Why Are There So Many Contigs?

While genome sequencing has gotten relatively inexpensive over the last several years, genome assembly on the other hand is still rather greedy. What do we mean by greedy? Well, often times a complete genome assembly requires a lot of data to help fill in the gaps. In a perfect world, 1 Genome Assembly would result in 1 Contig on a consistent basis, but unfortunately, we just aren’t there yet (but we are getting close). This is especially true for De Novo Genome Assembly. Generating a single contig from a De Novo Genome Assembly is incredibly difficult due to a number of factors including high repetitive regions which can span thousands of base pairs. One of the key advantages of NGS technology, such as the Illumina MiSeq or Novaseq, over Sanger sequencing is the reduced cost, but one of the key disadvantages is the reduced read length. Because we are often handling reads 150-250bp in length, determining where in the genome these short repetitive regions overlap in order to generate an accurate map of the full-length repetitive region is quite challenging. Unfortunately, this challenge can leave researchers with a lot of gaps to fill…literally. Additionally, the complexity of genome assembly only increases when dealing with polyploid genomes and determining which alleles should be mapped to which loci.

How do I fill in the Gaps?

One of the best ways to fill in the gaps is to use a hybrid approach; to combine long-read sequencing technology, such as the PacBio Sequel, or even long-insert library protocols, such as Illumina mate pair sequencing to cover the larger highly repetitive regions, and combine this with short-insert reads to fill in or polish the gaps. Other strategies in closing the gaps include PCR validation, where PCR primers are designed based on the end-sequences between the two contigs, or PCR primer walking, where a series of primers are designed based on the terminating sequence of the previous PCR extension in order to bridge the gap between contigs.