General musings from the MacVector team about sequence analysis, molecular biology, the Mac in general and of course your favorite sequence analysis app for the Mac!

Balancing Velvet KMER and coverage

The Velvet assembly algorithm in MacVector is blazingly fast and generates excellent assemblies. However, you do have to be careful when assembling NGS data to be sure that the parameters you submit are appropriate for the data you are assembling in order to get optimal results. By far the most important parameter is the KMER value. If you are not getting good assemblies, this is the parameter you should change. Below are the results of varying the KMER value for an NGS assembly of a circular 8,859 bp plasmid using data acquired from an Illumina HiSeq machine. In this case, the data consisted of a pair of fastq files with a read length of 75 nt. The original files each contained 1,370,000 paired end reads. The table below shows the longest contig that resulted from using varying numbers of input reads, versus varying the KMER parameter in Velvet.

MacVector has a feature in the contig editor that simplifies circularization of contigs with overlapping direct repeats at the ends. All the contigs in black could be circularized to generate a 8,859 bp plasmid. Those in red were not full length, or could not be circularized.

  • First, note that Velvet (like most assemblers) does not like a massive over-abundance of coverage. If you submit too many reads, it confuses the algorithm and you have to be very careful with your choice of KMER to get a good assembly.

  • Second, note that the more reads you submit, the higher the KMER needs to be to generate a complete contig.

The take home lesson from this is that in general, you should tune the amount of data in your NGS set to be between 100x and 1,000x coverage as that gives the most flexibility in your choice of KMER. You should start with a KMER that is ~70% of the average length of your reads (it has to be odd, so 51 in this case), then vary the KMER to see what impact that has. This holds true for bacterial genome assemblies as well as simple plasmids like this. Next week we will discuss a tool to help you break up large NGS data files into smaller segments to facilitate this analysis.

Not sure if you have Assembler? Choose MacVector | About MacVector. If the screen that appears says “MacVector with Assembler, Pro Edition” then you have it. If not, you can sign up for a fully functional 21 day trial version.

This entry was posted in Tips and tagged , , , . Bookmark the permalink. Trackbacks are closed, but you can post a comment.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*
*

This site uses Akismet to reduce spam. Learn how your comment data is processed.