101 things you (maybe) didn’t know about MacVector: #51 – Rapid assembly of genomes with Velvet and SPAdes

Not so long ago, to assemble even a small genome with Next Generation Sequencing data required an array of clustered computers and a lot of patience. But improvements in algorithms and hardware mean that it is now realistic to assemble bacterial genomes, or even smaller eukaryotic genomes using MacVector on a modest laptop machine.

MacVector 16 introduced a new assembler, SPAdes, to add to the existing Velvet and Phrap assemblers. Here’s a quick summary of the pros and cons of each assembler;

phrap: slow and does a poor job with most NGS data. But it handles long sequences extremely well as its tuned for Sanger ABI type reads. The best choice if you have just a few long reads or one or two consensus sequences resulting from the other algorithms.

velvet: fast with moderate RAM requirements. You can literally assemble smaller genomes (e.g. Mycoplasma) in less than two minutes. However, it takes a bit of playing with the parameters to get optimal assemblies.

SPAdes: much slower than velvet, but uses a lot less RAM with larger genomes/data sets. Typically works “out of the box” with fewer tweaks required to generate longer contigs than velvet.

Lets take a look at some performance data that illustrates this with a variety of bacterial genomes and input data to get a better understanding of how to use these algorithms. For this we will compare just velvet and SPAdes as it is not generally appropriate to use phrap for large NGS datasets. The table below lists the timings, memory usage and summary of results for a variety of different bacterial genomes with NGS data from Illumina HiSeq and MiSeq machines. This data was generated on a four year old MacBook Pro with a 2.7 GHz Intel Core i7 processor and 16 GB of RAM, i.e. a very modest machine by today’s standards.

AssemblyTimings

First, note how fast some of these assemblies complete. Velvet can assemble a small Mycoplasma genome from just under half a million MiSeq reads in a little more than a minute. Even a large ~7 GB Streptomyces sp. genome was assembled by Velvet in under an hour. While SPAdes is slower to complete, it uses significantly less RAM with the larger datasets and was able to complete the assembly of a larger B. subtilis MiSeq data set where Velvet ran out of memory and crawled to a halt.

As is usual with bacterial genome assemblies, neither Velvet nor SPAdes is able to generate a single contig representing the entire genome in these tests. This is due to the presence of repeat sequences (typically rRNA operons and insertion sequences) which prevent the assembly algorithms from knowing which order to join the contigs together. While this can be solved by including additional long insert reads into the assembly, we’ll explore some strategies for merging contigs in a future blog post.

</br />
This is an article in a long running series of tips to help you get the most out of MacVector. If you want to get notified every time a new tip gets published, follow us @MacVector on twitter (or check the feed for the hashtag #101MacVectorTips) or like us on Facebook.

SaveSave