de novo NGS Assembly Using Velvet
You can use Assembler to align millions of short Next Generation Sequencing (NGS) reads using the Velvet de novo assembler. There is also a blog post describing this functionality. Note that these assemblies can become extremely computationally intensive. In particular, the amount of RAM installed on the machine is of most importance. In our hands, we have found that with 16 GB of RAM, you can reasonably assemble a maximum of about 10 million 100nt reads within a couple of hours on a laptop. Above this, Velvet requires more memory and will start using your hard drive as extra "swap" space, significantly slowing down the computation. This can be alleviated a little with a very fast SSD hard drive, but repeating the same assembly with 20 million reads may lead to 24 hr assembly times.
Adding Reads to a Project
Assembly projects have two toolbar buttons for adding sequence data to the project. The Add Seqs button is used to add read data to the project - these are typically Fastq or Fasta formatted files containing sequence data from Illumina Solexa, SOLiD, 454 or MiSeq sequencing runs. To save disk space, the files are not copied - MacVector just notes their location, so its important not to move them after you have created a project.. The image below shows a project populated with a pair of Fastq read files representing paired-end reads from a bacterial sequencing project. Each file has approximately 2.4 million 90nt reads.
Assembling Using Velvet
Next we simply select the files we want to assemble (if nothing is selected, then all of the files in the project will be used) then click on the Velvet button;
The most important setting in the dialog is the Hash ("K-MER") Length. Sometimes you will need to play with this a little to get optimal results. This value must be shorter than the length of the reads, so if you are using older data where read lengths were only in the 33nt range, you will need to reduce this appropriately. Values between 31 and 51 appear to work best for typical bacterial sequencing projects. The value should be an odd number - if you enter an even number, Velvet will round it down to the next odd number.
The second most critical value is the checkbox to indicate that the source files contain paired reads. You can have multiple sets of paired reads - MacVector is clever enough to work out which files represent pairs of each other. However, with MacVector 13 the paired reads must reside in separate files.
The other values give you fine grained control over the assembly - typically you can just select the Auto settings to have Velvet work out suitable parameters by itself.
When completed, the project window refreshes to display all of the contigs created by the assembly. Note that the original source files remain untouched - the reads are copied into results. You can sort the contigs by length or number of reads. There is also an "UnusedReads.fa" file created that contains all of the reads that did not assemble.
The Properties tab gives you an overview of the assembly - the number of contigs, total length, etc. One of the most useful parameters is the N50 statistic. This is one measure that attempts to estimate the likely quality of the assembly. It is defined as the length for which the sum of the contigs of that length or longer exceeds half of the sum of the lengths of all contigs.
The contigs created can be viewed by double-clicking on them to display in the Contig Editor;
With MacVector 13, the contigs cannot be edited - behind the scenes the data is stored in the popular BAM format to save space and this is essentially a read-only format. However, the BAM file can be extracted from the project and used as input to otherapplications that accept that format as input.
There is also a Summary tab that lists useful summary information about the contig.
Exporting Contig Consensus Sequences
One additional nice feature of the Assembly Project is that you can select one or more (or all) contigs and export them in Fastq or Fasta format. This is a useful way of building up large assemblies from a series of smaller assemblies. You can run assemblies on multiple subsets of files (create a separate project for each one so you can run assemblies in parallel), export all of the contigs as Fastq files, then create a new master project where you can "assemble the assemblies" using phrap.