MacVectorTip: Use Bowtie to remove contaminating reads prior to NGS Assembly

MacVector with Assembler can use Velvet and/or SPAdes for fast and memory efficient de novo NGS assembly of modest sized genomes (typically up to 40 Mbp or so) even on a laptop. One common task is to assemble NGS data from BAC clones.However, one problem that often arises is that the BAC DNA preparations may be contaminated with genomic E. coli DNA. SPAdes, in particular, is so efficient that it will easily assemble large contigs from the minority contaminating reads. In the example below (where E. coli genomic sequence represented over two thirds of the reads) the large genomic contigs overwhelm the few BAC contigs (the main BAC contigs are 192kb and 41kb) – the primary BAC contig is highlighted below, but there are clearly a lot of large genomic contigs that add to the confusion.

The solution to this is to first run a Bowtie reference alignment of the raw reads against an E. coli genomic sequence. Ideally, you would use the genomic sequence of your host strain, but any genome will be effective. You can use multiple genomes with relatively little difference in processing time – 10 genomes takes only about twice as long as one genome. In this case, we used the Add Ref button to add two random E. coli genomes as reference sequences, then aligned them with the NGS data files using Bowtie with the default settings resulting in this alignment;

The critical point here is not the actual alignments, but the fact that the reads that do NOT align to the reference sequence(s) are saved into a pair of compressed FASTQ (fq.gz) files. These should be massively enriched for non-E. coli DNA sequences. While you can save these files to disk (choose File->Export Selected Reads To…) you can also directly assemble the data in the files by selecting them and then clicking the SPAdes button. When that completes with the default settings, you should see something like this.

Now the top two assemblies (NODE_x) are the major BAC contigs.You can also click on the ‘#’ column to sort the contigs by number of reads aligned and that (in this case) brings up additional minor BAC contigs.

Note that the approach to remove contaminating sequences via Bowtie alignment is not perfect. In particular, for paired end reads, both reads need to align to be considered “aligned”. So, if one of a pair does not align, or has “failed”, both will be placed in the Unaligned_Reads file.

Another limitation that you should be aware of is that the unaligned reads files may contain pairs of reads that map perfectly, except the distance between the two reads differs between the reference sequence and the organism that has been sequenced.

Paired sequencing reads are performed by sequencing both ends of a single fragment. The insert length, or distance between the two reads on the single fragment, is used by assembler algorithms both to improve accuracy and also where you are looking for structural variants. For example resolving tricky sections with long runs of repeats or where there are large INDELS

Bowtie uses the terms concordant for pairs of reads that map well to the reference and discordant matches for pairs where both reads map well, but the distance between the pair of reads does not match the insert length.

Unfortunately for the purpose of filtering out a genome from mixed dataset, the unaligned reads files will also include discordant reads, which means that you will never fully remove all traces of the genome you are trying to filter out. For example in the above example there are small contigs of the E.coli genome which are likely due to discordant reads in the unaligned reads files.

However, even with these two limitations the technique works extremely well to ensure a better assembly of your organism of interest.