101 things you (maybe) didn’t know about MacVector: #52 – Data mining to identify and analyze pangolin CoV-2 analogs to the human COVID-19 virus

One of the most underrated features in MacVector is the Database | Align to Folder function. You can use this as a more sensitive version of a local BLAST search to find sequences in a “database” that match a query sequence. But in this case the “database” is simply a collection of your own sequences, stored in one or more folders on your computer, or on a locally accessible server. More importantly, in these days of huge NGS data sets, the folders can contain fasta or fastq formatted files, and the files can even be compressed using the gzip algorithm. MacVector understands paired-end reads and can retrieve both reads of a pair even if only one of therm matches the query sequence.

As an example of the power of this approach, we used MacVector to retrieve reads matching the human SARS-CoV-2 genome from a collection of RNA-Seq reads from pangolins, assembled those reads into a viral genome and compared the sequence and encoded proteins to published bat and human isolates of SARS-CoV-2. You can read more about how that was accomplished and the results of the analysis in a published Technical Note.

You can use this approach to scan RNA-Seq reads for specific genes, or to identify reads in total genome sequencing experiments that extend sequences of interest, or to retrieve plasmids or bacteriophages. We’ve even used it to retrieve RNA-Seq reads using a protein sequence from a distantly related organism as a query. Here’s how to set up a typical search;

First make sure you have chosen a suitable Search Folder – you can have a hierarchy of folders and ask MacVector to search recursively through all the enclosed folders. Also be sure to check the paired-end reads checkbox if any of your files represent paired end reads.

Increasing the Hash Value speeds up searches dramatically, at the expense of more memory usage. The current maximum is 14, which means that you need at least a 14 residue perfect match before a potential match will even be considered. If you expect a lot of hits, increase Scores to Keep to a large value.

Finally, the Scoring Matrix can be critical. If you are looking for matches using a query sequence from a related organism, you should likely use DNA database matrix.nmat so that you can retrieve weak matches. However, if you trying to extend a genomic sequence where you are expecting essentially perfect matches, though perhaps with just short overlaps at the ends of reads, then DNA identity with penalties matrix.nmat is tuned for those searches.

This is an article in a long running series of tips to help you get the most out of MacVector. If you want to get notified every time a new tip gets published, follow us @MacVector on twitter (or check the feed for the hashtag #101MacVectorTips) or like us on Facebook.