101 things you (maybe) didn’t know about MacVector: #50 – Using Align To Folder to “clone” genes from NGS data

The Database | Align To Folder… function in MacVector is remarkably powerful. Its like having your own personal BLAST search except that it can also scan through millions of Reads in fasta or fastq formatted files to identify those matching an input sequence, which can be DNA OR protein. In addition, it understands about paired-end read files so that when it finds a match, it can retrieve both Reads of a pair even if only one of them matched the input sequence. Lets examine this more closely with a specific example.

Here I am starting off with the protein sequence of a galactokinase enzyme from Streptomyces coelicolor. Lets identify those reads that might encode the protein in an Illumina Solexa dataset from an unknown Clostridium species, retrieve those into a pair of fastq files (one each for the left and right reads) then assemble them to determine the coding region of the protein in the Clostridium sp. This is obviously an artificial example, chosen largely because S. coelicolor DNA is 73% G+C, so there is unlikely to be much sequence similarity between it and a Clostridium species at the DNA level. This illustrates the power of using the Align To Folder function in translated mode to scan for DNA sequences that could encode a specified protein.

First, I open the GalK protein sequence and choose Database|Align To Folder…

NGSAlignToFolder SetupDialog

I clicked on the Choose button and selected the folder where my pair of fastq files are located. The other critical settings are to let MacVector know to expect paired files in the folder and, for this example, to check the Align To DNA checkbox. This tells MacVector to translate each DNA Read in all 6 possible frames and compare the translations to the GalK protein. Finally, I set Scores To Keep to 10,000 to be sure I’m keeping all of the potential hits.

To give you an idea of performance, searching this 387aa protein against a total of 19 million x 90nt Reads in two fastq files took a little over 2 hours on a 3 year old 2.7 GHz MacBook Pro. At no time did MacVector use more than 500 MB of RAM, so this analysis can be performed on fairly modest machines, as long as you are patient.

Once complete, you can view a graphical overview of the alignments, a text view showing the actual translated alignments and a list view with each of the “hits” displayed, one per line. The Folder Aligned Sequence view shows that many of the translated reads do indeed show good similarity with the query sequence.

S coelicolor galK prot Results

After scrolling through the alignments, I could see that most of the top ~1,000 hits had reasonable similarity to the GalK protein. So I switched to the Folder Description List tab and selected the first 1,000 hits;

S coelicolor galK prot SelectedHits

When you select lines containing sequence hits in a Description List like this, A variety of Retrieve To Xxx… items become activated in the Database menu. I selected Database|Retrieve To File… and was prompted for a filename;

GalKHitsSaveAs

However, once I click on Save and look in the destination folder, I see that two files have been created;

GalKHitsInFinder

MacVector is smart enough to save BOTH sequences when either one of them is selected in the Description List. Not only that, it keeps them in separate files, so you can use them for further NGS analyses that take paired-end read files, like Bowtie or Velvet for example.

So, can I assemble these into a contig encoding the GalK protein of our Clostridium sp.? I used File|New|Assembly Project to create a new Assembly project then clicked on the Add Seqs toolbar button to add my files. I then clicked on the Velvet toolbar button and accepted the default parameters, except that I checked the Source files contain paired read checkbox;

GalKVelvetParams

Velvet takes just a few seconds to assemble such a small number of sequences. Encouragingly, there were just two contigs generated, with the longest one (1587bp) containing most of the reads;

GalKAssembly

So I double-clicked on Contig 1 to open up the contig editor and switched to the Map tab where I could see the restriction sites and an overview of the assembly;

GalContig1

Perhaps the entire GalK protein is encoded within the Contig? So I chose Analyze|Open Reading Frames…, accepted the defaults and saw one long ORF running from right to left on the minus strand;

GalContigORFResults

After selecting this, its trivial to choose Analyze|Translation… and get the amino acid translation in a separate MacVector window. Finally, lets run a Database|Internet BLAST search on that translated protein to see what we have “cloned”.

GalKCloneBlastResults

Surprise! The top hit is a Galactokinase protein from another Clostridium species. So there you have it – sorry about the long post but hopefully this gives you a good feeling for how you can use the MacVector Align To Folder function to clone genes from NGS data sets and validate the results.



This is an article in a long running series of tips to help you get the most out of MacVector. If you want to get notified every time a new tip gets published, follow us @MacVector on twitter (or check the feed for the hashtag #101MacVectorTips) or like us on Facebook.

Posted in 101 Tips | Comments closed

Download the latest published version of your favorite sequence with its accession number

It’s very quick to download the latest version of a sequence if you know its accession number. When you start working with a new sequence, it’s the best place to start.

  • Go to DATABASE > ENTREZ
  • Enter the accession number of your favorite sequence
  • Click SEARCH
  • Double click on the result to open up your sequence directly in MacVector.
  • NewImage

    If you do not know the accession number, then it’s still easy, but you might need to perform a more complex search to only retrieve a few hits. For example “ORGANISM=Homo sapiens, GENE=“Presenilin”

    If you just want to “refresh” your own copy of a sequence with the latest published annotation, then use Import Features instead.

    Posted in Tips | Tagged , | Comments closed

    How to change the default appearance of RE sites

    MacVector is extremely customizable. If you don’t like the defaults we supply, its very easy to change them. Lets look at restriction enzyme sites. By default we show unique sites in small red letters and sites that cut more than once in small blue letters. But suppose you want something bigger, bolder and, well, more black and white? Here’s how to do it. Hold down the [option] key and choose the Options | Default Symbols menu item. If no sequence windows are open, you don’t even need to hold down the ∫ key. This opens the Default Symbols editor.

    NewImage

    In the Results tab, select the Unique Restriction item and change the font e.g. to Helvetica Bold 14.0 pt. Then click on the Restriction item and maybe change that to Helvetica 14.0 pt. I’ve also changed the Symbol pen color to black. Click OK to dismiss the dialog and save the changes, then open up one of your favorite vectors (you do have to re-load sequences for the changes to take effect).

    NewImage

    Posted in Tips | Tagged , , , | Comments closed

    NIH Research Festival September 14-16, 2016

    We’re at the NIH Research Festival this week. Please drop by our booth on Thursday or Friday, when the big, white exhibitors tent is open. We’re on booth #562.

    We’ll be able to show you our latest release, MacVector 15 and we’ll have some goodies too!

    Posted in Meetings | Tagged | Comments closed

    Working with digested fragments in the Cloning Clipboard

    The Cloning Clipboard is an easy, and flexible, way to design and document your cloning strategies. Here’s two tips on manipulating a single fragment.

    – If you drag a fragment from the Cloning Clipboard to a vector, then you’ll get the ligation dialog. However, if you have already selected a pair of enzyme sites, then the ligation dialog does not appear. Instead, your fragment is immediately ligated into those two sites. This also happens if there is an unambiguous way the two fragments would ligate together. For example, an EcoRI-BamHI fragment would ligate directly into a vector digested with EcoRI and BamHI. Whereas if you’d just digested with EcoRI it would not.

    – If you need to manipulate, or save, a fragment before ligating it, then just double click on the fragment in the Cloning Clipboard. A new sequence window containing your fragment with its annotation will appear. The “digested end” information will be lost though. So in any subsequent ligation, the new sequence will be treated as a blunt ended fragment.

    Do remember that the history of a ligation is always documented. Also that the Cloning Clipboard allows you to export fragments as PDF. So you can make flow charts, showing the cloning strategy used for a construct, in an application such as Illustrator, PowerPoint or Word.

    NewImage

    Posted in Tips | Tagged , , | Comments closed

    101 things you (maybe) didn’t know about MacVector: #49 – Identifying CRISPR Indels

    If you are screening a set of clones for the presence of changes after a CRISPR experiment, then the MacVector Analyze | Align To Reference functionality is the approach to use. However, you may find that the default parameters are not ideal for this type of analysis – they are tuned for simple sequence confirmation experiments and trade off between sensitivity to accommodate larger (>5 nt) insertions/deletions and speed.

    If you are using MacVector 14.5.3 or earlier, the key to correctly handling larger deletions and insertions is to;

    (a) Use the cDNA Alignment algorithm. This allows for unlimited length deletions in the reads, as long as there are sufficient matching residues on either side of the deletion to exceed the minimum match criteria.
    (b) Increase the Sensitivity setting. This determines “how far ahead” the alignment algorithm looks. To handle insertions in the reads compared to the reference, it needs to be larger than the expected number of insertions. Typically, for CRISPR experiments, this is only one or two residues, but it can be larger, so setting this to a larger value will handle those cases much better. However, you also need to adjust the X-Dropoff value to be at least Sensitivity multiplied by the Gap Penalty.

    Here’s some reasonable settings for MacVector 14.5.3 with the critical settings outlined in red;

    CRISPR14 5 3Settings

    Aligning with the high Sensitivity value may take some time compared to the normal defaults e.g. 1,000 nt ABI trace files might align against a ~10,000 nt Reference at about 1 Read per second on an average machine. Here is an example alignment of a set of reads with a variety of short insertions and deletions centered around a (fake) CRISPR target site;

    CRISPR14 5 3Results

    You can see that the largest insert in a Read was 7 nt and this is reflected by 7 gaps inserted in the Reference. While most of the reads have short deletions and the deleted residues are represented by the gap character (“-“), some of the Reads with longer deletions are indicated by the “large gap” series of characters (“<- - - - ->“). This simply indicates that the Read was split into two separate segments during the alignment process.

    If you are using MacVector 15.0.1 or later, you will find that this interface has had some significant tweaks. First, a new CRISPR Indel Detection mode has been added to the Align To Reference settings dialog. This largely removes any need to adjust the individual settings;

    CRISPR15 0Settings

    The second change is that a clean up step has been added to the alignment algorithm to minimize the number of gapped segments in the final alignment. This has the effect of dramatically cleaning up the region around the indels. Compare the alignment below to the previous MacVector 14.5.3 generated alignment;

    CRISPR15 0Results



    This is an article in a long running series of tips to help you get the most out of MacVector. If you want to get notified every time a new tip gets published, follow us @MacVector on twitter (or check the feed for the hashtag #101MacVectorTips) or like us on Facebook.

    Posted in 101 Tips | Tagged , | Comments closed

    How to use Codon Preference plots

    When you are looking for open reading frames in newly sequenced regions, it’s not always the longest ORFs that are protein-encoding. Lets look at an example from one of the sequences included with MacVector:

    /Applications/MacVector/Sample Files/Gal Cosmid.nucl.

    This is from Streptomyces coelicolor, a filamentous bacteria with a 73% G+C content. The high G+C% means that stop codons (TAA, TAG and TGA) occur relatively infrequently by chance, so long open reading frames are quite common. Look at this plot from an Analyze | Nucleic Acid Toolbox plot;

    NewImage
    You can see there is a long open reading frame in the top most pane in Frame +3. However, this is extremely unlikely to actually encode a functional protein. How do we know? Take a look at the Staden Codon Preference plots in panels 3 and 4. These plot the probability that each of the three frames encodes a protein based on the selected codon bias table (here we are using Streptomyces coelicolor.bias). The plus strand plot (pane 3) suggests the blue frame has the best chance of encoding a protein, but its still not that great (compare it to the blue plot at the left hand end of the sequence where there is a blue ORF in the upper pane). The red plot, which we would expect to be the highest if the long ORF really encoded a protein is very low throughout the length of the ORF. Now look at the lowest pane – the green (-2) frame is very high throughout most of that region, with the exception of a dip in the middle. This corresponds almost exactly to two distinct green ORFs in the second (minus strand) plot. This is excellent evidence that the two green ORFs on the minus strand are the most likely to encode actual translated proteins and the long red ORF on the plus strand is purely accidental due to the infrequency of stop codons in this genome.

    Posted in Tips | Tagged | Comments closed

    MacVector 15 is out, with a focus on protein analysis and alignment tools.

    MacVector 15 has many new features including new protein analysis tools for reference alignment of proteins, translated DNA alignments and for functional analysis of protein sequences.

    InterProScan: Scan proteins for functional domains against a variety of sequence, protein family, domain and motifs databases using the InterProScan service. This performs a search against many different databases that comprise the InterPro database. This includes UniProt, PROSITE, HAMAP, Pfam, and PRINTS. You can annotate domains back to your sequence with a link to the original database entry.

    NewImage

    The Multiple Sequence Alignment tool has been improved with two new features. It will allow you to align DNA sequences based on their amino acid translations and multiple protein sequences can now be aligned to a single reference protein sequence.

    NewImage

    Translated Multiple Sequence Alignments: Align DNA sequences based on their amino acid translations. Display DNA sequences and their translations at the same time. Align the protein sequences using ClustalW, Muscle or T-Coffee to see the effect on the underlying DNA sequences. Directly edit the DNA sequences and immediately see the impact of the change on the amino acid alignments.

    Align proteins against a reference: You can use a protein sequence as a reference so that the display keys off that sequence when showing similarities. This allows you to view proteins in a similar way to the DNA Align To Reference interface.

    NewImage

    Applescript and Auto Annotate: Auto-annotation has joined the growing number of MacVector tools that support Applescript. Batch annotate folders of blank sequences. Example scripts provided.

    Check out the release notes for full details of this release.

    How to update to MacVector 15

    If you have active maintenance and are running MacVector 13.0.1 or later then you should have been notified about the new release already. At that point you have the option to automatically upgrade to MacVector 15. To install this version, you must have a maintenance contract that was active on 1 June 2016. If you are running OS X 10.6.8, the semi-automatic updater is not supported and you should download the full updater direct.

    If you have an older version of MacVector then download the trial and request an upgrade quote.

    If have downloaded the trial in the past then downloading a new trial will give you a fresh 21 days to evaluate MacVector even if a previous trial license had expired.

    Remember that when a trial expires it becomes MacVector Free.

    Posted in Releases | Tagged | Comments closed

    Tweak your DNA Matrix for better Align To Folder searches with primers

    You can use the Database | Align To Folder function as your own “personal BLAST search”, comparing a sequence to all of the sequences in a target folder hierarchy. The files in the folder can be in any format MacVector recognizes, including fasta and fastq formatted multiple sequence files.

    Many users take this approach to scan a sequence against a “database” of primers maintained in a fasta file. One problem you may encounter using this approach is that MacVector fails to find matches to perfect primers, particularly if they are short (20nt or less). The reason for this is due to settings in the matrix file used for the search. If you want to find matches to short sequences, such as primers, you need to make some changes to the data in the .nmat file you use for your search. Open the .nmat file you are using for the search and click on the appropriately named “Tweak” toolbar item;

    NewImage

    This will bring up the “Tweak Editor” dialog;

    NewImage

    The key here is to change the p4 setting. It represents the “minimum” score that should be exceeded to consider something a match. The default is 80. Note how the scores for an A vs A or G vs G match in the matrix is 4. That means that a perfect match of a 20 nt primer against a sequence will give a score of 20 x 4 = 80. The default settings require the score to be GREATER THAN the p4 threshold, so a perfect match of a 20 nt primer would be rejected. If you changed p4 to be 79, then those would be accepted.

    Now consider what would happen if you had a 19 out of 20 match to a 20 nt primer. The score would be (19 x 4) – 2 = 74 because mismatches score “-2” in the matrix. So you would need to drop p4 to be 73 or less to identify those matches.

    If you drop it too low, like the “60” indicated here, you may get many thousands of spurious matches. But the best ones will always be shown at the top of the list.

    Posted in Tips | Tagged , | Comments closed

    How to reset the sequence numbering when working with a subsection from a larger sequence

    When you copy a section from a long sequence and paste it into a new MacVector window, the original numbering from the original sequence is retained. This is very useful if you want to work on a shorter segment of a genome without losing the original numbering. However, sometimes it is preferable to have the numbering start at “1”. Its simple to do that in MacVector: just right-click in the Editor tab of the sequence window ([control]-click if you have a single button mouse or trackpad) and choose Reset Origin to 1 from the popup menu.

    NewImage

    Posted in Tips | Tagged , | Comments closed