101 things you (maybe) didn’t know about MacVector: #52 – Data mining to identify and analyze pangolin CoV-2 analogs to the human COVID-19 virus

One of the most underrated features in MacVector is the Database | Align to Folder function. You can use this as a more sensitive version of a local BLAST search to find sequences in a “database” that match a query sequence. But in this case the “database” is simply a collection of your own sequences, stored in one or more folders on your computer, or on a locally accessible server. More importantly, in these days of huge NGS data sets, the folders can contain fasta or fastq formatted files, and the files can even be compressed using the gzip algorithm. MacVector understands paired-end reads and can retrieve both reads of a pair even if only one of therm matches the query sequence.

As an example of the power of this approach, we used MacVector to retrieve reads matching the human SARS-CoV-2 genome from a collection of RNA-Seq reads from pangolins, assembled those reads into a viral genome and compared the sequence and encoded proteins to published bat and human isolates of SARS-CoV-2. You can read more about how that was accomplished and the results of the analysis in a published Technical Note.

You can use this approach to scan RNA-Seq reads for specific genes, or to identify reads in total genome sequencing experiments that extend sequences of interest, or to retrieve plasmids or bacteriophages. We’ve even used it to retrieve RNA-Seq reads using a protein sequence from a distantly related organism as a query. Here’s how to set up a typical search;

Align2FolderSetup

First make sure you have chosen a suitable Search Folder – you can have a hierarchy of folders and ask MacVector to search recursively through all the enclosed folders. Also be sure to check the paired-end reads checkbox if any of your files represent paired end reads.

Increasing the Hash Value speeds up searches dramatically, at the expense of more memory usage. The current maximum is 14, which means that you need at least a 14 residue perfect match before a potential match will even be considered. If you expect a lot of hits, increase Scores to Keep to a large value.

Finally, the Scoring Matrix can be critical. If you are looking for matches using a query sequence from a related organism, you should likely use DNA database matrix.nmat so that you can retrieve weak matches. However, if you trying to extend a genomic sequence where you are expecting essentially perfect matches, though perhaps with just short overlaps at the ends of reads, then DNA identity with penalties matrix.nmat is tuned for those searches.



This is an article in a long running series of tips to help you get the most out of MacVector. If you want to get notified every time a new tip gets published, follow us @MacVector on twitter (or check the feed for the hashtag #101MacVectorTips) or like us on Facebook.

Posted in 101 Tips, Tips | Tagged , , , | Leave a comment

Working from home with MacVector during the COVID-19 pandemic

A lot of MacVector users are now at home getting used to a new way of working. The MacVector team are distributed throughout the US and Europe and we are used to remote working. However, for those new to working from home, it’s a LOT different to working back in the lab with your colleagues!

(See how we’ve been making the most of the lockdown to use existing sequencing archives to assemble a new Pangolin SARS-CoV-2 genome.)

We want to help make it easier to use MacVector:

  • If you currently use a network license and are struggling to access this from home, then email MacVector Support. We can help you in various ways.
  • If you normally use a Mac desktop in the lab, then we can give you a temporary license to activate on any home Mac you might have.
  • Even if you use an old license of MacVector, then we will give you a temporary license of MacVector 17.5.
  • If you use a Standard license, then you can already activate that on any home Mac. However, if you have forgotten your license activation details then email MacVector Support for a reminder.
  • Finally if you are in anyway connected with COVID19 research, then please have our thanks, and have an annual license free of charge.

    Our thoughts are with all those affected by COVID–19, be it directly or indirectly. Our particular thoughts are with those on the front line, from health care workers to those researchers working so hard on behalf of all humanity to find a vaccine and a treatment.

    Posted in General | Tagged | Leave a comment

    Primer validation with MacVector: Primer3, Covid19 and primer design

    The CDC recently published diagnostic real-time primers for identification of SARS-CoV–2 in any person suspected of having COVID–19.

    Unfortunately as pointed out on the Biome Informatics blog these primers have issues that should have easily been detected had the primers been tested using a good quality primer testing tool (the linked blog post uses Primer3). What’s more is that if a good quality primer design application, such as MacVector’s Primer3 tool, had been used from the beginning, then these issues would never have occurred.

    Some primer design software tools can be difficult to use. However, MacVector provides an easy to use interface to Primer3. Designing a pair of primers to amplify a target can be as simple as just three mouse clicks (click on the target gene, click ANALYZE | PRIMER3 and then click OK).

    MacVector also has QuickTest primer for visual design and testing of primers. QuickTest Primer simplifies primer design by showing your primer and its statistics in realtime. Does your primer have a hairpin? Nudge it along your template until the hairpin goes? Want to add a restriction site? Then add one and again nudge your primer to optimise the oligo.

    MacVector also has built in tools for helping find suitable targets to amplify using Blast or scanning local sequences too.

    COVID19 QTPrimer 2

    Using MacVector’s Quicktest Primer tool you can see the obvious hairpin in the reverse primer of the CDC first set of primers.

    Here’s a demonstration workflow on how to use MacVector’s Primer Database and Quicktest Primer tools to look at these primers:

    Download the primers

  • Download the CDC primer list (in a MacVector Primer Database file).
  • Open MACVECTOR | PREFERENCES | SCAN DNA | PRIMERS.
  • Click SET DATABASE FILE and choose the downloaded file.
  • Download one of the many sequenced SARS-CoV-2 genomes

  • Open MACVECTOR | ENTREZ and search for “organism=SARS-CoV-2” and “all fields = genome”.
  • There are a lot of submitted genomes from this organism (94 as of March 16, 2020).

  • Double click on a hit to open the genome directly in MacVector.
  • Now switch to the MAP tab.
  • You will see the primers annotated on the sequence (faded as they are dynamically shown).

  • Open ANALYZE | QUICKTEST PRIMER
  • Click the Insert from Primer Database button (as seen in the screenshot below).

    QT Primer

    You will see the primer displayed in the QT Primer interface showing any flaws. Most of the primers have hairpins.

    SARS CoV 2 QTPrimer

    Designing Primers

    Here’s an example workflow of how you would design primers to detect the N gene.

    Please note that for this demonstration workflow we have designed primers against the N gene, that had been selected by the CDC as a target. For real world purposes you would need to select an appropriate target. One way would be to align all sequenced human CoV-2 genomes (or whatever organism you are looking for), then look for highly conserved regions. Particularly in regions of the virus thought to be critical for pathogenicity. Those conserved regions can then be used as targets to design primers. It would also be useful to also select regions that would distinguish SARS-CoV-2 from the original SARS-CoV-1.

  • Select the N Gene in the SARS-CoV-2 genome (the one you downloaded above).
  • Run ANALYZE | PRIMER DESIGN/TEST (Primer3)
  • Change the setting to REGION TO SCAN and enter a product size of 400 to 500
  • Check Hybridization Probe Sequence.
  • Ensure all primers are set to Find Primer and not USE THIS PRIMER.
  • For this workflow example you do not need to change any advanced options.
  • Click OK
  • Designprimers Primer3 settings

    In the Results window you will see a MAP with the generated primer pairs and products.

    NewPrimersPrimer3

    The initial left primer still has a hairpin. You could tweak Primer3 settings. However, you can quickly tweak the primer itself using Quicktest Primer. If you slide the primer three bases to the left then the hairpin will go.

  • Select the LEFT primer in the Results Spreadsheet tab and choose ADD TO PRIMER DATABASE
  • Choose a suitable name.
  • Repeat for the RIGHT primer and the PROBE.
  • Open QuickTest Primer
  • Click INSERT FROM PRIMER DATABASE and test each primer.
  • To slide a primer left or right click on the cursor buttons either side of the primer.
  • Don’t forget to save any modified primer. You may also want to test the primers again using primer3. If you save the primers to your Primer Database then this is easy to do.

    https://tomeraltman.net/2020/03/03/technical-problems-COVID-primers.html

    https://www.cdc.gov/coronavirus/2019-ncov/lab/rt-pcr-panel-primer-probes.html

  • Posted in Techniques, Tips | Tagged , , | Leave a comment

    MacVector Training workshop at The Crick: Tuesday 17th March 2020

    Unfortunately due to the Covid19 outbreak this workshop is now cancelled. Keep safe and healthy everybody.

    The workshop, previously cancelled last year, is now rescheduled:

    Room: HR Training Room 01.2162. Floor: 1 
    Date: 17th March 2020 – from 9:30 – 11:30

    Chris Lindley of MacVector, Inc. will be giving a training workshop for both novice and advanced users of MacVector at The Crick, reviewing both basic and advanced functions. In particular new tools introduced over the last few versions.

    The format is very informal and participants are very much encouraged to direct the workshop towards areas of the most interest.

    Laptops will be provided for users to work through examples and tutorials as they are demonstrated. Workbooks will also be provided to allow attendees to work through during the workshop and afterwards.

    The intention is that all attendees will learn at least one new and useful tool or tip. The workshop is two hours, but Chris will be available in the room for further discussion until 13:00.

    Please register for the workshop by emailing Chris (drop-ins on the day will be very welcome, but will not be guaranteed access to a laptop or a workbook).

    See what MacVector can do for your lab.

    UnknownGibsonCloning

    Posted in Uncategorized | Tagged , | Leave a comment

    Make more of your alignments with MacVector 17.5

    Our latest release MacVector 17.5 gives you new tools to make the most of your alignments.

    It displays shared domains in protein alignments to visualize the relationships between aligned proteins. It introduces Flye for de novo assembly of PacBio and Oxford Nanopore long reads and a slew of enhancements to the Contig and Align to Reference Editors.

    As ever there are a slew of minor enhancements, bug fixes and changes to better support the latest releases of macOS.

    Outlining Shared Domains in Aligned Sequences

    Outline shared aligned domains Multiple sequence alignments now retain feature information and can use this to outline shared domains in the Picture output tab. You can set the colors of features in the individual sequence documents in the usual way and these are used for the outlines.
    banner

    There is a feature display mode in the Editor tab where you can see the extent and color of the features. When you switch to the Picture tab, you will see colored outlines around the shared domains;

    Prions text

    de novo Assembly of PacBio and Oxford Nanopore reads with Flye

    Flye is an assembler algorithm tuned to assemble poor quality long reads such as those produced by PacBio and Oxford Nanopore sequencers. Because these reads tend to be very error prone, MacVector 17.5 also includes an optional polishing step using Racon. With typical bacterial genome assemblies it is fairly common to be able to assemble reads into a single full-length genome contig.

    Contig and Align to Reference Editor Enhancements

    There have been a number of enhancements to these editors, primarily to aid in visualizing edits and quality values and to “clean up” the visual appearance of alignments.

    Residue Background Colored by Quality

    There have been several changes to provide improved support for quality values of de novo contigs and reference assemblies.
    A Shading toolbar button lets you turn on coloring based on quality and edited residues are visualized with a blue background. Edited residues are always given a phred quality value of 99 – these residues are given a blue background.

    Base Calling with Phred

    You can now directly run phred on Sanger sequencing trace files in the Align to Reference Editor by clicking on the Basecall toolbar item with the appropriate sequences selected;

    Assembly qualityscorecolouring 2 2x 400

    Editing Enhancements

    There are some new context-sensitive menu items in the Align to Reference Editor tab

    Delete Clipped Residues – deletes any greyed-out (“clipped” or “trimmed”) residues. While these are ignored by the consensus calculation, some users prefer to delete them for a cleaner looking alignment.

    Close Gaps by Deleting Residues – you’ll often see gaps in the consensus where one or more reads has an additional erroneous inserted residue. This menu item removes the extra residues from the read, cleaning up the visual appearance of the alignment.

    Nudge reads – Select the name of the sequence you want to nudge and use the left/right arrow keys to move it around. If you have problematic alignments where you need to physically insert residues or gaps, hold down the

    MSADomainEditor 2x

    Miscellaneous Enhancements

    There have been a large number of minor enhancements. Some, such as reworking code behind the scenes to replace deprecated Apple functions and refactoring code for better stability and performance help ensure that MacVector will continue to work on upcoming releases of macOS and take advantage of improved hardware. There have also been improvements to Dark Mode support in many area and much better handling of the labels in crowded Map views.

    How to upgrade to MacVector 17.5

    If you have active maintenance and are running MacVector 15.5.4 or later then you will be notified about the new release. To install this version, you must have a maintenance contract that was active on 1st February, 2020. You must also be running MacVector 15.5.4 and OS X 10.9 Mavericks or later.

    If you have an older version of MacVector then download the trial and request an upgrade quote.

    Even if you have downloaded the trial in the past then downloading a new trial will give you a fresh 21 days to evaluate MacVector.

    When a trial license expires it becomes MacVector Free. So if you decide against upgrading then you can just delete the trial license and easily go back to your current version. It’s risk free as MacVector files are backwards compatible.

    Posted in Releases | Tagged , , , | Leave a comment

    Importing BAM files into an Assembly Project

    You can import BAM files, containing reads mapped against a reference sequence, into a MacVector Assembly Project. As well as the BAM file(s) you will also need the original reference sequence the reads were mapped against. FASTA is fine, but an annotated reference is better for visualisation.

    The tool needed is called ADD CONTIG. This is one of the toolbar buttons in an Assembly Project:

    First create a new assembly project.

    • FILE > NEW > ASSEMBLY PROJECT

    • click ADD REF to add the reference sequence.

    • Use ADD CONTIG to import your BAM/SAM file.

    Then you need to associate the BAM file(s) with the reference:

    – select the reference and an imported contig(BAM file).

    • Right click on and select UNITE REFERENCE WITH CONSENSUS SEQUENCE

    You can optionally also generate a report on any variants (either at the previous step or a later stage).

    • Right Click and choose GENERATE VCF

    If you import multiple BAM files against the same reference sequence you can also graphically compare these datasets with the Coverage Tab (third tab along in the Assembly Project window).

    CoverageTabx2

    Incidentally if you need to access the BAM files from within MacVector’s Assembly Projects then you can right click on an Assembly Project and view the contents.

    Simple DNA sequence assembly on a Mac with MacVector with Assembler.

    MacVector has a software plugin called Assembler that integrates directly into the DNA sequence analysis toolkit and provides DNA sequence assembly functionality. Dealing with sequencing reads has never been easier.

    MacVector includes no less than five different assemblers just a few mouse clicks away from your sequencing reads. Phrap assembles Sanger sequencing reads or existing contigs, while there are three separate NGS de novo assemblers – Velvet for short read datasets, Flye for Nanopore and PacBio long reads and SPAdes for mixed assemblies. For reference assembly Bowtie2 can map millions of sequencing reads against genomic reference sequences and is ideal for RNASeq gene expression analysis data too.

    Assembler is tightly integrated into MacVector. It’s easy to bring sequencing reads into MacVector, and it’s just as easy to directly design primers for a contig, run BLAST searches on a contig, and much more, right from your desktop!

    Posted in Tips | Tagged , , | Comments closed

    Calculating the optimal PCR annealing temperature

    MacVector has several tools to help with primer design and testing. The Analyze | Primer Design/Test (Pairs) function uses the popular Primer3 algorithm to find suitable pairs of primers to amplify specified segments of DNA. You can also enter pairs of pre-designed primers and test their suitability for use in PCR. In both cases, the Tm of each primer is reported, along with the optimal annealing temperature (Ta).

    Unknown

    The optimal annealing temperature (degrees C) is calculated as follows (from W. Rychlik, W.J. Spencer, and R. E. Rhoads, Nucl.Acids.Res. 18:6409–6412(1990));

    (Lowest Primer Tm x 0.3) + (Product Tm x 0.7) - 14.9

    This means that you can get an optimal annealing temperature for a PCR experiment that is significantly different from the optimal annealing temperature for an individual primer (e.g. in a sequencing experiment) because of the large influence of the product in the calculation.

    Posted in Tips | Tagged , , | Comments closed

    How to copy a specific short amino acid translation of a sequence

    There can be times when you are messing about with open reading frames, inserting residues to change frames to try to get the perfect CDS fusion. The MacVector single sequence Editor will show those (click and hold on the “Display” toolbar button) but if you select and copy, only the DNA sequence (with any overlapping features) will be copied to the clipboard. If you need to copy a specific translation of a sequence, here’s how to do it: Select the region you are interested in, then invoke Analyze | Translation… Select the “Display text view with translation” option, set the Number of Frames to 3 or 6 and click OK.

    Unknown

    From the resulting result window, you can select the text of the amino acid sequence you are interested in, copy, and then create a new sequence document (File | New From Clipboard) or paste into an external application.

    Posted in Techniques, Tips | Tagged , | Comments closed

    Optimizing Reverse Translations

    The Analyze | Reverse Translation menu option lets you create a DNA sequence from a Protein sequence, reverse translated using a specific Genetic Code (by default, the Universal Genetic Code). The default option creates a DNA sequence with N’s and other ambiguities reflecting the degeneracy of the genetic code. This is great if you want to identify less ambiguous sections to design probes or primers and in fact MacVector will even display a list of probes with the least ambiguities.

    However, MacVector also offers an optimization function if you are interested in designing a gene with codon usage optimized for expression in a particular organism.

    Unknown

    To use this function, you do need to supply a codon usage table – a number of common tables are shipped with MacVector:

    /Applications/MacVector/Codon Bias Tables/

    There are four different algorithms that MacVector provides for optimizing codon usage.

    Most Frequently Used Codon – this simply uses the most commonly occurring codon for each amino acid. So if, e.g. the most common Leu codon is CTC, all Leu codons will be CTC. Perhaps this is only useful if you want to design a “best guess” primer and are willing to accept a certain failure rate. If you used this to optimize expression, the host would likely run out of that tRNA and you wouldn’t see optimal expression.

    Frequency Distribution – this selects a random codon for each amino acid, biased towards the most commonly used codon that encodes each amino acid. Each time you run the algorithm, a different, random set of codons will be selected. If you were to generate a new DNA over and over again, eventually this would create a collection of sequences where the average codon usage would exactly match the average for the .bias organism. But any individual reverse translation may randomly be quite different.

    Probability Distribution – this is probably the most powerful setting if you are interested in expression. Similar to the Frequency Distribution, this chooses a random codon, biased towards the most frequently used codons for each amino acid. However, this version tries to ensure that the final DNA sequence has a codon usage profile as closely matching as possible to the codon usage of the selected .bias file. Again, each time you invoke the algorithm, it will produce a different sequence. But as the overall codon usage in the DNA sequence is guaranteed to be as close as possible to the codon usage in the .bias organism this should, in theory, give you the best chance of high expression. Again, you will get a different sequence each time you invoke this.

    Uniform Distribution – this ignores the usage of each codon and randomly assigns an appropriate codon for each amino acid. Its similar to the default algorithm that uses ambiguities to create an “absolute” coding DNA, but here it just chooses a random codon with no regard for codon usage probability. Again, you will get a different sequence each time you invoke this.

    Posted in Tips | Tagged , | Comments closed

    Use Database | Auto-Annotate Sequence to annotate prokaryotic genomes

    The continuing advances in Next Generation Sequencing have made it relatively low cost to sequence prokaryotic genomes. Many scientists are embarking on large projects to sequence multiple related genomes. These might be clinical isolates of the same species exhibiting different pathogenetic properties, environmental isolates from different sites, or a study over time of the changes in microbial genomes from specific locations. Once you have your sequence, the definitive source of annotation is the NCBI Prokaryotic Annotation Pipeline. However, to have that run on your sequence, you must submit the sequence to the NCBI. This is not always ideal – perhaps you are still working on resolving repeat sequences for your genome, you don’t want to wait for it to be published or you don’t want to go through the hassle of a formal submission for many variant sequences. MacVector to the rescue!

    First, you need to download existing similar genome sequences – open the Database | Online Search for Keywords (Entrez) browser and search for [name of your organism] “+” “complete genome”. Assuming you are working with a reasonably common organism, you might find a few (or a lot of) hits. Select those you are interested in and click on the To Disk button, selecting a suitable target folder to save the downloaded genomes into.

    Now take your unannotated genome sequence and invoke Database | Auto-Annotate Sequence. Select the folder containing your genomes as the target and press OK. On a 2.7 GHz laptop, scanning a 1.8 Mbp Campylobacter jejuni genome against 25 related C. jejuni genomes (average 5,000 features per genome) takes around 10 minutes with the default parameters, resulting in a fully annotated genome.

    Unknown

    MacVector 17’s new “Genome Comparison” tool lets you directly compare the features of two related genomes based on DNA or (for CDS features) protein sequences and reports all of the identities, similarities, differences and missing features. The tool confirmed that no features were missing compared to the NCBI annotated genome and there were just minor differences with a few CDS features where there were mutations creating or removing stop codons.

    Posted in Techniques, Tips | Tagged , , | Comments closed