General musings from the MacVector team about sequence analysis, molecular biology, the Mac in general and of course your favorite sequence analysis app for the Mac!

Assemble bacterial genomes in minutes on your Mac laptop

MacVector with Assembler contains some remarkably powerful algorithms for assembling Next Generation Sequencing (NGS) data. Not so long ago, you needed a powerful Linux server with lots of memory for de novo assembly of whole genomes. But with advances in the efficiency of algorithms and improvements in hardware, it is now possible to assemble quite large genomes on a Mac laptop.

MacVector 16 incorporates two separate NGS de novo assemblers, Velvet and SPAdes. Both are very capable assemblers with a small memory footprint. Velvet is significantly the faster of the two, but SPAdes often generates longer contigs as it does a slightly better job at resolving repeats, plus it can handle many more data types for mixed read assemblies and has a smaller memory footprint, allowing it to be used for larger data sets. With Velvet you often need to tweak the parameters for optimal performance, whereas SPAdes usually “just works”. SPAdes can often generate meaningful assemblies from relatively poor data where Velvet will fail without considerable tweaking of the parameters.

Both are invoked the same way: use File | New | Assembly Project to create a new project, then click on the Add Reads button and select the read files you want to import. Typically these are paired-end reads (either interleaved or as separate files), but they can be unpaired reads, consensus sequences exported from a different assembly, Ion Torrent, PacBio or Oxford Nanopore reads. You can also import compressed (gzip) files directly, with no need to uncompress them, saving a lot of disk space. Finally, click on the Velvet or SPAdes toolbar button to run the algorithms. The end result will be a number of contigs.

Here are some examples of performance, with all tests run on a 2013 2.7 GHz MacBook Pro with 16 GB RAM

NewImage

In the case of the small Mycobacterium genome, Velvet completed the assembly in a little over a minute. Even a moderately large ~7 Mbp Streptomyces sp assembly of 5 million HiSeq reads took just 16 minutes with Velvet and less than an hour with the more memory efficient SPAdes algorithm.

For a more in depth discussion of these results, please see our recent blog post.

Posted in Tips | Tagged , , , , | Leave a comment

Simple Assembly of Sanger Sequencing Files with MacVector Assembler

With MacVector Assembler, assembling ABI Sanger Sequencing files is simple, fast and accurate. MacVector uses the popular phred/phrap/cross_match set of tools from the University of Washington. To improve accuracy, and to help resolve repeats, these tools use “quality scores” (popularly known as “phred scores”), giving them an advantage over many other methods. To assemble two or more ABI files, follow these steps.

  • Use File | New | Assembly Project to create a new project
  • Click on the Add Seqs toolbar button and select all of your ABI (or SCF) chromatogram files to import
  • Click on the phred toolbar button – this re-calls the traces and generates quality scores (no need to select any items in the project, though you can to run phred on specific files)
  • Click on the phrap toolbar button and accept the defaults
  • After phrap has run, you will be presented with one or more contigs (assuming your ABI reads actually overlap). If you double-click on one of those, a contig editor will open letting you view and edit the actual alignments.

    NewImage

    Not sure if you have Assembler? Choose MacVector | About MacVector. If the screen that appears says “MacVector with Assembler, Pro Edition” then you have it. If not, you can sign up for a fully functional 21 day trial version.

    Posted in Tips | Tagged , , , | Leave a comment

    An overview of assembling sequencing data with MacVector’s Assembler plugin

    To assemble various types of sequencing reads, follow these steps.

    NewImage

    • Choose File | New | Assembly Project to create a new empty project file.

    Then follow one of the following:

    To create a de novo assembly from Sanger reads

    • Click on the Add Reads tool bar button, then select the sequence files you wish to assemble and click on the Open button. Read(s) file(s) can also be drag and dropped on the open Assembly Project window.
    • Click Phred to basecall sequences in the project. Note that if no sequences are selected, phred will be run on ALL of the files in the project.
    • Click Phrap to assemble the reads.

    To create a de novo assembly from NGS datasets

    • Click on the Add Reads tool bar button, then select the sequence files you wish to assemble and click on the Open button. File(s) can also be drag and dropped on the open Assembly Project window. Paired reads files are automatically detected.
    • Choose either SPAdes or Velvet to assemble the reads.

    To create a reference assembly

    • Click the Add Reads button, then select the sequence files you wish to assemble and click on the Open button.
    • Click Add Ref, select the sequence file(s) you wish to align the reads against and click on the Open button.
    • Click Bowtie to map all read files against all of the reference sequences in the project.

    Not sure if you have Assembler? Choose MacVector | About MacVector. If the screen that appears says MacVector with Assembler, Pro Edition then you have it. If not, you can sign up for a fully functional 21 day trial version

    Posted in Techniques, Tips | Tagged , , , , , | Leave a comment

    MacVector video tutorials on You Tube

    There are short video tips on using MacVector on our blog and our YouTube channel. Each one is less than 2 minutes and generally shorter than that! There will be a new screencast every few weeks, so please subscribe to our YouTube channel so you don’t miss any!

    The latest screencasts include quickly annotating a gene to a sequence, confirming a small sequencing project against a reference and checking the orientation of a ligated insert using MacVector’s Restriction Digest and Agarose Gel tools.

    If there’s any tools or features you’d like to see a screencast about please do let us know…. and always remember we will come and run real live workshops for you too!

    Posted in Tips | Tagged , | Leave a comment

    Viewing external database entries for features in a sequence.

    Sequences, or regions of sequences, can be linked to external databases. For example an entire sequence entry or for when annotation tools are used to annotate proteins with domain or motif information (for example InterProScan). Very useful for when you want to view more detailed or updated information. Within the Genbank specification, which MacVector extensively uses, an external database entry can be stored in a /DB_XREF qualifier. This allows the database entry to be easily viewed. The Genbank (and Genpept) specification allow for many different databases to be accessed using this qualifier.

    NewImage

    In MacVector the original database entry can easily be viewed in a web browser by selecting, then right clicking the feature entry in the Features tab and viewing the available DB_XREF entries. Selecting one will load it in your web browser.

    NewImage

    Posted in Tips | Tagged , , , | Leave a comment

    Use the Replica Button For Synchronized Views

    Most primary MacVector windows (Nucleic Acid Sequence, Protein Sequence, Multiple Sequence Alignment, Align To Reference, Contig Assembly etc.) have a Replica toolbar button. If you click that button, a second window will open, potentially set to a different tab. The key to this functionality is that the two windows are linked – any selections you make in one window will reflect in the other. In most cases this means that if you select an object in, for example, the Map tab of one window, the Editor tab of the other window will actually scroll to display the selected region.

    Here’s an example where clicking on an aligned read in an Align To Reference Map tab has automatically scrolled the replica Editor tab to show the selected sequence.

    NewImage

    Posted in Tips | Tagged , , , , | Leave a comment

    How to Identify Bacterial Promoters Using MacVector

    MacVector’s Subsequence tool is a very flexible search function that can be used for a variety of tasks. MacVector itself has a built-in variant of the function for maintaining and search primer databases (Analyze | Primer Database Search…). Each entry in the file MacVector uses as a source of subsequence data can have up to 3 segments, with variable length between the segments, along with a defined number of permitted mismatches and even a system for requiring that specific residues must match. That makes it ideal for searching for bacterial promoters. For example, the canonical Escherichia coli promoter sequence is a “-35” region TTGACA, then a gap of 16 to 18 residues, then a “-10” region “TATAAT”. You will find there is an EcoliPromoter.nsub file in the /MacVector/Subsequences/ folder. If not, you can download it. If you open the file in MacVector, you can see this.

    NewImage

    You can see that the file has four entries – each of these has two segments representing the -35 and -10 region, but each has additional settings that control how close a match has to be before it is reported. The names give some idea of the stringency of the match – Perfect, Probable, Possible and Weak. If you double-click on the Probable item, you get this editor.

    Posted in Tips | Tagged , , , | Leave a comment

    Import Multi-Sequence Genbank Files into an Assembly Project for easy access to Features

    There are many genomes in the Genbank database that cannot be downloaded as single annotated sequences. These might be large multi-chromosome eukaryotic genomes, but, increasingly, partially sequenced bacterial chromosomes where the major contigs have been annotated using the NCBI annotation pipeline. Typically, when you encounter these, there are options to download annotated versions of these as multi-sequence Genbank formatted files. MacVector has the option to open any file containing multiple sequences as either a Multiple Sequence Alignment document or as individual Sequence documents. This is not always optimal if you have more than a handful of sequences in the file. However, if you use MacVector with Assembler, you can import these sequences into a project using the Add Ref toolbar button and the individual sequences will not only be displayed in the project window, but, if you double-click on one, the complete annotated sequence will be opened.GenbankintoAssemblyProject

    This is a great way to view and/or sort collections of annotated sequences in Genbank format that cannot be done directly through the Apple Finder. Once opened, you can Export… any sequence in another format if you wish.

    Posted in Tips | Tagged , , , , | Leave a comment

    Opening multiple sequences as alignments or individual sequences

    Many sequence formats contain multiple concatenated sequence entries. For example FASTA and Genbank are two formats capable of storing multiple individual sequences.

    By default MacVector will treat such sequences as alignments and open them in the Multiple Sequence Alignment editor. Most users who want to open such a file do want to see an alignment. Additionally if the default behaviour was to open as individual sequences, then accidentally clicking on a large alignment would result in many hundreds of individual sequence windows opening up on your desktop (do remember that holding down the OPTION key and clicking on the close button will close all open sequences).

    If you need to open such a sequence file as individual sequences, then there’s a simple option that you need to check in the FILE | OPEN dialog. This behaviour has not changed for quite some time. However, back in MacVector 13 the appearance of the dialog changed, due to a change in Apple’s current guidelines on file dialogs. Whereas the older dialog had an obvious way to see this dropdown menu, now all you see is a small OPTIONS button in the bottom left hand corner.

    NewImage

    To open multiple sequence files as individual files you need to check an option in the FILE | OPEN dialog.

  • Click FILE | OPEN
  • In the dialog click OPTIONS (bottom left corner)
  • Change OPEN MULTIPLE SEQUENCE FILE AS from AUTO to SINGLE SEQUENCES
  • Click OPEN
  • Posted in Tips | Tagged , , | Leave a comment

    Restoring file associations when MacVector no longer opens your sequences

    Macs are pretty good at choosing the right application to open a document. For example when you double click on a .nucl document then it will open in MacVector. However, sometimes this file association breaks. Applications should coexist peacefully on a Mac, but sometimes a misbehaving app will corrupt these file associations and you will find that your sequence displays a generic document icon (or what’s worse a different application!). When you double click on the icon, it will no longer open in your favourite DNA sequence analysis tool!

    Luckily this is easily fixable:

    NewImage

    1. Select a .nucl file in the Finder
    2. Choose File | Get Info (or use command-I).
    3. In the “Open with” section, click on the popup menu and select MacVector
    4. Then click on the Change All… button to apply the change to all files.
    5. Repeat for all file types used by MacVector that are not opening correctly (e.g. .prot, .msan, .msap, .axml)
    Posted in Tips | Tagged , | Leave a comment