Use Database | Auto-Annotate Sequence to annotate prokaryotic genomes

The continuing advances in Next Generation Sequencing have made it relatively low cost to sequence prokaryotic genomes. Many scientists are embarking on large projects to sequence multiple related genomes. These might be clinical isolates of the same species exhibiting different pathogenetic properties, environmental isolates from different sites, or a study over time of the changes in microbial genomes from specific locations. Once you have your sequence, the definitive source of annotation is the NCBI Prokaryotic Annotation Pipeline. However, to have that run on your sequence, you must submit the sequence to the NCBI. This is not always ideal – perhaps you are still working on resolving repeat sequences for your genome, you don’t want to wait for it to be published or you don’t want to go through the hassle of a formal submission for many variant sequences. MacVector to the rescue!

First, you need to download existing similar genome sequences – open the Database | Online Search for Keywords (Entrez) browser and search for [name of your organism] “+” “complete genome”. Assuming you are working with a reasonably common organism, you might find a few (or a lot of) hits. Select those you are interested in and click on the To Disk button, selecting a suitable target folder to save the downloaded genomes into.

Now take your unannotated genome sequence and invoke Database | Auto-Annotate Sequence. Select the folder containing your genomes as the target and press OK. On a 2.7 GHz laptop, scanning a 1.8 Mbp Campylobacter jejuni genome against 25 related C. jejuni genomes (average 5,000 features per genome) takes around 10 minutes with the default parameters, resulting in a fully annotated genome.

Unknown

MacVector 17’s new “Genome Comparison” tool lets you directly compare the features of two related genomes based on DNA or (for CDS features) protein sequences and reports all of the identities, similarities, differences and missing features. The tool confirmed that no features were missing compared to the NCBI annotated genome and there were just minor differences with a few CDS features where there were mutations creating or removing stop codons.

Posted in Techniques, Tips | Tagged , , | Comments closed

Which DNA Matrix to use in Align To Folder?

The Database | Align To Folder function is a very useful tool to find and retrieve similar sequences from folders on your computer or on other local machines. Think of it as your own personal BLAST service. It can not only search individual sequences in any format MacVector can read (MacVector, Genbank, EMBL, ABI etc) but will also process collections of sequences in fasta or fastq format.

One important factor to consider in these searches is the DNA Scoring Matrix (.nmat) file to use. There are several included in the /Applications/MacVector/Scoring Matrices/ folder. The default file is DNA database matrix.nmat. This is ideal for identifying sequences that are not particularly closely related, such as the same gene from distant organisms or sequences matching a highly degenerate input sequence such as a reverse translation of a protein sequence.

However, one common use of Align To Folder is to identify and retrieve NGS reads from large collections in fasta or fastq formatted files. It is particularly useful for finding reads to help resolve repeat regions or close gaps between contigs. When running these types of alignments, it is preferable to use a different matrix that is more tuned to finding reads with a greater identity stringency. The best scoring matrix for this is DNA identity with penalties matrix.nmat. Here’s some examples, using a short query sequence, where the searches differ only in the scoring matrix.

Unknown

Low-scoring alignments using DNA database matrix.nmat

Unknown

Low-scoring alignments using DNA identity with penalties matrix.nmat

It can clearly be seen that the second example has true matching alignments that represent sections of reads that extend beyond the query fragment. All of the reads can safely be retrieved and used in additional assembly analyses to extend the query or help resolve repeats. However, the DNA database matrix example contains matches that have extensive regions with very poor similarity. These clearly do not represent reads that could be used to extend the sequence of the query sequence.

For additional information about possible uses of Align to Folder, check out this blog post.

Posted in Tips | Tagged | Comments closed

Gap closing and genome finishing tools in Align to Reference and Assembler.

Automated algorithms can only take you so far with genome assembly. The final steps involved in finishing a genome always need manual intervention. MacVector’s various assembly editors have many tools for helping finish genome sequencing projects. For example, closing gaps, extending reference sequences and even automatically circularizing contigs. If you select reads, then right click (or use CTRL-left click) you will see a context sensitive menu with the following tools:

Unknown

  • Export Consensus with/without Gaps
  • Align Selected Reads
  • Delete Selected Reads
  • Reset (unalign) Selected Reads
  • Export Selected Reads as FASTA/FASTQ
  • Select Matching Pairs – if you have aligned a set of paired-end reads, you can select individual read(s) and use this function to select the corresponding mate(s). This is particularly useful if you want to find pairs that will extend a contig and export them for further analysis/assembly.
  • Extend Reference with Selected Read – This is active if you have selected a single read that hangs over either end of a Reference sequence. This will extend the Reference in the appropriate direction using the sequence of the read.
  • Circularize Consensus – This is enabled if it detects direct repeats at the ends of a contig, and even tells you the length of the repeat it found. It will circularize the consensus and create a new circular sequence window with the repeat appropriately deleted.
  • Select Overlapping Reads Containing Selected Sequence – This is enabled if you select a short region in a read. All overlapping reads that contain that selected sequence will be selected. For paired reads you can then use Select Matching Pairs to select their mate, then Export Selected Reads as FASTQ/FASTA to export them to a file.

Not all tools are applicable or available in all editors. Plus some tools are only enabled when using paired end reads. Here’s what’s available in each editor.

Align to Reference editor

  • Export Consensus with/without Gaps
  • Align Selected Reads
  • Delete Selected Reads
  • Reset (unalign) Selected Reads
  • Export Selected Reads as FASTA/FASTQ
  • Select Matching Pairs
  • Extend Reference with Selected Read.
  • Select Overlapping Reads Containing Selected Sequence.

Reference Contig editor

  • Export Consensus with/without Gaps
  • Export Selected Reads as FASTA/FASTQ
  • Select Matching Pairs
  • Select Overlapping Reads Containing Selected Sequence.

De novo contig editor

  • Export Consensus with/without Gaps
  • Export Selected Reads as FASTA/FASTQ
  • Select Matching Pairs
  • Circularize Consensus

Read more about the various assembly tools in MacVector.

Simple DNA sequence assembly on a Mac with MacVector with Assembler.

MacVector has a software plugin called Assembler that integrates directly into the DNA sequence analysis toolkit and provides DNA sequence assembly functionality. Dealing with sequencing reads has never been easier.

MacVector includes no less than five different assemblers just a few mouse clicks away from your sequencing reads. Phrap assembles Sanger sequencing reads or existing contigs, while there are three separate NGS de novo assemblers – Velvet for short read datasets, Flye for Nanopore and PacBio long reads and SPAdes for mixed assemblies. For reference assembly Bowtie2 can map millions of sequencing reads against genomic reference sequences and is ideal for RNASeq gene expression analysis data too.

Assembler is tightly integrated into MacVector. It’s easy to bring sequencing reads into MacVector, and it’s just as easy to directly design primers for a contig, run BLAST searches on a contig, and much more, right from your desktop!

Posted in Tips | Tagged , | Comments closed

MacVector 17 Workshop at The Crick

Room: HR Training Room 01–2162. Floor: 1 
Date: 15 October 2019  From: 9:30 to 11:30

Now rescheduled – Date to be advised

Chris Lindley of MacVector, Inc. will be giving a training workshop for both novice and advanced users of MacVector at The Crick, reviewing both basic and advanced functions. In particular new tools introduced over the last few versions.

The format is very informal and participants are very much encouraged to direct the workshop towards areas of the most interest.

Laptops will be provided for users to work through examples and tutorials as they are demonstrated. Workbooks will also be provided to allow attendees to work through during the workshop and afterwards.

The intention is that all attendees will learn at least one new and useful tool or tip. The workshop is two hours, but Chris will be available in the room for further discussion until 13:00.

Please register for the workshop by emailing Chris (drop-ins on the day will be very welcome, but will not be guaranteed access to a laptop or a workbook).

See what MacVector can do for your lab.

UnknownGibsonCloning

Posted in Meetings | Tagged , | Comments closed

Migrating your Vector NTI sequence database to MacVector.

ThermoFisher (owners of Invitrogen) have announced that Vector NTI Express is nearing the end of its life and Vector NTI Advanced was terminated quite some time ago. If you are looking for an easy to use sequence analysis application, then look for a reliable and trusted application. MacVector is easy to use, has a comprehensive set of tools and is definitely not going away! MacVector has been the tried and trusted sequence analysis application on the Mac for over 20 years. There are many thousands of happy molecular biologists using MacVector in labs all over the world. Don’t take our word for it, read what our users have to say. What’s more opening Vector NTI files is straightforward.

If you are using Vector NTI Advance 11 or Vector NTI Express

  • Download the Mac or Windows Vector NTI Data Export Tool from ThermoFisher’s website.
  • Run this and migrate your entire sequence database to Genbank format. Note that if you are running macOS Catalina, then Apple’s increased security means you will need to right click, choose Open, then Open again.
  • To open in MacVector simply double click the Genbank file to open it directly within MacVector.
  • When you make changes then save and MacVector will automatically migrate the data into MacVector’s own NUCL format.

  • You can optionally batch process all your files into MacVector format using an Applescript that is supplied within the MacVector application folder.
  • If you are using Vector NTI Advance 10 or earlier

  • Open MacVector
  • Select the Database->Vector NTI Import… menu item
  • Click on the Choose button to locate the Vector NTI database folder on your Mac.
  • MacVector will display a list of all of the sequences available in the database. There is a popup menu to toggle between Nucleic Acid and Protein sequences. The list can be sorted to more easily identify sequence(s) of interest.

  • Select one or more sequences and then click either the To Desktop button to open those sequences in MacVector, or To Disk to save the sequences in MacVector format to a folder on your hard drive.
  • Sequence annotation

    MacVector will read all of the standard features and annotations associated with each sequence. Graphical appearance information is discarded and the highly customizable MacVector graphical features are used instead. If you prefer your sequences to look different then it is easy to curate all graphical features using MacVector’s Auto Annotation tool.

    MacVector ignores any restriction enzyme sites annotated in the database sequence and replaces them with the default dynamic set of sites used by MacVector’s RE Picker. However, all sequence features and related information are preserved. MacVector follows the Genbank format for the features table and is always kept up to date with the latest Genbank release, so where possible MacVector will also migrate any old and deprecated features information contained in the VectorNTI file into the current up to date features nomenclature.

    You will always get a good discount for upgrading to MacVector from Vector NTI.

    Your sequences remain yours

    The MacVector team strongly believe that you should never be locked out of your data. Your data is yours! Even if you have no license of MacVector then you can download MacVector Free and export your data. All versions of MacVector, including MacVector Free, have the tools to migrate sequences in Genbank format.

    Posted in General, Tips | Tagged , , | Comments closed

    Identifying transposon insertion sites from multiplexed NGS data

    Transposon mutagenesis is a common approach for investigating gene function in bacterial genomes by selecting for clones where the transposon inserting into the genome has generated a specific phenotype. You can then simply sequence the entire genome of each clone by NGS to identify the transposon insertion site. To lower the cost of such experiments, it is common to pool several individual genomes into each NGS sample and then run appropriate sequence analysis to identify the genes disrupted by the transposition events.

    There is a new Transposon Insertion Analysis Tutorial that describes how to perform this analysis using MacVector with Assembler. To follow along, you can download sample data. The basic strategy is to use MacVector’s Align to Folder functionality to pull out all pairs of reads that contain transposon sequences then align those to the genome to identify the end points of the transposon insertion site.

    Unknown

    The tutorial goes into detail, describing several approaches you can use to identify the insertion locations, along with shortcuts and suggestions on how to rapidly annotate the insertion sites on the complete genome. While the tutorial does use Macvector with Assembler for parts of the analysis, you can actually accomplish the same end result using plain MacVector.

    Posted in Techniques, Tutorials | Tagged , , , | Comments closed

    Human Transcriptome RNA-Seq Analysis Using MacVector

    With MacVector Pro and Assembler you can use Bowtie to perform RNA-Seq analyses using NGS data. The interface even has specialized output tabs listing the coverage information and statistics for each annotated CDS and gene feature on the genome. You can download a short tutorial and a sample dataset that illustrate the analysis workflow using a small (1.6 Mbp) prokaryotic genome.

    What surprises many people is that the combination of MacVector and modest Macintosh hardware can actually perform this analysis on the human genome. Now there are limitations to this – it’s not currently practical to do this with the entire genome due to memory and processing constraints, but it is possible to run an analysis against the known Human Transcriptome. The latest version of this can be downloaded from the GENCODE database. There is a new RNA-Seq Human Transcriptome Analysis Tutorial that describes the basic procedure in detail and some sample data that can be downloaded. The end result is that you get a table similar to that shown below that can be copied and pasted into Microsoft Excel for additional analysis.

    Unknown

    Simple DNA sequence assembly on a Mac with MacVector with Assembler.

    MacVector has a software plugin called Assembler that integrates directly into the DNA sequence analysis toolkit and provides DNA sequence assembly functionality. Dealing with sequencing reads has never been easier.

    MacVector includes no less than five different assemblers just a few mouse clicks away from your sequencing reads. Phrap assembles Sanger sequencing reads or existing contigs, while there are three separate NGS de novo assemblers – Velvet for short read datasets, Flye for Nanopore and PacBio long reads and SPAdes for mixed assemblies. For reference assembly Bowtie2 can map millions of sequencing reads against genomic reference sequences and is ideal for RNASeq gene expression analysis data too.

    Assembler is tightly integrated into MacVector. It’s easy to bring sequencing reads into MacVector, and it’s just as easy to directly design primers for a contig, run BLAST searches on a contig, and much more, right from your desktop!

    Posted in Techniques, Tutorials | Tagged , , | Comments closed

    Use a right-click in the Editor tab to see if your contig can be circularized

    MacVector incorporates no less than THREE different de novo assemblers, phrap, velvet and SPAdes. While all are great assemblers, with each having their own specific advantages, none of them will generate a circular sequence from input reads. However, MacVector also includes a tool to help you with this. If you are assembling reads representing plasmid sequences, or if you are closing gaps in a circular genome, you can find out if a contig can be circularized by double-clicking on it in the Assembly Project and then right-clicking* in the Contig Editor to bring up a context-sensitive menu.

    Unknown

    The algorithm looks for a perfect overlap between the ends of at least 20 bases. If no overlap exists, the menu item is greyed out and reads “Cannot Circularize Consensus”. Otherwise it indicates the length of the overlap. If you select the menu item, a new sequence window opens containing the circularized consensus of the contig, with all gaps removed.

    *To right click with a trackpad hold down [CTRL] and click once or tap with two fingers. MacVector has many “right click” menus with extra functionality.

    Not sure if you have Assembler? Choose MacVector | About MacVector. If the screen that appears says “MacVector with Assembler, Pro Edition” then you have it. If not, you can sign up for a fully functional 21 day trial version.

    Posted in Techniques, Tips | Tagged , | Comments closed

    Import Multi-Sequence Genbank Files into an Assembly Project for easy access to Features

    There are many genomes in the Genbank database that cannot be downloaded as single annotated sequences. These might be large multi-chromosome eukaryotic genomes, but, increasingly, partially sequenced bacterial chromosomes where the major contigs have been annotated using the NCBI annotation pipeline. Typically, when you encounter these, there are options to download annotated versions of these as multi-sequence Genbank formatted files. MacVector has the option to open any file containing multiple sequences as either a Multiple Sequence Alignment document or as individual Sequence documents. This is not always optimal if you have more than a handful of sequences in the file. However, if you use MacVector with Assembler, you can import these sequences into a project using the Add Ref toolbar button and the individual sequences will not only be displayed in the project window, but, if you double-click on one, the complete annotated sequence will be opened.

    Unknown

    This is a great way to view and/or sort collections of annotated sequences in Genbank format that cannot be done directly through the Apple Finder. Once opened, you can Export… any sequence in another format if you wish.

    Posted in Techniques, Tips | Tagged , , | Comments closed

    Opening Genbank or FASTA files with multiple sequences as individual sequences

    Many sequence formats contain multiple concatenated sequence entries. For example FASTA and Genbank are two formats capable of storing multiple individual sequences.

    By default MacVector will treat such sequences as alignments and open them in the Multiple Sequence Alignment editor. Most users who want to open such a file do want to see an alignment. Additionally if the default behaviour was to open as individual sequences, then accidentally clicking on a large alignment would result in many hundreds of individual sequence windows opening up on your desktop (do remember that holding down the OPTION key and clicking on the close button will close all open sequences).

    If you need to open such a sequence file as individual sequences, then there’s a simple option that you need to check in the FILE | OPEN dialog. This behaviour has not changed for quite some time. However, a few versions back the appearance of the dialog changed, due to a change in Apple’s guidelines on file dialogs. Whereas the older dialog had an obvious way to see this dropdown menu, now all you see is a small OPTIONS button in the bottom left hand corner.

    Unknown

    To open multiple sequence files as individual files you need to check an option in the FILE | OPEN dialog.

  • Click FILE | OPEN
  • In the dialog click OPTIONS (bottom left corner)
  • Change OPEN MULTIPLE SEQUENCE FILE AS from AUTO to SINGLE SEQUENCES
  • Click OPEN
  • (Read More…)

    Posted in Tips | Tagged , , , | Comments closed