Auto Annotation in MacVector 11

Have you ever got that plasmid back from the sequencing facility as a bare sequence with no annotations? Or downloaded that vector from from the vendors site to find its only available in a fasta format with no features? Or maybe your collaborators send you poorly annotated sequences. Maybe your lab-mate uses MacVector but insists on annotating the sequences with a tiny unreadable font or garish colors? What you need is a quick and easy way to annotate the sequence, or change the feature appearance so it looks just like YOU want it. Thats exactly what we added to MacVector 11.

Here’s the idea – over time you build up a collection of plasmids and sequence fragments of the genes and vectors you work with the most. Perhaps you always like to make your favorite gene appears as a striped red box. Now, when you get a new sequence, just run the auto annotation algorithm (Database | Auto Annotate Sequence) and point it at a folder containing your annotated sequences. The algorithm not only finds the matching features and copies them onto your bare sequence, but it also copies the graphic appearance symbol information. Lets look at an example.

MacVector 11 comes with a large set of pre-annotated vectors. You can find them in the /Applications/MacVector 11/Common Vectors/ folder. We’ve also included an /Annotated Fragments/ folder here with a started set of genes and replication origins you’ll find on many cloning vectors. Here’s a composite graphic image of a selection of those fragments.

SampleFragments(shrunken)

There is a plain text copy of pBR322 in /MacVector 11/Tutorial Files/AutoAnnotation/pBR322Ascii.txt. If you open this file in MacVector and toggle its topology to linear, you’ll see there are no features assigned to the plasmid.

pBR322 before Auto Annotation

The next step is to invoke Database | Auto Annotate Sequence, then click on the Choose… button to select the /MacVector 11/Common Vectors/Annotated Fragments/ folder. Finally, click on the OK button and the algorithm will search through all of the files in the folder looking for matching features. When complete, a report is displayed – when you close that, you’ll see the newly annotated sequence.

pBR322Annotated

In this case, pBR322 has picked up the tetracycline and ampiciliin resistance CDS features, along with the rop gene and replication origin.

Prefer a different way of graphically displaying the features? Try repeating the analysis, but selecting the /MacVector 11/Common Vectors/NEB/ folder – this contains a selection of vectors available from New England Biolabs, formatted to match the appearance in their catalog.


pBR322Reannotated

This time, when the algorithm completes, the features take on the typical appearance seen in the catalog. Note that the CDS features have not been duplicated – MacVector realizes the features already exist and just replaces the graphic symbols. You can also optionally set the algorithm to ignore duplicate features completely, in which case the sequence appearance would have been left unchanged.

You can use the Auto Annotation function to scan any folder containing DNA sequences. They don’t have to be in MacVector format, although features from GenBank or EMBL files will be given the default appearance for the feature type. There is a certain amount of fuzziness built into the algorithm – it can handle mismatches and even a few gaps and still identify matching features. We’ll be posting a more detailed tutorial in the next week with more information about the different parameters and limitations of the algorithm. In the meantime, take it for a spin and build up a collection of curated sequences containing all your favorite genes formatted for that great visual impact in your presentations.

Posted in Algorithms, Tutorials | Tagged | Comments closed

Topology of a sequence and interface changes

With each new release of MacVector we have always strived to be the easiest to use sequence analysis application. We have made some fairly radical changes to the interface over the last few years, and inevitably there are some changes that you know will confuse long time users but are needed to make the interface more consistent and logical.

One of the major changes was in MacVector 10. Prior to this release single sequences were displayed in the sequence editor with static views of the graphical map and feature table available from this single window. The actual sequence was only represented in the editor. In MacVector 10 this changed to a tabbed window with multiple dynamic views representing the same sequence model underneath. With this more flexible approach regardless of which view you are looking at, or modifying, you are acting on the sequence directly. Furthermore because of this single sequence model all windows are linked, and you can view multiple windows (we call them replicas) of that same sequence. For example with three replicas open, one showing the sequence, another the feature table and the third the map view, then if you select a gene feature in the Feature table, then the sequence representing that gene will be highlighted, and the feature in the Map window will also be highlighted. This is great for visualising your sequences. Especially with large ones.

However, this major change has left quite a few legacy functions that do not fit the new model so well. One such function is the linear/circular button in the floating graphics palette. This button has always been a visual function, and it does not affect the real topology of the sequence underneath. There are times when it is nice to display a circular sequence in linear mode – when you want to zoom in on a crowded region for example, or looking at the MCS of a plasmid whilst cloning fragments.

This made complete sense with the old sequence model of MacVector. However with the new unified tabbed windows viewing a circular view of a linear sequence gives the incorrect impression that you are viewing a circular sequence. Thus meaning that some algorithms would provide unexpected results. A classic example is the cloning vector pBR322, which has an EcoRI site that crosses the origin. As a plasmid the sequence’s topology should be circular. However, if you have toggled the topology button in the sequence window, the sequence is now linear and the EcoR I site disappears. Prior to MacVector 11 if you now viewed this as a circular plasmid the restriction site is still absent, yet the molecule appears to be circular. That’s going to produce a confusing set of fragments on a gel!

pBR322_Topology.png

We thought that it was safer to change this behaviour in MacVector 11, and so now although you can view a circular molecule as linear, you cannot view a linear molecule as a circular one. We could not think of any user need for viewing a linear sequence as circular, although there are many advantages for viewing a circular sequence as linear. e.g. the aforementioned example of viewing the MCS of a plasmid. So now whenever you view a linear sequence the floating graphics palette linear/circular button is disabled to prevent you from inadvertently changing the Map and *thinking* you are dealing with a circular molecule. It’s a fairly small change, but unfortunately it has confused quite a few users! We do think that it is safer to have made this change. Displaying misleading results is something that we definitely do not want to do!

Currently, only the restriction enzyme, subsequence searching and (most) nucleic acid toolbox plots consider linearity/circularity in their algorithms. However, we plan on adding much better support for circular sequences in MacVector 11.5, so we thought it important to clean up this ambiguous linear/circular issue in this release. To this end, you will also see that the Topology button has now been added to the default toolbar button set of all the sequence window tabs to help indicate it is an important property of the sequence.

Posted in Algorithms, Development, Releases | Comments closed

MacVector 11.0 and Mac OS X 10.6 Snow Leopard

MacVector 11.0 and Mac OS X 10.6 Snow Leopard (Uncia uncia or Panthera uncia)

On August 28th Apple Computer, Inc. publicly release Mac OS X 10.6 also known as Snow Leopard. With any new release of the operating system (OS) there is always the question of whether the applications you own will continue to work. Often companies will release a new version of their application that fixes any issues that it might have with the new OS. At MacVector, Inc. we work to make sure that our applications already support the new OS releases. MacVector 11.0 and MacVector Assembler 11.0 have been tested with Snow Leopard and any necessary fixes have already been made to keep our applications running on the latest OS from Apple Computer, Inc.

As ardent Mac fans we’re as excited as ever about this new OS, and even though we’re running development builds already, we’ll be first in line on Friday to purchase our shrink wrapped copies!

Talking of releases don’t forget if you are still running an old copy of MacVector you can upgrade at a highly discounted price here.

Posted in Development, General, Releases | Tagged | Comments closed

What’s new in MacVector 11!

Our next release, MacVector 11, will be released this week. Continuing our policy of continuous improvement there’s a lot of under the hood improvements to make MacVector both faster and easier to use. However, MacVector 11 has a lot of other changes and some really useful new features that make this one of our best releases yet!

Here’s a short overview of these:

Auto Annotation.

How often have you received or downloaded a vector or other DNA sequence that has no annotations? The new auto annotation function in MacVector 11 will scan a folder full of existing annotated sequences and automatically add matching features to the bare sequence. Not only does the algorithm add the features, but it also copies the appearance information of the matching features so you can be assured that e.g. an ampicillin resistance gene is always a blue arrow, an M13 origin is always a striped red box etc.

Click Cloning Enhancements

In MacVector 11 you can manipulate the ends of restriction fragments before joining them together. A new interface lets you cut back or fill in each end of the source or target molecule before ligation.

msa2

Floating Analysis Toolbar.

For those users who prefer to click on toolbar buttons to initiate analyses, MacVector 11 introduces a new floating toolbar window containing buttons for all common MacVector analysis functions. You can customize the toolbar to show just the functions you use most often or show them all for rapid access to every available algorithm. There’s also a bunch of new toolbar buttons to add to sequence windows. These will allow you to put your own most commonly used buttons specific to both nucleic acid and amino acid windows. Giving you easier access to functions.

Sequence Editor Changes

The primary sequence editor has been rewritten using modern OS X code to better handle long sequences and to provide a more modern look and feel. In addition, you can now display 3 or 6 frame translations below the sequence.

Next Generation Sequencing

The optional add-on Assembler module has been enhanced to provide support for next generation sequencing machines. Short read data may be imported in Fastq format and assembled using the latest version of phrap.

msa2

Miscellaneous Enhancements.

As always, we add a slew of minor enhancements to each release of MacVector designed to improve workflows, speed up processing or provide better integration with the operating system. Look for the ability to change the colors of chromatogram file displays for red/green color blind users, improved import of sequences from Vector NTI and better handling of genomic sequences.

Incidentally if you are still using an old dongle controlled version of MacVector you may be interested in an offer we have to upgrade!

Posted in Releases | Comments closed

Keeping a Happy Mac!

Macs generally just work. That’s not just a marketism, but something that every Mac user enjoys! When compared with some other operating systems (and as a Linux user as well as Mac I don’t just mean Windows!) OS X is a fairly robust system when it comes to keeping it running. However, like more or less anything to do with software it’s wise to set aside some regular time for maintenance.

We use and really like a very good utility called Onyx. It’s perfect for a quick once a month blast through all those usual disc check, permissions repair, cache clear type etc type tasks!

Of course you could always wait until you have problems before sorting them out. However, as well as for help in performing regular maintenance tasks, we’ve found that it is a great help in cleaning up a multitude of problems. So much that it’s the first tool we generally use when something goes wrong. For some of the more obscure issues we’ve come across in Support we’ve had a lot of successes fixing problems with OnyX by running the regular tasks. We like it a lot.

Onyx is a free utility you can download from:

http://www.titanium.free.fr/pgs/english.html

The most useful functions are;

a) In the “Maintenance” section, execute the “Permissions” option.

b) In the “Cleaning” section, execute cleaning of all the “Caches”.

I, and most of the rest of the MacVector team, generally run through these once a month or so on my our Macs. Be aware that upon startup it generally runs a disc check. You might as well go grab a coffee whilst these are being run, as even your favourite sequence analysis tool may struggle to perform!

Posted in General | Comments closed

Upgrade your old dongle to MacVector 11!

MacVector 11 will be released during August, 2009. This release has been designed to be the easiest version of MacVector to use yet. We’ve added many interface enhancements, such as a floating analysis toolbar. Also creating plasmid constructs has never been easier with point and click vector construction with Click Cloning. We’ve not forgotten to add new features either with 3/6 frame translation direct in the Sequence Editor, an auto annotation editor for help in maintaining your vector library, and next generation sequencing data support for Assembler.

msa2

To celebrate this release for the months of July and August we are offering a one time only discounted price for upgrading any old USB ( or older!) dongles that you may have for a full standard license of MacVector. Please contact us and send in your old dongle to take advantage of an upgrade price of $499 for MacVector 11*.

MacVector has always been an easy application to pick up and use straightaway, but over the last few releases we have really made it the easiest tool for Molecular biologists to use in the lab. No need to waste time learning how to use an application, but just open it up, download your sequences straight from the NCBI’s Entrez website (within MacVector), and design your primers, blast your sequence, and clone straight into a vector. If you are not convinced sign up for a free 21 day trial here. This will install alongside your existing copy of MacVector, and your files will not be affected.

http://www.macvector.com/try.html

If you do not have an existing dongle to upgrade, have an old network license, or you are not an academic, then contact sales@macvector.com for great pricing.

*Please note that this is a download only offer.

Posted in Releases | Comments closed

Free MacVector Teaching licenses available

Did you know that you are eligible for teaching licenses as long as you hold an active, up-to-date license? Recently, we had a whole class of students download our 30 day trial on the same day. Maybe the instructor did not hold a current license of MacVector, but if he/she did, they could have contacted us and gotten unlimited access to the program for the length of the course. All we need is the course name and number, start and stop dates of the course, and the approximate # of machines MacVector will be loaded on to. MacVector is an excellent, and very practical tool that can be used in teaching a Molecular Biology course. So if you hold a current MacVector license and are in the process of developing lesson plans for the courses you will be teaching next semester, contact sales@macvector.com for more details.

Posted in Uncategorized | Comments closed

Next Generation sequencing formats

As is common with the lack of standards seen with most emerging technologies there are many different and competing types of sequencing file formats for storage of short read or next generation sequencing data. All these formats try to solve the same question of storing an almost unprecedented amount of sequence data in a useable and complete format. However, one emerging format that seems very appropriate for this type of data is Fastq.

Phrap was one of the first real high throughput assemblers that could also deal with quality scores (generated by its stablemate Phred). The general input files to Phrap are a single Fasta file containing the reads, and an associated Qual file that contains quality scores for each and every read in the Fasta file. I’m not sure what it is about bioinformaticians, but they always feel the need to add Yet Another Format rather than reuse one of the many decent formats. However, in this case Fastq is a logical progression of the Fasta + Qual format in that the two individual files are now merged. That is each read comprises of four lines; A label header, a sequence, and second label header and a quality score line.

Here’s an example of such a file. Note that this example has a single ‘+’ character to indicate the quality label line, rather than a duplicate of the label.

@ERR000955.3982 IL6_1091:3:1:210:502/1
TCCAAACACACTTTGTGTAGAATCTGCAAGTGGAGAT
+
>>>>>>>>>>>>>>>>>>>;>>>>><>>>;>>;><>>

Many of the raw file formats (sff/ SRF etc) are big! They contain the raw image files as well as basecalled sequence and quality data. Fastq files are at the complete opposite end. They are small, and only contain minimal data. They may contain millions of reads, yet are still in the less than a tenth of a Gigabyte range for a single run’s data. It is worthwhile noting that the various Short Read Archives (NCBI, EBI etc.) require the submission of original raw image files, but only allow the reads to be downloaded in fastq format.

The quality line must comprise of the same number of characters as bases. i.e. one quality character per base. However, most quality scores are double digits. Fastq gets around this by using an ASCII character to encode the quality score. However, here’s where consistency fails. The quaility line mat be one of three different types. Sanger format will encode a Phred quality score of 0 – 60 using the ASCII characters 33 to 93. The latest Illumina 1.3 format will also contain a Phred Quality score from 0 to 40 however, this time encoded using ASCII 64 to 104. Finally the older Illumina (nee Solexa) 1.0 format has its own Solexa/Illumina quality score from -5 to 40 encoded using ASCII 59 to 104. Of course this does now pose problems, as unless you know which quality score was used, there is now way of knowing without guesswork, which it is.

There are also other issues with this format. It could be said that the label line for the quality score line is redundant, and the filesize could be reduced by 25% if this was removed. Some applications do generate and accept fastq files that have a single ‘@’ or ‘+’ in place of the quality label line.

It would be helpful to see a tightening of this format and indeed there is a fastq2 format that does not have these weaknesses.

With the next release of MacVector and Assembler during the Summer we will be adding support for the Fastq format. We will also be adding support for de novo assembly of short read data. This release is currently in internal beta testing, and will be out for a public beta trial soon.

Posted in Development, General | Tagged | Comments closed

Alignments in MacVector

Update 19 August 2013: We’ve added support for Muscle and T-Coffee to the MSA editor

We get a lot of comments and questions from users on the various alignment functions in MacVector. They say there’s more than one way to skin a cat (not that I’ve done that – I have skinned a catfish, but I only know one way), and thats certainly true for alignments in MacVector. Each function is designed for a different purpose. First, lets just list the functions;

  • ClustalW – we also call this the “standard” Multiple Sequence Alignment (MSA)
  • Align to Reference
  • Pustell Matrix (also known as a Dot Plot)
  • Internet BLAST
  • Align to Folder
  • Contig Assembly

ClustalW/Multiple Sequence Alignment (MSA)

msa2If you have two or more related sequences (DNA or Protein) and you want to examine the relationship between them, use this function. Choose File->New->Protein Alignment (or File->New->Nucleic Acid Alignment) to create an empty MSA window. Add sequences to the alignment by using the Edit->Add Sequences from File menu item then click on the Align toolbar button to automatically align the sequences using ClustalW. Click on the Prefs toolbar button to control the appearance and behavior of the data in each of tabs that represent different views or analyses of the alignment. This functionality is most suited for protein alignments, or for nucleic acid sequences where you are interested in examining phylogenetic relationships. If you wish to compare two or more DNA sequences, you should definitely consider if one of the other alignment functions may be more suitable.

Align to Reference

The Align to Reference Editor window

The Align to Reference Editor window

Use this if you have a reference sequence and you want to align one or more DNA sequences against it. A typical use would be in resequencing e.g. sequencing a cloned PCR fragment to check no errors were introduced, sequencing across end junctions, scanning for successful mutagenesis clones etc. In each case, open the file that represents the parent or “reference” sequence, then choose Analyze->Align to Reference. In the window that opens, click on the “+” button to add sequences from disk – these can be in any format that MacVector can read – typically ABI or SCF chromatogram files, but you can add plain sequences as well. When you click on the Align button, choose the Sequence Confirmation algorithm – this is tuned to expect the small insertions/deletions you would expect in raw chromatogram files. Compared to ClustalW, Align to Reference has the advantage that it will automatically “flip” sequences to guarantee optimal alignment.

Align to Reference can also be used to align cDNA clones against a genome sequence. The steps are similar – use the genomic sequence as the reference, then add one or more cDNA clones to the alignment. Again, these can be chromatogram files. Now choose the cDNA Alignment algorithm when you Align – this is tuned to expect large insertions representing the intron regions.

Pustell Matrix

Repetitive sequence elements identified using a dot plot

Repetitive sequence elements identified using a dot plot

This “Dot Plot” function is great for identifying weak regions of similarity between two sequences. It is not designed to show full-length alignments between two sequences, but instead shows shorter segments of direct or inverted similarity. You can use this to identify shorter regions of similarity, then copy those sections to new sequence windows for more in depth analysis using ClustalW or Align to Reference. Dot Plots are also the best way of identifying sequence rearrangements – the display clearly shows insertions and deletions (the main diagonal will be broken and have an offset) and even inversions (the inverted diagonal will run bottom left to top right and be colored blue). Finally you can use it to identify repetitive regions which appear as parallel diagonals offset from the main diagonal. Pustell Matrix can be used not only to compare DNA:DNA and Protein:Protein, but you can also use it to compare DNA:Protein where the algorithm will translate the DNA in all 6 frames before aligning to the protein.

Internet BLAST

Use this to identify and align a test sequence to the databases at the NCBI using the popular BLAST algorithm. One slightly hidden function in MacVector is that you can select sequences in the “hitlist” and then choose Database->Retrieve to Disk or Database->Retrieve to Desktop to download the matching sequences from the NCBI. You don’t even need to select the entire line – just select part of a line and use the Retrieve menu item.

Align to Folder

This allows you to scan a local folder full of sequences (in any format MacVector can recognize) and align them using the FastA alignment algorithm. Its kind of like a local BLAST, but more sensitive. Like the Pustell Matrix, you can choose to search DNA with Protein and vice versa. Many users like this function because the text alignment output also shows the features in the test sequence. This can be very useful for demonstrating the differences between your sequence and other sequences for patent purposes.

Contig Assembly

This requires our optional Assembler add-on. Use this if you want to align two or more DNA sequences with the idea of assembling them into a longer sequence with a consensus. Its is primarily designed for de novo sequencing, where you have no reference or scaffold sequence to align the individual sequences to. The MacVector implementation uses the popular phred, phrap and cross_match algorithms from the University of Washington that use quality values for improved accuracy of assembly. While you can use this for resequencing, you should consider whether the Align to Reference function might be a better choice.

Tutorials

There are tutorials for Sequence Confirmation and Contig Assembly in the Documentation folder of your MacVector installation. You can also download copies from our website.

So there we have at least five ways to align sequences using MacVector. Now if I can just find another 4 ways of skinning a catfish (or even just ONE thats easier than my current method) then I’ll be all set.

Posted in Algorithms, Tutorials | Tagged , , , | Comments closed

Philadelphia and the ASM Convention

We had a great time at the ASM Convention several weeks ago. Thanks to all of the MacVector users who stopped by the booth. We enjoyed meeting you, showing you new features in MacVector, and gaining your feedback that will help us plan for the future. We also met a lot of people who hadn’t heard of us and got a lot of positive feedback.

There were 2 questions that we were asked over and over.

The first was ” Do you have a MacVector for PC?” The answer is “not yet”. Our plan in the 2 1/2 years that our company has been developing MacVector is to first make it the best Macintosh desktop sequence analysis solution on the market, then port it to Windows. I’m happy to say that we are almost there. We hope to start developing a ‘MacVector for Windows’ soon.

The other question asked frequently is ” Can MacVector support Next Generation Sequencing?” The answer is that it will do in our next version, MacVector 11.0, due out later this summer. Our Assembler module ( optional add-on) has been enhanced to provide support for next-gen sequencing machines, reading FASTQ files and assembling them using the latest version of phrap.

I also thought Philadelphia was a great city.  I saw the Liberty Bell.  It was a 6 block walk from the Convention Center.  We found a really cool old pub, too- McGillins Olde Ale House.  It had a great selection of local brews- as well as cheap pitchers of PBR!  There is also a great city market right across the street from the Convention Center where I found the best Cheesesteaks.  I went there for lunch every day!

Here’s a photo of Kevin K. in our booth at the ASM meeting:

Kevin at our booth at the ASM

Posted in General | Tagged | Comments closed