General musings from the MacVector team about sequence analysis, molecular biology, the Mac in general and of course your favorite sequence analysis app for the Mac!

Importing features from a Genome Browser

One of the new features in MacVector 12.6 is the ability to annotate sequences based on the features stored in GFF/BED/GFT files that many Genome Browsers will export data as. MacVector 12.6 will annotate an empty or annotated sequence with the features stored within these files.

BED, GFF, GTF, and GFF3 formats

GFF, GTF, GFF3 & BED files are all file formats that are used to store annotation (features) generally without containing any sequence. Although it is common that they will be accompanied by a fasta file containing the sequence only. They emerged as a way of exporting, or exchanging, information from a specified region of an entire genome without having to take the entire genome.

Most sequence formats were developed to be for a specific gene or protein. Although this is no longer true they are still orientated to be of a region of fixed length. These annotation files are not at all length specific and could potentially store just two features that were at either end of the same chromosome. They are a much more flexible way of dealing with annotation, especially a large amount, than a fixed length sequence format such as Genbank.

They also are not limited to a single sequence and can contain information from multiple sequences in the same file (Fasta files can also contain multiple sequences). For example you could store the entire human set of chromosomes in a pair of (quite large!) files. A multiple sequence Fasta file and a single GFF file.

The format of these annotation files does vary (who ever said Bioinformaticians had to be consistent!) but basically their format consists of a set of individual lines (one line per feature) along the following lines:

SEQUENCE ID, START, STOP, FEATURE TYPE, NOTE

    Sequence ID is the sequence these annotations belong to.
    START and STOP are the region of sequence they are annotated against
    FEATURE TYPE is obvious! Note that this does not always correspond to a correct Genbank Feature Keyword

Genome Browsers

These tools (generally online web gateways) allow you to browse the entire chromosome or genome of a particular organism. Almost like a graphical model of a sequence database. All the information known about that particular organism’s sequence that has been submitted to one of the large sequence databases (e.g. Genbank at the NCBI) should be visualised within the genome browser. You can download all the annotation contained within a particular region fairly easily using one of these annotation formats. Then you can either annotate an existing file that you are working with (so preserving your own “private” annotation with all known public annotation).

To annotate a sequence with a BED/GFF/GFF3/GFT file in MacVector

From the UCSC’s Genome browser

  • Click on this link to open the UCSC’s Genome Browser.
  • Select C.elegans and enter sel-12 in the gene name. Click SUBMIT.
  • C elegans  Caenorhabditis elegans Genome Browser Gateway

    The interface will change and show all annotation associated with that region. You can modify the amount or type of annotation being showed. This particular gene, C.elegans Sel-12 is located on Chromosome X

  • Click the Tables link at the top of the page
  • This will now allow you to export all the annotation associated with the previous displayed region (tracks).

  • Change the REGION to POSITION.
  • If it is left at genome the entire genome will be downloaded

  • Change the OUTPUT FORMAT to GTF or BED
  • Click GET OUTPUT

  • Now switch back to MacVector
  • Now you need to open the sequence you want to annotate. For this example we could go to DATABASE > ENTREZ and search for and download Accession Number U35660. However, that only contains the mRNA and not the genomic sequence. So instead download the fasta sequence from the NCBI

    Sequence : Chromosome: X; NC_003284.7 (915873..918235)

    You will need to ensure that the start position of the downloaded file corresponds to the start of the region of the chromosome we have just downloaded the annotation for. This is easily done in MacVector.

  • Double click on the RED cross located near the start of the sequence in the Editor View and change it to 915872
  • MV126 SettingOrigin

    Now we will import our downloaded annotation.

  • Select FILE | IMPORT FEATURES
  • The Sequence ID (SeqID) contained within all features in the file will be shown in a dialog along with the number of features for each SeqID and the region of the sequence that these will be annotated against. A warning will be displayed if any of the features are outside of the region to be annotated.

    MV126 ImportFeaturesDialogue

  • Select the appropriate SeqID ChrX and click OK
  • A dialogue will be shown with the number of annotations that have been added. If you annotate a blank sequence (e.g. a fasta file) the resulting features may be initially hidden. However, you can easily show them from the Graphics Palette tree view.

    You can choose to annotate your sequence with all the features contained within the imported file or to ignore duplicates.

  • Click OK
  • MV126 ImportFeaturesResult

    ToxoDB Genome browser

    Here’s a similar workflow from the ToxoDB Genome browser

  • For this workflow we’ll start with a “random” 2kbp sequence from TaxoDB as a “starter” empty sequence.
  • Toxoplasma gondii ME49
    
    "CCTTCCCTGCGTCAGAGGAGAAGAGAACGGCTTGACCGATGGAGGACCCCGCAAACATGAGGGCGAAGGTAGTCTGCATGATCTCTGAACAAGGAACACGGCGCGGAAAGGGAAGCACAGAAGGAAGTCGATCAAGACACCCTGCGTTGTTTTTCGGGGAGCCCCAGAGAGGGAGCTCGCGGCTCTGGACTTCAAGGTCCGTCGAGCAGCAGAACGCTTCACTCGGCAAGGAAGGAGCAGTTTCTTCTCTCGCGTCTTGTTTCGCTTTCACGGCTTCGTTTTCTCGCCGCGACCTGCGAAAAGAAAACAGCTCCCCTATAAGAACTCGACTCTCGAGCCTGCGGTTTGGTATCGGCTTTTTCTTCAGAGTTTTTTCTGTCGCGCGTTCGGACAACCAGTTCCGTGCTTGCGCGCCTCCTCTGAAGGCCGCGCCCGCCTCTCGACTCCCGTCGCTTCTCTCTTCGGCTGGATAAGAGAAAACGCTGAACGAAGAGGAGAGTACGCACTGGCATCGTTTGTCGACTTTCGTCTCCAGGTGGGGGAGTGTCGGTCGACTCACCCAAGGGATTCTTCCCTTCGCTGCTCACGATCTGGCCGCCATACCAAAAAATCAGCGCCTGCAGAGCGTACTGAGCTCCCTGACAGACACGCAGACGCAGCGGCAAGGAGACGCTGAAAAGAAGAAAGACAACCGGAGAGCGCGGAGAACAAAGAACTGTGAGCGTGCAACGACGGGATCAAGGACGACAGCGAATCTCCCGTCTTCAGGACCTCGACGGGCATTCCGCATGGCAGGTCCTTTGACTCCGAAAAACTCTGCGGCAGCCTCGATGACCCTTACCCCCCCGGAACATCCCCGAGAGCTCGGTGGAAAAAACCTCTCATTCGAGAGCGACAGATCAGGCTTTGCTAGTCGAGCCAGAAGGCAGGAGAAGGAACGGAGCGAACCGCGGATGCGTCTCTCTGCGCGGACGAGTCTTCATGAGCAGGCACCGCGACGTTCCAAGAAGCAGAAAGAGAGAGAGGAGAGAGGAGAGAAGCGAGAAGCTCGGGAACTCACGAGGAAGCAACAAAGATCTCTCCTCGTCACTCACGTTCCGTCGACCTGCATGGCAGGCGTGACGCGGCATGCACAGCAGAAGACCTTTCGAGGTCACCACACACACCGCCTCGGACGTCGAGAAGTCTCGATCTTTGTGAACCACAGGGCTCTGTTTTGTGTGGCGGAACGAAGAAACCAAGCGCTTAGGATGGAGCTCACTGGAGAGCAGGAAACGGATCTTCAAACGAGTGTCGACGTCCTCCCGCGCATCCGAACCGAAACTCAAACGCGCTCCAGAGAGACGACATAGAAGACAGAGACGTACAATGAGAGAAGAAGAGACAACGCGGCAGGGGGAGTCTGACGTCCGACCTCGACTCGAGAAGTCGCTCGCCAAAACGTGTGTGCAGTGTCTTCTGTTTCTTTCCAAGTTCTCCAGTCCGAAGAAACCGGACACTCTGACATGACTCGATACAGGGACCTGCCCGCCGACTCTTTCCTACTTTCAGCGGTCCTCCCTGTTCATCTTTCCTGTGACATTTCGGCATCTCTTTTTCTTGGTTTCCTCGCCTTCTCACCTGACTGAAGCCCCAGAAAAAGCCGAGGAGAGCCGCAGCGCGCTCTTCTTCCTTCAGCGTCCTCAGAAGAACGCTCTGGTACCGTTCCGTGAAGTGAGGCTCTAAACCGAACGCTGAAACAATGCGAATACCGTTCAGAGCCTCGCTCATCACGAAGGCAGCGGTGTCGCGGTCCTCCACCTTCTCCGCCTTCTTGTTCGCCCCCTCACCTGCAGAAAAAACTCCAAGGTTTCCAAAGCCTCGTGAGGCTCCCCATGAAGCTCTCCGCCTACGCTTGCGCAGATTGAGGCAGAGCAAACTACGCAAATGTGAGCCTACATGTACACACAGTTTCGTCGAGATTTGTACCTATATCTAAGAAGATTTGTACGGAAATGCGGGTGTGAAGCGGCAGTTTTCGAGGTGGCGTGCATACATCGACGCGACTCGGAGACCCAGCTTTGAGGAGACAGGAGAGAGAAAAGGAAACGGAGATAGAGCAGGTGGGGAGATCAGGTTTGCTCTGGGAGACGTGGACGGTCGCAGACGAAGAAGCAGACGCACGGAGCGAGCAGTGCAAAAAAGCGCGAGACAGAGCCGGCGCTGGGGAACTTCTGAGGAAGAATTCGAGAGAGAGAGGACCGTGGTGAAAAGCCAAGCTAAACGCGTGGACTTCCAGTTCTGCGGACTTTTCGGAGCCGAAATGTGAGACTGAGCGGCAGTGGCGGGGAGCAGAAGAACAAAACATGCGAGGACCACGGCGGCCAAGCGCGCGTCTCCAAAGAACGCGATGATGACACCTGAAATAAAGATAATGGAAAAAACAGAAGAAAAGCAACGGTCTCTCGCGCAGTCTTCAATCACGAGTTGACGCACACACGCATTAGGGGAGAACAACCTCTGCGATGTTGGATGCTTCTTTAGTGAGTGGACGTTTCCAGAAATCAAATCAAAGTAGCACGACACCGACAAGCAAAGAGATGTATACTTTCGTTCACGAGCACTGAAAGACGCGTCTAGACGCCTACGGATGCAGAGGGATCTGAAGCGACGAGTGAACAGTCAAAAAGCTTTCCGGCCTACCAGTGACAACAGCAGCCAATCCCTGGGTCATCGCGAGTGCGTTTCCAGCGCTTCCTGTCTTGACGAGAAGGACGTCGCTGGAGAGAACTCCCGTGAGATATCCTGAGGTCATTTTTTAGAAGAAGCGAACACTCTGGCGCGGCGTCTTCACTCTCGTGCTCACAGAAAGAATGAACTCACGAGACCCATGGCAGGACAAGTCTCAGACAGACACACAC"
  • Blast the above sequence using the Blast interface at Toxodb.org.
  • This will find a single hit and display a link

  • Click on the link to open it in the ToxoDB Genome Browser.
  • Select DOWNLOAD TRACKS, then CONFIGURE and change it to GFF3 format and SAVE TO DISK. Now click GO.

    This downloads a file “dumped_region” which will contain all the annotation stored in the ToxoDB Genome Browser in GFF3 format.

  • Now switch back to MacVector and open the 2Kbp sequence.
  • You can use FILE > NEW FROM CLIPBOARD to bring the sequence into MacVector quickly

    Again if you do not change the start coordinate of the sequence the IMPORT FEATURES dialogue will show an error

    ToxoDB 1

  • Change the start coordinate of the sequence to match its location in the genome (which is 405235 as detailed on the Blast hit page) as in the step in the previous workflow.
  • I then went to FILE > IMPORT FEATURES and selected the GFF3 file.
  • ToxoDB 2

    The dialogue will show that it has found 6 features. Note that the GFF3 file contains annotation from a much longer sequence than our initial query sequence. MacVector will ignore any annotations that lies outside the query sequence. It will show an warning message to indicate this.

    ToxoDB 3

    Keeping a sequence updated

    You will be able to “update” and add new annotation to your existing sequence. For example after a few months I could revisit these two genome browsers website and download an updated GFF3 file. Upon importing these features it will optionally replace any duplicate features and add new ones. So you can work with a sequence and also keep it updated as other researchers find more about this particular sequence.

    Duplicate Features

    Due to the lack of strict standards across the many different file formats it may be that a potential duplicate is not recognised as such because the wording or keyword is different. In the majority of cases some degree of manual curation of the annotated sequence will be required. In all cases MacVector will err on the side of caution and will never throw away any potentially interesting or important information contained within a feature. Only entries that are 100% the same (after being parsed during the import) will be considered as duplicates. MacVector will never class a feature as a duplicate if the START, STOP or FEATURE TYPE are different in any way. Even if they differ by just a single base.

    Posted in General, Releases, Techniques, Tutorials | Tagged , | Leave a comment

    ASM2012: San Francisco

    We’ll be at the 112th ASM meeting in San Francisco this year from June 16 – 19, 2012.

    GoldenGateMay06

    Come visit us if you are at the show. Our booth at the ASM will be 936.

    We’ll be demoing MacVector 12.6 which will be released about that date. You’ll also be able to pick up demo CDs and incubator floaties. We are giving away some cool mouse pads with a summary of the DNA and Protein IUPAC codes and the Universal Genetic Code printed on them for quick reference.

    FloatiePhotoCropped+Resized+rotated MousepadPhoto

    This year it looks like the Twitter hashtag will be #asm2012. As ever you can follow us on @macvector

    If you are attending the meeting do pop along and say hi. See you in a month!

    Posted in General, Meetings | Tagged | Leave a comment

    Showing features as bases or a translation in a plasmid map

    Everybody has different tastes and giving everybody identical plasmid maps is unfair! So MacVector is designed to be as flexible as possible to allow you to make your maps look like YOU want then to look.

    In this theme was a recent change where appropriate features can be shown as residues when there is sufficient space to show them (for example when zoomed to residue).

    By default this is enabled for certain features. But it is controlled from the Symbol Editor (double click on a feature to edit it).

  • Change the dropdown menu to Show as Graphic to disable this
  • Select Show Residue Letters if Room to enable it
  • .

    For example if this is enabled for a CDS feature when zoomed to residue the amino acid (either 1 letter or 3 letter codes) will be shown.

    For example in this screenshot we’ve changed the CDS feature to show residues:

    MV12 ShowResidueLettersWhenRoom

    Posted in Tips | Tagged | Leave a comment

    New release of KeyServer K2

    The latest version of KeyServer is now K2 7.0.0.5.

    Run the version check on this page to check which version of which part you are running and if you need to upgrade.

    Remember if you are upgrading from a version earlier than K2 v 7 you’ll need a new MacVector license. Contact Support to request one.Screen Shot 2012 04 05 at 10 27 42

    Posted in Releases | Leave a comment

    Assembler: Using the coverage map of the Reference Contig editor to analyze your assembly

    There are two main steps to creating a reference assembly. Mapping your reads against your reference sequence and then analysing the alignment for variations. Knowing the depth of reads, or coverage, of an alignment is important for both of these stages. A low average depth of coverage means that you have less confidence in the called consensus and a high average depth of coverage depth means you have spent too much money on sequencing. Even more important are regions with reads well above or below the median level of coverage which can indicate anomalies or variations in the sequence.

    When you generate a reference contig with Bowtie, the Map view of a reference or child contig will show a plot of the depth of reads along the entire reference. This coverage map shows four statistics. A single plot line (default color is black) shows a running average of the number of reads at that point, calculated using a moving window of varying length depending on the zoom level. Such a plot is not sensitive when the window shows a large region of sequence at a high level, for example when viewing megabases of sequence). So two shaded areas indicate the highest value (default color is dark blue) and the lowest value (default color is light blue) of the reads averaged for that window. As the coverage map is viewed at higher magnifications then the window from which the running average is calculated becomes shorter and so these three values will become closer to the extent that when viewed at, or close to, sequence level these three plots will become identical.

    Regions of zero coverage

    Areas of zero coverage are shown in light grey. Note that these areas are always displayed even when they are disproportionate to the level of magnification. For example a region of zero coverage will always be displayed even when you are viewing a 20 megabase contig in its entirety. Also note that there are no areas of zero coverage in child contigs as by definition they are bounded by either end of the reference contig and/or an area of zero coverage. If you hover the mouse over the coverage map it will give the exact number of reads at that position (for example X reads over base XX).

    Regions with low coverage

    There are many reasons why regions will have lower than average coverage. These generally are caused by the base composition over that region. For example regulatory elements in a sequence, where proteins such as transcription factor bind, do have lower than average coverage perhaps due to their GC content being low.

    Regions with high coverage

    Short regions with excessively high coverage can be indicative of a repeated region that may or may not be present in the reference sequence. Reads will be piled up on one of the repeated sections rather than being spread out over each repeated region. Paired end reads can go some way to help detect these and allow correct alignment of reads. Also do remember that you may have the same read mapped to multiple locations on the reference unless you select the “USE BEST ALIGNMENT ONLY option

    MV125 ReferenceContigCoverageMapSymbols

    Further Analysis

    The coverage map makes it very easy to design primers for further sequence, for example Sanger sequencing for hybrid assembly. Remember that you can run general MacVector analysis tools directly on a contig and it will act as if you are running that analysis on a single sequence.

    Here’s how easy it is to design primers:

  • Zoom into an area of low coverage using the cursor in the reference contig.
  • First look for an area of low, or zero, coverage. Remember that areas of 2 or more bases with zero aligned reads are highlighted in grey and will be visible at all levels.
  • Now select the sequence spanning the low coverage region.
  • Now run ANALYZE > PRIMERS > DESIGN PRIMERS (PRIMER3)….
  • Check it’s set to AMPLIFY FEATURE/REGION. This will now take a 200bp region either side of your selected region and design primers to amplify this region.
  • Now you can amplify this sequence from your original sample, or instead design some sequencing primers and sequence it directly.
  • Technorati Tags: ,

    Posted in Algorithms, Tips | Tagged | Leave a comment

    Choosing the default application to open a file type

    Sometimes you’ll find that when you double click on a document (for example a protein sequence) that it opens in the wrong application. Generally this has resulted from recently installing a new application that has registered itself as the default application that you normally use to open that document. This will overwrite your default application.

    For example you double click on a Genbank sequence file (that has a file extension ending in “.gb”. You expect to see this sequence in MacVector, but instead it opens in some other application.

    Here’s how to fix this:

  • 1 – select any of the file type you want to change (e.g the aforementioned Genbank file) and “get info” using CMD + “i” (or right click/hold down CTRL and left click, on the file and choose “get info”).
  • 2 – in the Get Info dialog move to the OPEN WITH section (as in the screenshot below).
  • 3 – Change the drop down menu to MacVector.app.
  • 4 – Click the CHANGE ALL button and reply CONTINUE on the dialog that appears.
  • Now close the Get info dialog and double click on a text file. It should now open in TextEdit.

    Screen Shot 2012 02 06 at 18 56 28

    Technorati Tags: ,

    Posted in Support, Tips | Tagged | Leave a comment

    Updating to Keyserver 7.0

    KeyServer 7.0 was released a few months back. There is no need to upgrade your license server for MacVector. However, if you do want to upgrade then you must download the latest installer direct from Sassafras. This version (7.0.0.3) contains important bug fixes and is a later version than the KeyServer installer on the MacVector 12.5 CD.

    You will also need to contact MacVector Support to request a new network license.

    Posted in Tips | Tagged | 1 Comment

    Using old subsequence files with MacVector

    Sometimes when a subsequence file, that has been created by an version of MacVector prior to MacVector 11, has been stored on a remote filesystem or has been emailed MacVector will refuse to open the file.

    Mac files are generally comprised of two pieces: A data fork and the resource fork. Prior to the release of OS X Leopard (and OS9 and earlier) used two pieces of information called TYPE and CREATOR to specify which application would be used to open a document. This information was stored within the resource fork. The resource fork has been lost from the file during the transfer, or more accurately due to it being stored on a non-HFS filesystem which only recognises the data fork.

    With older versions of MacVector this would have required the resource fork to be recreated and a new TYPE/CREATOR information to be specified and stored within the new resource fork. However, one of the many major changes made to OS X as been the move from the traditional way of opening files using the resource forks TYPE/CREATOR information stored within the resource fork to the use of Universal Type Identifiers (UTI) and file extensions. MacVector now registers many file extensions with OS X when installed. For example “.nucl” is a single NA sequence and “.renz” is a restriction enzyme file. NA subsequence files have the “nsub” extension. If a file is missing the resource fork then OS X will rely on the file having an extension on whether it can open it or not.

    So fixing your subsequence files is as easy as literally adding the “nsub” file extension to your file.

    Do remember though that MacVector will always try its hardest to recognise a file’s format. So even if double clicking on a file will not open it (it’s the operating system that’s at fault here, not MacVector) then using FILE > OPEN from within MacVector will generally work. Saving that file will then recreate the resource fork AND add the extension.

    MV125 SubsequenceEditor

    Technorati Tags:

    Posted in Tips | Tagged | Leave a comment

    MacVector 12.5 workshop at the University of Pennsylvania

    If you are a user of MacVector or have any interest in sequence analysis for molecular biologists then you may be interested to know that there’s a workshop about MacVector at the University of Pennsylvania this Thursday open to faculty, students and researchers.

    When: Thursday February 9th 9:30 to 11:00
    Where: Blockley Hall, Room 1311

    “Dr Kevin Kendall will present an informal 90 minute workshop for both beginners and long-time users of MacVector.The workshop will review how to use the basic functions and features in MacVector, as well as covering more advanced workflows and new functionality in the latest releases, MacVector 12.0 and 12.5. He will be happy to answer any questions about MacVector that you may have, as well as previewing the functionality planned for MacVector 12.6.”

    Light refreshments will be provided.

    Please register for the workshop on the University Bioinformatics Group website

    Technorati Tags:

    Posted in Meetings, Tutorials | Tagged , | Leave a comment

    MacVector 12.5 workshop at Memorial Sloane Kettering Cancer Center

    If you are a user of MacVector or have any interest in sequence analysis for molecular biologists then you may be interested to know that there’s a workshop about MacVector at MSKCC this Wednesday open to MSKCC faculty, students and researchers.

    When: Wednesday February 8 from 11:00 to 12:30
    Where: Room RRL B20

    “Dr Kevin Kendall will present an informal 90 minute workshop for both beginners and long-time users of MacVector.The workshop will review how to use the basic functions and features in MacVector, as well as covering more advanced workflows and new functionality in the latest releases, MacVector 12.0 and 12.5. He will be happy to answer any questions about MacVector that you may have, as well as previewing the functionality planned for MacVector 12.6.”

    Technorati Tags:

    Posted in Meetings, Tutorials | Tagged , | Leave a comment