One of the new features in MacVector 12.6 is the ability to annotate sequences based on the features stored in GFF/BED/GFT files that many Genome Browsers will export data as (e.g. Ensemble, UCSC, etc). MacVector 12.6 will annotate an empty or annotated sequence with the features stored within these files.
BED, GFF, GTF, and GFF3 formats
GFF, GTF, GFF3 & BED files are all file formats that are used to store annotation (features) generally without containing any sequence. Although it is common that they will be accompanied by a fasta file containing the sequence only. They emerged as a way of exporting, or exchanging, information from a specified region of an entire genome without having to take the entire genome.
Most sequence formats were developed to be for a specific gene or protein. Although this is no longer true they are still orientated to be of a region of fixed length. These annotation files are not at all length specific and could potentially store just two features that were at either end of the same chromosome. They are a much more flexible way of dealing with annotation, especially a large amount, than a fixed length sequence format such as Genbank.
They also are not limited to a single sequence and can contain information from multiple sequences in the same file (Fasta files can also contain multiple sequences). For example you could store the entire human set of chromosomes in a pair of (quite large!) files. A multiple sequence Fasta file and a single GFF file.
The format of these annotation files does vary (who ever said Bioinformaticians had to be consistent!) but basically their format consists of a set of individual lines (one line per feature) along the following lines:
SEQUENCE ID, START, STOP, FEATURE TYPE, NOTE
- Sequence ID is the sequence these annotations belong to.
- START and STOP are the region of sequence they are annotated against
- FEATURE TYPE is obvious! Note that this does not always correspond to a correct Genbank Feature Keyword
These tools (generally online web gateways) allow you to browse the entire chromosome or genome of a particular organism. Almost like a graphical model of a sequence database. All the information known about that particular organism’s sequence that has been submitted to one of the large sequence databases (e.g. Genbank at the NCBI) should be visualised within the genome browser. You can download all the annotation contained within a particular region fairly easily using one of these annotation formats. Then you can either annotate an existing file that you are working with (so preserving your own “private” annotation with all known public annotation).
To annotate a sequence with a BED/GFF/GFF3/GFT file in MacVector
From the UCSC’s Genome browser
The interface will change and show all annotation associated with that region. You can modify the amount or type of annotation being showed. This particular gene, C.elegans Sel-12 is located on Chromosome X
This will now allow you to export all the annotation associated with the previous displayed region (tracks).
If it is left at genome the entire genome will be downloaded
Now you need to open the sequence you want to annotate. For this example we could go to
Sequence : Chromosome: X; NC_003284.7 (915873..918235)
You will need to ensure that the start position of the downloaded file corresponds to the start of the region of the chromosome we have just downloaded the annotation for. This is easily done in MacVector.
Now we will import our downloaded annotation.
The Sequence ID (SeqID) contained within all features in the file will be shown in a dialog along with the number of features for each SeqID and the region of the sequence that these will be annotated against. A warning will be displayed if any of the features are outside of the region to be annotated.
A dialogue will be shown with the number of annotations that have been added. If you annotate a blank sequence (e.g. a fasta file) the resulting features may be initially hidden. However, you can easily show them from the Graphics Palette tree view.
You can choose to annotate your sequence with all the features contained within the imported file or to ignore duplicates.
ToxoDB Genome browser
Here’s a similar workflow from the ToxoDB Genome browser
This will find a single hit and display a link
This downloads a file “dumped_region” which will contain all the annotation stored in the ToxoDB Genome Browser in GFF3 format.
You can use FILE > NEW FROM CLIPBOARD to bring the sequence into MacVector quickly
Again if you do not change the start coordinate of the sequence the IMPORT FEATURES dialogue will show an error
The dialogue will show that it has found 6 features. Note that the GFF3 file contains annotation from a much longer sequence than our initial query sequence. MacVector will ignore any annotations that lies outside the query sequence. It will show an warning message to indicate this.
Keeping a sequence updated
You will be able to “update” and add new annotation to your existing sequence. For example after a few months I could revisit these two genome browsers website and download an updated GFF3 file. Upon importing these features it will optionally replace any duplicate features and add new ones. So you can work with a sequence and also keep it updated as other researchers find more about this particular sequence.
Due to the lack of strict standards across the many different file formats it may be that a potential duplicate is not recognised as such because the wording or keyword is different. In the majority of cases some degree of manual curation of the annotated sequence will be required. In all cases MacVector will err on the side of caution and will never throw away any potentially interesting or important information contained within a feature. Only entries that are 100% the same (after being parsed during the import) will be considered as duplicates. MacVector will never class a feature as a duplicate if the START, STOP or FEATURE TYPE are different in any way. Even if they differ by just a single base.