Use Database | Auto-Annotate Sequence to annotate prokaryotic genomes

Nov 8, 2018

—

The continuing advances in Next Generation Sequencing have made it relatively low cost to sequence prokaryotic genomes. Many scientists are embarking on large projects to sequence multiple related genomes. These might be clinical isolates of the same species exhibiting different pathogenetic properties, environmental isolates from different sites, or a study over time of the changes in microbial genomes from specific locations. Once you have your sequence, the definitive source of annotation is the NCBI Prokaryotic Annotation Pipeline. However, to have that run on your sequence, you must submit the sequence to the NCBI. This is not always ideal – perhaps you are still working on resolving repeat sequences for your genome, you don’t want to wait for it to be published or you don’t want to go through the hassle of a formal submission for many variant sequences. MacVector to the rescue!

First, you need to download existing similar genome sequences – open the Database | Online Search for Keywords (Entrez) browser and search for [name of your organism] “+” “complete genome”. Assuming you are working with a reasonably common organism, you might find a few (or a lot of) hits. Select those you are interested in and click on the To Disk button, selecting a suitable target folder to save the downloaded genomes into.

Now take your unannotated genome sequence and invoke Database | Auto-Annotate Sequence. Select the folder containing your genomes as the target and press OK. On a 2.7 GHz laptop, scanning a 1.8 Mbp Campylobacter jejuni genome against 25 related C. jejuni genomes (average 5,000 features per genome) takes around 10 minutes with the default parameters, resulting in a fully annotated genome.

Compare Genomes lets you actually directly compare the features of two related genomes based on DNA or (for CDS features) protein sequences and reports all of the identities, similarities, differences and missing features. The tool confirmed that no features were missing compared to the NCBI annotated genome and there were just minor differences with a few CDS features where there were mutations creating or removing stop codons.