How to use Codon Preference plots

When you are looking for open reading frames in newly sequenced regions, it’s not always the longest ORFs that are protein-encoding. Lets look at an example from one of the sequences included with MacVector:

/Applications/MacVector/Sample Files/Gal Cosmid.nucl.

This is from Streptomyces coelicolor, a filamentous bacteria with a 73% G+C content. The high G+C% means that stop codons (TAA, TAG and TGA) occur relatively infrequently by chance, so long open reading frames are quite common. Look at this plot from an Analyze | Nucleic Acid Toolbox plot;

You can see there is a long open reading frame in the top most pane in Frame +3. However, this is extremely unlikely to actually encode a functional protein. How do we know? Take a look at the Staden Codon Preference plots in panels 3 and 4. These plot the probability that each of the three frames encodes a protein based on the selected codon bias table (here we are using Streptomyces coelicolor.bias). The plus strand plot (pane 3) suggests the blue frame has the best chance of encoding a protein, but its still not that great (compare it to the blue plot at the left hand end of the sequence where there is a blue ORF in the upper pane). The red plot, which we would expect to be the highest if the long ORF really encoded a protein is very low throughout the length of the ORF. Now look at the lowest pane – the green (-2) frame is very high throughout most of that region, with the exception of a dip in the middle. This corresponds almost exactly to two distinct green ORFs in the second (minus strand) plot. This is excellent evidence that the two green ORFs on the minus strand are the most likely to encode actual translated proteins and the long red ORF on the plus strand is purely accidental due to the infrequency of stop codons in this genome.

This entry was posted in Tips and tagged . Bookmark the permalink. Both comments and trackbacks are currently closed.