How Assembler uses quality scores to create assemblies

A common problem with all types of sequence assembly is distinguishing between sequencing errors and true genomic variations. Quality scores are one way to help the algorithm identify if a variation is of high quality and therefore likely to be a SNP or a sequencing error.

For Assembler trace files can be basecalled with Phred, which adds quality data. Reads that are added in FASTQ format should already contain Phred quality scores. Both Phrap and Bowtie2 are phred quality score aware. However, Velvet does not currently use the quality scores when generating an assembly.

For assembly with Phrap when the consensus is calculated it uses the quality scores as guides to how strongly to consider that particular sequence.  A single high scoring sequence will always contribute far higher to a consensus calculation then multiple low scoring ones.  Especially ones that have a score below 20 (shown in red). The scoring algorithm is logarithmic.  A score of 10 will mean that Phred has determined there to be a 1 in 10 chance that the basecall is in error.  1 in 20 means a 1 in 100 chance of an error.  20 and above is generally considered as acceptable, and so Assembler gives all scores above 20 as green (this cutoff point is different for the overall contig quality scores, as 40 or above is considered an acceptable score here).

Since it’s Phrap that determines the consensus and the quality of the consensus, and it basically ignores poor quality sequence if there is good quality sequence at the same position, then you should not need to remove poor quality sequence. In fact, the Phrap manual specifically states that you should never remove poor quality sequence.  So a red scoring sequence will not contribute to the consensus.

However, you can manually override the quality score in the contig editor.  The quality scores of 98 and 99 are reserved for manual editing of a trace file. So if you change any of a trace sequence using a valid IUPAC symbol, then the quality score will change to 99 and it will also change to be blue.  So if you are confident a particular sequence is correct, then overwriting it with the same symbol will change the quality score.  When calculating the sequence consensus, Assembler will always give special priority to manually edited traces with quality scores of 99. If there are 1 or more reads with manually edited bases, then the consensus will only be calculated from these, and the others will be ignored.  If there are multiple reads with manually edited, but non-matching, bases at the same point, then the consensus will use an ambiguity character to show this.

This entry was posted in Algorithms, Tips and tagged , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.