Short read improvements to MacVector 11.1

The ability to produce de novo assemblies of short read data was introduced into MacVector & Assembler 11 and we’ve enhanced this in MacVector 11.1. Now Assembler stores and visualises metadata on the type of short read you have and also stores and deals with the quality data stored in a better way.

Mixing your reads

Currently there are three major types of short read data about, due to the fact that there are mainly three next generation sequencers producing considerably different read lengths.

– Illumina reads which are generally 66bp in length (the first generation of Solexa sequencers produced reads of 33bp long).

– 454 reads which can be 400 to 500 bp long

– SOLiD reads which are around 50bp long.

It is important to know which sequencer your reads have come from as length is not the only specific characteristic they possess. So with MacVector 11.1 we have introduced new feature types to indicate which types of read it is. So whether you mix Sanger, with 454, and a dash of Illumina reads, your assembly project will always keep the source of the reads stored in the metadata of the project. Furthermore the default symbol for each read type is different, and can be easily visualised.

Yet another sequence file format inconsistency!

An unfortunate fact about file formats in the bioinformatics world is that there are just so many of them! Most of them giving a slightly different approach to the same question. So it is nice that the Fastq format seems to be already fairly ubiquitous amongst raw data storage for short read data. Which is the main reason that we chose it to support short read data in MacVector. However, on the downside already there are three variations in the way that quality scores are stored in the format (actually there are variations in the sequence labels as well, but let’s just consider the quality scores). As well as the basecalled sequence, the Fastq format stores quality data encoded as a single character using the ASCII code for that character representing the value. All three variations of the format use that strategy. However, the actual number stored and the way it is encoded into the ASCII character does vary. The first generation of Illumina machines (when the company was still called Solexa) used a proprietary quality code. The current generation of their sequencers use the more recognised Phrap quality scores. However, they do store this number differently to the more recognised “Sanger” format which stores a Phrap quality score with a range of 0 – 93 using ASCII codes of 33 to 126. It is very important to distinguish when data is in the old Solexa scores. It is less important to distinguish between the later Illumina format, as reads will only be relevant when scores are very high and such high Phrap scores are unlikely.

So as well as distinguishing which sequencer produced the data, Assembler now also support the import of Fastq reads in all three types of quality data, and will take this into account when being assembled.