Next Generation sequencing formats

As is common with the lack of standards seen with most emerging technologies there are many different and competing types of sequencing file formats for storage of short read or next generation sequencing data. All these formats try to solve the same question of storing an almost unprecedented amount of sequence data in a useable and complete format. However, one emerging format that seems very appropriate for this type of data is Fastq.

Phrap was one of the first real high throughput assemblers that could also deal with quality scores (generated by its stablemate Phred). The general input files to Phrap are a single Fasta file containing the reads, and an associated Qual file that contains quality scores for each and every read in the Fasta file. I’m not sure what it is about bioinformaticians, but they always feel the need to add Yet Another Format rather than reuse one of the many decent formats. However, in this case Fastq is a logical progression of the Fasta + Qual format in that the two individual files are now merged. That is each read comprises of four lines; A label header, a sequence, and second label header and a quality score line.

Here’s an example of such a file. Note that this example has a single ‘+’ character to indicate the quality label line, rather than a duplicate of the label.

@ERR000955.3982 IL6_1091:3:1:210:502/1
TCCAAACACACTTTGTGTAGAATCTGCAAGTGGAGAT
+
>>>>>>>>>>>>>>>>>>>;>>>>><>>>;>>;><>>

Many of the raw file formats (sff/ SRF etc) are big! They contain the raw image files as well as basecalled sequence and quality data. Fastq files are at the complete opposite end. They are small, and only contain minimal data. They may contain millions of reads, yet are still in the less than a tenth of a Gigabyte range for a single run’s data. It is worthwhile noting that the various Short Read Archives (NCBI, EBI etc.) require the submission of original raw image files, but only allow the reads to be downloaded in fastq format.

The quality line must comprise of the same number of characters as bases. i.e. one quality character per base. However, most quality scores are double digits. Fastq gets around this by using an ASCII character to encode the quality score. However, here’s where consistency fails. The quaility line mat be one of three different types. Sanger format will encode a Phred quality score of 0 – 60 using the ASCII characters 33 to 93. The latest Illumina 1.3 format will also contain a Phred Quality score from 0 to 40 however, this time encoded using ASCII 64 to 104. Finally the older Illumina (nee Solexa) 1.0 format has its own Solexa/Illumina quality score from -5 to 40 encoded using ASCII 59 to 104. Of course this does now pose problems, as unless you know which quality score was used, there is now way of knowing without guesswork, which it is.

There are also other issues with this format. It could be said that the label line for the quality score line is redundant, and the filesize could be reduced by 25% if this was removed. Some applications do generate and accept fastq files that have a single ‘@’ or ‘+’ in place of the quality label line.

It would be helpful to see a tightening of this format and indeed there is a fastq2 format that does not have these weaknesses.

With the next release of MacVector and Assembler during the Summer we will be adding support for the Fastq format. We will also be adding support for de novo assembly of short read data. This release is currently in internal beta testing, and will be out for a public beta trial soon.