Estimating insert length quickly for a read pair

[Edit December 20, 2017 – As of MacVector 15.5 you can simply right click a READ and select “SEE MATCHING READS” to view the pair of reads. The total sequence length is selected. ]

Insert length is the length of the sequence in between a pair of reads. Sequencers are supplied DNA samples in fragments of a known length and each end is sequenced (generally in a 5′ to 3′ direction from both ends).

For example if you have a fragment of 2Kbp and your reads had an average of 500bp, then the insert length would be around 1Kbp. Insert length is determined by the protocol that is used when preparing samples for sequencing. You should know the insert length (and orientation) of your sequencing data.

However, it may be that you do not know it. … or perhaps you have only a wide range and you want to see if using more stringent values would improve your assembly. Velvet will estimate this for you, however, Bowtie does not.

Here’s a quick way to estimate the insert length using MacVector and Assembler.

Choose a published sequence from your dataset. e.g. a reference sequence or a gene. If no such sequences exist, then take a contig from any assembly you may have already done.
Use this sequence with Align to Folder against the your dataset of paired reads. If your read files are large this may take some time (*consider extracting the first few hundred pairs as below).
Select a subset of hits in the results (Fewer read pairs will be less accurate, but a lot quicker to do) and save these as fastq using DATABASE | RETRIEVE TO FILE

Now we will repeat the alignment with these extracted hits.

Align these extracted reads using Align to Reference.
Once aligned toggle the SORT menu so that it is in alphabetical order. As long as your read pairs are named according to usual conventions, then each mate of a pair should be next to each other. For example in the example shown “SLXA-EAS1_89:1:100:858:113/1” and “SLXA-EAS1_89:1:100:858:113/2”.Now measure the insert length from a few aligned pairs.
Scroll down the list until you reach the first pair. Select both reads and check the length from the selection.
Repeat with a few pairs until you are happy with the estimated length.
*If you are comfortable with working from the command line, then a quick step is to use “head” to extract 100 pairs.
A read in a FASTQ file is generally composed of four lines. The header, the sequence, a blank header and the quality line. So extracting a multiple of four lines would give you that number of reads.

If your pairs are in two files then run the following to extract 100 reads.
```
head -n 400 Mate1.fastq > Mate1_100reads.fastq
head -n 400 Mate2.fastq > Mate2_100reads.fastq
```

SaveSave