MacVectorBaseLogoWhiteTransparentBackgroundlarge2x

Sequence Analysis Tools for Molecular Biologists

Home MacVector Assembler Downloads Try HowToBuy Support Contact Forums

Splitting Fastq Formatted Files

Most NGS sequencing platforms have the ability to save data in Fastq formatted files. Typically, these are provided as a pair of files, representing paired-end reads, though these may also be provided as interleaved reads in a single file. Depending on the source, these files can have just a few hundred reads up to as many as 50 or 100 million. There are many occasions where the number of reads in these files is not only overkill for the required experiment, but can actually prevent algorithms from correctly assembling them into appropriate contigs. Ideally, for de novo assembly, you want no more than 100-500x coverage of the genome. If you are sequencing a small plasmid, this may mean just a few thousand reads in the data files, even if your sequencing proider has sent data files containing many millions of reads.

In addition, working with smaller numbers of reads reduces computation, memory and disk space requirements - MacVector can often assemble entire bacterial genomes in just a few minutes with an optimized set of reads whereas the full set may take many hours, especially on Macs equipped with less than 16 GB RAM.

We at MacVector have created a simple free utility called SplitFastqFile.app to simplify splitting large files into smaller files. You can download a zipped copy of the utility using this link.

Using SplitFastqFile.app

SplitFastqFile.app is a "droplet" application, meaning that after you download and extract the application, you just need to drag a fastq formatted file onto it and it will split that file into multiple smaller files. During the splitting, you will get prompted for various inputs;

SplitFastq1

First, you should choose how many reads to save in each split file. Aim for a coverage of between 100x and 500x for the molecule you are sequencing, but be aware that if you have a pair of files, you should use the same value for both files.

SplitFastq2

Next you will be prompted for a destination - be sure to use the New Folder button if you want to save the files in a unique location.

SplitFastq3

Now select a "Prefix" for the filenames. This will be used as the root of the filename, so if you choose (like here) "Run_1650_R1_" than the files will be named "Run_1650_R1_aaa.fastq", "Run_1650_R1_aab.fastq" and so on.

SplitFastq4

A dialog will appear while the job is running. With long jobs this will close and reopen multiple times. For very large files (>50 million reads) the job may take several minutes to run on a typical machine. Be patient!

SplitFastq5

Finally, you will get a dialog box confirming that the utility has finished processing the files.

SplitFastq6

The target folder will contain all of the split files, as shown here. The original file is never changed by the utility.

FlatLogo2019

Copyright © 2024 MacVector, Inc. All rights reserved. Terms of Use.

MacVector, Inc • PO Box 1147 • Apex • North Carolina 27502 • USA

phone: +1-919-303-7450 • toll free: +1-866-338-0222 • fax: +1-919-303-7449

Overview

Creating a Sequencing Project

Base Calling Using phred

Vector Trimming with cross_match

Assembling Sequences using phrap

Editing and Analysis of Contigs

NGS Reference Assembly using Bowtie

NGS de novo Assembly using Velvet

Comparing Assembler and AssemblyLIgn

SplitFastqFile - a ultility to break up large fastq files.

Functional comparison with Sequencher