Automatic Assembly of Sub-projects with Phrap (Sub-Assemblies)

New to MacVector 18.6 is the ability to sort and assemble reads from different datasets into individual sub-projects. This functionality is located in the phrap parameters dialog. When enabled and configured appropriately for your dataset it will automatically break out the input reads into sub-projects to be assembled separately.

A simple pattern-matching text box lets you define which characters in the input filenames should be treated as project names, and which should be treated as read names. After assembly, contigs can be exported (to a variety of file formats, including fasta and fastq) retaining the project name in the contig names.

This function can be a great time saver if you do a lot of related small sequencing projects as long as you use a well-defined naming convention.

Pattern Matching

SubProjects 1

The reads in your datasets must have a defined naming standard. You need to construct a pattern that will match the project name and read name. There are a set of characters that you can use to construct a pattern that defines what is the read name and what is the project name. As an aid to construction a pattern when you type these in the dialog the sub-project name will be dynamically updated to show what the sub-projects will be named. These characters are:

  • P – a single character to be included in a project name.
  • X – a single character to be excluded from the project names (typically these would be the read names).
  • “-“, “_” or “.” – separators. If present in the pattern they MUST be in the filenames (you can add more separator characters in the dialog).
  • p – one or more characters to be included in the project name. Extends to the next separator or to the end of the filename.
  • x – one or more characters to be excluded from the project name. Extends to the next separator or to the end of the filename.

SubProjectsPostAssembly

This is best demonstrated with an example. Here we have a sequencing dataset called BASENAME. Each individual sample that had been sequenced was numbered 1000 to 1100. Typical read names are:

List of read names

  • BASENAME-1001g07_0x00.s01_1.scf
  • BASENAME-1001g07_0x00.s02_1.scf
  • BASENAME-1003g07_1a03.m22_2.scf
  • BASENAME-1003g07_1b06.m23_1.scf
  • BASENAME-1005g07_2c07.s01_1.scf
  • BASENAME-1005g07_0x00.s01_1.scf

Your pattern for this could be:

PPPPPPPP-PPPPxxxx

We can break this down as follows for the first readname:

BASENAME-1001g07_0x00.s01_1.scf
  • PPPPPPPP = comprises the main name up until the separator. (BASENAME)
  • = the separator
  • PPPP – the number of the individual sample (1001)
  • xxxx – The first x excludes all characters to the next separator (g07). The second x excludes the next set of characters to the next separator (0x00), etc..

The above set of reads would produce the following three sub-assemblies:

  • BASENAME1001
  • BASENAME1003
  • BASENAME1005

How to sort reads into sub-projects

  1. File | NEW | ASSEMBLY PROJECT
  2. click >ADD SEQS to add your dataset
  3. Click ASSEMBLE | PHRAP
  4. Click the Sub-Assemblies tab in the Phrap dialog.
  5. Toggle the Enable Sub-assemblies setting to on.
  6. Ensure your separator character is listed in the Valid Separators box.
  7. Construct a suitable matching patter (see above)
  8. Click OK.

MacVector 18.6 was released in July 2023. This release adds one-click optimization of CDS coding regions, automatic phrap sub-project assembly, direct support of .csv/.tsv files for Primer Database, inclusion of graphical information in GenBank exports and numerous tweaks and improvements to many workflows.

A DNA sequence having a CDS feature optimized for expression in a different organism. Background is macOS Sonoma.
This entry was posted in Releases, Tips and tagged , , . Bookmark the permalink. Both comments and trackbacks are currently closed.