Wednesday, September 11, 2013

Converting bas.h5 to fasta with pls2fasta.

I would strongly recommend taking the time to install pbcore/pbh5toos in order to work with bas/bax h5 files - these are kept up to date by the software group at pacbio. If you're having difficulty getting this set up on your system, it is possible to use a binary compiled file pls2fasta to convert bas.h5 files into fasta and fastq format. This is part of the blasr source you can download from github. After compiling there will be an executable $(source_dir)/pbihdfutils/bin/pls2fasta

Typical usage is as follows:
>pls2fasta in.bas.h5 out.fasta -trimByRegion
The flag "trimByRegion" is necessary to only include the high quality regions of reads in "out.fasta". The toher regions are not just low quality, but pure noise caused by signal recorded before sequencing began, or recorded after the true sequence ended. The key is that you don't gain any information by leaving in any of the low quality portions. To produce fastq output, use:
>pls2fasta in.bas.h5 out.fastq -trimByRegion -fastq
I do not think there is software out there that handles pacbio reads that makes appropriate use of the quality values other than the Quiver method released by PacBio. It is possible to make blasr use quality values to form pairwise alignments, but by default this is turned off because the adjacent insertion/deletion columns that are supported by quality value aware alignment often cause problems to naive consensus calling methods.