Friday, June 9, 2017

NYT parliamentary vote palette

Always on the hunt for a good 8 class palette, NYT has one for today's graphic of the parliamentary vote: #FAA651 #FFEF4C #3989CB #D65454 #960630 #93C44C #6A996A #999999

Friday, January 6, 2017

SV Calling in the cliveome

Well, the CliveOme is out, and open for analysis. I took a few steps to do SV calling, and the results are presented below.  Note this is very preliminary; a rough pass takes an afternoon, and nailing it all down months.  In particular the counts for deletion are very high, and you see an imbalanced mobile element insertion and deletion rate (the rate should be the same).

1. Data.
The data I used was 53 fastq files taking up about 300G of space.  I didn't plan on doing anything with the raw data.

2. Mapping:
I used my branch of blasr (https://github.com/mchaisso/blasr/commit/c950018b65728cf9a0806381f1e69fba6ea40b08)

to map to build 38, using a precomputed index
blasr input.fofn GRCh38.fasta -maxMatch 25 -advanceExactMatches 5 -sam -nproc 16 -minMapQV 30 -out /dev/stdout | samtools view -bS - | samtools sort -T /tmp/clive -o clive.bam

This took < 1 day using 16 cores and about 30G of memory.

3. Basic sequencing stats.  The dataset had about 30X euchromatic coverage, with about 0.5Gbase unmapped. The data release quotes 50X, so there is a chance I didn't grab all the files.


While there are a few very long (> 100kbp) sequences here, it only accounts for 0.2% of the total coverage, and 8% of the total coverage allowing reads over 50kbp. This implies we can't start assuming de novo assembly with 100kbp reads yet.

The accuracy was 89%, counting # matches/ (#matches + # insertions + # deletions + # mismatches), so multi-base insertions or deletions are counted as single events.

And like all SMS, there isn't a length/accuracy bias.

4. Sequencing error modes.

I'm a noob on nanopore sequencing, so much of what I may mention about error modes is probably well known.  Generic regions look fine, but have a high SNV rate.



Zooming in on a few of these there was a pretty high strand bias for SNVs.  Another way to look at this is, for each SNV, look at the ratio of the min(# forward, #reverse)/(#forward+#reverse) for all reads supporting an SNV, for the two haplotypes.



Clive mentioned on Twitter this could be due to modified bases and their basecaller not supporting native DNA yet, so I won't look any more to figure out what's happening.

Here is the strand ratio for the pacbio SNVs on chromosome 1 for a different individual:


Also, effectively all poly-A/polyT sequences I saw were deleted:



5. Calling structural variation.

The current pipeline I use for calling SV on a diploid sample is: 1. phase SNVs. 2. Partition SMS reads counting overlap of phased SNVs, and 3. Calling SV on each set of partitioned reads separately.  I didn't try taking the time to do the phasing because the SNV strand bias would cause a bunch of artifacts to appear in phase.  The code to separate strand bias from real SNVs represented on both strands is running now, but it's /very/ slow and probably won't finish until long past my attention has turned back to my normal research.

I modified my pipeline to do local assembly in 60kbp tiling windows using Canu V1.4. Instead of mapping the local assemblies, the 10 longest p-reads from each local assembly were mapped, this helps fix the problem of detecting heterozygous variants. The p-reads had an average accurcy of 96.9%, making them unsuitable for indel calling and OK for calling SV.  The p-reads were mapped back to hg38, and a set of unmerged SVs were output simply by parsing the cigars strings. To merge, SVs were split according to insertion and deletion, and then a graph was made with each SV as a node, and an edge connecting any two SVs if they had 50% reciprocal overlap.  The largest SV from any connected component with at least two nodes was output.



InsertionDeletion
Repeat typeCountMeans.d.TotalCountMeans.d.Total
Complex4968118.91191.235907531861683.75121.191559030
AluY1319263.5447.943476111857220.39119.8409259
AluS183176.6379.6832323158995.2157.34151293
L1Hs1711971.842145.283371852591597.512240.74413754
L1192765.581695.61469921734148.16393.03256910
L1P427553.71931.832364362109172381.9362747
SVA256709.81841.54181711430452.19798.13194442
HERV44288.02422.8212673274143.41355.7339295
Alu-mosaic254358.31256.5691010589166.86195.6498278
STR4690244.82328.2111481881331284.6565.211126812
Satellite2583297.44482.817682802645221.41460.67585625
Singletons308208.39367.86641841898103.76140.33196930
Tandem_Repeats4686592.71620.4827774037399350.75987.772595177
Total20557370.61991.14761869053403168.17531.358980612


Overall, the number of inserted and deleted bases is on par with what one would expect from a diploid sample ( e.g. http://genome.cshlp.org/content/early/2016/11/25/gr.214007.116.full.pdf+html), though a little high.

The count of the deletion variants is very high, driven by the STR and Complex callsets (a complex call is complex sequence context, e.g. unmasked bases, rather than a complex set of operations). The causes for this could be the use of p-reads as opposed to some more sophisticated model - larger events will be broken up into multiple smaller events in more noisy alignments, or due to a context dependent deletion error mode such as the loss of poly-a tracts.


Clive, if you care, I'll send the list of deletions that overlap with exons.  The count is high, but it's driven by small events that are likely false positives.  Not to worry though, nothing pops out.