2012 wellcome-talk

1.Streaming approaches to sequencedata compression(via normalization and error correction) C. Titus Brown Asst Prof, CSE andMicrobiologyMichigan State University [email protected]

2. What Ewan said. 3. What Guy said. 4. Side note: error correction is thebiggest data problem left insequencing.*Both for mapping & assembly.*paraphrased, E. Birney 5. My biggest research problem soil. Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil. Bacterial species in 1:1m dilution, est by 16s Does not include phage, etc. that are invisible to tagging approaches Currently we have approximately 2 Tbp spread across 9 soil samples, for one project; 1 Tbp across 10 samples for another. Need 3 TB RAM on single chassis to do assembly of300 Gbp (Velvet). estimate 500 TB RAM for 50 Tbp of sequence.That just wont do. 6. Online, streaming, lossy compression.(Digital normalization)Much of next-gen sequencing is redundant. 7. Uneven coverage => even moreredundancy Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!! This 100x will consumedisk space and, because of errors, memory. 8. Coverage before digitalnormalization:(MD amplified) 9. Coverage after digital normalization:Normalizes coverageDiscards redundancyEliminates majority oferrorsScales assembly dramatAssembly is 98% identica 10. Digital normalization algorithmfor read in dataset:if estimated_coverage(read) < CUTOFF:update_kmer_counts(read)save(read)else:# discard readNote, single pass; fixed memory. 11. Digital normalization retains information, whilediscarding data and errors 12. Little-appreciated implications!! Digital normalization puts both sequence and assembly graph analysis on a streaming and online basis. Potentially really useful for streaming variant calling and streaming sample categorization Can implement (< 2)-pass error detection/correction using locus-specific coverage. Error correction can be tuned to specific coverage retention and variant detection. 13. Local graph coverageDiginorm provides ability toefficiently (online) measurelocal graph coverage, veryefficiently.(Theory still needs to bedeveloped) 14. Alignment of reads to graph Fixes digital normalization Aligned reads => error corrected reads Can align longer sequencesCorrectionSequence Read (transcripts? contigs?) to graphs.Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG A GG19G19 19 GSN19 SN SN 19CTSN SN 19 19 C SN SN19A GSN 19 19SNSeed K-mer A MN C2020 CGAATCTGATMNMN GA1 1 ME CME 1 T C 1ME1 GGME Emission Base A ME1 G A G1 ME K-mer Coverage 19ME1 ME11 MEVertex Class SNMEJason Pell 15. 1.2x pass error-corrected E. coli*(Significantly more compressible) * Same approach can be used on mRNAseq and metageno 16. Some thoughts Need fundamental measures of information retention so that we can place limits on what were discarding with lossy compression. Compression is most useful to scientists when it makes analysis faster/lower memory/better. Variant calling, assembly, and just deleting your data are all just various forms of lossy compression :) 17. How compressible is soil data?De Bruijn graph overlap: 51% of the reads in prairie(330 Gbp) have coverage > 1 in the corn samplesde Bruijn graph (180 Gbp). CornPrairie 18. Further resourcesEverything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site:http://ged.msu.edu/interests.html See esp: BIGDATA: Small: DA: DCM: Low-memoryStreaming Prefilters for Biological SequencingData Preprints: on arXiv, q-bio:diginorm arxiv, illumina artifacts arxiv, assembling 19. Streaming Twitter analysis.

2012 wellcome-talk

Documents

assembly graph analysis

error correction

uneven coverage

streaming variant

fixes digital normalization

error corrected reads

specific coverage retention

soil data