Transcript

Slide 1

VARiD: A Variation Detection Framework for Color- and Letter-Space platformsAdrian Dalca1,2, Steven Rumble3, Sam Levy4, Michael Brudno2,5Massachusetts Institute of TechnologyUniversity of TorontoStanford UniversityThe Scripps Research InstituteDonnelly Centre and the Banting and Best Department of Medical Research

1motivation | methods | results | summaryVariation detection from NGS readsReference: TCAGCATCGGCATCGACTGCACAGGACCAGTCGATCGAC

Donor: ??????????????????????????????????????? GCATCGACTGCA CGGGATCGACTGAligned reads: ATCCATTGCA GATCCACTGCAC Determine differences (variation) between reference and donorusing NGS reads of the donorMotivationColor-space and Letter-space platformsbring them togetherMethodsSummaryResults33motivation | methods | results | summarySequencing Platforms letter-space Sanger, 454, Illumina, etc> NC_005109.2 | BRCA1 SX3TCAGCATCGGCATCGACTGCACAGG color-space AB SOLiD less software tools available > NC_005109.2 | BRCA1 AF3T212313230313232121311120 many differences -> useful to combine this information sequencing biases inherent errors advantages4Mention that we use numbers for colors- Dont List Differences4

Color Spacemotivation | methods | results | summaryTranslation MatrixTranslation Automata5> T2123132303132321213111205Translating> T212313230313232121311120> T

Sequencing Error vs SNPSequencing Error> T212313230313232121311120> T212313230310232121311120> TCAGCATCGGCAAGCTGACGTGTCC

SNP> TCAGCATCGGCATCGACTGCACAGG> TCAGCATCGGCAGCGACTGCACAGG> T212313230312332121311120AGCTCAGCATCGGCATCGACTGCACAGGColor Spacemotivation | methods | results | summary6Known first letter part of the linker, not of the read6

Color Spacemotivation | methods | results | summary7 clear distinction between a sequencing error and a SNP can this help us in SNP detection? sounds like it! single color change error, 2 colors changed (likely) SNP.Easy snp callWell covered basesDifficult Casereference incolor-spacereadsposition

7Detection Heterozygous SNPs Homozygous SNPs Tri-allelic SNPs small indels account for various errors, quality values & misalignmentsMotivation variation caller to handle both letter-space & color-space readsMotivationmotivation | methods | results | summary8VARiD system to make inferences on the donor bases variation detection8Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

MotivationSummaryResults99Statistical model for a system - states

Assume that system is a Markov process with state unobserved. Markov Process: next state depends only on current state

We can observe the states emission (output)each state has a probability distribution over outputsHidden Markov Model (HMM)motivation | methods | results | summary10S1S2S3e1e2e1e2e1e210Hidden Markov Model (HMM)motivation | methods | results | summary11Apply HMM to variation detection: we dont know the state (donor), but we can observe some output determined by the state (aligned reads)11Hidden Markov Model (HMM)motivation | methods | results | summary12. . . . . B6 B7 B8 B9 . . . . .B6 B7B7 B8B8 B9AAACcolor 0color 1AAACcolor 0color 1AAACcolor 0color 1unknowndonor- Will go to states, emissions and transitions12Why pairs of letters? Handle colors. AA and TT gives the same colors. Cant just model colorsThe donor could be: letters: AA color 0 letters: AC color 1 : letters: TT color 016 combinations

Statesmotivation | methods | results | summary13B6 B7REDUNDANT?

13Transitionsmotivation | methods | results | summary14. . . B6 B7 B8 . . .B6 B7B7 B8AACAATTT::GACT::AATT.........Possible StatesTransitions only certain transitions allowed when allowed, p(Xt|Xt-1) = freq(Xt) each state depends only on the previous states (Markov Process)States 16 possible states only look at second letter- Will go to states, emissions and transitions14T01020100311223 T1030101311223 T20100311223ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGACUnknown genomeColor readsLetter readsEmissionsmotivation | methods | results | summary15..... B6B7B8B9 .....B7 B8color 0color 1AAACcolor 0AA15AAcolor 0color 1color 2color 3letters Aletters C1 3emissionprobabilityp(em|AA)letters T1- 3letters GEmission Probabilitiesmotivation | methods | results | summary16Same color emission distributionTTDifferent letter emission distributionTT16

E.g. For state CC:Combining emission probabilities probability that this state emitted these reads.motivation | methods | results | summaryEmission Probabilities17T01020100311223 T1030101311223 T20100311223ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC..... B6B7B8B9 .....17Summary

unknown state donor pair at location

transitions transition probabilities

emissions reads at location emission probabilities motivation | methods | results | summarySimple HMM18B6 B7AAACcolor 0color 118 Have set-up a form of an HMM run Forward-Backward algorithm get probability distribution over states at each positionAACAATTT::GACT::likely statemotivation | methods | results | summaryForward-Backward Algorithm19 Variation Detection:compare most likely state with reference:

ref: GCTATCCAdon: ...AT...probability- FB intuition !!!19Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

MotivationSummaryResults2020Simple HMM only detects homozygous SNPs

Extended HMM: short indels heterozygous SNPs complex error profiles & quality valuesmotivation | methods | results | summaryExtended HMM2121Expand states Have states that include gaps emit: gap or colorA----GAGTGT-T- Have larger states, for diploids Transitions built in similar fashion as before Same algorithm, but in all we have 1600 states with very sparse transitionsExpansion: Gaps and heterozygous SNPsmotivation | methods | results | summary2222 Emission probabilities Support quality values Use variable error rates for emissions Translate through the first color first color is incorrect letter-space signalDonor: ACAGCATCGGCATCGACTGC 1123132303123321213read: >T2123132303123321213 > C123132303123321213Expansionmotivation | methods | results | summary23 Post-process putative SNPs correlated adjacent errors may support het SNPs check putative SNPs know the error rate (= error rate at first color) note: not ok to translate the whole read due to effects of color-space error, but one letter is safe. handle like a normal letter-space emission23motivation | methods | results | summary

blue: varid stepsSummaryResultsMotivationMethodsSummary2525Resultsmotivation | methods | results | summary Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773)454, SOLiD, Sanger

Color-space dataset: Compare random subsets: Corona (with AB mapper) VARiD (with SHRiMP) VARiD (with AB mapper)

Conclusions: the three pipelines perform very similarly. High-coverage results is as good as can be achieved 26F-measureSAY: Although VARiD should be used to combine datasets, for now we only tested it on purely color-space dataSAY: (most FN?) Low coverage for the 2nd allele- add a column for indels?

26Resultsmotivation | methods | results | summaryLetter-space dataset: Compare random subsets : GigaBayes (with Mosaik) VARiD (with SHRiMP) VARiD (with Mosaik)

Conclusion:the three pipelines perform very similarly. High-coverage results is as good as can be achieved 27F-measureSAY: Although VARiD should be used to combine datasets, for now we only tested it on purely color-space dataSAY: (most FN?) Low coverage for the 2nd allele- add a column for indels?

27letter-spaceF-meas.0x1x2.5x5x10xColor-space0x0.019.443.559.171.710x47.851.859.469.576.520x60.658.965.373.480.350x73.169.873.680.083.5100x75.675.277.982.786.0ResultsVARiD Motivation:Combining Letter-space and Color-space data to achieve increased accuracy in at-cost comparisonmotivation | methods | results | summary28Assuming a same-cost comparison of: 10x letter-space (LS) 100x color-space (CS) 5x LS and 50x CSVARiDSAY: Although VARiD should be used to combine datasets, for now we only tested it on purely color-space dataSAY: (most FN?) Low coverage for the 2nd allele- add a column for indels?

28SummaryMotivationMethodsResults2929Summary of VARiD HMM modeling underlying donor Treats color-space and letter-space together in the same framework no translation take advantage of each technologys properties accurately calls SNPs, short indels in both color- and letter-space improved results with hybrid data.Summary motivation | methods | results | summary Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available) Contact: [email protected] ADJ SNP: would need an extra runStress: Fully Probabilistic

TOOK OUT ADJACENT SNPs example30Acknowledgementsmotivation | methods | results | summary Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available) Contact: [email protected] Participant Support Award : ISCB, Department of Energy, and National Science Foundation Acknowledgements Steven Rumble, Sam Levy, Michael Brudno Misko DzambaFunding

For ADJ SNP: would need an extra runStress: Fully Probabilistic

TOOK OUT ADJACENT SNPs example31


Top Related