genotyping - university of marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf ·...
TRANSCRIPT
GenotypingCMSC702 Spring 2014
What makes them different?
Much human varia,on is due to difference in ~ 6 million base pairs (0.1 % of genome) referred to as SNPs
Genomic DNA:SNP
TACATAGCCATCGGTANGTACTCAATGATGATAA
G
Single Nucleo,de Polymorphism (SNP)
Three genotypes
TACATAGCCATCGGTAAGTACTCAATGATGATA
AA
ATGTATCGGTAGCCATTCATGAGTTACTACTAT
TACATAGCCATCGGTAAGTACTCAATGATGATAATGTATCGGTAGCCATTCATGAGTTACTACTAT
Mother
Father
TACATAGCCATCGGTAAGTACTCAATGATGATA
AG
ATGTATCGGTAGCCATTCATGAGTTACTACTAT
TACATAGCCATCGGTAGGTACTCAATGATGATAATGTATCGGTAGCCATCCATGAGTTACTACTAT
Mother
Father
TACATAGCCATCGGTAGGTACTCAATGATGATA
GG
ATGTATCGGTAGCCATCCATGAGTTACTACTAT
TACATAGCCATCGGTAGGTACTCAATGATGATAATGTATCGGTAGCCATCCATGAGTTACTACTAT
Mother
Father
[Check, Nature 437]
Personal Genomics
Next-gen SequencingPlatforms
• Millions of short DNA fragments (~100 bp) sequenced in parallel
13
Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010
Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencingplatform. Bioinformatics. 2009
Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010
namesequencequality scores
x 100s of millions
14
Sequencing throughput
HiSeq 200025 billion bp per day
(2010)
GA IIx5 billion bp per day
(2009)
GA II1.6 billion bp per day
(2008)
Images: www.illumina.com/systems
Numbers: www.politigenomics.com/next-generation-sequencing-informatics
Dates: Illumina press releases
15
Sequencing throughput
HiSeq 250060 billion bp per day
(2012)
GA IIx5 billion bp per day
(2009)
GA II1.6 billion bp per day
(2008)
Images: www.illumina.com/systems
Numbers: www.politigenomics.com/next-generation-sequencing-informatics
Dates: Illumina press releases
16
Sec-gen Sequencing for SNPs
TAACGATTC
ATTGCTAAG ......
......
TAACGTTTC
ATTGCAAAG ......
......
Sec-gen Sequencing for SNPs
Sec-gen Sequencing for SNPs
Sec-gen Sequencing for SNPs
Sec-gen Sequencing for SNPs
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT !GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC !AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT !ATATATATATATATAT |||||||||||||||| ATATATATATATATAT !TCTCTCCCANNAGAGC ||||||||| ||||| TCTCTCCCAGGAGAGC
Align Aggregate
Reference
Call: HET A, G p-value: 0.0023
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG
Statistics
“Coverage”
“Pileup” or “Coverage plot”
“Depth of coverage” = 14
(slide courtesy of Ben Langmead)
We want !
!probability of genotype
given aligned bases
P (Ti|D)
SNP calling
• We will look at SOAP and samtools today: [Ruiqiang Li et al., Genome Research 2009; Heng Li, Bioinformatics 2011].
• Both uses a “bayesian” formulation
• This is also how “first” generation SNP-calling was done (BayesSNP).
• Many other use a similar formulation (MAQ, Atlas-SNP, FreeBayes).
• Main difference is in their probabilistic framework of genotype.
!!Short Oligonucleotide Analysis Package S e q u e n c e A l i g n m e n t / M a p t o o l s
SOAPsnp
P (Ti
|D) =P (D|T
i
)P (Ti
)Px
P (D|Tx
)P (Tx
)
Probability of data given genotype
Prior probability of genotype
Prior Probabilities
Assuming:
1)SNP rate is 10-3
2)Error rate in reference is 10-5
Data probability
P (D|Ti) =nY
k=1
P (dk|Ti)
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG
P (dk|Ti) =P (dk|Hm) + P (dk|Hn)
2
Ti = HmHn
Data probability
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG
Data for each base (allele) is 1. observed base (allele) 2. sequencing cycle 3. quality score (error probability) 4. occurrence
P (dk|Ti) =P (dk|Hm) + P (dk|Hn)
2
Data probability
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG
Data for each base (allele) is 1. observed base (allele) 2. sequencing cycle 3. quality score (error probability) 4. occurrence
P (dk|Hm) = P (ok, ck, qk|Hm)= P (ok, ck|Hm, qk)P (qk|Hm)
Data probability
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG
No model here: use a lookup table!
P (ok, ck|Hm, qk)
Quality score recalibration
Substitution Errors
SOAPsnp
• Quality score recalibration and biased substitution rates are incorporated
• Uses a “bayesian” formulation
• Simple model, easily implemented
• Independence across genomic loci
• Easily parallelized (see Crossbow)