genotyping - university of marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf ·...

26
Genotyping CMSC702 Spring 2014

Upload: others

Post on 18-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

GenotypingCMSC702 Spring 2014

Page 2: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

What  makes  them  different?

Much  human  varia,on  is  due  to  difference  in  ~  6  million  base  pairs  (0.1  %  of  genome)  referred  to  as  SNPs

Page 3: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Genomic  DNA:SNP

TACATAGCCATCGGTANGTACTCAATGATGATAA

G

Single  Nucleo,de  Polymorphism  (SNP)  

Three  genotypes

Page 4: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

TACATAGCCATCGGTAAGTACTCAATGATGATA

AA

ATGTATCGGTAGCCATTCATGAGTTACTACTAT

TACATAGCCATCGGTAAGTACTCAATGATGATAATGTATCGGTAGCCATTCATGAGTTACTACTAT

Mother

Father

Page 5: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

TACATAGCCATCGGTAAGTACTCAATGATGATA

AG

ATGTATCGGTAGCCATTCATGAGTTACTACTAT

TACATAGCCATCGGTAGGTACTCAATGATGATAATGTATCGGTAGCCATCCATGAGTTACTACTAT

Mother

Father

Page 6: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

TACATAGCCATCGGTAGGTACTCAATGATGATA

GG

ATGTATCGGTAGCCATCCATGAGTTACTACTAT

TACATAGCCATCGGTAGGTACTCAATGATGATAATGTATCGGTAGCCATCCATGAGTTACTACTAT

Mother

Father

Page 7: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

[Check, Nature 437]

Page 8: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010
Page 9: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010
Page 10: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Personal Genomics

Page 11: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Next-gen SequencingPlatforms

• Millions of short DNA fragments (~100 bp) sequenced in parallel

13

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencingplatform. Bioinformatics. 2009

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

namesequencequality scores

x 100s of millions

14

Sequencing throughput

HiSeq 200025 billion bp per day

(2010)

GA IIx5 billion bp per day

(2009)

GA II1.6 billion bp per day

(2008)

Images: www.illumina.com/systems

Numbers: www.politigenomics.com/next-generation-sequencing-informatics

Dates: Illumina press releases

15

Sequencing throughput

HiSeq 250060 billion bp per day

(2012)

GA IIx5 billion bp per day

(2009)

GA II1.6 billion bp per day

(2008)

Images: www.illumina.com/systems

Numbers: www.politigenomics.com/next-generation-sequencing-informatics

Dates: Illumina press releases

16

Page 12: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Sec-gen Sequencing for SNPs

TAACGATTC

ATTGCTAAG ......

......

TAACGTTTC

ATTGCAAAG ......

......

Page 13: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Sec-gen Sequencing for SNPs

Page 14: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Sec-gen Sequencing for SNPs

Page 15: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Sec-gen Sequencing for SNPs

Page 16: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Sec-gen Sequencing for SNPs

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT !GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC !AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT !ATATATATATATATAT |||||||||||||||| ATATATATATATATAT !TCTCTCCCANNAGAGC ||||||||| ||||| TCTCTCCCAGGAGAGC

Align Aggregate

Reference

Call: HET A, G p-value: 0.0023

GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG

Statistics

“Coverage”

“Pileup” or “Coverage plot”

“Depth of coverage” = 14

(slide courtesy of Ben Langmead)

We want !

!probability of genotype

given aligned bases

P (Ti|D)

Page 17: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

SNP calling

• We will look at SOAP and samtools today: [Ruiqiang Li et al., Genome Research 2009; Heng Li, Bioinformatics 2011].

• Both uses a “bayesian” formulation

• This is also how “first” generation SNP-calling was done (BayesSNP).

• Many other use a similar formulation (MAQ, Atlas-SNP, FreeBayes).

• Main difference is in their probabilistic framework of genotype.

!!Short Oligonucleotide Analysis Package S e q u e n c e A l i g n m e n t / M a p t o o l s

Page 18: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

SOAPsnp

P (Ti

|D) =P (D|T

i

)P (Ti

)Px

P (D|Tx

)P (Tx

)

Probability of data given genotype

Prior probability of genotype

Page 19: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Prior Probabilities

Assuming:

1)SNP rate is 10-3

2)Error rate in reference is 10-5

Page 20: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Data probability

P (D|Ti) =nY

k=1

P (dk|Ti)

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG

P (dk|Ti) =P (dk|Hm) + P (dk|Hn)

2

Ti = HmHn

Page 21: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Data probability

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG

Data for each base (allele) is 1. observed base (allele) 2. sequencing cycle 3. quality score (error probability) 4. occurrence

P (dk|Ti) =P (dk|Hm) + P (dk|Hn)

2

Page 22: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Data probability

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG

Data for each base (allele) is 1. observed base (allele) 2. sequencing cycle 3. quality score (error probability) 4. occurrence

P (dk|Hm) = P (ok, ck, qk|Hm)= P (ok, ck|Hm, qk)P (qk|Hm)

Page 23: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Data probability

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG

No model here: use a lookup table!

P (ok, ck|Hm, qk)

Page 24: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Quality score recalibration

Page 25: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

Substitution Errors

Page 26: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010

SOAPsnp

• Quality score recalibration and biased substitution rates are incorporated

• Uses a “bayesian” formulation

• Simple model, easily implemented

• Independence across genomic loci

• Easily parallelized (see Crossbow)