genotyping - university of marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf ·...

GenotypingCMSC702 Spring 2014

What makes them different?

Much human varia,on is due to difference in ~ 6 million base pairs (0.1 % of genome) referred to as SNPs

Genomic DNA:SNP

TACATAGCCATCGGTANGTACTCAATGATGATAA

G

Single Nucleo,de Polymorphism (SNP)

Three genotypes

TACATAGCCATCGGTAAGTACTCAATGATGATA

AA

ATGTATCGGTAGCCATTCATGAGTTACTACTAT

TACATAGCCATCGGTAAGTACTCAATGATGATAATGTATCGGTAGCCATTCATGAGTTACTACTAT

Mother

Father

TACATAGCCATCGGTAAGTACTCAATGATGATA

AG

ATGTATCGGTAGCCATTCATGAGTTACTACTAT

TACATAGCCATCGGTAGGTACTCAATGATGATAATGTATCGGTAGCCATCCATGAGTTACTACTAT

Mother

Father

TACATAGCCATCGGTAGGTACTCAATGATGATA

GG

ATGTATCGGTAGCCATCCATGAGTTACTACTAT

TACATAGCCATCGGTAGGTACTCAATGATGATAATGTATCGGTAGCCATCCATGAGTTACTACTAT

Mother

Father

[Check, Nature 437]

Personal Genomics

Next-gen SequencingPlatforms

• Millions of short DNA fragments (~100 bp) sequenced in parallel

13

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencingplatform. Bioinformatics. 2009

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

namesequencequality scores

x 100s of millions

14

Sequencing throughput

HiSeq 200025 billion bp per day

(2010)

GA IIx5 billion bp per day

(2009)

GA II1.6 billion bp per day

(2008)

Images: www.illumina.com/systems

Numbers: www.politigenomics.com/next-generation-sequencing-informatics

Dates: Illumina press releases

15

Sequencing throughput

HiSeq 250060 billion bp per day

(2012)

GA IIx5 billion bp per day

(2009)

GA II1.6 billion bp per day

(2008)

Images: www.illumina.com/systems

Numbers: www.politigenomics.com/next-generation-sequencing-informatics

Dates: Illumina press releases

16

Sec-gen Sequencing for SNPs

TAACGATTC

ATTGCTAAG ......

......

TAACGTTTC

ATTGCAAAG ......

......


GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT !GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC !AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT !ATATATATATATATAT |||||||||||||||| ATATATATATATATAT !TCTCTCCCANNAGAGC ||||||||| ||||| TCTCTCCCAGGAGAGC

Align Aggregate

Reference

Call: HET A, G p-value: 0.0023

GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG

Statistics

“Coverage”

“Pileup” or “Coverage plot”

“Depth of coverage” = 14

(slide courtesy of Ben Langmead)

We want !

!probability of genotype

given aligned bases

P (Ti|D)

SNP calling

• We will look at SOAP and samtools today: [Ruiqiang Li et al., Genome Research 2009; Heng Li, Bioinformatics 2011].

• Both uses a “bayesian” formulation

• This is also how “first” generation SNP-calling was done (BayesSNP).

• Many other use a similar formulation (MAQ, Atlas-SNP, FreeBayes).

• Main difference is in their probabilistic framework of genotype.

!!Short Oligonucleotide Analysis Package S e q u e n c e A l i g n m e n t / M a p t o o l s

SOAPsnp

P (Ti

|D) =P (D|T

i

)P (Ti

)Px

P (D|Tx

)P (Tx

)

Probability of data given genotype

Prior probability of genotype

Prior Probabilities

Assuming:

1)SNP rate is 10-3

2)Error rate in reference is 10-5

Data probability



Data for each base (allele) is 1. observed base (allele) 2. sequencing cycle 3. quality score (error probability) 4. occurrence

P (dk|Ti) =P (dk|Hm) + P (dk|Hn)

2

Data probability



Data for each base (allele) is 1. observed base (allele) 2. sequencing cycle 3. quality score (error probability) 4. occurrence

P (dk|Hm) = P (ok, ck, qk|Hm)= P (ok, ck|Hm, qk)P (qk|Hm)

Data probability



No model here: use a lookup table!

P (ok, ck|Hm, qk)

Quality score recalibration

Substitution Errors

SOAPsnp

• Quality score recalibration and biased substitution rates are incorporated

• Uses a “bayesian” formulation

• Simple model, easily implemented

• Independence across genomic loci

• Easily parallelized (see Crossbow)

genotyping - university of marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf ·...

Documents