cpm 20061 snp and haplotype analysis algorithms and applications eran halperin international...

39
CPM 2006 1 SNP and Haplotype Analysis SNP and Haplotype Analysis Algorithms and Algorithms and Applications Applications Eran Halperin International Computer Science Institute Berkeley, California

Upload: yosef-haig

Post on 01-Apr-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 1

SNP and Haplotype SNP and Haplotype Analysis Analysis Algorithms and Algorithms and

ApplicationsApplications

Eran HalperinInternational Computer Science Institute

Berkeley, California

Page 2: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 2

““Computational Genetics”Computational Genetics”

Page 3: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 3

The Human Genome The Human Genome ProjectProject

“What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”

“But our work previously has shown… that having one genetic code is important, but it's not all that useful.” (referring to comparative genomics).

“I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…”

Washington, DCJune, 26, 2000.

Page 4: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 4

Individually Tailored Individually Tailored MedicineMedicine

People react to different drugs indifferent ways.

The vision: a simple DNA test would help todetermine which medicine to prescribe.

Page 5: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 5

• International consortium that aims in genotyping the genome of 270 individuals from four different populations.• Launched in 2002. First phase was finished in October (Nature, 2005).

Page 6: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 6

MotivationMotivation

Environmental Factors (50%)

Genetic Factors (50%)

Complexdisease

Multiple genes may affect the disease.

Therefore, the effect of every single gene may be negligible.

Page 7: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 7

Disease Association Disease Association StudiesStudies

The search for genetic factorsThe search for genetic factorsComparing the DNA contents of two populations:

• Cases - individuals carrying the disease.• Controls - background population.

A significant discrepancy between the two populations is an evident to a causal gene.

Page 8: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 8

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTCAGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC

Cases:

Controls: Associated SNP

Where should we look?Where should we look?SNP = Single Nucleotide PolymorphismUsually SNPs are bi-allelic (only two letters appear).

Page 9: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 10

Genotyping TechnologyGenotyping Technology

• Extracting the allele information for a SNP from a DNA sample.

• Considerable genotyping costs reductions in the last couple of years.

• Current cost allows for the genotyping of 500,000 SNPs for ~$1000 (compared to ~50 cents per SNP 3-4 years ago).

Page 10: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 11

Computational ChallengesComputational Challenges

Page 11: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 12

HaplotypesHaplotypes

• SNPs in physical proximity are correlated.

• A sequence of alleles along a chromosome are called haplotypes.

Page 12: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 13

Haplotype Block StructureHaplotype Block Structure

(Daly et al., 2001) Block 6 from Chromosome 5q31

Page 13: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 14

Haplotypes as Proxies for Rare SNPsHaplotypes as Proxies for Rare SNPs

Common haplotypes:– 011000111 (23% of population)– 000001111 (55% of population)– 111111111 (14% of population)

Tag SNPs

000001111

Page 14: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 15

Tag SNP SelectionTag SNP Selection

• Input: a set of genotypes

• Goal: find a set of t tag SNPs such that using these SNPs only, the error rate for the prediction of all other SNPs is minimized.

Formulation by [H., Kimmel, Shamir, 05’] (STAMPA)

Page 15: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 16

• Correlations between SNPs

Tag SNPsTag SNPs

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Cases:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Controls:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Page 16: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 17

Basic AssumptionBasic Assumption

Given two SNPs, the probabilities of the values at any

intermediate SNPs do not change if we know the values of additional distal ones.

SNP j SNP kintermediate SNPs

Page 17: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 18

1. Put aside one test genotype. Use the rest of the data to develop a majority rule for each pair of SNPs to predict intermediate SNPs values.

2. Average prediction error over all test genotypes gives a score to the pair j and k.

3. Apply dynamic programming to obtain best set of tag SNPs.

STAMPA STAMPA ((Selection of TAg SNPs to Maximize Prediction Accuracy)

Test genoteype

SNP j SNP kintermediate SNPs

Page 18: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 19

Comparison: STAMPA vs. ldSelectComparison: STAMPA vs. ldSelect

x - STAMPA, - ldSelect

52 sets of Yoruba genotypes (Gabriel et al., 2002).

Page 19: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 20

The haplotype ancestral structure of two subtypes of NHL.The trees are automatically generated by HAP (H., Eskin, 04’).

Page 20: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 21

PhasingPhasing

• Cost effective genotyping technology gives genotypes and not haplotypes.

Haplotypes Genotype

A

CCG

A

C

G

TA

ATCCGAAGACGC

ATACGAAGCCGC

Possiblephases:

AGACGAATCCGC ….

mother chromosomefather chromosome

Page 21: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 22

Public Genotype Data Public Genotype Data GrowthGrowth

2001

Daly et al.Nature Genetics103 SNPs40,000genotypes

Gabriel et al.Science3000 SNPs400,000 genotypes

2002

TSC DataNucleic AcidsResearch35,000 SNPs4,500,000genotypes

2003

Perlegen DataScience1,570,000 SNPs100,000,000 genotypes

2004

NCBI dbSNPGenomeResearch3,000,000 SNPs286,000,000 genotypes

2005

HapMap Phase 25,000,000+ SNPs600,000,000+genotypes

2006

- HAP’s speed allows it to phase whole-genome datasets- HAP is very accurate (Marchini et al., 2006).

Page 22: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 23

HAP Phasing ModelHAP Phasing Model

• A directed phylogenetic tree.• {0,1} alphabet.• Each site mutates at most

once.• No recombination.

• Goal: Finding a phase that fits the tree modelFormulation: [Gusfield, 2003]

00000

01000

1100001001

11100

11110

4

3

15

2

Page 23: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 24

ExampleExample

Genotypes

02022

22200

21222

21200

02000

01022

Haplotypes

00000

01000

11100

01011

00000

01000

1100001001

1110001011

4 3

15

2

Given the tree and the haplotypes the phase is unique

Page 24: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 25

Phasing via GreedyPhasing via Greedy

• A simple heuristic:– Find a haplotype that is compatible with

as many genotypes as possible. – Assign the haplotype for these

genotypes.– Continue with the rest of the genotypes.

• Intuition: Haplotypes with missing data.

Page 25: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 26

Haplotypes with missing Haplotypes with missing datadata

Input:111*11*100*01*1*01*000*011*11*11*111**001111*11*01*00010

Goal: Find a maximum likelihood phase.

Output:11111111000011110100001011111111111100001111111101000010

Page 26: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 27

Greedy Analysis Greedy Analysis (H., Karp, 2005)(H., Karp, 2005)

• Maximum likelihood == minimum entropy solution.

• Entropy(Greedy) < Entropy(OPT) + 3.

• Can be viewed as a variant of set cover.

Page 27: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 28

Mother, Father, Child TriosMother, Father, Child Trios

• Advantages:– Better phasing results (Marchini et al.,

06’).– Population stratification (Spielman et

al., 93’).

• Disadvantage:– 50% more expensive (and thus,

reduces power).

Page 28: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 29

1??11?1??11?

?100???100??

1?0???1?0???

10?11?11?11?

1100??0100??

100???110???

1??11?1??11?

1100??0100??

1?0???1?0???

10011?11111?

11000?01001?

10011?11000?

Inferring Haplotypes From Inferring Haplotypes From TriosTrios

Parent 1

Parent 2

Child

122112

210022

120222

Assumption: No recombination

Page 29: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 30

C

Genotyping Trios via DNA Genotyping Trios via DNA poolspools

[Beckman, Abel, Braun, H.][Beckman, Abel, Braun, H.]

FM

Page 30: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 31

  1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

Mother transmitted

allele

A A A A A A A A G G G G G G G G

Mother untransmitted

allele

A A A A G G G G A A A A G G G G

Father transmitted

allele

A A G G A A G G A A G G A A G G

Father untransmitted

allele

A G A G A G A G A G A G A G A G

Father and Child pool –

allele frequency

0 1 2 3 0 1 2 3 1 2 3 4 1 2 3 4

Mother and Child pool –

allele frequency

0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4

-Every configuration has a different pair of values.-Except for configurations 7 and 10 (het-het-het).

Page 31: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 32

Genotyping Unrelated Genotyping Unrelated IndividualsIndividuals

Edge size pool size (accuracy)Vertex degree amount of DNA used

Page 32: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 33

An algebraic viewAn algebraic view

A =

1 0 0 1 1

0 1 1 1 0

1 1 0 0 0

1 0 1 0 0

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

Is there ≤1 solutions to Ax = b,x ∈ {0,1,2}5 ?

Page 33: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 34

For every m, what is the largest n, so that m equations uniquely determine the n {0,1,2} variables?

For every m, what is the largest n for which A {0,1}mn, s.t. x,x’ {0,1,2}n , Ax=Ax’ x=x’

Page 34: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 35

Lower BoundLower Bound

• A random matrix A.– For every x {-2,-1,0,1,2}n, Aix=0

with prob. O(k-0.5) where k is the number of non-zero elements.

– Since the rows are independent, the probability that Ax = 0 is O(k-m/2).

– Using union bound, n=(m log m).

Page 35: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 36

Upper BoundUpper Bound

• Counting argument:– There are at most (2n)m different

values that Ax can take.– There are 3n values for x.– 3n< (2n)m and so n < O(m log m).

Page 36: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 37

Further ChallengesFurther Challenges

• Population stratification– In case/control studies and in family based

studies.– Admixed populations.

• Other pooling schemes– Practical considerations: error rates, missing

data, scalability, etc.

• Inferring evolutionary processes (e.g. selection, recombination rate, haplotype ancestry, etc.).

Page 37: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 38

SummarySummary

• Exciting times in genetics: changes in medicine may be felt in our lifetime.– An opportunity for Computer Scientists

to have a huge impact.

• An interdisciplinary work is needed. It involves computer science,statistics, genetics, biology,and medicine.

Page 38: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 39

AcknowledgementAcknowledgement

• UCSD– Eleazar Eskin.

• Tel-Aviv U.– Ron Shamir– Gad Kimmel– Noga Alon

• HIIT– Matti Kaariainen

• Sequenom Inc.– Andreas Braun– Ken Abel

• Perlegen Sciences– David Hinds– David Cox

• UC Berkeley– Richard Karp– Chris Skibola

• MPI– Rene Beier

• CHORI– Kenny Beckman

Page 39: CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

CPM 2006 40