computational and statistical challenges in association studies

79
Computational and Statistical Challenges in Association Studies Eleazar Eskin University of California, Los Angeles

Upload: kalb

Post on 02-Feb-2016

17 views

Category:

Documents


0 download

DESCRIPTION

Computational and Statistical Challenges in Association Studies. Eleazar Eskin University of California, Los Angeles. The Human Genome Project. “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational and Statistical Challenges in Association Studies

Computational and Statistical Challenges in Association Studies

Eleazar Eskin

University of California, Los Angeles

Page 2: Computational and Statistical Challenges in Association Studies

The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”

“I would be willing to make a prediction that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…”

Washington, DCJune, 26, 2000.

Page 3: Computational and Statistical Challenges in Association Studies

Human Genetics

Mother Father

Child

Disease Risk “genetic” factors account for

20%-80% of disease risk. Many genes contribute to

“complex” diseases.

Personalized Medicine Treatment decisions influenced

by diagnostics

Understanding Disease Biology New drug targets. Understanding of mechanism of

disease.

Mother

Child

Risk Factors

Risk Factors

Where are the risk factors?(Genetic Basis of Disease)

Page 4: Computational and Statistical Challenges in Association Studies

Disease Association StudiesThe search for genetic factors

Comparing the DNA contents of two populations:

• Cases - individuals carrying the disease.• Controls - background population.

Differences within a gene between the two populations is evidence the gene is involved in the disease.

Page 5: Computational and Statistical Challenges in Association Studies

Single Nucleotide Polymorphisms(SNPs)

AGAGCCGTCGACAGGTATAGCCTAAGAGCCGTCGACATGTATAGTCTA

AGAGCAGTCGACAGGTATAGTCTAAGAGCAGTCGACAGGTATAGCCTA

AGAGCCGTCGACATGTATAGCCTAAGAGCAGTCGACATGTATAGCCTA

AGAGCCGTCGACAGGTATAGCCTAAGAGCCGTCGACAGGTATAGCCTA

Human Variation Humans differ by

0.1% of their DNA. A significant

fraction of this variation is accounted by SNPs.

Page 6: Computational and Statistical Challenges in Association Studies

Single Nucleotide PolymorphismsAssociation Analysis

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTCAGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC

Cases: (Individuals with the disease)

Controls: (Healthy individuals) Associated SNP

Page 7: Computational and Statistical Challenges in Association Studies

Correlations between SNPs

Single Nucleotide Polymorphisms Association Analysis

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Cases:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Controls:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Challenges: Millions of Common SNPs

False Positives

Page 8: Computational and Statistical Challenges in Association Studies

Single Nucleotide Polymorphisms(SNPs)

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Cases:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Controls:

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

Challenges: Millions of Common SNPs Correlations between SNPs SNP locations unknown

False Positives

Page 9: Computational and Statistical Challenges in Association Studies

•Successor to the Human Genome Project •International consortium that aims in genotyping the genome of 270 individuals from four different populations.• Launched in 2002. First phase was finished in October (Nature, 2005).•Collected genotypes for 3.9 million SNPs.•Location and correlation structure of many common SNPs.

Page 10: Computational and Statistical Challenges in Association Studies

Public Genotype Data Growth

2001

Daly et al.Nature Genetics103 SNPs40,000genotypes

Gabriel et al.Science3000 SNPs400,000 genotypes

2002

TSC DataNucleic AcidsResearch35,000 SNPs4,500,000genotypes

2003

Perlegen DataScience1,570,000 SNPs100,000,000 genotypes

2004

NCBI dbSNPGenomeResearch3,000,000 SNPs286,000,000 genotypes

2005

HapMap Phase 25,000,000+ SNPs600,000,000+genotypes

2006

More SNPs increase genome coverage in association studies.

More genotypes allow for discovery of weaker associations.

Page 11: Computational and Statistical Challenges in Association Studies

Some Computational Challenges

Genetics - identifying disease genes Haplotype phasing - preprocessing SNPs Association study design Association study analysis Population stratification Inferring evolutionary processes (recombination rates,

selection, haplotype ancestry). Etc…

Genomics - functions of disease genes Predicting functional effect of variation Understanding disease effect on gene regulation Understanding disease effect on metabolic pathways Combining systems biology with genetics Etc…

HAP

WHAPSAT Tagger

Page 12: Computational and Statistical Challenges in Association Studies

Haplotype Phasing using Imperfect Phylogeny

Page 13: Computational and Statistical Challenges in Association Studies

Haplotype Phasing

High throughput cost effective sequencing technology gives genotypes and not haplotypes.

HaplotypesATCCGAAGACGC

ATACGAAGCCGC

Possiblephases:

AGACGAATCCGC ….

mother chromosomefather chromosome

Genotype

A

CCG

A

C

G

TA

Page 14: Computational and Statistical Challenges in Association Studies

Haplotype Limited Diversity

Previous studies on local haplotype structure: (Daly et al., 2001) chromosome 5q31. (Patil et al., 2001) chromosome 21.

Study findings: The SNPs on each haplotype are correlated. SNPs can be separated into blocks of limited diversity.

Local regions have few haplotypes.

Page 15: Computational and Statistical Challenges in Association Studies

Haplotype Data in a Block

(Daly et al., 2001) Block 6 from Chromosome 5q31

Page 16: Computational and Statistical Challenges in Association Studies

2nd Possibleresolution

11100110 100011001 1

10000001 201000001 2

01011001 110000000 1

10101110 101010001 1

11000001 100000001 1

11001000 100010001 1

01000001 210000001 2

or?

MaximumLikelihoodCriterion

?

ExamplePhasing

Genotypes

22222222

22000001

22022002

22222222

22000001

22022002

22000001

1st Possibleresolution

11111110 200000001 7

11000001 300000001 7

11011000 200000001 7

11111110 200000001 7

11000001 300000001 7

11011000 200000001 7

11000001 300000001 7

MaximumLikelihoodHaplotypeInference

is aNP-HardProblem

2

10

1

11

0

00

Page 17: Computational and Statistical Challenges in Association Studies

Narrowing the Search:Perfect Phylogeny

A directed phylogenetic tree. {0,1} alphabet. Each site mutates at most

once. No recombination.

00000

01000

1100001001

11100

11110

4

3

15

2

Page 18: Computational and Statistical Challenges in Association Studies

The Perfect Phylogeny Haplotype Problem (PPH)

Given genotypes over a short region. Find compatible haplotypes which

correspond to a perfect phylogeny tree.

[Gusfield 02’]. PPH deficiency – the data does not fit the

model.

Page 19: Computational and Statistical Challenges in Association Studies

Solving PPH

A very simple o(nm2) algorithm for PPH problem. (Also Gusfield 02, Bafna et al., 2003)

But – in practice, we do not expect to see perfect phylogeny in biological data.

We extend our algorithms to the case where the data is almost perfect phylogeny.

Eskin, Halperin, Karp ``Large Scale Reconstruction of Haplotypes from Genotype Data.'’ RECOMB 2003.

Page 20: Computational and Statistical Challenges in Association Studies

HAP Algorithm

HAP Local Predictions http://research.calit2.net/hap/ Over 6,000 users of webserver.

Main Ideas: Imperfect Phylogeny Maximum Likelihood Criterion

Extremely efficient. Orders of magnitude faster than other algorithms.

Eskin, Halperin, Karp ``Large Scale Reconstruction of Haplotypes from Genotype Data.'’ RECOMB 2003.

Page 21: Computational and Statistical Challenges in Association Studies

Public Genotype Data Growth

2001

Daly et al.Nature Genetics103 SNPs40,000genotypes

Gabriel et al.Science3000 SNPs400,000 genotypes

2002

TSC DataNucleic AcidsResearch35,000 SNPs4,500,000genotypes

2003

Perlegen DataScience1,570,000 SNPs100,000,000 genotypes

2004

NCBI dbSNPGenomeResearch3,000,000 SNPs286,000,000 genotypes

2005

HapMap Phase 25,000,000+ SNPs600,000,000+genotypes

2006Eskin, Halperin, KarpRECOMB 2003

HAPTimeline

:

Page 22: Computational and Statistical Challenges in Association Studies

Phasing Methods

HAP is one of many phasing algorithms. Clark, 1990, Excoffier and Slatkin, 1995, PHASE – Stephens et al., 2001, HAPLOTYPER - Niu et al., 2002. Gusfield, 2000, Lancia et al. 2001. Many more…

How do we phase entire chromosomes?

Algorithms were designed for only 4-12 SNPs!

HAP “tiling” extension phasing for long regions.

Leverages the speed of HAP.

Page 23: Computational and Statistical Challenges in Association Studies

• For each window we compute the haplotypes using HAP

• We tile the windows using dynamic programming

genotypes

Local predictions

Scaling to Whole GenomesHAP-TILE

Page 24: Computational and Statistical Challenges in Association Studies

0010000011011011111001

Haplotype Tiling Problem(ignoring homozygous positions)

001000110111 010000 101111 011111 100000 000101 111010 000011 111100 100110 011001

0010000011011011111001

(minimum number of conflicts)

001000110111 010000 101111 011111 100000 000101 111010 000011 111100 100110 011001

• NP-Hard Problem• Dynamic Programming Solution

(Eskin et al. 2004.)

Page 25: Computational and Statistical Challenges in Association Studies

Phasing Running Time Comparison(Phaseoff Competition)

Marchini et al. American Journal of Human Genetics, 2006.

HAP is over 1000x faster than PHASE.

Page 26: Computational and Statistical Challenges in Association Studies

Public Genotype Data Growth

2001

Daly et al.Nature Genetics103 SNPs40,000genotypes

Gabriel et al.Science3000 SNPs400,000 genotypes

2002

TSC DataNucleic AcidsResearch35,000 SNPs4,500,000genotypes

2003

Perlegen DataScience1,570,000 SNPs100,000,000 genotypes

2004

NCBI dbSNPGenomeResearch3,000,000 SNPs286,000,000 genotypes

2005

HapMap Phase 25,000,000+ SNPs600,000,000+genotypes

2006Eskin, Halperin, KarpRECOMB 2003

HAPTimeline

:

Perlegencollaboration

(12 hours)

NCBI dbSNPcollaboration

(24 hours) (48 hours)

Page 27: Computational and Statistical Challenges in Association Studies

Only 103 SNPs,0.02% of the genome!

RE

CO

MB

200

3 S

ub

mis

sio

n

Page 28: Computational and Statistical Challenges in Association Studies

Weighted Haplotype Association

Page 29: Computational and Statistical Challenges in Association Studies

Association Statistics

Assume we are given N/2 cases and N/2 control individuals.

Since each individual has 2 chromosomes, we have a total of N case chromosomes and N control chromosomes.

At SNP A, let p+A and p-

A be the observed case and control frequencies respectively.

We know that:

p+A ~ N(p+

A, p+A(1-p+

A)/N).

p-A ~ N(p-

A, p-A(1-p-

A)/N).

^ ^

^

^

Page 30: Computational and Statistical Challenges in Association Studies

Association Statistics

p+A ~ N(p+

A, p+A(1-p+

A)/N).

p-A ~ N(p-

A, p-A(1-p-

A)/N).

p+A- p-

A ~ N(p+A- p-

A,(p+A(1-p+

A)+p-A(1-p-

A))/N)

We approximate

p+A(1-p+

A)+p-A(1-p-

A) ≈ 2 pA(1-pA)

then if p+A =p-

A

^

^

^ ^

SA =ˆ p +A − ˆ p −A

2 /N ˆ p A (1− ˆ p A )~ N(0,1)

^ ^

Page 31: Computational and Statistical Challenges in Association Studies

-

Association Statistic

Under the null hypothesis p+A- p-

A=0

We compute the statistic SA.

If SA< -1(/2) or SA>--1(/2) then the association is significant at level .€

SA =ˆ p +A − ˆ p −A

2 /N ˆ p A (1− ˆ p A )~ N(0,1)

Page 32: Computational and Statistical Challenges in Association Studies

Association Power

Lets assume that SNP A is causal and p+A ≠ p-

A

Given the true p+A and p-

A, if we collect N individuals, and compute the statistic SA, the probability that SA has a significance level of is the power.

Power is the chance of detecting an association of a certain strength with a certain number of individuals.

Page 33: Computational and Statistical Challenges in Association Studies

Association Statistic Lets assume that p+

A ≠ p-A then

SA =ˆ p +A − ˆ p −A

2 /N ˆ p A (1− ˆ p A )~ N

pA+ − pA

2 /N pA (1− pA ),1

⎝ ⎜ ⎜

⎠ ⎟ ⎟

SA =ˆ p +A − ˆ p −A

2 /N ˆ p A (1− ˆ p A )~ N

( pA+ − pA

− ) N

2pA (1− pA ),1

⎝ ⎜ ⎜

⎠ ⎟ ⎟

SA =ˆ p +A − ˆ p −A

2 /N ˆ p A (1− ˆ p A )~ N λ A N ,1( )

Page 34: Computational and Statistical Challenges in Association Studies

Association Power

SA =ˆ p +A − ˆ p −A

2 /N ˆ p A (1− ˆ p A )~ N λ A N ,1( )

λA N

Power ofassociationtest

Threshold forsignificance

Non-centralityparameter.

Page 35: Computational and Statistical Challenges in Association Studies

Association Power

Statistical Power of an association with N individuals, non-centrality parameter and significance threshold is P(, )=

Note that if λ=0, power is always .€

λ N

λ N

(Φ−1

(α / 2) + λ N ) + 1 − Φ(−Φ−1

(α / 2) + λ N )

Page 36: Computational and Statistical Challenges in Association Studies

Indirect Association

Now lets assume that we have 2 markers, A and B. Let us assume that marker B is the causal mutation, but we are observing marker A.

If we observed marker B directly our statistic would be

λB =( pB

+ − pB− )

2pB (1− pB )

SB ~ N λ B N ,1( )

Page 37: Computational and Statistical Challenges in Association Studies

Indirect Association

However, we are observing A where our statistic is

What is the relation between SA and SB?

λA =( pA

+ − pA− )

2 pA (1− pA )

SA ~ N λ A N ,1( )

Page 38: Computational and Statistical Challenges in Association Studies

Indirect Association

We want to relate

to

λA =( pA

+ − pA− )

2 pA (1− pA )

SA ~ N λ A N ,1( )

λB =( pB

+ − pB− )

2pB (1− pB )

SB ~ N λ B N ,1( )

Page 39: Computational and Statistical Challenges in Association Studies

Indirect Association

We assume conditional probability distributions are equal in case and control samples

pA+ = pAB

+ + pAb+

pA+ = pB

+ pA |B + (1− pB+ )pA |b

pA− = pB

− pA |B + (1− pB− )pA |b

pA+ − pA

− = pA |B ( pB+ − pB

− ) − pA |b (pB+ − pB

− )

pA+ − pA

− = (pB+ − pB

− )(pA |B − pA |b )

Page 40: Computational and Statistical Challenges in Association Studies

Indirect Association Then

λA =( pA

+ − pA− )

2pA (1− pA )=

( pB+ − pB

− )( pA |B − pA |b )

2pA (1− pA )

=( pB

+ − pB− )( pA |B − pA |b )

2pA (1− pA )

2 pB (1− pB )

2 pB (1− pB )

=( pB

+ − pB− )

2pB (1− pB )

( pA |B − pA |b ) 2pB (1− pB )

2pA (1− pA )

= λ B

( pA |B − pA |b ) 2pB (1− pB )

2 pA (1− pA )

Page 41: Computational and Statistical Challenges in Association Studies

Indirect Association

Note that

λA = λ B

( pA |B − pA |b ) 2pB (1− pB )

2pA (1− pA )

= λ B

pAB

pB

−pAb

1− pB

⎝ ⎜

⎠ ⎟ pB (1− pB )

pA (1− pA )

= λ B

pAB − pAB pB − pAb pB

pB (1− pB )

⎝ ⎜

⎠ ⎟ pB (1− pB )

pA (1− pA )

= λ B

pAB − pA pB

pA (1− pA ) pB (1− pB )= λ B r2

λA = λ B r2

Page 42: Computational and Statistical Challenges in Association Studies

Indirect Association

How many individuals, NA, do we need to collect at marker A to achieve the same power as if we collected NB markers at marker B?

SA ~ N λ A NA ,1( )

SB ~ N λ B NB ,1( )

λA NA = λ B NB

λ B r2 NA = λ B NB

NA =NB

r2

λA = λ B r2

Page 43: Computational and Statistical Challenges in Association Studies

Visualization in terms of Power

λB N

Power ofassociationtest

Threshold forsignificance

Non-centralityparameters.

λA N

λA = λ B r2

Page 44: Computational and Statistical Challenges in Association Studies

Correlating Haplotypes with the Disease

The disease may be correlated with a SNP not in the panel.

The disease may be more correlated with a haplotype (group of SNPs) than with any single SNP in the panel.

Haplotype tests: Which haplotypes should we test? Which blocks should we pick?

Page 45: Computational and Statistical Challenges in Association Studies

Key Problem: Indirect Association

We have the HapMap. Information on 4,000,000 SNPs.

AffyMetrix gene chip collects information on 500,000 SNPs. What about the remaining 3,500,000 SNPs?

So far, we have designed studies by picking tag SNPs with high r2.

Can we use the HapMap when performing association? Multi-Tag methods.

Page 46: Computational and Statistical Challenges in Association Studies

Haplotypes as Proxies for Hidden SNPs (de Bakker 2005)

HaplotypesFreq.

1 2 3 4 5

A A A A A .25

A G A G G .15

A G A G A .10

G A G G G .25

G G G G G .25

HaplotypesFreq.

1 2 3 4 5

A A A A A .25

A G A G G .15

A G A G A .10

G A G G G .25

G G G G G .25

HaplotypesFreq.

1 2 3 4 5

A A A A A .25

A G A G G .15

A G A G A .10

G A G G G .25

G G G G G .25

HaplotypesFreq.

1 2 3 4 5

A A A A A .25

A G A G G .15

A G A G A .10

G A G G G .25

G G G G G .25

Page 47: Computational and Statistical Challenges in Association Studies

WHAP - Weighted Haplotypes

HaplotypesFreq.

1 2 3 4 5

A A A A A .25

A G A G G .15

A G A G A .10

G A G G G .25

G G G G G .25

A

0.71AA + 0.29AG0.71AA + 0.29AG

Page 48: Computational and Statistical Challenges in Association Studies

Basic MultiMarker Method

For each SNP in HapMap, find haplotype among genotyped SNPs that has highest r2 to the SNP.

Perform association at each SNP and each added haplotype.

Now instead of performing 500,000 tests, we perform 4,000,000 tests.

Page 49: Computational and Statistical Challenges in Association Studies

Weighted Haplotype Test

For each haplotype h, we assign a weight wh

We use a “weighted” allele frequency statistic:

This statistic is the weighted numerator in SA. What is the variance of this statistic?

Complication: Haplotype frequencies are not independent!

Wh = wh ( ph+ − ph

−)h

Page 50: Computational and Statistical Challenges in Association Studies

Weighted Haplotype Example

Assume we have 4 haplotypes AB, Ab, aB and ab. If we set the weights so that wAB=wAb=1 and

waB=wab=0, this is equivalent to looking at the single SNP A.

If we set the weights so that wAB=1 and wAb=waB=wab=0, this is equivalent to looking at the single haplotype AB.

Other weights are can be something in between.

Page 51: Computational and Statistical Challenges in Association Studies

The -test

(w) =N wh ph

case − phcontrol

( )h=1

k

∑ ⎛ ⎝ ⎜ ⎞

⎠ ⎟2

2 wh2 ph − wh ph∑ ⎛ ⎝

⎜ ⎞ ⎠ ⎟2

h=1

k

∑ ⎛

⎝ ⎜

⎠ ⎟

Each haplotype h is assigned a weight wh. N is the number of individuals. ph - the probablity for h in cases/controls, or

average. Under the null, the -test is 2 distributed.

Page 52: Computational and Statistical Challenges in Association Studies

Non-Centrality Parameter

Under weights w1,w2,w3,w4 and true case/control probabilities p1

+,p2+,p3

+,p4+ and

p1-,p2

-,p3-,p4

-, Wh is expected to be

When normalizing for the variance, the non-centrality parameter is

Wh = wi( pi+ − pi

−)i=1

4

λh N =

wi( pi+ − pi

−)i=1

4

2 /N wi

2 pi − wi pi

i=1

4

∑ ⎛

⎝ ⎜

⎠ ⎟

2

i=1

4

Page 53: Computational and Statistical Challenges in Association Studies

Wh and indirect association

Let us assume that SNP C is causal with non-centrality parameter λC.

If we perform weighted haplotype association, the noncentrality parameter is λh.

How are they related? (i.e. What is the power of the weighted haplotype association test).

Using the same technique, we can show that λC=rh λh, where rh is the conceptual equivalent of r in 2 SNP case.

Page 54: Computational and Statistical Challenges in Association Studies

The Relation to Power

rh2 =

wh qhC − qhc( )h=1

k

∑ pC (1− pC )

wh2 phh

∑ − wh phh∑( )

2

qhC = P(h | C)

qhc = P(h | c)

pC = P(c)

The power of detecting the SNPwith N individuals is the sameas using the tag SNPs withN/rh

2 individuals.

Page 55: Computational and Statistical Challenges in Association Studies

Choosing the Weights

Haplotypes

1 2 3 4 5

A A A A A .05

A G A G G .15

A G A G A .10

G A G G G .25

G G G G G .25

Optimal weights:

wh(s5) = P(s5 = ‘A’ | h) = qAh

Page 56: Computational and Statistical Challenges in Association Studies

The Relation to Power

rh2 =

qCh pCh − pC ph( )h=1

k

∑pC (1− pC )

This is exactly r2 in the case of one tag SNP.

WHAP always has at least as much power as:• single SNP test• single haplotype test• haplotype group test• 2 with k degrees of freedom.

Page 57: Computational and Statistical Challenges in Association Studies

Cases0.5M SNPs

Controls0.5M SNPs

HapMap4M SNPs

Use as training dataset to getthe weights

Tests: T1,…,T4M

Apply tests: T1,…,T4M

Positive results give evidence for a causal SNP - can be verified by a follow up/two stage study.

Page 58: Computational and Statistical Challenges in Association Studies

How Many SNPs are Captured?

Tag Set Pop SNP HAP WHAP

Affy500 CEU 0.61 0.77 0.84

Affy500 CHB 0.62 0.76 0.83

Affy500 JPT 0.59 0.73 0.81

Affy500 YRI 0.37 0.61 0.74

Illumina CEU 0.88 0.97 0.98

Illumina CHB 0.80 0.91 0.94

Illumina JPT 0.78 0.90 0.95

Illumina YRI 0.52 0.83 0.92

Page 59: Computational and Statistical Challenges in Association Studies
Page 60: Computational and Statistical Challenges in Association Studies

Power Simulations

Pop SNP HAP WHAP

CEU 0.92 0.94 0.96

CHB 0.90 0.94 0.95

JPT 0.90 0.93 0.95

YRI 0.77 0.88 0.92

- Relative power to using all SNPs. - Tested on the ENCODE regions, Affy 500k tag SNPs.

Page 61: Computational and Statistical Challenges in Association Studies

Practical Issues

We assume we have the haplotype frequencies in the HapMap (not the phase).

We assume the case/control populations are coming from the same population as the HapMap.

Over-fitting: Train with half of the data, test the other half. No correlation between the haps and random SNPs.

Page 62: Computational and Statistical Challenges in Association Studies

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 63: Computational and Statistical Challenges in Association Studies

WHAP r2 in a region. Red lines are collected SNPs. Blue lines are rh2 values.

Page 64: Computational and Statistical Challenges in Association Studies

Associations using WHAP. Red lines are assocations at collected SNPs. Blue lines are associations at uncollected SNPs inferred by WHAP.

Page 65: Computational and Statistical Challenges in Association Studies

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 66: Computational and Statistical Challenges in Association Studies

Optimal Genome Wide Tagging by Reduction to SAT

Page 67: Computational and Statistical Challenges in Association Studies

Correlation Strucutre

QuickTime™ and a decompressor

are needed to see this picture.

Page 68: Computational and Statistical Challenges in Association Studies

Example r2 Matrix

QuickTime™ and a decompressor

are needed to see this picture.

Page 69: Computational and Statistical Challenges in Association Studies

QuickTime™ and a decompressor

are needed to see this picture.

Graph Representation

Page 70: Computational and Statistical Challenges in Association Studies

Satisfiability and SAT Solvers Boolean variables called literals Logical operators

AND ∧ OR ∨ NOT ¬

Example: (s1 ∨ ¬ s2) ∧ (s2 ∨ s3 ∨ s1) s1 = false; s2 = false; s3 = true

Page 71: Computational and Statistical Challenges in Association Studies

A. Darwiche

A B B A C D D C

and and and and and and and and

or or or or

and and

or

rooted DAG (Circuit)

Negation Normal Form

Page 72: Computational and Statistical Challenges in Association Studies

CNF Form and Logical Solutions

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 73: Computational and Statistical Challenges in Association Studies

NNF Form of Solutions

QuickTime™ and a decompressor

are needed to see this picture.

Page 74: Computational and Statistical Challenges in Association Studies

Local Single SNP r2 Tagging

Generate a clause for each SNP Clause for SNP si contains all covers

Input CNF as conjuction of all clauses Compile with minSAT solver Find solutions by traversal of NNF

Page 75: Computational and Statistical Challenges in Association Studies

Optimal Tagging

QuickTime™ and a decompressor

are needed to see this picture.

Page 76: Computational and Statistical Challenges in Association Studies

Whole Genome Tagging

QuickTime™ and a decompressor

are needed to see this picture.

Page 77: Computational and Statistical Challenges in Association Studies

MultiMarker Example

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 78: Computational and Statistical Challenges in Association Studies

MultiMarker Tagging

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 79: Computational and Statistical Challenges in Association Studies

UCLA:Adnan DarwicheArthur ChoiKnot Pipatswisawat

ICSI:Eran HalperinRichard Karp

Perlegen Sciences:David HindsDavid Cox

Ph.D. Students:Buhm HanNils Homer Hyun Min KangSean O’RourkeJimmie YeNoah Zaitlen

Webserver Hosted By: