snps, haplotypes, disease associations algorithmic foundations of computational biology ii course 1...

44
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Upload: ann-webb

Post on 18-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

SNPs, Haplotypes,DiseaseAssociations

Algorithmic Foundations of Computational Biology II

Course 1

Prof. Sorin Istrail

Page 2: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

SNPs and the Human Genome:The Minimal Informative Subset

Page 3: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Overview

Introduction:

SNPs, Haplotypes A Data Compression Problem:

The Minimum Informative Subset A New Measure:

Informativeness

Page 4: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

A Most Challenging Problem

“None of the [advances of the 20th century medicine] depend on a deep knowledge of cellular processes or on any discoveries of molecular biology.

Cancer is still treated by gross physical and chemical assaults on the offending tissue.

Cardiovascular Disease is treated by surgery whose anatomical bases go back to the 19th century …Of course, intimate knowledge of the living cell and of basic molecular processes may be usefuleventually.”

Lewontin (1991)

Page 5: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Now

“A decade later, molecular biology can claim very few successes for drugs in clinical use that were designed ab initioto control a specific component of a pathwaylinked to disease: these include themonoclonal antibody Herceptin, and the kinase inhibitor Gleevec.”

Reik, Gregory and Urnov (2002)

Page 6: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Introduction

SNPs, HAPLOTYPES

Page 7: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%.

GATTTAGATCGCGATAGAGGATTTAGATCTCGATAGAG

The most abundant type of polymorphism

The two alleles at the site are G and T

Single Nucleotide Polymorphism (SNP)

Page 8: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca

tc

ga

ga

ga

ga

ga

gc

gc

gc

tc

ga

ga

ga

ga

ga

tc

tc

tc

tc

ga

ga

ga

tc

gc

tc

tc

tc

Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes.

Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs.

SNPs occur once every ~600 bp

Average gene in the human

genome spans ~27Kb

~50 SNPs per gene

Page 9: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

G C T C G A C A A C A GG T T C G T C A A C A G

Two individuals

C A G HaplotypesT T G

SNP SNP SNP

Haplotype

Page 10: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Mutations

Infinite Sites Assumption:

Each site mutates at most once

Page 11: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Haplotype Pattern

0 0 0 01 1 0 10 0 1 00 1 0 1

C A G TT T G AC A T GC T G T

At each SNP site label the two alleles as 0 and 1.

The choice which allele is 0 and which one is 1

is arbitrary.

Page 12: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

G T T C G A C T A T T A

G T T C G A C A A C A TA C G T A T C T A T T A

Recombination

Page 13: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

G T T C G A C T A T T A

G T T C G A C A A C A TA C G T A T C T A T T A

The two alleles are linked, I.e., they are “traveling together”

?

Recombinationdisrupts the linkage

Recombination

Page 14: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Variations in Chromosomes Within a Population

Common Ancestor

Emergence of Variations Over Time

time present

Disease Mutation

Linkage Disequilibrium (LD)

Page 15: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Time = present

2,000 gens. ago

Disease-Causing Mutation

1,000 gens. ago

Extent of Linkage Disequilibrium

Page 16: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

A Data Compression Problem

The Minimum Informative Subset

Page 17: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

A Data Compression Problem Select SNPs to use in an association study

Would like to associate single nucleotide polymorphisms (SNPs) with disease.

Very large number of candidate SNPs Chromosome wide studies, whole genome-scans For cost effectiveness, select only a subset.

Closely spaced SNPs are highly correlated It is less likely that there has been a recombination between two

SNPs if they are close to each other.

Page 18: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Disease Associations

Page 19: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Association studies

DiseaseResponder

ControlNon-responder

Allele 0 Allele 1

Marker A is associated with

Phenotype

Marker A:

Allele 0 =

Allele 1 =

Page 20: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Evaluate whether nucleotide polymorphisms associate with phenotype

T A GA A

C G GA A

C G TA A

T A TC G

T G TA G

T G GA G

Association studies

Page 21: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

T A GA A

C G GA A

C G TA A

T A TC G

T G TA G

T G GA G

Association studies

Page 22: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

1 1 00 0

0 0 00 0

0 0 10 0

1 1 11 1

1 0 10 1

1 0 00 1

Association studies

Page 23: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Compression based on Haplotype Resolution

Page 24: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

0 1 01 1

1 0 00

0 0 10 1

1

For a SNP s we associate a bipaprtite graph.

Nodes: the set of haplotypes.

Edges: the set of pairs of haplotypes with different alleles at s.

s1

s2

D-graph of a SNP

Page 25: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

0 1 01 1

1 0 00

0 0 10 1

1

For a set of SNPs S we associate a bipaprtite graph.

Nodes: the set of haplotypes.

Edges: the set of pairs of haplotypes with different

alleles at some SNP s in S.

s1

s2

D-graph of a set of SNPs

Page 26: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

0 1 01 1

1 0 00

0 0 10 1

1

Red SNP is equivalent to Blue SNP

SNP Selection

Page 27: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Red SNPs predict Green SNP

0 1 01 1

1 0 00

0 0 10 1

1

SNP Selection

Page 28: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Minimal Informative Subset

0 1 01 1

1 0 00

0 0 10 1

1

Data Compression

Page 29: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Compresssion based on Haplotype Blocks

Page 30: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Hypothesis – Haplotype Blocks?

The genome consists largely of blocks of

common SNPs with relatively little recombination

within the blocks Patil et al., Science, 2001; Jeffreys et al., Nature Genetics, 2001; Daly et al., Nature Genetics, 2001

Page 31: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Sense genes

Antisense genes

200 kb

1 2 3 4

DNA

SNPs

Haplotypeblocks

Haplotype Block StructureLD-Blocks, and 4-Gamete Test Blocks

Page 32: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Hudson and Kaplan 1985

A segment of SNPs is a block if between every pair of SNPs at most 3 out of the 4 gametes (00, 01,10,11) are observed.

0 0 10 1 11 1 01 1 1

0 0 10 1 11 1 01 0 1

BLOCK VIOLATES THE BLOCK DEFINITION

Four Gamete Block Test

Page 33: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Finding Recombination Hotspots:Many Possible Partitions into Blocks

A C T A G A T A G C C TG T T C G A C A A C A TA C T C T A T G A T C GG T T A T A C G A C A TA C T C T A T A G T A TA C T A G C T G G C A T

All four gametes are present:

Page 34: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

A C T A G A T A G C C TG T T C G A C A A C A TA C T C T A T G A T C GG T T A T A C G A C A TA C T C T A T A G T A TA C T A G C T G G C A T

Find the left-most right endpoint of any constraint and mark the site

before it a recombination site.

Eliminate any constraints crossing that site.

Repeat until all constraints are gone.

The final result is a minimum-size set of sites crossing all constraints.

Page 35: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Data Compression

ACGATCGATCATGAT

GGTGATTGCATCGAT

ACGATCGGGCTTCCG

ACGATCGGCATCCCG

GGTGATTATCATGAT

A------A---TG--

G------G---CG--

A------G---TC--

A------G---CC--

G------A---TG--

Haplotype Blocks based on LD(Method of Gabriel et al.2002)

Selecting Tagging SNPs in blocks

Page 36: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

A New Measure

Informativeness

Page 37: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Informativeness

0 1 00 1

0 1 10 0

s

h2

h1

Page 38: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1 s2 s3 s4 s5

I(s1,s2) = 2/4 = 1/2

Informativeness

Page 39: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1 s2 s3 s4 s5

I({s1,s2}, s4) = 3/4

Informativeness

Page 40: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1 s2 s3 s4 s5

I({s3,s4},{s1,s2,s5}) = 3

S={s3,s4} is a

Minimal Informative Subset

Informativeness

Page 41: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Minimum Set Cover= Minimum Informative Subset

s1

s2

s5

s3

s4

e1

e2

e3

e4

e5

e6

SNPs Edges

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1

s2

s3

s4

s5

Graph theory insight

Informativeness

Page 42: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Minimum Set Cover {s3, s4}= Minimum Informative Subset

s1

s2

s5

s3

s4

e1

e2

e3

e4

e5

e6

SNPs Edges

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1

s2

s3

s4

s5

Informativeness

Graph theory insight

Page 43: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

Real Haplotype Data

Two different runs of the Gabriel el al Block Detection method +

Zhang et al SNP selection algorithm

Our block-free algorithm

A region of Chr. 22

45 Caucasian samples

Page 44: SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

When Maximum Likelihood = Bayesian = Parsimony

A C G T

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

101112131415

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789101112131415

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314