phasing of 2-snp genotypes based on non-random mating model dumitru brinza

21
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia State University Atlanta, USA

Upload: gay

Post on 06-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia State University Atlanta, USA. Outline. Molecular biology terms Motivation Problem formulation Previous work Our contribution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Phasing of 2-SNP Genotypes

Based on Non-Random Mating Model

Dumitru Brinza

joint work with Alexander Zelikovsky

Department of Computer Science

Georgia State University

Atlanta, USA

Page 2: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Outline

Molecular biology termsMotivationProblem formulationPrevious workOur contributionPhasing of 2-SNP genotypesPhasing of multi-SNP genotypesResults

Page 3: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Molecular biology terms

Human Genome – all the genetic material in the chromosomes, length 3×109 base pairs

Difference between any two people occur in 0.1% of genome

SNP – single nucleotide polymorphism site where two or more different nucleotides occur in a large percentage of population.

Genotype – The entire genetic identity of an individual, including alleles, SNPs, or gene forms. (e.g., AC CT TG AA AC TG)

Haplotype – A single set of chromosomes (half of the full set of genetic material). (e.g., A C T A A T)

Genotype is a mixture of two haplotypes.

Page 4: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

From ACTG to 0,1,2 notations

Haplotype: Wild type SNPs are referred as 0 Mutated SNPs are referred as 1

Genotypes: Homozygous SNPs are referred as 0 (mixture of 00) or 1 (mixture of 11) Heterozygous SNPs are referred as 2 (mixture of 01,10)

homozygous

haplotype

SNP

heterozygous

Two haplotypes per individual

Genotype for the individual

11 00 1 0 0 111 01 1 1 0 0

11 02 1 2 0 2

Page 5: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Motivation

Haplotype may contain large amount of genetic markers, which are responsible for human disease.

Haplotypes may increase the power of association between marker loci and phenotypic traits.

Evolutionary tree can be reconstructed based on haplotypes.

Physical phasing (haplotypes inferring) is too expensive. Great need in computational methods for extracting haplotype information from the given genotype information.

Existing methods are either extremely slow or less accurate for genome-wide study.

Page 6: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Phasing problem (Haplotype inference)

Inferring haplotypes or genotype phasing is resolution of a genotype into two haplotypes

Given: n genotype vectors (0, 1 or 2), Find: n pairs of haplotype vectors, one pair of haplotypes per

each genotype explaining genotypes

For individual genotype with h heterozygous sites there are 2h-1 possible haplotype pairs explaining this genotype (h=20k for the genome-wide). also there are around 10% missing data.

This is hopeless without genetic model

Page 7: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Previous work

PHASE – Bayesian statistical method (Stephens et al., 2001, 2003)

HAPLOTYPER – proposed a Monte Carlo approach (Niu et al., 2002)

Phamily – phase the trio families based on PHASE (Acherman et al., 2003)

GERBIL – statistical method using maximum likelihood (ML), MST and expectation-maximization (EM) (Kimmel and Shamir, 2005)

SNPHAP – use ML/EM assuming Hardy-Weinberg equilibrium (Clayton et al., 2004)

Page 8: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Contribution

We explore phasing of genotypes with 2 SNPs which have ambiguity when the both sites are heterozygous. There are two possible phasing and the phasing problem is reduced to inferring their frequencies.

Having the phasing solution for 2-SNP genotypes, we propose an algorithm for inferring the complete haplotypes for a given genotype based on the maximum spanning tree of a complete graph with vertices corresponding to heterozygous sites and edge weights given by the inferred 2-SNP frequencies.

Extensive experimental validation of proposed methods and comparison with the previously known methods

Page 9: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Phasing of 2-SNP genotypes

At least one SNP is homozygous – phasing is well defined:

Both SNPs are heterozygous – ambiguity

Cis- phasing

Trans- phasing

01 0101

orExample 21 0111

220 01 1

220 11 0

Page 10: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Odds of cis- or trans- phasing

Odds ratio of being phased cis- / trans-

Additive odds ratio is better (also noticed in PHASE)

LD (linkage disequilibrium) between SNPs i and j

Page 11: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Confidence in cis- or trans- phasing

Closer pairs of SNPs are more linked (less crossovers)

The confidence cij in phasing 2 SNPs i and j is inverse proportional to squared distance:

Logarithm is for sign-indication of cis-/trans- preference

cij ≤ 0 means cis- with certainty |cij|

cij > 0 means trans- with certainty |cij|

22i j

0 01 1

22i j

0 10 1

Page 12: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Certainty of cis- or trans- phasing

n – number of genotypesF00, F01, F10, F11 – true haplotype frequencies (observed + true in 22)

? 1 0 2 1 1 0 1 0 1

1 1 0 0 1 0 0 2 0 1

0 1 2 0 1 2 0 1 0 1

2 1 1 0 1 1 0 ? 0 1

0 1 1 0 1 2 0 0 2 1

Genotypes

i j

#01 + 2

#00 + 2

#11 + 2 #10 + 1 , #11 + 1

(#00 + 1 , #11 + 1) or (#01 + 1 , #10 + 1)*

Page 13: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Haplotype frequencies in 22

Random mating model => Hardy-Weinberg Equilibrium (HWE):

(F00+F01+F10+F11)2 = F002 + F01

2 + F102 + F11

2 + 2F00F01 + 2F00F10 + 2F00F11 + 2F01F10 + 2F01F11 + 2F10F11

G00 G01

G10 G11 G02 G20 G22 G21 G12

Even single-SNP haplotype frequencies may deviate from HWE

(F0+F1)(F0+F1-2x)= (F0+x)2 + (F1+x)2 + 2(F0F1-x2)

xG0 yG1

zG2

Accordingly we adjust expectation of 2-SNP haplotype frequencies(F00+F01+F10+F11)2 = F00

2 + F012 + F10

2 + F112 + 2F00F01 + 2F00F10 + 2F00F11 + 2F01F10 + 2F01F11 + 2F10F11

xxG00 xyG01

yxG10 yyG11 xzG02 zxG20 zzG22 zyG21 yzG12

Compute expected haplotype frequencies in 22 as best fitting to observed deviation in single-site haplotype frequencies

Page 14: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Phasing of multi-SNP genotypes

Genotype graph for genotype g is a weighted complete graph G(g ) where: Vertices = 2’s i.e., heterozygous SNPs in g

Weight w(i,j)= |cij | confidence in phasing 2 SNPs i and j

Phasing of 2 heterozygous SNPs cij > 0 cis-edge 22 = 00 + 11

cij < 0 trans-edge 22 = 01 + 10

Phasing = Genotype graph coloring Color all vertices in two colors such that

any 2 vertices connected with a cis-edge have the same color, and any 2 vertices connected with a trans-edge have opposite colors

2 1 2 0 1 2 0 2 0 1

1 1 0 0 1 0 0 1 0 1

0 1 1 0 1 1 0 0 0 1

Genotype

Haplotype #1

Haplotype #2

a b c d a

b c

d

Page 15: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Genotype graph coloring

Exact solution: ILP – slow and not accurate Heuristic solution:

Find maximum spanning tree (MST) of G and color MST instead of G

12

1

13

2

1 2 1

13

2

Frequent conflicts when coloring genotype graph G since it has cyclesGenotype Graph Coloring Problem:

Find coloring with total weight (number) of conflicting edges minimized

Page 16: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

2SNP algorithm

For each pair of SNPs do Collect statistics on haplotype/genotype frequencies Compute weights reflecting likelihood of trans-/cis-

For each genotype g do Find MST for the complete graph G(g ) where vertices are heterozygous

sites Color G(g ) vertices and phase based on coloring

For each haplotype h with ?’s (missing SNP values) do Find a haplotype h’ closest to h (with minimum number of mismatches) Replace ?’s in h with the known SNP value in h’

Runtime (two bottlenecks) O(nm) – computing haplotype frequencies for 20×m pairs of SNPs in each

genotype, n is number of genotypes, m number of SNP’s. O(n2m) – missing data recovery, finding number of mismatches for any

two haplotypes

Page 17: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Datasets

Chromosome 5q31: 129 genotypes with 103 SNPs derived from the 616 KB region of human Chromosome 5q31 (Daly et al., 2001).

Yoruba population (D): 30 genotypes with SNPs from 51 various genomic regions, with number of SNPs per region ranging from 13 to 114 (Gabriel et al., 2002).

Random matching 5q31: 128 genotypes each with 89 SNPs from 5q31 cytokine gene generated by random matching from 64 haplotypes of 32 West African Hull et al. (2004).

HapMap datasets: 30 genotypes of Utah residents and Yoruba residents available on HapMap by Dec 2005. The number of SNPs varies from 52 to 1381 across 40 regions including ENm010, ENm013, ENr112, ENr113 and ENr123 spanning 500 KB regions of chromosome bands 7p15:2, 7q21:13, 2p16:3, 4q26 and 12q12 respectively, and two regions spanning the gene STEAP and TRPM8 plus 10 KB upstream and downstream.

Page 18: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Unrelated individuals phasing validation

Phasing methods can be validated on simulated data (haplotypes are known)

The validation on real data is usually performed on the trio data Offspring haplotypes are mostly known (inferred from parents haplotypes)

Error typesSingle-Site error Number of SNPs in offspring phased haplotypes which differ from SNPs inferred from trio data,

divide by (total number of SNPs) x (total number of haplotypes)

Individual error Number of correctly phased offspring genotypes (no Single-Site errors) divide by total number of

genotypes

Switching error Minimum number of switches which should be done in pair of haplotypes of offspring phased

genotype such that both haplotypes will coincide with haplotypes inferred from trio data, divide by total number of heterozygous positions in offspring genotypes.

Page 19: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Results

Page 20: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Chromosome-Wide Phasing

Entire chromosomes for 30 Trios from Hapmap

Average Errors: Single-site: 3.3% Switching: 8.8%

#SNPs

1.5K

runtime

2 sec

2.5K 8 sec

5.0K 25 sec

10.0K 55 sec

20.0K 220 sec

40.0K 17 min

60.0K 35 min

80.0K 70 min

Page 21: Phasing of 2-SNP Genotypes  Based on Non-Random Mating Model Dumitru Brinza

Conclusion

2SNP method

Several orders of magnitude faster

Scalable for genome-wide study

Phase 10000 SNPs in less than one hour

Same accuracy as PHASE and Gerbil