inferring ethnicity from mitochondrial dna sequence chih lee 1, ion mandoiu 1 and craig e. nelson 2...

26
Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1 , Ion Mandoiu 1 and Craig E. Nelson 2 [email protected] [email protected] [email protected] 1 Department of Computer Science and Engineering 2 Department of Molecular and Cell Biology University of Connecticut

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Inferring Ethnicity from Mitochondrial DNA Sequence

Chih Lee1, Ion Mandoiu1 and Craig E. Nelson2

[email protected]@engr.uconn.edu

[email protected] of Computer Science and Engineering

2Department of Molecular and Cell BiologyUniversity of Connecticut

Page 2: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Outline

Introduction Methods Results and Discussions Conclusions

Page 3: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Introduction Methods Results and Discussions Conclusions

Outline

Page 4: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Ethnicity in Forensics

Ethnicity information assists forensic investigators.

Investigator-assigned ethnicity: based on genetic and non-genetic markers.

Genetic information enhances inference accuracy when access to most informative markers (e.g. skin/hair) is limited.

Autosomal markers: Excellent accuracy assigning samples to clades

[Phi07, Shr97] May not survive degradation

Page 5: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Mitochondrial DNA

Circular 16,569 bps Maternally inherited High copy number

Recoverable from degraded samples

Coding region SNPs define

haplogroups [Beh07] Hypervariable Region

Page 6: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Hypervariable Region

High mutation rate compared to the coding region

Haplogroup inference [Beh07] 23 groups 96.7% accuracy rate with 1NN

Geographic origin inference [Ege04] SE Africa, Germany and Icelandic 66.8% accuracy rate with PCA-QDA

16024 16569 1 576

HVR 1 HVR 2

Page 7: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Ethnicity Inference from HVR

The problem: Given a set of HVR sequences tagged

with ethnicities Predict the ethnicities of new HVR

sequences A classification problem

Our contribution: Assess the performance of 4

classification algorithms: SVM, LDA, QDA and 1NN.

Page 8: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Outline

Introduction Methods Results and Discussions Conclusions

Page 9: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Encoding HVR

Align to rCRS (revised Cambridge reference sequence) SNP profile

a SNP a binary variable

Missing data (not typed regions) Assume rCRS Use mutation probability Common region

16067T CT

315.1C insertion

523D deletion

Page 10: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Support Vector Machines

Binary classification algorithm Map instances to high-D space (the

feature space) Optimal separating hyperplane with

max margins Kernel function k(x1,x2): similarity x1

and x2 between in the feature space Radial basis kernel: exp(-γ||x1-x2||2) Software: LIBSVM [Cha01]

Page 11: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Linear/Quadratic Discriminant Analysis

Find argmaxg P(G=g|X=x) Assumptions:

X|G=g ~Np(μg, Σg) P(G=g)’s are equal for all g

P(G=g|X=x) prop. to P(X=x|G=g) μg and Σg are estimated by the

training data LDA: common dispersion matrix Σg =

Σ for all g

Page 12: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

1-Nearest Neighbor

Assign a new sample to the dominating ethnicity among the nearest samples in the training data

Distance measure: the Hamming distance

Used by Behar et al. (2007) for haplogroup assignment

Page 13: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Principal Component Analysis

A dimension reduction technique Used in conjunction with SVM, LDA

and QDA Denoted as: PCA-SVM, PCA-LDA and

PCA-QDA

Page 14: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Outline

Introduction Methods Results and Discussions Conclusions

Page 15: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

The FBI mtDNA Population Database

Two tables: forensic: typed by FBI published: collected from literature

Retain only Caucasian, African, Asian and Hispanic samples

# samples

All Caucasian African Asian Hispanic

forensic dataset

4,426 1,674 (37.8%)

1,305 (29.5%)

761 (17.2%)

686 (15.5%)

published dataset

3,976 2,807 (70.6%)

254 (6.4%)

915 (23%)

Page 16: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Data Coverage and Subsets

Variable sequence lengths

trimmed forensic dataset (4,426) 16024-16365

trimmed published dataset (1,904) 16024-16365

full-length forensic dataset (2,540) 16024-16569, 1-576

16024 16569 1 576

HVR 1 HVR 2

forensic

published

Page 17: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

5-fold Cross-Validation (trimmed forensic)

Macro-Accuracy: Average of ethnicity-wise accuracy rates

Micro-Accuracy: Weighted by # Samples More accurate than Egeland et al. (2004) Matches human experts depending on skull and

large bones [Dib83, isc83]

Page 18: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Seq. Region Effect on Accuracy

Different primers result in different coverage. PCA-LDA outperforms 1NN on long sequences. PCA-SVM is consistently the best.

100%90%80%

16024 16569 1 576

HVR 1 HVR 2

full-length forensic dataset

Page 19: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

80%

Seq. Region Effect on Accuracy

HVR 2 contains less information. PCA-SVM is consistently the best.

100%90%

16024 16569 1 576

HVR 1 HVR 2

full-length forensic dataset

Page 20: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Twenty 10% Windows

Accuracy varies with region. PCA-SVM remains the best. 1NN is as good as PCA-SVM for short regions.

16024 16569 1 576

HVR 1 HVR 210%10%10%

Page 21: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Independent Validation (1/2)

Training data: trimmed forensic dataset Test data: trimmed published dataset PCA-SVM No Hispanic samples in the test data but

samples can be mis-classified as Hispanic Asian: ~17% lower than CV

Page 22: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Independent Validation (2/2)

Composition of the Asian samples in the training data: China (356 profiles), Japan (163), Korea (182), Pakistan

(8), and Thailand (52) Strong bias towards East Asia

145 Mis-classified Asian samples in the test data: 10 samples of unknown country of origin 90 samples from Kazakhstan and Kyrgyzstan

Both countries have significant Russian population. Evidence of admixture with Caucasians.

# Samples Asian Caucasian African Hispanic

Kazakhstan 107 56 (52.3%)

47 (44.0%)

3 (2.8%)

1 (0.9%)

Kyrgyzstan 95 56 (58.9%)

34 (35.8%)

1 (1.1%)

4 (4.2%)

Page 23: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Handling Missing Data

Mimic real-world scenario Training: forensic dataset Test: published dataset rCRS and Probability are biased toward

Caucasian. Common Region is the best overall.

Page 24: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Posterior Probability Calibration

PCA-SVM on published dataset with “Common Region”

Accuracy rates are slightly higher than the estimated posterior probabilities.

Page 25: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Conclusions

SVM is the most accurate algorithm among those investigated, outperforming Discriminant analysis employed by Egeland et

al. (2004) 1NN similar to that used by Behar et al. (2007)

Overall accuracy of 80%-90% in CV and independent testing Matches the accuracy of human experts

depending on measurements of skull and large bones [Dib83,isc83]

Approaches the accuracy by using ~60 autosomal loci [Bam04]

Page 26: Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu

Questions?

Thank you for your attention.