principal component analysis in genomic data

17
Principal Component Analysis in Genomic Data Seunggeun Lee Department of Biostatistics University of North Carolina at Chapel Hill March 4, 2010 Seunggeun Lee (UNC-CH) PCA March 4, 2010 1 / 12

Upload: others

Post on 22-Mar-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principal Component Analysis in Genomic Data

Principal Component Analysis in Genomic Data

Seunggeun Lee

Department of BiostatisticsUniversity of North Carolina at Chapel Hill

March 4, 2010

Seunggeun Lee (UNC-CH) PCA March 4, 2010 1 / 12

Page 2: Principal Component Analysis in Genomic Data

Bio

Korean

Undergraduate Major : Biology/Statistics

Worked 3 and 1/2 years as a Software Engineer

Came to UNC at 2005, admitted to MS and then progressed to PhD

Dissertation Advisors : Dr. Fei Zou and Dr. Fred A. Wright

Seunggeun Lee (UNC-CH) PCA March 4, 2010 2 / 12

Page 3: Principal Component Analysis in Genomic Data

Genomewide Association Studies

Single Nucleotide Polymorphism(SNP)

I Single nucleotide variationI Occur every 100 to 300 bases

Genomewide association StudyI Goal : To find SNPs associated

with case-control or quantitativetraits.

I Typically have > 1, 000 samplesI Test each SNPs

Seunggeun Lee (UNC-CH) PCA March 4, 2010 3 / 12

Page 4: Principal Component Analysis in Genomic Data

Genomewide Association Studies

Obtain genotype using SNP microarray.I SNP chips have 500k ∼ 1 million SNPs

For each SNPs (A vs a)

AA Aa aa

Case 320 160 20

Control 245 210 45

I P-value = 2× 10−6

I Compute p-values of all 5× 105 ∼ 106 SNPs

Seunggeun Lee (UNC-CH) PCA March 4, 2010 4 / 12

Page 5: Principal Component Analysis in Genomic Data

Genomewide Association Studies

Obtain genotype using SNP microarray.I SNP chips have 500k ∼ 1 million SNPs

For each SNPs (A vs a)

AA Aa aa

Case 320 160 20

Control 245 210 45

I P-value = 2× 10−6

I Compute p-values of all 5× 105 ∼ 106 SNPs

Seunggeun Lee (UNC-CH) PCA March 4, 2010 4 / 12

Page 6: Principal Component Analysis in Genomic Data

Genomewide Association Studies

Figure: −log10 P-values, from the GAIN Schizophrenia Data

Seunggeun Lee (UNC-CH) PCA March 4, 2010 5 / 12

Page 7: Principal Component Analysis in Genomic Data

Population Stratification

Population stratification is the presence of a systematic difference in allelefrequencies between subpopulations. It is a major confounding factor ofGWAS.

Study : Genetic components of Height

Samples were collected in EuropeI Researchers have found that Chr 2q 21 region is

associated with heightI This region encodes the lactase gene (LCT)

Relationship between Height and Lactasetolerance?

Seunggeun Lee (UNC-CH) PCA March 4, 2010 6 / 12

Page 8: Principal Component Analysis in Genomic Data

Northern vs. Southern European

Why ?

Study was conducted in Europe

Northern European vs. Southern European

Height (Adult men) Lactose Tolerance

Northern (Sweden) 5 ft 11 1/2 in 98%

Southern (Italy) 5 ft 9 1/2 in ∼ 50%

Northern Europeans are taller than Southern Europeans

Northern Europeans are lactose tolerant, but Southern Europeans arenot.

⇒ Population stratification (PS) is a major confounding factor in GWAS

Seunggeun Lee (UNC-CH) PCA March 4, 2010 7 / 12

Page 9: Principal Component Analysis in Genomic Data

Northern vs. Southern European

Why ?

Study was conducted in Europe

Northern European vs. Southern European

Height (Adult men) Lactose Tolerance

Northern (Sweden) 5 ft 11 1/2 in 98%

Southern (Italy) 5 ft 9 1/2 in ∼ 50%

Northern Europeans are taller than Southern Europeans

Northern Europeans are lactose tolerant, but Southern Europeans arenot.

⇒ Population stratification (PS) is a major confounding factor in GWAS

Seunggeun Lee (UNC-CH) PCA March 4, 2010 7 / 12

Page 10: Principal Component Analysis in Genomic Data

PCA in GWAS data

To adjust PS, we need to know the accurate ethnicity information.PCA is typically used for this purpose.

Price et al. (2006) Principal components analysis corrects forstratification in genome-wide association studies. Nature Genetics

Seunggeun Lee (UNC-CH) PCA March 4, 2010 8 / 12

Page 11: Principal Component Analysis in Genomic Data

European

Figure: European Journal of Human Genetics (2008) 16, 14131429

Seunggeun Lee (UNC-CH) PCA March 4, 2010 9 / 12

Page 12: Principal Component Analysis in Genomic Data

Asian

Figure: PLoS ONE. 2008; 3(12): e3862.

Seunggeun Lee (UNC-CH) PCA March 4, 2010 10 / 12

Page 13: Principal Component Analysis in Genomic Data

What I am doing

My research focuses on

Linkage Disequilibrium AdjustmentI Control of population stratification using correlated SNPs by shrinkage

principal components. Human Heredity (accepted)

Principal Component SelectionI Control of population stratification by correlation-selected principal

components. Biometrics (under revision)

Asymptotic behaviors of Principal ComponentI Convergence and prediction of principal component scores in high

dimensional settings. Annals of Statistics (accepted)

Other reserch papers: published and submitted in American Journalof Human genetics, Molecular Psychiatry, and Genetics.

Seunggeun Lee (UNC-CH) PCA March 4, 2010 11 / 12

Page 14: Principal Component Analysis in Genomic Data

What I am doing

My research focuses on

Linkage Disequilibrium AdjustmentI Control of population stratification using correlated SNPs by shrinkage

principal components. Human Heredity (accepted)

Principal Component SelectionI Control of population stratification by correlation-selected principal

components. Biometrics (under revision)

Asymptotic behaviors of Principal ComponentI Convergence and prediction of principal component scores in high

dimensional settings. Annals of Statistics (accepted)

Other reserch papers: published and submitted in American Journalof Human genetics, Molecular Psychiatry, and Genetics.

Seunggeun Lee (UNC-CH) PCA March 4, 2010 11 / 12

Page 15: Principal Component Analysis in Genomic Data

What I am doing

My research focuses on

Linkage Disequilibrium AdjustmentI Control of population stratification using correlated SNPs by shrinkage

principal components. Human Heredity (accepted)

Principal Component SelectionI Control of population stratification by correlation-selected principal

components. Biometrics (under revision)

Asymptotic behaviors of Principal ComponentI Convergence and prediction of principal component scores in high

dimensional settings. Annals of Statistics (accepted)

Other reserch papers: published and submitted in American Journalof Human genetics, Molecular Psychiatry, and Genetics.

Seunggeun Lee (UNC-CH) PCA March 4, 2010 11 / 12

Page 16: Principal Component Analysis in Genomic Data

What I am doing

My research focuses on

Linkage Disequilibrium AdjustmentI Control of population stratification using correlated SNPs by shrinkage

principal components. Human Heredity (accepted)

Principal Component SelectionI Control of population stratification by correlation-selected principal

components. Biometrics (under revision)

Asymptotic behaviors of Principal ComponentI Convergence and prediction of principal component scores in high

dimensional settings. Annals of Statistics (accepted)

Other reserch papers: published and submitted in American Journalof Human genetics, Molecular Psychiatry, and Genetics.

Seunggeun Lee (UNC-CH) PCA March 4, 2010 11 / 12

Page 17: Principal Component Analysis in Genomic Data

Welcome to Chapel Hill!!

Seunggeun Lee (UNC-CH) PCA March 4, 2010 12 / 12