genome-wide association studies caitlin collins, thibaut jombart mrc centre for outbreak analysis...

34
Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis using 30-10-2014

Upload: bartholomew-dixon

Post on 21-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Genome-Wide Association Studies

Caitlin Collins, Thibaut Jombart

MRC Centre for Outbreak Analysis and ModellingImperial College London

Genetic data analysis using 30-10-2014

Page 2: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

2

Outline• Introduction to GWAS

• Study designo GWAS designo Issues and considerations in GWAS

• Testing for associationo Univariate methods o Multivariate methods

• Penalized regression methods• Factorial methods

Page 3: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

3

Genomics & GWAS

Page 4: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Genomics & GWAS 4

The genomics revolution• Sequencing technology

o 1977 – Sangero 1995 – 1st bacterial genomes

• < 10,000 bases per day per machine

o 2003 – 1st human genome• > 10,000,000,000,000

bases per day per machine

• GWAS publicationso 2005 – 1st GWAS

o Age-related macular degeneration

o 2014 – 1,991 publicationso 14,342 associations

Page 5: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Genomics & GWAS 5

A few GWAS discoveries…

Page 6: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Genomics & GWAS 6

• Genome Wide Association Studyo Looking for SNPs… associated with a phenotype.

• Purpose:o Explain

• Understanding• Mechanisms• Therapeutics

o Predict• Intervention• Prevention• Understanding not required

So what is GWAS?

Page 7: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Genomics & GWAS 7

• Definitiono Any relationship

between two measured quantities that renders them statistically dependent.

• Heritabilityo The proportion of

variance explained by genetics

o P = G + E + G*E• Heritability > 0

Associationp

SNPs

n individuals

Cases

Controls

Page 8: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Genomics & GWAS 8

Page 9: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Genomics & GWAS 9

Why?• Environment, Gene-Environment interactions• Complex traits, small effects, rare variants• Gene expression levels• GWAS methodology?

Page 10: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

10

Study Design

Page 11: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Study Design 11

• Case-Controlo Well-defined “case”o Known heritability

• Variationso Quantitative phenotypic data

• Eg. Height, biomarker concentrationso Explicit models

• Eg. Dominant or recessive

GWAS design

Page 12: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Study Design 12

• Data qualityo 1% rule

• Controlling for confoundingo Sex, age, health profileo Correlation with other variables

• Population stratification*

• Linkage disequilibrium*

Issues & Considerations

*

*

Page 13: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Study Design 13

• Problemo Violates assumed population homogeneity, independent

observations• Confounding, spurious associations

o Case population more likely to be related than Control population• Over-estimation of significance of associations

Population stratification• Definition

o “Population stratification” = population structure

o Systematic difference in allele frequencies btw. sub-populations… • … possibly due to different

ancestry

Page 14: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Study Design 14

• Solutionso Visualise

• Phylogenetics• PCA

o Correct• Genomic Control• Regression on

Principal Components of PCA

Population stratification II

Page 15: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Study Design 15

• Definitiono Alleles at separate loci are

NOT independent of each other

• Problem?o Too much LD is a problem

• noise >> signal

o Some (predictable) LD can be beneficial• enables use of “marker”

SNPs

Linkage disequilibrium (LD)

Page 16: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

16

Testing for Association

Page 17: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 17

• Standard GWASo Univariate methods

• Incorporating interactionso Multivariate methods

• Penalized regression methods (LASSO)• Factorial methods (DAPC-based FS)

Methods for association testing

Page 18: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 18

• Variationso Testing

• Fisher’s exact test, Cochran-Armitage trend test, Chi-squared test, ANOVA

• Gold Standard—Fischer’s exact testo Correcting

• Bonferroni• Gold Standard—FDR

Univariate methodsp

SNPs

n individuals

Cases

Controls

• Approacho Individual test statisticso Correction for multiple

testing

Page 19: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 19

Strengths Weaknesses• Straightforward

• Computationally fast

• Conservative

• Easy to interpret

• Multivariate system, univariate framework

• Effect size of individual SNPs may be too small

• Marginal effects of individual SNPs ≠ combined effects

Univariate – Strengths &

weaknesses

Page 20: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 20

What about interactions?

Page 21: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 21

• Epistasiso “Deviation from linearity under a general linear

model”

vs.

• With p predictors, there are:• k-way interactions

• p = 10,000,000 5 x 1011 o That’s 500 BILLION possible pair-wise

interactions!

• Need some way to limit the number of pairwise interactions considered…

Interactions

Page 22: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 22

Multivariate methodsLASSO penalized regression

Ridge regression

Neural NetworksPenalized Regression

Bayesian Approaches

Factorial Methods

Bayesian Epistasis Association Mapping

Logic Trees

Modified Logic Regression-Gene

Expression ProgrammingGenetic Programming

for Association Studies

Logic feature selectionMonte Carlo

Logic RegressionLogic regression

Supervised-PCA

Sparse-PCA

DAPC-based FS (snpzip)

Bayesian partitioning

The elastic net

Bayesian Logistic Regression with

Stochastic Search Variable Selection

Odds-ratio-based MDR

Multi-factor dimensionality reduction

method

Genetic programming optimized neural

networks

Parametric decreasing method

Restricted partitioning method

Combinatorial partitioning method

Random forests

Set association approach

Non-parametric Methods

Page 23: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 23

• Penalized regression methodso LASSO penalized regression

• Factorial methodso DAPC-based

feature selection

Multivariate methods

Page 24: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 24

• Approacho Regression models multivariate associationo Shrinkage estimation feature selection

• Variationso LASSO, Ridge, Elastic net, Logic regression

• Gold Standard—LASSO penalized regression

Penalized regression methods

Page 25: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 25

• Regressiono Generalized linear model (“glm”)

• Penalizationo L1 norm

o Coefficients 0

o Feature selection!

LASSO penalized regression

Page 26: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 26

Strengths Weaknesses• Stability

• Interpretability

• Likely to accurately select the most influential predictors

• Sparsity

• Multicollinearity

• Not designed for high-p

• Computationally intensive

• Calibration of penalty parameterso User-defined variability

• Sparsity

LASSO – Strengths & weaknesses

Page 27: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 27

• Approacho Place all variables (SNPs) in a multivariate spaceo Identify discriminant axis best separationo Select variables with the highest contributions to that

axis

• Variationso Supervised-PCA, Sparse-PCA, DA, DAPC-based FSo Our focus—DAPC with feature selection (snpzip)

Factorial methods

Page 28: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 28Discriminant axis

Den

sit

y o

f in

div

idu

als

Discriminant axis

Den

sit

y o

f in

div

idu

als

DAPC-based feature selection

a

b

cd

e

AllelesIndividuals

a b c d e0

0.1

0.2

0.3

0.4

0.5

Contribution to Discriminant Axis

Healthy (“controls”)

Diseased (“cases”)

Discriminant Axis

Discriminant Axis

Page 29: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

a b c d e0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Contribution to Discriminant Axis

?

Discriminant axis

Den

sit

y o

f in

div

idu

als

DAPC-based feature selection• Where should we draw the line?

o Hierarchical clustering

Page 30: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 30

Hierarchical clustering (FS)

a b c d e0

0.050.1

0.150.2

0.250.3

0.350.4

0.45

Contribution to Discriminant Axis

Hooray!

Page 31: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

Testing for Association 31

Strengths Weaknesses

• More likely to catch all relevant SNPs (signal)

• Computationally quick

• Good exploratory tool

• Redundancy > sparsity

• Sensitive to n.pca

• N.snps.selected varies

• No “p-value”

• Redundancy > sparsity

DAPC – Strengths & weaknesses

Page 32: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

32

Conclusions• Study design

o GWAS designo Issues and considerations in GWAS

• Testing for associationo Univariate methods o Multivariate methods

• Penalized regression methods• Factorial methods

Page 33: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

33

Thanks for listening!

Page 34: Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis

34

Questions?