genome-wide association studiesadegenet.r-forge.r-project.org/files/glasgow2015... ·...

Post on 19-Jun-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Genome-Wide Association Studies

Caitlin Collins, Thibaut Jombart

MRC Centre for Outbreak Analysis and Modelling

Imperial College London

Genetic data analysis using

30-10-2014

Outline

• Introduction to GWAS

• Study design o GWAS design

o Issues and considerations in GWAS

• Testing for association o Univariate methods

o Multivariate methods

• Penalized regression methods

• Factorial methods

2

Genomics & GWAS

3

The genomics revolution

• Sequencing technology o 1977 – Sanger

o 1995 – 1st bacterial genomes

• < 10,000

bases per day per machine

o 2003 – 1st human genome

• > 10,000,000,000,000

bases per day per machine

• GWAS publications o 2005 – 1st GWAS

o Age-related macular degeneration

o 2014 – 1,991 publications

o 14,342 associations

Genomics & GWAS 4

A few GWAS discoveries…

Genomics & GWAS 5

• Genome Wide Association Study o Looking for SNPs…

associated with a phenotype.

• Purpose: o Explain

• Understanding

• Mechanisms

• Therapeutics

o Predict • Intervention

• Prevention

• Understanding not required

So what is GWAS?

Genomics & GWAS 6

• Definition o Any relationship

between two measured quantities that renders them statistically dependent.

• Heritability o The proportion of

variance explained by genetics

o P = G + E + G*E • Heritability > 0

Association

Genomics & GWAS 7

p SNPs

n individuals

Cases

Controls

Genomics & GWAS 8

Why? • Environment, Gene-Environment interactions

• Complex traits, small effects, rare variants

• Gene expression levels

• GWAS methodology?

Genomics & GWAS 9

Study Design

10

• Case-Control o Well-defined “case”

o Known heritability

• Variations o Quantitative phenotypic data

• Eg. Height, biomarker concentrations

o Explicit models

• Eg. Dominant or recessive

GWAS design

Study Design 11

• Data quality o 1% rule

• Controlling for confounding o Sex, age, health profile

o Correlation with other variables

• Population stratification*

• Linkage disequilibrium*

Issues & Considerations

Study Design 12

*

*

• Problem o Violates assumed population homogeneity, independent

observations

• Confounding, spurious associations

o Case population more likely to be related than Control population

• Over-estimation of significance of associations

Population stratification

Study Design 13

• Definition o “Population stratification” =

population structure

o Systematic difference in allele frequencies btw. sub-populations…

• … possibly due to different ancestry

• Solutions

o Visualise

• Phylogenetics

• PCA

o Correct

• Genomic Control

• Regression on Principal

Components of PCA

Population stratification II

Study Design 14

• Definition o Alleles at separate loci are NOT

independent of each other

• Problem? o Too much LD is a problem

• noise >> signal

o Some (predictable) LD can be beneficial

• enables use of “marker” SNPs

Linkage disequilibrium (LD)

Study Design 15

Testing for Association

16

• Standard GWAS

o Univariate methods

• Incorporating interactions

o Multivariate methods

• Penalized regression methods (LASSO)

• Factorial methods (DAPC-based FS)

Methods for association testing

Testing for Association 17

• Variations o Testing

• Fisher’s exact test, Cochran-Armitage trend test, Chi-squared test, ANOVA

• Gold Standard—Fischer’s exact test

o Correcting

• Bonferroni

• Gold Standard—FDR

Univariate methods

Testing for Association 18

p SNPs

n individuals

Cases

Controls

• Approach o Individual test statistics

o Correction for multiple

testing

Strengths Weaknesses

• Straightforward

• Computationally fast

• Conservative

• Easy to interpret

• Multivariate system,

univariate framework

• Effect size of individual

SNPs may be too small

• Marginal effects of

individual SNPs ≠

combined effects

Univariate – Strengths & weaknesses

Testing for Association 19

What about interactions?

Testing for Association 20

• Epistasis o “Deviation from linearity under a

general linear model”

• With p predictors, there are:

• 𝑝𝑘

=𝑝𝑘

𝑘! k-way interactions

• p = 10,000,000 5 x 1011

o That’s 500 BILLION possible pair-wise interactions!

• Need some way to limit the number of pairwise interactions considered…

Interactions

Testing for Association 21

𝑌𝑖 = 𝑤0 + 𝑤1𝐴𝑖 + 𝑤2𝐵𝑖 +𝒘𝟑𝑨𝒊𝑩𝒊

Multivariate methods

Testing for Association 22

LASSO penalized regression

Ridge regression

Neural Networks Penalized Regression

Bayesian Approaches

Factorial Methods

Bayesian Epistasis

Association Mapping

Logic Trees

Modified Logic

Regression-Gene

Expression Programming Genetic Programming for

Association Studies

Logic feature selection Monte Carlo

Logic Regression Logic regression

Supervised-PCA

Sparse-PCA

DAPC-based FS

(snpzip)

Bayesian partitioning

The elastic net

Bayesian Logistic

Regression with

Stochastic Search

Variable Selection

Odds-ratio-

based MDR

Multi-factor

dimensionality reduction

method

Genetic programming

optimized neural

networks

Parametric

decreasing method

Restricted

partitioning method

Combinatorial

partitioning method

Random forests

Set association

approach

Non-parametric Methods

• Penalized regression methods o LASSO penalized regression

• Factorial methods o DAPC-based

feature selection

Multivariate methods (ii)

Testing for Association 23

• Approach o Regression models multivariate association

o Shrinkage estimation feature selection

• Variations o LASSO, Ridge, Elastic net, Logic regression

• Gold Standard—LASSO penalized regression

Penalized regression methods

Testing for Association 24

• Regression o Generalized linear model (“glm”)

• Penalization o L1 norm

o Coefficients 0

o Feature selection!

LASSO penalized regression

Testing for Association 25

Strengths Weaknesses

• Stability

• Interpretability

• Likely to accurately

select the most

influential predictors

• Sparsity

• Multicollinearity

• Not designed for high-p

• Computationally intensive

• Calibration of penalty parameters o User-defined variability

• Sparsity

• NO p-values!

LASSO – Strengths & weaknesses

Testing for Association 26

• Approach o Place all variables (SNPs) in a multivariate space

o Identify discriminant axis best separation

o Select variables with the highest contributions to that axis

• Variations o Supervised-PCA, Sparse-PCA, DA, DAPC-based FS

o Our focus—DAPC with feature selection (snpzip)

Factorial methods

Testing for Association 27

Discriminant axis

De

nsi

ty o

f in

div

idu

als

Discriminant axis

De

nsi

ty o

f in

div

idu

als

DAPC-based feature selection

a

b

c d

e

Alleles

Individuals

0

0.1

0.2

0.3

0.4

0.5

a b c d e

Contribution to

Discriminant Axis

Healthy (“controls”)

Diseased (“cases”)

Testing for Association 28

Discriminant Axis Discriminant Axis

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

a b c d e

Contribution to Discriminant Axis

?

Discriminant axis

De

nsi

ty o

f in

div

idu

als

DAPC-based feature selection • Where should we draw the line?

o Hierarchical clustering

Hierarchical clustering (FS)

Testing for Association 30

0

0.1

0.2

0.3

0.4

0.5

a b c d e

Contribution to

Discriminant Axis

Hooray!

Strengths Weaknesses

• More likely to catch all

relevant SNPs (signal)

• Computationally quick

• Good exploratory tool

• Redundancy > sparsity

• Sensitive to n.pca

• N.snps.selected varies

• No “p-value”

• Redundancy > sparsity

DAPC – Strengths & weaknesses

Testing for Association 31

Conclusions

• Study design o GWAS design

o Issues and considerations in GWAS

• Testing for association o Univariate methods

o Multivariate methods

• Penalized regression methods

• Factorial methods

32

Thanks for listening!

33

Questions?

34

top related