genome-wide association studies

Post on 01-Jan-2016

134 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Genome-Wide Association Studies. Caitlin Collins , Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis using 30-10-2014. Outline. Introduction to GWAS Study design GWAS design Issues and considerations in GWAS - PowerPoint PPT Presentation

TRANSCRIPT

Genome-Wide Association Studies

Caitlin Collins, Thibaut Jombart

MRC Centre for Outbreak Analysis and ModellingImperial College London

Genetic data analysis using 30-10-2014

2

Outline• Introduction to GWAS

• Study designo GWAS designo Issues and considerations in GWAS

• Testing for associationo Univariate methods o Multivariate methods

• Penalized regression methods• Factorial methods

3

Genomics & GWAS

Genomics & GWAS 4

The genomics revolution• Sequencing technology

o 1977 – Sangero 1995 – 1st bacterial genomes

• < 10,000 bases per day per machine

o 2003 – 1st human genome• > 10,000,000,000,000

bases per day per machine

• GWAS publicationso 2005 – 1st GWAS

o Age-related macular degeneration

o 2014 – 1,991 publicationso 14,342 associations

Genomics & GWAS 5

A few GWAS discoveries…

Genomics & GWAS 6

• Genome Wide Association Studyo Looking for SNPs… associated with a phenotype.

• Purpose:o Explain

• Understanding• Mechanisms• Therapeutics

o Predict• Intervention• Prevention• Understanding not required

So what is GWAS?

Genomics & GWAS 7

• Definitiono Any relationship

between two measured quantities that renders them statistically dependent.

• Heritabilityo The proportion of

variance explained by genetics

o P = G + E + G*E• Heritability > 0

Associationp

SNPs

n individuals

Cases

Controls

Genomics & GWAS 8

Genomics & GWAS 9

Why?• Environment, Gene-Environment interactions• Complex traits, small effects, rare variants• Gene expression levels• GWAS methodology?

10

Study Design

Study Design 11

• Case-Controlo Well-defined “case”o Known heritability

• Variationso Quantitative phenotypic data

• Eg. Height, biomarker concentrationso Explicit models

• Eg. Dominant or recessive

GWAS design

Study Design 12

• Data qualityo 1% rule

• Controlling for confoundingo Sex, age, health profileo Correlation with other variables

• Population stratification*

• Linkage disequilibrium*

Issues & Considerations

*

*

Study Design 13

• Problemo Violates assumed population homogeneity, independent

observations• Confounding, spurious associations

o Case population more likely to be related than Control population• Over-estimation of significance of associations

Population stratification• Definition

o “Population stratification” = population structure

o Systematic difference in allele frequencies btw. sub-populations… • … possibly due to different

ancestry

Study Design 14

• Solutionso Visualise

• Phylogenetics• PCA

o Correct• Genomic Control• Regression on

Principal Components of PCA

Population stratification II

Study Design 15

• Definitiono Alleles at separate loci are

NOT independent of each other

• Problem?o Too much LD is a problem

• noise >> signal

o Some (predictable) LD can be beneficial• enables use of “marker”

SNPs

Linkage disequilibrium (LD)

16

Testing for Association

Testing for Association 17

• Standard GWASo Univariate methods

• Incorporating interactionso Multivariate methods

• Penalized regression methods (LASSO)• Factorial methods (DAPC-based FS)

Methods for association testing

Testing for Association 18

• Variationso Testing

• Fisher’s exact test, Cochran-Armitage trend test, Chi-squared test, ANOVA

• Gold Standard—Fischer’s exact testo Correcting

• Bonferroni• Gold Standard—FDR

Univariate methodsp

SNPs

n individuals

Cases

Controls

• Approacho Individual test statisticso Correction for multiple

testing

Testing for Association 19

Strengths Weaknesses• Straightforward

• Computationally fast

• Conservative

• Easy to interpret

• Multivariate system, univariate framework

• Effect size of individual SNPs may be too small

• Marginal effects of individual SNPs ≠ combined effects

Univariate – Strengths &

weaknesses

Testing for Association 20

What about interactions?

Testing for Association 21

• Epistasiso “Deviation from linearity under a general linear

model”

vs.

• With p predictors, there are:• k-way interactions

• p = 10,000,000 5 x 1011 o That’s 500 BILLION possible pair-wise

interactions!

• Need some way to limit the number of pairwise interactions considered…

Interactions

Testing for Association 22

Multivariate methodsLASSO penalized regression

Ridge regression

Neural NetworksPenalized Regression

Bayesian Approaches

Factorial Methods

Bayesian Epistasis Association Mapping

Logic Trees

Modified Logic Regression-Gene

Expression ProgrammingGenetic Programming

for Association Studies

Logic feature selectionMonte Carlo

Logic RegressionLogic regression

Supervised-PCA

Sparse-PCA

DAPC-based FS (snpzip)

Bayesian partitioning

The elastic net

Bayesian Logistic Regression with

Stochastic Search Variable Selection

Odds-ratio-based MDR

Multi-factor dimensionality reduction

method

Genetic programming optimized neural

networks

Parametric decreasing method

Restricted partitioning method

Combinatorial partitioning method

Random forests

Set association approach

Non-parametric Methods

Testing for Association 23

• Penalized regression methodso LASSO penalized regression

• Factorial methodso DAPC-based

feature selection

Multivariate methods

Testing for Association 24

• Approacho Regression models multivariate associationo Shrinkage estimation feature selection

• Variationso LASSO, Ridge, Elastic net, Logic regression

• Gold Standard—LASSO penalized regression

Penalized regression methods

Testing for Association 25

• Regressiono Generalized linear model (“glm”)

• Penalizationo L1 norm

o Coefficients 0

o Feature selection!

LASSO penalized regression

Testing for Association 26

Strengths Weaknesses• Stability

• Interpretability

• Likely to accurately select the most influential predictors

• Sparsity

• Multicollinearity

• Not designed for high-p

• Computationally intensive

• Calibration of penalty parameterso User-defined variability

• Sparsity

LASSO – Strengths & weaknesses

Testing for Association 27

• Approacho Place all variables (SNPs) in a multivariate spaceo Identify discriminant axis best separationo Select variables with the highest contributions to that

axis

• Variationso Supervised-PCA, Sparse-PCA, DA, DAPC-based FSo Our focus—DAPC with feature selection (snpzip)

Factorial methods

Testing for Association 28Discriminant axis

Den

sit

y o

f in

div

idu

als

Discriminant axis

Den

sit

y o

f in

div

idu

als

DAPC-based feature selection

a

b

cd

e

AllelesIndividuals

a b c d e0

0.1

0.2

0.3

0.4

0.5

Contribution to Discriminant Axis

Healthy (“controls”)

Diseased (“cases”)

Discriminant Axis

Discriminant Axis

a b c d e0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Contribution to Discriminant Axis

?

Discriminant axis

Den

sit

y o

f in

div

idu

als

DAPC-based feature selection• Where should we draw the line?

o Hierarchical clustering

Testing for Association 30

Hierarchical clustering (FS)

a b c d e0

0.050.1

0.150.2

0.250.3

0.350.4

0.45

Contribution to Discriminant Axis

Hooray!

Testing for Association 31

Strengths Weaknesses

• More likely to catch all relevant SNPs (signal)

• Computationally quick

• Good exploratory tool

• Redundancy > sparsity

• Sensitive to n.pca

• N.snps.selected varies

• No “p-value”

• Redundancy > sparsity

DAPC – Strengths & weaknesses

32

Conclusions• Study design

o GWAS designo Issues and considerations in GWAS

• Testing for associationo Univariate methods o Multivariate methods

• Penalized regression methods• Factorial methods

33

Thanks for listening!

34

Questions?

top related