eric xing © eric xing @ cmu, 2006-2010 1 machine learning application i: computational genomics...

59
Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Machine Learning Application I: Application I: Computational Genomics Computational Genomics Eric Xing Eric Xing Lecture 18, August 16, 2010

Upload: mildred-long

Post on 04-Jan-2016

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 1

Machine LearningMachine Learning

Application I: Application I:

Computational GenomicsComputational Genomics

Eric XingEric Xing

Lecture 18, August 16, 2010

Page 2: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 2

Biological Data Analysis

Dynamic, noisy, heterogeneous, high-dimensional data

“High-resolution” inference Parsimonious Scalability Stability Sample complexity Confidence bound

Page 3: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 3

Healthy

Sick

ACTCGTACGTAGACCTAGCATTTACGCAATAATGCGA

ACTCGAACCTAGACCTAGCATTTACGCAATAATGCGA

TCTCGTACGTAGACGTAGCATTTACGCAATTATCCGA

ACTCGAACCTAGACCTAGCATTTACGCAATTATCCGA

ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA

TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA

ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA

Single nucleotide polymorphism (SNP)

Causal (or "associated") SNP

Genetic Basis of Diseases

Page 4: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 4

A . . . . T . . G . . . . . C . . . . . . T . . . . . A . . G .

A . . . . A . . C . . . . . C . . . . . . T . . . . . A . . G .

T . . . . T . . G . . . . . G . . . . . . T . . . . . T . . C .

A . . . . A . . C . . . . . C . . . . . . T . . . . . T . . C .

A . . . . T . . G . . . . . G . . . . . . A . . . . . A . . G .

T . . . . T . . C . . . . . G . . . . . . A . . . . . A . . C .

A . . . . A . . C . . . . . C . . . . . . A . . . . . T . . C .

Data

Genotype Phenotype

• Cancer: Dunning et al. 2009.• Diabetes: Dupuis et al. 2010.• Atopic dermatitis: Esparza-Gordillo et al. 2009.• Arthritis: Suzuki et al. 2008

Standard Approach

causal SNPcausal SNP

a univariate a univariate phenotypephenotype::e.g., disease/controle.g., disease/control

Genetic Association Mapping

Page 5: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 5

Healthy

Cancer

ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA

ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA

TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA

ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA

ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA

TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA

ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA

Causal SNPs

Genetic Basis of Complex Diseases

Page 6: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 6

Genetic Basis of Complex Diseases

Healthy

Cancer

ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA

ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA

TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA

ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA

ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA

TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA

ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA

Causal SNPs

Biological mechanism

Page 7: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 7

Intermediate Phenotype

Genetic Basis of Complex Diseases

Healthy

Cancer

ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA

ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA

TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA

ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA

ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA

TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA

ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA

Causal SNPs

Clinical records

Gene expression

Association to intermediate phenotypes

Page 8: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 8

Association with Phenome

ACGTTTTACTGTACAATT

Multivariate complex syndrome (e.g., asthma)Multivariate complex syndrome (e.g., asthma)age at onset, history of eczemaage at onset, history of eczema

genome-wide expression profilegenome-wide expression profile

ACGTTTTACTGTACAATT

Traditional Approachcausal SNPcausal SNP

a univariate phenotype:a univariate phenotype:gene expression levelgene expression level

Structured Association

Page 9: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 9

Considerone phenotype at a time

Consider multiple correlated phenotypes (phenome) jointly

vs.

Standard Approach New Approach

Phenotypes Phenome

Structured Association : a New Paradigm

Page 10: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 10

Subnetworks for lung physiology Subnetwork for

quality of life

Genetic Association for Asthma Clinical Traits

TCGACGTTTTACTGTACAATT

Statistical challengeHow to find

associations to a multivariate entity in a graph?

Page 11: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 11

Gene Expression

Trait Analysis

TCGACGTTTTACTGTACAATT

Genes

Samples

Statistical challengeHow to find

associations to a multivariate entity in a tree?

Page 12: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 12

Sparse Associations

Pleotropic effectsPleotropic effects

Epistatic effectsEpistatic effects

Page 13: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 13

Sparse Learning

Linear Model:

Lasso (Sparse Linear Regression)

Why sparse solution?

penalizing

constraining

[R.Tibshirani 96]

Page 14: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 14

Multi-Task Extension Multi-Task Linear Model:

Coefficients for k-th task:

Coefficient Matrix:

Input:

Output:

Coefficients for a variable (2nd)

Coefficients for a task (2nd)

Page 15: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 15

Background: Sparse multivariate regression for disease association studies

Structured association – a new paradigm Association to a graph-structured phenome

Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009)

Association to a tree-structured phenome Tree-guided group lasso (Kim & Xing, ICML 2010)

Outline

Page 16: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 16

T G

A A

C C

A T

G A

A G

T A

GenotypeTrait

Multivariate Regression for Single-Trait Association Analysis

Xy

2.1 x=

x= β

Association Strength

?

Page 17: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 17

T G

A A

C C

A T

G A

A G

T A

GenotypeTrait

Multivariate Regression for Single-Trait Association Analysis

Many non-zero associations: Which SNPs are truly significant?

2.1 x=

Association Strength

Page 18: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 18

Genotype

x=2.1

Trait

Lasso for Reducing False Positives (Tibshirani, 1996)

Many zero associations (sparse results), but what if there are multiple related traits?

+

J

j 1

|βj |

T G

A A

C C

A T

G A

A G

T A

Lasso Penalty for sparsity

Association Strength

Page 19: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 19

Genotype

(3.4, 1.5, 2.1, 0.9, 1.8)

Multivariate Regression for Multiple-Trait Association Analysis

T G

A A

C C

A T

G A

A G

T A

How to combine information across multiple traits to increase the power?

Association Strength

x=

Association strength between SNP j and Trait i: βj,i

Allerg

y fo

r cat

s

Allerg

y fo

r roa

ches

Allerg

y in

spr

ing

FEVFEF

Allergy Lung physiology

LD

Syntheti

c lethal

+ ji,

|βj,i|

Page 20: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 20

Genotype

(3.4, 1.5, 2.1, 0.9, 1.8)

Trait

Multivariate Regression for Multiple-Trait Association Analysis

T G

A A

C C

A T

G A

A G

T A

Allergy Lung physiology

Association Strength

x=

+We introduce

graph-guided fusion penalty

Association strength between SNP j and Trait i: βj,i

+ ji,

|βj,i|

Page 21: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 21

Multiple-trait Association:Graph-Constrained Fused Lasso

Step 1: Thresholded correlation graph of phenotypes

ACGTTTTACTGTACAATT

Step 2: Graph-constrained fused lasso

Lasso Penalty

Graph-constrained fusion penalty

Fusion

Page 22: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 22

Fusion Penalty

Fusion Penalty: | βjk - βjm |

For two correlated traits (connected in the network), the association strengths may have similar values.

ACGTTTTACTGTACAATT

SNP j

Trait m

Trait k

Association strength between SNP j and Trait k: βjk

Association strength between SNP j and Trait m: βjm

Page 23: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 23

ACGTTTTACTGTACAATT

Overall effect

Graph-Constrained Fused Lasso

Fusion effect propagates to the entire network Association between SNPs and subnetworks of traits

Page 24: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 24

ACGTTTTACTGTACAATT

Multiple-trait Association: Graph-Weighted Fused Lasso

Subnetwork structure is embedded as a densely connected nodes with large edge weights

Edges with small weights are effectively ignored

Overall effect

Page 25: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 25

Estimating Parameters

Quadratic programming formulation Graph-constrained fused lasso

Graph-weighted fused lasso

Many publicly available software packages for solving convex optimization problems can be used

Page 26: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 26

Improving Scalability

Iterative optimization• Update βk

• Update djk’s, djml’s

Original problem

Equivalently

Using a variational formulation

Page 27: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 27

Previous Works vs. Our Approach

Previous approach Our approach

PCA-based approach (Weller et al., 1996, Mangin et al., 1998)

Implicit representation of trait correlations

Hard to interpret the derived traits

Explicit representation of trait correlations

Extension of module network for eQTL study (Lee et al., 2009)

Average traits within each trait cluster

Loss of information

Original data for traits are used

Network-based approach (Chen et al., 2008, Emilsson et al., 2008)

Separate association analysis for each trait (no information sharing)

Single-trait association are combined in light of trait network modules

Joint association analysis of multiple traits

Page 28: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 28

• 50 SNPs taken from HapMap chromosome 7, CEU population

• 10 traits

Trait Correlation Matrix

True Regression Coefficients

Single SNP-Single Trait Test

Significant at α = 0.01

LassoGraph-guided Fused Lasso

Thresholded Trait Correlation Network

Simulation Results

Phenotypes

SN

Ps

No association

High association

Page 29: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 29

Simulation Results

Page 30: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 30

Asthma Association Study

543 severe asthma patients from the Severe Asthma Research Program (SARP)

Genotypes : 34 SNPs in IL4R gene 40kb region of chromosome 16 Impute missing genotypes with PHASE (Li and Stephens, 2003)

Traits : 53 asthma-related clinical traits Quality of Life: emotion, environment, activity, symptom Family history: number of siblings with allergy, does the father has asthma? Asthma symptoms: chest tightness, wheeziness

Page 31: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 31

Asthma and IL4R Gene

T cell

In normal individuals: allergen ignored by the immune system

Allergen (ragweed, grass, etc.)

Page 32: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 32

Asthma and IL4R Gene

T cell

In asthma patientsAllergen (ragweed, grass, etc.)

Th2 cell B cell

ImmunoglobulinE (IgE) antibody

Inflammation!

• Airway in lung narrows• Difficult to breathe• Asthma symptoms

Interleukin-4 Receptor regulates IgE antibody production

Page 33: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 33

IL4R Gene

Chr16: 27,325,251-27,376,098

IL4R

SNPs in intron

SNPs in exon

SNPs in promotor region

exonsintrons

Page 34: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 34

Asthma Trait Network

Subnetwork for quality of life

Subnetwork for lung physiology

Phenotype Correlation Network

Subnetwork for Asthma symptoms

Page 35: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 35

Results from Single-SNP/Trait Test

Single-Marker Single-Trait Test

SN

Ps

Permutation test α = 0.05

Permutation test α = 0.01

No association

High association

Trait Network

Phenotypes

Phe

noty

pes

Lung physiology-related traits I• Baseline FEV1 predicted value: MPVLung • Pre FEF 25-75 predicted value • Average nitric oxide value: online • Body Mass Index • Postbronchodilation FEV1, liters: Spirometry • Baseline FEV1 % predicted: Spirometry • Baseline predrug FEV1, % predicted • Baseline predrug FEV1, % predicted

Q551R SNP• Codes for amino-acid changes in the intracellular signaling portion of the receptor• Exon 11

Page 36: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 36

Trait Network

Lasso Graph-guided Fused Lasso

Comparison of Gflasso with Others

Single-Marker Single-Trait Test

SN

Ps

Phenotypes

Phe

noty

pes

?

Lung physiology-related traits I• Baseline FEV1 predicted value: MPVLung • Pre FEF 25-75 predicted value • Average nitric oxide value: online • Body Mass Index • Postbronchodilation FEV1, liters: Spirometry • Baseline FEV1 % predicted: Spirometry • Baseline predrug FEV1, % predicted • Baseline predrug FEV1, % predicted

Q551R SNP• Codes for amino-acid changes in the intracellular signaling portion of the receptor• Exon 11

No association

High association

Page 37: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 37

Linkage Disequilibrium Structure in IL-4R gene

SNP Q551R

SNP rs3024660

SNP rs3024622

r2 =0.64

r2 =0.07

Page 38: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 38

IL4R Gene

Chr16: 27,325,251-27,376,098

IL4R

SNPs in intron

SNPs in exon

SNPs in promotor region

exonsintrons

Q551R

rs3024660

rs3024622

Page 39: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 39

Association to a Tree-structured

Phenome

TCGACGTTTTACTGTACAATT

Page 40: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 40

TCGACGTTTTACTGTACAATT

Tree-guided Group Lasso

Why tree? Tree represents a clustering structure

Scalability to a very large number of phenotypes Graph : O(|V|2) edges Tree : O(|V|) edges

Expression quantitative trait mapping (eQTL) Agglomerative hierarchical clustering is a popular tool

Page 41: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 41

Tree-Guided Group Lasso

In a simple case of two genes

• Low height• Tight correlation• Joint selection

• Large height• Weak correlation• Separate selection

h

h

Gen

otyp

es

Gen

otyp

es

Page 42: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 42

L1 penalty• Lasso penalty• Separate selection

L2 penalty • Group lasso• Joint selection

h

Elastic net

Select the child nodes jointly or separately?

Tree-Guided Group Lasso

In a simple case of two genes

Tree-guided group lasso

Page 43: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 43

h2

h1

Select the child nodes jointly or separately?

Joint selection

Separate selection

Tree-Guided Group Lasso

For a general tree

Tree-guided group lasso

Page 44: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 44

h2

h1

Select the child nodes jointly or separately?

Tree-Guided Group Lasso

For a general tree

Tree-guided group lasso

Joint selection

Separate selection

Page 45: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 45

Previously, in Jenatton, Audibert & Bach, 2009

Balanced Shrinkage

Page 46: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 46

Estimating Parameters

Second-order cone program

Many publicly available software packages for solving convex optimization problems can be used

Also, variational formulation

Page 47: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 47

Illustration with Simulated Data

True association strengths Lasso

Tree-guided group lasso

SN

Ps

Phenotypes

No association

High association

Page 48: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 48

Yeast eQTL Analysis

Tree-guided group lasso

Single-Marker Single-Trait Test

SN

Ps

Phenotypes

No association

High association

Hierarchical clustering tree

Page 49: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 49

Structured Association

Phenome Structure

Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009)

Graph

Tree-guided fused lasso (Kim & Xing, Submitted)

Tree

Temporally smoothed lasso (Kim, Howrylak, Xing, Submitted)

Dynamic Trait

Genome Structure

Stochastic block regression (Kim & Xing, UAI, 2008)

Linkage Disequilibrium

Multi-population group lasso (Puniyani, Kim, Xing, Submitted)

Population Structure

EpistasisACGTTTTACTGTACAATT

Group lasso with networks(Lee, Kim, Xing, Submitted)

Page 50: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 50

Structured Association

Phenome Structure

Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009)

Graph

Tree-guided fused lasso (Kim & Xing, Submitted)

Tree

Temporally smoothed lasso (Kim, Howrylak, Xing, Submitted)

Dynamic Trait

Genome Structure

Stochastic block regression (Kim & Xing, UAI, 2008)

Linkage Disequilibrium

Multi-population group lasso (Puniyani, Kim, Xing, Submitted)

Population Structure

EpistasisACGTTTTACTGTACAATT

Group lasso with networks(Lee, Kim, Xing, Submitted)

Page 51: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 51

Computation Time

Page 52: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 52

Proximal Gradient Descent

Original Problem:

Approximation Problem:

Gradient of theApproximation:

Page 53: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 53

Geometric Interpretation

Smooth approximation

Uppermost LineNonsmooth

Uppermost LineSmooth

Page 54: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 54

Convergence Rate

Theorem: If we require and set , the

number of iterations is upper bounded by:

Remarks: state of the art IPM method for for SOCP converges at a rate

Page 55: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 55

Multi-Task Time Complexity

Pre-compute:

Per-iteration Complexity (computing gradient)

Tree:

Graph:

IPM for SOCP

Proximal-Gradient

IPM for SOCP

Proximal-Gradient

Proximal-Gradient: Independent of Sample Size Linear in #.of Tasks

Page 56: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 56

Experiments

Multi-task Graph Structured Sparse Learning (GFlasso)

Page 57: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 57

Experiments

Multi-task Tree-Structured Sparse Learning (TreeLasso)

Page 58: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 58

Conclusions

Novel statistical methods for joint association analysis to correlated phenotypes Graph-structured phenome : graph-guided fused lasso Tree-structured phenome : tree-guided group lasso

Advantages Greater power to detect weak association signals Fewer false positives Joint association to multiple correlated phenotypes

Other structures In phenotypes: dynamic trait In genotypes: linkage disequilibrium, population structure, epistasis

58

Page 59: Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics Eric Xing Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 59

Reference• Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, Series B

58:267–288.• Weller J, Wiggans G, Vanraden P, Ron M (1996) Application of a canonical transformation to detection of quantitative trait

loci with the aid of genetic markers in a multi-trait experiment. Theoretical and Applied Genetics 92:998–1002.• Mangin B, Thoquet B, Grimsley N (1998) Pleiotropic QTL analysis. Biometrics 54:89–99.• Chen Y, Zhu J, Lum P, Yang X, Pinto S, et al. (2008) Variations in DNA elucidate molecular networks that cause disease.

Nature 452:429–35.• Lee SI, Dudley A, Drubin D, Silver P, Krogan N, et al. (2009) Learning a prior on regulatory potential from eQTL data.

PLoS Genetics 5:e1000358.• Emilsson V, Thorleifsson G, Zhang B, Leonardson A, Zink F, et al. (2008) Genetics of gene expression and its effect on

disease. Nature 452:423–28.• Kim S, Xing EP (2009) Statistical estimation of correlated genome associations to a quantitative trait network. PLoS

Genetics 5(8): e1000587.• Kim S, Xing EP (2008) Sparse feature learning in high-dimensional space via block regularized regression. In

Proceedings of the 24th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 325-332. AUAI Press.

• Kim S, Xing EP (2010) Exploiting a hierarchical clustering tree of gene-expression traits in eQTL analysis. Submitted.• Kim S, Howrylak J, Xing EP (2010) Dynamic-trait association analysis via temporally-smoothed lasso. Submitted.• Puniyani K, Kim S, Xing EP (2010) Multi-population GWA mapping via multi-task regularized regression. Submitted.• Lee S, Kim S, Xing EP (2010) Leveraging genetic interaction networks and regulatory pathways for joint mapping of

epistatic and marginal eQTLs. Submitted.• Dunning AM et al. (2009) Association of ESR1 gene tagging SNPs with breast cancer risk. Hum Mol Genet. 18(6):1131-9. • Esparza-Gordillo J et al. (2009) A common variant on chromosome 11q13 is associated with atopic dermatitis. Nature

Genetics 41:596-601.• Suzuki A et al. (2008) Functional SNPs in CD244 increase the risk of rheumatoid arthritis in a Japanese population.

Nature Genetics 40:1224-1229.• Dupuis J et al. (2010) New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk.

Nature Genetics 42:105-116.59