robust estimation of the relationship between dna...
TRANSCRIPT
Robust estimation of the relationship between DNAcopy number and gene expression
Pierre Neuvial
Laboratoire Statistique et GenomeUniversite d’Evry Val d’EssonneUMR CNRS 8071 – USC INRA
Joint work with Antoine Chambaz and Mark van der Laan
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 1 / 20
Outline
1 Association between DNA copy number and gene expression
2 Targeted maximum likelihood estimation of association
3 Results
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 2 / 20
Association between DNA copy number and gene expression
Outline
1 Association between DNA copy number and gene expression
2 Targeted maximum likelihood estimation of association
3 Results
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 3 / 20
Association between DNA copy number and gene expression
Caracteristics of tumor cells
Hanahan & Weinberg (2000)
self-sufficiency in growth factors insensibility to anti-growth signals no apoptosis
angiogenesis limitless replication potential tissue invasion and metastases
Enabled by genetic instability of tumor cells
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 4 / 20
Association between DNA copy number and gene expression
Changes in cancer cells at the molecular level
Different levels of biological information
DNA copy number
gene expression
DNA methylation
Quantitative measurements can beobtained from DNA microarrays
Goal: find genes that drive tumorigenesis
to better understand cancer cells
to help find new treatments
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 5 / 20
Association between DNA copy number and gene expression
What gene-level data look like187 GBM (brain cancer) samples from the Cancer Genome Atlas (TCGA)
DNA methylation
2 4 6 8 10 12 14
cor=−0.57
0.02
0.04
0.06
0.08
cor=−0.54
2468
101214
●
●
●
●
●
●● ●
●
●●
●●●●
●●
●
●●
●
●●●●
●
●
●
● ●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
DNA copy number
cor=0.87
0.02
0.04
0.06
0.08
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1 0 1 2
−1
0
1
2gene expression
EGFR
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 6 / 20
Association between DNA copy number and gene expression
Which genes are drivers ?
“Driver genes” are expected to show some association between DNAcopy number and gene expression
⇒ Test for association, and quantify it
Methods for genome-wide scanning for gene-level associations
linear correlations
differential expression (T -tests) between copy number states
canonical correlation analyses
Issues with existing methods
they essentially identify genes that were already known to be implied
associations may be non linear
DNA methylation may down-regulate gene expression
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 7 / 20
Association between DNA copy number and gene expression
Defining “gene-level data”
In the preceding plot:
DNA methylation (W) : proportion of “methylated” signal at a CpG locusin the gene’s promoter region.
DNA copy number (X) : smoothed normalized total copy number relativeto a set of reference samples.
Expression (Y) : “unified” gene expression level across 3 platforms
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 8 / 20
Targeted maximum likelihood estimation of association
Outline
1 Association between DNA copy number and gene expression
2 Targeted maximum likelihood estimation of association
3 Results
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 9 / 20
Targeted maximum likelihood estimation of association
Definition of a parameter of interest
Observation O = (W ,X ,Y ) ∼ P ∈M for a given gene:
W : DNA methylation
X : DNA copy number; X = 0: copy neutral state (2 copies)
Y : gene expression
M: non-parametric set of all possible data-gen. distributions of O
Parameter of interest (defined for all P ∈M)Ψ(P) = arg min
β∈REP
[(EP(Y |X ,W )− EP(Y |X = 0,W )− βX )2
]In a semi-parametric model whereEP(Y |X ,W ) = EP(Y |X = 0,W ) + βX , we have Ψ(P) = β.
By contrast, Ψ :M→ R is defined universally
Ψ(P) is a non-parametric variable importance measure of the “effect”of X (continuous) on Y (continuous) accounting for W
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 10 / 20
Targeted maximum likelihood estimation of association
Comment on the parameter of interest
Let θ(P)(X ,Y ) = EP(Y |X ,W ), then
Ψ(P) = corr(X , rP(X ,W ))
√EP [rP(X ,W )2]
EP [X 2],
where rP(X ,W ) = θ(P)(X ,W )− θ(P)(0,W )
Case where X is binary
If X ∈ {0, 1}, then
Ψ(P) = EP [(θP(1,W )− θP(0,W ))h(W )]
with weight h(W ) = P(X = 1|W )/P(X = 1)
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 11 / 20
Targeted maximum likelihood estimation of association
Targeted maximum likelihood methods: motivation
Goal: estimate a parameter Ψ(P) from observations arising from adistribution P. Ψ is known.
Naive strategy
1 Estimate P using P
2 Plug-in: Ψ(P)
Our target parameter is Ψ(P), not P !
P aims at balancing bias and variance for the whole distribution
Ψ(P) does not balance bias and variance for Ψ(P)
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 12 / 20
Targeted maximum likelihood estimation of association
Targeted maximum likelihood estimation (TMLE)
From an initial estimate P0n :
1 Create a model P0n(ε) parametrized by ε ∈ R whose score is the
efficient influence curve of Ψ at P0n
2 Estimate ε using maximum likelihood: ε0n
3 Update accordingly: P1n = P0
n(ε0n)
Repeat as many times as necessary... hence final estimate P?n
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 13 / 20
Targeted maximum likelihood estimation of association
Statistical properties
P0: true distribution of O
Consistency (double robustness)
TMLE is consistent if one of the following conditions holds:
θ(P?n)(0, ·) consistently estimates true θ(P0)(0, ·)
EP?n
(X |W ) and P?n(X = 0|W ) consistently estimate EP0(X |W ) and
P0(X = 0|W )
Asymptotic normality
Under the same conditions, TMLE is asymptotically GaussianWe can compute asymptotic p-values and thus rank genes
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 14 / 20
Results
Outline
1 Association between DNA copy number and gene expression
2 Targeted maximum likelihood estimation of association
3 Results
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 15 / 20
Results
Simulation strategy
Assumptions:
up to 3 copy number classes: normal regions, and regions of copynumber gains and losses
in normal regions, expression is negatively correlated with methylation
in regions of copy number alteration, copy number and expression arepositively correlated
GBM data used as a baseline for simulation:
Sample name Methylation Copy number Expression
TCGA-02-0001 0.05 2.72 -0.46TCGA-02-0003 0.01 9.36 1.25
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 16 / 20
Results
Simulated data set mimics real data set
Real data (GBM, n=187) Simulated data (n=200)
DNA methylation
2 4 6 8 10 12 14
cor=−0.57
0.02
0.04
0.06
0.08
cor=−0.54
2468
101214
●
●
●
●
●
●● ●
●
●●
●●●●
●●
●
●●
●
●●●●
●
●
●
● ●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
DNA copy number
cor=0.87
0.02
0.04
0.06
0.08
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1 0 1 2
−1
0
1
2gene expression
EGFR
DNA methylation
5 10 15 20
−0.66
0.00
0.02
0.04
0.06
0.08
0.10
−0.70
5
10
15
20
●
●
● ●
● ●
● ●
●
●
●
●
●
●
●
●●●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
● ●● ●
●●
● ● ●
●
●●
●
●
●
● ●●● ●
●
●
●
● ●
●
●
● ●●
●
●
●●● ●
●
●
●
●
●
●
●
● ●
●
●● ●
●
●
● ●●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●● ●● ● ●
●
●
● ●
●
●
●
●
● ●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●● ● ●●
●
●● ●
●
●
●
●
●
●●● ●●
●●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
DNA copy number
0.80
0.00
0.02
0.04
0.06
0.08
0.10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
−1 0 1 2
−2
−1
0
1
2gene expression
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 17 / 20
Results
Simulated data: TMLE corrects initial estimation
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 18 / 20
Results
Real data analysis : TCGA OV data set
DNA methylation
−1.
0
−0.
5
0.0
0.5
cor=0.051
0.0
0.2
0.4
0.6
0.8
cor=−0.46
−1.0
−0.5
0.0
0.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
DNA copy number
cor=0.23
0.0
0.2
0.4
0.6
0.8
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●● ●●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●● ●
●
●
●
●●●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●●●
●●
●●
●
●
●
●
●●
●●●
●●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●●
●
●●
●
●
●●
●
●
●●●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●● ●
●●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●●
●●
●
●● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●● ●●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●●●
●●
●●
●
●
●
●
●●
●● ●
●●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
● ●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●●
●
●●
●
●
●●
●
●
● ●●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
● ●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●●
● ●
●
●●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
● ●
●●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
−2
−1 0 1 2
−2
−1
0
1
2gene expression
STAT5A pcor=0.286
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 19 / 20
Results
Thanks
Antoine Chambaz
Mark van der Laan
Terry Speed
The Cancer Genome Atlas Research Network
P. Neuvial (Stat & Genome) Associating copy number and expression June 8, 2011 20 / 20