high-dimensional data analysis: microarrays and multiple testing
DESCRIPTION
High-dimensional data analysis: Microarrays and multiple testing. Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics & Dep. of Pathology, VU University medical center, Amsterdam. Genomics: a short history (1). Some history - PowerPoint PPT PresentationTRANSCRIPT
High-dimensional data analysis: Microarrays and multiple testing
Mark van de Wiel1,2
1. Dep. of Mathematics, VU University Amsterdam2. Dep. of Biostatistics & Dep. of Pathology, VU University medical center, Amsterdam
Genomics: a short history (1)
Some history
1. Watson & Crick: double helix structure of DNA (1953)
Source: http://ghr.nlm.nih.gov/handbook/illustrations/
Genomics: a short history (2)
2. Human Genome Project: Identification of all 20.000-25.000 human
genes (1990-2003) June 25, 2000
PRESIDENT CLINTON ANNOUNCES THE COMPLETION OF THE FIRST SURVEY OF THE ENTIRE
HUMAN GENOME Hails Public and Private Efforts Leading to This Historic Achievement
THE WHITE HOUSE Office of the Press Secretary
For Immediate Release June 25, 2000 PRESIDENT CLINTON ANNOUNCES THE COMPLETION OF THE FIRST
SURVEY OF THE ENTIRE HUMAN GENOME Hails Public and Private Efforts Leading to This Historic
Achievement June 26, 2000 Today, at a historic White House event with British Prime Minister Tony Blair,
President Clinton announced that the international Human Genome Project and Celera
Genomics Corporation have both completed an initial sequencing of the human
genome -- the genetic blueprint for human beings.
Genomics: a short history (3)
3a. 1961 DNA hybridisation discovered
3b. 1994 Introduction of robotics (Hoheisel et al.)
3c. 1995 First microarray publication (Schena et al.)
3d. 1997 First whole genome microarray
experiments (De Risi et al.)
3e. 1999 First publication on microarrays for cancer
classification (Golub et al.): Leukemia / Affymetrix
arrays
Central dogma
1. DNA is the same in each cell (tumours are an exception)
2. Function of the cell is determined by proteins
3. The path from DNA to proteins goes via messenger RNA (mRNA)
4. DNA is transcribed to mRNA according to the needs of that cell
5. mRNA contains the instructions for what proteins to build
Microarrays measure the amount of mRNA
DNA mRNA protein
Microarrays (1)
Source: http://www.cottongenomics.org/
Source: http://research.yale.edu/ysm/
Microarrays (2)
1. Isolation of mRNA (single-stranded DNA; genes)
2. Labeling with color molecule
3. Chip contains probes which uniquely correspond
to genes
4. Hybridization to the chip
5. Laser to read labeled molecules
6. Image analysis converts colors to numbers,
intensities
7. Result: data matrix with 2 intensities for each
array
Microarray Movie
The resultProbe ID Gene m1_g m2_r m3_g m4_r m5_g m6_r m7_g m8_rA_52_P616356Ccr1 34.46396 38.87202 39.8253 37.60986 34.46396 39.74775 43.21416 41.64688A_52_P580582Nppa 68.61412 63.78335 64.54471 59.00334 68.61412 66.14105 67.13218 58.91294A_52_P403405Aqp7 54.3694 43.58079 48.42171 40.02895 54.3694 44.35261 50.96373 44.11335A_52_P819156AK046412 40.35896 40.19367 42.21101 39.46673 40.35896 40.97604 46.80699 45.51824A_51_P331831Hvcn1 1139.168 1239.731 1331.944 1201.655 1139.168 1491.437 1109.039 1516.419A_51_P430630Gpr33 35.93206 33.36196 35.95886 34.107 35.93206 34.09339 40.05874 39.5299A_52_P502357C230086J09Rik34.30417 34.11315 41.22859 34.82119 34.30417 35.15055 45.26332 39.64404A_52_P299964Maml2 33.37359 34.92393 41.02386 36.41753 33.37359 36.563 45.34714 41.52166A_51_P356389A330106F07Rik37.64724 38.64861 41.41 37.02367 37.64724 39.60467 45.60977 43.02076A_52_P684402Ptdss2 96.73227 114.3885 115.1037 93.56061 96.73227 107.1197 179.2954 120.0925A_51_P4142081110014K05Rik39.31122 40.92528 41.12306 37.04293 39.31122 43.00788 45.41691 41.51804A_51_P280918Itfg1 42.2577 47.87027 58.40548 50.16758 42.2577 55.26483 63.48501 69.59664A_52_P613688Elmo1 51.93495 41.54138 47.34302 66.62057 51.93495 43.62045 53.96656 59.24562A_52_P258194Crtac1 42.62472 43.35425 42.28466 41.94087 42.62472 44.39193 46.12511 44.2117A_52_P229271Pnpt1 34.99725 38.55899 39.07093 37.96528 34.99725 40.0605 42.41069 42.81282A_52_P214630Sox9 35.1932 33.39075 37.21131 34.75895 35.1932 34.67025 41.13792 40.00255A_52_P579519Tmem144 35.21073 34.24402 40.30091 37.07755 35.21073 35.37331 43.38546 41.40743A_52_P979997AK039768 34.48014 39.14762 35.53075 35.00453 34.48014 39.19561 39.61501 39.7067A_52_P453864Syne1 35.06627 37.30331 38.93411 38.40689 35.06627 37.66834 43.29757 41.74719
• Nr of rows (eg 44.000) is determined by nr of probes (> nr of genes)
• More genes than samples: high-dimensional setting
Statistical issues before data analysis
1. Design of the experiment (not discussed)
2. Quality control (not discussed)
3. NormalizationData visualized by MA plotUse of different dyes (colours) may leed to a non-linear dye-biasThis needs to be removed since it is artificial
M = log2(R/G) = log2(R)-log2(G)
A = log2(R*G)=log2(R)+log2(G)
NormalizationPurpose: remove artificial dye effects to obtain unbiased M
values. Most popular method: Loess.
Assumption: mean M value equals 0 for all intensity ranges.
Algorithm
1. Sort A values: A’1, ..., A’p.
2. For A’i, window Wi = [A’i – L, A’i
+ L]
3. For each Wi linearly regress:
M = a + bA + ε
4. M’i(pred) = ai + bi A’i
5. Subtract M’i(pred) from M’i.
Loess
Before After
After normalizationProbe ID Gene p1 p2 p3 p4A_52_P616356Ccr1 -0.17364 0.082575 -0.20578 0.053296A_52_P580582Nppa 0.105326 0.129502 0.05296 0.18842A_52_P403405Aqp7 0.319102 0.27461 0.293776 0.208256A_52_P819156AK046412 0.005921 0.096982 -0.02189 0.040279A_51_P331831Hvcn1 -0.12205 0.14851 -0.38872 -0.45136A_51_P430630Gpr33 0.107068 0.076279 0.07578 0.019173A_52_P502357C230086J09Rik0.008056 0.24368 -0.03516 0.191238A_52_P299964Maml2 -0.06551 0.17183 -0.13168 0.127147A_51_P356389A330106F07Rik-0.03787 0.161531 -0.07313 0.08431A_52_P684402Ptdss2 -0.24187 0.298961 -0.14715 0.578193A_51_P4142081110014K05Rik-0.05805 0.150749 -0.12966 0.129491A_51_P280918Itfg1 -0.17992 0.219348 -0.38715 -0.1326A_52_P613688Elmo1 0.322157 -0.49282 0.251701 -0.13464A_52_P258194Crtac1 -0.02448 0.011778 -0.05861 0.061124A_52_P229271Pnpt1 -0.13983 0.041415 -0.19494 -0.01361A_52_P214630Sox9 0.075848 0.098357 0.021599 0.040377A_52_P579519Tmem144 0.040163 0.120267 -0.00665 0.067322A_52_P979997AK039768 -0.18316 0.021527 -0.18493 -0.00333A_52_P453864Syne1 -0.08922 0.019669 -0.10327 0.052607
Log2-ratios for further analysis. Ratios: cancel out experimental spot effect, log to obtain symmetric scale. However, nowadays log-intensities (both dyes) are used more and more often.
Data
R∈y
.n,...,1=j, y: yvector Response
n>p;n,...,1=j,p,...,1=i,X:Xmatrix expression Gene
j
j
ij
Type of response
• Nominal. Eg tumor type. R = {Benigne, Maligne}
• Ordinal. Stage of a tumor. R={1,2,3,4}
• Continuous. Disease severity score. R = R+
• Censored. Survival. R= R+ x {0,1}.
Typical data analyses for microarrays (1)
Multivariate
• Unsupervised Clustering
• Principle component analysis
• Classification (statistical learning, discriminant analysis,
supervised clustering)
• Multivariate regression with penalty for overfitting (eg
Lasso / Ridge regression)
• Prognostic multivariate survival models
Typical data analyses for microarrays (2)
Univariate
• Inference (Hypothesis testing). Expression of each gene is related to
clinical response using, for example,– ANOVA
– Linear Regression
– Cox regression (survival)
– Permutation (nonparametric) tests
Hybrid
• Inference for sets of genes that are functionally related
Two-step ANOVA (1)
)2(+)(+)(+)(+=
)1(++++=
acdggcgdgagacdg
acdgdcaacdg
εγτγδγαγu
uδταμy
(1) is the normalization model; it only includes a gene factor in the residual u. That is residual u contains all gene specific factors.
(2) is the differential expression model
Indices a: array; c: condition; d: dye; g: gene
Two-step ANOVA (2)
)2(+)(+)(+)(+=
)1(++++=
acdggcgdgagacdg
acdgdcaacdg
εγτγδγαγu
uδταμy
Use of the two-step ANOVA: first fit (1) on all data, then estimate residuals u for each gene, then fit (2) for each gene separately.
Main advantage with respect to one-level model: computational. One-level model would require fitting many parameters simultaneously in one ANOVA.
Computation of raw p-values is the same as for usual ANOVA.
Multiple Testing, Motivation.
Histogram of 20.000 p-values generated under H0
Even when all 20.000 null-hypotheses are true, we expect 20.000*0.05 = 1.000 p-values smaller than α = 0.05!!!
Multiple Testing. Illustration of Benjamini-Hochberg procedure
Multiple Testing
M