high-dimensional data analysis: microarrays and multiple testing

High-dimensional data analysis: Microarrays and multiple testing

Mark van de Wiel1,2

1. Dep. of Mathematics, VU University Amsterdam2. Dep. of Biostatistics & Dep. of Pathology, VU University medical center, Amsterdam

Genomics: a short history (1)

Some history

1. Watson & Crick: double helix structure of DNA (1953)

Source: http://ghr.nlm.nih.gov/handbook/illustrations/


2. Human Genome Project: Identification of all 20.000-25.000 human

genes (1990-2003) June 25, 2000

PRESIDENT CLINTON ANNOUNCES THE COMPLETION OF THE FIRST SURVEY OF THE ENTIRE

HUMAN GENOME Hails Public and Private Efforts Leading to This Historic Achievement

THE WHITE HOUSE Office of the Press Secretary

For Immediate Release June 25, 2000 PRESIDENT CLINTON ANNOUNCES THE COMPLETION OF THE FIRST

SURVEY OF THE ENTIRE HUMAN GENOME Hails Public and Private Efforts Leading to This Historic

Achievement June 26, 2000 Today, at a historic White House event with British Prime Minister Tony Blair,

President Clinton announced that the international Human Genome Project and Celera

Genomics Corporation have both completed an initial sequencing of the human

genome -- the genetic blueprint for human beings.


3a. 1961 DNA hybridisation discovered

3b. 1994 Introduction of robotics (Hoheisel et al.)

3c. 1995 First microarray publication (Schena et al.)

3d. 1997 First whole genome microarray

experiments (De Risi et al.)

3e. 1999 First publication on microarrays for cancer

classification (Golub et al.): Leukemia / Affymetrix

arrays

Central dogma

1. DNA is the same in each cell (tumours are an exception)

2. Function of the cell is determined by proteins

3. The path from DNA to proteins goes via messenger RNA (mRNA)

4. DNA is transcribed to mRNA according to the needs of that cell

5. mRNA contains the instructions for what proteins to build

Microarrays measure the amount of mRNA

DNA mRNA protein

Microarrays (1)

Source: http://www.cottongenomics.org/

Source: http://research.yale.edu/ysm/

Microarrays (2)

1. Isolation of mRNA (single-stranded DNA; genes)

2. Labeling with color molecule

3. Chip contains probes which uniquely correspond

to genes

4. Hybridization to the chip

5. Laser to read labeled molecules

6. Image analysis converts colors to numbers,

intensities

7. Result: data matrix with 2 intensities for each

array

Microarray Movie

The resultProbe ID Gene m1_g m2_r m3_g m4_r m5_g m6_r m7_g m8_rA_52_P616356Ccr1 34.46396 38.87202 39.8253 37.60986 34.46396 39.74775 43.21416 41.64688A_52_P580582Nppa 68.61412 63.78335 64.54471 59.00334 68.61412 66.14105 67.13218 58.91294A_52_P403405Aqp7 54.3694 43.58079 48.42171 40.02895 54.3694 44.35261 50.96373 44.11335A_52_P819156AK046412 40.35896 40.19367 42.21101 39.46673 40.35896 40.97604 46.80699 45.51824A_51_P331831Hvcn1 1139.168 1239.731 1331.944 1201.655 1139.168 1491.437 1109.039 1516.419A_51_P430630Gpr33 35.93206 33.36196 35.95886 34.107 35.93206 34.09339 40.05874 39.5299A_52_P502357C230086J09Rik34.30417 34.11315 41.22859 34.82119 34.30417 35.15055 45.26332 39.64404A_52_P299964Maml2 33.37359 34.92393 41.02386 36.41753 33.37359 36.563 45.34714 41.52166A_51_P356389A330106F07Rik37.64724 38.64861 41.41 37.02367 37.64724 39.60467 45.60977 43.02076A_52_P684402Ptdss2 96.73227 114.3885 115.1037 93.56061 96.73227 107.1197 179.2954 120.0925A_51_P4142081110014K05Rik39.31122 40.92528 41.12306 37.04293 39.31122 43.00788 45.41691 41.51804A_51_P280918Itfg1 42.2577 47.87027 58.40548 50.16758 42.2577 55.26483 63.48501 69.59664A_52_P613688Elmo1 51.93495 41.54138 47.34302 66.62057 51.93495 43.62045 53.96656 59.24562A_52_P258194Crtac1 42.62472 43.35425 42.28466 41.94087 42.62472 44.39193 46.12511 44.2117A_52_P229271Pnpt1 34.99725 38.55899 39.07093 37.96528 34.99725 40.0605 42.41069 42.81282A_52_P214630Sox9 35.1932 33.39075 37.21131 34.75895 35.1932 34.67025 41.13792 40.00255A_52_P579519Tmem144 35.21073 34.24402 40.30091 37.07755 35.21073 35.37331 43.38546 41.40743A_52_P979997AK039768 34.48014 39.14762 35.53075 35.00453 34.48014 39.19561 39.61501 39.7067A_52_P453864Syne1 35.06627 37.30331 38.93411 38.40689 35.06627 37.66834 43.29757 41.74719

• Nr of rows (eg 44.000) is determined by nr of probes (> nr of genes)

• More genes than samples: high-dimensional setting

Statistical issues before data analysis

1. Design of the experiment (not discussed)

2. Quality control (not discussed)

3. NormalizationData visualized by MA plotUse of different dyes (colours) may leed to a non-linear dye-biasThis needs to be removed since it is artificial

M = log2(R/G) = log2(R)-log2(G)

A = log2(R*G)=log2(R)+log2(G)

NormalizationPurpose: remove artificial dye effects to obtain unbiased M

values. Most popular method: Loess.

Assumption: mean M value equals 0 for all intensity ranges.

Algorithm

1. Sort A values: A’1, ..., A’p.

2. For A’i, window Wi = [A’i – L, A’i

+ L]

3. For each Wi linearly regress:

M = a + bA + ε

4. M’i(pred) = ai + bi A’i

5. Subtract M’i(pred) from M’i.

Loess

Before After

After normalizationProbe ID Gene p1 p2 p3 p4A_52_P616356Ccr1 -0.17364 0.082575 -0.20578 0.053296A_52_P580582Nppa 0.105326 0.129502 0.05296 0.18842A_52_P403405Aqp7 0.319102 0.27461 0.293776 0.208256A_52_P819156AK046412 0.005921 0.096982 -0.02189 0.040279A_51_P331831Hvcn1 -0.12205 0.14851 -0.38872 -0.45136A_51_P430630Gpr33 0.107068 0.076279 0.07578 0.019173A_52_P502357C230086J09Rik0.008056 0.24368 -0.03516 0.191238A_52_P299964Maml2 -0.06551 0.17183 -0.13168 0.127147A_51_P356389A330106F07Rik-0.03787 0.161531 -0.07313 0.08431A_52_P684402Ptdss2 -0.24187 0.298961 -0.14715 0.578193A_51_P4142081110014K05Rik-0.05805 0.150749 -0.12966 0.129491A_51_P280918Itfg1 -0.17992 0.219348 -0.38715 -0.1326A_52_P613688Elmo1 0.322157 -0.49282 0.251701 -0.13464A_52_P258194Crtac1 -0.02448 0.011778 -0.05861 0.061124A_52_P229271Pnpt1 -0.13983 0.041415 -0.19494 -0.01361A_52_P214630Sox9 0.075848 0.098357 0.021599 0.040377A_52_P579519Tmem144 0.040163 0.120267 -0.00665 0.067322A_52_P979997AK039768 -0.18316 0.021527 -0.18493 -0.00333A_52_P453864Syne1 -0.08922 0.019669 -0.10327 0.052607

Log2-ratios for further analysis. Ratios: cancel out experimental spot effect, log to obtain symmetric scale. However, nowadays log-intensities (both dyes) are used more and more often.

Data

R∈y

.n,...,1=j, y: yvector Response

n>p;n,...,1=j,p,...,1=i,X:Xmatrix expression Gene

j

j

ij

Type of response

• Nominal. Eg tumor type. R = {Benigne, Maligne}

• Ordinal. Stage of a tumor. R={1,2,3,4}

• Continuous. Disease severity score. R = R+

• Censored. Survival. R= R+ x {0,1}.

Typical data analyses for microarrays (1)

Multivariate

• Unsupervised Clustering

• Principle component analysis

• Classification (statistical learning, discriminant analysis,

supervised clustering)

• Multivariate regression with penalty for overfitting (eg

Lasso / Ridge regression)

• Prognostic multivariate survival models

Typical data analyses for microarrays (2)

Univariate

• Inference (Hypothesis testing). Expression of each gene is related to

clinical response using, for example,– ANOVA

– Linear Regression

– Cox regression (survival)

– Permutation (nonparametric) tests

Hybrid

• Inference for sets of genes that are functionally related

Two-step ANOVA (1)

)2(+)(+)(+)(+=

)1(++++=

acdggcgdgagacdg

acdgdcaacdg

εγτγδγαγu

uδταμy

(1) is the normalization model; it only includes a gene factor in the residual u. That is residual u contains all gene specific factors.

(2) is the differential expression model

Indices a: array; c: condition; d: dye; g: gene

Two-step ANOVA (2)

)2(+)(+)(+)(+=

)1(++++=

acdggcgdgagacdg

acdgdcaacdg

εγτγδγαγu

uδταμy

Use of the two-step ANOVA: first fit (1) on all data, then estimate residuals u for each gene, then fit (2) for each gene separately.

Main advantage with respect to one-level model: computational. One-level model would require fitting many parameters simultaneously in one ANOVA.

Computation of raw p-values is the same as for usual ANOVA.

Multiple Testing, Motivation.

Histogram of 20.000 p-values generated under H0

Even when all 20.000 null-hypotheses are true, we expect 20.000*0.05 = 1.000 p-values smaller than α = 0.05!!!

Multiple Testing. Illustration of Benjamini-Hochberg procedure

Multiple Testing

M

high-dimensional data analysis: microarrays and multiple testing

Documents

human genes

human beings

genome microarray

president clinton

short history

nr of probes nr of genes

amsterdam genomics

historic achievement