statistical tools for synthesizing lists of differentially expressed features in microarray...

1
STATISTICAL TOOLS FOR SYNTHESIZING LISTS OF DIFFERENTIALLY EXPRESSED FEATURES IN MICROARRAY EXPERIMENTS Marta Blangiardo and Sylvia Richardson 1 1 Centre for Biostatistics, Imperial College, St Mary’s Campus, Norfolk Place London W2 1PG, UK. [email protected] ACKNOWLEDGEMENTS We would like to thank Natalia Bochkina, Alex Lewin and Anne-Mette Hein for helpful discussions. This work has been supported by a Wellcome Trust Functional Genomics Development Initiative (FGDI) thematic award ``Biological Atlas of Insulin Resistance (BAIR)", PC2910_DHCT REFERENCES Allison et al. (2002), “A mixture model approach for the analysis of microarray gene expression data”, Computational Statistics And Data Analysis, 39, 1-20. Baldi and Long, (2001) “A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes”, Bioinformatics, 17, 509-519. Ma et al., (2005) “Bioinformatics identification of novel early stress response genes in rodent models of lung injury”, Am J Physiol Lung Cell Mol Physiol 289(3), 468-477. SCOPE OF THE WORK Consider two different but related experiments, how to assess whether there are more differentially expressed genes in common than expected by chance? RANKED LISTS Suppose we have two experiments, each reporting a measure (e.g. p-value, …) of differential expression on a probability scale: Experiment A Experiment B p A1 p B1 p A2 p B2 p An p Bn O 1+ (q) O +1 (q) Small p value: MOST differentially expressed Large p value: NOT differentially expressed We rank the genes according to the probability measures. For each cut off q we obtain a 2X2 table: The number of genes in common by chance is The number of genes observed in common is O 11 (q) Exp B DE DE Exp A DE DE O 11 (q) O 1+ (q)-O 11 (q) O +1 (q)-O 11 (q) n-O 1+ (q)- O +1 (q)+O 11 (q) O 1+ (q) n- O 1+ (q) O +1 (q) n- O +1 (q) n RATIO We propose to calculate the maximum of the observed to expected ratio: It is the maximal deviation from the underneath independence model. ) | ) ( ( ) ( ) ( max ) ( 0 * 11 * 11 * H q O E q O q T q T q •By using the maximum ratio, multiple testing issues for different list sizes are avoided Returns a single list of O 11 (q) genes for further biological investigation PERMUTATION TEST Given a threshold q and fixed margins But the distribution of T(q * ) is not easily obtained since the tables are nested in each other. We take advantage of the empirical distribution for T(q * ) obtained via permutations. ) ), ( ), ( ( ) ( 1 1 n q O q O Hyper q T LIMITATIONS OF THE TEST •The uncertainty of the margins is not taken into account •The size of the list of genes in common can be vary small (typically when the total number of DE genes is small) and this can cause an instability in the estimate of T(q) We propose a Bayesian model treating also the margins as random variables We perform a Monte Carlo test of T under the null hypothesis of independence between the two experiments using permutations. This returns a Monte Carlo p- value. 1 0 0.0 5 0.1 and the vector of parameters q is modelled as non informative Dirichlet: ~ Di(0.05,0.05,0.05,0.05) The derived quantity of interest is the ratio of the probability that a gene is in common to the probability that a gene is in common by chance: Since the model is conjugated the posterior distribution for is Dirichlet )] ( ) ( [ )] ( ) ( [ ) ( ) ( 3 1 2 1 1 q q q q q q R )] ( ) ( ) ( [ 05 . 0 )], ( ) ( [ 05 . 0 )], ( ) ( [ 05 . 0 ), ( 05 . 0 ( ~ 11 1 1 11 1 11 1 11 q O q O q O n q O q O q O q O q O Di θ DECISION RULE We can obtain a sample from the posterior distribution of the derived quantities R(q) and calculate the credibility interval (CI) at 95% for each threshold q. We define q* as the value of the argument for which the median of R(q) attains its maximum value, only for the subset of credibility intervals which do not include 1: Then R(q*) is the ratio associated to q*. 1} excludes (q) CI for which q of values of set over the ) , | ) ( ( max {arg 95 * n q R Median q q O DISCUSSION •This is a simple procedure to evaluate if two (or more) experiments are associated •The permutation test gives a first look under the model where the marginal frequencies are fixed •The Bayesian model permits to enlarge the scenario introducing variability on all the components •It is very flexible and adaptable for comparisons of several experiments at different levels (gene level, biological processes level) and for different problems (e.g. comparison between species , comparison between platforms ) SIMULATION We use three batches of simulations differing by level of association between experiments and percentage of DE genes. For every batch we simulate two lists of 2000 p-values (Allison et al.2002) averaging the results over 100 simulations. Conditional Model Permutation Test Joint Model Bayesian Analysis T(q * ) q * O 11 (q * ) O 1+ (q * ) O +1 (q * ) MC p- value R(q * ) 95% CI q * O 11 (q * ) O 1+ (q * ) O +1 (q * ) = 0 , DE = 10% 1.1 0.04 0 10 115 120 0.550 1.0 [0.4- 1.5] 0.05 0 18 125 130 = 0.25 , DE = 10% 5.7 0.01 0 6 49 50 0.060 5.0 [2.2- 10.6] 0.02 0 8 59 59 = 0.25 , DE = 20% 3.0 0.01 9 11 82 82 0.030 2.9 [1.5- 4.9] 0.02 6 17 105 106 = 0.25 , DE = 30% 2.5 0.02 3 21 125 126 0.002 2.4 [1.4- 3.6] 0.03 0 28 148 150 APPLICATION: analysis of deleterious effect of mechanical ventilation on lung gene expression We re-analyse the experiment presented in Ma et al, 2005, investigating the deleterious effect of mechanical ventilation on lung gene expression through a model of mechanical ventilation-induced lung injury (VILI) on rodents (mice and rats). We analyse separately the two dataset using Cyber-T (Baldi and Long, 2001) We use RESOURCERER to reconstruct the list of orthologs for the two species We apply the methodology described to the lists of 2969 p-values (ortholog genes) T(q * )=1.44 q * =0.01 MC p-value <0.001 R(q * ) = 1.43 q * = 0.01 CI 95 = [1.13- 1.75] n q O q O H q O E ) ( ) ( ) | ) ( ( 1 1 0 11 Not associated P-value 0.8 Associated P-value <0.001 BAYESIAN MODEL Starting with the 2x2 table we specify a multinomial distribution of dimension 3 for the vector of joint frequencies: 3 1 )] ( ) ( ) ( [ )] ( ) ( [ 3 )] ( ) ( [ 2 ) ( 1 11 1 1 11 1 11 1 11 ) 1 ( ) , , ( i q O q O q O n i q O q O q O q O q O n Multi θ O •97 genes found in common between mice and rats •15 genes in common with the original analysis (which highlighted 48 genes) •Two enriched pathways with our methodology: 1) MAPK signalling activity. 6 out of the significant orthologs are involved in this KEGG pathway (Fgfr1, Gadd45a, Hspa8, Hspa1a, Il1b, Il1r2) while only 4 were highlighted in the original one. 2) Cytokine-Cytokine receptor interaction. 5 out of the significant orthologs are involved in this KEGG pathway (IL6, Il1b, Il1r2, CCL2, Kit) while only 4 were highlighted in the original one. R(q * )=1.0 q * =0.05 O 11 (q * )=18 NO association is declared when the two lists are not associated (MC p-value not significant, CI include 1) When there is a TRUE association: All the computations have been performed in R and are available on BGX website (www.bgx.org.uk) R(q*) is always smaller than T(q*) and its q* is slightly bigger as it accounts for the additional variability O 11 (q*) = 97 O 1+ (q*) = 393 O +1 (q*) = 886 Conditional Model Joint Model •The ratio T(q*) decrease •q*, O1+(q), O+1(q), and O11(q) increase •MC p-value is more significant Increasi ng % of DE genes •The ratio R(q*) decreases •q*, O1+(q), O+1(q), and O11(q) increase CI 95 are narrower

Upload: andrew-mcgarry

Post on 28-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STATISTICAL TOOLS FOR SYNTHESIZING LISTS OF DIFFERENTIALLY EXPRESSED FEATURES IN MICROARRAY EXPERIMENTS Marta Blangiardo and Sylvia Richardson 1 1 Centre

STATISTICAL TOOLS FOR SYNTHESIZING LISTS OF DIFFERENTIALLY EXPRESSED FEATURES IN MICROARRAY EXPERIMENTS

Marta Blangiardo and Sylvia Richardson1

1 Centre for Biostatistics, Imperial College, St Mary’s Campus, Norfolk Place London W2 1PG, UK.

[email protected]

ACKNOWLEDGEMENTS

We would like to thank Natalia Bochkina, Alex Lewin and Anne-Mette Hein for helpful discussions. This work has been supported by a Wellcome Trust Functional Genomics Development Initiative (FGDI) thematic award ``Biological Atlas of Insulin Resistance (BAIR)", PC2910_DHCT

REFERENCES

Allison et al. (2002), “A mixture model approach for the analysis of microarray gene expression data”, Computational Statistics And Data Analysis, 39, 1-20.

Baldi and Long, (2001) “A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes”, Bioinformatics, 17, 509-519.

Ma et al., (2005) “Bioinformatics identification of novel early stress response genes in rodent models of lung injury”, Am J Physiol Lung Cell Mol Physiol 289(3), 468-477.

SCOPE OF THE WORK

Consider two different but related experiments, how to assess whether there are more differentially expressed genes in common than expected by chance?

RANKED LISTS

Suppose we have two experiments, each reporting a measure (e.g. p-value,…) of differential expression on a probability scale:

Experiment A Experiment B

pA1 pB1

pA2 pB2

… …

pAn pBn

O1+(q) O+1(q)

Small p value: MOST differentially expressed

Large p value: NOT differentially expressed

We rank the genes according to the probability measures. For each cut off q we obtain a 2X2 table:

The number of genes in common by chance is

The number of genes observed in common is O11(q)

Exp B

DE DE

Exp A DE

DE

O11(q) O1+(q)-O11(q)

O+1(q)-O11(q) n-O1+(q)- O+1(q)+O11(q)

O1+(q)

n- O1+(q)

O+1(q) n- O+1(q) n

RATIO

We propose to calculate the maximum of the observed to expected ratio:

It is the maximal deviation from the underneath independence model.

)|)((

)()(max)(

0*

11

*11*

HqOE

qOqTqT q

•By using the maximum ratio, multiple testing issues for different list sizes are avoided•Returns a single list of O11(q) genes for further biological investigation

PERMUTATION TEST

Given a threshold q and fixed margins

But the distribution of T(q*) is not easily obtained since the tables are nested in each other. We take advantage of the empirical distribution for T(q*) obtained via permutations.

)),(),(()( 11 nqOqOHyperqT

LIMITATIONS OF THE TEST

•The uncertainty of the margins is not taken into account

•The size of the list of genes in common can be vary small (typically when the total number of DE genes is small) and this can cause an instability in the estimate of T(q)

•We propose a Bayesian model treating also the margins as random variables

We perform a Monte Carlo test of T under the null hypothesis of independence between the two experiments using permutations. This returns a Monte Carlo p-value.

1

0 0.05 0.1

and the vector of parameters q is modelled as non informative Dirichlet:

~ Di(0.05,0.05,0.05,0.05)

The derived quantity of interest is the ratio of the probability that a gene is in common to the probability that a gene is in common by chance:

Since the model is conjugated the posterior distribution for is Dirichlet

)]()([)]()([

)()(

3121

1

qqqq

qqR

)])()()([05.0)],()([05.0)],()([05.0),(05.0(~ 111111111111 qOqOqOnqOqOqOqOqODi θ

DECISION RULE

We can obtain a sample from the posterior distribution of the derived quantities R(q) and calculate the credibility interval (CI) at 95% for each threshold q. We define q* as the value of the argument for which the median of R(q) attains its maximum value, only for the subset of credibility intervals which do not include 1:

Then R(q*) is the ratio associated to q*.

1} excludes (q)CIfor which q of valuesofset over the ),|)((max{arg 95* nqRMedianq q O

DISCUSSION

•This is a simple procedure to evaluate if two (or more) experiments are associated

•The permutation test gives a first look under the model where the marginal frequencies are fixed

•The Bayesian model permits to enlarge the scenario introducing variability on all the components

•It is very flexible and adaptable for comparisons of several experiments at different levels (gene level, biological processes level) and for different problems (e.g. comparison between species , comparison between platforms )

SIMULATION

We use three batches of simulations differing by level of association between experiments and percentage of DE genes. For every batch we simulate two lists of 2000 p-values (Allison et al.2002) averaging the results over 100 simulations.

Conditional Model

Permutation Test

Joint Model

Bayesian Analysis

T(q*) q* O11(q*) O1+(q*) O+1(q*) MC p-value R(q*)

95% CI

q* O11(q*) O1+(q*) O+1(q*)

= 0 ,

DE = 10%

1.1 0.040 10 115 120 0.550 1.0

[0.4-1.5]

0.050 18 125 130

= 0.25 ,

DE = 10%

5.7 0.010 6 49 50 0.060 5.0

[2.2-10.6]

0.020 8 59 59

= 0.25 ,

DE = 20%

3.0 0.019 11 82 82 0.030 2.9

[1.5-4.9]

0.026 17 105 106

= 0.25 ,

DE = 30%

2.5 0.023 21 125 126 0.002 2.4

[1.4-3.6]

0.030 28 148 150

APPLICATION: analysis of deleterious effect of mechanical ventilation on lung gene expression

We re-analyse the experiment presented in Ma et al, 2005, investigating the deleterious effect of mechanical ventilation on lung gene expression through a model of mechanical ventilation-induced lung injury (VILI) on rodents (mice and rats).

We analyse separately the two dataset using Cyber-T (Baldi and Long, 2001)

We use RESOURCERER to reconstruct the list of orthologs for the two species

We apply the methodology described to the lists of 2969 p-values (ortholog genes)

T(q*)=1.44

q*=0.01

MC p-value <0.001

R(q*) = 1.43

q* = 0.01

CI95 = [1.13-1.75]

n

qOqOHqOE

)()()|)(( 11

011

Not associated

P-value 0.8

Associated

P-value <0.001

BAYESIAN MODEL

Starting with the 2x2 table we specify a multinomial distribution of dimension 3 for the vector of joint frequencies:

3

1

)]()()([)]()([3

)]()([2

)(1

111111111111 )1(),,(i

qOqOqOni

qOqOqOqOqOnMulti θO

•97 genes found in common between mice and rats

•15 genes in common with the original analysis (which highlighted 48 genes)

•Two enriched pathways with our methodology:

1) MAPK signalling activity. 6 out of the significant orthologs are involved in this KEGG pathway (Fgfr1, Gadd45a, Hspa8, Hspa1a, Il1b, Il1r2) while only 4 were highlighted in the original one.

2) Cytokine-Cytokine receptor interaction. 5 out of the significant orthologs are involved in this

KEGG pathway (IL6, Il1b, Il1r2, CCL2, Kit) while only 4 were highlighted in the original one.

R(q*)=1.0

q*=0.05

O11(q*)=18

NO association is declared when the two lists are not associated (MC p-value not significant, CI include 1)

When there is a TRUE association:

All the computations have been performed in R and are available on BGX website

(www.bgx.org.uk)

R(q*) is always smaller than T(q*) and its q* is slightly bigger as it accounts for the additional variability

O11(q*) = 97

O1+(q*) = 393

O+1(q*) = 886

Conditional Model Joint Model

•The ratio T(q*) decrease

•q*, O1+(q), O+1(q), and O11(q) increase

•MC p-value is more significant

Increasing % of

DE genes•The ratio R(q*) decreases

•q*, O1+(q), O+1(q), and O11(q) increase

•CI95 are narrower