heterogeneity of variance in gene expression microarray...

Heterogeneity of Variance in Gene Expression Microarray Data

David M. RockeDepartment of Applied Science and

Division of BiostatisticsUniversity of California, Davis

March 15, 2003

Abstract

Motivation

One important problem in the analysis of gene expression microarray data is that thevariation in expression under constant conditions is not stable from gene to gene. Re-cently variance stabilizing transformations have been developed that can remove thesystematic dependence of the variance on the mean, but it appears that there is stillconsiderable variance heterogeneity that can interfere with global analysis of expressiondata.

Results

We develop a method consisting of a variance stabilizing data transformation followedby empirical Bayes estimation of gene-specific variances that is more powerful thanusing data from that gene alone, but does not suffer from the bias caused by the use ofglobal error models.

Availability

R code will be available from the author by e-mail or on the websitehttp://www.cipic.ucdavis.edu/~dmrocke

Contact

[email protected].

1

1. Introduction

Consider a set of microarray experiments of n arrays each with p genes. For each geneconsidered separately we entertain a statistical model which is linear in a set of factorsor variables that are attached to the arrays, so that the statistical model is common toall genes. Assume that the expression data have been transformed so that the variancesneither increase nor decrease systematically with the mean expression of the gene.

Given a statistically hypothesis framed within the linear model for each gene, there isalmost always an exact or approximate F -test, in which the numerator can be calculatedfrom the cell means of the data for a particular gene, and the denominator (if the test isconducted in isolation for each gene) is a function of the deviations of the data from thecell mean (Kerr 2003).

An alternative approach with variance-stabilized data is to obtain the numerator of thetest from the particular gene, but obtain the denominator from a global error model (inthis case, constant variance). This increases the power of the tests considerably becausethe variance estimates will be based on thousands of points, not just a few. However, itintroduces possible biases if the variances are not truly homogeneous (Kerr 2003; Kerr,Martin, and Churchill 2000).

A compromise between power and bias may be obtained by using variance estimates forthe denominators of the F -test that are a compromise themselves between the gene-specificvariance and the global variance.

2. A Motivating Example

We consider an experiment in which cell lines in four conditions are to be compared. Thereare two observations for each of the four conditions consisting of an Affymetrix U95AGeneChip for each sample. For the sake of illustration, we will consider the MAS 4.0average difference summary, one main advantage of which is that it does not artificiallycompress the low-level data.

One goal of the analysis is to determine what genes are differentially expressed amongthe four conditions. A standard approach if we consider only one gene would be to performa one-way analysis of variance (ANOVA). However, a standard assumption of that standardanalysis is that the variance at the different levels is the same. In the case of microarraydata, there is a strong dependence of the variance on the mean, as is shown for these databy Figures 1 and 2, which give the difference of replicates in a gene-by-group condition vs.the sum.

This type of variability can be removed by the generalized log (glog) transform intro-duced independently by Durbin et al. (2002), Hawkins (2002), Huber et al. (2002), andMunson (2001), and further developed in Durbin and Rocke (2003a; 2003b), Geller et al.(2003) and Rocke and Durbin (2003a; 2003b). Figure 3 shows the same sum/differencedata after transforming by the glog with a parameter of λ = 1225 estimated by maximumlikelihood (Durbin and Rocke 2003a).

2

MSE Source TWER FWER FDR

Gene-Specific 2114 1 18Global 2478 571 1516Posterior 2350 29 508

Table 1: Number of genes out of 12,625 “significant at the 5% level for three methodsof estimating the MSE in a microarray experiment. Column 2 is the raw p-values withtest-wise error rate (TWER) 5%. Column 3 give the family-wise error rate (FWER) usingthe Bonferroni inequality, and column 4 is the set of genes nominated as significant by thefalse-discovery-rate (FDR) method of Benjamin and Hochberg (1995; see also Reiner et al.2003).

At this point, one could reasonably perform an ANOVA for gene i using the model

zijk = βj + ²jk, (2.1)

where here the zijk are additively-normalized, glog-transformed expression values. In thisway, we obtain 12,625 F-tests of the null hypothesis of equal expression for all groups inwhich we compare the mean square for groups from gene i (MSGi) to the mean square forerror from gene i (MSEi) by referring the ratio MSGi/MSEi to an F distribution with 3and 4 degrees of freedom.

This procedure should be valid, and after an adjustment for multiplicity, the resultscould be used directly. Figure 4 gives a histogram of the 12,625 p-values showing thatcertainly some of them represent real effects. The first line of figures in Table 1 shows that,at the 5% level, 1 gene is significant using the Bonferroni method, and 18 are significantusing the FDR method of Benjamin and Hochberg (1995; see also Reiner et al. 2003).

A possible objection to this procedure is that we are losing power by not employinginformation from other genes. If we employ the perspective of Kerr, Martin, and Churchill(2000), we could estimate the model

zijk = µi + nk + βij + ²ijk (2.2)

where the zijk are glog-transformed (unnormalized) expression values, the normalizationis part of the ANOVA (the nk terms), and the group effects are in the gene-by-groupinteraction terms βij (Kerr 2003). This analysis gives as another mean square for errorthat we could use as a denominator, in which case the F-statistics for each gene separatelywould have 3 and 50,493 df. Figure 5 shows the histogram of the p-values using this method.The excess of very small F-statistics is a sign that the model is incorrect. In this case, theassumption that all genes have the same MSE is almost certainly false. Use of an averageMSE, when small or large ones will be more appropriate, will lead to an excess of p-valuesat both ends. In the second line of figures in Table 1, the number of genes nominated assignificant is much greater for each of the three methods than when the gene-specific MSE

3

is used. It is likely that some of these are mistakes, being due to a large true gene-specificMSE being coupled with using an average MSE as a denominator instead of an unbiasedgene-specific MSE estimate.

The average value over all 12,625 genes of the MSE is 0.1017, which is also the residualMSE from the global model. If the 4df estimates from each gene had the distributionpredicted from normality and constant true variance, the variance of these MSE estimatesacross genes would be 2σ4/ν = (0.1017)2/2 = .00572. Instead, it is 0.0556, nearly 10times the size it should be. Of the two simple explanations for this: nonnormality andheterogeneity of variance, the latter is the simpler possibility.

We now proceed to account for this situation using a standard empirical Bayes estimatefor the individual gene MSE.

3. The Modeling Setup

Given n genes indexed by i, suppose that the true variance of the effect of interest for genei is σ2i . For each i we obtain a ν degree-of-freedom estimate s2i of σ

2i . We will work in the

Gaussian framework for convenience, in which case we may assume that s2i has a gammadistribution with parameters τ (the mean) and a = ν/2 (the shape parameter). Again forsimplicity, we treat the case where ν is constant across genes. Though the case where νvaries is not conceptually more difficult, the computations are more complex.

We model these individual values σ2i = τi as random with an inverse gamma distributionwith parameters α and η = αβ. Note that η is the mean of the inverse of τ (the reciprocalvariance 1/τ is sometimes called the precision). With this as a prior distribution, and anobserved value s2i , the posterior distribution for τ is proportional to

e−1/τβ∗

τν/2+α+1(3.1)

where

β∗ =2

xν + 2α/η(3.2)

Thus, the posterior distribution is inverse gamma, like the prior, with parameters

α∗ = ν/2 + α (3.3)

β∗ =2

xν + 2α/η(3.4)

η∗ = α∗β∗ =ν + 2α

xν + 2α/η(3.5)

Also

1

η∗ =xν + 2α/η

ν + 2α(3.6)

= x

µν

ν + 2α

¶+1

η

µ2α

ν + 2α

¶(3.7)

4

Now x here is an observed value of s2i , and 1/η∗ is the reciprocal of the mean prior

precision, which is thus an estimate of the center of the prior distribution for τi = σ2i . Alsoν is the degrees of freedom of s2i and 2α is the equivalent degrees of freedom of the prior.Thus, the posterior estimate of the variance used here will be a weighted average of theindividual variance and the prior mean reciprocal precision, each weighted by its degrees offreedom.

This method of estimation of a variance using an inverse gamma conjugate prior iscompletely standard (Carlin and Lewis 2000; Gelman et al. 1995), and has been usedpreviously in a microarray context by Baldi and Long (2001). The first two references givemore detail on the derivation of the posterior in this case.

4. Empirical Estimation of the Prior

To complete the empirical Bayes estimation procedure, we need to specify how we estimatethe parameters of the prior from the ensemble of variances. If each observed variance s2ihas a gamma distribution Fi with parameters τ and a = ν/2, and if the prior distributionG of τ is inverse gamma with parameters α and β then

E(s2i ) =1

β(α− 1)V (s2i ) =

2(α− 1)/ν + 1β2(α− 1)2(α− 2) (4.1)

If an ensemble of variances has mean M and variance V , then a method of momentsestimate of α and β is given by solving

M =1

β(α− 1)V =

2(α− 1)/ν + 1β2(α− 1)2(α− 2) (4.2)

for α and β. This leads to

α =M2(1− 2/ν) + 2V

V − 2m2/ν

β =1

M(α− 1) (4.3)

as method-of-moments estimates. If the variances were homogeneous, then we would havethat V ≈ 2M2/ν. If the either the denominator or the numerator is negative, that ispresumably a sign that there is not an important amount of heterogeneity in the variances.However, usually both will be bounded well away from zero.

5

5. The Example Continued

For the example data set, the mean of the 12,625 values of the residual MSE is 0.1016965and the variance of the same collection is 0.0556099. Using (4.3), we obtain

α = 2.308

β = 7.520

η = 17.353

1/ν = 0.0576

The degrees of freedom of the prior is 2α = 4.615, so for each gene i, we obtain an 8.6df MSEestimate by taking a weighted average of the 4df MSE from the ANOVA of that gene (withweight 4/8.6), and the prior best estimate 0.0576 (with weight4.6/8.6). Figure 6 shows thehistogram of the p-values obtained by this method, which shows no sign of distortion at thehigh p-value end.

Comparing the three methods shown in Table 1, we see that the global MSE estimaterejects the most genes, but Figure 5 shows that these rejections cannot be trusted. The pos-terior best estimate MSE identifies a much larger number of genes as differentially expressedthan using 4df gene-specific MSE’s, without apparent signs of problems with maintainingthe size of the tests.

6. Concluding Remarks

Bayesian and empirical Bayesian methods are frequently proposed for the analysis of mi-croarray data (for example, Baldi and Long 2001; Broet et al. 2002; Efron et al. 2002;Ibrahim et al. 2002; Newton et al. 2001, 2003; Theilhaber et al. 2001). What is proposedhere is a sort of minimal empirical Bayesian approach. We do not need to put a prior dis-tribution on the mean expression across genes or on the probability of positive expression,since this is handled by the multiplicity-adjusted F-tests. Our approach resembles mostclosely the treatment in Baldi and Long (2001). However, their use of the log transformresulted in substantial dependence of the variance on the mean, whereas by use of the glogtransform, we have removed at least most of this dependence. This makes the Bayesianmodel fit the data better than in their case.

We have written code in the R language (Ihaka and Gentleman 1996) that implementsmany of the required calculations in standard situations. They will be available from theauthor by e-mail or on the website http://www.cipic.ucdavis.edu/~dmrocke.

Acknowledgements

The research reported in this paper was supported by grants from the National ScienceFoundation (ACI 96-19020, and DMS 98-70172) and the National Institute of EnvironmentalHealth Sciences, National Institutes of Health (P43 ES04699).

6

Appendix: The Gamma and Inverse Gamma Distributions

The gamma distribution with parameters α and β has density

fX(x) =xα−1e−x/β

Γ(α)βα(.1)

The first two moments are given by

E(X) = αβ = τ (.2)

V (X) = αβ2 = τ2/α (.3)

The inverse gamma distribution with parameters α and β is the distribution of Y = 1/Xwhere X is gamma distributed with parameters α and β. The density of Y is

fY (y) =e−1/yβ

Γ(α)βαyα+1(.4)

The first two moments are given by

E(Y ) =1

β(α− 1) (.5)

V (Y ) =1

β2(α− 1)2(α− 2) (.6)

We will re-parametrize in terms of α and η = αβ, which is the mean of the reciprocal ofthe inverse gamma variate. We then have that the density is

fY (y) =e−α/yη

Γ(α)(η/α)αyα+1(.7)

The first two moments are given in this parametrization by

E(Y ) =α

η(α− 1) (.8)

V (Y ) =α2

η2(α− 1)2(α− 2) (.9)

References

Baldi, P. and Long, A.D. (2001) “A Bayesian framework for the analysis of microarrayexpression data: regularized t-test and statistical inference of gene changes,” Bioin-formatics, 17, 506—519.

7

Benjamani, Y. and Hochberg, Y. (1995) “Controlling the false discovery rate,” Journal ofthe Royal Statistical Society, Series B, 57, 289—300.

Broet, P., Richardson, S., and Radvanyi, F. (2002) “Bayesian hierarchical model for iden-tifying changes in gene expression from microarray experiments,” Journal of Com-putational Biology, 9, 671—683.

Carlin, B.P. and Thomas, L.A. (2000) Bayes and Empirical Bayes Methods for Data Analy-sis, Second Edition, New York: Chapman and Hall.

Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. (2002) “A variance-stabilizingtransformation for gene-expression microarray data,” Bioinformatics, 18, S105—S110.

Durbin, B. and Rocke, D. M. (2003a) “Estimation of transformation parameters for mi-croarray data,” Bioinformatics, in press.

Durbin, B. and Rocke, D. M. (2003b) “Exact and approximate variance-stabilizing trans-formations for two-color microarrays,” submitted for publication.

Efron, B., Tibshirani, R., Storey, J.D., and Tusher, V. (2002) “Empirical Bayes analysisof a microarray experiment,” Journal of the American Statistical Association, 96,1151—1160.

Geller, S.C., Gregg, J.P., Hagerman, P.J., and Rocke, D.M. (2003) “Transformation andnormalization of oligonucleotide microarray data,” submitted for publication.

Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian Data Analysis,New York: Chapman and Hall.

Hawkins, D.M. (2002) “Diagnostics for conformity of paired quantitative measurements,”Statistics in Medicine, 21, 1913—1935.

Holder, D., Raubertas, R.F., Pikounis, V.B., Svetnik, V., and Soper, K. (2001) “Statisticalanalysis of high density oligonucleotide arrars: A SAFER approach,” GeneLogicWorkshop on Low Level Analysis of Affymetrix GeneChip Data.

Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A., and Vingron, M. (2002) “Vari-ance stabilization applied to microarray data calibration and to the quantificationof differential expression,” Bioinformatics, 18, S96—S104.

Ibrahim, J.G., Chen, M.-H., and Gray, R.J. (2002) “Bayesian models for gene expressionwith microarray data,” Journal of the American Statistical Association, 97, 88—99.

Ihaka, R. and Gentleman, R. (1996) “R: A language for data analysis and graphics,” Journalof Computational and Graphical Statistics, 5, 299—314. (See www.r-project.org.)

8

Kerr, M.K. (2003) “Linear models for microarray data analysis: Hidden similarity anddifferences,” University of Washington Biostatistics Working Paper 190.

Kerr, M.K., Martin, M., and Churchill, G.A. (2000) “Analysis of variance for gene expressionmicroarray data,” Journal of Computational Biology, 7, 819—837.

Munson, P. (2001) “A ‘Consistency’ Test for Determining the Significance of Gene Expres-sion Changes on Replicate Samples and Two Convenient Variance-stabilizing Trans-formations,” GeneLogic Workshop on Low Level Analysis of Affymetrix GeneChipData.

Newton, M.A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R., and Tsui, K.W. (2001)“On differential variability of expression ratios: improving statistical inference aboutgene expression changes from microarray data,” Journal of Computational Biology,8, 37—52.

Newton, M.A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2003) “Detecting differential geneexpression with a semiparametric heirarchical mixture model,” manuscript.

Reiner, A., Yekutieli, D. and Benjamini, Y. (2003) “Identifying differntially expressed genesusing false discovery rate controllling procedures,” Bioinformatics, 19, 368—375.

Rocke, D., and Durbin, B. (2001) “A model for measurement error for gene expressionarrays,” Journal of Computational Biology, 8, 557—569.

Rocke, D. and Durbin, B. (2003) “Approximate variance-stabilizing transformations forgene-expression microarray data,” Bioinformatics, in press.

Theilhaber, J., Bushnell, S., Jackson, A., and Fuchs, R. (2001) “Bayesian estimation offold changes in the analysis of gene expression: The PFOLD algorithm,” Journal ofComputational Biology, 8, 585—614.

9

List of Figures

1. Absolute difference in replicates versus the sum for the 12,625 × 4 gene-by-groupcombinations.

2. Absolute difference in replicates versus the rank of the sum for the 12,625 × 4 gene-by-group combinations.

3. Absolute difference in replicates versus the rank of the sum for the 12,625 × 4 gene-by-group combinations after transformation by the glog with λ = 1225.

4. Histogram of p-values for 12,625 F-tests using gene-specific MSE.

5. Histogram of p-values for 12,625 F-tests using global MSE.

6. Histogram of p-values for 12,625 F-tests using posterior best-estimate MSE.

10

050

0010

000

1500

020

000

0100020003000

Sum

Difference

Raw

Dat

a

010

000

2000

030

000

4000

050

000

0100020003000

Ran

k of

Sum

Difference

Raw

Dat

a

010

000

2000

030

000

4000

050

000

0123456

Ran

k of

Sum

Difference

Glo

g o

f D

ata

His

tog

ram

of

Gen

e-S

pec

ific

p-V

alu

es

Raw

p-V

alue

s

Frequency

0.0

0.2

0.4

0.6

0.8

1.0

0500100015002000

His

tog

ram

of

Glo

bal

p-V

alu

es

Raw

p-V

alue

s

Frequency

0.0

0.2

0.4

0.6

0.8

1.0

05001000150020002500

His

tog

ram

of

Po

ster

ior

p-V

alu

es

Raw

p-V

alue

s

Frequency

0.0

0.2

0.4

0.6

0.8

1.0

0500100015002000

heterogeneity of variance in gene expression microarray...

Documents