weighing the results of differing ‘low dose’ studies of the mouse prostate by nagel, cagen, and...

9
Weighing the results of differing Ôlow doseÕ studies of the mouse prostate by Nagel, Cagen, and Ashby: Quantification of experimental power and statistical results J.W. Owens * , J.G. Chaney Central Product Safety, The Procter & Gamble Company, Cincinnati, OH 45061, USA Received 26 May 2005 Available online 2 September 2005 Abstract Differing experimental findings with respect to ‘‘low dose’’ responses in the mouse prostate after in utero exposure have generated considerable controversy. An analysis of such controversies requires a broad strength and weight of the evidence approach. For example, a National Toxicology Program review panel acquired the raw data from nearly 50 studies and then statistically reanalyzed these data in a common and comparable approach. However, the statistical power of the various studies was not calculated and the quantitative p values were not reported in this reanalysis. Such calculations and values address vital strength- and weight-of-the-ev- idence questions: (1) how sensitive were the various studies to detect changes in prostate weight, particularly the negative replicate studies and (2) what were the p values; were negative studies robust or only marginal in their inability to find an effect? We first examined the statistical power of the studies to detect a positive effect on prostate weight. Preliminary calculations indicated that the two subsequent replicating studies were indeed more sensitive to changes in prostate weight in comparison to the original study, having reasonable power to detect an effect at only 50% of the response reported in the original study. Additional calculations were performed using the raw data available from one negative replicating study and the methods recommended by the statistics subpanel of the original review. This analysis used DunnettÕs multiple comparison procedure for groups with p < 0.05 to infer statistical sig- nificance, employed an analysis-of-covariance model with body weight as a covariate, and addressed litter as a nested random effect. The quantitated p values for this replicated study, comparing the two Bisphenol A treatment groups (2 and 20 lg/kg/day) to the control, were 0.821 and 0.972, respectively. This indicates this study was indeed robust in finding no treatment-related effect. Thus, the weight and strength of the evidence, based on sensitivity and quantitative p value, was that it is highly unlikely for this negative replicating study to have missed a true effect. In the future, we recommend a similar use of statistical power analysis for those design- ing experimental studies and for those conducting weight-of-the-evidence reviews, and we also recommend the clear quantitation and reporting of p values to support the reviewÕs interpretation and conclusions. Ó 2005 Elsevier Inc. All rights reserved. Keywords: Ashby; Bisphenol A; Cagen; Endocrine disruption; Estrogen; Experimental power; Mouse prostate; Nagel 1. Introduction Apparent effects at doses well below those tradition- ally accepted by toxicologists as no-adverse-effect-levels (NOAELs) have been the recent subject of controversy (see for example, Anon., 1997a,b). This ‘‘low dose’’ con- troversy potentially challenges the toxicological tenet that the dose makes the poison. In turn, the ‘‘low dose’’ conclusion challenges the regulatory approaches to chemical safety, which use uncertainty factors to extrap- olate statistical NOAELs from toxicological studies in the hazard characterization step of risk assessment to protect large populations. 0273-2300/$ - see front matter Ó 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.yrtph.2005.07.001 * Corresponding author. Fax: +1 513 627 1208. E-mail address: [email protected] (J.W. Owens). www.elsevier.com/locate/yrtph Regulatory Toxicology and Pharmacology 43 (2005) 194–202 Regulatory Toxicology and Pharmacology

Upload: jw-owens

Post on 31-Oct-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Regulatory

www.elsevier.com/locate/yrtph

Regulatory Toxicology and Pharmacology 43 (2005) 194–202

Toxicology andPharmacology

Weighing the results of differing �low dose� studies of themouse prostate by Nagel, Cagen, and Ashby: Quantification

of experimental power and statistical results

J.W. Owens *, J.G. Chaney

Central Product Safety, The Procter & Gamble Company, Cincinnati, OH 45061, USA

Received 26 May 2005Available online 2 September 2005

Abstract

Differing experimental findings with respect to ‘‘low dose’’ responses in the mouse prostate after in utero exposure have generatedconsiderable controversy. An analysis of such controversies requires a broad strength and weight of the evidence approach. Forexample, a National Toxicology Program review panel acquired the raw data from nearly 50 studies and then statistically reanalyzedthese data in a common and comparable approach. However, the statistical power of the various studies was not calculated and thequantitative p values were not reported in this reanalysis. Such calculations and values address vital strength- and weight-of-the-ev-idence questions: (1) how sensitive were the various studies to detect changes in prostate weight, particularly the negative replicatestudies and (2) what were the p values; were negative studies robust or only marginal in their inability to find an effect? We firstexamined the statistical power of the studies to detect a positive effect on prostate weight. Preliminary calculations indicated thatthe two subsequent replicating studies were indeed more sensitive to changes in prostate weight in comparison to the original study,having reasonable power to detect an effect at only 50% of the response reported in the original study. Additional calculations wereperformed using the raw data available from one negative replicating study and the methods recommended by the statistics subpanelof the original review. This analysis used Dunnett�s multiple comparison procedure for groups with p < 0.05 to infer statistical sig-nificance, employed an analysis-of-covariance model with body weight as a covariate, and addressed litter as a nested random effect.The quantitated p values for this replicated study, comparing the two Bisphenol A treatment groups (2 and 20 lg/kg/day) to thecontrol, were 0.821 and 0.972, respectively. This indicates this study was indeed robust in finding no treatment-related effect. Thus,the weight and strength of the evidence, based on sensitivity and quantitative p value, was that it is highly unlikely for this negativereplicating study to have missed a true effect. In the future, we recommend a similar use of statistical power analysis for those design-ing experimental studies and for those conducting weight-of-the-evidence reviews, and we also recommend the clear quantitationand reporting of p values to support the review�s interpretation and conclusions.� 2005 Elsevier Inc. All rights reserved.

Keywords: Ashby; Bisphenol A; Cagen; Endocrine disruption; Estrogen; Experimental power; Mouse prostate; Nagel

1. Introduction

Apparent effects at doses well below those tradition-ally accepted by toxicologists as no-adverse-effect-levels(NOAELs) have been the recent subject of controversy

0273-2300/$ - see front matter � 2005 Elsevier Inc. All rights reserved.

doi:10.1016/j.yrtph.2005.07.001

* Corresponding author. Fax: +1 513 627 1208.E-mail address: [email protected] (J.W. Owens).

(see for example, Anon., 1997a,b). This ‘‘low dose’’ con-troversy potentially challenges the toxicological tenetthat the dose makes the poison. In turn, the ‘‘low dose’’conclusion challenges the regulatory approaches tochemical safety, which use uncertainty factors to extrap-olate statistical NOAELs from toxicological studies inthe hazard characterization step of risk assessment toprotect large populations.

J.W. Owens, J.G. Chaney / Regulatory Toxicology and Pharmacology 43 (2005) 194–202 195

Central to this ‘‘low dose’’ controversy are conflictingstudies administering Bisphenol A (BPA)1 to mice inutero and measuring changes in the weight of the adultprostate. The original positive study reported thatincreases in prostate weights were found at doses wellbelow those of traditional toxicological thresholds forBPA, i.e., 2 and 20 lg BPA/kg body weight/day (Nagelet al., 1997). In contrast, the later replicating studieswere negative, failing to observe any changes in prostateweights (Ashby et al., 1999; Cagen et al., 1999a). Fur-ther, multiple generation reproductive and developmen-tal studies with BPA were performed and did notobserve any adverse effects at the very low BPA dosesin rats (Ema et al., 2001; Tyl et al., 2002).

The ‘‘low dose’’ controversy resulted in a major pub-lic scientific review of these and other studies under theauspices of the NTP (Melnick et al., 2002; NTP, 2001).The review panel and its subpanels considered a widerange of factors that might account for the different out-comes, including experimental designs, animal strain,laboratory diet, and statistical analyses. A novel aspectof this review was the request to 15 principal investiga-tors to provide their raw experimental data in order toconduct an independent statistical reanalysis prior tothe review meeting. Data from most of the 58 requestedstudies were received, the data were carefully audited,and the data were reanalyzed using a single consistentapproach by the statistics subpanel (Haseman et al.,2001).

The comment in the panel report that caught ourattention was made by the BPA subpanel: ‘‘collectively,these studies [referring to Ashby et al., 1999; Cagenet al., 1999a; and others] found no evidence for a lowdose effect of BPA, despite the considerable strengthand statistical power they represent, which the subpanelconsidered especially noteworthy (NTP, 2001).’’ Howev-er, we could find no actual quantification or comparisonof the actual power of these studies or a description anypower calculations made to support this statement. Fur-ther, the panel did not report quantified p values for allof the studies. Such quantified p values can be of highvalue. For example, if p < 0.05 was considered signifi-cant, values of 0.002 and 0.812 would be strong evidencefor an individual study detecting an effect and for thelack of effect, respectively. In contrast, p values of0.046 or 0.062 would be marginal evidence for any effector lack of effect, respectively. Therefore, we have usedthe available means and standard deviations to quanti-tate the experimental power of the studies of Nagelet al. (1997), Cagen et al. (1999a), and Ashby et al.(1999), and we have calculated p values when the rawdata were available.

1 Abbreviations used: BPA, Bisphenol A; CV, coefficient of variation;NTP, National Toxicology Program.

Such examination and calculations are not trivialtasks. For example, actual group means and standarddeviations were published by both Cagen et al. (1999a)and Ashby et al. (1999). Although these data were pres-ent in the original publication only in graphic form (seeFig. 2, Nagel et al. (1997)), the numerical values wereavailable from the NTP review (2001) (see p. A-9). Thesethree data sets provide an ability to calculate coefficientsof variation and to perform initial power simulations forthe three studies. However, detailed simulations and cal-culations require the data from individual animals.These individual data are available only from the origi-nal publication of Ashby et al. (1999) and not for theother two studies. In addition, cofactors must be statis-tically tested and taken into account. For example, pros-tate weight is potentially correlated with individualanimal body weights and, when present may be subjectto a possible litter effect.

A statistical reanalysis of the raw data from thesestudies was performed during the NTP review. The find-ings were that:

(1) in the Nagel study, prostate weights were indeedsignificantly different, but at p < 0.05, notp < 0.01 as reported by the authors (the quantita-tive p value was not reported, only the < indicationwas used), prostate weight was correlated to bodyweight at p < 0.05 (the quantitative p value wasagain not reported) although the original studydid not find a correlations with body weight(Nagel et al., 1997), and although the originalauthors reported a significant litter effect (Nagelet al., 1997), no reference to a litter effect being sig-nificant or p value was discovered in the subpanelreport;

(2) in the Cagen study, there was no effect of treat-ment on the prostate weight (the quantitativep value was again not reported), a significant bodyweight correlation at p < 0.05 (the quantitativep value was again not reported) was found thatwas not originally reported, and a significant littereffect on prostate weight at p < 0.01 was foundconsistent with the original report;

(3) in the Ashby study, there was no effect of treat-ment on the prostate weight, there was a signifi-cant correlation of the prostate weight with bodyweight (quantitative p values were not reportedfor the body weight correlation or the treatmenteffect on prostate weight), and there was a margin-al litter effect at p = 0.046 (NTP, 2001). All find-ings were consistent with the original report.

This suggests an evaluation is needed that considersthe statistical power of the conflicting experiments andthe calculation and transparent reporting the quantita-tive p values of these studies to judge the strength and

196 J.W. Owens, J.G. Chaney / Regulatory Toxicology and Pharmacology 43 (2005) 194–202

the weight of the evidence on the ‘‘low dose’’controversy.

2. Materials and methods

The experimental results being compared in the initialpower simulations are from the studies of Nagel et al.(1997), Cagen et al. (1999a), and Ashby et al. (1999).These are summarized in Table 1. The actual data forindividual litter and animals from the study of Ashbyet al. (1999) were published and were taken from Table6 of that paper.

We have followed the general approach of the statis-tics subpanel (Haseman et al., 2001), using Dunnett�smultiple comparison test, p < 0.05 for significance, alog-transformation to address any data heterogeneity,and the use of body weight as a covariable (an ANCO-VA approach). We have also noted and agree with thestatistics subpanel�s comments: (a) on the essential needto take a possible litter effect into account, in that a littereffect proportionally reduces the apparent sample size ofindividual animal numbers, and (b) where littermates ex-ist or are used only within a single treatment, then litteris a nested factor (Haseman et al., 2001).

2.1. Preliminary calculations and study comparisons

The sample sizes, the mean body weights and stan-dard deviations (SDs), the mean prostate weights andSDs, and the statistical methods of the studies were pub-lished or available from the NTP review. What was notavailable from Nagel et al. (1997) or Cagen et al. (1999a)was the individual animal weights. However, because

Table 1Summary of data for the three BPA mouse prostate studies examined

N Animals N Litters

Nagel et al. (1997)b

Control 11 112 lg/kg/day BPA 7 720 lg/kg/day BPA 7 7

Ashby et al. (1999)Control 54 112 lg/kg/day BPA 37 720 lg/kg/day BPA 29 6

Cagen et al. (1999a)Control NR 442 lg/kg/day BPA NR 2220 lg/kg/day BPA NR 18

BPA, Bisphenol A; NR, not reported, potentially up to 4 males per litter wereet al., 1999a).a For the studies of Nagel et al. and Ashby et al., means, SDs, and CVs are

and CVs are calculated based on individual litters.b These values were obtained from the NTP panel review records (page A

the summary information was available from all threestudies, this information could be used to set up condi-tions for simulations and these simulations could beused to estimate the power of these studies. For thepooled control (naı̈ve + vehicle control) group withineach study, a random sample was drawn from a normaldistribution with a mean of 100 and a standard devia-tion equal to 100 times the coefficient of variation forthe pooled control group reported in the study. For boththe low dose and high dose groups within each study, arandom sample was drawn from a normal distributionwith mean and SD as follows:

mean ¼ 100þ increase;

SD ¼ dose group CV � ð100þ increaseÞ;

where increase represents the percentage increase of inter-est in prostate weight between dosed and control groups,and CV is the coefficient of variation (SD/mean).

The maximum magnitude of the prostate weight dif-ference between the pooled control and the treatedgroups exceeded 30% in Nagel et al. (1997), so the mag-nitude of change (‘‘increase’’) in our power calculationswas set at intervals of 5%, i.e., 10, 15, 20, 25, and 30%.By using a mean of 100 for the pooled control group or(100 + increase)% for each dose group and using SDvalues based on this mean and the respective CV, therandom samples were normalized across dose levels. In-stead of simply using the low dose or high dose groupprostate weight SD in the simulations, the SD was com-puted as shown above because we are interested in dif-fering percentage increases from control and becausethe SD is not constant across doses in the individualstudy data, especially in Nagel et al. (1997).

Mean wt. ± SDa (mg) Prostate CVa (%)

Body Prostate

37.9 ± 2.8 40.8 ± 3.3 8.134.6 ± 3.0 52.8 ± 6.1 11.636.7 ± 3.0 54.9 ± 16.7 30.4

42.9 ± 4.4 49.1 ± 7.7 15.747.9 ± 6.1 53.2 ± 8.2 15.445.9 ± 5.1 50.3 ± 9.1 18.1

34.7 ± 2.3 39 ± 6.7 17.235.7 ± 2.3 39 ± 8.1 20.837.0 ± 2.6 41 ± 6.9 16.8

necropsied and the male reproductive tissue weights measured (Cagen

calculated based on individual animals. For Cagen et al., means, SDs,

-9, NTP, 2001).

J.W. Owens, J.G. Chaney / Regulatory Toxicology and Pharmacology 43 (2005) 194–202 197

For example, 11 simulated data points representingthe pooled naı̈ve and vehicle control groups from Nagelet al. (1997) were drawn from a normal distribution witha mean of 100 and a standard deviation (SD) of100 * 0.081 = 8.1. Seven data points representing thelow dose group from Nagel et al. were drawn from anormal distribution with mean = 100 + 10 = 110 andSD = 11.6 * (100 + 10)% = 12.76, and seven points rep-resenting the high dose group from this study weredrawn from a normal distribution with mean =100 + 10 = 110 and SD = 30.4 * (100 + 10)% = 33.44.

This process of generating random data was repeated10,000 times for each percentage increase of interest (inprostate weight between dosed and control groups) toobtain 10,000 simulated data sets for each of the per-centage increase/study combinations. All studies hadused Dunnett�s multiple comparison test with p < 0.05for significance (Ashby et al., 1999; Cagen et al.,1999a; Nagel et al., 1997), so a one-sided Dunnett�s testwith a 0.05 level of significance was performed on eachsimulated data set. A total of 10,000 tests were per-formed under each set of conditions, and a power esti-mate was derived for each set of conditions by simplydividing the number of tests showing a significant treat-ment effect by 10,000. All simulations were performedusing S-PLUS 6.1 (Insightful, Seattle, WA).

2.2. Power calculations of the Ashby study

Individual animal weights, including bodyweights andprostate weights, were available from Ashby et al. Thisparticular study, unlike Nagel et al. and Cagen et al., alsoincluded more than one animal per litter. Because theindividual animal informationwas available and theAsh-by study allows for an examination of litter effect, powerestimates could be derived via simulation in a more thor-ough manner. We were once again interested in variousmagnitudes of change between control and high dosegroups (i.e., 10, 15, 20, 25, and 30%), so that prostateweights were randomly drawn from a normal distributionwith the following mean and standard deviation:

mean ¼ ð100þ increaseÞ � litter mean=treatment mean;

SD ¼ ð100þ increaseÞ � within-litter SD=litter mean;

where increase = 0 for control groups and increase rep-resents the percentage increase of interest in prostateweight for dosed groups.

These means and standard deviations were calculat-ed separately for each litter within each experimentalgroup (0, 2, and 20 lg/kg/day only). By using a factorof 100 for the pooled control group or (100 + increase)for each dose group and using mean and SD valuesbased on this factor, the random samples were normal-ized across dose levels. By using means and standarddeviations based on the actual litter means and stan-

dard deviations, litter-to-litter variability (in additionto the pre-defined difference of interest between dosedand control groups) was taken into account in the sim-ulated data.

In addition to simulating and rescaling prostateweight data, body weights were rescaled because bodyweight is useful as a covariate in models describing theeffect of treatment on prostate weight. The actual bodyweights from the Ashby study were rescaled using thefollowing formula:

Rescaled body weight ¼ ð100þ increaseÞ� body weight=mean body weight;

where mean body weight represents the mean bodyweight of animals in a particular dose group and in-crease = 0 for control groups and Increase representsthe percentage increase of interest in prostate weightfor dosed groups.

An ANCOVA model was used to describe the effectof treatment and litter on prostate weight (taking bodyweight into account) and is shown as follows. Prostateand body weights were log-transformed to reduce heter-ogeneity of variance:

Y ijk ¼ l... þ ai þ bjðiÞ þ cX ijk þ eijk;

where Yijk is the log2-transformed, simulated prostateweight for animal k from litter j, treatment i; l. . . is theoverall mean; ai is the effect of treatment i; bj(i) is the ef-fect of litter j within treatment i; Xijk is the log2-trans-formed, rescaled body weight of animal k (from litterj, treatment i); c is a regression coefficient for the rela-tionship between Y and X; eijk is the random error foranimal k from litter j, treatment i.

A total of 5000 simulated data sets were generated.Each simulated data set contained the same number ofanimals from the same number of litters within eachdose group as the Ashby study. The ANCOVA modelwas fit to each simulated data set and Dunnett�s multiplecomparison procedure was used to compare the highdose (20 lg/kg/day) group to the pooled control group.The one-sided 95% confidence bound for Dunnett�s testwas calculated as follows:

D ¼ 1:93 � ðMSE � ð1=54þ 1=29ÞÞ1=2;where MSE is the mean squared error from the fittedANCOVA model, 54 is the pooled control sample size,29 is the high dose group sample size, and 1.93 is thecritical value for Dunnett�s test with a total of threecomparison groups and a total sample size of54 + 37 + 29 = 120.

A Dunnett�s test was declared significant if the highdose group mean prostate weight exceeded the controlgroup mean prostate weight by D. A power estimatewas derived for each percentage increase (between con-trol and high dose) by simply dividing the number of

198 J.W. Owens, J.G. Chaney / Regulatory Toxicology and Pharmacology 43 (2005) 194–202

tests showing a significant treatment effect by 5000. Allsimulations were performed using S-PLUS 6.1 (Insight-ful, Seattle, WA).

2.3. Combined studies

In addition to examining the power of the three indi-vidual studies, a simple meta-analysis was performed todetermine the overall magnitude of the effect found ineach study and whether this effect is statistically signifi-cant. The mean differences between the high dose andcontrol groups found in each study were weighted bythe reciprocals of their variances to calculate an overallweighted mean difference. The standard error of thismean was calculated to obtain an approximate 95% con-fidence interval on the overall effect.

3. Results

3.1. Preliminary power calculations

The results of the first set of power calculations arepresented in Fig. 1. This initial analysis showed that in-deed the Cagen et al., 1999a and Ashby et al., 1999 stud-ies should have had substantial power to detect changesin the mouse prostate even at magnitudes well belowthose observed in the Nagel study. In fact, based onthe preliminary calculations, these studies were almostcertain to detect a 25% change in prostate weight, werehighly likely to detect a 20% change, and still had goodpower to detect a 15% change in prostate weight, whichis half that reported in the original study.

Fig. 1. Power calculations for BPA mouse prostate studies. Themeans, CVs, and group sizes from the three studies of Nagel et al.(1997), Cagen et al. (1999), and Ashby et al. (1999) were used tocalculate for each study the respective power or capability to detect apositive effect at different magnitudes or percentage change in themouse prostate weight . The calculations were based on Dunnett�s one-sided multiple comparison test with p < 0.05 for significance asdescribed in the Methods.

3.2. Detailed power calculations with the Ashby data

Ashby et al. published their full raw data set for bodyand prostate weights by group and litter (see Table 6,Ashby et al., 1999), allowing a second round of powercalculations to examine the actual power of this study.Body weight was observed to be a significant covariateand a litter effect was found in the statistical analysesconducted by Ashby et al. (1999). Further, as Dunnett�stest implies an assumed homogeneity of variance, a log-arithmic transformation was included to eliminate or toreduce heterogeneity as would be typical practice formost statisticians. Therefore, in the actual power calcu-lations for Ashby�s data, body weight was considered asa covariate and logarithmic transformations of bothprostate weight and body weight data were included.The results of this final set of calculations are presentedin Table 2. According to the results, the power level forthe actual study is very similar to that seen when usingmodel simulations that did not account for litter-to-littervariability and body weight. However, power of theactual study is somewhat greater than simulation whenlooking at a small prostate weight increase (10%), andthe actual power indicates this study is still likely to de-tect weight changes at magnitudes well below those ob-served in the Nagel study.

We also calculated Dunnett�s p values for the BPAtreated groups versus the control in the Ashby study.Therefore, a set of progressive more detailed calcula-tions was made. With treatment as the only factor, nonesting of the litter effect, no body weight covariate,and no data transformations, the prostate weight p valueversus the control was 0.020 for the 2 lg/kg BPA groupand 0.413 for the 20 lg/kg BPA group. With treatmentand litter nested within treatment, no body weightcovariate, and no data transformation, the prostateweight p value was 0.053 for the 2 lg/kg BPA groupand 0.563 for the 20 lg/kg BPA group. There were clearabsolute differences in the group mean body weights:

Table 2Power calculations incorporating litter and body weight covariates andlog-transformation of prostate and body weights into the Ashby dataanalysisa

% weight increase Approximate power (%)for detecting dose effect (Ashby et al., 1999)

10 71.015 91.620 98.025 99.130 99.5

a Using the actual raw data from the Ashby study (see Table 6,Ashby et al., 1999); Dunnett�s one-sided multiple comparison test withp < 0.05 for significance; an ANCOVA model with log-transformedprostate weight as the response, group as a fixed factor, litter as arandom factor nested within group, and log-transformed body weightas a covariate; and 5000 simulations as described under Materials andmethods.

Table 3Comparison of dose groups in the Ashby study under various modelsa

Model 2 lg/kg vs. control p value 20 lg/kg vs. control p value

Yijk = l. . . + ai + eijk 0.020 0.413Yijk = l. . . + ai + bj(i) + eijk 0.053 0.563log2 (Yijk) = l. . . + ai + bj(i) + c log2 (Xijk) + eijk 0.821 0.972

Yijk is the prostate weight for animal k from litter j, treatment il. . . is the overall meanai is the effect of treatment ibj(i) is the effect of litter j within treatment iXijk is the body weight of animal k (from litter j, treatment i)c is a regression coefficient for the relationship between Y and X

eijk is the random error for animal k from litter j, treatment i

a Using the actual raw data from the Ashby study (see Table 6, Ashby et al., 1999); Dunnett�s one-sided multiple comparison test.

J.W. Owens, J.G. Chaney / Regulatory Toxicology and Pharmacology 43 (2005) 194–202 199

42.9 g for the controls, 47.9 g for 2 lg/kg BPA group,and 45.9 g for the 20 lg/kg BPA group resulting in astrong treatment effect on body weight (p = 0.00004)and a correlation between body weight and prostateweight (r = 0.43, p < 0.0001). Therefore, with treatmentand litter nested within treatment and now includingbody weight as a covariate and log-transformation ofthe data, there were major shifts in the p values. Theprostate weight p value under the latter conditions was0.821 for the 2 lg/kg BPA group and 0.972 for the20 lg/kg BPA group, meaning, both BPA treated groupare far from being statistically significant. These resultsare summarized in Table 3 and illustrate the impact ofthe particular statistical method and approach on theinterpretation of results.

3.3. Combined studies

The mean prostate weight gains for the 20 lg/kg BPAgroup relative to the control group in the studies of Na-gel, Ashby, and Cagen were 35, 2, and 5%, respectively.The weighted mean gain is 6.5% with a standard error ofapproximately 4.2%. Based on this mean gain and stan-dard error, the 95% confidence interval for the combinedstudies is (�2%, 15%).

4. Discussion

The NTP panel attempted to review nearly 60 exper-imental studies, requesting their raw data to undertake aconsistent statistical reanalysis. An effort on this scale isliterally unprecendented. However, this reflects theimportance of the ‘‘low dose’’ issue to toxicologicaland regulatory tenets. Nearly, 50 of the requested datasets were submitted. The enormous statistical effortwas intriguing in itself as:

• audits discovered some data errors;• indicated inappropriate statistical approaches andassumptions in some cases;

• in one case revealed a clear data outlier that resultedin spurious statistical significance;

• indicated surprising findings in several studies, suchas an unexpected negative correlation between uter-ine weight and body weight;

• also discovered typographical errors in values in thepublished literature (NTP, 2001; Haseman et al.,2001).

Despite such efforts, no overall weight-of-the-evi-dence consensus was reached by panel. Also, it is not al-ways transparent on which points the panel may haveagreed or disagreed, to what degree there was agreementor disagreement, or their rationales as the panel did notcommunicate their deliberations in the style of a detailedweight and strength analysis of the evidence as did theIPCS (2002).

Our objective here is to address important strength-and weight-of-the-evidence questions that still remainunanswered by the original review. Our first question askshow sensitive were the major studies to detect a change inprostate weight? We addressed this question with powercalculations, focusing in particular on the power of thenegative replicates to detect changes in comparison tothe original study. Our second question asks: what isthe strength of the evidence or findings? We address thisquestion by going beyond a qualitative statement of ‘‘sig-nificant’’ or ‘‘not significant.’’We calculate and report thequantitative p values to see if the absence of an effect inthe negative studies was robust or marginal.

The power or sensitivity of a study is the statisticalprobability that a positive effect of a given change in anendpoint can be detected by the study. Power is deter-mined by multiple factors, such as the number of ani-mals in a group, the variability of an endpoint, themagnitude of the effect or change in the endpoint,the precise statistical methods and assumptions, andsometimes the influence of subtle covariables, such asinteractions with body weight and litter (Hasemanet al., 2001). For a sound contribution to a weight-of-the-evidence process, each of these elements would

200 J.W. Owens, J.G. Chaney / Regulatory Toxicology and Pharmacology 43 (2005) 194–202

need to be included or considered and transparentlyreported with the power calculations.

Quantification of study power is important to theproper design of any replication studies. In retrospect,we are surprised that neither replicating study reportedperforming initial power calculations with the intendedpurpose to achieve robust levels of sensitivity comparedto the original study. Not only would this have beensound experimental practice, bur reporting these powerestimates in conjunction with their results would havebeen to their advantage. This presentation would haveadded increased weight both to the original papersand, possibly, to the scientific review panel.

Quantification of study power and p values is alsoimportant when weighing the strength of the evidencefrom differing studies. Power and quantitative p valuesallow very important questions may be posed with re-gard to the strength and weight of evidence for differingstudies: (1) do the studies attempting to replicate an ini-tial observation have adequate power to also find a truepositive effect, (2) do the replicating studies have suffi-cient power as to make it very unlikely to have misseda true effect so that the chance of a false negative is verylow, and (3) in such cases, would the power of the rep-licating studies be sufficiently large and the effect seenin the initial study modest so as to suggest that the firstresult was a false positive?

The quantification of the power of the disputed stud-ies is revealing. The statement made by the BPA subpan-el in the NTP review is correct; the studies of Cagen andAshby do appear to have greater overall power com-pared to the Nagel study as seen in Table 2 andFig. 1. With an estimated power level of just over80%, the Cagen study would fail to detect an increaseof 15% in the prostate weight in only about one of fivestudies. The Ashby study is more robust, with powergreater than 90% with a 15% prostate weight increase.Further, as the weight increase approaches the near30% increase reported in the Nagel study, the powerof the Cagen study and, particularly, the Ashby studyincreases rapidly to well over 99%. This preliminarylook suggests limited opportunity for these studies tohave missed a positive effect; that is, to have reportedfalse negative. This led us to a more detailed examina-tion of the experiments.

In a set of more detailed power calculations with theAshby study, we included litter as a nested effect andbody weight as a covariable in the power calculationsand included the log-transformed prostate and bodyweights. These are in agreement with the proceduresand recommendations of the original statistical subpanel(Haseman et al., 2001). We have used litter as a nestedeffect, so that effective sample size is reduced in propor-tion to the litter effect. This allows the data themselvesto speak and not be directed one way or another byassumptions made by the choice of the analysts. We

note that this also avoids the possibility of very unequaltreatment of marginally different findings, where in onedata set with a litter effect at p = 0.07, the data might beanalyzed as individuals, ignoring the litter effect entirely,while in another data set with a litter effect at p = 0.04,the data might be analyzed based only with litter as theunit, ignoring the number of individual animals.

Even with this more complex model, the actual powerof the particular Ashby study remains just above 91%with a 15% increase in prostate weight or half of themagnitude increase of the Nagel study, 98% for detect-ing a 20% prostate weight increase or two thirds of theNagel study, and over 99% for detecting 25 and 30%prostate weight increases. Therefore, the Ashby studyhas a very limited chance to have missed any effectand to have reported a false negative result.

Actual p values are useful to understand the strengthof any conclusions regarding the occurrence or absenceof statistical significance. For example, assume that notreatment effects exist, and that we have three equivalentstudies with the Nagel conditions, i.e., same sample siz-es, same standard deviations, and so on. An estimate ofthe likelihood of seeing an increase of 30% (as seen byNagel et al.) in one or more of the three studies isapproximately [1 � (1 � p value)�3], where the p valuecomes from the Nagel study. Although Nagel et al.(1997) reported p < 0.01, the statistical reanalysis duringthe NTP scientific review reported p < 0.05 (NTP, 2001).That difference yields likelihoods of false positives in atleast one study as low as 0.03 (if p = 0.01) or as high as0.14 (if p = 0.05). In other words, false positives are notrare events in actual experiments, as might be expected.Assuming no real treatment effect, at least one study inthree would be expected to exhibit a 30% increase inprostate weight 3–14 times out of every 100 times thatthree independent studies were conducted. There is aclear need, then, to replicate important findings and tothoroughly analyze the results.

The final calculations following the original subpan-el�s approach yielded for the Ashby study a p value of0.821 comparing 2 lg/kg BPA group to the controland 0.972 comparing the 20 lg/kg BPA group to thecontrol. Together with the superior power of that study,there is then strong evidence that the Nagel study wasindeed a false positive.

From a meta-analysis perspective, the results of thesethree studies can be combined to estimate the overall ef-fect of BPA on prostate weight. However, it should berecognized that any potential differences among studieswill not be taken into account in a meta-analysis. Inthe studies of Nagel, Ashby, and Cagen, the mean pros-tate weight increase was 35, 2, and 5%, respectively. Theresulting 95% confidence intervals on the overall effect ofBPA indicate that the population mean change inprostate weight for these three studies is somewhere be-tween a 2% weight loss and a 15% weight gain. Since this

J.W. Owens, J.G. Chaney / Regulatory Toxicology and Pharmacology 43 (2005) 194–202 201

interval includes zero, we cannot conclude that BPA hasa significant effect on prostate weight. On the otherhand, we can conclude with 95% confidence that thepopulation weight gain is less than 15% and that the Na-gel study is beyond this confidence level.

In the rat, there are several studies available to com-pare mean prostate weight increases with those in thethree mouse studies. Cagen et al. (1999b) have also pre-formed in utero exposures to similar low doses of BPAand recorded the weights of adult male sex accessory tis-sues, and Ema et al. (2001) and Tyl et al. (2002) haveconducted multi-generation studies that included similarlow does of BPA. Again, the number of litters and indi-vidual animals in these studies indicate a power greaterthan the Nagel study. The changes in mean prostateweights in the adult males of these studies at �20 lg/kg/day BPA doses were 2.4, 5.9, and �1.4% in F1 adultsand �9.2% in F2 adults, respectively. These data wouldclearly diminish the upper confidence level and even thelower confidence level for any change in prostate weight,if they were pooled with the mouse data.

Despite extensive debate about these studies, theactual protocol and laboratory differences among thestudies are at best small. For example, one speculationinvolves differences in technical proficiency among thelaboratories. However, there are no substantial differ-ences in prostate weight coefficient of variation acrossdose levels among the laboratories to suggest any differ-ences in technical dissection skill or proficiency (see Ta-ble 1). Other differences unlikely to impact afundamental biological response like estrogenicity in-clude speculations about differences between an isolatedCF-1 mouse colony in the Nagel study and a commer-cial animal source for CF-1 mice in the Cagen and theAshby studies.

There were different laboratory diets, and there hasbeen some speculation about the possible role of phyto-estrogen contents. However, various analyses (Degenet al., 2002; Owens et al., 2003; Thigpen et al.,1999a,b, 2002) indicate that 1) the phytoestrogen con-tents of the Nagel (PMI 5001 and 5008) and the Cagen(PMI 5002) laboratory diets should have been similarand 2) the phytoestrogen contents of the Ashby study(RM1, Special Diet Services) should have been aboutone third to one half of the PMI diets, making the Ash-by study the lowest of the three. More importantly, thephytoestrogen levels lead to a phytoestrogen consump-tion by the mice that would exceed by several ordersof magnitude the estrogenic potency of the BPA dosesadministered in these studies (Owens et al., 2003). Thatis, the BPA contribution to estrogen intake should havebeen trivial in all cases, including the original study. Inthis regard, since the original review, there has been fur-ther evidence that approximately 90–95% of the circulat-ing BPA after oral administration in rodents isglucuronidated in a first pass effect (Domoradzki

et al., 2003, 2004; Zalko et al., 2003), and the glucuron-idated form of BPA appears to be inactive as an estro-gen (Matthews et al., 2001; Snyder et al., 2000).Collectively, considerations such as these begin to placethe plausibility of differences among the labs and thebiological basis for the ‘‘low dose’’ effect in furtherdoubt.

We offer several conclusions based on these statisticalcalculations. First, in regard to the ‘‘low dose’’ issue, ifthe effect reported in the Nagel study were indeed a ro-bust, true effect, then it is very unlikely that the Cagenstudy would have failed to observe the same effect (a falsenegative result). It is even more unlikely based on the cur-rent analyses that the Ashby study would have failed toobserve the same effect (a false negative result). In theAshby study, the p values comparing the twoBPA groupsto the control were 0.821 and 0.972, indicating no supportthat this negative result was marginal. Thus, these powercalculations make it highly unlikely for the negative rep-licating studies to have missed a true effect, casting doubton the ‘‘low dose’’ findings as a true positive effect. There-fore, we see no merit in the original study being used tosuggest changes in current regulatory approaches.

Second, in the future, before undertaking the replica-tion of studies, those charged with the design of such astudy should perform preliminary power calculationssuch as those here. They can then report the power thattheir study presents in detecting any effects reported inan original study, including at successively lower levelsof change. We also recommend the consideration of lit-ter, body weight, and other effects before the study de-sign is finalized, as well as their integration with thechosen statistical methods.

Third, in the future, statisticians supporting majorscientific reviews of experimental results, such as theNTP review, should quantify and document the powerof disputed studies for the scientific panels. Several sta-tistical approaches may be used, but the statistical re-sults obtained from the ultimate approach should befully reported (including quantitative p values regardlessof significance). This makes transparent the strength ormarginality of the results, negative or positive. The pan-els should then clearly incorporate these power results,statistical approaches, and quantitative statistical valuesin their deliberations and employ the results as one as-pect in an overall weight of the evidence.

Acknowledgments

We acknowledge and thank Drs. Joe Haseman andGreg Carr for thoughtful discussions and commentsthat improved the design and performance of thesestatistical analyses, and thank Drs. George Dastonand Scott Belanger for suggesting improvements to themanuscript.

202 J.W. Owens, J.G. Chaney / Regulatory Toxicology and Pharmacology 43 (2005) 194–202

References

Anon., 1997a. Industry and scientists in cross-fire on endocrine-disrupting chemicals. ENDS Rep. 268, 26–29.

Anon., 1997b. Controversy over bisphenol-A research. Endocrine/Estrogen Newslett. 3(12), 1–4.

Ashby, J., Tinwell, H., Haseman, J., 1999. Lack of effects for low doselevels of Bisphenol A and diethylstilbestrol on the prostate gland ofCF1 nice exposed in utero. Regul. Toxicol. Pharmacol. 30, 156–166.

Cagen, S.Z., Waechter Jr., J.M., Dimond, S.S., Breslin, W.J., Butala,J.H., Jekat, F.W., Joiner, R.L., Shiotsuka, R.N., Veenstra, G.E.,Harris, L.R., 1999a. Normal reproductive organ development inCF-1 mice following prenatal exposure to Bisphenol A. Toxicol.Sci. 50, 36–44.

Cagen, S.Z., Waechter Jr., J.M., Dimond, S.S., Breslin, W.J., Butala,J.H., Jekat, F.W., Joiner, R.L., Shiotsuka, R.N., Veenstra, G.E.,Harris, L.R., 1999b. Normal reproductive organ development inWistar rats exposed to Bisphenol A in the drinking water. Regul.Toxicol. Pharmacol. 30, 130–139.

Degen, G.H., Janning, P., Diel, P., Bolt, H.M., 2002. Estrogenicisoflavones in rodent diets. Toxicol. Lett. 128, 145–157.

Domoradzki, J.Y., Pottenger, L.H., Thornton, C.M., Hansen, S.C.,Card, T.L., Markham, D.A., Dryzga, M.D., Shiotsuka, R.N.,Waechter Jr., J.M., 2003. Metabolism and pharmacokinetics ofBisphenol A (BPA) and the embryo–fetal distribution of BPA andBPA–monoglucuronide in CD Sprague–Dawley rats at threegestational stages. Toxicol. Sci. 76, 21–34.

Domoradzki, J.Y., Thornton, C.M., Pottenger, L.H., Hansen, S.C.,Card, T.L., Markham, D.A., Dryzga, M.D., Shiotsuka, R.N.,Waechter Jr., J.M., 2004. Age and dose dependency of the pharma-cokinetics and metabolism of Bisphenol A in neonatal Sprague–Dawley rats following oral administration. Toxicol. Sci. 77, 230–242.

Ema, M., Fuji, S., Furukawa, M., Kiguchi, M., Ikka, T., Harazono,A., 2001. Rat two-generation reproductive toxicity study ofBisphenol A. Reprod. Toxicol. 15, 505–523.

Haseman, J.K., Bailer, A.J., Kodell, R.L., Morris, R., Portier, K.,2001. Statistical issues in the analysis of low-dose endocrinedisruptor data. Toxicol. Sci. 61, 201–210.

International Program on Chemical Safety (IPCS), 2002. Globalassessment of the state-of-the-science of endocrine disruptors. In:Damstra, T., Barlow, S., Bergman, A., Kavlock, R. VanDer Kraak,G. (Eds.). World Health Organization, Geneva, Switzerland.

Matthews, J.B., Twomey, K., Zacharewski, T.R., 2001. In vitro andin vivo interactions of bisphenol A and its metabolite, bisphenol Aglucuronide, with estrogen receptors a and b. Chem. Res. Toxicol.14, 149–157.

Melnick, R., Lucier, G., Wolfe, M., Hall, R., Stancel, G., Prins, G.S.,Gallo, M., Rueuhl, K., Ho, S.-M., Brown, T., Moore, J., Leakey,J., Haseman, J., 2002. Summary of the National ToxicologyProgram�s report of the endocrine disruptors low-dose peer review.Environ. Health Perspect. 110, 427–431.

Nagel, S.C., vom Saal, F.S., Thayer, K.A., Dhar, M.G., Boechler, M.,Welshons, W.V., 1997. Relative binding affinity-serum modifiedaccess (RBA-SMA) assay predicts the relative in vivo bioactivity ofthe xenoestrogens bisphenol A and octylphenol. Environ. HealthPerspect. 105, 70–76.

Owens, W., Ashby, J., Odum, J., Onyon, L., 2003. The OECDprogram to validate the rat uterotrophic bioassay: phase two—dietary phytoestrogen analyses. Environ. Health Perspect. 111,1559–1567.

NTP, 2001. Endocrine disruptors low dose peer review. NTP, Raleigh,NC. http://ntp-server.niehs.nih.gov/htdocs/liason/LowDoseWeb-Page.html. Hard copy requests to NTP Liaison and ScientificReview Office (NIEHS, P.O. Box 12233, Research Triangle Park,NC 27709; Tel. 919 541 0530; fax: 919 541 0295;[email protected].

Snyder, R.W., Maness, S.C., Gaido, K.W., Welsch, F., Sumner, S.C.J.,Fennell, T.R., 2000. Metabolism and disposition of bisphenol A infemale rats. Toxicol. Appl. Pharmacol. 168, 225–234.

Thigpen, J.E., Setchell, K.D.R., Ahlmark, K.B., Locklear, J., Spahr,T., Caviness, G.F., Goelz, M.F., Haseman, J.K., Newbold, R.R.,Forsythe, D.B., 1999a. Phytoestrogen content of purified, open andclosed-formula laboratory animal diets. Lab. Anim. Sci. 49, 530–536.

Thigpen, J.E., Setchell, K.D.R., Goelz, M.F., Forsythe, D.B., 1999b.The phytoestrogen content of rodent diets. Lett. Environ. HealthPerspect. 107, A182–A183.

Thigpen, J.E., Lockear, J., Haseman, J., Saunders, H.E., Caviness, G.,Grant, M.F., Forsythe, D.B., 2002. Dietary factors affectinguterine weights of immature CD-1 mice used in uterotrophicbioassays. Cancer Detect. Prev. 26, 381–393.

Tyl, R.W., Myers, C.B., Marr, M.C., Brine, D.R., Veselica, M.M.,Fail, P.A., Chang, T.Y., Seely, J.C., Joiner, R.L., Butala, J.H.,Dimond, S.S., Cagen, S.Z., Shiotsuka, R.N., Stropp, G.D.,Waechter, J.M., 2002. Three-generation reproductive toxicityevaluation of Bisphenol A (BPA) in CD� (Sprague–Dawley) rats.Toxicol. Sci. 68, 121–146.

Zalko, D., Soto, A.M., Dolo, L., Dorio, C., Rathahao, E., Debrauwer,L., Faure, R., Cravedi, J.P., 2003. Biotransformation of bisphenolA in a mammalian rodent model: answers and new questions raisedby low-dose metabolic fate studies in pregnant CD1 mice. Environ.Health Perspect. 111, 309–319.