pbh benchmarking 20060103c - psychoutcomes.org · pbh does not routinely collect other demographic...

Benchmarking Depression Treatment 1

Running Head: BENCHMARKING DEPRESSION TREATMENT IN MANAGED CARE

Benchmarking the Effectiveness of Psychotherapy Treatment for Adult Depression in a Managed

Care Environment

Takuya Minami, Department of Educational Psychology, University of Utah

Bruce E. Wampold, Department of Counseling Psychology, University of Wisconsin – Madison

Ronald C. Serlin, Department of Educational Psychology, University of Wisconsin – Madison

Eric G. Hamilton, PacifiCare Behavioral Health, Pittsburgh, Pennsylvania

George S. Brown, Center for Clinical Informatics, Salt Lake City, Utah

John C. Kircher, Department of Educational Psychology, University of Utah

We would like to express our greatest appreciation to PacifiCare Behavioral Health, Inc., for

their permission to utilize their data for this study.

Correspondence should be addressed to Takuya Minami, Department of Educational Psychology,

University of Utah, Salt Lake City, Utah 84112, U.S.A. Email: [email protected].


Abstract

This study investigated the effectiveness of psychotherapy treatment for adult clinical

depression provided in a general clinical setting, notably a managed care environment, using

benchmarks established from efficacy data of published clinical trials. Overall results suggest

clinical equivalence between the effectiveness of psychotherapy provided in a managed care

environment as compared to efficacy observed in clinical trials, although providers in individual

practice exhibit slightly poorer outcomes than either the clinical trial benchmark or providers in

group practices.


Benchmarking the Effectiveness of Psychotherapy Treatment for Adult Depression in a Managed

Care Environment

More than a decade has passed since estimating the effectiveness of psychotherapy as it is

delivered in natural settings (i.e., treatment-as-usual, TAU) was proclaimed as one of the most

critical issues in the field of psychotherapy (e.g., Seligman, 1995; Weisz, Donenberg, Han, &

Weiss, 1995). However, most of the studies in the past decade that investigated outcomes in

clinical settings have involved evaluations of empirically supported treatments (ESTs) and other

manualized treatments that have been implemented in clinical settings rather than evaluating the

effectiveness of TAUs. Specifically, ESTs and other manualized treatments have been

implemented in clinical settings for treating many psychological disorders, including

agoraphobia (Hahlweg, Fiegenbaum, Frank, Schroeder, & von Witzleben, 2001), obsessive-

compulsive disorder (Franklin, Abramowitz, Kozak, Levitt, & Foa, 2000; Warren & Thomas,

2001), panic disorder (Addis, Hatgis, Krasnow, Jacob, Bourne, & Mansfield, 2004; García-

Palacios et al., 2002; Wade, Treat, & Stuart, 1998), posttraumatic stress disorder (Gillespie,

Duffy, Hackmann, & Clark, 2002), social phobia (Lincoln et al., 2003), depression (Merrill,

Tolbert, & Wade, 2003; Persons, Bostrom, & Bertagnolli, 1999), substance abuse (Morgenstern,

Blanchard, Morgan, Labouvie, & Hayaki, 2001), criminal offense (Henggeler, Melton, Brondino,

Scherer, & Hanley, 1997), bulimia nervosa (Tuschen-Caffier, Pook, & Frank, 2001), and

psychosis (Morrison et al., 2004). In other words, the past decade experienced a drastic increase

in the dissemination of manualized treatments that have been shown to be effective in clinical

trials to naturalistic settings. This process assumes that TAUs are not as effective as ESTs in

clinical settings and that outcomes can be improved by delivering ESTs (e.g., Hollon, Thase, &

Markowitz, 2002). However, as researchers have continued to conclude that ESTs should be


disseminated to clinical settings, they have ignored the important question about whether or not

TAUs already achieve outcomes comparable to ESTs (e.g., Addis, 2002; Chorpita et al., 2002;

Herschell, McNeil, & McNeil, 2004; Manderscheid & Henderson, 2004; Stirman, Crits-

Christoph, & DeRubeis, 2004).

At first glance, it appears as though little empirical support exists for the effectiveness of

interventions provided in the community, especially in the area of child and adolescent

psychotherapy (Weiss, Catron, & Harris; 2000; Weiss, Catron, Harris, & Phung, 1999; Weisz &

Weiss, 1989; Weisz, Weiss, & Donenberg, 1992). For example, a benchmarking study

conducted by Weersing and Weisz (2002) described the symptom trajectory of depressed youth

that were provided TAUs in community mental health centers (CMHCs) as resembling that of

what was observed in control groups in clinical trials. However, this study did not provide

unequivocal evidence for the superiority of ESTs over TAUs due to several significant

limitations. First, it is most likely that the adolescents that were treated at the CMHCs were

significantly different from the clients in clinical trials with regard to socioeconomic status, rates

of comorbidity, and other exclusion criteria (Westen, Novotny, & Thompson-Brenner, 2004). In

addition, it is well documented that the therapists’ workload and other work-related

environments are drastically different between clinical trials and clinical settings (Borkovec &

Castonguay, 1998; Rupert & Baird, 2004). Therefore, it is premature to draw conclusions that

ESTs for child and adolescents outperform TAUs.

As for the adult population, not only are there very few studies that have investigated the

effectiveness of TAUs; the results of these studies are mixed. An investigation of TAU marital

therapy conducted in Germany revealed that although significant pre-post effects were found,

their overall effect size was low (Hahlweg & Klann, 1997). On the other hand, TAUs conducted


in a community-based substance abuse treatment program showed equivalent results with

cognitive-behavioral therapy (CBT) that was implemented in the same setting (Morganstern et

al., 2001). Furthermore, Addis et al. (2004) reported that the delivery of an EST in a managed

care environment, notably panic control therapy (PCT), attained significantly better clinical

outcomes for some variables as compared to TAU; however, the PCT therapists received

additional training and supervision and, in any event, the superiority of PCT over TAU was

small (an effect size in the neighborhood of .15; see Wampold, 2005). Therefore, the inferiority

of TAU in clinical settings has not been conclusively established.

In addition, there have been several methodological problems involved in estimating the

size of effects of TAUs. First, because what is often defined as TAU is idiosyncratic to the study,

simplistic conclusions drawn based on comparisons between ESTs and TAUs are misleading

unless one carefully reviews what the authors defined as TAUs. For example, the “usual

services” that were used in comparison against multisystemic therapy (MST) implemented in

community mental health centers involved only monitoring by probation officers and referrals to

other social services and/or special academic programs (Henggeler et al., 1997). Although the

authors rightfully did not generalize their findings to suggest overall superiority of MST relative

to TAUs in general, the lack of uniform agreement about the nature of TAUs has contributed to

erroneous perceptions of the effectiveness of TAUs. Secondly, even if TAUs are bona fide

psychotherapies, most comparisons against implemented ESTs are unbalanced. For example,

therapists who are under the EST conditions receive additional training and supervision that are

not offered to therapists under the TAU conditions, not to mention the potential demand

characteristics of the study (e.g., Addis et al., 2004; Merrill, Tolbert, & Wade, 2003; Wade, Treat,

& Stuart, 1998; see also Westen, Novotny, & Thompson-Brenner, 2004). Clearly, care must be


taken in defining and operationalizing TAUs to be able to make valid conclusions about the

effectiveness of psychotherapy in clinical settings.

Recently, a promising method was introduced that allows for evaluation of psychotherapy

effectiveness without altering any aspect of TAUs by using benchmarks created from clinical

trials. Specifically, benchmarking allows pre-post outcome data in clinical settings to be

compared against pre-post outcome data from clinical trials (e.g., Merrill, Tolbert, & Wade,

2003; Wade, Treat, & Stuart, 1998; Weersing & Weisz, 2002). Benchmarking involves the

following three steps: (a) calculation of benchmarks by aggregating pre-post effect sizes

observed in clinical trials, (b) calculation of an effect size in the clinical setting, and (c) statistical

comparisons between the benchmarks and the clinical settings effect size (Minami, Serlin,

Wampold, Kircher, & Brown, 2005; Minami, Wampold, Serlin, Kircher, & Brown, 2005). Thus,

this strategy allows for direct statistical evaluation of effectiveness by comparing effects

produced in clinical settings to a rigorous standard established by clinical trials.

The purpose of the current study was to evaluate the effectiveness of TAUs delivered in a

managed care health organization (HMO) using a benchmarking strategy. Specifically, a subset

of the HMO data containing adult clients diagnosed with major depressive disorder (MDD;

American Psychiatric Association, 1994) was statistically compared against benchmarks of adult

depression treatment derived from clinical trials conducted to evaluate the effectiveness of

treatment for depression (Minami, Serlin, et al., 2005; Minami, Wampold, et al., 2005). In

addition, the effect of possible moderators of treatment effectiveness (i.e., individual versus

group providers, medication use) was examined.

Method

Participants


Base HMO data. The original database (labeled as the “base HMO data”) for this study

contained client outcome data for 48,038 adult clients who received treatment from 6,007

treatment providers1 between February 8, 1999 and December 31, 2004, under the insurance

coverage of PacifiCare Behavioral Health, Inc. (PBH). Available demographics are provided in

Tables 1 and 2. PBH does not routinely collect other demographic variables of clients and

providers such as race/ethnicity, education, and income, and thus were unavailable. The base

HMO data was reduced based on inclusion and exclusion criteria as explained below.

Outcome Measure

Outcome Questionnaire – 30.12 (OQ-30; Lambert, Hatfield, Vermeersch, Burlingame,

Reisinger, & Brown, 2001) was used to assess outcomes of clients included in the HMO dataset.

This instrument was a shortened version of the Outcome Questionnaire – 45.2 (OQ-45;

Vermeersch, Lambert, & Burlingame, 2000; Wells, Burlingame, Lambert, & Hoag, 1996), which

was designed to measure patient progress in three dimensions: (a) subjective discomfort, (b)

interpersonal relationships, and (c) social role performance. The OQ-45 was designed to be a

low-cost, brief-but-broad measure, which was sensitive to short-term changes yet reliable and

valid. The OQ-30 is a briefer version of the OQ-45 that was specifically created for clients to

conveniently complete multiple times. Lambert et al. reported high internal consistency and test-

retest reliability as well as high concurrent validity with other symptom measures, such as the

Inventory of Interpersonal Problems (Horowitz, Rosenberg, Baer, Ureno, & Villasenor, 1988),

Social Adjustment Scale (Weissman & Bothwell, 1976), and Beck Depression Inventory (BDI;

Beck & Steer, 1987).

Procedure


Initial data collection. Clients were asked to fill out the OQ-30 before their first, third,

and fifth sessions, as well as the every fifth session thereafter; this assessment has been

implemented system-wide at PBH as their routine assessment, and has a current participation rate

of approximately 70%. Whereas the clinical trials have the ease of defining episodes of care as

the period between when the participants entered the clinical trial and when they “completed” or

“dropped out,” episodes cannot be as clearly defined in clinical settings. Therefore, for the

current study, an episode of care was defined as the cluster of outcome assessment points that do

not have more than a 90-day gap between two observations. In other words, if any two

observations were more than 90 days apart, the former observation was deemed to be the score

of the last session of an episode, and the latter observation was considered the intake score of the

next episode.

Data reduction. The base HMO data was reduced to match, as best possible, the clinical

population represented in the clinical trials that investigate the efficacy of psychotherapy

treatments for adult depression. First, for the purpose of the study, only the first episode of care

for a given client was included so as to maintain independence of observations as best possible.

The database was further reduced based on the clients’ demographics and severity of depression

to match those of the clinical trials that were used to create the benchmarks. Thus, clients were

included in the subset data if they met all three of the following criteria: (a) client age of 18 years

or older, (b) a primary diagnosis of MDD, and (c) a score of 43 or above on the OQ-30, which

serves as the clinical cutoff score based on Jacobson and Truax’s (1991) formula (Lambert et al.,

2001). Client data with regards to concurrent substance abuse, other comorbidity (e.g., psychotic

features or personality disorders), or suicide ideation was unavailable, and thus were not used as

exclusion criteria although such exclusion criteria are typically used in clinical trials of


depression. Applying these criteria resulted in a subset database of outcomes for 6,323 adult

clients with depression who received treatment from 2,001 providers (i.e., subset HMO data).

All available demographic and clinical information from the base and subset HMO data with

regard to the clients and therapists are also provided in Tables 1 and 2, respectively. No data

were available on race/ethnicity and other demographics of both the providers and clients, as the

HMO does not routinely collect this information.

PLEASE INSERT TABLES 1 AND 2 AROUND HERE

Subset HMO Data Effect Size Calculation

Treatment effect size of the subset HMO data was calculated following basic meta-

analytic procedures (Becker, 1988; Hedges & Olkin, 1985). Specifically, where 1M and 2M are

the intake OQ-30 and last available OQ-30 means, respectively, and 1SD is the standard

deviation of the intake score, the biased estimator HMOg is

1

21

SDMMgHMO

−= . (1)

Here, the standard deviation of the intake score was used rather than a pooled standard deviation

because it is presumably less influenced by repeated testing and/or treatment, presenting a less

confounded value (Becker). Using the correction, as derived by Hedges and Olkin, the unbiased

estimate of the effect size HMOd , to be benchmarked, is

HMOHMO gN

d ⎟⎠⎞

⎜⎝⎛

−−=

5431 , (2)

where N is the sample size of the clinical settings data. The estimated variance of HMOd is

( )( )

Nd

Nrσ HMO

HMOd 212ˆ

2122 +

−= , (3)


where 12r is the estimated correlation between the intake and last available scores (Becker). For

the current sample, the correlation of the intake and last available score was .4966 and this value

was used for 12r in equation 3.

Clinical Trials Benchmarks

Benchmarks for both the treatment efficacy of psychotherapy for adult depression and

natural history of depression were derived by Minami, Wampold, et al. (2005). The treatment

efficacy benchmarks were derived by meta-analytically aggregating pretest to last available

assessment effects (i.e., both completers and intent-to-treat samples change effects) in published

clinical trials of psychotherapy treatment for adult depression. The natural history benchmarks

were constructed methodologically identical to the treatment efficacy benchmarks, but with

symptom trajectory of wait-list control groups. For this study, intent-to-treat benchmarks that

aggregated outcomes of self-report measures assessing broad symptoms were selected in order to

make valid comparison, as the HMO data set used a global well-being measure (viz., the OQ-30)

and contained all patients who entered treatment and completed at least two outcome measures.

Accordingly, the treatment efficacy benchmark was ( ) 831.0=TEBd and the natural history

benchmark was ( ) 122.0=NHBd for global measure for intent-to-treat samples (Minami,

Wampold, et al.). The mean numbers of weeks in treatment in the clinical trials were

approximately 16 for the efficacy benchmark and 10 for the natural history benchmark (Minami,

Wampold, et al.).

Benchmarking

Testing against the treatment efficacy benchmark. The subset HMO data was tested

using the benchmarking strategy illustrated in Minami, Serlin, et al. (2005) against the treatment

efficacy benchmark ( ) 831.0=TEBd . This strategy tests the true effect size in the population as


represented by the clinical settings data against a critical value derived from the benchmark,

taking into consideration a predetermined margin of 2.0=d between the benchmark and the

population to claim clinical equivalence while maintaining an overall Type I error of .05. In

other words, if the clinical settings effect was within 2.0=d below the efficacy benchmark (i.e.,

631.0=d ), the population effect size as represented by this data was considered clinically

equivalent to the efficacy benchmark. The margin of 2.0=d was selected based on Cohen’s

(1988) suggestion that this magnitude of effect size is small, and therefore, any differences

between the benchmarks and the population effect size that were within 1/5th of a standard

deviation was considered to be clinically trivial (Minami, Wampold, et al., 2005).

To statistically compare the population effect size represented by the subset HMO data

against the benchmark taking into consideration the 2.0=d margin, the “good-enough

principle” as illustrated in Serlin and Lapsley (1985, 1993) was used. This procedure allows for

hypothesis testing with a range-null rather than a point-null hypothesis, while maintaining an

overall Type I error of .05. Specifically, with subset HMO effect size data HMOd , its sample

size N , and 2.0=d margin ∆ , the noncentral t test statistic HMOt has a noncentrality parameter

( )( )∆−= TEBTE δNλ , (4)

where ( )TEBδ is the true treatment efficacy benchmark. When ν is the degrees of freedom (i.e.,

1−= Nν ), HMOt will be tested against the noncentral t critical value αλν :,t at 05.=α .

Testing against the natural history benchmark. For the population effect size to claim

any effectiveness over and above the natural symptom trajectory of depression, the clinical

setting effect must exceed at minimum 2.0=d above the natural history benchmark

( ) 122.0=NHBd (i.e., 322.0=d ). Thus, using an identical method as illustrated for comparison


against the treatment efficacy benchmark other than the direction of the margin, the noncentral t

test statistic HMOt has a noncentrality parameter

( )( )∆+= NHBNH δNλ , (5)

where ( )NHBδ is the true depression natural history benchmark. For this analysis, HMOt will be

tested against the noncentral t critical value αλν :,t at 05.=α . Again, however, whether or not

the population effect size as represented by the subset HMO data is statistically and clinically

superior to the natural trajectory of depression could be determined by visually determining the

figure in Minami, Wampold, et al. (2005) without actual calculation.

Results

Subset HMO Effect Size

The 323,6=N adult clients who had clinical depression in the subset HMO data had

mean intake and last session scores of 58.631 =M ( 75.12=SD ) and 19.542 =M ( 68.15=SD ),

respectively. Thus, the effect size HMOd was

7360.075.12

19.5458.63563234

31 =−

⋅⎟⎠⎞

⎜⎝⎛

−⋅−=HMOd . (6)

As 4966.12 =r in this subset HMO data, the variance ( )2ˆ HMOdσ was

( )( ) ( ) 0002021.0

632327360.0

63234966.12ˆ

22 =

⋅+

−=HMOdσ . (7)

Benchmarking

Benchmarking the overall subset HMO effect size. With the subset HMO effect size of

7360.0=HMOd with 323,6=N , the test statistic HMOt was

53.587360.06323 =⋅=HMOt . (8)


When compared against the treatment efficacy benchmark, it had a noncentrality parameter

( ) 18.502.08310.06323 =−=TEλ . (9)

Tested against the noncentral t critical value 99.5195.:18.50,6323 =t , HMOt was statistically

significant at 0001.<p . That is, the overall subset HMO effect size 7360.0=HMOd was

clinically equivalent to the treatment efficacy benchmark ( ) 831.0=TEBd . Here, with 323,6=N ,

the 95th percentile one-tailed critical value that the subset HMO effect size needed to exceed to

claim clinical equivalence with the treatment efficacy benchmark was

( ) 6538.06323

99.5195.:18.50,6323 ===N

td TECV (10)

(Minami, Serlin, et al., 2005). Clearly, the subset HMO effect size 7360.0=HMOd exceeded the

critical value.

When compared against the natural history benchmark ( ) 122.0=NHBd , 53.58=HMOt had

a noncentrality parameter

( ) 57.252.01216.06323 =+=NHλ . (11)

Tested against the noncentral t critical value 26.2795.:57.25,6323 =t , HMOt was statistically

significant at 0001.<p . For reference, the 95th percentile one-tailed critical value for the

clinical settings effect size to exceed to claim clinical effectiveness over and above the natural

trajectory of depression was ( ) 3429.0=NHCVd (Minami, Serlin, et al., 2005). Thus, the primary

conclusion of this analysis is that providers in clinical practice treating major depression attain

outcomes comparable to those achieved by treatments provided in clinical trials and surpass

improvement inherent in the natural course of depression.


Moderator analyses of the subset HMO data. As some clinical characteristics were

potential moderators, the subset HMO data was further divided based on the following factors:

(a) providers’ practice context (i.e., individual or group practice providers) and (b) client

medication use. The results of these analyses are shown in Table 3. Although the subset HMO

effect size as a whole cleared the treatment efficacy benchmark, the effect size was significantly

impacted by both providers’ practice and concurrent medication use. Although clients in group

practices cleared the treatment efficacy benchmark regardless of whether or not the clients were

on medication, clients treated by individual providers cleared the benchmark only with

concurrent medication use (see Table 4).

PLEASE INSERT TABLES 3 AND 4 AROUND HERE

Because effect size is affected by initial severity (i.e., typically, those who are initially

more severe demonstrate larger effects given sufficient number of sessions; Garfield, 1986;

Lambert, 2001), two subgroups within the individual practice/no medication condition were

further analyzed. For the first subset, we selected only those clients who collectively met the

initial severity observed in the individual practice/medication condition (i.e., intake 5.65≈M ).

This subset within the individual practice/no medication condition exceeded the critical value

6865.0=CVd ( 092,1=N , 0001.<p ), indicating that among clients with average intake scores

at this severity, providers in individual practice collectively performed equivalently as compared

to clinical trials, even without medication. However, with the second subset that matched the

intake severity to that observed among group practice/no medication condition (i.e., intake

3.61≈M ), providers in individual practices did not exceed the critical value.

As the providers in individual practice treating clients who were not on medication did

not perform clinically equivalent to the clinical trials, we estimated the percentage of providers


in this condition that collectively exceed the critical value as proposed a priori. In other words,

we sought to investigate what percentage of the providers, performing on the poorer end, needed

to be excluded in order for the set of providers in individual practice to meet the critical value to

claim clinical equivalence with the clinical trials. This analysis showed that excluding the

poorest functioning five percent of providers in individual practice was sufficient for these

providers to meet the required treatment benchmark.

Discussion

There has been a dearth of studies investigating the effectiveness of TAUs delivered in

clinical settings. To our knowledge, the present article reports the first benchmarking study of

TAUs for the psychotherapy treatment of adult clinical depression. Notable is the use of

benchmarks for treatment and natural history derived meta-analytically from clinical trials and

the use of the range-null hypothesis testing procedure (Serlin & Lapsley, 1985, 1993), which

allowed for a 2.0=d margin between the benchmarks and the subset HMO data to claim

clinical indifference between the two.

The results of the present study clearly demonstrated the general effectiveness of

psychotherapy treatment for adult depression provided in general clinical settings. The

assumption that providers in clinical practice produce outcomes inferior to what would be

accomplished had these providers used an EST for depression seems to be unwarranted, given

that the providers matched the effects produced by clinical trials.

An interesting result is that providers in group practices attain better outcomes than

providers in individual practice. Although there is no data in the present study that provides an

explanation for this result, several interesting conjectures are apparent. There may be something

intrinsic to group practices that augments the performance of their members, such as the


availability of colleague consultation or multidisciplinary approaches. On the other hand, the

explanation may involve selection; better therapists either select group practices or group

practices select better therapists. Finally, because the present study was naturalistic (i.e.,

patients were not randomly assigned to therapists), it may be that patients with better prognoses

select group practices. However, even the providers in individual practice achieve commendable

outcomes when one considers the fact that in clinical trials therapists typically are selected for

their expertise and are trained, monitored, and supervised (e.g., Rounsaville, O’Malley, Foley, &

Weissman, 1988; see also Rupert & Baird, 2005; Westen et al., 2004). In the present study,

eliminating the poorest functioning five percent allowed the providers in individual practice to

achieve clinical equivalence to the clinical trial benchmark; we speculate that clinical trialists are

more selective than this. Given that there is significant variation in outcomes attributable to

therapists in clinical trials and in managed care settings (Crits-Christoph, Barnackie, Kurcias,

Beck, Carroll, Perry, et al., 1991; Wampold & Brown, 2005), the selection of therapists in

clinical trials would tend to augment the effects in that context.

It is also important to note that whereas the mean number of weeks in treatment in the

clinical trials included in the efficacy benchmark was approximately 16, the mean number of

weeks for the subset HMO data was less than 9, with a median of 6. This comparison further

strengthens the evidence of effectiveness in clinical settings. Although an argument could be

made that most of the change in psychotherapy occurs early in treatment (Barkham, Rees, Stiles,

Shapiro, Hardy, & Reynolds, 1996; Howard, Kopta, Krause, & Orlinsky, 1986), the shorter

length in treatment nevertheless has profound impact on both the well being of the client and cost

effectiveness.


It also appears that the concurrent administration of medications augments the effects of

psychotherapy, a result consistent with the conclusions of some research (Thase & Jindal, 2004).

However, this result appears to be due, in a large part, to the fact that those on medication are

initially more severely dysfunctional. Again, because of the lack of random assignment, making

a strong conclusion about the effects of medication is precluded.

There are limitations that need to be considered in interpreting the results of the present

study. First of all, the treatment efficacy and natural history benchmarks that were used in this

study, despite being the best indexes currently available, are a compilation of various self-report

outcome measures that assess global symptoms (e.g., SCL-90; Derogatis, 1977). Therefore, as

the subset HMO data used the OQ-30, outcome measures for the benchmarks and the effects in

clinical practice are not exact matches and thus differences among the measures could potentially

impact the results. However, these benchmarks appeared to be the most representative to

compare against, given that the OQ-30 is also a self-report measure of global symptoms.

It could be claimed that the observed effectiveness demonstrated in the managed care

context was solely an artifact of regression to the mean. Because initial severity is correlated

with pre to posttest effect sizes, valid comparison of the clinical effect in the HMO data to the

benchmark assumes equal severity. Care was taken to only include clients who were given a

MDD diagnosis and who were in the clinical range of the OQ, a strategy that simulates the

inclusion criteria for clinical trials of depression. However, the equivalence of initial severity in

the benchmark samples with that of the clinical sample is unknown. However, although this

criticism cannot be completely refuted, we also benchmarked the subset HMO data with the

natural history benchmark, which aggregated pre-post effect sizes of wait-list control conditions.

Under the assumption that the clinical conditions between treatments and wait-list controls in the


clinical trials were sufficiently equivalent, it could be inferred that the natural history benchmark

would represent the combination of natural remission of depression and the regression artifact. It

is important to note that clinical equivalence as demonstrated using the benchmarking strategy

does not explain away other differences between clinical settings and clinical trials. Typically,

naturalistic practice settings and research environments are quite different with regard to client

and therapist factors such as heterogeneity among clients, funding structure, supervision and

training, length of treatment, and clinical caseload (Nathan, Stuart, & Dolan, 2000; Rounsaville,

O’Malley, Foley, & Weissman, 1988; Rupert & Baird, 2004; Seligman, 1995; Wampold, 1997,

2001; Westen & Morrison, 2001; Westen et al., 2004). In fact, it is the incorporation of these

documented differences in interpreting our results that further strengthen our conclusion. For

example, Westen and Morrison reported that approximately 70% of clients are screened out in

clinical trials based on their strict inclusion and exclusion criteria, whereas in the clinical settings,

such criteria are ethically impermissible. In addition, Rupert and Baird have documented the

client load and lack of supervision available in the general clinical settings, which is in stark

contrast to the conditions of therapists participating in rigorously conducted clinical trials who

receive additional training and supervision. Despite these inequalities in conditions, the results

of this study is reassuring in that adult clients receiving psychotherapy in general clinical settings

for depression are most likely receiving quality care.

Finally, the moderator analyses are difficult to interpret because clients were not

randomly assigned to conditions (i.e., were not randomly assigned to group or individual

practices or to medication or no medication conditions). Consequently, the results of the

moderator analyses need to be considered tentative.


The increased demand to demonstrate accountability and effectiveness of clinical practice

has put pressure on clinical settings to assess clinical outcomes. However, there are growing

concerns among therapists regarding this issue. For example, Hahlweg and Klann (1997)

reported that the response rate of therapists who were asked to measure outcomes were low,

possibly due to the anxiety that the results may be used for administrative purposes (e.g.,

promotion and/or retention). In addition, Plante, Andersen, and Boccaccini (1999), in their

survey of Clinical Diplomates of the American Board of Professional Psychology, reported that

many considered routine use of outcome measures as too lengthy and unnecessary. Although

more effectiveness studies are necessary, the above concerns must be addressed adequately so

that therapists would willingly participate in outcome assessments. Without active participation

from therapists, effectiveness cannot be adequately assessed.

Simultaneously, the current trend to implement ESTs and other manualized treatments in

clinical settings is also met with resistance by the therapists, as many perceive such

implementation as a hindrance to their autonomy and creativity (e.g., Plante et al., 1999).

Moreover, it is impossible to determine whether or not ESTs were satisfactorily implemented

without monitoring therapists’ adherence, which, to do so would be highly impractical

considering the cost. For clinicians that are faced with choosing between measuring their

clinical outcomes and abandoning their TAUs in favor of ESTs, measuring outcomes may in fact

appear favorable. Specifically, if providers in clinical settings could demonstrate that they are

attaining outcomes clinically equivalent to efficacy observed in clinical trials, the current

rationale for implementing ESTs, which is to ensure accountability, becomes illogical. After all,

it is the results with clients in the field that is the goal of treatment; if providers can document


that they are achieving desired outcomes, it makes little sense to suggest that they adopt

particular treatments.


References

Addis, M. E. (2002). Methods for disseminating research products and increasing evidence-

based practice: Promises, obstacles, and future directions. Clinical Psychology: Science

and Practice, 9, 367-378.

Addis, M. E., Hatgis, C., Krasnow, A. D., Jacob, K., Bourne, L, & Mansfield, A. (2004).

Effectiveness of cognitive-behavioral treatment for panic disorder versus treatment as

usual in a managed care setting. Journal of Consulting and Clinical Psychology, 72, 625-

635.

American Psychiatric Association (1994). Diagnostic and statistical manual of mental disorders

(4th ed.). Washington, DC: Author.

Barkham, M., Rees, A., Stiles, W. B., Shapiro, D. A., Hardy, G. E., & Reynolds, S. (1996).

Dose-effect relations in time-limited psychotherapy for depression. Journal of

Consulting and Clinical Psychology, 64, 927-935.

Beck, A. T., & Steer, R. A. (1987). Beck Depression Inventory manual. San Antonio, TX:

Harcourt Brace Jovanovich.

Becker, B. J. (1988). Synthesizing standardized mean-change measures. British Journal of

Mathematical and Statistical Psychology, 41, 257-278.

Borkovec, T. D., & Castonguay, L. G. (1998). What is the scientific meaning of empirically

supported therapy? Journal of Consulting and Clinical Psychology, 66, 136-142.

Chorpita, B. F., Yim, L. M., Donkervoet, J. C., Arensdorf, A., Amundsen, M. J., McGee, C.,

Serrano, A., Yates, A., Burns, J. A., & Morelli, P. (2002). Toward large-scale

implementation of empirically supported treatments for children: A review and


observations by the Hawaii Empirical Basis to Services Task Force. Clinical Psychology:

Science and Practice, 9, 165-190.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:

Erlbaum.

Crits-Christoph, P., Barnackie, K., Kurcias, J. S., Beck, A. T., Carroll, K., Perry, K., Luborsky,

L., McLellan, T., Woody, G., Thompson, L., Gallager, D., & Zitrin, C. (1991). Meta-

analysis of therapist effects in psychotherapy outcome studies. Psychotherapy Research,

1, 81-91.

Derogatis, L. R. (1977). The SCL-90 Manual I: Scoring, administration and procedures.

Baltimore, MD: Johns Hopkins University School of Medicine, Clinical Psychometrics

Unit.

Franklin, M. E., Abramowitz, J. S., Kozak, M. J., Levitt, J. T., & Foa, E. B. (2000).

Effectiveness of exposure and ritual prevention for obsessive-compulsive disorder:

Randomized compared with nonrandomized samples. Journal of Consulting and Clinical

Psychology, 68, 594-602.

García-Palacios, A., Botella, C., Robert, C., Baños, R., Perpiña, C., Quero, S., & Ballester, R.

(2002). Clinical utility of cognitive-behavioural treatment for panic disorder. Results

obtained in different settings: A research centre and a public mental health unit. Clinical

Psychology and Psychotherapy, 9, 373-383.

Garfield, S. L. (1986). Research on client variables in psychotherapy. In S. L. Garfield and A. E.

Bergin (Eds.), Handbook of psychotherapy and behavior change (3rd ed., pp. 213-256).

New York: John Wiley & Sons.


Gillespie, K., Duffy, M., Hackmann, A., & Clark, D. M. (2002). Community based cognitive

therapy in the treatment of post-traumatic stress disorder following the Omagh bomb.

Behaviour Research and Therapy, 40, 345-357.

Hahlweg, K., Fiegenbaum, W., Frank, M., Schroeder, B., & von Witzleben, I. (2001). Short- and

long-term effectiveness of an empirically supported treatment for agoraphobia. Journal

of Consulting and Clinical Psychology, 69, 375-382.

Hahlweg, K., & Klann, N. (1997). The effectiveness of marital counseling in Germany: A

contribution to health services research. Journal of Family Psychology, 11, 410-421.

Hamilton, M. A. (1960). A rating scale for depression. Journal of Neurology, Neurosurgery,

and Psychiatry, 23, 56-62.

Hamilton, M. A. (1967). Development of a rating scale for primary depressive illness. British

Journal of Social and Clinical Psychology, 6, 278-296.

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. San Diego, CA:

Academic Press.

Henggeler, S. W., Melton, G. B., Brondino, M. J., Scherer, D. G., & Hanley, J. H. (1997).

Multisystemic therapy with violent and chronic juvenile offenders and their families: The

role of treatment fidelity in successful dissemination. Journal of Consulting and Clinical


Herschell, A. D., McNeil, C. B., & McNeil, D. W. (2004). Clinical child psychology’s progress

in empirically supported treatments. Clinical Psychology: Science and Practice, 11, 267-

288.

Hollon, S. D., Thase, M. E., & Markowitz, J. C. (2002). Treatment and prevention of depression.

Psychological Science in the Public Interest, 3, 39-77.


Horowitz, L. M., Rosenberg, S. E., Baer, B. A., Ureno, G., & Villasenor, V. S. (1988). Inventory

of interpersonal problems: Psychometric properties and clinical applications. Journal of


Howard, K. I., Kopta, S. M., Krause, M. S., & Orlinsky, D. E. (1986). The dose-effect

relationship in psychotherapy. American Psychologist, 41, 159-164.

Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining

meaningful change in psychotherapy research. Journal of Consulting and Clinical


Lambert, M. J. (2001). The status of empirically supported therapies: Comment on Westen and

Morrison’s (2001) multidimensional meta-analysis. Journal of Consulting and Clinical


Lambert, M. J., Hatch, D. R., Kingston, M. D., & Edwards, B. C. (1986). Zung, Beck, and

Hamilton Rating Scales as measures of treatment outcome: A meta-analytic comparison.

Journal of Consulting and Clinical Psychology, 54, 54-59.

Lambert, M. J., Hatfield, D. R., Vermeersch, D. A., Burlingame, G. M., Reisinger, C. W., &

Brown, G. S. (2001). Administration and scoring manual for the LSQ (Life Status

Questionnaire). East Setauket, NY: American Professional Credentialing Services.

Lincoln, T. M., Rief, W., Hahlweg, K., Frank, M., von Witzleben, I., Schroeder, B., &

Fiegenbaum, W. (2003). Effectiveness of an empirically supported treatment for social

phobia in the field. Behaviour Research and Therapy, 41, 1251-1269.

Manderscheid, R. W., & Henderson, M. J. (2004). Mental health, United States, 2002 executive

summary. Administration and Policy in Mental Health, 32, 49-55.


Merrill, K. A., Tolbert, V. E., & Wade, W. A. (2003). Effectiveness of cognitive therapy for

depression in a community mental health center: A benchmarking study. Journal of


Minami, T., Serlin, R. C., Wampold, B. E., & Kircher, J. C. (2005). How to benchmark clinical

settings effect sizes against clinical trials. Manuscript submitted for publication.

Minami, T., Wampold, B. E., Serlin, R. C., Kircher, J. C., & Brown, G. S. (2005). Benchmarks

for the treatment of adult depression: Issues and results. Manuscript submitted for

publication.

Morgenstern, J., Blanchard, K. A., Morgan, T. J., Labouvie, E., & Hayaki, J. (2001). Testing the

effectiveness of cognitive-behavioral treatment for substance abuse in a community

setting: Within treatment and posttreatment findings. Journal of Consulting and Clinical

Psychology, 69, 1007-1017.

Morrison, A. P., Renton, J. C., Williams, S., Knight, D. H., Kreutz, M., Nothard, S., Patel, U., &

Dunn, G. (2004). Delivering cognitive therapy to people with psychosis in a community

mental health setting: An effectiveness study. Acta Psychiatrica Scandinavica, 220, 36-

44.

Nathan, P. E., Stuart, S. P., & Dolan, S. L. (2000). Research on psychotherapy efficacy and

effectiveness: Between Scylla and Charybdis? Psychological Bulletin, 126, 964-981.

Persons, J. B., Bostrom, A., & Bertagnolli, A. (1999). Results of randomized controlled trials of

cognitive therapy for depression generalize to private practice. Cognitive Therapy and

Research, 23, 535-548.


Plante, T. G., Andersen, E. N., & Boccaccini, M. T. (1999). Empirically supported treatments

and related contemporary changes in psychotherapy practice: What do clinical ABPPs

think? Clinical Psychologist, 52, 23-31.

Rounsaville, B. J., O’Malley, S., Foley, S., & Weissman, M. M. (1988). Role of manual-guided

training in the conduct and efficacy of interpersonal psychotherapy for depression.

Journal of Consulting and Clinical Psychology, 56, 681-688.

Rupert, P. A., & Baird, K. A. (2004). Managed care and the independent practice of psychology.

Professional Psychology: Research and Practice, 35, 185-193.

Seligman, M. E. P. (1995). The effectiveness of psychotherapy: The Consumer Reports Study.

American Psychologist, 50, 965-974.

Serlin, R. C., & Lapsley, D. K. (1985). Rationality in psychological research: The good-enough

principle. American Psychologist, 40, 73-83.

Serlin, R. C., & Lapsley, D. K. (1993). Rational appraisal of psychological research and the

good-enough principle. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in

the behavioral sciences: Methodological issues (pp. 199-228). Hillsdale, NJ: Lawrence

Erlbaum Associates.

Shadish, W. R., Matt, G. E., Navarro, A. M., & Phillips, G. (2000). The effects of psychological

therapies under clinically representative conditions: A meta-analysis. Psychological

Bulletin, 126, 512-529.

Shadish, W. R., Matt, G. E., Navarro, A. M., Siegle, G., Crits-Christoph, P., Hazeligg, M. D.,

Jorm, A. F., Lyons, L. C., Nietzel, M. T., Prout, H. T., Robinson, L., Smith, M. L.,

Svartberg, M., & Weiss, B. (1997). Evidence that therapy works in clinically

representative conditions. Journal of Consulting and Clinical Psychology, 65, 355-365.


Stirman, S. W., Crits-Christoph, P., & DeRubeis, R. J. (2004). Achieving successful

dissemination of empirically supported psychotherapies: A synthesis of dissemination

theory. Clinical Psychology: Science and Practice, 11, 343-359.

Thase, M. E., & Jindal, R. D. (2004). Combining psychotherapy and psychopharmacology for

treatment of mental disorders. In M. J. Lambert (Ed.), Handbook of psychotherapy and

behavior change (5th ed.). New York: John Wiley & Sons.

Tuschen-Caffier, B., Pook, M., & Frank, M. (2001). Evaluation of manual-based cognitive-

behavioral therapy for bulimia nervosa in a service setting. Behaviour Research and

Therapy, 39, 299-308.

Vermeersch, D. A., Lambert, M. J., & Burlingame, G. M. (2000). Outcome Questionnaire: Item

sensitivity to change. Journal of Personality Assessment, 74, 242-261.

Wade, W. A., Treat, T. A., & Stuart, G. L. (1998). Transporting an empirically supported

treatment for panic disorder to a service clinic setting: A benchmarking strategy. Journal

of Consulting and Clinical Psychology, 66, 231-239.

Wampold, B. E. (1997). Methodological problems in identifying efficacious psychotherapies.

Psychotherapy Research, 7, 21-43.

Wampold, B. E. (2001). The great psychotherapy debate: Model, methods, and findings.

Mahwah, NJ: Lawrence Erlbaum Associates.

Wampold, B. E. (2005). Do Therapies Designated as ESTs for Specific Disorders Produce

Outcomes Superior to Non-EST Therapies? Not a scintilla of evidence to support ESTs

as more effective than other treatments. In J. C. Norcross, L. E. Beutler & R. F. Levant

(Eds.), Evidence-based practices in mental health: Debate and dialogue on the


fundamental questions (pp. 299-308, 317-319) . Washington, DC: American

Psychological Association.

Wampold, B. E., & Brown, G. (2005). Estimating therapist variability in outcomes attributable

to therapists: A naturalistic study of outcomes in managed care. Journal of Consulting

and Clinical Psychology, 73, 914-923.

Warren, R., & Thomas, J. C. (2001). Cognitive-behavior therapy of obsessive-compulsive

disorder in private practice: An effectiveness study. Journal of Anxiety Disorders, 15,

277-285.

Weersing, V. R., & Weisz, J. R. (2002). Community clinic treatment of depressed youth:

Benchmarking usual care against CBT clinical trials. Journal of Consulting and Clinical


Weiss, B., Catron, T., & Harris, V. (2000). A 2-year follow-up of the effectiveness of traditional

child psychotherapy. Journal of Consulting and Clinical Psychology, 68, 1094-1101.

Weiss, B., Catron, T., Harris, V., & Phung, T. M. (1999). The effectiveness of traditional child

psychotherapy. Journal of Consulting and Clinical Psychology, 67, 82-94.

Weissman, M. M., & Bothwell, S. (1976). Assessment of social adjustment by patient self-

report. Archives of General Psychiatry, 33, 1111-1115.

Weisz, J. R., Donenberg, G. R., Han, S. S., & Weiss, B. (1995). Bridging the gap between

laboratory and clinical in child and adolescent psychotherapy. Journal of Consulting and

Clinical Psychology, 63, 688-701.

Weisz, J. R., & Weiss, B. (1989). Assessing the effects of clinic-based psychotherapy with

children and adolescents. Journal of Consulting and Clinical Psychology, 57, 741-746.


Weisz, J. R., Weiss, B., & Donenberg, G. R. (1992). The lab versus the clinic: Effects of child

and adolescent psychotherapy. American Psychologist, 47, 1578-1585.

Wells, M. G., Burlingame, G. M., Lambert, M. J. & Hoag, M. (1996). Conceptualization and

measurement of patient change during psychotherapy: Development of the Outcome

Questionnaire and Youth Outcome Questionnaire. Psychotherapy, 33, 275-283.

Westen, D. & Morrison, K. (2001). A multidimensional meta-analysis of treatments for

depression, panic, and generalized anxiety disorder: An empirical examination of the

status of empirically supported therapies. Journal of Consulting and Clinical Psychology,

69, 875-899.

Westen, D., Novotny, C. M., & Thompson-Brenner, H. (2004). The empirical status of

empirically supported psychotherapies: Assumptions, findings, and reporting in

controlled clinical trials. Psychological Bulletin, 130, 631-663.


Footnotes

1Treatment providers include providers who practice individually (i.e., individual

providers) and those who are in group practice (group providers). Group providers have an ID

solely for their group, and thus do not have ID numbers for each practicing provider within the

group.

2Due to licensing agreement, the OQ-30 is named the Life Status Questionnaire (LSQ) at

PacifiCare Behavioral Health, Inc.


Table 1

Client Demographic Information of the Base and Subset HMO Data

Base HMO Data Subset HMO Data

Clients N (%) 48,038 (100.00) 6,323 (13.16a)

Female n (%) 32,713 (68.10) 4,514 (71.39)

Age M (SD, Range, Mdn) 39.75 (11.54, 18 - 96, 39) 39.99 (11.19, 18 - 86, 40)

Diagnosis

Depression n (%) 9,024 (18.79) 6,323 (70.07a)

Adjustment n (%) 5,450 (11.35) -

Anxiety n (%) 2,324 (4.84) -

Other n (%) 2,380 (4.95) -

Unknown n (%) 28,860 (60.08) -

Treatment

Sessions M (SD, Range, Mdn) 9.22 (9.10, 1-160, 6) 8.78 (8.61, 1 - 127, 6)

Provider Training Level

Masters n (%) 19,185 (39.94) 1,884 (46.81)

Doctoral n (%) 8,696 (18.10) 975 (24.22)

Medical n (%) 2,695 (5.61) 99 (2.46)

Other/Unknown n (%) 17,462 (36.35) 1,067 (26.51)

Medication

Yes n (%) 8,355 (17.39) 3,645 (57.65)

No n (%) 9,986 (20.79) 2,263 (35.79)

Unknown n (%) 29,697 (61.82) 415 (6.56)

aPercentage of base HMO data.


Table 2

Provider Demographic Information of the Base and Subset HMO Data

Base HMO Data Subset HMO Data

Providers N (%) 6,007 (100.00) 2,001 (33.31a)

Individual Practicea n (%) 5,911 (98.40) 1,920 (95.95)

Female n (%) 2,404 (40.67) 879 (45.78)

Male n (%) 1,304 (22.06) 450 (23.44)

Gender Unknown n (%) 2,203 (37.27) 591 (30.78)

Training Level

Masters n (%) 2,300 (38.91) 814 (42.40)

Doctoral n (%) 1,231 (20.83) 459 (23.91)

Medical n (%) 166 (2.81) 47 (2.45)

Other/Unknown n (%) 2,214 (37.46) 600 (31.25)

Years in Practice M (SD, Range, Mdn) 22.41 (7.89, 4 - 53, 22) 22.05 (7.85, 4 - 50, 22)

aPercentage of total number of providers (i.e., N = 6,007). bGroup practices do not have

individual IDs for their therapists, and thus all following demographics pertain to therapists in

individual practice unless otherwise noted.


Table 3

Subset HMO Data Benchmarking

vs. Treatment Efficacy vs. Natural History

Condition N Intake M (SD) Last M (SD) d ( )TECVd p ( )NHCVd p

Overall 6,323 63.58 (12.75) 54.19 (15.68) 0.7360 0.6538 < .0001 0.3429 < .0001

Practice Individual 4,025 63.43 (12.90) 55.08 (15.36) 0.6482 0.6597 .1617 - -

Group 2,298 63.82 (12.48) 52.65 (16.13) 0.8950 0.6691 < .0001 - -

Medication Concurrent 3,645 65.40 (12.91) 54.98 (16.17) 0.8071 0.6611 < .0001 - -

None 2,263 60.54 (11.80) 52.75 (14.63) 0.6601 0.6693 .1055 - -

Unknown 415 64.08 (13.15) 55.15 (16.40) 0.6778 0.7221 .1757 - -

Note. Client sample size, intake score means (standard deviations), last score means (standard deviations), effect sizes (i.e., d), critical

values (i.e., dCV), and significance level (i.e., p) are for their respective conditions. Hyphen denotes analyses that were not conducted.


Table 4

Subset HMO Data Benchmarking by Provider Practice and Medication

Session

Medication Client N Intake M (SD) Last M (SD) d CVd p

Individual Practice

Concurrent 2,211 65.59 (13.15) 56.30 (15.82) 0.7065 0.6698 .0008

No Medication 1,552 60.18 (11.72) 53.05 (14.24) 0.6082 0.6774 .7964

(Concurrenta) 1,092 65.46 (9.88) 56.49 (14.00) 0.9073 0.6865 < .0001

(Groupb) 1,453 61.32 (11.24) 53.81 (14.06) 0.6677 0.6790 .1038

Unknown 262 64.58 (13.48) 56.78 (16.48) 0.5774 0.7466 .7914

Group Practice

Concurrent 1,434 65.13 (12.54) 52.96 (16.49) 0.9698 0.6793 < .0001

No Medication 711 61.32 (11.95) 52.08 (15.41) 0.7721 0.7001 .0005

Unknown 153 63.22 (12.58) 52.36 (15.93) 0.8593 0.7843 .0008

aInitial OQ-30 severity matched with clients who are concurrently on medication and treated by individual providers. bInitial OQ-30

severity matched with clients who are not on medication and treated by group providers.

pbh benchmarking 20060103c - psychoutcomes.org · pbh does not routinely collect other demographic...

Documents