using testlet response theory to analyze data from a survey of...

17
Research Article Received 1 October 2008, Accepted 4 March 2010 Published online 25 May 2010 in Wiley Interscience (www.interscience.wiley.com) DOI: 10.1002/sim.3945 Using Testlet Response Theory to analyze data from a survey of attitude change among breast cancer survivors Xiaohui Wang, a †‡ Su Baldwin, b§ Howard Wainer, b,c Eric T. Bradlow, c ‡‡ Bryce B. Reeve, d∗∗ Ashley W. Smith, d †† Keith M. Bellizzi eand Kathy B. Baumgartner f §§ In this paper we examine alternative measurement models for fitting data from health surveys. We show why a testlet-based latent trait model that includes covariate information, embedded within a fully Bayesian framework, can allow multiple simultaneous inferences and aid interpretation. We illustrate our approach with a survey of breast cancer survivors that reveals how the attitudes of those patients change after diagnosis toward a focus on appreciating the here-and-now, and away from consideration of longer-term goals. Using the covariate information, we also show the extent to which individual-level variables such as race, age and Tamoxifen treatment are related to a patient’s change in attitude. The major contribution of this research is to demonstrate the use of a hierarchical Bayesian IRT model with covariates in this application area; hence a novel case study, and one that is certainly closely aligned with but distinct from the educational testing applications that have made IRT the dominant test scoring model. Copyright © 2010 John Wiley & Sons, Ltd. Keywords: testlet; Bayesian; MCMC; local independence; breast cancer survey 1. Introduction Surveys are typically constructed with two related but different goals in mind. The first, and the easiest to accomplish, is the retrieval of facts from the surveyed population (e.g. ‘How many cigarettes did you smoke in the last 24 hours?’). The second goal is to retrieve information about a more vague concept, which may be conceived as a ‘higher-order’ construct or factor as in [1], which cannot be directly addressed with a single, factual question. Instead we try to understand its nature with a sequence of questions, each of which, individually, sheds only a little light on the underlying variable of interest, but taken as a whole give us a fairly complete understanding of the construct of interest. For example, if we are interested in studying the level of health in a society we might ask questions that reflect differing amounts of robustness, and/or measure different aspects of health (e.g. physical, mental, short term versus long term, etc.). a Department of Statistics, University of Virginia, Charlottesville, VA, U.S.A. b National Board of Medical Examiners, Philadelphia, PA, U.S.A. c The Wharton School of the University of Pennsylvania, Philadelphia, PA, U.S.A. d National Cancer Institute, Bethesda, MA, U.S.A. e University of Connecticut, Storrs, CT, U.S.A. f University of Louisville, Louisville, KY, U.S.A. Correspondence to: Xiaohui Wang, Department of Statistics, University of Virginia, Charlottesville, VA, U.S.A. E-mail: [email protected] Assistant Professor. § Measurement Scientist. Distinguished Research Scientist. Professor. ∗∗ Psychometrician and Program Director. †† Behavioral Scientist. ‡‡ Co-Director. §§ Associate Professor. Contract/grant sponsor: National Science Foundation; contract/grant number: DMS0631639 Contract/grant sponsor: National Security Agent; contract/grant number: GG11184 2028 Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

Upload: others

Post on 12-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

Research Article

Received 1 October 2008, Accepted 4 March 2010 Published online 25 May 2010 in Wiley Interscience

(www.interscience.wiley.com) DOI: 10.1002/sim.3945

Using Testlet Response Theory to analyze datafrom a survey of attitude change among breastcancer survivorsXiaohui Wang,a∗†‡ Su Baldwin,b§ Howard Wainer,b,c¶ ‖ Eric T. Bradlow,c‖‡‡Bryce B. Reeve,d∗∗ Ashley W. Smith,d†† Keith M. Bellizzie‡ andKathy B. Baumgartnerf§§

In this paper we examine alternative measurement models for fitting data from health surveys. We show why a testlet-based latenttrait model that includes covariate information, embedded within a fully Bayesian framework, can allow multiple simultaneousinferences and aid interpretation. We illustrate our approach with a survey of breast cancer survivors that reveals how theattitudes of those patients change after diagnosis toward a focus on appreciating the here-and-now, and away from considerationof longer-term goals. Using the covariate information, we also show the extent to which individual-level variables such as race,age and Tamoxifen treatment are related to a patient’s change in attitude.

The major contribution of this research is to demonstrate the use of a hierarchical Bayesian IRT model with covariates inthis application area; hence a novel case study, and one that is certainly closely aligned with but distinct from the educationaltesting applications that have made IRT the dominant test scoring model. Copyright © 2010 John Wiley & Sons, Ltd.

Keywords: testlet; Bayesian; MCMC; local independence; breast cancer survey

1. Introduction

Surveys are typically constructed with two related but different goals in mind. The first, and the easiest to accomplish, isthe retrieval of facts from the surveyed population (e.g. ‘How many cigarettes did you smoke in the last 24 hours?’). Thesecond goal is to retrieve information about a more vague concept, which may be conceived as a ‘higher-order’ constructor factor as in [1], which cannot be directly addressed with a single, factual question. Instead we try to understand itsnature with a sequence of questions, each of which, individually, sheds only a little light on the underlying variable ofinterest, but taken as a whole give us a fairly complete understanding of the construct of interest. For example, if we areinterested in studying the level of health in a society we might ask questions that reflect differing amounts of robustness,and/or measure different aspects of health (e.g. physical, mental, short term versus long term, etc.).

aDepartment of Statistics, University of Virginia, Charlottesville, VA, U.S.A.bNational Board of Medical Examiners, Philadelphia, PA, U.S.A.cThe Wharton School of the University of Pennsylvania, Philadelphia, PA, U.S.A.dNational Cancer Institute, Bethesda, MA, U.S.A.eUniversity of Connecticut, Storrs, CT, U.S.A.fUniversity of Louisville, Louisville, KY, U.S.A.∗Correspondence to: Xiaohui Wang, Department of Statistics, University of Virginia, Charlottesville, VA, U.S.A.†E-mail: [email protected]‡Assistant Professor.§Measurement Scientist.¶Distinguished Research Scientist.‖Professor.∗∗Psychometrician and Program Director.††Behavioral Scientist.‡‡Co-Director.§§Associate Professor.

Contract/grant sponsor: National Science Foundation; contract/grant number: DMS0631639Contract/grant sponsor: National Security Agent; contract/grant number: GG11184

2028

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

Page 2: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

In addition to questions about the issues of interest, surveys also usually try to characterize the surveyed populationaccording to various demographic/background questions. Such information is included to help us understand anysystematic variations in the responses to the questions asked as a function of those background variables (e.g. ‘DoHispanics have more robust health than Blacks?’). These results can then be used for issues such as targeting interventionsin underserved populations, or merely for descriptive purposes and reporting.

The last, and most technical issue that one may face (in surveys and otherwise) is the inevitable relationship betweenhow the data are collected and how they are analyzed: the impact of data collection on the measurement model.For example, survey questions are often administered as clumps of items with related content, that are termed ‘testlets’in the educational testing literature by [2] and ‘blocks’ in the survey literature. These are groups of questions, each ofwhich focuses on a single topic (e.g. pulmonary health, cardiac health, renal health). Instruments composed of testletslimit the choice of appropriate measurement models to those that do not depend on the assumption of conditionallocal independence among all of the items within the survey. Traditional models that comprise item response theory(IRT)¶¶ rely on this assumption and if violated, tend to yield overly optimistic estimates of precision as indicated in [6].Instead, models that take the nested item structure into account are likely to provide more realistic estimates of the trueinformation that is obtained from the respondents’ answers. It is this aspect of survey design that we focus on here,and most importantly its application to a survey of significant individual-level and societal value, that of breast cancersurvivors.

We highlight the importance and practicality of our approach via a data set consisting of survey responses from womenenrolled in the Health, Eating, Activity, and Lifestyle (HEAL) Study, a population-based, multi-center, multi-ethnic,prospective study of women newly diagnosed with in situ or Stages I to IIIA breast cancer, see [7, 8]. In 2005, 858women completed a health-related quality of life questionnaire approximately 39 months post cancer diagnosis [9].We excluded 53 women diagnosed with recurrent breast cancer or a second primary breast cancer by the date of theseanalyses to focus on a population of individuals facing for the first time a potential life-threatening event. This defined acohort of 805 women. Subjects were also screened to ensure that complete data were available, as dealing with missingdata was not our primary goal. This led to the exclusion of an additional 87 participants, which yielded a final samplesize of 718 women for the analysis.‖‖

This study will examine the women’s responses to the Post traumatic Growth Inventory (PTGI) developed by Tedeschiand Calhoun [10] that assesses the changes or transformation that people report experiencing following a traumatic event(i.e. breast cancer for this sample). The scale consists of 21 items with responses on a 0–5 point ordinal Likert scale,ranging from ‘no change’ to ‘very great change’.

In addition to the 21 PTGI survey items, there were also background questions asked about race, ethnicity, age,income, employment and marital status, whether they were taking the drug Tamoxifen (a drug commonly used to assistin the prevention and recurrence of breast cancer in women near or beyond menopause∗∗∗), and how long it had beensince they were diagnosed. These questions were anticipated to be useful in assessing the differing model-based itemcharacteristics across varying backgrounds.

The survey was designed so that the 21 items from which it was composed fell into five broad categories (testlets):

I. Relating to othersII. New possibilities

III. Personal strengthIV. Spiritual changeV. Appreciation of life

The exact survey questions and the testlets they fell into are shown in Table I below.Because each of the 21 items falls into one of these five categories of change, each item is a manifest indicator of a

latent factor of interest. The testlet items in this survey were not laid out contiguously (e.g. all items in testlet I are notadministered as the first 7 items). Prior research has suggested that this may reduce the testlet dependence structure asin [11] compared to one that was so ordered. While on the one hand, this may decrease the need for a testlet-IRT model,as described here, it also makes the results we find conservative because most surveys lay out items that are measuringthe same construct contiguously. Furthermore, it raises the question (and tradeoff) between a contiguous design wherethe factor structure is maintained with a likely increase in testlet variance, and a non-contiguous design are likely.

¶¶ IRT is the dominant measurement paradigm in contemporary educational testing [3--5].‖‖As the data were collected through three different clinic centers, it is important to account for any differences among them. We include

dummy variables in our analyses for this purpose.∗∗∗A detailed description can be found at http://www.breastcancer.org/tre_sys_tamox_idx.html.

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

2029

Page 3: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

Table I. The survey items shown within their testlet structure.

Testlet number Item number Text of item

I 6 Knowing that I can count on people in times of trouble7 A sense of closeness with others9 A willingness to express my emotion

12 Having compassion for others15 Putting effort into my relationships18 I learned a great deal about how wonderful people are20 I accept needing others

II 19 I developed new interests21 I established a new path for my life13 I am able to do better things with my life14 New opportunities are available which would not have been otherwise

2 I am more likely to try to change things which need changing

III 4 A feeling of self-reliance8 Knowing I can handle difficulties

10 Being able to accept the way things work out17 I discovered that I am stronger than I thought I was

IV 5 A better understanding of spiritual matters16 I have a stronger religious faith

V 1 My priorities about what is important in life3 An appreciation for the value of my own life

11 Appreciating each day

1.1. Hypotheses

Despite the recent reports of reductions in breast cancer deaths due to better screening and treatments, breast cancerremains the most frequent cancer diagnosed in females and the second leading cause of deaths from cancer after lungcancer, see [12]. Diagnosis and treatment of breast cancer has been shown to have a great impact on a woman’spsychological, sexual, and physical functioning by [13, 14]. Despite the negative impact, studies have also shown inlong-term, survivors of breast cancer that there have been some improvements in the quality of life in [15]. The PTGIstudied here, was intended to be used to examine the changes in psycho-social functioning among the breast cancersurvivors who were approximately three years post diagnosis.

The 21-item PTGI scale employed in this research, and its relationship with socio-economic and demographic factorshas been of considerable recent interest. These studies, as described below, used a combination of factor analytic methodsto assess the five subscale dimensions, and regression methods to assess covariate differences on posttraumatic change.However, none of the approaches used in these studies allows for a coherent and simultaneous set of inferences whileregarding the scale items, their inherent groupings, and the people it differentially impacts. The hierarchical Bayesianapproach employed here addresses that concern.

For instance, in a post hoc evaluation of the psychometric properties of the PTGI using a sample of undergraduatestudents, it was identified as having five factors [10]: relationships with others, new possibilities, appreciation of life,spirituality, and personal strength (as given in Table I). The five-factor solution, however, did not hold up in validationstudies by [16, 17]. In [17], whose data consists of middle- to old-aged cardiovascular disease patients, it was suggestedthat a single summary score of the PTGI’s 21 items was to be preferred for its parsimony because it did not seem tolose information.

Using our data, an exploratory factor analysis suggested a four factor solution when using the criterion of accountingfor 90 per cent of total variation, and the maximum likelihood method showed that six factors were significant at p= .05.This result of between four and six factors is consistent with the aforementioned extant research. The Bayesian testletresponse model utilized here, we believe, can also shed further light on the issue of dimensionality as the testlet-specificvariance components will indicate the extent of excess local-scale dependence above and beyond treating the scale asa one-factor 21-item battery. The testlet model we present can be re-parameterized as a bi-factor model by putting theloadings on each dimension of the bi-factor model proportional to the loadings on the general dimension, see [18].In this manner, the testlet model is a special case of a general multidimensional IRT model in [19] under the conditionof independence of all dimensions, which was supported by our exploratory factor analysis.

We anticipated, a priori, that an examination of the factor structure of the PTGI scale in this study will yield a similarresult; that a single growth factor would dominate given the similarity in the characteristics of the traumatic event to

2030

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

Page 4: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

those patients in [16, 17]. However, we were concerned that there would be local dependency among items that clusteredin the original factors of [10]; especially the spirituality factor that has shown to have the greatest impact followingbreast cancer in [15]. Therefore, we postulate that:

H1 The testlet variances for the 5 underlying subscales will be substantial, yet most significant for the spiritualityfactor.Previous studies have also found that younger women report more posttraumatic change following breast cancerthan older women, see [20--22]. Thus, we hypothesize that:

H2 Younger women will report more posttraumatic change in this study.Earlier research has also found a relationship between ethnicity and posttraumatic change. It was shown in[23, 24] an independent effect of ethnicity on reports of change in survivors of sexual assault and breast cancer,respectively. Specifically, African-American and Hispanic women reported higher levels of change than Non-Hispanic White women. Therefore, we expect that:

H3 A significant relationship between ethnicity and posttraumatic change exists; in particular higher levels forAfrican American and Hispanics.Previous studies have also examined the association between marital status and posttraumatic change, see [25, 26].Breast cancer survivors who were married or in a committed relationship reported more change than womenwho were single or widowed, see [20]. Being married has also been found to be associated with more changeamong bereaved parents due to the loss of a child as indicated in [26]. In [20], it is suggested that women’spartners can offer a beneficial social support system to help cope with their disease. In the current analysis, wetherefore hypothesized that:

H4 Having a long-term partner would be associated with more posttraumatic change.There is no literature, however, that has examined the relationship between taking Tamoxifen (one of ourmeasured variables as described previously) and posttraumatic change. Clinical variables were incorporated hereas covariates to account for potential confounding due to cancer status and treatment effects. One could argue thatthose women who are on adjuvant††† therapy (e.g. Tamoxifen) are more actively involved in the managementof their disease than those who are not on Tamoxifen. This might mean that the process of coping with breastcancer is different for women who are on Tamoxifen versus women who are not because they are reminded oftheir disease every day due to their use of adjuvant therapy. These women may continue to experience more ofan impact (both positive and negative) from their disease as a result. Thus whereas there was no strong researchbased hypotheses on the effects of Tamoxifen on posttraumatic change, we were interested in accounting forsuch clinical variables as adjuvant treatment. Hence, we posit that:

H5 Women taking Tamoxifen will report a larger posttraumatic change in their attitudes.

Within the IRT framework, each survey respondent’s position on a latent trait (i.e. change in attitudes following breastcancer) can be modeled without difficulty while dealing with the local dependence engendered by the survey construction.The contribution of this research is the application of incorporating the background information as covariates into themeasurement model to help explain the results and test our hypotheses without relying on the assumption of localindependence. Research incorporating covariates into the measurement model to explain the differences in ability isabundant. For example, in [27], is proposed a linear two-level mixed effects model that included covariates of ability.Owing to its linear nature, however, this model cannot be directly applied to the IRT models. In [28], a more generalmulti-level IRT model that is estimated through a Bayesian procedure is proposed. However, their model was based onthe assumption of local independence. The measurement model used in this paper is a natural extension of the originalTRT model as in [29], where covariates are also incorporated.

In the remainder of this paper we will describe both the method of analysis that we used and the results obtained.In addition, at various points in the discussion, we will point out the extent to which methods that fail to account forthe structure of the data yield different results, and the substantive benefits of using the background information to helpunderstand the effects of demographic and other factors on the PTGI results.

2. The measurement model

The PTGI was designed to assess perceived changes in one’s life following a traumatic event. In our mathematicalframework, this can be operationalized by estimating the position of those surveyed on an underlying latent dimension

†††Adjuvant: A substance, that when added to a medicine, speeds or improves its action, which aids another, such as an auxiliary remedy.

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

2031

Page 5: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

(let us call this ‘changeability’‡‡‡) where those with a higher value on the latent dimension (reflecting more change) aremore likely to give higher ordinal response scores as in [30]. Thus it is sensible to analyze these data with a measurementmodel that assumes the existence of such a dimension, recognizes the ordinal nature of the measured responses and theitem testlet structure as specified earlier, and then subsequently examine the extent to which this model yields a good fitto the data. The model we used in the analysis of the data from this survey instrument accomplishes this and is derivedfrom a more general set of models called testlet response theory (TRT) originally introduced in [29].

TRT is a family of models akin to parallel models in IRT, but for which the fundamental unit of analysis is the itemnested within a testlet. It assumes conditional local independence, conditional on the unidimensional latent trait, theitem testlet structure, and a set of parameters that govern the potential extra dependence of items nested within the sametestlet. For items that are in different item categories (e.g. I= relating to others and II=new possibilities), TRT behavesas IRT; however, for items within the same testlet, the model accounts for a likely stronger association. These modelshave been extended in many ways since they were first introduced and are described in detail in [31].

The specific TRT model that we use differs from standard IRT not just in the relaxation of the assumption of localindependence but also in the incorporation of covariates directly into the analysis, as in Chapter 11 of [31]. Includingcovariates as an integral part of the analysis has two principal advantages: one is science-based and the other relates toestimation. First, this directly allows us to go beyond the standard IRT result of ‘how strongly was this question endorsed’or ‘to what extent does this item reflect changeability’ to ‘why’. For the goals of science this is of obvious importance.Second, by incorporating the covariates directly into the model, one does not have to do a potentially inappropriatepost-analysis regression using point estimates of respondent latent locations as a dependent variable which ignores thatthey are estimates.

The estimation of the model is embedded within a fully Bayesian framework, where each parameter is associatedwith a prior and a hyperprior distribution. An important advantage of this framework is that it allows for sharing ofinformation across items and people in a way that improves the precision. We next describe the details of our Bayesianmodel specification.

2.1. The model specification

The important feature of the model that we utilize is that it accounts for the ordinal nature of the data, by utilizing apolytomous probit response model of [30, 32] designed for TRT in [33]. In particular, we model the probability thatrespondent i (i =1, . . . ,718 here) gives response score r (r =0, . . . ,5) to survey item j ( j =1, . . . ,21 here), Yij, as

P(Yij =r )=�(gr − tij)−�(gr−1 − tij), (1)

where � denotes the cumulative standard normal distribution function, gr denotes a latent cutoff for score r such thatYij =r if gr−1<tij<gr , and tij is as given next and represents the TRT aspect of the work.

The model replaces the standard IRT linear predictor of changeability score, tij =a j (�i −b j ) with its correspondingTRT version tij =a j (�i −b j −�id( j)), where �i is the i th person’s location on the latent dimension and the a j ’s and b j ’sare analogous to the usual discrimination and difficulty parameters for item j in educational testing. In particular, b jrepresents the general propensity for a given item (out of the 21 survey items) to receive higher ordinal scores thanothers, whereas a j is akin to a factor loading describing the degree to which item j is correlated with the underlyingchangeability score. The additional term in the model, �id( j), which incorporates the extra dependence when two itemsj and j’ are in the same testlet, i.e. d( j)=d( j ′), is indeed a person-by-item effect that is the same for a given person iand all the items within one testlet. By utilizing this structure, two item responses for person i , j and j ′ ( j �= j ′), whichare both in testlet d( j)=d( j ′), share term �id( j) in their latent linear predictor tij and hence are more highly correlatedunder the model than items j and j ′ for which d( j) �=d( j ′) (as �id( j) is assumed to be independent of �id( j ′) for differenttestlets). For instance, in this study d(1)=d(3), as both lie in category V , whereas d(11) �=d(19) as survey item 11 isin category V while item 19 is in category II.

The likelihood of the complete data, Y , is then derived as

P(Y |�)=�i� j�r P(Yij =r )I (Yij=r ) (2)

the product over all persons and items, where I (Yij =r )=1 if person i gave Likert scale score r to ‘change surveyitem’ (item j) and 0 otherwise, and � indicates the full parameter space. We note that (2) above is a model-basedapproach to local dependence identification which is related to the work of [34, 35], which develop methods to detectlocal independence.

‡‡‡This play on words, ‘changeability’ equal to the juxtaposition of ‘change’ as asked by the 0–5 Likert items in this survey, and ‘ability’ isintentional to highlight its similarity to the ability dimension commonly used in educational testing.

2032

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

Page 6: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

To fully specify a Bayesian model, prior distributions need to be specified for the parameters that govern the likelihoodgiven in (2). The Bayesian hierarchical structure that was employed here (and implemented in the computer programSCORIGHT§§§—details are in [31]) was

i. �i ∼ N (x���,1),ii. [log(a j ),b j ]∼ N2((�a,�b),�), (3)

iii. �id( j) ∼ N (0,�2d( j))

where N2(x, y) denotes a bivariate normal distribution with mean x and covariance matrix y. (3.ii) is noteworthybecause it explicitly allows a j and b j to be correlated, reflecting the common empirical pattern of items that are morediscriminating (higher a j ) tending to be endorsed less (have a lower b j ).

The Bayesian TRT model specification was completed by using a conjugate prior distribution for the covariate slopes,a non-informative prior on the vector of means, an inverse-gamma prior on the testlet variances, and an inverse-Wisharthyperprior on �, respectively, to ensure proper posteriors. Extensive testing suggests relatively little sensitivity to theexact hyperprior information for this project; yet, it is something (especially with sparse data for some testlets) that allresearchers should work at with caution. Further details are available upon request.

Obviously, the term �id( j) is what yields the TRT model and thus when �id( j) =0, we have a modified versionof Samejima’s model in [36]; albeit a Bayesian one. Furthermore, if we allow �2

a →∞, �2b →∞, and �2

d( j) →∞we would obtain the standard two parameter probit model (without Bayesian shrinkage) and hence the standard (frequen-tist) IRT models are nested within this model.

2.2. Model identification

Without constraints, the model is in general non-identifiable. There are three sources of identification problems:

Additive aliasing I: a j (�i −b j −�id( j))=a j ((�i −�id( j) −d)−(b j −d))

Multiplicative aliasing II: a j (�i −b j −�id( j))= (a js )[s(�i −b j −�id( j))]

Additive aliasing III: gr − tij = (gr +d)−(tij+d)

To solve the additive aliasing problem I, we constrain � to have a mean of 0. Therefore, we have to constrain x�to be a set of mean-centered covariates with corresponding slopes �� and the variance of � is set to 1. To solve themultiplicative aliasing problem, constraints also need to be imposed on the variance of either �i or a j . We handle thisproblem by applying a prior distribution to �i with variance equal to 1 so that the posterior distribution is not subjectto multiplicative aliasing.

To solve the additive aliasing problem II, for each item j , we fix g0 =0, i.e. g−1 =−∞, gr =+∞. Therefore, weonly need to estimate g1, . . . ,gr−1, which need to be estimated for every item. By applying the non-informative prior ongk , k =1, . . . ,r −1, the fully conditional density of gk given Tij, yij,�i ,a j ,b j ,�id( j) and {gm,m �=k} can be seen (up toa proportional constant) as a uniform distribution.

2.3. Bayesian computation

As the computational aspects of Bayesian IRT and TRT models are well-documented elsewhere (e.g. [31, 32, 37]), andare not the main focus of this research, we provide only a brief description of our computational approach. Inferencesfor the unknown model parameters given in (1)–(3) are derived from posterior samples obtained using Markov chainMonte Carlo, MCMC, procedures as in [38, 39].

The model was fit to the responses to data from the 21-item breast cancer survivor survey using MCMC methods;two chains having been run from overdispersed starting values. For each chain, after a burn-in period of M =10000iterations which is consistent with other research that has demonstrated this to be sufficient (e.g. [40]). The next 20 000iterations were used for inferences where we retained every twentieth draw (thinning) to reduce the high autocorrelation.The convergence of our MCMC sampler was assessed by noting that the potential scale reduction factors (see [41, 42])for the prior and hyperprior parameters are all very close to 1.0. All the reported inferences here are thus based on the2000 combined MCMC draws after convergence.

§§§SCORIGHT is a general purpose computer program for doing Bayesian computation for item response data that can be dichotomous,polytomous or a mixed version of the two, where items can be nested within testlets. SCORIGHT is available at no charge from the authorsupon request.

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

2033

Page 7: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

Figure 1. Continuous residual εij versus linear predictor tij: the dotted lines are 99 per cent predictive boundsunder the TRT model, i.e. εij should have IID normal distribution.

2.4. Posterior predictive checks

We ran a series of posterior predictive checks as demonstrated in [43] to detect any systematic differences betweenthe model and the observed data and to verify that inferences from our Bayesian TRT model would be appli-cable to the examination of posttraumatic change among breast cancer survivors. The general idea is to check themodel based on some discrepancy variables T (y,�) by comparing the observed values versus the replicated valueswith respect to the posterior distribution of �. In most cases, the posterior predictive p-value can be calculated asfollows:

P{T (yrep,�)>T (y,�)}=∫

P{T (yrep,�)>T (y,�)} f (�|y)d�.

In our framework, it can be estimated by∑2000

l=1 I (T (yrep,�l )>T (y,�l ))/2000. We carried out posterior predictivechecks using the following model-fit diagnostics.¶¶¶

As Yij (the PTGI ordinal response score) is discrete, we first utilize the vector of latent continuous model residuals asin [44]. The latent continuous residuals εl

ij are drawn conditionally on simulated values of � and the replicated data yrep.

If the model fits the data adequately, then the residual plot of εlij versus E(t l

ij) should show no pattern and the qq-normal

plot of εlij should fall on a straight line.

Figures 1 and 2 demonstrate for one simulated case that the residuals do not show any pattern with respect to the linearpredictors and the qq-plot versus the normal distribution falls very close to a straight line, and well within confidencebands. By examining all the simulated cases, they follow the same pattern as the one shown in Figures 1 and 2.

Instead of looking at the single simulated case, Figure 3 shows 20 random draws from the 2000 examinee simulatedcase. The left panel of Figure 3 displays a qq-plot of 20 random draws of ε based on the original data; and the rightpanel displays a similar plot of the 20 random draws of residual ε based on the replicated data. The two plots areindistinguishable, indicating that the distribution of the realized residuals fit the assumptions of the model well.

Figure 4 shows the averaged continuous residuals over all patients for each question j to assess model fit not inaggregate, but rather at the individual-item level. The upper panel of Figure 4 shows such averaged continuous residualversus question number for 20 random draws of the observed data. The lower panel shows the similar plot for thereplicated data. Again, the two are similar, indicating a reasonable fit.

¶¶¶A much larger set of diagnostics were run and were available upon request; the findings indicate excellent fit to the same degree aspresented here.

2034

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

Page 8: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

Figure 2. QQ-plot of εij in Figure 1. The envelope shows confidence bands.

Figure 3. The left panel shows a qq-plot of 20 random draws of ε for the observed data; and the right panel shows 20 randomdraws for the replicated data. The similarity supports the fit of our model.

Finally, Figure 5 gives the scatter plot of the deviance for the observed data and for the replicated data. The Bayesianp-value is 22 per cent under 2000 simulations, suggesting no lack of fit of our model. Clearly, there are many otheraspects of model fit that could be checked, but all diagnostics we ran suggest no lack of fit, which might be due to thesmall sample size of data.

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

2035

Page 9: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

Figure 4. The upper panel contains averaged values of εij over all patients for each question j ; 20 plots correspond to 20 randomdraws of parameters from the posterior distribution; the dashed lines are 95 per cent predictive bounds under the model.

The lower panel is for draws from the predictive distribution.

Figure 5. Scatter plot of predictive versus realized deviance. The p-value is estimated about 22 per cent.

2.5. Model selection

As mentioned, due to the nested structure of the data, we utilized TRT models that capture excess local dependenceinstead of standard IRT models. Whether this more general mode is needed is an empirical question. To assess this, weresort to two standard criteria.

The first is based on the mean absolute prediction error (MAPE), i.e. |y− yrep,l | over all posterior draws, l =1, . . . ,2000.We find that the posterior MAPE for the TRT model is 0.98 and for the IRT model it is 1.11 (Bayesian p-value <0.01).Therefore, on this criterion, the TRT model is preferred by 13 per cent, a considerable amount given the associatedstandard error of 1.35 per cent.

2036

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

Page 10: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

Table II. Deviance results for the cancer survey data.

Model D̄ D(�̄) pD DIC

IRT 36 944 36 141 803 37 748TRT 34 225 32 061 2 164 36 390

Figure 6. Posterior mean tracelines for the six response categories obtained for Item 1.

The second criterion is based on a Bayesian model selection rule: the Deviance Information Criterion (DIC) asin [45]. DIC combines the goal of improved model fitting with penalizing unnecessary model complexity. It isgiven by

DIC= D̄+ pD,

where

D̄ = E�|Y [−2log P(Y |�)],

pD = E�|Y [−2log P(Y |�)]+2log P(Y |�̄)

with D̄ serving as a Bayesian measure of model adequacy and pD serving as the penalty term that measures thecomplexity of the model, which is the difference between the posterior mean deviance and the deviance of the posteriormean. Obviously, the better model would have the smaller value of DIC.

Table II shows the deviance for the breast cancer survivor data. A comparison of DIC for IRT and TRT modelssuggests that the TRT model is better in terms of the overall fit. If we consider an IRT model as a fixed effects model,we would expect that pD should be approximately equal to the true number of independent parameters. Combining the�i s of all patients, the item parameters, and the cutoffs for each item, the total number of parameters is equal to 823,which is close to the pD (803) of the IRT model.

3. Model results

One standard inference that is obtainable from an IRT (or TRT) model for polytomous data are the probabilitycurves (often called tracelines) describing how, for each item, the probability of giving response category r varieswith latent trait �. For example the probability of each possible response, r =0,1, . . . ,5, to item 1 can be repre-sented graphically by the six posterior (pointwise) mean tracelines shown in Figure 6. From this, we can see that

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

2037

Page 11: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

Figure 7. The expected score curve for item 1 allows us to summarize the analog to difficulty for a polytomous item with thepoint on the x-axis at which the expected score reaches its 50 per cent point.

the responses to categories 0, 1, and 2 are all close together suggesting that these categories, as a function of latentchangeability, do not really differentiate much among responders, whereas response categories 3, 4, and 5 are betterdifferentiated.

We derived similar figures for all 21 items. The study of these figures yields insights into the functioning of thesurvey instrument (items that have good ‘separation’ in the observed range of changeability scores � are ‘good’ itemsin some sense). They also provide information regarding the impact of that item on the respondents. For binary itemsthis is easily characterized by the item discrimination parameter a j , but for polytomous items the relationship is morecomplex.

Item tracelines also provide the raw material for an often useful summary. We can take the various component tracelinesand combine them to yield an expected score curve for item j , E(Y. j )=�r r∗ P(Y. j =r )=0∗ P(0)+1∗ P(1)+ ·· ·+5∗ P(5).The expected score curve for Item 1 is shown in Figure 7. We have drawn on Figure 7 a horizontal line at the 50 per centpoint of expected score (an expected score of 2.5) and indicated the value of the changeability score, �=−0.09 to whichit corresponds. We see that a respondent whose changeability score is approximately 0 will have an expected responsescore to item j of r =2.5. Items that are less frequently cited as characterizing change would be offset to the right (e.g.a person would need to be more changeable to endorse it) and those items that are more frequently cited would be offsetto the left. We will summarize all items by this parameter (value of � which yields an expected score of 2.5) to easecomparisons among them (Table III).

Similar analyses were run for all of the items and the 50 per cent-points thus obtained are shown in Table III alongwith an abbreviated version of the text of the items. Table III is ordered by the 50 per cent points from the most highlyendorsed to the least, and the items are spaced by leaving horizontal gaps in the table, based upon the apparent gaps inthe item statistics, see [46]. We see that the items where respondents indicated the most change since being diagnosed(1, 3, and 11) are all from the ‘appreciation of life’ category. The items that were next in terms of change had to dowith an increased self-awareness (2 and 17) and a greater appreciation for other people (6, 7, 12, 18). At the otherend, we see that these breast cancer survivors acknowledge much less change in items that reflect moving on to ‘newactivities’ (14, 19, 21), and only a little more change was acknowledged in topics of religion (16) and acceptance ofone’s fate (10).

Showing these results as an annotated stem and leaf diagram (in Table IV) makes clearer both the separation amongsurvey items and the differences in the location of the testlets, one of our primary goals.

With these insights in hand, and the methods used to obtain them, we now move on to the second major goal of thisresearch, describing the ‘whys’.

2038

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

Page 12: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

Table III. The survey items and their 50 per cent points on the changeability scale.

Item number Item 50 per cent Point

3 An appreciation for the value of my own life −0.5811 Appreciating each day −0.311 My priorities about what is important in life −0.29

6 Knowing that I can count on people in times of trouble −0.0918 I learned a great deal about how wonderful people are −0.0512 Having compassion for others −0.0117 I discovered that I am stronger than I thought I was 0.005 A better understanding of spiritual matters 0.018 Knowing I can handle difficulties 0.027 A sense of closeness with others 0.062 I am more likely to try to change things that need changing 0.08

4 A feeling of self-reliance 0.1810 Being able to accept the way things work out 0.2416 I have a stronger religious faith 0.2913 I am able to do better things with my life 0.309 A willingness to express my emotion 0.3615 Putting effort into my relationships 0.39

20 I accept needing others 0.5621 I established a new path for my life 0.6019 I developed new interests 0.7514 New opportunities are available which would not have been otherwise 1.03

Table IV. The 50 per cent points on the changeability scale for each item within different testlets.

Testlet Item Two testletnumber 50 per cent point number names

V −0.6 3 Appreciation−0.5 of−0.4 life

V, V −0.3 11, 1−0.2

I, I −0.1 6, 18I, III, IV, III 0 12, 17, 5, 8I, II 0.1 7, 2III, III 0.2 4, 10IV, II 0.3 16, 13I, I, 0.4 9, 15

0.5I, II 0.6 20, 21II 0.7 19 New

0.8 possibilities0.9

II 1 14

4. Using covariates to understand why

We can use the covariates associated with each person as one source of explanatory information to help us under-stand what underlying factors may account for why some breast cancer survivors responded the way they did.Specifically, we might suspect the age (Hypothesis 2) and the type of medical treatment (Hypothesis 5)‖‖‖ may berelated to changeability. Our Bayesian analysis of �i ∼ N (x���,1) as given in (3) allows us to provide these resultsdirectly.

‖‖‖This study was not designed to allow us to discover the direction of the causal arrow; for example we do not know whether patientschanged more because they took Tamoxifen or whether changeable patients were more likely to take Tamoxifen.

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

2039

Page 13: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

Figure 8. The posterior density of the regression weight on the covariate ‘Tamoxifen’. Virtually all of the mass is above zero.

Table V. The posterior means of coefficients for the covariates.

S.E. of PosteriorVariable Coefficient coefficient prob

Clinic center 1 −0.02 0.24 0.47Clinic center 2 0.00 0.11 0.50

Age −0.03 0.01 0.00Tamoxifen 0.31 0.08 0.00Months since diagnosis −0.03 0.02 0.04White −0.21 0.22 0.16

Working 0.16 0.10 0.04Hispanic 0.19 0.13 0.06Income 0.05 0.03 0.10Married −0.07 0.10 0.23

4.1. Using the posterior distributions of the covariate coefficients

The posterior distribution of a parameter can be used to construct a histogram-based estimate for the unknown parameter.In Figure 8, we show the posterior distribution of the coefficient �� associated with the covariate ‘Took Tamoxifen’.The regression-based analysis shown in Table V strongly suggests that this is a significant predictor of changeability.Looking at the posterior distribution confirms this conclusion by showing us that virtually all the mass of the posteriorlies above the value of zero. That is, the p-value computed from the posterior distribution of ‘Took Tamoxifen’ is greaterthan 0 and confirms Hypothesis 5.

Examining the posterior distributions of the coefficients of three other covariates that were shown to be significant inexploratory regressions (age, months since diagnosis, white) shows a similar result.

Next, we look at the posterior distributions for two covariates (Hispanic and Married) that were not significant in anexploratory regression. Figure 9 contains the posterior for the coefficient associated with the covariate ‘Married’. We seeimmediately that the value of zero lies near the middle of the distribution and hence we can conclude that marital statushas little to do with a woman’s ordinal Likert responses to the PTGI survey. Again the Bayesian analysis supports theinferences we made from the exploratory posthoc regression and disconfirms Hypothesis 4.

Finally, in Figure 10, we examine the posterior distribution for the covariate ‘Hispanic’ and we see that 91 per centof the distribution lies to the right of zero. We would be remiss if we concluded that the event of being Hispanic wasunrelated to survey responses. Here the conclusions from a Bayesian analysis are at odds with those drawn from simpletraditional analysis.

The posterior distributions for ‘Working’ (not shown) closely resembles that of ‘Hispanic’ and suggests including itin our inferences, whereas the posterior for ‘Income’ resembles that of ‘Married’ and can likely be excluded.

2040

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

Page 14: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

Figure 9. The posterior density of the regression weight on the covariate ‘Married’; 38 per cent of the mass is above zero.

Figure 10. The posterior density of the regression weight on the covariate ‘Hispanic’; 91 per cent of the mass is above zero.

To summarize our empirical findings, we can report that:

H2 Confirmed; young women do report more posttraumatic change than older women.H3 Partially confirmed; Hispanic women report greater posttraumatic change than other women, but African-American

women do not.H4 Disconfirmed; married women do not report greater posttraumatic change than other women.H5 Confirmed; Women taking Tamoxifen report greater posttraumatic change than other women.

5. What was the effect of local dependence?

The model we fit allowed dependence within testlets by incorporating the testlet effects, �id( j), described in Section 2.1but does not require the kind of all-or-nothing decision that usually follows the examination of eigenvalues that isthe hallmark of principal components or factor analysis. Instead our model characterizes the size of the excess localdependence that would accompany a multiple factor structure. We can then examine the distribution of the parameter(s)that represents local dependence and see how large it is. In addition we can look at the posterior distribution of theparameters that represent the individual respondents and see how different it would be if we ignored the testlet structure.This is precisely what we did.

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

2041

Page 15: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

Figure 11. The posterior densities of the testlet parameter � for all five testlets.

Figure 12. The posterior densities of the person parameter � (changeability) for person 1, shown for both models—the usual IRTmodel that assumes local independence, and the testlet model that allows within-testlet dependence.

If these same data were to be fit with the analogous IRT (not TRT) model that assumes local independence it issensible to ask how different would the results have been as an empirical demonstration of our approach. The answer, forpoint estimates of the parameters, is ‘not very different’. This is reassuring, as IRT is often used when its assumptions areclearly being violated. The place where unmodeled local dependence affects results is in the estimates of the parameteruncertainties. If we assume independence when it is not true we do not have as much information as we might havethought.

In Figures 11 and 12 we show the posterior distributions of the testlet variance parameters. If the variance of � is zerothere is no local dependence. The extent to which it is greater than zero is a measure of the local dependence. The waythat the variance of � is estimated ensures that it must always be positive but to be meaningful it must be substantiallygreater than zero. As we can see, whereas it is greater than zero for all five testlets, it is only substantially greaterfor testlet IV (p<0.001). Thus we confirm Hypothesis 1 that the greatest amount of local dependence is found in theSpirituality testlet–testlet IV. Note that the scale of the variance of � is the same as the scale of the variance of � whichequals one so that the effect of � is between one-tenth and half of the effect of individual differences in changeability.

Obviously there is local dependence within testlets. But how much will this dependence affect our inferences if weneglect to model it? To illustrate this let us consider the posterior distributions of �, the latent changeability trait, for atypical respondent for two models that are identical except that one assumes local independence (IRT) and one does not(TRT). As is evident (in Figure 11), they are both centered in the same place, but the TRT model is more platykurtic.

2042

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

Page 16: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

The IRT model tells us that we have measured with greater precision than is, in fact, the case. For this survey the varianceof the posterior is underestimated by about 50 per cent when averaged over all respondents. This result firmly supportsone of the conclusions of [10] and the size of the effect is on a par with those found in the prior research by [33].

6. Discussion

This study describes in some detail how the results from a survey instrument can be understood using the most modernof contemporary statistical machinery. In a single coherent analysis we have incorporated:

(i) A method to estimate each person’s location on the latent dimension that underlies the 21 items on the survey;(ii) The testlet structure of the survey instrument;

(iii) The covariates that were gathered to illuminate some of the reasons why the different respondents answered theway they did.

We found, unsurprisingly, that breast cancer survivors report that since their diagnosis they have become more focusedon events and people temporally close at hand and fewer report following new paths in life or new interests. We cannotsay that the diagnosis is a cause of this response without a proper control group, (and we have no access to data thatwould allow us to compute estimates of change prior to diagnosis), but that would seem to be a plausible workinghypothesis. We also cannot tell which direction the causal arrow is pointing after uncovering the relationship between‘taking Tamoxifen’ and reporting a change in attitudes. Such a causal link could be examined with suitable longitudinaldata and a measurement model such as the one we described here. Our hope is that analyses such as this extend thetoolbox that survey researchers consider when they would like to answer questions on survey design, factor structure,and the ‘whys’ simultaneously.

Furthermore, an area of important and on-going research is how to determine the correlates of prominent testlet effects.If this effect was understood better, it could be controlled a priori during test development. SCORIGHT is designed tounderstand the relationship between testlet effects and its covariates by allowing for the testlet-variances to be predictedfrom covariates such as the number of items in a testlet. We believe that investigation of testlet effects is of substantiveimportance for any scientific endeavor.

Acknowledgements

This work was supported by the National Board of Medical Examiners, the Wharton Interactive Media Initiative. The researchof Xiaohui Wang was partially supported by the National Science Foundation grant DMS0631639 and the National SecurityAgent grant GG11184. We are delighted to take this opportunity to express our gratitude. The data were supplied by the NationalCancer Institute.

References1. Spearman C. ‘General intelligence’ objectively determined and measured. American Journal of Psychology 1904; 15:201--293.2. Wainer H, Kiely G. Item clusters and computerized adaptive testing: the case for testlets. Journal of Educational Measurement 1987;

24:189--205.3. Load FM. Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Association: Hillsdale, NJ, 1980.4. Rasch G. Probabilistic Models for some Intelligence and Attainment Tests. Denmarks Paedagogiske Institute: Coenhagen, 1960 (Republished

in 1980 by the University of Chicago Press of Chicago).5. Thissen D, Wainer H. Test Scoring. Lawrence Erlbaum Associates: Hillsdale, NJ, 2001.6. Wainer H, Thissen D. How is reliability related to the quality of test scores? What is the effect of local dependence on reliability.

Educational Measurement: Issues and Practice 1996; 15(1):22--29.7. McTiernan A, Rajan KB, Tworoger SS, Irwin M, Bernstein L, Baumgartner R, Gilliland F, Stanczyk FZ, Yasui Y, Ballard-Barbash R.

Adiposity and sex hormones in postmenopausal breast cancer survivors. Journal of Clinical Oncology 2003; 21(10):1961--1966.8. Irwin ML, McTiernan A, Bernstein L, Gilliland FD, Baumgartner R, Baumgartner K, Ballard-Barbash R. Physical activity levels among

breast cancer survivor. Medicine and Science in Sports and Exercise 2004; 36(9):1484--1491.9. Bowen DJ, Alfano CM, McGregor BA, Kuniyuki A, Bernstein L, Meeske K, Baumgartner KB, Fetherolf J, Reeve BB, Smith AW, Ganz PA,

McTiernan A, Barbash RB. Possible socioeconomic and ethnic disparities in quality of life in a cohort of breast cancer survivors. BreastCancer Research and Treatment 2007; 106(1):85--95.

10. Tedeschi RG, Calhoun LG. The Posttraumatic growth inventory: measuring the positive legacy of trauma. Journal of Traumatic Stress1996; 9(3):455--472.

11. Bradlow ET, Fitzsimons GJ. Subscale distance and item clustering effects in self-administered surveys: a new metric. Journal of MarketingResearch 2001; XXXVIII:254--261.

12. Jemal A, Siegel R, Ward E, Murray T, Xu J, Thun MJ. Cancer statistics. A Cancer Journal for Clinicians 2007; 57:43--66.13. Arndt V, Stegmaier C, Ziegler H, Brenner H. A population-based study of the impact of specific symptoms on quality of life in women

with brest cancer 1 year after diagnosis. Cancer 2006; 107:2496--2503.

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044

2043

Page 17: Using Testlet Response Theory to analyze data from a survey of …math.ntnu.edu.tw/~rtsai/104/bayes/presentations/5... · 2015. 5. 18. · Research Article Received 1 October 2008,

X. WANG ET AL.

14. Hartl K, Janni W, Kastner R, Sommer H, Strobl BR, Stauber M. Impact of medical and demographic factors on long-term quality of lifeand body image of breast cancer patients. Annals of Oncology 2003; 14:1064--1071.

15. Ganz PA, Desmond KA, Leedham B, Rowland JH, Meyerowitz BE, Belin TR. Quality of life in long-term, disease-free survivors of breastcancer: a follow-up study. Journal of the National Cancer Institute 2002; 94:39--49.

16. Ho SMY, Chan CLW, Ho RTH. Posttraumatic growth in Chinese cancer survivors. Psycho-Oncology 2003; 13:377--389.17. Sheikh AI, Marotta SA. A cross-validation study of the posttraumatic growth inventory. Measurement and Evaluation in Counseling and

Development 2005; 38:66--78.18. Li Y, Bolt DM, Fu J. A comparison of alternative models for testlets. Applied Psychological Measurement 2006; 30:3--21.19. de la Torre J, Patz RJ. Making the most of what we have: a practical application of multidimensional IRT in test scoring. Journal of

Educational and Behavioral Statistics 2005; 30:295--311.20. Bellizzi KM. Expressions of generativity and posttraumatic growth in adult cancer survivors. International Journal of Aging and Human

Development 2004; 58:247--267.21. Cordova MJ, Cunningham LLC, Carlson CR, Andrykowski MA. Posttraumatic growth following breast cancer: a controlled comparison

study. Health Psychology 2001; 20(3):176--185.22. Bower JE, Meyerowitz BE, Desmond KA, Bernaards CA, Rowland JH, Ganz PA. Perceptions of positive meaning and vulnerability following

breast cancer: predictors and outcomes among long-term breast cancer survivors. Annals of Behavioral Medicine 2005; 29(3):236--245.23. Frazier P, Tashiro T, Berman M, Steger M, Long J. Correlates of levels and patterns of positive life changes following sexual assault.

Journal of Consulting, Clinical Psychology 2004; 72(1):19--30.24. Tomich PL, Helgeson VS. Is finding something good in the bad always good? Benefit finding among women with breast cancer. Health

Psychology 2004; 23(1):16--23.25. Bellizzi KM, Blank TO. Predicting posttraumatic growth in breast cancer survivors. Health Psychology 2006; 25(1):47--56.26. Polatinsky S, Esprey Y. An assessment of gender differences in the perception of benefit resulting from the loss of a child. Journal of

Traumatic Stress 2000; 13:709--718.27. Adams, Raymond J, Wilson, Mark, Wu, Margaret. Multilevel Item Response models: an approach to errors in variables regression. Journal

of Educational and Behavioral Statistics 1997; 22:47--76.28. Fox, Jean-Paul, Glas, Cees AW. Bayesian estimation of a multilevel IRT model using GIBSS sampling. Psychometrika 2001; 66:271--288.29. Bradlow ET, Wainer H, Wang X. A Bayesian random effects model for testlets. Psychometrika 1999; 64:153--168.30. Samejima F. Homogeneous case of the continuous response level. Psychometrika 1973; 38:203--219.31. Wainer H, Bradlow ET, Wang X. Testlet Response Theory. Cambridge University Press: New York, 2007.32. Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 1993;

88:669--679.33. Wang X, Bradlow ET, Wainer H. A general Bayesian model for testlets: theory and applications. Applied Psychological Measurement

2002; 26(1):1090--1128.34. Ip EH. Testing for local dependency in dichotomous and polytomous item response model. Psychometrika 2001; 66:109--132.35. Zhang J, Stout W. The theoretical detect index of dimensionality and its application to approximate simple structure. Psychometrika 1999;

64:213--249.36. Samejima F. Estimation of latent ability using a response pattern of graded scores. Psychometrika Monographs (Whole No. 17), 1969.37. Patz RJ, Junker BW. A straightforward approach to Markov Chain Monte Carlo methods for item response models. Journal of Educational

and Behavioral Statistics 1999; 24:146--178.38. Gelfand AE, Hills SE, Racine-Poon A, Smith AFM. Illustration of Bayesian inference in normal data models using Gibbs sampling.

Journal of the American Statistical Association 1990; 85:972--985.39. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman & Hall: London, 1995.40. Sinharay S. Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement

2005; 42:375--394.41. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science 1992; 7:457--511.42. Brooks S, Gelman A. General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical

Statistics 1998; 7(4):434--455.43. Gelman A, Meng X, Stern H. Posterior predictive model assessment via realized discrepancies. Statistica Sinica (with Discussion) 1996;

6:733--807.44. Albert JH, Chib S. Bayesian residual analysis for binary regression models. Biometrika 1995; 78:637--644.45. Spiegelhalter DJ, Best NG, Carlin BP, van der Linde. Bayesian measures of model complexity and fit. Journal of the Royal Statistical

Society, Series B 2002; 64(4):583--639.46. Wainer H, Schacht S. Gapping. Psychometrika 1978; 43:203--212.

2044

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 2028--2044