rasch modeling with small samples: a review of the literature · pdf fileis rasch model...

1
Some 1,2 have recently advocated for an early-stage quantitative analysis, using a small additional sample, during the scale development process. Purported benefits of early stage small sample quantitative analyses: 1) Flag quantitatively problematic items which may not appear problematic from qualitative results 2) Avoid measurement gaps 3) Ensure a better match between the measure and the target population 4) Allow for a preliminary check of the measurement properties of the scale Are these benefits attainable in practice? Those benefits would be extremely useful but we question their attainability in practice, based on several factors present in a review of the empirical research. Performance of models with small sample Available empirical studies of item and person parameter recovery with small samples are not promising, showing poor item and person parameter recovery for small N analyses. 3,4 The majority of studies do not even investigate conditions with N as small as advocated for during early analyses (i.e., less than N = 100). 5-14 Realistic minimum sample size Below is part of a commonly cited table re: minimum N’s from Linacre. 15 Even the “Size for most purposes” values are based on a large “acceptable error” range and assume idealized conditions. Samples will realistically need to be much larger for accurate estimation. Post-analysis result visualization Figure 1 shows 40 threshold values (such as would be obtained from a 10- item COA which uses 5 category response options) obtained from the same generating or “true” parameter values . Each item parameter has been perturbed/randomly varied at plus or minus 1 logit, the range of “acceptable error.” Conclusions regarding the presence and location of any “measurement gaps” change across these 4 plots. Figure 2 shows an item-person map with three possible sets of item distributions from the same set of items, all within the purported range of acceptable error. Conclusions regarding “domain coverage” and item-person concordance vary greatly depending on which threshold plot is used. Post analysis fit assessment Numerous studies 16-20 have found INFIT and OUTFIT null distributions are affected by sample size and parameter distribution. Despite recommendations, there is no common “cut point” that is appropriate and empirical studies have not investigated the fit measures’ performance with samples as small as currently being advocated. Applicability of small-sample results to analysis sample When a highly constrained model is used to accommodate the small sample available during early analyses, but other IRT models are planned for use in the later psychometric analyses, any information gained from these early analyses does not necessarily generalize to later stage analyses. Conclusions Existing literature shows that the use of IRT models with small samples, including even Rasch-consistent models, is empirically unsupported. Our conclusion: conduct the standard large sample psychometric study that will allow for the proper statistical analysis of the draft instrument. Simply put, if one is not likely to receive trustworthy information from an initial small data collection phase, why collect the additional data and conduct the analyses at all? References 1.Stansbury, J. P. (2013, April). Mixed methods to enhance content validity of measures for use in drug-development trials. In A. Slagle (Moderator), Mixed methods – FDA Perspective: Incorporating mixed methods to enhance content validity in drug-development tools. Panel conducted at the Patient Reported Outcome (PRO) Consortium Workshop, Silver Springs: MD. Retrieved from http://c- path.org/PROSlides/Workshop3/2012_PROConsortium_PanelSession2.pdf 2.Gorecki, C., Lamping, D.L., Nixon, J., Brown, J.M., & Cano, S. (2012). Applying mixed methods to pretest the Pressure Ulcer Quality of Life (PU_QOL) instrument. Quality of Life Research, 21, 441-451. 3.Stone, M. & Yumoto, F. (2004). The effect of sample size for estimation Rasch/IRT parameters with dichotomous items. Journal of Applied Measurement, 5, 48-61. 4.Chen, W.-H., Lenderking, W., Jin, Y., Wyrwich, K.W, Gelhorn, H.,& Revicki, D.A (2014). Is Rasch model analysis applicable in small sample pilot studies for assessing preliminary item characteristics? An example using PROMIS pain behavior item bank data. Quality of Life Research, 23, 485-493. 5.Choi, S. Cook, K., & Dodd, B. (1997). Parameter recovery for the partial credit model using MULTILOG. Journal of Outcome Measurement, 1, 114-142. 6.DeMars, C. E. (2002, April). Recovery of graded response and partial credit parameters in MULTILOG and PARSCALE. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL. 7.French, G. & Dodd, B. (1999). Parameter recovery for the rating scale model using PARSCALE. Journal of Outcome Measurement, 3, 176-199. 8.Goldman, S. H. & Raju, N. S. (1986). Recovery of one- and two- parameter logistic item parameters: An empirical study. Educational and Psychological Measurement, 46, 11-21. 9.Guyer, R. & Thompson, N. (2011). Item response theory parameter recovery using Xcalibre 4.1 (Technical Report). St. Paul, MN: Assessment Systems Corporation. Retrieved from http://www.assess.com/docs/Xcalibre_4.1_tech_report.pdf 10.He, Q., & Wheadon, C. (2008). The effect of sample size on item parameter estimation for the partial credit model. Centre for Education and Research Policy. Retrieved from https://cerp.aqa.org.uk/sites/default/files/pdf_upload/CERP_RP_QH_11122008.pdf 11.Le, L. T. & Adams, R. J. (2013). Accuracy of Rasch model item parameter estimation. Retrieved from Australian Council for Educational Research website: http://research.acer.edu.au/cgi/viewcontent.cgi?article=1013&context=ar_misc 12.Meyer, J. P. & Hailey, E. (2012). A study of Rasch, partial credit, and rating scale model parameter recovery in WINSTEPS and jMetrik. Journal of Applied Measurement, 13, 248-258. 13.Preinerstorfer, D. & Formann, A. K. (2012). Parameter recovery and model selection in mixed Rasch models. British Journal of Mathematical and Statistical Psychology, 65, 251-262. 14.Wang, W.-C., & Chen, C.-T. (2005). Item parameters recovery, standard error estimates, and fit statistics of the WINSTEPS program for the family of Rasch models. Educational and Psychological Measurement, 65, 376-404. 15.Linacre, J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7, 328. 16.Smith, R. M. (1996). Polytomous mean-square fit statistics. Rasch Measurement Transactions, 10, 516-517. 17.Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied Measurement, 1, 152-176. 18.Wright, B. D., Linacre, J. M., Gustafson, J. –E., & Martin-Löf, P. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370. 19.Smith, R. M., Schumacker, R. E., & Bush, M. J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2, 66-78. 20.Smith, R.M. (1996). A comparison of the Rasch separate calibration and between-fit methods of detecting item bias. Educational and Psychological Measurement, 56, 403-418. Rasch Modeling with Small Samples: A Review of the Literature R.J. Wirth 1 , Carrie R. Houts 1 , & Linda S. Deal 2 1. Vector Psychometric Group, LLC 2. Pfizer, Inc. Item calibrations stable within CI Minimum sample size range (best to poor targeting) Size for most purposes ± 1 logit 95% 16-36 30 (Min. for dichotomies) ± 1 logit 95% 27-61 50 (Min. for polytomies)

Upload: vutruc

Post on 06-Feb-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rasch Modeling with Small Samples: A Review of the Literature · PDF fileIs Rasch model analysis applicable in small sample pilot studies for assessing preliminary item characteristics?

Some1,2 have recently advocated for an early-stage quantitative analysis, using a small additional sample, during the scale development process.

Purported benefits of early stage small sample quantitative analyses:

1) Flag quantitatively problematic items which may not appear problematic from qualitative results

2) Avoid measurement gaps

3) Ensure a better match between the measure and the target population

4) Allow for a preliminary check of the measurement properties of the scale

Are these benefits attainable in practice?

• Those benefits would be extremely useful but we question their attainability in practice, based on several factors present in a review of the empirical research.

Performance of models with small sample

• Available empirical studies of item and person parameter recovery with small samples are not promising, showing poor item and person parameter recovery for small N analyses.3,4 The majority of studies do not even investigate conditions with N as small as advocated for during early analyses (i.e., less than N = 100).5-14

Realistic minimum sample size

• Below is part of a commonly cited table re: minimum N’s from Linacre.15

• Even the “Size for most purposes” values are based on a large “acceptable error” range and assume idealized conditions. Samples will realistically need to be much larger for accurate estimation.

Post-analysis result visualization

• Figure 1 shows 40 threshold values (such as would be obtained from a 10-item COA which uses 5 category response options) obtained from the same generating or “true” parameter values .

• Each item parameter has been perturbed/randomly varied at plus or minus 1 logit, the range of “acceptable error.”

• Conclusions regarding the presence and location of any “measurement gaps” change across these 4 plots.

• Figure 2 shows an item-person map with three possible sets of item distributions from the same set of items, all within the purported range of acceptable error.

• Conclusions regarding “domain coverage” and item-person concordance vary greatly depending on which threshold plot is used.

Post analysis fit assessment

• Numerous studies16-20 have found INFIT and OUTFIT null distributions are affected by sample size and parameter distribution. Despite recommendations, there is no common “cut point” that is appropriate and empirical studies have not investigated the fit measures’ performance with samples as small as currently being advocated.

Applicability of small-sample results to analysis sample

• When a highly constrained model is used to accommodate the small sample available during early analyses, but other IRT models are planned for use in the later psychometric analyses, any information gained from these early analyses does not necessarily generalize to later stage analyses.

Conclusions

• Existing literature shows that the use of IRT models with small samples, including even Rasch-consistent models, is empirically unsupported.

• Our conclusion: conduct the standard large sample psychometric study that will allow for the proper statistical analysis of the draft instrument.

• Simply put, if one is not likely to receive trustworthy information from an initial small data collection phase, why collect the additional data and conduct the analyses at all?

References1.Stansbury, J. P. (2013, April). Mixed methods to enhance content validity of measures for use in drug-development trials. In A. Slagle (Moderator), Mixed methods – FDA Perspective: Incorporating mixed methods to enhance content validity in drug-development tools. Panel conducted at the Patient Reported Outcome (PRO) Consortium Workshop, Silver Springs: MD. Retrieved from http://c-path.org/PROSlides/Workshop3/2012_PROConsortium_PanelSession2.pdf2.Gorecki, C., Lamping, D.L., Nixon, J., Brown, J.M., & Cano, S. (2012). Applying mixed methods to pretest the Pressure Ulcer Quality of Life (PU_QOL) instrument. Quality of Life Research, 21, 441-451.3.Stone, M. & Yumoto, F. (2004). The effect of sample size for estimation Rasch/IRT parameters with dichotomous items. Journal of Applied Measurement, 5, 48-61.4.Chen, W.-H., Lenderking, W., Jin, Y., Wyrwich, K.W, Gelhorn, H.,& Revicki, D.A (2014). Is Rasch model analysis applicable in small sample pilot studies for assessing preliminary item characteristics? An example using PROMIS pain behavior item bank data. Quality of Life Research, 23, 485-493.5.Choi, S. Cook, K., & Dodd, B. (1997). Parameter recovery for the partial credit model using MULTILOG. Journal of Outcome Measurement, 1, 114-142. 6.DeMars, C. E. (2002, April). Recovery of graded response and partial credit parameters in MULTILOG and PARSCALE. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL.7.French, G. & Dodd, B. (1999). Parameter recovery for the rating scale model using PARSCALE. Journal of Outcome Measurement, 3, 176-199.8.Goldman, S. H. & Raju, N. S. (1986). Recovery of one- and two- parameter logistic item parameters: An empirical study. Educational and Psychological Measurement, 46, 11-21.9.Guyer, R. & Thompson, N. (2011). Item response theory parameter recovery using Xcalibre 4.1 (Technical Report). St. Paul, MN: Assessment Systems Corporation. Retrieved from http://www.assess.com/docs/Xcalibre_4.1_tech_report.pdf10.He, Q., & Wheadon, C. (2008). The effect of sample size on item parameter estimation for the partial credit model. Centre for Education and Research Policy. Retrieved from https://cerp.aqa.org.uk/sites/default/files/pdf_upload/CERP_RP_QH_11122008.pdf11.Le, L. T. & Adams, R. J. (2013). Accuracy of Rasch model item parameter estimation. Retrieved from Australian Council for Educational Research website: http://research.acer.edu.au/cgi/viewcontent.cgi?article=1013&context=ar_misc12.Meyer, J. P. & Hailey, E. (2012). A study of Rasch, partial credit, and rating scale model parameter recovery in WINSTEPS and jMetrik. Journal of Applied Measurement, 13, 248-258.13.Preinerstorfer, D. & Formann, A. K. (2012). Parameter recovery and model selection in mixed Rasch models. British Journal of Mathematical and Statistical Psychology, 65, 251-262.14.Wang, W.-C., & Chen, C.-T. (2005). Item parameters recovery, standard error estimates, and fit statistics of the WINSTEPS program for the family of Rasch models. Educational and Psychological Measurement, 65, 376-404.15.Linacre, J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7, 328.16.Smith, R. M. (1996). Polytomous mean-square fit statistics. Rasch Measurement Transactions, 10, 516-517.17.Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied Measurement, 1, 152-176. 18.Wright, B. D., Linacre, J. M., Gustafson, J. –E., & Martin-Löf, P. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370.19.Smith, R. M., Schumacker, R. E., & Bush, M. J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2, 66-78. 20.Smith, R.M. (1996). A comparison of the Rasch separate calibration and between-fit methods of detecting item bias. Educational and Psychological Measurement, 56, 403-418.

Rasch Modeling with Small Samples: A Review of the Literature

R.J. Wirth1, Carrie R. Houts1, & Linda S. Deal2

1. Vector Psychometric Group, LLC 2. Pfizer, Inc.

Item calibrations stable within

CI Minimum sample size range (best to poor

targeting)

Size for mostpurposes

± 1 logit 95% 16-36 30 (Min. for dichotomies)

± 1 logit 95% 27-61 50 (Min. for polytomies)