do your results really say what you think they say? issues of reliability and validity in evaluation...

22
Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student & Program Evaluator Oklahoma State University | JCCI Resource Development Services AEA Meeting, October 17, 2013 Assessment in Higher Education TIG

Upload: margaretmargaret-underwood

Post on 22-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Do your results really say what you think they say?

Issues of reliability and validity in evaluation measuring instruments

Do your results really say what you think they say?

Issues of reliability and validity in evaluation measuring instruments

Krista S. Schumacher, PhD student & Program EvaluatorOklahoma State University | JCCI Resource Development Services

Krista S. Schumacher, PhD student & Program EvaluatorOklahoma State University | JCCI Resource Development Services

AEA Meeting, October 17, 2013Assessment in Higher Education TIG

AEA Meeting, October 17, 2013Assessment in Higher Education TIG

Page 2: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Key IssueKey Issue

“Unfortunately, many readers and researchers fail to realize that no matter how profound the theoretical formulations, how sophisticated the design, and

how elegant the analytic techniques, they cannot

compensate for poor measures” (Pedhazur & Pedhazur Schmelkin, 1991).

“Unfortunately, many readers and researchers fail to realize that no matter how profound the theoretical formulations, how sophisticated the design, and

how elegant the analytic techniques, they cannot

compensate for poor measures” (Pedhazur & Pedhazur Schmelkin, 1991).

Page 3: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

The ProblemThe Problem

Review of 52 educational evaluation studies: 1971 to 1999 (Brandon & Singh, 2009)

None adequately addressed measurement

Lacking in research on practice of evaluation Literature on validity in evaluation studies

≠ measurement validity (Chen, 2010; Mark, 2011)

Review of 52 educational evaluation studies: 1971 to 1999 (Brandon & Singh, 2009)

None adequately addressed measurement

Lacking in research on practice of evaluation Literature on validity in evaluation studies

≠ measurement validity (Chen, 2010; Mark, 2011)

Page 4: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

The Problem (cont.)The Problem (cont.)

Federal emphasis on “scientifically based research” Experimental design Quasi-experimental design Regression discontinuity design, etc.

Federal emphasis on “scientifically based research” Experimental design Quasi-experimental design Regression discontinuity design, etc.

Where is measurement validity? How can programs be compared?

How can we justify requests for continued funding?

Page 5: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Program Evaluation Standards:

Accuracy

Program Evaluation Standards:

AccuracyStandard A2: Valid Information

“Evaluation information should serve the intended purposes and support valid interpretation” (p. 171).

Standard A3: Reliable Information“Evaluation procedures should yield

sufficiently dependable and consistent information for the intended users” (p. 179).

Standard A2: Valid Information“Evaluation information should serve

the intended purposes and support valid interpretation” (p. 171).

Standard A3: Reliable Information“Evaluation procedures should yield

sufficiently dependable and consistent information for the intended users” (p. 179).

(Yarbrough, Shulha, Hopson, & Caruthers, 2011)

(Yarbrough, Shulha, Hopson, & Caruthers, 2011)

Page 6: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Measurement Validity & Reliability Defined

Measurement Validity & Reliability Defined

Valid Inferences = Validity

Instrument measures intended construct

ReliabilityInstrument consistently measures a

construct But perhaps not the construct Reliability ≠ Validity

Consistent scores across administrations

Valid Inferences = Validity

Instrument measures intended construct

ReliabilityInstrument consistently measures a

construct But perhaps not the construct Reliability ≠ Validity

Consistent scores across administrations

Page 7: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Validity Types(basic for evaluation)

Validity Types(basic for evaluation)

Face On its face, instrument seems to measure intended

construct Assessment: Subject Matter Experts (SME) ratings

Content Items representative of domain of interest.

Assessment: SME ratings Provides no information for validity of inferences about

scores

Construct Instrument content reflects intended construct

Assessment: Exploratory factor analysis (EFA), principal components analysis (PCA)

Face On its face, instrument seems to measure intended

construct Assessment: Subject Matter Experts (SME) ratings

Content Items representative of domain of interest.

Assessment: SME ratings Provides no information for validity of inferences about

scores

Construct Instrument content reflects intended construct

Assessment: Exploratory factor analysis (EFA), principal components analysis (PCA)

Page 8: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Understanding Construct Validity

Understanding Construct Validity

Pumpkin Pie Example

ConstructPie

FactorsCrust and filling

Variables (items)Individual ingredients

Pumpkin Pie Example

ConstructPie

FactorsCrust and filling

Variables (items)Individual ingredients

(Nassif & Khalil, 2006)(Nassif & Khalil, 2006)

Page 9: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Validity Types(more advanced)

Validity Types(more advanced)

Criterion Establishes relationship or discrimination

Assessment: Correlation of scores with other test or with outcome variable

Types of criterion validity evidence Concurrent validity

Positive correlation with scores from another instrument measuring same construct

Discriminant validity Negative correlation with scores from another instrument measuring

opposite construct; comparing scores from different groups Predictive validity

Positive correlation of scores with criterion variable test is intended to predict

E.g., SAT scores and undergraduate GPA

Criterion Establishes relationship or discrimination

Assessment: Correlation of scores with other test or with outcome variable

Types of criterion validity evidence Concurrent validity

Positive correlation with scores from another instrument measuring same construct

Discriminant validity Negative correlation with scores from another instrument measuring

opposite construct; comparing scores from different groups Predictive validity

Positive correlation of scores with criterion variable test is intended to predict

E.g., SAT scores and undergraduate GPA

Page 10: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Reliability(basic for evaluation)

Reliability(basic for evaluation)

Measure of error (or results due to chance)

Internal Consistency Reliability (one type of reliability)

Cronbach’s coefficient alpha (most common) Correlation coefficient:

+1 = high reliability, no error 0 = no reliability, high error ≥ .70 desired (Nunnally, 1978)

Not a measure of dimensionality If multiple scales (or factors), compute alpha for each

scale

Measure of error (or results due to chance)

Internal Consistency Reliability (one type of reliability)

Cronbach’s coefficient alpha (most common) Correlation coefficient:

+1 = high reliability, no error 0 = no reliability, high error ≥ .70 desired (Nunnally, 1978)

Not a measure of dimensionality If multiple scales (or factors), compute alpha for each

scale

Page 11: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Psychometrically Tested Instrument in Evaluation:

Example

Psychometrically Tested Instrument in Evaluation:

ExampleMiddle Schoolers Out to Save the

World(Tyler-Wood, Knezek, & Christensen, 2010)  

$1.6 million NSF Innovative Technology Experiences for

Students and Teachers (ITEST) STEM attitudes & career interest surveys

Process Adapted existing psychometrically tested instruments Instrument development discussed Validity and reliability evidence included Instruments published in article

Middle Schoolers Out to Save the World

(Tyler-Wood, Knezek, & Christensen, 2010)   $1.6 million NSF Innovative Technology Experiences for

Students and Teachers (ITEST) STEM attitudes & career interest surveys

Process Adapted existing psychometrically tested instruments Instrument development discussed Validity and reliability evidence included Instruments published in article

Page 12: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Middle Schoolers Out to Save the World: Validity & ReliabilityMiddle Schoolers Out to Save

the World: Validity & Reliability

Content validitySubject matter experts

Teachers; advisory board members

Construct validityPrincipal components analysis

Criterion-related validity Concurrent: Correlated scores with other

instruments tested for validity and reliability Discriminant: Compared scores among varying

groups (e.g., 6th graders vs. ITEST PIs)

Content validitySubject matter experts

Teachers; advisory board members

Construct validityPrincipal components analysis

Criterion-related validity Concurrent: Correlated scores with other

instruments tested for validity and reliability Discriminant: Compared scores among varying

groups (e.g., 6th graders vs. ITEST PIs)

Page 13: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Middle Schoolers Out to Save the World: Construct ValidityMiddle Schoolers Out to Save the World: Construct Validity

Career Interest Survey Items

Component 1: Supportive environment

Component 2: Science education interest

Component 3: Perceived importance of science career

Item 1 .781 (component loading)

Item 2 .849

Item 3 .759

Item 4 .900

Item 5 .851

Item 6 .921

Item 7 .852

Item 8 .736

Item 9 .844

Item 10 .670

Item 11 .888

Item 12 .886

Page 14: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Middle Schoolers Out to Save the World: Reliability

Middle Schoolers Out to Save the World: Reliability

Scale # Items

Cronbach’s alpha

Perception of supportive environment for pursuing a career in science

4 .86

Interest in pursuing educational opportunities that would lead to a career in science

5 .94

Perceived importance of a career in science 3 .78

All items 12 .94

Internal Consistency Reliabilities for Career Interest Scales

Page 15: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Evaluations Lacking Instrument Validity &

Reliability

Evaluations Lacking Instrument Validity &

ReliabilitySix evaluations reviewedApprox. $9 million in federal fundingNSF programs:

STEM Talent Expansion Program (STEP)Innovative Technology Experiences for

Science Teachers (ITEST)Research in Disabilities Education

All used evaluator-developed instruments

Six evaluations reviewedApprox. $9 million in federal fundingNSF programs:

STEM Talent Expansion Program (STEP)Innovative Technology Experiences for

Science Teachers (ITEST)Research in Disabilities Education

All used evaluator-developed instruments

Page 16: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Purpose of Sample Evaluation Instruments

Purpose of Sample Evaluation Instruments

Instruments intended to measure:Attitudes toward science, technology,

engineering & math (STEM)Anxiety related to STEM educationInterest in STEM careersConfidence regarding success in STEM

majorProgram satisfaction

Instruments intended to measure:Attitudes toward science, technology,

engineering & math (STEM)Anxiety related to STEM educationInterest in STEM careersConfidence regarding success in STEM

majorProgram satisfaction

Page 17: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Measurement Fatal Flaws in Sample Evaluations

Measurement Fatal Flaws in Sample Evaluations

Failed to:Discuss process of instrument development

How were items developed? Were they reviewed by anyone other than

evaluators?Report reliability or validity information

Evaluations that included existing instruments did not report results of psychometric testing

One used different instruments for pre/post tests

How can claims of increases or decreases be made when different items are used?

Failed to:Discuss process of instrument development

How were items developed? Were they reviewed by anyone other than

evaluators?Report reliability or validity information

Evaluations that included existing instruments did not report results of psychometric testing

One used different instruments for pre/post tests

How can claims of increases or decreases be made when different items are used?

Page 18: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Reported Findings of Sample Evaluations

Reported Findings of Sample Evaluations

IEP students less likely than non-IEP peers to be interested in STEM fields (Lam et al., 2008)

Freshman seminar increased perceived readiness for following semester (Raines, 2012)

Residential program increased STEM attitudes and career interests (Lenaburg et al., 2012)

Participants satisfied with program (Russomanno et al, 2010)

Increased perceived self-competence re: information technology (IT) (Hayden et al., 2011)

Improved perceptions of IT professionals among high school faculty (Forssen et al., 2011)

IEP students less likely than non-IEP peers to be interested in STEM fields (Lam et al., 2008)

Freshman seminar increased perceived readiness for following semester (Raines, 2012)

Residential program increased STEM attitudes and career interests (Lenaburg et al., 2012)

Participants satisfied with program (Russomanno et al, 2010)

Increased perceived self-competence re: information technology (IT) (Hayden et al., 2011)

Improved perceptions of IT professionals among high school faculty (Forssen et al., 2011)

Page 19: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Implications for EvaluationImplications for Evaluation

Funding and other program decisions Findings based on valid and reliable data provide

strong justifications Use existing (tested) instruments when

possible Assessment Tools in Informal Science

http://www.pearweb.org/atis/dashboard/index Buros Center for Testing (Mental Measurements

Yearbook)http://buros.org/

For newly created instrumentsDiscuss process of instrument creationReport evidence of validity and reliability

Funding and other program decisions Findings based on valid and reliable data provide

strong justifications Use existing (tested) instruments when

possible Assessment Tools in Informal Science

http://www.pearweb.org/atis/dashboard/index Buros Center for Testing (Mental Measurements

Yearbook)http://buros.org/

For newly created instrumentsDiscuss process of instrument creationReport evidence of validity and reliability

Page 20: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Conclusion Conclusion

No more missing piecesMeasurement deserves a place of

priority

Continually ask...

No more missing piecesMeasurement deserves a place of

priority

Continually ask...• Are the data trustworthy? • Are my conclusions justifiable?

How do we know these results really say

what we think they say?

Page 21: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

ReferencesReferencesBrandon, P. R., & Singh, J. M. (2009). The strength of the methodological warrants for the findings of research on program

evaluation use. American Journal of Evaluation, 30(2), 123-157.Chen, H. T. (2010). The bottom-up approach to integrative validity: A new perspective for program evaluation. Evaluation

and Program Planning, 33, 205-214.Forssen, A., Lauriski-Karriker, T., Harriger, A., & Moskal, B. (2011). Surprising Possibilities Imagined and Realized through

Information Technology: Encouraging high school girls' interests in information technology. Journal of STEM Education: Innovations & Research, 12(5/6), 46-57.

Hayden, K., Ouyang, Y., Scinski, L., Olszewski, B., & Bielefeldt, T. (2011). Increasing student interest and attitudes in STEM: Professional development and activities to engage and inspire learners. Contemporary Issues in Technology and Teacher Education, 11(1), 47-69.

Lam, P., Doverspike, D., Zhao, J., Zhe, J., & Menzemer, C. (2008). An evaluation of a STEM program for middle school students on learning disability related IEPs. Journal of STEM Education: Innovations & Research, 9(1/2), 21-29.

Lenaburg, L., Aguirre, O., Goodchild, F., & Kuhn, J.-U. (2012). Expanding Pathways: A Summer Bridge Program for Community College STEM Students. Community College Journal of Research and Practice, 36(3), 153-168.

Mark, M. M. (2011). New (and old) directions for validity concerning generalizability. New Directions for Evaluation, 2011(130), 31-42.

Nassif, N., & Khalil, Y. (2006). Making a pie as a metaphor for teaching scale validity and reliability. American Journal of Evaluation, 27(3), 393-398.

Nunnally, J. (1978). Psychometric theory. New York, NY: McGraw-Hill.Pedhazur, E. J., & Pedhazur Schmelkin, L. (1991). Measurement, design, and analysis: An integrated approach. New York,

NY: Psychology Press.Raines, J. M. (2012). FirstSTEP: A preliminary review of the effects of a summer bridge program on pre-college STEM

majors. Journal of STEM Education : Innovations and Research, 13(1).Russomanno, D., Best, R., Ivey, S., Haddock, J. R., Franceschetti, D., & Hairston, R. J. (2010). MemphiSTEP: A STEM Talent

Expansion Program at the University of Memphis. Journal of STEM Education : Innovations and Research, 11(1/2), 69-81.

Tyler-Wood, T., Knezek, G., & Christensen, R. (2010). Instruments for assessing interest in STEM content and careers. Journal of Technology and Teacher Education, 18(2), 341-363.

Yarbrough, D. B., Shulha, L. M., Hopson, R. K., & Caruthers, F. A. (Eds.). (2011). The program evaluation standards: A guide for evaluators and evaluation users (3rd ed.). Thousand Oaks, CA: Sage.

Brandon, P. R., & Singh, J. M. (2009). The strength of the methodological warrants for the findings of research on program evaluation use. American Journal of Evaluation, 30(2), 123-157.

Chen, H. T. (2010). The bottom-up approach to integrative validity: A new perspective for program evaluation. Evaluation and Program Planning, 33, 205-214.

Forssen, A., Lauriski-Karriker, T., Harriger, A., & Moskal, B. (2011). Surprising Possibilities Imagined and Realized through Information Technology: Encouraging high school girls' interests in information technology. Journal of STEM Education: Innovations & Research, 12(5/6), 46-57.

Hayden, K., Ouyang, Y., Scinski, L., Olszewski, B., & Bielefeldt, T. (2011). Increasing student interest and attitudes in STEM: Professional development and activities to engage and inspire learners. Contemporary Issues in Technology and Teacher Education, 11(1), 47-69.

Lam, P., Doverspike, D., Zhao, J., Zhe, J., & Menzemer, C. (2008). An evaluation of a STEM program for middle school students on learning disability related IEPs. Journal of STEM Education: Innovations & Research, 9(1/2), 21-29.

Lenaburg, L., Aguirre, O., Goodchild, F., & Kuhn, J.-U. (2012). Expanding Pathways: A Summer Bridge Program for Community College STEM Students. Community College Journal of Research and Practice, 36(3), 153-168.

Mark, M. M. (2011). New (and old) directions for validity concerning generalizability. New Directions for Evaluation, 2011(130), 31-42.

Nassif, N., & Khalil, Y. (2006). Making a pie as a metaphor for teaching scale validity and reliability. American Journal of Evaluation, 27(3), 393-398.

Nunnally, J. (1978). Psychometric theory. New York, NY: McGraw-Hill.Pedhazur, E. J., & Pedhazur Schmelkin, L. (1991). Measurement, design, and analysis: An integrated approach. New York,

NY: Psychology Press.Raines, J. M. (2012). FirstSTEP: A preliminary review of the effects of a summer bridge program on pre-college STEM

majors. Journal of STEM Education : Innovations and Research, 13(1).Russomanno, D., Best, R., Ivey, S., Haddock, J. R., Franceschetti, D., & Hairston, R. J. (2010). MemphiSTEP: A STEM Talent

Expansion Program at the University of Memphis. Journal of STEM Education : Innovations and Research, 11(1/2), 69-81.

Tyler-Wood, T., Knezek, G., & Christensen, R. (2010). Instruments for assessing interest in STEM content and careers. Journal of Technology and Teacher Education, 18(2), 341-363.

Yarbrough, D. B., Shulha, L. M., Hopson, R. K., & Caruthers, F. A. (Eds.). (2011). The program evaluation standards: A guide for evaluators and evaluation users (3rd ed.). Thousand Oaks, CA: Sage.

Page 22: Do your results really say what you think they say? Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student

Contact InformationContact Information

JCCI Resource Development Serviceshttp://www.jccionline.comBECO Building West 5410 Edson Lane - Suite 210B Rockville , MD 20852 

Jennifer Kerns, President301-468-1851 | [email protected]

Krista S. Schumacher, Associate918-284-7276 | [email protected]

JCCI Resource Development Serviceshttp://www.jccionline.comBECO Building West 5410 Edson Lane - Suite 210B Rockville , MD 20852 

Jennifer Kerns, President301-468-1851 | [email protected]

Krista S. Schumacher, Associate918-284-7276 | [email protected]