reliability and validity

Reliability and Validity

Measurement

Scales (Levels) of measurementScales clarify the characteristics of measurement

processesScales indicate which statistical procedures are

appropriateNominal

• Categories without order

• Colors, gender, political party, nationality

Ordinal

• Categories with order

• Size (S,M,L), Social class, Agreement (strong, some, low, none)

Interval

• Distance is meaningful between categories

• Temperature, ACT scores, shoe size, IQ

Ratio

• Scale of categories has absolute zero

• Age, income, all rates and percents, vacation time

Levels of Measurement: Learn them by playing the

game!

Does my measurement procedure give the same accurate measurement

each time it is used?

Reliability

What is reliability?Reliability is consistency in measurement

Does this procedure or test yield the same results if you repeat the measurement, so long as conditions have not changed?

How do we know?Stern Tone Variator, from The Archives of the History of American Psychology

What is measurement validity?

Lavery Psychograph, from The Archives of the History of American Psychology

Validity is “truth” in measurement

Does this procedure or test actually measure the construct or dimension it is intended to measure?

How do we know?

Reliable, but not valid

Reliable: pattern shows the shot hits the same part of the target each time: it is consistent, so it is reliable.

Not valid. The goal is to hit the center of the target, but the shots are not in that area.

Valid, but not reliableValid because the

pattern is evenly distributed around the correct goal (center): the person probably tried to hit the correct place.

Not reliable because the shots are off the mark in every possible direction; they are not consistent.

Neither reliable nor validNot reliable because

the shots are not tightly clustered together; they are not consistent.

Not valid because, to the extent there is any pattern, it is not at the true target, the center.

Both reliable and valid

Reliable: the darts land close together. The red player can reliably hit the same part of the target.

Valid: the darts are clustered at the center, where they were aimed.

Bullseye!! by modenadude at http://www.flickr.com/photos/modenadude/3280286776

Why do reliability and validity matter?

All of our research uses data.Data is gathered through measurement

proceduresThe scores only have meaning if they measure

what they are supposed to measure (valid) and do so with accuracy and consistency (reliability).

Evaluating whether data are reliable and valid is a key element in applying research findings.

We use statistical techniques in evaluating the reliability and validity of data.

Reliability and ValidityReliabilityDoes the value observed and

recorded accurately reflect the “true” value of the object?

Test by measuring the object multiple times or ways.

Every researcher must either use a known instrument, or test and demonstrate the reliability of a new tool. The Literature Search is a

huge labor saving device Using a known instrument

improves research quality

ValidityDoes the value observed and

recorded reflect the concept and dimension of interest?

Test by comparing with other data or similar processes.

Every researcher must either use a known instrument, or test and demonstrate the validity of a new tool. The Literature Search is a

huge labor saving device Using a known instrument

improves research quality.

Reliability = True Score + ErrorUnreliable measurement

tools introduce errorReliability improves with

new tool or methodTest-Retest is the

simplest way to assess reliability

Can be used whenever the process of measuring will not, by itself, affect data

Test with Correlation of first with second scores.

Reliability – Aiming to reduce error

Reliable measurements allow researchers to test their theories and hypotheses.

The more error in the data – whether random or systematic – the less likely they are to find true and significant results.

Researchers in psychology and human service fields spend months and years to develop a set o questions or observations that has high reliability.

Sources of unreliabilityMeaning of questions is

unclear or produces random answers.

Raters not adequatelytrained on method of making rating.

Some of the questions or items measure a subtly different dimension, they don’t “go with” the others.

Instructions may be unclear or inconsistent, even if the test questions are fine.

Outside events may be having an effect.

Reliability – Parallel forms and Test-Retest

Testing situation requires the use of different forms of the same toolSchool & licensing examsLearning occurs in

testing, so it cant be repeated

Test by computing correlation of two or more forms, taken under same circumstances.

Reliability – Internal Consistency

Abstract or complex dimensions can’t be measured directly

Several related items or observations are more likely to get an accurate result

Need to verify that all the items relate to the same dimension

Cronbach’s alpha (α) is commonly reported. Interpret strength on same

scale as correlation

Reliability – Inter-rater ReliabilityObservation process is

skilled and requires individual judgment.

Trained researchers make observations of the same subject independently.

Test with Correlation of ratings (interval or ratio) or percent agreement (nominal or ordinal)

Taking In the View by Randy Son of Robert at http://www.flickr.com/photos/randysonofrobert/2384256036/

Reliability – Effective RangeMeasurement process

needs accuracy across allpossible outcomes.

Ceiling effect: process cannot measure extreme high scores

Floor effect: process cannot measure extreme low scores

Both are a scale attenuation problem.

Taking In the View by Randy Son of Robert at http://www.flickr.com/photos/randysonofrobert/2384256036/

Measurement in research

The real world is messy

Not always clear how to sort out all the reliability questions

Reliability issues are intertwined with validity.

Learn by seeing what competent researchers do.

math problems for girls by woodleywonderworks at http://www.flickr.com/photos/wwworks/3597217248/

http://www.flickr.com/photos/wwworks/

W. Andrew Harrell describes his study

Two controversies are connected to

Dr. Harrell’s study:

Do physical traits such as

beauty have an evolutionary

impact (i.e., people are more

likely to have children to pass

along their genes)?

Isn’t beauty a subjective

judgment, not a trait that can be

objectively measured in a

research study.

Dr. Harrell addresses both of those

questions in this 4 minute audio

clip.

Click to listen

University Of Alberta (2005, April 13). Researchers Show Parents Give Unattractive Children Less Attention. ScienceDaily. Retrieved July 25, 2009, from http://www.sciencedaily.com /releases/2005/04/050412213412.htm

http://www.npr.org/templates/story/story.php?storyId=4678922

Measurement in Harrell’s study

Direct observation of seat belt safety

Previously validated measures of beauty

Two trained researchers evaluate attractiveness(inter-rater reliability)

Two different trained researchers observe safety(inter-rater reliability and avoiding bias)

Increasing reliabilityAll researchers should use identical

instructions.A larger number of items (questions) will

provide a more stable measure of a complex dimension

Some questions will be eliminated because they evoke varied and inconsistent responses

The items need to cover the entire range of the dimension in order to observe the extreme values

Reliability may change if observations are made in very different populations or situations (e.g., college students vs. seniors in assisted living).

Does my measurement procedure give a measurement of the construct or variable that

I intend? Or is it measuring something else?

Validity

Two meanings of “validity”Validity is an over-arching concern of research

Measurement – are the observations directly and truly linked to the dimension or concept claimed?

Research design – how well does the experiment or study control the situation so that we are confident that the relationships or results observed were due to the impact of the independent variable?

In this course, we consider validity in measurement now and validity in research design later.

“Face” and Content Validity“Face” validity Content validity

Are all aspects of the dimension or concept covered?

Are any aspects over- or under-emphasized?

Does the measure differentiate this dimension from other similar ones?

Improving content validity Thorough search of the literature Consult with experts who disagree with your perspective

Manuscripts and checklists by Muffett at http://www.flickr.com/photos/calliope/173797447/

Criterion validity – p. 81A measure is valid if it has a strong relationship to an

external criterionA music audition is a valid measure if it selects the better

players over those with less ability (concurrent validity).The GRE is a valid measure if

people who do well on the GRE succeed in graduate school (predictive validity)

It is often as hard to demonstrate the link between the test and the criterion as between the test and the dimension.

GRE flashcards by NEPMET at http://www.flickr.com/photos/blahman/2168064272/

Construct validity

The dimension to be measured is a construct, an abstract idea related to a group of interrelated variables.

The construct itself might be socially constructedclassic studies in obedience

are being re-interpreted.some cultures lack a word or

idea for “schizophrenia”Researchers make their case for

validity, but must be open to reconsideration.

Milgram’s obedience experiment

Sources of invalidity

Lavery Psychograph, from The Archives of the History of American Psychology

Incorrect theory produced this psychograph. It yielded mostly random data.

The procedure measures a dimension, but not the one intended by the researcher.

The procedure measures a dimension, but there is more than one interpretation of its meaning.

Relationship of reliability and validity

Both reliability and validity are measured on a continuum, evaluated in terms of degrees.

If a measurement has very low reliability, it can’t be valid because it is not even accurate.

Maximum validity is the square root of reliability.

Especially in inferential statistics when relationships are tested, the analyst must remember to examine the quality of measurement.

yReliabilitValidity

reliability and validity

Education

measurement procedure give

procedure measures

literature search

randy son

rater reliability

lavery psychograph

valid measure

procedure