reliability and validity
DESCRIPTION
Slide show from Research Methods courseTRANSCRIPT
Reliability and Validity
Measurement
Scales (Levels) of measurementScales clarify the characteristics of measurement
processesScales indicate which statistical procedures are
appropriateNominal
• Categories without order
• Colors, gender, political party, nationality
Ordinal
• Categories with order
• Size (S,M,L), Social class, Agreement (strong, some, low, none)
Interval
• Distance is meaningful between categories
• Temperature, ACT scores, shoe size, IQ
Ratio
• Scale of categories has absolute zero
• Age, income, all rates and percents, vacation time
Levels of Measurement: Learn them by playing the
game!
Does my measurement procedure give the same accurate measurement
each time it is used?
Reliability
What is reliability?Reliability is consistency in measurement
Does this procedure or test yield the same results if you repeat the measurement, so long as conditions have not changed?
How do we know?Stern Tone Variator, from The Archives of the History of American Psychology
What is measurement validity?
Lavery Psychograph, from The Archives of the History of American Psychology
Validity is “truth” in measurement
Does this procedure or test actually measure the construct or dimension it is intended to measure?
How do we know?
Reliable, but not valid
Reliable: pattern shows the shot hits the same part of the target each time: it is consistent, so it is reliable.
Not valid. The goal is to hit the center of the target, but the shots are not in that area.
Valid, but not reliableValid because the
pattern is evenly distributed around the correct goal (center): the person probably tried to hit the correct place.
Not reliable because the shots are off the mark in every possible direction; they are not consistent.
Neither reliable nor validNot reliable because
the shots are not tightly clustered together; they are not consistent.
Not valid because, to the extent there is any pattern, it is not at the true target, the center.
Both reliable and valid
Reliable: the darts land close together. The red player can reliably hit the same part of the target.
Valid: the darts are clustered at the center, where they were aimed.
Bullseye!! by modenadude at http://www.flickr.com/photos/modenadude/3280286776
Why do reliability and validity matter?
All of our research uses data.Data is gathered through measurement
proceduresThe scores only have meaning if they measure
what they are supposed to measure (valid) and do so with accuracy and consistency (reliability).
Evaluating whether data are reliable and valid is a key element in applying research findings.
We use statistical techniques in evaluating the reliability and validity of data.
Reliability and ValidityReliabilityDoes the value observed and
recorded accurately reflect the “true” value of the object?
Test by measuring the object multiple times or ways.
Every researcher must either use a known instrument, or test and demonstrate the reliability of a new tool. The Literature Search is a
huge labor saving device Using a known instrument
improves research quality
ValidityDoes the value observed and
recorded reflect the concept and dimension of interest?
Test by comparing with other data or similar processes.
Every researcher must either use a known instrument, or test and demonstrate the validity of a new tool. The Literature Search is a
huge labor saving device Using a known instrument
improves research quality.
Reliability = True Score + ErrorUnreliable measurement
tools introduce errorReliability improves with
new tool or methodTest-Retest is the
simplest way to assess reliability
Can be used whenever the process of measuring will not, by itself, affect data
Test with Correlation of first with second scores.
Reliability – Aiming to reduce error
Reliable measurements allow researchers to test their theories and hypotheses.
The more error in the data – whether random or systematic – the less likely they are to find true and significant results.
Researchers in psychology and human service fields spend months and years to develop a set o questions or observations that has high reliability.
Sources of unreliabilityMeaning of questions is
unclear or produces random answers.
Raters not adequatelytrained on method of making rating.
Some of the questions or items measure a subtly different dimension, they don’t “go with” the others.
Instructions may be unclear or inconsistent, even if the test questions are fine.
Outside events may be having an effect.
Reliability – Parallel forms and Test-Retest
Testing situation requires the use of different forms of the same toolSchool & licensing examsLearning occurs in
testing, so it cant be repeated
Test by computing correlation of two or more forms, taken under same circumstances.
Reliability – Internal Consistency
Abstract or complex dimensions can’t be measured directly
Several related items or observations are more likely to get an accurate result
Need to verify that all the items relate to the same dimension
Cronbach’s alpha (α) is commonly reported. Interpret strength on same
scale as correlation
Reliability – Inter-rater ReliabilityObservation process is
skilled and requires individual judgment.
Trained researchers make observations of the same subject independently.
Test with Correlation of ratings (interval or ratio) or percent agreement (nominal or ordinal)
Taking In the View by Randy Son of Robert at http://www.flickr.com/photos/randysonofrobert/2384256036/
Reliability – Effective RangeMeasurement process
needs accuracy across allpossible outcomes.
Ceiling effect: process cannot measure extreme high scores
Floor effect: process cannot measure extreme low scores
Both are a scale attenuation problem.
Taking In the View by Randy Son of Robert at http://www.flickr.com/photos/randysonofrobert/2384256036/
Measurement in research
The real world is messy
Not always clear how to sort out all the reliability questions
Reliability issues are intertwined with validity.
Learn by seeing what competent researchers do.
math problems for girls by woodleywonderworks at http://www.flickr.com/photos/wwworks/3597217248/
W. Andrew Harrell describes his study
Two controversies are connected to
Dr. Harrell’s study:
Do physical traits such as
beauty have an evolutionary
impact (i.e., people are more
likely to have children to pass
along their genes)?
Isn’t beauty a subjective
judgment, not a trait that can be
objectively measured in a
research study.
Dr. Harrell addresses both of those
questions in this 4 minute audio
clip.
Click to listen
University Of Alberta (2005, April 13). Researchers Show Parents Give Unattractive Children Less Attention. ScienceDaily. Retrieved July 25, 2009, from http://www.sciencedaily.com /releases/2005/04/050412213412.htm
Measurement in Harrell’s study
Direct observation of seat belt safety
Previously validated measures of beauty
Two trained researchers evaluate attractiveness(inter-rater reliability)
Two different trained researchers observe safety(inter-rater reliability and avoiding bias)
Increasing reliabilityAll researchers should use identical
instructions.A larger number of items (questions) will
provide a more stable measure of a complex dimension
Some questions will be eliminated because they evoke varied and inconsistent responses
The items need to cover the entire range of the dimension in order to observe the extreme values
Reliability may change if observations are made in very different populations or situations (e.g., college students vs. seniors in assisted living).
Does my measurement procedure give a measurement of the construct or variable that
I intend? Or is it measuring something else?
Validity
Two meanings of “validity”Validity is an over-arching concern of research
Measurement – are the observations directly and truly linked to the dimension or concept claimed?
Research design – how well does the experiment or study control the situation so that we are confident that the relationships or results observed were due to the impact of the independent variable?
In this course, we consider validity in measurement now and validity in research design later.
“Face” and Content Validity“Face” validity Content validity
Are all aspects of the dimension or concept covered?
Are any aspects over- or under-emphasized?
Does the measure differentiate this dimension from other similar ones?
Improving content validity Thorough search of the literature Consult with experts who disagree with your perspective
Manuscripts and checklists by Muffett at http://www.flickr.com/photos/calliope/173797447/
Criterion validity – p. 81A measure is valid if it has a strong relationship to an
external criterionA music audition is a valid measure if it selects the better
players over those with less ability (concurrent validity).The GRE is a valid measure if
people who do well on the GRE succeed in graduate school (predictive validity)
It is often as hard to demonstrate the link between the test and the criterion as between the test and the dimension.
GRE flashcards by NEPMET at http://www.flickr.com/photos/blahman/2168064272/
Construct validity
The dimension to be measured is a construct, an abstract idea related to a group of interrelated variables.
The construct itself might be socially constructedclassic studies in obedience
are being re-interpreted.some cultures lack a word or
idea for “schizophrenia”Researchers make their case for
validity, but must be open to reconsideration.
Milgram’s obedience experiment
Sources of invalidity
Lavery Psychograph, from The Archives of the History of American Psychology
Incorrect theory produced this psychograph. It yielded mostly random data.
The procedure measures a dimension, but not the one intended by the researcher.
The procedure measures a dimension, but there is more than one interpretation of its meaning.
Relationship of reliability and validity
Both reliability and validity are measured on a continuum, evaluated in terms of degrees.
If a measurement has very low reliability, it can’t be valid because it is not even accurate.
Maximum validity is the square root of reliability.
Especially in inferential statistics when relationships are tested, the analyst must remember to examine the quality of measurement.
yReliabilitValidity