![Page 1: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/1.jpg)
QSU SeminarReliability and Validity:
Rita Popat, PhDDept of Health Research & Policy
Division of Epidemiology
Practical ConsiderationsDesign and Analytic Approaches
![Page 2: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/2.jpg)
2JAMA. 2004; 292:1188-1194
What do we want to know about the measurements? Why?Dependent variable (outcome)Independent variable (risk factor or predictor)
![Page 3: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/3.jpg)
What are other possible explanations for not detecting an association?
![Page 4: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/4.jpg)
4
JAMA. 2004;291:1978-1986
![Page 5: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/5.jpg)
Outline
• Definitions: Measurement error, reliability, validity
• Why should we care about measurement error?• Effects of measurement error on Study Validity
(Categorical Exposures)• Effects of measurement error on Study Validity
(Continuous Exposures)
• Measures (or indices) for reliability and validity
![Page 6: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/6.jpg)
Measurement error
• For an individual, measurement error is the difference between his/her observed and true measurement.
• Measurement error can occur in dependent (outcome) or independent (predictor or exposure) variables
• For categorical variables, measurement error is referred to as misclassification
• Measurement error is an important source of bias that can threaten internal validity of a study
6
![Page 7: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/7.jpg)
Reliability (aka reproducibility, consistency)
• Reliability is the extent to which repeated measurement of a stable phenomenon- by the same person or different people and instruments, at different times and places- obtain similar results.
• A precise measurement is reproducible, that is, has the same (or nearly the same) value each time it is measured.
• The higher the reliability, the greater the statistical power for a fixed sample size
• Reliability is affected by random error7
![Page 8: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/8.jpg)
Validity or Accuracy
•The accuracy of a variable is the degree to which it actually represents what it is intended to represent
•That is: The extent to which the measurement represents the true value of the attribute being assessed.
8
![Page 9: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/9.jpg)
Precise (Reliable) and Accurate (Valid) measurements are key to minimizing measurement errorPrecision, no accuracy Accuracy, low precision
No precision, no accuracy Precision and accuracy
9
![Page 10: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/10.jpg)
Measurement error in Categorical Variables• Referred to as Misclassification and could be in the
• Outcome variables, or• Exposure variables
• How do we know misclassification exists?• When method used for classifying exposure lacks
accuracy
![Page 11: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/11.jpg)
Assessment of Accuracy
Imperfect
Classification
True classification
b+da+c
c+dd
TN
c
FN
Present
-
a+bb
FP
a
TP
Present
+
Absent
-
Present
+
Sensitivity = a / (a+c) False negative = c / (a+c)
Specificity = d / (b+d) False positive = b / (b+d)
11
Criterion validity (compare against a reference or gold standard)
![Page 12: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/12.jpg)
12
Cases(outcome +)
Controls(outcome -)
Exposure + a b
Exposure - c d
12
Misclassification of exposure
• Non-differential
• Differential
![Page 13: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/13.jpg)
Cases Controls
Exposure + 55 34
Exposure - 45 66
Cases Controls
Exposure + 50 20
Exposure - 50 80
True exposure
Reported exposure:90% sensitivity & 80% specificity in cases & controls
Misclassification of exposure
Attenuation of true association due to misclassification of exposure
13
0.4)50)(20(
)80)(50(OR
4.2)45)(34(
)66)(55(OR
![Page 14: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/14.jpg)
• Non-differential misclassification occurs when the degree of misclassification of exposure is independent of outcome/disease status
• Tends to bias the association toward the null• Occurs when the sensitivity and specificity of the classification
of exposure are same for those with and without the outcome but less than 100%
14
Cases(outcome +)
Controls(outcome -)
Exposure + a b
Exposure - c d
14
Misclassification of the exposure
![Page 15: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/15.jpg)
Underestimation of a relative risk or odds ratio for…
0 1 2
Observedvalue
True value
Bias toward the null hypothesis
B. Protective factor
Modified from Greenberg. Fig 10-4, chapter 10
0 1 2
Observed value
True value
Bias toward the null hypothesis
A. Risk factor
![Page 16: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/16.jpg)
Cases Controls
Exposure + 48 14
Exposure - 52 86
Cases Controls
Exposure + 50 20
Exposure - 50 80
True exposure
Reported exposure: Cases - 96% sensitivity and 100% specificity Controls- 70% sensitivity and 100% specificity
Misclassification of the exposure
16
0.4)50)(20(
)80)(50(OR
7.5)14)(52(
)86)(48(OR
![Page 17: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/17.jpg)
Cases Controls
Exposure + 48 30
Exposure - 52 70
Cases Controls
Exposure + 50 20
Exposure - 50 80
True exposure
Reported exposure: Cases - 96% sensitivity and 100% specificity Controls- 70% sensitivity and 80% specificity
Misclassification of the exposure
17
0.4)50)(20(
)80)(50(OR
1.2)30)(52(
)70)(48(OR
![Page 18: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/18.jpg)
Differential misclassification occurs when the degree of misclassification differs between the groups being compared.
• May bias the association either toward or away from the null hypothesis
• Occurs when the sensitivity and specificity of the classification of exposure differ for those with and without the outcome
18
Cases(outcome +)
Controls(outcome -)
Exposure + a b
Exposure - c d
18
Misclassification of the exposure
![Page 19: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/19.jpg)
Overestimation of a relative risk or odds ratio for…
0 1 2
True value
Observed value
Bias away from the null hypothesis
B. Protective factor
Modified from Greenberg. Fig 10-4, chapter 10
0 1 2
True value
Observed value
Bias away from the null hypothesis
A. Risk factor
![Page 20: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/20.jpg)
Cases
Index
Proxy (~25%)
Hormone therapy
Never
Former
Current
Pharmacy database
accuracy?
![Page 21: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/21.jpg)
Summary so far….
• Misclassification of exposure is an important source of bias
• Good to know something about the validity of measurement for exposure classification before the study begins
• Almost impossible to avoid misclassification, but try to avoid differential misclassification
• If the study has already been conducted, develop analytic strategies that explore exposure misclassification as a possible explanation of the observed results (especially for a “primary” exposure of interest)
![Page 22: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/22.jpg)
Measurement error in Continuous Variables• Physiologic measures (SBP, BMI)• Biomarkers (hormone levels, lipids)• Nutrients• Environmental exposures• Outcome measures (QOL, function)
![Page 23: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/23.jpg)
Model of measurement error
•
![Page 24: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/24.jpg)
Measurement Theory: Example contd..
24
Measurement error
Ti Ti + b Ti + b + Ei_________ b _______ ______________ E _______________
+ systematic error in + additional "random error" for subject iinstrument (bias)
EXAMPLE: One measured diastolic blood pressure (DBP) as indicator of 2-year averageDBP.
BD cuff miscalibrated -- + randomness in BP cuff mechanics measures everyone's diastolic + subject i's 10 mmHg increase over 2-year average BP as 10 mm Hg less + subject intimidated - diastolic BP 20 mmHg higher than usual
+ misreading by interviewer+ random fluctuations in current BP
![Page 25: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/25.jpg)
Measurement theory
25
![Page 26: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/26.jpg)
Validity of X…
•
![Page 27: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/27.jpg)
Measurement error
![Page 28: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/28.jpg)
Differential Measurement error
OR
![Page 29: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/29.jpg)
Differential Measurement error
OR
![Page 30: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/30.jpg)
Differential Bias•
![Page 31: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/31.jpg)
Non- Differential Measurement error
The effects of non-differential measurement error on the odds ratio. ORT is the true odds ratio for exposure versus reference level r. ORO is the observable odds ratio for exposure versus reference level r.
![Page 32: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/32.jpg)
Effects of non-differential measurement error
![Page 33: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/33.jpg)
73.082.0 62.0/1/1 2
XTOT RRRR
![Page 34: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/34.jpg)
![Page 35: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/35.jpg)
Summary so far….
• Measurement error is an important source of bias• Good to know something about the validity of
measurement for exposure before the study begins• Almost impossible to avoid misclassification, but try
to avoid differential misclassification!• Non-differential measurement error will attenuate
the results towards the null, resulting in loss of power for a fixed sample size
• This should be taken into account when estimating sample size during the planning stage and
• Interpretation of results and determining internal validity of a study
![Page 36: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/36.jpg)
So why should we evaluate reliability and validity of measurements?• If it precedes the actual study, it tells us whether
the instrument/method we are using is reliable and valid
• This information can help us run sensitivity analysis or correct for the measurement error in the variables after the study has been completed
![Page 37: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/37.jpg)
Outline
• Definitions: Measurement error, reliability, validity
• Why should we care about measurement error?• Effects of measurement error on Study Validity
(Categorical Exposures)• Effects of measurement error on Study Validity
(Continuous Exposures)
• Measures (or indices) for reliability and validity
![Page 38: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/38.jpg)
Type of Variable Reliability Measure(s) Validity Measure(s)
Dichotomous Kappa sensitivity, specificity
Ordinal weighted kappaICC*
misclassification matrix
Continuous ICC *Bland Altman Plots
Pearson correlation(see note)
Bland-Altman Plots
Choice of reliability and validity measures depend on type of variable . . .
*ICC – intraclass correlation coefficientNote: in inter-method reliability studies, inferences about validity can be
made from coefficients of reproducibility (such as the Pearson’s correlation )38
![Page 39: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/39.jpg)
Assessing Accuracy (Validity) of continuous measures
• Bias: difference between the mean value as measured and the mean of the true values
• So bias = –• Standardized bias =
• Bland-Altman plots
39
X
X
x
x
XSD
Xx
![Page 40: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/40.jpg)
Bland and Altman plots
•Take two measurements (different methods or instrument) on the same subject
•For each subject, plot the difference b/w the two measures (y axis) vs. the mean of the two measures
•We expect the mean difference to be 0•We expect 95% of the differences to be within 2 standard deviations (SD)
![Page 41: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/41.jpg)
Yoong et al. BMC Medical Research Methodology 2013, 13:38
![Page 42: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/42.jpg)
Yoong et al. BMC Medical Research Methodology 2013, 13:38
![Page 43: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/43.jpg)
Suppose there is no gold standard, then how do we evaluate validity?…..We make inferences from inter-method reliability studies!
Note: will not be able to estimate bias when the two measures are based on different scales
![Page 44: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/44.jpg)
Inferences about validity from inter-method reliability studies
• Suppose two different methods (instruments) are used to measure the same continuous exposure. Let X1 denote the measure of interest (i.e., the one to be used to measure the exposure in the study) and X2 is the comparison measure
• We have the reliability coefficient
• However, we are actually interested in the validity coefficient:
• Example: Is self-reported physical activity valid? Compare it to the 4-week diary.
44
21xx
1Tx
![Page 45: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/45.jpg)
Relationship of Reliability to ValidityErrors of X1 and X2 are: Relationship b/w reliability and
validityUsual application
1. Uncorrelated and both measures are equally precise
Intramethod study
2. Uncorrelated , X2 is more precise than X1
Intermethod study
3. Uncorrelated , X1 is more precise than X2
Intermethod study
4. Correlated errors and both measures are equally precise
Intramethod studyIntermethod study
2121 xxTxTx
21121 xxTxxx
211 xxTx
211 xxTx
Take home message: In most situations the square root of the reliability coefficient can provide an upper limit to the validity coefficient 45
![Page 46: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/46.jpg)
Inferences about validity from inter-method reliability studies
• In our example, X1 is measure of interest (i.e., the one to be used to measure the exposure in the study: self-reported activity) and X2 is the comparison measure (4-wk diaries)
• We have the reliability coefficient = 0.79
• Errors in X1 and X2 are likely to be uncorrelated and X2 is more precise than X1, so
0.79 < < 0.89
• So, self-reported activity appears to be a valid measure46
21xx
1Tx
![Page 47: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/47.jpg)
Summary of Inferences From Reliability to Validity
• Reliability studies are used to interpret validity of x.
• Reliability is necessary for validity (instrument cannot be
valid if it is not reproducible).
• Reliability is not sufficient for validity - repetition of test
may yield same result because both X1 and X2 measure
some systematic error (i.e., errors are correlated).
• Reliability can only give an upper limit on validity. If the
upper limit is low, then the instrument is not valid.
• An estimate of reliability (or validity) depends on the
sample (i.e., may vary by age, gender, etc.)47
![Page 48: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/48.jpg)
Reliability of continuously distributed variables
• Pearson product-moment correlation?
• Spearman rank correlation?
48
![Page 49: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/49.jpg)
But…does correlation tell you about relationship or agreement?
Pearson’s Correlation coefficient=0.99Is this measure reliable?
49
Measure 1 Measure 2150 155155 158160 165163 170170 176174 184
145 150 155 160 165 170 175 180140
145
150
155
160
165
170
175
180
185
190
Measure 1
Mea
sure
2
![Page 50: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/50.jpg)
Reliability of continuously distributed variables
• Other methods generally preferred for intra or inter-observer reliability when same method/instrument is used- Intraclass correlation coefficients (ICC): is calculated
using variance estimates obtained through an analysis of variance (ANOVA)
- Bland-Altman plots
• Correlation coefficient useful for inter-method reliability to make inferences about validity (especially when the measurement scale differs for the two methods)
50
![Page 51: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/51.jpg)
Intraclass Correlation Coefficient (ICC)
• If within-person variance is very high, then measurement error can "overwhelm" the measurement of between person differences.
• If between-person differences are obscured by measurement error, it becomes difficult to demonstrate a correlation between the imperfectly measured characteristic and any other variable of interest.
• ICC is computed using ANOVA
51
Total variance
Within person Between person
Between person variance
Total varianceICC =
![Page 52: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/52.jpg)
ANalysis Of Variance (ANOVA)in a reliability study
In a reliability study, we are not studying associations b/w predictors and outcome, so we will express the overall variability in the measurement as a function of between-subjects and within-subjectsvariability
• So let’s consider a test-retest reliability study, where multiple measurements are taken for each subject
52
SST = SSB + SSW
![Page 53: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/53.jpg)
Total Variation22
12
2
11 )(...)()( XXXXXXSST nkn
53
G rou p 1 G rou p 2 G rou p 3
Resp on se , X
X
Subject 1 Subject 2 Subject 3
![Page 54: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/54.jpg)
Between-Subject Variation22
22
2
11 )(...)()( XXkXXkXXkSSB nn
54
G rou p 1 G rou p 2 G rou p 3
Resp on se , X
1X 2X
3XX
Subject 1 Subject 2 Subject 3
Where: k1= number of measurements taken on subject 1
![Page 55: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/55.jpg)
Within-Subject Variation
22
212
2
111 )(...)()( nnk XXXXXXSSWn
(continued)
55
G rou p 1 G rou p 2 G rou p 3
Resp on se , X
1X2X
3X11X
12X 13X
Subject 1 Subject 2 Subject 3
![Page 56: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/56.jpg)
One way analysis of variance for computation of ICC: test-retest study
Source of variance Sum of squares Degrees of Mean square(SS) freedom (df) (MS=SS/df)
Between subjects n-1 BMS
Within subjects n (k –1) WMS (random error)
Total nk - 1
56
Here, each subject is a group.k=# times measure is repeated
2 i
i XXk
i j
iij XX2
i j
ij XX2
WMSkBMS
WMSBMSICC x )1(
ˆ
![Page 57: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/57.jpg)
Interpretation of ICC
• If within-person variance is very high, then measurement error can "overwhelm" the measurement of between person differences.
• If between-person differences are obscured by measurement error, it becomes difficult to demonstrate a correlation between the imperfectly measured characteristic and any other variable of interest.
57
WMSkBMS
WMSBMSICC x )1(
ˆ
![Page 58: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/58.jpg)
Interpretation of ICC
• The ICC ranges b/w 0 and 1 and is a measure of reliability adjusted for chance agreement
• An ICC of 1 is obtained when there is perfect agreement and in general a higher ICC is obtained when the within-subject error (i.e., random error) is small.
• Hence, ICC=1 only when there is exact agreement between measures (i.e., Xi1=Xi2=...Xik for each subject).
• Generally, ICCs greater than 0.7 are considered to indicate good reliability.
58
![Page 59: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/59.jpg)
Two-way fixed effects ANOVA for computation of ICC (inter-rater reliability)
Source of variance Sum of squares Degrees ofMean square
(SS) freedom (df) (MS=SS/df)
Between subjects n-1 SMS
Between measures k - 1 MMS
Within subjects (random error) (n-1)(k-1) EMS
Total by subtraction nk - 1
59
2 i
i XXk
2 j
j XXn
i j
ij XX2
EMSknMMSknSMS
EMSSMSnx )1)(1()1(
)(ˆ
![Page 60: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/60.jpg)
Measuring reliability of categorical variables
•Percent agreement or concordance rate
•Kappa statistic
60
![Page 61: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/61.jpg)
Reliability of categorical variables
•Concordance rate is the proportion of observations on which the two observers agree
•Example: Agreement matrix for radiologists reading mammography for breast cancer
61
Radiologist B
Radiologist A
b+da+c
c+ddcNo
-
a+bbaYes
+
No
-
Yes
+
Overall % agreement = (a+d) / (a+b+c+d)
![Page 62: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/62.jpg)
Concordance rates: limitations
•Considerable agreement could be expected by chance alone.
•Misleading when the observations are not evenly distributed among the categories (i.e., when the proportion “abnormal” on a dichotomous test is substantially different from 50%)
So, what reliability measures should we use?
62
![Page 63: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/63.jpg)
Kappa
•Kappa is another measurement of reliability •Kappa measures the extent of agreement beyond that would be expected by chance alone
•Can be used for binary or variables with >2 levels
63
![Page 64: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/64.jpg)
Cohen’s Kappa ( ): some notation
• A reliability study in which n subjects have each been measured twice where each measure is a nominal variable with k categories.
• It is assumed that the two measures are equally accurate.
• is a measure of agreement that corrects for the agreement that would be expected by chance.
64
![Page 65: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/65.jpg)
Cohen’s KappaTable. Layout of data for computations of Cohen’s and weighted
Measure 2 (or Rater 2)1 2 . . k Total
1 p11 p12 . . p1k r1
Measure 1 2 p21 p22 . . p2k r2
(or Rater 1) . . . . . . .. . . . . . .k pk1 pk2 . . pkk rk
Total c1 c2 . . ck 1
65
![Page 66: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/66.jpg)
Cohen’s KappaTable. Layout of data for computations of Cohen’s and weighted
Measure 21 2 . . k Total
1 p11 p12 . . p1k r1
Measure 1 2 p21 p22 . . p2k r2
. . . . . . .
. . . . . . .
k pk1 pk2 . . pkk rk
Total c1 c2 . . ck 1
The observed proportion of agreement, Po, is the sum of the
proportions on the diagonal:
66
k
iiio pP
1
![Page 67: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/67.jpg)
Cohen’s KappaTable. Layout of data for computations of Cohen’s and weighted
Measure 21 2 . . k Total
1 p11 p12 . . p1k r1
Measure 1 2 p21 p22 . . p2k r2
. . . . . . .
. . . . . . .
k pk1 pk2 . . pkk rk
Total c1 c2 . . ck 1
The expected proportion of agreement (on the diagonal), Pe, is:
Where ri and ci are marginal proportions for the 1st and 2nd measure respectively.
67
k
iiie crP
1
![Page 68: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/68.jpg)
KappaThen, kappa is estimated by:
Which is: Observed agreement(%)-Expected agreement (%)
100% - Expected agreement (%) • = maximum possible nonchance agreement or 100% less the
contribution of chance • = proportion of observations that can be attributed to
reliable measurement (i.e., not due to chance)• So kappa is the ratio of the number of observed nonchance
agreements to the number of possible nonchance agreements
68
e
eo
P
PP
1
eo PP
eP1
![Page 69: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/69.jpg)
Pictorial of kappa statistic
69
agreement expected
0 100%agreement
by chancepotential improvement
beyond chance
observed agreement
Kappa = % of maximum possible improvement over thatexpected by chance alone (kappa 0.50 here)
![Page 70: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/70.jpg)
Kappa
•Kappa ranges from –1 (perfect disagreement) to +1 (perfect agreement)
•Kappa of 0 means that: observed agreement = expected agreement
70
e
eo
P
PP
1
![Page 71: QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology rpopat@stanford.edu Practical Considerations](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649dbf5503460f94ab2d1c/html5/thumbnails/71.jpg)
Reliability of categorical variables
•Example 1: Agreement matrix for radiologists reading mammography for breast cancer
71
Radiologist B
Radiologist A
12624
8683 (d)3 (c)No
-
6443 (b)21 (a)Yes
+
No
-
Yes
+
Overall % agreement = (a+d) / (a+b+c+d)=(21+83)/150=0.69
31.055.01
55.069.0
1ˆ
e
eo
P
PP