populations and sampling

1

POPULATIONS AND SAMPLING

2

Let’s Look at our Example Research Question

How do UF COP pharmacy students who only watch videostreamed lectures differ from those who attend class lectures (and also have access to videostreamed lectures) in terms of learning outcomes?

Population

Who Do You Want These Study Results to Generalize To??

3

Population The group you wish to generalize to is

your population. There are two types:

– Theoretical population In our example, this would be all pharmacy students in the US

– Accessible populationIn our example, this would be all COP pharmacy students

4

Sampling Target population or the Sampling

frame: All in the accessible population that you can draw your sample from.

Sample: The group of people you select to be in the study. A subgroup of the target population This is not necessarily the group that is

actually in your study.

5

SamplingHow you select your sample:

Sampling Strategies

Probability Sampling

Simple random

sampling

Stratified sampling

Multistage cluster

sampling

Nonprobability sampling

Convenience Sampling

Snowball Sampling

6

Sample SizeSelect as large a sample as possible from

your population.There is less potential error that the

sample is different from the population when you use a large sample.

Sampling error: The difference between the sample estimate and the true population value (example: exam score).

7

Sample Size Sample size formulas/tables can be used.

Factors that are considered include: Confidence in the statistical test Sampling error

See Appendix B in Creswell (pg 630) Sampling error formula – used to determine

sample size for a survey Power analysis formula – used to determine

group size in an experimental study.

8

Back to Our Example

What is our theoretical population? What is our accessible population? What sampling strategy should we

use?

How do UF COP pharmacy students who only watch videostreamed lectures differ from those who attend class lectures (and also have access to videostreamed lectures) in terms of learning outcomes?

9

Important ConceptRandom sampling vs random assignment

We have talked about random sampling in this session.

Random sampling is not the same as random assignment. Random sampling is used to select individuals

from the population who will be in the sample. Random assignment is used in an experimental

design to assign individuals to groups.

VALIDITY

Lou Ann Cooper, PhDDirector of Program Evaluation and Medical

Education ResearchUniversity of FloridaCollege of Medicine

11

INTRODUCTION Both research and evaluation

include:Design – how the study is conducted Instruments – how data is collectedAnalysis of the data to make inferences

about the effect of a treatment or intervention.

Each of these components can be affected by bias.

12

INTRODUCTION Two types of error in research

Random error due to random variation in participants’ responses at measurement. Inferential statistics, i.e. the p-value and 95% confidence interval, measure random error and allow us to draw conclusions based on research data.

Systematic error or bias.

13

BIAS: DEFINITION Deviations of results (or inferences)

from the truth, or processes leading to such deviation. Any trend in the selection of subjects, data collection, analysis, interpretation, publication or review of data that can lead to conclusions that are systematically different from the truth.

Systematic deviation from the truth that distorts the results of research.

14

BIAS Bias is a form of systematic error that can

affect scientific investigations and distort the measurement process.

Bias is primarily a function of study design and execution, not of results, and should be addressed early in the study planning stages.

Not all bias can be controlled or eliminated; attempting to do so may limit usefulness and generalizability.

Awareness of the presence of bias will allow more meaningful scrutiny of the results and conclusions.

A biased study loses validity and is a common reason for invalid research.

15

POTENTIAL BIASES IN RESEARCH AND EVALUATION

Study Design Issues related to Internal validity Issues related to External validity

Instrument Design Issues related to Construct validity

Data Analysis Issues related to Statistical Conclusion

validity

16

VALIDITYValidity is discussed and applied based on

two complimentary conceptualizations in education and psychology:

Test validity: the degree to which a test measures what it was designed to measure.

Experimental validity: the degree to which a study supports the intended conclusion drawn from the results.

17Conclusion

Is there a relationship between cause and effect?

Internal Is the relationship causal?

Construct

Can we generalize to other persons, places, times?

Can we generalize to the constructs?

External

FOUR TYPES OF VALIDITY QUESTIONS

18

CONCLUSION VALIDITY Conclusion validity is the degree to

which conclusions we reach about relationships are reasonable, credible or believable.

Relevant for both quantitative and qualitative research studies.

Is there a relationship in your data or not?

19

STATISTICAL CONCLUSION VALIDITY

Basing conclusions on proper use of statistics

Reliability of measures Reliability of implementation Type I Errors and Statistical

Significance Type II Errors and Statistical Power Fallacies of Aggregation

20

STATISTICAL CONCLUSION VALIDITY

Interaction and non-linearity Random irrelevancies in the

experimental setting Random heterogeneity of

respondents

21

VIOLATED ASSUMPTIONS OF STATISTICAL TESTS

The particular assumptions of a statistical test must be met if the results of the analysis are to be meaningfully interpreted.

Levels of measurement. Example: Analysis of Variance

(ANOVA)

22

LEVELS OF MEASUREMENTA hierarchy is implied

in the ides of level of measurement.

At lower levels, assumptions tend to be less restrictive and data analyses tend to be less sensitive.

In general, it is desirable to have a higher level of measurement (interval or ratio) rather than a lower one (nominal or ordinal).

23

STATISTICAL ANALYSIS AND LEVEL OF MEASUREMENT

ANALYSIS OF VARIANCE ASSUMPTIONS Independence of cases. Normality. In each of the groups, the data

are continuous and normally distributed. Equal variances or homoscedasticity. The

variance of data in groups should be the same.

The Kruskal-Wallis test is a nonparametric alternative which does not rely on an assumption of normality.

24

RELIABILITY Measures (tests and scales) of low

reliability may not register true changes.

Reliability of treatment implementation – when treatments/procedures are not administered in a standard fashion, error variance is increased and the chance of obtaining true differences will decrease.

25

Reject H0 Retain H0

H0 is TRUE

Type I Error α

Correct Decision

1 - α

H0 is FALSE

Correct Decision

1 - β (Power)

Type II Error β

STATISTICAL DECISIONTR

UE

POPU

LATI

ON

STA

TUS

26

TYPE I ERRORS AND STATISTICAL SIGNIFICANCE

A Type I error is made when a researcher concludes that there is a relationship and there really isn’t (False positive)

If the researcher rejects H0 because p ≤ .05, ask: If data are from a random sample, is

significance level appropriate?Are significance tests applied to a priori

hypotheses?Fishing and the error rate problem

27

TYPE II ERRORS AND STATISTICAL POWER

A Type II error is made when a researcher concludes that there is not a relationship and there really is (False negative)

If the researcher fails to reject H0 because p > .05, ask:Has the researcher used statistical

procedures of adequate power?Does failure to reject H0 merely reflect a

small sample size?

28

FACTORS THAT INFLUENCE POWER AND STATISTICAL INFERENCE

Alpha level Effect size Directional vs. Non-directional test Sample size Unreliable measures Violating the assumptions of a

statistical test

29

RANDOM IRRELEVANCIES Features of the experimental setting

other than the treatment affect scores on the dependent variable

Controlled by choosing settings free from extraneous sources of variation

Measure anticipated sources of variance to include in the statistical analysis

30

RANDOM HETEROGENEITY OF RESPONDENTS

Participants can differ on factors that are correlated with the major dependent variables

Certain respondents will be more affected by the treatment than others

Minimized byBlocking variables and covariatesWithin subjects designs

31

STRATEGIES TO REDUCE ERROR TERMS

Subjects as own control Homogeneous samples Pretest measures on the same scales used

for measuring the effect Matching on variables correlated with the

post-test Effects of other variables correlated with

the post-test used as covariates Increase the reliability of the dependent

variable measures

32

STRATEGIES TO REDUCE ERROR TERMS

Estimates of the desired magnitude of a treatment effect should be elicited before research begins

Absolute magnitude of the treatment effect should be presented so readers can infer whether a statistically reliable effect is practically significant.

33

INTERNAL VALIDITY Internal validity has to do with defending

against sources of bias arising in a research design.

To what degree is the study designed such that we can infer that the educational intervention caused the measured effect.

An internally valid study will minimize the influence of extraneous variables.

Example: Did participation in a series of Webinars on TB in children change the practice of physicans?

THREATS TO

INTERNAL VALIDITY

HISTORY

MATURATION

TESTING

INSTRUMENTATION

STATISTICALREGRESSION

SELECTION

INTERACTIONSWITH

SELECTION

MORTALITY

35

INTERNAL VALIDITY: THREATS IN SINGLE GROUP REPEATED MEASURES DESIGNS History Maturation Testing Instrumentation Mortality Regression

36

THREATS TO INTERNAL VALIDITY HISTORY The observed effects may be due to or be

confounded with nontreatment events occurring between the pretest and the post-test

History is a threat to conclusions drawn from longitudinal studies

Greater time period between measurements = more risk of a history effect

History is not a threat in cross sectional designs conducted at one point in time

37

THREATS TO INTERNAL VALIDITY MATURATION Invalid inferences may be made when

the maturation of participants between measurements has an effect and this maturation is not the research interest.

Internal (physical or psychological) changes in participants unrelated to the independent variable – older, wiser, stronger, more experienced.

38

THREATS TO INTERNAL VALIDITY TESTING Reactivity as a result of testing The effects of taking a test on the

outcomes of a second testPractice Learning

Improved scores on the second administration of a test can be expected even in the absence of intervention due to familiarity

39

THREATS TO INTERNAL VALIDITY INSTRUMENTATION Changes in instruments, observers or

scorers which may produce changes in outcomes

Observers/raters, through experience, become more adept at their task

Ceiling and floor effects Longitudinal studies

40

THREATS TO INTERNAL VALIDITY STATISTICAL REGRESSION Test-retest scores tend to drift

systematically to the mean rather than remain stable or become more extreme

Regression effects may obscure treatment effects or developmental changes

Most problematic when participants are selected because they are extreme on the classification variable of interest

41

THREATS TO INTERNAL VALIDITY MORTALITY Differences in drop-out rates/attrition

across conditions of the experiment Makes “before” and “after” samples

not comparable This selection artifact may become

operative in spite of random assignment

Major threat in longitudinal studies

42

INTERNAL VALIDITY: MULTIPLE GROUP THREATS

Selection Interactions with Selection

Selection-HistorySelection-MaturationSelection-TestingSelection-InstrumentationSelection-MortalitySelection-Regression

43

THREATS IN DESIGNS WITH GROUPS: SOCIAL INTERACTION THREATS

Compensatory equalization of treatments

Compensatory rivalryResentful demoralizationTreatment imitation or diffusion

Unintended treatments

44

EXTERNAL VALIDITY The extent to which the results of a

study can be generalizedPopulation validity – generalizations

related to other groups of peopleEcological validity – generalizations

related to other settings, times, contexts, etc.

45

THREATS TO EXTERNAL VALIDITY

Pre-test treatment interaction Multiple treatment interference Interaction of selection and treatment Interaction of setting and treatment Interaction of history and treatment Experimenter effects

46

THREATS TO EXTERNAL VALIDITY

Reactive arrangementsArtificial environmentHawthorne effect

◊ Halo effect◊ John Henry effect

Placebo effectParticipant-researcher interaction Novelty effect

SELECTING A RESEARCH DESIGN



48

What If….We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)…..

49

PRE-EXPERIMENTAL DESIGNS

One Group Posttest DesignX O

X = Implementation of the treatmentO = Measurement of the participants in

the experimental group

Also referred to as ‘One Shot Case Study’

50


For Discussion:What are the treats to validity?

51

SOURCES OF INVALIDITY InternalHistory –Maturation –TestingInstrumentationRegressionMortality –Selection –Selection Interactions

ExternalInteraction of

Testing and X

Interaction of Selection and X

–

Reactive Arrangements

Multiple X Interference

52


53

PRE-EXPERIMENTAL DESIGNSComparison Group Posttest Design

X O - - - - - - - O

Static Group Comparison Ex post facto research No pretest observations

54


For Discussion:What if we compare test scores for these students with last year’s scores (assume last year had no streaming video)?

55

SOURCES OF INVALIDITY InternalHistory +Maturation ?Testing +Instrumentation +Regression +Mortality –Selection –Selection Interactions

–


Testing and X


–



56

What If….We gave 150 pharmacy students (all are distance campus) access to streaming video and measured their performance on a written exam both before and after the intervention (measures achievement of learning outcomes)…..

57

PRE-EXPERIMENTAL DESIGNSOne Group Pretest/Posttest Design

O X O

Not a true experiment Because participants serve as

their own control, results may be less biased

58

What If….We gave 150 pharmacy students (all are distance campus) access to streaming video and measured their performance on a written exam both before and after the intervention (measures achievement of learning outcomes)…..

For Discussion:What are the threats to validity (What are the plausible hypotheses that could explain any difference)??

59

SOURCES OF INVALIDITY InternalHistory –Maturation –Testing –Instrumentation –Regression ?Mortality +Selection +Selection Interactions

–


Testing and X–


–


?


60

What If….We could randomize all 300 pharmacy students to the following groups:Group 1: access only streaming videoGroup 2:attend lecturesFor each group, administer both a pre-test and a post-test

61

TRUE EXPERIMENTAL DESIGNSPretest/Posttest Design with Control Group

and Random Assignment

R O X O - - - - - - - - - - - - - - - - - - - -R O O

Measurement of pre-existing differences Controls most threats to internal validity

62

What If….We could randomize all 300 pharmacy students to the following groups:Group 1: access only streaming videoGroup 2:attend lectures

For each group, administer both a pre-test and a post-test

For Discussion:What are the threats to validity (What are the plausible hypotheses that could explain any difference)??

63

SOURCES OF INVALIDITY InternalHistory +Maturation +Testing +Instrumentation +Regression +Mortality +Selection +Selection Interactions

+


Testing and X–


?


?


64

What If….We could randomize all 300 pharmacy students to the following groups:

Group 1: access only streaming video and post-testGroup 2: attend lectures and post-test

65

TRUE EXPERIMENTAL DESIGNSPosttest Only Control Group

R X O - - - - - - - - - - - - - - R O

66

What If….We could randomize all 300 pharmacy students to the following groups:

Group 1: access only streaming video and post-testGroup 2: attend lectures and post-test

For Discussion:What have we lost by not using a pre-test? (as compared to the experimental randomized pre-test and post-test design)

67

SOURCES OF INVALIDITY InternalHistory +Maturation +Testing +Instrumentation +Regression +Mortality +Selection +Selection Interactions

+


Testing and X+


?


?


68

What If….We could randomize all 300 pharmacy students to the following groups:Group 1: pre-test, access only streaming video, and

post-testGroup 2: pre-test, attend lectures, and post-testGroup 3: access only streaming video and post-test

onlyGroup 4: attend lectures and post-test only

69

TRUE EXPERIMENTAL DESIGNSSolomon Four Group Comparison

R O X O

R O O

R X O

R O

70

What If….We could randomize all 300 pharmacy students to the following groups:Group 1: pre-test, access only streaming video, and post-testGroup 2: pre-test, attend lectures, and post-testGroup 3: access only streaming video and post-test onlyGroup 4: attend lectures and post-test only

For Discussion:What have we gained by having 4 groups (esp group 3 and 4)?

71

What If….It is NOT feasible to use randomization. What if we were to have the following groups:Group 1 (all distant campuses): access only streaming videoGroup 2 (GNV campus):attend lectures


72

QUASI-EXPERIMENTAL DESIGNS

Nonequivalent Control Group

O X O

O O

Pre-existing differences can be measured Controls some threats to validity

73

What If….It is NOT feasible to use randomization. What if we were to have the following groups:Group 1 (all distant campuses): access only streaming videoGroup 2 (GNV campus):attend lectures


For Discussion:What have we “lost” by not randomizing?

74

SOURCES OF INVALIDITY InternalHistory –Maturation –TestingInstrumentationRegressionMortality –Selection –Selection Interactions


Testing and X


–



75

QUASI-EXPERIMENTAL DESIGNS

Time Series

O O O O X O O O O

76

QUASI-EXPERIMENTAL DESIGNSCounterbalanced Design

O X1 O X2 O X3 O X3 O X1 O X2O X2 O X3 O X1

MEASURMENT VALIDITY:SOURCES OF EVIDENCE

78

CLASSIC VIEW OF TEST VALIDITYTraditional triarchic view of validity Content Criterion

ConcurrentPredictive

Construct Tests were described as “valid” or

“invalid” Reliability was considered a separate test

trait

79

MODERN VIEW OF VALIDITY Scientific evidence needed to support

test score interpretationStandards for Educational &

Psychological Testing (1999)Cronbach, Messick, Kane

Some theory, key concepts, examples Reliability as part of validity

80

VALIDITY: DEFINITIONS“A proposition deserves some degree of trust only when it has survived serious attempts to falsify it.” (Cronbach, 1980)

According to the Standards, validity refers to “the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores.”

“Validity is an integrative summary.” (Messick, 1995)

“Validation is the process of building an argument supporting interpretation of test scores.” (Kane, 1992)

81

WHAT IS A CONSTRUCT? Constructs are psychological attributes,

hypothetical concepts A defensible construct has

A theoretical basis Clear operational definitions involving

measurable indicators Demonstrated relationships to other constructs

or observable phenomena A construct should be differentiated from

related theoretical constructs as well as from methodological irrelevancies

82

THREATS TO CONSTRUCT VALIDITY (Cook & Campbell)

Inadequate preoperational explication of constructs

Mono-operation bias Mono-method bias Interaction of different treatments Interaction of testing and treatment Restricted generalizability across

constructs

83

THREATS TO CONSTRUCT VALIDITY (Cook & Campbell)

Confounding constructs Confounding levels of constructs Hypothesis guessing within

experimental conditions Evaluation apprehension Researcher expectancies

84

SOURCES OF VALIDITY EVIDENCE1. Test Content

Task RepresentationConstruct Domain

2. Response Process – Item Psychometrics3. Internal Structure – Test Psychometrics4. Relationships with Other Variables – Correlations

Test-Criterion RelationshipsConvergent and Divergent Data

5. Consequences of Testing – Social context

Standards for Educational and Psychological Testing, 1999

85

ASPECTS OF VALIDITY: CONTENT Content validity refers to how well

elements of the test or scale relate to the content domain. Content relevance. Content representativeness. Content coverage.

Systematic analysis of what the test is intended to measure. Technical quality. Construct irrelevant variance

86

SOURCES OF VALIDITY EVIDENCE:TEST CONTENT

Detailed understanding of the content sampled by the instrument and its relationship to content domain

Content-related evidence is often established during the planning stages of an assessment or scale.

Content-related validity studies Exact sampling plan, table of specifications, blueprint Representativeness of items/prompts →Domain Appropriate content for instructional objectives

◊ Cognitive level of items◊ Match to instructional objectives

Review by panel of experts. Content expertise of item/prompt writers Expertise of content reviewers Quality of items/prompts, sensitivity review

87

ASPECTS OF VALIDITY: RESPONSE PROCESSES

Emphasis is on the role of theory. Tasks sample domain processes as

well as content. Accuracy in combining scores from

different item formats or subscales. Quality control – scanning,

assignment of grades, score reports.

88

SOURCES OF VALIDITY EVIDENCE:RESPONSE PROCESSESFit of student responses to hypothesized

construct? Basic quality control information – accuracy of

item responses, recording, data handling, scoring Statistical evidence that items/tasks measure the

intended construct Achievement items measure intended content and not

other content Ability items predict targeted achievement outcome Ability items fail to predict a non-related ability or

achievement outcome

89

SOURCES OF EVIDENCE: RESPONSE PROCESSES

Debrief examinees regarding solution processes.

“Think-aloud” during pilot testing. Subscore/subscale analyses- i.e.,

correlation patterns among part scores.

Accurate and understandable interpretations of scores for examinees.

90

SOURCES OF VALIDITY EVIDENCE:INTERNAL STRUCTURE

Statistical evidence of the hypothesized relationship between test item scores and the constructReliability

Test scale reliability Rater reliability Generalizability

Item analysis data Item difficulty and discrimination MCQ option function analysis Inter-item correlations

Scale factor structure Dimensionality studies Differential item functioning (DIF) studies

91

ASPECTS OF VALIDITY: EXTERNAL

Can the test results be evaluated by objective criteria?

Correlations with other relevant variablesTest-criterion correlations Concurrent or predictive

MTMM matrix Convergent correlations Divergent (discriminant) correlations

92

SOURCES OF VALIDITY EVIDENCE:RELATIONSHIPS TO OTHER VARIABLES

Statistical evidence of the hypothesized relationship between test scores and the construct

Criterion-related validity studiesCorrelations between test scores/subscores

and other measuresConvergent-Divergent studiesMTMM

93

RELATIONSHIPS WITH OTHER VARIABLES

Predictive validity: Variation of concurrent validity where the criterion is in the future.

Classic example is to determine whether students who score high on an admissions test such as the MCAT earn higher preclinical GPAs?

94


Convergent validity: Assessed by the correlation among items which make up the scale (internal consistency), by the correlation of a the given scale with measures of the same construct using instruments proposed by other researchers, and by the correlation of relationships involving the given scale across samples or across methods.

95


Criterion (concurrent) validity: correlation between scale or instrument measurement items and known accepted standard measures or criteria.

Do the proposed measures for a given concept exhibit generally the same direction and magnitude of correlation with other variables as measures of that concept already accepted in this area of research?

96


Divergent (discriminant) validity: The indicators of different constructs should not be highly correlated as to lead us to conclude that they measure the same thing. This would happen is there is definitional overlap between two constructs

97

MULTI-TRAIT MULTI-METHOD MTMM MATRIX Mono-method and/or mono-method

biases – use of a single data gathering method or a single indicator for a concept may result in bias

Multi-trait/Multi-method validation uses multiple indicators per concept and gathers data for each indicator by multiple methods or multiple sources.

98

MULTI-TRAIT MULTI-METHOD MTMM MATRIX

ActRefl SensInt VisVerb SeqGlob ExtInt SensInt ThinkFeel JudPerActRefl 0.75SensInt -0.15 0.81VisVerb 0.03 0.18 0.60

SeqGlob -0.32 -0.48 -0.12 0.81ExtInt 0.60 -0.11 0.05 -0.43 0.54

SensInt -0.22 0.69 -0.02 0.54 -0.18 0.69ThinkFeel 0.02 -0.09 0.09 0.10 0.01 0.02 0.19

JudPer -0.27 0.46 -0.18 0.39 -0.33 0.51 -0.01 0.50

Reliability diagonal (montrait-monomethod)Heterotrait-monomethodValidity diagonalHeterotrait-heteromethod

ILS LSTI

ILS

LSTI

Validity of index of learning styles scores: multitrait−multimethod comparison with three cognitive learning style instruments. Cook DA; Smith AJ. Medical Education, 2006; 40: 900-907 ILS = Index of Learning StylesLSTI = Learning Style Type Indicator

Active-reflectiveSensing-intuitiveVisual- verbalSequential-globalExtrovert-introvertSensing-intuitionThinking-feelingJudging- perceiving

99

MULTI-TRAIT MULTI-METHOD MTMM MATRIX

ActRefl SensInt VisVerb SeqGlob ExtInt SensInt ThinkFeel JudPerActRefl 0.75SensInt -0.15 0.81VisVerb 0.03 0.18 0.60

SeqGlob -0.32 -0.48 -0.12 0.81ExtInt 0.60 -0.11 0.05 -0.43 0.54

SensInt -0.22 0.69 -0.02 0.54 -0.18 0.69ThinkFeel 0.02 -0.09 0.09 0.10 0.01 0.02 0.19

JudPer -0.27 0.46 -0.18 0.39 -0.33 0.51 -0.01 0.50

Reliability diagonal (montrait-monomethod)Heterotrait-monomethodValidity diagonalHeterotrait-heteromethod

ILS LSTI

ILS

LSTI

Validity of index of learning styles scores: multitrait−multimethod comparison with three cognitive learning style instruments. Cook DA; Smith AJ. Medical Education, 2006; 40: 900-907 ILS = Index of Learning StylesLSTI = Learning Style Type Indicator

100

RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY

Neither is a property of a test or scale. Reliability is important validity evidence. Without reliability, there can be no validity.

Reliability is necessary, but not sufficient for validity.

Purpose of an instrument dictates what type of reliability is important and the sources of validity evidence necessary to support the desired inferences.

101

SOURCES OF VALIDITY EVIDENCE:CONSEQUENCES

Evidence of the effects of tests on students,instruction, schools, society Consequential validity

Social consequences of assessment Effects of passing-failing tests

Economic costs of failure Costs to society of false positive/false negative

decisions Effects of tests on instruction/learning

Intended vs. unintended

RELIABILITY AND INSTRUMENTATION



103

TYPES OF RELIABILITYDifferent types of assessments require different kinds of reliability Written MCQ/Likert-scale items

Scale reliability Internal consistency

Written Constructed Response and Essays Inter-rater agreementGeneralizability theory

104

TYPES OF RELIABILITY Oral Exams

Rater reliability Generalizability Theory

Observational Assessments Rater reliability Inter-rater agreement Generalizability Theory

Performance Exams (OSCEs) Rater reliability Generalizability Theory

105

ROUGH GUIDELINES FOR RELIABILITY The higher the better! Depends on purpose of test

Very high-stakes: > 0.90 (Licensure exams)

Moderate stakes: at least ~ 0.75 (Classroom test, Medical school OSCE)

Low stakes: > 0.60 (Quiz, test for feedback only)

106

INCREASING RELIABILITY Written tests

Use objectively scored formats At least 35-40 MCQs MCQs that differentiate between high and low

scorers Performance exams

At least 7-12 cases Well trained standardized patients and/or other

raters Monitoring and quality control

Observational Exams Many independent raters (7-11) Standard checklists/rating scales Timely ratings

107

SCALE DEVELOPMENT1. Identify the primary purpose for which

scores will be used. Validity is the most important

consideration. Validity is not a property of an instrument.

Inferences to be made determine the type of items you will write.2. Specify the important aspects of the

construct to be measured.

108

SCALE DEVELOPMENT 3. Initial pool of items.4. Expert review (content

validity)5. Preliminary item

‘tryout’6. Statistical properties

of the items Item analysis Reliability estimate Dimensionality

109

ITEM ANALYSIS Item ‘difficulty’ – item variance,

frequencies Inter-item

covariances/correlations Item discrimination – an item that

discriminates well correlates with the total score.

Cronbach’s coefficient alpha Factor Analysis –

Multidemensional Scaling IRT Structural aspect of validity.

110

NEED TO EVALUATE SCALE Jarvis & Petty (1996) Hypothesis: Individuals differ in the

extent to which they engage in evaluative responding.

Subjects were undergraduate psychology students.

Comprehensive reliability and validity studies.

Concluded the scale was ‘unidimensional’.

111

REFERENCESCook, T.D. & Campbell, D. T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings.

Downing, S. M. Threats to the validity of locally developed multiple-choice tests in medical education: construct-irrelevant variance and construct underrepresentation. Adv in Health Sci Educ 2002; 7:235-241.

Downing, S. M. Validity: On the meaningful interpretation of assessment data. Med Educ 2003; 37:830-837.

Messick, S. (1989) Validity. In Educational Measurement 3rd Ed. R. L. Linn, Ed.

Downing, S. M. Reliability: On the reproducibility of assessment data. Med Educ, 2004; 38:1006-1012.

http://www.socialresearchmethods.net

populations and sampling

Documents

large sample

sample estimate

sample sizeselect

theoretical population

sampling error formula

sampling frame

sampling strategy

videostreamed lectures