f. kaftandjieva. terminology f. kaftandjieva milestones in comparability 1904 “the proof and...
TRANSCRIPT
F. KaftandjievaF. Kaftandjieva
Milestones in Comparability
1904“The proof and measurement of association between two things““The proof and measurement of association between two things“
association
F. KaftandjievaF. Kaftandjieva
Milestones in Comparability
1904
1951“Scores on two or more tests may be said to be comparable for a certain population if they show identical distributions for that population.”
“Scores on two or more tests may be said to be comparable for a certain population if they show identical distributions for that population.”
comparable
population
F. KaftandjievaF. Kaftandjieva
Milestones in Comparability
1904
1951
1971‘Scales, norms, and equivalent scores’: EquatingEquating CalibrationCalibration ComparabilityComparability
‘Scales, norms, and equivalent scores’: EquatingEquating CalibrationCalibration ComparabilityComparability
F. KaftandjievaF. Kaftandjieva
Alignment
Alignment refers to the degree of match between test content and the standards
Dimensions of alignment Content Depth Emphasis Performance Accessibility
F. KaftandjievaF. Kaftandjieva
Alignment
Alignment is related to content validitycontent validity Specification (Manual – Ch. 4)
“Specification … can be seen as a qualitative method. … There are also quantitative methods for content validation but this manual does not require their use.” (p. 2)
24 pages of formsOutcome: “A chart profiling coverage graphically in
terms of levels and categories of CEF.” (p. 7) Crocker, L. et al. (1989). Quantitative Methods for
Assessing the Fit Between Test and Curriculum. In: Applied Measurement in Education, 2 (2), 179-194.
F. KaftandjievaF. Kaftandjieva
Construct Instrument Examinees Moderator
Equating = = = no
Calibration = = no
Projection = no
Statistical moderation Other test
Social moderation Judges
Mislevy & Linn: Linking Assessments
Equating Equating Linking Linking
F. KaftandjievaF. Kaftandjieva
-2,0 -1,5 -1,0 -0,5 0,0 0,5 1,0 1,5 2,0
-2,0
-1,5
-1,0
-0,5
0,0
0,5
1,0
1,5
2,0301 itemsr = .975
Sub
-sam
ple
B
Sub-sample A
Sample-Free Estimation
F. KaftandjievaF. Kaftandjieva
-2,0 -1,5 -1,0 -0,5 0,0 0,5 1,0 1,5 2,0
-2,0
-1,5
-1,0
-0,5
0,0
0,5
1,0
1,5
2,0
- b - values (r =+0.9998)
FA
CE
TS
OPLM
The ruler (θ scale)The ruler (θ scale)
F. KaftandjievaF. Kaftandjieva
The ruler (θ scale)The ruler (θ scale)
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
F. KaftandjievaF. Kaftandjieva
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
The ruler (θ scale)The ruler (θ scale)
F. KaftandjievaF. Kaftandjieva
-300 -250 -200 -150 -100 -50 0 50 100 150
Celsius
-500 -400 -300 -200 -100 0 100 200 300
Fahrenheit
The ruler (θ scale)The ruler (θ scale)
boiling waterboiling waterabsolute zeroabsolute zero
F. KaftandjievaF. Kaftandjieva
The ruler (θ scale)The ruler (θ scale)
F° = 1.8 * C° + 32 C° = (F° – 32) / 1.8 F° = 1.8 * C° + 32 C° = (F° – 32) / 1.8
F. KaftandjievaF. Kaftandjieva
Construct Instrument Examinees Moderator
Equating = = = no
Calibration = = no
Projection = no
Statistical moderation Other test
Social moderation Judges
Mislevy & Linn: Linking Assessments
F. KaftandjievaF. Kaftandjieva
Human judgment is the epicenter of every standard-setting method
Berk, 1995
Human judgment is the epicenter of every standard-setting method
Berk, 1995
Fact 1:
F. KaftandjievaF. Kaftandjieva
The cut-off points on the latent continuum do not possess any objective reality outside and independently of our minds. They are mental constructs, which can differ within different persons.
Fact 2:
F. KaftandjievaF. Kaftandjieva
Whether the levels themselves are set at the proper points is a most contentious issue and depends on the defensibility of the procedures used for determining them
Messick, 1994
Consequently:
F. KaftandjievaF. Kaftandjieva
National Standards Understands manuals
for devices used in their everyday life
Defensibility: Claims vs. Evidence
CEF – A2 Can understand
simple instructions on equipment encountered in everyday life – such as a public telephone (p. 70)
F. KaftandjievaF. Kaftandjieva
Cambridge ESOL DIALANG Finnish Matriculation CIEP (TCF) CELI Universitа per
Stranieri di Perugia Goethe-Institut TestDaF Institut WBT (Zertifikat
Deutsch)
Defensibility: Claims vs. Evidence
F. KaftandjievaF. Kaftandjieva
Common Practice (Buckendahl et al., 2000) External Evaluation of the alignment of
12 tests by 2 publishers Publisher reports:
No description of the exact procedure followedReports include only the match between items and
standards Evaluation study
At least 10 judges per test Comparison results
% of agreement: 26% - 55%Overestimation of the match by test-publishers
Defensibility: Claims vs. Evidence
F. KaftandjievaF. Kaftandjieva
Standard 1.7: When a validation rests in part of the opinion or decisions of
expert judges, observers or raters, procedures for selecting such experts and for eliciting judgments or ratings should be fully described. The description of procedures should include any training and instruction provided, should indicate whether participants reached their decisions independently, and should report the level of agreement reached. If participants interacted with one another or exchanged information, the procedures through which they may have influenced one another should be set forth.
Standards for educational and psychological testing,1999
F. KaftandjievaF. Kaftandjieva
Evaluation Criteria
Hambleton, R. (2001). Setting Performance Standards on Educational Assessments and Criteria for Evaluating the Process. In: Setting Performance Standards: Concepts, Methods and Perspectives., Ed. by Cizek, G., Lawrence Erlbaum Ass., 89-116.
A list of 20 questions as evaluation criteria Planning & Documentation 4 (20%) Judgments 11 (55%) Standard Setting Method 5 (25%)
Planning
F. KaftandjievaF. Kaftandjieva
Judges
Because standard-setting inevitably involves human judgment, a central issue is who is to make these judgments, that is, whose values are to be embodied in the standards.
Messick, 1994
F. KaftandjievaF. Kaftandjieva
Selection of Judges
The judges should have the right qualifications, but some other criteria such as
occupation, working experience, age, sex
may be taken into account, because ‘… although ensuring expertise is critical, sampling from relevant different constituencies may be an important consideration if the testing procedures and passing scores are to be politically acceptable’ (Maurer & Alexander, 1992).
F. KaftandjievaF. Kaftandjieva
Number of Judges
Livingston & Zieky (1982) suggest the number of judges to be not less than 5.
Based on the court cases in the USA, Biddle (1993) recommends 7 to 10 Subject Matter Experts to be used in the Judgement Session.
As a general rule Hurtz & Hertz (1999) recommend 10 to 15 raters to be sampled.
10 judges is a minimum number, according to the Manual (p. 94).
F. KaftandjievaF. Kaftandjieva
Training Session
The weakest point How much?
Until it hurts (Berk, 1995)
Main focus Intra-judge consistency
Evaluation forms Hambleton, 2001
Feedback
??
??
F. KaftandjievaF. Kaftandjieva
Training Session: Feedback Form
0,80
0,85
0,90
0,95
1,0011
0111
0211
0311
0411
0611
0811
1411
1511
1611
2412
0712
0912
1012
1212
1312
1812
1912
2012
2112
2212
2513
0513
1114
1716
2321
0621
1221
1721
2021
2721
3121
3222
0222
0322
1422
1822
2122
2422
2623
0423
0723
1923
2223
2323
2923
3723
4024
0124
0524
0924
1324
3024
3525
1625
3325
3826
3426
3626
3927
0827
1027
2828
1128
1528
25
Inter-judge Consistency
Con
sist
ency
Experts' ID
F. KaftandjievaF. Kaftandjieva
-3 -2 -1 0 1 2
1
2
3
4
5
6
Intra-Judge Consistency: Expert 13
Leve
l
Item Difficulty ( )
Training Session: Feedback Form
F. KaftandjievaF. Kaftandjieva
Standard Setting Method
Good Practice The most appropriate Due diligence Field tested Reality check Validity evidence More than one
F. KaftandjievaF. Kaftandjieva
Probably the only point of agreement among standard-setting gurus is that there is hardly any agreement between results of any two standard-setting methods, even when applied to the same test under seemingly identical conditions.
Berk, 1995
Standard Setting Method
F. KaftandjievaF. Kaftandjieva
-1,0
-0,8
-0,6
-0,4
-0,2
0,0
0,2
0,4
0,6
0,8
1,0
CGBGA3A2A1A0
La
ng
ua
ge
Pro
ficie
ncy
( )
Standard Setting Methods
Test-centered methods
Examinee-centered methods
He that increaseth knowledge increaseth sorrow. (Ecclesiastes 1:18)
F. KaftandjievaF. Kaftandjieva
1999 2000 2001 2002 2003 20040%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pa
ss R
ate
He that increaseth knowledge increaseth sorrow. (Ecclesiastes 1:18)
F. KaftandjievaF. Kaftandjieva
In sum, it may seem that providing valid grounds for valid inferences in standards-based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences.
Messick, 1994
In sum, it may seem that providing valid grounds for valid inferences in standards-based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences.
Messick, 1994
Instead of Conclusion
F. KaftandjievaF. Kaftandjieva
The chief determiner of performance standards is not truth; it is consequencesconsequences.
Popham, 1997
Instead of Conclusion
F. KaftandjievaF. Kaftandjieva
Perhaps by the year 2000, the collaborative efforts of measurement researchers and practitioners will have raised the standard on standard-setting practices for this emerging testing technology.
Berk, 1996
Instead of Conclusion