maintaining, adjusting and generalizing standards and cut-scores robert coe, durham university...
TRANSCRIPT
Maintaining, adjusting and generalizing standards and cut-scores
Robert Coe, Durham UniversityStandard setting in the Nordic countriesCentre for Educational Measurement, University of Oslo (CEMO) Oslo, 22 September 2015
@ProfCoe
Slides available at:
www.twitter.com/ProfCoe
∂
Different meanings of ‘standards’ Does it test something sensible? Is the content complex & extensive? Are the questions hard? Has the design/development followed the rules? Has it been marked properly? Are the scores reliable? Does it actually measure the intended construct? Can the outcomes (grades/scores) be used as desired? Does a particular cut-point indicate the same as
– Its ‘equivalent’ in previous versions– Some kind of equivalent in other assessments– Some specified level of performance
2
∂
In England we are worried about
Standards over time Standards across qualifications, subjects, or
specifications within the same broad qualification
Standards between awarding organisations, or assessment processes
Standards across countries Standards between groups of candidates (e.g.
males/females, rich/poor)
3
∂
Comparability (Newton, 2010)
Candidates who score at linked (grade boundary) marks must be the same in terms of …– the character of their attainments (phenomenal)– the causes of their attainments (causal)– the extent to which their attainments predict their
future success (predictive)
4
∂
Comparability (Coe, Newton & Elliott, 2012)
Any rational claim about the comparability of grades in different qualifications amounts to a claim that those grades can be treated as interchangeable for some purpose or interpretation.
We should talk about the comparability of grades or scores (rather than of qualifications), since these are the outcomes of an assessment that are interpreted and used.
Most interpretations of a grade achieved in an examination relate directly to the candidate. In other words, we are interested in what the grade tells us about the person who achieved it, and inferring characteristics of the person from the observed performance.
Any claim about interchangeability relates to a particular construct.
5
∂
Test development and standards
Theoretical
Specify the construct Develop the assessments
to measure it Use equating/linking
procedures to link key cut-points
Candidates with linked scores are equivalent (wrt the construct)
Pragmatic
Assessments evolve, shaped by– Explicit constructs– Past practice– User requirements (wide
range of different uses & purposes)
– Political drivers– Pragmatic constraints
Comparability defined by public opinion (Cresswell, 1996, 2012)
6
∂
An integration: rational and pragmatic Consider the different ways exam results are
used (interchangeably) Identify an implied construct for each (in
terms of which they are interchangeable) Develop a defensible method for minimising
unfairness and undesirable behaviour that results from these interchangeability requirements
7
∂
Use / interpretation Implied construct
Interchangeability requirement
1 The claim by teachers in the 2012 GCSE English dispute that students who met the criteria deserve a C
The grade indicates specific competences within the subject domain that have been demonstrated on the assessment occasion.
Performance judged to meet the same ‘criteria’ gets the same grade on different occasions, specifications, boards
2 The use of a B in GCSE maths as a filter for A level study in maths.
The grade indicates specific competences within the subject domain that the candidate is likely to be able reproduce in the future.
Grades (across occasions, specifications, boards) represent the same level of the construct (mathematics)
8
∂
Use / interpretation Implied construct
Interchangeability requirement
3 The use of ‘5A*-C EM’ (at least 5 grade Cs inc Eng & math) at GCSE as a filter for any A level study
The grade indicates competences transferable to other academic study that the candidate is likely to be able reproduce in the future.
Grades achieved in different combinations of subjects and other allowable qualifications must be equivalent in terms of their predictions for subsequent academic outcomes.
4 Employers requiring job applicants to have ‘5A*-C EM’.
The grade indicates competences transferable to employment contexts that the candidate is likely to be able reproduce in the future.
Grades achieved in maths and English (across occasions, specifications, boards) must predict the same level of relevant, reproducible workplace competences.
9
∂
Use / interpretation Implied construct Interchangeability requirement
5 Use of GCSE results in league tables to judge schools
Average grades for a class or school (especially if referenced against prior attainment) indicate the impact (and hence quality) of the teaching experienced.
Grades achieved in different combinations of subjects and other allowable qualifications must be equivalent in terms of some measure of the teaching (quality and quantity) that is typically (after controlling for pre-existing or irrelevant differences) associated with those outcomes.
6 Comparison of GCSE results of different types of school to justify impact of policy.
Average grades across the jurisdiction indicate the impact (and hence quality) of the system’s schooling provision.
As in 5
10
∂
Grade C in GCSE French could be made comparable to the same grade
In French in previous years (or parallel specifications), in terms of what is specified to be demonstrated
In French in previous years (or parallel specifications), or in other languages, in terms of the candidate’s ability to communicate in the target language
In other (academic) subjects, in terms of their prediction of subsequent attainment in other (academic) subjects
In other subjects, in terms of how hard it is to get students to reach this level
11
∂
It follows that … We cannot talk about standards (setting or
maintaining) until we decide which of these uses/interpretations we want to support
In at least some cases the different uses/interpretations will be incompatible
If we want the ‘standard’ to be captured in the outcome (score/grade) we have to prioritise (or optimise)
Alternatively, we can use different equivalences for different uses
12
∂
A level data
0
20
40
60
80
100
120
140
160
180
Film
Stu
d
Art
Pho
to
Med
iaS
tud
Art
Tex
t
Art
Gra
ph
Tra
velT
our
Soc
iolo
gy
Bus
App
l
Fin
eArt
Art
Des
Dra
ma
Eng
Lang
Lit
Hea
lSoc
Car
e
Eco
nBus
Bus
Stu
d
Eng
Lang
Law
DT
Pro
dDes
RS
Eng
Lit
Gov
Pol
ICT
App
l
Geo
grap
hyP
sych
olog
y
Cla
ssC
iv
PE
Spo
rtS
tud
ICT
His
tory
Acc
ouF
inM
usic
Tec
h
Eco
nom
ics
Mat
hs
Logi
cPhi
l
Mus
ic
Com
putin
gF
renc
h
Mat
hFur
Bio
lHum
an
Bio
logy
Che
mis
try
Phy
sics
Re
lati
ve
se
ve
rity
(c
orr
ec
ted
ta
riff
) A*
A
B
C
D
E
Leniently graded Severely graded
∂
Judgement-based methods Criterion-based judgement
– Judgement against specific competences– Judgement against overall grade descriptors
Item-based judgement– Angoff method– Bookmark method
Comparative judgment– Cross-moderation– Paired comparison
Judgement of demand– CRAS (complexity, resources, abstractness, strategies)
16
∂
Equating methods
Classical equating models– Linear equating– Equipercentile equating
IRT equating– Rasch model– Other IRT models
Equating designs– Equivalent groups– Common persons– Common items
17
∂
Linking/comparability methods Reference/anchor test
– Concurrent– Prior
Common candidate methods– Subject pairs– Subject matrix– Latent trait
Pre-testing designs (when high-stakes & released)– Live testing with additional future trial test items– Random future test versions within live testing– Low-stakes pre-testing two versions in counterbalanced trial– Low-stakes pre-testing with an anchor test
Norm/cohort referencing– Pure cohort referencing– Adjusted cohort referencing
18