1 | cemo standard setting conference 2015© pant, 2015 standard setting: bridging the worlds of...

1 | CEMO Standard Setting Conference 2015 © Pant, 2015

Standard setting: Bridging the worlds of research and policy makingHans Anand PantHumboldt-Universität zu BerlinFaculty of Humanities and Social SciencesDepartment of Education Studies

Standard-setting:International state of research and practices in

the Nordic countriesThe Centre for Educational Measurement

@ The University of OsloSeptember 21-23, 2015


Outline

What this talk will cover

Educational Standards Reform in Germany

Standard Setting: The “German Case”

Multi-Purpose Tests: The Communication Trap

What we would we do next time


„Empirical shift“ in education policy due to the „PISA Shock“ in 2001

Need to improve academic performance of students across literacy domains

Need to decrease disparities for ethnic and social groups

Need for stronger focus on output/ outcomes and accountability


A shift from an input-based to an output-based quality management in education

State curricula

School books

Intendedcurriculum

Implemented (actual)curriculum

Studentachievement

Effectivecurriculum

Instruction

Input Processes Output (Outcome)

Education policy is within the jurisdiction of the 16 federal states!


Enacted in 2003/2004 for core subjects by the Standing Conference of the Ministers of Education (KMK)

Subjects: Mathematics, German language, First Foreign Language (English or French), Sciences (Biology, Chemistry, Physics)

Represent the core elements of knowledge in each subject matter

Implementation is mandatory for all 16 states

Alignment of state-specific curricula and centralized exams with educational standards

Basis of output-orientation: National Educational Standards


Educational System in Germany


National Educational Standards by Grade Level

Elementary Level

Lower Secondary Level

Upper Secondary Level

Grade 4 Grade 9 Grade 10 Grade 12/13

German

Mathematics

First Foreign Language(English/French)

Science


Multiple expectations

Develop a scientifically valid and standards-based assessment system

Provide policy-relevant results on the system level

Provide assessments relevant to teachers and schools


Three types of cyclic school achievement tests in Germany

International National State-wide


Differences between the three assessments in Germany

International

PISA, PIRLS, TIMSS

National Tests

16-states comparisons

State-wide Testswithin-state

Standards-based tests? No Yes Yes

Data Base Sample-based Sample-based(approx. 40-60,000)

Population-based

Frequency 3 years, 5 years 5-6 years Every year

Main Purpose System monitoring System monitoringSchool & classroom

instruction improvement

Who is accountable? ("High Stakes")

Federal Ministry of Education; 16 State

Ministries of Education

16 State Ministries of Education and their school authorities

Principals, teachers


Content standards:State curriculua

Performance standards:

Can-do statementsTest

specifi-cations

(“blue print” for test

developers)

Test(item bank)

Standard setting

(definition of cut

scores)

Test use and

interpret-tations

(intended & actual)

Test impact

Outcome domain

Item domain

Content Domain

Schematic process in large scale assessment

Evidential aspects of validation(procedural, internal, and external aspects)

Consequential aspects of validation

(utility and impact aspects)


Significance of Standard Setting

allow the operationalization of performance standards and definition of educational goals

allow the standards-based interpretation of competence orientated large-scale assessment results

facilitate the communication of proficiency distributions of examinees to a range of stakeholders

represent the intersection of evidentiary and consequential aspects of test-score interpretation

The purpose of standard setting procedures is to…


Continuum View of Proficiency

Continuous Student Proficiency

Basic Proficient Advanced

Cutscore 1 Cutscore 2

low ability high ability

Source: Meskauskas (1976)

Ordinal Categories of Proficiency

Principle of Standard Setting


The Bookmark Method – Cut-score between levels L2 and L2

Standard Setting: The Bookmark Method

OrderedItem

Booklet

1

2

3

19

109

8

76

54

18

17

16

1514

13

12

11

22

21

20

OrderedItem

Booklet

1

2

3

19

109

8

76

54

18

17

16

1514

13

12

11

22

21

20

Book

mar

kL1

L2

Items a

re arrange

d by

ascending empiric

al

difficu

lty

Items below the Bookmark define the content a student at level L1

has mastered

Items above the Bookmarkhave not yet been mastered by a

student at level L2


Panel experts set cut-scores in three rounds: Round1: individual cut-scores using OIBs and

proficiency level descriptors; followed by normative feedback

(group members’ individual and average cut-scores) Round 2: individual cut-scores; followed by normative feedback plus impact data

(resulting population distribution of student achievement)

Round 3: final cut-scores are set individually (striving for a consensus)

Iterative group consensus finding procedure

E x am inee Reading P rofic ienc y Dis tribut ion in relat ion to Cut-s c ores and P rofic ienc y Levels

100 200 300 400 500 600 700 800 900

Ex a m ine e Ability

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

26000

Freq

uen

cy

C ut 1

A 2 C 1B 2B 1A 1

C ut 4C ut 3C ut 2

The Bookmark-Methode


Illustration of a proficiency level model(Scientific reasoning in biology)


From criterion-referenced to standards-referenced levels of proficiency Levels


The Interplay of Research and Policy in the Standard Setting Process in Germany

Policy enacts educational standards Researchers (domain content experts,

edumetricians, psychologists,…) set up an assessment system

Cut-score recommendations are presented to the 16 state ministers of education

Ministers formally approve and authorize the set of cut-scores for every single proficiency level model


Results from national standards-based monitoring



Norm-referenced feedback: Mean scores of mathematic achievement by state (9th grade)

State M SD

Saxony 536 96Thuringia 521 93Brandenburg 518 105Bavaria 517 101Saxony-Anhalt 513 99Mecklenburg Western Pommerania 505 95Rhineland-Palatinate 503 97Schleswig-Holstein 502 98Baden-Wuerttemberg 500 100Germany 500 100Lower Saxony 495 91Hesse 495 97Saarland 489 93Hamburg 489 99North Rhine Westfalia 486 102Berlin 479 104Bremen 471 103

Significantly above average

Significantly below average

Maximum mean state difference corresponds to over 2 years in learning progression!


“A criterion-referenced test is one that is deliberately constructed to yield measurements that are directly interpretable in terms of

specified performance standards” (Glaser and Nitko, 1971, p.653)

What does “criterion-referenced” imply?

Why criterion-referenced standards?


Criterion/Standards-referenced comparions:Proportions of students by proficiency level in maths (9th grade)

Bremen

Saxony

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

39

11

27

27

21

31

10

23

3

7

I II III IV V

61% of students in Saxony meet normative standard

34% of students in Bremen meet normative standard

Proficiency level


Criterion/Standards-referenced monitoring on a system level

Anmerkungen. <KS II = Mindeststandard nicht erreicht; ≥KS III = mindestens Regelstandard erreicht; Orthogr.= Orthografie; Fachwiss.= Fachwissen; Erkennt.= Erkenntnisgewinnung; im Fach Mathematik in der Sekundarstufe I beziehen sich die Prozentanteile auf alle zielgleich unterrichteten Neuntklässlerinnen und Neuntklässler, unabhängig davon, ob sie einen HSA oder MSA anstreben; hellrote Markierung= zwischen 10% und 15% haben den Mindeststandard (der Primarstufe bzw. des MSA) verfehlt bzw. lediglich zwischen 40% und 50% haben mindestens den Regelstandard (der Primarstufe bzw. des MSA) erreicht; dunkelrote Markierung= mehr als 15% haben den Mindeststandard (der Primarstufe bzw. des MSA) verfehlt bzw. weniger als 40% haben mindestens den Regelstandard (der Primarstufe bzw. des MSA) erreicht; grüne Markierung= weniger als 10% haben den Mindeststandard (der Primarstufe bzw. des MSA) verfehlt bzw. über 50% haben mindestens den Regelstandard (der Primarstufe bzw. des MSA) erreicht *= es liegen keine bundesweiten Angaben im Ländervergleich 2009 vor

< KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS III < KS II ≥ KS IIIBW 10 69 6 77 9 73 5 73 1 90 1 87 15 41 7 46 24 44 8 68 13 54 20 53 13 60 12 67 10 69BY 10 73 6 79 8 77 2 80 0 94 1 93 8 56 4 60 21 50 6 75 5 67 12 67 9 71 8 76 5 80BE 22 54 15 62 27 47 14 60 6 75 4 76 27 37 17 38 33 37 9 67 12 57 19 56 14 61 13 65 11 70BB 13 66 9 70 13 65 10 62 3 79 5 73 34 27 27 20 19 53 2 86 3 74 7 74 5 78 3 84 3 87HB 21 52 14 60 25 50 14 59 5 75 8 71 32 32 16 39 39 34 7 66 12 54 22 51 18 58 12 64 10 69HH 18 59 11 68 20 56 12 63 2 83 4 75 24 37 11 46 28 41 8 66 11 54 20 52 15 58 12 63 8 71HE 13 64 8 71 14 62 8 66 2 84 2 86 19 38 9 43 27 41 7 66 11 53 19 53 13 59 10 66 8 71MV 9 70 6 76 11 69 11 61 3 80 3 77 33 29 28 25 20 47 3 77 6 62 9 65 8 67 6 73 6 77NI 12 67 7 75 12 65 10 65 2 87 4 80 26 34 13 37 25 42 5 72 8 61 17 57 11 64 10 67 6 77NW 14 63 8 71 13 65 7 66 1 87 1 83 19 38 10 40 31 39 8 63 11 53 20 51 17 56 13 61 8 70RP 14 64 8 70 13 65 5 73 1 88 1 88 16 42 8 44 24 46 3 78 6 65 11 63 7 70 5 74 4 81SL 13 66 10 69 12 69 5 73 1 86 1 89 18 43 10 44 28 41 5 73 6 63 14 61 9 67 8 71 6 77SN 8 72 6 76 7 76 9 67 3 80 3 76 32 32 26 25 12 61 2 84 4 70 7 72 5 77 3 83 3 86ST 9 74 9 71 8 75 7 67 2 85 3 80 26 34 22 26 21 50 2 81 5 66 6 75 6 73 4 81 3 82SH 13 65 6 76 13 64 5 72 1 90 2 86 14 41 5 48 23 46 4 76 6 64 12 62 7 70 5 76 5 79TH 10 70 6 75 12 67 8 65 1 82 3 79 29 33 22 28 18 52 3 83 3 71 8 72 4 77 3 84 2 87DE 12 67 7 74 12 68 * * * * * * 19 40 10 43 25 44 6 71 9 59 16 58 12 64 10 69 7 75

Primarstufe

B-Fachwiss.

Sekundarstufe IDeutsch

D-Lesen D-ZuhörenM-Global D-Orthogr.Deutsch Englisch

E-Leseverst.D-Lesen D-ZuhörenMathematik Physik

Ph-Fachwiss. Ph-Erkennt.E-Hörverst.MathematikM-Global B-Erkennt.

Biologie ChemieCh-Fachwiss. Ch-Erkennt.


A glossary so far…

Educational Standards

Educational performance expectations

Standard Setting

Mapping of performance expectations to assessments

Cut-scores

Minimum criteria for meeting level-specific

expectations

Validity

Plausibility, appropriateness,defensibility


Current validity concept (APA, AERA, NCME)

It is the uses and interpretations of an assessment result, i.e. the inferences, rather than the assessment result itself that is validated.

Validity may be relatively high for one use of assessment results by quite low for another use or interpretation.

Validity is a matter of degree rather than all or none.

“The consequential aspect of (…) validity includes evidence and rationales for evaluating the intended and unintended consequences of score interpretation and use in both the short- and long-term” (Messick, 1995, p. 746)

“[T]he validation process is scientific as well as rhetorical and requires both evidence and argument” (p. 747).


In sum:

Validation takes the form of an open-ended

argument that evaluates the overall

plausibility of the proposed interpretations

of an assessment result and its

consequences.

Samuel Messick’s Validity Concept


The movement from accepted data, through a warrant, to a claim (Brockriede & Ehninger, 1960, p. 44)

Model of Inference (Toulmin,1953; Kane, 2011)

A more formal model of validation

DATUM CLAIM

WARRANT

BACKING

QUALIFIER


Validity argument

DATUM

Examineeperformance measeured

as a test-score on an assessment

CLAIM

Interpretation of test-scores as a proficiency

level classification

QUALIFIER

Justification for test usefor specified

populations under specified conditions

WARRANT

BACKING

Pre-Test-Score

Definition of output domain in terms of cognitive processes required to perform domain-relevant tasks

Definition of specific examples of examinee domain-relevant behaviors at each level of proficiency

Evidence of appropriate- content representation - response processes- level ofchallenge - difficulty generating item characteristics

Classical test theory and item response theory psychometric analyses of item and person parameters

Evidence of- procedural- internal, - external validity

Cut-scores correspond to the performance level descriptors

Clearly defined content domain

Test content

alignment

Clearly defined PLDs

Test-Score Reliability

Defensible Cut-Scores

Empirical and rhethorical evidence for the support of the propositions of

validity

Validity argument for the intended porpose of

the assessment and standard-based

(criterion-referenced) interpretation of test-

scores

Inte

nded

and

uni

nten

ded

cons

eque

nces

Bottleneck for consequential

concerns


Standards-based state-wide assessments



Differences between the three assessments in Germany

International

PISA, PIRLS, TIMSS

National tests

16-states comparisons

State-wide Testswithin-state

Standards-based tests? No Yes Yes

Data base Sample-based(approx. 5,000)

Sample-based(approx. 40-60,000)

Population-based

Periodicity 3 years, 5 years 5-6 years Every year

Main purpose System monitoring System monitoringSchool & classroom

instruction improvement

Who is accountable? (“high stakes")

Federal Ministry of Education; 16 State

Ministries of Education

16 State Ministries of Education and their school authorities

Principals, teachers


Goal of state-wide testing:Data-based instruction and school improvement

Coverage: 1.5 million students and 64,000 teachers every year


Test preparation strategies in teachers

Increasing instructional time

More individualized teaching

Increasing instructional load

Reallocation of content

Coaching

Cheating


State

I II III IV V III+ I II III IV V III+ A 26 21 21 14 18 53 13 21 30 25 11 66

3rd grade (state-wide, 2010)

B 38 19 18 11 15 43 22 24 27 19 7 54 C 29 20 21 13 17 51 10 21 32 26 11 69 D 21 19 23 15 20 58 10 17 29 28 16 74 E 22 19 22 15 22 59 9 21 33 27 10 70 F 24 20 22 14 20 57 12 21 29 25 13 67 G 23 19 22 14 22 58 14 23 30 22 11 63 H 20 19 23 15 23 61 13 22 30 24 11 65

4th grade (national, 2011)

Proficiency level distributions in population-based vs. sample-based studies of identical student cohort (1 year lag)


Sample performance feedback in a state-wide test on individual student level and class level

Student score with 95% CI

Class level distribution

Proficiency level description

Standard level description


Classification accuracy as a differential threat to validity depending on the test purpose

Classification accuracy is not an issue in the system monitoring case, where plausible value distributions are used to estimate the population percentage of students on each proficiency level.

However, in the case of school and class level aggregate feedback, individual person ability estimates (MLE or WLE) are used to report proficiency level distribution (e.g., 20% of school X students perform below basic standard). Hence, misclassification bias is important, specifically when trends in the proficiency distribution are fed back.

One test type, different purposes


Item Response Theory based classification accuracy using Rudner’s (2001) method

score-cut θ

score true θ

score observed θ

c

j

The standardized distance between true score and cut-score is determined

The area right of the cut-score is calculated to determine the probability of the examinee being incorrectly classified above their ability.

Freu

quen

cy

ability scale

0 1 3-2 2-1-3

θc

θj

θ̂

Cut-score

)θ(

θθ

j

jc

SEz

Consequential Validity Study:Definition of Classification Accuracy


Consequential Validity Study

The number of proficiency levels set on an ability scale;

the reliability of examinee ability estimates (Ercikan, 2006; Ercikan & Julian, 2002);

the distance on the ability scale between cut-scores (Zhang, 2010);

the number of examinees with ability estimates close to a cut-score (Emons, Sijtsma, & Meijer, 2007);

the measurement error associated with the examinee ability estimates (Betebenner, et al., 2008).

Factors that have been hypothesized to systematically affect classification accuracy (Wyse 2010):



For their state-wide assessments in grade 8 states in Germany can administer multiple psychometrically linked non-equivalent parallel test-forms of varying mean item difficulty.

This is done in order to take the average ability of different target populations (Hauptschul vs. Realschule vs. Gymnasium) into account.

Moreover, some states prefer to give a more “fine grained” feedback to their schools based on an 8-level instead of the usual 5-level competence model version of the CEFR.

Applied context for research questions on classification accuracy


Classification accuracy vs. feedback differentiation

The Common European Framework for LanguagesLevels of Language Proficiency

A2.2A2.1

A1Breakthrough

A2Waystage

B1Threshold

B2Vantage

C1Operational Proficiency

C2Mastery

A BasicLanguage Use

IndependentLanguage Use B C Proficient

Language Use

B1.2B1.1 B2.2B2.1



How do the two factors ability matched test forms and number of proficiency levels affect classification accuracy using a real sample of large-scale data?

Is class-level aggregated classification accuracy differentially affected by increasing the number of proficiency levels? Put differently: Is it possible to identify schools which would likely experience a disproportionally high loss in classification accuracy with the increase of the number of proficiency levels?

Research questions on classification accuracy



Response data of N=33,520 year eight examinees from the states of Berlin and Brandenburg totaling N=1,706 classes with an average size of 19 students was analyzed.

The assessment of examinee ability was estimated using maximum likelihood estimates (MLE) with a reliability of .919. English reading comprehension was conducted using two parallel test-forms with n=58 and n=52 items.

The items contained in test-form one, taken by lower achieving classes, had lower logit difficulty estimates (M=-1.13, SD=1.4) than items contained in test-form two (M=-0.12, SD=1.49), taken by higher achieving classes.

Data base and analyses


Classification accuracy of two non-equivalent parallel test-forms relative to examinee ability and 4 vs. 7 cut scores

Clas

sific

ation

Acc

urac

y

Ability Estimate

Testform 2

Testform 1

Classification AccuracyCl

assi

ficati

on A

ccur

acy

Ability Estimate

Testform 2Testform 1


The paradox relationship between the increase in perceived detail of reports and the accuracy with which examinees can be assigned to these levels leads to the recommendation to set only as many cut-score to define proficiency levels as is necessary for the purpose of the assessment.

Recipients of test-result should be made aware that when making comparisons between aggregated units, such as classes or schools, groups which differ in their mean ability may also differ in the accuracy with which they can be assigned to proficiency levels.

Taking the present findings and those of earlier studies into account, it can be recommended that reports of the proportions of examinee classifications on criterion-referenced proficiency levels should include the degree of uncertainty attached to the classification at each proficiency level.

Conclusions for standard setting and feedback format


Consequential aspects when designing standards-referenced assessments

Performance feedback level

Level of stakes

Evaluation purpose

high

low

individual

class

school

state

summative (status)

formative (trend, trajectory)


Valid standard setting procedures require …

a thourogh and thoughtful discussion on test-use and purpose before setting-up an assessment system

an open discussion about the vested interests of various stake-holders in the interpretation of test results

an explicit strategy how to relate assessment results to existing exams and certification systems

an idea about how to deal with uncertainty/ ambiguity/ imprecision of test results and how to communicate this to other stake-holders

What we ideally should have done


Take-home

Early management of expectations rather than later management of disappointment!

Contact:[email protected]@deutsche-schulakademie.de

Contact:[email protected]@deutsche-schulakademie.de

Thank you for your atten n


The argument for test-score interpretation

cons

eque

nces

DATUMExaminee performance as an assessment test-score

CLAIMInterpretation of test-score as proficiency level classification

QUALIFIERConditions under which proposed claims are valid

Content standards are well defined

Test content is aligned

PLDs are well defined

Test-scores are reliable and unbiased

Cut-scores are appro-priate

WARRANT


Regional Groups Evaluation Group

Items for Standardized Assessments

Items and Tasks for Classroom

Implementation

StudentSolutions

FirstItemPool

Inte

rnet

Por

tal

Review and revision of items, feedback to item development

group, development of overarching framework

Revised Item Pool

FinalItemPool

Pool of items for classroom Implementation

IQB

Pool of items for standardized assessments

Testing InstitutesPublishers

Collecting, Commenting, and Evaluating Release of items after review

/ revision

Item Development Process at the IQB

1 | cemo standard setting conference 2015© pant, 2015 standard setting: bridging the worlds of...

Documents

germanystandard setting

standing conference

teachers nr

educational system

educational standards

germany nr

time nr

educational measurement