f. kaftandjieva. terminology f. kaftandjieva milestones in comparability 1904 “the proof and...

F. KaftandjievaF. Kaftandjieva


Terminology


Milestones in Comparability

1904“The proof and measurement of association between two things““The proof and measurement of association between two things“

association



1904

1951“Scores on two or more tests may be said to be comparable for a certain population if they show identical distributions for that population.”

“Scores on two or more tests may be said to be comparable for a certain population if they show identical distributions for that population.”

comparable

population



1904

1951

1971‘Scales, norms, and equivalent scores’: EquatingEquating CalibrationCalibration ComparabilityComparability

‘Scales, norms, and equivalent scores’: EquatingEquating CalibrationCalibration ComparabilityComparability



1904

1951

1971

19921993



1904

1951

1971

19921993

19972001


Alignment

Alignment refers to the degree of match between test content and the standards

Dimensions of alignment Content Depth Emphasis Performance Accessibility


Alignment

Alignment is related to content validitycontent validity Specification (Manual – Ch. 4)

“Specification … can be seen as a qualitative method. … There are also quantitative methods for content validation but this manual does not require their use.” (p. 2)

24 pages of formsOutcome: “A chart profiling coverage graphically in

terms of levels and categories of CEF.” (p. 7) Crocker, L. et al. (1989). Quantitative Methods for

Assessing the Fit Between Test and Curriculum. In: Applied Measurement in Education, 2 (2), 179-194.


0.235

Alignment (Porter, 2004)

www.ncrel.orgwww.ncrel.org



1904

1951

1971

19921993

19972001


Construct Instrument Examinees Moderator

Equating = = = no

Calibration = = no

Projection = no

Statistical moderation Other test

Social moderation Judges

Mislevy & Linn: Linking Assessments

Equating Equating Linking Linking


in Calibration

The Good & The Bad


Model – Data Fit


-2,0 -1,5 -1,0 -0,5 0,0 0,5 1,0 1,5 2,0

-2,0

-1,5

-1,0

-0,5

0,0

0,5

1,0

1,5

2,0301 itemsr = .975

Sub

-sam

ple

B

Sub-sample A

Sample-Free Estimation


-2,0 -1,5 -1,0 -0,5 0,0 0,5 1,0 1,5 2,0

-2,0

-1,5

-1,0

-0,5

0,0

0,5

1,0

1,5

2,0

- b - values (r =+0.9998)

FA

CE

TS

OPLM

The ruler (θ scale)The ruler (θ scale)



-3 -2 -1 0 1 2 3

-3 -2 -1 0 1 2 3


-300 -250 -200 -150 -100 -50 0 50 100 150

Celsius

-500 -400 -300 -200 -100 0 100 200 300

Fahrenheit


boiling waterboiling waterabsolute zeroabsolute zero



F° = 1.8 * C° + 32 C° = (F° – 32) / 1.8 F° = 1.8 * C° + 32 C° = (F° – 32) / 1.8


Construct Instrument Examinees Moderator

Equating = = = no

Calibration = = no

Projection = no

Statistical moderation Other test

Social moderation Judges

Mislevy & Linn: Linking Assessments


Standard Setting


The Ugly


Human judgment is the epicenter of every standard-setting method

Berk, 1995

Human judgment is the epicenter of every standard-setting method

Berk, 1995

Fact 1:


When Ugliness turns to Beauty


The cut-off points on the latent continuum do not possess any objective reality outside and independently of our minds. They are mental constructs, which can differ within different persons.

Fact 2:


Whether the levels themselves are set at the proper points is a most contentious issue and depends on the defensibility of the procedures used for determining them

Messick, 1994

Consequently:


Defensibility


National Standards Understands manuals

for devices used in their everyday life

Defensibility: Claims vs. Evidence

CEF – A2 Can understand

simple instructions on equipment encountered in everyday life – such as a public telephone (p. 70)


Cambridge ESOL DIALANG Finnish Matriculation CIEP (TCF) CELI Universitа per

Stranieri di Perugia Goethe-Institut TestDaF Institut WBT (Zertifikat

Deutsch)



Common Practice (Buckendahl et al., 2000) External Evaluation of the alignment of

12 tests by 2 publishers Publisher reports:

No description of the exact procedure followedReports include only the match between items and

standards Evaluation study

At least 10 judges per test Comparison results

% of agreement: 26% - 55%Overestimation of the match by test-publishers



Standard 1.7: When a validation rests in part of the opinion or decisions of

expert judges, observers or raters, procedures for selecting such experts and for eliciting judgments or ratings should be fully described. The description of procedures should include any training and instruction provided, should indicate whether participants reached their decisions independently, and should report the level of agreement reached. If participants interacted with one another or exchanged information, the procedures through which they may have influenced one another should be set forth.

Standards for educational and psychological testing,1999


Evaluation Criteria

Hambleton, R. (2001). Setting Performance Standards on Educational Assessments and Criteria for Evaluating the Process. In: Setting Performance Standards: Concepts, Methods and Perspectives., Ed. by Cizek, G., Lawrence Erlbaum Ass., 89-116.

A list of 20 questions as evaluation criteria Planning & Documentation 4 (20%) Judgments 11 (55%) Standard Setting Method 5 (25%)

Planning


Judges

Because standard-setting inevitably involves human judgment, a central issue is who is to make these judgments, that is, whose values are to be embodied in the standards.

Messick, 1994


Selection of Judges

The judges should have the right qualifications, but some other criteria such as

occupation, working experience, age, sex

may be taken into account, because ‘… although ensuring expertise is critical, sampling from relevant different constituencies may be an important consideration if the testing procedures and passing scores are to be politically acceptable’ (Maurer & Alexander, 1992).


Number of Judges

Livingston & Zieky (1982) suggest the number of judges to be not less than 5.

Based on the court cases in the USA, Biddle (1993) recommends 7 to 10 Subject Matter Experts to be used in the Judgement Session.

As a general rule Hurtz & Hertz (1999) recommend 10 to 15 raters to be sampled.

10 judges is a minimum number, according to the Manual (p. 94).


Training Session

The weakest point How much?

Until it hurts (Berk, 1995)

Main focus Intra-judge consistency

Evaluation forms Hambleton, 2001

Feedback

??

??


Training Session: Feedback Form

0,80

0,85

0,90

0,95

1,0011

0111

0211

0311

0411

0611

0811

1411

1511

1611

2412

0712

0912

1012

1212

1312

1812

1912

2012

2112

2212

2513

0513

1114

1716

2321

0621

1221

1721

2021

2721

3121

3222

0222

0322

1422

1822

2122

2422

2623

0423

0723

1923

2223

2323

2923

3723

4024

0124

0524

0924

1324

3024

3525

1625

3325

3826

3426

3626

3927

0827

1027

2828

1128

1528

25

Inter-judge Consistency

Con

sist

ency

Experts' ID


-3 -2 -1 0 1 2

1

2

3

4

5

6

Intra-Judge Consistency: Expert 13

Leve

l

Item Difficulty ( )

Training Session: Feedback Form


Standard Setting Method

Good Practice The most appropriate Due diligence Field tested Reality check Validity evidence More than one


Probably the only point of agreement among standard-setting gurus is that there is hardly any agreement between results of any two standard-setting methods, even when applied to the same test under seemingly identical conditions.

Berk, 1995

Standard Setting Method


-1,0

-0,8

-0,6

-0,4

-0,2

0,0

0,2

0,4

0,6

0,8

1,0

CGBGA3A2A1A0

La

ng

ua

ge

Pro

ficie

ncy

( )

Standard Setting Methods

Test-centered methods

Examinee-centered methods

He that increaseth knowledge increaseth sorrow. (Ecclesiastes 1:18)


1999 2000 2001 2002 2003 20040%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Pa

ss R

ate

He that increaseth knowledge increaseth sorrow. (Ecclesiastes 1:18)


In sum, it may seem that providing valid grounds for valid inferences in standards-based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences.

Messick, 1994

In sum, it may seem that providing valid grounds for valid inferences in standards-based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences.

Messick, 1994

Instead of Conclusion


The chief determiner of performance standards is not truth; it is consequencesconsequences.

Popham, 1997



Perhaps by the year 2000, the collaborative efforts of measurement researchers and practitioners will have raised the standard on standard-setting practices for this emerging testing technology.

Berk, 1996


f. kaftandjieva. terminology f. kaftandjieva milestones in comparability 1904 “the proof and...

Documents