comparing standards of examination papers when there are no … · 2020. 3. 16. · expected vs...

Comparing standards of examination papers when there are no archived scripts

Ian Jones Loughborough UniversityColin Foster Loughborough University

Jodie Hunter Massey University, New Zealand

13th Annual UK Rasch User Group MeetingCambridge 2019

Fifty years of A-level mathematics: havestandards changed?

Ian Jonesa,*, Chris Wheadonb, Sara Humphriesc andMatthewInglisaaMathematics Education Centre, Loughborough University, UK; bNoMore Marking Ltd.;cOfqual

Advanced-level (A-level) mathematics is a high-profile qualification taken by many school leavers inEngland, Wales, Northern Ireland and around the world as preparation for university study. Con-cern has been expressed in these countries that standards in A-level mathematics have declined overtime, and that school leavers enter university or the workplace lacking the required mathematicalknowledge and skills. The situation in England, Wales and Northern Ireland reflects more generalinternational concerns about decreasing educational standards. However, evidence to support thisconcern has been of limited scope, rarely subjected to peer-review and of questionable validity. Ourstudy overcame the limitations of previous research into standards over time by applying a compara-tive judgement technique that enabled the direct comparison of mathematical performance acrossdifferent examinations. Furthermore, unlike previous research, all examination questions were re-typeset and candidate responses rewritten to reduce bias arising from surface cues. Using this tech-nique, mathematics experts judged A-level scripts from the 1960s, 1990s and the 2010s. We reportthat the experts believed current A-level mathematics standards to have declined since the 1960s,although there was no evidence that they believed standards have declined since the 1990s. We con-trast our findings with those from previous comparison studies and consider implications for futureresearch into standards over time.

Keywords: A-level mathematics; standards; assessment; comparative judgement

Background

Numerous articles and reports have been published over recent years decrying themathematical knowledge of school leavers in England and Wales (e.g. Walport et al.,2010; ACME, 2011). This includes those who have achieved high grades inAdvanced-level (A-level) mathematics (Hawkes & Savage, 2000; Croft et al., 2009),a course usually associated with achieving university entrance to science, engineeringand mathematics courses in England and Wales. High-profile and on-going mediacoverage (e.g. Willis & Paton, 2009) suggests that standards were higher some time inthe past, but have declined since. In this article we investigate whether this is in factthe case.Concerns about declining standards perhaps go back as far as accredited education

itself, but of particular relevance to the current debate in England and Wales is theinfluential Dearing report (National Committee of Inquiry into Higher Education,

*Corresponding author. Mathematics Education Centre, Loughborough University, Loughbor-ough, LE11 3TU, UK. Email: [email protected]

© 2016 British Educational Research Association

British Educational Research JournalVol. 42, No. 4, August 2016, pp. 543–560

DOI: 10.1002/berj.3224

BERJ (2016) Results

ABE

A Grades

B Grades

E Grades

Achi

evem

ent P

aram

eter

Est

imat

e

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Year of Examination1960 1970 1980 1990 2000 2010

BERJ (2016) Method

BERJ (2016) Preparation• Question papers typeset for consistency.

• Candidate responses transcribed for consistency.

• 66 scripts divided into 546 questions and uploaded to website for judging procedure.




TIME-CONSUMING AND EXPENSIVE




TIME-CONSUMING AND EXPENSIVEREQUIRES GRADED ARCHIVE

Can we apply CJ to standards comparison

without graded scripts?

Hope 1

r = .68sc

ripts

−3

−2

−1

0

1

2

perfect solutions−2 −1 0 1 2

Model solutions vs. graded scripts

An investigation of construct relevant and irrelevant featuresof mathematics problem-solving questions using comparativejudgement and Kelly’s Repertory GridStephen D. Holmes, Qingping He and Michelle Meadows

Office of Qualifications and Examinations Regulation, Coventry, UK

ABSTRACTThe relationship between the characteristics of 33 mathematicalproblem-solving questions answered by 16-year-old students inEngland and the quality of problem-solving elicited wasinvestigated in two studies. The first study used comparativejudgement (CJ) to estimate the quality of the problem-solvingelicited by each question, involving 33 mathematics teachersjudging pairs of journal-style responses to the questions and theapplication of the Bradley–Terry model. In the second study avariant of Kelly’s Repertory Grid was used with five mathematicsteachers to identify 23 dimensions along which the problem-solving questions varied. Significant relationships between ratingson some dimensions and the problem-solving quality estimated inthe first study were found. This suggests that the Kelly’s RepertoryGrid approach could be an effective way to identify features ofquestions that are relevant to the construct being assessed andfeatures that could be potential sources of construct-irrelevantvariance in test scores.

ARTICLE HISTORYReceived 29 April 2016Accepted 28 October 2016

KEYWORDSsummative assessment;mathematical problem-solving; validity

Introduction

In recent years there has been discontent, particularly from employers, about the perceivedlack of practical mathematical ability in England’s workforce and concerns that secondaryschool mathematics is not providing the skills required in the workplace or higher edu-cation (ACME, 2011; CBI, 2006; Toner, 2011; Vordermann, Porkess, Budd, Dunne, &Rahman-Hart, 2011). There have also been claims that the examinations in mathematicsfor 16-year-olds in England are not suitable for assessing the underlying mathematicalability of the students (Jones & Inglis, 2015). One solution suggested is that schools andschool qualifications should place more emphasis on problem-solving and non-routineuse of mathematics (Ofsted, 2012; Vordermann et al., 2011). This is consistent with theworldwide move towards the desire to train and assess these skills (e.g. ACT, 2006),perhaps best exemplified by the type of items used in the OECD PISA tests (OECD,2014), the results of which have an ever-increasing impact on policymakers worldwide.The PISA Assessment and Analytical Framework (OECD, 2013) details the use of itemswhich assess mathematical literacy, meaning the flexible application of mathematical

© 2017 Crown Copyright

CONTACT Stephen D. Holmes [email protected]

RESEARCH IN MATHEMATICS EDUCATION, 2017VOL. 19, NO. 2, 112–129https://doi.org/10.1080/14794802.2017.1334576

Dow

nloa

ded

by [L

ough

boro

ugh

Uni

vers

ity] a

t 03:

24 2

3 A

ugus

t 201

7

Hope 2

Expected vs Actual Difficulty

A Comparison of Actual and Expected Difficulty, and Assessment of Problem Solving in GCSE Maths

Ofqual 2015 64

2.3.14 Item expected and actual difficulty relationship

Figure 28 shows that there was a moderately strong correlation28 between the expected difficulty of the items and the difficulty as experienced by students (r=0.66). The disattenuated correlation, which estimates what the correlation would be if the measurement of expected and actual difficulty had been more precise, was reasonably high (r=0.76).

Figure 28: A scatter plot to show the relationship between expected and actual difficulty of items

2.3.15 Residual analysis of the relationship between expected and actual difficulty

Analysis of the residuals of a linear model between expected and actual difficulty revealed no systematic pattern between the independent variable (item difficulty) and the residuals. However, there is a correlation between item order and the residuals

28 This correlation is between the study 1 difficulty parameters and the Rasch model parameters from study 2. The correlations between the study 1 parameters and study 2 item facility values were 0.56 for foundation tier and 0.68 for higher tier. Unlike the Rasch parameters which can be equated, the facility values for the two tiers cannot be combined to obtain one correlation.

rdisattenuated = .76

From page 64 of Ofqual (2015) A Comparison of Expected Difficulty, Actual Difficulty and Assessment of Problem Solving across GCSE Maths Sample Assessment Materials. Report Ofqual/15/5679.

Can we apply CJ to standards comparison without graded scripts?

Study 1

Judging non-typeset items only.

Study 1: comparative judgement

• Exam papers from 1964, 1968, 1996, 2012 (as per BERJ, 2016).

• Split into 42 question items.

• Judged by 8 maths PhD students, total 670 pairwise judgements.

• Internal consistency, SSR = .91.

• Inter-rater reliability (split-halves, 100 iterations), rmedian = .79.

Study 1: analysis

We compared item scores with

(i) the scores of the perfect candidates from the BERJ paper (“perfect scores”), and

(ii) the scores of the real scripts from the BERJ paper (“script scores”).

(Scores were available for 38 of the 42 questions judged for Study 1.)

Study 1: correlations

r = 0.63 r = 0.68r = 0.49

item vs perfect item vs script perfect vs script

Study 1: variance explained

• Year as a predictor of item score (BERJ, 2016).

• Present study F(1,36) = 35.83, p < .001, R2 = .500, year as predictor: b = -0.06.

• Perfect scores (BERJ, 2016)F(1,36) = 13.94, p < .001, R2 = .279, year as predictor: b = -0.03.

• Script scores (BERJ, 2016)F(1,36) = 13.62, p < .001, R2 = .274, year as predictor: b = -0.03.

Study 2

Judging (i) typeset papers only, and (ii) typeset papers with perfect solutions.

Can we apply CJ to standards comparison without graded scripts?

Study 2: Exam papers

Year Boards1964 JMB*1968 JMB*1990 JMB1996 AEB*, London, UCLES2000 Edexcel2006 MEI2012 AQA*, MEI2017 MEI

* included in BERJ (2016).

Study 2: comparative judgement

• (i) Papers Only.

• Judged by 5 maths PhD students, total 250 judgements, SSR = .84.

• (ii) Papers and Solutions.

• Judged by 5 different maths PhD students, total 330 judgements, SSR = .87.

Study 2: correlation1964Paper11968Paper1

1990JMB

1996AEB

1996London

1996UCLES

2000Edexcel

2006MEI 2012AQA

2012MEI

2017MEI

r = .74

Pape

rs o

nly

−1.5

−1.0

−0.5

0

0.5

1.0

1.5

2.0

Papers and solutions−1.5 −1.0 −0.5 0 0.5 1.0

Study 2: analysis

We compared exam paper scores with

(i) the scores of the perfect candidates from the BERJ paper (“perfect scores”), and

(ii) the scores of the real scripts from the BERJ paper (“script scores”).

Unlike for Study 1 we did this graphically.

Study 2: graphical analysis

PerfectScripts

Parameter

−1

0

1

1964Paper1

1968Paper1

1990JMB

1996AEB

1996London

1996UCLES

2000Edexcel

2006MEI

2012AQA

2012MEI

2017MEI

BERJ

Study 2: graphical analysis

PerfectScripts

Parameter

−1

0

1

1964Paper1

1968Paper1

1990JMB

1996AEB

1996London

1996UCLES

2000Edexcel

2006MEI

2012AQA

2012MEI

2017MEI

BERJ

Papers onlyPapers & solutions

Para

met

er

−1

0

1

2

1964Paper1

1968Paper1

1990JMB

1996AEB

1996London

1996UCLES

2000Edexcel

2006MEI

2012AQA

2012MEI

2017MEI

Study 2 data

Limitations

• Standards-based assessment research is nonsense (Goldstein, 1979; Newton, 1997).

• Study 2 had only four data points. No estimate available that results are due to chance.

• Papers vary in length from 8 to 40 pages. CJ score vs length: ρ = –.47, p = .15.

• Cannot say “a candidate who achieved a grade B in 1996 or 2012 appears to have ... performed approximately at the level of a candidate who achieved a grade E in 1964”

Thank you

Ian Jones Loughborough [email protected]

Colin Foster Loughborough [email protected]

Jodie Hunter Massey University, New Zealand

comparing standards of examination papers when there are no … · 2020. 3. 16. · expected vs...

Documents