comparing standards of examination papers when there are no … · 2020. 3. 16. · expected vs...
TRANSCRIPT
![Page 1: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/1.jpg)
Comparing standards of examination papers when there are no archived scripts
Ian Jones Loughborough UniversityColin Foster Loughborough University
Jodie Hunter Massey University, New Zealand
13th Annual UK Rasch User Group MeetingCambridge 2019
![Page 2: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/2.jpg)
Fifty years of A-level mathematics: havestandards changed?
Ian Jonesa,*, Chris Wheadonb, Sara Humphriesc andMatthewInglisaaMathematics Education Centre, Loughborough University, UK; bNoMore Marking Ltd.;cOfqual
Advanced-level (A-level) mathematics is a high-profile qualification taken by many school leavers inEngland, Wales, Northern Ireland and around the world as preparation for university study. Con-cern has been expressed in these countries that standards in A-level mathematics have declined overtime, and that school leavers enter university or the workplace lacking the required mathematicalknowledge and skills. The situation in England, Wales and Northern Ireland reflects more generalinternational concerns about decreasing educational standards. However, evidence to support thisconcern has been of limited scope, rarely subjected to peer-review and of questionable validity. Ourstudy overcame the limitations of previous research into standards over time by applying a compara-tive judgement technique that enabled the direct comparison of mathematical performance acrossdifferent examinations. Furthermore, unlike previous research, all examination questions were re-typeset and candidate responses rewritten to reduce bias arising from surface cues. Using this tech-nique, mathematics experts judged A-level scripts from the 1960s, 1990s and the 2010s. We reportthat the experts believed current A-level mathematics standards to have declined since the 1960s,although there was no evidence that they believed standards have declined since the 1990s. We con-trast our findings with those from previous comparison studies and consider implications for futureresearch into standards over time.
Keywords: A-level mathematics; standards; assessment; comparative judgement
Background
Numerous articles and reports have been published over recent years decrying themathematical knowledge of school leavers in England and Wales (e.g. Walport et al.,2010; ACME, 2011). This includes those who have achieved high grades inAdvanced-level (A-level) mathematics (Hawkes & Savage, 2000; Croft et al., 2009),a course usually associated with achieving university entrance to science, engineeringand mathematics courses in England and Wales. High-profile and on-going mediacoverage (e.g. Willis & Paton, 2009) suggests that standards were higher some time inthe past, but have declined since. In this article we investigate whether this is in factthe case.Concerns about declining standards perhaps go back as far as accredited education
itself, but of particular relevance to the current debate in England and Wales is theinfluential Dearing report (National Committee of Inquiry into Higher Education,
*Corresponding author. Mathematics Education Centre, Loughborough University, Loughbor-ough, LE11 3TU, UK. Email: [email protected]
© 2016 British Educational Research Association
British Educational Research JournalVol. 42, No. 4, August 2016, pp. 543–560
DOI: 10.1002/berj.3224
![Page 3: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/3.jpg)
BERJ (2016) Results
ABE
A Grades
B Grades
E Grades
Achi
evem
ent P
aram
eter
Est
imat
e
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Year of Examination1960 1970 1980 1990 2000 2010
![Page 4: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/4.jpg)
BERJ (2016) Method
![Page 5: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/5.jpg)
BERJ (2016) Preparation• Question papers typeset for consistency.
• Candidate responses transcribed for consistency.
• 66 scripts divided into 546 questions and uploaded to website for judging procedure.
![Page 6: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/6.jpg)
BERJ (2016) Preparation• Question papers typeset for consistency.
• Candidate responses transcribed for consistency.
• 66 scripts divided into 546 questions and uploaded to website for judging procedure.
TIME-CONSUMING AND EXPENSIVE
![Page 7: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/7.jpg)
BERJ (2016) Preparation• Question papers typeset for consistency.
• Candidate responses transcribed for consistency.
• 66 scripts divided into 546 questions and uploaded to website for judging procedure.
TIME-CONSUMING AND EXPENSIVEREQUIRES GRADED ARCHIVE
![Page 8: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/8.jpg)
Can we apply CJ to standards comparison
without graded scripts?
![Page 9: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/9.jpg)
Hope 1
r = .68sc
ripts
−3
−2
−1
0
1
2
perfect solutions−2 −1 0 1 2
Model solutions vs. graded scripts
![Page 10: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/10.jpg)
An investigation of construct relevant and irrelevant featuresof mathematics problem-solving questions using comparativejudgement and Kelly’s Repertory GridStephen D. Holmes, Qingping He and Michelle Meadows
Office of Qualifications and Examinations Regulation, Coventry, UK
ABSTRACTThe relationship between the characteristics of 33 mathematicalproblem-solving questions answered by 16-year-old students inEngland and the quality of problem-solving elicited wasinvestigated in two studies. The first study used comparativejudgement (CJ) to estimate the quality of the problem-solvingelicited by each question, involving 33 mathematics teachersjudging pairs of journal-style responses to the questions and theapplication of the Bradley–Terry model. In the second study avariant of Kelly’s Repertory Grid was used with five mathematicsteachers to identify 23 dimensions along which the problem-solving questions varied. Significant relationships between ratingson some dimensions and the problem-solving quality estimated inthe first study were found. This suggests that the Kelly’s RepertoryGrid approach could be an effective way to identify features ofquestions that are relevant to the construct being assessed andfeatures that could be potential sources of construct-irrelevantvariance in test scores.
ARTICLE HISTORYReceived 29 April 2016Accepted 28 October 2016
KEYWORDSsummative assessment;mathematical problem-solving; validity
Introduction
In recent years there has been discontent, particularly from employers, about the perceivedlack of practical mathematical ability in England’s workforce and concerns that secondaryschool mathematics is not providing the skills required in the workplace or higher edu-cation (ACME, 2011; CBI, 2006; Toner, 2011; Vordermann, Porkess, Budd, Dunne, &Rahman-Hart, 2011). There have also been claims that the examinations in mathematicsfor 16-year-olds in England are not suitable for assessing the underlying mathematicalability of the students (Jones & Inglis, 2015). One solution suggested is that schools andschool qualifications should place more emphasis on problem-solving and non-routineuse of mathematics (Ofsted, 2012; Vordermann et al., 2011). This is consistent with theworldwide move towards the desire to train and assess these skills (e.g. ACT, 2006),perhaps best exemplified by the type of items used in the OECD PISA tests (OECD,2014), the results of which have an ever-increasing impact on policymakers worldwide.The PISA Assessment and Analytical Framework (OECD, 2013) details the use of itemswhich assess mathematical literacy, meaning the flexible application of mathematical
© 2017 Crown Copyright
CONTACT Stephen D. Holmes [email protected]
RESEARCH IN MATHEMATICS EDUCATION, 2017VOL. 19, NO. 2, 112–129https://doi.org/10.1080/14794802.2017.1334576
Dow
nloa
ded
by [L
ough
boro
ugh
Uni
vers
ity] a
t 03:
24 2
3 A
ugus
t 201
7
Hope 2
![Page 11: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/11.jpg)
Expected vs Actual Difficulty
A Comparison of Actual and Expected Difficulty, and Assessment of Problem Solving in GCSE Maths
Ofqual 2015 64
2.3.14 Item expected and actual difficulty relationship
Figure 28 shows that there was a moderately strong correlation28 between the expected difficulty of the items and the difficulty as experienced by students (r=0.66). The disattenuated correlation, which estimates what the correlation would be if the measurement of expected and actual difficulty had been more precise, was reasonably high (r=0.76).
Figure 28: A scatter plot to show the relationship between expected and actual difficulty of items
2.3.15 Residual analysis of the relationship between expected and actual difficulty
Analysis of the residuals of a linear model between expected and actual difficulty revealed no systematic pattern between the independent variable (item difficulty) and the residuals. However, there is a correlation between item order and the residuals
28 This correlation is between the study 1 difficulty parameters and the Rasch model parameters from study 2. The correlations between the study 1 parameters and study 2 item facility values were 0.56 for foundation tier and 0.68 for higher tier. Unlike the Rasch parameters which can be equated, the facility values for the two tiers cannot be combined to obtain one correlation.
rdisattenuated = .76
From page 64 of Ofqual (2015) A Comparison of Expected Difficulty, Actual Difficulty and Assessment of Problem Solving across GCSE Maths Sample Assessment Materials. Report Ofqual/15/5679.
![Page 12: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/12.jpg)
Can we apply CJ to standards comparison without graded scripts?
Study 1
Judging non-typeset items only.
![Page 13: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/13.jpg)
Study 1: comparative judgement
• Exam papers from 1964, 1968, 1996, 2012 (as per BERJ, 2016).
• Split into 42 question items.
• Judged by 8 maths PhD students, total 670 pairwise judgements.
• Internal consistency, SSR = .91.
• Inter-rater reliability (split-halves, 100 iterations), rmedian = .79.
![Page 14: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/14.jpg)
Study 1: analysis
We compared item scores with
(i) the scores of the perfect candidates from the BERJ paper (“perfect scores”), and
(ii) the scores of the real scripts from the BERJ paper (“script scores”).
(Scores were available for 38 of the 42 questions judged for Study 1.)
![Page 15: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/15.jpg)
Study 1: correlations
r = 0.63 r = 0.68r = 0.49
item vs perfect item vs script perfect vs script
![Page 16: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/16.jpg)
Study 1: variance explained
• Year as a predictor of item score (BERJ, 2016).
• Present study F(1,36) = 35.83, p < .001, R2 = .500, year as predictor: b = -0.06.
• Perfect scores (BERJ, 2016)F(1,36) = 13.94, p < .001, R2 = .279, year as predictor: b = -0.03.
• Script scores (BERJ, 2016)F(1,36) = 13.62, p < .001, R2 = .274, year as predictor: b = -0.03.
![Page 17: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/17.jpg)
Study 1: variance explained
• Year as a predictor of item score (BERJ, 2016).
• Present study F(1,36) = 35.83, p < .001, R2 = .500, year as predictor: b = -0.06.
• Perfect scores (BERJ, 2016)F(1,36) = 13.94, p < .001, R2 = .279, year as predictor: b = -0.03.
• Script scores (BERJ, 2016)F(1,36) = 13.62, p < .001, R2 = .274, year as predictor: b = -0.03.
![Page 18: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/18.jpg)
Study 2
Judging (i) typeset papers only, and (ii) typeset papers with perfect solutions.
Can we apply CJ to standards comparison without graded scripts?
![Page 19: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/19.jpg)
Study 2: Exam papers
Year Boards1964 JMB*1968 JMB*1990 JMB1996 AEB*, London, UCLES2000 Edexcel2006 MEI2012 AQA*, MEI2017 MEI
* included in BERJ (2016).
![Page 20: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/20.jpg)
Study 2: comparative judgement
• (i) Papers Only.
• Judged by 5 maths PhD students, total 250 judgements, SSR = .84.
• (ii) Papers and Solutions.
• Judged by 5 different maths PhD students, total 330 judgements, SSR = .87.
![Page 21: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/21.jpg)
Study 2: correlation1964Paper11968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI 2012AQA
2012MEI
2017MEI
r = .74
Pape
rs o
nly
−1.5
−1.0
−0.5
0
0.5
1.0
1.5
2.0
Papers and solutions−1.5 −1.0 −0.5 0 0.5 1.0
![Page 22: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/22.jpg)
Study 2: analysis
We compared exam paper scores with
(i) the scores of the perfect candidates from the BERJ paper (“perfect scores”), and
(ii) the scores of the real scripts from the BERJ paper (“script scores”).
Unlike for Study 1 we did this graphically.
![Page 23: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/23.jpg)
Study 2: graphical analysis
PerfectScripts
Parameter
−1
0
1
1964Paper1
1968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI
2012AQA
2012MEI
2017MEI
BERJ
![Page 24: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/24.jpg)
Study 2: graphical analysis
PerfectScripts
Parameter
−1
0
1
1964Paper1
1968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI
2012AQA
2012MEI
2017MEI
BERJ
Papers onlyPapers & solutions
Para
met
er
−1
0
1
2
1964Paper1
1968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI
2012AQA
2012MEI
2017MEI
Study 2 data
![Page 25: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/25.jpg)
Study 2: graphical analysis
PerfectScripts
Parameter
−1
0
1
1964Paper1
1968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI
2012AQA
2012MEI
2017MEI
BERJ
Papers onlyPapers & solutions
Para
met
er
−1
0
1
2
1964Paper1
1968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI
2012AQA
2012MEI
2017MEI
Study 2 data
![Page 26: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/26.jpg)
Limitations
• Standards-based assessment research is nonsense (Goldstein, 1979; Newton, 1997).
• Study 2 had only four data points. No estimate available that results are due to chance.
• Papers vary in length from 8 to 40 pages. CJ score vs length: ρ = –.47, p = .15.
• Cannot say “a candidate who achieved a grade B in 1996 or 2012 appears to have ... performed approximately at the level of a candidate who achieved a grade E in 1964”
![Page 27: Comparing standards of examination papers when there are no … · 2020. 3. 16. · Expected vs Actual Difficulty A Comparison of Actual and Expected Difficulty, and Assessment of](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf848ecf4eb65944544c4/html5/thumbnails/27.jpg)
Thank you
Ian Jones Loughborough [email protected]
Colin Foster Loughborough [email protected]
Jodie Hunter Massey University, New Zealand