developing a model for investigating the impact of assessment
TRANSCRIPT
Developing a model for investigating the
impact of assessment within educational
contexts by a public examination provider
Dr Nick Saville, Research and Validation Group, Cambridge ESOL
Developing
a model for investigating
the impact of assessment
within educational contexts
by a public examination provider
"Impact by Design"
Nick Saville
AthensJune 2006
A model for investigating the impact of (language)
assessment within
educational contexts
Teaching Testing Learning
A Perspective from Cambridge ESOL
Nick Saville
AthensJune 2006
A model for investigating the impact of assessment
within educational contexts
Teaching Testing Learning
Implications for Cambridge Assessment?
153 years of history ……..
In tune with the spirit of the Victorian age
14th December 1858
370 students in seven different local contexts took an examination paper set by UCLES for the first time
153 years of history ……..
"This year, we find that students have acquired a great deal of skill but that they seem to have acquired it for examination purposes"
Art examiner writing in The TES, 1915
Michael Shaw - Remembrance of things passedCover Story - Magazine, TES (10 December 2010)
153 years of history …….. plus ça change
"This year, we find that students have acquired a great deal of skill but that they seem to have acquired it for examination purposes"
Art examiner writing in The TES, 1915
Michael Shaw - Remembrance of things passedCover Story - Magazine, TES (10 December 2010)
• Background to ESOL's approach• 1980’s
• Messick, Bachman – early 1990s
• The literature on washback/impact• early work and recent progress
• gaps? where next?
• Analysis of three case studies• what can be learnt?
• Towards a Comprehensive Model of Impact• applicable to other educational contexts?
Outline for today's talk
V
Test
R Practicality?
ESOL background – 1987-1990 : Japan
Considerations in developing fair tests
The art of the possible
PracticalityV
P
TestR
“Practicality in Language Testing: an educational management model”
Main argument: test development is a form of educational innovation - and needs to be managed as such
“... achieving a balance between the purpose of the test, its validity for the purpose, the required reliability for the purpose and the constraints
imposed by the context is essentially the task facing the test designer ….”
Saville (1990), University of Reading.
A Cambridge test development project: Japan, 1987 to 1989
Putting the test into context
V
R P
Test
" The aim … is not only to encourage good testing practice, but to prevent bad tests being produced ....
... a bad test is not only one with low reliability and dubious validity but also one which has a damaging washback on the curriculum".
Saville 1990
……. any test which is produced should be appropriate to the educational context in which it is to be used and the effect on learners and institutions will be a major consideration.
V
R P
Test
Putting the test into context
Impact Ripples
V
R P
Test
V
R P
Test
I
Local Impact
“micro” level
Impact Ripples
V
R P
Test
II
Wider Impact
“macro”
level II
Impact Ripples
U = V + R + I + P
Prof L Bachman (UCLA) - Cambridge Seminars 1990/91
The unitary concept of Usefulness
Overall Validity
U = V + R + I + P
Bachman and Palmer, 1996 : U = Cv + A + I + R + I + P
Developing “useful tests”, fit for purpose
Balancing the test qualities
Usefulness as “overall Validity”
U = V + R + I + P
Bachman and Palmer, 1996 : U = Cv + A + I + R + I + P
Developing “useful tests”, fit for purpose
Balancing the test qualities
Usefulness as “overall Validity”
Current ESOL Practice
Principles of Good Practice - 2011
Quality Management and validation in language assessment
VRIP
Current ESOL Practice
Principles of Good PracticeQuality Management and validation in language assessment
www.cambridgeesol.org/about/standards/pogp.html
VRIPSee also brochure - Making an Impact
Current ESOL Practice
Starting to develop a model of impact
g 1993 – 1995
g Using VRIP to develop and revise examse.g. the revision of IELTS 1995
• The IELTS impact project
g An expanded view of impact - from the test developer’s perspective
• Working for positive impact• Limiting negative consequences
Maxim 1 PLANUse a rational and explicit approach to test development
Maxim 2 SUPPORTSupport stakeholders in the testing process
Maxim 3 COMMUNICATEProvide comprehensive, useful and transparent information
Maxim 4 MONITOR and EVALUATECollect all relevant data and analyse as required.
Maxims for achieving/monitoring impact
Milanovic and Saville, 1995 Considering the impact of the Cambridge EFL examinations
The literature on washback/impact
g Readings in the language testing literature:• Hamp-Lyons (1989)• Wall and Alderson (1993) Does washback exist? etc..• Language Testing (1996: 13, 3) Messick, Bailey, etc…• Hamp-Lyons (1997)• Watanabe (1997)• Cheng and Watanabe (eds) (2004)
• Recent PhD studies and subsequent books in SILT series based on research conducted in the 1990s:
• Cheng (SILT 21 - 2005)• Wall (SILT 23 - 2005)• Hawkey (SILT 24 - 2006)• Green (SILT 25 -2007) - “washback in context”
g Washback (or backwash) has been broadly defined in the assessment literature as the effect of testing on teaching and learning
g One aspect of the broader phenomenon known as impact
Washback/impact
g Based on who or what might be affected:• Teaching• Learning • Content• Rate of learning• Sequence of teaching/learning• Degree/depth of curriculum coverage• Attitudes of teachers/learners• Etc.
Alderson and Wall, 1993
15 washback hypotheses
g A continuum - stretching from harmful at one end, through neutral to beneficial at the other end
Negative Neutral Positive
- +
Washback
g Negative?• Restriction of content – narrowing of
curriculum• Too much time practising for the test
g Positive?• Transparent objectives and outcomes• Increased motivation of learners• Increased accountability of teachers (?)
Washback
The “law” of unintended consequences
g “Any purposeful action will produce some unintended consequences” or side-effects
g “Goodhart’s Law”(or “Campbell’s Law” in the USA)• a variant of the “law” of unintended
consequences
“Goodhart’s Law”
g “All performance indicators lose their meaning when adopted as policy targets”
g Examples:• England - school achievement targets - school
league tables• USA – No Child Left Behind (NCLB)
g The clearer you are about what you want, the more likely you are to get it – but the less likely it is to mean what you wanted it to!
(Dylan Wiliam, Cambridge 2008)
Perverse incentives?
g Assessment policy can create a tension between
• educational objectives at the micro level (teaching and learning in schools) and
• a requirement for accountability at the macro level
g Negative?• Restriction of content – narrowing of
curriculum• Too much time practising for the test
g Positive?• Transparent objectives and outcomes• Increased motivation of learners• Increased accountability of teachers (?)
g BUT – cause and effect explanations are rarely adequate …..
Washback
g Negative?• Restriction of content – narrowing of
curriculum• Too much time practising for the test
g Positive?• Transparent objectives and outcomes• Increased motivation of learners• Increased accountability of teachers (?)
g BUT – cause and effect explanations are rarely adequate …..
Washback
g Negative?• Restriction of content – narrowing of
curriculum• Too much time practising for the test
g Positive?• Transparent objectives and outcomes• Increased motivation of learners• Increased accountability of teachers (?)
g BUT – cause and effect explanations are rarely adequate …..
Washback
Washback Models
In the language testing literature:
• Hughes (1993)
• Bailey (1996)
• Watanabe (2004)
• Cheng (2004, 2005)
• Green (2007)
3 Ps:
Participants• students• teachers
Processes
Products• learning• teaching• materials• curricula
Bailey’s 1996 Model (based on Hughes 1993)
3 Ps:
Participants• students• teachers
Processes
Products• learning• teaching• materials• curricula
Bailey’s 1996 Model (based on Hughes 1993)
The literature on washback/impact
g Readings in the language testing literature:• Hamp-Lyons (1989)• Wall and Alderson (1993) Does washback exist? etc..• Language Testing (1996: 13, 3) Messick, Bailey, etc…• Hamp-Lyons (1997)• Watanabe (1997)• Cheng and Watanabe (eds) (2004)
• Recent PhD studies and subsequent books in SILT series based on research conducted in the 1990s:
• Cheng (SILT 21 - 2005) • Wall (SILT 23 - 2005)• Hawkey (SILT 24 - 2006)• Green (SILT 25 - 2007) - “washback in context”
Liying Cheng Dianne Wall Roger Hawkey
Studies in Language Testing series
The literature on washback/impact
g Readings in the language testing literature:• Hamp-Lyons (1989)• Wall and Alderson (1993) Does washback exist? etc..• Language Testing (1996: 13, 3) Messick, Bailey, etc…• Hamp-Lyons (1997)• Watanabe (1997)• Cheng and Watanabe (eds) (2004)
• Recent PhD studies and subsequent books in SILT series based on research conducted in the 1990s:
• Cheng (SILT 21 - 2005)• Wall (SILT 23 - 2005)• Hawkey (SILT 24 - 2006)• Green (SILT 25 - 2007) - “washback in context”
FocalConstruct
Test designcharacteristics
item formatcontent
complexityetc.
Overlap
Potential fornegative backwash
Potential forpositive backwash
Perception oftest importance
Perception oftest difficulty
Backwash toparticipant
Important
Unimportant
No backwash
Intense backwash
Easy
Unachievable
Challenging
Washback direction
Washback intensity
Washback variabilityParticipant characteristics and values
Knowledge/ understanding of test demandsResources to meet test demandsAcceptance of test demands
Other stakeholdersCourse providersMaterials writers
PublishersTeachersLearners
Green IELTS Washback in context: Preparation for academic writing in higher education(SILT 25, 2007)
The model starts from test design characteristics and related validity issues of construct representationidentified with washback by Messick (1996)
Washback will be most intense –have the most powerful effects on teaching and learning behaviours –where participants see the test as challenging and the results as importantSEE BLUE ARROW
Studies in Language Testing, 25
IELTS - Washback in context
Studies in Language Testing series
The literature on washback/impact
So• Impact is relatively new in the field of language assessment - an
extension on the notion of washback and related to ethicality• It is now considered to be of growing importance• It is part of a validity argument and evidence needs to be provided
Broadly speaking there is consensus • washback is an aspect of impact related to the “micro contexts” of the
classroom and the school (teaching and learning)• impact deals with wider influences and includes the “macro contexts” -
tests and examinations in societyBUT
g The dynamics between the micro and macro contexts mean that this is a complex rather than a simple or linear relationship
- a “complex dynamic system”
The literature on washback/impact
And currently:
• there has not been a comprehensive model of test or examination impact within educational contexts
• impact has not yet been fully integrated into an approach to test development and validation in a systematic way
Three case studies – 1995 to 2004
g Case 1 - the world-wide survey of the impact of IELTS• a starting point for the work and the original model for what has followed• a conceptualisation of impact and design/validation of suitable instruments to
investigate it
g Case 2 - the Italian PL2000 project• an application of the model within a macro educational context• an initial attempt at the applying the approach on a limited basis within a
state educational context• Hawkey – SILT 24 (2006)
g Case 3 - the Florence Language Learning Gains Project• an extension and re-application of the model within in a single school context • at the micro level focusing on individual stakeholders within a single
language teaching institution
Case 1 - the IELTS Impact studies
The project had the following aim within the IELTS revision project (1993-5):
….. to investigate the impact of the test on candidates and on other test users, as part of the continuous process of ensuring that IELTS is as valid, effective and ethical as possible
IELTS 1995 Revision Project
Phases of the IELTS Impact Study
Phase One: Prof. C.Alderson (Lancaster University) was commissioned to develop first draft of data collection instruments (1995)
Phase Two: trialling, revision, rationalisation of instruments
Phase Three: (2001-2004)pre-survey, main data collection, analyses, report
See: Research Notes (2, 2000; 6, 2001; 15, 2004)Alderson and Banerjee (SILT 11, 2001)Saville and Hawkey (2004 - in Cheng and Watanabe)Hawkey (SILT 24, 2006)
g Responses received from:• 572 pre- and post-IELTS candidates• 83 teachers completing the teacher questionnaire• 43 teachers completing the instrument for the analysis of textbook
materials
g Stakeholder interviews and focus groups at selected case study centres, involving:
120 students21 teachers 15 receiving institution administrators. 12 “live” IELTS-preparation classes have been video-recorded
and analysed.
Stakeholder participation in Phase 3
Some key points and lessons learnt
g Setting objectives, design and research questions• The instruments – development and validation• The data – (strategies for collection, storage, retrieval)• The analysis and interpretation of multiple sources of
data (quantitative and qualitative)
g Managing impact studies • practical, legal, ethical issues• project management and action planning
g But the IELTS international dimension introduces multiple contexts – many more case studies required in specific contexts
Using international certification in Italian state-sector education
Case 2 - the Italian PL2000 project
• an application of the approach within a single macro educational context
• an initial attempt at the applying the approach on a limited basis within a state educational context
Case 2 - the Italian PL2000 project
Case 2 - the Italian PL2000 project
g The Progetto Lingue 2000 within the state school system of Italy
g As the name suggests - came into practice in the academic year 1999 to 2000
Progetto Lingue 2000
g The intention of the progetto was:
“.... to introduce innovation into the teaching and
learning of other languages by putting greater
emphasis on the development of communicative
competence in all grades of the school system”
Italian Ministry document
Progetto Lingue 2000
g Emphasis on
• the use of new technology in pedagogic contexts
• self-study and the individualisation of the learning experience
g The adoption of a level system based on the Council of Europe’s Common European Framework of Reference (CEFR) as learning objectives and standards
g The option of getting a certificate of proficiency to certify the level reached• the certificate should be aligned to the CEFR scale and issued by a
certificating body which is recognised internationally
Educationalgoals
Italy’s national learning goals integrated with pan-European - Council of Europe - goals
An educational innovation project
Progetto Lingue 2000
Educationalgoals
ResourcesTeacher
Development& support
Assessmentand
Certification
Curriculumdesign
Progetto Lingue 2000
Educationalgoals
Assessment,CertificationIncluding optional
external certification
Progetto Lingue 2000
PL2000 Impact Project 2001-2
Main interdependent language programme stakeholders and dimensions
Learning goals,curriculum,
syllabus
Students
Parents
Teachers
Teacher-trainers
Curriculum developers
Testers
Publishers
Receiving institutions
Employers
Students
Parents
Teachers
Teacher-trainers
Curriculum developers
Testers
Publishers
Receiving institutions
Employers
Materials
Teacher Support
Testing
Methodology
Some key points and lessons learnt
g Applied lessons learnt in the IELTS studiesg Adapted the instruments and data collection techniquesg Introduced new features of data collection
• Seven case study schools with school visits and interviews
g Proved the successful application of the approach within a national context
g Showed the possibility of matching learning objectives and tests via a “neutral” framework of reference – CEFR
g But – only limited data g Test provider was an “outsider”
Studies in Language Testing, 24
Impact Theory and Practice
Studies in Language Testing series
Case 3 – Florence project
g the Florence Language Learning Gains Project
• an extension and re-application of the model within in a single school context
• at the micro level focusing on individual stakeholders within a single language teaching institution
(British Institute of Florence)
Key points and lessons learnt:
g Focus on washback on language performance and learning growth• Can the influence of the test be separated from the other
variables?
g Longitudinal study over one academic year (2002-3)g Participant learners were compared in terms of:
• Competence level• Age• Stage• Motivation• External (high stakes) or internal final exam• Learning gain
g Provided multiple sources of very rich data
g But - difficult and costly to dog Requires active participation of many stakeholder groups and
individuals
Learning from the 3 impact case studies
g What can be learned using these specific impact projects as meta-data?
Learning from the 3 impact case studies
g Three key factors of contemporary educational systems need to be accounted for:
1. the nature of complex dynamic systems(see for example D. Larsen Freeman 1997)
2. the roles that stakeholders play within such systems
3. the need to see assessment projects as educational innovations within the systems and to manage change effectively – need a theory of action
1. The nature of complex dynamic systems
LearnersTeachersTest writers/examiners Receiving institutionsSchool ownersFuture employersGovernment agenciesProfessional bodiesTest centre administratorsMaterials writersPublishersetc
Learners Parents/carersTeachersReceiving institutions EmployersSchool ownersExaminersGovernment agenciesProfessional bodiesAcademic researchersTest writers/Examinersetc
Test constructsTest format
Test conditions
Test assessment
criteria
Test scores
Testing System
Contexts of test use - consequencesInputs to test design
2. The roles that stakeholders play
LearnersTeachersTest writers/examiners Receiving institutionsSchool ownersFuture employersGovernment agenciesProfessional bodiesTest centre administratorsMaterials writersPublishersetc
Learners Parents/carersTeachersReceiving institutions EmployersSchool ownersExaminersGovernment agenciesProfessional bodiesAcademic researchersTest writers/Examinersetc
Test constructsTest format
Test conditions
Test assessment
criteria
Test scores
Testing System
Contexts of test use - consequencesInputs to test design
The roles that stakeholders play
LearnersTeachersTest writers/examiners Receiving institutionsSchool ownersFuture employersGovernment agenciesProfessional bodiesTest centre administratorsMaterials writersPublishersetc
Learners Parents/carersTeachersReceiving institutions EmployersSchool ownersExaminersGovernment agenciesProfessional bodiesAcademic researchersTest writers/Examinersetc
Test constructsTest format
Test conditions
Test assessment
criteria
Test scores
Testing System
Contexts of test use - consequencesInputs to test design
The roles that stakeholders play
LearnersTeachersTest writers/examiners Receiving institutionsSchool ownersFuture employersGovernment agenciesProfessional bodiesTest centre administratorsMaterials writersPublishersetc
Learners Parents/carersTeachersReceiving institutions EmployersSchool ownersExaminersGovernment agenciesProfessional bodiesAcademic researchersTest writers/Examinersetc
Test constructsTest format
Test conditions
Test assessment
criteria
Test scores
Testing System
Contexts of test use - consequencesInputs to test design
The roles that stakeholders play
See Wall (SILT 22, 2005)… a case study using insights from testing and innovation theory e.g. Henrichsen (1989)
3. The need to see assessment projects as educational innovations and to manage change effectively
Hybrid Model of the Diffusion / Implementation Process
Antecedents Process Consequences
Timeline
Learning from the case studies
g When applied to (language) assessment, two key factors also need to be accounted for :
a) the nature of the construct: language itself as a socio-cognitive phenomenon - the latest views on validity
b) the nature of the test development and validation process• from conception to routine data collection and analysis
g Impact research, therefore is another kind of validation activity ........
Theory Test Taking Context
TT CONTEXT• TLU • Learning context • Context of score use
a) A socio-cognitive framework
MessickBachmanKaneMislevyWeir….. etc.
Consequential aspects
of validity
Theory Test Taking Context
TT CONTEXT• TLU • Learning context • Context of score use
A socio-cognitive framework
The testing system
CoreConstruct
Consequential aspects
of validity
see also Pellegrino
Theory Test Taking Context
TT CONTEXT• TLU • Learning context • Context of score use
The contexts
Learning contexts
Testingcontexts
Use of resultscontexts
Consequential aspects
of validity
Theory Test Taking Context
TT CONTEXT• TLU • Learning context • Context of score use
ImpactConsequential
aspectsof validity
The contexts
..Test
performance
..“Real world”
(target situation of use)
True score
Test score
How can we score what we observe?
Relates to marking,rating criteria
Scoring model
Evaluation
Does the test measure consistently?
Relates totest reliability,rater training,scale construction and version equating using IRTetc
Measurement model
Generalization Extrapolation
Does the test score reflect the candidate’s actual ability?
Relates to Validity
e.g. a Socio-cognitive model linking features of the learners, the test content and the skills to be measured
CEFRlevels
Specific testing context Link to context -neutral frameworkIdealization
How does the specific learning/testing context relate to a more general proficiency framework?
Depends on identifying the salient features of the levels and the specific learner group – not all salient features may be relevant to all groups.
Quantitative and qualitative evidence may be provided.
inference to a framework - from Dr Neil Jones
… based on Kane, Mislevy etc.
b) Model of the Test Development Process
“ … seek validity by design as a likely basis for washback”
Messick, 1996: 252
Seek "impact by design"
i.e. a theory of action
Saville, 2009
Identifying stakeholders and their needs
Linking these needs to the requirements of test usefulness- including predicted impact
- theoretical
- practical
Long term, Iterative Processes -a key feature of validation
Model of the Test Development Process
Involvement of the stakeholder constituency
E.g. during test design and development
g presentation and consultation to do with specifications and detailed syllabus designs
g professional support programmes for institutions and individual teachers/students etc. who plan to use the examinations
g training and employment of suitable personnel within the field to work on all aspects of the examination cycle – to be question/item writers, to act as examiners, etc.
Cf. the Maxims referred to above
After an examination becomes operational
g Procedures need to be in place to collect data routinely which allows impact to be estimated:
• who is taking the examination (i.e. a profile of the candidates)
• who is using the examination results and for what purpose• who is teaching towards the examination and under what circumstances• what kinds of courses and materials are being designed and used to prepare
candidates• what effect the examination has on public perceptions generally
(e.g. regarding educational standards)• how the examination is viewed by those directly involved in educational
processes(e.g. by students, examination takers, teachers, parents, etc.)
• how the examination is viewed by members of society outside education(e.g. by politicians, business people, etc.)
Towards a comprehensive model
g How can these considerations be combined to produce a comprehensive, integrated model?
• to guide language testers in ways to build impact into test development and validation systems
• to promote research into impact by a wide range of stakeholders
A meta-framework building on Milanovic & Saville’s maxims (1996)
Four inter-related dimensions:1. re-conceptualise the role of impact study within the assessment enterprise,
vis-à-vis societal systems generally and language education specifically
2. introduce the concept of “impact by design” into the planning and operationalisation of language assessments by examination providers
3. re-organise validation procedures to incorporate impact research into
operational activities to provide the basis for knowing about and
understanding how well an assessment system works in practice with regard to its impact (as defined in point 1 above)
4. develop an appropriate theory of action which enables examination providers
to work with stakeholders to achieve the intended objectives, to avoid negative consequences and to take remedial action when necessary.
“Impact by design”
g Integral part of a framework for developing and validating examination systems
g A concept akin to social impact assessment (SIA)
g Focus on what matters – e.g. successful learning
Impacts (positive and negative) anticipated in design phase
Impact research methodology used to find out what happens
Remedial action taken when needed on the basis of impact evidence
Key considerations
Centrality of language construct, theories of language learning- a socio-cognitive model- learning understood as change- effective communication
Impact research incorporated into routine validation processesMixed method designs used with impact “toolkit” to collect quantitative and qualitative data
Importance of the timeline with iterative cycles of review and revisions implemented over time
Emergent aspects of validityImproved understanding of the meaning of language assessment in context and of the effects and
consequences on systems and people
StancePerspective of UK examinations boardInfluenced by critical realism, contemporary pragmatism
Reconceptualising impact taking account of:- theories of knowledge - socio-cognitive theory- constructivism- theories of change
Impact by design
Procedural basis for knowing about effects and consequences
Theory of Action
A revised model (2009)
Applications beyond ESOL?
g Applying the model within the UK educational context:
g The Asset Languages Project (2003 onwards)
Conclusion
Investigating impact as validationg The investigation of impact is not a discrete or one-off activity
g It is an essential component in establishing the overall validity (usefulness) of an assessment system in terms of its fitness forspecific purposes and contexts of use
g The proposed model locates the study of test impact as one of a set of research and development tools within an iterative approach to on-going test validation
g It is consistent with Messick, 1996:
“In essence ..... test validation is empirical evaluation of meaning and consequences of measurement, taking into account extraneous factors in the applied setting that might erode or promote validity of local score interpretation and use.”
Thank You!