developing an in-house speaking assessment: rasch analysis for action research
TRANSCRIPT
DEVELOPIN
G AN IN-
HOUSE SPEAKING
ASSESSMENT:
R A S C H AN A LY S I S
FO R A
C T I ON R
E S E A R C H
FLLT 2016Bangkok, ThailandJune 24, 2016
by Andy VajirasarnToyo University (Tokyo, Japan)
OVERVIEW Action Research Speaking Assessment Development Study and Results/Conclusions What’s next?
ACTION RESEARCH
T H E O R Y P R A C T I C
E
ACTION RESEARCH: DEFINITIONS
“…most agree on the following: action research is inquiry that is done by or with insiders to an organization or community, but never to or on them.” (Anderson, 2005, in Ivankova, 2015)
“Action research is a family of practices of living inquiry... It seeks to bring together action and reflection, theory and practice, in participation with others, in the pursuit of practical solutions to issues of pressing concern to people, and more generally, the flourishing of individual persons and their communities.” (Reason & Bradbury, 2008, in Ivankova, 2015)
Methodological features: systematic, cyclical, and flexible; it involves collection of multiple data sources and generation of a plan of action or intervention. (Ivankova, 2015)
ACTION RESEARCH MODELS
Kurt Lewin (1948)
Kemmis and McTaggart (2007)
Stringer (2014)
ACTION RESEARCH CYCLE (IVANKOVA, 2015)
DIAGNOSIS
RECONNAISSANCE
PLAN
ACT
MONITOR
EVAULATE
ACTION RESEARCH MODELS
Ivankova’s (2015) model1. Diagnosis: identify the problem2. Reconnaissance: gather information3. Plan: plan the intervention4. Act: execute the intervention5. Evaluate: assess the intervention6. Monitor: monitor for further improvement needs
DIAGNOSI
S
RECONNAISSANCE
PLAN
ACT
MONITOR
EVAULATE
STAGE 1: DIAGNOSIS of CONTEXT Private Japanese university Annual Screening test to enter Advanced English Program I participated as 1 of the 4 volunteer judges 80 candidates seen in one day Group interview: 6 candidates at once 10 -15 minutes per group
STAGE 1 (DIAGNOSIS): WHAT’S THE PROBLEM? No meetings, consensus-building or training for raters. Raters instructed to “ask anything” as interview questions
! Never knew what other judges would ask until the day of! No control for quality of questions: topic and wording difficulty
levels Marks given for English Skills and Motivation using “S, A,
B, C, or F”! No criteria given for English skills or motivation! No guidance given about meaning of levelsoWhat is B in Motivation? oWhat is a C in English skills?
In brief: very loosely run; purely intuition-based
STAGE 2: RECONNAISSANCE Request and obtained permission for action research
project. YES! Field Notes
o Conducted observation of a speaking assessment Transcripts
o Conducted interviews with director and other volunteer judges• Review of speaking assessment development literatureo Brown (2012) on rubrics and language assessmento Fulcher and Davidson (2007) on instrument design and validityo Luoma (2004) on assessing speakingo Taylor (2011) on how the IELTS test was validated
RECONNAISSANCE OUTCOMES Bank of questions (2 sets: easier and harder) Procedure: Start with easier Q first, move on to harder Q Scoring rubric (matrix of criteria)
4 Categories 4 levels of descriptorso Content Relevance 4o Content Support 3o Fluency 2o Accuracy 1
Procedure: Check validity using Multifaceted Rasch Measurement (Bond & Fox 2007; Fulcher & Davidson, 2007; Linacre, 2006)
MULTIFACETED RASCH
ANALYSIS (M
FRM)
S O P H I ST I C
A T E D ST A T I S
T I CA L A
N A LY S E S
MFRM and “FACETS” (Linacre, 2006)
Raw data is used to build a modelData is then compared to the model for how well it “fits”
(checked with infit and outfit mean squares OR t-test’s z-scores)
Passing the fit test = your instrument is “constructive for measurement”
Data Handling• Robust against MISSING VALUES• Raters do not need to rate every candidate; overlapping
groups are fine.• Finer grain view of data is possible via logit score/scale
MFRM and “FACETS” (Linacre, 2006)Provides measures of examinee ability rater severity rating category difficulty level rating scale use
Fairness adjustment post-scoring adjusting possible Based on all raters’ severity data, an adjustment value calculated Provides “adjusted score” alongside “observed score” (raw data)
THE PIL
OT STU
DY
W H AT AM I
L O O K I NG F
O R & H
O W ?
STAGE 3: PLAN the INTERVENTIONPurpose of the study:
Using MFRM to seek evidence for the soundness of new assessment scale.
Research questions:1. How are the performances of the examinees, raters & rubric
categories related when they are put on the same logit scale?
2. To what degree are different raters scoring in the same ways?
3. How difficult or easy are the rubric categories relative to each other?
PHASE 1: THE TRIAL INTERVIEW TEST 20 volunteer candidates were recruited Candidates given procedure, question lists & the rubric I conducted one-on-one interviews with each candidate Sessions were recorded using an IC recorder
PHASE 2: RATING 5 language professionals recruited as raters: 4 + myself
3 native English speakers; 2 L1 Japanese non-native English speakers
Rubrics provided along with audio data No formal training session, but informal verbal instruction on rubric
use 20 candidates submitted self-ratings
Each candidate given their own interview audio data Self-ratings submitted by email
RATING WORKLOAD
RESULTS
W H AT DI D
I F I N
D ?
RESULTS WALKTHROUGHTable 2 Raw scoresTable 3 Summary Fit StatisticsFigure 1 Vertical Ruler: Summary of all facets RQ 1Table 4 Student Ability RQ 1Table 5 Rater Severity RQ 1 & 2Table 6 Rating Criteria RQ 1 & 3Figure 2 Rating Scale Probability Curves ( not RQ
but related to purpose)
TABL
E 2:
RAW
SCO
RES
Content Relevance
Content Support
Fluency
Accuracy
16 points 12 points 10 points
16 points
10 points15 points
TABLE 3: SUMMARY FIT STATISTICSSeparate and unique examinee abilities
Content Relevance, Content support, Fluency, accuracy are significantly separate constructs as categories.
Too high . Raters are a bit too separate and unique.Lower value here means a more “normed” rating ability.
FIGU
RE 1
: VER
TICA
L RU
LER
FIGU
RE 1
: VER
TICA
L RU
LER
#17 as student is an average- abilityspeaker.
#17 as rater is lenient on himself
FIGU
RE 1
: VER
TICA
L RU
LER
#12 as student is a high-ability speaker
#12 as rater is too strict on herself
FIGU
RE 1
: VER
TICA
L RU
LER
Student raters (#1 – 20) show 17 SD units range.Too varied and erratic as “good” measurement.
FIGU
RE 1
: VER
TICA
L RU
LER
Professional raters’ (#21-25) grouped near each other.Not too far (- or + 2 logits) from the mean.Much “fairer” as raters than the students are.
FIGU
RE 1
: VER
TICA
L RU
LER
The “fairest one of all” is a non-native English speaker…And so is the second fairest! (#25 and #24).
The native English speakers were…a bit more lenient (#21 – 23).
TABL
E 4:
STU
DENT
ABI
LITY
Reasonable range for fit mean squares = .4 to 1.2
ORZ scores are within -2 to +2 SD
TABL
E 5:
RA
TER
SEVE
RITY
Reasonable range for fit mean squares = .4 to 1.2
ORZ scores are within -2 to +2 SD
TABLE 6: RATING CRITERIA
Reasonable range for fit mean squares = .4 to 1.2
ORZ scores are within -2 to +2 SD
FIGU
RE 2
: PR
OBAB
ILIT
Y CU
RVES
SUMMARY/CONCLUSIONThere is evidence that this scoring system provided statistically consistent measures of student ability, rater severity, and rubric functioning.Overall goodness of fit looks OK: all facets had z-scores within 2 SD
range. Student ability: good spread from high to low. Rater severity:
Self-ratings erratic, unexpected. (Remove them and re-calculate?) Professional raters’ were stable and fair.
Categories: evidence for unique constructs. Levels:
• All levels (1 to 4) were used enough times.• No merging / collapsing of unused levels needed.
WHAT’S NEXT?Before using under LIVE test-taking conditions, it would be best to…• design rater training program materials (w/recordings and ratings
from trial)• assess quality and difficulty levels of the interview questions• conduct post-scoring interviews (with raters and candidates)• For the rest… See me in Nagoya at JALT this November for stages 4, 5, & 6.
Thanks for coming!
DIAGNOSI
S
RECONNAISSANCE
PLAN
ACT
MONITOR
EVAULATE
WORKS CITEDBond, T.G. and Fox, C. M. (2012). Applying the Rasch Model: Fundamental
Measurement in the Human Sciences, (2nd ed). New York: Routledge.Brown, J.D. (Ed.). (2012). Developing, using, and analyzing rubrics in language
assessment with case studies in Asian and Pacific Languages. Honolulu, HI: NFLRC.
Fulcher, G. & Davidson, F. (2007). Language Testing and Assessment: an Advanced Resource Book. New York: Routledge.
Ivankova, N. (2015). Mixed Methods Applications in Action Research: From method to community action. Los Angeles: Sage Publications.
Linacre, J. M. (2006). Facets Rasch measurement computer program, version 3.61.0. Chicago: Winsteps.com.
Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.Taylor, L. (Ed.). (2011). Examining speaking: Research and practice in assessing
second language speaking. Cambridge: Cambridge University Press.
DEVELOPIN
G AN IN-
HOUSE SPEAKING
ASSESSMENT:
R A S C H AN A LY S I S
FO R A
C T I ON R
E S E A R C H
FLLT 2016Bangkok, ThailandJune 24, 2016
by Andy Vajirasarn