developing an in-house speaking assessment: rasch analysis for action research

DEVELOPIN

G AN IN-

HOUSE SPEAKING

ASSESSMENT:

R A S C H AN A LY S I S

FO R A

C T I ON R

E S E A R C H

FLLT 2016Bangkok, ThailandJune 24, 2016

by Andy VajirasarnToyo University (Tokyo, Japan)

OVERVIEW Action Research Speaking Assessment Development Study and Results/Conclusions What’s next?

ACTION RESEARCH

T H E O R Y P R A C T I C

E

ACTION RESEARCH: DEFINITIONS

“…most agree on the following: action research is inquiry that is done by or with insiders to an organization or community, but never to or on them.” (Anderson, 2005, in Ivankova, 2015)

“Action research is a family of practices of living inquiry... It seeks to bring together action and reflection, theory and practice, in participation with others, in the pursuit of practical solutions to issues of pressing concern to people, and more generally, the flourishing of individual persons and their communities.” (Reason & Bradbury, 2008, in Ivankova, 2015)

Methodological features: systematic, cyclical, and flexible; it involves collection of multiple data sources and generation of a plan of action or intervention. (Ivankova, 2015)

ACTION RESEARCH MODELS

Kurt Lewin (1948)

Kemmis and McTaggart (2007)

Stringer (2014)

ACTION RESEARCH CYCLE (IVANKOVA, 2015)

DIAGNOSIS

RECONNAISSANCE

PLAN

ACT

MONITOR

EVAULATE

ACTION RESEARCH MODELS

Ivankova’s (2015) model1. Diagnosis: identify the problem2. Reconnaissance: gather information3. Plan: plan the intervention4. Act: execute the intervention5. Evaluate: assess the intervention6. Monitor: monitor for further improvement needs

DIAGNOSI

S

RECONNAISSANCE

PLAN

ACT

MONITOR

EVAULATE

STAGE 1: DIAGNOSIS of CONTEXT Private Japanese university Annual Screening test to enter Advanced English Program I participated as 1 of the 4 volunteer judges 80 candidates seen in one day Group interview: 6 candidates at once 10 -15 minutes per group

STAGE 1 (DIAGNOSIS): WHAT’S THE PROBLEM? No meetings, consensus-building or training for raters. Raters instructed to “ask anything” as interview questions

! Never knew what other judges would ask until the day of! No control for quality of questions: topic and wording difficulty

levels Marks given for English Skills and Motivation using “S, A,

B, C, or F”! No criteria given for English skills or motivation! No guidance given about meaning of levelsoWhat is B in Motivation? oWhat is a C in English skills?

In brief: very loosely run; purely intuition-based

STAGE 2: RECONNAISSANCE Request and obtained permission for action research

project. YES! Field Notes

o Conducted observation of a speaking assessment Transcripts

o Conducted interviews with director and other volunteer judges• Review of speaking assessment development literatureo Brown (2012) on rubrics and language assessmento Fulcher and Davidson (2007) on instrument design and validityo Luoma (2004) on assessing speakingo Taylor (2011) on how the IELTS test was validated

RECONNAISSANCE OUTCOMES Bank of questions (2 sets: easier and harder) Procedure: Start with easier Q first, move on to harder Q Scoring rubric (matrix of criteria)

4 Categories 4 levels of descriptorso Content Relevance 4o Content Support 3o Fluency 2o Accuracy 1

Procedure: Check validity using Multifaceted Rasch Measurement (Bond & Fox 2007; Fulcher & Davidson, 2007; Linacre, 2006)

MULTIFACETED RASCH

ANALYSIS (M

FRM)

S O P H I ST I C

A T E D ST A T I S

T I CA L A

N A LY S E S

MFRM and “FACETS” (Linacre, 2006)

Raw data is used to build a modelData is then compared to the model for how well it “fits”

(checked with infit and outfit mean squares OR t-test’s z-scores)

Passing the fit test = your instrument is “constructive for measurement”

Data Handling• Robust against MISSING VALUES• Raters do not need to rate every candidate; overlapping

groups are fine.• Finer grain view of data is possible via logit score/scale

MFRM and “FACETS” (Linacre, 2006)Provides measures of examinee ability rater severity rating category difficulty level rating scale use

Fairness adjustment post-scoring adjusting possible Based on all raters’ severity data, an adjustment value calculated Provides “adjusted score” alongside “observed score” (raw data)

THE PIL

OT STU

DY

W H AT AM I

L O O K I NG F

O R & H

O W ?

STAGE 3: PLAN the INTERVENTIONPurpose of the study:

Using MFRM to seek evidence for the soundness of new assessment scale.

Research questions:1. How are the performances of the examinees, raters & rubric

categories related when they are put on the same logit scale?

2. To what degree are different raters scoring in the same ways?

3. How difficult or easy are the rubric categories relative to each other?

PHASE 1: THE TRIAL INTERVIEW TEST 20 volunteer candidates were recruited Candidates given procedure, question lists & the rubric I conducted one-on-one interviews with each candidate Sessions were recorded using an IC recorder

PHASE 2: RATING 5 language professionals recruited as raters: 4 + myself

3 native English speakers; 2 L1 Japanese non-native English speakers

Rubrics provided along with audio data No formal training session, but informal verbal instruction on rubric

use 20 candidates submitted self-ratings

Each candidate given their own interview audio data Self-ratings submitted by email

RATING WORKLOAD

RESULTS

W H AT DI D

I F I N

D ?

RESULTS WALKTHROUGHTable 2 Raw scoresTable 3 Summary Fit StatisticsFigure 1 Vertical Ruler: Summary of all facets RQ 1Table 4 Student Ability RQ 1Table 5 Rater Severity RQ 1 & 2Table 6 Rating Criteria RQ 1 & 3Figure 2 Rating Scale Probability Curves ( not RQ

but related to purpose)

TABL

E 2:

RAW

SCO

RES

Content Relevance

Content Support

Fluency

Accuracy

16 points 12 points 10 points

16 points

10 points15 points

TABLE 3: SUMMARY FIT STATISTICSSeparate and unique examinee abilities

Content Relevance, Content support, Fluency, accuracy are significantly separate constructs as categories.

Too high . Raters are a bit too separate and unique.Lower value here means a more “normed” rating ability.

FIGU

RE 1

: VER

TICA

L RU

LER

FIGU

RE 1

: VER

TICA

L RU

LER

#17 as student is an average- abilityspeaker.

#17 as rater is lenient on himself

FIGU

RE 1

: VER

TICA

L RU

LER

#12 as student is a high-ability speaker

#12 as rater is too strict on herself

FIGU

RE 1

: VER

TICA

L RU

LER

Student raters (#1 – 20) show 17 SD units range.Too varied and erratic as “good” measurement.

FIGU

RE 1

: VER

TICA

L RU

LER

Professional raters’ (#21-25) grouped near each other.Not too far (- or + 2 logits) from the mean.Much “fairer” as raters than the students are.

FIGU

RE 1

: VER

TICA

L RU

LER

The “fairest one of all” is a non-native English speaker…And so is the second fairest! (#25 and #24).

The native English speakers were…a bit more lenient (#21 – 23).

TABL

E 4:

STU

DENT

ABI

LITY

Reasonable range for fit mean squares = .4 to 1.2

ORZ scores are within -2 to +2 SD

TABL

E 5:

RA

TER

SEVE

RITY



TABLE 6: RATING CRITERIA



FIGU

RE 2

: PR

OBAB

ILIT

Y CU

RVES

SUMMARY/CONCLUSIONThere is evidence that this scoring system provided statistically consistent measures of student ability, rater severity, and rubric functioning.Overall goodness of fit looks OK: all facets had z-scores within 2 SD

range. Student ability: good spread from high to low. Rater severity:

Self-ratings erratic, unexpected. (Remove them and re-calculate?) Professional raters’ were stable and fair.

Categories: evidence for unique constructs. Levels:

• All levels (1 to 4) were used enough times.• No merging / collapsing of unused levels needed.

WHAT’S NEXT?Before using under LIVE test-taking conditions, it would be best to…• design rater training program materials (w/recordings and ratings

from trial)• assess quality and difficulty levels of the interview questions• conduct post-scoring interviews (with raters and candidates)• For the rest… See me in Nagoya at JALT this November for stages 4, 5, & 6.

Thanks for coming!

DIAGNOSI

S

RECONNAISSANCE

PLAN

ACT

MONITOR

EVAULATE

WORKS CITEDBond, T.G. and Fox, C. M. (2012). Applying the Rasch Model: Fundamental

Measurement in the Human Sciences, (2nd ed). New York: Routledge.Brown, J.D. (Ed.). (2012). Developing, using, and analyzing rubrics in language

assessment with case studies in Asian and Pacific Languages. Honolulu, HI: NFLRC.

Fulcher, G. & Davidson, F. (2007). Language Testing and Assessment: an Advanced Resource Book. New York: Routledge.

Ivankova, N. (2015). Mixed Methods Applications in Action Research: From method to community action. Los Angeles: Sage Publications.

Linacre, J. M. (2006). Facets Rasch measurement computer program, version 3.61.0. Chicago: Winsteps.com.

Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.Taylor, L. (Ed.). (2011). Examining speaking: Research and practice in assessing

second language speaking. Cambridge: Cambridge University Press.

DEVELOPIN

G AN IN-

HOUSE SPEAKING

ASSESSMENT:

R A S C H AN A LY S I S

FO R A

C T I ON R

E S E A R C H

FLLT 2016Bangkok, ThailandJune 24, 2016

by Andy Vajirasarn

developing an in-house speaking assessment: rasch analysis for action research

Education