1. could the bilc benchmark advisory tests (bats) be delivered as computer adaptive tests (cats)? 2....

52
1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work? BILC Professional Seminar Monterey, CA Ray Clifford, 13 June

Upload: isabella-reynolds

Post on 12-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)?2. Does computer scoring of Speaking proficiency work?

BILC Professional SeminarMonterey, CA

Ray Clifford, 13 June 2011

Page 2: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

These topics have two thingsin common.

• Both topics are related to proficiency testing.• Both attempt to “push the envelope”.

Page 3: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

These topics have two thingsin common.

• Both topics are related to proficiency testing. • Both attempt to “push the envelope”.

– In technical settings “pushing the envelope” means pushing the limits of an aircraft or technology system.

Page 4: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

These topics have two thingsin common.

• Both topics are related to proficiency testing. • Both projects attempt to “push the envelope”.

– In technical settings “pushing the envelope” means pushing the limits of an aircraft or technology system.

– An “envelope” is also the name of the container used to mail or protect documents.

Page 5: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

These topics have two thingsin common.

• Both topics are related to proficiency testing.• Both projects attempt to “push the envelope”.

– In technical settings “pushing the envelope” means pushing the limits of an aircraft or technology system.

– An “envelope” is also the name of the container used to mail or protect documents.

– Some envelopes are stationery and others are stationary .

Page 6: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

#1Could a BILC Benchmark

Advisory Test (BAT) be delivered as a Computer Adaptive Test

(CAT)?

Page 7: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Benchmark Advisory Tests• The BATs follow the Criterion-Referenced

scoring model used by the human-adaptive Oral Proficiency Interview (OPI) :

• To earn any specific proficiency rating, the test taker has to satisfy all of the level-specific Task, Conditions/Contexts, and Accuracy (TCA) criteria associated with that level.– Note 1: When researchers tried assigning ratings based on a total of

component scores, they found that total scores didn’t accurately predict human, Criterion-Referenced, OPI ratings.

– Note 2: The same non-alignment occurred when they used multiple-regression analyses,

Page 8: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Why use Criterion-Referenced scoring rather than total scores?

• Proficiency ratings are “criterion” ratings, and they require non-compensatory rating judgments at each level.

• Total and average scores, even when weighted, are compensatory scores.

• “Floor and ceiling” level-specific score comparisons are needed to assign a final rating.

• Raters can’t apply “floor and ceiling” rating criteria using a single or composite score.

Page 9: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Why do Speaking tests work?

1. Defined a primary construct for each proficiency level, and a secondary construct that the primary constructs form a hierarchy.

2. Converted these proficiency constructs into test specifications.

3. Created a test delivery system based on those test specifications called the OPI.

4. Applied Criterion-Referenced (C-R), “floor and ceiling”, scoring procedures.

Page 10: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

And OPI Speaking Tests work well.• The primary, level-specific constructs are

supported by inter-rater agreement statistics. – Pearson’s r = 0.978– Cohen weighted Kappa = 0.920

(See Foreign Language Annals, Vol. 36, No. 4, 2003, p.512)

• The secondary, hierarchical construct is supported by the fact that the “floor and ceiling” rating system does not result in “inversions” in assigned ratings.

Page 11: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

We used the same steps to createReading and Listening BATs.

1. Defined level-specific primary constructs and a secondary hierarchical construct.

2. Converted the constructs into test specifications.

3. Created a test delivery system based on those test specifications.

4. Applied Criterion-Referenced, “floor and ceiling”, scoring procedures.

Page 12: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Definition of Proficient Reading

• Proficient reading: The active, automatic process of using one’s internalized language and culture expectancy system to comprehend an authentic text for the purpose for which it was written.

Page 13: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

4 Levels; Not 64 Possible ProfilesReading and Listening Test Design Overview

Level Author Purpose Text Characteristics Reader Task Test Method

3Support opinions,

hypothesizeLengthy and

complex contexts

Comprehend andevaluate the

author's opinionsand feelings

Essay or oralreport

2 Instruct

Multipleparagraphsincluding

narration anddetailed

descriptions

Understand themain facts and

supporting details

Short answerresponses

1Orient, inform,provide simple

facts

Sentence levelfactual discourse

Grasp the mainideas

List the main facts

0 Enumerate, listLists of words or

phrasesRecognize, recall

meanings

Multiple choice orother "recall" item

types.

Page 14: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Benefits of Aligning Reading(and Listening) Test Factors

• Complexity is greatly reduced.

• Each level becomes a separate “Task, Condition, and Accuracy” ability criterion based on typical language patterns found in the targeted society.

• When TCA criteria are aligned, raters can:

– Check for sustained ability at each level.– Assign general proficiency ratings using a floor

and ceiling approach.– Assign progress ratings toward the next higher

level.

Page 15: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Warning!Multiple Choice tests may not be aligned with the trait to be tested.

Reading and Listening Test Design Overview

Level Text Characteristics Reader Task Test Method

3Support opinions,

hypothesizeLengthy and

complex contexts

Comprehend andevaluate the

author's opinionsand feelings

Essay or oralreport ¹ multiple

choice,recognition

2 Instruct

Multipleparagraphsincluding

narration anddetailed

descriptions

Understand themain facts and

supporting details

Short answerresponses ¹

multiple choicerecognition

1Orient, inform,provide simple

facts

Sentence levelfactual discourse

Grasp the mainideas

List the main facts¹ multiple choice

recognition

0 Enumerate, listLists of words or

phrasesRecognize, recall

meanings

Multiple choice orother "recall" item

types.

Page 16: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Predicted Development StagesLevel X Level X+1

Page 17: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Counter-Model Inversions Level X Level X+1

Page 18: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Initial BAT Test Results

• 187 NATO personnel from 12 nations took the English listening test.– Sustained Level 3 50– Sustained 2, most of 3 (2+) 42– Sustained Level 2 28– Sustained 1, most of 2 (1+) 34– Sustained Level 1 14– Most of Level 1 (0+) 3– No pattern or random ability 16

Page 19: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Initial BAT Test Results (Continued)

• The number of counter-model inversions: 0• The “floor and ceiling” criterion-referenced

ratings gave more accurate results than assigning scores based the total score.

• In fact, the criterion-referenced rating process ranked 70 (37%) of the test takers differently than they would have been ranked by their total score results.

Page 20: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Example A: Total score = 37 (62%)C-R assigned Proficiency Level = 1+

(Level 1 with Developing abilities at Level 2)Level 1 Level 2 Level 3

"Almost all"

17 points,85%  

Most 11 points,

55%

Some  

9 points,

45%

None  

Page 21: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Example B: Total score = 35 (58%)C-R assigned Proficiency Level = 2

(Level 2 with Random abilities at Level 3)

Level 1 Level 2 Level 3"Almost

all"17 points,

85%14 points,

70%

Most  

Some  

None  

4 points,

20%

Page 22: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Thanks to the BILC Secretariat and ATC • “Permissive” BAT research has continued

using English language learners interested in applying for admittance to a U.S. university.

• A diversity of first languages was represented among the test takers.

• The number who have taken the BAT Reading test now exceeds 600.

• With 600+ test takers, we have done the IRT analyses needed for adaptive testing.

Page 23: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Preparing a Computer Adaptive Test(or in this case, a CAT BAT)

1. WinSteps IRT Analyses confirmed that the BAT test items were “clustering” by level.

2. Clustered items were then assembled into testlets of 5 items each.

3. The logit values for each level were separated by more than 1.0 logits.

4. For any given level, the testlets were of comparable difficulty – within 0.02 logits.

5. The logit standard error of measurement for each group of testlets was 0.06 or less.

Page 24: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Testlet WinSteps Results, n = 680

Page 25: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

a

Level 2

Level 1

Logit

Valu

e

-3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

Testlet A Testlet B Category 3 Category 4

a 1.8 1.8 1.8 30.0

Level 2 0.3 0.1 0.3 0.2

Level 1 -1.5 -1.6 -1.6 -1.7

Note: There is no Testlet Dfor Level 3

1.8 1.8 1.8

0.30.1

0.3 0.2

-1.5 -1.6 -1.6-1.7

Page 26: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

#1Could the BILC Benchmark Advisory

Tests (BATs) be delivered as Computer Adaptive Tests (CATs)?

Page 27: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

#1Could the BILC Benchmark Advisory

Tests (BATs) be delivered as Computer Adaptive Tests (CATs)?

Yes!And simulations using actual student data show that testing time would be

reduced by an average of 50%.

Page 28: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

#2Does computer scoring of

Speaking proficiency work?

Page 29: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Types of Speaking Tests• Direct Tests.

– Oral Proficiency Interview (OPI).• (Human administered, human scored).

• Semi-direct Tests.– OPIc

• (Computer administered, human scored).

– “OPIc2” • Computer administered and scored.

– “Elicited Speech”• Computer administered and scored.

• Indirect Tests– Elicited Imitation.

• Computer administered and scored.

Page 30: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

“OPIc2” Experiment

• Found relationships between proficiency levels and composite scores based on “verbosity” and 1–gram lexical matching.– Able to identify Level 1 speakers compared

to Level 2 and Level 3 speakers.– But the scoring process took hours.– The voice-to-text conversion process was

imprecise.

Page 31: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

3 Voice-to-Text Output Examples(From carefully enunciated voicemail messages)

• < The meeting was held on Thursday at 3:15 PM. >

• < Discussions that took place last Thursday late into a compromise and they shut down was avoided. >

• < Hey the concept been more collegial more could've been accomplished ending in pass would've been avoided. >

Page 32: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

1st Voice-to-Text Output Example(Original statements and output)

• The original message “The meeting was held on Thursday at 3:15 pm.” was transcribed as: < The meeting was held on Thursday at 3:15 PM. >

Page 33: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

2nd Voice-to-Text Output Example(Original statements and output)

• The original message “The discussions that took place last Thursday led to a compromise, and a shutdown was avoided.” was transcribed as:< Discussions that took place last Thursday late into a compromise and they shut down was avoided. >

Page 34: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

3rd Voice-to-Text Output Example(Original statements and output)

• The original message “Had the confab been more collegial, more could have been accomplished, and an impasse would have been avoided.” was transcribed as:

• < Hey the concept been more collegial more could've been accomplished ending in pass would've been avoided.>

Page 35: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Voice-to-Text Output Examples(Representative of proficiency levels?)• Attempt at Level 1: < The meeting was held

on Thursday at 3:15 PM. >• Attempt at Level 2: < Discussions that took

place last Thursday late into a compromise and they shut down was avoided. >

• Attempt at Level 3: < Hey the concept been more collegial more could've been accomplished ending in pass would've been avoided. >

Page 36: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Next we tried EI and found:• The optimum number of syllables in a prompt

was dependent on the speakers’ proficiency.• Low frequency words were more difficult.• Contrasting L1 and L2 language features were

more difficult.• Providing user control of prompt timing had

no significant impact on EI scores.• Low ability learners showed a positive practice

effect with repeated exposure to the identical prompts.

Page 37: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Elicited Speech (ES) Tests• EI findings led to the creation of new ES tests

that force “chunking” at the meaning level rather than at the phoneme or word level.

• The new ES tests include prompts with… – Complex sentences that exceed the syllable counts

previously recommended for EI tests.– Level-specific language features drawn from the

ILR “grammar grids”.

• Thus, the ES prompts should be aligned with the targeted proficiency levels.

Page 38: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

ES Test Goal: Measure the Speaker’s Language Expectancy System (LES)

• It is hypothesized that our language comprehension and our language production depend on an internalized Language Expectancy System (LES).

• The more developed one’s target-language LES, the more accurately s/he understands and produces the target language.

• ES tests are designed to access the LES twice -- for comprehension and production.

Page 39: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Is an ES test a Listening or Speaking Test?

• To some extend it doesn’t matter, because the same LES is involved in both activities.

• Being able to say things one can’t understand is not a valuable skill.

• If one can’t regenerate a sentence, then s/he would not have been able to say it without the benefit of the model prompt.

Page 40: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

The EI versus the ES Response Process1. Hear the EI prompt. 1. Hear the ES prompt.

EI: Elicited ImitationES: Elicited SpeechSM: Sensory MemorySTM: Short-Term MemoryLTM: Long-Term memory

Page 41: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

The EI versus the ES Response Process1. Hear the EI prompt.2. Form a representation of the

sound chunks in SM.

1. Hear the ES prompt.2. Form a representation of the

meaning chunks in SM.

EI: Elicited ImitationES: Elicited SpeechSM: Sensory MemorySTM: Short-Term MemoryLTM: Long-Term memory

Page 42: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

The EI versus the ES Response Process1. Hear the EI prompt.2. Form a representation of the

sound chunks in SM.3. Store that representation of

sounds in STM.

1. Hear the ES prompt.2. Form a representation of the

meaning chunks in SM.3. Store the meaning

representation in STM.

EI: Elicited ImitationES: Elicited SpeechSM: Sensory MemorySTM: Short-Term MemoryLTM: Long-Term memory

Page 43: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

The EI versus the ES Response Process1. Hear the EI prompt.2. Form a representation of the

sound chunks in SM.3. Store that representation of

sounds in STM.4. Recall the sound

representation from STM.

1. Hear the ES prompt.2. Form a representation of the

meaning chunks in SM.3. Store that meaning

representation in STM.4. Recall the meaning

representation from STM.

EI: Elicited ImitationES: Elicited SpeechSM: Sensory MemorySTM: Short-Term MemoryLTM: Long-Term memory

Page 44: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

The EI versus the ES Response Process1. Hear the EI prompt.2. Form a representation of the

sound chunks in SM.3. Store that representation of

sounds in STM.4. Recall the sound

representation from STM.5. Reproduce the prompt.

1. Hear the ES prompt.2. Form a representation of the

meaning chunks in SM.3. Store that meaning

representation in STM.4. Recall the meaning

representation from STM.5. Use the Language Expectancy

System stored in one’s LTM plus the meaning retrieved from STM to regenerate the prompt.

EI: Elicited ImitationES: Elicited SpeechSM: Sensory MemorySTM: Short-Term MemoryLTM: Long-Term memory

Page 45: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

The EI versus the ES Response Process1. Hear the EI prompt.2. Form a representation of the

sound chunks in SM.3. Store that representation of

sounds in STM.4. Recall the sound

representation from STM.5. Reproduce the prompt.

1. Hear the ES prompt.2. Form a representation of the

meaning chunks in SM.3. Store that meaning

representation in STM.4. Recall the meaning

representation from STM.5. Use the Language Expectancy

System stored in one’s LTM plus the meaning retrieved from STM to regenerate the prompt.

EI: Elicited ImitationES: Elicited SpeechSM: Sensory MemorySTM: Short-Term MemoryLTM: Long-Term memory

Page 46: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Innovations

• The ES prompts should be aligned with ILR syntax, vocabulary, and text type expectations.

• The Automated Speech Recognition engine uses a forced-alignment scoring strategy that uses the ES prompts as a model.

• This approach improves accuracy and avoids the multimillion-dollar development of a full natural language corpus to use as a model for the ASR processor.

Page 47: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Progress

• Initial research results using a Spanish ES test have been very promising.– About 100 persons have been double tested with

the revised version and the OPI.– The correlation between human scoring of ES

tests and official OPI results was r = 0.91.– Automated Speech Recognition (ASR) scoring

predicted exact OPI ratings about 2 out of 3 times.

Page 48: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Next Steps for ES Testing• Create a version of the test that can be:

– Computer/Internet delivered.– Computer scored in near real time. – Equated to proficiency ratings.

• Add a fluency assessment module to the existing ASR accuracy scoring measures.

• Try C-R “floor and ceiling” scoring.• Conduct alpha testing with DOD personnel.

Page 49: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

#2Does computer scoring of

Speaking proficiency work?

Somewhat; and it will get better. .

Page 50: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Questions?

Page 51: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?

Remember:

If you can’t measure it,you can’t improve it!

Page 52: 1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?