1. could the bilc benchmark advisory tests (bats) be delivered as computer adaptive tests (cats)? 2....

1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)?2. Does computer scoring of Speaking proficiency work?

BILC Professional SeminarMonterey, CA

Ray Clifford, 13 June 2011

These topics have two thingsin common.

• Both topics are related to proficiency testing.• Both attempt to “push the envelope”.


• Both topics are related to proficiency testing. • Both attempt to “push the envelope”.

– In technical settings “pushing the envelope” means pushing the limits of an aircraft or technology system.


• Both topics are related to proficiency testing. • Both projects attempt to “push the envelope”.


– An “envelope” is also the name of the container used to mail or protect documents.


• Both topics are related to proficiency testing.• Both projects attempt to “push the envelope”.


– An “envelope” is also the name of the container used to mail or protect documents.

– Some envelopes are stationery and others are stationary .

#1Could a BILC Benchmark

Advisory Test (BAT) be delivered as a Computer Adaptive Test

(CAT)?

Benchmark Advisory Tests• The BATs follow the Criterion-Referenced

scoring model used by the human-adaptive Oral Proficiency Interview (OPI) :

• To earn any specific proficiency rating, the test taker has to satisfy all of the level-specific Task, Conditions/Contexts, and Accuracy (TCA) criteria associated with that level.– Note 1: When researchers tried assigning ratings based on a total of

component scores, they found that total scores didn’t accurately predict human, Criterion-Referenced, OPI ratings.

– Note 2: The same non-alignment occurred when they used multiple-regression analyses,

Why use Criterion-Referenced scoring rather than total scores?

• Proficiency ratings are “criterion” ratings, and they require non-compensatory rating judgments at each level.

• Total and average scores, even when weighted, are compensatory scores.

• “Floor and ceiling” level-specific score comparisons are needed to assign a final rating.

• Raters can’t apply “floor and ceiling” rating criteria using a single or composite score.

Why do Speaking tests work?

1. Defined a primary construct for each proficiency level, and a secondary construct that the primary constructs form a hierarchy.

2. Converted these proficiency constructs into test specifications.

3. Created a test delivery system based on those test specifications called the OPI.

4. Applied Criterion-Referenced (C-R), “floor and ceiling”, scoring procedures.

And OPI Speaking Tests work well.• The primary, level-specific constructs are

supported by inter-rater agreement statistics. – Pearson’s r = 0.978– Cohen weighted Kappa = 0.920

(See Foreign Language Annals, Vol. 36, No. 4, 2003, p.512)

• The secondary, hierarchical construct is supported by the fact that the “floor and ceiling” rating system does not result in “inversions” in assigned ratings.

We used the same steps to createReading and Listening BATs.

1. Defined level-specific primary constructs and a secondary hierarchical construct.

2. Converted the constructs into test specifications.

3. Created a test delivery system based on those test specifications.

4. Applied Criterion-Referenced, “floor and ceiling”, scoring procedures.

Definition of Proficient Reading

• Proficient reading: The active, automatic process of using one’s internalized language and culture expectancy system to comprehend an authentic text for the purpose for which it was written.

4 Levels; Not 64 Possible ProfilesReading and Listening Test Design Overview

Level Author Purpose Text Characteristics Reader Task Test Method

3Support opinions,

hypothesizeLengthy and

complex contexts

Comprehend andevaluate the

author's opinionsand feelings

Essay or oralreport

2 Instruct

Multipleparagraphsincluding

narration anddetailed

descriptions

Understand themain facts and

supporting details

Short answerresponses

1Orient, inform,provide simple

facts

Sentence levelfactual discourse

Grasp the mainideas

List the main facts

0 Enumerate, listLists of words or

phrasesRecognize, recall

meanings

Multiple choice orother "recall" item

types.

Benefits of Aligning Reading(and Listening) Test Factors

• Complexity is greatly reduced.

• Each level becomes a separate “Task, Condition, and Accuracy” ability criterion based on typical language patterns found in the targeted society.

• When TCA criteria are aligned, raters can:

– Check for sustained ability at each level.– Assign general proficiency ratings using a floor

and ceiling approach.– Assign progress ratings toward the next higher

level.

Warning!Multiple Choice tests may not be aligned with the trait to be tested.

Reading and Listening Test Design Overview

Level Text Characteristics Reader Task Test Method

3Support opinions,

hypothesizeLengthy and

complex contexts

Comprehend andevaluate the

author's opinionsand feelings

Essay or oralreport ¹ multiple

choice,recognition

2 Instruct

Multipleparagraphsincluding

narration anddetailed

descriptions

Understand themain facts and

supporting details

Short answerresponses ¹

multiple choicerecognition

1Orient, inform,provide simple

facts

Sentence levelfactual discourse

Grasp the mainideas

List the main facts¹ multiple choice

recognition

0 Enumerate, listLists of words or

phrasesRecognize, recall

meanings

Multiple choice orother "recall" item

types.

Predicted Development StagesLevel X Level X+1

Counter-Model Inversions Level X Level X+1

Initial BAT Test Results

• 187 NATO personnel from 12 nations took the English listening test.– Sustained Level 3 50– Sustained 2, most of 3 (2+) 42– Sustained Level 2 28– Sustained 1, most of 2 (1+) 34– Sustained Level 1 14– Most of Level 1 (0+) 3– No pattern or random ability 16

Initial BAT Test Results (Continued)

• The number of counter-model inversions: 0• The “floor and ceiling” criterion-referenced

ratings gave more accurate results than assigning scores based the total score.

• In fact, the criterion-referenced rating process ranked 70 (37%) of the test takers differently than they would have been ranked by their total score results.

Example A: Total score = 37 (62%)C-R assigned Proficiency Level = 1+

(Level 1 with Developing abilities at Level 2)Level 1 Level 2 Level 3

"Almost all"

17 points,85%

Most 11 points,

55%

Some

9 points,

45%

None

Example B: Total score = 35 (58%)C-R assigned Proficiency Level = 2

(Level 2 with Random abilities at Level 3)

Level 1 Level 2 Level 3"Almost

all"17 points,

85%14 points,

70%

Most

Some

None

4 points,

20%

Thanks to the BILC Secretariat and ATC • “Permissive” BAT research has continued

using English language learners interested in applying for admittance to a U.S. university.

• A diversity of first languages was represented among the test takers.

• The number who have taken the BAT Reading test now exceeds 600.

• With 600+ test takers, we have done the IRT analyses needed for adaptive testing.

Preparing a Computer Adaptive Test(or in this case, a CAT BAT)

1. WinSteps IRT Analyses confirmed that the BAT test items were “clustering” by level.

2. Clustered items were then assembled into testlets of 5 items each.

3. The logit values for each level were separated by more than 1.0 logits.

4. For any given level, the testlets were of comparable difficulty – within 0.02 logits.

5. The logit standard error of measurement for each group of testlets was 0.06 or less.

Testlet WinSteps Results, n = 680

a

Level 2

Level 1

Logit

Valu

e

-3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

Testlet A Testlet B Category 3 Category 4

a 1.8 1.8 1.8 30.0

Level 2 0.3 0.1 0.3 0.2

Level 1 -1.5 -1.6 -1.6 -1.7

Note: There is no Testlet Dfor Level 3

1.8 1.8 1.8

0.30.1

0.3 0.2

-1.5 -1.6 -1.6-1.7

#1Could the BILC Benchmark Advisory

Tests (BATs) be delivered as Computer Adaptive Tests (CATs)?

#1Could the BILC Benchmark Advisory

Tests (BATs) be delivered as Computer Adaptive Tests (CATs)?

Yes!And simulations using actual student data show that testing time would be

reduced by an average of 50%.

#2Does computer scoring of

Speaking proficiency work?

Types of Speaking Tests• Direct Tests.

– Oral Proficiency Interview (OPI).• (Human administered, human scored).

• Semi-direct Tests.– OPIc

• (Computer administered, human scored).

– “OPIc2” • Computer administered and scored.

– “Elicited Speech”• Computer administered and scored.

• Indirect Tests– Elicited Imitation.

• Computer administered and scored.

“OPIc2” Experiment

• Found relationships between proficiency levels and composite scores based on “verbosity” and 1–gram lexical matching.– Able to identify Level 1 speakers compared

to Level 2 and Level 3 speakers.– But the scoring process took hours.– The voice-to-text conversion process was

imprecise.

3 Voice-to-Text Output Examples(From carefully enunciated voicemail messages)

• < The meeting was held on Thursday at 3:15 PM. >

• < Discussions that took place last Thursday late into a compromise and they shut down was avoided. >

• < Hey the concept been more collegial more could've been accomplished ending in pass would've been avoided. >

1st Voice-to-Text Output Example(Original statements and output)

• The original message “The meeting was held on Thursday at 3:15 pm.” was transcribed as: < The meeting was held on Thursday at 3:15 PM. >

2nd Voice-to-Text Output Example(Original statements and output)

• The original message “The discussions that took place last Thursday led to a compromise, and a shutdown was avoided.” was transcribed as:< Discussions that took place last Thursday late into a compromise and they shut down was avoided. >

3rd Voice-to-Text Output Example(Original statements and output)

• The original message “Had the confab been more collegial, more could have been accomplished, and an impasse would have been avoided.” was transcribed as:

• < Hey the concept been more collegial more could've been accomplished ending in pass would've been avoided.>

Voice-to-Text Output Examples(Representative of proficiency levels?)• Attempt at Level 1: < The meeting was held

on Thursday at 3:15 PM. >• Attempt at Level 2: < Discussions that took

place last Thursday late into a compromise and they shut down was avoided. >

• Attempt at Level 3: < Hey the concept been more collegial more could've been accomplished ending in pass would've been avoided. >

Next we tried EI and found:• The optimum number of syllables in a prompt

was dependent on the speakers’ proficiency.• Low frequency words were more difficult.• Contrasting L1 and L2 language features were

more difficult.• Providing user control of prompt timing had

no significant impact on EI scores.• Low ability learners showed a positive practice

effect with repeated exposure to the identical prompts.

Elicited Speech (ES) Tests• EI findings led to the creation of new ES tests

that force “chunking” at the meaning level rather than at the phoneme or word level.

• The new ES tests include prompts with… – Complex sentences that exceed the syllable counts

previously recommended for EI tests.– Level-specific language features drawn from the

ILR “grammar grids”.

• Thus, the ES prompts should be aligned with the targeted proficiency levels.

ES Test Goal: Measure the Speaker’s Language Expectancy System (LES)

• It is hypothesized that our language comprehension and our language production depend on an internalized Language Expectancy System (LES).

• The more developed one’s target-language LES, the more accurately s/he understands and produces the target language.

• ES tests are designed to access the LES twice -- for comprehension and production.

Is an ES test a Listening or Speaking Test?

• To some extend it doesn’t matter, because the same LES is involved in both activities.

• Being able to say things one can’t understand is not a valuable skill.

• If one can’t regenerate a sentence, then s/he would not have been able to say it without the benefit of the model prompt.

The EI versus the ES Response Process1. Hear the EI prompt. 1. Hear the ES prompt.

EI: Elicited ImitationES: Elicited SpeechSM: Sensory MemorySTM: Short-Term MemoryLTM: Long-Term memory

The EI versus the ES Response Process1. Hear the EI prompt.2. Form a representation of the

sound chunks in SM.

1. Hear the ES prompt.2. Form a representation of the

meaning chunks in SM.



sound chunks in SM.3. Store that representation of

sounds in STM.


meaning chunks in SM.3. Store the meaning

representation in STM.




sounds in STM.4. Recall the sound

representation from STM.


meaning chunks in SM.3. Store that meaning

representation in STM.4. Recall the meaning

representation from STM.




sounds in STM.4. Recall the sound

representation from STM.5. Reproduce the prompt.


meaning chunks in SM.3. Store that meaning

representation in STM.4. Recall the meaning

representation from STM.5. Use the Language Expectancy

System stored in one’s LTM plus the meaning retrieved from STM to regenerate the prompt.


Innovations

• The ES prompts should be aligned with ILR syntax, vocabulary, and text type expectations.

• The Automated Speech Recognition engine uses a forced-alignment scoring strategy that uses the ES prompts as a model.

• This approach improves accuracy and avoids the multimillion-dollar development of a full natural language corpus to use as a model for the ASR processor.

Progress

• Initial research results using a Spanish ES test have been very promising.– About 100 persons have been double tested with

the revised version and the OPI.– The correlation between human scoring of ES

tests and official OPI results was r = 0.91.– Automated Speech Recognition (ASR) scoring

predicted exact OPI ratings about 2 out of 3 times.

Next Steps for ES Testing• Create a version of the test that can be:

– Computer/Internet delivered.– Computer scored in near real time. – Equated to proficiency ratings.

• Add a fluency assessment module to the existing ASR accuracy scoring measures.

• Try C-R “floor and ceiling” scoring.• Conduct alpha testing with DOD personnel.

#2Does computer scoring of

Speaking proficiency work?

Somewhat; and it will get better. .

Questions?

Remember:

If you can’t measure it,you can’t improve it!

1. could the bilc benchmark advisory tests (bats) be delivered as computer adaptive tests (cats)? 2....

Documents

proficiency level

proficiency ratings

proficiency constructs

proficiency testing

specific proficiency

levelspecific constructs

speaking proficiency

criterion ratings