evaluating the constructs and automated scoring … evaluating the constructs and automated scoring...

34
1 Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van Moere Knowledge Technologies Pearson NCME, New Orleans April 11, 2011 Symposium: Innovations in the automated scoring of spoken responses

Upload: vocong

Post on 10-Mar-2018

227 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

1

Evaluating the constructs and automated scoring performance for speaking tasks in the

Versant Tests and PTE Academic

Alistair Van Moere

Knowledge Technologies

Pearson

NCME, New Orleans April 11, 2011

Symposium: Innovations in the automated scoring of spoken responses

Page 2: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

2

Automated Tests of Spoken Proficiency

• Versant Test - Listening-Speaking test

- Uses: Job recruitment, placement, progress monitoring

- Available in English, Spanish, Arabic, Dutch, (French, Chinese)

• PTE Academic - 4-skills language proficiency test

- Uses: Entrance into English-speaking universities

Page 3: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

3

Assessment argument

(Mislevy 2005)

Page 4: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

4

Assessment argument

(Mislevy 2005)

Page 5: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

5

Versant Tasks and Scoring

Read Aloud Answer Question Repeat Sentence Sentence

Build Story

Retell

Page 6: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

6

Versant Tasks and Scoring

Read Aloud Answer Question Repeat Sentence Sentence

Build Story

Retell

Sentence Mastery Pronunciation Vocabulary Fluency

Page 7: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

7

Versant Tasks and Scoring

Pronunciation OVERALL

20% 30% 30% 20%

Read Aloud Answer Question Repeat Sentence Sentence

Build Story

Retell

Sentence Mastery Pronunciation Vocabulary Fluency

Page 8: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

8

Versant Tasks and Scoring

63 responses , 3’30 mins speech

Pronunciation OVERALL

20% 30% 30% 20%

Read Aloud Answer Question Repeat Sentence Sentence

Build Story

Retell

Sentence Mastery Pronunciation Vocabulary Fluency

Page 9: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

9

Versant Test Scoring

Trait Scoring

Fluency Temporal features of speech predict expert human judgments

Pronunciation Spectral properties and segmental aspects predict human judgments

Vocabulary

i) Rasch-based ability measures from dichotomous-scored vocabulary items; ii) LSA-based measures on constructed responses predict human judgments

Sentence Mastery

Rasch-based ability measures from word errors on increasingly complex sentences

Page 10: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

10

Versant Test Scoring

Trait Scoring

Machine-Human, r

Human split-half

Machine split-half

Fluency Temporal features of speech predict expert human judgments

.94 .99 .97

Pronunciation Spectral properties and segmental aspects predict human judgments

.88 .99 .97

Vocabulary

i) Rasch-based ability measures from dichotomous-scored vocabulary items; ii) LSA-based measures on constructed responses predict human judgments

.96 .93 .92

Sentence Mastery

Rasch-based ability measures from word errors on increasingly complex sentences

.97 .95 .92

.97 .99 .97

Validation sample, n=143, flat score distribution

Overall

Page 11: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

11

Scoring Model for Sentence Mastery

Repeat Sentence:

Page 12: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

12

Scoring Model for Sentence Mastery

Repeat Sentence:

I’ll catch up with you soon.

“uh .. I’ll catch up you … I don’t know”

Security wouldn’t let him in because he didn’t have a pass.

“Security wouldn’t help him pass”

Page 13: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

13

Scoring Model for Sentence Mastery

Repeat Sentence:

I’ll catch up with you soon.

“uh .. I’ll catch up you … I don’t know”

Security wouldn’t let him in because he didn’t have a pass.

“Security wouldn’t help him pass” = 7 word errors

= 2 word errors

Page 14: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

14

Scoring Model for Sentence Mastery

Repeat Sentence:

I’ll catch up with you soon.

“uh .. I’ll catch up you … I don’t know”

Security wouldn’t let him in because he didn’t have a pass.

“Security wouldn’t help him pass”

Item complexity

Accuracy of response

(word errors)

Partial credit

Rasch model

Estimate of

SENTENCE

MASTERY

ability

= 7 word errors

= 2 word errors

Page 15: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

15

Basic

Language

Cognition

Higher

Language

Cognition

Versant’s Domain of Use

Versant Tests

Hulstijn (2010)

Page 16: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

16

Basic

Language

Cognition

Higher

Language

Cognition

Versant’s Domain of Use

Versant Tests

Domain-specific Tests

Hulstijn (2010)

Page 17: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

17

Basic

Language

Cognition

Higher

Language

Cognition Communicative test r n

Test of Spoken English (TSE) 0.88 58

New TOEFL Speaking 0.84 321

BEST Plus interview 0.86 151

IELTS interview test 0.76 130

Versant test score correlations with communicative tests

Versant’s Domain of Use

Page 18: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

18

Read

Aloud

Repeat

Sentence

Retell

Lecture Answer

Short Question

Describe

Image

PTE Academic: Broader construct

Page 19: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

19

Read

Aloud

Repeat

Sentence

Retell

Lecture Answer

Short Question

Describe

Image

PTE Academic: Broader construct

Describe Image Retell Lecture

Preparation time 25 secs 40 secs

Response time 40 secs 40 secs

Page 20: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

20

Content Pronunciation Vocabulary

Read

Aloud

Repeat

Sentence

Retell

Lecture Answer

Short Question

Describe

Image

PTE Academic: Broader construct

Describe Image Retell Lecture

Preparation time 25 secs 40 secs

Response time 40 secs 40 secs

Accuracy Fluency

Page 21: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

21

PTE Academic: Sampling Academic Domain

• 5 tasks:

~ 36 responses

~ 8 minutes of speech

• Input:

– Reading texts

– Listening texts

– Visual (non-linguistic)

• Output:

– Prepared monologues

– Short, real-time responses

Page 22: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

22

Content Scoring of Constructed Responses

• Word choice (Latent Semantic Analysis)

• Content relevance

• Lexical measures

• Words in sequence; collocations

Sample response

Prokaryotic cell Eukaryotic cell

“the lecture was given about biotic cells prokaryotic cell was first described and eukaryotic cell was secondly ref uh described uh it was said eukaryotic cells are more complicated than prokaryotic cell eukaryotic cell is microorganisms where it is it has one single cell and multi cell organisms are also present in eukaryotic cell this more complicated than prokaryotic cell which is placed in right side of the screen”

Example item:

Retell Lecture

Page 23: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

23

PTE Academic: Reliability

Overall Score Machine to Human Correlation

R = 0.96

0

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

Human scaled scores

Mac

hin

e sc

ore

s

Scoring Machine-Human, r

Human split-half

Machine split-half

Overall .97 .97 .96

Validation sample

n=158

Page 24: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

24

Page 25: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

25

SCORING TASKS

CLAIM

BACKING

COUNTER

REBUTTAL

Page 26: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

26

The tasks are valid for assessing spoken language proficiency

SCORING TASKS

CLAIM

BACKING

COUNTER

REBUTTAL

Page 27: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

27

The tasks are valid for assessing spoken language proficiency

The tasks tap real-time automatic processes, and sample academic language & domain interactions

SCORING TASKS

CLAIM

BACKING

COUNTER

REBUTTAL

Page 28: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

28

The tasks are valid for assessing spoken language proficiency

The tasks tap real-time automatic processes, and sample academic language & domain interactions

Some tasks are not authentic; the interactions are too constrained

SCORING TASKS

CLAIM

BACKING

COUNTER

REBUTTAL

Page 29: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

29

The tasks are valid for assessing spoken language proficiency

The tasks tap real-time automatic processes, and sample academic language & domain interactions

Some tasks are not authentic; the interactions are too constrained

Many concurrent validation correlations with interview tests > 0.80 (different tasks and different performances)

SCORING TASKS

CLAIM

BACKING

COUNTER

REBUTTAL

Page 30: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

30

The tasks are valid for assessing spoken language proficiency

The scoring is sufficiently accurate to replace humans

The tasks tap real-time automatic processes, and sample academic language & domain interactions

Some tasks are not authentic; the interactions are too constrained

Many concurrent validation correlations with interview tests > 0.80 (different tasks and different performances)

SCORING TASKS

CLAIM

BACKING

COUNTER

REBUTTAL

Page 31: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

31

The tasks are valid for assessing spoken language proficiency

The scoring is sufficiently accurate to replace humans

The tasks tap real-time automatic processes, and sample academic language & domain interactions

Machine-to-human score correlations ~0.97 (same tasks, same performance instance)

Some tasks are not authentic; the interactions are too constrained

Many concurrent validation correlations with interview tests > 0.80 (different tasks and different performances)

SCORING TASKS

CLAIM

BACKING

COUNTER

REBUTTAL

Page 32: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

32

The tasks are valid for assessing spoken language proficiency

The scoring is sufficiently accurate to replace humans

The tasks tap real-time automatic processes, and sample academic language & domain interactions

Machine-to-human score correlations ~0.97 (same tasks, same performance instance)

Some tasks are not authentic; the interactions are too constrained

Machines are notoriously error prone; scores may be triple counting poor pronunciation

Many concurrent validation correlations with interview tests > 0.80 (different tasks and different performances)

SCORING TASKS

CLAIM

BACKING

COUNTER

REBUTTAL

Page 33: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

33

The tasks are valid for assessing spoken language proficiency

The scoring is sufficiently accurate to replace humans

The tasks tap real-time automatic processes, and sample academic language & domain interactions

Machine-to-human score correlations ~0.97 (same tasks, same performance instance)

Some tasks are not authentic; the interactions are too constrained

Machines are notoriously error prone; scores may be double counting poor pronunciation

Many concurrent validation correlations with interview tests > 0.80 (different tasks and different performances)

SCORING TASKS

CLAIM

BACKING

COUNTER

REBUTTAL

Scores are relatively insensitive to simulations of worse recognition; systems should be optimized for score accuracy.

Page 34: Evaluating the constructs and automated scoring … Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair Van

34

Acknowledgements

Dr Jared Bernstein, Consulting Scientist, Knowledge Technologies, Pearson Prof John De Jong, SVP Global Strategy & Business Development, Pearson