investigating the relationship between empirical task ... · investigating the relationship between...

www.britishcouncil.org 1

Jamie Dunlea, British Council

Investigating the relationship

between empirical task difficulty,

textual features and CEFR levels

EALTA 2014

29 May – 1 June

University of Warwick

Language Assessment

Research


First…

• HEALTH WARNINGS • CAUTIONS • CAVEATS • SOME APOLOGIES • Some interesting

information to share


What we will do • Look at task specifications for reading to specify

criterial features of input texts for different CEFR levels

• Focus on vocabulary profiles and readability

measures which are included in item writing

specifications

• Discuss an exploratory analysis of the textual

features of texts built to spec and the relationship to

empirical difficulty

• Look at the relationship between

• Rasch difficulty estimates of reading tasks from

the item bank of an operational test designed

around the CEFR and

• selected linguistic indices which we use for item

specification (and some additional measures).

Davidson & Fulcher (2007) encourage test developers to see the framework as a “series of guidelines from which tests (and teaching materials) can be built to suit local contextualized needs.”


The CEFR can be a

springboard to task and test

development

Task specs: Where to start?


Test specs from the CEFR CEFR: Vocabulary Range

B2

Has a good range of vocabulary for matters connected to his

field and most general topics? Can vary formulation to avoid

frequent repetition, but lexical gaps can still cause hesitation and

circumlocution.

B1

Has a sufficient vocabulary to express him/herself with some

circumlocutions on most topics pertinent to his everyday life such

as family, hobbies and interests, work, travel, and current events.

A2

Has sufficient vocabulary to conduct routine, everyday

transactions involving familiar situations and topics.

Has a sufficient vocabulary for the expression of basic

communicative needs.

Has a sufficient vocabulary for coping with simple survival

needs.


Task specs: Where to start?

Descriptors need to remain holistic in order to give

an overview; detailed lists of microfunctions,

grammatical forms and vocabulary are presented

in language specifications for particular languages

(e.g. Threshold Level 1990).

An analysis of the functions, notions, grammar

and vocabulary necessary to perform the

communicative tasks described on the scales

could be part of the process of developing new

sets of language specifications.

(Council of Europe, 2001, p. 30)

CEFR Grid for Reading Tests


Characteristic

Text source

Authenticity

Discourse type

Domain

Topic

Nature of content

Text length

Vocabulary

Grammar

Vocabulary Only frequent vocabulary

Mostly frequent vocabulary

Rather extended

Extended

Manual (Council of Europe, 2009)

Alderson, et al (2006)

• Some criteria when considering categories

Consistency Transparency Accountability Ease of use for item writers

• Specs have different audiences, and different levels of specificity according to the needs of the audience

• No spec is exhaustive: all specs will contain some of a possible range of categories and measures

• No spec is final: specs need to be reviewed and revised


Some important principles


Test Aptis

General Component Reading Task

Matching headings

to text Features of the Task

Skill focus Expeditious global reading of longer text, integrating propositions across a longer

text into a discourse-level representation.

Task Level A1 A2 B1 B2 C1 C2 task

description

Matching headings to paragraphs within a longer text. Candidates read through

a longer text consisting of 7 paragraphs, identifying the best heading for each

paragraph from a bank of 8 options.

Cognitive

processing

Goal

setting

Expeditious reading: local

(scan/search for specifics)

Careful reading: local

(understanding sentence)

Expeditious reading: global

(skim for gist/search for key

ideas/detail)

Careful reading: global

(comprehend main idea(s)/overall

text(s))

Cognitive

processing

Levels of

reading

Word recognition

Lexical access

Syntactic parsing

Establishing propositional meaning (cl./sent. level)

Inferencing

Building a mental model

Creating a text level representation (disc. structure)

Creating an intertextual representation (multi-text)

Task specs: an example


Features of the Input Text

Words 700-750 words

Domain Public Occupational Educational Personal

Discourse mode Descriptive Narrative Expository Argumentative Instructive

Content knowledge General Specific

Cultural specificity Neutral Specific

Nature information Only concrete Mostly concrete Fairly abstract Mainly abstract

Lexical Level K1 K2 K3 K4 K5 K6 K7 K8 K9 K10

The cumulative coverage should reach 95% at the K5 level. No

more than 5% of words should be beyond the K5 level.

Readability Flesch-Kincaid Grade Level 9-12

Grammar A1-B2 Exponents Average sentence length 18-20 words

Text genre Magazines, newspapers, instructional materials (such as extracts from

undergraduate textbooks describing important events and ideas, etc).




Features of the Response Targets

Length Up to 10

words Lexical K1-K5 Grammatical A1-B2

Distractors Length

Up to 10

words Lexical K1-K5 Grammatical B1-B2

Key

information

Within sentence Across sentences Across paragraphs

Extra criteria

Presentation Written Aural Illustration

s/Graphs


Lexical Level K1 K2 K3 K4 K5 K6 K7 K8 K9 K10

Readability Flesch-Kincaid Grade Level 9-12

Using automated tools

Lexical profiles: BNC-20 lists

• Derived from British National Corpus spoken corpora by

Paul Nation (2006) and adapted by Tom Cobb

• 20 1000-word levels, word=word family

Tools for analysis:

• http://www.lextutor.ca/vp/

• http://www.victoria.ac.nz/lals/about/staff/paul-nation

Alternative frequency lists

• General Service List (2000 word families

• Academic Word List

• BNC-Coca 25

http://www.lextutor.ca/vp/

http://www.victoria.ac.nz/lals/about/staff/paul-nation




Using automated tools Readability: Flesch-kincaid grade level

• Based on syllables per word and words per sentence.

• lexical level (longer words tend to more less frequent) and

syntactic complexity (longer sentences have more

compound sentences and embedded clauses)

• Scaled to US grade levels ( higher number, harder text)

for analysis:

• https://readability-score.com/

• http://cohmetrix.memphis.edu/cohmetrixpr/index.html

• Readability measures available in Word

Some alternative readability

• Reading Ease (basis for Flesch-kincaid)

• Cohmetrix indices

• Lexile measures

https://readability-score.com/



http://cohmetrix.memphis.edu/cohmetrixpr/index.html

How much of a text do learners need to be

able to comprehend?

A threshold level of 95% suggested for “reasonable” comprehension and guessing words from context (Laufer, 1989; Hirsch & Nation, 1992; Chujo & Oghigian, 2009)

A higher threshold of 98% suggested for “reading with ease” (Hirsch & Nation, 1992; Hu & Nation, 2000; Nation, 2006)

Van Zeeland & Schmitt (2012) suggest the different criteria could be suitable for different purposes. 95% suitable for “adequate comprehension”


Lvl Items/

Task

Word

length Task focus Response format

A1 5 50-60 Sentence level meaning

(Careful, local reading)

3-option multiple choice for

each gap.

A2 6 90-100 Inter-sentence cohesion

(Careful global reading)

Reorder 6 jumbled sentences.

All sentences must be used to

complete the story.

B1 7 125-135

Text-level comprehension

of short texts

(Careful global reading)

7 gaps in a short text. Select

the best word to fill each gap

from a bank of 9 options.

B2 7 700-750

Text-level comprehension

of longer text

(Global reading, both

careful and expeditious)

7 Paragraphs forming a long

text. Select the most

appropriate heading for each

paragraph from a bank of 8

options.

Aptis Reading Test Tasks


Lvl Word

length

BNC Level

(95%)

Flesch-

Kincaid Grade

Level

A1 50-60 1000

A2 90-100 2000 4-6

B1 125-135 3000 6-9

B2 700-750 4000 - 5000 9-12

Aptis Reading Test Tasks


Aptis Reading Test Tasks • Total of 20 texts, 5 for each CEFR level

from operational test forms in the item

bank

• Rasch measures derived during

pretesting with anchors to facilitate

equating of items to a common scale for

all items in the item bank

• The Rasch difficulty is for the task, rather

than at the individual item level.


Other measures Studies using automatic textual analysis to

investigate criterial features of reading texts :

• Green, Unaldi & Weir (2009). Empiricism versus

• connoisseurship: Establishing the appropriacy of texts iof

academic reading

• Weir, V idaković, & Galaczi (2013) Measured

Constructs

• Wu (2012) Establishing the Validity of the General

English Proficiency Test Reading Component Through a

Critical Evaluation on Alighment with the Common

European Framework of Reference


Additional indices CEFR Level CEFR level of the task (according to task specifications)

Rasch difficulty Difficulty measure for task

BNC 95 level at which 95% coverage is reached

RDFKGL Flesch Kincaid grade level

RDL2 Cohmetrix L2 Readability

DESWC Word count

DESSL Sentence length

DESWLlt Word length, number of letters, mean

PCNARp Text Easability PC Narrativity, percentile

PCSYNp Text Easability PC Syntactic simplicity, percentile

PCCNCp Text Easability PC Word concreteness, percentile

PCREFp Text Easability PC Referential cohesion, percentile

PCDCp Text Easability PC Deep cohesion, percentile

PCVERBp Text Easability PC Verb cohesion, percentile

PCCONNp Text Easability PC Connectivity, percentile

PCTEMPp Text Easability PC Temporality, percentile

LDVOCD Lexical diversity, VOCD, all words

SYNLE Left embeddedness, words before main verb, mean

SYNNP Number of modifiers per noun phrase, mean

WRDCNCc Concreteness for content words, mean


Measure Kendall’s Tau Spearman Rank Order

BNC95 .824** .924**

RDFKGL .547** .690**

RDL2 -.505** -.645**

DESWC .540** .777**

DESSL .487** .668**

DESWLlt .432** .589**

PCNARp -.474** -.608**

PCSYNp -0.105 -0.211

PCCNCp 0.084 0.128

PCREFp -0.164 -0.296

PCDCp 0.2 0.281

PCVERBp 0.063 0.032

PCCONNp -0.257 -0.366

PCTEMPp -0.105 -0.162

LDVOCD .491** .661**

SYNLE .639** .818**

SYNNP .611** .792**

WRDCNCc 0.274 0.364

Correlations: linguistic indices and difficulty

Discussion


• Strong relationship between vocabulary level and

difficulty.

• Strong but weaker relationship between readability

and difficulty.

• These correlations need to be interpreted with

caution as differences are built into the actual test

tasks at each level.

• But the relative differences between the linguistic

measures in terms of strength of relationship to

empirical difficulty is still instructive.

Discussion


• First step in ongoing research agenda to investigate these relationships with larger data sets, and will include looking within levels to be able to control for text length and task.

• Future studies will also include the cognitive features of task demand and the nature of the information in the texts (both in our item specs but which require qualitative judgement).

• Such studies should look at the strength of prediction of these variables (cognitive and linguistic) through regression analysis when enough data is available.

• Rasch difficulty values are from pretesting data; future studies will look at Rasch measures derived from operational data with larger data sets.

investigating the relationship between empirical task ... · investigating the relationship between...

Documents