investigating the relationship between empirical task ... · investigating the relationship between...
TRANSCRIPT
www.britishcouncil.org 1
Jamie Dunlea, British Council
Investigating the relationship
between empirical task difficulty,
textual features and CEFR levels
EALTA 2014
29 May – 1 June
University of Warwick
Language Assessment
Research
www.britishcouncil.org 2
First…
• HEALTH WARNINGS • CAUTIONS • CAVEATS • SOME APOLOGIES • Some interesting
information to share
www.britishcouncil.org 3
What we will do • Look at task specifications for reading to specify
criterial features of input texts for different CEFR levels
• Focus on vocabulary profiles and readability
measures which are included in item writing
specifications
• Discuss an exploratory analysis of the textual
features of texts built to spec and the relationship to
empirical difficulty
• Look at the relationship between
• Rasch difficulty estimates of reading tasks from
the item bank of an operational test designed
around the CEFR and
• selected linguistic indices which we use for item
specification (and some additional measures).
Davidson & Fulcher (2007) encourage test developers to see the framework as a “series of guidelines from which tests (and teaching materials) can be built to suit local contextualized needs.”
www.britishcouncil.org 4
The CEFR can be a
springboard to task and test
development
Task specs: Where to start?
www.britishcouncil.org 5
Test specs from the CEFR CEFR: Vocabulary Range
B2
Has a good range of vocabulary for matters connected to his
field and most general topics? Can vary formulation to avoid
frequent repetition, but lexical gaps can still cause hesitation and
circumlocution.
B1
Has a sufficient vocabulary to express him/herself with some
circumlocutions on most topics pertinent to his everyday life such
as family, hobbies and interests, work, travel, and current events.
A2
Has sufficient vocabulary to conduct routine, everyday
transactions involving familiar situations and topics.
Has a sufficient vocabulary for the expression of basic
communicative needs.
Has a sufficient vocabulary for coping with simple survival
needs.
www.britishcouncil.org 6
Task specs: Where to start?
Descriptors need to remain holistic in order to give
an overview; detailed lists of microfunctions,
grammatical forms and vocabulary are presented
in language specifications for particular languages
(e.g. Threshold Level 1990).
An analysis of the functions, notions, grammar
and vocabulary necessary to perform the
communicative tasks described on the scales
could be part of the process of developing new
sets of language specifications.
(Council of Europe, 2001, p. 30)
CEFR Grid for Reading Tests
www.britishcouncil.org 7
Characteristic
Text source
Authenticity
Discourse type
Domain
Topic
Nature of content
Text length
Vocabulary
Grammar
Vocabulary Only frequent vocabulary
Mostly frequent vocabulary
Rather extended
Extended
Manual (Council of Europe, 2009)
Alderson, et al (2006)
• Some criteria when considering categories
Consistency Transparency Accountability Ease of use for item writers
• Specs have different audiences, and different levels of specificity according to the needs of the audience
• No spec is exhaustive: all specs will contain some of a possible range of categories and measures
• No spec is final: specs need to be reviewed and revised
www.britishcouncil.org 8
Some important principles
www.britishcouncil.org 9
Test Aptis
General Component Reading Task
Matching headings
to text Features of the Task
Skill focus Expeditious global reading of longer text, integrating propositions across a longer
text into a discourse-level representation.
Task Level A1 A2 B1 B2 C1 C2 task
description
Matching headings to paragraphs within a longer text. Candidates read through
a longer text consisting of 7 paragraphs, identifying the best heading for each
paragraph from a bank of 8 options.
Cognitive
processing
Goal
setting
Expeditious reading: local
(scan/search for specifics)
Careful reading: local
(understanding sentence)
Expeditious reading: global
(skim for gist/search for key
ideas/detail)
Careful reading: global
(comprehend main idea(s)/overall
text(s))
Cognitive
processing
Levels of
reading
Word recognition
Lexical access
Syntactic parsing
Establishing propositional meaning (cl./sent. level)
Inferencing
Building a mental model
Creating a text level representation (disc. structure)
Creating an intertextual representation (multi-text)
Task specs: an example
www.britishcouncil.org 10
Features of the Input Text
Words 700-750 words
Domain Public Occupational Educational Personal
Discourse mode Descriptive Narrative Expository Argumentative Instructive
Content knowledge General Specific
Cultural specificity Neutral Specific
Nature information Only concrete Mostly concrete Fairly abstract Mainly abstract
Lexical Level K1 K2 K3 K4 K5 K6 K7 K8 K9 K10
The cumulative coverage should reach 95% at the K5 level. No
more than 5% of words should be beyond the K5 level.
Readability Flesch-Kincaid Grade Level 9-12
Grammar A1-B2 Exponents Average sentence length 18-20 words
Text genre Magazines, newspapers, instructional materials (such as extracts from
undergraduate textbooks describing important events and ideas, etc).
Task specs: an example
www.britishcouncil.org 11
Task specs: an example
Features of the Response Targets
Length Up to 10
words Lexical K1-K5 Grammatical A1-B2
Distractors Length
Up to 10
words Lexical K1-K5 Grammatical B1-B2
Key
information
Within sentence Across sentences Across paragraphs
Extra criteria
Presentation Written Aural Illustration
s/Graphs
www.britishcouncil.org 12
Lexical Level K1 K2 K3 K4 K5 K6 K7 K8 K9 K10
Readability Flesch-Kincaid Grade Level 9-12
Using automated tools
Lexical profiles: BNC-20 lists
• Derived from British National Corpus spoken corpora by
Paul Nation (2006) and adapted by Tom Cobb
• 20 1000-word levels, word=word family
Tools for analysis:
• http://www.lextutor.ca/vp/
• http://www.victoria.ac.nz/lals/about/staff/paul-nation
Alternative frequency lists
• General Service List (2000 word families
• Academic Word List
• BNC-Coca 25
www.britishcouncil.org 13
Using automated tools Readability: Flesch-kincaid grade level
• Based on syllables per word and words per sentence.
• lexical level (longer words tend to more less frequent) and
syntactic complexity (longer sentences have more
compound sentences and embedded clauses)
• Scaled to US grade levels ( higher number, harder text)
for analysis:
• https://readability-score.com/
• http://cohmetrix.memphis.edu/cohmetrixpr/index.html
• Readability measures available in Word
Some alternative readability
• Reading Ease (basis for Flesch-kincaid)
• Cohmetrix indices
• Lexile measures
How much of a text do learners need to be
able to comprehend?
A threshold level of 95% suggested for “reasonable” comprehension and guessing words from context (Laufer, 1989; Hirsch & Nation, 1992; Chujo & Oghigian, 2009)
A higher threshold of 98% suggested for “reading with ease” (Hirsch & Nation, 1992; Hu & Nation, 2000; Nation, 2006)
Van Zeeland & Schmitt (2012) suggest the different criteria could be suitable for different purposes. 95% suitable for “adequate comprehension”
www.britishcouncil.org 15
Lvl Items/
Task
Word
length Task focus Response format
A1 5 50-60 Sentence level meaning
(Careful, local reading)
3-option multiple choice for
each gap.
A2 6 90-100 Inter-sentence cohesion
(Careful global reading)
Reorder 6 jumbled sentences.
All sentences must be used to
complete the story.
B1 7 125-135
Text-level comprehension
of short texts
(Careful global reading)
7 gaps in a short text. Select
the best word to fill each gap
from a bank of 9 options.
B2 7 700-750
Text-level comprehension
of longer text
(Global reading, both
careful and expeditious)
7 Paragraphs forming a long
text. Select the most
appropriate heading for each
paragraph from a bank of 8
options.
Aptis Reading Test Tasks
www.britishcouncil.org 16
Lvl Word
length
BNC Level
(95%)
Flesch-
Kincaid Grade
Level
A1 50-60 1000
A2 90-100 2000 4-6
B1 125-135 3000 6-9
B2 700-750 4000 - 5000 9-12
Aptis Reading Test Tasks
www.britishcouncil.org 17
Aptis Reading Test Tasks • Total of 20 texts, 5 for each CEFR level
from operational test forms in the item
bank
• Rasch measures derived during
pretesting with anchors to facilitate
equating of items to a common scale for
all items in the item bank
• The Rasch difficulty is for the task, rather
than at the individual item level.
www.britishcouncil.org 18
Other measures Studies using automatic textual analysis to
investigate criterial features of reading texts :
• Green, Unaldi & Weir (2009). Empiricism versus
• connoisseurship: Establishing the appropriacy of texts iof
academic reading
• Weir, V idaković, & Galaczi (2013) Measured
Constructs
• Wu (2012) Establishing the Validity of the General
English Proficiency Test Reading Component Through a
Critical Evaluation on Alighment with the Common
European Framework of Reference
www.britishcouncil.org 19
Additional indices CEFR Level CEFR level of the task (according to task specifications)
Rasch difficulty Difficulty measure for task
BNC 95 level at which 95% coverage is reached
RDFKGL Flesch Kincaid grade level
RDL2 Cohmetrix L2 Readability
DESWC Word count
DESSL Sentence length
DESWLlt Word length, number of letters, mean
PCNARp Text Easability PC Narrativity, percentile
PCSYNp Text Easability PC Syntactic simplicity, percentile
PCCNCp Text Easability PC Word concreteness, percentile
PCREFp Text Easability PC Referential cohesion, percentile
PCDCp Text Easability PC Deep cohesion, percentile
PCVERBp Text Easability PC Verb cohesion, percentile
PCCONNp Text Easability PC Connectivity, percentile
PCTEMPp Text Easability PC Temporality, percentile
LDVOCD Lexical diversity, VOCD, all words
SYNLE Left embeddedness, words before main verb, mean
SYNNP Number of modifiers per noun phrase, mean
WRDCNCc Concreteness for content words, mean
www.britishcouncil.org 20
Measure Kendall’s Tau Spearman Rank Order
BNC95 .824** .924**
RDFKGL .547** .690**
RDL2 -.505** -.645**
DESWC .540** .777**
DESSL .487** .668**
DESWLlt .432** .589**
PCNARp -.474** -.608**
PCSYNp -0.105 -0.211
PCCNCp 0.084 0.128
PCREFp -0.164 -0.296
PCDCp 0.2 0.281
PCVERBp 0.063 0.032
PCCONNp -0.257 -0.366
PCTEMPp -0.105 -0.162
LDVOCD .491** .661**
SYNLE .639** .818**
SYNNP .611** .792**
WRDCNCc 0.274 0.364
Correlations: linguistic indices and difficulty
Discussion
www.britishcouncil.org 21
• Strong relationship between vocabulary level and
difficulty.
• Strong but weaker relationship between readability
and difficulty.
• These correlations need to be interpreted with
caution as differences are built into the actual test
tasks at each level.
• But the relative differences between the linguistic
measures in terms of strength of relationship to
empirical difficulty is still instructive.
Discussion
www.britishcouncil.org 22
• First step in ongoing research agenda to investigate these relationships with larger data sets, and will include looking within levels to be able to control for text length and task.
• Future studies will also include the cognitive features of task demand and the nature of the information in the texts (both in our item specs but which require qualitative judgement).
• Such studies should look at the strength of prediction of these variables (cognitive and linguistic) through regression analysis when enough data is available.
• Rasch difficulty values are from pretesting data; future studies will look at Rasch measures derived from operational data with larger data sets.