Master’s Thesis
International Studies in Computational Linguistics (ISCL)
Automated C-Test Difficulty Prediction:
Integrating Lexical, Sentence, and Text Features
in a Multi-Lingual Perspective
Author:Sabrina Galasso
1st Supervisor:Prof. Dr. Walt Detmar Meurers
2nd Supervisor:apl. Prof. Dr. Kurt Eberle
Submitted in Partial Fulfillment of the Requirementsfor the Degree of
Master of Arts in Computational Linguistics
Seminar fur SprachwissenschaftEberhard Karls Universitat Tubingen
February 9th, 2018
Antiplagiatserklarung
Name: GalassoVorname: SabrinaMatrikel-Nummer: 3730351Adresse: Elly-Heuss-Knapp-Str. 27, 72074 Tubingen
Hiermit versichere ich, die Arbeit mit dem Titel:
”Automated C-Test Difficulty Prediction: Integrating Lexical, Sentence, and Text Fea-
tures in a Multi-Lingual Perspective“
bei Prof. Detmar Meurers
selbstandig und nur mit den in der Arbeit angegebenen Hilfsmitteln ver-fasst zu haben. Mir ist bekannt, dass ich alle schriftlichen Arbeiten, die ich imVerlauf meines Studiums als Studien- oder Prufungsleistung einreiche, selbstandig ver-fassen muss. Zitate sowie der Gebrauch von fremden Quellen und Hilfsmitteln mussennach den Regeln wissenschaftlicher Dokumentation von mir eindeutig gekennzeichnetwerden. Ich darf fremde Texte oder Textpassagen (auch aus dem Internet) nicht alsmeine eigenen ausgeben.
Ein Verstoß gegen diese Grundregeln wissenschaftlichen Arbeitens gilt als Tauschungs-bzw. Betrugsversuch und zieht entsprechende Konsequenzen nach sich. In jedemFall wird die Leistung mit
”nicht ausreichend“ (5,0) bewertet. In besonders schwer-
wiegenden Fallen kann der Prufungsausschuss den Kandidaten/die Kandidatin von derErbringung weiterer Prufungsleistungen ausschließen (vgl. § 12 Abs. 3 der Prufung-sordnung fur die Magisterstudiengange vom 11. und 25. September 1995 bzw. § 13Abs. 3 der Prufungsordnung fur die kulturwissenschaftlichen Bachelor- und Masterstu-diengange vom 12.10.2006 und 23.11.2007).
English version: I hereby declare that this paper is the result of my own independentscholarly work. I have acknowledged all the other authors’ ideas and referenced directquotations from their work (in the form of books, articles, essays, dissertations, and onthe internet). No material other than that listed has been used.
Tubingen, February 9, 2018
Sabrina Galasso
Abstract
This thesis aims at the automated prediction of the difficulty of items,
sentences, and whole text passages in Spanish and English C-Tests. The
C-Test is an integrative placement test that measures a learner’s general
language proficiency. It is based on the cloze principle and tests the ability
to restore mutilated words in an authentic text. The process of designing
the test should involve knowledge about the difficulty of single items and
text passages.
Given actual learner data provided by a university’s language learning
center, we analyze the performance of C-Test participants on item, sentence,
and text level across proficiency levels for English and Spanish. C-Test items
are clustered into four to eight groups using different sets of performance
variables. A broad range of lexical features, readability features, syntactic
features, discourse features, and features describing the item’s context and
possible candidates are integrated into one pipeline. These features are
used within classification experiments in order to get insights into linguistic
characteristics of the test difficulty.
A combination of performance variables that is integrable into a real-
word C-Test generation application for both languages is presented: Similar
to the findings described by Svetashova (2015), the combination includes
information about the test takers’ performance on item and text level. The
C-Test items could be grouped into five interpretable classes: easy items in
difficult texts, easy items in easy texts, difficult items in easy texts, difficult
items in difficult texts, and as a fifth group either all items in very easy
texts for the English data, or all items in very difficult texts for the Spanish
data. Using the full set of features for the classification of these five classes,
leads to a macro-averaged F1 score of 0.76 (Support Vector Machine) for
the English data, and a score of 0.82 for the Spanish data.
In order to compare the difficulty characteristics of the two languages,
we further perform classification experiments using a comparable set of
features and add experiments based on information about the test takers’
performance on sentence and item level. The results show that the set of
comparable features is more predictive for Spanish than for English. Fea-
tures describing the item’s candidate space are highly predictive for both
languages.
Acknowledgements
”No act of kindness, no matter how small, is ever wasted.” — Aesop
I am deeply grateful for the help and support that I received during the
writing of this thesis. Receiving kindness in its various facets helped me to
overcome challenging times.
First of all, I want to thank my supervisor Detmar Meurers for his con-
tinuous personal and technical guidance in all the time of research and
writing of this work. His feedback on my progress was always motivating
and deepened my interest in the topic.
I would like to offer my deepest thanks to Yulia Svetashova. Without her
personal advice and professional support this work would not have been
completed in this form. Our conversations shaped the fundamental struc-
ture of this thesis. Her huge willingness to help and her sincere words of
encouragement helped me through demanding periods.
I want to thank Claudia Duttlinger and Jorge Martın-Martın for giv-
ing me the opportunity to gain insights into real-world placement testing.
Thank you for placing trust in me as a researcher and software-developer.
I have greatly benefited from our collaboration.
I further want to thank Eyal Schejter, Bjorn Rudzewitz, Xiaobin Chen,
and Jochen Saile for their support whenever I faced technical problems. I
also appreciated help from Zarah Weiß regarding the understanding and
implementation of the Dependency Locality Theory. I would like to thank
my university friends Alexander Hartmann, Till Pachalli, Eyal Schejter, and
Heike Cardoso for proofreading my thesis.
Most of all, I want to thank my family members and closest friends.
Without the encouragement and unconditional support of my parents, my
sister, and my brother this work would not have been possible at all. And
finally, I would like to thank my future husband — for everything.
Contents
1 Introduction 1
2 Background 2
2.1 Language Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 The C-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 From cloze tests to C-Tests . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Test design across languages . . . . . . . . . . . . . . . . . . . . 6
2.2.3 C-Test criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Linguistic Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Related Work on C-Test Difficulty Prediction . . . . . . . . . . . . . . . 10
2.4.1 Beinborn et al. (2014) and Beinborn (2016) . . . . . . . . . . . . 11
2.4.2 Svetashova (2015) . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Data 20
3.1 C-Tests at the Fachsprachenzentrum . . . . . . . . . . . . . . . . . . . . 20
3.2 Descriptive Analysis of the Data . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Database Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Available Data for English and Spanish . . . . . . . . . . . . . . 21
4 Performance Modeling 22
4.1 Performance Data Description . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Clustering Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Difficulty Modeling 28
5.1 Linguistic preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Modeling the difficulty of English C-Tests . . . . . . . . . . . . . . . . . 30
5.2.1 Item Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.2 Sentence Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.3 Text Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Modeling the difficulty of Spanish C-Tests . . . . . . . . . . . . . . . . . 38
5.3.1 Item Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.2 Sentence Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.3 Text Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Experiments and Results 40
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Investigation of Performance Profiles . . . . . . . . . . . . . . . . . . . . 42
6.2.1 Classification Results on English Performance Profiles . . . . . . 43
6.2.2 Classification Results on Spanish Performance Profiles . . . . . . 45
6.3 Predicting C-Test Performance on Text and Item Level . . . . . . . . . 46
6.3.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3.2 Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.4 Comparative Investigation of Difficulty Prediction for Spanish and English 50
6.4.1 Comparison of Performance on Text and Item Level . . . . . . . 51
6.4.2 Comparison of Performance on Sentence and Item Level . . . . . 53
6.5 Discussion of the Presented Results . . . . . . . . . . . . . . . . . . . . . 56
7 Conclusion 59
Bibliography 63
A Appendix 68
List of Figures
2.1 A difficulty continuum described by Beinborn et al. (2014) . . . . . . . . 12
2.2 Variable factor map for the best performance profile (Svetashova, 2015) 17
2.3 Individuals factor map for the best performance profile (Svetashova, 2015) 18
2.4 Top 50 features by feature subset (Svetashova, 2015) . . . . . . . . . . . 18
2.5 Example text with items highlighted according to difficulty prediction
results (Svetashova, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Distribution of the participants’ proficiency levels for each text passage . 24
4.2 Distribution of the participants’ proficiency levels for each item . . . . . 25
4.3 Variables factor map using all 21 performance variables and 4 clusters.
(English) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Variables factor map using all 21 performance variables and 4 clusters.
(Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Individuals and variables factor map using textAv and percCorrect (Span-
ish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Individuals and variables factor map using textAv and percCorrect (En-
glish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1 UIMA preprocessing pipeline to annotate C-TestTokens . . . . . . . . . 29
6.1 Individuals and variables factor map with performance profile Text
SentProf Item CL4 (English) . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Variable importance of the top 20 predictors in the RF model for Text
SentProf Item CL4 (English) . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 Variable importance of the top 20 predictors in the RF model for Text
SentProf Item CL4 (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . 47
6.4 Variable importance of the top 20 predictors in the SVM model for
Text Item CL5 (English) . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.5 Cluster interpretation given the performance profile Text Item CL5 (En-
glish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.6 Variable importance of the top 20 predictors in the SVM model for
Text Item CL5 (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.7 Cluster interpretation given the performance profile Text Item CL5 (Span-
ish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.8 Using Text Dlt features only: RF classification confusion matrices for
Spanish (top) and English (bottom) for the profile Text Item CL5. . . . 54
6.9 Using Item SubtlexCand features only: RF classification confusion ma-
trices for Spanish (top) and English (bottom) for the profile Text Item CL5. 54
6.10 Individuals and variables factor map with performance profile Sent
Item CL4 (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.11 Individuals and variables factor map with performance profile Sent
Item CL4 (English) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
List of Tables
2.1 Beinborn’s most predictive features . . . . . . . . . . . . . . . . . . . . . 12
2.2 Difficulty estimates for ”the” and ”in” (Svetashova, 2015) . . . . . . . . 14
2.3 Performance of feature groups in Svetashova (2015) . . . . . . . . . . . . 19
3.1 Number of participants and processed texts for Spanish and English . . 22
3.2 Conducted C-Tests with the number of test takers and information about
their performance score . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Theoretically possible item difficulty properties . . . . . . . . . . . . . . 26
5.1 The features of the C-TestToken type as implemented in our UIMA
preprocessing pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 List of context and candidate space features . . . . . . . . . . . . . . . . 34
5.3 List of lexical variation features . . . . . . . . . . . . . . . . . . . . . . . 37
5.4 List of lexical sophistication features . . . . . . . . . . . . . . . . . . . . 38
5.5 List of syntactic complexity features . . . . . . . . . . . . . . . . . . . . 38
6.1 The number of features (factor and numeric) before and after removing
highly correlated predictors . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 SVM and RF classification results using all features (English) . . . . . . 43
6.3 SVM and RF classification results using all features (Spanish) . . . . . . 46
6.4 Mean values of different item and text level features by cluster (English) 49
6.5 Mean values of predictive item and text level features by cluster (Spanish) 51
6.6 Classification results for SVM and RF for the performance profile Text
Item CL5 and different feature subsets. 63 features where considered to
be comparable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.7 Variable importance comparison. RF models for profile Text Item CL5
using a set of 63 comparable features. . . . . . . . . . . . . . . . . . . . 53
6.8 Classification results for SVM and RF for the performance profile Sent
Item CL4 using the 63 comparative features. . . . . . . . . . . . . . . . 55
6.9 Variable importance comparison. RF models for profile Sent -Item CL4
using the set of 63 comparable features. . . . . . . . . . . . . . . . . . . 57
A.1 Item level features from the groups of linguistic, pyscholinguistic and
position based features. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2 Item level features from the group of context and candidate space features. 69
A.3 Sentence level features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.4 Text level features from the group of lexical features. . . . . . . . . . . . 71
A.5 Text level features from the groups of syntactic complexity, DLT and
readability features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.6 English: Text ids and number of participants per text. . . . . . . . . . . 73
A.7 Spanish: Text ids and number of participants per text. . . . . . . . . . . 74
List of Abbreviations
AE Analysis Engine
AOA Age-of-acquisition
AWL Academic Word List
CALT Computer-Assisted Language Testing
CERF Common European Framework of Reference for Languages
CTAP Common Text Analysis Platform
DLT Dependency Locality Theory
FSZ Fachsprachenzentrum (The Language Learning Center of the Uni-
versity of Tubingen)
HCPC Hierarchical Clustering on Principle Components
ICALL Intelligent Computer-Assisted Language Learning
IRT Item Response Theory
LFP Lexical Frequency Profile
NLP Natural Language Processing
NLTK Natural Language Toolkit
PCA Principal Component Analysis
PCFG probabilistic context-free grammar
POS Part-of-speech
RF Random Forest
SLA Second Language Acquisition
SVM Support Vector Machine
TFIDF Term Frequency–Inverse Document Frequency
TTR Type-Token-Ratio
UIMA Unstructured Information Management Architecture
1 Introduction 1
1 Introduction
’What does it mean when we say that someone knows a language?’ (Spolsky, 1969).
This question was asked by Bernard Spolsky in the late 60th in order to introduce one
of his works on language testing and is still very present in contemporary research of
second language learning. It also implies the question of how one can break down the
continuum of proficiency from beginning learners of a foreign language to native-like
speakers. Based on empirical research, the Council of Europe developed a Common
European Framework of Reference for Languages (CEFR)1 that groups foreign language
proficiency into six levels. It was developed to have a standard for determining language
qualifications and to facilitate teaching.
It is known from the research of second language acquisition (SLA) that teaching is
most effective if the teaching material suits the learner’s current state of development.
The linguist and education researcher Stephen Krashen claimed that language acquisi-
tion occurs if learners are exposed to input that is slightly more advanced than what
they already know (Krashen, 1985). As the demand for language courses increased in
the past decades, also the interest of assigning large amounts of learners to appropriate
course levels in a cost-effective way increased. This assignment is done using placement
tests, which are conducted by universities, language schools or even by human resources
departments of large companies trying to figure out which further language training
their employees might need.
Placement tests differ highly from each other in terms of the underlying method.
They can be very versatile and complex, testing different kinds of language skills such
as grammar or vocabulary knowledge in different parts or including oral as well as
written assignments. The more complex the test is, the higher the costs for conducting
the tests are. Therefore, many institutions make use of simple cloze tests, which require
less preparation and scoring time and are still proven to be quite reliable in assigning
language levels to learners. The test which will be investigated in this thesis is the
C-Test, a variation of the cloze procedure following certain rules on how the gapping is
done. There is much research on the validity and reliability of C-Tests, but less research
on how the difficulty of the test can be predicted automatically. The process of building
up a C-Test should involve knowledge about the difficulty of single text passages in order
to select them appropriately and to then have influence on the test’s overall difficulty.
This in turn, should influence the analysis of the test taker’s performance results and
thereby the process of assigning them to appropriate language course levels. Thus,
predicting the test’s difficulty is an important step to be able to later on perform a
reasonable and appropriate result comparison.
Predicting the difficulty of C-Tests requires knowledge about linguistic complexity.
1http://www.coe.int/en/web/common-european-framework-reference-languages (Last accessed:18/02/08)
Automated C-Test Difficulty Prediction
2 Background 2
It needs to be investigated which characteristics of language use give evidence on how
difficult a test is. These characteristics can span different scopes of locality within the
test, such as single gaps themselves, sentences or whole paragraphs and texts. Some
gaps might not be difficult themselves, but the sentence or text in which they occur
is so hard to understand so that the gap cannot be filled out correctly. Furthermore
these characteristics cover different aspects of the language under consideration, e.g.
morphology, syntax, lexical variation, psycholinguistic frequency counts, or even dis-
course related features. There is much work on such features in order to predict the
complexity of texts. These features have also been used in the context of language
acquisition to predict the readability of texts or to analyze the language of learners
across proficiency levels. We will make use of complexity features in order to predict
the difficulty of C-Tests. Two languages will be investigated and compared to each
other in terms of the impact of feature types on the test difficulty.
This thesis will investigate the difficulty of C-Tests based on data provided by a
German university’s language learning center. The main question that underlies this
thesis is the following: How can we best predict the difficulty of single gaps and whole
texts? This implies the analysis of test takers’ performances and further leads to the
investigation of linguistic characteristics within tests. These characteristics can be
found on lexical, sentence and text level and will vary across languages with different
properties. It will be considered which features have a higher impact on the test
difficulty given the nature of the languages under consideration: English, as a more
analytic language as opposed to Spanish with a richer morphology. Section 2 will
give a brief overview on language testing in general and present the C-Test with the
criticism it has received. Furthermore the concept of automated complexity analysis
and corresponding research is presented. This is needed in order to understand the
existing work on the difficulty prediction of C-Tests: We will present in detail the work
of Svetashova (2015) and Beinborn (2016) who investigated item and text difficulty of
English C-Tests. Furthermore the dataset will be described (Section 3) and analyzed
by modeling the test takers’ performance (Section 4). In Section 5 we present our
difficulty model focusing on the underlying linguistic features. The machine learning
experiments and the results are explained in Section 6, followed by a section in which
a conclusion is drawn and future work is suggested.
2 Background
How a learner’s language skills can be tested and how test results can be analysed and
evaluated is a persistent question in SLA research. The following will give an overview
on different types of language tests and then describe the C-Test in detail focusing
on how it has been criticized positively as well as negatively. Afterwards, research
about the automated analysis of linguistic complexity will be presented, since this
Automated C-Test Difficulty Prediction
2 Background 3
is essential knowledge for understanding related work on C-Test difficulty prediction.
Difficulty prediction on English C-Tests has been conducted by (Svetashova, 2015) and
(Beinborn, 2016), whose works represent the fundamental basis of this thesis and are
therefore described in-depth.
2.1 Language Testing
Language testing is a broad term that describes the study of determining how profi-
ciently somebody uses a certain language. Although this main idea is true for all lan-
guage tests, there are fundamental differences with respect to the underlying method
and their purpose (McNamara, 2000).
Based on the test method, McNamara (2000) distinguishes between paper-and-
pencil language tests and performance tests. The former summarizes the more tra-
ditional tests which are used to assess discrete points of knowledge, such as certain
grammar or vocabulary skills. In these types of tests the response format is mostly
fixed, meaning that the test takers have to choose among a fixed set of multiple possi-
ble solutions. Such tests have the advantage of being efficient in terms of grading and
the scores can easily be compared across learners. Besides grammar and vocabulary
knowledge, such tests can measure skills in reading or listening comprehension but they
cannot be used for measuring the learner’s ability in language production. In contrast
to paper-and-pencil language tests, performance tests include acts of communication
where the focus is on language production skills. The response format is therefore
not fixed and the grading process is built upon an agreed rating procedure McNamara
(2000).
Additionally to the test method, language tests differ with respect to the test pur-
pose. Achievement tests aim at assessing individual progress and always relate to a
prior teaching goal. Thus, they give evidence on the outcome (i.e., achievement) of the
teaching process. Proficiency tests, in contrast, do not aim at measuring a learner’s
past progress but relate to the specific purpose of the use of the language in the future.
Such tests do often involve communication tasks specific to the target language usage
and try to simulate real world situations. Hughes (2007) defines the word ’proficient’
as ’having sufficient command of the language for a particular purpose’. Furthermore
there are placement tests, which aim at assigning the test taker to an appropriate lan-
guage course level. A challenging task with placement tests is to design a test that suits
the teaching programme of the institution. Since the outcome of placement tests does
not influence the future teaching methods they are considered to be summative as op-
posed to formative tests. Formative assessment tests can be used to formulate feedback
in order to influence future teaching, whereas summative assessment is summarizing
the students ability at a certain time. Hughes (2007) emphasizes the importance of de-
signing placement tests in a way that suits the teaching programme of the institution.
Automated C-Test Difficulty Prediction
2 Background 4
He argues that placement tests are most successful if they are designed for particular
situations. Putting much effort in the construction of placement tests will result in sav-
ing time and effort in teaching, but still the expense needed for versatile tests cannot
be afforded to supply the increasing demand for placement tests in many institutions.
This thesis investigates the difficulty of the C-Test, a placement test that relies on
the cloze procedure which can be generated semi-automatically and therefore has quite
low preparation costs. Furthermore, the scoring is very simple and comparable across
learners. The following subsection will give an overview of cloze tests in general and
present the modifications leading to the C-Test.
The implementation of language tests has changed within the last decades due to
technical progress in the whole field of SLA leading to an emergence of new inter-
disciplinary research fields such as Intelligent Computer-Assisted Language Learning
(ICALL)(Amaral and Meurers, 2011). Nowadays, the use of computers has become
a standard in foreign language courses. One can hardly imagine a language learning
course without technical aids, e.g. electronic communication platforms, electronic dic-
tionaries, or digital presentations. Big institutions cannot cope with the high demand
for language courses without making use of computers especially in their placement
procedures. Suvorov and Hegelheimer (2013) describe a new emerging field called
Computer-Assisted Language Testing (CALT). However, while the usage of computers
has become the norm, the gains of research in natural language processing (NLP) have
hardly been used for language learning and testing purposes. Meurers et al. (2010)
present a tool that makes use of NLP to visually highlight linguistic patterns that are
known to be difficult to learn. Amaral and Meurers (2011) describe the challenge of
using NLP to foster computer-assisted language learning systems and present a sys-
tem that automatically provides individualized feedback. Chapelle and Chung (2010)
describe how NLP can be used for language assessment and speech recognition.
2.2 The C-Test
The C-Test is a widely accepted language test used for different purposes. The following
will describe the origins of the C-Test and how it is designed and applied. Furthermore,
it will be presented how the test can been criticized.
2.2.1 From cloze tests to C-Tests
Cloze tests are based on the concept of reduced redundancy, which relies on the fact
that language often provides more information than what is actually necessary to un-
derstand what somebody intends to say. According to Spolsky (1969), the redundancy
within language might seem wasteful, but it does actually help to cope with noise
appearing in the communicating channel. Spolsky (1969) further points out that the
ability to cope with noise in language highly depends on the individual’s language pro-
Automated C-Test Difficulty Prediction
2 Background 5
ficiency. A non-native speaker might need the ’full normal redundancy’ whereas native
speakers can often perfectly understand messages where some parts are missing due
to interferences that disturb the communication. Tests based on reduced redundancy
principles challenge language learners to restore the distorted parts. It is assumed that
the competence in restoring these parts will give evidence on the test takers’ language
proficiency.
Cloze tests were originally constructed by deleting words randomly or by deleting
every nth word in a text. It has been argued that these approaches are the simplest
approaches and that they usually lead to deletions of different parts of speech, avoiding
deletions of e.g. only articles. However, such cloze tests have also widely been criti-
cized. According to Alderson (1979), the results of cloze tests relate more to tests of
grammar and vocabulary and less to reading comprehension tests. He further points
out that they rather measure low order language skills than higher order language skills
and that the restoration of the gap is highly dependent on the gap’s immediate con-
text. Moreover, the deletion rates highly influence the test results. Bachman (1982)
therefore suggests a linguistically motivated rational deletion of selected syntactic or
cohesive items. It is concluded that this cloze variation can measure higher-order skills
including coherence and cohesion by taking into account syntactic and discourse level
relationships (Bachman, 1982). The downside of such approaches is that it is hardly
possible to generalize the deletion procedure across tests and that each individual test
therefore needs to be analysed separately in terms of reliability and validation. As a
consequence, Raatz and Klein-Braley (1981) aims at designing a test where the ran-
dom deletion is modified without violating the principle of an internalized grammar and
hence, without having impact on generalizability. They describe six criteria followed
during the test development (Raatz and Klein-Braley, 1981):
1. shorter texts producing at least 100 items
2. no problems in choice of deletion rate and starting point
3. deletions should be an absolutely representative sample of the elements of the
text
4. it should not favor examinees with special knowledge
5. only exact scoring should be used
6. native speakers should obtain virtually perfect scores.
These criteria show that Raatz and Klein-Braley (1981) tackle problems concerning
the test design including text selection and deletion process. On top of that, they
address the problem of how such tests should be scored and how the results might be
analyzed and interpreted.
Automated C-Test Difficulty Prediction
2 Background 6
2.2.2 Test design across languages
A C-Test, as described by Grotjahn (2002), typically consists of four to six short texts
of different topics. By doing so, it can be avoided that the test results only represent
the learner’s familiarity with a certain topic. Each short text starts with a complete
sentence in order to provide the learner with some context. Starting from the second
sentence, from every second word the second half of the letters is blanked. If the word
to gap consists of an odd number of letters, the bigger part of the word is gapped. After
a predefined number of gaps, the text ends with a short untouched part to provide some
non-mutilated context again. This predefined number usually varies from 20 to 25.
In this thesis, we will use the following terms and notations: A single test item that
contains a gap is named an item. The correct solution is embraced in brackets. The
following gives an example:
• item: diffi[culty]
• base: diffi
• ending: culty
• intended word: difficulty
In contrast to usual cloze tests, the number of acceptable solutions for a C-Test gap is
low. It is not often the case that a semantically and grammatically correct restoration
other than the original solution should be considered as correct (Grotjahn, 2002). In
cases such as ”in Ju ”, one should accept both solutions: ”in June” and ”in July”. One
should differentiate between the following three types of multiple solution occurrences:
• A gap’s alternative solution should be accepted across texts.
E.g., theatre, theater. American or British English spelling variants
• A gap’s alternative solution should be accepted only within a specific text.
E.g., snowfall, snowstorm. The words snowfall and snowstorm might be inter-
changeable in one text, but not in the other, where a snowfall is much lighter
than a storm.
• A gap’s alternative solution should be accepted only for this specific gap.
E.g., June, July. In the context ”in July” one should accept also June. However,
in the context ”Fourth of July” one could argue that a learner should know that
only ”July” and not also ”June” should be accepted.
Grotjahn (2002) suggests to work with predefined lists of acceptable solutions. An-
other strategy would be to avoid multiple solution variants by reducing the number of
blanked letters. This is not recommended since it reduces the difficulty at the same
time.
Automated C-Test Difficulty Prediction
2 Background 7
The presented rules for developing C-Tests seem very fixed and language independent.
However, there exist language dependent issues with these construction rules. Grotjahn
and Tonshoff (1992) present the following language specific phenomena:
• apostrophes: Words containing apostrophes are often considered as one word
instead of as multiple words. In Italian, prepositional articles and the following
nouns are merged by an apostrophe, e.g. ”dell’anno”. Treating this construction
as one test item, the learner would need to restore a whole noun.
• compounds: The components of German compounds are usually not separated
by whitespace. In this way, whole stems can be blanked, making a restoration
very hard.
• enclitic personal pronouns: In some romance languages, multiple personal
pronouns are attached to the verb’s ending. E.g. the Italian verb ”regalarglielo”
(give it to him) would be difficult to restore when it is blanked.
• grapheme combinations: Grapheme combinations can represent single phones.
If a word’s non-blanked part ends with a graph that is part of such a polygraph
combination, it could mislead the reader, since inner phonation often serves as
restoration strategy. E.g. ”luglio” (Italian), where ”gl” is a polygraph.
As described, there often exist various options to handle such phenomena. The
challenge is to cope with them without altering the fixed C-Test rules too much to keep
the test results as generalizable and comparable as possible.
2.2.3 C-Test criticism
As mentioned in the beginning of this section, the development of the C-Test solved
some widely criticized issues that came up with cloze tests. The C-Test rules clearly
define the deletion rate and the starting and ending point of the deletions, which was
not the case for cloze tests. The fact that a C-Test is made up of several short texts
addressing different topics further reduces the problem of topic related distortions.
Moreover, it was shown that it is possible for native speakers to gain full C-Test scores.
An advantage of the C-Test as a placement test is that the restoration of gaps involves
different levels of language. Following a top-down processing approach, the learner can
infer the solution given the text’s topic, or information about persons, objects or places
involved in a sentence or text passage. Additional to such contextual clues, the learner
can follow ”grammatical, syntactical, lexical, semantic, collocational,[...] pragmatic,
logical, situational clues (and no doubt many others)” Klein-Braley (1985). Thus, being
able to use information on different levels of language leads to a successful restoration
of the gaps. This information can cover smaller or larger scopes within a text. As for
sentences or clauses, a reader could for example derive from the syntactic structure
Automated C-Test Difficulty Prediction
2 Background 8
that a gapped verb must be a past participle and use this knowledge to fill the gap.
As C-Tests are usually based on real-life texts, they can be considered authentic to
a certain extent (Klein-Braley, 1985). However, in general it is difficult to produce
authentic tests since there exist factors such as test anxiety which are not given in
real-life communications. According to McNamara (2000), test performances are only
indicators of how a learner would perform in a similar real world situation. McNa-
mara (2000, p. 8) emphasizes the importance of differentiating between the criterion
(behaviour in the target situation) and the test itself.
The validity and reliability of C-Tests has widely been investigated. Although the C-
Test has been criticized to only measure reading ability, it has been shown that solving
a C-Test involves also other language skills. Eckes and Grotjahn (2006) compared
C-Test results to German learners’ performance on a complex language test which
involves exercises in reading, writing, listening and speaking. The study shows that
the C-Test is able to measure the same general dimension as the more wide-ranging
test does. In placement tests it is important to get a general estimate of the learner’s
proficiency throughout different target real-life situations irrespective of the learner’s
competence in specific language skills (Eckes and Grotjahn, 2006). Babaii and Ansary
(2001) show similar results by comparing English C-Test results to TOEFL results.
They further investigated gap restoration strategies by collecting retrospective protocols
of the participants. These protocols indicate that the participants follow four different
strategies, namely automatic processing, lexical adjacency, sentential cues and top-
down cues. This also reveals that the C-Test does not only require local micro-level
processing strategies, but also macro-level processing (Eckes and Grotjahn, 2006). This,
in turn, is important for measuring general language proficiency. It has not only been
demonstrated that C-Test results correlate with other language tests, but also with
other external criteria such as school grades and teachers’ estimates Cohen et al. (1984).
Dornyei and Katona (1992) criticize the fact that the reliability of C-Tests has widely
been investigated but the reasons for this success are not clear, i.e. it has been shown
that the test is reliable but it has not been investigated why and how it works (Dornyei
and Katona, 1992).
This thesis is comparing C-Test item and overall difficulty of English tests to Spanish
tests. Most of the literature investigates C-Tests for learners of English as a foreign
language. It should be mentioned that the English test validation cannot simply be
projected to other languages. E.g., in Hebrew words are morphologically richer than
English. However, Cohen et al. (1984) show that the C-Test is reliable and valid
also for Hebrew. But they further show that Hebrew C-Tests correlate more with
grammar tests on verbal inflections and noun-adjective agreement than with reading
comprehension exercises. This is caused by the typological properties of Hebrew as a
synthetic language. Since Spanish is also more synthetic as English, one could expect
similar results for Spanish. Affixation is more prominent and complex in Spanish than
Automated C-Test Difficulty Prediction
2 Background 9
in English.
In summary, the C-Test is mostly considered to be reliable and valid for the measure-
ment of general language proficiency. The test is furthermore very efficient, meaning
that the costs for test construction and scoring are very low in contrast to other lan-
guage tests.
2.3 Linguistic Complexity
In order to understand what makes a test for language learners easy or difficult one
should take into account what makes the language under consideration in general more
complex. The exploration of linguistic complexity involves the comparison of linguistic
patterns across texts, which are assumed to be of different complexity. An example
for such texts are school book texts addressing different learner levels. It could be
investigated in which class level or school book section which syntactic patterns or
vocabulary entries are introduced. Furthermore, texts produced by language learners on
different levels of proficiency could be investigated to learn about linguistic complexity
of even native texts. Vajjala and Meurers (2012) have shown that measures from SLA
researches, which have proven to be informative in proficiency classification of learners,
can be used to improve readability classification of native language.
Readability measures are traditionally surface based and consist of formulas which
only include letter, syllable or word counts. Such measures have been criticized since
they do not take into account deeper linguistic structures. Technological advances in
computational linguistics and the growing availability of data made much deeper lin-
guistic analyses possible. McNamara et al. (2014) developed Coh-Metrix, a tool to au-
tomatically evaluate English text and discourse by computing a broad range of metrics,
e.g., traditional readability indices or more fine-grained measures based on theoretical
constructs such as cohesion and coherence, lexical diversity, syntactic analyses or la-
tent semantic analysis. Todirascu et al. (2013), Vajjala and Meurers (2012), Hancke
et al. (2012), and Weiß (2015) describe features for automatically measuring linguistic
complexity in the context of language learning. Chen and Meurers (2016) present a
web-based Common Text Analysis Platform (CTAP) which was built to strengthen re-
search collaboration by enabling researchers to share their feature implementations and
by providing an interface for non-programmers to manage their corpora and flexibly
choose feature sets corresponding to their purpose.
The indices for measuring linguistic complexity which are described in the mentioned
works differ from each other in terms of the locality they span. Indices on word level of-
ten provide information about morphological properties such as derivation or inflection.
They can also be based on lexicon look-ups of psycholinguistic databases that provide
information about the word’s age of acquisition (Kuperman et al., 2012). Furthermore,
word level features investigate lexical density and lexical variation.
Automated C-Test Difficulty Prediction
2 Background 10
On sentence level, syntactic complexity is measured by investigating certain syntac-
tic constructions and their frequencies such as the number of clauses or constituents
within the sentence. These features will be presented in more detail in the context of
related work in C-Test difficulty prediction in Section 2.4. Weiß (2017) made use of
features that are based on Gibson (2000)’s Dependency Locality Theory (DLT) in order
to assess the proficiency of learners of German as a second language. DLT emerged in
the field of Human Sentence Processing and is based on the idea of a cost that arises
when two elements in a sentence need to be integrated. According to DLT, this cost
depends on the locality of the two elements in terms of the distance between them. We
will describe these features in more detail in Section 5.2. Applying the DLT features on
two learner corpora resulted in the observation that integration costs increase with in-
creasing learner proficiency. Weiß (2017) emphasizes that this observation is consistent
across different tasks included in the corpus.
On text level, the indices can measure discourse related features often involving as-
pects of cohesion and coherence. Todirascu et al. (2013) describe measures for French
based on parts of speech, where they explored the number of referring expressions
within a text. They found out that a lower number of personal pronouns per sen-
tence increases the text’s difficulty, whereas a higher number of definite articles per
text make the texts more readable. They included also entity coherence and density
indices, and investigated manually annotated reference chains throughout sentences.
Graesser et al. (2003, p. 2) state that in general a ”text is less coherent when there are
many conceptual and structural gaps in the text, and the reader does not possess the
knowledge to fill them”. This quote does not refer to actual C-Test gaps, but illustrates
the connection between a lower coherence of a text and the reader’s ability to cope with
it. One can therefore assume that coherence related features can be used to predict
the difficulty of C-Tests, where actual gaps are used to decrease the text’s coherence.
Beinborn (2016) and Svetashova (2015) follow this assumption and further show how
complexity features can be adapted to the difficulty prediction of C-Test items. These
two works will be described in the following.
2.4 Related Work on C-Test Difficulty Prediction
One of the first attempts to find measures for determining a C-Test’s difficulty has been
reported by Klein-Braley (1984). She focused on the investigation of single C-Test text
passages to ensure a reasonable selection of texts passages. She developed measures
based on sentence length and Type-Token-Ratio and reports that these measures can
be used to predict the difficulty of C-Tests. A C-Test’s difficulty is often measured
by dividing the number of all erroneously entered gaps by the number of gaps in total
(mean error rate). Beinborn et al. (2014) criticize this measure since it does not capture
any information about single gaps:
Automated C-Test Difficulty Prediction
2 Background 11
”In the extreme case, half of the gaps can be solved by all learners and the
other half by almost no one. The test is then assigned a medium difficulty,
but the results are not useful for discrimination between learners.”(Beinborn
et al., 2014).
Beinborn et al. (2014), Beinborn (2016) and Svetashova (2015) therefore go further
by investigating not only the difficulty of whole texts but also the difficulty of sin-
gle C-Test items. New achievements in Natural Language Processing (NLP) and the
grown availability of language data and other resources allow for the development of a
great number of features on item, sentence and text level. The following will present
the named works by focusing on Svetashova (2015), which builds the theoretical and
technical groundwork of this work.
2.4.1 Beinborn et al. (2014) and Beinborn (2016)
Beinborn et al. (2014) developed a difficulty model including features based on four
concepts: solution difficulty, candidate ambiguity, inter-gap dependency and paragraph
difficulty. Solution difficulty and candidate ambiguity are processed on a so called
micro-level, where only the direct context of a gap is taken into account to estimate its
difficulty. The solution difficulty describes how likely it is for the student to know the
solution, e.g. based on the word’s frequency or morphological complexity. On macro-
level the inter-gap dependency (e.g., impacts of the difficulty of preceding gaps on the
current gap) and paragraph difficulty (including readability features) influence a gap’s
difficulty.
Four different C-Tests were conducted as placement tests for university students.
They analyse the answers of at least 140 participants per test and show that the variance
of error rates is very high within single paragraphs. Furthermore the answer variety
increases with higher error rates, showing that difficult gaps do lead to various mistakes
rather than to one typical mistake.
Their regression results show that solution difficulty has a high impact on the gap’s
difficulty, followed by the paragraph’s difficulty. Both feature groups have widely been
researched, while the other two perform worse. However, feature selection resulted in a
set of features from all four categories. Table 2.1 lists the top 21 features which resulted
in a leave-one-out testing accuracy of 57% on the whole data.
To compare their results to the performance of human experts, they conducted an
experiment with three English teachers. The teachers were asked to annotate 20 texts
by assigning a difficulty category to each item. Figure 2.1 lists these categories and the
corresponding error rates. The results show that approximately 50% of the predictions
were correct, which is on the same level as their automatic approach. Furthermore
their annotation experiment showed a very low inter-annotator agreement (0.36 Fleiss’
Automated C-Test Difficulty Prediction
2 Background 12
Table 2.1: The 21 most predictive features grouped by difficulty dimension (Beinbornet al., 2014, p. 526)
Kappa). The three of them agreed with each other and were actually correct in only
25,3% of the cases.
Figure 2.1: The difficulty continuum described by Beinborn et al. (2014, p. 524).
Beinborn’s dissertation aims at the difficulty prediction and manipulation of different
text-completion exercises including C-Tests for English, German and French (Beinborn,
2016). The implemented C-Test difficulty features can again be grouped into the four
dimensions described in Beinborn et al. (2014). The regression results show that so-
lution/word difficulty is the most predictive feature group for all three languages. In
general, the micro-level features work better than the features on the macro-level. The
difficulty prediction for the English and French data works even better when having
removed the macro-level features. Beinborn (2016) concludes that these findings give
evidence for the assumption that a gap’s difficulty is mostly determined by itself and
its direct context. As in Beinborn et al. (2014), features from all dimensions are present
after feature selection. This is the case for all the languages under consideration. How-
ever, there are twice as many features from the micro-level dimensions.
Beinborn (2016) further presents an error analysis that provides insights into the
data. The difficulty assigned to named entities turned out to be too high since named
entities might be familiar to learners but have not been integrated into the model.
Under-estimation happened in cases where candidate answers are more frequent and
simpler than the actually correct answer. This should be captured by candidate am-
biguity features, which apparently did not get enough weight. The same happens for
Automated C-Test Difficulty Prediction
2 Background 13
spelling difficulties as in words like ”of” and ”off”, which seem very simple but often
lead to an erroneous spelling.
In terms of the multilingual perspective in this work it is important to mention
Beinborn’s following findings. It was shown that the difficulty prediction for English
is worse than for the other two languages. Beinborn (2016) mention that this might
be caused by the comparably large candidate space given for short English words in
comparison to short words in French and German. A possible explanation why word
frequency features in particular are not as predictive for English is the substantial
presence of the English language in nowadays everyday-life, leading to a vocabulary
knowledge which is often domain-specific.
2.4.2 Svetashova (2015)
Svetashova (2015) aims at predicting the difficulty of C-Test items focusing on their
linguistic properties. Information about the difficulty of C-Test items and whole text
passages can then be used for an appropriate selection of texts. The data under consid-
eration was provided by the Language Learning Center of the University of Tubingen.
Svetashova (2015) presents in detail the underlying technical steps and theoretical foun-
dations. She tested various machine learning approaches and investigated a broad range
of variable combinations. It is important to mention that the following does not sum-
marize her work, but rather gives insight into five steps that are finally relevant for this
thesis.
Step 1: Model the learners’ performance. The first step was to extract rele-
vant performance statistics from five C-Tests, where each test is made up of five scored
texts and one calibration text. 20 distinct texts have been processed by more than
150 participants resulting in 7725 available answers. More information about the data
is given in Section 3. For each item the percentage of correctly inserted answers was
calculated. The proficiency levels (A1-C2) were available as final test scores. This
information was used to generate further variables representing the percentage of cor-
rectly inserted answers within a single proficiency level. Additionally, variables that
evolved from Item Response Theory (IRT) have been investigated. All answers that
did not exactly match the correct solution were considered as incorrect. This can also
include non-erroneous solutions such as spelling variants or incorrect encoding of the
apostrophe, which would not be considered an error by human correctors.
The following summarizes the information given in the performance statistics:
• percentage of exact matches
• percentage of exact matches by proficiency level (A1-A2)
• difference from the mean text passage difficulty
Automated C-Test Difficulty Prediction
2 Background 14
Table 2.2: This table is taken from Svetashova (2015). It shows difficulty estimatesfor occurrences of the words in and the and the difference from the mean(overall text difficulty).
• IRT difficulty estimate
• IRT discrimination value
Usually, the difficulty of an item is defined by the ratio of incorrectly inserted answers
to all answers. Beinborn et al. (2014) split the difficulty continuum into four levels
according to the error rate. Svetashova (2015) developed a new method to group
C-Test items into clusters, including two dimensions: The difficulty of the item itself
(percentage of correct answers for that item) and the overall difficulty of the text passage
(the mean text difficulty, MTD). It was found that words which are very similar to each
other can be of different difficulty depending on the MTD. In Table 2.2 it is shown that
the item in was filled out correctly by only 64.36 % of the test takers in text passage
590, whereas in text passage 524 more than 93 % got the correct answer. Thus, in
this case one cannot predict the item’s difficulty by considering it on the word level.
The table further shows the MTD of both texts, which differ highly from each other:
39.79 % for text 590, 70.26 % for text 524. Subtracting these MTD values from the
percentage of exact matches results in very similar values for the items across text
passages. Thus, the difference from the mean in these cases captures the similarity of
the same words occurring in different texts.
Svetashova (2015) combined different performance variables in order to find out which
items can be grouped together according to similarities. Hierarchical Clustering on
Principle Components (HCPC) was performed to investigate possible groupings. She
introduces the term performance profile, which describes a clustering principle or, in
other words, a combination of performance variables.
The following part will describe features that are linguistically motivated or somehow
else connected to the C-Test literature. Machine learning algorithms are trained on
these features in order to model the difficulty. In the end Svetashova (2015) checks
which features can best predict which performance clusters.
Step 2: Model the tests’ difficulty. The information about an item’s difficulty
can span different levels of locality: Svetashova (2015) describes features on text, sen-
tence, and item level:
• Text Level. The textual features include traditional surface-based readabil-
Automated C-Test Difficulty Prediction
2 Background 15
ity features and more sophisticated readability features described within Vajjala
(2015). Furthermore she included indices which measure lexical complexity and
which are based on different linguistic resources. An example for such features
are Lexical Frequency Profile measures. These measures are based on lists that
group words into different bands of vocabulary frequencies. A list which contains
word families that are especially common for academic use was also investigated.
Furthermore she included lexical density, richness, sophistication and variation
measures, which were originally developed by Lu (2012) in the context of Sec-
ond Language Acquisition. Also syntactic properties of the whole text have been
analysed: These features are based on the counts of production units such as
phrases, clauses, sentences, complex nominals, or t-units.
• Sentence Level. The features on sentence level mainly address syntactic com-
plexity based on counts of parse tags generated by both, a dependency parser
and a constituency parser. The depth of the parse tree and the log probability of
the parse tree (parse score) have also been used as features.
• Item Level. The item level features represent the broadest class. They capture
linguistic properties (LING) of the item based on morphological as well as syntac-
tic information obtained from part-of-speech (POS) tagging, named entity recog-
nition, parsing and a morphological analysis tool. A semantic feature encodes
whether the word has manually been annotated as topic specific. A further fea-
ture encodes whether the item has orthographic variants. In addition, the LING
features group contains surface-based features measuring the length of an item,
its base and its ending. The psycholinguistic features (PSY) are mostly based on
frequency lists and on lists generated within psycholinguistic experiments. Some
reveal information about a word’s familiarity, imageability, or stylistic category,
some are based on age-of-acquisition (AoA) ratings. She also investigated term
frequency–inverse document frequency (TFIDF) measures, which evolved from
information retrieval and indicate the importance of a word throughout a col-
lection of documents. The context and candidate space features (CONT) form
another group of item level based features. These can for example show how likely
other solutions are in the given context (1- to 5-grams). Such n-gram probabilities
have been extracted from a corpus containing about one hundred billion tokens.
The likelihood of POS sequences has also been investigated. A last subset of the
item level features addresses the positioning (POSIT) of the item itself. These
features are motivated by the assumption that the position of the item in the text
passage can influence its difficulty. The features indicate, for example, whether
an intended word occurs in the non-gapped introductory sentence, or whether it
occurs earlier or later in the passage.
Step 3: Choose the best performance profile. One of the main findings of Sve-
tashova (2015) is the way of how the performance data was modeled. Using Hierarchical
Automated C-Test Difficulty Prediction
2 Background 16
Clustering, she clustered the data in 2200 different ways and tried to find out which
cluster best represents the performance in terms of correlation with the full difficulty
feature set model (265 features). This was done by following two machine learning
approaches: Training a Support Vector Machine (SVM) classifier and a Random For-
est (RF) classifier. The performance profile that performed best combines variables of
three types: ”1) the log odds ratio to insert an item correctly for the test takers of
different proficiency levels, 2) the text passage average of correct insertions by the test
takers of different proficiency levels (measured as logits) and 3) the IRT model difficulty
estimates for each item”(Svetashova, 2015, p.89). Figure 2.2 shows the variable factor
map for the best performing performance profile. The item based irtDifficulty variable
and the logitA1-C2 variables negatively correlate with each other. With these variables
we can imagine a line that indicates the item difficulty from difficult (top left) to easy
(bottom right). The variables denoting performance on text passages (logitTextAver)
are perpendicular to the item based ones. We can imagine a line from the left bottom
to the right top with ”items which occur in text passages with decreasing difficulty”
(Svetashova, 2015, p. 90). Interpreting these findings, she groups the items into the
following four clusters visualized in Figure 2.3 (p. 18):
1. IdTd – difficult items in difficult text passages
2. IdTe – difficult items in easy text passages
3. IeTd – easy items in difficult text passages
4. IeTe – easy items in easy text passages.
The classification based on these four classes (referred to as 2DimClustering) sub-
stantially outperformed the classification based on the fourfold difficulty continuum
split (referred to as EqualRanges). As an example, the RF classification resulted in
a micro-averaged F1 score of 0.7959 using 2DimClustering, and 0.4747 using Equal-
Ranges. Svetashova (2015) experimented with both clustering approaches in order to
be able to compare her results to previous work in the field. The results for the fourfold
continuum split (F1-scores around 0.49 for RF) are comparable to those of Beinborn
et al. (2014), who report an SVM leave-one-out cross validation accuracy of 0.46.
Step 4: Investigate variable importance and feature groups. The perfor-
mance of features can be analysed following two ways: The investigation of variable
importance and the evaluation of the overall performance of distinct feature groups.
Figure 2.4 shows the results of extracting the variable importance after the RF and
SVM classification. The two most important features in RF were item-level features
based on the term frequency–inverse document frequency. The unigram log probability,
the number of candidate unigrams with a higher log probability and information about
topic-specificness closely follow as highly predicting features. Figure 2.4 shows how
many features out of each feature group were part of the top 50 features. The features
Automated C-Test Difficulty Prediction
2 Background 17
Figure 2.2: The variable factor map for the performance profile that worked best.The variables logitA1-C2 and irtDifficulty denote performance on items,whereas the logitTextAver variables denote performance on text pas-sages.(Svetashova, 2015, p. 90)
that performed best were those on text-level (especially lexical complexity) and item-
level (especially candidate space features and psycholinguistic characteristics).
The performance of single feature groups and feature group combinations has also
been tested and the results are summarized in Table 2.3 for both machine learning
algorithms. The best group performance is achieved by the combination of text-level
and item-level features for SVM (mic. F1-score 0.7959) as well as for RF (0.7347).
This combination even outperforms the performance of the total feature set. She fur-
ther showed that reducing the number of features from 265 to 30 does not lead to a
substantial performance loss.
Svetashova (2015) further carries out experiments on unseen data by training the
RF model on the whole dataset and testing it on 6 unseen texts that had been initially
excluded due to a relatively low number of participants. The classification lead to a
cross validation accuracy of 0.8088. It turned out that textual features are comparable
within the group of items in difficult texts (TdId, TdIe) and within the group of items in
easy texts (TeId, TeIe). Correspondingly, the item-based features are similar within the
group of difficult items (TdId, TeId) and within the group of easy items (TdIe, TeIe).
Step 5: Drawing a conclusion
Svetashova (2015) concludes that the mentioned findings are relevant for the actual
generation of C-Tests and that the pipeline could be integrated into a user applica-
tion, supporting language teachers to inspect item and text difficulties. Especially the
Automated C-Test Difficulty Prediction
2 Background 18
Figure 2.3: The individuals factor map for the performance profile that worked best(Svetashova, 2015, p. 91).
Figure 2.4: This figure shows the distribution of the top 50 features for the 2Dim-Clustering (two dimensions: text and item difficulty) for SVM andRF.(Svetashova, 2015, p. 102)
Automated C-Test Difficulty Prediction
2 Background 19
Table 2.3: This table shows how the different feature groups and feature group com-binations perform in classification. (2DimClustering). (Table taken fromSvetashova (2015, p. 107))
experiments on the unseen data give evidence that the findings are generalizable. As
a step towards interpreting the results, she presents the prediction results of one test
passage visualized in Figure 2.5. The underlying text was classified as easy. The colors
indicate whether the single items were classified as easy (green) or difficult (red). The
number labels represent the percentage of participants that actually inserted the gap
correctly. The only item that was classified incorrectly as easy is ”one”, which was
filled in correctly by only 23.21% of the participants.
Figure 2.5: This figure shows a text where each item is highlighted according to theprediction results (red for predicted as difficult item, green for predictedas easy item). The numbers indicate the percentage of correct matches(Figure taken from Svetashova (2015, p. 119))
She further suggests the following:
1. ”text passage properties and the characteristics of the individual gaps” could be
shown to the user.
2. One could ”accumulate information about the performance of the testees at dif-
ferent proficiency levels, their typical errors and acquisition order of the elements
of different levels: word, sentence and broader context.”
Within this thesis, we will move a step closer to this target by integrating the linguis-
tic preprocessing presented by Svetashova (2015) into one single pipeline. This will be
Automated C-Test Difficulty Prediction
3 Data 20
done focusing on those features with high information gain. We will further extract the
Spanish performance data and implement a broad range of features also for Spanish in
order to gain insights from a multi-lingual perspective.
3 Data
The data under investigation was provided by the Language Learning Centre (Fach-
sprachenzentrum, FSZ) of the University of Tubingen2. The FSZ offers foreign language
courses for different languages at different levels of proficiency. The four most requested
languages are English, Spanish, French, and Italian. For those languages they conduct
C-Tests in order to assign the students to appropriate course levels. The following will
first describe the testing procedure and then present the descriptive analysis of the data
under consideration.
3.1 C-Tests at the Fachsprachenzentrum
At the FSZ a C-Test consists of six text passages covering different topics. Only five
out of six passages are actually used for scoring and one of them is used for calibration.
Each short text starts with a header and a first sentence not containing any gaps.
Starting at the second sentence, one half of every second word is gapped. For words
with an odd number of letters, the gapped part is bigger. Words consisting of one letter
are ignored and never blanked. After 25 gaps, the current sentence is completed and
one more untouched sentence is added. Words that contain hyphens are counted as two
words. Additionally, there are some language specific rules concerning the tokenization:
For English a word which contains an apostrophe is counted as one word, whereas in
the other languages they count as two words, e.g.:
• Peter didn’t know that. → Peter did know th .
• Qu’est-ce que c’est? → Qu’e -ce q c’e ?
The tests were generated manually by course lecturers and then conducted electron-
ically on computers at the FSZ. Each test taker has 30 minutes to complete the test.
3.2 Descriptive Analysis of the Data
The FSZ database was provided as a MySQL dump file which contains an updated
version of the database investigated by Svetashova (2015). We will first describe the
2https://www.uni-tuebingen.de/en/facilities/verwaltung-dezernate/
division-iii-international-affairs/language-learning-centre/
termine-teilnahmebedingungen/einstufungstests/c-tests.html (Last accessed: 18/02/08)
Automated C-Test Difficulty Prediction
3 Data 21
structure of the database and then consider the available content for English and Span-
ish.
3.2.1 Database Structure
Svetashova (2015) describes the database structure in detail. The following summarizes
this structure by listing the database’s three tables ctest, ctest texte and tmp followed
by the relevant information obtained from each table.
• ctest (tests): test id, language, ids of the six text passages, id of text that was
used for calibration
• ctest texte (text passages): text id, language, the text with gapped parts marked
by square brackets
• tmp (results): result id, student id, test id, language, answers for each gap number
(1-25), final score (0-125)
Table ctest contains the information about single tests that were conducted by mul-
tiple students. It stores the text ids of each of the six text passages which make up the
test and the id of the text used for calibration. Table ctest texte contains the single
text passages as plain text, where the gapped parts are enclosed in brackets. Therefore,
correct answers can be extracted directly from the texts. The tmp table contains infor-
mation about the single test trials including the student id, the answers for each gap
number and the student’s final score. The final score represents the number of correctly
filled gaps in five of the six texts and can therefore range from 1 to 125. Svetashova
(2015, p.33) lists how this score is mapped to the CERF levels A1 to C2 defined by
the Council of Europe (c.f. Section 1, p. 1) and to UNICert R© course recommendation
levels 3.
3.2.2 Available Data for English and Spanish
We extracted all available test results for English and Spanish from the database in
order to investigate the student’s performances. This was done following Svetashova
(2015)’s method. 4
As illustrated in Table 3.1, the earlier version of the FSZ database (English 2015)
contained the results of 1399 participants who took C-Tests in English. In total, 8394
text passages were processed by these participants. Svetashova (2015) performed the
item difficulty analysis only on those texts with a number of participants greater than
140 and therefore considered 7725 answers on 20 texts. The newer version of the
3http://unicert-online.org/en/unicert-stufen4Yulia Svetashova kindly provided a tool to extract the English C-Test performance data from the
database. We reused it and adapted it to further extract the Spanish data.
Automated C-Test Difficulty Prediction
4 Performance Modeling 22
database contains answers of 4006 participants for English, who overall processed more
than 24000 texts (resulting in 106 distinct texts). We will perform item difficulty
analysis on the whole dataset as well as on the subset of texts that had been processed
by more than 140 participants.
The data available for Spanish contains the performances of 1570 participants, who
overall processed more than 9420 texts. These numbers are comparable to the numbers
given in the English 2015 version, despite of the number of distinct texts. Compared to
the English 2015 data, the Spanish data contains more than twice as many processed
texts, whereas the number of participants is only slightly higher (1570 for Spanish, 1399
for English 2015).
English2015
English2017
Spanish2017
participants 1399 4006 1570
texts processed in total 8394 24036 9420distinct texts 45 106 91
texts processed by > 140 part. 7725 22146 8358distinct texts 20 50 45
texts processed by <= 140 part. 669 1890 1062distinct texts 15 56 46
Table 3.1: The number of participants and the number of processed texts given in theFSZ database for English (version 2015 and 2017) and Spanish.
Table 3.2 shows how many students participated in each test and how well they
performed on it on average. For the English C-Tests with a reasonable number of test
takers (excluding tests 80 and 115) the students’ performances range from a score of 0
to a score of at most 121. The mean scores range from 60.0 (test 116) to 73.1 (test 90).
For the 10 Spanish tests the performances range from a score of 0 to a score of at most
116. For English as well as for Spanish, none of the test takers achieved the maximum
score of 125. The statistics show poorer mean performance scores for Spanish than for
English, ranging from 43.1 (test 128) to at most 57.5 (test 83).
4 Performance Modeling
This Section describes the performance of the test takers of Spanish and English C-
Tests.
The performance of C-Test takers can be modeled by considering different variables.
As described in Section 2.4.2, Svetashova (2015) extracted various variables from the
database. She found out that variables which are based on item and text level diffi-
culty across proficiency levels performed well in later clustering and machine learning
approaches. We will further add the test takers’ performance on sentence level across
Automated C-Test Difficulty Prediction
4 Performance Modeling 23
ENtest id N Min Max Mean
80 5 46 74 62.481 432 1 121 68.690 373 0 114 73.191 486 0 120 70.495 422 0 117 62.696 487 0 120 61.1106 374 0 117 63.4111 444 0 111 60.0115 3 62 78 72.0116 426 4 111 60.0120 428 0 113 64.2125 126 0 106 60.5
EStest id N Min Max Mean
83 178 0 116 57.587 164 3 103 53.794 191 0 106 55.2103 159 0 113 48.0104 185 2 112 47.9110 145 0 101 49.5114 176 0 115 50.5119 132 0 98 47.4123 191 0 110 53.1128 49 0 95 43.1
Table 3.2: Conducted C-Tests with the number of test takers (N) and informationabout their performance score (min, max, mean) for the English and Span-ish data
proficiency levels to the performance statistics. The Java tool which extracts this in-
formation from the MySQL database was adapted from Svetashova (2015) 5.
In the following we will first describe the performance data in general and then
present the clustering approach and results.
4.1 Performance Data Description
For each item with a number of participants higher than 140 we extracted its text
id, item id, sent id and test id from the database. For each item, the following 21
performance variables were inspected:
• item level:
– percCorrect : percentage of correct insertions for the item under considera-
tion
– percCorrectA1, . . . , percCorrectC2 : percentage of correct insertions within
proficiency level A1, A2, . . .
• sentence level:
– SentAv : average of item-wise percentages of correct insertions in the sentence
– SentAvA1, . . . , SentAvA2 : average of item-wise percentages of correct in-
sertions within proficiency level A1, A2, . . .
• text level:
5Yulia Svetashova kindly provided the code to extract the performance statistics from the database.She further modified the code to also extract performance information on sentence level.
Automated C-Test Difficulty Prediction
4 Performance Modeling 24
– TextAv : average of item-wise percentages of correct insertions
– TextAvA1, . . . , TextAvC2 : average of item-wise percentages of correct in-
sertions within proficiency level A1, A2, . . .
We further inspected the distribution of participants across different proficiency lev-
els for each text. The charts in Figure 4.1 show the percentages of participants per
proficiency level. The English C-Test text passages had been processed by a smaller
proportion of beginning learners (A1 and A2) than the Spanish C-Tests. For English,
the proportion of A1 level learners ranges from 4% to 16%, whereas for Spanish it
ranges from 9% to 31%. This phenomena is even more significant for the proportion
of A2 learners, which ranges from 11% to 27% for English and from 24% to 46% for
Spanish. Thus, assuming that the levels A1 and A2 describe beginning learners, about
half of the Spanish C-Test participants turned out to be beginning learners.
Figure 4.1: These charts visualize the distribution of the participants’ proficiency levelsfor each text passage (left chart for 56 English texts, right chart for 46Spanish texts). The proficiency levels are defined by the participants’ finalC-Test score.
The distribution of the participants’ proficiency levels can also be considered item-
wise instead of text-wise. Figure 4.2 shows how the distribution across the participants’
proficiency levels changes for items with increasing difficulty in terms of percCorrect
(overall percentage of correct insertions). For both languages, the distribution across
proficiency levels is less clear for difficult items than for easy items. The proportion of
correct insertions by the more proficient learners (C2) increases with increasing item
difficulty.
Considering only those texts with a number of at least 140 participants, these are
the three easiest items in terms of percCorrect (percentage of exact matches):
• 466 4: This yea[r’s] outdoor coo[king] season mi[ght] be ov[er], but (97.22%)
• 584 02: The pol[ice] said o[n] Monday th[ey] were sear[ching] for (97.07%)
• 470 01: At t[he] same ti[me], four o[ut] of fi[ve] (96.99%)
Automated C-Test Difficulty Prediction
4 Performance Modeling 25
Figure 4.2: These 100% stacked bar charts visualize the distribution of the partici-pants’ proficiency levels for each item, where items are sorted by perc-Correct (overall percentage of correct insertions) from left easy to rightdifficult.
The three most difficult items in terms of exact matches (excluding those with apos-
trophe encoding issues) are the following:
• 601 16: continue t[o] grow a[t] really stagg[erring] rates (0.0%)
• 586 24: i[n] the abs[ence] of t[he] firm’s br[ash] and bril[liant] co-founder (0.0%)
• 581 25: fulfi[lling] the long[held] dream of the United States (0.63%)
For Spanish, there are 14 of in total 1125 items with no correct answers at all, such
as the following three:
• 648 15: pa[ra] fortalecer a[sı] el sist[ema] inmunologico y hac[erse] mas fue[rte]
frente a l[as] infecciones vır[icas]. (0.0%)
• 455 17: Un entr[amado] de len[guas], culturas, relig[iones] y pue[blos] (0.0%)
• 501 04: La elec[cion] de l[as] familias argen[tinas] se vo[lco] mas ha[cia] la seg[unda]
alternativa. (0.0%)
All of the ten most easiest items represent the determiners ”los” and ”la”, or the
preposition ”de”, e.g.:
• 454 05: que l[os] mosquitos n[o] pican (97.2%)
• 456 05: mien[tras] que l[a] segunda enfa[tiza] la nac[ion] (96.6%)
• 456 08: origen d[e] la sobe[ranıa] (94.9%)
• 455 12: que a l[o] largo d[e] la hist[oria] han trans[itado](94.4%)
Automated C-Test Difficulty Prediction
4 Performance Modeling 26
IeSeTe easy item in easy sentence and easy text
IeSeTd easy item in easy sentence but difficult text
IeSdTe easy item in difficult sentence but easy text
IeSdTd easy item in difficult sentence and difficult text
IdSeTe difficult item in easy sentence and easy text
IdSeTd difficult item in easy sentence but difficult text
IdSdTe difficult item in difficult sentence but easy text
IdSdTd difficult item in difficult sentence and difficult text
Table 4.1: Theoretically possible item difficulty properties based on the three local-ity dimensions item (I ), sentence (S ), and text (T ). The units are eitherconsidered easy (e) or difficult (d).
4.2 Clustering Approach
To investigate different possible groupings of items, we follow Svetashova (2015) and
perform Hierarchical Clustering on Principle Components (HCPC) using R (R Devel-
opment Core Team, 2008) and the package FactoMineR (Le et al., 2008)6. This means
that first Principal Component Analysis (PCA) is performed in order to transform
the whole set of variables into a set of linearly uncorrelated variables. Afterwards,
Hierarchical Clustering is applied to group the items.
As the performance variables can be grouped into item, sentence, and text level
information, we can describe items according to these three dimensions. Considering
items (I ), sentences (S ), and texts (T ) as either easy (e) or difficult (d) leads to 8
clusters an item can belong to (cf. Figure 4.1).
We performed HCPC on the full set of 21 performance variables and experimented
with the number of clusters ranging from four to eight. The number of clusters does not
have remarkable impact on the resulting variable factor maps, which all look similar
to the four clusters approach shown in Figures 4.3 and 4.4 within the given language.
In both languages, the first two principal components describe more than 70% of the
variance. Variables on item and text level are perpendicular to each other, which
suggests that they encode different information. Those on sentence level are in between
them and correlate more with text level than with item level variables. For Spanish, the
variable distinction by different proficiency levels does not bring remarkable information
on all levels of locality (Figure 4.4). This is different within the English performance
data, where the different proficiency level variables are not relevant on text level, but
bring more information on sentence and item level. Thus, for English, inspecting the
proficiency levels is more interesting on more local levels than on more global levels
(Figure 4.3).
Experimenting with the number of clusters, we expected a clear cut individuals factor
map as found for the data in Svetashova (2015) (cf. individuals factor map in Figure
6The clustering of the performance data is based on R code kindly provided by Yulia Svetashova
Automated C-Test Difficulty Prediction
4 Performance Modeling 27
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables factor map (PCA)
Dim 1 (49.09%)
Dim
2 (
23.1
9%)
textAv
textAvA1
textAvA2
textAvB1 textAvB2textAvC1
textAvC2
percCorrectpercCorrectA1
percCorrectA2percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAv
sentAvA1sentAvA2sentAvB1 sentAvB2
sentAvC1sentAvC2
textAv
textAvA1
textAvA2
textAvB1 textAvB2textAvC1
textAvC2
percCorrectpercCorrectA1
percCorrectA2percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAv
sentAvA1sentAvA2sentAvB1 sentAvB2
sentAvC1sentAvC2
textAv
textAvA1
textAvA2
textAvB1 textAvB2textAvC1
textAvC2
percCorrectpercCorrectA1
percCorrectA2percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAv
sentAvA1sentAvA2sentAvB1 sentAvB2
sentAvC1sentAvC2
textAv
textAvA1
textAvA2
textAvB1 textAvB2textAvC1
textAvC2
percCorrectpercCorrectA1
percCorrectA2percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAv
sentAvA1sentAvA2sentAvB1 sentAvB2
sentAvC1sentAvC2
textAv
textAvA1
textAvA2
textAvB1 textAvB2textAvC1
textAvC2
percCorrectpercCorrectA1
percCorrectA2percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAv
sentAvA1sentAvA2sentAvB1 sentAvB2
sentAvC1sentAvC2
textAv
textAvA1
textAvA2
textAvB1 textAvB2textAvC1
textAvC2
percCorrectpercCorrectA1
percCorrectA2percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAv
sentAvA1sentAvA2sentAvB1 sentAvB2
sentAvC1sentAvC2
textAv
textAvA1
textAvA2
textAvB1 textAvB2textAvC1
textAvC2
percCorrectpercCorrectA1
percCorrectA2percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAv
sentAvA1sentAvA2sentAvB1 sentAvB2
sentAvC1sentAvC2
textAv
textAvA1
textAvA2
textAvB1 textAvB2textAvC1
textAvC2
percCorrectpercCorrectA1
percCorrectA2percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAv
sentAvA1sentAvA2sentAvB1 sentAvB2
sentAvC1sentAvC2
textAv
textAvA1
textAvA2
textAvB1 textAvB2textAvC1
textAvC2
percCorrectpercCorrectA1
percCorrectA2percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAv
sentAvA1sentAvA2sentAvB1 sentAvB2
sentAvC1sentAvC2
Figure 4.3: English: Variables fac-tor map (PCA) using all21 performance variablesand 4 clusters.
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables factor map (PCA)
Dim 1 (46.65%)
Dim
2 (
23.8
6%)
textAvtextAvA1
textAvA2textAvB1textAvB2textAvC1
textAvC2
percCorrect
percCorrectA1percCorrectA2
percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAvsentAvA1sentAvA2
sentAvB1sentAvB2
sentAvC1sentAvC2
textAvtextAvA1
textAvA2textAvB1textAvB2textAvC1
textAvC2
percCorrect
percCorrectA1percCorrectA2
percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAvsentAvA1sentAvA2
sentAvB1sentAvB2
sentAvC1sentAvC2
textAvtextAvA1
textAvA2textAvB1textAvB2textAvC1
textAvC2
percCorrect
percCorrectA1percCorrectA2
percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAvsentAvA1sentAvA2
sentAvB1sentAvB2
sentAvC1sentAvC2
textAvtextAvA1
textAvA2textAvB1textAvB2textAvC1
textAvC2
percCorrect
percCorrectA1percCorrectA2
percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAvsentAvA1sentAvA2
sentAvB1sentAvB2
sentAvC1sentAvC2
textAvtextAvA1
textAvA2textAvB1textAvB2textAvC1
textAvC2
percCorrect
percCorrectA1percCorrectA2
percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAvsentAvA1sentAvA2
sentAvB1sentAvB2
sentAvC1sentAvC2
textAvtextAvA1
textAvA2textAvB1textAvB2textAvC1
textAvC2
percCorrect
percCorrectA1percCorrectA2
percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAvsentAvA1sentAvA2
sentAvB1sentAvB2
sentAvC1sentAvC2
textAvtextAvA1
textAvA2textAvB1textAvB2textAvC1
textAvC2
percCorrect
percCorrectA1percCorrectA2
percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAvsentAvA1sentAvA2
sentAvB1sentAvB2
sentAvC1sentAvC2
textAvtextAvA1
textAvA2textAvB1textAvB2textAvC1
textAvC2
percCorrect
percCorrectA1percCorrectA2
percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAvsentAvA1sentAvA2
sentAvB1sentAvB2
sentAvC1sentAvC2
textAvtextAvA1
textAvA2textAvB1textAvB2textAvC1
textAvC2
percCorrect
percCorrectA1percCorrectA2
percCorrectB1percCorrectB2
percCorrectC1percCorrectC2
sentAvsentAvA1sentAvA2
sentAvB1sentAvB2
sentAvC1sentAvC2
Figure 4.4: Spanish: Variables fac-tor map (PCA) using all21 performance variablesand 4 clusters.
2.3 on p. 18). In contrast, the performance profile consisting of the two variables textAv
and percCorrect did not result in a clear cut individuals factor map when using four
clusters. However, it resulted in a clear cut map using five clusters for both languages.
For the Spanish data, Cluster 1 represents items in very difficult texts (Figure 4.5) in
terms of textAv. Thus, within very difficult texts, the clustering does not distinguish
between difficult and easy items. However, with decreasing text difficulty, one can
distinguish between easy items (cluster 2 and 3) and difficult items (cluster 4 and 5).
●
−3 −2 −1 0 1 2 3
−3
−2
−1
01
2
Factor map
Dim 1 (59.68%)
Dim
2 (
40.3
2%)
●●●●
●
●●
●●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●
●
●
●●●
●●●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●●●●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●
●●●
●
●●
●
●
●
●●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●●
●●●●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●●●
●
●
●
●
●
●
●●●
●●●●
●●
712_05712_13712_08712_20712_23
543_15543_09
712_16712_10712_14
543_22
712_02712_25
745_19
543_03745_08
543_23
712_24712_06
543_21745_10
543_14745_21
543_11543_06
712_01712_12712_19
745_14
543_20
712_03712_22
745_17
543_05745_20
543_01
712_21
745_12
543_08543_18745_15745_01745_03745_04
544_22544_25546_25
745_23745_05
546_23
745_13
544_20
712_07
543_16
544_03546_09
745_24
543_17
546_20
543_10745_22745_06
544_14544_01
711_25
712_11
711_13711_07
745_18
544_07546_19
548_09
543_04745_02
711_05
712_04
745_07
711_12
546_07544_23
711_02711_24548_08
544_09
548_20711_14
546_14
649_12
546_10
548_16548_19541_05541_22548_05
546_15
541_21541_16
543_25
711_22
543_12
711_17
649_08
544_17
711_06
712_18
541_04508_20645_15645_17638_05638_17501_04501_13708_05
649_02
546_22
638_06638_11
711_08
649_03
651_19
544_11
541_01
546_06
548_23
708_11508_18
544_05
501_19501_21508_13
548_02
653_16
651_01708_01508_08499_21645_04
541_17541_20
501_09651_23640_25
543_19
638_18638_20638_21
544_08
645_05645_19653_15640_18
546_13
508_23645_18645_21645_23
548_01
546_24
645_10
543_02
649_01
711_15711_20
508_07
647_12
745_09
543_07
653_17501_05
649_18
499_18
650_23
546_03
638_25
548_15
651_11
544_02
457_21457_14
499_02508_04653_08645_16
649_15
708_02708_19499_06
651_04
546_11546_16
647_01
508_25
544_18
455_17
544_21
651_18
537_24
546_08
647_23
712_17
645_06499_23653_06640_12640_24
455_21
708_03708_15708_16
543_13
739_01
508_17
455_16
548_14
501_17653_07653_12653_14647_11
508_19645_20640_22
457_17
638_19
648_15
649_21
651_22
739_04455_14
499_09
648_18648_25
647_07
543_24
739_24
708_20653_10647_08
508_03640_01499_10
649_22649_24
648_20
650_07
544_15
455_05537_25
708_13708_18499_24
544_16
710_18
548_17
651_17640_07
455_22
651_25
712_15
508_16
537_13
501_16499_07
648_13
745_11
541_02
653_05
541_15
508_15
457_22
547_09535_23
650_24537_06537_21
712_09
638_14
457_10
648_24
548_24
640_11
457_09739_16650_01537_16537_19
710_07
653_21638_15
647_19
640_08708_17
739_10647_03
455_08537_22
501_08
710_02
508_24
649_10
457_01650_06
547_04
708_22501_11501_23
541_13541_19
538_14
548_04
455_23
651_16
457_05650_20
538_15
648_01
640_05708_12708_14
538_22
649_23
501_14
548_03
457_24
638_04
541_11
502_12
649_13
547_01
457_11457_18
653_19
457_02457_20650_16650_22
742_12
537_15
645_09
497_08
541_12649_05
538_20538_25
710_09
640_15
455_03
647_21457_25
503_17
499_05645_14
534_01
649_04649_11
501_12
710_13
537_02
649_17
497_07535_09
544_24
548_11
647_10
710_08
739_22
534_17
541_03
645_13
535_04
711_21
544_13
501_20
710_25535_03
508_14
546_05546_12
640_04
650_25
534_20506_13
499_03
548_07548_21
710_17535_18
544_10
650_21
497_05
651_06
538_21502_20
649_25
499_11
546_01
708_08
537_20
502_06502_08
651_07
534_09
541_10
711_18
647_24
638_22
455_24
651_08651_10
535_11710_12
745_16
497_25
537_12
541_23
739_03
501_18
506_23
649_07
535_06535_14
647_06
497_09497_11
739_08739_18
653_23
503_20538_16506_21534_14
653_09640_14
650_12
499_12499_20
711_10
649_20
640_16640_23
506_10
638_16
503_16
546_18
547_22
651_05
538_10
548_12
499_15
648_19
457_12
653_11
711_01
534_18
745_25
647_09
640_09
457_04
538_02
710_04
651_14
505_07505_13
708_04
650_04650_19
638_13
534_12
546_17
739_02650_10
741_10741_20
501_24
648_07
502_07
547_05
640_03
647_25
651_12
739_05
742_23535_22
501_01
739_09
534_06
648_04
650_14
648_06
741_04
742_06
651_09508_06
547_12
651_24
497_17534_23
739_17650_03
502_15
648_23455_01
742_18
537_04
651_13
742_21502_16
546_02
537_10650_11
541_08
506_03
649_14
499_08
506_08
455_07
711_16
535_13
645_11
547_19
541_07
535_12
711_23
503_01
739_19
710_03
548_18
502_03506_04506_11506_16534_13
505_15
547_15
713_15
506_15
455_20
499_14499_19
741_03
502_22
651_21
495_07
544_04
503_05
541_25
713_17495_01
742_19
739_11
708_07
534_21
645_24
538_24
457_08
640_20
544_19
638_23
503_15
457_16
742_25
546_21
503_07502_24
535_21
648_08
500_08
503_14
547_07547_24
506_06
451_06
742_03742_04547_21
647_17
638_10501_15
500_16
653_03
741_12741_16505_05
451_05
537_14
534_15
739_12
538_08
713_08
456_06451_08
502_19
710_19
640_10651_15
456_02
713_18
454_16
742_20
649_06
495_11495_12
544_06
647_16
508_09
451_13
708_10640_02
648_02648_05
742_11
710_24
456_03646_03
547_02497_20
544_12
741_13
710_05
451_02454_15500_19
713_03
497_01497_10
646_17
648_16
638_03
649_19541_24
713_09
503_19503_22
741_17
495_03
640_17
739_06
710_22547_11547_14
744_20
455_18
647_13
508_01
500_18456_25
497_16
739_15
638_24
546_04
547_16
744_08
537_05
508_21
455_10
744_03744_18
650_13
502_14503_02
505_19
548_22711_04
650_15
651_03
547_03
711_19
742_10
495_22
711_09
640_19
742_13
500_24646_10
455_11
548_06
744_11495_17
538_18
653_22
456_17
741_05
454_07
713_22
535_07
647_02
645_25
451_12
648_14
638_08501_06
650_05647_14
742_22497_23497_24
541_06
501_07708_24
647_20
713_19
648_17
495_21
649_16
508_05
456_14
503_18
505_20
501_03
711_03
456_09
506_09502_13497_21
646_20
503_04742_14
541_09
506_01
499_17
711_11
739_21
503_12
640_13
710_11
537_11
454_19
497_15
548_25
645_01
505_16
547_18
708_23
500_17
497_22
653_04
500_01
499_01499_13
497_12503_10
741_08
502_10
500_03
455_06
744_09744_19744_25
499_16
505_11
495_24
537_07
548_10548_13
742_01538_23
501_25
503_24
645_08
454_14
653_18
538_03
537_17
653_13653_20
649_09
638_12
713_21
739_13
497_04502_17
454_11451_04
739_14
534_24
713_01
646_23454_04
647_05
541_18
455_15
638_01501_10
739_20647_15
505_06
742_05
451_07
499_04
650_09
541_14
535_15
741_14
638_02
506_12
505_12
638_09
547_13742_07
495_20
499_25
710_15
508_11
744_06456_13451_11
508_12
742_02
651_20
506_07
647_22
645_12
538_07534_02
744_23
645_22
535_16535_19
640_21
454_12
455_02455_09455_19
503_13
454_18
537_01
646_22
535_05
454_02
455_13
713_06
710_01710_06
645_02
646_15
741_25
708_21
455_04650_08
500_10
508_02
503_23
535_20
708_09501_22
534_16502_21506_22
708_25
505_18
744_15
651_02
500_05
710_23
741_23505_01505_09
638_07
547_10
505_08
653_02
456_15
739_25
744_16
454_17
506_20534_05
744_17
645_07
537_09457_15650_18
742_17535_17
506_19
645_03
502_11502_18
646_07
648_09
500_14
742_08
505_21
535_10
502_25
653_24
547_23547_25538_06
508_22
547_06
505_14
648_10
708_06501_02
500_22
739_23
534_07
650_17
653_01508_10
547_08710_20
741_26
499_22
744_22
535_08535_25497_14
640_06
454_24
742_09
710_10
456_16
647_04457_19
505_03
495_14
506_17538_17
457_13455_25
538_11538_13
457_03
744_13451_19
537_03537_18
534_04497_19
451_21646_01
503_25497_18
650_02
646_12
503_06
457_07457_23647_18
741_11
648_21
741_22
454_08
495_05
538_19
537_08648_03648_11
713_07
505_23
502_02538_09534_03534_22
456_18456_23500_12
502_23538_01547_20
739_07
495_16
454_10
457_06
713_20
537_23648_12
506_18742_15742_24534_08
713_16
455_12
535_02
648_22
497_06
646_08
534_19
547_17
456_07451_24
710_14
713_25
503_09
456_12451_10646_09
503_08
710_21
713_24
646_11
742_16
710_16
713_12
534_25502_05538_04
505_17
495_09451_25
506_14503_03502_09
495_18
538_05538_12
495_02
535_01535_24
456_11
741_18
500_06500_11451_01
713_02
497_02497_13
741_15
744_02
506_24
451_18
505_10505_22
534_11497_03
646_05
741_07505_04
534_10503_11
646_24
495_13
506_02506_25502_01
451_09
506_05
505_24
646_13
741_02
503_21
456_10
505_02
744_07456_04500_25454_20
713_10713_23
502_04
505_25
456_22
741_01
495_25744_14451_22646_06
713_14
500_20744_04
741_06713_11
456_24451_17451_20646_19454_21456_20
741_19713_13
646_16
741_21
500_23500_15744_10744_21744_12495_06495_04
646_14
741_24
456_21451_15646_02
713_04495_23744_24451_03
741_09
495_08
646_21454_13
713_05
456_01744_05
646_25
495_15495_19495_10451_16646_18
744_01456_19454_06454_09454_23646_04500_09454_22500_04454_03500_21454_01451_14451_23500_02500_07500_13456_08456_05454_05454_25
cluster 1
cluster 2
cluster 3
cluster 4
cluster 5
cluster 1 cluster 2 cluster 3 cluster 4 cluster 5
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables factor map (PCA)
Dim 1 (59.68%)
Dim
2 (
40.3
2%)
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
Figure 4.5: Spanish: Individuals and variables factor map using the two performancevariables textAv and percCorrect.
If we follow the same clustering approach for English, the individuals factor map
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 28
looks as shown in Figure 4.6. We can distinguish between easy and difficult items (in
terms of percCorrect) for texts of low difficulty (in terms of textAv). Items in very easy
texts are grouped into one cluster (cluster 5). With higher textual difficulty one can
distinguish between easy items (cluster 2 and 4) and difficult items (cluster 1 and 3). We
presume that the difference in the clustering results is not caused by language-specific
differences, but is rather dataset-specific.
●
−3 −2 −1 0 1 2 3 4
−2
−1
01
23
Factor map
Dim 1 (61.84%)
Dim
2 (
38.1
6%)
●●●●●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●●●
●●
●
●●●●●
●●●●●●●●●
688_08591_11591_09591_21688_23591_14590_19591_06590_12591_15688_14688_22590_20591_13590_03
609_03594_09610_22756_16
590_25590_01
610_20
688_03590_09
605_02
610_06610_11609_17
591_12
610_04609_18
690_24594_19605_19610_13
591_25
609_23
688_06
690_23
590_21590_23591_05
690_02605_01594_10756_04
588_23
609_21756_09756_19
689_09
594_23
590_18
610_09
688_05688_16688_25
588_06
591_19
605_14588_16
601_01601_16
609_08
590_05688_24
605_09594_21609_24
605_11
689_20
756_12
689_02
610_19690_07588_08
610_15
481_25691_23
609_01
481_03
688_19
589_02
689_07
586_17
594_04
586_24
689_24
690_17
609_14
688_04
725_16
594_15756_25
481_13
590_24
605_12610_25605_16
481_09
756_24
588_10
756_08
691_05589_06725_01
481_16
690_15
691_07
756_17594_20610_24605_21588_17588_09
725_22
748_04
609_16
605_13
529_13
594_16
748_09
529_22
601_10589_05
724_13
690_22588_11
609_15
725_13586_18
748_10
725_08586_11
481_04601_13
690_03
586_02
748_25689_22
690_16690_11
601_03
588_05
598_08
748_06481_11
591_03
529_14
609_19
589_23
690_12756_22
586_12
481_05
688_10
603_16
748_18
688_01
594_11
691_11
724_15
690_01
691_01
688_17
689_06
589_25725_04
748_11
603_18
725_03691_12
748_17
601_25
725_25
591_20
598_12
688_13688_20
766_14
586_10
590_04
481_24
603_14
609_20
591_16
725_05
603_22
591_07
594_24
590_17
594_18588_13
481_08689_18601_04589_01
591_22
594_22
716_07
588_22
481_20748_08
691_17
688_07
724_04
588_20
756_02594_12
478_09598_11
688_12
764_20
691_04
531_17
603_08
604_09
478_16
691_09
585_18
716_17
588_14
590_06
691_06
585_21
587_17
529_23
598_21
591_04
478_21
762_04
598_18
688_11
719_13
478_08
690_21
591_10
589_03
716_11
582_10604_10
478_20
610_12756_21
716_22
609_12
532_18
748_02748_22
604_13
588_19
756_10
590_02
756_06
604_11
716_18
610_05
585_12764_09
589_09
529_08529_19
748_03748_20690_13605_03605_10
481_02
586_01
766_20
609_04
582_17587_22
601_20
748_15
582_24
532_02
587_23478_07
610_17
589_16
588_04
587_12
689_11
764_08531_12
594_01
716_10
589_15
756_03
762_25
478_15
724_16
582_01766_05
609_22
586_20
610_07
603_10
531_18
590_15
531_06584_22
691_25
719_20
601_21689_15
531_24585_03719_14
756_05
719_19
587_02
764_10
603_13
601_15
724_09
725_24
719_11584_10
716_14
604_20
756_14
600_13
590_13
478_14
690_20756_13
585_16
586_22
527_22
529_21
766_09
527_18
529_09
598_19
589_04
478_01
527_12527_20
594_03
600_14
529_12
689_23481_18601_23
716_06
479_04
603_15
720_13528_07
478_18
531_05
724_25
748_14
594_17605_22
724_05
601_19
690_14
591_23
604_18
589_13
594_06
591_24688_09
601_18
764_21582_12766_10587_06
527_16
689_08
529_11
582_15
690_10610_16
598_23
604_15
689_14
531_04
528_03
610_18
582_18
756_18
590_11
588_25748_01689_04689_25
590_08
748_05748_07
610_21
532_08762_02
605_24
716_12
725_02
605_08
529_20
589_24
582_22
586_16
600_25
691_15
582_19
724_08
756_20
604_19
609_13
591_02
594_02
589_14
591_08
603_24
689_10
756_01609_09
688_15
586_13
603_05724_12
601_17
690_18
689_03
590_07
609_02
716_09
766_15
601_05
600_11
725_14586_04
609_25
481_22
716_01
478_06
582_16
596_03
690_06
530_13
724_06
584_09
691_22
605_15594_07
762_18
601_24
609_10
688_02
724_23
605_17605_18
720_12
756_23
598_01
527_09
756_15
689_12
766_21
589_21
610_02
685_25
594_08
596_05
719_10
724_24
466_01
716_20
689_21
533_08
762_16
588_01
532_22
688_21590_10
725_12
598_06603_02
479_03
584_21
748_21
584_23
756_11605_20610_10
766_24
716_02
725_19589_19
528_11
532_16528_12
585_19
605_25
481_19
605_23
590_14
720_11
685_11
764_15
609_07
591_18
762_10
478_25
688_18
691_14
603_11766_25
610_14
588_21
720_07
529_17
588_12
532_10
466_21
588_03
481_15691_20
587_21
529_15
528_18600_09
584_13
591_01
478_13587_20
533_07
585_05764_24604_06
691_24
531_09
689_05
610_23
590_16
766_13
748_16
528_22
689_19
598_16
586_15
479_20
586_03
598_02
601_22
598_10
590_22
529_25
591_17
719_02
685_12
478_19
584_15
525_23
724_21
762_23
748_13
584_19
588_15
766_22
685_10
748_19481_12
528_20
589_11
532_12
690_05
584_25
766_19
605_04
586_19
600_16720_08
724_03
587_19
528_13
725_21
764_16
689_17
600_17
589_12
588_02
598_17716_21
533_21
601_06
748_23
530_18
724_18
589_10
724_14
533_05
600_03
529_07
691_02
603_01
525_22
601_09
685_09
605_05
530_12
586_25
594_13
528_21
596_01
584_12
756_07
716_23603_23
532_03
690_19
609_05
481_10
605_06
716_08
596_18
586_21
764_19
601_02
596_24
531_21585_20604_05
609_06
766_17598_22
610_03
691_13601_11
582_06
601_08
527_11720_17
598_09
479_22
598_20
600_01
610_08
719_09
594_25588_24
481_01
466_13
587_03587_16
479_05
601_12
716_16
481_14
690_09
479_02527_02
609_11
530_20
610_01
532_11
466_07
598_24
762_24584_14
689_13
725_11
587_08
594_05690_25
479_09
531_15587_24
725_07
594_14690_04
529_05603_04
481_07
598_13
529_24
764_07
581_25
604_12
528_17
724_22
689_16
532_19
478_17
716_05716_15
479_18
588_07
691_18691_10
690_08
530_03
586_14
470_05
604_07719_25
587_25
685_04
764_13762_13762_15
587_09
533_25
587_18587_13
605_07
529_04
748_12
529_18
479_01528_23
596_13
725_06
724_01
748_24
470_19
716_25
719_07
766_02603_06
601_07
585_07
529_01
725_23
532_09
764_14
530_11
466_18
527_21
719_12
720_19720_23530_15530_16
588_18
600_24532_04527_01
466_24
691_03
531_07585_25
586_05691_21
604_08
481_17
523_01
585_17
766_07
479_14
689_01
533_09
603_20766_18
584_04762_14
725_09725_10
603_19766_03
529_02
601_14
531_08531_22
589_20
762_06
589_07
764_04
766_23
691_16481_23
532_20
479_07
470_16
586_07
720_09
685_01
720_16
533_02
531_03
589_08
719_22762_21764_03582_02582_25
691_19481_06
584_24
724_20
479_06
585_22
589_18
533_24
531_23
470_25
725_15725_20
481_21
764_22
724_17716_04
762_12719_16
530_06
587_15
525_05
691_08
766_06
586_23
596_08
716_24
584_18
529_10
766_08
585_11604_03764_23
528_01
762_07762_11762_22
582_21604_24
466_05
582_08
600_06
582_11
724_11
530_09
589_22
604_25
598_05
719_08532_06
720_06
584_08585_08
603_25
582_20
724_19
762_08
766_16587_01
589_17
724_07
764_05719_18
533_23
479_25
584_17719_17
523_17
586_09
470_07
764_11
587_07
603_03
596_23
470_13
532_05585_15
530_10533_14
725_18
587_14
470_18
531_20584_07
724_02
762_05
530_05
586_06725_17
716_13
600_15
529_16
585_02
525_08
587_04
764_02
587_05
585_10604_23585_24764_01585_01
596_25
600_22
529_03
478_04
530_17
603_17
530_23
525_24
529_06
719_15719_21
586_08
531_01584_03
478_22
604_17766_01
533_20
719_04
525_07
720_18
525_21
532_13
470_21
600_19
604_04585_14
600_04
685_06
600_18
523_13
527_13479_17
466_08
716_19
596_19
527_14
598_25
531_10
603_21478_10
528_09
724_10
530_24
584_01764_12762_17
582_03
719_03
598_07
584_11
528_10
598_03
600_05532_14
766_04766_11582_23
762_19
582_07582_09
598_14
527_08
604_02
470_06
582_05
527_23
587_11
532_24
478_12
720_15
685_05
478_24
479_13720_04
603_09
527_19
582_04
470_12
587_10598_04
764_25
528_24
523_24
478_02
720_25
716_03
525_17
766_12
530_14
478_23
531_11
524_12
604_14
584_20585_23
470_23
598_15
719_23
533_17
719_06
582_13
478_11
720_20
596_17
603_07
762_09
466_17
523_09
532_25
604_22
596_09596_10
584_06531_16762_03
523_02
585_06604_16
603_12
523_05
527_05528_02
525_14
762_20
720_21
685_08
604_21
596_11
581_14
479_15600_02
762_01
478_05
764_17
530_04
527_15
596_06
585_13
582_14604_01531_13
478_03
525_04
531_14531_02
533_06
600_21
531_19
720_02
532_17
466_02466_11
533_12
524_14
479_21527_10720_14
719_05
685_21
600_23528_04528_05479_16
764_06764_18
532_23
530_02
527_04532_07527_17528_14479_12
584_16
479_10600_08527_03720_22
523_14
585_09
533_01533_04
527_24479_08
466_20
523_11
585_04
720_01
525_03
581_04
584_05
581_24
685_18
479_23
532_21532_01
530_21
528_25
596_16
719_01
525_16
600_07
525_11
581_18
533_15685_22
523_04
466_12
581_13
596_21
527_07
719_24
479_11
533_18685_02
523_03
530_25479_24527_06
685_23596_02
581_15
685_15596_15
528_16
581_09
533_16
525_02
479_19600_20
470_15
584_02
720_10600_10532_15
685_19
720_24530_19
525_01
720_03528_06528_15527_25
466_10466_16
600_12
685_17
528_19
470_03
720_05
581_01
523_07523_18
530_22
528_08
685_03596_20
525_06
530_01
685_24685_13685_07
524_02
466_19
581_08
533_22530_08533_03
466_15
533_10596_14
525_25
596_22
523_15
466_03
470_22523_12
530_07
470_17
524_25
581_17
596_04
525_09
685_14596_07
466_14466_23525_10
685_16
525_20
533_11
525_19
533_13596_12533_19
523_08
685_20
524_20
525_15
581_02
523_19523_06470_08523_10
524_24
466_22
523_16
466_25
581_20
466_06
470_24525_18525_13470_10523_22
524_22
466_09
581_10
525_12
581_05
524_16
470_09470_11
524_11
466_04
470_04523_23
581_03
470_20523_21470_14
581_07
523_20470_02470_01
581_12581_19
523_25
581_11581_23581_06581_16581_22
524_06524_07
581_21
524_15524_18524_19524_08524_21524_04524_13524_23524_03524_10524_17524_01524_05524_09
cluster 1
cluster 2
cluster 3
cluster 4
cluster 5
cluster 1 cluster 2 cluster 3 cluster 4 cluster 5
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables factor map (PCA)
Dim 1 (61.84%)
Dim
2 (
38.1
6%)
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
textAv
percCorrect
Figure 4.6: English: Individuals and variables factor map using the two performancevariables textAv and percCorrect.
5 Difficulty Modeling
Modeling the difficulty of C-Test text passages and individual items is a relevant step
needed to influence the difficulty of whole C-Tests. As a step closer towards the inte-
gration of the difficulty prediction into a single tool, we implemented all features into
one processing pipeline. We make use of the Java Framework UIMA (Unstructured
Information Management Architecture) in order to process raw C-Test text passages
extracted from the FSZ database (Ferrucci et al., 2009). UIMA can be used to im-
plement single processing components, so-called Analysis Engines (AEs), such as a
tokenizer, a part-of-speech tagger, or a feature extractor. Afterwards, the single com-
ponents can be chained together. In this way, structure is given to unstructured content
while UIMA takes care of the data flow and supports the programmer in configuring
and running the pipeline.
Following Svetashova (2015), the difficulty of a C-Test text passage can be modeled
on different levels of locality:
• difficulty prediction of single items
• difficulty prediction of single sentences
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 29
• difficulty prediction of the whole text passage
Depending on the level of locality, different kinds of linguistic features can be imple-
mented. There exist lexical features based on frequency lists or other psycholinguistic
databases, as well as readability features, syntactic features and also discourse features.
The following will describe the implemented features for English and Spanish.
Every feature is calculated item-wise: On sentence level, all items in one sentence
have the same feature value. On text level, all items in one text have the same feature
value.
5.1 Linguistic preprocessing
The UIMA pipeline visualized in Figure 5.1 is used to annotate C-Test text passages, as
extracted from the FSZ database, with different linguistic information. The database
stores text passages as strings with the correct solutions enclosed in brackets. All UIMA
annotations contain at least two features: the start and the end index of the span being
annotated. The indices of all annotations in the pipeline refer to the indices in the
original database text with the bracket structure. Since NLP tools, e.g. for sentence
segmentation or tokenization, require input without brackets, we delete brackets for the
relevant steps and reinsert them afterwards to ensure correct indices. The final UIMA
type that is needed to encode relevant item information is the C-TestToken type. It
differs from usual NLP tokens since the C-Test passages require their own tokenization
rules: NLP tokenizers often split words that are not separated by whitespace due to
some linguistic circumstances (c.f. example ”didn’t” in Table 5.1). A C-TestToken
annotation has the following features: It is stored whether the C-TestToken contains
a gap (isItem) and what the intended word, the base, the ending, its lemma and POS
are. Furthermore, the position of the token in the text and in the sentence is stored.
SentenceAnnotator
TokenAnnotator
PosAnnotator
LemmaAnnotator
C-TestTokenAnnotator
Figure 5.1: The UIMA pipeline that chains multiple NLP analysis engines together inorder to annotate C-TestTokens.
Sentence segmentation and tokenization was carried out by using the Apache OpenNLP
java library 7.
English part-of-speech (POS) tagging was also done using OpenNLP, resulting in
POS tags from the Penn Treebank Tagset. For Spanish, we used a Maximum Entropy
Tagger from the Stanford NLP Group8 in order to get simplified tags from the Ancora
7https://opennlp.apache.org/ (Last accessed: 18/02/08) (English Models were available forOpenNLP version 1.6. The Spanish models only for version 1.4)
8https://nlp.stanford.edu/ (Last accessed: 18/02/08)
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 30
feature description example”pipe[lines]”
example”did[n’t]”
isItem indicates whether this token is one ofthe 25 test items
true true
intendedWord the acutally intended word pipelines didn’t
base the item’s base, if it is an item pipe did
ending the item’s ending, if it is an item lines n’t
lemma the lemma of the intended token pipelines do not
pos the pos of the intended token NNS VBD RB
tokenId the position of the token in the text(counted in tokens)
65 45
tokenInSentId the position of the token in the sentence(counted in tokens)
3 8
mergedString the merged string of the intended wordif the C-TestToken is a result of two in-dividual NLP tokens. This string con-tains an underscore. It can be usedto see whether and how the NLP To-kenizer would have split the word.
- did n’t
Table 5.1: The features of the C-TestToken type as implemented in our UIMA pre-processing pipeline.
Corpus Tagset 9(Toutanova et al., 2003). We further transformed the simplified tags
into the Parole Reduced Tagset, a set of 66 tags10.
Lemmatization was done using OpenNLP’s Simple Lemmatizer. For Spanish, a dic-
tionary11 developed in the OpeNER project was adapted by transforming the tags into
the Parole Reduced Tagset (Garcıa-Pablos et al., 2013).
The English constituency parser is a probabilistic context-free grammar (PCFG)
parser implemented by the Stanford NLP Group.
Syntactic dependencies were annotated using a transition-based neural-network De-
pendency Parser implemented by the Stanford NLP Group (Chen and Manning, 2014).
We used the Universal Dependency representation which is available for both languages
(Nivre et al., 2016).
5.2 Modeling the difficulty of English C-Tests
The English C-Test difficulty model is based on the findings of Svetashova (2015)
focusing on the 70 most predictive features in her experiments. We further integrated
complexity measures that were already implemented in a similar pipeline described
9Stanford’s simplified ancora tagset: https://nlp.stanford.edu/software/spanish-faq.shtml
(Last accessed: 18/02/08)10see http://www.cs.upc.edu/~nlp/SVMTool/parole.html for a list of the reduced tags (Last ac-
cessed: 18/02/08)11The dictionary is available at https://github.com/opener-project/pos-tagger-en-es/tree/
master/core/src/main/resources (Last accessed: 18/02/08)
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 31
within Chen and Meurers (2016). Additionally, the pipeline contains features based on
the Dependency Locality Theory (DLT) as described by Weiß (2017)12. The following
will describe the 155 implemented features by grouping them according to the locality
they span.
5.2.1 Item Level
We integrated features from all item level groups described by Svetashova (2015) sum-
marized in Section 2.4.2: Features capturing linguistic properties (LING), psycholin-
guistic features (PSY), context and candidate space features (CONT) and position
based features (POSIT). Some of them were implemented just as described in her work,
some needed some modifications in order to be integrable into the UIMA pipeline. We
also added further features in order to allow for a reasonable comparison among Spanish
and English features.
Linguistic Features (Ling): The linguistic features that performed well in Sve-
tashova (2015) contain surface based features describing the number of letters to be
restored (endingLength) and the number of letters in the intended token’s lemma (lem-
maLength). Furthermore the POS of the intended token is captured by the boolean
feature isContentWord, which indicates whether the POS describes a content word or
not. We consider nouns, verbs, adjectives, adverbs and cardinal numbers to be content
words 13. A further feature is the number of dependencies the item has in the sentence
(numDependencies).
Psycholinguistic Features (Psy): We included the term frequency–inverse doc-
ument frequency (TFIDF ) feature, which is based on frequency counts of the tokens
in the document and in a collection of documents. This feature was developed in the
context of Information Retrieval in order to describe a term’s importance throughout
a collection of documents. Following Svetashova (2015), we consider also the term
frequency (i.e., the number of term occurrences by all number of words, TF ), the doc-
ument frequency (i.e., the number of documents that contain the term, DF ) and the
inverse document frequency (IDF ) as single features. We further added the number of
occurrences of the term in the document (TermOccInDoc) as a feature. The document
collection is a corpus of 45 C-Test text passages and 88 texts from the brown corpus
compiled by Svetashova (2015).
Other psycholinguistic features are semantic variables based on the machine-usable
dictionary MRC Wilson (1988), which provides up to 26 linguistic and psycholinguistic
attributes for more than 150,000 words. We adapted features based on the follow-
ing attributes: age of acquisition (Sem AoA) (Kuperman et al., 2012), concreteness
12Zarah Weiß kindly provided her Java code to extract DLT features for German, which we thenadapted to English and Spanish
13Penn Treebank tags starting with N, V, J, R and CD
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 32
(Sem Concr), familiarity (Sem Fam), and meaningfulness (Sem Meani), since they
performed well in experiments by Svetashova (2015). MRC also contains an attribute
that indicates in how many of the 15 stylistic categories of the Brown Corpus the word
can be found (StylNCatsMRC ). 14
In order to check whether a word belongs to a high frequency class, Svetashova (2015)
uses a web interface provided by the ADELEX research project 15. The tool extracts a
feature that indicates whether a token is contained in the 1000 most frequent words of a
collection of corpora. This is based on the idea of Lexical Frequency Profiles developed
by Laufer and Nation (1995). We implemented a similar feature that checks if the word
is contained in the 1000 most frequent words in the SUBTLEX-US corpus (Lfp Band1 )
(Brysbaert et al., 2012).
The AoA database 16 contains information about the lemma and POS of given words
in the SUBTLEX-US corpus (Kuperman et al., 2012). We processed this list in order
to ensure a reasonable mapping to Penn Treebank tags17. We follow Svetashova (2015)
and implement two features that indicate (1) whether a token is the most frequent token
(isLemmasMostFreqTok) of its lemma and (2) whether the POS tag is the token’s most
frequent tag (isTokensMostFreqTag)18.
Position based Features (Posit): This set of 8 features contains information
about the item’s position and its occurrences in the text. We measure the position
in terms of the token number (Number Token) and the gap number (Number Gap).
In order to include information about additional occurrences of the intended word
in the text, we implement features measuring the distance between the gap and the
previous occurrence of the intended word or the intended word’s lemma (distancePre-
vMention Token and distancePrevMention Lemma). Two boolean features indicate
whether the intended token occurs in the non-mutilated parts (isInClosing and isIn-
StartOrClosing). Svetashova (2015) further follows Beinborn et al. (2014) and suggests
features measuring the difficulty of the previous gap by considering its unigram and
trigram probabilities (previousItemUnigramProb and previousItemTrigramProb). We
follow Svetashova (2015) and look them up in the Web1T corpus.
Context and Candidate Space Features (Cont): We also integrated some of
the Context and Candidate Space Features presented by Svetashova (2015) into the
pipeline. The context features are based on Web1T, a google n-gram corpus containing
uni- to five-grams and their observed frequency counts extracted from texts of about 1
14We noticed missing values for some words and extracted our own counts from the Brown Corpususing Python’s Natural Language Toolkit (NLTK) resulting in the feature StylNCatsNLTK (Birdand Loper, 2006)
15project website: http://www.ugr.es/~inped/ada/ (Last accessed: 18/02/08)16the database including lemma and POS information is available at http://crr.ugent.be/archives/
806 (Last accessed: 18/02/08)17We merged determiners and articles and treated numbers as adjectives. Items where the tag was
”unclassified” or represented by a number were deleted from the list.18If the word was not on the list, it was assumed that it is not the token’s most frequent pos.
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 33
trillion word tokens (Brants and Franz, 2006)19. For each item the uni- to five-grams
at the left and right side of the token are considered. Thus, a span of 9 tokens is
under consideration resulting in 15 features. Furthermore the maximum n-gram log
probability of an item represents a feature (maxNgram).
The motivation behind the candidate features is that in some cases the actually
intended token is less frequent in the given context than another word which would
also fit into the gap according to its base and the length constraint. Svetashova (2015)
considers all words in the psycholinguistic database MRC which have the same base
and match the length constraint as competing candidates Wilson (1988). One feature
simply counts the number of possible endings (numPotentialEndings). The n-gram
log probabilities for the actual solution and the candidate solutions have also been
extracted from the Web1T corpus. The features consider the unigram and all bigrams
and trigrams that contain the item under consideration. Thus, the n-grams to the left
and right side of the token are considered. The features that span these 1- to 3-gram
windows can be grouped into three different subsets and finally result in 18 features
(Svetashova, 2015).
• Ngram Cands delta (double): the difference between the n-gram probability of
an item and its competing candidate with highest n-gram probability
• Ngram Cands hasBigger (boolean): true, if there exist candidate n-grams that
have a bigger log probability
• Ngram Cands weakness (int): the number of candidates with bigger n-gram prob-
ability
In order to have features that are comparable to Spanish, we implemented another
feature set based on the SUBTLEX databases which contain word frequencies based on
film subtitles. The SUBTLEX-US database contains 74,286 American English words
and the corresponding per million word frequencies. This word list was used instead
of MRC to generate a map that maps item bases to possibly intended words. The
number of candidates again represents a feature (subtlex numCandidates). We then
computed one unigram feature for each of the mentioned three ideas (delta, hasBigger
and weakness) replacing n-gram log probabilities by per million word frequencies:
• SubtlexCands delta (double): the difference between the per million word fre-
quency of an item and its competing candidate with highest per million word
frequency
• SubtlexCands hasBigger (boolean): true, if there exists a candidate that has a
bigger per million word frequency
• SubtlexCands weakness (int): the number of candidates with bigger per million
19The code to extract the information from Web1T was made available by Yulia Svetashova.
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 34
word frequency
All the context and candidate space features are listed in Table 5.2.
feature group feature
Item Ngram Probs unigrambigram leftbigram righttrigram lefttrigram centertrigram rightfourgram leftfourgram left centerfourgram right centerfourgram rightfivegram leftfivegram left centerfivegram centerfivegram right centerfivegram rightmaxNgram
Item Ngram Cands Delta unigrambigram leftbigram righttrigram lefttrigram centertrigram right
Item Ngram Cands HasBigger unigrambigram leftbigram righttrigram lefttrigram centertrigram right
Item Ngram Cands Weakness unigrambigram leftbigram righttrigram lefttrigram centertrigram rightnumPotentialEndings
Item SubtlexCands unigram deltaunigram hasBiggerunigram weaknessunigram numCandidates
Table 5.2: These are the 39 context and candidate space features implemented in theEnglish difficulty model
5.2.2 Sentence Level
Only two sentence level features implemented within Svetashova (2015) were among her
70 most predictive features: One feature considers Stanford’s CoreNLP constituency
parser output and counts every construction that is not a simple declarative clause
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 35
(numComplicators)20. The parseScore feature represents the log probability of the
parse tree and is also provided by Stanford. We also included the parseDepth feature
as described by Svetashova (2015), which describes the depth of the parse tree, and the
number of gaps in the sentence (numGapsInSent). The sentence id, thus the position
of the sentence in the text, is also stored (sentID).
As already mentioned in the context of linguistic complexity (Section 2.3, p. 9),
Weiß (2017) considered features based on the DLT. The DLT is a theory of how hu-
man computational resources are used during sentence processing, where ”words are
input one at a time” (Gibson, 2000). According to Gibson (2000), in sentence pro-
cessing two linguistic elements (e.g., a head and its dependent) need to be integrated.
This integration is costly and the cost depends on the distance between the elements.
Weiß (2017) follows Shain et al. (2016) and computes integration costs based on three
assumptions: a) verbs are more expensive, b) coordination is less expensive, and c)
modifier dependencies can be excluded. This leads to the following modification when
calculating integration costs:
a) a cost of 1 for non finite verbs, a cost of 2 for finite verbs (v)
b) only one collective count for a coordinated set of referents (c)
c) ignore dependencies to preceding modifiers (m)
Combining these conditions with each other and using none of the conditions (o)
leads to eight different computations of integration costs. For all the eight of them,
she implemented three variants: the number of maximal integrations costs, the mean
total integration cost at finite verbs, and the number of adjacent high integration cost
areas. However, the variant of adjacent high integration cost areas requires a ”threshold
to qualify discourse costs as ’high’” which should be set according to clear linguistic
evidence that still needs to be revealed (Weiß, 2017, p. 75). We therefore only integrate
the first two variants into our pipeline, resulting in a set of 16 sentence-level DLT
features:
• maxTotalIntegrationCostPerFiniteVerb for the 8 conditions:
o, v, c, m, cv, cm, vm, cmv
• totalIntegrationCostsAtFiniteVerbPerFiniteVerb for the 8 conditions:
o, v, c, m, cv, cm, vm, cmv
5.2.3 Text Level
Following Svetashova (2015), we implement the following three groups of text level
features: lexical features, syntactic features and traditional readability features.
20Yulia Svetashova kindly provided her Java code for the extraction of the numComplicators feature
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 36
Lexical features (Lex): The lexical complexity of texts can be measured in differ-
ent ways. We distinguish between lexical density, lexical variation, and lexical sophis-
tication. Additional lexical features are based on different frequency lists.
• Density: Lexical density, as described by Lu (2012), measures the ratio of lex-
ical words to all words in a text (density Lex ). We consider nouns, verbs, ad-
jectives, and adverbs as lexical words. We also compute the ratio of further
POS groups to the number of all words in the text, resulting in the following
features: density Functional, density Noun, density Verb, density Adjective, den-
sity Adverb, density Conjunction, and density Determiner.
• Variation: According to Lu (2012), who investigates lexical variation in the con-
text of language learning, ”lexical variation refers to the range of a learner’s
vocabulary as displayed in his or her language use”. Table 5.3 lists the lexical
variation features computed within this work and the corresponding descriptions
and formulas.
• Sophistication: Lexical sophistication measures describe how sophisticated the
words in a text are. Following Lu (2012), we count ”words, lexical words, and
verbs as sophisticated if they were not on the list of the 2000 most frequent
words generated from the British National Corpus”. The list was obtained from
the Lancaster’s University Centre for Computer Corpus Research on Language21. The different lexical sophistication measures vary in terms of the underlying
formula and in terms of the POS under consideration (either all words or verbs).
The formulas consist of word type and word token counts, where we follow Lu
(2012) and consider ”different inflections of the same lemma (e.g., “go,” “goes,”
“going,” “went,” and “gone”) as one type”Lu (2012, p. 192). Thus, in this case
the term type differs from the notion of a type in the lexical variation measures.
We implemented all 5 lexical sophistication measures as listed in Table 5.4.
Svetashova (2015) further gathered measures based on the Lexical Frequency
Profile (LFP), as introduced by Laufer and Nation (1995), from the ADELEX
Analyzer (ADA) (cf. p. 32). The assumption behind LFP is that ”the propor-
tion of high frequency general service and academic words in learners’ writing”
influences lexical richness (Laufer and Nation, 1995). ADA divides the 7000 most
frequent words of different corpora into 7 bands of 1000 words and computes the
percentage of words included in each list by either matching types or tokens. (NB:
in this case word types and not lemmas are considered as types). We replicate
these features by considering the 7000 most frequent tokens in the SUBTLEX-US
corpus (cf. p. 32).
Another group of sophistication features that performed well in Svetashova (2015)
21http://ucrel.lancs.ac.uk/bncfreq/lists/1_2_all_freq.txt (Last accessed: 18/02/08)
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 37
are features extracted from the Lextutor Vocabprofile 22. We integrated these
features into our pipeline by taking the Lextutor’s Academic Word List (AWL),
originally developed by Coxhead (2000), and then generating language families
using the Lextutor’s familiarizer 23. The familiarizer extracted the word families
for every word in the AWL, which resulted in about 3000 words belonging to 570
word families. The implemented feature awlPercent encodes the percentage of
words belonging to the AWL and the feature awlNumFam encodes how many
of the corresponding word families have been found in the text. We further do
this with the 1000 most frequent words in the SUBTLEX-US corpus (famList-
Percent k1, famNum k1 ).
Feature Description
variation NDWZ T (number of different words)variation Adj Tadj/Nlex (adjective variation)variation Adv Tadv/Nlex (adverb variation)variation Mod Tadj+adv/Nlex (modifier variation)variation Lex Tlex/Nlex (lexical word variation)variation Verb2 Tverb/Nlex (verb variation II)variation VV1 Tverb/Nverb (verb variation I)variation VV1sq T 2
verb/Nverb (squared VVI)variation VV1co Tverb/
√2Nverb (corrected VVI)
variation TTR T/N (type token ratio)
variation TTR RTTR T/√N (root TTR)
variation TTR CTTR T/√
2N (corrected TTRvariation TTR Log LogT/LogN (bilogarithmic TTR)variation TTR Uber Log2N/Log(N/T ) (Uber index)
Table 5.3: The lexical variation features. N refers to the number of words, whereas Trefers to the number of distinct words (i.e. word types).
Syntactic features (Syn): Svetashova (2015) made use of the Web-based L2 Syn-
tactical Complexity Analyzer developed by Lu (2010)24. We replicate the implemen-
tation of these features by analyzing the output of the Stanford constituency parser.
The Tregex patterns described by Lu (2010) are used to extract the relevant linguistic
units from the parser output (Levy and Andrew, 2006). The best performing features
in Svetashova (2015) have been integrated into the pipeline and are listed in Table 5.5.
Dependency Locality Theory features (Dlt): Weiß (2017) projected the DLT
features on sentence level (cf.p 35) to the text. We integrate these features into our
pipeline.
Readability features (Read): Traditional readability features highly correlate
with each other in the context of C-Test item difficulty prediction, as shown by Sve-
tashova (2015). We therefore only integrate two readability indices into the pipeline:
The Flesch Reading Ease formula (readability Flesh) and the the Flesch-Kincaid Grade
22http://www.lextutor.ca/vp/eng/ (Last accessed: 18/02/08)23http://www.lextutor.ca/familizer/ (Last accessed: 18/02/08)24http://aihaiyang.com/software/l2sca/ (Last accessed: 18/02/08)
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 38
Feature Description
sophistication LS1 NsophLex/Nword (lexical sophistication I)sophistication LS2 TsophWord/Tword (lexical sophistication II )sophistication VS1 TsophV erb/Nverb (verb sophistication I )sophistication CSV TsophV erb/
√2Nverb (corrected VS I)
sophistication VS2 T 2sophV erb/Nverb (verb sophistication II )
frequencyProfile Band1 Tok percentage of N in band 1 (freq. 1-1000 in SUBTLEX)frequencyProfile Band1 Type percentage of T in band 1 (freq. 1-1000 in SUBTLEX)frequencyProfile Band2 Tok percentage of N in band 2 (freq. 1001-2000 in SUBTLEX)frequencyProfile Band2 Type percentage of T in band 2 (freq. 1001-2000 in SUBTLEX)frequencyProfile Band3 Tok . . .frequencyProfile Band3 Type . . .frequencyProfile Band4 Tok . . .frequencyProfile Band4 Type . . .frequencyProfile Band5 Tok . . .frequencyProfile Band5 Type . . .frequencyProfile Band6 Tok . . .frequencyProfile Band6 Type . . .frequencyProfile Band7 Tok . . .frequencyProfile Band7 Type . . .awlPercent percentage of N in AWL (Academic Word List)awlNumFam number of AWL familiesfamListPercent k1 percentage of words in k1
list(freq. 1-1000 in SUBTLEX)
famNum k1 number of k1 families
Table 5.4: The lexical sophistication features. N refers to the number of words, lexicalwords or verbs. T refers to the number of distinct words, lexical words orverbs (i.e. word types).
Feature Description
complexity CNperC # complex nominals / # clauses (Complex nominals per clause)complexity CTperT # complex t-units / # t-units (Complex T-unit ratio)complexity DCperC # dependent clauses / # clauses (Dependent clause ratio)complexity VPperT # verb phrases / # T-units (Verb phrases per T-unit)complexity MLC # words / # clauses (mean length of a clause)complexity MLT # words / # t-units (mean length of a t-unit)complexity MLS # words / # sentences (mean length of a sentence)
Table 5.5: The syntactic complexity features.
Level (readability Kincaid). These formulas are based on the average sentence length
and the number of syllables per word.
5.3 Modeling the difficulty of Spanish C-Tests
The implementation of the Spanish C-Test difficulty features was built up upon the
described English features and follows two main goals: An efficient way of predicting
the difficulty of Spanish C-Test items per se and a meaningful, linguistically insightful
comparison of difficulty phenomena in the two languages. The following will mainly
focus on those Spanish features that differ from the English ones.
Automated C-Test Difficulty Prediction
5 Difficulty Modeling 39
5.3.1 Item Level
Linguistic Features (Ling): The linguistic features for Spanish are the same as those
for English: endingLength, lemmaLength, isContentWord and numDependencies. We
considered the Ancora tags starting with N (nouns), V (verbs), A (adjectives) and R
(adverbs) as content words.
Psycholinguistic Features (Psy): The TFIDF measures have been adapted to
Spanish by compiling a new collection of documents: 91 C-Test text passages and 90
randomly chosen texts from a sample of the Corpus del Espanol 25.
The Lexical Frequency Profile measure, which indicates whether the item is contained
in the 1000 most frequent words, is based on the Spanish SUBTLEX-ESP corpus (Cue-
tos et al., 2012) 26 .
Position Based Features (Posit): All the position based features from English,
except for those based on the Web1T-corpus, have been adapted to the Spanish pipeline.
Context and Candidate Space (Cont): We implemented candidate space fea-
tures that are comparable between the two languages by focusing on the SUBTLEX
databases. As already described for English, the SUBTLEX database has been used to
generate maps from item bases to possible solutions according to the length constraint.
The following exemplifies a map entry for the base ”an” and all the matching possibly
intended words extracted from the first 72,286 words in SUBTLEX-ESP27:
an[ ] → [anos, anade, anora, anado, aneja, anoro, anada, anadı, anejo]
5.3.2 Sentence Level
On sentence level we include the following features that have been described for English
already: parseScore, parseDepth, numGapsInSent and sentID. Also the DLT features
have been adapted to Spanish.
5.3.3 Text Level
On text level we fully adapted the DLT and readability features to the Spanish pipeline.
The Spanish readability features make use of syllable counts extracted by following the
approach described by Hernandez-Figueroa et al. (2009)28.
All the lexical density and variation features have been adapted to Spanish. To in-
25a sample of the corpus is available at https://www.corpusdata.org/spanish.asp(Last accessed:17/12/29)
26http://crr.ugent.be/archives/679(Last accessed: 17/12/28)27This number was chosen because it is the number of words available in the English database.28Code available at https://github.com/vic/silabas4j (Last accessed: 17/12/29)
Automated C-Test Difficulty Prediction
6 Experiments and Results 40
clude the aspect of sophistication, we implement the lexical sophistication measures
developed by Lu (2012), which are listed for English in Table 5.4 (p. 38). We con-
sider a Spanish word as sophisticated if it is among the 2000 most frequent words in
SUBTLEX-ESP. We further adapt the Lexical Frequency Profile features by considering
the SUBTLEX-ESP corpus.
6 Experiments and Results
This section describes the machine learning experiments conducted in order to find out
how the difficulty of C-Test items, sentences and text passages can be predicted based
on the described difficulty features. We experiment with different clustering approaches
and analyze how the predictions correlate with the actual learner performance data.
At the end of this Section the results will be discussed.
First, we predict the resulting cluster memberships using the whole set of difficulty
features for both languages. Different performance profiles are evaluated by taking into
account their interpretability.
We then focus on those two performance profiles that are suitable for real world
applications where the selection of text passages should be based on predictions about
the difficulty of single passages.
In order to compare the difficulty characteristics across the two languages English
and Spanish, a unique set of comparable features is designed. We conduct classification
experiments using this feature set on the presented performance profiles. In addition, we
inspect the performances on the more local levels sentence and item, since the linguistic
differences between the languages are expected to be rather grammatical than content
related.
6.1 Experimental Setup
The data that has been investigated in the following experiments comprises performance
statistics for English and Spanish C-Tests gathered from the FSZ of the University
of Tubingen (cf. Section 3). As described in Section 4.2, we perform Hierarchical
Clustering on Principal Components (HCPC) on the performance data. The resulting
classes are then predicted using machine learning models which are trained on the
implemented difficulty features. The R package caret is used for the data preprocessing
and the machine learning part (Kuhn et al., 2015). Following Svetashova (2015), the
machine learning is performed using two different algorithms: Support Vector Machines
(SVM) and Random Forests (RF)29. The caret package provides a function that splits
29Yulia Svetashova kindly provided her machine learning pipeline including feature preprocessing andmodel training
Automated C-Test Difficulty Prediction
6 Experiments and Results 41
the data into test and training set. We use this function to create data partitions and
use 80% of the data for training, and 20% for testing. Svetashova (2015) chose the
named algorithms for several reasons (c.f. Svetashova (2015, p. 71 ff.)):
• ”They have been successfully applied to a wide spectrum of NLP problems”
• SVM makes a comparison to results reported in other literature possible
• ”both algorithms have the implementations that are fast in terms of training time
and are suitable for regression and classification tasks”
• ”these two algorithms showed best accuracy values in our preliminary testing
experiments”
• ”SVMs are reported to handle successfully large feature sets”
• ”The attractive property of random forests is that they permit to understand the
relative importance of variables that provide the predictive accuracy”
In terms of feature preprocessing, the machine learning pipeline follows four steps:
1. Dealing with missing values: Missing values in the feature data are caused
by different reasons. First, features based on table lookups can contain missing
values if the item under consideration is not present in the list. This applies
to the item-level psycholinguistic features based on the MRC dictionary lookup
(including age of acquisition, concreteness, familiarity, and meaningfulness fea-
tures). We replace missing values by the average feature values in these features.
Furthermore, features based on Web1T corpus lookups contain missing values if a
certain n-gram is not found in the corpus at all. We chose to assign the minimum
probability found in the corresponding feature to all missing values in the fea-
tures. The position based feature that measures the uni- or trigram probability
of the previous item is always missing for the first item in the text. These missing
values are replaced by the average feature value. The features indicating whether
an item is a lemma’s most frequent token or indicating whether an item has a
lemma’s most frequent tag are set to false if the words are not found in the MRC
lookup.
2. Zero- and near-zero variance removal: In order to steer clear of features
that contain no or little information, zero and near-zero variance removal can
be applied. We checked for zero and near-zero variance features, but it turned
out that the feature set integrated in the pipeline of this thesis does not contain
any such features. This is caused by the fact that we focus on those features of
Svetashova (2015) that turned out to be predictive in her experiments, in which
she also removed such variables first.
3. Highly correlated predictors removal: The consideration of multiple features
that correlate highly with each other does not result in any gain of information
Automated C-Test Difficulty Prediction
6 Experiments and Results 42
when performing machine learning experiments. We therefore filter out variables
that correlate highly with another variable, setting the cutoff threshold to 0.95.
Table 6.1 lists the number of features used in total and the number of features af-
ter having removed highly correlated predictors. For the English data, 98 numeric
features out of 145 remain after having removed the highly correlated numeric
features. For the Spanish, 66 out of 95 remain. Correlations of features mainly
concerns DLT features on text and sentence level, and different variations of fea-
tures that only differ in terms of the mathematical formula (e.g., different varia-
tions of TTR or the verb variation formula). Considering types instead of tokens
when computing the lexical frequency profile measures on text level does also not
make a difference in some cases, leading to highly correlating variables (e.g., Fre-
quencyProfile Band7 Type highly correlates with FrequencyProfile Band7 Tok).
We further found highly correlating variables within the group of candidate and
context n-gram features.
EN ES
# features 156 100
# numeric predictors 145 95# numeric predictors (highly correlated deleted) 98 66
# factor predictors 11 5
# all predictors (filtered) 109 71
Table 6.1: The number of features (factor and numeric) before and after removinghighly correlated predictors
In order to evaluate the models, we report macro- and micro-averaged F1 scores,
inspect the variable importance, and train additional models on feature subsets. The
R package caret provides a variable importance evaluation function. This function uses
the model information and shows which variables are most predictive.
6.2 Investigation of Performance Profiles
This section describes the machine learning experiments conducted in order to find
out how the difficulty of C-Test items can be predicted based on different information
about the test takers’ performance. As described in Section 4.2 (Clustering Approach),
we can describe the difficulty of an item according to three dimensions: text difficulty,
sentence difficulty, and item difficulty. Assigning a binary label (easy or difficult) to each
dimension leads to the 8 clusters listed in Table 4.1 (p. 26). Depending on the number of
dimensions present in a performance profile, we experimented with different numbers
of clusters: If three dimensions are present in the performance cluster, clustering is
performed with 4 to 8 clusters. If only two dimensions are present, we perform clustering
with 4 clusters and add experiments with 5 clusters, since the 5 cluster individuals
maps (Figure 4.5 and 4.6, p. 28) are interpretable in a way that is suitable for real
Automated C-Test Difficulty Prediction
6 Experiments and Results 43
world applications.
When describing performance profiles, the terms Text, Sent, and Item indicate
that only the overall average variable is considered (percCorrect, sentAv, or textAv).
TextProf, SentProf, and ItemProf further include the proficiency level variables. The
full set of 21 performance variables is described on page 23 and referred to as all.
As seen in the variable factor map, the English proficiency level variables on text
level highly correlate with each other independent from the number of clusters (4-8)
(see Figure 4.3). We therefore consider proficiency variable combinations on sentence
and item level (SentProf and ItemProf ) more detailedly than on text level. However,
in order to have results that are comparable to those reported in Svetashova (2015),
we add clusters including proficiency level variables on text level (TextProf ItemProf ).
6.2.1 Classification Results on English Performance Profiles
SVM RFEnglish Performance Profiles # vars mac F1 mic F1 mac F1 mic F1
all CL4 21 0.63 0.63 0.67 0.68all CL5 21 0.57 0.61 0.60 0.63all CL6 21 0.56 0.58 0.61 0.62all CL7 21 0.58 0.58 0.62 0.61all CL8 21 0.54 0.58 0.57 0.60Text Sent Item CL4 3 0.68 0.71 0.64 0.71Text Sent Item CL5 3 0.67 0.67 0.69 0.71Text Sent Item CL6 3 0.61 0.62 0.59 0.62Text Sent Item CL7 3 0.59 0.58 0.60 0.60Text Sent Item CL8 3 0.63 0.62 0.64 0.62Text SentProf Item CL4 9 0.79 0.83 0.94 0.95Text SentProf Item CL5 9 0.72 0.73 0.86 0.86Text SentProf Item CL6 9 0.69 0.68 0.86 0.85Text SentProf Item CL7 9 0.74 0.72 0.89 0.86Text SentProf Item CL8 9 0.77 0.72 0.89 0.85Text SentProf ItemProf CL4 15 0.56 0.57 0.60 0.60Text SentProf ItemProf CL5 15 0.45 0.54 0.49 0.57Text SentProf ItemProf CL6 15 0.45 0.52 0.51 0.57Text SentProf ItemProf CL7 15 0.47 0.52 0.47 0.53Text SentProf ItemProf CL8 15 0.46 0.48 0.49 0.52TextProf ItemProf CL4 14 0.76 0.77 0.77 0.79TextProf ItemProf CL5 14 0.77 0.75 0.76 0.74Text ItemProf CL4 8 0.59 0.63 0.61 0.65Text Item CL4 2 0.73 0.75 0.74 0.75Text Item CL5 2 0.76 0.74 0.75 0.72
Table 6.2: English: The macro- and micro-averaged F1-scores for SVM and RF clas-sification using all features. Values higher or equal to 0.70 are highlighted.
.
Table 6.2 presents the SVM and RF results for the classification of different perfor-
mance profiles using the full feature set. Values higher or equal to 0.70 are highlighted.
The performance profiles that are best predictable by the classifiers are those contain-
Automated C-Test Difficulty Prediction
6 Experiments and Results 44
ing a high proportion of sentence performance variables. The profile Text SentProf
Item CL4 is built up using four clusters and the following nine performance variables:
textAv, sentAv, sentence averages split by proficiency levels (sentAvA1, sentAvA2,...),
and the item variable percCorrect. Figure 6.1 shows the individuals factor map and the
variables factor map for this profile. Roughly speaking, the clusters describe items ac-
cording to the sentAv variable: Cluster 1 is the cluster with items in difficult sentences,
followed by those in less difficult sentences (red items, cluster 2). Even less difficult are
the sentences containing cluster 3 items, and finally, cluster 4 items occur in the easiest
sentences. Predicting these classes using all features resulted in an F1 score of 0.95
for RF classification. The most predictive features in this model are listed in Figure
6.2 and reflect the tendencies interpreted from the factor maps in Figure 6.1: 7 out of
the 10 most predictive features are sentence level features. The two most predictive
features are the parse score and the number of gaps in a sentence. Only two item level
features (Item Posit Number Token and Item Posit Number Gap) are among the top
20 features.
●
−10 −5 0 5 10
−10
−5
05
Factor map
Dim 1 (71.74%)
Dim
2 (
10.5
6%)
●●●●●●
●
●●●●●●●
●●●●●●
●
●●●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●●●●●
●
●●●
●
●
●
●●●
●●●●
●●
●
●●●
●●
●
●
●
●
●●
●●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●●●
●
●●
●●
●
●●●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●●●●●
●●●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●●●●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●●●
●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●
●
●●●●
●
●
●●
●●
●
●
●●
●●●
●●●●
●
●
●
●
●
●●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●●
●
●
●
●●●●●●
●●
●●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●●●●
●
●
●●●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●●●
●●
●●
●
●
●
●
●
●
●●
●
●
●●●●●
●●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●
●●
●
●
●
●
●
●●
●
●
●
●●●
●●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●●●●
●
●●●●●●●●●
●
●●●●●●●
●●
●
●●
●
●
●
●●●●●●●
●
●
●
●●
●
●
●●●
●
●●●
●●●●●●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●●●
●
●●●
●
●●●●●●●●●●
●●
●●●●
●
●
●●
●
●●●
●●●
●
●
589_02589_06589_05589_01589_03589_04725_25
590_19590_20590_25590_21590_23590_18590_24609_17609_18609_14609_16609_15609_19
590_22601_01601_03601_04
525_23525_22
601_02
525_24525_25
531_24
609_23609_21609_24
591_11
594_19
591_09
594_23
591_14
531_23
591_06591_15
594_21
591_13
609_20
591_12
748_04748_09610_22748_10610_20
594_20
591_05
748_06748_11609_22748_08594_24594_18
610_19
594_22748_02748_03610_25
609_25
610_24591_03591_16591_07
748_01748_05748_07594_17
591_04591_10
529_22
591_02591_08690_24610_18610_21690_23
604_09594_25604_10
610_06610_04
604_13604_11529_23
591_01
529_19
610_09
604_20610_23604_18604_15529_21
690_22
604_19529_20
690_21
529_17529_15
690_20610_05
604_12
610_07585_18719_13585_21690_18585_12604_08529_18
590_03
719_14719_19719_11
590_01590_09
585_16756_19
604_17
590_05
529_16
690_19719_10
688_23
604_14585_19756_25
604_16
756_24
688_22
610_08756_22
719_09585_20
590_04688_25590_06688_24
719_12
590_02
756_21
688_19
585_17719_16479_04719_18719_17
590_08
585_15719_15
756_20
590_07
585_14
688_20
756_23
479_03
582_17582_24527_18527_12585_13527_16582_15582_18582_22582_19
479_05479_02479_09
582_16
479_01
688_21
479_07479_06
688_18
766_14
470_19
582_25
470_16
582_21
479_10
582_20
479_08527_13527_14470_18
691_05691_07725_01688_08
584_23582_23
725_13
527_15
725_08
766_15
688_14691_11691_01
584_25527_17
691_12725_04725_03688_03725_05691_04
766_13
691_09691_06688_06481_25481_03
596_24
481_13688_05688_16481_09
527_22
481_16
527_20
470_15
766_17601_25
481_04
584_24
688_04481_11481_05725_02586_17586_24610_11481_24
470_17
610_13481_08481_20725_14601_20
688_10688_01
596_23
688_17
601_21
586_18
596_25
725_12
690_02688_13
766_16
481_02
601_23
610_15605_02688_07688_12756_16688_11605_01690_07
481_18691_02
690_17
533_08601_24
756_09605_09
691_13
605_11690_15
533_07725_11478_09725_07527_21
481_22478_16691_10756_12586_20690_03688_09605_12690_16478_21690_11478_08478_20
601_22725_06
588_06690_12588_16
533_05586_22690_01
691_03478_07481_19688_15756_08588_08
478_15725_09725_10
756_17481_15610_12586_16688_02691_08478_14478_01598_21598_18
588_10
481_12478_18610_17588_17588_09
481_10690_13588_11
533_09527_23
588_05
481_01481_14
527_19533_02478_06
756_04610_16
598_19481_07586_15605_03605_10
598_23
690_14588_13690_10756_10586_19
588_22588_20
481_17586_25
588_14
478_13586_21481_23
527_24
481_06478_19481_21
588_19610_10690_06756_14605_19756_13610_14588_04764_09605_08
586_14756_18605_14
764_10
766_20533_06478_17690_05587_17756_15586_23533_01533_04
605_16756_02756_11
598_22587_22598_20587_23
605_21
598_24
605_13690_09605_04756_06588_01690_04605_05690_08756_03605_06588_21588_12588_03756_05756_07
766_21
478_04
588_15
533_03766_24
588_02605_07478_10
766_25
478_12
756_01
478_02
766_22
724_13
766_19
478_11588_07600_13
478_05587_21587_20
600_14
478_03
603_16724_15
598_25
719_20764_13
603_18588_18
587_19
605_22603_14689_09
531_12605_24531_06600_11603_08
587_16
689_20689_02
531_05587_24
689_07
605_15
525_05766_18
594_09689_24
531_04605_17605_18
766_23
587_25587_18587_13
724_16
764_11
600_09603_10605_20605_25605_23603_13
525_08
530_13
594_10
601_19
603_15600_16601_18
525_07
764_12
600_17
689_22
587_15
529_13689_06
601_17724_12
588_23
529_14
531_09587_14
594_04689_18594_15724_04
525_04
530_18530_12
594_16603_11
525_03
689_11
530_20
594_11689_15529_08
525_02
530_03
724_18724_14724_09719_22
689_23
525_01
530_11
724_05689_08
530_15530_16
689_14
531_07600_15
689_04689_25594_12
525_06
529_09529_12
531_08531_03
689_10724_08529_11
528_22
689_03
530_06719_21528_20530_09
724_06689_12594_01689_21
530_10528_21530_05
603_20603_19
530_17531_01
724_17
594_03689_05594_06689_19
531_10
724_11531_17600_08
689_17594_02
528_23
724_03
531_11530_14
603_17594_07588_25
530_04531_13
724_10
594_08
600_10531_02
529_07531_18
689_13
600_12
689_16
530_02
603_09
589_23589_25
603_07724_01529_05603_12532_02529_04689_01529_01
530_19
594_13
529_02
530_01528_24
594_05724_07594_14
530_08
529_10532_08724_02
530_07
716_07588_24716_17
529_03529_06
716_11589_24
528_25
532_16531_21
532_18716_10
532_10531_15
716_14589_21
528_19
716_06
532_12716_12
532_03
591_21
523_01
531_22
716_09
532_11
716_01762_25716_02
532_09531_20
532_22
532_04
591_25
523_13
532_06
591_19
584_22716_08
523_09
532_05
589_20
523_02
531_16
523_05
586_11586_02
716_16
531_14
586_12
531_19532_13
716_05716_15589_22
532_14
716_22532_19
586_10
523_11
716_18
524_12
762_23
523_04523_03
716_04584_21
524_14
532_17
532_20720_13
586_01
523_07
598_12
591_20
532_07762_24
591_22
532_01
716_13584_15
598_11
584_19528_07
523_12
528_03
532_15
720_12
523_08
716_20
523_06523_10
762_21716_03584_14532_24
586_13
720_11
586_04
762_22
591_23
528_11
591_24
528_12533_21528_18716_21596_03716_23532_23
596_05586_03
720_17528_13584_18532_21
524_16
764_20603_22
524_11
533_25584_17
762_20
591_18
720_19600_25528_17598_16598_10
591_17
533_24
596_01720_16598_17
584_20764_21
533_23
586_05586_07598_13
533_20528_01720_18584_16
603_24
716_19
586_09
524_15
720_15
586_06764_15586_08
720_20528_09764_24528_10720_21
524_13524_10
764_16720_14528_02764_19
533_18
600_24528_04528_05
603_23
601_16
528_14598_14720_10596_06598_15764_14528_16529_25533_22528_06528_15600_22528_08764_22
601_10
685_11
533_19
764_23596_02
601_13
685_12529_24685_10603_25
748_25
685_09603_05600_23764_25603_21596_04596_07685_04764_17
748_18601_15
764_18685_01
748_17724_25601_05
603_04
748_20748_15
685_06603_06685_05
724_23724_24748_14762_04
685_08
601_06691_23601_09724_21601_11601_08601_12
685_02685_15
762_02
601_07724_22748_16
470_13685_03
748_13601_14748_19
685_13
748_24
685_07
762_18
524_25
691_17
762_16466_01
685_14
724_20
470_12
524_24
685_16
762_10
524_22
724_19748_12
691_25587_12479_20466_13466_07587_02691_15
609_03
762_13762_15
587_06691_22
533_17479_22466_18762_14762_06479_18762_12
691_14691_20
762_07762_11
691_24
609_08
466_05762_08762_05
609_01
524_21
725_16
762_17466_08
524_23
762_19479_25
764_08
725_22
691_18
762_09762_03
470_14
466_17762_01
587_03587_08691_21
466_02466_11
691_16587_09691_19589_09
466_12
589_16589_15609_12
479_21
609_04
466_10466_16479_23587_01
589_13598_08
587_07479_24
725_24
587_04587_05479_19466_15
589_14
466_03466_14
609_13
587_11
609_09609_02
587_10466_06
609_10589_19764_07
590_12
466_09
603_02725_19609_07
466_04
589_11589_12764_04589_10764_03470_25
725_21748_22609_05603_01764_05609_06609_11598_01
466_21470_21
598_06589_07589_08725_23589_18598_02
470_23589_17725_15725_20527_09764_06748_21598_09725_18603_03725_17
590_17
466_24
748_23527_11
590_15
527_02
590_13584_10
470_22
598_05
590_11
527_01
470_24
598_07598_03584_09598_04
590_10590_14
470_20
584_13
466_20
590_16
527_08
584_12
527_05
466_19
527_10527_04527_03
466_23
527_07527_06
584_04
466_22466_25
584_08584_07766_05584_03766_09584_01584_11766_10584_06596_18584_05596_13584_02
582_10582_01
596_08
582_12
766_02766_07766_03766_06766_08596_17596_09596_10
720_07600_03
596_11
600_01720_08766_01596_16766_04766_11596_15766_12
582_06
600_06
596_14
720_09
585_03
581_25
582_02
596_12
720_06600_04582_08582_11
525_21
600_05
479_14525_17
600_02
525_14
720_04582_03582_07582_09582_05582_04582_13585_05
479_17
600_07720_02
525_16525_11
582_14
479_13
720_01
479_15
720_03
479_16
604_06
479_12
720_05
585_07
581_24581_18479_11525_09525_10525_20525_19
523_17581_15
604_05
525_15
530_23
525_18
585_11
525_13525_12
530_24
585_08604_07585_02585_10585_01
581_14581_17
604_03
581_04581_20581_13523_14
585_06
530_21581_09530_25581_01
604_04585_09585_04604_02
523_18530_22581_08581_19523_15581_23
604_01
581_16581_22581_02581_21523_19581_10581_05523_16523_22
685_25
581_03581_07581_12523_21523_20581_11581_06
585_25585_22470_05585_24585_23470_07
524_20
470_06685_21685_18685_22685_23685_19685_17685_24470_03
719_02
685_20
524_18524_19
470_08
524_17
470_10470_09470_11
719_07
470_04470_02470_01
719_08719_04719_03719_06719_05719_01523_24720_23478_25
720_25720_22
523_23720_24
523_25
478_22478_24478_23
532_25
719_25764_01
719_23764_02716_25533_14716_24719_24533_12533_15533_16
524_02
533_10533_11533_13596_19604_24604_25604_23600_19600_18596_21
524_06524_07
604_22596_20604_21600_21
524_08
596_22
524_04524_03
600_20
524_01524_05524_09
610_02610_03610_01690_25
527_25
cluster 1 cluster 2 cluster 3 cluster 4
cluster 1 cluster 2 cluster 3 cluster 4
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables factor map (PCA)
Dim 1 (71.74%)
Dim
2 (
10.5
6%)
textAv
sentAv
sentAvA1
sentAvA2
sentAvB1sentAvB2
sentAvC1
sentAvC2
percCorrect
textAv
sentAv
sentAvA1
sentAvA2
sentAvB1sentAvB2
sentAvC1
sentAvC2
percCorrect
textAv
sentAv
sentAvA1
sentAvA2
sentAvB1sentAvB2
sentAvC1
sentAvC2
percCorrect
textAv
sentAv
sentAvA1
sentAvA2
sentAvB1sentAvB2
sentAvC1
sentAvC2
percCorrect
textAv
sentAv
sentAvA1
sentAvA2
sentAvB1sentAvB2
sentAvC1
sentAvC2
percCorrect
textAv
sentAv
sentAvA1
sentAvA2
sentAvB1sentAvB2
sentAvC1
sentAvC2
percCorrect
textAv
sentAv
sentAvA1
sentAvA2
sentAvB1sentAvB2
sentAvC1
sentAvC2
percCorrect
textAv
sentAv
sentAvA1
sentAvA2
sentAvB1sentAvB2
sentAvC1
sentAvC2
percCorrect
textAv
sentAv
sentAvA1
sentAvA2
sentAvB1sentAvB2
sentAvC1
sentAvC2
percCorrect
Figure 6.1: English: Individuals and variables factor map using performance profileText SentProf Item CL4. Predicting these 4 classes using all English fea-tures results in an F1 score of 0.95
Table 6.2 further shows that the RF prediction of the English performance profile
TextProf ItemProf CL4 results in a micro-averaged F1 score of 0.79 which is comparable
to the RF result reported by Svetashova (2015): a micro-averaged F1 score of 0.7959 on
a performance profile that only contains the two dimensions text and item. However,
the extended data set does not show the clear cut distinction between the item clusters
as it does for the data set investigated by Svetashova (2015, p. 91). We therefore chose
to inspect more detailedly the performance profile that makes use of two variables only:
textAv and percCorrect (see p. 28 for factor maps and a description and interpretation of
the resulting clusters). The SVM classification for this profile (Text Item CL5 ) results
in a macro-averaged F1 score of 0.76. We will now describe the results for Spanish,
Automated C-Test Difficulty Prediction
6 Experiments and Results 45
Importance
Text_Lex_Sophistication_VS2
Text_Lex_FrequencyProfile_Band6_Tok
Text_Syn_Complexity_MLS
Text_Dlt_mMaxTotalIntegrationCostPerFiniteVerb
Text_Lex_awlPercent
Item_Posit_Number_Gap
Text_Lex_FrequencyProfile_Band3_Type
Text_Lex_FrequencyProfile_Band1_Type
Text_Lex_FrequencyProfile_Band2_Type
Text_Dlt_cTotalIntegrationCostsAtFiniteVerbPerFiniteVerb
Text_Syn_Complexity_MLT
Text_Lex_Sophistication_LS1
Sent_Const_numComplicators
Sent_sentID
Item_Posit_Number_Token
Sent_Dlt_cTotalIntegrationCostsAtFiniteVerbPerFiniteVerb
Sent_Dlt_cmvMaxTotalIntegrationCostPerFiniteVerb
Sent_Const_parseDepth
Sent_numGapsInSent
Sent_Const_parseScore
20 40 60 80 100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Figure 6.2: English: Variable importance of the top 20 predictors in the RFmodel, when predicting the 4 classes of the performance profileText SentProf Item CL4
where similarities can be drawn in terms of interpretable clustering approaches.
6.2.2 Classification Results on Spanish Performance Profiles
The classification results for the Spanish performance profiles are listed in Table 6.3.
In contrast to English, the F1 scores are significantly higher for most of the profiles.
Values higher than 0.80 are highlighted, whereas for English, values higher than 0.70
were highlighted instead. The best performance profiles are again the ones with a high
proportion of sentence level variables. Using the best profile (Text SentProf Item CL4 )
the RF classification resulted in a micro-averaged F1 score of 0.93, which is slightly
worse than for the English data (0.95). Also the variable importance plotted in Figure
6.3 shows similarities to what could be observed for English: The parse score and the
number of gaps in the sentence are the most predictive features. In contrast to the
variable importance picture for English, these two most predictive Spanish features
are less dominant than for English. (100% numGapsInSent, 62.84 % parseScore for
Spanish, in contrast to 100 % parseScore, 88.12 % numGapsInSent for English).
High classification results can also be observed for the profiles combining text and
item variables: The four classes from profile TextProf ItemProf CL4 were predictable
with F1 scores of up to 0.86 (0.79 for English). However, in terms of interpretability
the performance profile Text Item CL5 (cf. p. 27) with an F1 score of 0.82 in SVM
Automated C-Test Difficulty Prediction
6 Experiments and Results 46
SVM RFSpanish Performance Profiles # vars mac F1 mic F1 mac F1 mic F1
all CL4 21 0.68 0.70 0.68 0.70all CL5 21 0.73 0.68 0.75 0.71all CL6 21 0.74 0.71 0.72 0.69all CL7 21 0.70 0.69 0.78 0.77all CL8 21 0.71 0.68 0.74 0.73Text Sent Item CL4 3 0.75 0.77 0.73 0.76Text Sent Item CL5 3 0.78 0.77 0.78 0.77Text Sent Item CL6 3 0.78 0.77 0.78 0.78Text Sent Item CL7 3 0.70 0.70 0.73 0.72Text Sent Item CL8 3 0.63 0.68 0.68 0.72Text SentProf Item CL4 9 0.87 0.89 0.91 0.93Text SentProf Item CL5 9 0.91 0.91 0.92 0.93Text SentProf Item CL6 9 0.80 0.83 0.82 0.83Text SentProf Item CL7 9 0.83 0.85 0.85 0.85Text SentProf Item CL8 9 0.84 0.86 0.86 0.87Text SentProf ItemProf CL4 15 0.71 0.71 0.74 0.74Text SentProf ItemProf CL5 15 0.64 0.65 0.69 0.70Text SentProf ItemProf CL6 15 0.63 0.66 0.64 0.68Text SentProf ItemProf CL7 15 0.58 0.61 0.60 0.63Text SentProf ItemProf CL8 15 0.50 0.53 0.47 0.52TextProf ItemProf CL4 14 0.80 0.84 0.83 0.86TextProf ItemProf CL5 14 0.68 0.71 0.72 0.75Text ItemProf CL4 8 0.48 0.48 0.54 0.55Text Item CL4 2 0.85 0.82 0.84 0.82Text Item CL5 2 0.82 0.82 0.81 0.80
Table 6.3: Spanish: The macro- and micro-averaged F1 scores for SVM and RF clas-sification using all features. Values higher or equal to 0.80 are highlighted.
classification was chosen for further experiments.
6.3 Predicting C-Test Performance on Text and Item Level
The difficulty prediction of single texts is relevant for the selection of text passages
during the composition of a whole C-Test. As shown by Svetashova (2015), C-Test
items can be clustered into 4 groups according to two dimensions: overall text passage
difficulty and item difficulty. She presents experiments on unseen data and suggests
how the results could be applied in a C-Test generation application. A C-Test item
in a given text passage could be highlighted according to its item difficulty (Ie and
Id) as in Figure 2.5 on p. 19. Furthermore, the whole text passage is classified as
easy or difficult (Te and Td). However, given the newly compiled extended datasets,
clustering the items into 5 groups provides a clearer picture for both languages than
using 4 clusters (cf. Figure 4.5 and 4.6 on p. 27). We will therefore investigate more
detailedly how the item classes resulting from the performance profile Text Item CL5
can be predicted automatically. For both languages, this clustering principle is both
interpretable and suitable for the usage within a real world application.
Automated C-Test Difficulty Prediction
6 Experiments and Results 47
Importance
Text_Lex_Sophistication_VS1
Text_Lex_Variation_Adv
Text_Lex_FrequencyProfile_Band5_Type
Sent_sentID
Item_Psy_TFIDF
Item_SubtlexCand_deltaBigger
Text_Lex_Variation_Lex
Item_Psy_DF
Text_Lex_Density_Function
Item_Posit_Number_Gap
Sent_Dlt_cvTotalIntegrationCostsAtFiniteVerbPerFiniteVerb
Sent_Dlt_oMaxTotalIntegrationCostPerFiniteVerb
Text_Lex_Density_Determiner
Sent_Dlt_mMaxTotalIntegrationCostPerFiniteVerb
Sent_Dlt_cmvMaxTotalIntegrationCostPerFiniteVerb
Item_Posit_Number_Token
Text_Lex_Variation_TTR_Uber
Sent_Const_parseDepth
Sent_Const_parseScore
Sent_numGapsInSent
20 40 60 80 100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Figure 6.3: Spanish: Variable importance of the top 20 predictors in the RFmodel, when predicting the 4 classes of the performance profileText SentProf Item CL4
The SVM model trained on the whole feature set resulted in a macro-average F1
score of 0.76 for the English data (Table 6.2), and 0.82 for the Spanish data (Table
6.3). In the following, we will consider the variable importance of the models for both
languages and inspect some mean values of predictive features.
6.3.1 English
The most predictive features for the prediction of the five classes of the English profile
Text Item CL5 are plotted in Figure 6.4. 15 out of the 20 most predictive features are
item level features. The most prominent feature subgroup on item level is the group
of psycholinguistic features with term and document frequency measures leading the
way (TFIDF and DF ). Also the number of stylistic categories that contain the item
forms a predictive variable. Further predictive psycholinguistic features are based on
frequency lists and psycholinguistic ratings. The surface based features indicating the
length of the item’s lemma and the length of the gap in letters are also among the top
20 features. The number of candidates with bigger unigram probability in the Web1T
corpus is also a predictive feature on item level. On the text level, POS tag based lexical
density and variation, and human sentence processing measures based on Dependency
Locality Theory perform well.
Figure 6.5 visualizes how the clusters can be interpreted based on the two dimension:
Automated C-Test Difficulty Prediction
6 Experiments and Results 48
Importance
Item_Psy_Sem_ConcrItem_Psy_Sem_FamFeature
Text_Lex_Variation_AdvItem_Ling_lemmaLength
Text_Dlt_mMaxTotalIntegrationCostPerFiniteVerbText_Lex_Variation_Mod
Item_SubtlexCand_weaknessText_Dlt_vmMaxTotalIntegrationCostPerFiniteVerb
Item_Ngram_Probs_Five_LeftItem_Ling_endingLength
Text_Lex_Density_AdverbItem_Psy_Sem_AoA_Kup
Item_Psy_Lfp_Band1.1Item_Ngram_Cands_BiggerUniDelta
Item_Psy_StylNCatsMRCItem_Psy_StylNCatsNLTK
Item_Ngram_Probs_Bi_LeftItem_Psy_TFIDF
Item_Ngram_Probs_UniItem_Psy_DF
20 40 60 80100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
A●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
B
20 40 60 80100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
CItem_Psy_Sem_Concr
Item_Psy_Sem_FamFeatureText_Lex_Variation_Adv
Item_Ling_lemmaLengthText_Dlt_mMaxTotalIntegrationCostPerFiniteVerb
Text_Lex_Variation_ModItem_SubtlexCand_weakness
Text_Dlt_vmMaxTotalIntegrationCostPerFiniteVerbItem_Ngram_Probs_Five_Left
Item_Ling_endingLengthText_Lex_Density_AdverbItem_Psy_Sem_AoA_Kup
Item_Psy_Lfp_Band1.1Item_Ngram_Cands_BiggerUniDelta
Item_Psy_StylNCatsMRCItem_Psy_StylNCatsNLTK
Item_Ngram_Probs_Bi_LeftItem_Psy_TFIDF
Item_Ngram_Probs_UniItem_Psy_DF ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
D
20 40 60 80100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
E
Figure 6.4: English: Variable importance of the top 20 predictors in the SVM model,when predicting the 5 classes of the performance profile Text Item CL5
text and item difficulty. For each of the clusters, Table 6.4 lists mean values of features
on text and item level. On the item level, the mean feature values of cluster 1 and
3 should indicate higher difficulty, whereas the mean feature values of cluster 2 and
4 should indicate lower difficulty. Inspecting the mean number of documents that
contain the item (Item Psy DF ), one can see higher values for the green highlighted
clusters (61.31 and 80) than for the red ones (18.5 and 31.9). This observation supports
an intuitive assumption: Throughout a collection of documents, difficult items occur
in less documents than easy items. Also the other three item level features listed
in the table reflect intuitive assumptions: The frequency of the item in a corpus is
lower for difficult words and higher for easy words. The mean values for the feature
Item SubtlexCand weakness show that if a word has many solution candidates with
higher frequency than the actual solution’s frequency, the item gets more difficult. On
text level, the values of cluster 5 indicate very low textual difficulty in contrast to the
other four clusters. The distinction between mean values of medium (green) and high
(red) difficulty is not noticeable in the lexical features adverb density and modifier
variation. However, the DLT mean feature values reflect the degrees of difficulty in the
following remarkable way: the higher the integration cost features are, the less difficult
is the gap restoration.
Automated C-Test Difficulty Prediction
6 Experiments and Results 49
Decreasing Text Difficulty
Decreasing Item Difficulty
Cluster 5:all items in very easy texts
Cluster 4:easy items in easy texts
Cluster 2:easy items in difficult texts
Cluster 1:difficult items in difficult texts
Cluster 3:difficult items in easy texts
Figure 6.5: English: Cluster interpretation given the performance profileText Item CL5
Cluster 1IdTd
Cluster 2IeTd
Cluster 3IdTe
Cluster 4IeTe
Cluster 5IallTvery e
Item Psy DF 18.474 61.31 31.903 80 67.449Item Ngram Probs Uni -11.277 -8.397 -10.588 -7.316 -8.136Item Psy StylNcatsNLTK 8.153 13.021 9.897 13.935 12.638Item SubtlexCand weakness 8.095 2.464 5.009 0.907 1.572Text Lex density Adverb 0.054 0.059 0.057 0.061 0.028Text Dlt vmMaxTotalInt 1.165 1.145 1.34 1.3 1.47Text Lex Variation Mod 0.21 0.212 0.205 0.212 0.157Text Dlt mMaxTotalInt 0.624 0.61 0.749 0.712 0.841
Table 6.4: This table lists the mean values of different English item and text levelfeatures by cluster. Cells highlighted in green indicate simplicity on thecorresponding locality level. Cells highlighted in red indicate difficulty. Ontext level, yellow indicates strong simplicity.
6.3.2 Spanish
The top 20 predictors in the SVM model for the performance profile Text Item CL5 are
listed in Figure 6.6. 15 predictors are text level features including features on lexical
variation, sophistication, density, and frequency profiles. On item level, only three
features are among the top 20 (TFIDF, DF and IDF ). Furthermore, two sentence level
features can be considered as predictive: The parse score features as well as one of the
DLT integration cost features (cmvMaxTotalIntegrationCostPerFiniteVerb).
Table 6.5 lists the mean values of features that were predictive in either SVM or RF.
Automated C-Test Difficulty Prediction
6 Experiments and Results 50
Importance
Text_Lex_Variation_VV1coText_Lex_Density_Noun
Sent_Dlt_cmvMaxTotalIntegrationCostPerFiniteVerbSent_Const_parseScore
Text_Lex_FrequencyProfile_Band2_TypeText_Lex_Density_Adposition
Text_Lex_Variation_TTR_UberText_Lex_Sophistication_LS1
Item_Psy_TFIDFItem_Psy_IDFItem_Psy_DF
Text_Lex_Density_ConjunctionText_Lex_Sophistication_LS2
Text_Lex_FrequencyProfile_Band5_TypeText_Lex_Density_Determiner
Text_Lex_Density_AdverbText_Lex_FrequencyProfile_Band1_Type
Text_Lex_FrequencyProfile_Band6_TokText_Lex_FrequencyProfile_Band6_Type
Text_Lex_Variation_Adv
0 20 40 60 80
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
A●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
B
0 20 40 60 80
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
CText_Lex_Variation_VV1co
Text_Lex_Density_NounSent_Dlt_cmvMaxTotalIntegrationCostPerFiniteVerb
Sent_Const_parseScoreText_Lex_FrequencyProfile_Band2_Type
Text_Lex_Density_AdpositionText_Lex_Variation_TTR_UberText_Lex_Sophistication_LS1
Item_Psy_TFIDFItem_Psy_IDFItem_Psy_DF
Text_Lex_Density_ConjunctionText_Lex_Sophistication_LS2
Text_Lex_FrequencyProfile_Band5_TypeText_Lex_Density_Determiner
Text_Lex_Density_AdverbText_Lex_FrequencyProfile_Band1_Type
Text_Lex_FrequencyProfile_Band6_TokText_Lex_FrequencyProfile_Band6_Type
Text_Lex_Variation_Adv ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
D
0 20 40 60 80
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
E
Figure 6.6: Spanish: Variable importance of the top 20 predictors in the SVM model,when predicting the 5 classes of the performance profile Text Item CL5
For item level features, the values in cluster 2 and 3 should indicate higher difficulty,
whereas the values in cluster 4 and 5 should indicate lower difficulty. The listed features
confirm these assumptions. Items that occur in less documents are more difficult to
fill. Items with a positive difference between the n-gram probability of the item itself
and its competing candidate with highest n-gram probability are easier to master. A
high number of possible candidates is also an indicator for the item’s difficulty. On
text level, the mean values of all three features in cluster 3 and 5 indicate simplicity in
contrast to the values in cluster 2 and 4. The distinction between very difficult items in
cluster 1 and medium difficult items in cluster 2 and 4 is only noticeable in the adverb
variation. A high number of distinct adverbs divided by the number of all lexical words
in a text leads to higher textual difficulty.
6.4 Comparative Investigation of Difficulty Prediction for Spanish
and English
In the following, we will first compare the classifiers’ performance in predicting the
five classes of the presented profile Text Item CL5 by investigating a unique set of
comparable features for both languages.
Additionally, we perform experiments on a performance profile including the test
takers’ performance on sentence and item level. This is motivated by the assumption
Automated C-Test Difficulty Prediction
6 Experiments and Results 51
Cluster 5:easy items in easy texts
Cluster 1:all items in very difficult texts
Cluster 2:difficult items in difficult texts
Cluster 3:difficult items in easy texts
Cluster 4:easy items in difficult texts
Decreasing Text Difficulty
Decreasing Item Difficulty
Figure 6.7: Spanish: Cluster interpretation given the performance profileText Item CL5
Cluster 1IallTvery e
Cluster 2IdTd
Cluster 3IdTe
Cluster 4IeTd
Cluster 5IeTe
Item Psy DF 37.307 15.456 19.346 97.030 108.913Item SubtlexCands deltaBigger 1144.471 -1010.447 -871.123 2023.725 3305.070Item SubtlexCands numCand 37.067 32.096 29.867 22.269 22.631Text Dlt cmvMaxTotalInt 0.768 0.851 0.996 0.855 1.022Text Read Kincaid 19.020 19.804 17.503 19.219 17.429Text Lex Variation Adv 0.101 0.059 0.056 0.058 0.054
Table 6.5: This table lists the mean values of different predictive Spanish item and textlevel features by cluster. Cells highlighted in green indicate simplicity onthe corresponding locality level. Cells highlighted in red indicate difficulty.On text level, violet indicates strong difficulty.
that the differences between Spanish and English C-Test item difficulty are mainly
caused by the differences in the grammatical structure of the two languages rather
than by the text’s general content.
6.4.1 Comparison of Performance on Text and Item Level
We extract a subset of features that encode morphological or syntactic information
which we consider to be suitable for a linguistic comparison. This includes features
encoding POS or parsing information, including DLT features. Furthermore the SUB-
TLEX candidate features are included, since they might capture influences caused by
Automated C-Test Difficulty Prediction
6 Experiments and Results 52
features #feat SVMmac. F1
SVMmic. F1
RFmac. F1
RFmic. F1
EN all 63 (no highly cor.) 31 0.70 0.68 0.70 0.68ES all 63 (no highly cor.) 36 0.77 0.76 0.78 0.76EN Item SubtlexCands 4 0.20 0.39 0.27 0.38ES Item SubtlexCands 4 0.32 0.40 0.30 0.39EN Text Dlt 16 0.46 0.47 0.49 0.51ES Text Dlt 16 0.50 0.64 0.52 0.63
Table 6.6: Classification results for SVM and RF for the performance profileText Item CL5 and different feature subsets. 63 features where consideredto be comparable. After filtering highly correlated variables, 31 remainedfor the English data and 36 for the Spanish data. For the classificationusing the other feature subsets, no filtering was applied.
the richer morphology of the Spanish language. We consider the SUBTLEX candidate
features as comparable, since the features rely on the same type of corpus and the exact
same number of words in the databases. The results for RF and SVM using different
feature subsets are presented in Table 6.6. All F1 scores show a better performance for
the Spanish than for the English data. Using all 63 comparative features and filtering
highly correlated features leads to a macro-averaged F1 score of 0.78 for the Spanish
data, and 0.70 for the English data. The feature Text Lex Density Lex and four DLT
features were removed due to high correlations in the English, but not in the Spanish
data. Thus, the DLT features show higher variance in the Spanish data. In the follow-
ing, we check for differences in the variable importance of the models and further focus
on the SUBTLEX candidate features and on the textual DLT features.
Table 6.7 lists the RF variable importance scaled over all features. Features that
occur in the top 10 in both languages, are printed in bold. For English, three SUBTLEX
candidate features are the most important predictors by far, with a scaled importance
ranging from 61.85 (weakness) to 100 (deltaBigger). A further feature on the item
level is the item’s number of dependencies. On the text level, four lexical variation
measures and one DLT measure are among the 10 most predictive features. The parse
score represents the only highly predictive sentence level feature. For the Spanish data,
the picture looks similar. The most important feature is also the SUBTLEX candidate
feature deltaBigger. It describes the difference between the per million word frequency
of an item and its competing candidate with highest per million word frequency. The
number of competing candidates and the candidate weakness are also among the 10
most predictive features, but have lower importance than in the English model. The
second most important feature for Spanish is the boolean feature indicating whether an
item is a content word or not. This feature is not among the 10 most predictive features
for English. On text level, the DLT measure that combines the condition c,v, and m
is also important for both languages, as well as certain lexical density and variation
measures.
As shown in Table 6.6, the Spanish candidate space features achieve a maximum
Automated C-Test Difficulty Prediction
6 Experiments and Results 53
English features Imp Spanish features ImpItem SubtlexCand deltaBigger 100.00 Item SubtlexCand deltaBigger 100.00Item SubtlexCand numCand 74.54 Item Ling Morph isContentW 76.69Item SubtlexCand weakness 61.85 Text Dlt cmvMaxTotalInt 71.79Text Lex Variation Verb2 39.69 Item SubtlexCand numCand 67.20Text Lex Variation Adv 36.74 Text Lex Density Verb 64.00Item Ling Synt numDep 27.06 Text Lex Variation VV1 38.26Text Dlt cmvMaxTotalInt 26.34 Text Lex Variation TTR RTTR 38.25Text Lex Variation TTR 24.98 Text Lex Density Determiner 37.00Text Lex Variation VV1 20.76 Item SubtlexCand weakness 34.45Sent Const parseScore 19.33 Item Ling Synt numDep 27.13
Table 6.7: Variable importance given the RF models for profile Text Item CL5 usingthe same set of 63 comparable features for both languages. The listedimportance values are scaled over all features. Overlapping features aremarked in bold.
F1 score of 0.39 for English, and 0.40 for Spanish. Thus, their performance does not
differ much given the two languages. However, using the DLT integration cost features,
the classification works significantly better for Spanish. The textual DLT features
achieve a maximum F1 score of 0.64 using SVM for Spanish, and a maximum score
of 0.51 for English. Figure 6.8 shows the confusion matrices for the RF classification
task of the profile Text Item CL5 for both languages when using only the 16 textual
DLT features. We highlight correct classifications and those misclassifications that are
acceptable given the lack of information on item level: Considering only textual input
features, the classifier cannot distinguish between difficult and easy items. Therefore,
acceptable confusion occurs between class 1 and 2, as well as between class 3 and 4. For
the English data, almost all items of class 5 are tagged correctly as items in very easy
texts. For the Spanish data, all items of class 1 are correctly classified as items in very
difficult texts. Correspondingly, we consider a feature subset on item level only. The
tables in Figure 6.9 show the confusion matrices for the classifications using the item
candidate space features. Considering the English data, the confusion mostly happens
because there is only item level information given in the features. 1 is mostly confused
with 3, which have both the same item difficulty class. The classifier requires text level
information in order to distinguish between them. The same holds for class 2 and 4:
The model predicts correctly that the items are easy, but fails to correctly distinguish
whether the item is in an easy or a difficult text. Thus, the features’ locality levels
reflect the locality dimension from the performance data.
6.4.2 Comparison of Performance on Sentence and Item Level
Due to typological properties of the languages under consideration, we expect differ-
ences in C-Test difficulty phenomena to arise on item and sentence level, rather than
on text level. The following describes how the C-Test items can be clustered consider-
ing the performance variables sentAv and percCorr and shows how well the resulting
Automated C-Test Difficulty Prediction
6 Experiments and Results 54
predictedEN 1 2 3 4 5
actu
al
1 0 33 1 4 02 0 43 0 4 03 0 0 30 34 14 0 6 36 28 05 0 0 1 1 25
predictedES 1 2 3 4 5
actu
al
1 15 0 0 0 02 0 70 0 0 03 0 1 52 0 74 0 39 1 0 05 0 0 36 0 3
Figure 6.8: Using Text Dlt featuresonly: RF classificationconfusion matrices forSpanish (top) and English(bottom) for the profileText Item CL5. Correctclassifications and accept-able misclassifications arehighlighted.
predictedEN 1 2 3 4 5
actu
al
1 7 5 20 6 02 4 4 9 30 03 11 8 26 18 24 3 3 5 57 25 2 2 3 19 1
predictedES 1 2 3 4 5
actu
al
1 0 6 1 1 72 2 43 17 4 43 2 27 22 4 54 1 10 5 7 175 1 4 4 15 15
Figure 6.9: Using Item SubtlexCandfeatures only: RF classifi-cation confusion matricesfor Spanish (top) andEnglish (bottom) for theprofile Text Item CL5.Correct classifications andacceptable misclassifica-tions are highlighted.
classes can be predicted using the comparative set of features.
For both languages, we performed HCPC to cluster the items into four groups, given
the two named variables. The individuals and variables factor maps are given in Figure
6.11 (English) and 6.10 (Spanish). The resulting clusters can be interpreted in the
following way:
• Cluster 1: Difficult items in difficult sentences
• Cluster 2: Difficult items in easy sentences
• Cluster 3: Easy items in difficult sentences
• Cluster 4: Easy items in easy sentences
The RF and SVM classification results using the full set of 63 comparable features
and two additional subsets are listed in Table 6.8. Taking into account all 63 features,
results in a maximal F1 score of 0.72 for Spanish, and 0.64 for English. Thus, the
classification of the Spanish performance classes outperforms the classification of the
English classes. However, using the set of 19 sentence features only, the English classes
are easier to predict than the Spanish ones. The higher F1 scores for the Spanish data,
when using all 63 features, might be caused by a reasonably good interaction with item
level features.
Table 6.9 shows the 20 most predictive variables for both languages in RF model. In
Automated C-Test Difficulty Prediction
6 Experiments and Results 55
features #feat SVMmac. F1
SVMmic. F1
RFmac. F1
RFmic. F1
EN all 63 (no highly cor.) 31 0.62 0.64 0.59 0.61ES all 63 (no highly cor.) 36 0.61 0.67 0.67 0.72EN all sent 19 0.48 0.52 0.53 0.58ES all sent 19 0.43 0.54 0.43 0.53EN Sent Dlt 16 0.31 0.39 0.29 0.38ES Sent Dlt 16 0.34 0.44 0.34 0.42
Table 6.8: Classification results for SVM and RF for the performance profileSent Item CL4 using the 63 comparative features. After filtering highlycorrelated variables, 31 remained for the English data and 36 for the Span-ish data.
both cases, five item level variables are among the top 20 variables, but the Spanish item
level features are ranked higher than the English ones. The variable importance further
shows that the locality level of the features again reflect the investigated performance
variables’ locality: There is only one English and no Spanish text level feature among
the 10 most predictive features. Table 6.8 also shows that the Spanish classification
outperforms the English one when using only sentence level DLT features. Again, the
variable importance reflects these tendencies: Five Sent Dlt features are among the
top 11 for Spanish, and only two for English. A further difference concerning variable
importance is the ranking of the morphologic feature indicating whether the word is a
content word or not. This feature is the second most predictive feature for the Spanish
data, with a variable importance of 53.38 (scaled over all features). In contrast, for the
English data it has an importance of 12.56 and is not even in the top 10.
●
−4 −2 0 2 4
−4
−2
02
4
Factor map
Dim 1 (66.14%)
Dim
2 (
33.8
6%)
●●
●
●●
●
●
●●
●●●●●
●●●●
●
●●●
●
●
●
●
●
●
●●●●●●
●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●●●
●●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●●●
●●
●●
●
●
●
●
●
●●●●●●●●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
538_22538_20538_21638_25497_25
640_25640_24
741_10503_17638_17712_05712_13745_19712_20544_22544_25712_08712_23
741_12
546_25543_15543_09544_20546_23543_22503_16745_21
741_13
538_25638_18638_20638_21745_17543_23712_16712_10712_14745_20546_20543_03647_12
712_25
548_08501_13
543_21
541_05
712_02
745_08
712_24
640_23
508_20651_19546_19647_01541_22541_21
544_23
501_19548_05543_14
537_24541_04543_11
501_09455_17745_10
638_19543_06712_06
653_16
543_20647_11499_06651_23455_21
546_14543_05541_01
712_19
647_07745_14
503_15712_22
647_08653_15
503_14712_01712_12
537_25451_06
543_01712_03
508_23
546_15
451_05
499_09
544_17
499_21
543_08543_18
653_17451_08
499_10
638_14712_21
537_13548_02541_20
548_09
455_22745_12
648_25708_05
638_15745_23
451_02
741_14
653_08
648_20645_15645_17
537_21651_18
454_16711_25
546_22
548_20
499_07501_17
457_21
537_16537_19508_25
708_11502_20653_06
548_01
454_15
647_03
708_01653_07653_12653_14537_22
711_13
538_24
548_16
651_22
711_07548_19
508_19653_10
645_04
745_24
650_23711_05
745_22
711_24
455_23
648_24
745_15
645_05645_19
651_17
499_23711_12
546_13651_25501_16
711_02
537_15
645_18645_21
546_24745_01745_03
711_14645_10
649_12
745_18
745_04
503_18
710_07
647_10653_05
544_18
708_02708_19499_24
710_02
544_21
711_22
745_05
711_17
544_03
499_05501_08
546_16
500_16
543_16
645_16708_03708_15708_16
638_22
649_08499_02
501_11745_13537_20541_02
457_22711_06645_06502_16
739_01
543_17
508_24501_14
638_16
547_09
537_12
543_10
649_02544_14
708_20
712_07
649_03
647_06
544_01
455_24
645_20
500_19
451_04
499_11
739_04
711_08708_13708_18710_09713_03710_13454_14
745_06
451_07
544_07
638_13
650_24710_08500_18
501_12
454_11
506_13547_04
548_15457_24
647_09543_04
457_18
508_18546_09
712_11
649_21
741_11
541_19
457_20
506_10534_17508_13
548_04
508_08
544_09
547_01
745_02
649_22649_24
534_20
711_15711_20457_25650_20
503_13
649_01710_12
745_07
505_07505_13
548_03
742_12
548_14502_19
535_09
708_17
457_14
712_04
648_19
649_18454_12
535_04
541_03501_18
538_23
534_09535_03
543_25
650_16650_22708_22
508_07497_08
739_10
648_15651_01
649_15
708_12708_14
497_07648_18640_18739_24534_14
543_12
548_17
653_09
546_07
710_04
548_07
500_17
457_17
535_11640_01506_03
638_23
650_25
455_16
508_04506_08
499_08
648_23650_21
535_06535_14
653_11
534_18
455_20
712_18
506_04506_11
455_14505_15
544_11640_07
544_24
645_09
534_12546_10508_17
541_23
648_13541_16535_23455_05
645_14
502_12495_11495_12
649_23
544_05
499_20
651_11497_09497_11
501_21713_15
645_13713_01
537_14651_24
508_03640_12651_04
713_17
708_08
739_16
650_12
543_19
506_06506_16
544_08710_03499_03
640_22
506_15457_10505_05
547_05640_08
650_07457_09
501_04
456_17742_23
534_01
543_02
508_16
638_24
648_01
650_19
739_03534_13546_06547_12
457_01
651_21
640_05
502_06502_08
739_08
543_07
508_15
455_08
544_02
457_05
541_17456_14
649_25
745_09
650_14
713_18
646_03
649_10
455_18
535_13742_18
546_18
710_18
535_12497_17742_21
548_11
541_07
454_08
547_19
708_04502_17
457_11
501_15
457_02
547_15
647_02
650_01739_22
546_17
739_02
653_03
543_13
649_13
499_19
543_24
640_11534_15
649_20
710_05
739_05
454_10
712_17
640_04
503_20
742_19739_09
650_06
541_25
649_05
647_23
546_03
649_04649_11
547_07742_25
744_20501_05
451_10
646_10
500_22649_17544_15
711_21
502_07
456_06
495_22
544_16
744_08499_18456_02
537_11
744_03744_18
546_11
713_22
456_03
495_17
739_18
506_09
548_12
546_08651_16
645_11
638_05
502_15
506_01
713_19
646_17
534_06
742_20
744_11
508_14
711_18
456_13
638_06638_11
547_02
708_07
544_19
538_14
495_21
742_11505_16
538_15
508_21
640_15
537_17
535_18
739_11
712_15
745_11
495_07
451_01
742_06
649_07
647_05
502_03
497_20505_11
535_22
546_21
495_01
547_11547_14640_09
541_24
497_10
501_07
648_07
741_04
547_16
457_12
502_18
457_04
710_25
640_03
711_10
648_04
710_11
741_20
547_03
710_17
648_06
712_09
497_16
501_23
548_23
495_24
739_12
708_10
535_07505_06739_17
711_01
742_13
500_08
451_09
499_04
651_06
541_15
647_19
650_13
456_15713_21
505_12
744_09744_19
650_15548_18
651_07
541_06
456_09500_24
651_08651_10
548_06
506_07
653_04
640_14
455_19
741_03
742_22739_19640_16
646_20
739_06
455_03
495_20541_13
544_13
650_04
739_15
495_03
745_25
742_03742_04
506_12
650_10
534_23
742_14
501_20
711_16
501_10653_13
535_21
649_14
651_05508_06
455_07
645_23
711_23
538_16
544_10
547_18
713_08
497_21
744_06541_11
647_21502_22
456_16648_08497_23497_24505_18651_14497_15
538_10
457_08650_03502_14
457_16505_01505_09
503_19503_22
497_05
541_12
505_08651_12
451_13
547_22538_02
497_12
646_23713_09
502_24
651_20
650_11
506_23537_06
710_01710_06
741_25495_14
741_16
651_09535_15
534_21
506_21
745_16
648_02648_05
640_02
647_24
505_14497_22
503_01
546_05546_12
646_07
647_04
648_16651_13
713_02
744_15
708_23
547_13
742_10744_16
503_05
645_01
541_10
546_01
744_17
534_16
741_17
502_13
649_06
456_18
456_25
646_22
455_25
640_20
535_05
646_15
537_18
505_03
501_24
503_07
499_01
653_02
649_19
741_05
645_08
451_12
508_09
451_03
739_13
500_01
502_10
739_14
500_03537_02
648_14455_10
647_25455_01
741_26456_12
711_19
640_10
499_12548_24
744_25
454_19
646_01
710_10
506_17547_10
711_04
508_22
648_17
534_07
646_12
653_01
508_01
638_04
499_15710_19
454_13
744_13
455_11
537_23
651_15
711_09
454_07
535_10
503_24
547_24
653_21
544_04
645_12
710_24
742_17
648_21
739_21640_17
538_08547_21
708_21
501_01742_01
741_08
547_06500_20
548_10548_13650_18708_09645_02
649_16
503_02
710_22
535_08547_08
541_08
457_19454_09711_03
546_02
742_05713_20
640_19
499_22
495_16
508_05
744_23
711_11
653_19
651_03
650_17
534_02650_05742_07
500_15
500_10
646_08
497_14
713_16742_02
648_22
544_06
455_06739_20
645_07
647_17646_09
451_11
544_12
505_19
457_23
548_21
645_03708_06
500_05
646_11
505_17
499_14
454_18
713_06
503_23
649_09
538_18503_04
454_04
503_12
713_12455_15
497_01
497_19640_13495_13
710_14
647_16
497_18
502_11
534_08
500_14537_04
534_24
534_05
537_10
495_18
503_10
534_19
535_20
505_20
506_14
454_17
505_10
647_13
653_23
456_07
455_09505_04535_02
742_08
742_15742_24508_11
454_02
650_09455_13
508_12
646_05
506_02
739_25
744_22
547_17
502_21
506_05
741_23
744_02
546_04
535_16535_19
505_02648_09456_20
538_03
497_06
500_21
646_13
457_15
647_14
454_24
742_09
508_02742_16
495_05
648_10
647_20710_15
535_25
739_23640_21
534_04
535_01
650_08
502_25
739_07534_11
456_11
503_25713_23
534_10
451_19
502_02744_07501_25
640_06
501_06
538_07
500_12451_21
638_10
541_09
535_17501_03534_03646_06
457_13
497_04537_05
495_25
646_24
457_03
713_07
497_13651_02
713_14
710_23499_17456_10
508_10
645_24
713_11
456_04
744_14647_15
638_03
457_07
499_13
648_03648_11
741_22495_09
744_04713_13
499_16455_02
502_05541_18456_19
456_23
495_02
502_09541_14
647_22455_04
648_12495_15
538_06
710_20
505_21
501_22
506_22
500_25
502_23
457_06
547_23547_25
744_10744_12
650_02
500_06500_11
646_02
506_20
638_08534_22548_22503_06646_19
506_19537_07
495_23
538_17
455_12
538_11538_13646_16
535_24502_01
653_22451_18
495_19
645_25
646_04744_05
500_23538_19646_14
503_21
741_07
502_04
534_25
741_18538_09741_15638_12548_25
505_23
744_01
538_01503_09537_01638_01503_08646_21741_02
501_02456_01
713_10
454_20
638_02547_20638_09
710_21
503_03646_25
710_16
653_18
647_18
653_20
646_18
537_09506_18
741_01495_06538_04495_04538_05
456_22
538_12
454_21
741_06
451_17451_20744_21503_11638_07456_08
713_25
645_22
456_24
495_08
456_05
713_24
505_22
537_03744_24741_19741_21495_10713_04506_24451_15505_24
713_05497_02
456_21
741_09537_08653_24500_09741_24500_04497_03506_25500_02
505_25451_16454_23
500_07500_13454_22451_14
451_24
454_06454_03454_01
451_25
454_25454_05
451_22
451_23
708_24499_25708_25
cluster 1
cluster 2
cluster 3
cluster 4
cluster 1 cluster 2 cluster 3 cluster 4
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables factor map (PCA)
Dim 1 (66.14%)
Dim
2 (
33.8
6%)
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
Figure 6.10: Spanish: Individuals and variables factor map using performance pro-file Sent Item CL4. Predicting these 4 classes using RF and all Englishfeatures results in a micro-averaged F1 score of 0.72
Automated C-Test Difficulty Prediction
6 Experiments and Results 56
●
−4 −2 0 2 4
−4
−2
02
Factor map
Dim 1 (70.41%)
Dim
2 (
29.5
9%)
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●●
●
●●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●●●
●
●●
●
●●
●
●
●
●
589_02589_06589_05725_25
609_17609_18590_19590_20
589_01
590_25
601_01
589_03
590_21609_14590_23590_18
609_16
589_04
609_15
610_22591_11
609_19
610_20
601_03
591_09
590_24
525_23609_23594_19591_14585_18
591_06609_21
585_21604_09591_15610_06525_22610_04594_23
719_13690_24
591_13585_12604_10
601_04
604_13
609_24
604_11690_23688_23
594_21610_09748_04
766_14
591_12
688_08
748_09610_19
590_03719_14585_16719_19
748_10
719_11688_22
604_20748_06
529_22
591_05
590_01691_05582_17756_19
605_02
531_24582_24691_07
610_25
590_09
586_17
604_18
479_04586_24725_01688_14
748_11604_15
610_11
594_20610_24
756_16605_01610_13
689_09605_19
604_19
690_02527_18527_12725_13600_13725_08
748_08
764_09
691_11688_03
600_14588_06
691_01
527_16588_16
582_15
605_09598_21
724_13
582_18
603_16
598_18691_12590_05756_25690_22
605_11586_18
594_09
689_20
756_24
527_22481_25
582_22
689_02481_03
688_25
725_04756_09605_14527_20533_08603_18
719_10
609_20
582_19
725_03588_08
748_02
588_23
688_06
764_10587_17
529_23766_20689_07756_04
688_24585_19
690_07
748_03
720_13
689_24603_14481_13719_20478_09724_15
596_24725_05479_03
481_09
610_15
525_24
691_04
478_16
594_24
582_16
594_18
690_17
691_09
605_12529_19
594_22
688_05688_16
481_16600_11
691_06
587_22470_19
478_21533_07588_10587_23
688_19
594_10478_08756_12
591_21
478_20603_08
764_20
601_25584_23
591_03
596_03
598_19588_17605_16
756_22
690_15
762_25
588_09
716_07
598_12584_22
532_18
766_15
478_07481_04724_04
529_13
588_11
596_05
478_15
598_23470_16605_21481_11716_17
529_21
689_22533_05481_05
601_16
605_13
532_02
756_08588_05
688_04
720_12
609_22
598_11
748_01
690_03
591_16585_20
756_17
748_05748_07
716_11
690_16690_11
690_21
600_09
591_07
762_04529_14
478_14
610_05
586_11
691_23
586_02
603_10
690_12478_01689_06
584_25
594_04716_22478_18
600_25589_23
481_24
719_09
529_20
690_01
603_13
603_22
716_18
586_12
594_15
531_17
531_12
720_11716_10
530_13
724_16
601_20
481_08
610_07
586_20479_05
685_11
589_25
470_18600_16481_20
590_04
591_04
479_02
603_15600_17
531_06
594_17
689_18
609_03
604_12
756_21766_13
588_13766_21
591_10
601_02
598_08528_07
479_09
591_25
588_22588_20
586_10
716_14
466_01
764_21
601_21586_22
724_09
688_20
610_18
690_20
725_16
590_22
594_16
685_12532_08
688_10
610_21
725_02688_01
528_03
716_06
766_24588_14
531_05596_01
479_01
685_10
590_06688_17596_23
478_06
601_23
601_10
585_17
724_05
719_12
766_25
609_25
764_08
481_02
584_21591_19531_04
604_08
531_18
588_19
688_13
601_13
533_09
596_25
524_12
716_12
605_03605_10
685_09
586_16
594_11
725_14
725_22
590_02
610_12
529_08
766_17
533_02689_11
523_01762_02
690_13
529_17
588_04587_21724_08
529_15
720_17
688_07
587_20724_12756_02766_22479_07766_19
748_18586_01
610_17688_12
719_16
532_22748_25
690_18
479_06
716_09689_15
590_12
716_01
530_18
725_12
609_08
756_10
533_21
584_15
584_10587_12
531_23
530_12
584_19724_06
524_14
481_18
748_17
478_13601_19
532_16691_17528_22
762_23
688_11
601_18587_19689_23
601_24
762_18
756_20585_15
584_24
528_11
594_12
528_12
581_25
756_06
466_21
719_18
529_09
719_17
598_22
603_11
532_10
591_02
764_15529_12
478_19
582_25
525_25591_08
685_04
587_02
598_20
528_20
609_01
764_13689_08527_21716_20
689_14
582_10
762_16528_18
716_02
530_20
610_23
764_24
689_04689_25690_14598_24
466_13
529_11
756_03
720_19
690_10605_08
466_07
605_22
603_24
582_21590_08
610_16
719_15585_14
756_14601_17
528_21
756_23
724_25
604_17
756_13
532_12
691_02
525_05
587_06
587_16
585_03
756_05
582_20
689_10
762_10530_03
586_15
531_09
766_05
584_14
481_22
589_24
689_03
582_01
605_24
764_16
720_16
528_13533_25
594_01762_24
587_24
603_05685_01
720_07
590_07
748_20
691_13
523_13
601_22
584_09
756_18
532_03
586_19
598_16
691_25530_11
470_15
748_15
685_25
586_13
587_25
530_15530_16
766_09
598_10
588_25
690_06
688_09
600_15689_12587_18
764_19
466_18
724_03587_13
479_20
724_18724_14
591_20
529_18
719_22
586_04
689_21
525_08
604_14
586_25
594_25591_01
766_16
601_15
478_17766_18
596_18766_10
605_15
594_03
588_01527_13
532_11600_24
481_19716_21
470_13
527_14
582_12
586_21
527_09
720_08
766_23605_17605_18
691_10690_19
716_08
589_21
604_16
725_11
470_25
528_23
598_17591_22
585_13
525_07
594_06
725_07
532_19
756_01
724_23
582_23
481_15716_23
589_09
610_08
533_24523_09
610_10764_11
688_15
530_06
533_06
724_24523_02528_17523_05
588_21
716_16
725_24584_13
588_12588_03
748_14
589_16
605_20
688_21
756_15689_05
589_15
689_19605_25
530_09691_15
725_06610_14
605_23
466_05
691_03
598_01
762_21720_18
688_02
594_02
748_22
481_12716_05716_15587_15
532_09
756_11
479_22
586_14
688_18
685_06
584_18
600_03466_24
533_23
527_23
470_17533_01533_04
764_14
596_13
532_04
588_15
598_06
530_10
605_04
531_21
719_21
530_05
689_17
598_13762_22
527_19
586_03
527_15479_10764_12
691_22470_12
720_15603_20594_07
603_23762_13762_15
690_05
601_05
479_08588_02725_09725_10
532_20
724_01603_19
603_02
685_05530_17
481_10
605_05
584_17
479_18
691_08
523_11
584_12600_01589_13
594_08
470_21
529_07
724_21
587_14605_06
531_07
531_15533_20609_12
525_04
720_20600_22
529_16
585_05
596_06764_22
481_01
523_04
527_17598_25
481_14
598_02
523_03
466_08609_04764_07
531_08
720_21
524_16
470_05
764_23531_03762_14532_06
689_13
685_08528_01
589_14
600_08481_07
691_14762_06
724_17
596_08
691_20
716_04
524_11
527_11
690_09
591_23
586_23
591_24
478_04
581_14
691_24
689_16756_07
525_03
524_25
530_14
470_23
532_05
527_02
529_05603_04
720_09
720_14
587_03
762_12
588_07
724_11
690_04
725_19
523_07
605_07
527_24
762_07762_11
587_08
690_08
466_17
604_06
603_01
478_10
531_22
603_17
529_04
590_17
748_16
724_07
598_09
588_24533_03
530_04762_08
584_20
481_17
523_17
596_02
724_22532_13
764_04
600_10
603_06525_02529_01
478_12
524_24
716_13
720_06
588_18
587_09
724_02
764_03
748_13528_24
589_19
581_24
594_13
529_25
748_19762_05
581_04
689_01600_12
584_04
601_06525_01
481_23478_02
532_24
466_02
532_14
466_11479_25
528_09601_09
582_06
481_06
762_20
581_18
685_02531_01
524_22
481_21
530_02
528_10
527_01
478_11
581_13
685_15529_02
603_09
523_12
586_05
609_13
581_15
725_21600_06
719_02
720_10724_10
600_23
609_09
531_10
748_21
589_20
601_11525_06
764_25
601_08
581_09
478_05
764_05609_02
470_07
531_20
604_05
716_19
603_25
603_07
586_07
594_05
601_12
533_17
762_17
478_03
523_08
585_07
594_14
529_10
589_11
584_16
603_12
584_08
762_19691_18609_10466_12
591_18
533_18724_20523_06
589_12
716_03
528_02
766_02596_17
598_14
523_10531_11
590_15
720_23
581_01
596_09596_10
685_03
589_10
764_17589_22
584_07
532_17
720_04
685_13
596_11
685_07
590_13
530_19
591_17
762_09
604_07
766_07
532_23
600_04529_24
528_25
587_01
596_04598_15586_09
762_03601_07
766_03
596_07764_18532_07748_12
466_10
724_19691_21
584_03466_16
528_04528_05
470_06
609_07587_07
529_03531_13603_21
479_14
586_06
466_20
531_02
600_05
581_08
529_06762_01
748_23
585_11
528_14
532_21
582_02
598_05
691_16530_01
587_04
585_25
587_05
532_01
590_11584_01
748_24691_19
525_21
766_06
586_08
584_11
585_08
685_14601_14
766_08
524_15
581_17
685_16
725_23720_02
531_16
523_24
582_08
530_08
582_11
533_22
596_16
719_07
585_22
470_22
523_14
528_19
479_21466_15
585_02
603_03
527_08
528_16
585_10
531_14
530_07
585_01
532_15
466_03
531_19
581_02
600_02
587_11584_06596_15
525_17
604_03
720_01
581_20
478_25
725_15725_20609_05587_10524_13528_06528_15
466_14479_23
609_06533_19764_06598_07
525_14
598_03
766_01589_07
527_05581_10
524_10
590_10
581_05
528_08
685_21
524_20
479_24589_08
530_23
609_11
470_24589_18
719_25
598_04
466_19
585_24
590_14766_04766_11466_06479_19
582_03
479_17523_18
725_18
719_08
524_02
582_07582_09600_07582_05720_03
530_24
685_18
725_17470_14
533_14
527_10
720_25
584_05466_09
582_04590_16720_05
604_04
685_22
766_12585_06527_04
479_13
589_17
596_19
581_03466_23596_14524_21
685_23
527_03582_13581_07
525_16
604_02
525_11
466_04
470_03523_15685_19
581_19581_12
719_04
470_20
527_07466_22
584_02
527_06582_14
585_23
596_12466_25
479_15685_17
524_23
719_03
585_09585_04581_11
720_22685_24
764_02
581_23
523_19479_16
604_01581_06581_16
764_01
581_22
479_12523_16
600_19716_25
719_06
600_18
581_21
523_22530_21479_11
533_12
530_25
720_24
604_24
470_08
604_25
685_20719_05
596_21
525_09470_10525_10525_20716_24525_19
523_21
533_15
523_20470_09
719_23525_15
604_23
470_11719_01530_22
600_21
470_04
533_16
525_18525_13
478_22532_25596_20525_12523_23524_18
524_06524_07
470_02
524_19
470_01
596_22478_24604_22
610_02
600_20478_23523_25
604_21533_10719_24524_17533_11533_13
524_08524_04524_03610_03524_01524_05610_01524_09
690_25527_25
cluster 1
cluster 2
cluster 3
cluster 4
cluster 1 cluster 2 cluster 3 cluster 4
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables factor map (PCA)
Dim 1 (70.41%)
Dim
2 (
29.5
9%)
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
sentAv
percCorrect
Figure 6.11: English: Individuals and variables factor map using performance profileSent Item CL4. Predicting these 4 classes using SVM and all Englishfeatures results in a micro-averaged F1 score of 0.64
6.5 Discussion of the Presented Results
We first investigated how well different performance profiles can be predicted using the
full set of features. For both languages, the profile that contains a large proportion of
sentence level performance variables (Text SentProf Item CL4 ) performed best (mic.
F1 score of 0.95 for English and 0.93 for Spanish). It is the only profile that includes
variables split by proficiency levels on sentence level (SentProf ) and groups the items
into four classes. These classes can be interpreted to vary from each other in terms of
the difficulty of the sentences in which the items occur. Even though the classification
of the Spanish data is mostly outperforming the English classification, it is not the case
for the named performance profile. An explanation for this might lie in the performance
statistics themselves: The Principal Component Analysis using all performance vari-
ables has shown that within the Spanish data the proficiency level variables correlate
more with each other than within the English data (cf. Figure 4.3 and 4.4 on p. 27).
Thus, the profile containing the SentProf variables is more informative for English than
for Spanish and might therefore be easier to predict. In other words, the information
about how well learners of different proficiency levels can master a sentence might be
easier to predict for English C-Tests than for Spanish C-Tests. However, for a direct
comparison of the two languages the underlying set of features should be comparable.
The results on different performance profiles further confirm the findings of Svetashova
(2015)’s work: The profiles that combine item and text level performance variables
seem to be well predictable using the generated feature sets. Due to the interpretabil-
ity of the resulting factor maps, we decided to investigate the profile Text Item CL5
more deeply.
Automated C-Test Difficulty Prediction
6 Experiments and Results 57
English Features Imp Spanish features ImpItem SubtlexCand deltaBigger 100.00 Item SubtlexCand deltaBigger 100.00Item SubtlexCand numCandidates 63.03 Item Ling Morph isContentWord.1 53.38Sent Const parseScore 58.45 Item SubtlexCand numCandidates 50.77Item SubtlexCand weakness 35.22 Sent Const parseScore 42.06Sent Const parseDepth 25.22 Item Ling Synt numDependencies 35.63Item Ling Synt numDependencies 25.10 Item SubtlexCand weakness 28.75Sent Dlt vmMaxTotalInt 18.02 Sent Const parseDepth 27.65Sent Dlt cTotalInt 13.49 Sent Dlt cmvMaxTotalInt 22.40Text Lex Density Verb 12.60 Sent Dlt mMaxTotalInt 21.49Item Ling Morph isContentWord.1 12.56 Sent Dlt cMaxTotalInt 18.24Text Dlt cmvMaxTotalInt 11.55 Sent Dlt cvTotalInt 16.94Text Lex Variation TTR RTTR 11.06 Text Lex Density Determiner 11.01Text Lex Density Function 10.62 Text Dlt cmvMaxTotalInt 10.64Text Lex Variation Lex 9.94 Text Lex Variation TTR Log 6.75Text Lex Variation TTR 8.86 Text Lex Variation Adv 6.27Text Lex Variation Verb2 8.70 Text Dlt vmTotalInt 5.33Text Dlt cTotalInt 8.67 Text Lex Density Lex 5.27Sent Dlt cmvTotalInt 8.20 Text Lex Density Function 5.07Text Lex Variation Mod 8.13 Text Lex Variation VV1co 5.05Text Lex Density Noun 7.40 Text Lex Variation Lex 4.14
Table 6.9: Variable importance given the RF models for profile Sent Item CL4 usingthe same set of 63 comparable features for both languages. The listedimportance values are scaled over all features.
The performance profile Text Item CL5, which is best interpretable and suitable for a
real-word application for both languages, resulted in a macro-averaged F1 score of 0.76
(English, 156) and 0.82 (Spanish, 100 features) using SVM. As shown in Svetashova
(2015)’s visualization (Figure 2.5), an application can make use of these models in order
to visualize item difficulties by color. The presented results approve that these models
can be used in practice.
Although the Spanish features are just a subset of the English features, they turned
out to produce better classification results. A prominent feature subgroup in the 20
most predictive English features is the group of psycholinguistic item level features in-
cluding term and document frequency measures, measures based on frequency lists and
psycholinguistic ratings. For the Spanish data, item level features are less prominent:
three term and document frequency features are the only item level features among the
top 20 features. Since the set of features for the two languages is different in the sense
that English has much more item level features than Spanish, a direct comparison can
not be drawn from these first experiments.
For both languages, we also presented mean feature values for the different clusters
and showed that they reflect the cluster interpretations and mostly confirm intuitive
assumptions about item difficulty. The results confirmed assumptions such as the fol-
lowings: A high document frequency or ngram frequency of an item makes a gap easier.
The more competing candidates an item has, the more difficult it is. However, the ex-
periments also lead to unexpected results. For both languages, on text as well as on
Automated C-Test Difficulty Prediction
6 Experiments and Results 58
item level, the DLT integration costs increase with decreasing text or sentence diffi-
culty. DLT emerged in the field of human sentence processing and has been applied
within the context of complexity analysis of learner data (Weiß, 2017). Weiß (2017) has
shown that in German L2 corpora, DLT measures increase with increasing learner pro-
ficiency. One could therefore expect higher integration costs in sentences that contain
gaps which are difficult to fill in. The results presented in this study show the opposite
for features on sentence as well as on text level. Thus, the test takers do apparently
profit from constructions with high integration costs. A possible reason could be the
position of the gaps relative to the integration cost counting. The features in this thesis
only measure these costs on sentence and text level, not on item level. Furthermore,
one should consider other possible reasons for these findings by reviewing the parse
trees that served as input for the feature calculations 30. One reason could be that test
takers benefit from high integration costs because more dependent words precede the
verb under consideration.
Using a set of 63 comparable features only, we performed experiments on the per-
formance profile Text Item CL5. We have shown that four DLT predictors have been
removed due to high correlations from the set of English features but not from the set
of Spanish features. This suggests that the DLT features show higher variance in the
Spanish data. A closer look at the data and the resulting feature calculations might
show the reasons for this outcome. We expect that language-specific differences in the
dependency structures cause differences in the correlations of resulting features.
The variable importance of the RF models showed similar pictures for both languages.
A difference within the 10 most predictive features is the high ranking of the morpho-
logical feature isContentWord in the Spanish model. This ranking supports the claim
that the rich morphology of Spanish content words influences difficulty more than the
less rich morphology of English content words. The SUBTLEX candidate features are
ranked higher for English than for Spanish. As described on page 13, Beinborn (2016)
mentions that in English there exist more short words than in German and French,
which in turn causes high numbers of candidates for English short words. Since Span-
ish and French are both Romance languages and similar in terms of their grammatical
system, the amount of short words should be similar in contrast to English. Thus, a
larger candidate space for English short words might cause the candidate features to
be more predictive in the English classification than in the Spanish one.
The last set of experiments is investigating differences between English and Spanish
when considering the performance variables on sentence and item level only. Using the
full set of comparable features, the results for the Spanish classification outperform the
English one. However, this is not the case when using the full set of 19 comparable
sentence features. This seems to suggest that a reasonably good interaction with item
30Punctuation marks have not been ignored when considering the English and Spanish dependencystructures. We do not consider this to have influence on the results.
Automated C-Test Difficulty Prediction
7 Conclusion 59
level features leads to better results when using all 63 features. The variable importance
does confirm that item level features are ranked higher in the Spanish data than in the
English data. This might be caused by the fact that in Spanish more information can
be encoded in one word. A further result that supports this idea is the rank of the
feature isContentWord. Just as for the profile Text Item CL5, it is again ranked higher
for Spanish than for English.
Moreover, it should be noted that differences between the languages can also be
caused by differences in the nature of the performance data. Inspecting the proficiency
levels more thoroughly is relevant in the context of difficulty comparison: For Spanish,
the proportion of proficient learners has been significantly smaller than for English
(cf. 4.1). This might have influenced the difficulty prediction. On different levels of
proficiency, different features are better indicators for difficulty.
7 Conclusion
The underlying work presented an automated way of investigating difficulty character-
istics of English and Spanish C-Tests, focusing on the locality levels of items, sentences,
and text passages. The purpose of this thesis was twofold: First, we aimed at devel-
oping a pipeline that automatically predicts the difficulty of C-Test items and text
passages for the two languages English and Spanish. This pipeline is supposed to be
integrable into a real world application, where difficulty estimates about text passages
and items can be used to influence the difficulty of whole C-Tests. Second, we aimed at
investigating how the difficulty characteristics of C-Tests vary across the two languages,
given the differences in their grammatical and morphological systems. The following
will summarize the findings and suggest ideas on future work.
We gave a review of existing work on the topic by focusing on the approaches pre-
sented within Svetashova (2015), Beinborn (2016), and Beinborn et al. (2014). Their
findings concerning performance modeling and difficulty predictions served as a baseline
for this work.
The performance of test takers could be modeled given a C-Test database provided
by the Language Learning Center of the University of Tubingen. We extracted statis-
tics about the test takers’ performance on different levels of locality: For each C-Test
item, the percentage of correct insertions and the corresponding averages throughout
sentences and whole texts have been used as performance variables. We further in-
spected the performance of test takers across proficiency levels given their final C-Test
score. The difficulty of a C-Test item has usually been defined by the ratio of incorrectly
inserted answers to all answers. Svetashova (2015) presented a new way of grouping
the items, namely based on information about the test takers’ performance on the two
locality dimensions item and text. Regression and classification results both confirmed
Automated C-Test Difficulty Prediction
7 Conclusion 60
that the classes resulting from the two-dimensional clustering approach can be pre-
dicted quite accurately using a broad range of linguistic features. Thus, her findings
showed that the difficulty of a C-Test item depends of the difficulty of the whole text.
We followed her clustering approach and further included sentence averaged statistics as
performance variables. We performed Hierarchical Clustering on Principal Components
on the updated set of items and experimented with the combination of performance
variables and the number of clusters. A clustering principle using the item and text
variables only, resulted in an interpretable picture of item groupings. For both lan-
guages, the items could be grouped into five clusters: easy items in difficult texts, easy
items in easy texts, difficult items in easy texts, difficult items in difficult texts, and as
a fifth group either all items in very easy texts for the English data, or all items in very
difficult texts for the Spanish data. We presume that the difference in the clustering
results is not caused by language-specific differences, but is rather dataset-specific.
The difficulty of the C-Test items was modeled using a broad range of English fea-
tures that proofed to be predictive within the work of Svetashova (2015). In addition,
we considered features that emerged within the scope of linguistic complexity analysis
and human sentence processing to be worth being integrated into our pipeline. The
final feature set comprised surface-based, lexical, psycholinguistic, syntactic and dis-
course features, as well as C-Test specific features describing an item’s candidate space,
position, or context. In total, 156 English features have been implemented. To get
a multi-lingual perspective on C-Test difficulty characteristics, we designed a versatile
set of 100 features for Spanish C-Tests.
Our classification experiments have shown which performance profiles can be pre-
dicted best using which set of features. The experiments have been conducted using
Support Vector Machine (SVM) and Random Forest (RF) models with a data split of
80/20%. Using the full feature set, we experimented with different performance profiles
and got the best classification results for those profiles containing a high proportion of
sentence performance variables (Text SentProf Item CL4 ): a micro-averaged F1 score
of 0.95 for the English data, and 0.93 for the Spanish data using RF. The performance
profile Text Item CL5, which is best interpretable and suitable for a real-word applica-
tion for both languages, resulted in a macro-averaged F1 score of 0.76 (English, 156)
and 0.82 (Spanish, 100 features) using SVM. Thus, the Spanish classification outper-
forms the English classification, even though the Spanish features are just a subset of
the English features.
Additionally, it was shown that the clusters’ mean feature values reflect difficulty
tendencies which either confirm or disprove intuitive assumptions. As an example,
high frequent words are easier to master than low frequent words. On the other hand,
sentences which are known to be more complex in terms of the Dependency Locality
Theory are easier to master. We suggested that this might be caused by the gap’s
position relative to the integration costs counting. To determine the reasons that
Automated C-Test Difficulty Prediction
7 Conclusion 61
explain these findings, we suggest the inspection of dependency structures of single
data points in future works.
In order to approximately compare our results to those reported by Svetashova
(2015), we performed classification experiments on the performance profile that com-
bines all item and text level performance variables, including the proficiency level split-
ting. Profile TextProf ItemProf CL4 is the performance profile that is most comparable
to Svetashova’s best performing profile, although it lacks performance statistics from
Item Response Theory. We reported comparative RF classification results given the
new English data: a micro-averaged F1 score of 0.79, while Svetashova (2015) reports a
score of 0.7959. It should be noted, however, that the results reported in this work are
all referring to the texts extracted from the latest version of the FSZ database, while
Svetashova (2015) conducted the experiments on an earlier, smaller version (c.f. Table
3.1, p. 22).
To compare the difficulty characteristics of the two languages, we performed clas-
sification experiments with a set of 63 comparable features. With a micro-averaged
F1 score of 72 in RF classification, the Spanish classification outperforms the English
classification, which shows a highest F1 score of 0.64 using SVM. From these results
we can conclude that the underlying set of comparable features is more predictive for
the Spanish language than for the English language. The variable importance showed
a much higher ranking of the morphologic feature isContentWord, which confirmed
our preliminary assumption that morphologic features are more important for Spanish,
which is morphologically richer than English. For English, the features describing the
item’s candidate space were highly predictive. The high number of English short words
increases the candidate space of short words. We suggested that this might have caused
the good performance of English candidate features in contrast to Spanish candidate
features. The most predictive feature for both languages is measuring the difference
between the word frequency of an item itself and its competing candidate with highest
word frequency. This feature is also the most predictive one in experiments on the test
takers’ performance on sentence and item level. The classification results for profile
Sentence Item CL4 showed that item level features are ranked lower in the English
experiment than in the Spanish experiment.
Given the findings of this thesis, several questions remain for future research. The
presented results show which language characteristics have influence on the difficulty of
C-Test items. More research is still necessary in order to figure out why certain features
are more predictive than others. This will involve a thorough inspection of single data
points, taking also into account the language learners’ proficiency levels. Additionally,
the set of features could be expanded by including further measures described for
example in the context of SLA, human sentence processing, or language complexity
analysis. Another way of extending the feature set would be to project item level
features to sentence and text level, e.g. by taking the average values of items in the
Automated C-Test Difficulty Prediction
7 Conclusion 62
sentence or text. So far the text level features only consider the text as a non-gapped
text rather than as a gapped testing text.
Further research will also be required concerning the DLT measures, which were
proven to be predictive in the given task. The experiments showed that increasing
integration costs lower the sentence and text difficulties. More research is necessary to
present evidence for these findings.
In order to gain more insights into differences between C-Test difficulties in English
and Spanish, we suggest to extend the set of features which encode morphological and
syntactic information. For Spanish, there exist tagsets with a high number of different
tags encoding very specific information. We suggest to make use of this information
by extending those feature sets that are based on POS information. On top of that,
the syntactic complexity features could be adapted to the Spanish pipeline. The in-
vestigation of other languages is also desirable in order to broaden the multi-lingual
perspective on the topic.
We further suggest to examine the test takers’ distribution across proficiency levels
more detailedly. It would be interesting to discover which language characteristics have
impact on difficulties for beginning learners in contrast to those that indicate difficulties
for advanced learners.
The presented pipeline is fully automated and expects an input structure that is
already used in a real-world C-Test generation application. As a next step towards the
application of the results, the difficulty prediction component needs to be integrated
into the existing user interface.
Automated C-Test Difficulty Prediction
References 63
References
Alderson, J. C. (1979). The cloze procedure and proficiency in english as a foreign
language. Tesol Quarterly, 13(2):219–227.
Amaral, L. and Meurers, D. (2011). On using intelligent computer-assisted language
learning in real-life foreign language teaching and learning. ReCALL, 23(1):4–24.
Babaii, E. and Ansary, H. (2001). The c-test: a valid operationalization of reduced
redundancy principle? System, 29(2):209–219.
Bachman, L. F. (1982). The trait structure of cloze test scores. Tesol Quarterly,
16(1):61–70.
Beinborn, L. (2016). Predicting and Manipulating the Difficulty of Text-Completion Ex-
ercises for Language Learning. Dr.rer.nat-thesis, Technische Universitat Darmstadt.
Beinborn, L., Zesch, T., and Gurevych, I. (2014). Predicting the difficulty of language
proficiency tests. Transactions of the Association for Computational Linguistics,
2:517–529.
Bird, S. and Loper, E. (2006). Nltk: the natural language toolkit. In Proceedings of
the COLING/ACL on Interactive presentation sessions, pages 69–72. Association for
Computational Linguistics.
Brants, T. and Franz, A. (2006). Web 1t 5-gram version 1.
Brysbaert, M., New, B., and Keuleers, E. (2012). Adding part-of-speech information
to the subtlex-us word frequencies. Behavior research methods, 44(4):991–997.
Chapelle, C. A. and Chung, Y.-R. (2010). The promise of nlp and speech processing
technologies in language assessment. Language Testing, 27(3):301–315.
Chen, D. and Manning, C. (2014). A fast and accurate dependency parser using neural
networks. In Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP), pages 740–750. Association for Computational Lin-
guistics.
Chen, X. and Meurers, D. (2016). Ctap: A web-based tool supporting automatic
complexity analysis. In Proceedings of the Workshop on Computational Linguistics
for Linguistic Complexity (CL4LC), pages 113–119.
Cohen, A. D., Segal, M., and Bar-Siman-To, R. (1984). The c-test in hebrew. Language
Testing, 1(2):221–225.
Coxhead, A. (2000). A new academic word list. TESOL quarterly, 34(2):213–238.
Automated C-Test Difficulty Prediction
References 64
Cuetos, F., Glez-Nosti, M., Barbon, A., and Brysbaert, M. (2012). Subtlex-esp: Spanish
word frequencies based on film subtitles. Psicologica, 33(2):133–143.
Dornyei, Z. and Katona, L. (1992). Validation of the c-test amongst hungarian efl
learners. Language Testing, 9(2):187–206.
Eckes, T. and Grotjahn, R. (2006). A closer look at the construct validity of c-tests.
Language Testing, 23(3):290–325.
Ferrucci, D., Lally, A., Verspoor, K., and Nyberg, E. (2009). Unstructured information
management architecture (UIMA) version 1.0. OASIS Standard.
Garcıa-Pablos, A., Cuadros, M., Gaines, S., and Rigau, G. (2013). Opener demo: Open
polarity enhanced named entity recognition. In Come Hack with OpeNER! Workshop
Programme, volume 501, pages 12–14.
Gibson, E. (2000). The dependency locality theory: A distance-based theory of linguis-
tic complexity. Image, language, brain, pages 95–126.
Graesser, A. C., McNamara, D. S., and Louwerse, M. M. (2003). What do readers need
to learn in order to process coherence relations in narrative and expository text.
Rethinking reading comprehension, pages 82–98.
Grotjahn, R. (2002). Konstruktion und einsatz von c-tests: Ein leitfaden fur die praxis.
Der C-Test. Theoretische Grundlagen und praktische Anwendungen, 4:211–225.
Grotjahn, R. and Tonshoff, W. (1992). Textverstandnis bei der c-test-bearbeitung.
pilotstudien mit franzosisch-und italienischlernern. Der C-Test. Theoretische Grund-
lagen und praktische Anwendungen, 1:19–95.
Hancke, J., Vajjala, S., and Meurers, D. (2012). Readability classification for German
using lexical, syntactic, and morphological features. In Proceedings of the 24th In-
ternational Conference on Computational Linguistics (COLING), pages 1063–1080,
Mumbai, India.
Hernandez-Figueroa, Z., Rodrıguez-Rodrıguez, G., and Carreras-Riudavets, F. (2009).
Separador de sılabas del espanol-silabeador tip.
Hughes, A. (2007). Testing for language teachers. Ernst Klett Sprachen.
Klein-Braley, C. (1984). Practice and Problems in Language Testing. Papers from the
International Symposium on Language Testing, volume 29, chapter Advance Predic-
tion of Difficulty with C-Tests., pages 97–112. ERIC, Colchester, England.
Klein-Braley, C. (1985). A cloze-up on the c-test: a study in the construct validation
of authentic tests. Language Testing, 2(1):76–104.
Krashen, S. (1985). The Input Hypothesis: Issues and Implications. Laredo.
Automated C-Test Difficulty Prediction
References 65
Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., Cooper,
T., Mayer, Z., and Kenkel, B. (2015). caret: Classification and regression training. r
package version 6.0–21. CRAN, Vienna, Austria.
Kuperman, V., Stadthagen-Gonzalez, H., and Brysbaert, M. (2012). Age-of-acquisition
ratings for 30,000 english words. Behavior Research Methods, 44(4):978–990.
Laufer, B. and Nation, P. (1995). Vocabulary size and use: Lexical richness in l2 written
production. Applied linguistics, 16(3):307–322.
Le, S., Josse, J., and Husson, F. (2008). FactoMineR: A package for multivariate
analysis. Journal of Statistical Software, 25(1):1–18.
Levy, R. and Andrew, G. (2006). Tregex and tsurgeon: tools for querying and ma-
nipulating tree data structures. In Proceedings of the fifth international conference
on Language Resources and Evaluation, pages 2231–2234, Genoa, Italy. European
Language Resources Association (ELRA).
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing.
International Journal of Corpus Linguistics, 15(4):474–496.
Lu, X. (2012). The relationship of lexical richness to the quality of esl learners’ oral
narratives. The Modern Language Journal, 96(2):190–208.
McNamara, D. S., Graesser, A. C., McCarthy, P. M., and Cai, Z. (2014). Automated
evaluation of text and discourse with Coh-Metrix. Cambridge University Press.
McNamara, T. (2000). Language Testing. Oxford Introduction to Language Study
ELT. OUP Oxford.
Meurers, D., Ziai, R., Amaral, L. A., Boyd, A., Dimitrov, A., Metcalf, V., and Ott,
N. (2010). Enhancing authentic web pages for language learners. In Proceedings
of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building
Educational Applications, pages 10–18, Los Angeles, California. Association for Com-
putational Linguistics.
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D.,
McDonald, R. T., Petrov, S., Pyysalo, S., Silveira, N., et al. (2016). Universal
dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth
International Conference on Language Resources and Evaluation (LREC).
R Development Core Team (2008). R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-
900051-07-0.
Automated C-Test Difficulty Prediction
References 66
Raatz, U. and Klein-Braley, C. (1981). The c-test–a modification of the cloze procedure.
In Practice and Problems in Language Testing. Proceedings of the International Lan-
guage Testing Symposium of the Interuniversitare Sprachtestgruppe, volume 4, Essex,
England. Education Resources Information Center (ERIC).
Shain, C., van Schijndel, M., Futrell, R., Gibson, E., and Schuler, W. (2016). Memory
access during incremental sentence processing causes reading time latency. Pro-
ceedings of the Workshop on Computational Linguistics for Linguistic Complexity
(CL4LC), pages 49–58.
Spolsky, B. (1969). Reduced redundancy as a language testing tool. In Language
Testing Section of the 2nd International Congress of Applied Linguistics, Cambridge.
England. Education Resources Information Center (ERIC).
Suvorov, R. and Hegelheimer, V. (2013). Computer-assisted language testing. The
companion to language assessment.
Svetashova, Y. (2015). C-test item difficulty prediction: Exploring the linguistic charac-
teristics of c-tests using machine learning. Master’s thesis, Department of Linguistics,
University of Tubingen.
Todirascu, A., Francois, T., Gala, N., Fairon, C., Ligozat, A.-L., and Bernhard, D.
(2013). Coherence and cohesion for the assessment of text readability. Natural
Language Processing and Cognitive Science, 11:11–19.
Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich part-
of-speech tagging with a cyclic dependency network. In Proceedings of the 2003
Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology-Volume 1, pages 173–180. Association
for Computational Linguistics.
Vajjala, S. (2015). Analyzing text complexity and text simplification: connecting linguis-
tics, processing and educational applications. PhD thesis, University of Tubingen.
Vajjala, S. and Meurers, D. (2012). On improving the accuracy of readability classifi-
cation using insights from second language acquisition. In Proceedings of the Seventh
Workshop on Building Educational Applications Using NLP, pages 163–173. Associ-
ation for Computational Linguistics.
Weiß, Z. (2015). More linguistically motivated features of language complexity in read-
ability classification of german textbooks: Implementation and evaluation. Bachelor’s
Thesis, Department of Linguistics, University of Tubingen.
Weiß, Z. (2017). Using measures of linguistic complexity to assess german l2 proficiency
in learner corpora under consideration of task-effects. Master’s thesis, Department
of Linguistics, University of Tubingen.
Automated C-Test Difficulty Prediction
References 67
Wilson, M. (1988). Mrc psycholinguistic database: Machine-usable dictionary, version
2.00. Behavior Research Methods, 20(1):6–10.
Automated C-Test Difficulty Prediction
A Appendix 68
A Appendix
The following tables list all the features implemented in the English pipeline. The
column ”ES” denotes whether the feature has been adapted to Spanish. The column
”COMP” indicates whether the feature is included in the set of 63 comparable features.
Moreover, Table A.6 and Table A.7 list how many participants have processed a
certain text passages.
Feature Group Feature Name ES COMP
LinguisticFeatures
Item Ling endingLength√
Item Ling lemmaLength√
Item Ling Morph isContentWord√ √
Item Ling Synt numDependencies√ √
PsycholinguisticFeatures
Item Psy TermOccInDoc√
Item Psy TF√
Item Psy DF√
Item Psy IDF√
Item Psy TFIDF√
Item Psy isLemmasMostFreqTokItem Psy isTokensMostFreqTagItem Psy Lfp Band1
√
Item Psy Sem AoA KupItem Psy Sem ConcrItem Psy Sem FamFeatureItem Psy Sem ImagItem Psy Sem MeaniItem Psy StylNCatsMRCItem Psy StylNCatsNLTK
Position basedFeatures
Item Posit distancePrevMention Lemma√
Item Posit distancePrevMention Token√
Item Posit isInClosing√
Item Posit isInStartingOrClosing√
Item Posit Number Gap√
Item Posit Number Token√
Item Posit previousItemTrigramProbItem Posit previousItemUnigramProb
Table A.1: Item level features from the groups of linguistic, pyscholinguistic and po-sition based features.
Automated C-Test Difficulty Prediction
A Appendix 69
Feature Group Feature Name ES COMP
Context andCandidate Space
Item Ngram Cands NumPotentialEndingsItem Ngram Cands BiggerUniDeltaItem Ngram Cands BiggerBiRightDeltaItem Ngram Cands BiggerBiLeftDeltaItem Ngram Cands BiggerTriCenterDeltaItem Ngram Cands BiggerTriLeftDeltaItem Ngram Cands BiggerTriRightDelta
Item Ngram Cands UniPrWeaknessItem Ngram Cands BiLeftPrWeaknessItem Ngram Cands BiRightPrWeaknessItem Ngram Cands TriLeftPrWeaknessItem Ngram Cands TriCenterPrWeaknessItem Ngram Cands TriRightPrWeakness
Item Ngram Cands HasBiggerUniItem Ngram Cands HasBiggerBiLeftItem Ngram Cands HasBiggerBiRightItem Ngram Cands HasBiggerTriCenterItem Ngram Cands HasBiggerTriLeft
Item Ngram Probs UniItem Ngram Probs Bi LeftItem Ngram Probs Bi RightItem Ngram Probs Tri CenterItem Ngram Probs Tri LeftItem Ngram Probs Tri RightItem Ngram Probs Four CenterRightItem Ngram Probs Four LeftItem Ngram Probs Four LeftCenterItem Ngram Probs Four RightItem Ngram Probs Five CenterItem Ngram Probs Five CenterRightItem Ngram Probs Five LeftItem Ngram Probs Five LeftCenterItem Ngram Probs Five RightItem Ngram Probs Max
Item SubtlexCand deltaBigger√ √
Item SubtlexCand hasBigger√ √
Item SubtlexCand numCandidates√ √
Item SubtlexCand weakness√ √
Table A.2: Item level features from the group of context and candidate space features.
Automated C-Test Difficulty Prediction
A Appendix 70
Feature Group Feature Name ES COMP
SurfaceFeatures
Sent numGapsInSent√
Sent sentID√ √
SyntacticFeatures
Sent Const numComplicatorsSent Const parseDepth
√ √
Sent Const parseScore√ √
DLTFeatures
Sent Dlt oMaxTotalIntegrationCost√ √
Sent Dlt cMaxTotalIntegrationCost√ √
Sent Dlt mMaxTotalIntegrationCost√ √
Sent Dlt vMaxTotalIntegrationCost√ √
Sent Dlt cmMaxTotalIntegrationCost√ √
Sent Dlt cvMaxTotalIntegrationCost√ √
Sent Dlt vmMaxTotalIntegrationCost√ √
Sent Dlt cmvMaxTotalIntegrationCost√ √
Sent Dlt oTotalIntegrationCosts√ √
Sent Dlt cTotalIntegrationCosts√ √
Sent Dlt vTotalIntegrationCosts√ √
Sent Dlt mTotalIntegrationCosts√ √
Sent Dlt cmTotalIntegrationCosts√ √
Sent Dlt cvTotalIntegrationCosts√ √
Sent Dlt vmTotalIntegrationCosts√ √
Sent Dlt cmvTotalIntegrationCosts√ √
Table A.3: Sentence level features.
Automated C-Test Difficulty Prediction
A Appendix 71
Feature Group Feature Name ES COMP
LexicalFeatures
Text Lex Density Adjective√ √
Text Lex Density Adverb√ √
Text Lex Density Conjunction√ √
Text Lex Density Determiner√ √
Text Lex Density Function√ √
Text Lex Density Lex√ √
Text Lex Density Noun√ √
Text Lex Density Verb√ √
Text Lex Variation Adj√ √
Text Lex Variation Adv√ √
Text Lex Variation Lex√ √
Text Lex Variation Mod√ √
Text Lex Variation NDWZ√ √
Text Lex Variation TTR√ √
Text Lex Variation TTR CTTR√ √
Text Lex Variation TTR Log√ √
Text Lex Variation TTR RTTR√ √
Text Lex Variation TTR Uber√ √
Text Lex Variation Verb2√ √
Text Lex Variation VV1√ √
Text Lex Variation VV1co√ √
Text Lex Variation VV1sq√ √
Text Lex Sophistication CSV√
Text Lex Sophistication LS1√
Text Lex Sophistication LS2√
Text Lex Sophistication VS1√
Text Lex Sophistication VS2√
Text Lex FrequencyProfile Band1 Tok√
Text Lex FrequencyProfile Band1 Type√
Text Lex FrequencyProfile Band2 Tok√
Text Lex FrequencyProfile Band2 Type√
Text Lex FrequencyProfile Band3 Tok√
Text Lex FrequencyProfile Band3 Type√
Text Lex FrequencyProfile Band4 Tok√
Text Lex FrequencyProfile Band4 Type√
Text Lex FrequencyProfile Band5 Tok√
Text Lex FrequencyProfile Band5 Type√
Text Lex FrequencyProfile Band6 Tok√
Text Lex FrequencyProfile Band6 Type√
Text Lex FrequencyProfile Band7 Tok√
Text Lex FrequencyProfile Band7 Type√
Text Lex awlNumFamText Lex awlPercentText Lex famListPercent k1Text Lex famNum k1
Table A.4: Text level features from the group of lexical features.
Automated C-Test Difficulty Prediction
A Appendix 72
Feature Group Feature Name ES COMP
SyntacticComplexityFeatures
Text Syn Complexity CNperCText Syn Complexity CTperTText Syn Complexity DCperCText Syn Complexity MLCText Syn Complexity MLSText Syn Complexity MLTText Syn Complexity VPperT
DLTFeatures
Text Dlt oMaxTotalIntegrationCost√ √
Text Dlt cMaxTotalIntegrationCost√ √
Text Dlt mMaxTotalIntegrationCost√ √
Text Dlt vMaxTotalIntegrationCost√ √
Text Dlt cvMaxTotalIntegrationCost√ √
Text Dlt cmMaxTotalIntegrationCost√ √
Text Dlt vmMaxTotalIntegrationCost√ √
Text Dlt cmvMaxTotalIntegrationCost√ √
Text Dlt oTotalIntegrationCosts√ √
Text Dlt cTotalIntegrationCosts√ √
Text Dlt mTotalIntegrationCosts√ √
Text Dlt vTotalIntegrationCosts√ √
Text Dlt cmTotalIntegrationCosts√ √
Text Dlt cvTotalIntegrationCosts√ √
Text Dlt vmTotalIntegrationCosts√ √
Text Dlt cmvTotalIntegrationCosts√ √
Readability IndexFeatures
Text Readability Flesh√
Text Readability Kincaid√
Table A.5: Text level features from the groups of syntactic complexity, DLT and read-ability features.
Automated C-Test Difficulty Prediction
A Appendix 73
# part. (>140) Text Ids
567 585564 529564 588550 527550 591546 584545 533541 532526 582495 609494 598494 610493 590492 589488 523487 587485 586485 605479 600476 581474 690474 716471 719470 720470 724469 725465 689464 685464 688455 691441 528435 478435 481432 466432 470432 479427 530427 531427 596422 603419 594418 604417 601374 524373 525173 764171 756169 766165 762160 748
Total: 22146
# part. (<140) Text Ids
62 60261 71860 58359 57857 76156 58054 76053 60752 75547 59347 60847 76546 57946 61146 72346 75245 60645 76343 59943 75043 75442 72142 75142 75341 69441 74940 69240 72240 75939 59538 68637 59237 71734 69333 59733 68733 75732 75829 76715 82214 82513 82112 81912 82012 82312 82910 8289 8249 8268 4688 4748 8275 4725 4765 4772 526
Total: 1890
Table A.6: English: Number of participants per text. On the left-hand side, thosewhich where processed by more than 140 participants. On the right-handside, those which were processed by less participants. In total, 24036(22146+1890) answers are available for English.
Automated C-Test Difficulty Prediction
A Appendix 74
# part. (>140) Text Ids
222 745218 508215 506213 742212 547210 739209 744208 503208 505208 650208 741207 543207 548205 537204 535200 646197 649197 653193 651191 495191 544184 502181 546178 451178 454178 455178 456178 457178 534177 640175 541173 538173 648170 638167 645164 497164 499164 500164 501161 647153 713152 711151 712147 708147 710
Total: 8358
# part. (<140) Text Ids
65 79564 79462 79661 79859 79729 64328 50728 64227 53627 54026 50426 65226 74026 80725 54224 74324 79924 80424 81123 63923 71523 80922 53922 54922 80519 50919 54519 71419 74619 80219 80619 81018 27916 80815 80013 70910 8419 8428 8017 8475 8404 8434 8444 8454 8462 839
Total: 1062
Table A.7: Spanish: Number of participants per text. On the left-hand side, thosewhich where processed by more than 140 participants. On the right-hand side, those which were processed by less participants. In total, 9420(8358+1062) answers are available for Spanish.
Automated C-Test Difficulty Prediction