large-scale cloze test dataset created by teachers · large-scale cloze test dataset created by...

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2344–2356Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics

2344

Large-scale Cloze Test Dataset Created by Teachers

Qizhe Xie∗ , Guokun Lai∗ , Zihang Dai, Eduard HovyLanguage Technologies Institute, Carnegie Melon University{qizhex, guokun, dzihang, hovy}@cs.cmu.edu

Abstract

Cloze tests are widely adopted in languageexams to evaluate students’ language profi-ciency. In this paper, we propose the firstlarge-scale human-created cloze test datasetCLOTH 1 2, containing questions used inmiddle-school and high-school language ex-ams. With missing blanks carefully createdby teachers and candidate choices purposelydesigned to be nuanced, CLOTH requires adeeper language understanding and a widerattention span than previously automatically-generated cloze datasets. We test the perfor-mance of dedicatedly designed baseline mod-els including a language model trained on theOne Billion Word Corpus and show humansoutperform them by a significant margin. Weinvestigate the source of the performance gap,trace model deficiencies to some distinct prop-erties of CLOTH, and identify the limited abil-ity of comprehending the long-term context tobe the key bottleneck.

1 Introduction

Being a classic language exercise, the clozetest (Taylor, 1953) is an accurate assessment oflanguage proficiency (Fotos, 1991; Jonz, 1991;Tremblay, 2011) and has been widely employedin language examinations. Under a typical set-ting, a cloze test requires examinees to fill in miss-ing words (or sentences) to best fit the surround-ing context. To facilitate natural language under-standing, automatically-generated cloze datasetsare introduced to measure the ability of machinesin reading comprehension (Hermann et al., 2015;Hill et al., 2016; Onishi et al., 2016). In thesedatasets, each cloze question typically consists of

∗ Equal contribution.1CLOTH (CLOze test by TeacHers) is available at

http://www.cs.cmu.edu/˜glai1/data/cloth/.2The leaderboard is available at http://www.

qizhexie.com/data/CLOTH_leaderboard.html

a context paragraph and a question sentence. Byrandomly replacing a particular word in the ques-tion sentence with a blank symbol, a single testcase is created. For instance, CNN/Daily Maildatasets (Hermann et al., 2015) use news articlesas contexts and summary bullet points as the ques-tion sentence. Only named entities are removedwhen creating the blanks. Similarly, in Children’sBooks test (CBT) (Hill et al., 2016), cloze ques-tions are obtained by removing a word in the lastsentence of every consecutive 21 sentences, withthe first 20 sentences being the context. Differentfrom CNN/Daily Mail datasets, CBT also provideseach question with a candidate answer set, con-sisting of randomly sampled words with the samepart-of-speech tag from the context as that of thecorrect answer.

Thanks to the automatic generation process,these datasets can be very large in size, leadingto significant research progresses. However, com-pared to how humans would create cloze ques-tions and evaluate reading comprehension ability,the automatic generation process bears some in-evitable issues. Firstly, blanks are chosen uni-formly without considering which aspect of thelanguage phenomenon that questions will test.Hence, quite a portion of automatically-generatedquestions can be purposeless or even trivial to an-swer. Another issue involves the ambiguity ofanswers. Given a context and a sentence with ablank, there can be multiple words that fit almostequally well into the blank. A possible solution isto include a candidate option set, as done by CBT,to get rid of the ambiguity. However, automati-cally generating the candidate option set can beproblematic since it cannot guarantee the ambigu-ity is removed. More importantly, automatically-generated candidates can be totally irrelevant orsimply grammatically unsuitable for the blank, re-sulting in again purposeless or trivial questions.

2345

Probably due to these unsatisfactory issues, neu-ral models have achieved comparable results tothe human-level performance within a very shorttime (Chen et al., 2016; Dhingra et al., 2016; Seoet al., 2016). While there have been works try-ing to incorporate human design into cloze ques-tion generation (Zweig and Burges, 2011; Papernoet al., 2016), due to the expensive labeling process,the MSR Sentence Completion Challenge createdby this effort has 1, 040 questions and the LAM-BADA (Paperno et al., 2016) dataset has 10, 022questions, limiting the possibility of developingpowerful neural models on it. As a result ofthe small size, human-created questions are onlyused to compose development sets and test sets.Motivated by the aforementioned drawbacks, wepropose CLOTH, a large-scale cloze test datasetcollected from English exams. Questions in thedataset are designed by middle-school and high-school teachers to prepare Chinese students forentrance exams. To design a cloze test, teachersfirstly determine the words that can test students’knowledge of vocabulary, reasoning or grammar;then replace those words with blanks and provideother three candidate options for each blank. If aquestion does not specifically test grammar usage,all of the candidate options would complete thesentence with correct grammar, leading to highlynuanced questions. As a result, human-createdquestions are usually harder and are a better as-sessment of language proficiency. A general clozetest evaluates several aspects of language profi-ciency including vocabulary, reasoning and gram-mar, which are key components of comprehendingnatural language.

To verify if human-created cloze questions aredifficult for current models, we train and evaluatethe state-of-the-art language model (LM) and ma-chine comprehension models on this dataset, in-cluding a language model trained on the One Bil-lion Word Corpus. We find that the state-of-the-art model lags behind human performance even ifthe model is trained on a large external corpus.We analyze where the model fails compared tohumans who perform well. After conducting er-ror analysis, we assume the performance gap re-sults from the model’s inability to use a long-termcontext. To examine this assumption, we eval-uate human-level performance when the humansubjects are only allowed to see one sentence asthe context. Our assumption is confirmed by the

matched performances of the models and humanwhen given only one sentence. In addition, wedemonstrate that human-created data is more dif-ficult than automatically-generated data. Specifi-cally, it is much easier for the same model to per-form well on automatically-generated data.

We hope that CLOTH provides a valuabletestbed for both the language modeling commu-nity and the machine comprehension community.Specifically, the language modeling communitycan use CLOTH to evaluate their models’ abil-ities in modeling long contexts, while the ma-chine comprehension community can use CLOTHto test machine’s understanding of language phe-nomena.

2 Related Work

Large-scale automatically-generated clozetests (Hermann et al., 2015; Hill et al., 2016;Onishi et al., 2016) lead to significant research ad-vancements. However, generated questions do notconsider language phenomenon to be tested andare relatively easy to solve. Recently proposedreading comprehension datasets are all labeled byhumans to ensure a high quality (Rajpurkar et al.,2016; Joshi et al., 2017; Trischler et al., 2016;Nguyen et al., 2016).

Perhaps the closet work to CLOTH is the LAM-BADA dataset (Paperno et al., 2016). LAM-BADA also targets at finding challenging words totest LM’s ability in comprehending a longer con-text. However, LAMBADA does not provide acandidate set for each question, which can causeambiguities when multiple words can fit in. Fur-thermore, only test set and development set are la-beled manually. The provided training set is theunlabeled Book Corpus (Zhu et al., 2015). Suchunlabeled data do not emphasize long-dependencyquestions and have a mismatched distribution withthe test set, as showed in Section 5. Further, theBook Corpus is too large to allow rapid algorithmdevelopment for researchers who do not have ac-cess to a huge amount of computational power.

Aiming to evaluate machines under the sameconditions that the humans are evaluated, there isa growing interest in obtaining data from exam-inations. NTCIR QA Lab (Shibuki et al., 2014)contains a set of real-world college entrance examquestions. The Entrance Exams task at CLEF QATrack (Penas et al., 2014; Rodrigo et al., 2015)evaluates machine’s reading comprehension abil-

2346

ity. The AI2 Reasoning Challenge (Clark et al.,2018; Schoenick et al., 2017) contains approx-imately eight thousand scientific questions usedin middle school. Lai et al. (2017) proposes thefirst large-scale machine comprehension datasetobtained from exams. They show that questionsdesigned by teachers have a significantly largerproportion of reasoning questions. Our dataset fo-cuses on evaluating both language proficiency andreasoning abilities.

3 CLOTH Dataset

In this section, we introduce the CLOTH datasetthat is collected from English examinations, andstudy its abilities of assessment.

3.1 Data Collection and Statistics

We collect the raw data from three free and pub-lic websites in China that gather exams createdby English teachers to prepare students for col-lege/high school entrance exams3. Before clean-ing, there are 20, 605 passages and 332, 755 ques-tions. We perform the following processes to en-sure the validity of data: Firstly, we remove ques-tions with an inconsistent format such as questionswith more than four options. Then we filter allquestions whose validity relies on external infor-mation such as pictures or tables. Further, we findthat half of the total passages are duplicates and wedelete those passages. Lastly, on one of the web-sites, the answers are stored as images. We use twoOCR software programs4 to extract the answersfrom images. We discard the questions when re-sults from the two software are different. Afterthe cleaning process, we obtain a clean dataset of7, 131 passages and 99, 433 questions.

Since high school questions are more diffi-cult than middle school questions, we divide thedatasets into CLOTH-M and CLOTH-H, whichstand for the middle school part and the highschool part. We split 11% of the data for boththe test set and the development set. The detailedstatistics of the whole dataset and two subsets arepresented in Table 1. Note that the questions werecreated to test non-native speakers, hence the vo-cabulary size is not very large.

3 The three websites include http://www.21cnjy.com/;http://5utk.ks5u.com/; http://zujuan.xkw.com/. We checkedthat CLOTH does not contain sentence completion examplequestions from GRE, SAT and PSAT.

4tesseract: https://github.com/tesseract-ocr; ABBYYFineReader: https://www.abbyy.com/en-us/finereader/

3.2 Question Type Analysis

In order to evaluate students’ mastery of a lan-guage, teachers usually design tests in a way thatquestions cover different aspects of a language.Specifically, they first identify words in the pas-sage that can examine students’ knowledge in vo-cabulary, logic, or grammar. Then, they replacethe words with blanks and prepare three incorrectbut nuanced candidate options to make the testnon-trivial. A sample passage is presented in Ta-ble 2.

To understand the abilities of assessment on thisdataset, we divide questions into several types andlabel the proportion of each type. According toEnglish teachers who regularly create cloze testquestions for English exams in China, there arelargely three types: grammar, vocabulary and rea-soning. Grammar questions are easily differen-tiated from other two categories. However, theteachers themselves cannot specify a clear distinc-tion between reasoning questions and vocabularyquestions since all questions require comprehend-ing the words within the context and conductingsome level of reasoning by recognizing incom-plete information or conceptual overlap.

Hence, we divided the questions except gram-mar questions based on the difficulty levelfor a machine to answer the question, follow-ing works on analyzing machine comprehensiondatasets (Chen et al., 2016; Trischler et al., 2016).In particular, we divide them in terms of their de-pendency ranges, since questions that only involvea single sentence are easier to answer than ques-tions involving evidence distributed in multiplesentences. Further, we divided questions involvinglong-term dependency into matching/paraphrasingquestions and reasoning questions since matchingquestions are easier. The four types include:

• Grammar: The question is about grammar us-age, involving tense, preposition usage, ac-tive/passive voices, subjunctive mood and so on.

• Short-term-reasoning: The question is aboutcontent words and can be answered based onthe information within the same sentence. Notethat the content words can evaluate knowledgeof both vocabulary and reasoning.

• Matching/paraphrasing: The question is an-swered by copying/paraphrasing a word in thecontext.

2347

Dataset CLOTH-M CLOTH-H CLOTH (Total)Train Dev Test Train Dev Test Train Dev Test

# passages 2,341 355 335 3,172 450 478 5,513 805 813# questions 22,056 3,273 3,198 54,794 7,794 8,318 76,850 11,067 11,516Vocab. size 15,096 32,212 37,235

Avg. # sentence 16.26 18.92 17.79Avg. # words 242.88 365.1 313.16

Table 1: The statistics of the training, development and test sets of CLOTH-M (middle school questions),CLOTH-H (high school questions) and CLOTH

• Long-term-reasoning: The answer must beinferred from synthesizing information dis-tributed across multiple sentences.

We sample 100 passages in the high school cat-egory and the middle school category respectivelywith totally 3, 000 questions. The types of thesequestions are labeled on Amazon Turk. We pay$1 and $0.5 for high school passages and middleschool passages respectively. We refer readers toAppendix A.1 for details of the labeling processesand the labeled sample passage.

The proportion of different questions is shownin Table 3. The majority of questions are short-term-reasoning questions while approximately22.4% of the data needs long-term information,in which the long-term-reasoning questions con-stitute a large proportion.

4 Exploring Models’ Limits

In this section, we investigate if human-createdcloze test is a challenging problem for state-of-the-art models. We find that LM trained on theOne Billion Word Corpus can achieve a remark-able score but cannot solve the cloze test. Afterconducting an error analysis, we hypothesize thatthe model is not able to deal with long-term de-pendencies. We verify the hypothesis by compar-ing the model’s performance with the human per-formance when the information humans obtain islimited to one sentence.

4.1 Human and Model PerformanceLSTM To test the performance of RNN-basedsupervised models, we train a bidirectionalLSTM (Hochreiter and Schmidhuber, 1997) topredict the missing word given the context withonly labeled data. The implementation details arein Appendix A.3.

Attentive Readers To enable the model togather information from a longer context, we aug-

ment the supervised LSTM model with the at-tention mechanism (Bahdanau et al., 2014), sothat the representation at the blank is used as aquery to find the relevant context in the docu-ment and a blank-specific representation of thedocument is used to score each candidate an-swer. Specifically, we adapt the Stanford Atten-tive Reader (Chen et al., 2016) and the position-aware attention model (Zhang et al., 2017) to thecloze test problem. With the position-aware atten-tion model, the attention scores are based on boththe context match and the distance from a contextto the blank. Both attention models are trainedonly with human-created blanks just as the LSTMmodel.

LM In cloze test, the context on both sides maybe enough to determine the correct answer. Sup-pose xi is the missing word and x1, · · · , xi−1,xi+1, · · · , xn are the context, we choose xi thatmaximizes the joint probability p(x1, · · · , xn),which essentially maximizes the conditionallikelihood p(xi | x1, · · · , xi−1, xi+1, · · · , xn).Therefore, LM can be naturally adapted to clozetest.

In essence, LM treats each word as a possibleblank and learns to predict it. As a result, it re-ceives more supervision than the LSTM trainedon human-labeled questions. Besides training aneural LM on our dataset, interested in whetherthe state-of-the-art LM can solve cloze test, wealso test the LM trained on the One Billion WordBenchmark (Chelba et al., 2013) (referred as 1B-LM) that achieves a perplexity of 30.0 (Jozefow-icz et al., 2016)5. To make the evaluation timetractable, we limit the context length to one sen-tence or three sentences. Note that the One BillionWord Corpus does not overlap with the CLOTH

5The pre-trained model is obtained fromhttps://github.com/tensorflow/models/tree/master/research/lm 1b

2348

Passage: Nancy had just got a job as a secretary in a com-pany. Monday was the first day she went to work, so shewas very 1 and arrived early. She 2 the door open andfound nobody there. ”I am the 3 to arrive.” She thought andcame to her desk. She was surprised to find a bunch of 4on it. They were fresh. She 5 them and they were sweet.She looked around for a 6 to put them in. ”Somebody hassent me flowers the very first day!” she thought 7 . ” Butwho could it be?” she began to 8 . The day passed quicklyand Nancy did everything with 9 interest. For the follow-ing days of the 10 , the first thing Nancy did was to changewater for the followers and then set about her work.Then came another Monday. 11 she came near her deskshe was overjoyed to see a(n) 12 bunch of flowers there.She quickly put them in the vase, 13 the old ones. Thesame thing happened again the next Monday. Nancy beganto think of ways to find out the 14 . On Tuesday afternoon,she was sent to hand in a plan to the 15 . She waited forhis directives at his secretary’s 16 . She happened to see onthe desk a half-opened notebook, which 17 : ”In order tokeep the secretaries in high spirits, the company has decidedthat every Monday morning a bunch of fresh flowers shouldbe put on each secretarys desk.” Later, she was told that theirgeneral manager was a business management psychologist.

Questions:1. A. depressed B. encouraged C. excited D. surprised2. A. turned B. pushed C. knocked D. forced3. A. last B. second C. third D. first4. A. keys B. grapes C. flowers D. bananas5. A. smelled B. ate C. took D. held6. A. vase B. room C. glass D. bottle7. A. angrily B. quietly C. strangely D. happily8. A. seek B. wonder C. work D. ask9. A. low B. little C. great D. general10. A. month B. period C. year D. week11. A. Unless B. When C. Since D. Before12. A. old B. red C. blue D. new13. A. covering B. demanding C. replacing D. forbidding14. A. sender B. receiver C. secretary D. waiter15. A. assistant B. colleague C. employee D. manager16. A. notebook B. desk C. office D. house17. A. said B. written C. printed D. signed

Table 2: A Sample passage from our dataset. Boldfaces highlight the correct answers. There is onlyone best answer among four candidates, althoughseveral candidates may seem correct.

corpus.

Human performance We measure the perfor-mance of Amazon Mechanical Turkers on 3, 000sampled questions when the whole passage isgiven.

Results The comparison is shown in Table 4.Both attentive readers achieve similar accuracyto the LSTM. We hypothesize that the reason ofthe attention model’s unsatisfactory performanceis that the evidence of a question cannot be simplyfound by matching the context. Similarly, on read-ing comprehension, though attention-based mod-els (Wang et al., 2017; Seo et al., 2016; Dhingra

Short-term Long-term

Dataset GM STR MP LTR O

CLOTH 0.265 0.503 0.044 0.180 0.007CLOTH-M 0.330 0.413 0.068 0.174 0.014CLOTH-H 0.240 0.539 0.035 0.183 0.004

Table 3: The question type statistics of 3000 sam-pled questions where GM, STR, MP, LTR andO denotes grammar, short-term-reasoning, match-ing/paraphrasing, long-term-reasoning and othersrespectively.

Model CLOTH CLOTH-M CLOTH-H

LSTM 0.484 0.518 0.471Stanford AR 0.487 0.529 0.471Position-aware AR 0.485 0.523 0.471

LM 0.548 0.646 0.5061B-LM (one sent.) 0.695 0.723 0.6851B-LM (three sent.) 0.707 0.745 0.693

Human performance 0.859 0.897 0.845

Table 4: Models’ performance and human-levelperformance on CLOTH. LSTM, Stanford Atten-tive Reader and Attentive Reader with position-aware attention shown in the top part only usesupervised data labelled by human. LM out-performs LSTM since it receives more supervi-sions in learning to predict each word. Trainingon large external corpus further significantly en-hances LM’s accuracy.

et al., 2016) have reached human performance onthe SQuAD dataset (Rajpurkar et al., 2016), theirperformance is still not comparable to human per-formance on datasets that focus more on reason-ing where the evidence cannot be simply found bya matching behavior (Lai et al., 2017; Xu et al.,2017). Since the focus of this paper is to analyzethe proposed dataset, we leave the design of rea-soning oriented attention models for future work.

The LM achieves much better performance thanLSTM. The gap is larger when the LM is trainedon the 1 Billion Word Corpus, indicating thatmore training data results in a better generaliza-tion. Specifically, the accuracy of 1B-LM is 0.695when one sentence is used as the context. It in-dicates that LM can learn sophisticated languageregularities when given sufficient data. The sameconclusion can also be drawn from the success ofa concurrent work ELMo which uses LM repre-sentations as word vectors and achieves state-of-the-art results on six language tasks (Peters et al.,

2349

2018). However, if we increase the context lengthto three sentences, the accuracy of 1B-LM onlyhas a marginal improvement. In contrast, humansoutperform 1B-LM by a significant margin, whichdemonstrates that deliberately designed questionsin CLOTH are not completely solved even forstate-of-the-art models.

4.2 Analyzing 1B-LM’s Strengths andWeaknesses

In this section, we would like to understand why1B-LM lags behind human performance. We findthat most of the errors involve long-term reason-ing. Additionally, in a lot of cases, the depen-dency is within the context of three sentences. Weshow several errors made by the 1B-LM in Table5. In the first example, the model does not knowthat Nancy found nobody in the company meansthat Nancy was the first one to arrive at the com-pany. In the second and third example, the modelfails probably because of not recognizing “they”referred to “flowers”. The dependency in the lastcase is longer. It depends on the fact that Nancywas alone in the company.

Based on the case study, we hypothesize thatthe LM is not able to take long-term informationinto account, although it achieves a surprisinglygood overall performance. Additionally, the 1B-LM is trained on the sentence level, which mightalso result in the inability to track paragraph levelinformation. However, to investigate the differ-ences between training on sentence level and onparagraph level, a prohibitive amount of computa-tional resource is required to train a large modelon the 1 Billion Word Corpus.

On the other hand, a practical comparison is totest the model’s performance on different typesof questions. We find that the model’s accu-racy is 0.591 on long-term-reasoning questions ofCLOTH-H while it achieves 0.693 on short-term-reasoning (a comprehensive type-specific perfor-mance is available in Appendix A.3), which par-tially confirms that long-term-reasoning is harder.However, we could not completely rely on the per-formance on specific questions types, partly due toa large variance caused by the small sample size.Another reason is that the reliability of questiontype labels depends on whether turkers are carefulenough. For example, in the error analysis shownin Table 5, a careless turker would label the secondexample as short-term-reasoning without noticing

that the meaning of “they” relies on a long context.To objectively verify if the LM’s strengths lie

in dealing with short-term information, we obtainthe ceiling performance of only utilizing short-term information. Showing only one sentence asthe context, we ask the Turkers to select an optionbased on their best guesses given the insufficientinformation. By limiting the context span man-ually, the ceiling performance with the access toonly a short context is estimated accurately.

As shown in Table 6, The performance of 1B-LM using one sentence as the context can almostmatch the human ceiling performance of only us-ing short-term information. Hence we concludethat the LM can almost perfectly solve all short-term cloze questions. However, the performanceof LM is not improved significantly when a long-term context is given, indicating that the perfor-mance gap is due to the inability of long-term rea-soning.

5 Comparing Human-created Data andAutomatically-generated Data

In this section, we demonstrate that human-created data is a better testbed than automatically-generated cloze test since it results in a larger gapbetween model’s performance and human perfor-mance.

A casual observation is that a cloze test canbe created by randomly deleting words and ran-domly sampling candidate options. In fact, togenerate large-scale data, similar generation pro-cesses have been introduced and widely used inmachine comprehension (Hermann et al., 2015;Hill et al., 2016; Onishi et al., 2016). However,research on cloze test design (Sachs et al., 1997)shows that tests created by deliberately deletingwords are more reliable than tests created by ran-domly or periodically deleting words. To designaccurate language proficiency assessment, teach-ers usually deliberately select words in order to ex-amine students’ proficiency in grammar, vocabu-lary and reasoning. Moreover, in order to make thequestion non-trivial, three incorrect options pro-vided by teachers are usually grammatically cor-rect and relevant to the context. For instance, inthe fourth problem of the sample passage shownin Table 2, “grapes”, “flowers” and “bananas” allfit the description of being fresh.

Hence we naturally hypothesize that human-generated data has distinct characteristics when

2350

Context Options

She pushed the door open and found nobody there. ”I am the to arrive.” She A. last B. second C. third D. firstthought and came to her desk.

They were fresh. She them and they were sweet. She looked around for a vase A. smelled B. ate C. took D. heldto put them in.

She smelled them and they were sweet. She looked around for a to put them in. A. vase B. room C. glass D. bottle”Somebody has sent me flowers the very first day!”

”But who could it be?” she began to . The day passed quickly and Nancy did A. seek B. wonder C. work D. askeverything with great interest.

Table 5: Error analysis of 1-billion-language-model with three sentences as the context. The questions aresampled from the sample passage shown in Table 2. The correct answer is in bold text. The incorrectlyselected options are in italics.


Short context 1B-LM 0.695 0.723 0.685Human 0.713 0.771 0.691

Long context 1B-LM 0.707 0.745 0.693Human 0.859 0.897 0.845

Table 6: Humans’ performance compared with 1-billion-language-model. In the short context part,both 1B-LM and humans only use information ofone sentence. In the long context part, humanshave the whole passage as the context, while 1B-LM uses contexts of three sentences.

compared with automatically-generated data. Toverify this assumption, we compare the LSTMmodel’s performance when given different propor-tions of the two types of data. Specifically, to traina model with α percent of automatically-generateddata, we randomly replace a percent blanks withblanks at random positions, while keeping the re-maining 1 − α percent questions the same. Thecandidate options for the generated blanks are ran-dom words sampled from the unigram distribu-tion. We test models obtained with varying α onhuman-created data and automatically-generateddata respectively.

PPPPPPTestα%

0% 25% 50% 75% 100%

human-created 0.484 0.475 0.469 0.423 0.381Generated 0.422 0.699 0.757 0.785 0.815

Table 7: The model’s performance when trainedon α percent of automatically-generated data and100− α percent of human-created data

From the comparison in Table 7, we have thefollowing observations: (1) human-created dataleads to a larger gap between model’s perfor-mance and the ceiling/human performance. Themodel’s performance and human’s performance

on the human-created data are 0.484 and 0.859 re-spectively, as shown in Tab. 4, leading to a gapof 0.376. In comparison, the performance gap onthe automatically-generated data is at most 0.185since the model’s performance reaches an accu-racy of 0.815 when fully trained on generated data.(2) Although human-created data may providemore information in distinguishing similar words,the distributional mismatch between two types ofdata makes it non-trivial to transfer the knowl-edge gained from human-created data to tackleautomatically-generated data. Specifically, themodel’s performance on automatically-generateddata monotonically decreases when given a higherratio of human-created data.

6 Combining Human-created Data withAutomatically-generated Data

In Section 4.1, we show that LM is able to takeadvantage of more supervision since it predictseach word based on the context. At the sametime, we also show that human-created data andthe automatically-generated data are quite differ-ent in Section 5. In this section, we propose amodel that takes advantage of both sources.

6.1 Representative-based Model

Specifically, for each question, regardless of be-ing human-created or automatically-generated, wecan compute the negative log likelihood of thecorrect answer as the loss function. Suppose JHis the average negative log likelihood loss forhuman-created questions and JR is the loss func-tion on generated questions, we combine losses onhuman-created questions and generated questionsby simply adding them together, i.e., JR + JH isused as the final loss function. We will introducethe definition of JR in the following paragraphs.

Although automatically-generated data has a

2351

large quantity and is valuable to the modeltraining, as shown in the previous Section,automatically-generated questions are quite dif-ferent from human-created questions. Ideally, alarge amount of human-created questions is moredesirable than a large amount of automatically-generated questions. A possible avenue towardshaving large-scale human-created data is to au-tomatically pick out a large number of generatedquestions which are representative of or similar tohuman-created questions. In other words, we traina network to predict whether a question is a gener-ated question or a human-created question. A gen-erated question is representative of human-createdquestions if it has a high probability of being ahuman-created question. Then we can give higherweights to questions that resemble human-createdquestion.

We first introduce our method to obtain the rep-resentativeness information. Let x denote the pas-sage and z denote whether a word is selected asa question by human, i.e., z is 1 if this word isselected to be filled in the original passage or 0otherwise. Suppose hi is the representation of i-thword given by a bidirectional LSTM. The networkcomputes the probability pi of xi being a human-created question as follows:

li = hTi wxi ; pi = Sigmoid(li)

where li is the logit which will be used as in thefinal model and wxi is the the word embedding.We train the network to minimize the binary crossentropy between p and ground-truth labels at eachtoken.

After obtaining the representativeness informa-tion, we define the representativeness weightedloss function as

JR =∑i 6∈H

Softmaxi(l1α, · · · , ln

α)Ji

where Ji denotes the negative log likelihood lossfor the i−th question and let li be the output rep-resentativeness of the i-th question and H is theset of all human-generated questions and α is thetemperature of the Softmax function. The modeldegenerates into assigning a uniform weight to allquestions when the temperature is +∞. We set αto 2 based on the performance on the dev set. 6.

6The code is available at https://github.com/qizhex/Large-scale-Cloze-Test-Dataset-Created-by-Teachers

Model Ex. CLOTH CLOTH-M CLOTH-H

Our model

No

0.583 0.673 0.549LM 0.548 0.646 0.506LSTM 0.484 0.518 0.471Stanford AR 0.487 0.529 0.471

1B-LM Yes 0.707 0.745 0.693

Human 0.859 0.897 0.845

Table 8: Overall results on CLOTH. Ex. denotesexternal data.


Our model 0.583 0.673 0.549w.o. rep. 0.566 0.662 0.528w.o. hum. 0.565 0.665 0.526w.o. rep. or hum. 0.543 0.643 0.505

Table 9: Ablation study on using the representa-tiveness information (denoted as rep.) and thehuman-created data (denoted as hum.)

6.2 Results

We summarize performances of all models in Ta-ble 8. Our representativeness model outperformsall other models that do not use external data onCLOTH, CLOTH-H and CLOTH-M.

6.3 Analysis

In this section, we verify the effectiveness ofthe representativeness-based averaging by abla-tion studies. When we remove the representative-ness information by setting α to infinity, the ac-curacy drops from 0.583 to 0.566. When we fur-ther remove the human-created data so that onlygenerated data is employed, the accuracy drops to0.543, similar to the performance of LM. The re-sults further confirm that it is beneficial to incor-porate human-created questions into training.

A sample of the predicted representativeness isshown in Figure 17. Clearly, words that are too ob-vious have low scores, such as punctuation marks,simple words “a” and “the”. In contrast, contentwords whose semantics are directly related to thecontext have a higher score, e.g., “same”, “simi-lar”, “difference” have a high score when the dif-ference between two objects is discussed and “se-crets” has a high score since it is related to the sub-sequent sentence “does not want to share with oth-ers”. Our prediction model achieves an F1 score of36.5 on the test set, which is understandable since

7The script to generate the Figure is obtainedat https://gist.github.com/ihsgnef/f13c35cd46624c8f458a4d23589ac768

2352

Figure 1: Representativeness prediction for each word. Lighter color means less representative. Thewords deleted by human as blanks are in bold text.

there are many plausible questions within a pas-sage.

It has been shown that features such as morphol-ogy information and readability are beneficial incloze test prediction (Skory and Eskenazi, 2010;Correia et al., 2012, 2010; Kurtasov, 2013). Weleave investigating the advanced approaches of au-tomatically designing cloze test to future work.

7 Conclusion and Discussion

In this paper, we propose a large-scale cloze testdataset CLOTH that is designed by teachers. Withmissing blanks and candidate options carefullycreated by teachers to test different aspects of lan-guage phenomena, CLOTH requires a deep lan-guage understanding and better captures the com-plexity of human language. We find that hu-man outperforms 1B-LM by a significant mar-gin. After detailed analysis, we find that theperformance gap is due to the model’s inabilityto understanding a long context. We also showthat, compared to automatically-generated ques-tions, human-created questions are more difficultand lead to a larger margin between human per-formance and the model’s performance.

Despite the excellent performance of 1B-LMwhen compared with models trained only onCLOTH, it is still important to investigate and cre-ate more effective models and algorithms whichprovide complementary advantages to having alarge amount of data. For rapid algorithm devel-opments, we suggest training models only on thetraining set of CLOTH and comparing with mod-els that do not utilize external data.

We hope our dataset provides a valuable testbedto the language modeling community and themachine comprehension community. In partic-ular, the language modeling community can use

CLOTH to evaluate their models’ abilities in mod-eling a long context. In addition, the machinecomprehension community may also find CLOTHuseful in evaluating machine’s understanding oflanguage phenomena including vocabulary, rea-soning and grammar, which are key componentsof comprehending natural language.

In our future work, we would like to design al-gorithms to better model a long context, to utilizeexternal knowledge, and to explore more effec-tive semi-supervised learning approaches. Firstly,we would like to investigate efficient ways of uti-lizing external knowledge such as paraphrasingand semantic concepts like prior works (Donget al., 2017; Dasigi et al., 2017). In comparison,training on a large external dataset is actually atime-consuming way of utilizing external knowl-edge. Secondly, to use the generated questionsmore effectively, the representative-based semi-supervised approach might be improved by tech-niques studied in active learning and hard exam-ple mining (Settles, 2009; Shrivastava et al., 2016;Chang et al., 2017).

Acknowledgement

We thank Yulun Du, Kaiyu Shi and Zhilin Yangfor insightful discussions and suggestions on thedraft. We thank Shi Feng for the script to high-light representative words. This research was sup-ported in part by DARPA grant FA8750-12-2-0342 funded under the DEFT program.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

2353

Haw-Shiuan Chang, Erik Learned-Miller, and AndrewMcCallum. 2017. Active bias: Training more accu-rate neural networks by emphasizing high variancesamples. In Advances in Neural Information Pro-cessing Systems, pages 1003–1013.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2013. One billion word benchmark for mea-suring progress in statistical lming. arXiv preprintarXiv:1312.3005.

Danqi Chen, Jason Bolton, and Christopher D Man-ning. 2016. A thorough examination of thecnn/daily mail reading comprehension task. arXivpreprint arXiv:1606.02858.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think you have solved question an-swering? try arc, the ai2 reasoning challenge. arXivpreprint arXiv:1803.05457.

Rui Correia, Jorge Baptista, Maxine Eskenazi, andNuno J Mamede. 2012. Automatic generation ofcloze question stems. In PROPOR, pages 168–178.Springer.

Rui Correia, Jorge Baptista, Nuno Mamede, IsabelTrancoso, and Maxine Eskenazi. 2010. Automaticgeneration of cloze question distractors. In Pro-ceedings of the Interspeech 2010 Satellite Workshopon Second Language Studies: Acquisition, Learn-ing, Education and Technology, Waseda University,Tokyo, Japan.

Pradeep Dasigi, Waleed Ammar, Chris Dyer, and Ed-uard Hovy. 2017. Ontology-aware token embed-dings for prepositional phrase attachment. arXivpreprint arXiv:1705.02925.

Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang,William W Cohen, and Ruslan Salakhutdinov.2016. Gated-attention readers for text comprehen-sion. arXiv preprint arXiv:1606.01549.

Li Dong, Jonathan Mallinson, Siva Reddy, and MirellaLapata. 2017. Learning to paraphrase for questionanswering. arXiv preprint arXiv:1708.06022.

Sandra S Fotos. 1991. The cloze test as an integra-tive measure of efl proficiency: A substitute for es-says on college entrance examinations? LanguageLearning, 41(3):313–336.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In NIPS.

Felix Hill, Antoine Bordes, Sumit Chopra, and JasonWeston. 2016. The goldilocks principle: Readingchildren’s books with explicit memory representa-tions. ICLR.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Jon Jonz. 1991. Cloze item types and second languagecomprehension. Language testing, 8(1):1–22.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. ACL.

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, NoamShazeer, and Yonghui Wu. 2016. Exploring the lim-its of lming. arXiv preprint arXiv:1602.02410.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Andrey Kurtasov. 2013. A system for generating clozetest items from russian-language text. In Proceed-ings of the Student Research Workshop associatedwith RANLP 2013, pages 107–112.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard Hovy. 2017. Race: Large-scalereading comprehension dataset from examinations.EMNLP.

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,Saurabh Tiwary, Rangan Majumder, and Li Deng.2016. Ms marco: A human generated machinereading comprehension dataset. arXiv preprintarXiv:1611.09268.

Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gim-pel, and David McAllester. 2016. Who did what:A large-scale person-centered cloze dataset. arXivpreprint arXiv:1608.05457.

Denis Paperno, German Kruszewski, Angeliki Lazari-dou, Quan Ngoc Pham, Raffaella Bernardi, San-dro Pezzelle, Marco Baroni, Gemma Boleda, andRaquel Fernandez. 2016. The lambada dataset:Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031.

Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.In NIPS-W.

Anselmo Penas, Yusuke Miyao, Alvaro Rodrigo, Ed-uard H Hovy, and Noriko Kando. 2014. Overview ofclef qa entrance exams task 2014. In CLEF (Work-ing Notes), pages 1194–1200.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In EMNLP, pages 1532–1543.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. arXiv preprint arXiv:1802.05365.

2354

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questionsfor machine comprehension of text. arXiv preprintarXiv:1606.05250.

Alvaro Rodrigo, Anselmo Penas, Yusuke Miyao, Ed-uard H Hovy, and Noriko Kando. 2015. Overview ofclef qa entrance exams task 2015. In CLEF (Work-ing Notes).

J Sachs, P Tung, and RYH Lam. 1997. How to con-struct a cloze test: Lessons from testing measure-ment theory models. Perspectives.

Carissa Schoenick, Peter Clark, Oyvind Tafjord, PeterTurney, and Oren Etzioni. 2017. Moving beyond theturing test with the allen ai science challenge. Com-munications of the ACM, 60(9):60–64.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2016. Bidirectional attentionflow for machine comprehension. arXiv preprintarXiv:1611.01603.

Burr Settles. 2009. Active learning literature survey.

Hideyuki Shibuki, Kotaro Sakamoto, Yoshinobu Kano,Teruko Mitamura, Madoka Ishioroshi, Kelly YItakura, Di Wang, Tatsunori Mori, and NorikoKando. 2014. Overview of the ntcir-11 qa-lab task.In NTCIR.

Abhinav Shrivastava, Abhinav Gupta, and Ross Gir-shick. 2016. Training region-based object detectorswith online hard example mining. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, pages 761–769.

Adam Skory and Maxine Eskenazi. 2010. Predictingcloze task quality for vocabulary training. In Pro-ceedings of the NAACL HLT 2010 Fifth Workshopon Innovative Use of NLP for Building EducationalApplications, pages 49–56. Association for Compu-tational Linguistics.

Wilson L Taylor. 1953. cloze procedure: a newtool for measuring readability. Journalism Bulletin,30(4):415–433.

Annie Tremblay. 2011. Proficiency assessment stan-dards in second language acquisition research. Stud-ies in Second Language Acquisition, 33(3):339–372.

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-ris, Alessandro Sordoni, Philip Bachman, and Ka-heer Suleman. 2016. Newsqa: A machine compre-hension dataset. arXiv preprint arXiv:1611.09830.

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 189–198.

Yichong Xu, Jingjing Liu, Jianfeng Gao, Yelong Shen,and Xiaodong Liu. 2017. Towards human-level ma-chine reading comprehension: Reasoning and in-ference with multiple strategies. arXiv preprintarXiv:1711.04964.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-geli, and Christopher D Manning. 2017. Position-aware attention and supervised data improve slot fill-ing. In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing,pages 35–45.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In Proceedings of the IEEEinternational conference on computer vision, pages19–27.

Geoffrey Zweig and Christopher JC Burges. 2011. Themicrosoft research sentence completion challenge.Technical report, Technical Report MSR-TR-2011-129, Microsoft.

2355

0

0.25

0.5

0.75

1

Short-term-reasoning

GrammarMatching/paraphrasing

Long-term-reasoning

LM Our model1B-LM (one sentence) 1B-LM (three sentences)Human (one sentence) Human (whole passage)

(a) Middle school group (CLOTH-M)

0

0.25

0.5

0.75

1

Short-term-reasoning

GrammarMatching/paraphrasing

Long-term-reasoning

LM Our model1B-LM (one sentence) 1B-LM (three sentences)Human (one sentence) Human (whole passage)

(b) High school group (CLOTH-H)

Figure 2: Model and human’s performance on questions with different types. Our model will be intro-duced in Sec. 6.

A Appendix

A.1 Question Type LabelingTo label the questions, we provided the definition and an example for each question category to theAmazon Mechanical Turkers. To ensure quality, we limited the workers to master Turkers who are expe-rienced and maintain a high acceptance rate. However, we did not restrict the backgrounds of the Turkerssince master Turkers should have a reasonable amount of knowledge about English to conduct previoustasks. In addition, the vocabulary used in CLOTH are usually not difficult since they are constructed totest non-native speakers in middle school or high school. To get a concrete idea of the nature of questiontypes, please refer to examples shown in Tab. 10.

A.2 Type-specific Performance AnalysisWe can also further verify the strengths and weaknesses of the 1B-LM by studying the performance ofmodels and human on different question categories. Note that the performance presented here may besubject to a high variance due to the limited number of samples in each category. From the comparisonshown in Figure 2, we see that 1B-LM is indeed good at short-term questions. Specifically, when thehuman only has access to the context of one sentence, 1B-LM is close to human’s performance onalmost all categories. Further, comparing LM and 1B-LM, we find that training on the large corpus leadsto improvements on all categories, showing that training on a large amount of data leads to a substantialimprovement in learning complex language regularities.

A.3 Implementation DetailsWe implement our models using PyTorch (Paszke et al., 2017). We train our model on all questions inCLOTH and test it on CLOTH-M and CLOTH-H separately. For our final model, we use Adam (Kingmaand Ba, 2014) with the learning rate of 0.001. The hidden dimension is set to 650 and we initializethe word embedding by 300-dimensional Glove word vector (Pennington et al., 2014). The temperatureα is set to 2. We tried to increase the dimensionality of the model but do not observe performanceimprovement.

When we train the small LM on CLOTH, we largely follow the recommended hyperparameters in thePytorch LM example8. Specifically, we employ a 2-layer LSTM with hidden dimension as 1024. Theinput embedding and output weight matrix are tied. We set the dropout rate to 0.5. The initial learningrate is set to 10 and divided by 4 whenever the PPL stops improving on the dev set.

We predict the answer for each blank independently for all of the models mentioned in this paper,since we do not observe significant performance improvements in our preliminary experiments when anauto-regressive approach is employed, i.e., when we fill all previous blanks with predicted answers. Wehypothesize that, regardless of whether there exist inter-blank dependencies, since blanks are usually

8https://github.com/pytorch/examples/tree/master/word language model

2356

distributed far away from each other, LSTM is not able to capture such long dependencies. When testinglanguage models, we use the longest text spans that do not contain blanks.

Passage: Nancy had just got a job as a secretary in a company. Monday was the first day she wentto work, so she was very 1 and arrived early. She 2 the door open and found nobody there. ”Iam the 3 to arrive.” She thought and came to her desk. She was surprised to find a bunch of 4on it. They were fresh. She 5 them and they were sweet. She looked around for a 6 to putthem in. ”Somebody has sent me flowers the very first day!” she thought 7 . ” But who could itbe?” she began to 8 . The day passed quickly and Nancy did everything with 9 interest. Forthe following days of the 10 , the first thing Nancy did was to change water for the followers andthen set about her work.Then came another Monday. 11 she came near her desk she was overjoyed to see a(n) 12 bunchof flowers there. She quickly put them in the vase, 13 the old ones. The same thing happenedagain the next Monday. Nancy began to think of ways to find out the 14 . On Tuesday afternoon,she was sent to hand in a plan to the 15 . She waited for his directives at his secretary’s 16. She happened to see on the desk a half-opened notebook, which 17 : ”In order to keep thesecretaries in high spirits, the company has decided that every Monday morning a bunch of freshflowers should be put on each secretarys desk.” Later, she was told that their general manager wasa business management psychologist.

Questions Question type1. A. depressed B. encouraged C. excited D. surprised short-term reasoning2. A. turned B. pushed C. knocked D. forced short-term reasoning3. A. last B. second C. third D. first long-term reasoning4. A. keys B. grapes C. flowers D. bananas matching5. A. smelled B. ate C. took D. held short-term reasoning6. A. vase B. room C. glass D. bottle long-term reasoning7. A. angrily B. quietly C. strangely D. happily short-term reasoning8. A. seek B. wonder C. work D. ask long-term reasoning9. A. low B. little C. great D. general long-term reasoning10. A. month B. period C. year D. week long-term reasoning11. A. Unless B. When C. Since D. Before grammar12. A. old B. red C. blue D. new long-term reasoning13. A. covering B. demanding C. replacing D. forbidding long-term reasoning14. A. sender B. receiver C. secretary D. waiter long-term reasoning15. A. assistant B. colleague C. employee D. manager matching16. A. notebook B. desk C. office D. house matching17. A. said B. written C. printed D. signed grammar

Table 10: An Amazon Turker’s label for the sample passage

large-scale cloze test dataset created by teachers · large-scale cloze test dataset created by...

Documents