bootstrapping for text learning tasks - kamal · pdf filebootstrapping for text learning tasks...

12
Bootstrapping for Text Learning Tasks Rosie Jones 1 [email protected] Andrew McCallum 2,1 [email protected] Kamal Nigam 1 [email protected] Ellen Riloff 3 riloff@cs.utah.edu 1 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 2 Just Research 4616 Henry Street Pittsburgh, PA 15213 3 Department of Computer Science University of Utah Salt Lake City, UT 84112 Abstract When applying text learning algorithms to complex tasks, it is tedious and expensive to hand-label the large amounts of training data necessary for good performance. This pa- per presents bootstrapping as an alternative approach to learning from large sets of la- beled data. Instead of a large quantity of la- beled data, this paper advocates using a small amount of seed information and a large collec- tion of easily-obtained unlabeled data. Boot- strapping initializes a learner with the seed in- formation; it then iterates, applying the learner to calculate labels for the unlabeled data, and incorporating some of these labels into the training input for the learner. Two case studies of this approach are presented. Bootstrapping for information extraction provides 76% preci- sion for a 250-word dictionary for extracting locations from web pages, when starting with just a few seed locations. Bootstrapping a text classifier from a few keywords per class and a class hierarchy provides accuracy of 66%, a level close to human agreement, when placing computer science research papers into a topic hierarchy. The success of these two examples argues for the strength of the general boot- strapping approach for text learning tasks. 1 Introduction Text learning algorithms today are reasonably successful when provided with enough labeled or annotated train- ing examples. For instance, text classifiers [Lewis, 1998; Joachims, 1998; Yang, 1999; Cohen and Singer, 1996; Schapire and Singer, 1999] reach high accuracy from large sets of class-labeled documents; information ex- traction algorithms [Califf, 1998; Riloff, 1993; Soder- land, 1999; Freitag, 1998] perform well when given many tagged documents or large sets of rules as input. How- ever, as more complex domains are considered, the req- uisite size of these training sets gets prohibitively large. Creating these training sets becomes tedious and expen- sive, since typically they must be labeled by a person. This leads us to consider learning algorithms that do not require such large amounts of labeled data. While labeled data is difficult to obtain, unlabeled data is readily available and plentiful. Castelli and Cover [1996] show in a theoretical framework that unlabeled data can be used in some settings to improve classifi- cation, although it is exponentially less valuable than labeled data. Fortunately, unlabeled data can often be obtained by completely automated methods. Consider the problem of classifying news articles: a short Perl script and a night of automated Internet downloads can fill a hard disk with unlabeled examples of news arti- cles. In contrast, it might take several days of human effort and tedium to label even one thousand of these, depending on the nature of the task. However, one cannot learn to perform classification from just unlabeled data alone. By itself, unlabeled data describe the domain of the problem, but not the task over the domain. Thus, unlabeled data must be coupled with at least some information about the target function for the learning task. This target, or seed, information can come in many different forms, such as keywords or fea- tures which may appear in examples of the target classes, or a small number of examples of the target classes. This paper advocates using a bootstrapping frame- work for text learning tasks that would otherwise re- quire large training sets. The input to the bootstrap- ping process is a large amount of unlabeled data and a small amount of seed information to inform the learner about the specific task at hand. In this paper we con- sider seed information in the form of keywords associated with classes. Bootstrapping initializes a learner with the keywords. It then iterates, applying the learner to hy- pothesize labels for unlabeled data, and building a new learner from these bootstrapped labels. We present two instantiations of the bootstrapping ap- proach for different text learning tasks. The first case study is learning extraction patterns and dictionaries for information extraction, using keywords and a parser as the only knowledge supplied. We use a combination of two classifiers: a noun-phrase classifier and a noun- phrase context classifier based on extraction patterns. The seeding gives us a small number of relatively reliable training examples (noun phrases), which we then use to

Upload: vannguyet

Post on 26-Feb-2018

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

Bootstrapping for Text Learning Tasks

Rosie Jones1

[email protected] McCallum2,1

[email protected] Nigam1

[email protected] Riloff3

[email protected]

1School of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213

2Just Research4616 Henry Street

Pittsburgh, PA 15213

3Department of Computer ScienceUniversity of Utah

Salt Lake City, UT 84112

Abstract

When applying text learning algorithms tocomplex tasks, it is tedious and expensive tohand-label the large amounts of training datanecessary for good performance. This pa-per presents bootstrapping as an alternativeapproach to learning from large sets of la-beled data. Instead of a large quantity of la-beled data, this paper advocates using a smallamount of seed information and a large collec-tion of easily-obtained unlabeled data. Boot-strapping initializes a learner with the seed in-formation; it then iterates, applying the learnerto calculate labels for the unlabeled data, andincorporating some of these labels into thetraining input for the learner. Two case studiesof this approach are presented. Bootstrappingfor information extraction provides 76% preci-sion for a 250-word dictionary for extractinglocations from web pages, when starting withjust a few seed locations. Bootstrapping a textclassifier from a few keywords per class anda class hierarchy provides accuracy of 66%, alevel close to human agreement, when placingcomputer science research papers into a topichierarchy. The success of these two examplesargues for the strength of the general boot-strapping approach for text learning tasks.

1 IntroductionText learning algorithms today are reasonably successfulwhen provided with enough labeled or annotated train-ing examples. For instance, text classifiers [Lewis, 1998;Joachims, 1998; Yang, 1999; Cohen and Singer, 1996;Schapire and Singer, 1999] reach high accuracy fromlarge sets of class-labeled documents; information ex-traction algorithms [Califf, 1998; Riloff, 1993; Soder-land, 1999; Freitag, 1998] perform well when given manytagged documents or large sets of rules as input. How-ever, as more complex domains are considered, the req-uisite size of these training sets gets prohibitively large.Creating these training sets becomes tedious and expen-sive, since typically they must be labeled by a person.

This leads us to consider learning algorithms that do notrequire such large amounts of labeled data.

While labeled data is difficult to obtain, unlabeleddata is readily available and plentiful. Castelli and Cover[1996] show in a theoretical framework that unlabeleddata can be used in some settings to improve classifi-cation, although it is exponentially less valuable thanlabeled data. Fortunately, unlabeled data can often beobtained by completely automated methods. Considerthe problem of classifying news articles: a short Perlscript and a night of automated Internet downloads canfill a hard disk with unlabeled examples of news arti-cles. In contrast, it might take several days of humaneffort and tedium to label even one thousand of these,depending on the nature of the task.

However, one cannot learn to perform classificationfrom just unlabeled data alone. By itself, unlabeled datadescribe the domain of the problem, but not the task overthe domain. Thus, unlabeled data must be coupled withat least some information about the target function forthe learning task. This target, or seed, information cancome in many different forms, such as keywords or fea-tures which may appear in examples of the target classes,or a small number of examples of the target classes.

This paper advocates using a bootstrapping frame-work for text learning tasks that would otherwise re-quire large training sets. The input to the bootstrap-ping process is a large amount of unlabeled data and asmall amount of seed information to inform the learnerabout the specific task at hand. In this paper we con-sider seed information in the form of keywords associatedwith classes. Bootstrapping initializes a learner with thekeywords. It then iterates, applying the learner to hy-pothesize labels for unlabeled data, and building a newlearner from these bootstrapped labels.

We present two instantiations of the bootstrapping ap-proach for different text learning tasks. The first casestudy is learning extraction patterns and dictionariesfor information extraction, using keywords and a parseras the only knowledge supplied. We use a combinationof two classifiers: a noun-phrase classifier and a noun-phrase context classifier based on extraction patterns.The seeding gives us a small number of relatively reliabletraining examples (noun phrases), which we then use to

Page 2: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

bootstrap the two classifiers. During the bootstrappingprocess the output of one classifier becomes the inputfor the other. The noun phrases help us identify goodextraction patterns, and the extraction patterns help usidentify new noun phrases. This combined bootstrap-ping is nested inside a higher level of bootstrapping,called meta-bootstrapping, which identifies the most reli-able noun phrases generated by the inner bootstrappingloop.

We perform experimental evaluation of the informa-tion extraction bootstrapping algorithm by generatingdictionaries for locations from corporate web pages usedin the WebKB project [Craven et al., 1998]. Meta-bootstrapping identifies 191 location phrases in the webpages. After 50 iterations, 76% of the hypothesized loca-tion phrases on the web pages were true locations. Thisalgorithm has been described in isolation in [Riloff andJones, 1999]; here we place it into the larger context ofbootstrapping for text learning.

The second case study of bootstrapping is documentclassification using a naive Bayes classifier, where knowl-edge about the classes of interest is provided in the formof a few keywords per class and a class hierarchy. Wepresent extensions to naive Bayes that succeed in thistask without requiring sets of labeled training data. Theextensions reduce the need for human effort by (1) usingkeyword matching to automatically assign approximatelabels, (2) using a statistical technique called shrinkagethat finds more robust parameter estimates by takingadvantage of the hierarchy, and (3) increasing accuracyfurther by iterating Expectation-Maximization to prob-abilistically reassign approximate labels and incorporateunlabeled data.

Experimental evaluation of the document classifica-tion bootstrapping approach is performed on a data setof computer science research papers. A 70-leaf hierarchyof computer science and a few keywords for each class areprovided as input. Thirty thousand unlabeled researchpapers are also provided. Keyword matching alone pro-vides 45% accuracy. Our bootstrapping algorithm usesthis as input and outputs a naive Bayes text classifierthat achieves 66% accuracy. Interestingly, this accuracyapproaches estimated human agreement levels of 72%.

This paper proceeds as follows. Section 2 discussesthe general framework of the bootstrapping process andoutlines the design space of bootstrapping algorithms.Section 3 presents one instantiation of bootstrapping forinformation extraction. A second example, for text clas-sification, is given in Section 4. Related work is presentedin Section 5, and discussion follows in Section 6.

2 Overview of Bootstrapping Algorithm

Bootstrapping is a general framework for improving alearner using unlabeled data. Typically, bootstrappingis an iterative process where labels for the unlabeled dataare estimated at each round in the process, and the labelsare then incorporated as training data into the learner.The bootstrapping described in this paper consists of the

following steps:

• Initialization: A small number of hand-chosen seedinformation (in this paper, keywords) for each classare applied to the entire set of unlabeled examples,resulting in labels for those examples matching thekeywords.

• Iterate Bootstrapping:

– Using Labels to Improve the Model: The infor-mation from the labeled examples is used togenerate a new, more reliable, model.

– Relabeling: The model is used to generate newlabels for the unlabeled data.

2.1 InitializationIn both instantiations of bootstrapping in this paper,seed information is provided in the form of keywords.The text-learning tasks we address involve the con-tent and topics of documents and parts of documents,which previous research on feature selection has shownto be at least partially discernible from a few individ-ual words within the text [Lewis and Ringuette, 1994;Cohen, 1996]. We do not require the keywords to labelall examples, nor do we require every label they provideto be correct. Instead, we expect these labels to give usareas of greater-then-random confidence from which tobegin a clustering process.

In general, keywords tend to have the nature of highprecision and low recall. For example, Cohen [1996]showed that his system RIPPER, using only five wordsin the feature set, can grow rules which classify titlesof AP news articles with 86% precision, at a recall levelof 18%. This property is often an important feature ofthe seed information for bootstrapping, because it allowsbootstrapping to iteratively trade off some precision forsignificantly higher recall.

The individual domain and learning task will deter-mine the number of words in an example and their dis-tribution, thus affecting precision and recall of the initiallabeling. For this reason, our methods need to be robustto variations in the reliability of the initial set. The al-gorithm described in Section 3 assumes a relatively reli-able initial set of examples. Given the domain and task(finding place-names as substrings of documents whileavoiding false hits) this is a fair assumption. The taskdescribed in Section 4 involves distinguishing betweendocuments in related fields in computer science. Herewe might expect greater ambiguity in the keyword as-signment, and the algorithm we use there is more robustto such ambiguity.

2.2 BootstrappingWe can regard bootstrapping as iterative clustering.Given the initial class-based clusters of labeled exam-ples provided as input, we re-cluster, and output a newset of labeled examples.

Bootstrapping can relabel examples in different ways.In Section 3 we describe an approach which relabels some

Page 3: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

examples with high confidence, effectively adding exam-ples to the training pool iteratively. This algorithm relieson restarting the process at periodic intervals, to resumewith the most confident predictions. In Section 4, wedescribe an algorithm that uses Expectation Maximiza-tion to relabel all examples probabilistically—expressingweaker confidence in those labels.

These are two points in a continuum we can trace be-tween high-confidence labeling and low-confidence label-ing, and using few or all labeled examples to retrain themodels. Note that more computation is required whenall examples are used to retrain models, and more confi-dently labeled examples are required when we use fewer.

3 Bootstrapping Dictionaries forInformation Extraction

The wealth of on-line text has produced widespread in-terest in the problem of information extraction. Infor-mation extraction (IE) systems try to identify and ex-tract specific types of information from natural languagetext. Most IE systems focus on information that is rel-evant to a particular domain or topic. For example, IEsystems have been built to extract the names of perpe-trators and victims of terrorist incidents, and the namesof people and companies involved in corporate acquisi-tions. IE systems have also been developed to extractinformation about joint venture activities [MUC-5 Pro-ceedings, 1993], microelectronics [MUC-5 Proceedings,1993], job postings [Califf, 1998], rental ads [Soderland,1999], and seminar announcements [Freitag, 1998].

Most information extraction systems rely on two dic-tionaries: a dictionary of extraction patterns and a se-mantic lexicon.1 In the past few years, several tech-niques have been developed to automate the construc-tion of these dictionaries. However, most of these meth-ods rely on special training data. For example, Au-toSlog [Riloff, 1993; 1996a], CRYSTAL [Soderland etal., 1995], RAPIER [Califf, 1998], SRV [Freitag, 1998],and WHISK [Soderland, 1999] need a training corpusthat includes annotations for the desired extractions, andPALKA [Kim and Moldovan, 1993] and LIEP [Huffman,1996]) require manually defined keywords, frames, or ob-ject recognizers. AutoSlog-TS [Riloff, 1996b] has simplerneeds but still requires a corpus of texts that have beenlabeled as relevant and irrelevant to the domain.

Using bootstrapping techniques described in section2, we have developed an algorithm that can learn dic-tionaries for information extraction without any specialtraining resources. Our bootstrapping technique gener-ates extraction patterns and a semantic lexicon simulta-neously, using only texts that are representative of thedomain and small set of seed words. Our approach isbased two observations.

1. Objects that belong to a semantic category can beused to identify extraction patterns for that cate-

1For our purposes, a semantic lexicon simply refers to adictionary of words or phrases with semantic category labels.

gory. For example, suppose we know that the words“schnauzer”, “terrier”, and “dalmation” all refer todogs. We may then discover that the pattern “<X>barked” extracts many instances of these words andinfer that it is a useful pattern for extracting refer-ences to dogs.

2. Extraction patterns for a semantic category can beused to identify new members of that category. Forexample, suppose we know that “<X> barked” is agood pattern for extracting dogs. Then we can inferthat every noun phrase that it extracts is a referenceto a dog. (This inference will not always be correct,but it should be correct more often than not.)

Our bootstrapping algorithm begins with a small setof seed words that belong to a semantic category of in-terest. These seed words are used to learn extractionpatterns that reliably extract members of the same se-mantic class. The learned extraction patterns are thenused to generate new category members, and the pro-cess repeats. We call this process mutual bootstrappingbecause it generates both a semantic lexicon and a dic-tionary of extraction patterns at the same time. Wehave also found it useful to introduce a second level ofbootstrapping that retains only the most reliable lexiconentries produced by the mutual bootstrapping processand then re-starts it from scratch. This two-tiered boot-strapping process is more robust than a single level ofbootstrapping and produces highly-quality dictionaries.

3.1 Mutual BootstrappingThe mutual bootstrapping process begins with a textcorpus and a small set of predefined seed words for thesemantic category of interest. First, a set of candi-date extraction patterns is generated by running Au-toSlog [Riloff, 1993; 1996a] exhaustively over the textcorpus. Given a noun phrase (NP) to extract, AutoSloguses heuristics to generate a linguistic expression thatrepresents relevant context for extracting the NP. Thislinguistic expression should be general enough to extractother relevant noun phrases as well. By applying Au-toSlog exhaustively, it generates a pattern to extractevery noun phrase in the corpus. Once all candidateextraction patterns have been generated, we apply themto the corpus and record the noun phrases (NPs) thatthey extract.

We now have all possible extraction patterns for thecorpus2 and all noun phrases that they extract, so onecan view the next step as the process of labeling thisdata. That is, we want to determine which noun phrasesbelong to the semantic category of interest, and whichextraction patterns will reliably extract members of thecategory. When we say that we “learn” an extractionpattern or a semantic lexicon entry, we really mean thatwe are labeling the extraction pattern or noun phrase asbelonging to the semantic category.

2That is, all extraction patterns that AutoSlog cangenerate.

Page 4: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

Generate all candidate extraction patterns from thetraining corpus using AutoSlog.

Apply the candidate extraction patterns to thetraining corpus and save the patterns with theirextractions to EPdata

SemLex = {seed words}Cat EPlist = {}

MUTUAL BOOTSTRAPPING LOOP

1. Score all extraction patterns in EPdata.

2. best EP = the highest scoring extractionpattern not already in Cat EPlist

3. Add best EP to Cat EPlist

4. Add best EP’s extractions to SemLex.

5. Go to step 1

Figure 1: Mutual Bootstrapping Algorithm

Figure 1 outlines the mutual bootstrapping algorithm,which iteratively learns extraction patterns from the seedwords and then exploits the learned extraction patternsto identify more words that belong to the semantic cat-egory. At each iteration, the algorithm saves the bestextraction pattern for the category to a list (Cat EPlist).All of its extractions are inferred to be category membersand added to the semantic lexicon (SemLex). Then thenext best extraction pattern is identified, based on boththe original seed words and the new words that were justadded to the lexicon, and the process repeats.

Since the semantic lexicon is constantly growing, theextraction patterns need to be rescored after each iter-ation. The scoring measure counts how many uniquelexicon entries a pattern extracts. This measure rewardsgenerality: a pattern that extracts several different cat-egory members will score higher than a pattern that ex-tracts only one or two category members, no matter howoften. To score extractions, our scoring metric uses headphrase matching, which means that X matches Y if X isthe rightmost substring of Y. For example, “New York”will match against “downtown New York” and “the fi-nancially stable New York”. We stripped each NP ofarticles, common adjectives, and numbers before match-ing it with other NPs and saving it to the lexicon. Wealso used a small stopword list and a number recognizerto discard general terms such as pronouns and numbers.

We scored each extraction pattern with the RlogF met-ric previously used by AutoSlog-TS [Riloff, 1996b]. Thescore is computed as:

score(patterni) = Ri ∗ log2(Fi)

where Fi is the number of unique lexicon entries ex-tracted by patterni, Ni is the total number of uniqueNPs extracted by patterni, and Ri = Fi

Ni. This metric

was designed for information extraction tasks, where itis important to identify not only the most reliable ex-traction patterns but also patterns that will frequentlyextract relevant information (even if irrelevant informa-

tion also will be extracted). For example, the pattern“kidnapped in <x>” will extract locations but it willalso extract dates (e.g., “kidnapped in January”). Thefact that it frequently extracts locations makes it essen-tial to have in the dictionary, even if dates will also beextracted. The RlogF metric tries to strike a balancebetween reliability and frequency: R is high when thepattern’s extractions are highly correlated with the se-mantic category, and F is high when the pattern extractsa large number of category members.

The mutual bootstrapping algorithm works quite well,producing good extraction patterns and lexicon entries.But the bootstrapping process can be led astray by ex-traction patterns that have an affinity for more than onetype of semantic category. For example, suppose thepattern “shot in <x>” is one of the first extraction pat-terns learned for the location category. This patternfrequently extracts both locations and body parts (e.g.,“shot in the back” or “shot in the arm”). All of its ex-tractions are assumed to be locations so many referencesto body parts will be incorrectly added to the locationlexicon. In the next iteration, extraction patterns willbe rewarded for extracting references to body parts andthe bootstrapping process can get derailed. To makethe algorithm more robust, we introduce a second levelof bootstrapping that retains only the most reliable lex-icon entries learned by mutual bootstrapping and thenrestarts the process all over again.

Note that all dictionary entries and extraction pat-terns in the lexicons are considered to be positive exam-ples, and all those outside the lexicons are considered tobe negative examples. This means that high confidenceis implicitly held in those positive examples (since ateach iteration we may relabel negative examples as pos-itive ones, but not vice versa). Since the bootstrappingalternates between two types of classifiers the comple-mentary nature of the information they use can dissi-pate the effects of noise in the labeled examples so far.The way we quantify our confidence in each labeled ex-ample by its accuracy in the alternate classifier reflectsthe high-accuracy low-coverage bias of this approach tobootstrapping.

3.2 Multi-level BootstrappingOn top of the mutual bootstrapping procedure, we in-troduce a second level of bootstrapping which we willcall meta-bootstrapping. The outer bootstrapping mech-anism (meta-bootstrapping) compiles the results fromthe inner bootstrapping process (mutual bootstrapping)and identifies the five most reliable lexicon entries. Thesefive NPs are retained for the permanent semantic lexi-con and the rest of the mutual bootstrapping process isdiscarded. The entire mutual bootstrapping process isthen restarted from scratch. Figure 2 shows the meta-bootstrapping process.

To determine which NPs are most “reliable”, we scoreeach NP based on the number of unique patterns that ex-tracted it. Intuitively, a noun phrase extracted by threedifferent extractions patterns is more likely to belong to

Page 5: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

permanent semanticlexicon

temporarysemantic lexicon

select best_EP

add best_EP'sextractions

categoryEP list

META-BOOTSTRAPPING

seedwords

initialize

add 5 best NPs

candidate extractionpatterns & their extractions

MUTUAL BOOTSTRAPPING

Figure 2: The Meta-Bootstrapping Process

the category than a noun phrase extracted by a singlepattern. For tie-breaking purposes, we also add in asmall factor that represents the strength of the patternsthat extracted it. The scoring metric for noun phrases isshown below, where Ni is the number of unique patternsthat extracted NPi.

score(NPi) =Ni∑k=1

1 + (.01 ∗ score(patternk))

The main advantage of meta-bootstrapping comesfrom re-evaluating the extraction patterns after each mu-tual bootstrapping process. For example, after the firstmutual bootstrapping run, five new words are added tothe permanent semantic lexicon. Then mutual boot-strapping is restarted from scratch with the original seedwords plus these five new words. Now, the best pat-tern selected by mutual bootstrapping might be differentfrom the best pattern selected last time. This producesa snowball effect because its extractions are added tothe temporary semantic lexicon which is the basis forchoosing the next extraction pattern. In practice, whathappens is that the ordering of the patterns changes(sometimes dramatically) between subsequent runs ofmutual bootstrapping. In particular, more general pat-terns seem to float to the top as the permanent semanticlexicon grows.

3.3 EvaluationWe evaluated our bootstrapping algorithm by generat-ing dictionaries for locations using two text collections:corporate web pages used in the WebKB project [Cravenet al., 1998] and terrorism news articles from the MUC-4 corpus [MUC-4 Proceedings, 1992]. For training, we

used 4160 of the web pages and 1500 of the terrorismtexts. We preprocessed the web pages first by removingHTML tags and adding periods to separate independentphrases.3 AutoSlog generated 19,690 candidate extrac-tion patterns from the web page training set, and 14,064candidate extraction patterns from the terrorism train-ing set.4 The seed word lists that we used are shownin Figure 3. We used different location seeds for thetwo text collections because the terrorism articles weremainly from Latin America while the web pages weremuch more international. The seed words must be fre-quent in the training texts for the bootstrapping algo-rithm to work well.

Web Location: australia canada china englandfrance germany japan mexicoswitzerland united states

Terr. Location: bolivia city colombia districtguatemala honduras neighborhoodnicaragua region town

Figure 3: Seed Word Lists

We ran the meta-bootstrapping algorithm (outerbootstrapping) for 50 iterations. The extraction pat-terns produced by the last iteration were the output of

3Web pages pose a problem for NLP systems because sep-arate lines do not always end with a period (e.g., list itemsand headers). We used several heuristics to insert periodswhenever an independent line or phrase was suspected.

4AutoSlog actually generated many more extraction pat-terns, but for practical reasons we only used the patterns thatappeared with frequency ≥ 2.

Page 6: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

Iter 1 Iter 10 Iter 20 Iter 30 Iter 40 Iter 50Web Location 5/5 (1) 46/50 (.92) 88/100 (.88) 129/150 (.86) 163/200 (.82) 191/250 (.76)Terr. Location 5/5 (1) 32/50 (.64) 66/100 (.66) 100/150 (.67) 127/200 (.64) 158/250 (.63)

Table 1: Accuracy of the Semantic Lexicons

Web Location Terrorism LocationPatterns Patternsoffices in <x> living in <x>facilities in <x> traveled to <x>operations in <x> become in <x>loans in <x> sought in <x>operates in <x> presidents of <x>locations in <x> parts of <x>producer in <x> to enter <x>states of <x> condemned in <x>seminars in <x> relations between <x>activities in <x> ministers of <x>consulting in <x> part in <x>countries of <x> taken in <x>rep. of <x> returned to <x>outlets in <x> process in <x>consulting in <x> involvement in <x>customers in <x> intervention in <x>diensten in <x> linked in <x>distributors in <x> operates in <x>services in <x> kidnapped in <x>expanded into <x> refuge in <x>partners in <x> democracy in <x>located in <x> prevailing in <x>plant in <x> billion in <x>dienste in <x> groups in <x>packages in <x> wave in <x>shipyards in <x> outskirts of <x>support in <x> prevails in <x>dedicated in <x> percent in <x>testing in <x> stay in <x>manager in <x> heard throughout

Figure 4: Top 25 extraction patterns for locations

the system, along with the permanent semantic lexicon.For each meta-bootstrapping iteration, we ran the mu-tual bootstrapping procedure (inner bootstrapping) un-til it produced 10 patterns that extracted at least onenew NP (i.e., not currently in the semantic lexicon). Butthere were two exceptions: (1) if the best pattern hadscore < 0.7 then mutual bootstrapping stopped, or (2)if the best pattern had score > 1.8 then mutual boot-strapping continued.

Figure 4 shows the top 20 extraction patterns pro-duced by meta-bootstrapping after 50 iterations. Mostof these extraction patterns are clearly useful patternsfor extracting noun phrases that represent locations. Itis also interesting to note that the location patterns gen-erated for the web pages are very different from the lo-cation patterns generated for the terrorism articles. Thetwo text collections contain very different vocabulary,and the locations mentioned in the texts are very dif-ferent too. The web pages are widely international and

mention many countries and U.S. cities, while the ter-rorism articles are mostly from Latin America and focuson cities and towns therein. One of the strengths of acorpus-based algorithm is that it learns dictionary en-tries that are most important for the texts representedin the training corpus.

To evaluate the quality of the semantic lexicons, wemanually inspected each word. We judged a word to be-long to the category if it was a specific category member(e.g., “Japan” is a specific location) or a general refer-ent for the category (e.g., “the area” is a referent forlocations). Although referents are meaningless in iso-lation, they are useful for information extraction tasksbecause a coreference resolver should be able to find theirantecedent. The referents were also very useful duringbootstrapping because a pattern that extracts “the area”will probably also extract specific locations.

Table 1 shows the accuracy of the semantic lexiconsafter the 1st iteration of meta-bootstrapping and aftereach 10th iteration. Each cell shows the number of truecategory members among the entries generated thus far.For example, 50 phrases were added to the web locationlexicon after the tenth iteration and 46 of those (92%)were true location phrases. Table 1 shows that meta-bootstrapping identified 191 location phrases in the webpages and 158 location phrases in the terrorism arti-cles. The density of good phrases was also quite high.Even after 50 iterations, 76% of the hypothesized loca-tion phrases on the web pages were true locations, and63% of the hypothesized locations phrases in the terror-ism articles were true locations.

In summary, multi-level bootstrapping can producehigh-quality extraction patterns and semantic lexiconentries using only raw texts and a small set of seed wordsas input. Our bootstrapping approach has two advan-tages over previous techniques for learning informationextraction dictionaries: both a semantic lexicon and adictionary of extraction patterns are acquired simulta-neously, and no special training resources are needed.

4 Bootstrapping a Text Classifier

In the previous section, we showed how an application ofthe bootstrapping framework successfully created an in-formation extractor. Now, we turn to the problem of textclassification. A variety of text classification algorithms,such as naive Bayes [Lewis, 1998], support vector ma-chines [Joachims, 1998], k-nearest neighbor [Yang, 1999],rule-learning algorithms [Cohen and Singer, 1996], andBoosting [Schapire and Singer, 1999] have proven adeptat text classification when provided with large amountsof labeled training data. In recent work [Nigam et al.,1999], we have shown that text classification error can be

Page 7: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

...

...

... ...

......

computer, university, science, system, paper

...

languageNLP

processing

Compiler Designcompiler

Garbage

garbagecollection

CollectionSemanticssemanticsdenotational

softwaredesignengineering

Software Engineering programming

Programming

languagelogic

programs

OSdistributedsystemsystemsnetworktime

learning

Artificial Intelligence

intelligence

Hardware & Architecturecircuitsdesign

HCI

multimedia

informationtext

retrieval

Information Retrieval

classification

Cooperative

cscwmultimediaMultimedia

interface

Interface Design

design

Planning

knowledgerepresentation

Knowledge Representation

toolsenvironments

constructiontypes optimization

memory

region

paralleldatalanguage

textinformation

learning LearningMachine

algorithms

networks

algorithm

university problemsplanreasoningtemporal

language

naturalsystem

interfacessketchuser

groupprovidework

collaborativerealtimedatamedia

computersystem

universitypaper

performanceuniversitycomputer

basedcomputeruniversity

university

codelanguagenatural

planning

documents

Computer Science

Figure 5: A subset of Cora’s topic hierarchy. Each node contains its title, and the five most probable words, as calculated bynaive Bayes and shrinkage with vertical word redistribution [Hofmann & Puzicha, 1998]. Words among the initial keywordsfor that class are indicated in plain font; others are in italics.

reduced by up to 30% when a small amount of labeleddocuments is augmented with a large collection of un-labeled documents. Here, we extend this approach andreplace the task information given the labeled documentswith seed information in another form.

In this study, knowledge about the classes of inter-est is provided in the form of a few keywords per classand a class hierarchy. Keywords are typically generatedmore quickly and easily than even a small number oflabeled documents. Many classification problems natu-rally come with hierarchically-organized classes. Our al-gorithm proceeds by using the keywords to generate pre-liminary labels for some documents by term-matching.Then these labels, the hierarchy and all the unlabeleddocuments become the input to a bootstrapping algo-rithm that generates a naive Bayes classifier.

The bootstrapping algorithm used here combines hier-archical shrinkage and Expectation-Maximization (EM)with unlabeled data. EM is an iterative algorithm formaximum likelihood estimation in parametric estimationproblems with missing data. In our scenario, the class la-bels of the documents are treated as missing data. Here,EM works by first training a classifier with only the doc-uments preliminarily-labeled by the keywords, and thenuses the classifier to re-assign probabilistically-weightedclass labels to all the documents by calculating the ex-pectation of the missing class labels. It then trains anew classifier using all the documents and iterates.

We further improve the classifier by also incorporatingshrinkage, a statistical technique for improving parame-ter estimation in the face of sparse data. When classesare provided in a hierarchical relationship, shrinkage isused to estimate new parameters by using a weighted av-erage of the specific (but unreliable) local class estimatesand the more general (but also more reliable) ancestorsof the class in the hierarchy. The optimal weights inthe average are calculated by an EM process that runssimultaneously with the EM that is re-estimating the

class labels.Experimental results show that the bootstrapping al-

gorithm described here uses the unlabeled data, key-words, and class hierarchy to output a classifier thatis close to human agreement levels. The experimentaldomain in this paper is topic identification of computerscience research papers. This domain originates as partof the Ra research project, an effort to build domain-specific search engines on the Web with machine learningtechniques. Our demonstration system, Cora, is a searchengine over computer science research papers [McCallumet al., 1999]. The bootstrapping classification algorithmdescribed in this paper is used in Cora to place researchpapers into a Yahoo-like hierarchy of the field of com-puter science. The search engine, with the hierarchy, ispublicly available at www.cora.justresearch.com.

4.1 Generating Preliminary Labels withKeywords

The first step in the bootstrapping process is to use thekeywords to generate preliminary labels for as many ofthe unlabeled documents as possible. Each class is givenjust a few keywords. Figure 5 shows examples of thenumber and type of keywords given in our experimentaldomain—the human-provided keywords are shown in thenodes in non-italic font.

In this paper, we generate preliminary labels from thekeywords by term-matching in a rule-list fashion: foreach document, we step through the keywords and placethe document in the category of the first keyword thatmatches. If an extensive keyword list were carefully cho-sen, this method could be reasonably accurate. However,finding enough keywords to obtain broad coverage andfinding sufficiently specific keywords to obtain high ac-curacy is very difficult; it requires intimate knowledge ofthe data and a lot of trial and error.

As a result, keyword matching is both an inaccurateand incomplete classification method. Keywords tend

Page 8: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

to be high-precision and low-recall; this brittleness willleave many documents unlabeled. Some documents willincorrectly match keywords from the wrong class. Ingeneral we expect the brittleness of the keywords to bethe dominating factor in overall error. In our experi-mental domain, for example, 59% of the unlabeled doc-uments are left unlabeled by keyword matching.

Another method of priming bootstrapping with key-words would be to take each set of keywords as a la-beled mini-document containing just a few words. Thiscould then be used as input to any standard learningalgorithm. Testing this, and other keyword labeling ap-proaches, is an area of ongoing research.

4.2 The Bootstrapping AlgorithmThe goal of the bootstrapping step is to generate a naiveBayes classifier from the inputs: the (inaccurate and in-complete) preliminary labels, the unlabeled data and theclass hierarchy. One straightforward method would beto simply take the unlabeled documents with prelimi-nary labels, and treat this as labeled data in a standardsupervised setting. This approach provides only mini-mal benefit for three reasons: (1) the labels are rathernoisy, (2) the sample of preliminarily-labeled documentsis skewed from the regular document distribution (i.e.it includes only documents with known keywords), and(3) data is sparse in comparison to the size of the fea-ture space. Adding the remaining unlabeled data andrunning EM helps counter the first and second of thesereasons. Adding hierarchical shrinkage to naive Bayeshelps counter the first and third of these reasons. Webegin a detailed description of our bootstrapping algo-rithm with a short overview of standard naive Bayes textclassification, then proceed to add EM to incorporate theunlabeled data, and conclude by explaining hierarchicalshrinkage. An outline of the entire algorithm is presentedin Table 2.

The naive Bayes frameworkWe use the framework of multinomial naive Bayes textclassification [Lewis, 1998; McCallum and Nigam, 1998].It is useful to think of naive Bayes as estimating theparameters of a probabilistic generative model for textdocuments. In this model, first the class of the docu-ment is selected. The words of the document are thengenerated based on the parameters for the class-specificmultinomial (i.e. unigram model). Thus, the classifierparameterizes the class prior probabilities and the class-conditioned word probabilities. Each class, cj, has adocument frequency relative to all other classes, writtenP(cj). For every word wt in the vocabulary V , P(wt|cj)indicates the frequency that the classifier expects wordwt to occur in documents in class cj.

In the standard supervised setting, learning the pa-rameters is accomplished using a set of labeled trainingdocuments, D. To estimate the word probability param-eters, P(wt|cj), we count the frequency that word wtoccurs in all word occurrences for documents in class cj.We supplement this with Laplace smoothing that primes

• Inputs: A collection D of unlabeled documents, a classhierarchy, and a few keywords for each class.

• Generate preliminary labels for as many of the unla-beled documents as possible by term-matching with thekeywords in a rule-list fashion.

• Initialize all the λj ’s to be uniform along each path froma leaf class to the root of the class hierarchy.

• Iterate the EM algorithm:

• (M-step) Build the maximum likelihood multino-mial at each node in the hierarchy given the classprobability estimates for each document (Equa-tions 1 and 2). Normalize all the λj ’s along eachpath from a leaf class to the root of the class hier-archy so that they sum to 1.

• (E-step) Calculate the expectation of the classlabels using the classifier created in the M-step(Equation 3). Increment the new λj ’s by attribut-ing each word probabilistically to the ancestors ofeach class.

• Output: A naive Bayes classifier that takes an unla-beled document and predicts a class label.

Table 2: An outline of the bootstrapping algorithm describedin Sections 4.1 and 4.2.

each estimate with a count of one to avoid probabilitiesof zero. Define N(wt, di) to be the count of the num-ber of times word wt occurs in document di, and defineP(cj|di) ∈ {0, 1}, as given by the document’s class label.Then, the estimate of the probability of word wt in classcj is:

P(wt|cj)=1 +

∑di∈D N(wt, di)P(cj |di)

|V |+∑|V |s=1

∑di∈D N(ws, di)P(cj |di)

. (1)

The class prior probability parameters are set in thesame way, where |C| indicates the number of classes:

P(cj) =1 +

∑di∈D P(cj |di)|C|+ |D| . (2)

Given an unlabeled document and a classifier, wedetermine the probability that the document belongsin class cj using Bayes’ rule and the naive Bayesassumption—that the words in a document occur inde-pendently of each other given the class. If we denotewdi,k to be the kth word in document di, then classifica-tion becomes:

P(cj |di) ∝ P(cj)P(di|cj)

∝ P(cj)|di|∏k=1

P(wdi,k |cj). (3)

Empirically, when given a large number of trainingdocuments, naive Bayes does a good job of classifyingtext documents [Lewis, 1998]. More complete presenta-tions of naive Bayes for text classification are providedby Mitchell [1997] and McCallum and Nigam [1998].

Page 9: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

Adding unlabeled data with EMIn the standard supervised setting, each documentcomes with a label. In our bootstrapping scenario,the preliminary keyword labels are both incomplete andinaccurate—the keyword matching leaves many docu-ments unlabeled, and labels some incorrectly. In orderto use the entire data set in a naive Bayes classifier,we use the Expectation-Maximization (EM) algorithmto generate probabilistically-weighted class labels for allthe documents. This results in classifier parameters thatare more likely given all the data.

EM is a class of iterative algorithms for maximumlikelihood estimation in problems with incomplete data[Dempster et al., 1977]. Given a model of data genera-tion, and data with some missing values, EM iterativelyuses the current model to estimate the missing values,and then uses the missing value estimates to improvethe model. Using all the available data, EM will locallymaximize the likelihood of the parameters and give esti-mates for the missing values. In our scenario, the classlabels of the unlabeled data are treated as the missingvalues.

In implementation, EM is an iterative two-step pro-cess. Initially, the parameter estimates are set in thestandard naive Bayes way from just the preliminarily la-beled documents. Then we iterate the E- and M-steps.The E-step calculates probabilistically-weighted class la-bels, P(cj |di), for every document using the classifier andEquation 3. The M-step estimates new classifier param-eters using all the documents, by Equations 1 and 2,where P(cj |di) is now continuous, as given by the E-step. We iterate the E- and M-steps until the classifierconverges. The initialization step identifies each mixturecomponent with a class and seeds EM so that the localmaxima that it finds correspond well to class definitions.

In previous work [Nigam et al., 1999], we have shownthis technique significantly increases text classificationaccuracy when given limited amounts of labeled dataand large amounts of unlabeled data. The expectationhere is that EM will both correct and complete the labelsfor the entire data set.

Improving sparse data estimates with shrinkageEven when provided with a large pool of documents,naive Bayes parameter estimation during bootstrappingwill suffer from sparse data because naive Bayes has somany (|V ||C|+ |C|) parameters to estimate. Using theprovided class hierarchy, we can integrate the statisti-cal technique shrinkage into the EM algorithm to helpalleviate the sparse data problem.

Consider trying to estimate the probability of the word“intelligence” in the class NLP. This word should clearlyhave non-negligible probability there; however, with lim-ited training data we may be unlucky, and the observedfrequency of “intelligence” in NLP may be very far fromits true expected value. One level up the hierarchy, how-ever, the Artificial Intelligence class contains many moredocuments (the union of all the children). There, theprobability of the word “intelligence” can be more reli-

ably estimated.Shrinkage calculates new word probability estimates

for each leaf class by a weighted average of the estimateson the path from the leaf to the root. The techniquebalances a trade-off between specificity and reliability.Estimates in the leaf are most specific but unreliable;further up the hierarchy estimates are more reliable butunspecific. We can calculate mixture weights for theaveraging that are guaranteed to maximize the likelihoodof the data with the EM algorithm.

More formally, let {P1(wt|cj), . . . ,Pk(wt|cj)} be wordprobability estimates, where P1(wt|cj) is the estimateusing training data just in the leaf, Pk−1(wt|cj) isthe estimate at the root using all the training data,and Pk(wt|cj) is the uniform estimate (Pk(wt|cj) =1/|V |). The interpolation weights among cj ’s “ances-tors” (which we define to include cj itself) are writ-ten {λ1

j , λ2j , . . . , λ

kj}, where

∑ki=1 λ

ij = 1. The new

word probability estimate based on shrinkage, denotedP̌(wt|cj), is then

P̌(wt|cj) = λ1jP

1(wt|cj) + . . .+ λkjPk(wt|cj). (4)

The λj vectors are calculated during EM. In the E-step of bootstrapping, for every word of training data inclass cj, we determine the expectation that each ances-tor was responsible for generating it. In the M-step, wenormalize the sum of these expectations to obtain newmixture weights λj .

One can think of hierarchical shrinkage as defining adifferent generative model than the one in Section 4.2.As before, a class is selected first. Then, for each wordposition in the document, an ancestor of the class (in-cluding itself) is selected according to the averagingweights. Then, the word itself is chosen based on the pa-rameters of that ancestor. For every possible generativemodel of this type, there is a corresponding model of thesimple naive Bayes type that yields the same documentdistribution. It is created by “flattening” the hierarchyaccording to the weights. A more complete description ofhierarchical shrinkage for text classification is presentedby McCallum et al. [1998].

4.3 Experimental ResultsIn this section, we provide empirical evidence that boot-strapping a text classifier from unlabeled data can pro-duce a high-accuracy text classifier. As a test domain,we use computer science research papers. We have cre-ated a 70-leaf hierarchy of computer science topics, partof which is shown in Figure 5. Creating the hierarchytook about 60 minutes, during which we examined con-ference proceedings, and explored computer science siteson the Web. Selecting a few keywords associated witheach node took about 90 minutes. A test set was createdby expert hand-labeling of a random sample of 625 re-search papers from the 30,682 papers in the Cora archiveat the time we began these experiments. Of these, 225(about one-third) did not fit into any category, and were

Page 10: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

Method # Lab # P-Lab # Unlab AccKeyword — — — 45%NB 100 — — 30%NB 399 — — 47%NB+EM+S — 12,657 18,025 66%NB — 12,657 — 47%NB+S — 12,657 — 63%Human — — — 72%

Table 3: Classification results with different techniques: key-word matching, human agreement, naive Bayes (NB), andnaive Bayes combined with hierarchical shrinkage (S), andEM. The classification accuracy (Acc), and the number of la-beled (Lab), keyword-matched preliminarily-labeled (P-Lab),and unlabeled (Unlab) documents used by each method areshown.

discarded—resulting in a 400 document test set. Label-ing these 400 documents took about six hours. Some ofthese papers were outside the area of computer science(e.g. astrophysics papers), but most of these were pa-pers that with a more complete hierarchy would be con-sidered computer science papers. The class frequenciesof the data are not too skewed; on the test set, the mostpopulous class accounted for only 7% of the documents.

Each research paper is represented as the words ofthe title, author, institution, references, and abstract. Adetailed description of how these segments are automat-ically extracted is provided elsewhere [McCallum et al.,1999; Seymore et al., 1999]. Words occurring in fewerthan five documents and words on a standard stoplistwere discarded. No stemming was used. Bootstrappingwas performed using the algorithm outlined in Table 2.

Table 3 shows classification results with different clas-sification techniques used. The rule-list classifier basedon the keywords alone provides 45%. (The 43% of doc-uments in the test set containing no keywords cannotbe assigned a class by the rule-list classifier, and arecounted as incorrect.) As an interesting time compari-son, about 100 documents could have been labeled in thetime it took to generate the keyword lists. Naive Bayesaccuracy with 100 labeled documents is only 30%. With399 labeled documents (using our test set in a leave-one-out-fashion), naive Bayes reaches 47%. When run-ning the bootstrapping algorithm, 12,657 documents aregiven preliminary labels by keyword matching. EM andshrinkage incorporate the remaining 18,025 documents,“fix” the preliminary labels and leverage the hierarchy;the resulting accuracy is 66%. As an interesting com-parison, agreement on the test set between two humanexperts was 72%.

A few further experiments reveal some of the inner-workings of bootstrapping. If we build a naive Bayesclassifier in the standard supervised way from the 12,657preliminarily labeled documents the classifier gets 47%accuracy. This corresponds to the performance for thefirst iteration of bootstrapping. Note that this matchesthe accuracy of traditional naive Bayes with 399 labeledtraining documents, but that it requires less than a quar-

ter the human labeling effort. If we run bootstrappingwithout the 18,025 documents left unlabeled by keywordmatching, accuracy reaches 63%. This indicates thatshrinkage and EM on the preliminarily labeled docu-ments is providing substantially more benefit than theremaining unlabeled documents.

One explanation for the small impact of the 18,025documents left unlabeled by keyword matching is thatmany of these do not fall naturally into the hierarchy.Remember that about one-third of the 30,000 documentsfall outside the hierarchy. Most of these will not be givenpreliminary labels by keyword matching. The presenceof these outlier documents skews EM parameter estima-tion. A more inclusive computer science hierarchy wouldallow the unlabeled documents to benefit classificationmore.

However, even without a complete hierarchy, we coulduse these documents if we could identify these outliers.Some techniques for robust estimation with EM are dis-cussed by McLachlan and Basford [1988]. One specifictechnique for these text hierarchies is to add extra leafnodes containing uniform word distributions to each in-terior node of the hierarchy in order to capture doc-uments not belonging in any of the predefined topicleaves. This should allow EM to perform well even whena large percentage of the documents do not fall intothe given classification hierarchy. A similar approachis also planned for research in topic detection and track-ing (TDT) [Baker et al., 1999]. Experimentation withthese techniques is an area of ongoing research.

5 Related Work

Other research efforts in text learning have also usedbootstrapping approaches. Brin [1998] uses a bootstrap-ping approach on the World Wide Web to extract booktitle and author pairs. From a small set of seeded books,his DIPRE algorithm searches the Web for known pairsand learns new patterns that are common for these pairs.Then, these new patterns are used to identify new books,in an iterative fashion.

The preliminary labeling by keywords used in this pa-per is similar to the seed collocations used by Yarowsky[1995]. There, in a word sense disambiguation task, abootstrapping algorithm is seeded with some examplesof common collocations with the particular sense of someword (e.g. the seed “life” for the biological sense of“plant”).

The co-training algorithm [Blum and Mitchell, 1998]for classification works in cases where the feature spaceis separable into naturally redundant and independentparts. For example, web pages can be thought of asthe text on the web page, and the collection of text inhyperlink anchors to that page.

Unlabeled data has also been used in document clas-sification by Nigam et al. [1999], where, as in this work,EM is used to fill in the “missing” class labels on the un-labeled data. Unlike this work, Nigam et al. begin witha few labeled documents instead of keywords. More im-

Page 11: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

portantly, they also describe two methods of overcomingproblems when the results of maximum likelihood clus-tering do not correspond well to class labels.

6 Discussion

The two problems studied in this paper work at very dif-ferent granularities. In the document classification do-main, our granularity of interest is the entire document,and our feature set consists of potentially every wordin the vocabulary. This means that a single keywordcan match one of several hundred words in a document,and is chosen from one of tens of thousands of wordsin the vocabulary. We can view the seeding using thesekeywords as finding better than random starting points,which we then exploit using Expectation-Maximization(EM) and shrinkage over the class hierarchy.

In the information extraction task, the granularity ofinterest is at the phrase-level. A phrase contains rela-tively few words, so a match with a keyword is a muchstronger indicator of class membership. The feature setwe use consists of all possible phrases with exact match,so our vocabulary is large, but the number of possiblefeatures for each example is very small. In this situa-tion, we can view the seeding from keywords as provid-ing a small set of highly accurately labeled data to beginbootstrapping.

Thus, the effect of initial labels is likely to be differentfor the two tasks; each example in the extraction taskcontains only a few words, whereas in classification anexample typically contains several hundred words. Thus,we expect the significance of individual keywords to belower in the document setting than in the extraction set-ting, but coverage to be higher.

In future work we propose to examine the effect ofthe domain and task on the applicability of the seededbootstrapping framework. Pure unsupervised clustering[Duda and Hart, 1973] can be used to find underlyingregularities in the data domain. If there is a strong cor-respondence between these regularities and our targettask, there may be no need for seeds, except to speed upconvergence by providing a good starting point for theclustering. However, if our dataset could have multi-ple solutions when unsupervised clustering is performed,and only a subset of these correspond to partitions thatperform our task well, seeding can help us to initializeour clustering with a point close to the target local max-imum. We hypothesize that many local maxima exist intext learning tasks, which makes the seeding essential.

We will test this hypothesis in two ways. We will per-form unsupervised clustering on the datasets (e.g. byproviding random initial seeds) then labeling the clus-ters post-facto with the seeds. If the clusters producedthis way perform as well as those produced using seeding,then our approach finds us the global maximum for ourtask. If this is not the case (as we expect) we will per-form experiments on providing seedwords of successivelypoorer quality (i.e. seedwords which are more and moreambiguous or loosely related to the target class) and ex-

amine how the quality of the initial labeling using thoseseedwords affects our final task-specific results.

This paper has considered text learning in cases with-out labeled training data. To this end, we advocate us-ing the general bootstrapping process. From a smallamount of seed information and a large collection of un-labeled data, bootstrapping iterates the steps of buildinga model from generated labels, and obtaining labeleddata from the model. In two case studies in informa-tion extraction and text classification, the bootstrappingparadigm proves successful at learning without labeleddata.

AcknowledgementsThis research is supported in part by the National Sci-ence Foundation under grants IRI-9509820, IRI-9704240,and SBR-9720374 and the DARPA HPKB program un-der contract F30602-97-1-0215.

References[Baker et al., 1999] D. Baker, T. Hofmann, A. McCallum,

and Y. Yang. A hierarchical probabilistic model for noveltydetection in text. Technical report, Just Research, 1999.http://www.cs.cmu.edu/∼mccallum.

[Blum and Mitchell, 1998] A. Blum and T. Mitchell. Com-bining labeled and unlabeled data with co-training. InCOLT ’98, 1998.

[Brin, 1998] Sergey Brin. Extracting patterns and relationsfrom the world wide web. In WebDB Workshop at EDBT’98, 1998.

[Califf, 1998] M. E. Califf. Relational Learning Techniquesfor Natural Language Information Extraction. PhD thesis,Tech. Rept. AI98-276, Artificial Intelligence Laboratory,The University of Texas at Austin, 1998.

[Castelli and Cover, 1996] V. Castelli and T. M. Cover. Therelative value of labeled and unlabeled samples in patternrecognition with an unknown mixing parameter. IEEETransactions on Information Theory, 42(6):2101–2117,November 1996.

[Cohen and Singer, 1996] W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In SI-GIR ’96, 1996.

[Cohen, 1996] W. W. Cohen. Learning trees and rules withset-valued features. In Proceedings of the Thirteenth Na-tional Conference on Artificial Intelligence (AAAI-96),1996.

[Craven et al., 1998] M. Craven, D. DiPasquo, D. Freitag,A. McCallum, T. Mitchell, K. Nigam, and S. Slattery.Learning to Extract Symbolic Knowledge from the WorldWide Web. In Proceedings of the Fifteenth National Con-ference on Artificial Intelligence, 1998.

[Dempster et al., 1977] A. P. Dempster, N. M. Laird, andD. B. Rubin. Maximum likelihood from incomplete datavia the EM algorithm. Journal of the Royal StatisticalSociety, Series B, 39(1):1–38, 1977.

[Duda and Hart, 1973] R. Duda and P. Hart. Pattern clas-sification and scene analysis. Wiley, 1973.

Page 12: Bootstrapping for Text Learning Tasks - Kamal · PDF fileBootstrapping for Text Learning Tasks Rosie Jones1 rosie@cs.cmu.edu ... Jones, 1999]; here we place it into the larger context

[Freitag, 1998] Dayne Freitag. Multistrategy Learning for In-formation Extraction. In Proceedings of the Fifteenth In-ternational Conference on Machine Learning, 1998.

[Huffman, 1996] S. Huffman. Learning information extrac-tion patterns from examples. In Stefan Wermter, EllenRiloff, and Gabriele Scheler, editors, Connectionist, Sta-tistical, and Symbolic Approaches to Learning for Natu-ral Language Processing, pages 246–260. Springer-Verlag,Berlin, 1996.

[Joachims, 1998] T. Joachims. Text categorization with Sup-port Vector Machines: Learning with many relevant fea-tures. In ECML-98, 1998.

[Kim and Moldovan, 1993] J. Kim and D. Moldovan. Ac-quisition of Semantic Patterns for Information Extractionfrom Corpora. In Proceedings of the Ninth IEEE Con-ference on Artificial Intelligence for Applications, pages171–176, Los Alamitos, CA, 1993. IEEE Computer Soci-ety Press.

[Lewis and Ringuette, 1994] David D. Lewis and MarcRinguette. A comparison of two learning algorithms fortext categorization. In Third Annual Symposium on Doc-ument Analysis and Information Retrieval, pages 81–93,1994.

[Lewis, 1998] D. D. Lewis. Naive (Bayes) at forty: The inde-pendence assumption in information retrieval. In ECML-98, 1998.

[McCallum and Nigam, 1998] A. McCallum and K. Nigam.A comparison of event models for naive Bayes text clas-sification. In AAAI-98 Workshop on Learning for TextCategorization, 1998. Tech. rep. WS-98-05, AAAI Press.http://www.cs.cmu.edu/∼mccallum.

[McCallum et al., 1998] A. McCallum, R. Rosenfeld,T. Mitchell, and A. Ng. Improving text clasification byshrinkage in a hierarchy of classes. In ICML-98, 1998.

[McCallum et al., 1999] Andrew McCallum, Kamal Nigam,Jason Rennie, and Kristie Seymore. Using machine learn-ing techniques to build domain-specific search engines. InIJCAI-99, 1999. To appear.

[McLachlan and Basford, 1988] G.J. McLachlan and K.E.Basford. Mixture Models. Marcel Dekker, New York, 1988.

[Mitchell, 1997] T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

[MUC-4 Proceedings, 1992] MUC-4 Proceedings. Proceed-ings of the Fourth Message Understanding Conference(MUC-4). Morgan Kaufmann, San Mateo, CA, 1992.

[MUC-5 Proceedings, 1993] MUC-5 Proceedings. Proceed-ings of the Fifth Message Understanding Conference(MUC-5). Morgan Kaufmann, San Francisco, CA, 1993.

[Nigam et al., 1999] K. Nigam, A. McCallum, S. Thrun, andT. Mitchell. Text classification from labeled and unlabeleddocuments using EM. Machine Learning, 1999. To appear.

[Riloff and Jones, 1999] E. Riloff and R. Jones. Automat-ically Generating Extraction Patterns from UntaggedText. In Proceedings of the Sixteenth National Confer-ence on Artificial Intelligence, pages 1044–1049. The AAAIPress/MIT Press, 1999. (to appear).

[Riloff, 1993] E. Riloff. Automatically Constructing a Dic-tionary for Information Extraction Tasks. In Proceedingsof the Eleventh National Conference on Artificial Intelli-gence, pages 811–816. AAAI Press/The MIT Press, 1993.

[Riloff, 1996a] E. Riloff. An Empirical Study of AutomatedDictionary Construction for Information Extraction inThree Domains. Artificial Intelligence, 85:101–134, 1996.

[Riloff, 1996b] E. Riloff. Automatically Generating Extrac-tion Patterns from Untagged Text. In Proceedings of theThirteenth National Conference on Artificial Intelligence,pages 1044–1049. The AAAI Press/MIT Press, 1996.

[Schapire and Singer, 1999] Robert E. Schapire and YoramSinger. Boostexter: A boosting-based system for text cat-egorization. Machine Learning Journal, 1999. To appear.

[Seymore et al., 1999] K. Seymore, A. McCallum, andR. Rosenfeld. Learning hidden Markov model structurefor information extraction. In AAAI-99 Workshop on Ma-chine Learning for Information Extraction, 1999. In sub-mission.

[Soderland et al., 1995] S. Soderland, D. Fisher, J. Aseltine,and W. Lehnert. CRYSTAL: Inducing a conceptual dic-tionary. In Proceedings of the Fourteenth InternationalJoint Conference on Artificial Intelligence, pages 1314–1319, 1995.

[Soderland, 1999] Stephen Soderland. Learning InformationExtraction Rules for Semi-structured and Free Text. Ma-chine Learning, 1999. (to appear).

[Yang, 1999] Y. Yang. An evaluation of statistical ap-proaches to text categorization. Journal of InformationRetrieval, 1999. To appear.

[Yarowsky, 1995] D. Yarowsky. Unsupervised word sensedisambiguation rivaling supervised methods. In ACL-95,1995.