info mat ion extractions

in a Retrieval ContextInformation Extraction: Algorithms and Prospects

Series Editor:W. Bruce Croft

University of Massachusetts, Amherst

Also in the Series:

INFORMATION RETRIEVAL SYSTEMS: Theory and Implementation, by Gerald Kowalski;

edited by Gregory Grefenstette;

Analytic Models of Performance, by Robert M. Losee;

Advanced Models for the Representationand Retrieval of Information by Fabio Crestani, Mounia Lalmas, and Cornelis Joost vanRijsbergen;

Technologies for Managing Electronic Document Collections,

Justin Zobel;

;

ADVANCES IN INFORMATIONAL RETRIEVAL: Recent Research from the Center for IntelligentInformation Retrieval, by W. Bruce Croft

INFORMATION RETRIEVAL SYSTEMS: Theory and Implementation, Second Edition,by Gerald J. Kowalski and Mark T. Maybury;

by Jian Kang Wu;

An Information Search Approach, by George Chang, Marcus J.

, by James Z. Wang;

, edited by James Allan;

edited by W. Bruce Croft; John Lafferty;

by Yixin Chen, Jia Li and James Z. Wang;

INFORMATION RETRIEVAL: Algorithms and Heuristics,

CHARTING A NEW COURSE: Natural Language Processing and Information Retrieval,edited by John I. Tait;

INTELLIGENT DOCUMENT RETRIEVAL: Exploiting Markup Structure,by Udo Kruschwitz;

THE TURN: Integration of Information Seeking and Retrieval in Context,by Peter Ingwersen, Kalervo Järvelin;

NEW DIRECTIONS IN COGNITIVE INFORMATION RETRIEVAL, edited by Amanda Spink, Charles Cole;

James G Shanahan, Yan Qu, Janyce Wiebe; ISBN: 1-4020-4026-1COMPUTING ATTITUDE AND AFFECT IN TEXT: Theory and Applications, edited by

Information Extraction: Algorithmsand Prospects in a Retrieval Context

Leuven, Belgium

Marie-Francine Moens

By

Katholieke Universiteit L u

A C.I.P. Catalogue record for this book is available from the Library of Congress.

Published by Springer,P.O. Box 17, 3300 AA Dordrecht, The Netherlands.

Printed on acid-free paper

All Rights Reserved

No part of this work may be reproduced, stored in a retrieval system, or transmittedin any form or by any means, electronic, mechanical, photocopying, microfilming, recording

or otherwise, without written permission from the Publisher, with the exceptionof any material supplied specifically for the purpose of being entered

and executed on a computer system, for exclusive use by the purchaser of the work.

www.springer.com

© 2006 Springer

ISBN-10 1-4020-4993-5 (e-book)

ISBN-10 1-4020-4987-0 (HB)ISBN-13 978-1-4020-4987-3 (HB)

ISBN-13 978-1-4020-4993-4 (e-book)

To those who search for meaning amidst ambiguous appearances...

Contents

1 Information Extraction and Information Technology 1

1.1 Defining Information Extraction 11.2 Explaining Information Extraction 4

1.2.1 Unstructured Data 4 1.2.2 Extraction of Semantic Information 5 1.2.3 Extraction of Specific Information 7 1.2.4 Classification and Structuring 8 1.3 Information Extraction and Information Retrieval 10

1.3.1 Information Overload 10 1.3.2 Information Retrieval 12 1.3.3 Searching for the Needle 13

1.4 Information Extraction and Other Information Processing Tasks 16

1.5 The Aims of the Book 171.6 Conclusions 201.7 Bibliography 21

2 Information Extraction from an Historical Perspective 23

2.1 Introduction 23 2.2 An Historical Overview 23

2.2.1 Early Origins 232.2.2 Frame Theory 262.2.3 Use of Resources 282.2.4 Machine Learning 31 2.2.5 Some Afterthoughts 32

2.3 The Common Extraction Process 36 2.3.1 The Architecture of an Information Extraction System 362.3.2 Some Information Extraction Tasks 38

2.4 A Cascade of Tasks 422.5 Conclusions 42 2.6 Bibliography 43

vii

PrefaceAcknowledgements

xi xiii

viii Contents

3 The Symbolic Techniques 47

3.1 Introduction 473.2 Conceptual Dependency Theory and Scripts 473.3 Frame Theory 54 3.4 58

3.4.1 Partial Parsing 583.4.2 Finite State Automata 58

3.5 Conclusions 63 3.6 Bibliography 63

4 Pattern Recognition 65

4.1 Introduction 654.2 What is Pattern Recognition? 664.3 The Classification Scheme 704.4 The Information Units to Extract 71 t4.5 The Features 73

4.5.1 Lexical Features 764.5.2 Syntactic Features 84.5.3 Semantic Features 83 4.5.4 Discourse Features 84

4.6 Conclusions 854.7 Bibliography 86

5 Supervised Classification 89

5.1 Introduction 895.2 Support Vector Machines 925.3 Maximum Entropy Models 101 5.4 Hidden Markov Models 10

0

5.5 Conditional Random Fields 114 5.6 Decision Rules and Trees 5.7 Relational Learning 121 5.8 Conclusions 122 5.9 Bibliography

118

122

p y q

Contents ix

6 Unsupervised Classification Aids 127

6.1 Introduction 1276.2 Clustering 129

6.2.1 Choice of Features 129 6.2.2 Distance Functions between Two Objects 130 6.2.3 Proximity Functions between Two Clusters 133 6.2.4 Algorithms 133 6.2.5 Number of Clusters 134 6.2.6 Use of Clustering in Information Extraction 136

6.3 Expansion 138 6.4 Self-training 141 6.5 Co-training 1446.6 Active Learning 1456.7 Conclusions 14 6.8 Bibliography 148

7 Integration of Information Extraction in Retrieval Models 151

7.1 Introduction 1517.2 State of the Art of Information Retrieval 1527.3 Requirements of Retrieval Systems 155 7.4 Motivation of Incorporating Information Extraction 1567.5 Retrieval Models 160 7.5.1 Vector Space Model 162 7.5.2 Language Model 163 7.5.3 Inference Network Model 167

7.5.4 Logic Based Model 170 7.6 Data Structures 1717.7 Conclusions 1767.8 Bibliography 176

7

8 Evaluation of Information Extraction Technologies 179

8.1 Introduction 1798.2 Intrinsic Evaluation of Information Extraction 180 8.2.1 Classical Performance Measures 181 8.2.2 Alternative Performance Measures 184 8.2.3 Measuring the Performance of Complex Extractions 187

x Contents

8.3 Extrinsic Evaluation of Information Extraction in Retrieval 1918.4 Other Evaluation Criteria 193 8.5 Conclusions 195 8.6 Bibliography 196

9 Case Studies 199

9.1 Introduction 199 9.2 Generic versus Domain Specific Character 200 9.3 Information Extraction from News Texts 202 9.4 Information Extraction from Biomedical Texts 2049.5 Intelligence Gathering 2099.6 Information Extraction from Business Texts 213 9.7 Information Extraction from Legal Texts 214 9.8 Information Extraction from Informal Texts 216 9.9 Conclusions 218 9.10 Bibliography 219

10 The Future of Information Extraction in a Retrieval Context 225

10.1 Introduction 225 10.2 The Human Needs and the Machine Performances 227 10.3 Most Important Findings 229 10.3.1 Machine Learning 229

10.3.2 The Generic Character of Information Extraction 230 10.3.3 The Classification Schemes 230 10.3.4 The Role of Paraphrasing 231 10.3.5 Flexible Information Needs 232 10.3.6 The Indices 233

10.4 Algorithmic Challenges 233 4 10.4.1 The Features 234

10.4.2 A Cascaded Model for Information Extraction 234 10.4.3 The Boundaries of Information Units 236 10.4.4 Extracting Sharable Knowledge 237 10.4.5 Expansion 237 10.4.6 Algorithms for Retrieval 238

10.5 The Future of IE in a Retrieval Context 23910.6 Bibliography 241

Index 243

Preface

Information extraction (IE) is usually defined as the process of selectivelystructuring and combining data that are explicitly stated or implied in oneor more natural language documents. This process involves a semanticclassification of certain pieces of information and is considered as a light form of text understanding. IE has a history going back at least three dec-ades and different approaches have been developed. Currently, there is a considerable interest in using these technologies for information retrieval,since there is an increasing need to localize precise information in docu-ments, for instance, as the answer to a question, rather than retrieving the entire document or a list of documents. Advanced retrieval models such aslanguage modeling answer that need by building a probabilistic model of the content of a document. Question answering systems are trying to take the next step by inferring answers to a natural language question from a document collection. In these and other information retrieval models a se-mantic classification of entities, relations between entities, and of semanti-cally relevant portions of texts (phrases, sentences, maybe passages) isvery valuable to advance the state of the art of text searching. When talk-ing about a semantic Web, semantic classification becomes of primordialimportance, but also in other tasks that involve information selection andfiltering, such as text summarization and information synthesis from dif-ferent documents, IE is an indispensable preprocessing step.

The book gives an overview and explanation of the most successful and efficient algorithms for information extraction, and how they could be in-tegrated in an information retrieval system. Special focus is on approachesthat are fairly generic, i.e., that can be applied for processing heterogene-ous document collections rather than a specific domain or text type and that are as much language independent as possible. The book contains a wealth of information on past and current milestones in information ex-traction, on necessary knowledge and resources involved in the extractionprocesses, and on the final aims of an extraction system. Additionally, avery important focus is on current statistical and machine learning tech-niques for information detection and classification. In an information re-trieval context, these techniques can be used to learn and fine tune traditional knowledge engineered rules and patterns.

xi

The book has grown from the results of a project on Generic Techno-logy for Information Extraction from Texts (researched at the Katholieke Universiteit Leuven, Belgium) from 2000-2004 and sponsored by the Insti-tute for the Promotion of Innovation by Science and Technology in Flan-ders) and from a graduate course on Text Based Information Retrieval taught at the same university to students in Artificial Intelligence, Infor-matics, and Electrical Engineering. This book is meant to give a compre-hensive overview of the field of information extraction, especially as it isused in an information retrieval context. It is aimed at researchers in in-formation extraction or related disciplines, but the many illustrations and real world examples make it also suitable as a handbook for students.

xii Preface

Acknowledgements

First, I would like to thank Rik De Busser, who is currently a Ph.D. student in Linguistics at La Trobe University in Melbourne, Australia, and whohelped with the redaction of the first three chapters of this book. Secondly, I thank Prof. Jos Dumortier, the director of the Interdisciplinary Centre for

given to our research group Legal Informatics and Information Retrieval.Many thanks go the staff of this group and especially to Roxana Anghe-luta, Jan De Beer, Koen Deschacht and Wim De Smet for participating in weekly project discussions. I am very grateful to Prof. Danny De Schreye, Head of the Informatics Department in the Faculty of Engineering for the many encouragements to pursue research in the domain of artificial intelli-gence. I sincerely thank Prof. Paul Van Orshoven, dean of our faculty, Prof. Yves Willems, former dean of the Faculty of Engineering, and Prof.

chance to continue and perpetuate my research and teaching in the domainof information retrieval. Information extraction from written texts by a machine is a first step towards their automatic understanding. The task compares to decoding the symbols of an old language and gradually learn-ing the meaning of the inscriptions. I am very grateful to the late Prof. Jan

fornia Los Angeles, USA). A long time ago they arouse in me the pro-found interest in content extraction from texts. I surely must thank Dr. Donna Harman (NIST, USA), Prof. Ed Hovy (University of Southern Cali-fornia, USA) and Prof. Karen Sparck Jones (University of Cambridge,UK) for creating influential and valuable ideas in the fields of information retrieval and text analysis. The final thank you goes to my family for their patience on Sunday afternoons.

xiii

Law and Information Technology at the K.U.Leuven for the opportunities

Marc Vervenne, Rector of the K.U.Leuven, who gave me a marvelous

Quaegebeur (K.U.Leuven) and Prof. John Callender (University of Cali-

1

1 Information Extraction and Information Technology

With Rik De Busser

1.1 Defining Information Extraction

The above examples have a number of elements in common: (1) The re-quests for information; (2) The answer to this request is usually present in unstructured data sources such as text and images; (3) But, it is impossible for humans to process all data because there is simply too much of it; And

A company wants to track the general sentiments about its newly released product in Web blogs. Another company wants to use the news feeds it bought from a press agency to construct a detailed overview of all techno-logical trends in the development of semiconductor technologies. The company also wants a timeline of all business transactions involved in this development. A space agency allows astronauts to query large amounts of technical documentation by means of natural language speech. A govern-ment is gathering data on a natural disaster and wants to urgently inform emergency services with a summary of the latest data available. An intelli-gence agency is investigating general trends in terrorist activities all over the world. They have a database of millions of news feeds, minutes and e-mails and want to use these to get a detailed overview of all terrorist events in a particular geographical region in the last five years. A legal scholar is interested in studying the decisions of judges in divorce settle-ments and the underlying criteria. He or she has thousands of court deci-sions at his disposal. A biomedical research group is investigating a new treatment and wants to know all possible ways in which a specific group of proteins can interact with other proteins and what the exact results of these interactions are. There are tens of thousands of articles, conference papers and technical reports to study.

2 1 Information Extraction and Information Technology

(4) computers are not able to directly query for the target information becauseit is not stored in a structured format such as a database but in unstructured sources. Information extraction (IE) is the subdiscipline of artificial intel-EEligence that tries to solve this kind of problems.

Traditionally, information extraction is associated with template based extraction of event information from natural language text, which was a popular task of the Message Understanding Conferences in the late eight-

started from a predefined set of templates, each containing specific infor-mation slots that encode event types relevant to a very specific subject domain – for instance, terrorism in Latin America – and used relatively straightforward pattern matching techniques to fill out these templates with specific instances of these events from a corpus of texts. Patterns in the form of a grammar or rules (e.g., in the form of regular expressions) weremapped on the text in order to identify the information.

MUC was the first large scale effort to boost research into automatic in-formation extraction and it would define the research field for the decadesto come. Even at the time of writing, information extraction is often asso-ciated with template based pattern matching techniques. Unsurprisingly,the MUC legacy still resounds very strongly in Riloff and Lorenzen’s defi-nition of information extraction:

IE systems extract domain-specific information from natural language text. The domain and types of information to be extracted must be de-fined in advance. IE systems often focus on object identification, such as references to people, places, companies, and physical objects. […]Domain-specific extraction patterns (or something similar) are used toidentify relevant information.

This definition represents a traditional view on what information extractionis and it more or less captures what this discipline is about: The extractionof information that is semantically defined from a text, using a set of ex-traction rules that are tailored to a very specific domain. The main points expressed by this definition are that an information extraction system iden-tifies information in text, i.e., in an unstructured information source, andthe information that adheres to predefined semantics (e.g., people, placesetc.). However, we will see in the rest of the book that at present the scopeof Riloff and Lorenzen's definition has become too limited. Information

ies and nineties (Sundheim, 1992). MUC information extraction tasks

(Riloff and Lorenzen, 1999, p. 169)

extraction is not necessarily domain specific. In practice, the domain of the information to be extracted is often determined in advance, but this has more to do with technological limitations of the present state of the art thanwith the long-term goals of the research discipline. An ideal information extraction system should be domain independent or at least portable to any tdomain with a minimum amount of engineering effort. Moreover, Riloff and Lorenzen do not specify further the types of information. Althoughmany different types of semantics can be defined, the semantics – whether they are defined in a specific or a general subject domain – ideally shouldbe as much as possible universally accepted and bear on the ontological nature and relationships of being.

Another consequence of the stress on pattern matching approaches that were developed during the MUC competitions is that eventually any tech-nique in which pattern matching is used to organize data in some struc-tured format can be considered to be information extraction. For instance, the early nineties saw a sudden surge in popularity of research into approachesthat try to extract the content of websites (e.g., shopbots that extract and compare prices), usually in order to convert them into a more convenient,uniform structural format. Some of these approaches analyze the naturallanguage content of full text websites, but many only use pattern matchingtechniques that exploit the structural properties of markup languages to harvest data from automatically generated web pages. While many researchersconveniently gathered these approaches under the common denominator

assume that information extraction presupposes at least some degree of semantic content analysis. In addition, information extraction is also very much involved in finding the relationships that exist between the extracted information, based on evidence in text (e.g., John kisses Claudia).

Cowie and Lehnert try to mend the previous inaccuracies. They see information extraction as a process that involves the extraction of frag-ments of information from natural language texts and the linking of thesefragments into a coherent framework. In their view, information extraction

[…] isolates relevant text fragments, extracts relevant information fromthe fragments, and then pieces together the targeted information in a coherent framework. […] The goal of information extraction research is to build systems that find and link relevant information while ignor-ing extraneous and irrelevant information.

1.1 Defining Information Extraction 33

web based information extraction (see for instance Eikvil, 1999), we will

(Cowie and Lehnert, 1996, p. 81)


Cowie and Lehnert’s interpretation of information extraction is close towhat we need to solve the problems at the beginning of this chapter. There is still one thing missing in their definition. Although in this book we con-centrate on information extraction from text, text is not the only source of unstructured information. Among these sources, it is probably the sourcewhere one has made the largest advancements in automatic understanding.But, other sources (e.g., image, video) exhibit a similar need for semanti-cally labeling unstructured information, and advances in their automatic understanding are expected to occur in the near future. Any framework in which information extraction functions should not exclude this given.

The interpretations above are only a few representative definitions, and in the literature one finds additional variant definitions. To this multitude,we will add our own working definition, trying to incorporate the kerneltask and function of information extraction and to avoid both Riloff and Lorenzen’s and Cowie and Lehnert’s limitations:

DEFINITION

Information extraction is the identification, and consequent or concur-rent classification and structuring into semantic classes, of specific infor-mation found in unstructured data sources, such as natural language text, making the information more suitable for information processing tasks.

This definition is concise and covers exactly in what sense we will use theterm information extraction throughout this book, but it is still fairly ab-stract. In the next sections, we will clarify what its constituent parts exactly mean.

1.2 Explaining Information Extraction

1.2.1 Unstructured Data

Information extraction is used to get some information out of unstructured data. Written and spoken text, pictures, video and audio are all forms of unstructured data. Unstructured does not imply that the data is structurally dincoherent (in that case it would simply be nonsense), but rather that its in-formation is encoded in such a way that makes it difficult for computers toimmediately interpret it. It would actually be more accurate to use the

5

terms computationally opaque vs. computationally transparent data. In-tformation extraction is the process that adds meaning to unstructured, raw data, whether that is text, images, video or audio. Consequently, the data become structured or semi-structured and can be more easily processed bythe computer (e.g., in information retrieval, data mining or summariza-tion).

In this book, the unstructured data sources we are mainly concerned with are written natural language texts. The texts can be of different typeor genre (e.g., news articles, scientific treaties, police reports). Unlessstated otherwise, we will assume that these texts are rather well formed,i.e., that they are largely coherent and error free. This is far from alwaysthe case. Written data, especially in an electronic context, is notorious for being incoherent and full of grammatical and spelling errors that are drafted with (e.g., spam messages) or without purpose (e.g., instant mes-sages, postings on informal news groups). Errors might also occur in theoutput of an automatic speech recognition system. The algorithms de-scribed in this book can all be applied to these deviant types of textual data, provided that the system is specifically trained to deal with the char-acteristics of the relevant text type or that the probabilities of the transla-tion into elements of well formed text are taken into consideration.

Limiting this book to information extraction from the text medium is no restriction of its value. The technologies described in this book contri-bute to an advanced understanding of textual sources, which, for instance can be used for aligning text and images when training systems that under-stand images. In addition, because technologies for the understanding of media other than text will be further developed in the near future, it seems valuable to compile information extraction technologies for text as theyserve as a source of ideas for content recognition in other media or in combined media (e.g., images and text, or video).

1.2.2 Extraction of Semantic Information

Information extraction identifies information in texts by taking advantage of their linguistic organization. Any text in any language consists of a complex layering of recurring patterns that form a coherent, meaningful

2004), a general notion from linguistic philosophy that underlies manymodern approaches to language and that states that the meaning of anycomplex linguistic expression is a function of the meanings of its constitu-ent parts. An English sentence typically contains a number of constituent


whole. This is a consequence of the principle of compositionality (Szabó,


parts (e.g., a subject, a verb, maybe one or more objects). Their individualmeanings, ordering and realization (for instance, the use of a specific verb tense) allow us to determine what the sentence means. If a text would becompletely irregular, it would simply be impossible for humans to makeany sense of it.

It is yet not entirely clear how these linguistic layers exactly interact,but many linguistic theories and natural language processing assume the existence of a realizational chain. This theoretical notion has its roots inthe grammar that was written by the Indian grammarian Panini in the 6th -5th

a language is realized in the linguistic surface structure through a number of distinct linguistic levels, each of which is the result of a projection of the properties of higher, more abstract levels. For instance, for Panini the meaning of a simple sentence starts as an idea in the mind of a writer. It then passes through the stage in which the event and all its participants are translated into a set of semantic concepts, each of which is in its turn trans-lated in a set of grammatical and lexical concepts. These are in their turntranslated into the character sequences that we see written down on a page of paper. Information extraction (and natural language processing, for that matter) assumes that this projection process is to a considerable extent bidirectional, i.e., that ideas are recoverable from their surface realizationsby a series of inverse processes.

In other words, information extraction presupposes that although the semantic information in a text and its linguistic organization is not imme-diately computationally transparent, it can nevertheless be retrieved by tak-ing into account surface regularities that reflect its computationally opaque internal organization. An information extraction system will use a set of extraction patterns, which are either manually constructed or automaticallylearned, to take information out of a text and put it in a more structured format. The exact techniques that are used to extract semantic information from a natural language text form the main topic of this book. Particular methodologies and algorithms will be discussed throughout its main chap-ters.

The use of the term extraction implies that the semantic target informa-tion is explicitly present in a text’s linguistic organization, i.e., that it istreadily available in the lexical elements (words and word groups), the grammatical constructions (phrases, sentences, temporal expressions, etc.)and the pragmatic ordering and rhetorical structure (paragraphs, chapters, etc.) of the source text. In this sense, information extraction is different from techniques that infer information from texts, for instance by building rlogical rules (logical inference) and by trying to distil world or domain

century B.C. (see Kiparsky, 2002). According to this notion, meaning in

7

knowledge from the propositions in a text through deductive, inductive or abductive reasoning. We will refer to this latter kind of information as knowledge. Knowledge discovery is also possible by means of statistical data mining techniques that operate on the information extracted from thetexts (also referred to as text mining). In all these operations informationextraction is often an indispensable preprocessing step. For instance, in-formation that is extracted from police reports could be used as the input for a data mining algorithm for profiling or for detecting general crime trends, or as the input of a case based reasoning algorithm that predicts thelocation of the next strike of a serial killer based on similar case patterns.

1.2.3 Extraction of Specific Information

Information extraction is traditionally applied in situations where it is known in advance which kind of semantic information is to be extracted from a text. For instance, it might be necessary to identify what kind of events are expressed in a certain text and at what moment these events takeplace. Since in a specific language, events and temporal expressions can only be expressed in a limited number of ways, it is possible to design a method to identify specific events and corresponding temporal location ina text. Depending on the information need, different models can be con-structed to distinguish different kinds of classes at different levels of semanticgranularity. In some applications, for example, it will suffice to indicate that a part of a sentence is a temporal expression, while in others it might be necessary to distinguish between different temporal classes, for instancebetween expressions indicating past, present and future.

Information extraction does not present the user with entire documents,

call text regions. As such, information extraction is different from extrac-tive summarization, which usually retrieves entire sentences from texts that serve as its summary. Information extraction, however, can be a useful first step in extractive headline summarization, in which the summary sentence is further reduced to a string of relevant phrases similar to a newspaper headline.

Specificity implies that not only the semantic nature of the target in-formation is predefined in an information extraction system, but also the unit andt scope of the elements to be extracted. Typical extraction units for an extraction system are word compounds and basic noun phrases, but in some applications it might be opportune to extract other linguistic units,


simple or multi-term basic phrases (Appelt and Israel, 1999), which we alsobut it extracts textual units or elements from the documents, typically


such as verb phrases, temporal markers, clauses, strings of related mean-ings that persist throughout different sentences, larger rhetorical structures, etc. Whereas the unit of extraction has to do with the granularity of indi-vidual information chunks that are lifted out of the source text, the scopeof extraction refers to the granularity of the extraction space for each indi-vidual information request. Information can be extracted from one clause or from multiple clauses or sentences spanning one or more texts before it is outputted by the system. Consider the example that an information ques-tion wants to retrieve event information about assassinations, it might bethat the name of the person assassinated and the time and place of the event is named in a first sentence of a news article, but that the name of the assas-sin and his method are mentioned in some sentences further in the dis-course.

During the Message Understanding Conferences (MUC), there gradu-ally arose a set of typical information extraction tasks (see Grishman and

named entity recognition, i.e., recognizing person names, organizations,locations, date, time, money and percents. These names are often expandedto protein names, product brands, etc. Other tasks are event extraction, i.e.,recognizing events, their participants and settings, and scenario extraction,i.e., linking of individual events in a story line. Coreference resolution,i.e., determining whether two expressions in natural language refer to thesame entity, person, time, place, and event in the world, also receives quitea lot of attention. These task definitions have been extremely influential inconcurrent information extraction research and we will see that althoughthey are getting too narrow to cover everything that is presently expected from information extraction, they still define its main targets. Currently, we see a lot of interest for the task of entity relation recognition. A number of domain specific extractions are also popular, e.g., extraction of the dateof availability of a product from a Web page, extraction of scientific data from publications, and extraction of the symptoms and treatments of a dis-ease from patient reports. The interest in the above extraction tasks is also demonstrated in the current Automatic Content Extraction (ACE) project.

1.2.4 Classification and Structuring

Typical for information extraction is that information is not just extracted from a text but afterwards also semantically classified in order to ensure itsfuture use in information systems. By doing this, the information from un-structured text sources also becomes structured (i.e., computationally

Sundheim, 1996; Cunningham, 1997). A most popular task probably is

9

transparent and semantically well defined). In the extreme case, the infor-mation that is verbatim extracted from the texts is discarded for further processing, but this is not what is usually intended.

Any classification process requires a semantic classification scheme,i.e., a set of semantic classes that are organized in some relevant way (for instance in a hierarchy) and that are used to categorize the extracted chunks of information into a number of meaningful groups. A very large variety of semantic classification schemes are conceivable, ranging from a small set of abstract semantic classes to a very elaborate and specific clas-sification.

Based on the general information focus of a system, we can make a main distinction between closed domain and open domain (or domainindependent) information extraction systems. Traditionally, information extraction systems were closed domain systems, which means that they were designed to function in a rather specialized, well delineated know-ledge domain (and that they will therefore use very specific classification rules). For instance, most MUC systems covered very limited subjectssuch as military encounters, Latin-American terrorism or international

formation extraction systems, on the other hand, are capable of handling texts belonging to heterogeneous text types and subject domains, and usu-ally use very generic classification schemes, which might be refined, if the information processing task demands a more specific identification of se-mantic information. The technology described in this book applies to both closed and open domain information extraction.

We mentioned before that information extraction essentially convertsunstructured information from natural language texts into structured infor-mation. This implies that there has to be a predefined structure, a representa-tion, in which the extracted information can be cast. Although the extracted information can solely be labeled for consequent processing by the informa-tion system, in the past many template based extraction systems have been developed. Template representations were typically used to describe singleevents (and later also complex scenarios) and consist of a set of attribute-value pairs (so-called slots), each of which represents a relevant aspect of the event (e.g., the action or state, the persons participating, time, place). An in-formation extraction task traditionally tries to take information from a sourcetext and maps it to an empty value of the defined template.

In order to know which piece of information is supposed to end up inwhich template slot, an information extraction application uses a set of ex-traction rules. These rules state which formal or linguistic properties a particular chunk of information must possess to belong to a particular


joint ventures (Grishman and Sundheim, 1996). Domain independent in-


semantic class. Especially in earlier systems these rules were usually

Currently, machine learning is playing a central role in the information ex-traction paradigm. In most cases, supervised learning is used, in which a learning algorithm uses a training corpus with manually labeled examplesto induce extraction rules, as they are applicable to a particular languageand text type (e.g., the CRYSTAL system developed by Soderland et al. 1995). In some cases, it is also possible to apply unsupervised learning, for which no training corpus is necessary. For instance, unsupervised learning systems have been implemented for noun phrase coreferent resolution

supervised learning approaches that limit the number of examples to be

techniques has been one of the main enabling factors for information ex-traction to move from very domain specific to more domain independent analyses. In addition, the machine learning techniques more easily allow modeling a probabilistic class assignment instead of a purely deterministic one.

1.3 Information Extraction and Information Retrieval

1.3.1 Information Overload

Our modern world is flooded with information. Nobody exactly knows how much information there is – or how one could uniformly measure in-formation flows from heterogeneous sources – but Lyman and Hal (2003)estimate that the total amount of newly created information on physicalmedia (print, film, optical and magnetic storage) amounted to some 5 exabytes in 2002, most of it is stored in digital format. This corresponds to9,500 billion books or 500,000 times the entire Library of Congress (which is supposed to contain approximately 10 terabytes of information). Ac-cording to their measures, the surface Web contains around 167 terabytesof information, and there are indications that the deep Web, i.e., informa-tion stored in databases that is accessible to human users through query in-terfaces but largely inaccessible to automatic indexing, is about 400 to 500times larger. Lyman and Hal (2003) estimate it to be at least 66,800 tera-bytes of data. A large fraction of this information is unstructured in the form of text, images, video and audio. These gargantuan figures are already

handcrafted (e.g., the FASTUS system developed by Appelt et al., 1993).

manually labeled (Shen et al., 2004). The application of these learning

(e.g., Cardie and Wagstaff, 1999). Today, we see a large interest in weakly

11

outdates at the time of writing and are dwarfed by the amount of e-mailtraffic that is generated, which according to Lyman and Hal (2003)amounts to more than 300,000 terabytes of unique information per year.

Fig. 1.1. Graphical presentation of the size of the Web and global storage capacityon computer hard discs anno 2003.


a b

Legend1 Size of the Library of Congress2 Size of the surface Web3 Size of the surface + deep Web 4 Size of surface + deep Web + e-mail traffic 5 Size of text data on hard discs sold in 2003

c


According to these authors, during 2003 an estimated 15,892.24 exabytes of hard disc storage was sold worldwide. A similar study that confirms theinformation overload is made by O’Neill, Lavoie and Bennett (2003). If this trend continues we will have to express amounts of information in yot-tabytes (280 number of bytes).

In order to give a rough impression about the amounts of data that are involved, Fig. 1.1 gives a graphical representation of the total amount of data present on the Web and on hard discs worldwide in 2003. Data ratios are reflected in the relative size differences between the diameters of the circles. Figure 1.1a shows the size of unique textual data on the surface web [2] in comparison with the textual data on the combined surface and deep web [3] and of the surface and deep Web plus all e-mail traffic [4]. The size of the Library of Congress [1] is given as a reference. Figure 1.1bis a 180-fold magnification in which Fig. 1.1a appears as the minute rec-tangle at the point of tangency of 3 and 4. Fig. 1.1c gives an impression of the complete size of the textual web at the left hand side and a comparison

This immense information production prevents its users to efficientlyselect information and accurately use it in problem solving and decision

if we find ways for reducing the information generation, there still is a large demand for intelligent tools that assist humans in information selec-

1.3.2 Information Retrieval

Information retrieval (IR) is a solution to this kind of problem (Baeza-

ments from large document collections, such as the Web or a corporateintranet, based on a keyword based query. Information retrieval is able to search efficiently through huge amounts of data because it builds indexesfrom the documents in advance in order to reduce the time complexity of each real-time search. The low level keyword matching techniques that are

1996). The success of information retrieval systems in general, and the Web search engines in particular is largely due to the flexibility of these systems with regard to the queries that users pose. Users have all kinds of information needs that are very difficult to determine a priori. Because users do not always pose their queries with the words that occur in rele-vant documents, query expansion with synonym and related terms is very

making (Edmunds and Morris, 2000; Farhoomand and Drury, 2002). Even

tion and processing (Berghel, 1997).

Yates and Ribeiro-Neto, 1999). It allows a user to retrieve a set of docu-

domain independent and – above all – very fast (Lewis and Sparck Jones, generally used in information retrieval systems make them error tolerant,

with all textual data on hard discs on the right hand side.

13

popular, primarily enhancing the recall of the results of the search. Infor-mation retrieval is very successful in what it is aimed to do, namely pro-viding a rough and quick approach to find relevant documents.

A downside is that such a robust and flexible approach sometimes re-sults in a low precision of an information search and in a huge number of possibly relevant documents when a large document base is searched,which are impossible to consult by the user of the information system

1.3.3 Searching for the Needle

Because of the information overload, the classical information retrieval paradigm is no longer preferable. This paradigm has found its roots in the example of the traditional library. One is helped in finding potentiallyrelevant books and documents, but the books and documents are still con-sulted by the humans. When the library is becoming very large and the pile of potentially relevant books is immensely high, humans want more ad-vanced information technology to assist them in their information search. We think that information extraction technology plays an important role in such a development.

Currently, an information retrieval system returns a list of relevant documents, where each individual document has to be fetched and skimmed through in order to assess its real relevance. There is a need for tools that reduce the amount of text that has to be read to obtain the desired information. To address this need, the information retrieval community iscurrently exploring ways of pinpointing highly relevant information. Thisis one of the reasons question answering systems are being researched. Theuser of a question answering retrieval system expresses his or her informa-tion need as a natural language question and the system extracts the answer

2003).Information extraction is one of the core technologies to help facilitate

highly focused retrieval. Indeed, recognizing entities and semanticallymeaningful relations between those entities is a key to provide focused information access.

With the current interest in expressing queries as natural languagetexts, the need for semantic classification of entities and their relations in the texts of document and query becomes of primordial importance. In-formation extraction technology realizes that – simply saying – not only the words of the query, but also the semantic classifications of entities and


(Blair, 2002).

to the information question from the texts of the documents (Maybury,


2002).Especially in information gathering settings where the economical costs

of searching is high or in time critical applications such as military or cor-porate intelligence gathering, a user often needs very specific information very quickly. For instance, an organization might need a list of all compa-nies that have offices in the Middle East and conducted business transac-tions or pre-contract negotiations in the Philippines or Indonesia in the last five months. He or she knows that many of this information is available innews feeds that were gathered over the last half year, but it is impossible togo through tens of thousands of news snippets to puzzle all relevant data together. In addition, we cannot neglect the need for flexible querying.

needs.Information retrieval techniques typically use general models for pro-

cessing large volumes of text. The indices are stored in data structures that are especially designed to be efficiently searched at the time of querying. An ideal information retrieval system answers all kinds of possible infor-mation questions in a very precise way by extracting the right information from a (possibly large) collection of documents.

Information extraction helps building such information systems. The extracted information is useful to construct sensitive indices more closely

This is often only restricted to the recognition and classification of entitiesthat are referenced in different places of the text and recognition of rela-tions between them. Besides an index of words that occur in the docu-ments, certain words or other information units are tagged with additional semantic information. This meta-information allows more precisely an-swering information questions without losing the advantages of flexiblequerying. Information extraction technology allows for a much richer in-dexing representation of both query and document or information found inthe document, which can improve retrieval performance of both open and closed domain texts. Especially linguistically motivated categories of semantics become important (e.g., expressions of time, location, coreference,abstract processes and their participants, ...). As we will show in this book, the identified and classified information – even if very generic semantic classifica-tions are made – is useful in information retrieval and selection, allowing for answers of information needs that can be more precisely inferred from infor-mation contained in documents. Information extraction can be regarded as

their relations must match the information found in the documents (Moens,

linked to the actual meaning of a particular text (Cowie and Lehnert, 1996).

There will always be a large variety of dynamically changing information

15

a kind of cheap and easy form of natural language understanding, which can be integrated in an information retrieval system to roughly provide someunderstanding of query and document.

As such, information extraction introduces natural language processing(NLP) technology into retrieval systems. Natural language processing has sought models for analyzing human language. These attempts were some-times successful, but they also aroused an awareness of the enormousmagnitude of the task. NLP deals with the processing of the linguisticstructure of text. This includes morphological, syntactic and semantic analysis of language, the extraction of discourse properties and domainknowledge or world knowledge, and eventually aims at full natural languageunderstanding. The problems of automatic text understanding concern theencoding of the necessary knowledge into a system and constructing a proper inference mechanism, as well as the computational complexity of the necessary operations. Information retrieval technology traditionally has depended upon relatively simple, but robust methods, while natural lan-guage processing involves complex knowledge based systems that have never approached robustness. The much more recent field of informationextraction, which is a first step towards full natural language understand-ing, reaches a degree of maturity and robustness that makes it ready to incorporate in information processing systems.

Also in Cross-Language Information Retrieval (CLIR), information ex-traction plays an important role. In CLIR, a query in one language searches a document base in another language. The semantic concepts that informa-tion extraction uses are mostly language independent, but more accuratelytranslate a query and map the translated queries with the documents. Tominimize the language specific alterations that need to be made in extend-ing an information extraction system to a new language, it is important toseparate the task specific conceptual knowledge the system uses, which may be assumed to be language independent, from the language dependent lexical knowledge the system requires, which unavoidably must be extended or learned for each new language. The language independent domain model can be compared to the use of an interlingua representation in Ma-chine Translation (MT). An information extraction system however does not require full generation capabilities from the intermediate representation(unless the extracted information is translated) and the task will be wellspecified by a limited model of semantic classes, rather than a full unre-stricted world model. This makes an interlingua representation feasible for information extraction.



1.4 Information Extraction and Other Information Processing Tasks

People are not only interested in information retrieval systems that pre-cisely give an answer to an information query, they also use informationsystems that help in solving problems, such as data mining systems, sys-tems that reason with knowledge, and systems that visualize, synthesize, or summarize information. Given the current information overload, system assistance is more than welcome. Information extraction is a helpful stepin these processes because the data become structured and semanticallyenriched.

A typical example is applying data mining techniques to the information found in texts. Examples are law texts and police reports that are parsed in

an increasing interest in extracting knowledge from texts to be used inknowledge-based systems. A knowledge based system uses knowledgethat is formally represented in a knowledge representation language andreasons with this knowledge in the search to find an answer to an informa-tion question. The knowledge can be in the form of sharable knowledgecomponents such as an ontology or in the form of very specific knowledge that is used for performing a specific task. Information extraction techno-logy is very useful to automatically build the knowledge rules and frames.We can refer to an example of the legal domain. A nice example is the automatic translation of legislation into knowledge rules. The rules can beused to infer the answer to a specific problem (e.g., Is the cultivation of

tation can be automatically translated into knowledge structures to be questioned at the time a problem occurs.

Information extraction semantically classifies textual units. As such in-formation extraction is related to text categorization and abstractive sum-marization. Text categorization classifies text passages or complete texts with semantic labels. These labels are usually used in the matching of an information query with the document in a retrieval context or for filteringdocuments according to a certain user profile. Abstractive summarization will replace a text or a text passage by a more abstract concept or concepts. Although text categorization, abstractive summarization and informationextraction techniques overlap to a large degree, the semantic labels in text categorization and abstracting define the most salient information of a text in one or a few abstract concepts. Information extraction is here somewhat the opposite as it allows finding detailed information, for instance, with re-gard to a certain event. On the other hand information extraction and text

order to analyze specific trends (Zeleznikow and Stranieri, 2005). There is

Erythroxylon punishable?) (Moens, 2003). Similarly, technical documen-

17

categorization complement each other in two directions. In a top down ap-proach, very domain specific information extraction technologies can be selected based on a prior semantic classification of a complete text or text passage. In a bottom up approach the detailed information labeled with in-formation extraction technologies can contribute to a more fine-grained classification of a complete text or passage.

There is a large need to synthesize and summarize information that is, for instance, collected as the result of an information search. Information extraction technologies identify the entities that are involved in certain processes and the relations between entities. Such information better per-mits generating concise headlines or compressed sentences that make upthe summary.

1.5 The Aims of the Book

The main goal of the book is to give a comprehensive overview of algo-rithms used in information extraction from texts. Almost equal importanceis given to early technologies developed in the field primarily with the aim of natural language understanding, as to the most advanced and recent technologies for information extraction. The past approaches are an incen-tive to identify some forgotten avenues that can be researched to advance the state of the art. Machine learning is playing a central role in the deve-lopment of the novel technologies and contributes to the portability and widespread use of information extraction technology. The book will espe-cially focus on weakly supervised learning algorithms that are very promi-sing for information extraction.

A second important aim is to focus on the prospects of information ex-traction to be used in modern information systems, and more specifically din information retrieval systems. We want to demonstrate that on one handthe statistical and machine learning techniques traditionally used in infor-mation retrieval have little attention for the underlying cognitive and lin-guistic models that shape the patterns that we are attempting to detect. Onthe other hand, current models of information retrieval that use expressive query statements in natural language (e.g., simple and complex question answering, a textual query by example) exhibit the need for a semanticbased matching between the content of the query and the documents.

It is also argued that information extraction technology has evolvedfrom a template extraction technique to a real necessary labeling step inmany information management tasks. In the past, as a result of the deve-tlopment of domain specific information extraction technology lexicons



were built that store subcategorization patterns for domain specific verbsin such a fashion as to permit them to use the patterns directly in a pattern-matching engine. Here we look to information extraction differently. Wesee information extraction as a tool that aids in other tasks such as infor-mation retrieval. In such a framework, information extraction is seen as a kind of preprocessing and labeling of texts, which contributes to the per-formance of other tasks (here mainly information retrieval, but we alsorefer to data mining, knowledge discovery, and summarization). We aimat labeling information in open domain and closed domain texts, at iden-tifying specific facts or more generic semantic information, and store thisinformation in a format that allows flexible further processing. This evo-lution is sustained by content recognition techniques that are currentlydeveloped for other media such as images for which the information extrac-tion paradigm also applies.

Whereas traditional information extraction recognizes information in text in a deterministic way and represents this information in a format with known semantics, such as relational database fields, we leave room for probabilistic classification of the information and consequent probabilisticprocessing of the information (e.g., in a probabilistic retrieval model). This is an approach that better corresponds with the way we humans search forinformation in texts of a language we do not completely understand, but from the combined combination of evidence we can make a good guess and fairly accurately locate the information. This is also an approach that better fits the philosophy and tradition of information retrieval.

In addition, we want to demonstrate that the form of semantic labeling that is advocated by information extraction technologies does not put in danger the cherished flexibility of an information search. According to our definition that is given p. 4, the extracted information is classified andstructured. The information retrieval models can smoothly integrate thisstructured information when matching the information question with thedocument. In this respect current XML (Extensible Markup Language) re-Ltrieval models that combine and rank structured and unstructured informa-

Last but not least, we want to illustrate the information extraction tech-nologies and evaluate the results as they are currently implemented inmany different application domains. The illustrative examples intend todemonstrate the generic character of current information extraction tech-nologies and their portability to different domains. The evaluation should also lead to better insights into the points of attention for future research. We focus in the book on information extraction from text in natural lan-guage. Although information extraction takes into account certain character-istics of the language and the examples are usually taken from the English

tion are a source of inspiration (Blanken et al., 2003).

19

language, our discussion is as language independent as possible. Many text documents currently have structure and layout tagged in markup languages such as XML and HTML (HyperText Markup Language). Such markups are not the focus of our attention.

The book is organized as follows. In Chapter 2 we give a short historical overview of information extrac-

tion, starting from its early origins in the mid seventies that coincide with early attempts of natural language understanding, where we do not ignorethe influence of artificial intelligence authorities such as Roger Schank and Marvin Minsky. During this period the use of symbolic, handcrafted knowledge was very popular. We explain the general trend towards ma-chine learning techniques that started in the early nineties and finish withthe current interest in weakly supervised learning methods and hybrid technologies that combine the use of symbolic and machine learning tech-niques. The dive into a history of more than three decades focuses on the different factors of success and causes of difficulties. This chapter also ex-plains typical information extraction tasks from both a linguistic theore-tical and application oriented viewpoint. The tasks among others includenamed entity recognition, noun phrase coreferent resolution, semantic roleclassification, and the recognition and resolution of temporal expressions. The chapter also describes the general architecture of an information ex-traction system and assisting linguistic and knowledge resources.

Chapter 3 gives an in depth discussion of some of the most important symbolic techniques for information extraction that use handcrafted knowledge. We start from the Conceptual Dependency Theory of Roger Schank, explain in detail frame based approaches and the use of finite state automata to parse the texts.

Chapter 4 offers an introduction to the current pattern recognition meth-ods that use machine learning methods. A substantial part of this chapter is devoted to the features of the texts that are used in the information extrac-tion tasks. The features include lexical, syntactic, semantic and discourse features found in the texts as well as features derived from external know-ledge resources.

Chapter 5 explains the most important and most successful supervised machine learning techniques currently in use in information extraction. We explain Support Vector Machines, maximum entropy modeling, hidden Markov models, conditional random fields, learning of decision rules and trees, and relational learning. The theoretical background of a technologyis illustrated with realistic examples of information extraction tasks.

Chapter 6 is devoted to unsupervised learning aids. Such aids become 6very popular because of the high cost of manual annotation. The tech-niques described range from completely unsupervised methods such as



clustering to weakly supervised methods such as co-training, self-trainingand active learning. Again the theory of each technique is illustrated with a real information extraction example.

Chapter 7 integrates information extraction in an information retrieval7framework. More specifically, it studies how information extraction is in-corporated in the various existing retrieval models. A retrieval model matches query and document representations and ranks the documents or information found in the documents according to relevance to the query.The different models that are discussed are the classical vector spacemodel, the language model, the inference network model and the logic based model. Because information extraction leaves behind a bag-of-wordsrepresentation of query and documents, we need adapted indexing struc-tures that can be efficiently searched at the time of querying.

Chapter 8 discusses the evaluation metrics currently in use in informa-tion extraction. Evaluation metrics allow a comparison of technologies and systems. Classical metrics such as recall, precision and accuracy are dis-cussed. Also metrics that value the goodness of a clustering are important to mention (e.g., to evaluate noun phrase coreferent resolution). We mention also the evaluation metrics currently in use in international competitions.The difference between an intrinsic and an extrinsic evaluation of informa-tion extraction is explained.

Chapter 9 elaborates on many recent applications in information extrac-tion and gives the reader a good assessment of the capabilities of current technologies. Information extraction is illustrated with applications of news services, intelligence gathering and extracting content from the bio-medical, business and legal domains. Finally, we study the special case of information extraction from noisy texts (e.g., transcribed speech) and its difficulties.

Chapter 10 summarizes our most important findings with regard to the algorithms used in information extraction and the future prospects of this technology. A section will also elaborate on future promising improve-ments of the technologies.

Each chapter is accompanied by clear and illustrative examples, and themost relevant past and current bibliography is cited.

1.6 Conclusions

In this first chapter we have defined information extraction from text as a technology that identifies, structures and semantically classifies certain in-formation in texts. We have demonstrated the importance of information

21

extraction in many information processing tasks, among which information retrieval. In the next chapter, we give an historical overview of the infor-mation extraction technologies, define in detail typical information extrac-tion tasks and outline the architecture of an information extraction system. This will give the reader a better understanding of information extraction needs as they have aroused in the course of the last decades and will smoothly introduce the different technologies that are discussed in the main parts of the book.

1.7 Bibliography

ACE: www.nist.gov/speech/tests/ace/ Appelt, Douglas E. and David J. Israel (1999). Introduction to information extrac-

tion technology. Tutorial at the International Joint Conference on Artificial In-telligence IJCAI-99: http://www.ai.sri.com/~appelt/ie-tutorial/

Appelt, Douglas E., Jerry R. Hobbs, John Bear, David J. Israel and Mabry Tyson (1993). FASTUS: A finite-state processor for information extraction from real-world text. In Proceedings of the Thirteenth International Joint Conference onArtificial Intelligence (pp. 1172-1178). San Mateo, CA: Morgan Kaufmann.

Baeza-Yates, Ricardo and Berthier Ribeiro-Neto (1999). Modern Information Retrieval. Harlow, UK: Addison-Wesley.

Berghel, Hal (1997). Cyberspace 2000: Dealing with information overload. Com-munications of the ACM, 40 (2), 19-24.

Blair, David C. (2002). The challenge of commercial document retrieval, partI: Major issues, and a framework based on search exhaustivity, determinacy of representation and document collection size. Information Processing and Man-agement, 38 (2), 273-291.

Blanken, Henk M., Torsten Grabs, Hans-Jörg Schek, Ralf Schenkel and GerhardWeikum (Eds.) (2003). Intelligent Search on XML Data, Applications, Lan-guages, Models, Implementations and Benchmarks (Lecture Notes in Computer Science, 2818). New York, NY: Springer.

Cardie, Claire and Kiri Wagstaff (1999). Noun phrase coreference as clustering. InProceedings of the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Very Large Corpora (pp. 82-89). East Stroudsburg, PA:ACL.

Cowie, Jim and Wendy Lehnert (1996). Information extraction. Communicationsof the ACM, 39 (1), 80-91.

Cunningham, Hamish (1997). Information Extraction: A User Guide. Researchmemo CS-97-02. Sheffield: University of Sheffield, ILASH.

Edmunds, Angela and Anne Morris (2000). The problem of information overload in business organisations: A review of the literature. International Journal of Information Management, 20, 17-28.

Eikvil, Line (1999). Information Extraction from the World Wide Web: A Survey.Norwegian Computer Center, Report no. 945, July 1999.

1.7 Bibliography


Farhoomand, Ali F. and Don H. Drury (2002). Managerial information overload.Communications of the ACM, 45 (10), 127-131.

Grishman, Ralph and Beth Sundheim (1996). Message Understanding Conference 6: A brief history. In Proceedings of the 16th International Conference onComputational Linguistics (pp. 466-471). San Mateo, CA: Morgan Kaufmann.

delivered at the Hyderabad Conference on the Architecture of Grammar, Janu-ary 2002, and at UCLA, March 2002.

Lewis, David D. and Karen Sparck Jones (1996). Natural language processing for information retrieval. Communications of the ACM 39 (1), 92-101.M

Lyman, Peter and Hal R. Varian (2003). How Much Information? 2003. URL:http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

Maybury, Mark (2003) (Ed.). New Directions in Question Answering. In Papers from the 2003 AAAI Spring Symposium. Menlo Park, CA: The AAAI Press.

Moens, Marie-Francine (2002). What information retrieval can learn from case-based reasoning. In Proceedings JURIX 2002: The Fifteenth Annual Conference(Frontiers in Artificial Intelligence and Applications) (pp. 83-91). Amsterdam: IOS Press.

Moens, Marie-Francine (2003). Interrogating legal documents: The future of legalinformation systems? In Proceedings of the JURIX 2003 Workshop on Ques-tion Answering for Interrogating Legal Documents December 11, 2003(pp. 19-30). Utrecht University, The Netherlands.

O’Neill, Edward. T., Brian F. Lavoie and Rick Bennett (2003). Trends in the evo-lution of the public web. 1998-2002. D-Lib Magazine 9 (4).

Riloff, Ellen and Jeffrey Lorenzen (1999). Extraction-based text categorization: Generating domain-specific role relationships automatically. In Tomek Strzalkowski (Ed.), Natural Language Information Retrieval (pp. 167-196). Dordrecht, The Netherlands: Kluwer Academic Publishers.

Shen, Dan, Jie Zhang, Jian Su, Guodong Zhou and Chew-Lim Tan (2004). Multi-criteria-based active learning for named entity recognition. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (pp. 590- 597). East Stroudsburg, PA: ACL.

Soderland, Stephen, David Fisher, Jonathan Aseltine and Wendy Lehnert (1995).CRYSTAL: Inducing a conceptual dictionary. In Proceedings of the Four-teenth International Joint Conference on Artificial Intelligence (pp. 1314-1319). San Mateo, CA: Morgan Kaufmann.

Sundheim, Beth M. (1992). Overview of the fourth Message Understanding evaluation and Conference. In Proceedings of the Fourth Message Understand-ing Conference (MUC-4) (pp. 3-21). San Mateo: CA: Morgan Kaufmann.

Szabó, Zoltán Gendler (2004). Compositionality. In Edward N. Zalta (Ed.), TheStandford Encyclopedia of Philosophy (Fall 2004 Edition).

Zeleznikow, John and Andrew Stranieri (2005). Knowledge Discovery from LegalDatabases. New York, NY: Springer.

Kiparsky, Paul (2002). On the Architecture of Panini’s Grammar. Three lectures

23

2 Information Extraction from an Historical Perspective

With Rik De Busser

2.1 Introduction

This chapter presents an historical overview of information extractionspanning more than three decades. It explains also the evolution from sys-tems that use symbolic, handcrafted knowledge towards systems that train from labeled and eventually unlabeled examples. The historical overviewallows us also to introduce the most common information extraction tasksand the common architecture of an extraction system. The most important algorithms will be discussed in detail in Chap. 3, which focuses on systems that use symbolic, handcrafted knowledge, and in Chaps. 4, 5 and 6, which discuss the machine learning approaches.

2.2 An Historical Overview

2.2.1 Early Origins

At the end of the sixties, Roger C. Schank introduced a revolutionarymodel to parse natural language texts into formal semantic representations

1

1 An exposition of an early version of his theory can be found Shank (1972). The basic principles of CD theory as it is known among computational linguists today are explained in Schank (1975).

(Schank, 1972; Shank, 1975) and very soon his Conceptual Dependency

24 2 Information Extraction from an Historical Perspective

Theory (CDT) gained an enormous popularity. Schank’s basic assumption is that “there exists a conceptual base that is interlingual, onto which lin-guistic structures in a given language map during the understanding proc-ess and out of which such structures are created during generation”

that have the same meaning ought to be represented by identical conceptualstructures (even when they are of a different language). These conceptualstructures or conceptualizations – as Schank called them – are composed of primary concepts, the interconnections of which are governed by a closed set of universal conceptual syntax rules and a larger set of conceptspecific conceptual semantic rules. For the first time, a more or less com-prehensive model had been developed that not only made it possible to semantically analyze entire texts, but that was also fit for practical imple-mentation into artificial intelligence systems for the extraction of semanticinformation, something that would later be called information extraction.

In its infancy, Conceptual Dependency Theory mainly aimed to extract semantic information about individual events from sentences at a concep-tual level. The main categories of concepts are PPs (i.e., picture producers, in other words, concrete nouns) and actions. Relations between conceptsare dependencies. The main conceptualization of a clause is a two-way de-pendency between a PP (the actor) and an action. Schank defined naturallanguage words in terms of conceptual primitives or predicates. The syntax of the conceptual level is described by a set of rules which specify whichtype of concepts can depend on which other type, as well as the different kinds of dependency relationships between concepts. Soon after its design,the theory developed into a fairly comprehensive system to embed these event analyses into full scenarios. Schank’s theory had and still has anenormous influence on information extraction technologies.

Schank’s unconventional theory mothered a research group at Yale University in the late 1970s and early 1980s, which developed several pro-totypes of information extraction tools that were able to extract informa-tion from texts with a previously unseen accuracy – be it in a very limited domain.

One of the early systems developed by the Yale School, called SAM, parsed a text into its full CD structure. In a first stage, a natural languageanalyzer maps the input text into conceptual dependency structures sen-tence by sentence, while filling out all implicit classes of concepts by usingexpectation routines. Using the output of the language analyzer, a second module – the script applier – tries to understand the story offered by the text. It matches a given input with a script in its database and it uses thescript to predict which information is likely to follow in the input string

(Schank, 1972, p. 553 ff.). This implies that any two linguistic structures

2.2 An Historical Overview 25

and to fill out any conceptualizations left implicit. The script applier is as-sisted by a conceptual parser, which parses the input into individual con-ceptualizations and disambiguates word senses, and a memory module, which maps information about physical objects onto references to a realworld model. At the end of the analysis, the script applier outputs a fullyinstantiated conceptual dependency network of the text, which can be ac-cessed by the post-processing modules. Schank and Abelson (1977) men-tion a summarizer and a questioning-answering module.

Conceptual Dependency Theory has been implemented in numerous other applications, most of which were not developed beyond the test stage. The majority of these systems did not use fully developed CD scripts (which are supposed to contain all information about any event which could possibly occur in a given situation), but so-called sketchyscripts, which only contain the most crucial or most relevant conceptuali-zations. This approach allows that a text is only partially analyzed, whilethe rest is ignored or skipped, or in other words that only certain infor-mation is extracted. Such an analysis is commonly referred to as partialparsing.

One of the most typical systems using sketchy scripts is undoubtedly FRUMP. The Fast Reading Understanding and Memory Program was de-

1977). At the time of writing of DeJong’s 1982 article, sketchy scripts for the interpretation modules of FRUMP had been manually constructed for 60 different situations. The interpretation of texts takes place in two mod-ules, which DeJong calls the predictor and the substantiator. On the basisof the current context, the predictor predicts which conceptualizations orparts of conceptualizations are likely to follow in the partly analyzed text and passes its results on to the substantiator. The latter will try to fill out the predicted structures; either by finding a word or phrase from the input text matching a slot filler specification proposed in one of the predictions; or by drawing an inference based on the input text and the relevant CDstructure proposed by the predictor. When the substantiator succeeds in verifying one of the predictions, the predictor adds it to the current context. If none of the predictions can be verified, the predictor backtracks and makes new suggestions. In this way the entire text is processed sentence bysentence. Since FRUMP’s routines are expectation driven – as all algo-rithms based on conceptual dependency are – the system needs to be ini-tialized: At the start of the analysis it has not built up a context on which tobase its predictions yet. It somehow has to be able to activate one or more relevant scripts to allow the substantiator to create an initial context or when the substantiator does not succeed in verifying the predictions.

veloped at Yale University for skimming newspaper articles (DeJong,


Therefore, FRUMP provides activation routines for sketchy scripts. A script is triggered by words or phrases in the text, by events already de-tected in the texts and by a related script that is already active. After pro-cessing the text, most – but not all – empty slots in the script will be filled.FRUMP stores the partly instantiated script containing the informationabout the article and when encountering a new article related to the same

Computer understanding of human narratives is a classic challenge for natural language processing and artificial intelligence. The understanding regards the extraction of the chronology and the plot. Rumelhart (19751977) has proposed the idea of so-called story grammars for understanding and summarizing text. He analyzed stories into hierarchical structures. The system of Lehnert (1982) built a plot unit connectivity graph for narrative text. Plot units have the form of propositions and are composed of affect states (e.g., positive events, negative events, mental states) that are linked by four types of relations (motivation, actualization, termination and equivalence). The recognition of affect states is based on a large predictive knowledge base containing knowledge about plans, goals and themes. The analysis of the story in terms of plot units results in a complex network where some units are subordinated to others. These graph type representa-

Interesting also to mention is The Linguistic String Project (LSP) at New York University, which began in 1965 with funding from the National Science Foundation to develop computer methods for structuring and ac-cessing information in the scientific and technical literature. Document processing was to be based on linguistic principles, first to demonstrate thepossibility of computerized grammatical analysis (parsing), then to extend to specialized vocabulary and rules for particular scientific domains. Do-main specialization led to an elaboration of the methods of sublanguage

reporting in patient documents and to the extraction of information. The project still has an influence on medical language processing.

2.2.2 Frame Theory

Schank’s research inspired many scientists. Though his theory was rarelyimplemented in its pure form, it boosted research into frame based methods for knowledge representation. The notion of a frame system was explicitlyformulated for the first time by Minsky in his groundbreaking paper onknowledge representation structures for artificial intelligence. Minsky defined a frame as follows:

tions of a story structure still have their influence today (Mani, 2003).

analysis (Sager, 1981), in particular as applied to the language of clinical

situation, it will use it to update the partly instantiated script.


A frame is a data structure for representing a stereotyped situation, likebeing in a certain type of living room or being at a child’s birthdayparty.

A frame stores the properties of characteristics of an entity, action or event. It typically consists of a number of slots to refer to the propertiesnamed by a frame, each of which contains a value (or is left blank). Thenumber and type of slots will be chosen according to the particular know-ledge to be represented. A slot may contain a reference to another frame.Other features of frames have advantages: They include the provision of a default value for a particular slot in all frames of a certain type, the use of more complex methods for inheriting values and properties between frames, and the use of procedural attachments to the frame slots. Whenframes have mutual relationships, a semantic net of frames can represent them.

Minsky’s frames pervaded AI research in the second half of the 1970s. They would remain the major data representation structures for informa-tion extraction applications up to the late 1990s. Instantiated conceptual frames are often stored in a semantic network that can afterwards be ac-cessed by the questioning answering module.

In the 1980s information extraction became a hot topic. A multitude of algorithms were designed that were influenced by Schank’s ConceptualDependency theory, many of which used frames. Famous systems are

used to process news articles on corporate mergers and acquisitions from

at the end of the 1980s and 1990s is given by Hahn (1989) and Jacobs(1992). Some of the systems first parse the text into its syntactical structurebefore instantiating the frames and mapping the information to the frame slots. An example is FASTUS, which will be discussed extensively inChap. 3.

From their introduction in the 1970s on, frames managed to preserve their position in the world of knowledge representation with a remarkable tenacity. As we will further see in this chapter the current FrameNet project

important component of the resource is a Frame database containing descriptions for lexical units for English. The descriptions include the

(Minsky, 1975, p. 212)

TESS that analyzes bank telexes (Young and Hayes, 1985), SCISOR that

the online information service of the Dow Jones (Jacobs and Rau, 1990), CONSTRUE (Hayes and Weinstein, 1991) and FASTUS (Appelt, Hobbs, Bear, Israel and Tyson, 1993). A nice overview of other systems developed

(Baker et al., 1998; Fillmore and Baker, 2001) is creating an on line lexical resource for English, based on frame semantics (Fillmore, 1968). An


conceptual structure of frames that represents a lexical item and descrip-tions for the elements (semantic roles), which participate in such structure. The FrameNet database is available in XML format and translated into aDAML + OIL (DARPA Agent Markup Language + Ontology Interface Layer) knowledge representation.

2.2.3 Use of Resources

The many practical information extraction systems that were developed in the 1980s and 1990s showed the need for a number of additional resourceswhen processing texts.

First of all, we need a module for tokenization and sentence segmenta-

tinguishes words, components of multi-part words (e.g., splitting the Dutchterm onroerendgoedmarkt in markt van onroerend goed, meaning mar-ket of real estate) and multiword expressions (e.g., in spite of). In spacedelimited languages (such as most European languages) a word or token can be defined as a string of characters separated by white space. In un-segmented languages (such as Chinese, Thai and Japanese), you need addi-tional lexical and morphological information that can be found in a word list in the form of a machine-readable dictionary. Alternatively, statistical techniques can be used to learn which characters are most likely to form words based on co-occurrence statistics (e.g., use of the mutual informa-tion statistic, chi-square statistic, etc.). During lexical analysis a text is usually split into sentences.

Usually language dependent rules are incorporated, for instance, for theresolution of apostrophes or hyphens. For most languages and texts, punc-tuation reliably indicates sentence boundaries. However, punctuation marks are sometimes ambiguous. A period, for example, can denote a decimalpoint or a thousands marker, an abbreviation, the end of a sentence, or even an abbreviation at the end of a sentence.

On overall, normalization is considered to be a necessary processingstep in any application that involves textual data. It comprises harmonizing spelling and capitalization and cleaning up unnecessary metadata. For some applications, stemming or lemmatization (i.e., words are restored torespectively their root or dictionary form) can be useful.2

Another step is the enrichment of the textual data with linguistic meta-data that will be used as features in the extraction process. To this end, a

2 Splitters are often incorporated in stemmers as affixes sometimes have to be re-moved.

tion (Palmer, 2000). Tokenization breaks a text into tokens or words. It dis-


number of natural language processing tools can be used. For most appli-cations, they include part-of-speech (POS) tagging (i.e., detecting the syntac-tic word class such as noun, verb, etc.) and phrase chunking (i.e., detecting base noun and verb phrases). Syntactic structure is often indicative of theinformation distribution in a sentence. For many applications, a rudimen-tary syntactic analysis is sufficient, which is often referred to as shallow parsing. Shallow parsing aims to recover fragments of syntactic structuresas efficiently and as correctly as possible. It can be implemented in diffe-rent ways. For example, phrasal analysis can be accomplished by bracket-

additional parsing (i.e., breaking up a sentence into its constituents and building the dependency tree of a sentence), or even full parsing might be desirable. Full parsing aims at providing an analysis of the sentence struc-ture as detailed as it is possible. This might include the translation into a canonical structure (e.g., argument structure) in which processes (e.g., as expressed by verbs) of sentences and their arguments are delimited. Some-

roles). For a more complete overview of natural language processing in gen-

eral, we refer the reader to Allen (1995). A treebank can also be a useful resource. A treebank can be defined as a k

syntactically processed corpus that contains annotations of natural lan-guage data at various linguistic levels: Often at word, phrase, clause and sentence levels. A treebank provides mainly the morphosyntactic and syn-tactic structure of the utterances within the corpus and consists of a bank of linguistic trees, thereby its name. The type of annotations differs, how-ever, between treebanks. The descriptions can be based on various linguis-tic theories, such as a dependency grammar (e.g., the Prague Dependency Treebank for Czech, the Turin University Treebank for Italian, the Turkishtreebank METU) or a Head-driven Phrase Structure Grammar (e.g., HPSG-based Syntactic Treebank of Bulgarian, Polish and Verbmobil HPSG Treebanks). One of the most well known treebanks is the Penn treebank.Currently, there are also on going treebank projects for several languages such as for Chinese, Dutch, French, Portuguese, Spanish, Turkish, etc. Additional lexical resources in machine readable form that offer know-ledge of synonymy, hypernymy, hyponymy and meronymy are valuable.Synonymy involves the use of different lexical items that express the same or a closely related word sense (e.g., sound and noise). Strict synonymyalmost never occurs, since word forms describing the same concept tend todifferentiate their meanings. For instance, sound and noise refer to the same referent, but they have a different meaning: Noise has a slightlynegative connotation (e.g., an obnoxious sound) whereas the meaning of

ing the output of a part-of-speech tagger (see Church, 1988). In some cases

times, sentence constituents are classified (e.g., into subject, object, semantic


sound is neutral. Hypernymy regards describing a term with a more gen-eral term (e.g., tree is a hypernym of oak). Hyponymy describes a term with a more specific term (e.g., apple is a hyponym of fruit). Meronymsare related through a part-whole relation (e.g., leg is a part of bodyf ). Therelations here discussed can be found in a lexico-semantic resource such asWordNet for English (Miller 1990).

Other lexico-semantic resources such as FrameNet are valuable. As tmentioned above, the Berkeley FrameNet project is creating an on line lexical resource for English, based on frame semantics and supported by

to document the range of semantic and syntactic combinatory possibilities(valences) of each word in each of its senses, through computer assisted annotation of example sentences and automatic tabulation and display of the annotation results. The major product of this work, the FrameNet lexi-cal database, currently contains more than 8,900 lexical units, more than 6,100 of which are fully annotated, in more than 625 semantic frames, ex-emplified in more than 135,000 annotated sentences. FrameNet documentsthe manner in which frame elements (for given words in given meanings)are grammatically instantiated in English sentences based on attested in-stances of contemporary English and organizes the results of such findings in a systematic way. The FrameNet database can be seen both as a diction-ary and as a thesaurus. The former signals, for instance, the definition of a lexical item and gives access to annotated examples illustrating each syn-tactic pattern found in the corpus and the kinds of semantic information in-stanced with such patterns. The database acts also as a thesaurus, in that, by being linked to frames, each word is directly connected with other words in its frame(s), and further extensions are provided by working out the ways in which a word’s basic frames are connected with other frames through relations of inheritance (possibly multiple inheritance) and com-position.

Tools that analyze the discourse structure of a text might be integrated in an extraction system. They include topic segmenters and recognition modules of rhetorical structures. Topic segmentation of texts concerns thedetection of the overall organization of the text into themes or topics and the identification of text segments that correspond to the general and more specific topics. Existing segmentation algorithms usually produce what iscalled a linear segmentation of the text assuming that the main topics or

Innovative topic segmentation algorithms allow detecting the hierarchicaland sequential topical segments including semantic returns of a topic at

ory (RST) was developed in the second half of the 1980s at the University

corpus evidence (Baker et al., 1998; Fillmore and Baker, 2001). The aim is

subtopics are sequentially organized (e.g., Hearst, 1997; Kan et al., 1998).

different levels of topical detail (Moens, 2006). Rhetorical Structure The-


of Southern California as a comprehensive linguistic theory for determin-ing the textual coherence structure of monologue discourse (Mann and

based on asymmetrical, and recursively definable nucleus-satellite rela-tionships. A rhetorical parsing algorithm segments a text into its rhetorical

Many information extraction applications will also use named entity recognition and coreference resolution. Since both detect and/or connectreferents of basic semantic entities in text, they are usually considered tobe forms of information extraction (see infra).

2.2.4 Machine Learning

Notwithstanding the success of the information extraction systems in the 1980s and 1990s, there was a growing concern in making the information extraction systems easily portable to domains other than the one a system was built for and eventually to use information extraction in open domains.Here, the high cost of the manual pattern drafting and the knowledgeacquisition involved made researchers investigate the possibilities of machine learning approaches.

The application of machine learning methods to aid the information extraction task go back to work on the learning of verb preferences in the1980s, which is published by Grishman and Sterling (1992) and Lehnert and Sundheim (1991) in the early 1990s. Other interesting research is early work on lexical knowledge acquisition by Kim and Moldovan (1993) and especially by Riloff and Lehnert (1993) on the famous AutoSlog system

experiments by using Muggleton’s ILP (Inductive Logic Programming

Most of these systems use supervised techniques to learn extraction pat-terns. The pattern recognizers or classifiers train from a training base of classified examples. The general idea is that a human expert annotates the

duce a function or rules that can be applied on previously unseen instances.The underlying idea is that it is easier to annotate documents than to write extraction rules, since the later requires some degree of programming ex-pertise and usually relies on the skills of the knowledge engineer to antici-pate extra patterns. Although for some applications symbolic, handcrafted knowledge is more convenient, we see a gradually increasing interest in themachine learning techniques from the second half of the 1990s onwards.

Thompson, 1987). RST assumes a text to have a hierarchical organization

structure tree (Marcu, 2000).

(Riloff, 1996). Soderland (1999) has done many information extraction

system) (Muggleton, 1991).

and then the learning system generalizes from these examples to pro-fragments that should be extracted in a small corpus of training documents,


Several research experiments have demonstrated that the supervised learn-ing techniques produce very good results compared to systems that use handcrafted patterns. The results approach the ones that depend upon handcrafted rules. The AutoSlog system of Riloff (1996) constructed a dic-tionary of extraction patterns for the MUC-4 terrorism domain that achieved 98% of the performance of the handcrafted patterns. The researchof Soderland (1999) endorses these findings. In recent years the supervised learning techniques have become very popular in information extraction.The more current and the most successful algorithms are discussed in de-tail in Chap. 5. They include Support Vector Machines, maximum entropymodeling, hidden Markov models, conditional random fields, learning of decision rules and trees, and relational learning.

Since the second half of the 1990s, we see a definite interest in using unsupervised or semi-supervised learning for information extraction. Ap-parently the cost of annotation is still a major handicap in developing largescale information extraction systems or when porting an existing system toanother domain. Many bootstrapping technologies that employ forms of weakly supervised learning were developed, which we will discuss in de-tail in Chap. 6. The aim is to learn a pattern recognizer from a small number of labeled examples and to improve the classifier by using the unlabeled examples, or at least learn a classifier whose performance is equal to onetrained on the full labeled set. In addition, there are approaches that aim at eliminating the need for manual annotation entirely.

Most of the work in machine learning regards the acquisition of patterns for entity classification, entity relation recognition, semantic role classifi-cation, and recognition and resolution of temporal expressions. Very littleresearch on automatically learning complete scenarios and scripts exists. An initial impetus is found in the learning of structured patterns by meansof kernel methods, hidden Markov models, conditional random fields and relational learning as discussed in Chap. 5.

2.2.5 Some Afterthoughts

One of the most elusive properties of the human mind is without any doubtthe ability to relate utterances in a text or conversation to some kind of conceptual model of the world. Despite the more than three decades of re-search into natural language understanding in general and information ex-traction in particular, we did not succeed yet in providing computers with this ability. Apart from the obvious computational complexity of the task,there are two main obstacles for putting real understanding into computers.First of all, determining the exact meaning of an utterance requires a con-


siderable amount of knowledge about the text genre, about relevant con-ventions and implicitly assumed common background knowledge, and of data about the world in general to which the utterance refers. Any sophisti-cated form of natural language understanding would therefore require a comprehensive implementation of world knowledge (or for some applica-tions domain knowledge) and of textual and conversational models that are necessary to interpret the exact function of an utterance in the flow of reasoning.

A second problem is that the relationships between language and mean-ing are still surprisingly unclear. This is so because on the one hand the exact processes that a language user employs to encode meaning into lan-guage are not as straightforward as relationships between other linguistic strata. On the other hand, there is no agreement at all as to how these rela-tionships, as far as it is known, should be formally implemented in a com-putational-linguistic framework. For instance, it is relatively easy to design procedures that within a certain error margin unambiguously assign part-of-speech tags (e.g., noun, adjective) to words in an utterance, since therelationship between a word in context and its part-of-speech is relatively straightforward and most researchers in the field will more or less agree onits annotation. It is not so easy to do the same for a word in context and its meaning. Consider for example the following sentence.

The prime minister dissolved the parliament. (2.1)

For instance, the word parliament has six distinct definitions in the Merriam-Webster Online Dictionary and an NLP system will have to contain a considerable amount of knowledge about how parliaments areconstituted and how they work in order to be able to determine that the definition “the supreme legislative body of a usually major political unit that is a continuing institution comprising a series of individual assem-blages” conveys the meaning that is most relevant to sentence (2.1). For a complete understanding at the sentence level, the system would have toknow that prime ministers are persons who are related to parliaments (al-though they are not necessarily a member of it); that in some countriesprime ministers have the authority to dissolve or disband a parliament; and that the fact that a prime minister is involved implies that the parliament referred to is neither a medieval institutional body in England, nor a French court of justice that existed before the revolution of 1789. In realis-tic natural language applications, the world knowledge that would beneeded for word sense disambiguation could be partially replaced by a


entities exist. But even though it would be relatively uncontroversial to de-fine a membership relation between parliamentarian and parliament or a part-whole relation between chamber and parliament, there is no such obvious relationship between prime minister and parliament. For exam-ple, in Belgium a prime minister cannot be a member of parliament, has noofficial authority to dissolve it, but can partake in sessions of one of itschambers and can in particular situations instigate the parliament’s dissolu-tion (although only the king can effectively dissolve it). Even when some-one should succeed in defining a set of relations that would be both uncon-troversial and generally applicable, it remains to be seen whether it is feasible to build a semantic network of the size that will be necessary for real world applications.

When we move from the level of individual semantic entities (which is mainly the domain of lexical semantics) to the level of entities in context (which is primarily concerned with event analysis) the situation is at thesame time more complex and more hopeful. Despite the fact that since thedawn of artificial intelligence a multitude of theories have been constructed for event analysis, no unified framework currently exists for describing the relationships between textual utterances and conceptualizations of events in a way that is useful in natural language processing. Grossly schematizing the complex field of event semantics, a rough distinction could be madebetween truth semantics, conceptual semantics, temporal semantics and modal semantics. Temporal and modal semantics are respectively con-cerned with the temporal allocation and the certainty or necessity of events. Truth semantics deals with the sufficient and necessary conditionsfor making valid judgments about event descriptions, i.e., its primary aim is making statements about the truth value of linguistic entities referring toevents. Truth semantics is inherently conceived as a formalized model for describing event statements and for performing operations on them. Truthsemantics does not contain any information about the conceptualization of linguistic entities, i.e., it might be able to say whether a statement is valid or not, but it cannot tell anything about what exactly the statement is about.Therefore a semantic framework is needed that describes the relationships between linguistic entities and a conceptualization of entities and events inthe real world. What exactly such a cognitive world model might be, how language users acquire it and how it should be implemented in NLP are allmatters of heated dispute, which we will steer away from. From all thecompeting (and often partially complementary) theories that exist, we will only remark the following. In frame theory, event types are encoded assemantic frames, each frame consisting of a number of attribute-valuepairs, which are called frame elements. Despite its obvious potential for

semantic network, in which only abstract semantic links between linguistic


natural language processing and information extraction in particular, ex-traction systems that rely on frame theory have often been developed in an ad hoc fashion and built to cover a very limited subject domain. The intro-duction of a more generic theoretical linguistic framework when defining the frame semantics will be more advantageous as the portability and theapplicability of the semantics is increased. One linguistic theory that could fulfill this task is systemic-functional grammar.

through a mediating set of fundamental conceptual categories, which are reflected in the lexico-grammatical constructs of a language. These catego-ries reflect that human observers primarily conceive the world around them as a never-ending series of (consecutive and parallel) actions and states. As a consequence, any linguistic expression can be analyzed in terms of the events that it describes, the entities that somehow are part of that event, and its worldly setting. Any linguistic description of a singleevent is centered around a process of a particular type. A process in its turn consists of a number of semantic roles: the process role itself, which de-scribes an event in the real world; a restricted number of participant roles,which describe the real world entities partaking in that event; and an – in theory – unrestricted number of circumstantial roles, which describe thegeneral setting of the process. Participants and circumstances characterize the “Who, what, when and how?” (or the “Who did what to whom, whenand how?”) expressed in sentences or phrases. Examples of process catego-ries are Material, Verbal, Mental, Behavioral, Existential and l Relational.The participant and circumstantial roles are process-category specific (e.g.,Mary (Material_Actor) y gave (Material_Process) e John (Material_ Benefici-nary) a book (Material_Goal) or Mary (Sayer) said: (Verbal_Process) “Hi, John” (Verbiage)). Systemic Functional process categories imply certain

entail certain person-like experiences (i.e., mental experience). These im-plications (or entailments) have grammatical reflexes. Behavioral pro-cesses are analyzed as distinct from, for instance, Material processes inthat they do not (without added syntactic machinery) have ergative con-

functional grammar offers very general semantic roles that might be morespecified when building information extraction systems, but that can make up for the lack of semantic understanding of the text when certain extrac-tion patterns for the specific semantic characterizations are lacking. In ad-dition their classification contributes to the disambiguation of the meaningof an utterance or lexical item.

express temporal, locative, durative, etc. semantic content. Systemic-structions such as Mary smiled John. Circumstantial elements simply

Systemic-functional grammarr (Halliday, 1994; Halliday and Matthiessen, 1999; Butler, 2003) starts from the hypothesis that humans perceive reality

properties of the participants involved in them, e.g., Behavioral processes


We conclude this historical framework with a final remark that is impor-tant if we want to build information extraction systems and incorporatethem in information processing systems. Cognitive linguistics adheres to the belief that, rather than existing independently of meaning, grammar is

claimed that all elements validly posited in grammatical description residein the pairings of conceptualizations and ways of symbolizing them. Among these elements are grammatical markers, categories, relations, roles and constructions. This means that, if we are building information extraction patterns, grammar is important.

In the rest of this chapter we will define the typical information extrac-tion tasks that have been implemented in working systems in informationextraction and define a general architecture of an information extraction system.

2.3 The Common Extraction Process

2.3.1 The Architecture of an Information Extraction System

In this section, we will go deeper into the typical components of an infor-mation extraction system, the types of information it can extract, and whatthe theoretical foundations are for assuming that it is possible to extract these kinds of information.

Figure 2.1 shows that the architecture of an operational information ex-traction system typically has two distinct phases: A training phase and a deployment phase. In the training phase, the knowledge engineer or the system acquires the necessary extraction patterns, the latter referring to theuse of machine learning. In a first step, a text corpus is selected that is rep-resentative for the task the system is intended for (Fig. 2.1, T1).

Before the texts can be used for extrapolating extraction rules from them, they usually go through a preprocessing phase (T2) in which their formal characteristics are normalized. Another step belonging to the pre-processing phase is the enrichment of the textual data with linguistic meta-data that will be used as parameters in the acquisition process (T2.2). To this end, a number of natural language processing tools can be used (see Sect. 2.2.3).

In the manual approach, an information specialist will use the preproc-essed training corpus during the learning phase (T3) as a basis for writing an extraction grammar. In case of a machine learning approach, the train-

symbolic in nature and inherently meaningful (Lanacker, 1999). It is

2.3 The Common Extraction Process 37

ing corpus is usually first manually annotated to indicate which elements in the texts are relevant for the extraction task and the machine learning module will use these annotations in the learning phase (T3) to automati-cally induce the extraction grammar from the corpus. The extraction grammar can here be in the form of a mathematical function that will pre-dict the class of an example. It is also possible that the training corpus isnot manually labeled or only partially annotated referring to respectivelyunsupervised and weakly supervised techniques.

that were not included in the training corpus. The preprocessing compo-nent in the deployment phase (D2) is as similar as possible to that in thelearning phase. After preprocessing the input texts are passed on to the

Fig. 2.1. A typical information extraction system.

In the deployment phase (D1-4), the information extraction system iden-tifies and classifies relevant semantic information in new texts, i.e., texts


learned in the learning step and possibly some additional knowledge (K2) to determine which elements in the input texts are relevant for the extrac-tion task and how they relate to certain semantic classes. It extracts thesetextual elements from the texts, classifies them and outputs them in a struc-tured format (D4). Some systems also have a feedback mechanism (K3) in which the final output of the system is corrected and used for retraining thelearning component (incremental learning). Existing literature usually does not focus on the real world implementation of information extraction, but on development and testing, and as a consequence the deployment phase is often called the evaluation or testing phase.

It is obvious that the main task of an information extraction system is the extraction of semantic information from texts, and we already men-tioned that this information is defined in advance of the extraction process.Different kinds of semantic information can be extracted from any one text, depending on the size of the linguistic units that are targeted in the ex-traction process and the linguistic context that is covered by the system. As seen above we will call the former the extraction unit or text region, andthe latter the linguistic context.

2.3.2 Some Information Extraction Tasks

There are a number of typical information extraction tasks that lately have been extensively researched with regard to open domain information ex-traction. They include named entity recognition, noun phrase coreferenceresolution, semantic role recognition, entity relation recognition and time-line recognition. This is not an exhaustive listing of extraction tasks. Wewill often use these example tasks to illustrate the extraction algorithmsdiscussed in the following chapters.

Named Entity Recognition

Named entity recognition recognizes and classifies named expressions in text (such as person, company, location or protein names).

Example:

John Smith works for IBM. (2.2)

extraction phase (D3), which uses the extraction grammar (K1) as it was

Person Company

39

Noun Phrase Coreference Resolution

Two or more noun phrases are coreferent, when they refer to the same situation described in the text. Many references in a text are encoded asphoric references, i.e., linguistic elements that rather than directly encodethe meaning of an entity, refer to a direct description of the entity earlier or later in the text. They are respectively called anaphoric and cataphoric ref-erences.

Example:

Bill Clinton went to New York, where he was invited for a keynote speech. The former president ... (2.3)

Bill Clinton, he and the former president refer in this text to the same entity. He refers to an anaphoric reference.

Semantic Role Recognition

Semantic role recognition regards the assignment of semantic roles to the(syntactic) constituents of a sentence. They regard certain actions or states, their participants and their circumstances. Semantic roles can be very gen-

grammar: cf. p. 37 ff.) or be more specific (e.g., as found in the FrameNet database and the actors and circumstance shown in the example below).

Example:

She clapped her hands in inspiration. (2.4)

Agent Body part Cause


Entity Relation Recognition

The relation between two or more entities is detected and the relation pos-sibly is typed with a semantic role.

erally defined (e.g., the roles defined by the theory of systemic-functional


Example:

John Smith works for IBM. (2.5)

Person Relation Company works for

Timex and Time Line Recognition

A first task is timex (i.e., temporal expression) detection and recognition in text. Temporal expressions to be marked include both absolute expressions(July 17, 1999, 12:00, the summer of ’69) and relative expressions (yes-terday, last week, the next millennium). Also noteworthy are durations(one hour, two weeks), event anchored expressions (two days before de-parture), and sets of times (every week). From the recognized timexes,the time line of different events can be reconstructed. Basic temporal rela-tions are: X before Y, X equals Y, X meets Y, X overlaps Y, X during Y, X starts Y, X finishes Y. Recognizing a time line involves sophisticated forms of temporal reasoning.

Example:

On April 16, 2005 I passed my final exam. The three weeks before I studied a lot. (2.6)

March 26, 2005-> April 15, 2005: StudyApril 16, 2005: Exam

Table 2.1 shows the selected number of information extraction tasks. Notethat the extraction unit for these particular tasks, i.e., the unit to be seman-tically classified or structured, is quite small and spans several word groups at most. The linguistic contexts used in the classification and theeventual goals of the tasks differ. The linguistic context enlarges as the goal of the understanding grows in scope.

already identifying many of the details of an event (e.g., time, location).Domain dependent extraction tasks can be defined to complement an eventdescription (e.g., the number of victims of a terrorist attack, the symptoms

The above extraction tasks are rather domain independent. But, they allow

41

Table 2.1. Examples of information extraction tasks, their respective extraction units and linguistic contexts, and the eventual goal of the information extraction.

Information Extraction task Extraction unit Linguistic con-

text Eventual goal

Named entityrecognition

Word/ Word group

Sentence/textEntity understand-ing

Noun phrasecoreferenceresolution

Word/ Word group

Sentence/text/

Multiple texts

Entity understand-ing

Semantic role recognition

Word/ Word group

SentenceSentence under-standing

Entity relationrecognition

Words/ Word groups

Sentence/text/

multiple texts

(Multi-text) discourse/storyunderstanding

Timeline extrac-tion

Words/ Word groups

Sentence/text/

multiple texts

(Multi-text) discourse/storyunderstanding

of a disease of a patient). At this level, information extraction is mainly interested in the extraction of information about individual events (and states), the status of participants in these events and their spatial, temporal,causal, … setting.

Events in the real world never exist in isolation, but rather are part of more complex events that are causally linked to each other. Humans rec-ognize these linked events as event complexes because they stereotypicallyoccur in a certain order. We call these stereotyped event complexes scriptsor scenarios. The eventual goal of information extraction at a textual level is to recognize scenarios and to link them to abstract models that reflect complex events in the real world. In some cases, the analysis of texts intoscenarios might not be really meaningful. Policy reports or court decisions, for instance, might not contain a real event structure.


Complex events are not the largest semantic structures that can be found in textual data. They are often part of multi-event structures that can be or-dered chronologically to represent an entire story. These structures usu-ally span multiple texts and are to be distinguished from scenarios in that causality is often not as important as chronology. Eventually, it should bepossible to extract complex chronologies of events from entire text corporaand to locate scenarios, single events and the entities participating in these events in their temporal and causal setting.


2.4 A Cascade of Tasks

Many of the extraction tasks rely on the results of other information ex-traction tasks. Typically a bottom up analysis is performed in several stages. For instance, it is valuable to perform named entity recognition be-fore noun phrase coreferent resolution (e.g., in the above example (2.3) of

is a person, before resolving the anaphor he). Defining the semantic roles of a sentence’s constituent can be performed prior to the classification of relations between entities. It is also impossible to determine the scenario underlying a text without first being able to identify individual events,since that is exactly what a scenario is: A chain of events that is ordered in a meaningful, structured way.

Future information extraction systems may well be cascaded systems inwhich the output of one information extraction system is used as input of another information extraction system (e.g., the results of one information extraction task are used as features for training another information extrac-tion system that more globally understands the text) (see Fig. 2.2). Actu-ally, the foundations for such an approach were already set by theFASTUS information extraction system, which we will discuss in detail in the next chapter.

2.5 Conclusions

In this chapter we outlined the history of information extraction. The his-torical perspective allowed to smoothly introduce some important – andmainly domain independent – information extraction tasks. The architec-ture of a classical information extraction system was explained together with some possible future improvements. In the next chapters we discuss

the techniques that rely on symbolic, handcrafted extraction patterns.

IE system 1 IE system2 IE system 3

Labeled trainingsource

Labeled training source

Labeledtrainingsource

Labeled trainingsource

…

Fig. 2.2. A cascaded information extraction system.

the most important information extraction algorithms. Chap. 3 explains

noun phrase coreferent resolution it is valuable to know that Bill Clinton

2.6 Bibliography

Allen, James (1995). Natural Language Understanding. Redwood City: Benjamin/ Cummings.

Appelt, Douglas E., Jerry R. Hobbs, John Bear, David J. Israel and Mabry Tyson(1993). FASTUS: A finite-state processor for information extraction from real-world text. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence (pp. 1172-1178). San Mateo, CA: Morgan Kaufmann.

Baker, Collin F., Charles J. Fillmore and John B. Lowe (1998). The Berkeley FrameNet project. In Proceedings of the COLING-ACL ’98 Joint Conference (pp. 86-90). San Francisco, CA: Morgan Kaufmann.

Butler, Christopher S. (2003). Structure and Function: A Guide to Three Major Structural-Functional Theories. Amsterdam, The Netherlands: John Benja-mins.

Church, Kenneth (1988). A stochastic parts program and noun phrase parser for unrestricted texts. In Proceedings of the Second Conference on Applied Natural Language Processing. Austin, Texas.

DeJong, Gerald (1977). Skimming newspaper stories by computer. In Proceedingsof the 5th International Joint Conference on Artificial Intelligence (p. 16).Cambridge, MA: William Kaufmann.

DeJong, Gerald (1982). An overview of the FRUMP system. In Wendy G. Lehnert and Martin H. Ringle (Eds.), Strategies for Natural Language Processing(pp. 149-176). Hillsdale, NJ: Lawrence Erlbaum.

Fillmore, Charles J. (1968). The case for case. In Emmon Bach and RobertT. Harms (Eds.), Universals in Linguistic Theory (pp. 1-88). New York, NY:Holt, Rinehart, and Winston.

Fillmore, Charles J. and Collin F. Baker (2001). Frame semantics for text under-standing. In Proceedings of WordNet and Other Lexical Resources Workshop.

Hahn, Udo (1989). Making understanders out of parsers: Semantically drivenparsing as a key concept for realistic text understanding applications. Interna-tional Journal of Intelligent Systems, 4, 345-393.

Halliday, Michael A.K. (1994). An Introduction to Functional Grammar. London:Arnold.

2.6 Bibliographyy 43

Halliday, Michael A.K. and Christian M.I.M. Matthiessen (1999). Construing Ex-perience Through Meaning: A Language-based Approach to Cognition. Lon-don: Cassell.

content-based indexing of a database of news stories. In 2nd Annual Conferenced

on Innovative Applications of Artificial Intelligence (pp. 49-64). Menlo Park, CA: AAAI Press.

Hearst, Marti A. (1997). TextTiling: Segmenting text into multi-paragraph sub-topic passages. Computational Linguistics, 23 (1), 33-64.

Grishman, Ralph and John Sterling (1992). Acquisition of selectional patterns. In Proceedings of the 14th International Conference on Computational Linguistics (COLING) (pp. 658-664). Morristown, NJ: ACL.

Jacobs, Paul S. (Ed.) (1992). Text-based Intelligent Systems: Current Researchand Practice in Information Extraction and Retrieval. Hillsdale, NJ: Lawrence Erlbaum.

Hayes, Philip J. and Steven P. Weinstein (1991). CONSTRUE/TIS: A system for


Jacobs, Paul S. and Lisa F. Rau (1990). “SCISOR”: Extracting information from on-line news. Communications of the ACM, 33 (11), 88-97.

Kan, Min-Yen, Judith L. Klavans and Kathy R. McKeown (1998). Linear segmen-tation and segment relevance. In Proceedings of 6th International Workshop of Very Large Corpora (WVLC-6), Montréal, Québec, Canada: August 1998 (pp. 197-205).

Kim, Jun-Tae and Dan I. Moldovan (1993). PALKA: A system for lexical knowl-edge acquisition. In Proceedings CIKM 93, Proceedings of the Second Interna-tional Conference on Information and Knowledge Management (pp. 124-131). tNew York: ACM.

Langacker, Ronald W. (1999). Grammar and Conceptualization. Berlin: Walter De Gruyter.

Lehnert, Wendy G. (1982). Plot units: A narrative summarization strategy. InWendy G. Lehnert and Martin H. Ringle (Eds.), Strategies for Natural Lan-guage Processing (pp. 375-412). Hillsdale, NJ: Lawrence Erlbaum.

(1992). Description of the CIRCUS system as used for MUC-4. In Proceedings of the Fourth Message Understanding Conference MUC-4 (pp. 282-288). San Francisco, CA: Morgan Kaufmann.

Lehnert, Wendy and Beth Sundheim (1991). An evaluation of text analysis tech-niques. AI Magazine, 12 (3), 81-94.

Mani, Inderjeet (2003). Recent developments in temporal information extraction.In Proceedings of Recent Advances in Natural Language Processing, BorovetsBulgaria,10-12 September 2003 (pp. 45-60).

Mann, William C. and Sandra A. Thompson (1987). Rhetorical Structure Theory:A Theory of Text Classification. ISI Report ISI/RS-87-190. Marina del Rey, CA: Information Sciences Institute.

Marcu, Daniel (2000). The Theory and Practice of Discourse Parsing and Sum-marization. Cambridge, MA: The MIT Press.

Miller, George A. (Ed.) (1990). Special issue: WordNet: An on-line lexical data-base. International Journal of Lexicography, 3 (4).

Minsky, Marvin (1975). A framework for representing knowledge. In P.H.Winston (Ed.), The Psychology of Computer Vision (pp. 211-277). New York:McGraw-Hill.

Moens, Marie-Francine (2006). Using patterns of thematic progression for build-ing a table of contents of a text. Journal of Natural Language Engineering(forthcoming).

Muggleton, Stephen H. (1991). Inductive logic programming. New GenerationComputing, (4), 295-318.

Palmer, David D. (2000). Tokenisation and sentence segmentation. In RobertDale, Herman Moisl and Harold Somers (Eds.), Handbook of Natural Lan-guage Processing (pp. 11-35). New York, NY: Marcel Dekker.

Riloff, Ellen (1996). An empirical study for automated dictionary construction for information extraction in three domains. Artificial Intelligence, 85, 101-134.

Riloff, Ellen and Wendy Lehnert. Automated dictionary construction for informa-tion extraction from text. In Proceedings of the Ninth IEEE Conference onArtificial Intelligence for Applications (pp. 93-99). Los Alamitos, CA: IEEEComputer Society Press.

Lehnert, Wendy, Cardie Claire, David Fisher, Joseph McCarthy and Ellen Riloff

45

Rumelhart, David E. (1975). Notes on a schema for stories. In D.G. Bobrow andA. Collins (Eds.), Representation and Understanding: Studies in Cognitive Science (pp. 211-236). New York, NY: Academic Press.

Rumelhart, David E. (1977). Introduction to Human Information Processing. New York, NY: John Wiley and Sons.

Sager, Naomi (1981). Natural Language Information Processing: A Computer Grammar of English and Its Applications. Reading, MA: Addison-Wesley.

Schank, Roger C. (1972). Conceptual dependency: A theory of natural languageunderstanding. Cognitive Psychology, 3 (4), 532-631.

Schank, Roger C. (1975). Conceptual Information Processing. Amsterdam: NorthHolland.

Schank, Roger C. and Robert P. Abelson (1977). Scripts, Plans, Goals and Under-standing. An Inquiry into Human Knowledge Structures. Hillsdale, NY: Law-rence Erlbaum.

Soderland, S. (1999). Learning information extraction rules from semi-structuredand free text. Machine Learning, 34 (1/3), 233-272.

Young, Sheryl R. and Philip J. Hayes (1985). Automatic classification and sum-marization of banking telexes. In The Second Conference on Artificial Intel-ligence Applications: The Engineering of Knowledge Based Systems (pp. 402-408). Washington, DC: IEEE Computer Society Press.

2.6 Bibliography

47

3 The Symbolic Techniques

With Rik De Busser

3.1 Introduction

formation extraction. They rely on symbolic knowledge that is handcrafted

used by the extraction system and who on his or her own or possibly with

3.2 Conceptual Dependency Theory and Scripts

Schank’s basic assumption was that “there exists a conceptual base that is interlingual, onto which linguistic structures in a given language map dur-ing the understanding process and out of which such structures are created

or conceptualizations are composed of primary concepts, the interconnec-tions of which are governed by a closed set of universal conceptual syntax rules and a larger set of specific conceptual semantic rules. Schank distin-guishes four classes of primary concepts, which are usually called roles:

by a knowledge engineer, who is familiar with the knowledge formalism

In this chapter we go deeper into knowledge systems that are used for in-

the help of an expert in the application domain writes rules for the infor-mation extraction component. Typically the knowledge engineer has ac-cess to a moderate sized corpus of domain relevant texts that can be manually inspected, and additionally he or she uses his backgroundknowledge or the one of the expert. As the previous chapter shows, a num-ber of very interesting approaches have been developed during the last decades, some of which we want to discuss more in detail in this chapter.

during generation” (Schank, 1972, p. 553 ff.). These conceptual structures

48 3 The Symbolic Techniques

Picture Producers (PP) represent physical objectsActs (ACT) represent primitive actions Picture Aiders (PA) modify PPs and usually represent a statewith a specific value (i.e., an attribute)Action Aiders (AA) modify ACTs.

Schank also specifies two other types of roles: Location (LOC)Time (T).

The core (and unfortunately also the most contested part) of CDT is the set of eleven ACTs, which should make it possible to represent (on their own or in different combinations) any action an actor can possibly perform inphysical reality. For example, one of the most often used ACTs is PTRANS, which expresses a change of location of a PP (i.e., of an entity in the real world). In CDT, example sentence (3.1) will be transposed into a PTRANS construction, which can be graphically represented in CD the-ory as the diagram in Fig. 3.1.

Martin goes to Brussels. (3.1)

Fig. 3.1. CDT representation of sentence (3.1).

In the diagram, the double arrow indicates a relationship between an ACT and an actor (the entity that performs the ACT), O indicates an objective relationship and D a directive relationship. The entire construction meansthat there is a physical object Martin that performs the act of changing thelocation of Martin (himself) from an unknown location (indicated by X) toXXthe location Brussels.

The concepts can only form combinations in accordance with a set of conceptual syntax rules, which are general constraints on the possible

49

combinations of different concept types. For instance, in Fig. 3.1 the con-struction Martin ⇔ PTRANS⇔ is only possible because a rule exists that Sstates

PP ⇔ ACT Only a PP can occur as the actor of an act

Similarly, PTRANS ⎯⎯←⎯⎯O Martin is valid because of the existence of a rule

ACT ⎯⎯←⎯⎯O PP ACTs can have objects

Similar rules exist for relating other roles (object, direction, recipient, in-strument, …) to ACTs, for relating PPs to states, for relating locations and times to ACTs, for binding result states to the events they spring forth from, etc. In their entirety, these rules should form an exhaustive set that makes it possible to assemble primitive concepts into complex concept-ualizations. For an overview of all conceptual syntax rules, we refer thereader to Schank (1975, p. 37 ff.).

In addition to these syntax rules, a set of conceptual semantic rules hasbeen designed that puts semantic constraints on the combinatory possibili-ties of individual concepts. They will – among other things – prescribe exactly which roles a specific ACT has to take and which conditions a specific PP must fulfill to be eligible to be used in a specific conceptual role slot. In the case of our example, there will be a conceptual semantic rule that canbe schematically represented as Fig. 3.2.

Fig. 3.2. Conceptual semantic rule of (3.1).

This rule restricts the actor to some predefined semantic class CREATURE,it defines that the actor and object must refer to the same real world entity (CREATURE 1), it prescribes that the actions contained in PTRANS mustresult in some form of spatial movement from a LOCATION 1 to some non-identical LOCATION 2.



Fig. 3.3. Simplified taking the bus script.1

Using the framework described so far, it is possible to construct conceptual representations of individual actions and – but to a lesser extent – of states.However, since Schank’s originally planned his theory to be a comprehen-sive formal model of human language understanding, it should not only

1 We simplified the notation and reduced the number of individual conceptualiza-tions.

51

represent how humans – and by extension computers – can conceptually process the textual references to real world events and their participants, but it should also make it possible to represent interactions between theseindividual events in terms of conceptual dependency relationships. In thephysical world, almost all interconnections involving actions and states are perceived in terms of causality and thus Schank and Abelson (1977) intro-duced a set of causal syntax rules in CDT, which made it possible to con-struct complex causal chains of events by taking into account different types of causal relationships. We will not go into these rules in detail; anexhaustive overview can be found in Schank and Abelson (1977, p. 24 ff.).

Such a chain of conceptualizations (i.e., events) becomes a script whenit describes “a predetermined, stereotyped sequence of actions that defines

words, when a person or another entity performs a complex action in the real world, he/she/it will often use a stereotyped sequence of simple actions. These stereotyped sequences are represented in CDT as scripts. Figure 3gives an example of a script for describing the complex event of taking a bus from one place to another.

In this script, some conceptualizations are vital: Without [3], [7] and [8] the bus script would simply not be a bus script. These are the mainconceptualizations (or MAINCONs). So, each script is characterized by alimited number of typical conceptualizations that must be true for thescript to be applicable to a particular situation. The other conceptualiza-tions belonging to the script are very likely to occur in instances where it isapplicable, but they can be violated without rendering the script itself in-applicable. It is likely, for example, that CR1 will pay for a ticket ([4] and [5]) after entering the bus, but when he would decide to sneak on the bus without paying, he would still be taking the bus.

Because of their strict internal organization, scripts have a predictive ability, which is exactly what makes them useful for information extractionapplications. Scripts work very well on computers: They can be easily stored and accessed, which results in fast parsing times and a simple soft-ware architecture. The stylized structure of scripts makes it possible tomake inferences. Especially this last ability is extremely useful for an in-formation extraction system, since the parsing strategy using sketchy scripts will be able to predict which conceptualizations are likely to follow in the input string that is being processed and can use these predictions tofacilitate parsing and to solve lexical ambiguity and anaphoric references.

When the Conceptual Dependency Theory is implemented in informa-tion extraction systems, fully developed CD scripts (which are supposed to contain all information about any event which could possibly occur in a


a well known situation” (Schank and Abelson, 1977, p. 41). In ordinary


given situation) are usually not used. So-called sketchy scripts only contain the most crucial or most relevant conceptualizations. Conceptualizations are usually internally represented as structures of the form

(role1 (var1 ,…, varm) ACT (var1 ,…, varp) role2 (var1 ,…, varq) ,…,rolek (vark

1 ,…, varr))

in which the entire structure corresponds to a conceptual syntax rule and the definition of the variables to conceptual semantic rules. These struc-tures are used as templates: An interpretation module fills out the variableslots; the instantiated templates can later be used for other informationprocessing tasks.

Schank’s theory, however, had to deal with some criticism.

First of all conceptual dependency is too domain dependent: In con-strained domains it performs excellently, but when expanding the domain one has to write a script for every new situation that might be relevant to the analysis of the text. Scripts are insufficient to explain all mechanisms of anticipation in human thinking: Schank and Abelson (1977) point out that people can deal with situations never encountered before and, thoughin theory scripts would work perfectly well, they deem it impossible toconstruct a script for every conceivable situation. However, several re-search projects indicate that in most situations it is feasible to deal with a sufficient diversity of real world data using only scripts. Schank andAbelson (1977) also introduce the theoretical concept of plan, defining it

2

The most often heard objection against Conceptual Dependency Theory is that it is ad hoc. Although it is very likely that conceptual primitives do exist in one form or another, some of Schank’s primitive ACTs seem to be rather ill chosen and devised to conceal the fact that the ACTs fail to cover all actions they are supposed to describe. As a consequence, there is seman-tic overlap between some concepts. A very obvious example is MOVE, which is defined as “to move a body part” but might as well be included in

2 A definition of scripts in contrast to plans can be found in Schank and Abelson 1977, p. 71 ff.

sub-goals point towards lists of plans and scripts representing possiblestrategies to reach these goals.

usually composed of a number of subgoals resulting in the end goal. Theas “a series of projected actions to realize a goal.” A specific plan is 2

53

PTRANS. On the other hand, there are some real world events that are very hard to describe in terms of conceptual dependency. Consider, for in-stance, the bus script that is explained in the previous section. In step 6 of that script, one possible subscript would be “to take a seat,” but when pon-dering upon its realization, one will instantly be confronted with difficult problems. It is in fact almost impossible to describe the sitting down eventin a precise and distinctive sequence of ACTs and the excessive complex-ity of such a construction might raise some questions as to how accuratelyit reflects human perception of this everyday event.

Another major problem with CD theory is that its theoretical underpin-nings are extremely ill defined. Especially the parts dealing with plans and goals are too schematic where well-defined formal models are essential. A flaw which is even more difficult to mend is the fact that the plan and goaltheory is inherently inconsistent: It uses ACTs to define scripts, scripts to define plans, and eventually plans to define ACTs (see Schank and

of analysis it will remain possible to decompose an entity into its constitu-ent parts and, going down in circles, one will never meet an undividable, most fundamental theoretical concept. Furthermore, conceptual depend-ency theory is action driven; states and shifts from one state to another can be handled in CD, but there is no solution to adequately represent the – sometimes very subtle – differences between them. Schank (1975, p. 48)attempts to introduce several value scales, but he immediately admits that “these state scales do not really explain what a state means,” which is a serious shortcoming for a theory claiming to be a representation of man’s conceptualization of reality.

According to Schank two linguistic entities representing the same meaning will always have the same conceptual representation, whatevertheir outward form might be. Thus two linguistic occurrences referring to the same real world situation in two different languages ought to corre-spond to one single set of conceptualizations – in theory. In that case, CD could be used as an interlingua: A parser would simply have to map the input text into CD structures and the structures can be translated into an-other language. Indeed, when dealing with relatively simple conceptualnetworks, Schank’s language independence seems to work just fine (cf.

not all languages do have the same conceptual universe. In fact, human perception and the construction of a mental representation of reality arequite strongly determined by cultural factors and it will become increasingly


Abelson, 1977) and thus undermines its own axiomatic base. At any level

DeJong, 1982). However, research into linguistic universals has proven that

problematic to use CD as a fully language independent internal representa-tion when the complexity of its constructions have to be boosted.

Notwithstanding the criticisms, several aspects of CD are still valuable for current or future systems. The use of semantic frames to make predic-tions is one of them, as is the notion of conceptual primitives; and for someapplications event driven scripts3 may be very useful.

3.3 Frame Theory

Essentially a frame is a knowledge representation structure that maps aclass of entities or situations onto a limited set of relevant features that are

and fixed for a particular frame. Feature values are empty slots on which certain constraints are placed. An information extraction algorithm will try to match linguistic representations of real world occurrences with particu-lar frames by simply filling out the slots in accordance with the constraints placed on them. A very simple frame for detecting dates, for instance,might be represented schematically like it is shown in Fig. 3.4.

When encountering the phrase on the 3rd day of August the algorithm in 3.4 will instantiate a frame by filling out matching information and us-ing functions to infer as many pieces of information as possible (Fig. 3.5).

DATE

Year : [yyyy]

Month : [m-name ∈ {January,…, December}]

Month-no : [m-number = integer and 0 <r m-number ≤ 12;procedure p1 for calculating m-number ifr m-name is given]

Day : [d-name ∈ {Monday,…,Sunday}]

Day-no : [d-number = integer and 0 < r d-number ≤length(month)]

Fig. 3.4. Simple example of non-instantiated date frame.

3 Event is here used as a term encompassing both actions and states. t

expressed as type-value pairs (Minsky, 1975). Feature types are constant

3.3 Frame Theory 55

Date | the 3rd day of AugustYear :

Month : August

Month-no : 8

Day :

Day-no : 3

This example is incomplete and quite simplistic. In the first place, the slots and the constraints placed on them are far from complete. For example, several control structures will be necessary to ensure that the slot fillers are correctly assigned to the right slot (e.g., not all four-digit numbers expressa year) and to handle variant notations. Secondly, slots do not necessarilyrepresent atomary values. A slot filler might as well be an entire subframe (or a pointer to a subframe). Furthermore, slots may contain certain default values. And most importantly, frames can – and in most environments: need to – function in an entire frame network.

In Minsky’s original proposal (1975), related frames are ordered in what he calls frame systems. Transformations between the frames of a system represent state changes in reality. Frame systems are in their turn incorpo-rated into a general “information retrieval network,” the ordering of whichcan be used by the information extraction system to guide frame selection at initialization time. One of Minsky’s most productive ideas is that of frame nesting, in which a subframe (or a pointer to it) is used as the value of a frame feature. Needless to say that nesting can be useful in a context where multiple levels of analysis are necessary, as is the case in CD analy-sis – in which one needs an analysis phase for assembling the conceptuali-zations and another one for gathering them into scripts – and actually inany form of “deep” semantic analysis.

Frames are most often organized in generalization hierarchies as pro-posed by Winograd (1975). In a generalization hierarchy, concepts (repre-sented by frames) are ordered by way of inclusion relationships based on inheritance of properties. For frame based approaches this simply implies that a subclass will automatically inherit all features of its superclass. Several variations are possible. On the most fundamental plane, one will have to choose whether a network will support multiple inheritance or whether it will restrict itself to single inheritance. In networks whose aim is to reflect relationships between concepts as they are perceived by a hu-man understander, the former almost seems to be unavoidable, but it will

Fig. 3.5. Example of instantiated date frame.


drastically heighten the chances for inconsistencies. Secondly, the network can be based on strict inheritance or it can support overwrite possibilities. In the latter case, subclasses will inherit feature values from their super-classes unless specified otherwise. The latter might be necessary in certainnetworks, but again it will increase the possibility of semantic incoherence. Generalization hierarchies are simple; yet they leave enough space for the algorithm to run through the network in a precise and meaningful way.Their construction is easy, notwithstanding the danger for internal incon-sistencies. Some instability is likely to occur in larger networks, especiallywhen using multiple inheritance. The major issue with hierarchies based on inclusion alone is that they only allow for grouping types with their sub- and supertypes, disregarding all other possibly relevant semantic rela-tionships. An obvious solution is to construct a hybrid network in whichinclusion is just one of many relationships.

Relying on Winston (1975), Minsky also proposed to order relatedframes into a similarity network, in which difference pointers connect frames referring to related concepts. These differences can be used by the system when trying to select an appropriate frame to match a given input or whenthe input to match deviates only slightly from a selected frame. An example:

Based on certain features represented in the input, the algorithm could decide to start filling out a desktop frame. When at a certain time itwould discover that some elements in its input do not match the exact conditions of a desktop PC, it can abandon its first guess and look for a more suitable frame on the basis of the differences defined in the net-work. For instance, some elements in the input could indicate that thereal world object to be identified is portable; that its size is smaller than that of a regular desktop computer; and that it is therefore more produc-tive to initialize a laptop frame.

Similarity networks were introduced for the analysis of vision. At first sight, they seem to be less applicable to the semantic analysis of language.And yet, a relatively young branch of corpus linguistics, prototype theory, developed in the second half of the 1990s, provides us with a theoretical

Prototype theory analyzes lexical items into bundles of binary features. Several occurrences of one item will turn out to develop a very restricted number of highly frequent feature constellations (the prototypical meaningof the lexical item) and a large number of peripheral semantic variations. Related lexical items can be combined into more elaborate structures,which will also develop into a network of largely overlapping items.Again, the more features a particular item has in common with the core of

foundation for similarity based lexico-semantic analysis (Geeraerts, 1997).

57

a structure, the closer it will be to the prototype of the structure. Similarity networks have a flexibility that simple inclusion based networks lack, which is much related to human analogical reasoning. Prototype networksin addition offer a simple binary selection mechanism, which in theoryought to reduce the complexity of the network and to increase the accuracy(since many binary selections could be combined). They will more or less naturally order themselves into clusters and superclusters. However, some problems have to be dealt with: Defining the features or frame slots that are involved in the similarity computations is a difficult problem in any

Finally, inclusion based and similarity based approaches can be combinedinto one network. One does not even need to restrict oneself to similaritynetworks and generalization hierarchies. Any set of semantic relationshipscan be used to construct a semantic network, as long as it is based on a consistent framework that is able to avoid internal inconsistency. To func-tion properly, it will be crucial for such a network to be constructed with painstaking accuracy.

Frames are still a popular way of representing knowledge. Obviously,frames have advantages, many of which have something to do with the fact that frames are a mode of knowledge representation. Wherever in the pastframes were implemented as lists (e.g., in the programming language LISP), nowadays they are often coded as XML structures, possibly XMLsupported knowledge representation languages such as DAML + OIL(DARPA Agent Markup Language + Ontology Interface Layer) or OWL(Web Ontology Language). The fundamental concept of frames remained unchanged throughout the years. Nevertheless, they have certain serious disadvantages. The largest one is that up till now frames had to be constructed manually. This forced developers to restrict themselves to information extraction systems covering very limited domains and caused performance rates to drop drastically once the system is applied to corpora they were not designed for. Moreover, it is probably not feasible to construct a broad domain frame ontology, in the first place because theconstruction of a network of that size would require huge amounts of work, and in the second place because there simply does not exist a semantic framework that would be able to cover all concepts that have to be covered for domain independent analysis and to unify them into a coherent network. In addition, it would be very difficult to design efficient search mechanisms for such a network. A possible solution would beworking with several subnetworks that are called by an arbiter algorithm.

3.3 Frame Theory

analogical reasoning (Kolodner, 1993).

3.4 Actual Implementations of the Symbolic Techniques

3.4.1 Partial Parsing

As we have seen in Chap. 2, partial parsing is very often applied in infor-mation extraction systems, especially in frame-based approaches. Partialparsing refers to the situation in which a text is only partially analyzed. Only the content is analyzed that is anticipated in the text while skippingover the rest of the text. In partial parsing, the patterns to be recognized areoften encoded as regular expressions, which are in turn translated into finite state machines. This is particularly interesting for frame-based ap-proaches, since slot-fillers can in many cases be directly identified in the text by looking at their context.

The parsing often relies on a grammar that captures how content is ex-pected to be expressed. In many cases the patterns are expressed in a verysimple form of grammar, namely a regular grammar. Regular grammarsdo not allow non-terminal symbols in their description. In the followingexample a simple arithmetic expression <ARITH> is described with a regular syntax (the letters and arithmetic operators are terminal symbols). The asterisk indicates zero, one, or more repetitions.

<ARITH> ::= (“a” | ”b” | … | “z”) ((”+” | “-“ | “*” |”/”)(“a” | ”b” | … | “z”))* (3.2)

Regular grammars are very well suited to represent textual patterns and when implemented in finite state automata the parsing of the text can berealized in a short amount of time. A partial analysis of a text can be com-bined with part-of-speech tagging or other forms of a shallow syntactical analysis (Abney 1996). As it is demonstrated by the FASTUS system (seebelow), the processing of the text based on regular grammars is very effi-cient.

3.4.2 Finite State Automata

Finite state automata have been used extensively in information extractiontasks. Their use is motivated by the fact that many textual phenomena fol-low a partially fixed order. A finite state automaton is formally defined as follows (Partee et al., 1990):

59

A finite state automaton fsa M is a quintupleM K,KK Σ, ∆,y0,F , where: K is a finite set, the set of statesKΣ = {σ1σ , σ2σσ , …,σkσσ }, the alphabet of symbols k

y0 ∈ K, the initial state KKF ⊆ K, the set of final states KK∆ is a relation from ∆ KxΣ intoΣ K, the transition relation.KK

Starting from an initial state q0, a fsa will move to a next state if it iscorrectly triggered by the presence of a member of the alphabet in the in-put string.4 When analyzing a valid string, it will eventually end up in one of its final states by repeating this process over a finite number of transi-tions. This signals a correct parse. When none of these final states can bereached – i.e., when at a certain point the automaton’s definition does not allow for a transition from y to y’ given a certain input character or when the end of the input string is reached and the automaton has not reached a final state yet – its analysis will fail. A finite state automaton can be de-terministic or non-deterministic. The deterministic aspect is expressed in the fact that for every state, exactly one outgoing transition must be de-fined. This is not the case for a non-deterministic finite state automaton. Ina final state transducer, an output entity is constructed when final states arereached.

In Chap. 5 we will discuss probabilistic finite state automata wheretransitions and emissions of symbols of the alphabet are probabilisti-cally modeled.

For full natural language analysis, finite state automata are not powerfulenough (although lately much research has been done into finite state ap-proximations of natural language). However, much improvement is possi-ble by simple linking several automata into a network. A first strategy is building a finite state cascade. In a finite state cascade, multiple levels of analysis exist, each of which consists of a set of finite state automata. The output of each level is used as the input of the next one. A second ap-proach is combining the fsas into an augmented transition network. In such a network, state transitions can contain pointers towards other automata.Although finite state machines may not be able to grasp language in all its complexity, there are good reasons for using them. It is relatively easy toimplement and adapt them, since they are usually defined as a relatively small set of regular expressions. Furthermore, they can be incredibly fast,

4 Note that alphabet is here not used in its original sense. In automata theory, an talphabet can consist of any finite set of characters, numbers, or strings of charac-ters or numbers.



as Hobbs et al. (1996) observed when they compared the performance of their system FASTUS – which is based on a finite state cascade – with that of their original algorithm TACITUS – which tried to produce a full se-mantic analysis of the input. FASTUS finite stage approach made it 180 times faster than its predecessor was. We will now discuss the FASTUSsystem in detail.

In 1991 a team at SRI International (USA) started with the development of FASTUS, the Finite State Automaton Text Understanding System (Hobbs et al., 1996). Originally built as a preprocessing unit for the text understanding system TACITUS, it was further developed as a stand-aloneinformation extraction system. Its makers describe FASTUS as “a set of cascaded, nondeterministic finite state transducers.”

The latest version of FASTUS has five processing levels. In a first step, compound words are identified, as well as proper names, dates, times andlocations. For unknown words, which are likely to be proper nouns due totheir occurrence in a certain construction, specialized pattern rules havebeen made. A second stage performs a partial parse to identify nounphrases, verb phrases, and other meaningful elements like conjunctionsand prepositional phrases.

Company name Bridgestone Sports Co.Verb groupNoun group FridayNoun groupVerb group had set upNoun group a joint venturePrepositionLocation TaiwanPrepositionNoun group a local concernAndNoun group a Japanese trading

houseVerb group to produceNoun group golf clubsVerb group to be shippedPrepositionLocation Japan

Fig. 3.6. Example output of the second stage of the FASTUS system.

said

it

in

with

and

to

61

For instance, the sentence:

Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house

produce golf clubs to be shipped to Japan. to (3.3)

will produce the following output shown in Fig. 3.6. Word classes deemed irrelevant (many adjectives and adverbs) and un-known words are simply skipped. Using a very limited number of regular expressions, FASTUS is able to recognize nearly all noun phrase com-plexes possible in English. A very simple example of these rules for noun

NG→ { Pro | N[timenp] }

will recognize any noun group that contains an independently used pro-noun or a noun expressing a time that can constitute a noun phrase by it-self. In his 1992 article, Hobbs mentions only seventeen general nouns for noun groups (although in later versions, domain specific rules wereadded). Verb groups are parsed using an even smaller set of expressions:The 1992 article mentions eight of them. In addition tags are assigned toeach of them, identifying them as actives, passives, gerunds, or infinitives. An illustration:

VG[passive]→ { VG[be] {V-en | V-eden[trans] } | V-en }

Relation: TIE-UPBridgestoneSports Co.

Activity: PRODUCTION

a local con-cern

Company:Entities:

a Japanesetrading house

Product: golf clubs

Joint Ven-ture Com-pany:

Start Date:

Activity:Amount:


groups (Hobbs et al., 1992):

Fig. 3.7. Example of results of the fourth stage of the FASTUS system.


This rule identifies a passive verb group as a construction consisting either 1) of a form of the verb be followed by a unambiguous past participle form or by a transitive verb form that is ambiguous between simple past and past participle; or 2) of a verb form that is unambiguously a past participle. The third level of analysis deals with complex noun groups and verb groups that can be recognized solely on the basis of syntactic information. It handles apposition, several kinds of conjunction, and complex verb groups expressing modal meanings. In the fourth stage, finite state ma-chines encode meaningful event patterns. State transitions are triggered by specific combinations of the head word and phrase type of the syntactic groups determined in the previous stages. For sentence (3.3) the followingtwo patterns are detected:

{Company/ies} {Set-up} {Joint-Venture} with {Company/ies} {Produce} {Product}

They cause the frames in Fig. 3.7 to be instantiated. At this stage, the sys-tem also detects relevant syntactic patterns that were not dealt with in the previous step. Finite state machines analyze nested phrases like relativeclauses and conjoined verb phrases with ellipsis of the subject. Most im-portantly, all events are related to their purely propositional active equiva-lent. For example, the sentences:

(3.4) Cars are to be manufactured by GM. (3.5) GM, which manufactures cars. (3.6)

are all transposed to their simple active form by means of syntactic equiva-lence rules:

GM manufactures cars. (3.7)

These operations allow the system to identify different syntactic represen-tations of the same event, which will be crucial for the final processing step. In this final stage, event structures referring to the same event are merged if they do not violate certain consistency criteria: The structure of their noun groups has to be consistent, their overall structure must be com-patible, and they cannot exceed certain nearness criteria. For certain do-mains, more specific or additional rules can be determined.

Cars are manufactured by GM.

3.6 Bibliography 63

One has to be aware of the fact that our exposition of FASTUS is based on the original design of the FASTUS system in order to illustrate the use of a cascade of finite state automata in a frame based information extrac-tion approach. It shows that FASTUS' approach of using finite state grammars turned out to be very productive: The system has a simple andmodular architecture, it is fast and efficient. Moreover, it had a short de-velopment cycle and certain stages of the analysis are almost completely domain independent. Consequently, it is easily portable to other domains(and to other languages: A Japanese version has been developed for MUC-4). In one word, FASTUS is a success story. Nonetheless, the system still

dependent Markov models are trained to detect the probabilities of state transitions and state emissions of a non-deterministic finite state automa-ton.

3.5 Conclusions

In this chapter we explained some important algorithms for informationextraction based on symbolic, handcrafted knowledge. Especially the early approaches of Roger Schank and Marvin Minsky are very interesting, as they can serve as a source of inspiration for future text extraction and text understanding in general. In the next chapter we discuss machine learning

work for machine learning and focuses on the features used in commonextraction tasks.

3.6 Bibliography

Abney, Steven (1996). Part-of-speech tagging and partial parsing. In Ken Church, Steven Young and Gerrit Bloothooft (Eds.), Corpus-Based Methods in Lan-guage and Speech. Dordrecht, The Netherlands: Kluwer Academic Publishers.

DeJong, Gerald (1982). An overview of the FRUMP system. In Wendy G. Lehnert and Martin H. Ringle (Eds.), Strategies for Natural Language Processing(pp. 149-176). Hillsdale, NJ: Lawrence Erlbaum.

Hobbs, Jerry R., Douglas Appelt, et al. (1992). FASTUS: A system for extractinginformation from natural-language text. Technical note no. 519. SRI Interna-tional.

Hobbs, Jerry H., Douglas Appelt et al. (1996). FASTUS: A cascaded finite-state transducer for extracting information from natural-language text. In Finite State Devices for Natural Language Processing. Cambridge MA: The MIT Press.

leaves space for improvement. In Chap. 5 we will study how context-

algorithms used in information extraction. Chap. 4 sets a general frame-


Geeraerts, Dirk (1997). Diachronic Prototype Semantics: A Contribution to His-torical Lexicology. Oxford: Clarendon.

Kolodner, Janet (1993). Case-Based Reasoning. San Mateo, CA: Morgan Kauf-mann.

Minsky, Marvin (1975). A framework for representing knowledge. In P.H.Winston (Ed.), The Psychology of Computer Vision (pp. 211-277). New York:McGraw-Hill.

Partee, Barbara H., Alice ter Meulen and Robert E. Wall (1990). Mathematical Methods in Linguistics. Dordrecht, The Netherlands: Kluwer Academic Pub-lishers.

Schank, Roger C. (1972). Conceptual dependency: A theory of natural languageunderstanding. Cognitive Psychology, 3, 552-631.

Schank, Roger C. (1975). Conceptual Information Processing. Amsterdam: North-Holland.

Schank, Roger C. and Robert P. Abelson (1977). Scripts, Plans, Goals and Under-standing. An Inquiry into Human Knowledge Structures. Hillsdale, NY: Law-rence Erlbaum.

Winograd, Terry (1975). Frame representations and the declarative/proceduralcontroversy. In Daniel G. Bobrow and Allan Collins (Eds.), Representation and Understanding. Studies in Cognitive Science (pp. 185-210). New York, NY:Academic Press.

Winston, Patrick H. (1975). Learning structural descriptions from examples. In Patrick H. Winston. The Psychology of Computer Vision (pp. 157-209). New York, NY: McGraw-Hill.

65

4 Pattern Recognition

4.1 Introduction

As it was learnt from the foregoing chapters, information extraction con-cerns the detection and recognition of certain information, and it relies onpattern recognition methods. Pattern recognition (also known as classifica-tion or pattern classification) aims at classifying data (patterns) based oneither a priori knowledge that is acquired by human experts or on know-ledge automatically learned from data. A system that automatically sortspatterns into classes or categories is called a pattern classifier. The classifi-cation patterns are recognized as a combination of features and their values. In case of information extraction the features are textual characteristics that can be identified or measured, and that are assumed to have a discrimina-tive value when sorting patterns into semantic classes.

As seen in the previous chapter, in its early days information extraction from texts relied on symbolic, handcrafted knowledge. Information was extracted using a set of patterns in the form of rules or a grammar, and a recognizer called an automaton parsed the texts with the objective to find constructions that are conform with the grammar and that were translated into semantic concepts or relations. More recent technologies often usefeature vectors as the input of statistical and machine learning algorithmsin order to detect the classification patterns. Supervised, unsupervised and weakly supervised learning algorithms are common. The machine learning algorithms relieve the burden of the manual knowledge acquisition. The algorithms exhibit an additional advantage. Instead of a deterministic translation of text units into semantic classes as seen in the previous chap-ter, the approaches usually allow a probabilistic class assignment, which isuseful if we want to make probabilistic inferences based on the extracted information. For instance, information retrieval models use probabilistic models such as Bayesian networks and reasoning with uncertainty wheninferring the relevance of a document to a query. After all, when we hu-mans read and understand a text, we make many (sometimes uncertain)

66 4 Pattern Recognition

inferences with the content of a text in combination with additional worldknowledge, the background knowledge of the reader, and his or her infor-

essing system that relies on extracted information should incorporateuncertainties about the information extracted.

Before proceeding to the next chapters that discuss prevalent pattern recognition methods used in information extraction, several important questions have to be answered. What are the information units and their relations that we want to detect in the texts and classify? How do we con-veniently detect these information units? What are the classificationschemes used in information extraction? How can an information unit be described with a feature vector or other object that captures the necessary feature values for correct classification? How can these features and their values be identified in the texts?

The aim of the book is to focus on generic and flexible approaches to in-formation extraction. When we answer the above questions, the focus is on technologies that can be used in open domain settings. It will be shown that many of the information extraction tasks require similar types of fea-tures and classification algorithms. By stressing what binds the approaches, we hope to promote the development of generic information extractiontechnology that can be used in many extraction settings. The text of this chapter will be illustrated with many different examples of common ex-traction tasks such as named entity recognition, coreference resolution,semantic role recognition, relation recognition and timex recognition.

4.2 What is Pattern Recognition?

Pattern recognition classifies objects into a number of classes or catego-ries based on the patterns that objects exhibit (Theodoridis and Koutroum-bas 2003). The objects are described with a number of selected featuresand their values. An object x thus can be described as a vector of features:

x = [x1, x2, …, xpx ]T (4.1)

where p = the number of features measured.

The features or attributes together span a multi-variate space called themeasurement space or feature space. Throughout the following chapters,features and feature vectors will be treated as random variables and vectorsrespectively. The measurements exhibit a random variation. This is partly due to the measurement noise of measuring devices and partly to the distinct

mation goals (Graesser and Clark, 1985). Any intelligent information proc-

67

characteristics of each feature. When features and their values are identi-fied in natural language text, we might not capture the values correctlybecause our tools cannot yet cope with all variations and ambiguities a natural language exhibits.

Vectors are not the sole representation format that we use for represent-ing the textual objects. We can also use structured objects as representa-tions such as presentations in first-order predicate logic and graphs. A text is often well suited to be represented as a tree (e.g., based on its parse or discourse tree), where the relations between features are figured as edgesbetween the nodes, and nodes can contain the attributes of the features.

The classification task can be seen as a two (binary) or multi-classproblem. In a two-class problem, an object is classified as belonging or not belonging to a particular class and one trains a binary classifier for each class. In a multi-class problem the classification task is defined as one multi-class learning problem. It is convenient to learn multiple binary clas-sifiers when the classes are not mutually exclusive. In the information ex-traction tasks, which we will further consider, classes are often mutuallyexclusive allowing treating information extraction as a multi-class learningproblem.

Pattern recognition methods regard machine learning. The learning al-gorithm takes the training data as input and selects a hypothesis from the hypothesis space that fits the data. There are many different learning algo-rithms. The availability or non-availability of training examples determines whether the machine learning is considered as respectively supervised or unsupervised.

In supervised pattern recognition, usually a rather large set of classi-fied examples can be used for training the classifier. The feature vectors whose true classes are known and which are used for building the classifier are considered as training examples and form the training set. Because in information extraction we work with textual material, the assignment of the true class is usually done by annotating the text with class labels. For instance, in a named entity recognition task proper names can be annotated with entity class labels (see Fig. 4.1).

In supervised pattern recognition the aim is to detect general, but high-accuracy classification patterns in the training set, that are highly predict-able to correctly classify new, previously unseen instances of a test set. Itis important to choose the appropriate training algorithm (e.g., support vec-tor machines, maximum entropy modeling, induction of rules and trees) in compliance with a number of a priori defined constraints on the data (e.g.,dependency of features, occurrence of noisy features, size of the feature set, size of the training set, etc…).


<HL> <ENAMEX TYPE= ORGANIZATION >Eastern Air</ENAMEX> ProposesDate For Talks on Pay-Cut Plan</HL> <DD> <TIMEX TYPE= DATE >01/23/87</TIMEX></DD><SO> WALL STREET JOURNAL (J)</SO><IN> LABOR TEX AIRLINES (AIR) </IN> <DATELINE> <ENAMEX TYPE= LOCATION >MIAMI</ENAMEX> </DATELINE> <TXT><s> <ENAMEX TYPE= ORGANIZATION >Eastern Airlines</ENAMEX> execu-tives notified union leaders that the carrier wishes to discuss selective wage reductions on <TIMEX TYPE= DATE >Feb. 3</TIMEX>. </s><s> Union representatives who could be reached said they hadn t decided whether they would respond. </s> <s> By proposing a meeting date, <ENAMEX TYPE= ORGANIZATION >Eastern</ENAMEX> moved one step closer to-ward reopening current high-cost contract agreements with its unions. </s><s> The proposal to meet followed an announcement <TIMEXTYPE= DATE >Wednesday</TIMEX> in which <ENAMEX TYPE= PERSON >Philip Bakes</ENAMEX>, <ENAMEXTYPE= ORGANIZATION >Eastern</ENAMEX> s president, laid out pro-posals to cut wages selectively an average of <NUMEX TYPE= PERCENT >29%</NUMEX>. </s> <s> The airline s three major labor unions, whose contractsdon t expire until year s end at the earliest, have vowed to re-sist the cuts. </s> <s> Nevertheless, one union official said he was intrigued by the brief and polite letter, which was hand-delivered by corpo-rate security officers to the unions. </s><s> According to <ENAMEX TYPE= PERSON >Robert Callahan</ENAMEX>, president of <ENAMEX TYPE= ORGANIZATION >Eastern</ENAMEX> sflight attendants union, the past practice of <ENAMEXTYPE= ORGANIZATION >Eastern</ENAMEX> s parent, <ENAMEXTYPE= LOCATION >Houston</ENAMEX>-based <ENAMEX TYPE= ORGANIZATION >Texas Air Corp.</ENAMEX>, has involved con-frontation and ultimatums to unions either to accept the car-rier s terms or to suffer the consequences – in this case, perhaps, layoffs. </s> <s> Yesterday s performance was a departure, Mr. <ENAMEX TYPE= PERSON >Callahan</ENAMEX> said, citing the invitation toconduct broad negotiations – and the lack of a deadline imposed by management. </s> <s> Frankly, it s a little mystifying. </s></TXT>

“ ”ORGANIZATIONORGANIZATION

“ ”DATE

“ ”

“ ”

“ ”DATEDATE

”

“ ”DATEDATE“ ”PERSONPERSON“ ”

”“

“ ”“ ”ORGANIZATIONORGANIZATION

“ ”ORGANIZATIONORGANIZATION

“ ”O G OO G O“ ”

“ ”PERSON

“ ”F kl it littl tif iF kl it littl tif i

’

t expire until yeart expire until year

“ ”Y t d f d tY t d f d t

Fig. 4.1. Annotated sentences from MUC-6 Document No. 870123-0009.

69

Unsupervised pattern recognition tries to unravel similarities or differ-ences between objects and to group or cluster similar objects. Cluster algo-rithms are often used for this purpose. Unsupervised learning is a necessity when the classes are not a priori known, when annotated examples are not available or too expensive to produce, or when objects and their features or feature values change very dynamically. For instance, non-pronominal noun phrase coreference resolution across documents in document collec-tions that dynamically change (such as news stories) is an example of where unsupervised learning is useful, because the context features of allnoun phrases are very likely to exhibit a large variation over time.

In unsupervised pattern recognition an important focus is on the selec-tion of features. One often relies on knowledge or an appreciation of fea-tures that are a priori assumed not to be relevant for the classes sought. Inaddition, the choice of a suitable function that computes the similarity ordistance between two feature vectors is very important as these functionsgive different results depending on where the feature vectors are located in

cluster algorithm that clusters the objects into groups is important as well. Here too, the choice is defined by a number of a priori defined constraintson the data, such as the number of feature vectors and their location in thegeometrical feature space.

Because of the large variety of natural language expressions it is not al-ways possible to capture this variety by sufficient annotated examples. Onthe other hand, we have huge amounts of unlabeled data sets in large text collections. Hence, the interest in unsupervised approaches for the seman-tic classification or in unsupervised aids that complement the lack of suffi-cient training examples.

In the framework of generic technologies for information extraction, it is important that the classification or extraction patterns are general enough to have a broad applicability, but specific enough to be consistently reliableover a large number of texts. However, there are many challenges to over-come. A major one that we have already cited is the lack of sufficient training examples that are labeled with a particular class. Natural language is very varied, capturing all possible variations in the examples and having sufficient overlap in the examples to discriminate good patterns from noisy patterns is almost impossible. We also expect the feature values to be sometimes inaccurate due to errors in the preprocessing phase (e.g., syn-tactic analysis) and to errors of human annotation of the training set. In ad-dition, the number of potential features is very large, but only few of them are active in each example, and only a small fraction of them are relevant to the target concept. Moreover, the individual features and their values are


the feature space (cf. Jones and Furnas, 1987). The choice of a convenient


often ambiguous markers of several classes; in combination with other fea-tures they might become more discriminative. But, introducing more fea-tures might not necessarily reduce ambiguity as they themselves are often sources of ambiguity. This situation poses problems both for supervised and unsupervised learning.

When information extraction is performed in real time, extraction algo-rithms need to perform fast computations and their computational com-plexity should be taken an eye on.

4.3 The Classification Scheme

A classification scheme describes the semantic distinctions that we want to assign to the information units and to the semantic relations between theseunits. The set can have the form of a straight list, for instance, when we de-fine a list of named entity classes to be identified in a corpus (e.g., theclasses protein, gene, drug, disease of information in biomedical texts). Or, the scheme can be characterized by its own internal structure. It mightrepresent the labels that can be assigned to entities or processes (the entityclasses), the attribute labels of the entity classes, the subclasses and the semantic relations that might hold between instances of the classes, yield-ing a real semantic network. For instance, in texts of the biomedical do-main one might be interested in the protein and gene subclasses, in theprotein attribute composition or in the relation is located on between a protein and a gene. In addition, this scheme preferably also integrates the constraints on the allowable combinations and dependencies of the seman-tic labels.

Semantic labels range from generic labels to domain specific labels. For instance, the semantic roles sayer in a verbal process and verbiage in a verbal process are rather generic information classes, while neurotrans-mitter and ribonuclear inclusion are quite domain specific. One can de-fine all kinds of semantic labels to be assigned to information found in a text that is useful in subsequent information processing tasks such as in-formation retrieval, text summarization, data mining, etc. Their definitionoften relies on existing taxonomies that are drafted based on linguistic orcognitive theories or on natural relationships that exist between entities. In case of a domain specific framework of semantic concepts and their rela-tions we often use the term ontology.

In this book we are mostly interested in semantic labels that can be used for open domain tasks and more specifically open domain information re-trieval. To accomplish such tasks, a semantic annotation of the text con-stituents preferably identifies at an intra-clause or -sentence level:

71

1) The type of action or state associated with the verb, possibly ex-pressed in terms of primitive actions and states;

2) The entities participating in the action or state (normally expressedas arguments);

3) The semantic role of the participants in the action or state; 4) Possibly a more fine grained characterization of the type of the en-)

tity (e.g., person, organization, animal, …);5) Coreferent relationships between noun phrase entities;6) Temporal expressions; 7) Spatial expressions.

Coreferent relations are also found across clauses, sentences and evendocuments. In a more advanced setting, information extraction can detect temporal and spatial relations within and across documents.

If information extraction is done in a specific domain with a specifictask in mind, then we refine the label set for entities and their relations. For instance, in the domain of natural disasters, labels such as the number of victims, the numbers of houses destroyed, etc… might be useful to ex-tract. In a business domain it might be interesting to extract the price of aproduct, the e-mail of a company’s information desk or the company a person works for. In the legal domain it is interesting to extract the sen-tence in a criminal case.

The output of a low-level semantic classification can become a featurein a higher-level classification. For instance, a list of relations attributed tola person entity might trigger the concept restaurant visit by that person.

In the following sections and chapters we focus on information extrac-tion approaches and algorithms that have proven their usefulness in ex-tracting both semantic information that is labeled with generic and rather abstract classes, and domain specific information.

4.4 The Information Units to Extract

Our next question is what information units or elements we want to iden-tify, classify and eventually extract from the texts. This process is often re-

units in the indices of the texts, we call them text regions. The smallest textual units to which meaning is assigned and thus could function as an information unit are the free morphemes or root forms of words. However,some words on their own do not carry much meaning, but have functional properties in the syntactic structure of a text. These function words alone

4.4 The Information Units to Extract

ferred to as segmentation (Abney, 1991). When we use these information


can never function as information units. Single words, base phrases or chunks, larger phrases, clauses, sentences, passages or structured docu-ment parts (e.g., section or chapters) might all be considered as informa-tion units to extract.

The extraction units most commonly used in information extraction are base phrases (e.g., base noun and verb phrases). A base noun phrase or noun chunk in English can be defined as a maximal contiguous sequence of tokens in a clause whose POS tags are from the set {JJ, VBN, VBG,POS, NN, NNS, NNP, NNPS, CD}.1 A base verb phrase is a maximalcontiguous sequence of tokens in a clause whose POS tags are from the set {VB, VBD, VBP, VBZ} possibly combined with a tag from the set {VBN, VBG}.2

One could define within a base noun phrase nested noun phrases. Herewe have to deal with possessive noun phrases (e.g., her promotion, John’sbook) and modifier noun phrases or prenominal phrases (e.g., studentscholarship, University officials). These noun phrases are still easy to detect yin English texts. On the other hand a base noun phrase can be augmented with modifiers headed by a preposition (e.g., Massachusetts Institute of Technology). For this task we need a syntactical parser that captures the syntactic dependency structure of each sentence in order to distinguish a noun phrase that modifies another noun phrase from one that modifies a verb phrase (e.g., leaving my house in a hurry and leaving my house in

ments also requires a syntactic parse. Although we have the tools to identify individual nouns and verbs, base phrases and full phrases, it is sometimes difficult to define which format is best suited to delimit an entity or the process it is involved in (e.g., Massa-chusetts Institute of Technology versus Rik De Busser of Leuven). This problem is especially significant in the biomedical domain (see Chap. 9). It can partially be solved by learning collocations, i.e., detecting words that co-occur together more often than by chance in a training corpus by means of statistical techniques (e.g., mutual information statistic, chi-square sta-tistic, likelihood ratio for a binomial distribution) (Dunning 1993; Man-

1 Penn Treebank tag set: JJ = adjective; JJR = adjective, comparative; JJS = adjec-tive, superlative; VBN = verb, past participle; VBG = verb, gerund/presentparticiple; POS = possessive ending; NN = noun, singular; NNP = proper noun,

2 VB = verb, base form; VBD = verb, past tense; VBP = verb, non-3rd person sin-gular present; VBZ = verb, 3rd person singular present.

number. singular; NNS = noun, plural; NNPS = proper noun, plural; CD = cardinal

ning and Schütze, 1999). With these techniques it is possible to learn an

my daddy’s neighborhood). The detection of verb phrases and their argu-

4.5 The Features 73

expression (e.g., a noun phrase) consisting of two or more words that

cated words found add an element of meaning that cannot be predicted from the meanings of their composing parts.

It is also possible to consider all candidate phrases in an informationextraction task (e.g., the university student of Bulgaria: consider: theuniversity student of Bulgaria, the university student, the student of Bulgaria, the student) and to select the one among the candidates that be-longs to a certain semantic class with a large probability. For instance, in a noun phrase coreference resolution task, such an approach has been im-plemented. Boundary detection and classification of the information unit are sometimes seen as two separate tasks, each relying on a different fea-ture set. A difficult problem to deal with and that is comparable with thenested noun phrase problem regards information units that are conjunc-tions of several individual units. Here too, all different possibilities of phrases can be considered.

Not only basic noun and verb phrases are identified, individual words or expressions might be useful to classify, such as certain adverbs and adver-bial expressions (e.g., today, up to here).

We also consider information units that extend phrase boundaries suchas the classification of sentences or passages. For such larger units wecross the domain of text categorization. The semantic classifications de-scribed in this book offer valuable features to classify larger text units with semantic concepts, and the technologies discussed can be used to classify relationships between clauses, sentences and passages (e.g., to detect rhe-torical and temporal relationships) that are very valuable when semanti-cally classifying a passage (e.g., classifying the passage as a visit to the dentist; or classifying it as a procedure).

4.5 The Features

Machine learning approaches rely on feature vectors built from a labeled (already classified) or an unlabeled document collection. Depending uponthe classification task a set of features is selected. We usually do not useall features that are present in a text, but select a number of important onesfor the information extraction task at hand in order to reduce the computa-tional complexity of the training of the classifier, and at the same time we keep as much as possible class discriminatory information. In the frame-work of an open domain information extraction task, it is important that

corresponds to some conventional way of saying things. Usually, the collo-


the features are generic enough to be used across different domains and that their values can automatically be detected.

The information units that we have identified in the previous section aredescribed with certain features, the values of which are stored in the featurevector of the unit that is semantically classified. The features themselves can be classified in different types. Features can have numeric values, i.e., discrete or real values. A special discrete value type is the Boolean one (i.e., value of 1 or 0). Features can also have nominal values (e.g., certain words), ordinal values (e.g., the values 0 = small number, 1 = medium number; 2 = large number), or interval or ratio scaled values. We can make conversions to other types of features. For instance, a feature with nominal values can be translated to a number of features that have a Boolean or real value (e.g., if the value of a feature represents a word in a vocabulary, thefeature can be translated into a set of features, one for each word in the vo-cabulary, which is advantageous, if one wants to give the words a weight).

Features can also be distinguished by their position in the text. First,we can define features that occur in the information unit itself, such as thetcomposition of letters and digits of an entity name. Secondly, there are the

words that surround an information unit to be classified. Thirdly, if a rela-tionship between two entities is to be found, features that are linked with each of the entity or with both entities can be defined. Fourth, the broader context in which the information unit occurs can give additional evidence for its semantic classification. In this case it is convenient to define fea-tures that occur in the complete document or t document collection. For in-stance, when classifying an entity name in a sentence, we might rely on the

of the name or reliably resolved acronyms or abbreviations of the name

In the next section we discuss the most commonly used features in typi-cal information extraction tasks. We classify the features in lexical,syntactic, semantic and discourse features. The features, their types and their values are illustrated in tables that explicitly group the features used in an extraction task. In this way we give the implementer of an informa-tion extraction system two views on the feature selection process. On onehand, the distinction in lexical, syntactic, semantic and discourse featuresgroups the typical methodologies and feature selection algorithms needed

string to be classified. In this category there are the features of thefeatures that occur in the close neighborhood ord context window of the token

can offer additional context and evidence to classify the entity name (Chieu

first resolved the noun phrases that refer to the same entity, we can define

the relation between two entity names. features that are selected from different documents in order to learn

assumption of one sense per discourse (Yarowski, 1995). Thus, repetitions

and Ng, 2002). Analogically, in a relation extraction task when we have

4.5 The Features 75

for the text analysis. On the other hand illustrative tables summarize fea-ture selection for a particular extraction task. For a particular feature that iscited in these tables, we give its most common value type.

1. The features for a named entity recognition task are based on the work of Bikel et al. (1999), Borthwick (1999), Collins and Singer (1999), Zhou and Su (2002), and Bunescu and Mooney (2004) (Table 4.1). Innamed entity recognition features typical for the entity name itself and contextual features play a role.

2. The features for the single-document noun phrase coreference resolu-tion task refer to the work of Cardie and Wagstaff (1999), Soon et al. (2001) and Müller et al. (2002) (Table 4.2). Most reference resolution programs determine the relationship between a noun phrase and its referent only from the properties of the pair. The context of both noun phrases is usually ignored.

3. The features for the cross-document coreference resolution refer to the work of Bagga and Baldwin (1998), Gooi and Allan (2004) and Liet al. (2004) (Table 4.3). Cross-document noun phrase coreference

to the same entity if their contexts in the different documents suffi-ciently match. Especially, proper names in these contexts are indica-tive of the meaning of the target proper name. Often, cross-documentcoreference resolution relies on single-document coreference resolu-tion for solving the coreferents in one text, and it uses cross-document resolution for disambiguating identical names across texts, although mixed approaches that combine both tasks are also possible.

4. The features for a semantic role recognition task rely on the work of Fleischman and Hovy (2003), Pradhan et al. (2004), Mehay et al.(2005) (Table 4.4). Syntactic and structural features (e.g., position)play an important role besides some lexical characteristics (e.g., use of certain prepositions).

5. In relation recognition our features are based on the work of Hasegawa et al. (2004) (Table 4.5). In this task contextual features are quite im-portant: There is no way to be certain that the sentence He succeeds Mr. Adams is a corporate management succession. It may refer to a political appointment, which is considered irrelevant, if we want toidentify management successions. A large window of context words is here advisable for feature selection.

6. The features used to detect temporal expressions or timexes were pre-viously described in Mani (2003) and Ahn et al. (2005) (Table 4.6). Processing of temporal information regards the detection and possiblenormalization of temporal expressions in text; their classification in

resolution is per se a word sense disambiguation task. Two names refer


absolute and relative expressions and in case of the latter the computa-tion of the absolute value, if possible; and the ordering of the expres-sions in time (Mani et al., 2005).

The feature set used in information extraction is very rich and varied.Natural language data is a domain that particularly benefits from rich andoverlapping feature representations.

Quite often feature values are transformed when used in an informationextraction task. For instance, one can aggregate a number of different fea-ture values by one general feature value. This process is referred to as fea-ture extraction or feature generation. An example of feature extraction iswhen semantic classifications of words are used as features in complex extraction tasks (see infra).

Table 4.1. Typical features in a named entity recognition task of the candidate en-tity name i that occur in the context window of l words.

FEATURE VALUE TYPE VALUEE

Short type Boolean True if i matches the short type j; Falseotherwise.

POS Nominal Part-of-speech tag of the syntactic head of Si.

Context word Boolean or realvalue between 0and 1; Or nominal.

True if the context word j occurs in thecontext of i; False otherwise; If a real value is used, it indicates the weight of thecontext word j. Alternatively, the context word feature can be represented as one feature with nominal values.

POS left Nominal POS tag of a word that occurs to the left of i.

POS right Nominal POS tag of a word that occurs to the right of i.

Morphological prefixes/suffixes

Nominal Prefix or suffix of i.

4.5.1 Lexical Features

Lexical features refer to the attributes of lexical items or words of a text. One can make a distinction between the words of the information unit that is to be classified, and its context words.

4.5 The Features 77

In named entity recognition tasks morphological characteristics of the information to be classified is often important. By morphological charac-teristics we mean the occurrence of specific character conventions such asthe occurrence pattern of digits and capital letters in a word or sequence of words. Because it is difficult to represent all possible compositions in a feature vector, entities are often mapped to a restricted number of feature templates that are a priori defined and are sometimes called short types

placing any maximal contiguous sequence of capital letters with ‘A’, of lowercase letters with ‘a’ and of digits with ‘0’, while keeping the other non alpha-numeric characters. For example, the word TGV-3 would be mapped to A-0. It is also possible to define short types for multi-word ex-pressions. A template can also represent more refined patterns (e.g., theword contains one digit at a certain position or contains a digit and a period at a certain position).

Simple heuristic rules allow detecting certain attributes of an informa-tion unit. For instance, the title, first name, middle name and last name of a person can be identified and used as a feature in coreference resolution.

It is common that words or compound terms have different variant spellings, i.e., an entity can have different mentions. Especially, proper names such as person names can occur in a text in different alias forms. Although the task of alias recognition in itself is a noun phrase coreference resolution task, often a simple form of alias recognition is a priori applied yielding classification features such as “is alias” and “is weak alias”. Theyespecially aim at detecting variations concerning punctuation (e.g., USAversus U.S.A), capitalization (e.g., Citibank versus CITIBANK), spacing(e.g., J.C. Penny versus J. C. Penny), abbreviations and acronyms (e.g., information retrieval versus IR), misspellings including omissions (e.g.,Collin versus Colin), additions (McKeown versus MacKeown), substitu-tions (e.g., Kily versus Kyly), and letter reversals (e.g., Pierce versusPeirce). Punctuation and capitalization variations can be resolved – although not in an error-free way - by simple normalization. Abbreviations and ac-ronyms can be normalized by using a translation table of abbreviations or acronyms and their corresponding expansions. Or, simple rules for acro-nym resolution might be defined. Especially for detecting misspelling, edit distances are computed. Then the similarity between two character strings is based on the cost associated with converting one pattern to the other. If the strings are of the same length, the cost is directly related to the number of symbols that have been changed in one of the strings so that the other string results. In the other case, when the strings have a different length,characters have to be either deleted or inserted at certain places of the teststring. The edit distance D(A(( ,B) is defined as the minimum total number of

(Collins, 2002). A short type of a word can, for instance, be defined by re-


D(A,B) =minj

S( j)+ I( j)+ R( j)[ ] (4.2)

where j runs over all possible combinations of symbol variations in order to obtain B from A. Dynamic programming algorithms are usually used to

Another alias detection heuristic refers to the matching of strings except for articles and demonstrative pronouns. An evaluation of different tech-niques for proper name alias detection can be found in Branting (2003). The first mention of the entity in a text is usually taken as the most repre-sentative. It is clear that alias resolution across different documents re-quires additional context matching as names that are (slightly) differently spelled might refer to different entities.

It is also common in text that entities are referred to by their synonym,hypernym, hyponym or sometimes meronym. A synonym is a term with the same meaning as the source term, but differently spelled. A hypernym de-notes a more general term, while a hyponym refers to a more specific term compared to the source term. A meronym stands for a part of relation.

tain these term relationships. It is not always easy to correctly detect syno-nyms, hypernyms and hyponyms in texts because of the different meanings that words have. The lexica often cite the different meanings of a word, but sometimes lack sufficient context descriptions for each meaning in order to easily disambiguate a word in a text.

Other lexical features regard gender andr number of the informationrunit, or of the head of the unit if it is composed of different words. Theyare, for instance, used as matching features in a noun phrase coreference task. An entity can have as gender: Masculine, feminine, both masculine and feminine and neutral. Additional knowledge of the gender of persons is helpful. It could be detected by relying on lists of first names in a par-ticular language or culture that are classified according to gender, when the person is mentioned with his or her first name and when the first name does not have an ambiguous gender (e.g., Dominique in French). Theform of addressing a person also acts as a cue in determining a person’s gender (e.g., Mrs. Foster). For common nouns, we have to infer the gender from additional knowledge sources. Number information is usually pro-vided by the part-of-speech tagger where a tag such as NNS refers to aplural noun.

efficiently obtain B fromm A (Skiena, 1998, p. 60 ff.).

Thesauri or lexical databases such as WordNet (Miller, 1990) usually con-

(possibly weighted) substitutes S, insertions I, and deletions R required to change pattern A into pattern B:

4.5 The Features 79

Table 4.2. Typical features in a single-document noun phrase coreference resolu-tion task of the syntactic heads, i and j, of two candidate coreferent noun phrases in text T where i < j in terms of word position in T.TT

FEATURE VALUE TYPE

VALUE

Numberagreement

Boolean True if i and j agree in number; False other-wise.

Gender agreement

Boolean True if i and j agree in gender; False otherwise.

Alias Boolean True if i is an alias of j or vice versa; False oth-erwise.

Weak alias Boolean True if i is a substring of j or vice versa; Falseotherwise.

POS match Boolean True if the POS tag of i and j match; False oth-erwise.

Pronoun i Boolean True if i is a pronoun; False otherwise.Pronoun j Boolean True if j is a pronoun; False otherwise.Appositive Boolean True if j is the appositive of i; False otherwise.Definiteness Boolean True if j is preceded by the article “the” or a

demonstrative pronoun; False otherwise.Grammatical role

Boolean True if the grammatical role of i and j match; False otherwise.

Proper names Boolean True if i and j are both proper names; False oth-erwise.

Named entityclass

Boolean True if i and j have the same semantic class (e.g., person, company, location); False other-wise.

Discourse distance

Integer > = 0 Number of sentences or words that i and j areapart.

In many semantic classifications the context words are very important. The size of the window with context words usually varies according to the ex-traction task. In named entity recognition the window size is usually quite small (two or three words on the left or the right of the target word yield-ing a window of respectively of 5 or 7 words). In a cross-documentcoreferent resolution task, the window can be quite large (e.g., 50 words, or the sentence in which the target word occurs). Words in context win-dows might receive a weight that indicates their importance. Quite often classical weighting functions such as tf xf idf are used for this purpose. The fterm frequency (tf ) is valuable when the words of different context win-dows are combined in one vector. This is, for instance, the case when in


Named entity class

Boolean True if i and j have the same semantic class (e.g., person, company, location); False oth-erwise.

Semantic role Boolean True if the semantic role of i matches the se-mantic role of j; False otherwise.

one document the context windows of identical or alias mentions of an en-tity can be merged while relying on the one sense per discourse principle which, for instance, for proper names can be accepted with high accuracy.The term frequency is then computed as the number of times a term occursin the window(s). The inverse document frequency (idf ) is useful to de-mote term weights when the term is a common term in the document col-lection under consideration or in a reference corpus in the language of thedocument. The idf of term i is usually computed as log (N/NN ni) where N isNthe number of documents in the collection and ni the number of documentsin the collection in which i occurs. In context windows, stop words or function words might be neglected. For certain tasks such as cross-document noun phrase coreference resolution, proper names, time and lo-cation expressions in the context might receive a high weight. In order tofind coreferring names across documents, the semantic roles and processes in which the entities are involved can yield additional cues.

4.5.2 Syntactic Features

The most common syntactic feature used in information extraction is the part-of-speech (POS) of a word. Part-of-speech taggers that operate with a

Table 4.3. Typical features in a cross-document noun phrase coreference resolu-tion task of the syntactic heads, i and j, of two candidate coreferent noun phrases where i and j occur in different documents.

FEATURE TYPE VALUE

Context word Boolean or real value be-tween 0 and 1

True if the context word k occurs in the con-text of i and j; False otherwise; If a real value is used, it indicates the weight of the context word; Proper names, time and location ex-pressions in the context might receive a high weight.

4.5 The Features 81

very high accuracy are commonly available. The part-of-speech of a word often plays a role in determining the values of other features.

So, for instance the definiteness of an information unit or noun phrasetentity can be approximately defined if the unit is preceded by the article “the” or a demonstrative pronoun (e.g., I saw a man and the man was old.That person wore strange clothes). In this example A man refers to in-definite information. Defining definiteness is valuable to detect anaphoricnoun phrase coreferents in texts (Yang et al., 2004). Definite noun phrases usually refer to content that is already familiar or to content items of whichthere exist only one (e.g., the U.S.). Definiteness can be split up in two

which allows describing cases that are neither definite nor indefinite.Alias recognition or weak alias recognition (cf. supra) can also rely on

part-of-speech tags. The part-of-speech tag gives us information on wordsthat might be removed for string matching of the candidate aliases. For in-stance for proper names, we can remove words that do not have the part-of-speech NNP (single proper name) or NNPS (plural proper name). For words that belong to the general part-of-speech type NN (noun), especiallythe head noun is important in the matching of candidate aliases. Detecting the type of phrase (e.g., a noun phrase such as the big bear, a prepositional noun phrase such as in the cold country) is important in a semantic role recognition task. The syntactic head of a phrase is here a useful feature. The syntactic head of a phrase is the word by whose part-dof-speech the phrase is classified (e.g., man in the noun phrase: the bigman). In timex recognition, the following information units are usuallyconsidered as candidates: Noun, noun phrase, adjective, adverb, adjectivephrase and adverb phrase. The voice of a clause (i.e., passive or active) is a useful feature in a rela-tion extraction task. It can be detected based on surface expressions in thetexts and the part-of-speech of the verb words. Another mode feature de-termines whether the sentence is affirmative or negative. This feature is more difficult to accurately detect. A number of syntactic features rely on a parsing of the sentence’s struc-ture. Unfortunately, sentence parsers are not available for every language.The grammatical role of a phrase in a sentence or clause such as subject, direct object and indirect object might play a role in the extraction process.Grammatical roles, which are sometimes also called syntactic roles, are de-tected with the help of rules applied on the parse tree of a sentence. In cer-tain languages the grammatical role of nouns and pronouns can be detected by their morphological form that indicates cases such as nominative, accu-sative, genitive and ablative. The grammatical role is important in a

separate Boolean features: Definite and indefinite (Ng and Cardie, 2002),


coreference resolution task as antecedent and referent often match with re-gard to their grammatical role (Yang et al., 2004). Parse information is also important in detecting relations between enti-

noun phrase entities are in a modifier relation, or defining their grammati-cal roles in the sentence acts as a useful feature in relation recognition.

Table 4.4. Common features in a generic semantic role recognition task of clause constituent i.

FEATURE VALUE TYPE

VALUE

Phrase type Nominal Phrase type (e.g., noun phrase, verb phrase) as de-termined by the POS tag of the syntactic head of i.

Syntactichead

Nominal The word that composes the syntactic head of the phrase that represents i.

Grammatical role

Nominal The grammatical role of i.

Voice Nominal The voice of the clause of which i is part: Activeor passive.

Named entity class

Nominal Name of the named entity class (e.g., person, or-ganization) of the syntactic head of i; Undefinedwhen i is not a noun phrase.

Relative dis-tance and po-sition

Integer The relative distance of the syntactic head of i withregard to the process can be defined as a number that is proportional with the distance (e.g., in termsof words); The numbering (e.g., negative or posi-tive) provides also the distinction whether i occursbefore or after the process in the clause; Is zero when i represents the process in the clause.

4.5.3 Semantic Features

Semantic features refer to semantic classifications of single- or multi-word information units. The semantic features act as features in other semantic classification tasks. An example is John Barry works for IBM whereJohn Barry and IBM are already classified respectively as person nameand company name. These more general features are then used in the

ties (Culotta and Sorensen, 2004). For instance, defining whether the two

recognition of the relation works for. There are multiple circumstances

4.5 The Features 83

POS con-text word

Nominal that designates the word’s POS tag.

Semanticrole i

Nominal Semantic role of phrase i; Undefined when iis a modifier.

Semanticrole j

Nominal Semantic role of phrase j; Undefined when jis a modifier.

Modifier i Boolean True if i is a modifier of j; False otherwise.Modifier j Boolean True if j is a modifier of i; False otherwise.Affirma-tive

Boolean True if the clause c in which i and j occur isaffirmative; False otherwise.

classify larger information units or in more complex classification tasks such as coreference resolution. In coreference resolution it is very impor-tant to use semantic classes such as female, male, person and organiza-tion, or animate and inanimate and to find agreement of antecedent and referent on these classes. Semantic features may involve simple identifica-tion of the name of a day or month by respectively the classes day or

classes such the sayer in a verbal process. An additional advantage is that semantic tagging of individual words

enables rules of greater generality than rules based exclusively on exact words. In this way it offers a solution to problems caused by the sparsenessof training data and the variety of natural language expressions found in texts.

There are several ways for identifying the semantic features. Firstly, they can be detected with the typical information extraction techniques de-scribed in this book, such as named entity recognition and semantic role recognition. Secondly, we can rely on external knowledge sources that are in the form of machine-readable dictionaries or lexica, which can be gen-eral or domain specific. Especially useful is a semantic lexicon that can be

where the replacement of words and terms by more general semantic con-cepts is advantageous especially when the features are used to semantically

Table 4.5. Common features in a relation recognition task between two nounphrase entities i and j in a clause c considering a context of l words.l

FEATURE VALUE TYPE VALUE

Context word

Boolean or realvalue between 0 and 1; Or nomi-nal.

True if the context word k occurs in the con-text of i and j; False otherwise; If a real valueis used, it indicates the weight of the context word k. Alternatively, the context word fea-ture can be represented as one feature withnominal values. For each context word, there is a feature

pany name, number and money, and the recognition of very generalmonth, the recognition of useful categories such as person name, com-


used to tag individual words with semantic classes appropriate to the do-main. Semantic class determination relying on general lexical databases

contextual expressions to disambiguate word meanings. There also existgazetteers that contain geographical or other names. In addition, semantic lexica might be incomplete and in practical applications generic resourcesoften have to be complemented with domain specific resources. A list of the most common first or last names can be used in a named entity recog-nition task (e.g., US Census list of the most common first and last names inthe US).

4.5.4 Discourse Features

Discourse features refer to features the values of which are computed byusing text fragments, i.e., a discourse or a connected speech or writing, larger than the sentence. Many discourse features are interesting in an in-formation extraction context.

Table 4.6. Common features of phrase i in a timex recognition task considering a context window of l f words.

FEATURE TYPE VALUE

Context word

Boolean or realvalue between0 and 1; Or nominal.

True if the context word j occurs in the con-text of i; False otherwise; If a real value is used, it indicates the weight of the contextword j. Alternatively, the context word fea-ture can be represented as one feature with nominal values.

Short type Boolean True if i matches the short type j; False oth-erwise.

distance between two entities is often important as it is assumed that dis-tance is inversely proportional with semantic relatedness. Especially insingle-document coreference resolution discourse distance is relevant. Dis-course distance can be expressed by the number of intervening words or bythe number of intervening sentences.

Discourse features such as rhetorical, temporal and spatial relations be-tween certain information found in the texts are important in the semantic classification of larger text units. For instance, the temporal order of cer-

A very simple example is discourse distance. In relation recognition the

such as WordNet (Miller, 1990) is not easy when they lack the necessary

4.6 Conclusions 85

tain actions is a significant indicator of script based concepts expressed in texts (e.g., a restaurant visit, a bank robbery). The recognition of temporalexpressions (timexes), their possible anchoring to absolute time values and their relative ordering are themselves considered as information extraction

Mani et al., 2005) is a proposed metadata standard for markup of events and their temporal anchoring in documents. The drafting of classificationschemes of temporal relationships goes back to Allen (1984) (e.g., before, after, overlaps, during, etc…). More recent ontological classificationschemes aim to logically describe the temporal content of Web pages and to make inferences or computations with them (Hobbs and Pan 2004).Experiments with regard to the automatic classification of temporal rela-tionships are very limited (Mani et al., 2003) and few studies report on adequate discourse features except for features that track shifts in tense and aspect. This is why we did not include a separate table for typical temporal relationship features.

4.6 Conclusions

Information extraction is considered as a pattern classification task. The candidate information unit to be extracted or semantically classified is de-scribed by a number of features. The feature set is very varied. However, a number of generic procedures are used in feature selection and extraction.They comprise lexical analysis, part-of-speech tagging and possibly pars-ing of the sentences. These primitive procedures allow identifying a set of useful information extraction features that can be found in open and closed domain document collections. Discourse features are used to a lesser extent, but will certainly become more important in future semantic classifica-tions. Elementary information classifications, such as named entity recog-nition, yield semantic features that can be used in more complex semanticclassifications, such as coreference resolution and relation recognition. The results of entity relation and time line recognition tasks can in their turn act as features in a script recognition task. Such an approach, to which we re-fer as a cascaded model, starts from semantically classifying small infor-mation units, and in a kind of bootstrapping way uses these to classifylarger information units. This model opens avenues for novel learningalgorithms and could yield semantic representations of texts at variouslevels of detail.

In the following two chapters we discuss the typical learning algorithmsused in information extraction.

tasks (e.g., Mani et al., 2005; Mani, 2003). TimeML (Pustejovsky et al., in


4.7 Bibliography

Abney, Steven P. (1991). Parsing by chunks. In Steven P. Abney, Robert C. Berwick and Carol Tenny (Eds.), Principle Based Parsing: Computation and Psycholinguistics (pp. 257-278). Dordrecht, The Netherlands: Kluwer.

Ahn, David, Sisay F. Adafre and Maarten de Rijke (2005). Extracting temporal in-formation from open domain text. In Proceedings of the 5th Dutch-BelgianInformation Retrieval Workshop (DIR’05).

Allen, James (1984). Towards a general theory of action and time. Artificial Intel-ligence, 23 (2), 123-154.

Bagga, Amit and Breck Baldwin (1998). Entity-based cross-document coreferenc-ing using the vector space model. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Con-ference on Computational Linguistics (COLING-ACL’98) (pp. 79-85). MorganKaufmann: ACL.

Bikel, Daniel M., Richard Schwartz and Ralph M. Weischedel (1999). An algo-rithm that learns what’s in a name. Machine Learning, 34 (1/2/3), 211-231.

Borthwick, Andrew E. (1999). A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New York Univer-sity.

Branting, Karl L. (2003). A comparative evaluation of name matching algorithms. In Proceedings of the 9th International Conference on Artificial Intelligenceand Law (pp. 224-232). New York: ACM.

Bunescu, Razvan and Raymond J. Mooney (2004). Collective information extrac-tion with relational Markov networks. In Proceedings of the 42nd Annual Meet-d

ing of the Association for Computational Linguistics (pp. 439-446). East Stroudsburg, PA: ACL.

Cardie, Claire and Kiri Wagstaff (1999). Noun phrase coreference as clustering. In Proceedings of the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Very Large Corpora (pp. 82-89). San Franscisco, CA:Morgan Kaufmann.

Chieu, Hai L. and Hwee T. Ng (2002). Named entity recognition: A maximum en-tropy approach using global information. In COLING 2002. Proceedings of the 19th International Conference on Computational Linguistics (pp. 190-196). San Francisco: Morgan Kaufmann.

Collins, Michael (2002). Ranking algorithms for named-entity extraction: Boost-ing and the voted perceptron. In Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics (pp. 489-496). San Francisco:Morgan Kauffman.

Collins, Michael and Yoram Singer (1999). Unsupervised models for named entityclassification. In Proceedings of Empirical Methods in Natural LanguageProcessing (EMNLP), College Park, MD.

Craven, M., et al. (2000). Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118, 69-113.

Culotta, Aron and Jeffrey Sorenson (2004). Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting of the Association for d

Computational Linguistics (pp. 424-430). East Stroudsburg, PA: ACL.

4.7 Bibliography 87

Dunning, Ted (1993). Accurate methods for the statistics of surprise and coinci-dence. Computational Linguistics, 61-74.

Fleischman, Michael and Eduard Hovy (2003). A maximum entropy approach toFrameNet tagging. In Proceedings of the Human Language Technology Con-ference of the North American Chapter for Computational Linguistics. EastStroudsburg, PA: ACL.

Gooi, Chung Heong and James Allan (2004). Cross-document coreference on a large scale corpus. In Proceedings of the Human Language Technology Con-ference of the North American Chapter of the Association for Computational Linguistics (pp. 9-16). East Stroudsburg, PA: ACL.

Graesser, Arthur C. and Leslie F. Clark (1985). Structures and Procedures of Im-plicit Knowledge (Advances in Discourse Processes (( XVII). Norwood, NJ: Ablex Publishing Corporation.

Hasegawa, Takaaki, Satoshi Sekine and Ralph Grishman (2004). Discovering re-lations among named entities from large corpora. In Proceedings of the 42nd

Annual Meeting of the Association for Computational Linguistics (pp. 416-423). East Stroudsburg, PA: ACL.

Hobbs, Jerry R. and Feng Pan (2004). An ontology of time for the semantic Web. ACM Transactions on Asian Language Information Processing, 3 (1), 66-85.

Jones, William P. and George W. Furnas. (1987). Pictures of relevance: A geo-metric analysis of similarity measures. Journal of the American Society for Information Science, 38 (6), 420-442.

Li, Xin, Paul Morie and Dan Roth (2004). Robust reading: Identification and trac-ing of ambiguous names. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computa-tional Linguistics (pp. 17-24). East Stroudsburg, PA: ACL.

Mani, Inderjeet (2003). Recent developments in temporal extraction. In Nicolas Nicolov and Ruslan Mitkov (Eds.), Proceedings of RANLP’03. Amsterdam: John Benjamins.

Mani, Inderjeet, James Pustejovski and Robert Gaizauskas (Eds.) (2005). The Language of Time: A Reader. Oxford, UK: Oxford University Press.

Mani, Inderjeet, Barry Schiffman and Jianping Zhang (2003). Inferring temporal ordering of events in news. In Proceedings of the Human Language Technol-ogy Conference (HLT-NAACL’03) (pp. 55-57). Edmonton, CA.

Manning, Christopher D. and Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. Boston, MA: The MIT Press.

Mehay Dennis N., Rik De Busser and Marie-Francine Moens (2005). Labeling generic semantic roles. In Harry Bunt, Jeroen Geertzen and Elias Thyse (Eds.),Proceedings of the Sixth International Workshop on Computational Semantics (IWCS-6) (pp. 175-187). Tilburg, The Netherlands: Tilburg University.

Miller, George A. (Ed.) (1990). Special issue: WordNet: An on-line lexical data-base. International Journal of Lexicography, 3 (4).

Müller, Christoph, Stefan Rapp and Michael Strubbe (2002). Applying co-training to reference resolution. In Proceedings of the 40th Annual Meeting of the Asso-ciation for Computational Linguistics (pp. 352-359). San Francisco: MorganKaufmann.

Ng, Vincent and Claire Cardie (2002). Improving machine learning approaches tocoreference resolution. In Proceedings of the 40th Annual Meeting of the Asso-


ciation for Computational Linguistics (pp. 104-111). San Francisco: MorganKaufmann.

Pradhan, Sameer, Wayne Ward, Kadri Hacioglu, James H. Martin and Dan Juraf-sky (2004). Shallow semantic parsing using support vector machines. In Pro-ceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics Annual Meeting(HLT/NAACL 2004). East Stroudsburg, PA: ACL.

Skiena, Steven S.K. (1998). The Algorithm Design Manual. New York, NY: Springer.

Soon Wee Meng, Hwee Tou Ng and Daniel Lim (2001). A machine learning ap-proach to coreference resolution of noun phrases. Computational Linguistics27(4), 2001, 521-544.

Theodoridis, Sergios and Konstantinos Koutroumbas (2003). Pattern Recognition. Amsterdam, The Netherlands: Academic Press.

Yang, Xiaofeng, Jian Su, Guodong Zhou and Chew Lim Tan (2004). Improvingpronoun resolution by incorporating coreferential information of candidates. InProceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (pp. 128-135). East Stroudsburg, PA: ACL.

Yarowski, David (1995). Unsupervised word sense disambiguation rivaling super-vised methods. In Proceedings of the 33th Annual Meeting of the Associationfor Computational Linguistics (pp. 189-196). Cambridge, MA.

Zhou, GuoDong and Jian Su (2002). Named Entity Recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting of the Associa-tion for Computational Linguistics (pp. 473-480). San Francisco, CA: Morgan Kaufmann.

89

5 Supervised Classification

5.1 Introduction

Supervised learning is a very popular approach in any text classificationtask. Many different algorithms are available that learn classification pat-terns from a set of labeled or classified examples. Given this sample of ex-amples, the task is to model the classification process and to use the model to predict the classes of new, previously unseen examples. In information retrieval the supervised techniques are very popular for the classificationof documents into subject categories (e.g., the classification of news into financial, political, cultural, sports, …) using the words of a document as main features. In information extraction usually smaller content units are classified with a variety of classification schemes ranging from rather ge-neric categories such as generic semantic roles of sentence constituents to very specific classes such as the number of heavily burned people in a firework accident.

As in text categorization, the amount of training examples is often lim-ited, or training examples are expensive to build. When the example set is small, it very often represents incomplete knowledge about the classifica-tion model sought. This danger is especially present in natural language data where a large variety of patterns express the same content.

Parallel to text categorization, the amount of different features is large and the feature set could include some noisy features. There are many dif-ferent words, syntactic, semantic and discourse patterns that make up thecontext of an information element. But, compared to text categorization lesser features in the context are indicative of the class sought. Often, the features behave dependently and the dependency is not always restricted toco-occurrence of certain feature values, it sometimes also demands thatfeature values occur in a certain order in the text. When different classes are to be assigned to text constituents, the class assignment might also be dependent on classes previously assigned.

90 5 Supervised Classification

approaches that have been used to extract information from text and rele-vant references were cited. In this chapter we dig deeper into the current and most successful algorithms for information extraction that use a super-vised learning approach. The chosen classifiers allow dealing with incom-plete data and with a large set of features that on occasion might be noisy. They also have the potential to be used for weakly supervised learning (de-scribed in the next chapter), and incorporate dependencies in their models.

As seen in Chap. 4 a feature vector x is described by a number of fea-tures (see Eq. (4.1)) that may refer to the information element to be classi-fied, to the close context of the element, to the more global context of thedocument in which the element occurs, and perhaps to the context of the document collection. The goal is to assign a label y to a new example.Among the statistical learning techniques a distinction is often made be-

model of the joint probability, p(x(( ,y) and makes its predictions by usingBayes’ rule to calculate p(y x)1 and then selects the most likely label y. An example of a generative classifier that we will discuss in this chapter is a hidden Markov model. A discriminative classifier models the posterior rprobability p(y x) directly and selects the most likely label x y, or learns a di-rect map from inputs x to the class labels. An example is the maximum en-tropy model, which is very often used in information extraction. Here, the joint probability p(x(( ,y) is modeled directly from the inputs x. Another ex-ample is a Support Vector Machine, which is a quite popular learningtechnique for information extraction. Some of the classifiers adhere to the maximum entropy principle. This principle states that, when we make in-ferences based on incomplete information, we should draw them from that probability distribution that has the maximum entropy permitted by the in-

to this principle. They are the maximum entropy model and conditional random fields.

In information extraction sometimes a relation exists between the vari-ous classes. In such cases it is valuable not to classify a feature vector separately from other feature vectors in order to obtain a more accurate

1 With a slight abuse in notation in the discussion of the probabilistic classifiers,we will also use p(y(( x) to denote the entire conditional probability distributionprovided by the model, with the interpretation that y and x are placeholders rather than specific instantiations. A model p(y x) is an element of all conditional prob-xability distributions. In the case that feature vectors take only discrete values, we will denote the probabilities by the capitalized letter P.

dan, 2002). Given inputs x and their labels y, a generative classifierr learns a tween generative and discriminative classifiers (Vapnik, 1988; Ng and Jor-

formation we have (Jaynes, 1982). Two of the discussed classifiers adhere

Chap. 2 gave an extensive historical overview of machine learning

9

classification of the individual extracted information. This is referred to as context-dependent classification as opposed to context-free classification.So, the class to which a feature vector is assigned depends on 1) the featurevector itself; 2) the values of other feature vectors; and 3) the existing rela-tion among the various classes. In information extraction, dependenciesexist at the level of descriptive features and at the level of classes, the latter also referring to classes that can be grouped in higher-level concepts (e.g.,scripts such as a bank robbery script). We study two context-dependent classifiers, namely a hidden Markov model and one based on conditional random fields.

Learning techniques that present the learned patterns in representations that are well conceivable and interpretable by humans are still popular in information extraction. Rule and tree learning is the oldest of such ap-proaches. When the rules learned are represented in logical propositions or first-order predicate logic, this form of learning is often called inductivelogic programming (ILP). The term relational learning refers to learning in any format that represents relations, including, but not limited to logicprograms, graph representations, probabilistic networks, etc. In the last twosections of this chapter we study respectively rule and tree learning, and relational learning.

The selection of classifiers in this chapter by no means excludes other supervised learning algorithms for which we refer to Mitchell (1997) and Theodoridis and Koutroumbas (2003) for a comprehensive overview of the supervised classification techniques.

When a binary classifier is learned in an information extraction task, we are usually confronted with an unbalanced example set, i.e., there are usu-ally too many negative examples as compared to the positive examples. Here techniques of active learning discussed in the next chapter might be of help to select a subset of negative examples.

When using binary classification methods such as a Support Vector Machine, we are usually confronted with the multi-class problem. The lar-ger the number of classes the more classifiers need to be trained and ap-plied. We can handle the multi-class problem by using a one-vs-rest (one class versus all other classes) method or a pair wise method (one class ver-sus another class). Both methods construct multi-class SVMs by combin-ing several binary SVMs. When classes are not mutually exclusive, the one-vs-rest approach is advisable.

In information extraction we are usually confronted with a complex problem. For instance, on one hand there is the detection of the boundariesof the information unit in the text. On the other hand there is the classifica-tion of the information unit. One can tackle these problems separately, or learn and test the extractor in one run. Sometimes the semantic classes to

5.1 Introduction 91


be assigned are hierarchically structured. This is often the case for entities to be recognized in biomedical texts. The hierarchical structure can be ex-ploited both in an efficient training and testing of the classifier by assum-ing that one class is subsumed by the other. As an alternative, in relational learning one can learn class assignment and relations between classes. A similar situation occurs where components of a class and their chronologi-cal order signal a superclass. An important motivation for separating the classification tasks is when they use a different feature set. For instance,with the boundary recognition task, the orthographic features are impor-

will be tackled in Chap. 10.

5.2 Support Vector Machines

Early machine learning algorithms aimed at learning representations of simple symbolic functions that could be understood and verified by ex-perts. Hence, the goal of learning in this paradigm was to output a hy-pothesis that performed the correct classification of the training data, and the learning algorithms were designed to find such an accurate fit to thedata. The hypothesis is complete when it covers all positive examples, and it is consistent when it does not cover any negative ones. It is possible that ta hypothesis does not converge to a (nearly) complete and (nearly) consis-tent one, indicating that there is no rule that discriminates between the positive and the negative examples. This can occur either for noisy data, or in case where the rule language is not sufficiently complex to represent the dichotomy between positive and negative examples.

This situation has fed the interest in learning a mathematical functionthat discriminates the classes in the training data. Among these, linear functions are the best understood and the easiest to apply. Traditional sta-tistics and neural network technology have developed many methods fordiscriminating between two classes of instances using linear functions.They can be called linear learning machines as they use hypotheses thatform linear combinations of the input variables.

In general, complex real-world applications require more expressive hypothesis spaces than the ones formed by linear functions (Minsky

issue of separating the learning tasks or combining them in one classifier tant, while in the classification task the context words are valuable. The

alternative solution by mapping the data into a high dimensional featureabstract features of the data are exploited. Kernel representations offer an a simple linear combination of the given features, but requires that more and Papert, 1969). Frequently, the target concept cannot be expressed as

5.2 Support Vector Machines 93

space where a linear separation of the data becomes easier. In natural lan-guage classification, it is often not easy to find a complete and consistent hypothesis that fits the training data. And in some cases linear functionsare insufficient in order to discriminate the examples of two classes. This is because natural language is full of exceptions and ambiguities. We may not capture sufficient training data, or the training data might be noisy in order to cope with these phenomena, or the function we are trying to learndoes not have a simple logical representation.

In this section we will lay out the principles of a Support Vector Ma-chine for data that are linearly or nearly linearly separable. We will alsointroduce kernel methods because we think they are a suitable technology for certain information extraction tasks.

The technique of Support Vector Machines (Cristianini and Shawe-

two classes. In information extraction tasks the two classes are often thepositive and negative examples of a class. In the theory discussed belowwe will use the terms positive and negative examples. This does not ex-clude that any two different semantic classes can be discriminated.

Fig. 5.1. A maximal margin hyperplane with its support vectors highlighted (after

rable and then generalize the idea to data that are not necessarily linearly separable and to examples that cannot be represented by linear decision surfaces, which leads us to the use of kernel functions.

We will first discuss the technique for example data that are linearly sepa-

Taylor, 2000) is a method that finds a function that discriminates between

Christianini and Shawe-Taylor, 2000).


In a classical linear discriminant analysis, we find a linear combination of the features (variables) that forms the hyperplane that discriminates be-tween the two classes (e.g., line in a two-dimensional feature space, plane in a three-dimensional feature space). Generally, many different hyper-planes exist that separate the examples of the training set in positive and negative examples among which the best one should be chosen. For in-stance, one can choose the hyperplane that realizes the maximum marginbetween the positive and negative examples. The hope is that this leads to a better generalization performance on unseen examples. Or in other words, the hyperplane with the margin d that has the maximum Euclidean ddistance to the closest training examples (support vectors) is chosen. More formally, we compute this hyperplane as follows:

Given the set S of n training examples:

S ={(x1,y1),...,(xn,yn)}

where xi ∈ℜ pℜ (p p(( -dimensional space) and yi ∈ {–1,+1} indicating that xi isrespectively a negative or a positive example.

When we train with data that are linearly separable, it is assumed that some hyperplane exists which separates the positive from the negative ex-amples. The points which lie on this hyperplane satisfy:

w ⋅ xi + b = 0 (5.1)

where w defines the direction perpendicular to the hyperplane (normal to the hyperplane). Varying the value of b moves the hyperplane parallel to itself. The quantities w and b are generally referred to as respectivelyweight vector and r bias. The perpendicular distance from the hyperplane to the origin is measured by:

b

w (5.2)

where w is the Euclidean norm of w.

Let d+ (d-) be the shortest distance from the separating hyperplane to the closest positive (negative) example. d+ and d- thus define the margin to thehyperplane. The task is now to find the hyperplane with the largest margin.

Given the training data that are linearly separable and that satisfy thefollowing constraints:


w ⋅ xi + b ≥ +1 for yi = +1 (5.3)

w ⋅ xi + b ≤ −1 for yi = 1 (5.4)

which can be combined in 1 set of inequalities:

yi( w ⋅ xi + b) −1≥ 0 for i = 1,…, n (5.5)

The hyperplane that defines one margin is defined by:

H1 : w ⋅ xi + b =1 (5.6)

with perpendicular distance from the origin:

1− b

w (5.7)

The hyperplane that defines the other margin is defined by:

H 2 : w ⋅ xi + b = −1 (5.8)

with perpendicular distance from the origin:

−1− b

w (5.9)

Hence d+ = d- =1w

and the margin =2w

.

In order to maximize the margin the following objective function is com-puted:

Minimizew,b w ⋅w

Subject to yi( w ⋅ xi + b) −1≥ 0, i =1,...,n (5.10)

−


Linear learning machines can be expressed in a dual representation, whichturns out to be easier to solve than the primal problem since handling ine-quality constraints directly is difficult. The dual problem is obtained by in-troducing Lagrange multipliers λiλ , also called dual variables. We can transform the primal representation into a dual one by setting to zero thederivatives of the Lagrangian with respect to the primal variables, and substituting the relations that are obtained in this way back into the La-

Maximize W (λ)λλ = λiλλi=1

n

− 1

2λiλλ λjλλ yjj iyi jyy xi⋅xjx

i, j=1

n

Subject to: λiλλ ≥ 0

λiλλ yλi i = 0i=1

n

,0 i =1,...,n

(5.11)

It can be noticed that training examples only need to be inputted as inner products (see Eq. (5.11)), meaning that the hypothesis can be expressed as a linear combination of the training points. By solving a quadratic optimi-zation problem, the decision function h(x(( ) for a test instance x is derived asfollows:

h(x) = sign( f (x)) (5.12)

f (x) = λiλλ yλiλλ i xi ⋅ x + bi=1

n

(5.13)

The function in Eq. (5.13) only depends on the support vectors for whichλiλ > 0. Only the training examples that are support vectors influence thedecision function. Also, the decision rule can be evaluated by using just inner products between the test point and the training points.

We can also train a soft margin Support Vector Machine which is ableto deal with some noise, i.e., classifying examples that are linearly separa-ble while taking into account some errors. In this case, the amount of train-ing error is measured using slack variables ξiξ , the sum of which must not exceed some upper bound.

simpler constraints:

grangian, hence removing the dependence on the primal variables. The resulting function contains only dual variables and is maximized under


The hyperplanes that define the margins are now defined as:

H1 : w ⋅ xi + b =1−ξiξξ (5.14)

H 2 : w ⋅ xi + b = −1+ξiξξ (5.15)

Hence, we assume the following objective function to maximize the mar-gin:

Minimizeξ , w, b

w ⋅w +C ξi

2ξξi=1

n

Subject to yi( w ⋅ xi +b) −1+ξiξξ ≥ 0 , i = 1,...,n

(5.16)

where ξi

2ξξi =1

n

= penalty for misclassification

C = weighting factor.

The decision function is computed as in the case of data objects that are linearly separable (cf. Eq. (5.13)).

When classifying natural language data, it is not always possible to line-arly separate the data. In this case we can map them into a feature space where they are linearly separable (see Fig. 5.2). However, working in a high dimensional feature space gives computational problems, as one hasto work with very large vectors. In addition, there is a generalization the-ory problem (the so-called curse of dimensionality), i.e., when using too many features, we need a corresponding number of samples to insure a correct mapping between the features and the classes. However, in the dual representation the data appear only inside inner products (both in the train-ing algorithm shown by Eq. (5.11) and in the decision function of Eq. (5.13)). In both cases a kernel function (Eq. (5.19)) can be used in the computations.

A Support Vector Machine is a kernel based method. It chooses a kernel function that projects the data typically into a high dimensional featurespace where a linear separation of the data is easier.


º • φ φ(φφ º) φ(•)φφ

º • • φ(φφ º) φ(•) φφ φ(•)φφ

º • φ(φφ º) φ(•)φφ

Formally, a kernel function K is a mapping K: S xS S → [0, ∞] from theinstance space of training examples S to a similarity score:S

K(xi,xjx ) = φkφφ (xi))φkφφ (xjx ) =k

φ(φ xi) ⋅φ(φ xjx ) (5.17)

In other words a kernel function is an inner product in some feature space (this feature space can be potentially very complex). The kernel functionmust be symmetric [K(KK x(( i,xjx ) = K(KK x(( jx ,jj xi)] and positive semi-definite. By semi-definite we require that if x1,…,xn ∈ S, then the n x n matrix G de-fined by Gij = K (K x(( i,xjx ) is positive semi-definite2. The matrix2 G is called the GGram matrix or the kernel matrix. Given G, the support vector classifier finds a hyperplane with maximum margins that separates the instances of different classes. In the decision function f(ff x) we can just replace the dot products with kernels K(KK x(( i,x,, jx ).

h(x) = sign( f (x)) (5.18)

f (x) = λiλλ yλiλλ i φ(φ xi) ⋅φ(φ x) + bi=1

n

(5.19)

Or

f (x) = λiλλ yλiλλ iK(xi,x) + bi=1

n

2 A matrix2 A ∈ ℜpxpx is a positive semi-definite matrix if 0≥ℜ∈∀ Axxx∀∀ Tp . A positive semi-definite matrix has non-negative eigenvalues.

(after Christianini and Shawe-Taylor 2000). Fig. 5.2. A mapping of the features can make the classification task more easy


To classify an unseen instance x, the classifier first projects x into the fea-ture space defined by the kernel function. Classification then consists of determining on which side of the separating hyperplane x lies. If we have a way of efficiently computing the inner product φ(φ xi) ⋅φ(φ x) in the feature space as a function of the original input points, the decision rule of Eq.(5.19) can be evaluated by at most n iterations of this process.

An example of a simple kernel function is the bag-of-words kernel used in text categorization where a document is represented by a binary vector,and each element corresponds to the presence or absence of a particular word in the document. Here, φkφφ (x(( i) = 1 if word w occurs in document xi

and word order is not considered. Thus, the kernel function K(KK x(( i,xjx ) is a simple function that returns the number of words in common between xi

and xjx .Kernel methods are effective at reducing the feature engineering burden

for structured objects. In natural language processing tasks, the objects be-ing modeled are often strings, trees or other discrete structures. By calcu-lating the similarity between two such objects, kernel methods can employdynamic programming solutions to efficiently enumerate over substruc-tures that would be too costly to explicitly include as features.

Another example that is relevant in information extraction is the tree kernel. Tree kernels constitute a particular case of more general kernels de-

2001). The idea is to split the structured object in parts and to define a ker-nel on the “atoms” and to recursively compute the kernel over larger partsin order to get the kernel of the whole structure.

The property of kernel methods to map complex objects in a feature space where a more easy discrimination between objects can be performedand the capability of the methods to efficiently consider the features of complex objects make them also interesting for information extraction tasks. In information extraction we can combine parse tree similarity with a similarity based on feature correspondence of the nodes of the trees. In the feature vector of each node additional attributes can be modeled (e.g., POS, general POS, entity type, entity level, WordNet hypernyms). Another example in information extraction would be to model script tree similarityof discourses where nodes store information about certain actions and their arguments.

We illustrate the use of a tree kernel in an entity relation recognition

the purpose of this research is to find relations between entities that are al-ready recognized as persons, companies, locations, etc. (e.g., John worksfor Concentra).

fined on a discrete structure (convolution kernels) (Collins and Duffy,

task (Zalenko et al., 2003; Culotto and Sorensen, 2004). More specifically

In this example, the training set is composed of parsed sentences in which the sought relations are annotated. For each entity pair found in thesame sentence, a dependency tree of this training example is captured based on the syntactic parse of the sentence. Then, a tree kernel can be de-fined that is used in a SVM to classify the test examples.

The kernel function incorporates two functions that consider attributecorrespondence of two nodes ti and tjt : A matching function m(ti,tjt ) ∈ {0, 1} and a similarity function s(ti,tjt ) ∈ [0,∞]. The former just determineswhether two nodes are matchable or not, i.e., two nodes can be matched when they are of compatible type. The latter computes the correspondence of the nodes ti and tjt based on a similarity function that operates on thej

nodes’ attribute values. For two dependency trees T1T and T2TT the tree kernel K(KK T1T ,T2TT ) can be de-fined by the following recursive function:

K(ti,tjt ) =0, if m (ti, tjt ) = 0

s(ti,tjt ) +KcKK (ti c[ ],tjt c[ ]) otherwise (5.20)

where KcKK is a kernel function that defines the similarity of the tree in terms of children subsequences. Note that two nodes are not matchable when one of them is nil. Let a and b be sequences of indices such that a is a sequence a1 ≤ a2 ≤ … ak and likewise for b. Let d(a) = ak –k a1 + 1 and l(a) be the length of a. Then KcKK can be defined as:

KcKK (ti[c], tjt [c]) = λdλλ (a )λdλλ (b )K(ti a[ ],tjt b[ ])a,b,l(a ) = l(b )

(5.21)

The constant 0 < λ < 1 is a decay factor that penalizes matching subse-λquences that are spread out within the child sequences.

Intuitively, whenever we find a pair of matching nodes, the model searches for all matching subsequences of the children of each node. For each matching pair of nodes (s titt ,tjt ) in a matching subsequence, we accumu-late the result of the similarity function s(ti ,tjt ) and then recursively search for matching subsequences of their children ti[c] and tjt [c]. Two types of tree kernels are considered in this model. A contiguous kernel only matches child subsequences that are uninterrupted by non-matching nodes.Therefore, d(a) = l(a). On the other hand, a sparse tree kernel, allows non-matching nodes within matching subsequences.

The above example shows that kernel methods have a lot to offer in in-formation extraction. Complex information contexts can be modeled in a

kernel function, and problem-specific kernel functions can be drafted. The problem is then concentrated on finding the suitable kernel function. Theuse of kernels as a general technique for using linear machines in a non-linear fashion can be exported to other learning systems (e.g., nearest neighbor classifiers).

Generally, Support Vector Machines are successfully employed in named

ognition (e.g., Zhang and Lee 2003; Mehay et al., 2005) and in entity rela-

Vector Machines have the advantage that they can cope with many (some-times) noisy features without being doomed by the curse of dimensional-ity.

5.3 Maximum Entropy Models

The maximum entropy model (sometimes referred to as MAXENT) com-putes the probability distribution p(x(( ,y) with maximum entropy that satis-fies the constraints set by the training examples (Berger et al., 1996). Among the possible distributions that fit the training data, the one is cho-sen that maximizes the entropy. The concept of entropy is known from

tainty concerning an event, and from another viewpoint a measure of ran-domness of a message (here a feature vector).

Let us first explain the maximum entropy model with a simple example of named entity recognition. Suppose we want to model the probability of a named entity being a disease or not when it appears in three very simplecontexts. In our example the contexts are composed of the word that is tobe classified being one of the set {neuromuscular, Lou Gerigh, person}. In other words, the aim is to compute the joint probability distribution pdefined over {neuromuscular, Lou Gerigh, person} x {disease, nodisease}given a training set S of nf training examples:

S = {(S x(( 1, y) , (x 2, y) ,…, (x(( ,y n) }.

Because p is a probability distribution, a first constraint on the model is that:

p(x,y)x,y

=1 (5.22)

5.3 Maximum Entropy Models 101

entity recognition tasks (e.g., Isozaki and Kazawa, 2002), noun phrasecoreferent resolution (e.g., Isozaki and Hirao, 2003) and semantic role rec-

tion recognition (Culotto and Sorensen, 2004). As explained above Support

Shannon’s information theory (Shannon, 1948). It is a measure of uncer-


orp(neuromuscular, disease) + p(Lou Gerigh, disease) + p (person(( , disease)+ p(neuromuscular, nodisease) + p(Lou Gerigh, nodisease) + p(person(( ,nodisease) = 1

It is obvious that numerous distributions satisfy this constraint, as seen in the Tables 5.1 and 5.2. The training set will impose additional constraints on the distribution. Ina maximum entropy framework, constraints imposed on a model are repre-sented by k binary-valued3 features known as feature functions. A featurefunction fjff takes the following form:

fjff (x,y) =1 if (x,y) satisfies a certain constraint

0 otherwise(5.23)

Table 5.1. An example of a distribution that satisfies the constraint in Eq. (5.22).

disease nodiseaseneuromuscular 1/4 1/8 Lou Gerigh 1/8 1/8

1/8 1/4 Total 1.0

Table 5.2. An example of a distribution that in the most uncertain way satisfies the constraint in Eq. (5.22).

disease nodisease neuromuscular 1/6 1/6 Lou Gerigh 1/6 1/6

1/6 1/6 Total 1.0

From the training set we learn that in 50% of the examples in which a dis-ease is mentioned the term Lou Gerigh occurs and that 70% of the exam-ples of the training set are classified as disease imposing the followingconstraints expressed by the feature functions:

3 The model is not restricted to binary features. For binary features efficient nu-merical methods exist for computing the model parameters of Eq. (5.35).

person

person

fLouGehrigff (x,y) =1 if x1= Lou Gerigh and y = disease

0 otherwise (5.24)

fdiseaseff (x,y) =1 if y = disease

0 otherwise (5.25)

In this simplified example, our training set does not give any information about the other context terms. The problem is how to find the most uncer-tain model that satisfies the constraints. In Table 5.3 one can again look for the most uniform distribution satisfying these constraints, but the examplemakes it clear that the choice is not always obvious. The maximum en-tropy model offers here a solution. Thus, when training the system, we choose the model p* that preserves as much uncertainty as possible, or which maximizes the entropy H(HH p(( ) between all the models p ∈ P that sat-isfy the constraints enforced by the training examples.

H(p(( ) = −),(

),(log),(yx

yxpyxp (5.26)

)(maxarg* pHpPp∈

= (5.27)

In the examples above we have considered an input training example char-acterized by a certain label y and a feature vector x, containing the context of the word (e.g., as described by surrounding words and their POS tag).We can collect n number of training examples and summarize the trainingsample S in terms of its empirical probability distribution: p~ defined by:


(5.24) and (5.25).

disease nodisease neuromuscular ? ? Lou Gerigh 0.5 ?

? ? Total 0.7 1.0person

Table 5.3. An example of a distribution that satisfies the constraints in Eqs. (5.22),


p̃(x,y) ≡ no

n (5.28)

where no = number of times a particular pair (x(( ,y) occurs in S andS no ≥ 0.

We want to compute the expected value of the feature function fjff with re-spect to the empirical distribution p̃(x,y) .4

Ep̃EE ( fjff ) = p̃(x,y) fjff (x,y)x,y

(5.29)

The statistics of a feature function are captured and it is required that themodel that we are building accords with it. We do this by constraining theexpected value that the model assigns to the corresponding feature func-tion fjff . The expected value of fjff with respect to the model p(y x) is:

EpEE ( fjff ) = p̃(x) p(y x) fjff (x,y)x,y

(5.30)

where p̃(x) is the empirical distribution of x in the training sample. We constrain this expected value to be the same as the expected value of fjff inthe training sample, i.e., the empirical expectation of fjff . That is we require:

)()( ~ jpjp fjE~ppfjEpp = (5.31)

Combining Eqs. (5.29), (5.30) and (5.31) yields the following constraint equation:

p̃(x) p(y x) fjff (x,y)x,y

= p̃(x,y) fjff (x,y)x,y

(5.32)

By restricting attention to these models p(y x) for which Eq. (5.31) holds,we are eliminating from consideration those models that do not agree with the training samples. In addition, according to the principle of maximum entropy we should select the distribution which is most uniform. A

4 The notation is sometimes abused: fjff (x(( ,y) will both denote the value of fjff for a particular pair (x(( ,y) as well as the entire function fjff .

mathematical measure of the uniformity of a conditional distribution p(y x)is provided by the conditional entropy. The conditional entropy H(HH Y X)XXmeasures how much entropy a random variable Y has remaining, if weYhave already learned completely the value of a second random variable X.The conditional entropy of a discrete random Y givenY X:

H(Y X) = p(x)H(Y X = x)x∈X

(5.33)

H(Y X) = − p(x) p(y x) log p(y x)y∈Yx∈X

(5.34)

or5

H( p) ≡ − p̃(x) p(y x)x∈X ,y∈Y

log p(y x)

Note that p̃(x) is estimated from the training set and p(y(( x) is the learned xmodel. When the model has no uncertainty at all, the entropy is zero.When the values of y are uniformly distributed, the entropy is log y . It has been shown that there is always a unique model p*(y x) with maximum entropy that obeys the constraints set by the training set. Considering the feature vector x of a test example, this distribution has the following expo-nential form:

p *(y x) = 1

Zexp λjλ fλjjλλ jff (x,y)

j =1

k

, 0 < λjλλ <∞ (5.35)

where fjff (x(( , y) is one of the k binary-valued feature functions λjλ = parameter adjusted to model the observed statisticsZ = normalizing constant computed as: Z

Z = exp( λjλ fλjjλλ jff (x,y))j=1

k

y (5.36))))

So, the task is to define the parameters λjλ in λj p which maximize H(HH p(( ). Insimple cases we can find the solution analytically, in more complex cases

5 Following Berger et al. (1996) we use here the notation H(p(( ) in order to empha-size the dependence of the entropy on the probably distribution p instead of thecommon notation H (H Y X) where Y and Y X are random variables. X


( )


we need numerical methods to derive λjλ given a set of constraints. The λj

problem can be considered as a constrained optimization problem, where we have to find a set of parameters of an exponential model, which maxi-mizes its log likelihood. Different numerical methods can be applied for this task among which are generalized iterative scaling (Darroch and

We also have to efficiently compute the expectation of each featurefunction. Eq. (5.30) cannot be efficiently computed, because it would in-volve summing over all possible combinations of x and y, a potentially in-finite set. Instead the following approximation is used, which takes intoaccount the n training examples xi:

EpEE ( fjff ) = 1

np(y xi) fjff (xi,y)

yi=1

n

(5.37)

The maximum entropy model has been successfully applied to natural lan-guage tasks in which context-sensitive modeling is important (Berger et al.,

model has been used in named entity recognition (e.g., Chieu and Hwee

nition (Fleischman et al., 2003; Mehay et al., 2005). The maximum entropymodel offers many advantages. The classifier allows to model dependen-cies between features, which certainly exist in many information extractiontasks. The classifier has the advantage that there is no need for an a priorifeature selection, as features that just are randomly associated with a cer-tain class, will keep their randomness in the model. This has the advantagethat you can train and experiment with many context features in the model, in an attempt to decrease the ambiguity of the learned patterns. Moreover,the principle of maximum entropy states that when we make inferences based on incomplete information, we should draw them from a probability distribution that has the maximum entropy permitted by the information

training set is often incomplete given the large variety of natural languagepatterns that convey the semantic classes sought. Here, the maximum en-tropy approach offers a satisfactory solution.

The above classification methods assume that there is no relation be-tween various classes. In information extraction in particular and in text understanding in general, content is often dependent. For instance, when there is no grant approved, there is also no beneficiary of the grant. Or,more formally one can say: There is only one or a finite number of ways in

Ratcliff, 1972), improved iterative scaling (Della Pietra et al., 1997), gradi-ent ascent and conjugate gradient (Malouf, 2002).

2002), coreference resolution (e.g., Kehler, 1997) and semantic role recog-

that we do have (Jaynes, 1982). In many information extraction tasks, our

1996; Ratnaparkhi, 1998) among which is information extraction. The

which information can be sequenced in a text or in a text sentence in order to convey its meaning. The scripts developed by Schank and his school in the 1970s and 1980s are an illustrative example (e.g., you have to get on the bus before you can ride the bus). But also, at the more fine-grained level of the sentence the functional position of an information unit in de-pendency with the other units defines the fine-grained meaning of the sen-tence units (e.g., semantic roles). In other words, information contained intext often has a certain dependency, one cannot exist without the other, or it has a high chance to occur with other information. This dependency and the nature of the dependency can be signaled by lexical items (and their co-occurrence in a large corpus) and refined by the syntactical constructsof the language including the discourse structure.

In pattern recognition there are a number of algorithms for context-dependent classification. In these models, the objects are described by fea-ture vectors, but the features and their values stored in different feature vectors together contribute to the classification. In order to reduce thecomputational complexity of the algorithms the vectors are often processed in a certain order and the dependency upon vectors previously processed is limited. The class to which a feature vector is assigned depends on itsown value, on the values of the other feature vectors and on the existing re-lation among the various classes. In other words, having obtained the class ci for a feature vector xi, the next feature vector could not always belong to any other class. In the following sections we will discuss two common ap-proaches to context-dependent information recognition: Hidden Markovmodels and conditional random fields. We foresee that many other usefulcontext dependent classification algorithms will be developed in text un-derstanding. In context-dependent classification, feature vectors are oftenreferred to as observations. For instance, the feature vector xi occurs in asequence of observations X = (X x(( 1,…,xT)TT .

5.4 Hidden Markov Models

In Chap. 2 we have seen that finite state automata quite successfully rec-ognize a sequence of information units in a sentence or a text. In such amodel a text is considered as a sequence of symbols and not as an unor-dered set. The task is to assign a class sequence Y= (y(( 1,…,yT)TT to the se-quence of observations X = (X x(( 1,…,xT)TT . Research in information extractionhas recently investigated probabilistic sequence models, where the task is to assign the most probable sequence of classes to the chain of observa-tions. Typically, the model of the content is implemented as a Markov

5.4 Hidden Markov Models 107


chain of states, in which transition probabilities between states and theprobabilities of emissions of certain symbols of the alphabet are modeled.

The states are shown as circles and the start state is indicated as start. Possible transitions are shown by edges that connect states, and an edge is labeled with the probability of this transition. Transitions with zero prob-ability are omitted from the graph. Note that the probabilities of the edges that go out from each state sum to 1. From this representation, it should beclear that a Markov model can be thought as a (non-deterministic) finitestate automaton with probabilities attached to each edge.

Fig. 5.3. An example Markov model that represents a Belgian criminal court decision. Some examples of emissions are shown without their probabilities.

court start victim accus ed

datenumber

dateletter

offence

routine opinion

opinion

routinefounda- tion

foundation

verdict

Nineteenhundred d

John Smith Transport law

1.0

0.86

0.144

1.0

1.0

0.27

0.73

0.62

0.38

0.30

0.50

0.200

0.37 0.3

0.7

0.25

0.25

0.2

0.3

0.5

0.5

1.01

0.133

0.50

conclusionend

1.0

as a Markov chain.In Fig. 5.3 the content of a Belgian criminal court decision is modeled

The probability of a sequence of states or classes Y = (Y y1,…,yT) is easily TT

calculated for a Markov chain:

P(y1,…,yT ) = T P(y1 )P(y2 y1) P(y3 y1, y2) … P (yT y1,…,yT-1) (5.38)

A first order Markov model assumes that class dependence is limited onlywithin two successive classes yielding:

P(y1,…,yT ) =T P(y1 )P(y2 y1) P(y3 y2)…P (yT yT-1) (5.39)

109

= P(y) P(yi yi − 1)i= 2

T∏ (5.40)

In Fig. 5.3 only some of the emission symbols are shown. The models that we consider in the context of information extraction have a discrete output, i.e., an observation outputs discrete values.

A first order Markov model is composed of a set of states Y with speci-Yfied initial and final states y1 and yT,TT a set of transitions between states, and a discrete vocabulary of output symbols = {σ1σ , σ2σσ ,…,σkσσ }. In informationextraction the emission symbols are usually words. The model generates an observation X = (X x(( 1,…,xT) by beginning in the initial state, transitioningTT

to a new state, emitting an output symbol, transitioning to another state, emitting another symbol, and so on, until a transition is made into the final state. The parameters of the model are the transition probabilities P(yi yi-1)that one state follows another and the emission probabilities P(x(( i yi) that a state emits a particular output symbol.6

Classification regards the recognition of the most probable path in themodel. For the task of information extraction this translates into the fol-lowing procedure. Having observed the following sequence of feature vec-tors X = (x(( 1,…,xT), we have to find the respective sequence of classes or TT

states Y = (Y y1,…,yT) that is most probably followed in the model. We com-TT

pute Y* for which

Y*= argmaxY

P(Y X) (5.41)

P(Y X) =P(y1)P(x1 y1) P(yi yi − 1)P(xi

i= 2

T

∏ yi) (5.42)

In order to compute the most probable path the Viterbi algorithm is used. Instead of a brute-force computation, by which all possible paths are com-puted, the Viterbi algorithm efficiently computes a subset of these paths. It is based on the observation that, if you look at the best path that goesthrough a given state yi at a given time ti, the path is the concatenation of the best path that goes from state y1 to yi (while emitting symbols corre-sponding to the feature vectors x1 to xi respectively at times t1 to ti) with the best path from state yi to the final state yT (while emitting symbols corre-sponding to the feature vectors xi + 1 to xT respectively at times ti+1 to tT).TT

This is because the probability of a path going through state yi is simply the product of the probabilities of the two parts (before and after yi), so that

6 We mean here the discrete symbol that is represented by the feature vector x.



the maximum probability of the global path is obtained when each part has a maximum probability.

When we want to train a Markov model based on labeled sequences Xall XXthere are usually two steps. First, one has to define the model of states or classes, which is called the topology of the model. Secondly, one has to learn the emission and transition probabilities of the model. The first step is usually drafted by hand when training an information extraction system (although at the end of this section we will mention some attempts to learna state model). In the second step, the probabilities of the model are learned based on the classified training examples. The task is learning the probabilities of the initial state, the state transitions and the emissions of a model µ.

In a visible Markov model (Fig. 5.4), the state sequence that actuallylgenerated an example is known, i.e., we can directly observe the states and the emitted symbols. If we can identify the path that was taken inside themodel to produce each training sequence, we are able to estimate the prob-abilities by the relative frequency of each transition from each state and of emitting each symbol. The labeling is used to directly compute the prob-abilities of the parameters of the Markov model by means of maximum likelihood estimates in the training set XallXX . The transition probabilities P(y’ y) and the emission probabilities P(x(( y) are based on the counts of re-spectively the class transitions ξ(ξξ y->y’) or ξ(ξξ y,y’) and of the emissions oc-curring in a classγ (y) where y↑x↑↑ i considered at the different times t:

P(y' y) =ξtξξ (y,y )

t=1

T−1

γ tγ (y)t=1

T−1 (5.43)

P(x y) =γ tγ (y)

t=1 and y↑x

T

γ tγ (y)t=1

T (5.44)

state sequence that the model passed through when generating the training examples. The states of the training examples are not fully observable. This is often the case in an information extraction task from a sequence of

’

In a hidden Markov modell (Rabiner, 1989) (Fig. 5.5) you do not know the

111

words. Each state (e.g., word) is associated with a class that we want to ex-tract. Some states are just background states, when they represent informa-tion not to be extracted or semantically labeled. As a result some of thewords are observed as emission symbols and have an unknown class or state.

In this case the transition and emission probabilities are inferred from a sequence of observations via some optimization function that is iteratively computed. The training of parameters is usually performed via the Baum-Welch algorithm, which is a special case of the Expectation-Maximization

Fig. 5.4. Example of a visible Markov Model for a named entity recognition task.

and the emissions of the model µ. The Baum-Welch approach is character-µized by the following steps:

1. Start with initial estimates for the probabilities chosen randomly or according to some prior knowledge.

2. Apply the model on the training data: Expectation step (E): Use the current model and observations tocalculate the expected number of traversals across each arc and theexpected number of traversals across each arc while producing a given output.


Title

First name

Verbalprocess Last name

said

John

Callender

Mr.

0.30

0.70

1.00

1.000.80

0.37

0.06

0.83

The task is learning the probabilities of the initial state, the state transitions algorithm (EM) (Dempster et al., 1977).

into a model that most likely produces these ratios.Maximization step (M): Use these calculations to update the model


Fig. 5.5. Example of a hidden Markov model for a named entity recognition task.

3. Iterate step 2 until a convergence criterion is satisfied (e.g., when thedifferences of the values with the values of a previous step aresmaller than a threshold value ε).εε

Expectation step (E)We consider the number of times that a path passes through state y at timet and through statet y’ at the next time t + 1 and the number of times thiststate transition occurs while generating the training sequences XallXX given l

the parameters of the current model µ. We then can define: µ

ξtξξ (y,y’) ≡ ξtξξ (y,y’ XallXX ,µ) = µµξtξξ (y, y , XallXX µ)µµ

P(XallXX µ)µµ(5.45)

=α(α yt = y)P(y y)P(xt + 1 y )β(β yt + 1= y )

P(XallXX µ)µ(5.46)

where )( yyt =α represents the path history terminating at time t and statety (i.e., the probability of being at state y at time t and outputting the first tsymbols) and β(β yt + 1 = y ) represents the future of the path, which at timet + 1 is at state y’ and then evolves unconstrained until the end (i.e., the

?

?

Verbal process

Last name

smart

Callender

grammarian

?

?

?

?

?

?

0.34 0.04

’

’

’’’

113

probability of being at the remaining states and outputting the remaining symbols). We define also the probability of being at time t at statet y:

γ tγγ (y) ≡γ tγγ (y XallXX ,µ)µµ = α(yt = y)β(yt = y)P(XallXX µ)µµ

(5.47)

γ t(y)t=1

T−1

can be regarded as the expected number of transitions from state y

given the model µ and the observation sequencesµ XallXX .

−

=

1

1

),(T

t

t yyξ can be regarded as the expected number of transitions from

state y to state y’, given the model µ and the observation sequences XallXX .

Maximization step (M) During the M-step the following formulas compute reasonable estimates of the unknown model parameters:

−

=

−

== 1

1

1

1

)(

),()( T

t

t

T

t

t

y

yyyyP

γ

ξ (5.48)

P(x y) =γ tγγ (y)

t=1 and y↑x

T

γ tγγ (y)t=1

T (5.49)

P (y) =γ 1(y) (5.50)

Practical implementations of the HMM have to cope with problems of zeroprobabilities as the values of α (yt) and β(β y(( t) are smaller than one and whenused in products tend to go to zero, which demands for an appropriate scal-ing of these values. A hidden Markov model is a popular technique to detect and classify a


’

’

linear sequence of information in text. The first information extraction

’


fixed or partially fixed order, such as the title, author, and journal from both the headers and reference sections of papers. Ray and Craven (2001) apply HMMs to Medline texts to extract the proteins, locations, genes and disorders and their relationships. Zhang et al. (2004) use also HMMs for the recognition of biomedical named entities. The disadvantage of using HMM for information extraction is that we need large amounts of training data for guaranteeing that all state transi-tions will appear a sufficient number of times in order to learn the prob-abilities in a satisfactory way. Content can be expressed in many linguisticvariant forms, not at least if one just considers the words of a text. In addi-tion there is the need for an a priori notion of the model’s topology (the possible sequences of states) or that this topology should automatically belearned. Existing work has generally used a handcrafted topology, in which states are connected manually in a reasonable way after evaluatingthe training corpus. There have been several attempts to automatically learn an appropriate topology for information extraction tasks. Examples can be found in Seymore et al. (1999) and McCallum et al. (1999).

5.5 Conditional Random Fields

Conditional random fields (CRF) regard a statistical method based on un-directed graphical models. The method exhibits a number of properties that makes it very well suited for information extraction tasks. Like the dis-criminative learning models it can accommodate many statistically corre-lated features of the input example. This contrasts with generative models,which often require conditional independent assumptions in order to makethe computations tractable. Nonetheless, the discriminative methods seenso far do not incorporate context dependency of classes unless they resort to some heuristics to find an acceptable combination of classes. Condi-tional random fields incorporate both the possibility of incorporating de-pendent features and the possibility of context-dependent learning, makingthe technique as one of the best current approaches to information extrac-

thought of a generalization of both the maximum entropy model and the hidden Markov model.

systems that used HMM technology were developed by Leek (1997),whose system extracted gene names and locations from scientific abstracts, and by Bikel et al. (1997) who used this technology for named entity recog-nition. McCallum et al. (1999) extracted document segments that occur in a

tion in empirical evaluations (Lafferty et al., 2001). This method can be

Let X be a random variable over data sequences to be labeled andX Y a Yrandom variable over corresponding label sequences. All components YiYY ofY are assumed to range over a finite label alphabet . For example, in aninformation extraction task, X might range over the sentences of a text,Xwhile Y ranges over the semantic classes to be recognized in these sen-tences. A conditional random field is viewed as an undirected graphical

element YvYY of Y. If each random variable YvYY obeys the Markov property with respect to G (e.g., in a first order model the transition probability de-pends only on the neighboring state), then the model (Y,YY X) is a conditionalXXrandom field. In theory the structure of graph G may be arbitrary, however,Gwhen modeling sequences, the simplest and most common graph structureencountered is that in which the nodes corresponding to elements of Yform a simple first-order Markov chain. In this case the conditional ran-dom field forms a probabilistic finite state automaton.

In information extraction conditional random fields are often used to la-bel sequential data, although the method can also be used in other settings.We focus here on a conditional random field that represents a sequence of extraction classes. Such a CRF defines a conditional probability distribu-tion p(Y X) of label sequences given input sequences. We assume that the XXrandom variable sequences X and Y have the same length and useY x =(x1,…,xT) andT y = (y1,…,yT) for an input sequence and label sequence respectively.7 Instead of defining a joint distribution over both label and observation sequences, the model defines a conditional probability over labeled sequences. A novel observation sequence x is labeled with y, sothat the conditional probability p(y x) is maximized.

Comparable to the maximum entropy model, we define a set of k bi-knary-valued8 features or feature functions that each express some charac-teristic of the empirical distribution of the training data that should alsohold in the model distribution. Each local feature is either a state features(yi, x, i) or a transition feature t(y(( i-1, yi, x, i), where yi-1 and yi are class la-bels, x an input sequence, and i an input position. When i is 1 (start state of the sequence), t(yi-1, yi , x, i) = 0. Examples of such features are:

7 Note that we represent here an instantiation of an observation sequence as x incontrast with the rest of this book where we use x as an instantiation of a feature vector. Analogically, we use y for the representation of a label sequence. 8 See footnote 3.

5.5 Conditional Random Fields 115

model or Markov random field, conditioned on XX (Jordan, 1999, Wallach,

node v ∈ V corresponding to each of the random variables representing an 2004). We define G = (V, E) to be an undirected graph such that there is a


sjs (yi, x, i) =1 if the observation at position i is the word say

0 otherwise (5.51)

tjt (yi − 1, yi, x, i) = and yi has POS tag NNP

0 otherwise

(5.52)Feature functions thus depend on the current state (in case of a state feature function) or on the previous and current states (in the case of a transition feature function). We can use a more global notation fjff for a feature func-tion where fjff (yi-1, yi, x, i) is either a state function sjs (yi, x, i) = sjs (yi-1, yi, x, i)or a transition function tjt (y(( i-1,yi, x, i). The CRF’s global feature vector FjFF (x(( ,y) for the input sequence x and la-bel sequence y is given by:

FjFF (x, y) = fjff (yi − 1,yi, x, i)i=1

T

(5.53)

where i ranges over input positions (such as a sequence of words in a document) or in terms of the graphical model over the values on T input Tnodes. Considering k feature functions, the conditional probability distri-bution defined by the CRF is then:

p(y x) = 1

Zexp( λ jλ FjF (x, y))

j=1

k

p(y x ) = 1

Zexp( λ jλ fλ jj jff (yi − 1,yi , x, i)

i=1

T

)j=1

k

(5.54)

where λjλ = parameter adjusted to model the observed statistics Z = normalizing constant computed as:

Z = exp( λ jλ Fjj jF (x, y))j=1

k

y∈Y

Z is a normalization factor for observation sequence x computed over dif-j

“ ”

“ “ ”

or

ferent possible state sequences and fff ranges over all k feature functions.

1 if yi - 1 has tag title

The most probable label sequence y* for input sequence x is:

y*= argmaxy

p(y x) (5.55)

For a chain-structured conditional random field, the probability p(y|x) of label sequence y given an observation sequence x can be easily computed by using matrices and relying on algorithms for solving path problems in graphs. To simplify the expressions we add a start andt end state, respec-dtively represented by yo and yT+1. Let be de alphabet from which labelsare drawn and y and y’ be labels drawn from this alphabet, we define a set

i i x

MiMM (y ,y x) = exp( λjλ fλjjλλ jff (y , y, x, i))j =1

k

(5.56)

The conditional probability of a label sequence y given observation se-quence x can be written as:

p(y x) = 1

ZMiMM

i = 1

T +1

∏ (yi − 1,yi x) (5.57)

The normalization factor Z for observation sequence x, may be computed from the set of MiM (x(( ) matrices. Z is given by the (Z start, end) entry of theddproduct of all T + 1 MiM (x) matrices.

Z = MiMM (x )

i =1

T +1

∏start , end

(5.58)

The conditional random field as defined by Eq. (5.54) is heavily motivatedby the principle of maximum entropy. As seen earlier in this chapter theentropy of a probability distribution is a measure of uncertainty and ismaximized when the distribution in question is as uniform as possible, subject to the constraints set by the training examples. The distribution that is as uniform as possible while ensuring that the expectation of each fea-ture function with respect to the empirical distribution of the training data equals the expected value of the feature function with respect to the modeldistribution.

As for the maximum entropy model, we need numerical methods in or-der to derive λjλ given the set of constraints. The problem can be considered λj

5.5 Conditional Random Feilds 117

, ,

matrix with elements of the form: of TT + 1 matrices {MM (((x) | i = 1, …, T + 1}, where each MM (((x) is a


as a constrained optimization problem, where we have to find a set of pa-rameters of an exponential model, which maximizes its log likelihood. We refer here to the references given above on numerical methods for deter-mining the model parameters for the maximum entropy model. Here also,we are confronted with the problem of efficiently calculating the expecta-tion of each feature function with respect to the CRF model distribution for every observation sequence x in the training data. Fortunately, dynamicprogramming techniques that are similar to the Baum-Welch algorithm that is commonly used for estimating the parameters of a hidden Markovmodel, can be used here for parameter estimation (Lafferty et al., 2001).

Conditional random fields have been implemented for named entity rec-

(Ahn et al., 2005). They allow representing dependencies on previous clas-sifications in a discourse. While adhering to the maximum entity principle,they offer a valid solution when learning from incomplete information. Given that in information extraction tasks, we often lack an annotated training set that covers all extraction patterns, this is a valuable asset.

Conditional random fields are a restricted class of undirected graphical

model many characteristics of the texts not only with regard to an input sequence, its terms and their characteristics, but they can also take into ac-count other discourse features that occur in previous sentences. Condi-tional random fields have here been illustrated with the case of a linear sequence of observations. Other graph models can be valuable for infor-mation extraction tasks.

For instance, a relational Markov network can represent arbitrary de-kpendencies between extractions (e.g., Taskar et al., 2004). This model al-lows for a collective classification of a set of related entities by integrating information from features of individual entities as well as the relations between them. For example, in a protein named entity recognition task,repeated references to the same protein are common. If the context sur-rounding one occurrence of a name offers good evidence that the name is a protein, then this should influence the tagging of another occurrence of thesame name in a different ambiguous context, if we assume the one sense

5.6 Decision Rules and Trees

Learning of rules and trees aims at inducing classifying expressions in theform of decision rules and trees from example cases. These are one of the

ognition (McCallum and Li, 2003) and timex recognition and normalization

per discourse heuristic (Bunescu and Mooney, 2004).

models (Jordan, 1999). The advantage is that the feature functions can

oldest approaches to machine learning and were also part of one of theoldest applications of machine learning in information extraction. Each de-cision rule is associated with a particular class, and a rule that is satisfied, i.e., evaluated as true, is an indication of its class. Thus, classifying newcases involves the application of the learned classifying expressions and assignment to the corresponding class upon positive evaluation.

The rules are found by searching these combinations of features or of feature relations that are discriminative for each class. Given a set of posi-tive examples and a set of negative examples (if available) of a class, thetraining algorithms generate a rule that covers all (or most) of the positive examples and none (or fewest) of the negative examples. Having found this rule, it is added to the rule set, and the cases that satisfy the rule are removed from further consideration. The process is repeated until no more example cases remain to be covered.

The paradigm of searching possible hypotheses also applies to tree and rule learning. There are two major ways for accessing this search space

most general towards the most specific hypothesis. One starts from themost general rule possible (often an empty clause), which is specialized at the encounter of a negative example that is covered. The principle is to add features to the rule. Specific-to-general methods search the hypothesisspace from the most specific towards the most general hypothesis and willprogressively generalize examples. One starts with a positive example,which forms the initial rule for the definition of the concept to be learned. This rule is generalized at the encounter of another positive example that is not covered. The principle is to drop features. The combination of the gen-eral-to-specific and the specific-to-general methods is the so-called version

tive examples specify the most general hypothesis. Positive examples gen-eralize the most specific hypothesis. The version spaces model suffersfrom practical and computational limitations. To test all possible hypothe-ses is most of the time impossible given the number of feature combina-tions.

The most widely used method is tree learning. The vectors of the train-ing examples induce classification expressions in the form of a decisiontree. A decision tree can be translated in if-then rules to improve the read-ability of the learned expressions. A decision tree consists of nodes and branches. Each node, except for terminal nodes or leaves, represents a test or decision and branches into subtrees for each possible outcome of thetest. The tree can be used to classify an object by starting at the root of thetree and moving through it until a leaf (class of the object) is encountered.

5.6 Decision Rules and Trees 119

(Mitchell, 1977). General-to-specific methods search the space from the

spaces method, which starts from two hypotheses (Mitchell, 1977). Nega-


top-down, greedy way by selecting the most discriminative feature and useit as a test to the root node of the tree. A descendant node is then created for each possible value of this feature, and the training examples are sorted to the appropriate descendant node (i.e., down the branch corresponding tothe example’s value for this feature). The entire process is then repeated using the training examples associated with each descendant node to select the best feature to test at that point in the tree. This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks toreconsider earlier choices. In this way not all the hypotheses of the searchspace are tested. Additional mechanisms can be incorporated. For instance, by searching a rule or tree that covers most of the positive examples and removal of these examples from further training, the search space is di-vided into subspaces, for each of which a covering rule is sought. Other ways for reducing the search space regard preferring simple rules above complex ones and by branching and bounding the search space when themethod will not consider a set of hypotheses if there is some criterion that allows assuming that they are inferior to the current best hypothesis. The selection of the most discriminative feature at each node except for a leave node, is often done by selecting the one with the largest information gain,i.e., the feature that causes the largest reduction in entropy when the train-ing examples are classified according to the outcome of the test at thenode. As seen above, entropy is a measure of uncertainty.

More specifically, given a collection S of training examples, if the clas-sification can take on k different values, then the entropy of S relative to the k classifications is defined as: k

Entropy(S) ≡ −pi log 2

i = 1

k

p2 i (5.59)

where pi is the proportion of S belonging to class k. The information gain of a feature f is the expected reduction in entropy caused by partitioningthe examples according to this feature.

Gain(S, f ) ≡ Entropy(S) −Sv

Sv∈Values( f )

EntropyE (SvSS ) 5.60)(

where Values( f = set of all possible values of featuref ) f Sv = subset of S for which feature f has valuef v.

Basic algorithms (e.g., C4.5 of Quinlan, 1993) construct the trees in a

Rule and tree learning algorithms were the first algorithms that have been used in information extraction, and they are still popular learningtechniques for information extraction. The factors that play a role in their popularity are their expressive power, which makes them compatible withhuman-engineered knowledge rules and their easy interaction with other knowledge resources. Because of their greedy nature the algorithms usu-ally perform better when the feature set is limited. Information extractiontasks sometimes naturally can make use of a limited set of features that ex-hibit some dependencies between the features (e.g., in coreference resolu-tion).

Induction of rules and trees was a very popular information extraction technique in the 1990s. It has been applied among others to information

be a popular and successful technique in coreference resolution (McCarthy

5.7 Relational Learning

When the learned rules are written in a logical formalism, the learning is

The most simple rules are expressed in propositional logic, but often thelearner will also acquire expressions in first-order predicate logic. Theclassifier learns small programs containing predicates, constants and vari-ables, which can be used to make inferences, hence the term inductivelogic programming.

Inductive logic programming is a subcategory of relational learning. Unless rule representation is severely restricted, the learning is often intractable. In order to counter this problem for a specific extraction prob-lem, domain-specific heuristics are implemented. However, we lack ge-neric ILP methods that could be applicable to a variety of informationextraction problems. Relational learning refers to all techniques that learn structured concept definitions from structured examples. Relational learn-ing is concerned with the classification of patterns whose presence signi-fies that certain elements of a structure are in a particular relation to one another. The structure of the instances can have different formats (e.g.,logical programs, Bayesian networks, graphs). The learning algorithm re-ceives input examples of which the complete structure is classified.

In information extraction relational learning that learns first-order predi-cates has been implemented for extracting rather structured information

5.7 Relational Learning 121

extraction from semi-structured text (Soderland, 1999) and it continues to

and Lehnert 1995; Soon et al., 2001; Ng and Cardie, 2002).

often referred to as inductive logic programming (ILP) (Mooney, 1997).

such as information in job postings (Califf and Mooney, 1997) and in


tional models based on statistics. The kernel methods, the hidden Markov models and conditional random fields can be seen as relational learning models. In these cases, the relational model is chosen because the proposi-tional, nominal or ordinal representations might become too large, or could loose much of the inherent domain structure.

Many questions have still to be solved and appropriate algorithms for relational learning should be drafted. Relational learning could offer suit-able solutions to recognize information in texts.

5.8 Conclusions

Supervised pattern recognition techniques are very useful in information extraction. Many useful approaches exist. As we will see in Chap. 9 theycurrently constitute the most successful techniques. However, there is thebottleneck of acquiring sufficient annotated examples. In the next chapter it is shown how unsupervised learning techniques aid in resolving this problem.

Information extraction techniques recognize rather simple patterns that classify information in a particular semantic class. As we will discuss in

where meaning is assigned based on a conglomerate of different conceptsand their relations found in the unstructured sources.

5.9 Bibliography

Ahn, David, Sisay F. Adafre and Maarten de Rijke (2005). Extracting temporal in-formation from open domain text. In Proceedings of the 5th Dutch-Belgian In-formation Retrieval Workshop (DIR’05). Twente.

Berger, Adam, Stephen A. Della Pietra and Vincent J. Della Pietra (1996). Amaximum entropy approach to natural language processing. Computational Linguistics, 22 (1), 39-71.

Bikel, Daniel M., Scott Miller, Richard Schwartz and Ralph Weischedel (1997).Nymble: A high-performance learning name-finder. In Proceedings Fifth Con-ference on Applied Natural Language Processing (pp. 194-201). Washington, DC.

Bunescu, Razvan and Raymond J. Mooney (2004). Collective information extrac-tion with relational Markov networks. In Proceedings of the 42nd Annual Meet-d

ing of the Association for Computational Linguistics (pp. 439-446). East Stroudsburgh, PA: ACL.

seminar announcements (Roth and Yih, 2001). In addition, there exist rela-

the final chapter, there is a need for more advanced recognition of content,

Califf, Mary E. and Raymond J. Mooney (1997). Relational learning of pattern-

Computational Natural Language Learning (pp. 9-15). ACL. Chieu, H.L. and Ng Hwee Tou (2002). Named entity recognition: A maximum en-

tropy approach using global information. In COLING 2002. Proceedings of the19th International Conference on Computational Linguistics (pp. 190-196). San Francisco: Morgan Kaufmann.

Christianini, Nello and John Shawe-Taylor (2000). An Introduction to Support Vector Machines and Other Kernel Based Learning Methods. Cambridge, UK:Cambridge University Press.

Collins, Michael and Nigel Duffy (2001). Convolution kernels for natural lan-guage. In Thomas G. Dietterich, Sue Becker and Zoubin Ghahramani (Eds.), Advances in Neural Information Processing Systems 14 (pp. 625-632). Cam-bridge, MA: The MIT Press.

Culotto, Aron and Jeffrey Sorenson (2004). Dependency tree kernels for relationextraction. In Proceedings of the 42nd Annual Meeting of the Association for d

Computational Linguistics (pp. 424-430). East Stroudsburg, PA: ACL. Darroch, J.N. and D. Ratcliff (1972). Generalized iterative scaling for log-linear

models. The Annals of Mathematical Statistics, 43, 1470-1480. Dempster, A.P., N.M. Laird and D.B. Rubin (1977). Maximum likelihood from

incomplete data via the EM algorithm. Journal Royal Statistical Society Series B 39, 1-38.

Della Pietra, Stephen, Vincent Della Pietra and John Lafferty (1997). Inducingfeatures of random fields. IEEE Transactions on Pattern Analysis and MachineIntelligence, 19, 380-393.

Fleischman, Michael, Namhee Kwon and Eduard Hovy (2003). A maximum en-tropy approach to FrameNet tagging. In Proceedings of the Human Language Technology Conference of the North American Chapter for Computational Linguistics. East Stroudsburg, PA: ACL.

Isozaki, Hideki and Hideto Kazawa (2002). Efficient support vector classifiers for named entity recognition. In COLING 2002. Proceedings of the 19th Interna-tional Conference on Computational Linguistics (pp. 390-396). San Francisco,CA: Morgan Kaufmann.

Isozaki, Hideki and Tsutomu Hirao (2003). Japanese zero pronoun resolutionbased on ranking rules and machine learning. In Proceedings of EMNLP-2003(pp. 184-191). ACL.

Jaynes, Edwin T. (1982). On the rationale of maximum-entropy models. Proceed-ings of the IEEE, 70 (9), 939-952.

Jordan, Michael I. (1999). Learning in Graphical Models. Cambridge, MA: TheMIT Press.

Kehler, Andrew (1997). Probabilistic coreference in information extraction. InProceedings of the Second Conference on Empirical Methods in Natural Lan-guage Processing (pp. 163-173). Somerset, NJ: ACL.

Lafferty, John, Andrew McCallum and Fernando C.N. Pereira (2001). Conditional random fields: Probabilistic models for segmenting and labelling sequence data. In Proceedings of the 18th International Conference on Machine Learning(pp. 282-289). San Francisco, CA: Morgan Kaufmann.

5.9 Bibliography 123

matching rules for information extraction. In T.M. Ellison (Ed.), CoNLL:


els. Master thesis, University of California San Diego.Malouf, Robert (2002). A comparison of algorithms for maximum entropy pa-

rameter estimation. In Proceedings of the Sixth Conference on Natural Lan-guage Learning (CoNLL-2002) (pp. 49-55). San Francisco, CA: Morgan Kaufmann.

McCallum, Andrew, Kamal Nigam, Jason Rennie and Kristie Seymore (1999). Amachine learning approach to building domain-specific search engines. In Pro-ceedings of the Sixteenth International Joint Conference on Artificial Intelli-gence (pp. 662-667). San Mateo, CA: Morgan Kaufmann.

McCallum, Andrew, Andrew Ng and Michael I. Jordan (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes.In Thomas Dietterich, Suzanna Becker and Zoubin Ghahramani (Eds.), Ad-vances in Neural Information Processing Systems 14 (pp. 609-616). Cam-bridge, MA: The MIT Press.

McCallum, Andrew and Wei Li (2003). Early results for named entity recognitionwith conditional random fields, feature induction and web-enhanced lexicons.In Proceedings of the Seventh Conference on Natural Language Learning(CoNLL). East Stroudsburg, PA: ACL.

coreference resolution. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (pp. 1050-1055). San Mateo, CA: Morgan Kaufmann.

Mehay Dennis N., Rik De Busser and Marie-Francine Moens (2005). Labelinggeneric semantic roles. In Harry Bunt, Jeroen Geertzen and Elias Thyse (Eds.),Proceedings of the Sixth International Workshop on Computational Semantics(IWCS-6) (pp. 175-187). Tilburg, The Netherlands: Tilburg University.

Minsky, Marvin L. and Seymour A. Papert (1969). Perceptrons. The MIT Press. Mitchell, Tom (1977). Version spaces: A candidate elimination approach to rule

learning. In Proceedings of the 5th International Joint Conference on Artificial Intelligence (pp. 305-310). Cambridge, MA: William Kaufmann.

Mitchell, Tom (1997). Machine Learning. MacGraw Hill. Mooney, Raymond (1997). Inductive logic programming for natural language

processing. In Inductive Logic Programming, volume 1314 of LNAI (pp. 3-24). IBerlin: Springer.

Ng, Andrew Y. and Michael Jordan (2002). On discriminative vs. generative clas-sifiers: A comparison of logistic regression and naïve Bayes. Neural Informa-tion Processing 2002.

Ng, Vincent and Claire Cardie (2002). Improving machine learning approaches tocoreference resolution. In Proceedings of the 40th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL-2002(( ). San Francisco, CA: MorganKaufmann.

Quinlan, J. Ross (1993). C4.5: Programs for Machine Learning. San Mateo, CA:Morgan Kaufmann.

Rabiner, Lawrence L. (1989). A tutorial on hidden Markov models and selectedapplications. In Proceedings of the IEEE 77 (pp. 257-285). Los Alamitos, CA:EThe IEEE Computer Society.

Leek, Timothy Robert (1997). Information Extraction using Hidden Markov Mod-

McCarthy, Joseph and Wendy G. Lehnert (1995). Using decision trees for

125

Ratnaparkhi, Adwait (1998). Maximum Entropy Models for Natural LanguageAmbiguity Resolution. Ph.D. thesis, University of Pennsylvania.

Ray, Soumya and Mark Craven (2001). Representing sentence structure in hiddenMarkov models for information extraction. In Proceedings of the 17th Interna-tional Joint Conference on Artificial Intelligence, Seattle, WA. San Francisco,CA: Morgan Kaufmann.

Roth, Dan and Wen-tau Yih (2001). Relational learning via propositional al-gorithms: An information extraction case study. In Proceedings of the 17th77International Joint Conference on Artificial Intelligence (pp. 1257-1263). SanFrancisco, CA: Morgan Kaufmann.

Seymore, Kristie, Andrew McCallum and Ronald Rosenfeld (1999). Learning hidden Markov model structure for information extraction. In Proceedings of the AAAI 99 Workshop on Machine Learning for Information Extraction.

Shannon, Claude E. (1948). A mathematical theory of communication. Bell Sys-tem Technical Journal, 27, 379-423, 623-656.

Soderland, Stephen (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, 1-3, 233-272.

Soon, Wee Meng, Hwee Tou Ng and Daniel Chung Yong Lim (2001). A machinelearning approach to coreference resolution of noun phrases. Computational Linguistics, 27 (4), 521-544.

Taskar, Ben, Vassil Chatalbashev and Daphne Koller (2004). Learning associativeMarkov networks. In Proceedings of the Twenty-First International Confer-ence on Machine Learning. San Mateo, CA: Morgan Kaufmann.

Theodoridis, Sergios and Konstantinos Koutroumbas (2003). Pattern Recognition. Amsterdam, The Netherlands: Academic Press.

Vapnik, Vladimir N. (1988). Statistical Learning Theory. New York: John Wiley and Sons.

Wallach, Hanna M. (2004). Conditional random fields: An introduction. Univer-sity of Pennsylvania CIS Technical Report MS-CIS-04-21.

Zalenko, Dimitry, Chinatsu Aone and Antony Richardella (2003). Kernel methodsfor relation extraction. Journal of Machine Learning Research, 3, 1083-1106.

Zhang, Dell and Wee Sun Lee (2003). Question classification using Support Vec-tor Machines. In Proceedings of the Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp.l26-31). New York: ACM.

Zhang, Jie, Dan Shen, Guodong Zu, Su Jian and Chew-Lim Tan (2004). Enhanc-ing HMM-based biomedical named entity recognition by studying special phe-nomena. Journal of Biomedical Informatics, 37, 411-422.

5.9 Bibliography

127

6 Unsupervised Classification Aids

6.1 Introduction

Many supervised algorithms that train from labeled or annotated exampleshave been applied to the task of information extraction. Although, we lack large benchmark studies, the literature on information extraction showsthat the results are reasonable successful (see Chap. 9 for current applicationsand the results). However, many studies report on extraction applications in closed and limited domains. Especially, in open domain information ex-traction, a major bottleneck is the k lack of sufficient annotated examples. In information extraction a multitude of semantic classes can be assigned to the linguistic surface expressions of the texts. For many classes the linguistic variation of expressing the same class is quite high while at the same time the set of features or their combinations that trigger a semantic pattern in an example is limited. The manual labeling of enough training documents in order to build an accurate classifier is often prohibitively expensive.This might be less of a problem in text categorization where only a fewcategories per document text are assigned and where many features (i.e., the words of a text) often individually trigger the category pattern. In addi-tion, even when we train an information extraction system in a closed do-main, the speed and cost of the annotation are also a considerable factor inthe commercial development of extraction tools.

On the other hand, collecting a large quantity of unlabeled textual datais cheap. Thus, it could be interesting to train upon a small annotatedcorpus and in some way gradually improve the quality of the learned clas-sification patterns by exploiting the unlabeled examples. Or, maybe it is possible to learn a classifier whose performance in terms of accuracy is equal to one trained on the full labeled set.

At one end of the spectrum there are the clustering algorithms, which are completely unsupervised technologies and which rely on unlabeled data. They detect the organization of similar patterns into sensible clustersor groups, which allow discovering similarities and differences among

128 6 Unsupervised Classification Aids

patterns and derive useful conclusions from the clusters. On the other end of the spectrum there are the supervised approaches that rely on a full set of labeled examples. In this chapter we demonstrate that several approaches exist that severely restrict the number of annotations to be manually drafted and that exploit patterns in unlabeled examples. They are referred to as weakly supervised or d bootstrapping techniques. The algorithms learnfrom a limited set of annotated examples and a large pool of unlabeled examples. A classifier is initially trained on the labeled seeds and is incre-mentally improved with examples from the unlabeled pool until the classifier reaches a certain level of accuracy (see Eq. (8.6)) on a test set. We discuss indetail expansion, self-training, co-training, and active learning. In expan-sion, the training set is iteratively expanded with similar examples. In self-training, examples are chosen in the next training step to which the current classifier assigns labels with most certainty. In co-training, examples are chosen in the next training step to which two or more current classifiersthat use an independent feature set assign labels with most certainty. In ac-tive learning humans label all examples, but the machine carefully selects the examples to be labeled, e.g., the examples that the current classifier labels as most uncertain, or the most representative and diverse examplesin the data pool of unlabeled examples are considered.

The most logical order of this chapter would be to start from techniques of active learning, where human involvement is still the largest and gradu-ally discuss the techniques that require less manual intervention with at theextreme the completely unsupervised techniques. We will, however, in-verse this ordering and start with the discussion of the unsupervised tech-niques. This allows us to firstly explain a number of essential concepts on feature selection and distance functions, which will be referred to in the whole chapter. The clustering approach is illustrated with two noun phrasecoreference resolution tasks and a relation recognition task. Expansion isillustrated with the classical Yarowski algorithm for word sense disam-biguation. Self-training is illustrated with named entity classification. Co-training and active learning are illustrated respectively with a noun-phrase coreference and named entity recognition task. We do not discuss simple association techniques that, for instance, when a domain corpus is avail-able, associate terms with this domain based on occurrence statistics of theterms in texts of the domain and in texts that do not belong to the domain

The supervised approaches in the previous chapter teach us that in order to improve information extraction results we have to deal with two types of problems. First, there is the variation of natural language and the fact that many different linguistic expressions signal the same semantic class.In machine learning terminology we can say that the number of potential

(e.g., Riloff, 1996).

6.2 Clustering 129

features is very large, but only a few of them are active in each example.Secondly, there is the ambiguity of natural language and the fact that a linguistic surface expression seemingly signals different semantic classes. In machine learning terminology we can say that many features on their own are very ambiguous, but in combination with other features of the dis-course context, they loose their ambiguity. When discussing the different approaches to information extraction that use unsupervised aids, we will each time elaborate on the effect of the approach on both problems.

6.2 Clustering

Clustering is a multivariate statistical technique that allows an automatic generation of groups in data. The feature vectors of the unlabeled exam-ples are clustered using a function that computes the numerical distance or similarity between each two pairs of objects. The result of the clustering isa partitioning of the collection of objects in groups of related objects.Some clustering algorithms generate a hierarchical grouping of the objects. When each vector only belongs to one cluster, the clustering is called hard or crisp. However, when the vector belongs to more than one cluster si-multaneously with a certain degree of membership, one speaks of a soft or fuzzy clustering. Many books exist that give a good overview and discus-sion of clustering algorithms. We refer here to Kaufman and Rousseeuw (1990), and Theodoridis and Koutroumbas (2003).

There are a number of factors that play a role in the clustering. First of all, because the approach is unsupervised, the choice of the features is of primordial importance. The distance or similarity function chosen is an-other important criterion in the clustering. The functions might behave dif-ferently when feature vectors are situated in a certain geometrical position in the vector space. In addition the clustering often respects a set of con-straints. The constraints relate to cluster membership or to the shape of theclusters, which might be expressed in the cost function that is used to op-timize a clustering. The constraints might also play a role in the choice of the cluster algorithm or might be expressed in a function that determinesthe best number of clusters.

6.2.1 Choice of Features

Depending on the task the appropriate features are selected. Some methods exist that select features in a clustering or eliminate noisy features. One set of methods refers to the so-called wrapper methods where different subsets


of features are evaluated by the learning algorithm itself (thus depending on other clustering criteria such as the proximity metric or the type of clus-

can attempt to efficiently determine the optimal feature weighting (Modha

cluster for different subsets of features based on intra- and intercluster dis-tances (cf. infra) (Dash et al., 2002). These approaches usually involve aheuristic (non-exhaustive) search through the space of all subsets of fea-tures. None of these methods were applied in information extraction.

Different groupings of the objects are possible depending on the featureand their values (e.g., clustering noun phrase entities according to gender or according to occurrence frequency classes in the discourse yields a dif-ferent grouping of the entities). The selected features must encode as muchinformation as possible concerning the clustering sought. Consequently, in information extraction one often relies on a priori linguistic or cognitiveknowledge to estimate the value of a feature (see Chap. 4).

6.2.2 Distance Functions between Two Objects

A suitable distance or a similarity function is chosen that computes the as-sociation between a pair of feature vectors. When the feature values in the

tions such as the Manhattan or Euclidean distance or similarity functionssuch as the inner product or cosine function can be used. So, the distance and similarity between two vectors xi and xjx each having p dimensions can be computed as follows (non-exhaustive list of example functions).

Manhattan distance: d1

d1(xi, xi jx ) = xil − xjlxl=1

p

(6.1)

Euclidean distance: d2

d2(xi,xjx ) = (xil − xjlx )2

l=1

p

(6.2)

Inner product similarity: s1

ter algorithm) (Talavera, 1999; Dy and Brodley, 2000). Alternatively, one

and Sprangler, 2003). Filtering methods measure the tendency of the dataset to

object vectors have continuous or discrete values, common distance func-

6.2 Clustering 131

s1(xi,xjx ) = xilxll jlxl=1

p

(6.3)

Cosine similarity: s2

s2(xi,xjx ) =xilxll jx l

l=1

p

xil2

l=1

p

xjlx 2

l=1

p(6.4)

Dice similarity: s3

s3(xi,xjx ) =2 xilxll jlx

l=1

p

xil

l=1

p

+ xjlxl=1

p (6.5)

The inner product computes the vector intersection or overlap. In the Boo-tlean case where vectors are zero or one, the inner product can be used tocompute the cardinality of the set intersection. The inner product is sensi-tive to between-objects and within-object differences in term weights. The inner product does not penalize vectors for their representational richness.For instance, a high value in one vector strongly influences the result.

The cosine function normalizes the inner product through its division by the Euclidean (L2) lengths of each of the vectors. The length normalizationavoids the influence of a single component and also fixes an upper bound of the range of similarity values (i.e., 1). The cosine function may penalizerepresentational richness and it is insensitive to between-vectors weight relationships.

The Dice measure critically depends on the relative L1 lengths of the two vectors. In the extreme case, when one of the vectors has a very largeL1 length compared to the one of the other vector, the effect of the latter in the normalization will be negligible. We refer to Jones and Furnas (1987) for an in-depth study of similarity measures used in text based tasks.

In information extraction the vectors often have mixed values. For in-stance, nominal, ordinal or interval-scaled values are mixed with discrete


or real values, or solely make up the object vectors. A similarity functionthat deals with mixed valued vectors can be defined as follows.

Mixed value similarity: s4

s4(xi,xjx ) =sl(xil,xjlx )

l=1

p

wl=1

p

l

(6.6)

where si(x(( il, xjlx ) is the similarity between the lth coordinate of h xi and xjx and jth coordinates of the two vectors are binary then: h

sl(xil,xjlx ) =1 if xil = xjlx =1

0 otherwise (6.7)

If the lth coordinate of the two vectors correspond to a nominal or ordinal variables then sl(x(( il, xjlx ) = 1 if xil and xjlx have the same values; otherwisel

sl(x(( il, xjlx ) = 0. If the l thj

coordinate corresponds to interval variables, then: h

sl(xil,xjlx ) =1−xil − xjlx

rl (6.8)

where rl is the length of the interval where the values of the lth coordinatesh

lie. In the case that the interval variables xil andl xjlx coincide, l sl(x(( il, xjlx ) takes the maximum value, which equals 1. On the other hand, if the absolute dif-ference between xil andl xjlx equals l rl, then sl(x(( il, xjlx ) = 0. For any other valueof xil – l xjlx , sl(x(( il, xjlx ) lies between 0 and 1.

wl is a weight factor corresponding to the l lth coordinate. If one of the h lth

coordinates of xi and xjx is undefined,j wl = 0. If the l lth coordinate is a binary h

variable and is 0 for both vectors, then wl = 0. In other cases, l wl is set equal l

to 1. If all wl’s are equal to 0 then s4(x(( i, xjx ) is undefined. The above metrics are symmetrical, i.e., the distance or similarity be-

tween xi and xjx is the same as the respective distance or similarity betweenxjx and j xi. In exceptional information extraction cases, asymmetric metricssuch as the relative entropy or Kullback-Leibler divergence of two probability

in case the l

6.2 Clustering 133

distributions p(x) and q(x), or the cross entropy between a random variable

There are many different types of clustering algorithms. Ideally, all pos-sible divisions of objects into groups should be tested according to somecriterion of cluster goodness. However, in any realistic clustering task this is computationally an NP1 hard problem. So, most clustering algorithmsincorporate a form of greediness, and in one way or another only test a subset of all possible divisions.

6.2.3 Proximity Functions between Two Clusters

When building the clustering, a proximity function that computes thecloseness between two clusters can be taken into consideration. Common proximity functions are maximum proximity, minimum proximity, averageproximity and mean proximity. Maximum proximity defines the proximity between clusters based on their most similar pair of objects. Minimum proximity defines the proximity between clusters based on their least simi-lar pair of objects. The average function defines proximity between clus-ters based on the average of the similarities between all pairs of objects,where objects of a pair belong to different clusters. The mean function de-fines proximity of clusters based on the similarity of the representative of each cluster. The representative may be the mean point (centroid), the mean center (medoid), or the median center of the cluster.

6.2.4 Algorithms

Sequential algorithms produce a grouping in one or a few iterations where n objects are considered in the clustering. They are very fast, but the clustering depends on the order of input of the objects. In the single pass algorithm in one pass each of the n objects is assigned to the closest clus-ter requiring that each object of a cluster has a minimum similarity withthe centroid of the cluster or with one object of the cluster. This latter approach is implemented by Cardie and Wagstaff (1999) in the task of single-document coreference resolution.

1 Non-deterministic Polynomial-time hard.

XX with true probability p(x) and a model distribution q(x) is computed

space.

(cf. Eqs. (7.7) and (7.8)). In future information extraction tasks a larger emphasis will be on similarity functions that consider probabilistic objectassignments instead of preferring functions that operate in the Euclidean


clusters and in consequent steps group them in more general clusters bydecreasing the number of clusters by 1 at each step. Famous algorithmsare the single link(age) algorithm (clusters are merged based on the maximum proximity function), the complete link(age) algorithm (clustersare merged based on the minimum proximity function) and the group av-erage algorithm (clusters are merged based on the average proximity function or sometimes based on the mean proximity function). Divisive algorithms act in the opposite direction. One single cluster comprising the n objects is divided into smaller and smaller groups until the n single objects are found. An example is the use of the group average agglom-erative algorithm implemented by Gooi and Allan (2004) for the resolu-tion of noun phrase cross-document coreferents (cf. infra).

There are a number of clustering algorithms based on cost function op-timization. Here, the clustering is evaluated in terms of a cost function f.ffUsually the number of clusters k is fixed. The algorithms start from an ini-ktial grouping into k clusters and iteratively other groupings into k clustersare tested while trying to optimize f. The iteration terminates when a local ffoptimum of f is determined. The algorithms include hard and fuzzy parti-tioning algorithms such as the hard k-means and k-medoid, and the fuzzyc-means. In both the k-means and the k-medoid algorithms an initial clus-tering is improved in consecutive steps by swapping objects between clus-ters. In the k-means algorithm a better clustering is one that minimizes the distance between an object and its centroid. The k-medoid algorithm dminimizes the distance between an object and its medoid. A fuzzy cluster-ing is seldom used for information extraction as an object usually only belongs to one class.

6.2.5 Number of Clusters

For the hierarchical cluster algorithms and the partitioning into k-clusters, the problem is how to find a good k, i.e., the number of clusters that isnaturally present in the data set. Among the criteria of goodness of the clustering are the intra-cluster distance of the objects contained in a cluster and the inter-cluster distance defined as the distance between objects of different clusters. These criteria give an indication of the number of clus-ters in the data set. Different heuristics have been defined.

The most simple approaches only consider intra-cluster distances, for instance by defining a threshold θ for the distance d(CjC ) between pairs of

merative algorithms start from the n individual objects that form singleton Very famous are the hierarchical clustering algorithms. The agglo-

6.2 Clustering 135

objects of a cluster CjC of the cluster structurej ℜc , or, more specifically, of the average pair wise distance of objects in cluster CjC . In other words:

∃!CjC ∈ ℜc : d(CjC ) >θ (6.9)

In other cases the intercluster distance between two clusters Ci and CjCcomes into play. A final clustering must satisfy the following criterion:

d(Ci,CjC ) >max{d(CiCC ),d(CjCC )} ∀Ci,CjC ∈ ℜc and Ci ≠CjC (6.10)

whered(CiCC ,CjCC ) =

xi∈CiCC ,xjx ∈CjCmind(x(( i,xjx )

Even if the above strict criterion is not satisfied, we can still obtain a good clustering by considering the average fit of an object in the cluster-ing. In this heuristic a good clustering can be defined as follows. For eachobject xi of the cluster structure ℜc , the degree of fitness f(ff x(( i) of xi to its cluster Ci is computed as the normalized difference between the distance of xi with its second choice cluster CjC and the average distance of j xi to allother objects of Ci:

f (xi) = b(xi) − a(xi)

max{a(xi),b(xi)}(6.11)

wherea(x(( i) = average distance of xi to all other objects of its cluster Ci:

1

r −1xjx ∈Ci

d(xi,xjx ) Ci∈ ℜc , xi ≠ xjx ,xi∈ Ci and r = Ci

b(x(( i) =argminCjC

1

rxjx ∈Cj

d(xi,xjx ) CjC ∈ ℜc , Ci ≠CjCC and r = CjC

−1≤ f (xi) ≤1

When Ci to which xi belongs is a singleton cluster, it is unclear howa(x(( i) should be defined and then simply f(ff x(( i) = 0. Also, when the clusteringcontains only one cluster, f(ff x(( i) cannot be defined.

f(ff x(( i) is averaged over all objects. This can be done for different cluster structures (e.g., different k values), which gives a certain evaluation of the


clustering where a high value indicates a clear structure and a low valueindicates that one might better apply an alternative method of data analy-sis. Among the good cluster structures and corresponding k-values, the onewith the highest average fitness can be chosen.

6.2.6 Use of Clustering in Information Extraction

In information extraction, clustering is useful when no training examples are available, when the information that they contain changes very dy-namically, or when good features are chosen based on linguistic and dis-course studies. For instance, cross-document resolution of noun phrase coreferents relies on the context in which two entities (e.g., persons) occur in the texts. Because new persons and their corresponding contexts con-tinuously turn up in the texts (e.g., new person names are cited in the context of novel companies where they are employed), it is not always convenient to train a supervised classifier that learns the contextual pat-terns. Another example regards single-document noun phrase coreference resolution where the clustering techniques are sometimes used relying ona good choice of features that are discussed in existing linguistic and dis-course theories. Many of the algorithms rely on threshold values for clus-

coreference resolution, few research involves the detection of the cluster goodness such as intra-cluster similarities.

Our first illustration regards single document coreference resolution. Let 1 k 1 n

lection of random variables over observations or “mentions”. For example, in the following sentences Bill Clinton went to Nigeria to speak before AIDS workers. Afterwards, the former US president joined his wife to a trip to China. Entity C1 represents Bill Clinton. Mention x1 refers to BillClinton, mention x5 refers to the former US president, and mention x6 re-fers to his. In this case the mentions x1, x1 5 and x6 corefer making that theyare assigned the entity class C1 (Bill Clinton) of the mention positioned first in the text.

We have to partition the set of mentions into an unknown set of entities.This is equivalent of finding the clusters of the mentions that refer to the same entity (e.g., in the above example the mentions x1, x5 and x6 belong tothe same cluster or entity class C1). The problem is often referred to as finding the coreferent chain to which a mention belongs. Finding the best

methods that find an approximate clustering. It is useful that a noun phrase

C = {C C C } be a set of classes or entities. Let X = {x1 x } be a col-

ter membership (e.g., Cardie and Wagstaff, 1999). In noun phrase

ristics as the one mentioned above) is often NP-hard, but there are severalpartition (i.e., computing all possible partitions and validate them with heu-

6.2 Clustering 137

coreference task is solved based on its relational nature, because the as-signment of an object to a partition (or here a mention to an entity) de-pends not just on a single low distance measurement to one other object, but on its low distance measurement to all objects in the partition (and fur-thermore on its high distance measurement to all nodes of all other parti-tions) (cf. Bansal et al., 2004). In noun-phrase coreferent resolution we usually have evidence that two noun phrases cannot be a coreferent when they are of a different genre such as wife and he in the foregoing example, or when they belong to a different semantic class such as Bill Clinton(person) and China (location) in the foregoing example. Considering theseconstraints already reduces the number of potential partitions to be tested.An additional constraint restricts the number of hypotheses to be tested by considering the fact that anaphoric and cataphoric coreferents (i.e.,coreferents whose meaning depends on other textual elements with a morefully descriptive phrasing) are often restricted to the scope of a text para-graph.

An example algorithm for noun phrase coreferent resolution splits a text into its paragraphs. For each paragraph, a clustering is sought while apply-ing the constraints of impossible coreferents. A best clustering can be found among all remaining hypotheses, or a good clustering could start from an initial clustering that merges non-phoric references and assigns thephoric references to the closest cluster, after which improvement of theclustering might be obtained through swapping of the phoric references.Coreferent chains obtained in one paragraph can then merge across para-graphs if, for instance, their non-phoric references are sufficiently similar.

Another task, for which a clustering approach is common, regards cross-document coreference resolution. An example can be found in Gooi and Allan (2004). Here identical strings or alias strings in different docu-ments are grouped when their contexts are sufficiently similar. Contextsare defined as a term vector of which the component terms are represented by their weights. These can, for instance, be computed as tf x f idf weights fwhere tf is computed as the frequency of the term in the context windowsof l terms that surround the mention in a single document, and idf is com-puted as the inverse document frequency weight based on a reference cor-pus. An agglomerative cluster algorithm was chosen that merges clusters when a minimum similarity between the clusters is satisfied.

A third example of the use of clustering in information extraction re-gards relation recognition. For instance, Hasegawa et al. (2004) cluster thecontexts of pairs of named entities. The pair of named entities is chosen based on selected types (e.g., company A and company B, person X and company Y) and a predefined distance between the words. Each pair of named entities is described by the terms that occur as intervening words


between the two entities at a given maximum distance. As in the foregoingexample, contexts are defined as term vectors, the weights of which arecomputed with the tf x f idf metricf . In this example the cosine similarity isused and an agglomerative clustering (e.g., complete linkage) groups con-texts when a minimum similarity between the clusters is satisfied. In thisway expressions of a semantic relation between two named entities can belearned, which can be applied to classify new instances. The frequent common words of a cluster characterize the relation, and become its label. The shared words are seen as the characterization of a certain relation. The complete linkage cluster algorithm, which is based on the principle of minimum proximity of the clusters and which results in compact clusters,yielded the best results in this experiment.

6.3 Expansion

In the following sections we discuss bootstrapping or weakly supervised learning techniques. Bootstrapping refers to a technology that starts from asmall initial effort and gradually grows into something larger and moresignificant. Bootstrapping is the promotion or development by initiative and effort with little or no assistance.2

The first class of techniques regards what can be called expansion tech-niques (in analogy with the expansion of a query in retrieval with synonym terms). The simple principle is as follows (Fig. 6.1). Given a set of seeds(examples that are manually classified), each instance of a subset of unla-belled examples receives the class of a labeled example that is judged sufficiently similar, or of the most similar examples. The newly classified examples are then added to the training set. This process can be iterated several times, but additional constraints are formulated in orderto limit noisy expansions.

biguation is probably the earliest and most famous bootstrapping approachin semantic classification. It learns a dictionary of word sense patterns for a priori defined senses of a particular word. Given an ambiguous target

2 http://www.webster.com

word (e.g., plant) and its two meanings (living thing and manufacturing institution), a few seed examples of each sense are chosen. The aim

The Yarowski algorithm (1995; Abney, 2004) for word sense disam-

Fig. 6.1. Labeled seeds are expanded with unlabeled examples, which are classi-fied with the class of the closest seeds.

is to classify a large pool of unlabeled examples and possibly to store the learned context patterns in a dictionary. The labeled seed examples and theunlabelled examples are represented by the terms of their context window (Fig. 6.2). In each step of the iteration, the target word in a set of unlabeled examples receives the sense of a close example. An example is similar when it contains at least one context word of a labeled example. Yarowskidefines two important constraints. First, the log ratio of the probability of the sense s in the labeled examples given the context word to the probabil-ity of another sense (!s) in the labeled examples given the context word is larger than a threshold value (one sense per collocation assumption). Sec-ond, in examples from the same text the target word receives the same sense class (one sense per discourse assumption).

entity recognition (Petasis et al., 2000). Starting from a seed set of classi-fied proper names (e.g., person names, organization names) and their fea-ture vectors, an unknown proper name receives the class of a seed proper name when the context of the seed is the same as the one of the unknown,or when the context of the seed is very similar. Similarity is defined based on shared context words, context words that have the same hypernym in WordNet, and on congruence of the syntactical function of the name in the

A second illustration regards the use of expansion techniques for named

sentence. A number of smoothing and weighting factors favor or disfavor

6.3 Expansion 139

Expansion

LABELED SEEDS

Class A

...

UN LABELED EXAMPLESClass B

Class B

Class C

Class C

Class C


Fig. 6.2. Schematic representation of the Yarowski algorithm for word sense dis-ambiguation.

the similarity, and take into account phenomena such as ambiguity of the meaning of a term, of a syntactic construct and conditional probability of a context given a semantic class.

The expansion approach has also been tested for named entity recogni-

1999).Many of the bootstrapping algorithms in natural language technology

give an answer to the lexical variation of natural language and the fact that the same content can be expressed by different phrasings. Expansion is illustrative of the most simple approach. One hopes to learn variant clas-sification patterns, that are only slightly different from the ones that are classified by humans and consequently to improve the recall (see Eq. (8.1)) of the classification. However, it is very important to choose good seedsand to define valuable additional constraints of the expansion, otherwise

tion (Agichtein and Gravano, 2003), question type recognition (Ravi-chandran and Hovy, 2002) and dictionary construction (Riloff and Jones,

Yarowski algorithm

Labeled examples

Unlabeled examples

Sense

A

Nissan car and truck plantautomatic manufacturing plant in Fremontcompany manufacturing plant in Orlando

Animal and plant tissues

Plant specialized in the production of car seats

. . .

An example is considered similar if it contains the same context word wConstraints :

P(Sense=s Context word w)

P(Sense=!s Context word w)> α

Used in strain microscopic plant lifeactual distribution of plant lifeby rapid growth of aquatic plant and animal life

AA

BBB

?

?

?

1) log

2) One sense per discourse

the expansion with slightly different patterns can introduce noisy and am-biguous patterns, which reduce the precision of the classification (see Eq.(8.2)). In other words, the features and the function that assess similarityshould be carefully chosen. In addition, seeds should be selected that rep-resent very different patterns, otherwise large improvements in recall cannot be expected by means of expansion with similar patterns. Finally, the features in the context patterns might be ambiguous on their own. Inthe Yarowski algorithm contexts might comprise words that are ambigu-ous in meaning and, when they are used in expansion techniques, they

demanding that accuracy on a (hopefully) representative test set is care-fully monitored during training.

6.4 Self-training

Self-training refers to supervised learning techniques that incrementallylearn a classifier based on a seed set of labeled examples and a set of unla-belled examples that are labeled with the current classifier until the trained classifier reaches a certain level of accuracy on the test set (Fig. 6.3). At each iteration, the classifier is applied on a set of size m of unlabeled ex-amples. When the labeling confidence of an example exceeds a certainthreshold, it is added to the training set, while the class distribution ismaintained. Because only one classifier is trained on the full set, self-training is also referred to as learning with a single-view algorithm.

Techniques of expansion that are discussed in the previous section can also be regarded as self-training. We deliberately treated the techniques in two separate sections, while defining expansion techniques as methods that use a nearest neighbor search on the unlabeled example set and de-fining self-learning as techniques that refer to training a classifier that generalizes to some rule or mathematical function based on the trainingdata. The borderline between both techniques is very shallow because some of the constraints used in an expansion technique can be consideredas a kind of generalization (e.g., in the Yarowski algorithm when onlythe most probable collocational words that in the strongest way are an indication of the meaning are retained for further training). Another dif-ference with the foregoing approach is that the newly learned classifier isapplied on all labeled examples in each iteration, and thus it might change the class labels of these examples that were obtained in a previ-ous step.

6.4 Self-training 141

will introduce noisy patterns. Noise propagates quickly (Angheluta, 2003)


Fig. 6.3. Self-training: A classifier is incrementally trained (blue line), first based

examples that are labeled with the current classifier. The dotted blue line repre-sents the set of all unlabeled examples that were considered for labeling in this step.

probabilities are not estimated directly, rather they are estimated indirectlyby invoking Bayes’ rule, e.g., a naïve Bayes classification) for the training of the initial classifier based on the seed set of labeled examples, theExpectation Maximization (EM) algorithm is used to train the classifier that learns both from the labeled and unlabeled examples. The EM algo-rithm is a classical algorithm that learns a good solution for hidden vari-ables (Dempster et al., 1977). The unlabeled data are considered as hidden or missing data. The goal is to find a model such that the posterior prob-ability of its parameters is locally maximized given both the labeled data and the unlabeled data. The resulting model is then used to make predic-tions for the test examples Initially, the algorithm estimates the model parameters by training a probabilistic classifier on the labeled instances. Then in the expectation step (E-step) all unlabeled data is probabilistically labeled by this clas-sifier. During the maximization step (M-step) the parameters of thegenerative model are re-estimated using the initially labeled data and

In a variant scenario, if a generative probabilistic classifier is used (i.e.,

on the labeled seeds, and then based on the labeled seeds and a set of unlabeled

Self-training

Class A

...

UNLABELED EXAMPLES

Class B A

Class B

Class C

Class C

Class C

LABELED SEEDS

143

the probabilistically labeled data in order to obtain a maximum a poste-riori optimized hypothesis. The E- and M-step are repeated duringseveral iterations. The unlabeled data contain information about parameters of the genera-tive model. They provide information about the joint probability distribu-tion of features. Suppose in a named entity recognition task the context word profit signals a company name. The unlabeled data might give evi-dence that the word share frequently co-occurs with profit, making it pos-sible that share also becomes an indicator of the class company.

The self-training techniques are confronted with the same problems as the expansion techniques (see supra). It is difficult to choose valuableseeds and to fix their number. In addition, the selection of valuable fea-tures for the classification task at hand cannot be underestimated. Althoughthe technique is more popular in text categorization (McCallum et al.,1999), Ng and Cardie (2003) incorporated the EM algorithm in nounphrase coreference resolution. A standard supervised learning paradigm is to induce a model from thedata and to deduce the labeling of test data from this model. Vapnik (1995)argues that the model may be superfluous and advocates the use of trans-ductive learning in order to directly estimate the labeling without first learning the model from the training data. In text categorization transductive Support Vector Machines have been proposed in case of a small number of labeled examples Slab and a large number of unlabeled examples Sunlab

tive Support Vector Machine is to select a function from the hypothesis space H usingH Slab and Sunlab such that the expanded number of erroneouspredictions on the test and the training samples is minimized.

The solution of this optimization problem is not only the separating hy-perplane <w,b > but also the labeling of the test set x1*, …, xm* ∈ Sunlab.The key idea of the algorithm is that it begins with a labeling of the test examples done by means of the inductive SVM. Then, it improves the so-lution by switching the labels of test examples such that the cost functiondecreases (see Eqs. (5.10) and (5.11).) The algorithm converges towards astable, although not necessarily an optimal solution. It takes the trainingand test examples as input, and outputs the labeling for the test examplesand a classification model. So far, transductive learning has seldom been used for information extraction. Goutte et al. (2004) exploit it for namedentity recognition.

6.4 Self-training

(Joachims, 2003). The general idea is as follows. The goal of the transduc-

,


6.5 Co-training

In co-training two (or more) classifiers are trained using the same seed set of labeled examples, but each classifier trains with a disjoint subset of fea-

such a way so that features of a different set are conditionally independent.These feature subsets are commonly referred to as different views that the classifiers have on the training examples. Initially the classifiers are trained in a conventional way based on the seed set. At each iteration, the classifi-ers are then applied on a same set of size m unlabeled examples. The ex-

which the classifiers agree with most confidence are added to the pool of labeled examples, while the class distribution is maintained. Then the clas-sifiers are retrained. This process is iterated several times until it reaches a

Fig. 6.4. Co-training: Two classifiers are incrementally trained (blue and greenlines), first based on the labeled seeds, and then based on the labeled seeds and a

blue and green lines represent the set of all unlabeled examples that were consid-ered for labeling in this step.

learned and when applied on new examples each classifier makes an inde-pendent decision. Usually the decision with the highest confidence deter-mines the label for the new instance. Here too, labels assigned in previoussteps might change in a subsequent iteration.

certain level of accuracy on the test set. Note that different classifiers are

tures (Blum and Mitchell, 1998) (Fig. 6.4). The features are usually split in

amples that are then labeled with the current classifiers and the ones on

set of unlabeled examples that are labeled with the current classifier. The dotted

Co-training

LABELED SEEDS

Class A

...

UNLABELED EXAMPLES

Class B

Class B

Class C

Class C

Class C

In co-training the selection and number of seeds are also important pa-rameters in training valuable classifiers. Another problem is the selectionof a natural split of the features where the chosen features from different sets are ideally conditionally independent. In an information extraction task the features are often not independent, but correlated.

Nevertheless, co-training has been applied to information extraction with some degree of success. If a natural split of the features in different sets is possible that each yields an independent, but discriminatory view of the data, relaxation of the features of each set yields additional expansionsof the patterns. Co-training is applied to named entity recognition (Collins

6.6 Active Learning

A promising approach in an information extraction context is active learn-ing. Among the weakly supervised methods described in this chapter, it is the one that requires the most supervision or human involvement. In an ac-tive learning approach all the examples are labeled by a human, but the limited set of examples to be labeled is carefully selected by the machine (Fig. 6.5). We assume here that the information extraction task is of such a nature that humans are able to correctly classify the training examples.

Fig. 6.5. Active learning: Representative and diverse examples to be labeled byhumans are selected based on clustering.

6.6 Active Learning 145

and Singer, 1999; Abney, 2002) and noun-phrase coreference resolution (Müller et al., 2002; Ng and Cardie, 2003).

LABELED SEEDS

Class C

Class C

Class C

Class B

Class B

Class A

UNLABELED EXAMPLES

?

?

?

Active learning


set of labeled examples, although the existence of such an initial seed set isnot firmly needed in order to correctly apply the algorithm. At each itera-tion, a set of examples is selected and labeled by a human and added to the training set in order to retrain the classifier until the trained classifier reaches a certain level of accuracy on a test set. The selection of examples is not random. Often, those examples are selected that the current classifier considers as most uncertain and thus most informative. Or, examples are selected which are representative or diverse with regard to the pool of unlabeled examples.

For instance, when training a Support Vector Machine, vectors of unla-beled examples can be chosen which are very close to the hyperplane that separates the two classes. These are certainly examples of which the cur-rent classifier is most uncertain. When training a probabilistic classifier (e.g., maximum entropy classifier) the probability of class assignment onunlabeled examples gives insight into the uncertainty of the classification. In order to quantify the uncertainty of the classification of an example,several entropy-based measures can be used (Tang et al., 2002).

When selecting representative examples in the pool of unlabeled ex-amples, different approaches are possible. The similarity between thefeature vectors of the examples can be computed. Examples that have alarge average similarity with other examples are chosen. An alternative is to cluster the examples (use of the clustering techniques) and a represen-tative example is then selected from each cluster (e.g., the medoid of the cluster). Clustering is also useful to detect diverse examples in a large setof unlabeled examples. If a natural clustering is sought, the medoids of clusters should be far apart and quite diverse. When detecting a natural clustering is computationally expensive or is only approximated, outlin-ers in the clusters can be selected for labeling.

Although the active learning methods are attractive, a few points need at-tention. When the system selects examples based on unsupervised methodslike clustering, the criteria for obtaining a good clustering discussed earlier in this chapter play a role (e.g., the choice of discriminative and non-noisy features). Secondly, we should watch the computational complexity of the training phase of the weakly supervised classifier, as, for instance, finding a good clustering in each step of the iteration is computationally expensive.

earlier in this chapter? A classifier used in information extraction has to

Like the other weakly-supervised methods, the algorithm starts with a seed

How does active learning affect the two issues that we have postulated

deal with two problems: Lack of training examples that exhibit variant patterns and ambiguity of the learned patterns. Active learning certainly contributes to a larger variation of the training examples. Not onlyslightly similar patterns, but also very diverse examples can be selectedand redundant examples are avoided. If the example set contains a largeamount of variant patterns, many iterations of the algorithm might beneeded so that the classifier is sufficiently accurate on a test set. Alsothe second problem of the ambiguity of language is taken care of. Thesystem chooses examples to annotate of which the current classifier is most uncertain. In addition, by choosing outliner elements of clusters or ex-amples that could be assigned to two clusters, the system detects examples that might exhibit some degree of ambiguity and that can be annotated by a human.

In information extraction, experiments with active learning techniquesare recent and limited. Shen et al. (2004) report comparable accuracy re-sults in a named entity recognition task, while reducing the training set tobe annotated by 80%. They use Support Vector Machine technology for

mented. It might be interesting to study whether methods of example selection in

active learning. Moreover, knowledge acquisition methods used by knowl-edge engineers can be very inspiring for the task of automated example selection.

6.7 Conclusions

In this chapter we have discussed a number of interesting approaches that aim at relieving the annotation bottleneck and that can further be pursued in research. The weakly supervised algorithms offer portable pattern rec-ognition technology. There are a number of algorithms that are very promi-sing, but attention should go to the choice of representative seed examples and to the selection of good features that represent an example. There is a wealth of literature on linguistic and cognitive theories. These theories are a source of knowledge for advanced seed and feature selection in informa-tion extraction tasks.

6.7 Conclusion 147

selection of uncertain, representative and diverse examples were imple-the supervised training. Some of the methods discussed above for the

concept learning (e.g., Michalski and Stepp, 1983 ) could be of interest for


lead to the development of novel learning algorithms, which we will dis-cuss in the last chapter of this book.

In the next chapter we will study the role of information extraction in information retrieval.

6.8 Bibliography

Abney, Steven (2002). Bootstrapping. In Proceedings of the 40th Annual Meetingof the Association for Computational Linguistics (ACL) (pp. 360-367). SanFrancisco, CA: Morgan Kaufmann.

Abney, Steven (2004). Understanding the Yarowski algorithm. Computational Linguistics, 30 (3), 365-395.

Agichtein, Eugene and Luis Gravano (2003). Querying text databases for efficient information retrieval. In Proceedings of the IEEE International Conference on Data Engineering (pp. 113-124). IEEE Computer Society.

Angheluta, Roxana (2003). Word Sense Disambiguation. Master Thesis, Katho-lieke Universiteit Leuven.

Bansal, Nikhil, Avrim Blum and Shuchi Chawla (2004). Correlation clustering.Machine Leaning, 56 (3), 89-113.

Blum, Avrim and Tom Mitchell (1998). Combining labeled with unlabeled datawith co-training. In Proceedings of the 11th Annual Conference on Computa-tional Learning Theory (COLT) (pp. 92-100). San Francisco, CA: Morgan TTKaufmann.

Cardie, Claire and Kiri Wagstaff (1999). Noun phrase coreference as clustering. In Proceedings of the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Very Large Corpora (pp. 82-89). ACL.

Collins, Michael and Yoram Singer (1999). Unsupervised models for named entityclassification. In Proceedings of Empirical Methods in Natural LanguageProcessing (EMNLP). College Park, MD.

Dash, Manoranjan, Kiseok Choi, Peter Scheuermann and Huan Liu (2002). Fea-ture selection for clustering. In Proceedings of the IEEE International Confer-ence on Data Mining (pp. 115-122). IEEE Computer Society.

Dempster, A.P., N.M. Laird and D.B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal Royal Statistical Society SeriesB, 39, 1-38.

Dy, Jennifer G. and Carla E. Brodley (2000). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5, 845-889.

Gooi, Chung Heong and James Allan (2004). Cross-document coreference on a large scale corpus. In Proceedings of the Human Language Technology Con-ference of the North American Chapter of the Association for Computational Linguistics (pp. 9-16). East Stroudsburg, PA: ACL.

Goutte, Cyril, Eric Gaussier, Nicola Cancedda and Hervé Déjean (2004). Genera-tive vs. discriminative approaches from label-deficient data. In JADT 2004: 7es

There are still a number of challenges to overcome when one wants todevelop learning technology for information extraction that eventually can

Hasegawa, Takaaki, Satoshi Sekine and Ralph Grishman (2004). Discovering rela-tions among named entities from large corpora. In Proceedings of the 42nd

g An-d

nual Meeting of the Association for Computational Linguistics (pp. 416-423). New York, NY: ACM.

Joachims, Thorsten (2003). Transductive learning via spectral graph partitioning. In Proceedings of the Twentieth International Conference on Machine Learn-ing (ICML -2003). San Franscisco, CA: Morgan Kaufmann.

Jones, William P. and George W. Furnas (1987). Pictures of relevance: A geomet-ric analysis of similarity measures. Journal of the American Society for Infor-mation Science, 38 (6), 420-442.

Kaufman, Leonard and Peter J. Rousseeuw (1990). Finding Groups in Data: AnIntroduction to Cluster Analysis. New York: John Wiley and Sons.

McCallum, Andrew, Kamal Nigam, Jason Rennie and Kristie Seymore (1999). A machine learning approach to building domain specific search engines. In Pro-ceedings of the Sixteenth International Joint Conference on Artificial Intelli-gence (pp. 662-667). San Mateo, CA: Morgan Kaufmann.

Michalski, Ryszard S. and Robert Stepp (1983). Learning from observation: Con-ceptual clustering. In Ryszard S. Michalski, Jamie G. Carbonell and Tom. M.Mitchell (Eds.), Machine Learning: An Artificial Intelligence Approach II(pp. 331-363). Palo Alto, CA: TIOGA Publishing Co.

Modha, Dharmendra S. and W. Scott Spangler (2003). Feature weighting ink-means clustering. Machine Learning, 52 (3), 217-237.

Müller, Christoph, Stefan Rapp and Michael Strube (2002). Applying co-trainingto reference resolution. In Proceedings of the 40th Annual Meeting of the Asso-ciation for Computational Linguistics (pp. 352-359). San Francisco, CA: Morgan Kaufmann.

Ng, Vincent and Claire Cardie (2003). Bootstrapping coreference classifiers withmultiple machine learning algorithms. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-2003). ACL.

Petasis, Georgios et al. (2000). Automatic adaptation of proper noun dictionariesthrough cooperation of machine learning and probabilistic methods. In Pro-ceedings of the 23rd Annual International ACM SIGIR Conference on Researchd

and Development in Information Retrieval (pp. 128-135). New York: ACM. Ravichandran, Deepak and Eduard Hovy (2002). Learning surface text patterns for

a question answering system. In Proceedings of the 40thg

Annual Meeting of the Association for Computational Linguistics (pp. 41-47). East Stroudsburg, PA:ACL.

Riloff, Ellen and Rosie Jones (1999). Learning dictionaries for information extrac-tion by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (pp. 474-479). San Francisco: Morgan Kaufmann.

Riloff, Ellen (1996). An empirical study for automated dictionary construction forinformation extraction in three domains. Artificial Intelligence, 85, 101-134.

Shen et al. (2004). Multi-criteria-based active learning for named entity recogni-tion. In Proceedings of the 42nd Annual Meeting of the Association for Compu-tational Linguistics (pp. 590- 597). East Stroudsburg, PA: ACL.


Journées internationales d’Analyse statistique de Données Textuelles. Louvain-La-Neuve.


Talavera, Luis (1999). Feature selection as a preprocessing step for hierarchicalclustering. In Proceedings of the 16th International Conference on Machine Learning (pp. 389-397). San Francisco, CA: Morgan Kaufman.

Tang, Min, Xiaoqiang Luo and Salim Roukos (2002). Active learning for statisti-cal natural language parsing. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 120-127). San Francisco, CA:Morgan Kaufmann.

Theodoridis, Sergios and Konstantinos Koutroumbas (2003). Pattern Recognition.Amsterdam, The Netherlands: Academic Press.

Vapnik, Vladimir (1995). The Nature of Statistical Learning Theory. New York:Springer.

Yarowski, David (1995). Unsupervised word sense disambiguation rivaling super-vised methods. In Proceedings of the 33th Annual Meeting of the Associationfor Computational Linguistics (pp. 189-196). Cambridge, MA: ACL.

151

7 Integration of Information Extraction in Retrieval Models

7.1 Introduction

In the foregoing chapters we have seen that information extraction algo-rithms are being developed on a large scale and can be applied on open domain document collections. The extracted information can further beprocessed in information systems, i.e., in data mining systems that detect valuable trends in the information, in summarization when selecting con-tent based on certain semantic classes, or in expert systems that reasonwith the extracted knowledge in order to support human decision making. Another important application is information retrieval, where the extracted information contributes to a more refined representation of documents and query. It is this last application that is the focus of this book.

Information extraction provides the technology to identify and describe content. We believe that information extraction technologies will become an important component of any retrieval system and that representations of documents and query that are enriched with semantic information of their respective content will give rise to adapted and advanced retrieval models.A retrieval model is defined by the query and document representations and by the function that estimates the relevance of a document to a query. The output of the ranking function is a score of relevance so that docu-ments can be sorted according to relevance to the query. Future retrievalmodels will incorporate the content descriptions that are generated with information extraction technologies and preferably incorporate the proba-bilistic nature of the assignments of the semantics.

Currently, we observe the first signs of such an evolution. On one hand sophisticated queries in the form of natural language statements are be-coming popular and demand for precise information answers. Because the extraction tools become available, they can yield many different forms of semantic preprocessing that bootstrap on each other yielding advanced forms of text understanding at various levels of detail. On the other hand,

152 7 Integration of Information Extraction in Retrieval Models

when more general semantic concepts are assigned to the texts, they can beused as search terms and so increase the recall of the retrieval. In general,information extraction technology offers descriptors that go beyond the variation of natural language and the paraphrases of similar content.

This chapter is organized as follows. We will first discuss the state of the art of information retrieval, followed by a definition of the require-ments and constraints of information retrieval systems. The next sectiongoes deeper into the problems of current retrieval systems and forms the basis of the discussion for the motivation of the use of advanced retrieval models that incorporate information extraction. A large part of this chapter discusses the integration of information extraction results in current re-trieval models and the drafting of effective and efficient search indices.The chapter focuses on retrieval models for texts. The reader will however notice that we look at retrieval models in the larger context of multi-mediaretrieval and that we anticipate that future information extraction from texts will be complemented with information extraction from other media.

We focus on information extraction tasks that are discussed in Chap. 4.We do not exclude the many other types of extraction patterns that can bedetected in text. Extraction technology offers meaning to the documents. Eventually, we aim that the extracted information contributes to the under-standing of the various content aspects of a text. More abstract concepts and scenario semantics can be assigned through bootstrapping from de-tailed extracted content. These could be used in the filtering of informa-tion. Filtering systems are not treated in this chapter.

We do not discuss here the case where information is retrieved andsynthesized from different documents. In the future we foresee that retrieval systems will increasingly combine information from differentsources. Because such systems are not developed yet besides some rare implementations, we will discuss information synthesis and retrieval inthe final chapter of this book that discusses future developments.

7.2 State of the Art of Information Retrieval

Information retrieval is concerned with answering information needs as accurately as possible. The information is found in ever-growing document collections. Information retrieval typically involves the querying of un-structured or semi-structured information, the former referring to the con-tent of unstructured text (written or spoken), images, video and audio, the latter referring to well-defined metadata that are attached to the documentsat the time of drafting. A well-known paradigm of querying a document

7.2 State of the Art of Information Retrieval 153

database is by inputting key terms and matching them against the terms by which the documents are indexed (which are the words of the texts in case of a full text search). There are the typical problems of synonymy and am-biguity, the latter referring to polysemous words. For instance, in text dif-ferent words express the same content and different content is expressed by the same word. In text retrieval there are several ways to deal with these problems. Maybe the most classical way is to expand the words of the query with related terms, obtained from a thesaurus or learned from a large corpus or from a set of documents that are judged relevant for the query

translate words or other features in query and documents to mathematical

1990). The translation will correlate terms that often occur in the same contexts. Retrieval becomes then concept matching and works well when the indexing can be trained on a large domain-specific corpus. These approaches certainly contribute to more effective word based retrieval.

In text based information retrieval, queries become increasingly sophis-ticated and take the form of real natural language statements, such asquestions (e.g., How do I obtain a Canadian passport?), needs (e.g., Find information about base sequences and restriction maps in plas-mids that are used as gene vectors.), commands (e.g., Give me the names and addresses of institutions that provide babysitters.), view-points or contextual queries (e.g., The effects of the 2005 Tsunami on theIndonesian economy in 2006.), or a query by example like the following case description (e.g., Car jacking with black Mercedes in the region of Brussels. Two men and a blond woman are signaled. One of the menwears an earring.).

These queries can be classified as real natural language questions or statements that in a kind of way are exemplary of the information to be re-trieved. The most well-known retrieval paradigms in such a setting relateto question answering and query by example.

In a question answering system a searcher poses a real question in natu-ral language and the system does not retrieve the documents in which the answers can be found, but the answer to the question (e.g., Which sum is allocated to social securities in the budget for 2003?). Single questions are automatically answered by using a collection of documents as the source of data for the production of the answer. The question providescontext for the meaning of the search terms and the question includeslinguistic relationships between the terms that should also be found in the

(Xu and Croft, 1996). Related to this approach there are techniques that

concepts (e.g., technique of Latent Semantic Indexing) (Deerweester et al.,


content. Even if the question answering technology is only used to identify passages or sentences in which the answer can be found, it is already a use-ful tool for retrieval.

In query by example retrieval the searcher provides an example object and the system retrieves similar objects, possibly ranked by decreasing similarity. This technique is commonly used in multi-media information retrieval, for instance, for the retrieval of similar images given an example image or similar musical melodies given example melodies. In case of text based information retrieval the technique is less widespread, but there are cases where it is valuable (e.g., retrieval of precedent cases in law; re-trieval of fact patterns by police forces). Exemplary documents are useful when they describe or exhibit the intellectual structure of a particular field of interest. In doing so, they provide not only an indexing vocabulary and, more importantly, a narrative context in which the indexing terms have a clearer meaning, but also the explicit relations that should exist between content. The exemplary texts often explicitly mention the relationships be-tween some or all of the issues or topics they identify (e.g., that certain topics are related causally; or how specific events are related chronologi-cally), and contain the relations between entities to be found in the texts of the document collection.

On the document side, we increasingly see that documents carry addi-tional information in the form of metadata. These metadata commonlyregard the structural organization of a document. For instance, a Belgianlegislative document is divided into books, chapters, sections, articles,and paragraphs. The annotations might also regard descriptions of thecontent that currently are mostly manually attributed, but in the future might be automatically assigned. So-called XML retrieval models, namedafter the markup language, XML (Extensible Markup Language), inwhose syntax most of the annotations are labeled, are being developedand take into account the extra information added to the documents whenranking documents according to relevance to the query (Blanken et al.,2003). Also document content might be expressed in semantic concepts that usually are manually assigned. The concepts might be expressed in OWL (Web Ontology Language) and information needs and documents can be represented in such a formalism (Song et al., 2005).

documents. Most of current question answering systems are restricted to answering factual questions (e.g., What year was Mozart born?), be-cause of the difficulty of answering questions that require reasoning with

7.3 Requirements of Retrieval Systems

In the following we discuss a number of important requirements of current information retrieval systems. They are ordered from the classical re-quirements towards the more novel ones.

1. The retrieval of information should be effective, i.e., all relevant in-formation should be retrieved (high recall) and all the retrieved infor-mation should be relevant (high precision).

2. The documents and possibly their index descriptions are often distrib-uted among several databases. Retrieval systems have to cope with thissituation, as well as with the fact that document databases and corre-sponding indices are usually very large.

3. Flexible querying is one of the pillars of the success of retrieval sys-tems. The information needs of users of the system (e.g., users of a Web search engine) are enormously varied and often change from dayto day.

4. Whenever correct and valuable information on the user of a retrieval system is available, the retrieval model should smoothly integrate this user’s profile.

5. Users of retrieval systems can formulate queries in the form of ques-tions or exemplary statements. Retrieval systems should retrieve ap-propriate answers to these types of queries.

6. The user is often interested in receiving the shortest, but complete an-swer to his or her information query. Often the retrieved information islarge demanding for summarization and fusion of (semi)-redundant answers.

7. Documents might exhibit a structured format, i.e., document structureand some content are possibly tagged with a markup language such as XML (Extensible Markup Language). The extra knowledge about a document should be integrated in ranking models.

8. Increasingly our document collections integrate different media (e.g., text, video, audio). Retrieval systems should adequately cope with these multi-media document bases.

9. A retrieval system should not neglect that document content can berepresented in many different ways. We can represent the information extracted from them in strict database fields or specifically designed knowledge representations. This would entail the loss of the underly-ing unstructured format and its possibilities to match different interpre-tations and information needs.

7.3 Requirements of Retrieval Systems 155


As we will see in the next section, many of the above requirements justify the use of information extraction technology in information retrieval.

7.4 Motivation of Incorporating Information Extraction

We are convinced that information extraction technology will become anecessary component of any retrieval system. Information extraction offersthe opportunity to semantically enrich the indexing representations made of the documents. Traditionally information extraction and retrieval are in-tegrated in a different way. More specifically, information retrieval tech-niques are commonly used to initially select possibly relevant content that is further analyzed by information extraction technology. Our approach does not exclude that basic key based searches select information regions that are more deeply processed by extraction technology, but we are con-vinced that future retrieval systems will increasingly make use of extrac-tion technology (from text, from audio, from images) in order to index documents and to find relevant information that answers an informationneed.

The information overload causes the traditional library paradigm of information retrieval systems to be abandoned. A classical information retrieval system very much relies on keyword indices to search docu-ments, after which the documents are retrieved and consulted. This ispretty similar to searching, borrowing and consulting books in a paperand print library. Retrieval of documents contained in today’s very large digital libraries often results in a large amount of possible relevant

sult all retrieved documents in order to find the answer to his or her infor-mation need. So, we need to redefine the retrieval model paradigm. Weexpect the retrieval systems to more directly answer our information needs by information extracted from the documents and processed into a coher-ent answer to our information query, or at least to intelligently link infor-mation as the possible answer to the information query which allows the user to efficiently navigate through the retrieved information.

In classical retrieval we are concerned with two problems. We want toretrieve all documents or information that is relevant for our query. In other words, the recall (see Eq. (8.24)) of the search should be as close aspossible to 100%. In addition we want to retrieve only documents or in-formation that is relevant for our query. In other words the precision of the search (see Eq. 8.25)) should be as close as possible to 100%. As we will see in Chap. 8, we are especially concerned with having high precision

documents (Blair, 2002a). Moreover, the user usually has no time to con-

values on the first answers that are retrieved by the information retrieval system.

We believe that both recall and precision in retrieval can be improved by the incorporation of semantic information in the indexing representa-tions. First, because information extraction allows assigning more generalconceptual terms to words, phrases, sentences or passages, we are able toimprove the recall of the retrieval. Secondly, if precise information queries are posed (e.g., above examples of statements in natural language) that are possibly augmented with information on the profile of the user, the extrac-tion technology is valuable to more precisely pinpoint the information. For instance, attributes or relations describing the entities of the query should also be found in the documents. The problem of a low precision is partly aproblem of ambiguity. Words get a meaning in their semantic relation with other terms. If we match query and document based on low-level featuressuch as the words of a text, the additional match on a semantic level can only improve the precision of the search. Such a matching is based on anold finding in artificial intelligence that states that two texts match, not only when the words match, but also when their relationships and attri-

vered with information extraction technology. Moreover, in Chap. 10 wewill explain how we are able to expand short queries with semantic infor-mation in order to improve the matching.

This idea of using additional semantic information when building indexing representations of documents is not new, actually it is almost as old as the beginning of information retrieval research. Harris (1959) pro-posed using linguistic procedures to extract selected relations from scien-tific articles, and to use them for document access. It is only now in the21st century that information extraction technology matures and computerpower allows the computational overhead of using information extraction in information retrieval. The power of information extraction and text classification techniques such as named entity recognition, semantic rolerecognition, attribute recognition and other relation classifications be-tween content elements is starting to become acknowledged in a retrievalcontext (e.g., Cohen et al., 2005).

An additional stimulus certainly is the use of semantic annotations inmulti-media retrieval. Currently, these labels are usually manually as-signed. Their automated assignment is only a matter of time when technology of content recognition in other media such as image recognition will mature. At that moment retrieval models that cope withsemantic information attached to the documents will become an absolutenecessity. Nevertheless, the trend has started to extract information from images. Entities or objects are detected, recognized and classified (e.g., asperson) and possibly authenticated (e.g., with the name of the person).

7.4 Motivation of Incorporating Information Extraction 157

butes match (Winston, 1982). Semantic attributes and relations are disco-


image medium we have to deal with low level features that by themselves

but also the meaningful features play an important role in order to further improve the performance of the retrieval.

In the information retrieval community there has been a reluctance to incorporate a linguistic analysis into retrieval systems (Lewis and Sparck

hance the retrieval performance have failed. Indexing the text by consider-ing phrases assumes that phrases refer to meaningful concepts. When in aretrieval environment a phrase appears in both query and document text, the two may refer to the same concept. This approach is limited by the fact that the phrase must appear in the same form in the document text and

1992). However, this is rarely the case with phrasal terms. A same concept

and ellipses. This problem is also present in current question answeringand query by example retrieval of text.

Current research tries to solve this problem by identifying paraphrases(i.e., finding similar content that is expressed differently) (Barzilay and

rect mapping to a standard single phrase must take into account lexical, syntactic, and morphological variations and resolve anaphors and ellipses.Finding paraphrases of query expressions is certainly useful in retrieval. Inaddition, finding paraphrase expressions is also part of the information ex-traction task as seen in Chap. 5. But the problem with paraphrasing is that paraphrases can be sought at different levels of detail. Literally paraphras-ing is the rephrasing of a phrase or sentence. But, it is possible to rephrase all sentences, even whole passages. Information extraction will not provideall possible rephrasings, but it will group expressions that refer to similar content under one semantic label.

Jones, 1996). Attempts to include phrases as index terms in order to en-

query in order for the concept to be matched (Lewis et al., 1989; Smeaton,

McKeown, 2002; Barzilay and Lee, 2003) in order to entail matches. Cor-

There is also a large interest in recognizing relations between persons, or between persons and objects, and in identifying attributes of persons and

ted

information besides low level features. For instance, both in a text and

carry not much meaning, but in combination they reveal “meaningful”

–ing for information the query can be mapped based on low level features,

can be expressed using different syntactic structures (e.g., a garden party and a party in the garden), possibly combined with lexical variations inword use (e.g., prenatal ultrasonic diagnosis and in utero sonographicdiagnosis of the fetus) or with morphological variants (e.g., vibratingover wavelets and wavelet vibrations). Phrases may contain anaphors

objects. In any case multi-media retrieval systems have to cope with extrac-

when convenient – could be named with semantic concepts. When search- patterns that are recognized by information extraction techniques and that

159

Some insights from the discipline of Case-Based Reasoning (CBR) are relevant in order to additionally motivate the use of information extraction technologies in a retrieval context. Case-Based Reasoning is generallyconcerned with remembering old problem situations and their solution and using these to find a solution, classification or other inference for the current problem. Humans use this form of analogical reasoning in many situations

similar content, is a very important initial step. Both CBR and IR systems use indexing representations that can be efficiently searched. CBR teachesus that the surface features of a case, i.e., the features that are most appar-ent such as the most obvious facts of the case, do not always reveal all itsaspects. Additional meaning can be added to the case representations mak-ing them more suitable for reasoning and learning from them in new situa-

more abstract level or describing the content with semantic descriptors. Also, extra knowledge with regard to the different contexts in which the

these semantics to the retrieval models is exactly what we want to accom-plish.

In information retrieval representing all the words of a text is popular because it is thought that these words still do have all the information in se,although this is a false assumption because in a bag-of-words approach we loose many of the relationships between words, phrases and sentences. Wewould not completely revise this model, but we would plead for a model that on top of these low level features additionally considers assigned se-mantics in the most flexible way. Blair (2002b) discusses two competingforces when generating document representations used in informationretrieval. Exhaustivity refers to the degree to which all the concepts and notions included in the text are recognized in its description. When the rep-resentations are highly descriptive (i.e., are very exhaustive) searches willtend to have high recall. But, if descriptions are biased towards discrimina-tion, i.e., having indexing terms that very well discriminate one document from the other, searches will tend to have high precision. When only usingdiscriminating descriptions, the searcher might not be able to anticipate any representations of relevant documents and recall and precision willboth be zero. Both the exhaustive and discriminative power of the indexing

tionships that exist between content and which are absolutely necessary tomatch query and information in documents. Last but not least, such a model does not inhibit a flexible information search, one of the main

7.4 Motivation of Incorporating Information Extraction

(Kolodner, 1993). In this framework, the search for a similar situation, i.e.,

tions (Carbonell, 1986). First of all, this refers to describing the cases at a

information in a case will be used can be added (Kolodner, 1993). Adding

representations can be enhanced by using information extraction resultsthat complement the words of a text. Such a model has the additional advantage that the semantic descriptors also make explicit the semantic rela-


requirements of information systems. A retrieval model that relies on the words of the texts and additional semantics added to words, phrases, sen-tences and passages is certainly more expressive than a bag-of-words rep-resentation, while still providing the flexibility of all kinds of informationsearches. Information extraction technology offers the extra “knowledge” to include in document and query representations. The attachment of thisknowledge is often of a probabilistic nature.

We have currently very little empirical studies on actual improvement of information retrieval research by adding semantic information. One studyby Grishman et al. (2002) indicates an increase in precision of the retrieved documents when incorporating structured information extracted from Webpages in the retrieval.

When building such a model, one wonders how much semantics do we have to add to the indexing representations. According to Blair (2002a), the number of semantic labels that could be assigned to a text is almost unlimited. There is a lot of philosophical dispute on the existence of con-cepts. The Platonian position (universalia ante rem) stresses the existence of concepts as ideas besides the various objects. The Aristotelian position(universalia in re) says that concepts only exists in the objects. In the posi-tion of the nominalism (universalia post rem), concepts are developed onlyin human thinking. Whatever the reality is, we have to keep in mind that the semantic labels are only artifacts that help us in information retrieval or other data processing tasks. Information extraction regards the labeling and storage of intermediary assistants in the process of understanding text or another medium.

In the following sections we will discuss how the different retrievalmodels might incorporate the extra “semantic knowledge” into query and document representations and into a matching and ranking function.

7.5 Retrieval Models

Because information extraction structures unstructured content, an obviousapproach is to translate the identified information into database entries (cf.the templates discussed in Chap. 2) and to allow a deterministic matchingbetween query and document content (cf. the querying of a relational or object-oriented database or the querying of a retrieval system with aBoolean model). However, such models do not account for a matching

between uncertain representations of query and document content and for the uncertainty of the matching. Because of technological limitations and of the difficulty of exactly capturing the information need of a user, con-tent representations of queries and documents are often uncertain, resultingin an uncertain matching process.

Flexibility in querying the retrieval system is an important require-ment of retrieval technology and is the basis of success of currentsearch engines. Users pose all kinds of different queries that are notknown in advance. In the previous section we have pleaded for docu-ment representations that represent the information as completely as possible given the constraints that large document indices pose whenthey have to be efficiently searched.

In the previous section we have also referred to research into XMLretrieval models. XML retrieval models preserve the non-deterministicmatching of an open domain query and document, but exploit the docu-ment structure to more precisely lead the user to these text elements that answer his or her information need (Blanken et al., 2003). It has beenproposed to use a vector-space model (Fuhr et al., 2003) and a language

(structured parts of a document). The retrieval or ranking models that weconsider are partly inspired by XML-retrieval models, but additionally attempt to incorporate the uncertainties of query and document representations in their ranking computations.

When we integrate the semantic annotations, the typical bag-of-wordsmodel changes from a flat list of words that represent each document to a model to which 0 to k labels can be attached to single terms or phrases,combinations of terms, passages, etc. The representation of a document thus is in the form of a bed-of-words covered with different semantics.

The query can still be in the form of a list of terms, or in case the query is composed of natural language statements (e.g., question, example texts),the query can be translated to a similar layered format as a document. Inaddition, the query can be enriched with contextual information (Shenet al., 2005).

The document representations that are semantically enriched demandfor different search structures than the ones traditionally used. The latter structures are composed of a dictionary or vocabulary of words to which

7.5 Retrieval Models 161

model (Hiemstra, 2003) to more accurately rank the retrieval elements

document identifiers are attached.


7.5.1 Vector Space Model

In the vector space retrieval model (Salton 1989), documents and queriesare represented as vectors in a p-dimensional space:

DjD = [wj1w , wj2w ,…,wjpw ]T (7.1)

Q = [w1, w2,…,wpw ] (7.2)

where p = the number of features measured.

The features wi commonly represent the terms of the vocabulary bywhich the documents in the collection are indexed (i.e., the distinct indexterms) and the values of the features are the term weights in the respectivedocument and query. Term weights might be binary indicating term pres-ence or absence. In the vector space model the weights have a numericalvalue and indicate the importance of the terms in document or query. For instance, weights are computed by a tf xf idf weighting scheme, where the fterm weight is proportional with the number of times the term occurs in the considered query or document text (tf ) and is possibly normalized by a factor that represents the length of a text, and where idf is a factor that isfinversely proportional with the number of documents of a reference col-lection in which the term occurs.

Comparing document and query vector is done by computing the simi-larity or distance between the vectors. The most common similarity func-tions are the inner product between term vectors and the cosine function,which computes the cosine of the angle between the vectors (see Eqs. (6.3) and (6.4) respectively). The result of the comparison is a ranking of thedocuments according to their similarity with the query. Although this model does not accurately represent queries and documents because it adapts a simplifying assumption that terms are not correlated and term vec-tors are pair-wise orthogonal, the model is very popular in research and commercial retrieval systems.

In this book we are concerned with adding semantic information to theindices composed solely of words. One possible way is to expand the vec-tors with semantic attributes, i.e., describing the documents and queries ina ℜlxk vector space where l is the number of entities or elements considered (e.g., words, phrases, passages) and k equals the number of semantic at-tributes by which each element can be classified. The weights of the se-mantic components can express the uncertainty of the assignment. Term vectors are already very large and sparse, meaning by the latter that very few vector components actually receive a value larger than zero. Adding to

163

the vector representations additional semantic concepts by which certain terms are classified or relationships between terms (represented as a se-mantic concept attached to a term) would make the vectors in the worse case k times larger compared to the classical term vectors without even kconsidering the combinations of words (e.g., passages, sentences) to which semantics can be assigned.1 Such a model becomes computationally quite complex. More importantly, the orthogonal properties of the vector spacewe are dealing with are not an appropriate representation of the correla-tions that exist between terms mutually, and terms and semantic informa-tion.

In addition, using probabilities as weights in a vector space model is not very appropriate. Computing the cosine as a measure of ranking assumes an Euclidean space, where differences in values are considered in the dis-tance computations and not the actual probabilities.

Notwithstanding these shortcomings, using semantic labeling in a vector space setting is not new. Wendlandt and Driscoll (1991) already have im-plemented and tested the vector space model for document retrieval, en-riched with semantic information. Currently, there is a renewed interest in enhancing the vector space model with concepts defined in the frame of

The vector model for document retrieval that is enriched with semanticinformation can be transposed for passage and sentence retrieval and, for instance, be used in a question answering setting. Note that in the classic vector model a term is only considered once, but can occur multiple timesin a document or other retrieval element. In a model where semantics are added to terms, a term can occur multiple times in a document, not neces-sarily reflecting the same semantic concept or participating in the same relation.

7.5.2 Language Model

Documents represent a certain distribution of information content that issignaled by the distribution of words, but also by the distribution of seman-

1 Instead of vectors in a high dimensional space, structured objects that integrate low level features (e.g., words) and the higher level semantic labels could be usedas representations of query and document. Kernel functions (see Chap. 5) that relyon dynamic programming functions for similarity computations can be used tocompute the similarity between query and document. This approach would notconstitute a vector space retrieval model, but it shares a nearest neighbor searchwith the vector model.


the semantic Web (Castels, 2005).


tic content elements that make up the information. In the language model we probabilistically model document content.

In recent years statistical language modeling has become a major re-

ment is viewed as a model and a query as a string of text randomlysampled from this model. Most of the approaches rank the documents in the collection by the probability that the query Q is generated given adocument DjD : P(Q DjD ), i.e., the probability that the query Q would beobserved during repeated random sampling from the model of docu-ment DjD . In the language model the query is seen as a set of query termsthat are assumed to be conditionally independent given the document,and thus the query probability can be represented as a product of the individual term probabilities:

P(q1,...,qm DjD ) = P(qi DjD )i=1

m

∏ (7.3)

where qi is the ith query term in a query composed ofh mf terms, and P(qi DjD )is specified by the document language model. Computing the probabilitythat a query term appears in document DjD with Eq. (7.3) might yield a zero probability. So, a document model is usually chosen that allows for a smoothing of the probabilities. Often, the occurrence probability of a term in the corpus is used to smooth the document probabilities yielding the fol-lowing mixture model:

P(q1,...,qm DjD ) = (αPαα (qi DjD )i=1

m

∏ + (1−α)α P(qi C)) (7.4)

where C is the collection of documents. The interpolation weight α is set αempirically or learned from a training corpus with relevance judgments (e.g., with the Expectation Maximization algorithm). The probabilities areusually estimated by maximum likelihood estimation from the respectivedocument or collection.

How can we adapt this model so that it incorporates the semantics at-tached to words, phrases, passages etc. while keeping the flexible approach of a word based search? The queries can take many different formats rang-ing from terms and semantic concepts to statements in natural language. In case the query terms are not found in a document, it is still possible that

trieval modeling approach (Croft and Lafferty, 2003). Typically, a docu-

165

one or more query terms match the semantic labels2 assigned to the docu-ment. The language model offers the possibility to incorporate the prob-abilities of the translation of a term into a concept.

P(c1,...,cm DjD ) = (α P(ci wl)l=1

l

P(wl DjD )i=1

m

∏ + (1−α)α P(ci C)) (7.5)

where the term wl of document DjD can express the concept ci with probabil-ity P(ci wl). When there are different terms in the document that lead to the same concept, the sum of their probability of the translation is made.

If the query is a mixture of concepts and terms that are assumed to be independent, we could propose the following mixture model.

P(cq1,...,cqm DjD ) = (α P(cqi wl)l=1

l

P(wl DjD )i=1

m

∏ +βPββ (PP cqi Dj) + (1−α − β)ββ P(cqi C))

(7.6)

where cqi is a term or a concept. Such a model allows giving a different weight to query terms that literally occur in the documents and to queryterms that are obtained by processing the document with information ex-traction techniques. For instance, when the query contains the nameBarbara Walters, the document will give a different weight to the exact mention of this lady and give another weight to the resolved coreferents that refer to this lady, while naturally modeling the probability of each resolution.

One can design many different language models that probabilistically model document content based on the words of the documents and the se-mantic labels assigned. For instance Cao et al. (2005) integrate word rela-tionships obtained from WordNet into the language model. When a model combines different submodels into a mixture model, the difficulty is find-ing correct interpolation weights. Having sufficient relevance judgmentsthese weights could be learned from relevant and non-relevant documents.

The model has the advantage that we could rank sentences and passages or other retrieval elements (e.g., as structured a priori with XML labels). This is done by considering in the above equations DjD as the appropriateretrieval element.

2 When query and documents are represented by a set of concepts, we can com-pute relevance based on concept matching by using Eq. (7.4) and replacing aquery term qi by a concept term ci..



The model has some handicaps. First, we make a simplifying assump-tion, i.e., that a semantic concept is assigned depending on one term. We can of course compute the probability of concept assignment conditioned on different terms, which can be the case in information extraction. This would mean that we build explicit models for each document as it is donein the inference network model discussed in the next section.

Moreover, when the query is in the form of a natural language state-ment, semantic concepts can also be assigned to the terms of the query. Using here a query representation composed of terms and their correlated concepts violates the independence assumption when computing the prob-ability with Eq. (7.3) and its variants that the document generates the query.

An alternative approach is associating a language model to both thedocument (or document passage or sentence) and the query and having amethod to estimate the distance between two language models. TheKullback-Leibler divergence or relative entropy is a natural measure of divergence between two models p(x) and q(x(( ) and is defined as:

H( p q) = p(x)x

logp(x)q(x)

(7.7)

Let θQθ and θDjθ be the language model of queryj Q and document DjD respec-tively and estimated according to one of the models described above, documents will be ranked by −H(θQθθ θDθθ jD ) . The cross-entropy is also citedas a metric for the divergence of two language models:

H( p q) = − p(x)x

logq(x) (7.8)

In this case documents are ranked by increasing cross-entropy. Entropy based rankings can be useful in case queries have the form of natural language statements such as exemplary texts or questions. Such models require that accurate models of estimating the probability distribu-tion of both the query and document are built. In practical applications, theKL-divergence has the additional problem that the probability of the document generating the model might be zero, which demands for an appropriate smoothing method.

167

7.5.3 Inference Network Model

Information extraction allows us to attach additional information to the content of documents and queries. This information is often uncertain, and as shown in the foregoing models, it can be used as extra evidence in theranking of the documents or retrieval elements. One intuitive way to model different sources of evidence and inferring conclusions from them is by modeling the information in a network and reasoning with the information.

A famous retrieval model is the inference network model (Turtle and

(DAG) in which nodes represent propositional (binary) variables or con-stants, and edges represent dependence relations between propositions. Aninference network can be defined as a directed graph G = (V,VV E) where Vconsists of l nodes and E of the edges between the nodes. The directededge (p(( ,q) ∈ E indicates the presence of a link from E p to q.

In the inference network there is an edge between p and q, if a proposi-tion represented by node p causes the proposition represented by node q.Thus the dependency of a node q on the values of a set of parents πqπ = {p1,…, pk} can be modeled, i.e., P(q πqπ ). Given a set of prior probabilities for the roots of the network, the network can be used to infer the conditional probability at each of its nodes.

In information retrieval the inference network is traditionally used to in-fer the relevance of a document for the query. The inference network con-sists of two components, a document network and a query network. The document network represents the document collection. It is built once for a kgiven collection. The query network represents the information need and is dynamically built at the time of querying the collection. The attachment of both networks is performed during retrieval after the query network is built. The attachment joins the corresponding concepts or terms of query and documents. There are different ways to compute the weight or theconditional probability of the attachment (which we will illustrate further).

In information retrieval documents and queries are usually represented as a set of terms, i.e., terms form the nodes of the network. A word in thedocument or query text can also be represented as a subject concept, allow-ing for an extra matching on the level of concepts. In this way additionalknowledge obtained from lexical resources such as thesauri, or subject categories assigned with the help of trained classifiers can be incorporated in the model.

In order to rank a document DjD according to the relevance to the query, j

we attach evidence to the network asserting that DjD = true (=1) and setting


Croft, 1992). An inference network is a directed, acyclic dependency graph


evidence for all the other documents in the collection to false. The prob-ability is computed that the information need is met given that DjD has beenobserved in the collection by means of computing the conditional prob-abilities of each node in the network, given its parents. A document can re-ceive a prior probability, reflecting the evidence that some documents are already a priori estimated being more relevant than others. The probabilityof relevance given a certain document, is computed as the propagation of the probabilities from the document node DjD to the query node Q. Doingthis for all documents, we can compute the probability that the information

accordingly. For all non-root nodes in the network, we must estimate the probability that a node takes on a value, given any set of values for its par-ent’s nodes. If a node q has a set of k parents πqπ = {p1,…, pk}, we estimateP(q πqπ ). Several evidence combination methods for computing the condi-tional probability at a node given the parents are possible. This conditionalprobability of a parent can be modeled as a Boolean variable, flagging theactivation of a word or concept in the network. Alternatively, parent nodes can receive a weight, which - when chosen between zero and one - mightreflect the probability of their activation. One of the ways of computing theconditional probability at a node is by computing the weighted sum of theevidence provided by the parent nodes.

The inference network model offers a natural framework for modeling the typical bag-of-words representation augmented with semantic labels.One can naturally model the probability of the extraction of the informa-tion, even for labels that are modeled conditioned on information occurringin different documents. The semantics of the query can also be modeled in a natural way. The (possibly) uncertain representations of query and document can be matched taking into account common words and seman-tic concepts, the latter, for instance, referring to relations or attributesassigned to query or document content.

The inference network model has the additional advantage that the con-ditional probability of the strength of the attachment of a complex query structure to the document network can be seen as a subgraph matching problem and computed with dynamic programming techniques such as, for

case the concepts in document and query have additional specifications and are joined based on the matching trees that represent the specifications of the concept. The edit distance (cf. Eq. (4.2)) computes the number of in-sertions, deletions and amendments that would be necessary to translatethe document tree into the query tree constraint.

instance, the edit distance of two trees (Graves and Lalmas, 2002). In this

need is met, given each document in the collection, and rank the documents

169

Fig. 7.1. An example of an inference network. ri and ci represent semantic labelsassigned to respectively query and sentence terms. Different combinations of sen-tences can be activated and their relevance can be computed.

In summary, the inference network model offers a very natural way of modeling the probabilities of the extraction in a retrieval setting, while keeping the flexibility of an information search and the computation of the ranking of information. However, the model becomes more complex than a classical word based inference network considering that many nodes willbe added to the network and that terms depending on their position in the discourse might receive a different label, so that they have to be repre-sented separately. So, a general disadvantage of the model is the computa-tional expense when computing relevance in large document collectionswhere documents are represented as complex graphs which model the richsemantics attached to words. However, the power of this retrieval model resides in considering candidate sentences or passages, retrieved with a simple keyword search from the document collection and representingthese sentences in an inference network which will be attached to thequery network. The resulting network will usually be manageable in termsof computational complexity. In such a framework, it is also possible to activate not one candidate sentence or passage when computing relevance, but considering a number of combinations of sentences to be active and to compute the relevance of the set. One can compute the evidence that two or more texts DjD or sentencesj SjS together are relevant for the query (see Fig. j

7.1). The terms of the sentences can be linked to different concepts, which,



for instance represent coreferring entities or events. The sentences can beextracted from different documents. To consider all possible subsets of documents (sentences) as an answer to the question is computational not feasible. The inference network model has already proven its usefulness in multi-media information retrieval, where one is interested in the selection of media documents (e.g., a video) both based on the content and the context

nodes based on the MPEG-7 standard).

7.5.4 Logic Based Model

sumes that queries and documents can be represented by logical formulas. Retrieval then is inferring the relevance of a document or of another in-formation element to a query. The relevance of a document is deduced byapplying inference rules. Formally, one can say: Given a query Q and adocument DjD , DjD is relevant toj Q if DjD logically impliesj Q: DjD ->Q. Logicbased retrieval models are not very common except for the logic based Boolean model where queries are composed of key terms connected withBoolean operators and relevant documents satisfy the logical conditions imposed by the query.

In question answering a question in natural language is posed to the document collection and the answer to the question is retrieved from the

tion answering and open domain question answering. Domain specific or

translates it to a structured question that queries a database with structured information about the domain. One of the tracks in the Text REtrievelConferences is open domain question answering. Most of the current tech-nology adheres to the following procedure for finding the answer to thequestion. Based on the words of the query, sentences that might containthe answer are retrieved from the document collection. The type of ques-tion and corresponding type of answer are determined (e.g., the questiondemands for the name of a person). The retrieved sentences are ranked ac-cording to their matching with the question and the matching of the type of question and type of answer. From each sentence that contains a matchingtype of answer, the answer is extracted. Answers can be additionally ranked by their frequency of occurrence in the document base. This para-digm works quite well in the case of factual questions. Finding the answers

producer, genre, etc. (e.g., Graves and Lalmas, 2002 who define contextual

The logic based retrieval model (van Rijsbergen, 1986; Lalmas, 1998) as-

collection. Usually, a distinction is made between domain specific ques-

of the information, the latter referring to metadata such as media format,

closed domain question answering traditionally analyzes the question and

to questions sometimes requires reasoning with information and the fusionof information At present such techniques are researched (Moldovan et al.,2003a).

Techniques of information extraction are indispensable in order to at-tach semantic meanings to questions and documents. In addition, reasoningtechniques are needed for the resolution of complex questions that fuse information from different texts.

Information extraction technologies allow representing content in first-order predicate logic. The semantic labels form the predicates, the ex-tracted texts make up the arguments. When document sentences and question are represented in predicated logic, theorem provers can be usedto infer the relevance of the extracted information. The theorem proversreason with the predicates whereby predicates extracted from different sentences from one or different documents can match. Moldovan et al. (2003b) demonstrate the generation of representations in predicate logicfrom question and all document sentences, and prove the relevancy of a sentence to the question. Similar work has been done by Blackburn and Bos (2005), who translate textual content into predicate logic and inferthe relevance of document content when answering a natural languagequestion.

These approaches do not (yet) capture the uncertainty of the extracted information. We do not have many examples of logic based models that cope with uncertainty, a logic for probabilistic inference is introduced with the notion of uncertain implication: DjD logically implies j Q with certainty P.The evaluation of the uncertainty function P is related to the amount of semantic information which is needed to prove DjD ->Q. Ranking accordingto relevance then depends upon the number of transformations necessaryto obtain the matching and the credibility of the transformations. To rep-resent uncertain implications and to reason with them, modal logic is

matching between query and text representations is not successful, the text representation is transformed in order to satisfy other possible interpreta-tions (cf. the possible worlds of modal logic) that might match the query.

In a model where documents are represented both with words and “un-certain” semantic labels, the document’s words are translated into meaning units that on their turn might be used to attach other meaning units to the texts. These units might imply the query.

7.6 Data Structures

It is common practice to build auxiliary data structures that contain all the necessary information for the retrieval. Of course it is possible to process

7.6 Data Structures 171

sometimes used (van Rijsbergen, 1986; Nie, 1992). For instance, when a


each document on the fly by sequential or online searching. In such a casethe retrieval system finds the patterns given by the query by sequentiallyprocessing all documents. In a brute force approach a sliding window willconsider each character of the document text in the match with the querypattern. Although there are algorithms that reduce the number of compari-sons made such as Knuth-Morris-Pratt, Boyer Moore and Shift-Or algo-

databases is an impossible task. Instead of a sequential search, almost all information retrieval systems

use auxiliary data structures that hold the information and that can be effi-ciently searched or use these data structures in combination with an onlinesearch. The data structures, which are also called indices or indexes, areupdated at regular intervals in case of a document collection that changesdynamically.

The most common used data structures for storing bag-of-words repre-

built in the following way. Given a set of documents, each document is as-signed a list of key terms. A key term can be a unique (stemmed) word or phrase that occurs in the document collection (stopwords might be ex-cluded) or an assigned descriptor. The inverted file or index is a sorted list of key terms where each key term has pointers to the documents in which the key term occurs (e.g., document id, URL). In case of a full inverted in-dex, the word position (or the position of the first character of a term) isalso stored. In addition, term weights (e.g., the frequency of the term in the document) might be added. So for each term w we have a set of postings:

did, fDff jD , w, o1,...,ofo Djff , w[ ] , ...{ ,...} (7.9)

where

did = identifier of a document d DjD containing term w

fDj,wff = frequency of w in DjD

o = positions in DjD at which w is observed.

Such a model is very much term centered. This is logical because the common retrieval models only match the terms of query and documents, so it is very important that the retrieval system efficiently finds the docu-ments in which a query term occurs.

rithms (Baeza-Yates and Ribeiro-Neto, 1999), searching in this way large

sentations are inverted files. The inverted file or index (Salton, 1989) is

173

If we add semantic information to the content of a document, we still want to search the terms of a document and consider in the ranking the ad-ditional semantic labels that are assigned to certain terms, to a sentence or passage, or to a set of terms or passages distributed within a document or over different documents. This demands for complementary types of data structures.

In the following section we give suggestions on what types of data structures could be useful in a retrieval setting where the semantic infor-mation is attached to content.

Many of the ideas here presented are borrowed from XML retrieval sys-tems. There are two considerations to be taken into account. From an effec-tiveness point of view, one should decide which type of information tostore (data and semantic labels), depending on the kind of operations to beperformed on the data. Secondly, from an efficiency point of view, one has to choose how to store the information, in order to execute these operationsin the fastest possibly way (e.g., the level of fragmentation of the informa-tion, the definition of indices, replication of information). For the formerwe would like to be as generic as possible, for the latter different schemas are possible that balance storage overhead with computational complexityat the time of querying.

The popular inverted file model is inspired by the desire to efficiently find the retrieval elements that are relevant for the query. In a classical re-trieval system the retrieval elements are documents. In more advanced re-trieval systems, the element can be a phrase, a sentence, a passage or a combination hereof, i.e., the minimum of content that completely answersthe information need. As a result, these retrieval elements should be de-fined and should be accessible. In the popular inverted file model, the keys that are searched are the terms. In our model, terms and sets of terms areaugmented with labels. This is also the case when indexing XML tagged documents.

In the text region model (de Vries et al., 2003) that is used as indexingscheme in XML information retrieval, the XML document is viewed as a sequence of tokens. Tokens are opening and closing tags as well as pre-processed text content. The preprocessing might regard form normaliza-tions (e.g., abbreviation resolution), stemming and stopword removal. Each component is a text region, i.e., a contiguous subsequence of thedocument. XML text regions are represented by triplets {ti, si, ei}, where tittis the XML tag of region i, and si and ei represent the start and end posi-tions of XML region i. Region identifiers also contain the document identi-fier to which a region belongs. The physical representation of this indexingscheme consists of two large tables: The node index R stores the set of allXML region triplets. The word index W stores all the index terms. In this

7.6 Data Structures


model the word index is separated from the node index, although an indi-vidual word can also be considered as a very small region. The choice for two separate indices is here motivated by efficiency requirements. For a word region only the start position needs to be stored.

In a representation that captures the semantics of a text, single wordsand text elements (e.g., phrases, sentences, passages) are represented as a text region that is semantically labeled. However, this model does not ad-here to an XML syntax, as overlapping text regions (e.g., regions with dif-ferent semantic labels) are not necessarily nested. Nevertheless, the textregion model is suited to represent our bed-of-words representation in which a word or set of consecutive words is covered with a semantic layer. Many semantic layers that each refer to different aspects of the content cancover the words.

Such a model would allow storing named entity classifications, resolved coreferents, relations between entities, passage classifications within and across documents. The semantic classifications can also link information across documents (e.g., though the identification of resolved coreferents).This model stores the text regions (here semantically classified) as a node index R containing the triplets {ti, si, ei} and the words in the word index W. Alternatively, the node index can contain quadruplets {ti, si, ei, pi}where pi refers to the probability with which the semantic label ti is as-signed. Note that in this model a single word can also form a text region when it is semantically classified (e.g., as a named entity). A same regioncan have several semantic labels.

This indexing scheme has several advantages. As it is shown with XML tagged documents, a pure full text search as well as structured queries(e.g., by using a variation of the SQL language) could be sustained. As in an XML retrieval model, whole documents, sections, passages, sentences can be ranked according to relevance to the query by considering theregions and the words that are contained in it. As we will further see in Chap. 10 information can be synthesized from different documents or document parts, the text regions being linked through the coreferent enti-ties or events. The indexing model can also be transposed to multi-mediadocuments. For instance, one could define an image region and label it semantically, while keeping in a separate index also the positions of lowlevel image features, which might be used in a query by example.

In the original text region model as defined by de Vries et al. (2003), an XML tagged text is regarded as a sequence of tokens including the tokens of the start and end tags. Both the words of the texts and the start and end tags receive a position number. Because in our text region model, regionsare not necessarily nested, we would prefer not to have the text annotated

175

with tags, but only have the assigned regions and their specifications (i.e., positions) stored in the node index.

This is a very simple, but powerful model that extends a bag-of-wordsapproach and that allows performing the computations for the retrieval models described in this chapter. It permits also to describe content from many different angles and to take into account these descriptions in the retrieval.

Alternative indexing schemes for XML documents have been imple-mented. They are often grounded in the tree-like representation of a docu-

2002) in which the parent-child relations of the XML elements, the rela-tions between nodes and their attributes and the relations between nodesand positions are represented as binary relations. Our model allows forhierarchical relations between nodes and explicitly representing these rela-tions in a separate indexing table with binary parent-child relationships. In this way tree-structured document formats such as XML, HTML, MPEG, etc. could be combined with the semantic index.

The indexing scheme that we propose is inspired by the retrieval of information, i.e., the retrieval of textual documents, passages, sentences or phrases. In information synthesis (see Chap. 10) we need to combineinformation from different sources considering certain constraints. The model as it is presented up till now allows indexing the equivalence of coreferent relations between content. In advanced models we also need torepresent attributes of nodes or arguments of nodes. The arguments them-selves are nodes and are defined by their label and position. So, we can ex-tend the node index to a quintuple {ti, si, ei, pi, li} where li refers to the list of arguments. An argument ljl is composed of a node label tjt where j tjt must refer to an existing node. Such an extended scheme allows reasoning with the propositions and the predicate structures provided by the index. In thisrepresentation circularity in node references should be checked.

Efficiency considerations at the time of querying may demand for addi-tional (redundant) indexing structures that contain intermediate computa-tions (e.g., for highly probable queries). Indices are usually stored on theservers of search engines. Rich case representations need a large storagespace. The representations can be divided into essential indices which willbe searched in main memory and secondary index attributes that can be stored in secondary memory and be put in main memory in case of more sophisticated queries in the form of questions or natural language state-ments. Secondary indices are currently researched in multi-media informa-tion retrieval (e.g., geographic/cartographic applications).

7.6 Data Structures

ment. An example is, for instance, the binary relation approach (Schmidt,


7.7 Conclusions

The idea lives very strongly that information extraction translates freetext into templates that are stored in relational databases. Restricting in-formation retrieval to the querying of the templates – as could be done bya deterministic database search – severely reduces the power of a re-trieval system. In this chapter we have shown that information extractionresults can be incorporated in classical retrieval models and especially inthe probabilistic models such as the language model and the inferencenetwork model. We have demonstrated that the different layers of seman-tic understanding that we attach to documents can be incorporated inthese models, without loosing the flexibility of information searches with which we are acquainted in the current popular full text searches. The retrieval models are computationally more expensive than the traditionalword based models, but this could be compensated through first selectingcandidate relevant sentences and then computing their relevance based onthe retrieval models that incorporate semantic information. Such an ap-proach is in line with finding to the point text elements in document bases that accurately answer the information queries. Powerful indexing structurescan be designed and implemented that encompass the classical inverted file information and the additional semantics attached to the documents.

7.8 Bibliography

Barzilay, Regina and Kathy McKeown (2002). Extracting paraphrases from a par-allel corpus. In Proceedings of the 39th Annual Meeting of the Association forComputational Linguistics (pp. 50-57). East Stroudsburg, PA: ACL.

Barzilay, Regina and Lillian Lee (2003). Learning to paraphrase: An unsupervised

2003 (pp. 16-23). East Stroudsburgh, PA: ACL.Blackburn, Patrick and Johan Bos (2005). Representation and Inference for Natu-

ral Language. CSLI Publications. Blair, David C. (2002a) The challenge of commercial document retrieval. Part I:

Major issues, and a framework based on search exhaustivity, determinacy of representation and document collection size. Information Processing and Man-agement, 38, 273-291.

Blair, David C. and Steven O. Kimbrough (2002b). Exemplary documents: A foundation for information retrieval design. Information Processing and Man-agement, 38, 363-379.

retrieval. Harlow, UK: Addison-Wesley. Baeza-Yates, Ricardo and Berthier Ribeiro-Neto (1999). Modern information

approach using multiple-sequence alignment. In Proceedings of HLT-NAACL

Cao, Guihong, Jian-Yun Nie and Jing Bai (2005). Integrating word relationshipsinto language models. In Proceedings of the Twenty-Eight Annual International nConference on Research and Development in Information Retrieval (pp. 298-305). New York: ACM.

Carbonell, Jamie G. (1986). Derivational analogy: A theory of reconstructiveproblem solving and expertise acquisition. In Ryszard S. Michalski, Jaime G.Carbonell and Tom M. Mitchell (Eds.), Machine Learning 2 (pp. 371-392). San Francisco, CA: Morgan Kaufmann.

Castels, Pablo (2005). An ontology based information retrieval model. In Pro-ceedings of the 2nd European Semantic Web Conference (Lecture Notes inComputer Science). Berlin: Springer.

Cohen, William W., Einat Minkov and Anthony Tomasic (2005). Learning to un-derstand website update requests. In Proceedings of the World Wide Web 2005Conference. New York: ACM.

Croft, W. Bruce and John Lafferty (2003). Language Modeling for Information Retrieval. Boston, MA: Kluwer Acedemic Publishers.

Deerweester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauerand Richard Harschman (1990). Indexing by Latent Semantic Analysis. Jour-nal of the American Society for Information Science, 41 (6), 391-407.

De Vries, Arjan P., Johan A. List and Henk Ernst Blok (2003). The Multi-modelDBMS architecture and XML information retrieval. In H.M. Blanken, T. Grabs, H.-J. Schek, R. Schenkel and G. Wekum (Eds.). Intelligent Search onXML (L Lecture Notes in Computer Science/ Lecture Notes in Artificial Intelli-gence 2818) (pp. 179-192). Berlin: Springer.

Fuhr, Norbert, Kai Groβjohann and Sasha Kriewel (2003). A query language andββuser interface for XML information retrieval. In Henk Blanken et al. (Eds.),Intelligent Search on XML Data (pp. 59-75). Berlin: Springer.

Graves, Andrew and Mounia Lalmas (2002). Video retrieval using an MPEG-7 based inference network. In Proceedings of the Twenty-fifth Annual Interna-tional ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 339-346). New York: ACM.

Grishman, Ralph, Silja Huttunen and Roman Yangarber (2002). Information ex-traction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35, 236-246.

Harris, Zellig (1959). Linguistic transformations for information retrieval. In Pro-ceedings of the International Conference on Scientific Information 2. Washing-ton,, DC: NAS-NRC.

Hiemstra, Djoerd (2003). Statistical language models for intelligent XML re-trieval. In Henk Blanken et al. (Eds.), Intelligent Search on XML Data(pp. 107-118). Berlin: Springer.

Kolodner, Janet (1993). Case-Based Reasoning. San Mateo, CA: Morgan Kauf-mann.

Lalmas, Mounia (1998). Logical models in information retrieval: Introduction andoverview. Information Processing and Management, 34 (1), 19-34.


Blanken, Henk M., Torsten Grabs, Hans-Jörg Schek, Ralf Schenkel and Gerhard Weikum (2003). Intelligent Search on XML Data, Applications, Languages, Models, Implementations and Benchmarks. New York: Springer.


Lewis, David D., W. Bruce Croft and Nehru Bhandaru (1989). Language-orientedinformation retrieval. International Journal of Intelligent Systems, 4, 285-318.

Lewis, David D. and Karen Sparck Jones (1996). Natural language processing for information retrieval. Communications of the ACM, 39 (1), 92-101.

Moldovan, Dan, Christine Clark, Sanda Harabagiu and Steve Maiorana (2003a). COGEX: A logic prover for question answering. In Proceedings of the HumanLanguage Technology and North American Chapter of Association of Compu-tational Linguistics 2003 (pp. 166-172). East Stroudsburg, PA: ACL.

Moldovan, Dan, M. Pasca and Sanda Harabagiu (2003b). Performance issues and error analysis in an open domain question answering system. ACM Transac-tions on Information Systems, 21, 133-154.

Nie, Jian-Yun (1992). An information retrieval model based on modal logic.Information Processing and Management, 25 (5), 477-494.

Salton, Gerard (1989). Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Reading, MA: Addison-Wesley.

Schmidt, Albrecht (2002). Processing XML in Database Systems. PhD thesis:University of Amsterdam.

Shen, Xuehua, Bin Tan and Chen Xiang Zhai (2005). Context-sensitive informa-tion retrieval using implicit feedback. In Proceedings of the Twenty-Eight An-nual International Conference on Research and Development in Information Retrieval (pp. 43-50). New York: ACM.

Smeaton, Alan F. (1992). Progress in the application of natural language process-ing. The Computer Journal, 35 (3), 268-278.

Song Jun-feng, Zhang Wei-ming, Xiao Wei-dong, Li Guo-hui and Xu Zhen-ning(2005). Ontology based information retrieval model for the Semantic Web. InProceedings of the 2005 International Conference on e-Technology, e-Commerceand e-Service (EEE’05).

Turtle, Howard R. and W. Bruce Croft (1992). A comparison of retrieval models.The Computer Journal, 35 (3), 279-290.

Van Rijsbergen, Cornelis J. (1986). A non-classical logic for information retrieval.The Computer Journal, 29, 111-134.

Wendlandt, Edgar B. and James Driscoll (1991). Incorporating semantic analysis into a document retrieval strategy. In A. Bookstein, Y. Chiaramella, G. Salton,

ACM SIGIR Conference (pp. 270-279). New York: ACM. Winston, Patrick H. (1982). Learning new principles from precedents and exer-

cises. Artficial Intelligence, 19 (3), 321-350. Xu, Jinxi and W. Bruce Croft (1996). Query expansion using local and global

document analysis. In Proceedings of the Nineteenth Annual International

trieval (pp. 4-11). New York, NY: ACM.

and V.V. Raghaven (Eds.), Proceedings of the Fourteenth Annual International

ACM SIGIR Conference on Research and Development in Information Re-

179

8 Evaluation of Information Extraction Technologies

8.1 Introduction

If we build technologies, we would like to evaluate our systems in order to see how they behave with regard to a golden standard and how they com-pare to other existing technologies for the same task. This is also true for information extraction. Depending on the application, certain perform-ances are measured. For instance, in some cases high precision results areof primordial importance, when the extraction results are not manuallycontrolled, while in other cases where the machine extraction is only per-forming an initial filtering of the information that eventually is manually selected, a high recall of the extraction is important. High precision means that the extracted information does not contain any or only very few errors.High recall refers to the situation where all or almost all information to beextracted is actually extracted. An example of the former is extracting the price of an air flight from the World Wide Web. An example of the latter is intelligence gathering, where the analyst wants to find as much as possi-ble valid information on the locations of a certain crime, which afterwardswill be manually processed in combination with other evidence. In some tasks errors do not weight equally, as some errors are perceived more se-vere with regard to the use or further processing of the extracted informa-tion. This is, for instance, the case when similar phenomena are grouped and the correctness of the clustering is measured (e.g., in coreference reso-lution). It is not always easy to define appropriate evaluation measures, es-pecially not in natural language processing tasks. In addition, a different weighting of certain types of errors introduces an element of subjectivity and context-dependency into the evaluation process. Many of the metrics were already defined during the Message Under-standing Conferences (MUC) in the 1990s. Evaluation is sometimes a complicated and controversial issue. The MUC scoring program and crite-ria were an important first step in confronting this problem.

180 8 Evaluation of Information Extraction Technologies

The Automatic Content Extraction (ACE(( ) competition currently devel-EEops its own metrics. In the course of the last decades information retrieval has developed several evaluation metrics in the framework of the Text

Information extraction is usually not a final goal, but assists in othertasks such as information retrieval, summarization or data mining. Manyevaluation measures aim at an intrinsic evaluation, i.e., the performance of the extraction task is measured. It might be valuable to perform an ex-trinsic evaluation, i.e., measuring the performance of another task in which information extraction is an integral part. In this book the extrinsic evaluation measures focus on measuring the performance of informationretrieval in which extraction plays a role.

Most of the evaluation criteria that we discuss regard qualitative cri-teria. They measure the quality of the results. Accuracy is here of most importance, but other measures such as recall and precision cannot beneglected. Once information extraction is applied in information retrievaltasks, and large documents collections are consulted, a high precision is often important. In other situations one can be confronted with incom-plete and imperfect relevance information. For these situations specific evaluation metrics are designed. Besides the quality of the results, other performance measures that are important in any text processing task, andapplication specific measures come into play.

8.2 Intrinsic Evaluation of Information Extraction

A first group of evaluation measures concerns the intrinsic evaluation of the results of an information extraction task by comparison with some

extraction is a classification task. The assigned classes can be compared with the ideal class assignment, which is usually determined by a human expert.

In many information extraction tasks classes can be objectively assigned and there is seldom a discussion about which classes to assign (e.g., namedentity recognition, noun phrase coreference resolution). However, there aretasks for which the assignment is less clear cut (e.g., certain semantic rolesand classification of modifiers of nouns). In the latter case it is supposed that so-called inter-annotator agreement is sufficiently high (e.g., more tthan 80%). For some tasks human evaluators do not agree on an golden standard. Inter-annotator agreement is usually computed with a reliability

REtrieval Conferences (TREC) (van Rijsbergen, 1979; Baeza-Yates and Ribeiro-Neto, 1999; Voorhees and Harman, 2005).

golden standard (Sparck Jones and Galliers, 1996, p. 19 ff.). Information

8.2 Intrinsic Evaluation of Information Extraction 181

It is often difficult to obtain an annotated test set that is large enough to assess the performance of a system. If the performance of different sys-tems is to be ranked, one is tempted to consider only these instances for classification by the human expert on which most systems disagree. These are definitely hard cases. However, the instances on which the systems agree can be completely wrongly classified. Ignoring them in the evalua-tion can still give a biased impression of absolute performance.

8.2.1 Classical Performance Measures

Information extraction adopts the typical evaluation measures for text clas-sification tasks being recall and precision, their combination into the F-measure, and accuracy.

The effectiveness of automatic assignment of the semantic classes is di-rectly computed by comparing the results of the automatic assignment withthe manual assignments by an expert. When classes are not mutually ex-clusive (i.e., several classes can be assigned to one instance), binary classi-fication decisions are the most appropriate.

Table 8.1 summarizes the relationships between the system classifi-cations and the expert judgments for the class Ci in case of a binary

computations of recall, precision and the F-measure.

R = a / (a + c) (8.1)P = a / (a + b) (8.2)Fal = b / (b + d) (8.3)dd

Recall (R) is the proportion of class members that the system assigns to theclass. Precision (P) is the proportion of members assigned to the class that really are class members. Fallout (t Fal) computes the proportion of incor-rect class members given the number of incorrect class members that the system could generate. Ideally, recall and precision are close to 1 and fall-out is close to 0.

measure, the most common being the ααα-statistic (Krippendorff, 1980), the κκκ-statistic (Carletta, 1996) and Kendall’s τττ-value (Conover, 1980).

classification (Chinchor, 1992; Lewis, 1995). They form the basis for the


Table 8.1. Contingency table of classification decisions.

Expert says yes Expert says no System says yes a b a + b = kSystem says no c d c + d =d n – k

a + c = r b + d =d n – r a + b + c + d= n

wheren = number of classified objectsk = number of objects classified into the class Ci by the system r = number of objects classified into the class Ci by the expert.

When comparing two classifiers, it is desirable to have a single measure of effectiveness. The F-measure, derived from the E-measure of van Rijsber-gen (1979, p. 174 ff.) is a commonly used metric for combining recall and precision values in one metric:

F = (B2 +1)PR

B2P+ R(8.4)

where P = precisionR = recallβ = a factor that indicates the relative importance of recall and preci-β

sion, when β equals 1, i.e., recall and precision are of equal importance, βthe metric is called the harmonic mean (F1F -measure).

Recall errors are referred to as false negatives, while precision errors regard false positives. The error rate (Er), which is also based on thecontingency Table 8.1, takes into account both errors of commission (b)and errors of omission (c).

Er = (r b + c) / n (8.5)

In Table 8.1 it is easy to see that the classical measure of accuracy is

Accuracy = (a + d) / n / (8.6)

computed as:


Fig. 8.1. Example of macro-averaged and micro-average precision.

Often multiple classes are assigned (e.g., assigning semantic roles to sen-tence constituents), pointing to the need for an overall assessment of theperformance of the extraction system. In this case the results of the above measurements for each class can be averaged over classes (macro-averaging) or over all binary classification decisions (micro-averaging)

ries with many examples have a larger impact upon the results. In some information extraction tasks, classes are mutually exclusive,

i.e., only one class can be assigned to the information constituent. In thiscase accuracy is an efficient performance measure, where accuracy is computed as the proportion of correct assignments to a class in all assign-ments. It can be seen that in this case micro-averaged precision and micro-averaged recall equal accuracy.

In information extraction both the detection of the information (e.g., de-tection of the boundary of an entity mention) and the recognition (classifi-cation) of a mention should be evaluated. For both tasks usually the sameevaluation metrics are used. The result of information extraction is often a probabilistic assignment. None of the above metrics takes the probabilityof the assignment into consideration.

(Fig. 8.1) (Lewis, 1992). The latter way of averaging provokes that catego-


8.2.2 Alternative Performance Measures

In cases when information is classified by grouping the tokens into clus-ters, adequate performance measures have been designed that are a varia-tion of the classical recall and precision measures. The metrics are usuallyillustrated with the task of noun phrase coreference resolution. Building noun phrase coreference chains regards the grouping of noun phrases into clusters. For instance, in the following example John saw Mary. This girlwas beautiful. She wore a red dress one cluster should contain Mary,girl and she apart from two singleton clusters respectively containing John and dress.

When evaluating or validating the clustering in information extraction, often the Vilain metric (official metric used in the MUC competition) or

clusters that are manually built by a human expert are compared with the clusters that are automatically built.

The Vilain algorithm takes into account the number of links that should be added 1) to the automatic output in order to arrive to the manual cluster-ing and 2) to the manual output in order to arrive to the automatic one. Theformer number influences the recall measure R, while the latter influences the precision measure P. Formally one defines:

For a cluster S of entities in the manual output, p(S) is a partition of Srelative to the automatic response. Each subset of S in the partition is Sformed by intersecting S and those automatic clusters that overlapS S. For example, if one manual cluster is S = {S A, B, C, D} and the automatic clus-tering is {A,B}, {C, …}, {D, …}, then p(S) = {{A,B},{C},{D}}.

c(S) is the minimal number of “correct” links necessary to generate thecluster S.

c(S) = ( S –1) (8.7)

m(S) is the minimal number of “missing” links in the automatic clustersrelative to the manual cluster S.

m(S) = ( p(S) – 1) (8.8)

The recall error for the manual cluster S is the number of missing links di-Svided by the number of correct links:

)(

)(

Sc

Sm (8.9)

the B-cubed metric (Bagga and Baldwin, 1998) is used. In both validations


The recall is thus:

)(

)()(

Sc

SmSc − (8.10)

This equals:

1

)(

−−S

SpS (8.11)

Extending this recall measure to the whole clustering output leads to:

R=

( SjSj=1

k

− p(SjS ) )

( SjSi =1

k

−1)

(8.12)

for each cluster j in the k clusters of the output.k

The precision measure is obtained by switching the roles of the auto-matic and manual clustering, yielding:

P =

(j=1

k

p(SjS ) − p(SjS )∩ SjS )

( p(SjS )j=1

k

−1)

(8.13)

The B-cubed algorithm takes into account the number of entities that should be added 1) to the automatic output in order to arrive to the manualone and 2) to the manual output in order to arrive to the automatic one.The former number influences the recall measure Ri, the latter number


influences the precision measure Pi. Formally, given n objects, we definefor each object i:

Ri = coi

moi (8.14)

Pi = coi

aoi (8.15)

where coi = number of correct objects in the cluster automatically built that contains object i

moi = number of objects in the cluster manually built that contains object i aoi = number of objects in the cluster automatically built that

contains object i

The final recall R and precision P that consider all P n objects of the cluster-ing are respectively computed as follows:

R = wi=1

n

i ⋅Ri (8.16)

P = wi=1

n

i ⋅ Pi (8.17)

where wi are weights that indicate the relative importance of each object(e.g., in noun phrase coreference resolution the pronoun i could beweighted differently than the noun i). All wi should sum to one and theyare often chosen as 1/n. Both the Vilain and B-Cubed metrics incorporate some form of subjec-tivity in measuring the validity of the clusters. The Vilain metric focuseson “What do I need to do in order to get the correct result? ”, and not interms of “Is the result that the system obtains correct or not”. The Vilain algorithm only rewards objects that are involved in some relationship.Determining that the object is not part of a cluster with another object is unrewarded. In this classic Vilain metric, all objects are treated similarly. In the B-Cubed algorithm, an object’s relationship with all other objects in its cluster can be weighted by a weighting parameter.


In analogy with the above measures, one can design other approaches for cluster validation, for instance by taking into account the number of wrong entities in one cluster.

8.2.3 Measuring the Performance of Complex Extractions

An information extraction task is often composed of different recognition tasks, hence the idea of using one evaluation score that evaluates different recognitions. Such a score is valuable when detecting complex content, e.g., content characterized by relations between content elements. Evalua-tion scores that measure the performance of complex extractions have been

The metrics used by the ACE competition compute a value score Value for a system defined by the sum of the values of all of the system’s output entity tokens, normalized by the sum of the values of all reference entitytokens, i.e., the sum of the ideal score of each token that should be recog-nized. The maximum possible Value score is 100%.

Value =Value(sysi)

i

Value(refjff )j

(8.18)

where sysi = value of each system token i based on its attributes and how

well it matches its corresponding reference token refjff = value of a reference token j.

The tokens are the information elements recognized by the system. The value of a system token is defined as the product of two factors. One factor represents the inherent value of the token, the other assesses howaccurately the token’s attributes are recognized or the token’s mentionsare detected. In other words, it is evaluated whether content (e.g., an en-tity relation, a timex), its attributes and its arguments are recognized cor-rectly. For instance, in a relation recognition task the arguments are the entities that form the relation.

There are two ways to look at content. One way reflects the linking of similar content which is referenced within and across documents wherethis content (e.g., entity, relation) receives a unique identification number.

designed during the ACE competition (ACE 2005).


The evaluation includes the recognition of the different mentions of that content. A second way is to consider the recognition of each content ele-ment and its attributes independently. We will focus here on the first type of evaluation, because it is themost relevant for complex recognition tasks.

Value(sys) = ElementValue(sys) ⋅ ArgumentsValue({Arguments(sys)}) (8.19)

where sys is the content element considered (e.g., sys can be an entity, a re-lation, an event, etc.). sys can refer to a system token or a reference token.ElementValue(sys) is a function of the attributes of the element and, if mapped to the reference element, it judges how well the attributes match those of the corresponding reference element. The function can be defined according to the type of content element that is evaluated. For instance, ina named entity recognition task the inherent value of an entity element isdefined as the product of the token’s attribute value parameters and of its attribute types (e.g., the characteristics of the entity and the type of entity).This inherent value is reduced for any attribute errors (i.e., for any differ-ences between the values of the system and the reference attributes) using error weighting parameters, {WerrWW -attribute}.. If a system token is unmapped, then the value of that token is weighted by a false alarm penalty, WE-FA.WW

The second factor in Eq. (8.19) determines how accurate the information element’s mentions or arguments are detected. The detection of mentions refers to the detection of arguments in an equivalence relation between dif-ferent mentions (e.g., the correct resolution of coreferring content ele-ments). In other types of relations other arguments can be detected, such as the recognition of the arguments of an action or a speech act, or the recog-nition of the necessary parts of a script.

The exact function for the computation of the element value and the

performance that are considered important for a certain application. The functions are here illustrated with the example of the recognition and nor-malization of temporal expressions in text.

The ElementValue (sys) here depends on how well the attributes of the sys-tem token sys mach those of the corresponding reference token. The intrin-sic value of a timex token is defined as a sum of attribute value parameters,AttrValue, summed over all attributes a ∈ A which exist and which are the

Value(sys) = ElementValue(sys) ⋅MentionsValue(sys) (8.20)

mentions value depends on the extraction task and on what aspects of its


same for both the system and reference tokens. In the recognition andnormalization of temporal expressions A is composed of the following at-tributes. Temporal expressions to be recognized include both absolute expressions and relative expressions (Type). In addition, the attributes in-clude the normalized time expression (Val) (e.g., 2005-9-23), the normal-ized time expression modifier (Mod) (e.g., approximate), a normalizedtime reference point (AnchorVal(( ) (e.g., 2005-9-5), a normalized time di-rectionality (AnchorDir) (e.g., before), and a flag that ascertains that Valis composed of a set of time expressions (Set). These attributes follow the conventions of the “TIDES 2005 standard for annotations of temporalexpressions”. If a system token is unmapped, ElementValue (sys) is zero.

ElementValue(sys) =a∈A

AttrValue(a) if a(sys) = a(ref ) and sys is mapped

0 otherwise

d

(8.21)

system token. A mention’s MMV is simply 1, if the system token’s men-Vtion maps the corresponding reference token. If the system token’s men-tion is unmapped, then the MMV is weighted by a false alarm penalty Vfactor, WM-FAWW and also by a coreference weighting factor WM-CRWW . The latterrefers to the penalty when the system mention happens to correspond to a legitimate reference mention, but one that does not belong to the corre-sponding reference token. For each pairing of a system token and a refer-ence token, an optimum correspondence between system mentions andreference mentions that maximizes the sum of MMV over all systemV

mapping between system and reference mentions.

MMV(mentionsys) =1 if mentionsys is mapped

−(WMWW − FA ⋅WMWW − CR) otherwise−d (8.22)

MMV (mentionsys

all sys mentions in doc

)all docs

(8.23)

MentionsValue (sys) is simply the sum of the mention values (MMVVV) of a

mentions is determined and used, subject to the constraint of a one-to-one

MentionsValues(sys) =


Table 8.2. Examples of values of weight parameters used in the attribute matching of the recognition and normalization of temporal expressions.

ElementValue parameters Attribute Type Val Mod AnchorVal AnchorDir Set AttrValue 0.10 1 0.10 0.50 0.25 0.10

WE-FAWW = 0.75 MentionsValue parameters

WM-FAWW = 0.75 WM-CRWW = 0.00 MinOverlap = 0.30

System mentions and reference mentions are permitted to correspond only if their extents have a mutual overlap of at least MinOverlap. In the frameof ACE 2005 overlap is simply defined as the normalized number of char-acters that are shared by the two strings.

From the above it is clear that several parameters have to be a priori set. In the ACE 2005 competition these parameters were set as shown inTable 8.2. In order to obtain a global evaluation of a system’s performance in temporal expression recognition and normalization in text, a final scoreis computed according to Eq. (8.18). This score is 100% when all timexes, their attributes and mentions are perfectly recognized and normalized.

The mutual overlap parameter determines the conditions under which the two mentions are allowed to map. In MUC-4 (1997) a partial matchingof mentions was allowed. In case of a partial matching, the performance score is decreased by a predefined factor. Lee et al. (2004) propose to measure the performance of the recognition according to each boundary condition of strict, left, right and sloppy: Strict means that the boundaries of the system and those of the answer match on both sides, left means that only the left boundary of the system and that of the answer match, rightmeans that only the right boundary of the system and that of the answer match, and sloppy means that the boundaries of the system and those of the answer overlap.

Evalation of several subtasks and integrating the evaluation score in one metric often demands weighting of the subscores based on a priori defined parameters. This is illustrated with the performance measure discussed in this section. Such an approach is subjectively colored by the many parame-ters that have to be tuned. But, the metric makes it clear that there is an ab-solute need to evaluate a combination of extraction tasks. In the future thisdemand will only increase as the different extraction tasks will eventuallylead to the understanding of texts.

8.3 Extrinsic Evaluation of Information Extraction in Retrieval

Classical evaluation in information retrieval relies on recall and precisionvalues (possibly combined in a F-measure) to assess the performance of the retrieval. We refer to Eqs. (8.1), (8.2) and (8.5) where the binary class considered is now the one of relevancy or correctness of the answer in theresult or answer list returned by the information retrieval system. For-mally, we define recall (R) and precision (P) respectively as:

R= ard

trd (8.24)

P = ard

ad (8.25)

where ard = number of relevant documents in the result list trd = total number of relevant documents in the document base

ad = number of documents in the result list.

Note that the term “documents” is interpreted here very broadly and en-compasses document elements or passages, sentences or phrases, apart from regular documents.

Currently, some measures take into account the ranking, which is the relative ordering of the retrieved documents by perceived relevance. One of the earliest metrics is the reciprocal answer rank (RAR) developed forevaluating the performance of question answering systems and whoseweights influence the correctness of the answer according to its position in the answer list, while decreasing the influence of an answer further downin this list. Another metric is the mean average precision (MAP), also referred to as

computed as a mean over a set of queries. The average precision (AP) is computed after every retrieved relevant document, using zero as precision for relevant documents that are not retrieved, and then averaged over thetotal number of retrieved relevant documents for a query. Suppose we have trd relevant documents for a given query in our test collection, AP is defined as:

8.3 Extrinsic Evaluation of Information Extraction in Retrieval 191

the mean non-interpolated average precision (Buckley and Voorhees, 2002)


AP = 1

trdP

r=1

trd

r (8.26)

Pr = ardrdd

r (8.27)

where ardr = the number of relevant documents in the result list up to ther

position of the rth relevant document. If the rth relevant document does not occur in the result list, Pr =r 0. A reader looses his or her time when looking at non-relevant documentsthat are ranked higher than relevant documents. So, a good result list should have as few as possible non-relevant documents ranked higher than relevant documents. This is, for instance, reflected in the bpref (f binarypreference) measure, which measures the number of faulty orderings in the result list, i.e., orderings where a non-relevant document is ranked before a

bpref = 1

ard(1− )

r=1

ard

(8.28)

One document or answer might be more relevant than another one in the list of retrieved documents. It is our conviction that for many applications, binary relevance judgments are rarely adequate to fully express the per-ceived relevance level experienced by end users. Relevance should be con-sidered a fuzzy variable, as it is - besides other factors - largely dependent on the utility of the judged documents for satisfying the user’s (underspeci-fied) information needs. Therefore, De Beer and Moens (2006) have pro-posed a generalization of the bpref measure that measures the intrusion of fless relevant documents before and between more relevant documents.

the position of the rth

relevant document in the result list. The soundness of a variant of this matric and its robustness in the face

of incomplete and imperfect relevance information are discussed and dem-onstrated by Buckley and Voorhees (2004). By incomplete judgments we mean that the result list does not contain all the relevant documents. An imperfect judgment refers to a situation in which a document of the result list is no longer part of the document collection. Both situations occur in current search settings.

relevant document (De Beer and Moens 2006).

nn

relevant document and nn = the number of non-

rnn

rwhere nn = the number of non-relevant documents in the result list up to

We are not aware of any metric that measures the performance of in-formation extraction and information retrieval in a combined way. Such ametric could be useful to compare retrieval systems that use different in-formation extraction technologies.

In information retrieval there is a growing need for evaluation metrics that judge answers to information questions. The answers are extracted from a document or even synthesized from different document sources (see Chap. 10). In such a setting it is important that the answer is complete and correct, i.e., it contains all and only correct elements and the elementsare connected with the right relationships. Research into evaluation metricsfor text summarization might be adopted. Such metrics are currently under development in the community of the Document Understanding Confer-ence (DUC). When different answers are retrieved from the document col-lection, e.g., when the query or information question could be interpreted in different ways, the evaluation metric should also assess that the most relevant answers come first, or are preceded by only very few non-relevant or less relevant answers.

8.4 Other Evaluation Criteria

When dealing with text, other criteria for judging the performance of in-formation extraction systems are important. Evaluating natural language text is extensively discussed in Sparck Jones and Galliers (1997).

A first evaluation criterion regards the computational complexity of theinformation extraction and of the storage overhead. Even if computer power has dramatically grown, extracting content from texts is computa-tionally expensive and care should be taken to use efficient computationswhenever possible. When information extraction results are added to document indices in retrieval systems, a balance should be sought betweenthe number of allowable computations at query time and the storage over-head caused by intermediary results that were a priori calculated. It could be measured how large the indexing overhead is and how this effects the retrieval performance for certain applications.

Another concern is linguistic coverage. Although becoming a smaller problem over the years, some types of linguistic phenomena cannot yet be covered in a certain language as the necessary technology and resources are not yet developed. Or the linguistic tools might not yield sufficient re-liance in terms of qualitative performance. This situation constraints cer-tain information extraction tasks (e.g., entity relations recognition relies ona syntactic parse of a sentence). So, when judging the extraction systems,

8.4 Other Evaluation Criteria 193


the evaluation report preferably includes the natural language processing

task at hand. If a part-of-speech tagger or a sentence parser is used, the ac-

Some information extraction systems might perform well in a limited domain where enough annotated examples are provided to cover all phe-nomena (all variant linguistic expressions are annotated) and the ambi-guity of the language is more restricted. In order to measure the domaincoverage, the concept of domain has to be specified. This is often difficult. A domain is sometimes associated with a sublanguage. Such a sublan-guage is more restricted in its linguistic properties (vocabulary, syntax,

Typical sublanguage texts may be weather reports and medical dischargesummaries of patients. Information extraction from sublanguage domains is thought to be easy. However, linguistic expressions from the standard language or from neighboring domains possibly enter the sublanguage without going through a process of setting up conventions. With regard to information extraction, this means that part of the extraction tasks can bedefined across domains and others are very domain specific. As we will see in Chap. 9, some information tasks are much more difficult than others and the degree of difficulty may vary from domain to domain. Rather than considering domain coverage as the proportion of the domain that is cov-ered by the extraction system, it makes more sense to measure the per-formance of the different extraction tasks.

The information to be extracted is described by the classification scheme or extraction ontology and in order to have comparable perform-ance measures, this classification scheme should be accepted by a largecommunity. This brings us to the problem of standardization. The output of the extraction system (i.e., the semantic labels) should be as much aspossible standardized, so as to ensure interoperability and comparability of systems and to facilitate that the output can be processed by other systems

Another performance criterion is measuring the extensibility of the extraction system. A system can be extended in two ways: By enlargingthe feature space or by enlarging the extraction scheme or ontology. Enlarging the feature space often regards inclusion of extra linguistic phe-nomena because of the availability or advancement of natural language processing resources. The second enlargement regards the domain cover-age, i.e., the classification scheme is extended in order to cover extra intra-domain or inter-domain classes. Extensibility is difficult to quantitatively measure. However, one could note the differences in performance after the

curacy of the results can be measured (van Halteren, 1999).

resources and tools that are used, and evaluates their performance for the

semantics and discourse organization) (Grishman and Kittredge, 1986).

such as retrieval, data mining and summarization tools.

system is extended. Also, the necessary changes to the system in this adap-tation should be described in any performance report.

Related to extensibility is portability, i.e., the capability of a system to be transferred from a language or subject domain to another one and the amount of extra work (e.g., in the form of drafting knowledge rules or of annotating and training a system).

We can also measure how much time it takes to train an extraction sys-tem, whether it is a system that is built from scratch or whether the system is extended or ported to another language or domain. The time to train a system largely depends on the size of the classification scheme used and on the complexity of the examples that are associated with certain classes.This is also difficult to quantitatively measure. First of all, there is the cost of annotation. Even with sophisticated annotation tools which have to be adapted to changing features and classes, annotation is a real burden,which one wants to reduce as much as possible. Another question to beasked is: Can the system be trained incrementally without starting from scratch when new labeled examples are available, or when classes or fea-tures are updated?

The criteria of extensibility, portability and time to train a system regard the maintenance of the system.

Very often the circumstances in which a system is trained or operatesare not ideal. For instance, the linguistic quality of the input can be dis-ttorted by spelling and grammatical errors (e.g., spam messages). Then, it is definitely worth measuring how robust the system is. The performance can also be compared when all the settings for the extraction are kept constantand the noisy text is replaced by its non-noisy variant.

Finally, there are a number of criteria that are common for many infor-mation systems. They regard – among others – latency (speed of generat-ing the answer) and efficient usage of system resources (working memory, storage, bandwidth in case of distributed information), scalability to largedocument collections, huge classification schemes, and a large number of languages.

8.5 Conclusions

In this chapter we have given an overview of a number of classical evalua-tion metrics for assessing the performance of information extraction and information retrieval systems. There is still room for the development of evaluation metrics that measure the quality of the results of retrieval sys-tems that incorporate extraction technology, for instance, when measuring

8.5 Conclusions 195


the completeness and correctness of an answer to an information question by splitting the answer into information elements and their relationships.Because information extraction from text and information retrieval tech-nology that relies on these extraction technologies employ natural lan-guage processing tools, performance measures that are commonly applied in human language technology seem useful.

8.6 Bibliography

ACE (2005). The ACE (2005) evaluation plan. Evaluation of the detection andrecognition of ACE entities, values, temporal expressions, relations and events.

Baeza-Yates, Ricardo and Berthier Ribeiro-Neto (1999). Modern Information Re-trieval. New York: Addison-Wesley.

Bagga, Amit and Breck Baldwin (1998). Algorithms for scoring coreferencechains. In Proceedings of the Linguistic Coreference Workshop at the First In-ternational Conference on Language Resources and Evaluation (LREC’98)(pp. 563-566). LREC.

Buckley, Chris and Ellen M. Voorhees (2002). Evaluating evaluation measure sta-bility. In Proceedings of the 23rd Annual International ACM SIGIR Conference d

on Research and Development in Information Retrieval (pp. 33-40). New York: ACM.

Buckley Chris and Ellen M. Voorhees (2004). Retrieval evaluation with incom-plete information. In Proceedings of the 27th77 Annual International ACM SIGIRConference on Information Retrieval (pp. 25-32). New York: ACM.l

Carletta, Jean (1996). Assessing agreement on classification tasks: The kappa sta-tistic. Computational Linguistics, 22 (2), 249-254.

Chinchor, Nancy (1992). MUC-4 Evaluation metrics. In Proceedings of the Fourth Message Understanding Conference (MUC-4) (pp. 22-50). San Mateo,CA: Morgan Kaufmann.

Conover, William J. (1980). Practical Non-Practical Statistics, 2nd edition. Red-wood City, CA: Addison-Wesley.

Grishman, Ralph and Richard Kittredge (1986). Analyzing Language in Restricted Domains: Sublanguage Description and Processing. Hillsdale, NJ: Lawrence Erlbaum.

Krippendorff, Klaus (1980). Computing Krippendorff’s Alpha-Reliability. Thou-sand Oaks, CA: Sage Publications.

Lewis, David D. (1992). Evaluating and optimizing autonomous text classification

Conference on Research and Development in Information Retrieval (pp. 246-254). New York: ACM.

Lewis, David D. (1995). Evaluating and optimizing autonomous text classificationsystems. In Proceedings of the 18th

systems. In Proceedings of the Fifteenth Annual International ACM SIGIR

Annual International ACM SIGIR Conference

De Beer, Jan and Marie-Francine Moens (2006). Rpref - A Generalization of Bpref towards Graded Relevance Judgments. In Proceedings of the T wenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

on Research and Development in Information Retrieval (pp. 246-254). New York: ACM.

Sparck Jones, Karen and Julia R. Galliers (1996). Evaluating Natural Language Processing: An Analysis and Review. Springer: New York.

Van Halteren, Hans (1999). Performance of taggers. In Hans van Halteren (Ed.) Syntactic Wordclass Tagging (pp. 81-94). Dordrecht: Kluwer Academic Pub-lishing.

Van Rijsbergen, Cornelis J. (1979). Information Retrieval, 2nd ed. London: But-terworths.

Voorhees, Ellen and Donna K. Harman (Eds.) (2005). TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA: The MIT Press.


9 Case Studies

9.1 Introduction

In the foregoing chapters we focused on the history of information extrac-tion, on the current extraction technologies and their evaluation. In this chapter, it is time to illustrate these technologies with real and recent casestudies, to summarize the capabilities and the performance of these sys-tems, and to draw the attention to the bottlenecks that need further re-search. Furthermore, we will sum up the tasks in which the extraction technology is integrated and specifically focus on information that is rele-vant in a retrieval setting.

Information extraction technology is integrated in a variety of applica-tion domains and many different tasks are being implemented.

Information extraction from news texts has considerably been studied inthe past research. Information on worldwide events such as natural disas-ters, political events or on famous persons is commonly identified in the documents.

Another application domain where information extraction technology is in full expansion is the biomedical domain. In this domain, extraction has become a necessary technology in order to master the huge amounts of in-formation, much of which is in the form of natural language texts.

A third domain, which currently gives a strong impetus to the develop-ment of information extraction technology, is intelligence gathering. After the September 11 attacks, police and intelligence services are eager to find and link information in the bulks of private e-mails, phone conversations and police reports, and in public sources such as news, chat rooms and Web texts.

In the economic and business domain, there is a large interest in extract-ing product and consumer information from texts found on the World Wide Web, to monitor mergers and transactions, and to identify consumer sentiments and attitudes. These texts usually carry some structure marked with HTML (HyperText Markup Language) or XML (Extensible Markup

199

200 9 Case Studies

Language). In the business domain, one is also interested in extracting in-formation from technical documentation.

In the legal domain we see a large demand for information extraction technologies, especially for metadata generation and concept assignment totexts, which could be used for case-based reasoning. Notwithstanding this need and the huge amounts of written texts that are available in legal data-bases, information extraction is not very much researched in the legal do-main. Moreover, the results of the rare studies in information extractionleave room for a lot of improvements.

Finally, information extraction from speech and informal sources suchas e-mail and spam poses additional difficulties that are the focus of cur-rent research.

The performance measures that accompany our case descriptions are only indicative because the evaluation settings (corpora, semantic classes,selected features) usually differ. The aim is to give the reader an estimate of the state of the art performance. We refer to the literature for details on the evaluations. Unless stated otherwise, the results regard the processing of English texts.

The above list of extraction tasks is far from exhausted and is only in-spired by information extraction from text. Any information medium thatis consulted by humans is or will be eventually accessed with information extraction technologies.

Before discussing the different application domains of information ex-traction, we will give some general remarks on the generic versus domainspecific character of the extraction technology.

9.2 Generic versus Domain Specific Character

In the previous chapters we have described the technologies on a very gen-eral level and treated fairly generic extraction tasks such as named entity recognition, noun phrase coreference resolution, semantic role recognition, entity relation recognition, and timex recognition and resolution. These chapters show that the information extraction algorithms and methods can be transposed to many different application domains. However, within a certain domain the extraction tasks become more refined (e.g., domainspecific entities are extracted) as each domain carries additional domainspecific semantic information. The domains also handle specific text typesor genres (e.g., newswires, news headlines, article abstracts, full articles, scientific reports, technical notes, internal communiqués, law texts, court decisions, political analyses and transcriptions of telephone conversations).

Variations between subject domains mainly come down to the use of a specialized vocabulary and of certain domain specific idiomatic expres-sions and grammatical constructions, besides the vocabulary, expressionsand constructions of ordinary language. For instance, biomedical texts usemany domain specific names, while legal texts are famous for their use of lengthy, almost unreadable grammatical constructions.

Variations between text types mainly regard the rhetorical and global textual features. The former includes the use of specific rhetorical markers,of specific forms of argumentation or causality, of the directness of the message, and the underlying goal of the text. The latter includes parame-ters such as text length, use of typography and specific rules as to text for-matting. For example, a news feed will almost always be a short text that wants to inform the reader in a neutral and direct tone that a certain note-worthy event took place somewhere in the world. It will contain a headline(which usually summarizes the event described in the text body and is of-ten capitalized or typographically distinct from the rest) and a small num-ber of very short paragraphs. Scientific journal articles are usually longer; they do not necessarily describe a noteworthy event, but rather the result of scientific research; they do not simply want to convey something, but try to convince the reader that the research described in the article is scientifi-cally relevant and sound; and they do that – or at least are supposed to do that – by using some form of rational argumentation. In their most simpleform, they are organized into a number of subsections, each of which has a subtitle and is subdivided in a number of paragraphs. The articles are pre-ceded by a title that is indicative of the content and an abstract containing a short overview of the article, and consist of a main body that has a topic-argument-conclusion structure.

As a result the domain specific extraction tasks rely on domain specificand text type specific contextual features, often demanding different levels of linguistic processing – sometimes domain adapted linguistic processing – in order to compute the features values. In addition, an ontology of domain specific semantic labels accompanies the information phenomena.

Although it is not easy to choose and model the right features and labelsfor the extraction task, the underlying technology and algorithms – espe-cially the pattern recognition techniques – for information extraction arefairly generic. The difficulty in defining good features is one of the reasons why information extraction has been popular in a restricted semantic do-main operating on a certain text type. Nowadays we have at our disposalmany natural language processing tools that can be applied for feature se-lection. A completely domain independent information extraction system does not exist because of the reliance on a rich variety of features, but re-cent trends in information extraction increasingly stress the importance of

9.2 Generic versus Doamin Specific Character 201

202 9 Case Studies

making extraction components as generic as possible, especially the under-lying algorithms and methods.

These findings make information extraction also interesting for informa-tion retrieval from both specific document collections and collections that cover heterogeneous domains and text types, such as found on the World Wide Web.

9.3 Information Extraction from News Texts

Information extraction from news is well developed through the Message Understanding Conferences (MUC) of the late 1980s and 1990s, spon-sored by the US Defense Advanced Research Projects Agency (DARPA).Many of the MUC competitions involved the extraction of information from newswires and newspaper articles. For instance, MUC-3 and MUC-4 involved the analysis of news stories on terrorist attacks. MUC-5 included texts on joint ventures, while MUC-7 identified information in news onairplane crashes and satellite launch events. Each of the MUC conferencesoperated in a specific domain, though the MUC experiences laid the foun-dations for many generic information extraction tasks (e.g., noun phrase coreference resolution, named entity recognition) and they showed that the technology developed could be easily ported to different domains. The MUC competition focused also on finding relations between different enti-ties that form the constituents of an event and that fill a template frame,e.g., time, location, instrument and actors in a terrorist attack. Typically innews actors and their relations (who did what to whom) and the circum-stances (e.g., location, date) are identified.

Currently, the Automatic Content Extraction initiative (ACE(( ) of the National Institute of Standards and Technology (NIST) in the US develops content extraction technology to support automatic processing of human language in text form. One of the source types involves newswire texts. An important goal of this program is the recognition of entities, semantic rela-tions and events. The entities include persons, organizations, geographical-political entities (i.e., politically defined geographical regions), localization(restricted to geographical entities), and facility entities (i.e., human made artifacts in a certain domain). In addition, relations between entities are de-tected. They include within and across document noun phrase coreferenceresolution, cross-document event tracking and predicate-argument recogni-tion in clauses. In the frame of the above competitions valuable annotated test corpora were developed.

203

Named entity recognition – and more specifically recognition of per-sons, organizations and locations – in news texts is fairly well developed,yielding performance in terms of F-measure1 (see Eq. (8.5)) above 95% for different machine learning algorithms (e.g., maximum entropy model, hid-den Markov model) (e.g., Bikel et al., 1999). The performance of namedentity taggers on written documents such as Wall Street Journal articles iscomparable to human performance, the latter being estimated in the 94-96% F-measure range. This means that relevant features are very well under-stood and that the patterns are quite unambiguous.

The best results of noun phrase coreference resolution are obtained withdecision tree algorithms (F-measure of 66.3% and 61.2% on the MUC-6 and MUC-7 data sets, respectively) more specifically for the decision tree

when resolving coreferents with weakly supervised methods (Ng and

metric for recall and precision (see Eqs. (8.12) and (8.13)). The results show that noun phrase coreference resolution in news texts is far from a solved problem.

Cross-document noun phrase co-reference resolution research investi-gates both the detection of synonymous (alias) names (mentions with a dif-ferent writing that refer to the same entity) and the disambiguation of polysemous names (mentions with the same writing that refer to different entities). Li et al. (2004) report results of 73% in terms of F-measure, whenapplying a generative language model on 300 documents from the NewYork Times for the former task (cross-document alias detection of people, location and organization mentions while ignoring anaphoric references).For the disambiguation task, the same authors obtain an F-measure close to 91% under the same settings. Gooi and Allan (2004) report best results interms of F-measure (obtained with the B-CUBED scoring algorithm for recall and precision, see Eqs. (8.16) and (8.17)) of more than 88% by clus-tering terms in contextual windows on the John Smith corpus with the aim of disambiguating the name John Smith across different documents.

With the ACE corpus of news articles (composed of 800 annotated text documents gathered from various newspapers and broadcasts), Culotta and Sorensen (2004) obtain promising results on an entity relation recognitiontask by using different dependency tree kernels. The kernel function is used as a similarity metric. Given a set of labeled instances, the method de-termines the label of a novel instance by comparing it to the labeled instances

1 Unless stated otherwise, F-measures (see Eq. (8.5)) refer here to the harmonic mean, where recall and precision are equally weighted (also referred to as F1-measure).

9.3 Information Extraction from News Texts

algorithm (C.4.5) (Ng and Cardie, 2002) and F-measures in the lower 60%

Cardie, 2003). F-measures are here computed based on the Vilain evalution

204 9 Case Studies

using this kernel function. For a binary classification with a support vector machine (SVM) the tree kernel and the combination of the tree (contiguous or sparse kernel) and bag-of-word kernel outperform the bag-of-word ker-nel by F-measures between 61% and 63% versus 52%. Precision is quite good (in the lower 80%), but is tempered by the rather low recall values(ca. 50%). The 24 types of relations used (e.g., semantic roles that indicatee.g., part-of relation, specific family and affiliation relations) have a quitedifferent distribution in the data set. 17 of the relations have less than 50annotated instances, being an important cause of the low performance interms of recall.

The lack of patterns in the training data is an important, but not the sole cause of a low recall value. Another problem is implicit information, i.e., there is information that is not made explicit in the stories, but is under-stood by human readers from its context.

In addition, news stories are often of the narrative genre. They are verywell suited to establish the timeline of the different steps in an event or thetimeline of different events. Research in the recognition and resolution of timexes is only in its infancy, but becomes an important research topic.Recognition and resolution of timexes have to deal with ambiguous signal-ing cues and content left implicit in the text (e.g., the time order of certain

the logical order is easy to infer based on world knowledge). Not all the news stories are of narrative nature. Many of them are also

opinion pieces or interweave opinions into the events. Information extrac-tion technology used in opinion extraction of news is limited (Grefenstetteet al., 2004), although there is an extensive literature on sentiment or atti-tude tracking (see below).

Information extraction from news is important in question answering re-trieval. Information extraction from the text of news is increasingly used to annotate accompanying images and video (Dowman et al., 2005), and there is no doubt that these annotations will play a valuable role in the re-trieval of news media. Current research focuses on aligning content recog-nized in news media across text and images in order to obtain well-documented answers and summaries to information questions.

9.4 Information Extraction from Biomedical Texts

Among the application domains of information extraction, the biomedical domain is currently the most important. This is due to the large amount of

content is not explicitly expressed in a text or is lacking across texts, but

205

biological and medical literature that exponentially grows every day andthe necessity to efficiently access this information.

A first source of information regards patient reports. There have been efforts to extract information and consequently encode the information in order to use it in data mining, decision support systems, patient manage-ment systems, quality monitoring systems, and clinical research.

A second source of information is made of the huge repositories with scientific literature in the biomedical domain. Medline alone contains over 15 million abstracts and is a critical source of information, with a rate of ca. 60.000 new abstracts appearing each month.

Different ontologies or classification schemes and annotated databases are available, e.g., the functional annotations listed in the Kyoto Encyclo-

Ontology (GO) (Ashburner et al., 2000) annotation databases. The GeneOntology is a large controlled vocabulary covering molecular functions,biological processes and cellular components. An important annotated dataset is the GENIA dataset. Currently the GENIA corpus is the largest annotated text resource in the biomedical domain available to the public.In its current version, it contains 2000 MEDLINE abstracts that are part-of-speech tagged with Penn Treebank POS tags and annotated with bio-medical concepts defined in the ontology.

Especially, named entity recognition is a very common task because of the absolute necessity to recognize names of genes, proteins, gene prod-ucts, organisms, drugs, chemical compounds, diseases, symptoms, etc. The named entity recognition is a first step for more advanced extraction taskssuch as the detection of protein-protein interaction, gene regulation events, subcellular location of proteins and pathway discovery. In other words thebiological entities and their relationships convey knowledge that is embed-ded in the large textual document bases that are electronically available.

Named entity recognition poses specific problems because of the com-plex nature of the detection of the boundaries of the entities, their classifi-cation, mapping (tracing) and disambiguation. These problems also occur in other application domains, but are usually less pronounced in these do-mains.

Boundary detection of the named entity is not always easy and its rec-ognition is often treated as a separate classification task. One cannot relyon a simple short type that defines capitalization or other character patterns because of the variety of patterns that refer to the same named entity. Bio-medical named entities often have pre-modifiers or post-modifiers that are(e.g., 91 kDA protein) or are not (e.g., activated B cell lines) part of the entity. The names are often solely mentioned or referred to as acronyms (e.g., NR = nerve root). Entities are of varying length (e.g., 47 kDa sterol


pedia of Genes and Genomes (KEGG) (Kanehisa et al., 2004) and the Gene

206 9 Case Studies

regulatory element binding factor, RA). Two or more biomedical namedentities can share one head noun by using a conjunction construction (e.g.,91 and 84 kDa proteins). Biomedical entities are often embedded in oneanother (e.g., <PROTEIN> <DNA> kappa 3 </DNA> binding factor</PROTEIN>).

Commonly used features in the classification task are orthographic fea-tures (e.g., use of capitals, digits), morphological prefixes and suffixes(e.g., ~cin, ~mide), part-of-speech features, head noun words, contextual trigger words such as verb triggers (e.g., activate, regulate), head noun words (e.g., treatment, virus).

Biomedical names have many aliases (synonym terms, acronyms, mor-phological and derivational variants), reinforced by the ad hoc use of or-thography such as capitals, spaces and punctuation (e.g., NF-Kappa B, NFKappa B, NFkappaB and NF kappa B) and the inconsistent naming con-ventions (e.g., IL-2 has many variants such as IL2, Interleukin 2 and in-terleukin-2). On the other hand, names and their acronyms are often polysemous. Although exhibiting the same orthographic appearances, they can be classified in different semantic classes, depending on a given con-text (e.g., interleukin-2 is a protein in some context, but can be a DNA inanother context; PA can stand for pseudomonas aeruginosa, pathology and pulmonary artery). Existing lexico-semantic resources in this domain typically lack contextual information that supports disambiguation of terms. This situation makes that within and cross-document noun phrasecoreference resolution is a necessity.

New terms and their corresponding acronyms are invented at a high rate while old ones are withdrawn or become obsolete.

Although the earliest systems rely on handcrafted extraction patterns, current named entity recognition in the biomedical domain use machine learning techniques. The results of a hidden Markov model (Zhang et al., 2004) have an average of 66.5% F-measure for 22 categories assigned of the GENIA ontology. The F-measures range from 80% (category body-part) to 0% (e.g., categories atom, inorganic). The lack of sufficient training examples in this experiment and resulting low recall are an impor-tant factor in the low F-measure for certain categories. Kou et al. (2005)made a comparative study on protein recognition on the GENIA corpus. The results are 66% in terms of F-measure when training with a maximum entropy classifier and 71% when training a conditional random field classi-fier. The results of the CRF model could be improved by about 1% through an extension of the conditional random fields (SemiCRFs) that en-ables more effective use of dictionary information as features. Lee et al. (2004) train a Support Vector Machine and consider entity boundary de-tection and entity classification as two complementary tasks. Tests with the

207

GENIA corpus yield a best F-measure of 74.8% for the boundary detectiontask and of 66.7% for the entity classification task. Finkel et al. (2005) use a maximum entropy model for detecting gene and protein names in bio-medical abstracts. Their system competed in the Biocreative comparativeevaluation and achieved a precision of 83% and recall of 84% (F-measureof 83%) in the open evaluation and a precision of 78% and recall of 85%

resources in the form of gazetteers (i.e., lists of names) or related texts were used.

The detection of entity boundaries in biomedical information extraction is a problem by itself. The identification of the boundaries is difficult be-cause of the diverse use of modifiers such as adjectives, past particles or modifying prepositional phrases, and it is hard to distinguish whether a modifier is included in the named entity or not. Including here statistics on the collocational use of the separate terms as extra features seems useful. Finkel et al. (2005) also stress the importance of correct boundary detec-tion as a way of improving the named entity recognition task in biomedical texts. In their research many errors (37% of false positives and 39% of false negatives) stem from incorrect name boundaries.

Recall (or false negative) errors are caused by patterns not seen in thetraining set, i.e., the classifier does not know the name and/or contextual pattern. An initial normalization of the training and test examples that iscorrectly performed seems very useful. Especially a syntactic normaliza-tion with syntactic equivalence rules might be helpful. However, this is not always easy. For instance, it is not simple to detect an instance of a coordi-nated noun phrase where the modifier is attached to only one of thephrases and modifies all of the coordinated members.

Researchers seem to agree that in order to improve named entity recog-nition in the biomedical domain, we must explore other avenues, includingbetter exploitation of existing features and resources, development of addi-tional features, incorporation of additional external resources, or experi-mentation with other algorithms and strategies for approaching the task.

The named entity recognition is a first step for more advanced extraction tasks such as the detection of protein-protein interactions, protein-nucleotide interactions, gene regulation events, subcellular location of proteins and pathway discovery. These complex tasks involve relation detection. Current progress in genomics and proteomics projects worldwide has generated anincreasing number of new proteins, the biochemical functional characteriza-tion of which are continuously being discovered and reported.

Entity relation recognition can be based on hand-built grammars bywhich the texts are partially parsed. An example hereof is the research of Leroy et al. (2003). They use cascaded finite state automata to structure


(F-measure of 83%) in the closed evaluation. In an open evaluation extra

208 9 Case Studies

relations between individual biomedical entities. In an experiment con-sidering 26 abstracts they obtained 90% precision. Gaizauskas et al.(2000) built an extraction system that heavily relies on handcrafted in-formation resources, which include case-insensitive terminology lexicons (the component terms of various categories), morphological cues (i.e., standard biochemical suffixes) and handcrafted grammar rules for each class. The system is applied for the extraction of information about en-zymes and metabolic pathways and the extraction of information aboutprotein structure.

More advanced techniques use machine learning for protein relation extraction. Ramani et al. (2005) recovered 6,580 interactions among 3,737 human proteins in Medline abstracts. Their algorithm has three parts. First, human protein names are identified in Medline abstracts using a recognizer based on conditional random fields. Then, interactions a re identified by theco-occurrence of protein names across the set of Medline abstracts. Finally, valid interactions are filtered with a Bayesian classifier. A similar approach is taken by Huang et al. (2004) who aligned sentences whose protein names were already identified. Similar patterns found in many sentences could be extracted as protein relations.

Literature-based gene expression analysis is a current research topic.Texts that describe genes and their function are an important source of in-formation in order to discover functionally related genes and genes that aresimultaneously expressed. The texts give an additional justification and explanation (Glenisson et al., 2003).

The function of a protein is closely correlated with its subcellular loca-tion. With the rapid increase in new protein sequences entering into data banks, textual sources might help us to expedite the determination of pro-tein subcellular locations. Stapley et al. (2002) evaluated the recognition of 11 location roles in Medline abstracts and obtained F-measures rangingfrom 31% to 80% depending on the location class.

Pathway prediction aims at identifying a series of consecutive enzy-matic reactions that produce specific products in order to better understand the physiology of an organism, to produce the effect of a drug, understanddisease processes and gene function assignment. Complex biomedical ex-traction tasks aim at predicting these pathways. The information extraction task is similar to detecting an event or scenario that takes place between a number of entities and to identifying how the actions that constitute the scenario are ordered (e.g., in a sequence of reactions of a pathway). Thismeans that the clausal and textual levels of analysis will become relevant and that we will have to resort to event extraction and scenario building technologies to solve this problem. Research on pathway recognition is al-ready done by Friedman et al. (2001).

209

The overview given here is far from exhaustive. The biomedical litera-ture is full of experiments that report on information extraction from tex-tual sources and on the integration of data extracted from unstructured texts with structured data. Biomedical information is also increasingly ex-tracted from figures and figure captions.

9.5 Intelligence Gathering

Police and intelligence services are charged with collecting, extracting, summarizing, analyzing and disseminating criminal intelligence data gath-ered from a variety of sources. In many cases the sources are just plaintext. Processing this data and extracting information from them is criticalto the strategic assaults on national and international crime. The informa-tion is necessary to combat organized criminal groups and terrorists that could threaten state security.

Most criminal data are structured and stored in relational databases, in dwhich data are represented as tuples with attributes describing various fields, such as attributes of a suspect, the address of a crime scene, etc. Unstructured data, such as free-text narrative reports, are often stored as text objects in databases or as text files. Valuable information in such texts is difficult to access or to efficiently use by crime investigators in further analyses. Recognizing entities, their attributes and relations in the texts is very important in the search for information, for crime pattern recognition and criminal investigation in general. Combined with factual data in data-bases, the extracted information is very helpful as an analysis tool for the police.

We can make a distinction between open and closed data sources of theintelligence services. The open sources are publicly available, have a vari-able degree in reliability, and include Web pages, files downloadable via the Internet, newsgroup accounts, magazine and news articles, and broad-casted information. Closed sources have a secured access and are available only to certain authorized persons. They include police and intelligence re-ports, internal documentation, investigation reports and “soft” information(i.e., information on suspicious behavior that is noted). The sources are not only composed of texts, but are increasingly of multi-media format. The textual sources are often of multi-lingual nature.

Police forces and intelligence services worldwide start using commer-cial mining tools, but they are not always adapted to their specific needs. On the other hand, research into the specific demands of extraction sys-tems that operate in this application domain is limited or is not publicly


210 9 Case Studies

available. MUC-3 and MUC-4 already covered news articles on the subject of Latin American terrorism. DARPA (Defense Advanced Research Pro-jects Agency) recently started the research program Evidence Extraction and Link Discovery (EELD). The purpose of this project is the develop-ment of accurate and effective systems for scanning large amounts of het-erogeneous, multi-lingual, open-source texts (news, Web pages, e-mails,etc.). The systems should identify entities, their attributes, and their rela-tions in a larger story (scenario) in order to detect crime patterns, gangsand their organizational structure, and suspect activities (e.g., a person John B drives a white Alfa Romeo).

In all of the tasks described above, entity recognition is of primordial importance. Entities are first of all the common entities such as person, or-ganization and location names and timexes, but they comprise also car brands, types of weapons, money amounts and narcotic drugs. In addition,it is very important to link the entities to each other, where the link will besemantically typed.

There are very few evaluations of the performance of named entityrecognizers that operate on police and intelligence texts. Chau and Xu (2005) trained a neural network pattern recognizer combined with a dic-tionary of popular names and a few handcrafted rules in order to detectand classify the entities of 36 reports that were randomly selected from the Phoenix Police Department database for narcotic related crimes. Thereports were relatively noisy: They are all written in uppercase letters and contain a significant amount of typos, spelling errors, and grammati-cal mistakes. The following entities were recognized: persons (with a precision of 74% and recall of 73%), addresses (with a precision of 60%and recall of 51%), narcotic drugs (with a precision of 85% and recall of 78%) and personal property (with a precision of 47% and recall of 48%).These numbers sharply differ from the precision and recall numbers of the entities extracted from news text. A first reason regards the ortho-graphical and grammatical errors found in these texts. Secondly, entitiesother than person names, organizations and locations, such as drugnames, crime addresses, weapons and vehicles are also relevant to crimeintelligence analysis, but they are sometimes more difficult to extract as the contextual patterns are more ambiguous (e.g., Peter B. gave the Kalashninov to Sherly S. in Amsterdam Centraal. and Peter B. gavethe Cannabis to Sherly S. in Amsterdam Centraal.). These additional entity types do not often change names, so that external lexico-semanticresources can easily be used and maintained, unless the entities have code names in the captured messages.

Noun phrase coreference resolution is of absolute importance. Espe-cially, persons and their references need to be tracked in and across docu-

211

ments. As in any other application domain we have to disambiguate thenames and their aliases. Criminals very often use or are referred to by dif-ferent names, that orthographically might be completely different (e.g., Peter B. aliased as Petro and The big sister) making the task of name tracking a special challenge.

To our knowledge research into relation recognition is very limited. For instance, extraction of subordination relations between entities were de-tected in 1000 intelligence messages in order to construct the hierarchies of

itly stated connections between two entity mentions (e.g., MuhammadBakr al-Hakim is a leader of Iraq’s largest Shiite political party isclassified as a leadership relation). The authors report F-scores of 91% for recognition of names, of 83% for entity coreference resolution and 79%for subordination relation detection. Scores assume 50% partial credit as-signed to “easily correctable” errors such as misplaced name boundaries (e.g., including a title in a person name). Both in the biomedical and in thepolice and intelligence domains recognition of relations between entities is important. Whereas in the former domain one could rely on the many in-stances to affirm the validity of the detected relation, in the police and in-telligence domain a single instance of a relationship could be of primordial importance.

Police and intelligence services are also interested in building a profileof an entity based on a corpus of documents. The extraction system shouldcollect entity-centric information based on coreference and alias links. Dif-ferent attributes of an entity should be detected (e.g., Peter B. has red hair; the car has a damaged back light).

As seen in the previous section, extraction, resolution and ordering of temporal expressions (timexes) are valuable tasks when processing newsstories. In the police and intelligence domain, they are of primordial im-portance. Temporal information is crucial to link persons, to link personsto events, to track people and the events they participated in and to link events. Extracting temporal information from the texts involves the rec-ognition of the temporal expressions and the classification of the tempo-ral relations: Anchoring to a time (e.g., when exactly an event takes placeis often relative to the time of detection of a crime and is often vaguely described, e.g., Sometime before the assassin meeting, the two menmust have flown into Chicago), ordering between temporal events, as-pectual relations (detecting the different phases of an event) and subordi-nating relations (events that syntactically subordinate other events).


organizations (Crystal and Pine, 2005). The relations are detected as explic-

212 9 Case Studies

In police and intelligence settings the recognition and resolution of spa-tial information is also very valuable in order to, for instance, link persons to events. Processing spatial information goes beyond the recognition of location names, but includes also the resolution of spatial references (e.g.,up there) and the placing of persons and objects (e.g., car) in a spatial con-text (e.g., the city of Brussels is mentioned in the beginning of the text and the car jacking mentioned at the end of the text: does the carjacking take place in Brussels?). Spatial information is often vague, ambiguous and hard to extract.

Extraction of temporal and spatial relations demands correctly annotated corpora. These are not easy to obtain given the ambiguity and vagueness of the language. Some resources for evaluation and training are available.There are the TimeBank data in which timexes and their relations are an-notated. In addition, corpora labeled with temporal information gathered for the task of information retrieval become available (e.g., the AQUAINTcorpus).

The entities in which police and intelligence services are interested, are often the building stones of an event description. Many different types of events are interesting. The “what”, “who”, “where”, “when”, “frequency” of a meeting or a crime can be extracted. The “who”, “when”, “length”, “content” and “frequency” of a phone call can be identified. Other types of events are possible (e.g., delivery, travel) of which information has to be collected.

In this domain scenario or script extraction is relevant in order to clas-sify a set of consecutive actions (e.g., describing a set of actions as a bank robbery) or content that is linked with types of rhetorical relations (e.g., causal relations). To our knowledge, research on scenario and script rec-ognition does not exist apart from the use of symbolic knowledge inspired by the theories of Schank described in Chap. 2.

The police and intelligence domain also shows that we may not betempted to reduce the content of a text to certain extracted informationstored in a template database representation. Sometimes very small details (e.g., a Maria statute on the dashboard of a car) become the key to certainlinks between the information. In addition, in this domain text is not theonly source of information that naturally occurs in unstructured format. Captures of surveillance cameras, images and video cannot be forgotten.Any search and analysis tool will have to deal with these multi-media for-mats. There is, however, the restriction that many of the information sources are closed sources and are not freely available for training and test-ing pattern recognizers.

213

9.6 Information Extraction from Business Texts

The business domain is a domain where structured data go hand in handwith unstructured data, the latter being mostly in the form of text. The text corpora consist of technical documentation, product descriptions, con-tracts, patents, Web pages, financial and economical news, and increas-ingly also of informal texts such as blogs and consumer discussions. Data mining has been well established in business communities and explainswhy mining of texts also becomes highly valued. For these tasks, data and text mining software that is commercially available is often used, offering, however, a very rudimentary solution to the extraction problems. Classicalcommercial software offers functionality for the clustering of texts, theclustering of terms, the categorization of texts and named entity recogni-tion.

The oldest applications of information extraction technologies are found as part of the processing of technical documentation (e.g., for space craft maintenance). In these documents natural language text is interweaved with structured information. Because the documents often have a strict formal organization and follow a number of stylish conventions, their for-mal characteristics can be fixed and enforced by a drafting tool. Neverthe-less, not all content can be structured at drafting time, which leaves room for the extraction of specific information, especially for answering unfore-seen questions that users pose.

Businesses are concerned about their competitive intelligence. They want to actively monitor available information that is relevant for the deci-sion making process of the company. They can use publicly available sources (e.g., the World Wide Web) in order to detect information oncompetitors or competitive products by which they might assess the com-petitive threat. The extracted information can be used in order to react more quickly (e.g., when one of their products has received negative re-views). Information extraction can also be used to find new business op-portunities.

Up until now the extraction technologies usually concern named entity recognition. Apart from the common named entities such as product brands, organizations and persons, typical business entities can be definedamong which are prices and properties of products (e.g., dimensions,availability dates), which can often be seen as attributes of the entities. In this domain one of the earliest relation recognition tasks were developed

Supervised learning techniques were used in the recognition of company mergers and succession relations in management functions (e.g., Soderland

9.6 Information Extraction from Business Texts

based on hand crafted symbolic knowledge (e.g., Young and Hayes, 1985).

214 9 Case Studies

1999). Unsupervised learning techniques could extract patterns in the fi-nancial domain (e.g., Xu et al., 2002).

Problems that can be encountered are that information is often presented in semi-structured format (e.g., Web texts where layout characteristicstcoded in HyperText Markup Language (HTML) play an important role) or business forms with structured information coded in the Extensible Markup Language (XML). The structured characteristics of these docu-ments are often very helpful in order to extract the right information. Theproblem is that the structured characteristics are usually not standardized(e.g., layout or document architectures differ) requiring many annotatedtexts in order to train suitable systems.

Besides extracting named entities from Web pages, the latest trend is to monitor and analyze the most up to date online news and blog posts, pro-viding immediate insight into consumer discussions, perceptions and is-sues that could impact brand reputation. Information extraction technologyhere delivers true market intelligence and provides brand managers, prod-uct and marketing professionals with the critical analysis necessary to clearly understand consumer discussions relating to companies, productsand competitors. Studies on sentiment or attitude tracking are still limited. We refer the interested reader to Hearst (1992), Finn et al. (2002), Daveet al. (2003), Pang and Lee (2004), Riloff et al. (2005), and Shanahan et al. (2006). Sentiment tracking offers a whole new area of research into infor-mation extraction where the technologies discussed in this book can be ap-plied.

The business domain will certainly be a large client of extraction tech-nology and offers many opportunities for research. We foresee a growingdemand for automated syntheses of information and its presentation (e.g., comparison of prices) that are generated on the basis of flexible informa-tion queries. Wrappers that extract information from highly structured sources as the Web have been developed (Kushmerick 2000). The business domain offers a new ground for research in information extraction. Whenusing the World Wide Web as an information source, scalability problemshave to be taken care of.

9.7 Information Extraction from Legal Texts

The legal field is perhaps the field where information is almost exclusivelyfound in texts and where huge text collections are available. The repositoriesof legislation, court decisions and doctrinal texts are increasingly accessiblevia Web portals. These texts often combine structured with unstructured

215

data, the former mostly referring to the typical architecture of the docu-ments (e.g., legislation is divided in Books, Chapters, Sections and Arti-cles) or metadata that are typically manually assigned (e.g., date of enact-ment of an article).

Notwithstanding the large need for information extraction technologies in the legal domain, the use of this technology is very limited. The litera-ture cites the recognition and tracing of named entities (mainly persons)

names and the disambiguation of names that have equal writings. Another extraction task regards the classification of sentences of court decisionsaccording to their rhetorical role (e.g., offence, opinion, argument, proce-

case based reasoning systems, it is important that the factors, i.e., the fact patterns that play a role in the decision, are assigned to the decision texts and to the arguments of the decisions in particular. The most extensive studies in assigning factors to court decisions were realized by Brüning-haus and Ashley (2001a).

There are different reasons for the low interest of using information ex-traction techniques in the law domain. A first problem deals with the lan-guage of the texts. Legal texts combine ordinary language with a typical legal vocabulary, syntax and semantics, making problems such as disam-biguation, part-of-speech tagging and parsing more difficult than would be the case in other texts. Perhaps the most important causes of the slow inte-gration of extraction technologies in the legal domain are a certain resis-tance to use automated means, the monopoly of a few commercial playersthat dominate the legal information market, and the past lack of interna-tional competitions and golden standards in the form of annotated corpora. In 2006 a legal track of the Text REtrieval Conference is planned.

Nevertheless, comparable to the police domain there is a high demandfor extracting named entities such as persons, organizations and loca-tions, for linking them to certain acts or events, and for classifying thesefactual data into concepts, scripts or issues. The extracted data would bevery useful in order to enhance the performance of information retrieval,

2001b). In addition, information extracted from legislative documents could be integrated in knowledge based systems that automatically an-swer legal questions.

9.7 Information Extraction from Legal Texts

(Branting, 2003). The tracing of persons regards the mapping of alias

dure) (Moens et al., 1997; Grover et al., 2003; Hachey and Crover, 2005; Aouladomar, 2005). For the retrieval of court decisions and their use in

to perform data mining computations (Stranieri and Zelznikow, 2005) and to facilitate automated reasoning with cases (Brüninghaus and Ashey,

216 9 Case Studies

9.8 Information Extraction from Informal Texts

In all the above cases of information extraction more or less well-formed texts are processed. Often, we are confronted with informal texts from which we want to extract information. Examples are transcribed speech,spam texts, and instant messages that were generated through mobile ser-vices. If we can afford to annotate sufficient training examples, simple pat-tern matching approaches can already be of help. However, in many casesof informal texts the patterns change continuously (e.g., different informal styles of different authors) or deliberately (e.g., in spam mail). Also, thenatural language processing techniques on which we rely will not perform as adequate as they should. In the following section we will elaborate onthe example of transcribed speech and refer to some other examples of in-formal texts.

Speech is transcribed to written text by means of automatic speech rec-ognition (ASR) techniques. However, the speech differs from written textsbecause of the use of different discourse structures and stylistic conven-tions, and transcribed speech has to cope with the errors of the transcrip-tions.

Existing information extraction technologies do not perform well ontranscribed speech. There are several reasons for this. Orthographic fea-tures such as punctuation, capitalization and the presence of non-alphabetic characters are usually absent in transcribed speech. Sentence boundary detection is difficult. Numbers are spelled out in places wheredigit sequences would be expected. When the vocabulary used by ASR does not contain all entities, detecting unknown names in texts is difficult

Most of the research in information extraction from speech regards named entity recognition. While recognition of named entities in news stories has attained F-measure values that are comparable with human performance (see supra), named entity recognition of speech data − both conversational −

vocabulary rate is typically very low (< 1%) for most large-vocabularysystems, the out-of-vocabulary rate is significantly higher for words in

most closely matches the output audio stream. While the overall out-of-defined, and the recognizer will output the word in its lexicon that

because orthographic features cannot be used. In most current speechrecognition systems, the size and content of the ASR vocabulary is pre-

proper name phrases, frequently ranging from 5% to more than 20% (Palmer and Ostendorf 2000), and this rate usually differs depending on the type of noun phrase that is considered. The incompleteness of the ASR vocabulary is a common situation in domains where new names constantlysurface (e.g., in news and in the business domain).

217

speech and broadcast news speech − does not yet attain such a high per-formance.

The most interesting aspect in the development of information extrac-tion from transcribed speech is the integration of explicit error handling in the extraction system, an idea originally postulated by Grishman (1998). In transcribed speech, errors corrupt the spoken text as words that are wrongly recognized. Consider a simple model in which the errors are cre-ated by sending the original text through a noisy channel, which both de-letes some incoming tokens and inserts random tokens. Using a pattern recognizer that is trained on a noiseless text will severely reduce the reli-ability of the information extractor.

Grishman (1998) proposed a symbolic approach characterized by a noisy channel that may insert or delete a noun group or an individual tokenoutside of a noun group. An experiment on MUC-6 texts showed that themodel could attain precision values of 85%. But, recall is very low: With a 30% word error, the extraction lost 85% in recall compared to the perfect transcript, meaning that with the model we miss many valid extraction pat-terns. Papageorgiou et al. (2004) recognized person, location and organiza-tion entity names based on a vocabulary name list and a finite-state based named entity recognition module. Although precision values are in the90%, recall values range between 31% and 53%. The authors blame thelack of proper names in the vocabulary of the ASR engine and the lack of grammar patterns used by the finite state automaton.

Models can be designed that propagate a limited set of alternative analy-ses from one stage to the next, i.e., from the speech recognition to the ex-traction. Palmer and Ostendorf (2000) demonstrate the usefulness of thisapproach. These authors use a hidden Markov model while incorporating part-of-speech features. For the recognition of persons, locations, organi-zations, timexes and numeric expressions, their model could still attainF-measure rates above 71% for named entity recognition with a word error rate higher than 28% (Evaluation of the DARPA-sponsored Hub-4 Broad-cast News Transcription and Understanding).

The use of features different from the typical text based ones is also in-vestigated in information extraction from speech. For instance, features can be considered that mark prosody such as durational, intonational and energy characteristics (e.g., duration of pauses, final vowel, pitch relativeto the speaker’s baseline). In experiments, prosodic features did not have anotable effect on the F-measure of named entity recognition (Hakkani-Tür et al.,1999).

With the ACE (Automatic Content Recognition) competition we foresee a growing interest in information extraction from transcribed speech as some ACE corpora contain this medium.

9.8 Information Extraction from Informal Texts

218 9 Case Studies

In general, informal texts are often ungrammatical. They are character-ized by spelling errors, inconsistent use of capitalization patterns, un-grammatical constructions making that simple information extraction taskssuch as named entity recognition more difficult while attaining lower accu-racy rates than one would expect. Studies on information extraction from informal texts are very limited. Huang et al. (2001) and Jansche and Abney (2002) studied extraction of caller names and phone numbers from voicemail transcripts. Rennie and Jaakkola (2005) extracted named entities from e-mails and bulletin board texts. E-mails often demand inferences of hu-mans for their correct interpretation as content might be left out. The sender and receiver typically share a context, which must be inferred (Minkov et al., 2004). The best that can be done here is taking into account contextual documents. Spam mail is often ungrammatical and the vocabu-lary is malformed in order to mislead spam filters. Such a situation restrictsthe application of standard natural language processing tools.

9.9 Conclusions

The case studies demonstrate that information extraction is currently heav-ily researched. The technologies and algorithms are generically used across domains. While early research primarily relied on symbolic patternsthat were manually acquired, current technology is mostly focused towards machine learning of the recognition patterns while handcrafted resourcesserve as complementary knowledge sources. Current extraction tasks regardnamed entity recognition, noun phrase coreference resolution (recognition of alias names and disambiguation of polysemous names) and recognition of relations between entity names. In the future we foresee that extraction technologies will be used to build complex scenario’s, profiles and scripts and will be integrated in advanced coreferent resolution across documents.

The problems encountered in information extraction are pretty muchsimilar across the different domains. The lack of annotated examples (and corresponding lack of symbolic knowledge when rules are handcrafted) that cover the variety of linguistic expressions is omnipresent. Secondly,the need to find more advanced features to combat certain ambiguities in the patterns learned is also apparent. Increasingly we are confronted with informal texts (e.g., speech, blogs, mails) posing additional challenges ontheir processing. When dealing with these “noisy” texts, the problems are only cumulated, demanding research in the years to come.

Last but not least, the need for information synthesis is very well pre-sent in all applications that attempt to answer complex information ques-

219

tions. For instance, in news we want to detect and link information about events. In the biomedical domain, we want to automatically discover from texts complex biological scenarios. Police and intelligence services de-mand to link persons and events in texts allowing them to mine complexcriminal patterns. In the business domain, we want to link entities to at-tribute values such as detailed product information and consumer attitudes.In the legal domain researchers are interested in building complex case representations that will be used in case based reasoning, or in automati-cally translating legislation into the rules of knowledge based systems that some day might substitute human judges.

In Chap. 7 we have seen that information queries of users are very flexi-ble and that we may not be tempted to represent a document that is used inan information system as a template containing only certain extracted in-formation, but that the extracted information acts as additional descriptors besides the words of the text. These findings form the basis of the last chapter in this book where special attention will go to the role of informa-tion extraction in retrieving and synthesizing information.

9.10 Bibliography

Aouladomar, Farida (2005). Some foundational linguistic elements for QA sys-tems: An application to e-government services. In Proceedings of the Eight-eenth JURIX Conference on Legal Knowledge and Information Systems (pp. 81-90). Amsterdam: IOS Press.

Ashburner, Michael et al. (2000). Gene ontology: Tool for the unification of biol-ogy: The Gene ontology consortium. Nature Genetics, 25, 25-29.

Bikel, Daniel M., Richard Schwartz and Ralph M. Weischedel (1999). An algo-rithm that learns what’s in a name. Machine Learning, 34, 211-231.

Branting, L. Karl (2003). A comparative evaluation of name-matching algorithms.In Proceedings of the 9th International Conference on Artificial Intelligenceand Law (pp. 224-232). New York: ACM.

Brüninghaus, Stefanie and Kevin, D. Ashley (2001a). Improving the representa-tion of legal case texts with information extraction methods. In Proceedings of the 8th International Conference on Artificial Intelligence and Law (pp. 42-51). New York: ACM.

Brüninghaus, Stefanie and Kevin D. Ashley (2001b). The role of information ex-traction for textual CBR. In Proceedings of the 4th International Conference on Case-Based Reasoning – Lecture Notes in Computer Science (pp. 74-89). Ber-lin: Springer.

Chau, Michael and Jennifer J. Xu (2005). CrimeNet explorer: A framework for criminal network knowledge discovery. ACM Transactions on Information Sys-tems, 23 (2), 201-226.

9.10 Bibliography

220 9 Case Studies

Crystal, Michael R. and Carrie Pine (2005). Automated org-chart generation from intelligence message traffic. In Proceedings of the 2005 International Confer-ence on Intelligence Analysis.

Cullota, Aron and Jeffrey Sorensen (2004). Dependency tree kernels for relationextraction. In Proceedings of the 42nd Annual Meeting of the Association for d

Computational Linguistics (pp. 424-430). East Stroudsburg, PA: ACL. Dave, Kushal, Steve Lawrence and David M. Pennock (2003). Mining the peanut

gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of the Twelfth International World Wide Web Conference. NewYork: ACM.

Dowman, Mike, Valentin Tablan, Cristian Ursu, Hamish Cunningham and Borislav Popov (2005). Semantically enhanced television news through Web and video integration. In Proceedings of the World Wide Web Conference. New York: ACM.

Finkel, Jenny, Shipra Dingare, Christopher D. Manning, Malvina Nissim, BeatriceAlex and Claire Grover (2005). Reporting the boundaries: Gene and protein identification in biomedical text. BMC Bioinformatics 2005, 6 (Suppl I): S5.

Finn, Aidin, Nicholas Kushmerick and Barry Smyth (2002). Genre classification and domain transfer information filtering. In Fabio Crestani, Mark Girolamiand Cornelis J. van Rijsbergen (Eds.), Proceedings of ECIR-2, 24th European Colloquium on Information Retrieval Research. Heidelberg: Springer.

Friedman, Carol, Pauline Kra, Hong Yu, Michael Krauthammer and Andrey Rzet-sky (2001). GENIES: A natural language processing system for the extractionof molecular pathways from journal articles. ISMB (Supplement of Bioinfor-matics), 74-82.

Gaizauskas, Robert J., George Demetriou and Kevin Humphreys (2000). Term recognition and classification in biological science journal articles. In Proceed-ings of the Computational Terminology for Medical and Biological Applica-tions Workshop of the 2nd International Conference on NLP d (pp. 37-44).

Glenisson, Patrick, Janick Mathijs, Yves Moreau and Bart De Moor (2003). Meta-clustering of gene expression data and literature-extracted information. SIGKDD Explorations, Special Issue on Microarray Data Mining, 5 (2), 101-112.

Grefenstette, Gregory, Yan Qu, James G. Shanahan and David A. Evans (2004). Coupling niche browsers and affect analysis for an opinion mining. In Proceed-ings RIAO 2004. Paris: Centre des Hautes Études.

Gooi, Chung Heong and James Allan (2004). Cross-document coreference on a large scale corpus. In Proceedings of the Human Language Technology Con-ference of the North American Chapter of the Association for Computational Linguistics (pp. 9-16). East Stroudsburgh, PA: Association for Computational Linguistics.

Grishman, Ralph (1998). Information extraction and speech recognition. In Pro-ceedings of the Broadcast News Transcription and Understanding Workshop (pp. 159-165).

Grishman, Ralph, Silja Huttunen and Roman Yangarber (2002). Information ex-traction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35, 236-246.

221

Grover, Claire, Ben Hachey, Ian Hughson and Chris Korycinski (2003). Auto-matic summarization of legal documents. In Proceedings of the 9th Interna-tional Conference on Artificial Intelligence and Law (pp. 243-251). ACM: NewYork.

Hakkani-Tür, Dilek, Gökhan Tür, Andreas Stolcke and Elizabeth Shriberg (1999). Combining words and prosody for information extraction from speech. In Pro-ceedings EUROSPEECH ’99, 6th European Conference on Speech Communi-cation and Technology.

Hearst, Marti (1992). Direction-based text interpretation as an information accessrefinement. In Paul Jacobs (Ed.), Text-Based Intelligent Systems. Hillsdale, NJ: Lawrence Erlbaum.

Huang, Jing, Geoffry Zweig and Mukund Padmanabhan (2001). Informationextraction from voicemail. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp. 290-297). San Mateo, CA:Morgan Kaufmann.

Huang, Minlie et al. (2004). Discovering patterns to extract protein-protein inter-actions from full text. Bioinformatics, 20 (18), 3604-3612.

Jansche, Martin and Steven P. Abney (2002). Information extraction from voice-mail transcripts. In Proceedings of Empirical Methods in Natural LanguageProcessing. East Stroudsburg, PA: ACL.

Kanehisa Minoru, Susumu Goto, Shuichi Kawashima, Yasushi Okuno and Masa-hiro Hattori (2004). The KEGG resource for deciphering the genome. NucleicAcid Res, 32, D277-280.

Kushmerick, Nicholas (2000). Wrapper induction: Efficiency and expressiveness.Artificial Intelligence, 118, 15-68.

Kou, Zhenzhen, William W. Cohen and Robert F. Murphy (2005). High-recall protein entity recognition using a dictionary. Bioinformatics, Suppl1, i266-io273.

Biomedical named entity recognition using a two-phase model based on SVMs. Journal of Biomedical Informatics, 37, 436-447.

Leroy, Gondy, Hinchun Chen and Jesse D. Martinez (2003). A shallow parser based on closed-class words to capsule relations in biomedical text. Journal of Biomedical Informatics, 36, 145-158.

Li, Xin, Paul Morie and Dan Roth (2004). Robust reading: Identification and trac-ing of ambiguous names. In Proceedings of the Human Language TechnologyConference (pp. 17-14). East Stroudsburgh, PA: ACL.

Marcotte, Edward M., Ioannis Xenarios and David Eisenberg (2001). Mining lit-erature for protein-protein interactions. Bioinformatics, 17 (4), 359-363.

Minkov, Einat, Richard C. Wang and William W. Cohen (2004). Extracting per-sonal names from emails. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Proc-essing (HLT/EMNLP) (pp. 443-450). East Stroudsburg, PA: ACL.

Moens, Marie-Francine, Caroline Uyttendaele and Jos Dumortier (1997). Ab-stracting of legal cases: The SALOMON experience In Proceedings of the 6th

International Conference on Artificial Intelligence and Law (pp. 114-122).New York: ACM.

9.10 Bibliography

Lee, Ki-Joong, Young-Sook Hwang, Seonho Kim and Hae-Chang Rim (2004).

222 9 Case Studies

Ng, Vincent and Claire Cardie (2002). Improving machine learning approaches tocoreference resolution. In Proceedings of the 40th Annual Meeting of the Asso-ciation for Computational Linguistics (pp. 104-111). San Francisco, CA: Mor-gan Kaufmann.

Ng, Vincent and Claire Cardie (2003). Weakly supervised natural language learn-ing without redundant views. In Proceedings of the Human Language Technol-ogy Conference (pp. 183-180). East Stroudsburgh, PA: ACL.

Palmer, David D. and Mari Ostendorf (2000). Improving information extractionby modelling errors in speech recognizer output. http://citeseer.ist.psu.edu/ 646402.html

Pang, Bo and Lilian Lee (2005). Seeing stars: Exploiting class relationships forsentiment categorization with respect to rating scales. In Proceedings of the43rd Annual Meeting of the Association for Computational Linguistics (pp. 115-124). East Stroudsburg, PA: ACL.

Papageorgiou, Harris, Prokopis Prokopidis, Iason Demiros, Nikos Hatzigeorgiouand George Carayannis (2004). CIMWOS: A multimedia retrieval system based on combined text, speech and image processing. In Proceedings of the RIAO 2004 Conference. Paris: Centre des Hautes Études.

Ramani, Arun K., Razvan C. Bunescu, Raymond J. Mooney and Edward M. Mar-cotte (2005). Consolidating the set of known human protein-protein interac-tions in preparation for large-scale mapping of the human interactome. GenomeBiology, 6: R40.

Rennie, Jason D.M. and Tommie Jaakkola (2005). Using term informativeness for named entity detection. In Proceedings of the Twenty-Eight Annual Interna-tional ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 353-360). New York: ACM.

Riloff, Ellen, Janyce Wiebe and William Phillips (2005). Exploiting subjectivityclassification to improve information extraction. In Proceedings of the 20th Na-

Press. Shanahan, James G., Yan Qu and Janyce Wiebe (Eds.) (2006). Computing Attitude

and Affect in Text (t The Information Retrieval Series 20). New York: Springer. Soderland, Stephen (1999). Learning information extraction rules for semi-

structured and free text. Machine Learning, 1-3, 233-272.Stapley, BJ, Kelley LA and Sternberg MJ (2002). Predicting the sub-cellular loca-

tion of proteins from using support vector machines. Pacific Symposium Bio-computing, 374-385.

Stranieri, Andrew and John Zeleznikow (2005). Data Mining in Law. New York: Springer.

Xu, Feiyu, Daniela Kurz, Jacub Piskorski and Sven Schmeier (2002). Term ex-traction and mining of term relations from unrestricted texts in the financialdomain. In Proceedings of the 5th International Conference on Business Infor-mation Systems BIS-2002 (pp. 304-310). Poznan, Poland.

tional Conference on Artificial Intelligence (((AAAI-05). Menlo Park, CA: AAAI

223

ing HMM-based biomedical named entity recognition by studying special phe-nomena. Journal of Biomedical Informatics, 37, 411-422.

9.10 Bibliography

Young, Sheryl R. and Philip J. Hayes (1985). Automatic classification and sum-marization of banking telexes. In The Second Conference on Artificial Intelli-gence (pp. 402-408). Los Alamitos, CA: IEEE Press.

Zhang, Jie, Dan Shen, Guodong Zu, Jian Su and Chew-Lim Tan (2004). Enhanc-

10 The Future of Information Extraction in a Retrieval Context

10.1 Introduction

One of the most important goals of artificial intelligence is the simulation of human perception and cognition, among which is the understanding of natural language. Information extraction is a first step towards such an un-derstanding. This book started with several definitions of information ex-traction that we found in the literature and which lead to our own tentative definition (Chap. 1 p. 4) that matured through the course of the book to thefollowing one:

DEFINITION

Information extraction is the identification, and consequent or con-current classification and structuring into semantic classes, of spe-cific information found in unstructured data sources, such as naturallanguage text, providing additional aids to access and interpret theunstructured data by information systems.

In this book the focus is on extraction of information from text. Extractionis adding meaning to content. In the case of textual content, this means adding meaning to a term, phrase, passage, or to a combination of them, notwithstanding the many variant expressions of natural language thatconvey the same meaning. In addition, many different types of meaningcan be attached to the same text. Information extraction offers the buildingblocks that can be combined in order to interpret content. This is not a newidea. Conceptual terms are added to content since the beginning of infor-mation retrieval research, but it is only now that technology allows us toperform this task on a large scale.

The initial chapters (Chaps. 2 and 3) were very much oriented towardslegacy approaches of information extraction from textual sources. The

225

226 10 The Future of Information Extraction in a Retrieval Context

main purpose was to draw the attention to certain – maybe forgotten – as-pects of extraction, more specifically to the identification of relationships between content that eventually leads to a partial understanding of the dis-course. The work of Roger Schank and Marvin Minsky is very important in this respect. They taught us that content in texts is composed of small elements, which the author of the texts has combined in order to communi-cate a certain message.

A strong impetus for developing information extraction technology came from the Message Understanding Conferences (MUC), held in the 1980s and 1990s, currently succeeded by the Automatic Content Extrac-tion (ACE) competition. Another solid stimulus for developing extraction technology currently originates from the biomedical field where content dbecomes only manageable with the help of this technology. A third impor-tant factor regards the growing use of techniques of content recognition in multi-media.

When the different methods and algorithms were explained in the body of this book (Chaps. 4, 5 and 6), a large part of them involved machine learning techniques. They replace traditional symbolic techniques that relyon handcrafted knowledge. We elaborated on the important task of featureselection and presented the most important algorithms of supervised, weakly supervised and unsupervised learning that are currently used in in-formation extraction. The algorithms often yield a probabilistic assignmentof a semantic class that types the extracted information.

Automated text understanding is not a final goal. The information that is detected is used for some other task, such as the selection of informationand problem solving. In Chap. 7 we discussed how extraction technology could be integrated in retrieval technology. Information extraction offersus building blocks to better interpret the content of queries and documents, allowing us to find answers to our information questions that are possiblydistributed across multiple discourses and perhaps across multiple media. We have also seen that due to efficiency constraints, we better perform theextraction a priori, attach the generic and domain specific labels to content elements and store them in the data structures used for indexing.

Although the probabilistic nature of the extraction technology has sev-eral advantages, it does not make the evaluation of the results easier. In Chap. 8 we have given an overview of different evaluation frameworksand pleaded for an extrinsic evaluation of the extraction technologies.

In Chap. 9 we have listed a number of case studies that have revealed the current development of information extraction technologies in different domains. In addition, the major requirements for extraction technologies and (current) important bottlenecks were discussed.

As our definition suggests, extraction technology offers building blocks that aid access to and further processing of the information. In this book the latter concerned information retrieval.

In the next section we justify the model of information extraction and retrieval from the angle of linguistic theory and philosophy. Then follows a section on the most important findings of the book. The last two sectionsare devoted to future algorithmic developments in information extraction and retrieval.

10.2 The Human Needs and the Machine Performances

The model of information extraction and information retrieval that we have developed during the book in essence regards the assumption of a top down creation of linguistic content while starting from ideas and translat-ing them into the character strings of a text. Understanding a text regards the inverse process. The understanding is not the final goal. Users combine information at different levels of abstraction when they select informationor solve problems.

We have explained the existence of a realizational chain when creatingnatural language texts, a finding that goes back to the Indian grammarian Panini in the 6th th

notion, meaning in a language is realized in the linguistic surface structurethrough a number of distinct linguistic levels, each of which is the result of a projection of the properties of higher, more abstract levels. Ideas aretranslated into the broad conceptual components of a text. Such a concep-tual component is broken into subideas, which will be reflected in sen-tences. The meaning of a sentence or set of sentences starts as an idea in the mind of a writer, it then passes through the stage in which the event and all its participants are translated into a set of semantic roles, which arein their turn translated in a set of grammatical and lexical concepts. Theseare in their turn translated into the character sequences that we see written down on a page of paper. Such a model of text generation is also found in

Understanding text requires decoding it and can be seen as the inverse of the above process.1 Sperber and Wilson (1995) argue that the semantic

1 Generating a multimedia document, for instance a video, can be seen as a similar process. The semantic message is constructed in the mind of the producer and translated into scenes and eventually into scene shots. Multimedia analysis can beseen as the reverse of this process: Starting from basic features, the semantic value chain gradually adds more and more semantics.

10.2 The Human Needs and the Machine Performances 227

psychological studies (e.g., Kintsch and van Dijk, 1978).

-5 century B.C. (see Kiparsky, 2002). According to this


representations recovered by decoding are useful only as a source of hy-potheses and evidence for the second communication process, the inferentialone. The user of the information applies inference rules to the recovered se-mantic representations to formulate an answer to his or her information need.

Throughout this book it became clear that information extraction is concerned with the inverse process of processing character strings bottom up and translating them into semantic concepts at various level of detailand of various types. Currently, only simple levels of meaning are ex-tracted with current information extraction systems yielding a simple form of natural language understanding, but one can imagine future systems that come close to a full understanding of text in all its facets. Information ex-traction presupposes that although the semantic information in a text and its linguistic organization are not immediately computationally transparent,it can nevertheless be retrieved by taking into account surface regularities.It attaches meaning to text by considering the meaning of small building blocks, relating these elements to each other, creating additional meaningelements that in their turn can again be linked, and so on. Information ex-traction can be seen as an iterative process that ultimately will result in ideas and macrostructures identified in the texts. This is an idea that alreadywas present in early theories that aimed at story and discourse understand-ing (e.g., the work of Schank in 1975). However, the ideas developed inthis book differ from the early implementations of extraction systems that used a top down, anticipatory and rigid analysis, while here we rather ad-vocate a bottom up, flexible and more generic attitude towards informationextraction.

As such, information extraction offers building blocks of content atvarious levels of detail and from various angles that can be used in combi-nation with the words of a text, when users search for information for their problem at hand, or when the machine infers an answer to their problem. Information extraction permits that the access to, selection of and even the reasoning with information can be performed at different levels of mean-ing, thus offering possibilities that are very much valued in information re-trieval, summarization and information synthesis.

10.3 Most Important Findings

10.3.1 Machine Learning

Nowadays, powerful computers are omnipresent and the advancements inthe processing of natural language text allow doing things that were un-thinkable a few decades ago. Especially, the availability of reliable learn-ing systems make advanced information extraction possible.

Machine learning techniques that learn the extraction patterns havemany advantages. It is often worthwhile that a knowledge engineer ac-quires symbolic knowledge that can be unambiguously defined and shared by different applications. However, information extraction is often con-cerned with ambiguous textual patterns that require many discourse and linguistic features for their disambiguationr , which are difficult to antici-pate. Machine learning naturally allows considering many more contextual features than is usually the case with handcrafted rules. Moreover, lan-guage is an instrument of a society of living persons. As it is the reflection

extraction model should dynamically adapt to the changing patterns of a living language.

Machine learning for information extraction has still other advantages.There is a lesser building effort compared to extraction systems that rely ton handcrafted extraction patterns. Annotation is usually considered as be-ing easier than knowledge engineering. Moreover, the learning techniques allow a probabilistic assignment of the semantic labels, as usually insuffi-cient training data or knowledge rules are available to cover all linguistic phenomena, or the system is confronted with unsolved ambiguities of the language due to content left implicit or purposely left ambiguous by the author. The distribution function of linguistic phenomena just has a verylong tail. Handling the most common phenomena gets you 60% recall and precision relatively quickly. Getting 100% recall and precision requireshandling increasingly rare phenomena, hence, the advantage of usinglearning techniques that adhere to the maximum entropy principle in order to cope with incomplete data. In addition, the probabilities of the assign-ment can be considered in the further processing of the information, such as in a retrieval system.

The machine learning techniques have an additional advantage in the sense that information extracted in previous steps can become the features in more advanced recognition tasks. One can dynamically combine different

10.3 Most Important Findings 229

of the properties, thoughts, ideas, beliefs and realizations of that society, the


feature sets from the discourse (possibly as the result of previous extrac-tions) and even from the whole document collection.

We have also seen that the learning techniques have evolved and leave room for finding the similarity of structured objects. Kernel methods are useful to compare the structured objects that, for instance, represent a dis-course or sentence. In addition, many approaches towards weakly super-vised methods are being developed in order to relief the annotation burden.

10.3.2 The Generic Character of Information Extraction

The common feeling that information extraction can only be useful in processing limited subject domains and thus only operates with a tunnel vision has hopefully been defeated. The case studies in Chap. 9 have dem-onstrated that many of the information extraction algorithms, features and tasks are used across domains and in open domains. The pattern recogni-tion algorithms, whether they are supervised, weakly supervised or unsu-pervised, are very generic. Common natural language processing tech-niques such as part-of-speech taggers, sentence parsers, and general semantic parsers (e.g., detection of rhetorical structure) can be exploited toyield valuable features for the extraction tasks. They are currently gener-ally available and usually only need limited adaptations when a specificdomain (e.g., law) or informal texts require so. The case studies have alsoshown that many extraction tasks are generic. They include, among others, named entity recognition, noun phrase coreference resolution, semantic role detection, entity attribute detection, and event and scenario classifica-tion.

What domain specificity is then left? The domain specificity is resided in the domain specific classification schemes, the ontologies or taxonomiesthat we use to classify the content. Dependent on the domain different named entities, semantic roles, attributes, events and scenarios become in-teresting. They constitute the labels by which we classify the content inview of a further processing.

10.3.3 The Classification Schemes

The classification schemes or ontologies capture how humans tend to re-flect the world. They also offer us the constraints on the allowable combi-nations and dependencies of information when using the semantic labels ininformation processing tasks. The biggest problem is defining semantictaxonomies or ontologies that convey content in a sensible matter. Both generic content classifications and domain specific classifications are

needed. In order to improve the exchangeability of the ontologies they should be accepted by a large community. There is a large dispute on whether semantic concepts (e.g., semantic roles) are a product of our lan-guage that names the facts of the world, or whether they are universal components of our mental representation of objects and concepts.

Machines will always be confronted with text (written or spoken), as it is a natural medium for humans to communicate and store information, or will have to process images and audio sources, fragrances, etc. as they areperceived in the world around us. These perceptions will have to be de-scribed with metadata when currently used by the machine.

Today’s information extraction technology is not that far advanced thatwe can detect scripts, scenarios and stories in a large variety of domains. Nevertheless, it allows already an initial recognition of meaning and meaningful relations between words that leaves behind the bag-of-words approach and the simple expansion of words with related terms. In information retrieval the semantics of the text are important in order tocompute the matching of query and document. However, a human descrip-tion of the semantics might be just a temporal artifact that is needed (as wethave seen in the section on indexing) to efficiently retrieve, combine and reason with the information. As stated before, a shortcut expression of these semantics in the form of a priori assigned labels allows the system tobetter match query and document and eventually to make inferences with information found in one text and across texts. The semantic labels are a way of ordering the information with the aim of further processing. In thelong run our machines might become very powerful and process, combine, integrate and synthesize information at the moment of perception without the need of the intermediary labels.

10.3.4 The Role of Paraphrasing

Currently, a lot of research interest goes into paraphrasing, i.e., findingnatural language statements that express the same content. On first sight,paraphrase generation seems very useful in a retrieval context. One can generate all possible paraphrases for a simple query statement or questionand try to mach these expressions with the document content. However, paraphrasing has its limitations. Paraphrasing is possible for simple ques-tions and query statements (e.g., to detect the different versions of phrases), but once more complex questions are posed, or complex contex-tual information has to be taken into account, it is very difficult to para-phrase whole discourses, passages or combinations of textual content. In information retrieval or question answering systems we aim to combine

10.3 Most Important Findings 231


information from different sources, and an integrated answer should bebuilt by means of inferencing.

A more valuable model, which is inspired by human cognitive process-ing, stores the building blocks for possible interpretations of the texts that can be matched against all possible queries or used in inferring the answer to a question. This is exactly what is accomplished by information extrac-tion. Despite the many different appearances that natural language offers, information extraction detects and recognizes content, not only within phrases or sentences, but also in larger text segments.

This does not mean that paraphrasing is not useful. For instance, syntac-tical equivalence rules can be applied on training and testing examples when using machine learning techniques for information extraction in or-der to reduce the number of variant linguistic expressions.

10.3.5 Flexible Information Needs

The classical and successful paradigm of information retrieval that standsfor a flexible information search will not be abandoned soon, as it supports the changing society and corresponding information needs. As we have seen in this book, the rise of information extraction technologies do not damage this paradigm. On the contrary, by adding semantic classifications to the bag-of-words model, the flexibility of the matching and the potential of a correct matching is only increased. Although information extraction aims at a rather shallow understanding of the texts, it can contribute to the many interpretations and meanings that a text provides and as such canimprove the performance of information retrieval. This model parallels ad-ditions of semantics to low-level representations of images and audio, where one can both query with examples or with higher level descriptions. Such an approach is different from the traditional view of information ex-traction that translates the content into a fixed template format that can be stored in a relational database.

Moreover, many extraction tasks allow to link information across sen-tences and across documents. This is a very powerful mechanism. A typi-cal linking task is coreference resolution, but other types of linking cause to generate scripts, scenarios and events, etc… in the raw texts. Linking in-formation is especially useful when answering complex information needsin the form of natural language questions and statements, and thus could enhance the recall and precision of the answer to such information questions.

10.3.6 The Indices

Ideally, the machine should interpret texts, images and other media on the fly, search the contents and combine them without any reliance on a prioricomputation of assisting data or index structures. In practice, this is cur-rently not true. For instance, in information retrieval we construct indicesthat allow an efficient computation of finding the relevant documents or the answer to an information question. These indices are updated on a regular basis.

Information extraction structures the unstructured textual information and traditionally the extracted information is translated into templates that populate relational databases. This means that a document is represented by selected content. In this book, we have definitely opposed such a view. The low level features of a text allow for different uses and interpretations of the content. Some of them have received a semantic meaning. For in-formation retrieval we have argued that we will keep the traditional index-ing based on words augmented with the semantic labels that are assigned to content units. The semantic labels act as intermediary assisting struc-tures that hold information. The semantic meta-information that enrichesthe traditional word based indices applies to the content of individual words, sentences and passages in a document. It also links content within and across documents, which could be exploited in the retrieval as it is tra-ditionally done in retrieval systems that use the link structure to compute relevance rankings. In addition, uncertainty information should also be stored in the data structures as they are a necessary component in conse-quent computations.

The idea is to augment the traditional word based indexes with semanticinformation that form generic building blocks for querying the informa-tion. Consequently, we leave the bag-of-words paradigm in favor of a bed-of-words that is layered with different semantics or elements of meaning.

10.4 Algorithmic Challenges

Both the information extraction task and the combination of informationextraction and retrieval offer great future challenges in algorithm devel-opment.

10.4 Algorithmic Challenges 233


10.4.1 The Features

It was shown that choosing good features is important, especially when dealing with unsupervised and weakly supervised learning methods. Selec-tion also plays a role in supervised methods in order to test the largeamounts of possibly relevant characteristics of natural language texts. In Chap. 4 we have described the features used in classical extraction tasks.

On one hand there is a large number of features that have a potential ininformation extraction, ranging from the close context, to the whole dis-course and complete corpus. Linguistic, cognitive and pragmatic studies offer a new mining field of features. Very often, these studies are neglected by researchers who are specialized in human language technology and in-formation retrieval. We might try out many different feature sets. An a pri-ori selection of the right features is sometimes very difficult, because in extraction tasks, many features behave dependently, i.e., only in combina-tion with other features they form a discriminative pattern between classes.

The selection of the features often requires some degree of linguisticprocessing. Because linguistic resources such as part-of-speech taggers and sentence parsers become easily available, they can be employed as a pre-processing step in the selection and extraction of features. A linguistic pre-processing of the texts fits the cascaded model that we propose and that will further be explained.

10.4.2 A Cascaded Model for Information Extraction

In the cascaded model, the output of one type of extraction forms the fea-tures of a more complex task of extraction. These features can possibly becomplemented with other features. The approach has been successfully applied for the detection of simple extraction tasks. We want to expand this approach further. For instance, the recognized named entities are used to detect their relationships, the relationships in their turn augmented by the (chronological) sequence in which they occur in a discourse form thefeatures for a script recognition task, and are used to classify a set of ac-tions as, for instance, a restaurant visit. In addition, we want to transposethis model to a machine learning environment for the reasons discussed above, to incorporate the probabilities of previous assignments, and possi-bly backtrack to previous stages if more information becomes available.

The cascaded model was already proposed on a limited scale by Jerry Hobbs and his co-researchers (1996) based on the technology of non-probabilistic finite state transducers. In their approach, a finite-stateautomaton reads one element at a time of a sequence of elements; each

element transitions the automaton into a new state, based on the type of element it is, e.g., the part of speech of a word. Some states are designated as final, and a final stage is reached when the sequence of elements matches a valid pattern. In a final state transducer, an output entity is con-structed when final states are reached, e.g., a representation of the informa-tion in a phrase. In a cascaded finite state transducer, there are several finite state transducers at different stages. Earlier stages will package a string of elements into something that the next stage will view as a single element.

As we have seen in this book, the decomposition of the natural language problem into levels is essential to our approach. Each level corresponds to a linguistic natural kind, reflects important universals about language. Themodel by Hobbs was inspired by the remarkable fact that very diverse lan-guages all show the same structure of nominal and verbal elements, and the basic and complex phrase levels. Our model leads further and attaches also the semantics at different levels of detail.

The cascaded model leads to many advantages as already was noticed by Hobbs. Organizing a system in this way offers greater portabilityamong domains because the system is built in a modular way and in cer-tain steps modules might be replaced by others. Moreover, complementary modules can be more easily designed and implemented.

Why not learn directly that a certain text passage is about a restaurant visit? For instance, with a relational model we might learn the script and its composing actions of a restaurant visit in one step.

Compared to such a direct learning the cascaded model offers major ad-vantages. First of all, if we would learn a more complex task directly –given the variety of natural language – we would have to annotate huge amounts of examples to be able to capture all variant expressions that sig-nal a restaurant visit and that discriminate it from the many other similar scenarios. When performing this complex task in steps, one can take ad-vantage of the induction. From a limited set of instances, we can induce a more general rule that allows us to detect a building block of a restaurant visit scenario. By breaking up the extraction task into pieces, we make thewhole process manageable. And, we can reuse the building blocks in other recognition tasks. Finally, for each stage in the process a suitable set of features can be selected avoiding the curse of dimensionality known inmachine learning.

Another argument for learning extraction patterns in stages is the fact that extracted information is always used for some other task such as data mining, summarization or information retrieval. These tasks benefit from having content descriptions at different levels of detail and along several semantic interpretations. In this book the task that we have considered isinformation retrieval. In information retrieval we do not know the queries



in advance. Learning the different semantics of a text allows interrogatingits content from various angles or with information queries that represent various levels of abstraction. Using a cascaded model in which extracted information forms the input of learning other extractions, makes the acqui-sition of the various semantics very efficient.

Of course, there is a downside to this model. Because the extraction sys-tems do not work perfectly, errors might propagate. Errors made in earlier steps are carried along in the next stages. It is possible that if these errors

stages, when different classes are combined. Ideally, we should build ex-traction systems that correct themselves while bootstrapping from a simple to a more advanced understanding of the texts. Algorithms can be devel-oped that correct errors in previous stages of the cascade when evidence is gathered on other parts of the comprehension. During processing severalhypotheses can grow and eventually die as more evidence becomes avail-able.

The above considerations can give rise to novel and efficient algorithmsthat in a cascaded way determine the probabilities of the recognized con-tent, and use these assignments as probable features in following content recognition tasks allowing for a selective backtracking in order to make corrections.

As such information extraction becomes a very important stepping stonetowards real text understanding. Such a view is also shared by Riloff (1999) who noted that extraction patterns can be used to represent morecomplex semantic classes hereby following the initial insights proposed by Schank (1975). By exploiting the practical nature of information extraction technology and the insights and theories developed by discourse under-standing of expository, narrative, and other (possible domain specific) gen-res, Riloff believes in the possibility of building increasingly intelligent natural language processing systems that are practical for large scale appli-cations. According to Riloff the grand challenge of developing a broad-coverage, in-depth natural language understanding system may still be a long way off, an effective synergy between information extraction and story understanding may prove to be a promising starting point for real progress in that direction.

are not too severe, they will be smoothed out in these next processing

detection of named entities is problematic. Not only in named entity recog-nition, but also in the other extraction tasks − especially when they involve the−detection of passages − boundary detection is an issue. For instance, how is the text that deals with a restaurant visit delimited? For some tasks the results can be improved by considering boundary detection as a separate classification task and selectively use features for it. All (plausible)boundaries are then considered in a specific stage of the process andclassified as correct or not, which could lead to computationally complexsituations. Research is needed to demonstrate that breaking a complex ex-traction into smaller extraction tasks will benefit the accuracy of the boundary detection of the complex task.

10.4.4 Extracting Sharable Knowledge

As noted above, machine learning techniques that automatically acquirethe extraction patterns have the advantage that meaning is assigned to content by taking into account the context of the information. Some knowl-edge resources (e.g., lexico-semantic resources) can be shared and unam-biguously applied on different texts in order to extract information. For instance, it might be handy to use a hand-compiled list of countries and their capitals. These resources can be used in the information extraction aswell as in the retrieval tasks.

Also straightforward heuristic rules that can be safely applied in the processing could be more effective in the extraction process than usingtheir attributes as features in a learning task. For instance, a one sense per discourse assumption, i.e., a name is usually used with the same meaning in the same discourse, could be helpful. Rules developed for textual en-tailment or paraphrasing (especially syntactic rules) can contribute to a de-crease of annotated examples that are needed when training a system.

10.4.5 Expansion

It is often stated that one of the limitations of information extraction tech-

In order to extract information that is only implicitly present in a text you t


nology is that it can only extract what is explicit in the text (Hobbs, 2002).

10.4.3 The Boundaries of Information Units

A specific problem, when we semantically classify information, is findingthe boundaries of information to be classified. As seen in Chap. 9, this problem has received attention in the biomedical field, where the boundary


probably, a human would infer that this is done with a hammer. Instead of considering all possible inferences resulting in a combinatorial explosionof inferences, in human thinking only the most plausible ones come tomind. Also the machine can infer this extra information. The most domi-nant or probable instrument learned from a large corpus can complementinformation that is already extracted, although this information is not ex-plicitly present in the text. In order to pinpoint the most probable argument in the action of beating a nail, information extraction techniques that clas-sify the arguments of an action are very useful.

10.4.6 Algorithms for Retrieval

In Chap. 7 we have seen that the integration of information extractiontechnology in information retrieval offers many possibilities for the devel-opment of novel algorithms. Most importantly, information extraction al-lows inferring the answer to an information query by utilizing different sources of evidence that represent document content. If the query is suffi-ciently elaborated, for instance, when it has the format of a natural lan-guage question or statement(s), information extraction technology will alsosemantically classify the query and its constituents.

There is a large potential for the development of retrieval models that incorporate the results of extractions. Especially graph based algorithmsthat allow (uncertain) inferences seem very suitable.

An important problem causing information retrieval inaccuracy lies inthe short queries that humans often use when searching for information.These queries lack the necessary context to pinpoint the real interest of theuser. The above section shows that we might expand the query in a morefocused way by applying information extraction technology on a largecorpus. Typically in information retrieval relevance feedback techniques are used to better learn the information need. Relevance feedback can rely kon information extraction techniques as not only the words of relevant and non-relevant documents will better discriminate the documents, but also their semantic meaning.

We also believe that in future retrieval systems questions and queries will increasingly be posed in spoken format. When using speech, people have the tendency to build longer, explanatory queries. Such queries poseadditional challenges when recognizing their content.

need inferences based on background knowledge. This is true. However,compared to human understanding and perception, also the machine cangenerate the most probable complement of the information, which it can learn from a large corpus of texts. For instance, consider the statement beating a nail. It does not say with what instrument this is done. Most

10.5 The Future of IE in a Retrieval Context

In the future we foresee that technology for information synthesis will bedeveloped and that such technology will be a main stream component of a retrieval system. Information synthesis is the composition or combination of divers content parts so as to form a coherent whole. Information fusionis often used as a synonym term. Information synthesis is an important cognitive task of humans. In many professional and daily life settings wesynthesize information that helps us to solve our problems. In Chap. 7 we already rejected the classical library paradigm for information searching infavor of systems that pinpoint us to relevant answers to our informationquery. Here, we really move beyond the library paradigm and increase the intelligence of the retrieval system by having the information to be synthe-sized.

Except for the domain of multi-document summarization, very littlework has been done in the domain of information synthesis (Amigó et al., 2004). Multi-document summarization often relies only on simple word overlap. Moreover, although frame semantics are studied, little research is performed with regard to frame synthesis. In case of frame or templatemerging, information from different templates is merged into a single tem-plate. For example, one template may define a partial set of information,another template may define an overlapping set that contains additional in-formation. Merging produces a single data set with all the information.Template merging is used in summarization, where conflicting informationwould give rise to the generation of summary sentences that discuss the

existing approaches deal with generating information syntheses that an-swer all kinds of information needs in a flexible way.

Humans perform information synthesis all the time when they process

Synthesis involves both the selection and integration of information. With regard to the selection of information, humans will search for in-

formation until they have found what they are looking for and they will not further search. In other words, the human will usually not consult all the sources and choose the best one. This type of selection is called satisficing.An alternative way of selection is suppression. Given the presence of two sources of information, only one source is consulted because of knowledge that the result of consulting the second source would not matter (e.g., byknowing that the second source contains redundant information). A third way of selection is by veto. Here, some source of information overrulesanother one. For instance, some old information can be defeating by newer information (e.g., the number of deaths in a terrorist attack).

10.5 The Future of IE in a Retrieval Context 239

contradictions found (McKeown and Radev, 1998). However, none of the

input signals that they perceive in the world around them (Cutting, 1998).


Information synthesis further involves the integration of information.This is accomplished by additive or subadditive accumulation. Given two sources of information, the resulting sum of information is respectivelyequal or smaller than the simple combination of the information. In case of cooperative integration, the resulting amount of information is larger than the merely addition of the information, because the integration learned something, which is not explicitly present in the sources. The combinationallows making additional inferences. A third way of integration is the dis-ambiguation of one source with the other. This form of integration is often a first step before accumulation and coordination.

When humans synthesize information, they make inferences. Psycho-logical studies offer a distinction between bridging (backward) inferences

you have previously encountered, so it establishes coherence between in-formation. Elaboration inferences add additional information. Both typesof inferences are important in information synthesis and require back-ground knowledge. We could also simulate this process by the machine(cf. the expansion methods discussed above).

In advanced retrieval systems information synthesis becomes an abso-lute necessity. For instance, in question answering there are many ques-tions for which the answer is distributed over different information nuggets that possibly need to be selected and integrated by means of inferencing.Now, how will information extraction fit in this synthesis task? Informa-tion extraction translates information to different views that can be used inan efficient matching with the query. The main contribution of information extraction is that extraction technologies allow us to link entities, events, scenarios and other information. Finding relations between informationwithin and across texts is a necessary condition if we want to reason with information or synthesize it. Finding equivalence relations (i.e., linking of coreferring noun phrases, coreferring events, etc.) is here of primordialimportance. But, also the resolution of other references – among which aretemporal and spatial references (e.g., today, hereunder) – is very relevant.This would allow constructing a time line of the information or a visualrepresentation of the content. We could also detect semantic relations of explanation, contrast, insightness, etc. . . within and across documents.

The extracted information can be represented in different formats, asannotations, graphs and logical networks. In Chap. 7 we have seen how wecould combine extracted information in retrieval. It is important to takeinto consideration the uncertainty of the representation and the flexibilityof the information need. What we extract from text is often uncertain. This is also the case in human perception. As it is explained above, we should

and elaboration (forward) inferences (Graesser and Bower, 1990; Sanford, 1990; Smith and Hancox, 2001). Bridging inferences link information that


not be tempted to select information in a text and put the information intemplate compartments, neglecting that other content might sometimes be relevant in future information synthesis tasks.

Ultimately, information synthesis paves the way to real problem solvingby the machine based on the content extracted from document collections. Once the machine has linked information within and across documents, it can reason with the information. When the system integrates information through reasoning, it should be able to reason with flexible informationneeds, uncertain information and defeasible information.

Of course, we already foresee many bottlenecks in the development of information synthesis systems. Apart from all the difficulties mentioned in this book with regard to information extraction and retrieval, there will bethe typical problems of information synthesis. We name here a few, but fu-ture research will certainly reveal others. First, you cannot combine all in-formation because some combinations do not make sense. Machines often lack this common sense knowledge. Another important question is the va-lidity of information. How can the machine detect the validity of informa-tion and find out when it is defeated by new information?

As our information sources increasingly are a mixture of text and other media, cross-media synthesis becomes very relevant. We will capture evi-dence from different media – speech, audio, images, video and text – and build a synthesis. Such an aim especially demands a cross-media cross-document extraction and alignment of content. Research in these areas is currently emerging.

And finally, we may not forget that we have to answer informationneeds of humans, who often change their minds, who somehow want to build their own truth into the synthesis, their own interpretation, tailored to what he or she already knows ... in a multi-media context.

Will extraction tools automatically assemble our own personal 7 o’clock television news from all sources available?


Graesser, Arthur C and Gordon H. Bower (1990). The Psychology of Learningand Motivation: Inferences and Text Comprehension. San Diego, CA: Aca-demic Press.

Hobbs, Jerry R. (2002). Information extraction from biomedical text. Journal of Biomedical Informatics, 35, 260-264.

Hobbs, Jerry H., Douglas Appelt et al. (1996). FASTUS: A cascaded finite-statetransducer for extracting information from natural language text. In Emmanuel Roche and Yves Schabes (Eds.), Finite State Devices for Natural Language Processing (pp. 383-406). Cambridge, MA: The MIT Press.

Kintsch, Walter and Toon A. van Dijk (1978). Toward a model text comprehen-

delivered at the Hyderabad Conference on the Architecture of Grammar, Janu-ary 2002, and at UCLA, March 2002.

McKeown, Kathleen and Dragomir R. Radev (1999). Generating summaries of multiple news texts. In Inderjeet Mani and Mark T. Maybury (Eds.), Advancesin Automatic Text Summarization (pp. 381-389). Cambridge, MA: The MIT Press.

Riloff, Ellen (1999). Information extraction as a stepping stone toward story un-derstanding. In Ashwin Ram and Kenneth Moorman (Eds.), UnderstandingLanguage Understanding: Computational Models of Reading (pp. 435-460). Cambridge, MA: The MIT Press.

Sanford Tony (1990). On the nature of text-driven inference. In Encyclopedia of Language and Linguistics (pp. 515-535). Oxford, UK: Elsevier.

Schank, Roger C. (1975). Conceptual Information Processing. Amsterdam: North-Holland.

Smith, Elliot and Peter Hancox (2001). Representation, coherence and inference. Artificial Intelligence Review, 15, 295-323.

Sperber, Dan and Deirdre Wilson (1995). Relevance: Communication and Cogni-tion (2nd edition). Oxford, UK: Basil Blackwell.

10.6 Bibliography

Amigó, Enrique, Julio Gonzalo, Victor Peinado, Anselmo Peñas and Felisa Ver-dejo (2004). An empirical study of information synthesis tasks. In Proceedingsof the 42nd Annual Meeting of the Association for Computational Linguistics(pp. 208-215). East Stroudsburg, PA: ACL.

Cutting, James E. (1998). Information from the world around us. In Julian Ho-chberg (Ed.), Perception and Cognition at Century’s End (pp. 69-93). San Diego, CA: Academic Press.

sion and production. Psychological Review, 85 (5), 363-394. Kiparsky, Paul (2002). On the Architecture of Panini’s Grammar. Three lectures

Index

Accuracy, 182

Alias recognition, 81

Automatic Content Extraction, 8, 180, 187-90, 202, 218, 226

AutoSlog, 32

Bag-of-words, 160-1, 169, 232-3 B-Cubed metric, 185-6Bed-of-words, 161, 174, 233Binary preference, 192 Boundary detection, 72, 207, 237 Bpref, see Binary preference

86, 234Case-based reasoning, 159,

215-6 Categorization, 16 Chi-square test, 72Classification

Context-dependent, 91 Context-free, 91

Classification scheme, 70 Classifier

Discriminative, 90 Generative, 90

Clustering, 129-38Hierarchical, 134 K-means, 134K-medoid, 134Number of clusters, 134-5

Collocation, 72 Conceptual Dependency Theory,

23-6, 47-54Conditional random fields, 90-1,

Context window, 74Coreference resolution, see noun

phrase coreference resolutionCo-training, 144-5Cross-entropy, see entropy

Curse of dimensionality, 97

DARPA agent Markup Language + Ontology Interface Layer,

28, 57

Definiteness, 81 Distance, 130 Euclidean, 130

Manhattan, 130

ACE, see Automatic Content

recognitionASR, see automatic speech

retrieval, 15Cross-language information

Closed domain, 9

Automatic speech recognition, 216

243

Extraction

Cascaded model, 42-3, 60-3,

Sequential, 133

DAML + OIL, see DARPAagent Markup Language+ Ontology Interface Layer

Data mining, 7, 16, 216

Active learning, 145-7

114-18

244 Index

Edit distance, 78E-mail, 218 Entity relation recognition, 40,

83, 101, 114, 203-4, 208

Conditional, 105Cross-, 166Relative, 166

Expansion 237-8Expectation Maximization algo-

rithm, 142-3, 112-4, 164Extensible Markup Language,

18-9, 57, 165, 173-5, 214 Extrinsic evaluation, 180, 191 Event, 34

FASTUS, 10, 27, 42, 6-63 Feature, 73-86, 89, 116, 129-130,

229-30, 234 Discourse, 85-6 Lexical, 77-81 Semantic, 84-5Syntactic, 81-2

Finite State Automaton, 58-60 First-order predicate logic, 171 F-measure, 182, 203-7, 217-8Frame, 26-8, 54-8

Network, 55 FrameNet, 27-8, 30 FRUMP, 25

Gazetteers, 207 Grammar

Regular, 58 Systemic-functional, 35-6

Hidden Markov model, seeMarkov model

HTML, see HyperText Markup Language

Hypernymy, 30, 78HyperText Markup Language

(HTML), 19, 175, 214

Indexing representation, 156,159

Indices, 14Inductive logic programming,

91, 121 Inference, 167-71, 240 Information extraction, 1-4, 225

Information retrieval, 11-5, 151-76

Information synthesis, 239-41Inter-annotator agreement, 180 Intrinsic evaluation, 180Inverse document frequency, 78,

137-8, 162Inverted file, 172

Kernel functions, 92-3, 97-101,230Bag-of-words, 99Convolution, 99 Tree, 99-101

Kullback-Leibler divergence,133, 166

Latent Semantic Indexing, 153Lemmatization, 28 Likelihood ratio, 72Linking, 232

Machine learning, 31-2, 229-30Supervised, 67, 89-124Unsupervised, 71, 127-138

Machine translation, 15 Macro-averaging, 183

EELD, see evidence extraction and link discovery

discovery, 210Evidence extraction and link

138-48, 230 Weakly supervised, 71,128,

precisionMAP, see mean average

Entropy, 103, 120

Information gain, 120

Hyponymy, 30, 78

245

Markov chain, 108-9, 115Markov model

Visible, 110-1Maximum entropy model, 90,

101-7Maximum entropy principle,

Mean average precision, 191-2 Meronymy, 78

Metadata, 154 Micro-averaging, 183Minsky, 26-8, 54-8, 226 Morpheme, 71 MUC (see Message Understand-

ing Conference) Multi-class problem, 91Multi-media, 157, 204, 209, 212,

226, 241

Named entity recognition, 38, 75-6, 101, 106, 114, 140, 203, 205, 217-8

Natural language processing, 29

Single-document, 79, 101, 106,122, 136-7Cross-document, 80, 137-8

Ontology, 70, 201, 230Open domain, 9 Opinion recognition, 204, 214

Panini, 6, 227 Paraphrasing, 158, 231-2

Shallow, 29 Part-of-speech tagging, 29, 58,

72, 230Pattern recognition, 66POS tagging, see part-of-speech

tagging Precision, 156-7, 181, 185-6,

191, 204, 217Principle of compositionality, 5 Proximity, 133

Query by example, 154 Question answering, 13, 153,

170-1

RAR, see reciprocal answer rank Recall, 152, 156-7, 181, 185-6,

191, 204, 217Reciprocal answer rank, 191 Relational learning, 91, 121-2 Relation recognition, see Entity

relation recognition Retrieval model, 151-71, 238

Inference network, 167-70Language, 164-6 Logic-based, 170-1 Vector space, 162-3XML, 154, 161

85Root, 71Rule and tree learning, 91,

Language OWL, see Web Ontology

rence, 2, 8, 179, 202 Message Understanding Confe-

39, 75, 184, 203, 210-11, 215 Noun phrase coreference resolution,

Parsing Full, 29Partial, 58 Sentence, 230

Segmentation Linear, 30 Hierarchical, 30

Self training, 141-4 Semantic role recognition, 39,

75, 82, 101, 106

Scenario recognition, 41, 212

Script recognition, 50, 212

Index

Hidden, 90-1, 107-10, 112-4

103, 117, 229

Rhetorical Structure Theory, 30-1,

118-21

Schank, 23-6, 47-54, 226

246 Index

Short type, 77Similarity, 131 Cosine, 131, 162 Dice, 131-2

Mixed value, 132 Spam mail, 218Spatial semantics recognition

and resolution, 212Speech, 200, 216, 238 Stemming, 28Summarization, 7, 16 Support Vector Machines, 90,

92-101, 148 Transductive, 143-4

Synonymy, 29, 78, 153

TextBiomedical, 199, 204-9, 22

6 Business, 199, 213-4 Informal, 200, 216-8 Intelligence, 199, 209-13

Text mining, 7

Term frequency, 78, 137-8, 162

Text region, 7, 40, 173-5 Text Retrieval Conference, 180,

215Time line recognition, 40, 75-6 TimeML, 85 Timex recognition, 40, 75-6, 84,

204, 211-12 Timex resolution, 40, 75-6, 84,

204, 212Tokenization, 28 TREC, see Text Retrieval

Treebank, 29

Version spaces method, 121 Vilain metric, 184-5 Visible Markov model, see

Markov modelViterbi algorithm, 110

Web Ontology Language, 57 WordNet, 30, 84 Word sense disambiguation, 138

XML, see Extensible Markup Language

opinion recognitionSentiment recognition, see

Conference

Inner product, 130-1, 162

Yarowski, 138-40 News, 199, 202-4 Legal, 200