information extraction for free text

Plain Text Information Extraction (based on Machine Learning)Chia-Hui Chang Department of Computer Science & Information EngineeringNational Central University

chia@csie.ncu.edu.tw9/24/2002

Introduction

Plain Text Information Extraction The task of locating specific pieces of data from a

natural language document To obtain useful structured information from unstr

uctured text DARPA’s MUC program

The extraction rules are based on syntactic analyzer semantic tagger

On-line documents SRV, AAAI-1998

D. Freitag Rapier, ACL-1997, AAAI-

1999 M. E. Califf

WHISK, ML-1999 Solderland

Related Work

Free-text documents PALKA, MUC-5, 1993 AutoSlog, AAAI-1993

E. Riloff LIEP, IJCAI-1995

Huffman Crystal, IJCAI-1995, KD

D-1997 Solderland

SRVInformation Extraction from HTML: Application of a General Machine Learning Approach

Dayne Freitag

Dayne@cs.cmu.edu

AAAI-98

Introduction

SRV A general-purpose relational learner A top-down relational algorithm for IE Reliance on a set of token-oriented features

Extraction pattern First-order logic extraction pattern with predicates

based on attribute-value tests

Extraction as Text Classification Extraction as Text Classification

Identify the boundaries of field instances Treat each fragment as a bag-of-words Find the relations from the surrounding context

Relational Learning

Inductive Logic Programming (ILP) Input: class-labeled instances Output: classifier for unlabeled instances Typical covering algorithm

Attribute values are added greedily to a rule The number of positive examples is heuristically

maximized while the number of negative examples is heuristically minimized

Simple Features

Features on individual token Length (e.g. single letter or multiple letters) Character type (e.g. numeric or alphabet) Orthography (e.g. capitalized) Part of speech (e.g. verb) Lexical meaning (e.g. geographical_place)

Individual Predicates

Individual predicate: Length (=3): accepts only fragments containing three token

s Some(?A [] capitalizedp true): the fragment contains some t

oken that is capitalized Every(numericp false): every token in the fragment is non-n

umeric Position(?A fromfirst <2): the token bound to ?A is either fir

st or second in the fragment Relpos(?A ?B =1) the token bound to ?A immediately prece

ds the token bound to ?B

Relational Features

Relational Feature types Adjacency (next_token) Linguistic syntax (subject_verb)

Example

Search

Adding predicates greedily, attempting to cover as many positive and as few negative examples as possible.

At every step in rule construction, all documents in the training set are scanned and every text fragment of appropriate size counted.

Every legal predicate is assessed in terms of the number of positive and negative examples it covers.

A position-predicate is not legal unless some-predicate is already part of the rule

Relational Paths

Relational features are used only in the Path argument to the some-predicate Some(?A [prev_token prev_token] capitalized tru

e): The fragment contains some token preceded by a capitalized token two tokens back.

Validation Training Phase

2/3: learning 1/3: validation

Testing Bayesian m-

estimates: All rules matching a given

fragment are used to assign a confidence score.

Combined confidence:

Ccc)1(1

Adapting SRV for HTML

Experiments Data Source:

Four university computer science departments: Cornell, U. of Texas, U. of Washington, U. of Wisconsin

Data Set: Course: title, number, instructor Project: title, member 105 course pages 96 project pages

Two Experiments Random: 5 cross-validation LOUO: 4-fold experiments

OPD Coverage:Each rule

has its own confidence

Baseline Strategies

Simply memorizes field instances

Random Guesser

Conclusions

Increased modularity and flexibility Domain-specific information is separate from the

underlying learning algorithm Top-down induction

From general to specific Accuracy-coverage trade-off

Associate confidence score with predictions Critique: single-slot extraction rule

RAPIERRelational Learning of Pattern-Match Rules for Information Extraction

M.E. Califf and R.J. Mooney

ACL-97, AAAI-1999

Rule Representation

Single-slot extraction patterns Syntactic information (part-of-speech tagger) Semantic class information (WordNet)

The Learning Algorithm A specific to general search

The pre-filler pattern contains an item for each word The filler pattern has one item from each word in the

filler The post-filler has one item for each word

Compress the rules for each slot Generate the least general generalization (LGG) of each

pair of rules When the LGG of two constraints is a disjunction, we

create two alternatives (1) disjunction (2) removal of the constraints.

Example Located in Atlanta, Georgia. Offices in Kansas City, Missouri.

Example:

Assume there is a semantic class for states, but not one for cities.

Located in Atlanta, Georgia.Offices in Kansas City, Missouri.

Experimental Evaluation

300 computer-related Jobs 17 slots: employer, location, salary, job requirements,

language and platform.

Experimental Evaluation

485 seminar announcement 4 slots:

WHISK:

S. Soderland

University of Washington

Journal of Machine Learning 1999

Semi-structured Text

Free Text

Person name Position

Verb stem

WHISK Rule Representation

For Semi-structured IE

WHISK Rule Representation For Free Text IE

Person name Position

Verb stem

Skip only whithin the same syntactic field

Example – Tagged by Users

The WHISK Algorithm

Creating a Rule from a Seed Instance Top-down rule induction

Start from an empty rule

Add terms within the extraction boundary (Base_1) Add terms just outside the extraction (Base_2)

Until the seed is covered

Example

AutoSlog: Automatically Constructing a Dictionary for Information Extraction Tasks

Ellen RiloffDept. of Computer Science, University of Massachusetts, AAAI93

AutoSlog

Purpose: Automatically constructs a domain-specific

dictionary for IE Extraction pattern (concept nodes):

Conceptual anchor: a trigger word Enabling conditions: constraints

Concept Node Example

Physical target slot of a bombing template

Construction of Concept Nodes1. Given a target piece of information.

2. AutoSlog finds the first sentence in the text that contains the string.

3. The sentence is handed over to CIRCUS which generates a conceptual analysis of the sentence.

4. The first clause in the sentence is used.

5. A set of heuristics are applied to suggest a good conceptual anchor point for a concept node.

6. If none of the heuristics is satisfied, AutoSlog searches for the next sentence, and goto 3.

Conceptual Anchor Point Heuristics

Background Knowledge

Concept Node Construction Slot

The slot of the answer key Hard and soft constraints

Type: Use template types such as bombing, kidnapping Enabling condition: heuristic pattern

Domain Specification The type of a template The constraints for each template slot

Another good concept node definition Perpetrator slot from a

perpetrator template

A bad concept node definition Victim slot from a

kidnapping template

Empirical Results Input:

Annotated corpus of texts in which the targeted information is marked and annotated with semantic tags denoting the type of information (e.g., victim) and type of event (e.g., kidnapping)

1500 texts with 1258 answer keys contain 4780 string fillers Output:

1237 concept node definitions Human intervention: 5 user-hour to sift through all generated concept nodes 450 definitions are kept

Performance:

Conclusion

In 5 person-hour, AutoSlog creates a dictionary that achieves 98% of the performance of hand-crafted dictionary

Each concept node is a single-slot extraction pattern Reasons for bad definitions

When a sentence contains the targeted string but does not describe the event

When a heuristic proposes the wrong conceptual anchor point

When CIRCUS incorrectly analyzes the sentence

CRYSTAL: Inducing a Conceptual Dictionary

S. Soderland, D. Fisher, J. Aseltine, W. Lehnert

University of Massachusetts

IJCAI’95

Concept Nodes (CN)

CN-type Subtype Extracted syntactic

constituents Linguistic patterns Constraints on

syntactic constituents

The CRYSTAL Induction Tool

Creating initial CN definitions For each instance

Inducing generalized CN definitions Relaxing constraints for highly similar definitions

Word constraints: intersecting strings of words Class constraints: moving up the semantic hierarchy

Inducing Generalized CN Definitions1. Start from a CN definition, D

2. Assume we have found a second definition D’ which is similar to D,

a) Create a new definition U

b) Delete from the dictionary all definitions covered by U, e.g. D and D’

c) Test if U extracts only marked informationa) If ‘Yes’, then go to Step 2 and set D=U,

b) If ‘No’, then start from another definition as D

Implementation Issue

Finding similar definitions Indexing CN definitions by verbs and by extraction

buffers Similarity metric

Intersecting classes or intersecting strings of words

Testing error rate of a generalized definition A database of instances segmented by sentence

analyzer is constructed

Experimental Results

385 annotated hospital discharge reports

14719 training instances The choice of error

tolerance parameter is used to manipulate a tradeoff between precision and recall

Output: CN definitions 194, coverage=10 527, 2<coverage<10

Comparison

Bottom-up: From specific to generalized CRYSTAL [Soderland, 1996] RAPIER [Califf & Mooney, 1997]

Top-down: From general to specific SRV [Freitag, 1998] WHISK [Soderland, 1999]

References

I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.

Riloff, E. (1993) Automatically Constructing a Dictionary for Information Extraction Tasks, AAAI-93, pp. 811-816

S. Soderland, et al, CRYSTAL: Inducing a Conceptual Dictionary, AAAI-95.

Dayne Freitag, Information Extraction from HTML: Application of a General Machine Learning Approach, AAAI98

Mary Elaine Califf and Raymond J. Mooney, Relational Learning of Pattern-Match Rules for Information Extraction, AAAI-99, Orlando, FL, pp. 328-334, July, 1999.

S. Soderland, Learning information extraction rules for semi-structured and free text. J. of Machine Learning, 1999.

information extraction for free text

text classification

srv information extraction

extraction rules

individual token length

address aaai

set of token

relational algorithm

introduction srv

Documents

automatic extraction of hierarchical relations from text

collocation extraction measures for text mining applications

multi-script text extraction from natural...

text mining 5/5: information extraction

free text phrase encoding and information extraction from...

text extraction from the web via text-to-tag ratio

srm free text carts srm_sho_303 srm free text carts

text extraction from images and videos

annotation free information extraction

free text phrase encoding and information extraction from...

natural language processing and text mining1.1 introduction...

text, knowledge, and information extraction -...

lncs 4338 - text localization and extraction from complex...

text extraction from big data

text normalization and feature extraction

automatic ontology oriented clinical concept extraction from...

free text phrase encoding and information extraction from...

schema-driven relationship extraction from unstructured text

collocation extraction measures for text mining...

ontology based information extraction on free text …