annotation free information extraction chia-hui chang department of computer science &...

Annotation Free Information Extraction

Chia-Hui Chang Department of Computer Science & Information Engineering

National Central [email protected]

10/4/2002

Introduction

TEXT IE AutoSlog-TS

Semi IE IEPAD

AutoSlog-TS: Automatically Generating Extraction Patterns from Untagged Text

Ellen Riloff

University of Utah

AAAI96

AutoSlog-TS

AutoSlog-TS is an extension of AutoSlog It operates exhaustively by generating an extraction patter

n for every noun phrase in the training corpus. It then evaluates the extraction patterns by processing the

corpus a second time and generating relevance statistics for each pattern.

A more significant difference is that AutoSlog-TS allows multiple rules to fire if more than one matches the context.

AutoSlog-TS Concept

Relevance Rate

Pr(relevant text | text contains patterni)

= rel-freqi / total-freqi

rel-freqi : the number of instances of patterni that were activated in relevant texts.

total-freqi : the total number of instances of patterni that were activated in the training corpus.

The motivation behind the conditional probability estimate is that domain-specific expressions will appear substantially more often in relevant texts than irrelevant texts.

Rank function

Next, we use a rank function to rank the patterns in order of importance to the domain:

relevance rate * log2(frequency)

So, a person only needs to review the most highly ranked patterns.

Experimental Results Setup

We evaluated AutoSlog and AutoSlog-TS by manually inspecting the performance of their dictionaries in the MUC-4 terrorism domain.

We used the MUC-4 texts as input and the MUC-4 answer keys as the basis for judging “correct” output (MUC-4 Proceedings 1992).

Training

32345 11225 210

1237 450

Extraction Patterns

1500,50% relevant

772 relevant

Texts

AutoSlog-TS:

AutoSlog:

Testing

To evaluate the two dictionaries, we chose 100 blind texts from the MUC-4 test set. (50 relevant texts and 50 irrelevant texts)

We scored the output by assigning each extracted item to one of five categories: correct, mislabeled, duplicate, spurious, or missing.

Correct: If an item matched against the answer keys. Mislabeled: If an item matched against the answer keys but was

extracted as the wrong type of object. Duplicate: If an item was referent to an item in the answer keys. Spurious: If an item did not refer to any object in the answer key

s. Missing: Items in the answer keys that were not extracted

Experimental Results We scored three items: perpetrators, victims, and

targets.

Experimental Results

We calculated recall as correct / (correct + missing) Compute precision as:

(correct + duplicate) / (correct + duplicate + mislabeled + spurious)

Behind the scenes

In fact, we have reason to believe that AutoSlog-TS is ultimately capable of producing better recall than AutoSlog because it generates many good patterns that AutoSlog did not.

AutoSlog-TS produced 158 patterns with a relevance rate 90% and frequency 5. Only 45 of these patte≧ ≧rns were in the original AutoSlog dictionary.

The higher precision demonstrated by AutoSlog-TS is probably a result of the relevance statistics.

Future Directions

A potential problem with AutoSlog-TS is that there are undoubtedly many useful patterns buried deep in the ranked list, which cumulatively could have a substantial impact on performance.

The precision of the extraction patterns could also be improved by adding semantic constraints and, in the long run, creating more complex extraction patterns.

IEPAD: Information Extraction based on Pattern Discovery

C.H. Chang. National Central UniversityWWW10

Semi-structured Information Extraction Information Extraction (IE)

Input: Html pages Output: A set of records

Pattern Discovery based IE

Motivation Display of multiple records often forms a repeated

pattern The occurrences of the pattern are spaced regularly and

adjacently

Now the problem becomes ... Find regular and adjacent repeats in a string

IEPAD Architecture

Pattern Generator

ExtractorExtraction Results

Html Page

Patterns

Pattern Viewer

Extraction Rule

Users

Html Pages

The Pattern Generator

Translator PAT tree construction Pattern validator Rule Composer

HTML Page

Token Translator

PAT TreeConstructor

Validator

Rule Composer

PAT trees andMaximal Repeats

Advenced Patterns

Extraction Rules

A Token String

1. Web Page Translation

Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a

special token called TEXT (denoted by a underscore) HTML Example:

Congo242 

Egypt20 

Encoded token stringT()T(_)T()T()T(_)T()T( )

T()T(_)T()T()T(_)T()T( )

Various Encoding Schemes

B lo ck -lev e l ta g s T ex t-lev e l ta g sHeadings

Text containers

Lists

Others

H1~H6

P, PRE, BLOCKQUOTE,ADDRESS

UL, OL, LI, DL, DIR,MENU

DIV, CENTER, FORM,HR, TABLE, BR

Logical markup

Physical markup

Special markup

EM, STRONG, DFN, CODE,SAMP, KBD, VAR, CITE

TT, I, B, U, STRIKE, BIG,SMALL, SUB, SUP, FONT

A, BASEFONT, IMG, APPLET,PARAM, MAP, AREA

Figure. 2 Tag classification

2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible

suffix strings of a text Example

T() 000T() 001T() 010T() 011T( ) 100 T(_) 110

000110001010110011100000110001010110011100

T()T(_)T()T()T(_)T()T( )T()T(_)T()T()T(_)T()T( )

Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$

The Constructed PAT Tree

$

12

1

2 2

3 4 5

10

1 8 10

0

1

10000

1

$

0

147

0

5

3

22

$0

16

$0

3 13

7

$0

6

11

13

$

4

19

$0

92

a

b

c

d e

f

g

h

i

j k

l m

Figure 3. The PAT tree for the Congo Code

=0110001010110011100=1010110011100=01010110011100=0110011100=11100

Definition of Maximal Repeats

Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pai

r such that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) p

air such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and ri

ght maximal

Finding Maximal Repeats

Definition: Let’s call character S[pi-1] the left character of s

uffix pi

A node is left diverse if at least two leaves in the ’s subtree have different left characters

Lemma: The path labels of an internal node in a PAT tr

ee is a maximal repeat if and only if is left diverse

3. Pattern Validator

Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.

Characteristics of a Pattern Regularity: Variance coefficient

Adjacency: Density}1|{

}1|{)(

1

1

kippMean

kippStdDevV

ii

ii

||

||*)(

1

pp

kD

k

Pattern Validator (Cont.) Basic ScreeningFor each maximal repeat , compute V() and D()

a) check if the pattern’s variance: V() < 0.5

b) check if the pattern’s density: 0.25 < D() < 1.5

V()<0.5

0.25<D()<1.5

Yes

NoDiscard

Yes

Pattern

NoDiscard

Pattern

4. Rule Composer Occurrence partition

Flexible variance threshold control Multiple string alignment

Increase density of a pattern

Occurrence Partition

Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularity

Solution Clustering of the occurrences of such a pattern

Clustering V()<0.1No

Discard

Check densityYes

Multiple String Alignment

Problem Patterns with density less than 1 can extract only part o

f the information

Solution Align k-1 substrings among the k occurrences

A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Multiple String Alignment (Cont.) Suppose “adc” is the discovered pattern for token

string “adcwbdadcxbadcxbdadcb”

If we have the following multiple alignment for strings `àdcwbd'', `àdcxb'' and `àdcxbd'':

a d c w b d

a d c x b -

a d c x b d

The extraction pattern can be generalized as “adc[w|x]b[d|-]”

Pattern Viewer Java-application based GUI Web based GUI

http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

The Extractor

Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm

Alternatives in a rule matching the longest pattern

What are extracted? The whole record

Experiment Setup

Fourteen sources: search engines Performance measures

Number of patterns Retrieval rate and Accuracy rate

Parameters Encoding scheme Thresholds control

# of Patterns Discovered Using BlockLevel Encoding

Figure 5. Number of Patterns validated

02468

101214

0 0.25 0.5 0.75 1

Density

# o

f p

att

ern

s

r=0.25

r=0.5r=0.75

Average 117 maximal repeats in our test Web pages

Translation

Table 2. Size of translated sequences and number of patterns

Encoding Scheme Length of Sequence No. of Patterns

All Tag 1128 7.9

No Physical 873 6.5

No Special 796 5.7

Block-Level 514 4.4

Average page length is 22.7KB

Accuracy and Retrieval RateTable 5. The performance of multiple string alignment

Search Engine Retrieval Rate Accuracy Rate Matching PercentageAltaVistaCoraExciteGalaxyHotbotInfoseekLycosMagellanMetacrawlerNorthernLightOpenfindSavvysearchStpt.comWebcrawler

1.001.001.001.000.970.980.941.000.900.950.831.000.990.98

1.001.000.970.950.860.940.631.000.960.960.900.951.000.98

0.910.971.000.990.880.870.940.760.780.900.660.970.950.98

Average 0.97 0.94 0.90

Summary IEPAD: Information Extraction based on Pattern

Discovery Rule generator The extractor Pattern viewer

Performance 97% retrieval rate and 94% accuracy rate

Problems

Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the

desired data Only applicable when there are several

records in a Web page, currently

References

TEXT IE Riloff, E. (1996) Automatically Generating Extracti

on Patterns from Untagged Text, (AAAI-96) , 1996, pp. 1044-1049.

Riloff, E. (1999) Information Extraction as a Stepping Stone toward Story Understanding, In Computational Models of Reading and Understanding, Ashwin Ram and Kenneth Moorman, eds., The MIT Press.

References

Semi-structured IE D.W. Embley, Y.S. Jiang, and W.-K. Ng,

Record-Boundary Discovery in Web Documents, SIGMOD'99 Proceedings

C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.

B. Chidlovskii, J. Ragetli, and M. de Rijke, Automatic Wrapper Generation for Web Search Engines, The 1st Intern. Conf. on Web-Age Information Management (WAIM'2000), Shanghai, China, June 2000