annotation free information extraction chia-hui chang department of computer science &...

40
Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University [email protected] 10/4/2002

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Annotation Free Information Extraction

Chia-Hui Chang Department of Computer Science & Information Engineering

National Central [email protected]

10/4/2002

Page 2: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Introduction

TEXT IE AutoSlog-TS

Semi IE IEPAD

Page 3: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

AutoSlog-TS: Automatically Generating Extraction Patterns from Untagged Text

Ellen Riloff

University of Utah

AAAI96

Page 4: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

AutoSlog-TS

AutoSlog-TS is an extension of AutoSlog It operates exhaustively by generating an extraction patter

n for every noun phrase in the training corpus. It then evaluates the extraction patterns by processing the

corpus a second time and generating relevance statistics for each pattern.

A more significant difference is that AutoSlog-TS allows multiple rules to fire if more than one matches the context.

Page 5: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

AutoSlog-TS Concept

Page 6: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Relevance Rate

Pr(relevant text | text contains patterni)

= rel-freqi / total-freqi

rel-freqi : the number of instances of patterni that were activated in relevant texts.

total-freqi : the total number of instances of patterni that were activated in the training corpus.

The motivation behind the conditional probability estimate is that domain-specific expressions will appear substantially more often in relevant texts than irrelevant texts.

Page 7: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Rank function

Next, we use a rank function to rank the patterns in order of importance to the domain:

relevance rate * log2(frequency)

So, a person only needs to review the most highly ranked patterns.

Page 8: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Experimental Results Setup

We evaluated AutoSlog and AutoSlog-TS by manually inspecting the performance of their dictionaries in the MUC-4 terrorism domain.

We used the MUC-4 texts as input and the MUC-4 answer keys as the basis for judging “correct” output (MUC-4 Proceedings 1992).

Training

32345 11225 210

1237 450

Extraction Patterns

1500,50% relevant

772 relevant

Texts

AutoSlog-TS:

AutoSlog:

Page 9: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Testing

To evaluate the two dictionaries, we chose 100 blind texts from the MUC-4 test set. (50 relevant texts and 50 irrelevant texts)

We scored the output by assigning each extracted item to one of five categories: correct, mislabeled, duplicate, spurious, or missing.

Correct: If an item matched against the answer keys. Mislabeled: If an item matched against the answer keys but was

extracted as the wrong type of object. Duplicate: If an item was referent to an item in the answer keys. Spurious: If an item did not refer to any object in the answer key

s. Missing: Items in the answer keys that were not extracted

Page 10: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Experimental Results We scored three items: perpetrators, victims, and

targets.

Page 11: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Experimental Results

We calculated recall as correct / (correct + missing) Compute precision as:

(correct + duplicate) / (correct + duplicate + mislabeled + spurious)

Page 12: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Behind the scenes

In fact, we have reason to believe that AutoSlog-TS is ultimately capable of producing better recall than AutoSlog because it generates many good patterns that AutoSlog did not.

AutoSlog-TS produced 158 patterns with a relevance rate 90% and frequency 5. Only 45 of these patte≧ ≧rns were in the original AutoSlog dictionary.

The higher precision demonstrated by AutoSlog-TS is probably a result of the relevance statistics.

Page 13: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Future Directions

A potential problem with AutoSlog-TS is that there are undoubtedly many useful patterns buried deep in the ranked list, which cumulatively could have a substantial impact on performance.

The precision of the extraction patterns could also be improved by adding semantic constraints and, in the long run, creating more complex extraction patterns.

Page 14: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

IEPAD: Information Extraction based on Pattern Discovery

C.H. Chang. National Central UniversityWWW10

Page 15: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Semi-structured Information Extraction Information Extraction (IE)

Input: Html pages Output: A set of records

Page 16: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Pattern Discovery based IE

Motivation Display of multiple records often forms a repeated

pattern The occurrences of the pattern are spaced regularly and

adjacently

Now the problem becomes ... Find regular and adjacent repeats in a string

Page 17: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

IEPAD Architecture

Pattern Generator

ExtractorExtraction Results

Html Page

Patterns

Pattern Viewer

Extraction Rule

Users

Html Pages

Page 18: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

The Pattern Generator

Translator PAT tree construction Pattern validator Rule Composer

HTML Page

Token Translator

PAT TreeConstructor

Validator

Rule Composer

PAT trees andMaximal Repeats

Advenced Patterns

Extraction Rules

A Token String

Page 19: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

1. Web Page Translation

Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a

special token called TEXT (denoted by a underscore) HTML Example:

<B>Congo</B><I>242</I><BR>

<B>Egypt</B><I>20</I><BR>

Encoded token stringT(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

Page 20: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Various Encoding Schemes

B lo ck -lev e l ta g s T ex t-lev e l ta g sHeadings

Text containers

Lists

Others

H1~H6

P, PRE, BLOCKQUOTE,ADDRESS

UL, OL, LI, DL, DIR,MENU

DIV, CENTER, FORM,HR, TABLE, BR

Logical markup

Physical markup

Special markup

EM, STRONG, DFN, CODE,SAMP, KBD, VAR, CITE

TT, I, B, U, STRIKE, BIG,SMALL, SUB, SUP, FONT

A, BASEFONT, IMG, APPLET,PARAM, MAP, AREA

Figure. 2 Tag classification

Page 21: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible

suffix strings of a text Example

T(<B>) 000T(</B>) 001T(<I>) 010T(</I>) 011T(<BR>) 100 T(_) 110

000110001010110011100000110001010110011100

T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$

Page 22: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

The Constructed PAT Tree

$

12

1

2 2

3 4 5

10

1 8 10

0

1

10000

1

$

0

147

0

5

3

22

$0

16

$0

3 13

7

$0

6

11

13

$

4

19

$0

92

a

b

c

d e

f

g

h

i

j k

l m

Figure 3. The PAT tree for the Congo Code

=0110001010110011100=1010110011100=01010110011100=0110011100=11100

Page 23: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Definition of Maximal Repeats

Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pai

r such that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) p

air such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and ri

ght maximal

Page 24: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Finding Maximal Repeats

Definition: Let’s call character S[pi-1] the left character of s

uffix pi

A node is left diverse if at least two leaves in the ’s subtree have different left characters

Lemma: The path labels of an internal node in a PAT tr

ee is a maximal repeat if and only if is left diverse

Page 25: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

3. Pattern Validator

Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.

Characteristics of a Pattern Regularity: Variance coefficient

Adjacency: Density}1|{

}1|{)(

1

1

kippMean

kippStdDevV

ii

ii

||

||*)(

1

pp

kD

k

Page 26: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Pattern Validator (Cont.) Basic ScreeningFor each maximal repeat , compute V() and D()

a) check if the pattern’s variance: V() < 0.5

b) check if the pattern’s density: 0.25 < D() < 1.5

V()<0.5

0.25<D()<1.5

Yes

NoDiscard

Yes

Pattern

NoDiscard

Pattern

Page 27: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

4. Rule Composer Occurrence partition

Flexible variance threshold control Multiple string alignment

Increase density of a pattern

Page 28: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Occurrence Partition

Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularity

Solution Clustering of the occurrences of such a pattern

Clustering V()<0.1No

Discard

Check densityYes

Page 29: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Multiple String Alignment

Problem Patterns with density less than 1 can extract only part o

f the information

Solution Align k-1 substrings among the k occurrences

A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Page 30: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Multiple String Alignment (Cont.) Suppose “adc” is the discovered pattern for token

string “adcwbdadcxbadcxbdadcb”

If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'':

a d c w b d

a d c x b -

a d c x b d

The extraction pattern can be generalized as “adc[w|x]b[d|-]”

Page 31: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Pattern Viewer Java-application based GUI Web based GUI

http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

Page 32: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

The Extractor

Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm

Alternatives in a rule matching the longest pattern

What are extracted? The whole record

Page 33: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Experiment Setup

Fourteen sources: search engines Performance measures

Number of patterns Retrieval rate and Accuracy rate

Parameters Encoding scheme Thresholds control

Page 34: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

# of Patterns Discovered Using BlockLevel Encoding

Figure 5. Number of Patterns validated

02468

101214

0 0.25 0.5 0.75 1

Density

# o

f p

att

ern

s

r=0.25

r=0.5r=0.75

Average 117 maximal repeats in our test Web pages

Page 35: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Translation

Table 2. Size of translated sequences and number of patterns

Encoding Scheme Length of Sequence No. of Patterns

All Tag 1128 7.9

No Physical 873 6.5

No Special 796 5.7

Block-Level 514 4.4

Average page length is 22.7KB

Page 36: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Accuracy and Retrieval RateTable 5. The performance of multiple string alignment

Search Engine Retrieval Rate Accuracy Rate Matching PercentageAltaVistaCoraExciteGalaxyHotbotInfoseekLycosMagellanMetacrawlerNorthernLightOpenfindSavvysearchStpt.comWebcrawler

1.001.001.001.000.970.980.941.000.900.950.831.000.990.98

1.001.000.970.950.860.940.631.000.960.960.900.951.000.98

0.910.971.000.990.880.870.940.760.780.900.660.970.950.98

Average 0.97 0.94 0.90

Page 37: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Summary IEPAD: Information Extraction based on Pattern

Discovery Rule generator The extractor Pattern viewer

Performance 97% retrieval rate and 94% accuracy rate

Page 38: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Problems

Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the

desired data Only applicable when there are several

records in a Web page, currently

Page 39: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

References

TEXT IE Riloff, E. (1996) Automatically Generating Extracti

on Patterns from Untagged Text, (AAAI-96) , 1996, pp. 1044-1049.

Riloff, E. (1999) Information Extraction as a Stepping Stone toward Story Understanding, In Computational Models of Reading and Understanding, Ashwin Ram and Kenneth Moorman, eds., The MIT Press.

Page 40: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

References

Semi-structured IE D.W. Embley, Y.S. Jiang, and W.-K. Ng,

Record-Boundary Discovery in Web Documents, SIGMOD'99 Proceedings

C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.

B. Chidlovskii, J. Ragetli, and M. de Rijke, Automatic Wrapper Generation for Web Search Engines, The 1st Intern. Conf. on Web-Age Information Management (WAIM'2000), Shanghai, China, June 2000