annotation free information extraction chia-hui chang department of computer science &...
Post on 19-Dec-2015
216 views
TRANSCRIPT
Annotation Free Information Extraction
Chia-Hui Chang Department of Computer Science & Information Engineering
National Central [email protected]
10/4/2002
Introduction
TEXT IE AutoSlog-TS
Semi IE IEPAD
AutoSlog-TS: Automatically Generating Extraction Patterns from Untagged Text
Ellen Riloff
University of Utah
AAAI96
AutoSlog-TS
AutoSlog-TS is an extension of AutoSlog It operates exhaustively by generating an extraction patter
n for every noun phrase in the training corpus. It then evaluates the extraction patterns by processing the
corpus a second time and generating relevance statistics for each pattern.
A more significant difference is that AutoSlog-TS allows multiple rules to fire if more than one matches the context.
AutoSlog-TS Concept
Relevance Rate
Pr(relevant text | text contains patterni)
= rel-freqi / total-freqi
rel-freqi : the number of instances of patterni that were activated in relevant texts.
total-freqi : the total number of instances of patterni that were activated in the training corpus.
The motivation behind the conditional probability estimate is that domain-specific expressions will appear substantially more often in relevant texts than irrelevant texts.
Rank function
Next, we use a rank function to rank the patterns in order of importance to the domain:
relevance rate * log2(frequency)
So, a person only needs to review the most highly ranked patterns.
Experimental Results Setup
We evaluated AutoSlog and AutoSlog-TS by manually inspecting the performance of their dictionaries in the MUC-4 terrorism domain.
We used the MUC-4 texts as input and the MUC-4 answer keys as the basis for judging “correct” output (MUC-4 Proceedings 1992).
Training
32345 11225 210
1237 450
Extraction Patterns
1500,50% relevant
772 relevant
Texts
AutoSlog-TS:
AutoSlog:
Testing
To evaluate the two dictionaries, we chose 100 blind texts from the MUC-4 test set. (50 relevant texts and 50 irrelevant texts)
We scored the output by assigning each extracted item to one of five categories: correct, mislabeled, duplicate, spurious, or missing.
Correct: If an item matched against the answer keys. Mislabeled: If an item matched against the answer keys but was
extracted as the wrong type of object. Duplicate: If an item was referent to an item in the answer keys. Spurious: If an item did not refer to any object in the answer key
s. Missing: Items in the answer keys that were not extracted
Experimental Results We scored three items: perpetrators, victims, and
targets.
Experimental Results
We calculated recall as correct / (correct + missing) Compute precision as:
(correct + duplicate) / (correct + duplicate + mislabeled + spurious)
Behind the scenes
In fact, we have reason to believe that AutoSlog-TS is ultimately capable of producing better recall than AutoSlog because it generates many good patterns that AutoSlog did not.
AutoSlog-TS produced 158 patterns with a relevance rate 90% and frequency 5. Only 45 of these patte≧ ≧rns were in the original AutoSlog dictionary.
The higher precision demonstrated by AutoSlog-TS is probably a result of the relevance statistics.
Future Directions
A potential problem with AutoSlog-TS is that there are undoubtedly many useful patterns buried deep in the ranked list, which cumulatively could have a substantial impact on performance.
The precision of the extraction patterns could also be improved by adding semantic constraints and, in the long run, creating more complex extraction patterns.
IEPAD: Information Extraction based on Pattern Discovery
C.H. Chang. National Central UniversityWWW10
Semi-structured Information Extraction Information Extraction (IE)
Input: Html pages Output: A set of records
Pattern Discovery based IE
Motivation Display of multiple records often forms a repeated
pattern The occurrences of the pattern are spaced regularly and
adjacently
Now the problem becomes ... Find regular and adjacent repeats in a string
IEPAD Architecture
Pattern Generator
ExtractorExtraction Results
Html Page
Patterns
Pattern Viewer
Extraction Rule
Users
Html Pages
The Pattern Generator
Translator PAT tree construction Pattern validator Rule Composer
HTML Page
Token Translator
PAT TreeConstructor
Validator
Rule Composer
PAT trees andMaximal Repeats
Advenced Patterns
Extraction Rules
A Token String
1. Web Page Translation
Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a
special token called TEXT (denoted by a underscore) HTML Example:
<B>Congo</B><I>242</I><BR>
<B>Egypt</B><I>20</I><BR>
Encoded token stringT(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
Various Encoding Schemes
B lo ck -lev e l ta g s T ex t-lev e l ta g sHeadings
Text containers
Lists
Others
H1~H6
P, PRE, BLOCKQUOTE,ADDRESS
UL, OL, LI, DL, DIR,MENU
DIV, CENTER, FORM,HR, TABLE, BR
Logical markup
Physical markup
Special markup
EM, STRONG, DFN, CODE,SAMP, KBD, VAR, CITE
TT, I, B, U, STRIKE, BIG,SMALL, SUB, SUP, FONT
A, BASEFONT, IMG, APPLET,PARAM, MAP, AREA
Figure. 2 Tag classification
2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible
suffix strings of a text Example
T(<B>) 000T(</B>) 001T(<I>) 010T(</I>) 011T(<BR>) 100 T(_) 110
000110001010110011100000110001010110011100
T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$
The Constructed PAT Tree
$
12
1
2 2
3 4 5
10
1 8 10
0
1
10000
1
$
0
147
0
5
3
22
$0
16
$0
3 13
7
$0
6
11
13
$
4
19
$0
92
a
b
c
d e
f
g
h
i
j k
l m
Figure 3. The PAT tree for the Congo Code
=0110001010110011100=1010110011100=01010110011100=0110011100=11100
Definition of Maximal Repeats
Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pai
r such that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) p
air such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and ri
ght maximal
Finding Maximal Repeats
Definition: Let’s call character S[pi-1] the left character of s
uffix pi
A node is left diverse if at least two leaves in the ’s subtree have different left characters
Lemma: The path labels of an internal node in a PAT tr
ee is a maximal repeat if and only if is left diverse
3. Pattern Validator
Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.
Characteristics of a Pattern Regularity: Variance coefficient
Adjacency: Density}1|{
}1|{)(
1
1
kippMean
kippStdDevV
ii
ii
||
||*)(
1
pp
kD
k
Pattern Validator (Cont.) Basic ScreeningFor each maximal repeat , compute V() and D()
a) check if the pattern’s variance: V() < 0.5
b) check if the pattern’s density: 0.25 < D() < 1.5
V()<0.5
0.25<D()<1.5
Yes
NoDiscard
Yes
Pattern
NoDiscard
Pattern
4. Rule Composer Occurrence partition
Flexible variance threshold control Multiple string alignment
Increase density of a pattern
Occurrence Partition
Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularity
Solution Clustering of the occurrences of such a pattern
Clustering V()<0.1No
Discard
Check densityYes
Multiple String Alignment
Problem Patterns with density less than 1 can extract only part o
f the information
Solution Align k-1 substrings among the k occurrences
A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.
Multiple String Alignment (Cont.) Suppose “adc” is the discovered pattern for token
string “adcwbdadcxbadcxbdadcb”
If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'':
a d c w b d
a d c x b -
a d c x b d
The extraction pattern can be generalized as “adc[w|x]b[d|-]”
Pattern Viewer Java-application based GUI Web based GUI
http://www.csie.ncu.edu.tw/~chia/WebIEPAD/
The Extractor
Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm
Alternatives in a rule matching the longest pattern
What are extracted? The whole record
Experiment Setup
Fourteen sources: search engines Performance measures
Number of patterns Retrieval rate and Accuracy rate
Parameters Encoding scheme Thresholds control
# of Patterns Discovered Using BlockLevel Encoding
Figure 5. Number of Patterns validated
02468
101214
0 0.25 0.5 0.75 1
Density
# o
f p
att
ern
s
r=0.25
r=0.5r=0.75
Average 117 maximal repeats in our test Web pages
Translation
Table 2. Size of translated sequences and number of patterns
Encoding Scheme Length of Sequence No. of Patterns
All Tag 1128 7.9
No Physical 873 6.5
No Special 796 5.7
Block-Level 514 4.4
Average page length is 22.7KB
Accuracy and Retrieval RateTable 5. The performance of multiple string alignment
Search Engine Retrieval Rate Accuracy Rate Matching PercentageAltaVistaCoraExciteGalaxyHotbotInfoseekLycosMagellanMetacrawlerNorthernLightOpenfindSavvysearchStpt.comWebcrawler
1.001.001.001.000.970.980.941.000.900.950.831.000.990.98
1.001.000.970.950.860.940.631.000.960.960.900.951.000.98
0.910.971.000.990.880.870.940.760.780.900.660.970.950.98
Average 0.97 0.94 0.90
Summary IEPAD: Information Extraction based on Pattern
Discovery Rule generator The extractor Pattern viewer
Performance 97% retrieval rate and 94% accuracy rate
Problems
Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the
desired data Only applicable when there are several
records in a Web page, currently
References
TEXT IE Riloff, E. (1996) Automatically Generating Extracti
on Patterns from Untagged Text, (AAAI-96) , 1996, pp. 1044-1049.
Riloff, E. (1999) Information Extraction as a Stepping Stone toward Story Understanding, In Computational Models of Reading and Understanding, Ashwin Ram and Kenneth Moorman, eds., The MIT Press.
References
Semi-structured IE D.W. Embley, Y.S. Jiang, and W.-K. Ng,
Record-Boundary Discovery in Web Documents, SIGMOD'99 Proceedings
C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.
B. Chidlovskii, J. Ragetli, and M. de Rijke, Automatic Wrapper Generation for Web Search Engines, The 1st Intern. Conf. on Web-Age Information Management (WAIM'2000), Shanghai, China, June 2000