![Page 1: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/1.jpg)
Annotation Free Information Extraction
Chia-Hui Chang Department of Computer Science & Information Engineering
National Central [email protected]
10/4/2002
![Page 2: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/2.jpg)
Introduction
TEXT IE AutoSlog-TS
Semi IE IEPAD
![Page 3: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/3.jpg)
AutoSlog-TS: Automatically Generating Extraction Patterns from Untagged Text
Ellen Riloff
University of Utah
AAAI96
![Page 4: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/4.jpg)
AutoSlog-TS
AutoSlog-TS is an extension of AutoSlog It operates exhaustively by generating an extraction patter
n for every noun phrase in the training corpus. It then evaluates the extraction patterns by processing the
corpus a second time and generating relevance statistics for each pattern.
A more significant difference is that AutoSlog-TS allows multiple rules to fire if more than one matches the context.
![Page 5: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/5.jpg)
AutoSlog-TS Concept
![Page 6: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/6.jpg)
Relevance Rate
Pr(relevant text | text contains patterni)
= rel-freqi / total-freqi
rel-freqi : the number of instances of patterni that were activated in relevant texts.
total-freqi : the total number of instances of patterni that were activated in the training corpus.
The motivation behind the conditional probability estimate is that domain-specific expressions will appear substantially more often in relevant texts than irrelevant texts.
![Page 7: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/7.jpg)
Rank function
Next, we use a rank function to rank the patterns in order of importance to the domain:
relevance rate * log2(frequency)
So, a person only needs to review the most highly ranked patterns.
![Page 8: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/8.jpg)
Experimental Results Setup
We evaluated AutoSlog and AutoSlog-TS by manually inspecting the performance of their dictionaries in the MUC-4 terrorism domain.
We used the MUC-4 texts as input and the MUC-4 answer keys as the basis for judging “correct” output (MUC-4 Proceedings 1992).
Training
32345 11225 210
1237 450
Extraction Patterns
1500,50% relevant
772 relevant
Texts
AutoSlog-TS:
AutoSlog:
![Page 9: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/9.jpg)
Testing
To evaluate the two dictionaries, we chose 100 blind texts from the MUC-4 test set. (50 relevant texts and 50 irrelevant texts)
We scored the output by assigning each extracted item to one of five categories: correct, mislabeled, duplicate, spurious, or missing.
Correct: If an item matched against the answer keys. Mislabeled: If an item matched against the answer keys but was
extracted as the wrong type of object. Duplicate: If an item was referent to an item in the answer keys. Spurious: If an item did not refer to any object in the answer key
s. Missing: Items in the answer keys that were not extracted
![Page 10: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/10.jpg)
Experimental Results We scored three items: perpetrators, victims, and
targets.
![Page 11: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/11.jpg)
Experimental Results
We calculated recall as correct / (correct + missing) Compute precision as:
(correct + duplicate) / (correct + duplicate + mislabeled + spurious)
![Page 12: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/12.jpg)
Behind the scenes
In fact, we have reason to believe that AutoSlog-TS is ultimately capable of producing better recall than AutoSlog because it generates many good patterns that AutoSlog did not.
AutoSlog-TS produced 158 patterns with a relevance rate 90% and frequency 5. Only 45 of these patte≧ ≧rns were in the original AutoSlog dictionary.
The higher precision demonstrated by AutoSlog-TS is probably a result of the relevance statistics.
![Page 13: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/13.jpg)
Future Directions
A potential problem with AutoSlog-TS is that there are undoubtedly many useful patterns buried deep in the ranked list, which cumulatively could have a substantial impact on performance.
The precision of the extraction patterns could also be improved by adding semantic constraints and, in the long run, creating more complex extraction patterns.
![Page 14: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/14.jpg)
IEPAD: Information Extraction based on Pattern Discovery
C.H. Chang. National Central UniversityWWW10
![Page 15: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/15.jpg)
Semi-structured Information Extraction Information Extraction (IE)
Input: Html pages Output: A set of records
![Page 16: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/16.jpg)
Pattern Discovery based IE
Motivation Display of multiple records often forms a repeated
pattern The occurrences of the pattern are spaced regularly and
adjacently
Now the problem becomes ... Find regular and adjacent repeats in a string
![Page 17: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/17.jpg)
IEPAD Architecture
Pattern Generator
ExtractorExtraction Results
Html Page
Patterns
Pattern Viewer
Extraction Rule
Users
Html Pages
![Page 18: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/18.jpg)
The Pattern Generator
Translator PAT tree construction Pattern validator Rule Composer
HTML Page
Token Translator
PAT TreeConstructor
Validator
Rule Composer
PAT trees andMaximal Repeats
Advenced Patterns
Extraction Rules
A Token String
![Page 19: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/19.jpg)
1. Web Page Translation
Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a
special token called TEXT (denoted by a underscore) HTML Example:
<B>Congo</B><I>242</I><BR>
<B>Egypt</B><I>20</I><BR>
Encoded token stringT(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
![Page 20: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/20.jpg)
Various Encoding Schemes
B lo ck -lev e l ta g s T ex t-lev e l ta g sHeadings
Text containers
Lists
Others
H1~H6
P, PRE, BLOCKQUOTE,ADDRESS
UL, OL, LI, DL, DIR,MENU
DIV, CENTER, FORM,HR, TABLE, BR
Logical markup
Physical markup
Special markup
EM, STRONG, DFN, CODE,SAMP, KBD, VAR, CITE
TT, I, B, U, STRIKE, BIG,SMALL, SUB, SUP, FONT
A, BASEFONT, IMG, APPLET,PARAM, MAP, AREA
Figure. 2 Tag classification
![Page 21: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/21.jpg)
2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible
suffix strings of a text Example
T(<B>) 000T(</B>) 001T(<I>) 010T(</I>) 011T(<BR>) 100 T(_) 110
000110001010110011100000110001010110011100
T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$
![Page 22: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/22.jpg)
The Constructed PAT Tree
$
12
1
2 2
3 4 5
10
1 8 10
0
1
10000
1
$
0
147
0
5
3
22
$0
16
$0
3 13
7
$0
6
11
13
$
4
19
$0
92
a
b
c
d e
f
g
h
i
j k
l m
Figure 3. The PAT tree for the Congo Code
=0110001010110011100=1010110011100=01010110011100=0110011100=11100
![Page 23: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/23.jpg)
Definition of Maximal Repeats
Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pai
r such that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) p
air such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and ri
ght maximal
![Page 24: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/24.jpg)
Finding Maximal Repeats
Definition: Let’s call character S[pi-1] the left character of s
uffix pi
A node is left diverse if at least two leaves in the ’s subtree have different left characters
Lemma: The path labels of an internal node in a PAT tr
ee is a maximal repeat if and only if is left diverse
![Page 25: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/25.jpg)
3. Pattern Validator
Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.
Characteristics of a Pattern Regularity: Variance coefficient
Adjacency: Density}1|{
}1|{)(
1
1
kippMean
kippStdDevV
ii
ii
||
||*)(
1
pp
kD
k
![Page 26: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/26.jpg)
Pattern Validator (Cont.) Basic ScreeningFor each maximal repeat , compute V() and D()
a) check if the pattern’s variance: V() < 0.5
b) check if the pattern’s density: 0.25 < D() < 1.5
V()<0.5
0.25<D()<1.5
Yes
NoDiscard
Yes
Pattern
NoDiscard
Pattern
![Page 27: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/27.jpg)
4. Rule Composer Occurrence partition
Flexible variance threshold control Multiple string alignment
Increase density of a pattern
![Page 28: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/28.jpg)
Occurrence Partition
Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularity
Solution Clustering of the occurrences of such a pattern
Clustering V()<0.1No
Discard
Check densityYes
![Page 29: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/29.jpg)
Multiple String Alignment
Problem Patterns with density less than 1 can extract only part o
f the information
Solution Align k-1 substrings among the k occurrences
A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.
![Page 30: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/30.jpg)
Multiple String Alignment (Cont.) Suppose “adc” is the discovered pattern for token
string “adcwbdadcxbadcxbdadcb”
If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'':
a d c w b d
a d c x b -
a d c x b d
The extraction pattern can be generalized as “adc[w|x]b[d|-]”
![Page 31: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/31.jpg)
Pattern Viewer Java-application based GUI Web based GUI
http://www.csie.ncu.edu.tw/~chia/WebIEPAD/
![Page 32: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/32.jpg)
The Extractor
Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm
Alternatives in a rule matching the longest pattern
What are extracted? The whole record
![Page 33: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/33.jpg)
Experiment Setup
Fourteen sources: search engines Performance measures
Number of patterns Retrieval rate and Accuracy rate
Parameters Encoding scheme Thresholds control
![Page 34: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/34.jpg)
# of Patterns Discovered Using BlockLevel Encoding
Figure 5. Number of Patterns validated
02468
101214
0 0.25 0.5 0.75 1
Density
# o
f p
att
ern
s
r=0.25
r=0.5r=0.75
Average 117 maximal repeats in our test Web pages
![Page 35: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/35.jpg)
Translation
Table 2. Size of translated sequences and number of patterns
Encoding Scheme Length of Sequence No. of Patterns
All Tag 1128 7.9
No Physical 873 6.5
No Special 796 5.7
Block-Level 514 4.4
Average page length is 22.7KB
![Page 36: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/36.jpg)
Accuracy and Retrieval RateTable 5. The performance of multiple string alignment
Search Engine Retrieval Rate Accuracy Rate Matching PercentageAltaVistaCoraExciteGalaxyHotbotInfoseekLycosMagellanMetacrawlerNorthernLightOpenfindSavvysearchStpt.comWebcrawler
1.001.001.001.000.970.980.941.000.900.950.831.000.990.98
1.001.000.970.950.860.940.631.000.960.960.900.951.000.98
0.910.971.000.990.880.870.940.760.780.900.660.970.950.98
Average 0.97 0.94 0.90
![Page 37: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/37.jpg)
Summary IEPAD: Information Extraction based on Pattern
Discovery Rule generator The extractor Pattern viewer
Performance 97% retrieval rate and 94% accuracy rate
![Page 38: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/38.jpg)
Problems
Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the
desired data Only applicable when there are several
records in a Web page, currently
![Page 39: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/39.jpg)
References
TEXT IE Riloff, E. (1996) Automatically Generating Extracti
on Patterns from Untagged Text, (AAAI-96) , 1996, pp. 1044-1049.
Riloff, E. (1999) Information Extraction as a Stepping Stone toward Story Understanding, In Computational Models of Reading and Understanding, Ashwin Ram and Kenneth Moorman, eds., The MIT Press.
![Page 40: Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d365503460f94a0d857/html5/thumbnails/40.jpg)
References
Semi-structured IE D.W. Embley, Y.S. Jiang, and W.-K. Ng,
Record-Boundary Discovery in Web Documents, SIGMOD'99 Proceedings
C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.
B. Chidlovskii, J. Ragetli, and M. de Rijke, Automatic Wrapper Generation for Web Search Engines, The 1st Intern. Conf. on Web-Age Information Management (WAIM'2000), Shanghai, China, June 2000