Download - Information Carnivores
04/22/23 09:05 1Copyright © 2000,2002 Weld, Kushmerick
Information Carnivores
04/22/23 09:05 2Copyright © 2000,2002 Weld, Kushmerick
• Built by Selberg & Etzioni• Release in Jun ‘95• In 2000 aggregated 12 search engines:
– LookSmart, About, Infoseek, – GoTo, Google, DirectHit, – RealNames, Webcrawler, AltaVista, – Excite, Lycos, Thunderstone
• History– Netbot– Go2net– Infospace– ???
04/22/23 09:05 3Copyright © 2000,2002 Weld, Kushmerick
User enters query
Formulate queries
Lycos Excite. . .Collate results
Remove duplicates
Post-process + rank
Download?
Present to user
04/22/23 09:05 4Copyright © 2000,2002 Weld, Kushmerick
The Need for Wrappers
lots ofinformation
but
computers don’tunderstandmuch of it
04/22/23 09:05 5Copyright © 2000,2002 Weld, Kushmerick
Example 1: Seminar announcement news article<[email protected]>Type: cmu.andrew.assocs.UEATopic: Re: entreprenuership speakerDates: 17-Apr-95Time: 7:00 PMPostedBy: Colin S Osburn on 15-Apr-95 at 15:11 from CMU.EDUAbstract: hello againto reiteratethere will be a speaker on the law and startup businessesthis monday evening the 17thit will be at 7pm in room 261 of GSIA in the new building, ie upstairs.please attend if you have any interest in starting your own business orare even curious.Colin
date = monday evening the 17th speaker = ?start-time = 7pm end-time = ? location = room 261 of GSIA
IE
04/22/23 09:05 6Copyright © 2000,2002 Weld, Kushmerick
Example 2: Seminar announcement Web pages
date = Nov 5speaker = Dr. Rodger Kibble affil = University of Brighton title = Using centering...
IE
date = Nov 19speaker = Dr. Reinhard Muskens affil = Katholieke Univ... title = Underspecification...
date = Nov 26speaker = Dr. Julie Berndsen affil = University College... title = A Generic Lexicon...
...
04/22/23 09:05 7Copyright © 2000,2002 Weld, Kushmerick
Example 3: Job listings
IE
04/22/23 09:05 8Copyright © 2000,2002 Weld, Kushmerick
Strategy: Wrappers
resource A resource B resource C
wrapper A
user
wrapper B wrapper C
Mediator
queries
results
04/22/23 09:05 9Copyright © 2000,2002 Weld, Kushmerick
Scaling issuesNeed custom wrapper for each resource.
<HTML><BODY BGCOLOR="FFFFFF" LINK="00009C" ALINK="00009C" VLINK="00009C”TEXT= "000000"> <center> <table><tr><td><NOBR> <NOBR><img src="/ypimages/b_r_hd_a.gif”border=0 ALT="Switchboard Results" width=407height=20 align=top><A HREF="/bin/cgiqa.dll?MEM=1" TARGET ="_top"><img src="/ypimages/b_r_hd_1.gif" border=0 ALT="People" width=54 height=20align=top></A><A HREF="/bin/cgidir.dll?MEM=1”TARGET="_top"><img src= "/ypimages/b_r_hd_2.gif”border=0 ALT= "Business" width=62 height=24 align=top></A><A HREF="/" TARGET="_top"><img src=”/ypimages /b_r_hd_3.gif" border=0 ALT="Home”width=47 height=20 align=top></A></NOBR><br></td></tr></table> </center><center><table border=0width=576> <tr><td colspan=2 align =center> <center>
But hand-coding is tedious.
Especially since sites frequently change format
usefulinformation
04/22/23 09:05 10Copyright © 2000,2002 Weld, Kushmerick
Wrapper Approaches• Perl-like languages
– Simple and effective (if tedious)• Proprietary languages & tools
– Click and generalize• Conversion to tree form
– Use XML as intermediate representation– Extract children of specified node
• Machine Learning– Promising, but not yet fielded
04/22/23 09:05 11Copyright © 2000,2002 Weld, Kushmerick
Kushmerick Contribution
machine learning techniques to automatically construct wrappers from examples
wrapperprocedure
<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>
04/22/23 09:05 12Copyright © 2000,2002 Weld, Kushmerick
Example
(Congo, 242) (Egypt, 20) (Belize, 501) (Spain, 34)
04/22/23 09:05 13Copyright © 2000,2002 Weld, Kushmerick
LR wrappers: The basic idea
Use <B>, </B>, <I>, </I> for parsing
exploit fortuitous non-linguistic regularity
<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I> <BR><B>Egypt</B> <I>20</I> <BR><B>Belize</B> <I>501</I> <BR><B>Spain</B> <I>34</I> <BR></BODY></HTML>
04/22/23 09:05 14Copyright © 2000,2002 Weld, Kushmerick
procedure ExtractCountryCodes while there are more occurrences of <B> 1. extract Country between <B> and </B> 2. extract Code between <I> and </I>
Country/Code LR wrapper
Left-Right wrappers
04/22/23 09:05 15Copyright © 2000,2002 Weld, Kushmerick
procedure ExtractAttributes: while there are more occurrence of l1
1. extract 1st attribute between l1 and r1 . . . K. extract Kth attribute between lK and rK LR wrapper 2K strings l1 , r1 , …, lK , rK
Not just HTML tags!
“Generic” LR wrapper
K = number of attributesleft delimiter right delimiter
04/22/23 09:05 16Copyright © 2000,2002 Weld, Kushmerick
Wrapper induction algorithm
PAC modelparameters
wrapper
1. Gather enough pages to satisfy the termination condition (PAC model).
2. Label example pages.
3. Find a wrapper consistent with the examples.
automaticpage labeler
example pagesupply
04/22/23 09:05 17Copyright © 2000,2002 Weld, Kushmerick
Finding an LR wrapper
l1, r1, …, lK, rK
Example: Find 4 strings<B>, </B>, <I>, </I> l1 , r1 , l2 , r2
labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
04/22/23 09:05 18Copyright © 2000,2002 Weld, Kushmerick
LR: Finding r1
<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
r1 can be any prefixeg </B>
04/22/23 09:05 19Copyright © 2000,2002 Weld, Kushmerick
LR: Finding l1, l2 and r2<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
r2 can be any prefixeg </I>
l2 can be any suffixeg <I>
l1 can be any suffixeg <B>
04/22/23 09:05 20Copyright © 2000,2002 Weld, Kushmerick
Finding an LR wrapper: Algorithm naïve algorithm enumerate all combinations for each candidate l1
for each candidate r1 ··· for each candidate lK
for each candidate rK succeed if consistent with examples
O(S2K) O(KS)
efficient algorithm constraints are independent
for k = 1 to K for each candidate rk succeed if consistent with examplesfor k = 1 to K for each candidate lk succeed if consistent with examples
S = length of examplesK = number of attributes
04/22/23 09:05 21Copyright © 2000,2002 Weld, Kushmerick
Summary of Kushmerick PhD Results
“search.com” survey
AltaVista, WebCrawler,
WhoWhere, CNN Headlines,
Lycos, Shareware.Com,
AT&T 800 Directory, ...
time to automatically
build wrappers
K = number of attributes
S = size of examples
useful? learnable?wrapper class
57 %
13 %
53 %57 %
50 %
53 %O(KS2)
O(S2K)
O(KS2)O(KS4)
O(S2K+2)
O(KS)HLRT
N-LR
OCLRHOCLRT
N-HLRT
LR
total 70 %
04/22/23 09:05 22Copyright © 2000,2002 Weld, Kushmerick
“Strong” trainable IE systems• Examples:
– CRYSTAL (Soderland et al, 1995)– SRV (Freitag, 1999)– Rapier (Califf & Mooney, 1999)
• General approach:– Define a space of possible extraction rules.– Learning = search rule space for set of rules that individually
cover many positive examples and few negative examples– Sometimes use POS tagging and other shallow linguistic pre-
processing
04/22/23 09:05 23Copyright © 2000,2002 Weld, Kushmerick
SRV (Freitag’s CMU PhD)
... ... exampledocument
rule = conjunction of literalsliteral = predefined relational encoding of a document
Englishinterpretation
FOLinterpretation
04/22/23 09:05 24Copyright © 2000,2002 Weld, Kushmerick
Learing Curves for Rapier ~ SRV
more training data(job-listings domain)
04/22/23 09:05 25Copyright © 2000,2002 Weld, Kushmerick
SRV: Pseudo-pseudo-codeprocedure SRV(training examples E)
RuleSet {}while (E is not empty)
Rule TRUErepeat
let Best be the literal that most improves Ruleaccording to an information-theoretic gain metric
Rule RuleBest until no such Best existsremove examples covered by Rule from ERuleSet RuleSet + Rule
return RuleSet
04/22/23 09:05 26Copyright © 2000,2002 Weld, Kushmerick
Covering algorithm: Pseudo-Example
+++
++
++
+++
--
---
--- -
-- +++
++
++
+++
--
---
--- -
-- +++
++
++
+++
--
---
--- -
--
+++
++
++
+++
--
---
--- -
-- +++
++
++
+++
--
---
--- -
--
--
+++
++
++
+++
--
--
-- --
-
+++
++
++
+++
--
---
--- -
-- +++
++
++
+++
--
---
--- -
--
-
+++
++
++
+++
--
---
-- --
-
1 2 3
4
7
5 6
8 9
04/22/23 09:05 27Copyright © 2000,2002 Weld, Kushmerick
Why am I telling you this? • “Strong” trainable IE systems explore a
complex rule space…– Complicated algorithm/implementation– Deep & bushy search space– Susceptible to overfitting (?)
• Existing algorithms are covering algorithms– Other ways to reweight examples (eg, Boosting)– Theoretically more satisfying– Learned rules are more accurate (?)
If we use a cleverer reweighting scheme, can we get away with simpler rules? Can we do better than the “strong” learner?!
04/22/23 09:05 28Copyright © 2000,2002 Weld, Kushmerick
Boosting• Boosting (Schapire, Freund, et al)
– General ML technique for improving the performance of a “weak” learning algorithm, by repeatedly applying the learner, each time modifying the training data weights to force the weak learner to focus on examples which were previously classified incorrectly
– Given:Weak Learner L– Output: Boosted Learner L using L as a “subroutine”
• Theorem: Any learning algorithm L with training error ½ can be mechanically converted into an algorithm L with error arbitrarily close to 0.
04/22/23 09:05 29Copyright © 2000,2002 Weld, Kushmerick
Reweighting Example
training instancew
eigh
t
training instance
wei
ght
training instance
wei
ght
training instance
wei
ght
t=1 t=2
t=3 t=4
weak learner will focus onthese instances on iteration t=5
= instances correctly classified by ht
weakhypothesis
h1
h2
h3
04/22/23 09:05 30Copyright © 2000,2002 Weld, Kushmerick
BWI’s extraction patterns• Basic building block: Boundary detector
• Associated with every boundary detector dis a numeric confidence value Cd
“prefix”pattern
“suffix”pattern
Detector d matches a boundary B if:“prefix” pattern matches tokens to B’s left, and“suffix” pattern matches tokens to B’s right
example: Who: Dr. Jane Smith B
Detector d = [who :][dr . Capitalized] wildcard
04/22/23 09:05 31Copyright © 2000,2002 Weld, Kushmerick
Wildcards• “Standard” wildcards
Anything Alphabetic Capitalized Lowercase Alphanumeric Numeric Punctuation SingleChar (one-character token)
• Tried several simple “lexical” wildcards Firstname (dictionary of names from US Census) Lastname NonEnglishWord (tokens not in /usr/dict/words)
04/22/23 09:05 32Copyright © 2000,2002 Weld, Kushmerick
Detector learning algorithmInput: training examplesOutput: boundary detector d = p,s
start with empty detector d = [][], and growdetector one token at a timerepeat this process until d can’t be improved:
Consider all ways to grow prefix by one tokenand all ways to grow suffix by one tokenPick the extension that most improves d’saccuracy on the training data.
04/22/23 09:05 33Copyright © 2000,2002 Weld, Kushmerick
Boosted Wrapper Induction• Wrapper =
1. Start detectors dS1, dS2, …2. End detectors dE1, dE2, … 3. Length histogram L:[-,+][0,1]
• To invoke wrapper on a document:1. Apply all detectors to entire document2. Score every boundary B:
3. Extract all substrings (BS,BE) that satisfy
Bdid
iSiS
CBStartScorematches:
)( Bdid
iEiE
CBEndScorematches:
)(
)()()( SEES BBLBEndScoreBStartScore
user-specified confidence threshold
04/22/23 09:05 34Copyright © 2000,2002 Weld, Kushmerick
Extraction Example
Start detectorsdS1 = [a b c][d e], conf 0.2dS2 = [p q][r s t], conf 0.4
End detectorsdE1 = [w x][y z], conf 0.5dE2 = [m][n o], conf 0.3
0.20.3
1 2 3 4
L
StartScore=0.6
StartScore=0.4StartScore=0.2
04/22/23 09:05 35Copyright © 2000,2002 Weld, Kushmerick
Extraction Example
Start detectorsdS1 = [a b c][d e], conf 0.2dS2 = [p q][r s t], conf 0.4
End detectorsdE1 = [w x][y z], conf 0.5dE2 = [m][n o], conf 0.3
0.20.3
1 2 3 4
L
EndScore=0.3
EndScore=0.5
EndScore=0.3
04/22/23 09:05 36Copyright © 2000,2002 Weld, Kushmerick
Extraction Example
Start detectorsdS1 = [a b c][d e], conf 0.2dS2 = [p q][r s t], conf 0.4
End detectorsdE1 = [w x][y z], conf 0.5dE2 = [m][n o], conf 0.3
0.20.3
1 2 3 4
L
StartScore(SB)=0.6
EndScore(SE)=0.5
SE-SB = 3 tokens
StartScore(SB)EndScore(SE)L(SE-SB) = 0.60.50.3 = 0.09 > ?
roughly, “probability that ‘38-44K’ is a correct value”
04/22/23 09:05 37Copyright © 2000,2002 Weld, Kushmerick
BWI Algorithm• Procecure BWI
Input: training examplesOutput: Start & end detectors, length histogram Parameters:
Number of boosting rounds TLookahead depth LWildcards
1. S = Start boundary examples2. E = End boundary examples3. Start-detectors = AdaBoost(LearnDetector, S)4. End-detectors = AdaBoost(LearnDetector, E)5. Construct length histogram L from training data
04/22/23 09:05 38Copyright © 2000,2002 Weld, Kushmerick
Example
04/22/23 09:05 39Copyright © 2000,2002 Weld, Kushmerick
Experiments• 16 IE tasks from 8 document collections
– 8 fields from 3 “traditional” domains: Seminar announcements, Job listings; Reuters corporate acquisition articles;
– 8 fields from 5 “wrapper” domains: CS department faculty lists; Zagats restaurants reviews; LA Times restaurant reviews; Internet Address Finder; Stock quote server
• Performance metrics– Precision (fraction of extracted items that are correct)– Recall (fraction of items in the documents that were extracted)– F1 = 2/(1/Precision + 1/Recall)
• Competitors– SRV, Rapier, algorithm based on hidden Markov models
04/22/23 09:05 40Copyright © 2000,2002 Weld, Kushmerick
Results: 16 tasks 4 algorithms
21cases
7cases
04/22/23 09:05 41Copyright © 2000,2002 Weld, Kushmerick
Summary & Conclusions• BWI learns simple wrapper-like extraction patterns;
each pattern has high accuracy but low coverage– Uses boosting to focus the weak pattern learner on difficult
training examples• Works because a few dozen or hundred (but not
millions!) of patterns suffice for broad coverage.– Many real-world natural corpora have their own stereotypical
language, nongrammatical utterances, stylistic constraints, editorial guidelines, formatting regularities, etc that greatly simplify extraction
• BWI outperforms 3 competitors in 75% of comparisons