wrapper learning: cohen et al 2002; kushmeric 2000; kushmeric & frietag 2000 william cohen...

36
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Upload: richard-oneal

Post on 20-Jan-2018

213 views

Category:

Documents


0 download

DESCRIPTION

Learner

TRANSCRIPT

Page 1: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Wrapper Learning:

Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000

William Cohen1/26/03

Page 2: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Goal: learn from a human teacher how to extract certain database records from a particular web site.

Page 3: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Learner

Page 4: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03
Page 5: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03
Page 6: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Why learning from few examples is important

At training time, only four examples are available—but one would like to generalize to future pages as well…

Must generalize across time as well as across a single site

Page 7: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

now some details….

Page 8: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Kushmerick’s WIEN system

• Earliest wrapper-learning system (published IJCAI ’97)

• Special things about WIEN:– Treats document as a string of characters– Learns to extract a relation directly, rather

than extracting fields, then associating them together in some way

– Example is a completely labeled page

Page 9: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

WIEN system: a sample wrapper

Page 10: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

WIEN system: a sample wrapper

Left delimiters L1=“<B>”, L2=“<I>”; Right R1=“</B>”, R2=“</I>”

Page 11: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

WIEN system: a sample wrapperLearning means finding L1,…,Lk and R1,…,Rk

• Li must precede every instance of field i

• Ri must follow every instance of field I

• Li, Ri can’t contain data items

• Limited number of possible candidates for Li,Ri

Page 12: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

WIEN system: a more complex class of wrappers (HLRT)

Extension: use Li,Ri delimiters only: after a “head” (after first occurence of H) and before a “tail” (occurrence of T)H = “<P>”, T = “<HR>”

Page 13: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Kushmeric: overview of various extensions to LR

Page 14: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Kushmeric and Frietag: Boosted wrapper induction

Page 15: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Review of boostingGeneralized version of AdaBoost (Singer&Schapire,

99)

Allows “real-valued” predictions for each “base hypothesis”—including value of zero.

Page 16: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Learning methods: boosting rules

Weak learner: to find weak hypothesis t:

1. Split Data into Growing and Pruning sets

2. Let Rt be an empty conjunction

3. Greedily add conditions to Rt guided by Growing set:

4. Greedily remove conditions from Rt guided by Pruning set:

5. Convert to weak hypothesis:where

Constraint: W+ > W-

and caret is smoothing

Page 17: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Learning methods: boosting rules

SLIPPER also produces fairly compact rule sets.

Page 18: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Learning methods: BWI

• Boosted wrapper induction (BWI) learns to extract substrings from a document.– Learns three concepts: firstToken(x),

lastToken(x), substringLength(k)– Conditions are tests on tokens before/after x

• E.g., toki-2=‘from’, isNumber(toki+1)– SLIPPER weak learner, no pruning.– Greedy search extends “window size” by at most L

in each iteration, uses lookahead L, no fixed limit on window size.

• Good results in (Kushmeric and Frietag, 2000)

Page 19: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

BWI algorithm

Page 20: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

BWI algorithmLookaheadsearch here

Page 21: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

BWI example rules

Page 22: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03
Page 23: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Cohen et al

Page 24: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Improving A Page Classifier with Anchor Extraction

and Link Analysis

William W. CohenNIPS 2002

Page 25: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

•Previous work in page classification using links:

• Exploit hyperlinks (Slattery&Mitchell 2000; Cohn&Hofmann, 2001; Joachims 2001): Documents pointed to by the same “hub” should have the same class.

• What’s new in this paper:• Use structure of hub pages (as well as structure of site graph) to find better “hubs” • Adapt an existing “wrapper learning” system to find structure, on the task of classifying “executive bio pages”.

Page 26: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Intuition: links from this “hub page” are informative…

…especially these links

Page 27: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Idea: use the wrapper-learner to learn to extract links to

execBio pages, smoothing the “noisy” data produced by the

initial page classifier.

Task: train a page classifier, then use it to classify pages on a new, previously-unseen web site as executiveBio or other

Question: can index pages for executive biographies be used

to improve classification?

Page 28: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Background: “co-training” (Mitchell&Blum, ‘98)

• Suppose examples are of the form (x1,x2,y) where x1,x2 are independent (given y), and where each xi is sufficient for classification, and unlabeled examples are cheap. – (E.g., x1 = bag of words, x2 = bag of links).

• Co-training algorithm:1. Use x1’s (on labeled data D) to train f1(x)=y2. Use f1 to label additional unlabeled examples U.3. Use x2’s (on labeled part of U+D to train f1(x)=y4. Repeat . . .

Page 29: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Simple 1-step co-training for web pages

f1 is a bag-of-words page classifier, and S is web site containing unlabeled pages.

• Feature construction. Represent a page x in S as a bag of pages that link to x (“bag of hubs”).

• Learning. Learn f2 from the bag-of-hubs examples, labeled with f1

• Labeling. Use f2(x) to label pages from S.

Idea: use one round of co-training to bootstrap the bag-of words classifier to one that uses site-specific features x2/f2

Page 30: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Improved 1-step co-training for web pages

Feature construction. - Label an anchor a in S as positive iff it points to a positive page x (according to f1). Let D = {(x’,a): a is a positive anchor on x’}. - Generate many small training sets Di from D, by sliding small windows over D.- Let P be the set of all “structures” found by any builder from any subset Di

- Say that p links to x if p extracts an anchor that points to x. Represent a page x as the bag of structures in P that link to x.

Learning and Labeling. As before.

Page 31: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

builder

extractor

List1

Page 32: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

builder

extractor

List2

Page 33: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

builder

extractor

List3

Page 34: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

BOH representation:

{ List1, List3,…}, PR

{ List1, List2, List3,…}, PR

{ List2, List 3,…}, Other

{ List2, List3,…}, PR

Learner

Page 35: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Experimental results

1 2 3 4 5 6 7 8 9Winnow

None0

0.05

0.1

0.15

0.2

0.25

WinnowD-TreeNone

Co-training hurts No improvement

Page 36: Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

Summary- “Builders” (from a wrapper learning system) let

one discover and use structure of web sites and index pages to smooth page classification results.

- Discovering good “hub structures” makes it possible to use 1-step co-training on small (50-200 example) unlabeled datasets.– Average error rate was reduced from 8.4% to 3.6%.– Difference is statistically significant with a 2-tailed paired sign test or t-test.– EM with probabilistic learners also works—see (Blei et al, UAI 2002)