automatic wrappers for large scale web extraction

43
Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC)

Upload: hogan

Post on 06-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Automatic Wrappers for Large Scale Web Extraction. Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC). Task : Learn rules to extract information (e.g. Directors) from structurally similar pages. html. body. head. class= ‘ head ’. div. class= ‘ content ’. div. title. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic Wrappers for Large  Scale Web Extraction

Automatic Wrappers for Large Scale Web Extraction

Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC)

Page 2: Automatic Wrappers for Large  Scale Web Extraction

Task: Learn rules to extract information (e.g. Directors) from structurally similar pages.

VLDB 2011, Seattle, USA

Page 3: Automatic Wrappers for Large  Scale Web Extraction

html

bodyhead

title

div div

table

td

table

td td td td td

class=‘content’

width=80%Godfather

Title : Godfather Director : Coppola Runtime 118min

We can use the following Xpath rule to extract directors

W1 = /html/body/div[2]/table/td[2]/text()

class=‘head’

VLDB 2011, Seattle, USA3

Page 4: Automatic Wrappers for Large  Scale Web Extraction

WrappersCan be learned with a little amount of

supervision.

Very effective for site-level extraction.

Have been extensively studied in literature.

VLDB 2011, Seattle, USA4

Page 5: Automatic Wrappers for Large  Scale Web Extraction

In This Work:

Objective: learn wrappers without site-level supervision.

VLDB 2011, Seattle, USA5

Page 6: Automatic Wrappers for Large  Scale Web Extraction

VLDB 2011, Seattle, USA6

Page 7: Automatic Wrappers for Large  Scale Web Extraction

IdeaObtain training data cheaply using

dictionaries or automatic labelers.

Make wrapper induction tolerant to noise.

VLDB 2011, Seattle, USA7

Page 8: Automatic Wrappers for Large  Scale Web Extraction

VLDB 2011, Seattle, USA8

Page 9: Automatic Wrappers for Large  Scale Web Extraction

Summary of Approach

VLDB 2011, Seattle, USA9

Page 10: Automatic Wrappers for Large  Scale Web Extraction

Summary of ApproachTwo main problems:

Wrapper Enumeration: How to generate the space of all the possible wrappers efficiently?

Wrapper Ranking: How to rank the enumerated wrappers based on quality?

VLDB 2011, Seattle, USA10

Page 11: Automatic Wrappers for Large  Scale Web Extraction

Example : TABLE wrapper system

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

Works on a table.

Generates wrappers from the following space: a single cell, a row, a column or the entire table.

VLDB 2011, Seattle, USA11

Page 12: Automatic Wrappers for Large  Scale Web Extraction

Example : TABLE wrapper system

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

L = { n1, n2, n4, a4, z5}

32 possible subsets

8 unique wrappers : {n1, n2, n4, a4, z5, C1, R4, T}

VLDB 2011, Seattle, USA12

Page 13: Automatic Wrappers for Large  Scale Web Extraction

Wrapper Enumeration Problem Input : A wrapper inductor, Φ and a set of labels L

Wrapper space of L is defined as

W(L) = {Φ(S)| S ⊆ L}

Problem : Enumerate the wrapper space of L in time polynomial in the size of the wrapper space and L.

VLDB 2011, Seattle, USA13

Page 14: Automatic Wrappers for Large  Scale Web Extraction

Wrapper InductorsTABLE : The wrapper inductor as defined before

XPATH : Learn the minimal xpath rule, in a simple fragment of Xpath, that covers all the training examples

LR : Find the maximal pair of strings preceding and following all the training examples. The output of the wrapper is all strings delimited by the pair.

VLDB 2011, Seattle, USA14

Page 15: Automatic Wrappers for Large  Scale Web Extraction

Well-behaved InductorA wrapper inductor Φ is well-behaved if it has following

properties: [Fidelity] L ⊆ Φ(L) [Closure] l ∈Φ(L) ⇒ Φ(L) = Φ(L ∪ l) [Monotonicity] L1 ⊆L2 ⇒ Φ(L1) ⊆ΦL2)

Theorem : TABLE, LR and XPATH are well-behaved wrapper inductors.

VLDB 2011, Seattle, USA15

Page 16: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up AlgorithmStart with singleton labels in L as candidate label sets

Learn wrappers by feeding candidate label sets to Φ

Incrementally apply one-label extensions to each candidate

Extend candidates with the closure of wrappers learned by Φ

Theorem : Bottom-up algorithm is sound and complete

Theorem : Bottom-up algorithm makes at most k.|L| calls to the wrapper, where k is the size of the wrapper space.

VLDB 2011, Seattle, USA16

Page 17: Automatic Wrappers for Large  Scale Web Extraction

Can we do better?A wrapper inductor is a feature-based inductor if:

Every label is associated with a set of features ((attribute, value) pairs)

Φ(L) = intersection of all the features of L Output of a wrapper w = text nodes satisfying all the features of w

E.g. TABLE can be expressed as a feature-based inductor with two features, row and col.

Both LR and XPW can be expressed as a feature-based inductor.

VLDB 2011, Seattle, USA17

Page 18: Automatic Wrappers for Large  Scale Web Extraction

Top-down Algorithm

We give a top-down algorithm for a feature-based wrapper that makes exactly k calls to the wrapper, where k is the size of the wrapper space.

VLDB 2011, Seattle, USA18

Page 19: Automatic Wrappers for Large  Scale Web Extraction

Wrapper Ranking ProblemGiven a set of wrappers, we want to output one that

gives the “best” list.

Let X be a list extracted by a wrapper w

Choose wrapper that maximizes P[X | L], or equivalently,

P[L | X] P[X]

VLDB 2011, Seattle, USA19

Page 20: Automatic Wrappers for Large  Scale Web Extraction

Example: Extracting names from business listings

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

Let us rank the following three lists as candidates for the set of names: X1 = first column

X2 = entire table

X3 = first two columnsVLDB 2011, Seattle, USA20

Page 21: Automatic Wrappers for Large  Scale Web Extraction

Example: Extracting names from business listings

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

X1 = first column

P[L | X1] : 2 wrong labels, 3 correct labels

P[X1] : nice repeating structure, schema size = 4

VLDB 2011, Seattle, USA21

Page 22: Automatic Wrappers for Large  Scale Web Extraction

Example: Extracting names from business listings

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

X2 = entire table

P[L | X2] : 0 wrong labels, 5 correct labels

P[X2] : nice repeating structure, schema size =1

VLDB 2011, Seattle, USA22

Page 23: Automatic Wrappers for Large  Scale Web Extraction

Example: Extracting names from business listings

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

X3 = first two columns

P(L | X3) : 1 wrong label, 4 correct labels

P(X3) : poor repeating structure, schema size = 1 or 3

VLDB 2011, Seattle, USA23

Page 24: Automatic Wrappers for Large  Scale Web Extraction

Ranking ModelP[L | X]

Assume a simple annotator with precision p and recall r that independently labels each node.

Each node in X is added to L with probability r Each node not in X is added to L with probability 1- p

VLDB 2011, Seattle, USA24

Page 25: Automatic Wrappers for Large  Scale Web Extraction

Ranking ModelP[X]

Define features of the grammar that describes X, e.g. schema size and repeating structure

Learn distributions on the values of features, or take it as input as part of domain knowledge.

VLDB 2011, Seattle, USA25

Page 26: Automatic Wrappers for Large  Scale Web Extraction

ExperimentsDatasets:

DEALERS : Used automatic form filling techniques to obtain dealer listings from 300 store locator pages

DISCOGRAPHY : Crawled 14 music websites that contain track listings of albums.

Task : Automatically learn wrappers to extract business names/track titles for each of the website.

VLDB 2011, Seattle, USA26

Page 27: Automatic Wrappers for Large  Scale Web Extraction

VLDB 2011, Seattle, USA27

Page 28: Automatic Wrappers for Large  Scale Web Extraction

VLDB 2011, Seattle, USA28

Page 29: Automatic Wrappers for Large  Scale Web Extraction

SummaryA new framework for noise-tolerant wrapper induction

Two efficient wrapper enumeration algorithms

Probabilistic wrapper ranking model

Web-scale information extraction No site-level supervision No manual labeling Tolerating noise in automatic labeling

VLDB 2011, Seattle, USA29

Page 30: Automatic Wrappers for Large  Scale Web Extraction

VLDB 2011, Seattle, USA30

Page 31: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up AlgorithmINPUT : Φ, L

Z = all singleton subsets of L

W = Z

while (Z not empty) Remove the smallest set S from Z

For each possible single-label expansion S’ of SAdd Φ(S’) to W Add (Φ(S’) ∩ L) back to Z

VLDB 2011, Seattle, USA31

Page 32: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={n1, n2, n4, a4, z5}

VLDB 2011, Seattle, USA32

Page 33: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={n2, n4, a4, z5, {n1, n2, n4}}

n2 n4

C1

VLDB 2011, Seattle, USA33

Page 34: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={n2, n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T

VLDB 2011, Seattle, USA34

Page 35: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T

VLDB 2011, Seattle, USA35

Page 36: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={a4, z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T R4a4

VLDB 2011, Seattle, USA36

Page 37: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T R4a4

VLDB 2011, Seattle, USA37

Page 38: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={{n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T R4a4

VLDB 2011, Seattle, USA38

Page 39: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={{n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T R4a4

VLDB 2011, Seattle, USA39

Page 40: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={{n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T R4a4

VLDB 2011, Seattle, USA40

Page 41: Automatic Wrappers for Large  Scale Web Extraction

Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={}

n2 n4

C1a4z5

T R4a4

VLDB 2011, Seattle, USA41

Page 42: Automatic Wrappers for Large  Scale Web Extraction

Top-down Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1, n2, n4, a4, z5

n4, a4

column

n1, n2, n4 a4 z5

row

n1 n2 n4

rowVLDB 2011, Seattle, USA42

Page 43: Automatic Wrappers for Large  Scale Web Extraction

Wrapper RankingargmaxX P(L|X) P(X) ?

Possible values of X are the possible wrappers computed byΦ

P (L |X ): probability of observing L given that X is the right wrapper

The annotator has precision p, and recall r (estimated from tested labelings)

Independent annotation process:Decide on labeling nodes

independently Each node in X is added to L with

probability rEach node not in X is added to L

with probability 1-p

H

X A2

X2X1

L

labeled nodes

labeled nodes in X

Non-labeled nodes in X

non-labeled nodes outside X

All nodes

A1

labeled nodes outside X

VLDB 2011, Seattle, USA43