alfred -

A Framework for Learning Web Wrappers from the Crowd

Valter Crescenzi, Paolo Merialdo, Disheng Qiu

Dipartimento di IngegneriaUniversità degli Studi Roma TreVia della Vasca Navale, 79, Rome

[email protected]

mailto:[email protected]

mailto:[email protected]

Extracting data

2M pages from IMDB, and we want to extract ... titles, directors etc ....

1/15

Extracting data


DB#Wrapper!

1/15

Extracting data


Inference algorithm!

DB#Wrapper!

1/15

Supervised

Supervised hard to scale


DB#Wrapper!

1/15

Unsupervised

Unsupervised easier to scale but not accurate


DB#Wrapper!

1/15

Automatic Annotator

Automatic annotators can not be applied in all cases


DB#Wrapper!

+"

1/15

• Sample values• Ontology• Lexical patterns

Crowdsourcing

An opportunity to scale supervised approaches


DB#Wrapper!

1/15

Scaling Wrapper Inference

Scaling the number of workers with Crowdsourcing platforms opens new challenges:

Issues: Contributions:

2/15




Non-expert workers

• Simple interactions to reduce the worker error rate• Membership Query (yes/no answer)

2/15




Non-expert workers


• Active Learning to carefully select queries• Dynamic Expressiveness of the inference language

Costs

2/15




Non-expert workers


• Active Learning to carefully select queries• Dynamic Expressiveness of the inference language

Costs

2/15

Quality

• Bayesian Model to evaluate the expected wrapper quality• Sampling algorithms

ALFRED

ALFRED is a wrapper inference system supervised by workers from a crowdsourcing platform.

Input annotated page (page0):

3/15

ALFRED


r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....



3/15

ALFRED




page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away


3/15

ALFRED




page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null


3/15

ALFRED





Is this title the correct one?

3/15

ALFRED




DB#Wrapper!

r1 = /html/table/tr[1]/td/text()


Is this title the correct one?

3/15

Membership Query

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


4/15

Yes !

Membership Query

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


• Rules compatible with the answer more likely to be correct (Bayesian Model)

For each new answer

4/15

Yes !

Membership Query

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


• Rules compatible with the answer more likely to be correct (Bayesian Model)

For each new answer

• If no rule is good enough:• a new query is selected (Active Learning)

4/15

Yes !

Bayesian Model

Training sequence

= {“Spirited Away” , “-” , “9.3” }Yes No No

5/15

LkLk

Bayesian Model

Training sequence


5/15

LkLk

a rule r is correct:

none of the candidate rules is correct:

Probability that:

P (r|Lk)

P (R|Lk)

Bayesian update:

Bayesian Model

Training sequence


5/15

LkLk

a rule r is correct:

none of the candidate rules is correct:

Probability that:

P (r|Lk)

P (R|Lk)

Active Learning

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


ALFRED actively selects the queries; a good policy saves money

6/15

Active Learning

• Random (baseline)Values are randomly selected

• EntropyValues are selected by maximizing the Entropy (most uncertain value)

• GreedyValues are selected by minimizing the queries to confirm the most likely rule

• LuckyHybrid approach, it starts with an Entropy algorithm and then switch to Greedy to confirm the best rule

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


ALFRED actively selects the queries; a good policy saves money

6/15

Expressiveness

The candidate rules are generated observing the first annotated page

Should we use all the XPath expressiveness or just a fragment?

7/15

Expressiveness of the fragment Number of candidate rules

Expressiveness

Pool of candidate rules organized in fragments:



7/15


Expressiveness


/html/table/tr[1]/td/text() Absolute Rules (complete path from root)



7/15


Expressiveness



//*[contains(.,”Spirited Away”)]/text()//*[contains(.,”Ratings:”)]/../../tr[1]/td/text()//*[contains(.,”Director:”)]/../../tr[1]/td/text()

Relative Rules (path from a textual node)



7/15


Expressiveness



//*[contains(.,”Spirited Away”)]/text()//*[contains(.,”Ratings:”)]/../../tr[1]/td/text()//*[contains(.,”Director:”)]/../../tr[1]/td/text()

Relative Rules (path from a textual node)


.... other XPaths


7/15


Expressiveness

/html/table/tr[1]/td/text()//*[contains(.,”Spirited Away”)]/text()//*[contains(.,”Ratings:”)]/../../tr[1]/td/text()//*[contains(.,”Director:”)]/../../tr[1]/td/text()

Correct (absolute) rule:/html/table/tr[1]/td/text()

• The fragment is too expressive: the correct rule can be generated • But many MQ are needed to find it

8/15

Expressiveness

• The fragment is just expressive enough: the correct rule can be generated. • Few queries are needed to find it

/html/table/tr[1]/td/text()




8/15

Expressiveness

• The fragment is just expressive enough: the correct rule can be generated. • Few queries are needed to find it

/html/table/tr[1]/td/text()




8/15

State-of-the-art approaches fall in the first case !They statically define the expressiveness of the XPath fragment

R0 : Absolute Rules

R1 : R0 + Relative Rules

.....

Expressiveness

5%

70%25%

We defined simple XPath fragments. Empirically observed: too expressive fragments are not actually needed.

9/15

Rules are organized in a Hierarchy of Fragments with increasing expressiveness

R0 : Absolute Rules


.....

Expressiveness

5%

70%25%


9/15

Rules are organized in a Hierarchy of Fragments with increasing expressiveness

R0 : Absolute Rules


.....

Inspired by Structural Risk Minimization (SRM)*: a Machine Learning technique to address overfitting

*Details: Shawe-Taylor et all - IEEE Transactions on Information Theory, 44(5):1926–1940, 1998

Expressiveness

5%

70%25%


9/15

Dynamic Expressiveness

R0 : Absolute Rules

10/15


R0 : Absolute Rules

10/15

P (R|Lk)No solution?

> ?�R


R0 : Absolute Rules

10/15


> ?�R

Expands the expressiveness

No



R0 : Absolute Rules

10/15


> ?�R


No

.....



R0 : Absolute Rules

10/15


> ?�R


No

.....



R0 : Absolute Rules

10/15

P (r|Lk)Is r good enough?

> ?�r


No

.....


Yes

Terminates


R0 : Absolute Rules

10/15

P (r|Lk)Is r good enough?

> ?�r


No

Results

Site Entity |Pages|www.imdb.com Actor 500k

www.imdb.com Movies 500k

www.allmusic.com Band 500k

www.allmusic.com Albums 500k

www.nasdaq.com Stock Quotes 7k

Dataset: 40 attributes

Measures:

• Costs - #MQ• Quality - Precision and Recall

11/15

http://www.imdb.com

http://www.imdb.com

http://www.allmusic.com




http://www.nasdaq.com

http://www.nasdaq.com

Results: Dynamic Expressiveness

Strategy #MQ (SRM off) #MQ (SRM on) % MQ saved P (SRM on) R (SRM on)

RANDOM 379 190 50% 0,998 0,977

GREEDY 398 169 58% 0,998 0,983

LUCKY 196 132 33% 0,996 0,995

ENTROPY 205 116 44% 0,998 0,99

12/15



RANDOM 379 190 50% 0,998 0,977

GREEDY 398 169 58% 0,998 0,983

LUCKY 196 132 33% 0,996 0,995

ENTROPY 205 116 44% 0,998 0,99

Dynamic Expressiveness saves a lot of queries

12/15



RANDOM 379 190 50% 0,998 0,977

GREEDY 398 169 58% 0,998 0,983

LUCKY 196 132 33% 0,996 0,995

ENTROPY 205 116 44% 0,998 0,99

Dynamic Expressiveness saves a lot of queries

Small quality loss: The expressiveness is not expanded when it is needed

12/15


Static Expressiveness Dynamic Expressiveness

# candidate rules # candidate rules

13/15



“Simple” attributes: complex algorithms are not needed


13/15



“Simple” attributes: complex algorithms are not needed

“Complex” attributes: Entropy, Lucky and Dynamic Expressiveness saves a lot of queries


13/15

Future development

Noisy Crowds: workers mistakes vs task redundancy* How to evaluate the accuracy of the worker?

Another query or another worker?

Same learning framework, different problems: NLP, Crawling

14/15

*Demo Title: ALFRED: Crowd Assisted Data Extraction When: Tomorrow 17h Where: Imperial Room

Thank you for the attention !!

15/15

15/15

Redundancy

0

0,5

1

0 1 2 3 4

P(r1)

P(r2)

P(r3)

# MQ

0

0,5

1

0 1 2 3 4

P(r1)

P(r2)

P(r3)

Not Accurate Worker

# MQ

0

0,5

1

0 1 2 3 4

P(r1)

P(r2)

P(r3)

# MQ

Many Workers

Accurate Worker

... selecting the right sample set is crucial

Sampling & Quality

2M pages from IMDB, we have to work with a sample set but ....


Sampling & Quality




Sampling & Quality


Wrapper!



Sampling & Quality


Wrapper!


DB#

... Not all pages look like the pages about famous movies

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

page0 page1

r1

r2

r3

Spirited Away City of God

Spirited Away -


r1 = r3 != r2

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

page0 page1

r1

r2

r3


Spirited Away -


r1 = r3 != r2

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


r1 != r3 != r2

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

page0 page1

r1

r2

r3


Spirited Away -


r1 = r3 != r2

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


r1 != r3 != r2

Pages make apparent the differences among the rules

Find a small set that makes apparent the same differences observed in the

whole set of pages*

Sampling & Quality

The problem.

Find the smallest set that makes apparent the differences among the rules:(e.g., 100 pages that make apparent the same differences that we would observe in 2M pages).

It is a NP-Hard problem !! Reduction to SET-Cover problem:Find the smallest set of pages that cover all the group of rules (group = equivalent rules).

The smallest set is not needed:A greedy algorithm O(|Pages|) in time and O(1) in space works very well in practice.

XPath rules

For every page p: if (p makes apparent new differences) representative pages += p

An offline algorithm that can be easily parallelized

Sampling & Quality

Results: Sampling

Three sample sets:• Biased

Pages collected by crawling the website

• RandomPages randomly picked from the whole set of pages

• RepresentativePages collected by our sampling algorithm

Results: Sampling

Entity Sampling |Pages| P R

Movies

Biased 250 0.98 0.71

Movies Random 250 0.99 0.99Movies

Representative 42 1.00 1.00

Actors

Biased 250 1.00 1.00

Actors Random 250 1.00 0.96Actors


Stocks

Biased 86 1.00 0.98

Stocks Random 86 1.00 0.99Stocks


Albums

Biased 258 1.00 0.99

Albums Random 258 1.00 1.00Albums


Bands

Biased 289 1.00 0.68

Bands Random 289 1.00 1.00Bands


Results: Sampling


Movies

Biased 250 0.98 0.71



Actors

Biased 250 1.00 1.00



Stocks

Biased 86 1.00 0.98



Albums

Biased 258 1.00 0.99



Bands

Biased 289 1.00 0.68



Representative perfect

Results: Sampling


Movies

Biased 250 0.98 0.71



Actors

Biased 250 1.00 1.00



Stocks

Biased 86 1.00 0.98



Albums

Biased 258 1.00 0.99



Bands

Biased 289 1.00 0.68



Results: Sampling


Movies

Biased 250 0.98 0.71



Actors

Biased 250 1.00 1.00



Stocks

Biased 86 1.00 0.98



Albums

Biased 258 1.00 0.99



Bands

Biased 289 1.00 0.68



Biased: recall loss

Results: Sampling


Movies

Biased 250 0.98 0.71



Actors

Biased 250 1.00 1.00



Stocks

Biased 86 1.00 0.98



Albums

Biased 258 1.00 0.99



Bands

Biased 289 1.00 0.68



Results: Sampling


Movies

Biased 250 0.98 0.71



Actors

Biased 250 1.00 1.00



Stocks

Biased 86 1.00 0.98



Albums

Biased 258 1.00 0.99



Bands

Biased 289 1.00 0.68



Random: better than biased

State of Art

• 2006 - Interactive wrapper generation with minimal user effort. U. Irmik et al. WWW

• 2006 - Active learning with multiple views. I. Muslea et al. JAIR

Supervised Wrapper Induction

State of Art

• 2008 - Wrapper inference for ambiguous web pages. C. Valter and P. Merialdo JAAI

• 2005 - Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai WWW.

Unsupervised Wrapper Induction

State of Art

• 2012 - D.I.A.D.E.M. J. Furche and G. Gottlob WWW

• 2011 - Automatic wrappers for large scale web extraction. N.N. Dalvi et al. VLDB.

Automatic Annotators

alfred -

Education

wrapper inference system

number of workers

crowdsourcing platforms

scaleinference algorithm

accurateinference algorithm

casesinference algorithm

new challenges

extracting data2m pages