Download - Joint Repairs for Web Wrappers

Joint Repairs for Web WrappersStefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche

ICDE Helsinki - May, 19 2016

Title Director Rating RuntimeSchindler’s List

Steven Spielberg

R 195 min

Web Data Extraction

Road Runner

DEPTA

Attribute_1 Attribute_2Schindler’s List Director: Steven Spielberg Rating: R

Runtime: 195 min

Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min

Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min

Joint Data and Wrapper Repair

Attribute_1 Attribute_2

Schindler’s List

Director: Steven Spielberg

Rating: R

Runtime: 195 min

Title Director Rating Runtime

Schindler’s List

Steven

Spielberg

R 195 min

Maximal Repair is NP-completeAttribute

Director: Steven Spielberg Rating: R Runtime: 195 min

Director Rating Runtime

Director Rating RuntimeSteven Spielberg R 195 min

φ1:

φ2:

φ3:

φ4:

OBSERVATIONS

Templated Websites:Data is published following a template.

Wrapper BehaviourWrappers rarely misplace and over-segment at the same time.

Wrappers make systematic errors.

OraclesOracles can be implemented as (ensembles of) NERs.

NERs are not perfect, i.e., they make mistakes

Joint Wrapper And Data Repair

Authors

When values are both misplaced and over-segmented, computing repairs of maximal fitness is hard, otherwise, just do the following:

(1) Compute all possible k non-crossing partitions (k = |R|) of tokens, i.e., assign to each attribute an element of the part i t ion (O(nk) - Narayana Number).

(2) Discard tokens never accepted by oracles in any of the partitions.

(3) Collapse identical partitions and choose the one with maximal fitness.

Without misplacement and over-segmentation, solution in polynomial time by computing non-crossing k-partition

NP-hardness: reduction from Weighted Set Packing. Membership in NP: guess a partition, decide non crossing and compute fitness in PTIME.

Stefano Ortona [email protected] University of Oxford, UKGiorgio Orsi [email protected] University of Oxford, UKMarcello Buoncristiano [email protected] Università della Basilicata,ItalyTim Furche [email protected] University of Oxford, UK

http://diadem.cs.ox.ac.uk/wadar

Web data extraction (aka scraping/wrapping) uses wrappers to turn web pages into structured data.

Wrapper: structure { ⟨R,!R⟩ { ⟨A1,!A1⟩,…,⟨Am,!Am⟩ } } specifying objects to be extracted (listings, records, attributes) and corresponding XPath expressions.

Wrappers are often created algorithmically and in large numbers. Tools capable of maintaining them over time are missing.

⟨RATING, //li[@class=‘second’]/p⟩

⟨RUNTIME, //li[@class=‘third’]/ul/li[1]⟩

Algorithmically-created wrappers generate data that is far from perfect.Data can be badly segmented and misplaced.

⟨TITLE,⟨1⟩,string($)⟩ ⟨DIRECTOR,⟨2⟩,substring-before(substring-after($,tor:_),_Rat)⟩⟨RATING,⟨2⟩,substring-before(substring-after($,ing:_),_Run)⟩⟨RUNTIME,⟨2⟩,substring-after($,time:_)⟩

Take a set Ω of oracles, where each ωA in Ω can say whether a value vA belongs to the domain of A. We define the fitness of a relation R w.r.t. Ω as:

Repair: specifies regular expressions that, when applied on the original relation, produce a new relation with higher fitness.

<Director: Steven>

<195 min>

<Director:><Steven Spielberg>

<Rating: R Runtime:195>

<Runtime: k195 min>

<min Director: Steven Spielberg>

<Rating: Runtime: 195>

<Director: Steven Spielberg>

<Rating: R>

<R>

<Spielberg Rating: R Runtime:>

WADaR:

⟨DIRECTOR, //li[@class=‘first’]/div/span⟩

APPROXIMATING JOINT REPAIRS

Annotation 1Each record is interpreted as a string (concatenation of attributes), where NERs analyse and identify relevant attributes.

Entity recognisers make mistakes, WADaR tolerates incorrect and missing annotations.


Runtime: 195 min



The life of Jack Tarantino (coming soon)

Director: David R Lynch Rating: Not Rated Runtime: 123 min

TitleTitle

Director

Title

Director Director

RatingRating

Runtime

Runtime

Runtime

Runtime

Rating

Director

Segmentation

SINK

RATING

RATING

MAX FLOW SEQUENCE: DIRECTOR

Goal: Understand underlying structure of the relation.

START

TITLE

Two possible ways of encoding the problem:

2

TITLE

11

1. Max Flow Sequence in a Flow Network

RUNTIME

RUNTIME

DIRECTOR

DIRECTOR

START

TITLE

DIRECTOR

RATING

2. Most Likely Sequence in a Memoryless Markov Chain

RUNTIME

SINK

Solutions often coincide.Markov Chains: intuitive and faster to compute.

Max Flows: provably optimal.

RUNTIMERATING

TITLEMOST LIKELY SEQUENCE: DIRECTOR RATING RUNTIME

3/4

1/4

1 3/4

1/4

1

1

3

11 8 8

11

3

3 3 3

Induction 3

Schindler’s ListLawrence of Arabia (re-release)Le cercle Rouge (re-release)

Director: Steven Spielberg Rating: R Runtime: 195 minDirector: David Lean Rating: PG Runtime: 216 minDirector: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min

Director: Steven Spielberg Rating: R Runtime: 195 minDirector: David Lean Rating: PG Runtime: 216 minDirector: Jean-Pierre Melville Rating: Not Rated Runtime: 140 minDirector: David R Lynch Rating: Not Rated Runtime: 123 min

SUFFIX= substring-before(“_(“)

PREFIX= substring-after(“tor:_“)SUFFIX= substring-before(“_Rat“)

PREFIX= substring(string-length()-7)

Input: set of clean annotations to be used as positive examples.

WADaR induces regular expressions by looking, e.g., at common prefixes, suffixes, non-content strings, and character length.

Induced expressions improve recalltoken value1 token token token tokentoken token value2 token token tokentoken token token value3 token token

When WADaR cannot induce regular expressions (not enough regularity), data is repaired directly with annotators. Wrappers are instead repaired with value-based expressions, i.e., disjunction of the annotated values.

ATTRIBUTE=string-contains(“value1”|”value2”|”value3”)

Empirical Evaluation

Table 1

Precision_Original

Precision_Repaired

Recall_Original

Recall_Repaired

FScore_Original

FScore_Repaired

0.013233 0.5689

0.004155 0.4255

0.006324 0.488

0.535259 0.1396

0.307571 0.2871

0.390661 0.2665

0.8243 0.0914

0.5348 0.3002

0.6487 0.2248

0.5264 0.3716

0.3501 0.5276

0.4205 0.4666

0.7332 0.1943

0.5147 0.3361

0.6048 0.2827

0.6703 0.2281

0.5091 0.3295

0.5787 0.2888

0.5777 0.2766

0.553 0.3314

0.5651 0.304

0.292 0.441

0.2317 0.4597

0.2584 0.4531

0.158 0.6278

0.1404 0.588

0.1487 0.6074

0.5636 0.2263

0.5191 0.235

0.5405 0.2311

0.446 0.3314

0.2552 0.471

0.3246 0.4263

0.6799 0.302

0.5609 0.299

0.6147 0.3022

0.7252 0.1589

0.6525 0.1429

0.6869 0.150

0.5461 0.3267

0.3965 0.3268

0.4594 0.3316

FScore_Original

FScore_Repaired

0.006324 0.488

0.390661 0.2665

0.6487 0.2248

0.4205 0.4666

0.6048 0.2827

0.5787 0.2888

0.5651 0.304

0.2584 0.4531

0.1487 0.6074

0.5405 0.2311

0.3246 0.4263

0.6147 0.1904

0.6869 0.1505

0.4594 0.3316

0

0.2

0.4

0.6

0.8

1

ViNTs (R

E)

ViNTs (Auto

)

DIADEM (RE)

DEPTA (R

E)

DIADEM (Auto

)

DEPTA (A

uto)

RR (Auto

)

RR (Boo

k)

RR (Cam

era)

RR (Job

)

RR (Mov

ie)

RR (Nba

)

RR (Res

tauran

t)

RR (Univ

ersity)

Precision (Original) Precision (Repaired) Recall (Original)Recall (Repaired) FScore (Original) FScore (Repaired)

00.10.20.30.40.50.60.70.80.9

1

Vints_

RE

Vints_

UC

DIADEM_RE

DEPTA_R

E

DIADEM_UC

DEPTA_U

C

RR_Auto

RR_Boo

k

RR_Cam

era

RR_Job

RR_Mov

ie

RR_Nba

RR_resta

urant

RR_Univ

ersity

Untitle

d 1

Untitle

d 2

FScore Original FScore Repaired

5.1 SettingDatasets. The dataset consists of 100 websites from 10 do-

mains and is an enhanced version of SWDE [20], a benchmark com-monly used in web data extraction. SWDE’s data is sourced from80 sites and 8 domains: auto, book, camera, job, movie, NBA player,restaurant, and university. For each website, SWDE provides collec-tions of 400 to 2k detail pages (i.e., where each page correspondsto a single record). We complemented SWDE with collections oflisting pages (i.e., pages with multiple records) from 20 websitesof real estate (RE) and auto domains. Table 1 summarises the char-acteristics of the dataset. SWDE comes with ground-truth data cre-

Table 1: Dataset characteristics.Domain Type Sites Pages Records Attributes

Real Estate listing 10 271 3,286 15Auto listing 10 153 1,749 27Auto detail 10 17,923 17,923 4Book detail 10 20,000 20,000 5

Camera detail 10 5,258 5,258 3Job detail 10 20,000 20,000 4

Movie detail 10 20,000 20,000 4Nba Player detail 10 4,405 4,405 4Restaurant detail 10 20,000 20,000 4University detail 10 16,705 16,705 4

Total - 100 124,715 129,326 78

ated under the assumption that wrapper-generation systems couldonly generate extraction rules with DOM-element granularity, i.e.,without segmenting text nodes. Modern wrapper-generation sys-tems support text-node segmentation and we therefore refined theground-truth accordingly. As an example, in the camera domain,the original ground-truth values for MODEL consisted of the entireproduct title. The text node includes, other than the model, COLOR,PIXELS, MANUFACTURER. The ground-truth for real estate and autodomains has been created following the SWDE format. The finaldataset consist of more than 120k pages, for almost 130k recordscontaining more than 500k attribute values.

Wrapper-generation systems. We generated input relationsfor our evaluation using four wrapper-generation systems: DIA-DEM [19], DEPTA [36] and ViNTs [39] for listing pages, and Road-Runner [12] for detail pages.1 The output of DIADEM, DEPTA, andRoadRunner can be readily used in the evaluation since these arefull fledged data extraction systems, supporting the segmentationof both records and attributes within listings or (sets of) detail-pages. ViNTs, on the other hand, segments rows into records withina search result listing and, as such, it does not have a concept ofattribute. Instead, it segments rows within a record. We thereforepost-processed its output, typing the content of lines from differ-ent records that are likely to have the same semantics. We used anaïve heuristic similarity based on relative position in the recordand string-edit distance of the row’s content. This is a very simpleversion of more advanced alignment methods based on instance-level redundancy used by, e.g., WEIR and TEGRA [7].

Metrics. The performance of the repair is evaluated by com-paring wrapper-generated relations against the SWDE ground truthbefore and after the repair. The metrics used for the evaluationare Precision, Recall, and F1-Score computed at attribute-level. Boththe ground truth and the extracted values are normalised, and exactmatching between the extracted values and the ground-truth is re-quired for a hit. For space reasons, in this paper we only presentthe most relevant results. The results of the full evaluation, together

1RoadRunner can be configured for listings but it performs better ondetail pages.

with the dataset, gold standard, extracted relations, the code of thenormaliser and of the scorer are available at the online appendix [1].

All experiments are run on a desktop with an Intel quad-core i7at 3.40GHz with 16 GB Ram and Linux Mint OS 17.

5.2 Repair performanceRelation-level Accuracy. The first two questions we want to an-

swer are: whether joint repairs are necessary and what their impactis in terms of quality. Table 2 reports, for each system, the percent-age of: (i) Correctly extracted values. (ii) Under-segmentations,i.e., when values for an attribute are extracted together with val-ues of other attributes or spurious content. Indeed often websitespublish multiple attribute values within the same text node and theinvolved extraction systems are not able to split values into multi-ple attributes. (iii) Over-segmentations, i.e., when attribute valuesare split over multiple fields. As anticipated in Section 2, this rarelyhappens since an attribute value is often contained in a single textnode. In this setting an attribute value can be over-segmented onlyif the extraction system is capable of splitting single text nodes(DIADEM), but even in this case the splitting happens only whenthe system can identify a strong regularity within the text node.(iv) Misplacements, i.e., values are placed or labeled as the wrongattribute. This is mostly due to lack of semantic knowledge andconfusion introduced by overlapping attribute domains. (v) Miss-ing values, due to lack of regularity and optionality in the websource (RoadRunner, DEPTA, ViNTs) or missing values from the do-main knowledge (DIADEM). Note that the numbers do not add up to

Table 2: Wrapper generation system errors.System Correct

(%)Under

Segmented(%)

OverSegmented

(%)

Misplaced(%)

Missing(%)

DIADEM 60.9 34.6 0 23.2 3.5DEPTA 49.7 44 0 25.3 6ViNTs 23.9 60.8 0 36.4 15.2

RoadRunner 46.3 42.8 0 18.6 10.4

100% since errors may fall into multiple categories. These numbersclearly show that there is a quality problem in wrapper-generatedrelations and also support the atomic misplacement assumption.

Figure 2 shows, for each system and each domain, the impactof the joint-repair on our metrics. Light (resp. dark)-colored barsdenote the quality of the relation before (resp. after) the repair.

A first conclusion that can be drawn is that a repair is always ben-eficial. From 697 extracted attributes, 588 (84.4%) require someform of repair and the average pre-repair F1-Score produced by thesystems is 50%. We are able to induce a correct regular expressionfor 335 (57%) attributes, while for the remaining 253 (43%) it pro-duces value-based expressions. We can repair at least one attributein each of the wrappers in all of the cases, and we can repair morethan 75% of attributes in more than 80% of the cases.

Among the considered systems, DIADEM delivers, in average,the highest pre-repair F1-Score (�60%), but it never exceeds 65%.RoadRunner is in average worse than DIADEM but it reaches a bet-ter 70% F1-Score on restaurant. Websites in this domain are in facthighly structured and individual attribute values are contained in adedicated text node. When attributes are less structured, e.g., onbook, camera, movie, RoadRunner has a significant drop in perfor-mance. As expected, ViNTs delivers the worst pre-cleaning results.

In terms of accuracy, our approach delivers a boost in F1-Scorebetween 15% and 60%. Performance is consistently close to orabove 80% across domains and, except for ViNTs, across systems,with a peak of 91% for RoadRunner on NBA player.

The following are the remaining causes of errors: (i) Missingvalues cannot be repaired as we can only use the data available in

8

0.3$0.4$0.5$0.6$0.7$0.8$0.9$1$

Auto$

Book$

Camera$ Job

$

Movie$

Nba$

Restaurant$

University$

WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$

Evaluation100 websites 10 domains4 wrapper generation systems.

Precision, Recall, F1-Score computed before and after repair.

WADaR boosts F1-Score between 15% and 60%. Performance consistently close to or above 80%.

Metrics computed considering exact matches.

WADaR against WEIR.

WADaR is highly robust to errors of the NERs.

WADaR scales linearly with the size of the input relation. Optimal joint-repair approximations

computed in polynomial time.

OptimalityWADaR provably produces relations of maximum fitness,

provided that the number of correctly annotated tuples is more than the maximum error rate of the annotators.

Background: Web wrapping

refcode postcode bedrooms bathrooms available price

33453 OX2 6AR 3 2 15/10/2013 £1280 pcm

33433 OX4 7DG 2 1 18/04/2013 £995 pcm

Process or turning semi-structured (templated) web data into structured form

Hidden databases are actually a form of dark / dim data (ref. panel on Tuesday)

manual / (semi) supervised

accurate expensive + non-scalable

unsupervised

less accurate cheaper + scalable

Wrapidity



From (manually or automatically) created examples to XPath-based wrappers

Even on templated websites, automatic wrapping can be inaccurate

Pairs <field,expression> that, once applied to the DOM, return structured records

field expression

listing //bodyrecord //div[contains(@class,'movlist_wrap')]

title //span[contains(@class,’title’)]/text()

rated .//span[.='rating:']/following-sibling::strong/text()genre .//span[.=genre']/following-sibling::strong/text()

releaseMo .//span[@class='release']/text()

releaseDy .//span[@class='release']/text()releaseYr .//span[@class='release']/text()

image .//@src

runtime .//span[.=runtime']/following-sibling::strong/text()

Problems with wrapping

Inaccurate wrapping results in over(under) segmented data


Ava’s Possessions Release Date: March 4, 2016 | Rated: R | Genre(s) : Sci-Fi, Mystery, Thriller, Horror | Production Company: Off Hollywood Pictures “| Runtime: 216 min

Camino Release Date: March 4, 2016 | Rated: Not Rated | Genre(s): Action, Adventure, Thriller | Production Company: Bielberg Entertainment | Runtime: 103 min

Cemetery of Splendor Release Date: March 4, 2016 | Rated: Not Rated | Genre(s): Drama | User Score: 4.6 | Production Company: Centre National de la Cinématographie (CNC) | Runtime: 122 min

Title Release Genre Rating Runtime

RS: Source Relation

: Target Schema

Example extraction using

RoadRunner (Crescenzi Et Al.)

Questions

The questions we want to answer are:

can we fix the data, and use what we learn to repair wrappers as well?

are the solutions scalable?

Why do we care?

Companies such as FB, and Skyscanner spend millions of dollars of engineering time, creating and maintaining wrappers

Wrapper maintenance is a major cost of data acquisition from the web

Fixing the data

MAKE MODEL PRICEThe wrapper thinks it is filling this schema…

£19k Audi A3 Sportback

£43k Audi A6 Allroad quattro

£10k Citroën C3

£22k Ford C-max Titanium X

If all instances looked like this (i.e., mis-segmentation, no garbage, no shuffling)

table induction problem: TEGRA, WebTables, etc.

Moreover… we still have no clue on how to fix the wrapper afterwards

…but instead it produces this instance…£19k Make: Audi Model:A3 Sportback

£43k Make: Audi Model: A6 Allroad

Citroën £10k Model: C3

Ford £22k Model: C-max Titanium X

What is a good relation?

The problem is that wrapper generated relations really look like this…

First, we need a way to determine how “far” we are from a good relation…

ū = ⟨u1, u2, …, un⟩

a tuple generated by the wrapper

Σ = ⟨A1, A2, …, Am⟩

the (target) schema for the extraction

Ω = {ωA1, …, ωAarity(Σ) }

set of oracles for Σ

The fitness then quantifies how well ū (resp. the whole instance) “fits” Σ

Ω = {ωMAKE, ωMODEL, ωPRICE}, Σ = ⟨MAKE, PRICE, MODEL⟩

ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise

f(R, Σ, Ω) = 1/2 = 50%

£19k Make: Audi Model:A3 Sportback

£43k Make: Audi Model: A6 Allroad



ωMAKE ωPRICE ωMODEL

Problem Definition: Fitness

Σ = ⟨A1, A2, …, Am⟩ attributes (fields) of the target schema of the relation

ū = ⟨u1, u2, …, un⟩ tuple of the wrapper-generated relation R

Ω = {ωA1, …, ωAarity(Σ) } set of oracles for the fields of the Σ, s.t.,

ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise

We define the fitness of a tuple ū (resp. relation R) w.r.t. a schema Σ as:

f(ū, Σ, Ω) = ∑ ωAi (ui) / d i=1

c

where: c=min{ arity(Σ), arity(R) } and d=max{ arity(Σ), arity(R) }

resp. f(R, Σ, Ω) = ∑ f(ū, Σ, Ω) / |R|ū∈R

Input: a wrapper W, a relation R | W(P)=R for some set of pages P, and a schema Σ



Citroën £10k C3

Ford £22k C-max Titanium X

f(R, Σ, Ω) = 1/6 = 17%

MAKE MODEL PRICE

Problem Definition: Σ-repairs

Π = (i, j, …, k) permutation of the fields of R ρ = { ⟨A1,ƐAi⟩, ⟨A2,ƐAi⟩, …, ⟨Am,ƐAm⟩ } set of regexes for each attribute in Σ

A Σ-repair is a pair σ = ⟨Π,ρ⟩ where:

Σ-repairs can be applied to a tuple ū in the following way

σ(ū) = ⟨ ƐAi(Π(ū)), ƐA2(Π(ū)), … , ƐAm(Π(ū)) ⟩

The notion of applicability extends naturally to relations σ(R) (i.e., sets of tuples)

Similarly, Σ-repairs can be applied to wrappers as well [details in the paper]

Output: a wrapper W’ and a relation R’ | W’(P)=R’ and R’ is of maximum fitness w.r.t. Σ

The goal is to find the Σ-repair that maximises the fitness

Computing Σ-repairs

Complexity [details in the paper]:

1. non atomic misplacements: NP-complete (red. from Weighted Set Packing)

2. atomic misplacements: polynomial (red. from Stars and Buckets)

We have an atomic misplacement when the correct value for an attribute is:

1. entirely misplaced, or

2. if it is over-segmented, the fragments are in adjacent fields in the relation

MAKE MODEL PRICE


MAKE MODEL PRICE

C-max £22k X Ford Titanium

atomic misplacement non atomic misplacement

Naïve Algorithm:

For each tuple…

1. permute tuples in all possible ways (only if non-atomic misplacements)

2. segment tuples in all possible ways

3. ask the oracles and keep the segmentation of highest fitness

Approximating Σ-repairs

The naïve algorithm has the following problems:

1. oracles do not (always) exist

2. it fixes one tuple at a time, the wrapper needs a single fix for each attribute

3. even under the assumption of atomic misplacements we still have to try O(nk) different segmentations (worst case) before finding the one of maximum fitness

(1) Weak oracles

Use noisy NERs in spite of oracles. If unavailable, it’s easy to build one.

In this work we use ROSeAnn (Chen&Al. PVLDB13)

(2 and 3) Approximate relation-wide repairs

Wrappers are programs, if they make a mistake they make it consistently

There is hope to have a common underlying attribute structure

Finding the right structure

We have to solve two problems:

find the underlying structure(s) of the relation

find an segmentation that maximises the fitness

An obvious way is sequence labelling (e.g., Markov chains + Viterbi) where oracles are simulated by NERs (so they can make mistakes)

A SINK5

SOURCE

B

C3

D

2

3 3

4 4

2

The maximum likelihood sequence is actually <A,D> which “fits” ~28%

It looks like there’s another sequence that fits better…

a b ca b ca b ca da db a db a d

Ω = {ωA, ωB, ωC, ωD}A B C D


The sequence corresponding to the max-flow is <A,B,C> which “fits” ~32%

vA,() SINK13

SOURCE vB,(A) vC,(A,B)9 9 9

vD,(A)

4 4

vB,() vA,(B) vD,(B,A)

44 4

4

The problem is that Markov chains are memory-less…

we have to remember the context and

make sure our sequence satisfies the oracles more than any other

Ok… this sounds like a max-flow!

Ω = {ωA, ωB, ωC, ωD}

a b ca b ca b ca da db a db a d

A B C D

Iteratively compute max flows on the network, i.e., likely sequences of high fitness

MAKE

6/8SINK

0/3

SOURCE

PRICE

MAKE

0/2

6/6

0/2

6/6PRICE, MAKE, MODEL

MODEL

Iteration 0

PRICE0/3

0/3

MODEL

MODEL0/3

6/6

MAKE

SINK

3/3SOURCE

PRICE

3/3

3/33/3

MAKE, PRICE, MODEL

MODEL

Iteration 1

We stop when we covered “enough” of the tuples in the relation

First, annotate the relation using NERs (surrogate oracles) and build the network

MAKE

8SINK

3

SOURCE

PRICE

MAKE

2

6

2

6

MODEL

PRICE3

3

MODEL

MODEL3

6

Example:

£19k Make: Audi Model:A3 Sportback

£43k Make: Audi Model: A6 Allroad quattro



Ω = {ωMAKE, ωMODEL, ωPRICE}


Fixing the relation (and the wrapper)

Max flows represent likely sequences. We use them to eliminate unsound annotations.

We can use standard regex-induction algorithms to obtain robust expressions

£19k Make: Audi Model: A3 Sportback

MAKE [11,15) MODEL [24,36) PRICE [0,4)

The remaining annotations can be used as examples for regex induction

The induced expressions recover missing (incomplete) annotations

£19k Make: Audi Model: A3 Sportback

£43k Make: Audi Model: A6 Allroad quattro



ρ = { ⟨MAKE, substring-before($, £) or substring-before(substring-after($, ‘ke:␣’),’␣Mo’)⟩, ⟨MODEL, substring-after($, el:␣)⟩, ⟨PRICE, substring-after(substring-before($, ’kMa␣’ || ’kMo␣’),␣)⟩ }

Approximating Σ-repairs

MAKE MODEL PRICE



Citroën £10k C3


When an expression fails to match a minimum number of tuples, we fall back to the NERs: value-based expressions

ρ = { ⟨MAKE, value-based($, [Audi, Ford] )⟩, ⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩, ⟨PRICE, substring-after(substring-before($, k␣),␣)⟩ }

Example: (induction threshold 75%)

MAKE MODEL PRICE



Citroën £10k C3


ρ = { ⟨MAKE, substring-before($, £) or substring-before(substring-after($, k␣),␣)⟩, ⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩, ⟨PRICE, substring-after(substring-before($, k␣),␣)⟩ }

Example: (induction threshold 20%)

Evaluation

Dataset:

An enhanced version of the SWDE dataset (https://swde.codeplex.com)

10 domains, 100 websites, 78 attributes, ~100k pages, ~130k records

Systems:

wrapper generation systems: DIADEM, Depta, ViNTs, RoadRunner

Baseline wrapper induction/repair systems: WEIR (Crescenzi et Al. VLDB ‘13)

Implementation: WADaR (Wrapper and Data Repair) – Java + SQL

Evaluation: Highlights

Table 1

Precision_Original

Precision_Repaired

Recall_Original

Recall_Repaired

FScore_Original

FScore_Repaired

0.013233 0.5689

0.004155 0.4255

0.006324 0.488

0.535259 0.1396

0.307571 0.2871

0.390661 0.2665

0.8243 0.0914

0.5348 0.3002

0.6487 0.2248

0.5264 0.3716

0.3501 0.5276

0.4205 0.4666

0.7332 0.1943

0.5147 0.3361

0.6048 0.2827

0.6703 0.2281

0.5091 0.3295

0.5787 0.2888

0.5777 0.2766

0.553 0.3314

0.5651 0.304

0.292 0.441

0.2317 0.4597

0.2584 0.4531

0.158 0.6278

0.1404 0.588

0.1487 0.6074

0.5636 0.2263

0.5191 0.235

0.5405 0.2311

0.446 0.3314

0.2552 0.471

0.3246 0.4263

0.6799 0.302

0.5609 0.299

0.6147 0.3022

0.7252 0.1589

0.6525 0.1429

0.6869 0.150

0.5461 0.3267

0.3965 0.3268

0.4594 0.3316

FScore_Original

FScore_Repaired

0.006324 0.488

0.390661 0.2665

0.6487 0.2248

0.4205 0.4666

0.6048 0.2827

0.5787 0.2888

0.5651 0.304

0.2584 0.4531

0.1487 0.6074

0.5405 0.2311

0.3246 0.4263

0.6147 0.1904

0.6869 0.1505

0.4594 0.3316

00.20.40.60.8

1

ViNTs (R

E)

ViNTs (Auto

)

DIADEM (RE)

DEPTA (R

E)

DIADEM (Auto

)

DEPTA (A

uto)

RR (Auto

)

RR (Boo

k)

RR (Cam

era)

RR (Job

)

RR (Mov

ie)

RR (Nba

)

RR (Res

tauran

t)

RR (Univ

ersity)


00.10.20.30.40.50.60.70.80.9

1

Vints_

RE

Vints_

UC

DIADEM_RE

DEPTA_R

E

DIADEM_UC

DEPTA_U

C

RR_Auto

RR_Boo

k

RR_Cam

era

RR_Job

RR_Mov

ie

RR_Nba

RR_resta

urant

RR_Univ

ersity

Untitle

d 1

Untitle

d 2


Fig. 2: Impact of repair.

to 30% in real estate, with an identical effect in almost alldomains.

Attribute-level accuracy. Another question is whetherthere are substantial differences in attribute-level accuracy.The top of Table III shows attributes where the repair isvery effective (F1-Score'1 after repair). These values appearas highly structured attributes on web pages and the corre-sponding expressions repair almost all tuples. As an example,DOOR NUMBER is almost always followed by suffixes dr ordoor. In these cases, the wrapper induction under-segmentedthe text due to lack of sufficient examples.

TABLE III: Attribute-level evaluation.

System Domain Attribute Original F1-Score Repaired F1-ScoreDIADEM real estate POSTCODE 0.304 0.947DIADEM auto DOOR

NUMBER0 0.984

DEPTA real estate BATHROOMNUMBER

0.314 0.973

DEPTA auto MAKE 0.564 0.986DIADEM real estate CITY 0 0.59DEPTA real estate COUNTY 0 0.728

DIADEM auto ENGINETYPE

0 0.225

DEPTA auto PRICE 0.711 0.742

For attributes such as CITY and COUNTY, despite a sig-nificant boost (59% and 72% respectively) produced by therepair, the final F1-Score is still low. These are irregularly struc-tured attributes, often co-occurring with others, e.g., STATE,POSTCODE, in ways that cannot be easily isolated by regularexpressions. Despite not having syntactic regularity, theseattributes are semantically related, e.g., COUNTY is usuallyafter CITY and before POSTCODE, and could be captured byextending f S

XPATHwith NER capabilities [4].

An exceptional case is ENGINE TYPE, where the valuePetrol is also recognised as COLOUR. This causes a loss ofperformance as it creates a systematic error in the annotatedrelation. Another exception is the case of PRICE in relationsgenerated by DEPTA. DEPTA extracts large chunks of text withmultiple prices among which the annotators cannot distinguishthe target price reliably, resulting in worse performance.

Independent evaluation. We performed an extraction ofrestaurant chain locations in collaboration with a large socialnetwork, which provided us with 210 target websites. We usedDIADEM as a wrapper induction system and we then appliedjoint repair on the generated relations. The accuracy has beenmanually evaluated by third-party rating teams on a sampleof nearly 1,000 records of the 276,787 extracted. Table IV

shows Precision and Recall computed on the sample (valueshigher than 0.9 are highlighted in bold). In order to estimate

TABLE IV: Accuracy of large scale evaluation.

Attribute Precision Recall % Modified valuesLOCALITY 0.993 0.993 11.34%

OPENING HOURS 1.00 0.461 17.14%LOCATED WITHIN 1.00 0.224 29.75%

PHONE 0.987 0.849 50.74%POSTCODE 0.999 0.989 9.4%

STREET ADDRESS 0.983 0.98 83.78%

the impact of the repair, we computed, for each attribute, thepercentage of values that are different before and after therepair step. These numbers are shown in the last column ofTable IV. Clearly, the repair is beneficial on all of the cases. ForOPENING HOURS and LOCATED WITHIN, where recall is verylow, the problem is due to the fact that these attributes wereoften not available on the source pages, thus being impossibleto repair. The independent evaluation proved that our repairmethod can scale to hundreds of thousands of non-syntheticrecords. On the other hand, the joint repair is bound to theaccuracy of the extraction system, i.e., it cannot repair datathat has not been extracted.

We have previously shown (Section IV) that an optimalapproximation of a joint repair can be computed efficiently.To stress the scalability of our method, we created a syntheticdataset by modifying two different variables: n—number ofrecords, with an impact mostly on the induction of regularexpressions, since it increases the number of examples; k—number of attributes, which influences the size of the flownetwork and the computation of the maximum flow. Thesynthetic relations are built to produce the worst case scenario,i.e., each record contains k annotated tokens, each annotationhas a different context and each record produces a differentpath on the network. This results in a network with n · k+ 2nodes, and n · k+ n edges. The chart on the left of Figure 3plots the running time over an increasing number of records(with number of attributes fixed), while the chart on the right

Fig. 3: Running time.

WADaR increases F1-score between 15% and 60% (excluding ViNTs)

increases the number of attributes (with number of recordsfixed). As expected, the joint repair grows linearly w.r.t thesize of the relation, and polynomially w.r.t. the number ofattributes. In the extreme case, the computed network contains10M nodes and 10.1M edges. The largest network obtainedon non-synthetic datasets has 39,148 nodes and 45,797 edges(book), with repairs computed in less than 3 seconds.

Comparative evaluation. We compare our approachagainst WEIR [3], a wrapper induction and data integration sys-tem that can be used to compute a joint repair of a relation w.r.t.a schema. WEIR induces wrappers by generating candidateexpressions using simple heuristics and by filtering them usinginstance-level redundancy across multiple web sources, i.e., itpicks, among candidate rules, those that consistently matchsimilar values on different sources. We compare with WEIR asthe only other similar system, Turbo Syncer [8], is significantlyolder, and we were not able to obtain an implementation.

WEIR uses only redundant values for rules selection, re-sulting in relations with missing values (and records). Wecompared against WEIR on SWDE original dataset, the sameone used in their evaluation [3] (using RoadRunner as extractionsystem). We evaluated WEIR and our approach in two separatesettings: Figure 4 shows the performance of our approach andWEIR on each domain, computed on redundant records only,while in Figure 5 we also take into account non-redundantones. A first observation is that redundant records are asmall fraction of the whole relation, thus limiting the recall(shown on top of the bars in Figure 4). The results showthat, if we limit the evaluation to redundant values only,our approach delivers same or better performance than WEIR.Interesting cases are auto, restaurant and university, where ourapproach outperforms WEIR by more than 10% in F1-Score. Inparticular, WEIR suffers from false redundancy caused by a laxsimilarity measure and under-segmented text nodes. The onlycase where WEIR performs better than our approach is in movie,where the presence of multivalued attributes (such as GENRE)causes the selection of suboptimal max-flow sequences. If wealso consider all values, including non redundant ones, ourapproach clearly outperforms WEIR in every domain, with apeak of 36% boost in F1-Score in camera.

In terms of running time, WEIR requires an average of 30minutes per domain whereas our approach repairs a domain inless than 2 minutes. This is due to the way WEIR exploits cross-source redundancy, i.e., instances in a source are comparedagainst instances of all other sources. As a consequence,the running time increases with the number of sources. Ourapproach instead repairs each source in parallel.

We also run a preliminary comparative evaluation with

0.5$0.6$0.7$0.8$0.9$1$

Auto$

Book$

Camera$ Job

$

Movie$

Nba$

Restauran

University$


6.10%$5.3%$

14.9%$

3.2%$6%$

14.7%$8.2%$

5.1%$

Fig. 4: Comparison with WEIR (Redundant values)

0.3$0.4$0.5$0.6$0.7$0.8$0.9$1$

Auto$

Book$

Camera$ Job

$

Movie$

Nba$

Restaurant$

University$


Fig. 5: Comparison with WEIR (all values)

Google Data Highlighter, a supervised data annotation tool thatcan be used to produce tabular data from web pages. Adiscussion is available at [1].

Ablation study. In this experiment we measured the impactof each phase of the joint repair computation on F1-Scorefor the most relevant scenarios (we found similar results inother scenarios but those have not been included due to spacereasons). With respect to Figure 6, original (or) refers to the

Only Annotator Only Regular Only Value Based Final

DIADEM (RE) 0.6487 0.7048 0.8315 0.8613 0.8735

DIADEM (AUTO) 0.6048 0.798 0.7249 0.8819 0.8875

RR (AUTO) 0.5651 0.5838 0.7148 0.8418 0.8691

RR (UNIVERSITY) 0.4594 0.6253 0.7803 0.8254 0.8335

RR (NBA) 0.6147 0.8408 0.9235 0.9169

RR (CAMERA) 0.1487 0.4650 0.755 0.7561

0.40.50.60.70.80.9

ORANN

REGNET J

DIADEM (RE)

ORANN

REGNET J

DIADEM (AUTO)

ORANN

REGNET J

RR (AUTO)

ORANN

REGNET J

RR (UNIVERSITY) 0.90.80.70.60.50.4

Fig. 6: Impact of individual components

original (i.e., before repair) quality of the relation, while joint (j)is the post-repair quality. annotator (ann) shows the effect ofconstructing the repair by directly using annotations, regex (reg)shows the performance when only regexes are induced (i.e.,without value based expressions), network (net) shows the repairperformance by using only value based expressions computedfrom max-flow sequences (i.e., no regex induction).

As we can see, the direct use of annotations for repair orregex induction delivers poor results. The major contributionto the quality is the use of max-flow sequences that uncoverthe underlying structure of the relation and eliminate noisyannotations. Regex induction is still beneficial afterwards torecover misses of the annotators. The most striking case isSTREET_ADDRESS in the real estate domain. The attributeis hardly recognised by annotators (accuracy around 50%),however its structure in the relation is very regular and theafter-repair accuracy reaches 85%.

Thresholding. This second experiment measures the effectof the thresholds t f low and tregex on performance. Figure 7shows the variation of F1-Score for the most interesting sce-narios (other scenarios report similar results). The setting ofexcessively low thresholds negatively impacts the performance,as it causes a premature induction of regexes that repair onlya small number of records. However, there are cases, e.g.,DIADEM on real estate, where a lower threshold helps to recovermisses of the annotators. In domains where attributes are betterstructured or the annotator is more accurate, e.g., auto, the bestperformance is achieved by setting a high threshold. Overall,the variation in performance is anyway limited (3%) and theaverage best performance is obtained with a 75% threshold.

WADaR is 23% more accurate than WEIR on average

Evaluation: Robustness

We studied how F1-score varies w.r.t. annotation noise

Fig. 7: Impact of t f low and tregex threshold

Effect of annotator accuracy. We gradually decreased theRecall of our annotators by randomly eliminating a numberof annotations and observing the effect on F1-Score whilekeeping a fixed regex induction threshold (0.75). To lower theeffect of sampling bias, we ran the experiment 30 times withdifferent annotation sets and took the average performance.The accuracy numbers are limited to those attributes whereour approach induces regular expressions, since it is alreadyclear that annotator errors directly reduce the accuracy ofvalue-based expressions. This is still a significant number ofattributes, i.e., �65% in all cases except for RoadRunner onbook (35%), and RoadRunner on movie (46%). Figure 8 shows

Fig. 8: Annotator recall drop - Fixed threshold

the impact of a drop in recall (x-axis) on F1-Score. As wecan see, our approach is robust to a drop in recall until wereach 80% loss, then the performance rapidly decays. This issomehow expected, since the regular expressions compensatefor the missing recall up to the point where the max-flowsequences are no longer able to determine the underlyingattribute structure reliably.

Figure 9 show the effect on F1-Score if we set a low regex-induction threshold (i.e., 0.1) instead. Clearly, in this caseour approach is highly robust to annotator inaccuracy and wenotice a loss in performance only after 80-90% loss in recall.In summary, a lower regex-induction threshold is advisablewhen we know that annotators have low recall. Even involvingan annotator with very low accuracy, our approach is robust

Fig. 9: F1-Score variation with a threshold value of 0.1

enough to overcome the errors introduced by the annotator.

VI. RELATED WORK

Computing joint repairs is one of the many maintenanceproblems faced in web data extraction [5], [22], [25], [27],[28]. However, classical wrapper maintenance has assumedperfect, typically human-created wrappers to begin with, witherrors only being introduced over time due to change in thesources. When covering thousands or hundreds of thousandsof sources with automatically or semi-supervised wrapperinduction this assumption is no longer valid.

Closer in spirit to joint repairs are techniques to generatewrappers from background data [3], [8], [17], [36]. Thesetechniques implicitly align background data and wrappers aspart of the generation process. The closest works to oursare Turbo Syncer [8] and WEIR [3], which use instance-levelredundancy across multiple sources to compute extraction ruleson individual sites that, together, can be used to effectivelylearn wrappers without supervision. An advantageous side-effect of these approaches is the construction of “compatible”relations that can be more easily integrated. Differently fromTurbo Syncer and WEIR, our approach assumes the existenceof an already generated wrapper to be repaired w.r.t. a targetschema. From a practical point of view, both Turbo Syncer andWEIR can be adapted to compute joint repairs, however, asshown in Section V with a significantly worse performancethan our approach due to their reliance on redundancy. Ourapproach also eliminates the need for re-induction of wrappers,leading to better runtime performance.

Redundancy. Instance-level redundancy across websources has been previously used in different contexts to detectand repair inconsistent data extracted from the web [3], [7], [8],[13]. Redundancy-based approaches face two main obstacles:(i) it is not always possible to leverage sufficient redundancyin every domain, (see, e.g., the number of redundant recordsin SWDE of Figure 4), and (ii) redundancy-based methodsrequire access to a substantial number of sources that have,so far, limited their scalability (see, e.g., WEIR running time).Encoding the redundancy by other means, e.g., through entityrecognisers and knowledge bases, has proven beneficial to cir-cumvent the scalability problems without sacrificing generalityof the approaches [7], [18]. Our approach achieves this via anensemble of entity recognisers [4], some of which are trainedusing redundancy-based methods.

Cleaning, segmentation and alignment. Traditional datacleaning methods focus on the detection and repair of databaseinconsistencies, using, e.g., statistical value distributions [31],[34], constraints [2], [14], [16], [29], and knowledge bases [7].Differently from our setting, cleaning methods operate onrelation(s) that contain incorrect values but are assumed to becorrectly segmented.

A more relevant body of work is list segmentation/tableinduction techniques, targeting the induction of structuredrecords from unstructured (i.e., wrongly segmented) lists ofvalues. These are alternatives to the segmentation based onflow networks used in our approach. Being inspired by taggingproblems common in bio-informatics and other areas, these ap-proaches traditionally require some form of supervision. Manyrequire an initial seed of correctly segmented records [10],

Fixed induction threshold 75%

(high dependence on annotation quality)

Fig. 7: Impact of t f low and tregex threshold

Effect of annotator accuracy. We gradually decreased theRecall of our annotators by randomly eliminating a numberof annotations and observing the effect on F1-Score whilekeeping a fixed regex induction threshold (0.75). To lower theeffect of sampling bias, we ran the experiment 30 times withdifferent annotation sets and took the average performance.The accuracy numbers are limited to those attributes whereour approach induces regular expressions, since it is alreadyclear that annotator errors directly reduce the accuracy ofvalue-based expressions. This is still a significant number ofattributes, i.e., �65% in all cases except for RoadRunner onbook (35%), and RoadRunner on movie (46%). Figure 8 shows

Fig. 8: Annotator recall drop - Fixed threshold

the impact of a drop in recall (x-axis) on F1-Score. As wecan see, our approach is robust to a drop in recall until wereach 80% loss, then the performance rapidly decays. This issomehow expected, since the regular expressions compensatefor the missing recall up to the point where the max-flowsequences are no longer able to determine the underlyingattribute structure reliably.

Figure 9 show the effect on F1-Score if we set a low regex-induction threshold (i.e., 0.1) instead. Clearly, in this caseour approach is highly robust to annotator inaccuracy and wenotice a loss in performance only after 80-90% loss in recall.In summary, a lower regex-induction threshold is advisablewhen we know that annotators have low recall. Even involvingan annotator with very low accuracy, our approach is robust

Fig. 9: F1-Score variation with a threshold value of 0.1

enough to overcome the errors introduced by the annotator.

VI. RELATED WORK

Computing joint repairs is one of the many maintenanceproblems faced in web data extraction [5], [22], [25], [27],[28]. However, classical wrapper maintenance has assumedperfect, typically human-created wrappers to begin with, witherrors only being introduced over time due to change in thesources. When covering thousands or hundreds of thousandsof sources with automatically or semi-supervised wrapperinduction this assumption is no longer valid.

Closer in spirit to joint repairs are techniques to generatewrappers from background data [3], [8], [17], [36]. Thesetechniques implicitly align background data and wrappers aspart of the generation process. The closest works to oursare Turbo Syncer [8] and WEIR [3], which use instance-levelredundancy across multiple sources to compute extraction ruleson individual sites that, together, can be used to effectivelylearn wrappers without supervision. An advantageous side-effect of these approaches is the construction of “compatible”relations that can be more easily integrated. Differently fromTurbo Syncer and WEIR, our approach assumes the existenceof an already generated wrapper to be repaired w.r.t. a targetschema. From a practical point of view, both Turbo Syncer andWEIR can be adapted to compute joint repairs, however, asshown in Section V with a significantly worse performancethan our approach due to their reliance on redundancy. Ourapproach also eliminates the need for re-induction of wrappers,leading to better runtime performance.

Redundancy. Instance-level redundancy across websources has been previously used in different contexts to detectand repair inconsistent data extracted from the web [3], [7], [8],[13]. Redundancy-based approaches face two main obstacles:(i) it is not always possible to leverage sufficient redundancyin every domain, (see, e.g., the number of redundant recordsin SWDE of Figure 4), and (ii) redundancy-based methodsrequire access to a substantial number of sources that have,so far, limited their scalability (see, e.g., WEIR running time).Encoding the redundancy by other means, e.g., through entityrecognisers and knowledge bases, has proven beneficial to cir-cumvent the scalability problems without sacrificing generalityof the approaches [7], [18]. Our approach achieves this via anensemble of entity recognisers [4], some of which are trainedusing redundancy-based methods.

Cleaning, segmentation and alignment. Traditional datacleaning methods focus on the detection and repair of databaseinconsistencies, using, e.g., statistical value distributions [31],[34], constraints [2], [14], [16], [29], and knowledge bases [7].Differently from our setting, cleaning methods operate onrelation(s) that contain incorrect values but are assumed to becorrectly segmented.

A more relevant body of work is list segmentation/tableinduction techniques, targeting the induction of structuredrecords from unstructured (i.e., wrongly segmented) lists ofvalues. These are alternatives to the segmentation based onflow networks used in our approach. Being inspired by taggingproblems common in bio-informatics and other areas, these ap-proaches traditionally require some form of supervision. Manyrequire an initial seed of correctly segmented records [10],

Fixed induction threshold 10%

(low dependence on annotation quality)

F1 starts being affected when recall loss at ~80%

Precision loss does not affect WADaR until ~300% (random noise)

Evaluation: Scalability

Worst-case scenario: all tuples are annotated with all attribute types

WADaR scales linearly w.r.t. the size of the relation and polynomially w.r.t. attributes

Table 1

Precision_Original

Precision_Repaired

Recall_Original

Recall_Repaired

FScore_Original

FScore_Repaired

0.013233 0.5689

0.004155 0.4255

0.006324 0.488

0.535259 0.1396

0.307571 0.2871

0.390661 0.2665

0.8243 0.0914

0.5348 0.3002

0.6487 0.2248

0.5264 0.3716

0.3501 0.5276

0.4205 0.4666

0.7332 0.1943

0.5147 0.3361

0.6048 0.2827

0.6703 0.2281

0.5091 0.3295

0.5787 0.2888

0.5777 0.2766

0.553 0.3314

0.5651 0.304

0.292 0.441

0.2317 0.4597

0.2584 0.4531

0.158 0.6278

0.1404 0.588

0.1487 0.6074

0.5636 0.2263

0.5191 0.235

0.5405 0.2311

0.446 0.3314

0.2552 0.471

0.3246 0.4263

0.6799 0.302

0.5609 0.299

0.6147 0.3022

0.7252 0.1589

0.6525 0.1429

0.6869 0.150

0.5461 0.3267

0.3965 0.3268

0.4594 0.3316

FScore_Original

FScore_Repaired

0.006324 0.488

0.390661 0.2665

0.6487 0.2248

0.4205 0.4666

0.6048 0.2827

0.5787 0.2888

0.5651 0.304

0.2584 0.4531

0.1487 0.6074

0.5405 0.2311

0.3246 0.4263

0.6147 0.1904

0.6869 0.1505

0.4594 0.3316

00.20.40.60.8

1

ViNTs (R

E)

ViNTs (Auto

)

DIADEM (RE)

DEPTA (R

E)

DIADEM (Auto

)

DEPTA (A

uto)

RR (Auto

)

RR (Boo

k)

RR (Cam

era)

RR (Job

)

RR (Mov

ie)

RR (Nba

)

RR (Res

tauran

t)

RR (Univ

ersity)


00.10.20.30.40.50.60.70.80.9

1

Vints_

RE

Vints_

UC

DIADEM_RE

DEPTA_R

E

DIADEM_UC

DEPTA_U

C

RR_Auto

RR_Boo

k

RR_Cam

era

RR_Job

RR_Mov

ie

RR_Nba

RR_resta

urant

RR_Univ

ersity

Untitle

d 1

Untitle

d 2







NUMBER0 0.984


0.314 0.973



0 0.225










PHONE 0.987 0.849 50.74%POSTCODE 0.999 0.989 9.4%





Table 1

Precision_Original

Precision_Repaired

Recall_Original

Recall_Repaired

FScore_Original

FScore_Repaired

0.013233 0.5689

0.004155 0.4255

0.006324 0.488

0.535259 0.1396

0.307571 0.2871

0.390661 0.2665

0.8243 0.0914

0.5348 0.3002

0.6487 0.2248

0.5264 0.3716

0.3501 0.5276

0.4205 0.4666

0.7332 0.1943

0.5147 0.3361

0.6048 0.2827

0.6703 0.2281

0.5091 0.3295

0.5787 0.2888

0.5777 0.2766

0.553 0.3314

0.5651 0.304

0.292 0.441

0.2317 0.4597

0.2584 0.4531

0.158 0.6278

0.1404 0.588

0.1487 0.6074

0.5636 0.2263

0.5191 0.235

0.5405 0.2311

0.446 0.3314

0.2552 0.471

0.3246 0.4263

0.6799 0.302

0.5609 0.299

0.6147 0.3022

0.7252 0.1589

0.6525 0.1429

0.6869 0.150

0.5461 0.3267

0.3965 0.3268

0.4594 0.3316

FScore_Original

FScore_Repaired

0.006324 0.488

0.390661 0.2665

0.6487 0.2248

0.4205 0.4666

0.6048 0.2827

0.5787 0.2888

0.5651 0.304

0.2584 0.4531

0.1487 0.6074

0.5405 0.2311

0.3246 0.4263

0.6147 0.1904

0.6869 0.1505

0.4594 0.3316

00.20.40.60.8

1

ViNTs (R

E)

ViNTs (Auto

)

DIADEM (RE)

DEPTA (R

E)

DIADEM (Auto

)

DEPTA (A

uto)

RR (Auto

)

RR (Boo

k)

RR (Cam

era)

RR (Job

)

RR (Mov

ie)

RR (Nba

)

RR (Res

tauran

t)

RR (Univ

ersity)


00.10.20.30.40.50.60.70.80.9

1

Vints_

RE

Vints_

UC

DIADEM_RE

DEPTA_R

E

DIADEM_UC

DEPTA_U

C

RR_Auto

RR_Boo

k

RR_Cam

era

RR_Job

RR_Mov

ie

RR_Nba

RR_resta

urant

RR_Univ

ersity

Untitle

d 1

Untitle

d 2







NUMBER0 0.984


0.314 0.973



0 0.225










PHONE 0.987 0.849 50.74%POSTCODE 0.999 0.989 9.4%





Oracles decouple the problem of finding similar instances from the segmentation



£10k Citroën C3


Ω = {ωMAKE, ωMODEL, ωPRICE}

Open issues

Learning oracles

Building oracles is not difficult but still requires engineering time.

The IBM SystemT people did some of good work in this direction. We can start there.

Missing attributes

Right now, if the wrapper fails to recover data, then we cannot repair it.

It is possible to manipulate the wrapper to match more content.

Markov Chains vs Max flows on wrapped relations

They seem to eventually compute the same sequences but in different order… proof?

What I know is that max-flows best approximate the maximum fitness at every step.

Questions?

L. Chen, S. Ortona, G. Orsi, M. Benedikt. Aggregating Semantic Annotators. PVLDB ’13

S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. Joint Repairs for Web Wrappers. ICDE ’16

S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. WADaR: Joint Wrapper and Data Repair. VLDB ’15 (Demo)

References:

T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, C. Wang. DIADEM: Thousands of websites to a single database. PVLDB ’15

Title Director Rating RuntimeSchindler’s List

Steven Spielberg

R 195 min

Web Data Extraction

Road Runner

DEPTA


Runtime: 195 min



Joint Data and Wrapper Repair


Schindler’s List

Director: Steven Spielberg

Rating: R

Runtime: 195 min

Title Director Rating Runtime

Schindler’s List

Steven

Spielberg

R 195 min

Maximal Repair is NP-completeAttribute

Director: Steven Spielberg Rating: R Runtime: 195 min

Director Rating Runtime

Director Rating RuntimeSteven Spielberg R 195 min

φ1:

φ2:

φ3:

φ4:

OBSERVATIONS

Templated Websites:Data is published following a template.

Wrapper BehaviourWrappers rarely misplace and over-segment at the same time.

Wrappers make systematic errors.

OraclesOracles can be implemented as (ensembles of) NERs.

NERs are not perfect, i.e., they make mistakes

Joint Wrapper And Data Repair

Authors

When values are both misplaced and over-segmented, computing repairs of maximal fitness is hard, otherwise, just do the following:

(1) Compute all possible k non-crossing partitions (k = |R|) of tokens, i.e., assign to each attribute an element of the part i t ion (O(nk) - Narayana Number).

(2) Discard tokens never accepted by oracles in any of the partitions.

(3) Collapse identical partitions and choose the one with maximal fitness.

Without misplacement and over-segmentation, solution in polynomial time by computing non-crossing k-partition

NP-hardness: reduction from Weighted Set Packing. Membership in NP: guess a partition, decide non crossing and compute fitness in PTIME.

Stefano Ortona [email protected] University of Oxford, UKGiorgio Orsi [email protected] University of Oxford, UKMarcello Buoncristiano [email protected] Università della Basilicata,ItalyTim Furche [email protected] University of Oxford, UK

http://diadem.cs.ox.ac.uk/wadar

Web data extraction (aka scraping/wrapping) uses wrappers to turn web pages into structured data.

Wrapper: structure { ⟨R,!R⟩ { ⟨A1,!A1⟩,…,⟨Am,!Am⟩ } } specifying objects to be extracted (listings, records, attributes) and corresponding XPath expressions.

Wrappers are often created algorithmically and in large numbers. Tools capable of maintaining them over time are missing.

⟨RATING, //li[@class=‘second’]/p⟩

⟨RUNTIME, //li[@class=‘third’]/ul/li[1]⟩

Algorithmically-created wrappers generate data that is far from perfect.Data can be badly segmented and misplaced.

⟨TITLE,⟨1⟩,string($)⟩ ⟨DIRECTOR,⟨2⟩,substring-before(substring-after($,tor:_),_Rat)⟩⟨RATING,⟨2⟩,substring-before(substring-after($,ing:_),_Run)⟩⟨RUNTIME,⟨2⟩,substring-after($,time:_)⟩

Take a set Ω of oracles, where each ωA in Ω can say whether a value vA belongs to the domain of A. We define the fitness of a relation R w.r.t. Ω as:

Repair: specifies regular expressions that, when applied on the original relation, produce a new relation with higher fitness.

<Director: Steven>

<195 min>

<Director:><Steven Spielberg>

<Rating: R Runtime:195>

<Runtime: k195 min>

<min Director: Steven Spielberg>

<Rating: Runtime: 195>

<Director: Steven Spielberg>

<Rating: R>

<R>

<Spielberg Rating: R Runtime:>

WADaR:

⟨DIRECTOR, //li[@class=‘first’]/div/span⟩

APPROXIMATING JOINT REPAIRS

Annotation 1Each record is interpreted as a string (concatenation of attributes), where NERs analyse and identify relevant attributes.

Entity recognisers make mistakes, WADaR tolerates incorrect and missing annotations.


Runtime: 195 min



The life of Jack Tarantino (coming soon)

Director: David R Lynch Rating: Not Rated Runtime: 123 min

TitleTitle

Director

Title

Director Director

RatingRating

Runtime

Runtime

Runtime

Runtime

Rating

Director

Segmentation

SINK

RATING

RATING

MAX FLOW SEQUENCE: DIRECTOR

Goal: Understand underlying structure of the relation.

START

TITLE

Two possible ways of encoding the problem:

2

TITLE

11

1. Max Flow Sequence in a Flow Network

RUNTIME

RUNTIME

DIRECTOR

DIRECTOR

START

TITLE

DIRECTOR

RATING

2. Most Likely Sequence in a Memoryless Markov Chain

RUNTIME

SINK

Solutions often coincide.Markov Chains: intuitive and faster to compute.

Max Flows: provably optimal.

RUNTIMERATING

TITLEMOST LIKELY SEQUENCE: DIRECTOR RATING RUNTIME

3/4

1/4

1 3/4

1/4

1

1

3

11 8 8

11

3

3 3 3

Induction 3

Schindler’s ListLawrence of Arabia (re-release)Le cercle Rouge (re-release)

Director: Steven Spielberg Rating: R Runtime: 195 minDirector: David Lean Rating: PG Runtime: 216 minDirector: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min

Director: Steven Spielberg Rating: R Runtime: 195 minDirector: David Lean Rating: PG Runtime: 216 minDirector: Jean-Pierre Melville Rating: Not Rated Runtime: 140 minDirector: David R Lynch Rating: Not Rated Runtime: 123 min

SUFFIX= substring-before(“_(“)

PREFIX= substring-after(“tor:_“)SUFFIX= substring-before(“_Rat“)

PREFIX= substring(string-length()-7)

Input: set of clean annotations to be used as positive examples.

WADaR induces regular expressions by looking, e.g., at common prefixes, suffixes, non-content strings, and character length.

Induced expressions improve recalltoken value1 token token token tokentoken token value2 token token tokentoken token token value3 token token

When WADaR cannot induce regular expressions (not enough regularity), data is repaired directly with annotators. Wrappers are instead repaired with value-based expressions, i.e., disjunction of the annotated values.

ATTRIBUTE=string-contains(“value1”|”value2”|”value3”)

Empirical Evaluation

Table 1

Precision_Original

Precision_Repaired

Recall_Original

Recall_Repaired

FScore_Original

FScore_Repaired

0.013233 0.5689

0.004155 0.4255

0.006324 0.488

0.535259 0.1396

0.307571 0.2871

0.390661 0.2665

0.8243 0.0914

0.5348 0.3002

0.6487 0.2248

0.5264 0.3716

0.3501 0.5276

0.4205 0.4666

0.7332 0.1943

0.5147 0.3361

0.6048 0.2827

0.6703 0.2281

0.5091 0.3295

0.5787 0.2888

0.5777 0.2766

0.553 0.3314

0.5651 0.304

0.292 0.441

0.2317 0.4597

0.2584 0.4531

0.158 0.6278

0.1404 0.588

0.1487 0.6074

0.5636 0.2263

0.5191 0.235

0.5405 0.2311

0.446 0.3314

0.2552 0.471

0.3246 0.4263

0.6799 0.302

0.5609 0.299

0.6147 0.3022

0.7252 0.1589

0.6525 0.1429

0.6869 0.150

0.5461 0.3267

0.3965 0.3268

0.4594 0.3316

FScore_Original

FScore_Repaired

0.006324 0.488

0.390661 0.2665

0.6487 0.2248

0.4205 0.4666

0.6048 0.2827

0.5787 0.2888

0.5651 0.304

0.2584 0.4531

0.1487 0.6074

0.5405 0.2311

0.3246 0.4263

0.6147 0.1904

0.6869 0.1505

0.4594 0.3316

0

0.2

0.4

0.6

0.8

1

ViNTs (R

E)

ViNTs (Auto

)

DIADEM (RE)

DEPTA (R

E)

DIADEM (Auto

)

DEPTA (A

uto)

RR (Auto

)

RR (Boo

k)

RR (Cam

era)

RR (Job

)

RR (Mov

ie)

RR (Nba

)

RR (Res

tauran

t)

RR (Univ

ersity)


00.10.20.30.40.50.60.70.80.9

1

Vints_

RE

Vints_

UC

DIADEM_RE

DEPTA_R

E

DIADEM_UC

DEPTA_U

C

RR_Auto

RR_Boo

k

RR_Cam

era

RR_Job

RR_Mov

ie

RR_Nba

RR_resta

urant

RR_Univ

ersity

Untitle

d 1

Untitle

d 2


5.1 SettingDatasets. The dataset consists of 100 websites from 10 do-

mains and is an enhanced version of SWDE [20], a benchmark com-monly used in web data extraction. SWDE’s data is sourced from80 sites and 8 domains: auto, book, camera, job, movie, NBA player,restaurant, and university. For each website, SWDE provides collec-tions of 400 to 2k detail pages (i.e., where each page correspondsto a single record). We complemented SWDE with collections oflisting pages (i.e., pages with multiple records) from 20 websitesof real estate (RE) and auto domains. Table 1 summarises the char-acteristics of the dataset. SWDE comes with ground-truth data cre-

Table 1: Dataset characteristics.Domain Type Sites Pages Records Attributes

Real Estate listing 10 271 3,286 15Auto listing 10 153 1,749 27Auto detail 10 17,923 17,923 4Book detail 10 20,000 20,000 5

Camera detail 10 5,258 5,258 3Job detail 10 20,000 20,000 4

Movie detail 10 20,000 20,000 4Nba Player detail 10 4,405 4,405 4Restaurant detail 10 20,000 20,000 4University detail 10 16,705 16,705 4

Total - 100 124,715 129,326 78

ated under the assumption that wrapper-generation systems couldonly generate extraction rules with DOM-element granularity, i.e.,without segmenting text nodes. Modern wrapper-generation sys-tems support text-node segmentation and we therefore refined theground-truth accordingly. As an example, in the camera domain,the original ground-truth values for MODEL consisted of the entireproduct title. The text node includes, other than the model, COLOR,PIXELS, MANUFACTURER. The ground-truth for real estate and autodomains has been created following the SWDE format. The finaldataset consist of more than 120k pages, for almost 130k recordscontaining more than 500k attribute values.

Wrapper-generation systems. We generated input relationsfor our evaluation using four wrapper-generation systems: DIA-DEM [19], DEPTA [36] and ViNTs [39] for listing pages, and Road-Runner [12] for detail pages.1 The output of DIADEM, DEPTA, andRoadRunner can be readily used in the evaluation since these arefull fledged data extraction systems, supporting the segmentationof both records and attributes within listings or (sets of) detail-pages. ViNTs, on the other hand, segments rows into records withina search result listing and, as such, it does not have a concept ofattribute. Instead, it segments rows within a record. We thereforepost-processed its output, typing the content of lines from differ-ent records that are likely to have the same semantics. We used anaïve heuristic similarity based on relative position in the recordand string-edit distance of the row’s content. This is a very simpleversion of more advanced alignment methods based on instance-level redundancy used by, e.g., WEIR and TEGRA [7].

Metrics. The performance of the repair is evaluated by com-paring wrapper-generated relations against the SWDE ground truthbefore and after the repair. The metrics used for the evaluationare Precision, Recall, and F1-Score computed at attribute-level. Boththe ground truth and the extracted values are normalised, and exactmatching between the extracted values and the ground-truth is re-quired for a hit. For space reasons, in this paper we only presentthe most relevant results. The results of the full evaluation, together

1RoadRunner can be configured for listings but it performs better ondetail pages.

with the dataset, gold standard, extracted relations, the code of thenormaliser and of the scorer are available at the online appendix [1].

All experiments are run on a desktop with an Intel quad-core i7at 3.40GHz with 16 GB Ram and Linux Mint OS 17.

5.2 Repair performanceRelation-level Accuracy. The first two questions we want to an-

swer are: whether joint repairs are necessary and what their impactis in terms of quality. Table 2 reports, for each system, the percent-age of: (i) Correctly extracted values. (ii) Under-segmentations,i.e., when values for an attribute are extracted together with val-ues of other attributes or spurious content. Indeed often websitespublish multiple attribute values within the same text node and theinvolved extraction systems are not able to split values into multi-ple attributes. (iii) Over-segmentations, i.e., when attribute valuesare split over multiple fields. As anticipated in Section 2, this rarelyhappens since an attribute value is often contained in a single textnode. In this setting an attribute value can be over-segmented onlyif the extraction system is capable of splitting single text nodes(DIADEM), but even in this case the splitting happens only whenthe system can identify a strong regularity within the text node.(iv) Misplacements, i.e., values are placed or labeled as the wrongattribute. This is mostly due to lack of semantic knowledge andconfusion introduced by overlapping attribute domains. (v) Miss-ing values, due to lack of regularity and optionality in the websource (RoadRunner, DEPTA, ViNTs) or missing values from the do-main knowledge (DIADEM). Note that the numbers do not add up to

Table 2: Wrapper generation system errors.System Correct

(%)Under

Segmented(%)

OverSegmented

(%)

Misplaced(%)

Missing(%)

DIADEM 60.9 34.6 0 23.2 3.5DEPTA 49.7 44 0 25.3 6ViNTs 23.9 60.8 0 36.4 15.2

RoadRunner 46.3 42.8 0 18.6 10.4

100% since errors may fall into multiple categories. These numbersclearly show that there is a quality problem in wrapper-generatedrelations and also support the atomic misplacement assumption.

Figure 2 shows, for each system and each domain, the impactof the joint-repair on our metrics. Light (resp. dark)-colored barsdenote the quality of the relation before (resp. after) the repair.

A first conclusion that can be drawn is that a repair is always ben-eficial. From 697 extracted attributes, 588 (84.4%) require someform of repair and the average pre-repair F1-Score produced by thesystems is 50%. We are able to induce a correct regular expressionfor 335 (57%) attributes, while for the remaining 253 (43%) it pro-duces value-based expressions. We can repair at least one attributein each of the wrappers in all of the cases, and we can repair morethan 75% of attributes in more than 80% of the cases.

Among the considered systems, DIADEM delivers, in average,the highest pre-repair F1-Score (�60%), but it never exceeds 65%.RoadRunner is in average worse than DIADEM but it reaches a bet-ter 70% F1-Score on restaurant. Websites in this domain are in facthighly structured and individual attribute values are contained in adedicated text node. When attributes are less structured, e.g., onbook, camera, movie, RoadRunner has a significant drop in perfor-mance. As expected, ViNTs delivers the worst pre-cleaning results.

In terms of accuracy, our approach delivers a boost in F1-Scorebetween 15% and 60%. Performance is consistently close to orabove 80% across domains and, except for ViNTs, across systems,with a peak of 91% for RoadRunner on NBA player.

The following are the remaining causes of errors: (i) Missingvalues cannot be repaired as we can only use the data available in

8

0.3$0.4$0.5$0.6$0.7$0.8$0.9$1$

Auto$

Book$

Camera$ Job

$

Movie$

Nba$

Restaurant$

University$


Evaluation100 websites 10 domains4 wrapper generation systems.

Precision, Recall, F1-Score computed before and after repair.

WADaR boosts F1-Score between 15% and 60%. Performance consistently close to or above 80%.

Metrics computed considering exact matches.

WADaR against WEIR.

WADaR is highly robust to errors of the NERs.

WADaR scales linearly with the size of the input relation. Optimal joint-repair approximations

computed in polynomial time.

OptimalityWADaR provably produces relations of maximum fitness,

provided that the number of correctly annotated tuples is more than the maximum error rate of the annotators.

More questions? Come to the poster later!!!

T. Furche, G. Gottlob, L. Libkin, G. Orsi, N. Paton. Data Wrangling for Big Data. EDBT ’16

Download - Joint Repairs for Web Wrappers

Top Related