Download - Joint Repairs for Web Wrappers
Joint Repairs for Web WrappersStefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche
ICDE Helsinki - May, 19 2016
Title Director Rating RuntimeSchindler’s List
Steven Spielberg
R 195 min
Web Data Extraction
Road Runner
DEPTA
Attribute_1 Attribute_2Schindler’s List Director: Steven Spielberg Rating: R
Runtime: 195 min
Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min
Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Joint Data and Wrapper Repair
Attribute_1 Attribute_2
Schindler’s List
Director: Steven Spielberg
Rating: R
Runtime: 195 min
Title Director Rating Runtime
Schindler’s List
Steven
Spielberg
R 195 min
Maximal Repair is NP-completeAttribute
Director: Steven Spielberg Rating: R Runtime: 195 min
Director Rating Runtime
Director Rating RuntimeSteven Spielberg R 195 min
φ1:
φ2:
φ3:
φ4:
OBSERVATIONS
Templated Websites:Data is published following a template.
Wrapper BehaviourWrappers rarely misplace and over-segment at the same time.
Wrappers make systematic errors.
OraclesOracles can be implemented as (ensembles of) NERs.
NERs are not perfect, i.e., they make mistakes
Joint Wrapper And Data Repair
Authors
When values are both misplaced and over-segmented, computing repairs of maximal fitness is hard, otherwise, just do the following:
(1) Compute all possible k non-crossing partitions (k = |R|) of tokens, i.e., assign to each attribute an element of the part i t ion (O(nk) - Narayana Number).
(2) Discard tokens never accepted by oracles in any of the partitions.
(3) Collapse identical partitions and choose the one with maximal fitness.
Without misplacement and over-segmentation, solution in polynomial time by computing non-crossing k-partition
NP-hardness: reduction from Weighted Set Packing. Membership in NP: guess a partition, decide non crossing and compute fitness in PTIME.
Stefano Ortona [email protected] University of Oxford, UKGiorgio Orsi [email protected] University of Oxford, UKMarcello Buoncristiano [email protected] Università della Basilicata,ItalyTim Furche [email protected] University of Oxford, UK
http://diadem.cs.ox.ac.uk/wadar
Web data extraction (aka scraping/wrapping) uses wrappers to turn web pages into structured data.
Wrapper: structure { ⟨R,!R⟩ { ⟨A1,!A1⟩,…,⟨Am,!Am⟩ } } specifying objects to be extracted (listings, records, attributes) and corresponding XPath expressions.
Wrappers are often created algorithmically and in large numbers. Tools capable of maintaining them over time are missing.
⟨RATING, //li[@class=‘second’]/p⟩
⟨RUNTIME, //li[@class=‘third’]/ul/li[1]⟩
Algorithmically-created wrappers generate data that is far from perfect.Data can be badly segmented and misplaced.
⟨TITLE,⟨1⟩,string($)⟩ ⟨DIRECTOR,⟨2⟩,substring-before(substring-after($,tor:_),_Rat)⟩⟨RATING,⟨2⟩,substring-before(substring-after($,ing:_),_Run)⟩⟨RUNTIME,⟨2⟩,substring-after($,time:_)⟩
Take a set Ω of oracles, where each ωA in Ω can say whether a value vA belongs to the domain of A. We define the fitness of a relation R w.r.t. Ω as:
Repair: specifies regular expressions that, when applied on the original relation, produce a new relation with higher fitness.
<Director: Steven>
<195 min>
<Director:><Steven Spielberg>
<Rating: R Runtime:195>
<Runtime: k195 min>
<min Director: Steven Spielberg>
<Rating: Runtime: 195>
<Director: Steven Spielberg>
<Rating: R>
<R>
<Spielberg Rating: R Runtime:>
WADaR:
⟨DIRECTOR, //li[@class=‘first’]/div/span⟩
APPROXIMATING JOINT REPAIRS
Annotation 1Each record is interpreted as a string (concatenation of attributes), where NERs analyse and identify relevant attributes.
Entity recognisers make mistakes, WADaR tolerates incorrect and missing annotations.
Attribute_1 Attribute_2Schindler’s List Director: Steven Spielberg Rating: R
Runtime: 195 min
Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min
Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
The life of Jack Tarantino (coming soon)
Director: David R Lynch Rating: Not Rated Runtime: 123 min
TitleTitle
Director
Title
Director Director
RatingRating
Runtime
Runtime
Runtime
Runtime
Rating
Director
Segmentation
SINK
RATING
RATING
MAX FLOW SEQUENCE: DIRECTOR
Goal: Understand underlying structure of the relation.
START
TITLE
Two possible ways of encoding the problem:
2
TITLE
11
1. Max Flow Sequence in a Flow Network
RUNTIME
RUNTIME
DIRECTOR
DIRECTOR
START
TITLE
DIRECTOR
RATING
2. Most Likely Sequence in a Memoryless Markov Chain
RUNTIME
SINK
Solutions often coincide.Markov Chains: intuitive and faster to compute.
Max Flows: provably optimal.
RUNTIMERATING
TITLEMOST LIKELY SEQUENCE: DIRECTOR RATING RUNTIME
3/4
1/4
1 3/4
1/4
1
1
3
11 8 8
11
3
3 3 3
Induction 3
Schindler’s ListLawrence of Arabia (re-release)Le cercle Rouge (re-release)
Director: Steven Spielberg Rating: R Runtime: 195 minDirector: David Lean Rating: PG Runtime: 216 minDirector: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Director: Steven Spielberg Rating: R Runtime: 195 minDirector: David Lean Rating: PG Runtime: 216 minDirector: Jean-Pierre Melville Rating: Not Rated Runtime: 140 minDirector: David R Lynch Rating: Not Rated Runtime: 123 min
SUFFIX= substring-before(“_(“)
PREFIX= substring-after(“tor:_“)SUFFIX= substring-before(“_Rat“)
PREFIX= substring(string-length()-7)
Input: set of clean annotations to be used as positive examples.
WADaR induces regular expressions by looking, e.g., at common prefixes, suffixes, non-content strings, and character length.
Induced expressions improve recalltoken value1 token token token tokentoken token value2 token token tokentoken token token value3 token token
When WADaR cannot induce regular expressions (not enough regularity), data is repaired directly with annotators. Wrappers are instead repaired with value-based expressions, i.e., disjunction of the annotated values.
ATTRIBUTE=string-contains(“value1”|”value2”|”value3”)
Empirical Evaluation
Table 1
Precision_Original
Precision_Repaired
Recall_Original
Recall_Repaired
FScore_Original
FScore_Repaired
0.013233 0.5689
0.004155 0.4255
0.006324 0.488
0.535259 0.1396
0.307571 0.2871
0.390661 0.2665
0.8243 0.0914
0.5348 0.3002
0.6487 0.2248
0.5264 0.3716
0.3501 0.5276
0.4205 0.4666
0.7332 0.1943
0.5147 0.3361
0.6048 0.2827
0.6703 0.2281
0.5091 0.3295
0.5787 0.2888
0.5777 0.2766
0.553 0.3314
0.5651 0.304
0.292 0.441
0.2317 0.4597
0.2584 0.4531
0.158 0.6278
0.1404 0.588
0.1487 0.6074
0.5636 0.2263
0.5191 0.235
0.5405 0.2311
0.446 0.3314
0.2552 0.471
0.3246 0.4263
0.6799 0.302
0.5609 0.299
0.6147 0.3022
0.7252 0.1589
0.6525 0.1429
0.6869 0.150
0.5461 0.3267
0.3965 0.3268
0.4594 0.3316
FScore_Original
FScore_Repaired
0.006324 0.488
0.390661 0.2665
0.6487 0.2248
0.4205 0.4666
0.6048 0.2827
0.5787 0.2888
0.5651 0.304
0.2584 0.4531
0.1487 0.6074
0.5405 0.2311
0.3246 0.4263
0.6147 0.1904
0.6869 0.1505
0.4594 0.3316
0
0.2
0.4
0.6
0.8
1
ViNTs (R
E)
ViNTs (Auto
)
DIADEM (RE)
DEPTA (R
E)
DIADEM (Auto
)
DEPTA (A
uto)
RR (Auto
)
RR (Boo
k)
RR (Cam
era)
RR (Job
)
RR (Mov
ie)
RR (Nba
)
RR (Res
tauran
t)
RR (Univ
ersity)
Precision (Original) Precision (Repaired) Recall (Original)Recall (Repaired) FScore (Original) FScore (Repaired)
00.10.20.30.40.50.60.70.80.9
1
Vints_
RE
Vints_
UC
DIADEM_RE
DEPTA_R
E
DIADEM_UC
DEPTA_U
C
RR_Auto
RR_Boo
k
RR_Cam
era
RR_Job
RR_Mov
ie
RR_Nba
RR_resta
urant
RR_Univ
ersity
Untitle
d 1
Untitle
d 2
FScore Original FScore Repaired
5.1 SettingDatasets. The dataset consists of 100 websites from 10 do-
mains and is an enhanced version of SWDE [20], a benchmark com-monly used in web data extraction. SWDE’s data is sourced from80 sites and 8 domains: auto, book, camera, job, movie, NBA player,restaurant, and university. For each website, SWDE provides collec-tions of 400 to 2k detail pages (i.e., where each page correspondsto a single record). We complemented SWDE with collections oflisting pages (i.e., pages with multiple records) from 20 websitesof real estate (RE) and auto domains. Table 1 summarises the char-acteristics of the dataset. SWDE comes with ground-truth data cre-
Table 1: Dataset characteristics.Domain Type Sites Pages Records Attributes
Real Estate listing 10 271 3,286 15Auto listing 10 153 1,749 27Auto detail 10 17,923 17,923 4Book detail 10 20,000 20,000 5
Camera detail 10 5,258 5,258 3Job detail 10 20,000 20,000 4
Movie detail 10 20,000 20,000 4Nba Player detail 10 4,405 4,405 4Restaurant detail 10 20,000 20,000 4University detail 10 16,705 16,705 4
Total - 100 124,715 129,326 78
ated under the assumption that wrapper-generation systems couldonly generate extraction rules with DOM-element granularity, i.e.,without segmenting text nodes. Modern wrapper-generation sys-tems support text-node segmentation and we therefore refined theground-truth accordingly. As an example, in the camera domain,the original ground-truth values for MODEL consisted of the entireproduct title. The text node includes, other than the model, COLOR,PIXELS, MANUFACTURER. The ground-truth for real estate and autodomains has been created following the SWDE format. The finaldataset consist of more than 120k pages, for almost 130k recordscontaining more than 500k attribute values.
Wrapper-generation systems. We generated input relationsfor our evaluation using four wrapper-generation systems: DIA-DEM [19], DEPTA [36] and ViNTs [39] for listing pages, and Road-Runner [12] for detail pages.1 The output of DIADEM, DEPTA, andRoadRunner can be readily used in the evaluation since these arefull fledged data extraction systems, supporting the segmentationof both records and attributes within listings or (sets of) detail-pages. ViNTs, on the other hand, segments rows into records withina search result listing and, as such, it does not have a concept ofattribute. Instead, it segments rows within a record. We thereforepost-processed its output, typing the content of lines from differ-ent records that are likely to have the same semantics. We used anaïve heuristic similarity based on relative position in the recordand string-edit distance of the row’s content. This is a very simpleversion of more advanced alignment methods based on instance-level redundancy used by, e.g., WEIR and TEGRA [7].
Metrics. The performance of the repair is evaluated by com-paring wrapper-generated relations against the SWDE ground truthbefore and after the repair. The metrics used for the evaluationare Precision, Recall, and F1-Score computed at attribute-level. Boththe ground truth and the extracted values are normalised, and exactmatching between the extracted values and the ground-truth is re-quired for a hit. For space reasons, in this paper we only presentthe most relevant results. The results of the full evaluation, together
1RoadRunner can be configured for listings but it performs better ondetail pages.
with the dataset, gold standard, extracted relations, the code of thenormaliser and of the scorer are available at the online appendix [1].
All experiments are run on a desktop with an Intel quad-core i7at 3.40GHz with 16 GB Ram and Linux Mint OS 17.
5.2 Repair performanceRelation-level Accuracy. The first two questions we want to an-
swer are: whether joint repairs are necessary and what their impactis in terms of quality. Table 2 reports, for each system, the percent-age of: (i) Correctly extracted values. (ii) Under-segmentations,i.e., when values for an attribute are extracted together with val-ues of other attributes or spurious content. Indeed often websitespublish multiple attribute values within the same text node and theinvolved extraction systems are not able to split values into multi-ple attributes. (iii) Over-segmentations, i.e., when attribute valuesare split over multiple fields. As anticipated in Section 2, this rarelyhappens since an attribute value is often contained in a single textnode. In this setting an attribute value can be over-segmented onlyif the extraction system is capable of splitting single text nodes(DIADEM), but even in this case the splitting happens only whenthe system can identify a strong regularity within the text node.(iv) Misplacements, i.e., values are placed or labeled as the wrongattribute. This is mostly due to lack of semantic knowledge andconfusion introduced by overlapping attribute domains. (v) Miss-ing values, due to lack of regularity and optionality in the websource (RoadRunner, DEPTA, ViNTs) or missing values from the do-main knowledge (DIADEM). Note that the numbers do not add up to
Table 2: Wrapper generation system errors.System Correct
(%)Under
Segmented(%)
OverSegmented
(%)
Misplaced(%)
Missing(%)
DIADEM 60.9 34.6 0 23.2 3.5DEPTA 49.7 44 0 25.3 6ViNTs 23.9 60.8 0 36.4 15.2
RoadRunner 46.3 42.8 0 18.6 10.4
100% since errors may fall into multiple categories. These numbersclearly show that there is a quality problem in wrapper-generatedrelations and also support the atomic misplacement assumption.
Figure 2 shows, for each system and each domain, the impactof the joint-repair on our metrics. Light (resp. dark)-colored barsdenote the quality of the relation before (resp. after) the repair.
A first conclusion that can be drawn is that a repair is always ben-eficial. From 697 extracted attributes, 588 (84.4%) require someform of repair and the average pre-repair F1-Score produced by thesystems is 50%. We are able to induce a correct regular expressionfor 335 (57%) attributes, while for the remaining 253 (43%) it pro-duces value-based expressions. We can repair at least one attributein each of the wrappers in all of the cases, and we can repair morethan 75% of attributes in more than 80% of the cases.
Among the considered systems, DIADEM delivers, in average,the highest pre-repair F1-Score (�60%), but it never exceeds 65%.RoadRunner is in average worse than DIADEM but it reaches a bet-ter 70% F1-Score on restaurant. Websites in this domain are in facthighly structured and individual attribute values are contained in adedicated text node. When attributes are less structured, e.g., onbook, camera, movie, RoadRunner has a significant drop in perfor-mance. As expected, ViNTs delivers the worst pre-cleaning results.
In terms of accuracy, our approach delivers a boost in F1-Scorebetween 15% and 60%. Performance is consistently close to orabove 80% across domains and, except for ViNTs, across systems,with a peak of 91% for RoadRunner on NBA player.
The following are the remaining causes of errors: (i) Missingvalues cannot be repaired as we can only use the data available in
8
0.3$0.4$0.5$0.6$0.7$0.8$0.9$1$
Auto$
Book$
Camera$ Job
$
Movie$
Nba$
Restaurant$
University$
WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$
Evaluation100 websites 10 domains4 wrapper generation systems.
Precision, Recall, F1-Score computed before and after repair.
WADaR boosts F1-Score between 15% and 60%. Performance consistently close to or above 80%.
Metrics computed considering exact matches.
WADaR against WEIR.
WADaR is highly robust to errors of the NERs.
WADaR scales linearly with the size of the input relation. Optimal joint-repair approximations
computed in polynomial time.
OptimalityWADaR provably produces relations of maximum fitness,
provided that the number of correctly annotated tuples is more than the maximum error rate of the annotators.
Background: Web wrapping
refcode postcode bedrooms bathrooms available price
33453 OX2 6AR 3 2 15/10/2013 £1280 pcm
33433 OX4 7DG 2 1 18/04/2013 £995 pcm
Process or turning semi-structured (templated) web data into structured form
Hidden databases are actually a form of dark / dim data (ref. panel on Tuesday)
manual / (semi) supervised
accurate expensive + non-scalable
unsupervised
less accurate cheaper + scalable
Wrapidity
Background: Web wrapping
Background: Web wrapping
From (manually or automatically) created examples to XPath-based wrappers
Even on templated websites, automatic wrapping can be inaccurate
Pairs <field,expression> that, once applied to the DOM, return structured records
field expression
listing //bodyrecord //div[contains(@class,'movlist_wrap')]
title //span[contains(@class,’title’)]/text()
rated .//span[.='rating:']/following-sibling::strong/text()genre .//span[.=genre']/following-sibling::strong/text()
releaseMo .//span[@class='release']/text()
releaseDy .//span[@class='release']/text()releaseYr .//span[@class='release']/text()
image .//@src
runtime .//span[.=runtime']/following-sibling::strong/text()
Problems with wrapping
Inaccurate wrapping results in over(under) segmented data
Attribute_1 Attribute_2
Ava’s Possessions Release Date: March 4, 2016 | Rated: R | Genre(s) : Sci-Fi, Mystery, Thriller, Horror | Production Company: Off Hollywood Pictures “| Runtime: 216 min
Camino Release Date: March 4, 2016 | Rated: Not Rated | Genre(s): Action, Adventure, Thriller | Production Company: Bielberg Entertainment | Runtime: 103 min
Cemetery of Splendor Release Date: March 4, 2016 | Rated: Not Rated | Genre(s): Drama | User Score: 4.6 | Production Company: Centre National de la Cinématographie (CNC) | Runtime: 122 min
Title Release Genre Rating Runtime
RS: Source Relation
: Target Schema
Example extraction using
RoadRunner (Crescenzi Et Al.)
Questions
The questions we want to answer are:
can we fix the data, and use what we learn to repair wrappers as well?
are the solutions scalable?
Why do we care?
Companies such as FB, and Skyscanner spend millions of dollars of engineering time, creating and maintaining wrappers
Wrapper maintenance is a major cost of data acquisition from the web
Fixing the data
MAKE MODEL PRICEThe wrapper thinks it is filling this schema…
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
£10k Citroën C3
£22k Ford C-max Titanium X
If all instances looked like this (i.e., mis-segmentation, no garbage, no shuffling)
table induction problem: TEGRA, WebTables, etc.
Moreover… we still have no clue on how to fix the wrapper afterwards
…but instead it produces this instance…£19k Make: Audi Model:A3 Sportback
£43k Make: Audi Model: A6 Allroad
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
What is a good relation?
The problem is that wrapper generated relations really look like this…
First, we need a way to determine how “far” we are from a good relation…
ū = ⟨u1, u2, …, un⟩
a tuple generated by the wrapper
Σ = ⟨A1, A2, …, Am⟩
the (target) schema for the extraction
Ω = {ωA1, …, ωAarity(Σ) }
set of oracles for Σ
The fitness then quantifies how well ū (resp. the whole instance) “fits” Σ
Ω = {ωMAKE, ωMODEL, ωPRICE}, Σ = ⟨MAKE, PRICE, MODEL⟩
ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise
f(R, Σ, Ω) = 1/2 = 50%
£19k Make: Audi Model:A3 Sportback
£43k Make: Audi Model: A6 Allroad
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
ωMAKE ωPRICE ωMODEL
Problem Definition: Fitness
Σ = ⟨A1, A2, …, Am⟩ attributes (fields) of the target schema of the relation
ū = ⟨u1, u2, …, un⟩ tuple of the wrapper-generated relation R
Ω = {ωA1, …, ωAarity(Σ) } set of oracles for the fields of the Σ, s.t.,
ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise
We define the fitness of a tuple ū (resp. relation R) w.r.t. a schema Σ as:
f(ū, Σ, Ω) = ∑ ωAi (ui) / d i=1
c
where: c=min{ arity(Σ), arity(R) } and d=max{ arity(Σ), arity(R) }
resp. f(R, Σ, Ω) = ∑ f(ū, Σ, Ω) / |R|ū∈R
Input: a wrapper W, a relation R | W(P)=R for some set of pages P, and a schema Σ
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
Citroën £10k C3
Ford £22k C-max Titanium X
f(R, Σ, Ω) = 1/6 = 17%
MAKE MODEL PRICE
Problem Definition: Σ-repairs
Π = (i, j, …, k) permutation of the fields of R ρ = { ⟨A1,ƐAi⟩, ⟨A2,ƐAi⟩, …, ⟨Am,ƐAm⟩ } set of regexes for each attribute in Σ
A Σ-repair is a pair σ = ⟨Π,ρ⟩ where:
Σ-repairs can be applied to a tuple ū in the following way
σ(ū) = ⟨ ƐAi(Π(ū)), ƐA2(Π(ū)), … , ƐAm(Π(ū)) ⟩
The notion of applicability extends naturally to relations σ(R) (i.e., sets of tuples)
Similarly, Σ-repairs can be applied to wrappers as well [details in the paper]
Output: a wrapper W’ and a relation R’ | W’(P)=R’ and R’ is of maximum fitness w.r.t. Σ
The goal is to find the Σ-repair that maximises the fitness
Computing Σ-repairs
Complexity [details in the paper]:
1. non atomic misplacements: NP-complete (red. from Weighted Set Packing)
2. atomic misplacements: polynomial (red. from Stars and Buckets)
We have an atomic misplacement when the correct value for an attribute is:
1. entirely misplaced, or
2. if it is over-segmented, the fragments are in adjacent fields in the relation
MAKE MODEL PRICE
£22k Ford C-max Titanium X
MAKE MODEL PRICE
C-max £22k X Ford Titanium
atomic misplacement non atomic misplacement
Naïve Algorithm:
For each tuple…
1. permute tuples in all possible ways (only if non-atomic misplacements)
2. segment tuples in all possible ways
3. ask the oracles and keep the segmentation of highest fitness
Approximating Σ-repairs
The naïve algorithm has the following problems:
1. oracles do not (always) exist
2. it fixes one tuple at a time, the wrapper needs a single fix for each attribute
3. even under the assumption of atomic misplacements we still have to try O(nk) different segmentations (worst case) before finding the one of maximum fitness
(1) Weak oracles
Use noisy NERs in spite of oracles. If unavailable, it’s easy to build one.
In this work we use ROSeAnn (Chen&Al. PVLDB13)
(2 and 3) Approximate relation-wide repairs
Wrappers are programs, if they make a mistake they make it consistently
There is hope to have a common underlying attribute structure
Finding the right structure
We have to solve two problems:
find the underlying structure(s) of the relation
find an segmentation that maximises the fitness
An obvious way is sequence labelling (e.g., Markov chains + Viterbi) where oracles are simulated by NERs (so they can make mistakes)
A SINK5
SOURCE
B
C3
D
2
3 3
4 4
2
The maximum likelihood sequence is actually <A,D> which “fits” ~28%
It looks like there’s another sequence that fits better…
a b ca b ca b ca da db a db a d
Ω = {ωA, ωB, ωC, ωD}A B C D
Finding the right structure
The sequence corresponding to the max-flow is <A,B,C> which “fits” ~32%
vA,() SINK13
SOURCE vB,(A) vC,(A,B)9 9 9
vD,(A)
4 4
vB,() vA,(B) vD,(B,A)
44 4
4
The problem is that Markov chains are memory-less…
we have to remember the context and
make sure our sequence satisfies the oracles more than any other
Ok… this sounds like a max-flow!
Ω = {ωA, ωB, ωC, ωD}
a b ca b ca b ca da db a db a d
A B C D
Iteratively compute max flows on the network, i.e., likely sequences of high fitness
MAKE
6/8SINK
0/3
SOURCE
PRICE
MAKE
0/2
6/6
0/2
6/6PRICE, MAKE, MODEL
MODEL
Iteration 0
PRICE0/3
0/3
MODEL
MODEL0/3
6/6
MAKE
SINK
3/3SOURCE
PRICE
3/3
3/33/3
MAKE, PRICE, MODEL
MODEL
Iteration 1
We stop when we covered “enough” of the tuples in the relation
First, annotate the relation using NERs (surrogate oracles) and build the network
MAKE
8SINK
3
SOURCE
PRICE
MAKE
2
6
2
6
MODEL
PRICE3
3
MODEL
MODEL3
6
Example:
£19k Make: Audi Model:A3 Sportback
£43k Make: Audi Model: A6 Allroad quattro
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
Ω = {ωMAKE, ωMODEL, ωPRICE}
Finding the right structure
Fixing the relation (and the wrapper)
Max flows represent likely sequences. We use them to eliminate unsound annotations.
We can use standard regex-induction algorithms to obtain robust expressions
£19k Make: Audi Model: A3 Sportback
MAKE [11,15) MODEL [24,36) PRICE [0,4)
The remaining annotations can be used as examples for regex induction
The induced expressions recover missing (incomplete) annotations
£19k Make: Audi Model: A3 Sportback
£43k Make: Audi Model: A6 Allroad quattro
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
ρ = { ⟨MAKE, substring-before($, £) or substring-before(substring-after($, ‘ke:␣’),’␣Mo’)⟩, ⟨MODEL, substring-after($, el:␣)⟩, ⟨PRICE, substring-after(substring-before($, ’kMa␣’ || ’kMo␣’),␣)⟩ }
Approximating Σ-repairs
MAKE MODEL PRICE
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
Citroën £10k C3
Ford £22k C-max Titanium X
When an expression fails to match a minimum number of tuples, we fall back to the NERs: value-based expressions
ρ = { ⟨MAKE, value-based($, [Audi, Ford] )⟩, ⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩, ⟨PRICE, substring-after(substring-before($, k␣),␣)⟩ }
Example: (induction threshold 75%)
MAKE MODEL PRICE
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
Citroën £10k C3
Ford £22k C-max Titanium X
ρ = { ⟨MAKE, substring-before($, £) or substring-before(substring-after($, k␣),␣)⟩, ⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩, ⟨PRICE, substring-after(substring-before($, k␣),␣)⟩ }
Example: (induction threshold 20%)
Evaluation
Dataset:
An enhanced version of the SWDE dataset (https://swde.codeplex.com)
10 domains, 100 websites, 78 attributes, ~100k pages, ~130k records
Systems:
wrapper generation systems: DIADEM, Depta, ViNTs, RoadRunner
Baseline wrapper induction/repair systems: WEIR (Crescenzi et Al. VLDB ‘13)
Implementation: WADaR (Wrapper and Data Repair) – Java + SQL
Evaluation: Highlights
Table 1
Precision_Original
Precision_Repaired
Recall_Original
Recall_Repaired
FScore_Original
FScore_Repaired
0.013233 0.5689
0.004155 0.4255
0.006324 0.488
0.535259 0.1396
0.307571 0.2871
0.390661 0.2665
0.8243 0.0914
0.5348 0.3002
0.6487 0.2248
0.5264 0.3716
0.3501 0.5276
0.4205 0.4666
0.7332 0.1943
0.5147 0.3361
0.6048 0.2827
0.6703 0.2281
0.5091 0.3295
0.5787 0.2888
0.5777 0.2766
0.553 0.3314
0.5651 0.304
0.292 0.441
0.2317 0.4597
0.2584 0.4531
0.158 0.6278
0.1404 0.588
0.1487 0.6074
0.5636 0.2263
0.5191 0.235
0.5405 0.2311
0.446 0.3314
0.2552 0.471
0.3246 0.4263
0.6799 0.302
0.5609 0.299
0.6147 0.3022
0.7252 0.1589
0.6525 0.1429
0.6869 0.150
0.5461 0.3267
0.3965 0.3268
0.4594 0.3316
FScore_Original
FScore_Repaired
0.006324 0.488
0.390661 0.2665
0.6487 0.2248
0.4205 0.4666
0.6048 0.2827
0.5787 0.2888
0.5651 0.304
0.2584 0.4531
0.1487 0.6074
0.5405 0.2311
0.3246 0.4263
0.6147 0.1904
0.6869 0.1505
0.4594 0.3316
00.20.40.60.8
1
ViNTs (R
E)
ViNTs (Auto
)
DIADEM (RE)
DEPTA (R
E)
DIADEM (Auto
)
DEPTA (A
uto)
RR (Auto
)
RR (Boo
k)
RR (Cam
era)
RR (Job
)
RR (Mov
ie)
RR (Nba
)
RR (Res
tauran
t)
RR (Univ
ersity)
Precision (Original) Precision (Repaired) Recall (Original)Recall (Repaired) FScore (Original) FScore (Repaired)
00.10.20.30.40.50.60.70.80.9
1
Vints_
RE
Vints_
UC
DIADEM_RE
DEPTA_R
E
DIADEM_UC
DEPTA_U
C
RR_Auto
RR_Boo
k
RR_Cam
era
RR_Job
RR_Mov
ie
RR_Nba
RR_resta
urant
RR_Univ
ersity
Untitle
d 1
Untitle
d 2
FScore Original FScore Repaired
Fig. 2: Impact of repair.
to 30% in real estate, with an identical effect in almost alldomains.
Attribute-level accuracy. Another question is whetherthere are substantial differences in attribute-level accuracy.The top of Table III shows attributes where the repair isvery effective (F1-Score'1 after repair). These values appearas highly structured attributes on web pages and the corre-sponding expressions repair almost all tuples. As an example,DOOR NUMBER is almost always followed by suffixes dr ordoor. In these cases, the wrapper induction under-segmentedthe text due to lack of sufficient examples.
TABLE III: Attribute-level evaluation.
System Domain Attribute Original F1-Score Repaired F1-ScoreDIADEM real estate POSTCODE 0.304 0.947DIADEM auto DOOR
NUMBER0 0.984
DEPTA real estate BATHROOMNUMBER
0.314 0.973
DEPTA auto MAKE 0.564 0.986DIADEM real estate CITY 0 0.59DEPTA real estate COUNTY 0 0.728
DIADEM auto ENGINETYPE
0 0.225
DEPTA auto PRICE 0.711 0.742
For attributes such as CITY and COUNTY, despite a sig-nificant boost (59% and 72% respectively) produced by therepair, the final F1-Score is still low. These are irregularly struc-tured attributes, often co-occurring with others, e.g., STATE,POSTCODE, in ways that cannot be easily isolated by regularexpressions. Despite not having syntactic regularity, theseattributes are semantically related, e.g., COUNTY is usuallyafter CITY and before POSTCODE, and could be captured byextending f S
XPATHwith NER capabilities [4].
An exceptional case is ENGINE TYPE, where the valuePetrol is also recognised as COLOUR. This causes a loss ofperformance as it creates a systematic error in the annotatedrelation. Another exception is the case of PRICE in relationsgenerated by DEPTA. DEPTA extracts large chunks of text withmultiple prices among which the annotators cannot distinguishthe target price reliably, resulting in worse performance.
Independent evaluation. We performed an extraction ofrestaurant chain locations in collaboration with a large socialnetwork, which provided us with 210 target websites. We usedDIADEM as a wrapper induction system and we then appliedjoint repair on the generated relations. The accuracy has beenmanually evaluated by third-party rating teams on a sampleof nearly 1,000 records of the 276,787 extracted. Table IV
shows Precision and Recall computed on the sample (valueshigher than 0.9 are highlighted in bold). In order to estimate
TABLE IV: Accuracy of large scale evaluation.
Attribute Precision Recall % Modified valuesLOCALITY 0.993 0.993 11.34%
OPENING HOURS 1.00 0.461 17.14%LOCATED WITHIN 1.00 0.224 29.75%
PHONE 0.987 0.849 50.74%POSTCODE 0.999 0.989 9.4%
STREET ADDRESS 0.983 0.98 83.78%
the impact of the repair, we computed, for each attribute, thepercentage of values that are different before and after therepair step. These numbers are shown in the last column ofTable IV. Clearly, the repair is beneficial on all of the cases. ForOPENING HOURS and LOCATED WITHIN, where recall is verylow, the problem is due to the fact that these attributes wereoften not available on the source pages, thus being impossibleto repair. The independent evaluation proved that our repairmethod can scale to hundreds of thousands of non-syntheticrecords. On the other hand, the joint repair is bound to theaccuracy of the extraction system, i.e., it cannot repair datathat has not been extracted.
We have previously shown (Section IV) that an optimalapproximation of a joint repair can be computed efficiently.To stress the scalability of our method, we created a syntheticdataset by modifying two different variables: n—number ofrecords, with an impact mostly on the induction of regularexpressions, since it increases the number of examples; k—number of attributes, which influences the size of the flownetwork and the computation of the maximum flow. Thesynthetic relations are built to produce the worst case scenario,i.e., each record contains k annotated tokens, each annotationhas a different context and each record produces a differentpath on the network. This results in a network with n · k+ 2nodes, and n · k+ n edges. The chart on the left of Figure 3plots the running time over an increasing number of records(with number of attributes fixed), while the chart on the right
Fig. 3: Running time.
WADaR increases F1-score between 15% and 60% (excluding ViNTs)
increases the number of attributes (with number of recordsfixed). As expected, the joint repair grows linearly w.r.t thesize of the relation, and polynomially w.r.t. the number ofattributes. In the extreme case, the computed network contains10M nodes and 10.1M edges. The largest network obtainedon non-synthetic datasets has 39,148 nodes and 45,797 edges(book), with repairs computed in less than 3 seconds.
Comparative evaluation. We compare our approachagainst WEIR [3], a wrapper induction and data integration sys-tem that can be used to compute a joint repair of a relation w.r.t.a schema. WEIR induces wrappers by generating candidateexpressions using simple heuristics and by filtering them usinginstance-level redundancy across multiple web sources, i.e., itpicks, among candidate rules, those that consistently matchsimilar values on different sources. We compare with WEIR asthe only other similar system, Turbo Syncer [8], is significantlyolder, and we were not able to obtain an implementation.
WEIR uses only redundant values for rules selection, re-sulting in relations with missing values (and records). Wecompared against WEIR on SWDE original dataset, the sameone used in their evaluation [3] (using RoadRunner as extractionsystem). We evaluated WEIR and our approach in two separatesettings: Figure 4 shows the performance of our approach andWEIR on each domain, computed on redundant records only,while in Figure 5 we also take into account non-redundantones. A first observation is that redundant records are asmall fraction of the whole relation, thus limiting the recall(shown on top of the bars in Figure 4). The results showthat, if we limit the evaluation to redundant values only,our approach delivers same or better performance than WEIR.Interesting cases are auto, restaurant and university, where ourapproach outperforms WEIR by more than 10% in F1-Score. Inparticular, WEIR suffers from false redundancy caused by a laxsimilarity measure and under-segmented text nodes. The onlycase where WEIR performs better than our approach is in movie,where the presence of multivalued attributes (such as GENRE)causes the selection of suboptimal max-flow sequences. If wealso consider all values, including non redundant ones, ourapproach clearly outperforms WEIR in every domain, with apeak of 36% boost in F1-Score in camera.
In terms of running time, WEIR requires an average of 30minutes per domain whereas our approach repairs a domain inless than 2 minutes. This is due to the way WEIR exploits cross-source redundancy, i.e., instances in a source are comparedagainst instances of all other sources. As a consequence,the running time increases with the number of sources. Ourapproach instead repairs each source in parallel.
We also run a preliminary comparative evaluation with
0.5$0.6$0.7$0.8$0.9$1$
Auto$
Book$
Camera$ Job
$
Movie$
Nba$
Restauran
University$
WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$
6.10%$5.3%$
14.9%$
3.2%$6%$
14.7%$8.2%$
5.1%$
Fig. 4: Comparison with WEIR (Redundant values)
0.3$0.4$0.5$0.6$0.7$0.8$0.9$1$
Auto$
Book$
Camera$ Job
$
Movie$
Nba$
Restaurant$
University$
WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$
Fig. 5: Comparison with WEIR (all values)
Google Data Highlighter, a supervised data annotation tool thatcan be used to produce tabular data from web pages. Adiscussion is available at [1].
Ablation study. In this experiment we measured the impactof each phase of the joint repair computation on F1-Scorefor the most relevant scenarios (we found similar results inother scenarios but those have not been included due to spacereasons). With respect to Figure 6, original (or) refers to the
Only Annotator Only Regular Only Value Based Final
DIADEM (RE) 0.6487 0.7048 0.8315 0.8613 0.8735
DIADEM (AUTO) 0.6048 0.798 0.7249 0.8819 0.8875
RR (AUTO) 0.5651 0.5838 0.7148 0.8418 0.8691
RR (UNIVERSITY) 0.4594 0.6253 0.7803 0.8254 0.8335
RR (NBA) 0.6147 0.8408 0.9235 0.9169
RR (CAMERA) 0.1487 0.4650 0.755 0.7561
0.40.50.60.70.80.9
ORANN
REGNET J
DIADEM (RE)
ORANN
REGNET J
DIADEM (AUTO)
ORANN
REGNET J
RR (AUTO)
ORANN
REGNET J
RR (UNIVERSITY) 0.90.80.70.60.50.4
Fig. 6: Impact of individual components
original (i.e., before repair) quality of the relation, while joint (j)is the post-repair quality. annotator (ann) shows the effect ofconstructing the repair by directly using annotations, regex (reg)shows the performance when only regexes are induced (i.e.,without value based expressions), network (net) shows the repairperformance by using only value based expressions computedfrom max-flow sequences (i.e., no regex induction).
As we can see, the direct use of annotations for repair orregex induction delivers poor results. The major contributionto the quality is the use of max-flow sequences that uncoverthe underlying structure of the relation and eliminate noisyannotations. Regex induction is still beneficial afterwards torecover misses of the annotators. The most striking case isSTREET_ADDRESS in the real estate domain. The attributeis hardly recognised by annotators (accuracy around 50%),however its structure in the relation is very regular and theafter-repair accuracy reaches 85%.
Thresholding. This second experiment measures the effectof the thresholds t f low and tregex on performance. Figure 7shows the variation of F1-Score for the most interesting sce-narios (other scenarios report similar results). The setting ofexcessively low thresholds negatively impacts the performance,as it causes a premature induction of regexes that repair onlya small number of records. However, there are cases, e.g.,DIADEM on real estate, where a lower threshold helps to recovermisses of the annotators. In domains where attributes are betterstructured or the annotator is more accurate, e.g., auto, the bestperformance is achieved by setting a high threshold. Overall,the variation in performance is anyway limited (3%) and theaverage best performance is obtained with a 75% threshold.
WADaR is 23% more accurate than WEIR on average
Evaluation: Robustness
We studied how F1-score varies w.r.t. annotation noise
Fig. 7: Impact of t f low and tregex threshold
Effect of annotator accuracy. We gradually decreased theRecall of our annotators by randomly eliminating a numberof annotations and observing the effect on F1-Score whilekeeping a fixed regex induction threshold (0.75). To lower theeffect of sampling bias, we ran the experiment 30 times withdifferent annotation sets and took the average performance.The accuracy numbers are limited to those attributes whereour approach induces regular expressions, since it is alreadyclear that annotator errors directly reduce the accuracy ofvalue-based expressions. This is still a significant number ofattributes, i.e., �65% in all cases except for RoadRunner onbook (35%), and RoadRunner on movie (46%). Figure 8 shows
Fig. 8: Annotator recall drop - Fixed threshold
the impact of a drop in recall (x-axis) on F1-Score. As wecan see, our approach is robust to a drop in recall until wereach 80% loss, then the performance rapidly decays. This issomehow expected, since the regular expressions compensatefor the missing recall up to the point where the max-flowsequences are no longer able to determine the underlyingattribute structure reliably.
Figure 9 show the effect on F1-Score if we set a low regex-induction threshold (i.e., 0.1) instead. Clearly, in this caseour approach is highly robust to annotator inaccuracy and wenotice a loss in performance only after 80-90% loss in recall.In summary, a lower regex-induction threshold is advisablewhen we know that annotators have low recall. Even involvingan annotator with very low accuracy, our approach is robust
Fig. 9: F1-Score variation with a threshold value of 0.1
enough to overcome the errors introduced by the annotator.
VI. RELATED WORK
Computing joint repairs is one of the many maintenanceproblems faced in web data extraction [5], [22], [25], [27],[28]. However, classical wrapper maintenance has assumedperfect, typically human-created wrappers to begin with, witherrors only being introduced over time due to change in thesources. When covering thousands or hundreds of thousandsof sources with automatically or semi-supervised wrapperinduction this assumption is no longer valid.
Closer in spirit to joint repairs are techniques to generatewrappers from background data [3], [8], [17], [36]. Thesetechniques implicitly align background data and wrappers aspart of the generation process. The closest works to oursare Turbo Syncer [8] and WEIR [3], which use instance-levelredundancy across multiple sources to compute extraction ruleson individual sites that, together, can be used to effectivelylearn wrappers without supervision. An advantageous side-effect of these approaches is the construction of “compatible”relations that can be more easily integrated. Differently fromTurbo Syncer and WEIR, our approach assumes the existenceof an already generated wrapper to be repaired w.r.t. a targetschema. From a practical point of view, both Turbo Syncer andWEIR can be adapted to compute joint repairs, however, asshown in Section V with a significantly worse performancethan our approach due to their reliance on redundancy. Ourapproach also eliminates the need for re-induction of wrappers,leading to better runtime performance.
Redundancy. Instance-level redundancy across websources has been previously used in different contexts to detectand repair inconsistent data extracted from the web [3], [7], [8],[13]. Redundancy-based approaches face two main obstacles:(i) it is not always possible to leverage sufficient redundancyin every domain, (see, e.g., the number of redundant recordsin SWDE of Figure 4), and (ii) redundancy-based methodsrequire access to a substantial number of sources that have,so far, limited their scalability (see, e.g., WEIR running time).Encoding the redundancy by other means, e.g., through entityrecognisers and knowledge bases, has proven beneficial to cir-cumvent the scalability problems without sacrificing generalityof the approaches [7], [18]. Our approach achieves this via anensemble of entity recognisers [4], some of which are trainedusing redundancy-based methods.
Cleaning, segmentation and alignment. Traditional datacleaning methods focus on the detection and repair of databaseinconsistencies, using, e.g., statistical value distributions [31],[34], constraints [2], [14], [16], [29], and knowledge bases [7].Differently from our setting, cleaning methods operate onrelation(s) that contain incorrect values but are assumed to becorrectly segmented.
A more relevant body of work is list segmentation/tableinduction techniques, targeting the induction of structuredrecords from unstructured (i.e., wrongly segmented) lists ofvalues. These are alternatives to the segmentation based onflow networks used in our approach. Being inspired by taggingproblems common in bio-informatics and other areas, these ap-proaches traditionally require some form of supervision. Manyrequire an initial seed of correctly segmented records [10],
Fixed induction threshold 75%
(high dependence on annotation quality)
Fig. 7: Impact of t f low and tregex threshold
Effect of annotator accuracy. We gradually decreased theRecall of our annotators by randomly eliminating a numberof annotations and observing the effect on F1-Score whilekeeping a fixed regex induction threshold (0.75). To lower theeffect of sampling bias, we ran the experiment 30 times withdifferent annotation sets and took the average performance.The accuracy numbers are limited to those attributes whereour approach induces regular expressions, since it is alreadyclear that annotator errors directly reduce the accuracy ofvalue-based expressions. This is still a significant number ofattributes, i.e., �65% in all cases except for RoadRunner onbook (35%), and RoadRunner on movie (46%). Figure 8 shows
Fig. 8: Annotator recall drop - Fixed threshold
the impact of a drop in recall (x-axis) on F1-Score. As wecan see, our approach is robust to a drop in recall until wereach 80% loss, then the performance rapidly decays. This issomehow expected, since the regular expressions compensatefor the missing recall up to the point where the max-flowsequences are no longer able to determine the underlyingattribute structure reliably.
Figure 9 show the effect on F1-Score if we set a low regex-induction threshold (i.e., 0.1) instead. Clearly, in this caseour approach is highly robust to annotator inaccuracy and wenotice a loss in performance only after 80-90% loss in recall.In summary, a lower regex-induction threshold is advisablewhen we know that annotators have low recall. Even involvingan annotator with very low accuracy, our approach is robust
Fig. 9: F1-Score variation with a threshold value of 0.1
enough to overcome the errors introduced by the annotator.
VI. RELATED WORK
Computing joint repairs is one of the many maintenanceproblems faced in web data extraction [5], [22], [25], [27],[28]. However, classical wrapper maintenance has assumedperfect, typically human-created wrappers to begin with, witherrors only being introduced over time due to change in thesources. When covering thousands or hundreds of thousandsof sources with automatically or semi-supervised wrapperinduction this assumption is no longer valid.
Closer in spirit to joint repairs are techniques to generatewrappers from background data [3], [8], [17], [36]. Thesetechniques implicitly align background data and wrappers aspart of the generation process. The closest works to oursare Turbo Syncer [8] and WEIR [3], which use instance-levelredundancy across multiple sources to compute extraction ruleson individual sites that, together, can be used to effectivelylearn wrappers without supervision. An advantageous side-effect of these approaches is the construction of “compatible”relations that can be more easily integrated. Differently fromTurbo Syncer and WEIR, our approach assumes the existenceof an already generated wrapper to be repaired w.r.t. a targetschema. From a practical point of view, both Turbo Syncer andWEIR can be adapted to compute joint repairs, however, asshown in Section V with a significantly worse performancethan our approach due to their reliance on redundancy. Ourapproach also eliminates the need for re-induction of wrappers,leading to better runtime performance.
Redundancy. Instance-level redundancy across websources has been previously used in different contexts to detectand repair inconsistent data extracted from the web [3], [7], [8],[13]. Redundancy-based approaches face two main obstacles:(i) it is not always possible to leverage sufficient redundancyin every domain, (see, e.g., the number of redundant recordsin SWDE of Figure 4), and (ii) redundancy-based methodsrequire access to a substantial number of sources that have,so far, limited their scalability (see, e.g., WEIR running time).Encoding the redundancy by other means, e.g., through entityrecognisers and knowledge bases, has proven beneficial to cir-cumvent the scalability problems without sacrificing generalityof the approaches [7], [18]. Our approach achieves this via anensemble of entity recognisers [4], some of which are trainedusing redundancy-based methods.
Cleaning, segmentation and alignment. Traditional datacleaning methods focus on the detection and repair of databaseinconsistencies, using, e.g., statistical value distributions [31],[34], constraints [2], [14], [16], [29], and knowledge bases [7].Differently from our setting, cleaning methods operate onrelation(s) that contain incorrect values but are assumed to becorrectly segmented.
A more relevant body of work is list segmentation/tableinduction techniques, targeting the induction of structuredrecords from unstructured (i.e., wrongly segmented) lists ofvalues. These are alternatives to the segmentation based onflow networks used in our approach. Being inspired by taggingproblems common in bio-informatics and other areas, these ap-proaches traditionally require some form of supervision. Manyrequire an initial seed of correctly segmented records [10],
Fixed induction threshold 10%
(low dependence on annotation quality)
F1 starts being affected when recall loss at ~80%
Precision loss does not affect WADaR until ~300% (random noise)
Evaluation: Scalability
Worst-case scenario: all tuples are annotated with all attribute types
WADaR scales linearly w.r.t. the size of the relation and polynomially w.r.t. attributes
Table 1
Precision_Original
Precision_Repaired
Recall_Original
Recall_Repaired
FScore_Original
FScore_Repaired
0.013233 0.5689
0.004155 0.4255
0.006324 0.488
0.535259 0.1396
0.307571 0.2871
0.390661 0.2665
0.8243 0.0914
0.5348 0.3002
0.6487 0.2248
0.5264 0.3716
0.3501 0.5276
0.4205 0.4666
0.7332 0.1943
0.5147 0.3361
0.6048 0.2827
0.6703 0.2281
0.5091 0.3295
0.5787 0.2888
0.5777 0.2766
0.553 0.3314
0.5651 0.304
0.292 0.441
0.2317 0.4597
0.2584 0.4531
0.158 0.6278
0.1404 0.588
0.1487 0.6074
0.5636 0.2263
0.5191 0.235
0.5405 0.2311
0.446 0.3314
0.2552 0.471
0.3246 0.4263
0.6799 0.302
0.5609 0.299
0.6147 0.3022
0.7252 0.1589
0.6525 0.1429
0.6869 0.150
0.5461 0.3267
0.3965 0.3268
0.4594 0.3316
FScore_Original
FScore_Repaired
0.006324 0.488
0.390661 0.2665
0.6487 0.2248
0.4205 0.4666
0.6048 0.2827
0.5787 0.2888
0.5651 0.304
0.2584 0.4531
0.1487 0.6074
0.5405 0.2311
0.3246 0.4263
0.6147 0.1904
0.6869 0.1505
0.4594 0.3316
00.20.40.60.8
1
ViNTs (R
E)
ViNTs (Auto
)
DIADEM (RE)
DEPTA (R
E)
DIADEM (Auto
)
DEPTA (A
uto)
RR (Auto
)
RR (Boo
k)
RR (Cam
era)
RR (Job
)
RR (Mov
ie)
RR (Nba
)
RR (Res
tauran
t)
RR (Univ
ersity)
Precision (Original) Precision (Repaired) Recall (Original)Recall (Repaired) FScore (Original) FScore (Repaired)
00.10.20.30.40.50.60.70.80.9
1
Vints_
RE
Vints_
UC
DIADEM_RE
DEPTA_R
E
DIADEM_UC
DEPTA_U
C
RR_Auto
RR_Boo
k
RR_Cam
era
RR_Job
RR_Mov
ie
RR_Nba
RR_resta
urant
RR_Univ
ersity
Untitle
d 1
Untitle
d 2
FScore Original FScore Repaired
Fig. 2: Impact of repair.
to 30% in real estate, with an identical effect in almost alldomains.
Attribute-level accuracy. Another question is whetherthere are substantial differences in attribute-level accuracy.The top of Table III shows attributes where the repair isvery effective (F1-Score'1 after repair). These values appearas highly structured attributes on web pages and the corre-sponding expressions repair almost all tuples. As an example,DOOR NUMBER is almost always followed by suffixes dr ordoor. In these cases, the wrapper induction under-segmentedthe text due to lack of sufficient examples.
TABLE III: Attribute-level evaluation.
System Domain Attribute Original F1-Score Repaired F1-ScoreDIADEM real estate POSTCODE 0.304 0.947DIADEM auto DOOR
NUMBER0 0.984
DEPTA real estate BATHROOMNUMBER
0.314 0.973
DEPTA auto MAKE 0.564 0.986DIADEM real estate CITY 0 0.59DEPTA real estate COUNTY 0 0.728
DIADEM auto ENGINETYPE
0 0.225
DEPTA auto PRICE 0.711 0.742
For attributes such as CITY and COUNTY, despite a sig-nificant boost (59% and 72% respectively) produced by therepair, the final F1-Score is still low. These are irregularly struc-tured attributes, often co-occurring with others, e.g., STATE,POSTCODE, in ways that cannot be easily isolated by regularexpressions. Despite not having syntactic regularity, theseattributes are semantically related, e.g., COUNTY is usuallyafter CITY and before POSTCODE, and could be captured byextending f S
XPATHwith NER capabilities [4].
An exceptional case is ENGINE TYPE, where the valuePetrol is also recognised as COLOUR. This causes a loss ofperformance as it creates a systematic error in the annotatedrelation. Another exception is the case of PRICE in relationsgenerated by DEPTA. DEPTA extracts large chunks of text withmultiple prices among which the annotators cannot distinguishthe target price reliably, resulting in worse performance.
Independent evaluation. We performed an extraction ofrestaurant chain locations in collaboration with a large socialnetwork, which provided us with 210 target websites. We usedDIADEM as a wrapper induction system and we then appliedjoint repair on the generated relations. The accuracy has beenmanually evaluated by third-party rating teams on a sampleof nearly 1,000 records of the 276,787 extracted. Table IV
shows Precision and Recall computed on the sample (valueshigher than 0.9 are highlighted in bold). In order to estimate
TABLE IV: Accuracy of large scale evaluation.
Attribute Precision Recall % Modified valuesLOCALITY 0.993 0.993 11.34%
OPENING HOURS 1.00 0.461 17.14%LOCATED WITHIN 1.00 0.224 29.75%
PHONE 0.987 0.849 50.74%POSTCODE 0.999 0.989 9.4%
STREET ADDRESS 0.983 0.98 83.78%
the impact of the repair, we computed, for each attribute, thepercentage of values that are different before and after therepair step. These numbers are shown in the last column ofTable IV. Clearly, the repair is beneficial on all of the cases. ForOPENING HOURS and LOCATED WITHIN, where recall is verylow, the problem is due to the fact that these attributes wereoften not available on the source pages, thus being impossibleto repair. The independent evaluation proved that our repairmethod can scale to hundreds of thousands of non-syntheticrecords. On the other hand, the joint repair is bound to theaccuracy of the extraction system, i.e., it cannot repair datathat has not been extracted.
We have previously shown (Section IV) that an optimalapproximation of a joint repair can be computed efficiently.To stress the scalability of our method, we created a syntheticdataset by modifying two different variables: n—number ofrecords, with an impact mostly on the induction of regularexpressions, since it increases the number of examples; k—number of attributes, which influences the size of the flownetwork and the computation of the maximum flow. Thesynthetic relations are built to produce the worst case scenario,i.e., each record contains k annotated tokens, each annotationhas a different context and each record produces a differentpath on the network. This results in a network with n · k+ 2nodes, and n · k+ n edges. The chart on the left of Figure 3plots the running time over an increasing number of records(with number of attributes fixed), while the chart on the right
Fig. 3: Running time.
Table 1
Precision_Original
Precision_Repaired
Recall_Original
Recall_Repaired
FScore_Original
FScore_Repaired
0.013233 0.5689
0.004155 0.4255
0.006324 0.488
0.535259 0.1396
0.307571 0.2871
0.390661 0.2665
0.8243 0.0914
0.5348 0.3002
0.6487 0.2248
0.5264 0.3716
0.3501 0.5276
0.4205 0.4666
0.7332 0.1943
0.5147 0.3361
0.6048 0.2827
0.6703 0.2281
0.5091 0.3295
0.5787 0.2888
0.5777 0.2766
0.553 0.3314
0.5651 0.304
0.292 0.441
0.2317 0.4597
0.2584 0.4531
0.158 0.6278
0.1404 0.588
0.1487 0.6074
0.5636 0.2263
0.5191 0.235
0.5405 0.2311
0.446 0.3314
0.2552 0.471
0.3246 0.4263
0.6799 0.302
0.5609 0.299
0.6147 0.3022
0.7252 0.1589
0.6525 0.1429
0.6869 0.150
0.5461 0.3267
0.3965 0.3268
0.4594 0.3316
FScore_Original
FScore_Repaired
0.006324 0.488
0.390661 0.2665
0.6487 0.2248
0.4205 0.4666
0.6048 0.2827
0.5787 0.2888
0.5651 0.304
0.2584 0.4531
0.1487 0.6074
0.5405 0.2311
0.3246 0.4263
0.6147 0.1904
0.6869 0.1505
0.4594 0.3316
00.20.40.60.8
1
ViNTs (R
E)
ViNTs (Auto
)
DIADEM (RE)
DEPTA (R
E)
DIADEM (Auto
)
DEPTA (A
uto)
RR (Auto
)
RR (Boo
k)
RR (Cam
era)
RR (Job
)
RR (Mov
ie)
RR (Nba
)
RR (Res
tauran
t)
RR (Univ
ersity)
Precision (Original) Precision (Repaired) Recall (Original)Recall (Repaired) FScore (Original) FScore (Repaired)
00.10.20.30.40.50.60.70.80.9
1
Vints_
RE
Vints_
UC
DIADEM_RE
DEPTA_R
E
DIADEM_UC
DEPTA_U
C
RR_Auto
RR_Boo
k
RR_Cam
era
RR_Job
RR_Mov
ie
RR_Nba
RR_resta
urant
RR_Univ
ersity
Untitle
d 1
Untitle
d 2
FScore Original FScore Repaired
Fig. 2: Impact of repair.
to 30% in real estate, with an identical effect in almost alldomains.
Attribute-level accuracy. Another question is whetherthere are substantial differences in attribute-level accuracy.The top of Table III shows attributes where the repair isvery effective (F1-Score'1 after repair). These values appearas highly structured attributes on web pages and the corre-sponding expressions repair almost all tuples. As an example,DOOR NUMBER is almost always followed by suffixes dr ordoor. In these cases, the wrapper induction under-segmentedthe text due to lack of sufficient examples.
TABLE III: Attribute-level evaluation.
System Domain Attribute Original F1-Score Repaired F1-ScoreDIADEM real estate POSTCODE 0.304 0.947DIADEM auto DOOR
NUMBER0 0.984
DEPTA real estate BATHROOMNUMBER
0.314 0.973
DEPTA auto MAKE 0.564 0.986DIADEM real estate CITY 0 0.59DEPTA real estate COUNTY 0 0.728
DIADEM auto ENGINETYPE
0 0.225
DEPTA auto PRICE 0.711 0.742
For attributes such as CITY and COUNTY, despite a sig-nificant boost (59% and 72% respectively) produced by therepair, the final F1-Score is still low. These are irregularly struc-tured attributes, often co-occurring with others, e.g., STATE,POSTCODE, in ways that cannot be easily isolated by regularexpressions. Despite not having syntactic regularity, theseattributes are semantically related, e.g., COUNTY is usuallyafter CITY and before POSTCODE, and could be captured byextending f S
XPATHwith NER capabilities [4].
An exceptional case is ENGINE TYPE, where the valuePetrol is also recognised as COLOUR. This causes a loss ofperformance as it creates a systematic error in the annotatedrelation. Another exception is the case of PRICE in relationsgenerated by DEPTA. DEPTA extracts large chunks of text withmultiple prices among which the annotators cannot distinguishthe target price reliably, resulting in worse performance.
Independent evaluation. We performed an extraction ofrestaurant chain locations in collaboration with a large socialnetwork, which provided us with 210 target websites. We usedDIADEM as a wrapper induction system and we then appliedjoint repair on the generated relations. The accuracy has beenmanually evaluated by third-party rating teams on a sampleof nearly 1,000 records of the 276,787 extracted. Table IV
shows Precision and Recall computed on the sample (valueshigher than 0.9 are highlighted in bold). In order to estimate
TABLE IV: Accuracy of large scale evaluation.
Attribute Precision Recall % Modified valuesLOCALITY 0.993 0.993 11.34%
OPENING HOURS 1.00 0.461 17.14%LOCATED WITHIN 1.00 0.224 29.75%
PHONE 0.987 0.849 50.74%POSTCODE 0.999 0.989 9.4%
STREET ADDRESS 0.983 0.98 83.78%
the impact of the repair, we computed, for each attribute, thepercentage of values that are different before and after therepair step. These numbers are shown in the last column ofTable IV. Clearly, the repair is beneficial on all of the cases. ForOPENING HOURS and LOCATED WITHIN, where recall is verylow, the problem is due to the fact that these attributes wereoften not available on the source pages, thus being impossibleto repair. The independent evaluation proved that our repairmethod can scale to hundreds of thousands of non-syntheticrecords. On the other hand, the joint repair is bound to theaccuracy of the extraction system, i.e., it cannot repair datathat has not been extracted.
We have previously shown (Section IV) that an optimalapproximation of a joint repair can be computed efficiently.To stress the scalability of our method, we created a syntheticdataset by modifying two different variables: n—number ofrecords, with an impact mostly on the induction of regularexpressions, since it increases the number of examples; k—number of attributes, which influences the size of the flownetwork and the computation of the maximum flow. Thesynthetic relations are built to produce the worst case scenario,i.e., each record contains k annotated tokens, each annotationhas a different context and each record produces a differentpath on the network. This results in a network with n · k+ 2nodes, and n · k+ n edges. The chart on the left of Figure 3plots the running time over an increasing number of records(with number of attributes fixed), while the chart on the right
Fig. 3: Running time.
Oracles decouple the problem of finding similar instances from the segmentation
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
£10k Citroën C3
£22k Ford C-max Titanium X
Ω = {ωMAKE, ωMODEL, ωPRICE}
Open issues
Learning oracles
Building oracles is not difficult but still requires engineering time.
The IBM SystemT people did some of good work in this direction. We can start there.
Missing attributes
Right now, if the wrapper fails to recover data, then we cannot repair it.
It is possible to manipulate the wrapper to match more content.
Markov Chains vs Max flows on wrapped relations
They seem to eventually compute the same sequences but in different order… proof?
What I know is that max-flows best approximate the maximum fitness at every step.
Questions?
L. Chen, S. Ortona, G. Orsi, M. Benedikt. Aggregating Semantic Annotators. PVLDB ’13
S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. Joint Repairs for Web Wrappers. ICDE ’16
S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. WADaR: Joint Wrapper and Data Repair. VLDB ’15 (Demo)
References:
T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, C. Wang. DIADEM: Thousands of websites to a single database. PVLDB ’15
Title Director Rating RuntimeSchindler’s List
Steven Spielberg
R 195 min
Web Data Extraction
Road Runner
DEPTA
Attribute_1 Attribute_2Schindler’s List Director: Steven Spielberg Rating: R
Runtime: 195 min
Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min
Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Joint Data and Wrapper Repair
Attribute_1 Attribute_2
Schindler’s List
Director: Steven Spielberg
Rating: R
Runtime: 195 min
Title Director Rating Runtime
Schindler’s List
Steven
Spielberg
R 195 min
Maximal Repair is NP-completeAttribute
Director: Steven Spielberg Rating: R Runtime: 195 min
Director Rating Runtime
Director Rating RuntimeSteven Spielberg R 195 min
φ1:
φ2:
φ3:
φ4:
OBSERVATIONS
Templated Websites:Data is published following a template.
Wrapper BehaviourWrappers rarely misplace and over-segment at the same time.
Wrappers make systematic errors.
OraclesOracles can be implemented as (ensembles of) NERs.
NERs are not perfect, i.e., they make mistakes
Joint Wrapper And Data Repair
Authors
When values are both misplaced and over-segmented, computing repairs of maximal fitness is hard, otherwise, just do the following:
(1) Compute all possible k non-crossing partitions (k = |R|) of tokens, i.e., assign to each attribute an element of the part i t ion (O(nk) - Narayana Number).
(2) Discard tokens never accepted by oracles in any of the partitions.
(3) Collapse identical partitions and choose the one with maximal fitness.
Without misplacement and over-segmentation, solution in polynomial time by computing non-crossing k-partition
NP-hardness: reduction from Weighted Set Packing. Membership in NP: guess a partition, decide non crossing and compute fitness in PTIME.
Stefano Ortona [email protected] University of Oxford, UKGiorgio Orsi [email protected] University of Oxford, UKMarcello Buoncristiano [email protected] Università della Basilicata,ItalyTim Furche [email protected] University of Oxford, UK
http://diadem.cs.ox.ac.uk/wadar
Web data extraction (aka scraping/wrapping) uses wrappers to turn web pages into structured data.
Wrapper: structure { ⟨R,!R⟩ { ⟨A1,!A1⟩,…,⟨Am,!Am⟩ } } specifying objects to be extracted (listings, records, attributes) and corresponding XPath expressions.
Wrappers are often created algorithmically and in large numbers. Tools capable of maintaining them over time are missing.
⟨RATING, //li[@class=‘second’]/p⟩
⟨RUNTIME, //li[@class=‘third’]/ul/li[1]⟩
Algorithmically-created wrappers generate data that is far from perfect.Data can be badly segmented and misplaced.
⟨TITLE,⟨1⟩,string($)⟩ ⟨DIRECTOR,⟨2⟩,substring-before(substring-after($,tor:_),_Rat)⟩⟨RATING,⟨2⟩,substring-before(substring-after($,ing:_),_Run)⟩⟨RUNTIME,⟨2⟩,substring-after($,time:_)⟩
Take a set Ω of oracles, where each ωA in Ω can say whether a value vA belongs to the domain of A. We define the fitness of a relation R w.r.t. Ω as:
Repair: specifies regular expressions that, when applied on the original relation, produce a new relation with higher fitness.
<Director: Steven>
<195 min>
<Director:><Steven Spielberg>
<Rating: R Runtime:195>
<Runtime: k195 min>
<min Director: Steven Spielberg>
<Rating: Runtime: 195>
<Director: Steven Spielberg>
<Rating: R>
<R>
<Spielberg Rating: R Runtime:>
WADaR:
⟨DIRECTOR, //li[@class=‘first’]/div/span⟩
APPROXIMATING JOINT REPAIRS
Annotation 1Each record is interpreted as a string (concatenation of attributes), where NERs analyse and identify relevant attributes.
Entity recognisers make mistakes, WADaR tolerates incorrect and missing annotations.
Attribute_1 Attribute_2Schindler’s List Director: Steven Spielberg Rating: R
Runtime: 195 min
Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min
Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
The life of Jack Tarantino (coming soon)
Director: David R Lynch Rating: Not Rated Runtime: 123 min
TitleTitle
Director
Title
Director Director
RatingRating
Runtime
Runtime
Runtime
Runtime
Rating
Director
Segmentation
SINK
RATING
RATING
MAX FLOW SEQUENCE: DIRECTOR
Goal: Understand underlying structure of the relation.
START
TITLE
Two possible ways of encoding the problem:
2
TITLE
11
1. Max Flow Sequence in a Flow Network
RUNTIME
RUNTIME
DIRECTOR
DIRECTOR
START
TITLE
DIRECTOR
RATING
2. Most Likely Sequence in a Memoryless Markov Chain
RUNTIME
SINK
Solutions often coincide.Markov Chains: intuitive and faster to compute.
Max Flows: provably optimal.
RUNTIMERATING
TITLEMOST LIKELY SEQUENCE: DIRECTOR RATING RUNTIME
3/4
1/4
1 3/4
1/4
1
1
3
11 8 8
11
3
3 3 3
Induction 3
Schindler’s ListLawrence of Arabia (re-release)Le cercle Rouge (re-release)
Director: Steven Spielberg Rating: R Runtime: 195 minDirector: David Lean Rating: PG Runtime: 216 minDirector: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Director: Steven Spielberg Rating: R Runtime: 195 minDirector: David Lean Rating: PG Runtime: 216 minDirector: Jean-Pierre Melville Rating: Not Rated Runtime: 140 minDirector: David R Lynch Rating: Not Rated Runtime: 123 min
SUFFIX= substring-before(“_(“)
PREFIX= substring-after(“tor:_“)SUFFIX= substring-before(“_Rat“)
PREFIX= substring(string-length()-7)
Input: set of clean annotations to be used as positive examples.
WADaR induces regular expressions by looking, e.g., at common prefixes, suffixes, non-content strings, and character length.
Induced expressions improve recalltoken value1 token token token tokentoken token value2 token token tokentoken token token value3 token token
When WADaR cannot induce regular expressions (not enough regularity), data is repaired directly with annotators. Wrappers are instead repaired with value-based expressions, i.e., disjunction of the annotated values.
ATTRIBUTE=string-contains(“value1”|”value2”|”value3”)
Empirical Evaluation
Table 1
Precision_Original
Precision_Repaired
Recall_Original
Recall_Repaired
FScore_Original
FScore_Repaired
0.013233 0.5689
0.004155 0.4255
0.006324 0.488
0.535259 0.1396
0.307571 0.2871
0.390661 0.2665
0.8243 0.0914
0.5348 0.3002
0.6487 0.2248
0.5264 0.3716
0.3501 0.5276
0.4205 0.4666
0.7332 0.1943
0.5147 0.3361
0.6048 0.2827
0.6703 0.2281
0.5091 0.3295
0.5787 0.2888
0.5777 0.2766
0.553 0.3314
0.5651 0.304
0.292 0.441
0.2317 0.4597
0.2584 0.4531
0.158 0.6278
0.1404 0.588
0.1487 0.6074
0.5636 0.2263
0.5191 0.235
0.5405 0.2311
0.446 0.3314
0.2552 0.471
0.3246 0.4263
0.6799 0.302
0.5609 0.299
0.6147 0.3022
0.7252 0.1589
0.6525 0.1429
0.6869 0.150
0.5461 0.3267
0.3965 0.3268
0.4594 0.3316
FScore_Original
FScore_Repaired
0.006324 0.488
0.390661 0.2665
0.6487 0.2248
0.4205 0.4666
0.6048 0.2827
0.5787 0.2888
0.5651 0.304
0.2584 0.4531
0.1487 0.6074
0.5405 0.2311
0.3246 0.4263
0.6147 0.1904
0.6869 0.1505
0.4594 0.3316
0
0.2
0.4
0.6
0.8
1
ViNTs (R
E)
ViNTs (Auto
)
DIADEM (RE)
DEPTA (R
E)
DIADEM (Auto
)
DEPTA (A
uto)
RR (Auto
)
RR (Boo
k)
RR (Cam
era)
RR (Job
)
RR (Mov
ie)
RR (Nba
)
RR (Res
tauran
t)
RR (Univ
ersity)
Precision (Original) Precision (Repaired) Recall (Original)Recall (Repaired) FScore (Original) FScore (Repaired)
00.10.20.30.40.50.60.70.80.9
1
Vints_
RE
Vints_
UC
DIADEM_RE
DEPTA_R
E
DIADEM_UC
DEPTA_U
C
RR_Auto
RR_Boo
k
RR_Cam
era
RR_Job
RR_Mov
ie
RR_Nba
RR_resta
urant
RR_Univ
ersity
Untitle
d 1
Untitle
d 2
FScore Original FScore Repaired
5.1 SettingDatasets. The dataset consists of 100 websites from 10 do-
mains and is an enhanced version of SWDE [20], a benchmark com-monly used in web data extraction. SWDE’s data is sourced from80 sites and 8 domains: auto, book, camera, job, movie, NBA player,restaurant, and university. For each website, SWDE provides collec-tions of 400 to 2k detail pages (i.e., where each page correspondsto a single record). We complemented SWDE with collections oflisting pages (i.e., pages with multiple records) from 20 websitesof real estate (RE) and auto domains. Table 1 summarises the char-acteristics of the dataset. SWDE comes with ground-truth data cre-
Table 1: Dataset characteristics.Domain Type Sites Pages Records Attributes
Real Estate listing 10 271 3,286 15Auto listing 10 153 1,749 27Auto detail 10 17,923 17,923 4Book detail 10 20,000 20,000 5
Camera detail 10 5,258 5,258 3Job detail 10 20,000 20,000 4
Movie detail 10 20,000 20,000 4Nba Player detail 10 4,405 4,405 4Restaurant detail 10 20,000 20,000 4University detail 10 16,705 16,705 4
Total - 100 124,715 129,326 78
ated under the assumption that wrapper-generation systems couldonly generate extraction rules with DOM-element granularity, i.e.,without segmenting text nodes. Modern wrapper-generation sys-tems support text-node segmentation and we therefore refined theground-truth accordingly. As an example, in the camera domain,the original ground-truth values for MODEL consisted of the entireproduct title. The text node includes, other than the model, COLOR,PIXELS, MANUFACTURER. The ground-truth for real estate and autodomains has been created following the SWDE format. The finaldataset consist of more than 120k pages, for almost 130k recordscontaining more than 500k attribute values.
Wrapper-generation systems. We generated input relationsfor our evaluation using four wrapper-generation systems: DIA-DEM [19], DEPTA [36] and ViNTs [39] for listing pages, and Road-Runner [12] for detail pages.1 The output of DIADEM, DEPTA, andRoadRunner can be readily used in the evaluation since these arefull fledged data extraction systems, supporting the segmentationof both records and attributes within listings or (sets of) detail-pages. ViNTs, on the other hand, segments rows into records withina search result listing and, as such, it does not have a concept ofattribute. Instead, it segments rows within a record. We thereforepost-processed its output, typing the content of lines from differ-ent records that are likely to have the same semantics. We used anaïve heuristic similarity based on relative position in the recordand string-edit distance of the row’s content. This is a very simpleversion of more advanced alignment methods based on instance-level redundancy used by, e.g., WEIR and TEGRA [7].
Metrics. The performance of the repair is evaluated by com-paring wrapper-generated relations against the SWDE ground truthbefore and after the repair. The metrics used for the evaluationare Precision, Recall, and F1-Score computed at attribute-level. Boththe ground truth and the extracted values are normalised, and exactmatching between the extracted values and the ground-truth is re-quired for a hit. For space reasons, in this paper we only presentthe most relevant results. The results of the full evaluation, together
1RoadRunner can be configured for listings but it performs better ondetail pages.
with the dataset, gold standard, extracted relations, the code of thenormaliser and of the scorer are available at the online appendix [1].
All experiments are run on a desktop with an Intel quad-core i7at 3.40GHz with 16 GB Ram and Linux Mint OS 17.
5.2 Repair performanceRelation-level Accuracy. The first two questions we want to an-
swer are: whether joint repairs are necessary and what their impactis in terms of quality. Table 2 reports, for each system, the percent-age of: (i) Correctly extracted values. (ii) Under-segmentations,i.e., when values for an attribute are extracted together with val-ues of other attributes or spurious content. Indeed often websitespublish multiple attribute values within the same text node and theinvolved extraction systems are not able to split values into multi-ple attributes. (iii) Over-segmentations, i.e., when attribute valuesare split over multiple fields. As anticipated in Section 2, this rarelyhappens since an attribute value is often contained in a single textnode. In this setting an attribute value can be over-segmented onlyif the extraction system is capable of splitting single text nodes(DIADEM), but even in this case the splitting happens only whenthe system can identify a strong regularity within the text node.(iv) Misplacements, i.e., values are placed or labeled as the wrongattribute. This is mostly due to lack of semantic knowledge andconfusion introduced by overlapping attribute domains. (v) Miss-ing values, due to lack of regularity and optionality in the websource (RoadRunner, DEPTA, ViNTs) or missing values from the do-main knowledge (DIADEM). Note that the numbers do not add up to
Table 2: Wrapper generation system errors.System Correct
(%)Under
Segmented(%)
OverSegmented
(%)
Misplaced(%)
Missing(%)
DIADEM 60.9 34.6 0 23.2 3.5DEPTA 49.7 44 0 25.3 6ViNTs 23.9 60.8 0 36.4 15.2
RoadRunner 46.3 42.8 0 18.6 10.4
100% since errors may fall into multiple categories. These numbersclearly show that there is a quality problem in wrapper-generatedrelations and also support the atomic misplacement assumption.
Figure 2 shows, for each system and each domain, the impactof the joint-repair on our metrics. Light (resp. dark)-colored barsdenote the quality of the relation before (resp. after) the repair.
A first conclusion that can be drawn is that a repair is always ben-eficial. From 697 extracted attributes, 588 (84.4%) require someform of repair and the average pre-repair F1-Score produced by thesystems is 50%. We are able to induce a correct regular expressionfor 335 (57%) attributes, while for the remaining 253 (43%) it pro-duces value-based expressions. We can repair at least one attributein each of the wrappers in all of the cases, and we can repair morethan 75% of attributes in more than 80% of the cases.
Among the considered systems, DIADEM delivers, in average,the highest pre-repair F1-Score (�60%), but it never exceeds 65%.RoadRunner is in average worse than DIADEM but it reaches a bet-ter 70% F1-Score on restaurant. Websites in this domain are in facthighly structured and individual attribute values are contained in adedicated text node. When attributes are less structured, e.g., onbook, camera, movie, RoadRunner has a significant drop in perfor-mance. As expected, ViNTs delivers the worst pre-cleaning results.
In terms of accuracy, our approach delivers a boost in F1-Scorebetween 15% and 60%. Performance is consistently close to orabove 80% across domains and, except for ViNTs, across systems,with a peak of 91% for RoadRunner on NBA player.
The following are the remaining causes of errors: (i) Missingvalues cannot be repaired as we can only use the data available in
8
0.3$0.4$0.5$0.6$0.7$0.8$0.9$1$
Auto$
Book$
Camera$ Job
$
Movie$
Nba$
Restaurant$
University$
WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$
Evaluation100 websites 10 domains4 wrapper generation systems.
Precision, Recall, F1-Score computed before and after repair.
WADaR boosts F1-Score between 15% and 60%. Performance consistently close to or above 80%.
Metrics computed considering exact matches.
WADaR against WEIR.
WADaR is highly robust to errors of the NERs.
WADaR scales linearly with the size of the input relation. Optimal joint-repair approximations
computed in polynomial time.
OptimalityWADaR provably produces relations of maximum fitness,
provided that the number of correctly annotated tuples is more than the maximum error rate of the annotators.
More questions? Come to the poster later!!!
T. Furche, G. Gottlob, L. Libkin, G. Orsi, N. Paton. Data Wrangling for Big Data. EDBT ’16