r oad r unner : towards automatic data extraction from large web sites

16
ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001

Upload: beata

Post on 12-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites. Valter Crescenzi Giansalvatore Mecca Paolo Merialdo. VLDB 2001. Overview. Automatically generates a wrapper from large structured Web pages Supports nested structures - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

ROADRUNNER: Towards Automatic Data Extraction

from Large Web Sites

Valter CrescenziGiansalvatore MeccaPaolo Merialdo

VLDB 2001

Page 2: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Overview Automatically generates a wrapper from large

structured Web pages Supports nested structures Efficient approach to large, complex pages with

regular structures

Page 3: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Approach Given a set of example pages Generate a Union-free Regular Expression

(UFRE) Find the least upper bounds on the RE lattice

to generate a wrapper Reduces to find the least upper bound on two

UFRES

Page 4: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Matching/Mismatching Start with the first page and create a RE that defines

the wrapper Match each successive sample against the wrapper Mismatches result in generalizations of the regular

expression Types of mismatches

– String mismatches– Tag mismatches

Page 5: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Example Pages

Page 6: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Example

#PCDATA

String mismatches are used to discover fields of the documentsWrapper is generated by replacing “John Smith” with #PCDATA

Page 7: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Example (Cont.)

#PCDATA

Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional

– (<img src=…/>)?

Page 8: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Example (Cont.)

#PCDATA

Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional

– (<img src=…/>)?

(<IMG src=…/>)?

Page 9: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Example (Cont.)

#PCDATA

(<IMG src=…/>)?

#PCDATA

#PCDATA

Tag Mismatches :Discovering Iterators Assume mismatch is caused by repeated elements in a list Match possible squares against earlier squares Generalize the wrapper by finding all contiguous repeated

occurrences– (<li><i>Title:</i>#PCDATA</li>)+

Page 10: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Extracted Result

Page 11: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Recursive Example

Page 12: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Complexity

Page 13: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Discussion Assumptions

– Pages are well-structured– Want to extract at the level of entire fields– Structure can be modeled without disjunctions

Search Space for explaining mismatches is huge– Uses a number of heuristics to prune space

Limited backtracking Limit on number of choices to explore Patterns can not be delimited by optionals

– Will result in pruning possible wrappers

Page 14: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Experimental Result

Page 15: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Comparison with Other Works

Page 16: R OAD R UNNER : Towards Automatic Data Extraction  from Large Web Sites

Name Struc_ture

Semi Free Single-slot

Multi-slot

Missing items

Permuta_tions

Nested_data

Resilient

WIEN X X XSoftMealy X X X X X X*STALKER X X X * X X XRAPIER X X ? X X X ?SRV X X ? X X X ?WHISK X X X X X X X* ?AutoSlog X X X XROAD_RUNNER X X X X XBYU Onto X X ? X X X X X X

X means the information extraction system has the capability; X* means the information extraction system

has the ability as long as the training corpus can accommodate the required training data; ? Shows that the

systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.