r oad r unner : towards automatic data extraction from large web sites

ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001

Upload: beata

Post on 12-Feb-2016

32 views

Category:

Documents

0 download

Report

Download

Tags:

Embed Size (px):

DESCRIPTION

R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites. Valter Crescenzi Giansalvatore Mecca Paolo Merialdo. VLDB 2001. Overview. Automatically generates a wrapper from large structured Web pages Supports nested structures - PowerPoint PPT Presentation

TRANSCRIPT

ROADRUNNER: Towards Automatic Data Extraction

from Large Web Sites

Valter CrescenziGiansalvatore MeccaPaolo Merialdo

VLDB 2001

Page 2: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Overview Automatically generates a wrapper from large

structured Web pages Supports nested structures Efficient approach to large, complex pages with

regular structures

Page 3: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Approach Given a set of example pages Generate a Union-free Regular Expression

(UFRE) Find the least upper bounds on the RE lattice

to generate a wrapper Reduces to find the least upper bound on two

UFRES

Page 4: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Matching/Mismatching Start with the first page and create a RE that defines

the wrapper Match each successive sample against the wrapper Mismatches result in generalizations of the regular

expression Types of mismatches

– String mismatches– Tag mismatches

Page 5: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Example Pages

Page 6: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Example

#PCDATA

String mismatches are used to discover fields of the documentsWrapper is generated by replacing “John Smith” with #PCDATA

Page 7: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Example (Cont.)

#PCDATA

Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional

– (<img src=…/>)?

Page 8: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Example (Cont.)

#PCDATA

– (<img src=…/>)?

(<IMG src=…/>)?

Page 9: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Example (Cont.)

#PCDATA

(<IMG src=…/>)?

#PCDATA

Tag Mismatches :Discovering Iterators Assume mismatch is caused by repeated elements in a list Match possible squares against earlier squares Generalize the wrapper by finding all contiguous repeated

occurrences– (<li><i>Title:</i>#PCDATA</li>)+

Page 10: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Extracted Result

Page 11: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Recursive Example

Page 12: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Complexity

Page 13: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Discussion Assumptions

– Pages are well-structured– Want to extract at the level of entire fields– Structure can be modeled without disjunctions

Search Space for explaining mismatches is huge– Uses a number of heuristics to prune space

Limited backtracking Limit on number of choices to explore Patterns can not be delimited by optionals

– Will result in pruning possible wrappers

Page 14: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Experimental Result

Page 15: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Comparison with Other Works

Page 16: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites

Name Struc_ture

Semi Free Single-slot

Multi-slot

Missing items

Permuta_tions

Nested_data

Resilient

WIEN X X XSoftMealy X X X X X X*STALKER X X X * X X XRAPIER X X ? X X X ?SRV X X ? X X X ?WHISK X X X X X X X* ?AutoSlog X X X XROAD_RUNNER X X X X XBYU Onto X X ? X X X X X X

X means the information extraction system has the capability; X* means the information extraction system

has the ability as long as the training corpus can accommodate the required training data; ? Shows that the

systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.

:;A-161 01 SeorR;e •'::oak House Ihr~on~r ~oad Darlington ... · 301)9 Ihr~on~r Church ~oad ... CWB 7 201 10/15/1853 Grantor: ... INVENTORY - NOMINATION FORM FOR NPS USE ONLY ENTRY

O˜ce hours: Monday–Friday 9am–4/media/maps/files/karori-map.pdf · ARMY N LIC 2 aforth ns C h a y o r S t Cu rti St K a r o i R oad W a l w o r t h R oad d R a n d w c k oad

FACT SHEET - Knight Frank · 2019. 2. 28. · Corpus Christi College, University of Cambridge. CB1 . 10. oad oad The amburlaine 13 el oad oad e e ark el tion oad. 1 12 7 2 14 19 8

PN R CRSSN 230 oad N L 09 - LoopNetimages4.loopnet.com/d2/LaAOVER94xWSXvNI1...PN R CRSSN 230 oad N L 09 02/26/19 317-577-5600 kiterealty.com LEASING CONTACTS • Pine Ridge Crossing

R OAD R UNNER A LARM C LOCK Luka Licheli. M ISSION STATEMENT The purpose of RoadRunner Alarm Clock is to wake you up effectively. It is designed by the

V ALLEY R OAD S CHOOL A BULLY FREE SCHOOL A Bully Free School

Hot unner System Installation Guide - Synventive · 2019-10-17 · Hot unner System Installation Guide Service and Maintenance / Single Axis Valve Gate Nozzle 16SVH Assembly Tools

SY7€8LD Old Frog Cottage - OnTheMarket · 2015. 12. 2. · 1 4 King S t r eet He r e f o r d HR4 9 B W 2 B r oad S t r eet Leominster HR6 8BS 22 B r oad S t r eet Knighton LD7 1BL

LE Hay L oad E r CHassis sPECiFiCaTioNsLE Hay L oad E r CHassis The World’s Fastest Bale Stackers sPECiFiCaTioNs 8905 Industrial Dr. • Haven,KS 67543 • 800-530-5304 • 620-465-2683

The R oad Ahead

1. R EALIGN T HORNTON R OAD WITH S TAR S TREET AT D E B ROGGI R OAD I MPROVEMENT P LAN A DOPTED BY B OARD IN J ULY OF 1998 (PP-98-3) I MPROVED T

T HE R OAD TO E MANCIPATION, PART 1 Northern War Aims, 1861-1862