a brief survey of web data extraction tools alberto h. f. laender, berthier a. ribeiro-neto,...
TRANSCRIPT
A Brief Survey of Web Data Ex-traction Tools
Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira
Federal University of Minas GeraisBelo Horizonte MG Brazil
SIGMOD Record, June 2002Presented by Young-Seok Lim
January 9th, 2009
2
Contents
Introduction A taxonomy for characterizing Web
data extraction tools Overview of web data extraction
tools Qualitative analysis Conclusions
3
Introduction
A wealth of data on many different subjects with the explosion of the World Wide Web
Users retrieve Web data by Browsing
not suitable for locating particular items of data, because following links is tedious and it is easy to get lost
Keyword searching sometimes more efficient than browsing, but
often returns vast amounts of data
4
Introduction
Ideas taken from the database area Database require structured data XML is standard for structuring data
But, existing Web data? Unstructured or semistructured data Enormous and still increasing
Possible strategy is to extract data from Web sources to populate data-bases Specialized programs, called wrappers
for extracting data from Web sources
5
Introduction
Given a Web page S containing a set of implicit objects,
A wrapper is a program that exe-cutes the mapping W that populates a data repository R with the objects in S
6
A taxonomy for characterizing Web data extraction tools
Languages for Wrapper Development Development of languages specially de-
signed to assist users in constructing wrappers
Minerva, TSIMMIS, Web-OQL HTML-aware Tools
Relying on inherent structural features of HTML documents(parsing tree) for ac-complishing data extraction
W4F, XWRAP, RoadRunner
7
A taxonomy for characterizing Web data extraction tools
NLP-based Tools Natural language processing(filtering,
part-of-speech tagging, and lexical semmantic tagging) to learn extraction rules
RAPIER, SRV, WHISK Wrapper Induction Tools
Generate delimiter-based extraction rules derived from a given set of training examples
WIEN, SoftMealy, STALKER
8
A taxonomy for characterizing Web data extraction tools
Modeling-based Tools Given a target, trying to locate in Web
pages portions of data that implicitly conform to that structure
NoDoSE, DEByE Ontology-based Tools
Given a specific domain application, an ontology can be used to locate con-stants present in the page and to con-struct objects with them
Brigham Young University Data Extrac-tion Group
9
Overview of Web data extraction tools - Languages for wrapper development
Minerva Combines a declarative grammar-based
approach with features typical of proce-dural programming languages
A set of productions Each production defines the structure of
non-terminal symbol of the grammar, in terms of terminal symbols and other non-terminals
Exception clause
10
Overview of Web data extraction tools - Languages for wrapper development
TSIMMIS Specification files composed by a sequence
of commands that define extraction steps Form [variables, source, pattern]
Variables represents a set of variables that hold the extraction results
Web-OQL Declarative query language that is capable
of locating selected pieces of data in HTML pages
Abstract HTML syntax tree, called a hyper-tree
11
Overview of Web data extraction tools - HTML-aware Tools
W4F(World Wide Web Wrapper Factory) Three phase
Describe how to access the document Describe what pieces of data to extract Declare what target structure to use for storing
the data extracted HEL that define extraction rules
XWRAP Semiautomatic construction of wrapper Cleans up bad HTML tags Outputs a wrapper coded in Java Six heuristics
12
Overview of Web data extraction tools - HTML-aware Tools
RoadRunner Compare the HTML structure of two (or
more) given sample pages belonging to a same “page class”
Grammar is inferred from schema Fully automatic and no user intervention
13
Overview of Web data extraction tools - NLP-based Tools
RAPIER(Robust Automated Production of Information Extraction Rules) From free text Template indicating the data to be ex-
tracted To learn data extration patterns to extract
data for populating its slots Constraints on the words and part-of-speech
tags Single-slot
14
Overview of Web data extraction tools - NLP-based Tools
SRV Based on a given set of training exam-
ples Relies on a set of token-oriented fea-
tures that can be either simple or rela-tional
Single-slot WHISK
A set of extraction rules is induced from a given set of training example docu-ments
On iteration user add tag Multi-slot
15
Overview of Web data extraction tools – Wrapper Induction Tools
WIEN A pioneer wrapper induction tool A set of pages where data of interest is
labeled to serve as examples Don’t deal with nested structures or with
variations typical of semistructured data SoftMealy
Uses a special kind of automata called finite-state transducers(FST)
16
Overview of Web data extraction tools – Wrapper Induction Tools
STALKER Can deal with hierarchical data extrac-
tiono 2 inputs
A set of training examples in the form of a sequence of tokens representing the sur-rounding of the data to be extracted
A description of the page structure, called an Embedded Catalog Tree(ECT)
Disjunctive rules
17
Overview of Web data extraction tools – Modeling-based Tools
NoDoSE(Northwestern Document Structure Extractor) Interactive tool for semi-automatically
determining the structure of documents Mining component
DEByE(Data Extraction By Example) Interactive tool that receives as input a
set of example objects taken from a sample Web page
Object extraction patterns(OEP) Bottom-up extraction algorithm
18
Overview of Web data extraction tools – On-tology-based Tools
The work of the Data Extraction Group at Brigham Young University(BYU) Ontologies constructed to describe the
data of interest If representative enough, fully auto-
mated Inherently resilient and adaptable
19
Qualitative analysis - Degree of automation
Related to the amount of work left to the user dur-ing the process of generating a wrapper
Approaches based on lanugage Require the writing of code
HTML-aware tools Higher degree To be really effective, must be a very consistent use of
HTML tag in the target page NLP-based, induction-based, modeling-based tools
Semi-automated User has to provide examples
Ontology-base tools Manually Requires the construction of an ontology
20
Qualitative analysis – Support for Complex Objects
Approaches based on lanugage coding
HTML-aware tools W4F use HEL coding
NLP-based, induction-based, modeling-based tools SoftMealy allows the representation of
structural variations SoftMealy doesn’t deal with nested struc-
tures STALKER, NoDoSE, DEByE represent hierar-
chical structure and structural variations
21
Qualitative analysis – Page Contents
Two kinds of pages Semi-structured data
semi-structured text
22
Qualitative analysis – Ease of Use
HTML-aware tools, NLP-based tools, wrapper induction tools, and model-ing-based tools usually present a GUI
In BYU tool, the ontology creation process must also be done manually by the user
23
Qualitative analysis – XML Output
In Minerva, the user has to explicitly write code to generate an output in XML
In W4F, “mapping wizard” XWRAP and DEByE natively provide NoDoSE supports a variety of formats
24
Qualitative analysis – Support for Non-HTML Sources
NLP-based tools and the BYU tools specially suitable for non-HTML sources
Wrapper induction tools and the modeling-based tools don’t rely uniquely on HTML tags
25
Qualitative analysis – Resilience and Adap-tiveness
As the structural and presentation features of Web pages are prone to frequent changes Resilience – the capacity of continuing to
work properly in the occurrence of changes in the pages
Adaptiveness – the property of working properly with pages from another source in the same application domain