a brief survey of web data extraction tools alberto h. f. laender, berthier a. ribeiro-neto,...

27
A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University of Minas Gerais Belo Horizonte MG Brazil SIGMOD Record, June 2002 Presented by Young-Seok Lim January 9 th , 2009

Upload: trevor-grant

Post on 25-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

A Brief Survey of Web Data Ex-traction Tools

Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira

Federal University of Minas GeraisBelo Horizonte MG Brazil

SIGMOD Record, June 2002Presented by Young-Seok Lim

January 9th, 2009

2

Contents

Introduction A taxonomy for characterizing Web

data extraction tools Overview of web data extraction

tools Qualitative analysis Conclusions

3

Introduction

A wealth of data on many different subjects with the explosion of the World Wide Web

Users retrieve Web data by Browsing

not suitable for locating particular items of data, because following links is tedious and it is easy to get lost

Keyword searching sometimes more efficient than browsing, but

often returns vast amounts of data

4

Introduction

Ideas taken from the database area Database require structured data XML is standard for structuring data

But, existing Web data? Unstructured or semistructured data Enormous and still increasing

Possible strategy is to extract data from Web sources to populate data-bases Specialized programs, called wrappers

for extracting data from Web sources

5

Introduction

Given a Web page S containing a set of implicit objects,

A wrapper is a program that exe-cutes the mapping W that populates a data repository R with the objects in S

6

A taxonomy for characterizing Web data extraction tools

Languages for Wrapper Development Development of languages specially de-

signed to assist users in constructing wrappers

Minerva, TSIMMIS, Web-OQL HTML-aware Tools

Relying on inherent structural features of HTML documents(parsing tree) for ac-complishing data extraction

W4F, XWRAP, RoadRunner

7

A taxonomy for characterizing Web data extraction tools

NLP-based Tools Natural language processing(filtering,

part-of-speech tagging, and lexical semmantic tagging) to learn extraction rules

RAPIER, SRV, WHISK Wrapper Induction Tools

Generate delimiter-based extraction rules derived from a given set of training examples

WIEN, SoftMealy, STALKER

8

A taxonomy for characterizing Web data extraction tools

Modeling-based Tools Given a target, trying to locate in Web

pages portions of data that implicitly conform to that structure

NoDoSE, DEByE Ontology-based Tools

Given a specific domain application, an ontology can be used to locate con-stants present in the page and to con-struct objects with them

Brigham Young University Data Extrac-tion Group

9

Overview of Web data extraction tools - Languages for wrapper development

Minerva Combines a declarative grammar-based

approach with features typical of proce-dural programming languages

A set of productions Each production defines the structure of

non-terminal symbol of the grammar, in terms of terminal symbols and other non-terminals

Exception clause

10

Overview of Web data extraction tools - Languages for wrapper development

TSIMMIS Specification files composed by a sequence

of commands that define extraction steps Form [variables, source, pattern]

Variables represents a set of variables that hold the extraction results

Web-OQL Declarative query language that is capable

of locating selected pieces of data in HTML pages

Abstract HTML syntax tree, called a hyper-tree

11

Overview of Web data extraction tools - HTML-aware Tools

W4F(World Wide Web Wrapper Factory) Three phase

Describe how to access the document Describe what pieces of data to extract Declare what target structure to use for storing

the data extracted HEL that define extraction rules

XWRAP Semiautomatic construction of wrapper Cleans up bad HTML tags Outputs a wrapper coded in Java Six heuristics

12

Overview of Web data extraction tools - HTML-aware Tools

RoadRunner Compare the HTML structure of two (or

more) given sample pages belonging to a same “page class”

Grammar is inferred from schema Fully automatic and no user intervention

13

Overview of Web data extraction tools - NLP-based Tools

RAPIER(Robust Automated Production of Information Extraction Rules) From free text Template indicating the data to be ex-

tracted To learn data extration patterns to extract

data for populating its slots Constraints on the words and part-of-speech

tags Single-slot

14

Overview of Web data extraction tools - NLP-based Tools

SRV Based on a given set of training exam-

ples Relies on a set of token-oriented fea-

tures that can be either simple or rela-tional

Single-slot WHISK

A set of extraction rules is induced from a given set of training example docu-ments

On iteration user add tag Multi-slot

15

Overview of Web data extraction tools – Wrapper Induction Tools

WIEN A pioneer wrapper induction tool A set of pages where data of interest is

labeled to serve as examples Don’t deal with nested structures or with

variations typical of semistructured data SoftMealy

Uses a special kind of automata called finite-state transducers(FST)

16

Overview of Web data extraction tools – Wrapper Induction Tools

STALKER Can deal with hierarchical data extrac-

tiono 2 inputs

A set of training examples in the form of a sequence of tokens representing the sur-rounding of the data to be extracted

A description of the page structure, called an Embedded Catalog Tree(ECT)

Disjunctive rules

17

Overview of Web data extraction tools – Modeling-based Tools

NoDoSE(Northwestern Document Structure Extractor) Interactive tool for semi-automatically

determining the structure of documents Mining component

DEByE(Data Extraction By Example) Interactive tool that receives as input a

set of example objects taken from a sample Web page

Object extraction patterns(OEP) Bottom-up extraction algorithm

18

Overview of Web data extraction tools – On-tology-based Tools

The work of the Data Extraction Group at Brigham Young University(BYU) Ontologies constructed to describe the

data of interest If representative enough, fully auto-

mated Inherently resilient and adaptable

19

Qualitative analysis - Degree of automation

Related to the amount of work left to the user dur-ing the process of generating a wrapper

Approaches based on lanugage Require the writing of code

HTML-aware tools Higher degree To be really effective, must be a very consistent use of

HTML tag in the target page NLP-based, induction-based, modeling-based tools

Semi-automated User has to provide examples

Ontology-base tools Manually Requires the construction of an ontology

20

Qualitative analysis – Support for Complex Objects

Approaches based on lanugage coding

HTML-aware tools W4F use HEL coding

NLP-based, induction-based, modeling-based tools SoftMealy allows the representation of

structural variations SoftMealy doesn’t deal with nested struc-

tures STALKER, NoDoSE, DEByE represent hierar-

chical structure and structural variations

21

Qualitative analysis – Page Contents

Two kinds of pages Semi-structured data

semi-structured text

22

Qualitative analysis – Ease of Use

HTML-aware tools, NLP-based tools, wrapper induction tools, and model-ing-based tools usually present a GUI

In BYU tool, the ontology creation process must also be done manually by the user

23

Qualitative analysis – XML Output

In Minerva, the user has to explicitly write code to generate an output in XML

In W4F, “mapping wizard” XWRAP and DEByE natively provide NoDoSE supports a variety of formats

24

Qualitative analysis – Support for Non-HTML Sources

NLP-based tools and the BYU tools specially suitable for non-HTML sources

Wrapper induction tools and the modeling-based tools don’t rely uniquely on HTML tags

25

Qualitative analysis – Resilience and Adap-tiveness

As the structural and presentation features of Web pages are prone to frequent changes Resilience – the capacity of continuing to

work properly in the occurrence of changes in the pages

Adaptiveness – the property of working properly with pages from another source in the same application domain

26

Conclusion

27

Conclusion