large-scale deep web integration: exploring and querying structured data on the deep web kevin c....

67
Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

Upload: kristian-doyle

Post on 21-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

Large-Scale Deep Web Integration:

Exploring and Querying Structured Data on the Deep

WebKevin C. Chang

Tutorial in SIGMOD’06

Page 2: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

2

Still challenges on the Web?

Google is only the start of search(and MSN will not be the end of it).

Page 3: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

3

Structured Data--- Prevalent but ignored!

Page 4: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

4

Challenges on the Web come in “dual”: Getting access to the structured

information!

Access

Structure

Surface Web Deep Web

Kevin’s 4-quardants:

Page 5: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

5

Tutorial Focus: Large Scale Integration of structured

data over the Deep Web That is: Search-flavored integration. Disclaimer-- What it is not:

Small-scale, pre-configured, mediated-querying settings many related techniques some we will relate today

Text databases (or, meta-search) Several related but “text-oriented” issues in meta-search

eg, Stanford, Columbia, UIC more in the IR community (distributed IR)

And, never a “complete” bibliography!! http://metaquerier.cs.uiuc.edu/ “Web Integration” bibliography

Finally, no intention to “finish” this tutorial.

Page 6: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

6

An evidence in Beta: Google Base.

Page 7: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

7

When Google speaks up…“What is an “Attribute”,” says Google!

Page 8: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

8

And things are indeed happening!

Page 9: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

9

Page 10: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

10

Page 11: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

11

The Deep Web:Databases on the

Web

Page 12: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

12

The previous Web: Search used to be “crawl and index”

Page 13: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

13

The current Web: Search must eventually resort to integration

Page 14: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

14

How to enable effective access to the deep Web?

Cars.com Amazon.com

Apartments.comBiography.com

401carfinder.com411localte.com

Page 15: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

15

Survey the frontier: BrightPlanet.com, March 2000 [Bergman00] Overlap analysis of search engines.

“Search sites” not clearly defines.

Estimated 43,000 – 96,000 deep Web sites. Content size 500 times that of surface Web.

N

n

n

n b

a

0

Page 16: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

16

Survey the frontier UIUC MetaQuerier, April 2004 [ChangHL+04]

Macro: Deep Web at large Data: Automatically-sampled 1 million IPs

Micro: per-source specific characteristics Data: Manually-collected sources 8 representative domains, 494 sources

Airfare (53), Autos (102), Books (69), CarRentals (24)

Hotels (38), Jobs (55), Movies (78), MusicRecords (75)

Available at http://metaquerier.cs.uiuc.edu/repository

Page 17: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

17

They wanted to observe…

How many deep-Web sources are out there? “The dot-com bust has brought down DBs on the Web.”

How many structured databases? “There are just (or, much more) text databases.”

How do search engines cover them? “Google does it all.”– Or, “InvisibleWeb.com does it all.”

How hidden are they? “It is the hidden Web.”

How complex are they? “Queries on the Web are much simpler, even trivial.” “Coping with semantics is hopeless– Let’s Just wait till the

semantic Web.”

Page 18: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

18

And their results are…

How many deep-Web sources are out there? 307,000 sites, 450,000 DBs, 1,258,000 interfaces.

How many structured databases? 348,000 (structured) : 102,000 (text) == 3 : 1

How do search engines cover them? Google covered 5% fresh and 21% state objects. InvisibleWeb.com covered 7.8% sources.

How hidden are they? CarRental (0%) > Airfares (~4%) > … > MusicRec > Books > Movies (80%+)

How complex are they? “Amazon effects”

Page 19: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

19

Reported the “Amazon effect”…

Attributes converge in a domain!

Condition patterns converge even across domains!

Page 20: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

20

Google’s Recent Survey [courtesy Jayant Madhavan]

Page 21: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

21

Driving Force: The Large Scale

Page 22: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

22

Circa 2000: Example System– Information Agents [MichalowskiAKMTT04,

Knoblock03]

Page 23: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

23

Circa 2000: Example System– Comparison Shopping Engines

[GuptaHR97]Virtual Database

Page 24: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

24

System: Example

Applications

Page 25: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

25

Integrating information from multiple types of sources Ranking papers, conferences, and authors for a given query Handling structured queries

WebDatabase

WebDatabase

WebDatabase

WebDatabase

WebDatabase…

PDF

PS DOC

JournalHomepage

AuhtorHomepage

Conf.Homepage

Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA)

Page 26: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

26

On-the-fly Meta-querying Systems—e.g., WISE [HeMYW03], MetaQuerier

[ChangHZ05]

FIND sources

QUERY sources

db of dbs

unified query interface

Amazon.comCars.com

411localte.com

Apartments.com

MetaQuerier@UIUC :

Page 27: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

27

What needs to be done? Technical Challenges:

Source Modeling & Selection Schema Matching Source Querying, Crawling, and Obj Ranking Data Extraction System Integration

Page 28: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

28

The Problems:Technical

Challenges

Page 29: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

29

Technical Challenges

1. Source Modeling & Selection

How to describe a source and find right sources for query answering?

Page 30: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

30

Source Modeling: Circa 2000

Focus: Design of expressive model mechanism.

Techniques: View-based mechanisms: answering queries using

views, LAV, GAV (see [Halevy01] for survey). Hierarchical or layered representations for modeling

in-site navigations ([KnoblockMA+98], [DavulcuFK+99]).

Page 31: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

31

Source Modeling & Selection: for Large Scale Integration

Focus: Discovery of sources. Focused crawling to collect query interfaces [BarbosaF05,

ChangHZ05]. Focus: Extraction of source models.

Hidden grammar-based parsing [ZhangHC04]. Proximity-based extraction [HeMY+04]. Classification to align with given taxonomy [HessK03,

Kushmerick03]. Focus: Organization of sources and query routing

Offline clustering [HeTC04, PengMH+04]. Online search for query routing [KabraLC05].

Page 32: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

32

Form Extraction: the Problem

Output all the conditions, for each: Grouping elements (into query conditions) Tagging elements with their “semantic roles”

attributeoperator value

Page 33: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

33

Observation: Interfaces share “patterns” of presentation.

Hypothesis:

Now, the problem:

Given , how to find ?

Form Extraction: Parsing Approach [ZhangHC04]

A hidden syntactic model exist?

Grammar

Interface Creation

query capabilities

Page 34: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

34

Best-Effort Visual Language Parsing Framework

Layout Engine

TokenizerBE-Parser

Ambiguity

Resolution

Error Handling

Output:semantic structure

Input:HTML query form

Productions Preferences

2P Grammar

X

Page 35: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

35

Form Extraction: Clustering Approach [HessK03, Kushmerick03]Concept: A form as a Bayesian network. Training: Estimate the Bayesian probabilities. Classification: Max-likelihood predictions given terms.

Page 36: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

36

Technical Challenges

2. Schema Matching

How to match the schematic structures between sources?

Page 37: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

37

Schema Matching: Circa 2000 Focus:

Generic matching without assuming Web sources Techniques: [RahmB01]

Page 38: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

38

Schema Matching: for Large Scale Integration

Focus: Matching large number of interface schemas, often in a holistic way. Statistical model discovery [HeC03]; correlation mining [HeCH04,

HeC05]. Query probing [WangWL+04]. Clustering [HeMY+03, WuYD+04]. Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06].

Focus: Constructing unified interfaces. As a global generative model [HeC03]. Cluster-merge-select [HeMY+03].

Page 39: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

39

WISE-Integrator: Cluster-Merge-Represent

[HeMY+03]

Page 40: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

40

Matching attributes: Synonymous label: WordNet, string similarity Compatible value domains (enum values or type)

Constructing integrated interface: form = initial empty until all attribtes covered:

take one attribute select a representative and merge values

WISE-Integrator: Cluster-Merge-Represent

[HeMY+03]

Page 41: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

41

Observation: Schemas share “tendencies” of attribute usage.

Hypothesis:

Now, the problem:

Given , how to find ?

Statistical Schema Matching: MGS A hidden statistical model exist? [HeC03, HeCH04, HeC05]

Statistical Model

Schema Generation

attribute matchings

αβ

η

αβ

δ

γη

α β γ η δ

αβ

ηα

βδγ

η

αβ

ηα

βδγ

η α β γ η δ

Page 42: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

42

Statistical Hypothesis Discovery

Statistical formulation: Given as observations:

Find underlying hypothesis:

“Global” approach: Hidden model discovery [HeC03] Find entire global model at once

“Local” approach: Correlation mining [HeCH04, HeC05] Find local fragments of matchings one at a time.

αβηαβδγη

α β γ η δ

Prob

QIs

Page 43: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

43

Technical Challenges

3. Source Querying, Crawling & Search

How to query a source? How to crawl all objects and to search them?

Page 44: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

44

Source Querying: Circa 2000

Focus: Mediation of cross-source, join-able queries Query rewriting, planning– Extensive study: e.g.,

[LevyRO96, AmbiteKMP01, Halevy01].

Focus: Execution & optimization of queries Adaptive, speculative query optimization; e.g.,

[NaughtonDM+01, BarishK03, IvesHW04].

Page 45: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

45

Source Querying: for Large Scale Integration

1. Metaquerying model: Focus: On-the-fly Querying.

MetaQuerier Query Assistant [ZhangHC05].

2. Vertical-search-engine model: Focus: Source crawling to collect objects.

Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06].

Focus: Object search and ranking [NieZW+05]

Page 46: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

46

On-the-fly Querying: [ZhangHC05]

Type-locality based Predicate Translation

Target template P

Target Predicate t*

Type Recognizer

Domain Specific Handler

Text Handler

Numeric Handler

Datetime Handler

Predicate Mapper

Source predicate s

Correspondences occur within localities Translation by type-handler

Page 47: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

47

Source Crawling by Query Selection [WuWL+06]

Author Title Category

Ullman Complier System

Ullman Data Mining Application

Ullman Automata Theory

Han Data Mining ApplicationUllman

Han

Compiler

Automata

Data Mining

Application

TheorySystem

Conceptually, the DB as a graph: Node: Attributes Edge: Occurrence relationship

Crawling is transformed into graph traversal problem:Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum.

Page 48: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

48

Object Ranking-- Object Relationship Graph

[NieZW+05]

Popularity Propagation Factor for each type of relationship link

Popularity of an object is also affected by the popularity of the Web pages containing the object

Page 49: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

49

Object Ranking-- Training Process [NieZW+05]

PopRankCalculator

Ranking DistanceEstimator

new combination from neighbors

Chosen as the best

Link Graph Initial Combination of PPFs

Better than the best

?

AcceptThe worse one

?

Expert Ranking

Yes

No

Yes

Subgraph selection to approximate rank calculation for speeding up.

Page 50: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

50

Technical Challenges

3. Data Extraction

How to extract result pages into relations?

Page 51: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

51

Data Extraction: Circa 2000 Need for rapid wrapper construction well recognized.

Mediator

Wrapper Wrapper Wrapper

Focus: Semi-automatic wrapper construction.

Techniques: Wrapper-mediator architecture [Wiederhold92] . Manual construction: Semi-automatic: Learning-based

HLRT [KushmerickWD97],

Stalker [MusleaMK99],

Softmealy [HsuD98];

Page 52: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

52

Data Extraction: for Large Scale Even more automatic approaches.

Mediator

Wrapper Wrapper Wrapper

Focus: Even more automatic approaches.

Techniques: Semi-automatic: Learning-based

[ZhaoMWRY05], [IRMKS06]. Automatic: Syntax-based

RoadRunner [MeccaCM01],

ExAlg [ArasuG03],

DEPTA [LiuGZ03, ZhaiL05].

Page 53: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

53

HLRT Wrapper: the first “Wrapper Induction” [KushmerickWD97]

ExecuteHLRT(<h,t,l1,r1,..,lk,rk>,page P)skip past first occurrence of h in Pwhile next l1 is before next t in P for each <lk,rk>belongs to {<l1,r1>,..,< lk, rk >} skip past next occurrence of lk in P extract attr from P to next occurrence of rkreturn extracted tuples

ExtractCCs(page P)skip past first occurrence of <B> in Pwhile next <B> is before next <HR> in P for each <lk,rk>belongs to {< <B>,</B>>,< <I>,</I>>} skip past next occurrence of lk in P extract attribute from P to next occurrence of rk return extracted tuples

A manual wrapper:

A generalized wrapper:

wrapper rules:(delimiters)hl1, r1l2, r2……lk, rkt

InductionAlgorithm

labeled data

Page 54: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

54

RoadRunner [MeccaCM01]

Basic idea: Page generation: filling (encoding) data into a template Data extraction: as the reverse, decoding the template

Algorithm Compare two HTML pages at one time

one as wrapper and the other as sample Solving the mismatches

string mismatch -- content slot tag mismatch -- structure variance

Page 55: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

55

RoadRunner

Page 56: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

56

RoadRunner

the template

Page 57: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

57

Technical Challenges

3. System Integration

Putting things together?

Page 58: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

58

Our “system” research often ends up with “components in isolation” [ChangHZ05]

Page 59: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

59

System integration: Sample issues

New challenges How will errors in automatic form extraction impact the

subsequent schema matching? New opportunities

Can the result of schema matching help to correct such errors? e.g., (adults, children) together form a matching, then?

AA.com

Result of extraction:

Page 60: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

60

Current agenda: “Science” of system integration

jSiS kSCascade

Feedback

new challenge: error cascading

new opportunity: result feedback

Page 61: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

61

Finally, observationsLarge scale is not

only a challenge, but also an opportunity!

Page 62: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

62

Observation #1: Large scale introduces

New Problems! Several issues arise in the context:

Evidences of new problems: Source modeling & selection Source querying, crawling, ranking:

On-the-fly query translation Object crawling, ranking

System integration

Page 63: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

63

Observation #2: Large scale introduces

New Semantics! Relaxed metrics possible– even the same problems.

Evidences of new metrics: Search-flavored integration– large scale but simplistic

Function: Simple queries Source: Transparency no more the fundamental doctrine User: In the loop of querying Techniques: Automatic but error-likely Results: Fuzzy, ranked

meta-querying: ranking of matching sources vertical-search-engine: ranking of objects

Page 64: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

64

Observation #2: Large scale introduces

New Insights! The multitude of sources gives a holistic context for study.

Evidences of new insights: Schema matching: Many holistic approaches Source modeling: “Lego”-based extraction System integration: Holistic error correction/feedback

Page 65: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

65

The Web “Trio” (My three circles...)

Integration Mining

Search

Page 66: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

66

DB People: Buckle Up!

Our time has finally come…

Looking ForwardRecall the first time I heard about

Google Base.

Page 67: Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06

67

Thank You!

For more information:

http://metaquerier.cs.uiuc.edu [email protected]