mayssam sayyadian, anhai doan university of wisconsin - madison hieu lekhac university of illinois -...

Mayssam Sayyadian, AnHai Doan

University of Wisconsin - Madison

Hieu LeKhac

University of Illinois - Urbana

Luis Gravano

Columbia University

Efficient Keyword Search acrossEfficient Keyword Search across Heterogeneous Relational DatabasesHeterogeneous Relational Databases

Key Message of PaperKey Message of Paper

Precise data integration is expensive But we can do IR-style data integration

very cheaply, with no manual cost!– just apply automatic schema/data matching – then do keyword search across the databases– no need to verify anything manually

Already very useful

Build upon keyword search over a single database ...

Keyword Search over Keyword Search over a Single Relational Databasea Single Relational Database

A growing field, numerous current works– DBXplorer [ICDE02], BANKS [ICDE02]– DISCOVER [VLDB02]– Efficient IR-style keyword search in databases [VLDB03], – VLDB-05, SIGMOD-06, etc.

Many related works over XML / other types of data– XKeyword [ICDE03], XRank [Sigmod03]– TeXQuery [WWW04]– ObjectRank [Sigmod06]– TopX [VLDB05], etc.

More are coming at SIGMOD-07 ...

A Typical ScenarioA Typical Scenario

tid custid name contact addr

t1 c124 Cisco Michael Jones …

t2 c533 IBM David Long …

t3 c333 MSR David Ross …

Customers

tid id emp-name comments

u1 c124 Michael Smith Repair didn’t work

u2 c124 John Deferred work to

John Smith

Complaints

Foreign-Key Join

Q = [Michael Smith Cisco]

Ranked list of answers

t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work

t1 c124 Cisco Michael Jones … u2 c124 John Deferred work to John Smith

score=.8

score=.7

Our Proposal:Our Proposal:Keyword Search across Multiple DatabasesKeyword Search across Multiple Databases

IR-style data integration

tid eid reports-to

x1 e23 e37

x2 e14 e37

Groups

tid empid name

Employees

v1 e23 Mike D. Smith

v2 e14 John Brown

v3 e37 Jack Lucas

t3 c333 MSR Joan Brown …

Customers

John Smith

Complaints

v1 e23 Mike D. Smith x1 e23 e37

v3 e37 Jack Lucasacross databases

Query: [Cisco Jack Lucas]

A NaA Naïïve Solutionve Solution

1. Manually identify FK joins across DBs

2. Manually identify matching data instances across DBs

3. Now treat the combination of DBs as a single DB

apply current keyword search techniques

Just like in traditional data integration,

this is too much manual work

Kite SolutionKite Solution

tid eid reports-to

x1 e23 e37

x2 e14 e37

Groups

tid empid name

Employees

v2 e14 John Brown

v3 e37 Jack Lucas

t3 c333 MSR Joan Brown …

Customers

John Smith

Complaints

Automatically find FK joins / matching data instances across databases

no manual work is required from user

Automatically Find FK JoinsAutomatically Find FK Joinsacross Databases across Databases

Current solutions analyze data values (e.g., Bellman) Limited accuracy

– e.g., “waterfront” with values yes/no “electricity” with values yes/no

Our solution: data analysis + schema matching– improve accuracy drastically (by as much as 50% F-1)

tid empid name

Employees

v2 e14 John Brown

v3 e37 Jack Lucas

John Smith

Complaints

Automatic join/data matching can be wrong

incorporate confidence scores into answer scores

Incorporate Confidence Scores Incorporate Confidence Scores into Answer Scoresinto Answer Scores

α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q)

size (A)score (A, Q) =

t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work score=.8

Recall: answer example in single-DB settings

Recall: answer example in multiple-DB settings

v1 e23 Mike D. Smith x1 e23 e37

v3 e37 Jack Lucasscore 0.9 for FK join

score 0.7 for data matching

Summary of Trade-OffsSummary of Trade-Offs

Precise data integration– the holy grail

SQL queries

IR-style data integration, naïve way– manually identify FK joins, matching data– still too expensive

IR-style data integration, using Kite– automatic FK join finding / data matching– cheap– only approximates the “ideal” ranked list found by naïve

Kite ArchitectureKite Architecture

Q = [ Smith Cisco ]

Distributed SQL queries

Index Builder

Foreign key joins

IR index1 IR indexn…

… D1 Dn

Refinementrules

Offline preprocessing Online querying

Condensed CN Generator

Top-k Searcher

Foreign-Key Join Finder

Data-based Schema

Join Finder Matcher

Data instancematcher

– Partial

– Full

– Deep

Online QueryingOnline Querying

Database 2

Relation 1 Relation 2

Database 1

Relation 1 Relation 2

What current solutions do:

1. Create answer templates

2. Materialize answer templates to obtain answers

Create Answer Templates Create Answer Templates

Find tuples that contain query keywords– Use DB’s IR index

– example: Q = [Smith Cisco]

Tuple sets:

Create tuple-set graphSchema graph:

Tuple set graph:

Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1}

HR-DB: EmployeesQ={t1} GroupsQ={}

Customers Complaints Emps GroupsJ1 J4

Customers{} Complaints{} Emps{}

Groups{}

CustomersQ ComplaintsQ EmpsQ

Complaints

Service-DB

Groups

Employees

Customers

Create Answer Templates (cont.)Create Answer Templates (cont.)

Groups{}

sample tuple set graph

EmpsQ Groups{} Emps{} Complaints{Q}J2 J3 J4

sample CNs

CustomersQ Complaints{Q}CN2: J1

CustomersQCN1:

Search tuple-set graph to generate answer templates– also called Candidate Networks (CNs)

Each answer template = one way to join tuples to form an answer

Materialize Answer TemplatesMaterialize Answer Templatesto Generate Answersto Generate Answers

By generating and executing SQL queries

CN: CustomersQ ComplaintsQ (CustomersQ = {v1} , ComplaintsQ = {u1, u2})

SQL: SELECT * FROM Customers C, Complaints P

WHERE C.cust-id = P.id AND

(C.tuple-id = v1) AND

(P.tuple-id = u1 OR tuple-id = u2)

Naïve solution– materialize all answer templates, score, rank, then return answers

Current solutions– find only top-k answers– materialize only certain answer templates– make decisions using refinement rules + statistics

Challenges for Kite SettingChallenges for Kite Setting

More databases way too many answer templates to generate– can take hours on just 3-4 databases

Materializing an answer template takes way too long– requires SQL query execution across multiple databases– invoking each database incurs large overhead

Difficult to obtain reliable statistics across databases

See paper for our solutions (or backup slides)

Empirical Evaluation Empirical Evaluation

Domain # DBsAvg # tables

per DB

Avg # attributes per

schema

Avg # approximate FK joins tuples per table

Avg # tuples per table

Total size

total across DBs per pair

DBLP 2 3 3 11 6 11 500K 400M Inventory 8 5.8 5.4 890 804 33.6 2K 50M

Domains

The DBLP Schema

CNF (id, name)

CITE (id1, id2)

AR (id, title)

AU (id, name)

AR (aid, biblo)

PU (aid, uid)

DBLP 1 DBLP 2

Sample Inventory Schema

WAREHOUSE

AUTHOR

WH2BOOK

ARTIST

Inventory 1

Runtime Performance (1)Runtime Performance (1)

Hybrid algorithm adapted to run over multiple databases

Kite without condensed CNs

Kite without adaptive rule selection and without rule Deep

Full-fledged Kite algorithm

Kite without rule Deep

1 2 3 4 5 6 7

Inventory

1 2 3 4 5 6 7 8 9

maxCCNsize

2-keyword queries, k=10, 2 databases 2-keyword queries, k=10, 5 databases

1 2 3 4 5 6 7 8

Inventory

runtime vs. # of databases

#of DBs

maximum CCN size = 4, 2-keyword queries, k=10

runtime vs. maximum CCN size

Runtime Performance (2)Runtime Performance (2)runtime vs. # of keywords in the query

1 2 3 4 5

ec) DBLP

max CCN=6, k=10, 2 databases

1 2 3 4 5

ec) Inventory

max CCN=4, k=10, 5 databases

runtime vs. # of answers requested

1 4 7 10 13 16 19 22 25 27 30

ec) Inventory

2-keyword queries, max CCN=4, 5 databases

2-keyword queries, max CCN=4, |q|=2, 5 databases

Query Result QualityQuery Result Quality

1 5 10 15 200

1 5 10 15 20

OR-semantic queries AND-semantic queries

Pr@k = the fraction of answers that appear in the “ideal” list

SummarySummary

Kite executes IR-style data integration– performs some automatic preprocessing– then immediately allows keyword querying

Relatively painless– no manual work!– no need to create global schema, nor to understand SQL

Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people– enterprises, on the Web, need only a few answers– emergency (e.g., hospital + police), need answers quickly

Future DirectionsFuture Directions

Incorporate user feedback interactive IR-style data integration

More efficient query processing– large # of databases, network latency

Extends to other types of data– XML, ontologies, extracted data, Web data

IR-style data integration is feasible and useful

extends current works on keyword search over DB

raises many opportunities for future work

BACKUPBACKUP

Condensing Candidate NetworksCondensing Candidate Networks

In multi-database settings unmanageable number of CNs– Many CNs share the same tuple sets and differ only in the associated joins– Group CNs into condensed candidate networks (CCNs)

Groups{}

sample tuple set graph

sample CNs

Groups{}`

CustomersQ ComplaintsQ EmpsQJ1

{J2, J3}

condense tuple set graph

sample CCNs

CN3: EmpsQ Groups{} Emps{} Complaints{Q}

J2{J2, J3} J4

Condense

Top-Top-kk Search Search Main ideas for top-k keyword search:

– No need to materialize all CNs– Sometimes, even partially materializing a CN is enough– Estimate score intervals for CNs, then branch and bound search

... Q [0.5, 0.7]

. ... P [0.6, 1]

R [0.4, 0.9]..

iteration 1

K = {P2, P3}, min score = 0.7

.. P1 [0.6, 0.8]

P2 0.9

. P3 0.7

R [0.4, 0.9]..

iteration 2

Res = {P2, R2}min score = 0.85.. R1 [0.4, 0.6]

. R2 0.85

iteration 3

Kite approach: materialize CNs using refinement rules

Top-Top-kk Search Using Refinement Rules Search Using Refinement Rules

• In single-database setting

selecting rules based on database statistics

• In multi-database setting Inaccurate statistics

• Inaccurate statistics Inappropriate rule selection

Refinement RulesRefinement Rules Full:

– Exhaustively extract all answers from a CN (fully materialize S) too much data to move around the network (data transfer cost)

Partial:– Try to extract the most promising answer from a CN invoke remote databases for only one answer (high cost of database

invocation)

Deep:– A middle-ground approach– Once a table in a remote

database is invoked, extract all answers involving that table

– Takes into account database invocation cost

t1 u1 TQ UQ

t1 0.9t2 0.7t3 0.4t4 0.3

0.8 u10.6 u20.5 u30.1 u4

t1 0.9t2 0.7t3 0.4t4 0.3

0.8 u10.6 u20.5 u30.1 u4

TQ UQ TQ UQ

t1 0.9t2 0.7t3 0.4t4 0.3

0.8 u10.6 u20.5 u30.1 u4

t1 0.9t2 0.7t3 0.4t4 0.3

0.8 u10.6 u20.5 u30.1 u4

t1 u3t1 u1 ,

Adaptive SearchAdaptive Search Question: which refinement rule to apply next?

– In single-database setting based on database statistics– Multi-database setting inaccurate statistics

Kite approach: adaptively select rules

goodness-score (rule, cn) = benefit (rule, cn) – cost (rule, cn)

– cost (rule, cn): optimizer’s estimated cost for SQL statements– benefit (rule, cn): reduce the benefit if a rule is applied for a while

without making any progress

Other ExperimentsOther ExperimentsJoin Discovery Accuracy

Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5

Join Discovery Join Discovery + Schema Matching

1 2 3 4 5 6 7 8

Kite over single database

max CCN size

Schema matching helps improve join discovery algorithm drastically

Kite also improves single-database keyword search algorithm mHybrid

mayssam sayyadian, anhai doan university of wisconsin - madison hieu lekhac university of illinois -...

Documents

probe, count, and classify: categorizing hidden web...

mayssam sayyadian university of illinois at urbana-champaign...

classification of discourse functions of affirmative words...

event identification in social media hila becker, luis...

information processing and management - columbia...

european journal of women's studies 2009 gravano 353 71

voice transf 24. voice transformation o -...

automatic classification of text databases through query...

efficient ir-style keyword search over relational databases...

“downstepped contours in the given/new distinction”...

992 ieee transactions on knowledge and data … › ~gravano...

1 modeling query-based access to text databases eugene...

learning to find answers to questions eugene agichtein steve...

awardee - george washington university · dr. alex ross–...

an investigation of linguistic features and clustering...

anhai doan university of illinois joint work with pedro...

original articles the stanford digital library metadata...

learning to rank adaptively for scalable...

gloss: text-source discovery over the...

querying text databases for efficient information extraction...