mayssam sayyadian, anhai doan university of wisconsin - madison hieu lekhac university of illinois -...
Post on 17-Jan-2016
228 Views
Preview:
TRANSCRIPT
Mayssam Sayyadian, AnHai Doan
University of Wisconsin - Madison
Hieu LeKhac
University of Illinois - Urbana
Luis Gravano
Columbia University
Efficient Keyword Search acrossEfficient Keyword Search across Heterogeneous Relational DatabasesHeterogeneous Relational Databases
2
Key Message of PaperKey Message of Paper
Precise data integration is expensive But we can do IR-style data integration
very cheaply, with no manual cost!– just apply automatic schema/data matching – then do keyword search across the databases– no need to verify anything manually
Already very useful
Build upon keyword search over a single database ...
3
Keyword Search over Keyword Search over a Single Relational Databasea Single Relational Database
A growing field, numerous current works– DBXplorer [ICDE02], BANKS [ICDE02]– DISCOVER [VLDB02]– Efficient IR-style keyword search in databases [VLDB03], – VLDB-05, SIGMOD-06, etc.
Many related works over XML / other types of data– XKeyword [ICDE03], XRank [Sigmod03]– TeXQuery [WWW04]– ObjectRank [Sigmod06]– TopX [VLDB05], etc.
More are coming at SIGMOD-07 ...
4
A Typical ScenarioA Typical Scenario
tid custid name contact addr
t1 c124 Cisco Michael Jones …
t2 c533 IBM David Long …
t3 c333 MSR David Ross …
Customers
tid id emp-name comments
u1 c124 Michael Smith Repair didn’t work
u2 c124 John Deferred work to
John Smith
Complaints
Foreign-Key Join
Q = [Michael Smith Cisco]
Ranked list of answers
t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work
t1 c124 Cisco Michael Jones … u2 c124 John Deferred work to John Smith
score=.8
score=.7
5
Our Proposal:Our Proposal:Keyword Search across Multiple DatabasesKeyword Search across Multiple Databases
IR-style data integration
tid eid reports-to
x1 e23 e37
x2 e14 e37
Groups
tid empid name
Employees
v1 e23 Mike D. Smith
v2 e14 John Brown
v3 e37 Jack Lucas
tid custid name contact addr
t1 c124 Cisco Michael Jones …
t2 c533 IBM David Long …
t3 c333 MSR Joan Brown …
Customers
tid id emp-name comments
u1 c124 Michael Smith Repair didn’t work
u2 c124 John Deferred work to
John Smith
Complaints
t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work
v1 e23 Mike D. Smith x1 e23 e37
v3 e37 Jack Lucasacross databases
Query: [Cisco Jack Lucas]
6
A NaA Naïïve Solutionve Solution
1. Manually identify FK joins across DBs
2. Manually identify matching data instances across DBs
3. Now treat the combination of DBs as a single DB
apply current keyword search techniques
Just like in traditional data integration,
this is too much manual work
7
Kite SolutionKite Solution
tid eid reports-to
x1 e23 e37
x2 e14 e37
Groups
tid empid name
Employees
v1 e23 Mike D. Smith
v2 e14 John Brown
v3 e37 Jack Lucas
tid custid name contact addr
t1 c124 Cisco Michael Jones …
t2 c533 IBM David Long …
t3 c333 MSR Joan Brown …
Customers
tid id emp-name comments
u1 c124 Michael Smith Repair didn’t work
u2 c124 John Deferred work to
John Smith
Complaints
Automatically find FK joins / matching data instances across databases
no manual work is required from user
8
Automatically Find FK JoinsAutomatically Find FK Joinsacross Databases across Databases
Current solutions analyze data values (e.g., Bellman) Limited accuracy
– e.g., “waterfront” with values yes/no “electricity” with values yes/no
Our solution: data analysis + schema matching– improve accuracy drastically (by as much as 50% F-1)
tid empid name
Employees
v1 e23 Mike D. Smith
v2 e14 John Brown
v3 e37 Jack Lucas
tid id emp-name comments
u1 c124 Michael Smith Repair didn’t work
u2 c124 John Deferred work to
John Smith
Complaints
Automatic join/data matching can be wrong
incorporate confidence scores into answer scores
9
Incorporate Confidence Scores Incorporate Confidence Scores into Answer Scoresinto Answer Scores
α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q)
size (A)score (A, Q) =
t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work score=.8
Recall: answer example in single-DB settings
Recall: answer example in multiple-DB settings
t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work
v1 e23 Mike D. Smith x1 e23 e37
v3 e37 Jack Lucasscore 0.9 for FK join
score 0.7 for data matching
10
Summary of Trade-OffsSummary of Trade-Offs
Precise data integration– the holy grail
SQL queries
IR-style data integration, naïve way– manually identify FK joins, matching data– still too expensive
IR-style data integration, using Kite– automatic FK join finding / data matching– cheap– only approximates the “ideal” ranked list found by naïve
11
Kite ArchitectureKite Architecture
Q = [ Smith Cisco ]
…
Distributed SQL queries
D1 Dn
Index Builder
Foreign key joins
IR index1 IR indexn…
… D1 Dn
Refinementrules
Offline preprocessing Online querying
Condensed CN Generator
Top-k Searcher
Foreign-Key Join Finder
Data-based Schema
Join Finder Matcher
Data instancematcher
– Partial
– Full
– Deep
12
Online QueryingOnline Querying
Database 2
Relation 1 Relation 2
Database 1
Relation 1 Relation 2
What current solutions do:
1. Create answer templates
2. Materialize answer templates to obtain answers
13
Create Answer Templates Create Answer Templates
Find tuples that contain query keywords– Use DB’s IR index
– example: Q = [Smith Cisco]
Tuple sets:
Create tuple-set graphSchema graph:
Tuple set graph:
Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1}
HR-DB: EmployeesQ={t1} GroupsQ={}
Customers Complaints Emps GroupsJ1 J4
J2
J3
Customers{} Complaints{} Emps{}
Groups{}
J1
CustomersQ ComplaintsQ EmpsQ
J1
J1
J1
J4
J4
J4
J4
J2J3
J3J2
Complaints
u1
u2
Service-DB
Groups
x1
x2
Employees
t1
t2
t3
HR-DB
Customers
v1
v2
v3
14
Create Answer Templates (cont.)Create Answer Templates (cont.)
Customers{} Complaints{} Emps{}
Groups{}
J1
CustomersQ ComplaintsQ EmpsQ
J1
J1
J1
J4
J4
J4
J4
J2J3
J3J2
sample tuple set graph
EmpsQ Groups{} Emps{} Complaints{Q}J2 J3 J4
CN4:
sample CNs
EmpsQ Groups{} Emps{} Complaints{Q}J2 J2 J4
CN3:
CustomersQ Complaints{Q}CN2: J1
CustomersQCN1:
Search tuple-set graph to generate answer templates– also called Candidate Networks (CNs)
Each answer template = one way to join tuples to form an answer
15
Materialize Answer TemplatesMaterialize Answer Templatesto Generate Answersto Generate Answers
By generating and executing SQL queries
CN: CustomersQ ComplaintsQ (CustomersQ = {v1} , ComplaintsQ = {u1, u2})
SQL: SELECT * FROM Customers C, Complaints P
WHERE C.cust-id = P.id AND
(C.tuple-id = v1) AND
(P.tuple-id = u1 OR tuple-id = u2)
J1
Naïve solution– materialize all answer templates, score, rank, then return answers
Current solutions– find only top-k answers– materialize only certain answer templates– make decisions using refinement rules + statistics
16
Challenges for Kite SettingChallenges for Kite Setting
More databases way too many answer templates to generate– can take hours on just 3-4 databases
Materializing an answer template takes way too long– requires SQL query execution across multiple databases– invoking each database incurs large overhead
Difficult to obtain reliable statistics across databases
See paper for our solutions (or backup slides)
17
Empirical Evaluation Empirical Evaluation
Domain # DBsAvg # tables
per DB
Avg # attributes per
schema
Avg # approximate FK joins tuples per table
Avg # tuples per table
Total size
total across DBs per pair
DBLP 2 3 3 11 6 11 500K 400M Inventory 8 5.8 5.4 890 804 33.6 2K 50M
Domains
The DBLP Schema
CNF (id, name)
CITE (id1, id2)
AR (id, title)
AU (id, name)
AR (aid, biblo)
PU (aid, uid)
DBLP 1 DBLP 2
Sample Inventory Schema
WAREHOUSE
AUTHOR
BOOK
WH2BOOK
CD
ARTIST
WH2CD
Inventory 1
18
Runtime Performance (1)Runtime Performance (1)
Hybrid algorithm adapted to run over multiple databases
Kite without condensed CNs
Kite without adaptive rule selection and without rule Deep
Full-fledged Kite algorithm
Kite without rule Deep
0
60
120
180
1 2 3 4 5 6 7
tim
e (s
ec)
Inventory
0
60
120
180
1 2 3 4 5 6 7 8 9
tim
e (s
ec)
DBLP
maxCCNsize
maxCCNsize
2-keyword queries, k=10, 2 databases 2-keyword queries, k=10, 5 databases
0
15
30
45
1 2 3 4 5 6 7 8
tim
e (s
ec)
Inventory
runtime vs. # of databases
#of DBs
maximum CCN size = 4, 2-keyword queries, k=10
runtime vs. maximum CCN size
19
Runtime Performance (2)Runtime Performance (2)runtime vs. # of keywords in the query
|q|0
5
10
15
20
1 2 3 4 5
tim
e (s
ec) DBLP
max CCN=6, k=10, 2 databases
0
10
20
30
40
1 2 3 4 5
tim
e (s
ec) Inventory
|q|
max CCN=4, k=10, 5 databases
runtime vs. # of answers requested
0
15
30
45
1 4 7 10 13 16 19 22 25 27 30
0
15
30
45
1 4 7 10 13 16 19 22 25 27 30
tim
e (s
ec) Inventory
k
2-keyword queries, max CCN=4, 5 databases
tim
e (s
ec)
k
2-keyword queries, max CCN=4, |q|=2, 5 databases
20
Query Result QualityQuery Result Quality
0
0.2
0.4
0.6
0.8
1
1 5 10 15 200
0.2
0.4
0.6
0.8
1
1 5 10 15 20
Pr@k
k k
Pr@k
OR-semantic queries AND-semantic queries
Pr@k = the fraction of answers that appear in the “ideal” list
21
SummarySummary
Kite executes IR-style data integration– performs some automatic preprocessing– then immediately allows keyword querying
Relatively painless– no manual work!– no need to create global schema, nor to understand SQL
Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people– enterprises, on the Web, need only a few answers– emergency (e.g., hospital + police), need answers quickly
22
Future DirectionsFuture Directions
Incorporate user feedback interactive IR-style data integration
More efficient query processing– large # of databases, network latency
Extends to other types of data– XML, ontologies, extracted data, Web data
IR-style data integration is feasible and useful
extends current works on keyword search over DB
raises many opportunities for future work
23
BACKUPBACKUP
24
Condensing Candidate NetworksCondensing Candidate Networks
In multi-database settings unmanageable number of CNs– Many CNs share the same tuple sets and differ only in the associated joins– Group CNs into condensed candidate networks (CCNs)
Customers{} Complaints{} Emps{}
Groups{}
J1
CustomersQ ComplaintsQ EmpsQ
J1
J1
J1
J4
J4
J4
J4
J2J3
J3J2
sample tuple set graph
sample CNs
Customers{} Complaints{} Emps{}
Groups{}`
J1
CustomersQ ComplaintsQ EmpsQJ1
J1
J1
J4
J4
J4
J4
{J2, J3}
{J2, J3}
condense tuple set graph
sample CCNs
EmpsQ Groups{} Emps{} Complaints{Q}J2 J3 J4
CN4:
EmpsQ Groups{} Emps{} Complaints{Q}J2 J2 J4
CN3: EmpsQ Groups{} Emps{} Complaints{Q}
J2{J2, J3} J4
Condense
Condense
25
Top-Top-kk Search Search Main ideas for top-k keyword search:
– No need to materialize all CNs– Sometimes, even partially materializing a CN is enough– Estimate score intervals for CNs, then branch and bound search
.
... Q [0.5, 0.7]
. ... P [0.6, 1]
R [0.4, 0.9]..
iteration 1
K = {P2, P3}, min score = 0.7
.
.
.. P1 [0.6, 0.8]
P2 0.9
. P3 0.7
R [0.4, 0.9]..
iteration 2
Res = {P2, R2}min score = 0.85.. R1 [0.4, 0.6]
. R2 0.85
iteration 3
Kite approach: materialize CNs using refinement rules
26
Top-Top-kk Search Using Refinement Rules Search Using Refinement Rules
• In single-database setting
selecting rules based on database statistics
• In multi-database setting Inaccurate statistics
• Inaccurate statistics Inappropriate rule selection
27
Refinement RulesRefinement Rules Full:
– Exhaustively extract all answers from a CN (fully materialize S) too much data to move around the network (data transfer cost)
Partial:– Try to extract the most promising answer from a CN invoke remote databases for only one answer (high cost of database
invocation)
Deep:– A middle-ground approach– Once a table in a remote
database is invoked, extract all answers involving that table
– Takes into account database invocation cost
TQ UQ
t1 u1 TQ UQ
t1 0.9t2 0.7t3 0.4t4 0.3
0.8 u10.6 u20.5 u30.1 u4
t1 0.9t2 0.7t3 0.4t4 0.3
0.8 u10.6 u20.5 u30.1 u4
TQ UQ TQ UQ
t1 0.9t2 0.7t3 0.4t4 0.3
0.8 u10.6 u20.5 u30.1 u4
t1 0.9t2 0.7t3 0.4t4 0.3
0.8 u10.6 u20.5 u30.1 u4
t1 u3t1 u1 ,
28
Adaptive SearchAdaptive Search Question: which refinement rule to apply next?
– In single-database setting based on database statistics– Multi-database setting inaccurate statistics
Kite approach: adaptively select rules
goodness-score (rule, cn) = benefit (rule, cn) – cost (rule, cn)
– cost (rule, cn): optimizer’s estimated cost for SQL statements– benefit (rule, cn): reduce the benefit if a rule is applied for a while
without making any progress
29
Other ExperimentsOther ExperimentsJoin Discovery Accuracy
0
0.2
0.4
0.6
0.8
1
Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5
Join Discovery Join Discovery + Schema Matching
accu
racy
(F
1)
0
2
4
6
1 2 3 4 5 6 7 8
Kite over single database
tim
e (s
ec)
max CCN size
Schema matching helps improve join discovery algorithm drastically
Kite also improves single-database keyword search algorithm mHybrid
top related