1 data integration most slides are borrowed from dr. chen li, uc irvine
TRANSCRIPT
![Page 1: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/1.jpg)
1
Data integration
Most slides are borrowed from Dr. Chen Li, UC Irvine
![Page 2: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/2.jpg)
2
Motivation
Legacy database Plain text files
Biblio sever
Support seamless access to autonomous and heterogeneous information sources.
![Page 3: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/3.jpg)
3
Comparison Shopping
Lowest price of the DVD: “The Matrix”?
Applications
Comparison shopping
Supply-chain management
Supplier 2
… Integrator
Supplier M
Supplier 1
Buyer 2
Buyer M
Buyer 1
…
![Page 4: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/4.jpg)
4
Mediation architecture
Mediator
Wrapper
Source 1
Wrapper
Source 2
Wrapper
Source n
![Page 5: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/5.jpg)
5
• Sources are heterogeneous:– Different data models: relational, object-oriented, XML, …– Different schemas and representations. E.g.,
“Keanu Reeves” or “Reeves, Keanu” or “Reeves, K.” etc.• Describe source contents• Use source data to answer queries• Sources have limited query capabilities• Data quality• Performance• … …
Challenges
![Page 6: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/6.jpg)
6
Research projects
• Garlic (IBM),• Information Manifold (AT&T)• InfoSleuth (MCC),• Tsimmis, InfoMaster (Stanford)• Internet Softbot/Razor/Tukwila (U Wash.)• Hermes (Maryland)• Telegraph / Eddies (UC Berkeley)• Niagara (Univ Wisconsin)• DISCO, Agora (INRIA, France)• SIMS/Ariadne (USC/ISI)• Emerac/Havasu (ASU)
![Page 7: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/7.jpg)
7
Industry
• Nimble Technology• Enosys Markets• IBM• BEA
![Page 8: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/8.jpg)
8
Virtual integration
• Leave the data in the sources• When a query comes in:
– Determine the relevant sources to the query– Break down the query into sub-queries for the sources– Get the answers from the sources, filter them if needed and combine
them appropriately• Data is fresh• Otherwise known as
On Demand Integration
Slides from Dr. Michalis Petropoulos
![Page 9: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/9.jpg)
9
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
Design-Time
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
Slides from Dr. Michalis Petropoulos
![Page 10: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/10.jpg)
10
Design-Time
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
2
Slides from Dr. Michalis Petropoulos
![Page 11: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/11.jpg)
11
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
Design-Time
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
2
3
Slides from Dr. Michalis Petropoulos
![Page 12: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/12.jpg)
12
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
Design-Time
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
2
3
4
Slides from Dr. Michalis Petropoulos
![Page 13: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/13.jpg)
13
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
Design-Time
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
2
5
3
4
Slides from Dr. Michalis Petropoulos
![Page 14: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/14.jpg)
14
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query ResultEnd User
Wrapper Wrapper
Design-Time
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
2
5
63
4
Slides from Dr. Michalis Petropoulos
![Page 15: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/15.jpg)
15
Outline
• Basics: theories of conjunctive queries• Global-as-view (GAV) approach to data integration• Local-as-view (LAV) approach to data integration
![Page 16: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/16.jpg)
16
Conjunctive Queries (CQ’s) in Datalog
• Most common form of query; equivalent to select-project-join (SPJ) queries• Useful for data integration• Form: q(X) :- p1(X1), p2(X2),…, pn(Xn).• Head q(X) represents the query answers• Body p1(X1), p2(X2),…, pn(Xn) represents the query conditions
– The head is true if all the subgoals are true.– Each pi(Xi) is called a subgoal. Xi is a vector of variables or constants.– Shared variables represent join conditions– Constants represent “Attribute=const” selection conditions– A relation can appear in multiple predicates (subgoals)
head
body
subgoals
q(X) :- p1(X1), p2(X2), …, pn(Xn)
![Page 17: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/17.jpg)
17
Conjunctive queries
• Head and subgoals are atoms.• An atom consists of a predicate applied to zero or more
arguments• Predicates represent relations.• An atom is true for given values of its variables iff the
arguments form a tuple of the relation.• Whenever an assignment of values to all variables makes all
subgoals true, the rule asserts that the resulting head is also true.
![Page 18: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/18.jpg)
18
Conjunctive Queries: example
• Schemastudent(name, courseNum), course(number, Instructor)
• SQL SELECT name FROM student, course WHERE student.courseNum=course.number AND instructor=‘Li’;• Equal to:
ans(SN) :- student(SN, CN), course(CN,’Li’).
– Predicates student and course correspond to relations names– Two subgoals: student(SN, CN) and course(CN,’Li’)– Variables: SN, CN. Constant: ‘Li’– Shared variable, CN, corresponds to “student.courseNum=course.number”– Variable SN in the head: the answer to the query
![Page 19: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/19.jpg)
19
Why not SQL
• Datalog is more concise• Let us state some general principles
– e.g., containment of rules that are almost impossible to state correctly in SQL.
– Will see that later• Recursion is much easier to express in Datalog.
![Page 20: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/20.jpg)
20
Answer to a CQ
• For a CQ Q on database D, the answer Q(D) is a set of heads of Q if we:– Substitute constants for variables in the body of Q in all possible ways– Require all subgoals to be true
• Example: ans(SN) :- student(SN, CN), course(CN,’Li’).– Tuples are also called facts:
student(Jack, 184), student(Tom,215), …, course(184,Li), course(215,Li), …– Answer “Jack”: SNJack,CN184– Answer “Tom”: SNTom,CN215– Answer “Jack”: SNJack,CN215 (duplicate eliminated)
Number Instructor 184 Li 215 Li 214 Mehrotra 252 Gupta
Student Course
![Page 21: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/21.jpg)
21
Query containment
• For two queries Q1 and Q2, we say Q1 is contained in Q2, denoted Q1Q2, if any database D, we have Q1(D) Q2(D).
• We say Q1 and Q2 are equivalent, denoted Q1Q2, if Q1(D) Q2(D) and Q2(D) Q1(D).
• Example: Q1: ans(SN) :- student(SN, CN), course(CN, ’Li’).
Q2: ans(SN) :- student(SN, CN), course(CN, INS).
We have: Q1(D) Q2(D).
![Page 22: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/22.jpg)
22
Another example
Q1: p(X,Y) :- r(X,W), b(W,Z), r(Z,Y).
Q2: p(X,Y) :- r(X,W), b(W,W), r(W,Y).
• We have: Q2 Q1
• Proof:– For any DB D, suppose p(x,y) is in Q2(D). Then there is a w such that
r(x,w), b(w,w), and r(w,y) are in D.– For Q1, consider the substitution: X x, W w, Z w, Y y.
– Thus the head of Q1 becomes p(x,y), meaning that p(x,y) is also in Q1(D).
• In general, how to test containment of CQ’s?– Containment mappings
![Page 23: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/23.jpg)
23
Test containment
• Two approaches:1. Containment mappings.2. Canonical databases.
• Really the same in the simple CQ case covered so far.• Containment test is NP-complete, but CQ’s tend to be small so
here is one case where intractability doesn’t hurt you.
![Page 24: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/24.jpg)
24
Containment mappings• A containment mapping from Q2 to Q1: Map variables of Q2 to
variables of Q1, such that:– Head of Q2 becomes head of Q1;– Each subgoal of Q2 becomes some subgoal of Q1.
• It is not necessary that every subgoal of Q1 is the target of some subgoal of Q2.
• Q1 Q2 iff there is a containment mapping from Q2 to Q1. – Note that the containment mapping is opposite the containment --- it goes
from the larger (containing CQ) to the smaller (contained CQ).• Example:
Q1: p(X,Y) :- r(X,W), b(W,Z), r(Z,Y).
Q2: p(X,Y) :- r(X,W), b(W,W), r(W,Y).– Containment mapping from Q1 to Q2: X X, Y Y, W W, Z W– No containment mapping from Q2 to Q1:
• For b(W,W) in Q2, its only possible target in Q1 is b(W,Z)• However, we cannot have a mapping WW and WZ, since each variable cannot
be mapped to two different variables
![Page 25: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/25.jpg)
25
A slightly different example
Q1: p(X,Y):- r(X,Z), g(Z,Z), r(Z,Y).
Q2: p(A,B):- r(A,C), g(C,D), r(D,B).
Containment mapping m:m(A)=X;m(B)=Y;m(C)=m(D)=Z.
![Page 26: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/26.jpg)
26
Q1: p(X,Y):- r(X,Y), g(Y,Z).Q2: p(A,B):- r(A,B), r(A,C).
Q1 looks for:
Q2 looks for:
Another Example
X ZY
A B
C
![Page 27: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/27.jpg)
27
Q1: p(X,Y):- r(X,Y), g(Y,Z).
Q2: p(A,B):- r(A,B), r(A,C).
Containment mapping:m(A)=X;m(B)=m(C)=Y.
Example - Continued
Notice twosubgoals canmap to one.
And notevery subgoalneed be atarget.
![Page 28: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/28.jpg)
28
Example - Concluded
Q1: p(X,Y):- r(X,Y), g(Y,Z).Q2: p(A,B):- r(A,B), r(A,C).
• No containment mapping from Q1 to Q2.– g(Y,Z) cannot map anywhere, since there is no g subgoal in Q2.
• Thus, Q1 properly contained in Q2.
![Page 29: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/29.jpg)
29
Extending CQ’s• CQ’s with built-in predicates:
– We can add more conditions to variables in a CQ.– Example:
student(name, GPA, courseNum), course(number,instructor,year)Q1(SN) :- student(SN, G, CN), course(CN, ’Li’), G>=3.5.Q2(SN) :- student(SN, G, CN), course(CN, ’Li’), G>=3.5, Y < 2002.Q2(SN) Q1(SN).
• Datalog queries: – a (possibly infinite) set of CQ’s with (possibly) recursion– Example: parent(Parent, Child)– Query: finding all ancestors of Tom
ancestor(P,C) :- parent(P, C). ancestor(P,C) :- ancestor(P,X), parent(X, C).
result(P) :- ancestor(P, ‘tom’).
![Page 30: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/30.jpg)
30
• Although CQ theory first appeared at a database conference, the AI community has taken CQ’s to heart.
• CQ’s, or similar logics like description logic, are used in a number of AI applications.– Again, their design theory is really containment and equivalence.
![Page 31: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/31.jpg)
31
Outline
• Basics: theories of conjunctive queries• Global-as-view (GAV) approach to data
integration• Local-as-view (LAV) approach to data integration
![Page 32: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/32.jpg)
32
GAV approach to data integration
• Readings:– Jeffrey Ullman, Information Integration Using Logical Views, ICDT
1997. – Ramana Yerneni, Chen Li, Hector Garcia-Molina, and Jeffrey
Ullman, Computing Capabilities of Mediators, SIGMOD 1999.
![Page 33: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/33.jpg)
33
Global-as-view Approach
Mediator
• Mediator exports views defined on source relationsmed(Dealer,City,Make,Year) = R1 R2
• A query is posted on mediator views:SELECT * FROM medWHERE Year = ‘2001’;
ans(D,C,M, ‘2001’) :- med(D,C,M,‘2001’).• Mediator expands query to source queries:
SELECT * FROM R1, R2 WHERE Year = ‘2001’;
ans(D,C,M,’2001’) :- R1(D,C), R2(D,M, ‘2001’).
R1(Dealer,City) R2(Dealer, Make, Year)
med(Dealer,City,Make,Year) = R1 R2
![Page 34: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/34.jpg)
34
• Project: TSIMMIS at Stanford• Advantages:
– User queries are easy to define– Query transformation generation is straightforward
• Disadvantages:– Not all source information is exported:
– Not easily scalable: every time a new source is added, mediator views need to be changed.
• Research issues– Efficient query execution?– Deal with limited source capabilities?
GAV Approach
![Page 35: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/35.jpg)
35
Limited source capabilities
Complete scans of relations not possible
Reasons:– Legacy databases or structured files: limited interfaces
– Security/Privacy
– Performance concerns
Example 1: legacy databases with restrictive interfaces
Ullman DBMS
Knuth TeX
… …
author titleGiven an author, return the books.
![Page 36: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/36.jpg)
36
Another example: Web search forms
www.imdb.com
![Page 37: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/37.jpg)
37
Problems
• How to describe source restrictions?• How to compute mediator restrictions from sources? • How to answer queries efficiently given these restrictions?• How to compute as many answers as possible to a query?• …
![Page 38: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/38.jpg)
38
Computing mediator restrictions
• Motivation: do not want users to be frustrated by submitting a query that cannot be answerable by the mediator
• Example:– Source 1: book(author?, title, price)
• Capability: “bff”• i.e., we must provide an author, and can get title and price info
– Source 2: review(title?, reviewer, rate)• Capability: “bff”• i.e., we must provide a book title, and can get other info
– Mediator view: MedView(A?,T,P,RV,RT) :- book(A,T,P),review(T,RV,RT).
– Query on the mediator view:• Ans(RT) :- MedView(A, ‘db’, P, RV, RT).• I.e., “find the review rates of DB books”
– But the mediator cannot answer this query, since we do not know the authors.• We want to tell the user beforehand what queries can be answered
![Page 39: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/39.jpg)
39
Outline
• Basics: theories of conjunctive queries;• Global-as-view (GAV) approach to data integration;• Local-as-view (LAV) approach to data integration.
![Page 40: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/40.jpg)
40
Local-as-view (LAV) approach
Mediator
• There are global predicates, e.g., “car,” “person,” “book,” etc.• They can been seen as mediator views• The content of each source is described using these global predicates• A query to the mediator is also defined on the global predicates• The mediator finds a way to answer the query using the source
contents
sources
![Page 41: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/41.jpg)
41
Example
Mediator
• Global predicates: Loc(Dealer,City),Sell(Dealer,Make,Year)• Source content defined on global predicates:
S1(Dealer,City) :- Loc(Dealer, City).S2(Dealer,Make,Year) :- Sell(Dealer, Make, Year).
In general, each definition could be more complicated, rather than direct copies.• Queries defined on global predicates.
Q: ans(D,M,Y) :- Loc(D, ’windsor’), Sell(D, M, Y).– Users do not know source views.
• The mediator decides how to use source views to answer queries.– “Answering queries using views”:
ans(D, M, Y) :- S1(D,’windsor’), S2(D,M,Y).
S1(Dealer,City) S2(Dealer,Make,Year)
![Page 42: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/42.jpg)
42
Answering queries using views
Mediator
• Source views can be complicated: SPJs, arithmetic comparisons,…• Not easy to decide how to answer a query using source views
Query: ans(D,M) :- Loc(D,‘windsor'), Sell(D,M,Y).
Rewriting: ans(D,M) :- V3(D,‘windsor’, M,Y). ans(D,M) :- V1(D,’windsor’), V2(D,M,Y).
…– “Equivalent rewriting”: compute the “same” answer as the query– A rewriting can join multiple source views
V1(Dealer,City):- Loc(Dealer, City).V2(Dealer,Make,Year):-Sell(Dealer, Make, Year).V3(D,C,M,Y) :- Loc(D,C),Sell(D,M,Y).V4(D,C,M,Y) :- Loc(D,C),Sell(D,M,Y), Y<1970.
Query
![Page 43: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/43.jpg)
43
Arithmetic comparisons
Mediator
• Comparisons can make the problem even trickier• Query: ans(D,M) :- Loc(D,‘windsor'), Sell(D,M,Y).
Rewriting: ans(D,M) :- V(D,‘windsor’, M,Y).
Contained rewriting: only retrieve cars before 1970. • Query: ans(D,M):- Loc(D, ‘windsor'), Sell(D,M,Y), Y < 1960.
Rewriting: ans(D,M) :- V(D,‘windsor’, M, Y), Y < 1960.
V(D,C,M,Y):- Loc(D,C),Sell(D,M,Y),Y<1970.
![Page 44: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/44.jpg)
44
Local-as-View (LAV)
Source1
Source2
Source3
Source4
Source5
Local Schema LocalSchema
LocalSchema
LocalSchema
Global Schema
BookISBNTitleGenreYear
AuthorISBNName
R1ISBNTitleName
Local Schema
R5ISBNTitle
Books before 1970 Humor Books
Create View R1 ASSELECT B.ISBN, B.Title, A.NameFROM Book B, Author AWHERE A.ISBN = B.ISBN AND B.Year < 1970
R1(ISBN, Title, Name):-Book(ISBN, Title, Genre,Year), Author(ISBN, Name), Year<1970.
Create View R5 ASSELECT B.ISBN, B.TitleFROM Book BWHERE B.Genre = ‘Humor’
R5(ISBN, Title):-Book(ISBN, Title, ‘humor’, Year).
![Page 45: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/45.jpg)
45
LAV details
• Query: Find authors of humor books Q(Name):-Book(ISBN,Title,”humor”,YEAR), Author(ISBN, Name)
• Views:R1(ISBN, Title, Name):- Book(ISBN, Title, Genre,Year), Author(ISBN, Name), Year<1970.R5(ISBN, Title) :-Book(ISBN, Title, ‘humor’, Year).
• Rewriting of Q using views:Q’(Name):-R1(ISBN, Title, Name), R2(ISBN, Title)
• Expansion of Q’Q’’(Name):- Book(ISBN, Title, Genre,Year), Author(ISBN, Name), Year<1970, Book(ISBN, Title, ‘humor’, Year).
Q’’’(Name):- Author(ISBN, Name), Year<1970, Book(ISBN, Title, ‘humor’, Year).
• Q’’’ is contained in Q
![Page 46: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/46.jpg)
46
Query Rewritings
• Given a query Q and a set of views V:– A conjunctive query P is called a “rewriting” of Q using
V if P only uses views in V, and P computes a partial answer of Q. That is: Pexp Q. A rewriting is also called a “contained rewriting” (CR).
– A conjunctive query P is called an “equivalent rewriting” (ER) of Q using V if P only uses views in V, and P computes the exact answer of Q. That is: Pexp Q.
![Page 47: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/47.jpg)
47
Bucket algorithm
• It is the basic method for query rewriting
• Each subgoal must be “covered” by some view• Make a list of candidates (buckets) per query subgoal• Consider combinations of candidates from different buckets• Not all combos are “compatible”• Keep the compatible ones and minimize them• Discard the ones contained in another• Take their union
![Page 48: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/48.jpg)
48
The Bucket Algorithm: Example
V1(Std,Crs,Qtr,Title) :- reg(Std,Crs,Qtr), course(Crs,Title), Crs ≥ 500, Qtr ≥ Aut98V2(Std,Prof,Crs,Qtr) :- reg(Std,Crs,Qtr), teaches(Prof,Crs,Qtr)V3(Std,Crs) :- reg(Std,Crs,Qtr), Qtr ≤ Aut94V4(Prof,Crs,Title,Qtr) :- reg(Std,Crs,Qtr), course(Crs,Title), teaches(Prof,Crs,Qtr), Qtr ≤ Aut97
q(S,C,P) :- teaches(P,C,Q), reg(S,C,Q), course(C,T), C ≥ 300, Q ≥ Aut95
Step 1: For each query subgoal, put the relevant sources into a bucket
![Page 49: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/49.jpg)
49
The Bucket Algorithm: Example
V1(Std,Crs,Qtr,Title) :- reg(Std,Crs,Qtr), course(Crs,Title), Crs ≥ 500, Qtr ≥ Aut98V2(Std,Prof,Crs,Qtr) :- reg(Std,Crs,Qtr), teaches(Prof,Crs,Qtr)V3(Std,Crs) :- reg(Std,Crs,Qtr), Qtr ≤ Aut94V4(Prof,Crs,Title,Qtr) :- reg(Std,Crs,Qtr), course(Crs,Title), teaches(Prof,Crs,Qtr), Qtr ≤ Aut97
q(S,C,P) :- teaches(P,C,Q), reg(S,C,Q), course(C,T), C ≥ 300, Q ≥ Aut95
PProf, CCrs, QQtr
Note: Arithmetic predicates don’t pose a problem in this step
V2
Buckets
V4
teaches reg course
![Page 50: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/50.jpg)
50
The Bucket Algorithm: Example
V1(Std,Crs,Qtr,Title) :- reg(Std,Crs,Qtr), course(Crs,Title), Crs ≥ 500, Qtr ≥ Aut98V2(Std,Prof,Crs,Qtr) :- reg(Std,Crs,Qtr), teaches(Prof,Crs,Qtr)V3(Std,Crs) :- reg(Std,Crs,Qtr), Qtr ≤ Aut94V4(Prof,Crs,Title,Qtr) :- reg(Std,Crs,Qtr), course(Crs,Title), teaches(Prof,Crs,Qtr), Qtr ≤ Aut97
q(S,C,P) :- teaches(P,C,Q), reg(S,C,Q), course(C,T), C ≥ 300, Q ≥ Aut95
SStd, CCrs, QQtr
Note: V3 doesn’t work: arithmetic predicates not consistentV4 doesn’t work: S not in the output of V4
V2
Buckets
V4
teaches reg course
V1V2
![Page 51: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/51.jpg)
51
The Bucket Algorithm: Example
V1(Std,Crs,Qtr,Title) :- reg(Std,Crs,Qtr), course(Crs,Title), Crs ≥ 500, Qtr ≥ Aut98V2(Std,Prof,Crs,Qtr) :- reg(Std,Crs,Qtr), teaches(Prof,Crs,Qtr)V3(Std,Crs) :- reg(Std,Crs,Qtr), Qtr ≤ Aut94V4(Prof,Crs,Title,Qtr) :- reg(Std,Crs,Qtr), course(Crs,Title), teaches(Prof,Crs,Qtr), Qtr ≤ Aut97
q(S,C,P) :- teaches(P,C,Q), reg(S,C,Q), course(C,T), C ≥ 300, Q ≥ Aut95
CCrs, TTitle V2
Buckets
V4
teaches reg course
V1V2
V1V4
![Page 52: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/52.jpg)
52
The Bucket Algorithm: Example
Step 2:• Try all combos of views, one each from a bucket• Test satisfaction of arithmetic predicates in each case
– e.g., two views may not overlap, i.e., they may be inconsistent• Desired rewriting = union of surviving ones
Query rewriting 1:
q1(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’), V1(S”,C,Q’,T)– no problem from arithmetic predicates (none in V2)– May or may not be minimal (why?)
V2V4
teaches reg course
V1V2
V1V4
![Page 53: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/53.jpg)
53
The Bucket Algorithm: Example
Unfolding of rewriting 1:q1’(S,C,P) :- r(S’,C,Q), t(P,C,Q), r(S,C,Q), c(C,T’), r(S”,C,Q’), c(C,T), C ≥ 500, Q ≥ Aut98, C ≥ 500, Q’ ≥ Aut98
• Black r’s can be mapped to green r:S’S, S”S, Q’Q
• Black c can be mapped to green c:just extend above mapping to TT’
Minimized unfolding of rewriting 1:q1m’(S,C,P) :- t(P,C,Q), r(S,C,Q), c(C,T’), C ≥ 500, Q ≥ Aut98Minimized rewriting 1:q1m(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’)
![Page 54: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/54.jpg)
54
The Bucket Algorithm: Example
Query Rewriting 2:
q2(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’), V4(P’,C,T,Q’)q2’(S,C,P) :- r(S’,C,Q), t(P,C,Q), r(S,C,Q), r(S,C,Q), c(C,T’), C ≥ 500, Q ≥ Aut98, r(S”,C,Q’), c(C,T), t(P’,C,Q’), Q’ ≤ Aut97• This combo is infeasible: consider the conjunction of arithmetic predicates
in V1 and V4
Query rewriting 3:
q3(S,C,P) :- V2(S’,P,C,Q), V2(S,P’,C,Q), V4(P”,C,T,Q’)
V2V4
teaches reg course
V1V2
V1V4
V2V4
teaches reg course
V1V2
V1V4
![Page 55: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/55.jpg)
55
The Bucket Algorithm: Example
Unfolding of rewriting 3:q3’(S,C,P) :- r(S’,C,Q), t(P,C,Q), r(S,C,Q), t(P’,C,Q), r(S”,C,Q’), c(C,T), t(P”,C,Q’), Q’ ≤ Aut97• The green subgoals can cover the black ones under the mapping: S’S,
S”S, P’P, P”P, Q’Q
Minimized rewriting 3:q3m(S,C,P) :- V2(S,P,C,Q), V4(P,C,T,Q)
Verify that there are only two rewritings that are not covered by others
Maximally Contained Rewriting:q’ = q1m q3m
![Page 56: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/56.jpg)
56
The Bucket Algorithm: Example 2
Query:q(X) :- cites(X,Y), cites(Y,X), sameTopic(X,Y)
Views:V4(A) :- cites(A,B), cites(B,A)V5(C,D) :- sameTopic(C,D)V6(F,H) :- cites(F,G), cites(G,H), sameTopic(F,G)
Note: Should we list V4(X) twice in the buckets?
V4
Buckets
V6
cites cites sameTopic
V4
V6
V5
V6
![Page 57: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/57.jpg)
57
Bucket algorithm• Query:
q(x):-car(x), sell(x, d), loc(d, ’windsor’).• Views:
v1(x) :- car(x).v2(x) :- car(x), sell(x, d).v3(x,d) :- sell(x, d), loc(d, ’windsor’).v4(x) :- sell(x, d), loc(d, ’windsor’).
Car(x) Sell(x,d) Loc(d,’windsor’)
V1(x)
V2(x) v2(x)
V3(x,d) V3(x,d)
V4(x) V4(x)
q(x):-v1(x), v2(x), v3(x,d).
q(x):-v1(x), v3(x,d).
q(x):-v1(x), v4(x).
q(x):-v2(x), v3(x,d).
q(x):-v2(x), v4(x).
…
![Page 58: 1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine](https://reader033.vdocuments.us/reader033/viewer/2022051619/56649e2d5503460f94b1d500/html5/thumbnails/58.jpg)
58
• Projects: Information Manifold, Infomaster, Tukwila, …• Advantages:
– Scalable: new sources easy to add without modifying the mediator views
– All we need to do is to define the new source using the existing mediator views (predicates)
• Disadvantages:– Hard to decide how to answer a query using views
• Reading: Alon Halevy, Answering Queries Using Views: A Survey.
Projects using the LAV approach