data integration: achievements and perspectives in the last ten years aijing
TRANSCRIPT
![Page 1: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/1.jpg)
Data Integration:Achievements and Perspectives in the Last Ten Years
AiJing
![Page 2: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/2.jpg)
Outline
Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion
![Page 3: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/3.jpg)
Motivation & Background Data integration is a pervasive challenge
faced in applications that need to query across multiple autonomous and heterogeneous data sources.
Data integration is crucial in large enterprises that own a multitude of data sources.
For better cooperation among agencies, each with their own data sources.
![Page 4: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/4.jpg)
Data Integration
Legacy DatabasesServices and Applications
Enterprise Databases
![Page 5: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/5.jpg)
Outline
Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion
![Page 6: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/6.jpg)
Ten-Year Best PaperQuerying Heterogeneous Information Sources using Source Descriptions. VLDB96
Alon Halevy a principal member of technical staff at AT&T Bell Laboratories, and then at AT&T Laboratories.
• Main idea: the Information Manifold
• led to tremendous progress on data integration and to quite a few commercial data integration products.
![Page 7: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/7.jpg)
The Information Manifold An implemented data integration system
Goal: provide a uniform query interface to a heterogeneous collection of Web data sources
Main contribution: the way it described the contents of the data sources it knew about.
IM contains declarative descriptions of the contents and capabilities of the information sources. (Source Description)
![Page 8: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/8.jpg)
An example of complex query
find reviews of movie directed by Woody Allen playing in my area three web sites join!
1. a movie site containing actor and director information (IMDB)
2. movie playing sources(e.g.,777film.com)
3. movie review sites (e.g., a newspaper)
![Page 9: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/9.jpg)
wrapper wrapper wrapper wrapper wrapper
Mediated Schema
Semantic mappingsoptimization &
execution
query reformulation
Design time Run time
![Page 10: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/10.jpg)
Semantic Mappings
Books TitleISBNPriceDiscountPriceEdition
CDs AlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
CD: ASIN, Title, Genre,…Artist: ASIN, name, …
Mediated Schema
Mapping logicMapping logic
InformatioInformation sourcesn sources
![Page 11: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/11.jpg)
Global-as-View (GAV)(Previous approaches)
SourceSource Source Source SourceR1 R2 R3 R4 R5
CD: ASIN, Title, Genre,…Artist: ASIN, name, …
Mediated Schema
Mapping:
![Page 12: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/12.jpg)
Local-as-View (LAV)
SourceSource Source Source SourceR1 R2 R3 R4 R5
CD: ASIN, Title, Genre, YearArtist: ASIN, Name, …
Mediated Schema
Mapping:
Mediated View
Mediated View
Mediated View
Mediated View
Mediated View
![Page 13: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/13.jpg)
benefits of LAV
Describing information sources became easier
a data integration system could accommodate new sources easily
The descriptions of the information sources could be more precise
describe precise constraints on the contents of the sources become easier
![Page 14: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/14.jpg)
Query reformulation
Books TitleISBNPriceDiscountPriceEdition
CDs AlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
CD: ASIN, Title, Genre,…
Mediated SchemaA query
posed over CD(A,T,G)
a set of queries on the data sources
![Page 15: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/15.jpg)
Query Answering in LAV =Answering queries using views (AQUV) a problem which was earlier considered in the
context of query optimization
Given a set of views V1,…,Vn,
And a query Q,
Can we answer Q using only the answers to V1,…,Vn?
![Page 16: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/16.jpg)
AQUV Query optimization & Supporting physical
data independence
AQUV for data integration: Not necessarily equivalent rewriting Find maximally contained rewriting
Main AQUV Algorithms: Bucket Inverse rules Minicon
![Page 17: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/17.jpg)
Outline
Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion
![Page 18: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/18.jpg)
Building on the Foundation
Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence
![Page 19: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/19.jpg)
Generating Schema Mappings
Look at that observation: Who’s going to write all these LAV/GAV formulas
(the semantic mappings between the sources
and the mediated schema)?
1.create the source descriptions
2. writing the semantic mappings This was the main bottleneck.
![Page 20: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/20.jpg)
Techniques for Schema Mapping
semi-automatically generating schema mappings Goal: create tools that speed up the creation of
the mappings and reduce the amount of human
effort involved.
Compare schema elements based on: Linguistic similarities overlaps in data values or data types schema mapping tasks are often repetitive.
![Page 21: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/21.jpg)
A Machine Learning Approach
Map multiple schemas in the same domain to the same mediated schema.
Learn from previous experience: the manually created schema mappings as training data generalize from them to predict mappings between
unseen schemas.
Mediated schema
Given matches Predict new ones
![Page 22: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/22.jpg)
Building on the Foundation
Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence
![Page 23: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/23.jpg)
Adaptive query processing look at that observation:
Once we have mappings, how can we execute queries?
Traditional plan-then-execute doesn’t work.
Root: the dynamic nature of data integration contexts
![Page 24: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/24.jpg)
Adaptive query processing
data integration system:
the context is very dynamic and the optimizer has much less information than the traditional setting.
Two results: the optimizer can’t decide a good plan a plan may be arbitrarily bad.
Dynamic adjust query plan
![Page 25: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/25.jpg)
Building on the Foundation
Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence
![Page 26: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/26.jpg)
XML characters for data integration XML offered a common syntactic format for s
haring data among data sources. since it appeared as if data could actually be
shared integration systems using XML as the underly
ing data Model and XML query languages (XQuery)
![Page 27: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/27.jpg)
Building on the Foundation
Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence
![Page 28: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/28.jpg)
Model Management
Goal: provide an algebra for manipulating schemas and mappings
With such an algebra: complex operations on data sources
simple sequences of operators in the algebra Some of the operators in Model Management
create & compose mappings, merge & diff models
![Page 29: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/29.jpg)
Building on the Foundation
Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence
![Page 30: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/30.jpg)
Peer Data Management Systems
Berkeley
Stanford
DBLP
UW (Washington)
UW (Wisconsin)
CiteSeerUW (Waterloo)
Q
Q1
Q2Q6
Q5
Q4
Q3
LAV, GLAV
![Page 31: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/31.jpg)
Two Additional Benefits
A P2P architecture offers a truly distributed
mechanism for sharing data. Every data source only provide semantic mappings to a set
of neighbors. complex integrations emerge follows semantic paths
P2P architecture is more appropriate than a single mediated schema in data sharing context. there is never a single global mediated schema data sharing occurs in local neighborhoods of the network.
![Page 32: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/32.jpg)
Building on the Foundation
Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence
![Page 33: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/33.jpg)
The Role of Artificial Intelligence Description Logics describe relationships between
data sources data sources need to be represented declaratively the mediated schema of IM was based on Classic
Description Logic
Description Logics offered more flexible mechanisms for representing a mediated schema
Recent work: combine the expressive power of Description Logics with the ability to manage large amounts of data.
![Page 34: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/34.jpg)
Outline
Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion
![Page 35: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/35.jpg)
The Data Integration Industry Late 90’s——commercialization Enterprise Information Integration (EII):
without having to first load all the data into a central warehouse
the development of the EII industry Technologies from research labs matured enough The needs of data management XML
Inappropriate:
data warehousing solutions, ad-hoc solutions
![Page 36: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/36.jpg)
data sources
mediated schema
will participate in the application
buildbuild
applicationsapplications applicationsapplications
queryquery
semantic mappings a query posed over the
virtual schemaquery query reformulationa query over the data sources
Execute with an engine that create plans that span multiple data sources
A data integration scenario Query processing
![Page 37: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/37.jpg)
Other EII Products XML data model and XQuery
Challenge: the research on integration for XML was only in its infancy
customer-relationship management
Challenge: how to provide the customer-facing
worker a global view of a customer whose data is
residing in multiple sources, and track information
from multiple sources in real time.
![Page 38: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/38.jpg)
Outline
Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion
![Page 39: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/39.jpg)
Future Challenges The factors of data integration challenges:
Social: Data integration is fundamentally about getting people to collaborate and share data.
complexity of integration
Data integration has been referred to as a problem as hard as AI, maybe even harder!
Our goal: create tools that facilitate data integration in a variety of scenarios.
![Page 40: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/40.jpg)
Several Specific Challenges
Dataspaces: Pay-as-you-go data management
Uncertainty and lineage
Reusing human attention
![Page 41: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/41.jpg)
Dataspaces
database system: create the schema first! data integration system: create the semantic
mappings first!
fundamental shortcoming: long setup time!
Dataspaces: the idea of pay-as-you-go data
management
![Page 42: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/42.jpg)
Pay-as-you-go
offer some services immediately without any
setup time, and improve the services as more
investment is made into creating semantic
relationships. A dataspace should offer keyword search ove
r any data in any source with no setup time.
![Page 43: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/43.jpg)
Pay-as-you-go Data Management
Benefit
Investment (time, cost)
Dataspaces
Data integration solutions
Dataspaces: Franklin, Halevy, Maier [see PODS 2006]
![Page 44: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/44.jpg)
Several Specific Challenges
Dataspaces: Pay-as-you-go data management
Uncertainty and lineage
Reusing human attention
![Page 45: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/45.jpg)
Uncertain data & data lineage A necessity in data integration system
introspect about the certainty of the data
when not automatically determine its certainty, refer the user to the lineage of the data
Web search engines provide URLs along with their search results, so users can consider the URLs in the decision of which results to explore further.
![Page 46: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/46.jpg)
Several Specific Challenges
Dataspaces: Pay-as-you-go data management
Uncertainty and lineage
Reusing human attention
![Page 47: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/47.jpg)
Reusing human attention
achieving tighter semantic integration among data sources
Users’ any operation to data sources:
Giving a semantic clue about the data or
about relationships between data sources Systems that leverage these semantic clues: obta
in semantic integration much faster an area for additional research and development
![Page 48: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/48.jpg)
Outline
Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion
![Page 49: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/49.jpg)
Conclusion
not so long ago a nice feature and an area
for intellectual curiosity
today a necessity
Today’s economy further emphasize the need for data integration solutions.
Thomas Friedman: The World is Flat.
data integrationtimetime
![Page 50: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/50.jpg)
A Framework for Deep Web Integration
Query Translation
Resul ts Extraction
Data Merging
Integrated Interface
Deep Web
WDB Discovery
Interface Integration
RDBWeb DB
Web DB
Web DB
Web DBWeb DB
Interface Schema Extraction
WDB Clustering
Query Process Modul e
I nterface I ntegrati on Modul e
WDB Selection
Query Submission
Resul ts Annotation
Resul t Process Modul e
Developed issue Developing issue Undeveloped issue Our focuses
![Page 51: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing](https://reader034.vdocuments.us/reader034/viewer/2022051019/5697bf9d1a28abf838c9381c/html5/thumbnails/51.jpg)
Q & A