IntroductionData Integration
Summary
Data IntegrationCOCS 6421 Advanced Database Systems
Przemyslaw Pawluk
CSE, York University
March 20, 2008
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Agenda
1 IntroductionProblem descriptionProblems
2 Data IntegrationIntegration processTheoretical perspectivePractical solutions
3 SummaryOpen questions and future workConclusionBibliography
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Introduction
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Why to do it?
Many data sources and applications in one organization
The need of fast access and understanding of informationfrom different sources.
Elimination of bottlenecks in the data-flow (many pointswhere data is transformed).
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Why to do it?
Many data sources and applications in one organization
The need of fast access and understanding of informationfrom different sources.
Elimination of bottlenecks in the data-flow (many pointswhere data is transformed).
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Why to do it?
Many data sources and applications in one organization
The need of fast access and understanding of informationfrom different sources.
Elimination of bottlenecks in the data-flow (many pointswhere data is transformed).
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Goals of data integration
To answer queries in most efficient way. (When we knowexactly the queries we want to answer and what data areavailable)
To discover the knowledge. (When don’t know the queries inadvance or we don’t know all data sources in advance)
In both we need to analyze data and present results accordingto requirements of users.
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Goals of data integration
To answer queries in most efficient way. (When we knowexactly the queries we want to answer and what data areavailable)
To discover the knowledge. (When don’t know the queries inadvance or we don’t know all data sources in advance)
In both we need to analyze data and present results accordingto requirements of users.
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Goals of data integration
To answer queries in most efficient way. (When we knowexactly the queries we want to answer and what data areavailable)
To discover the knowledge. (When don’t know the queries inadvance or we don’t know all data sources in advance)
In both we need to analyze data and present results accordingto requirements of users.
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Problems in data integration
Identification of the best data sourceHow to get data from the source?
Interface to integrate schemesFormulation of queries to different data sources
Problems with dataincomplete and incorrect datainconsistent dataincomprehensible dataduplicatesdifferent formats
Security issues
Quality of DataQuality of Service
reliabilityefficiency. . .
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Problems with data–Example
Many data formats i.e. John Brown, Brown J., Brown John
Two different phone no. or addresses where integrityconstraint allows only one
Zip codes, phone numbers, SIN etc. with ‘–‘ or without
Bad news!
AI cannot solve all problems. Knowledge of meaning of the data isrequired!
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Problems with data–Example
Many data formats i.e. John Brown, Brown J., Brown John
Two different phone no. or addresses where integrityconstraint allows only one
Zip codes, phone numbers, SIN etc. with ‘–‘ or without
Bad news!
AI cannot solve all problems. Knowledge of meaning of the data isrequired!
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Problems with data–Example
Many data formats i.e. John Brown, Brown J., Brown John
Two different phone no. or addresses where integrityconstraint allows only one
Zip codes, phone numbers, SIN etc. with ‘–‘ or without
Bad news!
AI cannot solve all problems. Knowledge of meaning of the data isrequired!
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Problem descriptionProblems
Problems with data–Example
Many data formats i.e. John Brown, Brown J., Brown John
Two different phone no. or addresses where integrityconstraint allows only one
Zip codes, phone numbers, SIN etc. with ‘–‘ or without
Bad news!
AI cannot solve all problems. Knowledge of meaning of the data isrequired!
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Data Integration
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Integration Process
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Elements of the integration process
Understanding
meta-data analysisdiscover inconsistencies and dependencies
Standardization
target schemadata repair (to achieve consistent data)data identification constraints (How to identify object?)
Specification
build mapping(s)
Execution
materialization (ETL)federation (data exchange or query reformulation)indexing
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Elements of the integration process
Understanding
meta-data analysisdiscover inconsistencies and dependencies
Standardization
target schemadata repair (to achieve consistent data)data identification constraints (How to identify object?)
Specification
build mapping(s)
Execution
materialization (ETL)federation (data exchange or query reformulation)indexing
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Elements of the integration process
Understanding
meta-data analysisdiscover inconsistencies and dependencies
Standardization
target schemadata repair (to achieve consistent data)data identification constraints (How to identify object?)
Specification
build mapping(s)
Execution
materialization (ETL)federation (data exchange or query reformulation)indexing
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Elements of the integration process
Understanding
meta-data analysisdiscover inconsistencies and dependencies
Standardization
target schemadata repair (to achieve consistent data)data identification constraints (How to identify object?)
Specification
build mapping(s)
Execution
materialization (ETL)federation (data exchange or query reformulation)indexing
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Techniques
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Materialization
DS1
DS2
DS3
ODS
ETL CDB DM2
DM1
DM3
Present.
The information pipeline[1]
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Materialization
DS1
DS2
DS3
ODS ETL
CDB DM2
DM1
DM3
Present.
The information pipeline[1]
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Materialization
DS1
DS2
DS3
ODS ETL CDB
DM2
DM1
DM3
Present.
The information pipeline[1]
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Materialization
DS1
DS2
DS3
ODS ETL CDB DM2
DM1
DM3
Present.
The information pipeline[1]
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Materialization
DS1
DS2
DS3
ODS ETL CDB DM2
DM1
DM3
Present.
The information pipeline[1]
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Federation
DS1 DS2 DS3
Query processor
Presentation
Do not materialize data
Usually use queryreformulation or unfolding
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Federation
DS1 DS2 DS3
Query processor
Presentation
Do not materialize data
Usually use queryreformulation or unfolding
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Federation
DS1 DS2 DS3
Query processor
Presentation
Do not materialize data
Usually use queryreformulation or unfolding
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Federation
DS1 DS2 DS3
Query processor
Presentation
Do not materialize data
Usually use queryreformulation or unfolding
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Indexing
XML1 XML2 XML3
Index
Query processor
Presentation
Do not materialize data
Use index to find properdocument
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Indexing
XML1 XML2 XML3
Index
Query processor
Presentation
Do not materialize data
Use index to find properdocument
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Indexing
XML1 XML2 XML3
Index
Query processor
Presentation
Do not materialize data
Use index to find properdocument
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Indexing
XML1 XML2 XML3
Index
Query processor
Presentation
Do not materialize data
Use index to find properdocument
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Theoretical perspective
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Views
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Views in data integration
Architecture based on global schema (GS) and set of sources
Data in the data sources (DS)
Global schema as virtual view of DS
Each DS and GS can be expressed in a different language
The goal is to define some mappings between DSs and GS
Two approaches:
Global as viewLocal as view
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Global as view
Global schema expressed in terms of data sources. Each element ofthe global schema is defined as a view over the sources.
Requirement!
Stable set of data sources!
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Global as view
Global schema expressed in terms of data sources. Each element ofthe global schema is defined as a view over the sources.
Requirement!
Stable set of data sources!
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
GAV–benefits
The benefits of GAV:
Easier to understand
Mapping explicitly tells how to retrieve data from DSs
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Local as view
Global schema defined independently from data-sources, and therelationships between the global schema and the sources areexpressed by defining every source as a view over the globalschema.
Requirement!
Stable global schema!
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Local as view
Global schema defined independently from data-sources, and therelationships between the global schema and the sources areexpressed by defining every source as a view over the globalschema.
Requirement!
Stable global schema!
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
LAV–benefits
The immediate benefits of LAV:
Easier describing sources(doesn’t require any knowledge aboutother DS)
New DS can be easily added
Describing precise constraints on the contents of DS anddescribing sources that have different relational structure thanthe GS is easier
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
GAV vs. LAV
The LAV system can be transformed into a GAV one andvice-versa.
LAV to GAV. . .
. . . can be used to derive theprocedural specification fromdeclarative one.
GAV to LAV. . .
. . . can be useful to derive adeclarative characterization ofthe content of the sourcesstarting from proceduralspecification.
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
GAV vs. LAV
The LAV system can be transformed into a GAV one andvice-versa.
LAV to GAV. . .
. . . can be used to derive theprocedural specification fromdeclarative one.
GAV to LAV. . .
. . . can be useful to derive adeclarative characterization ofthe content of the sourcesstarting from proceduralspecification.
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
GAV vs. LAV
The LAV system can be transformed into a GAV one andvice-versa.
LAV to GAV. . .
. . . can be used to derive theprocedural specification fromdeclarative one.
GAV to LAV. . .
. . . can be useful to derive adeclarative characterization ofthe content of the sourcesstarting from proceduralspecification.
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Answering queries using views
Depends on chosen technique
GAV
gives simple solution how to get the data. Just unfold!
LAV
uses view-based query rewriting or view-based query answering.More complicated than GAV, uses reasoning in the presence ofincomplete information
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Schema mapping
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Schema mapping
Data exchange–How to move data from one schema to theother one?
Data integration–How to integrate data from many differentschemes?
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Schema mapping in data exchange problem
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
LAV and GAV–Example
In general elements of the set Σ can have a formφ(x) → ∃yΨ(x , y)For example(Student(s)∧Enrolls(s, c)) → ∃t∃g(Teaches(t, c)∧Grade(s, c , g))In LAV there is only one element on the left sideP(x) → ∃yΨ(x , y)In GAV there is only one element on the right sideφ(x) → R(x)
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Schema composing
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Schema inverting in schema evolution
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Schema mapping–Example of changes in schemas
Why evolution?
new atributeadd/remove constraint. . .
Example
new information about student or course requiredchanges in the grading system. . .
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Practical solutions
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Example–banking system
Our point of view: CRM system
Questions:
The purpose of the systemThe source/sources of data for the system?. . .
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Example–Banking systems
Customers DB
WebSys. Data Account DB Contracts
Cards DB
CRM System
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Example–Banking systems
Customers DB
WebSys. Data Account DB Contracts
Cards DBCRM System
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Example–Banking systems
Different formats and accuracy of data
Different Database Management Systems
Different owners (security levels)
. . .
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
What is Master Data?
Facts describing core business entities
Provides business context
Does not imply a particular style of usage
Not everything is Master Data . . .
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
What Is Master Data Management?
An approach that decouples master information fromindividual applications and unifies it.
Central application- and process-neutral resource
Ensures consistent, up-to-date master information acrossbusiness processes, and systems
Simplifies ongoing integration tasks and new applicationdevelopment
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Problems . . .
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Integration processTheoretical perspectivePractical solutions
Proposed solution
Master Data Management brakes walls between the systemsand provides fast access to the consistent information.
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Open questions and future workConclusionBibliography
Summary
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Open questions and future workConclusionBibliography
Open questions
Automatic schema mapping
Reference reconciling
Model management
Peer-to-Peer Data Management
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Open questions and future workConclusionBibliography
Conclusion
Data integration–theoretical as well as business problem
Some commercial solutions are available but not sufficient
Many issues without solution = reach research field
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Open questions and future workConclusionBibliography
Bibliography
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Open questions and future workConclusionBibliography
Brazhnik, O., and Jones, J. F.Anatomy of data integration.J. of Biomedical Informatics 40, 3 (2007), 252–269.
Fagin, R., Kolaitis, P. G., Tan, W.-C., and Popa, L.Composing schema mappings: second-order dependencies tothe rescue.In PODS ’04: Proceedings of the twenty-third ACMSIGMOD-SIGACT-SIGART symposium on Principles ofdatabase systems (New York, NY, USA, 2004), ACM,pp. 83–94.
Haas, L. M.Beauty and the beast: The theory and practice of informationintegration.In ICDT (2007), T. Schwentick and D. Suciu, Eds., vol. 4353of Lecture Notes in Computer Science, Springer, pp. 28–43.
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Open questions and future workConclusionBibliography
Halevy, A., Rajaraman, A., and Ordille, J.Data integration: the teenage years.In VLDB ’06: Proceedings of the 32nd international conferenceon Very large data bases (2006), VLDB Endowment, pp. 9–16.
Halevy, A. Y.Answering queries using views: A survey.The VLDB Journal 10, 4 (2001), 270–294.
Kolaitis, P. G.Schema mappings, data exchange, and metadatamanagement.In PODS ’05: Proceedings of the twenty-fourth ACMSIGMOD-SIGACT-SIGART symposium on Principles ofdatabase systems (New York, NY, USA, 2005), ACM,pp. 61–75.
Lenzerini, M.
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Open questions and future workConclusionBibliography
Data integration: a theoretical perspective.In PODS ’02: Proceedings of the twenty-first ACMSIGMOD-SIGACT-SIGART symposium on Principles ofdatabase systems (New York, NY, USA, 2002), ACM,pp. 233–246.
Pankowski, T.Integracja danych w teorii i praktyce–przeglad problemow irozwiazan.In Bazy Danych: Nowe Technologie (2007), WK, pp. 45–55.
Ra, Y.-G., and Rundensteiner, E. A.A transparent schema-evolution system based onobject-oriented view technology.IEEE Transactions on Knowledge and Data Engineering 9, 4(1997), 600–624.
Shahri, H. H., and Shahri, S. H.
Przemyslaw Pawluk Data Integration
IntroductionData Integration
Summary
Open questions and future workConclusionBibliography
Eliminating duplicates in information integration: An adaptive,extensible framework.IEEE Intelligent Systems 21, 5 (2006), 63–71.
Przemyslaw Pawluk Data Integration