1 information integration. 2 information resides on heterogeneous information sources different...

1

Information Integration

2

Information Resides on Heterogeneous Information Sources

• different interfaces• different data representations• redundant and conflicting information

WWW Excel PersonaldatabaseFlat File

3

Modes of Information Integration

• Federated Databases: the sources are independent, but one source can call on others to supply information

• Data warehouse: copies of data from several sources are stored in a single database, called a data warehouse. The data stored at the warehouse is first processed in some way before storage; e.g. data may be filtered, and relations may be joined or aggregated. As the data is copied from the sources, it may need to be transformed in certain ways to make all data conform to the schema at the data warehouse

4

Modes of Information Integration

• Mediation: a mediator is a software component that supports a virtual database, which the user may query as if it were materialized (physically constructed like a warehouse). The mediator store no data of its own. Rather, it translates the user’s query into one or more queries to its sources. The mediator then synthesizes the answer to the user’s query from the responses of those sources, and returns the answer to the user

5

Problems of Information Integration

Example: The AAAI Automobile Co. has 1000 dealers each of

which maintains a database of their cars in stock. AAAI wants to create an integrated database containing the information of all 1000 sources. The integrated database will help dealers locate a particular model if they don’t have one in stock. It also can be used by corporate analysts to predict the market and adjust production to provide the model most likely to sell

6


• The 1000 dealers do not all use the same database schema:

Cars (serialNo, model, color, autoTrans, cdPlayer, ...)

or Autos (serialNo, model, color)

Options (serialNo, option)

7


• Schema difference • Different equivalent names

• Data type differences: numbers may be represented by character strings of varying length at one source and fixed length at another

• Value differences: the same concept may be represented by different constants at different sources (BLACK, BL, 100, etc)

• Semantic differences: Terms can be given different interpretations at different sources (Cars includes trucks or not)

• Missing values: a source may not record information of a type that all of the other sources provide

8

Goal: System Providing Integrated View of Heterogeneous Data

Integration System

WWW Personaldatabase

• collects and combines information• provides integrated view, uniform user interface

ExcelFlat File

9

The Data Warehousing Approach to Integration

Mediator

WrapperWrapper

Client

Excel Flat File

Stored Integrated

View

10


• Data from several sources is extracted and combined into a global schema

• The data is stored at the warehouse which looks like an ordinary database

• There are three approaches to maintaining the data in the data warehouse:– off-line reconstruction of the whole data warehouse

– the data warehouse is updated periodically based on the changes made to the original data sources

– the data warehouse is updated immediately

11


Example Suppose that there are two dealers in the system and

that they use the schemas:Cars (serialNo, model,color,autoTrans, cdPlayer, ...)

andAutos (serialNo,model,color)Options (serialNo,option)

Assume a data warehouse with the schema:

AutoWhse(serialNo,model,color,autoTrans, dealer)

12


• The software to extracts data from the dealer’s databases and populates the global schema can be written as SQL-queries. The query for the first dealer:

insert into AutoWhse(serialNo,model,color,autoTrans, dealer)

select serialNo, model, color, autoTrans, ‘dealer1’from Cars

The code for the second dealer is more complex since we have to decide whether or not a given car has an automatic transmission.

13


insert into AutoWhse(serialNo,model,color,autoTrans, dealer)

select serialNo, model, color, ‘yes’, ‘dealer2’from Autos, Optionswhere Autos.serialNo=Options.serialNo andoption=‘autoTrans’;insert into AutoWhse(serialNo,model,color,autoTrans,

dealer)select serialNo, model, color, ‘no’ ‘dealer2’from Autoswhere not exists ( select * from Optionswhere serialNo=Autos.serialNo andoption=‘autoTrans’);

14

The Wrapper and Mediator Architecture

Mediator

WrapperWrapper

Client

business reports

portfolios for each company

stock market prices

Excel Flat File

CommonData Model

15


• A mediator supports a virtual view, or collection of views, that integrates several sources in much the same way that the materialized relation(s) in a data warehouse integrate sources.

• The mediator doesn’t store any data < <<

Example:Let us consider the same scenario. The mediator

integrates the same two data sources into a view that is a single relation with the schema:

AutoMed(serialNo,model,color,autoTrans, dealer)

16


Assume the user asks the mediator about the red cars:

select serialNo, model from AutosMedwhere color = ‘red’;

The mediator forward the same query to each of the two wrappers

(1) select serialNo, model from Cars where color=‘red’;(2) select serialNo, model from Autos where color=‘red’;

The mediator can take the union of answers and return the result to the user.

17

The Lazy Integration Approach

Mediator

WrapperWrapper

Client

IBM portfolio

IBM price IBM related reports (in common model)

IBM related reports

Excel Flat File

Query Decomposition, Translation and Result Fusion

18

Wrappers in Mediator-Based Systems

• In a data warehouse system, the source extractors consist of:– one or more queries built-in that are executed at the source to

produce data for the data warehouse

– communication mechanisms, so that wrapper can:• pass ad-hoc queries to the source

• receive responses from the source

• pass information to the warehouse

• Mediator systems require more complex wrappers - the wrapper must be able to accept a variety of queries from the mediator and translate any of them to the terms of the source.

19


• A systematic way to design a wrapper that connects a mediator to a source is to classify the possible queries that the mediator can ask into templates, which are queries with parameters that represent constants.

• The mediator can provide the constants, and the wrapper executes the query with the given constants.

• T S the template T is turned into the source query S

Example:

The source of dealer1:

Cars (serialNo, model,color,autoTrans, cdPlayer, ...)

20


Assume we use the mediator with schema:

AutoMed(serialNo,model,color,autoTrans, dealer)

How the mediator could ask the wrapper for cars of a given color>

The template:

select * from AutoMed where color= ‘$c’;

select serialNo, model color, autoTrans, ‘dealer1’

from Cars where color=‘$c’;

21


• The wrapper could have another template that specified the parameter $m representing a model

• there would be 2N templates for N attributes

• the number of templates could grow unreasonably large.

22

Wrapper Generators

• The template defining a wrapper must be turned into code for the wrapper itself - the software that creates the wrapper is called a wrapper generator

• The wrapper generator creates a table that holds the various query patterns contained in the templates, and the source queries that are associated with each.

• A driver is used in each wrapper. The task of the driver is to :– accept a query from the mediator– search the table for a template that match the query– the source query is sent to the source using a communication

mechanism– the response is processed by the wrapper, if necessary, and

then returned to the mediator

23

Mediator

Client

Wrapper

Wrappers & Mediators from High-Level Specifications

Mediator SpecificationInterpreter

WrapperGenerator

Wrapper

WrapperSpecification

MediatorSpecification

Source Source

24

Filters

• Complex templateselect * from AutoMed where color= ‘$c’ and model = ‘$m’; select serialNo, model color, autoTrans, ‘dealer1’ from Cars where color=‘$c’ and model =‘$m’;• Wrapper filter approach - if the wrapper has a

template that returns a superset of what the query wants then it is possible to filter the result at the wrapper

• The decision whether a mediator asks for a subset of what the pattern of some wrapper template returns is a hard problem < < < <>>>>

25

Filters

Example:

Given the templateselect * from AutoMed where color= ‘$c’;

The mediator needs to find blue Gobi model car:select * from AutoMed where color= ‘blue’ and

model=‘Gobi’;• use the template with $c=blue to find all blue cars• store the result in the temporary relation Temp• select from Temp the Gobi’s and return the result

26

Other Wrapper Operations

• It is possible to transform the data at the wrapper in different ways

• The mediator is asked to find dealers and models such that the dealer has two red cars, of the same model, one with and one without automatic transmission. Suppose we have only one template as before.

Select A1.model A1.dealer

from AutoMed A1 AutoMed A2

where A1.model=A2.model and A1.color=‘red’ and

A2.color=‘red’ and A1.autoTrans=‘no’ and

A2.autoTrans=‘yes’;

27

Other Wrapper Operations

• It is possible to answer the query by first obtaining from the Dealer’s 1 source a relation with all the red cars (use the original template) - RedAutos relation

• select distinct A1.model A1.dealer

from RedAutos A1, RedAutos A2

where A1.model=A2.model and

A1.autoTrans=‘no’ and

A2.autoTrans=‘yes’;

28

Challenge: Sources Without a Well-Structured Schema

• semistructured– irregular– deeply nested– cross-referenced

• incomplete schema knowledge– autonomous– dynamic

• HTML pages• SGML documents• genome data• chemical structures• bibliographic

information• results of the

integration process

Examples

29

Challenge: Different and Limited Source Capabilities

Client

Wrapper(A)

Wrapper(B)

Mediator(U = A + B)

retrieve IBM dataretrieve IBM data

retrieve IBM data

30

Mediator has to Adapt to Query Capabilities of Sources

Client

Wrapper(A)

Wrapper(B)

Mediator(U = A + B)

retrieve everything

retrieve IBM data

retrieve IBM data

retrieve IBM data

(A) does notallow selection

31

Part B

• Semistructured Data Representation

• Mediator Generation

• Wrapper Generation

• Capabilities-Based Rewriting

32

Representation of Semistructured Information using OEM

semanticobject-id

label

Atomic Value

Set Value

structuralobject-id

<http://www/~doe, faculty, {&f1,&l1,&r1}> <&f1, first_name, “John”> <&l1, last_name, “Doe”> <&r1, rank, “professor”>

33

Object Exchange Model - Goals

• Easy to read

• Easy to edit

• Easy to generate or parse by a program

• Consistency with Stanford’s other projects (developed with the TSIMMIS’)

• Possibility of extensions in the future

34

Graph Representation of OEM Data

faculty first_name “John” last_name “Doe” rank “professor”

http://www/~doe

<http://www/~doe, faculty, {&f1,&l1,&r1}> <&f1, first_name, “John”> <&l1, last_name, “Doe”> <&r1, rank, “professor”>

35

OEM Structures Represent Arbitrary Labeled Graphs

faculty first_name “John” last_name “Doe” rank “professor”

http://www/~doe

faculty name “Mary Smith” project “Air DB” paper

author name “John Doe”

author name “Mary Smith”

title “Thin Air DB”

http://www/~smith

36

Reprezentacja danych semistrukturalnych

• Object Exchange Model

• ACeDB

• XML

• Mogą być wykorzystywane w warstwie pomiędzy mediatorem a wrapper’ami.

37

Object Exchange Model

• Zdefiniowany przy okazji budowy systemu Tsimmis służącego do integracji heterogenicznych źródeł danych.

• Wykorzystywany przy projekcie Merlin (MQS) i Lorel (QL)

38

Object Exchange Model (cd)

• Węzeł OEM skłąda się z czterech pól: – Object-ID – jest wykorzystywany do unikalnej

identyfikacji określonego węzła OEM– Label – jest ciągiem znaków który opisuje to co

węzeł OEM reprezentuje – Type – jest typem danych wartości węzła. (atomowy

lub kolekcja) – Value – może być albo wartością atomową albo

referencją do kolekcji węzłów OEM

• Jest zwykle reprezentowany jako: <Object-ID Label Type Value>

39

Object Exchange Model – przykład

<&oid1 Notowanie Set {&oid11 &oid12}>

<&oid11 NrNotowania String „4004”>

<&oid12 Rezultaty Set {&oid121 &oid122}>

<&oid121 Miejsce1 Set {&oid1211}>

<&oid1211 Utwor String „metropolis”>


<&oid1221 Utwor String „money”>

<&oid2 Notowanie Set {&oid21 &oid22}>

<&oid21 NrNotowania String „4005”>

<&oid22 Rezultaty Set {&oid221 &oid222}>


<&oid2211 Utwor String „learning to fly”>

40

Object Exchange Model – cechy

• Reprezentowany jako graf z obiektami na wierzchołkach i etykietami na krawędziach.

• Wszystkie wystąpienia są obiektami.

• Każdy obiekt ma swój unikalny identyfikator (oid).

• Rozróżniane są dwa typy obiektów: atomowe i złożone.

41

Object Exchange Model – cechy (cd)

• W OEM występują tzw. nazwy (ang. names), które mogą być traktowane jako aliasy do obiektów wewnątrz bazy danych.

• Nazwa służy jako wskaźnik do bazy danych.

• Każdy obiekt w bazie danych powinien być osiągalny za pomocą nazwy.

42

ACeDB

• ACeDB (A C. elegans Database) była rozwijana jako baza danych informacji genetycznej organizmów.

• Rozwijana od 1989.

• Posiada swój własny język zapytań AQL - Acedb Query Language

• http://www.acedb.org/

43

ACeDB – cechy

• Schemat i dane mogą być traktowane jako drzewo z etykietowanymi krawędziami.

• Krawędzie mogą być etykietowane jakimkolwiek typem podstawowym. (int, Notowanie)np.: array Int unique Int

• Z określonego wierzchołka drzewa danych może wychodzić wiele gałęzi.

• ACeDB pozwala na to aby jakakolwiek etykieta różna od etykiety głównej była pominięta.

• Identyfikatory obiektów wprowadzane są przez użytkownika.

44

ACeDB – cechy (cd)

• ACeDB wymaga schematu

• Mimo to, fakt, że dane mogą być pomijane oraz to, że etykietowane dane są traktowane jednolicie z innymi prostymi typami powoduje, że jest on bardzo bliski semistrukturalnemu modelowi danych.

45

ACeDB – przykład

>Book title UNIQUE Textauthors Textchapters int UNIQUE Textlanguage UNIQUE english

frenchother

date UNIQUE month Intyear Int

&hock2 title ”Computer Simulation Using Particles”authors ”Hockney”

”Eastwood” chapters 1 ”Computer Experiments”

2 ”A One-Dimensional Model”...

language english

46

XML

• Znany i lubiany

47

Różnice pomiędzy XML a OEM

• XML jest uporządkowany. • Etykiety w OEM są wykorzystywane tylko jako

punkt odniesienia oraz do oznaczania zależności pomiędzy obiektami. W XML każdy element nie będący ciągiem tekstowym zawiera identyfikujący go znacznik – etykietę.

• XML nie wspiera bezpośrednio struktury grafu.

48

Overview


• Mediator Generation• Example of mediator specification• Language expressiveness• Implementation and performance



49

Merge Information Relating to a Faculty

person name “John Doe” birthday “April 1”

s2faculty name “John Doe” rank “professor” papers ...

s1

faculty name “John Doe” rank “professor” birthday “April 1” papers ...

• Schema Integration• Info fusion

50

Mediator Specification Example


s2

<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1<N faculty {<L V>}> :- <person {<name N> <L V>}>@s2

faculty name “John Doe” rank “professor” papers ...

s1


51

Mediator Specification Example: Semantics of Rule Bodies



s2



s1

52

Mediator Specification Example: Semantics of Rule Heads



s2

“John Doe”faculty name “John Doe” rank “professor” birthday “April 1” papers ...


s1

53

Incrementally Add to Semantically Identified Object



s1person name “John Doe” birthday “April 1”

s2


54

Irregularities & Incomplete Schema Knowledge

<N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1faculty name “John Doe” rank “professor” papersfaculty name “Mary Smith” project “Air DB”

s1


s2

faculty name “John Doe” rank “professor” birthday “April 1” papers faculty name “Mary Smith” project “Air DB”

“John Doe”

“Mary Smith”

55

Second Rule Attaches More Subobjects to View Objects



s1



s2

56

The OEM object structure of the cs wrapper

<&e1, employee, set, {&f1,&l1,&t1,&rep1}> <&f1, first name, string, 'Joe'> <&l1, last name, string, 'Chung'> <&t1, title, string, 'professor'> <&rep1, reports to, string, 'John Hennessy'>

<&e2, employee, set, {&f2,&l2,&t2}> <&f2, first name, string, 'John'> <&l2, last name, string, 'Hennessy'> <&t2, title, string, 'chairman'>

. . .

<&s3, student, set, {&f3,&l3,&y3}> <&f3, first name, string, 'Pierre'> <&l3, last name, string, 'Huyn'> <&y3, year, integer, 3>

. . .

57

The OEM object structure of whois

<&p1, person, set, {&n1,&d1,&rel1,&elm1}> <&n1, name, string, 'Joe Chung'> <&d1, dept, string, 'CS'> <&rel1, relation, string, 'employee'> <&elm1, e_mail, string, 'chung@cs'>

<&p2, person, set, {&n2,&d2,&rel2}> <&n2, name, string, 'Nick Naive'> <&d2, dept, string, 'CS'> <&rel2, relation, string, 'student'> <&y2, year, integer, 3>

...

58

Object exported by med

<&cp1, cs_person, {&mn1,&mrel1,&t1,&rep1,&elm1}> <&mn1, name, string, 'Joe Chung'> <&mrel1, rel, string, 'employee'> <&t1, title, string, 'professor'> <&rep1, reports_to, string, 'John Hennessy'> <&elm1, e_mail, string, 'chung@cs'>

59

Problemy występujące przy tworzeniu Specyfikacji Mediatora

• Schema-domain mismatch

• Schematic discrepancy

• Schema Evolution

• Structure Irregularities

60

(MSL) Rules:

<cs_person {<name N> <rel R> Rest1 Rest2}> : <person {<name N> <dept 'CS'>

<relation R> | Rest1}>@whois AND decomp(N, LN, FN) AND <R {<first name FN>

<last name LN> | Rest2}>@cs

External: decomp(string,string,string)(bound,free,free)

impl by name_to_lnfn decomp(string,string,string)(free,bound,bound)

impl by lnfn_to_name

61

(MSL) Rules:

<cs_person {<name N> <rel R>}> : <person {<name N> <dept 'CS'>

<relation R>}>@whois

<cs_person {<name N> <rel R>}>:- decomp(N, LN, FN)

AND <R {<first name FN> <last name LN>}>@cs

<cs_person {<name N> <rel R> <title T>}>:- ???

62

(MSL) Rules:

<cs_person {<name N> <rel R> <e_mail E>}>

:- ???

??? :- ??? <R {<first_name FN>

<last_name LN> <title E>

<reports_to S>}>@cs

Rewriting:

<cs_person {<name N> <rel R> <e_mail E> <title E>

<reports_to S>}>

:- ???

63

Language Expressiveness

• Information fusion problems solved by MSL– Irregularities– Incomplete knowledge of source structure– Transformation of cross-referenced structures– Inconsistent and redundant data– Use of arbitrary matching criteria

• Theoretical analysis of expressiveness– Consider the relational representation of OEM

graphs. Then MSL is equivalent to “SQL + special form of transitive closure”

64

faculty name “John Doe” rank “associate”

Inconsistent and Redundant Information


AND NOT <faculty {<name N> <L V1>}>@s1

person name “John Doe” rank “assistant”

s1 s2

“John Doe”faculty

name “John Doe” rank “associate”

rank “assistant”

65

Overview


• Mediator Generation• Example of mediator specification• Language expressiveness• Implementation and performance



66

Mediator Specification Interpreter Architecture

Query Rewriter

Cost-Based Optimizer

Datamerge Engine


Query

logical datamergeprogram

plan

Result

Queries toWrappers

Results

67

Query Rewriting When Known Origins of Information

• <N faculty {<salary S>}> :-:- <faculty {<name N> <salary S>}>@s1

• <N faculty {< rank R >}>

:- <person {<name N> <rank R>}>@s2• <well-paid {<name N> <salary X>}>

:- <N faculty {<salary X> <rank assistant>}> AND X>65000

68

Query Rewriter Pushes Conditions to Sources

• <N faculty {<salary S>}> :- :- <faculty {<name N> <salary S>}>@s1 <N faculty {< rank R >}> :- <person {<name N> <rank R>}>@s2

• <well-paid {<name N> <salary X>}> :- <N faculty {<salary X> <rank assistant>}> AND X>65000

• logical datamerge program <well-paid {<name N> <salary X>}> :- (<faculty {<name N> <salary X>}> AND X>65000)@s1 AND <person {<name N> <rank assistant>}>@s2

69

<name N> :- <person {<rank assistant>}>

Passing Bindings & Local Join Plans

Passing Bindings

Local Join

<salary X> :- <faculty {<name $N> <salary X>}> AND X>65000

<name N> :- <person {<rank assistant>}>

<a {<s X> <n N>}>:- <faculty {<name N> <salary X>}> AND X>65000

N

s1 s2

s1 s2

70

Query Decomposition When Unknown Origins of Information

<X faculty {<S Y>}> :- <X faculty {<birthday “1/20”> <S Y>}>


71

Plan Considers All Possible Sources of birthday

<X faculty {<S Y>}> :- <X faculty {<birthday “1/20”> <S Y>}>


name

s2s1

name

birthday

birthday

72

Overview

• Semistructured-Data Representation




73

Query Translation in Wrappers

Source

SELECT * FROM personSELECT * FROM personWHERE name=“Smith”

find -allfind -n Smith

Query TranslatorResult

Translator

Wrapper

74

Rapid Query Translation Using Templates and Actions

Source

SELECT * FROM personSELECT * FROM personWHERE name=“Smith”

find -allfind -n Smith

TemplateInterpreter

ResultTranslator

SELECT * FROM person {emit “find -all” }SELECT * FROM personWHERE name=$N {emit “find -n $N”}

75

Description of Infinite Sets of Supported Queries

• uses recursive nonterminals

• Example:– job description contains word w1 and word w2

and ...– SELECT subset(person) FROM person

WHERE \CJob\CJob : job LIKE $W AND \CJob\CJob : TRUE

76

Overview

• Semistructured-Data Representation




77

Wrapper Supported Queries

Description

Capabilities-Based Rewriter in Mediator Architecture

Capabilities-Based

Rewriter

QueryRewriter

Cost-BasedOptimizer

DatamergeEngine

logical datamerge program

supportedplans

optimal plan


Wrapper Supported Queries

Description

Query

78

Capabilities-Based Rewriter Finds Supported Plans

Supported Queries

SELECT * FROM AWHERE salary>65000

SELECT * FROM A

79

Capabilities-Based Rewriter Finds Most-Selective Supported Plans

Supported Queries

SELECT * FROM BWHERE salary>65000

SELECT * FROM BSELECT * FROM BWHERE salary >65000

80

Capabilities-Based Rewriter Architecture

Component SubQueryDiscovery

Plan Construction

Plan Refinement

Query CapabilitiesDescription

Component SubQueries

Plans (not fully optimized)

Query

Algebraically optimal plans

81

What TSIMMIS Achieved

• system for integration of heterogeneous sources

• challenges and solutions– semistructured data & incomplete schema

knowledge• appropriate specification language and query processing

algorithms

– limited and different query capabilities• query translation algorithm

• capabilities-based query rewriting algorithm

82

Overview

• TSIMMIS’ goals, technical challenges, and solutions

• Insufficiencies of the TSIMMIS’ framework

• Going forward

83

Insufficiencies of the TSIMMIS framework

• OEM was really unstructured data– some loose and partial schematic info may

pay off tremendously

• too “databasy” user/mediator/source interaction

84

Overview

• TSIMMIS’ goals, technical challenges, and solutions

• Insufficiencies of the TSIMMIS’ framework

• Going forward

85

Web emerges as a Distributed DB and XML as its Data Model

DataSource

Native XMLDatabase

XML ViewDocument(s)

XML ViewDocument(s)

XML ViewDocument(s)

Also export:1. Schemas & Metadata (XML-Data, RDF,…)2. Description of supported queries

Wrapper

LegacySource

XMAS QueryLanguage

86

Definition of Integrated Views

DataSource

DataSource

DataSource

Mediator

XML ViewDocument(s)

Integrated XML View

XML ViewDocument(s)

XML ViewDocument(s)

View Definition inXMAS

87

Non-Materialized Views in the MIX mediator system

Blended Browsing &Querying (BBQ) GUI

Application

DOM for Virtual XML Doc’s

MIX Mediator

XMAS query XML document

DTDInference

IntegratedView DTD

XML Source XML Source

QueryProcessor

View Definition inXMAS

Source DTD

88

RDB2XMLWrapper

DTDInference

Resolution

Simplification

Execution

Unfolded Query

Blended Browsing &Querying (BBQ) GUI

MIX MediatorXMAS MediatorView Definition

View DTD

Translation to Algebra

Optimization

XML DocumentFragments

XMAS Query

XMLSource 1

DTD

XMASQuery

XMLDocumentFragments

DOM (VXD) Client API

Application

1 information integration. 2 information resides on heterogeneous information sources different...

Documents

integration data

data warehouse slide

information data warehouse

copies of data

original data sources

model color

information integration

data warehousing approach