principles of database management systems 10: information integration

32
DBMS2001 Notes 10: Information Int egration 1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

Upload: alia

Post on 14-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Principles of Database Management Systems 10: Information Integration. Pekka Kilpeläinen University of Kuopio. Information Integration. Large family of applications concerned with combining data from heterogeneous information sources Our goal: Get an overview of the - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

1

Principles of Database Management Systems

10: Information Integration

Pekka KilpeläinenUniversity of Kuopio

Page 2: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

2

Information Integration

• Large family of applications concerned with combining data from heterogeneous information sources

• Our goal: Get an overview of the – modes (or approaches, or architectures)– problems, and– some applications

of information integration

Page 3: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

3

Modes of information integration

• Federated databases– collection of independent, co-operating sources

• Warehousing– single DB containing physical copies of data

from several sources• Mediation

– system providing an integrated view to the sources (non-materialized, virtual database)

Page 4: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

4

Example An integrated DB for dealer DBs of Aardvark

Automobile Co, to support locating requested models and market analysis

DB of Dealer1:Cars(serialNo,model, color, autoTrans,cdPlayer, …)

---------------------------------------------------------- | 123 | BMV 1 | blue| yes | yes | … | 456 | MB 50 | red | yes | no | …. . .

Page 5: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

5

Example (cont.)DB of Dealer2:

Autos(serial, model, color)

------------------------------ | 323 | MZ-6 | BL | | 555 | MB 50| Rd | . . .Options(serial, option)

---------------------------------- | 323 | manualTrans | | 323 | cdPlayer | . . .

Page 6: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

6

Problems of information integration

• Different names – Cars/Autos, serialNo/serial, …– renaming straightforward

• Different data types– variable vs. fixed-length strings vs. integers vs. …

• Conceptual differences– Are trucks included in ’Cars’?– Are station wagons classified as minivans?

• Treatment of missing values– represent with NULLs or some default values?

Page 7: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

7

Information integration architectures

1. Federated database– one-to-one connections between independent

databases

DB1 DB2

DB3 DB4

– Requires n(n-1) interfaces to fully connect n systems together

Page 8: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

8

Dealer1 wants to search requested cars also from the DB of Dealer2, for requests stored in relation NeededCars(model,color,autoTrans)

• Needs to query relations Autos(serial, model, color) and Options(serial, option)

of Dealer2 (with the following embedded SQL):

Example (of federation)

yes/no

Page 9: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

9

for (each tuple (:m, :c, :a) in NeededCars) { if (:a=’yes’) { // autoTrans wanted SELECT serial FROM Autos, Options WHERE Autos.serial = Options.serial AND option=’autoTrans’ AND model = :m AND color = :c;

} else { // manual transmission wanted SELECT serial FROM Autos WHERE model = :m AND color = :c AND NOT EXISTS ( SELECT * FROM OptionsWHERE serial = Autos.serial ANDoption = ’autoTrans’ );

}}

Page 10: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

10

2. Data Warehouse• Database formed by extracting and

combining data from several sources– E.g., sales data for a supermarket chain– generally read-only

Warehouse

Source Source Source

Extractor Extractor Extractor

Combiner

Page 11: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

11

Create a warehouse of Dealer1 and Dealer2 databases with combined schema CarWH(serNo,model,color,autoTrans,dealer)

• Extract Dealer1’s data:INSERT INTO CarWH(serNo,model,color,autoTrans,dealer)SELECT serialNo,model,color,autoTrans,’dealer1’ FROM Cars;

Example (Data warehouse of Aadvark Co.)

• Extract Dealer2’s data:

Page 12: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

12

INSERT INTO CarWH(serNo,model,color, autoTrans,dealer) SELECT serial,model,color, ’yes’,’dealer2’ FROM Autos, Options WHERE Autos.serial = Options.serial AND option=’autoTrans’;

INSERT INTO CarWH(serNo,model,color, autoTrans,dealer) SELECT serial,model,color, ’no’,’dealer2’ FROM Autos WHERE NOT EXISTS ( SELECT * FROM Options WHERE serial = Autos.serial AND option = ’autoTrans’ );

Page 13: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

13

3. Mediators• Support views combining data from

several sourcesClient Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Page 14: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

14

Integrate Dealer1 and Dealer2 databases as a virtual database with schema CarMed(serNo,model,color,autoTrans,dealer)

• Suppose user of the mediator gives query Q:SELECT serNo, model FROM CarMed WHERE color=‘red’ ;

• Translation by wrapper of Dealer1 easy:Q1: SELECT serialNo AS serNo, model

FROM Cars WHERE color=‘red’ ;• Query Q2 on relation Autos of Dealer2 similarly• Return UNION of results to Q1 and Q2

Example (mediation of dealer sources)

Page 15: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

15

Extractors and wrappers• An extactor for a warehouse needs

– one or more queries to get data from source• SQL, or other supported language

– communication mechanisms for• passing queries to source• receiving responces from the source• passing data to the warehouse

• A wrapper for a mediator more complex– translates arbitrary queries for the source

Page 16: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

16

An aside: XML wrappers• What if the source does not provide a

query language?– Need to translate a piece of extracted

data (file/report/message/document/…)– developing and maintaining translation

programs is tedious => a wrapper specification language XW (XML Wrapper; Univ. of Kuopio, 2001-2002)

Page 17: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

17

XML in data exchange• XML an industry standard for textual

representation of structured data (W3C Recommendation 1998)– XML-based protocols developed for

e-business, medical messages, … – old message formats need to be

converted to XML– this translation addressed by XW

Page 18: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

18

XW wrapping

AA x1

x2

BBy1 y2

z1 z2

<part-a> <part-a> <e1> <e1>x1x1</e1> </e1>

<e2> <e2>x2x2</e2></e2></part-a></part-a><part-b><part-b> <line-1> <line-1> <d1> <d1>y1y1</d1> </d1>

<d2> <d2>y2y2</d2></d2> </line-1> </line-1> <d3> <d3>z2z2</d3></d3></part-b></part-b>

XW-engineXW-engine

<xw:wrapper … ><xw:wrapper … > … … </xw:wrapper></xw:wrapper>

ssource ource datadata

wwrapprapper er descriptiondescription

result result documentdocument

Page 19: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

19

Back to DB wrappers: Query templates• Translating queries of mediator to queries on

the source possible through templates– with parameters (place-holders for constants of

the original query)– Notation T => S:

template T is translated to source query S• Example: Translating color-based selections

on mediated view to queries on Dealer1 DB

Page 20: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

20

SELECT * FROM CarMed WHERE color = ‘$c’ ;

=>SELECT serialNo AS serNo, model, color, autoTrans, ‘dealer1’ FROM Cars WHERE color = ‘$c’ ;

• Similar templates for model and autoTrans– supporting all combinations of n attributes requires

2n templates!– Can do with fewer by applying filtering (see later)

Example (of query templates)

Page 21: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

21

Wrapper generators• Templates defining a wrapper

translated into code of a wrapper– by a wrapper generator– How?

• Two parts of a wrapper– table of query patterns specified by

templates– a fixed driver

Page 22: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

22

Wrapper-generator architecture

Source

queries results

Wrapper generator

Driver

table Wrapper

Templates Queries from mediator Results

Page 23: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

23

Filtering• Often impossible to write templates for

each possible form of mediator query• Possible to wrap more complex queries

– if some template returns a superset of the original query (by fixing only a part of conditions), then

– filter the result of the template by the additional query conditions

Page 24: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

24

How to evaluate query SELECT * FROM CarMed WHERE color=‘blue’ AND model=‘Gobi’ ;if we have only the below template?

SELECT * FROM CarMed WHERE color = ‘$c’ ; =>SELECT serialNo AS serNo, model, color, autoTrans, ‘dealer1’ FROM Cars WHERE color = ‘$c’ ;

Example (of filtering by wrapper)

Page 25: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

25

1. Apply template with $c = ‘blue’ and store the result: INSERT INTO Temp(serNo,model,color,autoTrans,dealer)SELECT serialNo AS serNo, model, color, autoTrans, ‘dealer1’ FROM Cars WHERE color = ‘blue’ ;

Example (of filtering cont.)

2. Filter out Gobis from relation Temp:

SELECT * FROM Temp WHERE model = ‘Gobi’ ;

Page 26: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

26

Warehouses and their applications

• Warehouses– aimed at executive, decision maker,

analyst– often a copy of operational data

• Advantages– High query performance– Local processing at sources unaffected– Can operate when sources unavailable

Page 27: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

27

Applications of warehouses (1)• OLAP (On-line Analytic Processing)

– finding important patterns or trends from data stored at the warehouse

– involves complex queries on large parts of the database

– Contrast: OLTP (On Line Transaction Processing)• Describes processing at operational sites,

not at the warehouse

Page 28: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

28

OLTP vs. OLAP

• Reads & updates• Many small

transactions• MB-TB of data• Raw, up-to-date data• Clerical users• Consistency,

recoverability critical

• Mostly reads• Queries long,

complex• GB-TB of data• Summarized data,

updated periodically• Decision-makers,

analysts as users

OLTP OLAP

Page 29: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

29

• Data mining– active area of R&D since 1990’s– analysis of large amounts of data for

finding interesting regularities or exceptions

• Why?– To get valuable information out of huge

data collections of research commerce and industry

Applications of warehouses (2)

Page 30: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

30

Example (data mining)• “Association rules” (IBM 1993)

– way to analyse co-occurrence of values – E.g. the market-basket analysis:

bread, cheese butter (53%, 74%)

confidence: how many buyers of bread&cheese also bought butter?

support: how many market-baskets contain bread, cheese and butter

Page 31: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

31

What’s new in Data Mining?• Statistics:

– methods to test hypotheses on data

• Data mining– automatic generation of hypotheses

among hundreds or thousands of explaining variables

• OLAP:– user-driven examination of data

Page 32: Principles of Database Management Systems 10: Information Integration

DBMS2001 Notes 10: Information Integration

32

Summary

• We’ve had an overview of information integration

• Different approaches– federation, warehousing, mediation

• Applications of data warehouses– OLAP– data mining