s4i nanni 1 intro to bi and data warehousing

24
Master of Science in Engineering in Computer Science (MSE-CS) DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI Seminars in Software and Services for the Information Society Umberto Nanni Introduction to 1 Seminars of Software and Services for the Information Society Umberto Nanni Introduction to Datawarehousing

Upload: others

Post on 16-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: S4I Nanni 1 Intro to BI and Data Warehousing

Master of Science in Engineering in Computer Science (MSE-CS)

DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE

ANTONIO RUBERTI

(MSE-CS)

Seminars in Software and Services for the Information Society

Umberto Nanni

Introduction to

1Seminars of Software and Services for the Information SocietyUmberto Nanni

Introduction to Datawarehousing

Page 2: S4I Nanni 1 Intro to BI and Data Warehousing

Business Intelligence Architecture

management

management

system

goals results

management

system

rep

ort

ing

OLA

P

da

ta m

inin

g

Datawarehouse

Datamart-1 Datamart-2 Datamart-3

KPI DSS MKT HRCRM …

system

2Seminars of Software and Services for the Information SocietyUmberto Nanni

operational

system

ETL systems

external

data

sources

service

systems

ERP

systems

internet /

extranet

operational system

Page 3: S4I Nanni 1 Intro to BI and Data Warehousing

What is Data Warehousing

Collection of methods, technologies and tools to

assist the “knowledge worker” (manager,

analyst) to conduct data analysis aimed at

supporting decision-making and/or improving

3Seminars of Software and Services for the Information SocietyUmberto Nanni

the management of information assets

Page 4: S4I Nanni 1 Intro to BI and Data Warehousing

What is a Data Warehouse

A data warehouse is a collection of data

• integrated (far beyond the organization)

• consistent (despite the heterogeneous origin)

• focused (an interest area is defined)

4Seminars of Software and Services for the Information SocietyUmberto Nanni

• historical (over a consistent timeframe)

• permanent (never delete your data!)

Page 5: S4I Nanni 1 Intro to BI and Data Warehousing

Purpose of a Data Warehouse

A Data Warehouse helps (allows) you:

• to take decisions

• to identify and interpret phenomena

• to make predictions about the future

5Seminars of Software and Services for the Information SocietyUmberto Nanni

• to make predictions about the future

• to control a complex system

Page 6: S4I Nanni 1 Intro to BI and Data Warehousing

Value and quantity of information

value

BD

competitors

marketing

prices

sales

logistics

value

strategic

information

reports

selected

$$$$

competitors prices

6Seminars of Software and Services for the Information SocietyUmberto Nanni

quantity

primary

information

sources

selected

information

$$$

Page 7: S4I Nanni 1 Intro to BI and Data Warehousing

OLTP & OLAP

OLTP - On-Line Transaction Processing

– realm of (write and / or read) transactions, recovery,

consistencyconsistency

– many, fast and frequent operations

– high level of concurrency

– access to a small amount of data

– on-the-fly data update

OLAP - On-Line Analytical Processing

read only

7Seminars of Software and Services for the Information SocietyUmberto Nanni

– read only

– few operations

– low level of concurrency

– access to huge amounts of data

– historical but essentially static data

Page 8: S4I Nanni 1 Intro to BI and Data Warehousing

Separation between:

Operational Database & Data Warehouse

• different computational load

• different needs:• different needs:

– DB: dynamic data, asynchronous updates

– DW: static data, periodic updates

• integration with business activity:

– DB: supporting operations (focused, timely)

– DW: supporting decisions (descriptive, historical)

8Seminars of Software and Services for the Information SocietyUmberto Nanni

– DW: supporting decisions (descriptive, historical)

• data collection:

– DB: minimal

– DW: maximal

Page 9: S4I Nanni 1 Intro to BI and Data Warehousing

Two issues with different perspectives

• Data redundancy

– OLTP (DB): to avoid, bringing to inconsistency – OLTP (DB): to avoid, bringing to inconsistency

and/or inefficiency on updates

– OLAP (DW): redundancy avoids recomputation

and shorten response time

• Indexing

– OLTP (DB): good when you search – bad when you

9Seminars of Software and Services for the Information SocietyUmberto Nanni

– OLTP (DB): good when you search – bad when you

update... you need some trade-off

– OLAP (DW): the more, the best

Page 10: S4I Nanni 1 Intro to BI and Data Warehousing

Some Data Warehouse Systems

• Oracle 12

• IBM InfoSphere Warehouse• IBM InfoSphere Warehouse

• Microsoft SQL-Server 2012 – Analysis Services

• Sybase IQ

• Hyperion (bought by Oracle)

• Teradata (division of NCR)

10Seminars of Software and Services for the Information SocietyUmberto Nanni

• Teradata (division of NCR)

• Netezza – Cognos (bought by IBM)

• Business Objects (bought by SAP)

Page 11: S4I Nanni 1 Intro to BI and Data Warehousing

A comparison by Gartner

11Seminars of Software and Services for the Information SocietyUmberto Nanni

Donald Feinberg, Mark A. Beyer

Magic Quadrant for Data Warehouse

Database Management Systems

Gartner RAS Core Research Note G00173535, 28 January 2010

Page 12: S4I Nanni 1 Intro to BI and Data Warehousing

Architectures for Datawarehousing: issues

• separating OLTP & OLAP

• scalability

• extensibility

• security

12Seminars of Software and Services for the Information SocietyUmberto Nanni

• security

• administrability

Page 13: S4I Nanni 1 Intro to BI and Data Warehousing

Architecture for Datawarehousing

• determined by design choices

• determined by / determines the choice of a • determined by / determines the choice of a

software system

• determines the cost and makes possible

future integration (quantitative and / or

qualitative)

13Seminars of Software and Services for the Information SocietyUmberto Nanni

qualitative)

• affects the cost of data processing

Page 14: S4I Nanni 1 Intro to BI and Data Warehousing

Data Mart

Collection of data focused on particular user profile or

on particular target analysison particular target analysis

Alternatives:

1. dependent Data Mart: it is a subset and/or an aggregation of

data in the primary DW

→ DM extracted from a DW

2. independent Data Mart: it is a subset and/or an aggregation

14Seminars of Software and Services for the Information SocietyUmberto Nanni

2. independent Data Mart: it is a subset and/or an aggregation

of data in the operational DB

→ DW=Ui(DMi), that is, DW is a set of DM

3. hybrid solution, combining 1, 2

Page 15: S4I Nanni 1 Intro to BI and Data Warehousing

DW architecture: 1 Level

• there is only an operational DW

• virtual DB (no OLTP-OLAP separation)

• data coincident with DB operational

• difficult integration with other sources

data - level 1 middleware

(copy of)

operational

15Seminars of Software and Services for the Information SocietyUmberto Nanni

sources warehouse analysis

operational

DB

external

sources

Page 16: S4I Nanni 1 Intro to BI and Data Warehousing

DW architecture: 2 Levels – dependent DMs

• data sources complemented with external sources

• running on dedicated software platform

• ETL: Extraction, Transformation, Loading• ETL: Extraction, Transformation, Loading

• materialization of the DW

• materialization of Data Marts

oper

BDDataMart

ETL

data - level 1 data - level 2

16Seminars of Software and Services for the Information SocietyUmberto Nanni

BD

ext

BD

sources warehouse analysisfeeding

DWMart

DataMart

Page 17: S4I Nanni 1 Intro to BI and Data Warehousing

DW architecture: 2 Levels – independent DMs

• Data Mart are materialized by feeding

• DW = union of DMs• DW = union of DMs

oper

BDDataMart

ETL

data - level 1 data - level 2

17Seminars of Software and Services for the Information SocietyUmberto Nanni

sources warehouse analysisfeeding

BD

ext

BD

Mart

DataMart

Page 18: S4I Nanni 1 Intro to BI and Data Warehousing

DW architecture: 3 Levels

• a level of "reconciled" data (operational data store) is introduced

• separation into two phases of ETL activities:• separation into two phases of ETL activities:

1. extraction / transformation

2. loading

oper

BDDataMart

ET(L)

data - level 1 data - level 2 data - level 3

18Seminars of Software and Services for the Information SocietyUmberto Nanni

BD

ext

BD

DWMart

DataMart

reconcilied

data

loading

sources warehouse analysisfeeding

Page 19: S4I Nanni 1 Intro to BI and Data Warehousing

ETL: Extraction, Transformation, Loading

extraction

Operational Data, External Data

• extraction

• cleaning - validation - filtering

• transformation

Reconciled Data

19Seminars of Software and Services for the Information SocietyUmberto Nanni

• loading

Data Warehouse

Page 20: S4I Nanni 1 Intro to BI and Data Warehousing

Extraction

• initial extraction:

– targeted at the creation of the DW– targeted at the creation of the DW

• furter extractions:

– static (replaces the whole DW)

– incremental

20Seminars of Software and Services for the Information SocietyUmberto Nanni

– incremental

• log

• timestamp

Page 21: S4I Nanni 1 Intro to BI and Data Warehousing

Cleaning

• changing VALUES

• duplicates• duplicates

• inconsistencies

– domain violation

– functional dependency violation

• null values

• misuse of fields

21Seminars of Software and Services for the Information SocietyUmberto Nanni

• misuse of fields

• spelling

• abbreviations (not homogeneous)

Page 22: S4I Nanni 1 Intro to BI and Data Warehousing

Transformation

• changing FORMATS:

• misalignment of formats

• field overloading

• unhomogeneous coding

22Seminars of Software and Services for the Information SocietyUmberto Nanni

• unhomogeneous coding

Page 23: S4I Nanni 1 Intro to BI and Data Warehousing

Loading

• Refresh:

ex-novo load of the whole DWex-novo load of the whole DW

• Update:

differential updates

23Seminars of Software and Services for the Information SocietyUmberto Nanni

Page 24: S4I Nanni 1 Intro to BI and Data Warehousing

Metadata

• internal metadata

– concerning the administration of the DW (i.e., sources, transformations,

schemas, users, etc..)schemas, users, etc..)

• external metadata

– interesting for users (e.g., measurement units, possible combinations)

• STANDARDs

• CWM - Common Warehouse Model (OMG), defined by:

– UML (Unified Modeling Language)

24Seminars of Software and Services for the Information SocietyUmberto Nanni

– UML (Unified Modeling Language)

– XML (eXtensible Markup Language)

– XMI (XML Metadata Interchange)

OMG = Object Management Group: CORBA (Common Object Request Broker Architecture), UML

(Unified Modeling Language), MDA (Model-Driven Architecture)