data warehousing & mining techniques

61
Data Warehousing & Mining Techniques Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Upload: others

Post on 16-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehousing & Mining Techniques

Data Warehousing

& Mining Techniques Wolf-Tilo Balke

Silviu Homoceanu

Institut für Informationssysteme

Technische Universität Braunschweig

http://www.ifis.cs.tu-bs.de

Page 2: Data Warehousing & Mining Techniques

• Lecture

– 16. 10.2012 – 29.01.2013

– 09:45-12:15 (3 lecture hours !)

– Exercises, detours, and home work discussion integrated into lecture

• 4-5 Credits

• Exams

– Oral exam

– Achieving more than 50% in homework points is advised

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

0. Organizational Issues

Page 3: Data Warehousing & Mining Techniques

• Bad decisions can lead to disaster

– Data Warehousing is at

the base of decision

support systems

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3

0. Why should you be here?

Page 4: Data Warehousing & Mining Techniques

• Data Warehousing & OLAP is important –

discover information hidden within the

organization’s data

– See data from different angles:

product, client, time, area

– Get adequate statistics to get

your point of argumentation

across

– Get a glimpse of the future…

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4

0. Why should you be here?

Page 5: Data Warehousing & Mining Techniques

• Because you love databases…

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5

0. Why should you be here?

Page 6: Data Warehousing & Mining Techniques

• Building the Data Warehouse

– William H. Inmon

– Wiley, ISBN 0-7645-9944-5

• The Data Warehouse Toolkit

– Ralph Kimball & Margy Ross

– Wiley, ISBN 0-471-20024-7

• Data Warehouse-Systeme

– Andreas Bauer & Holger Günzel

– dpunkt.verlag, ISBN 3-89864-251-8

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6

0. Recommended Literature

Page 7: Data Warehousing & Mining Techniques

• The Data Warehouse ETL Toolkit

– Ralph Kimball & Joe Caserta

– Wiley, ISBN 0-7645-6757-8

• OLAP Solutions

– Erik Thomsen

– Wiley, ISBN 0-471-40030-0

• Data Warehouses and OLAP

– Robert Wrembel & Christian Koncilia

– IRM Press, ISBN 1-59904364-5

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7

0. Recommended Literature

Page 8: Data Warehousing & Mining Techniques

1 Introduction

1.1 What is a data warehouse (DW)?

1.2 Application and users

1.3 Lifecycle / phases of a data

warehouse

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8

1. Introduction

Page 9: Data Warehousing & Mining Techniques

• Basically a very large database…

– Not all very large databases are DW, but all data

warehouses are pretty large databases

– Nowadays a warehouse is considered

to start at around a TB and goes

up to several PB

– It spans over several servers and

needs an impressive amount of

computing power

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9

I.I What is a data warehouse?

Page 10: Data Warehousing & Mining Techniques

• More specific, a collective data repository

– Containing snapshots of the operational data (history)

– Obtained through data cleansing (Extract-Transform-

Load process)

– Useful for analytics

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10

I.I What is a data warehouse?

Page 11: Data Warehousing & Mining Techniques

• Compared to other solutions it…

– Is suitable for tactical/strategic focus

– Implies a small number of transactions

– Implies large transactions spanning over a long

period of time

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11

I.I What is a data warehouse?

Page 12: Data Warehousing & Mining Techniques

• Experts say…

– Ralph Kimball: “a copy of transaction

data specifically structured for query

and analysis”

– Bill Inmon: “A data warehouse is a: – Subject oriented

– Integrated

– Non-volatile

– Time variant

collection of data in support of management’s

decisions.”

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12

I.I Some Definitions

Page 13: Data Warehousing & Mining Techniques

• Subject oriented

– The data in the DW is organized in such a way that

all the data elements relating to the same real-world

event or object are linked together

• Typical subject areas in DWs are

Customer, Product, Order, Claim,

Account,…

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13

I.I Inmon Definition

Page 14: Data Warehousing & Mining Techniques

• Subject oriented

– Example: customer as central subject in some DW

• The complete DW is organized by customer

• It may consist of hundreds or more physical tables that

are related

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14

I.I Inmon Definition

Base customer Data 2000 - 2002

Base customer Data 2003 - 2005

Customer activity 2001 - 2004

Customer activity Detail 2002 - 2004

Customer activity Detail 2005 - 2006

CUSTOMER

Page 15: Data Warehousing & Mining Techniques

• Integrated

– The DW contains data from most or all the

organization's operational systems and this data is

made consistent

– E.g. gender, measurement, conflicting keys,

consistency,…

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15

I.I Inmon Definition

appl A – M,F operational encoding DW

M,F

appl B – 1,0

appl C – male,female

appl A – cm

appl B – inches

cm

Page 16: Data Warehousing & Mining Techniques

• Non-volatile – Data in the DW is hardly ever over-written or deleted -

once committed, the data is static, read-only, and retained for

future reporting

– Data is loaded, but not updated

– When subsequent changes occur, a new version or snapshot

record is written

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16

I.I Inmon Definition

Record-by-record manipulation

insert

delete

change

access

Operational

Mass load/ access of data

load

DW

access

Page 17: Data Warehousing & Mining Techniques

• Time-varying

– The changes to the data in the DW are tracked and

recorded so that reports show changes over time

– Different environments have different time horizons

associated

• While for operational systems a

60-to-90 day time horizon is

normal, DWs have a 5-to-10 year

horizon

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17

I.I Inmon Definition

Page 18: Data Warehousing & Mining Techniques

• More general, a DW is…

– A large repository of

some organization’s

electronically stored

data

– Specifically designed

to facilitate reporting

and analysis

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18

I.I General Definition

Page 19: Data Warehousing & Mining Techniques

• DW typically…

– Reside on computers dedicated to this function

– Run on enterprise scale DBMS such as Oracle,

IBM DB2, Teradata, or Microsoft SQL Server

– Retain data for long periods of time

– Consolidate data obtained from a variety of

sources

– Are built around their own carefully designed data

model

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19

I.I Typical Features

Page 20: Data Warehousing & Mining Techniques

• Our story of data warehouses begins with… Walmart

– Founded in the 1962 by Sam Walton

– In 1967 Walton already owned 24 stores

– In 1970 the company goes national

– Breaking all records • First company to grow that fast

$1billion in annual sales by 1980

• America’s top retailer by 1990

• 2.2 million employees, 200 million customers a week, 10,000 stores in 27 countries more than $1billion per day sales

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20

1.1 DW - The Story

Page 21: Data Warehousing & Mining Techniques

• …and the story goes on with Teradata

– 1976-1979: Startup from Caltech (California Institute

of Technology)

– 1983: First beta system shipped to Wells Fargo Bank

– 1984: Teradata releases the world's first parallel

data warehouses and data marts

– 1986: Fortune Magazine names Teradata

"Product of the Year"

– 1989: Teradata partners with

NCR to build new database computers

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21

1.1 DW - The Story

Page 22: Data Warehousing & Mining Techniques

• Walmart contacts Teradata (in 1990):

– Walmart CIO: I want to keep track of sales in

all my 1900 stores simultaneously

– Teradata consultant: You need our RDBMS

software. You can stuff data in as sales are rung up at cash

registers and simultaneously query data right in your office

– So Walmart buys a $1 milion Sun E10000 multi-CPU

server, a $500 000 Teradata license, expensive

technical consultants, and builds

a normalized SQL data model

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22

1.1 DW - The Story

Page 23: Data Warehousing & Mining Techniques

• After a few months of collecting large amounts of data a Walmart executive asks… – I have noticed that there was a Colgate promotion recently,

directed to people who live in small towns. How much toothpaste did we sell in those towns yesterday?

– Translation to a query: • select sum(sales.quantity_sold) from sales, products, product_categories, manufacturers,

stores, cities where manufacturer_name = ‘Colgate’

and product_category_name = ‘toothpaste’

and cities.population < 40 000

and trunc(sales.date_time_of_sale) = trunc(sysdate-1)

and sales.product_id = products.product_id

and sales.store_id = stores.store_id

and products.product_category_id = product_categories.product_category_id

and products.manufacturer_id = manufacturers.manufacturer_id

and stores.city_id = cities.city_id

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23

1.1 DW - The Story

Page 24: Data Warehousing & Mining Techniques

– The tables contain large volumes of data and

the query implies a 6 way join so it will take

some time to execute

– The tables are at the same time also updated by

new sales

– Soon after executive start their quest for marketing

information store employees notice that there are

times during the day when it is impossible to

process a sale Any attempt to update the database results in freezing the

computer up for 20 minutes

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24

1.1 DW - The Story

Page 25: Data Warehousing & Mining Techniques

• Minutes later… the Walmart CIO calls Teradata (we don’t really know the exact discussion but…)

• Walmart CIO: WE TYPE IN THE TOOTHPASTE QUERY AND OUR SYSTEM HANGS!!!

• Teradata support: Of course it does! You built an on-line transaction processing (OLTP) system. You can’t feed it a decision support system (DSS) query and expect things to work!

– Walmart CIO: !@%$#. I thought this was the whole point of SQL and your RDBMS…to query and insert simultaneously!!

– Teradata support: Uh, not exactly. If you’re reading from the database, nobody can write to the database. If you’re writing to the database, nobody can read from the database. So if you’ve got a query that takes 20 minutes to run and don’t specify special locking instructions, nobody can update those tables for 20 minutes.

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25

1.1 DW - The Story

Page 26: Data Warehousing & Mining Techniques

– Walmart CIO: It sounds like a bug.

– Teradata support: Actually it is a feature. We call it pessimistic locking.

– Walmart CIO: Can you fix your system so that it doesn’t lock up???

– Teradata support: No. But we made this great loader tool so that you can copy everything from your OLTP system into a separate Data Warehouse system at 100 GB/hour

• 1992 : Teradata creates the first system over

1 terabyte which goes online at Walmart

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26

1.1 DW - The Story

Page 27: Data Warehousing & Mining Techniques

• OLTP (OnLine Transaction Processing)

– Typically for data entry / retrieval and transaction

processing

– Works on the operational data stores (ODS) and

represents day-to-day operational business activities

• Purchasing, sales, production distribution, …

– Reflects only the current state

of the data

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27

I.I OLTP

Page 28: Data Warehousing & Mining Techniques

• OLAP (OnLine Analytical Processing)

– Provides information for activities like

• Enterprise resource planning, capital budgeting, marketing

initiatives,…

– Represents front-end analytics based on a DW

repository

– Used for reporting and

it is decision oriented

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28

I.I OLAP

Page 29: Data Warehousing & Mining Techniques

• Properties of Operational Data Stores and DWs

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29

I.I ODS vs. DW

ODS DW

Mostly updates Mostly reads

Many small transactions Few, but complex queries

MB-TB of data GB-PB of data

Raw data Summarized data

Clerks Decision makers

Up-to-date data May be slightly outdated

Page 30: Data Warehousing & Mining Techniques

• Consider a normalized database for a store

– The schema would look somewhat like this…

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30

I.I ODS vs. DW

Customer

Customer_ID (PK) Name Address City Postal_code

Invoice

Invoice_number (PK) Date Customer_ID (FK) Status_code (FK) Total Sales_tax Shipping_charge

Status

Status_code (PK) Status

Sales_unit

Sales_unit_D (PK) Name Address City Postal_code Telephone_number Email

Product

Product_ID (PK) Name Description Cost Sales_unit_ID (FK)

Invoice_line_item

Invoice_number (FK) Item_seq_number Product_ID (FK) Units Unit_cost

Page 31: Data Warehousing & Mining Techniques

• If we were to set up a DW for that store, we

would start by building the following schema

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31

I.I ODS vs. DW

Customer

Customer_ID (PK) Name Address City Postal_code

Sales_facts

Product_ID (FK) Customer_ID (FK) Time_key (FK) Sales_unit_ID (FK) Total Sales_tax Shipping_charge

Sales_unit

Sales_unit_ID (PK) Name Address City Postal_code Telephone

Time_period

Time_key (PK) Description Day Fiscal_week Fiscal_period Fiscal_year

Product

Product_ID (PK) Name Description Cost

Page 32: Data Warehousing & Mining Techniques

• Basic insights from comparing ODS and DWs

– A DW is a separate (DBMS) installation that

contains copies of data from operational systems

• Physically separate hardware may not be absolutely

necessary if you have lots of extra computing power,

e.g., virtualization

– With an optimistic locking DBMS you might even

be able to get away for a while with keeping just one

copy of your data

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32

I.I ODS vs. DW

Page 33: Data Warehousing & Mining Techniques

• There is an essentially different pattern of

hardware utilization between transactional and

analytical processing

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33

I.I ODS vs. DW

Operational Data warehouse

Page 34: Data Warehousing & Mining Techniques

• Typical questions to be be answered by OLAP

– How much did sales unit A earn in January?

– How much did sales unit B earn in February?

– What was their combined sales amount for the first

quarter?

• Answering these questions with SQL-queries is

difficult

– Complex query formulation necessary

– Processing will be slow due to complex joins and

multiple scans

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34

I.I ODS vs. DW

Page 35: Data Warehousing & Mining Techniques

• Why can such questions be answered faster with

a DW?

– Because in a DW data is rearranged in tables and

pre-aggregated

– The table arrangement is subject oriented, usually

some star schema

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35

I.I ODS vs. DW

Page 36: Data Warehousing & Mining Techniques

• A DW is the base repository for front-end

analytics (or business cockpits)

– OLAP

– Knowledge discovery in databases (KDD)

and data mining

• Results are used for

– Data visualization

– Reporting

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36

I.2 Typical Application

Page 37: Data Warehousing & Mining Techniques

• As a form of information processing OLAP

needs to provide timely, accurate and

understandable information

– ‘Timely’ is however a relative term…

• In OLTP we expect a query/update to go

through in a matter of seconds

• In OLAP the time to answer a query can

take minutes, hours or even longer

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37

I.2 Typical Application

Page 38: Data Warehousing & Mining Techniques

• KDD & Data Mining

– Constructs models of the data in question

• Models can be seen as high level summaries of the

underlying data

• E.g. “customers older than 35 having at least 1 child and

driving a minivan usually spend more than €100 for grocery

shopping”

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38

I.2 Typical Application

ID Name Age Income Children Car Spent

12 Peter 45 € 65,000 2 Mini Van € 210.00

15 Gabriel 28 € 53,000 0 Coupe € 30.00

… … … … … … …

122 Claire 40 € 52,000 1 Mini Van € 250.00

Page 39: Data Warehousing & Mining Techniques

• Users of DW are called decision support

system (DSS) analysts and usually have a

business background

– Their primary job is to define and discover

information used in corporate decision-making

– The way they think

• “Show me what I say I want…

and then I can tell you what I really want”

• They work in an explorative manner

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39

I.2 Who are the users?

Page 40: Data Warehousing & Mining Techniques

– Typical explorative line of work

• “When I see what the possibilities are, I can tell what I really

need to see. But until I see what the possibilities are, I

cannot describe exactly what I want…”

– This usage has profound effect on the way a DW is

developed

• The classical system development life cycle assumes

that the requirements are known at the start of design

• The DSS analyst starts with existing requirements, but

factoring in new requirements is almost impossible

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40

I.2 Who are the users?

Page 41: Data Warehousing & Mining Techniques

• System Development Life Cycle (SDLC)

– Classical SDLC vs. DW SDLC

– DW SDLC is almost the opposite of classical SDLC,

since requirements are not known from the beginning

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41

I.3 Lifecycle of DWs

Requirements Requirements

DB DB

Program Program

DW DW

Program Program

Requirements Requirements

Page 42: Data Warehousing & Mining Techniques

• Classical SDLC vs. DW SDLC

– Because it is the opposite of SDLC,

DW SDLC is also called CLDS

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42

I.3 Lifecycle of DWs

Classical SDLC DW SDLC

Requirements gathering Implement warehouse

Analysis Integrate data

Design Test for bias

Programming Program against data

Testing Design DSS system

Integration Analyze results

Implementation Understand requirements

Page 43: Data Warehousing & Mining Techniques

• CLDS is a data driven development life cycle

– It starts with data

• Once data is at hand it is integrated and tested against bias

• Programs are written against the data and the results are

analyzed and finally the requirements of the system are

understood

• Once requirements are understood,

adjustments are made to the design and

the cycle starts all over

– “spiral development methodology”

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43

I.3 Lifecycle of DW

Page 44: Data Warehousing & Mining Techniques

• Lifecycle phases

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44

I.3 Lifecycle of DWs

Page 45: Data Warehousing & Mining Techniques

• Design

– Interviewing the end-users in cycles

– Analyzing the data source system (ODS)

– Defining the key performance indicators

– Mapping the decision-making processes to the underlying information needs

– Logical and physical schema design

• Prototype

– Objective is to constrain and in some cases reframe end-user requirements

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45

I.3 Lifecycle of DWs

Page 46: Data Warehousing & Mining Techniques

• Deployment – Development of documentation

– Personal training

– Operations and management processes

• Operation – Day-to-day maintenance of the DW needs a good

management of ongoing Extraction, Transformation and Loading (ETL) process

• Enhancement requires the modification of – HW - physical components

– Operations and management processes

– Logical schema designs

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46

I.3 Lifecycle of DWs

Page 47: Data Warehousing & Mining Techniques

• When operating a DW the following phases can

be identified

– Monitoring

– Extraction

– Transforming

– Loading

– Analyzing

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47

I.3 Operating a DW

Page 48: Data Warehousing & Mining Techniques

• Monitoring

– Surveillance of the data sources

– Identification of data modification which is relevant to

the DW

• Monitoring has an important role over the whole

process deciding on which data to load, and when

to load it into the DW

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48

I.3 Monitoring

Page 49: Data Warehousing & Mining Techniques

• Monitoring techniques

– Active mechanisms - Event Condition Action (ECA) rules:

– Replication mechanisms:

• Oracle 9i – Snapshots are local copies of data (similar to a view): a snapshot is replaced completely on change

• IBM DB2 – Data replication maintains and replicates data in destination tables through a data propagation processes (data is incrementally updated)

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49

I.3 Monitoring

EVENT Payment

CONDITION Account sum > 10 000 €

ACTION Transfer to economy account

Page 50: Data Warehousing & Mining Techniques

– Protocol based mechanisms:

• Since DBMSs write protocol data for transaction

management, the protocol can also be used for

monitoring

• Problematic since protocol formats are proprietary

and subject to change

– Application managed mechanisms:

• Hard to implement for legacy systems

• Based on time stamping or data comparison

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50

I.3 Monitoring

Page 51: Data Warehousing & Mining Techniques

• Extraction

– Reads the data selected during the monitoring phase

and inserts it in the intermediate data structures of

the workplace (“staging area”)

– Due to large data volume, compression can be used

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51

I.3 Extraction

Page 52: Data Warehousing & Mining Techniques

– The time-point for performing extraction can be

• Periodical: weather or stock market information can be actualized more times in a day, while product specification can be actualized in a longer period of time

• On request: e.g. when a new item is added to a product group

• Event driven: event driven e.g. number of modifications over passing a specified threshold triggers the extraction

• Immediate: in some special cases like the stock market it can be necessary that the changes propagate immediately to the warehouse

– Extraction largely depends on hardware and the software used for the DW and the data source

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52

I.3 Extraction

Page 53: Data Warehousing & Mining Techniques

• Transforming

– Implies adapting data, schema as well as data

quality to the application requirements

– Data integration:

• Transformation in de-normalized data structures

• Handling of key attributes

• Adaptation of different types of the same data

• Conversion of encoding: “Buy”, “Sell” 1,2 vs. B, S 1,2

• Date handling: “MM-DD-YYYY” “MM.DD.YYYY”

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53

I.3 Transforming

Page 54: Data Warehousing & Mining Techniques

• String normalization

– “Michael Schumacher” “Michael, Schumacher” vs.

“Schumacher Michael” “Michael, Schumacher”

• Measurement units and scaling

– 10 inch 25,4 cm

– 30 mph 48,279 km/h

• Save calculated values

– Price including tax = Price without tax * 1.19

• Aggregation

– Daily sums can be added into weekly ones

– Different levels of granularity can be used

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54

I.3 Transforming

Page 55: Data Warehousing & Mining Techniques

– Data cleansing (or data cleaning)

• Consistency check: Delivery Date > Order Date

• Completeness: management of missing values as well as

NULL values

• Dictionary approaches for city or person names

• Regular expressions for phone numbers or email addresses

• Duplicate detection for redundancy elimination

• Outlier detection as a warning system for possible errors

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55

I.3 Transforming

Page 56: Data Warehousing & Mining Techniques

• Loading

– Loading usually takes place during weekends or nights when the system is not under user stress

– Split between initial load to initialize the DW and the periodical load to keep the DW updated

– Initial loading

• Implies big volumes of data and

for this reason a bulk loader

is used

– Usually optimized by means of parallelization and incremental actualization

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56

I.3 Loading

Page 57: Data Warehousing & Mining Techniques

• Analysis phase

– Data access - useful for extracting goal oriented information

• How many iPhones 3G were sold in the Braunschweig stores of T-Mobile in the last 3 calendar weeks of 2010?

• Although it’s a common OLTP query, it might be to complex for the operational environment to handle

– OLAP - the class of analytical operations running on the DW

• In which district does a product group register the highest profit? And how did the profit change in comparison to the previous month?

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57

I.3 Analyzing

Page 58: Data Warehousing & Mining Techniques

– Data mining

• Useful for identifying hidden patterns, e.g. customers buying

wine also buy cheese

• Useful for answering questions like: How does the typical

iPad buyer look like? (for a targeted marketing campaign)

• Methods and procedures for data mining: association rule

mining, sequence pattern mining, classification, clustering,

etc.

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58

I.3 Analyzing

Page 59: Data Warehousing & Mining Techniques

• Data Warehousing overview

– Simplified, a data warehouse is a collective data repository built for analytical tasks

– Data is extracted from the operational environment, it is transformed (and cleaned) and finally loaded into the DW

– Typical usage scenarios of DW are budgeting, resource planning, marketing, etc.

– Users of the DW are DSS analysts and they work explorative

– Since requirements are not known at the beginning, the lifecycle of the data warehouse is almost the reverse of classical software development projects

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59

Summary

Page 60: Data Warehousing & Mining Techniques

• Data Warehousing – DW Architecture

– Data Modeling: conceptual, logical, physical

– Optimization: indexes, partitioning, materialized views and join optimization strategies

– Querying the DW: OLAP operations & queries SQL’99 and MDX

– How do we get the data in the DW? The ETL process in detail

• Data Mining – Association rule mining, sequence pattern mining, time series

– Supervised machine learning: (classification) decision trees, naïve Bayes, support vector machines

– Unsupervised machine learning (clustering) k-means, hierarchical clustering

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60

What Next?

Page 61: Data Warehousing & Mining Techniques

• Data Warehouse Architecture

– Basic architectures

– Storage models

– Layers

– Middleware

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61

Next Lecture