lecture @dhbw: data warehouse part 05: data …a collaborative data management practice focused on...

57
A company of Daimler AG LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA SECURITY – DATA CULTURE ANDREAS BUCKENHOFER, DAIMLER TSS

Upload: others

Post on 25-Jun-2020

7 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

A company of Daimler AG

LECTURE @DHBW: DATA WAREHOUSE

PART 05: DATA SECURITY – DATA CULTUREANDREAS BUCKENHOFER, DAIMLER TSS

Page 2: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

ABOUT ME

Andreas BuckenhoferSenior DB Professional

Since 2009 at Daimler TSS

Department: Machine Learning Solutions

Business Unit: Analytics

Contact/Connect

vcard

• Oracle ACE Associate

• DOAG responsible for InMemory DB

• Lecturer at DHBW

• Certified Data Vault Practitioner 2.0

• Certified Oracle Professional

• Certified IBM Big Data Architect

• Over 20 years experience with

database technologies

• Over 20 years experience with Data

Warehousing

• International project experience

Page 3: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

After the end of this lecture you will be able to

• Understand importance of Data Security

• Data Classification

• GDPR

• Understand importance of Data Culture

WHAT YOU WILL LEARN TODAY

Data Warehouse / DHBWDaimler TSS 3

Page 4: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

DATA SECURITY

Data Warehouse / DHBWDaimler TSS 4

Page 5: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 5

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging

Layer

(Input

Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output

Layer)

(Reporting

Layer)

Integration

Layer

(Cleansing

Layer)

Aggregation

Layer

Metadata Management

Security

DWH Manager incl. Monitor

Page 6: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• End users should only have access to the Data Marts in a DWH

• Grant privileges and roles to users

• Good practice:

• Grant privileges directly to a role; do not nest roles (grant role to role)

• Grant role to users

• Distinguish end users, administrators and technical users with different password

policies

• Much more difficult to grant access in a Data Lake

• Tool maturity

• Data often unknown with schema-on-read

DATA ACCESS

Data Warehouse / DHBWDaimler TSS 6

Page 7: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

CIA TRIAD – DATA CLASSIFICATION

Data Warehouse / DHBWDaimler TSS 7

Confidentiality

IntegrityAvailability

whether or not

information is kept

secret or private,

e.g. data theft

whether the

information is kept

accurate, e.g.

faking data

ensuring that

information is

available when it is

needed, e.g.

blackout

Page 8: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

Regulation in EU law on data protection and privacy for all individuals in the

EU

Requirements like

• Data protection by design and by default

• Right to erasure / Right to be forgotten

• Records of processing activities

GDPR – GENERAL DATA PROTECTION REGULATION

Data Warehouse / DHBWDaimler TSS 8

Not relevant for exam

Page 9: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• "capture-it-all" approach causes serious questions of privacy

• always ask yourself why you are capturing or storing data

CHALLENGES DWH & BIG DATA

Data Warehouse / DHBWDaimler TSS 9

Source: https://martinfowler.com/bliki/Datensparsamkeit.html

Challenges like

• Combining of sensitive data

sources allowed?

• Export restrictions that

forbid to combine data from

Germany, US, China

Page 10: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

Anonymization

• Irreversibly destroys any way to identify an individual

• GDPR does not apply

Pseudonymization

• Additional information required to re-identify an individual

• GDPR does apply

ANONYMIZATION VS PSEUDONYMIZATION

Data Warehouse / DHBWDaimler TSS 10

Not relevant for exam

Page 11: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

TYPICAL SECURITY FUNCTIONALITIES AND MEASURES IN DATABASES

Data Warehouse / DHBWDaimler TSS 11

Auditing

Encryption

Masking (static, dynamic)

Row Level Security

Authentication & Password

policies

Patching

Secure installation

SSL – secure

communication

Authorization & roles,

profiles, ressource limits

Security checklists

Page 12: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

Page 13: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

Page 14: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

DATA CULTURE

Data Warehouse / DHBWDaimler TSS 14

Page 15: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• “Taping into the potential of data involves data

sensemaking. I prefer the term over the more

popular term analytics because it better fits the full

range of activities that are needed” (page 44)

• “Data sensemaking requires a significant

investment in the development of human skills”(page 74)

• DWH/Big Data Architecture (instead of

technology/tools)

• Data integration, Data visualization, Data modeling, …

STEPHEN FEW: BIG DATA, BIG DUPE (ANALYTICS PRESS, 2018)

Data Warehouse / DHBWDaimler TSS 15

Page 16: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

DATA LANDSCAPE MANIFESTO@SCOUT24

Data Warehouse / DHBWDaimler TSS 16

Source: Data Festival, Munich 2019

Not relevant for exam

Page 17: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

A collaborative data management practice focused on improving the

communication, integration and automation of data flows between data

managers and consumers across an organization.

The goal of DataOps is to create predictable delivery and change

management of data, data models and related artifacts. DataOps uses

technology to automate data delivery with the appropriate levels of security,

quality and metadata to improve the use and value of data in a dynamic

environment.

DATAOPS

Data Warehouse / DHBWDaimler TSS 17

Source: https://medium.com/data-ops/the-best-dataops-articles-of-q3-2018-c39882be3d7b

Not relevant for exam

Page 18: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

NETFLIX’ THREE CREDOS FOR DATA CULTURE

Data Warehouse / DHBWDaimler TSS 18

Source: Phil Simons’ webinar on “The visual organization: data visualization, big data, and the quest for better decisions” from

Harvard Business Review (2014)

Data catalog: Data should be accessible, easy to discover, and easy to process for everyone

Data storytelling: Whether your dataset is large or small, being able to visualize it makes it easier to explain

Fail fast and iterate: The longer you take to find the data, the less valuable it becomes

Not relevant for exam

Page 19: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

Establish a culture for

• Sharing data across the organization by following privacy and ethics

• Collaborating around data products

• Using data for decisions instead of e.g. experience

• Speed: fail fast and iterate

Being successful as a data-driven company requires the active involvement of

all employees

DATA CULTURE

Data Warehouse / DHBWDaimler TSS 19

Not relevant for exam

Page 20: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

DIGITIZATION - NEW GIANTSBAT – BAIDU + ALIBABA + TENCENT

Data Warehouse / DHBWDaimler TSS 20

Sources:

https://venitism.wordpress.com/2017/12/15/beware-of-the-bats-baidu-alibaba-and-tencent/

https://www.afr.com/brand/business-summit/baidu-alibaba-tencent-to-disrupt-facebook-amazon-netflix-google-in-asia-20180228-h0wrdl

Not relevant for exam

Page 21: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

We’re entering a new world in which

data may be more important than

software.

[Tim O’Reilly, Founder O’Reilly Media]

Data is a precious thing and will last longer than

the systems themselves.

[Tim Berners-Lee, Father of the Worldwide Web]

Information is the oil of the 21st

century

[Peter Sondergaard, Gartner]

Everything we do in the digital realm ... creates a data trail.

And if that trail exists, chances are someone is using it.

[Douglas Rushkoff, Author]

Page 22: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

Data creation is exploding

[Gavin Belson, HBOs Silicon Valley]

Data is the new gold

[Open Data Initiative, European Commission]

In a world deluged by irrelevant

information, clarity is power.

[Yuval Noah Harari, Author]

Big data is not about the data

[Gary King, Harvard University]

Page 23: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

DATA WAREHOUSE

Data Warehouse / DHBWDaimler TSS 23

Applications come, applications go.The data, however, lives forever.

It is not about building applications;it really is about the data

underneath these applications(Tom Kyte, Oracle)

Page 24: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle

Data Warehouse / DHBWDaimler TSS 24

THANK YOU

Page 25: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

SQL STANDARD IS EVOLVING

Data Warehouse / DHBWDaimler TSS 25

Source: https://vimeo.com/289497563

Page 26: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

Write an SQL statement that computes the most recent data for each

customer.

Script to create the table including data: https://github.com/abuckenhofer/dwh_course/tree/master/scripts

EXERCISE: COMPUTE MOST RECENT ROWS

Data Warehouse / DHBWDaimler TSS 26

Customer_key Name Status Valid_from

1 Brown Single 01-MAY-2014

2 Bush Married 05-JAN-2015

1 Miller Married 15-DEC-2015

3 Stein 15-DEC-2015

3 Stein Single 18-DEC-2015

SIN1.sql

Page 27: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

EXERCISE: COMPUTE MOST RECENT ROWS

Data Warehouse / DHBWDaimler TSS 27

Page 28: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

create table s_customer (

customer_key integer NOT NULL

, cust_name varchar2(100) NOT NULL

, status varchar2(10)

, valid_from date NOT NULL

, CONSTRAINT s_customer_pk PRIMARY KEY (customer_key, valid_from)

);

insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Brown',

'Single', to_date('01.05.2014', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (2, 'Bush',

'Married', to_date('05.01.2015', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Miller',

'Married', to_date('15.12.2015', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein',

NULL, to_date('15.12.2015', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein',

'Single', to_date('18.12.2015', 'DD.MM.YYYY'));

commit;

EXERCISE: COMPUTE MOST RECENT ROWS

Data Warehouse / DHBWDaimler TSS 28

Page 29: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

SELECT s.*

FROM S_CUSTOMER s

JOIN (SELECT i.customer_key,

max(i.valid_from) as max_valid_from

FROM S_CUSTOMER i

GROUP BY i.customer_key) b

ON s.customer_key = b.customer_key

AND s.valid_from = b.max_valid_from;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 1: MAX-FUNCTION

Data Warehouse / DHBWDaimler TSS 29

S2IN.sql

Page 30: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

SELECT s.*

FROM S_CUSTOMER s

WHERE NOT EXISTS (SELECT 1

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key

AND s.valid_from < i.valid_from);

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 2: EXISTS

Data Warehouse / DHBWDaimler TSS 30

S2IN.sql

Page 31: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

SELECT s.*

FROM S_CUSTOMER s

WHERE s.valid_from = (SELECT MAX(i.valid_from)

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key);

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 3: MAX IN CORRELATED SUB-SELECT

Data Warehouse / DHBWDaimler TSS 31

S2IN.sql

Page 32: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

SELECT *

FROM (SELECT coalesce ((SELECT min (i.valid_from)

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key

AND s.valid_from < i.valid_from

), to_date ('31.12.9999',

'DD.MM.YYYY')) as end_ts,

s.*

FROM S_CUSTOMER s)

WHERE end_ts = to_date ('31.12.9999', 'DD.MM.YYYY');

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 4: COALESCE WITH SUB-SELECT

Data Warehouse / DHBWDaimler TSS 32

S2IN.sql

Page 33: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

WITH max_cust as (

SELECT i.customer_key,

max(i.valid_from) as max_valid_from

FROM S_CUSTOMER i

GROUP BY i.customer_key)

SELECT s.*

FROM S_CUSTOMER s

JOIN max_cust b ON s.customer_key = b.customer_key

AND s.valid_from = b.max_valid_from;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 5: MAX-FUNCTION AND WITH-CLAUSE

Data Warehouse / DHBWDaimler TSS 33

S2IN.sql

Page 34: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

partition data

compute functions over these partitions

Rank [sequential order], first [first row], last [last row], lag [previous row], lead [next row]

return result

EXERCISE: COMPUTE MOST RECENT ROWSSQL ANALYTIC / WINDOWING FUNCTIONS

Data Warehouse / DHBWDaimler TSS 34

Page 35: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

WITH lead_cust as (

SELECT lead (s.valid_from, 1) OVER (PARTITION BY

s.customer_key

ORDER BY s.valid_from ASC) as end_ts

, s.*

FROM s_customer s)

SELECT *

FROM lead_cust b

WHERE b.end_ts IS NULL;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 6: ANALYTIC / WINDOWING FUNCTION LEAD

Data Warehouse / DHBWDaimler TSS 35

S3IN.sql

Page 36: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

WITH lead_cust as (

SELECT row_number() OVER(PARTITION BY s.customer_key

ORDER BY s.valid_from DESC) as rn

, s.*

FROM s_customer s)

SELECT *

FROM lead_cust b

WHERE b.rn = 1;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 7: ANALYTIC FUNCTION ROW_NUMBER

Data Warehouse / DHBWDaimler TSS 36

S3IN.sql

Page 37: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• Check execution plans, execution time including service + response time,

resource usage for final decision

• Solutions with Analytic / Windowing do not need self-join and show better

statistics compared to the other shown solutions

• Analytic / Windowing functions are very powerful

• Remark: Usage of with-clause in SQL statements is preferable compared

to sub-selects as it improves readability, understandability, maintainability

MAX OR ANALYTIC / WINDOWING FUNCTIONS WHICH ALTERNATIVE WOULD YOU RECOMMEND?

Data Warehouse / DHBWDaimler TSS 37

Page 38: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

TEMPORAL DATA STORAGE (BITEMPORAL DATA)

Data Warehouse / DHBWDaimler TSS 38

Page 39: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

TEMPORAL DATA STORAGE (BITEMPORAL DATA)

Data Warehouse / DHBWDaimler TSS 39

10.09. 20.09. 30.09. 10.10.

Time

Price: 15EUR Price: 16EUR

New Price of 16EUR is

entered into the DB

Valid

Time

(20.09.)

Transaction

Time

(10.09.)

Page 40: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• Time period when a fact is true in the real world

• The end user determines start and end date/time (or just a date/time for events)

Business validity:

Valid time

• Time period when a fact stored in the database is known

• ETL process determines start and end date/time

Technical validity:Transaction time

• Combines both Valid and Transaction TimeBitemporal data

TEMPORAL DATA STORAGE (BITEMPORAL DATA)DEFINITION

Data Warehouse / DHBWDaimler TSS 40

Page 41: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• SQL standard SQL:2011

• But different implementations by RDBMSes like Oracle, DB2, SQL Server

and others

• Different syntax!

• Different coverage of standard!

• Very useful for slowly changing dimensions type 2, but also for other

purposes

TEMPORAL DATA STORAGE (BITEMPORAL DATA)SQL STANDARD

Data Warehouse / DHBWDaimler TSS 41

Page 42: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

CREATE TABLE customer_address

( customerID INTEGER NOT NULL

, name VARCHAR(100)

, city VARCHAR(100)

, valid_start DATE NOT NULL

, valid_end DATE NOT NULL

, PERIOD BUSINESS_TIME(valid_start, valid_end)

, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) );

DB2 VALID TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 42

Page 43: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

INSERT INTO customer_address VALUES

(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');

UPDATE customer_address FOR PORTION OF BUSINESS_TIME

FROM '22.05.2013' TO '31.12.2013'

SET city = 'San Diego' WHERE customerID = 1;

DB2 VALID TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 43

customerID Name City Valid_start Valid_end

1 Miller Seattle 01.01.2013 22.05.2013

1 Miller San Diego 22.05.2013 31.12.2013

Page 44: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

SELECT *

FROM customer_address

FOR BUSINESS_TIME AS OF '17.05.2013';

DB2 VALID TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 44

Page 45: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

CREATE TABLE customer_info(

customerId INTEGER NOT NULL,

comment VARCHAR(1000) NOT NULL,

sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,

sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,

PERIOD SYSTEM_TIME (sys_start, sys_end)

);

DB2 TRANSACTION TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 45

Page 46: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

Transaction on 15.10.2013:

INSERT INTO customer_info VALUES( 1, 'comment 1');

Transaction on 31.10.2013

UPDATE customer_address SET comment = 'comment 2'

WHERE customerID = 1;

DB2 TRANSACTION TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 46

CustomerId comment Sys_start Sys_end

1 Comment 2 31.10.2013 31.12.2999

Page 47: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

SELECT *

FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';

Data comes from a history table:

Valid Time and Transaction Time can be combined = Bitemporal table

DB2 TRANSACTION TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 47

CustomerId comment Sys_start Sys_end

1 Comment 1 15.10.2013 31.10.2013

Page 48: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• Very important performance improvement technique

• Good for many reads with high selectivity, write penalty

• B-trees most common

INDEXING - WHY

Data Warehouse / DHBWDaimler TSS 48

root

branch branch

leaf leaf leaf

Table

Page 49: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• DBs index Primary Keys by default

• Dimension table columns

that are regularly used in where clauses

are candidates

• Maybe foreign Key columns in Fact table

(see also later Star Transformation)

INDEXING A STAR SCHEMA – WHICH COLUMNS ARE CANDIDATES FOR AN INDEX?

Data Warehouse / DHBWDaimler TSS 49

Page 50: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• Fact table has normally much more rows compared to dimension tables

• Common join techniques would need to join first dimension table with the

fact table

• Alternative technique: evaluate all dimensions

(cartesian join)

• Then join into fact table in last step

• Oracle uses Bitmap indexes on

foreign key columns in fact tables to achieve

Star Join; not supported by many DBs

STAR TRANSFORMATION

Data Warehouse / DHBWDaimler TSS 50

Page 51: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

PARTITIONING

Data Warehouse / DHBWDaimler TSS 51

Col1 Col2 Col3 col4

1 A AA AAA

2 B BB BBB

3 C CC CCC

Col1 Col2

1 A

2 B

3 C

Col3 col4

AA AAA

BB BBB

CC CCC

Col1 Col2 Col3 col4

3 C CC CCC

Col1 Col2 Col3 col4

1 A AA AAA

2 B BB BBB

Vertical partitioning Horizontal partitioning

Page 52: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• Very powerful feature in a DWH to reduce workload

• Split table into logical smaller tables

• Avoidance of full table scans

• How could a table be split?

• Introduction to (Oracle) partitioning:

https://asktom.oracle.com/partitioning-for-developers.htm

HORIZONTAL PARTITIONING

Data Warehouse / DHBWDaimler TSS 52

Page 53: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• By range

• Most common

• Use date field like order data to partition table into months, days, etc

• By list

• Use field that has limited number of different values, e.g. split customer data by

country if end users most likely select customers from within a country

• By hash

• Use a filed that most likely splits the data in evenly distributed chunks

HORIZONTAL PARTITIONING – SPLITTING OPTIONS

Data Warehouse / DHBWDaimler TSS 53

Page 54: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• Statements are normally executed on one CPU

• Parallelism allows the DB to distribute the execution to several CPUs

• Powerful combination with partitioning

• Parallelism is limited by the number of CPUs: if parallelism is too high, performance

will degrade

• Intra-query parallelism and inter-query parallelism

PARALLELISM

Data Warehouse / DHBWDaimler TSS 54

Page 55: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• Data compression + Index compression

• Store more data in a block/page = read more data during I/O

• If CPU resources are available, often a very powerful feature to improve performance

• Additionally reduce storage

• Additionally reduce backup time

COMPRESSION

Data Warehouse / DHBWDaimler TSS 55

Page 56: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• Relational columnar In-Memory DB

• Materialized Views / Query Tables

ALREADY COVERED IN A PREVIOUS LECTURE

Data Warehouse / DHBWDaimler TSS 56

Page 57: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data

• Recapture ETL and DB specific topics

• Which topics do you remember, or do you find important?

• Write down 1-2 topics on stick-it cards.

EXERCISE - RECAPTURE ETL AND DB SPECIFICS

Data Warehouse / DHBWDaimler TSS 57