lecture @dhbw: data warehouse part xxi: db specificsbuckenhofer/20182dwh/buckenhofer-… · user...

A company of Daimler AG

LECTURE @DHBW: DATA WAREHOUSE

PART XXI: DB SPECIFICSANDREAS BUCKENHOFER, DAIMLER TSS

ABOUT ME

https://de.linkedin.com/in/buckenhofer

https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/

http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

https://www.xing.com/profile/Andreas_Buckenhofer2

Andreas BuckenhoferSenior DB [email protected]

Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics



mailto:[email protected]

ANDREAS BUCKENHOFER, DAIMLER TSS GMBH

Data Warehouse / DHBWDaimler TSS 3

“Forming good abstractions and avoiding complexity is an essential part of a successful data architecture”

Data has always been my main focus during my long-time occupation in the area of data integration. I work for Daimler TSS as Database Professional and Data Architect with over 20 years of experience in Data Warehouse projects. I am working with Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new things, experiment, and program every day.

I share my knowledge in internal presentations or as a speaker at international conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on modern data architectures at Baden-Wuerttemberg Cooperative State University DHBW. I also gained international experience through a two-year project in Greater London and several business trips to Asia.

I’m responsible for In-Memory DB Computing at the independent German Oracle User Group (DOAG) and was honored by Oracle as ACE Associate. I hold current certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM InfoSphere Change Data Capture Technical Professional”, etc.

DHBWDOAG

xing

Contact/Connect

mailto:[email protected]






https://www.doag.org/de/themen/datenbank/in-memory/


https://www.xing.com/profile/Andreas_Buckenhofer2

As a 100% Daimler subsidiary, we give

100 percent, always and never less.

We love IT and pull out all the stops to

aid Daimler's development with our

expertise on its journey into the future.

Our objective: We make Daimler the

most innovative and digital mobility

company.

NOT JUST AVERAGE: OUTSTANDING.

Daimler TSS

INTERNAL IT PARTNER FOR DAIMLER

+ Holistic solutions according to the Daimler guidelines

+ IT strategy

+ Security

+ Architecture

+ Developing and securing know-how

+ TSS is a partner who can be trusted with sensitive data

As subsidiary: maximum added value for Daimler

+ Market closeness

+ Independence

+ Flexibility (short decision making process,

ability to react quickly)

Daimler TSS 5

Daimler TSS

LOCATIONS

Data Warehouse / DHBW

Daimler TSS ChinaHub Beijing10 employees

Daimler TSS MalaysiaHub Kuala Lumpur42 employees

Daimler TSS IndiaHub Bangalore22 employees

Daimler TSS Germany

7 locations

1000 employees*

Ulm (Headquarters)

Stuttgart

Berlin

Karlsruhe

* as of August 2017

6

• After the end of this lecture you will be able to

• Understand DB techniques that are specific for DWH

• Analytic/windowing functions

• Bitemporal data

• Indexing, Partitioning, Parallelism, Compression

WHAT YOU WILL LEARN TODAY


SQL STANDARD IS EVOLVING


Source: https://vimeo.com/289497563

https://vimeo.com/289497563

Write an SQL statement that computes the most recent data for each customer.

Script to create the table including data: https://github.com/abuckenhofer/dwh_course/tree/master/scripts

EXERCISE: COMPUTE MOST RECENT ROWS


Customer_key Name Status Valid_from

1 Brown Single 01-MAY-2014

2 Bush Married 05-JAN-2015

1 Miller Married 15-DEC-2015

3 Stein 15-DEC-2015

3 Stein Single 18-DEC-2015

SIN1.sql

create table s_customer (

customer_key integer NOT NULL

, cust_name varchar2(100) NOT NULL

, status varchar2(10)

, valid_from date NOT NULL

, CONSTRAINT s_customer_pk PRIMARY KEY (customer_key, valid_from)

);

insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Brown',

'Single', to_date('01.05.2014', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (2, 'Bush',

'Married', to_date('05.01.2015', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Miller',

'Married', to_date('15.12.2015', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein',

NULL, to_date('15.12.2015', 'DD.MM.YYYY'));

insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein',

'Single', to_date('18.12.2015', 'DD.MM.YYYY'));

commit;



SELECT s.*

FROM S_CUSTOMER s

JOIN (SELECT i.customer_key,

max(i.valid_from) as max_valid_from

FROM S_CUSTOMER i

GROUP BY i.customer_key) b

ON s.customer_key = b.customer_key

AND s.valid_from = b.max_valid_from;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 1: MAX-FUNCTION


S2IN.sql

SELECT s.*

FROM S_CUSTOMER s

WHERE NOT EXISTS (SELECT 1

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key

AND s.valid_from < i.valid_from);

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 2: EXISTS


S2IN.sql

SELECT s.*

FROM S_CUSTOMER s

WHERE s.valid_from = (SELECT MAX(i.valid_from)

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key);

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 3: MAX IN CORRELATED SUB-SELECT


S2IN.sql

SELECT *

FROM (SELECT coalesce ((SELECT min (i.valid_from)

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key

AND s.valid_from < i.valid_from

), to_date ('31.12.9999',

'DD.MM.YYYY')) as end_ts,

s.*

FROM S_CUSTOMER s)

WHERE end_ts = to_date ('31.12.9999', 'DD.MM.YYYY');

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 4: COALESCE WITH SUB-SELECT


S2IN.sql

WITH max_cust as (

SELECT i.customer_key,

max(i.valid_from) as max_valid_from

FROM S_CUSTOMER i

GROUP BY i.customer_key)

SELECT s.*

FROM S_CUSTOMER s

JOIN max_cust b ON s.customer_key = b.customer_key

AND s.valid_from = b.max_valid_from;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 5: MAX-FUNCTION AND WITH-CLAUSE


S2IN.sql

partition data

compute functions over these partitions

Rank [sequential order], first [first row], last [last row], lag [previous row], lead [next row]

return result

EXERCISE: COMPUTE MOST RECENT ROWSSQL ANALYTIC / WINDOWING FUNCTIONS


WITH lead_cust as (

SELECT lead (s.valid_from, 1) OVER (PARTITION BY

s.customer_key

ORDER BY s.valid_from ASC) as end_ts

, s.*

FROM s_customer s)

SELECT *

FROM lead_cust b

WHERE b.end_ts IS NULL;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 6: ANALYTIC / WINDOWING FUNCTION LEAD


S3IN.sql

WITH lead_cust as (

SELECT row_number() OVER(PARTITION BY s.customer_key

ORDER BY s.valid_from DESC) as rn

, s.*

FROM s_customer s)

SELECT *

FROM lead_cust b

WHERE b.rn = 1;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 7: ANALYTIC FUNCTION ROW_NUMBER


S3IN.sql

• Check execution plans, execution time including service + response time, resource usage for final decision

• Solutions with Analytic / Windowing do not need self-join and show better statistics compared to the other shown solutions

• Analytic / Windowing functions are very powerful

• Remark: Usage of with-clause in SQL statements is preferable compared to sub-selects as it improves readability, understandability, maintainability

MAX OR ANALYTIC / WINDOWING FUNCTIONS WHICH ALTERNATIVE WOULD YOU RECOMMEND?


TEMPORAL DATA STORAGE (BITEMPORAL DATA)


TEMPORAL DATA STORAGE (BITEMPORAL DATA)


10.09. 20.09. 30.09. 10.10.

Time

Price: 15EUR Price: 16EUR

New Price of 16EUR is entered into the DB

ValidTime

(20.09.)

TransactionTime

(10.09.)

• Time period when a fact is true in the real world

• The end user determines start and end date/time (or just a date/time for events)

Business validity:

Valid time

• Time period when a fact stored in the database is known

• ETL process determines start and end date/time

Technical validity:Transaction time

• Combines both Valid and Transaction TimeBitemporal data

TEMPORAL DATA STORAGE (BITEMPORAL DATA)DEFINITION


• SQL standard SQL:2011

• But different implementations by RDBMSes like Oracle, DB2, SQL Server and others

• Different syntax!

• Different coverage of standard!

• Very useful for slowly changing dimensions type 2, but also for other purposes

TEMPORAL DATA STORAGE (BITEMPORAL DATA)SQL STANDARD


CREATE TABLE customer_address

( customerID INTEGER NOT NULL

, name VARCHAR(100)

, city VARCHAR(100)

, valid_start DATE NOT NULL

, valid_end DATE NOT NULL

, PERIOD BUSINESS_TIME(valid_start, valid_end)

, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) );

DB2 VALID TIME EXAMPLE


INSERT INTO customer_address VALUES

(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');

UPDATE customer_address FOR PORTION OF BUSINESS_TIME

FROM '22.05.2013' TO '31.12.2013'

SET city = 'San Diego' WHERE customerID = 1;



customerID Name City Valid_start Valid_end

1 Miller Seattle 01.01.2013 22.05.2013

1 Miller San Diego 22.05.2013 31.12.2013

SELECT *

FROM customer_address

FOR BUSINESS_TIME AS OF '17.05.2013';



CREATE TABLE customer_info(

customerId INTEGER NOT NULL,

comment VARCHAR(1000) NOT NULL,

sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,

sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,

PERIOD SYSTEM_TIME (sys_start, sys_end)

);

DB2 TRANSACTION TIME EXAMPLE


Transaction on 15.10.2013:

INSERT INTO customer_info VALUES( 1, 'comment 1');

Transaction on 31.10.2013

UPDATE customer_address SET comment = 'comment 2'

WHERE customerID = 1;



CustomerId comment Sys_start Sys_end

1 Comment 2 31.10.2013 31.12.2999

SELECT *

FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';

Data comes from a history table:

Valid Time and Transaction Time can be combined = Bitemporal table



CustomerId comment Sys_start Sys_end

1 Comment 1 15.10.2013 31.10.2013

• Very important performance improvement technique

• Good for many reads with high selectivity, write penalty

• B-trees most common

INDEXING - WHY


root

branch branch

leaf leaf leaf

…

…

Table

• DBs index Primary Keys by default

• Dimension table columns that are regularly used in where clausesare candidates

• Maybe foreign Key columns in Fact table (see also later Star Transformation)

INDEXING A STAR SCHEMA – WHICH COLUMNS ARE CANDIDATES FOR AN INDEX?


• Fact table has normally much more rows compared to dimension tables

• Common join techniques would need to join first dimension table with the fact table

• Alternative technique: evaluate all dimensions(cartesian join)

• Then join into fact table in last step

• Oracle uses Bitmap indexes on foreign key columns in fact tables to achieveStar Join; not supported by many DBs

STAR TRANSFORMATION


PARTITIONING


Col1 Col2 Col3 col4

1 A AA AAA

2 B BB BBB

3 C CC CCC

Col1 Col2

1 A

2 B

3 C

Col3 col4

AA AAA

BB BBB

CC CCC

Col1 Col2 Col3 col4

3 C CC CCC

Col1 Col2 Col3 col4

1 A AA AAA

2 B BB BBB

Vertical partitioning Horizontal partitioning

• Very powerful feature in a DWH to reduce workload

• Split table into logical smaller tables

• Avoidance of full table scans

• How could a table be split?

• Introduction to (Oracle) partitioning: https://asktom.oracle.com/partitioning-for-developers.htm

HORIZONTAL PARTITIONING


https://asktom.oracle.com/partitioning-for-developers.htm

• By range

• Most common

• Use date field like order data to partition table into months, days, etc

• By list

• Use field that has limited number of different values, e.g. split customer data by country if end users most likely select customers from within a country

• By hash

• Use a filed that most likely splits the data in evenly distributed chunks

HORIZONTAL PARTITIONING – SPLITTING OPTIONS


• Statements are normally executed on one CPU

• Parallelism allows the DB to distribute the execution to several CPUs

• Powerful combination with partitioning

• Parallelism is limited by the number of CPUs: if parallelism is too high, performance will degrade

• Intra-query parallelism and inter-query parallelism

PARALLELISM


• Data compression + Index compression

• Store more data in a block/page = read more data during I/O

• If CPU resources are available, often a very powerful feature to improve performance

• Additionally reduce storage

• Additionally reduce backup time

COMPRESSION


• Relational columnar In-Memory DB

• Materialized Views / Query Tables

ALREADY COVERED IN A PREVIOUS LECTURE


• Recapture ETL and DB specific topics

• Which topics do you remember, or do you find important?

• Write down 1-2 topics on stick-it cards.

EXERCISE - RECAPTURE ETL AND DB SPECIFICS


Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle


THANK YOU

lecture @dhbw: data warehouse part xxi: db specificsbuckenhofer/20182dwh/buckenhofer-… · user...

Documents