lecture @dhbw: data warehouse part xxi: db specificsbuckenhofer/20182dwh/buckenhofer-… · user...
TRANSCRIPT
A company of Daimler AG
LECTURE @DHBW: DATA WAREHOUSE
PART XXI: DB SPECIFICSANDREAS BUCKENHOFER, DAIMLER TSS
ABOUT ME
https://de.linkedin.com/in/buckenhofer
https://twitter.com/ABuckenhofer
https://www.doag.org/de/themen/datenbank/in-memory/
http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/
https://www.xing.com/profile/Andreas_Buckenhofer2
Andreas BuckenhoferSenior DB [email protected]
Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics
ANDREAS BUCKENHOFER, DAIMLER TSS GMBH
Data Warehouse / DHBWDaimler TSS 3
“Forming good abstractions and avoiding complexity is an essential part of a successful data architecture”
Data has always been my main focus during my long-time occupation in the area of data integration. I work for Daimler TSS as Database Professional and Data Architect with over 20 years of experience in Data Warehouse projects. I am working with Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new things, experiment, and program every day.
I share my knowledge in internal presentations or as a speaker at international conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on modern data architectures at Baden-Wuerttemberg Cooperative State University DHBW. I also gained international experience through a two-year project in Greater London and several business trips to Asia.
I’m responsible for In-Memory DB Computing at the independent German Oracle User Group (DOAG) and was honored by Oracle as ACE Associate. I hold current certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM InfoSphere Change Data Capture Technical Professional”, etc.
DHBWDOAG
Contact/Connect
As a 100% Daimler subsidiary, we give
100 percent, always and never less.
We love IT and pull out all the stops to
aid Daimler's development with our
expertise on its journey into the future.
Our objective: We make Daimler the
most innovative and digital mobility
company.
NOT JUST AVERAGE: OUTSTANDING.
Daimler TSS
INTERNAL IT PARTNER FOR DAIMLER
+ Holistic solutions according to the Daimler guidelines
+ IT strategy
+ Security
+ Architecture
+ Developing and securing know-how
+ TSS is a partner who can be trusted with sensitive data
As subsidiary: maximum added value for Daimler
+ Market closeness
+ Independence
+ Flexibility (short decision making process,
ability to react quickly)
Daimler TSS 5
Daimler TSS
LOCATIONS
Data Warehouse / DHBW
Daimler TSS ChinaHub Beijing10 employees
Daimler TSS MalaysiaHub Kuala Lumpur42 employees
Daimler TSS IndiaHub Bangalore22 employees
Daimler TSS Germany
7 locations
1000 employees*
Ulm (Headquarters)
Stuttgart
Berlin
Karlsruhe
* as of August 2017
6
• After the end of this lecture you will be able to
• Understand DB techniques that are specific for DWH
• Analytic/windowing functions
• Bitemporal data
• Indexing, Partitioning, Parallelism, Compression
WHAT YOU WILL LEARN TODAY
Data Warehouse / DHBWDaimler TSS 7
SQL STANDARD IS EVOLVING
Data Warehouse / DHBWDaimler TSS 8
Source: https://vimeo.com/289497563
Write an SQL statement that computes the most recent data for each customer.
Script to create the table including data: https://github.com/abuckenhofer/dwh_course/tree/master/scripts
EXERCISE: COMPUTE MOST RECENT ROWS
Data Warehouse / DHBWDaimler TSS 9
Customer_key Name Status Valid_from
1 Brown Single 01-MAY-2014
2 Bush Married 05-JAN-2015
1 Miller Married 15-DEC-2015
3 Stein 15-DEC-2015
3 Stein Single 18-DEC-2015
SIN1.sql
create table s_customer (
customer_key integer NOT NULL
, cust_name varchar2(100) NOT NULL
, status varchar2(10)
, valid_from date NOT NULL
, CONSTRAINT s_customer_pk PRIMARY KEY (customer_key, valid_from)
);
insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Brown',
'Single', to_date('01.05.2014', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (2, 'Bush',
'Married', to_date('05.01.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Miller',
'Married', to_date('15.12.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein',
NULL, to_date('15.12.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein',
'Single', to_date('18.12.2015', 'DD.MM.YYYY'));
commit;
EXERCISE: COMPUTE MOST RECENT ROWS
Data Warehouse / DHBWDaimler TSS 11
SELECT s.*
FROM S_CUSTOMER s
JOIN (SELECT i.customer_key,
max(i.valid_from) as max_valid_from
FROM S_CUSTOMER i
GROUP BY i.customer_key) b
ON s.customer_key = b.customer_key
AND s.valid_from = b.max_valid_from;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 1: MAX-FUNCTION
Data Warehouse / DHBWDaimler TSS 12
S2IN.sql
SELECT s.*
FROM S_CUSTOMER s
WHERE NOT EXISTS (SELECT 1
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key
AND s.valid_from < i.valid_from);
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 2: EXISTS
Data Warehouse / DHBWDaimler TSS 13
S2IN.sql
SELECT s.*
FROM S_CUSTOMER s
WHERE s.valid_from = (SELECT MAX(i.valid_from)
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key);
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 3: MAX IN CORRELATED SUB-SELECT
Data Warehouse / DHBWDaimler TSS 14
S2IN.sql
SELECT *
FROM (SELECT coalesce ((SELECT min (i.valid_from)
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key
AND s.valid_from < i.valid_from
), to_date ('31.12.9999',
'DD.MM.YYYY')) as end_ts,
s.*
FROM S_CUSTOMER s)
WHERE end_ts = to_date ('31.12.9999', 'DD.MM.YYYY');
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 4: COALESCE WITH SUB-SELECT
Data Warehouse / DHBWDaimler TSS 15
S2IN.sql
WITH max_cust as (
SELECT i.customer_key,
max(i.valid_from) as max_valid_from
FROM S_CUSTOMER i
GROUP BY i.customer_key)
SELECT s.*
FROM S_CUSTOMER s
JOIN max_cust b ON s.customer_key = b.customer_key
AND s.valid_from = b.max_valid_from;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 5: MAX-FUNCTION AND WITH-CLAUSE
Data Warehouse / DHBWDaimler TSS 16
S2IN.sql
partition data
compute functions over these partitions
Rank [sequential order], first [first row], last [last row], lag [previous row], lead [next row]
return result
EXERCISE: COMPUTE MOST RECENT ROWSSQL ANALYTIC / WINDOWING FUNCTIONS
Data Warehouse / DHBWDaimler TSS 17
WITH lead_cust as (
SELECT lead (s.valid_from, 1) OVER (PARTITION BY
s.customer_key
ORDER BY s.valid_from ASC) as end_ts
, s.*
FROM s_customer s)
SELECT *
FROM lead_cust b
WHERE b.end_ts IS NULL;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 6: ANALYTIC / WINDOWING FUNCTION LEAD
Data Warehouse / DHBWDaimler TSS 18
S3IN.sql
WITH lead_cust as (
SELECT row_number() OVER(PARTITION BY s.customer_key
ORDER BY s.valid_from DESC) as rn
, s.*
FROM s_customer s)
SELECT *
FROM lead_cust b
WHERE b.rn = 1;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 7: ANALYTIC FUNCTION ROW_NUMBER
Data Warehouse / DHBWDaimler TSS 19
S3IN.sql
• Check execution plans, execution time including service + response time, resource usage for final decision
• Solutions with Analytic / Windowing do not need self-join and show better statistics compared to the other shown solutions
• Analytic / Windowing functions are very powerful
• Remark: Usage of with-clause in SQL statements is preferable compared to sub-selects as it improves readability, understandability, maintainability
MAX OR ANALYTIC / WINDOWING FUNCTIONS WHICH ALTERNATIVE WOULD YOU RECOMMEND?
Data Warehouse / DHBWDaimler TSS 20
TEMPORAL DATA STORAGE (BITEMPORAL DATA)
Data Warehouse / DHBWDaimler TSS 22
10.09. 20.09. 30.09. 10.10.
Time
Price: 15EUR Price: 16EUR
New Price of 16EUR is entered into the DB
ValidTime
(20.09.)
TransactionTime
(10.09.)
• Time period when a fact is true in the real world
• The end user determines start and end date/time (or just a date/time for events)
Business validity:
Valid time
• Time period when a fact stored in the database is known
• ETL process determines start and end date/time
Technical validity:Transaction time
• Combines both Valid and Transaction TimeBitemporal data
TEMPORAL DATA STORAGE (BITEMPORAL DATA)DEFINITION
Data Warehouse / DHBWDaimler TSS 23
• SQL standard SQL:2011
• But different implementations by RDBMSes like Oracle, DB2, SQL Server and others
• Different syntax!
• Different coverage of standard!
• Very useful for slowly changing dimensions type 2, but also for other purposes
TEMPORAL DATA STORAGE (BITEMPORAL DATA)SQL STANDARD
Data Warehouse / DHBWDaimler TSS 24
CREATE TABLE customer_address
( customerID INTEGER NOT NULL
, name VARCHAR(100)
, city VARCHAR(100)
, valid_start DATE NOT NULL
, valid_end DATE NOT NULL
, PERIOD BUSINESS_TIME(valid_start, valid_end)
, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) );
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 25
INSERT INTO customer_address VALUES
(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');
UPDATE customer_address FOR PORTION OF BUSINESS_TIME
FROM '22.05.2013' TO '31.12.2013'
SET city = 'San Diego' WHERE customerID = 1;
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 26
customerID Name City Valid_start Valid_end
1 Miller Seattle 01.01.2013 22.05.2013
1 Miller San Diego 22.05.2013 31.12.2013
SELECT *
FROM customer_address
FOR BUSINESS_TIME AS OF '17.05.2013';
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 27
CREATE TABLE customer_info(
customerId INTEGER NOT NULL,
comment VARCHAR(1000) NOT NULL,
sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,
sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,
PERIOD SYSTEM_TIME (sys_start, sys_end)
);
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 28
Transaction on 15.10.2013:
INSERT INTO customer_info VALUES( 1, 'comment 1');
Transaction on 31.10.2013
UPDATE customer_address SET comment = 'comment 2'
WHERE customerID = 1;
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 29
CustomerId comment Sys_start Sys_end
1 Comment 2 31.10.2013 31.12.2999
SELECT *
FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';
Data comes from a history table:
Valid Time and Transaction Time can be combined = Bitemporal table
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 30
CustomerId comment Sys_start Sys_end
1 Comment 1 15.10.2013 31.10.2013
• Very important performance improvement technique
• Good for many reads with high selectivity, write penalty
• B-trees most common
INDEXING - WHY
Data Warehouse / DHBWDaimler TSS 31
root
branch branch
leaf leaf leaf
…
…
Table
• DBs index Primary Keys by default
• Dimension table columns that are regularly used in where clausesare candidates
• Maybe foreign Key columns in Fact table (see also later Star Transformation)
INDEXING A STAR SCHEMA – WHICH COLUMNS ARE CANDIDATES FOR AN INDEX?
Data Warehouse / DHBWDaimler TSS 32
• Fact table has normally much more rows compared to dimension tables
• Common join techniques would need to join first dimension table with the fact table
• Alternative technique: evaluate all dimensions(cartesian join)
• Then join into fact table in last step
• Oracle uses Bitmap indexes on foreign key columns in fact tables to achieveStar Join; not supported by many DBs
STAR TRANSFORMATION
Data Warehouse / DHBWDaimler TSS 33
PARTITIONING
Data Warehouse / DHBWDaimler TSS 34
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
3 C CC CCC
Col1 Col2
1 A
2 B
3 C
Col3 col4
AA AAA
BB BBB
CC CCC
Col1 Col2 Col3 col4
3 C CC CCC
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
Vertical partitioning Horizontal partitioning
• Very powerful feature in a DWH to reduce workload
• Split table into logical smaller tables
• Avoidance of full table scans
• How could a table be split?
• Introduction to (Oracle) partitioning: https://asktom.oracle.com/partitioning-for-developers.htm
HORIZONTAL PARTITIONING
Data Warehouse / DHBWDaimler TSS 35
• By range
• Most common
• Use date field like order data to partition table into months, days, etc
• By list
• Use field that has limited number of different values, e.g. split customer data by country if end users most likely select customers from within a country
• By hash
• Use a filed that most likely splits the data in evenly distributed chunks
HORIZONTAL PARTITIONING – SPLITTING OPTIONS
Data Warehouse / DHBWDaimler TSS 36
• Statements are normally executed on one CPU
• Parallelism allows the DB to distribute the execution to several CPUs
• Powerful combination with partitioning
• Parallelism is limited by the number of CPUs: if parallelism is too high, performance will degrade
• Intra-query parallelism and inter-query parallelism
PARALLELISM
Data Warehouse / DHBWDaimler TSS 37
• Data compression + Index compression
• Store more data in a block/page = read more data during I/O
• If CPU resources are available, often a very powerful feature to improve performance
• Additionally reduce storage
• Additionally reduce backup time
COMPRESSION
Data Warehouse / DHBWDaimler TSS 38
• Relational columnar In-Memory DB
• Materialized Views / Query Tables
ALREADY COVERED IN A PREVIOUS LECTURE
Data Warehouse / DHBWDaimler TSS 39
• Recapture ETL and DB specific topics
• Which topics do you remember, or do you find important?
• Write down 1-2 topics on stick-it cards.
EXERCISE - RECAPTURE ETL AND DB SPECIFICS
Data Warehouse / DHBWDaimler TSS 40
Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
Data Warehouse / DHBWDaimler TSS 41
THANK YOU