lecture @dhbw: data warehouse part 05: data …a collaborative data management practice focused on...
TRANSCRIPT
A company of Daimler AG
LECTURE @DHBW: DATA WAREHOUSE
PART 05: DATA SECURITY – DATA CULTUREANDREAS BUCKENHOFER, DAIMLER TSS
ABOUT ME
Andreas BuckenhoferSenior DB Professional
Since 2009 at Daimler TSS
Department: Machine Learning Solutions
Business Unit: Analytics
Contact/Connect
vcard
• Oracle ACE Associate
• DOAG responsible for InMemory DB
• Lecturer at DHBW
• Certified Data Vault Practitioner 2.0
• Certified Oracle Professional
• Certified IBM Big Data Architect
• Over 20 years experience with
database technologies
• Over 20 years experience with Data
Warehousing
• International project experience
After the end of this lecture you will be able to
• Understand importance of Data Security
• Data Classification
• GDPR
• Understand importance of Data Culture
WHAT YOU WILL LEARN TODAY
Data Warehouse / DHBWDaimler TSS 3
DATA SECURITY
Data Warehouse / DHBWDaimler TSS 4
LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 5
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
• End users should only have access to the Data Marts in a DWH
• Grant privileges and roles to users
• Good practice:
• Grant privileges directly to a role; do not nest roles (grant role to role)
• Grant role to users
• Distinguish end users, administrators and technical users with different password
policies
• Much more difficult to grant access in a Data Lake
• Tool maturity
• Data often unknown with schema-on-read
DATA ACCESS
Data Warehouse / DHBWDaimler TSS 6
CIA TRIAD – DATA CLASSIFICATION
Data Warehouse / DHBWDaimler TSS 7
Confidentiality
IntegrityAvailability
whether or not
information is kept
secret or private,
e.g. data theft
whether the
information is kept
accurate, e.g.
faking data
ensuring that
information is
available when it is
needed, e.g.
blackout
Regulation in EU law on data protection and privacy for all individuals in the
EU
Requirements like
• Data protection by design and by default
• Right to erasure / Right to be forgotten
• Records of processing activities
GDPR – GENERAL DATA PROTECTION REGULATION
Data Warehouse / DHBWDaimler TSS 8
Not relevant for exam
• "capture-it-all" approach causes serious questions of privacy
• always ask yourself why you are capturing or storing data
CHALLENGES DWH & BIG DATA
Data Warehouse / DHBWDaimler TSS 9
Source: https://martinfowler.com/bliki/Datensparsamkeit.html
Challenges like
• Combining of sensitive data
sources allowed?
• Export restrictions that
forbid to combine data from
Germany, US, China
Anonymization
• Irreversibly destroys any way to identify an individual
• GDPR does not apply
Pseudonymization
• Additional information required to re-identify an individual
• GDPR does apply
ANONYMIZATION VS PSEUDONYMIZATION
Data Warehouse / DHBWDaimler TSS 10
Not relevant for exam
TYPICAL SECURITY FUNCTIONALITIES AND MEASURES IN DATABASES
Data Warehouse / DHBWDaimler TSS 11
Auditing
Encryption
Masking (static, dynamic)
Row Level Security
Authentication & Password
policies
Patching
Secure installation
SSL – secure
communication
Authorization & roles,
profiles, ressource limits
Security checklists
Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
DATA CULTURE
Data Warehouse / DHBWDaimler TSS 14
• “Taping into the potential of data involves data
sensemaking. I prefer the term over the more
popular term analytics because it better fits the full
range of activities that are needed” (page 44)
• “Data sensemaking requires a significant
investment in the development of human skills”(page 74)
• DWH/Big Data Architecture (instead of
technology/tools)
• Data integration, Data visualization, Data modeling, …
STEPHEN FEW: BIG DATA, BIG DUPE (ANALYTICS PRESS, 2018)
Data Warehouse / DHBWDaimler TSS 15
DATA LANDSCAPE MANIFESTO@SCOUT24
Data Warehouse / DHBWDaimler TSS 16
Source: Data Festival, Munich 2019
Not relevant for exam
A collaborative data management practice focused on improving the
communication, integration and automation of data flows between data
managers and consumers across an organization.
The goal of DataOps is to create predictable delivery and change
management of data, data models and related artifacts. DataOps uses
technology to automate data delivery with the appropriate levels of security,
quality and metadata to improve the use and value of data in a dynamic
environment.
DATAOPS
Data Warehouse / DHBWDaimler TSS 17
Source: https://medium.com/data-ops/the-best-dataops-articles-of-q3-2018-c39882be3d7b
Not relevant for exam
NETFLIX’ THREE CREDOS FOR DATA CULTURE
Data Warehouse / DHBWDaimler TSS 18
Source: Phil Simons’ webinar on “The visual organization: data visualization, big data, and the quest for better decisions” from
Harvard Business Review (2014)
Data catalog: Data should be accessible, easy to discover, and easy to process for everyone
Data storytelling: Whether your dataset is large or small, being able to visualize it makes it easier to explain
Fail fast and iterate: The longer you take to find the data, the less valuable it becomes
Not relevant for exam
Establish a culture for
• Sharing data across the organization by following privacy and ethics
• Collaborating around data products
• Using data for decisions instead of e.g. experience
• Speed: fail fast and iterate
Being successful as a data-driven company requires the active involvement of
all employees
DATA CULTURE
Data Warehouse / DHBWDaimler TSS 19
Not relevant for exam
DIGITIZATION - NEW GIANTSBAT – BAIDU + ALIBABA + TENCENT
Data Warehouse / DHBWDaimler TSS 20
Sources:
https://venitism.wordpress.com/2017/12/15/beware-of-the-bats-baidu-alibaba-and-tencent/
https://www.afr.com/brand/business-summit/baidu-alibaba-tencent-to-disrupt-facebook-amazon-netflix-google-in-asia-20180228-h0wrdl
Not relevant for exam
We’re entering a new world in which
data may be more important than
software.
[Tim O’Reilly, Founder O’Reilly Media]
Data is a precious thing and will last longer than
the systems themselves.
[Tim Berners-Lee, Father of the Worldwide Web]
Information is the oil of the 21st
century
[Peter Sondergaard, Gartner]
Everything we do in the digital realm ... creates a data trail.
And if that trail exists, chances are someone is using it.
[Douglas Rushkoff, Author]
Data creation is exploding
[Gavin Belson, HBOs Silicon Valley]
Data is the new gold
[Open Data Initiative, European Commission]
In a world deluged by irrelevant
information, clarity is power.
[Yuval Noah Harari, Author]
Big data is not about the data
[Gary King, Harvard University]
DATA WAREHOUSE
Data Warehouse / DHBWDaimler TSS 23
Applications come, applications go.The data, however, lives forever.
It is not about building applications;it really is about the data
underneath these applications(Tom Kyte, Oracle)
Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
Data Warehouse / DHBWDaimler TSS 24
THANK YOU
SQL STANDARD IS EVOLVING
Data Warehouse / DHBWDaimler TSS 25
Source: https://vimeo.com/289497563
Write an SQL statement that computes the most recent data for each
customer.
Script to create the table including data: https://github.com/abuckenhofer/dwh_course/tree/master/scripts
EXERCISE: COMPUTE MOST RECENT ROWS
Data Warehouse / DHBWDaimler TSS 26
Customer_key Name Status Valid_from
1 Brown Single 01-MAY-2014
2 Bush Married 05-JAN-2015
1 Miller Married 15-DEC-2015
3 Stein 15-DEC-2015
3 Stein Single 18-DEC-2015
SIN1.sql
EXERCISE: COMPUTE MOST RECENT ROWS
Data Warehouse / DHBWDaimler TSS 27
create table s_customer (
customer_key integer NOT NULL
, cust_name varchar2(100) NOT NULL
, status varchar2(10)
, valid_from date NOT NULL
, CONSTRAINT s_customer_pk PRIMARY KEY (customer_key, valid_from)
);
insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Brown',
'Single', to_date('01.05.2014', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (2, 'Bush',
'Married', to_date('05.01.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Miller',
'Married', to_date('15.12.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein',
NULL, to_date('15.12.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein',
'Single', to_date('18.12.2015', 'DD.MM.YYYY'));
commit;
EXERCISE: COMPUTE MOST RECENT ROWS
Data Warehouse / DHBWDaimler TSS 28
SELECT s.*
FROM S_CUSTOMER s
JOIN (SELECT i.customer_key,
max(i.valid_from) as max_valid_from
FROM S_CUSTOMER i
GROUP BY i.customer_key) b
ON s.customer_key = b.customer_key
AND s.valid_from = b.max_valid_from;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 1: MAX-FUNCTION
Data Warehouse / DHBWDaimler TSS 29
S2IN.sql
SELECT s.*
FROM S_CUSTOMER s
WHERE NOT EXISTS (SELECT 1
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key
AND s.valid_from < i.valid_from);
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 2: EXISTS
Data Warehouse / DHBWDaimler TSS 30
S2IN.sql
SELECT s.*
FROM S_CUSTOMER s
WHERE s.valid_from = (SELECT MAX(i.valid_from)
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key);
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 3: MAX IN CORRELATED SUB-SELECT
Data Warehouse / DHBWDaimler TSS 31
S2IN.sql
SELECT *
FROM (SELECT coalesce ((SELECT min (i.valid_from)
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key
AND s.valid_from < i.valid_from
), to_date ('31.12.9999',
'DD.MM.YYYY')) as end_ts,
s.*
FROM S_CUSTOMER s)
WHERE end_ts = to_date ('31.12.9999', 'DD.MM.YYYY');
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 4: COALESCE WITH SUB-SELECT
Data Warehouse / DHBWDaimler TSS 32
S2IN.sql
WITH max_cust as (
SELECT i.customer_key,
max(i.valid_from) as max_valid_from
FROM S_CUSTOMER i
GROUP BY i.customer_key)
SELECT s.*
FROM S_CUSTOMER s
JOIN max_cust b ON s.customer_key = b.customer_key
AND s.valid_from = b.max_valid_from;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 5: MAX-FUNCTION AND WITH-CLAUSE
Data Warehouse / DHBWDaimler TSS 33
S2IN.sql
partition data
compute functions over these partitions
Rank [sequential order], first [first row], last [last row], lag [previous row], lead [next row]
return result
EXERCISE: COMPUTE MOST RECENT ROWSSQL ANALYTIC / WINDOWING FUNCTIONS
Data Warehouse / DHBWDaimler TSS 34
WITH lead_cust as (
SELECT lead (s.valid_from, 1) OVER (PARTITION BY
s.customer_key
ORDER BY s.valid_from ASC) as end_ts
, s.*
FROM s_customer s)
SELECT *
FROM lead_cust b
WHERE b.end_ts IS NULL;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 6: ANALYTIC / WINDOWING FUNCTION LEAD
Data Warehouse / DHBWDaimler TSS 35
S3IN.sql
WITH lead_cust as (
SELECT row_number() OVER(PARTITION BY s.customer_key
ORDER BY s.valid_from DESC) as rn
, s.*
FROM s_customer s)
SELECT *
FROM lead_cust b
WHERE b.rn = 1;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 7: ANALYTIC FUNCTION ROW_NUMBER
Data Warehouse / DHBWDaimler TSS 36
S3IN.sql
• Check execution plans, execution time including service + response time,
resource usage for final decision
• Solutions with Analytic / Windowing do not need self-join and show better
statistics compared to the other shown solutions
• Analytic / Windowing functions are very powerful
• Remark: Usage of with-clause in SQL statements is preferable compared
to sub-selects as it improves readability, understandability, maintainability
MAX OR ANALYTIC / WINDOWING FUNCTIONS WHICH ALTERNATIVE WOULD YOU RECOMMEND?
Data Warehouse / DHBWDaimler TSS 37
TEMPORAL DATA STORAGE (BITEMPORAL DATA)
Data Warehouse / DHBWDaimler TSS 38
TEMPORAL DATA STORAGE (BITEMPORAL DATA)
Data Warehouse / DHBWDaimler TSS 39
10.09. 20.09. 30.09. 10.10.
Time
Price: 15EUR Price: 16EUR
New Price of 16EUR is
entered into the DB
Valid
Time
(20.09.)
Transaction
Time
(10.09.)
• Time period when a fact is true in the real world
• The end user determines start and end date/time (or just a date/time for events)
Business validity:
Valid time
• Time period when a fact stored in the database is known
• ETL process determines start and end date/time
Technical validity:Transaction time
• Combines both Valid and Transaction TimeBitemporal data
TEMPORAL DATA STORAGE (BITEMPORAL DATA)DEFINITION
Data Warehouse / DHBWDaimler TSS 40
• SQL standard SQL:2011
• But different implementations by RDBMSes like Oracle, DB2, SQL Server
and others
• Different syntax!
• Different coverage of standard!
• Very useful for slowly changing dimensions type 2, but also for other
purposes
TEMPORAL DATA STORAGE (BITEMPORAL DATA)SQL STANDARD
Data Warehouse / DHBWDaimler TSS 41
CREATE TABLE customer_address
( customerID INTEGER NOT NULL
, name VARCHAR(100)
, city VARCHAR(100)
, valid_start DATE NOT NULL
, valid_end DATE NOT NULL
, PERIOD BUSINESS_TIME(valid_start, valid_end)
, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) );
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 42
INSERT INTO customer_address VALUES
(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');
UPDATE customer_address FOR PORTION OF BUSINESS_TIME
FROM '22.05.2013' TO '31.12.2013'
SET city = 'San Diego' WHERE customerID = 1;
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 43
customerID Name City Valid_start Valid_end
1 Miller Seattle 01.01.2013 22.05.2013
1 Miller San Diego 22.05.2013 31.12.2013
SELECT *
FROM customer_address
FOR BUSINESS_TIME AS OF '17.05.2013';
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 44
CREATE TABLE customer_info(
customerId INTEGER NOT NULL,
comment VARCHAR(1000) NOT NULL,
sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,
sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,
PERIOD SYSTEM_TIME (sys_start, sys_end)
);
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 45
Transaction on 15.10.2013:
INSERT INTO customer_info VALUES( 1, 'comment 1');
Transaction on 31.10.2013
UPDATE customer_address SET comment = 'comment 2'
WHERE customerID = 1;
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 46
CustomerId comment Sys_start Sys_end
1 Comment 2 31.10.2013 31.12.2999
SELECT *
FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';
Data comes from a history table:
Valid Time and Transaction Time can be combined = Bitemporal table
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 47
CustomerId comment Sys_start Sys_end
1 Comment 1 15.10.2013 31.10.2013
• Very important performance improvement technique
• Good for many reads with high selectivity, write penalty
• B-trees most common
INDEXING - WHY
Data Warehouse / DHBWDaimler TSS 48
root
branch branch
leaf leaf leaf
…
…
Table
• DBs index Primary Keys by default
• Dimension table columns
that are regularly used in where clauses
are candidates
• Maybe foreign Key columns in Fact table
(see also later Star Transformation)
INDEXING A STAR SCHEMA – WHICH COLUMNS ARE CANDIDATES FOR AN INDEX?
Data Warehouse / DHBWDaimler TSS 49
• Fact table has normally much more rows compared to dimension tables
• Common join techniques would need to join first dimension table with the
fact table
• Alternative technique: evaluate all dimensions
(cartesian join)
• Then join into fact table in last step
• Oracle uses Bitmap indexes on
foreign key columns in fact tables to achieve
Star Join; not supported by many DBs
STAR TRANSFORMATION
Data Warehouse / DHBWDaimler TSS 50
PARTITIONING
Data Warehouse / DHBWDaimler TSS 51
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
3 C CC CCC
Col1 Col2
1 A
2 B
3 C
Col3 col4
AA AAA
BB BBB
CC CCC
Col1 Col2 Col3 col4
3 C CC CCC
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
Vertical partitioning Horizontal partitioning
• Very powerful feature in a DWH to reduce workload
• Split table into logical smaller tables
• Avoidance of full table scans
• How could a table be split?
• Introduction to (Oracle) partitioning:
https://asktom.oracle.com/partitioning-for-developers.htm
HORIZONTAL PARTITIONING
Data Warehouse / DHBWDaimler TSS 52
• By range
• Most common
• Use date field like order data to partition table into months, days, etc
• By list
• Use field that has limited number of different values, e.g. split customer data by
country if end users most likely select customers from within a country
• By hash
• Use a filed that most likely splits the data in evenly distributed chunks
HORIZONTAL PARTITIONING – SPLITTING OPTIONS
Data Warehouse / DHBWDaimler TSS 53
• Statements are normally executed on one CPU
• Parallelism allows the DB to distribute the execution to several CPUs
• Powerful combination with partitioning
• Parallelism is limited by the number of CPUs: if parallelism is too high, performance
will degrade
• Intra-query parallelism and inter-query parallelism
PARALLELISM
Data Warehouse / DHBWDaimler TSS 54
• Data compression + Index compression
• Store more data in a block/page = read more data during I/O
• If CPU resources are available, often a very powerful feature to improve performance
• Additionally reduce storage
• Additionally reduce backup time
COMPRESSION
Data Warehouse / DHBWDaimler TSS 55
• Relational columnar In-Memory DB
• Materialized Views / Query Tables
ALREADY COVERED IN A PREVIOUS LECTURE
Data Warehouse / DHBWDaimler TSS 56
• Recapture ETL and DB specific topics
• Which topics do you remember, or do you find important?
• Write down 1-2 topics on stick-it cards.
EXERCISE - RECAPTURE ETL AND DB SPECIFICS
Data Warehouse / DHBWDaimler TSS 57