dw training
TRANSCRIPT
-
8/8/2019 DW Training
1/32
ETL- Extraction,Transformation & Loading
-
8/8/2019 DW Training
2/32
1-2
What is ETL?
ETL(Extraction, Transformation and Loading) is a process by which
data is integrated and transformed from the operational systems intothe Data warehouse environment
Operational systemsOperational systems
Filters andFilters and
ExtractorsExtractors
TransformationTransformation
RulesRules
Rule 1Rule 1
Rule 2Rule 2
Rule 3Rule 3
IntegratorIntegrator
CleaningCleaningRulesRules
Rule 1Rule 1
Rule 2Rule 2
Rule 3Rule 3
TransformationTransformationEngineEngine
CleanserCleanser
LoaderLoaderWarehouseWarehouse
ErrorError
ViewCheck
Correct
ErrorError
View
Check
Correct
-
8/8/2019 DW Training
3/32
1-3
Operational Data - Challenges
Data from heterogeneous sources
Format differences
Data Variations
Context
Across locations the same code could
represent different customers
Across periods of time a product code could
have been reused
-
8/8/2019 DW Training
4/32
1-4
Extraction
Oracle
Sybase
Text files
Target
80 tables
50 tables
-
8/8/2019 DW Training
5/32
1-5
Transformation
First
Name
Last
Name
Emp id
IndianaJones10001
SherlockHolmes10002
Name =
Concat(First Name,
Last Name)
Indiana Jones
Sherlock Homes
Staging Area
Source
-
8/8/2019 DW Training
6/32
1-6
Loading
StagingArea
SourceData
Warehouse
Direct Load
Cleaning,
Transformation
& Integration of
Raw data
-
8/8/2019 DW Training
7/32
1-7
Volume of ETL in a Data warehouse
Source OLTP
Systems Data MartsData Marts
DesignMapping
Extract
Scrub
Transform
LoadIndex
Aggregation ReplicationData Set Distribution
Access & AnalysisResource Scheduling & Distributio
Meta Data
System Monitoring
Enterprise
DataW
areh
ouse
Metadata
-
8/8/2019 DW Training
8/32
1-8
Factors Influencing ETL Architecture
Volume at each warehouse component.
The time window available for extraction.
The extraction type (Full,Periodic etc.)
Complexity of the processes at each stage.
-
8/8/2019 DW Training
9/32
Extraction Types
-
8/8/2019 DW Training
10/32
1-10
Extraction Types
Extraction
Full Extract
Periodic/
Incremental
Extract
-
8/8/2019 DW Training
11/32
1-11
Source System
Full Extract
Existing data
Data Mart
Full Extract
-
8/8/2019 DW Training
12/32
1-12
Full Extract
Source System
Full Extract
Data Mart
New data
-
8/8/2019 DW Training
13/32
1-13
Full Extract
Data Mart
New data
Source System
Full Extract
-
8/8/2019 DW Training
14/32
1-14
Incremental Extract
Data Mart
Source SystemIncremental Extract
Existing data
Incremental
Data
-
8/8/2019 DW Training
15/32
1-15
Incremental Extract
Data Mart
Source SystemIncremental Extract
New data
Changed data
Existing data
Incremental
Data
-
8/8/2019 DW Training
16/32
1-16
Incremental Extract
Data Mart
Source System
Incremental Extract
New data
Changed dataExisting data updated
using changed data
Incremental
Data
Incremental addition
to data mart
-
8/8/2019 DW Training
17/32
Transformation
-
8/8/2019 DW Training
18/32
1-18
Data Transformation
Conversions
Data type (e.g. Char to Date)
Bring data to common units (Currency,Measuring Units)
Classifications
Changing continuous values to discrete ranges (e.g.
Temperatures to Temperature Ranges)
Splitting of fields
Merging of fields
Aggregations (e.g. Sum, Avg., Count)
Derivations (Percentages, Ratios, Indicators)
-
8/8/2019 DW Training
19/32
1-19
Structural Transformations
Additive
Orders arrive
every
two minutesAggregate
Average
DailyProductivity
figuresAverage
OLTP
OLTP
Data ware
house
Data ware
house
-
8/8/2019 DW Training
20/32
1-20
Format transformation
Splitting
Data Type
Conversions
Source
Schema
32
Transformation
Target
Schema
32
Age as a String Age as an
Integer
15-10-1992
Source
Schema
Date as a
String
Transformation15 10 1999
Target
Schema
Day Month Year
Date as a combination of3
integer fields
-
8/8/2019 DW Training
21/32
1-21
Simple Conversions
Transformations using Simple Conversions
Source
Schema
Rs. 10000Multiply by 1/43
Target
Schema
$232.56
Revenue in
Rupees
Revenue in
Dollars
1000 lbs.Multiply by 0.4536
453.56 kgs.
Production in
Pounds
Production in
Kilograms
SourceSchema
TargetSchema
-
8/8/2019 DW Training
22/32
1-22
Classification
Name Age
John Black 27
Richard Wayne 53
Jennifer Goldman 45
Helmut Koch 37
Anna Ludwig 32
Shito Maketha 28
Tracy Withman 39
Ada Zhesky 25
David Rosenberg 33
Pankaj Sharma 29
Zhu Ling 44
George Kurtz 27
Rita Hartman 34
Grouping
ge Grou Frequency
20-25 1
26-30 4
31-35 3
36-40 2
41-45 2
46-50 1
51-55 1
56-60 0
-
8/8/2019 DW Training
23/32
1-23
Data Consistency Transformations
Source 1
ender
Male M
Female F
Source 2
Gender
Male Male
Female Female
Source 1
Gender
Male 1
Female 2
Target
Gender
Male M
emale
-
8/8/2019 DW Training
24/32
1-24
Reconciliation of Duplicated data
Joe Smith
123 MaineSt.
MA -
70127
Joseph
Smith123 Maine
St.
MA -
70127
J.R.Smith
123 MaineSt.
MA -
70127
Joseph R Smith
123 Maine St.
MA - 70127
-
8/8/2019 DW Training
25/32
1-25
Data Aggregation - Design Requirements
Aggregates must be stored in their own fact tables and each level
should have its own fact table
Dimension tables attached to the aggregate fact tables should where
ever possible be shrunken versions of the dimension tables attached to
the base fact table
The base fact table and all of its related aggregate fact tables must be
associated together as a family of schemas
-
8/8/2019 DW Training
26/32
Loading
-
8/8/2019 DW Training
27/32
1-27
Types of Data warehouse Loading
Target update types
Insert
Update
-
8/8/2019 DW Training
28/32
1-28
Types of Data Warehouse Updates
Insert
Full Replace
Selective Replace
Update
Update plus RetainHistory
Point in Time Snapshots
New Data
Changed Data
Data Warehouse
Source data Data Staging
-
8/8/2019 DW Training
29/32
1-29
New Data and Point-In-Time Data Insert
Source data
New data
OR
Point-in-Time
Snapshot(e.g.. Monthly)
New Data Added toExisting Data
-
8/8/2019 DW Training
30/32
1-30
Changed Data Insert
Source data
Changed Data Added to
Existing Data
Changed
data
-
8/8/2019 DW Training
31/32
1-31
When the value of dimension in a data warehouse changes,
then
History of change needs to be maintained.
Changed data alone needs to be identified
Changed data should be easier to access.
Reconstruction of the dimension table any point in time should be
easier
Change of Dimension values
-
8/8/2019 DW Training
32/32
1-32
ETL - Approach in a nutshell
1) Identify the Operational systems based on data islands in the
target
2) Map source-target dependencies.
3) Define cleaning and transformation rules
4) Validate source-target mapping
5) Consolidate Meta data for ETL
6) Draw the ETL architecture
7) Build the cleaning, transformation and auditing routines using
either a tool or customized programs