65518203 etl concepts
TRANSCRIPT
-
7/28/2019 65518203 ETL Concepts
1/56
ETL Concepts
-
7/28/2019 65518203 ETL Concepts
2/56
Scope of the TrainingWhat is ETL?
Need for ETLETL Glossary
The ETL Process
Data Extraction and Preparation
Data Cleansing
Data Transformation
Data Load
Data Refresh Strategies
ETL Solution Options
Characteristics of ETL Tools
-
7/28/2019 65518203 ETL Concepts
3/56
What is ETL?ETL stands for Extraction, Transformation
and Load
This is the most challenging, costly and
time consuming step towards building any
type of Data warehouse.This step usually determines the success
or failure of a Data warehouse because
any analysis lays a lot of importance ondata and the quality of data that is being
analyzed.
-
7/28/2019 65518203 ETL Concepts
4/56
What is ETL?
-
7/28/2019 65518203 ETL Concepts
5/56
What is ETL? - ExtractionExtraction
The process of culling out data that is
required for the Data Warehouse from the
source system
Can be to a file or to a database
Could involve some degree of cleansing or
transformation
Can be automated since it becomes
repetitive once established
-
7/28/2019 65518203 ETL Concepts
6/56
What is ETL? - Transformation &
Cleansing
Transformation
Modification or transformation of data being
imported into the Data Warehouse.
Usually done with the purpose of ensuring
clean and consistent data
Cleansing
The process of removing errors and
inconsistencies from data being imported to
a data warehouse
Could involve multi le sta es
-
7/28/2019 65518203 ETL Concepts
7/56
What is ETL? - Loading
After extracting, scrubbing, cleaning,validating etc. need to load the data intothe warehouse
Issueshuge volumes of data to be loaded
small time window available when
warehouse can be taken off line (usuallynights)
when to build index and summarytables
-
7/28/2019 65518203 ETL Concepts
8/56
What is ETL? - Loading Techniques
Techniques
batch load utility: sort input records on
clustering key and use sequential I/O;
build indexes and derived tables
sequential loads still too long
use parallelism and incrementaltechniques
-
7/28/2019 65518203 ETL Concepts
9/56
The Need for ETLFacilitates Integration of data from various
data sources for building a Datawarehouse Note: Mergers and acquisitions also create
disparities in data representation and pose
more difficult challenges in ETL.Businesses have data in multiple
databases with different codification andformats
Transformation is required to convert andto summarize operational data into aconsistent, business oriented format
Pre-Computation of any derived
-
7/28/2019 65518203 ETL Concepts
10/56
The Need for ETL - Example
appl A - balanceappl B - bal
appl C - currbal
appl D - balcurr
appl A - pipeline - cm
appl B - pipeline - in
appl C - pipeline - feet
appl D - pipeline - yds
appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female
Data Warehouse
-
7/28/2019 65518203 ETL Concepts
11/56
Data Integrity Problems - ScenariosSame person, different spellings
Agarwal, Agrawal, Aggarwal etc...Multiple ways to denote company name
Persistent Systems, PSPL, Persistent
Pvt. LTD.Use of different names
mumbai, bombay
Different account numbers generated bydifferent applications for the same customer
Required fields left blank
Invalid product codes collected at point of
-
7/28/2019 65518203 ETL Concepts
12/56
ETL Glossary
Extracting
Conditioning
House holding
Enrichment
Scoring
-
7/28/2019 65518203 ETL Concepts
13/56
ETL Glossary
Extracting
Capture of data from operational
source in as is statusSources for data generally in legacy
mainframes in VSAM, IMS, IDMS,
DB2; more data today in relationaldatabases on Unix
Conditioning
The conversion of data types from the
-
7/28/2019 65518203 ETL Concepts
14/56
ETL GlossaryHouse holding
Identifying all members of a household(living at the same address)
Ensures only one mail is sent to a
householdCan result in substantial savings: 1 lakh
catalogues at Rs. 50 each costs Rs. 50lakhs. A 2% savings would save Rs. 1 lakh.
Enrichment
Bring data from external sources toaugment/enrich operational data. Data
sources include Dunn and Bradstreet, A. C.
-
7/28/2019 65518203 ETL Concepts
15/56
The ETL ProcessAccess data dictionaries defining source
files
Build logical and physical data models for
target data
Identify sources of data from existing
systems
Specify business and technical rules fordata extraction, conversion and
transformation
Perform data extraction and
-
7/28/2019 65518203 ETL Concepts
16/56
The ETL Process Push vs. PullPull :- A Pull strategy is initiated by the Target
System. As a part of the Extraction Process, thesource data can be pulled from Transactional
system into a staging area by establishing a
connection to the relational/flat/ODBC sources. Advantage :- No additional space required to store
the data that needs to be loaded into to the staging
database
Disadvantage :- Burden on the Transactional
systems when we want to load data into the staging
database
OR
-
7/28/2019 65518203 ETL Concepts
17/56
The ETL Process Push vs. Pull
With a PUSH strategy, the source system areamaintains the application to read the source and
create an interface file that is presented to your ETL.
With a PULL strategy, the DW maintains the
application to read the source.
-
7/28/2019 65518203 ETL Concepts
18/56
The ETL Process - Data Extraction
and Preparation
Extract
Analyze, Cleanand Transform
Data Movementand Load
Stage I
Stage II
Stage III
Periodic
Refresh/
Update
-
7/28/2019 65518203 ETL Concepts
19/56
The ETL Process A simplified
picture
OLTP
Systems
StagingArea
DataWarehouseExtract
OLTP
Systems
OLTPSystems
Load
Transform
Stage I Stage II Stage III
-
7/28/2019 65518203 ETL Concepts
20/56
Static extract = capturing a
snapshot of the source data at a
point in time
Incremental extract = capturing
changes that have occurred since
the last static extract
Capture = extractobtaining a snapshot of
a chosen subset of the source data for
loading into the data warehouse
The ETL Process Step1
-
7/28/2019 65518203 ETL Concepts
21/56
Scrub = cleanseuses pattern recognition
and AI techniques to upgrade data quality
Fixing errors: misspellings, erroneousdates, incorrect field usage, mismatched
addresses, missing data, duplicate data,
inconsistencies
Also: decoding, reformatting, timestamping, conversion, key generation,
merging, error detection/logging, locating
missing data
The ETL Process Step2
-
7/28/2019 65518203 ETL Concepts
22/56
Transform = convert data from format of
operational system to format of data
warehouse
Record-level:Selection data partitioning
Joining data combining
Aggregation data summarization
Field-level:Single-field from one field to one field
Multi-field from many fields to one, or one
field to many
The ETL Process Step3
-
7/28/2019 65518203 ETL Concepts
23/56
Load/Index= place transformed data into the
warehouse and create indexes
Refresh mode: bulk rewriting of targetdata at periodic intervals
Update mode: only changes in sourcedata are written to data warehouse
The ETL Process Step4
-
7/28/2019 65518203 ETL Concepts
24/56
The ETL Process - Data
Transformation
Transforms the data in accordance with
the business rules and standards that
have been established
Example include: format changes, de-
duplication, splitting up fields,
replacement of codes, derived values,
and aggregates
-
7/28/2019 65518203 ETL Concepts
25/56
Scrubbing/Cleansing DataSophisticated transformation tools used for
improving the quality of dataClean data is vital for the success of the
warehouse
Example
Seshadri, Sheshadri, Sesadri, Seshadri S.,
Srinivasan Seshadri, etc. are the same
person
-
7/28/2019 65518203 ETL Concepts
26/56
Dummy Values
Absence of DataMultipurpose Fields
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Violation of Business Rules
Reused Primary Keys
Non-Unique IdentifiersData Integration Problems
Reasons for Dirty data
-
7/28/2019 65518203 ETL Concepts
27/56
The ETL Process - Data Cleansing
Source systems contain dirty data
that must be cleansed
ETL software contains rudimentary
data cleansing capabilities
Specialized data cleansing software is
often used. Important for performing
name and address correction and
house holding functions
Leading data cleansing/Quality
Technology vendors include IBM
-
7/28/2019 65518203 ETL Concepts
28/56
Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
-
7/28/2019 65518203 ETL Concepts
29/56
Parsing
Parsing locates and identifies
individual data elements in the source
files and then isolates these data
elements in the target files.Examples include parsing the first,
middle, and last name; street number
and street name; and city and state.
-
7/28/2019 65518203 ETL Concepts
30/56
Correcting
Corrects parsed individual data
components using sophisticated data
algorithms and secondary data
sources.
Example include replacing a vanity
address and adding a zip code.
-
7/28/2019 65518203 ETL Concepts
31/56
Standardizing
Standardizing applies conversion
routines to transform data into its
preferred (and consistent) format using
both standard and custom businessrules.
Examples include adding a pre name,
replacing a nickname, and using apreferred street name.
-
7/28/2019 65518203 ETL Concepts
32/56
Matching
Searching and matching records within
and across the parsed, corrected and
standardized data based on predefined
business rules to eliminateduplications.
Examples include identifying similar
names and addresses.
-
7/28/2019 65518203 ETL Concepts
33/56
Consolidating
Analyzing and identifying relationships
between matched records and
consolidating/merging them into ONE
representation.
-
7/28/2019 65518203 ETL Concepts
34/56
Data Quality Technology Tools
(Vendors)
DataFlux Integration Server & dfPower
Studio (www.DataFlux.com)
Trillium Software Discovery & Trillium
Software System
(www.trilliumsoftware.com)
ProfileStage & QualityStage
(www.ascential.com)
MarketScope Update: Data Quality
-
7/28/2019 65518203 ETL Concepts
35/56
MarketScope Update: Data Quality
Technology ratings, 2005 (Source:
Gartner - June 2005)
-
7/28/2019 65518203 ETL Concepts
36/56
The ETL Process - Data Loading
Data are physically moved to the data
warehouse
The loading takes place within a load
window
The trend is to near real time updates
of the data warehouse as the
warehouse is increasingly used foroperational applications
-
7/28/2019 65518203 ETL Concepts
37/56
Data Loading - First Time LoadFirst load is a complex exercise
Data extracted from tapes, files, archivesetc.
First time load might take a lot of time to
complete
-
7/28/2019 65518203 ETL Concepts
38/56
Data Refresh
Issues:
when to refresh?
on every update: too expensive, onlynecessary if OLAP queries need
current data (e.g., up-the-minute stock
quotes)
periodically (e.g., every 24 hours,
every week) or after significant
events
refresh olic set b administrator
-
7/28/2019 65518203 ETL Concepts
39/56
Data RefreshData refreshing can follow two approaches :
Complete Data Refresh
Completely refresh the target table every
time
Data Trickle Load
Replicate only net changes and update
the target database
-
7/28/2019 65518203 ETL Concepts
40/56
Data Refresh TechniquesSnapshot Approach - Full extract from base
tablesread entire source table or database:
expensive
may be the only choice for legacydatabases or files.
Incremental techniques (related to work onactive DBs)
detect & propagate changes on basetables: replication servers (e.g., Sybase,Oracle, IBM Data Propagator)
snapshots & triggers (Oracle)
-
7/28/2019 65518203 ETL Concepts
41/56
ETL Solution Options
ETL
Custom
Solution
Generic
Solution
-
7/28/2019 65518203 ETL Concepts
42/56
Custom SolutionUsing RDBMS staging tables and stored
proceduresProgramming languages like C, C++, Perl,
Visual Basic etc
Building a code generator
-
7/28/2019 65518203 ETL Concepts
43/56
Custom Solution Typical components
Control Program
Time window based extraction
Restart at point of failure
High level of error handling
Control metadata captured in Oracle tables
Facility to launch failure recovery programs Automatically
Snapshots for dimension
tables
PL/SQL extraction procedure
Complex views for
transformation
Control table and highly
parameterized/generic
extraction process
Extract From Source
Control table driven
Highly configurable process
PL/SQL procedure
Checks performed - referential
integrity, Y2K, elementary
statistics, business rules
Mechanism to flag the records
as bad / reject
Data Quality
Multiple Stars extracted as
separate groups Pro*C programs using
embedded SQL
Surrogate key generation
mechanism
ASCII file downloads
generated for load into
warehouse
Generate Download Files
-
7/28/2019 65518203 ETL Concepts
44/56
Generic SolutionAddress limitations (in scalability &
complexity) of manual codingThe need to deliver quantifiable business
value
Functionality, Reliability and Viability areno longer major issues
-
7/28/2019 65518203 ETL Concepts
45/56
Characteristics of ETL ToolsProvides facility to specify a large number of
transformation rules with a GUIGenerate programs to transform data
Handle multiple data sources
Handle data redundancy
Generate metadata as output
Most tools exploit parallelism by running on
multiple low-cost servers in multi-threaded
environment
Support data extraction, cleansing, aggregation,
reorganization, transformation, and load
operations
-
7/28/2019 65518203 ETL Concepts
46/56
Types of ETL ToolsFirst-generation
Code-generation products
Generate the source code
Second-generation Engine-driven products
Generate directly executable code
Note: Due to more efficient architecture,
second generation tools have significant
-
7/28/2019 65518203 ETL Concepts
47/56
Types of ETL Tools - First-
GenerationExtraction, transformation,load process
run on server or hostGUI interface is used to define extraction/
transformation processes
Detailed transformations require coding inCOBOL or C
Extract program is generated
automatically as source code. Sourcecode is compiled, scheduled, and run in
batch mode
Uses intermediate files
Fi G i ETL T l
-
7/28/2019 65518203 ETL Concepts
48/56
First-Generation ETL Tools
Strengths and LimitationsStrengths
Tools are mature
Programmers are
familiar with code
generation in
COBOL or C
Limitations
High cost of products Complex training
Extract programs have to
compiled from source
Many transformations have tocoded manually
Lack of parallel execution
support
Most metadata to be manually
generated
Fi t G ti ETL T l
-
7/28/2019 65518203 ETL Concepts
49/56
First-Generation ETL Tools
ExamplesSAS/Warehouse Administrator
Prism from Prism Solutions
Passport from Apertus Carleton Corp
ETI-EXTRACT Tool Suite from
Evolutionary Technologies
Copy Manager from Information Builders
T f ETL T l S d
-
7/28/2019 65518203 ETL Concepts
50/56
Types of ETL Tools - Second-
GenerationExtraction/Transformation/Load runs on
serverData directly extracted from source and
processed on server
Data transformation in memory and written
directly to warehouse database. High
throughput since intermediate files are not
usedDirectly executable code
Support for monitoring, scheduling,
extraction, scrubbing, transformation,
S d G ti ETL T l
-
7/28/2019 65518203 ETL Concepts
51/56
Second-Generation ETL Tools
Strengths and LimitationsStrengths
Lower cost suites,
platforms, and
environment
Fast, efficient, and
multi-threaded
ETL functions highly
Limitations
Not mature
Initial tools oriented only to
RDBMS sources
S d G ti ETL T l
-
7/28/2019 65518203 ETL Concepts
52/56
Second-Generation ETL Tools
ExamplesPowerMart from Informatica
DataStage from Ardent
Data Mart Solution from Sagent
Technology
Tapestry from D2K
-
7/28/2019 65518203 ETL Concepts
53/56
ETL Tools - Examples
DataStage from Ascential Software
SAS System from SAS Institute
Power Mart/Power Center from
Informatica
Sagent Solution from Sagent Software
Hummingbird Genio Suite fromHummingbird Communications
-
7/28/2019 65518203 ETL Concepts
54/56
ETL Tool - General Selection criteria Business Vision/Considerations
Overall IT strategy/Architecture Over all cost of Ownership
Vendor Positioning in the Market
Performance
In-house Expertise available
User friendliness
Training requirements to existing users
References from other customers
-
7/28/2019 65518203 ETL Concepts
55/56
ETL Tool Specific Selection criteriaSupport to retrieve, cleanse, transform,
summarize, aggregate, and load dataEngine-driven products for fast, parallel
operation
Generate and manage central metadatarepository
Open metadata exchange architecture
Provide end-users with access tometadata in business terms
Support development of logical and
physical data models
-
7/28/2019 65518203 ETL Concepts
56/56
ETL Tool - Selection criteria
ETI Extract
SAS Warehouse
Administrator
Informatica PowerCenter
Platinum Decision Base
Ardent DataStage
Data Mirror Transformation
Server
Ardent Warehouse
Executive
Carleton Pureview
Ease of Use / Development
Capabilities
Target Database Loading
Data
Transformation
and Repair
Complexity
Operations
Management/
ProcessAutomation
Metadata
Management and
Administration
Data Extraction &
Integration
complexity
Source: Gartner Report
High Rating
Low Rating