Download - ETL Concepts
ETL Concepts
Scope of the TrainingWhat is ETL?Need for ETLETL GlossaryThe ETL Process
• Data Extraction and Preparation• Data Cleansing• Data Transformation• Data Load• Data Refresh Strategies
ETL Solution OptionsCharacteristics of ETL ToolsTypes of ETL ToolsETL Tool - selection criteriaKey tools in the market
What is ETL?ETL stands for Extraction, Transformation
and LoadThis is the most challenging, costly and
time consuming step towards building any type of Data warehouse.
This step usually determines the success or failure of a Data warehouse because any analysis lays a lot of importance on data and the quality of data that is being analyzed.
What is ETL?
What is ETL? - ExtractionExtraction
The process of culling out data that is required for the Data Warehouse from the source system
Can be to a file or to a databaseCould involve some degree of cleansing or
transformationCan be automated since it becomes
repetitive once established
What is ETL? - Transformation & Cleansing
TransformationModification or transformation of data being imported
into the Data Warehouse.Usually done with the purpose of ensuring ‘clean’ and
‘consistent’ dataCleansing
The process of removing errors and inconsistencies from data being imported to a data warehouse
Could involve multiple stagesFeedback could go back for strengthening OLTP data
capture mechanism
What is ETL? - Loading
After extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse
Issueshuge volumes of data to be loadedsmall time window available when warehouse can
be taken off line (usually nights)when to build index and summary tablesallow system administrators to monitor, cancel,
resume, change load ratesRecover gracefully -- restart after failure from
where you were and without loss of data integrity
What is ETL? - Loading Techniques
Techniquesbatch load utility: sort input records on
clustering key and use sequential I/O; build indexes and derived tables
sequential loads still too longuse parallelism and incremental
techniques
The Need for ETLFacilitates Integration of data from various data sources for building a Datawarehouse • Note: Mergers and acquisitions also create disparities in data
representation and pose more difficult challenges in ETL.Businesses have data in multiple databases with different
codification and formatsTransformation is required to convert and to summarize
operational data into a consistent, business oriented formatPre-Computation of any derived informationSummarization is also carried out to pre-compute
summaries and aggregatesMakes data available in a queriable format
The Need for ETL - Exampleen
codi
ngun
itfie
ld
appl A - balanceappl B - bal
appl C - currbalappl D - balcurr
appl A - pipeline - cmappl B - pipeline - in
appl C - pipeline - feetappl D - pipeline - yds
appl A - m,fappl B - 1,0appl C - x,y
appl D - male, female
Data Warehouse
Data Integrity Problems - ScenariosSame person, different spellingsAgarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company namePersistent Systems, PSPL, Persistent Pvt. LTD.
Use of different namesmumbai, bombay
Different account numbers generated by different applications for the same customer
Required fields left blankInvalid product codes collected at point of sale
manual entry leads to mistakes“in case of a problem use 9999999”
ETL Glossary
Extracting Conditioning House holding Enrichment Scoring
ETL Glossary
Extracting
Capture of data from operational source in “as is” status
Sources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix
Conditioning
The conversion of data types from the source to the target data store (warehouse) -- always a relational database
ETL GlossaryHouse holdingIdentifying all members of a household (living at the same
address)Ensures only one mail is sent to a householdCan result in substantial savings: 1 lakh catalogues at Rs. 50
each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh.Enrichment
Bring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc...
Scoring computation of a probability of an event. e.g..., chance that a
customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product
The ETL ProcessAccess data dictionaries defining source files
Build logical and physical data models for target data
Identify sources of data from existing systems
Specify business and technical rules for data extraction, conversion and transformation
Perform data extraction and transformation
Load target databases
The ETL Process – Push vs. Pull Pull :- A Pull strategy is initiated by the Target System. As a part of the
Extraction Process, the source data can be pulled from Transactional system into a staging area by establishing a connection to the relational/flat/ODBC sources.
• Advantage :- No additional space required to store the data that needs to be loaded into to the staging database
• Disadvantage :- Burden on the Transactional systems when we want to load data into the staging database
OR Push :- A Push strategy is initiated by the Source System. As a part of the
Extraction Process, the source data can be pushed/exported or dumped onto a file location from where it can loaded into a staging area.
• Advantage :- No additional burden on the Transactional systems when we want to load data into the staging database
• Disadvantage :- Additional space required to store the data that needs to be loaded into to the staging database
The ETL Process – Push vs. Pull
With a PUSH strategy, the source system area maintains the application to read the source and create an interface file that is presented to your ETL. With a PULL strategy, the DW maintains the application to read the source.
The ETL Process - Data Extraction and Preparation
Extract
Analyze, Cleanand Transform
Data Movementand Load
Stage I
Stage II
Stage III
PeriodicRefresh/Update
The ETL Process – A simplified picture
OLTPSystems
OLTPSystems
Staging Area
Data WarehouseExtract
OLTPSystems
OLTPSystems
OLTPSystems
OLTPSystems
Load
Transform
Stage I Stage II Stage III
Static extractStatic extract = capturing a snapshot of the source data at a point in time
Incremental extractIncremental extract = capturing changes that have occurred since the last static extract
Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse
The ETL Process – Step1
Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality
Fixing errors:Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also:Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
The ETL Process – Step2
Transform = convert data from format of operational system to format of data warehouse
Record-level:Record-level:Selection – data partitioningJoining – data combiningAggregation – data summarization
Field-level:Field-level: Single-field – from one field to one fieldMulti-field – from many fields to one, or one field to many
The ETL Process – Step3
Load/Index= place transformed data into the warehouse and create indexes
Refresh mode:Refresh mode: bulk rewriting of target data at periodic intervals
Update mode:Update mode: only changes in source data are written to data warehouse
The ETL Process – Step4
The ETL Process - Data Transformation
Transforms the data in accordance with the business rules and standards that have been established
Example include: format changes, de-duplication, splitting up fields, replacement of codes, derived values, and aggregates
Scrubbing/Cleansing DataSophisticated transformation tools used for
improving the quality of dataClean data is vital for the success of the
warehouseExample
• Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same person
Dummy ValuesAbsence of DataMultipurpose FieldsCryptic DataContradicting DataInappropriate Use of Address LinesViolation of Business RulesReused Primary KeysNon-Unique IdentifiersData Integration Problems
Reasons for “Dirty” data
The ETL Process - Data Cleansing
Source systems contain “dirty data” that must be cleansed
ETL software contains rudimentary data cleansing capabilities
Specialized data cleansing software is often used. Important for performing name and address correction and house holding functions
Leading data cleansing/Quality Technology vendors include IBM (QualityStage & ProfileStage), Harte-Hanks (Trillium Software), SAS (DataFlux) and Firstlogic
Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing
Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.
Examples include parsing the first, middle, and last name; street number and street name; and city and state.
Correcting
Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.
Example include replacing a vanity address and adding a zip code.
Standardizing
Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.
Examples include adding a pre name, replacing a nickname, and using a preferred street name.
Matching
Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
Examples include identifying similar names and addresses.
Consolidating
Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
Data Quality Technology Tools (Vendors)
DataFlux Integration Server & dfPower ® Studio (www.DataFlux.com)
Trillium Software Discovery & Trillium Software System (www.trilliumsoftware.com)
ProfileStage & QualityStage (www.ascential.com)
MarketScope Update: Data Quality Technology ratings, 2005 (Source: Gartner - June 2005)
The ETL Process - Data Loading
Data are physically moved to the data warehouse
The loading takes place within a “load window”
The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications
Data Loading - First Time LoadFirst load is a complex exercise
• Data extracted from tapes, files, archives etc.
• First time load might take a lot of time to complete
Data Refresh
Issues:when to refresh?
on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes)
periodically (e.g., every 24 hours, every week) or after “significant” events
refresh policy set by administrator based on user needs and traffic
possibly different policies for different sourceshow to refresh?
Data RefreshData refreshing can follow two approaches :
Complete Data Refresh
• Completely refresh the target table every time
Data Trickle Load
• Replicate only net changes and update the target database
Data Refresh TechniquesSnapshot Approach - Full extract from base tablesread entire source table or database: expensivemay be the only choice for legacy databases or files.
Incremental techniques (related to work on active DBs)detect & propagate changes on base tables: replication
servers (e.g., Sybase, Oracle, IBM Data Propagator)snapshots & triggers (Oracle)transaction shipping (Sybase)
Logical correctnesscomputing changes to star tablescomputing changes to derived and summary tablesoptimization: only significant changes
transactional correctness: incremental load
ETL Solution Options
ETLETL
Custom SolutionCustom Solution
Generic SolutionGeneric Solution
Custom SolutionUsing RDBMS staging tables and stored
proceduresProgramming languages like C, C++, Perl,
Visual Basic etcBuilding a code generator
Custom Solution – Typical components
• Control Program• Control Program
• Time window based extraction• Restart at point of failure
• High level of error handling• Control metadata captured in Oracle tables
• Facility to launch failure recovery programs Automatically
• Time window based extraction• Restart at point of failure
• High level of error handling• Control metadata captured in Oracle tables
• Facility to launch failure recovery programs Automatically
• Snapshots for dimension tables
• PL/SQL extraction procedure• Complex views for
transformation• Control table and highly
parameterized/generic extraction process
• Snapshots for dimension tables
• PL/SQL extraction procedure• Complex views for
transformation• Control table and highly
parameterized/generic extraction process
Extract From Source
• Control table driven• Highly configurable process• PL/SQL procedure• Checks performed - referential
integrity, Y2K, elementary statistics, business rules
• Mechanism to flag the records as bad / reject
• Control table driven• Highly configurable process• PL/SQL procedure• Checks performed - referential
integrity, Y2K, elementary statistics, business rules
• Mechanism to flag the records as bad / reject
Data Quality
• Multiple Stars extracted as separate groups
• Pro*C programs using embedded SQL
• Surrogate key generation mechanism
• ASCII file downloads generated for load into warehouse
• Multiple Stars extracted as separate groups
• Pro*C programs using embedded SQL
• Surrogate key generation mechanism
• ASCII file downloads generated for load into warehouse
Generate Download Files
Generic SolutionAddress limitations (in scalability &
complexity) of manual codingThe need to deliver quantifiable business
valueFunctionality, Reliability and Viability are
no longer major issues
Characteristics of ETL Tools Provides facility to specify a large number of transformation rules with a GUI
Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers
in multi-threaded environment Support data extraction, cleansing, aggregation, reorganization,
transformation, and load operations Generate and maintain centralized metadata Closely integrated with various RDBMS Filter data, convert codes, calculate derived values, map many source
data fields to one target data field Automatic generation of data extract programs High speed loading of target data warehouses
Types of ETL ToolsFirst-generation
• Code-generation products
• Generate the source code Second-generation
• Engine-driven products
• Generate directly executable code
Note: Due to more efficient architecture, second
generation tools have significant advantage over
first-generation
Types of ETL Tools - First-GenerationExtraction, transformation,load process run on server or host
GUI interface is used to define extraction/ transformation processes
Detailed transformations require coding in COBOL or CExtract program is generated automatically as source
code. Source code is compiled, scheduled, and run in batch mode
Uses intermediate filesProgram is single threaded and cannot use parallel
processorsLittle metadata is generated automatically
First-Generation ETL Tools – Strengths and LimitationsStrengths
Tools are mature
Programmers are
familiar with code
generation in COBOL
or C
Limitations High cost of products Complex training Extract programs have to
compiled from source Many transformations have to
coded manually Lack of parallel execution
support Most metadata to be manually
generated
First-Generation ETL Tools – ExamplesSAS/Warehouse Administrator
Prism from Prism Solutions
Passport from Apertus Carleton Corp
ETI-EXTRACT Tool Suite from
Evolutionary Technologies
Copy Manager from Information Builders
Types of ETL Tools - Second-GenerationExtraction/Transformation/Load runs on server
Data directly extracted from source and processed on server
Data transformation in memory and written directly to warehouse database. High throughput since intermediate files are not used
Directly executable codeSupport for monitoring, scheduling, extraction,
scrubbing, transformation, load, index, aggregation, metadata
Multi-threading with use of parallel processorsAutomatic generation of high percentage of metadata
Second-Generation ETL Tools – Strengths and LimitationsStrengths
Lower cost suites, platforms,
and environment
Fast, efficient, and multi-
threaded
ETL functions highly integrated
and automated
Open, extensible metadata
Limitations
Not mature
Initial tools oriented only to
RDBMS sources
Second-Generation ETL Tools – ExamplesPowerMart from Informatica
DataStage from Ardent
Data Mart Solution from Sagent
Technology
Tapestry from D2K
ETL Tools - Examples
DataStage from Ascential SoftwareSAS System from SAS InstitutePower Mart/Power Center from
InformaticaSagent Solution from Sagent SoftwareHummingbird Genio Suite from
Hummingbird Communications
ETL Tool - General Selection criteria Business Vision/Considerations Overall IT strategy/Architecture Over all cost of Ownership Vendor Positioning in the Market Performance In-house Expertise available User friendliness Training requirements to existing users References from other customers
ETL Tool – Specific Selection criteriaSupport to retrieve, cleanse, transform,
summarize, aggregate, and load dataEngine-driven products for fast, parallel
operationGenerate and manage central metadata
repositoryOpen metadata exchange architectureProvide end-users with access to metadata in
business termsSupport development of logical and physical
data models
ETL Tool - Selection criteria
ETI Extract
SAS Warehouse Administrator
Informatica PowerCenter
Platinum Decision Base
Ardent DataStage
Data Mirror Transformation Server
Ardent Warehouse Executive
Carleton Pureview
Ease of Use / Development Capabilities
Target Database Loading
Data Transformation
and Repair Complexity
Operations Management/
Process Automation
Metadata Management and
Administration
Data Extraction & Integration complexity
Source: Gartner Report
High Rating
Low Rating