data quality and preprocessing concepts etl
TRANSCRIPT
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
1/64
The Need For Data Quality
Difficulty in decision makingTime delays in operationOrganizational mistrust
Data ownership conflictsCustomer attritionCosts associated with
error detectionerror reworkcustomer servicefixing customer problems
Poor Data Quality Impacts The Bottom Line
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
2/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
3/64
Data Quality Analysis
Identify InconsistenciesCodification differencesMultiple database entries of same entity
Out of range or missing values
Frequency DistributionUniqueness of keys
Referential Integrity
Compliance to other business rules
Examine Source Systems For Inconsistencies
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
4/64
Tools In The Market
Business Rule Discovery ToolsIntegrity Data Reengineering Tool from Vality TechnologyTrillium Software System from Harte-Hanks Data TechnologiesMigration Architect from DB StarWizRule, WizWhy, from WizSoft Inc
Data Reengineering & Cleansing ToolsPrism from Prism SolutionsCarlton Pureview from OracleETI-Extract from Evolutionary Technologies
PowerMart from Informatica CorpSagent Data Mart from Sagent Technology
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
5/64
Tools In The Market
Name & Address Cleansing ToolsCentrus Suite from SagentI.d.centric from First Logic
Quality Defect Prevention ToolsCentrus Suite from SagentI.d.centric from First Logic
Data Quality Assessment Tools
Migration Architect, Evoke Axio from Evoke SoftwareWizrule from Wizsoft
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
6/64
Why Data Preprocessing?
Data in the real world is dirtyincomplete: lacking attribute values, lackingcertain attributes of interest, or containing onlyaggregate data
e.g., occupation=
noisy: containing errors or outlierse.g., Salary= -10
inconsistent: containing discrepancies in codes ornames
e.g., Age=42 Birthday=03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C e.g., discrepancy between duplicate records
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
7/64
Forms of Data Preprocessing
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
8/64
Measuring the Dispersion of Data
Quartiles, outliers and boxplotsQuartiles: Q 1 (25 th percentile), Q 3 (75 th percentile)
Inter-quartile range: IQR = Q 3 Q 1
Boxplot: ends of the box are the quartiles, median is marked,
and plot outlier individually
Variance and standard deviation ( sample: s , po pula t ion: )
Variance: (algebraic, scalable computation)
Standard deviation s (or ) is the square root of variance s 2 ( o r
2)
n
i
n
i
ii
n
i
i x
n x
n x x
n s
1 1
22
1
22 ])(1
[1
1)(
11 n
i
i
n
i
i x
N x
N 1
22
1
22 1)(1
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
9/64
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing.Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: use inference-based formula such asBayesian formula or decision tree
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
10/64
Noisy Data
Noise: random error or variance in a measuredvariableIncorrect attribute values may due to
faulty data collection instruments
data entry problemsdata transmission problemstechnology limitationinconsistency in naming convention
Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent data
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
11/64
How to Handle Noisy Data?
Binningfirst sort data and partition into (equal-frequency) binsthen one can smooth by bin means, smooth by bin median,smooth by bin boundaries, etc.
Regressionsmooth by fitting the data into regression functionsClustering
detect and remove outliers
Combined computer and human inspectiondetect suspicious values and check by human (e.g., deal withpossible outliers)
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
12/64
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,28, 29, 34
* Partition into equal-frequency (equi-depth) bins:- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
13/64
Regression
x
y
y = x + 1
X1
Y1
Y1
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
14/64
Cluster Analysis
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
15/64
Data Cleaning as a Process
Data discrepancy detection (not wanting to give details, outdatedaddress, poorly designed forms, too many options for questions)
Use any knowledge say metadata (e.g., domain, range, dependency,distribution) your write your own scripts.
Check field overloading (2004/12/25, 25/12/2004.)
Check uniqueness rule, consecutive rule and null rule(zero, personrefusing to provide, blanks)
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
16/64
Data cleaning contd.
Use commercial toolsData scrubbing: use simple domain knowledge (e.g., postalcode, spell-check) to detect errors and make corrections useparsing, fuzzy matching techniques)Data auditing: by analyzing data to discover rules andrelationship to detect violators (e.g., correlation and
clustering to find outliers)ETL (Extraction/Transformation/Loading) tools: allow users tospecify transformations through a graphical user interface
Integration of the two processes(www.control.cs.berkeley.edu.abc
Iterative and interactive (e.g., Potters Wheels) Work in progress: writing declarative languages usingSQL for data cleaning
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
17/64
Data Integration
Data integration:Combines data from multiple sources into a coherent storeSchema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:Identify real world entities from multiple data sources, e.g., BillClinton = William Clinton
Detecting and resolving data value conflictsFor the same real world entity, attribute values from differentsources are differentPossible reasons: different representations, different scales,e.g., metric vs. British units
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
18/64
ETL
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
19/64
Ralph Speaks
Technical Design Challenges PosedBy The Data Warehouse Evolution
Timeliness Data Volumes Response Times
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
20/64
BILegacySystems
.. B2C
B2B
CRM
The Big Picture!
Which Approach Do We Take ?
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
21/64
Data Extraction and Preparation
Extract
Analyze, Cleanand Transform
Data Movementand Load
Stage I
Stage II
Stage III
PeriodicRefresh/Update
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
22/64
The ETL Process
Access data dictionaries defining source filesBuild logical and physical data models for targetdataSurvey existing systems to identify sources ofdataSpecify business and technical rules for dataextraction, conversion and transformation
Perform data extraction and transformationLoad target databases
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
23/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
24/64
OLTPSystems
StagingArea DataWarehouseExtractOLTP
Systems
OLTPSystems
Load
Transform
Stage I Stage II Stage III
The ETL Process
Data Extraction - Simplified
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
25/64
ETL Tools - Classification
First-generationCode-generation productsGenerate the source code
Second-generationEngine-driven productsGenerate directly executable code
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
26/64
ETL Tools - Classification
Due to more efficient architecture, secondgeneration tools have significant advantage overfirst-generation
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
27/64
ETL Tools - First-Generation
StrengthsTools are matureProgrammers are familiarwith code generation inCOBOL or C
LimitationsHigh cost of productsComplex trainingExtract programs have tocompiled from source
Many transformations haveto coded manuallyLack of parallel executionsupportMost metadata to bemanually generated
Characterized by the Generation and Deployment of Multiple Codes
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
28/64
ETL Tools - Second-Generation
Extraction/Transformation/Load runs on serverData directly extracted from source and processed on server
Data transformation in memory and written directly to warehouse
database. High throughput since intermediate files are not used
Directly executable codeSupport for monitoring, scheduling, extraction, scrubbing,transformation, load, index, aggregation, metadata
Characterized by the Transformation Engine
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
29/64
ETL Tools - Second-Generation
PowerCentre/Mart from InformaticaData Mart Solution from Sagent TechnologyDataStage from Ascential
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
30/64
ETL Tools - Selection
Support to retrieve, cleanse, transform,summarize, aggregate, and load dataEngine-driven products for fast, parallel operationGenerate and manage central metadata repositoryOpen metadata exchange architectureProvide end-users with access to metadata inbusiness terms
Support development of logical and physical datamodels
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
31/64
Data Loading - First Time Loads
First load is a complexexerciseData extracted fromtapes, files, archives etc.
First time load might takeseveral days to complete
Extract, Clean,Transform etc
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
32/64
Source: www.survey.com
ETL Trends
DWH market is growing at 40-45% p.a.Meta data management is shaping the marketReal time CRM requires real time DWHE-comm and E-business are fuelling DWH & BIERP Data Warehousing is in demand
Major Trends
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
33/64
Source: Cutter Report - May 2000
ETL Trends
ETL technology built into other BI productsXML enabled platform independent data trafficNear Real Time Data Warehouses usingmiddlewareVendors have evolved their products into datamart/analytical platforms
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
34/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
35/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
36/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
37/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
38/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
39/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
40/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
41/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
42/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
43/64
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
44/64
The Data Mart StrategyThe most common approachBegins with a single mart and architected marts are addedover time for more subject areasRelatively inexpensive and easy to implement
Can be used as a proof of concept for data warehousingCan perpetuate the silos of information problem Can postpone difficult decisions and activitiesRequires an overall integration plan
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
45/64
Data Sources and TypesPrimarily from legacy, operational systemsAlmost exclusively numerical data at the presenttimeExternal data may be included, often purchasedfrom third-party sourcesTechnology exists for storing unstructured dataand expect this to become more important over
time
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
46/64
Extraction, Transformation,and Loading (ETL) ProcessesThe plumbing work of data warehousing
Data are moved from source to target data basesA very costly, time consuming part of datawarehousing
Recent Development:
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
47/64
Recent Development:More Frequent Updates
Updates can be done in bulk and trickle modesBusiness requirements, such as trading partneraccess to a Web site, requires current dataFor international firms, there is no good time toload the warehouse
Recent Development:
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
48/64
Recent Development:Clickstream Data
Results from clicks at web sitesA dialog manager handles user interactions. AnODS (operational data store in the data stagingarea) helps to custom tailor the dialog
The clickstream data is filtered and parsed andsent to a data warehouse where it is analyzedSoftware is available to analyze the clickstreamdata
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
49/64
Data ExtractionOften performed by COBOL routines(not recommended because of high programmaintenance and no automatically generated metadata)Sometimes source data is copied to the targetdatabase using the replication capabilities ofstandard RDMS (not recommended because ofdirty data in the source systems)Increasing performed by specialized ETL software
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
50/64
Sample ETL ToolsTeradata Warehouse Builder from TeradataDataStage from Ascential SoftwareSAS System from SAS InstitutePower Mart/Power Center from InformaticaSagent Solution from Sagent SoftwareHummingbird Genio Suite from HummingbirdCommunications
Reasons for Dirty Data
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
51/64
Reasons for Dirty Data
Dummy ValuesAbsence of DataMultipurpose FieldsCryptic DataContradicting DataInappropriate Use of Address LinesViolation of Business RulesReused Primary Keys,
Non-Unique IdentifiersData Integration Problems
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
52/64
Data CleansingSource systems contain dirty data that must becleansedETL software contains rudimentary data cleansingcapabilities
Specialized data cleansing software is often used.Important for performing name and addresscorrection and householding functionsLeading data cleansing vendors include Vality
(Integrity), Harte-Hanks (Trillium), and Firstlogic(i.d.Centric)
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
53/64
Steps in Data CleansingParsing
Correcting
Standardizing
Matching
Consolidating
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
54/64
ParsingParsing locates and identifies individual dataelements in the source files and then isolatesthese data elements in the target files.Examples include parsing the first, middle, and
last name; street number and street name; andcity and state.
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
55/64
CorrectingCorrects parsed individual data components usingsophisticated data algorithms and secondary datasources.Example include replacing a vanity address and
adding a zip code.
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
56/64
StandardizingStandardizing applies conversion routines totransform data into its preferred (and consistent)format using both standard and custom businessrules.
Examples include adding a pre name, replacing anickname, and using a preferred street name.
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
57/64
MatchingSearching and matching records within andacross the parsed, corrected and standardizeddata based on predefined business rules toeliminate duplications.
Examples include identifying similar names andaddresses.
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
58/64
ConsolidatingAnalyzing and identifying relationships betweenmatched records and consolidating/merging theminto ONE representation.
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
59/64
Data StagingOften used as an interim step between data extraction andlater stepsAccumulates data from asynchronous sources usingnative interfaces, flat files, FTP sessions, or otherprocessesAt a predefined cutoff time, data in the staging file istransformed and loaded to the warehouseThere is usually no end user access to the staging fileAn operational data store may be used for data staging
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
60/64
Data TransformationTransforms the data in accordance with thebusiness rules and standards that have beenestablishedExample include: format changes, deduplication,
splitting up fields, replacement of codes, derivedvalues, and aggregates
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
61/64
Data LoadingData are physically moved to the data warehouseThe loading takes place within a load windowThe trend is to near real time updates of the datawarehouse as the warehouse is increasingly used
for operational applications
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
62/64
Meta DataData about dataNeeded by both information technology personneland usersIT personnel need to know data sources and
targets; database, table and column names;refresh schedules; data usage measures; etc.Users need to know entity/attribute definitions;reports/query tools available; report distribution
information; help desk contact information, etc.
Recent Development:
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
63/64
Recent Development:Meta Data Integration
A growing realization that meta data is critical todata warehousing successProgress is being made on getting vendors toagree on standards and to incorporate the sharing
of meta data among their toolsVendors like Microsoft, Computer Associates, andOracle have entered the meta data marketplacewith significant product offerings
-
8/10/2019 Data Quality and Preprocessing Concepts ETL
64/64
Thats lots of ETL. Come back to basics: let us revise