data quality and preprocessing concepts etl

Upload: rahul-panthri

Post on 02-Jun-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    1/64

    The Need For Data Quality

    Difficulty in decision makingTime delays in operationOrganizational mistrust

    Data ownership conflictsCustomer attritionCosts associated with

    error detectionerror reworkcustomer servicefixing customer problems

    Poor Data Quality Impacts The Bottom Line

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    2/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    3/64

    Data Quality Analysis

    Identify InconsistenciesCodification differencesMultiple database entries of same entity

    Out of range or missing values

    Frequency DistributionUniqueness of keys

    Referential Integrity

    Compliance to other business rules

    Examine Source Systems For Inconsistencies

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    4/64

    Tools In The Market

    Business Rule Discovery ToolsIntegrity Data Reengineering Tool from Vality TechnologyTrillium Software System from Harte-Hanks Data TechnologiesMigration Architect from DB StarWizRule, WizWhy, from WizSoft Inc

    Data Reengineering & Cleansing ToolsPrism from Prism SolutionsCarlton Pureview from OracleETI-Extract from Evolutionary Technologies

    PowerMart from Informatica CorpSagent Data Mart from Sagent Technology

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    5/64

    Tools In The Market

    Name & Address Cleansing ToolsCentrus Suite from SagentI.d.centric from First Logic

    Quality Defect Prevention ToolsCentrus Suite from SagentI.d.centric from First Logic

    Data Quality Assessment Tools

    Migration Architect, Evoke Axio from Evoke SoftwareWizrule from Wizsoft

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    6/64

    Why Data Preprocessing?

    Data in the real world is dirtyincomplete: lacking attribute values, lackingcertain attributes of interest, or containing onlyaggregate data

    e.g., occupation=

    noisy: containing errors or outlierse.g., Salary= -10

    inconsistent: containing discrepancies in codes ornames

    e.g., Age=42 Birthday=03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C e.g., discrepancy between duplicate records

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    7/64

    Forms of Data Preprocessing

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    8/64

    Measuring the Dispersion of Data

    Quartiles, outliers and boxplotsQuartiles: Q 1 (25 th percentile), Q 3 (75 th percentile)

    Inter-quartile range: IQR = Q 3 Q 1

    Boxplot: ends of the box are the quartiles, median is marked,

    and plot outlier individually

    Variance and standard deviation ( sample: s , po pula t ion: )

    Variance: (algebraic, scalable computation)

    Standard deviation s (or ) is the square root of variance s 2 ( o r

    2)

    n

    i

    n

    i

    ii

    n

    i

    i x

    n x

    n x x

    n s

    1 1

    22

    1

    22 ])(1

    [1

    1)(

    11 n

    i

    i

    n

    i

    i x

    N x

    N 1

    22

    1

    22 1)(1

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    9/64

    How to Handle Missing Data?

    Ignore the tuple: usually done when class label is missing.Fill in the missing value manually: tedious + infeasible?

    Fill in it automatically with

    a global constant : e.g., unknown, a new class?!

    the attribute mean

    the attribute mean for all samples belonging to the same class:

    smarter

    the most probable value: use inference-based formula such asBayesian formula or decision tree

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    10/64

    Noisy Data

    Noise: random error or variance in a measuredvariableIncorrect attribute values may due to

    faulty data collection instruments

    data entry problemsdata transmission problemstechnology limitationinconsistency in naming convention

    Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent data

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    11/64

    How to Handle Noisy Data?

    Binningfirst sort data and partition into (equal-frequency) binsthen one can smooth by bin means, smooth by bin median,smooth by bin boundaries, etc.

    Regressionsmooth by fitting the data into regression functionsClustering

    detect and remove outliers

    Combined computer and human inspectiondetect suspicious values and check by human (e.g., deal withpossible outliers)

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    12/64

    Binning Methods for Data Smoothing

    Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,28, 29, 34

    * Partition into equal-frequency (equi-depth) bins:- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34

    * Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23

    - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries:

    - Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    13/64

    Regression

    x

    y

    y = x + 1

    X1

    Y1

    Y1

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    14/64

    Cluster Analysis

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    15/64

    Data Cleaning as a Process

    Data discrepancy detection (not wanting to give details, outdatedaddress, poorly designed forms, too many options for questions)

    Use any knowledge say metadata (e.g., domain, range, dependency,distribution) your write your own scripts.

    Check field overloading (2004/12/25, 25/12/2004.)

    Check uniqueness rule, consecutive rule and null rule(zero, personrefusing to provide, blanks)

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    16/64

    Data cleaning contd.

    Use commercial toolsData scrubbing: use simple domain knowledge (e.g., postalcode, spell-check) to detect errors and make corrections useparsing, fuzzy matching techniques)Data auditing: by analyzing data to discover rules andrelationship to detect violators (e.g., correlation and

    clustering to find outliers)ETL (Extraction/Transformation/Loading) tools: allow users tospecify transformations through a graphical user interface

    Integration of the two processes(www.control.cs.berkeley.edu.abc

    Iterative and interactive (e.g., Potters Wheels) Work in progress: writing declarative languages usingSQL for data cleaning

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    17/64

    Data Integration

    Data integration:Combines data from multiple sources into a coherent storeSchema integration: e.g., A.cust-id B.cust-#

    Integrate metadata from different sources

    Entity identification problem:Identify real world entities from multiple data sources, e.g., BillClinton = William Clinton

    Detecting and resolving data value conflictsFor the same real world entity, attribute values from differentsources are differentPossible reasons: different representations, different scales,e.g., metric vs. British units

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    18/64

    ETL

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    19/64

    Ralph Speaks

    Technical Design Challenges PosedBy The Data Warehouse Evolution

    Timeliness Data Volumes Response Times

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    20/64

    BILegacySystems

    .. B2C

    B2B

    CRM

    The Big Picture!

    Which Approach Do We Take ?

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    21/64

    Data Extraction and Preparation

    Extract

    Analyze, Cleanand Transform

    Data Movementand Load

    Stage I

    Stage II

    Stage III

    PeriodicRefresh/Update

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    22/64

    The ETL Process

    Access data dictionaries defining source filesBuild logical and physical data models for targetdataSurvey existing systems to identify sources ofdataSpecify business and technical rules for dataextraction, conversion and transformation

    Perform data extraction and transformationLoad target databases

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    23/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    24/64

    OLTPSystems

    StagingArea DataWarehouseExtractOLTP

    Systems

    OLTPSystems

    Load

    Transform

    Stage I Stage II Stage III

    The ETL Process

    Data Extraction - Simplified

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    25/64

    ETL Tools - Classification

    First-generationCode-generation productsGenerate the source code

    Second-generationEngine-driven productsGenerate directly executable code

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    26/64

    ETL Tools - Classification

    Due to more efficient architecture, secondgeneration tools have significant advantage overfirst-generation

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    27/64

    ETL Tools - First-Generation

    StrengthsTools are matureProgrammers are familiarwith code generation inCOBOL or C

    LimitationsHigh cost of productsComplex trainingExtract programs have tocompiled from source

    Many transformations haveto coded manuallyLack of parallel executionsupportMost metadata to bemanually generated

    Characterized by the Generation and Deployment of Multiple Codes

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    28/64

    ETL Tools - Second-Generation

    Extraction/Transformation/Load runs on serverData directly extracted from source and processed on server

    Data transformation in memory and written directly to warehouse

    database. High throughput since intermediate files are not used

    Directly executable codeSupport for monitoring, scheduling, extraction, scrubbing,transformation, load, index, aggregation, metadata

    Characterized by the Transformation Engine

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    29/64

    ETL Tools - Second-Generation

    PowerCentre/Mart from InformaticaData Mart Solution from Sagent TechnologyDataStage from Ascential

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    30/64

    ETL Tools - Selection

    Support to retrieve, cleanse, transform,summarize, aggregate, and load dataEngine-driven products for fast, parallel operationGenerate and manage central metadata repositoryOpen metadata exchange architectureProvide end-users with access to metadata inbusiness terms

    Support development of logical and physical datamodels

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    31/64

    Data Loading - First Time Loads

    First load is a complexexerciseData extracted fromtapes, files, archives etc.

    First time load might takeseveral days to complete

    Extract, Clean,Transform etc

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    32/64

    Source: www.survey.com

    ETL Trends

    DWH market is growing at 40-45% p.a.Meta data management is shaping the marketReal time CRM requires real time DWHE-comm and E-business are fuelling DWH & BIERP Data Warehousing is in demand

    Major Trends

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    33/64

    Source: Cutter Report - May 2000

    ETL Trends

    ETL technology built into other BI productsXML enabled platform independent data trafficNear Real Time Data Warehouses usingmiddlewareVendors have evolved their products into datamart/analytical platforms

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    34/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    35/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    36/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    37/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    38/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    39/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    40/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    41/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    42/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    43/64

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    44/64

    The Data Mart StrategyThe most common approachBegins with a single mart and architected marts are addedover time for more subject areasRelatively inexpensive and easy to implement

    Can be used as a proof of concept for data warehousingCan perpetuate the silos of information problem Can postpone difficult decisions and activitiesRequires an overall integration plan

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    45/64

    Data Sources and TypesPrimarily from legacy, operational systemsAlmost exclusively numerical data at the presenttimeExternal data may be included, often purchasedfrom third-party sourcesTechnology exists for storing unstructured dataand expect this to become more important over

    time

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    46/64

    Extraction, Transformation,and Loading (ETL) ProcessesThe plumbing work of data warehousing

    Data are moved from source to target data basesA very costly, time consuming part of datawarehousing

    Recent Development:

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    47/64

    Recent Development:More Frequent Updates

    Updates can be done in bulk and trickle modesBusiness requirements, such as trading partneraccess to a Web site, requires current dataFor international firms, there is no good time toload the warehouse

    Recent Development:

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    48/64

    Recent Development:Clickstream Data

    Results from clicks at web sitesA dialog manager handles user interactions. AnODS (operational data store in the data stagingarea) helps to custom tailor the dialog

    The clickstream data is filtered and parsed andsent to a data warehouse where it is analyzedSoftware is available to analyze the clickstreamdata

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    49/64

    Data ExtractionOften performed by COBOL routines(not recommended because of high programmaintenance and no automatically generated metadata)Sometimes source data is copied to the targetdatabase using the replication capabilities ofstandard RDMS (not recommended because ofdirty data in the source systems)Increasing performed by specialized ETL software

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    50/64

    Sample ETL ToolsTeradata Warehouse Builder from TeradataDataStage from Ascential SoftwareSAS System from SAS InstitutePower Mart/Power Center from InformaticaSagent Solution from Sagent SoftwareHummingbird Genio Suite from HummingbirdCommunications

    Reasons for Dirty Data

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    51/64

    Reasons for Dirty Data

    Dummy ValuesAbsence of DataMultipurpose FieldsCryptic DataContradicting DataInappropriate Use of Address LinesViolation of Business RulesReused Primary Keys,

    Non-Unique IdentifiersData Integration Problems

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    52/64

    Data CleansingSource systems contain dirty data that must becleansedETL software contains rudimentary data cleansingcapabilities

    Specialized data cleansing software is often used.Important for performing name and addresscorrection and householding functionsLeading data cleansing vendors include Vality

    (Integrity), Harte-Hanks (Trillium), and Firstlogic(i.d.Centric)

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    53/64

    Steps in Data CleansingParsing

    Correcting

    Standardizing

    Matching

    Consolidating

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    54/64

    ParsingParsing locates and identifies individual dataelements in the source files and then isolatesthese data elements in the target files.Examples include parsing the first, middle, and

    last name; street number and street name; andcity and state.

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    55/64

    CorrectingCorrects parsed individual data components usingsophisticated data algorithms and secondary datasources.Example include replacing a vanity address and

    adding a zip code.

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    56/64

    StandardizingStandardizing applies conversion routines totransform data into its preferred (and consistent)format using both standard and custom businessrules.

    Examples include adding a pre name, replacing anickname, and using a preferred street name.

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    57/64

    MatchingSearching and matching records within andacross the parsed, corrected and standardizeddata based on predefined business rules toeliminate duplications.

    Examples include identifying similar names andaddresses.

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    58/64

    ConsolidatingAnalyzing and identifying relationships betweenmatched records and consolidating/merging theminto ONE representation.

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    59/64

    Data StagingOften used as an interim step between data extraction andlater stepsAccumulates data from asynchronous sources usingnative interfaces, flat files, FTP sessions, or otherprocessesAt a predefined cutoff time, data in the staging file istransformed and loaded to the warehouseThere is usually no end user access to the staging fileAn operational data store may be used for data staging

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    60/64

    Data TransformationTransforms the data in accordance with thebusiness rules and standards that have beenestablishedExample include: format changes, deduplication,

    splitting up fields, replacement of codes, derivedvalues, and aggregates

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    61/64

    Data LoadingData are physically moved to the data warehouseThe loading takes place within a load windowThe trend is to near real time updates of the datawarehouse as the warehouse is increasingly used

    for operational applications

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    62/64

    Meta DataData about dataNeeded by both information technology personneland usersIT personnel need to know data sources and

    targets; database, table and column names;refresh schedules; data usage measures; etc.Users need to know entity/attribute definitions;reports/query tools available; report distribution

    information; help desk contact information, etc.

    Recent Development:

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    63/64

    Recent Development:Meta Data Integration

    A growing realization that meta data is critical todata warehousing successProgress is being made on getting vendors toagree on standards and to incorporate the sharing

    of meta data among their toolsVendors like Microsoft, Computer Associates, andOracle have entered the meta data marketplacewith significant product offerings

  • 8/10/2019 Data Quality and Preprocessing Concepts ETL

    64/64

    Thats lots of ETL. Come back to basics: let us revise