ipc data analysis and extraction

14
Enterprise Intelligence Enterprise Intelligence Data Analysis, Extraction and Validation Technology Partners:

Upload: pzybrick

Post on 07-Aug-2015

33 views

Category:

Documents


1 download

TRANSCRIPT

Enterprise Intelligence

Enterprise Intelligence

Data Analysis, Extraction and

Validation

Technology Partners:

PresenterPete Zybrick – Enterprise Solutions Architect

Over 30 years of experience designing and delivering complex software solutions.

From Bell Labs to BMW to Big Data, Pete has architected, managed, tested and implemented large scale mission critical systems directly responsible for billions of dollars in annual transaction.

As the leader of the Big Data technical programs within IPC Global, Pete is responsible for building a framework of collaboration between IPC Global and our technology partners, Cloudera and AWS.

Cloudera Certified Apache Hadoop Developer Amazon Web Services Certified Developer - Associate

Data Analysis, Extraction and Validation

Objectives

• IPC Global Experience with Analysis and Extraction of Big Data

• Techniques And Tools To Enable Business Users To Rapidly Access Subsets Of Large Datasets

• Validation And Import

• Not Presentation Capabilities

Data Analysis, Extraction and Validation

• Federal Reserve Economic Database (FRED2)

• SiteCatalyst aka Adobe Analytics

Data Analysis, Extraction and Validation

Applications

• Overview - http://research.stlouisfed.org/fred2/

• 240K Discrete Series’ – README file

• Business Case: Consumer Price Index• FRED: http://research.stlouisfed.org/fred2/series/CPIAUCSL/

• QlikView: CPI demo - Single

Data Analysis, Extraction and Validation

FRED

• FRED data is discrete• Consumer Price Index for All Urban Consumers: All Items

• Consumer Price Index for All Urban Consumers: Apparel

• Consumer Price Index for All Urban Consumers: Energy

• Analytics: Groupings/Categories, Drill Downs• Consumer Price Index

• All Urban Consumers

• All Items

• Apparel

• Energy

Data Analysis, Extraction and Validation

FRED Raw Structure

• “Live the Data”

• Programs/Tools/Spreadsheets to Iteratively Analyze • Pattern Analysis, Parsing Rules – Split into Categories

• Iterative Distinct Values used to generate Reference Files – Separate Code from Data, Standardize Values (example: Countries)

• Parsing ExampleConsumer Price Index for All Urban Consumers: All items in Atlanta, GA

Consumer Price Index for All Urban Consumers: Energy in Atlanta, GA

Consumer Price Index for All Urban Consumers: All items less food and energy in Atlanta, GA

Data Analysis, Extraction and Validation

FRED Programmatic Analysis

• Implementation of Parsing Rules/Reference Files

• Definition of Common Table Structure (example: Create Table’s)

• Output Files in Format Suitable for Database Bulk Import• GZip’ed Tab Separated Values (HDFS for Impala, S3 for Redshift)

• Bulk Import into Impala and Redshift

Data Analysis, Extraction and Validation

FRED Programmatic Extraction

• Programmatically Generated• Distinct Categories Spreadsheet

• QlikView Load/Select Scripts

• Demo:• Find “Home Price Index (High Tier)” in Distinct Categories

• Find File in Generated Scripts based on Row Number

• Copy/Paste the Generated Load/Select into new Dashboard

• Manually modify Select for Middle Tier and Low Tier, Update Dashboard

Data Analysis, Extraction and Validation

FRED Productivity Enhancement Tools/Techniques

• ~12-15MM Main Rows/Day, up to 554 Columns/Row

• ~300MM Event Rows/Day (~24 per Main Row, 12 Before, 12 After)

• Data Falls Into Logical Groupings (i.e. general, video, mobile, etc.)

• Reference Table Lookups Of Varying Complexity

• 2-3% Error Rate

• Example of Inbound Data

Data Analysis, Extraction and Validation

SiteCatalyst – Overview

• Created Spreadsheets With Top 2000 Distinct Values For Each Column, Reviewed Every Column

• Identified Data Type and Criteria (i.e. range) For Each Column

• Defined/Developed Data Validation Framework

• Iterative Application of Data Validation

• Identified Reference Table Lookups, Implemented Simple and Complex

• Defined/Developed Test Data Generator

Data Analysis, Extraction and Validation

SiteCatalyst – Programmatic Analysis

• Specify the Rules For Inbound Columns

• Hadoop Map Program to Process Each Inbound Row• Hadoop Distributed Cache containing Target Columns->Tables, Validation

Rules and Reference Lookups

• Data Validation Performed Against All Inbound Columns

• Rejects Written To Error Table

• Reference Lookups

• Valid Data Written To Separate Tables Based On Category (main, video, mobile, etc.)

Data Analysis, Extraction and Validation

SiteCatalyst – Data Validation and Table Framework

• All Code/Processes Developed Internally by IPC Global

• Multiple Sources, Multiple Target Databases

• Iterative Analysis Utilizing Rapidly Coded Tools

• Iterative Analyze/Extract/Validate/Apply

• Performance Insight – Reference Tables, Caching

• Increased Access to Business Data

Data Analysis, Extraction and Validation

Summary

Enterprise Intelligence

Enterprise Intelligence

Technology Partners:

Data Analysis, Extraction and

Validation