ipc data analysis and extraction
TRANSCRIPT
Enterprise Intelligence
Enterprise Intelligence
Data Analysis, Extraction and
Validation
Technology Partners:
PresenterPete Zybrick – Enterprise Solutions Architect
Over 30 years of experience designing and delivering complex software solutions.
From Bell Labs to BMW to Big Data, Pete has architected, managed, tested and implemented large scale mission critical systems directly responsible for billions of dollars in annual transaction.
As the leader of the Big Data technical programs within IPC Global, Pete is responsible for building a framework of collaboration between IPC Global and our technology partners, Cloudera and AWS.
Cloudera Certified Apache Hadoop Developer Amazon Web Services Certified Developer - Associate
Data Analysis, Extraction and Validation
Objectives
• IPC Global Experience with Analysis and Extraction of Big Data
• Techniques And Tools To Enable Business Users To Rapidly Access Subsets Of Large Datasets
• Validation And Import
• Not Presentation Capabilities
Data Analysis, Extraction and Validation
• Federal Reserve Economic Database (FRED2)
• SiteCatalyst aka Adobe Analytics
Data Analysis, Extraction and Validation
Applications
• Overview - http://research.stlouisfed.org/fred2/
• 240K Discrete Series’ – README file
• Business Case: Consumer Price Index• FRED: http://research.stlouisfed.org/fred2/series/CPIAUCSL/
• QlikView: CPI demo - Single
Data Analysis, Extraction and Validation
FRED
• FRED data is discrete• Consumer Price Index for All Urban Consumers: All Items
• Consumer Price Index for All Urban Consumers: Apparel
• Consumer Price Index for All Urban Consumers: Energy
• Analytics: Groupings/Categories, Drill Downs• Consumer Price Index
• All Urban Consumers
• All Items
• Apparel
• Energy
Data Analysis, Extraction and Validation
FRED Raw Structure
• “Live the Data”
• Programs/Tools/Spreadsheets to Iteratively Analyze • Pattern Analysis, Parsing Rules – Split into Categories
• Iterative Distinct Values used to generate Reference Files – Separate Code from Data, Standardize Values (example: Countries)
• Parsing ExampleConsumer Price Index for All Urban Consumers: All items in Atlanta, GA
Consumer Price Index for All Urban Consumers: Energy in Atlanta, GA
Consumer Price Index for All Urban Consumers: All items less food and energy in Atlanta, GA
Data Analysis, Extraction and Validation
FRED Programmatic Analysis
• Implementation of Parsing Rules/Reference Files
• Definition of Common Table Structure (example: Create Table’s)
• Output Files in Format Suitable for Database Bulk Import• GZip’ed Tab Separated Values (HDFS for Impala, S3 for Redshift)
• Bulk Import into Impala and Redshift
Data Analysis, Extraction and Validation
FRED Programmatic Extraction
• Programmatically Generated• Distinct Categories Spreadsheet
• QlikView Load/Select Scripts
• Demo:• Find “Home Price Index (High Tier)” in Distinct Categories
• Find File in Generated Scripts based on Row Number
• Copy/Paste the Generated Load/Select into new Dashboard
• Manually modify Select for Middle Tier and Low Tier, Update Dashboard
Data Analysis, Extraction and Validation
FRED Productivity Enhancement Tools/Techniques
• ~12-15MM Main Rows/Day, up to 554 Columns/Row
• ~300MM Event Rows/Day (~24 per Main Row, 12 Before, 12 After)
• Data Falls Into Logical Groupings (i.e. general, video, mobile, etc.)
• Reference Table Lookups Of Varying Complexity
• 2-3% Error Rate
• Example of Inbound Data
Data Analysis, Extraction and Validation
SiteCatalyst – Overview
• Created Spreadsheets With Top 2000 Distinct Values For Each Column, Reviewed Every Column
• Identified Data Type and Criteria (i.e. range) For Each Column
• Defined/Developed Data Validation Framework
• Iterative Application of Data Validation
• Identified Reference Table Lookups, Implemented Simple and Complex
• Defined/Developed Test Data Generator
Data Analysis, Extraction and Validation
SiteCatalyst – Programmatic Analysis
• Specify the Rules For Inbound Columns
• Hadoop Map Program to Process Each Inbound Row• Hadoop Distributed Cache containing Target Columns->Tables, Validation
Rules and Reference Lookups
• Data Validation Performed Against All Inbound Columns
• Rejects Written To Error Table
• Reference Lookups
• Valid Data Written To Separate Tables Based On Category (main, video, mobile, etc.)
Data Analysis, Extraction and Validation
SiteCatalyst – Data Validation and Table Framework
• All Code/Processes Developed Internally by IPC Global
• Multiple Sources, Multiple Target Databases
• Iterative Analysis Utilizing Rapidly Coded Tools
• Iterative Analyze/Extract/Validate/Apply
• Performance Insight – Reference Tables, Caching
• Increased Access to Business Data
Data Analysis, Extraction and Validation
Summary
Enterprise Intelligence
Enterprise Intelligence
Technology Partners:
Data Analysis, Extraction and
Validation