wrestling large data volumes to the ground
DESCRIPTION
My presentation for Surge 2011. You can see the video here: http://omniti.com/surge/2011/speakers/daniel-austinTRANSCRIPT
Wrestling Large Data Volumes to the Ground
Daniel AustinYahoo! Exceptional Performance
March 24, 2011
Large-Scale Production Engineering Meetup
Agenda: A Boy and His Prototype
– Project X: Performance + Data
– Project X: The Prototype
– What We Learned
Project X: Hi Performance, Low Budget
– Need to collect and store a lot of performance data for BI purposes• ~ 10+ TB working volume
• ~ 3+ TB inflow daily (tiny!)
– Cheap, fast, good (pick one)• optimized for query speed
• closely followed by ETL speed
– Needs to be done yesterday!
– Next Step: Build a prototype
Business Intelligence in Three Acts
Data Collection Data Storage Data Analysis(out of scope)
Data Collection: Tools – Gomez Last Mile
• HTTP response time testing in the wild
• Instrumented Firefox browser
• On a real user’s machine
• Data collection via FTP (brute force!)
• Custom format provides max data for every object on every page in a sequence
Data Collection: Tools – Talend Open Studio
• Open Source ETL tool– Based on eclipse
– Similar to other ETL tools from IBM, Oracle, others
– Java or Perl generation
• But easier to use!
Data Storage: Tools – Infobright MySQL Engine
• Columnar DB for MySQL
• Open Source & Commercial versions
• High compression
• Knowledge grid
• Query Optimizations– For analytics
– For performance
Intro to Data Products
For each data product:
• Data Dictionary
• Data Model
• Pre- and post-validation schemas
• Date Lifecycle Plan
Level 0
– Raw data at measurement-level resolution
– Field-level Syntactic & Semantic Validation
Level 1
– 3NF 5D Data Model
– concrete aggregates while retaining record-level resolution
Levels 2+
– Time & Space-based aggregates
– We ended up choosing not to do this!
“An organized, verifiable dataset with specific levels of quality and abstraction”
How to design ETL chains?
– Idempotent• If the process fails, restart the job
– Intermediate steps from state to state
– Syntactic vs. semantic validation• Treat separately
– Don’t skimp the supporting jobs!• Retention
• Volumes and roll-off
• Logging and trackback
Diagram: ETL for Data Validation
2. Semantic Validation Step
1. Syntactic Validation Step
Simple 3NF Level 1 Data Model for HTTP
• NO xrefs
• 5D User Narrative Model
• High levels of normalization are costly up front…
• …but pay for themselves later when you are making queries!
Level 1: The Boss Battle!
Some Best Practices
– URIs: Handle with care
• Encode text strings in lexical order
• Use sequential bitfields for searching
– Integer arithmetic only
– Combined fields for per-row consistency checks in every table
– Don’t trade ETL time for integrity risk
Project X: What We Learned
– Design your data products up front
– Higher-level data products are often better created downstream
– Open Source ETL can be made to scale well
• Requires a lot of upfront design
– High levels of normalization may be worth pursuing
Endgame! Analysis in Near Real-Time
Thank You!
Daniel AustinYahoo! Exceptional Performance
@daniel_b_austin
March 24, 2011
Large-Scale Production Engineering Meetup