wrestling large data volumes to the ground

Wrestling Large Data Volumes to the Ground

Daniel AustinYahoo! Exceptional Performance

March 24, 2011

Large-Scale Production Engineering Meetup

Agenda: A Boy and His Prototype

– Project X: Performance + Data

– Project X: The Prototype

– What We Learned

Project X: Hi Performance, Low Budget

– Need to collect and store a lot of performance data for BI purposes• ~ 10+ TB working volume

• ~ 3+ TB inflow daily (tiny!)

– Cheap, fast, good (pick one)• optimized for query speed

• closely followed by ETL speed

– Needs to be done yesterday!

– Next Step: Build a prototype

Business Intelligence in Three Acts

Data Collection Data Storage Data Analysis(out of scope)

Data Collection: Tools – Gomez Last Mile

• HTTP response time testing in the wild

• Instrumented Firefox browser

• On a real user’s machine

• Data collection via FTP (brute force!)

• Custom format provides max data for every object on every page in a sequence

Data Collection: Tools – Talend Open Studio

• Open Source ETL tool– Based on eclipse

– Similar to other ETL tools from IBM, Oracle, others

– Java or Perl generation

• But easier to use!

Data Storage: Tools – Infobright MySQL Engine

• Columnar DB for MySQL

• Open Source & Commercial versions

• High compression

• Knowledge grid

• Query Optimizations– For analytics

– For performance

Intro to Data Products

For each data product:

• Data Dictionary

• Data Model

• Pre- and post-validation schemas

• Date Lifecycle Plan

Level 0

– Raw data at measurement-level resolution

– Field-level Syntactic & Semantic Validation

Level 1

– 3NF 5D Data Model

– concrete aggregates while retaining record-level resolution

Levels 2+

– Time & Space-based aggregates

– We ended up choosing not to do this!

“An organized, verifiable dataset with specific levels of quality and abstraction”

How to design ETL chains?

– Idempotent• If the process fails, restart the job

– Intermediate steps from state to state

– Syntactic vs. semantic validation• Treat separately

– Don’t skimp the supporting jobs!• Retention

• Volumes and roll-off

• Logging and trackback

Diagram: ETL for Data Validation

2. Semantic Validation Step

1. Syntactic Validation Step

Simple 3NF Level 1 Data Model for HTTP

• NO xrefs

• 5D User Narrative Model

• High levels of normalization are costly up front…

• …but pay for themselves later when you are making queries!

Level 1: The Boss Battle!

Some Best Practices

– URIs: Handle with care

• Encode text strings in lexical order

• Use sequential bitfields for searching

– Integer arithmetic only

– Combined fields for per-row consistency checks in every table

– Don’t trade ETL time for integrity risk

Project X: What We Learned

– Design your data products up front

– Higher-level data products are often better created downstream

– Open Source ETL can be made to scale well

• Requires a lot of upfront design

– High levels of normalization may be worth pursuing

Endgame! Analysis in Near Real-Time

Thank You!

Daniel AustinYahoo! Exceptional Performance

@daniel_b_austin

[email protected]

March 24, 2011

Large-Scale Production Engineering Meetup

wrestling large data volumes to the ground

Technology