wrestling large data volumes to the ground

16
Wrestling Large Data Volumes to the Ground Daniel Austin Yahoo! Exceptional Performance March 24, 2011 Large-Scale Production Engineering Meetup

Upload: daniel-austin

Post on 08-May-2015

223 views

Category:

Technology


0 download

DESCRIPTION

My presentation for Surge 2011. You can see the video here: http://omniti.com/surge/2011/speakers/daniel-austin

TRANSCRIPT

Page 1: Wrestling Large Data Volumes to the Ground

Wrestling Large Data Volumes to the Ground

Daniel AustinYahoo! Exceptional Performance

March 24, 2011

Large-Scale Production Engineering Meetup

Page 2: Wrestling Large Data Volumes to the Ground

Agenda: A Boy and His Prototype

– Project X: Performance + Data

– Project X: The Prototype

– What We Learned

Page 3: Wrestling Large Data Volumes to the Ground

Project X: Hi Performance, Low Budget

– Need to collect and store a lot of performance data for BI purposes• ~ 10+ TB working volume

• ~ 3+ TB inflow daily (tiny!)

– Cheap, fast, good (pick one)• optimized for query speed

• closely followed by ETL speed

– Needs to be done yesterday!

– Next Step: Build a prototype

Page 4: Wrestling Large Data Volumes to the Ground

Business Intelligence in Three Acts

Data Collection Data Storage Data Analysis(out of scope)

Page 5: Wrestling Large Data Volumes to the Ground

Data Collection: Tools – Gomez Last Mile

• HTTP response time testing in the wild

• Instrumented Firefox browser

• On a real user’s machine

• Data collection via FTP (brute force!)

• Custom format provides max data for every object on every page in a sequence

Page 6: Wrestling Large Data Volumes to the Ground

Data Collection: Tools – Talend Open Studio

• Open Source ETL tool– Based on eclipse

– Similar to other ETL tools from IBM, Oracle, others

– Java or Perl generation

• But easier to use!

Page 7: Wrestling Large Data Volumes to the Ground

Data Storage: Tools – Infobright MySQL Engine

• Columnar DB for MySQL

• Open Source & Commercial versions

• High compression

• Knowledge grid

• Query Optimizations– For analytics

– For performance

Page 8: Wrestling Large Data Volumes to the Ground

Intro to Data Products

For each data product:

• Data Dictionary

• Data Model

• Pre- and post-validation schemas

• Date Lifecycle Plan

Level 0

– Raw data at measurement-level resolution

– Field-level Syntactic & Semantic Validation

Level 1

– 3NF 5D Data Model

– concrete aggregates while retaining record-level resolution

Levels 2+

– Time & Space-based aggregates

– We ended up choosing not to do this!

“An organized, verifiable dataset with specific levels of quality and abstraction”

Page 9: Wrestling Large Data Volumes to the Ground

How to design ETL chains?

– Idempotent• If the process fails, restart the job

– Intermediate steps from state to state

– Syntactic vs. semantic validation• Treat separately

– Don’t skimp the supporting jobs!• Retention

• Volumes and roll-off

• Logging and trackback

Page 10: Wrestling Large Data Volumes to the Ground

Diagram: ETL for Data Validation

2. Semantic Validation Step

1. Syntactic Validation Step

Page 11: Wrestling Large Data Volumes to the Ground

Simple 3NF Level 1 Data Model for HTTP

• NO xrefs

• 5D User Narrative Model

• High levels of normalization are costly up front…

• …but pay for themselves later when you are making queries!

Page 12: Wrestling Large Data Volumes to the Ground

Level 1: The Boss Battle!

Page 13: Wrestling Large Data Volumes to the Ground

Some Best Practices

– URIs: Handle with care

• Encode text strings in lexical order

• Use sequential bitfields for searching

– Integer arithmetic only

– Combined fields for per-row consistency checks in every table

– Don’t trade ETL time for integrity risk

Page 14: Wrestling Large Data Volumes to the Ground

Project X: What We Learned

– Design your data products up front

– Higher-level data products are often better created downstream

– Open Source ETL can be made to scale well

• Requires a lot of upfront design

– High levels of normalization may be worth pursuing

Page 15: Wrestling Large Data Volumes to the Ground

Endgame! Analysis in Near Real-Time

Page 16: Wrestling Large Data Volumes to the Ground

Thank You!

Daniel AustinYahoo! Exceptional Performance

@daniel_b_austin

[email protected]

March 24, 2011

Large-Scale Production Engineering Meetup