![Page 1: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/1.jpg)
A Bird’s-Eye View of Pig and Scalding
with hRavena tale by @gario and @joep
Hadoop Summit 2013
v1.2
![Page 2: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/2.jpg)
@Twitter#HadoopSummit2013 2
Apache HBase PMC member andCommitter
Software Engineer @ Twitter
Core Storage Team - Hadoop/HBase
•
••
About the authors
Software Engineer @ Twitter
Engineering Manager Hadoop/HBaseteam @ Twitter
••
![Page 3: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/3.jpg)
@Twitter#HadoopSummit2013 3
Chapter 1: The ProblemChapter 2: Why hRaven?Chapter 3: How Does it Work?
3a: Loading
3b: Table structure / queryingChapter 4: Current UsesAppendix: Future Work
•
•
•
••
•
•
Table of Contents
![Page 4: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/4.jpg)
Chapter 1: The Problem
Illustration by Sirxlem (CC BY-NC-ND3.0)
![Page 5: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/5.jpg)
@Twitter#HadoopSummit2013 5
Most users run Pig and Scalding scripts, not straight map reduceJobTracker UI shows jobs, not DAGs of jobs generated by Pig and Scalding
•
•
Chapter 1: Mismatched Abstractions
![Page 6: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/6.jpg)
@Twitter#HadoopSummit2013
Chapter 1: A Problem of Scale
6
![Page 7: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/7.jpg)
@Twitter#HadoopSummit2013 7
How many Pig versus Scalding jobs do we run ?What cluster capacity do jobs in my pool take ?How many jobs do we run each day ?What % of jobs have > 30k tasks ?Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ?
•
•
•
•
•
Chapter 1: Questions
![Page 8: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/8.jpg)
@Twitter#HadoopSummit2013 8
How many Pig versus Scalding jobs do we run ?What cluster capacity do jobs in my pool take ?How many jobs do we run each day ?What % of jobs have > 30k tasks ?Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ?
•
•
•
•
•
Chapter 1: Questions
#Nevermore
![Page 9: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/9.jpg)
Chapter 2: Why hRaven?
Photo by DAVID ILIFF. License: CC-BY-SA3.0
![Page 10: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/10.jpg)
@Twitter#HadoopSummit2013 10
Stores stats, configuration and timing for every map reduce job on everyclusterStructured around the full DAG of jobs from a Pig or Scalding applicationEasily queryable for historical trendingAllows for Pig reducer optimization based on historical run statsKeep data online forever (12.6M jobs, 4.5B tasks + attempts)
•
•
•
•
•
Chapter 2: Why hRaven?
![Page 11: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/11.jpg)
@Twitter#HadoopSummit2013 11
cluster - each cluster has a unique name mapping to the Job Trackeruser - map reduce jobs are run as a given userapplication - a Pig or Scalding script (or plain map reduce job)flow - the combined DAG of jobs executed from a single run of anapplicationversion - changes impacting the DAG are recorded as a new version of thesame application
•
•
•
•
•
Chapter 2: Key Concepts
![Page 12: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/12.jpg)
@Twitter#HadoopSummit2013 12
Chapter 2: Application Flows
Edgar
![Page 13: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/13.jpg)
@Twitter#HadoopSummit2013 13
Chapter 2: Application Flows
Edgar
![Page 14: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/14.jpg)
@Twitter#HadoopSummit2013 14
All jobs in a flow are ordered together•
Chapter 2: Flow Storage
![Page 15: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/15.jpg)
@Twitter#HadoopSummit2013 15
Most recent flow is ordered first•
Chapter 2: Flow Storage
![Page 16: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/16.jpg)
@Twitter#HadoopSummit2013 16
All jobs in a flow are ordered togetherPer-job metrics stored
Total map and reduce tasks
HDFS bytes read / written
File bytes read / written
Total map and reduce slot milliseconds
Easy to aggregate stats for an entire flowEasy to scan the timeseries of each application’s flows
•
•
••••
•
•
Chapter 2: Key Features
![Page 17: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/17.jpg)
Chapter 3: How Does it Work?
![Page 18: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/18.jpg)
@Twitter#HadoopSummit2013 18
Chapter 3: ETL - Step 1: JobFilePreprocessor
![Page 19: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/19.jpg)
@Twitter#HadoopSummit2013 19
Chapter 3: ETL - Step 2: JobFileRawLoader
![Page 20: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/20.jpg)
@Twitter#HadoopSummit2013 20
Chapter 3: ETL - Step 3: JobFileProcessor
![Page 21: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/21.jpg)
@Twitter#HadoopSummit2013 21
Chapter 3: ETL - Step 3: JobFileProcessor
Jobs finish out of order with respect to job_id
![Page 22: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/22.jpg)
@Twitter#HadoopSummit2013 22
job_history_raw
job_history
job_history_task
job_history_app_version
•
•
•
•
Chapter 3: Tables
![Page 23: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/23.jpg)
@Twitter#HadoopSummit2013 23
Row key: cluster!jobID
Columns:
jobconf - stores serialized raw job_*_conf.xml file
jobhistory - stored serialized raw job history log file
job_processed_success - indicates whether job has been processed
•••
Chapter 3: job_history_raw
![Page 24: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/24.jpg)
@Twitter#HadoopSummit2013 24
Row key: cluster!user!application!timestamp!jobIDcluster - unique cluster name (ie. “cluster1@dc1”)
user - user running the application (“edgar”)
application - application ID derived from job configuration:
uses “batch.desc” property if set
otherwise parses a consistent ID from “mapred.job.name”
timestamp - inverted (Long.MAX_VALUE - value) value of submission time
jobID - stored as Job Tracker start time (long), concatenated with job sequence number
job_201306271100_0001 -> [1372352073732L][1L]
•••
••
••
•
Chapter 3: job_history
![Page 25: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/25.jpg)
@Twitter#HadoopSummit2013 25
Row key: cluster!user!application!timestamp!jobID!taskIDsame components as job_history key (same ordering)
taskID - (ie. “m_00001”) uniquely identifies individual task/attempt in job
Two row types:Task - “meta” row
cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001
Task Attempt - individual execution on a Task Trackercluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001_1
••
•
•
Chapter 3: job_history_task
![Page 26: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/26.jpg)
@Twitter#HadoopSummit2013 26
Row key: cluster!user!application
Example: cluster1@dc1!edgar!wordcount
Columns:v1=1369585634000
v2=1372263813000
Chapter 3: job_history_app_version
![Page 27: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/27.jpg)
@Twitter#HadoopSummit2013 27
Using Pig’s HBaseStorage (or direct HBase APIs)Through Client APIThrough REST API
•
•
•
Chapter 3: Querying hRaven
![Page 28: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/28.jpg)
Chapter 4: Current Uses
![Page 29: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/29.jpg)
@Twitter#HadoopSummit2013 29
Pig reducer optimizationsCluster utilization / capacity planningApplication performance trending over timeIdentifying common job anti-patternsAd-hoc analysis troubleshooting cluster problems
•
•
•
•
•
Chapter 4: Current Uses
![Page 30: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/30.jpg)
@Twitter#HadoopSummit2013 30
Chapter 4: Cluster reads-writes
![Page 31: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/31.jpg)
@Twitter#HadoopSummit2013
Chapter 4: Pool / Application reads/writes
31
Pool view
Spike in File size read
Indicates jobs spilling
•
••
Application view
Spike in HDFS sizeread
Indicates spiking input
•
•
•
![Page 32: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/32.jpg)
@Twitter#HadoopSummit2013
Chapter 4: Pool usage: Used vs. Allocated
32
![Page 33: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/33.jpg)
@Twitter#HadoopSummit2013 33
Chapter 4: Compute cost
![Page 34: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/34.jpg)
Appendix: Future Work
![Page 35: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/35.jpg)
@Twitter#HadoopSummit2013 35
Real-time data loading from Job Tracker / Application MasterFull flow-centric UI (Job Tracker UI replacement)Hadoop 2.0 compatibility (in-progress)Ambrose integration
•
•
•
•
Appendix: Future Work
![Page 36: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/36.jpg)
@Twitter#HadoopSummit2013 36
hRaven on Githubhttps://github.com/twitter/hraven
hRaven Mailing [email protected]
•
••
Additional Resources
![Page 37: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/37.jpg)
@Twitter#HadoopSummit2013
Afterword
37
Now will thou drop your job data on the floor ?Quoth the hRaven, 'Nevermore.'
![Page 38: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/38.jpg)
#TheEnd@gario and @joep
Come visit us at booth #26 to continue the story
![Page 39: A Birds-Eye View of Pig and Scalding Jobs with hRaven](https://reader035.vdocuments.us/reader035/viewer/2022070315/55510281b4c9057b478b4eb3/html5/thumbnails/39.jpg)
@Twitter#HadoopSummit2013 39
Desired orderjob_201306271100_9999job_201306271100_10000...job_201306271100_99999job_201306271100_100000...job_201306271100_999999job_201306271100_1000000
•
Sort order Variable length job_idLexical order
job_201306271100_10000job_201306271100_100000job_201306271100_1000000job_201306271100_9999job_201306271100_99999job_201306271100_999999
•