november 2013 hug: compute capacity calculator

V i r a j B h a t

v i r a j @ y a h o o - i n c . c o m

C3 – Compute Capacity Calculator

Hadoop User Group (HUG) – 20 Nov 2013

mailto:[email protected]

Why we need this tool?

o Capacity Planning for a multi-tenant system like Hadoop Grid is

critical

o Project Owners need to estimate their project capacity

requirements for provisioning on Hadoop clusters

o BU-POCs need to have capacity estimates from projects to

manage their demand vs. supply equation within their business

units

o SEO needs product owners to provide Grid capacity requirements

quarterly (CAR)

Onboarding Projects - Challenge

o Application developers typically develop and test their Hadoop Jobs

or Oozie workflows on the limited capacity, shared prototyping

research Hadoop cluster with partial data sets before on-boarding to

production Hadoop clusters

o Research and Production Hadoop Grids, have varying map reduce

slots, container sizes, compute and communication costs

o Projects may need optimization before being on boarded

o SupportShop is the front end portal for teams to onboard projects

onto Yahoo! Gridso Onboarding tool known as Himiko tracks users requests till the project is

provisioned on the cluster

Project On-boarding needs Computing Capacity

C3 Tool Requirementso Self-Serve deployed as a web interface tool hosted within end-user one-

stop portal – SupportSHOP o Rule Based Uses post-job execution diagnostic rule engine to calculate

the computation capacities

o SLA Focus Given a desired SLA, the tool will calculate optimal compute resources required on the cluster for the entire SLA range of [ 2x to 0.25x]

o Hide Complexity should take into account the source & target cluster’s map-reduce slot configuration, internal Hadoop scheduling and execution details as well as hardware specific “speedup” in calculating the compute capacities

o Pig Jobs Support should analyze the Job DAG (Directed Acyclic Graph) of Map Reduce job spawned by Pig to accurately compute the capacities

o Oozie Support: workflows running on our Grids use Oozie

C3 Architecture

Job Type:[MR]Grid Name: [..]

Job ID: [job_202030_1234]SLA [Mins]: [..]

Submit

Job Type:[Pig]Grid Name: [..]

Pig Console Output: [Location]SLA [Mins]: [..]

Submit

BrowserC3 php forms

Web Server

yphp backend

C3 DB C3Cronjob

Yahoo! Grid

HDFS Proxy

Input forms

SupportShop Backend

Compute Capacity ReportJob Type:[Pig/MR]

Grid Name: [..]SLA [Mins]: [..]

Map Slot CapacityReduce Slot Capacity

Job Dag

Output report is emailed to user

C3 Core Logic

1) Parse pig logs/oozie jobid2) Copy pig logs3) Run pending jobs from dbRecord completed jobs to db

1) Fetch job history logs and conf logs using HDFS Proxy

2) Execute Hadoop Vaidya rule3) Send results back to c3cronjob

SupportShop Frontend

C3 – Compute Capacity Calculator

o Calculate the compute capacity needed for their M/R jobs to meet the required processing time Service Level Agreement (SLA)

o Compute capacity is calculated in terms of number of Map and Reduce slots/containerso Estimate machines procured based on the Map and Reduce Slots/containers

o Projects normally run their jobs on the research cluster and are onboarded to the production cluster

o Tool should automatically match the map reduce slot ratio in research to production (Hadoop 1.x)

o Capacities of M/R jobs which are launched in parallel are addedo Example: Fork in Oozie workflows

o Maximum of the Capacity of M/R’s jobs are considered when launched in sequenceo Example: Pig Dag which produces sequential jobs

C3 Statistics

o C3 and Himiko have helped onboard more than 200 projects

o More than 2300+ requests have been submitted to C3

o C3 has analyzed Pig Dag which consists of more than 200 individual

M/R jobs

o C3 has helped detect performance issues with certain M/R jobs

where excessive mappers were being used in a Pig script

C3 Backend – Hadoop Vaidya

oRule based performance diagnosis of M/R jobs o M/R performance analysis expertise is captured

and provided as an input through a set of pre-defined diagnostic rules

o Detects performance problems by postmortem analysis of a job by executing the diagnostic rules against the job execution counters

o Provides targeted advice against individual performance problems

oExtensible frameworko You can add your own rules based on a rule

template and published job counters data structures

o Write complex rules using existing simpler rules

Vaidya: An expert (versed in his own profession , esp. in medical science) , skilled in the art of healing , a physician

C3 Rule logic at the Backend

o Reduce slot capacity/containers is same as number of reduce slots/containers required for number of reducers specified for the M/R job

o Calculate shuffle time as amount of data per reducer / 4MBps (conservative estimate of bandwidth) - configurable

o Reduce phase time =~ max (sort + reduce logic time) of reducers * speedup

o Map Phase time = SLA - (shuffle time - reduce phase time) * speedup

o Map slot capacity = MAP_SLOT_MILLIS / Map Phase time (in millis) o MAP_SLOT_MILLIS = Median of the 10% of the worst performing mappers

o Once we get initial Map and Reduce slot capacity using above calculations, iteratively get their ratio close to slot configuration per node (Hadoop 1.0)

o Add 10% slots for speculative execution (failed/killed task attempts)

C3 Input

Compute Capacity Tool Output

Pig Dag Single M/R job

C3 tool integrated with Hadoop Vaidya

C3 results for M/R jobs run in Hadoop 23

C3 results for Pig script run on Hadoop 23

Future Enhancements

o C3 should output the storage requirements for a job

o Display Map and Reduce runtime

o Capacity planning for custom Map Reduce jobs which can provide an xml of their DAG’s

o Introduce more granular estimation using a speed-up factor per cluster based on the hardware node configuration (processors, memory etc)

o C3 should accept % data input to accurately estimate the capacities

Links

o Hadoop Vaidyao https://hadoop.apache.org/docs/r1.2.1/vaidya.html

o Hadoop Vaidya Job History Server Integration for Hadoop 2.0o https://issues.apache.org/jira/browse/MAPREDUCE-3202

https://hadoop.apache.org/docs/r1.2.1/vaidya.html

https://hadoop.apache.org/docs/r1.2.1/vaidya.html

https://issues.apache.org/jira/browse/MAPREDUCE-3202

Acknowledgements

Yahoo› Ryota Egashira – [email protected]› Kendall Thrapp - [email protected]› Kimsukh Kundu – [email protected]

Ebay› Shashank Phadke - [email protected]

Pivotal› Milind Bhandarkar - [email protected] › Vitthal Gogate - [email protected]







Questions?

november 2013 hug: compute capacity calculator

Technology