starfish: a self-tuning system for big data analytics · mr job word counts for docs workload on a...
TRANSCRIPT
![Page 1: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/1.jpg)
Herodotos Herodotou, Harold Lim,
Gang Luo, Nedyalko Borisov, Liang Dong,
Fatma Bilgen Cetin, Shivnath Babu
Duke University
![Page 2: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/2.jpg)
Analysis in the Big Data Era
01/12/2011 Duke University 2
Automate decision processes
Increase cost savings and revenue
Massive Data
Data Analysis
Insight
Key to Success = Timely and Cost-Effective Analysis
![Page 3: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/3.jpg)
Analysis in the Big Data Era
01/12/2011 Duke University 3
Popular option
Hadoop software stack
MapReduce Execution Engine
Distributed File System
Hadoop
OozieHivePig Elastic MRJava Client …
![Page 4: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/4.jpg)
Analysis in the Big Data Era
01/12/2011 Duke University 4
Popular option
Hadoop software stack
Burden on the users
Responsible for provisioning & configuration
Usually lack expertise to tune the system
Challenges
Tasks expressed in general-purpose programming
languages
Input data stored as files and interpreted at run-time
![Page 5: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/5.jpg)
Starfish: Self-Tuning System
01/12/2011 Duke University 5
NOT our goal: Improve Hadoop’s peak performance
Our goal: Provide good performance automatically
MapReduce Execution Engine
Distributed File System
Hadoop
OozieHivePig Elastic MRJava Client
Starfish
Analytics System
…
![Page 6: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/6.jpg)
MR Job
Word Counts For Docs
Workload on a Starfish Cluster MapReduce (MR) Job
Workflow
Physical: directed graph of MR job nodes
Logical: directed graph of SPJA & UDF nodes
Workload: Collection of workflows
01/12/2011 Duke University 6
Count clicks
per url type
Join
Filter
age < 21
Load users
(id, age, ip)
Load clicks
(id, url, count)
MR Job 3
Compute TF-IDF
MR Job 2
Word Counts For Docs
MR Job 1
Word Frequency in Doc
![Page 7: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/7.jpg)
Starfish Architecture
01/12/2011 Duke University 7
What-if
Engine
Workflow-level tuning
Workflow-aware
Scheduler
Workload-level tuning
Workload Optimizer Elastisizer
Data Manager
Metadata
Mgr.
Intermediate
Data Mgr.
Data Layout &
Storage Mgr.
Just-in-Time Optimizer
Profiler
Job-level tuning
Sampler
![Page 8: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/8.jpg)
What-if
Engine
Starfish Architecture
01/12/2011 Duke University 8
Workflow-level tuning
Workflow-aware
Scheduler
Workload-level tuning
Workload Optimizer Elastisizer
Data Manager
Metadata
Mgr.
Intermediate
Data Mgr.
Data Layout &
Storage Mgr.
Just-in-Time Optimizer
Profiler
Job-level tuning
Sampler
![Page 9: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/9.jpg)
Job Configuration Parameters
01/12/2011 Duke University 9
Over 190 parameters
Many affect performance in complex ways
Impact depends on Job, Data, and Cluster properties
![Page 10: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/10.jpg)
Current Approach
01/12/2011 Duke University 10
Rules of thumb
mapred.reduce.tasks = 0.9 * number_of_reduce_slots
io.sort.record.percent = 16 / (16 + average_record_size)
Rules of thumb may not suffice
Rules of
thumb
![Page 11: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/11.jpg)
Just-in-Time Job Optimization Just-in-Time Optimizer
Searches through the high-dimensional space of
parameter settings
What-if Engine
Uses mix of simulation and model-based estimation
Sampler
Collects statistics about input, intermediate, and output
key-value spaces of MapReduce jobs
Profiler
Collects information about MR job executions
01/12/2011 Duke University 11
![Page 12: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/12.jpg)
Job Profiler Dynamic instrumentation
Monitor specific components in a system
Collect run-time information
Benefits
Zero overhead when it is turned off
Works with unmodified MapReduce programs
Used to construct a job profile
Concise representation of the job execution
Allows for in-depth analysis of the job behavior
12/07/2010 Herodotos Herodotou 12
![Page 13: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/13.jpg)
What-if
Engine
Starfish Architecture
01/12/2011 Duke University 13
Workflow-level tuning
Workflow-aware
Scheduler
Workload-level tuning
Workload Optimizer Elastisizer
Data Manager
Metadata
Mgr.
Intermediate
Data Mgr.
Data Layout &
Storage Mgr.
Just-in-Time Optimizer
Profiler
Job-level tuning
Sampler
![Page 14: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/14.jpg)
Job Workflows Producer-Consumer
relationships among jobs
Data Layout crucial for later
jobs
Effective use of parallelism
Task scheduling
Major Problem
Unbalanced data layouts
01/12/2011 Duke University 14
![Page 15: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/15.jpg)
Unbalanced Data Layouts Issues with data-locality-aware schedulers
Performance degradation due to reduced parallelism
Further unbalanced layout due to job outputs
01/12/2011 Duke University 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
5
10
15
20
Data Node
Dis
k U
sage
(%)
Map-only aggregation
(non-data local)
Map-only aggregation
(data local)
Partition (replication
count = 1)
Initial layout
![Page 16: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/16.jpg)
Workflow-Aware Scheduler Goal: Optimize overall performance of workflow
Select best data layout + job parameters
Space of options
Block placement policy
Replication factor
Block size
Output compression
Approach
Simulate task scheduling and block placement policies
Perform cost-based search
01/12/2011 Duke University 16
![Page 17: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/17.jpg)
Starfish Architecture
01/12/2011 Duke University 17
What-if
Engine
Workflow-level tuning
Workflow-aware
Scheduler
Workload-level tuning
Workload Optimizer Elastisizer
Data Manager
Metadata
Mgr.
Intermediate
Data Mgr.
Data Layout &
Storage Mgr.
Just-in-Time Optimizer
Profiler
Job-level tuning
Sampler
![Page 18: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/18.jpg)
Optimizing Starfish Workloads Data-flow sharing
Materialization
Reorganization
01/12/2011 Duke University 18
MapReduce Execution Engine
Distributed File System
Hadoop
OozieHivePig Elastic MRJava Client …
Starfish
Analytics System
![Page 19: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/19.jpg)
Jumbo Operator Single MapReduce job to process multiple Select-
Project-Aggregate operations over a table
Enables sharing of scans, computation, sorting,
shuffling, and output generation
01/12/2011 Duke University 19
0
200
400
600
800
1000
Ru
nn
ing
Tim
e (s
ec) Serial
Concurrent
Jumbo (Share Scans)
Jumbo (Share Interm)
![Page 20: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/20.jpg)
Challenges and Opportunities1. Study the interactions among optimizations across
different levels
2. Construct a What-if Engine that can model these
interactions
01/12/2011 Duke University 20
What-if
Engine
In
ter
ac
tio
ns
Schedule on Map & Reduce Slots
Translate to MapReduce jobs
Optimize logical workloads
![Page 21: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/21.jpg)
Elastisizer – Hadoop Provisioning Goal: Make provisioning decisions based on workload
requirements (e.g., completion time, cost)
01/12/2011 Duke University 21
0
2000
4000
6000
8000
10000
12000
m1.small m1.large m1.xlarge
Exec
uti
on
Tim
e (
sec)
Node Type on Amazon EC2
2 nodes 4 nodes 6 nodes
$0.00
$1.00
$2.00
$3.00
$4.00
$5.00
$6.00
m1.small m1.large m1.xlarge
Cost
($)
Node Type on Amazon EC2
2 nodes 4 nodes 6 nodes
![Page 22: Starfish: A Self-tuning System for Big Data Analytics · MR Job Word Counts For Docs Workload on a Starfish Cluster MapReduce (MR) Job Workflow Physical: directed graph of MR job](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec88ce63a33f068f4242651/html5/thumbnails/22.jpg)
Starfish: Self-Tuning SystemFocus simultaneously on
Different workload granularities
Workload
Workflows
Jobs (procedural and declarative)
Across various decision points
Provisioning
Optimization
Scheduling
Data layout
01/12/2011 Duke University 22