introduction & motivation problem statementiraicu/teaching/cs554-s15/lecture05-matrix.pdf ·...
TRANSCRIPT
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
MATRIX: MAny-Task computing execution fabRIc at eXascale
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
MATRIX: MAny-Task computing execution fabRIc at eXascale
MATRIX: MAny-Task computing execution fabRIc at eXascale
• Today (2014): PetaScale Computing − 34PFlops − 3Million cores
• Near Future (~2020): Exascale Computing − 1000PFlops − 1Billion cores
http://s.top500.org/static/lists/2014/06/TOP500_201406_Poster.pdf http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%20report%20101909.pdf
• Manages resources − compute − storage − network
• Job scheduling/management − resource allocation − job launch
• Data management − data movement − caching
MATRIX: MAny-Task computing execution fabRIc at eXascale
• PetaScale à Exascale (10^18 Flops, 1B cores) • Workload Diversity
− Traditional large-scale High Performance Computing (HPC) jobs − HPC ensemble runs − Many-Task Computing (MTC) applications
• State-of-the-art Resource Managers − Centralized paradigm: job scheduling − HPC resource managers: SLURM, PBS, Moab, Torque, Cobalt − HTC resource manager: Condor, Falkon, Mesos, YARN
• Next-generation Resource Managers − Distributed paradigm − Scalable, efficient, reliable
MATRIX: MAny-Task computing execution fabRIc at eXascale
Decade Uses and computer involved
1970s Weather forecasting, aerodynamic research (Cray-1).
1980s Probabilistic analysis, radiation shielding modeling (CDC Cyber).
1990s Brute force code breaking (EFF DES cracker).
2000s 3D nuclear test simulations as a substitute for legal conduct Nuclear Non-Proliferation Treaty (ASCI Q).
2010s Molecular Dynamics Simulation (Tianhe-1A)
Medical Image Processing: Functional Magnetic Resonance
Chemistry Domain: MolDyn
Molecular Dynamics: DOCK
Production Runs in Drug Design
Economic Modeling: MARS
Large-scale Astronomy Application Evaluation
Astronomy Domain: Montage
Data Analytics: Sort and WordCount
http://elib.dlr.de/64768/1/EnSIM_-_Ensemble_Simulation_on_HPC_Computes_EN.pdf
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
MATRIX: MAny-Task computing execution fabRIc at eXascale
• Scalability − System scale is increasing − Workload size is increasing − Processing capacity needs to increase
• Efficiency − Allocating resources fast − Making fast enough scheduling decisions − Maintaining a high system utilization
• Reliability − Still functioning well under failures
MATRIX: MAny-Task computing execution fabRIc at eXascale
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
MATRIX: MAny-Task computing execution fabRIc at eXascale
MATRIX: MAny-Task computing execution fabRIc at eXascale
KVS server
Scheduler
Executor KVS server
Scheduler
Executor
Compute Node Compute Node
……
Fully-Connected
communication
Client Client Client
MTC Centralized Scheduling MTC Distributed Scheduling
cd cd cd
…
Controller and KVS server
Controller and KVS Server
cd cd cd
…
Controller and KVS Server
cd cd cd
…
…Fully-Connected
Client Client Client
HPC Centralized Scheduling HPC Distributed Scheduling
MATRIX
Slurm++ Slurm
Falkon
MATRIX: MAny-Task computing execution fabRIc at eXascale
KVS server
Scheduler
Executor KVS server
Scheduler
Executor
Compute Node Compute Node
……
Fully-Connected
communication
Client Client Client
Wait Queue
Local Ready Queue
Task 1
Task 2
Task 6
Work-Stealing Ready Queue
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 3
Task 4
Task 5
Complete Queue
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
P1 P2
T1 T2 T3 T4
Scheduler Specification
MATRIX: MAny-Task computing execution fabRIc at eXascale
KVS server
Scheduler
Executor KVS server
Scheduler
Executor
Compute Node Compute Node
……
Fully-Connected
communication
Client Client Client
• Fine-grained Workloads • Load Balancing • Data-Aware Scheduling
MATRIX: MAny-Task computing execution fabRIc at eXascale
MTC application trace MTC application DAGs
MATRIX: MAny-Task computing execution fabRIc at eXascale
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0.01 0.1 1 10 100 1000
Per
cent
age
Task Execution Time (s)
• 17-month period on IBM BG/P machine − Minimum length: 0 sec − Maximum length: 1470 sec − Medium length: 30 sec − Average length: 95 sec
• Direct Acyclic Graphs (DAG) − Vertices are discrete events − Edges are data-flows − Task can have parents − Task can have children
• Load balancing – Distribute workloads as evenly as possible – Difficult because of local view – Dynamic change
• Work stealing: distributed load balancing technique − Popular in shared-memory machine at thread level − Starved processors steal tasks from overloaded ones − Apply it at the node level in distributed environment − Various parameters
§ number of tasks to steal § number of neighbors § static vs dynamic neighbors (dynamic multi-random neighbor selection) § polling interval (exponential back-off)
MATRIX: MAny-Task computing execution fabRIc at eXascale
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8
Task 9 Task 10 Task 11 Task 12
Scheduler 1 Scheduler 2 Scheduler 3 Scheduler 4
Task 4 Task 5 Task 6
Task 7 Task 8
Steal tasks Steal tasks
send tasks
Send tasks
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
Task 9 Task 7 Task 8
Task 4 Task 5 Task 6
Task 1 Task 2 Task 3
Scheduler 1 Scheduler 2 Scheduler 4 Scheduler 1
• Big-data era – data-intensive applications – Data-flow driven programming models (MTC) – Workflow, task execution framework
• Work Stealing – Original work stealing is data locality-oblivious – Migrating tasks randomly comprise data-locality – Propose a Data-aware Work Stealing technique
MATRIX: MAny-Task computing execution fabRIc at eXascale
MATRIX: MAny-Task computing execution fabRIc at eXascale
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8
Task 9 Task 10 Task 11 Task 12
Scheduler 1 Scheduler 2 Scheduler 3 Scheduler 4
Task 4 Task 5 Task 6
Task 7 Task 8
Steal tasks Steal tasks
send tasks
Send tasks
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
Task 9 Task 7 Task 8
Task 4 Task 5 Task 6
Task 1 Task 2 Task 3
Scheduler 1 Scheduler 2 Scheduler 4 Scheduler 1
Task 1 2 3 4 5 6 7 8
length 1ms 1ms 1ms 1ms 1ms 1ms 1ms 1ms
Data size 5GB 20KB 1MB 20GB 30KB 0B 20KB 1MB
MATRIX: MAny-Task computing execution fabRIc at eXascale
typedef TaskMetaData {
int num_wait_parent; // number of waiting parents vector<string> parent_list; // schedulers that run each parent task vector<string> data_object; // data object name produced by each parent vector<long> data_size; // data object size (byte) produced by each parent long all_data_size; // all data object size (byte) produced by all parents vector<string> children; // children of this tasks
} TMD;
Wait Queue
Local Ready Queue
Task 1
Task 2
Task 6
Work-Stealing Ready Queue
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 3
Task 4
Task 5
Complete Queue
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Wait Queue (WaitQ): holds tasks that are waiting for parents to complete
Dedicated Local Ready Queue (LReadyQ): holds ready tasks that can only be executed on local node
Shared Work-Stealing Ready Queue (SReadyQ): holds ready tasks that can be shared through work stealing
Complete Queue (CompleteQ): holds tasks that are completed
P1: a program that checks if a task is ready to run, and moves ready tasks to either ready queue according to the decision making algorithm
P2: a program that updates the task metadata for each child of a completed task
T1 to T4: executor has 4 (configurable) executing threads that executes tasks in the ready queues and move a task to complete queue when it is done
P1 P2
T1 T2 T3 T4
• MLB – Maximized Load Balancing
• MDL – Maximized Data-Locality
• RLDS – Rigid Load balancing and Data-locality Segregation
• FLDS – Flexible Load balancing and Data-locality Segregation
MATRIX: MAny-Task computing execution fabRIc at eXascale
• A discrete event simulator – Simulates MTC task execution framework – 1M nodes, 1B cores, 100B tasks
• Explore the scalability – fully distributed MTC-architecture – work stealing technique – data-aware work stealing technique
MATRIX: MAny-Task computing execution fabRIc at eXascale
MATRIX: MAny-Task computing execution fabRIc at eXascale
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 4 8 16
32 64
128 256
512 1024
2048 4096
8192
Effic
ienc
y
Scale (No. of Nodes)
steal_1
steal_2
steal_log
steal_sqrt
steal_half
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 4 8 16
32
64
128
256
512
1024
20
48
4096
81
92
Effic
ienc
y
Scale (No. of Nodes)
nb_2
nb_log
nb_sqrt
nb_eighth
nb_quar
nb_half
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 4 16
64 256
1024 4096
16384
65536
262144
1048576
Effic
ienc
y
Scale (No. of Nodes)
nb_1
nb_2
nb_log
nb_sqrt
No. of Static neighbors
No. of Dynamic Neighbors
No. of Task to Steal
Data-aware Work
Stealing
The conclusion drawn about the simulation-based optimal parameters for the adaptive work stealing is to steal half the number of tasks from their neighbors, and to use the square root number of dynamic random neighbors. The data-aware work stealing technique is scalable at extreme-scales
Simulations have shown that the distributed architecture and the work stealing technique are scalable, now we can do real implementations
MATRIX: MAny-Task computing execution fabRIc at eXascale
• MATRIX components – Client (submit tasks, monitoring execution progress) – Scheduler (scheduling tasks for execution) – Executor (execute tasks)
• MATRIX code (https://github.com/kwangiit/matrix_v2) – 5K lines of C++ code – 8K lines of ZHT code – 1K lines of auto-generated code from Google protocol buffer
• MATRIX goal – Scalable distributed scheduling system for MTC applications – The prototype has been finished – working on running real applications
• Testbeds: – IBM Blue Gene/P/Q supercomputers (4K cores) – Probe Kodiak cluster (200 cores) – Amazon EC2 (256 cores)
MATRIX: MAny-Task computing execution fabRIc at eXascale
KVS server
Scheduler
Executor KVS server
Scheduler
Executor
Compute Node Compute Node
……
Fully-Connected
communication
Client Client Client
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
MATRIX: MAny-Task computing execution fabRIc at eXascale
MATRIX: MAny-Task computing execution fabRIc at eXascale
For sub-second tasks (64ms), MATRIX achieves efficiency as high as 85%+ at scales of 1024 nodes (4096 cores), meaning throughput as high as 13K tasks/sec
MATRIX: MAny-Task computing execution fabRIc at eXascale
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
256 512 1024 2048
Eff
icie
ncy
Scale (No. of Cores)
sleep 1 (Falkon) sleep 1 (MATRIX) sleep 2 (Falkon) sleep 2 (MATRIX) sleep 4 (Falkon) sleep 4 (MATRIX) sleep 8 (Falkon) sleep 8 (MATRIX)
MATRIX: MAny-Task computing execution fabRIc at eXascale
f0 f1 f2 f3 fm
t0 t1 t2 t3 t4 tn
tend
f0,0 f0,1 f1,1
t0 t1 t2 t3
f1,0
Astroportal: Image Stacking
Biometrics: All-pairs
MATRIX: MAny-Task computing execution fabRIc at eXascale
99% 99% 94% 91% 91%
57%
69%
55%
40% 30%
0
10
20
30
40
50
0%
20%
40%
60%
80%
100%
2 4 16 64 256 Av
erag
e Ta
sk L
engt
h (s
ec)
Effic
ienc
y
Scale (No. of cores)
MATRIX efficiency YARN efficiency Average task length
• Bio-Informatics Application
• P r o t e i n - l i g a n d Clustering
• 5 P h a s e s o f MapReduce Jobs
• 256MB data per node
• First phase has the majority of tasks
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
MATRIX: MAny-Task computing execution fabRIc at eXascale
• System scale is approaching exascale • Applications are becoming fine-grained • Next-generation resource management
systems need to be highly scalable, efficient and available
• Fully distributed architectures are scalable • Data-aware work stealing technique is
scalable
MATRIX: MAny-Task computing execution fabRIc at eXascale
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
MATRIX: MAny-Task computing execution fabRIc at eXascale
• Hadoop Integration of MATRIX − Hadoop scheduler is centralized − Replace the Hadoop scheduler with MATRIX
• Workflow integration of MATRIX − Swift workflow system − Will enable the execution of real scientific applications
• File System integration of MATRX − FusionFS distributed file system − Take care of the data storage and management − Expose the data to MATRIX
MATRIX: MAny-Task computing execution fabRIc at eXascale
• More information: – http://datasys.cs.iit.edu/~kewang/
• Contact: – [email protected]
• Questions?
MATRIX: MAny-Task computing execution fabRIc at eXascale