introduction & motivation problem statementiraicu/teaching/cs554-s15/lecture05-matrix.pdf ·...

36

Upload: lykien

Post on 06-Feb-2018

223 views

Category:

Documents


2 download

TRANSCRIPT

•  Introduction & Motivation •  Problem Statement •  Proposed Work •  Evaluation •  Conclusions •  Future Work

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  Introduction & Motivation •  Problem Statement •  Proposed Work •  Evaluation •  Conclusions •  Future Work

MATRIX: MAny-Task computing execution fabRIc at eXascale

MATRIX: MAny-Task computing execution fabRIc at eXascale

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  Today (2014): PetaScale Computing −  34PFlops −  3Million cores

•  Near Future (~2020): Exascale Computing −  1000PFlops −  1Billion cores

http://s.top500.org/static/lists/2014/06/TOP500_201406_Poster.pdf http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%20report%20101909.pdf

•  Manages resources −  compute −  storage −  network

•  Job scheduling/management −  resource allocation −  job launch

•  Data management −  data movement −  caching

MATRIX: MAny-Task computing execution fabRIc at eXascale

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  PetaScale à Exascale (10^18 Flops, 1B cores) •  Workload Diversity

−  Traditional large-scale High Performance Computing (HPC) jobs −  HPC ensemble runs −  Many-Task Computing (MTC) applications

•  State-of-the-art Resource Managers −  Centralized paradigm: job scheduling −  HPC resource managers: SLURM, PBS, Moab, Torque, Cobalt −  HTC resource manager: Condor, Falkon, Mesos, YARN

•  Next-generation Resource Managers −  Distributed paradigm −  Scalable, efficient, reliable

MATRIX: MAny-Task computing execution fabRIc at eXascale

Decade Uses and computer involved

1970s Weather forecasting, aerodynamic research (Cray-1).

1980s Probabilistic analysis, radiation shielding modeling (CDC Cyber).

1990s Brute force code breaking (EFF DES cracker).

2000s 3D nuclear test simulations as a substitute for legal conduct Nuclear Non-Proliferation Treaty (ASCI Q).

2010s Molecular Dynamics Simulation (Tianhe-1A)

Medical Image Processing: Functional Magnetic Resonance

Chemistry Domain: MolDyn

Molecular Dynamics: DOCK

Production Runs in Drug Design

Economic Modeling: MARS

Large-scale Astronomy Application Evaluation

Astronomy Domain: Montage

Data Analytics: Sort and WordCount

http://elib.dlr.de/64768/1/EnSIM_-_Ensemble_Simulation_on_HPC_Computes_EN.pdf

•  Introduction & Motivation •  Problem Statement •  Proposed Work •  Evaluation •  Conclusions •  Future Work

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  Scalability −  System scale is increasing − Workload size is increasing −  Processing capacity needs to increase

•  Efficiency −  Allocating resources fast − Making fast enough scheduling decisions − Maintaining a high system utilization

•  Reliability −  Still functioning well under failures

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  Introduction & Motivation •  Problem Statement •  Proposed Work •  Evaluation •  Conclusions •  Future Work

MATRIX: MAny-Task computing execution fabRIc at eXascale

MATRIX: MAny-Task computing execution fabRIc at eXascale

KVS server

Scheduler

Executor KVS server

Scheduler

Executor

Compute Node Compute Node

……

Fully-Connected

communication

Client Client Client

MTC Centralized Scheduling MTC Distributed Scheduling

cd cd cd

Controller and KVS server

Controller and KVS Server

cd cd cd

Controller and KVS Server

cd cd cd

…Fully-Connected

Client Client Client

HPC Centralized Scheduling HPC Distributed Scheduling

MATRIX

Slurm++ Slurm

Falkon

MATRIX: MAny-Task computing execution fabRIc at eXascale

KVS server

Scheduler

Executor KVS server

Scheduler

Executor

Compute Node Compute Node

……

Fully-Connected

communication

Client Client Client

Wait Queue

Local Ready Queue

Task 1

Task 2

Task 6

Work-Stealing Ready Queue

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 3

Task 4

Task 5

Complete Queue

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

P1 P2

T1 T2 T3 T4

Scheduler Specification

MATRIX: MAny-Task computing execution fabRIc at eXascale

KVS server

Scheduler

Executor KVS server

Scheduler

Executor

Compute Node Compute Node

……

Fully-Connected

communication

Client Client Client

•  Fine-grained Workloads •  Load Balancing •  Data-Aware Scheduling

MATRIX: MAny-Task computing execution fabRIc at eXascale

MTC application trace MTC application DAGs

MATRIX: MAny-Task computing execution fabRIc at eXascale

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0.01 0.1 1 10 100 1000

Per

cent

age

Task Execution Time (s)

•  17-month period on IBM BG/P machine −  Minimum length: 0 sec −  Maximum length: 1470 sec −  Medium length: 30 sec −  Average length: 95 sec

•  Direct Acyclic Graphs (DAG) −  Vertices are discrete events −  Edges are data-flows −  Task can have parents −  Task can have children

•  Load balancing –  Distribute workloads as evenly as possible –  Difficult because of local view –  Dynamic change

•  Work stealing: distributed load balancing technique −  Popular in shared-memory machine at thread level −  Starved processors steal tasks from overloaded ones −  Apply it at the node level in distributed environment −  Various parameters

§  number of tasks to steal §  number of neighbors §  static vs dynamic neighbors (dynamic multi-random neighbor selection) §  polling interval (exponential back-off)

MATRIX: MAny-Task computing execution fabRIc at eXascale

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8

Task 9 Task 10 Task 11 Task 12

Scheduler 1 Scheduler 2 Scheduler 3 Scheduler 4

Task 4 Task 5 Task 6

Task 7 Task 8

Steal tasks Steal tasks

send tasks

Send tasks

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6

Task 9 Task 7 Task 8

Task 4 Task 5 Task 6

Task 1 Task 2 Task 3

Scheduler 1 Scheduler 2 Scheduler 4 Scheduler 1

•  Big-data era –  data-intensive applications –  Data-flow driven programming models (MTC) –  Workflow, task execution framework

•  Work Stealing –  Original work stealing is data locality-oblivious –  Migrating tasks randomly comprise data-locality –  Propose a Data-aware Work Stealing technique

MATRIX: MAny-Task computing execution fabRIc at eXascale

MATRIX: MAny-Task computing execution fabRIc at eXascale

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8

Task 9 Task 10 Task 11 Task 12

Scheduler 1 Scheduler 2 Scheduler 3 Scheduler 4

Task 4 Task 5 Task 6

Task 7 Task 8

Steal tasks Steal tasks

send tasks

Send tasks

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6

Task 9 Task 7 Task 8

Task 4 Task 5 Task 6

Task 1 Task 2 Task 3

Scheduler 1 Scheduler 2 Scheduler 4 Scheduler 1

Task 1 2 3 4 5 6 7 8

length 1ms 1ms 1ms 1ms 1ms 1ms 1ms 1ms

Data size 5GB 20KB 1MB 20GB 30KB 0B 20KB 1MB

MATRIX: MAny-Task computing execution fabRIc at eXascale

typedef TaskMetaData {

int num_wait_parent; // number of waiting parents vector<string> parent_list; // schedulers that run each parent task vector<string> data_object; // data object name produced by each parent vector<long> data_size; // data object size (byte) produced by each parent long all_data_size; // all data object size (byte) produced by all parents vector<string> children; // children of this tasks

} TMD;

Wait Queue

Local Ready Queue

Task 1

Task 2

Task 6

Work-Stealing Ready Queue

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 3

Task 4

Task 5

Complete Queue

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Wait Queue (WaitQ): holds tasks that are waiting for parents to complete

Dedicated Local Ready Queue (LReadyQ): holds ready tasks that can only be executed on local node

Shared Work-Stealing Ready Queue (SReadyQ): holds ready tasks that can be shared through work stealing

Complete Queue (CompleteQ): holds tasks that are completed

P1: a program that checks if a task is ready to run, and moves ready tasks to either ready queue according to the decision making algorithm

P2: a program that updates the task metadata for each child of a completed task

T1 to T4: executor has 4 (configurable) executing threads that executes tasks in the ready queues and move a task to complete queue when it is done

P1 P2

T1 T2 T3 T4

•  MLB –  Maximized Load Balancing

•  MDL –  Maximized Data-Locality

•  RLDS –  Rigid Load balancing and Data-locality Segregation

•  FLDS –  Flexible Load balancing and Data-locality Segregation

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  A discrete event simulator – Simulates MTC task execution framework – 1M nodes, 1B cores, 100B tasks

•  Explore the scalability –  fully distributed MTC-architecture – work stealing technique – data-aware work stealing technique

MATRIX: MAny-Task computing execution fabRIc at eXascale

MATRIX: MAny-Task computing execution fabRIc at eXascale

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 4 8 16

32 64

128 256

512 1024

2048 4096

8192

Effic

ienc

y

Scale (No. of Nodes)

steal_1

steal_2

steal_log

steal_sqrt

steal_half

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 4 8 16

32

64

128

256

512

1024

20

48

4096

81

92

Effic

ienc

y

Scale (No. of Nodes)

nb_2

nb_log

nb_sqrt

nb_eighth

nb_quar

nb_half

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 4 16

64 256

1024 4096

16384

65536

262144

1048576

Effic

ienc

y

Scale (No. of Nodes)

nb_1

nb_2

nb_log

nb_sqrt

No. of Static neighbors

No. of Dynamic Neighbors

No. of Task to Steal

Data-aware Work

Stealing

The conclusion drawn about the simulation-based optimal parameters for the adaptive work stealing is to steal half the number of tasks from their neighbors, and to use the square root number of dynamic random neighbors. The data-aware work stealing technique is scalable at extreme-scales

Simulations have shown that the distributed architecture and the work stealing technique are scalable, now we can do real implementations

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  MATRIX components –  Client (submit tasks, monitoring execution progress) –  Scheduler (scheduling tasks for execution) –  Executor (execute tasks)

•  MATRIX code (https://github.com/kwangiit/matrix_v2) –  5K lines of C++ code –  8K lines of ZHT code –  1K lines of auto-generated code from Google protocol buffer

•  MATRIX goal –  Scalable distributed scheduling system for MTC applications –  The prototype has been finished –  working on running real applications

•  Testbeds: –  IBM Blue Gene/P/Q supercomputers (4K cores) –  Probe Kodiak cluster (200 cores) –  Amazon EC2 (256 cores)

MATRIX: MAny-Task computing execution fabRIc at eXascale

KVS server

Scheduler

Executor KVS server

Scheduler

Executor

Compute Node Compute Node

……

Fully-Connected

communication

Client Client Client

•  Introduction & Motivation •  Problem Statement •  Proposed Work •  Evaluation •  Conclusions •  Future Work

MATRIX: MAny-Task computing execution fabRIc at eXascale

MATRIX: MAny-Task computing execution fabRIc at eXascale

For sub-second tasks (64ms), MATRIX achieves efficiency as high as 85%+ at scales of 1024 nodes (4096 cores), meaning throughput as high as 13K tasks/sec

MATRIX: MAny-Task computing execution fabRIc at eXascale

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

256 512 1024 2048

Eff

icie

ncy

Scale (No. of Cores)

sleep 1 (Falkon) sleep 1 (MATRIX) sleep 2 (Falkon) sleep 2 (MATRIX) sleep 4 (Falkon) sleep 4 (MATRIX) sleep 8 (Falkon) sleep 8 (MATRIX)

MATRIX: MAny-Task computing execution fabRIc at eXascale

MATRIX: MAny-Task computing execution fabRIc at eXascale

f0 f1 f2 f3 fm

t0 t1 t2 t3 t4 tn

tend

f0,0 f0,1 f1,1

t0 t1 t2 t3

f1,0

Astroportal: Image Stacking

Biometrics: All-pairs

MATRIX: MAny-Task computing execution fabRIc at eXascale

99% 99% 94% 91% 91%

57%

69%

55%

40% 30%

0

10

20

30

40

50

0%

20%

40%

60%

80%

100%

2 4 16 64 256 Av

erag

e Ta

sk L

engt

h (s

ec)

Effic

ienc

y

Scale (No. of cores)

MATRIX efficiency YARN efficiency Average task length

•  Bio-Informatics Application

•  P r o t e i n - l i g a n d Clustering

•  5 P h a s e s o f MapReduce Jobs

•  256MB data per node

•  First phase has the majority of tasks

•  Introduction & Motivation •  Problem Statement •  Proposed Work •  Evaluation •  Conclusions •  Future Work

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  System scale is approaching exascale •  Applications are becoming fine-grained •  Next-generation resource management

systems need to be highly scalable, efficient and available

•  Fully distributed architectures are scalable •  Data-aware work stealing technique is

scalable

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  Introduction & Motivation •  Problem Statement •  Proposed Work •  Evaluation •  Conclusions •  Future Work

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  Hadoop Integration of MATRIX − Hadoop scheduler is centralized − Replace the Hadoop scheduler with MATRIX

•  Workflow integration of MATRIX −  Swift workflow system − Will enable the execution of real scientific applications

•  File System integration of MATRX −  FusionFS distributed file system −  Take care of the data storage and management −  Expose the data to MATRIX

MATRIX: MAny-Task computing execution fabRIc at eXascale

•  More information: – http://datasys.cs.iit.edu/~kewang/

•  Contact: – [email protected]

•  Questions?

MATRIX: MAny-Task computing execution fabRIc at eXascale