on-line automated performance diagnosis on thousands of processors

Post on 04-Jan-2016

28 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

On-line Automated Performance Diagnosis on Thousands of Processors. Philip C. Roth. Future Technologies Group Computer Science and Mathematics Division Oak Ridge National Laboratory. Paradyn Research Group Computer Sciences Department University of Wisconsin-Madison. - PowerPoint PPT Presentation

TRANSCRIPT

1

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

On-line Automated Performance Diagnosis on Thousands of Processors

Philip C. Roth

Future Technologies GroupComputer Science and Mathematics Division

Oak Ridge National LaboratoryParadyn Research Group

Computer Sciences DepartmentUniversity of Wisconsin-Madison

2

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

High Performance Computing Today

Large parallel computing resources Tightly coupled systems (Earth Simulator,

BlueGene/L, XT3) Clusters (LANL Lightning, LLNL Thunder) Grid

Large, complex applications ASCI Blue Mountain job sizes (2001)

512 cpus: 17.8% 1024 cpus: 34.9% 2048 cpus: 19.9%

Small fraction of peak performance is the rule

3

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Achieving Good Performance

Need to know what and where to tune Diagnosis and tuning tools are critical for realizing

potential of large-scale systems On-line automated tools are especially desirable

Manual tuning is difficult Finding interesting data in large data volume Understanding application, OS, hardware interactions

Automated tools require minimal user involvement; expertise is built into the tool

On-line automated tools can adapt dynamically Dynamic control over data volume Useful results from a single run

But: tools that work well in small-scale environments often don’t scale

4

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Tool Front End

d0 d1 d2 d3 dP-4 dP-3 dP-2 dP-1

a0 a1 a2 a3 aP-4 aP-3 aP-2 aP-1

Tool Daemons

App Processes

•Managing performance data volume•Communicating efficiently between distributed tool components

•Making scalable presentation of data and analysis results

Barriers to Large-Scale Performance Diagnosis

5

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Our Approach for Addressing These Scalability Barriers

MRNet: multicast/reduction infrastructure for scalable tools

Distributed Performance Consultant: strategy for efficiently finding performance bottlenecks in large-scale applications

Sub-Graph Folding Algorithm: algorithm for effectively presenting bottleneck diagnosis results for large-scale applications

6

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Outline

Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary

7

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Automated performance diagnosis Search for application performance

problems Start with global, general experiments (e.g., test

CPUbound across all processes) Collect performance data using dynamic

instrumentation Collect only the data desired Remove the instrumentation when no longer

needed Make decisions about truth of each experiment Refine search: create more specific experiments

based on “true” experiments (those whose data is above user-configurable threshold)

Performance Consultant

8

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Performance Consultant

c002.cs.wisc.educ001.cs.wisc.edu c128.cs.wisc.edu

myapp367 myapp4287 myapp27549

9

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

c002.cs.wisc.educ001.cs.wisc.edu

CPUbound

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

main

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

… …

……

c128.cs.wisc.edu

myapp367 myapp4287 myapp27549

Performance Consultant

10

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

c002.cs.wisc.educ001.cs.wisc.edu

CPUbound

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

main

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

… …

……

cham.cs.wisc.edu c128.cs.wisc.edu

myapp367 myapp4287 myapp27549

Performance Consultant

11

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Outline

Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary

12

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

MRNet: Multicast/Reduction Overlay Network

Parallel tool infrastructure providing: Scalable multicast Scalable data synchronization and transformation

Network of processes between tool front-end and back-ends

Useful for parallelizing and distributing tool activities Reduce latency Reduce computation and communication load at tool

front-end Joint work with Dorian Arnold (University of

Wisconsin-Madison)

13

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Typical Parallel Tool Organization

Tool Front End

d0 d1 d2 d3

a0 a1 a2 a3

dP-4 dP-3 dP-2 dP-1

aP-4 aP-3 aP-2 aP-1

Tool Daemons

App Processes

14

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

MRNet-based Parallel Tool Organization

Tool Front End

d0 d1 d2 d3

a0 a1 a2 a3

dP-4 dP-3 dP-2 dP-1

aP-4 aP-3 aP-2 aP-1

Tool Daemons

App Processes

Multicast/ Reduction Network

Internal Process

Filter

15

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Outline

Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary

16

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Performance Consultant: Scalability Barriers

MRNet can alleviate scalability problem for global performance data (e.g., CPU utilization across all processes)

But front-end still processes local performance data (e.g., utilization of process 5247 on host mcr398.llnl.gov)

17

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

c002.cs.wisc.educ001.cs.wisc.edu

CPUbound

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

main

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

… …

……

cham.cs.wisc.edu c128.cs.wisc.edu

myapp367 myapp4287 myapp27549

Performance Consultant

18

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

c002.cs.wisc.educ001.cs.wisc.edu

CPUbound

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

main

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

… …

……

cham.cs.wisc.edu c128.cs.wisc.edu

myapp367 myapp4287 myapp27549

Distributed Performance Consultant

19

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Distributed Performance Consultant: Variants

Natural steps from traditional centralized approach (CA)

Partially Distributed Approach (PDA) Distributed local searches, centralized global search Requires complex instrumentation management

Truly Distributed Approach (TDA) Distributed local searches only Insight into global behavior from combining local

search results (e.g., using Sub-Graph Folding Algorithm)

Simpler tool design than PDA

20

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

c002.cs.wisc.educ001.cs.wisc.edu

CPUbound

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

main

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

… …

……

cham.cs.wisc.edu c128.cs.wisc.edu

myapp367 myapp4287 myapp27549

Distributed Performance Consultant: PDA

21

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

c002.cs.wisc.educ001.cs.wisc.edu

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

… …

…… …

cham.cs.wisc.edu c128.cs.wisc.edu

myapp367 myapp4287 myapp27549

Distributed Performance Consultant: TDA

22

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

c002.cs.wisc.educ001.cs.wisc.edu

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

… …

…… …

cham.cs.wisc.edu c128.cs.wisc.edu

myapp367 myapp4287 myapp27549

Distributed Performance Consultant: TDA

Sub-Graph Folding Algorithm

23

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Outline

Paradyn and the Performance Consultant

MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary

24

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Search History Graph Example

CPUbound

c34.cs.wisc.edu

myapp{7624}

main

A B

C

D

main

A B

C

D

myapp{1272}

main

A B

C

D

myapp{1273}

main

A B

C

D E

myapp{7625}

main

A B

C

D

c33.cs.wisc.edu

25

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Search History Graphs

Search History Graph is effective for presenting search-based performance diagnosis results…

…but it does not scale to a large number of processes because it shows one sub-graph per process

26

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Sub-Graph Folding Algorithm

Combines host-specific sub-graphs into composite sub-graphs

Each composite sub-graph represents a behavioral category among application processes

Dynamic clustering of processes by qualitative behavior

27

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

SGFA: Example

CPUbound

c34.cs.wisc.edu

myapp{7624}

main

A B

C

D

main

A B

C

D

myapp{1272}

main

A B

C

D

myapp{1273}

main

A B

C

D E

myapp{7625}

main

A B

C

D

c33.cs.wisc.edu

myapp{*}

D E

c*.cs.wisc.edu

28

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

SGFA: Implementation

Custom MRNet filter Filter in each MRNet process keeps

folded graph of search results from all reachable daemons

Updates periodically sent upstream By induction, filter in front-end holds

entire folded graph Optimization for unchanged graphs

29

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Outline

Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary

30

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC + SGFA: Evaluation

Modified Paradyn to perform bottleneck searches using CA, PDA, or TDA approach

Modified instrumentation cost tracking to support PDA Track global, per-process instrumentation cost

separately Simple fixed-partition policy for scheduling global

and local instrumentation Implemented Sub-Graph Folding Algorithm as

custom MRNet filter to support TDA (used by all)

Instrumented front-end, daemons, and MRNet internal processes to collect CPU, I/O load information

31

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC + SGFA: Evaluation

su3_rmd QCD pure lattice gauge theory code C, MPI Weak scaling scalability study

LLNL MCR cluster 1152 nodes (1048 compute nodes) Two 2.4 GHz Intel Xeons per node 4 GB memory per node Quadrics Elan3 interconnect (fat tree) Lustre parallel file system

32

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC + SGFA: Evaluation

PDA and TDA: bottleneck searches with up to 1024 processes so far, limited by partition size

CA: scalability limit at less than 64 processes

Similar qualitative results from all approaches

33

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC: Evaluation

34

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC: Evaluation

35

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC: Evaluation

36

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC: Evaluation

37

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC: Evaluation

38

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC: Evaluation

39

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC: Evaluation

40

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC: Evaluation

41

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

DPC: Evaluation

42

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

SGFA: Evaluation

43

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Summary

Tool scalability is critical for effective use of large-scale computing resources

On-line automated performance tools are especially important at large scale

Our approach: MRNet Distributed Performance Consultant (TDA)

plus Sub-Graph Folding Algorithm

44

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

References

P.C. Roth, D.C. Arnold, and B.P. Miller, “MRNet: a Software-Based Multicast/Reduction Network for Scalable Tools,” SC 2003, Phoenix, Arizona, November 2003

P.C. Roth and B.P. Miller, “The Distributed Performance Consultant and the Sub-Graph Folding Algorithm: On-line Automated Performance Diagnosis on Thousands of Processes,” in submission

Publications available from http://www.paradyn.org

MRNet software available from http://www.paradyn.org/mrnet

top related