predicting communication...

PREDICTING COMMUNICATION

PERFORMANCE

Nikhil JainCASC Seminar, LLNL

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This work was funded by the Laboratory Directed Research and Development Program at LLNL under project tracking code 13-ERD-055

1

1Thursday, June 27, 13

SUPERCOMPUTERS

2


SUPERCOMPUTERS

2

48 GB/s, 1-2 microsec 40 GB/s, 1-3 microsec 150 GB/s, 0.8 microsec

420 GB/s, 1-2 microsec


SUPERCOMPUTERS

2

48 GB/s, 1-2 microsec 40 GB/s, 1-3 microsec 150 GB/s, 0.8 microsec

420 GB/s, 1-2 microsec

Larger BandwidthLower Latency

Fewer hops


WHY STUDY NETWORK PERFORMANCE?

3



Peak bandwidth and latency are never realized in presence of congestion

3




High raw bandwidth does not guarantee proportionate observed performance

Blue Gene vs Cray’s Gemini

3






Savings are proportionate to core-count

3






Savings are proportionate to core-count

Most importantly, as a graduate student, I do what I am asked to do!

3


QUANTIFYING IMPACT

Mapping via logical operations in Rubik

What about others mappings?

How far are we from the best?

4

0

200

400

600

800

1000

2048 4096 8192 16384 32768 65536

Tim

e pe

r ite

ratio

n (s

)

Number of cores

Execution time for different mappings of pF3D

Default MapBest Map 60%

A. Bhatele, et al Mapping applications with collectives over sub-communicators on torus networks. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’12. IEEE Computer Society, Nov. 2012 (to appear). LLNL-CONF-556491.


ALTERNATIVES

5

*Abhinav Bhatele, Nikhil Jain, William D. Gropp, and Laxmikant V. Kale. 2011b. Avoiding hot-spots on two- level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’11). ACM, New York, NY, USA, 76:1–76:11.


ALTERNATIVES

Theoretically: NP hard

Simulations: too slow

15 days to simulate one use case*

5



ALTERNATIVES

Theoretically: NP hard

Simulations: too slow

15 days to simulate one use case*

Real runs: very expensive

Application/allocation specific information

5


2012 2013Intrepid 4.16M 0.73M

Mira 0.17M 7.67M

Total 4.33M 8.40M

13 million core hours!


HEURISTICS - KNOWN METRICS

6

20

30

40

50

60

0 2 4 6 8 10

Tim

e pe

r ite

ratio

n (m

s)

Maximum Dilation

20

30

40

50

60

8e8 1.2e9 1.6e9 2e9

Average bytes per link

20

30

40

50

60

1e09 3e9 6e9 9e9

Maximum bytes on a link

Figure 1: Performance variation with prior metrics. A large variation in performance is observed for the same value of the metric inall three cases.

In Section 2, we describe the common metrics used in literature andmotivate the need for more precise metrics. Sources of contentionon torus networks, methodology for collecting hardware countersdata, and new proposed metrics are discussed in Section 3. Thebenchmarks and supervised learning techniques used in the paperand the measures of prediction success are described in Section 4.In Sections 5, 6, 7, we present results using prior metrics, new met-rics and their combinations. We conclude our work in Section 8.

2. BACKGROUND AND MOTIVATIONSeveral metrics have been proposed in the literature to evaluate taskmappings offline. Let us assume a guest graph, G = (Vg, Eg)

(communication graph between tasks in a parallel application) anda host graph, H = (Vh, Eh) (network topology of the parallel ma-chine). M defines a mapping of the guest graph, G on the hostgraph, H . The earliest metric that was used to compare the effec-tiveness of task mappings is dilation [3, 12]. Dilation for a mappingM can be defined as,

dilation(M) = max

ei2Egdi(M) (1)

where di is the dilation of the edge ei for a mapping M . Dilation ofan edge ei is the number of hops between the end-points of the edgewhen mapped to the host graph. This metric aims at minimizing thelength of the longest wire in a circuit [3]. We will refer to this asmaximum dilation to avoid any confusion. We can also calculatethe average dilation per edge for a mapping as,

average dilation-per-edge(M) =

Pei2Eg

di(M)

|Eg|(2)

Hoefler and Snir overload dilation to describe the “expected” dila-tion for an edge and “average” dilation for a mapping [11]. Theirdefinition of expected dilation for an edge can be reduced to equa-tion 1 above by assuming that messages are only routed on shortestpaths, which is true for the IBM Blue Gene and Cray XT/XE fam-ily (if all nodes are in a healthy state). The average dilation metric,as coined by Hoefler and Snir, is a weighted dilation and has beenpreviously referred to as the hop-bytes metric by Sadayappan [9] in1988 and Agarwal in 2006 [2]. Hop-bytes is the weighted sum ofthe edge dilations where the weights are the message sizes. Hop-bytes can be calculated by the equation,

hop-bytes(M) =

X

ei2Eg

di(M)⇥ wi (3)

where di is the dilation of edge ei and wi is the weight (message

size in bytes) of edge ei.

Hop-bytes gives an indication of the overall communication trafficbeing injected on to the network. We can derive two metrics basedon hop-bytes: the average number of hops traveled by each byte onthe network,

average hops-per-byte(M) =

Pei2Eg

di(M)⇥ wiP

ei2Egwi

(4)

and the average number of bytes that pass through a hardware link,

average bytes-per-link(M) =

Pei2Eg

di(M)⇥ wi

|Eh|(5)

The former gives an indication of how far each byte has to travelon average. The latter gives an indication of the average load orcongestion on a hardware link on the network. They are derivedmetrics (from hop-bytes) and all three are practically equivalentwhen used for prediction. In the rest of the paper, we use averagebytes-per-link.

Another metric that indicates congestion on network links is themaximum number of bytes going through any link on the network,

maximum bytes(M) = max

li2Eh

(

X

ej2Eg |ej=)li

wj) (6)

where ej =) li represents that edge ej in the guest graph goesthrough edge (link) li in the host graph (network). Hoefler andSnir use a second metric in their paper [11], worst case congestion,which is the same as equation 6 above.

We conducted a simple experiment with three of these metrics de-scribed above – maximum dilation, average bytes-per-link and max-imum bytes on a link to analyze their correlation with applicationperformance. Figure 1 shows the communication time for one it-eration of a two-dimensional halo exchange versus the three met-rics in different plots. Although the coefficient of determinationvalues (R2, metric used for prediction success) are high, there isa significant variation in the y-values for different points with thesame x-value. For example, in the maximum bytes plot (right), forx = 6e9, there are mappings with performance varying from 20 to50 ms. These variations make predicting performance using simplemodels with a reasonably high accuracy (±5% error) difficult. Thismotivated us to find new metrics and ways to improve the correla-tion between metrics and application performance.

2D-Halo: predicting performance using a linear regression model for known metrics


SUPERVISED LEARNING

Collect/generate data and summarize

Build models: train performance prediction based on independent metrics

Predict

7


SUPERVISED LEARNING

Collect/generate data and summarize

Build models: train performance prediction based on independent metrics

Predict

7

��

��

��

��

��

��

�� !�� "��#�


COMMUNICATION

8


COMMUNICATION

8

Injection FIFO Contention


COMMUNICATION

8


Link Contention


COMMUNICATION

8


Link Contention

Receive Buffer Contention


COMMUNICATION

8


Link Contention


Reception FIFOContention


COMMUNICATION

8


Link Contention


Reception FIFOContention

MemoryContention

MemoryContention


NETWORK COUNTERS OF BLUEGENE/Q

9


A PMPI based BGQ-Counter collection module


9



Packets sent on links in specific directions: A, B, C, D, E

deterministic, dynamic


9

AB

C





Packets received on a link


9

AB

C





Packets received on a link

Packets in buffers


9

AB

C


ANALYTICAL TOOL

10


ANALYTICAL TOOL

Simulate the injection mechanism

Selection of memory injection FIFO

Mapping of memory FIFO to network injection FIFO

10


ANALYTICAL TOOL

Simulate the injection mechanism

Selection of memory injection FIFO

Mapping of memory FIFO to network injection FIFO

Simulate routing to obtain hops/dilation

10


SUPERVISED LEARNING

11


SUPERVISED LEARNING

Collect raw data - various entities, e.g. bytes on a link, and the observed performance.

11


SUPERVISED LEARNING


Derive metrics from the raw data on entities, e.g. average of bytes on links.

11


SUPERVISED LEARNING



Create a database of derived metrics and performance; we have used 100 mappings.

11


SUPERVISED LEARNING



Create a database of derived metrics and performance; we have used 100 mappings.

Select two-third entries as training set; includes derived metrics and performance.

11


SUPERVISED LEARNING

12


SUPERVISED LEARNING

The training set is used to create a model for prediction

12


SUPERVISED LEARNING


Remaining entries from the database are used as the test set - only derived metrics.

12


SUPERVISED LEARNING



Prediction is compared with observed values.

12


SUPERVISED LEARNING




Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc.

12


SUPERVISED LEARNING




Experimented with a large number of algorithms - linear, bayesian, SVM, near-neighbors etc.

12

http://scikit-learn.org




SUPERVISED LEARNING

13


SUPERVISED LEARNING

13

X[1] <= 0.4295

X[0] <= 0.0082 X[0] <= 0.2857

X[1] <= 0.0077 X[0] <= 0.0176 X[0] <= 0.1905 leaf

leaf

X[1] <= 0.0212

leaf

Rest of the tree

Decision trees


SUPERVISED LEARNING

13

X[1] <= 0.4295

X[0] <= 0.0082 X[0] <= 0.2857

X[1] <= 0.0077 X[0] <= 0.0176 X[0] <= 0.1905 leaf

leaf

X[1] <= 0.0212

leaf

Rest of the tree

feature 1

feat

ure

2

Decision trees Randomized forest of trees


HOW TO JUDGE A PREDICTION

Rank Correlation Coefficient (RCC): fraction of the number of pairs of task mappings whose ranks are in the same partial order in predicted and observed performance list

Absolute Correlation

Higher the better!

14

X[1] <= 0.4295

X[0] <= 0.0082 X[0] <= 0.2857

X[1] <= 0.0077 X[0] <= 0.0176 X[0] <= 0.1905 leaf

leaf

X[1] <= 0.0212

leaf

Rest of the tree

(a) Decision tree

feature 1

feat

ure

2

(b) Random forests

Figure 3: Example decision tree and random forests generated using scikit.

by the input features. Each color in the figure is a leaf region of oneof the decision trees in the random forest generated by fitting thetraining set. The white circles are the test set being predicted. Topredict the target for these unseen samples, every decision tree istraversed to find the right leaf node that provides a possible tar-get value. Having obtained a set of possible target values from thedecision trees, they are combined to provide the predicted value.

4.4 Metrics for prediction successThe goodness or success of the prediction function (also referredto as the score) can be evaluated using different metrics depend-ing on the definition of success. Our main goal is to compare theperformance of two mappings and determine the correct orderingbetween the mappings in terms of performance. Hence, we focuson a rank correlation metric for determining success but we alsopresent results for a metric that compares absolute values.Rank Correlation Coefficient (RCC): Ranks are assigned to map-pings based on their position in two sorted sets (by execution timesfor observed and predicted performance). RCC is defined as theratio of the number of pairs of task mappings whose ranks werein the same partial order in both the sets to the total number ofpairs. In statistical parlance, RCC equals the ratio of the numberof concordant pairs to that of all pairs (Kendall’s Tau [1]). For-mally speaking, if observed ranks of tasks mappings are given by{x1, x2, · · · , xn}, and the predicted ranks by {y1, y2, · · · , yn},we define RCC as:

concord ij =

8><

>:

1, if xi >= xj & yi >= yj

1, if xi < xj & yi < yj

0, otherwise

RCC =

⇣ X

0<=i<n

X

0<=j<i

concordij⌘/(

n(n� 1)

2

)

Absolute Correlation (R2): To predict the success for absolutepredicted values, we use the coefficient of determination from statis-tics, R-squared,

R2(y, y) = 1�

Pi(yi � yi)

2

Pi(yi � y)2

where yi is the predicted value of the ith sample, yi is the corre-sponding true value, and

y =

1

nsamples

X

i

yi

5. PERFORMANCE PREDICTION OF COM-MUNICATION KERNELS

In this section, we present predictions for the execution times ofcommunication kernels (Section 4.2) for different task mappings.

5.1 Performance variation with mappingFigure 4 presents the execution times for the three benchmarks forfour message sizes – 8 bytes, 512 bytes, 16 KB and 4 MB. Thesesizes represent the amount of data exchanged between a pair ofMPI processes. For example, for 2D Halo, this number is the sizeof a message sent by an MPI process to each of its four neighborsin 2D. For a particular message size, a point on the plot representsthe execution time (on the y-axis) for a mapping (on the x-axis).

For 2D Halo, Figure 4(a) shows that for small messages such as 8and 512 bytes, mapping has an insignificant impact. As the mes-sage size increases to 16 KB, in addition to an increase in the run-time, we observe up to a 7⇥ difference in performance for the bestmapping in comparison to the worst mapping (note the log scaleon the y-axis). Similar variation is seen as we further increase themessage size to 4 MB. For a more communication intensive bench-mark, 3D Halo, we find that mapping impacts performance even formessages of size 512 bytes (Figure 4(b)). As we further increasethe communication in Sub A2A, the effect of task mapping is seeneven for the 8-byte messages as shown in Figure 4(c).

In the sections below, we do not present predictions for the caseswhere the performance variation due to mapping is statistically in-significant: 8- and 512-byte results in case of 2D Halo and 8-byteresults in case of 3D Halo.

5.2 Prior featuresWe begin with showing prediction results using prior metrics/ fea-tures (described in Section 2) and quantify the goodness of the fitor prediction using RCC and R2 (Section 4.4).

X[1] <= 0.4295

X[0] <= 0.0082 X[0] <= 0.2857

X[1] <= 0.0077 X[0] <= 0.0176 X[0] <= 0.1905 leaf

leaf

X[1] <= 0.0212

leaf

Rest of the tree

(a) Decision tree

feature 1

feat

ure

2

(b) Random forests

Figure 3: Example decision tree and random forests generated using scikit.

by the input features. Each color in the figure is a leaf region of oneof the decision trees in the random forest generated by fitting thetraining set. The white circles are the test set being predicted. Topredict the target for these unseen samples, every decision tree istraversed to find the right leaf node that provides a possible tar-get value. Having obtained a set of possible target values from thedecision trees, they are combined to provide the predicted value.

4.4 Metrics for prediction successThe goodness or success of the prediction function (also referredto as the score) can be evaluated using different metrics depend-ing on the definition of success. Our main goal is to compare theperformance of two mappings and determine the correct orderingbetween the mappings in terms of performance. Hence, we focuson a rank correlation metric for determining success but we alsopresent results for a metric that compares absolute values.Rank Correlation Coefficient (RCC): Ranks are assigned to map-pings based on their position in two sorted sets (by execution timesfor observed and predicted performance). RCC is defined as theratio of the number of pairs of task mappings whose ranks werein the same partial order in both the sets to the total number ofpairs. In statistical parlance, RCC equals the ratio of the numberof concordant pairs to that of all pairs (Kendall’s Tau [1]). For-mally speaking, if observed ranks of tasks mappings are given by{x1, x2, · · · , xn}, and the predicted ranks by {y1, y2, · · · , yn},we define RCC as:

concord ij =

8><

>:

1, if xi >= xj & yi >= yj

1, if xi < xj & yi < yj

0, otherwise

RCC =

⇣ X

0<=i<n

X

0<=j<i

concordij⌘/(

n(n� 1)

2

)

Absolute Correlation (R2): To predict the success for absolutepredicted values, we use the coefficient of determination from statis-tics, R-squared,

R2(y, y) = 1�

Pi(yi � yi)

2

Pi(yi � y)2

where yi is the predicted value of the ith sample, yi is the corre-sponding true value, and

y =

1

nsamples

X

i

yi

5. PERFORMANCE PREDICTION OF COM-MUNICATION KERNELS

In this section, we present predictions for the execution times ofcommunication kernels (Section 4.2) for different task mappings.

5.1 Performance variation with mappingFigure 4 presents the execution times for the three benchmarks forfour message sizes – 8 bytes, 512 bytes, 16 KB and 4 MB. Thesesizes represent the amount of data exchanged between a pair ofMPI processes. For example, for 2D Halo, this number is the sizeof a message sent by an MPI process to each of its four neighborsin 2D. For a particular message size, a point on the plot representsthe execution time (on the y-axis) for a mapping (on the x-axis).

For 2D Halo, Figure 4(a) shows that for small messages such as 8and 512 bytes, mapping has an insignificant impact. As the mes-sage size increases to 16 KB, in addition to an increase in the run-time, we observe up to a 7⇥ difference in performance for the bestmapping in comparison to the worst mapping (note the log scaleon the y-axis). Similar variation is seen as we further increase themessage size to 4 MB. For a more communication intensive bench-mark, 3D Halo, we find that mapping impacts performance even formessages of size 512 bytes (Figure 4(b)). As we further increasethe communication in Sub A2A, the effect of task mapping is seeneven for the 8-byte messages as shown in Figure 4(c).

In the sections below, we do not present predictions for the caseswhere the performance variation due to mapping is statistically in-significant: 8- and 512-byte results in case of 2D Halo and 8-byteresults in case of 3D Halo.

5.2 Prior featuresWe begin with showing prediction results using prior metrics/ fea-tures (described in Section 2) and quantify the goodness of the fitor prediction using RCC and R2 (Section 4.4).


RESULTS

Three communication kernel

Five-point 2D stencil

14-point 3D stencil

Sub-communicator all-to-all

Four message sizes to span MPI and routing protocols

15


KNOWN METRICS

EntitiesBytes on a linkDilation

Derivation Methods MaximumAverage Sum

16

20

30

40

50

60

1e09 3e9 6e9 9e9

Maximum bytes on a link

Figure 1: Performance variation with prior metrics. A large variation in performance is observed for the same value of the metric inall three cases.

In Section 2, we describe the common metrics used in literature andmotivate the need for more precise metrics. Sources of contentionon torus networks, methodology for collecting hardware countersdata, and new proposed metrics are discussed in Section 3. Thebenchmarks and supervised learning techniques used in the paperand the measures of prediction success are described in Section 4.In Sections 5, 6, 7, we present results using prior metrics, new met-rics and their combinations. We conclude our work in Section 8.

2. BACKGROUND AND MOTIVATIONSeveral metrics have been proposed in the literature to evaluate taskmappings offline. Let us assume a guest graph, G = (Vg, Eg)

(communication graph between tasks in a parallel application) anda host graph, H = (Vh, Eh) (network topology of the parallel ma-chine). M defines a mapping of the guest graph, G on the hostgraph, H . The earliest metric that was used to compare the effec-tiveness of task mappings is dilation [3, 12]. Dilation for a mappingM can be defined as,

dilation(M) = max

ei2Egdi(M) (1)

where di is the dilation of the edge ei for a mapping M . Dilation ofan edge ei is the number of hops between the end-points of the edgewhen mapped to the host graph. This metric aims at minimizing thelength of the longest wire in a circuit [3]. We will refer to this asmaximum dilation to avoid any confusion. We can also calculatethe average dilation per edge for a mapping as,

average dilation-per-edge(M) =

Pei2Eg

di(M)

|Eg|(2)

Hoefler and Snir overload dilation to describe the “expected” dila-tion for an edge and “average” dilation for a mapping [11]. Theirdefinition of expected dilation for an edge can be reduced to equa-tion 1 above by assuming that messages are only routed on shortestpaths, which is true for the IBM Blue Gene and Cray XT/XE fam-ily (if all nodes are in a healthy state). The average dilation metric,as coined by Hoefler and Snir, is a weighted dilation and has beenpreviously referred to as the hop-bytes metric by Sadayappan [9] in1988 and Agarwal in 2006 [2]. Hop-bytes is the weighted sum ofthe edge dilations where the weights are the message sizes. Hop-bytes can be calculated by the equation,

hop-bytes(M) =

X

ei2Eg

di(M)⇥ wi (3)

where di is the dilation of edge ei and wi is the weight (message

size in bytes) of edge ei.

Hop-bytes gives an indication of the overall communication trafficbeing injected on to the network. We can derive two metrics basedon hop-bytes: the average number of hops traveled by each byte onthe network,

average hops-per-byte(M) =

Pei2Eg

di(M)⇥ wiP

ei2Egwi

(4)

and the average number of bytes that pass through a hardware link,

average bytes-per-link(M) =

Pei2Eg

di(M)⇥ wi

|Eh|(5)

The former gives an indication of how far each byte has to travelon average. The latter gives an indication of the average load orcongestion on a hardware link on the network. They are derivedmetrics (from hop-bytes) and all three are practically equivalentwhen used for prediction. In the rest of the paper, we use averagebytes-per-link.

Another metric that indicates congestion on network links is themaximum number of bytes going through any link on the network,

maximum bytes(M) = max

li2Eh

(

X

ej2Eg |ej=)li

wj) (6)

where ej =) li represents that edge ej in the guest graph goesthrough edge (link) li in the host graph (network). Hoefler andSnir use a second metric in their paper [11], worst case congestion,which is the same as equation 6 above.

We conducted a simple experiment with three of these metrics de-scribed above – maximum dilation, average bytes-per-link and max-imum bytes on a link to analyze their correlation with applicationperformance. Figure 1 shows the communication time for one it-eration of a two-dimensional halo exchange versus the three met-rics in different plots. Although the coefficient of determinationvalues (R2, metric used for prediction success) are high, there isa significant variation in the y-values for different points with thesame x-value. For example, in the maximum bytes plot (right), forx = 6e9, there are mappings with performance varying from 20 to50 ms. These variations make predicting performance using simplemodels with a reasonably high accuracy (±5% error) difficult. Thismotivated us to find new metrics and ways to improve the correla-tion between metrics and application performance.


RESULTSKNOWN METRICS

17

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�� !� �� "#��$�� $��


RESULTSKNOWN METRICS

17

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�� !� �� "#��$�� $��

max bytes is good, but

incorrect in 10% cases


NEW METRICS

18


NEW METRICS

EntitiesBuffer length (on intermediate nodes)FIFO length (packets in injection FIFO)Delay per link (packets in buffer/packets received)

18


NEW METRICS

EntitiesBuffer length (on intermediate nodes)FIFO length (packets in injection FIFO)Delay per link (packets in buffer/packets received)

Derivation methodsAverage Outliers (AO)Top Outliers (TO)

18


��

��

��

��

��

��

��

��

�� !� ��"#��$�� $��

�"#��"#��%&

��!� ��&

�"#��$��&�"#��$��%&

RESULTSNEW METRICS

19

��

��

��

��

��

��


HYBRID METRICS

20


HYBRID METRICS

Combine multiple metrics to complement each other

20


HYBRID METRICS

Combine multiple metrics to complement each other

Some combinationsavg bytes + max bytes + max FIFOavg bytes + max bytes + avg buffer + max FIFOavg bytes + avg buffer + avg delay AO + sum hopsavg bytes TO + avg buffer TO + avg delay TO + sum hops

20


RESULTSHYBRID METRICS

21

��

��

��

��

��

�

��

��

��

��

��

��

��

��

��

�� !� ��"#��$�� $��"#��

�"#��%&

��!� ��&�"#��$��&�"#��$��%&

��

��'�(��


RESULTSHYBRID METRICS

21

��

��

��

��

��

�

��

��

��

��

��

��

��

��

��

�� !� ��"#��$�� $��"#��

�"#��%&

��!� ��&�"#��$��&�"#��$��%&

��

��'�(��

hybrid metrics provide

high accuracy


RESULTS - TREND

22

Figure 7: Prediction success based on hybrid features from Table 3. We obtain RCC and R2 values exceeding 0.99 for 3D Halo andSub A2A. Prediction success improves significantly for 2D Halo also.

0.93 to 0.975 and 0.955 for the 16 KB and 4 MB message sizesrespectively. For the more communication intensive benchmarks,we obtained R2 values as high as 0.99 in general. Hence, the useof hybrid features not only predicts the correct pairwise orderingof mapping pairs but also does so with high accuracy in predictingtheir absolute performance.

5.5 SummaryFigure 8 presents the scatter-plot of predicted performance for thethree benchmarks for the 4 MB message size. On the x-axis are thetask mappings sorted by observed performance, while the y-axisis the predicted performance. The feature set H3: avg bytes, maxbytes, avg buffer, max FIFO was used for these predictions. It isevident from the figure that an almost perfect ordering is achievedfor all three benchmarks.

Figure 9 shows the prediction success for the three benchmarks on65,536 cores of BG/Q. From all the previously presented features(prior, new and hybrid), we selected the ones with the highest RCCscores for 16,384 cores, and present only those in this figure. Weobtain significant improvements in the prediction scores using hy-

brid features for prediction in comparison to single features such asmax bytes and avg bytes TO. For Sub A2A, RCC improved by 14%from 0.86 to 0.98 , with a RCC value of 1.00 for both 512 bytesand 4 MB message sizes. For 2D Halo and 3D Halo, an improve-ment of up to 8% was observed in the prediction success. Similartrends were observed for R2 values.

6. COMBINING ALL TRAINING SETSIn the previous section, we presented high correlation for predict-ing performance of the three benchmarks. For the prediction ofindividual benchmarks, the training and testing sets were generatedfrom the 84 different mappings of the same benchmark for a par-ticular message size on a fixed core count. In this section, we relaxthese requirements, and explore the space where the training andtesting sets are a mix of different benchmarks, message sizes andcore counts.

6.1 Combining samples from different kernelsWe first explore the use of training and testing sets that are a combi-nation of all three benchmarks and both 16 KB and 4 MB message

0 0.02 0.04 0.06 0.08 0.1

0.12 0.14

0 5 10 15 20 25 30Pred

icte

d pe

rfor

man

ce (

s)

Sorted mappings

0

0.5

1

1.5

2

2.5

0 5 10 15 20 25 30

Sorted mappings

0 1 2 3 4 5 6 7 8

0 5 10 15 20 25 30

Sorted mappings

Figure 8: Scatter plot of predicted performance (for 2D Halo, 3D Halo and Sub A2A in order) using hybrid features. Mappingssorted by observed performance are used as the x-axis.2D Halo 3D Halo Sub A2A


RESULTS

23

��

��

��

��

��

��

��

�� !��"�

�#�$�� #�$��!��"�

�#�$�� #�$��!��"�


RESULTS

24


RESULTS

24

Figure 9: Prediction success: summary for all benchmarks on 65,536 cores of BG/Q. H3: avg bytes & max bytes & avg buffer & maxFIFO; H5: avg bytes TO & avg buffer TO & avg delay AO & sum hops A0 & max FIFO

sizes. It is to be noted that the training and testing sets are now sixtimes the size of individual sets (336 vs. 56 for the training set and168 vs. 28 for the testing set). Figure 10 presents the predictionsuccess and the absolute number of mispredictions for this exper-iment. We present selected prior, new and hybrid features in thisexperiment.

High RCC values, such as 0.97 for avg bytes, suggests that thecombination of training sets results in a better prediction than theindividual cases. A comparison of the total number of mispredic-tions, presented in Figure 10, with the sum of mispredictions forindividual cases results in similar values. This suggests that scikitwas successful in classifying the sample data from different kindsof communication patterns and message sizes and in making goodpredictions using them. This suggests that if a large database con-sisting of different communication patterns and message sizes iscreated, predicting performance of different classes of applications(possibly with unknown communication structure) may be feasible.We leave an in-depth study of this aspect for future work.

6.2 Predicting performance on 65,536 cores us-ing 16,384-core samples

We also experimented with predicting the performance on 65,536cores using the same combined training set for 16,384 cores fromthe section above. We obtained a maximum RCC value of 0.975using the feature set H3: avg bytes, max bytes, avg buffer, maxFIFO. In terms of absolute number of pairs with the wrong order-ing, ⇠ 3200 pairs was mispredicted among a full set of 126756.

We find these results to be very encouraging since a strong corre-lation for predicting performance on large node counts using datafrom smaller jobs may provide a scalable method for performanceprediction. Using smaller systems to predict performance at scalehas several advantages. First, generating data sets is more feasiblein this regime because it consumes less resources. Second, man-ually generating various mappings for large systems is impracti-cal, but using prediction on smaller node counts, a large number ofmappings can be explored with low overhead.

0.6

0.7

0.8

0.9

1.0

avg bytes

max bytes

sum dilation AO

avg bytes TO

avg buffer

H1 H3 H4 H5 H6

RC

C

Rank correlation coefficient

Figure 10: Prediction success using combination of bench-marks as training and testing sets.

7. RESULTS WITH PF3DpF3D [17] is a multi-physics code used for studying laser plasma-interactions in the National Ignition Facility (NIF) experiments atLLNL. pF3D is a communication-heavy application and has beenshown to benefit significantly from task mapping on Blue Gene/P [6].This is the first attempt at mapping pF3D on Blue Gene/Q.

pF3D simulations consist of three distinct phases: wave propaga-tion and coupling, advecting light, and solving the hydrodynamicequations. The MPI processes are arranged in a 3D Cartesian grid

Combining all benchmarks


RESULTS

24

Figure 9: Prediction success: summary for all benchmarks on 65,536 cores of BG/Q. H3: avg bytes & max bytes & avg buffer & maxFIFO; H5: avg bytes TO & avg buffer TO & avg delay AO & sum hops A0 & max FIFO

sizes. It is to be noted that the training and testing sets are now sixtimes the size of individual sets (336 vs. 56 for the training set and168 vs. 28 for the testing set). Figure 10 presents the predictionsuccess and the absolute number of mispredictions for this exper-iment. We present selected prior, new and hybrid features in thisexperiment.

High RCC values, such as 0.97 for avg bytes, suggests that thecombination of training sets results in a better prediction than theindividual cases. A comparison of the total number of mispredic-tions, presented in Figure 10, with the sum of mispredictions forindividual cases results in similar values. This suggests that scikitwas successful in classifying the sample data from different kindsof communication patterns and message sizes and in making goodpredictions using them. This suggests that if a large database con-sisting of different communication patterns and message sizes iscreated, predicting performance of different classes of applications(possibly with unknown communication structure) may be feasible.We leave an in-depth study of this aspect for future work.

6.2 Predicting performance on 65,536 cores us-ing 16,384-core samples

We also experimented with predicting the performance on 65,536cores using the same combined training set for 16,384 cores fromthe section above. We obtained a maximum RCC value of 0.975using the feature set H3: avg bytes, max bytes, avg buffer, maxFIFO. In terms of absolute number of pairs with the wrong order-ing, ⇠ 3200 pairs was mispredicted among a full set of 126756.

We find these results to be very encouraging since a strong corre-lation for predicting performance on large node counts using datafrom smaller jobs may provide a scalable method for performanceprediction. Using smaller systems to predict performance at scalehas several advantages. First, generating data sets is more feasiblein this regime because it consumes less resources. Second, man-ually generating various mappings for large systems is impracti-cal, but using prediction on smaller node counts, a large number ofmappings can be explored with low overhead.

0.6

0.7

0.8

0.9

1.0

avg bytes

max bytes

sum dilation AO

avg bytes TO

avg buffer

H1 H3 H4 H5 H6

RC

C


Figure 10: Prediction success using combination of bench-marks as training and testing sets.

7. RESULTS WITH PF3DpF3D [17] is a multi-physics code used for studying laser plasma-interactions in the National Ignition Facility (NIF) experiments atLLNL. pF3D is a communication-heavy application and has beenshown to benefit significantly from task mapping on Blue Gene/P [6].This is the first attempt at mapping pF3D on Blue Gene/Q.

pF3D simulations consist of three distinct phases: wave propaga-tion and coupling, advecting light, and solving the hydrodynamicequations. The MPI processes are arranged in a 3D Cartesian grid

0.6

0.7

0.8

0.9

1.0

avg bytes

max bytes

sum dilation A0

avg bytes TO

avg buffer

H1 H3 H4 H5 H6

RC

C


Figure 11: Predicting performance on 65,536 cores using16,384 cores. RCC of up to 0.975 was achieved - 3,200 mis-predictions in 1,26,756 pairs (2.5%).

and all of the MPI communication is performed along different di-rections (X,Y, Z) of this grid. Wave propagation and couplingconsists of two-dimensional (2D) Fast Fourier Transforms (FFTs)in XY -planes; the 2D FFTs are performed via two non-overlapping1D FFTs along the X and Y directions using MPI_Alltoall.The advection phase involves planar exchange of data with neigh-bors in the Z-direction performed using MPI_Send and MPI_Recv.Finally, the hydrodynamic phase consists of near-neighbor data ex-change in the positive and negative X , Y and Z directions. TheFFT phase and the planar exchange in Z account for most of thetime spent in communication in pF3D. The logical 3D grid of pro-cesses used for pF3D in this paper was 16⇥ 8⇥ 128.

In Figure 12, we present RCC scores for predicting the performanceof pF3D on 16,384 cores of BG/Q. While avg bytes has a low RCC,max bytes correctly predicts the partial ordering for 91% of the taskmapping pairs. Interestingly, sum of dilations for messages that be-long to the average outlier set exhibits a high RCC of 0.94. Similarto the benchmarks, the hybrid features show strong correlation withperformance, and have RCCs exceeding 0.96 for this productionapplication. The highest correlation achieved is for the set H6: avgbytes TO, avg buffer AO, avg delay TO, avg delay AO, sum hopsA0, max FIFO that has an RCC of 0.995. The R2 values are sig-nificantly lower, in contrast with the benchmarks. For max bytes,the R2 value is only 0.76 which increases dramatically to 0.931 forthe hybrid set H6. The prediction results for pF3D on 65, 536 coreswere not as expected and we hope to add those numbers in the finalversion of the paper.

8. CONCLUSIONSignificant time and effort wasted in real runs to evaluate the per-formance of different task mappings suggests the use of simulationor metrics to predict application performance offline. Metrics usedpreviously in literature fall short in providing strong correlationswith execution time. In this paper, we demonstrate the use of newmetrics and machine learning techniques to predict performance ofparallel applications for different task mappings.

In addition to prior metrics, such as maximum bytes, we have de-veloped new metrics, such as buffer length and messages in in-jection FIFOs, to include the effects of contention for network re-sources other than links. Using a combination of these metrics,which includes average and maximum bytes on links, maximum

Figure 12: Prediction success for pF3D using a variety of prior,new, and hybrid metrics. RCC values are very high for the hy-brid metrics, as in previous examples, but are somewhat lowerfor prior and new single metrics. R2 values are lower on aver-age overall.

messages contending for an injection FIFO, and the average num-ber of packets in buffers, we show rank correlation coefficients ofup to 0.99, i.e. for only 1% of pairs the pairwise ordering is pre-dicted incorrectly. This signifies an improvement of 14% in theprediction success for an all-to-all over sub-communicators bench-mark and 8% for 2D and 3D halo exchange. In addition, predictionusing such hybrid metrics also shows high R2 scores which indi-cates good prediction in terms of absolute values.

We also successfully attempted combining of training and testingsets from different benchmarks and and still retaining high pre-diction accuracy. This suggests that if a large database consist-ing of different communication patterns and message sizes is cre-ated, predicting performance of different classes of applications(possibly with unknown communication structure) may be feasi-ble. More importantly, we show that using training sets from smallcore counts, we can predict performance at a larger count with anRCC value of 0.975. This may provide a scalable method for per-formance prediction at large scales and for future machines withouthaving to perform detailed network simulations. Finally, we havedemonstrated that supervised learning and ensemble methods canbe used to predict performance not only for simple communicationkernels but also for complex production applications with severaldiverse communication phases.

AcknowledgmentsThis work was performed under the auspices of the U.S. Depart-ment of Energy by Lawrence Livermore National Laboratory un-der Contract DE-AC52-07NA27344. This work was funded by theLaboratory Directed Research and Development Program at LLNLunder project tracking code 13-ERD-055.

Combining all benchmarks

Predicting performance on 65,536 cores using

16,384 cores data for training


RESULTS - PF3D

25


RESULTS - PF3D

25



In Figure 12, we present RCC scores for predicting the performanceof pF3D on 16,384 cores of BG/Q. While avg bytes has a low RCC,max bytes correctly predicts the pairwise ordering for 91% of thetask mapping pairs. Interestingly, sum of dilations for messagesthat belong to the average outlier set exhibits a high RCC of 0.94.Similar to the benchmarks, the hybrid features show strong corre-lation with performance, and have RCCs exceeding 0.96 for thisproduction application. The highest correlation achieved is for theset H6: avg bytes TO, avg buffer AO, avg delay TO, avg delay AO,sum hops A0, max FIFO that has an RCC of 0.995. The R2 val-ues are significantly lower, in contrast with the benchmarks. Formax bytes, the R2 value is only 0.76 which increases dramaticallyto 0.931 for the hybrid set H6. The prediction results for pF3Don 65, 536 cores were not as expected and we hope to add thosenumbers in the final version of the paper.



0.6

0.7

0.8

0.9

1.0

RC

C


0.6

0.7

0.8

0.9

1.0

avg bytes

max bytes

sum dilation AO

avg bytes TO

avg buffer

H1 H3 H4 H5 H6

R2

Absolute performance correlation




AcknowledgmentsThis work was performed under the auspices of the U.S. Depart-ment of Energy by Lawrence Livermore National Laboratory un-der Contract DE-AC52-07NA27344. This work was funded by theLaboratory Directed Research and Development Program at LLNLunder project tracking code 13-ERD-055 (LLNL-CONF-635857).


RESULTS - PF3D

25

22

23

24

25

26

27

28

29

30

31

0 5 10 15 20 25 30

Exec

utio

n T

ime

(s)

Mappings sorted by actual execution times

Blue Gene/Q (16,384 cores)

pF3D Observed pF3D Predicted



In Figure 12, we present RCC scores for predicting the performanceof pF3D on 16,384 cores of BG/Q. While avg bytes has a low RCC,max bytes correctly predicts the pairwise ordering for 91% of thetask mapping pairs. Interestingly, sum of dilations for messagesthat belong to the average outlier set exhibits a high RCC of 0.94.Similar to the benchmarks, the hybrid features show strong corre-lation with performance, and have RCCs exceeding 0.96 for thisproduction application. The highest correlation achieved is for theset H6: avg bytes TO, avg buffer AO, avg delay TO, avg delay AO,sum hops A0, max FIFO that has an RCC of 0.995. The R2 val-ues are significantly lower, in contrast with the benchmarks. Formax bytes, the R2 value is only 0.76 which increases dramaticallyto 0.931 for the hybrid set H6. The prediction results for pF3Don 65, 536 cores were not as expected and we hope to add thosenumbers in the final version of the paper.



0.6

0.7

0.8

0.9

1.0

RC

C


0.6

0.7

0.8

0.9

1.0

avg bytes

max bytes

sum dilation AO

avg bytes TO

avg buffer

H1 H3 H4 H5 H6

R2

Absolute performance correlation




AcknowledgmentsThis work was performed under the auspices of the U.S. Depart-ment of Energy by Lawrence Livermore National Laboratory un-der Contract DE-AC52-07NA27344. This work was funded by theLaboratory Directed Research and Development Program at LLNLunder project tracking code 13-ERD-055 (LLNL-CONF-635857).


SUMMARY

Communication is not just about peak latency/bandwidth

Simultaneous analysis of various aspects of network is important

Complex models are required for accurate prediction

There are patterns waiting to be identified!

26


FUTURE WORK

More applications!

More metrics

Weighted analysis

Offline prediction of entities

27


FUTURE WORK

More applications!

More metrics

Weighted analysis

Offline prediction of entities

27

Questions?


28


predicting communication...

Documents