[ieee 2013 ieee international symposium on workload characterization (iiswc) - portland, or, usa...

2
Hardware-Independent Application Characterization (Extended abstract) Scott Pakin Applied Computer Science Group Los Alamos National Laboratory Email: [email protected] Patrick McCormick Applied Computer Science Group Los Alamos National Laboratory Email: [email protected] The trend in high-performance computing is to include computational accelerators such as GPUs or Xeon Phis in each node of a large-scale system. Qualitatively, such accelerators tend to favor codes that perform large numbers of floating-point and integer operations per branch; that exhibit high degrees of memory locality; and that are highly data-parallel. The question we address in this work is how to quantify those characteristics. To that end we developed an application-characterization tool called Byfl that provides a set of “software performance counters”. These are analogous to the hardware performance counters provided by most modern processors but are implemented via code instrumentation—the equivalent of adding flops = flops + 1 after every floating-point operation but in fact implemented by modifying the compiler’s internal representation of the code. The novelty of our approach is that we report counter values in a hardware-independent manner. Unlike similar tools based on, say, PAPI [1] or Pin [2], Byfl (a) reports the same data regardless of target platform and (b) operates on a machine abstraction that is closer to the application developer’s view of the application than the hardware’s. For example, if a program includes a 64-bit scalar floating-point division, Byfl reports that as such, even if, as in Intel’s Knight’s Corner, this division is implemented in terms of dozens of primitive floating-point and integer instructions, mostly vector operations. Also, unlike static analysis tools such as ROSE [3], Byfl can handle control flow and data dependencies that are not known until run time. Byfl is not constrained to reporting only what hardware per- formance counters might report. Two metrics that Byfl reports that are worthy of additional mention are unique bytes and flop (or op) bits. Unique bytes represent the program’s memory footprint—the number of unique, byte-level addresses that the program accessed during a run as opposed to the total number of bytes loaded and stored, often repeatedly. Flop bits is a creation of ours that attempts to rationalize the oft-quoted byte:flop ratio of an application. Consider the assignment “A B +C,” where each variable resides in memory. If the variables are all 32 bits wide, this assignment has a byte:flop ratio of 12. If, however, they are 64 bits wide, this same assignment has a byte:flop ratio of 24. In both cases, though, all data loaded from memory a fed into a floating-point operation, and all results of a floating-point operation are written to memory. The bits:flop-bit ratio nor- malizes this case to 1 by defining a flop bit as the number TABLE I: Sample Byfl measurements (xRAGE, first of 1056 MPI ranks) Raw measurements 7,021,934,541 basic blocks 5,273,766,649 conditional or indirect branches 106,609,718,886 bytes loaded 33,545,236,192 bytes stored 140,154,955,078 total bytes 568,213,078 unique bytes 5,056,852,452 flops 82,872,028,639 ops 852,877,751,088 bits loaded 268,361,889,536 bits stored 1,121,239,640,624 total bits 4,545,704,624 unique bits 970,915,667,328 flop bits 746,8902,667,646 op bits Derived measurements 3.1781 loads per store 0.9589 flops per conditional or indirect branch 15.7140 ops per conditional or indirect branch 27.7158 bytes per flop 1.1548 bits per flop bit 1.6912 bytes per op 0.1501 bits per op bit 0.1124 unique bytes per flop 0.0047 unique bits per flop bit 0.0069 unique bytes per op 0.0006 unique bits per op bit 246.6592 bytes per unique byte of bits consumed or produced by a floating-point operation (e.g., 192 for a 64-bit binary operator or 64 for a 32-bit unary operator). Bit:flop-bit values less than 1 imply register reuse, and values greater than 1 imply data traffic that is not bound to floating-point operations. Op bits are the analogue to flop bits for all arithmetic and logical operations, not just floating-point. Table I presents the result of instrumenting the first process of a 1056-process run of the xRAGE radiation-hydrodynamics application [4] simulating an asteroid impact. The left half of Table I represents raw data, and the right half represents ratios of various raw-data values. As the data show, the instrumented process performed a total of 82 billion oper- ations of which 5 billion represented floating-point operations. “Bad” branches—those with statically unpredictable target addresses—also numbered 5 billion. Hence, one can conclude that xRAGE is a somewhat control-intensive application, as an average of only 15 operations (<1 floating-point operation) occur between potential pipeline flushes. xRAGE can therefore be expected to be challenging to optimize for most accelerators, which favor large stretches of branchless execution. On the other hand, the xRAGE process performed 140 billion bytes of memory accesses representing only 568 million distinct byte ad- dresses. In other words, each byte of data was reused an average of 246 times. This is an encouraging metric from the perspective 111 978-1-4799-055-3/13/$31.00 ©2013 IEEE

Upload: patrick

Post on 23-Dec-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

1

Hardware-Independent Application Characterization(Extended abstract)

Scott PakinApplied Computer Science GroupLos Alamos National Laboratory

Email: [email protected]

Patrick McCormickApplied Computer Science GroupLos Alamos National Laboratory

Email: [email protected]

The trend in high-performance computing is to includecomputational accelerators such as GPUs or Xeon Phisin each node of a large-scale system. Qualitatively, suchaccelerators tend to favor codes that perform large numbersof floating-point and integer operations per branch; thatexhibit high degrees of memory locality; and that are highlydata-parallel. The question we address in this work is howto quantify those characteristics. To that end we developed anapplication-characterization tool called Byfl that provides a setof “software performance counters”. These are analogous tothe hardware performance counters provided by most modernprocessors but are implemented via code instrumentation—theequivalent of adding flops = flops + 1 after everyfloating-point operation but in fact implemented by modifyingthe compiler’s internal representation of the code.

The novelty of our approach is that we report counter valuesin a hardware-independent manner. Unlike similar tools basedon, say, PAPI [1] or Pin [2], Byfl (a) reports the same dataregardless of target platform and (b) operates on a machineabstraction that is closer to the application developer’s view ofthe application than the hardware’s. For example, if a programincludes a 64-bit scalar floating-point division, Byfl reportsthat as such, even if, as in Intel’s Knight’s Corner, this divisionis implemented in terms of dozens of primitive floating-pointand integer instructions, mostly vector operations. Also, unlikestatic analysis tools such as ROSE [3], Byfl can handle controlflow and data dependencies that are not known until run time.

Byfl is not constrained to reporting only what hardware per-formance counters might report. Two metrics that Byfl reportsthat are worthy of additional mention are unique bytes andflop (or op) bits. Unique bytes represent the program’s memoryfootprint—the number of unique, byte-level addresses that theprogram accessed during a run as opposed to the total number ofbytes loaded and stored, often repeatedly. Flop bits is a creationof ours that attempts to rationalize the oft-quoted byte:flop ratioof an application. Consider the assignment “A← B+C,” whereeach variable resides in memory. If the variables are all 32 bitswide, this assignment has a byte:flop ratio of 12. If, however,they are 64 bits wide, this same assignment has a byte:flop ratioof 24. In both cases, though, all data loaded from memory a fedinto a floating-point operation, and all results of a floating-pointoperation are written to memory. The bits:flop-bit ratio nor-malizes this case to 1 by defining a flop bit as the number

TABLE I: Sample Byfl measurements (xRAGE, first of 1056MPI ranks)

Raw measurements

7,021,934,541 basic blocks5,273,766,649 conditional

or indirectbranches

106,609,718,886 bytes loaded33,545,236,192 bytes stored

140,154,955,078 total bytes568,213,078 unique bytes

5,056,852,452 flops82,872,028,639 ops

852,877,751,088 bits loaded268,361,889,536 bits stored

1,121,239,640,624 total bits4,545,704,624 unique bits

970,915,667,328 flop bits746,8902,667,646 op bits

Derived measurements

3.1781 loads per store0.9589 flops per conditional

or indirect branch15.7140 ops per conditional

or indirect branch

27.7158 bytes per flop1.1548 bits per flop bit1.6912 bytes per op0.1501 bits per op bit

0.1124 unique bytes per flop0.0047 unique bits per flop bit0.0069 unique bytes per op0.0006 unique bits per op bit

246.6592 bytes per unique byte

of bits consumed or produced by a floating-point operation(e.g., 192 for a 64-bit binary operator or 64 for a 32-bit unaryoperator). Bit:flop-bit values less than 1 imply register reuse,and values greater than 1 imply data traffic that is not bound tofloating-point operations. Op bits are the analogue to flop bitsfor all arithmetic and logical operations, not just floating-point.

Table I presents the result of instrumenting the first processof a 1056-process run of the xRAGE radiation-hydrodynamicsapplication [4] simulating an asteroid impact. The left halfof Table I represents raw data, and the right half representsratios of various raw-data values. As the data show, theinstrumented process performed a total of 82 billion oper-ations of which 5 billion represented floating-point operations.“Bad” branches—those with statically unpredictable targetaddresses—also numbered 5 billion. Hence, one can concludethat xRAGE is a somewhat control-intensive application, asan average of only 15 operations (<1 floating-point operation)occur between potential pipeline flushes. xRAGE can thereforebe expected to be challenging to optimize for most accelerators,which favor large stretches of branchless execution. On theother hand, the xRAGE process performed 140 billion bytes ofmemory accesses representing only 568 million distinct byte ad-dresses. In other words, each byte of data was reused an averageof 246 times. This is an encouraging metric from the perspective

111978-1-4799-055-3/13/$31.00 ©2013 IEEE

2

of minimizing data transfers between CPUs and accelerators.Table I also shows bytes per flop. The xRAGE process

consumed over 27 bytes of memory for every floating-pointoperation it performed. Examining instead the number ofbits per flop bit we see a ratio of 1.2, implying that moredata were loaded/stored to/from main memory than wereconsumed/produced by a floating-point unit. While these valuessound pessimistic, being beyond the capabilities of modernmemory systems, the data also show that the xRAGE processconsumed 1.7 bytes per arbitrary (not necessarily floating-point)operation or, arguably more meaningfully, 0.15 bits peroperation bit. That is, the ALU operated on more than sixbits for each bit loaded or stored, implying that register usageis approximately six times as prevalent as memory usage—notmuch but still less worrisome than the initial 27 bytes/flopmetric. (Also, caches and prefetchers may alleviate some ofthe bandwidth requirement.) Furthermore, the data show thatxRAGE needs only 0.0069 unique bytes from memory for everyoperation it performs. That is, if xRAGE could fit its entireworking set in cache it would be able to perform 145 operationsfor every mandatory (“cold”) cache miss, a far more optimisticcharacterization of the application’s memory-performanceneeds than the venerable bytes-per-flop metric indicates.

Because Byfl can gather data on a per-basic-block levelwe can examine the variability in computational intensityacross an entire run. Figure 1 presents a box-and-whiskersplot of the flop-bits:bit ratio for three applications (xRAGE [4],Chicoma [5], and S3D [6]) and the SPEC CPU2006 floating-point benchmarks [7]. The center line of each box represents themedian flop-bits:bit ratio across all basic blocks. The top andbottom of each box represent, respectively, the 75th and 25thpercentiles of the data. The upper and lower whiskers represent±1.5 interquartile ranges (IQRs) of the box boundaries. Outliersof the whisker boundaries are plotted individually as circles.The horizontal line at y = 1 separates floating-point-intensivepoints from memory-intensive points; higher points are likelyto be better for accelerators than lower points.

Immediately apparent in Figure 1 is the abundance ofoutliers. This implies that constructing hardware requirementsaround an average computational intensity fails to take intoconsideration that some parts of an application are likely to beextremely memory-bound and will observe bad performanceeven on hardware with the “ideal” memory bandwidth basedon the application’s average. Second, we can use a graphlike Figure 1 to draw analogies between applications andbenchmarks: both Chicoma and GemsFDTD exhibit a mediancomputational intensity of zero; both xRAGE and dealII loador store approximately one memory bit for every bit producedor consumed by a floating-point unit; and both S3D and tontocompute slightly more than they access memory and seesimilar variance across basic blocks. All three applications seefar more outliers than their corresponding benchmark, however.

The conclusions one can draw from this work are that(1) hardware-independent application characterization canbe a useful counterpart to traditional hardware performancecounters, binary modification, and source-to-source translation,

05

10

15

Application

Chicoma

GemsFDTD

calculix

soplex

xRAGE

dealII

leslie3d

povray

S3D

tonto

milc

bwaves

zeusmp

sphinx3

namd

gromacs

lbm

cactusADM

Fig. 1: Computational intensity of various programs

(2) measuring compute intensity in terms of bits rather thanbytes can reduce the misleading effect of different wordwidths and operand counts, (3) software performance countersenable applications to be evaluated for suitability to emergingarchitectures, and (4) applications tend to have more variabilityin compute intensity over time than do benchmarks, eventhose with similar average compute intensity.

Byfl is available from https://github.com/losalamos/Byfl.

REFERENCES

[1] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, “A portableprogramming interface for performance evaluation on modern processors,”The International Journal of High Performance Computing Applications,vol. 14, no. 3, pp. 189–204, 2000, DOI: 10.1177/109434200001400303.

[2] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace,V. J. Reddi, and K. Hazelwood, “Pin: Building customized programanalysis tools with dynamic instrumentation,” in ACM SIGPLAN 2005 Con-ference on Programming Language Design and Implementation (PLDI ’05),Chicago, Illinois, Jun. 11–15, 2005, DOI: 10.1145/1065010.1065034.

[3] D. Quinlan, “ROSE: Compiler support for object-oriented frameworks,”Parallel Processing Letters, vol. 10, no. 2–3 (June & September), pp.215–226, 2000, DOI: 10.1142/S0129626400000214.

[4] M. Gittings, R. Weaver, M. Clover, T. Betlach, N. Byrne,R. Coker, E. Dendy, R. Hueckstaedt, K. New, W. R. Oakes,D. Ranta1, and R. Stefan, “The RAGE radiation-hydrodynamiccode,” Computational Science & Discovery, vol. 1, no. 1, Oct.–Dec. 2008,DOI: 10.1088/1749-4699/1/1/015005.

[5] J. Waltz, “Performance of a three-dimensional unstructured meshcompressible flow solver on NVIDIA Fermi-class graphics processingunit hardware,” International Journal for Numerical Methods in Fluids,vol. 72, no. 2, pp. 259–268, May 20, 2013, DOI: 10.1002/fld.3744.

[6] J. H. Chen, A. Choudhary, B. de Supinski, M. DeVries, E. R. Hawkes,S. Klasky, W. K. Liao, K. L. Ma, J. Mellor-Crummey, N. Podhorszki,R. Sankaran, S. Shende, and C. S. Yoo, “Terascale direct numericalsimulations of turbulent combustion using S3D,” Computational Science &Discovery, vol. 2, no. 1, 2009, DOI: 10.1088/1749-4699/2/1/015001.

[7] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACMSIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, Sep.2006, DOI: 10.1145/1186736.1186737.

112