a data-centric profiler for parallel programs - department of

12
A Data-centric Profiler for Parallel Programs Xu Liu and John Mellor-Crummey Department of Computer Science Rice University Houston, TX, USA {xl10, johnmc}@rice.edu ABSTRACT It is difficult to manually identify opportunities for enhanc- ing data locality. To address this problem, we extended the HPCToolkit performance tools to support data-centric pro- filing of scalable parallel programs. Our tool uses hardware counters to directly measure memory access latency and at- tributes latency metrics to both variables and instructions. Different hardware counters provide insight into different as- pects of data locality (or lack thereof). Unlike prior tools for data-centric analysis, our tool employs scalable measure- ment, analysis, and presentation methods that enable it to analyze the memory access behavior of scalable parallel pro- grams with low runtime and space overhead. We demon- strate the utility of HPCToolkit’s new data-centric analysis capabilities with case studies of five well-known benchmarks. In each benchmark, we identify performance bottlenecks caused by poor data locality and demonstrate non-trivial performance optimizations enabled by this guidance. Categories and Subject Descriptors C.4 [Performance of Systems]: Measurement techniques, Performance attributes. General Terms Performance, Measurement. Keywords Data-centric profiling, scalable profiler, data locality. 1. INTRODUCTION In modern processors, accesses to deep layers of the mem- ory hierarchy incur high latency. Memory accesses with high Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SC’13, November 17–22, 2013, Denver, CO, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2378-9/13/11 ...$15.00. http://dx.doi.org/10.1145/2503210.2503297 latency not only degrade performance but also increase en- ergy consumption. For many programs, one can reduce av- erage memory latency by staging data into caches and ac- cessing it thoroughly before it is evicted. Access patterns that do so are said to have excellent data locality. There are two types of data locality common to both sequential and multithreaded programs: spatial and temporal. An access pattern exploits spatial locality when it accesses a memory location and then accesses nearby locations soon afterward. Typically, spatial locality goes unexploited when accessing data with a large stride or indirection. An access pattern ex- hibits temporal locality when it accesses individual memory locations multiple times. Multi-socket systems have an extra layer in the mem- ory hierarchy that poses an additional obstacle to perfor- mance. Processors in such systems are connected with a high bandwidth link, e.g., HyperTransport for AMD processors and QuickPath for Intel processors. To provide high aggre- gate memory bandwidth on multi-socket systems, memory is partitioned and some is directly attached to each pro- cessor. Each processor has its own memory controller to access directly-attached memory. Such an architecture has Non-Uniform Memory Access (NUMA) latency. Accesses to directly-attached memory are known as local accesses. A processor can also access memory attached to other proces- sors; such accesses are known as remote accesses. Remote ac- cesses have higher latency than local accesses, so each thread in a multi-threaded program must access local data for high performance. A thread that primarily performs local ac- cesses is said to have NUMA locality. Tools for identifying data locality bottlenecks use either simulation or measurement. Simulation-based data-centric tools, such as CPROF [23], MemSpy [27], ThreadSpot- ter [33], SLO [6, 5], and MACPO [32], instrument some or all memory accesses and compute the approximate memory hierarchy response with a cache simulator. There are two drawbacks to simulation. First, gathering information about memory accesses with pervasive instrumentation is expen- sive. Although sampling techniques for monitoring accesses, e.g., [34, 39], can reduce measurement overhead for instru- mentation, they do so with some loss of accuracy. Second, the accuracy of simulation-based methods depends on the cache simulators they use. Since it is prohibitively expensive for cache simulators to model all details of memory hierar- chies in modern multi-socket systems, the aforementioned tools make simplifying assumptions for simulators that re- duce the tools’ accuracy. Unlike simulation-based tools, measurement-based tools

Upload: others

Post on 11-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Data-centric Profiler for Parallel Programs - Department of

A Data-centric Profiler for Parallel Programs

Xu Liu and John Mellor-CrummeyDepartment of Computer Science

Rice UniversityHouston, TX, USA

{xl10, johnmc}@rice.edu

ABSTRACTIt is difficult to manually identify opportunities for enhanc-ing data locality. To address this problem, we extended theHPCToolkit performance tools to support data-centric pro-filing of scalable parallel programs. Our tool uses hardwarecounters to directly measure memory access latency and at-tributes latency metrics to both variables and instructions.Different hardware counters provide insight into different as-pects of data locality (or lack thereof). Unlike prior toolsfor data-centric analysis, our tool employs scalable measure-ment, analysis, and presentation methods that enable it toanalyze the memory access behavior of scalable parallel pro-grams with low runtime and space overhead. We demon-strate the utility of HPCToolkit’s new data-centric analysiscapabilities with case studies of five well-known benchmarks.In each benchmark, we identify performance bottleneckscaused by poor data locality and demonstrate non-trivialperformance optimizations enabled by this guidance.

Categories and Subject DescriptorsC.4 [Performance of Systems]: Measurement techniques,Performance attributes.

General TermsPerformance, Measurement.

KeywordsData-centric profiling, scalable profiler, data locality.

1. INTRODUCTIONIn modern processors, accesses to deep layers of the mem-

ory hierarchy incur high latency. Memory accesses with high

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor components of this work owned by others than the author(s) mustbe honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected].

SC’13, November 17–22, 2013, Denver, CO, USA.Copyright is held by the owner/author(s). Publication rights licensed toACM.ACM 978-1-4503-2378-9/13/11 ...$15.00.http://dx.doi.org/10.1145/2503210.2503297

latency not only degrade performance but also increase en-ergy consumption. For many programs, one can reduce av-erage memory latency by staging data into caches and ac-cessing it thoroughly before it is evicted. Access patternsthat do so are said to have excellent data locality. There aretwo types of data locality common to both sequential andmultithreaded programs: spatial and temporal. An accesspattern exploits spatial locality when it accesses a memorylocation and then accesses nearby locations soon afterward.Typically, spatial locality goes unexploited when accessingdata with a large stride or indirection. An access pattern ex-hibits temporal locality when it accesses individual memorylocations multiple times.

Multi-socket systems have an extra layer in the mem-ory hierarchy that poses an additional obstacle to perfor-mance. Processors in such systems are connected with a highbandwidth link, e.g., HyperTransport for AMD processorsand QuickPath for Intel processors. To provide high aggre-gate memory bandwidth on multi-socket systems, memoryis partitioned and some is directly attached to each pro-cessor. Each processor has its own memory controller toaccess directly-attached memory. Such an architecture hasNon-Uniform Memory Access (NUMA) latency. Accesses todirectly-attached memory are known as local accesses. Aprocessor can also access memory attached to other proces-sors; such accesses are known as remote accesses. Remote ac-cesses have higher latency than local accesses, so each threadin a multi-threaded program must access local data for highperformance. A thread that primarily performs local ac-cesses is said to have NUMA locality.

Tools for identifying data locality bottlenecks use eithersimulation or measurement. Simulation-based data-centrictools, such as CPROF [23], MemSpy [27], ThreadSpot-ter [33], SLO [6, 5], and MACPO [32], instrument some orall memory accesses and compute the approximate memoryhierarchy response with a cache simulator. There are twodrawbacks to simulation. First, gathering information aboutmemory accesses with pervasive instrumentation is expen-sive. Although sampling techniques for monitoring accesses,e.g., [34, 39], can reduce measurement overhead for instru-mentation, they do so with some loss of accuracy. Second,the accuracy of simulation-based methods depends on thecache simulators they use. Since it is prohibitively expensivefor cache simulators to model all details of memory hierar-chies in modern multi-socket systems, the aforementionedtools make simplifying assumptions for simulators that re-duce the tools’ accuracy.

Unlike simulation-based tools, measurement-based tools

Page 2: A Data-centric Profiler for Parallel Programs - Department of

collect performance metrics using hardware performancecounters. Measurement-based tools can directly assess theperformance of a program execution with low overhead.Measurement-based tools can be classified as code-centricand/or data-centric. In the best case, code-centric tools suchas VTune [13], Oprofile [24], CodeAnalyst [3], and gprof [12]map performance metrics back to statements accessing data.Such code-centric information is useful for identifying prob-lematic code sections but fails to highlight problematic vari-ables. Data-centric tools, on the other hand, attribute per-formance metrics to variables or dynamic memory alloca-tions. A tool that combines both data-centric and code-centric analysis is more powerful for pinpointing and diag-nosing data locality problems. Section 2 elaborates on themotivation for data-centric tools.

A variety of data-centric tools currently exist; we discussthem in detail in Section 6. Some tools focus on data localityin sequential codes [25, 26, 7]; the others focus on NUMAproblems in threaded codes [28, 20]; none of them supportscomprehensive analysis of all kinds of data locality problems.Moreover, existing tools work on modest numbers of coreson a single node system; none of them tackles the challengeof scaling and is applicable across a cluster with many hard-ware threads on each node. Obviously, data-centric mea-surement tools must be scalable if they are to be used tostudy codes on modern supercomputers with non-trivial in-put data.

To address this challenge, we extended the HPCToolkitperformance tools [2] with data-centric capabilities to mea-sure and analyze program executions on scalable parallelsystems. Our resulting tools have three unique capabilitiesfor data-centric measurement and analysis.

• They report all data locality issues. HPCToolkit canmeasure and analyze memory latency in threaded pro-grams running on multi-socket systems. Failing to ex-ploit temporal, spatial, and NUMA data locality exac-erbates memory latency.

• They work for large-scale hybrid programs that em-ploy both MPI [29] and OpenMP [31]. HPCToolkitcollects data-centric measurements for scalable parallelprograms with low runtime and space overhead. More-over, HPCToolkit aggregates measurement data acrossthreads and processes in a scalable way.

• They provide intuitive analysis of results for optimiza-tion. HPCToolkit provides an intuitive graphical userinterface that enables users to analyze data-centricmetrics attributed to variables, memory accesses, andfull calling contexts. It provides multiple data-centricviews to highlight variables and attribute costs to in-structions that access them.

HPCToolkit exploits hardware support for data-centric mea-surement on both AMD Opteron and IBM POWER proces-sors. To evaluate the effectiveness of our methods, we em-ploy our tools to study five parallel benchmarks, coveringMPI, OpenMP and hybrid programming models. With thehelp of HPCToolkit, we identified data locality problems ineach of these benchmarks and improved their performanceusing insights gleaned with our tools.

The rest of this paper is organized as follows. Section 2elaborates on the motivation for this work. Section 3 de-scribes the hardware support that HPCToolkit uses for data-

Figure 1: Code-centric profiling aggregates metricsfor memory accesses in the same source line; data-centric profiling decomposes metrics by variable.

and code-centric analysis. Section 4 presents the design andimplementation of our new capabilities in HPCToolkit. Sec-tion 5 studies five well-known benchmarks and shows howHPCToolkit’s data-centric analysis capabilities provide in-sights about data access patterns that are useful for tuning.Section 6 describes previous work on data-centric tools foranalyzing data locality problems and distinguishes our ap-proach. Finally, Section 7 summarizes our conclusions andoutlines our future directions.

2. MOTIVATIONThere are two principal motivations for extending HPC-

Toolkit to support data-centric measurement and analysisof program executions on scalable parallel systems. In Sec-tion 2.1, we illustrate the importance of data-centric profil-ing for analyzing data locality problems. In Section 2.2, wedescribe why scalability has become an important concernfor data-centric profilers.

2.1 Data-centric profilingTwo capabilities distinguish data-centric profiling from

code-centric profiling. First, while code-centric profiling canpinpoint a source code line that suffers from high accesslatency, without a mapping from machine instructions tocharacter positions on a source line (which to our knowl-edge hasn’t been available since DEC’s Alpha compilers),code-centric profilers can’t distinguish the contribution tolatency associated with different variables accessed by theline. In contrast, data-centric profiling can decompose thelatency and attribute it to individual variables accessed bythe same source line. Figure 1 illustrates the latency de-composition that data-centric profiling can provide. Data-centric methods can decompose the latency associated withline 4 and attribute it to different variables. From the per-centage of latency associated, one can see that array C isof principal interest for data locality optimization. One canthen continue to investigate the program to determine howto improve data locality for C. Second, data-centric profil-ing aggregates metrics from all memory accesses that are

Page 3: A Data-centric Profiler for Parallel Programs - Department of

for (i = 0; i < 100; i++) {

var[i] = malloc(size);

}

Figure 2: A loop allocating data objects in the heap.

associated with the same variable. Aggregate metrics canhighlight problematic variables in a program. This can helpone identify when a data layout is poorly matched to accesspatterns that manipulate the data, or pinpoint inefficientmemory allocations on a NUMA architecture.

2.2 ScalabilityScalability is a significant concern for data-centric profil-

ers because (a) application performance on scalable parallelsystems is of significant interest to the HPC community;(b) memory latency is a significant factor that limits perfor-mance and increases energy consumption; and (c) it is desir-able to study executions on large-scale data sets of interestrather than forcing application developers to construct rep-resentative test cases that can be studied on small systems.Unfortunately, existing data-centric profilers do not scale tohighly-parallel MPI+OpenMP programs.

Data-centric profilers for scalable parallel systems shouldhave low time and space overhead. Runtime overheadmainly comes from two parts: collecting performance met-rics and tracking variable allocations. One can reduce theoverhead of collecting metrics by sampling hardware per-formance monitoring units (described in Section 3) with areasonable sampling period. However, tracking variable al-locations is difficult to control. If a program frequently allo-cates data, the overhead of tracking allocations can be unaf-fordable. Unfortunately, existing tools lack the capability toreduce such overhead; instead, they assume that programsdon’t frequently allocate heap data. We show a benchmarkin our case study that contradicts this assumption.

Space overhead is also critical when profiling parallel ex-ecutions. Systems like LLNL’s Sequoia support millions ofthreads. If each thread in a large-scale execution generates1MB of performance data, a million threads would producea terabyte of performance data. Thus, for analysis at scalea compact profile is necessary to keep the size of measure-ment data manageable. However, many existing tools tracevariable allocations and memory accesses [16, 20]. The sizeof memory access traces is proportional to execution timeand the number of active threads. A trace of variable al-locations can easily grow unaffordably large. Consider thecode in Figure 2. A memory allocation is called 100 timesin a loop, so 100 allocations would be recorded in a threadtrace. However, if this loop is in an OpenMP parallel re-gion executed by each MPI process, millions of allocationswould be recorded in a large-scale execution on a system likeLLNL’s Sequoia.

Besides overhead, existing tools do not display data-centric results in a scalable way for effective analysis. Again,consider the code in Figure 2. Metrics may be dispersedamong 100 allocations, without showing any hot spot. How-ever, one might be interested in aggregating metrics for these100 allocations per thread. If this loop is called by multiplethreads and MPI processes, metrics associated with thesedata objects should be coalesced to highlight var as a prob-lematic array.

To address these issues, we added scalable data-centricprofiling support to the HPCToolkit performance tools.HPCToolkit employs novel techniques to scale the data col-lection and presentation with low time and space overhead.

3. HARDWARE SUPPORT FOR DATA-CENTRIC PROFILING

To support accurate measurement and attribution of per-formance metrics in out-of-order processors, Dean et al. [9]developed instruction-based sampling (IBS). When usingIBS, a performance monitoring unit (PMU) periodically se-lects an instruction for monitoring. As a monitored instruc-tion moves through the execution pipeline, the PMU recordsinformation about the occurrence of key events (e.g., cacheand TLB misses), latencies, and the effective address of amemory operand. When a sampled instruction completes,the PMU triggers an interrupt, signaling that details abouta monitored instruction’s execution are available, along withthe address of the monitored instruction. The first tool usingIBS was DEC’s Dynamic Continuous Profiling Infrastruc-ture (DCPI)—a low overhead, system-wide, flat profiler [4].

Recent AMD Opteron processors (family 10h and succes-sors) support instruction-based sampling [10]. Using IBS,one can analyze the execution of instructions that complete.The PMU records detailed information about an instructionas it executes, including the effective address (virtual andphysical) for a load/store, details about the memory hierar-chy response (e.g., cache and TLB miss or misalignment),and various latency measures.

IBM POWER5 and successors use a mechanism simi-lar to IBS to count marked events [35]. When a markedevent is counted, POWER processors set two special pur-pose registers. The sampled instruction address register(SIAR) records the precise instruction address of the sam-pled instruction. The sampled data address register (SDAR)records the effective address touched by a sampled instruc-tion if this instruction is memory related. A PMU canbe configured to trigger an interrupt when a marked eventcount reaches some threshold. When a sample is triggered,HPCToolkit can use SIAR and SDAR to attribute markedevents to both code and data.

Other processors have data-centric measurement capabil-ities as well. Intel Nehalem processors and their successorssupport precise event-based sampling [14]. Event AddressRegisters (EAR) on Itanium processors also support data-centric measurement [15].

Common features for data-centric measurement are (1)support for sampling memory accesses, (2) one or moremechanisms for assessing the cost of an access in terms ofperformance events or access latency, and (3) support toidentify a precise instruction pointer (IP) and an effectivedata address for a sampled access. We use events or latencyto quantify access costs, the precise IP to associate costswith memory accesses, and the effective address to associatecosts with variables.

4. DATA-CENTRIC CAPABILITIES FORHPCTOOLKIT

Figure 3 shows a simplified workflow that illustrates data-centric measurement and analysis in HPCToolkit. HPC-Toolkit consists of three principal components: an online

Page 4: A Data-centric Profiler for Parallel Programs - Department of

Figure 3: A simplified workflow that illustrates data-centric measurement and analysis in HPCToolkit.Rectangles are components of HPCToolkit; ellipsesare inputs or outputs of different components.

call path profiler, a post-mortem analyzer, and a graphi-cal user interface (GUI) for presentation. The profiler takesa fully optimized binary executable as its input. For eachthread in a program execution, the profiler collects samplesand associates costs with memory accesses and data objects.Section 4.1 describes the profiler. The post-mortem ana-lyzer, described in Section 4.2, gathers all profiles collectedby the profiler for each process and thread. It also analyzesthe binaries, extracts information about static code struc-ture, and maps profiles to source lines, loops, procedures,dynamic calling contexts, and variable names. Finally, theGUI, also described in Section 4.2, displays intuitive views ofdata-centric analysis results that highlight problematic dataand accesses. Each component is designed to scale.

4.1 Online call path profilerHPCToolkit’s call path profiler is loaded into a monitored

program’s address space at link time for statically linked exe-cutables or at runtime using LD PRELOAD for dynamicallylinked executables. As a program executes, HPCToolkit’sprofiler triggers samples, captures full calling contexts forsample events, tracks variables, and attributes samples toboth code and variables. To minimize synchronization over-head during execution, each thread records its own profile.The following four sections describe the implementation ofdata-centric support for each of these capabilities.

4.1.1 Triggering samplesThe profiler first programs each core’s PMU to enable

instruction-based sampling or marked event sampling witha pre-defined period. When a PMU triggers a sample event,the profiler receives a signal and reads PMU registers toextract performance metrics related to the sampled instruc-tion. To map these performance metrics to both code anddata, the profiler records the precise instruction pointer ofthe sampled instruction and the effective address of the sam-pled instruction if it accesses memory.

4.1.2 Capturing full calling contextsHPCToolkit unwinds the call stack at each sample event.

To do so, it uses on-the-fly binary analysis to locate the re-turn address for each procedure frame on the call stack [38].Call paths are entered into a calling context tree (CCT).The root of a CCT represents main or a thread start func-tion; internal nodes represent call sites; and leaves representinstructions where samples were triggered. A CCT reducesthe space needed for performance data by coalescing com-mon call path prefixes.

Supporting data-centric analysis required two changes toHPCToolkit’s call stack unwinder. First, we adjust the leafnode of the calling context we obtain by unwinding fromthe signal context to use the precise IP recorded by PMUhardware. This avoids“skid”between the monitored instruc-tion and the sample event that typically occurs on out-of-order processors. Second, we create multiple CCTs for eachthread. Inserting a sample call path into a specific CCTdepends on the sample feature. For example, we create aCCT that includes all samples that do not access memory;other CCTs include call paths that touch different types ofvariables, such as static and heap-allocated variables.

4.1.3 Tracking variablesTo support data-centric analysis, we augmented HPC-

Toolkit to track the life cycle of each variable. We trackthe allocation and deallocation of static and heap allocateddata. Variables that do not belong to static or heap allo-cated data are treated as unknown data.

Static data.Data allocated in the .bss section in load modules are

static data. Each static variable has a named entry in thesymbol table that identifies the memory range for the vari-able with an offset from the beginning of the load module.The life cycle of static variables begins when the enclosingload module (executable or dynamic library) is loaded intomemory and ends when the load module is unloaded. Theprofiler tracks the loading and unloading of load modules.When a load module is loaded into memory, HPCToolkitreads the load module’s symbol table to extract informationabout the memory ranges for all of its static variables. Thesememory ranges are inserted into a map for future use. Allload modules in use are linked in a list for easy traversal. Ifa load module is unloaded, the load module together withits search tree of static data is removed from the list.

HPCToolkit tracks all static variables used during a pro-gram execution. Unlike other tools, HPCToolkit not onlytracks static variables in the executable, but also staticvariables in dynamically-loaded shared libraries. Moreover,it collects fine-grained information for each static variablerather than simply attributing metrics to load modules.

Heap-allocated data.Variables in the heap are allocated dynamically during

execution by one of the malloc family of functions (malloc,calloc, realloc). Since heap-allocated data may be knownby different aliases, e.g., function parameters, at differentpoints in an execution, HPCToolkit’s profiler uses the fullcall path of the allocation point for a heap-allocated datablock to uniquely identify it throughout its lifetime. To asso-ciate a heap-allocated variable with its allocation call path,the profiler wraps memory allocation and free operations ina monitored execution. At each monitored allocation, theprofiler enters into a map an association between the ad-dress range of a data block and its allocation point. At eachmonitored free, the profiler deletes an association from themap.

Unknown data.There are variables that do not belong to static or heap al-

located data. Such variables are either not easily tracked orhave little impact for performance. For example, C++ tem-

Page 5: A Data-centric Profiler for Parallel Programs - Department of

plate containers directly use a low level system call brk toallocate memory. Instead of returning the allocated ranges,it sets the data segment. For that reason, HPCToolkit doesnot track C++ template container allocations. HPCToolkitalso treats stack variables as unknown data because stackvariables seldom become data locality bottlenecks.

Recording address range information for static variablesincurs little overhead because it happens once when a loadmodule is loaded into the program’s address space. How-ever, the overhead of tracking heap allocations and deallo-cations hurts the scalability of the profiler. If a programallocates and frees memory with high frequency, wrappingallocates and frees and capturing the full calling context foreach allocation may cause large overhead. For example, theexecution time of AMG2006, one of our case study bench-marks, increases by 150% when monitoring all allocationsand frees. To reduce such overhead, we use the followingthree strategies.

• We do not track all memory allocations. Usually, largearrays suffer from more severe locality problems thansmall ones. Typically, there are also more opportuni-ties for optimizing data locality for large arrays. Forthat reason, HPCToolkit does not track any heap allo-cated variable smaller than a threshold, which we havechosen as 4K. However, we still track all calls to freeto avoid attributing costs to wrong variables. Becausewe don’t collect calling contexts for frees, wrapping allfrees is not costly.

• We use inlined assembly code to directly read execu-tion context information from registers to aid in callstack unwinding. Our assembly code incurs lower over-head than libc’s getcontext.

• Since unwinding the call stack for frequent allocationsin deep calling contexts is costly, we reduce unwindingoverhead by identifying the common prefix for adja-cent allocations to avoid duplicate unwinds of prefixesthat are already known. We accomplish this by placinga marker (known as a trampoline) in the call stack toidentify the least common ancestor frame of the callingcontexts for two temporally adjacent allocations [11].Using this technique, each allocation only needs to un-wind the call path suffix up to the marked frame.

Because the first method can lead to inaccuracy due to theinformation loss, we only use it when necessary to avoid un-affordable time overhead. We always enable the other twomethods because they are always beneficial. In our casestudy of AMG2006, these approaches reduce the time over-head of tracking variables from 150% to less than 10%.

4.1.4 Attributing metrics to variablesBy correlating information about accesses from PMU sam-

ples with memory ranges for variables, HPCToolkit per-forms data-centric attribution on-the-fly. It first createsthree CCTs in each profile, each recording a different storageclass: static, heap, and unknown. This aggregation high-lights which storage class has more problematic locality.

For each sample, HPCToolkit searches the map of heapallocated variables using the effective address provided bythe PMU registers. If a sample in a thread touches a heapallocated variable, the thread prepends the call path for thevariable allocation to the call path for the sample and then

adds it to its local heap data CCT. It is worth noting that thememory allocation call path may reside in a different threadthan the one the PMU sample takes for an access. Copying acall path from another thread doesn’t require a lock becauseonce created, a call path is immutable. If a copied call pathis already in the local thread CCT, the thread coalesces thecommon paths. Although a memory access is mapped to aheap allocated variable using the allocated memory ranges,the allocation call path uniquely identifies a heap allocatedvariable. This CCT copy-and-merge operation successfullyaddresses the problem of multiple allocations with the samecall path, which we illustrated with Figure 2. If multipleheap allocated data objects have the same allocation callpath, they are merged online and treated as a single variable.Memory accesses to different variables are separated intodifferent groups identified by the variables’ allocation paths.

If HPCToolkit does not find a heap allocated variablematching an access, it looks up the effective address for theaccess in data structures containing address ranges for staticvariables. The search is performed on each load module (theexecutable and dynamically loaded libraries) in the activeload module list to look for a variable range. If the sampleaccesses a static variable, the profiler records the variablename from the symbol table in a dummy node and inserts itinto the static data CCT of the thread that takes the sam-ple. The sample with the full call path is inserted under thedummy node. Therefore, if multiple samples touch the samestatic variable, they all have the same dummy node as theircommon prefix for merging. Like heap allocated variables,the dummy nodes of static variable names separate the CCTinto multiple groups.

If the sample does not access any heap allocated vari-able or static variable, we insert the sample to the thread’sunknown data CCT. All of a thread’s call paths touchingunknown data are grouped together in this CCT.

The data-centric attribution strategy used by HPC-Toolkit’s profiler aggregates data access samples in bothcoarse-grain (storage class) and fine-grain (individual vari-ables) ways. Because each thread records data into its ownCCTs, no thread synchronization is needed and data-centricattribution incurs little overhead.

4.2 Post-mortem analyzer and user interfaceHPCToolkit’s post-mortem analyzer takes load modules

and profiles as inputs. It reads the symbol table from theexecutable and dynamically loaded libraries, and maps CCTnodes to either function or variable symbols. It then extractsline mapping information from debugging sections and mapssymbol names to source code.

To generate compact profile results, which is importantfor scalability, HPCToolkit’s post-mortem analyzer mergesprofiles from different threads and processes. Data-centricprofiles, which consist of at most three CCTs per thread(one each for static, heap, and unknown storage classes),are amenable to coalescing. The analyzer merges CCT’s ofthe same storage class across threads and processes. Con-text paths for individual variables and their memory accessescan be merged recursively across threads and processes. Forheap allocated variables, if their allocation call paths arethe same, even they are from different threads or processes,they are coalesced. For static variables, the analyzer mergesthem from different threads and processes if they have thesame symbol name. Because all memory access call paths

Page 6: A Data-centric Profiler for Parallel Programs - Department of

Table 1: Measurement configuration and overhead of benchmarkscode number of cores monitored events execution time execution time with profiling

AMG2006 4 MPI processes, 128 threads/process PM MRK DATA FROM RMEM1 551s 604s (+9.6%)Sweep3D 48 MPI processes, no threads AMD IBS 88s 90s (+2.3%)LULESH 48 threads AMD IBS 17s 19s (+12%)

Streamcluster 128 threads PM MRK DATA FROM RMEM 25s 27s (+8.0%)NW 128 threads PM MRK DATA FROM RMEM 77s 80s (+3.9%)

have variable CCT nodes as prefixes, they can be automat-ically merged after the merging of variables. The profilemerging overhead grows linearly with the increasing numberof threads and processors used by the monitored program.HPCToolkit’s post-mortem analyzer uses a scalable MPI-based algorithm to parallelize the merging process using areduction tree [36].

Finally, the analyzer outputs a database for HPCToolkit’sGUI. The GUI sorts performance metrics related to eachvariable and instruction. It provides a top-down view toexplore the costs associated with each dynamic calling con-text in an execution. One can easily identify performancelosses within full calling contexts for variables or instruc-tions. Moreover, HPCToolkit’s GUI also provides a com-plementary bottom-up view. If the same malloc function iscalled in different contexts, the bottom-up view aggregatesall performance losses and associates them with the malloccall site. In case studies, we show how these two views guidedata locality optimization.

5. CASE STUDIESWe evaluated HPCToolkit’s data-centric extensions for

analyzing data locality issues on two machines. Our studyfocused on evaluating measurement and analysis of highlymultithreaded executions at the node level. HPCToolkit’sMPI-based post-mortem analysis naturally scales for analy-sis of data from many nodes.

The first test platform for our study is a POWER7 cluster.Each node has four POWER7 processors with a total of 128hardware threads. Each POWER7 processor in a node hasits own memory controller so there are four NUMA nodes.We reserved 4 nodes, up to 512 threads to evaluate the scala-bility of HPCToolkit. The applications we studied use one orboth of MPI for inter-node communication and OpenMP forintra-node communication. A second test platform for eval-uating HPCToolkit’s data-centric analysis is a single nodeserver with four AMD Magny-Cours processors. There are48 cores in this machine and 8 NUMA locality domains.

We studied five well-known application benchmarks codedin C, C++ and Fortran, covering OpenMP, MPI and hybridprogramming models. We built these programs using GNUor IBM compilers with full optimization. The programs westudied are the following:

• AMG2006 [22], one of LLNL Sequoia benchmarks, isa parallel algebraic multi-grid solver for linear sys-tems arising from problems on unstructured grids.The driver for this benchmark builds linear sys-tems for various 3D problems. It consists of threephases: initialization, setup and solver. AMG2006 isa MPI+OpenMP benchmark written in C.

1This is a marked event that measures data access fromremote memory.

• Sweep3D [1] is an ASCI benchmark that solves a timeindependent discrete ordinates 3D Cartesian geometryneutron transport problem. It is written in Fortranand parallelized with MPI without threading.

• LULESH [21], a UHPC application benchmark devel-oped by Lawrence Livermore National Laboratory, isan Arbitrary Lagrangian Eulerian code that solves theSedov blast wave problem for one material in 3D. Inthis paper, we study a highly-tuned LULESH imple-mentation written in C++ with OpenMP.

• Streamcluster [8] is one of Rodinia benchmarks. For astream of input points, it finds a predetermined num-ber of medians so that each point is assigned to itsnearest center. The quality of the clustering is mea-sured by the sum of squared distances metric. It iswritten in C++ with OpenMP.

• Needleman-Wunsch (NW) [8], another Rodinia bench-mark, is a nonlinear global optimization method forDNA sequence alignments. It is also implemented inC++ with OpenMP.

These benchmarks are configured to scale to the full size ofour testbeds. The configuration and measurement overheadfor these benchmarks is shown in Table 1. One may ask howwe chose events to monitor and why these events affect per-formance. HPCToolkit either computes derived metrics [26]to identify whether a program is memory-bound enough fordata locality optimization or counts occurrences of a specificevent with a traditional hardware counter and evaluates itsperformance impact. We only apply data-centric analysis tomemory-bound programs. As the table shows, in our casestudies HPCToolkit’s measurement overhead was 2.3–12%.The profile size (i.e., space overhead) ranged from 8–33 MB.

5.1 AMG2006Figure 4 is an annotated top-down view from HPC-

Toolkit’s GUI for AMG2006, which executes with four MPIprocesses and 128 threads per process on our POWER7 clus-ter. The GUI consists of three panes. The top one shows thesource code of a monitored program; the navigation pane atthe bottom left shows program contexts such as routines,loops and statements. To present data-centric measure-ments, we add variable names or allocation call stacks ascontext prefixes in the navigation pane. The bottom rightpane shows the metrics related to the context at the left.The GUI computes inclusive and exclusive values for allmetrics. In the figures for our case studies, we show onlyinclusive metrics.

In Figure 4, the second line of the metric pane showsthat 94.9% of the remote memory accesses are associatedwith heap allocated variables. Data allocated on line 175 ofhyper_CAlloc$AF6_3 is the target of 22.2% of the remote

Page 7: A Data-centric Profiler for Parallel Programs - Department of

Figure 4: The top-down data-centric view of AMG2006 shows a problematic heap allocated variable and itstwo accesses that suffer most remote events to this variable.

memory accesses in the execution. Selecting the allocationcall site in the navigation pane shows the source code for theallocation, enabling one to identify the variable allocated.The highlighted source code line shows that the column in-dices for non-zeros in a matrix (S_diag_j) are the target ofthese remote accesses. Deeper on the allocation call path,one can see that this variable is allocated by calloc. Immedi-ately below the calloc is a dummy node heap data accesses;this node serves as the root of all memory accesses to thisdata. Calling contexts for two accesses are shown below thispoint; the leaf of each access call path is highlighted witha rectangle. One access accounts for 19.3% of total remotememory accesses and the other for 2.9%. One can select ahighlighted access to display its source code in the top pane.Insets in Figure 4 show the source code for these accesses.S_diag_j is accessed in different loops. Since these loopsare inside OpenMP outlined functions (with suffix $$OL$$),they are executed by multiple threads in a parallel region.

The performance data in the GUI shows that there is amismatch between the allocation of S_diag_j and initializa-tion by the master thread in one NUMA domain, and ac-cesses by OpenMP worker threads executing in other NUMAdomains. The workers all compete for memory bandwidthto access the data in the master’s NUMA domain. BesidesS_diag_j, there are many other variables in AMG2006 withthe same NUMA problem. To avoid contending for data al-

phases initialization setup solver whole programoriginal 26s 420s 105s 551snumactl 52s 426s 87s 565slibnuma 28s 421s 80s 529s

Table 2: Improvement for different phases ofAMG2006 using coarse-grained numactl and fine-grained libnuma.

located in a single NUMA domain, we launch the programwith numactl [18, 19] and specify that all memory alloca-tions should be interleaved across all NUMA domains. Ta-ble 2 shows the performance improvement. Using numactl

reduces the running time of the solver phase from 105s to87s. However, with numactl the initialization and setupphases are slower because interleaved allocations are morecostly. The higher cost of interleaved allocation offsets itsbenefits for the solver phase.

A more surgical approach is to apply libnuma’s inter-leaved allocator [18] only to problematic variables identifiedby HPCToolkit. We used the bottom-up view provided byHPCToolkit’s GUI, shown in Figure 5, to identify problem-atic variables. This view shows different call sites that invokethe hypre allocator. Figure 5 shows that while S_diag_j ac-counts for 22.2% of remote accesses, there are other 6 vari-

Page 8: A Data-centric Profiler for Parallel Programs - Department of

Figure 5: The bottom-up data-centric view ofAMG2006 shows problematic allocation call siteswhich may reside in different call paths.

Figure 6: The data-centric view of Sweep3D showsheap allocated variables with high latency. Threehighlighted arrays have high memory access latency.

ables that are the target of more than 7% of remote accesses.We focus on these problematic variables. If a variable is allo-cated and initialized by the master thread, we use libnuma’sinterleaved allocator instead. If a variable is initialized inparallel, we change the calloc to malloc, which enables the“first-touch” policy to allocate memory near the computa-tion. Table 2 shows that using libnuma to apply interleavedallocation selectively avoids dilating the cost of the setupand initialization phases. Moreover, the execution time ofthe solver phase is 8% faster than when using numactl andubiquitous interleaved allocation. Using libnuma, we avoidinterleaved allocation for thread local data, which eliminatesremote accesses to that data that arise with ubiquitous in-terleaved allocation.

Figure 7: The data-centric view of Sweep3D showsa memory access with long latency to array Flux.This access is in a deep call chain.

5.2 Sweep3DWe ran Sweep3D on our 48-core AMD system and used

IBS to monitor data fetch latency, which highlights data lo-cality issues. The second line of the navigation pane in Fig-ure 6 shows that 97.4% of total latency is associated withheap allocated variables. The top three heap allocated vari-ables, Flux, Src, and Face account for 39.4%, 39.1%, and14.6% of the latency respectively. Because these three ar-rays account for 93.1% of total latency, we only focus on op-timizing their locality. Figure 7 shows a problematic accessto Flux, which accounts for 28.6% of the total latency. Thisaccess, residing in line 480, is deeply nested in the call chainand loops. The two inner-most loops in line 477 and 478traverse Flux with left-most dimension first and right-mostdimension second. Since Fortran uses column-major arraylayouts, the loop in line 477 has a long access stride. Theselong strides disrupt spatial locality and hardware prefetch-ing, which leads to elevated cache and TLB miss rates. Toimprove data locality, one could consider changing the ac-cess pattern by interchanging loops or transforming the datalayout. In this case, interchanging loops is problematic. Ex-amining the rest of the accesses to Flux, we see all are prob-lematic, so we interchange the dimensions of Flux by insert-ing the last dimension between the first and second. Withthis data transformation, accesses to Flux have unit strideand improved spatial locality.

Both Src and Face suffer from the same spatial localityproblem as Flux. Similarly, we transpose their layouts tomatch their access patterns. The optimization reduces theexecution time of the whole program by 15%. It is worthnoting that marked event sampling on POWER7 can alsoidentify such optimization opportunities. One can samplePM MRK DATA FROM L3 event to quantify the localityissue in Sweep3D. Because Sweep3D is a pure MPI program,no NUMA problem exists because MPI processes are always

Page 9: A Data-centric Profiler for Parallel Programs - Department of

Figure 8: The data-centric view shows problematicheap allocated arrays with high latency in LULESH.The red rectangle encloses the call paths of all heapvariables that incur significant memory latency.

co-located with their data and thus there is no need to ex-amine NUMA-related events.

5.3 LULESHWe ran LULESH on our 48-core AMD system. As

with Sweep3D, we used IBS for data-centric monitoring.The first metric column in Figure 8 shows the data ac-cess latency associated with heap allocated variables inLULESH. The second metric column shows a NUMA-related metric that reflects accesses to remote DRAM;this event is analogous to the POWER7 marked eventPM MRK DATA FROM RMEM. The figure shows thatheap allocated variables account for 66.8% of total latencyand 94.2% of the execution’s remote memory accesses. Theindividual heap allocated variables are shown along the al-location call path. Annotations to the left of the allocationcall sites show the names of the variables allocated. Eachof the top seven heap allocated variables accounts for 3.0–9.4% of the total latency. The R DRAM ACCESS metricshows that most of these variables are accessed remotely.By examining the source code, we found that all heap al-located variables in LULESH are allocated and initializedby the master thread. According to the Linux “first touch”policy, all of these variables are allocated in the memoryattached to the NUMA node containing the master thread.Consequently, the memory bandwidth of that NUMA nodebecomes a performance bottleneck. To alleviate contentionfor that bandwidth, we use libnuma to allocate all variableswith high remote accesses in an interleaved fashion. Thisoptimization speeds up the program by 13%.

We continue our analysis of LULESH by considering staticvariables. Figure 9 shows that static variables account for23.6% of total access latency. The static variable f_elem

is a hotspot because it accounts for 17% of total latency.There are two principal loops that have high access latencyfor f_elem. Both loops have exactly the same structure andone is shown in the source pane of Figure 9. From the figure,one can see that f_elem is a three-dimensional array. The

Figure 9: The data-centric view shows a problematicstatic variable and its accesses with high latency inLULESH. The ellipse shows that variable’s name.

first dimension (left-most) is an indirect access using arraynodeElemCornerList in line 801. The last dimension (right-most) is computed from a function Find_Pos in line 802.Thus, accesses to f_elem are irregular. Though optimizingdata locality for irregular accesses is difficult, we found oneopportunity to enhance the data locality for f_elem in thisloop. The second dimension (highlighted by a rectangle)ranges from 0 to 2. We transposed f_elem to make thisdimension the last, which enables these three accesses toexploit spatial locality since C is row-major. This transpo-sition reduces LULESH’s execution time by 2.2%.

5.4 StreamclusterWe ran Streamcluster on a node of our POWER7 sys-

tem with 128 threads. Streamcluster suffers from seriousNUMA data locality issues. Figure 10 shows that 98.2% oftotal remote memory accesses are related to heap allocatedvariables. The annotation to the left of the allocation thataccounts for 92.6% of total remote accesses shows that it isassociated with the variable block. Line 175 contains prob-lematic accesses to p1.coord and p2.coord, which use point-ers to access regions of block. This code is called from twodifferent OpenMP parallel regions and accounts for 55.5%and 37% of total remote accesses from these two contexts.Examining the source code, we found that block is allocatedand initialized by the master thread, so all worker threadsaccess it remotely. We address this problem by leveragingthe Linux “first touch” policy. Initializing block in parallelallocates parts of it near each thread using it. The opti-mization reduces both remote accesses and contention formemory bandwidth. We also applied this optimization topoint.p, which accounts for 5.5% remote accesses. Thisoptimization reduces Streamcluster execution time by 28%.

5.5 Needleman-WunschWe ran Needleman-Wunsch on a node of our POWER7

Page 10: A Data-centric Profiler for Parallel Programs - Department of

Figure 10: The data-centric view associates a largenumber of NUMA related events to a problematicheap allocated variable and its inefficient accesses inStreamcluster benchmark.

Figure 11: The data-centric view highlights twoproblematic heap allocated variables and their in-efficient accesses in Needleman-Wunsch benchmark.

system with 128 threads. As with Streamcluster, this codesuffers from a high ratio of remote memory accesses. Asnapshot of HPCToolkit’s GUI shown in Figure 11 indicatesthat 90.9% of remote memory accesses are associated withheap allocated variables. Two variables, referrence andinput_itemsets, are hot spots that account for 61.4% and

29.5% of total remote accesses. The problematic accesses oc-cur on lines 163–165. The maximum function called inside anOpenMP parallel region _Z7runTestiPPc.omp_fn.0 takesboth referrence and input_itemsets as inputs. However,both variables are allocated and initialized by the masterthread. To address this problem, we use libnuma to dis-tribute the allocation of these two variables across all NUMAnodes to alleviate contention for memory bandwidth. Thisoptimization speeds up the program by 53%.

6. RELATED WORKUnlike HPCTookit’s integrated view of all data locality

problems, prior work on measurement-based data-centricprofilers only focuses on part of the problem. Section 6.1 de-scribes tools that pinpoint poor temporal and spatial cachelocality bottlenecks; Section 6.2 discusses tools that identifyNUMA-related bottlenecks in threaded programs.

6.1 Tools for identifying poor cache localityIrvin and Miller were perhaps the first to recognize the

importance of data for performance [16]. In response, theyextended Paradyn [30] to support a data view. They usedParadyn to dynamically instrument an executable to mea-sure performance of semantic operations, such as collectivecommunication primitives, and they used static analysis toassociate measurements with data structures. Since theymeasure using instrumentation rather than hardware coun-ters, details of hardware behavior (e.g. cache misses or la-tency) are unobservable.

In the first work to employ asynchronous, event-basedsampling (EBS) for data-centric analysis, Itzkowitz et al.introduced memory profiling to Sun ONE Studio [17]. Theyapply this tool to collect and analyze memory accesses ina sequential program and report measurement data in flatprofiles. Code-centric and data-centric views are not related.

Buck and Hollingsworth developed Cache Scope [7] to per-form data-centric analysis using Itanium 2 event address reg-isters (EAR). Cache Scope associates latency with data ob-jects and functions that accessed them. Unlike HPCToolkit,Cache Scope does not collect information about calling con-text and only associates latency metrics with code at theprocedure level rather than at the level of loops or individ-ual source lines.

In previous work [25, 26], we extended HPCToolkit to useIBS to analyze memory hierarchy performance. This workfocused exclusively on sequential programs and as a resultdid not consider NUMA-related performance problems. Fur-thermore, this work didn’t support attribution of memoryhierarchy performance for static variables.

Tallent and Kerbyson also extended HPCToolkit to sup-port data-centric tracing [37]. However, their work focusedon global arrays used in the Partitioned Global AddressSpace (PGAS) programming models, rather than heap-allocated or static variables used in a common program.

A common drawback of these tools is that, to our knowl-edge, none of them supports measurement and analysis ofscalable parallel programs with low overhead. Though thesetools can identify locality bottlenecks in sequential pro-grams, parallel versions (either threaded, MPI, or hybrid) ofthese programs may suffer from different performance prob-lems, particularly with multithreaded programs running onnodes with a NUMA memory hierarchy.

Page 11: A Data-centric Profiler for Parallel Programs - Department of

6.2 Tools for identifying NUMA inefficienciesMemphis [28] uses IBS for low-overhead collection of

NUMA-related performance information on AMD Opteron-based multicore platforms. Memphis collects informationabout accesses to remote cache or memory and then asso-ciates these events with static data variables. Unlike HPC-Toolkit, Memphis doesn’t have mechanisms for attributingaccess costs back to dynamic allocations and doesn’t at-tribute costs to full calling contexts.

MemProf [20] is another data-centric tool based on AMDIBS, for analyzing NUMA problems. It maps IBS samplesto both static and heap allocated variables. Unlike HPC-Toolkit, MemProf records a trace of each IBS sample andvariable allocation rather than collapsing it on-the-fly intoa compact profile. The resulting high data volume makesthis problematic to scale to a cluster with a large number ofnodes. Moreover, MemProf does not map performance met-rics to individual static variables; instead, it treats all staticvariables from a load module as one group and coarsely at-tributes metrics to these groups.

It is not clear that either of these tools can be applied tostudy MPI+OpenMP programs on a cluster. Additionally,these tools exclusively focus on NUMA locality problems.Furthermore, neither of these tools supports precise attribu-tion of metrics to both static and heap allocated variables.

7. CONCLUSION AND FUTURE WORKAugmenting HPCToolkit with support for data-centric

profiling of highly parallelized programs enables it to quan-tify temporal, spatial, and NUMA locality. HPCToolkit’sdata-centric measurement and analysis capabilities leveragePMU hardware on IBM POWER7 and AMD Opteron pro-cessors. After the experimental work reported in this paperwas completed, we extended HPCToolkit to support data-centric analysis using PMU hardware in additional proces-sors, including Intel’s Itanium2 and Ivy Bridge. By usingsampling and a compact profile representation, HPCToolkithas low time and space overhead; this helps it scale to a largenumber of cores. In case studies, we applied HPCToolkit’sdata-centric analysis capabilities to study five well-knownparallel benchmarks and attribute data-centric metrics toboth code and variables. These capabilities provided intu-itive analysis results that enabled us to easily identify andoptimize data locality bottlenecks. With data-centric feed-back from HPCToolkit, we were able to improve the perfor-mance of these benchmarks by 13–53%.

Hardware sampling support for data-centric profiling is in-sufficient in today’s supercomputers. For example, ORNL’sTitan supercomputer uses AMD processors that supportinstruction-based sampling. However, the operating systemon Titan disables IBS. LLNL’s Sequoia supercomputer isbased on Blue Gene/Q processors. Though the A2 cores ofa Blue Gene/Q ASIC have SIAR and SDAR registers as partof support for instruction sampling, this capability is not us-able because full support for instruction sampling is missingfrom the multicore ASIC. The utility of measurement-baseddata-centric analysis of parallel program executions, whichwe demonstrate, motivates including hardware support fordata-centric measurement in future generation systems.

We plan to extend our work in several ways. First, ratherthan overlooking heap allocations smaller than 4K, we thinkthat monitoring some of them will enable HPCToolkit to

provide useful data-centric feedback for programs with datastructures built from lots of small allocations. Second, weplan to explore extensions that enable us to associate data-centric measurements with stack-allocated variables. Fi-nally, we plan to enhance HPCToolkit’s measurement andanalysis to provide guidance for where and how to improvedata locality by pinpointing initializations that associatedata with a memory module and identifying opportunitiesto apply transformations such as data distribution, arrayregrouping, and loop fusion.

AcknowledgementsThis work was supported in part by subcontract B602160from Lawrence Livermore National Laboratory.

8. REFERENCES[1] Accelerated Strategic Computing Initiative. The ASCI

Sweep3D Benchmark Code.http://wwwc3.lanl.gov/pal/software/sweep3d,2009.

[2] L. Adhianto et al. HPCToolkit: Tools for performanceanalysis of optimized parallel programs. Concurrencyand Computation: Practice and Experience,22:685–701, 2010.

[3] Advanced Micro Devices. AMD CodeAnalystperformance analyzer. http://amddevcentral.com/tools/hc/CodeAnalyst/Pages/default.aspx.

[4] J. M. Anderson et al. Continuous profiling: wherehave all the cycles gone? ACM TOCS., 15(4):357–390,1997.

[5] K. Beyls and E. D‘Hollander. Discovery oflocality-improving refactorings by reuse path analysis.Proc. of the 2nd Intl. Conf. on High PerformanceComputing and Communications (HPCC),4208:220–229, Sept. 2006.

[6] K. Beyls and E. H. D’Hollander. Refactoring for datalocality. Computer, 42(2):62 –71, Feb. 2009.

[7] B. R. Buck and J. K. Hollingsworth. Data centriccache measurement on the Intel ltanium 2 processor.In SC ’04: Proc. of the 2004 ACM/IEEE Conf. onSupercomputing, page 58, Washington, DC, USA,2004. IEEE Computer Society.

[8] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer,S.-H. Lee, and K. Skadron. Rodinia: A benchmarksuite for heterogeneous computing. In Proc. of the2009 IEEE Intl. Symp. on Workload Characterization(IISWC), pages 44–54, Washington, DC, USA, 2009.

[9] J. Dean et al. ProfileMe: Hardware support forinstruction-level profiling on out-of-order processors.In Proc. of the 30th annual ACM/IEEE Intl.Symposium on Microarchitecture, pages 292–302,Washington, DC, USA, 1997.

[10] P. J. Drongowski. Instruction-based sampling: A newperformance analysis technique for AMD family 10hprocessors. http://developer.amd.com/Assets/AMD_IBS_paper_EN.pdf, November 2007.

[11] N. Froyd, J. Mellor-Crummey, and R. Fowler.Low-overhead call path profiling of unmodified,optimized code. In Proc. of ICS’05, pages 81–90, NewYork, NY, USA, 2005. ACM Press.

[12] S. L. Graham, P. B. Kessler, and M. K. McKusick.Gprof: A call graph execution profiler. In Proc. of the

Page 12: A Data-centric Profiler for Parallel Programs - Department of

1982 SIGPLAN Symposium on CompilerConstruction, pages 120–126, New York, NY, USA,1982. ACM Press.

[13] Intel VTune Amplifier XE 2013.http://software.intel.com/en-us/articles/

intel-vtune-amplifier-xe, April 2013.

[14] Intel Corporation. Intel 64 and IA-32 architecturessoftware developer’s manual, Volume 3B: Systemprogramming guide, Part 2, Number 253669-032US.http:

//www.intel.com/Assets/PDF/manual/253669.pdf,June 2010.

[15] Intel Corporation. Intel Itanium Processor 9300 seriesreference manual for software development andoptimization. http://www.intel.com/Assets/PDF/manual/323602.pdf,March 2010.

[16] R. B. Irvin and B. P.Miller. Mapping performancedata for high-level and data views of parallel programperformance. In Proc. of ICS’96, pages 69–77, NewYork, NY, USA, 1996. ACM.

[17] M. Itzkowitz, B. J. N. Wylie, C. Aoki, and N. Kosche.Memory profiling using hardware counters. In Proc. ofthe 2003 ACM/IEEE Conf. on Supercomputing,page 17, Washington, DC, USA, 2003. IEEEComputer Society.

[18] A. Kleen. A NUMA API for Linux.http://developer.amd.com/wordpress/media/2012/

10/LibNUMA-WP-fv1.pdf, 2005.

[19] A. Kleen. nuamctl – Linux man page.http://linux.die.net/man/8/numactl, 2005.

[20] R. Lachaize, B. Lepers, and V. Quema. Memprof: amemory profiler for NUMA multicore systems. InProceedings of the 2012 USENIX Annual TechnicalConf., USENIX ATC’12, Berkeley, CA, USA, 2012.

[21] Lawrence Livermore National Laboratory. LivermoreUnstructured Lagrangian Explicit ShockHydrodynamics (LULESH).https://codesign.llnl.gov/lulesh.php.

[22] Lawrence Livermore National Laboratory. LLNLSequoia Benchmarks.https://asc.llnl.gov/sequoia/benchmarks.

[23] A. R. Lebeck and D. A. Wood. Cache profiling andthe SPEC benchmarks: A case study. Computer,27(10):15–26, 1994.

[24] J. Levon et al. OProfile.http://oprofile.sourceforge.net.

[25] X. Liu and J. Mellor-Crummey. Pinpointing datalocality problems using data-centric analysis. In Proc.of CGO’11, pages 171–180, Washington, DC, 2011.

[26] X. Liu and J. Mellor-Crummey. Pinpointing datalocality bottlenecks with low overheads. In Proc. ofISPASS 2013, Austin, TX, USA, April 21-23, 2013.

[27] M. Martonosi, A. Gupta, and T. Anderson. Memspy:analyzing memory system bottlenecks in programs.SIGMETRICS Perform. Eval. Rev., 20(1):1–12, 1992.

[28] C. McCurdy and J. S. Vetter. Memphis: Finding andfixing NUMA-related performance problems onmulti-core platforms. In IEEE Intl. Symp. onPerformance Analysis of Systems Software (ISPASS),pages 87–96, Mar. 2010.

[29] Message Passing Interface Forum. MPI: A messagepassing interface standard.http://www.mcs.anl.gov/research/projects/mpi,2013.

[30] B. P. Miller et al. The Paradyn parallel performancemeasurement tool. IEEE Computer, 28(11):37–46,1995.

[31] OpenMP Architecture Review Board. OpenMPapplication program interface, version 3.0.http://www.openmp.org/mp-documents/spec30.pdf,May 2008.

[32] A. Rane and J. Browne. Enhancing performanceoptimization of multicore chips and multichip nodeswith data structure metrics. In Proc. of the 12th Intl.Conf. on Parallel Architectures and CompilationTechniques, Minneapolis, MN, USA, 2012. IEEEComputer Society.

[33] Rogue Wave Software. ThreadSpotter manual, version2012.1. http://www.roguewave.com/documents.aspx?EntryId=1492, August 2012.

[34] D. L. Schuff, M. Kulkarni, and V. S. Pai. Acceleratingmulticore reuse distance analysis with sampling andparallelization. In Proc. of PACT’10, PACT ’10, pages53–64, New York, NY, USA, 2010. ACM.

[35] M. Srinivas et al. IBM POWER7 performancemodeling, verification, and evaluation. IBM JRD,55(3):4:1–4:19, May-June 2011.

[36] N. R. Tallent, L. Adhianto, and J. M.Mellor-Crummey. Scalable identification of loadimbalance in parallel executions using call pathprofiles. In Proc. of Intl. Conf. for High-PerformanceComputing, Networking, Storage and Analysis, NewYork, NY, USA, November 2010. ACM.

[37] N. R. Tallent and D. Kerbyson. Data-centricperformance analysis of PGAS applications. In Proc.of the Second Intl. Workshop on High-performanceInfrastructure for Scalable Tools (WHIST), SanServolo Island, Venice, Italy, 2012.

[38] N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan.Binary analysis for measurement and attribution ofprogram performance. In Proc. of the 2009 ACMPLDI, pages 441–452, NY, NY, USA, 2009. ACM.

[39] Y. Zhong and W. Chang. Sampling-based programlocality approximation. In Proc. of the 7th Intl.Symposium on Memory Management, ISMM ’08,pages 91–100, New York, NY, USA, 2008. ACM.