performance analysis of multithreaded applications based on hardware simulations with dimemas and...

1

Performance Analysis of multithreaded applications

based on Hardware Simulations with Dimemas and Pin tool

Maria Stylianou

Universitat Politecnica de Catalunya (UPC)

[email protected]

June 15, 2012

Abstract

It is widely accepted that application development is rapidly growing while hardware design

occurs in a smaller pace and success. The high cost of buying the best hardware as well as the

non-existence of machines that can support the latest application developments lead scientists

to look for other ways to examine the performance of their applications. Hardware simulation

appears to be crucial for an application analysis. This project attempts to simulate hardware

using two simulators; one for exploring the outcome of parameters like latency, network

bandwidth, congestion and the other for studying the effect of parameters related to cache

memory, such as cache size, cluster size, number of processors. Our predictions for the great

effect of these parameters are justified with the results extracted from our experimentation.

mailto:[email protected]

2

1 Introduction

Simulation is defined as the imitation of the operation of a real-world process or system over

time [1]. In the area of engineering, an architecture simulation is important to model a real-life or

a hypothetical situation on a computer, in order to be later studied and analyzed. With several

simulations and variable modifications, researchers predict and draw conclusions about the

behavior of the system. To go a step further, simulation ends to be crucial when either the real

computer hardware is not accessible or prohibited to be engaged either the hardware is not yet

built [2].

In this report, the attention revolves around hardware simulations and more specifically

simulations using two tools; Dimemas and Pin. Both tools are used under this study to analyze

and predict hardware behavior upon execution of parallel programs. Major differences

distinguish the two tools and are shortly described below.

Dimemas is a performance analysis tool for message-passing programs [3], differently

characterized as a trace-driven simulator. By taking as inputs a machine’s configuration and an

application trace file, Dimemas can reconstruct the time behavior of a parallel application and

open the doors for experimenting with the performance of the modeled hardware.

Similarly, Pin tool can be used for program analysis like Dimemas. More precisely, Pin is a

dynamic binary instrumentation tool [4]. By its definition, instrumentation takes place on

compiled binary files during runtime, and thus recompiling the source code is not needed. As it

will be later described, the two tools are used in different scenarios for achieving different goals.

This paper continues in the next section with the methodology followed for setting up the proper

environments for both tools and performing the simulations. In Section 3, results are presented

and discussed. Finally, in Section 4 final conclusions are made upon our observations.

2 Methodology

The study constitutes of two main parts; simulation with Dimemas and simulation with Pin. In

this section, both parts are explained in more detail and depth.

2.1 Dimemas simulator

As it has been previously explained, Dimemas is used for developing and analysing the

performance of parallel programs. Several message-passing libraries are supported [3], but for

this work an MPI application was chosen. With the configuration parameters of a machine,

several simulations were performed, for testing and identifying the sensitivity of the application

performance to interconnect parameters.

2.1.1 Pre-process

The MPI application used is offered by the NAS Parallel Benchmark called MG [5] and was run

in boada server offered by the Barcelona Supercomputing Center. This server has a Dual Intel

3

Xeon E5645 with 24 cores. Traces were generated by running the program with 2, 4, 8, 16, 32

and 64 threads and using Extrae, a dynamic instrumentation package which traces programs

running with a shared memory model or a message passing programming model. More details

on how to set up Extrae can be found in [6].

Traces generated from Extrae can be visualized with Paraver, a performance analysis tool that

allows visual inspection of an application and a more detailed quantitative analysis of problems

observed [7]. To be used as input to Dimemas, an application called prv2trf is used which

translates traces from Paraver format to Dimemas format. This trace translator can be found in

[8]. The command line for running prv2trf is:

./prv2trf paraver_trace.prv dimemas_trace.trf

The second input in Dimemas is a configuration file that describes an architecture model

ridealized from MareNostrum. This machine is ideal with zero latency, unlimited bandwidth and

no limit on the number of concurrent communications.

2.1.2 Simulations

The objective is to test the application under different situations and characteristics. The

parameters changed in simulations are the latency, the network bandwidth, the number of

buses and the relative processor speed. They are studied one by one in the order given above

and for each parameter, a range of values is specified on which the application is tested. After

choosing the best value for a parameter, we move on to the next parameter having as a f ixed

value the the last parameter value decided. This loop process is performed 6 times, one for

each trace generated with extrae for 2, 4, 8, 16, 32, 64 threads. Using this methodology, it

becomes easier to observe how the application behaves in each circumstance.

The first step for using Dimemas after installing it, is to run the Dimemas gui located in

Dimemas_directory/bin/dimemas-gui.sh. In the window opened and from the menu label

Configuration, we choose Load Configuration in order to load the configuration file of the

machine. Afterwards, we specify the trace file converted by prv2trf, by clicking in Configuration

→ Initial Machine and we Compute the number of application tasks. After that we are able to

make changes in the machine characteristics.

The parameters mentioned before can be changed from Configuration → Target Configuration.

Specifically, from Node information, latency and Relative CPU Performance can be changed

through the values of Startup on Remote Communication and Relative Processor Speed

respectively. From Environment information, network bandwidth and number of buses can be

modified. It is important to mention that for each change done in a parameter, the button “Do all

the same” has to be pressed in order for the change to get applied to all nodes.

2.1.2.1 Latency

The first parameter studied was latency and what its impact is when increasing or decreasing

this value. In Dimemas, latency represents the local overhead of an MPI implementation. We

ran simulations with different values of latency, beginning from 1ns and increasing each time the

4

latency by multiplying with 10, up to 100,000ns. After each change, the new configuration was

saved as a new configuration file.

2.1.2.2 Network Bandwidth

Another important parameter to be studied is the network bandwidth. In the ideal machine, the

bandwidth is unlimited, thus the impact of reducing it would be interesting. We ran simulations

starting from 1 Mbyte and increasing it in each scenario by multiplying with 10, up to 1,000,000

Mbytes.

2.1.2.3 Number of Buses

An important question that needs to be answered refers to the impact of contention in the

application. Congestion can be modeled by the number of buses but this is not restrictively the

only way. With these simulations, the possibility of having a bad routing that could cause

contention and negatively affect performance is examined. Initially, the machine has no limit on

the number of concurrent communications. We, then, ran simulations for 1, 2, 4, 8, 16 and 32

buses. In other words, the number of buses defines how many possible transfers can take place

at any time.

2.1.2.4 Relative CPU performance

The last parameter examined was the Relative Processor Speed and what would be the impact

of having a faster processor in the machine. By saying faster, we mean the speed in the

execution of the sequential computation burst between MPI calls. Initially, the speed is the

minimum 1%. In our simulations, we tried values of being from half a time faster up to 5 times

faster, increasing by half in each simulation.

2.1.3 Post-process

As it has been already explained, the study of each parameter is done exclusively without

changing any other parameters. When all configuration files are generated regarding the same

parameter, they are studied, compared and the best value is decided depending on the impact

on the execution time and the cost that comes along. The configuration file with this value will

be the loading configuration in next simulations where a new parameter will be studied.

For each configuration saved during simulations, a Paraver file should be produced. This is

done with the command below:

./Dimemas3 -S -32K -pa new_paraver_trace.prv new_config_file.cfg

where we specify the name of the configuration file we have saved and the name we want the

new Paraver file to have.

The traces generated are opened with Paraver along with the initial trace files in order to

compare, observe performance characteristics and examine any problems indicated by the

simulator. In Section 3, the results of these simulations are presented and discussed.

5

2.2 Pin simulator

Pin analyzes programs by inserting arbitrary code inside executable [4]. In this project, a pin-tool

was designed to simulate a three-level cache hierarchy with a per-processor L1 data cache, a

cluster-shared L2 data cache and a globally-shared L3 data cache. While processing, each

processor uses its dedicated L1 data cache which is the fastest but usually the smallest. When

L1 fills in, the L2 data cache is used. L2 caches are responsible for a cluster of processors and

are usually slower than L1 but larger in size. Eventually when L2 cache is full, the L3 data cache

is utilized. L3 is the most expensive out of the three caches and can be used by all processors.

The objective is to perform multiprocessor cache simulations with a pthread parallel application,

changing several parameters like the number of processors, the size of L1, L2 and L3 caches

and the number of processors per cluster.

2.2.1 Pre-process

The pthread application chosen is called dotprod.c and was found in a list of sample pthread

programs provided in the website of the course [9]. Compiling the program after every change is

needed and done with the command: gcc dotprod.c -o dotprod -lpthread. After

downloading Pin, we chose an already existing pin-tool, called dcache.cpp and located in

pin_directory/source/tools/Memory/, to be the basis of our final pin-tool. This pin-tool

simulates the L1 cache memory, and therefore was helpful for building the L2 and L3 caches.

The final pin-tool was named mycache.cpp.

2.2.2 Simulations

The first series of simulations - and the biggest one - studies the impact of cache size, line size

and associativity. The idea was to study - for each cache - the three parameters, and find which

values increase hit rate the most. The cluster size and number of processors were kept the

same throughout these experiments with the values of 2 and 4 respectively. All parameters can

be changed inside the pin-tool and with every change a new compilation is needed and done

with the command: <pin_directory> source/tools/Memory/make

Afterwards, the command below is executed in order to run the pthread program using the

caches configuration given in mycache.cpp.

<pin_directory> ./pin -t ./source/tools/Memory/obj-intel64/mycache.so --

./dotprod

In Table 1, the initial values given to the parameters are shown. Starting with L1 we fixed the

best values for the three parameters and then we proceeded to L2 and finally to L3. We name

stage of simulations the set of simulations related to a single parameter. For each cache, after a

stage of simulations was complete, the best value of the studying parameter was chosen and

used to the next stages of simulations..

Parameters/Cache L1 L2 L3

Cache Size 128 KB 1 MB 4 MB

Line Size (bytes) 32 32 32

Associativity 1 1 1

Table 1: Initial Parameters Values

6

As it is previously said, L2 is cluster-shared cache, which means that it is shared among a set of

nodes. The second series of simulations was focused on the cluster size and how this affects

the L2 hit rate.

Finally, in the third series of simulations we studied how the number of processors devoted for

the execution of the pthread application affects the execution time of the program. This

parameter is set in two places; the pin-tool and the pthread program.

Shortly, the parameters examined during pin simulations are explained below.

2.2.2.1 Cache Size

Cache size is the maximum number of kilobytes (KB) that a cache can keep at a time. It is

expected that by increasing the cache size, the hit rate will increase as well. Simulations were

performed for 1, 2, 4, 8, 16, 32 and 64 KB of L1 cache size in order to confirm our expectations.

L2 cache size range depends on the size choses for L1, since it should be at least double.

Similarly, L3 cache size should be at least double of L2 cache size.

2.2.2.2 Line Size

Line size is the number of bytes that can be saved at once in the cache. All three caches

are tested with values 32, 64 and 128 bytes.

2.2.2.3 Associativity

Associativity parameter keeps the number of possible memory location mappings in the

cache. Three simulations were ran with three different values of associativity; 1, 2 and 4 for all

caches.

2.2.2.4 Cluster size

Cluster size keeps the number of nodes sharing a L2 cache memory. For this study we tried 1,

2, 4 and 8 processors per L2.

2.2.3 Post-process

After each run of the pin tool, the execution time is printed in the screen while the L1, L2, L3 hit

rates and some other relevant information are printed in an output textfile called mycache.out.

3 Results

In this section, the results of both Dimemas and Pin simulations are presented and explained.

3.1 Dimemas Simulations

3.1.1 Latency

The first simulations tested latency and how it affects the execution time of the program. Several

simulations with different values of latency are performed, from 1ns up to 100,000ns increasing

exponentially each time. In Figure 1, we present for different number of processors, the values

of latency in the x-axis showing the change of the execution time from the ideal one in the y

axis. This ratio is calculated with the division: Current Execution Time / Ideal

7

Execution Time. Small values of latency do now affect the time, while it becomes obvious

that after 10,000ns the ratio starts to increase. When excluding the last value of latency,

1,000,000ns, we could see that the execution time rises after the 1,000ns and therefore we

chose the 1,000ns as the best latency that our application can handle.

Figure 1: Time Ratio based on Latency

3.1.2 Network Bandwidth

With a fixed value of latency in 1,000ns, we moved on to the network bandwidth. Beginning with

the ideal bandwidth in the x axis – which is unlimited, we tried several values from 1 to 100,000

Mbytes, increasing exponentially. In the y axis, the change in the execution time can be seen.

As expected, small amounts of bandwidth cause traffic and lead to longer execution time. The

value of 1000 Mbytes was chosen as the ideal one, since the improvement in time with larger

bandwidth was minimal and the cost for having more bandwidth would be higher.

Figure 2: Time Ratio based on Bandwidth

8

3.1.3 Number of Buses

After fixing the value of latency and bandwidth to 1,000 ns and 1,000 Mbytes, we studied which

number of buses would give better results. Running simulations for 1, 2, 4, 8, 16 and 32 buses,

it is obvious that with more buses, the execution time tends to reach the ideal one. Though,

having many buses is not feasible or at least very difficult to implement. We also notice, that the

application still performs well with a very small number of concurrent transfers and therefore the

2 buses were chosen.

Figure 3: Time Ratio based on Number of Buses

3.1.4 Relative CPU performance

Having fixed values for latency, bandwidth and number of buses, we tested how the application

performs in the case of faster processors. With values from half time faster up to 5 times faster,

increasing by half in each simulation, it is clearly observed that speed up is proportional to the

increase. This time the ratio was calculated with the following division: Ideal Execution Time /

Current Execution Time for easier understanding of the graph.

9

Figure 4: Time Ratio based on Relative Processor Speed

3.2 Pin Simulations

3.2.1 Cache Size

In Figure 5, the hit rate depending on the cache size is presented for all three caches. For L1

(Figure 5-a), the sizes 1, 2, 4, 8, 16, 32 and 64 KB were tested, choosing the 64KB as the best

choice. For L2, the size should be at least the double number of the L1 size, and therefore the

range begins from 128 till 1024, choosing the 256KB as the best choice, since the difference

from bigger memory size was not very high. For the same reason, in L3 the range begins from

512 and goes up to 4096, selecting the 2048 as the fixed value.

(a) (b)

(c)

Figure 5: Hit Rate based on Cache Size (a) for L1, (b) for L2 and (c) for L3

10

3.2.2 Line Size

After observing the cache size effects, the line size was tested for three values: 32, 64 and 129

bytes. As it can be seen in Figure 6, for all caches, by increasing the line size, the hit rate is

being rising as well, and therefore the 128 bytes was chosen.

Figure 6: Hit Rate based on Line Size, for L1, L2, L3 caches

3.2.3 Associativity

With fixed parameters in cache size and line size, we studied the impact of associativity.

Simulations were performed for the values 1, 2 and 4. From Figure 7, it can be observed that

associativity does not affect significantly the hit rate in none of the caches.

Figure 7: Hit Rate based on Associativity for L1, L2, L3 caches

11

3.2.4 Cluster Size

The second series of simulations studied the cluster size. In Figure 8 we show the hit rate for 1,

2, 4 and 8 processors per L2 cache. With more processors devoted to one L2 cache, it is

expected that cache accesses will increase and therefore the hit rate will drop. Indeed, in Figure

8, this decrease can be seen.

Figure 8: Hit Rate based on Cluster Size (L2)

3.2.5 Number of Processors

The final series of simulations is related to the number of processors working for the application.

As it is observed in Figure 9, with bigger number of processors the execution time of the pthread

application increases. The parameters used for these simulations are the ones chosen on the

first series of simulations and are shown in Table 2.

Figure 9: Execution Time based on number of Processors

12

Parameters/Cache L1 L2 L3

Cache Size 64 256 1024

Line Size 128 128 128

Associativity 1 1 1

Table 2: Values of parameters for the last series of simulations

4 Conclusions

This project focused on how hardware simulations are performed using two simulators;

Dimemas and Pin tool. Hardware behavior was analyzed by examining the performance impact

of various parameters on multithreaded applications.

Simulations with Dimemas show how latency, network bandwidth, number of buses and CPU

speed can affect the execution time of a parallel application. Simulations with Pin regarding a

multi-level cache memory show how cache size, line size and associativity can affect the

specific cache as well as the caches coming afterwards. Additionally, cluster size is proven to

be an important factor for L2 hit rate. Considering that L2 misses will proceed to L3, this

parameter ends up to be important for L3 hit rate as well. Lastly, the number of threads is

examined and how their increase rise the execution time. With pin, we were able to measure the

performance of the application based on hardware that we do not have.

This last series of experiments with pin open the doors for further experimentation and analysis

of applications without having access to hardware, because it is either costly or prohibited to

use. Thinking even further, these simulations can help scientists examine the pros and cons of

implementing hardware the way it is proposed or theoretically designed.

13

References

[1] J. Banks, J. Carson, B. Nelson, D. Nicol (2001). Discrete-Event System Simulation. Prentice

Hall. p. 3.

[2] J.A. Sokolowski, C.M. Banks (2009). Principles of Modeling and Simulation. Hoboken, NJ:

Wiley. p. 6.

[3] Barcelona Supercomputing Center. Dimemas. [Online]. Available:

http://www.bsc.es/computer-sciences/performance-tools/dimemas.

[4] Intel Software Network. Pin - A Dynamic Binary Instrumentation Tool. [Online]. Available:

http://www.pintool.org.

[5] J. Dunbar (2012, Mar.). NAS Parallel Benchmarks. [Online]. Available:

http://www.nas.nasa.gov/publications/npb.html

[6] H. S. Gelabert, G. L. Sánchez (2011, Nov.). Extrae: User guide manual for version 2.2.0.

Barcelona Supercomputing Center. [Online]. Available:

http://www.bsc.es/ssl/apps/performanceTools/files/docs/extrae-userguide.pdf

[7] Barcelona Supercomputing Center. Paraver [Online]. Available: http://www.bsc.es/computer-

sciences/performance-tools/paraver

[8] Barcelona Supercomputing Center. Software Modules [Online]. Available:

http://www.bsc.es/ssl/apps/performanceTools/

[9] A. Ramirez (2012, Jan). Primavera 2012. Tools and Measurement Techniques [Online].

Available: http://pcsostres.ac.upc.edu/eitm/doku.php/pri12

[10] Wikipedia, the free encyclopedia. CPU cache [Online]. Available:

http://en.wikipedia.org/wiki/CPU_cache

http://www.bsc.es/computer-sciences/performance-tools/dimemas

















http://www.pintool.org/













































http://www.bsc.es/computer-sciences/performance-tools/paraver































http://pcsostres.ac.upc.edu/eitm/doku.php/pri12































performance analysis of multithreaded applications based on hardware simulations with dimemas and...

Technology