performance analysis of multithreaded applications based on hardware simulations with dimemas and...
DESCRIPTION
Course: Measurement Tools and Techniques (UPC)TRANSCRIPT
1
Performance Analysis of multithreaded applications
based on Hardware Simulations with Dimemas and Pin tool
Maria Stylianou
Universitat Politecnica de Catalunya (UPC)
June 15, 2012
Abstract
It is widely accepted that application development is rapidly growing while hardware design
occurs in a smaller pace and success. The high cost of buying the best hardware as well as the
non-existence of machines that can support the latest application developments lead scientists
to look for other ways to examine the performance of their applications. Hardware simulation
appears to be crucial for an application analysis. This project attempts to simulate hardware
using two simulators; one for exploring the outcome of parameters like latency, network
bandwidth, congestion and the other for studying the effect of parameters related to cache
memory, such as cache size, cluster size, number of processors. Our predictions for the great
effect of these parameters are justified with the results extracted from our experimentation.
2
1 Introduction
Simulation is defined as the imitation of the operation of a real-world process or system over
time [1]. In the area of engineering, an architecture simulation is important to model a real-life or
a hypothetical situation on a computer, in order to be later studied and analyzed. With several
simulations and variable modifications, researchers predict and draw conclusions about the
behavior of the system. To go a step further, simulation ends to be crucial when either the real
computer hardware is not accessible or prohibited to be engaged either the hardware is not yet
built [2].
In this report, the attention revolves around hardware simulations and more specifically
simulations using two tools; Dimemas and Pin. Both tools are used under this study to analyze
and predict hardware behavior upon execution of parallel programs. Major differences
distinguish the two tools and are shortly described below.
Dimemas is a performance analysis tool for message-passing programs [3], differently
characterized as a trace-driven simulator. By taking as inputs a machine’s configuration and an
application trace file, Dimemas can reconstruct the time behavior of a parallel application and
open the doors for experimenting with the performance of the modeled hardware.
Similarly, Pin tool can be used for program analysis like Dimemas. More precisely, Pin is a
dynamic binary instrumentation tool [4]. By its definition, instrumentation takes place on
compiled binary files during runtime, and thus recompiling the source code is not needed. As it
will be later described, the two tools are used in different scenarios for achieving different goals.
This paper continues in the next section with the methodology followed for setting up the proper
environments for both tools and performing the simulations. In Section 3, results are presented
and discussed. Finally, in Section 4 final conclusions are made upon our observations.
2 Methodology
The study constitutes of two main parts; simulation with Dimemas and simulation with Pin. In
this section, both parts are explained in more detail and depth.
2.1 Dimemas simulator
As it has been previously explained, Dimemas is used for developing and analysing the
performance of parallel programs. Several message-passing libraries are supported [3], but for
this work an MPI application was chosen. With the configuration parameters of a machine,
several simulations were performed, for testing and identifying the sensitivity of the application
performance to interconnect parameters.
2.1.1 Pre-process
The MPI application used is offered by the NAS Parallel Benchmark called MG [5] and was run
in boada server offered by the Barcelona Supercomputing Center. This server has a Dual Intel
3
Xeon E5645 with 24 cores. Traces were generated by running the program with 2, 4, 8, 16, 32
and 64 threads and using Extrae, a dynamic instrumentation package which traces programs
running with a shared memory model or a message passing programming model. More details
on how to set up Extrae can be found in [6].
Traces generated from Extrae can be visualized with Paraver, a performance analysis tool that
allows visual inspection of an application and a more detailed quantitative analysis of problems
observed [7]. To be used as input to Dimemas, an application called prv2trf is used which
translates traces from Paraver format to Dimemas format. This trace translator can be found in
[8]. The command line for running prv2trf is:
./prv2trf paraver_trace.prv dimemas_trace.trf
The second input in Dimemas is a configuration file that describes an architecture model
ridealized from MareNostrum. This machine is ideal with zero latency, unlimited bandwidth and
no limit on the number of concurrent communications.
2.1.2 Simulations
The objective is to test the application under different situations and characteristics. The
parameters changed in simulations are the latency, the network bandwidth, the number of
buses and the relative processor speed. They are studied one by one in the order given above
and for each parameter, a range of values is specified on which the application is tested. After
choosing the best value for a parameter, we move on to the next parameter having as a f ixed
value the the last parameter value decided. This loop process is performed 6 times, one for
each trace generated with extrae for 2, 4, 8, 16, 32, 64 threads. Using this methodology, it
becomes easier to observe how the application behaves in each circumstance.
The first step for using Dimemas after installing it, is to run the Dimemas gui located in
Dimemas_directory/bin/dimemas-gui.sh. In the window opened and from the menu label
Configuration, we choose Load Configuration in order to load the configuration file of the
machine. Afterwards, we specify the trace file converted by prv2trf, by clicking in Configuration
→ Initial Machine and we Compute the number of application tasks. After that we are able to
make changes in the machine characteristics.
The parameters mentioned before can be changed from Configuration → Target Configuration.
Specifically, from Node information, latency and Relative CPU Performance can be changed
through the values of Startup on Remote Communication and Relative Processor Speed
respectively. From Environment information, network bandwidth and number of buses can be
modified. It is important to mention that for each change done in a parameter, the button “Do all
the same” has to be pressed in order for the change to get applied to all nodes.
2.1.2.1 Latency
The first parameter studied was latency and what its impact is when increasing or decreasing
this value. In Dimemas, latency represents the local overhead of an MPI implementation. We
ran simulations with different values of latency, beginning from 1ns and increasing each time the
4
latency by multiplying with 10, up to 100,000ns. After each change, the new configuration was
saved as a new configuration file.
2.1.2.2 Network Bandwidth
Another important parameter to be studied is the network bandwidth. In the ideal machine, the
bandwidth is unlimited, thus the impact of reducing it would be interesting. We ran simulations
starting from 1 Mbyte and increasing it in each scenario by multiplying with 10, up to 1,000,000
Mbytes.
2.1.2.3 Number of Buses
An important question that needs to be answered refers to the impact of contention in the
application. Congestion can be modeled by the number of buses but this is not restrictively the
only way. With these simulations, the possibility of having a bad routing that could cause
contention and negatively affect performance is examined. Initially, the machine has no limit on
the number of concurrent communications. We, then, ran simulations for 1, 2, 4, 8, 16 and 32
buses. In other words, the number of buses defines how many possible transfers can take place
at any time.
2.1.2.4 Relative CPU performance
The last parameter examined was the Relative Processor Speed and what would be the impact
of having a faster processor in the machine. By saying faster, we mean the speed in the
execution of the sequential computation burst between MPI calls. Initially, the speed is the
minimum 1%. In our simulations, we tried values of being from half a time faster up to 5 times
faster, increasing by half in each simulation.
2.1.3 Post-process
As it has been already explained, the study of each parameter is done exclusively without
changing any other parameters. When all configuration files are generated regarding the same
parameter, they are studied, compared and the best value is decided depending on the impact
on the execution time and the cost that comes along. The configuration file with this value will
be the loading configuration in next simulations where a new parameter will be studied.
For each configuration saved during simulations, a Paraver file should be produced. This is
done with the command below:
./Dimemas3 -S -32K -pa new_paraver_trace.prv new_config_file.cfg
where we specify the name of the configuration file we have saved and the name we want the
new Paraver file to have.
The traces generated are opened with Paraver along with the initial trace files in order to
compare, observe performance characteristics and examine any problems indicated by the
simulator. In Section 3, the results of these simulations are presented and discussed.
5
2.2 Pin simulator
Pin analyzes programs by inserting arbitrary code inside executable [4]. In this project, a pin-tool
was designed to simulate a three-level cache hierarchy with a per-processor L1 data cache, a
cluster-shared L2 data cache and a globally-shared L3 data cache. While processing, each
processor uses its dedicated L1 data cache which is the fastest but usually the smallest. When
L1 fills in, the L2 data cache is used. L2 caches are responsible for a cluster of processors and
are usually slower than L1 but larger in size. Eventually when L2 cache is full, the L3 data cache
is utilized. L3 is the most expensive out of the three caches and can be used by all processors.
The objective is to perform multiprocessor cache simulations with a pthread parallel application,
changing several parameters like the number of processors, the size of L1, L2 and L3 caches
and the number of processors per cluster.
2.2.1 Pre-process
The pthread application chosen is called dotprod.c and was found in a list of sample pthread
programs provided in the website of the course [9]. Compiling the program after every change is
needed and done with the command: gcc dotprod.c -o dotprod -lpthread. After
downloading Pin, we chose an already existing pin-tool, called dcache.cpp and located in
pin_directory/source/tools/Memory/, to be the basis of our final pin-tool. This pin-tool
simulates the L1 cache memory, and therefore was helpful for building the L2 and L3 caches.
The final pin-tool was named mycache.cpp.
2.2.2 Simulations
The first series of simulations - and the biggest one - studies the impact of cache size, line size
and associativity. The idea was to study - for each cache - the three parameters, and find which
values increase hit rate the most. The cluster size and number of processors were kept the
same throughout these experiments with the values of 2 and 4 respectively. All parameters can
be changed inside the pin-tool and with every change a new compilation is needed and done
with the command: <pin_directory> source/tools/Memory/make
Afterwards, the command below is executed in order to run the pthread program using the
caches configuration given in mycache.cpp.
<pin_directory> ./pin -t ./source/tools/Memory/obj-intel64/mycache.so --
./dotprod
In Table 1, the initial values given to the parameters are shown. Starting with L1 we fixed the
best values for the three parameters and then we proceeded to L2 and finally to L3. We name
stage of simulations the set of simulations related to a single parameter. For each cache, after a
stage of simulations was complete, the best value of the studying parameter was chosen and
used to the next stages of simulations..
Parameters/Cache L1 L2 L3
Cache Size 128 KB 1 MB 4 MB
Line Size (bytes) 32 32 32
Associativity 1 1 1
Table 1: Initial Parameters Values
6
As it is previously said, L2 is cluster-shared cache, which means that it is shared among a set of
nodes. The second series of simulations was focused on the cluster size and how this affects
the L2 hit rate.
Finally, in the third series of simulations we studied how the number of processors devoted for
the execution of the pthread application affects the execution time of the program. This
parameter is set in two places; the pin-tool and the pthread program.
Shortly, the parameters examined during pin simulations are explained below.
2.2.2.1 Cache Size
Cache size is the maximum number of kilobytes (KB) that a cache can keep at a time. It is
expected that by increasing the cache size, the hit rate will increase as well. Simulations were
performed for 1, 2, 4, 8, 16, 32 and 64 KB of L1 cache size in order to confirm our expectations.
L2 cache size range depends on the size choses for L1, since it should be at least double.
Similarly, L3 cache size should be at least double of L2 cache size.
2.2.2.2 Line Size
Line size is the number of bytes that can be saved at once in the cache. All three caches
are tested with values 32, 64 and 128 bytes.
2.2.2.3 Associativity
Associativity parameter keeps the number of possible memory location mappings in the
cache. Three simulations were ran with three different values of associativity; 1, 2 and 4 for all
caches.
2.2.2.4 Cluster size
Cluster size keeps the number of nodes sharing a L2 cache memory. For this study we tried 1,
2, 4 and 8 processors per L2.
2.2.3 Post-process
After each run of the pin tool, the execution time is printed in the screen while the L1, L2, L3 hit
rates and some other relevant information are printed in an output textfile called mycache.out.
3 Results
In this section, the results of both Dimemas and Pin simulations are presented and explained.
3.1 Dimemas Simulations
3.1.1 Latency
The first simulations tested latency and how it affects the execution time of the program. Several
simulations with different values of latency are performed, from 1ns up to 100,000ns increasing
exponentially each time. In Figure 1, we present for different number of processors, the values
of latency in the x-axis showing the change of the execution time from the ideal one in the y
axis. This ratio is calculated with the division: Current Execution Time / Ideal
7
Execution Time. Small values of latency do now affect the time, while it becomes obvious
that after 10,000ns the ratio starts to increase. When excluding the last value of latency,
1,000,000ns, we could see that the execution time rises after the 1,000ns and therefore we
chose the 1,000ns as the best latency that our application can handle.
Figure 1: Time Ratio based on Latency
3.1.2 Network Bandwidth
With a fixed value of latency in 1,000ns, we moved on to the network bandwidth. Beginning with
the ideal bandwidth in the x axis – which is unlimited, we tried several values from 1 to 100,000
Mbytes, increasing exponentially. In the y axis, the change in the execution time can be seen.
As expected, small amounts of bandwidth cause traffic and lead to longer execution time. The
value of 1000 Mbytes was chosen as the ideal one, since the improvement in time with larger
bandwidth was minimal and the cost for having more bandwidth would be higher.
Figure 2: Time Ratio based on Bandwidth
8
3.1.3 Number of Buses
After fixing the value of latency and bandwidth to 1,000 ns and 1,000 Mbytes, we studied which
number of buses would give better results. Running simulations for 1, 2, 4, 8, 16 and 32 buses,
it is obvious that with more buses, the execution time tends to reach the ideal one. Though,
having many buses is not feasible or at least very difficult to implement. We also notice, that the
application still performs well with a very small number of concurrent transfers and therefore the
2 buses were chosen.
Figure 3: Time Ratio based on Number of Buses
3.1.4 Relative CPU performance
Having fixed values for latency, bandwidth and number of buses, we tested how the application
performs in the case of faster processors. With values from half time faster up to 5 times faster,
increasing by half in each simulation, it is clearly observed that speed up is proportional to the
increase. This time the ratio was calculated with the following division: Ideal Execution Time /
Current Execution Time for easier understanding of the graph.
9
Figure 4: Time Ratio based on Relative Processor Speed
3.2 Pin Simulations
3.2.1 Cache Size
In Figure 5, the hit rate depending on the cache size is presented for all three caches. For L1
(Figure 5-a), the sizes 1, 2, 4, 8, 16, 32 and 64 KB were tested, choosing the 64KB as the best
choice. For L2, the size should be at least the double number of the L1 size, and therefore the
range begins from 128 till 1024, choosing the 256KB as the best choice, since the difference
from bigger memory size was not very high. For the same reason, in L3 the range begins from
512 and goes up to 4096, selecting the 2048 as the fixed value.
(a) (b)
(c)
Figure 5: Hit Rate based on Cache Size (a) for L1, (b) for L2 and (c) for L3
10
3.2.2 Line Size
After observing the cache size effects, the line size was tested for three values: 32, 64 and 129
bytes. As it can be seen in Figure 6, for all caches, by increasing the line size, the hit rate is
being rising as well, and therefore the 128 bytes was chosen.
Figure 6: Hit Rate based on Line Size, for L1, L2, L3 caches
3.2.3 Associativity
With fixed parameters in cache size and line size, we studied the impact of associativity.
Simulations were performed for the values 1, 2 and 4. From Figure 7, it can be observed that
associativity does not affect significantly the hit rate in none of the caches.
Figure 7: Hit Rate based on Associativity for L1, L2, L3 caches
11
3.2.4 Cluster Size
The second series of simulations studied the cluster size. In Figure 8 we show the hit rate for 1,
2, 4 and 8 processors per L2 cache. With more processors devoted to one L2 cache, it is
expected that cache accesses will increase and therefore the hit rate will drop. Indeed, in Figure
8, this decrease can be seen.
Figure 8: Hit Rate based on Cluster Size (L2)
3.2.5 Number of Processors
The final series of simulations is related to the number of processors working for the application.
As it is observed in Figure 9, with bigger number of processors the execution time of the pthread
application increases. The parameters used for these simulations are the ones chosen on the
first series of simulations and are shown in Table 2.
Figure 9: Execution Time based on number of Processors
12
Parameters/Cache L1 L2 L3
Cache Size 64 256 1024
Line Size 128 128 128
Associativity 1 1 1
Table 2: Values of parameters for the last series of simulations
4 Conclusions
This project focused on how hardware simulations are performed using two simulators;
Dimemas and Pin tool. Hardware behavior was analyzed by examining the performance impact
of various parameters on multithreaded applications.
Simulations with Dimemas show how latency, network bandwidth, number of buses and CPU
speed can affect the execution time of a parallel application. Simulations with Pin regarding a
multi-level cache memory show how cache size, line size and associativity can affect the
specific cache as well as the caches coming afterwards. Additionally, cluster size is proven to
be an important factor for L2 hit rate. Considering that L2 misses will proceed to L3, this
parameter ends up to be important for L3 hit rate as well. Lastly, the number of threads is
examined and how their increase rise the execution time. With pin, we were able to measure the
performance of the application based on hardware that we do not have.
This last series of experiments with pin open the doors for further experimentation and analysis
of applications without having access to hardware, because it is either costly or prohibited to
use. Thinking even further, these simulations can help scientists examine the pros and cons of
implementing hardware the way it is proposed or theoretically designed.
13
References
[1] J. Banks, J. Carson, B. Nelson, D. Nicol (2001). Discrete-Event System Simulation. Prentice
Hall. p. 3.
[2] J.A. Sokolowski, C.M. Banks (2009). Principles of Modeling and Simulation. Hoboken, NJ:
Wiley. p. 6.
[3] Barcelona Supercomputing Center. Dimemas. [Online]. Available:
http://www.bsc.es/computer-sciences/performance-tools/dimemas.
[4] Intel Software Network. Pin - A Dynamic Binary Instrumentation Tool. [Online]. Available:
http://www.pintool.org.
[5] J. Dunbar (2012, Mar.). NAS Parallel Benchmarks. [Online]. Available:
http://www.nas.nasa.gov/publications/npb.html
[6] H. S. Gelabert, G. L. Sánchez (2011, Nov.). Extrae: User guide manual for version 2.2.0.
Barcelona Supercomputing Center. [Online]. Available:
http://www.bsc.es/ssl/apps/performanceTools/files/docs/extrae-userguide.pdf
[7] Barcelona Supercomputing Center. Paraver [Online]. Available: http://www.bsc.es/computer-
sciences/performance-tools/paraver
[8] Barcelona Supercomputing Center. Software Modules [Online]. Available:
http://www.bsc.es/ssl/apps/performanceTools/
[9] A. Ramirez (2012, Jan). Primavera 2012. Tools and Measurement Techniques [Online].
Available: http://pcsostres.ac.upc.edu/eitm/doku.php/pri12
[10] Wikipedia, the free encyclopedia. CPU cache [Online]. Available:
http://en.wikipedia.org/wiki/CPU_cache