new the papi performance analysis toolgec.di.uminho.pt/discip/minf/cpd1213/ruis_papi_nov12.pdf ·...

36
THE PAPI PERFORMANCE ANALYSIS TOOL Rui Silva 20 November 2012 Universidade do Minho terça-feira, 20 de Novembro de 2012

Upload: others

Post on 25-Oct-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

THE PAPI PERFORMANCE ANALYSIS TOOL

Rui Silva

20 November 2012Universidade do Minho

terça-feira, 20 de Novembro de 2012

Page 2: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

OUTLINE

• Introduction to PAPI

• My experience with PAPI

terça-feira, 20 de Novembro de 2012

Page 3: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

INTRODUCTION TO PAPI

terça-feira, 20 de Novembro de 2012

Page 4: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

Experiences and Lessons Learned with aPortable Interface to Hardware Performance Counters

Jack Dongarra Kevin London Shirley Moore Philip Mucci Daniel TerpstraHaihang You Min Zhou

University of TennesseeInnovative Computing LaboratoryKnoxville, TN 37996-3450 USA

E-mail: [email protected]

Abstract

The PAPI project has defined and implemented a cross-platform interface to the hardware counters available onmost modern microprocessors. The interface has gainedwidespread use and acceptance from hardware vendors,users, and tool developers. This paper reports on experi-ences with the community-based open-source effort to de-fine the PAPI specification and implement it on a varietyof platforms. Collaborations with tool developers who haveincorporated support for PAPI are described. Issues relatedto interpretation and accuracy of hardware counter dataand to the overheads of collecting this data are discussed.The paper concludes with implications for the design of thenext version of PAPI.

1. Introduction

The Performance API (PAPI) is a specification of across-platform interface to hardware performance counterson modern microprocessors [1]. These counters exist as asmall set of registers that count events, which are occur-rences of specific signals related to a processor’s function.Monitoring these events has a variety of uses in applica-tion performance analysis and tuning, benchmarking, anddebugging. The PAPI specification consists of both a stan-dard set of events deemed most relevant for application per-formance tuning, as well as both high-level and low-levelsets of routines for accessing the counters. The high levelinterface simply provides the ability to start, stop, and readthe counters for a specified list of events, and is intendedfor the acquisition of simple but accurate measurements byapplication engineers. The fully programmable low-level

Figure 1. Layered architecture of the PAPI im-plementation

interface provides additional features and options and is in-tended for third-party tool developers or application devel-opers with more sophisticated needs.In addition to the specification, the PAPI project has de-

veloped a reference implementation of the library. Thisimplementation uses a layered approach, as illustrated inFigure 1. The machine-dependent part of the implementa-tion, called the substrate, is all that needs to be rewrittento port PAPI to a new architecture. Platforms supportedby the reference implementation include SGI IRIX, IBMAIX, HP/Compaq Tru64 UNIX, Linux/x86, LInux-IA64,Cray T3E, Sun Solaris, and Windows. For each platform,the reference implementation attempts to map as many ofthe PAPI standard events as possible to native events on thatplatform.PAPI has been under development for four years and

has become widely adopted by application and performanceanalysis tool developers. It is installed and in use for appli-cation performance tuning at a number of high performance

PAPI

• Access to hardware performance counters

• Justification of gains of optimisations

• Identify side effects

terça-feira, 20 de Novembro de 2012

Page 5: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

THE RESULTS. WHAT IS IT SIGNIFICANCE?

• The number is good or bad?

• The question is: it is possible to improve?

• Knowledge of the problem

• Define the bottleneck of the code (Other tools. Gprof...)

terça-feira, 20 de Novembro de 2012

Page 6: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

IMPROVE PERFORMANCE

• Knowledge of the problem

• Knowledge and implementation the optimizations

• Analyse and compare performance counters

• Justify improvements with the new values

• Analyse side effects

• It may be necessary to analyse more counters (using guessing/intuition )

terça-feira, 20 de Novembro de 2012

Page 7: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

EXAMPLE (LOOP FUSION)

for i = 1 to N M[ i ] += C1 for i = 1 to N M[ i ] ∗= C2

for i = 1 to N M[ i ] += C1 M[ i ] ∗= C2

terça-feira, 20 de Novembro de 2012

Page 8: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

gain

s

EXAMPLE (LOOP FUSION)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

instru

ction

clock

cycles

L1 da

ta ca

che m

isses

L2 da

ta ca

che m

isses

L3 to

tal ca

che m

isses

• What is the problem?

• Less instructions

• Better locality in access data

• But more clock cycles

terça-feira, 20 de Novembro de 2012

Page 9: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

EXAMPLE (LOOP FUSION)

• More misses on access the instructionga

ins

0.0

0.2

0.4

0.6

0.8

1.0

1.2

instru

ction

clock

cycles

L1 in

struc

tion c

ache

miss

es

L1 da

ta ca

che m

isses

L2 da

ta ca

che m

isses

L3 to

tal ca

che m

isses

terça-feira, 20 de Novembro de 2012

Page 10: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

PERFORMANCE METRICS

• CPI and misse rate

• Hides increases in instructions and access

• CPE and Misses per element

• Different problems, different values

• It is not possible to compare the performance of different problems

terça-feira, 20 de Novembro de 2012

Page 11: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

EXAMPLE (-O0 VS -O3)

1 / 5

!"#$%&'(')'*+,-.#/#&'

012%/234&'#-'#-+-56-.7&8'9&.,2#&:-+'#-'72:#;2:-'

1.a) e 1.b) -O0 -O3 #I 3.1x106 1.0x106 #CC 2.0x106 1.4x106 CPI 0.6 1.3 Nota: deve converter-se os valores dos contadores para “double” para imprimir o CPI no ecrã. 1.c) e 1d) -O0 -O3 Texe (ms) 2.0x106 / 2.6x109 = 0.769 1.4x106/ 2.6x109 = 0.538 PAPI_usec 0.930 0.694 Nota: a diferença entre o estimado e o real é um pouco superior ao que seria de esperar 1.e) #I diminui em 2.0/1.4 = 3.1 vezes; CPI aumenta =1.3/0.6 = 2.2 vezes Texe diminui com O3 porque #I * CPI diminui 1.f) -O0 -O3 CPE 2,0x106 / 256x256 = 30.5 1,4x106/ 256x256 = 21.4

Convolve 3x1

terça-feira, 20 de Novembro de 2012

Page 12: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

WHAT TO MEASURE?

• All code

• Hide improvements

• Part of the code that was optimised

• The size of input

• Attention to precision of PAPI (papi_cost)

• Data size (example optimisation access data, the problem do not fit in cache levels)

terça-feira, 20 de Novembro de 2012

Page 13: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

HOW TO USE PAPI (EXAMPLE 1)

(...) string nEvents[NUMEVENTS] = { "PAPI_TOT_INS", "PAPI_TOT_CYC" }; int events[NUMEVENTS] = {PAPI_TOT_INS, PAPI_TOT_CYC}; long long values[NUMEVENTS]; int errorcode; char errorstring[PAPI_MAX_STR_LEN+1];

// Initialize Papi and its events! startPAPI();! errorcode = PAPI_start_counters(events, NUMEVENTS);

! convolve3x1 (res_img->buf, img->buf, img->width, img->height);

errorcode = PAPI_stop_counters(values, NUMEVENTS);! for (int w=0; w<NUMEVENTS; w++)! ! cout << nEvents[w] << ":" << values[w] << endl;! (...)

terça-feira, 20 de Novembro de 2012

Page 14: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

PROBLEM IN THIS APPROACH

• The number of counters is limited

• Solution

• Run several times

terça-feira, 20 de Novembro de 2012

Page 15: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

HOW TO USE PAPI (EXAMPLE 1I)

int main(int argc, char **argv) {! events_define_by_user();! for (int i = 0; i < papi_profiler_length_events; i++) { init_program(); papi_profiler_i = i;

main2(argc, argv); //original main

PAPI_shutdown(); } print_cache();

exit(0);

}

(...) // Initialize Papi and its events papi_profiler_start();

! convolve3x1 (res_img->buf, img->buf, img->width, img->height); papi_profiler_stop();! (...)

terça-feira, 20 de Novembro de 2012

Page 16: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

PAPI COMMAND

• papi_avail

• papi_error_codes

• papi_cost

• papi_mem_info

• papi_native_avail

terça-feira, 20 de Novembro de 2012

Page 17: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

MY EXPERIENCE WITH PAPI

terça-feira, 20 de Novembro de 2012

Page 18: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

COUNTERS

• Total of instructions completed

• Total cycles

• Cache accesses (L1, L2 and L3)

• Cache misses (L1, L2 and L3)

terça-feira, 20 de Novembro de 2012

Page 19: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

CASE STUDIES

• Molecular dynamics simulation

• Matrix Multiplication

• Others

terça-feira, 20 de Novembro de 2012

Page 20: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

PAPISEQUENTIAL VERSIONS

terça-feira, 20 de Novembro de 2012

Page 21: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

LOOP REORDER

terça-feira, 20 de Novembro de 2012

Page 22: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

LOOP REORDER

3.2. Optimisation 17

PAPI TOT INS Total of instructions completed.PAPI TOT CYC Total cycles.PAPI LST INS Load/store instructions completed.PAPI L1 DCM Level 1 data cache misses.PAPI L2 DCA Level 2 data cache accesses.PAPI L2 DCM Level 2 data cache misses.PAPI L3 TCA Level 3 data+ instruction cache accesses.PAPI L3 TCM Level 3 data+ instruction cache misses.

3.2.1 Loop Reorder

To analyse this optimisation it was used the Matrix Multiplication. Theloop reorder changes the order of loop from IJK to IKJ (figure 2.1).

!"#$%$&$'$("$)*$$!"#$+$&$'$("$,-$$ $!"#$.$&$'$("$)-$$ $ $,/%0/+0$1&$*/%0/.02-/.0/+0$

The reorder of loop provides better gains on larger matrices (figure3.1). The most significant improvement in performance occurs whenthere is a substantial increase of locality on the last level of cache.However, on small sizes there is no improvement, because the problemfits in L1 cache.

The increase locality in this case is reflected in a significant increasein performance, because there is no change in the number of instruc-tions. The instructions remain constant because the matrices are square.In the case of matrices aren’t square, the number of instructions isn’tconstant (can be larger or smaller) since the loops have different sizes.

3.2.2 Loop Tiling

The loop tiling is analysed in two cases: MM and MD. For the first caseit was necessary to include three additional loops.

!"#$%$&$'$("$)*$$$$$!"#$+$&$'$("$,-$$$$$$$$$!"#$.$&$'$("$)-$$$$$$$$$$$$$,/%0/+0$1&$*/%0/.02-/.0/+0$

!"#$3%$&$'$("$)*4$3%1&35"6.7%89$$$$$!"#$3+$&$'$("$,-4$3+1&35"6.7%89$$$$$$$$$!"#$3.$&$'$("$)-4$3.1&35"6.7%89$$$$$$$$$$$$$!"#$%$&$3%$("$3%$$1$35"6.7%89$$$$$$$$$$$$$$$$$!"#$+$&$3+$("$3+$1$35"6.7%89$$$$$$$$$$$$$$$$$$$$$!"#$.$&$3.$("$3.$1$35"6.7%89$$$$$$$$$$$$$$$$$$$$$$$$$,/%0/+0$1&$*/%0/.02-/.0/+0$

&:$

In these new three loops all matrices are accessed partitioned byblocks (figure 3.2). Therefore, the sub-block C0,0 is done by multiplyingthe sub-block A0,0 with the sub-block B0,0 and added to the result ofmultiplying the sub-block A0,1 with the sub-block B1,0.

6 2. Locality optimisation techniques

!" #"

$" %" &"

'()"

(a) IJK order

!" #"

$" %" &"

'()"

(b) IKJ order

Figure 2.1: Loop Reorder - Data access corresponding to the calculation

of first elements (8 iterations) on matrix multiplication

2.1a), because the data is accessed by columns. So this algorithm don’t

have locality in access to matrix B.

By applying the loop reorder, the order of data access is changed.

In the algorithm IKJ (figure 2.1b), the elements in both matrices are

accessed by rows (increase spatial locality), but this change does not

increase the temporal locality.

2.1.2 Loop Tiling(LT)

The general approach is to use a divide and conquer strategy: the orig-

inal problem is subdivided into sub-problems of smaller size that fits

into a certain level of memory hierarchy [YRP+07]. This optimisation

intends to take advantage of temporal locality in the algorithm since

each sub-problems is to stay in one of the levels of the cache. However,

the loop tiling increases the number of instructions due to two factors:

• more instructions of the control loop.

• more instructions of load and store. In the case of MD study

(section 3.1.2), the calculation of interactions between particle “A”

and the others, particle “A” is accessed several times (depends on

the number of blocks).

In loop tiling it is created an additional parameter, the parameter

that defines the size partitioning of the cycle (number of blocks). The

choice of the parameter has implications in the number of instructions

6 2. Locality optimisation techniques

!" #"

$" %" &"

'()"

(a) IJK order

!" #"

$" %" &"

'()"

(b) IKJ order

Figure 2.1: Loop Reorder - Data access corresponding to the calculation

of first elements (8 iterations) on matrix multiplication

2.1a), because the data is accessed by columns. So this algorithm don’t

have locality in access to matrix B.

By applying the loop reorder, the order of data access is changed.

In the algorithm IKJ (figure 2.1b), the elements in both matrices are

accessed by rows (increase spatial locality), but this change does not

increase the temporal locality.

2.1.2 Loop Tiling(LT)

The general approach is to use a divide and conquer strategy: the orig-

inal problem is subdivided into sub-problems of smaller size that fits

into a certain level of memory hierarchy [YRP+07]. This optimisation

intends to take advantage of temporal locality in the algorithm since

each sub-problems is to stay in one of the levels of the cache. However,

the loop tiling increases the number of instructions due to two factors:

• more instructions of the control loop.

• more instructions of load and store. In the case of MD study

(section 3.1.2), the calculation of interactions between particle “A”

and the others, particle “A” is accessed several times (depends on

the number of blocks).

In loop tiling it is created an additional parameter, the parameter

that defines the size partitioning of the cycle (number of blocks). The

choice of the parameter has implications in the number of instructions

IJK

IKJ

terça-feira, 20 de Novembro de 2012

Page 23: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

LOOP REORDER (MM)

• No gains on small problems

• Fit in L1/L2 cache

• Better locality => Increased performance

• Better usage of L2/L3

• Instruction count remains constant

Ratio Base/Loop Reorder

clock cycles

problem size KB

gain

01234567

24 96 38415

3661

4424

57698

30424 96 38

415

3661

4424

57698

304

L1 misses

problem size KB

02468

24 96 38415

3661

4424

57698

304

L2 misses

problem size KB

gain

0100200300400

24 96 38415

3661

4424

57698

304

L3 misses

problem size KB

020406080

24 96 38415

3661

4424

57698

304

terça-feira, 20 de Novembro de 2012

Page 24: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

LOOP TILING

terça-feira, 20 de Novembro de 2012

Page 25: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

MATRIX MULTIPLICATION

• Matrix computations by blocks

• Block size is adjusted to different levels of the cache

2.1. Algorithm 7

and cache misses. If the parameter is small the number of instructions

will be higher, but if the parameter is too large it cannot take advantage

of the faster cache levels. This technique can be applied multiple times,

thus enabling a better fitting of the problem to multiple levels of the

cache.

!" #"

$" %" &"

'()(*+"

Figure 2.2: Loop tiling - Data access corresponding to the calculation

of first elements (8 iterations) on matrix multiplication

In this algorithm the data is accessed by blocks (figure 2.2), thereby

dividing the problem into sub-problems. With this division it is intended

that sub-problems have a smaller size than the level of the cache that

it is intended to optimise.

2.1.3 Loop Fusion(LF)

The loop fusion merges multiple operations over a data set into the same

structure for control of the loop. Therefore the number of instructions

and the number of data accesses is reduced (less instructions for con-

trolling the loops), which implies a decrease in the number of misses.

However the technique increases the code located in the cycle, so it can

increase the number of misses in instruction cache.

This technique is applied when several functions are applied to the

same data (Table 2.1).

Loop Loop fusion

for i = 1 to NM[ i ] += C1

for i = 1 to NM[ i ] ∗= C2

for i = 1 to NM[ i ] += C1M[ i ] ∗= C2

Table 2.1: Loop Fusion - Differences of the code

In this example it is applied two functions to data. Thus, it is

necessary to go twice over the data, but it is possible to apply two

functions at the same time, thus avoid duplication of data accesses.

terça-feira, 20 de Novembro de 2012

Page 26: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

MATRIX MULTIPLICATION20 3. Impact of optimisations

instructions

block size (KB)

nu

mb

er

of

inst

ruct

ion

s

0e+00

1e+11

2e+11

3e+11

4e+11

5e+11

0,020,

090,

381,5 6 24 9638

415

3661

44

2457

6

clock cycles

block size (KB)

nu

mb

er

of

cycl

es

0e+00

2e+11

4e+11

6e+11

8e+11

1e+12

0,020,

090,

381,5 6 24 9638

415

3661

44

2457

6

L1 refs

block size (KB)

nu

mb

er

of

acc

ess

0.0e+00

5.0e+10

1.0e+11

1.5e+11

2.0e+11

2.5e+11

3.0e+11

0,020,

090,

381,5 6 24 9638

415

3661

44

2457

6

L1 misses

block size (KB)

nu

mb

er

of

mis

ses

0e+00

2e+09

4e+09

6e+09

8e+09

0,020,

090,

381,5 6 24 9638

415

3661

44

2457

6

L2 misses

block size (KB)

nu

mb

er

of

mis

ses

0e+00

2e+09

4e+09

6e+09

8e+09

0,020,

090,

381,5 6 24 9638

415

3661

44

2457

6

L3 misses

block size (KB)

nu

mb

er

of

mis

ses

0e+00

2e+09

4e+09

6e+09

8e+09

0,020,

090,

381,5 6 24 9638

415

3661

44

2457

6

(a) Matrix Multiplication

instructions

block size (KB)

nu

mb

er

of

inst

ruct

ion

s

0e+00

1e+12

2e+12

3e+12

4e+12

5e+12

6e+12

0.35

3.52

35.1

6

351.

56

3515

.63

3515

6.25

clock cycles

block size (KB)

nu

mb

er

of

cycl

es

0e+00

1e+12

2e+12

3e+12

4e+12

0.35

3.52

35.1

6

351.

56

3515

.63

3515

6.25

L1 refs

block size (KB)

nu

mb

er

of

acc

ess

0.0e+00

5.0e+11

1.0e+12

1.5e+12

0.35

3.52

35.1

6

351.

56

3515

.63

3515

6.25

L1 misses

block size (KB)

nu

mb

er

of

mis

ses

0.0e+00

5.0e+10

1.0e+11

1.5e+11

2.0e+11

0.35

3.52

35.1

6

351.

56

3515

.63

3515

6.25

L2 misses

block size (KB)

nu

mb

er

of

mis

ses

0e+00

2e+09

4e+09

6e+09

8e+09

0.35

3.52

35.1

6

351.

56

3515

.63

3515

6.25

L3 misses

block size (KB)

nu

mb

er

of

mis

ses

0.0e+00

5.0e+09

1.0e+10

1.5e+10

2.0e+10

0.35

3.52

35.1

6

351.

56

3515

.63

3515

6.25

(b) Molecular Dynamic Simulation

Figure 3.3: Loop tiling optimisation

terça-feira, 20 de Novembro de 2012

Page 27: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

LOOP TILING (MD)

• Instruction count overhead on small block size

• Inner loop is small

• Tile size can be tuned to L1/L2/L3

• More impact of L2/L3 misses

• Similar results in MM

Counts as function of tile size (problem size ~35MB)

Base version

instructions

block size (KB)

num

ber

0e+001e+122e+123e+124e+125e+126e+12

0.353.5

235

.1635

1.56

3515

.63

3515

6.25

clock cycles

block size (KB)

0e+001e+122e+123e+124e+12

0.353.5

235

.1635

1.56

3515

.63

3515

6.25

L1 refs

block size (KB)

0.0e+005.0e+111.0e+121.5e+12

0.353.5

235

.1635

1.56

3515

.63

3515

6.25

L1 misses

block size (KB)

num

ber

0.0e+005.0e+101.0e+111.5e+112.0e+11

0.353.5

235

.1635

1.56

3515

.63

3515

6.25

L2 misses

block size (KB)

0e+002e+094e+096e+098e+09

0.353.5

235

.1635

1.56

3515

.63

3515

6.25

L3 misses

block size (KB)

0.0e+005.0e+091.0e+101.5e+102.0e+10

0.353.5

235

.1635

1.56

3515

.63

3515

6.25

terça-feira, 20 de Novembro de 2012

Page 28: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

PAPIPARALLEL VERSIONS

terça-feira, 20 de Novembro de 2012

Page 29: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

CODE IN PARALLEL VERSIONS (1)void runiters(MD *md, Particles *particulas) {

Reduction vars[md->threads]; create_newtowsArrays(vars, md); md->move = 0;#pragma omp parallel { papi_profiler_start();

for (; md->move < md->movemx;) {

#pragma omp master cicleDoMove(md, particulas); // Calcular o movimento

cicleForces(md, particulas, vars); // Calcular a força

#pragma omp master { cicleMkekin(md, particulas); // Scale forces, update velocities cicleVelavg(md, particulas); // calcular a velocidade scale_temperature(md, particulas); // temperature scale if required get_full_potential_energy(md); // sum to get full potential energy and virial }

#pragma omp master md->move++;#pragma omp barrier } papi_profiler_stop(); }}

terça-feira, 20 de Novembro de 2012

Page 30: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

CODE IN PARALLEL VERSIONS (2)

int main(int argc, char **argv) {! events_define_by_user();! for (int i = 0; i < papi_profiler_length_events; i++) { init_program(); papi_profiler_i = i;

main2(argc, argv); //original main

PAPI_shutdown(); } print_cache();

exit(0);

}

terça-feira, 20 de Novembro de 2012

Page 31: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

MD (AOP, SOA)instructions

0e+00

1e+12

2e+12

3e+12

4e+12

1 3 6122448 1 3 6122448

clock cycles

0e+00

1e+12

2e+12

3e+12

4e+12

1 3 6122448 1 3 6122448

L1 refs

0e+00

2e+11

4e+11

6e+11

8e+11

1 3 6122448 1 3 6122448

L1 misses

0.0e+00

5.0e+10

1.0e+11

1.5e+11

1 3 6122448 1 3 6122448

L2 cache refs

0.0e+00

5.0e+10

1.0e+11

1.5e+11

1 3 6122448 1 3 6122448

L2 cache misses

0e+00

1e+08

2e+08

3e+08

4e+08

5e+08

1 3 6122448 1 3 6122448

L3 cache acess

0e+001e+082e+083e+084e+085e+086e+08

1 3 6122448 1 3 6122448

L3 cache misses

0.0e+00

5.0e+07

1.0e+08

1.5e+08

2.0e+08

2.5e+08

1 3 6122448 1 3 6122448

AOP SOA

number of threads number of threads number of threads number of threads

number of threads number of threads number of threads number of threads

AOP SOA AOP SOA AOP SOA

terça-feira, 20 de Novembro de 2012

Page 32: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

AOP12 THREADSinstructions

0.0e+005.0e+101.0e+111.5e+112.0e+112.5e+113.0e+113.5e+11

0 1 2 3 4 5 6 7 8 91011

clock cycles

0.0e+00

5.0e+10

1.0e+11

1.5e+11

2.0e+11

0 1 2 3 4 5 6 7 8 91011

L1 refs

0e+001e+102e+103e+104e+105e+106e+107e+10

0 1 2 3 4 5 6 7 8 91011

L1 misses

0.0e+002.0e+094.0e+096.0e+098.0e+091.0e+101.2e+101.4e+10

0 1 2 3 4 5 6 7 8 91011

L2 cache refs

0.0e+002.0e+094.0e+096.0e+098.0e+091.0e+101.2e+101.4e+10

0 1 2 3 4 5 6 7 8 91011

L2 cache misses

0.0e+005.0e+061.0e+071.5e+072.0e+072.5e+073.0e+073.5e+07

0 1 2 3 4 5 6 7 8 91011

L3 cache acess

0e+00

1e+07

2e+07

3e+07

4e+07

0 1 2 3 4 5 6 7 8 91011

L3 cache misses

0.0e+00

5.0e+06

1.0e+07

1.5e+07

2.0e+07

2.5e+07

0 1 2 3 4 5 6 7 8 91011

thread id thread id thread id thread id

thread id thread id thread id thread id

terça-feira, 20 de Novembro de 2012

Page 33: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

SOA12 THREADS

instructions

0.0e+005.0e+101.0e+111.5e+112.0e+112.5e+113.0e+113.5e+11

0 1 2 3 4 5 6 7 8 91011

clock cycles

0.0e+00

5.0e+10

1.0e+11

1.5e+11

2.0e+11

0 1 2 3 4 5 6 7 8 91011

L1 refs

0e+001e+102e+103e+104e+105e+106e+10

0 1 2 3 4 5 6 7 8 91011

L1 misses

0.0e+002.0e+074.0e+076.0e+078.0e+071.0e+081.2e+08

0 1 2 3 4 5 6 7 8 91011

L2 cache refs

0.0e+002.0e+074.0e+076.0e+078.0e+071.0e+081.2e+08

0 1 2 3 4 5 6 7 8 91011

L2 cache misses

0.0e+005.0e+061.0e+071.5e+072.0e+072.5e+073.0e+07

0 1 2 3 4 5 6 7 8 91011

L3 cache acess

0.0e+005.0e+061.0e+071.5e+072.0e+072.5e+073.0e+073.5e+07

0 1 2 3 4 5 6 7 8 91011

L3 cache misses

0.0e+00

5.0e+06

1.0e+07

1.5e+07

2.0e+07

2.5e+07

0 1 2 3 4 5 6 7 8 91011

thread id thread id thread id thread id

thread id thread id thread id thread id

terça-feira, 20 de Novembro de 2012

Page 34: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

AOP 24 THREADSinstructions

0.0e+00

5.0e+10

1.0e+11

1.5e+11

01234567891011121314151617181920212223

clock cycles

0.0e+00

5.0e+10

1.0e+11

1.5e+11

01234567891011121314151617181920212223

L1 refs

0e+00

1e+10

2e+10

3e+10

01234567891011121314151617181920212223

L1 misses

0e+001e+092e+093e+094e+095e+096e+097e+09

01234567891011121314151617181920212223

L2 cache refs

0e+001e+092e+093e+094e+095e+096e+097e+09

01234567891011121314151617181920212223

L2 cache misses

0e+00

1e+07

2e+07

3e+07

4e+07

01234567891011121314151617181920212223

L3 cache acess

0e+00

1e+07

2e+07

3e+07

4e+07

01234567891011121314151617181920212223

L3 cache misses

0.0e+00

5.0e+06

1.0e+07

1.5e+07

01234567891011121314151617181920212223

thread id thread id thread id thread id

thread id thread id thread id thread id

terça-feira, 20 de Novembro de 2012

Page 35: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

SOA 24 THREADS

instructions

0.0e+00

5.0e+10

1.0e+11

1.5e+11

01234567891011121314151617181920212223

clock cycles

0.0e+00

5.0e+10

1.0e+11

1.5e+11

01234567891011121314151617181920212223

L1 refs

0.0e+005.0e+091.0e+101.5e+102.0e+102.5e+103.0e+10

01234567891011121314151617181920212223

L1 misses

0.0e+00

5.0e+07

1.0e+08

1.5e+08

2.0e+08

2.5e+08

01234567891011121314151617181920212223

L2 cache refs

0.0e+00

5.0e+07

1.0e+08

1.5e+08

2.0e+08

2.5e+08

01234567891011121314151617181920212223

L2 cache misses

0e+00

1e+07

2e+07

3e+07

4e+07

01234567891011121314151617181920212223

L3 cache acess

0e+00

1e+07

2e+07

3e+07

4e+07

01234567891011121314151617181920212223

L3 cache misses

0.0e+002.0e+064.0e+066.0e+068.0e+061.0e+071.2e+071.4e+07

01234567891011121314151617181920212223

thread id thread id thread id thread id

thread id thread id thread id thread id

terça-feira, 20 de Novembro de 2012

Page 36: New THE PAPI PERFORMANCE ANALYSIS TOOLgec.di.uminho.pt/Discip/MInf/cpd1213/RuiS_PAPI_nov12.pdf · 2012. 11. 20. · PAPI L2 DCA Level 2 data cache accesses. PAPI L2 DCM Level 2 data

PAPI IN VIRTUAL MACHINE

terça-feira, 20 de Novembro de 2012