lab exercises: lab 1 (performance measurement)

22
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

Upload: allene

Post on 15-Jan-2016

71 views

Category:

Documents


8 download

DESCRIPTION

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms. Lab Exercises: Lab 1 (Performance measurement). Lab # 1: Parallel Programming and Performance measurement using MPAC. Lab 1 – Goals. Objective - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lab Exercises: Lab 1 (Performance measurement)

Programming Multi-Core Processors based

Embedded Systems

A Hands-On Experience on Cavium Octeon based

Platforms

Lab Exercises: Lab 1 (Performance measurement)

Page 2: Lab Exercises: Lab 1 (Performance measurement)

1-2

Lab # 1: Parallel Programming and

Performance measurement using MPAC

Page 3: Lab Exercises: Lab 1 (Performance measurement)

1-3

Lab 1 – Goals

Objective Use MPAC benchmarks to measure the

performance of different subsystems of multi-core based systems

Use MPAC to learn to develop parallel programs

Mechanism MPAC CPU and memory benchmarks will

exercise the processor and memory unit by generating compute and memory intensive workload

Page 4: Lab Exercises: Lab 1 (Performance measurement)

1-4

What to Look for

Observations Observe the throughput with increasing

number of threads for compute and memory intensive workloads

Identify performance bottlenecks

Page 5: Lab Exercises: Lab 1 (Performance measurement)

1-5

Measurement of Execution Time Measuring the elapsed time since the start of a

task until its completion is a straight-forward procedure in the context of a sequential task.

This procedure becomes complex when the same task is executed concurrently by n threads on n distinct processors or cores.

Not guaranteed that all tasks start at the same time or complete at the same time. Therefore, the measurement is imprecise due to concurrent nature of the tasks.

Page 6: Lab Exercises: Lab 1 (Performance measurement)

1-6

Cont… Execution time measured either globally

or locally. In the case of global measurement,

execution time is equal to the difference of time stamps taken at global fork and join instants.

Local times can be measured and recorded by each of the n threads.

After thread joining, the maximum of all these individual execution times provides an estimate of overall execution time.

Page 7: Lab Exercises: Lab 1 (Performance measurement)

1-7

Definitions

LETE: Local Execution Time Estimation

GETE: Global Execution Time Estimation

Page 8: Lab Exercises: Lab 1 (Performance measurement)

1-8

Cont…

12 ttt

procedure()

t1 = start time t2 = stop time

12 ttt

procedure()

t1 = start time t2 = stop time

procedure()

t1 = start time t2 = stop time

12 ggg ttt

procedure()

t11 = start time t12 = stop time

procedure()

t21 = start time t22 = stop time

procedure()

t31 = start time t32 = stop time

procedure()

tn1 = start time tn2 = stop time

tg1=global start time tg2=global stop time

121max ii

nil ttt

12 ggg ttt

procedure()

t11 = start time t12 = stop time

procedure()

t21 = start time t22 = stop time

procedure()

t31 = start time t32 = stop time

procedure()

tn1 = start time tn2 = stop time

tg1=global start time tg2=global stop time

procedure()

t11 = start time t12 = stop time

procedure()

t11 = start time t12 = stop time

procedure()

t21 = start time t22 = stop time

procedure()

t21 = start time t22 = stop time

procedure()

t31 = start time t32 = stop time

procedure()

t31 = start time t32 = stop time

procedure()

tn1 = start time tn2 = stop time

procedure()

tn1 = start time tn2 = stop time

tg1=global start time tg2=global stop time

121max ii

nil ttt

LETE

GETE

Page 9: Lab Exercises: Lab 1 (Performance measurement)

1-9

The Problem

Lack of Precision Some tasks finish before others Synchronization issue with large no. of cores Results not repeatable

Page 10: Lab Exercises: Lab 1 (Performance measurement)

1-10

Performance Measurement Methodologies

For multithreaded case

Get start time at the barrier

(1) (2) (3) ... (K)

Repeat for N no. of

iterations

Get end time at the barrier

For sequential case

Get start time

Repeat for N no. of

iterations

Get end time

Page 11: Lab Exercises: Lab 1 (Performance measurement)

1-11

Accurate LETE Measurement Methodology

(1) (2) (3) ... (K)

Thread synchronization before each round using barrier

Repeat for N no. of rounds

Maximum elapsed time for the round

Page 12: Lab Exercises: Lab 1 (Performance measurement)

1-12

Measurement Observations

Page 13: Lab Exercises: Lab 1 (Performance measurement)

1-13

Accurate MINMAX Approach Repeat for N no. of Iterations Store thread local execution time for

each thread for each iteration For an individual iteration store the

largest execution time amongst the threads

We have stored N largest execution time values

Choose the minimum of that value to be your execution time. The MINMAX value!!

Page 14: Lab Exercises: Lab 1 (Performance measurement)

1-14

Compile and Run (Memory Benchmark) Memory Benchmark

$ cd /<path-to-mpac>/mpac_1.2 $ ./configure $ make clean $ make $ cd benchmarks/mem $ ./mpac_mem_bm –n <# of Threads> -s

<array size> -r <# of repetitions> -t <data type>

For Help ./mpac_cpu_bm –h

Page 15: Lab Exercises: Lab 1 (Performance measurement)

1-15

Compile and Run (CPU Benchmark)

CPU Benchmark $ cd /<path-to-mpac>/mpac_1.2 $ ./configure $ make clean $ make $ cd benchmarks/cpu $ ./mpac_cpu_bm –n <# of Threads> -r <#

of Iterations> For Help

./mpac_cpu_bm –h

Page 16: Lab Exercises: Lab 1 (Performance measurement)

1-16

Performance Measurements (CPU)

Integer Unit (summation), Floating Point Unit (sine) and Logical Unit (string operation) of the processor are exercised.

Intel Xeon, AMD Opteron (x86) and Cavium Octeon (MIPS64) are used as System under Test (SUT).

Throughput scales linearly across number of threads for all cases.

Page 17: Lab Exercises: Lab 1 (Performance measurement)

1-17

Performance Measurements (Memory)

With concurrent symmetric threads one expects to see the memory-memory throughput scale with the number of threads.

With data sizes of 4 KB, 16 KB and 1 MB, most of the memory accesses should hit L2 caches rather than the main memory.

For these cases the throughput scales linearly.

Page 18: Lab Exercises: Lab 1 (Performance measurement)

1-18

Performance Measurements (Memory)

Copying 16 MB requires extensive memory accesses

In case of Intel shared bus is used. Thus, throughput is lower compared to the cases where accesses hit in L2 caches, and saturates as bus becomes a bottleneck

Memory copy throughput saturates at around 40 Gbps, which is half of the available bus bandwidth (64 bits x 1333 MHz = 85.3 Gbps)

For AMD and Cavium based SUT, throughput scales linearly for 16MB case due to their more efficient low-latency memory controllers instead of a shared system bus

Page 19: Lab Exercises: Lab 1 (Performance measurement)

1-19

MPAC fork and join infrastructure

In MPAC based applications, the initialization and argument handling is performed by the main thread.

The task to be run in parallel are forked to worker threads

The worker threads join after completing their task. Final processing is done by main thread

Page 20: Lab Exercises: Lab 1 (Performance measurement)

1-20

MPAC code structure

Page 21: Lab Exercises: Lab 1 (Performance measurement)

1-21

MPAC Hello World

Objective To write a simple ” Hello World” program using

MPAC Mechanism

User specifies number of worker threads through commandline

Each worker thread prints “Hello World” and exits

Page 22: Lab Exercises: Lab 1 (Performance measurement)

1-22

Compile and Run $ cd /<path-to-mpac>/mpac_1.2/apps/hello $ make clean $ make $ ./mpac_hello_app –n <# of Threads>