daniel dauwe ece 561 benchmarking results

22
Benchmarking ECE 561 Sudeep Pasricha Daniel Dauwe 1/9/2014

Upload: cinedan

Post on 14-Jun-2015

101 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Daniel dauwe   ece 561 Benchmarking Results

Benchmarking ECE 561

Sudeep Pasricha

Daniel Dauwe1/9/2014

Page 2: Daniel dauwe   ece 561 Benchmarking Results

Presentation Outline• Project Goals

• Tools for Benchmarking:

• Performance counters, PAPI,

• HPC Toolkit, Phoronix Test Suits,

• Power Measurement

• How testing was accomplished

• List of additional data points for application to processor affinity

• A simple continuation of Ryan’s test work

• Results from Memory/Cache Interference Testing for multiple applications run simultaneously pinned to specific cores

Page 3: Daniel dauwe   ece 561 Benchmarking Results

Project Goals• Benchmarking Processors

– Monitor both performance counters and the system's power usage

– Gathering more data for looking at application affinity for performance on a particular processor architecture• Memory Intensive Applications• CPU Intensive Applications

– Analyze the Interaction/Interference of multiple applications run simultaneously on different cores of the same processor

• This data collection is intermediate work for future unspecified projects

Page 4: Daniel dauwe   ece 561 Benchmarking Results

Performance Counters and PAPI

• Performance counters– Counters built into processor hardware that record the number

of occurrences of user specified events in hardware• PAPI – Performance Application Programming Interface

– PAPI was developed in the hope of identifying bottlenecks in current architectural development of high performance computing

– A standardized list of performance counters available for most processors

– PAPI makes it easier to have consistent tests across multiple processor architectures

Page 5: Daniel dauwe   ece 561 Benchmarking Results

What do the Performance Counter Measurements mean?

• Can mean different things based on which counters are being monitored Ex:– PAPI_L1_DCA - Level 1 data cache accesses– PAPI_FAD_INS - Floating point add instructions– PAPI_L2_DCM - Level 2 data cache misses

• The raw count data provided by the Performance Counter will need to be meaningfully interpreted by the user

Page 6: Daniel dauwe   ece 561 Benchmarking Results

Matching Performance counters to Processor Architectures

• Performance Counters used for these tests :– PAPI_TOT_INS – Total Instructions Executed– PAPI_L2_TCM – Data and Instruction Level 2 Cache Misses

• These should be pretty universally available across different processor architectures

• Future inclusion of other tests may require other Performance Counters, but available Performance Counters vary greatly between processor architectures…

Page 7: Daniel dauwe   ece 561 Benchmarking Results

HPC Toolkit

• “An Integrated suite of tools for measurement and analysis of program performance”

• Essentially – HPC Toolkit makes it easier to interface with the local machine's

performance counters– Makes collecting program performance data easier

Page 8: Daniel dauwe   ece 561 Benchmarking Results

Phoronix Test Suite• Phoronix Provides lots of test applications capable of testing many

aspects of processor performance– Phoronix tests are responsible for all of the benchmarking data

gathered for this presentation• However many other groups write application suites useful for

benchmarking– SPEC CPU2000 / 2006– PARSEC

• Several resources such as “OpenBenchmarking.org” provide a substantial amount of results from tests run from these suites on many processor architectures– This could prove to be a useful resource, however they do not

include information about power usage

Page 9: Daniel dauwe   ece 561 Benchmarking Results

Applications used for testing Cross-Core cache interference

• C-Ray– A Ray Tracing Program– CPU Intensive– Many Floating Point Calculation Operations– Relatively Little Memory Access

• Ramspeed– Integer and Floating Point Writes and Reads to memory– Memory Intensive– More interaction with the caches

Page 10: Daniel dauwe   ece 561 Benchmarking Results

Monitoring Power Usage• “Watts Up? PRO” power meter

– Measures power consumption from a single standard power outlet

– Has a USB port to interface with a computer and dump recorded power measurements

Page 11: Daniel dauwe   ece 561 Benchmarking Results

How tests were run• Minimalist Ubuntu Operating System allows the processor's

attention to be dedicated to the test applications– Terminal Based User Interface– Unnecessary background processes not included in the

operating system• Power usage and selected program counters are recorded and

saved while the various test applications are run.• For Testing Interference between programs:

– “taskset” was used to pin the applications to specific processor cores

– The applications were run concurrently, while performance counter results were measured

Page 12: Daniel dauwe   ece 561 Benchmarking Results

Measuring Memory Interference between Applications

• How this is tested:• Simultaneously pin different types of applications to run only on specific cores in the

processor,• Then use performance counters and the power meter to measure the interference

• Interference could be defined as:• An increase in the number of cache misses • Increase in application execution time• Possibly defined by an increase in power consumption

• Test plan:• Tests were run:

• First on an AMD Turion II Dual-Core M520 Processor (2 cores, 5 P-states)• Later also on an Intel Pentium Dual Core CPU (2 cores, 4 P-states)

• Run control tests for running each processor alone (pinned to a single core )• Run the tests together and analyze the differences

Page 13: Daniel dauwe   ece 561 Benchmarking Results

Control Results: Intel Pentium dual CPU T2330

0 1 2 30

100

200

300

400

Intel Pentium Dual Core: C-Ray L2

Cache Miss Control Results

CPU Control Test

0 1 2 30

500100015002000

Intel Pentium Dual Core : C-Ray Exe-

cution Time Control Results CPU Control

Test

0 1 2 3140

180

220

Intel Pentium Dual Core: Ramspeed Execution Time Control Results Memory

Control Test

0 1 2 353950

54000

54050

54100

Intel Pentium Dual Core: Ramspeed L2 Cache Miss Control

Results Memory Control Test

0 1 2 30

1000020000300004000050000

Intel Pentium Dual Core C-Ray Power Usage Control Re-

sults CPU Control Energy

0 1 2 38000

8500

9000

9500

10000

Intel Pentium Dual Core Ramspeed

Power Usage Con-trol Results Memory

Control En-ergy

Page 14: Daniel dauwe   ece 561 Benchmarking Results

Control Results: AMD Turion II Dual Core Mobile M520

0 1 2 3 4576

578

580

582

AMD Turion II Dual-Core C-ray Execution Time control Results CPU Control

Test

0 1 2 3 40

200

400

600

AMD Turion II Dual-Core C-ray L2 Cache Miss control Results

CPU Control Test

0 1 2 3 475

80

85

AMD Turion II Dual-Core Ram-speed Execution

Time control Results

Memory Control Test

0 1 2 3 44000420044004600

AMD Turion II Dual-Core Ram-speed L2 Cache

Miss control Results

Memory Control Test

0 1 2 3 40

1000020000300004000050000

AMD Turion II Dual-Core C-ray Power Usage control Re-

sults CPU Control Energy

0 1 2 3 40

5000

10000

15000

AMD Turion II Dual-Core Ramspeed

Power Usage con-trol Results Memory

Control En-ergy

Page 15: Daniel dauwe   ece 561 Benchmarking Results

Taking a Closer Look at the AMD Control Results from the previous slide:

• It seems suspect that the results from the control test should produce the same execution time across all p-states, even though this result for the C-Ray execution control test was consistent over multiple runs on the AMD Turion II processor, a test execution on a secondary Intel Pentium Dual Core processor produced results that were closer to what seems realistic:

0 1 2 3 40

500

1000

1500

2000

2500

C-Ray Execution Time

(AMD First Run)

Control TestInterference Test

0 1 2 3 40

200

400

600

800

1000

1200

1400

1600

1800

C-ray Execution Time

(AMD Second Run)

CPU Control TestCPU Inter-ference Test

0 1 2 30

200

400

600

800

1000

1200

1400

1600

1800

C-ray Execution Time (Intel

Run)

CPU Control TestCPU Inter-ference Test

Page 16: Daniel dauwe   ece 561 Benchmarking Results

• The third column of data represents Adjusted interference results

Interference Results (Joint Pinning Results on C-Ray):

Intel Pentium dual CPU T2330

0 1 2 30

200

400

600

800

1000

1200

1400

1600

1800

C-ray Execution Time Interference (Ramspeed test on

second core)

CPU Control TestOriginal CPU Interference TestAdjusted CPU Interference Test

0 1 2 30

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

C-ray L2 Cache Misses Interference (Ramspeed test on

second core)

CPU Control TestOriginal CPU Interference TestAdjusted CPU Interference Test

0 1 2 30

5000

10000

15000

20000

25000

30000

35000

40000

45000

Power usage for C-ray and Ramspeed tests run together

CPU Control Energy1 CPU and 1 Memory In-terference Test Energy

Page 17: Daniel dauwe   ece 561 Benchmarking Results

Interference Results (Joint Pinning Results on Ramspeed):

Intel Pentium dual CPU T2330

0 1 2 3150

160

170

180

190

200

210

220

Ramspeed Execution Time Interference

(C-ray test on second core)

Memory Control TestMemory Inter-ference Test

0 1 2 353850

53900

53950

54000

54050

54100

Ramspeed L2 Cache Misses Interference

(C-ray test on second core)

Memory Control TestMemory Inter-ference Test

Page 18: Daniel dauwe   ece 561 Benchmarking Results

Interference Results (2 CPU Intensive Application Pinning Results):

Intel Pentium dual CPU T2330

0 1 2 30

200

400

600

800

1000

1200

1400

1600

C-ray Execution Time Interference

(C-ray test on second core)

CPU Control TestCPU Inter-ference TestCPU Inter-ference Test

0 1 2 30

100

200

300

400

500

600

700

C-ray L2 Cache Misses Inter-

ference (C-ray test on second core)

CPU Control TestCPU Inter-ference TestCPU Inter-ference Test

0 1 2 30

5000

10000

15000

20000

25000

30000

35000

40000

45000

Power usage for 2 C-ray tests run-ning on separate

cores

CPU Control Energy2 CPU Inter-ference Test Energy

Page 19: Daniel dauwe   ece 561 Benchmarking Results

Interference Results (2 Memory Intensive Application

Pinning Results): Intel Pentium dual CPU T2330

0 1 2 30

50

100

150

200

250

300

350

400

Ramspeed Execu-tion Time Inter-

ference (Ramspeed test on

second core) Memory Control TestMemory Interfer-ence TestMemory Interfer-ence Test

0 1 2 353900

53950

54000

54050

54100

54150

54200

54250

Ramspeed L2 Cache Misses In-

terference (Ramspeed test on

second core) Memory Control TestMemory Interfer-ence TestMemory Interfer-ence Test

0 1 2 30

5000

10000

15000

20000

25000

30000

35000

40000

45000

Power usage for 2 Ramspeed tests running on sepa-

rate cores

Memory Control Energy2 Memory Interference Test Energy

Page 20: Daniel dauwe   ece 561 Benchmarking Results

Interference between simultaneous applications:

Future TestsThe foundation scripts have been written so in the future it will be very easy to add support for testing:

– Interference of 1 type of application pinned to N cores for a processor with a substantial number of cores (ie >2)

– Interference from 2 CPU intensive or 2 Memory intensive test applications

– Measure memory interference with M applications mapped to N cores (Obviously N > 2)– Testing a larger sample size might produce more interesting results

– Find which application to core mappings can provide the best performance for specific architectures/cache sizes

Page 21: Daniel dauwe   ece 561 Benchmarking Results

Presentation Outline• Project Goals

• Tools for Benchmarking:

• Performance counters, PAPI,

• HPC Toolkit, Phoronix Test Suits,

• Power Measurement

• How testing was accomplished

• List of additional data points for application to processor affinity

• A simple continuation of Ryan’s test work

• Results from Interference Testing for applications pinned to specific cores

Page 22: Daniel dauwe   ece 561 Benchmarking Results

Thank You For Your Attention