does cache sharing on modern cmp matter to the · pdf filethe performance of contemporary...

44
Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA

Upload: doannhu

Post on 01-Feb-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs?

Eddy Zheng ZhangYunlian Jiang

Xipeng Shen (presenter)

Computer Science DepartmentThe College of William and Mary, VA, USA

Page 2: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Cache Sharing

• A common feature on modern CMP

2

Page 3: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

Cache Sharing on CMP

• A double-edged sword

• Reduces communication latency

• But causes conflicts & contention

3The College of William and Mary

Page 4: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

Cache Sharing on CMP

• A double-edged sword

• Reduces communication latency

• But causes conflicts & contention

4The College of William and Mary

Non-Uniformity

Page 5: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Many Efforts for Exploitation

• Example: shared-cache-aware scheduling

• Assigning suitable programs/threads to the same chip

5

• Independent jobs• Job Co-Scheduling [Snavely+:00, Snavely+:02, El-

Moursy+:06, Fedorova+:07, Jiang+:08, Zhou+:09]

• Parallel threads of server applications• Thread Clustering [Tam+:07]

Page 6: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Overview of this Work (1/3)

• A surprising finding

• Insignificant effects from shared cache on a recent multithreaded benchmark suite (PARSEC)

6

• Drawn from a systematic measurement• thousands of runs• 7 dimensions on levels of programs, OS, &

architecture

• derived from timing results• confirmed by hardware performance counters

Page 7: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Overview of this Work (2/3)

• A detailed analysis

• Reason • three mismatches between executables and CMP

cache architecture

• Cause • the current development and compilation are

oblivious to cache sharing

7

Page 8: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Overview of this Work (3/3)

• An exploration of the implications

• Exploiting cache sharing deserves not less but more attention.

• But to exert the power, cache-sharing-aware transformations are critical• Cuts half of cache misses

• Improves performance by 36%.

8

Page 9: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Outline

• Experiment design

• Measurement and findings

• Cache-sharing-aware transformation

• Related work, summary, and conclusion.

9

Page 10: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Benchmarks (1/3)

• PARSEC suite by Princeton Univ [Bienia+:08]

10

“focuses on emerging workloads and was designed to be representative of next-generation shared-memory programs for chip-multiprocessors”

Page 11: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Benchmarks (2/3)

• Composed of

• RMS applications

• Systems applications

• ……

• A wide spectrum of

• working sets, locality, data sharing, synch., off-chip traffic, etc.

11

Page 12: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Program Description Parallelism Working Set

Blackscholes Black-Scholes equation data 2MB

Bodytrack body tracking data 8MBCanneal sim. Annealing unstruct. 256MBFacesim face simulation data 256MB

Fluidanimate fluid dynamics data 64MBStreamcluster online clustering data 16MB

Swaptions portfolio pricing data 0.5MBX264 video encoding pipeline 16MB

Dedup stream compression pipeline 256MBFerret image search pipeline 64MB

12

Benchmarks (3/3)

Page 13: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Factors Covered in Measurements

Dimension Variations Variations Descriptionbenchmarks 10 from PARSECparallelism 3 data, pipeline, unstructured

inputs 4 simsmall, simmedium, simlarge, native

# of threads 4 1,2,4,8

assignment 3 threads assignment to cores

binding 2 yes, no

subset of cores 7 The cores a program uses

platforms 2 Intel Xeon & AMD Operon

Program level

OS level

Arch. level

13

Page 14: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Intel (Xeon 5310)

32K 32K 32K 32K

8GB DRAM

32K 32K 32K 32K4MB L2 4MB L2 4MB L2 4MB L2

Machines

14

64K 64K 64K 64K512K 512K 512K 512K

2MB L3

4GB DRAM

64K 64K 64K 64K512K 512K 512K 512K

2MB L3

4GB DRAM

AMD (Opeteron 2352)

Page 15: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Measurement Schemes

• Running times

• Built-in hooks in PARSEC

• Hardware performance counters

• PAPI

• cache miss, mem. bus, shared data accesses

15

Page 16: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Outline

• Experiment design

• Measurement and findings

• Cache-sharing-aware transformation

• Related work, summary, and conclusions

16

Page 17: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Observation I:Sharing vs. Non-sharing

17

Page 18: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

Sharing vs. Non-sharing

18

T1

T2

VS.

T1 T2

Page 19: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

Sharing vs. Non-sharing

19

T1

T2

VS.

T1 T3

T3

T4

T2 T4

Page 20: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Sharing vs. Non-sharing

• Performance Evaluation (Intel)

0

0.2

0.4

0.6

0.8

1

1.2

1.42t simlarge 2t native 4t simlarge 4t native

20blac

kscholes

bodytrack

canneal

facesi

m

fluidanimate

stream

cluste

r

swaptio

nsx264

Page 21: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Sharing vs. Non-sharing

• Performance Evaluation (AMD)

0

0.2

0.4

0.6

0.8

1

1.2

1.42t simlarge 2t native 4t simlarge 4t native

21blac

kscholes

bodytrack

canneal

facesi

m

fluidanimate

stream

cluste

r

swaptio

nsx264

Page 22: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Sharing vs. Non-sharing

• L2-cache accesses & misses (Intel)

22

Page 23: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Reasons (1/2)1) Small amount of inter-thread data sharing

23

01.753.55.257

blacksch

oles

bodytrack

canneal

facesi

m

fluidanimate

stream

cluste

r

swaptio

nsx264

sharing ratio of reads (%) (Intel)

Page 24: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Reasons (2/2)2) Large working sets

Program Description Parallelism Working Set

Blackscholes Black-Scholes equation data 2MB

Bodytrack body tracking data 8MBCanneal sim. Annealing unstruct. 256MBFacesim face simulation data 256MB

Fluidanimate fluid dynamics data 64MBStreamcluster online clustering data 16MB

Swaptions portfolio pricing data 0.5MBX264 video encoding pipeline 16MB

Dedup stream compression pipeline 256MBFerret image search pipeline 64MB

Page 25: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary 25

Observation II:Different Sharing Cases

• Threads may differ

• Different data to be processed or tasks to be conducted.

• Non-uniform communication and data sharing.

• Different thread placement may give different performance in the sharing case.

Page 26: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

Different Sharing Cases

26

T1

T2

T3

T4

T1

T3

T2

T4

T1

T4

T2

T3

Page 27: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

0246810121416

Max. Perf. Diff (%)

27

statistically insignificant---large fluctuations across runs of the same config.

2t simlarge 2t native 4t simlarge 4t native

blacksch

oles

bodytrack

canneal

facesi

m

fluidanimate

stream

cluste

r

swaptio

nsx264

Page 28: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Two Possible Reasons

• Similar interactions among threads

• Differences are smoothed by phase shifts

28

Page 29: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Temporal Traces of L2 misses

29

Page 30: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Temporal Traces of L2 misses

30

Page 31: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Two Possible Reasons

• Similar interactions among threads

• Differences are smoothed by phase shifts

31

Page 32: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Pipeline Programs

• Two such programs: ferret, and dedup

• Numerous concurrent stages

• Interactions within and between stages

• Large differences between different thread-core assignments

• Mainly due to load balance rather than differences in cache sharing.

32

Page 33: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

A Short Summary

• Insignificant influence on performance

• Large working sets

• Little data sharing

• Thread placement does not matter

• Due to uniform relations among threads

• Hold across inputs, # threads, architecture, phases.

33

Page 34: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Outline

• Experiment design

• Measurement and findings

• Cache-sharing-aware transformation

• Related work, summary, and conclusions

34

Page 35: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Principle

• Increase data sharing among siblings

• Decrease data sharing otherwise

35

Non-uniform threads

Non-uniform cache sharing

Page 36: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Example: streamcluster original code

36

for i = 1 to N, step =1… … for j= T2+1 to T3 dist=foo(p[j],p[c[i]]) end… …end

for i = 1 to N, step =1… … for j= T1 to T2 dist=foo(p[j],p[c[i]]) end… …end

thread 1 thread 2

Page 37: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary 37

Example: streamcluster optimized code

for i = 1 to N, step =2… … for j= T1 to T3 dist=foo(p[j],p[c[i+1]]) end… …end

for i = 1 to N, step =2… … for j= T1 to T3 dist=foo(p[j],p[c[i]]) end… …end

thread 1 thread 2

Page 38: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Performance Improvement (streamcluster)

00.250.50.751

4t 8tL2 Cache MissMem Bus Trans

38

Page 39: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Other ProgramsNormalized L2 Misses (on Intel)

0

0.25

0.5

0.75

1

4t 8t 4t 8tBlacksholes Bodytrack

39

Page 40: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Implication

• To exert the potential of shared cache, program-level transformations are critical.

• Limited existing explorations

• Sarkar & Tullsen’08, Kumar& Tullsen’02, Nokolopoulos’03.

* A contrast to the large body of work in OS and architecture.

40

Page 41: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Related Work

• Co-runs of independent programs• Snavely+:00, Snavely+:02, El-Moursy+:06, Fedorova+:07, Jiang+:08, Zhou

+:09, Tian+:09

• Co-runs of parallel threads of multithreaded programs• Liao+:05, Tuck+:03, Tam+:07

• Have been focused on certain aspects of CMP• Simulators-based for cache design

• Old benchmarks (e.g. SPLASH-2)• Specific class of apps (e.g., server apps)

• Old CMP with no shared cache

41

First systematic examin. of the influence of cache sharing in modern CMP on the perf. of contemporary multithreaded apps.

Page 42: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

MeasurementInsignificant influence from cache sharing despite inputs, arch, # threads, thread placement, parallelism, phases, etc.

AnalysisMismatch between SW & HW causing the observations.

TransformationLarge potential of cache-share-aware code optimizations.

Summary

42

Page 43: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Conclusion

Yes. But the main effects show up only after cache-sharing-aware transformations.

Does cache sharing on CMP matter to contemporary multithreaded programs?

43

Page 44: Does Cache Sharing on Modern CMP Matter to the · PDF filethe Performance of Contemporary Multithreaded Programs? ... • Parallel threads of server applications ... • Large differences

The College of William and Mary

Thanks!

44

Questions?