performance of go on multicore systems

Performance of Go on Mul/core Systems

Huang Yipeng

19th November 2012 FYP Presenta/on

1

Mo/va/on

•  Mul-core systems have become common

•  But “dual, quad-‐cores are not useful all the /me, they waste baEeries...” -‐ Stephen Elop, Nokia CEO

2

Mo/va/on

•  Mul-core systems have become common

•  But “dual, quad-‐cores are not useful all the /me, they waste baEeries...” -‐ Stephen Elop, Nokia CEO

•  Because most programs are explicitly parallel –  #Threads –  #Cores

3

Mo/va/on: Why Go?

4

Objec/ve

•  To study the parallelism performance of Go, compared with C, using measurements and analy-cal models (to quan/fy actual and predicted performances respec/vely)

5

Related Work

•  Understanding the Off-‐chip Memory Conten/on of Parallel Programs in Mul/core Systems (B.M. Tudor, Y.M. Teo, 2011)

•  A Prac/cal Approach for Performance Analysis of Shared Memory Programs (B.M. Tudor, Y.M. Teo, 2011)

6

Parallelism of Shared-‐memory Program

Memory Conten/on

Useful Work Data

Dependency

Related Work: Differences

7

Shared Memory Programs

Shared Memory Programs Implicit Parallelism e.g. Go

Explicit Parallelism

e.g. C & OpenMP

Processor Architecture

Shared Memory Programs Emerging pladorms e.g. ARM

Mul/core pladorms

e.g. Intel, AMD

Parallelism Performance Analy/cal Models

Low Memory Conten/on

High Memory Conten/on

Contribu/ons

1.  Insights about the parallelism performance of Go

2.  Extend our analy/cal parallelism model for programs with lower memory conten/on

3.  Automate performance predic/on and model valida/on with scripts

8

Outline

•  Mo/va/on •  Related Work

•  Methodology –  Approach –  Valida/on

•  Evalua/on •  Conclusion

9

Process Methodology

10

Analy/cal Models

Baseline Execu/ons

Parallelism Traces Parallelism Traces

1.  Hardware Counters (Perf Stat 3.0)

2.  Run Queue (Proc Reader)

Parallelism Predic/on

Go Program

Analy/cal Parallelism Model

Parallelism of Shared-‐memory Program: m threads, n cores

Number of Threads: m

Exploited Parallelism: π’ Contention: M(n)

Memory Conten/on

Useful Work Data

Dependency

11

Experimental Setup: Workloads

12

Non-‐Uniform Memory Access (24 cores): Dual six-‐core Intel Xeon X5650 2.67 GHz, 2 hardware threads per core, 12MB L3 cache, 16 GB RAM, running Linux Kernel 3.0

Experimental Setup: Machine

13

Outline


•  Methodology –  Approach –  Valida-on

•  Evalua/on •  Conclusion

14

The Memory Conten/on Model

SP (Class C)

15

9.7

Defini-on: Low conten6on problems have a conten/on ≤ 1.2

Observa-on: Low conten/on problems exhibt a W-‐like paEern not captured by the model.

Why does this occur?

Valida/on of Memory Cont. Model

Mandelbrot

Fannkuck-‐Redux

Spectral Norm

EP (Class C)

16

Original Model: Matrix Mul

17

Modifica/on of Memory Cont. Model

Model revalidated... 1.  For Matrix Mul/plica/on (down from 50% error to 7%) 2.  For other low conten/on programs 3.  In Go and C 4.  On Intel and ARM mul/cores

Revised Model: Matrix Mul

Outline


•  Methodology –  Approach –  Valida/on

•  Evalua-on •  Conclusion

18

Performance analysis: Go vs C

1.  How much poorer is Go compared to C? Why? –  Run/me, speedup vs #Cores

2.  Could Go outperform C? –  Run/me vs Problem size –  Run/me vs #Threads

3.  Predictability of actual performance –  Modeled vs Measured –  Conten/on vs #Cores –  Prob. size vs Exp. Parallelism / Data Dep. / Conten/on

19

Points of Comparison

20

Unop/mized Op/mized

Compiler Op/miza/on Programmer Op/miza/on

Experiment 1 Matrix Mul/plica/on (4992*4992) No op/miza/on flags (-‐N for Go) #threads = 24

Go is comparable with C


21

Unop/mized Op/mized



Go is comparable with C

Experiment 2 Matrix Mul/plica/on (4992*4992) -‐O3 op/miza/on for C, No flag for Go #threads = 24

Go is marginally worse than C


22

Unop/mized Op/mized



Experiment 2 Matrix Mul/plica/on (4992*4992) -‐O3 op/miza/on for C, No flag for Go #threads = 24

Go is marginally slowerthan C

Experiment 3 Transposed Matrix Mul/plica/on (4992*4992) -‐O3 op/miza/on for C, No flag for Go #threads = 24

Go is much worse than C

Observa-ons: •  Sequen-al: Go is 16% slower •  Parallel: Go is up to 5% faster

No Op/miza/on: Run/me vs #Cores

23

MatrixMul(#threads = 24, P size = 5K) Effect of #cores on run/me

MatrixMul(#threads = 24, P size = 5K) Effect of #cores on X ra/o

Reasons

Observa-ons (in Go)

1.   Instruc-ons executed: 12% less

2.   #Cycles: sequen/al (16% higher), parallel (5% less)

3.   Cache Misses: sequen/al (27x worse), parallel (similar)

24

Conclusions •  Go’s poor sequen/al performance caused

by heavy cache miss rate. Likely result of parallel overhead.

Observa-ons:

•  Go makes up for poor sequen/al performance with a higher speedup.

•  Normalized Go speedup is marginally beEer (up to 1.05x), except on 1/24 cores (0.86x/0.97x)

No Op/miza/on: Parallelism (Speedup) vs #Cores

25

MatrixMul(#threads = 24, P size = 5K) Effect of #cores on speedup

MatrixMul(#threads = 24, P size = 5K) Effect of #cores on norm. speedup (against best seq. execu/on /me)

Observa-ons:

•  Sequen-al: Go is 400% slower

•  Parallel: Go is 180-‐340% slower

Both Op/miza/ons: Run/me vs #Cores

26

MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on run/me

MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on X difference

Reasons

27

Observa-ons (in Go) 1.   Instruc-ons executed:

5.2x as many 2.   #Cycles:

sequen/al (400% higher), parallel (180% higher)

3.   Cache Misses: sequen/al (64% less), parallel (56% less)

Conclusions •  Go’s op-miza-on not as mature as C’s

Sequen/al instruc/ons reduced 1.3x vs 8x, cycles reduced 4x vs 18x

•  Go has beVer cache management

Observa-ons: •  Go speedup is higher than C’s on its own base, but significantly worse when normalized. •  Secondary Objec-ve: Given that Go has a higher own-‐base speedup, could it beat C if we

increase the problem size?

Both Op/miza/ons: Parallelism vs #Cores

28

MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on speedup

MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on norm. speedup

Observa-on: •  Variance in the /mes ra/o reduces from 1.0-‐1.3 to 1.0-‐1.1

Conclusion: •  In general, Go is increasingly compe//ve as the problem size increases.

Compiler Op/miza/on: Varying Problem Size

29



Both Op/miza/ons: Varying Problem Size

30

MatrixMul –O3(#threads = 24) Effect of problem size, #cores on /mes difference

Observa-on: •  The /mes ra/o decreases as the

problem size increases on 1-‐20 cores.

Conclusion:

•  There is a valley of performance on intermediate core numbers.

Both Op/miza/ons: Run/me vs #threads

31

Observa-on:

•  Go’s rela/ve performance as the #threads increases.

Conclusions: •  The cost of gorou/nes in Go is

extremely low.

•  Go’s performance may improve on problems with high data dependency.

MatrixMul (#cores= 24, Problem size = 5K) Effect of #threads on run/me

Predictability of Actual Performance

•  Objec-ve: To determine how Go compares to C with regard to mul/core predictability as we change the #cores, #threads, problem size

•  Observa-ons (in Go): –  Model exhibits beEer accuracy –  Memory Conten/on does not fluctuate as #cores changes –  Measurements consistent with assump/ons as problem size changes

•  Result: Go exhibts proper/es useful for predic/on that C does not.

32

Observa-ons

•  Conten/on Error –  C (Avg: 15%, Max: 55% )

–  Go (Avg: 3%, Max: 14%)

•  Parallelism Error –  C (Avg: 18%, Max: 44%)

–  Go (Avg: 6%, Max: 15%)

•  Run/me Error

–  C (Avg: 16%, Max: 47%)

–  Go (Avg: 5%, Max: 13%)

Conclusion

•  Go has a beEer predictability than C

Predictability of Performance Modeled vs Measured

33

MatrixMul –O3(#threads = 24, P=17K) Effect of #cores on conten/on factor

Observa-ons

•  In C , conten/on flucuates (0-‐5.6)

•  Not so much in Go (0-‐1)

Conclusion

•  Garbage Collec/on, Channel U/l

•  A conten/on factor can be easily bounded in Go to guarantee performance of some other program.

Predictability of Performance Conten/on vs #Cores

34

MatrixMul –O3(#threads = 24, P=17K) Effect of #cores on conten/on factor

Predictability of Performance Modeling across problem sizes

•  Objec-ve: Can we perform measurements on smaller problem sizes to reduce run/me of parallelism predic/on?

35

Predictability of Performance Problem size vs Exploit. Parallelism

36

Go MatrixMul (#threads = 24, P=17K) Effect of problem size on exploited parallelsim

C MatrixMul (#threads = 24, P=17K) Effect of problem size on exploited parallelsim

Observa-ons (in Go) •  Exploited Parallelism only decreases slightly as problem size increases

Predictability of Performance Problem size vs Data Dependency

37



Observa-ons (in Go) •  Data Dependency decreases as expected as problem size increases

Predictability of Performance Problem size vs Conten/on

38



Observa-ons (in Go) •  Memory conten/on only increases slightly as problem size increases

Conclusion: •  Measurements inputs on small problems are more accurate in Go than in C

Conclusion

1.   How does Go compare to C in a mul-core environment?

Go’s Actual Performance –  Comparable performance before, Inferior performance aver programmer

op/miza/on –  Consequence of different levels of op/miza/on –  Performance margin decreases as the problem size increases on intermediate

core numbers –  Cost of gorou/nes much lower than threads

Go’s Predicted Performance –  Model exhibits beEer accuracy –  Memory Conten/on does not fluctuate as #cores changes –  Measurements consistent with assump/ons as problem size changes

39

Conclusion

2.   Is the model extensible beyond C, tradi-onal mul-cores, and high conten-on? –  Modified / Validated for low conten/on problems –  Validated for the Go language –  Validated for ARM devices

3.   Can we make the model easier to use? –  Formally defined valida/on criteria –  Wrote script to perform model valida/on –  Wrote script to perform performance predic/on –  *Future Work* Front end for predic/on

40

Observa-ons:

•  Sequen-al: Go is 31% slower

•  Parallel: Go is up to 0-‐28% slower

•  On UMA, /mes ra/o decreases as #cores increases

Compiler Op/miza/on: Run/me vs #Cores

41

MatrixMul –O3 (#threads = 24, P size = 5K) Effect of #cores on run/me

MatrixMul –O3 (#threads = 24, P size = 5K) Effect of #cores on X difference

Reasons

42

Observa-ons (in Go)

1.   Instruc-ons executed: 4.5x as many

2.   #Cycles: sequen/al (30% higher), parallel (similar)

3.   Cache Misses: sequen/al (10% higher), parallel (46% less)

Observa-ons:

•  Go speedup is higher than C’s on its own base, but lower when normalized.

•  Secondary Objec-ve: Given that Go has a higher own-‐base speedup, could it beat C if we increase the problem size?

Compiler Op/miza/on: Parallelism vs #Cores

43

MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on Exp. Parallelism

MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on norm. speedup

Sequen/al Op/miza/on

44

No op/miza/on

Compiler op/miza/on

Compiler + Programmer op/miza/on

Predictability of Performance Modeling across problem sizes

•  Objec-ve: Can we perform measurements on smaller problem sizes to reduce run/me of parallelism predic/on?

•  Observa-on: The performance profiles in Go are consistent with expecta/ons as problem size changes

•  Result: Measurements inputs on small problems are more accurate in Go than in C

45

performance of go on multicore systems

Technology

ncores number of threads

reasons observaonsingo

cores prob

coresonspeedup matrixmul

explicit parallelism

exploited parallelism

mulcore pladorms e

slower bothopmizaons