performance of go on multicore systems
DESCRIPTION
NUS Presentation 2012TRANSCRIPT
Performance of Go on Mul/core Systems
Huang Yipeng
19th November 2012 FYP Presenta/on
1
Mo/va/on
• Mul-core systems have become common
• But “dual, quad-‐cores are not useful all the /me, they waste baEeries...” -‐ Stephen Elop, Nokia CEO
2
Mo/va/on
• Mul-core systems have become common
• But “dual, quad-‐cores are not useful all the /me, they waste baEeries...” -‐ Stephen Elop, Nokia CEO
• Because most programs are explicitly parallel – #Threads – #Cores
3
Mo/va/on: Why Go?
4
Objec/ve
• To study the parallelism performance of Go, compared with C, using measurements and analy-cal models (to quan/fy actual and predicted performances respec/vely)
5
Related Work
• Understanding the Off-‐chip Memory Conten/on of Parallel Programs in Mul/core Systems (B.M. Tudor, Y.M. Teo, 2011)
• A Prac/cal Approach for Performance Analysis of Shared Memory Programs (B.M. Tudor, Y.M. Teo, 2011)
6
Parallelism of Shared-‐memory Program
Memory Conten/on
Useful Work Data
Dependency
Related Work: Differences
7
Shared Memory Programs
Shared Memory Programs Implicit Parallelism e.g. Go
Explicit Parallelism
e.g. C & OpenMP
Processor Architecture
Shared Memory Programs Emerging pladorms e.g. ARM
Mul/core pladorms
e.g. Intel, AMD
Parallelism Performance Analy/cal Models
Low Memory Conten/on
High Memory Conten/on
Contribu/ons
1. Insights about the parallelism performance of Go
2. Extend our analy/cal parallelism model for programs with lower memory conten/on
3. Automate performance predic/on and model valida/on with scripts
8
Outline
• Mo/va/on • Related Work
• Methodology – Approach – Valida/on
• Evalua/on • Conclusion
9
Process Methodology
10
Analy/cal Models
Baseline Execu/ons
Parallelism Traces Parallelism Traces
1. Hardware Counters (Perf Stat 3.0)
2. Run Queue (Proc Reader)
Parallelism Predic/on
Go Program
Analy/cal Parallelism Model
Parallelism of Shared-‐memory Program: m threads, n cores
Number of Threads: m
Exploited Parallelism: π’ Contention: M(n)
Memory Conten/on
Useful Work Data
Dependency
11
Experimental Setup: Workloads
12
Non-‐Uniform Memory Access (24 cores): Dual six-‐core Intel Xeon X5650 2.67 GHz, 2 hardware threads per core, 12MB L3 cache, 16 GB RAM, running Linux Kernel 3.0
Experimental Setup: Machine
13
Outline
• Mo/va/on • Related Work
• Methodology – Approach – Valida-on
• Evalua/on • Conclusion
14
The Memory Conten/on Model
SP (Class C)
15
9.7
Defini-on: Low conten6on problems have a conten/on ≤ 1.2
Observa-on: Low conten/on problems exhibt a W-‐like paEern not captured by the model.
Why does this occur?
Valida/on of Memory Cont. Model
Mandelbrot
Fannkuck-‐Redux
Spectral Norm
EP (Class C)
16
Original Model: Matrix Mul
17
Modifica/on of Memory Cont. Model
Model revalidated... 1. For Matrix Mul/plica/on (down from 50% error to 7%) 2. For other low conten/on programs 3. In Go and C 4. On Intel and ARM mul/cores
Revised Model: Matrix Mul
Outline
• Mo/va/on • Related Work
• Methodology – Approach – Valida/on
• Evalua-on • Conclusion
18
Performance analysis: Go vs C
1. How much poorer is Go compared to C? Why? – Run/me, speedup vs #Cores
2. Could Go outperform C? – Run/me vs Problem size – Run/me vs #Threads
3. Predictability of actual performance – Modeled vs Measured – Conten/on vs #Cores – Prob. size vs Exp. Parallelism / Data Dep. / Conten/on
19
Points of Comparison
20
Unop/mized Op/mized
Compiler Op/miza/on Programmer Op/miza/on
Experiment 1 Matrix Mul/plica/on (4992*4992) No op/miza/on flags (-‐N for Go) #threads = 24
Go is comparable with C
Points of Comparison
21
Unop/mized Op/mized
Compiler Op/miza/on Programmer Op/miza/on
Experiment 1 Matrix Mul/plica/on (4992*4992) No op/miza/on flags (-‐N for Go) #threads = 24
Go is comparable with C
Experiment 2 Matrix Mul/plica/on (4992*4992) -‐O3 op/miza/on for C, No flag for Go #threads = 24
Go is marginally worse than C
Points of Comparison
22
Unop/mized Op/mized
Compiler Op/miza/on Programmer Op/miza/on
Experiment 1 Matrix Mul/plica/on (4992*4992) No op/miza/on flags (-‐N for Go) #threads = 24
Experiment 2 Matrix Mul/plica/on (4992*4992) -‐O3 op/miza/on for C, No flag for Go #threads = 24
Go is marginally slowerthan C
Experiment 3 Transposed Matrix Mul/plica/on (4992*4992) -‐O3 op/miza/on for C, No flag for Go #threads = 24
Go is much worse than C
Observa-ons: • Sequen-al: Go is 16% slower • Parallel: Go is up to 5% faster
No Op/miza/on: Run/me vs #Cores
23
MatrixMul(#threads = 24, P size = 5K) Effect of #cores on run/me
MatrixMul(#threads = 24, P size = 5K) Effect of #cores on X ra/o
Reasons
Observa-ons (in Go)
1. Instruc-ons executed: 12% less
2. #Cycles: sequen/al (16% higher), parallel (5% less)
3. Cache Misses: sequen/al (27x worse), parallel (similar)
24
Conclusions • Go’s poor sequen/al performance caused
by heavy cache miss rate. Likely result of parallel overhead.
Observa-ons:
• Go makes up for poor sequen/al performance with a higher speedup.
• Normalized Go speedup is marginally beEer (up to 1.05x), except on 1/24 cores (0.86x/0.97x)
No Op/miza/on: Parallelism (Speedup) vs #Cores
25
MatrixMul(#threads = 24, P size = 5K) Effect of #cores on speedup
MatrixMul(#threads = 24, P size = 5K) Effect of #cores on norm. speedup (against best seq. execu/on /me)
Observa-ons:
• Sequen-al: Go is 400% slower
• Parallel: Go is 180-‐340% slower
Both Op/miza/ons: Run/me vs #Cores
26
MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on run/me
MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on X difference
Reasons
27
Observa-ons (in Go) 1. Instruc-ons executed:
5.2x as many 2. #Cycles:
sequen/al (400% higher), parallel (180% higher)
3. Cache Misses: sequen/al (64% less), parallel (56% less)
Conclusions • Go’s op-miza-on not as mature as C’s
Sequen/al instruc/ons reduced 1.3x vs 8x, cycles reduced 4x vs 18x
• Go has beVer cache management
Observa-ons: • Go speedup is higher than C’s on its own base, but significantly worse when normalized. • Secondary Objec-ve: Given that Go has a higher own-‐base speedup, could it beat C if we
increase the problem size?
Both Op/miza/ons: Parallelism vs #Cores
28
MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on speedup
MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on norm. speedup
Observa-on: • Variance in the /mes ra/o reduces from 1.0-‐1.3 to 1.0-‐1.1
Conclusion: • In general, Go is increasingly compe//ve as the problem size increases.
Compiler Op/miza/on: Varying Problem Size
29
MatrixMul –O3(#threads = 24, P size = 10K) Effect of #cores on X difference
MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on X difference
Both Op/miza/ons: Varying Problem Size
30
MatrixMul –O3(#threads = 24) Effect of problem size, #cores on /mes difference
Observa-on: • The /mes ra/o decreases as the
problem size increases on 1-‐20 cores.
Conclusion:
• There is a valley of performance on intermediate core numbers.
Both Op/miza/ons: Run/me vs #threads
31
Observa-on:
• Go’s rela/ve performance as the #threads increases.
Conclusions: • The cost of gorou/nes in Go is
extremely low.
• Go’s performance may improve on problems with high data dependency.
MatrixMul (#cores= 24, Problem size = 5K) Effect of #threads on run/me
Predictability of Actual Performance
• Objec-ve: To determine how Go compares to C with regard to mul/core predictability as we change the #cores, #threads, problem size
• Observa-ons (in Go): – Model exhibits beEer accuracy – Memory Conten/on does not fluctuate as #cores changes – Measurements consistent with assump/ons as problem size changes
• Result: Go exhibts proper/es useful for predic/on that C does not.
32
Observa-ons
• Conten/on Error – C (Avg: 15%, Max: 55% )
– Go (Avg: 3%, Max: 14%)
• Parallelism Error – C (Avg: 18%, Max: 44%)
– Go (Avg: 6%, Max: 15%)
• Run/me Error
– C (Avg: 16%, Max: 47%)
– Go (Avg: 5%, Max: 13%)
Conclusion
• Go has a beEer predictability than C
Predictability of Performance Modeled vs Measured
33
MatrixMul –O3(#threads = 24, P=17K) Effect of #cores on conten/on factor
Observa-ons
• In C , conten/on flucuates (0-‐5.6)
• Not so much in Go (0-‐1)
Conclusion
• Garbage Collec/on, Channel U/l
• A conten/on factor can be easily bounded in Go to guarantee performance of some other program.
Predictability of Performance Conten/on vs #Cores
34
MatrixMul –O3(#threads = 24, P=17K) Effect of #cores on conten/on factor
Predictability of Performance Modeling across problem sizes
• Objec-ve: Can we perform measurements on smaller problem sizes to reduce run/me of parallelism predic/on?
35
Predictability of Performance Problem size vs Exploit. Parallelism
36
Go MatrixMul (#threads = 24, P=17K) Effect of problem size on exploited parallelsim
C MatrixMul (#threads = 24, P=17K) Effect of problem size on exploited parallelsim
Observa-ons (in Go) • Exploited Parallelism only decreases slightly as problem size increases
Predictability of Performance Problem size vs Data Dependency
37
Go MatrixMul (#threads = 24, P=17K) Effect of problem size on exploited parallelsim
C MatrixMul (#threads = 24, P=17K) Effect of problem size on exploited parallelsim
Observa-ons (in Go) • Data Dependency decreases as expected as problem size increases
Predictability of Performance Problem size vs Conten/on
38
Go MatrixMul (#threads = 24, P=17K) Effect of problem size on exploited parallelsim
C MatrixMul (#threads = 24, P=17K) Effect of problem size on exploited parallelsim
Observa-ons (in Go) • Memory conten/on only increases slightly as problem size increases
Conclusion: • Measurements inputs on small problems are more accurate in Go than in C
Conclusion
1. How does Go compare to C in a mul-core environment?
Go’s Actual Performance – Comparable performance before, Inferior performance aver programmer
op/miza/on – Consequence of different levels of op/miza/on – Performance margin decreases as the problem size increases on intermediate
core numbers – Cost of gorou/nes much lower than threads
Go’s Predicted Performance – Model exhibits beEer accuracy – Memory Conten/on does not fluctuate as #cores changes – Measurements consistent with assump/ons as problem size changes
39
Conclusion
2. Is the model extensible beyond C, tradi-onal mul-cores, and high conten-on? – Modified / Validated for low conten/on problems – Validated for the Go language – Validated for ARM devices
3. Can we make the model easier to use? – Formally defined valida/on criteria – Wrote script to perform model valida/on – Wrote script to perform performance predic/on – *Future Work* Front end for predic/on
40
Observa-ons:
• Sequen-al: Go is 31% slower
• Parallel: Go is up to 0-‐28% slower
• On UMA, /mes ra/o decreases as #cores increases
Compiler Op/miza/on: Run/me vs #Cores
41
MatrixMul –O3 (#threads = 24, P size = 5K) Effect of #cores on run/me
MatrixMul –O3 (#threads = 24, P size = 5K) Effect of #cores on X difference
Reasons
42
Observa-ons (in Go)
1. Instruc-ons executed: 4.5x as many
2. #Cycles: sequen/al (30% higher), parallel (similar)
3. Cache Misses: sequen/al (10% higher), parallel (46% less)
Observa-ons:
• Go speedup is higher than C’s on its own base, but lower when normalized.
• Secondary Objec-ve: Given that Go has a higher own-‐base speedup, could it beat C if we increase the problem size?
Compiler Op/miza/on: Parallelism vs #Cores
43
MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on Exp. Parallelism
MatrixMul –O3(#threads = 24, P size = 5K) Effect of #cores on norm. speedup
Sequen/al Op/miza/on
44
No op/miza/on
Compiler op/miza/on
Compiler + Programmer op/miza/on
Predictability of Performance Modeling across problem sizes
• Objec-ve: Can we perform measurements on smaller problem sizes to reduce run/me of parallelism predic/on?
• Observa-on: The performance profiles in Go are consistent with expecta/ons as problem size changes
• Result: Measurements inputs on small problems are more accurate in Go than in C
45