performance evaluation and analysis of sx-aurora tsubasa · antenna plasma turbine) sx-aurora...

18
Performance evaluation and analysis of SX-Aurora TSUBASA Kazuhiko Komatsu Tohoku University 9 October, 2018 WSSP28

Upload: others

Post on 10-Mar-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Performance evaluation and analysis of SX-Aurora TSUBASA

Kazuhiko KomatsuTohoku University9 October, 2018

WSSP28

Page 2: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Background

• Supercomputers are important infrastructures• Widely used for scientific research as well as various

industries• Top1 system reaches 122.3 Pflop/s

• Big gap between theoretical performance and sustained performance• Only compute-intensive applications stand to benefit

from high peak performance• Memory-intensive applications are limited by lower

memory performance

9 October, 2018 WSSP28 2

Memory performance has gained more and more attentions

Page 3: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

SX-Aurora TSUBASA with the world’s highest memory bandwidth• Two important concepts of its design

• High usability• High sustained performance

• New architecture• Vector host (VH) is attached to vector engines (VEs)

• VE is responsible for executing an entire application• VH is used for processing system calls invoked by the

applications

9 October, 2018 3X86 Linux

VE

Vector Engines

VEVH

Page 4: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

New execution model

• Conventional model • New execution model

4WSSP28

VH VE

Exe module load Startprocessing

System call(I/O, etc)

Transparent offloadOS function

Finishprocessing

Host GPU

Startprocessing

Kernelexecution

Kernelexecution

Fisnishprocessing

Page 5: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Highlights of the execution model

• Two advantages over conventional execution model• Avoid frequent data transfers between VE and VH

• Applications are entirely executed on VE→High sustained performance

• No special programming• Explicit specifications of computation kernels are not necessary• System calls are transparently offloaded to the VH

• Programmers do not need to care system calls

→High usability

9 October, 2018 WSSP28 5

Page 6: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Specification of SX-Aurora TSUBASA• High memory bandwidth

• 1.22 TB/s world’s highest memory bandwidth• Six HBM2 memory modules integration

• 3.0 TB/s LLC bandwidth• LLC is connected to cores via 2D mesh network

• High computational performance• 2.15 Tflop/[email protected] GHz

• 8 powerful vector cores• 16 nm FINFET process technology• 4.8 billion transistors• 14.96 mm x 33.00 mm

9 October, 2018 WSSP28 6Block diagram of a vector processor

Page 7: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Performance evaluation of SX-Aurora TSUBASA

• SX-Aurora TSUBASA A300-2• 2x VEs Type 10B• 1x VH

9 October, 2018 WSSP28 7

VE Type 10BFrequency 1.4 GHzPeak FP / core 268.8 GFLOPS# cores 8Peak DP Flops / socket 2.15 TFLOPSMemory BW 1.2 TB/sMemory capacity 48 GB

VHCPU Intel Xeon Gold 6126Frequency 2.60 GHz / 3.70 GHz (Turbo)# cores 12Mem BW 128 GB/sMem Capacity 96 GBMem config DDR4-2666 DIMM 16GB x 6

X86 Linux

VE

Vector Engines

VEVH

Page 8: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Experimental environments

9 October, 2018 WSSP28 8

Processor SX-AuroraType 10B

Xeon Gold 6126 SX-ACE Tesla V100 Xeon Phi

KNL 7290

Frequency 1.4 GHz 2.6 GHz 1.0 GHz 1.245 GHz 1.5 GHz# of cores 8 12 4 5120 72DP flop/s(SP flop/s)

2.15 T(4.30 T)

998.4 GF (1996.8 GF) 256 GF 7 TF

(14 TF)3.456 TF

(6.912 TF)Memory

subsystem HBM2 x6 DDR4 x6ch DDR3 x16ch HBM2 MCDRAM DDR4

Memory BW 1.22 TB/s 128 GB/s 256 GB/s 900 GB/s 450+ GB/s115.2 GB/s

Memory capacity 48 GB 96 GB 64 GB 16 GB 16 GB

96 GBLLC BW 2.66 TB/s N/A 1.0 TB/s N/A N/A

LLC capacity 16 MB shared

19.25 MB shared 1 MB private 6 MB shared 1 MB shared

by 2 cores

Page 9: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Applications used for evaluation

• SGEMM/DGEMM• Matrix-matrix multiplications to evaluate the Peak flop/s

• Stream benchmark• Simple kernels (copy, scale, add, triad) to measure sustained

memory performance• Himeno benchmark

• Jacobi kernels with a 19-point stencil as a memory-intensive kernels

• Tohoku univ’s kernels• Kernels of practical applications of Tohoku univ in Earthquake,

CFD, Electromagnetic• Microbenchmark for offload evaluation

• Mixture with vector-friendly jacobi kernels and I/O kernels

9 October, 2018 WSSP28 9

Page 10: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

SGEMM/DGEMM Performance

9 October, 2018 WSSP28 10

• High scalability up to 8 threads• High vectorization ratio 99.36%, good vector length 253.8

• High efficiency and achieve almost ideal performance• Efficiecy 97.8~99.2%

0500

10001500200025003000350040004500

1 2 3 4 5 6 7 8GEM

M p

erfo

rman

ce (G

flop/

s)

Number of threads

DGEMM Gflop/s SGEMM Gflop/s

0102030405060708090100

0500

10001500200025003000350040004500

1 2 3 4 5 6 7 8

Effic

ienc

y (%

)

GEM

M p

erfo

rman

ce (G

flop/

s)

Number of threads

DGEMM Gflop/s SGEMM Gflop/sDGEMM efficiency SGEMM Efficiency

Page 11: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Memory performance(Stream Triad)

• High sustained memory bandwidth of SX-Aurora TSUBASA• Efficiency: Aurora 79%, ACE 83%, Skylake 66%, V100 81%

• Scalability• Saturated even when the number of cores is 3 or 4

9 October, 2018 WSSP28 11

0

200

400

600

800

1000

1200

SX-AuroraTSUBASA

SX-ACE Skylake Tesla V100 KNL

Stre

am b

andw

idth

(GB/

s)

x 4.72 x 11.7 x 1.37 x 2.23

0

200

400

600

800

1000

1200

1400

0 2 4 6 8 10 12

Stre

am b

andw

idth

(GB/

s)

Number of threads

SX-Aurora… SX-ACE Skylake

Page 12: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Himeno (Jacobi) performance

• Higher performance… except GPU• Vector reduction becomes bottleneck due to copy among vector pipes

• Nice thread scalability• 6.9x speedup in 8 threads => 86% parallel efficiency (OpenMP

overhead?)

9 October, 2018 WSSP28 12

0

50

100

150

200

250

300

350

SX-AuroraTSUBASA

SX-ACE Skylake Tesla V100 KNL

Him

eno

perf

orm

ance

(Gflo

p/s) x 3.4 x 7.96 x 0.93 x 2.1

0

40

80

120

160

200

240

280

320

0 2 4 6 8 10 12

Him

eno

perf

orm

ance

(Gflo

p/s)

Number of threads

SX-Aurora TSUBASA SX-ACE Skylake

x 6.9

Page 13: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Application kernel performance

• SX-Aurora TSUBASA could achieve high performance• Plasma, Turbine => Indirect access, memory latency-bound• Antenna => computation-bound to memory BW-bound• Land mine, Earthqauke, Turbulent flow =>memory or LLC BW-

bound

9 October, 2018 WSSP28 13

0

10

20

30

40

50

60

70

80

Land mine Earthquake Turbulentflow

Antenna Plasma Turbine

Exec

utio

n tim

e (s

ec)

SX-Aurora TSUBASA SX-ACE Skylake

1

4

16

64

256

1024

4096

0.1250.25 0.5 1 2 4 8 16 32 64 128

Atta

inab

le p

erfo

rman

ce (G

flop/

s)

Arithmetic intensity (Flops/Byte)

SX-Aurora TSUBASA SX-ACE

Land mine

Earthquake Antenna

Antenna

Turbulent flow

Plasma

PlasmaTurbine

x 4.3 x 2.9x 3.4

x 9.9

x 2.6 x 3.4

Page 14: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Memory bound? or LLC bound?

• Further analysis using 4 types of Bytes/Flop ratio• Memory B/F = (memory BW) / (peak performance)• LLC B/F = (LLC BW) / (peak performance)• Code B/F = (necessary data in Byte) / (# FP operations)• Actual B/F = (# block memory access) * (block size) / (# FP operations)

• Code B/F > Actual B/F * LLC BW / Memory BW => LLC bound• Code B/F < Actual B/F * LLC BW / Memory BW => memory bound

9 October, 2018 WSSP28 14

B/F ratio Actual < Memory Memory > Actual

Code < LLC Computation-bound Memory BW-bound

Code > LLC LLC BW-bound Memory or LLC bound *

Page 15: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Application kernel performance

• Land mine => LLC-bound• Earthqauke => LLC-bound• Turbulent flow => memory BW-bound• Antenna => memory BW-bound

9 October, 2018 WSSP28 15

0

10

20

30

40

50

60

70

80

Land mine Earthquake Turbulentflow

Antenna Plasma Turbine

Exec

utio

n tim

e (s

ec)

SX-Aurora TSUBASA SX-ACE Skylake

x 4.3 x 2.9x 3.4

x 9.9

x 2.6 x 3.4

B/F ratio Actual < Memory Memory > Actual

Code < LLC Computation-bound Memory BW-bound

Code > LLC LLC BW-bound Memory or LLC bound *

Page 16: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Evaluation of the execution model• (Transparent/Explicit)

Offload from VE to VH• Offload from VH to VE

16

VH VE

SAP VVE

offload

0

5

10

15

20

25

30

35

40

VE execution VH offload VE offload

Exec

utio

n tim

e (s

ec)

Data transfer Jacobi 1 I/O Jacobi 2

VH VE

VAPS VH

offload

VAP

S

VAP

VAP

S

VAP

V

V

SAP

Transparent VE to VH explicit VE to VH VH to VE

SAP

Page 17: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Multi-VE performance on A300-8

• Stream VE-level scalability• Almost ideal scalability up to 8 VEs

• Himeno VE-level scalability• Good scalability up to 4VEs• Lack of vector lengths when more than 5VEs

• Problem size is too small

9 October, 2018 WSSP28 17

0100020003000400050006000700080009000

1 2 3 4 5 6 7 8

Stre

am b

andw

idth

(GB/

s)

Number of VEs

Aurora Ideal

0200400600800

10001200140016001800200022002400

1 2 3 4 5 6 7 8

Him

eno

perf

orm

ance

(Gflo

p/s)

Number of VEs

Aurora Ideal

Page 18: Performance evaluation and analysis of SX-Aurora TSUBASA · Antenna Plasma Turbine) SX-Aurora TSUBASA SX-ACE Skylake 1 4 16 64 256 1024 4096 0.1250.250.5 1 2 4 8 16 32 64 128) Arithmetic

Conclusions

• Performance evaluation and analysis of SX-Aurora TSUBASA• Standard benchmark programs→ High potential of compute and memory performances• Kernels of practical applications→ High memory performance leads high sustained

performance• Microbenchmark→ Effectiveness of a new execution model

9 October, 2018 WSSP28 18