hpci system development team software development team · under the measurement rule of the v3.0,...

2

Upload: others

Post on 17-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPCI System Development Team Software Development Team · Under the measurement rule of the V3.0, the performance of computational kernel is same to V2.4, but the HPCG score will
Page 2: HPCI System Development Team Software Development Team · Under the measurement rule of the V3.0, the performance of computational kernel is same to V2.4, but the HPCG score will

HPCI System Development TeamSoftware Development Team

BBBBBBBBBBBBBBBBB

1

10

100

1,000

10,000

8 64 512 4096 32768

GGFL

OPS

(lo

g)

PProcesses (log)

0200

400

600

800

OriginalTuned

TTim

e [s

] Speed UpX19

Result for SC15

Tune: Coloring for SYMGS

The original HPCG code evaluation on the K computer gave these informations. Linear scalability was obtained, so parallelization tuning is not ncessary. Single CPU performance was low since SYMGS is not multi-thread.Therefore we have aimed the. single CPU.

Weak-scagaling performance Cost components in CG

Using V2.4, We obtained the HPCG score 0.461 PFLOPS using 82944 processors.This score was ranked 2nd on the list at SC15.

Significant Improving ObtainedWe tried these additional ways for improving. Memory serialization for matrix Data access ordering improvement for SYMGS Loop optimization for SPMV, SYMGS Parameter adjustment Refinement miscellaneous routinesWe obtained 19 times improvement.

R.5

In the original code, SYMGS is not able to be multi-thread since there are data dependencies between rows.To avoid data dependencies, we em-ployed the colored blocking that divide the mesh into some blocks and do the coloring to blocks.

Nodes in same colored blocks are reordered to be continuous in memory space.There are no data dependencies between same colored blocks, so the blocks are able to be processed by multi-threading.

Input meshfor a process

Node Divide intosome blocksand coloring

R.1 R.2

R.4

R.5

R.3

Each block hasalmost samenodes

Rank Computer HPLPFLOPS

HPCGPFLOPS

Ratio toHPL %

1 Tianhe-2 33.86 0.580 1.7%2 K computer 10.51 0.461 4.4%3 Titan 17.59 0.322 1.8%4 Trinity 8.10 0.183 2.3%5 Mira 8.59 0.167 1.9%

http://www.hpcg-benchmark.org/custom/index.html?lid=155&slid=282

Evaluate Original HPCG on the K

93.09%ComputeSYMGS_ref

(No multi-thread)

5.23%ComputeSPMV_ref

Under the measurement rule of the V3.0, the performance of computational kernel is same to V2.4, but the HPCG score will be decreased.

R.1 R.2 R.3 R.4

CPUcore

CPUcore

CPUcore

CPUcore

R.8・ ・ ・ ・

・ ・ ・ ・

Comparing CG run time for 5 CG calls

HPCI System Development Team

Software Development Team

HPCI : High Performance Computing Infrastructure

users.

Nagoya University

University of Tsukuba

University of Tokyo

HPCI WEST HUB Tokyo Institute of Technology

Kyoto University

Osaka University

Riken AICS

HPCI EAST HUB

Hokkaido University

Tohoku University

Kyushu University

Nagoya University

University of TsTT ukuba

University offffffffffffffff TTTTTTTTTToTT kyo

HPCI WEST HUB TTTTTTTTTTTTTTTTTTooooTTTTT kyo Institute ofTeTT chnology

Kyoto University

Osaka University

Riken AICS

HPCI EASSSSSSSSSSSSSSSSSSSSSSSSSSTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT HHUB

ToTT hoku University

Kyushhhhhhhhhhhhhhhhuuuuuuuuuuuuuuuuu University

The Institute ofStatistical Mathematics

Collaborate between the system and the application to improve system and usabilityHPCG Performance Tuning on the K computer

One-Stop service for na onwide super computer resources Support from project proposal to user report. The K computer and other major super computers in japan,

large scale MPP, GPU Cluster, PC Cluster, and Vector machines are available.

The HPCI helpdesk is available for HPCI users, and provides consulta on before applying project proposal, technical support for each computer resources, and guidance of user report a er finishing the project.

HPCI Shared Storage Provide 22PBytes single view file space for all the HPCI Available from all the HPCI super computer resources. HPCI single sign on authen ca on, if a user wants to transfer

the computa onal results to the HPCI shared storage a er the computa onal session, the user does not have to log in again to the HPCI shared storage.

Automa c file replica on, HPCI shared storage adopts 2 file replicas in default. Secure network communica on, HPCI user can specify data encryp on method such as “gsi” in ~/.gfarm2fs.

Single sign On Authen ca on Allow efficient use of na on wide super computer resources. Users of the HPCI can u lize all the granted HPC resources seamlessly a er signing on one of the resources.

Pow

er

Time

Pow

er

Time

×n1/n

n/n=1

Analysis of correlation with power consumptionand application performance Standby power and memory throughput has a

greater impact on the power consumption of the K computer.

A trial of improving of power consumption and system operation A evaluation model of power consumption

of each performance tuned applications.

The performance of each big userʼs applications has estimated how much differs compared to the field average.

If both the CPU performance and the memory performance improved to field average, NODE×TIME can be reduced as below.

When we improve performance about the 10(50) major user application, we can use more 25M(66M) NODExTIME.

The power consumption is When we improve performance about 10(50) major

user application, we can achieve energy savings of 2.7GWh(7.0GWh).

0.0%

2.0%

4.0%

6.0%

8.0%

10.0%

12.0%

14.0%

16.0%

18.0%

20.0%

0 20 40 60 80 100

Ra o [%]

The number of applica on

Node ×× Time and Power Saving Ra o

Node × TimePower saving

Calc.[%]Integer ×0.2429+[%]Throughput L1D×2.3299+ [%]Throughput L2×0.0857+[%]ThroughputMemory 4.3906+

Calc.[%] Floating×1.36598.8128=sys]δPower[MW/

Simple kernel loops

Improvement of Operation Efficiency and Effort forPower Consumption Reduction on the K computer

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0[MW]

Day/Time

Received powerCGS1CGS2

HPCG start

Limit ofPower

Consump on

0.0

1.0

2.0

3.0

4.0

5.0

6.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Power[MW/sys]

Floa ng Calc. Memory Throuput L2 Throuput L1D Throuput Integer Calc. Power Mesurement

B/F Change Throuput Change STREAM Floa ng Calc. Integer Calc.DGEMM

Recently jobs with large power consumption haveincreased and total power consumption exceeded thepower consumption limit regulated by contract withprovider.If the total power consumption exceed the power limit,we have to pay additional charge as a penalty.

JAMSTEC