key technologies for 100 pflopsတတတတတတတတ · title: key technologies for 100...

10
Key Technologies for 100 PFLOPS Copyright 2014 FUJITSU LIMITED

Upload: others

Post on 02-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Key Technologies for 100 PFLOPSတတတတတတတတ · Title: Key Technologies for 100 PFLOPSတတတတတတတတ Author: FUJITSU LIMITEDတတတတတတတတ Created Date:

Key Technologies for100 PFLOPS

Copyright 2014 FUJITSU LIMITED

Page 2: Key Technologies for 100 PFLOPSတတတတတတတတ · Title: Key Technologies for 100 PFLOPSတတတတတတတတ Author: FUJITSU LIMITEDတတတတတတတတ Created Date:

How to keep the HPC-tree growing

Copyright 2014 FUJITSU LIMITED

A stable programming model is needed to secure the continuous software development.

FFT

Eigenvalue problem

materials

Big-data analysis

Drug discoveryLife-science

Computational

Fluid dynamicsMolecular dynamics Disaster prevention

Subatomic particle phys.

Astronomy

Earth science

Economics

Quantum chemistry

Structure analysis

1

Page 3: Key Technologies for 100 PFLOPSတတတတတတတတ · Title: Key Technologies for 100 PFLOPSတတတတတတတတ Author: FUJITSU LIMITEDတတတတတတတတ Created Date:

Copyright 2014 FUJITSU LIMITED

The MPI + (OpenMP or automatic-parallelization) programming model will stay important.

VISIMPACT (Virtual Single Processor by Integrated Multi-core Parallel Architecture)

Automated multi-thread parallelization• Inherits vectorization technology

High performance hardware thread barrier

Easy hybrid parallelization reduces number of processes• Reducing communication overhead

• Saving memory usage

Lasting programming model

POST-FX10

K computer (*)

NUMERICAL WIND TUNNEL

1974

FACOM-M190

1993

1999

VPP5000

2000

PRIMEPOWER

2009

FX1

2011

2011

#1 on Top500

#1 on Top500

FUJITSU SupercomputerPRIMEHPC FX10

(*) The K computer is a supercomputerjointly developed by RIKEN and Fujitsu.

2

Page 4: Key Technologies for 100 PFLOPSတတတတတတတတ · Title: Key Technologies for 100 PFLOPSတတတတတတတတ Author: FUJITSU LIMITEDတတတတတတတတ Created Date:

Proven performance in various application fields

Application Research Field # of nodes Efficiency

NICAM 81,920 8%

Seism3D 82,944 18%

PHASE 82,944 20%

RSDFT 82,944 52%

FrontFlow/blue 80,000 3%

Lattice QCD 82,944 16%

ZZ-EFSI 82,944 46%

GreeM 82,944 42%

Copyright 2014 FUJITSU LIMITED

Performance data on K computer are provided by Mr. Minami of RIKEN

Post-FX10 maintains excellent performance efficiency with desirable technologies as well as a well-balanced system design.

*1

*2

Earth science

Disaster prevention

Computational materials

Fluid dynamics

Quantum chemistryComputational materials

Subatomic particle phys.

Life science Fluid dynamics Structure analysis

Astronomy*1: http://www.scls.riken.jp/eng/newsletter/Vol.7/report02.html*2: T.Ishiyama, et.al., 4.45 Pflops astrophysical N-body simulation on K computer, SC’12

3

Page 5: Key Technologies for 100 PFLOPSတတတတတတတတ · Title: Key Technologies for 100 PFLOPSတတတတတတတတ Author: FUJITSU LIMITEDတတတတတတတတ Created Date:

Data intensive applications

Copyright 2014 FUJITSU LIMITED

do k=2,kmax-1do j=2,jmax-1

do i=2,imax-1s0=a(I,J,K,1)*p(I+1,J,K) + a(I,J,K,2)*p(I,J+1,K) + a(I,J,K,3)*p(I,J,K+1) &

+b(I,J,K,1)*(p(I+1,J+1,K)-p(I+1,J-1,K) - p(I-1,J+1,K)+p(I-1,J-1,K)) &+b(I,J,K,2)*(p(I,J+1,K+1)-p(I,J-1,K+1) - p(I,J+1,K-1)+p(I,J-1,K-1)) &+b(I,J,K,3)*(p(I+1,J,K+1)-p(I-1,J,K+1) - p(I+1,J,K-1)+p(I-1,J,K-1)) &+c(I,J,K,1)*p(I-1,J,K) +c(I,J,K,2)*p(I,J-1,K) +c(I,J,K,3)*p(I,J,K-1)+wrk1(I,J,K)

ss=(s0*a(I,J,K,4)-p(I,J,K))*bnd(I,J,K)GOSA1=GOSA1+SS*SSwrk2(I,J,K)=p(I,J,K)+OMEGA *SS

enddoenddo

enddo

Himeno BMT kernel*

HMC is effective for data intensive applications.

5

4

3

2

1

0

FX10 1node Post-FX10 1node

Rat

io

* Incompressible fluid analysis code solving the Poisson’s equation solution using the Jacobi iteration method. Many loads require high memory throughput.

* Tentative result

4

Page 6: Key Technologies for 100 PFLOPSတတတတတတတတ · Title: Key Technologies for 100 PFLOPSတတတတတတတတ Author: FUJITSU LIMITEDတတတတတတတတ Created Date:

Dense matrix problems

Enhanced sector-cache mechanism improve efficiency.

DGEMM is the fundamental function for many applications.

Copyright 2014 FUJITSU LIMITED

Performance

FX10 95.3% (16core)

Post-FX10 97.9% (16core)

DGEMM performance of BLAS:

* Tentative result

5

Page 7: Key Technologies for 100 PFLOPSတတတတတတတတ · Title: Key Technologies for 100 PFLOPSတတတတတတတတ Author: FUJITSU LIMITEDတတတတတတတတ Created Date:

Sparse matrix problems

2 to 4 times faster than FX10 (on 1 core)

The new indirect-load of wider-SIMD operation contributes to performance improvement

Copyright 2014 FUJITSU LIMITED

do j=1,lastrow-firstrow+1sum = 0.d0do k=rowstr(j),rowstr(j+1)-1

sum = sum + a(k) * p(colidx(k))enddow(j) = sum

enddo

Small Array size < 32KBMedium Array size < 12MBLarge Array size > 20MB

0

1

2

3

4

5

Small Small Small Medium Medium

Small Medium Large Medium Large

Re

lati

ve

pe

rf.

on

1 c

ore

FX10 Post-FX10

a(), colidx()

p()

NAS parallel benchmarks CG kernel

* Tentative result

6

Page 8: Key Technologies for 100 PFLOPSတတတတတတတတ · Title: Key Technologies for 100 PFLOPSတတတတတတတတ Author: FUJITSU LIMITEDတတတတတတတတ Created Date:

Communication intensive applications

Fujitsu's math library provides scalable 3D-FFT functionality for massively parallel execution.

Interconnect bandwidth improves significantly.

Tofu Interconnect and job manager enable various processing configuration.

Optimized MPI library provides high performance collective communications for Tofu Interconnect.

Copyright 2014 FUJITSU LIMITED

0

50

100

150

200

0 10000 20000 30000 40000 50000 60000

TFlo

ps

Num. of nodes

3D-FFT(81923) estimated Perf. on Post-FX10

Post-FX10 SSL II

K computer SSL II

K computer FFTW3

Process topology

Global data view

6

4

7

5

2

0

3

1Data assignment

and rank

3-dimensional FFT using volumetric decomposition

7

Page 9: Key Technologies for 100 PFLOPSတတတတတတတတ · Title: Key Technologies for 100 PFLOPSတတတတတတတတ Author: FUJITSU LIMITEDတတတတတတတတ Created Date:

Conclusion

Fujitsu is developing the Post-FX10 system to be capable of 100 PFLOPS:

Enhanced SPARC64 CPU and Tofu Interconnect were designed exclusively for HPC.

Fujitsu's HPC software stack exploits maximum hardware performance.

Post-FX10 employs best technologies for actual application performance:

256 bit-wide SIMD processing and enhanced sector cache mechanism improve flops

New indirect-load/store SIMD operations are effective for complex data access

Micron’s Hybrid Memory Cube fulfills high B/F requirements

Enhanced interconnect and assistant cores enable fast and scalable communication

Post-FX10 strongly drives scientific progress forward.

Copyright 2014 FUJITSU LIMITED9

Page 10: Key Technologies for 100 PFLOPSတတတတတတတတ · Title: Key Technologies for 100 PFLOPSတတတတတတတတ Author: FUJITSU LIMITEDတတတတတတတတ Created Date:

Copyright 2013 FUJITSU LIMITED10