key technologies for 100 pflopsတတတတတတတတ · title: key technologies for 100...
TRANSCRIPT
Key Technologies for100 PFLOPS
Copyright 2014 FUJITSU LIMITED
How to keep the HPC-tree growing
Copyright 2014 FUJITSU LIMITED
A stable programming model is needed to secure the continuous software development.
FFT
Eigenvalue problem
materials
Big-data analysis
Drug discoveryLife-science
Computational
Fluid dynamicsMolecular dynamics Disaster prevention
Subatomic particle phys.
Astronomy
Earth science
Economics
Quantum chemistry
Structure analysis
1
Copyright 2014 FUJITSU LIMITED
The MPI + (OpenMP or automatic-parallelization) programming model will stay important.
VISIMPACT (Virtual Single Processor by Integrated Multi-core Parallel Architecture)
Automated multi-thread parallelization• Inherits vectorization technology
High performance hardware thread barrier
Easy hybrid parallelization reduces number of processes• Reducing communication overhead
• Saving memory usage
Lasting programming model
POST-FX10
K computer (*)
NUMERICAL WIND TUNNEL
1974
FACOM-M190
1993
1999
VPP5000
2000
PRIMEPOWER
2009
FX1
2011
2011
#1 on Top500
#1 on Top500
FUJITSU SupercomputerPRIMEHPC FX10
(*) The K computer is a supercomputerjointly developed by RIKEN and Fujitsu.
2
Proven performance in various application fields
Application Research Field # of nodes Efficiency
NICAM 81,920 8%
Seism3D 82,944 18%
PHASE 82,944 20%
RSDFT 82,944 52%
FrontFlow/blue 80,000 3%
Lattice QCD 82,944 16%
ZZ-EFSI 82,944 46%
GreeM 82,944 42%
Copyright 2014 FUJITSU LIMITED
Performance data on K computer are provided by Mr. Minami of RIKEN
Post-FX10 maintains excellent performance efficiency with desirable technologies as well as a well-balanced system design.
*1
*2
Earth science
Disaster prevention
Computational materials
Fluid dynamics
Quantum chemistryComputational materials
Subatomic particle phys.
Life science Fluid dynamics Structure analysis
Astronomy*1: http://www.scls.riken.jp/eng/newsletter/Vol.7/report02.html*2: T.Ishiyama, et.al., 4.45 Pflops astrophysical N-body simulation on K computer, SC’12
3
Data intensive applications
Copyright 2014 FUJITSU LIMITED
do k=2,kmax-1do j=2,jmax-1
do i=2,imax-1s0=a(I,J,K,1)*p(I+1,J,K) + a(I,J,K,2)*p(I,J+1,K) + a(I,J,K,3)*p(I,J,K+1) &
+b(I,J,K,1)*(p(I+1,J+1,K)-p(I+1,J-1,K) - p(I-1,J+1,K)+p(I-1,J-1,K)) &+b(I,J,K,2)*(p(I,J+1,K+1)-p(I,J-1,K+1) - p(I,J+1,K-1)+p(I,J-1,K-1)) &+b(I,J,K,3)*(p(I+1,J,K+1)-p(I-1,J,K+1) - p(I+1,J,K-1)+p(I-1,J,K-1)) &+c(I,J,K,1)*p(I-1,J,K) +c(I,J,K,2)*p(I,J-1,K) +c(I,J,K,3)*p(I,J,K-1)+wrk1(I,J,K)
ss=(s0*a(I,J,K,4)-p(I,J,K))*bnd(I,J,K)GOSA1=GOSA1+SS*SSwrk2(I,J,K)=p(I,J,K)+OMEGA *SS
enddoenddo
enddo
Himeno BMT kernel*
HMC is effective for data intensive applications.
5
4
3
2
1
0
FX10 1node Post-FX10 1node
Rat
io
* Incompressible fluid analysis code solving the Poisson’s equation solution using the Jacobi iteration method. Many loads require high memory throughput.
* Tentative result
4
Dense matrix problems
Enhanced sector-cache mechanism improve efficiency.
DGEMM is the fundamental function for many applications.
Copyright 2014 FUJITSU LIMITED
Performance
FX10 95.3% (16core)
Post-FX10 97.9% (16core)
DGEMM performance of BLAS:
* Tentative result
5
Sparse matrix problems
2 to 4 times faster than FX10 (on 1 core)
The new indirect-load of wider-SIMD operation contributes to performance improvement
Copyright 2014 FUJITSU LIMITED
do j=1,lastrow-firstrow+1sum = 0.d0do k=rowstr(j),rowstr(j+1)-1
sum = sum + a(k) * p(colidx(k))enddow(j) = sum
enddo
Small Array size < 32KBMedium Array size < 12MBLarge Array size > 20MB
0
1
2
3
4
5
Small Small Small Medium Medium
Small Medium Large Medium Large
Re
lati
ve
pe
rf.
on
1 c
ore
FX10 Post-FX10
a(), colidx()
p()
NAS parallel benchmarks CG kernel
* Tentative result
6
Communication intensive applications
Fujitsu's math library provides scalable 3D-FFT functionality for massively parallel execution.
Interconnect bandwidth improves significantly.
Tofu Interconnect and job manager enable various processing configuration.
Optimized MPI library provides high performance collective communications for Tofu Interconnect.
Copyright 2014 FUJITSU LIMITED
0
50
100
150
200
0 10000 20000 30000 40000 50000 60000
TFlo
ps
Num. of nodes
3D-FFT(81923) estimated Perf. on Post-FX10
Post-FX10 SSL II
K computer SSL II
K computer FFTW3
Process topology
Global data view
6
4
7
5
2
0
3
1Data assignment
and rank
3-dimensional FFT using volumetric decomposition
7
Conclusion
Fujitsu is developing the Post-FX10 system to be capable of 100 PFLOPS:
Enhanced SPARC64 CPU and Tofu Interconnect were designed exclusively for HPC.
Fujitsu's HPC software stack exploits maximum hardware performance.
Post-FX10 employs best technologies for actual application performance:
256 bit-wide SIMD processing and enhanced sector cache mechanism improve flops
New indirect-load/store SIMD operations are effective for complex data access
Micron’s Hybrid Memory Cube fulfills high B/F requirements
Enhanced interconnect and assistant cores enable fast and scalable communication
Post-FX10 strongly drives scientific progress forward.
Copyright 2014 FUJITSU LIMITED9
Copyright 2013 FUJITSU LIMITED10