Security Level:
Transforming HPC with
Huawei ARM HPC Solution
Huawei HPC Ecosystem and Collaboration
3
Win-Win, Open HPC Ecosystem
Hardware/Infrastructure
Industry Applications
Management Software System Environment
Community/Organization
Customer Benefits
Solution Integration
4
Joint Innovation and Certification Test with Partners
Huawei & ESI signed a MoU to innovate
HPC for Industrial
Manufacturing
Acceleration
Huawei & Ansysjointly released Fluent
Benchmarking white paper
based on the Huawei
FusionServer
Huawei & Altairsigned MoU to Jointly
pursue HPC opportunities
and build industrial
simulation cloud solutions
Huawei & Dassaultsigned MoU to pursue
sustainable innovation
opportunities with the
3DEXPERIENCE Platform
on Cloud
5
Embracing the Open Source HPC Community
Reduce verification and integration
workloads
Ensure version iteration and service
support
Lower thresholds for HPC deployment
and new technologiesHuawei HPC Platform Open Source HPC Software
Stack
Participated in OpenHPC Community Participate in HPC-AI Advisory Council
Application Performance Benchmarking Studies
Linaro open source
collaboration
6
Rich Experience in HPC Industry Project Construction
Manufacturing/CarmakersColleges and Universities
Supercomputing Centers &
Research Organizations
Oil & Gas, Media,
Meteorology
7
Innovation at Huawei
8
Innovation Makes Computing SimplePowering a large portfolio of solution platforms
SAP HANA
Appliance
HPC
Big Data
Hyper-
Converged
Infrastructure
Azure Stack
Solution
Edge
Computing
for Video
Analytics
Solution
Accelerated
Accelerator Parts
ACC
FPGA
NIC
NVMe SSD
FPGA
Intelligent NIC
FusionServer
X
FusionServer
E
FusionServer
G
FusionServer
Rack-
optimized
Std. 2S-8S x86,
improved
designs for
medium and
larger business
Traditional Modular
High-density
server,
optimized for
large-scale
application
deployments
Blade system
for converged
infrastructure,
providing
maximized
efficiency
GPU server for
HPC, video and
AI/DL etc.,
which needs
GPU computing
environments
Unique Innovations
General Purpose Dedicated Applications
FDM
DEMT
Innovative
Chips
CPU: Hi1616
NC: Hi1503
NIC: Hi1822
SSD: Hi1812
BMC: Hi1710
9
Cultivating Competitive Differentiation at the Chip Level
Node
Controller
CPUBMC
NIC
controller
SSD
controller
2
5
4
Node
controller
CPU CPU
CPU
CPU
I/O chips
Management
chips
Computing chips
Hi1503 CPU interconnect chip, supports
scaling-up to 32 sockets
Hi1812 SSD storage controller, for read/write
I/O acceleration
Hi1822 Network controller chip, DC high-
speed and flexible interconnect
Hi1710BMC management chip, with
enhanced RAS features
2
3
4
53
1
Hi1620 ARM server SoC, supports up to 4
sockets1
10
How Huawei Positions ARM HPC
11
ARM HPC Driven by ExaScale CompetitionFocus on exascale technology and it’s commercial HPC market opportunities
• Sunway System – NRCPC, Alpha ISA
• Tianhe System – NUDT, ARM ISA
• EPI project with ARM ISA CPU for HPC and
RISC-V CPU for future high-throughput
computing
• Post-K system with ARM ISA CPU known
as A64FX
• A21/Intel, Frontier/?, EI Capitan/?
• ARM ISA?
12
Huawei Leverages ARM Technology to Fulfill Our Cloud-Network-Terminal StrategyARM technology is jointly invested by the Terminal, Cloud, Wireless and IT BUs
IoT Platform• Rich commercial experience for both OEM and
aftermarket
• Full maturity for connected car platform
development
• Solid roadmap for future evolution, e.g. AI,
Blockchain, V2X server, etc.
Network
Chipsets
• LTE-V: leading PC5 LTE-V standards, 153+
patents
• 5G: industry leader, main founder of 5GAA,
dozens of trial in China/Korea/Japan/Europe
• Module, T-BoX and HU delivered in BMW,
Daimler, …
• Open ecosystem
Network
Telematics Connected Vehicle Connected Life
Cloud
Connectivity Management
ServiceOrchestration
Big Data
Unified Open Interface APIs
Terminal
RSU RSU
Smart Antenna3G/4G LTE-V
2G/3G/4G AR
• V2X server• IoT agent• Edge
computing
V2Xserver
V2Xserver
• V2X server• IoT agent• Edge
computing5G
C-V2X
5G
LTE-V
V2V/V2I
IoT Platform
Third-party cloud
13
How Huawei Positions ARM HPCApplying ARM into Roofline model: let’s focus on real application performance
Arithmetic Intensity
SparseMV
BLAS1,2
0.1-1.0 FLOPS per byte 1-2 FLOPS per byte > 2 FLOPS per byte
Stencils
(PDEs) Lattice
Boltzmann
Methods
FFTs,
Spectral
MethodsDense Linear
Algebra
(BLAS3)
Particle
Methods
CFD, FEM WRF
Memory bound Compute bound
Arithmetic Intensity means Ratio of (arithmetic
instructions)/(off-chip memory operands)
Lower arithmetic means memory bound; higher
arithmetic means CPU bound
1%
You are here
You are here
ARM HPC targets memory bound applications or lower arithmetic intensive applications (scalar
floating point, integer), like CAE/CFD, weather, life sciences, and oil&gas
14
The Majority of Real Applications Have Low Arithmetic IntensityARM server can cover 50%+ of HPC workloads
Applications Field MethodsArithmetic Intensity
FLOPS/Byte
BCM CFD Navier-Stokes 0.14
OpenFOAM CFD Finite Volumes – Finite Element 0.13
MHD – FDM Magneto Hydro Dynamics Finite Difference Method 0.33
MHD - Spectral Magneto Hydro Dynamics Pseudo Spectral Method 0.45
QSFDM Seismology Spherical 2.5D FDM 0.46
SEISM3D Seismology Finite Difference Method 0.47
Barotropic Ocean Circulation Model Shallow Water Model 0.51
Turbine CFD DNS 0.56
KKRNano Nanotechnology Korringa-Kohn-Rostoker 4 (DP)
BQCD High-Energy-Physics Hybrid Monte-Carlo 0.45 (DP), 0.9 (SP)
B-CALM Electro-Magnetic Sim. Finite Difference time-domain 0.3 (SP)
WRF Weather Forecast model Stencil code 0.5-1.5
HEPSPEC SPEC2006 selection for HEP
(CERN)
NAMD, DEALII, SOPLEX, POVRAY,
OMNETPP, ASTAR, XALANCBMK
≥ 0.5
Gromacs Molecular dynamics package Bennett Acceptance Ratios < 1
15
CAE/CFD, Weather, Life Sciences and Oil&Gas as Huawei’s Target IndustriesApplying ARM into seven-dwarf model
Partial differential equation
Finite element
Finite difference
Finite volume
Others: spectral method, spectral element
method, boundary element, meshless
CAE/CFD
Electromagnetism
Weather
Geophysics
Chemistry
Molecular dynamics
Hartree-Fock
Density functional theory
Post-Hartree-Fock
Semi-empirical QC
Biosciences/Bioinformatics
Sequence analysis
Gene and protein expression
Molecular dynamics
Chemistry
Materials science
Drug development
Dense linear algebra
Sparse linear algebra
Spectral methods
(FFT)
N-body methods
(Particle Methods)
Structured grids
Unstructured grids
Monte CarloGenomics
Medicine
Health service
Quantum
Chemistry
Seven-dwarf Numerical method Domain/Industry
16
Huawei ARM HPC Solution Readiness
17
Huawei ARM Server Portfolio
Storage Server
X6000
TaiShan 5280TaiShan 2280
XR320High Density
• 4U 2S rack server
• 2U 4 nodes
• 2 sockets/node
Rack Based Storage Intensive
Massive Production
Com
pu
ting
Inte
nsiv
eA
cce
lera
tor
• 2U 2S rack server
NVMe SSD Programmable intelligent NICFPGA
18
Huawei ARM HPC Solution is READY to SELL!
System
environment
Cluster
management
Operating Systems
Job Scheduler / Cluster Monitoring / Web Portal
MPI
Open MPI / HPC-X
Compiler Toolchain / Math Library
Hardware
infrastructure Server (Hi1616 SoC) Interconnection Storage
OpenHPC
xcat Nagios
Customers/
Demonstration
University of Oslo
(UiO)
Shanghai Jiao
Tong University
SolutionManufacturing Life Science Weather Forecast
Commercial
ISVs
Open Source
Software
2nd
sequencing
3rd
sequencing
Weather
modeling
Ocean
modeling
Environment
modeling
19
Solutions Ecosystem
SolutionsManufacturing Life Science Weather Forecast
Commercial
ISV
Open Source
Software
2nd
sequencing
3rd
sequencing
Weather
modeling
Ocean
modeling
Environment
modeling
Car Manufacturing
Altair, ESI, LSTC,
Ansys for PoC
Other ISVs
Open Source CFD
ESI OpenFOAM
SU2, TAU, CLF3D
2nd generation
sequencing
GATK package
3rd generation
sequencing
CANU package
Weather modeling
WRF…
Ocean modeling
ROMs, FVCOM, NEMO…
Environment modeling
CAMQ, CAMx, SMOKE…
20
ARM64 Server SoC Roadmap
Under Development
Production
2014 2016 2018 2020
Hi1616
32*A72@16FF+
2.4 GHz
4P
4*64bit DDR3/4
PCIe 3.0 /SAS3.0/10GE
Hi1620
Up to 64 custom ARM
Cores@7nm
Up to 3.0 GHz
4P, 8*64bit DDR4
CCIX
RoCE v2
PCIe 4.0 /100GE
Planning
2022
Hi1630
TBD*NG ARM Core with SVE
SMT
Up to 3.x GHz
4P
8*64bit DDR5
TBD
Hi1612
32*A57@16FF
2.1 GHz
4*64bit DDR3/4
PCIe 3.0 /SAS3.0/10GE
Hi1610
16*A57@16FF
2.1 GHz
2*64bit DDR3/4
PCIe 3.0 /SAS3.0/10GE
Production Now Production Now
21
Hi1620 Specifications Overview
CPU core Up to 64 ARMv 8.2 cores, 3.0 GHz, 48-bit physical address
4 issue OoO superscalar design
64 KB L1 I Cache and 64 KB L1 D cache
L2 cache 512 KB private per core, 24 MB total
L3 cache 48 MB shared for all (1 MB/core), Partitioned
Memory 8-channel DDR4-2400/2666/2933/3200
16 ranks/channel, 1DPC and 2DPC configurations
x4/x8 support
ECC, SDDC, DDDC
PCIe 40 lanes of PCIe Gen4.0 16x
Integrated I/O 8 lanes of ETH, Combo MACs, supporting 2 x 100GE, 2 x 40GE, 8 x
25GE/10GE,10 x GE, supporting SR-IOV
RoCEv2/RoCEv1
x4 USB
3.0 x8 SAS 3.0 x2 SATA 3.0
Crypto engine AES, DES/3DES, MD5, SHA1, SHA2, HMAC, CMAC
Up to 100 Gbit/s
Compression GZIP, LZS, LZ4
Up to 40 Gbit/s (compress)/100 Gbit/s (decompression)
RAID RAID5/6, DIF, XOR, PQ acceleration
CCIX Cache coherency interface for accelerator, like Xilinx FPGA
World's 1st CCIX solution
Scale-up Coherent SMP interface for 2P/4P
3*240Gbps bandwidth
Power TDP ~150 W (48C 2.6 GHz)
Coherent Fabric
NIC IF
DDR464b SAS
/SATA 3.0PCIe 4.0
Local BusNANDC/USB/UARTs/GPIO
etc.
AMBA
48~64 MB LLC
Taishan*4 2.6~3.0G64K/64K
512KB L2 CachePer Core
HACs
DDR464b
DDR464b
240Gbps per port
2*100GE/2*40GE/
8*25GE/8*10GE/8*GE
Up to 16lanes
Up to 40lanes
Low speed I/O
……
……
HCCSHCCS
CC Port
Taishan*4 2.6~3.0G64K/64K
512KB L2 CachePer Core
Taishan*4 2.6~3.0G64K/64K
512KB L2 CachePer Core
22
Preliminary Benchmark and Application
Performance of Hi1620 (Early Sample)
23
Preliminary Performance on Hi1620 (Early Sample)Memory Bandwidth and Latency (Performance Optimization Underway)
4 4
1821
5760
4 48 8
3740
3 3
11
17
5154
0
10
20
30
40
50
60
70
stride thrash stride thrash stride thrash
L1 Cache L2 Cache L3 Cache
lmbench Latency (L1/L2/L3)
Hi1616 1core Hi1620ES 1core skylake 6148 1core
Lower is better 122 122
81.787.5
83 81
0
20
40
60
80
100
120
140
stride thrash
Local memory
lmbench Latency (DDR)
Series1 Series2
Lower is better
2P Hi1620 (48 cores, 2.6 GHz, DDR4-2666) 2P Skylake 6148 (20 cores, 2.0 GHz, DDR4-2666)
STREAM 251 GB/S with 73.59% efficiency 192 GB/S with 75.08% efficiency
24
Preliminary Performance on Hi1620 (Early Sample)HPCG and HPL (Performance Optimization Underway)
GFlops
Hi1616 14.31
Hi1620 31.3
0
5
10
15
20
25
30
35
HPCG Performance
Hi1616 Hi1620
GFlops
Hi1616 484.3
Hi1620 914.58
0
100
200
300
400
500
600
700
800
900
1000
HPL / LINPACK Performance
Hi1616 Hi1620
• HPCG:
• Code optimization is essential for HPCG performance on Arm
• Hi1620 has a major performance boost compared to Hi1616
• HPL:
• Hi1620 shows almost double performance compared to Hi1616
• Reached 91.6% efficiency
25
Preliminary Performance on Hi1620 (Early Sample)OpenFOAM (Performance Optimization Underway)
111.88
61.74 63.12
72.03
0
20
40
60
80
100
120
Hi1616 Hi1620 EPYC 7551 Gold 6148
Execution T
ime (
seconds)
OpenFOAM Performance (motorbike)Lower is better
OpenFOAM motobike single node test
PlatformSockets
Num.
Cores Num.
TotalFrequency (GHz) Memory OS Compiler MPI
OpenFOAM
Version
Hi1616 2 64 2.48*32 GB 2400
MHz
EulerOS 2.0 SP2
Linux kernel 4.1.43GNU 7.2 OpenMPI 3.0 v1806
Hi1620 2 96 2.616*16 GB 2666
MHz
EulerOS 2.0 SP3
Linux kernel 4.14.10GNU 7.2 OpenMPI 3.0 v1806
EPYC
75512 64 2.0 (2.55 turbo)
16*16 GB 2666
MHzRHEL 7.5 beta GNU 7.3 OpenMPI 3.1 v1806
Gold 6148 2 40 2.4 (3-3.1 turbo)12*16 GB 2666
MHzCentOS 7.5
Intel 2017 update
4Intel MPI 2018 update 1 v1806
Mesh Size: 1.02M
Solver: SimpleFoam
Solver run for 30 iterations
26
Preliminary Performance on Hi1620 (Early Sample)WRF (Performance Optimization Underway)
4.183.93
5.4
4.32
0
1
2
3
4
5
6
precise mode fast mode
ste
p t
ime [s]
WRF 3.8.1 (CONUS 2.5 km)single step performance
Hi1620 ESgnu7+ompi3.0
Gold 6148intel 2018
Lower is better
Hi1620 2p Skylake Gold 6148 2p
OS EulerOS CentOS 7.4
MPI Open MPI 3.0.1 Intel MPI 18.0.1
Compiler gcc 7.2 icc 18.0.1
Precise mode [s] 3.93 4.32
Fast mode [s] 4.18 5.4
• With precise mode, Hi1620 is 23% better than Skylake Gold 6148
• With fast mode, Hi1620 is 10% better than Skylake Gold 6148
27
Preliminary Performance on Hi1620 (Early Sample)SU2 (Performance Optimization Underway)
Hi1620 2p Skylake Gold 6148 2p
OS EulerOS CentOS 7.4
MPI Open MPI 3.0.0 Intel MPI 18.0.1
Compiler gcc 7.2 icc 18.0.1
Step time [s] 41.5 66.692
79.76
41.5
75.101
66.692
0
10
20
30
40
50
60
70
80
90
Hi1616 Hi1620 ES Gold 6148gcc+ompi3.0
Gold 6148intelmpi2018
ste
p t
ime [s]
SU2 (NASA Turbulent Flow Case)Lower is better
• Hi1620 is 38% better than Gold 6148
28
Preliminary Performance on Hi1620 (Early Sample)ROMS (Performance Optimization Underway)
112.75
85.148
0
20
40
60
80
100
120
time/s
ROMS Performance (benchmark3)
X86-Gold 6148 2p icc/ifort Hi1620 2p ahc18.4
Lower is better
Overview
ROMS is a free-surface, terrain-following, primitive equations ocean
model widely used by the scientific community for a diverse range of
applications. ROMS includes accurate and efficient physical and
numerical algorithms and several coupled models for
biogeochemical, bio-optical, sediment, and sea ice applications.
Characteristic
• Memory Bound
• Low I/O ratio
• MPI collective communication mainly
Performance
Hi1620 2p performance is 30% higher than Skylake Gold 6148 2p
Skylake Gold 6148 2p Hi1620 2p
OS CentOS 7.4 EulerOS
MPI Intel MPI 18.0.1 OpenMPI 3.0.0
Compiler icc 18.0.1 Arm Allinea Studio 18.4
Time (s) 112.75 85.148
29
Preliminary Performance on Hi1620 (Early Sample)CAMx (Performance Optimization Underway)
Overview
CAMx is a state-of-the-science photochemical grid model that
comprises a “one-atmosphere” treatment of tropospheric air
pollution over spatial scales ranging from neighborhoods to
continents. It is an open-source system that is computationally
efficient, flexible, and available at zero cost.
Characteristic
• Memory Bound
• Low I/O ratio
• MPI small packet (4bytes) mainly
Performance
Hi1620 2p performance is close to Skylake Gold 6148 2p
201 198
0
50
100
150
200
250
time/s
CAMx Performance (CAMx 2-Day Test Case)
X86-Gold 6148 2p icc/ifort Hi1620 2p ahc18.4
x86-Gold 6148 2p Hi1620 2p
OS CentOS 7.4 EulerOS
MPI Intel MPI 18.0.1 OpenMPI 3.0.0
Compiler icc 18.0.1 Arm Allinea Studio 18.4
Time (s) 201 198
Lower is better
30
Preliminary Performance on Hi1620 (Early Sample)CANU (Performance Optimization Underway)
x86-Gold 6148 2p Hi1620 2p
OS CentOS 7 EulerOS
Compiler gcc 7 gcc 7
Time (min) 672 570
0
100
200
300
400
500
600
700
800
time(min)
CANU performance (5.6G gene data set)
X86-Gold 6148 2p gcc 7.0 Hi1620 2p gcc 7.0
Overview
CANU is a fork of the Celera Assembler designed for high-noise
single-molecule sequencing (such as the PacBio RSII or Oxford
Nanopore MinION).
Characteristic
• Memory Bound
• Integer calculation mainly
Performance
Hi1620 2p performance is 15% higher than Skylake Gold 6148 2p
Lower is better
31
Huawei ARM HPC Ecosystem Development
33
ARM HPC Open Source EcosystemOpenHPC and ARM HPC Users Group
34
ARM Industry Ecosystem DevelopmentChina GCC, European Open ARM Alliance
European ARM Open Alliance
Consolidate industry ecosystem by involving end customers, ISV, IHV and OEM
into the alliance
Huawei has been working with
3 major partners from 2016• Compute intensive (Juelich)
• Data intensive (CERN)
• Cloud service (TSFR)
Copyright©2018 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without
limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual
results and developments to differ materially from those expressed or implied in the
predictive statements. Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei may change the
information at any time without notice.
Thank You.