Download - Transforming HPC with Huawei ARM HPC Solution · 2019-06-24 · 4 Joint Innovation and Certification Test with Partners Huawei & ESI signed a MoU to innovate HPC for Industrial Manufacturing

Security Level:

Transforming HPC with

Huawei ARM HPC Solution

Huawei HPC Ecosystem and Collaboration

3

Win-Win, Open HPC Ecosystem

Hardware/Infrastructure

Industry Applications

Management Software System Environment

Community/Organization

Customer Benefits

Solution Integration

4

Joint Innovation and Certification Test with Partners

Huawei & ESI signed a MoU to innovate

HPC for Industrial

Manufacturing

Acceleration

Huawei & Ansysjointly released Fluent

Benchmarking white paper

based on the Huawei

FusionServer

Huawei & Altairsigned MoU to Jointly

pursue HPC opportunities

and build industrial

simulation cloud solutions

Huawei & Dassaultsigned MoU to pursue

sustainable innovation

opportunities with the

3DEXPERIENCE Platform

on Cloud

5

Embracing the Open Source HPC Community

Reduce verification and integration

workloads

Ensure version iteration and service

support

Lower thresholds for HPC deployment

and new technologiesHuawei HPC Platform Open Source HPC Software

Stack

Participated in OpenHPC Community Participate in HPC-AI Advisory Council

Application Performance Benchmarking Studies

Linaro open source

collaboration

6

Rich Experience in HPC Industry Project Construction

Manufacturing/CarmakersColleges and Universities

Supercomputing Centers &

Research Organizations

Oil & Gas, Media,

Meteorology

7

Innovation at Huawei

8

Innovation Makes Computing SimplePowering a large portfolio of solution platforms

SAP HANA

Appliance

HPC

Big Data

Hyper-

Converged

Infrastructure

Azure Stack

Solution

Edge

Computing

for Video

Analytics

Solution

Accelerated

Accelerator Parts

ACC

FPGA

NIC

NVMe SSD

FPGA

Intelligent NIC

FusionServer

X

FusionServer

E

FusionServer

G

FusionServer

Rack-

optimized

Std. 2S-8S x86,

improved

designs for

medium and

larger business

Traditional Modular

High-density

server,

optimized for

large-scale

application

deployments

Blade system

for converged

infrastructure,

providing

maximized

efficiency

GPU server for

HPC, video and

AI/DL etc.,

which needs

GPU computing

environments

Unique Innovations

General Purpose Dedicated Applications

FDM

DEMT

Innovative

Chips

CPU: Hi1616

NC: Hi1503

NIC: Hi1822

SSD: Hi1812

BMC: Hi1710

9

Cultivating Competitive Differentiation at the Chip Level

Node

Controller

CPUBMC

NIC

controller

SSD

controller

2

5

4

Node

controller

CPU CPU

CPU

CPU

I/O chips

Management

chips

Computing chips

Hi1503 CPU interconnect chip, supports

scaling-up to 32 sockets

Hi1812 SSD storage controller, for read/write

I/O acceleration

Hi1822 Network controller chip, DC high-

speed and flexible interconnect

Hi1710BMC management chip, with

enhanced RAS features

2

3

4

53

1

Hi1620 ARM server SoC, supports up to 4

sockets1

10

How Huawei Positions ARM HPC

11

ARM HPC Driven by ExaScale CompetitionFocus on exascale technology and it’s commercial HPC market opportunities

• Sunway System – NRCPC, Alpha ISA

• Tianhe System – NUDT, ARM ISA

• EPI project with ARM ISA CPU for HPC and

RISC-V CPU for future high-throughput

computing

• Post-K system with ARM ISA CPU known

as A64FX

• A21/Intel, Frontier/?, EI Capitan/?

• ARM ISA?

12

Huawei Leverages ARM Technology to Fulfill Our Cloud-Network-Terminal StrategyARM technology is jointly invested by the Terminal, Cloud, Wireless and IT BUs

IoT Platform• Rich commercial experience for both OEM and

aftermarket

• Full maturity for connected car platform

development

• Solid roadmap for future evolution, e.g. AI,

Blockchain, V2X server, etc.

Network

Chipsets

• LTE-V: leading PC5 LTE-V standards, 153+

patents

• 5G: industry leader, main founder of 5GAA,

dozens of trial in China/Korea/Japan/Europe

• Module, T-BoX and HU delivered in BMW,

Daimler, …

• Open ecosystem

Network

Telematics Connected Vehicle Connected Life

Cloud

Connectivity Management

ServiceOrchestration

Big Data

Unified Open Interface APIs

Terminal

RSU RSU

Smart Antenna3G/4G LTE-V

2G/3G/4G AR

• V2X server• IoT agent• Edge

computing

V2Xserver

V2Xserver

• V2X server• IoT agent• Edge

computing5G

C-V2X

5G

LTE-V

V2V/V2I

IoT Platform

Third-party cloud

13

How Huawei Positions ARM HPCApplying ARM into Roofline model: let’s focus on real application performance

Arithmetic Intensity

SparseMV

BLAS1,2

0.1-1.0 FLOPS per byte 1-2 FLOPS per byte > 2 FLOPS per byte

Stencils

(PDEs) Lattice

Boltzmann

Methods

FFTs,

Spectral

MethodsDense Linear

Algebra

(BLAS3)

Particle

Methods

CFD, FEM WRF

Memory bound Compute bound

Arithmetic Intensity means Ratio of (arithmetic

instructions)/(off-chip memory operands)

Lower arithmetic means memory bound; higher

arithmetic means CPU bound

1%

You are here

You are here

ARM HPC targets memory bound applications or lower arithmetic intensive applications (scalar

floating point, integer), like CAE/CFD, weather, life sciences, and oil&gas

14

The Majority of Real Applications Have Low Arithmetic IntensityARM server can cover 50%+ of HPC workloads

Applications Field MethodsArithmetic Intensity

FLOPS/Byte

BCM CFD Navier-Stokes 0.14

OpenFOAM CFD Finite Volumes – Finite Element 0.13

MHD – FDM Magneto Hydro Dynamics Finite Difference Method 0.33

MHD - Spectral Magneto Hydro Dynamics Pseudo Spectral Method 0.45

QSFDM Seismology Spherical 2.5D FDM 0.46

SEISM3D Seismology Finite Difference Method 0.47

Barotropic Ocean Circulation Model Shallow Water Model 0.51

Turbine CFD DNS 0.56

KKRNano Nanotechnology Korringa-Kohn-Rostoker 4 (DP)

BQCD High-Energy-Physics Hybrid Monte-Carlo 0.45 (DP), 0.9 (SP)

B-CALM Electro-Magnetic Sim. Finite Difference time-domain 0.3 (SP)

WRF Weather Forecast model Stencil code 0.5-1.5

HEPSPEC SPEC2006 selection for HEP

(CERN)

NAMD, DEALII, SOPLEX, POVRAY,

OMNETPP, ASTAR, XALANCBMK

≥ 0.5

Gromacs Molecular dynamics package Bennett Acceptance Ratios < 1

15

CAE/CFD, Weather, Life Sciences and Oil&Gas as Huawei’s Target IndustriesApplying ARM into seven-dwarf model

Partial differential equation

Finite element

Finite difference

Finite volume

Others: spectral method, spectral element

method, boundary element, meshless

CAE/CFD

Electromagnetism

Weather

Geophysics

Chemistry

Molecular dynamics

Hartree-Fock

Density functional theory

Post-Hartree-Fock

Semi-empirical QC

Biosciences/Bioinformatics

Sequence analysis

Gene and protein expression

Molecular dynamics

Chemistry

Materials science

Drug development

Dense linear algebra

Sparse linear algebra

Spectral methods

(FFT)

N-body methods

(Particle Methods)

Structured grids

Unstructured grids

Monte CarloGenomics

Medicine

Health service

Quantum

Chemistry

Seven-dwarf Numerical method Domain/Industry

16

Huawei ARM HPC Solution Readiness

17

Huawei ARM Server Portfolio

Storage Server

X6000

TaiShan 5280TaiShan 2280

XR320High Density

• 4U 2S rack server

• 2U 4 nodes

• 2 sockets/node

Rack Based Storage Intensive

Massive Production

Com

pu

ting

Inte

nsiv

eA

cce

lera

tor

• 2U 2S rack server

NVMe SSD Programmable intelligent NICFPGA

18

Huawei ARM HPC Solution is READY to SELL!

System

environment

Cluster

management

Operating Systems

Job Scheduler / Cluster Monitoring / Web Portal

MPI

Open MPI / HPC-X

Compiler Toolchain / Math Library

Hardware

infrastructure Server (Hi1616 SoC) Interconnection Storage

OpenHPC

xcat Nagios

Customers/

Demonstration

University of Oslo

(UiO)

Shanghai Jiao

Tong University

SolutionManufacturing Life Science Weather Forecast

Commercial

ISVs

Open Source

Software

2nd

sequencing

3rd

sequencing

Weather

modeling

Ocean

modeling

Environment

modeling

19

Solutions Ecosystem

SolutionsManufacturing Life Science Weather Forecast

Commercial

ISV

Open Source

Software

2nd

sequencing

3rd

sequencing

Weather

modeling

Ocean

modeling

Environment

modeling

Car Manufacturing

Altair, ESI, LSTC,

Ansys for PoC

Other ISVs

Open Source CFD

ESI OpenFOAM

SU2, TAU, CLF3D

2nd generation

sequencing

GATK package

3rd generation

sequencing

CANU package

Weather modeling

WRF…

Ocean modeling

ROMs, FVCOM, NEMO…

Environment modeling

CAMQ, CAMx, SMOKE…

20

ARM64 Server SoC Roadmap

Under Development

Production

2014 2016 2018 2020

Hi1616

32*A72@16FF+

2.4 GHz

4P

4*64bit DDR3/4

PCIe 3.0 /SAS3.0/10GE

Hi1620

Up to 64 custom ARM

Cores@7nm

Up to 3.0 GHz

4P, 8*64bit DDR4

CCIX

RoCE v2

PCIe 4.0 /100GE

Planning

2022

Hi1630

TBD*NG ARM Core with SVE

SMT

Up to 3.x GHz

4P

8*64bit DDR5

TBD

Hi1612

32*A57@16FF

2.1 GHz

4*64bit DDR3/4


Hi1610

16*A57@16FF

2.1 GHz

2*64bit DDR3/4


Production Now Production Now

21

Hi1620 Specifications Overview

CPU core Up to 64 ARMv 8.2 cores, 3.0 GHz, 48-bit physical address

4 issue OoO superscalar design

64 KB L1 I Cache and 64 KB L1 D cache

L2 cache 512 KB private per core, 24 MB total

L3 cache 48 MB shared for all (1 MB/core), Partitioned

Memory 8-channel DDR4-2400/2666/2933/3200

16 ranks/channel, 1DPC and 2DPC configurations

x4/x8 support

ECC, SDDC, DDDC

PCIe 40 lanes of PCIe Gen4.0 16x

Integrated I/O 8 lanes of ETH, Combo MACs, supporting 2 x 100GE, 2 x 40GE, 8 x

25GE/10GE,10 x GE, supporting SR-IOV

RoCEv2/RoCEv1

x4 USB

3.0 x8 SAS 3.0 x2 SATA 3.0

Crypto engine AES, DES/3DES, MD5, SHA1, SHA2, HMAC, CMAC

Up to 100 Gbit/s

Compression GZIP, LZS, LZ4

Up to 40 Gbit/s (compress)/100 Gbit/s (decompression)

RAID RAID5/6, DIF, XOR, PQ acceleration

CCIX Cache coherency interface for accelerator, like Xilinx FPGA

World's 1st CCIX solution

Scale-up Coherent SMP interface for 2P/4P

3*240Gbps bandwidth

Power TDP ~150 W (48C 2.6 GHz)

Coherent Fabric

NIC IF

DDR464b SAS

/SATA 3.0PCIe 4.0

Local BusNANDC/USB/UARTs/GPIO

etc.

AMBA

48~64 MB LLC

Taishan*4 2.6~3.0G64K/64K

512KB L2 CachePer Core

HACs

DDR464b

DDR464b

240Gbps per port

2*100GE/2*40GE/

8*25GE/8*10GE/8*GE

Up to 16lanes

Up to 40lanes

Low speed I/O

……

……

HCCSHCCS

CC Port





22

Preliminary Benchmark and Application

Performance of Hi1620 (Early Sample)

23

Preliminary Performance on Hi1620 (Early Sample)Memory Bandwidth and Latency (Performance Optimization Underway)

4 4

1821

5760

4 48 8

3740

3 3

11

17

5154

0

10

20

30

40

50

60

70

stride thrash stride thrash stride thrash

L1 Cache L2 Cache L3 Cache

lmbench Latency (L1/L2/L3)

Hi1616 1core Hi1620ES 1core skylake 6148 1core

Lower is better 122 122

81.787.5

83 81

0

20

40

60

80

100

120

140

stride thrash

Local memory

lmbench Latency (DDR)

Series1 Series2

Lower is better

2P Hi1620 (48 cores, 2.6 GHz, DDR4-2666) 2P Skylake 6148 (20 cores, 2.0 GHz, DDR4-2666)

STREAM 251 GB/S with 73.59% efficiency 192 GB/S with 75.08% efficiency

24

Preliminary Performance on Hi1620 (Early Sample)HPCG and HPL (Performance Optimization Underway)

GFlops

Hi1616 14.31

Hi1620 31.3

0

5

10

15

20

25

30

35

HPCG Performance

Hi1616 Hi1620

GFlops

Hi1616 484.3

Hi1620 914.58

0

100

200

300

400

500

600

700

800

900

1000

HPL / LINPACK Performance

Hi1616 Hi1620

• HPCG:

• Code optimization is essential for HPCG performance on Arm

• Hi1620 has a major performance boost compared to Hi1616

• HPL:

• Hi1620 shows almost double performance compared to Hi1616

• Reached 91.6% efficiency

25

Preliminary Performance on Hi1620 (Early Sample)OpenFOAM (Performance Optimization Underway)

111.88

61.74 63.12

72.03

0

20

40

60

80

100

120

Hi1616 Hi1620 EPYC 7551 Gold 6148

Execution T

ime (

seconds)

OpenFOAM Performance (motorbike)Lower is better

OpenFOAM motobike single node test

PlatformSockets

Num.

Cores Num.

TotalFrequency (GHz) Memory OS Compiler MPI

OpenFOAM

Version

Hi1616 2 64 2.48*32 GB 2400

MHz

EulerOS 2.0 SP2

Linux kernel 4.1.43GNU 7.2 OpenMPI 3.0 v1806

Hi1620 2 96 2.616*16 GB 2666

MHz

EulerOS 2.0 SP3

Linux kernel 4.14.10GNU 7.2 OpenMPI 3.0 v1806

EPYC

75512 64 2.0 (2.55 turbo)

16*16 GB 2666

MHzRHEL 7.5 beta GNU 7.3 OpenMPI 3.1 v1806

Gold 6148 2 40 2.4 (3-3.1 turbo)12*16 GB 2666

MHzCentOS 7.5

Intel 2017 update

4Intel MPI 2018 update 1 v1806

Mesh Size: 1.02M

Solver: SimpleFoam

Solver run for 30 iterations

26

Preliminary Performance on Hi1620 (Early Sample)WRF (Performance Optimization Underway)

4.183.93

5.4

4.32

0

1

2

3

4

5

6

precise mode fast mode

ste

p t

ime [s]

WRF 3.8.1 (CONUS 2.5 km)single step performance

Hi1620 ESgnu7+ompi3.0

Gold 6148intel 2018

Lower is better

Hi1620 2p Skylake Gold 6148 2p

OS EulerOS CentOS 7.4

MPI Open MPI 3.0.1 Intel MPI 18.0.1

Compiler gcc 7.2 icc 18.0.1

Precise mode [s] 3.93 4.32

Fast mode [s] 4.18 5.4

• With precise mode, Hi1620 is 23% better than Skylake Gold 6148

• With fast mode, Hi1620 is 10% better than Skylake Gold 6148

27

Preliminary Performance on Hi1620 (Early Sample)SU2 (Performance Optimization Underway)

Hi1620 2p Skylake Gold 6148 2p

OS EulerOS CentOS 7.4

MPI Open MPI 3.0.0 Intel MPI 18.0.1

Compiler gcc 7.2 icc 18.0.1

Step time [s] 41.5 66.692

79.76

41.5

75.101

66.692

0

10

20

30

40

50

60

70

80

90

Hi1616 Hi1620 ES Gold 6148gcc+ompi3.0

Gold 6148intelmpi2018

ste

p t

ime [s]

SU2 (NASA Turbulent Flow Case)Lower is better

• Hi1620 is 38% better than Gold 6148

28

Preliminary Performance on Hi1620 (Early Sample)ROMS (Performance Optimization Underway)

112.75

85.148

0

20

40

60

80

100

120

time/s

ROMS Performance (benchmark3)

X86-Gold 6148 2p icc/ifort Hi1620 2p ahc18.4

Lower is better

Overview

ROMS is a free-surface, terrain-following, primitive equations ocean

model widely used by the scientific community for a diverse range of

applications. ROMS includes accurate and efficient physical and

numerical algorithms and several coupled models for

biogeochemical, bio-optical, sediment, and sea ice applications.

Characteristic

• Memory Bound

• Low I/O ratio

• MPI collective communication mainly

Performance

Hi1620 2p performance is 30% higher than Skylake Gold 6148 2p

Skylake Gold 6148 2p Hi1620 2p

OS CentOS 7.4 EulerOS

MPI Intel MPI 18.0.1 OpenMPI 3.0.0

Compiler icc 18.0.1 Arm Allinea Studio 18.4

Time (s) 112.75 85.148

29

Preliminary Performance on Hi1620 (Early Sample)CAMx (Performance Optimization Underway)

Overview

CAMx is a state-of-the-science photochemical grid model that

comprises a “one-atmosphere” treatment of tropospheric air

pollution over spatial scales ranging from neighborhoods to

continents. It is an open-source system that is computationally

efficient, flexible, and available at zero cost.

Characteristic

• Memory Bound

• Low I/O ratio

• MPI small packet (4bytes) mainly

Performance

Hi1620 2p performance is close to Skylake Gold 6148 2p

201 198

0

50

100

150

200

250

time/s

CAMx Performance (CAMx 2-Day Test Case)

X86-Gold 6148 2p icc/ifort Hi1620 2p ahc18.4

x86-Gold 6148 2p Hi1620 2p

OS CentOS 7.4 EulerOS

MPI Intel MPI 18.0.1 OpenMPI 3.0.0

Compiler icc 18.0.1 Arm Allinea Studio 18.4

Time (s) 201 198

Lower is better

30

Preliminary Performance on Hi1620 (Early Sample)CANU (Performance Optimization Underway)

x86-Gold 6148 2p Hi1620 2p

OS CentOS 7 EulerOS

Compiler gcc 7 gcc 7

Time (min) 672 570

0

100

200

300

400

500

600

700

800

time(min)

CANU performance (5.6G gene data set)

X86-Gold 6148 2p gcc 7.0 Hi1620 2p gcc 7.0

Overview

CANU is a fork of the Celera Assembler designed for high-noise

single-molecule sequencing (such as the PacBio RSII or Oxford

Nanopore MinION).

Characteristic

• Memory Bound

• Integer calculation mainly

Performance

Hi1620 2p performance is 15% higher than Skylake Gold 6148 2p

Lower is better

31

Huawei ARM HPC Ecosystem Development

33

ARM HPC Open Source EcosystemOpenHPC and ARM HPC Users Group

34

ARM Industry Ecosystem DevelopmentChina GCC, European Open ARM Alliance

European ARM Open Alliance

Consolidate industry ecosystem by involving end customers, ISV, IHV and OEM

into the alliance

Huawei has been working with

3 major partners from 2016• Compute intensive (Juelich)

• Data intensive (CERN)

• Cloud service (TSFR)

Copyright©2018 Huawei Technologies Co., Ltd. All Rights Reserved.

The information in this document may contain predictive statements including, without

limitation, statements regarding the future financial and operating results, future product

portfolio, new technology, etc. There are a number of factors that could cause actual

results and developments to differ materially from those expressed or implied in the

predictive statements. Therefore, such information is provided for reference purpose

only and constitutes neither an offer nor an acceptance. Huawei may change the

information at any time without notice.

Thank You.

Download - Transforming HPC with Huawei ARM HPC Solution · 2019-06-24 · 4 Joint Innovation and Certification Test with Partners Huawei & ESI signed a MoU to innovate HPC for Industrial Manufacturing

Top Related