performance optimization on high performance computers · 2019. 12. 20. · cache-based cpu 6...

Performance Optimization on High Performance Computers

Guangming Tan

2018.5.29

Institute of Computing Technology, Chinese Academy of Sciences

2

High Parallelism = High Speed

3

High Speed = High Performance

2011K-Computer

2012Titan

Raw Peak Speed

0.60 HPCG

2013Tianhe-2A

2016TaihuLight

11.28

PFLOPS

27.11

54.90

125.43

0.320.58 0.48

18X 84X

94X

260X

HPL

Real-world applications DO NOT have

performance gain

?

4

A Retrospective to HPC Optimization

MPPCC-NUMA

MPPCC-NUMA

Scientific Computing

Cluster + xPU

Vector

1995-Now1970-80 1980-95

Internet Big Data

Scientific ComputingEngineering Computing

Biological Cloud computing

Scientific Computing

Engineering Computing

SpecialApplication

NarrowApplicationDomain

DiverseApplicationDomains

5

From Hero to Masses:A System Perspective is Desired

More computing units

Slow memory access

More dataDiverse access pattern

System

Application

A system perspective way to improve MOSTof application, Not only focusing on INDIVIDUAL application

Promoting application performance by improving vectorization, locality and parallelism

Syst

em C

omp

lexi

ty

1970 1990 2010 2020

Vectorization

Locality

Vector Processor

Manycore

Cache-based CPU

6

Architecture-Driven Optimization Methodology

Parallelism

CrayConvexFujistuNEC

IntelAMDARM

CUDAAPUMICTPU

Syst

em C

omp

lexi

ty

1970 1990 2010 2020

Vectorization

Locality

Vector Processor

Manycore

Cache-based CPU

7

Parallelism

CrayConvex

IntelAMDARM

CUDAAPU

Issue 1 – Performance Bound on Emerging Architecture

Given one workload and one targeted architecture

Approaching performance upper bound on the given machine

Percolation

Cacheblocking

1 à 1

ArchitectureSpecific

FujistuNEC

MICTPU

8

Percolation Execution Model

More cores lead to more serious bottleneck of data movement

��à �


��

��

��cache memory � ��

��

… ��

��

��

��

cache memory

• The massive parallelism is leveraged

to create just-in-time locality, where on-demand data movement happens in parallel

• Transform static dependence to dynamic one

• Couple dynamic parallelism with data movement

Guangming Tan, Ninghui Sun, and Guang R. Gao, Improving Performance of Dynamic Programming via Parallelism and Locality on Multicore Architecture, IEEE Transactions on Parallel and Distributed Systems, 2009.

ü generating the optimal instruction execution order

ü hiding data movement by multiple pipelines

9

Applying to Numerical Computing Kernels

... ...

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

0

1

31

...

...

...

...

...

...

...

...

...

...

0 1 2 3 4 5 6 70 1 2 3 4 5 6 7

0 1 31

LD/ST.128

globalmemory

0 32 224...

1 33 225...

30 62 254...

31 63 255...

... ... ...

192193

thread block

sharedmemory

128-bits load/store

buffer 0

buffer 1

• The best algorithm that may approach machine’s raw performance

• The kernel determining deep learning’s performance

��à �


Matricescomputation

362GFlops

160GFlops

Ours

CUBLAS4.0 *

758GFlops

392GFlops

Ours

ACML-GPU1.0

Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, NinghuiSun, An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs, The 26th ACM International Conference on Supercomputing (ICS), pp.377-386, 2012.

*

10

Improving Performance of Math Library on Dawning6000 Supercomputer

Guangming Tan, Linchuan Li, Sean Triechler, Everett Phillips, Yungang Bao, Ninghui Sun, Fast Implementation of DGEMM on Fermi GPU, ACM/IEEE Supercomputing (SC), 2011

Dawning 6000(Nebulae)

No.2, Top500, 2010.6Shenzhen Supercomputing Center

��à �


*NVIDIAGPU

*AMDGPU

11

Improving Performance of Deep Learning Library

ü A performance tuning framework to explore GPU micro-architectural features

ü fully utilize fused multiply-add (FMA) dual-issue mechanism

20-60% higher than CuDNN

��à �


https://github.com/PAA-NCIC/DeepPerf

Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, Mingyu Chen, Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. ACM PPoPP 2017: 31-43

Syst

em C

omp

lexi

ty

1970 1990 2010 2020

Vectorization

Locality

Vector Processor

Manycore

Cache-based CPU

12

Parallelism

IntelAMDARM

CUDAAPU

Issue 2 – Performance Portability on Different Architectures

Given one workload and many targeted architectures

High performance algorithm is adaptive to different architectures

CrayConvex

FujistuNEC

MICTPU

1 à N

ArchitectureOblivious

EmpiricalAutotuning

Machine Learning Autotuing

13

Performance Autotuning

N implementations of performance optimization by hand

AutotuningLibrary

with ONEinterface

• Fermi• Kepler• Maxwell• Pascal

• KnightsFerry• KnightsCorner• KnightsLanding• KnightsMill

• Pentium(MMX)• PentiumIII(SSE)• Sandy Bridge(AVX)• Skylake(AVX512)

��

…

=

à


Somekernelalgorithm(e.g.,SpMV,Stencil)

14

N implementations of performance optimization by hand

AutotuningLibrary

with ONEinterface

• Fermi• Kepler• Maxwell• Pascal

• KnightsFerry• KnightsCorner• KnightsLanding• KnightsMill

• Pentium(MMX)• PentiumIII(SSE)• Sandy Bridge(AVX)• Skylake(AVX512)

��

…

=

à


Somekernelalgorithm(e.g.,SpMV,Stencil)

Performance Autotuning

15

Machine Learning Based Autotuning

Tuning

Producer

stepevaluate result

para

met

er

spac

e

searchalgorithm

MLmodel

Optimizer

variant generator

Evaluatormetric

analyserscore

functionparameters optimizted

instanceinstrumented

instance metric

Instance

environment setrun

best variant

Yes

No

Extractor

featureanalyser

environment analyser

inputanalyser

features

supervised learning

Learner

unsupervised learning

Knowledge Mining knowledgefeatures

metric

à


• frameworktoprovidebasictoolstobuildautotuning librarywithmachinelearning

• Composable &Resuable for5modules

https://github.com/PAA-NCIC/PAK

ü Extracting both application features and architecture features with data mining

ü Learning a predict model to generate optimal codes

16

Case study : SpMV Autotuning

32GFlops10GFlops

Ours

MKL

Reduce execution time by 20%

AMG solver

à


Example• TLB•Cache•Register• prefetch• SIMDize•Branch•multithreading

Example•Dimension•Dialog• #Non-zero•Distribution of non-zero• Power-low

Architectureparam

eters

ApplicationParam

eters

Featuredatabase

optimization&

training

TunedSpMV(hand&third-party)(MKL,CUSPARSE,…)

Learningmodel

execution&measure

Featureextraction

ThebestSpMV

algorithms&

im

plementatio

SMAT_xCSR_SpMV

off-lineon-line

SpMV Generatefeature values

Data-mining

(C5.0)

Link

predict

<threshold> threshold

modelconfidence

SMAT

J. Li, G. Tan, M. Chen, N. Sun, SMAT: An Input Adaptive Auto-Tuner for Sparse Matrix-Vector Multiplication, ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI) , 2013

17

Case study: Stencil Autotuningà


Producer

OOS

Optimizer Evaluator

RunningInstance

Extractor

Learnerknowledge data

HPS backendCompiler

flag

TAU

HPS frontendPlatform

information

Knowledgedatabase

Tunedinstance

Stencil autotuner

PATUS 8 hours

SDSL >33 hours

PARTANS 2.5h-1month

Halide 2h-2days

9 steps

2725 steps

Ｏｕｒｓ

PATUS、SDSL

Yulong Luo,GuangmingTan,Zeyao Mo,Ninghui Sun,FAST:AFastStencilAutotuning FrameworkBasedOnAnOptimal-solutionSpaceModel,Proceedingsofthe29thACMonInternationalConferenceonSupercomputing(ICS),June08- 11,2015.

Optimal Space Solution (OSS) ModelExplosive Search Space

The optimal variant contained in a small

OSS

Two running instances have the same optimal

variant if their features are similar

Autotuning Speed

AlgorithmSimilarityà ReduceSearchTime

Syst

em C

omp

lexi

ty

1970 1990 2010 2020

Vectorization

Locality

Vector Processor

Manycore

Cache-based CPU

18

Parallelism

IntelAMDARM

CUDAAPU

Issue 3 – Performance Adaptive between Application and Architecture

Given many workloads and many targeted architectures

Suitable combination of hardware and algorithm for different workloads

N à N

CrayConvex

FujistuNEC

MICTPU

CoDesign

Domain-specificFramework

Pattern-driven

19

Pattern-Driven Software-Hardware Codesign

Patternap

plication

Multi-level/grained parallelism

Word-level scatter-gather computation

Algorithm

Image Reconstruction

Hash, SparseMatrix, Graph

reduce

Packet detection

Cryo-Electronic Microscope

Genome Climate Internet

• Specificdesignfortheabstractedpatterns(NOTforalgorithms),forexample:

• Heterogeneoussystemcombiningdata-levelparallelismandtask-levelparallelism

• Specializedcomputingunitsforcommonoperations• Customizedhardwareunitsforirregularmemoryaccess

20

Case study: Accelerating 3D Reconstruction of Cryo-Electron Microscopy Images

Multigrain parallelism pattern [1]

separating data access patterns from computing kernels [2]

Coarse– Middle- FineMPI—OpenMP—FPGA

1. Guangming Tan, Ziyu Guo, Dan Meng. Single-particle 3D Reconstruction from Cryo-Electron Microscopy Images on GPU. The 23rd ACM International Conference on Supercomputing (ICS), pp. 380-389, 2009

2. Guangming Tan, Chunming Zhang, Wendi Wang, PeihengZhang, SuperDragon: A Heterogeneous Parallel System for Accelerating 3D Reconstruction of Cryo-Electron Microscopy Images, ACM Transactions on Reconfigurable Technology and Systems, 8(4): 25, 2015

• FPGA achieves 2.5 times speedup than 12-core CPU

• In terms of power efficiency, FPGA accelerator is 14.2 than CPU

19,841 Hepatitis B virus images

21

Case study: Accelerating Irregular Computationin Genomic Data Analysis

2.5X faster than 64-core

Intel Xeon SMP

Specific Memory controller for word-level scatter-gather memory access

Guangming Tan, Chunming Zhang, Wen Tang, PeihengZhang, Ninghui Sun, Accelerating Irregular Computation in Massive Short Reads Mapping on FPGA Co-processor, IEEE Transactions on Parallel and Distributed Systems,Vol.27, No.5, 2016.

Co-processor to exploit massive fine-grained memory level parallelism

generatememoryrequestsfromeveryAEtoeverymemorycontrollerportoneverycycle

Hashindexlookup

22

Case study: Accelerating In-Memory Computing for Genomic Data Analysis

Xueqi Li, Guangming Tan, Bingchen Wang, Ninghui Sun: High-performance genomic analysis framework with in-memory computing. ACM PPOPP2018: 317-328

50x Platinum GenomeGRCh37/hg19

HumanGenomeShort Genetic Variations

Database (dbsnp) Data

Source

Algorithm-Specific API System-APIAPI

In-memoryExecutionEngineforLarge-scaleGenomicDataProcessing

WGS(Whole-Genome Sequencing)

WES(Whole-Exome Sequencing)

GenePanelGenomicScene

Precision Computing System

GenomicDataCompressionExecution

Engine

RedundancyElimination

Genomic PipelineLoadBalancing

UnlockingthePoweroftheGenome:24minutesgenomicsolutionsforwhole-genomestudies

Ease of Use

Syst

em C

omp

lexi

ty

1970 1990 2010 2020

Vectorization

Locality

Vector Processor

Manycore

Cache-based CPU

23

Parallelism

CrayConvex

IntelAMDARM

CUDAAPU

Wrap-up

Percolation

Cacheblocking

SIMDization

1 à 1


FujistuNEC

MICTPU

1 à N


N à N

CoDesign

Compiler

EmpiricalAutotuning

MachineLearning

Library

Domain-specificFramework

Pattern-Driven

24

What’s Next

MPPCC-NUMA

MPPCC-NUMACluster+GPU

Vector

1995-20171970-80 1980-95 2018-

AI-supportedApplications

Machine/Deep Learning

AI Supercomputer

25

Optimization on AI Supercomputer

• Architecturesupport• Variablenumericalprecision• Interconnectioncustomizedforasynchronouscommunication

• Librarysupport• Abstractprogramminginterface• Adaptivetodiverseaccelerators

• Algorithmsupport• Exploitbillion-levelparallelisminmodel

• Tradeoffbetweenspeedandaccuracy

Thanks！

performance optimization on high performance computers · 2019. 12. 20. · cache-based cpu 6...

Documents