performance optimization on high performance computers · 2019. 12. 20. · cache-based cpu 6...
TRANSCRIPT
Performance Optimization on High Performance Computers
Guangming Tan
2018.5.29
Institute of Computing Technology, Chinese Academy of Sciences
2
High Parallelism = High Speed
3
High Speed = High Performance
2011K-Computer
2012Titan
Raw Peak Speed
0.60 HPCG
2013Tianhe-2A
2016TaihuLight
11.28
PFLOPS
27.11
54.90
125.43
0.320.58 0.48
18X 84X
94X
260X
HPL
Real-world applications DO NOT have
performance gain
?
4
A Retrospective to HPC Optimization
MPPCC-NUMA
MPPCC-NUMA
Scientific Computing
Cluster + xPU
Vector
1995-Now1970-80 1980-95
Internet Big Data
Scientific ComputingEngineering Computing
Biological Cloud computing
Scientific Computing
Engineering Computing
SpecialApplication
NarrowApplicationDomain
DiverseApplicationDomains
5
From Hero to Masses:A System Perspective is Desired
More computing units
Slow memory access
More dataDiverse access pattern
System
Application
A system perspective way to improve MOSTof application, Not only focusing on INDIVIDUAL application
Promoting application performance by improving vectorization, locality and parallelism
Syst
em C
omp
lexi
ty
1970 1990 2010 2020
Vectorization
Locality
Vector Processor
Manycore
Cache-based CPU
6
Architecture-Driven Optimization Methodology
Parallelism
CrayConvexFujistuNEC
IntelAMDARM
CUDAAPUMICTPU
Syst
em C
omp
lexi
ty
1970 1990 2010 2020
Vectorization
Locality
Vector Processor
Manycore
Cache-based CPU
7
Parallelism
CrayConvex
IntelAMDARM
CUDAAPU
Issue 1 – Performance Bound on Emerging Architecture
Given one workload and one targeted architecture
Approaching performance upper bound on the given machine
Percolation
Cacheblocking
1 à 1
ArchitectureSpecific
FujistuNEC
MICTPU
8
Percolation Execution Model
More cores lead to more serious bottleneck of data movement
��à �
ArchitectureSpecific
��������
����
��cache memory � �����
����
… ���
����
����
���
cache memory
• The massive parallelism is leveraged
to create just-in-time locality, where on-demand data movement happens in parallel
• Transform static dependence to dynamic one
• Couple dynamic parallelism with data movement
Guangming Tan, Ninghui Sun, and Guang R. Gao, Improving Performance of Dynamic Programming via Parallelism and Locality on Multicore Architecture, IEEE Transactions on Parallel and Distributed Systems, 2009.
ü generating the optimal instruction execution order
ü hiding data movement by multiple pipelines
9
Applying to Numerical Computing Kernels
... ...
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0
1
31
...
...
...
...
...
...
...
...
...
...
0 1 2 3 4 5 6 70 1 2 3 4 5 6 7
0 1 31
LD/ST.128
globalmemory
0 32 224...
1 33 225...
30 62 254...
31 63 255...
... ... ...
192193
thread block
sharedmemory
128-bits load/store
buffer 0
buffer 1
• The best algorithm that may approach machine’s raw performance
• The kernel determining deep learning’s performance
��à �
ArchitectureSpecific
Matricescomputation
362GFlops
160GFlops
Ours
CUBLAS4.0 *
758GFlops
392GFlops
Ours
ACML-GPU1.0
Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, NinghuiSun, An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs, The 26th ACM International Conference on Supercomputing (ICS), pp.377-386, 2012.
*
10
Improving Performance of Math Library on Dawning6000 Supercomputer
Guangming Tan, Linchuan Li, Sean Triechler, Everett Phillips, Yungang Bao, Ninghui Sun, Fast Implementation of DGEMM on Fermi GPU, ACM/IEEE Supercomputing (SC), 2011
Dawning 6000(Nebulae)
No.2, Top500, 2010.6Shenzhen Supercomputing Center
��à �
ArchitectureSpecific
*NVIDIAGPU
*AMDGPU
11
Improving Performance of Deep Learning Library
ü A performance tuning framework to explore GPU micro-architectural features
ü fully utilize fused multiply-add (FMA) dual-issue mechanism
20-60% higher than CuDNN
��à �
ArchitectureSpecific
https://github.com/PAA-NCIC/DeepPerf
Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, Mingyu Chen, Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. ACM PPoPP 2017: 31-43
Syst
em C
omp
lexi
ty
1970 1990 2010 2020
Vectorization
Locality
Vector Processor
Manycore
Cache-based CPU
12
Parallelism
IntelAMDARM
CUDAAPU
Issue 2 – Performance Portability on Different Architectures
Given one workload and many targeted architectures
High performance algorithm is adaptive to different architectures
CrayConvex
FujistuNEC
MICTPU
1 à N
ArchitectureOblivious
EmpiricalAutotuning
Machine Learning Autotuing
13
Performance Autotuning
N implementations of performance optimization by hand
AutotuningLibrary
with ONEinterface
• Fermi• Kepler• Maxwell• Pascal
• KnightsFerry• KnightsCorner• KnightsLanding• KnightsMill
• Pentium(MMX)• PentiumIII(SSE)• Sandy Bridge(AVX)• Skylake(AVX512)
��������� �����
…
=
à
ArchitectureOblivious
Somekernelalgorithm(e.g.,SpMV,Stencil)
14
N implementations of performance optimization by hand
AutotuningLibrary
with ONEinterface
• Fermi• Kepler• Maxwell• Pascal
• KnightsFerry• KnightsCorner• KnightsLanding• KnightsMill
• Pentium(MMX)• PentiumIII(SSE)• Sandy Bridge(AVX)• Skylake(AVX512)
��������� �����
…
=
à
ArchitectureOblivious
Somekernelalgorithm(e.g.,SpMV,Stencil)
Performance Autotuning
15
Machine Learning Based Autotuning
Tuning
Producer
stepevaluate result
para
met
er
spac
e
searchalgorithm
MLmodel
Optimizer
variant generator
Evaluatormetric
analyserscore
functionparameters optimizted
instanceinstrumented
instance metric
Instance
environment setrun
best variant
Yes
No
Extractor
featureanalyser
environment analyser
inputanalyser
features
supervised learning
Learner
unsupervised learning
Knowledge Mining knowledgefeatures
metric
à
ArchitectureOblivious
• frameworktoprovidebasictoolstobuildautotuning librarywithmachinelearning
• Composable &Resuable for5modules
https://github.com/PAA-NCIC/PAK
ü Extracting both application features and architecture features with data mining
ü Learning a predict model to generate optimal codes
16
Case study : SpMV Autotuning
32GFlops10GFlops
Ours
MKL
Reduce execution time by 20%
AMG solver
à
ArchitectureOblivious
Example• TLB•Cache•Register• prefetch• SIMDize•Branch•multithreading
Example•Dimension•Dialog• #Non-zero•Distribution of non-zero• Power-low
Architectureparam
eters
ApplicationParam
eters
Featuredatabase
optimization&
training
TunedSpMV(hand&third-party)(MKL,CUSPARSE,…)
Learningmodel
execution&measure
Featureextraction
ThebestSpMV
algorithms&
im
plementatio
SMAT_xCSR_SpMV
off-lineon-line
SpMV Generatefeature values
Data-mining
(C5.0)
Link
predict
<threshold> threshold
modelconfidence
SMAT
J. Li, G. Tan, M. Chen, N. Sun, SMAT: An Input Adaptive Auto-Tuner for Sparse Matrix-Vector Multiplication, ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI) , 2013
17
Case study: Stencil Autotuningà
ArchitectureOblivious
Producer
OOS
Optimizer Evaluator
RunningInstance
Extractor
Learnerknowledge data
HPS backendCompiler
flag
TAU
HPS frontendPlatform
information
Knowledgedatabase
Tunedinstance
Stencil autotuner
PATUS 8 hours
SDSL >33 hours
PARTANS 2.5h-1month
Halide 2h-2days
9 steps
2725 steps
Ours
PATUS、SDSL
Yulong Luo,GuangmingTan,Zeyao Mo,Ninghui Sun,FAST:AFastStencilAutotuning FrameworkBasedOnAnOptimal-solutionSpaceModel,Proceedingsofthe29thACMonInternationalConferenceonSupercomputing(ICS),June08- 11,2015.
Optimal Space Solution (OSS) ModelExplosive Search Space
The optimal variant contained in a small
OSS
Two running instances have the same optimal
variant if their features are similar
Autotuning Speed
AlgorithmSimilarityà ReduceSearchTime
Syst
em C
omp
lexi
ty
1970 1990 2010 2020
Vectorization
Locality
Vector Processor
Manycore
Cache-based CPU
18
Parallelism
IntelAMDARM
CUDAAPU
Issue 3 – Performance Adaptive between Application and Architecture
Given many workloads and many targeted architectures
Suitable combination of hardware and algorithm for different workloads
N à N
CrayConvex
FujistuNEC
MICTPU
CoDesign
Domain-specificFramework
Pattern-driven
19
Pattern-Driven Software-Hardware Codesign
Patternap
plication
Multi-level/grained parallelism
Word-level scatter-gather computation
Algorithm
Image Reconstruction
Hash, SparseMatrix, Graph
reduce
Packet detection
Cryo-Electronic Microscope
Genome Climate Internet
• Specificdesignfortheabstractedpatterns(NOTforalgorithms),forexample:
• Heterogeneoussystemcombiningdata-levelparallelismandtask-levelparallelism
• Specializedcomputingunitsforcommonoperations• Customizedhardwareunitsforirregularmemoryaccess
20
Case study: Accelerating 3D Reconstruction of Cryo-Electron Microscopy Images
Multigrain parallelism pattern [1]
separating data access patterns from computing kernels [2]
Coarse– Middle- FineMPI—OpenMP—FPGA
1. Guangming Tan, Ziyu Guo, Dan Meng. Single-particle 3D Reconstruction from Cryo-Electron Microscopy Images on GPU. The 23rd ACM International Conference on Supercomputing (ICS), pp. 380-389, 2009
2. Guangming Tan, Chunming Zhang, Wendi Wang, PeihengZhang, SuperDragon: A Heterogeneous Parallel System for Accelerating 3D Reconstruction of Cryo-Electron Microscopy Images, ACM Transactions on Reconfigurable Technology and Systems, 8(4): 25, 2015
• FPGA achieves 2.5 times speedup than 12-core CPU
• In terms of power efficiency, FPGA accelerator is 14.2 than CPU
19,841 Hepatitis B virus images
21
Case study: Accelerating Irregular Computationin Genomic Data Analysis
2.5X faster than 64-core
Intel Xeon SMP
Specific Memory controller for word-level scatter-gather memory access
Guangming Tan, Chunming Zhang, Wen Tang, PeihengZhang, Ninghui Sun, Accelerating Irregular Computation in Massive Short Reads Mapping on FPGA Co-processor, IEEE Transactions on Parallel and Distributed Systems,Vol.27, No.5, 2016.
Co-processor to exploit massive fine-grained memory level parallelism
generatememoryrequestsfromeveryAEtoeverymemorycontrollerportoneverycycle
Hashindexlookup
22
Case study: Accelerating In-Memory Computing for Genomic Data Analysis
Xueqi Li, Guangming Tan, Bingchen Wang, Ninghui Sun: High-performance genomic analysis framework with in-memory computing. ACM PPOPP2018: 317-328
50x Platinum GenomeGRCh37/hg19
HumanGenomeShort Genetic Variations
Database (dbsnp) Data
Source
Algorithm-Specific API System-APIAPI
In-memoryExecutionEngineforLarge-scaleGenomicDataProcessing
WGS(Whole-Genome Sequencing)
WES(Whole-Exome Sequencing)
GenePanelGenomicScene
Precision Computing System
GenomicDataCompressionExecution
Engine
RedundancyElimination
Genomic PipelineLoadBalancing
UnlockingthePoweroftheGenome:24minutesgenomicsolutionsforwhole-genomestudies
Ease of Use
Syst
em C
omp
lexi
ty
1970 1990 2010 2020
Vectorization
Locality
Vector Processor
Manycore
Cache-based CPU
23
Parallelism
CrayConvex
IntelAMDARM
CUDAAPU
Wrap-up
Percolation
Cacheblocking
SIMDization
1 à 1
ArchitectureSpecific
FujistuNEC
MICTPU
1 à N
ArchitectureOblivious
N à N
CoDesign
Compiler
EmpiricalAutotuning
MachineLearning
Library
Domain-specificFramework
Pattern-Driven
24
What’s Next
MPPCC-NUMA
MPPCC-NUMACluster+GPU
Vector
1995-20171970-80 1980-95 2018-
AI-supportedApplications
Machine/Deep Learning
AI Supercomputer
25
Optimization on AI Supercomputer
• Architecturesupport• Variablenumericalprecision• Interconnectioncustomizedforasynchronouscommunication
• Librarysupport• Abstractprogramminginterface• Adaptivetodiverseaccelerators
• Algorithmsupport• Exploitbillion-levelparallelisminmodel
• Tradeoffbetweenspeedandaccuracy
Thanks!