developing cuda kernels to push tensor ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s gemm k mixed...
TRANSCRIPT
![Page 1: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/1.jpg)
Andrew Kerr, May 21, 2020
DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100
![Page 2: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/2.jpg)
2
CUTLASS Team
Andrew Kerr, Haicheng Wu, Manish Gupta, Duane Merrill, Pradeep Ramani
Contributors
Mostafa Hagog, Timothy Costa, Alan Kaatz, John Tran, Stephen Jones, Kyrylo Perelygin, Luke Durant, Piotr Majcher, Paul Springer, Markus Hohnerbach
Acknowledgements
Joel McCormack, Julien Demouth, Olivier Giroux, Bryce Lelbach, Cris Cecka
ACKNOWLEDGEMENTS
![Page 3: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/3.jpg)
3
Overview
NVIDIA Ampere Architecture and CUTLASS 2.2
Tensor Cores on NVIDIA Ampere Architecture
Accelerated matrix operations
Efficient data movements for Tensor Cores
Strategies for maximizing performance
CUTLASS on NVIDIA A100
Optimal CUDA C++ templates for Tensor Cores
AGENDA
![Page 4: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/4.jpg)
4
OVERVIEW
![Page 5: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/5.jpg)
5
NVIDIA AMPERE ARCHITECTURE
New and Faster Tensor Core Operations
▪ Floating-point Tensor Core operations 8x and 16x faster than F32 CUDA Cores
▪ Integer Tensor Core operations 32x and 64x faster than F32 CUDA Cores
▪ New IEEE double-precision Tensor Cores 2x faster than F64 CUDA Cores
Additional Data Types and Mode
▪ Bfloat16, double, Tensor Float 32
Asynchronous copy
▪ Copy directly into shared memory – deep software pipelines
Many additional new features – see “Inside NVIDIA Ampere Architecture”
NVIDIA A100
![Page 6: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/6.jpg)
6
PROGRAMMING NVIDIA AMPERE ARCHITECTUREDeep Learning and Math Libraries using Tensor Cores (with CUDA kernels under the hood)
• cuDNN, cuBLAS, cuTENSOR, cuSOLVER, cuFFT, cuSPARSE
• “CUDNN V8: New Advances in Deep Learning Acceleration” (GTC 2020 - S21685)
• “How CUDA Math Libraries Can Help you Unleash the Power of the New NVIDIA A100 GPU” (GTC 2020 – S21681)
• “Inside the Compilers, Libraries and Tools for Accelerated Computing” (GTC 2020 – S21766)
CUDA C++ Device Code
• CUTLASS, CUDA Math API, CUB, Thrust, libcu++
CUDA device code
CUDA-accelerated math libraries with host-side API
GPU
GPU-accelerated application
![Page 7: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/7.jpg)
7
PROGRAMMING NVIDIA AMPERE ARCHITECTURE
This is a talk for CUDA programmers
with CUDA C++
CUDA device code
CUDA-accelerated math libraries with host-side API
GPU
GPU-accelerated application
![Page 8: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/8.jpg)
8
CUTLASS
https://github.com/NVIDIA/cutlass
CUDA C++ Templates for Deep Learning and Linear Algebra
CUDA 9.1 CUDA 10.1 CUDA 11
CUDA 9.2 CUDA 10.2
CUTLASS Preview Release
CUTLASS 1.3 – native NVIDIA V100 Tensor
Cores
CUTLASS 2.2 –NVIDIA A100
CUTLASS 1.0CUTLASS 2.0 – native NVIDIA Turing Tensor
Cores
![Page 9: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/9.jpg)
9
CUTLASS
CUTLASS 2.2: optimal performance on NVIDIA Ampere Architecture
• Higher throughput Tensor Cores: more than 2x speedup for all data types
• New floating-point types: bfloat16, Tensor Float 32, double
• Deep software pipelines with cp.async: efficient and latency tolerant
CUTLASS 2.1
• Planar complex: complex-valued GEMMs with batching options targeting Volta and Turing Tensor Cores
• BLAS-style host side API
CUTLASS 2.0: significant refactoring using modern C++11 programming
• Efficient: particularly for Turing Tensor Cores
• Tensor Core programming model: reusable components for linear algebra kernels in CUDA
• Documentation, profiling tools, reference implementations, SDK examples, more..
What’s new?
https://github.com/NVIDIA/cutlass
![Page 10: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/10.jpg)
10
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
12
8
11
52
21
76
32
00
42
24
52
48
62
72
72
96
83
20
93
44
10
36
8
11
39
2
12
41
6
13
44
0
14
46
4
15
48
8
GFL
OP/
s
GEMM K
CUTLASS PERFORMANCE ON NVIDIA AMPERE ARCHITECTURE
0
50,000
100,000
150,000
200,000
250,000
32
54
4
10
56
15
68
20
80
25
92
31
04
36
16
41
28
46
40
51
52
56
64
61
76
66
88
72
00
77
12
GFL
OP/
s
GEMM K
Mixed Precision Floating Point
CUTLASS 2.2 - CUDA 11 Toolkit – NVIDIA A100
Double Precision Floating Point Mixed Precision Integer
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
20,000
32
16
0
28
8
41
6
54
4
67
2
80
0
92
8
10
56
11
84
13
12
14
40
15
68
16
96
18
24
19
52
GFL
OP/
s
GEMM K
Tensor Core – F64Tensor Core – BF16, F16
CUDA Core – F64Tensor Core – TF32
CUDA Core – F32
Tensor Core – INT4
CUDA Core – INT8
Tensor Core – INT8
m=3456, n=4096
5.7x
2x13x
7.7x
13.8x
![Page 11: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/11.jpg)
11
TENSOR CORES ON NVIDIA AMPERE ARCHITECTURE
![Page 12: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/12.jpg)
12
Matrix operations: D = op(A, B) + C
▪ Matrix multiply-add
▪ XOR-POPC
Input Data types: A, B
▪ half, bfloat16, Tensor Float 32, double, int8, int4, bin1
Accumulation Data Types: C, D
▪ half, float, int32_t, double
WHAT ARE TENSOR CORES?
![Page 13: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/13.jpg)
13
WHAT ARE TENSOR CORES?
Matrix operations: D = op(A, B) + C
▪ Matrix multiply-add
▪ XOR-POPC
M-by-N-by-K matrix operation
▪ Warp-synchronous, collective operation
▪ 32 threads within warp collectively hold A, B, C, and D operands
![Page 14: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/14.jpg)
14
NVIDIA AMPERE ARCHITECTURE - TENSOR CORE OPERATIONS
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends
PTXData Types(A * B + C)
ShapeSpeedup on NVIDIA A100 (vs F32 CUDA cores)
Speedup on Turing*(vs F32 Cores)
Speedup on Volta*(vs F32 Cores)
mma.sync.m16n8k16
mma.sync.m16n8k8
F16 * F16 + F16
F16 * F16 + F32
BF16 * BF16 + F32
16-by-8-by-16
16-by-8-by-816x 8x 8x
mma.sync.m16n8k8 TF32 * TF32 + F32 16-by-8-by-8 8x N/A N/A
mma.sync.m8n8k4 F64 * F64 + F64 8-by-8-by-4 2x N/A N/A
mma.sync.m16n8k32
mma.sync.m8n8k16S8 * S8 + S32
16-by-8-by-32
8-by-8-by-1632x 16x N/A
mma.sync.m16n8k64 S4 * S4 + S32 16-by-8-by-64 64x 32x N/A
mma.sync.m16n8k256 B1 ^ B1 + S32 16-by-8-by-256 256x 128x N/A
* Instructions with equivalent functionality for Turing and Volta differ in shape from the NVIDIA Ampere Architecture in several cases.
![Page 15: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/15.jpg)
15
Warp-wide Tensor Core operation: 8-by-8-by-128b
TENSOR CORE OPERATION: FUNDAMENTAL SHAPE
![Page 16: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/16.jpg)
16
mma.sync.aligned(via inline PTX)
int32_t D[2];uint32_t const A;uint32_t const B;int32_t const C[2];
// Example targets 8-by-8-by-16 Tensor Core operation
asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 "
" { %0, %1 }, "
" %2, "
" %3, "
" { %4, %5 }; ": "=r"(D[0]), "=r"(D[1])
: "r"(A), "r"(B),"r"(C[0]), "r"(C[1])
);
8-by-8-by-16
S8 * S8 + S32
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends
![Page 17: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/17.jpg)
17Warp-wide Tensor Core operation: 16-by-8-by-128b
EXPANDING THE M DIMENSION
![Page 18: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/18.jpg)
18
mma.sync.aligned(via inline PTX)
float D[4];uint32_t const A[2];uint32_t const B;float const C[4];
// Example targets 16-by-8-by-8 Tensor Core operation
asm("mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 "
" { %0, %1, %2, %3 }, "
" { %4, %5}, "
" %6, "
" { %7, %8, %9, %10 };": "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])
: "r"(A[0]), "r"(A[1]),"r"(B),"f"(C[0]), "f"(C[1])
);
16-by-8-by-8
F16 * F16 + F32
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends
![Page 19: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/19.jpg)
19Warp-wide Tensor Core operation: 16-by-8-by-256b
EXPANDING THE K DIMENSION
![Page 20: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/20.jpg)
20
mma.sync.aligned(via inline PTX)
float D[4];uint32_t const A[4];uint32_t const B[2];float const C[4];
// Example targets 16-by-8-by-32 Tensor Core operation
asm("mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
" { %0, %1, %2, %3 }, "
" { %4, %5, %6, %7 }, "
" { %8, %9 }, "
" { %10, %11, %12, %13 };": "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])
: "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),"r"(B[0]), "r"(B[1]),"f"(C[0]), "f"(C[1]), "f"(C[2]), "f"(C[3])
);
16-by-8-by-16
F16 * F16 + F32
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends
![Page 21: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/21.jpg)
21
mma.sync.aligned(via inline PTX)
int32_t D[4];uint32_t const A[4];uint32_t const B[2];int32_t const C[4];
// Example targets 16-by-8-by-32 Tensor Core operation
asm("mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 "
" { %0, %1, %2, %3 }, "
" { %4, %5, %6, %7 }, "
" { %8, %9 }, "
" { %10, %11, %12, %13 };": "=r"(D[0]), "=r"(D[1]), "=r"(D[2]), "=r"(D[3])
: "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),"r"(B[0]), "r"(B[1]),"r"(C[0]), "r"(C[1]), "r"(C[2]), "r"(C[3])
);
16-by-8-by-32
S8 * S8 + S32
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends
![Page 22: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/22.jpg)
22
mma.sync.aligned(via inline PTX)
uint32_t D[2]; // two registers needed (vs. four)uint32_t const A[4];uint32_t const B[2];uint32_t const C[2]; // two registers needed (vs. four)
// Example targets 16-by-8-by-16 Tensor Core operation
asm("mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
" { %0, %1}, "
" { %2, %3, %4, %5 }, "
" { %6, %7 }, "
" { %8, %9 }; ": "=r"(D[0]), "=r"(D[1])
: "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),"r"(B[0]), "r"(B[1]),"r"(C[0]), "r"(C[1])
);
16-by-8-by-16
HALF-PRECISION : F16 * F16 + F16
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends
C[0]
C[1]
![Page 23: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/23.jpg)
23
mma.sync.aligned(via inline PTX)
uint64_t D[2]; // two 64-bit accumulatorsuint64_t const A; // one 64-bit element for A operanduint64_t const B; // one 64-bit element for B operanduint64_t const C[2]; // two 64-bit accumulators
// Example targets 8-by-8-by-4 Tensor Core operation
asm("mma.sync.aligned.m8n8k4.row.col.f64.f64.f64.f64 "
" { %0, %1}, "
“ %2, "
" %3, "
" { %4, %5 }; ": "=l"(D[0]), "=l"(D[1])
: “l"(A), “l"(B),“l"(C[0]), “l"(C[1])
);
8-by-8-by-4
DOUBLE-PRECISION: F64 * F64 + F64
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends
![Page 24: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/24.jpg)
24
cutlass::arch::Mma
/// Matrix multiply-add operationtemplate <
/// Size of the matrix product (concept: GemmShape)typename Shape,
/// Number of threads participatingint kThreads,
/// Data type of A elementstypename ElementA,
/// Layout of A matrix (concept: MatrixLayout)typename LayoutA,
/// Data type of B elementstypename ElementB,
/// Layout of B matrix (concept: MatrixLayout)typename LayoutB,
/// Element type of C matrixtypename ElementC,
/// Layout of C matrix (concept: MatrixLayout)typename LayoutC,
/// Inner product operatortypename Operator
>struct Mma;
m-by-n-by-k
CUTLASS: wraps PTX in template
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/mma_sm80.h
![Page 25: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/25.jpg)
25
cutlass::arch::Mma
__global__ void kernel() {
// arrays containing logical elementsArray<half_t, 8> A;Array<half_t, 4> B;Array< float, 4> C;
// define the appropriate matrix operationarch::Mma< GemmShape<16, 8, 16>, 32, ... > mma;
// in-place matrix multiply-accumulatemma(C, A, B, C);
...}
16-by-8-by-16
CUTLASS: wraps PTX in template
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/mma_sm80.h
![Page 26: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/26.jpg)
26
EFFICIENT DATA MOVEMENT FOR TENSOR CORES
![Page 27: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/27.jpg)
27
CUDA example
__global__ void tensor_core_example_8x8x16(int32_t *D, uint32_t const *A, uint32_t const *B, int32_t const *C) {
// Compute the coordinates of accesses to A and B matrices
int outer = threadIdx.x / 4; // m or n dimensionint inner = threadIdx.x % 4; // k dimension
// Compute the coordinates for the accumulator matricesint c_row = threadIdx.x / 4;int c_col = 2 * (threadIdx.x % 4);
// Compute linear offsets into each matrixint ab_idx = outer * 4 + inner;int cd_idx = c_row * 8 + c_col;
// Issue Tensor Core operationasm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 "" { %0, %1 }, "" %2, "" %3, "" { %4, %5 }; ": "=r"(D[cd_idx]), "=r"(D[cd_idx + 1])
: "r"(A[ab_idx]), "r"(B[ab_idx]),"r"(C[cd_idx]), "r"(C[cd_idx + 1])
);}
HELLO WORLD: TENSOR CORES
Map each thread to coordinates of the matrix operation
Load inputs from memory
Perform the matrix operation
Store the result to memory
![Page 28: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/28.jpg)
28
CUDA example__global__ void tensor_core_example_8x8x16(int32_t *D, uint32_t const *A, uint32_t const *B, int32_t const *C) {
// Compute the coordinates of accesses to A and B matrices
int outer = threadIdx.x / 4; // m or n dimensionint inner = threadIdx.x % 4; // k dimension
// Compute the coordinates for the accumulator matricesint c_row = threadIdx.x / 4;int c_col = 2 * (threadIdx.x % 4);
// Compute linear offsets into each matrixint ab_idx = outer * 4 + inner;int cd_idx = c_row * 8 + c_col;
// Issue Tensor Core operationasm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 "" { %0, %1 }, "" %2, "" %3, "" { %4, %5 }; ": "=r"(D[cd_idx]), "=r"(D[cd_idx + 1])
: "r"(A[ab_idx]), "r"(B[ab_idx]),
"r"(C[cd_idx]), "r"(C[cd_idx + 1]));
}
PERFORMANCE IMPLICATIONS
Load A and B inputs from memory: 2 x 4B per thread
Perform one Tensor Core operation: 2048 flops per warp
2048 flops require 256 B of loaded data
➔ 8 flops/byte
NVIDIA A100 Specifications:
• 624 TFLOP/s (INT8)
• 1.6 TB/s (HBM2)
➔ 400 flops/byte
8 flops/byte * 1.6 TB/s ➔ 12 TFLOP/s
This kernel is global memory bandwidth limited.
![Page 29: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/29.jpg)
29
FEEDING THE DATA PATHEfficient storing and loading through Shared Memory
Tiled, hierarchical model: reuse data in Shared Memory and in Registers
See CUTLASS GTC 2018 talk for more details about this model.
![Page 30: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/30.jpg)
30
FEEDING THE DATA PATHMove data from Global Memory to Tensor Cores as efficiently as possible
• Latency-tolerant pipeline from Global Memory
• Conflict-free Shared Memory stores
• Conflict-free Shared Memory loads
![Page 31: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/31.jpg)
31
ASYNCHRONOUS COPY: EFFICIENT PIPELINESNew NVIDIA Ampere Architecture feature: cp.async
• Asynchronous copy directly from Global to Shared Memory
• See “Inside the NVIDIA Ampere Architecture” for more details (GTC 2020 – S21730)
Enables efficient software pipelines
• Minimizes data movement: L2 ➔ L1 ➔ RF ➔ SMEM becomes L2 ➔ SMEM
• Saves registers: RF no longer needed to hold the results of long-latency load instructions
• Indirection: fetch several stages in advance for greater latency tolerance
CommittedStage
SMEM write pointer
Copies in flight
Circular buffer in Shared Memory
cp.asynccp.asynccp.asyncld.shared
![Page 32: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/32.jpg)
32
FEEDING THE DATA PATHMove data from Global Memory to Tensor Cores as efficiently as possible
• Latency-tolerant pipeline from Global Memory
• Conflict-free Shared Memory stores
• Conflict-free Shared Memory loads
![Page 33: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/33.jpg)
33
GLOBAL MEMORY TO TENSOR CORES
Global Memory
Shared Memory
Tensor Cores
cp.async
M dimension
K dimension
![Page 34: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/34.jpg)
34
LDMATRIX: FETCH TENSOR CORE OPERANDS PTX instruction to load a matrix from Shared Memory
Shared Memory Tensor Cores
Each thread supplies a pointer to 128b row of data in Shared Memory
Each 128b row is broadcast to groups of four threads
(potentially different threads than the one supplying the pointer)
Data matches arrangement of inputs to Tensor Core operations
![Page 35: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/35.jpg)
35
LDMATRIX: PTX INSTRUCTION
Each thread supplies a pointer to 128b row of data in Shared Memory
Each 128b row is broadcast to groups of four threads
(potentially different threads than the one supplying the pointer)
Data matches arrangement of inputs to Tensor Core operations
PTX instruction to load a matrix from SMEM
// Inline PTX assembly for ldmatrix
uint32_t R[4];uint32_t smem_ptr;
asm volatile ("ldmatrix.sync.aligned.x4.m8n8.shared.b16 ""{%0, %1, %2, %3}, [%4]; "
: "=r"(R[0]), "=r"(R[1]), "=r"(R[2]), "=r"(R[3])
: "r"(smem_ptr)
);
![Page 36: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/36.jpg)
36
GLOBAL MEMORY TO TENSOR CORES
Tensor Cores
ldmatrix
cp.async
Global Memory
Shared Memory
![Page 37: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/37.jpg)
37
NVIDIA AMPERE ARCHITECTURE – SHARED MEMORY BANK TIMING
Bank conflicts between threads in the same phase
4B words are accessed in 1 phase
8B words are accessed in 2 phases:
• Process addresses of the first 16 threads in a warp
• Process addresses of the second 16 threads in a warp
16B words are accessed in 4 phases:
• Each phase processes 8 consecutive threads of a warp
Slide borrowed from: Guillaume Thomas-Collignon and Paulius Micikevicius. "Volta Architecture and performance optimization.” GTC 2018.
http://on-demand.gputechconf.com/gtc/2018/presentation/s81006-volta-architecture-and-performance-optimization.pdf
128 bit access size
Phase 0: T0 .. T7
Phase 1: T8 .. T15
Phase 2: T16 .. T23
Phase 3: T24 .. T31
![Page 38: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/38.jpg)
38
GLOBAL MEMORY TO TENSOR CORES
Global Memory
Shared Memory
Registers
Bank conflict on either store or load from Shared Memory
ldmatrix
cp.async
![Page 39: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/39.jpg)
39
GLOBAL TO SHARED MEMORY
Load(128 bits per thread)
Store(128 bits per thread)
Permuted Shared Memory layout
XOR function maps thread index to Shared Memory location
![Page 40: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/40.jpg)
40
GLOBAL TO SHARED MEMORY
Phase 0: T0 .. T7
Phase 1: T8 .. T15
Phase 2: T16 .. T23
Phase 3: T24 .. T31
Load(128 bits per thread)
Store(128 bits per thread)
![Page 41: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/41.jpg)
41
GLOBAL TO SHARED MEMORY
Phase 0: T0 .. T7
Phase 1: T8 .. T15
Phase 2: T16 .. T23
Phase 3: T24 .. T31
Load(128 bits per thread)
Store(128 bits per thread)
![Page 42: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/42.jpg)
42
GLOBAL TO SHARED MEMORY
Phase 0: T0 .. T7
Phase 1: T8 .. T15
Phase 2: T16 .. T23
Phase 3: T24 .. T31
Load(128 bits per thread)
Store(128 bits per thread)
![Page 43: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/43.jpg)
43
GLOBAL TO SHARED MEMORY
Phase 0: T0 .. T7
Phase 1: T8 .. T15
Phase 2: T16 .. T23
Phase 3: T24 .. T31
Load(128 bits per thread)
Store(128 bits per thread)
![Page 44: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/44.jpg)
44
FEEDING THE DATA PATHMove data from Global Memory to Tensor Cores as efficiently as possible
• Latency-tolerant pipeline from Global Memory
• Conflict-free Shared Memory stores
• Conflict-free Shared Memory loads
![Page 45: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/45.jpg)
45
LOADING FROM SHARED MEMORY TO REGISTERS
![Page 46: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/46.jpg)
46
LOADING FROM SHARED MEMORY TO REGISTERS
![Page 47: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/47.jpg)
47
LOADING FROM SHARED MEMORY TO REGISTERS
![Page 48: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/48.jpg)
48
LOADING FROM SHARED MEMORY TO REGISTERS
![Page 49: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/49.jpg)
49
ADVANCING TO NEXT K GROUP
K=16 .. 31K=0 ..15
![Page 50: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/50.jpg)
50
ADVANCING TO NEXT K GROUP
smem_ptr = row_idx * 8 + column_idx; smem_ptr = smem_ptr ^ 2;
K=0..15 K=16..31
![Page 51: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/51.jpg)
51
LOADING FROM SHARED MEMORY TO REGISTERS
Phase 0K=16..31
![Page 52: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/52.jpg)
52
LOADING FROM SHARED MEMORY TO REGISTERS
Phase 1K=16..31
![Page 53: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/53.jpg)
53
LOADING FROM SHARED MEMORY TO REGISTERS
Phase 2
K=16..31
![Page 54: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/54.jpg)
54
LOADING FROM SHARED MEMORY TO REGISTERS
Phase 3
K=16..31
![Page 55: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/55.jpg)
55
CUTLASSCUDA C++ Templates as an Optimal Abstraction Layer for Tensor Cores
• Latency-tolerant pipeline from Global Memory
• Conflict-free Shared Memory stores
• Conflict-free Shared Memory loads
![Page 56: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/56.jpg)
56
CUTLASS: OPTIMAL ABSTRACTION FOR TENSOR CORESusing Mma = cutlass::gemm::warp::DefaultMmaTensorOp<GemmShape<64, 64, 16>, half_t, LayoutA, // GEMM A operandhalf_t, LayoutB, // GEMM B operandfloat, RowMajor // GEMM C operand
>;
__shared__ ElementA smem_buffer_A[Mma::Shape::kM * GemmK];__shared__ ElementB smem_buffer_B[Mma::Shape::kN * GemmK];
// Construct iterators into SMEM tilesMma::IteratorA iter_A({smem_buffer_A, lda}, thread_id);Mma::IteratorB iter_B({smem_buffer_B, ldb}, thread_id);
Mma::FragmentA frag_A;Mma::FragmentB frag_B;Mma::FragmentC accum;
Mma mma;
accum.clear();
#pragma unroll 1for (int k = 0; k < GemmK; k += Mma::Shape::kK) {
iter_A.load(frag_A); // Load fragments from A and B matricesiter_B.load(frag_B);
++iter_A; ++iter_B; // Advance along GEMM K to next tile in A// and B matrices
// Compute matrix productmma(accum, frag_A, frag_B, accum);
}
Shared Memory Tensor CoresWarp-level matrix multiply
![Page 57: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/57.jpg)
57
CUTLASS: OPTIMAL ABSTRACTION FOR TENSOR CORESusing Mma = cutlass::gemm::warp::DefaultMmaTensorOp<GemmShape<64, 64, 16>, half_t, LayoutA, // GEMM A operandhalf_t, LayoutB, // GEMM B operandfloat, RowMajor // GEMM C operand
>;
__shared__ ElementA smem_buffer_A[Mma::Shape::kM * GemmK];__shared__ ElementB smem_buffer_B[Mma::Shape::kN * GemmK];
// Construct iterators into SMEM tilesMma::IteratorA iter_A({smem_buffer_A, lda}, thread_id);Mma::IteratorB iter_B({smem_buffer_B, ldb}, thread_id);
Mma::FragmentA frag_A;Mma::FragmentB frag_B;Mma::FragmentC accum;
Mma mma;
accum.clear();
#pragma unroll 1for (int k = 0; k < GemmK; k += Mma::Shape::kK) {
iter_A.load(frag_A); // Load fragments from A and B matricesiter_B.load(frag_B);
++iter_A; ++iter_B; // Advance along GEMM K to next tile in A// and B matrices
// Compute matrix productmma(accum, frag_A, frag_B, accum);
}
Tile Iterator Constructors:
Initialize pointers into permuted Shared Memory buffers
Fragments:
Register-backed arrays holding each thread’s data
Warp-level matrix multiply:
Decomposes a large matrix multiply into Tensor Core operations
Tile Iterator:
load() - Fetches data from permuted Shared Memory buffers
operator++() - advances to the next logical matrix in SMEM
![Page 58: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/58.jpg)
58
CUTLASS ON NVIDIA A100
![Page 59: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/59.jpg)
59
cuBLAS
CUTLASS
99% 99% 98% 99%95%
93%96% 95%
98% 97% 98% 97%94%
90%
97% 99%
90% 90% 90% 90%
80%83%
80% 80%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT
DGEMM IGEMM SGEMM TensorOp (f16) TensorOp (f32) TensorOp (TF32)
CUTLASS RELATIVE PERFORMANCE TO CUBLASCUTLASS 2.2 – CUDA 11 Toolkit – NVIDIA A100
![Page 60: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/60.jpg)
60
CUTLASS RELATIVE PERFORMANCE TO CUBLAS
cuBLAS
CUTLASS 2.2 – CUDA 11 Toolkit – Three generations of GPU architectures
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT
DGEMM IGEMM SGEMM TensorOp (f16) TensorOp (f32) TensorOp (TF32)
2080Ti
A100
TitanV
CUTLASS
![Page 61: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/61.jpg)
61
0
50,000
100,000
150,000
200,000
250,000
64
16
0
25
6
35
2
44
8
54
4
64
0
73
6
83
2
92
8
10
24
11
20
12
16
13
12
14
08
15
04
16
00
16
96
17
92
18
88
19
84
20
80
21
76
22
72
23
68
24
64
25
60
26
56
27
52
28
48
29
44
30
40
31
36
32
32
33
28
34
24
35
20
36
16
37
12
38
08
39
04
40
00
40
96
GFL
OP/
s
GEMM K
ARBITRARY PROBLEM SIZECUTLASS Templates Cover the Design Space
128b alignment
64b alignment
32b alignment
16b alignment
CUDA 10.2 and before
CUTLASS 2.2 – NVIDIA A100 - Tensor Cores: F16 * F16 + F32
![Page 62: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/62.jpg)
62
CONCLUSION
![Page 63: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/63.jpg)
63
CONCLUSION: NVIDIA A100 IS FAST AND PROGRAMMABLE
Tensor Cores on NVIDIA A100 in CUDA
• Order of magnitude speedup for matrix computations
• Programmable in CUDA via mma.sync with zero overhead
• Kernel design can avoid memory bottlenecks
• CUDA 11 Toolkit capable of near-peak performance
CUTLASS 2.2: May 2020
• Open source CUDA C++ template library for CUDA development
• Reusable building blocks for utilizing Tensor Cores on NVIDIA GPUs
• Near-optimal performance on NVIDIA Ampere Architecture
Try it out! https://github.com/NVIDIA/cutlass
![Page 64: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/64.jpg)
64
REFERENCES
NVIDIA Ampere Architecture:
“Inside the NVIDIA Ampere Architecture” (GTC 2020 – S21730)
“NVIDIA Ampere Architecture In-Depth” (blog post)
“CUDA New Features and Beyond” (GTC 2020 – S21760)
“Tensor Core Performance on NVIDIA GPUs” (GTC 2020 – S21929)
“Inside the Compilers, Libraries and Tools for Accelerated Computing” (GTC 2020 - S21766)
CUTLASS
https://github.com/NVIDIA/cutlass (open source software, New BSD license)
GTC 2018 and GTC 2019 talks: GEMM structure and Volta Tensor Cores
CUTLASS Parallel For All blog post
![Page 65: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision](https://reader036.vdocuments.us/reader036/viewer/2022071510/612de4871ecc5158694278b4/html5/thumbnails/65.jpg)