Accelerator Evaluation on Real Edge-Inference Applications
Vinay Mehta, Inference Technical Marketing ManagerFlex Logix Technologies, Inc.
Linley Spring Processor ConferenceApril 6-9, 2020, Santa Clara, CA
InferX™ X1
2
• 54mm2 TSMC 16FFC• 933MHz Operation
• 4K MACs @ INT8 2K MACs @ BF16 Winograd acceleration for INT8
• 8MB L2 SRAM + 4MB L3 SRAM• x32 LPDDR4 (14.9GB/s peak BW)
• 13.5 W (max)
• Partners: TSMC, GUC, Synopsys, Arteris, Analog Bits, Cadence, Mentor
• Available as Chip & PCIe card Q3
x32GPIO
4K MACs8MB
distributed L2 SRAM
4MB L3 SRAM eFPGA
x32LPDDR4
Host PCIeGen3/4 x4
Segmenting Edge Customers by CNN Complexity
All share requirements for real-time (streaming: batch=1, low latency) with large input size
3
Perception
Icons by Freepik from www.flaticon.com
Learned DSP
Medical segmentation and classification
Image denoisingCCTV with shoplifting alerts
Attention monitoring for ADAS
Quality assurance and inspections
• Choose a representative workload• Be clear with metrics (eg, latency and throughput)• Tie performance back to the design requirements
4
Evaluating Like a Customer
Customers’ Evaluation Involves More Than Performance
5
TDP Die Size
InferX X1 7-13.5 W 54 mm2
Nvidia Xavier NX 15 W 350 mm2
Nvidia Tesla T4 75 W 545 mm2
21 mm
70 mm
175 mm
Right Benchmark: Characterizing Models (Actual Model Still Best!)
6
0
25000
50000
75000
MobileNet ResNet-50 Inception v4 YOLOv3
Ope
ratio
ns p
er in
put p
ixel
-cha
nnel
Arithmetic Intensity Across Common CNNs
0
10
20
30
40
200 450 700 950 1200 1440
Meg
abyt
es
Input Size (pixels^2)
ResNet-50, Relative Memory Footprint
Weights (total)
Weights (max layer)
Reporting Benchmarks: Single Stream vs Pooling
7
Latency (ms) FPS
DLA_0 290 3.4
DLA_1 290 3.4
GPU 95 10.5
“17.3 FPS”+
YOLOv3-1440 INT8, b=1 on Nvidia Jetson NX0 200 ms 400 ms
DLA_0 InferenceImage 1
DLA_0 InferenceImage 2Assumes pool of
data DLA_1 InferenceImage 3
DLA_1 InferenceImage 4
GPUImage 5
GPUImage 6
GPUImage 7
GPUImage 8
Reporting Benchmarks: Input Data is a Stream
8
Latency (ms) FPS
DLA_0 290 3.4
DLA_1 290 3.4
GPU 95 10.5
“17.3 FPS”+
YOLOv3-1440 INT8, b=1 on Nvidia Jetson NX
Can I run inf. on a 15 FPS sensor?
Reporting Benchmarks: Input Data is a Stream
9
Latency (ms) FPS
DLA_0 290 3.4
DLA_1 290 3.4
GPU 95 10.5
“17.3 FPS”+
YOLOv3-1440 INT8, b=1 on Nvidia Jetson NX
Can I run inf. on a 15 FPS sensor?
DLA_0 InferenceDLA_1 Inference
GPU
0 67 ms 200 ms
GPUDLA_0 Inference
DLA_1 InferenceGPU
GPUGPU
DLA_0 Inference
Image bufferedResource idled
(no, not as you expect)
Throughput Does Not Correspond to Effective Latency
10
Latency (ms) FPS
DLA_0 290 3.4
DLA_1 290 3.4
GPU 95 10.5
• Cannot use all available resources on same inference• Difficult to schedule processing engines• Accessible performance demonstrated by latency
YOLOv3-1440 INT8, b=1 on Nvidia Jetson NX
Datacenter vs Edge Benchmarks: Summary
11
DLA_0 InferenceDLA_1 Inference
GPU
0 67 ms 200 ms
GPUDLA_0 Inference
DLA_1 InferenceGPU
GPUGPU
DLA_0 Inference
Image bufferedResource idled
Datacenter Inference: Out-of-Order, No Fixed FPS Edge Data: Single Stream, 15 FPS
0 200 ms 400 ms
DLA_0 InferenceImage 1
DLA_0 InferenceImage 2
DLA_1 InferenceImage 3
DLA_1 InferenceImage 4
GPUImage 5
GPUImage 6
GPUImage 7
GPUImage 8
12
Real World Benchmark: Latency (If Power and Cost Didn’t Matter…)
X1
NX: GPU
T4
0
2
4
6
8
10
12
14
Customer: Model X(Bfloat16)
Late
ncy
(ms)
(lower is better)
X1
NX: GPU
T4
NX: DLA
0
20
40
60
80
100
120
140
160
180
Customer: Model Z(Bfloat16)
X1NX: GPU
T4
NX: DLA
0
50
100
150
200
250
300
YOLOv3, 1440 (INT8)
X1
NX: GPU
T4
NX: DLA
0
10
20
30
40
50
60
YOLOv3, 608 (INT8)
13
InferX X1 Has Superior Performance for the Price
X1 X1 X1 X1
NX:GPU
NX:GPU
NX:GPU
NX:GPU
T4
T4
T4
T4
NX:DLA
NX:DLA
NX:DLA
0
0.2
0.4
0.6
0.8
1
1.2
Customer: Model X(Bfloat16)
Customer: Model Z(Bfloat16)
YOLOv3, 608 (INT8) YOLOv3, 1440 (INT8)
(higher is better)
Thro
ughp
ut /
Die
Size
14
Key to X1 Efficiency is in Data Packing
A
B
3 6 9
2 5 8
1 4 7
Layer 1: 3D Activation Space
Layer 1: 2D Activation Space
Layer 1: 1D Activation Space
Layer 2: Activation Space
1 2 8 9
10 11 8 9
10 4 7
11 5 8
12 6 9
6 9
5 8
4 7
12
11
10
1 4 7
2 5 8
3 6 9
3x3
Win
dow
13x
3W
indo
w 2
15
X1 Flexibility Achieves Efficiency Where Others Can’t, Eg. 3D Convolutions
2D Conv
3D Conv
InferX™ X1
16
• 54mm2 TSMC 16FFC• 933MHz Operation
• 4K MACs @ INT8 2K MACs @ BF16 Winograd acceleration for INT8
• 8MB L2 SRAM + 4MB L3 SRAM• x32 LPDDR4 (14.9GB/s peak BW)
• 13.5 W (max)
• Partners: TSMC, GUC, Synopsys, Arteris, Analog Bits, Cadence, Mentor
• Available as Chip & PCIe card Q3
x32GPIO
4K MACs8MB
distributed L2 SRAM
4MB L3 SRAM eFPGA
x32LPDDR4
Host PCIeGen3/4 x4
17
Appendix
18
Benchmark Data
Latency (ms) Throughput (inf/s) Throughput/Die Size
NX-DLA (1) NX-DLA (2) NX-GPU T4 X1* NX-DLA (1) NX-DLA (2) NX-GPU T4 X1* NX-DLA (1) NX-GPU T4Customer Model Z
BF/FP16 157.8 163.1 53.5 15.54 35.7 6.3 12.3 18.7 64.4 28.0 0.0349 0.1030 0.2276
Customer Model XBF/FP16 - - 12.3 2.01 1.1 - - 81.3 497.5 909.1 #VALUE! 0.0138 0.0542
YOLOv3-608INT8 50.5 53.7 18.3 4.2 18.5 19.8 37.2 54.6 238.1 54.1 0.0565 0.1560 0.4364
YOLOv3-1440INT8 279.2 289.8 94.9 19.3 108.6 3.6 6.9 10.5 51.8 9.2 0.0600 0.1766 0.5575
*performance estimate