dlu: deep learning unit · dlu deep learning unit host i/f inter-chip i/f hbm2 dpu-0 dpu-1 dpu dpe...
TRANSCRIPT
DLU™ : Deep Learning UnitISC 2018
Copyright 2018 FUJITSU LIMITED0
Copyright 2018 FUJITSU LIMITED
Fujitsu’s HPC Development Timeline
© RIKEN
The K computer is still competitive in various fields; from advanced research to manufacturing.
K computer
The post-K is under development to achieve superior application performance.
Post-K Computer
HPCGNo.1(2016)
Graph500No.1(2016)
Gordon Bell Prize Finalist
(2016)
DLU is a processor designed for deep learning that has the ability to handle large-scale neural networks.
Deep Learning Unit (DLU™)
1
Fujitsu Processor DevelopmentPerpetual Evolution > 60 years: Always Targeting No.1
2000~2003
SPARC64
SPARC64
II
SPARC64
V
SPARC64
GP
GS8900
GS21
600
GS8600
GS8800B
SPARC64
VII
GS21
1600
SPARC64
V+
SPARC64
VI
GS8800
GS21
900
Mainframe
Pe
rform
an
ce
Re
liab
ility
Store Ahead
Branch History
Prefetch
Single-chip CPU
Non-Blocking $
O-O-O Execution
Super-Scalar
L2$ on Die
HPC-ACE
System on Chip
Hardware Barrier
Multi-core Multi-thread
2004~2007 2008~2011
SPARC64
GP
2012~2015 2016~
SPARC64
IXfx
SPARC64
VIIIfx
Virtual Machine Architecture
Software on Chip
High-speed Interconnect
SPARC64
X+
130nm
250nm /
220nm
180nm
:Technology generation
90nm
350nm
28nm
Tr=1B
CMOS Cu
40nm
65nm
HPCUNIX
$ ECC
Register/ALU Parity
Instruction Retry
$ Dynamic Degradation
Error Checkers/History
Mainframe/UNIX/HPC + AIincremental development
GS21
2600
45nm
40nm
Next
GS
SPARC64
XIfx
SPARC64
X
20nm
DLU
SPARC64
XII
Post-K
ARM
AI
2
Next
SPARC
Copyright 2018 FUJITSU LIMITED
DigitalAnnealer
Features
• Architecture designed for deep learning
• Low-power consumption design
Goal: 10x Performance / Watt compared to competitors
• Scalable design with Tofu interconnect technology
Ability to handle large-scale neural networks
Utilizing technologies derived from the K computer
DLU: Processor Designed for Deep Learning
DLUDeep Learning Unit
3 Copyright 2018 FUJITSU LIMITED
What’s the New Architecture for the DLU?
Copyright 2018 FUJITSU LIMITED4
Domain specific, Optimal precision, and Massively parallel.
High Precision
General Use
Conventional Architecture The New Architecture
2. Optimal Precision
1. Domain Specific
Sequential+ Parallel
3. Massively Parallel
Many cores w/ on-chip networkMultiple strong cores
Double/Single precision FP Deep Learning Integer
Complicated O-O-O coresw/ cache memory
Domain specific coresw/ large register file
DLU Architecture
ISA: Newly developed for deep learning
Micro-Architecture
• Simple pipeline to remove HW complexity
• On-chip network to share data between DPUs
Utilizes Fujitsu’s HPC experience, such as high density FMAs and high speed interconnect
Maximizes performance / watt
Copyright 2018 FUJITSU LIMITED
Fujitsu’s interconnect technologyLarge scale DLU interconnect through off-chip network
DLU Deep Learning Unit
Host I/F
Inter-chipI/F
HBM2
DPU-0
DPU-1
DPU
DPE DPE DPE
DPE DPE DPE
DPE DPE DPE
DPU
DPU
DPU-n
DPE DPE DPE
DPE DPE DPE
DPE DPE DPE
DPU: Deep learning Processing Unit,DPE: Deep learning Processing Element
On-chip networkNew ISA for deep learningHigh density FMA
5
DPU: Execution Wide SIMD execution and
Large Register File. Execute DL operations based
on master core’s control
Heterogeneous Cores and Large Register File
Copyright 2018 FUJITSU LIMITED6
DPU
DPU
DPU
DPU
DPU
DPU
DPU
DPUMaster
MemoryMemory
Controller
Instructions/Data …
Master Core:Memory Access andDPU control
• Push & Pull instructions and data for DLUs.• Start/stop execution of DLUs
The combination of few large core (Master) and many small execution cores (DPU) results in more performance with less power consumption, compared to a conventional homogeneous structure
Large
RegFile
Large
RegFile
Large
RegFile
Large
RegFile
Large
RegFile
Large
RegFile
Large
RegFile
Large
RegFile
Wid
e
SIM
D
Wid
e
SIM
D
Wid
e
SIM
D
Wid
e
SIM
D
Wid
e
SIM
D
Wid
e
SIM
D
Wid
e
SIM
D
Wid
e
SIM
D
How to utilize many DPUs e.g. convolution layer
one CH-out / DPUmultiple batch / DPU
CH-in CH-out
DPE & Large RF (Register File)
Copyright 2018 FUJITSU LIMITED7
DPU
CNTL
DPU: 128 SIMD* / 16DPE
DPU consists of 16 DPEs connected with on-chip network
DPE incudes large RF and wide SIMD execution units to realize an efficient Deep Learning engine. RF is fully SW controllable unlike cache to extract full HW potential
DPE: 8SIMD* with large RF(~100x of typical CPU core)
ExecUNIT
ExecUNIT
RF
ExecUNIT
RF
ExecUNIT
RF
ExecUNIT
RF
ExecUNIT
RF
ExecUNIT
RF
ExecUNIT
RF
* For FP32
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Name RF/$ structure
UNIX SPARC64 XII RF + $
HPC SPARC64 XIfx RF + sector $
AI DLU Large RF
More SW
controllability
Domain specific architecture - Why cache memory removed -
Complex hardware to achieve high performance for any applications with various data access patterns
E.g.
Large cache memory with cache tags and LRU replacement Unit
Hardware Prefetch Engine
General Processors DLU (Domain Specific)
Simple hardware focusing on simple memory access patterns
E.g. Convolution Layer
CH-in data can be shared among CH-out calculation at all DPUs
Memory access patterns are continuously and predictable (software controllable)
L2 $L2 $
Memory
Copyright 2018 FUJITSU LIMITED8
DLINT : Deep Learning Integer
Copyright 2018 FUJITSU LIMITED9
Fujitsu’s “DLINT” realizes necessary accuracy for Deep Learning with only a 16 or 8 bits data size (i.e. less power consumption compared with FP32)
Training results with DLINT8/16 can be converted to the conventional 8/16-bit INT for inference.
Data Size
Effective Precision
FP32
16-bit
INT
8-bit
INT
Required Precision for Deep Learning
>INT8,16
Accumulator
INT16
INT8 INT8
INT16
INT8 INT8
FP16
FP16 FP16
FP32
Deep Learning
Integer
Data Size and Precision DLU Data Type
INT16
INT8 INT8
INT16
INT8 INT8
Small Large
16/8bit area with
minimum accuracy
loss
HW gathered
statistics
Accuracy of Deep Learning Integer
10
(*) ImageNet(subset): image size=96x96, #categories=25
FP32
Deep Learning Integer (16bit/8bit)
DLINT has shown similar accuracy with FP32 precision
INT8
INT8
Deep Learning Integer (8bit)
FP32
Deep Learning Integer (16bit)
Copyright 2018 FUJITSU LIMITED
DLU Roadmap
Multiple generations of DLUs over time, as we currently do for HPC/UNIX/Mainframe processors
Copyright 2018 FUJITSU LIMITED
* Subject to change without notice
Needs Host CPU Inter-DLU Direct
Connection
Embedded Host CPU Neuro Computing Combinational Optimization
Architecture
The 1st
GenerationThe 2nd
Generation Future
Performance / Watt
11
AI will be accelerated by three technologies
Copyright 2018 FUJITSU LIMITED12
Quantum
Computing
Deep
LearningHPC“K” Supercomputer & FX100will provide simulation andpre-processing technology
Zinrai Deep Learning & DLUwill offer a high-speed learning environment
Digital Annealer will identify combinational optimal solutions
Three world-class advanced technologies together will contribute to expansion of customer business
Copyright 2017 FUJITSU LIMITED13