dlu: deep learning unit · dlu deep learning unit host i/f inter-chip i/f hbm2 dpu-0 dpu-1 dpu dpe...

DLU™ : Deep Learning UnitISC 2018

Copyright 2018 FUJITSU LIMITED0

Copyright 2018 FUJITSU LIMITED

Fujitsu’s HPC Development Timeline

© RIKEN

The K computer is still competitive in various fields; from advanced research to manufacturing.

K computer

The post-K is under development to achieve superior application performance.

Post-K Computer

HPCGNo.1(2016)

Graph500No.1(2016)

Gordon Bell Prize Finalist

(2016)

DLU is a processor designed for deep learning that has the ability to handle large-scale neural networks.

Deep Learning Unit (DLU™)

1

Fujitsu Processor DevelopmentPerpetual Evolution > 60 years: Always Targeting No.1

2000～2003

SPARC64

SPARC64

II

SPARC64

V

SPARC64

GP

GS8900

GS21

600

GS8600

GS8800B

SPARC64

VII

GS21

1600

SPARC64

V+

SPARC64

VI

GS8800

GS21

900

Mainframe

Pe

rform

an

ce

Re

liab

ility

Store Ahead

Branch History

Prefetch

Single-chip CPU

Non-Blocking $

O-O-O Execution

Super-Scalar

L2$ on Die

HPC-ACE

System on Chip

Hardware Barrier

Multi-core Multi-thread

2004～2007 2008～2011

SPARC64

GP

2012～2015 2016～

SPARC64

IXfx

SPARC64

VIIIfx

Virtual Machine Architecture

Software on Chip

High-speed Interconnect

SPARC64

X+

130nm

250nm /

220nm

180nm

:Technology generation

90nm

350nm

28nm

Tr=1B

CMOS Cu

40nm

65nm

HPCUNIX

$ ECC

Register/ALU Parity

Instruction Retry

$ Dynamic Degradation

Error Checkers/History

Mainframe/UNIX/HPC + AIincremental development

GS21

2600

45nm

40nm

Next

GS

SPARC64

XIfx

SPARC64

X

20nm

DLU

SPARC64

XII

Post-K

ARM

AI

2

Next

SPARC


DigitalAnnealer

Features

• Architecture designed for deep learning

• Low-power consumption design

Goal: 10x Performance / Watt compared to competitors

• Scalable design with Tofu interconnect technology

Ability to handle large-scale neural networks

Utilizing technologies derived from the K computer

DLU: Processor Designed for Deep Learning

DLUDeep Learning Unit

3 Copyright 2018 FUJITSU LIMITED

What’s the New Architecture for the DLU?


Domain specific, Optimal precision, and Massively parallel.

High Precision

General Use

Conventional Architecture The New Architecture

2. Optimal Precision

1. Domain Specific

Sequential+ Parallel

3. Massively Parallel

Many cores w/ on-chip networkMultiple strong cores

Double/Single precision FP Deep Learning Integer

Complicated O-O-O coresw/ cache memory

Domain specific coresw/ large register file

DLU Architecture

ISA: Newly developed for deep learning

Micro-Architecture

• Simple pipeline to remove HW complexity

• On-chip network to share data between DPUs

Utilizes Fujitsu’s HPC experience, such as high density FMAs and high speed interconnect

Maximizes performance / watt


Fujitsu’s interconnect technologyLarge scale DLU interconnect through off-chip network

DLU Deep Learning Unit

Host I/F

Inter-chipI/F

HBM2

DPU-0

DPU-1

DPU

DPE DPE DPE

DPE DPE DPE

DPE DPE DPE

DPU

DPU

DPU-n

DPE DPE DPE

DPE DPE DPE

DPE DPE DPE

DPU: Deep learning Processing Unit,DPE: Deep learning Processing Element

On-chip networkNew ISA for deep learningHigh density FMA

5

DPU: Execution Wide SIMD execution and

Large Register File. Execute DL operations based

on master core’s control

Heterogeneous Cores and Large Register File


DPU

DPU

DPU

DPU

DPU

DPU

DPU

DPUMaster

MemoryMemory

Controller

Instructions/Data …

Master Core:Memory Access andDPU control

• Push & Pull instructions and data for DLUs.• Start/stop execution of DLUs

The combination of few large core (Master) and many small execution cores (DPU) results in more performance with less power consumption, compared to a conventional homogeneous structure

Large

RegFile

Large

RegFile

Large

RegFile

Large

RegFile

Large

RegFile

Large

RegFile

Large

RegFile

Large

RegFile

Wid

e

SIM

D

Wid

e

SIM

D

Wid

e

SIM

D

Wid

e

SIM

D

Wid

e

SIM

D

Wid

e

SIM

D

Wid

e

SIM

D

Wid

e

SIM

D

How to utilize many DPUs e.g. convolution layer

one CH-out / DPUmultiple batch / DPU

CH-in CH-out

DPE & Large RF (Register File)


DPU

CNTL

DPU: 128 SIMD* / 16DPE

DPU consists of 16 DPEs connected with on-chip network

DPE incudes large RF and wide SIMD execution units to realize an efficient Deep Learning engine. RF is fully SW controllable unlike cache to extract full HW potential

DPE: 8SIMD* with large RF(~100x of typical CPU core)

ExecUNIT

ExecUNIT

RF

ExecUNIT

RF

ExecUNIT

RF

ExecUNIT

RF

ExecUNIT

RF

ExecUNIT

RF

ExecUNIT

RF

* For FP32

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Name RF/$ structure

UNIX SPARC64 XII RF + $

HPC SPARC64 XIfx RF + sector $

AI DLU Large RF

More SW

controllability

Domain specific architecture - Why cache memory removed -

Complex hardware to achieve high performance for any applications with various data access patterns

E.g.

Large cache memory with cache tags and LRU replacement Unit

Hardware Prefetch Engine

General Processors DLU (Domain Specific)

Simple hardware focusing on simple memory access patterns

E.g. Convolution Layer

CH-in data can be shared among CH-out calculation at all DPUs

Memory access patterns are continuously and predictable (software controllable)

L2 $L2 $

Memory


DLINT : Deep Learning Integer


Fujitsu’s “DLINT” realizes necessary accuracy for Deep Learning with only a 16 or 8 bits data size (i.e. less power consumption compared with FP32)

Training results with DLINT8/16 can be converted to the conventional 8/16-bit INT for inference.

Data Size

Effective Precision

FP32

16-bit

INT

8-bit

INT

Required Precision for Deep Learning

>INT8,16

Accumulator

INT16

INT8 INT8

INT16

INT8 INT8

FP16

FP16 FP16

FP32

Deep Learning

Integer

Data Size and Precision DLU Data Type

INT16

INT8 INT8

INT16

INT8 INT8

Small Large

16/8bit area with

minimum accuracy

loss

HW gathered

statistics

Accuracy of Deep Learning Integer

10

(*) ImageNet(subset): image size=96x96, #categories=25

FP32

Deep Learning Integer (16bit/8bit)

DLINT has shown similar accuracy with FP32 precision

INT8

INT8

Deep Learning Integer (8bit)

FP32

Deep Learning Integer (16bit)


DLU Roadmap

Multiple generations of DLUs over time, as we currently do for HPC/UNIX/Mainframe processors


* Subject to change without notice

Needs Host CPU Inter-DLU Direct

Connection

Embedded Host CPU Neuro Computing Combinational Optimization

Architecture

The 1st

GenerationThe 2nd

Generation Future

Performance / Watt

11

AI will be accelerated by three technologies


Quantum

Computing

Deep

LearningHPC“K” Supercomputer & FX100will provide simulation andpre-processing technology

Zinrai Deep Learning & DLUwill offer a high-speed learning environment

Digital Annealer will identify combinational optimal solutions

Three world-class advanced technologies together will contribute to expansion of customer business

dlu: deep learning unit · dlu deep learning unit host i/f inter-chip i/f hbm2 dpu-0 dpu-1 dpu dpe...

Documents