ljubisa bajićand jasmina vasiljević · 2020. 8. 18. · dynamic execution 1 10 100 1000 10000...

Compute substrate for Software 2.0Ljubisa Bajić and Jasmina Vasiljević

ML vs.

Moore’s Law compute

ML

1

10

100

1000

10000

100000

1000000

10000000

2019 2020 2021 2022 2023 2024 2025 2026

ML vs. Moore's Law (Optimistic)

ML compute demand Moore's law - 50%/yr

Optical, analog/nanowires

ML compute demand

Moore’s law

10K*Moores Law (cluster)

100K*Moores Law (mega cluster)

Dynamic Execution

1

10

100

1000

10000

100000

1000000

10000000

2019 2020 2021 2022 2023 2024 2025 2026

ML vs. Moore's Law

ML compute demand Moore's law - 50%/yr

Scale helps, but the only

long-term solution

is to change the slope of

the curve

20 Watts

Scale out

Scale Out

Large Clusters are Already the Norm

§ Shared memory architectures could not provide the required scale

§ Many modern neural nets are trained and inferenced on clusters

§ Many nodes with 4-16 GPUs

§ Private memory space, explicit data movement

§ Data parallel at first§ Sidesteps many communication/synchronization issues

§ but model parallel has become necessary§ The full complexity of cluster programming is now exposed

Largest. Clusters. Ever.

§ Networking + compute on each chip

§ Computation directly on packets

§ Packet routing controlled by graph compiler

§ Hundreds of thousands of nodes in cluster

§ One device in Pytorch WH WH WH WH

WH WH WH WH

WH WH WH WH

WH WH WH WH

WH

2U system

42U Rack

Module

Cluster

Shared Memory Machines

§ Each processing unit can see the full memory space

§ A processor needs an array element: just issue a LOAD

§ Tensor manipulations and views mostly reduce to

strided access to same buffer in memory

§ Primary compiler challenge – loop nest optimization4 1

3 2

7 6

5 0

9 2

6 4

5 1

0 3

4

1

7

6

3

2

5

0

9

2

5

1

6

4

0

3

0x0

0x4

0x8

4 1

3 2

7 6

5 0

9 2

6 4

5 1

0 3

4

1

7

6

3

2

5

0

9

2

5

1

6

4

0

3

0x0

0x4

0x8Transpose

Stride 4

Stride 1

Clusters and ML Chips Have Private Memory

§ Data is split up between nodes and no local view exists§ Data transfers explicitly managed

§ Tensor manipulations -> inter-node co-ordination§ Example transpose implementation:

§ Data transfer between 1,0 <-> 0,1

§ Transpose of local tiles

§ Hard challenges:§ Data tiling and parallelization

§ Data transfers, synchronization

§ Complexities with tensor manipulations

§ Memory management

§ We solve them holistically

Tenstorrent confidential

Transpose

4 1

3 2

7 6

5 0

9 2

6 4

5 1

0 3

4

1

3

2

7

6

5

0

9

2

6

4

5

1

0

3

Node 0,0 Node 0,1

Node 1,0 Node 1,1

0x0

0x4

0x8

4

1

3

2

7

6

5 0

9

2

6

4

5

10 3

4

3

1

2

9

6

2

4

7

5

6

0

5

0

1

3

Node 0,0 Node 0,1

Node 1,0 Node 1,1

0x0

0x4

0x8

Need XFER

between cores

Compiler

Runtime Hardware

Dynamic ExecutionWhat is it?

Dynamic Execution

Sparse Compute Runtime CompressionControl Flow

Models that dynamically

choose subsets of blocks to

compute during each pass

fp32 fp32

fp16

8-bit 8-bit

mat

mult

Dynamic Precision

Weights and activations

O(n) Matrix MultiplicationWeights/activations (dense)

Activations (dense) Result (dense) Activations (sparse)

Weights (dense)

Result (sparse)

Weights (dense)

Result (sparse)Activations (dense)

Dense: O(n^3) Sparse: linear speed-up Chained sparse MM:

quadratic speed up

Sparsity Max boost

50% 2X

90% 10X

Sparsity Max boost

50% 4X

90% 100X

O(n) continued

§ Generally applicable

§ works for training and inference (unlike pruning)

§ models with general applicability (like GPT3)

§ Requires models that dynamically choose subsets of blocks to

compute during each pass

§ Mixture of experts

§ LSH

§ Pre-pass based

§ Requires hardware that can realize full speed-up from block sparsity

Result (sparse)

16x16 block

The Full Stack SolutionArchitecture & Software

GrayskullCluster On a Chip

§ 2D grid of cores

§ 120 self contained cores

§ Each core executing independent program

§ Network on chip

§ 2D bi-directional torus

§ Optimized for ML-workload

§ Connectivity

§ PCIe

§ DRAM

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T TP

C

I

E

A

R

C

LPDDR4

T

T

T

T

T

T

T

T

T T T T T T T T T

T T

T T

T T

T T

T T

T T

T T

T T

T

T

T

T

T

T

T

T

T T T

T T T T T T T T T T T T

LPDDR4 LPDDR4 LPDDR4

LPDDR4 LPDDR4 LPDDR4 LPDDR4

NoC BW 330 GB/sec

PCIe Gen3 x16

Off-chip memory LPDDR4

Single Core

§ Packet Compute

§ Vector, SIMD

§ Programmable & flexible compute

§ Sparse compute

§ Packet Manager§ Data transfers & storage

§ Tensor manipulation

§ Dynamic compression

§ Storage§ Local SRAM

§ Access to DRAM

§ 5 RISC cores

§ Powerful single-issue processor

§ Runtime software

Packet

Compute

Packet

Manager

Packet

ManagerSRAM

NoC

Local SRAM1MB

660 GB/sec R/W bw

Compute3 TOPs (8-bit)

0.75 TFLOPs (16-bit)

Data formats

Bfloat, half-float, tf32

8-bit

Several custom formats, <=8-bit fp

Challenges of Connecting Compute Layers

§ Parallelization

§ Splitting tensors amongst the cores

§ Moving tensors between the cores

§ Tensor Manipulation (TM) instructions

§ Reshuffle data in various ways

§ Performance

§ Overlapping compute & transfers

§ Efficient utilization of NoC

§ Efficient utilization of memory bandwidth Compute

A

Compute

B

Tensor Manipulation

instructions

Compute

B

Compute

A

Compute

C

Compute

D

Compound

Complexity

Reshape Transpose Squeeze

The Full Stack Approach

AOT

Compiler

Runtime Hardware

Parallelizes compute &

packetizes tensors

Dynamic memory

management

Packet

manager

Graph CompilerCompilation Into Packets

§ Packetization

§ Packet headers: packet IDs & routing information

§ No pointers, everything is expressed in terms of packet IDs

§ Compute layer parallelization optimized by graph compiler

§ Data movement & synchronization expressed explicitly by compiler

§ NoC is visible to compiler

§ Produces an Instruction Queue for the Packet Manager§ Packets re-ordered by NoC

§ In-line TMs

§ Memory access patterns

Graph Compiler

he

ad

er

payload

Single

packet

Mini-tensors

Tensor

Compute

A

Compute

B

Compute

A

Compute

BReshape Transpose Squeeze

Packet Manager

• Packet Compute Engine

• Programmable device, flexibility

• Computes what Packet Manager feeds it

• Packet header triggers a program for the

packet

Packet

Manager

Packet

Compute

Single Core

NoC

SRAM

DRAM

Packet Manager

• Tensor Manipulation Engine

• Reshape, transpose, concat, slice

• In-line, between compute & memory

• Dynamic packet compression

• Data Transfer Engine

• Multi-core synchronization

• Data dependencies, data hazards, data ready,

memory space ready

• Works with runtime software

• Router

• Moves data across the NoC

• Back-pressure, guaranteed ordering, deadlock free

• Optimized multi-cast and gather operation for ML

workloads

Packet Manager

NoC Router

Tensor

Manipulation

Data Transfer

Engine

Packet

Compute Engine

SRAM

DRAM

Runtime Software

§ Five RISC processors per core

§ Dynamic memory allocation § Runtime buffer (de)allocation

§ Runtime controlled memory target

§ Data locality through SRAM

§ Spills to DRAM and host

§ Control flow§ if-statements, for-loops, while-loops

§ Decisions reflected by jumping around the Instruction Queue executed by Packet Compute and Packet Manager

Packet

Compute

Engine

Packet

Manager

Packet

ManagerSRAM

Flexible Scheduling & Parallelization

Clusters of nodes mapped

spatially onto the cores

pipeline

stage 1

pipeline

stage 2

pipeline

stage 3

pipeline

stage 4

pipeline

stage 1

pipeline

stage 2

pipeline

stage 3

pipeline

stage 4

Pipeline Parallelism Model Parallelism

A

B

C

D

Layers without data

dependencies run

concurrently

CB DA

6 cores

4 cores9 cores

4 cores

Combining multiple parallelization methodsleads to higher utilization of large number of cores,

resulting in higher performance

Pe

rfo

rma

nce

No. of CoresPe

rfo

rma

nce

No. of CoresPe

rfo

rma

nce

No. of CoresPe

rfo

rma

nce

No. of Cores

parallelization #1

parallelization #2

parallelization #3

parallelization #4

The Full Stack Approach

§ High-performance through concurrency

§ Asynchronous cores: flexible parallelization & scheduling

§ Packet manager & Packet Compute overlap data transfers & compute

§ High memory utilization

§ AOT graph compiler can not accurately predict buffer lifetimes

§ Dynamic memory management compensates

§ Dynamic execution

§ Runtime packet compression & data locality benefits

§ Sparse compute

§ Control flow graphs

AOT

Compiler

Runtime Hardware

Parallelizes compute &

packetizes tensors

Dynamic memory

management

Packet

manager

Company Overview, Status&

Plans

Company Overview

§ Tenstorrent

§ Founded in 2016

§ ~70 employees in Toronto and Austin

§ Equal mix of CPU, GPU, FPGA backgrounds

§ Goals and targets

§ ML inference and training

§ Edge to data center

§ General purpose high throughput parallel computation

T

E

Compute core

Ethernet (100G)

DRAM interface

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T TP

C

I

E

A

R

C

LPDDR4

T

T

T

T

T

T

T

T

T T T T T T T T T

T T

T T

T T

T T

T T

T T

T T

T T

T

T

T

T

T

T

T

T

T T T

T T T T T T T T T T T T

LPDDR4 LPDDR4 LPDDR4

LPDDR4 LPDDR4 LPDDR4 LPDDR4

Grayskull (2020)

ML processor

• 8 channels of LPDDR4, PCIE g4 x16

• 4 core OoO ARC CPU, runs linux

• 368 TOPS / 92 TFLOPS, 120MB SRAM

• 65W

Evaluation with multiple large customers

Shipping this fall

Wormhole (2021)

Network switch & ML processor

• Integrated network switch

• 16 ports of 100G ethernet

• 6 channels of GDDR6, PCIE g4 x16


T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

G6

P

C

I

E

E

E

E

E

E

E

E

E

E

E

EA

R

C

G6

T

T

T

T

T

T

T

T

T T T T T T T T T

G6E E

G6G6 G6E E

E

Jawbridge (2019)

ML processor

• 1 channels of LPDDR4, PCIE g4 x4


• 4 TOPS / 1 TFLOPS, 6MB SRAM

• 1.5W

T T T

T T T

LPDDR4

P

C

I

E

OOO CPU

65W Grayskull BERT Inference Performance

Workload Score

BERT BASE, SQuAD 1.1, fp16 – no conditionals 2,830

BERT BASE, SQuAD 1.1, fp16 + light conditional execution 10,150

BERT BASE, SQuAD 1.1, mixed precision, moderate conditional execution 23,345

Work in progress, BERT model modified with conditional execution

*

Software: Compiler generality

NLP

§ BERT

§ ALBERT

§ GPT2

§ T5 LM

§ GNMT

§ Transformer

§ Electra

NLP

Key verticals:

Healthcare, Financials,

Ecommerce, Retail

Models ready:

§ BERT

§ ALBERT

§ GPT2

§ T5 LM

§ GNMT

§ Transformer

§ Electra

Vision/Imaging

Key verticals:

Retail, Security,

Automotive

Models ready:

§ Resnet50

§ DeepCoNN

§ Googlenet

§ Densenet

§ Inception

§ Alexnet

§ ResNext

§ SqueezeNet

§ Mobilenet

§ VGG

§ YOLO

§ SSD Resnet34

§ SSD Mobilenet

Others

Key verticals:

Gaming, Entertainment,

social media, ecommerce

Models ready:

§ NCF

§ DLRM

§ Autoencoder

§ Stacked denoising autoencoder

Eval with customers

Public beta on our dev cloud

November 1st

Framework Integration + Deployment

Back End

Graph CompilerOptimizer

Front End

§ Full Pytorch integration

§ Native device

§ Torchscript with full support for conditionals

§ ONNX

§ A single device from PyTorch no matter the size of

computer

§ Automated deployment flow

§ Pre-trained models can benefit from Tenstorrent features

WH WH WH WH

WH WH WH WH

WH WH WH WH

WH WH WH WH

2U system

Single card

Summary

compute

ML§ Scale and conditional computation to let ML models grow

§ Flexibility: run anything

§ Usability: easy to use software, hiding all complexity of

programming clusters

ljubisa bajićand jasmina vasiljević · 2020. 8. 18. · dynamic execution 1 10 100 1000 10000...

Documents