maeri architectureand implementationdetails...feb 08, 2019  · resource description numpe 64 distbw...

30
MAERI Architecture and Implementation Details Synergy Lab, Georgia Tech Hyoukjun Kwon http://synergy.ece.gatech.edu

Upload: others

Post on 18-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

MAERI Architecture andImplementation Details

Synergy Lab, Georgia TechHyoukjun Kwon

http://synergy.ece.gatech.edu

Page 2: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Acknowledgement

• Mr. Charlie Hauck• Dr. Rishiyur Nikhil

• Ananda Samajdar• Eric Qin• Yehowshua Immanuel

For providing BSV license for hands-on exercises

For discussions and ASIC/FPGA synthesis

Page 3: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Outline

• Tool Flow of MAERI• MAERI Implementation Details• Using MAERI source code base• Demo and exercises

Page 4: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Tool Flow of MAERI

ResourceDescription

NumPE 64DistBW 4GatrBW 4

K 16C 3R 3S 3Y 224X 224

LayerDescription

Building BlockLibrary (BSV RTL)

AdderSwitch

Mult.Switch

SimpleSwitch Cntl

MAERIFront-end

BSVCompiler

MAERI Input MAERI Framework MAERI Outputs

VerilogFiles

Cycle-accurate Simulation# Cycles# Weight Distribution# Input Distribution# Local Communication…

RTLGeneration

Simulation

Page 5: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Tool Flow of MAERI Simulation

MAERI-Compiler

mRNA-generatedMapping

Target Hardware

Config.

Switch Configurations

Tile configurations

MAERI-Simulation

Inputs Machine codes

Page 6: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Bluespec System Verilog (BSV)• A high-level hardware description language

• Generates fully synthesizable Verilog

• Inspired by Haskell and System Verilog• Strong type-checking system and polymorphism• System Verilog-like syntax• Intuitive module interfaces

• Based on “guarded atomic action” blocks• Provides coarse-grained description of parallel actions

Page 7: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Bluespec System Verilog (BSV)• A high-level hardware description language

• Generates fully synthesizable Verilog

• Inspired by Haskell and System Verilog• Strong type-checking system and polymorphism• System Verilog-like syntax• Intuitive module interfaces

• Based on “guarded atomic action” blocks• Provides coarse-grained description of parallel actions

For details, please refer to “BSV by Example” (http://csg.csail.mit.edu/6.S078/6_S078_2012_www/resources/bsv_by_example.pdf)

Page 8: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Outline

• Tool Flow of MAERI• MAERI Implementation Details• Using MAERI source code base• Demo and exercises

Page 9: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

MAERI Implementation – Distribution Network

X X X X X X X X X X X X X X X X

Input Port 0 Input Port 1 Input Port 2 Input Port 3

• # Multiplier Switches = 16• Distribution Bandwidth = 4X

Page 10: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

MAERI Implementation – Multiplier Network

X X X X X X X X X X X X X X X X

• # Multiplier Switches = 16

Page 11: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

MAERI Implementation – Reduction Network

X+

X X X X X X X X X X X X X X X

• # Multiplier Switches = 16• Reduction Bandwidth = 4X

+ + + + + + ++ ++ +

+ +

+

Page 12: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

MAERI Implementation – Reduction Network

X+

X X X X X X X X X X X X X X X

• # Multiplier Switches = 16• Reduction Bandwidth = 4X

+ + + + + + ++ ++ +

+ +

+Double

Reduction switchSingle

Reduction switch

Page 13: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

MAERI Implementation – Reduction Network

X+

X X X X X X X X X X X X X X X

• # Multiplier Switches = 16• Reduction Bandwidth = 4X

+ + + + + + ++ ++ +

+ +

+

Collection Bus3Collection Bus2Collection Bus1Collection Bus0

Page 14: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Outline

• Tool Flow of MAERI• MAERI Implementation Details• Using MAERI source code base• Demo and exercises

Page 15: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Source Code Directory Structuremaeri_code_hpca2019_tutorial

src

scripts

distribution_networkreduction_networkmultiplier_networkALUs

types

lib

maeri_accelerator

MAERI core implementation

Custom BSV type definitions

Custom BSV libraries

Distribution tree

Augmented reduction tree

Multiplier switch and its array

Fixed point adder/multiplier

MAERI top module…

Page 16: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

How to use MAERI front-end• Changing design parameters

• Modify AcceleratorConfig.bsv at the top directory• Distribution bandwidth• Reduction bandwidth• Number of multiplier switches

• Cycle-accurate simulation and Verilog code generation• ./Maeri –c : Compile a simulation• ./Maeri –r : Run compiled simulation• ./Maeri –w : Launch GTKwave for waveform analysis • ./Maeri –v : Generate Verilog code• ./Maeri –clean : Clean up intermediate files

Page 17: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

How to use MAERI front-end• Simulation results example

• Commands: “./Maeri –r” after “./Maeri –c”

Page 18: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

How to use MAERI front-end• Waveform Analysis

• Commands: “./Maeri –r” and then “./Maeri –w”

Page 19: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

How to use MAERI front-end

• Verilog code generation• Commands: “./Maeri –v”

* Verilog files are generated in “(Top_Directory)/Verilog”

Generated Verilog code is synthesizable!

Page 20: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

MAERI Synthesis and PnR

• Synthesis/PnR Environment• Technology: 28nm

• Clock frequency: 1GHz

• Design: 64 multiplier switches and 31 adder switches

• Distribution Bandwidth: 32/16/8/4 data per cycle

• Gather Bandwidth: 32/16/8/4 data per cycle

• RTL Code: Verilog generated using MAERI code base

• CAD Tool Chain: Synopsys Design compiler, Cadence

Innovus, Primepower

Page 21: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Post-layout Area and Power

Num PEs16 32 64 128 256

Wire RN MN DNAr

ea(u

m2 )

Bandwidth: 8X

Page 22: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Post-layout Area and PowerWire RN MN DN

Area

(um

2 )

4X 8X 16X 32XBandwidth

NumPEs: 64

Page 23: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

FPGA Resource Usage

* Based on Virtex 7 board, synthesis frequency: 50MHzNum PEs

32 64 128 256

LUT FF DSPBandwidth: 8X

Page 24: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

FPGA Resource Usage

* Based on Virtex 7 board, synthesis frequency: 50MHzBandwidth

8X 16X 32X

LUT FF DSPNumPEs: 64

Page 25: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Outline

• Tool Flow of MAERI• MAERI Implementation Details• Using MAERI source code base• Demo and exercises

Page 26: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Demo• Launching cycle-accurate simulations

• Modifying user configuration

• Compiling simulations

• Launching wave form analysis

• Generating Verilog files

Page 27: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Outline

• Tool Flow of MAERI• Source code Structure• Using MAERI source code base• Demo• Hands-on Exercises

Page 28: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Testbench Structure

28

Testbench

X X X XX X X X

+++

++

+ +

Weights / Inputs

AcceleratorConfig.bsv(Configuration File)

Outputs

VN0 VN1

InterconnectControl

Layer SizeNumMultSwsDist. BWRed. BW…

MAERI mapper-generated optimized configurations

Switch Configurations

Tile configurations

Machine codes

RN_Config.vmh Layer_Info.vmh

Generated Simulation Model

Page 29: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

Testbench Dataflow (MAESTRO description)

29

let vnSize = sizof(R) x sizeof(S)let numVNs = floor(NumMultSwitches / vnSize)

• Temporal_Map (1, 1) C• Spatial_Map (1, 1) K• Temporal_Map (sizeOf(R), 1) Y• Temporal_Map(sizeOf(S), 1) X• Cluster (vnSize, L)• Temporal _Map (SizeOf(S), SizeOf(S)) S• Spatial_Map (1,1) R• Cluster(vnSize, P)• Spatial_Map (1,1) S

High weight filter parallelism

Page 30: MAERI Architectureand ImplementationDetails...Feb 08, 2019  · Resource Description NumPE 64 DistBW 4 GatrBW 4 K 16 C 3 R 3 S 3 Y 224 X 224 Layer Description Building Block Library

• Exercise#1: Compile a simulation with default, early, and late layers with 32 PEs (“./Maeri –c,”). Run simulation and compare results.

• Exercise#2: Compile a simulation with 32 and 64 PEs using default seting (“./Maeri –c”). Run simulation and compare results

• Exercise#3: Compile a simulation with 4X/8X/and 16X distribution bandwidth (fix reduction bandwidth as 8X). Run simulation and compare results.

• Exercise#4: Compile a simulation with 4X/and 8X reduction bandwidth (fix distribution bandwidth as 8X). Run simulation and compare results.