with physical design awareness unlock the noc ...cbatten/pdfs/batten... · • physical-design...

13
Unlock the NoC: Transforming NoC Research with Physical Design Awareness Christopher Batten (Cornell University) Michael Taylor (University of Washington) NOCS’20 Special Session

Upload: others

Post on 28-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Unlock the NoC: Transforming NoC Researchwith Physical Design Awareness

Christopher Batten (Cornell University)Michael Taylor (University of Washington)

NOCS’20 Special Session

• Physical-Design Issues for NOCs • Open-Source Hardware for NOCs

Christopher BattenAssociate Professor, Cornell UniversityComputer Architecture, EDA, VLSITapeouts in 180/130/28/16nmEarly work on nanophotonic chip-levelinterconnection networks

Michael TaylorAssociate Professor, University of Washington

Computer Architecture, VLSITapeouts in 180/40/28/16/12nm

Pioneered scalar-operand networks, one of thefirst academic works on NOCs

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 2 / 13

• Physical-Design Issues for NOCs • Open-Source Hardware for NOCs

MIT RAW Processor [IEEE-Micro’02,ISSCC’03]

I 18× 18mm in IBM 180 nmI 16 MIPS-like cores

I 4×4 mesh network-on-chip

. 2 statically configured NoCs forscalar operands

. 2 dynamically routed Nocs formemory traffic

. XY dimension ordered routing

. Four physical networks, no VCs

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 3 / 13

• Physical-Design Issues for NOCs • Open-Source Hardware for NOCs

Celerity System-on-Chip [IEEE-Micro’18,VLSI’19]

I 5× 5mm in TSMC 16 nm FFCI 385 million transistorsI 511 RISC-V cores

. 5 Linux-capable Rocket cores

. 496-core tiled manycore

. 10-core low-voltage arrayI 1 BNN acceleratorI 1 synthesizable PLLI 1 synthesizable LDO VregI 3 clock domains

I 16×31 mesh network-on-chip. Remote-store programming. One-cycle router+channel latency. XY dimension ordered routing. One physical network, no VCs

• TSMC 16nm FFC• 25mm2 die area (5mm x 5mm)• ~385 million transistors• 511 RISC-V cores

• 5 Linux-capable “Rocket Cores”• 496-core mesh tiled array “Manycore”• 10-core mesh tiled array “Manycore” (low voltage)

• 1 Binarized Neural Network Specialized Accelerator• On-chip synthesizable PLLs and DC/DC LDO

• Developed in-house• 3 Clock domains

• 400 MHz – DDR I/O• 625 MHz – Rocket core + Specialized accelerator• 1.05 GHz – Manycore array

• 672-pin flip chip BGA package• 9-months from PDK access to tape-out

Celerity: Chip Overview• TSMC 16nm FFC• 25mm2 die area (5mm x 5mm)• ~385 million transistors• 511 RISC-V cores

• 5 Linux-capable “Rocket Cores”• 496-core mesh tiled array “Manycore”• 10-core mesh tiled array “Manycore” (low voltage)

• 1 Binarized Neural Network Specialized Accelerator• On-chip synthesizable PLLs and DC/DC LDO

• Developed in-house• 3 Clock domains

• 400 MHz – DDR I/O• 625 MHz – Rocket core + Specialized accelerator• 1.05 GHz – Manycore array

• 672-pin flip chip BGA package• 9-months from PDK access to tape-out

Celerity: Chip Overview

• TSMC 16nm FFC• 25mm2 die area (5mm x 5mm)• ~385 million transistors• 511 RISC-V cores

• 5 Linux-capable “Rocket Cores”• 496-core mesh tiled array “Manycore”• 10-core mesh tiled array “Manycore” (low voltage)

• 1 Binarized Neural Network Specialized Accelerator• On-chip synthesizable PLLs and DC/DC LDO

• Developed in-house• 3 Clock domains

• 400 MHz – DDR I/O• 625 MHz – Rocket core + Specialized accelerator• 1.05 GHz – Manycore array

• 672-pin flip chip BGA package• 9-months from PDK access to tape-out

Celerity: Chip Overview

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 4 / 13

• Physical-Design Issues for NOCs • Open-Source Hardware for NOCs

Physical Design Issues

I Timing Closure: Must meet both min and max timing constraints acrossentire chip and multiple corners

I Silicon Utilization: Must effectively use both active area and wiringresources without negatively impacting other design issues

I Power Distribution: Must ensure no static IR drop nor dI/dt voltage noiseissues; carefully balance power grid vs signal routing resources

I Hierarchical Design: Carefully consider hard macros which can addresssome physical design issues while at the same time creating new challenges

I EDA Tool Runtime: Must facilitate an agile chip design methodology wherewe can spin an entire chip in less than a day

I Signal Integrity: Global signals must always operate robustly even in thecontext of voltage noise and aggressor signals

I Custom Circuits: Carefully consider the impact of mixing custom andstandard-cell-based design methodologies

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 5 / 13

• Physical-Design Issues for NOCs • Open-Source Hardware for NOCs

Figure 5. Bandwidth, Latency, and Area Tradeoffs for Post-Place-and-Route Results – M = message size in bits; (a) = bandwidth and area tradeoffs;(b) = latency and area tradeoffs for small messages (64b); (c) = latency and area tradeoffs for large messages (256b).

185µm

Blockage

DummyCore

OCN Router& Channels

(a) mesh-c1r0-b32

185µm

Blockage

DummyCore

OCN Router& Channels

(b) mesh-c1r0-b64

185µm

Blockage

DummyCore

OCN Router& Channels

(c) mesh-c1r0-b128185µm

BlockageDummyCore

OCNRouter &Channels

(d) mesh-c1r0-b256

185µm

Blockage

DummyCore

OCN Router& Channels

(e) mesh-c1r0q0-b32

185µm

Blockage

DummyCore

OCN Router& Channels

(f) torus-c1r0-b32

375µm

Blockage

DummyCore

OCNRouter &Channels

Blockage

DummyCore

DummyCore

DummyCore

BlockageBlockage

(g) mesh-c4r0-b128

375µm

Blockage

DummyCore

OCNRouter &Channels

Blockage

DummyCore

DummyCore

DummyCore

BlockageBlockage

(h) mesh-c4r2-b64

Figure 6. Example Macro-Level Post-Place-and-Route Layouts – All layouts are to scale and include 1–4 blockages, 1–4 dummy cores, and fencesto constrain placement of the router and channels. (a–d) layouts for mesh-c1r0 at four different channel bandwidths; (e) layout for mesh-c1r0q0-b32 (i.e., no channel queues), OCN requires more area than mesh-c1r0-b32; (f) layout for torus-c1r0-b32, OCN requires comparable area tomesh-c1r0-b32; (g–h) layout for mesh-c4r0-b128 and mesh-c4r2-b64 both of which require comparable area (smaller OCN router “square” inmesh-c4r2-b64 is outweighed by longer and wider channel “rectangles”).

3140µm 230µm 3100µm 275µm

Wrap-Around ChipTop-Level Routing

Cross-Over ChipTop-Level Routing

Global Clock & ResetRouting Over Macro

Straight AcrossChip Top-LevelRouting

Straight AcrossChip Top-Level

Routing

(a) torus-c1r0-b32 Full Chip (b) torus-c1r0-b32 Close Up (c) mesh-c4r2-b64 Full Chip (d) mesh-c4r2-b64 Close Up

Figure 7. Example Chip-Level Post-Place-and-Route Layouts – (a) full-chip layout with 256 instances of the torus-c1r0-b32 hard macro which isshown in Figure 6(f); (b) close-up of required chip top-level routing including cross-over routing to neighboring hard macro, wrap-around routing,and global clock and reset routing over the hard macro; (c) full-chip layout for 64 instances of the mesh-c4r2-b64 hard macro which is shown inFigure 1(h); (d) = close-up of the required chip top-level routing including straight-across routing at the middle of each macro side.

I Ruche Networks:Wire-Maximal, No-Fuss NoCsDai Cheol Jung, Scott Davidson, ChunZhao, Dustin Richmond, Michael BedfordTaylor (University of Washington)

I Implementing Low-Diameter On-ChipNetworks Using a Tiled Physical DesignMethodologyYanghui Ou, Shady Agwa, ChristopherBatten (Cornell University)

I NoC SymbiosisDaniel Petrisko, Chun Zhao, Scott Davidson,Paul Gao, Dustin Richmond, MichaelBedford Taylor (University of Washington)

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 6 / 13

Physical-Design Issues for NOCs • Open-Source Hardware for NOCs •

Software Innovation Today

Your proprietary code • Instagram • $500K seed with 13 people → $1B

Open-source software • Python • Django • Memcached • Postgres/SQL • Redis • nginx • Apache, Gnuicorn • Linux • GCC

"What Powers Instagram: Hundreds of Instances,

Dozens of Technologies"https://goo.gl/76fWrM

Like climbing an iceberg – much is hidden!

Adapted from M. Taylor, “Open Source HW in 2030,” Arch 2030 Workshop @ ISCA’16

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 7 / 13

Physical-Design Issues for NOCs • Open-Source Hardware for NOCs •

Hardware Innovation Today

Closed source • ARM A57, A7, M4, M0 • ARM on-chip interconnect • Standard cells, I/O pads, DDR Phy • SRAM memory compilers • VCS, Modelsim • DC, ICC, Formality, Primetime • Stratus, Innovus, Voltus • Calibre DRC/RCX/LVS, SPICE

What you have to build • New machine learning accelerator • Other unrelated components, anything you cannot afford to buy or for which COTS IP does not do

Like climbing a mountain – nothing is hidden!

Adapted from M. Taylor, “Open Source HW in 2030,” Arch 2030 Workshop @ ISCA’16

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 8 / 13

Physical-Design Issues for NOCs • Open-Source Hardware for NOCs •

Chip Costs Are Skyrocketing

$120M

$500K

Adapted from M. Taylor, “Open Source HW in 2030,” Arch 2030 Workshop @ ISCA’16; original: International Business Strategies & T. Austin.

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 9 / 13

Physical-Design Issues for NOCs • Open-Source Hardware for NOCs •

How can HW design be more like SW design?

Open-Source Software Hardware

high-levellanguages

Python, Ruby, R,Javascript, Julia

Chisel, PyMTL, PyRTL, MyHDL,JHDL, Cλash

libraries C++ STL,Python std libs

BaseJump, PyOCN

systems Linux, Apache, MySQL,memcached

Rocket, Pulp/Ariane, OpenPiton,Boom, FabScalar, MIAOW, Nyuzi

standards POSIX RISC-V ISA, RoCC, TileLink

tools GCC, LLVM, CPython,MRI, PyPy, V8

Icarus Verilog, Verilator, qflow,Yosys, TimberWolf, qrouter,magic, klayout, ngspice

methodologies agile software design agile hardware design

cloud IaaS, elastic computing IaaS, elastic CAD

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 10 / 13

Physical-Design Issues for NOCs • Open-Source Hardware for NOCs •

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 11 / 13

Physical-Design Issues for NOCs • Open-Source Hardware for NOCs •

PyOCN: A Unified Framework forModeling, Testing, and Evaluating OCNs

I Implemented using PyMTL3, a new Python modeling, generation,simulation, and verification framework [IEEE Micro’20, IEEE D&T ’20]

I https://github.com/pymtl/pymtl3-net

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 12 / 13

Physical-Design Issues for NOCs Open-Source Hardware for NOCs

Figure 5. Bandwidth, Latency, and Area Tradeoffs for Post-Place-and-Route Results – M = message size in bits; (a) = bandwidth and area tradeoffs;(b) = latency and area tradeoffs for small messages (64b); (c) = latency and area tradeoffs for large messages (256b).

185µm

Blockage

DummyCore

OCN Router& Channels

(a) mesh-c1r0-b32

185µm

Blockage

DummyCore

OCN Router& Channels

(b) mesh-c1r0-b64

185µm

Blockage

DummyCore

OCN Router& Channels

(c) mesh-c1r0-b128185µm

BlockageDummyCore

OCNRouter &Channels

(d) mesh-c1r0-b256

185µm

Blockage

DummyCore

OCN Router& Channels

(e) mesh-c1r0q0-b32

185µm

Blockage

DummyCore

OCN Router& Channels

(f) torus-c1r0-b32

375µm

Blockage

DummyCore

OCNRouter &Channels

Blockage

DummyCore

DummyCore

DummyCore

BlockageBlockage

(g) mesh-c4r0-b128

375µm

Blockage

DummyCore

OCNRouter &Channels

Blockage

DummyCore

DummyCore

DummyCore

BlockageBlockage

(h) mesh-c4r2-b64

Figure 6. Example Macro-Level Post-Place-and-Route Layouts – All layouts are to scale and include 1–4 blockages, 1–4 dummy cores, and fencesto constrain placement of the router and channels. (a–d) layouts for mesh-c1r0 at four different channel bandwidths; (e) layout for mesh-c1r0q0-b32 (i.e., no channel queues), OCN requires more area than mesh-c1r0-b32; (f) layout for torus-c1r0-b32, OCN requires comparable area tomesh-c1r0-b32; (g–h) layout for mesh-c4r0-b128 and mesh-c4r2-b64 both of which require comparable area (smaller OCN router “square” inmesh-c4r2-b64 is outweighed by longer and wider channel “rectangles”).

3140µm 230µm 3100µm 275µm

Wrap-Around ChipTop-Level Routing

Cross-Over ChipTop-Level Routing

Global Clock & ResetRouting Over Macro

Straight AcrossChip Top-LevelRouting

Straight AcrossChip Top-Level

Routing

(a) torus-c1r0-b32 Full Chip (b) torus-c1r0-b32 Close Up (c) mesh-c4r2-b64 Full Chip (d) mesh-c4r2-b64 Close Up

Figure 7. Example Chip-Level Post-Place-and-Route Layouts – (a) full-chip layout with 256 instances of the torus-c1r0-b32 hard macro which isshown in Figure 6(f); (b) close-up of required chip top-level routing including cross-over routing to neighboring hard macro, wrap-around routing,and global clock and reset routing over the hard macro; (c) full-chip layout for 64 instances of the mesh-c4r2-b64 hard macro which is shown inFigure 1(h); (d) = close-up of the required chip top-level routing including straight-across routing at the middle of each macro side.

I Ruche Networks:Wire-Maximal, No-Fuss NoCsDai Cheol Jung, Scott Davidson, ChunZhao, Dustin Richmond, Michael BedfordTaylor (University of Washington)

I Implementing Low-Diameter On-ChipNetworks Using a Tiled Physical DesignMethodologyYanghui Ou, Shady Agwa, ChristopherBatten (Cornell University)

I NoC SymbiosisDaniel Petrisko, Chun Zhao, Scott Davidson,Paul Gao, Dustin Richmond, MichaelBedford Taylor (University of Washington)

NOCS’20 Unlock the NoC: Transforming NoC Research with Physical Design Awareness 13 / 13