networks-on-chips (nocs) basics ece 284 on-chip interconnection networks spring 2013

Networks-on-Chips (NoCs)Basics

ECE 284On-Chip Interconnection Networks

Spring 2013

2

Examples of Tiled Multiprocessors• 2D-mesh networks often used as on-chip

fabricI/O Area

I/O Area

single tile

1.5mm

2.0mm

21.7

2mm

12.64mm

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip) 1.2 Million (tile)

Transistors

275mm2 (full-chip) 3mm2 (tile)

Die Area

8390C4 bumps #

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip) 1.2 Million (tile)

Transistors

275mm2 (full-chip) 3mm2 (tile)

Die Area

8390C4 bumps #

Tilera Tile64

Intel 80-core

Typical architecture

• Each tile typically comprises the CPU, a local L1 cache, a “slice” of a distributed L2 cache, and a router

Compute UnitRouter

CPUL1

Cache

Slice of L2 Cache

Router function

• The job of the router is forward packets from a source tile to a destination tile (e.g., when a “cache line” is read from a “remote” L2 slice).

• Two example switching modes:– Store-and-forward: Bits of a packet are forwarded only after

entire packet is first stored.– Cut-through: Bits of a packet are forwarded once the header

portion is received.

Store-and-forward switching

Source end node

Destination end node

Packets are completely stored before any portion is forwarded

Store

Buffers for datapackets

[adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]

Store-and-forward switching

Source end node


Packets are completely stored before any portion is forwarded

StoreForward

Requirement:buffers must be

sized to holdentire packet


Cut-through switching

• Wormhole

Source end node


Source end node



Requirement:buffers must be sized to hold entire packet

Buffers for flits:packets can be larger

than buffers

• Virtual cut-through


• Wormhole

• Virtual cut-through

Cut-through switching

Source end node


Source end node


Busy Link

Packet stored along the path

Busy Link

Packet completelystored atthe switch


Requirement:buffers must be sized to hold entire packet

(MTU)

Buffers for flits:packets can be larger

than buffers


Packets to flits

Transact. Type

Message Type

Packet Size

Read Request 1 flit

Read Reply 1+n flits

Write Request 1+n flits

Write Reply 1 flit

[adapted from Becker STM’09 talk]

Wormhole routing

• Head flit establishes the connection from input port to output port. It contains the destination address.

• Body flits goes through the established connection (does not need destination address information)

• Tail flit releases the connection.• All other flits blocked until connection is released

Deadlock

Virtual channels

• Share channel capacity between multiple data streams– Interleave flits from different packets

• Provide dedicated buffer space for each virtual channel– Decouple channels from buffers

• “The Swiss Army Knife for Interconnection Networks”– Prevent deadlocks– Reduce head-of-line blocking– Also useful for providing QoS


Using VCs for deadlock prevention

• Protocol deadlock– Circular dependencies between messages at network edge– Solution:

• Partition range of VCs into different message classes

• Routing deadlock– Circular dependencies between resources within network– Solution:

• Partition range of VCs into different resource classes• Restrict transitions between resource classes to impose partial order

on resource acquisition

• {packet classes} = {message classes} × {resource classes}


Using VCs for flow control

• Coupling between channels and buffers causes head-of-line blocking– Adds false dependencies between packets– Limits channel utilization– Increases latency– Even with VCs for deadlock prevention, still applies to packets in same class

• Solution:– Assign multiple VCs to each packet class


VC router pipeline

• Route Computation (RC)– Determine candidate output

port(s) and VC(s)– Can be precomputed at

upstream router (lookahead routing)

• Virtual Channel Allocation (VA)– Assign available output VCs to

waiting packets at input VCs• Switch Allocation (SA)

– Assign switch time slots to buffered flits

• Switch Traversal (ST)– Send flits through crossbar

switch to appropriate output

Per packet

Per flit


Allocation basics

• Arbitration:– Multiple requestors– Single resource– Request + grant vectors

• Allocation:– Multiple requestors– Multiple equivalent resources– Request + grant matrices

• Matching:– Each grant must satisfy a request– Each requester gets at most one grant– Each resource is granted at most once


Separable allocators

• Matchings have at most one grant per row and per column

• Implement via to two phases of arbitration– Column-wise and row-wise– Perform in either order– Arbiters in each stage are fully independent

• Fast and cheap• But bad choices in first phase can prevent second

stage from generating a good matching!

Input-first:

Output-first:


Wavefront allocators

• Avoid separate phases– … and bad decisions in first

• Generate better matchings• But delay scales linearly• Also difficult to pipeline

• Principle of operation:– Pick initial diagonal– Grant all requests on diagonal

• Never conflict!– For each grant, delete requests

in same row, column– Repeat for next diagonal


Wavefront allocator timing

• Originally conceived as full-custom design

• Tiled design• True delay scales linearly• Signal wraparound creates

combinational loops– Effectively broken at priority

diagonal– But static timing analysis

cannot infer that– Synthesized designs must be

modified to avoid loops!


Diagonal Propagation Allocator

• Unrolled matrix avoids combinational loops

• Sliding priority window activates sub-matrix cells

• But static timing analysis again sees false paths!– Actual delay is ~n– Reported delay is ~(2n-1)– Hurts synthesized designs

20


VC allocation

• Before packets can proceed through router, need to acquire ownership of VC at downstream router

• VC allocator matches unassigned input VCs with output VCs that are not currently in use– P×V requestors (input VCs), P×V resources (output VCs)

• VC is acquired by head flit, inherited by body & tail flits


VC allocator implementations

• Not shown:– Masking logic for busy VCs


Typical pipelined router

ST LTRC

switchtraversal

linktraversal

routecomputation

VC + switchallocation

VASA

networks-on-chips (nocs) basics ece 284 on-chip interconnection networks spring 2013

Documents

chip units

single chip

chip communication costs

chip network topology

intels research chip

source tile

tile polaris processor

destination tile