networks-on-chips (nocs) basics ece 284 on-chip interconnection networks spring 2013
TRANSCRIPT
Networks-on-Chips (NoCs)Basics
ECE 284On-Chip Interconnection Networks
Spring 2013
2
Examples of Tiled Multiprocessors• 2D-mesh networks often used as on-chip
fabricI/O Area
I/O Area
single tile
1.5mm
2.0mm
21.7
2mm
12.64mm
65nm, 1 poly, 8 metal (Cu)Technology
100 Million (full-chip) 1.2 Million (tile)
Transistors
275mm2 (full-chip) 3mm2 (tile)
Die Area
8390C4 bumps #
65nm, 1 poly, 8 metal (Cu)Technology
100 Million (full-chip) 1.2 Million (tile)
Transistors
275mm2 (full-chip) 3mm2 (tile)
Die Area
8390C4 bumps #
Tilera Tile64
Intel 80-core
Typical architecture
• Each tile typically comprises the CPU, a local L1 cache, a “slice” of a distributed L2 cache, and a router
Compute UnitRouter
CPUL1
Cache
Slice of L2 Cache
Router function
• The job of the router is forward packets from a source tile to a destination tile (e.g., when a “cache line” is read from a “remote” L2 slice).
• Two example switching modes:– Store-and-forward: Bits of a packet are forwarded only after
entire packet is first stored.– Cut-through: Bits of a packet are forwarded once the header
portion is received.
Store-and-forward switching
Source end node
Destination end node
Packets are completely stored before any portion is forwarded
Store
Buffers for datapackets
[adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]
Store-and-forward switching
Source end node
Destination end node
Packets are completely stored before any portion is forwarded
StoreForward
Requirement:buffers must be
sized to holdentire packet
[adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]
Cut-through switching
• Wormhole
Source end node
Destination end node
Source end node
Destination end node
Buffers for datapackets
Requirement:buffers must be sized to hold entire packet
Buffers for flits:packets can be larger
than buffers
• Virtual cut-through
[adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]
• Wormhole
• Virtual cut-through
Cut-through switching
Source end node
Destination end node
Source end node
Destination end node
Busy Link
Packet stored along the path
Busy Link
Packet completelystored atthe switch
Buffers for datapackets
Requirement:buffers must be sized to hold entire packet
(MTU)
Buffers for flits:packets can be larger
than buffers
[adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]
Packets to flits
Transact. Type
Message Type
Packet Size
Read Request 1 flit
Read Reply 1+n flits
Write Request 1+n flits
Write Reply 1 flit
[adapted from Becker STM’09 talk]
Wormhole routing
• Head flit establishes the connection from input port to output port. It contains the destination address.
• Body flits goes through the established connection (does not need destination address information)
• Tail flit releases the connection.• All other flits blocked until connection is released
Deadlock
Virtual channels
• Share channel capacity between multiple data streams– Interleave flits from different packets
• Provide dedicated buffer space for each virtual channel– Decouple channels from buffers
• “The Swiss Army Knife for Interconnection Networks”– Prevent deadlocks– Reduce head-of-line blocking– Also useful for providing QoS
[adapted from Becker STM’09 talk]
Using VCs for deadlock prevention
• Protocol deadlock– Circular dependencies between messages at network edge– Solution:
• Partition range of VCs into different message classes
• Routing deadlock– Circular dependencies between resources within network– Solution:
• Partition range of VCs into different resource classes• Restrict transitions between resource classes to impose partial order
on resource acquisition
• {packet classes} = {message classes} × {resource classes}
[adapted from Becker STM’09 talk]
Using VCs for flow control
• Coupling between channels and buffers causes head-of-line blocking– Adds false dependencies between packets– Limits channel utilization– Increases latency– Even with VCs for deadlock prevention, still applies to packets in same class
• Solution:– Assign multiple VCs to each packet class
[adapted from Becker STM’09 talk]
VC router pipeline
• Route Computation (RC)– Determine candidate output
port(s) and VC(s)– Can be precomputed at
upstream router (lookahead routing)
• Virtual Channel Allocation (VA)– Assign available output VCs to
waiting packets at input VCs• Switch Allocation (SA)
– Assign switch time slots to buffered flits
• Switch Traversal (ST)– Send flits through crossbar
switch to appropriate output
Per packet
Per flit
[adapted from Becker STM’09 talk]
Allocation basics
• Arbitration:– Multiple requestors– Single resource– Request + grant vectors
• Allocation:– Multiple requestors– Multiple equivalent resources– Request + grant matrices
• Matching:– Each grant must satisfy a request– Each requester gets at most one grant– Each resource is granted at most once
[adapted from Becker STM’09 talk]
Separable allocators
• Matchings have at most one grant per row and per column
• Implement via to two phases of arbitration– Column-wise and row-wise– Perform in either order– Arbiters in each stage are fully independent
• Fast and cheap• But bad choices in first phase can prevent second
stage from generating a good matching!
Input-first:
Output-first:
[adapted from Becker STM’09 talk]
Wavefront allocators
• Avoid separate phases– … and bad decisions in first
• Generate better matchings• But delay scales linearly• Also difficult to pipeline
• Principle of operation:– Pick initial diagonal– Grant all requests on diagonal
• Never conflict!– For each grant, delete requests
in same row, column– Repeat for next diagonal
[adapted from Becker STM’09 talk]
Wavefront allocator timing
• Originally conceived as full-custom design
• Tiled design• True delay scales linearly• Signal wraparound creates
combinational loops– Effectively broken at priority
diagonal– But static timing analysis
cannot infer that– Synthesized designs must be
modified to avoid loops!
[adapted from Becker STM’09 talk]
Diagonal Propagation Allocator
• Unrolled matrix avoids combinational loops
• Sliding priority window activates sub-matrix cells
• But static timing analysis again sees false paths!– Actual delay is ~n– Reported delay is ~(2n-1)– Hurts synthesized designs
20
[adapted from Becker STM’09 talk]
VC allocation
• Before packets can proceed through router, need to acquire ownership of VC at downstream router
• VC allocator matches unassigned input VCs with output VCs that are not currently in use– P×V requestors (input VCs), P×V resources (output VCs)
• VC is acquired by head flit, inherited by body & tail flits
[adapted from Becker STM’09 talk]
VC allocator implementations
• Not shown:– Masking logic for busy VCs
[adapted from Becker STM’09 talk]
Typical pipelined router
ST LTRC
switchtraversal
linktraversal
routecomputation
VC + switchallocation
VASA