high-performance networks for dataflow architectures

31
High-Performance High-Performance Networks for Networks for Dataflow Dataflow Architectures Architectures Pravin Bhat Pravin Bhat Andrew Putnam Andrew Putnam

Upload: webb

Post on 25-Feb-2016

31 views

Category:

Documents


3 download

DESCRIPTION

High-Performance Networks for Dataflow Architectures. Pravin Bhat Andrew Putnam. Overview. Motivation & Design Constraints Network design Performance Adaptive Routing Conclusion. Overview. Motivation & Design Constraints Network design Performance Adaptive Routing Conclusion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High-Performance Networks for  Dataflow Architectures

High-Performance High-Performance Networks for Networks for

Dataflow Dataflow ArchitecturesArchitectures

Pravin BhatPravin BhatAndrew PutnamAndrew Putnam

Page 2: High-Performance Networks for  Dataflow Architectures

OverviewOverview Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion

Page 3: High-Performance Networks for  Dataflow Architectures

OverviewOverview Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion

Page 4: High-Performance Networks for  Dataflow Architectures

MotivationMotivation Signal delay on wires is more important Signal delay on wires is more important

than transistor switching speedthan transistor switching speed Seriously decreased reliability in future Seriously decreased reliability in future

processesprocesses Factory testing will not be possibleFactory testing will not be possible Expect 20% of transistors to be DOAExpect 20% of transistors to be DOA Expect 10% more to die over several Expect 10% more to die over several

monthsmonths Dataflow is an answer, but the network Dataflow is an answer, but the network

is currently a bottleneckis currently a bottleneck

Page 5: High-Performance Networks for  Dataflow Architectures

Dataflow CharacteristicsDataflow Characteristics Unpredictable trafficUnpredictable traffic

Cannot pre-allocate resourcesCannot pre-allocate resources Highly bursty trafficHighly bursty traffic

Quick delivery of bursts is criticalQuick delivery of bursts is critical Nodes are not guaranteed to Nodes are not guaranteed to

consume messagesconsume messages Potential for livelock & deadlockPotential for livelock & deadlock

Page 6: High-Performance Networks for  Dataflow Architectures

OverviewOverview Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion

Page 7: High-Performance Networks for  Dataflow Architectures

Network RequirementsNetwork Requirements High-Performance during burstsHigh-Performance during bursts Area efficientArea efficient Guarantee message deliveryGuarantee message delivery Deadlock & Livelock freeDeadlock & Livelock free Fault TolerantFault Tolerant Regular 2-D physical structureRegular 2-D physical structure

Page 8: High-Performance Networks for  Dataflow Architectures

TopologyTopology On-chip - must be implementable in 2-DOn-chip - must be implementable in 2-D Regular tiled structure suggests:Regular tiled structure suggests:

GridGrid TorusTorus HypercubeHypercube Fat TreeFat Tree

Hypercube is difficult to route, scaleHypercube is difficult to route, scale Fat Tree has a single point of failureFat Tree has a single point of failure

Page 9: High-Performance Networks for  Dataflow Architectures

RoutingRouting Static routing does not provide Static routing does not provide

essential fault toleranceessential fault tolerance Use a modified Virtual Channel Use a modified Virtual Channel

algorithmalgorithm VC guarantees deadlock free if nodes VC guarantees deadlock free if nodes

consume messagesconsume messages Dynamically adaptive to handle transient Dynamically adaptive to handle transient

faults & congestionfaults & congestion Initial studies used static routingInitial studies used static routing

Page 10: High-Performance Networks for  Dataflow Architectures

Flow ControlFlow Control Resource reservation not possibleResource reservation not possible Long-latency wires prohibit Long-latency wires prohibit

handshakeshandshakes Send messages assuming acceptSend messages assuming accept Buffer just enough to allow receiver Buffer just enough to allow receiver

to send reject signal on subsequent to send reject signal on subsequent clock cycleclock cycle

Page 11: High-Performance Networks for  Dataflow Architectures

Deadlock-Free OperationDeadlock-Free Operation Nodes cannot always consume Nodes cannot always consume

messagesmessages Add a dedicated channel to and from Add a dedicated channel to and from

memorymemory Adds 8% area overheadAdds 8% area overhead

Rotate stalled operands out of PEs to Rotate stalled operands out of PEs to ensure forward progressensure forward progress

Send first operand back at a faster rate Send first operand back at a faster rate to avoid livelockto avoid livelock

Page 12: High-Performance Networks for  Dataflow Architectures

OverviewOverview Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion

Page 13: High-Performance Networks for  Dataflow Architectures

PerformancePerformance Ran network-centric simulationsRan network-centric simulations 20 billion instructions20 billion instructions Spec2000, Splash2, and Dataflow Spec2000, Splash2, and Dataflow

benchmarksbenchmarks Goal is to find optimum balance of:Goal is to find optimum balance of:

Number of Virtual ChannelsNumber of Virtual Channels Queue LengthQueue Length Link BandwidthLink Bandwidth Packets per messagePackets per message

Page 14: High-Performance Networks for  Dataflow Architectures

Virtual Channels

0

0.5

1

1.5

2

2.5

0 4 8 12 16Virtual Channels

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 15: High-Performance Networks for  Dataflow Architectures

Queue Length

0.8

1.2

1.6

2

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64Queue Length

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 16: High-Performance Networks for  Dataflow Architectures

Link Bandwidth

0.8

1

1.2

1.4

1.6

1.8

2

0 4 8 12 16Bandwidth

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 17: High-Performance Networks for  Dataflow Architectures

Link Width

0

0.2

0.4

0.6

0.8

1

1.2

0 8 16 24 32 40 48 56 64Packets per Message

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 18: High-Performance Networks for  Dataflow Architectures

ASIC ModelASIC Model Performance must be balanced with areaPerformance must be balanced with area Developed RTL model of WaveScalar Developed RTL model of WaveScalar

network architecturenetwork architecture 90 nm process ASIC standard cell library90 nm process ASIC standard cell library Timing per link:Timing per link:

Grid links: 2.76 nsGrid links: 2.76 ns Torus links: 6.16 nsTorus links: 6.16 ns

Network switch is 11.6% of chip areaNetwork switch is 11.6% of chip area

Page 19: High-Performance Networks for  Dataflow Architectures

Virtual Channels

0

0.5

1

1.5

2

2.5

3

3.5

0 2 4 6 8 10 12 14 16 18Virtual Channels

Performance / Area

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 20: High-Performance Networks for  Dataflow Architectures

Link Bandwidth

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2 4 6 8 10 12 14 16Number of Links

Performance / Area

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 21: High-Performance Networks for  Dataflow Architectures

Queue Length

0

0.5

1

1.5

2

2.5

3

0 8 16 24 32 40 48 56 64Queue Length

Performance / Area

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 22: High-Performance Networks for  Dataflow Architectures

OverviewOverview Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion

Page 23: High-Performance Networks for  Dataflow Architectures

Virtual Channels Flow Virtual Channels Flow ControlControl

In hardware only In hardware only Head-of-Queue can be Head-of-Queue can be dequeued in one clock dequeued in one clock cyclecycle

If the first message in If the first message in a queue is blocked a queue is blocked then every message then every message behind it is blockedbehind it is blocked

The network utilization The network utilization suffers due to idle linkssuffers due to idle links

Page 24: High-Performance Networks for  Dataflow Architectures

Virtual Channels Flow Virtual Channels Flow Channel Channel

Virtual Channels – Virtual Channels – several small several small queues instead of queues instead of one long queueone long queue

Decouples buffer Decouples buffer resources from link resources from link resourcesresources

Increase network Increase network throughput by throughput by increasing link increasing link usageusage

Page 25: High-Performance Networks for  Dataflow Architectures

Dimension Order Dimension Order RoutingRouting

Old WaveScalar Routing ProtocolOld WaveScalar Routing Protocol Network topology is a static gridNetwork topology is a static grid Packets first travel to the correct Packets first travel to the correct

x-coordinate and then to the x-coordinate and then to the correct y-coordinatecorrect y-coordinate

Low network utilization from not Low network utilization from not using all available pathsusing all available paths

Not fault tolerantNot fault tolerant

Page 26: High-Performance Networks for  Dataflow Architectures

Adaptive RoutingAdaptive Routing Progressively chooses Progressively chooses

longer routes instead of longer routes instead of waiting for an unavailable waiting for an unavailable resourceresource

High Network UtilizationHigh Network Utilization Fault tolerantFault tolerant Can cause deadlockCan cause deadlock

Page 27: High-Performance Networks for  Dataflow Architectures

Deadlock Free Adaptive Deadlock Free Adaptive RoutingRouting

Some Virtual Channels are reserved for Some Virtual Channels are reserved for Dimension Order Routing, rest used for Dimension Order Routing, rest used for Adaptive routingAdaptive routing

Every time a packet is routed in the wrong Every time a packet is routed in the wrong direction the Dimension Reversal count direction the Dimension Reversal count incrementedincremented

No packet is allowed to wait in a virtual No packet is allowed to wait in a virtual channel with a packet that has a lower channel with a packet that has a lower Dimension reversal countDimension reversal count

Mathematically proven to be deadlock free.Mathematically proven to be deadlock free.

Page 28: High-Performance Networks for  Dataflow Architectures

Virtual Channels

0

0.5

1

1.5

2

2.5

3

3.5

0 4 8 12 16Virtual Channels

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 29: High-Performance Networks for  Dataflow Architectures

Queue Length (Adaptive Speedup)

0.8

1.2

1.6

2

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64Queue Length

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 30: High-Performance Networks for  Dataflow Architectures

Link Bandwidth (Adaptive Speedup)

0.8

1

1.2

1.4

1.6

1.8

2

0 4 8 12 16Bandwidth

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 31: High-Performance Networks for  Dataflow Architectures

ConclusionConclusion Best performance per area with:Best performance per area with:

2 Virtual Channels2 Virtual Channels 2 Links2 Links 2-4 entries per queue2-4 entries per queue Torus TopologyTorus Topology Adaptive RoutingAdaptive Routing

Dataflow chip networks can be high-Dataflow chip networks can be high-performance at reasonable areaperformance at reasonable area