ece669: lecture 24 asoc: a scalable on-chip communication architecture russell tessier, jian liang,...

ECE669: Lecture 24

aSoC: A Scalable On-Chip aSoC: A Scalable On-Chip Communication Communication

ArchitectureArchitectureRussell Tessier, Jian Liang, Andrew Laffely, and Wayne Russell Tessier, Jian Liang, Andrew Laffely, and Wayne

BurlesonBurlesonUniversity of Massachusetts, AmherstUniversity of Massachusetts, Amherst

Reconfigurable Computing GroupReconfigurable Computing Group

Supported by National Science Foundation Grants CCR-081405 and CCR-Supported by National Science Foundation Grants CCR-081405 and CCR-99882389988238

ECE669: Lecture 24

OutlineOutline

Design philosophyDesign philosophy Communication architectureCommunication architecture Mapping tools / simulation Mapping tools / simulation

environmentenvironment Benchmark designsBenchmark designs Experimental results Experimental results Prototype layoutPrototype layout

ECE669: Lecture 24

Design Goals / Design Goals / PhilosophyPhilosophy

Low-overheadLow-overhead core interface for on-chip core interface for on-chip streamsstreams On-chip bus substitute for streaming applicationsOn-chip bus substitute for streaming applications

Allows for interconnect of heterogeneous Allows for interconnect of heterogeneous corescores Differing sizes and clock ratesDiffering sizes and clock rates

Based on static schedulingBased on static scheduling Support for some dynamic events, run-time Support for some dynamic events, run-time

reconfigurationreconfiguration Development of complete systemDevelopment of complete system

Architecture, prototype layout, simulator, Architecture, prototype layout, simulator, mapping tools, target applicationsmapping tools, target applications

ECE669: Lecture 24

aSoC ArchitectureaSoC Architecture

MultiplierFPGAFPGA

uProc

Multiplier

ctrl

South Core

West

North

East

tile

West

North

South

East

Ctrl

Core

Communication

Interface

Heterogeneous CoresHeterogeneous CoresPoint-to-point connectionsPoint-to-point connectionsCommunication InterfaceCommunication Interface

ECE669: Lecture 24

Point to Point Data TransferPoint to Point Data Transfer

Data transfers from tile to tile on each communication cycleData transfers from tile to tile on each communication cycle Schedule repeats based on required communication patternsSchedule repeats based on required communication patterns

Core

CoreCore

Core Core

Core

Tile A Tile B

Tile D Tile E Tile F

Tile C

Cycle 2 Cycle 3 Cycle 4Cycle 1

ECE669: Lecture 24

Core and Communication Core and Communication InterfaceInterface

NSEWSouth

East

IPCore

core

port

InterfaceCrossbar

CDMC

DM

…PCDecoder

PC logic

Flow control

Interfacecontroller

Scheduleinstruction

CDM

CDM

InterconnectMemory

West

North

NSEW

ECE669: Lecture 24

Communication Interface Communication Interface OverviewOverview

Interconnect memoryInterconnect memory controls crossbar controls crossbar configurationconfiguration Programmable state machineProgrammable state machine Allows multiple streamsAllows multiple streams

Interface controllerInterface controller manages flow control manages flow control Supports simple protocol based on single packet Supports simple protocol based on single packet

bufferingbuffering Communication data memory (CDM)Communication data memory (CDM) buffers buffers

stream datastream data Single storage location per streamSingle storage location per stream

CoreportCoreport provides storage and synchronization provides storage and synchronization Storage for input and output streamsStorage for input and output streams Requires minimal support for synchronizationRequires minimal support for synchronization

ECE669: Lecture 24

Interface Control Interface Control CircuitryCircuitry

PC

+1

Crossbar

!=0?

N S E W C I

3 3 3 3 3 3 123

5

3

Decoder

Iin[31:0]

8

UC-Jump

Nout[31:0]Sout[31:0]Eout[31:0]Wout[31:0]Cout[31:0]

Nin[31:0]Sin[31:0]Ein[31:0]Win[31:0]Cn[31:0]

Next PC En-

Jum

p

5 1 1 3

Coreport

CP

O

CP

I

ECE669: Lecture 24

Data Dependent Stream Data Dependent Stream ControlControl

Two types of branchesTwo types of branches Unconditional branch – end of schedule Unconditional branch – end of schedule

reachedreached Conditional branch – test data value to modify Conditional branch – test data value to modify

schedule sequenceschedule sequence Provides minimal support for Provides minimal support for

reconfigurationreconfiguration Requires core interface support Requires core interface support

ECE669: Lecture 24

Inter-tile Flow Control / Inter-tile Flow Control / BufferBuffer

Provide minimum amount of storage Provide minimum amount of storage per stream at each node (1 packet)per stream at each node (1 packet)

First priority: transfer data from First priority: transfer data from storagestorage

Send and acknowledge simultaneouslySend and acknowledge simultaneously Can’t send same stream on Can’t send same stream on

consecutive cyclesconsecutive cyclesData

Buffer Full

ECE669: Lecture 24

Inter-tile Flow ControlInter-tile Flow Control

InterfaceCrossbar

CDM Addr

…

Read Addr

PC

Clear Addr

NW

E

S

EWNSC

Data

data

clea

r

Add

r

ValidBit

1

0

ValidBit

Flowcontrol

Data fromwest

1

0

CDM

Data Valid Bit

Data

Wr Addr

Rd AddrRead Addr

Flowcontrol

Data fromwest

ToCrossbar

ECE669: Lecture 24

Coreport Interface to Coreport Interface to CommunicationCommunication

Data buffer provides synchronization with Data buffer provides synchronization with flow controlflow control

Stream indicators (CPO, CPI) provide access Stream indicators (CPO, CPI) provide access to flow control bitsto flow control bits

Data

Add

r

ValidBit

Add

r

Dat

a

Cle

ar

Dat

a

InterfaceCrossbar

CoreportAccess?

DeMux

Data

WE

ValidBit

Add

r

Dat

a

Add

r

Dat

a

NSEW

NSEW

CPO CPI

NSEW NSEW

Flow Control Bits

From InterconnectMemory

CPO CPI

CO CI

From Core To Core

Output Coreport Input Coreport

ECE669: Lecture 24

Adapting the IP CoreAdapting the IP Core

Multiplier exampleMultiplier example State machine sequencerState machine sequencer

ValidBit

Data

StateMachine

MU

L LD

LD

LD

Data

Clear

Addr

A

B

ValidBit

Data

Addr

Data

En

Input Coreport Output Coreport

FromCI

ToCI

Multiplier Core

ECE669: Lecture 24

Design Mapping Tool FlowDesign Mapping Tool Flow

Support multiple core clock speeds Support multiple core clock speeds and design formatsand design formats

Automate scheduling/routingAutomate scheduling/routing Allow feedback between core Allow feedback between core

characteristics and mapping decisionscharacteristics and mapping decisions Generate both core and Generate both core and

communication programming communication programming informationinformation

Lots of room for improvement Lots of room for improvement (StreamIt, HW/SW partitioning, (StreamIt, HW/SW partitioning, estimators)estimators)

ECE669: Lecture 24

Design Mapping ToolsDesign Mapping Tools

Front-endparse

SUIFoptimization

Basic blockPartition/Assignment

Inter-coresynchronization

Communicationscheduling

Corecompilation

Codegeneration

exec. timeestimation

code

exe. time

Graph-basedInter. Format

Stream assignment Enhanced I.F.

dependencies

Stream schedules core I.F.

R4000Instructions

Bit streams Communicationinstructions

Source

ECE669: Lecture 24

Design Mapping Tool Front Design Mapping Tool Front EndEnd

Current system isolates computation Current system isolates computation into basic blocksinto basic blocks

Stream-oriented front-end (e.g. StreamIt) Stream-oriented front-end (e.g. StreamIt) more appropriate. more appropriate.

Front-end preprocessingFront-end preprocessing Built on SUIFBuilt on SUIF Performs standards optimizationsPerforms standards optimizations

Intermediate form used for Intermediate form used for subsequent partitioning placement, subsequent partitioning placement, and scheduling (routing)and scheduling (routing)

User interface allows for interaction User interface allows for interaction and feedbackand feedback

ECE669: Lecture 24

Partitioning and AssignmentPartitioning and Assignment Clustering used to collect blocks based on cost Clustering used to collect blocks based on cost

function:function:

Cost function takes both computation and Cost function takes both computation and communication into accountcommunication into account TTcompute compute = estimate overall compute time= estimate overall compute time

TToverlap overlap = estimate overall time of overlapping = estimate overall time of overlapping communicationcommunication

cctotal total = estimate overall communication time= estimate overall communication time

Swap-based approach used to minimize cost across Swap-based approach used to minimize cost across cores based on performance estimates.cores based on performance estimates.

totalcompute czyTx *T

1**cost

overlap

ECE669: Lecture 24

Scheduled RoutingScheduled Routing Number and locations of streams known as Number and locations of streams known as

a result of schedulinga result of scheduling Stream paths routed as a function of Stream paths routed as a function of

required path bandwidth (channel capacity)required path bandwidth (channel capacity) Basic approachBasic approach

Order nets by Manhattan lengthOrder nets by Manhattan length Route streams using Prim’s algorithm across Route streams using Prim’s algorithm across

time slices based on channel costtime slices based on channel cost Determine feasible path for all streams Determine feasible path for all streams Attempt to “fill-in” unused bandwidth in Attempt to “fill-in” unused bandwidth in

schedule with additional stream transfersschedule with additional stream transfers

ECE669: Lecture 24

Back-end Code GenerationBack-end Code Generation

C code targeted to R4000 coresC code targeted to R4000 cores Subsequently compiled with gccSubsequently compiled with gcc

Verilog code for FPGA blocksVerilog code for FPGA blocks Synthesized with Synopsys and Altera Synthesized with Synopsys and Altera

toolstools Interconnect memory instructions for Interconnect memory instructions for

each interconnect memoryeach interconnect memory Limited by size of interconnect memoryLimited by size of interconnect memory

ECE669: Lecture 24

Simulation OverviewSimulation Overview Simulation takes place in two phasesSimulation takes place in two phases

Core simulator determines computation Core simulator determines computation cycles between interface accessescycles between interface accesses

Cycle accurate interconnect simulator Cycle accurate interconnect simulator determines data transfer between cores determines data transfer between cores taking core delay into account.taking core delay into account.

ECE669: Lecture 24

Simulation EnvironmentSimulation Environment

Core codes from AppMapper

R4000 Sim.(SimpleScalar)

MAC Sim.MEM Sim.FPGA Sim.(Quartus)

C codeVerilog Core config.

Coreconfig.

Networksimulation

Coresimulation

Combinedevaluation

System statistics

Computation delays

comm.events

Config.Core speed

TopologyCore locationCI instruction

Simulator Lib.

System performance

C representationOf cores

ECE669: Lecture 24

Core SimulatorsCore Simulators

Simplescalar (D. Burger/T. Austin – U. Simplescalar (D. Burger/T. Austin – U. Wisconsin)Wisconsin)

Models R4000-like architecture at the cycle Models R4000-like architecture at the cycle levellevel

Breakpoints used to calculate cycle counts Breakpoints used to calculate cycle counts between communication network between communication network interactioninteraction

Cadence Verilog –XLCadence Verilog –XL Used to model 484 LUT FPGA block designsUsed to model 484 LUT FPGA block designs Modeled at RTL and LUT levelModeled at RTL and LUT level

Custom C simulationCustom C simulation Cycle counts generated for memory and Cycle counts generated for memory and

multiply accumulate blocksmultiply accumulate blocks Simulators invoked via scriptsSimulators invoked via scripts

ECE669: Lecture 24

Interconnect SimulatorInterconnect Simulator Based on NSIM (MIT NuMesh Simulator Based on NSIM (MIT NuMesh Simulator

– C. Metcalf)– C. Metcalf) Each tile modeled as a separate processEach tile modeled as a separate process Interconnect memory instructions used to Interconnect memory instructions used to

control cycle-by-cycle operationcontrol cycle-by-cycle operation Core speeds and flow control circuitry Core speeds and flow control circuitry

modeled accurately.modeled accurately. Adapted for a series of on-chip Adapted for a series of on-chip

interconnect architectures (bus-based interconnect architectures (bus-based architectures)architectures)

ECE669: Lecture 24

Target Architectural ModelsTarget Architectural Models

FPGA blocks contain 121 4-LUT clustersFPGA blocks contain 121 4-LUT clusters Custom MAC and 32Kx8 SRAM (Mem) Custom MAC and 32Kx8 SRAM (Mem)

blocksblocks Same configurations used for all Same configurations used for all

benchmarksbenchmarks

ECE669: Lecture 24

Example: MPEG-2Example: MPEG-2

Design partitioned across eleven coresDesign partitioned across eleven cores Other applications: IIR filter, image Other applications: IIR filter, image

processing, FFTprocessing, FFT

IDCT

source- recon.

source- recon.

source- recon.

source- recon.

ME

In Buf

DCT

Control

Ref Buf

sourceframe

reconstructedframe

In Buf Ref Buf

control IDCT

ME MAC2 MAC1 DCT

MAC3 MAC4 MAC0

MAC0

MAC1

MAC2

MAC3

MAC4

recon.block

DCTblock

DCTblock

Motionerror

R4000

R4000

R4000 MEM

MEM

ECE669: Lecture 24

Core ParametersCore Parameters

Communication interface, MAC, FPGA, Communication interface, MAC, FPGA, and MEM sizes determined through and MEM sizes determined through layout layout (TSMC 0.18um)(TSMC 0.18um)

** R4000 size from MIPs web page** R4000 size from MIPs web page

SpeedSpeed Area (Area (λλ2 2 ))

Comm. Comm. InterfaceInterface

2.5 ns2.5 ns 2500 x 2500 x 35003500

MIPs MIPs R4000R4000

5 ns5 ns 4.3 x 104.3 x 1077 ****

MACMAC 5 ns5 ns 1500 x 1500 x 10001000

FPGAFPGA 10 ns10 ns 30000 x 30000 x 3000030000

MEM MEM (32Kx8)(32Kx8)

5 ns5 ns 10000x 10000x 1000010000

ECE669: Lecture 24

Mapping StatisticsMapping Statistics

Number of Interconnect Mem instructions Number of Interconnect Mem instructions (CI Instruct) deceptively small(CI Instruct) deceptively small

Likely need to better fold streams in Likely need to better fold streams in scheduleschedule

DesignDesign No. No. CoresCores

No. No. StreamsStreams

Max CI Max CI Instruct.Instruct.

Max Max Streams Streams Per CIPer CI

Max Max CPort CPort Mem. Mem. DepthDepth

IIRIIR 99 1111 22 55 55

IIRIIR 1616 2020 22 55 55

IMGIMG 99 88 22 33 33

IMGIMG 1616 1515 44 44 44

FFTFFT 1616 2525 66 77 77

MPEGMPEG 1616 1919 44 88 88

ECE669: Lecture 24

Comparison to IBM Comparison to IBM CoreConnectCoreConnect9 Core 9 Core

ModelModel16 Core Model16 Core Model

Execution Time Execution Time (ms)(ms)

IIRIIR IMGIMG IIRIIR IMGIMG FFTFFT MPEGMPEG

R4000R4000 0.040.0499

327.327.00

0.3500.350 327.0327.0 0.790.79 152152

CoreConnectCoreConnect 0.010.0122

22.022.0 0.0160.016 30.530.5 0.120.12 173173

Coreconnect Coreconnect (burst)(burst)

0.010.0122

18.918.9 0.0150.015 24.324.3 0.120.12 172172

aSoCaSoC 0.000.0066

9.69.6 0.0060.006 7.37.3 0.090.09 8383

aSoC Speed-up aSoC Speed-up vs. burstvs. burst

2.02.0 2.32.3 2.52.5 3.33.3 1.31.3 2.12.1

Used aSoC LinksUsed aSoC Links 88 88 3333 2727 4141 2626

aSoC max. link aSoC max. link usageusage

10%10% 8%8% 37%37% 28%28% 2%2% 25%25%

aSoC ave. link aSoC ave. link usageusage

7%7% 7%7% 22%22% 25%25% 2%2% 5%5%

CoreConnect CoreConnect busy (burst)busy (burst)

91%91% 100100%%

100%100% 99%99% 32%32% 67%67%

Still work to do on mapping environment to Still work to do on mapping environment to boost aSoC link utilizationboost aSoC link utilization

ECE669: Lecture 24

Comparison to Hierarchical Comparison to Hierarchical CoreConnectCoreConnect

Multiple levels of arbitration slows down Multiple levels of arbitration slows down hierarchical CoreConnecthierarchical CoreConnect

9-core 9-core ModelModel

16-Core Model16-Core Model

Execution Execution Time Time (ms)(ms)

IIRIIR IMGIMG IIRIIR IMGIMG FFTFFT MPEGMPEG

Hier Hier CoreConneCoreConnectct

0.0130.013 26.026.0 15.715.7 37.437.4 0.150.15 178178

aSoCaSoC 0.0060.006 9.69.6 7.07.0 7.37.3 0.090.09 8383

aSoC aSoC speedupspeedup

2.12.1 2.72.7 2.22.2 5.15.1 1.61.6 2.22.2

ECE669: Lecture 24

aSoC Comparison to Dynamic aSoC Comparison to Dynamic NetworkNetwork

Direct comparison to oblivious routing network Direct comparison to oblivious routing network 11

9-Core Model9-Core Model 16 Core Model16 Core Model

ExecutioExecution Time n Time (ms)(ms)

IIRIIR IMGIMG IIRIIR IMGIMG MPEGMPEG

Dynamic Dynamic RoutingRouting

0.0080.008 14.414.4 8.78.7 9.79.7 162.0162.0

aSoCaSoC 0.0060.006 6.16.1 7.07.0 7.37.3 82.582.5

aSoC aSoC SpeedupSpeedup

1.31.3 2.42.4 1.31.3 1.31.3 2.02.0

1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993

ECE669: Lecture 24

aSoC LayoutaSoC Layout

ECE669: Lecture 24

aSoC Multi-core LayoutaSoC Multi-core Layout

Comm. Comm. Interface Interface consumes consumes about 6% of tileabout 6% of tile

Critical path in Critical path in flow control flow control between tilesbetween tiles

Currently Currently integrating integrating additional coresadditional cores

ECE669: Lecture 24

Future Work: Dynamic Voltage Future Work: Dynamic Voltage ScalingScaling

Data transfer Data transfer rate to/from rate to/from core used to core used to control voltage control voltage and clockand clock

Counter and Counter and CAM used to CAM used to select sourcesselect sources

May be software May be software controlledcontrolled

Core

Coreports

DecoderLocal

Frequency& Voltage

North to South & East

Instruction Memory

PC

Controller

North

South

East

West

Local Config.

North

South

East

West

Inputs Outputs

ECE669: Lecture 24

Future Work: Dynamic Voltage Future Work: Dynamic Voltage ScalingScaling

CAM allows CAM allows selectionselection

V1 V2 V3 V4

CAM

Voltage Selection System

/128/64/32/16/8/4/2/1

Critical Path Check

SetReset

Global Clock

ClockSelector

Data RateMeasurement

Core

Coreport In count count

ClockEnable

Coreport Out

Local Clock Local Supply

ECE669: Lecture 24

Future WorkFuture Work Improved software mapping Improved software mapping

environmentenvironment Integration of more coresIntegration of more cores Mapping of substantial applicationsMapping of substantial applications

Turbo codesTurbo codes Viterbi decoderViterbi decoder

More integrated simulation More integrated simulation environmentenvironment

ECE669: Lecture 24

SummarySummary Goal: Create low-overhead interconnect Goal: Create low-overhead interconnect

environment for on-chip stream environment for on-chip stream communicationcommunication

IP core augmented with communication IP core augmented with communication interfaceinterface

Flow control and some stream Flow control and some stream reconfiguration included in the architecturereconfiguration included in the architecture

Mapping tools and simulation environment Mapping tools and simulation environment assist in evaluating designassist in evaluating design

Initial results show favorable comparison to Initial results show favorable comparison to bus and high-overhead dynamic networks.bus and high-overhead dynamic networks.

ece669: lecture 24 asoc: a scalable on-chip communication architecture russell tessier, jian liang,...

Documents

synchronization slide

synchronization storage

output streams storage

stream single storage

stream coreport

synchronization coreport

minimal support

streaming applications