ece669: lecture 24 asoc: a scalable on-chip communication architecture russell tessier, jian liang,...
TRANSCRIPT
![Page 1: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/1.jpg)
ECE669: Lecture 24
aSoC: A Scalable On-Chip aSoC: A Scalable On-Chip Communication Communication
ArchitectureArchitectureRussell Tessier, Jian Liang, Andrew Laffely, and Wayne Russell Tessier, Jian Liang, Andrew Laffely, and Wayne
BurlesonBurlesonUniversity of Massachusetts, AmherstUniversity of Massachusetts, Amherst
Reconfigurable Computing GroupReconfigurable Computing Group
Supported by National Science Foundation Grants CCR-081405 and CCR-Supported by National Science Foundation Grants CCR-081405 and CCR-99882389988238
![Page 2: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/2.jpg)
ECE669: Lecture 24
OutlineOutline
Design philosophyDesign philosophy Communication architectureCommunication architecture Mapping tools / simulation Mapping tools / simulation
environmentenvironment Benchmark designsBenchmark designs Experimental results Experimental results Prototype layoutPrototype layout
![Page 3: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/3.jpg)
ECE669: Lecture 24
Design Goals / Design Goals / PhilosophyPhilosophy
Low-overheadLow-overhead core interface for on-chip core interface for on-chip streamsstreams On-chip bus substitute for streaming applicationsOn-chip bus substitute for streaming applications
Allows for interconnect of heterogeneous Allows for interconnect of heterogeneous corescores Differing sizes and clock ratesDiffering sizes and clock rates
Based on static schedulingBased on static scheduling Support for some dynamic events, run-time Support for some dynamic events, run-time
reconfigurationreconfiguration Development of complete systemDevelopment of complete system
Architecture, prototype layout, simulator, Architecture, prototype layout, simulator, mapping tools, target applicationsmapping tools, target applications
![Page 4: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/4.jpg)
ECE669: Lecture 24
aSoC ArchitectureaSoC Architecture
MultiplierFPGAFPGA
uProc
Multiplier
ctrl
South Core
West
North
East
tile
West
North
South
East
Ctrl
Core
Communication
Interface
Heterogeneous CoresHeterogeneous CoresPoint-to-point connectionsPoint-to-point connectionsCommunication InterfaceCommunication Interface
![Page 5: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/5.jpg)
ECE669: Lecture 24
Point to Point Data TransferPoint to Point Data Transfer
Data transfers from tile to tile on each communication cycleData transfers from tile to tile on each communication cycle Schedule repeats based on required communication patternsSchedule repeats based on required communication patterns
Core
CoreCore
Core Core
Core
Tile A Tile B
Tile D Tile E Tile F
Tile C
Cycle 2 Cycle 3 Cycle 4Cycle 1
![Page 6: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/6.jpg)
ECE669: Lecture 24
Core and Communication Core and Communication InterfaceInterface
NSEWSouth
East
IPCore
core
port
InterfaceCrossbar
CDMC
DM
…PCDecoder
PC logic
Flow control
Interfacecontroller
Scheduleinstruction
CDM
CDM
InterconnectMemory
West
North
NSEW
![Page 7: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/7.jpg)
ECE669: Lecture 24
Communication Interface Communication Interface OverviewOverview
Interconnect memoryInterconnect memory controls crossbar controls crossbar configurationconfiguration Programmable state machineProgrammable state machine Allows multiple streamsAllows multiple streams
Interface controllerInterface controller manages flow control manages flow control Supports simple protocol based on single packet Supports simple protocol based on single packet
bufferingbuffering Communication data memory (CDM)Communication data memory (CDM) buffers buffers
stream datastream data Single storage location per streamSingle storage location per stream
CoreportCoreport provides storage and synchronization provides storage and synchronization Storage for input and output streamsStorage for input and output streams Requires minimal support for synchronizationRequires minimal support for synchronization
![Page 8: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/8.jpg)
ECE669: Lecture 24
Interface Control Interface Control CircuitryCircuitry
PC
+1
Crossbar
!=0?
N S E W C I
3 3 3 3 3 3 123
5
3
Decoder
Iin[31:0]
8
UC-Jump
Nout[31:0]Sout[31:0]Eout[31:0]Wout[31:0]Cout[31:0]
Nin[31:0]Sin[31:0]Ein[31:0]Win[31:0]Cn[31:0]
Next PC En-
Jum
p
5 1 1 3
Coreport
CP
O
CP
I
![Page 9: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/9.jpg)
ECE669: Lecture 24
Data Dependent Stream Data Dependent Stream ControlControl
Two types of branchesTwo types of branches Unconditional branch – end of schedule Unconditional branch – end of schedule
reachedreached Conditional branch – test data value to modify Conditional branch – test data value to modify
schedule sequenceschedule sequence Provides minimal support for Provides minimal support for
reconfigurationreconfiguration Requires core interface support Requires core interface support
![Page 10: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/10.jpg)
ECE669: Lecture 24
Inter-tile Flow Control / Inter-tile Flow Control / BufferBuffer
Provide minimum amount of storage Provide minimum amount of storage per stream at each node (1 packet)per stream at each node (1 packet)
First priority: transfer data from First priority: transfer data from storagestorage
Send and acknowledge simultaneouslySend and acknowledge simultaneously Can’t send same stream on Can’t send same stream on
consecutive cyclesconsecutive cyclesData
Buffer Full
![Page 11: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/11.jpg)
ECE669: Lecture 24
Inter-tile Flow ControlInter-tile Flow Control
InterfaceCrossbar
CDM Addr
…
Read Addr
PC
Clear Addr
NW
E
S
EWNSC
Data
data
clea
r
Add
r
ValidBit
1
0
ValidBit
Flowcontrol
Data fromwest
1
0
CDM
Data Valid Bit
Data
Wr Addr
Rd AddrRead Addr
Flowcontrol
Data fromwest
ToCrossbar
![Page 12: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/12.jpg)
ECE669: Lecture 24
Coreport Interface to Coreport Interface to CommunicationCommunication
Data buffer provides synchronization with Data buffer provides synchronization with flow controlflow control
Stream indicators (CPO, CPI) provide access Stream indicators (CPO, CPI) provide access to flow control bitsto flow control bits
Data
Add
r
ValidBit
Add
r
Dat
a
Cle
ar
Dat
a
InterfaceCrossbar
CoreportAccess?
DeMux
Data
WE
ValidBit
Add
r
Dat
a
Add
r
Dat
a
NSEW
NSEW
CPO CPI
NSEW NSEW
Flow Control Bits
From InterconnectMemory
CPO CPI
CO CI
From Core To Core
Output Coreport Input Coreport
![Page 13: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/13.jpg)
ECE669: Lecture 24
Adapting the IP CoreAdapting the IP Core
Multiplier exampleMultiplier example State machine sequencerState machine sequencer
ValidBit
Data
StateMachine
MU
L LD
LD
LD
Data
Clear
Addr
A
B
ValidBit
Data
Addr
Data
En
Input Coreport Output Coreport
FromCI
ToCI
Multiplier Core
![Page 14: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/14.jpg)
ECE669: Lecture 24
Design Mapping Tool FlowDesign Mapping Tool Flow
Support multiple core clock speeds Support multiple core clock speeds and design formatsand design formats
Automate scheduling/routingAutomate scheduling/routing Allow feedback between core Allow feedback between core
characteristics and mapping decisionscharacteristics and mapping decisions Generate both core and Generate both core and
communication programming communication programming informationinformation
Lots of room for improvement Lots of room for improvement (StreamIt, HW/SW partitioning, (StreamIt, HW/SW partitioning, estimators)estimators)
![Page 15: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/15.jpg)
ECE669: Lecture 24
Design Mapping ToolsDesign Mapping Tools
Front-endparse
SUIFoptimization
Basic blockPartition/Assignment
Inter-coresynchronization
Communicationscheduling
Corecompilation
Codegeneration
exec. timeestimation
code
exe. time
Graph-basedInter. Format
Stream assignment Enhanced I.F.
dependencies
Stream schedules core I.F.
R4000Instructions
Bit streams Communicationinstructions
Source
![Page 16: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/16.jpg)
ECE669: Lecture 24
Design Mapping Tool Front Design Mapping Tool Front EndEnd
Current system isolates computation Current system isolates computation into basic blocksinto basic blocks
Stream-oriented front-end (e.g. StreamIt) Stream-oriented front-end (e.g. StreamIt) more appropriate. more appropriate.
Front-end preprocessingFront-end preprocessing Built on SUIFBuilt on SUIF Performs standards optimizationsPerforms standards optimizations
Intermediate form used for Intermediate form used for subsequent partitioning placement, subsequent partitioning placement, and scheduling (routing)and scheduling (routing)
User interface allows for interaction User interface allows for interaction and feedbackand feedback
![Page 17: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/17.jpg)
ECE669: Lecture 24
Partitioning and AssignmentPartitioning and Assignment Clustering used to collect blocks based on cost Clustering used to collect blocks based on cost
function:function:
Cost function takes both computation and Cost function takes both computation and communication into accountcommunication into account TTcompute compute = estimate overall compute time= estimate overall compute time
TToverlap overlap = estimate overall time of overlapping = estimate overall time of overlapping communicationcommunication
cctotal total = estimate overall communication time= estimate overall communication time
Swap-based approach used to minimize cost across Swap-based approach used to minimize cost across cores based on performance estimates.cores based on performance estimates.
totalcompute czyTx *T
1**cost
overlap
![Page 18: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/18.jpg)
ECE669: Lecture 24
Scheduled RoutingScheduled Routing Number and locations of streams known as Number and locations of streams known as
a result of schedulinga result of scheduling Stream paths routed as a function of Stream paths routed as a function of
required path bandwidth (channel capacity)required path bandwidth (channel capacity) Basic approachBasic approach
Order nets by Manhattan lengthOrder nets by Manhattan length Route streams using Prim’s algorithm across Route streams using Prim’s algorithm across
time slices based on channel costtime slices based on channel cost Determine feasible path for all streams Determine feasible path for all streams Attempt to “fill-in” unused bandwidth in Attempt to “fill-in” unused bandwidth in
schedule with additional stream transfersschedule with additional stream transfers
![Page 19: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/19.jpg)
ECE669: Lecture 24
Back-end Code GenerationBack-end Code Generation
C code targeted to R4000 coresC code targeted to R4000 cores Subsequently compiled with gccSubsequently compiled with gcc
Verilog code for FPGA blocksVerilog code for FPGA blocks Synthesized with Synopsys and Altera Synthesized with Synopsys and Altera
toolstools Interconnect memory instructions for Interconnect memory instructions for
each interconnect memoryeach interconnect memory Limited by size of interconnect memoryLimited by size of interconnect memory
![Page 20: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/20.jpg)
ECE669: Lecture 24
Simulation OverviewSimulation Overview Simulation takes place in two phasesSimulation takes place in two phases
Core simulator determines computation Core simulator determines computation cycles between interface accessescycles between interface accesses
Cycle accurate interconnect simulator Cycle accurate interconnect simulator determines data transfer between cores determines data transfer between cores taking core delay into account.taking core delay into account.
![Page 21: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/21.jpg)
ECE669: Lecture 24
Simulation EnvironmentSimulation Environment
Core codes from AppMapper
R4000 Sim.(SimpleScalar)
MAC Sim.MEM Sim.FPGA Sim.(Quartus)
C codeVerilog Core config.
Coreconfig.
Networksimulation
Coresimulation
Combinedevaluation
System statistics
Computation delays
comm.events
Config.Core speed
TopologyCore locationCI instruction
Simulator Lib.
System performance
C representationOf cores
![Page 22: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/22.jpg)
ECE669: Lecture 24
Core SimulatorsCore Simulators
Simplescalar (D. Burger/T. Austin – U. Simplescalar (D. Burger/T. Austin – U. Wisconsin)Wisconsin)
Models R4000-like architecture at the cycle Models R4000-like architecture at the cycle levellevel
Breakpoints used to calculate cycle counts Breakpoints used to calculate cycle counts between communication network between communication network interactioninteraction
Cadence Verilog –XLCadence Verilog –XL Used to model 484 LUT FPGA block designsUsed to model 484 LUT FPGA block designs Modeled at RTL and LUT levelModeled at RTL and LUT level
Custom C simulationCustom C simulation Cycle counts generated for memory and Cycle counts generated for memory and
multiply accumulate blocksmultiply accumulate blocks Simulators invoked via scriptsSimulators invoked via scripts
![Page 23: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/23.jpg)
ECE669: Lecture 24
Interconnect SimulatorInterconnect Simulator Based on NSIM (MIT NuMesh Simulator Based on NSIM (MIT NuMesh Simulator
– C. Metcalf)– C. Metcalf) Each tile modeled as a separate processEach tile modeled as a separate process Interconnect memory instructions used to Interconnect memory instructions used to
control cycle-by-cycle operationcontrol cycle-by-cycle operation Core speeds and flow control circuitry Core speeds and flow control circuitry
modeled accurately.modeled accurately. Adapted for a series of on-chip Adapted for a series of on-chip
interconnect architectures (bus-based interconnect architectures (bus-based architectures)architectures)
![Page 24: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/24.jpg)
ECE669: Lecture 24
Target Architectural ModelsTarget Architectural Models
FPGA blocks contain 121 4-LUT clustersFPGA blocks contain 121 4-LUT clusters Custom MAC and 32Kx8 SRAM (Mem) Custom MAC and 32Kx8 SRAM (Mem)
blocksblocks Same configurations used for all Same configurations used for all
benchmarksbenchmarks
![Page 25: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/25.jpg)
ECE669: Lecture 24
Example: MPEG-2Example: MPEG-2
Design partitioned across eleven coresDesign partitioned across eleven cores Other applications: IIR filter, image Other applications: IIR filter, image
processing, FFTprocessing, FFT
IDCT
source- recon.
source- recon.
source- recon.
source- recon.
ME
In Buf
DCT
Control
Ref Buf
sourceframe
reconstructedframe
In Buf Ref Buf
control IDCT
ME MAC2 MAC1 DCT
MAC3 MAC4 MAC0
MAC0
MAC1
MAC2
MAC3
MAC4
recon.block
DCTblock
DCTblock
Motionerror
R4000
R4000
R4000 MEM
MEM
![Page 26: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/26.jpg)
ECE669: Lecture 24
Core ParametersCore Parameters
Communication interface, MAC, FPGA, Communication interface, MAC, FPGA, and MEM sizes determined through and MEM sizes determined through layout layout (TSMC 0.18um)(TSMC 0.18um)
** R4000 size from MIPs web page** R4000 size from MIPs web page
SpeedSpeed Area (Area (λλ2 2 ))
Comm. Comm. InterfaceInterface
2.5 ns2.5 ns 2500 x 2500 x 35003500
MIPs MIPs R4000R4000
5 ns5 ns 4.3 x 104.3 x 1077 ****
MACMAC 5 ns5 ns 1500 x 1500 x 10001000
FPGAFPGA 10 ns10 ns 30000 x 30000 x 3000030000
MEM MEM (32Kx8)(32Kx8)
5 ns5 ns 10000x 10000x 1000010000
![Page 27: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/27.jpg)
ECE669: Lecture 24
Mapping StatisticsMapping Statistics
Number of Interconnect Mem instructions Number of Interconnect Mem instructions (CI Instruct) deceptively small(CI Instruct) deceptively small
Likely need to better fold streams in Likely need to better fold streams in scheduleschedule
DesignDesign No. No. CoresCores
No. No. StreamsStreams
Max CI Max CI Instruct.Instruct.
Max Max Streams Streams Per CIPer CI
Max Max CPort CPort Mem. Mem. DepthDepth
IIRIIR 99 1111 22 55 55
IIRIIR 1616 2020 22 55 55
IMGIMG 99 88 22 33 33
IMGIMG 1616 1515 44 44 44
FFTFFT 1616 2525 66 77 77
MPEGMPEG 1616 1919 44 88 88
![Page 28: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/28.jpg)
ECE669: Lecture 24
Comparison to IBM Comparison to IBM CoreConnectCoreConnect9 Core 9 Core
ModelModel16 Core Model16 Core Model
Execution Time Execution Time (ms)(ms)
IIRIIR IMGIMG IIRIIR IMGIMG FFTFFT MPEGMPEG
R4000R4000 0.040.0499
327.327.00
0.3500.350 327.0327.0 0.790.79 152152
CoreConnectCoreConnect 0.010.0122
22.022.0 0.0160.016 30.530.5 0.120.12 173173
Coreconnect Coreconnect (burst)(burst)
0.010.0122
18.918.9 0.0150.015 24.324.3 0.120.12 172172
aSoCaSoC 0.000.0066
9.69.6 0.0060.006 7.37.3 0.090.09 8383
aSoC Speed-up aSoC Speed-up vs. burstvs. burst
2.02.0 2.32.3 2.52.5 3.33.3 1.31.3 2.12.1
Used aSoC LinksUsed aSoC Links 88 88 3333 2727 4141 2626
aSoC max. link aSoC max. link usageusage
10%10% 8%8% 37%37% 28%28% 2%2% 25%25%
aSoC ave. link aSoC ave. link usageusage
7%7% 7%7% 22%22% 25%25% 2%2% 5%5%
CoreConnect CoreConnect busy (burst)busy (burst)
91%91% 100100%%
100%100% 99%99% 32%32% 67%67%
Still work to do on mapping environment to Still work to do on mapping environment to boost aSoC link utilizationboost aSoC link utilization
![Page 29: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/29.jpg)
ECE669: Lecture 24
Comparison to Hierarchical Comparison to Hierarchical CoreConnectCoreConnect
Multiple levels of arbitration slows down Multiple levels of arbitration slows down hierarchical CoreConnecthierarchical CoreConnect
9-core 9-core ModelModel
16-Core Model16-Core Model
Execution Execution Time Time (ms)(ms)
IIRIIR IMGIMG IIRIIR IMGIMG FFTFFT MPEGMPEG
Hier Hier CoreConneCoreConnectct
0.0130.013 26.026.0 15.715.7 37.437.4 0.150.15 178178
aSoCaSoC 0.0060.006 9.69.6 7.07.0 7.37.3 0.090.09 8383
aSoC aSoC speedupspeedup
2.12.1 2.72.7 2.22.2 5.15.1 1.61.6 2.22.2
![Page 30: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/30.jpg)
ECE669: Lecture 24
aSoC Comparison to Dynamic aSoC Comparison to Dynamic NetworkNetwork
Direct comparison to oblivious routing network Direct comparison to oblivious routing network 11
9-Core Model9-Core Model 16 Core Model16 Core Model
ExecutioExecution Time n Time (ms)(ms)
IIRIIR IMGIMG IIRIIR IMGIMG MPEGMPEG
Dynamic Dynamic RoutingRouting
0.0080.008 14.414.4 8.78.7 9.79.7 162.0162.0
aSoCaSoC 0.0060.006 6.16.1 7.07.0 7.37.3 82.582.5
aSoC aSoC SpeedupSpeedup
1.31.3 2.42.4 1.31.3 1.31.3 2.02.0
1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993
![Page 31: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/31.jpg)
ECE669: Lecture 24
aSoC LayoutaSoC Layout
![Page 32: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/32.jpg)
ECE669: Lecture 24
aSoC Multi-core LayoutaSoC Multi-core Layout
Comm. Comm. Interface Interface consumes consumes about 6% of tileabout 6% of tile
Critical path in Critical path in flow control flow control between tilesbetween tiles
Currently Currently integrating integrating additional coresadditional cores
![Page 33: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/33.jpg)
ECE669: Lecture 24
Future Work: Dynamic Voltage Future Work: Dynamic Voltage ScalingScaling
Data transfer Data transfer rate to/from rate to/from core used to core used to control voltage control voltage and clockand clock
Counter and Counter and CAM used to CAM used to select sourcesselect sources
May be software May be software controlledcontrolled
Core
Coreports
DecoderLocal
Frequency& Voltage
North to South & East
Instruction Memory
PC
Controller
North
South
East
West
Local Config.
North
South
East
West
Inputs Outputs
![Page 34: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/34.jpg)
ECE669: Lecture 24
Future Work: Dynamic Voltage Future Work: Dynamic Voltage ScalingScaling
CAM allows CAM allows selectionselection
V1 V2 V3 V4
CAM
Voltage Selection System
/128/64/32/16/8/4/2/1
Critical Path Check
SetReset
Global Clock
ClockSelector
Data RateMeasurement
Core
Coreport In count count
ClockEnable
Coreport Out
Local Clock Local Supply
![Page 35: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/35.jpg)
ECE669: Lecture 24
Future WorkFuture Work Improved software mapping Improved software mapping
environmentenvironment Integration of more coresIntegration of more cores Mapping of substantial applicationsMapping of substantial applications
Turbo codesTurbo codes Viterbi decoderViterbi decoder
More integrated simulation More integrated simulation environmentenvironment
![Page 36: ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,](https://reader035.vdocuments.us/reader035/viewer/2022062717/56649e3b5503460f94b2d365/html5/thumbnails/36.jpg)
ECE669: Lecture 24
SummarySummary Goal: Create low-overhead interconnect Goal: Create low-overhead interconnect
environment for on-chip stream environment for on-chip stream communicationcommunication
IP core augmented with communication IP core augmented with communication interfaceinterface
Flow control and some stream Flow control and some stream reconfiguration included in the architecturereconfiguration included in the architecture
Mapping tools and simulation environment Mapping tools and simulation environment assist in evaluating designassist in evaluating design
Initial results show favorable comparison to Initial results show favorable comparison to bus and high-overhead dynamic networks.bus and high-overhead dynamic networks.