active cells: a programming model for configurable multicore … · a programming model for...
TRANSCRIPT
CASE STUDY 4: CUSTOM DESIGNED MULTI-PROCESSOR SYSTEM
Active Cells:
A Programming Model for Configurable Multicore Systems
311
Vision
312
P P NILP NILNIL P P PP P PP P
core core
cache
bus
cache
memory
P P
core corePP
core engineP
core corePP
coreP
coreP
engine
General Purpose Shared Memory Computer Application Specific Multicore Network On Chip
Motivation: Multicore Systems Challenges
Cache Coherence
Shared Memory Communication Bottleneck
Thread Synchronization Overhead
Hard to predict performance of a program
Difficult to scale the design to massive multi-core architecture
313
Operating System Challenges
Processor Time Sharing
Interrupts
Context Switches
Thread Synchronisation
Memory Sharing
Inter-process: Paging
Intra-process, Inter-Thread: Monitors
314
Focus: Streaming Applications
315
Stream-Parallelism: Pipelining
Task Parallelism:
ParallelExecution Data Paralellism:
Vector ComputingLoop-level parallelism
4.1. HARDWARE BUILDING BLOCKSTRM AND INTERCONNECTS
316
TRM: Tiny Register Machine*
Extremely simple processor on FPGA with Harvard architecture.
Two-stage pipelined
Each TRM contains
Arithmetic-logic unit (ALU) and a shifter.
32-bit operands and results stored in a bank of 2*8 registers.
local data memory: d*512 words of 32 bits.
local program memory: i*1024 instructions with 18 bits.
7 general purpose registers
Register H for storing the high 32 bits of a product, and 4 conditional registers C, N, V, Z.
No caches
317* Invented and implemented by Dr. Ling Liu and Prof. Niklaus Wirth
TRM Machine Language
Machine language: binary representation of instructions
18-bit instructions
Three instruction types:
Type a: arithmetical and logical operations
Type b: load and store instructions
Type c: branch instructions (for jumping)
318from Lectures on Reconfigurable Computing, Dr. Ling Liu, ETH Zürich
Encoding Overview Register Operations
Load and Store
Conditional Branches
Special Instructions
Branch and Link
319
091011131417
0 immRdop imm is zero extended to 32 bits
091011131417
1 RsRdop 0 0 0 x x 0
(a)
(b)
091011131417
1 VRsVRdop 1 0 0 x x x(c)
091011131417
1 x x xRdop 0 0 0 0 0 1(a)
(c)
091011131417
1 RsRdop 1 0 x x x x
091011131417
1 RsRdop 0 1 x x x x(d)
091011131417
1 x x xVRdop 1 0 0 001(b)
091011131417
1 RsRdop 101(e) xxxx
6
(a)
(b)
091011131417
0 RsRdop off
091011131417
1 RsVRdop off
3
3
off is zero extended
0910131417
cond1110 off off is sign extended0131417
1111 off off is 14-bit offset
TRM architecture
320
Figure from: Niklaus Wirth, Experiments in Computer System Design, Technical Report, August 2010
http://www.inf.ethz.ch/personal/wirth/Articles/FPGA-relatedWork/ComputerSystemDesign.pdf
Variants of TRM
FTRM
includes floating point unit
VTRM (Master Thesis Dan Tecu)
includes a vector processing unit
supports 8 x 8-word registers
available with / without FP unit
TRM with software-configurable instruction width (Master Thesis Stefan Koster, 2015)
321
Initial ExperimentsTRM12 Bus TRM12 Ring
322
Column 0
Column 1
Column 2
Column 3
H0
H1
H2H3
C0
inbound arbiteroutbound arbiter
inbound arbiteroutbound arbiter
inbound arbiteroutbound arbiter
inbound arbiter
outbound arbiter
N0
C7
N1
N2
C2
N6
N7
N8
C6
C1
C8
N3
N4
N5
N9
N10
N11
C5 C11
C4
C3
C10
C9
RS232TR
CiNi
: processor core : network controller RS232TR : RS232 transmitter receiver
Timer
LCD
LEDs
DDR2
node0 node1 node2 node3 node4 node5
node11 node10 node9 node8 node7 node6
TRM0 TRM1 TRM2 TRM3 TRM4 TRM5
TRM11 TRM10 TRM9 TRM8 TRM7 TRM6T
RM
Rin
g
RS232
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
01111110
01111110
01111110
01111110
01111110
01111110
01111110
01111110
01111110 0111
111001111110
01111110
4.2. HARDWARE SOFTWARE CODESIGN
323
Software / Hardware Co-designVision: Custom System on Button Push
System design as high-level program code
Electronic circuits
Computing model
Programming Language
Compiler, Synthesizer,
Hardware Library,
Simulator
Programmable Hardware
(FPGA)
324
System specification, HW/SW partitioning
Program microcontroller in
C/C++
Programsystem specific
hardware in HDL
Compilation Synthesis
Microcontroller + machine code + specific hardware (eg. DSP)
Traditional HW/SW co-design for embedded
systems
Traditional HW/SW co-design
One Program
One Toolchain
System on FPGA
Active Cells approach for embedded systems
development
Goal
325
Active Cells Computing Model
On-chip distributed system
Cell
Scope and environment for a running isolated process.
Integrated control thread(s)
Provides communication ports
Net
Network of communication cells
Cells connected via channels (FIFOs)
326
Inspired by
• Kahn Process Networks
• Dataflow Programming
• CSP (e.g. Google's Go)
• Actor Model (e.g. Erlang)
Software Hardware Map
channel
cell
fifo
327
Consequences of the approach
No global memory
No processor sharing
No pecularities of specific processor
No predefined topology (NoC)
No interrupts
No operating system
328
Cell
329
type
BernoulliSampler* = cell (probIn: port in; valOut: port out);
var
r: Random.Generator;
p: real;
begin
new(r);
loop
p << probIn;
valOut << r.Bernoulli(p);
end
end BernoulliSampler;
non-typed communication ports
blocking receive
asynchronous send
BernoulliSampler
probIn
valOut
CellActivity
Properties
type
Controller = cell {Processor=TRM, FPU, DataMemory=2048, BitWidth=18} (in: port in (64); result: port out);
...
begin
(* ... controller action ... *)
end Controller;
....
Port Width
330
Properties can influence both, generation of hardware and the generation of software code.
Controller(TRM)
FPU
Configurable Processor on PL
331
Tiny Register Machine
FPU
offon
Vector Unit
offon
Instruction Width
14 16 18 20 22
Multiplier
offon
Program Memory
0k 1k 2k 3k 4k
Data Memory
0.5k 1k 1.5k
Engine
typeMerger = cell {Engine, inputs=1}
(ind: array inputs of port in; outd: port out);var data: longint;begin
loopfor i := 0 to len(in)-1 do
data << ind[i]outd << data;
endend
end Merger;
332
Engines are prebuilt components instantiated as electronic circuitson a target hardware
Merger
…
Unit of Deployment: (Terminal) CellnetLearnTest = cellnet;var
learner: CRBMNet.CRBMLearner;reader: MLUtil.imageReader;ims0,ims1: MLUtil.imshow;…
beginnew(learner)new(reader);new(ims0{name='v0debug',posx=0,posy=100});new(ims1{name='v1debug',posx=300,posy=100});…reader.imageOUT >> learner.imgIN;learner.v0DebugOUT >> ims0.imageIN;learner.v1DebugOUT >> ims1.imageIN;…
end LearnTest;
333
CRBMLearner
imShow
imageReader
imShow
connection
dynamic construction
connection
Hierarchical Composition
334
CRBMLearner
imShow
imageReader
imShow
CRBMLearner
DownNet
UpNet
Delay
Delay
Delay
Delay
debug
result
img in
… …
…Sample
UpNet
…
GetGradients
visible
P(h|v)visible
hidden
…hidden
kernels
UpNetsplit
…
UpStep
Hierarchic Composition: non-terminal CellnetUpNet = cellnet
{vr=28,vc=28,kr=5,kc=5,k=9,c=2,name='upstep'}(pvIN, kerIN, bIN : port in; phOUT: array k of port out; pvSideOUT: port out );
vari,hr,hc: longint;upstep: CRBMUpstepCell;split: MLFunctions.SplitterCell;
beginhr:=vr-kr+1; hc:=vc-kc+1;new(vSplit {dataSize=vr*vc,numOut=2});new(upstep {vr=vr,vc=vc,kr=kr,kc=kc,k=k,c=c});pvIN >> vSplit.dataIN;
vSplit.dataOUT[0] >> pvSideOUT;
vSplit.dataOUT[1] >> UpstepCell.vIN;
kerIN >> UpstepCell.kerIN;bIN >> UpstepCell.bIN;for i:=0 to k-1 do
upstep.phOUT[i] >> phOUT[i];end;
end CRBMUpNet;
335
UpNet
img in
split
…
UpStep
pvOut
phOut
kernIn
biasIn
port delegationport delegation
ports and properties
Software Hardware Map
336
ARM TRM
ARM ENGINE
ENGINE
ZYNQ PL
PS
cell
engine
port
thread
softcore
hw engine
AX
I4
FIFO
AXI4 Stream Interconnect
Hybrid Compilation
Code body Role Compilation method
Cell (Softcore) Program logic Software Compilation
Cell (Engine) Computation unit Hardware Generation
Cell Net Architecture Hardware Compilation
337
cellnet N;
type A=cell(pi: port in; po: port out);var x: integer;begin… pi ? x; … po ! x; …end A;
var a,b: A;begin… connect(a.po, b.pi)end N.
Implementation Alternatives (1)
338
compilerfrontend
Zybo Backend
Spartan3 Backend
Spartan3e Backend
…
+ simple- redundant
- not flexible- hard to extend
sourcebitstream
Implementation Alternatives (2)
339
compilerfrontend
+ extensible+ not redundant- not simple- static configuration limits flexibility
Hardware Library Component Library
cellnet interpreterbackend
source bitstream
Implementation Alternatives (3)
340
compilerfrontend
+ extensible+ not redundant+ simpler+ simulation becomes execution
Hardware LibraryExecutables
Component LibraryExecutables
hdl runtime source bitstreamexecutable
Frontend
341
Source Code(Module)
Intermediate Code
(Module)Executable
CompilerFrontend
CompilerBackend
SimulationRuntime
VisualizationRuntime
Software Libraries
Software Libraries
Software Libraries
HDL BackendRuntime
Backend
342
HDL Backend Runtime
DependencyAnalysis
HDL Code Generator
Instruction Code Generator
HW Library
Dependency Graph
HDL Code
TRM Kernel
TRM CodeExecutable
ARM Kernel
ARM CodeExecutable
TRM Kernel
TRM CodeExecutable
ARM Kernel
ARM CodeExecutable
TRM Kernel
TRM CodeExecutable
RuntimeLibraries
CompilerBackend
Executable
FPGA VendorImplementation
Toolsbitstream
Deployment
Intermediate Code
(Module)
RuntimeLibrariesRuntimeLibrariesRuntimeLibraries
Extensibility: Defining components and platforms
Software HLL Codetype Gpo = cell{Engine,DataWidth=32,InitState="0H"} (input: port in);
343
Hardware HDL Codemodule Gpo
#( parameter integer DW = 8 … )
(
input aclk, input aresetn , input [DW−1:0] …
);
Component Specification
Build CommandAcHdlBackend.Build
--target="ZyboBoard" -p="Vivado" CRBM.TestCellnet ~
Platform Specification
Hardware Platform and ToolsHardware Types
Platform Instances
Vendor specific tools
Component Specification
344
module Gpo;import Hdl := AcHdlBackend;var c: Hdl.Engine;begin
new(c,"Gpo","Gpo");c.SetDescription("General Purpose Output … ");
c.SupportedOn("∗"); (∗ portable ∗)
c.NewProperty("DataWidth","DW",Hdl.NewInteger(32),Hdl.IntegerPropertyRangeCheck(1,Hdl.MaxInteger));c.NewProperty("InitState","InitState",Hdl.NewBinaryValue("0H"),nil);
c.SetMainClockInput("aclk"); (* main component's clock *)c.SetMainResetInput("aresetn",Hdl.ActiveLow); (* active-low reset *)c.NewAxisPort("input","inp",Hdl.In,8);c.NewExternalHdlPort("gpo","gpo",Hdl.Out,8);
c.NewDependency("Gpo.v",true,false);
c.AddPostParamSetter(Hdl.SetPortWidthFromProperty("inp","DW"));c.AddPostParamSetter(Hdl.SetPortWidthFromProperty("gpo","DW"));
Hdl.hwLibrary.AddComponent(c);end Gpo.
Identification and description
Supported devices
Parameters
Ports
Dependencies
Configuration Actions
c.NewExternalHdlPort("gpo","gpo",Hdl.Out,8);
Generic Peer-to-Peer Communication Interface
Use of AXI4 Stream interconnect standard from ARM
Generic, flexible
Non-redundant
345
TVALID: Data valid
Senderport
Receiverport
TREADY:Ready to process
Mu
ltip
lexi
ng
net
wo
rkData out
Select Select
Data in
clk clk
master must assert tvalid and keep asserted
slave can wait for master's tvalid and then assert tready
Target Platform Specificationmodule Basys2Board;
import Hdl := AcHdlBackend, AcXilinx;
var t: Hdl.TargetDevice;
pldPart: AcXilinx.PldPart;
ioSetup: Hdl.IoSetup;
pin: Hdl.IoPin;
begin
new(pldPart,"XC3S100E-4CP132");
pldPart.SetJtagChainIndex(0);
new(t,"Basys2Board",pldPart);
new(pin,Hdl.In,"B8","LVCMOS33");
t.NewExternalClock(pin,50000000,50,0); (* ExternalClock0 *)
t.SetSystemClock(t.clocks.GetClockByName("ExternalClock0"),1,1);
new(pin,Hdl.In,"G12","LVCMOS33");
t.SetSystemReset(pin,true);
new(ioSetup,"Gpo_0");
ioSetup.NewIoPort("gpo",Hdl.Out,"U16,E19,U19,V19","LVCMOS33");
t.AddIoSetup(ioSetup);
Hdl.hwLibrary.AddTarget(t);
end Basys2Board.
346
System Signals
FPGA Part
Mapping of external Ports
ioSetup.NewIoPort("gpo",Hdl.Out,"U16,E19,U19,V19","LVCMOS33");
TRM1
TRM2
TRM9
TRM10 TRM11 TRM12
FIFO1
FIFO8
FIFO9
FIFO16
FIFO17 FIFO18
FIFO19
FIFO20
FIFO33
FIFO34
UART
controller CF
controller
LCD
controller
Virtex-5LX50T FPGA
Xilinx ML505 board
RS232
CF
LCD
ECG
Sensor
·
·
·
·
·
·
Case Study 1: ECGFocus: Resources and Power
Real-time
ECG MonitorSignalinput
Wave proc_1
QRSdetect
HRV analysis
Disease classifier
Wave proc_2
Wave proc_8
ECG
bitstream
out
stream
347
Resources
ECG Monitor*
Maximum number of TRMs in communication chain
348
#TRMs #LUTs #BRAMs #DSPs TRM load
12 13859(48%)
52(86%)
12(25%)
<5%@116 MHz
FPGA #TRMs #LUTs #BRAMs #DSPs
Virtex-5 30 27692 (96%) 60 (100%)
30 (62%)
Virtex 6 500
*8 physical channels @ 500 Hz sampling frequency
implemented on Virtex 5 348
Comparative Power Usage
Preconfigured FPGA (#TRMs, IM/DM, I/O, Interconnect fixed)versus fully configurable FPGA (Active Cells)
System StaticPower (W)
Dynamic Power (W)
Preconfigured("TRM12")
3.44 0.59
Dynamicallyconfigured
0.5 0.58
86% saving!
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Core 8
Core 9
Core 10
Core 11
Signal
inpu
t
Wave
proc_
1
QRS
det
ect
HRV analysis
Disease classifier
Wave
proc_2
Wave
proc_8
349
Case Study: Non-Invasive Continuous Blood Pressure MonitorFocus: Development Cycle Time
A2 Host OS with GUI on ARMSensor control and medical algorithms on Zynq PL
Sensors and Motorson Bracelet
350
Medical Monitor Network On Chip
351351
Dominated by TRM processors. Feedback driven. Not performance critical.
Development Cycle Times
Step Medical Monitor
OCT (full)
Software Compilation 4s 2s
Hardware Implementation
20 min 20 min
Stream Patching (all) 2 min -
Stream Patching (typical) 12 s -
Deployment 11s 16s
sporadic often352
352
Case Study 3: Optical Coherence TomographyFocus: Performance
A(λi) = A(1/fi)
f(z)
z-Axis Processing
1. Non uniform sampling
A(λi) A(f i)
2. Dispersion compensation
3. (Inverse) FFT
… for many lines x in a row (2d)
… and many rows y in a column (3d)
~
zx
y 353 353
A component of OCT image processingDispersion Compensation
Dominated by Engines. Dataflow driven.
354
Case Study: ANN
355
Sigmoid
InnerProdFlt
Í Í Í
+
+
x0x0 x1x1 x2x2y0y0 y1y1 y2y2
- Í
+
Frac
LUT
yy
Fix
zz
input[0]input[0]
input[1]input[1]
outputoutput
xx
y
bx0 x1 x2w0 w1 w2
ARM SoC
x
w
P0
R0
P2
R2P1
R1
z
ANN Layer 1(MoreLayers)
b
Programmable Logic
Perceptron
Performance and Resource Usage
Medical Monitor Simple OCT OCT Perceptron
Architecture Spartan 6XC6SLX75
Zynq 7000XC7Z020
Zynq 7000XC7Z020
Zynq 7000XC7Z010
Resources 28% Slice LUTs, 4% Slice Registers80% BRAMs24% DSPs
11% Slice LUTs, 6% Slice Registers7% BRAMs15% DSPs1 ARM Cortex A9
17% Slice LUTS8% Slice Registers22% BRAMs31% DSPs1 ARM Cortex A9
83% Slice LUTS, 67% Slice Registers13.33% BRAMs, 21% DSP1 ARM Cortex A9
Clock Rate 58 MHz 118 MHz 50 MHz 147 MHz
Data Bandwidth 1.25 Mbit /s (in)23 kB/s (out)
236 MWords/s (in)118 MWords/s (out)
50 MWords/s (in)50 MWords/s (out)
9.6 GBits/s in9.6 GBits/s out
Performance -- 8.3 GFPOps*up to 32 GFPops**
4.3 GFPOps* 4.9 GFlops
Power ~2W ~5W ~5W 2.1 W
** Fixed point operations, 32bit
* when instantiated 4 times356
Conclusion
ActiveCells: Computing model and tool-chain for configurable computing
Configurable interconnect Simple Computing, Power Saving
Embedding of task engines High Performance
Hybrid compilation Quick Development
Backend execution Eased Flexibility and Extensibility
357
Mapping to Hardware – Simple ExampleIO*=CELL (input: PORT IN; output: PORT OUT;
buttons: PORT IN; leds: PORT OUT);BEGIN ...END IO;
Controller*=CELL (input: PORT IN; output: PORT OUT)BEGIN ...END Controller;
VAR controller: Controller; io: IO; gpo{DataWidth=8}: Engines.Gpo;gpi{DataWidth=11}: Engines.Gpi;
BEGINNEW(controller); NEW(io);NEW(gpi); NEW(gpo);
CONNECT (controller.output, io.input, 32);CONNECT (io.output, controller.input, 32);CONNECT (gpi.output, io.buttons);CONNECT (io.leds, gpo.input);
END Simple.
358
IOTRM
ControllerTRM
fifo fifo
gpi gpo
Blackboard / Code inspection
Spartan3 Board
Resources
4-input LUTS 3840
Flip Flops 3840
BRAMs 12
DSPs -
359
Resource Usage Scenarios
360
CELLNET Game;
IMPORT LED;
TYPE
IO*=CELL {DataMemorySize(512), CodeMemorySize(1024), LED} (in: PORT IN);BEGINEND IO;
VAR io: IO; BEGIN
NEW(io); END Game.
1 TRM
Resources
4-input LUTS 27%
Flip Flops 5%
BRAMs 16%
Resources
4-input LUTS 58%
Flip Flops 9%
BRAMs 33%
TRM ≈ 21-27% (LUTs), 16% (BRAMs); Fifo ≈6% (LUTs)
VAR io: IO; trm: TRM;BEGINNEW(io); NEW(trm);
CONNECT(trm.out, io.in);END Game.
TRM TRM
(cf. next page)
TRM TRM
-> TRM
Resources
4-input LUTS 86%
Flip Flops 18%
BRAMs 75%
Resource Usage (Lab)controller: Controller;ioTRM: IO;digitsDriver: DigitsDriver;digits: Engines.LEDDigits;gpo{DataWidth=8}: Engines.Gpo;gpi{DataWidth=11}: Engines.Gpi;
BEGINNEW(controller); NEW(ioTRM);NEW(digitsDriver);NEW(digits);NEW(gpi);NEW(gpo);CONNECT( controller.cmdOUT,ioTRM.cmdIN,10);CONNECT( ioTRM.cmdOUT,controller.cmdIN,10);CONNECT( gpi.output, ioTRM.ButtonsIN);CONNECT( ioTRM.LEDOUT, gpo.input);CONNECT(ioTRM.digitsOUT, digitsDriver.cmdIN,10);CONNECT(digitsDriver.ledOUT, digits.input);
3 TRMs
3 FIFOs
361
Resources
4-input LUTS 86%
Flip Flops 18%
BRAMs 75%