reconfigurable architectures
DESCRIPTION
Reconfigurable Architectures. AMANO, Hideharu hunga @ am . ics . keio . ac . jp. Reconfigurable System ( Custom Computing Machine ). A target algorithm is executed directly with a hardware on SRAM-style FPGA/PLDs. High performance of special purpose machines. - PowerPoint PPT PresentationTRANSCRIPT
Reconfigurable Architectures
AMANO, Hideharu
hunga @ am . ics . keio . ac . jp
Reconfigurable System( Custom Computing Machine ) A target algorithm is executed directly with
a hardware on SRAM-style FPGA/PLDs. High performance of special purpose machine
s. High degree of flexibility of general purpose ma
chines. A completely different execution mechanis
m from a stored program computers.
Recent FPGA/PLDs
More than 1000K Gates (It is difficult to use efficiently.)
The operational frequency is 30MHz – 60MHz.
A large internal data RAMs.
Simple Gate Arrays are replaced with FPGA/PLDs.
Switch
Logic Block
I/O
SRAM FPGA (Xilinx’s)(Field Programmable Gate Array)
5-inputs
2 F . F .
Look Up Table
Switch
Configuration Memory
SRAM(Configuration Memory )
Switch
Logic BlockI/O
SRAM CPLD(Complex Programmable Logic Device)
Reconfigurable Systems
Stand alone type Implemented on boards or cabinet. Splash 1・2, RM-I,II,III,IV , RASH ( Mit
subishi ) , ATTRACTOR ( NTT ) , FLEMING
Co-processor type Improve performance of general purpose process
ors. PRISM I,II, DISC II , Garp, CHEMAE
RA, Chameleon, PipeRench
Reconfigurable Systems
1990 The 1st FPLThe 1st Japanese FPGA/PLD Conf.
1992
1993 The 1st FCCM
2000
1995
SPLASH
SPLASH-2RM-IRM-II
RM-IVRM-III
RM-V
WASMII
PRISM-IPRISM-II
DISCDISC-II
DRL
FIPSOCHOSMII
Cont . Switch .FPGA PCA
RASH
YARDS
ATTRACTOR
PipeRench
ChameleonCHIMERA
Stand Alone
Co-processor
New Device
MPLD
Cache LogicMult . Context
FPGA
Splash-2 (Arnold et.al 92)
米国計算機科学センター
String matching, Image processing, DNA matching, 330 times faster than the supercomputer Cray-II.
Systolic algorithm VHDL, Parallel C Annapolis Micro
Systems ( WILDFIRE)
RM-IV (Kobe Univ.)
FPGAmem .
FPGAmem . FPGAmem .
FPGAmem .
FPGAmem .
FPGAmem . FPGAmem .
FPGAmem .
FPGA mem .
FPGA mem .FPGA mem .
FPGA mem .
FPGA mem .
FPGA mem .FPGA mem .
FPGA mem .
FPIC
Interface
&p 10
RASH(Mitsubishi)
Display
RASH unit
CompactPCI bus
diskCD
EXE-ボード
CPUボード
EthernetLAN
This slide is supported by Dr.Nakajima of Mitsubishi.
1Unit: 6 EXE boards CPU boards (Pentium)
Multiple Units can be connected
&p 11
EXE boards of RASH
FPGA Altera FLEX10K100A (62K-158KGate)
EXE-board controller
FPGA FPGA FPGA FPGA
FPGA FPGA FPGA FPGA
PCI Local-bus
PCI-bus
PCI-bus I/F
Clocks / Cont. signals
SRAM( 2MB )
Local-bus
Mesh links and buses
2 clock lines
PCI bus I/F
A large SRAM
DRAM daughter board
ATTRACTOR ( NTT )
RISC
FPGA
RISC
FPGA
RISC
ATMI/O
RISC
ATMSW
RISC
Buffer
RISC
RAM( LUT)
MPU
Mem.
Compact PCI
Ethernet
High speed serial link ( 1Gbps )
Board level reconfiguration
Using various boardsSpecialized for ATM communication
Co-processor type
Tightly coupled with core CPU A part of program is selected and executed. Recently, on-chip implementation with the cor
e CPU and reconfigurable part is possible. Tightly coupling co-processor
NAPA, Garp, Chameleon, CHIMAERA, PipeRench
PRISM II ( Brown Univ. )
Am2955CPU
Boot ROM
Burst ModeMemory Controller
FPGA Module
DRAM
DRAM
S witch
FPGA ModuleFPGA Module
AddressControl
Data
A program core is executed.
Frontier of co-processors.
Garp (Hauser97)
Proposed in UCB MIPS Core and Reconfigura
ble Array share a cache system.
Loop is extracted with a compiler, and converted to hardware.
Image processing, 43 times faster than Ultrasparc
MIPS
Crossbar
Q Q QCache
Memory queue
32bit buses x 5
Reconfigurable Array
DISC (Wirthlin et al. 95)
Brigham Young Univ. A general purpose processor
using partial reconfigurable chip.
Custom instructions can be attached. Each module can be designed
by the user. Function called by C-language.
FPGA 3
ProcessorCore
FPGA 1
Bus I/FConfiguration
Controller
FPGA 2
CustomInstruction
Space
SystemMemory
Host P/C
CHIMAERA (Ye et.al. 2000)
Northwestern Univ. A reconfigurable array is i
nserted in the datapath of a super-scalar machine.
9 registers can be read in parallel from the shadow-register.
Out of Order control 10 ~ 20% performance i
mprovement
ReconfigurableArrayuP Core
Registerfile
Shadow registers
Controller
Chameleon ( Chameleon Co. ) Field Programmable System Level
Integrated Circuits (FPSLICs) Coarse grain Reconfigurable Processing Fa
bric 、 RISC Core 、 PCI Controller 、 Memory Controller 、 DMA Controller and SRAM are implemented on a single chip.
In Signal processing, Communication protocol processing, It is 5-10 times faster than high speed DSPs.
Chameleon CS2112
ReconfigurableProcessing
Fabric
128-bit RoadRunner Bus
PCI Cont. RISC CoreMemory
Controller
DMASubsystem
ConfigurationSubsystem
160-pin Programmable I/O
32-bit PCI Bus 64-bit Memory Bus
Reconfigurable Processing Fabric in Chameleon
DPU
CTLLM
Tile 0
Slice 0
DPU
CTLLM
Tile 0
Slice 3
108 DPU(Data Path Unit)s consists 4 Slices ( 3Tiles each )
1Tile: 9DPU = 32bit ALU X 7 16bit + 16bit multiplier X 2
8 instructions stored in the CTL are executed in the DPU.
The CTL can select the next instruction in the same cycle.
Configuration can be changed by loading a bit stream.
DPU
Instruction
Rou
ting
MU
XR
ou
ting
MU
X
Register&
Mask
Register&
Mask
OP Register
RegisterBarrelShifter
OP : Operations in C or Verilog
SIMD arrays and pipelines are formed with multiple DPUs.
Problems on Reconfigurable Systems Calculators with SRAM type FPGAs are 10 ti
mes slower than ASIC calculators and requires 10 times wide area.
Weak connection between memory modules. No standard method for generating a efficient
hardware. The size limitation problem.
Toward solving problems (1)
Speed and area problem compared with dedicated calculators. The disadvantage is reduced using a novel
process. Coarse grain FPGA Implementation with the CPU
Weak connection between memory modules Connection with a large scale integrated SRAM DRAM integration
DRAM integrated FPGA ( NEC)
256 X 256DRAMModule
Word Driver(128)
Sen
se
Am
p.
(128
)
Sen
se
Am
p.
(128
)
LogicElement
LogicElement
LogicElement
LogicElement
Word Driver(128)
FPAccA (Hiroshima City Univ.)
Routing Matrix
ALU
Array of floating
ALU(Add/Mult )
model2(0. 35um)
12x 25MFLOPS
Toward solving problems (2)
Algorithm conversion problem Co-processing between integrated CPU High-level synthesis techniques Data-driven execution Systolic algorithm
Size limitation problem Partial Reconfiguration Multi-context FPGA Virtual hardware
Systolic algorithm
Data x
Data y
Computational array
A data stream x, y are inserted with a specific interval intoa special computational array.Suitable for reconfigurable computing.
Band matrix multiply y=Ax
a
x
a11 a12 0 0
a21 a22 a23 0
0 a32 a33 a34
0 0 a43 a44
X+
yiyo
yo= a x + y i
y0
y1
y2
y3
x0
x1
x2
x3
=
Band matrix multiply y=Ax
X +
a11
x1
a12 a21
a22
a23 a32
a11 a12 0 0
a21 a22 a23 0
0 a32 a33 a34
0 0 a43 a44
Band matrix multiply y=Ax
X +X +x1
a12 a21
a22
a23 a32
a11 a12 0 0
a21 a22 a23 0
0 a32 a33 a34
0 0 a43 a44
a33
y1=a11x1
x2
Band matrix multiply y=Ax
X +x1
a34 a43
a22
a23 a32
a11 a12 0 0
a21 a22 a23 0
0 a32 a33 a34
0 0 a43 a44a33
y2=a21 x1
x2x3
y1=a11 x1+ a12 x2
Band matrix multiply y=Ax
X +X +
a34 a43
a44
a23 a32
a11 a12 0 0
a21 a22 a23 0
0 a32 a33 a34
0 0 a43 a44
a33
y2=a21 x1+ a22 x2
x2x3
Band matrix multiply y=Ax
X +
a34 a43
a44
a11 a12 0 0
a21 a22 a23 0
0 a32 a33 a34
0 0 a43 a44
a33y2=a21 x1+ a22 x2+ a23 x3
x2x3
y3= a32 x2
Data flow algorithm
x
+
x
+
a bc
d e
(a+b)x(c+(dxe))
PlasticPart
7
0 7
Built-inPart
16bits
16word x 1bitmemory ( LUT)
output: 1bit
input: 1bit
control: 6bitsdata: 8bits
16bits
14bits
14bits
Hardware controller which executes12 instructions
Logic/Memory
PCA( Plastic Cell Architecture )
Unit cells are connected in a mesh structure
Variable
PCA(Plastic Cell Architecture)
PlasticPart
Built-in Part
ConfigurationPath
Communication Path
.….
Self reconfigurationAsynchronous communication
Multi-context FPGA
Mul
tipl
exer
SRAM slots
n
Logic cells
1
2
Input data
Output data
Logic cellsLogic cellsContext
Configuration RAM can be changeable.Fujitsu’s MPLD(1990)Fujitsu’s MPLD(1990) 、、 WASMII(1992)WASMII(1992) 、、 Xilinx(1997)Xilinx(1997)
NEC’s DRL(1999)NEC’s DRL(1999)
Dynamic Reconfigurable Logic Multi-context(8 context) and partial reconfiguration.
4×12 (Logic Block, LB) Interface logics
Logic Block 4×4 (Unified Cell, UC) Reconfiguration Controller (RC) Bus Connector (BC)
Conf
igur
atio
n/Da
ta In
put(7
9b)
Input Select
External Config. Control(4b)
Address Decoder
Config. Store Control(2b)
Reset CLK Output Select
Conf
igur
atio
n/Da
ta O
utpu
t(79b
)
Configuration Store
Address(10b)LB: 論理ブロック
DRLIn
put S
elec
tor
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
LB
Out
put S
elec
tor
LB
Data Config Vertical Local Bus(4b×2) (3b×2) (4b×2)
Horizontal Local Bus
(4b×2)
Memory
Data
Config(3b×2)
(4b×2)
(4b×2)
UC : Unified Cell
RC : Reconfiguable Circuit
BC:Bus Connecter
RCUC
UC
UC
UC
UC
UC
UC
UC
UC UC
UC UC
UC UC
UC UC
Global Bus
Switch
BC
BC
BC
BC
BC BCBC BC
RC
BC
BC
BC
BC
BC BCBC BC
UC
UC
UC
UC
UC
UC
UC
UC
UC UC
UC UC
UC UC
UC UC
WASMII chip
Execution block
Virtual Hardware WASMII on the DRL( Keio U. +NEC)
ConfigurationData line
Input TokenRegisters
Controller
Token router
Input TokenRegisters
Active Page
Page 1
Page 2Page 3
Page 4
WASMII operation I
ConfigurationData line
Input TokenRegisters
WASMII operation II(Outside chip extension)
ConfigurationData line
Input TokenRegisters
Backup RAM
External InputToken Registers
WASMII Chip
LBLayout of WASMII on DRL
Execution block
32 LBs
Control block
16 LBs
Dynamicallyreconfigured
Staticallyconfigured
WASMII on the DRL
Small applications have been implemented. Continuous System Simulation Neural Network Emulation
Almost the same speed as recent PCs Conservative implementation because of the first
prototype. → Drastically improved in the next version The limitation of the context is an essential problem.
PipeRench Architecture ( CMU )
Interconnection
Interconnection
・ ・ ・
・ ・ ・
stripe
PE
Pass registers
Global buses
PE PE PE
PEPEPEPE
Pipelined Reconfiguration
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Cycle:
Virtual pipeline
Stage 1
Stage 2
Stage 3
Cycle:
Physical pipeline
1
1
1
2
2
1
2
3
3
1
2
3
4
4
4
2
3
5
5
4
5
3
6
6
4
5
6
Applications No flexible program change No IEEE standard floating point Not memory bounded
Image processing, analysis, pattern matching, Logic simulation, Fault simulation. Neural network simulation. Encryption /Decryption Queuing Model 、 Markov Analysis Electric Power Flow Censer processing
Efficient use of on the fly processing. Communication control 、 Protocol control Software radio
Summary
Another computing system than stored program computers.
Not a perfect replace of stored program type computers.
Advance of the semiconductor techniques directly enhance the performance.
A lot of problems and subjects to research.
Historical flow of computer systems
ENIAC
EDVAC 、 EDSAC
IBM machines
RISC, Intel’s microprocessorsReconfigurableMachine