reconfigurable architectures

Reconfigurable 　Architectures

AMANO, Hideharu

hunga ＠ am ． ics ． keio ． ac ． jp

Reconfigurable 　 System（ Custom 　 Computing 　Machine ） A target algorithm is executed directly with

a hardware on SRAM-style FPGA/PLDs. High performance of special purpose machine

s. High degree of flexibility of general purpose ma

chines. A completely different execution mechanis

m from a stored program computers.

Recent FPGA/PLDs

More than 1000K Gates (It is difficult to use efficiently.)

The operational frequency is 30MHz – 60MHz.

A large internal data RAMs.

Simple Gate Arrays are replaced with FPGA/PLDs.

Switch

Logic 　 Block

I/O

SRAM FPGA (Xilinx’s)(Field 　 Programmable 　 Gate 　 Array)

5-inputs

2 　 F ． F ．

Look 　 Up 　 Table

Switch

Configuration 　 Memory

SRAM(Configuration 　 Memory ）

Switch

Logic 　 BlockI/O

SRAM CPLD(Complex 　 Programmable 　 Logic 　 Device)

Reconfigurable 　 Systems

Stand alone type Implemented on boards or cabinet. Splash 　１・２，　 RM-I,II,III,IV ， RASH （ Mit

subishi ） , ATTRACTOR （ NTT ） , 　ＦＬＥＭＩＮＧ

Co-processor type Improve performance of general purpose process

ors. PRISM 　 I,II, 　ＤＩＳＣ　ＩＩ , Garp, CHEMAE

RA, Chameleon, PipeRench

Reconfigurable 　 Systems

1990 The 1st FPLThe 1st Japanese 　FPGA/PLD 　Conf.

1992

1993 The 1st FCCM

2000

1995

SPLASH

SPLASH-2RM-IRM-II

RM-IVRM-III

RM-V

WASMII

PRISM-IPRISM-II

DISCDISC-II

DRL

FIPSOCHOSMII

Cont ． Switch ．FPGA PCA

RASH

YARDS

ATTRACTOR

PipeRench

ChameleonCHIMERA

Stand 　Alone

Co-processor

New 　Device

MPLD

Cache 　LogicMult ． Context

　 FPGA

Splash-2 　 (Arnold et.al 92)

米国計算機科学センター

String matching, Image processing, DNA matching, 330 times faster than the supercomputer Cray-II.

Systolic algorithm VHDL, 　 Parallel C Annapolis 　 Micro 　

Systems （ WILDFIRE)

RM-IV (Kobe Univ.)

ＦＰＧＡmem ．

ＦＰＧＡmem ．ＦＰＧＡmem ．

ＦＰＧＡmem ．

ＦＰＧＡmem ．

ＦＰＧＡmem ．ＦＰＧＡmem ．

ＦＰＧＡmem ．

ＦＰＧＡ mem ．

ＦＰＧＡ mem ．ＦＰＧＡ mem ．



ＦＰＧＡ mem ．ＦＰＧＡ mem ．


FPIC

Ｉｎｔｅｒｆａｃｅ

&p 10

RASH(Mitsubishi)

Display

RASH unit

CompactPCI bus

diskCD

EXE-ボード

CPUボード

EthernetLAN

This slide is supported by Dr.Nakajima of Mitsubishi.

1Unit: 6 EXE boards CPU boards (Pentium)

Multiple Units can be connected

&p 11

EXE boards of RASH

FPGA 　 Altera 　 FLEX10K100A 　 (62K-158KGate)

EXE-board controller

FPGA FPGA FPGA FPGA

FPGA FPGA FPGA FPGA

PCI Local-bus

PCI-bus

PCI-bus I/F

Clocks ／ Cont. signals

SRAM（ 2MB ）

Local-bus

Mesh links and buses

2 clock lines

PCI bus I/F

A large SRAM

DRAM daughter board

ATTRACTOR （ NTT ）

RISC

FPGA

RISC

FPGA

RISC

ATMI/O

RISC

ATMSW

RISC

Buffer

RISC

RAM（ LUT)

MPU

Mem.

Compact 　PCI

Ethernet

High speed serial link （ 1Gbps ）

Board level reconfiguration

Using various boardsSpecialized for ATM communication

Co-processor type

Tightly coupled with core CPU A part of program is selected and executed. Recently, on-chip implementation with the cor

e CPU and reconfigurable part is possible. Tightly coupling co-processor

NAPA, 　 Garp, 　 Chameleon, 　 CHIMAERA, 　 PipeRench

PRISM 　 II （ Brown Univ. ）

Am2955CPU

Boot 　 ROM

Burst 　 ModeMemory 　 Controller

ＦＰＧＡ　Ｍｏｄｕｌｅ

DRAM

DRAM

S ｗｉｔｃｈ

ＦＰＧＡ　ＭｏｄｕｌｅＦＰＧＡ　Ｍｏｄｕｌｅ

ＡｄｄｒｅｓｓＣｏｎｔｒｏｌ

Ｄａｔａ

A program core is executed.

Frontier of co-processors.

Garp (Hauser97)

Proposed in UCB MIPS Core and Reconfigura

ble Array share a cache system.

Loop is extracted with a compiler, and converted to hardware.

Image processing, 43 times faster than Ultrasparc

MIPS

Crossbar

Q Q QCache

Memory queue

32bit buses x 5

Reconfigurable Array

DISC (Wirthlin et al. 95)

Brigham Young Univ. A general purpose processor

using partial reconfigurable chip.

Custom instructions can be attached. Each module can be designed

by the user. Function called by C-language.

FPGA 3

ProcessorCore

FPGA 1

Bus I/FConfiguration

Controller

FPGA 2

CustomInstruction

Space

SystemMemory

Host P/C

CHIMAERA (Ye et.al. 2000)

Northwestern Univ. A reconfigurable array is i

nserted in the datapath of a super-scalar machine.

9 registers can be read in parallel from the shadow-register.

Out of Order control 10 ～ 20% performance i

mprovement

ReconfigurableArrayuP Core

Registerfile

Shadow registers

Controller

Chameleon （ Chameleon Co. ）　 Field 　 Programmable 　 System 　 Level 　

Integrated 　 Circuits 　 (FPSLICs) Coarse grain Reconfigurable 　 Processing 　 Fa

bric 、 RISC 　 Core 、 PCI 　 Controller 、 Memory 　 Controller 、 DMA 　 Controller and SRAM are implemented on a single chip.

In Signal processing, Communication protocol processing, It is 5-10 times faster than high speed DSPs.

Chameleon CS2112

ReconfigurableProcessing

Fabric

128-bit RoadRunner Bus

PCI Cont. RISC CoreMemory

Controller

DMASubsystem

ConfigurationSubsystem

160-pin Programmable I/O

32-bit PCI Bus 64-bit Memory Bus

Reconfigurable Processing Fabric in Chameleon

DPU

CTLLM

Tile 　0

Slice 　0

DPU

CTLLM

Tile 　0

Slice 　3

108 DPU(Data 　 Path 　 Unit)s consists 4 Slices （ 3Tiles each ）

1Tile: 　 9DPU ＝ 32bit ALU X 7 16bit + 16bit multiplier 　 X 　２

8 instructions stored in the CTL are executed in the DPU.

The CTL can select the next instruction in the same cycle.

Configuration can be changed by loading a bit 　stream.

DPU

Instruction

Rou

ting

MU

XR

ou

ting

MU

X

Register＆

Mask

Register＆

Mask

OP Register

RegisterBarrelShifter

OP ： Operations in C or Verilog

SIMD arrays and pipelines are formed with multiple DPUs.

Problems on Reconfigurable Systems Calculators with SRAM type FPGAs are 10 ti

mes slower than ASIC calculators and requires 10 times wide area.

Weak connection between memory modules. No standard method for generating a efficient

hardware. The size limitation problem.

Toward solving problems (1)

Speed and area problem compared with dedicated calculators. The disadvantage is reduced using a novel

process. Coarse grain FPGA Implementation with the CPU

Weak connection between memory modules Connection with a large scale integrated SRAM DRAM integration

DRAM integrated FPGA （ NEC)

256 　 X 　 256DRAMModule

Word 　 Driver(128)

Sen

se　

Am

p．

(128

)

Sen

se　

Am

p．

(128

)

LogicElement

LogicElement

LogicElement

LogicElement

Word 　 Driver(128)

FPAccA (Hiroshima City Univ.)

Routing　Matrix

ＡＬＵ

Array 　 of 　 floating

ALU(Add/Mult ）

model2(0． 35um)

12ｘ　25MFLOPS　

Toward solving problems (2)

Algorithm conversion problem Co-processing between integrated CPU High-level synthesis techniques Data-driven execution Systolic algorithm

Size limitation problem Partial 　 Reconfiguration Multi-context 　 FPGA Virtual hardware

Systolic algorithm

Data x

Data y

Computational array

A data stream x, y are inserted with a specific interval intoa special computational array.Suitable for reconfigurable computing.

Band matrix multiply 　 y=Ax

a

x

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

Ｘ＋

ｙｉｙｏ

ｙｏ＝ａｘ＋ｙｉ

y0

y1

y2

y3

x0

x1

x2

x3

=


Ｘ＋

a11

x1

a12 a21

a22

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44


Ｘ＋Ｘ＋x1

a12 a21

a22

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

a33

y1=a11x1

x2


X ＋x1

a34 a43

a22

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44a33

y2=a21 x1

x2x3

y1=a11 x1+ a12 x2


Ｘ＋Ｘ＋

a34 a43

a44

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

a33

y2=a21 x1+ a22 x2

x2x3


Ｘ＋

a34 a43

a44

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

a33y2=a21 x1+ a22 x2+ a23 x3

x2x3

y3= a32 x2

Data flow algorithm

ｘ

＋

ｘ

＋

ａｂｃ

ｄｅ

（ａ＋ｂ）ｘ（ｃ＋（ｄｘｅ））

PlasticPart

7

0 7

Built-inPart

16bits

16word x 1bitmemory （ LUT)

output: 1bit

input: 1bit

control: 6bitsdata: 8bits

16bits

14bits

14bits

Hardware controller which executes12 instructions

Logic/Memory

ＰＣＡ（ Plastic 　 Cell 　Architecture ）

Unit cells are connected in a mesh structure

Variable

PCA(Plastic Cell Architecture)

PlasticPart

Built-in Part

ConfigurationPath

Communication Path

.….

Self reconfigurationAsynchronous communication

Multi-context FPGA

Mul

tipl

exer

SRAM slots

n

Logic cells

1

2

Input data

Output data

Logic cellsLogic cellsContext

Configuration 　 RAM can be changeable.Fujitsu’s MPLD(1990)Fujitsu’s MPLD(1990) 、、 WASMII(1992)WASMII(1992) 、、 Xilinx(1997)Xilinx(1997)

NEC’s DRL(1999)NEC’s DRL(1999)

Dynamic Reconfigurable Logic Multi-context(8 context) and partial reconfiguration.

4×12 (Logic Block, LB) Interface logics

Logic Block 4×4 (Unified Cell, UC) Reconfiguration Controller (RC) Bus Connector (BC)

Conf

igur

atio

n/Da

ta In

put(7

9b)

Input Select

External Config. Control(4b)

Address Decoder

Config. Store Control(2b)

Reset CLK Output Select

Conf

igur

atio

n/Da

ta O

utpu

t(79b

)

Configuration 　Store 　

Address(10b)LB: 論理ブロック

DRLIn

put S

elec

tor

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

Out

put S

elec

tor

LB

Data Config Vertical Local Bus(4b×2) (3b×2) (4b×2)

Horizontal Local Bus

(4b×2)

Memory

Data

Config(3b×2)

(4b×2)

(4b×2)

UC ： Unified Cell

RC ： Reconfiguable Circuit

BC:Bus Connecter

RCUC

UC

UC

UC

UC

UC

UC

UC

UC UC

UC UC

UC UC

UC UC

Global Bus

Switch

BC

BC

BC

BC

BC BCBC BC

RC

BC

BC

BC

BC

BC BCBC BC

UC

UC

UC

UC

UC

UC

UC

UC

UC UC

UC UC

UC UC

UC UC

WASMII chip

Execution block

Virtual Hardware WASMII on the DRL（ Keio U. +NEC)

ConfigurationData line

Input TokenRegisters

Controller

Token router


Active Page

Page 1

Page 2Page 3

Page 4

WASMII operation I



WASMII operation II(Outside chip extension)



Backup RAM

External InputToken Registers

WASMII Chip

LBLayout of WASMII on DRL

Execution block

32 LBs

Control block

16 LBs

Dynamicallyreconfigured

Staticallyconfigured

WASMII on the DRL

Small applications have been implemented. Continuous System Simulation Neural Network Emulation

Almost the same speed as recent PCs Conservative implementation because of the first

prototype. → Drastically improved in the next version The limitation of the context is an essential problem.

PipeRench Architecture （ CMU ）

Interconnection

Interconnection

・・・

・・・

stripe

PE

Pass registers

Global buses

PE PE PE

PEPEPEPE

Pipelined Reconfiguration

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Cycle:

Virtual pipeline

Stage 1

Stage 2

Stage 3

Cycle:

Physical pipeline

1

1

1

2

2

1

2

3

3

1

2

3

4

4

4

2

3

5

5

4

5

3

6

6

4

5

6

Applications No flexible program change No IEEE standard floating point Not memory bounded

Image processing, analysis, pattern matching, Logic simulation, Fault simulation. Neural network simulation. Encryption /Decryption Queuing 　 Model 、 Markov Analysis Electric Power Flow Censer processing

Efficient use of on the fly processing. Communication control 、 Protocol control Software radio

Summary

Another computing system than stored program computers.

Not a perfect replace of stored program type computers.

Advance of the semiconductor techniques directly enhance the performance.

A lot of problems and subjects to research.

Historical flow of computer systems

ENIAC

EDVAC 、 EDSAC

IBM machines

RISC, Intel’s microprocessorsReconfigurableMachine

reconfigurable architectures

Documents