reconfigurable architectures

Reconfigurable 　Architectures

AMANO, Hideharu

hunga ＠ am ． ics ． keio ． ac ． jp

Reconfigurable 　 System（ Custom 　 Computing 　Machine ） A target algorithm is executed directly with

a hardware on SRAM-style FPGA/PLDs. High performance of special purpose machine

s. High degree of flexibility of general purpose ma

chines. A completely different execution mechanis

m from a stored program computers.

Recent FPGA/PLDs

More than 1000K Gates (It is difficult to use efficiently.)

The operational frequency is 30MHz – 60MHz.

A large internal data RAMs.

Simple Gate Arrays are replaced with FPGA/PLDs.

Switch

Logic 　 Block

SRAM FPGA (Xilinx’s)(Field 　 Programmable 　 Gate 　 Array)

5-inputs

2 　 F ． F ．

Look 　 Up 　 Table

Switch

Configuration 　 Memory

SRAM(Configuration 　 Memory ）

Switch

Logic 　 BlockI/O

SRAM CPLD(Complex 　 Programmable 　 Logic 　 Device)

Reconfigurable 　 Systems

Stand alone type Implemented on boards or cabinet. Splash 　１・２，　 RM-I,II,III,IV ， RASH （ Mit

subishi ） , ATTRACTOR （ NTT ） , 　ＦＬＥＭＩＮＧ

Co-processor type Improve performance of general purpose process

ors. PRISM 　 I,II, 　ＤＩＳＣ　ＩＩ , Garp, CHEMAE

RA, Chameleon, PipeRench

Reconfigurable 　 Systems

1990 The 1st FPLThe 1st Japanese 　FPGA/PLD 　Conf.

1993 The 1st FCCM

SPLASH

SPLASH-2RM-IRM-II

RM-IVRM-III

WASMII

PRISM-IPRISM-II

DISCDISC-II

FIPSOCHOSMII

Cont ． Switch ．FPGA PCA

ATTRACTOR

PipeRench

ChameleonCHIMERA

Stand 　Alone

Co-processor

New 　Device

Cache 　LogicMult ． Context

　 FPGA

Splash-2 　 (Arnold et.al 92)

米国計算機科学センター

String matching, Image processing, DNA matching, 330 times faster than the supercomputer Cray-II.

Systolic algorithm VHDL, 　 Parallel C Annapolis 　 Micro 　

Systems （ WILDFIRE)

RM-IV (Kobe Univ.)

ＦＰＧＡmem ．

ＦＰＧＡmem ．ＦＰＧＡmem ．

ＦＰＧＡmem ．

ＦＰＧＡmem ．ＦＰＧＡmem ．

ＦＰＧＡmem ．

ＦＰＧＡ mem ．

ＦＰＧＡ mem ．ＦＰＧＡ mem ．

Ｉｎｔｅｒｆａｃｅ

RASH(Mitsubishi)

Display

RASH unit

CompactPCI bus

diskCD

EXE-ボード

CPUボード

EthernetLAN

This slide is supported by Dr.Nakajima of Mitsubishi.

1Unit: 6 EXE boards CPU boards (Pentium)

Multiple Units can be connected

EXE boards of RASH

FPGA 　 Altera 　 FLEX10K100A 　 (62K-158KGate)

EXE-board controller

FPGA FPGA FPGA FPGA

PCI Local-bus

PCI-bus

PCI-bus I/F

Clocks ／ Cont. signals

SRAM（ 2MB ）

Local-bus

Mesh links and buses

2 clock lines

PCI bus I/F

A large SRAM

DRAM daughter board

ATTRACTOR （ NTT ）

ATMI/O

Buffer

RAM（ LUT)

Compact 　PCI

Ethernet

High speed serial link （ 1Gbps ）

Board level reconfiguration

Using various boardsSpecialized for ATM communication

Co-processor type

Tightly coupled with core CPU A part of program is selected and executed. Recently, on-chip implementation with the cor

e CPU and reconfigurable part is possible. Tightly coupling co-processor

NAPA, 　 Garp, 　 Chameleon, 　 CHIMAERA, 　 PipeRench

PRISM 　 II （ Brown Univ. ）

Am2955CPU

Boot 　 ROM

Burst 　 ModeMemory 　 Controller

ＦＰＧＡ　Ｍｏｄｕｌｅ

S ｗｉｔｃｈ

ＦＰＧＡ　ＭｏｄｕｌｅＦＰＧＡ　Ｍｏｄｕｌｅ

ＡｄｄｒｅｓｓＣｏｎｔｒｏｌ

Ｄａｔａ

A program core is executed.

Frontier of co-processors.

Garp (Hauser97)

Proposed in UCB MIPS Core and Reconfigura

ble Array share a cache system.

Loop is extracted with a compiler, and converted to hardware.

Image processing, 43 times faster than Ultrasparc

Crossbar

Q Q QCache

Memory queue

32bit buses x 5

Reconfigurable Array

DISC (Wirthlin et al. 95)

Brigham Young Univ. A general purpose processor

using partial reconfigurable chip.

Custom instructions can be attached. Each module can be designed

by the user. Function called by C-language.

FPGA 3

ProcessorCore

FPGA 1

Bus I/FConfiguration

Controller

FPGA 2

CustomInstruction

SystemMemory

Host P/C

CHIMAERA (Ye et.al. 2000)

Northwestern Univ. A reconfigurable array is i

nserted in the datapath of a super-scalar machine.

9 registers can be read in parallel from the shadow-register.

Out of Order control 10 ～ 20% performance i

mprovement

ReconfigurableArrayuP Core

Registerfile

Shadow registers

Controller

Chameleon （ Chameleon Co. ）　 Field 　 Programmable 　 System 　 Level 　

Integrated 　 Circuits 　 (FPSLICs) Coarse grain Reconfigurable 　 Processing 　 Fa

bric 、 RISC 　 Core 、 PCI 　 Controller 、 Memory 　 Controller 、 DMA 　 Controller and SRAM are implemented on a single chip.

In Signal processing, Communication protocol processing, It is 5-10 times faster than high speed DSPs.

Chameleon CS2112

ReconfigurableProcessing

Fabric

128-bit RoadRunner Bus

PCI Cont. RISC CoreMemory

Controller

DMASubsystem

ConfigurationSubsystem

160-pin Programmable I/O

32-bit PCI Bus 64-bit Memory Bus

Reconfigurable Processing Fabric in Chameleon

Tile 　0

Slice 　0

Tile 　0

Slice 　3

108 DPU(Data 　 Path 　 Unit)s consists 4 Slices （ 3Tiles each ）

1Tile: 　 9DPU ＝ 32bit ALU X 7 16bit + 16bit multiplier 　 X 　２

8 instructions stored in the CTL are executed in the DPU.

The CTL can select the next instruction in the same cycle.

Configuration can be changed by loading a bit 　stream.

Instruction

Register＆

OP Register

RegisterBarrelShifter

OP ： Operations in C or Verilog

SIMD arrays and pipelines are formed with multiple DPUs.

Problems on Reconfigurable Systems Calculators with SRAM type FPGAs are 10 ti

mes slower than ASIC calculators and requires 10 times wide area.

Weak connection between memory modules. No standard method for generating a efficient

hardware. The size limitation problem.

Toward solving problems (1)

Speed and area problem compared with dedicated calculators. The disadvantage is reduced using a novel

process. Coarse grain FPGA Implementation with the CPU

Weak connection between memory modules Connection with a large scale integrated SRAM DRAM integration

DRAM integrated FPGA （ NEC)

256 　 X 　 256DRAMModule

Word 　 Driver(128)

LogicElement

Word 　 Driver(128)

FPAccA (Hiroshima City Univ.)

Routing　Matrix

ＡＬＵ

Array 　 of 　 floating

ALU(Add/Mult ）

model2(0． 35um)

12ｘ　25MFLOPS　

Toward solving problems (2)

Algorithm conversion problem Co-processing between integrated CPU High-level synthesis techniques Data-driven execution Systolic algorithm

Size limitation problem Partial 　 Reconfiguration Multi-context 　 FPGA Virtual hardware

Systolic algorithm

Data x

Data y

Computational array

A data stream x, y are inserted with a specific interval intoa special computational array.Suitable for reconfigurable computing.

Band matrix multiply 　 y=Ax

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

Ｘ＋

ｙｉｙｏ

ｙｏ＝ａｘ＋ｙｉ

Ｘ＋

a12 a21

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

Ｘ＋Ｘ＋x1

a12 a21

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

y1=a11x1

X ＋x1

a34 a43

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44a33

y2=a21 x1

y1=a11 x1+ a12 x2

Ｘ＋Ｘ＋

a34 a43

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

y2=a21 x1+ a22 x2

Ｘ＋

a34 a43

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

a33y2=a21 x1+ a22 x2+ a23 x3

y3= a32 x2

Data flow algorithm

ａｂｃ

ｄｅ

（ａ＋ｂ）ｘ（ｃ＋（ｄｘｅ））

PlasticPart

Built-inPart

16bits

16word x 1bitmemory （ LUT)

output: 1bit

input: 1bit

control: 6bitsdata: 8bits

16bits

14bits

Hardware controller which executes12 instructions

Logic/Memory

ＰＣＡ（ Plastic 　 Cell 　Architecture ）

Unit cells are connected in a mesh structure

Variable

PCA(Plastic Cell Architecture)

PlasticPart

Built-in Part

ConfigurationPath

Communication Path

Self reconfigurationAsynchronous communication

Multi-context FPGA

SRAM slots

Logic cells

Input data

Output data

Logic cellsLogic cellsContext

Configuration 　 RAM can be changeable.Fujitsu’s MPLD(1990)Fujitsu’s MPLD(1990) 、、 WASMII(1992)WASMII(1992) 、、 Xilinx(1997)Xilinx(1997)

NEC’s DRL(1999)NEC’s DRL(1999)

Dynamic Reconfigurable Logic Multi-context(8 context) and partial reconfiguration.

4×12 (Logic Block, LB) Interface logics

Logic Block 4×4 (Unified Cell, UC) Reconfiguration Controller (RC) Bus Connector (BC)

Input Select

External Config. Control(4b)

Address Decoder

Config. Store Control(2b)

Reset CLK Output Select

Configuration 　Store 　

Address(10b)LB: 論理ブロック

Data Config Vertical Local Bus(4b×2) (3b×2) (4b×2)

Horizontal Local Bus

(4b×2)

Memory

Config(3b×2)

(4b×2)

UC ： Unified Cell

RC ： Reconfiguable Circuit

BC:Bus Connecter

Global Bus

Switch

BC BCBC BC

WASMII chip

Execution block

Virtual Hardware WASMII on the DRL（ Keio U. +NEC)

ConfigurationData line

Input TokenRegisters

Controller

Token router

Active Page

Page 3

WASMII operation I

WASMII operation II(Outside chip extension)

Backup RAM

External InputToken Registers

WASMII Chip

LBLayout of WASMII on DRL

Execution block

32 LBs

Control block

16 LBs

Dynamicallyreconfigured

Staticallyconfigured

WASMII on the DRL

Small applications have been implemented. Continuous System Simulation Neural Network Emulation

Almost the same speed as recent PCs Conservative implementation because of the first

prototype. → Drastically improved in the next version The limitation of the context is an essential problem.

PipeRench Architecture （ CMU ）

Interconnection

・・・

stripe

Pass registers

Global buses

PE PE PE

PEPEPEPE

Pipelined Reconfiguration

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Cycle:

Virtual pipeline

Stage 1

Stage 2

Stage 3

Cycle:

Physical pipeline

Applications No flexible program change No IEEE standard floating point Not memory bounded

Image processing, analysis, pattern matching, Logic simulation, Fault simulation. Neural network simulation. Encryption /Decryption Queuing 　 Model 、 Markov Analysis Electric Power Flow Censer processing

Efficient use of on the fly processing. Communication control 、 Protocol control Software radio

Summary

Another computing system than stored program computers.

Not a perfect replace of stored program type computers.

Advance of the semiconductor techniques directly enhance the performance.

A lot of problems and subjects to research.

Historical flow of computer systems

EDVAC 、 EDSAC

IBM machines

RISC, Intel’s microprocessorsReconfigurableMachine

reconfigurable architectures

Documents

1/30 course-grained reconfigurable architectures patrick...

arces university of bologna reconfigurable architectures...

loop dissevering: a technique for temporally partitioning...

run-time adaptable architectures for heterogeneous...

ece 506 reconfigurable computing ece506 lecture 4...

1 - cpre 583 (reconfigurable computing): reconfigurable...

dynamically reconfigurable architectures: an overview

reconfigurable computing reconfigurable architectures...

reconfigurable service-oriented architectures

reconfigurable architectures 11

l11/12: reconfigurable logic architectures

reconfigurable architectures greg stitt ece department...

course-grained reconfigurable architectures

10th reconfigurable architectures workshop (raw 2003), nice,...

l12: reconfigurable logic architectures · l12:...

the molen compiler backend for reconfigurable...

ece 506 reconfigurable computing ece506 lecture 2...

chapter 10 era – embedded reconfigurable architectures

scalable interconnects for reconfigurable spatial...

l12: reconfigurable logic architectures - mit -...