reconfigurable architectures

Post on 11-Jan-2016

43 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Reconfigurable Architectures. AMANO, Hideharu hunga @ am . ics . keio . ac . jp. Reconfigurable System ( Custom Computing Machine ). A target algorithm is executed directly with a hardware on SRAM-style FPGA/PLDs. High performance of special purpose machines. - PowerPoint PPT Presentation

TRANSCRIPT

Reconfigurable  Architectures

AMANO, Hideharu

hunga @ am . ics . keio . ac . jp

Reconfigurable   System( Custom   Computing  Machine ) A target algorithm is executed directly with

a hardware on SRAM-style FPGA/PLDs. High performance of special purpose machine

s. High degree of flexibility of general purpose ma

chines. A completely different execution mechanis

m from a stored program computers.

Recent FPGA/PLDs

More than 1000K Gates (It is difficult to use efficiently.)

The operational frequency is 30MHz – 60MHz.

A large internal data RAMs.

Simple Gate Arrays are replaced with FPGA/PLDs.

Switch

Logic   Block

I/O

SRAM FPGA (Xilinx’s)(Field   Programmable   Gate   Array)

5-inputs

2   F . F .

Look   Up   Table

Switch

Configuration   Memory

SRAM(Configuration   Memory )

Switch

Logic   BlockI/O

SRAM CPLD(Complex   Programmable   Logic   Device)

Reconfigurable   Systems

Stand alone type Implemented on boards or cabinet. Splash  1・2,  RM-I,II,III,IV , RASH ( Mit

subishi ) , ATTRACTOR ( NTT ) ,  FLEMING

Co-processor type Improve performance of general purpose process

ors. PRISM   I,II,  DISC II , Garp, CHEMAE

RA, Chameleon, PipeRench

Reconfigurable   Systems

1990 The 1st FPLThe 1st Japanese  FPGA/PLD  Conf.

1992

1993 The 1st FCCM

2000

1995

SPLASH

SPLASH-2RM-IRM-II

RM-IVRM-III

RM-V

WASMII

PRISM-IPRISM-II

DISCDISC-II

DRL

FIPSOCHOSMII

Cont . Switch .FPGA PCA

RASH

YARDS

ATTRACTOR

PipeRench

ChameleonCHIMERA

Stand  Alone

Co-processor

New  Device

MPLD

Cache  LogicMult . Context

  FPGA

Splash-2   (Arnold et.al 92)

米国計算機科学センター

String matching, Image processing, DNA matching, 330 times faster than the supercomputer Cray-II.

Systolic algorithm VHDL,   Parallel C Annapolis   Micro  

Systems ( WILDFIRE)

RM-IV (Kobe Univ.)

FPGAmem .

FPGAmem . FPGAmem .

FPGAmem .

FPGAmem .

FPGAmem . FPGAmem .

FPGAmem .

FPGA mem .

FPGA mem .FPGA mem .

FPGA mem .

FPGA mem .

FPGA mem .FPGA mem .

FPGA mem .

FPIC

Interface

&p 10

RASH(Mitsubishi)

Display

RASH unit

CompactPCI bus

diskCD

EXE-ボード

CPUボード

EthernetLAN

This slide is supported by Dr.Nakajima of Mitsubishi.

1Unit: 6 EXE boards CPU boards (Pentium)

Multiple Units can be connected

&p 11

EXE boards of RASH

FPGA   Altera   FLEX10K100A   (62K-158KGate)

EXE-board controller

FPGA FPGA FPGA FPGA

FPGA FPGA FPGA FPGA

PCI Local-bus

PCI-bus

PCI-bus I/F

Clocks / Cont. signals

SRAM( 2MB )

Local-bus

Mesh links and buses

2 clock lines

PCI bus I/F

A large SRAM

DRAM daughter board

ATTRACTOR ( NTT )

RISC

FPGA

RISC

FPGA

RISC

ATMI/O

RISC

ATMSW

RISC

Buffer

RISC

RAM( LUT)

MPU

Mem.

Compact  PCI

Ethernet

High speed serial link ( 1Gbps )

Board level reconfiguration

Using various boardsSpecialized for ATM communication

Co-processor type

Tightly coupled with core CPU A part of program is selected and executed. Recently, on-chip implementation with the cor

e CPU and reconfigurable part is possible. Tightly coupling co-processor

NAPA,   Garp,   Chameleon,   CHIMAERA,   PipeRench

PRISM   II ( Brown Univ. )

Am2955CPU

Boot   ROM

Burst   ModeMemory   Controller

FPGA Module

DRAM

DRAM

S witch

FPGA ModuleFPGA Module

AddressControl

Data

A program core is executed.

Frontier of co-processors.

Garp (Hauser97)

Proposed in UCB MIPS Core and Reconfigura

ble Array share a cache system.

Loop is extracted with a compiler, and converted to hardware.

Image processing, 43 times faster than Ultrasparc

MIPS

Crossbar

Q Q QCache

Memory queue

32bit buses x 5

Reconfigurable Array

DISC (Wirthlin et al. 95)

Brigham Young Univ. A general purpose processor

using partial reconfigurable chip.

Custom instructions can be attached. Each module can be designed

by the user. Function called by C-language.

FPGA 3

ProcessorCore

FPGA 1

Bus I/FConfiguration

Controller

FPGA 2

CustomInstruction

Space

SystemMemory

Host P/C

CHIMAERA (Ye et.al. 2000)

Northwestern Univ. A reconfigurable array is i

nserted in the datapath of a super-scalar machine.

9 registers can be read in parallel from the shadow-register.

Out of Order control 10 ~ 20% performance i

mprovement

ReconfigurableArrayuP Core

Registerfile

Shadow registers

Controller

Chameleon ( Chameleon Co. )  Field   Programmable   System   Level  

Integrated   Circuits   (FPSLICs) Coarse grain Reconfigurable   Processing   Fa

bric 、 RISC   Core 、 PCI   Controller 、 Memory   Controller 、 DMA   Controller and SRAM are implemented on a single chip.

In Signal processing, Communication protocol processing, It is 5-10 times faster than high speed DSPs.

Chameleon CS2112

ReconfigurableProcessing

Fabric

128-bit RoadRunner Bus

PCI Cont. RISC CoreMemory

Controller

DMASubsystem

ConfigurationSubsystem

160-pin Programmable I/O

32-bit PCI Bus 64-bit Memory Bus

Reconfigurable Processing Fabric in Chameleon

DPU

CTLLM

Tile  0

Slice  0

DPU

CTLLM

Tile  0

Slice  3

108 DPU(Data   Path   Unit)s consists 4 Slices ( 3Tiles each )

1Tile:   9DPU = 32bit ALU X 7 16bit + 16bit multiplier   X  2

8 instructions stored in the CTL are executed in the DPU.

The CTL can select the next instruction in the same cycle.

Configuration can be changed by loading a bit  stream.

DPU

Instruction

Rou

ting

MU

XR

ou

ting

MU

X

Register&

Mask

Register&

Mask

OP Register

RegisterBarrelShifter

OP : Operations in C or Verilog

SIMD arrays and pipelines are formed with multiple DPUs.

Problems on Reconfigurable Systems Calculators with SRAM type FPGAs are 10 ti

mes slower than ASIC calculators and requires 10 times wide area.

Weak connection between memory modules. No standard method for generating a efficient

hardware. The size limitation problem.

Toward solving problems (1)

Speed and area problem compared with dedicated calculators. The disadvantage is reduced using a novel

process. Coarse grain FPGA Implementation with the CPU

Weak connection between memory modules Connection with a large scale integrated SRAM DRAM integration

DRAM integrated FPGA ( NEC)

256   X   256DRAMModule

Word   Driver(128)

Sen

se 

Am

p.

(128

)

Sen

se 

Am

p.

(128

)

LogicElement

LogicElement

LogicElement

LogicElement

Word   Driver(128)

FPAccA (Hiroshima City Univ.)

Routing Matrix

ALU

Array   of   floating

ALU(Add/Mult )

model2(0. 35um)

12x 25MFLOPS 

Toward solving problems (2)

Algorithm conversion problem Co-processing between integrated CPU High-level synthesis techniques Data-driven execution Systolic algorithm

Size limitation problem Partial   Reconfiguration Multi-context   FPGA Virtual hardware

Systolic algorithm

Data x

Data y

Computational array

A data stream x, y are inserted with a specific interval intoa special computational array.Suitable for reconfigurable computing.

Band matrix multiply   y=Ax

a

x

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

X+

yiyo

yo= a x + y i

y0

y1

y2

y3

x0

x1

x2

x3

=

Band matrix multiply   y=Ax

X +

a11

x1

a12 a21

a22

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

Band matrix multiply   y=Ax

X +X +x1

a12 a21

a22

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

a33

y1=a11x1

x2

Band matrix multiply   y=Ax

X +x1

a34 a43

a22

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44a33

y2=a21 x1

x2x3

y1=a11 x1+ a12 x2

Band matrix multiply   y=Ax

X +X +

a34 a43

a44

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

a33

y2=a21 x1+ a22 x2

x2x3

Band matrix multiply   y=Ax

X +

a34 a43

a44

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

a33y2=a21 x1+ a22 x2+ a23 x3

x2x3

y3= a32 x2

Data flow algorithm

a bc

d e

(a+b)x(c+(dxe))

PlasticPart

7

0 7

Built-inPart

16bits

16word x 1bitmemory ( LUT)

output: 1bit

input: 1bit

control: 6bitsdata: 8bits

16bits

14bits

14bits

Hardware controller which executes12 instructions

Logic/Memory

PCA( Plastic   Cell  Architecture )

Unit cells are connected in a mesh structure

Variable

PCA(Plastic Cell Architecture)

PlasticPart

Built-in Part

ConfigurationPath

Communication Path

.….

Self reconfigurationAsynchronous communication

Multi-context FPGA

Mul

tipl

exer

SRAM slots

n

Logic cells

1

2

Input data

Output data

Logic cellsLogic cellsContext

Configuration   RAM can be changeable.Fujitsu’s MPLD(1990)Fujitsu’s MPLD(1990) 、、 WASMII(1992)WASMII(1992) 、、 Xilinx(1997)Xilinx(1997)

NEC’s DRL(1999)NEC’s DRL(1999)

Dynamic Reconfigurable Logic Multi-context(8 context) and partial reconfiguration.

4×12 (Logic Block, LB) Interface logics

Logic Block 4×4 (Unified Cell, UC) Reconfiguration Controller (RC) Bus Connector (BC)

Conf

igur

atio

n/Da

ta In

put(7

9b)

Input Select

External Config. Control(4b)

Address Decoder

Config. Store Control(2b)

Reset CLK Output Select

Conf

igur

atio

n/Da

ta O

utpu

t(79b

)

Configuration  Store  

Address(10b)LB: 論理ブロック

DRLIn

put S

elec

tor

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

Out

put S

elec

tor

LB

Data Config Vertical Local Bus(4b×2) (3b×2) (4b×2)

Horizontal Local Bus

(4b×2)

Memory

Data

Config(3b×2)

(4b×2)

(4b×2)

UC : Unified Cell

RC : Reconfiguable Circuit

BC:Bus Connecter

RCUC

UC

UC

UC

UC

UC

UC

UC

UC UC

UC UC

UC UC

UC UC

Global Bus

Switch

BC

BC

BC

BC

BC BCBC BC

RC

BC

BC

BC

BC

BC BCBC BC

UC

UC

UC

UC

UC

UC

UC

UC

UC UC

UC UC

UC UC

UC UC

WASMII chip

Execution block

Virtual Hardware WASMII on the DRL( Keio U. +NEC)

ConfigurationData line

Input TokenRegisters

Controller

Token router

Input TokenRegisters

Active Page

Page 1

Page 2Page 3

Page 4

WASMII operation I

ConfigurationData line

Input TokenRegisters

WASMII operation II(Outside chip extension)

ConfigurationData line

Input TokenRegisters

Backup RAM

External InputToken Registers

WASMII Chip

LBLayout of WASMII on DRL

Execution block

32 LBs

Control block

16 LBs

Dynamicallyreconfigured

Staticallyconfigured

WASMII on the DRL

Small applications have been implemented. Continuous System Simulation Neural Network Emulation

Almost the same speed as recent PCs Conservative implementation because of the first

prototype. → Drastically improved in the next version The limitation of the context is an essential problem.

PipeRench Architecture ( CMU )

Interconnection

Interconnection

・ ・ ・

・ ・ ・

stripe

PE

Pass registers

Global buses

PE PE PE

PEPEPEPE

Pipelined Reconfiguration

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Cycle:

Virtual pipeline

Stage 1

Stage 2

Stage 3

Cycle:

Physical pipeline

1

1

1

2

2

1

2

3

3

1

2

3

4

4

4

2

3

5

5

4

5

3

6

6

4

5

6

Applications No flexible program change No IEEE standard floating point Not memory bounded

Image processing, analysis, pattern matching, Logic simulation, Fault simulation. Neural network simulation. Encryption /Decryption Queuing   Model 、 Markov Analysis Electric Power Flow Censer processing

Efficient use of on the fly processing. Communication control 、 Protocol control Software radio

Summary

Another computing system than stored program computers.

Not a perfect replace of stored program type computers.

Advance of the semiconductor techniques directly enhance the performance.

A lot of problems and subjects to research.

Historical flow of computer systems

ENIAC

EDVAC 、 EDSAC

IBM machines

RISC, Intel’s microprocessorsReconfigurableMachine

top related