reconfigurable architectures

49
Reconfigurable Architectures AMANO, Hideharu hunga am ics keio ac j p

Upload: ankti

Post on 11-Jan-2016

43 views

Category:

Documents


2 download

DESCRIPTION

Reconfigurable Architectures. AMANO, Hideharu hunga @ am . ics . keio . ac . jp. Reconfigurable System ( Custom Computing Machine ). A target algorithm is executed directly with a hardware on SRAM-style FPGA/PLDs. High performance of special purpose machines. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Reconfigurable Architectures

Reconfigurable  Architectures

AMANO, Hideharu

hunga @ am . ics . keio . ac . jp

Page 2: Reconfigurable Architectures

Reconfigurable   System( Custom   Computing  Machine ) A target algorithm is executed directly with

a hardware on SRAM-style FPGA/PLDs. High performance of special purpose machine

s. High degree of flexibility of general purpose ma

chines. A completely different execution mechanis

m from a stored program computers.

Page 3: Reconfigurable Architectures

Recent FPGA/PLDs

More than 1000K Gates (It is difficult to use efficiently.)

The operational frequency is 30MHz – 60MHz.

A large internal data RAMs.

Simple Gate Arrays are replaced with FPGA/PLDs.

Page 4: Reconfigurable Architectures

Switch

Logic   Block

I/O

SRAM FPGA (Xilinx’s)(Field   Programmable   Gate   Array)

5-inputs

2   F . F .

Look   Up   Table

Switch

Configuration   Memory

Page 5: Reconfigurable Architectures

SRAM(Configuration   Memory )

Switch

Logic   BlockI/O

SRAM CPLD(Complex   Programmable   Logic   Device)

Page 6: Reconfigurable Architectures

Reconfigurable   Systems

Stand alone type Implemented on boards or cabinet. Splash  1・2,  RM-I,II,III,IV , RASH ( Mit

subishi ) , ATTRACTOR ( NTT ) ,  FLEMING

Co-processor type Improve performance of general purpose process

ors. PRISM   I,II,  DISC II , Garp, CHEMAE

RA, Chameleon, PipeRench

Page 7: Reconfigurable Architectures

Reconfigurable   Systems

1990 The 1st FPLThe 1st Japanese  FPGA/PLD  Conf.

1992

1993 The 1st FCCM

2000

1995

SPLASH

SPLASH-2RM-IRM-II

RM-IVRM-III

RM-V

WASMII

PRISM-IPRISM-II

DISCDISC-II

DRL

FIPSOCHOSMII

Cont . Switch .FPGA PCA

RASH

YARDS

ATTRACTOR

PipeRench

ChameleonCHIMERA

Stand  Alone

Co-processor

New  Device

MPLD

Cache  LogicMult . Context

  FPGA

Page 8: Reconfigurable Architectures

Splash-2   (Arnold et.al 92)

米国計算機科学センター

String matching, Image processing, DNA matching, 330 times faster than the supercomputer Cray-II.

Systolic algorithm VHDL,   Parallel C Annapolis   Micro  

Systems ( WILDFIRE)

Page 9: Reconfigurable Architectures

RM-IV (Kobe Univ.)

FPGAmem .

FPGAmem . FPGAmem .

FPGAmem .

FPGAmem .

FPGAmem . FPGAmem .

FPGAmem .

FPGA mem .

FPGA mem .FPGA mem .

FPGA mem .

FPGA mem .

FPGA mem .FPGA mem .

FPGA mem .

FPIC

Interface

Page 10: Reconfigurable Architectures

&p 10

RASH(Mitsubishi)

Display

RASH unit

CompactPCI bus

diskCD

EXE-ボード

CPUボード

EthernetLAN

This slide is supported by Dr.Nakajima of Mitsubishi.

1Unit: 6 EXE boards CPU boards (Pentium)

Multiple Units can be connected

Page 11: Reconfigurable Architectures

&p 11

EXE boards of RASH

FPGA   Altera   FLEX10K100A   (62K-158KGate)

EXE-board controller

FPGA FPGA FPGA FPGA

FPGA FPGA FPGA FPGA

PCI Local-bus

PCI-bus

PCI-bus I/F

Clocks / Cont. signals

SRAM( 2MB )

Local-bus

Mesh links and buses

2 clock lines

PCI bus I/F

A large SRAM

DRAM daughter board

Page 12: Reconfigurable Architectures

ATTRACTOR ( NTT )

RISC

FPGA

RISC

FPGA

RISC

ATMI/O

RISC

ATMSW

RISC

Buffer

RISC

RAM( LUT)

MPU

Mem.

Compact  PCI

Ethernet

High speed serial link ( 1Gbps )

Board level reconfiguration

Using various boardsSpecialized for ATM communication

Page 13: Reconfigurable Architectures

Co-processor type

Tightly coupled with core CPU A part of program is selected and executed. Recently, on-chip implementation with the cor

e CPU and reconfigurable part is possible. Tightly coupling co-processor

NAPA,   Garp,   Chameleon,   CHIMAERA,   PipeRench

Page 14: Reconfigurable Architectures

PRISM   II ( Brown Univ. )

Am2955CPU

Boot   ROM

Burst   ModeMemory   Controller

FPGA Module

DRAM

DRAM

S witch

FPGA ModuleFPGA Module

AddressControl

Data

A program core is executed.

Frontier of co-processors.

Page 15: Reconfigurable Architectures

Garp (Hauser97)

Proposed in UCB MIPS Core and Reconfigura

ble Array share a cache system.

Loop is extracted with a compiler, and converted to hardware.

Image processing, 43 times faster than Ultrasparc

MIPS

Crossbar

Q Q QCache

Memory queue

32bit buses x 5

Reconfigurable Array

Page 16: Reconfigurable Architectures

DISC (Wirthlin et al. 95)

Brigham Young Univ. A general purpose processor

using partial reconfigurable chip.

Custom instructions can be attached. Each module can be designed

by the user. Function called by C-language.

FPGA 3

ProcessorCore

FPGA 1

Bus I/FConfiguration

Controller

FPGA 2

CustomInstruction

Space

SystemMemory

Host P/C

Page 17: Reconfigurable Architectures

CHIMAERA (Ye et.al. 2000)

Northwestern Univ. A reconfigurable array is i

nserted in the datapath of a super-scalar machine.

9 registers can be read in parallel from the shadow-register.

Out of Order control 10 ~ 20% performance i

mprovement

ReconfigurableArrayuP Core

Registerfile

Shadow registers

Controller

Page 18: Reconfigurable Architectures

Chameleon ( Chameleon Co. )  Field   Programmable   System   Level  

Integrated   Circuits   (FPSLICs) Coarse grain Reconfigurable   Processing   Fa

bric 、 RISC   Core 、 PCI   Controller 、 Memory   Controller 、 DMA   Controller and SRAM are implemented on a single chip.

In Signal processing, Communication protocol processing, It is 5-10 times faster than high speed DSPs.

Page 19: Reconfigurable Architectures

Chameleon CS2112

ReconfigurableProcessing

Fabric

128-bit RoadRunner Bus

PCI Cont. RISC CoreMemory

Controller

DMASubsystem

ConfigurationSubsystem

160-pin Programmable I/O

32-bit PCI Bus 64-bit Memory Bus

Page 20: Reconfigurable Architectures

Reconfigurable Processing Fabric in Chameleon

DPU

CTLLM

Tile  0

Slice  0

DPU

CTLLM

Tile  0

Slice  3

108 DPU(Data   Path   Unit)s consists 4 Slices ( 3Tiles each )

1Tile:   9DPU = 32bit ALU X 7 16bit + 16bit multiplier   X  2

8 instructions stored in the CTL are executed in the DPU.

The CTL can select the next instruction in the same cycle.

Configuration can be changed by loading a bit  stream.

Page 21: Reconfigurable Architectures

DPU

Instruction

Rou

ting

MU

XR

ou

ting

MU

X

Register&

Mask

Register&

Mask

OP Register

RegisterBarrelShifter

OP : Operations in C or Verilog

SIMD arrays and pipelines are formed with multiple DPUs.

Page 22: Reconfigurable Architectures

Problems on Reconfigurable Systems Calculators with SRAM type FPGAs are 10 ti

mes slower than ASIC calculators and requires 10 times wide area.

Weak connection between memory modules. No standard method for generating a efficient

hardware. The size limitation problem.

Page 23: Reconfigurable Architectures

Toward solving problems (1)

Speed and area problem compared with dedicated calculators. The disadvantage is reduced using a novel

process. Coarse grain FPGA Implementation with the CPU

Weak connection between memory modules Connection with a large scale integrated SRAM DRAM integration

Page 24: Reconfigurable Architectures

DRAM integrated FPGA ( NEC)

256   X   256DRAMModule

Word   Driver(128)

Sen

se 

Am

p.

(128

)

Sen

se 

Am

p.

(128

)

LogicElement

LogicElement

LogicElement

LogicElement

Word   Driver(128)

Page 25: Reconfigurable Architectures

FPAccA (Hiroshima City Univ.)

Routing Matrix

ALU

Array   of   floating

ALU(Add/Mult )

model2(0. 35um)

12x 25MFLOPS 

Page 26: Reconfigurable Architectures

Toward solving problems (2)

Algorithm conversion problem Co-processing between integrated CPU High-level synthesis techniques Data-driven execution Systolic algorithm

Size limitation problem Partial   Reconfiguration Multi-context   FPGA Virtual hardware

Page 27: Reconfigurable Architectures

Systolic algorithm

Data x

Data y

Computational array

A data stream x, y are inserted with a specific interval intoa special computational array.Suitable for reconfigurable computing.

Page 28: Reconfigurable Architectures

Band matrix multiply   y=Ax

a

x

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

X+

yiyo

yo= a x + y i

y0

y1

y2

y3

x0

x1

x2

x3

=

Page 29: Reconfigurable Architectures

Band matrix multiply   y=Ax

X +

a11

x1

a12 a21

a22

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

Page 30: Reconfigurable Architectures

Band matrix multiply   y=Ax

X +X +x1

a12 a21

a22

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

a33

y1=a11x1

x2

Page 31: Reconfigurable Architectures

Band matrix multiply   y=Ax

X +x1

a34 a43

a22

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44a33

y2=a21 x1

x2x3

y1=a11 x1+ a12 x2

Page 32: Reconfigurable Architectures

Band matrix multiply   y=Ax

X +X +

a34 a43

a44

a23 a32

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

a33

y2=a21 x1+ a22 x2

x2x3

Page 33: Reconfigurable Architectures

Band matrix multiply   y=Ax

X +

a34 a43

a44

a11 a12 0 0

a21 a22 a23 0

0 a32 a33 a34

0 0 a43 a44

a33y2=a21 x1+ a22 x2+ a23 x3

x2x3

y3= a32 x2

Page 34: Reconfigurable Architectures

Data flow algorithm

a bc

d e

(a+b)x(c+(dxe))

Page 35: Reconfigurable Architectures

PlasticPart

7

0 7

Built-inPart

16bits

16word x 1bitmemory ( LUT)

output: 1bit

input: 1bit

control: 6bitsdata: 8bits

16bits

14bits

14bits

Hardware controller which executes12 instructions

Logic/Memory

PCA( Plastic   Cell  Architecture )

Unit cells are connected in a mesh structure

Variable

Page 36: Reconfigurable Architectures

PCA(Plastic Cell Architecture)

PlasticPart

Built-in Part

ConfigurationPath

Communication Path

.….

Self reconfigurationAsynchronous communication

Page 37: Reconfigurable Architectures

Multi-context FPGA

Mul

tipl

exer

SRAM slots

n

Logic cells

1

2

Input data

Output data

Logic cellsLogic cellsContext

Configuration   RAM can be changeable.Fujitsu’s MPLD(1990)Fujitsu’s MPLD(1990) 、、 WASMII(1992)WASMII(1992) 、、 Xilinx(1997)Xilinx(1997)

NEC’s DRL(1999)NEC’s DRL(1999)

Page 38: Reconfigurable Architectures

Dynamic Reconfigurable Logic Multi-context(8 context) and partial reconfiguration.

4×12 (Logic Block, LB) Interface logics

Logic Block 4×4 (Unified Cell, UC) Reconfiguration Controller (RC) Bus Connector (BC)

Page 39: Reconfigurable Architectures

Conf

igur

atio

n/Da

ta In

put(7

9b)

Input Select

External Config. Control(4b)

Address Decoder

Config. Store Control(2b)

Reset CLK Output Select

Conf

igur

atio

n/Da

ta O

utpu

t(79b

)

Configuration  Store  

Address(10b)LB: 論理ブロック

DRLIn

put S

elec

tor

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

LB

Out

put S

elec

tor

LB

Data Config Vertical Local Bus(4b×2) (3b×2) (4b×2)

Horizontal Local Bus

(4b×2)

Memory

Data

Config(3b×2)

(4b×2)

(4b×2)

UC : Unified Cell

RC : Reconfiguable Circuit

BC:Bus Connecter

RCUC

UC

UC

UC

UC

UC

UC

UC

UC UC

UC UC

UC UC

UC UC

Global Bus

Switch

BC

BC

BC

BC

BC BCBC BC

RC

BC

BC

BC

BC

BC BCBC BC

UC

UC

UC

UC

UC

UC

UC

UC

UC UC

UC UC

UC UC

UC UC

Page 40: Reconfigurable Architectures

WASMII chip

Execution block

Virtual Hardware WASMII on the DRL( Keio U. +NEC)

ConfigurationData line

Input TokenRegisters

Controller

Token router

Input TokenRegisters

Active Page

Page 1

Page 2Page 3

Page 4

Page 41: Reconfigurable Architectures

WASMII operation I

ConfigurationData line

Input TokenRegisters

Page 42: Reconfigurable Architectures

WASMII operation II(Outside chip extension)

ConfigurationData line

Input TokenRegisters

Backup RAM

External InputToken Registers

WASMII Chip

Page 43: Reconfigurable Architectures

LBLayout of WASMII on DRL

Execution block

32 LBs

Control block

16 LBs

Dynamicallyreconfigured

Staticallyconfigured

Page 44: Reconfigurable Architectures

WASMII on the DRL

Small applications have been implemented. Continuous System Simulation Neural Network Emulation

Almost the same speed as recent PCs Conservative implementation because of the first

prototype. → Drastically improved in the next version The limitation of the context is an essential problem.

Page 45: Reconfigurable Architectures

PipeRench Architecture ( CMU )

Interconnection

Interconnection

・ ・ ・

・ ・ ・

stripe

PE

Pass registers

Global buses

PE PE PE

PEPEPEPE

Page 46: Reconfigurable Architectures

Pipelined Reconfiguration

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Cycle:

Virtual pipeline

Stage 1

Stage 2

Stage 3

Cycle:

Physical pipeline

1

1

1

2

2

1

2

3

3

1

2

3

4

4

4

2

3

5

5

4

5

3

6

6

4

5

6

Page 47: Reconfigurable Architectures

Applications No flexible program change No IEEE standard floating point Not memory bounded

Image processing, analysis, pattern matching, Logic simulation, Fault simulation. Neural network simulation. Encryption /Decryption Queuing   Model 、 Markov Analysis Electric Power Flow Censer processing

Efficient use of on the fly processing. Communication control 、 Protocol control Software radio

Page 48: Reconfigurable Architectures

Summary

Another computing system than stored program computers.

Not a perfect replace of stored program type computers.

Advance of the semiconductor techniques directly enhance the performance.

A lot of problems and subjects to research.

Page 49: Reconfigurable Architectures

Historical flow of computer systems

ENIAC

EDVAC 、 EDSAC

IBM machines

RISC, Intel’s microprocessorsReconfigurableMachine