Download - Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU

Reconfigurable Computing

Reconfigurable Architectures

Chapter 3.2

Prof. Dr.-Ing. Jürgen Teich

Lehrstuhl für Hardware-Software-Co-Design


Coarse-Grained Reconfigurable Devices


Recall:

1. Brief Historically development (Estrin Fix-Plus and Rammig machine)

2. Programmable Logic

1. PALs and PLAs

2. CPLDs

3. FPGAs

1. Technology

2. Architecture by means of an example

1. Actel

2. Xilinx

3. Altera


3

Once again: General purpose vs Special purpose

With LUTs as function generators, FPGA can be seen as general purpose devices.

Like any general purpose device, they are flexible but often inefficient.

Flexible because any n-variable Boolean function can be implemented using an n-input LUT.

Inefficient since complex functions must be implemented in many LUTs at different locations. The connection among the LUTs is done using the routing matrix wich increases the signal delays.

LUT implementation is usually slower than direct

wiring.


4


Example: Implement the function

using 2-input LUTs.

LUTs are grouped in logic blocks (LB). 2 2-input LUT per LB

Connection inside a LB is efficient (direct)

Connection outside LBs are slow (Connection matrix)


5

AF = ABD + AC BCD +

AB

D

A

CDA

B

C

F

Connection

matrix


Idea: Implement frequently used blocks as hard-core module in the device


6

ABD

ACDA

BC

F

Connection

matrix

A

B

C

D

F

Coarse grained reconfigurable devices

Overcome the inefficiency of FPGAs by providing coarse grained functional units (adders, multipliers, integrators, etc.), efficiently implemented

Advantage: Very efficient in terms of speed (no need for connections over connection matrices for basic operators)

Advantage: Direct wiring instead of LUT implementation

A coarse grained device is usually an array of programmable and identical processing elements (PE) capable of executing few operations like addition and multiplication.

Depending on the manufacturer, the functional units communicate via buses or can be directly connected using programmable routing matrices.


7

Coarse grained reconfigurable devices

Memory exists between and inside the PEs.

Several other functional units according to the manufacturer.

A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU which can be configured to execute only one operation on a given period (until the next configuration).

Communication among the PEs can be either packet oriented (on buses) or point-to-point (using crossbar switches).

Since each vendor has its own implementation approach, study will be done by means of few examples. Considered are: PACT XPP, Quicksilver ACM, NEC DRP, TCPA.


8

The PACT XPP – Overall structure

XPP (Extreme Processing Platform) is

a hierarchical structure consisting of:

An array of Processing Array Elements

(PAE) grouped in clusters called Processing

Array (PA)

PAC = Processing Array Cluster (PAC) +

Configuration manager (CM)

A hierarchical configuration tree

Local CMs manage the configuration at the

PA level

The local CMs access the local configuration

memory while the supervisor CM (SCM)

accesses external memory and supervises

the whole configuration process on the

device


9

Source: V. Baumgarten et al., PACT XPP: A Self-Reconfigurable

Data Processing Architecture, Journal of Supercomputing. 2003.

Source: PACT XPP Technologies

The PACT XPP – The Processing Array Elements

A Communication Network

Memory elements aside the PACs

A set of I/Os


10

The PAE: Two types of PAE

The ALU PAE

The RAM PAE

The ALU PAE:

Contains an ALU which can be configured to perform basic operations

Back-register (BREG) provides routing channels for data and events from bottom to top

Forward Register (FREG) provides routing channels from top to bottom

The ALU PAE

Source: PACT XPP Technologies, XPP-III Processor

Overview, 2012.

The PACT XPP - The Processing Array Elements

DataFlow Register (DF-REG) can be used at the object outputs for buffering data

Input register can be preloaded by

configuration data.

The RAM PAE:

1. Differs from the ALU-PAE only on the

function. Instead of an ALU, a RAM-PAE

contains a dual-ported RAM.

2. Useful for data storage

3. Data is written or read after the reading

of an address at the RAM-inputs

4. BREG, FREG, and DF-REG of the RAM-

PAE have the same function as in the

ALU-PAE


11

The RAM PAE


Overview, 2012.

The PACT XPP - Routing

Routing in PACT XPP: Two independent networks

One for data transmission

The other for event transmission

A Configuration BUS exists besides the

data and event networks (very little

information exists about the

configuration bus)

All objects can be connected to

horizontal routing channels using

switch-objects

Vertical routing channels are provided

by the BREG and FREG

BREGs route from bottom to top

FREGs route from top to bottom


12

Horizontal routing channels

Vertical routing channels


Overview, white paper, 2012.

The PACT XPP - Interface

Interfaces are available inside the chip

Number and type of interfaces vary

from device to device

On the XPP42-A1:

6 internal interfaces consisting of:

4 identical general purpose I/O on-chip

interfaces (bottom left, upper left, upper

right, and bottom right)

One configuration manager

One JTAG (Join Test Action Group,

"IEEE Standard 1149.1") Boundary scan

interface for testing purpose (not

shown in the picture)


13

Interfaces



The I/O interfaces can operateindependent from each other. Two operation modes

The RAM mode

The streaming mode

RAM mode:

Each port can access external Static

RAM (SRAM).

Control signals for the SRAM

transactions are available.

No additional logic required


14



Streaming mode:

1. For high speed streaming of data to

and from the device

2. Each I/O element provides two

bidirectional ports for data

streaming

3. Handshake signals are used for

synchronization of data packets to

external port


15


The Quicksilver ACM - Architecture

Structure: Fractal-like structure

Hierarchical group of four nodes with

full communication among the nodes

4 lower level nodes are grouped in a

higher level node

The lowest level consists of 4

heterogeneous processing nodes

The connection is done in a Matrix

Interconnect Network (MIN)

A system controller

Various I/O


16

Source: B. Plunkett et al., Adapt2400 ACM architecture

overview, QuickSilver Technology, Inc., 2004.

The Quicksilver ACM – The processing node

An ACM processing node

consists of:

An algorithmic engine. It is

unique to each node type and

defines the operation to perform

by the node.

The node memory for data

storage at the node level.

A node wrapper which is

common to all nodes. It is used to

hide the complexity of the

heterogeneous architecture.


17




Four types of nodes exist:

The Programmable Scalar Node

(PSN) provides a standard 32-bit

RISC architecture with 32-bit

general purpose registers

The Adaptive Execution Node

(AXN) provides variable size MAC

and ALU operations

The Domain Bit Manipulation

(DBM) node provides bit

manipulation and byte oriented

operation

External Memory Controller node

provides DDRRAM, SRAM,

memory random access DMA

control interface


18

ACM PSN-NodeSource: B. Plunkett et al., Adapt2400 ACM architecture




19

ACM DBM-NodeACM AXN-NodeSource: B. Plunkett et al., Adapt2400 ACM architecture overview, QuickSilver Technology, Inc., 2004.


The node wrapper envelopes the

algorithmic engine and presents an

identical interface to neighbouring

nodes. It features:

1. A MIN interface to support the

communication among nodes via

the MIN-network

2. A hardware task manager for task

management at the node level

3. A DMA engine

4. Dedicated I/O circuitry

5. Memory controllers

6. Data distributors and aggregators


20

The ACM Node WrapperSource: B. Plunkett et al., Adapt2400 ACM architecture


The Quicksilver ACM - The MIN

Matrix Interconnect Network is

the communication medium in

an ACM chip

1. Hierarchically organized. The MIN

at a given level connects many

lower-level MINs

2. The MIN-Root is used for:

1. Off-chip communication

2. Configuration

3. Supports the communication

among nodes

4. Provides service like Point to

point dataflow streaming, Real-

time broadcasting, DMA, etc.


21

Example of ACM

chip configuration



The Quicksilver ACM - The System Controller

The system controller is in charge of the system management:

Loads tasks into node ready-to-run

queue for execution

Statically or dynamically sets the

communication channels between

the processing nodes

Carries out the reconfiguration of

nodes on a clock cycle-by-clock

cycle basis

The ACM chip features a set of I/O

interfaces controllers like:

PCI

PLL

SDRAM and SRAM


22

The system controller

The interface controllersSource: B. Plunkett et al., Adapt2400 ACM architecture


The NEC DRP – Architecture

The NEC Dynamically

Reconfigurable Processor (DRP)

consists of:

A set of byte oriented processing

elements (PE)

A programmable interconnection

network for communication among

the PEs.

A sequencer. Can be programmed as

finite state machine (FSM) to control

the reconfiguration process

Memory around the device for storing

configuration and computation data

Various Interfaces


23

Source: C. Bobda, Introduction to Reconfigurable

Computing, Springer, 2007. Original image adapted

from M. Motomura: A dynamically reconfigurable

processor architecture, Microprocessor Forum, 2002.

The NEC DRP - The Processing Element

ALU: ordinary byte arithmetic/logic

operations

DMU (data management unit): handles

byte select, shift, mask, constant

generation, etc., as well as bit

manipulations

An instruction dictates ALU/DMU

operations and inter-PE connections

Source/destination operands can

either be from/to

its own register file

other PEs (i.e., flow through)

Instruction pointer (IP) is provided from

STC (state transition controller)


24

Adapted from: M. Susuki et al., Stream Applications on

the Dynamically Reconfigurable Processor, International

Conference on Field-Programmable Technology, IEEE,

2004.

The NEC DRP – Reconfiguration Process

Instruction Pointer (IP) from STC

identifies a datapath plane

Spatial computation with using a

customized datapath plane

When IP changes, datapath

plane switches instantaneously

PE instructions as a collection

behave like an extreme VLIW

Sequencing through instructions

=> Dynamic reconfiguration


25

AES

3DES

MD5

SHA-1

Compress

Data In

Control

(task selectionby descriptor)

Dynamic Reconfiguration

Data Out

Multiple DatapathPlanes

The NEC DRP – Reconfiguration Process


26

Add

Sel

Add

Cmp

Add

Add

Cmp

Sel

PE

PE ArrayALUDMU

Insts.

PE

012

IP = “1”

1

3

4

PE Array

PE ALUDMU

012

Insts.

IP = “1”

1

1

2

1 Identify the instruction to be executed

2 Decode the instruction in the ALU plane

3 Configure the ALU Plane according to the

instruction4+

Tightly Coupled Processor Arrays (TCPA)

• Processor elements (PEs) with VLIW (Very long

instruction word)-Architecture

• Weakly programmable

– Small local instruction memory

– Limited parametrizable instruction set focused on digital signal

processing

• Data flow oriented control path, no global address space,

data streaming over the processing field

• Regular interconnect network

• Application areas: Digital signal processing, e.g., mobile

communication, HDTV, multimedia, . . .

30


Tightly-Coupled Processor Arrays (TCPA)

31


Source: D. Kissler et al., A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template, International Workshop on

Reconfigurable Communication Centric System-on-Chips (ReCoSoC), 2006.

• Basic structure: Grid

• Dynamic reconfigurable

• By using a bypass, more

than one hop is possible

in a single clock cycle

• Interconnect wrapper is

responsible for switching

TCPA – Interconnect Network

32


Adapted from: D. Kissler et al., A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template, International Workshop

on Reconfigurable Communication Centric System-on-Chips (ReCoSoC), 2006.

TCPA – Network Example – 4D Hypercube

33


TCPA – Network Example – 2D Torus

34


• Multicast-Scheme for partial dynamic reconfiguration

• Differential reconfiguration (program/connections) also

possible

TCPA – Dynamic Reconfiguration

35


Source: D. Kissler, Power-Efficient Tightly-Coupled Processor Arrays for Digital Signal Processing, PhD Dissertation, 2012.

24 Core TCPA – Lehrstuhl für Informatik 12

• 24x 16 Bit cores

• Technology

• CMOS 1.0 V

• 9 metal layers

• 90 nm standard cell layout

• FUs/PE

• 2xAdd, 2xMul,

• 1xShift, 1xDPU

• Register/PE: 15

• Instruction memory

• 1024x32 = 4kB

• Clock frequency: 200 MHz

• Peak Performance: 24 GOPS

• Energy consumption

• 133 mW @ 200 MHz (Hybrid Clock Gating).

• Power efficiency: 180 MOPS/mW

36


Download - Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU

Top Related