Reconfigurable Computing
Reconfigurable Architectures
Chapter 3.2
Prof. Dr.-Ing. Jürgen Teich
Lehrstuhl für Hardware-Software-Co-Design
Reconfigurable Computing
Coarse-Grained Reconfigurable Devices
Reconfigurable Computing
Recall:
1. Brief Historically development (Estrin Fix-Plus and Rammig machine)
2. Programmable Logic
1. PALs and PLAs
2. CPLDs
3. FPGAs
1. Technology
2. Architecture by means of an example
1. Actel
2. Xilinx
3. Altera
Reconfigurable Computing
3
Once again: General purpose vs Special purpose
With LUTs as function generators, FPGA can be seen as general purpose devices.
Like any general purpose device, they are flexible but often inefficient.
Flexible because any n-variable Boolean function can be implemented using an n-input LUT.
Inefficient since complex functions must be implemented in many LUTs at different locations. The connection among the LUTs is done using the routing matrix wich increases the signal delays.
LUT implementation is usually slower than direct
wiring.
Reconfigurable Computing
4
Once again: General purpose vs Special purpose
Example: Implement the function
using 2-input LUTs.
LUTs are grouped in logic blocks (LB). 2 2-input LUT per LB
Connection inside a LB is efficient (direct)
Connection outside LBs are slow (Connection matrix)
Reconfigurable Computing
5
AF = ABD + AC BCD +
AB
D
A
CDA
B
C
F
Connection
matrix
Once again: General purpose vs Special purpose
Idea: Implement frequently used blocks as hard-core module in the device
Reconfigurable Computing
6
ABD
ACDA
BC
F
Connection
matrix
A
B
C
D
F
Coarse grained reconfigurable devices
Overcome the inefficiency of FPGAs by providing coarse grained functional units (adders, multipliers, integrators, etc.), efficiently implemented
Advantage: Very efficient in terms of speed (no need for connections over connection matrices for basic operators)
Advantage: Direct wiring instead of LUT implementation
A coarse grained device is usually an array of programmable and identical processing elements (PE) capable of executing few operations like addition and multiplication.
Depending on the manufacturer, the functional units communicate via buses or can be directly connected using programmable routing matrices.
Reconfigurable Computing
7
Coarse grained reconfigurable devices
Memory exists between and inside the PEs.
Several other functional units according to the manufacturer.
A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU which can be configured to execute only one operation on a given period (until the next configuration).
Communication among the PEs can be either packet oriented (on buses) or point-to-point (using crossbar switches).
Since each vendor has its own implementation approach, study will be done by means of few examples. Considered are: PACT XPP, Quicksilver ACM, NEC DRP, TCPA.
Reconfigurable Computing
8
The PACT XPP – Overall structure
XPP (Extreme Processing Platform) is
a hierarchical structure consisting of:
An array of Processing Array Elements
(PAE) grouped in clusters called Processing
Array (PA)
PAC = Processing Array Cluster (PAC) +
Configuration manager (CM)
A hierarchical configuration tree
Local CMs manage the configuration at the
PA level
The local CMs access the local configuration
memory while the supervisor CM (SCM)
accesses external memory and supervises
the whole configuration process on the
device
Reconfigurable Computing
9
Source: V. Baumgarten et al., PACT XPP: A Self-Reconfigurable
Data Processing Architecture, Journal of Supercomputing. 2003.
Source: PACT XPP Technologies
The PACT XPP – The Processing Array Elements
A Communication Network
Memory elements aside the PACs
A set of I/Os
Reconfigurable Computing
10
The PAE: Two types of PAE
The ALU PAE
The RAM PAE
The ALU PAE:
Contains an ALU which can be configured to perform basic operations
Back-register (BREG) provides routing channels for data and events from bottom to top
Forward Register (FREG) provides routing channels from top to bottom
The ALU PAE
Source: PACT XPP Technologies, XPP-III Processor
Overview, 2012.
The PACT XPP - The Processing Array Elements
DataFlow Register (DF-REG) can be used at the object outputs for buffering data
Input register can be preloaded by
configuration data.
The RAM PAE:
1. Differs from the ALU-PAE only on the
function. Instead of an ALU, a RAM-PAE
contains a dual-ported RAM.
2. Useful for data storage
3. Data is written or read after the reading
of an address at the RAM-inputs
4. BREG, FREG, and DF-REG of the RAM-
PAE have the same function as in the
ALU-PAE
Reconfigurable Computing
11
The RAM PAE
Source: PACT XPP Technologies, XPP-III Processor
Overview, 2012.
The PACT XPP - Routing
Routing in PACT XPP: Two independent networks
One for data transmission
The other for event transmission
A Configuration BUS exists besides the
data and event networks (very little
information exists about the
configuration bus)
All objects can be connected to
horizontal routing channels using
switch-objects
Vertical routing channels are provided
by the BREG and FREG
BREGs route from bottom to top
FREGs route from top to bottom
Reconfigurable Computing
12
Horizontal routing channels
Vertical routing channels
Source: PACT XPP Technologies, XPP-III Processor
Overview, white paper, 2012.
The PACT XPP - Interface
Interfaces are available inside the chip
Number and type of interfaces vary
from device to device
On the XPP42-A1:
6 internal interfaces consisting of:
4 identical general purpose I/O on-chip
interfaces (bottom left, upper left, upper
right, and bottom right)
One configuration manager
One JTAG (Join Test Action Group,
"IEEE Standard 1149.1") Boundary scan
interface for testing purpose (not
shown in the picture)
Reconfigurable Computing
13
Interfaces
Source: PACT XPP Technologies
The PACT XPP - Interface
The I/O interfaces can operateindependent from each other. Two operation modes
The RAM mode
The streaming mode
RAM mode:
Each port can access external Static
RAM (SRAM).
Control signals for the SRAM
transactions are available.
No additional logic required
Reconfigurable Computing
14
Source: PACT XPP Technologies
The PACT XPP - Interface
Streaming mode:
1. For high speed streaming of data to
and from the device
2. Each I/O element provides two
bidirectional ports for data
streaming
3. Handshake signals are used for
synchronization of data packets to
external port
Reconfigurable Computing
15
Source: PACT XPP Technologies
The Quicksilver ACM - Architecture
Structure: Fractal-like structure
Hierarchical group of four nodes with
full communication among the nodes
4 lower level nodes are grouped in a
higher level node
The lowest level consists of 4
heterogeneous processing nodes
The connection is done in a Matrix
Interconnect Network (MIN)
A system controller
Various I/O
Reconfigurable Computing
16
Source: B. Plunkett et al., Adapt2400 ACM architecture
overview, QuickSilver Technology, Inc., 2004.
The Quicksilver ACM – The processing node
An ACM processing node
consists of:
An algorithmic engine. It is
unique to each node type and
defines the operation to perform
by the node.
The node memory for data
storage at the node level.
A node wrapper which is
common to all nodes. It is used to
hide the complexity of the
heterogeneous architecture.
Reconfigurable Computing
17
Source: B. Plunkett et al., Adapt2400 ACM architecture
overview, QuickSilver Technology, Inc., 2004.
The Quicksilver ACM – The processing node
Four types of nodes exist:
The Programmable Scalar Node
(PSN) provides a standard 32-bit
RISC architecture with 32-bit
general purpose registers
The Adaptive Execution Node
(AXN) provides variable size MAC
and ALU operations
The Domain Bit Manipulation
(DBM) node provides bit
manipulation and byte oriented
operation
External Memory Controller node
provides DDRRAM, SRAM,
memory random access DMA
control interface
Reconfigurable Computing
18
ACM PSN-NodeSource: B. Plunkett et al., Adapt2400 ACM architecture
overview, QuickSilver Technology, Inc., 2004.
The Quicksilver ACM – The processing node
Reconfigurable Computing
19
ACM DBM-NodeACM AXN-NodeSource: B. Plunkett et al., Adapt2400 ACM architecture overview, QuickSilver Technology, Inc., 2004.
The Quicksilver ACM – The processing node
The node wrapper envelopes the
algorithmic engine and presents an
identical interface to neighbouring
nodes. It features:
1. A MIN interface to support the
communication among nodes via
the MIN-network
2. A hardware task manager for task
management at the node level
3. A DMA engine
4. Dedicated I/O circuitry
5. Memory controllers
6. Data distributors and aggregators
Reconfigurable Computing
20
The ACM Node WrapperSource: B. Plunkett et al., Adapt2400 ACM architecture
overview, QuickSilver Technology, Inc., 2004.
The Quicksilver ACM - The MIN
Matrix Interconnect Network is
the communication medium in
an ACM chip
1. Hierarchically organized. The MIN
at a given level connects many
lower-level MINs
2. The MIN-Root is used for:
1. Off-chip communication
2. Configuration
3. Supports the communication
among nodes
4. Provides service like Point to
point dataflow streaming, Real-
time broadcasting, DMA, etc.
Reconfigurable Computing
21
Example of ACM
chip configuration
Source: B. Plunkett et al., Adapt2400 ACM architecture
overview, QuickSilver Technology, Inc., 2004.
The Quicksilver ACM - The System Controller
The system controller is in charge of the system management:
Loads tasks into node ready-to-run
queue for execution
Statically or dynamically sets the
communication channels between
the processing nodes
Carries out the reconfiguration of
nodes on a clock cycle-by-clock
cycle basis
The ACM chip features a set of I/O
interfaces controllers like:
PCI
PLL
SDRAM and SRAM
Reconfigurable Computing
22
The system controller
The interface controllersSource: B. Plunkett et al., Adapt2400 ACM architecture
overview, QuickSilver Technology, Inc., 2004.
The NEC DRP – Architecture
The NEC Dynamically
Reconfigurable Processor (DRP)
consists of:
A set of byte oriented processing
elements (PE)
A programmable interconnection
network for communication among
the PEs.
A sequencer. Can be programmed as
finite state machine (FSM) to control
the reconfiguration process
Memory around the device for storing
configuration and computation data
Various Interfaces
Reconfigurable Computing
23
Source: C. Bobda, Introduction to Reconfigurable
Computing, Springer, 2007. Original image adapted
from M. Motomura: A dynamically reconfigurable
processor architecture, Microprocessor Forum, 2002.
The NEC DRP - The Processing Element
ALU: ordinary byte arithmetic/logic
operations
DMU (data management unit): handles
byte select, shift, mask, constant
generation, etc., as well as bit
manipulations
An instruction dictates ALU/DMU
operations and inter-PE connections
Source/destination operands can
either be from/to
its own register file
other PEs (i.e., flow through)
Instruction pointer (IP) is provided from
STC (state transition controller)
Reconfigurable Computing
24
Adapted from: M. Susuki et al., Stream Applications on
the Dynamically Reconfigurable Processor, International
Conference on Field-Programmable Technology, IEEE,
2004.
The NEC DRP – Reconfiguration Process
Instruction Pointer (IP) from STC
identifies a datapath plane
Spatial computation with using a
customized datapath plane
When IP changes, datapath
plane switches instantaneously
PE instructions as a collection
behave like an extreme VLIW
Sequencing through instructions
=> Dynamic reconfiguration
Reconfigurable Computing
25
AES
3DES
MD5
SHA-1
Compress
Data In
Control
(task selectionby descriptor)
Dynamic Reconfiguration
Data Out
Multiple DatapathPlanes
The NEC DRP – Reconfiguration Process
Reconfigurable Computing
26
Add
Sel
Add
Cmp
Add
Add
Cmp
Sel
PE
PE ArrayALUDMU
Insts.
PE
012
IP = “1”
1
3
4
PE Array
PE ALUDMU
012
Insts.
IP = “1”
1
1
2
1 Identify the instruction to be executed
2 Decode the instruction in the ALU plane
3 Configure the ALU Plane according to the
instruction4+
Tightly Coupled Processor Arrays (TCPA)
• Processor elements (PEs) with VLIW (Very long
instruction word)-Architecture
• Weakly programmable
– Small local instruction memory
– Limited parametrizable instruction set focused on digital signal
processing
• Data flow oriented control path, no global address space,
data streaming over the processing field
• Regular interconnect network
• Application areas: Digital signal processing, e.g., mobile
communication, HDTV, multimedia, . . .
30
Reconfigurable Computing
Tightly-Coupled Processor Arrays (TCPA)
31
Reconfigurable Computing
Source: D. Kissler et al., A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template, International Workshop on
Reconfigurable Communication Centric System-on-Chips (ReCoSoC), 2006.
• Basic structure: Grid
• Dynamic reconfigurable
• By using a bypass, more
than one hop is possible
in a single clock cycle
• Interconnect wrapper is
responsible for switching
TCPA – Interconnect Network
32
Reconfigurable Computing
Adapted from: D. Kissler et al., A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template, International Workshop
on Reconfigurable Communication Centric System-on-Chips (ReCoSoC), 2006.
TCPA – Network Example – 4D Hypercube
33
Reconfigurable Computing
TCPA – Network Example – 2D Torus
34
Reconfigurable Computing
• Multicast-Scheme for partial dynamic reconfiguration
• Differential reconfiguration (program/connections) also
possible
TCPA – Dynamic Reconfiguration
35
Reconfigurable Computing
Source: D. Kissler, Power-Efficient Tightly-Coupled Processor Arrays for Digital Signal Processing, PhD Dissertation, 2012.
24 Core TCPA – Lehrstuhl für Informatik 12
• 24x 16 Bit cores
• Technology
• CMOS 1.0 V
• 9 metal layers
• 90 nm standard cell layout
• FUs/PE
• 2xAdd, 2xMul,
• 1xShift, 1xDPU
• Register/PE: 15
• Instruction memory
• 1024x32 = 4kB
• Clock frequency: 200 MHz
• Peak Performance: 24 GOPS
• Energy consumption
• 133 mW @ 200 MHz (Hybrid Clock Gating).
• Power efficiency: 180 MOPS/mW
36
Reconfigurable Computing