ece 697f reconfigurable computing lecture 19 reconfigurable coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

ECE 697F

Reconfigurable Computing

Lecture 19

Reconfigurable Coprocessors


Overview

• Focus on Processor and Array hybrids.

• Motivation

• Compute Models: how to fit into computation

• Examples: Garp, Prism, Remarc, OneChip, Prisc

• Some lecture material taken with permission from Dehon lecture on reconfigurable computing.


Compression Techniques

• Processors efficient at sequential codes, regular arithmetic operations.

• FPGA efficient at fine-grained parallelism, unusual bit-level operations.

• Tight-coupling important: allows sharing of data/control

• Converging technologies: SRAM being migrated to same die as processor anyway. Why not integrate?


Motivational: Other Viewpoints

• Replace interface glue logic.

• I/O pre/post processing

• Handle real-time responsiveness

• Provide powerful, application specific operation.

• Allow migration of function /performance over time.


Compute Models

• Glue logic for buses, adapters.

• Dedicated I/O processor

• Instruction augmentation

- Special instructions/coprocessor ops

- VLIW/microcoded extension to processor

- Configurable vector unit

• Autonomous co/stream processor


Interfacing• Logic replaces:

- ASIC customization

- External FPGA/CPLD

• Example

- Bus protocols

- Peripherals

- Sensors, actuators

• Argument

- Need customization

- Modern chips have capacity

- Reduce part count

- Migrate to system-on-a-chip

- Performance/power


Triscend E5 Architecture


I/O Processor• Array dedicated to servicing to I/O channel

- Sensor, LAN, WAN, peripheral

- Many protocols, services

• Provides protocol handling

- Stream computation

- Compression, encrypt

• Effectively looks like I/O peripheral to processor.

- Don’t need all at same time

- Offload function from processor.


I/O Processing

• Single threaded processor created in reconfigurable logic.

• No support for multiple data pipes or multiple contexts.

• Need some minimal, local control to handle events.

• For performance or real-time guarantees, may need to service rapidly.

• Checksum and acknowledge packets, for example


Instruction Augmentation

• Processor can only describe a small number of basic computations in a cycle

- I bits -> 2I operations

• Recall that for Boolean function a total of ______ operations could be performed on 2 W-bit words.

• ALU implementations restrict execution of some simple operations.

- e. g. bit reversal

a31 a30………. a0

b31 b0

Swap bitpositions


Instruction Augmentation

• Provide a way to augment the processor instruction set for an application.

• Avoid mismatch between hardware/software

•Fit augmented instructions into data and and

control stream.

•Create a functional unit for augmented instructions.

•Compiler techniques to identify/use new functional unit.

What’s Required?


Chimaera

• Start from Prisc idea.

- Integrate as a functional unit

- No state

- RFU Ops (like expfu)

- Stall processor on instruction miss

• Add

- Multiple instructions at a time

- More than 2 inputs possible

• Hauck: University of Washington


Chimaera Architecture

• Live copy of register file values feed into array

• Each row of array may compute from register of intermediates

• Tag on array to indicate RFUOP


Chimaera Architecture

• Array can operate on values as soon as placed in register file.

• Logic is combinational

• When RFUOP matches

- Stall until result ready

- Drive result from matching row


Chimaera Timing

• If R1 presented last then stall

• Might be helped by instruction reordering

• Physical implementation an issue.

R5 R3 R2 R1


Chimaera Results

• Three Spec92 benchmarks

- Compress 1.11 speedup

- Eqntott 1.8

- Life 2.06

• Small arrays with limited state

• Small speedup

• Perhaps focus on global router rather than local optimization.


Garp

• Integrate as coprocessor

- Similar bandwidth to processor as functional unit

- Own access to memory

• Support multi-cycle operation

- Allow state

- Cycle counter to track operation

• Configuration cache, path to memory


Garp – UC Berkeley

• ISA – coprocessor operations

- Issue gaconfig to make particular configuration present.

- Explicitly move data to/from array

- Processor suspension during coproc operation

- Use cycle counter to track progress

• Array may directly access memory

- Processor and array share memory

- Exploits streaming data operations

- Cache/MMU maintains data consistency


Garp Instructions

• Interlock indicates if processor waits for array to count to zero.

• Last three instructions useful for context swap

• Processor decode hardware augmented to recognize new instructions.


Garp Array

• Row-oriented logic

• Dedicated path for processor/memory

• Processor does not have to be involved in array-memory path


Garp Results

• General results- 10-20X

improvement on stream, feed-forward operation

- 2-3x when data dependencies limit pipelining

- [Hauser-FCCM97]


PRISC/Chimaera vs. Garp

• Prisc/Chimaera

- Basic op is single cycle: expfu

- No state

- Could have multiple PFUs

- Fine grained parallelism

- Not effective for deep pipelines

• Garp

- Basic op is multi-cycle – gaconfig

- Effective for deep pipelining

- Single array

- Requires state swapping consideration


Common Theme

• To overcome instruction expression limits:

- Define new array instructions. Make decode hardware slower / more complicated.

- Many bits of configuration… swap time. An issue -> recall tips for dynamic reconfiguration.

• Give array configuration short “name” which processor can call out.

• Store multiple configurations in array. Access as needed (DPGA)


ReMarc

• Miyamori/Olukotun – Stanford

• Array of “nano-processors”

- 16b, 32 instructions each

- VLIW –like instruction

• Coprocessor interface (similar to Garp)

- No direct array -> memory


ReMarc Architecture

• 8x8 array of nanoprocessor

• Reminiscent of DPGA except that processing element is ALU


Nanoprocessor Tile

• Each tile has own instruction RAM

• Communication with near-neighbor tiles

• Global sequence specifies non-PC

• 16 bit output.


ReMarc Results

• ReMarc 60X smaller than FPGA

• Performance comparable


Observation

• All coprocessors have been single-threaded

- Performance improvement limited by application parallelism

• Potential for task/thread parallelism

- DPGA

- Fast context switch

• Concurrent threads seen in discussion of IO/stream processor

• Added complexity needs to be addressed in software.


Scalability?

• Can scale….

- Number of inactive contexts.

- Similar to cache model

- Number of PFUs in PRISC/Chimaera– Still limited by single execution thread.

– Exacerbate pressure/complexity of reconfigurable logic/interconnect

• Cannot scale?

- Amount of active resources.

- Perhaps take coarser-grain focus to parallel processing.


Parallel Computation: Processor and FPGA

• What would it take to let the processor and FPGA run in parallel?

Modern Processors

Deal with:

• Variable data delays

• Dependencies with data

• Multiple heterogeneous functional units

Via:

• Register scoreboarding

• Runtime data flow (Tomasulo)


OneChip -> Toronto

• Allow array to have more memory-memory operations

• Want to fit into programming model/ISA without forcing exclusive processor/FPGA operation.

• Also allow decoupled processor/array execution.

• Allow interlocking of data in special “scoreboard” area.


OneChip Innovations

• FPGA operates on certain memory regions only

• Makes regions explicit to processor issue.

• Scoreboard memory blocks

0x00x1000

0x10000

FPGA

Proc

Indicates usage of data pages like virtual memory system!


OneChip

• Basic Op is FPGAMem -> Mem

• No state between ops

• Ops must appear sequential

• Could have multiple/parallel FPGA compute units

- Scoreboard between all

• Multiprocessing?


Summary

• Several different models and uses for “reconfigurable processor”

• Some move towards parallel computing. Others towards single processors

• Exploit density and expressiveness of fine-grained, parallel operations.

• Number of ways to integrate. Need to work around limitations.

ece 697f reconfigurable computing lecture 19 reconfigurable coprocessors

Documents