ece 697f reconfigurable computing lecture 19 reconfigurable coprocessors
DESCRIPTION
ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors. Overview. Focus on Processor and Array hybrids. Motivation Compute Models: how to fit into computation Examples: Garp, Prism, Remarc, OneChip, Prisc - PowerPoint PPT PresentationTRANSCRIPT
Lecture 19: Reconfigurable Coprocessors November 15, 2004
ECE 697F
Reconfigurable Computing
Lecture 19
Reconfigurable Coprocessors
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Overview
• Focus on Processor and Array hybrids.
• Motivation
• Compute Models: how to fit into computation
• Examples: Garp, Prism, Remarc, OneChip, Prisc
• Some lecture material taken with permission from Dehon lecture on reconfigurable computing.
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Compression Techniques
• Processors efficient at sequential codes, regular arithmetic operations.
• FPGA efficient at fine-grained parallelism, unusual bit-level operations.
• Tight-coupling important: allows sharing of data/control
• Converging technologies: SRAM being migrated to same die as processor anyway. Why not integrate?
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Motivational: Other Viewpoints
• Replace interface glue logic.
• I/O pre/post processing
• Handle real-time responsiveness
• Provide powerful, application specific operation.
• Allow migration of function /performance over time.
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Compute Models
• Glue logic for buses, adapters.
• Dedicated I/O processor
• Instruction augmentation
- Special instructions/coprocessor ops
- VLIW/microcoded extension to processor
- Configurable vector unit
• Autonomous co/stream processor
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Interfacing• Logic replaces:
- ASIC customization
- External FPGA/CPLD
• Example
- Bus protocols
- Peripherals
- Sensors, actuators
• Argument
- Need customization
- Modern chips have capacity
- Reduce part count
- Migrate to system-on-a-chip
- Performance/power
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Triscend E5 Architecture
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Triscend E5 Architecture
Lecture 19: Reconfigurable Coprocessors November 15, 2004
I/O Processor• Array dedicated to servicing to I/O channel
- Sensor, LAN, WAN, peripheral
- Many protocols, services
• Provides protocol handling
- Stream computation
- Compression, encrypt
• Effectively looks like I/O peripheral to processor.
- Don’t need all at same time
- Offload function from processor.
Lecture 19: Reconfigurable Coprocessors November 15, 2004
I/O Processing
• Single threaded processor created in reconfigurable logic.
• No support for multiple data pipes or multiple contexts.
• Need some minimal, local control to handle events.
• For performance or real-time guarantees, may need to service rapidly.
• Checksum and acknowledge packets, for example
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Instruction Augmentation
• Processor can only describe a small number of basic computations in a cycle
- I bits -> 2I operations
• Recall that for Boolean function a total of ______ operations could be performed on 2 W-bit words.
• ALU implementations restrict execution of some simple operations.
- e. g. bit reversal
a31 a30………. a0
b31 b0
Swap bitpositions
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Instruction Augmentation
• Provide a way to augment the processor instruction set for an application.
• Avoid mismatch between hardware/software
•Fit augmented instructions into data and and
control stream.
•Create a functional unit for augmented instructions.
•Compiler techniques to identify/use new functional unit.
What’s Required?
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Chimaera
• Start from Prisc idea.
- Integrate as a functional unit
- No state
- RFU Ops (like expfu)
- Stall processor on instruction miss
• Add
- Multiple instructions at a time
- More than 2 inputs possible
• Hauck: University of Washington
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Chimaera Architecture
• Live copy of register file values feed into array
• Each row of array may compute from register of intermediates
• Tag on array to indicate RFUOP
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Chimaera Architecture
• Array can operate on values as soon as placed in register file.
• Logic is combinational
• When RFUOP matches
- Stall until result ready
- Drive result from matching row
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Chimaera Timing
• If R1 presented last then stall
• Might be helped by instruction reordering
• Physical implementation an issue.
R5 R3 R2 R1
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Chimaera Results
• Three Spec92 benchmarks
- Compress 1.11 speedup
- Eqntott 1.8
- Life 2.06
• Small arrays with limited state
• Small speedup
• Perhaps focus on global router rather than local optimization.
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Garp
• Integrate as coprocessor
- Similar bandwidth to processor as functional unit
- Own access to memory
• Support multi-cycle operation
- Allow state
- Cycle counter to track operation
• Configuration cache, path to memory
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Garp – UC Berkeley
• ISA – coprocessor operations
- Issue gaconfig to make particular configuration present.
- Explicitly move data to/from array
- Processor suspension during coproc operation
- Use cycle counter to track progress
• Array may directly access memory
- Processor and array share memory
- Exploits streaming data operations
- Cache/MMU maintains data consistency
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Garp Instructions
• Interlock indicates if processor waits for array to count to zero.
• Last three instructions useful for context swap
• Processor decode hardware augmented to recognize new instructions.
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Garp Array
• Row-oriented logic
• Dedicated path for processor/memory
• Processor does not have to be involved in array-memory path
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Garp Results
• General results- 10-20X
improvement on stream, feed-forward operation
- 2-3x when data dependencies limit pipelining
- [Hauser-FCCM97]
Lecture 19: Reconfigurable Coprocessors November 15, 2004
PRISC/Chimaera vs. Garp
• Prisc/Chimaera
- Basic op is single cycle: expfu
- No state
- Could have multiple PFUs
- Fine grained parallelism
- Not effective for deep pipelines
• Garp
- Basic op is multi-cycle – gaconfig
- Effective for deep pipelining
- Single array
- Requires state swapping consideration
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Common Theme
• To overcome instruction expression limits:
- Define new array instructions. Make decode hardware slower / more complicated.
- Many bits of configuration… swap time. An issue -> recall tips for dynamic reconfiguration.
• Give array configuration short “name” which processor can call out.
• Store multiple configurations in array. Access as needed (DPGA)
Lecture 19: Reconfigurable Coprocessors November 15, 2004
ReMarc
• Miyamori/Olukotun – Stanford
• Array of “nano-processors”
- 16b, 32 instructions each
- VLIW –like instruction
• Coprocessor interface (similar to Garp)
- No direct array -> memory
Lecture 19: Reconfigurable Coprocessors November 15, 2004
ReMarc Architecture
• 8x8 array of nanoprocessor
• Reminiscent of DPGA except that processing element is ALU
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Nanoprocessor Tile
• Each tile has own instruction RAM
• Communication with near-neighbor tiles
• Global sequence specifies non-PC
• 16 bit output.
Lecture 19: Reconfigurable Coprocessors November 15, 2004
ReMarc Results
• ReMarc 60X smaller than FPGA
• Performance comparable
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Observation
• All coprocessors have been single-threaded
- Performance improvement limited by application parallelism
• Potential for task/thread parallelism
- DPGA
- Fast context switch
• Concurrent threads seen in discussion of IO/stream processor
• Added complexity needs to be addressed in software.
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Scalability?
• Can scale….
- Number of inactive contexts.
- Similar to cache model
- Number of PFUs in PRISC/Chimaera– Still limited by single execution thread.
– Exacerbate pressure/complexity of reconfigurable logic/interconnect
• Cannot scale?
- Amount of active resources.
- Perhaps take coarser-grain focus to parallel processing.
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Parallel Computation: Processor and FPGA
• What would it take to let the processor and FPGA run in parallel?
Modern Processors
Deal with:
• Variable data delays
• Dependencies with data
• Multiple heterogeneous functional units
Via:
• Register scoreboarding
• Runtime data flow (Tomasulo)
Lecture 19: Reconfigurable Coprocessors November 15, 2004
OneChip -> Toronto
• Allow array to have more memory-memory operations
• Want to fit into programming model/ISA without forcing exclusive processor/FPGA operation.
• Also allow decoupled processor/array execution.
• Allow interlocking of data in special “scoreboard” area.
Lecture 19: Reconfigurable Coprocessors November 15, 2004
OneChip Innovations
• FPGA operates on certain memory regions only
• Makes regions explicit to processor issue.
• Scoreboard memory blocks
0x00x1000
0x10000
FPGA
Proc
Indicates usage of data pages like virtual memory system!
Lecture 19: Reconfigurable Coprocessors November 15, 2004
OneChip
• Basic Op is FPGAMem -> Mem
• No state between ops
• Ops must appear sequential
• Could have multiple/parallel FPGA compute units
- Scoreboard between all
• Multiprocessing?
Lecture 19: Reconfigurable Coprocessors November 15, 2004
Summary
• Several different models and uses for “reconfigurable processor”
• Some move towards parallel computing. Others towards single processors
• Exploit density and expressiveness of fine-grained, parallel operations.
• Number of ways to integrate. Need to work around limitations.