the garp architecture and c compiler brought to you by liao jirong liaojiro@comp.nus.edu.sg liaojiro...
Post on 29-Dec-2015
216 Views
Preview:
TRANSCRIPT
The Garp Architecture and C Compiler
Brought to you by
Liao Jirong
liaojiro@comp.nus.edu.sghttp://www.comp.nus.edu.sg/~liaojiro
T.J. Callahan J.R. Hauser & J. Wawrzynek, U.C. Berkeley
Background Emergence of reconfigurable hardware,
FPGA,etc. Impressive speedups for various tasks DNA sequence matching, encryption,etc. Obstacles to be overcome configuration time, size, floating-point operations,
compatibility of various implementations in the market…..
Past works -- PRISC, NAPA, PRISM, etc limited to specific application domains non full automatic compliation
The Big Picture
Computation Kernel
Processor (CPU)
Coprocessor (FPGA, ASIC, etc)
Application
Non-ComputationKernel
Compilation
Compilation/synthesis
communication
Working Flow of the Execution of the Kernel
1. Load a configuration2. Copy any initial register data to
coprocessor3. Start execution on coprocessor4. Copy result back to the processor
1, 2 & 4 are overhead.
Motivation Integrate reconfigurable hardware more
closely with the processor
Long reconfiguration times
Low- bandwidth paths for data transfer
Hardware design expertise
Assumption
A few cycles of overhead for register data transferring is acceptable
Coprocessor need its own direct path to the processor’s memory system
impossible for the processor to do this Coprocessor need to be rapidly
reconfigurable.
The Garp Architecture Single-issue MIPS processor core with
reconfigurable hardware (coprocessor) Coprocessor is on the same die with processor Coprocessor and Processor share the same
memory The reconfigurable hardware architecture and
interfaces are designed Does not exist as real silicon (simulation only)
The Garp Arch. (Cont) For general purpose applications Fit into an ordinary processing environment The main thread of control through a
program is managed by the processor 1. configuration can be loaded only when coprocessor is idle 2. coprocessor can work independently 3. coprocessor execution can be halted or
resumed 4. can not load configuration or access the
coprocessor while it is active
The reconfigurable hardware Two-dimensional array of Blocks No. of row is implementation-specific upward-compatible fashion Interconnected by programmable wiring A fixed global clock - sequencer Configuration cache Memory buses Memory queues
Blocks Configurable Logic Block (CLB) 2-bit width 16 CLBs in a row is a 32-bit data path each up to 4 2-bit inputs (a<<10)|(b&c) can be implemented in one
row Control blocks one for each row in the leftmost column serve as liaison Boolean Values for if-conversion used in hyperblock
Wires Vertical wire communicate blocks in the same column Horizontal communicate blocks in the same or adjacent
rows Built-in carry chain support for addition, subtraction and
comparison. Make multiplication and division by
constant fairly efficient by multi-bit shift across a row
The wire network is passive value cannot jump from one to another without
passing through a logic block
Memory tricks Configuration cache hold recently displaced configurations reloading from cache requires 5 cycles only. can hole 4 full-sized configurations Wide path betwn coprocessor and memory data transfer and configuration load Memory bus 4 32-bit data bus and 1 32-bit address bus coprocessor is master of memory buses when
active initiate one access every cycle Memory Queues
Compare Garp with other arch.
VLIW Garp resemble VLIW Advantage over VLIW but doesn’t have VLIW’s per-cycle limits on instruction
issue, functional units, or register file bandwidth. pipeline in Garp is more straightforward than software
pipelining on VLIW: no function units competition problem for Garp
maintain high performance for sequential code in processor
Disadvantage over VLIW kernel size limit can not exploit ILP outside of loops
Garp V.S. Vector
Garp resemble a memory-to-memory vector processor when synthesizing a vectorizable loop.
Feedback loops can be constructed arbitrarily while vector units can handle only very speciallized recurrences
Garp can easily handle data-dependent loop exits, which is a problem for vector arch.
Garp V.S. Superscalar
Because of the modest number of instruction issue slots, Superscalar processor can not compete with the Garp coprocessor in cases with a large amount of ILP.
Any Question About Garp?
For further details:
Garp: A MIPS Processor with a Reconfigurable Coprocessor
J.R. Hauser, J. Wawrzynek, IEEE FCCM 1997,
Automatic Compilation
Standard ANSI C as input
SUIF C compiler for the front-end phase
parsing and standard optimizations
Full automatic compliation
Compilation FlowApplication
Kernel selection
Optimization &Synthesis Optimization
kernelNon-kernel
coprocessor processor
Bit-streamExecutable file
Kernel selection Loops The whole loop? -- NO loop size – too large contain some infrequent executed code -- longer load time -- longer interconnects operations cannot be implemented ILP limitation in basic block
Hyperblock Join all the basic blocks of a loop body by
using prediction – boolean value Increase ILP Precedence edges array subscript analysis
inter-procedural pointer analysis Contain the loop back edges avoid switching control from time to
time
Hyperblock (Cont)
Reject loops that speedup doesn’t make up the overhead
by profiling and execution time estimate
Exceptional exit cases execution continue on processor
occur only a small fraction
Optimization Techs.
Speculative loads crucial for pipelining Pipelining loop-carried dependencies simultaneous memory access Memory queues 3 memory queues buffering and reading ahead, writing
behind non-cache-allocating
Configuration Synthesis
Module mapping mapping groups of nodes in the DFG to
compound modules in the configuration, minimize the size and its critical path
Placement connect modules close to one another Generating the bit-stream file
Simulation Results
32-row array Adapted Ultrasparc processor Cycle-accurate simulator Model cache misses and interlocks.
Gzip compression
Gzip have irregular memory accesses
reduce parallelism and prevent pipelining
Each loop execute only a few cycles
overhead cost more significant The overhead negates the benefit
Compilation time & Code expansion
Compilation time typically much less than double that of
compiling for software only
Code size typically increase from 10 to 50 percent wavelet benchmark – 16 percent
Garp V.S. Ultrasparc Ultrasparc a four-way superscalar, 167Mhz Garp implemented using the same VLSI process 133Mhz Wavelet Garp is 68% faster than Ultrasparc Gzip Ultrasparc is 14% faster than Garp
Future
More experiments over a broader range of benchmark
Development of new optimizations Find out strengths and weaknesses
of the Garp architecture
Summary The Garp Architecture processor + coprocess configuration cache memory queues high-bandwidth, low-latency data
access
Synthesis Compiler for Garp
top related