the garp architecture and c compiler brought to you by liao jirong liaojiro@comp.nus.edu.sg liaojiro...

The Garp Architecture and C Compiler

Brought to you by

Liao Jirong

liaojiro@comp.nus.edu.sghttp://www.comp.nus.edu.sg/~liaojiro

T.J. Callahan J.R. Hauser & J. Wawrzynek, U.C. Berkeley

Outline

Background The Garp Architecture The Compiler for the Garp Simulation result Summary

Background Emergence of reconfigurable hardware,

FPGA,etc. Impressive speedups for various tasks DNA sequence matching, encryption,etc. Obstacles to be overcome configuration time, size, floating-point operations,

compatibility of various implementations in the market…..

Past works -- PRISC, NAPA, PRISM, etc limited to specific application domains non full automatic compliation

The Big Picture

Computation Kernel

Processor (CPU)

Coprocessor (FPGA, ASIC, etc)

Application

Non-ComputationKernel

Compilation

Compilation/synthesis

communication

Working Flow of the Execution of the Kernel

1. Load a configuration2. Copy any initial register data to

coprocessor3. Start execution on coprocessor4. Copy result back to the processor

1, 2 & 4 are overhead.

Motivation Integrate reconfigurable hardware more

closely with the processor

Long reconfiguration times

Low- bandwidth paths for data transfer

Hardware design expertise

Assumption

A few cycles of overhead for register data transferring is acceptable

Coprocessor need its own direct path to the processor’s memory system

impossible for the processor to do this Coprocessor need to be rapidly

reconfigurable.

The Garp Architecture Single-issue MIPS processor core with

reconfigurable hardware (coprocessor) Coprocessor is on the same die with processor Coprocessor and Processor share the same

memory The reconfigurable hardware architecture and

interfaces are designed Does not exist as real silicon (simulation only)

The Blueprint

The Garp Arch. (Cont) For general purpose applications Fit into an ordinary processing environment The main thread of control through a

program is managed by the processor 1. configuration can be loaded only when coprocessor is idle 2. coprocessor can work independently 3. coprocessor execution can be halted or

resumed 4. can not load configuration or access the

coprocessor while it is active

The reconfigurable hardware Two-dimensional array of Blocks No. of row is implementation-specific upward-compatible fashion Interconnected by programmable wiring A fixed global clock - sequencer Configuration cache Memory buses Memory queues

Blocks Configurable Logic Block (CLB) 2-bit width 16 CLBs in a row is a 32-bit data path each up to 4 2-bit inputs (a<<10)|(b&c) can be implemented in one

row Control blocks one for each row in the leftmost column serve as liaison Boolean Values for if-conversion used in hyperblock

Wires Vertical wire communicate blocks in the same column Horizontal communicate blocks in the same or adjacent

rows Built-in carry chain support for addition, subtraction and

comparison. Make multiplication and division by

constant fairly efficient by multi-bit shift across a row

The wire network is passive value cannot jump from one to another without

passing through a logic block

Memory tricks Configuration cache hold recently displaced configurations reloading from cache requires 5 cycles only. can hole 4 full-sized configurations Wide path betwn coprocessor and memory data transfer and configuration load Memory bus 4 32-bit data bus and 1 32-bit address bus coprocessor is master of memory buses when

active initiate one access every cycle Memory Queues

Compare Garp with other arch.

VLIW Garp resemble VLIW Advantage over VLIW but doesn’t have VLIW’s per-cycle limits on instruction

issue, functional units, or register file bandwidth. pipeline in Garp is more straightforward than software

pipelining on VLIW: no function units competition problem for Garp

maintain high performance for sequential code in processor

Disadvantage over VLIW kernel size limit can not exploit ILP outside of loops

Garp V.S. Vector

Garp resemble a memory-to-memory vector processor when synthesizing a vectorizable loop.

Feedback loops can be constructed arbitrarily while vector units can handle only very speciallized recurrences

Garp can easily handle data-dependent loop exits, which is a problem for vector arch.

Garp V.S. Superscalar

Because of the modest number of instruction issue slots, Superscalar processor can not compete with the Garp coprocessor in cases with a large amount of ILP.

Any Question About Garp?

For further details:

Garp: A MIPS Processor with a Reconfigurable Coprocessor

J.R. Hauser, J. Wawrzynek, IEEE FCCM 1997,

Automatic Compilation

Standard ANSI C as input

SUIF C compiler for the front-end phase

parsing and standard optimizations

Full automatic compliation

Compilation FlowApplication

Kernel selection

Optimization &Synthesis Optimization

kernelNon-kernel

coprocessor processor

Bit-streamExecutable file

Kernel selection Loops The whole loop? -- NO loop size – too large contain some infrequent executed code -- longer load time -- longer interconnects operations cannot be implemented ILP limitation in basic block

Hyperblock Join all the basic blocks of a loop body by

using prediction – boolean value Increase ILP Precedence edges array subscript analysis

inter-procedural pointer analysis Contain the loop back edges avoid switching control from time to

Hyperblock (Cont)

Reject loops that speedup doesn’t make up the overhead

by profiling and execution time estimate

Exceptional exit cases execution continue on processor

occur only a small fraction

Optimization Techs.

Speculative loads crucial for pipelining Pipelining loop-carried dependencies simultaneous memory access Memory queues 3 memory queues buffering and reading ahead, writing

behind non-cache-allocating

Configuration Synthesis

Module mapping mapping groups of nodes in the DFG to

compound modules in the configuration, minimize the size and its critical path

Placement connect modules close to one another Generating the bit-stream file

Simulation Results

32-row array Adapted Ultrasparc processor Cycle-accurate simulator Model cache misses and interlocks.

Wavelet image compression

Gzip compression

Gzip have irregular memory accesses

reduce parallelism and prevent pipelining

Each loop execute only a few cycles

overhead cost more significant The overhead negates the benefit

Compilation time & Code expansion

Compilation time typically much less than double that of

compiling for software only

Code size typically increase from 10 to 50 percent wavelet benchmark – 16 percent

Garp V.S. Ultrasparc Ultrasparc a four-way superscalar, 167Mhz Garp implemented using the same VLSI process 133Mhz Wavelet Garp is 68% faster than Ultrasparc Gzip Ultrasparc is 14% faster than Garp

Garp V.S. Ultrasparc (Cont) Hand-coded functions Garp has great potential

Future

More experiments over a broader range of benchmark

Development of new optimizations Find out strengths and weaknesses

of the Garp architecture

Summary The Garp Architecture processor + coprocess configuration cache memory queues high-bandwidth, low-latency data

access

Synthesis Compiler for Garp

The End

Thank you!

Any feedback will be appreciatedliaojiro@comp.nus.edu.sg

http://www.comp.nus.edu.sg/~liaojiro

the garp architecture and c compiler brought to you by liao jirong liaojiro@comp.nus.edu.sg liaojiro...

processor coprocessor

coprocessor execution

data path

data bus

memory data transfer

address bus coprocessor

initial register data

register data transferring

Documents

program ass a garp a

garp among thieves

cs3233 competitive progggramming - comp.nus.edu.sg

matchmover® pro 3 - comp.nus.edu.sg

klugman garp sa part11 0

grb prompt emission: turbulence, magnetic field & jitter...

garp nepal

garp nepal sitan execsum

l1.2011.garp practice questions & errata

garp risk index q1 2011

xi x appears stream - comp.nus.edu.sg

garp switzerland chapter meeting - garp - global association...

okeke garp sa website 0

liquidity risk garp

risk dashboard - garp

garp munis 2013

garp risk index

liquidity risk management - garp

limsoon wong - comp.nus.edu.sg

garp (part 2)