computer science &engineering compiled code acceleration on fpgas w. najjar, b.buyukkurt, z.guo,...

12
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering University of California Riverside

Upload: doris-floyd

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

COMPUTER SCIENCE &ENGINEERING

Compiled code acceleration on

FPGAs

W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra

Computer Science & EngineeringUniversity of California Riverside

28 September 2007 Future of Computing - W. Najjar2

Why?

Are FPGA: A New HPC Platform?

Comparison of a dual core Opteron (2.5 GHz) to

Virtex 4 & 5 FPGA on dp fp

Balanced allocation of adders, multipliers and registers

Use both DSP and logic for multipliers, run at lower speed

Logic & wires for I/O interfaces

(dp) Gflop/sOpt V-4 V-5

MAc 10 15.9 28.0

Mult 5 12.0 19.9

Add 5 23.9 55.3

WattsOpt V-4 V-5

95 25 ~35David Strensky, FPGAs Floating-Point Performance -- a pencil and paper evaluation, in HPCwire.com

28 September 2007 Future of Computing - W. Najjar3

ROCCC

Riverside Optimizing Compiler for Configurable Computing

Code acceleration By mapping of circuits to FPGA Achieve same speed as hand-written VHDL codes

Improved productivity Allows design and algorithm space exploration

Keeps the user fully in control We automate only what is very well understood

28 September 2007 Future of Computing - W. Najjar4

Challenges

FPGA is an amorphous mass of logic Structure provided by the code being accelerated Repeatedly applied to a large data set: streams

Languages reflect the von Neumann execution model: Highly structured and sequential (control driven) Vast randomly accessible uniform memory

CPUs (& GPUs) FPGAs

Temporal computing Spatial computing

Sequential Parallel

Centralized storage Distributed storage

Control flow driven Data flow driven

28 September 2007 Future of Computing - W. Najjar5

ROCCC Overview

Limitations on the code:•No recursion•No pointers

High level transformations

Low level transformations

Code generation

Hi-CIRRFJava

C/C++

Lo-CIRRF

SystemC

VHDL

Binary

FPGA

CPU

GPU

DSP

Customunit

Procedure, loop and array optimizations

Instruction schedulingPipelining and storageoptimizations

CIRRFCompiler Intermediate

Representation for Reconfigurable Fabrics

28 September 2007 Future of Computing - W. Najjar6

Input memory(on or off chip)

Output memory(on or off chip)

Mem Fetch Unit

Mem Store Unit

Input Buffer

Output Buffer

Multiple loop bodiesUnrolled and pipelined

A Decoupled Execution Model

Decoupled memory access from datapath

Parallel loop iterations Pipelined datapath Smart buffer (input)

does data reuse Memory fetch and store

units, data path configured by compiler

Off chip accesses platform specific

28 September 2007 Future of Computing - W. Najjar7

So far, working compiler with …

Extensive optimizations and transformations Traditional and FPGA specific

Systolic array, pipelined unrolling, look-up tables

Compile + hardware support for data reuse > 98% reduction in memory fetches on image codes

Efficient code generation and pipelining Within 10% of hand-optimized HDL codes

Import of existing IP cores Leverages huge wealth, integrated with C source code

Support for dynamic partial reconfiguration

28 September 2007 Future of Computing - W. Najjar8

Indices of A[]

coefficients

#define N 516void begin_hw();void end_hw();int main(){ int i; const int T[5] = {3,5,7}; int A[N], B[N];begin_hw();L1: for (i=0; i<=(N-3); i=i+1) { B[i] = T[0]*A[i] +

T[1]*A[i+1] + T[2]*A[i+2]; }end_hw(); }

Example: 3-tap FIR

28 September 2007 Future of Computing - W. Najjar9

RC Platform Models

CPU

FPGA

Memory interface

CPU

CPU

Memory interface

FPGA

SR

AM

Fast Network

CPU Memory FPGA

SR

AM CPU Memory FPGA

SR

AM

21

3

28 September 2007 Future of Computing - W. Najjar10

What we have learned so far

Big speedups are possible 10x to 1,000x on application codes, over Xeon

and Itanium, molecular dynamics, bio-informatics, etc.

Works best with streaming data

New paradigms and tools For spatio-temporal concurrency Algorithms, languages, compilers, run-time

systems etc

28 September 2007 Future of Computing - W. Najjar11

Future? Very wide use of FPGAs

Why? High throughput (> 10x) AND low power (< 25%)

How? Mostly in Models 2 and 3, initially

Model2: See Intel QuickAssist, Xtremedata & DRC Model 3: SGI, SRC & Cray

Contingency Market brings price of FPGAs down Availability of some software stack

for savvy programmers, initially

Potential Multiple “killer apps” (to be discovered)

28 September 2007 Future of Computing - W. Najjar12

Conclusion

We as a research community should be ready

Stamatis was

Thank you