1 program development environments languages & tools kris gaj george mason university

SRC Programming Model

Microprocessor FPGA

main.c

function_1()

function_2()

ANSI C

function_1

function_2

macro_1(a, b, c)

macro_2(b, d)macro_2(c, e)

macro_3(s, t)

macro_1(n, b)macro_4(t, k)

FPGA

Macro_1

Macro_2 Macro_2

a

b c

d eMAP C(subset of ANSI C)

I/O

I/O

Libraries of macros

VHDL

macro_1 macro_2macro_3 macro_4……………………….

C function for P

C function for MAP

VHDLmacro

SRC Program Partitioning

P system

FPGA system

HLL

HDL

SRC Compilation Process

Objectfiles

Application sources Macro sources

MAP CompilerP Compiler

Logic synthesis

Place & Route

Linker

.v files

.bin files

.ngo files

.o files .o files

Applicationexecutable

Configurationbitstreams

HDLsources

Netlists

.c or .f files .vhd or .v files

Logic synthesis

Place & Route

Linker

.v files

.bin files

.ngo files

HDLsources

. or.mc or .mf files

SRC Libraries of Hardware Macros

User libraries of hardware macros developed by GWU/GMU/USC 2002-2006

• Secret-key cipher encryption & breaking• Binary Galois Field arithmetic (polynomial basis & normal basis representation)• Elliptic Curve Arithmetic• Long integer modular arithmetic (RSA)• Sorting• Image processing• Bioinformatics See http://hpc.gwu.edu/library

Vendor libraries of hardware macros

• basic integer and floating-point arithmetic• digital signal processing

http://hpc.gwu.edu/library



Library

Object

Sheets

StarStar Bridge Programming Environment - Viva

Place & Route

.bin files

.ngo files

Applicationexecutable

Configurationbitstreams

Netlists

Star Bridge Compilation Process

VIVA

Graphical User Interface

User input

Xilinx

Cray XD1 Programming Flows

Source: [Cray, MAPLD05]

Synthesis

process (a, m) isbegin z <= a and m;end process;

intmask(a, m){

return (a & m);}

VHDL/Verilog Synthesis

Mitrion-C

VHDL,Verilog

Mentor GraphicsSynopsysSynplicity

Xilinx

a

mz

01001011010101010101101010010100010101101010100101010101

MATLAB/Simulink

The MathWorks

StandardFlow

Mitrion

High-levelFlow

SystemGenerator

Xilinx

Xilinx

Place & Route

Gate-level EDIF

VHDL or Verilog

Xtreme DSP Design Flow

HDL-based SGI Altix Programming Flow

IA-32 Linux

Machine

Design iterations

Design Entry(Verilog, VHDL)

Design Synthesis(Synplify Pro,

Amplify)

Design Implementation

(ISE)

Design Verification

Behavioral Simulation(VCS, Modelsim)

Static Timing Analysis(ISE Timing Analyzer)

.v, .vhd.v, .vhd

.edf

.ncd, .pcf

.bin

MetadataProcessing

(Python)

.v, .vhd

.cfg

Altix Device Programming(RASC Abstraction Layer,

Device Manager, Device Driver)

Real-time Verification

(gdb)

.c

IA-32 Linux

Machine

RTL Generation and Integration with Core Services


Amplify)

Design Verification



.v, .vhd

.v, .vhd

.edf

.ncd, .pcf

.bin

MetadataProcessing

(Python)

.v, .vhd

.cfg




(gdb)

.c

Design Implementation(ISE)

HLL Design Entry(Handel-C, Mitrion C, Viva)

HLL-based SGI Altix Programming Flow

Mitrion-C Programming Model for Cray & SGI

Microprocessor FPGA

main.c

function_1(in1)start_fpga()

ANSI Cbased on Mitrion

API

FPGA

I/O

RAM

Application code

(platform independent)

Mitrion Distributed Processor Architecture(platform dependent)

Mitrion Compiler& Configurator

application on the

distributed processor

Input &output

Mitrion-C

VHDL

function_1(in2)start_fpga()

Compiling A Mitrion Program

ProcessorConfigurator

ProcessorArchitecture

Mitrion-CSource code

ProcessorHW-Design

(VHDL IP Core)

FPGA

Mitrion Software Development Kit

Simulator& Debugger

ProcessorMachine-code

Compiler

The Mitrion Platform

1) The Mitrion Virtual Processor– A fine-grain massively parallel, configurable soft-core

processor– 10-30 times faster than traditional CPUs

2) The Mitrion-C programming language– An intrinsically parallel C-family language

3) The Mitrion Software Development Kit– Compiler– Debugger/Simulator– Processor configurator

A New Processor Architecture Specifically For FPGAs

int:48<30> main(){ int:48 prev = 1; int:48 fib = 1;

int:48<30> fibonnacci = for(i in <1..30>) { fib = fib+prev; prev = fib; } <>fib;

} fibonnacci;

?

Architecture design goal:• High silicon utilization• Take advantage of FPGA re-

configurability

Goal achieved by:• Allow processor to be

massively parallel• Allow processor to be fully

adapted to algorithm

Processor Architecture: A Cluster-On-A-Chip

• Non-Von Neumann architecture• Processor architecture more like a cluster• Very Fine-Grain Parallelism

– Normal clusters run a block of code on each PE1

– Mitrion runs a single instruction on each PE– Each PE adapted to optimally run its

instruction• Network topology specific for algorithm• No Instruction Stream, instead Data Stream

1) PE = Processing Element

A C-family Language

• Basic syntax is the same as for other C-family languages

• Examples:– Blocks are surrounded by { }– Assignment with =– Statements end with ;– if, for, while– Most of the usual c operators– C-style comments (though nestable)

Types

• Basic typesint/uint signed/unsigned integerboolean boolean value (true/false)float Floating point real valuebits Bit vector format

• Free bit widthint:24 24 bit signed integeruint:19 19 bit unsigned integerfloat:24.8 IEEE-754 single precision float

• Collectionsint:24[100] Vector (indexable collection)int:14<100> List (no index)

Language constructs

Operators

if(a>b) ...

while(i<10) ...

for(i in <0..999>) ...

foreach (e in vector) ...

int:8 function(int:8 a) ...

A C-family Language

• Important differences– No pointers– No dynamic allocation– Static general recursion only

• Though loop structures may be dynamic

Compiler, Simulator And Debugger

26

Hardware

Software

GraphicalData FlowDiagram

HLLHDL

Increased productivity

Increased capability to describe parallel execution

Program Entry for FPGA Accelerator Boards

Traditional

Extended(e.g.Corefire) Hardware

Software

27



Star Bridge Hardware

Software

porting EDIF

COMobjects

Program Entry for Reconfigurable Computers

Hardware

SoftwareSRC

HLLHDLGraphical Data FlowDiagram

HDL macros

28



CrayXD1withSimulink Hardware

Software

Program Entry for Reconfigurable Computers

Hardware

SoftwareSGIor CraywithMitrion

HLLHDLGraphical Data FlowDiagram

Mitrion Processor

Mitrion-C

Xilinx System Generator

Simulink

29

General hierarchy of library files suggested

by SRC Computers Inc.

30

Structure of the SRC macro repository

< top of repository >

<lib # 1 >

common rev_d rev_e

hdlfile InfoFile BlkBoxFile

macro1 macro2 macro3

< macros >

<lib # 2 > <lib # 3 >

rev_f

DebugCodeFile

DataSheet

31

Files describing an SRC macro

Platform independent– HDL file: macro.v or macro.vh

• Verilog or VHDL code defining the macro

– Debug Code File: macro.c • provides the equivalent C functionality for the macro

– Data sheet file: datasheet• contains the documentation for the macro

Platform dependent– Blk Box File: blackbox.v

• Interface (black box) definition for the macro in Verilog

– Info File: info• Info file entry for this macro

32

Library Development - SRC

HLL (C, Fortran)

HDL (VHDL, Verilog)

P system

FPGA system

ApplicationProgrammer

LibraryDeveloper

HLL (C, Fortran)

HLL (C, Fortran)

LLL (ASM)

HLL (C, Fortran)

33

Library Development - StarBridge

GDF (Viva)

HDL (VHDL, Verilog)

P system

FPGA system

ApplicationProgrammer

LibraryDeveloper

GDF (Viva)

GDF (Viva)

HLL, LLL (C++, ASM)

GDF (Viva)

34

Software libraries and their role in the development

of SRC libraries

35

1. source of test vectors for VHDL macros|

2.emulation of hardware during debugging

3.performance comparison

Roles of software libraries

36

1. Identify class of applications

2. Identify basic operations required by your applications

3. Determine the existence of the RC library of such operations

4. Determine the existence of the microprocessor library of such operations

5. Determine the right granularity for the required library operations

How to approach porting your application to reconfigurable computers?

1. input/output intensive applications• bulk data encryption

(DES, IDEA, and RC5 encryption)

2. computationally intensive applications• secret-key cipher breaking based on

the exhaustive key search (DES, IDEA, RC5 breakers)

• public-key cipher breaking based on factoring

3. latency-critical applications• cipher key agreement and signature (ECC schemes, RSA)

Classes of applications

Example 1Cryptography:

High-throughput encryption

Cipher

message

ciphertext

cryptographickey

K bits

Secret-key ciphers

key of Alice and Bob - KABkey of Alice and Bob - KAB

Alice Bob

Network

Encryption Decryption

High-Throughput Encryption

Encryption

Mi

Mi+1

Mi+2

Ci

Ci+1

Ci+2

. . . .

K0 Encryption algorithms:

DES, 3DES, AES, RC5, IDEA, etc.

Fully Pipelined Architecture

. . . .

. . . .

. . . .

Loop unrolling

Pipeline stages inside of cipher rounds

New input & new output every clock cycle

. . . .

Round 1

Round 2

Round k

. . .

Encryption on SRC-6 – No streamingencryption.mc (1)

#include <libmap.h>

void encryption (uint64_t sdata[], uint64_t key, uint64_t *hardware_timein,

uint64_t *hardware_timeprocess, uint64_t *hardware_timeout,

int mapnum){ OBM_BANK_A (S1OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (S2OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (S3OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (S4OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_E (S5OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_F (S6OBM, uint64_t, MAX_OBM_SIZE) uint32_t encrypt_decrypt; //0:encrypt 1:decrypt int i, nbytes; uint64_t t1,t2,t3,t4;

encrypt_decrypt = 0; nbytes = MAX_OBM_SIZE * 8*3; start_timer();

read_timer(&t1); DMA_CPU(CM2OBM, S1OBM, MAP_OBM_stripe(1,"A,B,C"), sdata, 1, nbytes, 0); wait_DMA(0); read_timer(&t2);

for(i=0;i<MAX_OBM_SIZE;i++) {

des (S1OBM[i], key, encrypt_decrypt, &S4OBM[i]); des (S2OBM[i], key, encrypt_decrypt, &S5OBM[i]); des (S3OBM[i], key, encrypt_decrypt, &S6OBM[i]);

} read_timer(&t3);



DMA_CPU(OBM2CM, S4OBM, MAP_OBM_stripe(1,"D,E,F"), sdata, 1, nbytes, 5); wait_DMA(5); read_timer(&t4); *hardware_timein = t2-t1; *hardware_timeprocess = t3-t2; *hardware_timeout = t4-t3;

}

Encryption on SRC-6 – No streamingdes_blkbx.v

module des ( desOut, desIn, keyin, decrypt, clk ) /* synthesis syn_black_box syn_noprune=1 */ ;

output [63:0] desOut; input [63:0] desIn; input [63:0] keyin; input decrypt; input clk /* synthesis syn_noclockbuf=1 */ ;

endmodule

Encryption on SRC-6 – No streamingdes.info (1)

BEGIN_DEF "des" MACRO = "des"; LATENCY = 17; STATEFUL = NO; EXTERNAL = NO; PIPELINED = YES;

INPUTS = 3: I0 = INT 64 BITS (desIn[63:0]) I1 = INT 64 BITS (keyin[63:0]) I2 = INT 32 BITS (decrypt) ;

OUTPUTS = 1: O0 = INT 64 BITS (desOut[63:0]) ;

IN_SIGNAL : 1 BITS "clk" = "CLOCK";

Encryption on SRC-6 – No streamingdes.info (2)

DEBUG_HEADER = $ void des__dbg (long long desin, long long keyin, int decrypt, long long *desout); $;

DEBUG_FUNC = $ #include <des.h> void des__dbg(long long desin, long long keyin, int decrypt, long long *desout) { des_(desout, &desin, &keyin, &decrypt); } $;END_DEF

Encryption on SRC-6 - with streamingencryption.mc (1)

#include <libmap.h>

void encryption (uint64_t sdata[], uint64_t key, uint64_t *hardware_timeprocess, uint64_t *hardware_timeout, int mapnum)

{ OBM_BANK_A (S1OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (S2OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (S4OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_E (S5OBM, uint64_t, MAX_OBM_SIZE) uint32_t encrypt_decrypt; //0:encrypt 1:decrypt int i, nbytes; uint64_t t1,t2,t3; Stream_64 S0, S1; uint64_t v0, v1; encrypt_decrypt = 0; nbytes = MAX_OBM_SIZE * 8*2;

start_timer();

read_timer(&t1);

#pragma src parallel sections { #pragma src section { stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, S1OBM, DMA_A_B, sdata, 1, nbytes); }

#pragma src section { for (i=0; i<MAX_OBM_SIZE; i++) { get_stream (&S0, &v0); get_stream (&S1, &v1);

des (v0, key, encrypt_decrypt, &S4OBM[i]); des (v1, key, encrypt_decrypt, &S5OBM[i]);

}; } }

Encryption on SRC-6 – with streamingencryption.mc (2)

Encryption on SRC-6 – with streamingencryption.mc (3)

read_timer(&t2); DMA_CPU(OBM2CM, S4OBM, MAP_OBM_stripe(1,"D,E"), sdata, 1, nbytes, 5); wait_DMA(5); read_timer(&t3); *hardware_timeprocess = t2-t1; *hardware_timeout = t3-t2;

}

7.5

38

46

Speed up

560

113

93

Xeon2.8GHz

4,240

4,240

4,240

SRC-6

End-to-End Throughput

(Mbits/s)

10,76011,35019,2003 RC5

Ciphers(64-bit block)

10,76011,35019,2003 IDEACiphers

(64-bit block)

10,76011,35019,2003 DES


SRC-6SRC-6SRC-6

DataTransfer OutThroughput

(Mbits/s)

DataTransfer InThroughput

(Mbits/s)

ComputationalThroughput

(Mbits/s)Application

ResultsSRC-6 without streaming

8.5

42.5

52

Speed up

560

113

93

Xeon2.8GHz

4,800

4,800

4,800

SRC 6


(Mbits/s)

10,7609,000NA3 RC5


10,7609,000NA3 IDEACiphers

(64-bit block)

10,7609,000NA3 DES


SRC 6SRC 6SRC 6


(Mbits/s)

DataTransfer In

& processingThroughput

(Mbits/s)



ResultsSRC-6 with streaming (3 units)

9.5

47.5

58

Speed up

560

113

93

Xeon2.8GHz

5,400

5,400

5,400

SRC 6


(Mbits/s)

10,76011,350NA2 RC5


10,76011,350NA2 IDEACiphers

(64-bit block)

10,76011,350NA2 DES


SRC 6SRC 6SRC 6


(Mbits/s)

DataTransfer In


(Mbits/s)



ResultsSRC-6 with streaming (2 units)

4.5

18

26

Speed up

560

113

93

Xeon2.8GHz

2,430

2,040

2,430

Altix


(Mbits/s)

NANA12,800

(200MHz)

1 RC5Cipher

(64-bit block)

NANA6,400

(100MHz)

1 IDEACipher

(64-bit block)

NANA12,800

(200MHz)

1 DESCipher

(64-bit block)

AltixAltixAltix


(Mbits/s)

DataTransfer In


(Mbits/s)



SGI Altix MOATB without streaming

5.5

22

33

Speed up

560

113

93

Xeon2.8GHz

3080

2480

3080

Altix


(Mbits/s)

NANA12,800

(200MHz)

1 RC5Cipher

(64-bit block)

NANA6,400

(100MHz)

1 IDEACipher

(64-bit block)

NANA12,800

(200MHz)

1 DESCipher

(64-bit block)

AltixAltixAltix


(Mbits/s)

DataTransfer In


(Mbits/s)



SGI Altix MOATB with streaming

Example 2Cryptography:

Cipher Breaking

Secret-key cipher breaking

Given:

Looked for:

Method:

remaining plaintext

ciphertext

or key

guessed fragment of the plaintext

exhaustive key search (brute-force) attack

successivekeys cipher

Secret-key cipher breaking

Cipherbreaker

M0 C0

…

K1 K2 K3 KN

Generated by the cipher breaker

Negligibly smallinput/output

Huge amountof computations

Correct key

Message – Ciphertext pair

Cipher Breaking Results - SRC-6

Application Theoretical Maximum

Computational Throughput

Measured End-to-End Throughput

(million keys/s) (million keys/s) Speed-up

SRC 6 SRC 6 Xeon2.8GHz

DES CipherBreaking

(20 units working in parallel)

2000 2000 1.77 1130

IDEA CipherBreaking


1000 1000 2.19 457

RC5 Cipher Breaking(2 units working in

parallel)

200 200 0.71 282

Application Theoretical Maximum

Computational Throughput

Measured End-to-End Throughput

(million keys/s) (million keys/s) Speed-up

SGI SGI Xeon2.8GHz

DES CipherBreaking


2000 2000 1.77 1130

Cipher Breaking ResultsSGI Altix MOATB

Example 3:Cryptography:

Key exchange using ECC

Secret-key ciphers

key of Alice and Bob - KABkey of Alice and Bob - KAB

Alice Bob

Network


Key Distribution Problem

N - UsersN · (N-1)

2Keys

Users Keys

100 5,000

1000 500,000

Public Key (Asymmetric) Ciphers

Public key of Bob - KBPrivate key of Bob - kB

Alice Bob

Network


Alice Bob session key

(random secret-key)

Bob’s public key

Key exchange for secret-key ciphers

Bob’s private key

Network

Session keyencrypted using Bob’s public key

Message encrypted using session key

Message

Hash function

Public keycipher

AliceSignature

Alice’s private key

Bob

Hash function

Alice’s public key

Digital Signature

Hash value 1

Hash value 2

Hash value

Public key cipher

yes no

Message Signature

Why public-key cryptography is a good application for reconfigurable computers?

• computationally intensive arithmetic operations

• unconventionally long operand sizes (160-2048 bits)

• multiple algorithms, parameters, key sizes, and architectures = need for reconfiguration

Elliptic Curve Cryptosystems (ECC)

a family of cryptosystems, rather than a single

cryptosystem = added security but

need for reconfiguration

public key (asymmetric) cryptosystems

used for key agreement and digital signatures

implementations must be optimized for

minimum latency rather than maximum

throughput = limited speed-up from

parallel processing

Basic operations of ECCBasic operations in Galois Field GF(2m)

Basic operations on points of an Elliptic Curve

• addition and subtraction (xor): x+y, x-y (XOR)

• addition of points: P + Q• doubling a point: 2 P• projective to affine coordinate: P2A

• multiplication, squaring: x y, x2

• inversion: x-1

Complex operations on points of an Elliptic Curve• scalar multiplication: k P = P + P + …+P

k times

Hierarchy of ECC functions

kP

P+Q 2P projective_to_affine (P2A)

MUL

INV

High level

Medium level

Low level 2

ROTXORLow level 1

C function for P

C function for MAP

VHDLmacro

SRC Program Partitioning

P system

FPGA system

HLL

HDL

Investigated Partitioning Schemes

kPC function for P

C function for FPGA

VHDLmacro

μP Software Only

Based on public-domain code by Rosing M., Implementing Elliptic Curve Cryptography, Manning, 1999

MUL4

C function

for FPGA

VHDL

macrosROTROT XOR

C function

for µP0

H

L1V_ROTVARROT

kPP2A

kP

P+Q 2P

MUL2MUL2 MULMUL

0HL1 Partitioning

INVINV

P2AP+Q2P

MUL4

C function

for FPGA

VHDL

macrosROTROT XOR

C function

for µP0

H

V_ROTINV

kPP2A

kP

P+Q 2P

MUL2MUL2

0HL2 Partitioning

P2AP+Q2P

L2

0HM Partitioning

C function

for FPGA

VHDL

macros

C function

for µP

0

H

MP+QP+Q 2P2P P2AP2A

kPkP

kP

0

0

H

00H Partitioning (VHDL only)

C function for P

C function for FPGA

VHDLmacro

Timing Measurements

MAPAlloc.

MAP

FreeDMA

DataOut

DMA

Data In

FPGA

Computation

.c file

.mc file

End-to-End time (SW)

MAPfunction

MAP function

FPGA

Configure

Configuration time

MAP

Allocation

time

MAP

Release

Time

End-to-End time (HW)

Results (Latency)

0HL1 866 37 472 14 394 8930HL2 863 37 469 14 394 8950HM 592 37 201 12 391 1305

VHDL macro 592 39 201 17 391 1305

Software772,519

Data Transfer

Out Time

Total Overhead

Speedup vs.

Software

System Level Architecture

End-to-End Time

Data Transfer In Time

FPGA Compu-

tation Time

0

100

200

300

400

500

600

700

800

900

us

0HL1 0HL2 0HM 00H (VHDL)

Different Architectures

End to End Time

FPGA ComputationTime

Results (Area)

Software N/A

0HL1 99 1.68 57 1.3 68 2.61

0HL2 92 1.56 52 1.18 62 2.38

0HM 75 1.27 48 1.09 39 1.5

System Level

Architecture

% of CLB

slices

(out of

33792)

CLB

increase

vs. pure

VHDL

% of

LUTs

(out of

67,584)

LUT

increase

vs. pure

VHDL

% of

FFs

(out of

67,584)

FF count

increase

vs. pure

VHDL

0

10

20

30

40

50

60

70

80

90

100

%

0HL1 0HL2 0HM 00H

Different Architectures

CLB slices

LUT

FF

78

185

349

371

MAP C

15326010070HL1

153

153

153

Main C

1601744

2301291

36

Macro Wrapper

0HM

1960

VHDL

VHDL macro

0HL2

Algorithm Partitioning Scheme

Number of lines of code

Conclusions

Assuming focus on:

Timing Resources

Ease of programming

Conclusions – cont.The best implementation approach:

0HL1 partitioning scheme

893 speedup vs. software and only 0.46 times slowdown versus pure VHDL with ease of

implementation

MUL4

C functionfor MAP

VHDL macros

ROTROT XOR

C function for µP

0

H

L1V_ROTV_ROT

kP

INV

P2AP+Q 2P

kP

INV

P2AP+QP+Q 2P2P

MUL2MUL2 MULMUL

Conclusions – cont.

• Elliptic Curve Cryptosystem implementation challenging for reconfigurable computers because of

• optimization for latency rather than throughput• limited amount of parallelism

• First publication showing a 1000x speed-up for a reconfigurable computer application optimized for data latency

Summary of results

Type of applicationEnd-to-end

speed-up of SRC-6 vs. P4

Computationally intensive(cipher breaking)

300-1100

Latency critical(ECC key exchange)

Input/output intensive 10-60(secret key encryption/decryption)

890-1300

GWU_GMU secret key cipher libraries

1. Secret key cipher encryption and decryption

2. Secret key cipher breaking

• DES• IDEA• RC5

• DES• IDEA• RC5

GWU_GMU public key cipher libraries

1. Operations in the binary Galois Fields GF(2m)

a. polynomial basis b. normal basis

2. Multiprecision integer arithmetic

3. Elliptic Curve Operations

- addition - doubling - scalar multiplication

89

Example 4Image Processing:

Hyperspectral Dimension Reduction

90

Multi-Spectral Imagery 10’s of bands

Hyperspectral Imagery 100’s-1000’s of bands

Challenges - Curse of Dimensionality

Solution On-Board Dimension

Reduction Needs

Higher performance Higher flexibility

Multispectral / Hyperspectral Imagery Comparison

High-Performance Reconfigurable Computing

Application: Hyperspectral Dimension Reduction

91

Hyperspectral Dimension Reduction(Techniques)

Principal Component Analysis (PCA): Most Common Method

Dimension Reduction Complex and Global

computations: difficult for parallel processing and hardware implementations

Does Not Preserve Spectral Signatures

Wavelet-Based Dimension Reduction*: Simple and Local Operations High-Performance

Implementation Preserves Spectral

Signatures

Multi-Resolution Wavelet Decomposition of Each Pixel 1-D Spectral Signature (Preservation of Spectral Locality)

* S. Kaewpijit, J. Le Moigne, T. El-Ghazawi, “Automatic Reduction of Hyperspectral Imagery Using Wavelet Spectral Analysis”, IEEE Transactions on Geoscience and Remote Sensing, Vol. 41, No. 4, April, 2003, pp. 863-871.

92

The input image is first convolved along the rows by the two filters L and H and decimated along the columns by two resulting in two "column-decimated" images L and H

Each of the two images, L and H, is then convolved along the columns by the two filters L and H and decimated along the rows by two

This decomposition results into four images, LL, LH, HL and HH

The LL image is taken as the new input to perform the next level of decomposition

Discrete Wavelet Transform (DWT) Decomposition (Mallat Algorithm)

93

Wavelet-Based Dimension Reduction(Description)

94

DWT on SRC-6

transfer coefficientsto OBM bank c

transfer image datato OBM bank a

load coefficients from bank c to

on-chip registers

transfer image data from bank b to the host

compute Wavelet

read one pixelfrom bank a

store result into bank b

End of Image

Yes

No

Read Data

MAP Alloc.

Map Free

Write Data

Measurements Scenario

95

DWT on SRC-6 (cnt’d)(Main Program)

int main (int argc, char *argv[]) { . .

/* initialize daubechies coefficients */ float coeff[8]; coeff[0]= coeff[1]= (1+sqrt(3))/(4*sqrt(2)); coeff[2]= coeff[3]= (3+sqrt(3))/(4*sqrt(2)); coeff[4]= coeff[5]= (3-sqrt(3))/(4*sqrt(2)); coeff[6]= coeff[7]= (1-sqrt(3))/(4*sqrt(2)); .

. /* allocate images */ .

map_allocate(1);

gettimeofday(&time0, NULL);

proc_fpga (image_in, image_out, coeff, dx, dy, &ht0, &ht1, &ht2, &ht3, mapno);

gettimeofday(&time1, NULL);

/* print time difference */ . .

map_free(1); .}

Allocate the RP

• configure and start the Program execution on the FPGA

• passing the input image pointer and the output image buffer pointer to be used by DMA

• individual parameters can be passed to the MAP C function such as image dimensions

• large parameter array, such as the wavelet coefficients, can be transferred using DMA by passing the pointer of the coefficients array

Free the RP

96

DWT on SRC-6 (cnt’d)MAP C Function (FPGA.mc)

void proc_fpga(int64_t image_in[], int64_t image_out[], int64_t coeff[], int dx, int dy, long long *ht0, long long *ht1, long long *ht2, long long *ht3, int mapnum){ // coefficients float LP0, LP1, LP2, LP3, LP4; float HP0, HP1, HP2, HP3, HP4;

// variables int i, j, k; int64_t in_pixel, out_pixel; // input image OBM_BANK_A (AL, int64_t, MAX_OBM_SIZE)

// output image OBM_BANK_B (BL, int64_t, MAX_OBM_SIZE)

// filter coefficients OBM_BANK_C (CL, int64_t, MAX_OBM_SIZE)

97

start_timer();read_timer(ht0);

// DMA Input Image transferDMA_CPU (CM2OBM, AL, MAP_OBM_stripe (1, "A"), image_in, 1, Image_size, 0);wait_DMA (0);

// DMA coefficients transferDMA_CPU (CM2OBM, CL, MAP_OBM_stripe(1,“C"), coeff, 1, 4*sizeof(int64_t), 0);wait_DMA(0);

read_timer(ht1);

for (i = 0; i < 4; i++) { LP0 = LP1 ; HP0 = HP1 ; LP1 = LP2 ; HP1 = HP2 ; LP2 = LP3 ; HP2 = HP3 ; split_64to32_flt_flt(CL[i], & HP3, & LP3 );}


transfer image datato an OBM bank

transfer coefficientsto an OBM bank

load coefficients from the OBM bank to on-chip registers

98

for (i = 0; i<Image_Size; i++) {

in_pixel = AL[i];

{ . . . }

BL[i] = out_pixel;

} read_timer(ht2);

DMA_CPU (OBM2CM, BL, MAP_OBM_stripe(1,"B"), image_out, 1, Image_size, 0);wait_DMA (0);

read_timer(ht3);}


read pixel value from the OBM bank

compute Wavelet

store results to theOBM bank

transfer image datato the host

99

Overlapping Data Transfer with Computation(SRC-6)

#pragma src parallel sections {

#pragma src section {

for(i = 0; i < i<MAX_OBM_SIZE; i++)

{

get_stream (&S0, &v0);

DO COMPUTATION (Current Data Block)

}

} /* end of parallel section with compute loop */

#pragma src section {

/* Stream DMA_IN */

stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, S1OBM, DMA_A_B, sdata, 1, nbytes);

} /* end of parallel section with DMA */

} /* end of parallel sections */Time

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Read DMA 1 2 3 X X

Algorithm X 1 2 3 X

Write DMA X X 1 2 3

Improve performance by overlapping algorithm computation and data loading and unloading

Parallel sections Multiple parallel code blocks

are active in parallel

100

Stream_64 S0;

#pragma src parallel sections

{

#pragma src section

{

int i;

for (i=0; i<sz; i++)

put_stream (&S0, AL[i]+42, 1);

} /* end of parallel section */

#pragma src section

{

int i;

for (i=0; i<sz; i++)

get_stream (&S0, &BL[i]);

} /* end of parallel section */

} /* end of parallel sections */

Streams(SRC-6)

Conventional Data FlowConventional Data Flow Streams and Conventional Streams and Conventional Data FlowData Flow

On-Board On-Board Memory Memory or BRAMor BRAM

ComputeComputeLoop 1Loop 1






SteamsSteamsComputeComputeLoop 2Loop 2

On-Board On-Board Memory Memory or BRAMor BRAMTimeTime

Saves Saves Access toAccess toOn-BoardOn-BoardMemoryMemory

Data is flowingData is flowingIn the logicIn the logic

A stream is a data structure that allows flexible communication between concurrent producer and consumer loops

101

Cray XD-1

102

DWT on Cray-XD1(Main Program)

#define APP_CFG_REG 0x08UL#define USR_REG1 0x40UL#define USR_REG2 0x48UL#define USR_REG3 0x50UL#define USR_REG4 0x58UL#define QDR1_OFFSET 0x100UL /* Offset in FPGA memory space.*/

int main (int argc, char *argv[]) {

int fp_id; err_e e; u_64 coeff[4] u_64 * dp_base; u_64 * image;

fp_id = fpga_open ("/dev/ufp0", O_RDWR|O_SYNC, &e);

fpga_load (fp_id, "top.bin.ufp", &e);

. . /* Read Image */ . /* initialize daubechies coefficients */ . fpga_wrt_appif_val (fp_id, coeff[0] , USR_REG1 , TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[1] , USR_REG2 , TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[2] , USR_REG3 , TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[3] , USR_REG4 , TYPE_VAL, &e);

Define the address space for user registers and QDR memory

Open the FPGA Device

Load the FPGA

Transfer coefficients into the FPGA registers

103

fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e);

dp_base = fpga_memmap (fp_id, QDR_SIZE, PROT_READ | ROT_WRITE,MAP_SHARED, QDR1_OFFSET, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) dp_base[i] = iamge[i];


/* ... */

fpga_rd_appif_val (fp_id, &status, STATUS_REG, &e);

/* ... */


for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) output_iamge[i] = dp_base[i[i] ;

fpga_close (fp_id, &e); }

Configure the Wavelet for QDR bridging

Start Processing

Read the FPGA status

Map the entire 4 Mbytes of QDR Memory

Read back the Image

Transfer the Image into the QDR

Configure the Wavelet for QDR bridging

Close the FPGA device

DWT on Cray-XD1 (cnt’d)(Main Program)

104

Accessing µP memory from FPGA(Cray-XD1)

unsigned long order; void *ftr_mem;

/* ... */

ftr_mem = fpga_set_ftrmem(fpga_fd, order, &err); if (ftr_mem == NULL) { /* Handle error. */ }

fpga_wrt_appif_val (fp_id, (u_64) ftr_mem , BUFF0_PTR_REG, TYPE_ADDR, &e);

fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG,

TYPE_VAL, &e);

/* ... */

fpga_rd_appif_val(fp_id, &status, STATUS_REG, &e);

/* ... */

The APIs support access to a region of the µP memory by the FPGA logic

The program uses the fpga_set_ftrmem function to: Allocate an FTR Associates it with the address space

of the µP Sets up the FPGA to access it

directly

It does not automatically provide the address of this region to the FPGA application logic One way is to establish an FPGA

register for that purpose and use the fpga_wrt_appif_val function to write the value to the register

105

Using MPI on Cray-XD1

if(MYTHREAD==0) read_image (image_file_name, image_buffer, &rows, &cols);

MPI_Bcast(&rows, 1,MPI_INT,0,MPI_COMM_WORLD);MPI_Bcast(&cols, 1,MPI_INT,0,MPI_COMM_WORLD);

local_size= rows*cols/THREADS;

MPI_Scatter(image_buffer, local_size,MPI_UNSIGNED_LONG, local_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD);

/* Execute the wavelet on the Hardware*/process_image(fp_id, local_image_buffer, local_output_buffer, rows, cols);

MPI_Gather(local_output_buffer,local_size,MPI_UNSIGNED_LONG, output_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0,MPI_COMM_WORLD); if(MYTHREAD==0) write_image (output_file_name, output_image_buffer, rows, cols);

Each Cray XD1 chassis consists of 6 dual-processor Opteron nodes 2 Opteron processors (Total 12) 1 Xilinx Virtex-II Pro 50 (Total 6)

Applications can be parallelized across the 6 FPGAs using MPI

Data are distributed across the 6 FPGAs

106

rasclib_algorithm_request_t ar; int alg_id; int res; strcpy(ar.alg_id,“Wavelet"); ar.num_devices = 1; . . /* Read Image */ . /* initialize daubechies coefficients */ . rasclib_resource_alloc(&ar, 1); algorithm_id = rasclib_algorithm_open(“Wavelet"); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff0", coeff[0]);

res = rasclib_algorithm_alg_reg_write (alg_id, “coeff1", coeff[1]);



res = rasclib_algorithm_send (alg_id, "a_in", Image_Buff , SIZE);

Parameter Passing Small parameters

Connect to Algorithm Defined Registers

(alg_def_reg0 - alg_def_reg7) Pass parameter mapping to software through

an extractor directive, type REG_IN:-- extractor REG_IN: coeff0 64 u alg_def_reg0[63:0]

-- extractor REG_IN: coeff1 64 u alg_def_reg1[63:0]



Large Arrays Dedicate a portion of an SRAM bank for the

parameter array Pass parameter array mapping to software with

an extractor comment of type SRAM:-- extractor SRAM:a_in 2048 64 sram[0] 0x00 in u fixed

DWT on SGI-Altix(Main Program)

107

rasclib_algorithm_go (alg_id);

res = rasclib_algorithm_receive (alg_id, "d_out", out_Buff, SIZE); rasclib_algorithm_commit (alg_id); rasclib_algorithm_wait (alg_id); rasclib_algorithm_close (alg_id);

Results Read-Back Small parameters

Connect to Algorithm Defined Registers Pass parameter mapping to software through an extractor directive, type REG_OUT Use the API function rasclib_algorithm_reg_read

Large Arrays Dedicate a portion of an SRAM bank for the parameter array Pass parameter array mapping to software with an extractor comment of type SRAM:

-- extractor SRAM:d_out 2048 64 sram[1] 0x00 out u fixed

DWT on SGI-Altix (cnt’d)(Main Program)

108

Time

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Read DMA 0 1 2 X X

Algorithm X 0 1 2 X

Write DMA X X 0 1 2

Improve performance by overlapping algorithm computation and data loading and unloading

Extractor directives are used to tell software: where input/output data arrays are located (SRAM bank + starting index) the sizes of the input/output data arrays which arrays have been enabled for streaming

Extractor directive type used: SRAM with attribute stream, e.g.:

-- extractor SRAM:a_in 2048 64 sram[0] 0x00 in u stream -- extractor SRAM:d_out 2048 64 sram[1] 0x00 out u stream

Streaming(SGI-Altix)

Example 5Image processing:

Thin PlateSplines

The application: Thin Plate Splines- image analysis of protein gels

• Image morphing based on natural logarithm computations

• Essential for comparing protein content

• Speedup per FPGA: 10-30x. Reduces analysis runtime from days to hours.

Host Program- running on Opteron CPU, calling FPGA subroutine

Transfer parameter data to QDRAM

Start Mitrion program and wait until finished

Retrieve computed image data

u_64 fpga_mem, i;my_fpga = fpga_open(args); // Use normal XD1 API for most operations

...

if (!fpga_is_loaded(args)) rtn = fpga_load(args);

...

// memory map QDRAMs into host address spacefpga_mem = fpga_memmap(args);

// Upload data to QDRAMmemcpy(fpga_mem, parameter_data, sizeof_parameter_data);

// Control of mitrion processor is internally handled// with a number of memory mapped registers in the FPGA// Controlling running/stepping/reset etc.

mitrion_start(my_fpga); // Start mitrion blockmitrion_wait(my_fpga); // wait for block to finish

// Fetch results from QDRAMmemcpy(image_coordinates, fpga_mem, sizeof_image_data);

FPGA program (1/3)- accelerated subroutine in Mitrion-c// Options: -cpp

#define RAMType mem uint:64 [ 0x100000 ]

#include "grint_lib.lqd"#include "logarithm_rwhile.lqd"

(Fix, RAMType) readFix(RAMType m, uint:24 basicOffset, uint:24 fixOffset){ uint:32 memOffset = basicOffset + fixOffset; (result, m2) = _memread(m, memOffset);} (result, m2);

(RAMType, RAMType, RAMType, RAMType) main (RAMType Am, RAMType Bm, RAMType Cm, RAMType Dm){ Fix<LMS> py; // parameter vectors Fix<LMS> px; Fix<LMS> koeffx; Fix<LMS> koeffy;

// read paramters from external RAM (px, py, koeffx, koeffy, Aml) = foreach(index in <0.. LMS_1>) { (x, Am2) = readFix(Am, PX_OFF, index) ; (y, Am3) = readFix(Am2, PY_OFF, index) ; (kx, Am4) = readFix(Am3, KOEFFX_OFF, index); (ky, Am5) = readFix(Am4, KOEFFY_OFF, index); } (x, y, kx, ky, Am5); Aut = _wait(Aml);

Cut = grintpolc(Cm, px, py, koeffx, koeffy);

} (Aut, Bm, Cut, Dm);

readFix fetches input data from QDRAM

Definition of RAM type

Start of program.Matches external RAM interface of the XD1:

4 banks of 1M word each

FPGA program (2/3)- accelerated subroutine in Mitrion-c

RAMType grintpolc ( RAMType coords, // out Fix<LMS> px, Fix<LMS> py, Fix<LMS> koeffx, Fix<LMS> koeffy ){ imDonel = foreach(y in <0.. YSIZE_1>) { uint:32 lineoff = y*XSIZE; imDone2l = foreach(x in <0.. XSIZE_1>) { (distx, disty) = foreach(px, py, koeffx, koeffy in px, py, koeffx, koeffy) { Fix dx = px - int2fix(x); Fix dy = py - int2fix(y);

Fix r2 = fixmul(dx,dx) + fixmul(dy,dy);

Fix ext = if(r2 == 0) 0 else { Fix ln = fixln(r2); ext = fixmul(r2,ln); } ext;

Input arguments (the image) for Thin Plate

Splines transform

Major compute intensive part: high

precision ln computation

FPGA program (3/3)- accelerated subroutine in Mitrion-c Fix rx = fixmul(ext, koeffx);

Fix ry = fixmul(ext, koeffy);

} (rx, ry); Fix distcoordx = sum(distx);

Fix distcoordy = sum(disty);

// distcoordx and distcoordy is the coordinated // of the pixels to be fetched from the distorted image

uint:32 index = x + lineoff; int:32 x32 = (distcoordx >>> 8); // convert into Fix16.16 int:32 y32 = (distcoordy >>> 8); // convert into Fix16.16 watch x32; watch y32;

bits:64 word = [x32, y32]; imDone3 = _memwrite(coords, index, word); } imDone3; imDone2 = _wait(imDone2l); } imDone2; imDone = _wait(imDonel);

} imDone; Output argument is the distorted image

Output arguments (distorted image coordinates) are

written to QDRAM

115


Challenges

116

Application Developmentfor Reconfigurable Computers

ProgramEntry

Compilation

Execution

Platformmapping

Debugging &Verification

117

Tasks Addressed in This Presentation

ProgramEntry

Compilation

Execution

Platformmapping

Debugging &Verification

118

Program

Program Entry

119

Platform MappingSW/HW Partitioning

Software(executed in

the microprocessor system)

Hardware(executed in

the reconfigurableprocessor

system)

Program

120

SW/HW Partitioning & CodingTraditional Approach

Specification

SW/HW Partitioning

SW Coding HW Coding

SW Compilation HW Compilation

SW Profiling HW Profiling

121

SW/HW Partitioning & CodingNew Approach

Specification

SW/HW Coding

SW Compilation HW Compilation

SW Profiling HW Profiling

SW/HW Partitioning

122

Platform MappingFPGA mapping

Software

HardwareProgram

FPGA 1 FPGA 2

FPGA 3

FPGA 4

123

Example of FPGA Mapping

add

FPGA

multiply

divide add

multiply

divide

FPGA 1 FPGA 2

addmultiply

divide

FPGA 2FPGA 1

124

add

multiply

divide

FPGA 1 FPGA 2

FPGA Mapping in SRC

void fpga1(int64_t a, int64_t b, int64_t *sum, int mapno){ int64_t c, temp;

send_to_bridge(b); c = a * const1; recv_from_bridge(&temp); *sum = temp+Mult;}

void fpga2(){ int64_t a, d;

recv_from_bridge(&a); d = a/const2; send_to_bridge(d);}

Makefile

MAPFILES = FPGA1.mc FPGA2.mcPRIMARY = FPGA1.mcSECONDARY = FPGA2.mcCHIP2 = FPGA2.mc

a

FPGA1.mc

FPGA2.mc

b

sum

125

FPGA Mapping in VIVA TM

By changing the attributes one can specify where an object is to be located

126

Platform MappingFPGA-FPGA data transfer & synchronization

Software

HardwareProgram

FPGA 1 FPGA 2

FPGA 3

FPGA 4

127

FPGA 1 FPGA 264

64

computation

2

computation

1

void fpga1(int64_t a, b, c, *d){ send_to_bridge(a, b, c); computation1 recv_from_bridge(d);}

void fpga2(){ int64_t a,b,c,d;

recv_from_bridge(&a, &b, &c); computation2 send_to_bridge(d);}

FPGA-FPGA Data Transfer in SRCFPGA1.mc

FPGA2.mc

a

b

c

d

128

32 words

64 bits

64 bits

64

64

64

32 words

FIFO

FIFO

FPGA-FPGA Data Transfer in SRC

Bridge Port

129

FPGA-FPGA Data Transfer in VIVA TM

Special partitioning objects placed between the modules to be synthesized automatically map the relevant lines between the FPGAs.

For designs mapped over several FPGAs:The system description must include those FPGAs over which the design is to be mapped,

130

Platform MappingUse of Internal and External Memories

Software

HardwareProgram

FPGA 1FPGA 2

FPGA 3

FPGA 4

OCM

OCM – On-Chip Memory LM – Local Memory SM – Shared Memory

SM

LM

131

Using On-Chip Memory (OCM) in SRCvoid sum(int64_t a[], int *c, int mapno)

{

BANK_A_ALLOC(AL, int64_t, SIZE);

ocm_a [SIZE];

int i;

cm2obm_0(AL, a, byteLength);

wait_server_0();

for(i=0; i<SIZE; i++) {

ocm_a[i] = AL[i]; }

for(i=0; i<SIZE; i++) {

tmp = ocm_a[i] + tmp; }

}

FPGA

SM(OBM

)

64

32

AL[]

ocm_a[]

OCM

computationsc

132

Using On-Chip Memory (OCM) in VIVATM

Special Objects under the Memory Subsystem of the library allows the programmer to use the on chip memory of the Xilinx Virtex II chip

133

Platform MappingI/O

Software

HardwareProgram

FPGA 1 FPGA 2

FPGA 3

FPGA 4

SM

LM

OCM

SRC

StarBridge

134

Main program

Function_1(a, d, e)

Function_2(d, e, f)

Function_1

Function_2

Macro_1(a, b, c)

Macro_2(b, d)Macro_2(c, e)

Macro_3(s, t)

Macro_1(n, b)Macro_4(t, k)

FPGA……

……

……

Macro_1

Macro_2 Macro_2

a

b c

d e

FPGA contents afterthe Function_1 call

Program in C or Fortran

Run Time Reconfiguration in SRC

135

Run-time Reconfiguration in VIVATM

Reconfiguration is possible by using the spawn object.By specifying the FileName attribute a VIVA executable (.vex file) or a VIVA project can be loaded onto the same or a different FPGA.

136

Ideal Program Entry

ProgramEntry

Function

137

Actual Program Entry

SW/HWPartitioning

Data Transfers& Synchronization

Use of Internaland External Memories

Sequence of Run-time Reconfigurations

Use of FPGAResources

(multipliers,μP cores)

PreferredArchitectures

ProgramEntry

Function

FPGAMapping

SW/HW Interface

138

Not implemented

ManualEntry

CompilerAutomated

SRC

Star Bridge

FPGA-FPGA Partitioning

P-FPGA Partitioning

FPGA-FPGA Data Transfer

P-FPGA Data Transfer

Computation-Data transfer Overlapping

Choosing component version

Evolution and the current status of tools

and othervendors

. . . . . . . . .

139

Debugging & Verification

140

ApplicationApplication

MAP RuntimeMAP RuntimeLibraryLibrary

ComListComListCodeCode

WrapperWrapperCodeCode

User LogicUser Logic

SubroutineSubroutineFor MAPFor MAP

MAP Board Execution

MAP BoardMAP Board

Data &Data &

FlagsFlags

User FPGAsUser FPGAs

Control Processor

On-boardOn-boardMemoryMemory


Registers & Flags

Logic

Macro

Logic

Macro

Logic

MacroLogic

Macro

ComList Processor

DMA Engine

141

EmulatorEmulator

MAP Emulator + DFG Simulator

ApplicationApplication





SubroutineSubroutineFor MAPFor MAP Data &Data &

FlagsFlags


Control Processor



Registers & Flags

C Code

Macro

C Code

Macro

C Code

MacroC

Code

Macro

ComList Processor

DMA Engine

142

MAP Emulator + Verilog Simulator

EmulatorEmulatorApplicationApplication





SubroutineSubroutineFor MAPFor MAP Data &Data &

FlagsFlags


Control Processor



Registers & Flags

VCSVCS

Verilog

Macro

Verilog

Macro

Verilog

MacroVerilo

g

Macro

ComList Processor

DMA Engine

143

X86 System in VIVATM

The FileIn Object as it appears when the x86 system is loaded

144

X86 System in VIVATM

FileIn object as it appears when the FPGA system description is loaded.

145

Debugging in VIVATM

Data can be viewed with the help of widgets, which are basically input and output ‘horns’ placed in a worksheet.

Various display options are available to view data, options to include the kind of view desired by the viewer and the data viewed can be switched between HEX or INT.

146

IA-32 Linux

Machine

RTL Generation and Integration with Core Services


Amplify)

Design Verification



.v, .vhd

.v, .vhd

.edf

.ncd, .pcf

.bin

MetadataProcessing

(Python)

.v, .vhd

.cfg




(gdb)

.c

Design Implementation(ISE)

HLL Design Entry(Handel-C, Mitrion C, Viva)

Debugging in the SGI Environment

Compiler, Simulator And Debugger

148

Programming EnvironmentsSummary

149

SRC Programming Environment + very easy to learn and use+ standard ANSI C+ hides implementation details+ good support for debugging+ vendor and user libraries+ very well integrated environment+ good use of 3rd party tools+ in production use for over 3 years with constant improvements

- subset of C- legacy C code requires rewriting- C limitations in describing HW (paralellism, data types)- closed environment, limited portability of codes to HW platforms other than SRC

150

Star Bridge Programming Environment Viva

+ drag-and-drop program entry+ standard and user libraries+ separation of designs/programs from system/platform descriptions = portability of codes+ support for multiple platforms under development

- does not follow any established standards- no textual description = limited scalability of codes- control signals (e.g., handshaking between the adjacent library cells) must be specified explicitly- no clear mechanism to call HW functions from SW

151

+ drag-and-drop program entry+ extensive libraries of DSP components+ good support for debugging

- graphical description = limited scalability of codes- control signals (e.g., handshaking between the adjacent library cells) must be specified explicitly- limited library support for applications other than DSP

Cray Programming Environment basedon Simulink/System Generator

152

+ graphical programming language (drag-and-drop program entry)+ extensive libraries of DSP components+ single environment (MATLAB™/Simulink™) to analyze, visualize, implement, debug, verify+ efficient resource usage

- graphical description = limited scalability of codes- limited library support for applications other than DSP

Cray Programming Environment basedon DSPLogic

153

Cray XD1 and SGI Environmentsbased on Mitrion-C

+ high-level C-like language easy to learn by an HPC programmer+ ease of describing paralellism and non-standard (variable size) data types+ small amount of Mitrion-C generates large number of lines of HDL code+ suitable for describing classical complex HPC problems+ Mitrion-C code portable between Cray XD1 and SGI

- new and yet untested- non-standard, no support for legacy codes- language describes only what happens in a single FPGA- currently, no mechanisms to use HDL macros

1 program development environments languages & tools kris gaj george mason university

Documents