1 program development environments languages & tools kris gaj george mason university

153
1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Upload: sybil-byrd

Post on 17-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

1

Program Development Environments

Program Development Environments

Languages & ToolsLanguages & Tools

Kris GajGeorge Mason University

Page 2: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

2

Acknowledgements

AMI

Cray

Mitrion

NCSA

SGI

SRC

Star Bridge

DoD/LUCITE

Companies, centers, and sponsors

Page 3: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

3

• Esmail Chitalwala (GWU/Star Bridge)

• Hatim Diab (GWU)

• Esam El-Araby (GWU)

• Miaoqing Huang (GWU)

• Hoang Le (GMU)

• Allen Michalski (GMU/USC)

• Nandkishore Sastry (GMU)

• Chang Shu (GMU)

• Mohamed Taher (GWU)

• Proshanta Saha (GWU)

Acknowledgements

GWU/GMU students

Page 4: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

SRC Programming Model

Microprocessor FPGA

main.c

function_1()

function_2()

ANSI C

function_1

function_2

macro_1(a, b, c)

macro_2(b, d)macro_2(c, e)

macro_3(s, t)

macro_1(n, b)macro_4(t, k)

FPGA

Macro_1

Macro_2 Macro_2

a

b c

d eMAP C(subset of ANSI C)

I/O

I/O

Libraries of macros

VHDL

macro_1 macro_2macro_3 macro_4……………………….

Page 5: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

C function for P

C function for MAP

VHDLmacro

SRC Program Partitioning

P system

FPGA system

HLL

HDL

Page 6: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

SRC Compilation Process

Objectfiles

Application sources Macro sources

MAP CompilerP Compiler

Logic synthesis

Place & Route

Linker

.v files

.bin files

.ngo files

.o files .o files

Applicationexecutable

Configurationbitstreams

HDLsources

Netlists

.c or .f files .vhd or .v files

Logic synthesis

Place & Route

Linker

.v files

.bin files

.ngo files

HDLsources

. or.mc or .mf files

Page 7: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

SRC Libraries of Hardware Macros

User libraries of hardware macros developed by GWU/GMU/USC 2002-2006

• Secret-key cipher encryption & breaking• Binary Galois Field arithmetic (polynomial basis & normal basis representation)• Elliptic Curve Arithmetic• Long integer modular arithmetic (RSA)• Sorting• Image processing• Bioinformatics See http://hpc.gwu.edu/library

Vendor libraries of hardware macros

• basic integer and floating-point arithmetic• digital signal processing

Page 8: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Library

Object

Sheets

StarStar Bridge Programming Environment - Viva

Page 9: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Place & Route

.bin files

.ngo files

Applicationexecutable

Configurationbitstreams

Netlists

Star Bridge Compilation Process

VIVA

Graphical User Interface

User input

Xilinx

Page 10: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Cray XD1 Programming Flows

Source: [Cray, MAPLD05]

Synthesis

process (a, m) isbegin z <= a and m;end process;

intmask(a, m){

return (a & m);}

VHDL/Verilog Synthesis

Mitrion-C

VHDL,Verilog

Mentor GraphicsSynopsysSynplicity

Xilinx

a

mz

01001011010101010101101010010100010101101010100101010101

MATLAB/Simulink

The MathWorks

StandardFlow

Mitrion

High-levelFlow

SystemGenerator

Xilinx

Xilinx

Place & Route

Gate-level EDIF

VHDL or Verilog

Page 11: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Xtreme DSP Design Flow

Page 12: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

HDL-based SGI Altix Programming Flow

IA-32 Linux

Machine

Design iterations

Design Entry(Verilog, VHDL)

Design Synthesis(Synplify Pro,

Amplify)

Design Implementation

(ISE)

Design Verification

Behavioral Simulation(VCS, Modelsim)

Static Timing Analysis(ISE Timing Analyzer)

.v, .vhd.v, .vhd

.edf

.ncd, .pcf

.bin

MetadataProcessing

(Python)

.v, .vhd

.cfg

Altix Device Programming(RASC Abstraction Layer,

Device Manager, Device Driver)

Real-time Verification

(gdb)

.c

Page 13: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

IA-32 Linux

Machine

RTL Generation and Integration with Core Services

Design Synthesis(Synplify Pro,

Amplify)

Design Verification

Behavioral Simulation(VCS, Modelsim)

Static Timing Analysis(ISE Timing Analyzer)

.v, .vhd

.v, .vhd

.edf

.ncd, .pcf

.bin

MetadataProcessing

(Python)

.v, .vhd

.cfg

Altix Device Programming(RASC Abstraction Layer,

Device Manager, Device Driver)

Real-time Verification

(gdb)

.c

Design Implementation(ISE)

HLL Design Entry(Handel-C, Mitrion C, Viva)

HLL-based SGI Altix Programming Flow

Page 14: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Mitrion-C Programming Model for Cray & SGI

Microprocessor FPGA

main.c

function_1(in1)start_fpga()

ANSI Cbased on Mitrion

API

FPGA

I/O

RAM

Application code

(platform independent)

Mitrion Distributed Processor Architecture(platform dependent)

Mitrion Compiler& Configurator

application on the

distributed processor

Input &output

Mitrion-C

VHDL

function_1(in2)start_fpga()

Page 15: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Compiling A Mitrion Program

ProcessorConfigurator

ProcessorArchitecture

Mitrion-CSource code

ProcessorHW-Design

(VHDL IP Core)

FPGA

Mitrion Software Development Kit

Simulator& Debugger

ProcessorMachine-code

Compiler

Page 16: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

The Mitrion Platform

1) The Mitrion Virtual Processor– A fine-grain massively parallel, configurable soft-core

processor– 10-30 times faster than traditional CPUs

2) The Mitrion-C programming language– An intrinsically parallel C-family language

3) The Mitrion Software Development Kit– Compiler– Debugger/Simulator– Processor configurator

Page 17: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

A New Processor Architecture Specifically For FPGAs

int:48<30> main(){ int:48 prev = 1; int:48 fib = 1;

int:48<30> fibonnacci = for(i in <1..30>) { fib = fib+prev; prev = fib; } <>fib;

} fibonnacci;

?

Architecture design goal:• High silicon utilization• Take advantage of FPGA re-

configurability

Goal achieved by:• Allow processor to be

massively parallel• Allow processor to be fully

adapted to algorithm

Page 18: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Processor Architecture: A Cluster-On-A-Chip

• Non-Von Neumann architecture• Processor architecture more like a cluster• Very Fine-Grain Parallelism

– Normal clusters run a block of code on each PE1

– Mitrion runs a single instruction on each PE– Each PE adapted to optimally run its

instruction• Network topology specific for algorithm• No Instruction Stream, instead Data Stream

1) PE = Processing Element

Page 19: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University
Page 20: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

A C-family Language

• Basic syntax is the same as for other C-family languages

• Examples:– Blocks are surrounded by { }– Assignment with =– Statements end with ;– if, for, while– Most of the usual c operators– C-style comments (though nestable)

Page 21: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Types

• Basic typesint/uint signed/unsigned integerboolean boolean value (true/false)float Floating point real valuebits Bit vector format

• Free bit widthint:24 24 bit signed integeruint:19 19 bit unsigned integerfloat:24.8 IEEE-754 single precision float

• Collectionsint:24[100] Vector (indexable collection)int:14<100> List (no index)

Page 22: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Language constructs

Operators

if(a>b) ...

while(i<10) ...

for(i in <0..999>) ...

foreach (e in vector) ...

int:8 function(int:8 a) ...

Page 23: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

A C-family Language

• Important differences– No pointers– No dynamic allocation– Static general recursion only

• Though loop structures may be dynamic

Page 24: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Compiler, Simulator And Debugger

Page 25: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University
Page 26: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

26

Hardware

Software

GraphicalData FlowDiagram

HLLHDL

Increased productivity

Increased capability to describe parallel execution

Program Entry for FPGA Accelerator Boards

Traditional

Extended(e.g.Corefire) Hardware

Software

Page 27: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

27

Increased productivity

Increased capability to describe parallel execution

Star Bridge Hardware

Software

porting EDIF

COMobjects

Program Entry for Reconfigurable Computers

Hardware

SoftwareSRC

HLLHDLGraphical Data FlowDiagram

HDL macros

Page 28: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

28

Increased productivity

Increased capability to describe parallel execution

CrayXD1withSimulink Hardware

Software

Program Entry for Reconfigurable Computers

Hardware

SoftwareSGIor CraywithMitrion

HLLHDLGraphical Data FlowDiagram

Mitrion Processor

Mitrion-C

Xilinx System Generator

Simulink

Page 29: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

29

General hierarchy of library files suggested

by SRC Computers Inc.

Page 30: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

30

Structure of the SRC macro repository

< top of repository >

<lib # 1 >

common rev_d rev_e

hdlfile InfoFile BlkBoxFile

macro1 macro2 macro3

< macros >

<lib # 2 > <lib # 3 >

rev_f

DebugCodeFile

DataSheet

Page 31: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

31

Files describing an SRC macro

Platform independent– HDL file: macro.v or macro.vh

• Verilog or VHDL code defining the macro

– Debug Code File: macro.c • provides the equivalent C functionality for the macro

– Data sheet file: datasheet• contains the documentation for the macro

Platform dependent– Blk Box File: blackbox.v

• Interface (black box) definition for the macro in Verilog

– Info File: info• Info file entry for this macro

Page 32: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

32

Library Development - SRC

HLL (C, Fortran)

HDL (VHDL, Verilog)

P system

FPGA system

ApplicationProgrammer

LibraryDeveloper

HLL (C, Fortran)

HLL (C, Fortran)

LLL (ASM)

HLL (C, Fortran)

Page 33: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

33

Library Development - StarBridge

GDF (Viva)

HDL (VHDL, Verilog)

P system

FPGA system

ApplicationProgrammer

LibraryDeveloper

GDF (Viva)

GDF (Viva)

HLL, LLL (C++, ASM)

GDF (Viva)

Page 34: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

34

Software libraries and their role in the development

of SRC libraries

Page 35: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

35

1. source of test vectors for VHDL macros|

2.emulation of hardware during debugging

3.performance comparison

Roles of software libraries

Page 36: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

36

1. Identify class of applications

2. Identify basic operations required by your applications

3. Determine the existence of the RC library of such operations

4. Determine the existence of the microprocessor library of such operations

5. Determine the right granularity for the required library operations

How to approach porting your application to reconfigurable computers?

Page 37: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

1. input/output intensive applications• bulk data encryption

(DES, IDEA, and RC5 encryption)

2. computationally intensive applications• secret-key cipher breaking based on

the exhaustive key search (DES, IDEA, RC5 breakers)

• public-key cipher breaking based on factoring

3. latency-critical applications• cipher key agreement and signature (ECC schemes, RSA)

Classes of applications

Page 38: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Example 1Cryptography:

High-throughput encryption

Page 39: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Cipher

message

ciphertext

cryptographickey

K bits

Page 40: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Secret-key ciphers

key of Alice and Bob - KABkey of Alice and Bob - KAB

Alice Bob

Network

Encryption Decryption

Page 41: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

High-Throughput Encryption

Encryption

Mi

Mi+1

Mi+2

Ci

Ci+1

Ci+2

. . . .

K0 Encryption algorithms:

DES, 3DES, AES, RC5, IDEA, etc.

Page 42: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Fully Pipelined Architecture

. . . .

. . . .

. . . .

Loop unrolling

Pipeline stages inside of cipher rounds

New input & new output every clock cycle

. . . .

Round 1

Round 2

Round k

. . .

Page 43: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Encryption on SRC-6 – No streamingencryption.mc (1)

#include <libmap.h>

void encryption (uint64_t sdata[], uint64_t key, uint64_t *hardware_timein,

uint64_t *hardware_timeprocess, uint64_t *hardware_timeout,

int mapnum){ OBM_BANK_A (S1OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (S2OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (S3OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (S4OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_E (S5OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_F (S6OBM, uint64_t, MAX_OBM_SIZE) uint32_t encrypt_decrypt; //0:encrypt 1:decrypt int i, nbytes; uint64_t t1,t2,t3,t4;

Page 44: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

encrypt_decrypt = 0; nbytes = MAX_OBM_SIZE * 8*3; start_timer();

read_timer(&t1); DMA_CPU(CM2OBM, S1OBM, MAP_OBM_stripe(1,"A,B,C"), sdata, 1, nbytes, 0); wait_DMA(0); read_timer(&t2);

for(i=0;i<MAX_OBM_SIZE;i++) {

des (S1OBM[i], key, encrypt_decrypt, &S4OBM[i]); des (S2OBM[i], key, encrypt_decrypt, &S5OBM[i]); des (S3OBM[i], key, encrypt_decrypt, &S6OBM[i]);

} read_timer(&t3);

Encryption on SRC-6 – No streamingencryption.mc (2)

Page 45: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Encryption on SRC-6 – No streamingencryption.mc (3)

DMA_CPU(OBM2CM, S4OBM, MAP_OBM_stripe(1,"D,E,F"), sdata, 1, nbytes, 5); wait_DMA(5); read_timer(&t4); *hardware_timein = t2-t1; *hardware_timeprocess = t3-t2; *hardware_timeout = t4-t3;

}

Page 46: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Encryption on SRC-6 – No streamingdes_blkbx.v

module des ( desOut, desIn, keyin, decrypt, clk ) /* synthesis syn_black_box syn_noprune=1 */ ;

output [63:0] desOut; input [63:0] desIn; input [63:0] keyin; input decrypt; input clk /* synthesis syn_noclockbuf=1 */ ;

endmodule

Page 47: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Encryption on SRC-6 – No streamingdes.info (1)

BEGIN_DEF "des" MACRO = "des"; LATENCY = 17; STATEFUL = NO; EXTERNAL = NO; PIPELINED = YES;

INPUTS = 3: I0 = INT 64 BITS (desIn[63:0]) I1 = INT 64 BITS (keyin[63:0]) I2 = INT 32 BITS (decrypt) ;

OUTPUTS = 1: O0 = INT 64 BITS (desOut[63:0]) ;

IN_SIGNAL : 1 BITS "clk" = "CLOCK";

Page 48: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Encryption on SRC-6 – No streamingdes.info (2)

DEBUG_HEADER = $ void des__dbg (long long desin, long long keyin, int decrypt, long long *desout); $;

DEBUG_FUNC = $ #include <des.h> void des__dbg(long long desin, long long keyin, int decrypt, long long *desout) { des_(desout, &desin, &keyin, &decrypt); } $;END_DEF

Page 49: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Encryption on SRC-6 - with streamingencryption.mc (1)

#include <libmap.h>

void encryption (uint64_t sdata[], uint64_t key, uint64_t *hardware_timeprocess, uint64_t *hardware_timeout, int mapnum)

{ OBM_BANK_A (S1OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (S2OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (S4OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_E (S5OBM, uint64_t, MAX_OBM_SIZE) uint32_t encrypt_decrypt; //0:encrypt 1:decrypt int i, nbytes; uint64_t t1,t2,t3; Stream_64 S0, S1; uint64_t v0, v1; encrypt_decrypt = 0; nbytes = MAX_OBM_SIZE * 8*2;

Page 50: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

start_timer();

read_timer(&t1);

#pragma src parallel sections { #pragma src section { stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, S1OBM, DMA_A_B, sdata, 1, nbytes); }

#pragma src section { for (i=0; i<MAX_OBM_SIZE; i++) { get_stream (&S0, &v0); get_stream (&S1, &v1);

des (v0, key, encrypt_decrypt, &S4OBM[i]); des (v1, key, encrypt_decrypt, &S5OBM[i]);

}; } }

Encryption on SRC-6 – with streamingencryption.mc (2)

Page 51: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Encryption on SRC-6 – with streamingencryption.mc (3)

read_timer(&t2); DMA_CPU(OBM2CM, S4OBM, MAP_OBM_stripe(1,"D,E"), sdata, 1, nbytes, 5); wait_DMA(5); read_timer(&t3); *hardware_timeprocess = t2-t1; *hardware_timeout = t3-t2;

}

Page 52: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

7.5

38

46

Speed up

560

113

93

Xeon2.8GHz

4,240

4,240

4,240

SRC-6

End-to-End Throughput

(Mbits/s)

10,76011,35019,2003 RC5

Ciphers(64-bit block)

10,76011,35019,2003 IDEACiphers

(64-bit block)

10,76011,35019,2003 DES

Ciphers(64-bit block)

SRC-6SRC-6SRC-6

DataTransfer OutThroughput

(Mbits/s)

DataTransfer InThroughput

(Mbits/s)

ComputationalThroughput

(Mbits/s)Application

ResultsSRC-6 without streaming

Page 53: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

8.5

42.5

52

Speed up

560

113

93

Xeon2.8GHz

4,800

4,800

4,800

SRC 6

End-to-End Throughput

(Mbits/s)

10,7609,000NA3 RC5

Ciphers(64-bit block)

10,7609,000NA3 IDEACiphers

(64-bit block)

10,7609,000NA3 DES

Ciphers(64-bit block)

SRC 6SRC 6SRC 6

DataTransfer OutThroughput

(Mbits/s)

DataTransfer In

& processingThroughput

(Mbits/s)

ComputationalThroughput

(Mbits/s)Application

ResultsSRC-6 with streaming (3 units)

Page 54: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

9.5

47.5

58

Speed up

560

113

93

Xeon2.8GHz

5,400

5,400

5,400

SRC 6

End-to-End Throughput

(Mbits/s)

10,76011,350NA2 RC5

Ciphers(64-bit block)

10,76011,350NA2 IDEACiphers

(64-bit block)

10,76011,350NA2 DES

Ciphers(64-bit block)

SRC 6SRC 6SRC 6

DataTransfer OutThroughput

(Mbits/s)

DataTransfer In

& processingThroughput

(Mbits/s)

ComputationalThroughput

(Mbits/s)Application

ResultsSRC-6 with streaming (2 units)

Page 55: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

4.5

18

26

Speed up

560

113

93

Xeon2.8GHz

2,430

2,040

2,430

Altix

End-to-End Throughput

(Mbits/s)

NANA12,800

(200MHz)

1 RC5Cipher

(64-bit block)

NANA6,400

(100MHz)

1 IDEACipher

(64-bit block)

NANA12,800

(200MHz)

1 DESCipher

(64-bit block)

AltixAltixAltix

DataTransfer OutThroughput

(Mbits/s)

DataTransfer In

& processingThroughput

(Mbits/s)

ComputationalThroughput

(Mbits/s)Application

SGI Altix MOATB without streaming

Page 56: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

5.5

22

33

Speed up

560

113

93

Xeon2.8GHz

3080

2480

3080

Altix

End-to-End Throughput

(Mbits/s)

NANA12,800

(200MHz)

1 RC5Cipher

(64-bit block)

NANA6,400

(100MHz)

1 IDEACipher

(64-bit block)

NANA12,800

(200MHz)

1 DESCipher

(64-bit block)

AltixAltixAltix

DataTransfer OutThroughput

(Mbits/s)

DataTransfer In

& processingThroughput

(Mbits/s)

ComputationalThroughput

(Mbits/s)Application

SGI Altix MOATB with streaming

Page 57: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Example 2Cryptography:

Cipher Breaking

Page 58: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Secret-key cipher breaking

Given:

Looked for:

Method:

remaining plaintext

ciphertext

or key

guessed fragment of the plaintext

exhaustive key search (brute-force) attack

successivekeys cipher

Page 59: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Secret-key cipher breaking

Cipherbreaker

M0 C0

K1 K2 K3 KN

Generated by the cipher breaker

Negligibly smallinput/output

Huge amountof computations

Correct key

Message – Ciphertext pair

Page 60: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Cipher Breaking Results - SRC-6

Application Theoretical Maximum

Computational Throughput

Measured End-to-End Throughput

(million keys/s) (million keys/s) Speed-up

SRC 6 SRC 6 Xeon2.8GHz

DES CipherBreaking

(20 units working in parallel)

2000 2000 1.77 1130

IDEA CipherBreaking

(10 units working in parallel)

1000 1000 2.19 457

RC5 Cipher Breaking(2 units working in

parallel)

200 200 0.71 282

Page 61: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Application Theoretical Maximum

Computational Throughput

Measured End-to-End Throughput

(million keys/s) (million keys/s) Speed-up

SGI SGI Xeon2.8GHz

DES CipherBreaking

(10 units working in parallel)

2000 2000 1.77 1130

Cipher Breaking ResultsSGI Altix MOATB

Page 62: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Example 3:Cryptography:

Key exchange using ECC

Page 63: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Secret-key ciphers

key of Alice and Bob - KABkey of Alice and Bob - KAB

Alice Bob

Network

Encryption Decryption

Page 64: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Key Distribution Problem

N - UsersN · (N-1)

2Keys

Users Keys

100 5,000

1000 500,000

Page 65: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Public Key (Asymmetric) Ciphers

Public key of Bob - KBPrivate key of Bob - kB

Alice Bob

Network

Encryption Decryption

Page 66: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Alice Bob session key

(random secret-key)

Bob’s public key

Key exchange for secret-key ciphers

Bob’s private key

Network

Session keyencrypted using Bob’s public key

Message encrypted using session key

Page 67: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Message

Hash function

Public keycipher

AliceSignature

Alice’s private key

Bob

Hash function

Alice’s public key

Digital Signature

Hash value 1

Hash value 2

Hash value

Public key cipher

yes no

Message Signature

Page 68: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Why public-key cryptography is a good application for reconfigurable computers?

• computationally intensive arithmetic operations

• unconventionally long operand sizes (160-2048 bits)

• multiple algorithms, parameters, key sizes, and architectures = need for reconfiguration

Page 69: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Elliptic Curve Cryptosystems (ECC)

a family of cryptosystems, rather than a single

cryptosystem = added security but

need for reconfiguration

public key (asymmetric) cryptosystems

used for key agreement and digital signatures

implementations must be optimized for

minimum latency rather than maximum

throughput = limited speed-up from

parallel processing

Page 70: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Basic operations of ECCBasic operations in Galois Field GF(2m)

Basic operations on points of an Elliptic Curve

• addition and subtraction (xor): x+y, x-y (XOR)

• addition of points: P + Q• doubling a point: 2 P• projective to affine coordinate: P2A

• multiplication, squaring: x y, x2

• inversion: x-1

Complex operations on points of an Elliptic Curve• scalar multiplication: k P = P + P + …+P

k times

Page 71: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Hierarchy of ECC functions

kP

P+Q 2P projective_to_affine (P2A)

MUL

INV

High level

Medium level

Low level 2

ROTXORLow level 1

Page 72: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

C function for P

C function for MAP

VHDLmacro

SRC Program Partitioning

P system

FPGA system

HLL

HDL

Page 73: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Investigated Partitioning Schemes

Page 74: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

kPC function for P

C function for FPGA

VHDLmacro

μP Software Only

Based on public-domain code by Rosing M., Implementing Elliptic Curve Cryptography, Manning, 1999

Page 75: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

MUL4

C function

for FPGA

VHDL

macrosROTROT XOR

C function

for µP0

H

L1V_ROTVARROT

kPP2A

kP

P+Q 2P

MUL2MUL2 MULMUL

0HL1 Partitioning

INVINV

P2AP+Q2P

Page 76: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

MUL4

C function

for FPGA

VHDL

macrosROTROT XOR

C function

for µP0

H

V_ROTINV

kPP2A

kP

P+Q 2P

MUL2MUL2

0HL2 Partitioning

P2AP+Q2P

L2

Page 77: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

0HM Partitioning

C function

for FPGA

VHDL

macros

C function

for µP

0

H

MP+QP+Q 2P2P P2AP2A

kPkP

Page 78: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

kP

0

0

H

00H Partitioning (VHDL only)

C function for P

C function for FPGA

VHDLmacro

Page 79: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Timing Measurements

MAPAlloc.

MAP

FreeDMA

DataOut

DMA

Data In

FPGA

Computation

.c file

.mc file

End-to-End time (SW)

MAPfunction

MAP function

FPGA

Configure

Configuration time

MAP

Allocation

time

MAP

Release

Time

End-to-End time (HW)

Page 80: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Results (Latency)

0HL1 866 37 472 14 394 8930HL2 863 37 469 14 394 8950HM 592 37 201 12 391 1305

VHDL macro 592 39 201 17 391 1305

Software772,519

Data Transfer

Out Time

Total Overhead

Speedup vs.

Software

System Level Architecture

End-to-End Time

Data Transfer In Time

FPGA Compu-

tation Time

0

100

200

300

400

500

600

700

800

900

us

0HL1 0HL2 0HM 00H (VHDL)

Different Architectures

End to End Time

FPGA ComputationTime

Page 81: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Results (Area)

Software N/A

0HL1 99 1.68 57 1.3 68 2.61

0HL2 92 1.56 52 1.18 62 2.38

0HM 75 1.27 48 1.09 39 1.5

System Level

Architecture

% of CLB

slices

(out of

33792)

CLB

increase

vs. pure

VHDL

% of

LUTs

(out of

67,584)

LUT

increase

vs. pure

VHDL

% of

FFs

(out of

67,584)

FF count

increase

vs. pure

VHDL

0

10

20

30

40

50

60

70

80

90

100

%

0HL1 0HL2 0HM 00H

Different Architectures

CLB slices

LUT

FF

Page 82: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

78

185

349

371

MAP C

15326010070HL1

153

153

153

Main C

1601744

2301291

36

Macro Wrapper

0HM

1960

VHDL

VHDL macro

0HL2

Algorithm Partitioning Scheme

Number of lines of code

Page 83: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Conclusions

Assuming focus on:

Timing Resources

Ease of programming

Page 84: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Conclusions – cont.The best implementation approach:

0HL1 partitioning scheme

893 speedup vs. software and only 0.46 times slowdown versus pure VHDL with ease of

implementation

MUL4

C functionfor MAP

VHDL macros

ROTROT XOR

C function for µP

0

H

L1V_ROTV_ROT

kP

INV

P2AP+Q 2P

kP

INV

P2AP+QP+Q 2P2P

MUL2MUL2 MULMUL

Page 85: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Conclusions – cont.

• Elliptic Curve Cryptosystem implementation challenging for reconfigurable computers because of

• optimization for latency rather than throughput• limited amount of parallelism

• First publication showing a 1000x speed-up for a reconfigurable computer application optimized for data latency

Page 86: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Summary of results

Type of applicationEnd-to-end

speed-up of SRC-6 vs. P4

Computationally intensive(cipher breaking)

300-1100

Latency critical(ECC key exchange)

Input/output intensive 10-60(secret key encryption/decryption)

890-1300

Page 87: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

GWU_GMU secret key cipher libraries

1. Secret key cipher encryption and decryption

2. Secret key cipher breaking

• DES• IDEA• RC5

• DES• IDEA• RC5

Page 88: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

GWU_GMU public key cipher libraries

1. Operations in the binary Galois Fields GF(2m)

a. polynomial basis b. normal basis

2. Multiprecision integer arithmetic

3. Elliptic Curve Operations

- addition - doubling - scalar multiplication

Page 89: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

89

Example 4Image Processing:

Hyperspectral Dimension Reduction

Page 90: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

90

Multi-Spectral Imagery 10’s of bands

Hyperspectral Imagery 100’s-1000’s of bands

Challenges - Curse of Dimensionality

Solution On-Board Dimension

Reduction Needs

Higher performance Higher flexibility

Multispectral / Hyperspectral Imagery Comparison

High-Performance Reconfigurable Computing

Application: Hyperspectral Dimension Reduction

Page 91: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

91

Hyperspectral Dimension Reduction(Techniques)

Principal Component Analysis (PCA): Most Common Method

Dimension Reduction Complex and Global

computations: difficult for parallel processing and hardware implementations

Does Not Preserve Spectral Signatures

Wavelet-Based Dimension Reduction*: Simple and Local Operations High-Performance

Implementation Preserves Spectral

Signatures

Multi-Resolution Wavelet Decomposition of Each Pixel 1-D Spectral Signature (Preservation of Spectral Locality)

* S. Kaewpijit, J. Le Moigne, T. El-Ghazawi, “Automatic Reduction of Hyperspectral Imagery Using Wavelet Spectral Analysis”, IEEE Transactions on Geoscience and Remote Sensing, Vol. 41, No. 4, April, 2003, pp. 863-871.

Page 92: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

92

The input image is first convolved along the rows by the two filters L and H and decimated along the columns by two resulting in two "column-decimated" images L and H

Each of the two images, L and H, is then convolved along the columns by the two filters L and H and decimated along the rows by two

This decomposition results into four images, LL, LH, HL and HH

The LL image is taken as the new input to perform the next level of decomposition

Discrete Wavelet Transform (DWT) Decomposition (Mallat Algorithm)

Page 93: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

93

Wavelet-Based Dimension Reduction(Description)

Page 94: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

94

DWT on SRC-6

transfer coefficientsto OBM bank c

transfer image datato OBM bank a

load coefficients from bank c to

on-chip registers

transfer image data from bank b to the host

compute Wavelet

read one pixelfrom bank a

store result into bank b

End of Image

Yes

No

Read Data

MAP Alloc.

Map Free

Write Data

Measurements Scenario

Page 95: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

95

DWT on SRC-6 (cnt’d)(Main Program)

int main (int argc, char *argv[]) { . .

/* initialize daubechies coefficients */ float coeff[8]; coeff[0]= coeff[1]= (1+sqrt(3))/(4*sqrt(2)); coeff[2]= coeff[3]= (3+sqrt(3))/(4*sqrt(2)); coeff[4]= coeff[5]= (3-sqrt(3))/(4*sqrt(2)); coeff[6]= coeff[7]= (1-sqrt(3))/(4*sqrt(2)); .

. /* allocate images */ .

map_allocate(1);

gettimeofday(&time0, NULL);

proc_fpga (image_in, image_out, coeff, dx, dy, &ht0, &ht1, &ht2, &ht3, mapno);

gettimeofday(&time1, NULL);

/* print time difference */ . .

map_free(1); .}

Allocate the RP

• configure and start the Program execution on the FPGA

• passing the input image pointer and the output image buffer pointer to be used by DMA

• individual parameters can be passed to the MAP C function such as image dimensions

• large parameter array, such as the wavelet coefficients, can be transferred using DMA by passing the pointer of the coefficients array

Free the RP

Page 96: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

96

DWT on SRC-6 (cnt’d)MAP C Function (FPGA.mc)

void proc_fpga(int64_t image_in[], int64_t image_out[], int64_t coeff[], int dx, int dy, long long *ht0, long long *ht1, long long *ht2, long long *ht3, int mapnum){ // coefficients float LP0, LP1, LP2, LP3, LP4; float HP0, HP1, HP2, HP3, HP4;

// variables int i, j, k; int64_t in_pixel, out_pixel; // input image OBM_BANK_A (AL, int64_t, MAX_OBM_SIZE)

// output image OBM_BANK_B (BL, int64_t, MAX_OBM_SIZE)

// filter coefficients OBM_BANK_C (CL, int64_t, MAX_OBM_SIZE)

Page 97: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

97

start_timer();read_timer(ht0);

// DMA Input Image transferDMA_CPU (CM2OBM, AL, MAP_OBM_stripe (1, "A"), image_in, 1, Image_size, 0);wait_DMA (0);

// DMA coefficients transferDMA_CPU (CM2OBM, CL, MAP_OBM_stripe(1,“C"), coeff, 1, 4*sizeof(int64_t), 0);wait_DMA(0);

read_timer(ht1);

for (i = 0; i < 4; i++) { LP0 = LP1 ; HP0 = HP1 ; LP1 = LP2 ; HP1 = HP2 ; LP2 = LP3 ; HP2 = HP3 ; split_64to32_flt_flt(CL[i], & HP3, & LP3 );}

DWT on SRC-6 (cnt’d)MAP C Function (FPGA.mc)

transfer image datato an OBM bank

transfer coefficientsto an OBM bank

load coefficients from the OBM bank to on-chip registers

Page 98: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

98

for (i = 0; i<Image_Size; i++) {

in_pixel = AL[i];

{ . . . }

BL[i] = out_pixel;

} read_timer(ht2);

DMA_CPU (OBM2CM, BL, MAP_OBM_stripe(1,"B"), image_out, 1, Image_size, 0);wait_DMA (0);

read_timer(ht3);}

DWT on SRC-6 (cnt’d)MAP C Function (FPGA.mc)

read pixel value from the OBM bank

compute Wavelet

store results to theOBM bank

transfer image datato the host

Page 99: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

99

Overlapping Data Transfer with Computation(SRC-6)

#pragma src parallel sections {

#pragma src section {

for(i = 0; i < i<MAX_OBM_SIZE; i++)

{

get_stream (&S0, &v0);

DO COMPUTATION (Current Data Block)

}

} /* end of parallel section with compute loop */

#pragma src section {

/* Stream DMA_IN */

stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, S1OBM, DMA_A_B, sdata, 1, nbytes);

} /* end of parallel section with DMA */

} /* end of parallel sections */Time

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Read DMA 1 2 3 X X

Algorithm X 1 2 3 X

Write DMA X X 1 2 3

Improve performance by overlapping algorithm computation and data loading and unloading

Parallel sections Multiple parallel code blocks

are active in parallel

Page 100: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

100

Stream_64 S0;

#pragma src parallel sections

{

#pragma src section

{

int i;

for (i=0; i<sz; i++)

put_stream (&S0, AL[i]+42, 1);

} /* end of parallel section */

#pragma src section

{

int i;

for (i=0; i<sz; i++)

get_stream (&S0, &BL[i]);

} /* end of parallel section */

} /* end of parallel sections */

Streams(SRC-6)

Conventional Data FlowConventional Data Flow Streams and Conventional Streams and Conventional Data FlowData Flow

On-Board On-Board Memory Memory or BRAMor BRAM

ComputeComputeLoop 1Loop 1

On-Board On-Board Memory Memory or BRAMor BRAM

ComputeComputeLoop 2Loop 2

On-Board On-Board Memory Memory or BRAMor BRAM

On-Board On-Board Memory Memory or BRAMor BRAM

ComputeComputeLoop 1Loop 1

SteamsSteamsComputeComputeLoop 2Loop 2

On-Board On-Board Memory Memory or BRAMor BRAMTimeTime

Saves Saves Access toAccess toOn-BoardOn-BoardMemoryMemory

Data is flowingData is flowingIn the logicIn the logic

A stream is a data structure that allows flexible communication between concurrent producer and consumer loops

Page 101: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

101

Cray XD-1

Page 102: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

102

DWT on Cray-XD1(Main Program)

#define APP_CFG_REG 0x08UL#define USR_REG1 0x40UL#define USR_REG2 0x48UL#define USR_REG3 0x50UL#define USR_REG4 0x58UL#define QDR1_OFFSET 0x100UL /* Offset in FPGA memory space.*/

int main (int argc, char *argv[]) {

int fp_id; err_e e; u_64 coeff[4] u_64 * dp_base; u_64 * image;

fp_id = fpga_open ("/dev/ufp0", O_RDWR|O_SYNC, &e);

fpga_load (fp_id, "top.bin.ufp", &e);

. . /* Read Image */ . /* initialize daubechies coefficients */ . fpga_wrt_appif_val (fp_id, coeff[0] , USR_REG1 , TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[1] , USR_REG2 , TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[2] , USR_REG3 , TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[3] , USR_REG4 , TYPE_VAL, &e);

Define the address space for user registers and QDR memory

Open the FPGA Device

Load the FPGA

Transfer coefficients into the FPGA registers

Page 103: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

103

fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e);

dp_base = fpga_memmap (fp_id, QDR_SIZE, PROT_READ | ROT_WRITE,MAP_SHARED, QDR1_OFFSET, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) dp_base[i] = iamge[i];

fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG, TYPE_VAL, &e);

/* ... */

fpga_rd_appif_val (fp_id, &status, STATUS_REG, &e);

/* ... */

fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e);

for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) output_iamge[i] = dp_base[i[i] ;

fpga_close (fp_id, &e); }

Configure the Wavelet for QDR bridging

Start Processing

Read the FPGA status

Map the entire 4 Mbytes of QDR Memory

Read back the Image

Transfer the Image into the QDR

Configure the Wavelet for QDR bridging

Close the FPGA device

DWT on Cray-XD1 (cnt’d)(Main Program)

Page 104: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

104

Accessing µP memory from FPGA(Cray-XD1)

unsigned long order; void *ftr_mem;

/* ... */

ftr_mem = fpga_set_ftrmem(fpga_fd, order, &err); if (ftr_mem == NULL) { /* Handle error. */ }

fpga_wrt_appif_val (fp_id, (u_64) ftr_mem , BUFF0_PTR_REG, TYPE_ADDR, &e);

fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG,

TYPE_VAL, &e);

/* ... */

fpga_rd_appif_val(fp_id, &status, STATUS_REG, &e);

/* ... */

The APIs support access to a region of the µP memory by the FPGA logic

The program uses the fpga_set_ftrmem function to: Allocate an FTR Associates it with the address space

of the µP Sets up the FPGA to access it

directly

It does not automatically provide the address of this region to the FPGA application logic One way is to establish an FPGA

register for that purpose and use the fpga_wrt_appif_val function to write the value to the register

Page 105: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

105

Using MPI on Cray-XD1

if(MYTHREAD==0) read_image (image_file_name, image_buffer, &rows, &cols);

MPI_Bcast(&rows, 1,MPI_INT,0,MPI_COMM_WORLD);MPI_Bcast(&cols, 1,MPI_INT,0,MPI_COMM_WORLD);

local_size= rows*cols/THREADS;

MPI_Scatter(image_buffer, local_size,MPI_UNSIGNED_LONG, local_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD);

/* Execute the wavelet on the Hardware*/process_image(fp_id, local_image_buffer, local_output_buffer, rows, cols);

MPI_Gather(local_output_buffer,local_size,MPI_UNSIGNED_LONG, output_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0,MPI_COMM_WORLD); if(MYTHREAD==0) write_image (output_file_name, output_image_buffer, rows, cols);

Each Cray XD1 chassis consists of 6 dual-processor Opteron nodes 2 Opteron processors (Total 12) 1 Xilinx Virtex-II Pro 50 (Total 6)

Applications can be parallelized across the 6 FPGAs using MPI

Data are distributed across the 6 FPGAs

Page 106: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

106

rasclib_algorithm_request_t ar; int alg_id; int res; strcpy(ar.alg_id,“Wavelet"); ar.num_devices = 1; . . /* Read Image */ . /* initialize daubechies coefficients */ . rasclib_resource_alloc(&ar, 1); algorithm_id = rasclib_algorithm_open(“Wavelet"); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff0", coeff[0]);

res = rasclib_algorithm_alg_reg_write (alg_id, “coeff1", coeff[1]);

res = rasclib_algorithm_alg_reg_write (alg_id, “coeff2", coeff[2]);

res = rasclib_algorithm_alg_reg_write (alg_id, “coeff3", coeff[3]);

res = rasclib_algorithm_send (alg_id, "a_in", Image_Buff , SIZE);

Parameter Passing Small parameters

Connect to Algorithm Defined Registers

(alg_def_reg0 - alg_def_reg7) Pass parameter mapping to software through

an extractor directive, type REG_IN:-- extractor REG_IN: coeff0 64 u alg_def_reg0[63:0]

-- extractor REG_IN: coeff1 64 u alg_def_reg1[63:0]

-- extractor REG_IN: coeff2 64 u alg_def_reg2[63:0]

-- extractor REG_IN: coeff3 64 u alg_def_reg3[63:0]

Large Arrays Dedicate a portion of an SRAM bank for the

parameter array Pass parameter array mapping to software with

an extractor comment of type SRAM:-- extractor SRAM:a_in 2048 64 sram[0] 0x00 in u fixed

DWT on SGI-Altix(Main Program)

Page 107: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

107

rasclib_algorithm_go (alg_id);

res = rasclib_algorithm_receive (alg_id, "d_out", out_Buff, SIZE); rasclib_algorithm_commit (alg_id); rasclib_algorithm_wait (alg_id); rasclib_algorithm_close (alg_id);

Results Read-Back Small parameters

Connect to Algorithm Defined Registers Pass parameter mapping to software through an extractor directive, type REG_OUT Use the API function rasclib_algorithm_reg_read

Large Arrays Dedicate a portion of an SRAM bank for the parameter array Pass parameter array mapping to software with an extractor comment of type SRAM:

-- extractor SRAM:d_out 2048 64 sram[1] 0x00 out u fixed

DWT on SGI-Altix (cnt’d)(Main Program)

Page 108: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

108

Time

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Read DMA 0 1 2 X X

Algorithm X 0 1 2 X

Write DMA X X 0 1 2

Improve performance by overlapping algorithm computation and data loading and unloading

Extractor directives are used to tell software: where input/output data arrays are located (SRAM bank + starting index) the sizes of the input/output data arrays which arrays have been enabled for streaming

Extractor directive type used: SRAM with attribute stream, e.g.:

-- extractor SRAM:a_in 2048 64 sram[0] 0x00 in u stream -- extractor SRAM:d_out 2048 64 sram[1] 0x00 out u stream

Streaming(SGI-Altix)

Page 109: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Example 5Image processing:

Thin PlateSplines

Page 110: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

The application: Thin Plate Splines- image analysis of protein gels

• Image morphing based on natural logarithm computations

• Essential for comparing protein content

• Speedup per FPGA: 10-30x. Reduces analysis runtime from days to hours.

Page 111: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Host Program- running on Opteron CPU, calling FPGA subroutine

Transfer parameter data to QDRAM

Start Mitrion program and wait until finished

Retrieve computed image data

u_64 fpga_mem, i;my_fpga = fpga_open(args); // Use normal XD1 API for most operations

...

if (!fpga_is_loaded(args)) rtn = fpga_load(args);

...

// memory map QDRAMs into host address spacefpga_mem = fpga_memmap(args);

// Upload data to QDRAMmemcpy(fpga_mem, parameter_data, sizeof_parameter_data);

// Control of mitrion processor is internally handled// with a number of memory mapped registers in the FPGA// Controlling running/stepping/reset etc.

mitrion_start(my_fpga); // Start mitrion blockmitrion_wait(my_fpga); // wait for block to finish

// Fetch results from QDRAMmemcpy(image_coordinates, fpga_mem, sizeof_image_data);

Page 112: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

FPGA program (1/3)- accelerated subroutine in Mitrion-c// Options: -cpp

#define RAMType mem uint:64 [ 0x100000 ]

#include "grint_lib.lqd"#include "logarithm_rwhile.lqd"

(Fix, RAMType) readFix(RAMType m, uint:24 basicOffset, uint:24 fixOffset){ uint:32 memOffset = basicOffset + fixOffset; (result, m2) = _memread(m, memOffset);} (result, m2);

(RAMType, RAMType, RAMType, RAMType) main (RAMType Am, RAMType Bm, RAMType Cm, RAMType Dm){ Fix<LMS> py; // parameter vectors Fix<LMS> px; Fix<LMS> koeffx; Fix<LMS> koeffy;

// read paramters from external RAM (px, py, koeffx, koeffy, Aml) = foreach(index in <0.. LMS_1>) { (x, Am2) = readFix(Am, PX_OFF, index) ; (y, Am3) = readFix(Am2, PY_OFF, index) ; (kx, Am4) = readFix(Am3, KOEFFX_OFF, index); (ky, Am5) = readFix(Am4, KOEFFY_OFF, index); } (x, y, kx, ky, Am5); Aut = _wait(Aml);

Cut = grintpolc(Cm, px, py, koeffx, koeffy);

} (Aut, Bm, Cut, Dm);

readFix fetches input data from QDRAM

Definition of RAM type

Start of program.Matches external RAM interface of the XD1:

4 banks of 1M word each

Page 113: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

FPGA program (2/3)- accelerated subroutine in Mitrion-c

RAMType grintpolc ( RAMType coords, // out Fix<LMS> px, Fix<LMS> py, Fix<LMS> koeffx, Fix<LMS> koeffy ){ imDonel = foreach(y in <0.. YSIZE_1>) { uint:32 lineoff = y*XSIZE; imDone2l = foreach(x in <0.. XSIZE_1>) { (distx, disty) = foreach(px, py, koeffx, koeffy in px, py, koeffx, koeffy) { Fix dx = px - int2fix(x); Fix dy = py - int2fix(y);

Fix r2 = fixmul(dx,dx) + fixmul(dy,dy);

Fix ext = if(r2 == 0) 0 else { Fix ln = fixln(r2); ext = fixmul(r2,ln); } ext;

Input arguments (the image) for Thin Plate

Splines transform

Major compute intensive part: high

precision ln computation

Page 114: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

FPGA program (3/3)- accelerated subroutine in Mitrion-c Fix rx = fixmul(ext, koeffx);

Fix ry = fixmul(ext, koeffy);

} (rx, ry); Fix distcoordx = sum(distx);

Fix distcoordy = sum(disty);

// distcoordx and distcoordy is the coordinated // of the pixels to be fetched from the distorted image

uint:32 index = x + lineoff; int:32 x32 = (distcoordx >>> 8); // convert into Fix16.16 int:32 y32 = (distcoordy >>> 8); // convert into Fix16.16 watch x32; watch y32;

bits:64 word = [x32, y32]; imDone3 = _memwrite(coords, index, word); } imDone3; imDone2 = _wait(imDone2l); } imDone2; imDone = _wait(imDonel);

} imDone; Output argument is the distorted image

Output arguments (distorted image coordinates) are

written to QDRAM

Page 115: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

115

Program Development Environments

Challenges

Page 116: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

116

Application Developmentfor Reconfigurable Computers

ProgramEntry

Compilation

Execution

Platformmapping

Debugging &Verification

Page 117: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

117

Tasks Addressed in This Presentation

ProgramEntry

Compilation

Execution

Platformmapping

Debugging &Verification

Page 118: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

118

Program

Program Entry

Page 119: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

119

Platform MappingSW/HW Partitioning

Software(executed in

the microprocessor system)

Hardware(executed in

the reconfigurableprocessor

system)

Program

Page 120: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

120

SW/HW Partitioning & CodingTraditional Approach

Specification

SW/HW Partitioning

SW Coding HW Coding

SW Compilation HW Compilation

SW Profiling HW Profiling

Page 121: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

121

SW/HW Partitioning & CodingNew Approach

Specification

SW/HW Coding

SW Compilation HW Compilation

SW Profiling HW Profiling

SW/HW Partitioning

Page 122: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

122

Platform MappingFPGA mapping

Software

HardwareProgram

FPGA 1 FPGA 2

FPGA 3

FPGA 4

Page 123: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

123

Example of FPGA Mapping

add

FPGA

multiply

divide add

multiply

divide

FPGA 1 FPGA 2

addmultiply

divide

FPGA 2FPGA 1

Page 124: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

124

add

multiply

divide

FPGA 1 FPGA 2

FPGA Mapping in SRC

void fpga1(int64_t a, int64_t b, int64_t *sum, int mapno){ int64_t c, temp;

send_to_bridge(b); c = a * const1; recv_from_bridge(&temp); *sum = temp+Mult;}

void fpga2(){ int64_t a, d;

recv_from_bridge(&a); d = a/const2; send_to_bridge(d);}

Makefile

MAPFILES = FPGA1.mc FPGA2.mcPRIMARY = FPGA1.mcSECONDARY = FPGA2.mcCHIP2 = FPGA2.mc

a

FPGA1.mc

FPGA2.mc

b

sum

Page 125: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

125

FPGA Mapping in VIVA TM

By changing the attributes one can specify where an object is to be located

Page 126: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

126

Platform MappingFPGA-FPGA data transfer & synchronization

Software

HardwareProgram

FPGA 1 FPGA 2

FPGA 3

FPGA 4

Page 127: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

127

FPGA 1 FPGA 264

64

computation

2

computation

1

void fpga1(int64_t a, b, c, *d){ send_to_bridge(a, b, c); computation1 recv_from_bridge(d);}

void fpga2(){ int64_t a,b,c,d;

recv_from_bridge(&a, &b, &c); computation2 send_to_bridge(d);}

FPGA-FPGA Data Transfer in SRCFPGA1.mc

FPGA2.mc

a

b

c

d

Page 128: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

128

32 words

64 bits

64 bits

64

64

64

32 words

FIFO

FIFO

FPGA-FPGA Data Transfer in SRC

Bridge Port

Page 129: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

129

FPGA-FPGA Data Transfer in VIVA TM

Special partitioning objects placed between the modules to be synthesized automatically map the relevant lines between the FPGAs.

For designs mapped over several FPGAs:The system description must include those FPGAs over which the design is to be mapped,

Page 130: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

130

Platform MappingUse of Internal and External Memories

Software

HardwareProgram

FPGA 1FPGA 2

FPGA 3

FPGA 4

OCM

OCM – On-Chip Memory LM – Local Memory SM – Shared Memory

SM

LM

Page 131: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

131

Using On-Chip Memory (OCM) in SRCvoid sum(int64_t a[], int *c, int mapno)

{

BANK_A_ALLOC(AL, int64_t, SIZE);

ocm_a [SIZE];

int i;

cm2obm_0(AL, a, byteLength);

wait_server_0();

for(i=0; i<SIZE; i++) {

ocm_a[i] = AL[i]; }

for(i=0; i<SIZE; i++) {

tmp = ocm_a[i] + tmp; }

}

FPGA

SM(OBM

)

64

32

AL[]

ocm_a[]

OCM

computationsc

Page 132: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

132

Using On-Chip Memory (OCM) in VIVATM

Special Objects under the Memory Subsystem of the library allows the programmer to use the on chip memory of the Xilinx Virtex II chip

Page 133: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

133

Platform MappingI/O

Software

HardwareProgram

FPGA 1 FPGA 2

FPGA 3

FPGA 4

SM

LM

OCM

SRC

StarBridge

Page 134: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

134

Main program

Function_1(a, d, e)

Function_2(d, e, f)

Function_1

Function_2

Macro_1(a, b, c)

Macro_2(b, d)Macro_2(c, e)

Macro_3(s, t)

Macro_1(n, b)Macro_4(t, k)

FPGA……

……

……

Macro_1

Macro_2 Macro_2

a

b c

d e

FPGA contents afterthe Function_1 call

Program in C or Fortran

Run Time Reconfiguration in SRC

Page 135: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

135

Run-time Reconfiguration in VIVATM

Reconfiguration is possible by using the spawn object.By specifying the FileName attribute a VIVA executable (.vex file) or a VIVA project can be loaded onto the same or a different FPGA.

Page 136: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

136

Ideal Program Entry

ProgramEntry

Function

Page 137: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

137

Actual Program Entry

SW/HWPartitioning

Data Transfers& Synchronization

Use of Internaland External Memories

Sequence of Run-time Reconfigurations

Use of FPGAResources

(multipliers,μP cores)

PreferredArchitectures

ProgramEntry

Function

FPGAMapping

SW/HW Interface

Page 138: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

138

Not implemented

ManualEntry

CompilerAutomated

SRC

Star Bridge

FPGA-FPGA Partitioning

P-FPGA Partitioning

FPGA-FPGA Data Transfer

P-FPGA Data Transfer

Computation-Data transfer Overlapping

Choosing component version

Evolution and the current status of tools

and othervendors

. . . . . . . . .

Page 139: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

139

Debugging & Verification

Page 140: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

140

ApplicationApplication

MAP RuntimeMAP RuntimeLibraryLibrary

ComListComListCodeCode

WrapperWrapperCodeCode

User LogicUser Logic

SubroutineSubroutineFor MAPFor MAP

MAP Board Execution

MAP BoardMAP Board

Data &Data &

FlagsFlags

User FPGAsUser FPGAs

Control Processor

On-boardOn-boardMemoryMemory

User LogicUser Logic

Registers & Flags

Logic

Macro

Logic

Macro

Logic

MacroLogic

Macro

ComList Processor

DMA Engine

Page 141: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

141

EmulatorEmulator

MAP Emulator + DFG Simulator

ApplicationApplication

MAP RuntimeMAP RuntimeLibraryLibrary

ComListComListCodeCode

WrapperWrapperCodeCode

User LogicUser Logic

SubroutineSubroutineFor MAPFor MAP Data &Data &

FlagsFlags

User FPGAsUser FPGAs

Control Processor

On-boardOn-boardMemoryMemory

User LogicUser Logic

Registers & Flags

C Code

Macro

C Code

Macro

C Code

MacroC

Code

Macro

ComList Processor

DMA Engine

Page 142: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

142

MAP Emulator + Verilog Simulator

EmulatorEmulatorApplicationApplication

MAP RuntimeMAP RuntimeLibraryLibrary

ComListComListCodeCode

WrapperWrapperCodeCode

User LogicUser Logic

SubroutineSubroutineFor MAPFor MAP Data &Data &

FlagsFlags

User FPGAsUser FPGAs

Control Processor

On-boardOn-boardMemoryMemory

User LogicUser Logic

Registers & Flags

VCSVCS

Verilog

Macro

Verilog

Macro

Verilog

MacroVerilo

g

Macro

ComList Processor

DMA Engine

Page 143: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

143

X86 System in VIVATM

The FileIn Object as it appears when the x86 system is loaded

Page 144: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

144

X86 System in VIVATM

FileIn object as it appears when the FPGA system description is loaded.

Page 145: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

145

Debugging in VIVATM

Data can be viewed with the help of widgets, which are basically input and output ‘horns’ placed in a worksheet.

Various display options are available to view data, options to include the kind of view desired by the viewer and the data viewed can be switched between HEX or INT.

Page 146: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

146

IA-32 Linux

Machine

RTL Generation and Integration with Core Services

Design Synthesis(Synplify Pro,

Amplify)

Design Verification

Behavioral Simulation(VCS, Modelsim)

Static Timing Analysis(ISE Timing Analyzer)

.v, .vhd

.v, .vhd

.edf

.ncd, .pcf

.bin

MetadataProcessing

(Python)

.v, .vhd

.cfg

Altix Device Programming(RASC Abstraction Layer,

Device Manager, Device Driver)

Real-time Verification

(gdb)

.c

Design Implementation(ISE)

HLL Design Entry(Handel-C, Mitrion C, Viva)

Debugging in the SGI Environment

Page 147: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

Compiler, Simulator And Debugger

Page 148: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

148

Programming EnvironmentsSummary

Page 149: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

149

SRC Programming Environment + very easy to learn and use+ standard ANSI C+ hides implementation details+ good support for debugging+ vendor and user libraries+ very well integrated environment+ good use of 3rd party tools+ in production use for over 3 years with constant improvements

- subset of C- legacy C code requires rewriting- C limitations in describing HW (paralellism, data types)- closed environment, limited portability of codes to HW platforms other than SRC

Page 150: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

150

Star Bridge Programming Environment Viva

+ drag-and-drop program entry+ standard and user libraries+ separation of designs/programs from system/platform descriptions = portability of codes+ support for multiple platforms under development

- does not follow any established standards- no textual description = limited scalability of codes- control signals (e.g., handshaking between the adjacent library cells) must be specified explicitly- no clear mechanism to call HW functions from SW

Page 151: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

151

+ drag-and-drop program entry+ extensive libraries of DSP components+ good support for debugging

- graphical description = limited scalability of codes- control signals (e.g., handshaking between the adjacent library cells) must be specified explicitly- limited library support for applications other than DSP

Cray Programming Environment basedon Simulink/System Generator

Page 152: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

152

+ graphical programming language (drag-and-drop program entry)+ extensive libraries of DSP components+ single environment (MATLAB™/Simulink™) to analyze, visualize, implement, debug, verify+ efficient resource usage

- graphical description = limited scalability of codes- limited library support for applications other than DSP

Cray Programming Environment basedon DSPLogic

Page 153: 1 Program Development Environments Languages & Tools Kris Gaj George Mason University

153

Cray XD1 and SGI Environmentsbased on Mitrion-C

+ high-level C-like language easy to learn by an HPC programmer+ ease of describing paralellism and non-standard (variable size) data types+ small amount of Mitrion-C generates large number of lines of HDL code+ suitable for describing classical complex HPC problems+ Mitrion-C code portable between Cray XD1 and SGI

- new and yet untested- non-standard, no support for legacy codes- language describes only what happens in a single FPGA- currently, no mechanisms to use HDL macros