1 program development environments languages & tools kris gaj george mason university
TRANSCRIPT
1
Program Development Environments
Program Development Environments
Languages & ToolsLanguages & Tools
Kris GajGeorge Mason University
2
Acknowledgements
AMI
Cray
Mitrion
NCSA
SGI
SRC
Star Bridge
DoD/LUCITE
Companies, centers, and sponsors
3
• Esmail Chitalwala (GWU/Star Bridge)
• Hatim Diab (GWU)
• Esam El-Araby (GWU)
• Miaoqing Huang (GWU)
• Hoang Le (GMU)
• Allen Michalski (GMU/USC)
• Nandkishore Sastry (GMU)
• Chang Shu (GMU)
• Mohamed Taher (GWU)
• Proshanta Saha (GWU)
Acknowledgements
GWU/GMU students
SRC Programming Model
Microprocessor FPGA
main.c
function_1()
function_2()
ANSI C
function_1
function_2
macro_1(a, b, c)
macro_2(b, d)macro_2(c, e)
macro_3(s, t)
macro_1(n, b)macro_4(t, k)
FPGA
Macro_1
Macro_2 Macro_2
a
b c
d eMAP C(subset of ANSI C)
I/O
I/O
Libraries of macros
VHDL
macro_1 macro_2macro_3 macro_4……………………….
C function for P
C function for MAP
VHDLmacro
SRC Program Partitioning
P system
FPGA system
HLL
HDL
SRC Compilation Process
Objectfiles
Application sources Macro sources
MAP CompilerP Compiler
Logic synthesis
Place & Route
Linker
.v files
.bin files
.ngo files
.o files .o files
Applicationexecutable
Configurationbitstreams
HDLsources
Netlists
.c or .f files .vhd or .v files
Logic synthesis
Place & Route
Linker
.v files
.bin files
.ngo files
HDLsources
. or.mc or .mf files
SRC Libraries of Hardware Macros
User libraries of hardware macros developed by GWU/GMU/USC 2002-2006
• Secret-key cipher encryption & breaking• Binary Galois Field arithmetic (polynomial basis & normal basis representation)• Elliptic Curve Arithmetic• Long integer modular arithmetic (RSA)• Sorting• Image processing• Bioinformatics See http://hpc.gwu.edu/library
Vendor libraries of hardware macros
• basic integer and floating-point arithmetic• digital signal processing
Library
Object
Sheets
StarStar Bridge Programming Environment - Viva
Place & Route
.bin files
.ngo files
Applicationexecutable
Configurationbitstreams
Netlists
Star Bridge Compilation Process
VIVA
Graphical User Interface
User input
Xilinx
Cray XD1 Programming Flows
Source: [Cray, MAPLD05]
Synthesis
process (a, m) isbegin z <= a and m;end process;
intmask(a, m){
return (a & m);}
VHDL/Verilog Synthesis
Mitrion-C
VHDL,Verilog
Mentor GraphicsSynopsysSynplicity
Xilinx
a
mz
01001011010101010101101010010100010101101010100101010101
MATLAB/Simulink
The MathWorks
StandardFlow
Mitrion
High-levelFlow
SystemGenerator
Xilinx
Xilinx
Place & Route
Gate-level EDIF
VHDL or Verilog
Xtreme DSP Design Flow
HDL-based SGI Altix Programming Flow
IA-32 Linux
Machine
Design iterations
Design Entry(Verilog, VHDL)
Design Synthesis(Synplify Pro,
Amplify)
Design Implementation
(ISE)
Design Verification
Behavioral Simulation(VCS, Modelsim)
Static Timing Analysis(ISE Timing Analyzer)
.v, .vhd.v, .vhd
.edf
.ncd, .pcf
.bin
MetadataProcessing
(Python)
.v, .vhd
.cfg
Altix Device Programming(RASC Abstraction Layer,
Device Manager, Device Driver)
Real-time Verification
(gdb)
.c
IA-32 Linux
Machine
RTL Generation and Integration with Core Services
Design Synthesis(Synplify Pro,
Amplify)
Design Verification
Behavioral Simulation(VCS, Modelsim)
Static Timing Analysis(ISE Timing Analyzer)
.v, .vhd
.v, .vhd
.edf
.ncd, .pcf
.bin
MetadataProcessing
(Python)
.v, .vhd
.cfg
Altix Device Programming(RASC Abstraction Layer,
Device Manager, Device Driver)
Real-time Verification
(gdb)
.c
Design Implementation(ISE)
HLL Design Entry(Handel-C, Mitrion C, Viva)
HLL-based SGI Altix Programming Flow
Mitrion-C Programming Model for Cray & SGI
Microprocessor FPGA
main.c
function_1(in1)start_fpga()
ANSI Cbased on Mitrion
API
FPGA
I/O
RAM
Application code
(platform independent)
Mitrion Distributed Processor Architecture(platform dependent)
Mitrion Compiler& Configurator
application on the
distributed processor
Input &output
Mitrion-C
VHDL
function_1(in2)start_fpga()
Compiling A Mitrion Program
ProcessorConfigurator
ProcessorArchitecture
Mitrion-CSource code
ProcessorHW-Design
(VHDL IP Core)
FPGA
Mitrion Software Development Kit
Simulator& Debugger
ProcessorMachine-code
Compiler
The Mitrion Platform
1) The Mitrion Virtual Processor– A fine-grain massively parallel, configurable soft-core
processor– 10-30 times faster than traditional CPUs
2) The Mitrion-C programming language– An intrinsically parallel C-family language
3) The Mitrion Software Development Kit– Compiler– Debugger/Simulator– Processor configurator
A New Processor Architecture Specifically For FPGAs
int:48<30> main(){ int:48 prev = 1; int:48 fib = 1;
int:48<30> fibonnacci = for(i in <1..30>) { fib = fib+prev; prev = fib; } <>fib;
} fibonnacci;
?
Architecture design goal:• High silicon utilization• Take advantage of FPGA re-
configurability
Goal achieved by:• Allow processor to be
massively parallel• Allow processor to be fully
adapted to algorithm
Processor Architecture: A Cluster-On-A-Chip
• Non-Von Neumann architecture• Processor architecture more like a cluster• Very Fine-Grain Parallelism
– Normal clusters run a block of code on each PE1
– Mitrion runs a single instruction on each PE– Each PE adapted to optimally run its
instruction• Network topology specific for algorithm• No Instruction Stream, instead Data Stream
1) PE = Processing Element
A C-family Language
• Basic syntax is the same as for other C-family languages
• Examples:– Blocks are surrounded by { }– Assignment with =– Statements end with ;– if, for, while– Most of the usual c operators– C-style comments (though nestable)
Types
• Basic typesint/uint signed/unsigned integerboolean boolean value (true/false)float Floating point real valuebits Bit vector format
• Free bit widthint:24 24 bit signed integeruint:19 19 bit unsigned integerfloat:24.8 IEEE-754 single precision float
• Collectionsint:24[100] Vector (indexable collection)int:14<100> List (no index)
Language constructs
Operators
if(a>b) ...
while(i<10) ...
for(i in <0..999>) ...
foreach (e in vector) ...
int:8 function(int:8 a) ...
A C-family Language
• Important differences– No pointers– No dynamic allocation– Static general recursion only
• Though loop structures may be dynamic
Compiler, Simulator And Debugger
26
Hardware
Software
GraphicalData FlowDiagram
HLLHDL
Increased productivity
Increased capability to describe parallel execution
Program Entry for FPGA Accelerator Boards
Traditional
Extended(e.g.Corefire) Hardware
Software
27
Increased productivity
Increased capability to describe parallel execution
Star Bridge Hardware
Software
porting EDIF
COMobjects
Program Entry for Reconfigurable Computers
Hardware
SoftwareSRC
HLLHDLGraphical Data FlowDiagram
HDL macros
28
Increased productivity
Increased capability to describe parallel execution
CrayXD1withSimulink Hardware
Software
Program Entry for Reconfigurable Computers
Hardware
SoftwareSGIor CraywithMitrion
HLLHDLGraphical Data FlowDiagram
Mitrion Processor
Mitrion-C
Xilinx System Generator
Simulink
29
General hierarchy of library files suggested
by SRC Computers Inc.
30
Structure of the SRC macro repository
< top of repository >
<lib # 1 >
common rev_d rev_e
hdlfile InfoFile BlkBoxFile
macro1 macro2 macro3
< macros >
<lib # 2 > <lib # 3 >
rev_f
DebugCodeFile
DataSheet
31
Files describing an SRC macro
Platform independent– HDL file: macro.v or macro.vh
• Verilog or VHDL code defining the macro
– Debug Code File: macro.c • provides the equivalent C functionality for the macro
– Data sheet file: datasheet• contains the documentation for the macro
Platform dependent– Blk Box File: blackbox.v
• Interface (black box) definition for the macro in Verilog
– Info File: info• Info file entry for this macro
32
Library Development - SRC
HLL (C, Fortran)
HDL (VHDL, Verilog)
P system
FPGA system
ApplicationProgrammer
LibraryDeveloper
HLL (C, Fortran)
HLL (C, Fortran)
LLL (ASM)
HLL (C, Fortran)
33
Library Development - StarBridge
GDF (Viva)
HDL (VHDL, Verilog)
P system
FPGA system
ApplicationProgrammer
LibraryDeveloper
GDF (Viva)
GDF (Viva)
HLL, LLL (C++, ASM)
GDF (Viva)
34
Software libraries and their role in the development
of SRC libraries
35
1. source of test vectors for VHDL macros|
2.emulation of hardware during debugging
3.performance comparison
Roles of software libraries
36
1. Identify class of applications
2. Identify basic operations required by your applications
3. Determine the existence of the RC library of such operations
4. Determine the existence of the microprocessor library of such operations
5. Determine the right granularity for the required library operations
How to approach porting your application to reconfigurable computers?
1. input/output intensive applications• bulk data encryption
(DES, IDEA, and RC5 encryption)
2. computationally intensive applications• secret-key cipher breaking based on
the exhaustive key search (DES, IDEA, RC5 breakers)
• public-key cipher breaking based on factoring
3. latency-critical applications• cipher key agreement and signature (ECC schemes, RSA)
Classes of applications
Example 1Cryptography:
High-throughput encryption
Cipher
message
ciphertext
cryptographickey
K bits
Secret-key ciphers
key of Alice and Bob - KABkey of Alice and Bob - KAB
Alice Bob
Network
Encryption Decryption
High-Throughput Encryption
Encryption
Mi
Mi+1
Mi+2
Ci
Ci+1
Ci+2
. . . .
K0 Encryption algorithms:
DES, 3DES, AES, RC5, IDEA, etc.
Fully Pipelined Architecture
. . . .
. . . .
. . . .
Loop unrolling
Pipeline stages inside of cipher rounds
New input & new output every clock cycle
. . . .
Round 1
Round 2
Round k
. . .
Encryption on SRC-6 – No streamingencryption.mc (1)
#include <libmap.h>
void encryption (uint64_t sdata[], uint64_t key, uint64_t *hardware_timein,
uint64_t *hardware_timeprocess, uint64_t *hardware_timeout,
int mapnum){ OBM_BANK_A (S1OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (S2OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (S3OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (S4OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_E (S5OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_F (S6OBM, uint64_t, MAX_OBM_SIZE) uint32_t encrypt_decrypt; //0:encrypt 1:decrypt int i, nbytes; uint64_t t1,t2,t3,t4;
encrypt_decrypt = 0; nbytes = MAX_OBM_SIZE * 8*3; start_timer();
read_timer(&t1); DMA_CPU(CM2OBM, S1OBM, MAP_OBM_stripe(1,"A,B,C"), sdata, 1, nbytes, 0); wait_DMA(0); read_timer(&t2);
for(i=0;i<MAX_OBM_SIZE;i++) {
des (S1OBM[i], key, encrypt_decrypt, &S4OBM[i]); des (S2OBM[i], key, encrypt_decrypt, &S5OBM[i]); des (S3OBM[i], key, encrypt_decrypt, &S6OBM[i]);
} read_timer(&t3);
Encryption on SRC-6 – No streamingencryption.mc (2)
Encryption on SRC-6 – No streamingencryption.mc (3)
DMA_CPU(OBM2CM, S4OBM, MAP_OBM_stripe(1,"D,E,F"), sdata, 1, nbytes, 5); wait_DMA(5); read_timer(&t4); *hardware_timein = t2-t1; *hardware_timeprocess = t3-t2; *hardware_timeout = t4-t3;
}
Encryption on SRC-6 – No streamingdes_blkbx.v
module des ( desOut, desIn, keyin, decrypt, clk ) /* synthesis syn_black_box syn_noprune=1 */ ;
output [63:0] desOut; input [63:0] desIn; input [63:0] keyin; input decrypt; input clk /* synthesis syn_noclockbuf=1 */ ;
endmodule
Encryption on SRC-6 – No streamingdes.info (1)
BEGIN_DEF "des" MACRO = "des"; LATENCY = 17; STATEFUL = NO; EXTERNAL = NO; PIPELINED = YES;
INPUTS = 3: I0 = INT 64 BITS (desIn[63:0]) I1 = INT 64 BITS (keyin[63:0]) I2 = INT 32 BITS (decrypt) ;
OUTPUTS = 1: O0 = INT 64 BITS (desOut[63:0]) ;
IN_SIGNAL : 1 BITS "clk" = "CLOCK";
Encryption on SRC-6 – No streamingdes.info (2)
DEBUG_HEADER = $ void des__dbg (long long desin, long long keyin, int decrypt, long long *desout); $;
DEBUG_FUNC = $ #include <des.h> void des__dbg(long long desin, long long keyin, int decrypt, long long *desout) { des_(desout, &desin, &keyin, &decrypt); } $;END_DEF
Encryption on SRC-6 - with streamingencryption.mc (1)
#include <libmap.h>
void encryption (uint64_t sdata[], uint64_t key, uint64_t *hardware_timeprocess, uint64_t *hardware_timeout, int mapnum)
{ OBM_BANK_A (S1OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (S2OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (S4OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_E (S5OBM, uint64_t, MAX_OBM_SIZE) uint32_t encrypt_decrypt; //0:encrypt 1:decrypt int i, nbytes; uint64_t t1,t2,t3; Stream_64 S0, S1; uint64_t v0, v1; encrypt_decrypt = 0; nbytes = MAX_OBM_SIZE * 8*2;
start_timer();
read_timer(&t1);
#pragma src parallel sections { #pragma src section { stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, S1OBM, DMA_A_B, sdata, 1, nbytes); }
#pragma src section { for (i=0; i<MAX_OBM_SIZE; i++) { get_stream (&S0, &v0); get_stream (&S1, &v1);
des (v0, key, encrypt_decrypt, &S4OBM[i]); des (v1, key, encrypt_decrypt, &S5OBM[i]);
}; } }
Encryption on SRC-6 – with streamingencryption.mc (2)
Encryption on SRC-6 – with streamingencryption.mc (3)
read_timer(&t2); DMA_CPU(OBM2CM, S4OBM, MAP_OBM_stripe(1,"D,E"), sdata, 1, nbytes, 5); wait_DMA(5); read_timer(&t3); *hardware_timeprocess = t2-t1; *hardware_timeout = t3-t2;
}
7.5
38
46
Speed up
560
113
93
Xeon2.8GHz
4,240
4,240
4,240
SRC-6
End-to-End Throughput
(Mbits/s)
10,76011,35019,2003 RC5
Ciphers(64-bit block)
10,76011,35019,2003 IDEACiphers
(64-bit block)
10,76011,35019,2003 DES
Ciphers(64-bit block)
SRC-6SRC-6SRC-6
DataTransfer OutThroughput
(Mbits/s)
DataTransfer InThroughput
(Mbits/s)
ComputationalThroughput
(Mbits/s)Application
ResultsSRC-6 without streaming
8.5
42.5
52
Speed up
560
113
93
Xeon2.8GHz
4,800
4,800
4,800
SRC 6
End-to-End Throughput
(Mbits/s)
10,7609,000NA3 RC5
Ciphers(64-bit block)
10,7609,000NA3 IDEACiphers
(64-bit block)
10,7609,000NA3 DES
Ciphers(64-bit block)
SRC 6SRC 6SRC 6
DataTransfer OutThroughput
(Mbits/s)
DataTransfer In
& processingThroughput
(Mbits/s)
ComputationalThroughput
(Mbits/s)Application
ResultsSRC-6 with streaming (3 units)
9.5
47.5
58
Speed up
560
113
93
Xeon2.8GHz
5,400
5,400
5,400
SRC 6
End-to-End Throughput
(Mbits/s)
10,76011,350NA2 RC5
Ciphers(64-bit block)
10,76011,350NA2 IDEACiphers
(64-bit block)
10,76011,350NA2 DES
Ciphers(64-bit block)
SRC 6SRC 6SRC 6
DataTransfer OutThroughput
(Mbits/s)
DataTransfer In
& processingThroughput
(Mbits/s)
ComputationalThroughput
(Mbits/s)Application
ResultsSRC-6 with streaming (2 units)
4.5
18
26
Speed up
560
113
93
Xeon2.8GHz
2,430
2,040
2,430
Altix
End-to-End Throughput
(Mbits/s)
NANA12,800
(200MHz)
1 RC5Cipher
(64-bit block)
NANA6,400
(100MHz)
1 IDEACipher
(64-bit block)
NANA12,800
(200MHz)
1 DESCipher
(64-bit block)
AltixAltixAltix
DataTransfer OutThroughput
(Mbits/s)
DataTransfer In
& processingThroughput
(Mbits/s)
ComputationalThroughput
(Mbits/s)Application
SGI Altix MOATB without streaming
5.5
22
33
Speed up
560
113
93
Xeon2.8GHz
3080
2480
3080
Altix
End-to-End Throughput
(Mbits/s)
NANA12,800
(200MHz)
1 RC5Cipher
(64-bit block)
NANA6,400
(100MHz)
1 IDEACipher
(64-bit block)
NANA12,800
(200MHz)
1 DESCipher
(64-bit block)
AltixAltixAltix
DataTransfer OutThroughput
(Mbits/s)
DataTransfer In
& processingThroughput
(Mbits/s)
ComputationalThroughput
(Mbits/s)Application
SGI Altix MOATB with streaming
Example 2Cryptography:
Cipher Breaking
Secret-key cipher breaking
Given:
Looked for:
Method:
remaining plaintext
ciphertext
or key
guessed fragment of the plaintext
exhaustive key search (brute-force) attack
successivekeys cipher
Secret-key cipher breaking
Cipherbreaker
M0 C0
…
K1 K2 K3 KN
Generated by the cipher breaker
Negligibly smallinput/output
Huge amountof computations
Correct key
Message – Ciphertext pair
Cipher Breaking Results - SRC-6
Application Theoretical Maximum
Computational Throughput
Measured End-to-End Throughput
(million keys/s) (million keys/s) Speed-up
SRC 6 SRC 6 Xeon2.8GHz
DES CipherBreaking
(20 units working in parallel)
2000 2000 1.77 1130
IDEA CipherBreaking
(10 units working in parallel)
1000 1000 2.19 457
RC5 Cipher Breaking(2 units working in
parallel)
200 200 0.71 282
Application Theoretical Maximum
Computational Throughput
Measured End-to-End Throughput
(million keys/s) (million keys/s) Speed-up
SGI SGI Xeon2.8GHz
DES CipherBreaking
(10 units working in parallel)
2000 2000 1.77 1130
Cipher Breaking ResultsSGI Altix MOATB
Example 3:Cryptography:
Key exchange using ECC
Secret-key ciphers
key of Alice and Bob - KABkey of Alice and Bob - KAB
Alice Bob
Network
Encryption Decryption
Key Distribution Problem
N - UsersN · (N-1)
2Keys
Users Keys
100 5,000
1000 500,000
Public Key (Asymmetric) Ciphers
Public key of Bob - KBPrivate key of Bob - kB
Alice Bob
Network
Encryption Decryption
Alice Bob session key
(random secret-key)
Bob’s public key
Key exchange for secret-key ciphers
Bob’s private key
Network
Session keyencrypted using Bob’s public key
Message encrypted using session key
Message
Hash function
Public keycipher
AliceSignature
Alice’s private key
Bob
Hash function
Alice’s public key
Digital Signature
Hash value 1
Hash value 2
Hash value
Public key cipher
yes no
Message Signature
Why public-key cryptography is a good application for reconfigurable computers?
• computationally intensive arithmetic operations
• unconventionally long operand sizes (160-2048 bits)
• multiple algorithms, parameters, key sizes, and architectures = need for reconfiguration
Elliptic Curve Cryptosystems (ECC)
a family of cryptosystems, rather than a single
cryptosystem = added security but
need for reconfiguration
public key (asymmetric) cryptosystems
used for key agreement and digital signatures
implementations must be optimized for
minimum latency rather than maximum
throughput = limited speed-up from
parallel processing
Basic operations of ECCBasic operations in Galois Field GF(2m)
Basic operations on points of an Elliptic Curve
• addition and subtraction (xor): x+y, x-y (XOR)
• addition of points: P + Q• doubling a point: 2 P• projective to affine coordinate: P2A
• multiplication, squaring: x y, x2
• inversion: x-1
Complex operations on points of an Elliptic Curve• scalar multiplication: k P = P + P + …+P
k times
Hierarchy of ECC functions
kP
P+Q 2P projective_to_affine (P2A)
MUL
INV
High level
Medium level
Low level 2
ROTXORLow level 1
C function for P
C function for MAP
VHDLmacro
SRC Program Partitioning
P system
FPGA system
HLL
HDL
Investigated Partitioning Schemes
kPC function for P
C function for FPGA
VHDLmacro
μP Software Only
Based on public-domain code by Rosing M., Implementing Elliptic Curve Cryptography, Manning, 1999
MUL4
C function
for FPGA
VHDL
macrosROTROT XOR
C function
for µP0
H
L1V_ROTVARROT
kPP2A
kP
P+Q 2P
MUL2MUL2 MULMUL
0HL1 Partitioning
INVINV
P2AP+Q2P
MUL4
C function
for FPGA
VHDL
macrosROTROT XOR
C function
for µP0
H
V_ROTINV
kPP2A
kP
P+Q 2P
MUL2MUL2
0HL2 Partitioning
P2AP+Q2P
L2
0HM Partitioning
C function
for FPGA
VHDL
macros
C function
for µP
0
H
MP+QP+Q 2P2P P2AP2A
kPkP
kP
0
0
H
00H Partitioning (VHDL only)
C function for P
C function for FPGA
VHDLmacro
Timing Measurements
MAPAlloc.
MAP
FreeDMA
DataOut
DMA
Data In
FPGA
Computation
.c file
.mc file
End-to-End time (SW)
MAPfunction
MAP function
FPGA
Configure
Configuration time
MAP
Allocation
time
MAP
Release
Time
End-to-End time (HW)
Results (Latency)
0HL1 866 37 472 14 394 8930HL2 863 37 469 14 394 8950HM 592 37 201 12 391 1305
VHDL macro 592 39 201 17 391 1305
Software772,519
Data Transfer
Out Time
Total Overhead
Speedup vs.
Software
System Level Architecture
End-to-End Time
Data Transfer In Time
FPGA Compu-
tation Time
0
100
200
300
400
500
600
700
800
900
us
0HL1 0HL2 0HM 00H (VHDL)
Different Architectures
End to End Time
FPGA ComputationTime
Results (Area)
Software N/A
0HL1 99 1.68 57 1.3 68 2.61
0HL2 92 1.56 52 1.18 62 2.38
0HM 75 1.27 48 1.09 39 1.5
System Level
Architecture
% of CLB
slices
(out of
33792)
CLB
increase
vs. pure
VHDL
% of
LUTs
(out of
67,584)
LUT
increase
vs. pure
VHDL
% of
FFs
(out of
67,584)
FF count
increase
vs. pure
VHDL
0
10
20
30
40
50
60
70
80
90
100
%
0HL1 0HL2 0HM 00H
Different Architectures
CLB slices
LUT
FF
78
185
349
371
MAP C
15326010070HL1
153
153
153
Main C
1601744
2301291
36
Macro Wrapper
0HM
1960
VHDL
VHDL macro
0HL2
Algorithm Partitioning Scheme
Number of lines of code
Conclusions
Assuming focus on:
Timing Resources
Ease of programming
Conclusions – cont.The best implementation approach:
0HL1 partitioning scheme
893 speedup vs. software and only 0.46 times slowdown versus pure VHDL with ease of
implementation
MUL4
C functionfor MAP
VHDL macros
ROTROT XOR
C function for µP
0
H
L1V_ROTV_ROT
kP
INV
P2AP+Q 2P
kP
INV
P2AP+QP+Q 2P2P
MUL2MUL2 MULMUL
Conclusions – cont.
• Elliptic Curve Cryptosystem implementation challenging for reconfigurable computers because of
• optimization for latency rather than throughput• limited amount of parallelism
• First publication showing a 1000x speed-up for a reconfigurable computer application optimized for data latency
Summary of results
Type of applicationEnd-to-end
speed-up of SRC-6 vs. P4
Computationally intensive(cipher breaking)
300-1100
Latency critical(ECC key exchange)
Input/output intensive 10-60(secret key encryption/decryption)
890-1300
GWU_GMU secret key cipher libraries
1. Secret key cipher encryption and decryption
2. Secret key cipher breaking
• DES• IDEA• RC5
• DES• IDEA• RC5
GWU_GMU public key cipher libraries
1. Operations in the binary Galois Fields GF(2m)
a. polynomial basis b. normal basis
2. Multiprecision integer arithmetic
3. Elliptic Curve Operations
- addition - doubling - scalar multiplication
89
Example 4Image Processing:
Hyperspectral Dimension Reduction
90
Multi-Spectral Imagery 10’s of bands
Hyperspectral Imagery 100’s-1000’s of bands
Challenges - Curse of Dimensionality
Solution On-Board Dimension
Reduction Needs
Higher performance Higher flexibility
Multispectral / Hyperspectral Imagery Comparison
High-Performance Reconfigurable Computing
Application: Hyperspectral Dimension Reduction
91
Hyperspectral Dimension Reduction(Techniques)
Principal Component Analysis (PCA): Most Common Method
Dimension Reduction Complex and Global
computations: difficult for parallel processing and hardware implementations
Does Not Preserve Spectral Signatures
Wavelet-Based Dimension Reduction*: Simple and Local Operations High-Performance
Implementation Preserves Spectral
Signatures
Multi-Resolution Wavelet Decomposition of Each Pixel 1-D Spectral Signature (Preservation of Spectral Locality)
* S. Kaewpijit, J. Le Moigne, T. El-Ghazawi, “Automatic Reduction of Hyperspectral Imagery Using Wavelet Spectral Analysis”, IEEE Transactions on Geoscience and Remote Sensing, Vol. 41, No. 4, April, 2003, pp. 863-871.
92
The input image is first convolved along the rows by the two filters L and H and decimated along the columns by two resulting in two "column-decimated" images L and H
Each of the two images, L and H, is then convolved along the columns by the two filters L and H and decimated along the rows by two
This decomposition results into four images, LL, LH, HL and HH
The LL image is taken as the new input to perform the next level of decomposition
Discrete Wavelet Transform (DWT) Decomposition (Mallat Algorithm)
93
Wavelet-Based Dimension Reduction(Description)
94
DWT on SRC-6
transfer coefficientsto OBM bank c
transfer image datato OBM bank a
load coefficients from bank c to
on-chip registers
transfer image data from bank b to the host
compute Wavelet
read one pixelfrom bank a
store result into bank b
End of Image
Yes
No
Read Data
MAP Alloc.
Map Free
Write Data
Measurements Scenario
95
DWT on SRC-6 (cnt’d)(Main Program)
int main (int argc, char *argv[]) { . .
/* initialize daubechies coefficients */ float coeff[8]; coeff[0]= coeff[1]= (1+sqrt(3))/(4*sqrt(2)); coeff[2]= coeff[3]= (3+sqrt(3))/(4*sqrt(2)); coeff[4]= coeff[5]= (3-sqrt(3))/(4*sqrt(2)); coeff[6]= coeff[7]= (1-sqrt(3))/(4*sqrt(2)); .
. /* allocate images */ .
map_allocate(1);
gettimeofday(&time0, NULL);
proc_fpga (image_in, image_out, coeff, dx, dy, &ht0, &ht1, &ht2, &ht3, mapno);
gettimeofday(&time1, NULL);
/* print time difference */ . .
map_free(1); .}
Allocate the RP
• configure and start the Program execution on the FPGA
• passing the input image pointer and the output image buffer pointer to be used by DMA
• individual parameters can be passed to the MAP C function such as image dimensions
• large parameter array, such as the wavelet coefficients, can be transferred using DMA by passing the pointer of the coefficients array
Free the RP
96
DWT on SRC-6 (cnt’d)MAP C Function (FPGA.mc)
void proc_fpga(int64_t image_in[], int64_t image_out[], int64_t coeff[], int dx, int dy, long long *ht0, long long *ht1, long long *ht2, long long *ht3, int mapnum){ // coefficients float LP0, LP1, LP2, LP3, LP4; float HP0, HP1, HP2, HP3, HP4;
// variables int i, j, k; int64_t in_pixel, out_pixel; // input image OBM_BANK_A (AL, int64_t, MAX_OBM_SIZE)
// output image OBM_BANK_B (BL, int64_t, MAX_OBM_SIZE)
// filter coefficients OBM_BANK_C (CL, int64_t, MAX_OBM_SIZE)
97
start_timer();read_timer(ht0);
// DMA Input Image transferDMA_CPU (CM2OBM, AL, MAP_OBM_stripe (1, "A"), image_in, 1, Image_size, 0);wait_DMA (0);
// DMA coefficients transferDMA_CPU (CM2OBM, CL, MAP_OBM_stripe(1,“C"), coeff, 1, 4*sizeof(int64_t), 0);wait_DMA(0);
read_timer(ht1);
for (i = 0; i < 4; i++) { LP0 = LP1 ; HP0 = HP1 ; LP1 = LP2 ; HP1 = HP2 ; LP2 = LP3 ; HP2 = HP3 ; split_64to32_flt_flt(CL[i], & HP3, & LP3 );}
DWT on SRC-6 (cnt’d)MAP C Function (FPGA.mc)
transfer image datato an OBM bank
transfer coefficientsto an OBM bank
load coefficients from the OBM bank to on-chip registers
98
for (i = 0; i<Image_Size; i++) {
in_pixel = AL[i];
{ . . . }
BL[i] = out_pixel;
} read_timer(ht2);
DMA_CPU (OBM2CM, BL, MAP_OBM_stripe(1,"B"), image_out, 1, Image_size, 0);wait_DMA (0);
read_timer(ht3);}
DWT on SRC-6 (cnt’d)MAP C Function (FPGA.mc)
read pixel value from the OBM bank
compute Wavelet
store results to theOBM bank
transfer image datato the host
99
Overlapping Data Transfer with Computation(SRC-6)
#pragma src parallel sections {
#pragma src section {
for(i = 0; i < i<MAX_OBM_SIZE; i++)
{
get_stream (&S0, &v0);
DO COMPUTATION (Current Data Block)
}
} /* end of parallel section with compute loop */
#pragma src section {
/* Stream DMA_IN */
stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, S1OBM, DMA_A_B, sdata, 1, nbytes);
} /* end of parallel section with DMA */
} /* end of parallel sections */Time
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Read DMA 1 2 3 X X
Algorithm X 1 2 3 X
Write DMA X X 1 2 3
Improve performance by overlapping algorithm computation and data loading and unloading
Parallel sections Multiple parallel code blocks
are active in parallel
100
Stream_64 S0;
#pragma src parallel sections
{
#pragma src section
{
int i;
for (i=0; i<sz; i++)
put_stream (&S0, AL[i]+42, 1);
} /* end of parallel section */
#pragma src section
{
int i;
for (i=0; i<sz; i++)
get_stream (&S0, &BL[i]);
} /* end of parallel section */
} /* end of parallel sections */
Streams(SRC-6)
Conventional Data FlowConventional Data Flow Streams and Conventional Streams and Conventional Data FlowData Flow
On-Board On-Board Memory Memory or BRAMor BRAM
ComputeComputeLoop 1Loop 1
On-Board On-Board Memory Memory or BRAMor BRAM
ComputeComputeLoop 2Loop 2
On-Board On-Board Memory Memory or BRAMor BRAM
On-Board On-Board Memory Memory or BRAMor BRAM
ComputeComputeLoop 1Loop 1
SteamsSteamsComputeComputeLoop 2Loop 2
On-Board On-Board Memory Memory or BRAMor BRAMTimeTime
Saves Saves Access toAccess toOn-BoardOn-BoardMemoryMemory
Data is flowingData is flowingIn the logicIn the logic
A stream is a data structure that allows flexible communication between concurrent producer and consumer loops
101
Cray XD-1
102
DWT on Cray-XD1(Main Program)
#define APP_CFG_REG 0x08UL#define USR_REG1 0x40UL#define USR_REG2 0x48UL#define USR_REG3 0x50UL#define USR_REG4 0x58UL#define QDR1_OFFSET 0x100UL /* Offset in FPGA memory space.*/
int main (int argc, char *argv[]) {
int fp_id; err_e e; u_64 coeff[4] u_64 * dp_base; u_64 * image;
fp_id = fpga_open ("/dev/ufp0", O_RDWR|O_SYNC, &e);
fpga_load (fp_id, "top.bin.ufp", &e);
. . /* Read Image */ . /* initialize daubechies coefficients */ . fpga_wrt_appif_val (fp_id, coeff[0] , USR_REG1 , TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[1] , USR_REG2 , TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[2] , USR_REG3 , TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[3] , USR_REG4 , TYPE_VAL, &e);
Define the address space for user registers and QDR memory
Open the FPGA Device
Load the FPGA
Transfer coefficients into the FPGA registers
103
fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e);
dp_base = fpga_memmap (fp_id, QDR_SIZE, PROT_READ | ROT_WRITE,MAP_SHARED, QDR1_OFFSET, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) dp_base[i] = iamge[i];
fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG, TYPE_VAL, &e);
/* ... */
fpga_rd_appif_val (fp_id, &status, STATUS_REG, &e);
/* ... */
fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e);
for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) output_iamge[i] = dp_base[i[i] ;
fpga_close (fp_id, &e); }
Configure the Wavelet for QDR bridging
Start Processing
Read the FPGA status
Map the entire 4 Mbytes of QDR Memory
Read back the Image
Transfer the Image into the QDR
Configure the Wavelet for QDR bridging
Close the FPGA device
DWT on Cray-XD1 (cnt’d)(Main Program)
104
Accessing µP memory from FPGA(Cray-XD1)
unsigned long order; void *ftr_mem;
/* ... */
ftr_mem = fpga_set_ftrmem(fpga_fd, order, &err); if (ftr_mem == NULL) { /* Handle error. */ }
fpga_wrt_appif_val (fp_id, (u_64) ftr_mem , BUFF0_PTR_REG, TYPE_ADDR, &e);
fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG,
TYPE_VAL, &e);
/* ... */
fpga_rd_appif_val(fp_id, &status, STATUS_REG, &e);
/* ... */
The APIs support access to a region of the µP memory by the FPGA logic
The program uses the fpga_set_ftrmem function to: Allocate an FTR Associates it with the address space
of the µP Sets up the FPGA to access it
directly
It does not automatically provide the address of this region to the FPGA application logic One way is to establish an FPGA
register for that purpose and use the fpga_wrt_appif_val function to write the value to the register
105
Using MPI on Cray-XD1
if(MYTHREAD==0) read_image (image_file_name, image_buffer, &rows, &cols);
MPI_Bcast(&rows, 1,MPI_INT,0,MPI_COMM_WORLD);MPI_Bcast(&cols, 1,MPI_INT,0,MPI_COMM_WORLD);
local_size= rows*cols/THREADS;
MPI_Scatter(image_buffer, local_size,MPI_UNSIGNED_LONG, local_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD);
/* Execute the wavelet on the Hardware*/process_image(fp_id, local_image_buffer, local_output_buffer, rows, cols);
MPI_Gather(local_output_buffer,local_size,MPI_UNSIGNED_LONG, output_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0,MPI_COMM_WORLD); if(MYTHREAD==0) write_image (output_file_name, output_image_buffer, rows, cols);
Each Cray XD1 chassis consists of 6 dual-processor Opteron nodes 2 Opteron processors (Total 12) 1 Xilinx Virtex-II Pro 50 (Total 6)
Applications can be parallelized across the 6 FPGAs using MPI
Data are distributed across the 6 FPGAs
106
rasclib_algorithm_request_t ar; int alg_id; int res; strcpy(ar.alg_id,“Wavelet"); ar.num_devices = 1; . . /* Read Image */ . /* initialize daubechies coefficients */ . rasclib_resource_alloc(&ar, 1); algorithm_id = rasclib_algorithm_open(“Wavelet"); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff0", coeff[0]);
res = rasclib_algorithm_alg_reg_write (alg_id, “coeff1", coeff[1]);
res = rasclib_algorithm_alg_reg_write (alg_id, “coeff2", coeff[2]);
res = rasclib_algorithm_alg_reg_write (alg_id, “coeff3", coeff[3]);
res = rasclib_algorithm_send (alg_id, "a_in", Image_Buff , SIZE);
Parameter Passing Small parameters
Connect to Algorithm Defined Registers
(alg_def_reg0 - alg_def_reg7) Pass parameter mapping to software through
an extractor directive, type REG_IN:-- extractor REG_IN: coeff0 64 u alg_def_reg0[63:0]
-- extractor REG_IN: coeff1 64 u alg_def_reg1[63:0]
-- extractor REG_IN: coeff2 64 u alg_def_reg2[63:0]
-- extractor REG_IN: coeff3 64 u alg_def_reg3[63:0]
Large Arrays Dedicate a portion of an SRAM bank for the
parameter array Pass parameter array mapping to software with
an extractor comment of type SRAM:-- extractor SRAM:a_in 2048 64 sram[0] 0x00 in u fixed
DWT on SGI-Altix(Main Program)
107
rasclib_algorithm_go (alg_id);
res = rasclib_algorithm_receive (alg_id, "d_out", out_Buff, SIZE); rasclib_algorithm_commit (alg_id); rasclib_algorithm_wait (alg_id); rasclib_algorithm_close (alg_id);
Results Read-Back Small parameters
Connect to Algorithm Defined Registers Pass parameter mapping to software through an extractor directive, type REG_OUT Use the API function rasclib_algorithm_reg_read
Large Arrays Dedicate a portion of an SRAM bank for the parameter array Pass parameter array mapping to software with an extractor comment of type SRAM:
-- extractor SRAM:d_out 2048 64 sram[1] 0x00 out u fixed
DWT on SGI-Altix (cnt’d)(Main Program)
108
Time
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Read DMA 0 1 2 X X
Algorithm X 0 1 2 X
Write DMA X X 0 1 2
Improve performance by overlapping algorithm computation and data loading and unloading
Extractor directives are used to tell software: where input/output data arrays are located (SRAM bank + starting index) the sizes of the input/output data arrays which arrays have been enabled for streaming
Extractor directive type used: SRAM with attribute stream, e.g.:
-- extractor SRAM:a_in 2048 64 sram[0] 0x00 in u stream -- extractor SRAM:d_out 2048 64 sram[1] 0x00 out u stream
Streaming(SGI-Altix)
Example 5Image processing:
Thin PlateSplines
The application: Thin Plate Splines- image analysis of protein gels
• Image morphing based on natural logarithm computations
• Essential for comparing protein content
• Speedup per FPGA: 10-30x. Reduces analysis runtime from days to hours.
Host Program- running on Opteron CPU, calling FPGA subroutine
Transfer parameter data to QDRAM
Start Mitrion program and wait until finished
Retrieve computed image data
u_64 fpga_mem, i;my_fpga = fpga_open(args); // Use normal XD1 API for most operations
...
if (!fpga_is_loaded(args)) rtn = fpga_load(args);
...
// memory map QDRAMs into host address spacefpga_mem = fpga_memmap(args);
// Upload data to QDRAMmemcpy(fpga_mem, parameter_data, sizeof_parameter_data);
// Control of mitrion processor is internally handled// with a number of memory mapped registers in the FPGA// Controlling running/stepping/reset etc.
mitrion_start(my_fpga); // Start mitrion blockmitrion_wait(my_fpga); // wait for block to finish
// Fetch results from QDRAMmemcpy(image_coordinates, fpga_mem, sizeof_image_data);
FPGA program (1/3)- accelerated subroutine in Mitrion-c// Options: -cpp
#define RAMType mem uint:64 [ 0x100000 ]
#include "grint_lib.lqd"#include "logarithm_rwhile.lqd"
(Fix, RAMType) readFix(RAMType m, uint:24 basicOffset, uint:24 fixOffset){ uint:32 memOffset = basicOffset + fixOffset; (result, m2) = _memread(m, memOffset);} (result, m2);
(RAMType, RAMType, RAMType, RAMType) main (RAMType Am, RAMType Bm, RAMType Cm, RAMType Dm){ Fix<LMS> py; // parameter vectors Fix<LMS> px; Fix<LMS> koeffx; Fix<LMS> koeffy;
// read paramters from external RAM (px, py, koeffx, koeffy, Aml) = foreach(index in <0.. LMS_1>) { (x, Am2) = readFix(Am, PX_OFF, index) ; (y, Am3) = readFix(Am2, PY_OFF, index) ; (kx, Am4) = readFix(Am3, KOEFFX_OFF, index); (ky, Am5) = readFix(Am4, KOEFFY_OFF, index); } (x, y, kx, ky, Am5); Aut = _wait(Aml);
Cut = grintpolc(Cm, px, py, koeffx, koeffy);
} (Aut, Bm, Cut, Dm);
readFix fetches input data from QDRAM
Definition of RAM type
Start of program.Matches external RAM interface of the XD1:
4 banks of 1M word each
FPGA program (2/3)- accelerated subroutine in Mitrion-c
RAMType grintpolc ( RAMType coords, // out Fix<LMS> px, Fix<LMS> py, Fix<LMS> koeffx, Fix<LMS> koeffy ){ imDonel = foreach(y in <0.. YSIZE_1>) { uint:32 lineoff = y*XSIZE; imDone2l = foreach(x in <0.. XSIZE_1>) { (distx, disty) = foreach(px, py, koeffx, koeffy in px, py, koeffx, koeffy) { Fix dx = px - int2fix(x); Fix dy = py - int2fix(y);
Fix r2 = fixmul(dx,dx) + fixmul(dy,dy);
Fix ext = if(r2 == 0) 0 else { Fix ln = fixln(r2); ext = fixmul(r2,ln); } ext;
Input arguments (the image) for Thin Plate
Splines transform
Major compute intensive part: high
precision ln computation
FPGA program (3/3)- accelerated subroutine in Mitrion-c Fix rx = fixmul(ext, koeffx);
Fix ry = fixmul(ext, koeffy);
} (rx, ry); Fix distcoordx = sum(distx);
Fix distcoordy = sum(disty);
// distcoordx and distcoordy is the coordinated // of the pixels to be fetched from the distorted image
uint:32 index = x + lineoff; int:32 x32 = (distcoordx >>> 8); // convert into Fix16.16 int:32 y32 = (distcoordy >>> 8); // convert into Fix16.16 watch x32; watch y32;
bits:64 word = [x32, y32]; imDone3 = _memwrite(coords, index, word); } imDone3; imDone2 = _wait(imDone2l); } imDone2; imDone = _wait(imDonel);
} imDone; Output argument is the distorted image
Output arguments (distorted image coordinates) are
written to QDRAM
115
Program Development Environments
Challenges
116
Application Developmentfor Reconfigurable Computers
ProgramEntry
Compilation
Execution
Platformmapping
Debugging &Verification
117
Tasks Addressed in This Presentation
ProgramEntry
Compilation
Execution
Platformmapping
Debugging &Verification
118
Program
Program Entry
119
Platform MappingSW/HW Partitioning
Software(executed in
the microprocessor system)
Hardware(executed in
the reconfigurableprocessor
system)
Program
120
SW/HW Partitioning & CodingTraditional Approach
Specification
SW/HW Partitioning
SW Coding HW Coding
SW Compilation HW Compilation
SW Profiling HW Profiling
121
SW/HW Partitioning & CodingNew Approach
Specification
SW/HW Coding
SW Compilation HW Compilation
SW Profiling HW Profiling
SW/HW Partitioning
122
Platform MappingFPGA mapping
Software
HardwareProgram
FPGA 1 FPGA 2
FPGA 3
FPGA 4
123
Example of FPGA Mapping
add
FPGA
multiply
divide add
multiply
divide
FPGA 1 FPGA 2
addmultiply
divide
FPGA 2FPGA 1
124
add
multiply
divide
FPGA 1 FPGA 2
FPGA Mapping in SRC
void fpga1(int64_t a, int64_t b, int64_t *sum, int mapno){ int64_t c, temp;
send_to_bridge(b); c = a * const1; recv_from_bridge(&temp); *sum = temp+Mult;}
void fpga2(){ int64_t a, d;
recv_from_bridge(&a); d = a/const2; send_to_bridge(d);}
Makefile
MAPFILES = FPGA1.mc FPGA2.mcPRIMARY = FPGA1.mcSECONDARY = FPGA2.mcCHIP2 = FPGA2.mc
a
FPGA1.mc
FPGA2.mc
b
sum
125
FPGA Mapping in VIVA TM
By changing the attributes one can specify where an object is to be located
126
Platform MappingFPGA-FPGA data transfer & synchronization
Software
HardwareProgram
FPGA 1 FPGA 2
FPGA 3
FPGA 4
127
FPGA 1 FPGA 264
64
computation
2
computation
1
void fpga1(int64_t a, b, c, *d){ send_to_bridge(a, b, c); computation1 recv_from_bridge(d);}
void fpga2(){ int64_t a,b,c,d;
recv_from_bridge(&a, &b, &c); computation2 send_to_bridge(d);}
FPGA-FPGA Data Transfer in SRCFPGA1.mc
FPGA2.mc
a
b
c
d
128
32 words
64 bits
64 bits
64
64
64
32 words
FIFO
FIFO
FPGA-FPGA Data Transfer in SRC
Bridge Port
129
FPGA-FPGA Data Transfer in VIVA TM
Special partitioning objects placed between the modules to be synthesized automatically map the relevant lines between the FPGAs.
For designs mapped over several FPGAs:The system description must include those FPGAs over which the design is to be mapped,
130
Platform MappingUse of Internal and External Memories
Software
HardwareProgram
FPGA 1FPGA 2
FPGA 3
FPGA 4
OCM
OCM – On-Chip Memory LM – Local Memory SM – Shared Memory
SM
LM
131
Using On-Chip Memory (OCM) in SRCvoid sum(int64_t a[], int *c, int mapno)
{
BANK_A_ALLOC(AL, int64_t, SIZE);
ocm_a [SIZE];
int i;
cm2obm_0(AL, a, byteLength);
wait_server_0();
for(i=0; i<SIZE; i++) {
ocm_a[i] = AL[i]; }
for(i=0; i<SIZE; i++) {
tmp = ocm_a[i] + tmp; }
}
FPGA
SM(OBM
)
64
32
AL[]
ocm_a[]
OCM
computationsc
132
Using On-Chip Memory (OCM) in VIVATM
Special Objects under the Memory Subsystem of the library allows the programmer to use the on chip memory of the Xilinx Virtex II chip
133
Platform MappingI/O
Software
HardwareProgram
FPGA 1 FPGA 2
FPGA 3
FPGA 4
SM
LM
OCM
SRC
StarBridge
134
Main program
Function_1(a, d, e)
Function_2(d, e, f)
Function_1
Function_2
Macro_1(a, b, c)
Macro_2(b, d)Macro_2(c, e)
Macro_3(s, t)
Macro_1(n, b)Macro_4(t, k)
FPGA……
……
……
Macro_1
Macro_2 Macro_2
a
b c
d e
FPGA contents afterthe Function_1 call
Program in C or Fortran
Run Time Reconfiguration in SRC
135
Run-time Reconfiguration in VIVATM
Reconfiguration is possible by using the spawn object.By specifying the FileName attribute a VIVA executable (.vex file) or a VIVA project can be loaded onto the same or a different FPGA.
136
Ideal Program Entry
ProgramEntry
Function
137
Actual Program Entry
SW/HWPartitioning
Data Transfers& Synchronization
Use of Internaland External Memories
Sequence of Run-time Reconfigurations
Use of FPGAResources
(multipliers,μP cores)
PreferredArchitectures
ProgramEntry
Function
FPGAMapping
SW/HW Interface
138
Not implemented
ManualEntry
CompilerAutomated
SRC
Star Bridge
FPGA-FPGA Partitioning
P-FPGA Partitioning
FPGA-FPGA Data Transfer
P-FPGA Data Transfer
Computation-Data transfer Overlapping
Choosing component version
Evolution and the current status of tools
and othervendors
. . . . . . . . .
139
Debugging & Verification
140
ApplicationApplication
MAP RuntimeMAP RuntimeLibraryLibrary
ComListComListCodeCode
WrapperWrapperCodeCode
User LogicUser Logic
SubroutineSubroutineFor MAPFor MAP
MAP Board Execution
MAP BoardMAP Board
Data &Data &
FlagsFlags
User FPGAsUser FPGAs
Control Processor
On-boardOn-boardMemoryMemory
User LogicUser Logic
Registers & Flags
Logic
Macro
Logic
Macro
Logic
MacroLogic
Macro
ComList Processor
DMA Engine
141
EmulatorEmulator
MAP Emulator + DFG Simulator
ApplicationApplication
MAP RuntimeMAP RuntimeLibraryLibrary
ComListComListCodeCode
WrapperWrapperCodeCode
User LogicUser Logic
SubroutineSubroutineFor MAPFor MAP Data &Data &
FlagsFlags
User FPGAsUser FPGAs
Control Processor
On-boardOn-boardMemoryMemory
User LogicUser Logic
Registers & Flags
C Code
Macro
C Code
Macro
C Code
MacroC
Code
Macro
ComList Processor
DMA Engine
142
MAP Emulator + Verilog Simulator
EmulatorEmulatorApplicationApplication
MAP RuntimeMAP RuntimeLibraryLibrary
ComListComListCodeCode
WrapperWrapperCodeCode
User LogicUser Logic
SubroutineSubroutineFor MAPFor MAP Data &Data &
FlagsFlags
User FPGAsUser FPGAs
Control Processor
On-boardOn-boardMemoryMemory
User LogicUser Logic
Registers & Flags
VCSVCS
Verilog
Macro
Verilog
Macro
Verilog
MacroVerilo
g
Macro
ComList Processor
DMA Engine
143
X86 System in VIVATM
The FileIn Object as it appears when the x86 system is loaded
144
X86 System in VIVATM
FileIn object as it appears when the FPGA system description is loaded.
145
Debugging in VIVATM
Data can be viewed with the help of widgets, which are basically input and output ‘horns’ placed in a worksheet.
Various display options are available to view data, options to include the kind of view desired by the viewer and the data viewed can be switched between HEX or INT.
146
IA-32 Linux
Machine
RTL Generation and Integration with Core Services
Design Synthesis(Synplify Pro,
Amplify)
Design Verification
Behavioral Simulation(VCS, Modelsim)
Static Timing Analysis(ISE Timing Analyzer)
.v, .vhd
.v, .vhd
.edf
.ncd, .pcf
.bin
MetadataProcessing
(Python)
.v, .vhd
.cfg
Altix Device Programming(RASC Abstraction Layer,
Device Manager, Device Driver)
Real-time Verification
(gdb)
.c
Design Implementation(ISE)
HLL Design Entry(Handel-C, Mitrion C, Viva)
Debugging in the SGI Environment
Compiler, Simulator And Debugger
148
Programming EnvironmentsSummary
149
SRC Programming Environment + very easy to learn and use+ standard ANSI C+ hides implementation details+ good support for debugging+ vendor and user libraries+ very well integrated environment+ good use of 3rd party tools+ in production use for over 3 years with constant improvements
- subset of C- legacy C code requires rewriting- C limitations in describing HW (paralellism, data types)- closed environment, limited portability of codes to HW platforms other than SRC
150
Star Bridge Programming Environment Viva
+ drag-and-drop program entry+ standard and user libraries+ separation of designs/programs from system/platform descriptions = portability of codes+ support for multiple platforms under development
- does not follow any established standards- no textual description = limited scalability of codes- control signals (e.g., handshaking between the adjacent library cells) must be specified explicitly- no clear mechanism to call HW functions from SW
151
+ drag-and-drop program entry+ extensive libraries of DSP components+ good support for debugging
- graphical description = limited scalability of codes- control signals (e.g., handshaking between the adjacent library cells) must be specified explicitly- limited library support for applications other than DSP
Cray Programming Environment basedon Simulink/System Generator
152
+ graphical programming language (drag-and-drop program entry)+ extensive libraries of DSP components+ single environment (MATLAB™/Simulink™) to analyze, visualize, implement, debug, verify+ efficient resource usage
- graphical description = limited scalability of codes- limited library support for applications other than DSP
Cray Programming Environment basedon DSPLogic
153
Cray XD1 and SGI Environmentsbased on Mitrion-C
+ high-level C-like language easy to learn by an HPC programmer+ ease of describing paralellism and non-standard (variable size) data types+ small amount of Mitrion-C generates large number of lines of HDL code+ suitable for describing classical complex HPC problems+ Mitrion-C code portable between Cray XD1 and SGI
- new and yet untested- non-standard, no support for legacy codes- language describes only what happens in a single FPGA- currently, no mechanisms to use HDL macros