octavian cret, kalman pusztai cristian vancea, balint szente technical university of cluj-napoca,...
TRANSCRIPT
Octavian CretOctavian Cret, K, Kaalmlmaan n Pusztai Pusztai Cristian Vancea, Balint SzenteCristian Vancea, Balint Szente
Technical University of Cluj-Napoca, RomaniaTechnical University of Cluj-Napoca, Romania
CREC: A Novel CREC: A Novel Reconfigurable Computing Reconfigurable Computing
Design MethodologyDesign Methodology
22
IntroductionIntroduction
CREC: low-cost general-purpose CREC: low-cost general-purpose reconfigurable computer;reconfigurable computer;
DynamicallyDynamically generated architecture; generated architecture;
Built in a Hardware/Software CoDesign Built in a Hardware/Software CoDesign manner;manner;
Based on FPGA devices, on VHDL Based on FPGA devices, on VHDL language and high level language (Java);language and high level language (Java);
No need for integration in a dedicated No need for integration in a dedicated VLSI chip.VLSI chip.
33
CREC’s Main FeaturesCREC’s Main Features
Reconfigurable Reconfigurable RISCRISC computer computer;;
ParallelParallel computer: each register has an computer: each register has an associated Execution Unit (EU)associated Execution Unit (EU);;
All the EUs have an All the EUs have an identicalidentical structure, and structure, and each one is able to execute any kind of each one is able to execute any kind of instruction from the CREC Instruction Setinstruction from the CREC Instruction Set;;
Having a greater number of EUs has the Having a greater number of EUs has the advantage of introducing advantage of introducing Instruction Level Instruction Level ParallelismParallelism..
44
CREC Design FlowCREC Design Flow
AApppplliiccaattiioonn ssoouurrccee ccooddee
((wwrriitttteenn iinn CCRREECC AAsssseemmbbllyy LLaanngguuaaggee))
PPaarraalllleell CCoommppiilleerr
((ddeetteerrmmiinnaattiioonn ooff tthhee nnuummbbeerr ooff
sslliicceess aanndd iinnssttrruuccttiioonnss sscchheedduulliinngg))
VVHHDDLL ssoouurrccee ccooddee GGeenneerraattoorr
((wwrriitttteenn iinn JJAAVVAA))
VVHHDDLL ffiillee CCoommppiillaattiioonn
FFPPGGAA CCoonnffiigguurraattiioonn
PPrroocceessss
AApppplliiccaattiioonn EExxeeccuuttiioonn
IInntteeggrraatteedd CCRREECC DDeevveellooppmmeenntt SSyysstteemm
55
The Parallel Compiler (I.)The Parallel Compiler (I.)
Parses the CREC-RISC source codeParses the CREC-RISC source code;;
Takes important decisions upon the execution Takes important decisions upon the execution system that will be generatedsystem that will be generated;;
Divides a program that is written in a sequential Divides a program that is written in a sequential manner into portions of code to be executed at manner into portions of code to be executed at the same time;the same time;
Determines the minimal number of program Determines the minimal number of program slicesslices;;
Determines which instructions will be executed Determines which instructions will be executed in parallel in each slicein parallel in each slice..
66
The Parallel Compiler (II.)The Parallel Compiler (II.)
Uses a set of rules;Uses a set of rules;
An example: each slice can contain at most one An example: each slice can contain at most one LoadLoad, , StoreStore or or JumpJump instruction; instruction;
Reads the application source code (in CREC Reads the application source code (in CREC assembly language) and generates a file in a assembly language) and generates a file in a specificspecific format, giving a description of the format, giving a description of the tailored CRECtailored CREC;;
The resulting CREC architecture contains only The resulting CREC architecture contains only the hardware needed to execute the subset of the hardware needed to execute the subset of instructions used in the program.instructions used in the program.
77
88
Results of the Parallel CompilerResults of the Parallel Compiler
The size of the various functional partsThe size of the various functional parts;;
The subset of instructions involvedThe subset of instructions involved;;
The number of execution unitsThe number of execution units ( (NN););
The sequence of instructions making up The sequence of instructions making up the programthe program;;
The resulting CREC architecture contains The resulting CREC architecture contains only the hardware needed to execute the only the hardware needed to execute the subset of instructions used in the program.subset of instructions used in the program.
99
Slices Slices
The instructions that are assigned to each The instructions that are assigned to each EU to be executed at a same moment of EU to be executed at a same moment of time make up a program time make up a program sliceslice;;
The whole program is divided into slices;The whole program is divided into slices;
The slice’s size depends on the designed The slice’s size depends on the designed number of execution units used for number of execution units used for program execution.program execution.
1010
Program sequence, and the instruction scheduling:Program sequence, and the instruction scheduling: [1] MOV R1,2[1] MOV R1,2 [2] MOV R2,3[2] MOV R2,3 [3] MOV R3,3[3] MOV R3,3 [4] ADD R1,R2[4] ADD R1,R2 [5] DEC R3[5] DEC R3 [6] JNZ[6] JNZ R3R3,[,[44]] [7] MOV ST[7] MOV STORORB,R1B,R1 [8] STORE [8] STORE [[200200]]
Program ExampleProgram Example
Classical, non-optimal multiplication of two integers Classical, non-optimal multiplication of two integers without overflow check using three EUswithout overflow check using three EUs
1111
VHDL Source Code GeneratorVHDL Source Code Generator
VHDL fileVHDL filess contain an already written source contain an already written source code, where the main architecture’s parameters code, where the main architecture’s parameters are given as are given as genericsgenerics and and constantsconstants;;
The following components can be tailored:The following components can be tailored: The number of EUs;The number of EUs; The register’s width in all the EUs;The register’s width in all the EUs; The size of the Instructions Memory and Operands The size of the Instructions Memory and Operands
Memory for each EU;Memory for each EU; The size of the Data Stack and Slice Stack Memory;The size of the Data Stack and Slice Stack Memory; The slice-mapping block, containing instructions.The slice-mapping block, containing instructions.
1212
CREC General ArchitectureCREC General Architecture
EEUU11 EEUU22
SSlliiccee MMeemmoorryy
SSlliiccee CCoouunntteerr
SSlliiccee SSttaacckk MMeemmoorryy
DDaattaa SSttaacckk MMeemmoorryy
LLooaadd BBuuffffeerr
SSttoorree BBuuffffeerr
DDaattaa MMeemmoorryy
EEUUNN
Addr
AAddddrr
OOppeerraanndd MMeemmoorryy 11
……
AAddddrr
IInnssttrruuccttiioonnss MMeemmoorryy 11
AAddddrr
OOppeerraanndd MMeemmoorryy 22
AAddddrr
IInnssttrruuccttiioonnss MMeemmoorryy 22
AAddddrr
OOppeerraanndd MMeemmoorryy NN
AAddddrr
IInnssttrruuccttiioonnss MMeemmoorryy NN
……
1313
The Hardware ArchitectureThe Hardware Architecture
The The NN Execution Units; Execution Units;
Instruction Memories;Instruction Memories;
Data Stack Memory (for Data Stack Memory (for PushPush and and PopPop););
Slice Stack Memory (for Slice Stack Memory (for CallCall and and ReturnReturn););
A Slice Program Counter;A Slice Program Counter;
A Slice-mapping Memory;A Slice-mapping Memory;
Store Buffer and Load Buffer;Store Buffer and Load Buffer;
Data Memory (external or internal);Data Memory (external or internal);
Operand Memories.Operand Memories.
1414
The Instruction SetThe Instruction Set
Relatively Relatively largelarge instruction set, contains instruction set, contains more instructions than the usual more instructions than the usual microcontrollers have;microcontrollers have;
Every instruction performs operation only Every instruction performs operation only on on unsignedunsigned integers; integers;
Each EU is potentially able to execute Each EU is potentially able to execute any any kindkind of instruction from the CREC of instruction from the CREC Instruction Set.Instruction Set.
1515
AdditionAddition with or without Carry; with or without Carry;
SubtractionSubtraction with or without Borrow and with or without Borrow and comparecompare;;
Logical functions: Logical functions: AndAnd, , OrOr, , XorXor, , NotNot and and Bit Bit TestTest;;
ShiftShift arithmetic and logic to left/right; arithmetic and logic to left/right;
RotateRotate and rotate through Carry to left/right; and rotate through Carry to left/right;
IncrementIncrement//DecrementDecrement and and 2’s Complement2’s Complement..
Data Manipulation InstructionsData Manipulation Instructions
1616
Instruction Format and ExampleInstruction Format and Example
““GG” defines the Instruction Group (Data Manipulation);” defines the Instruction Group (Data Manipulation);
““CodeCode” is the operation code (ex. Add, Sub);” is the operation code (ex. Add, Sub);
““TypeType” specifies the operation type (ex. with/without Carry);” specifies the operation type (ex. with/without Carry);
““LoadLoad” contains the load signals for the register and for the ” contains the load signals for the register and for the Carry and Zero flags;Carry and Zero flags;
““DD” is the Register/Data selection for the second operand.” is the Register/Data selection for the second operand.
1717
Program Control InstructionProgram Control Instruction
Slice counter manipulation: Slice counter manipulation: JumpJump, , CallCall and and ReturnReturn;;
Data movement: Data movement: MoveMove;;
Stack manipulation: Stack manipulation: PushPush and and PopPop;;
Input from and Output to port: Input from and Output to port: InIn and and OutOut;;
LoadLoad from and from and StoreStore to external memory; to external memory;
For great flexibility every instruction exists also in For great flexibility every instruction exists also in the conditioned form: the conditioned form: CC ( (CarryCarry), ), ZZ ( (ZeroZero), ), EE ( (EqualEqual), ), AA ( (AboveAbove), ), AEAE ( (Above or EqualAbove or Equal), ), BB ( (BelowBelow), ), BEBE ((Below or EqualBelow or Equal) and with negation too.) and with negation too.
1818
Instruction Format and ExampleInstruction Format and Example
““GG” defines the Instruction Group (Program Control);” defines the Instruction Group (Program Control);
““CodeCode” is the operation code (ex. Jump, Call);” is the operation code (ex. Jump, Call);
““ConditionsConditions” ” field contains the code for validating the field contains the code for validating the
execution of a given instructionexecution of a given instruction;;““RR” is the load signal for the Register (ex. Move);” is the load signal for the Register (ex. Move);
““DD” is the Register/Data selection for the second operand.” is the Register/Data selection for the second operand.
1919
The Execution UnitThe Execution Unit
Decoding UnitDecoding Unit – decodes the instruction code; – decodes the instruction code;
Control UnitControl Unit – generates the control signals for – generates the control signals for the Program Control Instruction group;the Program Control Instruction group;
Multiplexer UnitMultiplexer Unit – the second operand of the – the second operand of the binary instructions is multiplexed by this unit;binary instructions is multiplexed by this unit;
Operating UnitOperating Unit – realizes data manipulating – realizes data manipulating operations;operations;
Accumulator UnitAccumulator Unit – stores the instruction result; – stores the instruction result;
Flag UnitFlag Unit – contains the two flag bits: Carry Flag – contains the two flag bits: Carry Flag (CF), and the Zero Flag (ZF) (CF), and the Zero Flag (ZF)
2020
ZZFF CCFF
FFllaagg UUnniitt RReeggiisstteerr
AAccccuummuullaattoorr
SShhiifftt LLeefftt UUnniitt SSHHLL//RROOLL//NNEEGG
IINNCC//DDEECC//
SShhiifftt RRiigghhtt UUnniitt SSHHRR//RROORR//NNOOTT
LLooggiicc UUnniitt AANNDD//OORR//XXOORR
AArriitthhmmeettiicc UUnniitt AADDDD//SSUUBB
CCaarrrryy GGeenneerraattoorr
OOppeerraattiinngg UUnniitt
RReegg//DDaattaa MMUUXX
RReeggiisstteerr MMUUXX DDaattaa MMUUXX
MMuullttiipplleexxeerr UUnniitt
II mmmm
ee ddii aa
tt ee
OOpp ee
rr aann dd
LLoo aa
dd BB
uu ffff ee
rr
SS tt aa
cc kk
II nnpp uu
tt PP oo
rr tt
RR11
RR22
RRNN
CCoonnttrrooll SSiiggnnaall GGeenneerraattoorr CC
oo nn tt
rr ooll U
Unn i
i tt
JJ MMPP
CC
AALL
LL
RREE
TT
PP UU
SS HH
PP O
OPP
LL
OOAA
DD
SS TT
OORR
EE W
MMOO
VV SS
TTBB
OO
UUTT
RREE
GG// DD
AATT
AA RReeggiisstteerr
VVaalluuee OOppeerraanndd
VVaalluuee
IInnssttrruuccttiioonn CCooddee
CCoonnddiittiioonn GGeenneerraattoorr
CCOONNDDIITTIIOONN BBUUSS
CCOONNDDIITTIIOONN BBUUSS
IInnssttrruuccttiioonn DDeeccooddeerr
DDeeccooddiinngg UUnniitt
EEXX
EECC
UUTT
II OONN
UUNN
II TT
2121
The Optimized Operating UnitThe Optimized Operating Unit
Symmetrical organization: aSymmetrical organization: at the right side are t the right side are the binary instruction blocks, and at the left side the binary instruction blocks, and at the left side are the unary operation blocks (performing are the unary operation blocks (performing operations only on the accumulator);operations only on the accumulator);
The blocks use The blocks use only one levelonly one level of FPGA slices; of FPGA slices;
All four subunits use the same number of slices;All four subunits use the same number of slices;
Takes advantage of the Fast Carry Lines;Takes advantage of the Fast Carry Lines;
The size of the The size of the Operating Unit is growing Operating Unit is growing linearlylinearly with the word length.with the word length.
2222
Virtex Optimized Arithmetic UnitVirtex Optimized Arithmetic Unit
The basic 2-bit ADD/SUB cell using the Fast Carry The basic 2-bit ADD/SUB cell using the Fast Carry Lines consumes only one Xilinx VirtexE slice.Lines consumes only one Xilinx VirtexE slice.
2323
Arithmetic and Logic OpcodesArithmetic and Logic Opcodes
Opcodes of the arithmetic unitOpcodes of the arithmetic unit
Opcodes of the logic unitOpcodes of the logic unit
Where Where LL is the “ is the “Not LoadNot Load” and ” and SS is the “ is the “SubtractSubtract” signal ” signal
2424
Virtex Optimized Shift Left UnitVirtex Optimized Shift Left Unit
The basic 2-bit SHL/ROL/NEG/INC/DEC cell using The basic 2-bit SHL/ROL/NEG/INC/DEC cell using the Fast Carry Lines consumes only one slice.the Fast Carry Lines consumes only one slice.
2525
Virtex Optimized Shift Right UnitVirtex Optimized Shift Right Unit
The basic 2-bit SHR/ROR/NOT cell using the Fast The basic 2-bit SHR/ROR/NOT cell using the Fast Carry Lines consumes only one Xilinx VirtexE slice.Carry Lines consumes only one Xilinx VirtexE slice.
2626
Shift Left and Right OpcodesShift Left and Right Opcodes
Opcodes of the shift left unitOpcodes of the shift left unit
Opcodes of the shift right unitOpcodes of the shift right unit
Where Where SS is the “ is the “ShiftShift” and ” and DD is the “ is the “DecrementDecrement” signal” signal
Where Where SS is the “ is the “ShiftShift” and ” and NN is the “ is the “NotNot” signal” signal
2727
Shift and Rotate OperationsShift and Rotate Operations
SHLSHL – Shift Left;– Shift Left;
SALSAL – Shift Arithmetic Left;– Shift Arithmetic Left;
ROLROL – Rotate Left;– Rotate Left;
RCLRCL – Rotate through – Rotate through Carry Left.Carry Left.
SHRSHR – Shift Right;– Shift Right;
SARSAR – Shift Arithmetic Right;– Shift Arithmetic Right;
RORROR – Rotate Right;– Rotate Right;
RCRRCR – Rotate through Carry – Rotate through Carry Right.Right.
2828
Execution Unit ResourcesExecution Unit Resources
A complete Execution Unit (with all the A complete Execution Unit (with all the subunits generated) having 8-bit wide subunits generated) having 8-bit wide accumulator consumes 20 CLBs, that is accumulator consumes 20 CLBs, that is approximately 0.6% of a Xilinx Virtex600E approximately 0.6% of a Xilinx Virtex600E FPGA chip;FPGA chip;
An Execution Unit with 16-bit wide register An Execution Unit with 16-bit wide register consumes 35 CLBs, that is approximately consumes 35 CLBs, that is approximately 1% of the available CLBs.1% of the available CLBs.
2929
Experimental ResultsExperimental Results
Functional Parallel compiler;Functional Parallel compiler;
Execution Units optimized for Xilinx VirtexE device;Execution Units optimized for Xilinx VirtexE device;
Slice Memory and Stack Memory under test;Slice Memory and Stack Memory under test;
A CREC architecture having 4 EUs with 4-bit wide A CREC architecture having 4 EUs with 4-bit wide registers occupies 4% of the CLBs and 5% of the registers occupies 4% of the CLBs and 5% of the BlockRAMs in the Virtex600E device;BlockRAMs in the Virtex600E device;
A CREC architecture having 4 EUs with 16-bit wide A CREC architecture having 4 EUs with 16-bit wide registers occupies 18% of the CLBs and 20% of the registers occupies 18% of the CLBs and 20% of the BlockRAMs in the Virtex600E device; BlockRAMs in the Virtex600E device;
The operating clock frequency is 100 MHz.The operating clock frequency is 100 MHz.
3030
Performance evaluationPerformance evaluation
The performance indexes show how many times The performance indexes show how many times faster a given algorithm is executed on an faster a given algorithm is executed on an optimised CREC system than in the case of optimised CREC system than in the case of classical execution flowclassical execution flow
3131
Conclusions and Further WorkConclusions and Further Work
Creating the possibility of writing high-level Creating the possibility of writing high-level programs for CREC;programs for CREC;
Extend the functionalities of the Parallel Extend the functionalities of the Parallel Compiler, then create a C or PASCAL Compiler, then create a C or PASCAL compiler for CREC applications;compiler for CREC applications;
Several variants of CREC architecturesSeveral variants of CREC architectures;;
Hardware distributed computing, using the Hardware distributed computing, using the FPGA configuration over the Internet.FPGA configuration over the Internet.