General Overview of An Adaptive Dynamic Extensible Processor
Hamid Noori, Kazuaki Murakami, Koji Inoue & Victor Goulart
Kyushu University
Department of Informatics
Workshop on Introspective Architecture (WISA06)
WISA06@AustinKyushu University
Agenda
Background Research goal General overview of the architecture
Modes of operation Profiler Accelerator Sequencer
Generation of Custom Instructions Configuration Data for the Accelerator Experiments and Results Conclusions & Future work
WISA06@AustinKyushu University
Background
GPP ASIC ASIP Ext. Proc. Our Proc.
Power consumption
× ◎ ◎ ○ ○
Performance (Specific)
× ◎ ○ ○ ○
Performance (General)
○ × × × ○
Flexibility ◎ × × × ◎
Design time ○ × × △ ○
Design cost ○ × △ △ ○
Programmability ◎ × ◎ ○ ◎
Productivity ◎ × △ △ ◎
WISA06@AustinKyushu University
Some definitions
Hot Basic Block (HBB) A basic block which execution frequency is greater than a
given threshold specified in the profiler Custom Instructions (CIs)
Are the extended Instruction Set Architecture (ISA) that are executed on the ACC
Accelerator (ACC) Custom hardware for executing CIs
Training mode Operation mode for detecting HBBs and generating CIs
Normal mode Normal operation mode where CIs are executed on the ACC
WISA06@AustinKyushu University
Research Goal Proposal of an Adaptive Dynamic Extensible
Processor for Embedded Systems Custom instructions are adaptable to the applications Custom instructions are detected and created during
execution/training Generation of custom instruction are done transparently
and automatically Advantages of the novel approach
Higher performance than GPPs Higher flexibility compared to Extensible Processors Shorter TAT and cheaper design and verification cost
compared to ASIPs and Extensible Processors
WISA06@AustinKyushu University
General overview of the architecture
Adaptive Dynamic Extensible Processor
Base Processor
Reg FileFetch
Decode
Execute
Memory
Write
Augmented Hardware
ACC
Profiler
Sequencer
N-wayin-order
general RISC
Detects start addresses of
Hot Basic Blocks (HBBs)
Executes Custom
Instructions
Switches between main processor and
ACC
WISA06@AustinKyushu University
General overview of the architecture
Modes of operation Training mode
Profiling Detecting start address of Hot Basic Blocks (HBBs) Generating Custom Instructions Generating Configuration Data for the ACC Binary rewriting Initializing the Sequencer Table♦ Online
Needs a simple hardware for profiling All tasks are run on the base processor
♦ Offline Needs a PC trace after taken branches/jumps
Normal mode Profiling (optional) Executing Custom Instructions on the ACC and other parts of the
code on the base processor
WISA06@AustinKyushu University
Components
Register File
ID/EXE Reg
Accelerator
Multi-Context Memory
Cache
Functional Unit
Mux SequencerSequencer
Table
EXE/MEM RegProfiler
DMA
Profiler Table (HWT)GPP Augmented HW
Online Training
WISA06@AustinKyushu University
Operation modes
Applications
ProcessorProfiler
ACC
Training Mode
SequencerProcessor
Profiler
ACC Sequencer
Running Tools for Generating
Custom Instructions, Generating
Configuration Data for ACC
and Initializing Sequencer
Table
Training Mode Normal Mode
ProcessorProfiler
ACC Sequencer
Monitors PC and
Switches between
main processor and ACC
Executing CIs
ApplicationsApplications
Binary Rewritin
g
Profiler
Binary-Level
Profiling
Detecting Start
Address of HBBs
WISA06@AustinKyushu University
ProfilerCurrent PC Previous PC
Compare
If greater than instruction length
Is Current PC in the table?
No
Yes
Add it as a new entry and set the counter to one.
Increment the counter
Basic Block Start Addr
(BBSA)
Counter
Profiler Table
NoNothing
Yes
After a taken branch or jump we look at the BBSA to see if the target PC is on the table. If it is a miss we include this address and initialize the counter to 1, otherwise we increment its value.
WISA06@AustinKyushu University
Detecting Start Addr of HBBs
400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: bne $2,$0,400db8 <usqrt+0xa8>400d58: srl $2,$2,0x1e 400d60: lw $3,0($29) 400d68: addu $4,$4,$2 400d70: sll $8,$8,0x2 400d78: sll $6,$3,0x1 400d80: sll $3,$3,0x2 400d88: addiu $3,$3,1400d90: sltu $2,$4,$3 400d98: sw $6,0($29)
Not taken part
BBSA Counter
Profiler Table
HBBSA Counter
HBB Table
BTA
Taken Freq
Exec Freq
subHot?
Counter > Threshold
400d10 500
400db8 500X
Threshold = 100
400db8 50
HBB
WISA06@AustinKyushu University
Size of Profiler Table
Exec Freq Threshold 128 256 512 1024 2048
adpcm (enc) 28 28 28 28 28
basicmath 126 125 121 120 118
cjpeg 290 216 192 127 114
djpeg 163 154 108 48 35
lame 1109 978 929 852 537
dijkstra 117 116 103 101 101
patricia 290 290 255 228 216
blowfish 87 87 84 23 17
rijndael(enc) 107 107 106 37 37
sha 73 73 61 17 13
crc 37 37 36 36 36
fft 68 68 65 65 65
gsm 364 362 329 328 319
Number of Basic Blocks with Exec Freq more than Threshold
WISA06@AustinKyushu University
Accelerator (ACC) ACC is a matrix of Functional Units (FUs) ACC has a two level configuration memory
A multi-context memory (keeps two or four config) A cache
FUs support only logical operations, add/subtract, shifts and compare
ACC updates the PC ACC has variable delay which depends on
size of Custom Instruction
WISA06@AustinKyushu University
Connecting ACC to the Base Processor
Decoder
DEC/EXE Pipeline Registers
FU1 FU2 FU3 FU4 ACC
Reg0 ………………………………………………………………. Reg31
Sequencer
EXE/MEM Pipeline Registers
Config Mem
WISA06@AustinKyushu University
Connecting ACC to the Base Processor
DEC/EXE Pipeline Registers
FU1 FU2 FU3 FU4 ACC
Reg0 ………………………………………………………………. Reg31
Sequencer
EXE/MEM Pipeline Registers
Config Mem
Decoder
Sequencer
WISA06@AustinKyushu University
Sequencer The sequencer mainly determines the microcode execution
sequence. Selects between decoder and config memory for reading RF Selects between the output of Functional Unit and Accelerator Distinguishes when to switch between different contexts of multi-
context memory Determines when to load configuration data from cache to multi-
context memory. Checks the configuration data of custom instruction
If it is in multi-context memory, custom instructions will be executed on the accelerator
If it is not in multi-context memory If there is enough time to load it from cache to multi-context memory,
loads it and execute CI on the ACC If there is not enough time, the original code is executed.
WISA06@AustinKyushu University
Generation of Custom Instructions
Custom instructions Exclude floating point, multiply, divide and load instructions Include at most one STORE, at most one BRANCH/JUMP
and all other fixed point instructions Simple algorithm for generating custom instructions
HBBs usually include 10~40 instructions for Mibench Custom instruction generator is going to be executed on
the base processor (in online training mode)
WISA06@AustinKyushu University
Generating Custom Instructions4052c0 addiu $29,$29,-324052c8 mov.d $f0,$f124052d0 sw $18,24($29)4052d8 addu $18,$0,$64052e0 sw $31,28($29)4052e8 sw $16,16($29)4052f0 mfc1 $16,$f04052f8 mfc1 $17,$f1405300 srl $6,$17,0x14405308 andi $6,$6,2047405310 sltiu $2,$6,2047405318 addu $6,$6,$18405320 sltiu $2,$6,2047405328 lui $2,32783405330 and $17,$17,$2405338 andi $2,$6,2047405340 sll $2,$2,0x14405348 or $17,$17,$2405350 mtc1 $16,$f0405358 mtc1 $17,$f1405360 lw $31,28($29)405370 lw $16,16($29)405378 addiu $29,$29,32405380 jr $31
Finding the biggest sequence of instructions in the HBB that can be executed on the ACC
Moving the instructions and appending supportable instructions to the head of the detected instruction sequence after checking flow-dependency and anti-dependency
Moving the instructions and appending supportable instructions to the tail of the detected instruction sequence after checking flow-dependency and anti-dependency
Rewriting object code if instructions have been moved
Moving instructions, should not modify the logic of the application
Custom instruction generation is done without considering any other constraints.
WISA06@AustinKyushu University
Generating Custom Instructions
Block 3 (B3) is selected as the biggest instructions sequence that can be executed on the ACC
Block 2 (B2) can not be executed on ACC
Block 1 (B1) can be executed on ACC
If there is no flow and anti-dependency between B1 and B2 exchange them.
This is done for B4 and B5.
Supported instr(s) (B1)
Not supported
instr(s) (B2)
Not supported
instr(s) (B4)
Supported instr(s) (B3)
Supported instr(s) (B5)
Supported instr(s) (B1)
Not supported
instr(s) (B2)
Supported instr(s) (B3)
Not supported
instr(s) (B2)
Supported instr(s) (B3)
Supported instr(s) (B1)
WISA06@AustinKyushu University
Example 1400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: srl $29,$2,0x1e 400d58: lw $3,0($29) 400d60: addu $4,$4,$3 400d68: sll $8,$8,0x2 400d70: sll $6,$3,0x1 400d78: sll $3,$3,0x2 400d80: addiu $3,$3,1 400d88: sltu $2,$4,$3 400d90: sw $6,0($29) 400d98: bne $2,$0,400db8 <usqrt+0xa8>
Customized Instruction 1
Customized Instruction 2
WISA06@AustinKyushu University
Example 2 (rewriting obj code)
400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: addu $7,$0,$0 400d28: lui $9,49152 400d30: sll $4,$4,0x2 400d38: and $2,$8,$9 400d40: srl $2,$2,0x1e 400d48: lw $22,0($29) 400d50: addu $4,$4,$2 400d58: sll $8,$8,0x2 400d60: sll $6,$3,0x1 400d68: sll $3,$3,0x2 400d70: sltu $2,$4,$3 400d78: bne $2,$0,400db8 <usqrt+0xa8>
WISA06@AustinKyushu University
ACC Config Data Generation Flow
Profiler
Base Processor
Detecting Start Addr of HBBs
Reading HBBs from Obj Code
DFG
Simplescalar (PISA
Configuration)
Mibench Applications
2
3
4
1
1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2
ADDU
SRA
SLT
SUBU
BNE
R3R0 R0R0
R10
R8
R2
R2
R30x3
400488
2
3
4
5
1
A Custom Instruction
Data Flow Graph
ACC Map
5
WISA06@AustinKyushu University
Preliminary Performance Evaluation
400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: srl $2,$2,0x1e
FU FU
FU FU
FU FU
FU FU
FU FU
FU FU
FU FU
FU FU
FU
FU
FU
FU
Depth = 31st row = 1 clock
0.5 clock 0.5 clockTotal = 2 clock
9 – 2 = 7 clock cycles
7 * freq = reduced clock cycles
7 * 50K = 350K clock cycles
WISA06@AustinKyushu University
Results – Number of CI considering their length
0
10
20
30
40
50
60
70
basic
math
_la
rge_64K
cjp
eg_32K
djp
eg_8K
lam
e_32K
dijk
str
a_64K
patr
icia
_128K
blo
wfish_128K
rijn
dael-enc_128K
rijn
dael-dec_128K
sha_64K
adpcm
_enc_2000K
adpcm
_dec_2000K
crc
_2000K
fft1
28K
fft-
inv_128K
gsm
128K
(cod)
Number
1~5
6~10
11~15
16~20
21~25
26~30
31~35
36~40
41~45
Length of CIs
82
WISA06@AustinKyushu University
Results – Percentage of CIs considering their length
0
10
20
30
40
50
60
70
80
90
100
basic
math
_la
rge_64K
cjp
eg_32K
djp
eg_8K
lam
e_32K
dijk
str
a_64K
patr
icia
_128K
blo
wfish_128K
rijn
dael-enc_128K
rijn
dael-dec_128K
sha_64K
adpcm
_enc_2000K
adpcm
_dec_2000K
crc
_2000K fft
fft-
inv
gsm
(cod)
Percent
1~5
6~10
11~15
16~20
21~25
26~30
31~35
36~40
41~45
Length of CIs
WISA06@AustinKyushu University
More info on Custom InstructionsApp. Exe Instr (M) Threshold (K) # HBB # CI % Speedup % code size % exec time
basicmath_large 170 64 37 18 19.6 1.4 31.6
cjpeg 101 32 42 52 27 1.5 44
djpeg 25 8 22 32 31.5 0.8 48
lame 260 32 142 104 8.6 1.1 16
dijkstra 254 64 34 20 21.4 0.7 38.6
patricia 217 128 51 17 7.8 0.6 14.6
blowfish 260 128 18 28 33 2.7 59
rijndael (enc) 260 128 63 92 36 6.1 51.7
rijndael (dec) 259 128 63 78 36 4.5 51.7
sha 154 64 9 13 52 1.1 73
adpcm (enc) 260 2000 14 8 21 0.32 42
adpcm (dec) 265 2000 12 5 24 0.24 41
crc 265 512 4 2 20 0.1 44.9
fft 189 128 43 19 18.6 0.93 30
fft (inv) 190 128 43 19 18.6 0.93 30
gsm (cod) 265 128 34 41 25.1 1.53 47.2
Average 39 34 25 1.53 41.45
WISA06@AustinKyushu University
Conclusions An Adaptive Dynamic Extensible Processor
Training mode and Normal mode Advantages
It has s simple profiler CI are detected and added after production There is no need to a new compiler There is no need to new opcode for CIs There is no penalty for absence of CI config data Lower design cost and shorter design time
By accelerating a small part of code which has a high execution frequency an average 25% speedup improvement can be obtained. Comparing a single issue processor speedup improvement ranges from 7.8% to 52%.
WISA06@AustinKyushu University
Future Work
Linking HBBs Providing more details on the architecture
(Accelerator, sequencer, etc) Designing an Accelerator to support
conditional execution Developing a complete framework Extending ACC for floating point operations Substituting the in-order base processor with
an out-of-order
WISA06@AustinKyushu University
Thank you for your listening
WISA06@AustinKyushu University
Example
Application X CIx1, 100, input = 3 CIx2, 200, input = 6 Total executed instruction = 400,000
Application Y CIy1, 50, input = 4 CIy2, 400, input = 6 Total executed instruction = 800,000
Input < 5
40050)2200()2100(
50)2100(
xx
x
WISA06@AustinKyushu University
Mapping Tool - Example
2
3
4
1
1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2
ADDU
SRA
SLT
SUBU
BNE
R3R0 R0R0
R10
R8
R2
R2
R30x3
400488
2
3
4
5
1
A Custom Instruction
Data Flow Graph
ACC Map
5
WISA06@AustinKyushu University
RFU Design: A Quantitative Approach
RFU or Accelerator is a matrix of ALUs No of Inputs No of Outputs No of ALUs Connections Location of Inputs & Outputs
Some definitions: Considering frequency and weight in measurement
CI Execution Frequency Weight (To equal number of executed instructions) Average = for all CIs (ΣFreq*Weight)
Rejection: Percentage of CI that could not be mapped on the RFU Coverage: Percentage of CI that could be mapped on the RFU Basic Blocks: A sequence of instructions terminates in a control
instruction Hot Basic Blocks: A basic block executed more than a threshold
WISA06@AustinKyushu University
RFU Inputs (no constraint)
Input No Analysis-Optimized Version
0
20
40
60
80
100
120
1 3 5 7 9 11 13 15 17 19
Input No.
Co
vera
ge
Series1
96.3789.37 98.48
8
WISA06@AustinKyushu University
RFU Outputs (no constraint)
6
Output No. Analysis- Optimized Version
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Output No.
Co
vera
ge
Series1
96.58
WISA06@AustinKyushu University
RFU Node No (Input=8, Output=8)
Node No. Analysis-Optimized Version
0
20
40
60
80
100
120
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Node No.
Co
vera
ge Coverage based
on Total CIs
Coverage basedon remaining CIs
94.74
16
WISA06@AustinKyushu University
RFU Width (Inp=8, Out=8, Node=16)
ACC Width Analysis-Optimized Version
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10
ACC Width
Co
vera
ge
Series1
97.6595.65
6
WISA06@AustinKyushu University
RFU Depth (Inp=8, Out=8, Node=16)
ACC Height Analysis-Optimized Version
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14
ACC Height
Co
vera
ge
Series16
93.41
WISA06@AustinKyushu University
RFU Configuration
Input=8 Output=8 Node=16 Width = 6,4,3,2,1 Depth = 5
WISA06@AustinKyushu University
General overview of RFU (Architecture 1) Inputs are applied to the first
row Outputs of each row are
connected only to the inputs of the subsequent row
MOVE is used for transferring data
Rejection is 22.47%
WISA06@AustinKyushu University
General overview of RFU (Architecture 2) Distributing Inputs in different
rows Row1 = 7 Row 2 = 2 Row 3 = 2 Row 4 = 2 Row 5 = 1
Connections with Variable Length row1 row3 = 1 row1 row4 = 1 row1 row5 = 1 row2 row4 = 1
Rejection is 9.52%
WISA06@AustinKyushu University
Functional Units
Types for FUs: Type1: Logical (xor, nor, and , or) Type2: add, sub, compare Type3: shift (left/right)
Number of each type in the RFU Type 1 = 6 Type 2 = 14 Type 3 = 9
WISA06@AustinKyushu University
RFU with 8 outputs
Reg
Sequencer/control bits
RegRegReg
Accelerator
FU1-Output
FU2-Output
FU3-Output
FU4-Output
Sequencer/control bits
WISA06@AustinKyushu University
Control Bits & Immediate Data
287 bits are needed as Control Bits for Multiplexers Functional Units
204 bits are needed for Immediates Each CI configuration needs (247+204 = 491
bits)
WISA06@AustinKyushu University
CI Configuration Memory
2K x 1-bit multi-context memory 4 CI configuration
8K x 1-bit cache 16 CI configuration Total 20 CI configuration can be kept in
configuration memories
WISA06@AustinKyushu University
Extension of Custom Instructions over HBBs – Motivating Example
B1
S1
B2
S2
B3
B4
S3
B5
S4
B6
S5
J1
B7
S6
J2
B8
B9
S7
B10
S8
S9
B11
J3
B12
S10
Name of the block
No. of Exe. (M)
No. of Instr
B1 11.6 5
B2 5.8 1
B3 5.8 4
B4 8.6 3
B5 5.2 3
B6 5.6 1
B7 5.8 2
B8 11.6 2
B9 11.6 6
B10 11.6 2
B11 11.6 4
B12 5.8 3
WISA06@AustinKyushu University
Multi-Exit Custom Instructions
WISA06@AustinKyushu University
Conclusions
Adaptive Dynamic Extensible Processor Binary Profiler RFU (Inp=8, Out=6, Nodes=16, Width=6,4,3,2,1 - Depth=5) Sequencer
Adaptive Dynamic Extensible Processor No design time No extra read port and write port No design and verification cost No compiler No new opcode No penalty for absence of configuration data of custom
instruction in multi-context memory.
WISA06@AustinKyushu University
Custom Instruction
Generated from HBBs Using HBB table Object code
Custom instruction can include logical operations add/sub Shift At most one store At most one control instruction (jump/branch) No load No floating point instructions
New object code Logically is equivalent
BBSA Counter
Profiler Table
WISA06@AustinKyushu University
Processor modes (1/2)
Training mode Profiling applications Detecting critical region of code Generating DFG for critical regions Generating custom instruction from DFGs Generating new object code Generating data for accelerator configuration memories
and initializing sequencer table Training can be done at the gap between two consecutive
execution of the application if possible, otherwise just once before processor starts its normal operation
WISA06@AustinKyushu University
Processor modes (2/2)
Normal mode Profiling applications Using the data generated in training mode to
execute custom instructions on the accelerator.
Critical regions of the code are executed as custom instructions on the accelerator and the remaining part of the code are executed deploying the processor functional unit as usual.
WISA06@AustinKyushu University
Online Profiler-Components
Profiler Hardware Software
Hardware Comparator: compares current value and previous value
of Program Counter (PC). Profiler Table: In this table for each taken branch/jump
target address, there is a corresponding counter. The counter, counts how many taken branch or jumps has been done to the target address.
Software Hot Basic Block (HBB) detector
*Basic block is a sequence of instructions that ends up in a branch or jump.
WISA06@AustinKyushu University
Architecture Advantages
No compiler No new opcode No penalty for absence of configuration data of
custom instruction in multi-context memory. The ability to use processor functional unit and
accelerator in parallel. Custom instruction detection and execution are
done fully automatically and transparently.
WISA06@AustinKyushu University
General overview of the architecture Base processor (1,2 or 4-way in-order
general RISC) Profiler
Detects start address of Hot Basic Blocks (HBBs) Accelerator (ACC)
Executes Custom Instructions Sequencer
Determines the microcode execution sequence using the sequencer table