tm efficient ip design flow for low-power high-level synthesis quick & accurate power analysis...
TRANSCRIPT
TM
Efficient IP Design flow for Low-PowerHigh-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow
J A N . 2 0 . 2 0 1 4
Asher Berkovitz
Yaniv Fais
2 © 2014 Freescale Semiconductor, Inc. | External Use
Authors Contact Details
Asher [email protected]
+972- 09-9522511
Yaniv Fais [email protected]
+972- 09-9522179
Freescale Semiconductor
Israel
Herzelia
Shenkar 3
3 © 2014 Freescale Semiconductor, Inc. | External Use
Outline
• Challenges
• High Level Synthesis flow
• Power Efficiency
− Problems at RTL
• Proposed VSIM++ Flow
− Analysis
− Optimization
− Results on Networking Algorithm (Non-Abstract Version)
• Conclusions
4 © 2014 Freescale Semiconductor, Inc. | External Use
Challenges
• IP blocks for networking types of applications need to meet tight
power consumptions while meeting aggressive performance
requirements.
• Making changes to micro architectures and other high abstraction
modeling styles could deliver the largest benefits on overall power.
• It is hard to accurately measure power at higher abstractions.
• Measuring accurate power upon signoff is late in the design
process when high level changes are impossible
5 © 2014 Freescale Semiconductor, Inc. | External Use
High Level Synthesis design Flow
AlgorithmsDefinition
Macro-Architecture Definition
RTL2GDSII“Normal”
flow
RTL
Macro-architecture definition: Based on an accelerator base class Uses unified modules (FIFOs, interfaces etc)
Commands (uArch)
Cell library (.lib)
Bit-exact SystemC ® Model SystemC® Model:
Architecture evaluation and RTL generation Accurate data path description according to
macro-architecture Design to meet processing requirements
HLS: Builds pipelined data path and control logic Considers real timings during RTL generation Explore implementation tradeoffs
HLS
System
C
®
RTL Quick explore (Timing/Area)
6 © 2014 Freescale Semiconductor, Inc. | External Use
Power Dissipation
• Static Power - ~test independent
• Dynamic Power – highly dependent on application (Signal
Transition)
• Signal transitions can be divided to:
− Functional change
− Glitch (signal changes that which not captured by a sequential element)
• Glitches are not visible in RTL simulation and can contribute ~20%
to power dissipation
7 © 2014 Freescale Semiconductor, Inc. | External Use
Fast & Accurate power analysis flow (VSIM)
• Quick Physical Design (PD) flow:
− Timing violations allowed
− DRC violations allowed
− Less than 100% RTL to GL equivalence
• Costumed test bench enables Cycle accurate
Gate Level Simulation
• Power analysis is performed using gate level
netlist & parasitics file.
• Power analysis results are mapped backed to
RTL netlist.
Quick PD flow
RTL DB
Power Analysis
GLV simulation
Test bench generation
Mapping GL 2 RTL
8 © 2014 Freescale Semiconductor, Inc. | External Use
Test Bench Generation
• Based on RTL to GL mapping, force RTL values on GLV simulation
• Advantages:
QD
Std’ test bench
QD
“VSIM” test bench
Force the RTL value on the key
point
Timing violation!
QD
Short run time:Simulate selected window
Force correct value @ time
point X
QD
GL delay for logic cones(SDF)
QD
QD
QD
Values are a bit “off”
Correct values forced
GL & SDF
9 © 2014 Freescale Semiconductor, Inc. | External Use
Cond_0
Gate level results mapping to RTL netlist
reg cond[1:0]reg count[1:0]
always @(posedge clk) if (condition == 2’b11) count = count + 1;R
TL
netli
stG
L ne
tlist
26
29
Cond_1
count_1count_0
Clock Gate
1. Map RTL 2 GL2. For each unmapped GL instance:
Divide the power between drive/load key points
3. Assign GL key point power to RTL key point
4. The power of each RTL hierarchy is the sum of power assigned to its key point
4 8
10
10
2
1
10101 1
11
13
13
14151111
11
11
10 © 2014 Freescale Semiconductor, Inc. | External Use
Mapping results to high-level language (VSIM++)
• Using annotation of C++ class names, variable names as well as file name/line numbers we can map power consumption from the accurate gate-level to the C++.
• This capability allows us to:− Analyze and fix clock gating− Redesign “power hungry” resources− Consider different architectures
reg my_var_Ln123[1:0]reg count_Ln124[1:0]
always @(posedge clk) if (my_var_Ln123 == 2’b11) count_Ln124 = count_Ln124 + 1;R
TL
netli
st
void process() { … while (true) { if (my_var==3)
count++; … }}
C+
+ c
ode
121:122:123:124:125:126:127:
Line #
11 © 2014 Freescale Semiconductor, Inc. | External Use
DFFDFF
Example problem identified
• Tool inserts “clock gating” enabler code for RTL automatically
always @(posedge clk)
if (en)
data[511:0] <=
new_data;
C++ process condition HLS
DFF
clk
en
new_data data
• Gate-Level implementation is not
implemented as gated clock but
as data logic due to timing
violations
• Solution – Simplify clock gating
enablers to meet timing
constraints
12 © 2014 Freescale Semiconductor, Inc. | External Use
Clock gating enabler simplification
DFFDFFHash Key
clken
new_data data
DFFDFFHeaderDFFDFFProcess control
DFFDFFHash Key
clken
new_data data
DFFDFFProcess control
Original clock gating scheme – Complicated enable logic Synthesized to non efficient enabler
Simplified clock gating scheme – Enable synthesized w/o changesLeading to high clock gating efficiency
13 © 2014 Freescale Semiconductor, Inc. | External Use
Conclusions
• Use High Level Synthesis for IP Design
− Quick and easy to explore architecture alternatives
− Quick front-end flow including verification
• Power analysis:
− Measure power on system level scenario
− Quick (doesn’t require full physical design flow convergence)
− Accurate (done on gate-level)
• Analysis and Optimization in high-level design (C++)
− Manual clock gating enable setting reduced dynamic power consumption by 19.4%
• Early in the design cycle : Easy to change IP architecture !
14 © 2014 Freescale Semiconductor, Inc. | External Use
Backup
15 © 2014 Freescale Semiconductor, Inc. | External Use
Accuracy
• Measured using similar methodology on a different design• Si measurement compared to full T/O gate level data
Test Dynamic power accuracy
Single core Fast Fourier Transform -7.59%
Single core Fast Fourier Transform No memory miss
-8.40%
Dual core Fast Fourier Transform 7.57%