tm efficient ip design flow for low-power high-level synthesis quick & accurate power analysis...

15
TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv Fais

Upload: annabel-bradley

Post on 26-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

TM

Efficient IP Design flow for Low-PowerHigh-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow

J A N . 2 0 . 2 0 1 4

Asher Berkovitz

Yaniv Fais

Page 2: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

2 © 2014 Freescale Semiconductor, Inc. | External Use

Authors Contact Details

Asher [email protected]

+972- 09-9522511

Yaniv Fais [email protected]

+972- 09-9522179

Freescale Semiconductor

Israel

Herzelia

Shenkar 3

Page 3: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

3 © 2014 Freescale Semiconductor, Inc. | External Use

Outline

• Challenges

• High Level Synthesis flow

• Power Efficiency

− Problems at RTL

• Proposed VSIM++ Flow

− Analysis

− Optimization

− Results on Networking Algorithm (Non-Abstract Version)

• Conclusions

Page 4: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

4 © 2014 Freescale Semiconductor, Inc. | External Use

Challenges

• IP blocks for networking types of applications need to meet tight

power consumptions while meeting aggressive performance

requirements.

• Making changes to micro architectures and other high abstraction

modeling styles could deliver the largest benefits on overall power.

• It is hard to accurately measure power at higher abstractions.

• Measuring accurate power upon signoff is late in the design

process when high level changes are impossible

Page 5: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

5 © 2014 Freescale Semiconductor, Inc. | External Use

High Level Synthesis design Flow

AlgorithmsDefinition

Macro-Architecture Definition

RTL2GDSII“Normal”

flow

RTL

Macro-architecture definition: Based on an accelerator base class Uses unified modules (FIFOs, interfaces etc)

Commands (uArch)

Cell library (.lib)

Bit-exact SystemC ® Model SystemC® Model:

Architecture evaluation and RTL generation Accurate data path description according to

macro-architecture Design to meet processing requirements

HLS: Builds pipelined data path and control logic Considers real timings during RTL generation Explore implementation tradeoffs

HLS

System

C

®

RTL Quick explore (Timing/Area)

Page 6: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

6 © 2014 Freescale Semiconductor, Inc. | External Use

Power Dissipation

• Static Power - ~test independent

• Dynamic Power – highly dependent on application (Signal

Transition)

• Signal transitions can be divided to:

− Functional change

− Glitch (signal changes that which not captured by a sequential element)

• Glitches are not visible in RTL simulation and can contribute ~20%

to power dissipation

Page 7: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

7 © 2014 Freescale Semiconductor, Inc. | External Use

Fast & Accurate power analysis flow (VSIM)

• Quick Physical Design (PD) flow:

− Timing violations allowed

− DRC violations allowed

− Less than 100% RTL to GL equivalence

• Costumed test bench enables Cycle accurate

Gate Level Simulation

• Power analysis is performed using gate level

netlist & parasitics file.

• Power analysis results are mapped backed to

RTL netlist.

Quick PD flow

RTL DB

Power Analysis

GLV simulation

Test bench generation

Mapping GL 2 RTL

Page 8: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

8 © 2014 Freescale Semiconductor, Inc. | External Use

Test Bench Generation

• Based on RTL to GL mapping, force RTL values on GLV simulation

• Advantages:

QD

Std’ test bench

QD

“VSIM” test bench

Force the RTL value on the key

point

Timing violation!

QD

Short run time:Simulate selected window

Force correct value @ time

point X

QD

GL delay for logic cones(SDF)

QD

QD

QD

Values are a bit “off”

Correct values forced

GL & SDF

Page 9: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

9 © 2014 Freescale Semiconductor, Inc. | External Use

Cond_0

Gate level results mapping to RTL netlist

reg cond[1:0]reg count[1:0]

always @(posedge clk) if (condition == 2’b11) count = count + 1;R

TL

netli

stG

L ne

tlist

26

29

Cond_1

count_1count_0

Clock Gate

1. Map RTL 2 GL2. For each unmapped GL instance:

Divide the power between drive/load key points

3. Assign GL key point power to RTL key point

4. The power of each RTL hierarchy is the sum of power assigned to its key point

4 8

10

10

2

1

10101 1

11

13

13

14151111

11

11

Page 10: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

10 © 2014 Freescale Semiconductor, Inc. | External Use

Mapping results to high-level language (VSIM++)

• Using annotation of C++ class names, variable names as well as file name/line numbers we can map power consumption from the accurate gate-level to the C++.

• This capability allows us to:− Analyze and fix clock gating− Redesign “power hungry” resources− Consider different architectures

reg my_var_Ln123[1:0]reg count_Ln124[1:0]

always @(posedge clk) if (my_var_Ln123 == 2’b11) count_Ln124 = count_Ln124 + 1;R

TL

netli

st

void process() { … while (true) { if (my_var==3)

count++; … }}

C+

+ c

ode

121:122:123:124:125:126:127:

Line #

Page 11: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

11 © 2014 Freescale Semiconductor, Inc. | External Use

DFFDFF

Example problem identified

• Tool inserts “clock gating” enabler code for RTL automatically

always @(posedge clk)

if (en)

data[511:0] <=

new_data;

C++ process condition HLS

DFF

clk

en

new_data data

• Gate-Level implementation is not

implemented as gated clock but

as data logic due to timing

violations

• Solution – Simplify clock gating

enablers to meet timing

constraints

Page 12: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

12 © 2014 Freescale Semiconductor, Inc. | External Use

Clock gating enabler simplification

DFFDFFHash Key

clken

new_data data

DFFDFFHeaderDFFDFFProcess control

DFFDFFHash Key

clken

new_data data

DFFDFFProcess control

Original clock gating scheme – Complicated enable logic Synthesized to non efficient enabler

Simplified clock gating scheme – Enable synthesized w/o changesLeading to high clock gating efficiency

Page 13: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

13 © 2014 Freescale Semiconductor, Inc. | External Use

Conclusions

• Use High Level Synthesis for IP Design

− Quick and easy to explore architecture alternatives

− Quick front-end flow including verification

• Power analysis:

− Measure power on system level scenario

− Quick (doesn’t require full physical design flow convergence)

− Accurate (done on gate-level)

• Analysis and Optimization in high-level design (C++)

− Manual clock gating enable setting reduced dynamic power consumption by 19.4%

• Early in the design cycle : Easy to change IP architecture !

Page 14: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

14 © 2014 Freescale Semiconductor, Inc. | External Use

Backup

Page 15: TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN.20.2014 Asher Berkovitz Yaniv

15 © 2014 Freescale Semiconductor, Inc. | External Use

Accuracy

• Measured using similar methodology on a different design• Si measurement compared to full T/O gate level data

Test Dynamic power accuracy

Single core Fast Fourier Transform -7.59%

Single core Fast Fourier Transform No memory miss

-8.40%

Dual core Fast Fourier Transform 7.57%