implementing tta asips on standard cell technologyopenasip.org/move/doc/tta_seminar/sertamo.pdf ·...

22
Implementing TTA ASIPs on Standard Implementing TTA ASIPs on Standard Cell Technology Cell Technology Jaakko Sertamo [email protected] Institute of Digital and Computer Systems Tampere University of Technology Tampere, Finland

Upload: others

Post on 25-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

Implementing TTA ASIPs on Standard Implementing TTA ASIPs on Standard Cell TechnologyCell Technology

Jaakko [email protected]

Institute of Digital and Computer SystemsTampere University of Technology

Tampere, Finland

Page 2: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

OutlineOutline

TTA ASIP design flow @ TUT/DCSOptimization methods for TTA ASIP implementations on standard cells

Replacement strategies for the three-state busClock gating

Examples

Page 3: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

TTA ASIP design flow @ TUT/DCSTTA ASIP design flow @ TUT/DCSBased on MOVE Framework IDE developed at TUD

Functionality (C/C++)

Resources(Machine

Description)

Design SpaceExploration

Parametric compiler Hardware generator

BinaryExecutable Code

VHDLCode

Statistics Statistics

Technology description

Page 4: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

TTA ASIP design flow @ TUT/DCSTTA ASIP design flow @ TUT/DCSAfter a suitable architecture configuration is found, VHDL description of the target processor core is generated using processor generator scripts

Scripts, written in Python, are not part of the MOVE frameworkThey perform HDL to HDL transformation from internal

machine description format to synthesizable VHDL codeLeaf cells, i.e, functional units and register files are not generated

Implementation methodology of leaf cells can be freely chosen

Page 5: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

TTA ASIP design flow @ TUT/DCSTTA ASIP design flow @ TUT/DCSStandard cell implementation approach

Because of the simple control required in the functional units (SVTL pipelining) a full range of units supporting majority of integer operations can be easily described in RTL VHDLVHDL code is useful in two ways

Simulation/Verification at RTL levelInput for the logic synthesis

Also simple scripts for the logic synthesis can be obtained from the processor generator Synthesis itself is a fairly straightforward procedure

Simple top-down strategy is usedHowever, we are still far from working silicon

Little experience from complex, tool/vendor dependent physical design flow

Page 6: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

Optimization methods for TTA ASIP Optimization methods for TTA ASIP implementations on standard cellsimplementations on standard cells

Modularity of TTAs makes HDL writing / generation straightforward

No redundant logic or registers in generated HDL code

Most of the optimizations are performed automatically by the synthesis tool

Tool used: Synopsys Design Compiler Automatic selection of the accurate architecture of arithmetic units

For example, CSA vs. Wallace-tree multiplierIn general, describing structural logic (arithmetic units) independent of synthesis tool is very difficult and gives small benefits

Page 7: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

Optimization methods for TTA ASIP Optimization methods for TTA ASIP implementations on standard cellsimplementations on standard cells

Traditionally, transport (interconnection) network has been proposed to be implemented using three-state buffers

May result in smaller wiring area in full custom designs when only few metal layers are availableNowadays every process has 5+ layers of metalPlace and Route may give better results when no artificial boundaries between units and busses are givenPossible reliability problems

Fortunately, there are alternative topologies to replace the three-state bus

Page 8: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

Optimization methods for TTA ASIP Optimization methods for TTA ASIP implementations on standard cellsimplementations on standard cells

Multiplexer based selection for bus writesEach FU or RF output connected to a bus is an input to a multiplexer that drives the busNo result socketsSeveral alternatives for implementing the multiplexerDoes not affect the programming model

Page 9: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

Optimization methods for TTA ASIP Optimization methods for TTA ASIP implementations on standard cellsimplementations on standard cells

AND-OR structureControl bit for each output connected to a busDescribed explicitly in HDLThe final selection of the used cells is performed by the synthesis tool

Page 10: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

Optimization methods for TTA ASIP Optimization methods for TTA ASIP implementations on standard cellsimplementations on standard cells

Traditional three-state drivers“HDL multiplexer”

Source field from instruction word used to control which output writes to the busSynthesis tools selects the the detailed implementationResults in more or less same structure as in AND-OR based busSelection probably suffers from combined ID and opcode in source field

More bit patterns in SRC field, that than inputs to selection logic They cannot be considered as 'don't cares'

Page 11: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

Optimization methods for TTA ASIP Optimization methods for TTA ASIP implementations on standard cellsimplementations on standard cells

Clock gatingGoal: decrease the amount of nodes to be charged/discharged at clock transitionCan be performed automatically by the synthesis toolHDL coding style may improve results

Basic Idea: Make as many register writes as possible conditional

Modularity of TTA promotes clock gatingNo clock activity in units that are not used

DrawbacksMakes clock tree synthesis (CTS) more complexSynthesis tool dependent

Page 12: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

Optimization methods for TTA ASIP Optimization methods for TTA ASIP implementations on standard cellsimplementations on standard cells

Reduction in both power and area

Reference: Synopsys, Power Compiler Reference Manual

Page 13: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

ExamplesExamplesASIP for 32-point DCT (discrete cosine transform) application

Comparison: No clock gating vs. Gated clockArchitecture configuration

9 busses3 adders, shifter, multiplier, 2 load-store units8 X 4 registers

Simulation parametersClock frequency 100 MHzOperation voltage 1.5 VNominal operating conditions (T=25°C)

Page 14: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

ExamplesExamplesPower consumption by components

Total Power consumption

Saved 33.9 %

FU RF CNTL IC0

5001000150020002500300035004000450050005500600065007000750080008500

No Clock GatingGated ClockP

ower

/ uW

No Clock Gating Gated Clock0

2500

5000

7500

10000

12500

15000

17500

20000

22500

25000

Pow

er /

uW

Page 15: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

ExamplesExamplesArea by components Total area

Saved 10.9 %

FU RF CNTL IC0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

No Clock GatingGated ClockA

rea

/ Gat

es

No Clock Gating Gated Clock0

250050007500

100001250015000175002000022500250002750030000325003500037500400004250045000

Area

/ G

ates

Page 16: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

ExamplesExamplesASIP for 8x8 2D-DCT

Comparison: No clock gating vs. Gated clockArchitecture configuration

7 busses2 adders, shifter, multiplier,comparator, 2 load-store units8 X 4 registers

Simulation parametersClock frequency 100 MHzOperation voltage 1.5 VNominal operating conditions (T=25°C)

Page 17: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

ExamplesExamplesPower consumption by components

Total power consumption

Power saved 34.2%

FU RF CNTL IC0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

7000

No Clock GatingGated Clock

Pow

er /

uW

No Clock Gating Gated Clock0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Pow

er /

uW

Page 18: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

ExamplesExamplesArea by components Total area

Area saved: 13.2%

FU RF CNTL IC0

10002000

3000

4000

50006000

7000

80009000

10000

11000

120001300014000

15000

No Clock GatingGated ClockAr

ea /

Gat

es

No Clock Gating Gated Clock0

2500

5000

7500

10000

12500

15000

17500

20000

22500

25000

27500

30000

32500

35000

Are

a / G

ates

Page 19: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

ExamplesExamplesASIP for 8x8 DCT

Comparison: multiplexer/bus write implementationsArchitecture configuration

7 busses2 adders, shifter, multiplier,comparator, 2 load-store units8 X 4 registers

Simulation parametersClock frequency 100 MHzOperation voltage 1.5 VNominal operating conditions (T=25°C)

Page 20: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

ExamplesExamplesPower consumption by components

Total Power

FU RF CNTL IC0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

AND-ORMUXTRISTATE

Pow

er /

uW

AND-OR MUX TRISTATE0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

12000

13000

Pow

er /

uW

Page 21: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

ExamplesExamplesArea by components Total Area

FU RF CNTL IC0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

12000

AND-ORMUXTRISTATE

Are

a / G

ates

AND-OR MUX TRISTATE0

2500

5000

7500

10000

12500

15000

17500

20000

22500

25000

27500

30000

32500

35000

Are

a / G

ates

Page 22: Implementing TTA ASIPs on Standard Cell Technologyopenasip.org/move/doc/TTA_seminar/Sertamo.pdf · TTA ASIP design flow @ TUT/DCS After a suitable architecture configuration is found,

ConclusionsConclusionsMost of the design optimizations are performed at application/architectural levelRegister files are the most critical part of the TTAs implemented with standard cell technologyThere are alternatives for three-state bus which result in smaller area and lower power consumptionClock gating can significantly reduce power consumption of TTA ASIPs