directed study report

7/28/2019 Directed Study Report

http://slidepdf.com/reader/full/directed-study-report 1/12

1

AUTOMATED STANDARD CELL LIBRARY GENERATION

& STUDY OF CELL LIBRARY FUNCTIONAL CONTENT

Yunbum JungDept. of Electrical Engineering and Computer Science

University of Michigan [email protected]

ABSTRACT

As the operating frequency of circuits increases,

the use of a fixed library faces a limit in generating

high performance circuits. One of the ways to go

beyond the limit of a fixed cell library is the use of

a fluid cell library. The fluid cell library provides a

customized drive strength of each cell that is not in

the fixed cell library but is required for a fine

circuit tuning. For the effective generation of a

fluid cell library as well as a fixed cell library, an

automated flow is applied to generate a standard

cell library. This paper presents the procedure for

automated standard cell library generation and an

overview of cell characterization. It also examines

how each logic function in a cell library affects an

automated circuit design. This experiment shows

that the target cell library should be selectively

chosen for the good quality of a synthesized design.

1. INTRODUCTION

Although a standard cell methodology reduces the

design effort in terms of time and cost, the

performance of a synthesized design is very poor

compared to a custom design. Many studies to

improve the quality of a synthesized design have

been performed [1, 2, 3, 4, 5] and it has been

shown that simple modifications to the cell library

significantly impact the performance of a

synthesized design [1]. Especially in a high

performance application, the use of a fixed cell

library prevents fine device tuning for delay and

power optimization [5]. As an alternative to the

limit of a fixed cell library, the use of a fluid cell

library suitable for circuit tuning was suggested in

[3]. This kind of effort increases the need for

automation in generating a cell library since an

automated flow for generating new cells easily

creates the various cells required for circuit

optimization.

In generating a new cell library, accurate

characterization of each cell is important since

other tools that use these cells predict the behavior

of circuits based on the characterized cell data.

Before mentioning the advantages of automated

cell characterization, it seems proper to consider

the disadvantages of manual cell characterization.

Manual cell characterization requires a cell

designer to create netlists and interactively run

simulations. With this method, stimulus must be

developed and applied to the cell beingcharacterized. Once the simulation is complete, the

data is extracted from each run. Then the cell

designer inserts the data into the gate level models

and the datasheets. But this manual cell

characterization is prone to cause errors and

requires tremendous effort [6].

A cell library generation process includes cell

design, layout generation, physical abstraction as

well as cell characterization. Through this process,

several designers may share these tasks. Therefore,

if misunderstandings exist among designers, the

manual method may lose consistency in theprocedures.

In order to reduce both the effort to generate a new

standard cell library and the number of inadvertent

errors introduced when these tasks are done

manually, an automated flow is applied to generate

a new cell library. The well-organized and

automated flow provides consistency in procedures,

increases the range of simulation capabilities at the

cell characterization step, and minimizes the risk

of errors.

2. STANDARD CELL LIBRARYGENERARTION

As shown in Fig. 1, the process of generating a

standard cell library consists of four major steps.

Most of steps are automatically carried out. Netlist

files in spice format are created at the cell design

step. Layout that is the physical implementation of

netlist is generated in the layout step. Verification



2

and parasitic extraction are also performed during

the layout step. Stimulus generation, SPICE

simulations, and data compilations are part of

characterization. Physical abstraction of each cell

for the place & route tool can be carried out in

parallel with characterization. A physical abstract

includes information about blockage layers, pin

locations, and cell symmetry.

Fig. 1 Standard cell library generation flow

2.1 CELL DESIGN

The cell design phase consists of circuit design and

transistor sizing. After drawing a schematic

representing a logic gate, the cell designer

determines transistor size of each cell. Since the

delay in circuits depends not only on the drive

strength of each stage but also on its P/N width

ratio, it is important to provide a good P/N widthratio of each cell in standard cell library. An

optimal P/N ratio of each cell is derived such that

it minimizes path delay [7]. Based on the optimal

P/N ratio, the netlist template for each cell is

generated. This netlist template is used for creating

the cell netlist with the intended size. It has been

shown that providing cells with only a single drive

strength degrades the speed of a synthesized

design [1]. Therefore, providing each cell in a

variety of drive strengths is considered as a

standard cell library design guideline. According

to this guideline, most of the cells in a fixed

standard cell library are designed in 4 drivestrengths (xL, x1, x2, and x4) and the buffers and

inverters have 5 additional drive strengths (x3, x8,

x12, x16, and x20). For providing more variety of

drive strengths easily, the size of each cell is

parameterized. Fig. 2 illustrates how to use a

parameterized netlist template. The P/N ratio of

the inverter is fixed but the widths of transistors

are scaled proportionally. If the number of

transistors connected in series is large, it degrades

the falling/rising time due to the higher resistance.

Therefore, each cell is restricted such that it does

not have more than 4 series transistors.

Fig. 2 Parameterized Inverter template

The cells of a standard cell library are categorized

into seven groups as follows.

• Negative unate logic cells

• Positive unate logic cells• Arithmetic cells

• Sequential cells

• Special cells

• Inverted input cells

• Low skew cells

Negative unate logic cells consist of INV, NAND,

NOR, AOI, OAI, and XNOR function families.

Positive unate logic cells comprise BUF, AND,

OR, AO, OA, and XOR function families. FADD

(Full Adder) and HADD (Half Adder) function

families compose arithmetic cells. DFF and

LATCH function families are included insequential cells. MUX and Tri-state belong to the

special cells. And there are two more interesting

groups. One group comprises inverted input cells

and the other, low skew cells. Inverted input cells

such as NAND2B and NOR3BB (ëBí means

inverted input) provide inverted inputs such that it

effectively makes an internal connection between

the inverter and the logic function. Low skew cells

are used for clock distribution schemes where low

skew and high speed are primary concerns.

CLKINV and CLKBUF families belong to the low

skew cells.

2.2 LAYOUT

As shown in Fig. 3, the layout phase consists of

physical layout generation, layout verification, and

parasitic extraction.

ProTech, ProSpin, ProGen, and ProSticks from

Prolific are used to generate physical layout from

the netlist in spice format. ProTech configures the



3

Fig. 3 Layout step

fabrication technology specification, design styles,

the cell template, and the layer information.

ProSpin reads in spice netlists describing

individual cells and produces corresponding .cel

and .db files that are read by ProGen and ProSticks

respectively. The .cel file specifies the generators

that should be called to create the final layout data

and contains cell specific information such as

transistor size and node name. And the .db file is

used to generate the symbolic layout that is used

by ProSticks. ProGen reads in the .cel file andproduces the physical layout. After reading the .cel

file, ProGen invokes the proper generators to

produce a loose physical layout. Then ProGen

compacts this initial layout to produce a final cell

that is as small as possible [8]. Fig. 4 shows an

example of a layout that is generated by the

Prolific suite.

Fig.4 AO21X2 layout

Although ProGen tries to satisfy all design rules

and layout constraints, it sometimes generates a

cell with violations such as a cell height violation

or design rule errors. Thus, the verification step is

an essential part of layout generation to insure

getting a correct layout. Calibre from Mentor

graphics is used for accurate verification. Calibre

performs layout vs. schematic (LVS) checking as

well as design rule checking (DRC).

Fig.5 Symbolic AO21X2 layout

Once the design violations are found, ProSticks

can be used to fix those errors. ProSticks supports

a graphical user interface for the symbolic layout.

Fig. 5 shows an example of a symbolic layout.

Simple routing rearrangement and transistor

relocation on the symbolic layout can correct most

of errors of simple cells. But, in case of complex

cells such as MUX and ADDER, a height violation

is hard to fix. Special options such as poly contact

merging and diffusion/metal-1 contact wire are

used to obtain more available routing space.

Aggressive poly contact merging eliminates the

unnecessary metal-1 wire between poly contacts.

Diffusion nodes can be connected to the power or

ground rails with a contact of diffusion and metal-

1 wire such that it makes other metal-1 wires go

over the active region.

The last step of the layout phase is parasitic

extraction. This is the process of creating an

electrical model of the physical interconnections.

The physical interconnect does not behave as an

ideal wire. Instead, it behaves like a network of

capacitances, inductances, and resistors, which can

dominate circuit behavior. It does not do muchgood to do power or timing analysis of a design

without the parasitic network. Xcalibre from

Mentor Graphics is used for parasitic extraction.

The original spice netlists combined with the

parasitic network are used for characterization

simulation.



4

2.3 CHARACTERIZATION

Characterization enables a designer to abstract

timing and power models. This abstraction shifts a

design from the transistor level to the gate level,

enabling synthesis, floor planning, place & route,

and delay & power calculation [9].

Fig. 6 Characterization in terms of input slew rate

and load capacitance

As shown in Fig. 6, a cell is usually characterizedin terms of the input slew rate tin and output

capacitive load CL where supply voltage,

temperature, and process values are given. But this

traditional approach is challenged in the era of

deep sub-micron process. The single capacitance is

no longer enough to represent a load that acts like

a RC network. However, the auto-characterization

tool is implemented based on the traditional

definition. Another concern of cell characterization

is stimulus generation. Current standard cell

libraries can handle only single switching input,

that is, it cannot deal with multiple switching

inputs that are frequently encountered in a real

operation. This is another shortcoming of cell

characterization. Although an exhaustive

enumeration of all input states satisfies the

requirement of stimulus, it is a waste of simulation

time. Therefore, selecting a minimal set of input

vectors is seriously considered to reduce

simulation time.

The characterized cells have several common

attributes such as output transition times,

propagation delays, internal switching power,

leakage power, input pin capacitances, and cell

area. Sequential cells have additional requirements

of characterizing relative signals. Relative signalsare signals that are timing-critical to another

signalís state. Relative signals include setup,

recovery, hold, and removal time [6].

2.3.1 Timing characterization

The timing attributes such as output transition time,

propagation time, setup constraint, hold constraint,

recovery constraint, and removal constraint are

modeled for timing characterization.

Both output transition time and propagation time

are required for gate level synthesis and delay

calculating tools [9]. In those tools, output

transition time (and the interconnect wire parasitic

data if possible) is used for estimating input slew

rates of successive cells and the propagation time

of each cell is extracted from the cell library table

based on input slew rate and load. Consequently,

the delay between two nodes in a design can be

calculated with the propagation time and output

transition time of each cell on the path between

two nodes.

Since a signal arrives at an input pin with a ramp

time and a sequential cell takes some time to latch

the data signal correctly, the data signal that

arrives to a sequential cell has to be stabilized

before and after clocking by defining setupconstraint and hold constraint respectively. Clock

transition to active is the concern of measuring

setup and hold constraint of edge-triggered

sequential cells. If an unstable data signal arrives

at an input near the clock transition to active, the

edge triggered sequential cell may evaluate a

wrong output or it goes through a metastable state.

The unintended data is held until the next clock

transition to active. On the other hand, clock

transition to inactive is the concern of measuring

setup and hold constraint of level-sensitive

sequential cells. If an unstable data signal arrives

at an input near the clock transition to inactive, theunintended data may be held while the clock is in

the inactive state as in the case of the edge-

triggered sequential cell. In a design with

sequential cells, both attributes constrain the delay

of combinational circuits that are placed between

two sequential cells where the operation frequency

of the design is given. Fig. 7 shows consecutive

pipeline stage formed by two edge-triggered

clocking sequential cells, comprising a long path

and a short path [10].

Fig. 7 Consecutive pipeline stage formed by two

edge-triggered sequential cells



5

On the long path, the maximum time available for

evaluation of combinational logic in one clock

period, t CLmax, is given by

t CLmax = t CYeff ñ(t SU2+t CQ1)

where t CYeff

is the effective cycle time, t SU2

is the

setup constraint of the second sequential cell, and

t CQ1 is the Clock to Q propagation time of the first

sequential cell.

On the short path, if the next state value from the

second sequential cell reaches the first sequential

cell during the hold time of the first sequential cell,

the next state value will corrupt the current state

value of the first sequential cell. The minimum

propagation time, t CLmin, through combinational

logic on the short path is expressed by

t CLmin = t SK + (t H1 ñt CQ2)

where t SK is the clock skew between clocking of

both sequential cells, t H1 is the hold constraint of

the first sequential cell, and t CQ2 is the Clock to Q

propagation time of the second sequential cell.

Recovery and removal constraints describe the

timing requirements on the control signals, such as

preset or clear, with respect to the clock signal. A

sequential cell needs some time to be out of the

influence of the control signal after the control

signal becomes inactive. This time is referred to as

recovery constraint. Therefore, the control signal

should become inactive at least a time (recoveryconstraint) before clocking in order to insure the

clocking effective. On the other hand, removal

constraint is the minimum time for control signal

to influence the latched value. If a control signal

becomes inactive before the removal constraint,

the control signal will not affect the operation of

the sequential cell.

In the simulations for the timing attributes

mentioned above, the minimal stimulus comprises

input vectors that cause an output transition,

although the necessary input vectors are different

according to each attribute. Input vectors can besubdivided into data signals, control signals, and

clock. More details of stimulus for each attribute

are described in the following sections.

Output transition / Propagation time

As shown in Fig. 8, the output transition time is

measured between two pre-determined edge

threshold values of output signal (e.g., 10% of the

voltage range to 90% of the voltage range). The

propagation time is measured between pre-

determined delay threshold value of input signal

and that of output signal (e.g., 50% of the voltage

range of input signal to 50% of the voltage range

of output signal).

A data transition that can cause an output

transition is the required condition of the stimulus

for both attributes. In sequential cells, the Clock to

Q propagation time can be obtained from the clock

transition. In the level-sensitive clocking

sequential cells such as LATCH, the D to Q

propagation time can also be obtained with the

clock in the active state.

Fig. 8 Output transition and propagation time

Setup / Hold constraint

The setup constraint is generally defined as the

minimum time allowed between the arrival of the

data and the transition of the clock signal. If the

data signal makes a transition during setup time, an

incorrect value may be latched [11, 12]. But thisdefinition needs to be modified to improve the

delay performance of a synthesized design that

uses sequential cells because the minimum setup

constraint tends to cause a long Clock to Q

propagation time. Therefore, one more condition

can be added to the setup definition such that the

setup constraint does not degrade the Clock to Q

propagation time more than a pre-determined

tolerance (e.g., 5% of Clock to Q propagation time

that can be obtained when the time between the

data arrival and the clock transition is enough).

The hold constraint describes the minimum time

allowed between the transition of the clock signal

and the latching of the data. If the data signal

makes a transition during hold time, an incorrect

value may be latched [11, 12].

Data that can change the output state is the

required condition of the stimulus. In the case of

the cell with control input, the control signal

should be set in inactive state.



6

Fig. 9 illustrates the setup and hold constraints of

an edge-triggered sequential cell where clock is

high active. Fig. 10 illustrates the setup and hold

constraints of a level-sensitive sequential cell

where clock is high active.

Fig. 9 Setup and Hold constraint of an edge

triggered sequential cell.

Fig. 10 Setup and Hold constraint of a level

sensitive sequential cell.

Recovery / Removal constraint

The recovery constraint describes the minimum

allowable time between the control pin transition

to the inactive state and the active edge of the

synchronous clock signal [12]. Like the setup time,

the tolerance condition is added into the condition

of recovery constraint.

The removal constraint describes the minimum

allowable time between the active edge of the

clock pin while the asynchronous control pin is

active and the inactive edge of the asynchronous

control pin [12].

Data that can change the preset/clear value and

control transition to inactive are the required

conditions of the stimulus. The only difference of

measuring recovery and removal constraint of a

level-sensitive cell is that clock transition to

inactive is the required condition of stimulus while

clock transition to active is the required condition

of stimulus for an edge-triggered sequential cell.

Fig. 11 illustrates recovery and removal constraint

of an edge-triggered sequential cell where clock is

high active and control is low active.

Fig. 11 Recovery and removal constraint of an

edge triggered sequential cell

Bisection method

In order to measure relative signal characterization,

the bisection method is used. Bisection is a method

of optimization that employs a binary search to

find the value of an input variable associated with

a ì goalî value of an output variable. This method

uses a binary search to locate the output variable

goal value within a search range of the input

variable by iteratively halving that range to

converge rapidly on the target value. The

measured value of the output variable is compared

with the goal value every iteration [13].

Fig. 12 Setup constraint search using bisection

Fig. 12 shows how to determine setup constraint

with bisection method where the goals are the

output transition and the allowable Clock to Q

propagation time. To start the binary search, a

lower boundary and an upper boundary are

specified. Data transition 1 at the lower boundary

is early enough to cause a good output signal. Data

transition 2 at the upper boundary is too late to



7

change output signal. This means that the

candidate for setup constraint exists between the

upper and the lower boundaries. Consequently, the

bisection algorithm tests data transition at the mid-

point between both boundaries. Data transition 3 at

the mid-point changes output signal but causes a

long propagation time, that is, data transition 3

does not meet the goal. The bisection algorithm

sets the mid-point as the new upper boundary.

Given the new range, the bisection algorithm tests

data transition at the new mid-point. If the output

value satisfies goals, the new mid-point is set as

the new lower boundary. Otherwise, the mid-point

is set as the new upper boundary. Then the

bisection algorithm tests data transition at the new

mid-point within the new range again. The

bisection algorithm iterates setting new boundary

and mid-point until the binary search reaches a

process-termination criterion. Data transition 4 is

the latest data transition that satisfies the goals.

Therefore, setup constraint, t SU , is given by

t SU = t 2 ñ t 1

2.3.2 Power characterization

Power dissipation can be handled in terms of static

power and dynamic power as shown in Fig. 13.

Static power is the power that is dissipated when

the cell is stable, that is, there is no signal

transition on any inputs or outputs of the cell.

Static power is dissipated in a number of ways.

The largest consumption of static power results

from source to drain subthresold leakage. This

leakage is caused by reduced threshold voltage that

prevents the gate from turning off completely.

Static power dissipation also occurs when current

leaks between the diffusion layers and substrate.

For this reason, static power is often called leakage

power [12].

Dynamic power is the power dissipated when a

circuit is active. Dynamic power is divided into

switching power and internal power. Switching

power results from charging/discharging of load

capacitance. Switching power is calculated by a

gate level power analysis tool where the

interconnect parasitic is known [9]. Therefore,

switching power is excluded from power

characterization of cells. While input or output

signals switch, power is also dissipated by internal

capacitive charging/discharging and short circuit

dissipation. Since this kind of power is dissipated

in the cell during signal switching, it is called

internal power.

Power annotated library

PowerArc from Synopsys is used to generate a

power-annotated library from a library containing

no power data. This tool automatically calculates

stimulus for power characterization and runs

power simulation. PowerArc does not distinguish

rising and falling output transitions in calculating

internal power [14]. In both output transition cases,

internal power dissipations are given by

22

1 2

1)( dd L

t

t dd V C t V t I P ⋅−∂⋅= ∫

where I(t) is current at power node, V dd is source

voltage, and C L is load capacitance.

Consequently, it is no wonder that negative power

values are found in the power-annotated librarythat is generated by PowerArc. Internal power is

overestimated at rising output transition and

underestimated at falling output transition.

Nevertheless, these internal power values are

acceptable if internal power dissipation is

considered for a given period.

I

Cinter-node Cinter-nodeCL CL

I

ISCISC Iinter-node

Iinter-node

ICL

ICL

(a) Static power (b) Dynamic power (rising) (b) Dynamic power (falling)

I

Ileakage

Fig.13 Power dissipation



8

2.3.3 Input capacitance and cell area

Input capacitance is used for the part of the output

load of the previous cell when switching power is

calculated by a gate level power analysis tool or

path delay is calculated by a timing analysis tool.

Input capacitance value is extracted from theparasitic file of each cell that is made during the

layout phase.

The cell area attribute is used to estimate total cell

area. The cell area value is calculated from the

layout during the layout phase and then stored in

temporary files. In the characterization phase, that

value is inserted into cell area table of cell library.

2.4 Physical abstraction and rest works

The individual cells of the cell library are

described in layout format as GDSII. In order thatplace & route tools refer to cells in a cell library,

physical abstracts for cell layouts are needed. The

physical abstracts contain information about

blockage layers, pin locations, and cell symmetry.

Envisia Abstract Generator from Cadence is used

to create physical abstract views for all cells. The

physical abstracts are exported in library exchange

format (*.lef) and used in Cadence place & route

tools.

Once the characterization step is completed, the

ë*.libí file in ASCII text format is created. This file

is compiled into a ë*.dbí file in Synopsys database

format using library compiler from Synopsys. And,

the ë*.libí file is also compiled into a ë*.tlfí file in

timing library format using syn2tlf from Cadence.

The ë*.tlfí file is used for timing driven placement

and routing in Cadence place & route tool.

3. RELATED STANDARD CELL

LIBRARY STUDY

In the standard cell design style, target cell

libraries as well as design flow affect the quality of

a final circuit. Typically, synthesis techniques

optimize a circuit in two phases, logicminimization and library-mapping phase [15, 16,

17]. During the library-mapping phase, synthesis

tool choose the structures and the size of the gate

from the target cell libraries. It is apparent that the

limited sizes of each gate in the target cell libraries

prevent good solutions. As an effort to provide

finer granularity of the gate size, transistor level

resynthesis [2] and fluid cell library [3] were

suggested and those methods achieved

performance improvement.

Another possible approach to improve the quality

of the final circuit is wisely choosing the logic

functions in the target cell library. In order to

examine how each logic function group mentioned

in section 2.1 affects the quality of the circuit and

how synthesis tool picks up cells to satisfy

constraints, a set of benchmark circuits are

synthesized, placed & routed, and resynthesized

using the libraries shown in Table 1. These

libraries are formed by selectively choosing logic

function groups from Artisan standard cell library

in TSMC 0.18-micron technology.

Lib

1

Lib

2

Lib

3

Lib

4

Lib

5

Lib

6

Lib

7

Lib

8

Tri-state x x x x x x x xSequential x x x x x x x xNegative

unate x x x x x x x xPositive

unate x x x x x xInverted

input x x x x

Arithmetic x x xMux x x x

Low skew x x xTable. 1 Overview of cell libraries

3.1 Design Flow

Fig. 15 Standard cell design flow



9

As shown in Fig.15, initial synthesis and place &

route are carried out to get the initial layout. Then

the decent wire load model from the initial layout

is extracted and the cell library is upgraded with

the extracted wire load model. With the upgraded

cell library, synthesis and place & route are

performed and the layout is generated again. From

the layout, parasitics can be extracted. Those

parasitics are used for input of re-optimization in

aspect of cell size. During the re-optimization,

cells with new size replace the cells chosen at the

synthesis step. For the input of timing and power

analysis, resynthesized circuit is placed & routed

again and parasitics are extracted from the final

layout.

Design Compiler from Synopsys is used for

synthesis. Silicon Ensemble from Cadence is used

for place & route, and clock tree generation.

HyperExtract from Cadence is used for wire load

model extraction and parasitic extraction. LibraryCompiler from Synopsys is used for upgrading cell

library with the extracted wire model. PrimeTime

from Synopsys is used for timing analysis.

NanoSim from Synopsys is used for power

analysis.

3.2 Benchmark circuits

Four benchmark circuits were used for the

experiments. In order to observe how each library

affects datapath-dominated circuits and controller-

dominated circuits, two of them (VP2 and

CMUDSP) are chosen from Digital SignalProcessor (DSP) and the others (GPIO and CAN)

are chosen from Controller. Arithmetic functions

are heavily used in DSP while it is less frequently

used in Controller. Using the standard cell design

flow described in section 3.1, each benchmark

circuit is designed with various target clocks and

different libraries.

3.3. Results

Power (or area) vs. delay plots are an effective

way to compare cell libraries since the efficiency

of achieving a particular delay is important [4]. Inorder to make power (or area) vs. delay plots, a set

of benchmark circuits are designed using different

libraries within some target clock range. Through

timing analysis and power analysis, power (or

area) vs. delay plots are obtained. From this study,

several interesting phenomena are found. The

largest library, lib 8, does not always produce the

best result. Instead, the choice of the best library

depends on the operation speed of a circuit and

what kind of circuit is designed.

3.3.1 DSP circuits

Given the target clock range (4~9ns) that is used in

the examination, the target clock vs. delay curvesof VP2 benchmark shows the low plateau region in

Fig. 16. The area of circuits still increases while

there is no performance increase in this low

plateau region. Consequently, too small target

clock degrades the quality of design. From the low

plateau region, the lowest delay can be obtained.

Therefore, the power (or area) vs. delay plots of

VP2 benchmark will show the characteristics at the

low or middle delay regions. On the other hand,

the target clock vs. delay curves of the CMUDSP

benchmark gradually decreases without a low

plateau region in Fig. 17. It means that there exists

the room to reduce delay of CMUDSP benchmark.

Therefore, the power (or area) vs. delay plots of

CMUDSP will show the characteristics at the

middle or high delay regions

6

7

8

9

10

3 4 5 6 7 8 9 10

Target clock [ns]

D e l a y [ n s ]

Library 1

Library 2

Library 3

Library 4

Library 5

Library 6

Library 7

Library 8

Fig. 16 Target clock vs. delay for VP2

Fig. 17 Target clock vs. delay for CMUDSP

As shown in Fig. 18 and 19, the library, lib 4,

including complex cells such as arithmetic and

mux cells shows good area efficiency at the high

delay region while adding inverted input cells to

them (lib 5 and lib 8) somewhat degrades area

4

5

6

7

8

9

10

3 4 5 6 7 8 9 10

Target clock [ns]

D e l a y [ n s ]

Library 1

Library 2

Library 3

Library 4

Library 5

Library 6

Library 7

Library 8



10

efficiency at the high delay region. But, as the

delay decreases, lib 4 is prone to increase the area

of design quickly. Consequently, lib 4 results in

worse area efficiency than lib 5 or lib 8 at the

middle or low delay regions. This means that for

good area efficiency at high speed, inverted input

cells are needed.

140000

160000

180000

200000

220000

240000

260000

6 7 8 9 10

Delay [ns]

C e l l a r e a

Library 1

Library 2

Library 3

Library 4

Library 5

Library 6

Library 7

Library 8

Fig. 18 Delay vs. area curves for VP2

400000

410000

420000

430000

440000

450000

460000

470000

480000

490000

500000

4 5 6 7 8 9 10

Delay [ns]

C e l l a r e a

Library 1

Library 2

Library 3

Library 4

Library 5

Library 6

Library 7

Library 8

Fig. 19 Delay vs. area curves for CMUDSP

As delay becomes close to the lowest delay, all

libraries show the similar area efficiency in Fig. 18.

This can be explained by the decomposition of

complex cells. As expected, the synthesis tool

basically increases the ratio of cells with higher

drive strength in the circuit as target clock

frequency increases. When this simple increase

reaches the limit, complex cells such as positive

unate and arithmetic cells are additionally

decomposed into relatively simple cells. These

decompositions come with the cost of the sudden

increase of total cell count as well as the increase

of total cell area. Therefore, the area benefits of

complex cells are reduced. These decompositions

of complex cells are observed in the VP2

benchmark. As the target clock changes from 9ns

to 8ns, the total cell count increases from 4208 to

7148. Fig 20 shows the complex cell count and Fig

21 shows the relatively simple cell count at both

target clocks (8ns and 9ns).

0

100

200

300

400

500

600

700

AND ADD XOR MX

8ns

9ns

Fig. 20 Complex cell count of VP2

where lib 4 is used

0

200

400600

800

1000

1200

1400

I N V

N O R

N A N D A O

I O A I

M X I

X N O R

O R

8ns

9ns

Fig. 21 Simple cell count of VP2

where lib 4 is used

At the low delay region, lib 1 produces the best

power efficient circuits as shown in Fig. 22 while

all libraries tend to build circuits similar in area

efficiency. On the other hand, at the high delayregion, the power efficiency is similar over the all

libraries as shown in Fig. 23. From this result, it is

apparent that the power density of complex cells

such as arithmetic and mux cells is higher than the

power density of negative cells. Therefore, the

power considering design is proper to high-speed

design while the area efficiency is considered for

design at low speed circuits

25000

30000

35000

40000

45000

50000

55000

60000

65000

70000

6 7 8 9 10

Delay [ns]

A v g . p o w e r [ u W

]

Library 1

Library 2

Library 3

Library 4

Library 5

Library 6

Library 7

Library 8

Fig. 22 Delay vs. power curves for VP2



11

45000

55000

65000

75000

85000

95000

105000

115000

4 5 6 7 8 9 10

Delay [ns]

A V G p o w e r [ u W ]

Library 1

Library 2

Library 3

Library 4

Library 5

Library 6

Library 7

Library 8

Fig.23 Delay vs. power curves for CMUDSP

One more interesting thing is that the ratio of cells

with the lowest drive strength (xL) increases as

operation frequency of the circuit gets fast as

shown in Fig. 24. This result explains that the

lowest drive strength cells are needed to form

longer buffer trees even though it has higher

capacitance due to larger active area than cellswith a time drive strength (x1).

2

4

6

8

10

12

14

4 6 8 10

Delay [ns ]

[ % ]

Lib 1

Lib 2

Lib 3

Lib 4

Lib 5

Lib 6

Lib 7

Lib 8

Fig. 24 % of cells with xL size for CMUDSP

3.3.2 Controller circuits

1

1.5

2

2.5

3

3.5

4

4.5

5

0 1 2 3 4 5 6 7

Target clock [ns]

D e l a y [ n s ]

Library 1

Library 2

Library 3

Library 4

Library 5

Library 6

Library 7

Library 8

Fig. 25 Target clock vs. delay curves of CAN

Given the target clock range (0.5~6ns) that is used

in the experiments, the target clock vs. delay

curves of the CAN benchmark show the high

plateau region as well as the low plateau region as

shown in Fig. 25. Differently from DSP circuits,

the available delay range is narrow. The target

clock vs. delay curves of the GPIO benchmark also

shows the narrow delay range. Therefore, it is hard

to distinguish delay regions in controller circuits.

The increase of the ratio of cells with higher drive

strength and the lowest drive strength also occurs

as the clock becomes fast. The decomposition of

positive cells is observed. Even though the

decomposition of positive cells happens, that does

not increase the total cell count and the area as

much as the decomposition of arithmetic cells.

Within narrow delay range, irregular power (or

area) vs. delay curves are scattered. Therefore, it is

hard to find which library produces the best quality

of circuits.

4. CONCLUSIONS AND FUTURE

WORK

The automatic cell library generation flow is

developed to generate a new cell library easily

without inadvertent errors introduced when a cell

library is generated manually. The parameterized

cell methodology in the automatic flow makes it

possible to generate a fluid cell library. Since the

fluid cell library consists of custom cells with

optimal sizes that can be generated when actual

wire load parasitics are known after placement, it

is expected to attain power and performance close

to custom designs.

This work can be extended to generate the cell

library for an advanced process. Although high

performance integrated circuits often use an

advanced process technology, designs based on

standard cells use processes that lag by several

generations due to the absence of the cell library

for advanced process. A cell library in a Silicon on

Insulator (SOI) process will be generated to

support designs using the SOI process. Then it will

be examined how the cell library in SOI improves

the power and performance of circuits.

The study of cell library functional content shows

that the largest library does not always produce the

best result. For the overall good quality of adatapath-dominated circuit, the best cell libray is

chosen in the way that the area efficiency is

considered at the low speed circuit while the

power efficiency is considered at the high-speed

circuit. In the case of controller-dominated circuit,

it is hard to find which library produces the good

quality of circuits.



12

REFERENCE

[1] Ken Scott and Kurt Keutzer, ì Improving Cell Library for

Synthesisî, Proc. of Custom Integrated Circuit Conference(CICC), pp. 128-131, 1994

[2] S. Gavrilov, A. Glebov, S. Pullela, S. C Moore, A.

Dharchoudhury, R. Panda, G. Vijayan, and D. T. Blaauw, ì Library-Less Synthesis for Static CMOS Combinational

Logic Circuitsî , Proc. IEEE Int. Conf. on Computer-AidedDesign (ICCAD), pp. 658-662, 1997

[3] Gregory A. Northrop and Pong-Fei Lu, ì A Semi-Custom

Design Flow in High-Performance Microprocessor Designî ,

Proc. of Design Automation Conference (DAC), pp. 426-431,2001

[4] Miodrag Vujkovic and Carl Sechen, ì Optimized Power-Delay

Curve Generation for Standard Cell ICsî, Proc. IEEE Int. Conf.on Computer-Aided Design (ICCAD), pp. 387-394, 2002

[5] K. Keutyzer and E. Girczyc, ìPanel: Cell libraries ñ build vs

buy; static vs. dynamicî, Proc. of Design AutomationConference (DAC), pp. 341-342, 1999

[6] Teri Hike, McFaul and Karl Perrey, ìCharacterizing a Cell

Library using ICCSî, Proc of ASIC Seminar and Exhibit, pp.p12/5.1-p12/5.4,1990

[7] David S. Kung and Ruchir Puri, ì Optimal P/N Width Ratio

Selection for Standard Cell Librariesî , Proc. IEEE Int. Conf.on Computer-Aided Design (ICCAD), pp. 178-184, 1999

[8] Prolific User Guide

[9] Binay Ackalloor and Dinesh Caitonde, ìAn overview of

Library Characterization in Semi-Custom Designî, Proc. of Custom Integrated Circuit Conference (CICC), pp. 305-312,

1998

[10] Anantha Chandrakasan, William J. Bowhill,, and Frank Fox,Ed. ì Design of High-performance Microprocessor Circuitsî,

IEEE Press, New York, pp 215-218, 2001

[11] Neil H. E. Weste and Karman Eshraghian, Ed ì Principles of CMOS VLSI designî , Addison-Wesley, pp 317-325, 1994

[12] Design Compiler User Manual

[13] Star-Hspice Manual[14] PowerArc User Guide

[15] Eric Lehman, Yosinori Watanabe, Joel Grodstein, and HeatherHarkness, ì Logic Deocomposition during Technology

Mappingî, Proc. IEEE Int. Conf. on Computer-Aided Design(ICCAD), pp. 242-245, 1995

[16] Chi-Ying Tsui, Massoud Pedram, and Alvin M. Despain, ì Technology Decomposition and Mapping Targeting LowPower Dissipationî , Proc. of Design Automation Conference

(DAC), pp. 68-73, 1993[17] Randal E. Bryant, ì Graph-Based Algorithms for Boolean

Function Manipulationî , IEEE Transactions on Computers, pp.

677-699, 1985

directed study report

Documents