fpga2

8/3/2019 fpga2

1/14

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007 53

Dual-Threshold CAD Framework for SubthresholdLeakage Power Aware FPGAs

Akhilesh Kumar, Student Member, IEEE, and Mohab Anis, Member, IEEE

AbstractManaging leakage power in field programmable gatearrays (FPGAs) has become critical for the FPGA industry toremain competitive in the semiconductor market and enter themobile applications domain. This paper proposes and evaluatesseveral dual-Vt-based designs of FPGA architecture for reduc-ing the leakage power. A dual-Vt FPGA computer-aided design(CAD) framework has been proposed, which is used to developand evaluate different dual-Vt FPGA architectures. The logicelements and the routing resources are considered as candidatesfor dual-Vt assignment. The authors estimate the number of thelogic elements that can be assigned as high-Vt in the ideal caseby using a dual-Vt assignment algorithm in the CAD framework.Based upon this estimate, the authors develop and evaluate two

kinds of architectures, homogenous and heterogenous. The resultsindicate that an average leakage-power savings of up to 50% canbe obtained from these architectures. This CAD framework canalso be used for developing and evaluating different dual-Vt FPGAarchitectures other than the ones proposed in this paper.

Index TermsCMOS digital integrated circuits, delay estima-tion, design automation, field programmable gate arrays (FPGAs),leakage currents.

I. INTRODUCTION

LEAKAGE POWER has been recently recognized as a ma-

jor challenge in the field programmable gate array (FPGA)

industry. This was primarily because other design challenges,

such as performance and area, were given more attention,

but with technology scaling, leakage power has emerged as a

key design challenge [14]. The current-generation FPGAs are

being implemented in the 90-nm CMOS technology, which

necessitates devising techniques for leakage-power reduction.

For the FPGAs to continue retaining its semiconductor market

and competitive advantages over the high-performance custom

very large scale integration (VLSI) designs, the FPGA industry

should adopt new techniques for leakage-power reduction. The

work in [4] showed that a 90-nm FPGA consumes too much

leakage power to be successfully used in mobile applications.

Therefore, for FPGAs to gain popularity in the domains such

as wireless personal communication systems or in biomedicalapplications, the FPGAs need to implement techniques for

reducing leakage power.

The increase in complexity of the current-generation FPGAs

has resulted in more number of transistors in the FPGAs which

directly translate into increased leakage power. The resource

Manuscript received April 21, 2005; revised November 19, 2005 andFebruary 22, 2006. This work was supported by the Natural Sciences andEngineering Research Council of Canada. This paper was recommended byAssociate Editor F. N. Najm.

The authors are with the Department of Electrical and Computer Engineer-ing, University of Waterloo, Waterloo, ON N2L3G1, Canada.

Digital Object Identifier 10.1109/TCAD.2006.882595

utilization of FPGAs is just over 60%, and the unutilized

parts also consume leakage power, which means that reducing

leakage power is important both in the used and unused parts of

the FPGA.

This paper proposes the following for the design of a dual-Vt

FPGA for reducing subthreshold leakage power.

1) Dual-Vt FPGA architecture, which involves determining

the distribution for the high- and low-Vt devices.

2) Computer-aided design (CAD) framework for designing

the dual-Vt FPGA. This CAD framework is used for

determining the parameters of the dual-Vt FPGA design.

3) Algorithms associated with the CAD framework.4) Analysis of different dual-Vt FPGA architectures.

The remainder of this paper is organized as follows. In

Section II, leakage reduction and the previous work are dis-

cussed. Section III explains the static RAM (SRAM)-based

FPGA architecture that is used in this paper. In Section IV,

FPGA architectures based on dual-threshold-voltage method-

ology are proposed, and the proposed dual-Vt FPGA CAD

flow is discussed. Section V presents the implementation of the

proposed dual-Vt FPGA CAD flow. Section VI discusses the

results and the evaluation methodology. Section VII proposes

some guidelines for designing a dual-Vt FPGA. Section VIII

concludes the paper and outlines the future work.

II. LEAKAGE REDUCTION AND PREVIOUS WOR K

The subthreshold leakage current through a MOSFET can be

modeled as [9]

Isub = I0

1 exp

VdsVT

exp

Vgs Vt Voff

n VT

(1)

where I0 is a constant dependent upon device parameters fora given technology, VT is the thermal voltage, Voff is theoffset voltage which determines the channel current at Vgs = 0,

Vt is the threshold voltage, and n is the subthreshold swingparameter. Since the leakage current is exponentially depen-

dent on the threshold voltage, increasing the threshold voltage

would decrease the leakage current substantially. However, the

high-threshold-voltage devices have larger switching delays.

The various leakage-current mechanisms and some leakage-

reduction techniques for CMOS circuits were discussed in [2]

and [15]. The techniques for reducing leakage power involve

static and dynamic approaches. The dynamic approach involves

run-time decision making for leakage reduction. One such pop-

ular technique is the use of sleep transistors in multi-threshold

CMOS (MTCMOS) circuits for controlling the leakage power

[18]. Several optimizations for MTCMOS circuits involving

0278-0070/$25.00 2007 IEEE

8/3/2019 fpga2

2/14

54 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

Fig. 1. Dual-Vt design implementation.

the use of sleep transistors have been proposed, such as in

[16] and [17]. However, this technique can reduce only the

standby leakage power and lead to performance degradation.

The static technique does not involve run-time decision making

for leakage control. The dual-threshold-voltage-design tech-

nique, which is a static approach, has been widely used in

the custom VLSI designs for reducing leakage power. The

dual-Vt implementation reduces both the active and standby

leakages. Further, there is no performance degradation in a

dual-Vt implementation for custom VLSI designs.

The dual-threshold-voltage-design technique uses two kinds

of transistors in the same circuit. Some transistors have a high

threshold voltage, while other transistors have a low threshold

voltage. The high-threshold-voltage transistors have less

subthreshold leakage-power dissipation but also have a larger

delay as compared to the low-threshold-voltage transistors.

Fig. 1 shows the concept of dual-threshold-voltage implemen-

tation in custom VLSI designs. Here, the gates on noncritical

paths are assigned as high-Vt, and the gates on the criticalpath are assigned as low-Vt. The objective is to maximize the

number of transistors having high threshold voltage without

sacrificing the performance of the circuit. Several prior works

have proposed algorithms which assign high- and low-Vt to

the logic gates of the given circuit [1], [5], [19], [20]. However,

the dual-threshold-voltage-design technique proposed in the

literature for custom VLSI designs cannot be used for FPGAs.

This is because the FPGAs are programmable and we do not

know the circuit that would be eventually implemented on it

and hence the delays through various paths of the circuit are

not known. In this paper, we present a dual-Vt FPGA CAD

flow for developing and evaluating different dual-Vt FPGAarchitectures. We do not know of any published work in which

a CAD flow for evaluating and developing dual-Vt FPGA

architectures has been proposed.

The leakage-current mechanisms for the FPGAs are the same

as that for the custom VLSI designs. The two main sources of

leakage currents are the subthreshold and gate leakages. There

have been very few works targeted at reducing the leakage

power in FPGAs. The work in [3] used a technique based on the

property that the leakage power consumed by a CMOS circuit is

dependent on the state of its inputs and used the signal statistics

to alter the state of the inputs in order to reduce the leakage

power, in such a way that the functionality of the circuit does

not change. However, this paper addressed only active leakage.Further, this paper addressed leakage reduction only in the used

parts of the FPGA. The work in [6] approached the problem

of reducing leakage power by dividing the FPGA fabric into

small regions, with each of the regions being controlled by a

sleep transistor which would be turned on/off depending upon

whether that region is being used or not, thus reducing the

leakage power. This technique leads to reduction of only the

standby leakage power. The work in [7] explored the dual-Vt,body biasing, and gate biasing of NMOS pass transistors for

reducing leakage power. The dual-Vt technique used in [7] was

based on varying the percentage of high-threshold-voltage ele-

ments in the routing resources of the FPGA and studying its ef-

fect for a number of benchmarks. It did not specify the leakage

savings obtained, and no detailed evaluation of several possible

routing architectures was presented. Body-biasing technique

requires control circuitry and generation of voltages for body

biasing. Further, this technique leads to reduction in leakage

savings with technology scaling [21]. The negative gate biasing

of NMOS pass transistors has implementation issues and also

leads to additional leakage-current component, called gate-

induced drain leakage. The work in [24] used a dual-Vdd/Vt

architecture to reduce leakage power and dynamic power. Only

the configuration SRAM cells were assigned high-Vt to reduce

leakage power. The dual-Vdd approach requires design of

power supply network with two voltage levels and level con-

verters which lead to area overhead and increased complexity.

The dual-Vt technique and the CAD flow that we propose

have the following advantages: 1) reduces both active and

standby leakage powers; 2) provides a CAD framework for

developing and evaluating a dual-Vt FPGA implementation;

3) the inherent area penalty in using sleep transistors is not

present in this design technique; 4) the dual-Vt architecture

does not require any modification in the existing placement androuting tools from the perspective of the users.

III. TARGETED FPGA ARCHITECTURE

The FPGA architecture is very regular in structure. It is

made up of two main componentslogic blocks (CLBs) and

routing resources. The logic blocks implement the functionality

of the given circuit, while the routing resources provide the

connectivity for implementing the logic. The logic blocks and

the routing resources are configurable. Although many types of

architectures have been experimented with, the most popular

one is the SRAM-based architecture which is described belowand is used in this paper [10].

The logic block of the SRAM-based FPGA is lookup table

(LUT)-based and is composed of basic logic elements (BLEs).

LUT is an array of SRAM cells. Each BLE consists of a

k-input LUT, flip-flop, and multiplexer for selecting the output

directly from either the output of LUT or the registered output

value of the LUT stored in the flip-flop. Fig. 2 shows the BLE.

Previous works have shown that the four-input LUT is the most

optimum size as far as logic density and utilization of resources

are concerned, and this has been widely used. Cluster-based

logic blocks were investigated in [10], and it was shown that

the cluster-based logic blocks are better in speed and area. The

structure of a cluster-based logic block is shown in Fig. 3. Inthe cluster-based logic block, the logic block is made up of

8/3/2019 fpga2

3/14

KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs 55

Fig. 2. Basic logic element [10].

Fig. 3. Cluster-based logic block [10].

N BLEs. There are I inputs to the logic block, such that eachinput can connect to all the BLEs. Also, the output of each BLE

can drive one of the inputs of each of the BLEs. The clock

feeds all the BLEs. The work in [10] showed that the logicclusters containing four to ten BLEs achieve good performance.

We consider each subblock to be made up of BLE and the

corresponding LUT input multiplexers.

The routing resources are again of various types, but the

one used in this paper is the island-based architecture. In the

island-based architecture, the routing resources form a meshlike

structure with the horizontal and vertical routing channels.

The horizontal and vertical routing channels are connected by

switch boxes which are programmable and thus provide the

flexibility in making the connections. The logic blocks are

connected to the routing channels through the connection boxes

which are also programmable. Fig. 4 shows the island stylerouting architecture [10]. Apart from the logic blocks and the

routing resources, the clock distribution is assumed to have

a dedicated network. Most of the commercial FPGAs have a

structure similar to the one described above or some variant of

the above architecture.

We have used an industrial CMOS 0.13-m technology nodefor the technology dependent data for the FPGA architectures.

The methodology proposed in [10] was used to extract the

required technology dependent data. In extracting the delay data

for the logic blocks, SPICE simulation was done. The delay

through the subblock is the delay through the LUT input multi-

plexer, LUT, flip-flop, and multiplexer for selecting the output

of the flip-flop or LUT. The sizing of the buffers was done forequal rise and fall times, as proposed by the study in [10]. We

have assumed that metal 3 is used for the routing wires. The

sizing of the routing switches was done for minimum area delay

product, as proposed by the study in [10]. We have used the

regular MOS models for low-threshold voltage implementation.

The Vdd for this technology is 1.2 V, and Vt is 0.2 V. The

high-Vt transistors have threshold voltage 100 mV higher than

the low-Vt transistors [27]. SPICE simulation shows that, uponincreasing the high-Vt value further, the delays of the subblocks

increase without leading to any significant leakage savings.

Since the SRAM cells do not contribute to any run-time delay,

we assume that they are implemented with high-Vt transistors

to reduce the leakage power, and hence, our results for leakage

savings do not include any leakage savings from the SRAM

cells [4].

IV. PROPOSED DUAL -T HRESHOLD FPGA ARCHITECTURE

The very nature of programmability of FPGAs makes the

dual-Vt design technique for FPGAs different from dual-Vtcustom VLSI designs. We propose dual-Vt FPGA architectures

and then use a CAD framework for determining the parameters

and subsequently evaluating the proposed dual-Vt FPGA archi-

tectures. This paper targets the logic elements within the CLBs

and the routing resources as candidates for dual-Vt assignment.

We evaluate the following architectures:

1) homogenous architecture with only subblocks within

CLBs considered for dual-Vt assignment (arch1);

2) homogenous architecture with subblocks and routing re-

sources considered for dual-Vt assignment (arch2);

3) heterogenous architecture with only subblocks consid-

ered for dual-Vt assignment (arch3);

4) heterogenous architecture with the subblocks and routing

resources considered for dual-Vt assignment (arch4).

The proposed homogenous and heterogenous architectures are

described as follows.

A. Homogenous Dual-Vt FPGA Architecture

This proposed architecture has a fixed number of high- and

low-Vt subblocks in each of the CLBs. For a cluster size of

N, M(M < N) subblocks are assigned as high-Vt, whereas(N M) subblocks are assigned as low-Vt in each CLB. Forexample, Fig. 5 shows a homogenous dual-Vt FPGA archi-

tecture in which each CLB has 50% high- and 50% low-Vt

subblocks for a cluster size of four. In the case of arch1,

we consider only the subblocks as candidates for dual-Vt

assignment. We extend this architecture to include the routing

resources as candidates for dual-Vt assignment, which leads to

the architecture arch2. The switches in the connection blocks

and the switch blocks, which drive the routing segments, are

considered for dual-Vt assignment. In the architecture arch2,

certain fraction of these switches are assigned as high-Vt. Fig. 6

shows the switch block and the associated routing switches.

The connection block has similar switches for providing con-nections between CLBs and routing segments.

8/3/2019 fpga2

4/14


Fig. 4. Island style routing architecture [10].

Fig. 5. Proposed homogenous FPGA architecture. Each CLB has a fixed ratioof the high- and low-Vt subblocks.

B. Heterogenous Dual-Vt FPGA Architecture

This proposed architecture has two kinds of CLBs distributed

uniformly throughout the array. The first type of CLBs has all

high-Vt subblocks, and the second type has some high- and

low-Vt subblocks. These two kinds of logic blocks are distrib-

uted uniformly throughout the array in such a way that the arrayis regular. If the logic-block cluster size is N, then the proposedFPGA architecture would have one of the two kinds of CLBs

with all N high-Vt subblocks (type1), whereas the second kindof logic blocks would have M high-Vt subblocks (M < N)and N M low-Vt subblocks. Fig. 7 shows the distributionof the two kinds of the logic blocks in the FPGA, such that

50% of the logic blocks are of type1, whereas the other 50%

of the logic blocks are of type2 for a cluster size of four. We

selected such an architecture because the number of subblocks

assigned as high-Vt was very high and so a large percentage

of logic blocks can possibly have all high-Vt subblocks. This

heterogenous architecture is extended to include the routing

resources for dual-Vt assignment, leading to the architecturearch4, in the same way as arch1 is extended to arch2.

Fig. 6. Switch block. (a) Overall architecture of a switch block. (b) Bufferedpass-transistor switch. (c) Pass-transistor-based switch.

C. Proposed Dual-Vt FPGA CAD Framework

Developing and evaluating dual-Vt FPGA architectures, as

outlined previously, would require a CAD framework which

can perform dual-Vt assignment to the subblocks and therouting resources and has a methodology for analyzing different

8/3/2019 fpga2

5/14


Fig. 7. Proposed heterogenous FPGA architecture. Two kinds of CLBs, onehaving all high-Vt subblocks and the other having a fixed ratio of the high- andlow-Vt subblocks.

dual-Vt FPGA architectures. The typical existing FPGA CADflow within the framework of versatile place and route (VPR),

for placing and routing a given netlist, and the proposed generic

dual-Vt FPGA CAD flow are shown in Fig. 8. The proposed

dual-Vt FPGA CAD flow has been developed using the widely

used academic research tools VPR and T-Vpack [10] and a

state-dependent leakage-power model that we developed for

FPGAs [11] and integrated with the power model proposed

by the study in [13]. The salient features of the leakage-power

model developed by us are the following.

1) It is a state-dependent model which takes into account the

dependencies of leakage power on the state of the input.

2) It models both the gate and subthreshold leakages.3) It takes into account the various short-channel effects to

accurately model the leakage.

4) It is based on analytical equations from BSIM4.

The leakage-power model takes into account the state de-

pendence of leakage power by considering the probability

of states for each of the circuit element. Consider a cir-

cuit element which has n states and the probability of thestates are Prob1, Prob2, . . . , Probn, such that

n

i=1 Probi = 1.The leakage power for different states are given as

Pleak1, Pleak2, . . . , Pleakn. The average leakage power canthen be written as

Pavgleak =n

i=1

Probi Pleaki. (2)

We consider both the PMOS and NMOS in various circuit el-

ements as the candidates for subthreshold leakage, but we con-

sider only the NMOS transistors as candidates for gate leakage

because the gate leakage in PMOS is considerably smaller than

NMOS [12]. Furthermore, the back gate leakage of the NMOS

transistors is ignored, and we consider the gate current only

from the gate to channel which then gets partitioned and flows

into the source and drain of the transistor, as shown in Fig. 9(a).

The methodology for computing the leakage through aninverter in the FPGA leakage-power model developed by us is

as follows. We model the subthreshold leakage of the inverters

in both the states, i.e., when the input is zero and when the input

is one, and the gate leakage of the inverter when the input is

one. With the input at zero, only the subthreshold leakage flows

through the NMOS of the inverter and the gate leakage through

PMOS is ignored, as shown in Fig. 9(b). When the input to

the inverter is one, there is subthreshold leakage through thePMOS and gate leakage through the NMOS. In similar ways,

the leakage power through other FPGA circuit elements are

computed.

VPR is used for placement and routing of the benchmark

circuits. After the placement and routing of the given circuit,

the power model computes the probability of states for each

node of the circuit. For computing the probability of the nodes

of the circuit, we have used the work done in [13]. After

the probability of states for each of the nodes is computed,

the power model looks at each of the circuit elements of the

FPGA and computes the leakage for each of the states of that

element using the leakage computation engine (LCE). We have

implemented the LCE, which computes the leakage for each of

the circuit element of the FPGA based on a given input vector

(and the state of the SRAM cells, if they are present in the

given circuit element). The LCE is basically a library of state-

dependent leakage calculation functions. This library has the

basic leakage equation for the subthreshold and gate leakages,

and the computation of the associated parameters. It also has the

models for computing the leakage for each of the FPGA circuit

elements for a given input vector. For a low-Vt subblock with

all the inputs set to zero, the total leakage is 3.46 nA, whereas

for a high-Vt subblock, the leakage is 0.139 nA. For a low-Vt

buffered pass-transistor routing switch, with the input as zero,

the leakage is 0.933 nA, whereas for the high-Vt switch, theleakage is 0.0156 nA.

The typical FPGA CAD flow involves circuit optimization

using sequential interactive synthesis (SIS) [22] and technology

mapping to LUTs using FlowMap [23]. T-Vpack is then used

to do the packing and clustering, which generates the netlist file

of logic blocks. VPR is then used for placement and routing.

The proposed dual-Vt FPGA CAD framework has six stages

as shown. The next section describes various stages of the

CAD flow.

V. CAD FRAMEWORK IMPLEMENTATION

We modified the VPR to support the dual-Vt assignment to

the subblocks and the T-Vpack to support reclustering. Each

of the stages of the dual-Vt FPGA CAD flow [Fig. 8(b)] is

explained below.

A. Stage 1

In this stage, the .net file is generated, which is a netlist of

logic blocks, along with the activity and function files required

for the power estimation. The .net file is generated from the

.blif (Berkeley logic interchange format) file based upon the

specified FPGA architecture, such as cluster size, inputs percluster, LUT size, etc., by T-Vpack. The activity and function

8/3/2019 fpga2

6/14


Fig. 8. (a) Typical FPGA CAD flow within the VPR and T-Vpack framework. (b) Proposed generic dual-Vt FPGA CAD flow.

file generations are a part of the power model framework [11],

[13]. These files serve as inputs to the second stage.

B. Stage 2

The second stage is a delay estimation stage, which is used

to assign high- and low-Vt to various subblocks, based onthe slacks available for each path. The delay calculation can

either be accurate or be based upon some estimation. In this

paper, we have used accurate delay estimation based on the

actual placement and routing of the benchmark. In this stage,

it is assumed that all the subblocks are low-Vt, and the delay

calculation is done based on the low-Vt delay values for the

subblocks and routing resources. VPR does the placement

and routing of the benchmark for delay calculation. Since the

leakage-power model is integrated into the VPR, we calculate

the power for the single low-threshold-voltage implementation

of the FPGA for the benchmark, immediately after placement

and routing. This leakage-power-consumption result is used inthe final stage of the CAD flow for comparison with the leakage

power consumed by the dual-Vt implementation and, hence,

calculation of leakage savings.

C. Stage 3

Initially, it is assumed that all the subblocks are low-Vt

except the configuration SRAM cells which are all high-Vt

for both the single low-Vt and dual-Vt implementations. Based

on the delays of various paths of the circuit computed in the

previous stage, the dual-Vt assignment algorithm assigns high-

Vt to various subblocks. The dual-Vt assignment algorithm is

shown in Algorithm 1 [1]. The algorithm works on the timinggraph created by the VPR for placement and routing. This

Fig. 9. (a) Gate leakage in NMOS. (b) Subthreshold leakage in an inverter.

timing graph contains the information for the delays between

various pins of the circuit. High-Vt assignment to a subblock

is done based on the slack available at the output pin of the

subblock. The assignment of high-Vt would change the delay

through the subblock. This delay value is updated in the timing

graph, and the next subblock is then considered for high-Vtassignment. It should be noted that the dual-Vt assignment is

not constrained to keep the CLBs identical in terms of the

number of high- and low-Vt subblocks in each CLB. This kind

of dual-Vt assignment, which is the ideal assignment, would

result in a dual-Vt FPGA implementation which would not have

any performance degradation. This dual-Vt assignment gives a

theoretical upper limit for the number of subblocks assigned as

high-Vt. This dual-Vt assignment methodology is the same as

the dual-Vt assignment for custom VLSI designs. Therefore,

the dual-Vt assignment produces an output in which there

are different numbers of high- and low-Vt subblocks across

different CLBs. This is, however, not a feasible solution. This

information is needed by the next reclustering stage to create afeasible implementation for the dual-Vt FPGA architectures.

8/3/2019 fpga2

7/14


Algorithm 1 Dual-Vt assignment [1]

Using timing - graph in VPR (after Place and Route)

for each subblockdo

Compute the slack availableifslack > 0 then

Assign high-Vt to the subblock

Recompute the slackifslack < 0 then

Re-assign low-Vt to the subblock

end if

end if

end for.

D. Stage 4

This stage works within the framework of T-Vpack. For

the homogenous architectures (arch1 and arch2), we estimate

the number of high-Vt subblocks per CLB based upon the

results of the dual-Vt assignment to subblocks in the previousstage. Based on these results, we evaluate the homogenous

architecture with a fixed number of high-Vt subblocks per CLB.

The heterogenous architectures (arch3 and arch4) are devel-

oped and evaluated in the following way based on the algorithm

shown in Algorithm 2. We have modified the clustering algo-

rithm proposed by the study in [10]. The clustering algorithm in

[10] picks up a seed BLE and then tries to build up the cluster by

attracting the other BLEs based on an attraction function. The

attraction function used in [10] to add a BLE B to a currentcluster C is shown in the following:

Attraction= Criticality(B)+(1)

Nets(B)

Nets(C)MaxNets

(3)

where Criticality(B) represents that part of the attractionfunction which tries to reduce the delay of the circuit. For

computing the delay of the various paths in the circuit during

the packing stage, the intracluster delay is assumed to be 0.1,

logic delay is assumed to be 0.1, and intercluster delay is

assumed to be one [10]. For the modified clustering algorithm,

we set the delay of high-Vt BLEs to 0.2 to account for the

increased delays of high-Vt BLEs. The exact values of these

delays are not very important as long as they represent the

correct trend [10]. The methodology for criticality computationis explained in [10], and its value lies between zero and one. The

second part of the function tries to cluster the BLE which share

maximum nets with the current cluster. To take into account the

high-Vt BLEs and increase the type1 (all high-Vt) clusters the

modified cost function is shown in the following:

Attraction = Criticality(B)

+ (1 )

Nets(B) Nets(C)

MaxNets

+ Hvt NumHvt + (1 Hvt) NumLvt

ClusterSize(4)

where the third part of the equation denotes the attraction of a

BLE to a cluster based on the number of high-/low-Vt BLEs

in the current cluster. The parameter Hvt takes a value of

one for a high-Vt BLE and zero for a low-Vt BLE. NumHvt

denotes the number of high-Vt BLEs in the current cluster, and

NumLvt denotes the number of low-Vt BLEs in the current

cluster. The value ofdepends upon the benchmark, but a valuebetween 0.01 and 0.05 was found to give good results for all thebenchmarks. The value of was set to 0.8.

The number of low-Vt subblocks in the type2 logic blocks is

determined as follows:

NumLvtBLEs = ClusterSize

SumLvtCrit

NumFracLB

(5)

where SumLvtCrit denotes the sum of criticality of all low-Vt

subblocks, and NumFracLB denotes the number of logic blocks

which have at least one low-Vt subblock so that they can be

categorized into type2 logic block. We use this equation to de-

termine the number of low-Vt subblocks in type2 CLBs becausethe criticality measure takes into account the importance of a

subblock with regard to delay, which directly translates to the

threshold voltage of the transistors of the subblock. The next

step is to assign high-/low-Vt to subblocks in type2 logic blocks

so that each of the type2 logic blocks have the same number of

low-Vt subblocks. We start by arranging all the subblocks in

a logic block in the ascending order of criticality. If the logic

block has more low-Vt subblocks than NumLvtBLEs, the BLEs

are assigned high-Vt, starting with the lowest critical low-Vt

subblock, until the number of low-Vt subblocks is equal to

NumLvtBLEs. If the logic block has low-Vt subblocks less

than NumLvtBLEs, some of the subblocks are assigned low-

Vt, starting with the highest critical high-Vt subblock, so thatmaximum gain in performance might be achieved. This method

of assigning high-/low-Vt to subblocks during the reclustering

stage is also used for the homogenous architecture, such that

the value NumLvtBLEs is used for all the logic blocks.

Algorithm 2 Recluster

Do clustering based on T-Vpack algorithm with modified

attraction function

for each logic blockdo

ifAll-subblocks - high- Vt then

Mark the logic block as type1

elseMark the logic block as type2

end if

end for

Determine the number of low-Vt subblocks in type2 CLB

based on (5)

for Each type2 logic blockdo

Arrange the subblocks in ascending order of criticality

ifNo.-of-low-Vt-subblocks < NumLvtBLEsthen

while No.-of-low-Vt-subblocks < NumLvtBLEs doRe-assign a high-Vt subblock to low-Vt based on

criticality (Highest critical high-Vt subblock as-

signed low-Vt first)end while

8/3/2019 fpga2

8/14


else

while No.-of-low-Vt-subblocks > NumLvtBLEs doRe-assign a low-Vt subblock to high-Vt based on

criticality (Lowest critical low-Vt subblock as-

signed high-Vt first)

end while

end ifend for.

This stage produces an output which gives an estimate of the

number of type1 and type2 CLBs for the heterogenous archi-

tectures (arch3 and arch4) and number of low-Vt subblocks in

the type2 CLBs.

E. Stage 5

Homogenous architectures : For the architecture arch1, place-

ment and routing is done in this stage. For realizing the architec-

ture arch2, a certain fraction of routing resources are assignedhigh-Vt, and then, placement and routing is done.

The high-Vt assignment to routing switches is done as

follows. Consider a routing channel width of 100 with two

kinds of segments, such that 50 segments are driven by pass-

transistor switches in the switch box and 50 segments are driven

by buffered pass-transistor switches, as defined by the VPR

architecture file. The output pins of the CLB drive all these

segments with a buffered switch in the connection box. When

we assign high-Vt to, say, 40% of routing segments driven by

pass-transistor switches, it would mean that 20 segments of

those 50 segments will be driven by high-Vt pass-transistor

switches. Also, all the output pin switches of the CLBs, which

drive these 20 segments, will be assigned high-Vt.We assume that a gate boosting for these switches are done,

which would minimize the voltage degradation when a logic

one is propagated [10]. Other techniques, such as the use of

level restorer, can be used to eliminate voltage degradation in

the routing switches [25]. SPICE simulation shows that logic

level one passing through a high-Vt pass transistor with gate

boosting to 1.5 V (Vdd = 1.2 V) degrades to only 1.198 from1.2 V. The SPICE simulation results indicate that, with gate

boosting only 1.5%, extra static leakage current flows in the

driven buffer through a high-Vt pass transistor as compared

to the case when there is no voltage degradation. Therefore,

we consider that the static power dissipation due to voltagedegradation in high-Vt pass-transistor switches is sufficiently

small.

Heterogenous architectures: Based on the reclustering re-

sults, we determine, for a given FPGA array, the number of

type1 and type2 CLBs. We estimate these values based on

the ratio of type1 and type2 logic blocks in the reclustered

netlist, after providing sufficient margin to account for the

performance degradation because of the physical restriction

of FPGA regularity. These CLBs are assumed to be distrib-

uted evenly throughout the FPGA array. For the architecture

arch3, constrained placement followed by routing is done. For

arch4, a certain fraction x of the total routing switches are

assigned high-Vt, and then, constrained placement followedby routing is done. We vary the value of x to determine the

leakage savings and performance tradeoffs. The algorithm for

constrained placement is shown in Algorithm 3. Since the VPR

uses simulated annealing for timing-driven placement, which

is a very effective placement technique, we have not modified

the basic placement algorithm. However, we have incorporated

delays of the high- and low-Vt parts of the CLBs into the

timing graph accordingly, so that VPR can optimize the overallplacement for performance.

Algorithm 3 Constrained Placement

P-ratio = (Num type1 CLBs)/(Num type2 CLBs)Determine physical locations for the two types of CLBs

based on P-ratio

Assign high-Vt to frac fraction of routing switches

Initial Random placement for all the CLBs

Start placement based on VPR placement algorithm

Allow the type2 CLBs to be placed on physical locations

meant for type1 CLBs and vice versa

End Placement

for each CLB do

ifLogic-Block-not-on-same-type-CLB then

Convert the logic block to type1 or type2 according

to its placement

end if

end for.

Since the VPR router is timing driven, it optimizes the overall

routing. The delays of the high-Vt switches would be more

than the delays of the low-Vt switches, which would force the

router to use more of low-Vt switches. Therefore, we provide

the increased delays of the high-Vt routing switches to the VPR,

so that the router can optimize the overall routing.

F. Stage 6

This stage computes the results obtained from the previous

stages of the CAD flow. After the computation of delays and

power for the dual-Vt FPGA implementation is over, this stage

computes the leakage-power savings obtained and the delay

penalty for the benchmark.

VI. EVALUATION, RESULTS, AN D DISCUSSIONS

This section describes the realization and evaluation of dif-ferent FPGA architectures and their comparison. The method

of developing and evaluating these architectures is by using a

number of benchmarks on the proposed dual-Vt FPGA CAD

flow to find out which subblocks can be high-Vt for maxi-

mizing leakage savings and reducing the delay penalties. It

would be worthwhile to point out the difference between the

physical design of the FPGA and the mapping of an appli-

cation on the FPGA. The physical resources of an FPGA are

fixed, and the applications are mapped later on. The dual-

Vt FPGA CAD flow that we outlined above is meant for the

actual physical design of the FPGA and, later on after the

physical design of the FPGA is complete, for the design of

associated CAD tools. This dual-Vt FPGA CAD flow is to beused (along with other CAD tools) during the design of the

8/3/2019 fpga2

9/14


TABLE ILEAKAGE SAVINGS WIT H LOGIC BLOCKS ASSIGNED AS DUAL -VT FOR A CLUSTER SIZE OF 12 FOR

HOMOGENOUS AND HETEROGENOUS ARCHITECTURES

FPGA, where a number of benchmarks are evaluated before a

design for an FPGA is finalized. This CAD flow is not meant

for mapping of an application on to an FPGA chip, as all

the logic blocks and routing resources would then be fixed,

which would be a routine job using the standard FPGA CADtools. Therefore, the dual-Vt FPGA CAD flow is meant for

developing a dual-Vt FPGA by determining its architectural

parameters.

A. Estimating Power Savings

We use the power model proposed in [11] for estimating the

leakage-power savings. The total power consumption can be

divided into two parts, active power and standby power. The

total average power can be written as

Pavg =tact Pact + toff Poff

tact + toff (6)

T = tact + toff (7)

where Pavg, Pact, and Poff are the average, active, and standby(off) power. For personal wireless communication systems,

typically, the standby time or off time (toff) is 90% of the totaltime (T), and active time (tact) is 10% of the total time. Duringthe active time, the components of power dissipation can be

written as

Pact = [Pdyn + Psckt + Pactleak]used + [Pactleak]unused (8)

where Pdyn, Psckt, and Pactleak are the dynamic, short circuit,and active leakage-power consumptions, respectively. Poff is

the standby-leakage-power consumption of the FPGA, because

during the standby mode, only the leakage power is dissipated.

Hence, reducing leakage power during the standby mode for

mobile applications would increase the battery life significantly.

The dual-Vt technique reduces both the standby and activeleakages, and hence, it is an attractive method to reduce sub-

threshold leakage.

B. Evaluation Methodology

For obtaining the results from the benchmarks, we used the

LUT size of four and a cluster size of 12. It was shown in [8]

that a LUT size of four and a cluster size of eight or 12 leads to

minimization of total power. The inputs per cluster was chosen

as K/2(N + 1), where N is the number of subblocks percluster and K is the LUT size [26]. We used the default routing

architecture present in the FPGA architecture file of VPRhaving 50% routing segments with pass-transistor switches and

50% routing segments with buffered pass-transistor switches.

For computing the leakage savings and performance tradeoffs,

a fixed-size FPGA array of 20 20 was used for all the

benchmarks, except bigkey, clma, des, dsip, and s38584.1, for

which a 35 35 array was used. A routing channel width of

100 was used for the FPGA architecture.

Table I shows the results of the dual-Vt assignment. Based

on these results, homogenous architectures with four, six, and

eight high-Vt subblocks per CLB were considered. The results

shown for the number of type1 CLBs and low-Vt subblocks in

type2 CLBs in Table I correspond to a minimum-sized FPGA

(required to fit a benchmark), and these values are ideal valueswhich does not consider the regularity of the structure of the

8/3/2019 fpga2

10/14


FPGAs. This kind of dual-Vt assignment leads to an FPGA

which would have no delay penalty. This is same as the dual-Vt

assignment for ASICs, in which there is no restriction with

regard to the logic elements that can be assigned as high-Vt.

This kind of dual-Vt assignment leads to nonidentical logic

blocks (in terms of high- and low-Vt subblocks in each logic

block) in the FPGA, and the reclustering stage estimates theparameters of a uniform dual-Vt FPGA (in which all the logic

blocks are identical with respect to number of the high- and

low-Vt subblocks), based on the results of the dual-Vt assign-

ment. Hence, to provide sufficient margin and prevent severe

degradation of final placement and routing imposed because of

the limitation that the FPGA needs to be regular and uniform

in structure, we set the values for type1 CLBs and number of

low-Vt subblocks in type2 CLBs as 50% and 10, respectively.

C. Realizing and Evaluating Different FPGA Architectures for

Leakage Savings and Design Tradeoffs

1) arch1: This is a homogenous architecture in which all

the logic blocks are identical, such that each logic block

has a fixed ratio of the high- and low-Vt subblocks. Based

upon these results for dual-Vt assignment, we evaluated three

different homogenous architectures with CLBs having eight,

six, and four high-Vt subblocks in each CLB. After assigning

high-Vt to, say, eight subblocks in each CLB, we placed and

routed different benchmarks to evaluate the leakage savings

and performance tradeoffs. Table I shows the overall average

leakage saving results for these three different cases. The mean

performance penalties were 3.7%, 4.8%, and 5.04% for the

architectures with four, six, and eight high-Vt subblocks in

each CLB, respectively. The delay penalties are close to eachother because of two reasons: 1) There are sufficient number

of low-Vt subblocks for the critical path(s) and 2) the routing

contributes to a larger portion of total delay, which means that,

even if there are few high-Vt subblocks on the critical path, this

would not cause a large delay penalty.

2) arch2: This is the homogenous architecture with routing

resources considered for dual-Vt assignment. The evaluation

methodology is same as that of the arch1 case, except that we

vary the fraction of the two kinds of routing resources assigned

as high-Vt and estimate the leakage savings and the delay

tradeoffs. The evaluation was done for the three cases, viz.,

FPGA architecture with four, six, and eight high-Vt subblocksper CLB for a cluster size of 12.

Fig. 10 shows the leakage savings and the performance

tradeoffs when the fraction of the high-Vt switches are varied

over the base architecture having six high-Vt subblocks per

CLB. Fig. 10(a) shows the leakage savings and delay penalty

when only the buffered switches are considered for high-Vt

assignment along with the CLB output pin switches which can

drive these segments (i.e., those segments which can be driven

by these high-Vt buffered switches), all the pass-transistor

switches remaining as low-Vt. Fig. 10(b) shows the leakage

savings and delay penalty when pass-transistor switches are

considered for high-Vt assignment along with the CLB output

pin switches which can drive these segments (i.e., thosesegments which can be driven by these high-Vt pass-transistor

Fig. 10. Leakage savings for arch2 with six high-Vt subblocks per CLB.(a) Buffered switches assigned as high-Vt. (b) Pass-transistor switches assignedas high-Vt.

switches), all the buffered switches remaining as low-Vt.

The leakage savings increases considerably when the routing

resources are assigned as high-Vt with some impact on the per-

formance. This is because a major part of leakage comes from

the routing resources and is discussed further in Section VI-E.

Leakage savings for both the cases, when the pass-transistor

switches are assigned as high-Vt and buffered switches are

assigned as high-Vt, are almost the same. Although the buffered

switches have more number of devices but the pass-transistor

switch is eight times the minimum size, the pass transistor in

the buffered switch is two times the minimum size and the

buffer is two times the minimum sized buffer, which leads toalmost similar leakage savings. The delay penalties for these

architectures vary just between 10.7% and 3.32%. This can be

attributed to the fact that a large percentage of routing switches

are unutilized [4], and therefore, they do not contribute to delay

penalties even if they are assigned as high-Vt. Further, the rout-

ing optimization effort of the VPR would also increase, tending

to use more of the low-Vt switches, which compensates, to

some extent, the increased delay of the high-Vt switches.

3) arch3: This is a heterogenous architecture in which there

are two kinds of CLBs in the FPGA. Table I shows the leakage

savings for this architecture. Since the FPGA architecture needs

to be regular in structure, the two kinds of CLBs are distributed

uniformly throughout the array. Based on the clustering results,type1 CLBs were fixed to 50% of the total CLBs, and type2

8/3/2019 fpga2

11/14


Fig. 11. Leakage savings forarch4. (a) Buffered switches assigned as high-Vt.(b) Pass-transistor switches assigned as high-Vt.

CLBs had ten low-Vt subblocks for this architecture. The num-ber of low-Vt subblocks in type2 CLBs was fixed by averaging

this value for all the benchmarks. The ratio of type1 and type2

CLBs was fixed by averaging and providing sufficient margin

to prevent performance degradation for all the benchmarks. The

results shown in this paper were obtained after fixing these

values for all the benchmarks. Table I shows that the overall

mean leakage savings for this architecture is 19.3%. The mean

delay penalty for this architecture is 6.9%.

4) arch4: This architecture is realized using the architecture

arch3. The base architecture is the same as that of arch3,

and over that, architecture routing resources are also assigned

as high-Vt to realize the architecture arch4. The evaluationmethodology is the same as that for arch3, except that the

fraction of high-Vt routing resources is varied and the leakage

savings and design tradeoffs are calculated. Fig. 11(a) shows

the leakage savings and delay penalty when only the buffered

switches are considered for high-Vt assignment along with the

CLB output pin switches which can drive these segments (i.e.,

those segments which can be driven by these high-Vt buffered

switches), all the pass-transistor switches remaining as low-Vt.

Fig. 11(b) shows the leakage savings and delay penalty when

pass-transistor switches are considered for high-Vt assignment

along with the CLB output pin switches which can drive these

segments (i.e., those segments which can be driven by these

high-Vt pass-transistor switches), all the buffered switchesremaining as low-Vt. The delay penalties vary between 12.4%

and 4.8% for different fractions of high-Vt switches. The in-

crease in delay penalty is not too large when routing switches

are assigned high-Vt because a large fraction of routing re-

sources are unutilized. We observe that the leakage savings for

this architecture is about 3% greater than of the arch2 (six high-

Vt subblocks per CLB), with almost the same delay penalty.

D. Design Tradeoffs

Before deciding upon any dual-Vt FPGA architecture, we

need to consider the design tradeoffs. In this paper, we eval-

uated two types of architectureshomogenous and heteroge-

nous. The design tradeoffs are the delay penalties and the

impact on overall power. We evaluated these for all the FPGA

architectures. It was observed that the dynamic power dissipa-

tion remains almost constant for all the architectures.

A dual-Vt custom VLSI design does not lead to any delay

penalty. However, because of the very nature of the program-

mability of the FPGAs, there would be some delay penalty

associated with a circuit implemented on a dual-Vt FPGA

as compared to a single low-Vt FPGA. The delay penalties

would vary with the benchmark. Essentially, the delay penalty

occurs because the critical path of a circuit in the dual-Vt

FPGA changes from that in the single low-Vt FPGA. There

are three sources of delay penalties in a dual-Vt FPGA:

1) the increased delays of the subblocks, some of which might

lie on the critical path; 2) the altered placement and routing

in the dual-Vt FPGA because of the different delays through

the subblocks which impacts the placement and, subsequently,

routing; and 3) increased delays of high-Vt routing switches,

some of which might lie on the critical path. Table II shows the

leakage savings and the delay penalties for arch2, arch4, and allhigh-Vt subblocks architectures. The delay penalties for arch2

and arch4 are comparable. These architectures have a very high

percentage of high-Vt switches and would be most susceptible

to performance degradation. However, the mean delay penalty

for arch2 with 80% high-Vt pass-transistor switches is 6.4%,

with the maximum delay penalty being 15.3% for a mean leak-

age savings of 40.3%. For arch4 with 80% high-Vt switches, the

mean delay penalty is 5.8%, with the maximum delay penalty

being 13.6% for a mean leakage savings of 43.5%. We have

also shown the results of all high-Vt architecture, where all

the subblocks in all the CLBs are high-Vt. The mean leakage

savings is 32.6%, and the mean delay penalty is 19%, withthe maximum delay penalty as high as 48.5%. We see that

keeping all high-Vt subblocks leads to severe performance

degradation in many of the benchmarks. This is because, in

the architectures which contain low-Vt subblocks, the placer

and the router can search for low-Vt subblocks to minimize

the delay, which is not possible in the case of architecture with

all high-Vt subblocks. The performance degradation in the all

high-Vt subblocks is more pronounced for benchmarks which

have a large number of subblocks on the critical path. The

benchmark diffeq, which has a severe delay penalty of 48.5%,

has 30 subblocks on the critical path for the implementation on

the all high-Vt subblocks architecture. The benchmark bigkey

has only six subblocks on the critical path, and the routing ofthe VPR compensates for the increased delay, which leads to

8/3/2019 fpga2

12/14


TABLE IIDESIGN TRADEOFFS FOR HOMOGENOUS ARCHITECTURE arch2 WITH SIX HIG H-VT SUBBLOCKS AND 80% HIG H-VT PAS S-TRANSISTOR SWITCHES,

HETEROGENOUS ARCHITECTURE arch4 WIT H 80% HIG H-VT PAS S-TRANSISTOR SWITCHES, AN D ALL HIG H-VT SUBBLOCKS ARCHITECTURE

Fig. 12. (a) Leakage contributions of routing resources, logic resources, and SRAM cells for single low-Vt implementation and (b) and (c) after dual-Vtimplementation for alu4.

almost no delay penalty. Hence, having an architecture with all

high-Vt subblocks would not be a good design.

There is almost no area penalty as the dual-Vt technique

inherently does not lead to any area penalty. In the experimental

results for all the benchmarks, we kept the routing channel

width and the number of CLBs the same for both the baseline

single-Vt FPGA architecture and the dual-Vt FPGA architec-

tures. This shows that there is no area penalty associated with

the proposed design of the dual-Vt FPGA architectures.

E. Distribution of Leakage Savings

In all these evaluations, we have assumed high-Vt SRAM

cells for both the single low-Vt and dual-Vt implementations.

The leakage savings is solely from the logic and routing parts

of the FPGA. Fig. 12 shows the distribution of leakage among

the logic blocks and the routing resources. Fig. 12(a) shows the

leakage distribution in the single low-Vt implementation for the

benchmark alu4. It can be seen that almost two thirds of leakage

is from the routing resources. This is because of a large amount

of unused routing resources [4]. SRAM leakage is negligible as

they have high-Vt transistors to reduce the leakage. Assigning

dual-Vt to the logic blocks reduces the leakage component ofthe logic blocks, as shown in Fig. 12(b). Fig. 12(c) shows the

Fig. 13. Leakage power for alu4. (a) Single low-Vt implementation.(b) Dual-Vt arch3. (c) Dual-Vt arch4 with 60% high-Vt pass-transistorswitches.

impact on leakage distribution as a result of dual-Vt assignment

to both the logic blocks and the routing resources. The contri-

bution of routing leakage to total leakage increases by a small

amount, even though it results in overall leakage savings. This

can be explained by the leakage-power results for a benchmark

alu4, as shown in Fig. 13. For architecture arch3, the logic

leakage reduces by more than 50%, with the routing leakageremaining the same. When 60% of pass-transistors switches

8/3/2019 fpga2

13/14


Fig. 14. Realizing a dual-Vt FPGA design.

are also assigned as high-Vt, the routing leakage decreases by

28%. This results in an overall leakage savings of 37%, with

19% savings from the logic blocks and 18% savings from the

routing resources. This explanation can be extended to all the

benchmarks, and the other benchmarks have similar leakage

distributions.

The leakage-power model in [11] models both the gate

and the subthreshold leakage. For CMOS 130 nm, the gate

leakage is six orders of magnitude smaller than the subthresholdleakage. Therefore, the gate leakage does not impact our

methodology.

VII. DESIGNING A DUAL -V T FPGA

Designing a dual-Vt FPGA would require selecting a dual-Vt

FPGA architecture and then determining the parameters for the

dual-Vt FPGA. The dual-Vt FPGA CAD framework proposed

in this paper is intended for this purpose. Fig. 14 shows the

steps involved in the design of a dual-Vt FPGA. The previous

sections explored the various dual-Vt FPGA architectures for

leakage-power savings and performance tradeoffs. It can beseen from the results of the previous section that architectures

with high-Vt switches lead to significantly increased leakage

savings with small increase in delay penalties. Further, some

of the architectures are more suitable for some benchmarks,

whereas other benchmarks are more suitable for another archi-

tecture. This is because the number of interconnections between

the logic blocks is different for different benchmarks. This im-

plies that some benchmarks having a large number of high-Vt

switches would not have a large impact on delay because of

a shorter total wire length in the critical path, whereas some

benchmarks having a high percentage of high-Vt subblocks per

CLB would not have a big impact on performance because

there are very few subblocks in the critical path. Homogenous

architectures would, in general, be easier in implementation

as compared to heterogenous architectures. However, a careful

design of heterogenous architectures and the associated CAD

tools can lead to increased leakage savings for same delay

penalties as that for homogenous architectures.

VIII. CONCLUSION

The dual-Vt FPGA architectures explored above indicate

that an average leakage-power savings of up to 50% can be

obtained by the dual-Vt FPGA architectures. We proposed a

dual-Vt FPGA CAD flow for implementation and evaluationof dual-Vt FPGA architectures. The results show that a high

percentage of subblocks are assigned as high-Vt for the ideal

dual-Vt assignment. This indicates that a large amount of slack

is available with the logic blocks, which can be exploited to re-

duce the leakage power. We explored two primary architectures,

homogenous and heterogenous. Results indicate that there is a

large percentage of unused routing and logic resources, such

that dual-Vt FPGA architectures can lead to good leakagesavings without large delay penalties. The main bottleneck in

using any leakage-reduction technique for FPGAs is the very

nature of the programmability of the FPGA, leading to an

architecture not very suitable for leakage-reduction techniques.

For our future paper, we intend to develop novel leakage-

aware FPGA architectures which would be especially adapted

to implement leakage-reduction technique without incurring

large delay penalties.

REFERENCES

[1] L. Wei, Z. Chen, K. Roy, M. C. Johnson, Y. Ye, and V. K. De, Designoptimization of dual-threshold circuits for low-voltage low-power appli-

cations, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 1,pp. 1624, Mar. 1999.

[2] J. Kao, S. Narendra, and A. Chandrakasan, Subthreshold leakage mod-eling and reduction techniques, in Proc. IEEE Int. Conf. Comput.-Aided

Des., 2002, pp. 141148.[3] J. H. Anderson, F. N. Najm, and T. Tuan, Active leakage power optimiza-

tion for FPGAs, in Proc. FPGA, 2004, pp. 3341.[4] T. Tuan and B. Lai, Leakage power analysis of a 90 nm FPGA, in Proc.

IEEE Custom Integr. Circuits Conf., 2003, pp. 5760.[5] S. Sirichotiyakul, T. Edwards, C. Oh, R. Panda, and D. Blaauw,

Duet: An accurate leakage estimation and optimization tool for dual-Vtcircuits, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 2,pp. 7990, Apr. 2002.

[6] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, andT. Tuan, Reducing leakage energy in FPGAs using region-constrainedplacement, in Proc. FPGA, 2004, pp. 5158.

[7] A. Rahman and V. Polavarapuv, Evaluation of low-leakage designtechniques for field programmable gate arrays, in Proc. FPGA, 2004,pp. 2330.

[8] F. Li, D. Chen, L. He, and J. Cong, Architecture evaluation for powerefficient FPGAs, in Proc. FPGA, 2003, pp. 175184.

[9] C. Hu, A. Niknejad, X. Xi, W. Liu, X. Jin, K. M. Cao, M. Dunga, andJ. Ou. (2005, Jul. 29). BSIM4 MOS Models, Release 4.5.0. [Online].Available. http://www.device.eecs.berkeley.edu/~bsim3/bsim4.html

[10] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs. Norwell, MA: Kluwer, 1999.

[11] A. Kumar and M. Anis, An analytical state dependent leakage powermodel for FPGAs, in Proc. DATE, 2006, pp. 612617.

[12] D. Lee, D. Blaauw, and D. Sylvester, Gate oxide leakage current analysisand reduction for VLSI circuits, IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 12, no. 2, pp. 155166, Feb. 2004.

[13] K. Poon, A. Yan, and S. J. E. Wilton, A flexible power model forFPGAs, in Proc. Int. Conf. Field Programmable Logic and Appl., 2002,

pp. 312321.[14] S. Borkar, Design challenges of technology scaling, IEEE Micro,

vol. 19, no. 4, pp. 2329, Jul./Aug. 1999.[15] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, Leak-

age current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits, Proc. IEEE, vol. 91, no. 2, pp. 305327,Feb. 2003.

[16] M. Anis, M. Mahmoud, M. Elmasry, andS. Areibi, Dynamic andleakagepower reduction in MTCMOS circuits using an automated efficient gateclustering technique, in Proc. DAC, 2002, pp. 480485.

[17] J. Kao, A. Chandrakasan, and D. Antoniadis, Transistor sizing issuesand tools for multi-threshold CMOS technology, in Proc. DAC, 1997,pp. 409414.

[18] S. Mutah, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, andJ. Yamada, 1-V power supply high speed digital circuit technology withmulti-threshold voltage CMOS, IEEE J. Solid-State Circuits, vol. 30,

no. 8, pp. 847853, Aug. 1995.[19] N. Kato, Y. Akita, M. Hiraki, T. Tamashita, T. Shimizu, F. Maki, andK. Yano, Random modulation: Multi-threshold-voltage design

8/3/2019 fpga2

14/14


methodology in sub-2-V power supply CMOS, IEICE Trans. Electron.,vol. E83-C, no. 11, pp. 17471754, Nov. 2000.

[20] L. Wei, K. Roy, and C. Koh, Power minimization by simultaneous dual-Vth assignment and gate-sizing, in Proc. IEEE Custom Integr. CircuitsConf., 2000, pp. 413416.

[21] Y.-F. Tsai, D. Duarte, N. Vijaykrishnan, and M. J. Irwin, Implications oftechnology scaling on leakage reduction techniques, in Proc. DAC, 2003,pp. 187190.

[22] E. M. Sentovich et al., SIS: A System for Sequential Circuit Analysis.Berkeley, CA: Univ. California, 1992.[23] J. Cong and Y. Ding, FlowMap: An optimal technology mapping

algorithm for delay optimization in lookup-table based FPGA designs,IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 13, no. 1,pp. 112, Jan. 1994.

[24] F. Li, Y. Lin, L. He, and J. Cong, Low-power FPGA using predefineddual-Vdd/dual-Vt fabrics, in Proc. FPGA, 2004, pp. 4250.

[25] G. Lemieux and D. Lewis, Design of Interconnection Networks forProgrammable Logic. Norwell, MA: Kluwer, 2004.

[26] E. Ahmed and J. Rose, The effect of LUT and cluster size ondeep submicron FPGA performance and density, in Proc. FPGA, 2000,pp. 312.

[27] M. Hirabayashi, K. Nose, and T. Sakurai, Design methodology andoptimization strategy for dual-Vth scheme using commercially availabletools, in Proc. Int. Symp. Low Power Electron. and Design, Aug. 2001,pp. 283286.

Akhilesh Kumar (S06) received the B.Tech.(honors) degree in electrical engineering from IndianInstitute of Technology, Roorkee, India, in 2001,and the M.A.Sc. degree in electrical and computerengineering from University of Waterloo, Waterloo,ON,Canada, in 2006. He is currentlyworkingtowardthe Ph.D. degree at the Department of Electrical andComputer Engineering, University of Waterloo.

He worked with the ST Microelectronics, India,from November 2002 to December 2003 and withthe Hughes Software Systems, India, from June 2001

to November 2002. His research interests include the areas of design automa-tion, digital circuit design, and architecture design for deep-submicrometertechnologies.

Mohab Anis (M03) was born in Montreal, Canada,on February 19, 1974. He received the B.Sc. degree(honors) in electronics and communication engineer-ing from Cairo University, Cairo, Egypt, in 1997, andthe M.A.Sc. and Ph.D. degrees in electrical engineer-ing from the University of Waterloo, Waterloo, ON,Canada, in 1999 and 2003, respectively.

Currently, he is an Assistant Professor and Co-

Director of the VLSI Research Group with theUniversity of Waterloo. He has authored/coauthoredover 40 papers in international journals and con-

ferences and is the author of the book Multi-Threshold CMOS DigitalCircuitsManaging Leakage Power (Kluwer, 2003). His research interestsinclude integrated circuit design and design automation for VLSI systems inthe deep-submicrometer regime.

Dr. Anis is a member of the program committee for several IEEE conferencesand is a regular reviewer for a number of IEEE journals and conferences. Hewas awarded the 2005 New Opportunity Fund Award from the Canada Founda-tion for Innovation, the 2004 Douglas R. Colton Medal for Research Excellencein recognition of excellence in research leading to new understanding and noveldevelopments in Microsystems in Canada, and the 2002 IEEE InternationalLow-Power Design Contest Prize.

fpga2

Documents