fpga2

Upload: rostamedastan65

Post on 06-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 fpga2

    1/14

    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007 53

    Dual-Threshold CAD Framework for SubthresholdLeakage Power Aware FPGAs

    Akhilesh Kumar, Student Member, IEEE, and Mohab Anis, Member, IEEE

    AbstractManaging leakage power in field programmable gatearrays (FPGAs) has become critical for the FPGA industry toremain competitive in the semiconductor market and enter themobile applications domain. This paper proposes and evaluatesseveral dual-Vt-based designs of FPGA architecture for reduc-ing the leakage power. A dual-Vt FPGA computer-aided design(CAD) framework has been proposed, which is used to developand evaluate different dual-Vt FPGA architectures. The logicelements and the routing resources are considered as candidatesfor dual-Vt assignment. The authors estimate the number of thelogic elements that can be assigned as high-Vt in the ideal caseby using a dual-Vt assignment algorithm in the CAD framework.Based upon this estimate, the authors develop and evaluate two

    kinds of architectures, homogenous and heterogenous. The resultsindicate that an average leakage-power savings of up to 50% canbe obtained from these architectures. This CAD framework canalso be used for developing and evaluating different dual-Vt FPGAarchitectures other than the ones proposed in this paper.

    Index TermsCMOS digital integrated circuits, delay estima-tion, design automation, field programmable gate arrays (FPGAs),leakage currents.

    I. INTRODUCTION

    LEAKAGE POWER has been recently recognized as a ma-

    jor challenge in the field programmable gate array (FPGA)

    industry. This was primarily because other design challenges,

    such as performance and area, were given more attention,

    but with technology scaling, leakage power has emerged as a

    key design challenge [14]. The current-generation FPGAs are

    being implemented in the 90-nm CMOS technology, which

    necessitates devising techniques for leakage-power reduction.

    For the FPGAs to continue retaining its semiconductor market

    and competitive advantages over the high-performance custom

    very large scale integration (VLSI) designs, the FPGA industry

    should adopt new techniques for leakage-power reduction. The

    work in [4] showed that a 90-nm FPGA consumes too much

    leakage power to be successfully used in mobile applications.

    Therefore, for FPGAs to gain popularity in the domains such

    as wireless personal communication systems or in biomedicalapplications, the FPGAs need to implement techniques for

    reducing leakage power.

    The increase in complexity of the current-generation FPGAs

    has resulted in more number of transistors in the FPGAs which

    directly translate into increased leakage power. The resource

    Manuscript received April 21, 2005; revised November 19, 2005 andFebruary 22, 2006. This work was supported by the Natural Sciences andEngineering Research Council of Canada. This paper was recommended byAssociate Editor F. N. Najm.

    The authors are with the Department of Electrical and Computer Engineer-ing, University of Waterloo, Waterloo, ON N2L3G1, Canada.

    Digital Object Identifier 10.1109/TCAD.2006.882595

    utilization of FPGAs is just over 60%, and the unutilized

    parts also consume leakage power, which means that reducing

    leakage power is important both in the used and unused parts of

    the FPGA.

    This paper proposes the following for the design of a dual-Vt

    FPGA for reducing subthreshold leakage power.

    1) Dual-Vt FPGA architecture, which involves determining

    the distribution for the high- and low-Vt devices.

    2) Computer-aided design (CAD) framework for designing

    the dual-Vt FPGA. This CAD framework is used for

    determining the parameters of the dual-Vt FPGA design.

    3) Algorithms associated with the CAD framework.4) Analysis of different dual-Vt FPGA architectures.

    The remainder of this paper is organized as follows. In

    Section II, leakage reduction and the previous work are dis-

    cussed. Section III explains the static RAM (SRAM)-based

    FPGA architecture that is used in this paper. In Section IV,

    FPGA architectures based on dual-threshold-voltage method-

    ology are proposed, and the proposed dual-Vt FPGA CAD

    flow is discussed. Section V presents the implementation of the

    proposed dual-Vt FPGA CAD flow. Section VI discusses the

    results and the evaluation methodology. Section VII proposes

    some guidelines for designing a dual-Vt FPGA. Section VIII

    concludes the paper and outlines the future work.

    II. LEAKAGE REDUCTION AND PREVIOUS WOR K

    The subthreshold leakage current through a MOSFET can be

    modeled as [9]

    Isub = I0

    1 exp

    VdsVT

    exp

    Vgs Vt Voff

    n VT

    (1)

    where I0 is a constant dependent upon device parameters fora given technology, VT is the thermal voltage, Voff is theoffset voltage which determines the channel current at Vgs = 0,

    Vt is the threshold voltage, and n is the subthreshold swingparameter. Since the leakage current is exponentially depen-

    dent on the threshold voltage, increasing the threshold voltage

    would decrease the leakage current substantially. However, the

    high-threshold-voltage devices have larger switching delays.

    The various leakage-current mechanisms and some leakage-

    reduction techniques for CMOS circuits were discussed in [2]

    and [15]. The techniques for reducing leakage power involve

    static and dynamic approaches. The dynamic approach involves

    run-time decision making for leakage reduction. One such pop-

    ular technique is the use of sleep transistors in multi-threshold

    CMOS (MTCMOS) circuits for controlling the leakage power

    [18]. Several optimizations for MTCMOS circuits involving

    0278-0070/$25.00 2007 IEEE

  • 8/3/2019 fpga2

    2/14

    54 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

    Fig. 1. Dual-Vt design implementation.

    the use of sleep transistors have been proposed, such as in

    [16] and [17]. However, this technique can reduce only the

    standby leakage power and lead to performance degradation.

    The static technique does not involve run-time decision making

    for leakage control. The dual-threshold-voltage-design tech-

    nique, which is a static approach, has been widely used in

    the custom VLSI designs for reducing leakage power. The

    dual-Vt implementation reduces both the active and standby

    leakages. Further, there is no performance degradation in a

    dual-Vt implementation for custom VLSI designs.

    The dual-threshold-voltage-design technique uses two kinds

    of transistors in the same circuit. Some transistors have a high

    threshold voltage, while other transistors have a low threshold

    voltage. The high-threshold-voltage transistors have less

    subthreshold leakage-power dissipation but also have a larger

    delay as compared to the low-threshold-voltage transistors.

    Fig. 1 shows the concept of dual-threshold-voltage implemen-

    tation in custom VLSI designs. Here, the gates on noncritical

    paths are assigned as high-Vt, and the gates on the criticalpath are assigned as low-Vt. The objective is to maximize the

    number of transistors having high threshold voltage without

    sacrificing the performance of the circuit. Several prior works

    have proposed algorithms which assign high- and low-Vt to

    the logic gates of the given circuit [1], [5], [19], [20]. However,

    the dual-threshold-voltage-design technique proposed in the

    literature for custom VLSI designs cannot be used for FPGAs.

    This is because the FPGAs are programmable and we do not

    know the circuit that would be eventually implemented on it

    and hence the delays through various paths of the circuit are

    not known. In this paper, we present a dual-Vt FPGA CAD

    flow for developing and evaluating different dual-Vt FPGAarchitectures. We do not know of any published work in which

    a CAD flow for evaluating and developing dual-Vt FPGA

    architectures has been proposed.

    The leakage-current mechanisms for the FPGAs are the same

    as that for the custom VLSI designs. The two main sources of

    leakage currents are the subthreshold and gate leakages. There

    have been very few works targeted at reducing the leakage

    power in FPGAs. The work in [3] used a technique based on the

    property that the leakage power consumed by a CMOS circuit is

    dependent on the state of its inputs and used the signal statistics

    to alter the state of the inputs in order to reduce the leakage

    power, in such a way that the functionality of the circuit does

    not change. However, this paper addressed only active leakage.Further, this paper addressed leakage reduction only in the used

    parts of the FPGA. The work in [6] approached the problem

    of reducing leakage power by dividing the FPGA fabric into

    small regions, with each of the regions being controlled by a

    sleep transistor which would be turned on/off depending upon

    whether that region is being used or not, thus reducing the

    leakage power. This technique leads to reduction of only the

    standby leakage power. The work in [7] explored the dual-Vt,body biasing, and gate biasing of NMOS pass transistors for

    reducing leakage power. The dual-Vt technique used in [7] was

    based on varying the percentage of high-threshold-voltage ele-

    ments in the routing resources of the FPGA and studying its ef-

    fect for a number of benchmarks. It did not specify the leakage

    savings obtained, and no detailed evaluation of several possible

    routing architectures was presented. Body-biasing technique

    requires control circuitry and generation of voltages for body

    biasing. Further, this technique leads to reduction in leakage

    savings with technology scaling [21]. The negative gate biasing

    of NMOS pass transistors has implementation issues and also

    leads to additional leakage-current component, called gate-

    induced drain leakage. The work in [24] used a dual-Vdd/Vt

    architecture to reduce leakage power and dynamic power. Only

    the configuration SRAM cells were assigned high-Vt to reduce

    leakage power. The dual-Vdd approach requires design of

    power supply network with two voltage levels and level con-

    verters which lead to area overhead and increased complexity.

    The dual-Vt technique and the CAD flow that we propose

    have the following advantages: 1) reduces both active and

    standby leakage powers; 2) provides a CAD framework for

    developing and evaluating a dual-Vt FPGA implementation;

    3) the inherent area penalty in using sleep transistors is not

    present in this design technique; 4) the dual-Vt architecture

    does not require any modification in the existing placement androuting tools from the perspective of the users.

    III. TARGETED FPGA ARCHITECTURE

    The FPGA architecture is very regular in structure. It is

    made up of two main componentslogic blocks (CLBs) and

    routing resources. The logic blocks implement the functionality

    of the given circuit, while the routing resources provide the

    connectivity for implementing the logic. The logic blocks and

    the routing resources are configurable. Although many types of

    architectures have been experimented with, the most popular

    one is the SRAM-based architecture which is described belowand is used in this paper [10].

    The logic block of the SRAM-based FPGA is lookup table

    (LUT)-based and is composed of basic logic elements (BLEs).

    LUT is an array of SRAM cells. Each BLE consists of a

    k-input LUT, flip-flop, and multiplexer for selecting the output

    directly from either the output of LUT or the registered output

    value of the LUT stored in the flip-flop. Fig. 2 shows the BLE.

    Previous works have shown that the four-input LUT is the most

    optimum size as far as logic density and utilization of resources

    are concerned, and this has been widely used. Cluster-based

    logic blocks were investigated in [10], and it was shown that

    the cluster-based logic blocks are better in speed and area. The

    structure of a cluster-based logic block is shown in Fig. 3. Inthe cluster-based logic block, the logic block is made up of

  • 8/3/2019 fpga2

    3/14

    KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs 55

    Fig. 2. Basic logic element [10].

    Fig. 3. Cluster-based logic block [10].

    N BLEs. There are I inputs to the logic block, such that eachinput can connect to all the BLEs. Also, the output of each BLE

    can drive one of the inputs of each of the BLEs. The clock

    feeds all the BLEs. The work in [10] showed that the logicclusters containing four to ten BLEs achieve good performance.

    We consider each subblock to be made up of BLE and the

    corresponding LUT input multiplexers.

    The routing resources are again of various types, but the

    one used in this paper is the island-based architecture. In the

    island-based architecture, the routing resources form a meshlike

    structure with the horizontal and vertical routing channels.

    The horizontal and vertical routing channels are connected by

    switch boxes which are programmable and thus provide the

    flexibility in making the connections. The logic blocks are

    connected to the routing channels through the connection boxes

    which are also programmable. Fig. 4 shows the island stylerouting architecture [10]. Apart from the logic blocks and the

    routing resources, the clock distribution is assumed to have

    a dedicated network. Most of the commercial FPGAs have a

    structure similar to the one described above or some variant of

    the above architecture.

    We have used an industrial CMOS 0.13-m technology nodefor the technology dependent data for the FPGA architectures.

    The methodology proposed in [10] was used to extract the

    required technology dependent data. In extracting the delay data

    for the logic blocks, SPICE simulation was done. The delay

    through the subblock is the delay through the LUT input multi-

    plexer, LUT, flip-flop, and multiplexer for selecting the output

    of the flip-flop or LUT. The sizing of the buffers was done forequal rise and fall times, as proposed by the study in [10]. We

    have assumed that metal 3 is used for the routing wires. The

    sizing of the routing switches was done for minimum area delay

    product, as proposed by the study in [10]. We have used the

    regular MOS models for low-threshold voltage implementation.

    The Vdd for this technology is 1.2 V, and Vt is 0.2 V. The

    high-Vt transistors have threshold voltage 100 mV higher than

    the low-Vt transistors [27]. SPICE simulation shows that, uponincreasing the high-Vt value further, the delays of the subblocks

    increase without leading to any significant leakage savings.

    Since the SRAM cells do not contribute to any run-time delay,

    we assume that they are implemented with high-Vt transistors

    to reduce the leakage power, and hence, our results for leakage

    savings do not include any leakage savings from the SRAM

    cells [4].

    IV. PROPOSED DUAL -T HRESHOLD FPGA ARCHITECTURE

    The very nature of programmability of FPGAs makes the

    dual-Vt design technique for FPGAs different from dual-Vtcustom VLSI designs. We propose dual-Vt FPGA architectures

    and then use a CAD framework for determining the parameters

    and subsequently evaluating the proposed dual-Vt FPGA archi-

    tectures. This paper targets the logic elements within the CLBs

    and the routing resources as candidates for dual-Vt assignment.

    We evaluate the following architectures:

    1) homogenous architecture with only subblocks within

    CLBs considered for dual-Vt assignment (arch1);

    2) homogenous architecture with subblocks and routing re-

    sources considered for dual-Vt assignment (arch2);

    3) heterogenous architecture with only subblocks consid-

    ered for dual-Vt assignment (arch3);

    4) heterogenous architecture with the subblocks and routing

    resources considered for dual-Vt assignment (arch4).

    The proposed homogenous and heterogenous architectures are

    described as follows.

    A. Homogenous Dual-Vt FPGA Architecture

    This proposed architecture has a fixed number of high- and

    low-Vt subblocks in each of the CLBs. For a cluster size of

    N, M(M < N) subblocks are assigned as high-Vt, whereas(N M) subblocks are assigned as low-Vt in each CLB. Forexample, Fig. 5 shows a homogenous dual-Vt FPGA archi-

    tecture in which each CLB has 50% high- and 50% low-Vt

    subblocks for a cluster size of four. In the case of arch1,

    we consider only the subblocks as candidates for dual-Vt

    assignment. We extend this architecture to include the routing

    resources as candidates for dual-Vt assignment, which leads to

    the architecture arch2. The switches in the connection blocks

    and the switch blocks, which drive the routing segments, are

    considered for dual-Vt assignment. In the architecture arch2,

    certain fraction of these switches are assigned as high-Vt. Fig. 6

    shows the switch block and the associated routing switches.

    The connection block has similar switches for providing con-nections between CLBs and routing segments.

  • 8/3/2019 fpga2

    4/14

    56 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

    Fig. 4. Island style routing architecture [10].

    Fig. 5. Proposed homogenous FPGA architecture. Each CLB has a fixed ratioof the high- and low-Vt subblocks.

    B. Heterogenous Dual-Vt FPGA Architecture

    This proposed architecture has two kinds of CLBs distributed

    uniformly throughout the array. The first type of CLBs has all

    high-Vt subblocks, and the second type has some high- and

    low-Vt subblocks. These two kinds of logic blocks are distrib-

    uted uniformly throughout the array in such a way that the arrayis regular. If the logic-block cluster size is N, then the proposedFPGA architecture would have one of the two kinds of CLBs

    with all N high-Vt subblocks (type1), whereas the second kindof logic blocks would have M high-Vt subblocks (M < N)and N M low-Vt subblocks. Fig. 7 shows the distributionof the two kinds of the logic blocks in the FPGA, such that

    50% of the logic blocks are of type1, whereas the other 50%

    of the logic blocks are of type2 for a cluster size of four. We

    selected such an architecture because the number of subblocks

    assigned as high-Vt was very high and so a large percentage

    of logic blocks can possibly have all high-Vt subblocks. This

    heterogenous architecture is extended to include the routing

    resources for dual-Vt assignment, leading to the architecturearch4, in the same way as arch1 is extended to arch2.

    Fig. 6. Switch block. (a) Overall architecture of a switch block. (b) Bufferedpass-transistor switch. (c) Pass-transistor-based switch.

    C. Proposed Dual-Vt FPGA CAD Framework

    Developing and evaluating dual-Vt FPGA architectures, as

    outlined previously, would require a CAD framework which

    can perform dual-Vt assignment to the subblocks and therouting resources and has a methodology for analyzing different

  • 8/3/2019 fpga2

    5/14

    KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs 57

    Fig. 7. Proposed heterogenous FPGA architecture. Two kinds of CLBs, onehaving all high-Vt subblocks and the other having a fixed ratio of the high- andlow-Vt subblocks.

    dual-Vt FPGA architectures. The typical existing FPGA CADflow within the framework of versatile place and route (VPR),

    for placing and routing a given netlist, and the proposed generic

    dual-Vt FPGA CAD flow are shown in Fig. 8. The proposed

    dual-Vt FPGA CAD flow has been developed using the widely

    used academic research tools VPR and T-Vpack [10] and a

    state-dependent leakage-power model that we developed for

    FPGAs [11] and integrated with the power model proposed

    by the study in [13]. The salient features of the leakage-power

    model developed by us are the following.

    1) It is a state-dependent model which takes into account the

    dependencies of leakage power on the state of the input.

    2) It models both the gate and subthreshold leakages.3) It takes into account the various short-channel effects to

    accurately model the leakage.

    4) It is based on analytical equations from BSIM4.

    The leakage-power model takes into account the state de-

    pendence of leakage power by considering the probability

    of states for each of the circuit element. Consider a cir-

    cuit element which has n states and the probability of thestates are Prob1, Prob2, . . . , Probn, such that

    n

    i=1 Probi = 1.The leakage power for different states are given as

    Pleak1, Pleak2, . . . , Pleakn. The average leakage power canthen be written as

    Pavgleak =n

    i=1

    Probi Pleaki. (2)

    We consider both the PMOS and NMOS in various circuit el-

    ements as the candidates for subthreshold leakage, but we con-

    sider only the NMOS transistors as candidates for gate leakage

    because the gate leakage in PMOS is considerably smaller than

    NMOS [12]. Furthermore, the back gate leakage of the NMOS

    transistors is ignored, and we consider the gate current only

    from the gate to channel which then gets partitioned and flows

    into the source and drain of the transistor, as shown in Fig. 9(a).

    The methodology for computing the leakage through aninverter in the FPGA leakage-power model developed by us is

    as follows. We model the subthreshold leakage of the inverters

    in both the states, i.e., when the input is zero and when the input

    is one, and the gate leakage of the inverter when the input is

    one. With the input at zero, only the subthreshold leakage flows

    through the NMOS of the inverter and the gate leakage through

    PMOS is ignored, as shown in Fig. 9(b). When the input to

    the inverter is one, there is subthreshold leakage through thePMOS and gate leakage through the NMOS. In similar ways,

    the leakage power through other FPGA circuit elements are

    computed.

    VPR is used for placement and routing of the benchmark

    circuits. After the placement and routing of the given circuit,

    the power model computes the probability of states for each

    node of the circuit. For computing the probability of the nodes

    of the circuit, we have used the work done in [13]. After

    the probability of states for each of the nodes is computed,

    the power model looks at each of the circuit elements of the

    FPGA and computes the leakage for each of the states of that

    element using the leakage computation engine (LCE). We have

    implemented the LCE, which computes the leakage for each of

    the circuit element of the FPGA based on a given input vector

    (and the state of the SRAM cells, if they are present in the

    given circuit element). The LCE is basically a library of state-

    dependent leakage calculation functions. This library has the

    basic leakage equation for the subthreshold and gate leakages,

    and the computation of the associated parameters. It also has the

    models for computing the leakage for each of the FPGA circuit

    elements for a given input vector. For a low-Vt subblock with

    all the inputs set to zero, the total leakage is 3.46 nA, whereas

    for a high-Vt subblock, the leakage is 0.139 nA. For a low-Vt

    buffered pass-transistor routing switch, with the input as zero,

    the leakage is 0.933 nA, whereas for the high-Vt switch, theleakage is 0.0156 nA.

    The typical FPGA CAD flow involves circuit optimization

    using sequential interactive synthesis (SIS) [22] and technology

    mapping to LUTs using FlowMap [23]. T-Vpack is then used

    to do the packing and clustering, which generates the netlist file

    of logic blocks. VPR is then used for placement and routing.

    The proposed dual-Vt FPGA CAD framework has six stages

    as shown. The next section describes various stages of the

    CAD flow.

    V. CAD FRAMEWORK IMPLEMENTATION

    We modified the VPR to support the dual-Vt assignment to

    the subblocks and the T-Vpack to support reclustering. Each

    of the stages of the dual-Vt FPGA CAD flow [Fig. 8(b)] is

    explained below.

    A. Stage 1

    In this stage, the .net file is generated, which is a netlist of

    logic blocks, along with the activity and function files required

    for the power estimation. The .net file is generated from the

    .blif (Berkeley logic interchange format) file based upon the

    specified FPGA architecture, such as cluster size, inputs percluster, LUT size, etc., by T-Vpack. The activity and function

  • 8/3/2019 fpga2

    6/14

    58 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

    Fig. 8. (a) Typical FPGA CAD flow within the VPR and T-Vpack framework. (b) Proposed generic dual-Vt FPGA CAD flow.

    file generations are a part of the power model framework [11],

    [13]. These files serve as inputs to the second stage.

    B. Stage 2

    The second stage is a delay estimation stage, which is used

    to assign high- and low-Vt to various subblocks, based onthe slacks available for each path. The delay calculation can

    either be accurate or be based upon some estimation. In this

    paper, we have used accurate delay estimation based on the

    actual placement and routing of the benchmark. In this stage,

    it is assumed that all the subblocks are low-Vt, and the delay

    calculation is done based on the low-Vt delay values for the

    subblocks and routing resources. VPR does the placement

    and routing of the benchmark for delay calculation. Since the

    leakage-power model is integrated into the VPR, we calculate

    the power for the single low-threshold-voltage implementation

    of the FPGA for the benchmark, immediately after placement

    and routing. This leakage-power-consumption result is used inthe final stage of the CAD flow for comparison with the leakage

    power consumed by the dual-Vt implementation and, hence,

    calculation of leakage savings.

    C. Stage 3

    Initially, it is assumed that all the subblocks are low-Vt

    except the configuration SRAM cells which are all high-Vt

    for both the single low-Vt and dual-Vt implementations. Based

    on the delays of various paths of the circuit computed in the

    previous stage, the dual-Vt assignment algorithm assigns high-

    Vt to various subblocks. The dual-Vt assignment algorithm is

    shown in Algorithm 1 [1]. The algorithm works on the timinggraph created by the VPR for placement and routing. This

    Fig. 9. (a) Gate leakage in NMOS. (b) Subthreshold leakage in an inverter.

    timing graph contains the information for the delays between

    various pins of the circuit. High-Vt assignment to a subblock

    is done based on the slack available at the output pin of the

    subblock. The assignment of high-Vt would change the delay

    through the subblock. This delay value is updated in the timing

    graph, and the next subblock is then considered for high-Vtassignment. It should be noted that the dual-Vt assignment is

    not constrained to keep the CLBs identical in terms of the

    number of high- and low-Vt subblocks in each CLB. This kind

    of dual-Vt assignment, which is the ideal assignment, would

    result in a dual-Vt FPGA implementation which would not have

    any performance degradation. This dual-Vt assignment gives a

    theoretical upper limit for the number of subblocks assigned as

    high-Vt. This dual-Vt assignment methodology is the same as

    the dual-Vt assignment for custom VLSI designs. Therefore,

    the dual-Vt assignment produces an output in which there

    are different numbers of high- and low-Vt subblocks across

    different CLBs. This is, however, not a feasible solution. This

    information is needed by the next reclustering stage to create afeasible implementation for the dual-Vt FPGA architectures.

  • 8/3/2019 fpga2

    7/14

    KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs 59

    Algorithm 1 Dual-Vt assignment [1]

    Using timing - graph in VPR (after Place and Route)

    for each subblockdo

    Compute the slack availableifslack > 0 then

    Assign high-Vt to the subblock

    Recompute the slackifslack < 0 then

    Re-assign low-Vt to the subblock

    end if

    end if

    end for.

    D. Stage 4

    This stage works within the framework of T-Vpack. For

    the homogenous architectures (arch1 and arch2), we estimate

    the number of high-Vt subblocks per CLB based upon the

    results of the dual-Vt assignment to subblocks in the previousstage. Based on these results, we evaluate the homogenous

    architecture with a fixed number of high-Vt subblocks per CLB.

    The heterogenous architectures (arch3 and arch4) are devel-

    oped and evaluated in the following way based on the algorithm

    shown in Algorithm 2. We have modified the clustering algo-

    rithm proposed by the study in [10]. The clustering algorithm in

    [10] picks up a seed BLE and then tries to build up the cluster by

    attracting the other BLEs based on an attraction function. The

    attraction function used in [10] to add a BLE B to a currentcluster C is shown in the following:

    Attraction= Criticality(B)+(1)

    Nets(B)

    Nets(C)MaxNets

    (3)

    where Criticality(B) represents that part of the attractionfunction which tries to reduce the delay of the circuit. For

    computing the delay of the various paths in the circuit during

    the packing stage, the intracluster delay is assumed to be 0.1,

    logic delay is assumed to be 0.1, and intercluster delay is

    assumed to be one [10]. For the modified clustering algorithm,

    we set the delay of high-Vt BLEs to 0.2 to account for the

    increased delays of high-Vt BLEs. The exact values of these

    delays are not very important as long as they represent the

    correct trend [10]. The methodology for criticality computationis explained in [10], and its value lies between zero and one. The

    second part of the function tries to cluster the BLE which share

    maximum nets with the current cluster. To take into account the

    high-Vt BLEs and increase the type1 (all high-Vt) clusters the

    modified cost function is shown in the following:

    Attraction = Criticality(B)

    + (1 )

    Nets(B) Nets(C)

    MaxNets

    + Hvt NumHvt + (1 Hvt) NumLvt

    ClusterSize(4)

    where the third part of the equation denotes the attraction of a

    BLE to a cluster based on the number of high-/low-Vt BLEs

    in the current cluster. The parameter Hvt takes a value of

    one for a high-Vt BLE and zero for a low-Vt BLE. NumHvt

    denotes the number of high-Vt BLEs in the current cluster, and

    NumLvt denotes the number of low-Vt BLEs in the current

    cluster. The value ofdepends upon the benchmark, but a valuebetween 0.01 and 0.05 was found to give good results for all thebenchmarks. The value of was set to 0.8.

    The number of low-Vt subblocks in the type2 logic blocks is

    determined as follows:

    NumLvtBLEs = ClusterSize

    SumLvtCrit

    NumFracLB

    (5)

    where SumLvtCrit denotes the sum of criticality of all low-Vt

    subblocks, and NumFracLB denotes the number of logic blocks

    which have at least one low-Vt subblock so that they can be

    categorized into type2 logic block. We use this equation to de-

    termine the number of low-Vt subblocks in type2 CLBs becausethe criticality measure takes into account the importance of a

    subblock with regard to delay, which directly translates to the

    threshold voltage of the transistors of the subblock. The next

    step is to assign high-/low-Vt to subblocks in type2 logic blocks

    so that each of the type2 logic blocks have the same number of

    low-Vt subblocks. We start by arranging all the subblocks in

    a logic block in the ascending order of criticality. If the logic

    block has more low-Vt subblocks than NumLvtBLEs, the BLEs

    are assigned high-Vt, starting with the lowest critical low-Vt

    subblock, until the number of low-Vt subblocks is equal to

    NumLvtBLEs. If the logic block has low-Vt subblocks less

    than NumLvtBLEs, some of the subblocks are assigned low-

    Vt, starting with the highest critical high-Vt subblock, so thatmaximum gain in performance might be achieved. This method

    of assigning high-/low-Vt to subblocks during the reclustering

    stage is also used for the homogenous architecture, such that

    the value NumLvtBLEs is used for all the logic blocks.

    Algorithm 2 Recluster

    Do clustering based on T-Vpack algorithm with modified

    attraction function

    for each logic blockdo

    ifAll-subblocks - high- Vt then

    Mark the logic block as type1

    elseMark the logic block as type2

    end if

    end for

    Determine the number of low-Vt subblocks in type2 CLB

    based on (5)

    for Each type2 logic blockdo

    Arrange the subblocks in ascending order of criticality

    ifNo.-of-low-Vt-subblocks < NumLvtBLEsthen

    while No.-of-low-Vt-subblocks < NumLvtBLEs doRe-assign a high-Vt subblock to low-Vt based on

    criticality (Highest critical high-Vt subblock as-

    signed low-Vt first)end while

  • 8/3/2019 fpga2

    8/14

    60 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

    else

    while No.-of-low-Vt-subblocks > NumLvtBLEs doRe-assign a low-Vt subblock to high-Vt based on

    criticality (Lowest critical low-Vt subblock as-

    signed high-Vt first)

    end while

    end ifend for.

    This stage produces an output which gives an estimate of the

    number of type1 and type2 CLBs for the heterogenous archi-

    tectures (arch3 and arch4) and number of low-Vt subblocks in

    the type2 CLBs.

    E. Stage 5

    Homogenous architectures : For the architecture arch1, place-

    ment and routing is done in this stage. For realizing the architec-

    ture arch2, a certain fraction of routing resources are assignedhigh-Vt, and then, placement and routing is done.

    The high-Vt assignment to routing switches is done as

    follows. Consider a routing channel width of 100 with two

    kinds of segments, such that 50 segments are driven by pass-

    transistor switches in the switch box and 50 segments are driven

    by buffered pass-transistor switches, as defined by the VPR

    architecture file. The output pins of the CLB drive all these

    segments with a buffered switch in the connection box. When

    we assign high-Vt to, say, 40% of routing segments driven by

    pass-transistor switches, it would mean that 20 segments of

    those 50 segments will be driven by high-Vt pass-transistor

    switches. Also, all the output pin switches of the CLBs, which

    drive these 20 segments, will be assigned high-Vt.We assume that a gate boosting for these switches are done,

    which would minimize the voltage degradation when a logic

    one is propagated [10]. Other techniques, such as the use of

    level restorer, can be used to eliminate voltage degradation in

    the routing switches [25]. SPICE simulation shows that logic

    level one passing through a high-Vt pass transistor with gate

    boosting to 1.5 V (Vdd = 1.2 V) degrades to only 1.198 from1.2 V. The SPICE simulation results indicate that, with gate

    boosting only 1.5%, extra static leakage current flows in the

    driven buffer through a high-Vt pass transistor as compared

    to the case when there is no voltage degradation. Therefore,

    we consider that the static power dissipation due to voltagedegradation in high-Vt pass-transistor switches is sufficiently

    small.

    Heterogenous architectures: Based on the reclustering re-

    sults, we determine, for a given FPGA array, the number of

    type1 and type2 CLBs. We estimate these values based on

    the ratio of type1 and type2 logic blocks in the reclustered

    netlist, after providing sufficient margin to account for the

    performance degradation because of the physical restriction

    of FPGA regularity. These CLBs are assumed to be distrib-

    uted evenly throughout the FPGA array. For the architecture

    arch3, constrained placement followed by routing is done. For

    arch4, a certain fraction x of the total routing switches are

    assigned high-Vt, and then, constrained placement followedby routing is done. We vary the value of x to determine the

    leakage savings and performance tradeoffs. The algorithm for

    constrained placement is shown in Algorithm 3. Since the VPR

    uses simulated annealing for timing-driven placement, which

    is a very effective placement technique, we have not modified

    the basic placement algorithm. However, we have incorporated

    delays of the high- and low-Vt parts of the CLBs into the

    timing graph accordingly, so that VPR can optimize the overallplacement for performance.

    Algorithm 3 Constrained Placement

    P-ratio = (Num type1 CLBs)/(Num type2 CLBs)Determine physical locations for the two types of CLBs

    based on P-ratio

    Assign high-Vt to frac fraction of routing switches

    Initial Random placement for all the CLBs

    Start placement based on VPR placement algorithm

    Allow the type2 CLBs to be placed on physical locations

    meant for type1 CLBs and vice versa

    End Placement

    for each CLB do

    ifLogic-Block-not-on-same-type-CLB then

    Convert the logic block to type1 or type2 according

    to its placement

    end if

    end for.

    Since the VPR router is timing driven, it optimizes the overall

    routing. The delays of the high-Vt switches would be more

    than the delays of the low-Vt switches, which would force the

    router to use more of low-Vt switches. Therefore, we provide

    the increased delays of the high-Vt routing switches to the VPR,

    so that the router can optimize the overall routing.

    F. Stage 6

    This stage computes the results obtained from the previous

    stages of the CAD flow. After the computation of delays and

    power for the dual-Vt FPGA implementation is over, this stage

    computes the leakage-power savings obtained and the delay

    penalty for the benchmark.

    VI. EVALUATION, RESULTS, AN D DISCUSSIONS

    This section describes the realization and evaluation of dif-ferent FPGA architectures and their comparison. The method

    of developing and evaluating these architectures is by using a

    number of benchmarks on the proposed dual-Vt FPGA CAD

    flow to find out which subblocks can be high-Vt for maxi-

    mizing leakage savings and reducing the delay penalties. It

    would be worthwhile to point out the difference between the

    physical design of the FPGA and the mapping of an appli-

    cation on the FPGA. The physical resources of an FPGA are

    fixed, and the applications are mapped later on. The dual-

    Vt FPGA CAD flow that we outlined above is meant for the

    actual physical design of the FPGA and, later on after the

    physical design of the FPGA is complete, for the design of

    associated CAD tools. This dual-Vt FPGA CAD flow is to beused (along with other CAD tools) during the design of the

  • 8/3/2019 fpga2

    9/14

    KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs 61

    TABLE ILEAKAGE SAVINGS WIT H LOGIC BLOCKS ASSIGNED AS DUAL -VT FOR A CLUSTER SIZE OF 12 FOR

    HOMOGENOUS AND HETEROGENOUS ARCHITECTURES

    FPGA, where a number of benchmarks are evaluated before a

    design for an FPGA is finalized. This CAD flow is not meant

    for mapping of an application on to an FPGA chip, as all

    the logic blocks and routing resources would then be fixed,

    which would be a routine job using the standard FPGA CADtools. Therefore, the dual-Vt FPGA CAD flow is meant for

    developing a dual-Vt FPGA by determining its architectural

    parameters.

    A. Estimating Power Savings

    We use the power model proposed in [11] for estimating the

    leakage-power savings. The total power consumption can be

    divided into two parts, active power and standby power. The

    total average power can be written as

    Pavg =tact Pact + toff Poff

    tact + toff (6)

    T = tact + toff (7)

    where Pavg, Pact, and Poff are the average, active, and standby(off) power. For personal wireless communication systems,

    typically, the standby time or off time (toff) is 90% of the totaltime (T), and active time (tact) is 10% of the total time. Duringthe active time, the components of power dissipation can be

    written as

    Pact = [Pdyn + Psckt + Pactleak]used + [Pactleak]unused (8)

    where Pdyn, Psckt, and Pactleak are the dynamic, short circuit,and active leakage-power consumptions, respectively. Poff is

    the standby-leakage-power consumption of the FPGA, because

    during the standby mode, only the leakage power is dissipated.

    Hence, reducing leakage power during the standby mode for

    mobile applications would increase the battery life significantly.

    The dual-Vt technique reduces both the standby and activeleakages, and hence, it is an attractive method to reduce sub-

    threshold leakage.

    B. Evaluation Methodology

    For obtaining the results from the benchmarks, we used the

    LUT size of four and a cluster size of 12. It was shown in [8]

    that a LUT size of four and a cluster size of eight or 12 leads to

    minimization of total power. The inputs per cluster was chosen

    as K/2(N + 1), where N is the number of subblocks percluster and K is the LUT size [26]. We used the default routing

    architecture present in the FPGA architecture file of VPRhaving 50% routing segments with pass-transistor switches and

    50% routing segments with buffered pass-transistor switches.

    For computing the leakage savings and performance tradeoffs,

    a fixed-size FPGA array of 20 20 was used for all the

    benchmarks, except bigkey, clma, des, dsip, and s38584.1, for

    which a 35 35 array was used. A routing channel width of

    100 was used for the FPGA architecture.

    Table I shows the results of the dual-Vt assignment. Based

    on these results, homogenous architectures with four, six, and

    eight high-Vt subblocks per CLB were considered. The results

    shown for the number of type1 CLBs and low-Vt subblocks in

    type2 CLBs in Table I correspond to a minimum-sized FPGA

    (required to fit a benchmark), and these values are ideal valueswhich does not consider the regularity of the structure of the

  • 8/3/2019 fpga2

    10/14

    62 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

    FPGAs. This kind of dual-Vt assignment leads to an FPGA

    which would have no delay penalty. This is same as the dual-Vt

    assignment for ASICs, in which there is no restriction with

    regard to the logic elements that can be assigned as high-Vt.

    This kind of dual-Vt assignment leads to nonidentical logic

    blocks (in terms of high- and low-Vt subblocks in each logic

    block) in the FPGA, and the reclustering stage estimates theparameters of a uniform dual-Vt FPGA (in which all the logic

    blocks are identical with respect to number of the high- and

    low-Vt subblocks), based on the results of the dual-Vt assign-

    ment. Hence, to provide sufficient margin and prevent severe

    degradation of final placement and routing imposed because of

    the limitation that the FPGA needs to be regular and uniform

    in structure, we set the values for type1 CLBs and number of

    low-Vt subblocks in type2 CLBs as 50% and 10, respectively.

    C. Realizing and Evaluating Different FPGA Architectures for

    Leakage Savings and Design Tradeoffs

    1) arch1: This is a homogenous architecture in which all

    the logic blocks are identical, such that each logic block

    has a fixed ratio of the high- and low-Vt subblocks. Based

    upon these results for dual-Vt assignment, we evaluated three

    different homogenous architectures with CLBs having eight,

    six, and four high-Vt subblocks in each CLB. After assigning

    high-Vt to, say, eight subblocks in each CLB, we placed and

    routed different benchmarks to evaluate the leakage savings

    and performance tradeoffs. Table I shows the overall average

    leakage saving results for these three different cases. The mean

    performance penalties were 3.7%, 4.8%, and 5.04% for the

    architectures with four, six, and eight high-Vt subblocks in

    each CLB, respectively. The delay penalties are close to eachother because of two reasons: 1) There are sufficient number

    of low-Vt subblocks for the critical path(s) and 2) the routing

    contributes to a larger portion of total delay, which means that,

    even if there are few high-Vt subblocks on the critical path, this

    would not cause a large delay penalty.

    2) arch2: This is the homogenous architecture with routing

    resources considered for dual-Vt assignment. The evaluation

    methodology is same as that of the arch1 case, except that we

    vary the fraction of the two kinds of routing resources assigned

    as high-Vt and estimate the leakage savings and the delay

    tradeoffs. The evaluation was done for the three cases, viz.,

    FPGA architecture with four, six, and eight high-Vt subblocksper CLB for a cluster size of 12.

    Fig. 10 shows the leakage savings and the performance

    tradeoffs when the fraction of the high-Vt switches are varied

    over the base architecture having six high-Vt subblocks per

    CLB. Fig. 10(a) shows the leakage savings and delay penalty

    when only the buffered switches are considered for high-Vt

    assignment along with the CLB output pin switches which can

    drive these segments (i.e., those segments which can be driven

    by these high-Vt buffered switches), all the pass-transistor

    switches remaining as low-Vt. Fig. 10(b) shows the leakage

    savings and delay penalty when pass-transistor switches are

    considered for high-Vt assignment along with the CLB output

    pin switches which can drive these segments (i.e., thosesegments which can be driven by these high-Vt pass-transistor

    Fig. 10. Leakage savings for arch2 with six high-Vt subblocks per CLB.(a) Buffered switches assigned as high-Vt. (b) Pass-transistor switches assignedas high-Vt.

    switches), all the buffered switches remaining as low-Vt.

    The leakage savings increases considerably when the routing

    resources are assigned as high-Vt with some impact on the per-

    formance. This is because a major part of leakage comes from

    the routing resources and is discussed further in Section VI-E.

    Leakage savings for both the cases, when the pass-transistor

    switches are assigned as high-Vt and buffered switches are

    assigned as high-Vt, are almost the same. Although the buffered

    switches have more number of devices but the pass-transistor

    switch is eight times the minimum size, the pass transistor in

    the buffered switch is two times the minimum size and the

    buffer is two times the minimum sized buffer, which leads toalmost similar leakage savings. The delay penalties for these

    architectures vary just between 10.7% and 3.32%. This can be

    attributed to the fact that a large percentage of routing switches

    are unutilized [4], and therefore, they do not contribute to delay

    penalties even if they are assigned as high-Vt. Further, the rout-

    ing optimization effort of the VPR would also increase, tending

    to use more of the low-Vt switches, which compensates, to

    some extent, the increased delay of the high-Vt switches.

    3) arch3: This is a heterogenous architecture in which there

    are two kinds of CLBs in the FPGA. Table I shows the leakage

    savings for this architecture. Since the FPGA architecture needs

    to be regular in structure, the two kinds of CLBs are distributed

    uniformly throughout the array. Based on the clustering results,type1 CLBs were fixed to 50% of the total CLBs, and type2

  • 8/3/2019 fpga2

    11/14

    KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs 63

    Fig. 11. Leakage savings forarch4. (a) Buffered switches assigned as high-Vt.(b) Pass-transistor switches assigned as high-Vt.

    CLBs had ten low-Vt subblocks for this architecture. The num-ber of low-Vt subblocks in type2 CLBs was fixed by averaging

    this value for all the benchmarks. The ratio of type1 and type2

    CLBs was fixed by averaging and providing sufficient margin

    to prevent performance degradation for all the benchmarks. The

    results shown in this paper were obtained after fixing these

    values for all the benchmarks. Table I shows that the overall

    mean leakage savings for this architecture is 19.3%. The mean

    delay penalty for this architecture is 6.9%.

    4) arch4: This architecture is realized using the architecture

    arch3. The base architecture is the same as that of arch3,

    and over that, architecture routing resources are also assigned

    as high-Vt to realize the architecture arch4. The evaluationmethodology is the same as that for arch3, except that the

    fraction of high-Vt routing resources is varied and the leakage

    savings and design tradeoffs are calculated. Fig. 11(a) shows

    the leakage savings and delay penalty when only the buffered

    switches are considered for high-Vt assignment along with the

    CLB output pin switches which can drive these segments (i.e.,

    those segments which can be driven by these high-Vt buffered

    switches), all the pass-transistor switches remaining as low-Vt.

    Fig. 11(b) shows the leakage savings and delay penalty when

    pass-transistor switches are considered for high-Vt assignment

    along with the CLB output pin switches which can drive these

    segments (i.e., those segments which can be driven by these

    high-Vt pass-transistor switches), all the buffered switchesremaining as low-Vt. The delay penalties vary between 12.4%

    and 4.8% for different fractions of high-Vt switches. The in-

    crease in delay penalty is not too large when routing switches

    are assigned high-Vt because a large fraction of routing re-

    sources are unutilized. We observe that the leakage savings for

    this architecture is about 3% greater than of the arch2 (six high-

    Vt subblocks per CLB), with almost the same delay penalty.

    D. Design Tradeoffs

    Before deciding upon any dual-Vt FPGA architecture, we

    need to consider the design tradeoffs. In this paper, we eval-

    uated two types of architectureshomogenous and heteroge-

    nous. The design tradeoffs are the delay penalties and the

    impact on overall power. We evaluated these for all the FPGA

    architectures. It was observed that the dynamic power dissipa-

    tion remains almost constant for all the architectures.

    A dual-Vt custom VLSI design does not lead to any delay

    penalty. However, because of the very nature of the program-

    mability of the FPGAs, there would be some delay penalty

    associated with a circuit implemented on a dual-Vt FPGA

    as compared to a single low-Vt FPGA. The delay penalties

    would vary with the benchmark. Essentially, the delay penalty

    occurs because the critical path of a circuit in the dual-Vt

    FPGA changes from that in the single low-Vt FPGA. There

    are three sources of delay penalties in a dual-Vt FPGA:

    1) the increased delays of the subblocks, some of which might

    lie on the critical path; 2) the altered placement and routing

    in the dual-Vt FPGA because of the different delays through

    the subblocks which impacts the placement and, subsequently,

    routing; and 3) increased delays of high-Vt routing switches,

    some of which might lie on the critical path. Table II shows the

    leakage savings and the delay penalties for arch2, arch4, and allhigh-Vt subblocks architectures. The delay penalties for arch2

    and arch4 are comparable. These architectures have a very high

    percentage of high-Vt switches and would be most susceptible

    to performance degradation. However, the mean delay penalty

    for arch2 with 80% high-Vt pass-transistor switches is 6.4%,

    with the maximum delay penalty being 15.3% for a mean leak-

    age savings of 40.3%. For arch4 with 80% high-Vt switches, the

    mean delay penalty is 5.8%, with the maximum delay penalty

    being 13.6% for a mean leakage savings of 43.5%. We have

    also shown the results of all high-Vt architecture, where all

    the subblocks in all the CLBs are high-Vt. The mean leakage

    savings is 32.6%, and the mean delay penalty is 19%, withthe maximum delay penalty as high as 48.5%. We see that

    keeping all high-Vt subblocks leads to severe performance

    degradation in many of the benchmarks. This is because, in

    the architectures which contain low-Vt subblocks, the placer

    and the router can search for low-Vt subblocks to minimize

    the delay, which is not possible in the case of architecture with

    all high-Vt subblocks. The performance degradation in the all

    high-Vt subblocks is more pronounced for benchmarks which

    have a large number of subblocks on the critical path. The

    benchmark diffeq, which has a severe delay penalty of 48.5%,

    has 30 subblocks on the critical path for the implementation on

    the all high-Vt subblocks architecture. The benchmark bigkey

    has only six subblocks on the critical path, and the routing ofthe VPR compensates for the increased delay, which leads to

  • 8/3/2019 fpga2

    12/14

    64 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

    TABLE IIDESIGN TRADEOFFS FOR HOMOGENOUS ARCHITECTURE arch2 WITH SIX HIG H-VT SUBBLOCKS AND 80% HIG H-VT PAS S-TRANSISTOR SWITCHES,

    HETEROGENOUS ARCHITECTURE arch4 WIT H 80% HIG H-VT PAS S-TRANSISTOR SWITCHES, AN D ALL HIG H-VT SUBBLOCKS ARCHITECTURE

    Fig. 12. (a) Leakage contributions of routing resources, logic resources, and SRAM cells for single low-Vt implementation and (b) and (c) after dual-Vtimplementation for alu4.

    almost no delay penalty. Hence, having an architecture with all

    high-Vt subblocks would not be a good design.

    There is almost no area penalty as the dual-Vt technique

    inherently does not lead to any area penalty. In the experimental

    results for all the benchmarks, we kept the routing channel

    width and the number of CLBs the same for both the baseline

    single-Vt FPGA architecture and the dual-Vt FPGA architec-

    tures. This shows that there is no area penalty associated with

    the proposed design of the dual-Vt FPGA architectures.

    E. Distribution of Leakage Savings

    In all these evaluations, we have assumed high-Vt SRAM

    cells for both the single low-Vt and dual-Vt implementations.

    The leakage savings is solely from the logic and routing parts

    of the FPGA. Fig. 12 shows the distribution of leakage among

    the logic blocks and the routing resources. Fig. 12(a) shows the

    leakage distribution in the single low-Vt implementation for the

    benchmark alu4. It can be seen that almost two thirds of leakage

    is from the routing resources. This is because of a large amount

    of unused routing resources [4]. SRAM leakage is negligible as

    they have high-Vt transistors to reduce the leakage. Assigning

    dual-Vt to the logic blocks reduces the leakage component ofthe logic blocks, as shown in Fig. 12(b). Fig. 12(c) shows the

    Fig. 13. Leakage power for alu4. (a) Single low-Vt implementation.(b) Dual-Vt arch3. (c) Dual-Vt arch4 with 60% high-Vt pass-transistorswitches.

    impact on leakage distribution as a result of dual-Vt assignment

    to both the logic blocks and the routing resources. The contri-

    bution of routing leakage to total leakage increases by a small

    amount, even though it results in overall leakage savings. This

    can be explained by the leakage-power results for a benchmark

    alu4, as shown in Fig. 13. For architecture arch3, the logic

    leakage reduces by more than 50%, with the routing leakageremaining the same. When 60% of pass-transistors switches

  • 8/3/2019 fpga2

    13/14

    KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs 65

    Fig. 14. Realizing a dual-Vt FPGA design.

    are also assigned as high-Vt, the routing leakage decreases by

    28%. This results in an overall leakage savings of 37%, with

    19% savings from the logic blocks and 18% savings from the

    routing resources. This explanation can be extended to all the

    benchmarks, and the other benchmarks have similar leakage

    distributions.

    The leakage-power model in [11] models both the gate

    and the subthreshold leakage. For CMOS 130 nm, the gate

    leakage is six orders of magnitude smaller than the subthresholdleakage. Therefore, the gate leakage does not impact our

    methodology.

    VII. DESIGNING A DUAL -V T FPGA

    Designing a dual-Vt FPGA would require selecting a dual-Vt

    FPGA architecture and then determining the parameters for the

    dual-Vt FPGA. The dual-Vt FPGA CAD framework proposed

    in this paper is intended for this purpose. Fig. 14 shows the

    steps involved in the design of a dual-Vt FPGA. The previous

    sections explored the various dual-Vt FPGA architectures for

    leakage-power savings and performance tradeoffs. It can beseen from the results of the previous section that architectures

    with high-Vt switches lead to significantly increased leakage

    savings with small increase in delay penalties. Further, some

    of the architectures are more suitable for some benchmarks,

    whereas other benchmarks are more suitable for another archi-

    tecture. This is because the number of interconnections between

    the logic blocks is different for different benchmarks. This im-

    plies that some benchmarks having a large number of high-Vt

    switches would not have a large impact on delay because of

    a shorter total wire length in the critical path, whereas some

    benchmarks having a high percentage of high-Vt subblocks per

    CLB would not have a big impact on performance because

    there are very few subblocks in the critical path. Homogenous

    architectures would, in general, be easier in implementation

    as compared to heterogenous architectures. However, a careful

    design of heterogenous architectures and the associated CAD

    tools can lead to increased leakage savings for same delay

    penalties as that for homogenous architectures.

    VIII. CONCLUSION

    The dual-Vt FPGA architectures explored above indicate

    that an average leakage-power savings of up to 50% can be

    obtained by the dual-Vt FPGA architectures. We proposed a

    dual-Vt FPGA CAD flow for implementation and evaluationof dual-Vt FPGA architectures. The results show that a high

    percentage of subblocks are assigned as high-Vt for the ideal

    dual-Vt assignment. This indicates that a large amount of slack

    is available with the logic blocks, which can be exploited to re-

    duce the leakage power. We explored two primary architectures,

    homogenous and heterogenous. Results indicate that there is a

    large percentage of unused routing and logic resources, such

    that dual-Vt FPGA architectures can lead to good leakagesavings without large delay penalties. The main bottleneck in

    using any leakage-reduction technique for FPGAs is the very

    nature of the programmability of the FPGA, leading to an

    architecture not very suitable for leakage-reduction techniques.

    For our future paper, we intend to develop novel leakage-

    aware FPGA architectures which would be especially adapted

    to implement leakage-reduction technique without incurring

    large delay penalties.

    REFERENCES

    [1] L. Wei, Z. Chen, K. Roy, M. C. Johnson, Y. Ye, and V. K. De, Designoptimization of dual-threshold circuits for low-voltage low-power appli-

    cations, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 1,pp. 1624, Mar. 1999.

    [2] J. Kao, S. Narendra, and A. Chandrakasan, Subthreshold leakage mod-eling and reduction techniques, in Proc. IEEE Int. Conf. Comput.-Aided

    Des., 2002, pp. 141148.[3] J. H. Anderson, F. N. Najm, and T. Tuan, Active leakage power optimiza-

    tion for FPGAs, in Proc. FPGA, 2004, pp. 3341.[4] T. Tuan and B. Lai, Leakage power analysis of a 90 nm FPGA, in Proc.

    IEEE Custom Integr. Circuits Conf., 2003, pp. 5760.[5] S. Sirichotiyakul, T. Edwards, C. Oh, R. Panda, and D. Blaauw,

    Duet: An accurate leakage estimation and optimization tool for dual-Vtcircuits, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 2,pp. 7990, Apr. 2002.

    [6] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, andT. Tuan, Reducing leakage energy in FPGAs using region-constrainedplacement, in Proc. FPGA, 2004, pp. 5158.

    [7] A. Rahman and V. Polavarapuv, Evaluation of low-leakage designtechniques for field programmable gate arrays, in Proc. FPGA, 2004,pp. 2330.

    [8] F. Li, D. Chen, L. He, and J. Cong, Architecture evaluation for powerefficient FPGAs, in Proc. FPGA, 2003, pp. 175184.

    [9] C. Hu, A. Niknejad, X. Xi, W. Liu, X. Jin, K. M. Cao, M. Dunga, andJ. Ou. (2005, Jul. 29). BSIM4 MOS Models, Release 4.5.0. [Online].Available. http://www.device.eecs.berkeley.edu/~bsim3/bsim4.html

    [10] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs. Norwell, MA: Kluwer, 1999.

    [11] A. Kumar and M. Anis, An analytical state dependent leakage powermodel for FPGAs, in Proc. DATE, 2006, pp. 612617.

    [12] D. Lee, D. Blaauw, and D. Sylvester, Gate oxide leakage current analysisand reduction for VLSI circuits, IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 12, no. 2, pp. 155166, Feb. 2004.

    [13] K. Poon, A. Yan, and S. J. E. Wilton, A flexible power model forFPGAs, in Proc. Int. Conf. Field Programmable Logic and Appl., 2002,

    pp. 312321.[14] S. Borkar, Design challenges of technology scaling, IEEE Micro,

    vol. 19, no. 4, pp. 2329, Jul./Aug. 1999.[15] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, Leak-

    age current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits, Proc. IEEE, vol. 91, no. 2, pp. 305327,Feb. 2003.

    [16] M. Anis, M. Mahmoud, M. Elmasry, andS. Areibi, Dynamic andleakagepower reduction in MTCMOS circuits using an automated efficient gateclustering technique, in Proc. DAC, 2002, pp. 480485.

    [17] J. Kao, A. Chandrakasan, and D. Antoniadis, Transistor sizing issuesand tools for multi-threshold CMOS technology, in Proc. DAC, 1997,pp. 409414.

    [18] S. Mutah, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, andJ. Yamada, 1-V power supply high speed digital circuit technology withmulti-threshold voltage CMOS, IEEE J. Solid-State Circuits, vol. 30,

    no. 8, pp. 847853, Aug. 1995.[19] N. Kato, Y. Akita, M. Hiraki, T. Tamashita, T. Shimizu, F. Maki, andK. Yano, Random modulation: Multi-threshold-voltage design

  • 8/3/2019 fpga2

    14/14

    66 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

    methodology in sub-2-V power supply CMOS, IEICE Trans. Electron.,vol. E83-C, no. 11, pp. 17471754, Nov. 2000.

    [20] L. Wei, K. Roy, and C. Koh, Power minimization by simultaneous dual-Vth assignment and gate-sizing, in Proc. IEEE Custom Integr. CircuitsConf., 2000, pp. 413416.

    [21] Y.-F. Tsai, D. Duarte, N. Vijaykrishnan, and M. J. Irwin, Implications oftechnology scaling on leakage reduction techniques, in Proc. DAC, 2003,pp. 187190.

    [22] E. M. Sentovich et al., SIS: A System for Sequential Circuit Analysis.Berkeley, CA: Univ. California, 1992.[23] J. Cong and Y. Ding, FlowMap: An optimal technology mapping

    algorithm for delay optimization in lookup-table based FPGA designs,IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 13, no. 1,pp. 112, Jan. 1994.

    [24] F. Li, Y. Lin, L. He, and J. Cong, Low-power FPGA using predefineddual-Vdd/dual-Vt fabrics, in Proc. FPGA, 2004, pp. 4250.

    [25] G. Lemieux and D. Lewis, Design of Interconnection Networks forProgrammable Logic. Norwell, MA: Kluwer, 2004.

    [26] E. Ahmed and J. Rose, The effect of LUT and cluster size ondeep submicron FPGA performance and density, in Proc. FPGA, 2000,pp. 312.

    [27] M. Hirabayashi, K. Nose, and T. Sakurai, Design methodology andoptimization strategy for dual-Vth scheme using commercially availabletools, in Proc. Int. Symp. Low Power Electron. and Design, Aug. 2001,pp. 283286.

    Akhilesh Kumar (S06) received the B.Tech.(honors) degree in electrical engineering from IndianInstitute of Technology, Roorkee, India, in 2001,and the M.A.Sc. degree in electrical and computerengineering from University of Waterloo, Waterloo,ON,Canada, in 2006. He is currentlyworkingtowardthe Ph.D. degree at the Department of Electrical andComputer Engineering, University of Waterloo.

    He worked with the ST Microelectronics, India,from November 2002 to December 2003 and withthe Hughes Software Systems, India, from June 2001

    to November 2002. His research interests include the areas of design automa-tion, digital circuit design, and architecture design for deep-submicrometertechnologies.

    Mohab Anis (M03) was born in Montreal, Canada,on February 19, 1974. He received the B.Sc. degree(honors) in electronics and communication engineer-ing from Cairo University, Cairo, Egypt, in 1997, andthe M.A.Sc. and Ph.D. degrees in electrical engineer-ing from the University of Waterloo, Waterloo, ON,Canada, in 1999 and 2003, respectively.

    Currently, he is an Assistant Professor and Co-

    Director of the VLSI Research Group with theUniversity of Waterloo. He has authored/coauthoredover 40 papers in international journals and con-

    ferences and is the author of the book Multi-Threshold CMOS DigitalCircuitsManaging Leakage Power (Kluwer, 2003). His research interestsinclude integrated circuit design and design automation for VLSI systems inthe deep-submicrometer regime.

    Dr. Anis is a member of the program committee for several IEEE conferencesand is a regular reviewer for a number of IEEE journals and conferences. Hewas awarded the 2005 New Opportunity Fund Award from the Canada Founda-tion for Innovation, the 2004 Douglas R. Colton Medal for Research Excellencein recognition of excellence in research leading to new understanding and noveldevelopments in Microsystems in Canada, and the 2002 IEEE InternationalLow-Power Design Contest Prize.