ultra-low power electronics and design

ULTRA LOW-POWER ELECTRONICS AND DESIGN

This page intentionally left blank

Ultra Low-PowerElectronics and Design

Edited by

Enrico MaciiPolitecnico di Torino,

KLUWER ACADEMIC PUBLISHERSNEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW

eBook ISBN: 1-4020-8076-XPrint ISBN: 1-4020-8075-1

No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,mechanical, recording, or otherwise, without written consent from the Publisher

Created in the United States of America

Visit Springer's eBookstore at: http://www.ebooks.kluweronline.comand the Springer Global Website Online at: http://www.springeronline.com

Dordrecht

Contents

CONTRIBUTORS…………………………………………………………………….VII

PREFACE…………………………………………………………….………………...IX

INTRODUCTION……………………………………………………………………XIII

1. ULTRA-LOW-POWER DESIGN: DEVICE AND LOGIC DESIGN

APPROACHES……………………………………….………………………………….1

2. ON-CHIP OPTICAL INTERCONNECT FOR LOW-POWER……………………21

3. NANOTECHNOLOGIES FOR LOW POWER……………….…………………….40

4. STATIC LEAKAGE REDUCTION THROUGH SIMULTANEOUS

Vt/Tox AND STATE ASSIGNMENT………………………………………………….56

5. ENERGY-EFFICENT SHARED MEMORY ARCHITECTURES FOR

MULTI-PROCESSOR SYSTEMS-ON-CHIP…………………………………...…..84

6. TUNING CACHES TO APPLICATIONS FOR LOW-ENERGY EMBEDDED

SYSTEMS……………………………………………………………………………..103

7. REDUCING ENERGY CONSUMPTION IN CHIP MULTIPROCESSORS

USING WORKLOAD VARIATIONS……………………………………………....123

8. ARCHITECTURES AND DESIGN TECHNIQUES FOR ENERGY

EFFICIENT EMBEDDED DSP AND MULTIMEDIA PROCESSING……….….141

9. SOURCE-LEVEL MODELS FOR SOFTWARE POWER OPTIMIZATION…..156

10. TRANSMITTANCE SCALING FOR REDUCING POWER DISSIPATION

OF A BACKLIT TFT-LCD…………………………………………………………..172

11. POWER-AWARE NETWORK SWAPPING FOR WIRELESS PALMTOP

PCS…………………………………………………………………………………… 198

12. ENERGY EFFICIENT NETWORK-ON-CHIP DESIGN…………………………214

13. SYSTEM LEVEL POWER MODELING AND SIMULATION OF

HIGH-END INDUSTRIAL NETWORK-ON-CHIP……………………………….233

14. ENERGY AWARE ADAPTATIONS FOR END-TO-END VIDEO

STREAMING TO MOBILE HANDHELD DEVICES…………………………….255

Contributors

A. Acquaviva Università di Urbino L. Benini Università di Bologna D. Bertozzi Università di Bologna D. Blaauw University of Michigan, Ann Arbor A. Bogliolo Università di Urbino A. Bona STMicroelectronics C. Brandolese Politecnico di Milano W.C. Cheng University of Southern California G. De Micheli Stanford University N. Dutt University of California, Irvine W. Fornaciari Politecnico di Milano F. Gaffiot Ecole Centrale de Lyon J. Gautier CEA-DRT–LETI/D2NT–CEA/GRE A. Gordon-Ross University of California, Riverside R. Gupta University of California, San Diego C. Heer Infineon Technologies AG M. J. Irwin Pennsylvania State University I. Kadayif Canakkale Onsekiz Mart University M. Kandemir Pennsylvania State University B. Kienhuis Leiden I. Kolcu UMIST E. Lattanzi Università di Urbino D. Lee University of Michigan, Ann Arbor A. Macii Politecnico di Torino S. Mohapatra University of California, Irvine I. O’Connor Ecole Centrale de Lyon K. Patel Politecnico di Torino M. Pedram University of Southern California C. Pereira University of California, San Diego C. Piguet CSEMM. Poncino Università di Verona F. Salice Politecnico di Milano P. Schaumont University of California, Los Angeles U. Schlichtmann Technische Universität München D. Sylvester University of Michigan, Ann Arbor

F. Vahid University of California, Riverside and University of California, Irvine

N. Venkatasubramanian University of California, Irvine I. Verbauwhede University of California, Los Angeles

and K.U.Leuven N. Vijaykrishnan Pennsylvania State University V. Zaccaria STMicroelectronics R. Zafalon STMicroelectronics B. Zhai University of Michigan, Ann Arbor C. Zhang University of California, Riverside

Preface

Today we are beginning to have to face up to the consequences of the stunning success of Moore’s Law, that astute observation by Intel’s Gordon Moore which predicts that integrated circuit transistor densities will double every 12 to 18 months. This observation has now held true for the last 25 years or more, and there are many indications that it will continue to hold true for many years to come. This book appears at a time when the first examples of complex circuits in 65nm CMOS technology are beginning to appear, and these products already must take advantage of many of the techniques to be discussed and developed in this book. So why then should our increasing success at miniaturization, as evidenced by the success of Moore’s Law, be creating so many new difficulties in power management in circuit designs?

The principal source and the physical origin of the problem lies in the differential scaling rates of the many factors that contribute to power dissipation in an IC – transistor speed/density product goes up faster than the energy per transition comes down, so the power dissipation per unit area increases in a general sense as the technology evolves.

Secondly, the “natural” transistor switching speed increase from one generation to the next is becoming downgraded due to the greater parasitic losses in the wiring of the devices. The technologists are offsetting this problem to some extent by introducing lower permittivity dielectrics (“low-k”) and lower resistivity conductors (copper) – but nonetheless to get the needed circuit performance, higher speed devices using techniques such as silicon-on-insulator (SOI) substrates, enhanced carrier mobility (“strained silicon”) and higher field (“overdrive”) operation are driving power densities ever upwards. In many cases, these new device architectures are increasingly leaky, so static power dissipation becomes a major headache in power management, especially for portable applications.

A third factor is system or application driven – having all this integration capability available encourages us to combine many different functional blocks into one system IC. This means that in many cases, a large part of the chip’s required functionality will come from software executing on and between multiple on-chip execution units; how the optimum partitioning between hardware architecture and software implementation is obtained is a vast subject, but clearly some implementations will be more energy efficient than others. Given that, in many of today’s designs, more than 50% of the total development effort is on the software that runs on the chip, getting this partitioning right in terms of power dissipation can be critical to the success of (or instrumental in the failure of!) the product.

A final motivation comes from the practical and environmental consequences of how we design our chips – state-of-the-art high performance circuits are dissipating up to 100W per square centimeter – we only need 500 square meters of such silicon to soak up the output of a small nuclear power station. A related argument, based on battery lifetime, shows that the “converged” mobile phone application combining telephony, data transmission, multimedia and PDA functions that will appear shortly is demanding power at the limit of lithium-ion or even methanol-water fuel cell battery technology. We have to solve the power issue by a combination of design and process technology innovations; examples of current approaches to power management include multiple transistor thresholds, triple gate oxide, dynamic supply voltage adjustment and memory architectures.

Multiple transistor thresholds is a technique, practiced for several years now, that allows the designer to use high performance (low Vt) devices where he needs the speed, and low leakage (high Vt) devices elsewhere. This benefits both static power consumption (through less sub-threshold leakage) and dynamic power consumption (through lower overall switching currents). High threshold devices can also be used to gate the supplies to different parts of the circuit, allowing blocks to be put to sleep until needed.

Similar to the previous technique, triple gate oxide (TGO) allows circuit partitioning between those parts that need performance and other areas of the circuit that don’t. It has the additional benefit of acting on both sub-threshold leakage and gate leakage. The third oxide is used for I/O and possibly mixed-signal. It is expected over the next few years that the process technologists will eventually replace the traditional silicon dioxide gate dielectric of the CMOS devices by new materials such as rare earth oxides with much higher dielectric constants that will allow the gate leakage problem to be completely suppressed.

Dynamic supply voltage adjustment allows the supply voltage to different blocks of the circuit to be adjusted dynamically in response to the immediate performance needs for the block – this very sophisticated technique will take some time to mature.

Finally, many, if not most, advanced devices use very large amounts of memory for which the contents may have to be maintained during standby; this consumes a substantial amount of power, either through refreshing dynamic RAM or through the array leakage for static RAM. Traditional non-volatile memories have writing times that are orders of magnitude too slow to allow them to substitute these on-chip memories. New developments, such as MRAM, offer the possibility of SRAM-like performance coupled with unlimited endurance and data retention, making them potential candidates to replace the traditional on-chip memories and remove this component of standby power consumption.

Most of the approaches to power management described briefly above will be employed in 65nm circuits, but there are a lot more good ideas waiting to be applied to the problem, many of which you will find clearly and concisely explained in this book.

Mike Thompson, Philippe Magarshack

STMicroelectronics, Central R&D Crolles, France

This page intentionally left blank

Introduction

ULTRA LOW-POWER ELECTRONICS AND DESIGN

Enrico Macii Politecnico di Torino

Power consumption is a key limitation in many electronic systems today, ranging from mobile telecom to portable and desktop computing systems, especially when moving to nanometer technologies. Power is also a showstopper for many emerging applications like ambient intelligence and sensor networks. Consequently, new design techniques and methodologies are needed to control and limit power consumption.

The 2004 edition of the DATE (Design Automation and Test in Europe) conference has devoted an entire Special Focus Day to the power problem and its implications on the design of future electronic systems. In particular, keynote presentations and invited talks by outstanding researchers in the field of low-power design, as well as several technical papers from the regular conference sessions have addressed the difficulties ahead and advanced strategies and principles for achieving ultra low-power design solutions. Purpose of this book is to integrate into a single volume a selection of these contributions, duly extended and transformed by the authors into chapters proposing a mix of tutorial material and advanced research results.

The manuscript consists of a total of 14 chapters, addressing different aspects of ultra low-power electronics and design. Chapter 1 opens the volume by providing an insight to innovative transistor devices that are capable of operating with a very low threshold voltage, thus contributing to a significant reduction of the dynamic component of power consumption. Solutions for limiting leakage power during stand-by mode are also discussed. The chapter closes with a quick overview of low-power design techniques applicable at the logic level, including multi-Vdd, multi-Vth and hybrid approaches.

Chapter 2 focuses on the problem of reducing power in the interconnect network by investigating alternatives to traditional metal wires. In fact, according to the 2003 ITRS roadmap, metallic interconnections may not be able to provide enough transmission speed and to keep power under control for the upcoming technology nodes (65nm and below). A possible solution, explored in the chapter, consists of the adoption of optical interconnect networks. Two applications are presented: Clock distribution and data communication using wavelength division multiplexing.

In Chapter 3, the power consumption problem is faced from the technology point of view by looking at innovative nano-devices, such as single-electron or few-electron transistors. The low-power characteristics and potential of these devices are reviewed in details. Other devices, including carbon nano-tube transistors, resonant tunnelling diodes and quantum cellular automata are also treated.

Chapter 4 is entirely dedicated to advanced design methodologies for reducing sub-threshold and gate leakage currents in deep-submicron CMOS circuits by properly choosing the states to which gates have to be driven when in stand-by mode, as well as the values of the threshold voltage and of the gate oxide thickness. The authors formulate the optimization problem for simultaneous state/Vth and state/Vth/Tox assignments under delay constraints and propose both an exact method for its optimal solution and two practical heuristics with reasonable run-time. Experimental results obtained on a number of benchmark circuits demonstrate the viability of the proposed methodology.

Chapter 5 is concerned with the issue of minimizing power consumption of the memory subsystem in complex, multi-processor systems-on-chip (MPSoCs), such as those employed in multi-media applications. The focus is on design solutions and methods for synthesizing memory architectures containing both single-ported and multi-ported memory banks. Power efficiency is achieved by casting the memory partitioning design paradigm to the case of heterogeneous memory structures, in which data need to be accessed in a shared manner by different processing units.

Chapter 6 addresses the relevant problem of minimizing the power consumed by the cache hierarchy of a microprocessor. Several design techniques are discussed, including application-driven automatic and dynamic cache parameter tuning, adoption of configurable victim buffers and frequent-value data encoding and compression.

Power optimization for parallel, variable-voltage/frequency processors is the subject of Chapter 7. Given a processor with such an architecture, this chapter investigates the energy/performance tradeoffs that can be spanned in parallelizing array-intensive applications, taking into account the possibility that individual processing units can operate at different voltage/frequency levels. In assigning voltage levels to processing units, compiler analysis is used to reveal hetherogeneity between the loads of the different units in parallel execution.

Chapter 8 provides guidelines for the design and implementation of DSP and multi-media applications onto programmable embedded platforms. The RINGS architecture is first introduced, followed by a detailed discussion on power-efficient design of some of the platform components, namely, the DSPs. Next, design exploration, co-design and co-simulation challenges are addressed, with the goal of offering to the designers the capability of including into the final architecture the right level of programmability (or reconfigurability) to guarantee the required balance between system performance and power consumption. Chapter 9 targets software power minimization through source code optimization. Different classes of code transformations are first reviewed; next, the chapter outlines a flow for the estimation of the effects that the application of such transformations may have on the power consumed by a software application. At the core of the estimation methodology there is the development of power models that allow the decoupling of processor-independent analysis from all the aspects that are tightly related to processor architecture and implementation. The proposed approach to software power minimization is validated through several experiments conducted on a number of embedded processors for different types of benchmark applications.

Reduction of the power consumed by TFT liquid crystal displays, such as those commonly used in consumer electronic products is the subject of Chapter 10. More specifically, techniques for reducing power consumption of transmissive TFT-LCDs using a cold cathode fluorescent lamp backlight are proposed. The rationale behind such techniques is that the transmittance function of the TFT-LCD panel can be adjusted (i.e., scaled) while meeting an upper bound on a contrast distortion metric. Experimental results show that significant power savings can be achieved for still images with very little penalty in image contrast.

Chapter 11 addresses the issue of efficiently accessing remote memories from wireless systems. This problem is particularly important for devices such as palmtops and PDAs, for which local memory space is at a premium and networked memory access is required to support virtual memory swapping. The chapter explores performance and energy of network swapping in comparison with swapping on local microdrives and FLASH memories. Results show that remote swapping over power-manageable wireless network interface cards can be more efficient than local swapping and that both energy and performance can be optimized by means of power-aware reshaping of data requests. In other words, dummy data accesses can be preemptively inserted in the source code to reshape page requests in order to significantly improve the effectiveness of dynamic power management.

Chapter 12 focuses on communication architectures for multi-processor SoCs. The network-on-chip (NoC) paradigm is reviewed, touching upon several issues related to power optimization of such kinds of communication architectures. The analysis goes on a layer-by-layer basis, and particular emphasis is given to customized, domain-specific networks, which represent the most promising scenario for communication-energy minimization in multi-processor platforms.

Chapter 13 provides a natural follow up to the theory of NoCs covered in the previous chapter by describing an industrial application of this type of communication architecture. In particular, the authors introduce an innovative methodology for automatically generating the power models of a versatile and parametric on-chip communication IP, namely the STBus by STMicroelectronics. The methodology is validated on a multi-processor hardware platform including four ARM cores accessing a number of peripheral targets, such as SRAM banks, interrupt slaves and ROM memories.

The last contribution, offered in Chapter 14, proposes an integrated end-to-end power management approach for mobile video streaming applications that unifies low-level architectural optimizations (e.g., CPU, memory, registers), OS power-saving mechanisms (e.g., dynamic voltage scaling) and adaptive middleware techniques (e.g., admission control, trans-coding, network traffic regulation). Specifically, interaction parameters between the different levels are identified and optimized to achieve a reduction in the power consumption.

Closing this introductory chapter, the editor would like to thank all the authors for their effort in producing their outstanding contributions in a very short time. A special thank goes to Mike Thompson and Philippe Magarshack of STMicroelectronics for their keynote presentation at DATE 2004 and for writing the foreword to this book. The editor would also like to acknowledge the support offered by Mark De Jongh and the Kluwer staff during the preparation of the final version of the manuscript. Last, but not least, the editor is grateful to Agnieszka Furman for taking care of most of the “dirty work” related to book editing, paging and preparation of the camera-ready material.

Chapter 1

ULTRA-LOW-POWER DESIGN: DEVICE AND

LOGIC DESIGN APPROACHES

Christoph Heer1 and Ulf Schlichtmann

1Infineon Technologies AG; 2Technische Universität München

Abstract Power consumption increasingly is becoming the bottleneck in the design of

ICs in advanced process technologies. We give a brief introduction into the

major causes of power consumption. Then we report on experiments in an

advanced process technology with ultra-low threshold voltage (Vth) devices. It

turns out that in contrast to older process technologies, this approach

increasingly is becoming less suitable for industrial usage in advanced process

technologies. Following, we describe methodologies to reduce power

consumption by optimizations in logic design, specifically by utilizing

multiple levels of supply voltage Vdd and threshold voltage Vth. We evaluate

them from an industrial product development perspective. We also give a brief

outlook to proposals on other levels in the design flow and to future work.

Keywords: Low-power design, dynamic power reduction, leakage power reduction, ultra-

low-Vth devices, multi-Vdd, multi-Vth, CVS

1.1 INTRODUCTION

The progress of silicon process technology marches on relentlessly. As

predicted by Gordon Moore decades ago, silicon process technology

continues to achieve improvements at an astonishing pace [1]. The number

of transistors that can be integrated on a single IC approximately doubles

every 2 years [2,3]. This engineering success has created innovative new

industries (e.g. personal computers and peripherals, consumer electronics)

and revolutionized other industries (e.g. communications).

Today, however, it is becoming increasingly difficult to achieve

improvements at the pace that the industry has become accustomed to. More

and more technical challenges appear that require increasing resources to be

solved [4]. One such problem is the increasing power consumption of

integrated circuits. It becomes even more critical as an increasing number of

today’s high-volume consumer products are battery-powered.

In the following, we will consider the sources of power consumption and

their development over time. We will show why reduction of power

consumption increasingly is becoming critical to product success and will

review traditional approaches in Sections 1.1 and 1.2. In Section 1.3 we will

then analyze a potential solution based on introduction of an optimized

transistor with a very low threshold voltage Vth. Thereafter, we will present

and discuss logic-level design optimizations for power reduction in Section

1.4. Also, we will briefly point out potential optimizations on higher levels.

Our observations are made from the perspective of industrial IC product

development where technical optimizations must be carefully evaluated

against the cost associated with achieving and implementing them. Mostly,

the presented methodologies are already being utilized in leading-edge

industrial ICs.

1.2 POWER CONSUMPTION BECOMES CRITICAL

Depending on the type of end-product and its application, different

aspects of power consumption are the primary concern: dynamic power or

leakage power.

Reduction of dynamic power consumption is a concern for almost all

IC products today. For battery-powered products, reduced power

consumption directly results in longer operating time for the product, which

is a very desirable characteristic. Even for non-battery-powered products,

reduced power consumption brings many advantages, such as reduced cost

because of cheaper packaging or higher performance because of lower

temperatures. Finally, reduced power consumption often leads to lower

system cost (no fans required; no or cheaper air conditioning for data /

telecom center etc.).

Dynamic power consumption is caused by the charging and discharging

of capacitances when a circuit switches. In addition, during switching a

short-circuit current flows, but this current is typically much smaller, and

will therefore be neglected in the following. The dynamic current due to

capacitance charging and discharging is determined by the following well-

known relationship:

2~ ddLdyn VCfP ••

Based on constant electrical field scaling, Vdd and CL each are reduced by

30% in each successive process generation. Also, delay decreases by 30%,

resulting in 43% increase in frequency. Therefore, the dynamic power

consumption per device is reduced by 50% from one process generation to

the next. As scaling also doubles the number of devices that can be

implemented in a given die area, dynamic power consumption per area

should stay roughly identical. However, historically frequency has increased

by significantly more than 43% from one process generation to the next (e.g.

in microprocessors, it has roughly doubled, due to architectural

optimizations, such as deeper pipeline stages), and in addition, die sizes have

increased with each new process technology, further increasing the power

consumption, due to an increased number of active devices [5]. For these

reasons, dynamic power consumption has increased exponentially, as is

shown in Figure 1-1 for the example of microprocessors.

Reduction of leakage power consumption today is primarily a concern

for products that are powered by battery and spend most of their operating

hours in some type of standby mode, such as cell phones.

For many process generations, however, leakage has increased roughly

by a factor of 10 for every two process nodes [6]. Due to this dramatic

increase with newer process generations, leakage is becoming a significant

contribution to overall IC power consumption even in normal operating

mode, as can be seen in Figure 1-1 as well. Leakage was estimated to

increase from 0.01% of overall power consumption in a 1.0µm technology,

to 10% in a 0.1µm technology [6]. For a microprocessor, Intel estimated

leakage power consumption at more than 50W for a 100nm technology

node[3]. This figure probably is extreme, and leakage depends strongly on a

number of factors, such as threshold voltage (Vth) of the transistor, gate

oxide thickness and environmental operating conditions (supply voltage Vdd,

temperature T). Nevertheless, for an increasing number of products leakage

power consumption is turning into a problem, even when they are not

battery-powered.

Figure 1-1. Development of dynamic and leakage power consumption over time [3,7]

1.3 TRADITIONAL APPROACHES TO POWER

REDUCTION

As outlined above, dynamic power consumption is governed by:

2~ ddLdyn VCfP ••

with f denoting the switching frequency, CL the capacitance being

switched, and Vdd the supply voltage . This formula immediately identifies

the key levers to reduce dynamic power:

• Reduce operating frequency

• Reduce driven capacity

• Reduce supply voltage

Traditionally, reduction in supply voltage Vdd has been the most often

followed strategy to reduce power consumption. Unfortunately, lowering Vdd

has the side effect of reducing performance as well, primarily because gate

overdrive (the difference between Vdd and Vth) diminishes if the threshold

voltage Vth is kept constant. Based on the alpha power law model [8], the

delay td of an inverter is given by

( )αthdd

−•=

with α denoting a fitting constant. As supply voltages are driven below

1.0V, the reductions in gate overdrive are more pronounced than previously.

In addition, newer process technologies give significantly less of a

performance boost compared to the previous process generation than has

traditionally been the case, therefore a further reduction in performance is

highly undesirable. Finally, the power reduction achieved by moving to a

new process generation has trended down over time, since supply voltages

have been scaled by increasingly less than the 30% prescribed by the

constant electrical field scaling paradigm.

Consequently, more advanced approaches are required.

In the following, our main focus will be on dynamic power consumption,

but we will also consider leakage power consumption.

1.4 ZERO-VTH DEVICES

The concept of zero-Vth devices was developed in the mid 90-ies. It

overcomes the diminishing gate overdrive by radically setting the threshold

voltage of the active devices to zero. It has been shown [9], that the optimum

power dissipation is obtained, if Pleak (leakage contribution) is in the same

order of magnitude as Pdyn (dynamic switching contribution). This can be

achieved for transistors with Vth close to 0V (‘zero-Vth transistor‘). Therefore

the devices will never completely switch off. But from an overall power

perspective the gain in active power consumption is tremendous.

Using these transistors the supply voltage of 130nm circuits can be

reduced to values below 0.3V to achieve a Pdyn reduction by 90% without

performance degradation. Alternatively, the circuit can be operated at twice

the clock frequency when keeping the supply voltage at 1.2V, as shown in

Figure 1-2. The corresponding Ion/Ioff-ratio for the zero-Vth transistor is about

10-100 instead of >105 for the standard transistor options. During standby,

the complete circuits are switched-off or are set into a low leakage mode to

cope with the very high leakage contribution. The low leakage mode is

achieved by ‘active well’ control, which denotes the use of the body effect.

The well potentials of the PFETs and NFETs are altered to change Vth. To

achieve a lower leakage current, the absolute value of Vth is increased by

reverse back biasing: a negative well-to-source voltage Usb is used.

Therefore voltages below Vss for NFETs and above Vdd for PFETs have to

be generated. Furthermore, active well is required to compensate the lot-to-

lot or wafer-to-wafer variations of Vth.

The initial ‘zero-Vth’ concept assumed constant junction temperatures Tj

below 40°C. For some high-end computer equipment the costs for active

chip cooling are affordable to achieve this junction temperature. But this is

definitely not the case for cost-driven consumer products. For this

application domain Tj in active mode ranges between 85°C and 125°C, and

in some applications the specified worst-case ambient temperature is even

80°C. The proposed zero-Vth concept is therefore not applicable without

changes and adaptations.

Figure 1-2. Simulated performance curves of transistors with ultra-low Vth. Compared to low-

Vth, either a performance gain or a Vdd reduction can be achieved. Curves for reg-Vth and

high-Vth transistors of a 130 nm technology are included

A more conservative approach with respect to zero-Vth, but still

aggressive compared to current devices, had to be chosen. An ultra-low Vth

device with about 150mV threshold voltage proved to be the best

compromise between zero-Vth and current low-Vth of about 300mV within a

130 nm CMOS technology.

To identify the optimal choice of Vth and Vdd in combination with the

higher junction temperature Tj, simulations with modified parameters of the

130nm low-Vth transistor are performed. In Figure 1-3 the power dissipation

is shown for a high activity circuit ( = 20%) with various options for the

transistor threshold voltages: reg-Vth, low-Vth, and transistors whose Vth are

reduced to 200mV, 150mV, 100mV and 50mV. The reg-Vth circuit

performance was used as the reference (Vdd = 1.5V), and the supply voltages

for the other transistor options were reduced to meet that reference

performance.

Device Option / Vth (mV)

Figure 1-3. Power dissipation at T=125°C in active mode for several transistor options with

reduced Vth. A minimum power consumption is achieved at 150mV Vth. (At T=55°C the

minimum is achieved for the same option but process variations show less impact).

The reduced supply voltage leads to lower overall active power

consumption Pactive. A minimum power consumption is reached at Vth =

150mV. With even lower threshold voltages Pactive starts to increase again

because of the increase of the leakage current. The steep rise of Pactive

originates from the exponential relation between Vth and leakage current. As

a rule of thumb a 100mV reduction of the threshold voltage allows for a Vdd

0,0E+00

5,0E-06

1,0E-05

1,5E-05

2,0E-05

2,5E-05

3,0E-05

3,5E-05

reg-Vt low-Vt 200mV 150mV 100mV 50mV

T= 125°C

Vdd= 1.5V

0.8V 0.7V

= target

reduction by 0.15V but on the other hand results in a tenfold increase of

the leakage current. From Figure 1-3 also the impact of technology

variations is visible. Due to the high leakage contribution a power reduction

of only 25% is achieved under fast process conditions. Using back biasing in

reverse mode, the high performance of fast transistors can be reduced

through increasing Vth. The corresponding leakage current therefore

decreases and allows a power reduction by 50% (stippled arrow).

A process modification has been developed to manufacture devices with

the threshold voltage of 150 mV, which proves to be the most efficient for

the target application domain of mobile consumer products [10]. In Table1-1

the key transistor parameters of our ultra-low-Vth FETs (ulv) and of the

standard low-Vth transistor are listed. The Vth values are 165mV and 161mV

for the ulv-NFET and ulv-PFET respectively, Ion increases by 35% and 22%,

which translates into an average decrease of the CV/I-metric delay by 29%.

Circuit simulations showed a performance increase of 25%. Concerning Vth,

performance, and Ioff the target values have been nearly met.

Table 1-1. Extracted key parameters of the ulv-FETSs in comparison with the target values

and the low- Vth FETs

130nm low-Vt

NFET / PFET

130nm ulv-FET

NFET / PFET

Target

[µA/µm]

560 / 240 755 / 295

[nA/µm]

1.2 / 1.2 48 / 17 35

295 / 260 165/160 150

body effect

[mV/V]

150 / 135 60/65 90

Vth@ L=10nm

35 / 30 65/30

Vth@ L=15nm

65 / 70 100/90

Simulated gate delay

[relative units]

1 0.8 0.75

The sensitivity of Vth to gate length variation (roll-off) is expressed in

Vth-shift per 10nm or 15nm gate length decrease. A comparison with low-

Vth-FETs shows a pronounced increase. Therefore in addition to temperature

compensation, back biasing has also to be used to compensate for this strong

technology variation.

The values of the body effect are also included in Table 1-1. The body

effect is expressed in Vth-shift per 1V well bias. The ulv-FETs yield values,

which are lower by more than 50% compared to the low-Vth transistors. The

decrease of body effect in combination with the increased roll-off reduces

the leverage of back biasing for ulv-FETs very significantly. The leverage is

not even sufficient to compensate the technology variation, since the value

of the roll-off is higher than that of the body effect. As an example, the ulv-

NFET shows roll-off values of 65mV/10nm and 100mV/15nm and a body

effect of only 60mV/V.

To investigate the migration potential of the ulv-FETs for future

technology generations Ioff measurement results, obtained from a recent

90nm hardware, were used. Based on this measurement data the leverage of

active well with the standard reg-Vth and low leakage transistor options has

been analyzed. For supply voltages of 1.2V and 0.75V a reverse back biasing

voltage of 0.5V has been applied. For the NFET, the back biasing results in a

leakage reduction by 50% to 70% for all transistor widths and for both

values of Vdd. In the case of the PFET, the leakage reduction values are

similar (60% to 80%) for transistors with W> 0.5µm. For very narrow

PFETs with Vdd = 1.2V, the reduction is only 20% or even less. Since

narrow FETs are used within SRAMs, which contribute a major part of the

circuit’s standby current, this small reduction for narrow transistors in

addition reduces significantly the leverage of active well. The root cause is

an additional leakage mechanism based on tunnelling currents across the

drain-well junction, which limits the reverse back biasing to 0.5V. This

tunnelling current depends exponentially on the drain-well voltage and is

working against any reduction of the sub-threshold current via active well.

At Vdd = 0.75V the drain-well voltage is reduced and the tunnelling current

is therefore lower. In this case the effect of back biasing is not compensated

by a rising tunnelling current and a leakage current reduction by 70% is still

achieved.

For a 90nm technology the limit of 0.5V for the well potential swing

limits the reduction of the leakage currents to a factor between 2 and 4. This

is still a major contribution of all feasible measures to reduce standby power

consumption, but the leverage becomes quite small compared to the

reduction ratios of several orders of magnitude obtained in previous

technologies [11,12]. In future technologies, Ileak will become more strongly

affected by the emerging tunnelling current Igate through the gate of the FET.

This is due to the ever decreasing gate oxide thickness and also due to the

fact, that even the on-state transistors shows gate leakage. Igate is not affected

by well biasing reducing the leverage of active well even further.

In summary the zero-Vth-devices have become very susceptible to

process and temperature variations. Significant yield is only achievable with

back biasing via active well control and with active cooling. The latter

approach is not feasible for mobile applications. Therefore a more

conservative approach with respect to zero-Vth, but still aggressive compared

to current devices, had to be chosen. An ultra-low-Vth device with about

150mV threshold voltage proved to be the best compromise between zero-

Vth and current low-Vth of about 300mV within a 130 nm CMOS technology.

But even though fabrication of this ultra-low-Vth device is possible, it

affects some standard methods to overcome short-channel effects. The so

called halo- or pocket-implantation had to be removed to bring the threshold

voltage down. Unfortunately short-channel effects are now heavily

increased, leading as shown to a very strong Vth roll-off at slight variations

of the channel length. Finally this effect was prohibitive for the overall

approach and led to cancellation of many zero-Vth projects in the

industry[13].

1.5 DESIGN APPROACHES TO POWER

REDUCTION

As outlined above, solutions from process technology by itself will not

suffice to provide sufficient power reduction. Therefore, solutions must be

found in algorithms, product architecture and logic design. Increasingly,

differentiated device options provided by process technology are utilized on

these levels in the search for optimization of power consumption.

For leading-edge products which need to optimize both power

consumption and system performance, optimization techniques on

architecture and design level have been proposed and partly already been

implemented. While academic research often focuses on the tradeoff

between power consumption and performance, industrial product

development must also take other variables into consideration.

• Product cost: often, power optimization design techniques increase die

area, directly affecting manufacturing cost. Also, utilization of additional

devices (e.g. different Vth devices) increases mask count and

consequently manufacturing cost, and additionally requires up-front

expenditures for the development of such devices. Finally, increased

manufacturing complexity poses the risk of lowered manufacturing yield.

• Product robustness: it must be ensured that optimized products still work

across the specified range of operating conditions, also taking

manufacturing variations into account.

1.5.1 Multi-Vdd Design

As outlined in the introduction, the supply voltage Vdd quadratically

impacts dynamic switching power consumption. Thus, lowering Vdd is the

preferred option to reduce dynamic power consumption. However, as

discussed in Section 1.2, lowering Vdd reduces the system performance.

Thus, the incentive to lower Vdd to reduce power consumption is kept in

check by the need to maintain performance.

Reduction of Vdd can be applied on different abstraction levels of a

design. Most effective regarding power reduction, and also easiest to

implement is to lower Vdd for an entire IC. As this will directly impact the

performance of the IC design, this often is not an option. On a lower

abstraction level, it is possible to lower Vdd for an entire module. This is still

rather simple to implement, but if only modules are chosen such that overall

IC performance is not impacted, the achieved gains in power reduction will

often be very moderate.

Finally, a reduction in supply voltage can be applied specifically to

individual gates, such that the overall system performance is not reduced.

This approach, as shown in Figure 1-4, recognizes that in a typical design,

most logic paths are not critical. They can be slowed down, often

significantly, without reducing the overall system performance. This slowing

down is achieved by lowering the supply voltage Vdd for gates on the non-

critical paths, which results in lowered power consumption.

Figure 1-4. Multi-Vdd design

This technique will modify the distribution of path delays in a design to a

distribution skewed towards paths with higher delay, as indicated Figure 1-5

Single Supply Voltage SSV Multiple Supply Voltages MSV

crit. paths

1/f 1/f

Single Supply Voltage SSV Multiple Supply Voltages MSV

crit. paths

1/f 1/f

Figure 1-5. Distribution of path delays under single and multiple supply voltages

Non-critical path runs with reduced supply voltage

Vdd_low

Vdd_low Vdd_low

Non-critical path may be delayed

Non-critical path runs with reduced supply voltage

Vdd_low

Vdd_low Vdd_low

Vdd_low

Vdd_low Vdd_low

10ns10ns

5ns5ns

10ns10ns

8ns8ns

A number of studies have shown significant variation in dynamic power

reduction results from implementing a multi-Vdd design strategy, ranging

from less than 10% up to almost 50%, with 40% being the average [15,16].

Rules of thumb for selecting appropriate supply voltage levels have been

developed. When using two supply voltages, the lower Vdd was proposed to

be 0.6x-0.7x of the higher Vdd [17]. The optimal supply voltage level also

depends on Vth [18].

The benefit of using multiple supply voltages quickly saturates. The

major gain is obtained by moving from a single Vdd to dual-Vdd. Extending

this to ever more supply voltage levels yields only small incremental benefits

[18,19], even when the overhead introduced by multiple supply voltages (see

below) is not taken into consideration.

The power reduction achieved by this technique roughly depends on two

parameters: the difference between the regular supply voltage Vdd and the

lowered supply voltage Vdd_low, and the percentage of gates to which Vdd_low

is applied.

Regarding the first parameter, it has been pointed out some years ago that

the leverage of this concept decreases as process technologies are scaled

down further [18].

Recent work has analyzed this in more detail [14]. At least for high-Vth

devices, which are essential for low standby power design due to their lower

leakage current, Vth has scaled much slower than Vdd recently. Therefore,

gate overdrive (Vdd - Vth) is diminished, negatively impacting performance.

Thus, even a little reduction in Vdd will have a very significant impact on

performance. Therefore, the potential to lower Vdd while maintaining overall

system performance is greatly reduced. It is shown that from 0.25µm down

to 0.09µm, the effectiveness of dual-Vdd decreases by a factor of 2 (from

60% dynamic power reduction to 30%) for high-Vth designs, whereas it stays

about constant for low-Vth designs. This can however be countered by

introduction of variable threshold voltages, as will be seen later.

Regarding the second parameter, experience has shown that especially in

designs using the multi-Vth technique outlined below, path delays tend to be

skewed to higher delays already, thus reducing the number of gates that can

be slowed down further [14].

For the selection of those gates which will receive the lower supply

voltage Vdd_low, a number of techniques have been proposed. Most prevalent

is the concept of clustered voltage scaling (CVS). It recognizes that it is

desirable to have clusters of gates assigned to the same voltage, since

between the output of a gate supplied by Vdd_low and the input of a gate

supplied by Vdd a level shifter is required to avoid static current flow [20].

This concept has been enhanced by extended clustered voltage scaling

(ECVS)[17] which essentially allows an arbitrary assignment of supply

voltage levels to gates. This strategy implies more frequent insertion of level

shifters into the design. However, usually only power consumption and

delay are considered in the literature. The additional area cost is neglected.

In industry, this certainly is not feasible.

While conceptually simple, the implementation of a multi-Vdd concept

poses a number of challenges.

• The additional supply voltage Vdd_low needs to be created on-chip by a dc-

to-dc converter, unless the voltage already exists externally. This results

in area overhead, and in power consumption for the converter.

• The additional supply voltage Vdd_low must be distributed across the chip.

• Level-shifters are required between different supply domains. It is

feasible to integrate level shifters into flip-flops [21].

The penalties in area, power consumption and delay resulting from these

effects are not always taken into account by work published in the literature.

Studies indicate that a 10% area overhead will result from implementing a

dual-Vdd design [22].

An additional consideration for industrial IC product development is that

EDA tool support for implementing a dual-Vdd design is still only

rudimentary. It is not sufficient to have a single point tool which can perform

power-performance tradeoffs. Instead, this methodology needs to encompass

the entire design flow (e.g. power distribution in layout; automated insertion

of level shifters etc.).

1.5.2 Multi-Vth Design

Another essential technique is the use of different transistor threshold

voltages (multi-Vth design). Primarily this technique reduces leakage power

consumption, thus increasing standby time of battery-powered ICs. As

leakage power consumption becomes an increasingly important component

of overall power consumption in modern process technologies, this

technique increasingly also helps to reduce overall power consumption

significantly, as design moves to more advanced process technologies. The

idea is similar to multi-Vdd design: paths that do not need highest

performance are implemented with special leakage-reduced transistors

(typically higher Vth transistors, but also thicker gate-oxide Tox), as shown

in Figure 1-6.

Figure 1-6. Multi-Vth design

A typical industrial approach today is to first create a design using lower

Vth transistors to achieve the required performance and then to selectively

replace gates off the critical path with higher Vth (or thicker Tox) transistors

to reduce leakage.

Studies in the literature have reported reductions in leakage of around

50% up to 80%. Some approaches assume that different Vth levels are

provided by the process technology (through doping variations) and propose

algorithms to optimally assign Vth levels to transistors, ensuring that

performance is not compromised [23, 24]. Recently, it has also been

proposed to achieve modifications in Vth by modifying transistor length or

gate oxide thickness Tox [25].

Design-tool support for this technique is also rudimentary at best. While

it is becoming established to design different modules of an IC with different

Vth transistors, it is very challenging to do this on the level of individual

transistors within a module. The primary reason is that the entire design flow

must be able to handle cells with identical functionality and size, which

differ in their electrical properties. This poses no principal algorithmic

problems, but must be consistently implemented in all EDA tools within a

design flow.

Non-critical path runs with increased threshold voltage

high Vt

high Vt high Vt

5ns5ns

8ns8ns

Non-critical path runs with increased threshold voltage

high Vt

high Vt high Vt

1.5.3 Hybrid Approaches

Recently approaches have been suggested in the literature which combine

implementation of multiple supply voltages and multiple threshold voltages

for further power reduction. Especially for designs where minimization of

total power consumption is key (as compared to e.g. minimization of standby

power for mobile products), it is possible to trade off leakage and dynamic

power, as originally proposed in the zero-Vth concept. Studies in the

literature indicate a total power optimum when leakage power contributes

10% to 30% [26,12]. This ratio depends significantly on the process

technology, operating environment, and clock frequency of a design.

For applications where leakage power minimization is critical (e.g.

mobile products), this approach usually is not feasible, as it requires a

relatively low Vth which causes high leakage currents [14].

With the increasing significance of gate leakage currents, variations of

gate oxide thickness Tox have also been proposed.

An overall framework for using two supply voltages and two threshold

voltages as well has been presented [19]. Theoretically, it is shown that

more than 60% of total power consumption can be saved this way (not

considering required overhead such as level shifters, routing etc.). Rules of

thumb are proposed and it is shown that the optimal second Vdd is about 50%

of the original Vdd in this case. It is also argued that the usefulness of multi-

Vdd strategies is not diminished, but actually increased in more advanced

technologies, if also a multi-Vth strategy is followed, since this strategy

allows to trade off leakage vs. dynamic power consumption by changing Vth

and Vdd to optimize power consumption, while maintaining a required timing

performance.

This approach has been applied to the practical example of an ARM

processor in [27]. Due to specific layout considerations it was not possible to

implement all four intended combinations of Vdd and Vth. Instead, three

different libraries were implemented. Using a CVS algorithm, a reduction in

dynamic power by 15% was achieved for a 0.18µm process technology.

Leakage power was reduced by 40%. As leakage power was more than

1000x smaller than dynamic power, overall active power reduction was

15%. To achieve this, a 14% increase in area was required.

A very recent approach considers also transistor width sizing in addition

to Vdd and Vth assignment [28]. Using a two stage, sensitivity-based

approach, total power savings of 37% on average over a suite of benchmark

circuits are reported. In this study, the threshold voltage is chosen rather low,

so that leakage represents 20-50% of total power consumption. Therefore,

optimization of both leakage and dynamic power consumption is essential,

which is achieved with the presented approach.

An enhanced approach for leakage power consumption considers

multiple gate oxide thicknesses Tox in addition to multi-Vth [29]. It is

motivated by the fact that gate leakage increases very dramatically with

newer process technologies. Gate leakage is of the same order of magnitude

as subthreshold leakage at the 90nm process node. Their relationship also

depends significantly on the operating temperature T. The key observation

that an OFF transistor suffers from subthreshold leakage, an ON transistor

from gate leakage, motivates the approach to analyze transistor states in

standby mode and assign Vth and Tox such that leakage power consumption

is minimized. Leakage reductions of 5-6x are obtained on benchmark

circuits, compared to designs using a single Vth and Tox.

Previous approaches that included Tox into the optimization varied Tox

only for different design modules, not on critical paths within modules.

These newer approaches promise further reductions in power

consumption. This will come, however, at a price (as seen e.g. in the ARM

example). Design complexity increases significantly when variations in

many parameters are made available at the same time. In some studies, the

resulting overhead is not considered.

1.5.4 Cost Tradeoffs

This overhead must be considered, however, since it is quite significant:

• Multi-Vdd: level-shifter (area, power consumption, delay), routing of

additional supply voltages (area).

• Multi-Vth: additional masks (manufacturing costs); potentially special

design rules at the boundary between different Vth devices (area).

• Multi-Tox: additional masks (manufacturing costs).

• In addition, IC development costs increase due to more complex design

flows. Also, special process options (Vth, Tox) must be developed,

qualified and continuously monitored. For each such option, the design

library must be electrically characterized, modelled for all EDA tools,

and potentially optimized regarding circuit design and layout. It must be

maintained and regularly updated (changes in electrical parameters,

changes in tools in the design flow) over a long period of time as well. If

a very specialized manufacturing flow is developed to fully optimize a

given product, it will be very difficult to shift manufacturing of this

product to a different fab (e.g. a foundry in case additional capacity is

required).

For these and potentially other reasons, we are not yet aware of industrial

products that have implemented such proposals in a fine-grained manner (i.e.

different Vth, Vdd and Tox combined within one design module).

Some approaches in the literature also determine optimum levels of

threshold voltages depending on a given design. In industry, this is rarely

feasible. Typically, a manufacturing process has to be taken as given, with

only predefined values of Vth (and Tox) being available.

1.6 APPROACHES ON HIGHER ABSTRACTION

LEVELS

The approaches outlined above on gate level and device level can be (and

often must be) supported by measures on higher levels of abstraction.

Some of the most promising concepts are as follows:

• partitioning the system such that large areas can be powered off for

significant periods of time (block turnoff)

• especially partitioning memory systems such that large parts can be

turned off in standby mode

• clock gating is an essential method which reduces dynamic power

consumption by local off-switching of non-active gates

• coding strategies (e.g. for buses) can reduce switching and thus dynamic

power consumption

1.7 CONCLUSION AND FUTURE CHALLENGES

There is no single “silver bullet” to solve the challenge of power

reduction. While ultra-low voltage logic based on special ultra-low-Vth

devices is a conceptually very convincing concept, its widespread

implementation is hindered by manufacturing concerns. An extrapolation of

current technology trends indicates that such a concept will become even

more difficult in the future.

Today, design techniques are the most promising approach to reduce

power – both dynamic and leakage.

The concepts outlined here can be further extended. It is feasible to

dynamically adjust supply and threshold voltages. These are theoretically

promising concepts which however still require more investigation

especially with regard to feasibility under industrial boundary conditions.

Quite likely, in the future even more emphasis than today will have to be

placed on power reduction schemes on algorithmic and system level. On

these levels, the levers to reduce power consumption are largest.

Acknowledgement

The authors wish to acknowledge and thank Jörg Berthold and Tim

Schönauer for their contributions and fruitful discussions.

References

[1] G. Moore, Cramming More Components onto integrated circuits, Electronics Magazine,

Vol. 38, No. 8, 1965, pp. 114-117.

[2] ITRS, International Technology Roadmap for Semiconductors, 2003, http://public.itrs.net.

[3] F. Pollack, New Microarchitecture Challenges in the Coming Generations of CMOS

Process Technologies, Micro32 Keynote, 1999.

[4] U. Schlichtmann, Systems are Made from Transistors: UDSM Technology Creates New

Challenges for Library and IC Development, IEEE Euromicro Symposium on Digital

System Design, 2002, pp. 1-2.

[5] S. Borkar, Design Challenges of Technology Scaling, IEEE Micro, July/August 1999, pp.

23-29.

[6] S. Thompson, P. Packan, and M. Bohr, MOS Scaling: Transistor Challenges for the 21st

Century, Intel Technology Journal, Q3 1998.

[7] N. Kim et al., Leakage Current: Moore's Law Meets Static Power, IEEE Computer, Vol.

36, No. 12, December 2003, pp. 68-75.

[8] S. Sakurai, A. R. Newton, Alpha-Power Law MOSFET Model and its Application to

CMOS Inverter Delay and Other Formulas, IEEE Journal of Solid-State Circuits, Vol.

25, No. 2, 1990, pp. 584-594.

[9] J.B. Burr, J. Schott, A 200 mV self-testing encoder/decoder using Stanford ultra-low-

power CMOS, 1994 IEEE International Solid-State Circuits Conference

[10] J. Berthold, R. Nadal, C. Heer, Optionen für Low-Power-Konzepte in den sub-180-nm-

CMOS-Technologien (In German), U.R.S.I. Kleinheubacher Tagung 2002.

[11] V. Svilan, M. Matsui, J. B. Burr, Energy-Efficient 32 x 32-bit Multiplier in Tunable

Near-Zero Threshold CMOS, ISLPED 2000, pp. 268-272.

[12] V. Svilan, J. B. Burr, L. Tyler, Effects of Elevated Temperature on Tunable Near-Zero

Threshold CMOS, ISLPED 2001, pp. 255-258.

[13] C. Heer, Designing low-power circuits: an industrial point of view, PATMOS 2001

[14] T. Schoenauer, J. Berthold, C. Heer, Reduced Leverage of Dual Supply Voltages in Ultra

Deep Submicron Technologies, International Workshop on Power And Timing

Modeling, Optimization and Simulation PATMOS 2003, pp. 41-50.

[15] K. Usami, M. Igarashi, Low-Power Design Methodology and Applications utilizing Dual

Supply Voltages, Proceedings of the Asia and South Pacific Design Automation

Conference 2000, pp. 123-128.

[16] M. Donno, L. Macchiarulo, A. Macii, E. Macii, M. Poncino, Enhanced Clustered

Voltage Scaling for Low Power, Proceedings of the 12th ACM Great Lakes Symposium

on VLSI, 2002, pp. 18-23.

[17] K. Usami et al., Automated Low-Power Technique Exploiting Multiple Supply Voltages

Applied to a Media Processor, IEEE Journal of Solid-State Circuits, Vol. 33, No. 3,

March 1998, pp. 463-472.

[18] M. Hamada, Y. Ootaguro, T. Kuroda, Utilizing Surplus Timing for Power Reduction,

Proceedings IEEE Custom Integrated Circuits Conference CICC, 2001, pp. 89-92.

[19] A. Srivastava, D. Sylvester, Minimizing Total Power by Simultaneous Vdd/Vth

Assignment, Proceedings of the Asia and South Pacific Design Automation Conference

2003, pp. 400-403.

[20] K. Usami, M. Horowitz, Clustered Voltage Scaling Technique for Low-Power Design,

Proceedings of the International Symposium on Low Power Design ISLPD, 1995, pp. 3-

[21] K. Usami et al., Design Methodology of Ultra Low-power MPEG4 Codec Core

Exploiting Voltage Scaling Techniques, Proceedings of the 35th Design Automation

Conference 1998, pp. 483-488.

[22] C. Yeh, Y.-S. Kang, Layout Techniques Supporting the Use of Dual Supply Voltages for

Cell-Based Designs, Proceedings of the 36th Design Automation Conference 1999, pp.

62-67.

[23] Q. Wang, S. Vrudhula, Algorithms for Minimizing Standby Power in Deep

Submicrometer, Dual-Vt CMOS Circuits, IEEE Transactions on CAD, Vol. 21, No. 3,

March 2002, pp. 306/318.

[24] L. Wei, Z. Chen, K. Roy, M. Johnson, Y. Ye, V. De, Design and Optimization of Dual-

Threshold Circuits for Low-Voltage Low-Power Applications, IEEE Transactions on

Very Large Scale Integration (VLSI), Vol. 7, No. 1, March 1999, pp. 16-24.

[25] N. Sirisantana, K. Roy, Low-Power Design Using Multiple Channel Lengths and Oxide

Thicknesses, IEEE Design & Test of Computers, January-February 2004, pp. 56-63.

[26] K. Nose, T. Sakurai, Optimization of VDD and VTH for Low-Power and High-Speed

Applications, Proceedings of the Asia and South Pacific Design Automation Conference

2000, pp. 469-474.

[27] R. Bai, S. Kulkarni, W. Kwong, A. Srivastava, D. Sylvester, D. Blaauw, An

Implementation of a 32-bit ARM Processor Using Dual Power Supplies and Dual

Threshold Voltages, IEEE International Symposium on VLSI, 2003, pp. 149-154.

[28] A. Srivastava, D. Sylvester, D. Blaauw, Concurrent Sizing, Vdd and Vth Assignment for

Low-Power Design, Proceedings of the Design, Automation and Test in Europe

Conference DATE, 2003, pp. 718-719.

[29] D. Lee, H. Deogun, D. Blaauw, D. Sylvester, Simultaneous State, Vt and Tox

Assignment for Total Standby Power Minimization, Proceedings of the Design,

Automation and Test in Europe Conference DATE, 2003, pp. 494-499.

Chapter 2

ON-CHIP OPTICAL INTERCONNECT FORLOW-POWER

Ian O’Connor and Frederic GaffiotEcole Centrale de Lyon

Abstract It is an accepted fact that process scaling and operating frequency both contributeto increasing integrated circuit power dissipation due to interconnect. Extrapolat-ing this trend leads to a red brick wall which only radically different interconnectarchitectures and/or technologies will be able to overcome. The aim of this chap-ter is to explain how, by exploiting recent advances in integrated optical devices,optical interconnect within systems on chip can be realised. We describe ourvision for heterogeneous integration of a photonic “above-IC" communicationlayer. Two applications are detailed: clock distribution and data communicationusing wavelength division multiplexing. For the first application, a design methodwill be described, enabling quantitative comparisons with electrical clock trees.For the second, more long-term, application, our views will be given on the useof various photonic devices to realize a network on chip that is reconfigurable interms of the wavelength used.

Keywords: Interconnect technology, optical interconnect, optical network on chip

2.1 INTRODUCTION

In the 2003 edition of the ITRS roadmap [17], the interconnect problem wassummarised thus: “For the long term, material innovation with traditional scal-ing will no longer satisfy performance requirements. Interconnect innovationwith optical, RF, or vertical integration ... will deliver the solution”. Continu-ally shrinking feature sizes, higher clock frequencies, and growth in complexityare all negative factors as far as switching charges on metallic interconnect isconcerned. Even with low resistance metals such as copper and low dielectricconstant materials, bandwidths for long interconnect will be insufficient for fu-ture operating frequencies. Already the use of metal tracks to transport a signalover a chip has a high cost in terms of power: clock distribution for instance

requires a significant part (30-50%) of total chip power in high-performancemicroprocessors.

A promising approach to the interconnect problem is the use of an opticalinterconnect layer, which could empower an increase in the ratio between datarate and power dissipation. At the same time it would enable synchronous op-eration within the circuit and with other circuits, relax constraints on thermaldissipation and sensitivity, signal interference and distortion, and also free uprouting resources for complex systems. However, this comes at a price. Firstly,high-speed and low-power interface circuits are required, design of which isnot easy and has a direct influence on the overall performance of optical inter-connect. Another important constraint is the fact that all fabrication steps haveto be compatible with future IC technology and also that the additional costincurred remains affordable. Additionally, predictive design technology is re-quired to quantify the performance gain of optical interconnect solutions, whereinformation is scant and disparate concerning not only the optical technology,but also the CMOS technologies for which optics could be used (post-45nmnode).

In section 2.2, we will describe the “above-IC” optical technology. Sections2.3 and 2.4 describe an optical clock distribution network and a quantitativeelectrical-optical power comparison respectively. A proposal for a novel opticalnetwork on chip in discussed in section 2.5.

2.2 OPTICAL INTERCONNECT TECHNOLOGY

Various technological solutions may be proposed for integrating an opticaltransport layer in a standard CMOS system. In our opinion, the most promisingapproach makes use of hybrid (3D) integration of the optical layer above acomplete CMOS IC, as shown in fig. 2.1. The basic CMOS process remainsthe same, since the optical layer can be fabricated independently. The weaknessof this approach is in the complex electrical link between the CMOS interfacecircuits and the optical sources (via stack and advanced bonding).

In the system shown in fig. 2.1, a CMOS source driver circuit modulatesthe current flowing through a biased III-V microsource through a via stackmaking the electrical connection between the CMOS devices and the opticallayer. III-V active devices are chosen in preference to Si-based optical devicesfor high-speed and high-wavelength operation. The microsource is coupled tothe passive waveguide structure, where silicon is used as the core and SiO2

as the cladding material. Si/SiO2 structures are compatible with conventionalsilicon technology and silicon is an excellent material for transmitting wave-lengths above 1.2µm (mono-mode waveguiding with attenuation as low as 0.8dB/cm has been demonstrated [10]). The waveguide structure transports theoptical signal to a III-V photodetector (or possibly to several, as in the case of

drivercircuit

receivercircuit

electricalcontact

III−Vphotodetector

III−Vlaser source

Si photonicwaveguide (n=3.5)

SiO2 waveguidecladding (n=1.5)

CMOS IC

Figure 2.1. Cross-section of hybridised interconnection structure

a broadcast function) where it is converted to an electrical photocurrent, whichflows through another via stack to a CMOS receiver circuit which regeneratesthe digital output signal. This signal can then if necessary be distributed over asmall zone by a local electrical interconnect network.

2.3 AN OPTICAL CLOCK DISTRIBUTIONNETWORK

In this section we present the structure of the optical clock distribution net-work, and detail the characteristics of each component part in the system: ac-tive optoelectronic devices (external VCSEL source and PIN detector), passivewaveguides, interface (driver and receiver) circuits. The latter represent ex-tremely critical parts to the operation of the overall link and require particularlycareful design.

An optical clock distribution network, shown in fig. 2.2, requires a singlephotonic source coupled to a symmetrical waveguide structure routing to anumber of optical receivers. At the receivers the high-speed optical signal isconverted to an electrical one and provided to local electrical networks. Hencethe primary tree is optical, while the secondary tree is electrical. It is not feasibleto route the optical signal all the way down to the individual gate level sinceeach drop point requires a receiver circuit which consumes area and power.The clock signal is thus routed optically to a number of drop points which willcover a zone over which the last part of the clock distribution will be carried out

by the electrical secondary clock tree. The size of the zones is determined bycalculating the power required to continue in the optical domain and comparingit to the power required to distribute over the zone in the electrical domain. Thenumber of clock distribution points (64 in the figure) is a particularly crucialparameter in the overall system.

The global optical H-tree was optimised to achieve minimal optical lossesby designing the bend radii to be as large as possible. For 20mm die width and64 output nodes in the H-tree at the 70nm technology mode, the smallest radiusof curvature (r3 in fig. 2.2) is 625µm, which leads to negligible pure bendingloss.

opticalreceivers

opticalwaveguides

electricalclock trees

CVL : source−waveguidecoupling loss

WL : waveguidetransmission loss

YL : Y−couplerloss

BL : bendingloss

CRL : waveguide−receivercoupling loss

opticalsource

die width, D

r1=D/8, r2=D/16, r3=D/32

Figure 2.2. Optical H-tree clock distribution network (OCDN) with 64 output nodes. r1−3 arethe bend radii linked to the chip width D

2.3.1 VCSEL sources

VCSELs (Vertical Cavity Surface Emitting Lasers) are certainly the mostmature emitters for on-chip or chip-to-chip interconnections. Commercial VC-SELs, when forward biased at a voltage well above 1.5V, can emit opticalpower of the order of a few mW around 850nm, with an efficiency of some40%. Threshold currents are typically in the mA range. However, fundamentalrequirements for integrated semiconductor lasers in optical interconnect appli-cations are small size, low threshold lasing operation and single-mode operation(i.e. only one mode is allowed in the gain spectrum). Additionally, the factthat VCSELs emit light vertically makes coupling less easy. It is clear that

significant effort is required from the research community if VCSELs are tocompete seriously in the on-chip optical interconnect arena, to increase wave-length, efficiency and threshold current in the same device. Long wavelength,and low-threshold VCSELs are only just beginning to emerge (for example, a1.5µm, 2.5Gb/s tuneable VCSEL [5], and an 850nm, 70µA threshold current,2.6µm diameter CMOS compatible VCSEL [11] have been reported). Ulti-mately however, optical interconnect is more likely to make use of integratedmicrosources as described in section 2.5, as these devices are intrinsically bettersuited to this type of application.

2.3.2 PIN photodetectors

In order to optimise the frequency and power dissipation performance of theoverall link, photodetectors must exhibit high quantum efficiency, large intrinsicbandwidth and small parasitic capacitance. The photodetector performance ismeasured by the bandwidth efficiency product.

Conventional III-V PIN devices suffer from two main limitations. On onehand, their relatively high capacitance per unit area leads to limitations in thedesign of the transconductance amplifier interface circuit. On the other hand,due to its vertical structure, there is a tradeoff between its frequency performanceand its efficiency (the quantum efficiency increases and the bandwidth decreaseswith the absorption intrinsic layer thickness) [9].

Metal-semiconductor-metal (MSM) photodetectors offer an alternative overconventional PIN photodetectors. An MSM photodetector consists of interdig-itated metal contacts on top of an absorption layer. Because of their lateralstructure, MSM photodetectors have very high bandwidths due to their lowcapacitance and the possibility to reduce the carrier transit time. However,the responsivity is usually low compared to PIN photodetectors [4]. MSMphotodiodes with bandwidth greater than 100GHz have been reported.

2.3.3 Waveguides

Optical waveguides are at the heart of the optical interconnect concept.In the Si/SiO2 approach, the high relative refractive index difference ∆ =(n2

1 − n22)/2n

21 between the core (n1 ≈ 3.5 for Si) and cladding (n2 ≈ 1.5 for

SiO2) allows the realisation of a compact optical circuit with dimensions com-patible with DSM technologies. For example, it is possible to realise monomodewaveguides less than 1µm wide (waveguide width of 0.3µm for wavelengthsof 1.55µm), with bend radii of the order of a few µm [15].

However, the performance of the complete optical system depends on theminimum optical power required by the receiver and on the efficiency of passiveoptical devices used in the system. The total loss in any optical link is the sum

of losses (in decibels) of all optical components:

Ltotal = LCV + LW + LB + LY + LCR (2.1)

LCV is the coupling coefficient between the photonic source and opticalwaveguide. There are currently several methods to couple the beamemitted from the laser into the optical waveguide. In this analysis weassumed 50% coupling efficiency LCV from the source to a single modewaveguide.LW is the rectangular waveguide transmission loss per unit distance ofthe optical power. Due to small waveguide dimensions and large in-dex change at the core/cladding interface in the Si/SiO2 waveguide theside-wall scattering is the dominant source of loss (fig. 2.3a). For thewaveguide fabricated by Lee [10] with roughness of 2nm the calculatedtransmission loss is 1.3dB/cm.LB is the bending loss, highly dependent on the refractive index difference∆ between the core and cladding medium. In Si/SiO2 waveguides, ∆ isrelatively high and so due to this strong optical confinement, bend radiias small as a few µm may be realised. As can be seen from fig. 2.3b,the bending losses associated with a single mode strip waveguide arenegligible if the radius of curvature is larger then 3µm.LY is the Y-coupler loss, and depends on the reflection and scatteringattenuation into the propagation path and surrounding medium. For highindex difference waveguides the losses for the Y-branch are significantlysmaller than for low ∆ structures and the simulated losses are less then0.2dB per split [14].LCR is the coupling loss from the waveguide to the optical receiver.Using currently available materials and methods it is possible to achievean almost 100% coupling efficiency from waveguide to optical receiver.In this analysis the coupling efficiency is assumed to be 87% (LCR =0.6dB) [16].

2.3.4 Interface circuits

High-speed CMOS optoelectronic interface circuits are crucial building blocksto the optical interconnect approach. The electrical power dissipation of thelink is defined by these circuits, but it is the receiver circuit that poses the mostserious design challenges. The power dissipated by the source driver is mainlydetermined by the source bias current and is therefore device-dependent. Onthe receiver side however, most of the receiver power is due to the circuit, whileonly a small fraction is required for the photodetector device.

1 2 3 4 5 6 7 8 9 10 11 12

Sidewall roughness (nm)

Figure 2.3a. Simulated transmission lossfor varying sidewall roughness in a0.5µm× 0.2µm Si/SiO2 strip waveguide

0.0001

2 3 4 5 6 7 8 9

Bend radius (um)

Figure 2.3b. Simulated pure bending lossfor various bend radii in a 0.5µm× 0.2µmSi/SiO2 strip waveguide

2.3.4.1 Driver circuits. Source driver circuits generally use a currentmodulation scheme for high-speed operation. The source always has to bebiased above its threshold current by a MOS current sink to eliminate turn-ondelays, which is why low-threshold sources are so important (figures of theorder of 40µA [7] have been reported). A switched current sink modulatesthe current flowing through the source, and consequently the output opticalpower injected into the waveguide. As with most current-mode circuits, highbandwidth can be achieved since the voltage over the source is held relativelyconstant and parasitic capacitances at this node have reduced influence on thespeed.

2.3.4.2 Receiver circuits. A typical structure for a high-speed pho-toreceiver circuit consists of: a transimpedance amplifier (TIA) to convert thephotocurrent of a few µA into a voltage of a few mV; a comparator to gener-ate a rail-to-rail signal; and a data recovery circuit to eliminate jitter from therestored signal. Of these, the TIA is arguably the most critical component forhigh-speed performance, since it has to cope with a generally large photodiodecapacitance situated at its input.

The basic transimpedance amplifier structure in a typical configuration isshown in fig. 2.4 [8]. The bandwidth/power ratio of this structure can be max-imised by using small-signal analysis and mapping of the individual componentvalues to a filter approximation of Butterworth type.

It is then possible to develop a synthesis procedure which, from desiredtransimpedance performance criteria (gain Zg0, bandwidth and pole qualityfactor Q) and operating conditions (photodiode and load capacitances, Cd andCl respectively) generates component values for the feedback resistance Rf andthe voltage amplifer (voltage gain Av and output resistance Ro). Circuits withhigh Ro/Av ratio (≈ 1/

∑gm) require the least quiescent current and area and

this quantity constitutes therefore an important figure of merit in design space

C = C + Cd i C = C + Colx y

ω 0=R Co y

1 1 + Av

f x mM (M + M (1 + M ))x

Z = −g0fR − R /Ao v

1 + 1/Av( ) fM = R / Rf o

iM = C / Cx y

M = C / Cmm y

Q = 1 + Avf x m xM (M + M (1 + M ))( )1 + Av1 + M (1 + M ) + M M ( )x f m f

Figure 2.4. CMOS transimpedance amplifier structure

exploration (fig. 2.5a). To reach a sized transistor-level circuit, approximateequations for the small-signal characteristics and bias conditions of the circuitare sufficient to allow a first-cut sizing of the amplifier, which can then be fine-tuned by numerical or manual optimisation, using simulation for exact results.The complete process is described in [13].

Amplifier Ro/Av requirementCi=500fF Cl=100fF

Ro/Av 300 250 200 150 100 50

Bandwidthrequirement

/GHz 1000

Transimpedancegain

requirement/ohms

100150200250300350400

Figure 2.5a. TIA Ro/Av designspace with varying bandwidth andtransimpedance gain requirements

350 180 130 100 70 45

Technology node (nm)

1THzohm Transimpedance amplifier characteristics against technology nodeCd = 400fF, Cl = 150fF

Area / um2Quiescent power / 100uW

Figure 2.5b. Evolution of TIA character-istics (power, area, noise) with technologynode

Using this methodology with industrial transistor models for technologynodes from 350nm to 180nm and predictive BSIM3v3/BSIM4 models for tech-nology nodes from 130nm down to 45nm [3], we generated design parametersfor 1THzΩ transimpedance amplifiers to evaluate the evolution in critical char-acteristics with technology node. Fig. 2.5b shows the results of transistor levelsimulation of fully generated photoreceiver circuits at each technology node.

2.4 QUANTITATIVE POWER COMPARISONBETWEEN ELECTRICAL AND OPTICAL CLOCKDISTRIBUTION NETWORKS

2.4.1 Design methodology

In an optical link there are two main sources of electrical power dissipation:(i) power dissipated by the optical receiver(s) and (ii) energy needed by theoptical source(s) to provide the required optical output power. To estimate theelectrical power dissipated in the system we developed the methodology shownin fig. 2.6.

losses in passivewaveguide network

minimum opticalpower at receiver

minimum opticalpower at source

photodiode characteristics

dark(R,C ,I )

transimpedanceamplifier

BER specification(SNR requirement)

electrical power dissipatedin optical system

emitterpower

receiverpower

sourceefficiency

Figure 2.6. Methodology used to estimate the electrical power dissipation in an optical clockdistribution network

The first criterion for defining the performance of the optoelectronic link isthe required signal transmission quality, represented by the bit error rate (BER)and directly linked to the photoreceiver signal to noise ratio. For an on-chipinterconnect network, a BER of 10−15 is acceptable. To calculate the requiredsignal power at the receiver, the characteristics of the receiver circuit have tobe extracted from the transistor-level schematic, which is generated from thephotodetector characteristics (responsivity R, Cd, dark current Idark) and fromthe required operating frequency using the method described in section 2.3.For the given BER and for the noise signal associated with the photodiode andtransimpedance circuit the minimum optical power required by the receiverto operate at the given error probability can be calculated using the Morikuniformula [12].

With this figure, and knowing the layout and therefore the optical losses thatwill be incurred in the waveguides, the minimum required optical power at thesource can be estimated. The total electrical power dissipated in the optical

link is the sum of the power dissipated by the number of optical receivers andthe energy needed by the source to provide the required optical power. Theelectrical power dissipated by the receivers can be extracted from transistor-level simulations. To estimate the energy needed by the optical source, laserlight-current characteristics given by Amann [1] were used.

2.4.2 Design performance

Our aim in this work was to quantitatively compare the power dissipationin electrical and optical clock distribution networks for a number of cases, in-cluding technology node prediction. For both electrical and optical cases weused technology parameters from the ITRS roadmap (wire geometry, materialparameters). For transistor models we used predictive model parameters fromBerkeley (BSIM3V3 down to 70nm and BSIM4 down to 45nm). The powerdissipated in the electrical system can be attributed to the charging and discharg-ing of the wiring and load capacitance and to the static power dissipated by thebuffers. In order to calculate the power we used an internally developed simu-lator, which allows us to model and calculate the electrical parameters of clocknetworks for future technology nodes [18]. For optical performance predictionswe used existing technology characteristics while for the optoelectronic deviceswe took datasheets from two real devices and used these figures.

The power dissipated in clock distribution networks was analysed in bothsystems at the 70nm technology node. Power dissipation figures for electricaland optical CDNs were calculated based on the system performance summarisedin tables 2.1a and 2.1b.

Table 2.1a. Electrical CDN characteris-tics

Electrical system parameter

Technology (nm) 70Vdd (V) 0.9Tox (nm) 1.6Chip size (mm2) 400Global wire width (µm) 1Metal resistivity (Ω-cm) 2.2Dielectric constant 3Optimal segment length (mm) 1.7Optimal buffer size (µm) 90

Table 2.1b. Optical CDN characteristics

Optical system parameter

Wavelength λ (nm) 1550Waveguide core index (Si) 3.47Waveguide cladding index (SiO2) 1.44Waveguide thickness (µm) 0.2Waveguide width (µm) 0.5Transmission loss (dB/cm) 1.3Loss per Y-junction (dB) 0.2Input coupling coefficient (%) 50Photodiode capacitance (fF) 100Photodiode responsivity (A/W) 0.95

What follows is the results of comparisons of the power dissipation in elec-trical and optical clock distribution networks. This was quantitatively carriedout for varying chip size, operating frequency, number of clock distributionpoints, technology node, and finally sidewall roughness. This latter perfor-

mance characteristic is the only non system-driven characteristic, but it givessome important design information to technology groups working on opticalinterconnect.

Fig. 2.7a shows a power comparison where we vary square die size from 10 to37 mm width. This analysis was carried out for the 70nm node at a distributionfrequency of 5.6GHz (which is the clock frequency associated with this node)and 256 drop points. Electrical CDN power rises almost linearly with die size,which is understandable since the line lengths increase and therefore requiremore buffers to drive them. Optical CDN power rises much more slowly sinceall that is really changing is transmission loss and this has a quite minor effecton the overall power dissipation.

When we vary clock frequency for constant chip width, fig. 2.7b we observea similar effect for the electrical CDN. Again, the number of buffers has toincrease since the segment lengths have to be reduced in order to attain the lowerRC time constants. For the optical CDN, what is changing is the receiver powerdissipation. The transimpedance amplifier requires a lower output resistance inorder to operate at higher frequencies and this translates to a higher bias current.

In fig. 2.7c, we vary the number of drop points and see that both electricaland optical CDN power dissipation rises, but optical rises much faster thanelectrical. There are two reasons for this: firstly, every time the number of droppoints is doubled, so is the number of receivers and this accounts for a large partof the power dissipation; secondly, the number of splitters is doubled, whichin turn means that the power at emission also has to be doubled. These twofactors cause the optical power to catch up with the electrical power at around4000 drop points.

Fig. 2.7e shows a comparison for varying technology node. Not only thetechnology is changing here, we are also changing the clock frequency asso-ciated with the node. We can see that at the 70nm node there is a five-folddifference between electrical and optical clock distribution. As the technologynode advances, this difference becomes even more marked.

A final analysis, fig. 2.7f, shows how technological advances are requiredto improve system performance, concerning in this case waveguide sidewallroughness. 5nm roughness translates to a transmission loss of around 8dB/cm,which in turn corresponds to a power dissipation figure of around 500mWfor the 70nm node at 5.6GHz and 20mm chip width. Looking at the 2nmroughness point, achieved at MIT [10] and corresponding to a transmissionloss of 1.3dB/cm, we obtain a power dissipation figure of about 10mW, a fifty-fold decrease in the overall power dissipation by going from 5nm roughness to2nm roughness. This demonstrates the importance of optimising the passivewaveguide technology for the whole system.

100 300 500 700 900

Die size (mm2)

Electrical CDNOptical CDN

Figure 2.7a. Comparison of power dissi-pation in electrical and optical clock dis-tribution networks for varying chip size(70nm technology, 5.6GHz, 256 droppoints)

1 3 5 7

Clock frequency (GHz)

Electrical CDN 256Optical CDN 256

Electrical CDN 128Optical CDN 128

Figure 2.7b. Comparison of power dissi-pation in electrical and optical clock dis-tribution networks for varying clock fre-quency (70nm technology, 400mm2, 256drop points)

4 32 256 2048 8172

Number of drop points (nodes)

Figure 2.7c. Comparison of power dissi-pation in electrical and optical clock dis-tribution networks for varying number ofdrop points (70nm technology, 5.6GHz,400mm2)

4 32 256 2048 8172

Number of drop points (nodes)

Figure 2.7d. Comparison of power dissi-pation in electrical and optical clock dis-tribution networks for varying number ofdrop points (70nm technology, 5.6GHz,400mm2)

130 100 70 45

Technology node (nm)

Figure 2.7e. Comparison of power dissi-pation in electrical and optical clock dis-tribution networks for varying technologynodes

1 3 5 7 9

Waveguide transmission loss (dB/cm)

Optical CDN 256Optical CDN 128

Figure 2.7f. Evaluation of power dissipa-tion in optical clock distribution networksfor varying waveguide sidewall roughness(70nm technology, 5.6GHz, 400mm2)

For a BER of 10−15 the minimal power required by the receiver is -22.3dBm(at 3GHz). Losses incurred by passive components for various nodes in theH-tree are summarised in table 2.2.

Table 2.2. Optical power budget for 20mm die width at 3GHz

Number of nodes in H-tree 16 32 64 128

Loss in straight lines (dB) 1.3 1.3 1.3 1.3Loss in curved lines (dB) 1.53 1.66 1.78 1.85Loss in Y-dividers (dB) 12 15 18 21Loss in Y-couplers (dB) 0.8 1 1.2 1.4Output coupling loss (dB) 0.6 0.6 0.6 0.6Input coupling loss (dB) 3 3 3 3Total optical loss (dB) 19.2 22.5 25.8 29.1Min. receiver power (dBm) -22.3 -22.3 -22.3 -22.3Laser optical power (mW) 0.5 1.1 2.30 4.85

We can conclude from this analysis that power dissipation in optical clockdistribution networks is lower than that of electrical clock distribution networks,by a factor of five for example at the 70nm technology node. This factor will inthe future become larger due to two reasons: firstly due to improvements in opti-cal fabrication technology; and secondly with the rise in operating frequencies.However, this figure is probably not sufficient to convince semiconductor manu-facturers to introduce such large technological and methodological changes forthis application. To improve the figure, weak points can be identified for eachmain part of an integrated optical link. For the source, the efficiency betweenelectrical and optical power conversion is relatively low. This needs to be im-proved and one area is possibly in integrated microsources. For the waveguidestructures, most of the losses need to be improved, especially transmission lossand coupling loss. Sidewall roughness especially has a direct and considerableimpact on the power dissipation in the global system. Finally at the receiverend, the transimpedance amplifier power dissipation is too high. Better circuitstructures must be devised, or the photodetector parasitic capacitance needs tobe reduced.

2.5 OPTICAL NETWORK ON CHIP

In current SoC architectures, global data throughput between functionalblocks can reach up to tens of gigabits per second, the load being shared byseveral communication buses. In the future the constraints acting on such dataexchange networks will continue to increase: the number of IP blocks in anintegrated system could be as high as several hundred and the global throughputcould reach the Tb/s scale. To provide this level of performance, the communi-cation system itself is designed as an IP block into which the various functionalunits will be connected. This type of standardised hardware communicationarchitecture is called a network on a chip (NoC).

Using wavelength division multiplexing (WDM) techniques, photonics andoptoelectronics may offer new solutions to realise reconfigurable optical net-works on chip (ONoC). An ONoC, as an electronic router with routing based onwavelength λ, is actually a circuit-switching based topology and can thus ensuredata exchanges between IP blocks with very low contention. The advantagesof using an optical network are many: independence of interconnect perfor-mance from distance and data rate, crosstalk reduction, connectivity increase,interconnect power dissipation reduction, increase in the size of isochronoustiles, use of communication protocols. Figure 2.8 shows a 4× 4 ONoC with allelectronic interfaces: photodetector and laser in III-V technology and opticalnetwork in SOI technology, using similar heterogeneous integration techniquesas described in section 2.2. Intellectual property (IP) blocks shown can be pro-cessor cores, memory blocks, functional units etc. with standard interfaces tothe communication network. This is a multi-domain device with high speedoptoelectronic circuits (modulation of the laser current and photodetectors) andpassive optics (waveguides and passive filters). In the figure, M are masters(processor, IP, ...) which can communicate with targets T (memory, ...). Thenetwork is comprised of 4 stages, each associated with a single resonant wave-length. The operation of the 4×4 network is summarised in the table of figure2.3. This system is a fully passive circuit-switching network based on wave-length routing and is a non-blocking network. From Mi to Tj , there exists onlyone physical path associated with one wavelength. At any one time, single-wavelength emitters can make 4 connections and multi-wavelength emitterscan make 12 connections. The network is in principle scalable to an infinitenumber of connections. In practice, this number is severely limited by lithog-raphy and etching precision. For a 5nm tolerance on the size of the microdisk,corresponding to state of the art CMOS process technology, the maximum sizeof the network is 8 × 8.

Table 2.3. Truth table for optical network on chip

T1 T2 T3 T4

M1 λ2 λ3 λ1 λ4

M2 λ3 λ4 λ2 λ1

M3 λ1 λ2 λ4 λ3

M4 λ4 λ1 λ3 λ2

The basic element of the network is an optical filter, described in the nextsection. The ports 1 − 4 correspond to inputs/outputs of the optical filter.Its operation is the same as an electronic cross-bar: the cross function (outputin 4) is activated when the injected wavelength in 1 does not correspondto a resonant ring wavelength and the bar function is activated (output in 3)when the injected wavelength in 1 corresponds to a resonant ring wavelength.

masterinterface(driver,laser)

targetinterface(detector,receiver)

passive opticalnetwork on chip

elementary opticalfilter operation

targetIP blocks

masterIP blocks

Figure 2.8. Architecture of 4x4 optical network on chip

Operation is symmetrical: the same phenomena happens if the wavelengthinjection is placed in the port 4.

2.5.1 Microresonators

Microring resonators are ideal device candidates for integrated photonic cir-cuits. Because they render possible the addition or extraction of signals froma waveguide based on wavelength in a WDM flow, they can be considered asbasic building blocks to build complex communication networks. The use ofstandard SOI technology leads to high compactness (structures with radii assmall as 4µm have been reported) and the possibility of low-cost photonic in-tegration. Figure 2.9 shows the structure of an elementary add-drop filter basedon microring resonators. The size of the structure is typically a few hundredµm2. It consists of two identical disks evanescently side-coupled to two signalwaveguides which are crossed at near right angles to facilitate signal direc-tivity. The microdisks make up a selective structure: the electromagnetic fieldpropagates in the rings for discrete propagation modes corresponding to specificwavelengths. The resonant wavelengths depend on geometric and structural pa-rameters (indices of the substrate and of the microrings, thickness and diameterof the disks).

The basic function of a microresonator can be thought of as a wave-length-controlled switching function. If the wavelength of an optical signal passingthrough a waveguide in proximity to the resonator (for example injected at port1) is close enough to a resonant wavelength λ1 (tolerance is of the order of a fewnm, depending on the coupling strength between the disk and the waveguide),then the electromagnetic field is coupled into the microrings and then out alongthe second waveguide (in the example, the optical signal is transmitted to the

Figure 2.9. Micro-disk realisation of an add-drop filter

output port 3, as shown in fig. 2.10a). If the wavelength of the optical signaldoes not correspond to the resonant wavelength, then the electromagnetic fieldcontinues to propagate along the waveguide and not through the structure (inthe example, the optical signal would then be transmitted to the output port 4,as shown in fig. 2.10b). This device thus operates as an elementary router, thebehaviour of which is summarised in the table in fig. 2.9.

Figure 2.10a. FDTD simulation of add-drop filter in on-state

Figure 2.10b. FDTD simulation of add-drop filter in off-state

First structures have been realised and preliminary results are promising.Fig. 2.11a shows an IR photograph of the structure in the cross state (top) andin the bar state (bottom), while fig. 2.11b represents the transmission coefficienton the cross output: the transmitted power on the cross output reaches 100%for wavelengths corresponding to the resonant frequencies of the microdisk.

2.5.2 Microsource lasers

From the viewpoint of mode field confinement and mirror reflection, mi-crodisk lasers operate on the principle of total internal reflection, as opposed tomultiple reflection, as is the case in VCSELs for example. This fact gives thistype of source two distinct advantages over VCSELs for on-chip optical inter-connect. Firstly, light emission is in-plane (as opposed to vertical), meaning

Figure 2.11a. Infra-red photograph ofstructure in both cross (top) and bar (bot-tom) states

Figure 2.11b. Transmission coefficient oncross output for varying wavelength

that emitted light can be injected directly into a waveguide with minimum loss[6]. Secondly, for communication schemes requiring multiple wavelengths,it is easier from a technological point of view to control the radius of such adevice than it is to control the thickness of an air gap in a VCSEL. In any casesuch devices, to be compatible with dense photonic integration, must satisfy therequirements of small volume and high optical confinement, with low thresholdcurrent and emitting in the 1.3-1.6µm range. Although these devices are notas mature as VCSELs, they seem extremely promising for optical interconnectapplications. An overview of microcavity semiconductor lasers can be foundin [2].

2.5.3 Demonstration of principle

Behavioural models enable us to verify the operation of the 4 × 4 ONoC athigh level simulation. An injection of 4 wavelengths is realised (λ1, λ2, λ3,and λ4) at the port 1 at the same moment (shown in figure 2.12). The inputsignal format is a matrix. Figure 2.12 is a 3-dimensional representation withwavelength on the X-axis (representing the 4 channels), time on the Y-axis andpower (normalised) on the vertical axis. Each injected wavelength has twopulses (Gaussian) in time. The behavioural simulation analyses the 4 outputsT1, T2, T3 and T4 (T2 shown in fig. 2.12). As predicted in table 2.3, only λ3 isdetected at the output T2.

Figure 2.12. Simulation of 4x4 optical network on chip

2.6 CONCLUSION

Integrated optical interconnect is one potential technological solution to al-leviate some of the more pressing issues involved in moving volumes of databetween circuit blocks on integrated circuits. In this chapter, we have shownhow novel integrated photonic devices can be fabricated above standard CMOSICs, designed concurrently with EDA tools and used in clock distribution andNoC applications. The feasibility of on-chip optical interconnect is no longerreally in doubt. We have given some partial results to quantitatively demon-strate the advantages of optical clock distribution. Although lower power canbe achieved (of the order of a five-fold decrease), more work is required toexplore new solutions that benefit from advances both at the architectural andat the technological level. Also the existing basic building blocks need to beintegrated together to physically demonstrate on-chip optical links. Research iswell under way in several research groups around the world to do this. Lookingfurther ahead, the use of multiple wavelengths in on-chip communication net-works and in reconfigurable computing is an extremely promising and excitingfield of research.

References

[1] M. Amann, M. Ortsiefer, and R. Shau: 2002, ‘Surface-emitting LaserDiodes for Telecommunications’. In: Proc. Symp. Opto- and Microelec-tronic Devices and Circuits.

[2] T. Baba: 1997, ‘Photonic Crystals and Microdisk Cavities Based onGaInAsP-InP System’. IEEE J. Selected Topics in Quantum Electronics 3.

[3] Y. Cao, T. Sato, D. Sylvester, M. Orchansky, and C. Hu: 2000, ‘NewParadigm of Predictive MOSFET and Interconnect Modeling for Early Cir-cuit Design’. In: Proc. Custom Integrated Circuit Conference.

[4] S. Cho et al.: 2002, ‘Integrated detectors for embedded optical interconnec-tions on electrical boards, modules and integrated circuits’. IEEE J. Sel.Topics in Quantum Electronics 8.

[5] A. Filios et al.: 2003, ‘Transmission performance of a 1.5-µm 2.5-Gb/sdirectly modulated tunable VCSEL’. IEEE Phot. Tech. Lett. 15.

[6] M. Fujita, A. Sakai, and T. Baba: 1999, ‘Ultrasmall and ultralow thresholdGaInAsP-InP microdisk injection lasers: Design, fabrication, lasing charac-teristics and spontaneous emission factor’. IEEE J. Sel. Topics in QuantumElectronics 5.

[7] M. Fujita, R. Ushigome, and T. Baba: 2000, ‘Continuous wave lasing inGaInAsP microdisk injection laser with threshold current of 40µA’. IEEElectron. Lett. 36.

[8] M. Ingels and M. S. J. Steyaert: 1999, ‘A 1-Gb/s, 0.7µm CMOS OpticalReceiver with Full Rail-to-Rail Output Swing’. IEEE J. Solid-State Circuits34(7).

[9] I. Kimukin et al.: 2002, ‘InGaAs-Based High-Performance p-i-n Photodi-odes’. IEEE Phot. Tech. Lett. 26(3).

[10] K. Lee et al.: 2001, ‘Fabrication of ultralow-loss Si/SiO2 waveguides byroughness reduction’. Optics Letters 26.

[11] J. Liu et al.: 2002, ‘Ultralow-threshold sapphire substrate-bonded top-emitting 850-nm VCSEL array’. IEEE Phot. Lett. 14.

[12] J. Morikuni et al.: 1994, ‘Improvements to the standard theory for pho-toreceiver noise’. IEEE J. Lightwave Technology 12.

[13] I. O’Connor, F. Mieyeville, F. Tissafi-Drissi, G. Tosik, and F. Gaffiot:2003, ‘Predictive design space exploration of maximum bandwidth CMOSphotoreceiver preamplifiers’. In: Proc. IEEE International Conference onElectronics, Circuits and Systems.

[14] A. Sakai, T. Fukazawa, and T. Baba: 2002, ‘Low Loss Ultra-SmallBranches in a Silicon Photonic Wire Waveguide’. IEICE Tran. Electron.E85-C.

[15] A. Sakai, G. Hara, and T. Baba: 2001, ‘Propagation Characteristics ofUltrahigh-∆ Optical Waveguide on Silicon-on-Insulator Substrate’. Jpn.J. Appl. Phys. – Part 2 40.

[16] S. Schultz, E. Glytsis, and T. Gaylord: 2000, ‘Design, Fabrication, andPerformance of Preferential-Order Volume Grating Waveguide Couplers’.Applied Optics-IP 39.

[17] Semiconductor Industry Association: 2003, ‘International TechnologyRoadmap for Semiconductors’.

[18] G. Tosik, F. Gaffiot, Z. Lisik, I. O’Connor, and F. Tissafi-Drissi: 2004,‘Power dissipation in optical and metallic clock distribution networks innew VLSI technologies’. IEE Elec. Lett. 4(3).

Chapter 3

NANOTECHNOLOGIES FOR LOW POWER

Jacques Gautier

CEA-DRT – LETI/D2NT – CEA/GRE

Abstract The conventional approach to improve the performance of circuits is to scale

down the devices and technologies. This is also convenient to lower the power

consumption per function. In this chapter, we overview the potential of

nanotechnologies for this purpose, with emphasis on few-electron devices in

the case of room-temperature operation. Other devices, especially carbon

nanotube transistors, resonant tunnelling diodes and quantum cellular

automata, are briefly discussed.

Keywords: nanotechnologies; Single Electron Transistor; SET; molecular electronics;

RTD; QCA; low power; Coulomb blockade

3.1 INTRODUCTION

In addition to packing-density increase and speed improvement, the

downscaling of technologies comes with a reduction of the power

consumption per function. However this gain is offset by the tremendous

increase in the number of transistors per chip. A possible solution is to go

further towards nano-scale devices where a lower amount of charge is

needed to code a bit. This is the basis of what is known as single electronics.

The use of molecules could be a realistic way to fabricate these tiny devices

and other useful nanostructures.

In this chapter we overview the potential of nanodevices for low power

electronics with emphasis on few-electron electronics in the case of room-

temperature (RT) operation. Other devices, especially carbon nanotube

transistors, resonant tunnelling diodes (RTD) and quantum cellular automata

(QCA), are briefly discussed.

3.2 SINGLE ELECTRONICS

In CMOS circuits, the total power consumption is the sum of the

dynamic power and of the contribution of leakages. For advanced

technology generations the later is rapidly rising, but it is still less than the

former. So, we will focus on this dynamic power consumption which is

given by the usual expression

( ) cDDergategated fVCCNaP ⋅⋅+⋅⋅= 2

int (1)

where a is the activity factor, Ngate is the amount of gates, (Cgate + Cinter) is

the load capacitance, gate and interconnect contributions, and f is the clock

frequency. This equation shows that the power is proportional to the amount

of charge in transistors and interconnects for coding a bit of information. For

dense circuits with local interconnects, the dominant contribution is usually

the one related to the gate capacitance of transistors which can also be

expressed as Pd=a.Ngate.Q.VDD.fc, where Q is the channel charge. So there is a

strong motivation to reduce it for power saving. This is currently obtained by

the downscaling of technologies. From the extrapolation of the historical

trend and from the ITRS roadmap anticipation[1], we can expect a value of

only 10-20 electrons for sub-10nm MOSFETs. This is much less than the

hundreds to thousands of electrons present in current devices. Is it possible

to go still further, towards only one electron, using what is called a single

electron transistor or SET [2]? That would be advantageous for power

consumption, knowing that the reduction of power per function due to the

scaling is more or less balanced by the tremendous increase of the number of

transistors per chip. However this gain would be effective only if the

capacitances of interconnects are not too large. Another factor in expression

(1) is the electrostatic potential at which the charge Q is brought. At present,

there is a strong incentive for reducing it. Whereas the supply voltage of

current high performance circuits is in the range 1.2-1.8V, operation at only

0.3V on experimental circuits has already been demonstrated [3], which is

close to the bottom limit anticipated by the ITRS. For a lower value the

device is not in well defined On or Off states which results in either leakage

or poor performance. What can be expected from SET's ? Before giving an

answer to this question, their properties and modes of operation are briefly

recalled.

3.2.1 Background on single electron transistors

A SET is a device which comprises a Source and a Drain reservoir of

electrons and a control gate, like MOSFET's. In between, there is an island

where carriers should be confined [2] (see Fig. 3-1). A common solution to

obtain this effect is to insert tunnelling or potential barriers between the

reservoirs and the island. This is the main structural difference from

MOSFET's, but it is essential for the operation of SET's. Due to this

confinement, there is always an integer number of electron in the island.

However, the probability to have a given amount of charge is a continuous

function of the device bias, such that there is also a continuous variation of

the average charge versus the external bias.

Figure 3-1. Schematics of a SET

Provided that just one electron more or less has a significant effect on the

electrostatic energy of the device, it is shown that, for a given device bias,

there are limited possible states of charge in the island [2]. Especially, there

are bias domains for which only one state of charge is possible. In this case,

there is no exchange of charge with the electron reservoirs and the device is

in Off state. This is the Coulomb blockade effect. For the other cases, the

number of electron oscillates between the highest probable states of charge

leading to a flux of carriers between source and drain. For instance, when the

two states n and n+1 are possible, the current is due to the repetition of the

sequence: one electron coming from the source to the island then leaving the

island to the drain.

As shown in Fig. 3-2, the electrical characteristics of SET's are very

different from those of MOSFET's. The ID(VG) curves have periodic

oscillations of current and the output characteristics look like a resistance (or

staircase for non-symmetrical device) with a low drain voltage domain

where the device is periodically Off and On as a function of VG. The period

of Coulomb Blockade Oscillations, CBO, is given by e/Cg. Between two

successive oscillations, the only difference is that the average number of

RT>>RQ

island

electron in the island is incremented or decremented by one. At a peak of

current, two dominant states of charge have equal probability and, on the

average, there is a half integer number of electron in the island.

2 10-8

4 10-8

6 10-8

8 10-8

1 10-7

1,2 10-7

0 0,5 1 1,5 2

Vg (V)

VD=0.4V

5 10-8

1 10-7

1,5 10-7

2 10-7

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

Vd (V)

Vg=0.45V

Vg=0.1V

Figure 3-2. Typical ID(VG) and ID(VD) characteristics of a SET. They have been obtained by

simulation with the following parameters: Cj=0.1aF, Cg=0.2aF, RT=1MΩ, T=300K

To observe the previous typical characteristics, there are two important

conditions to meet. Firstly, the charging energy, which is the electrostatic

energy increase due to the arrival of one electron in the island, should be

large in comparison to the thermal energy kT:

eEC >>=

where e is the electron charge (absolute value), CΣ is the total capacitance

of the island, CΣ=2Cj+Cg, where Cj is the junction capacitance and Cg is the

gate to island capacitance. For room-temperature operation, CΣ should be

less than 0.3 aF (Ec=10 kT, T=300K), which requires an island smaller than

a few nm. The second condition is related to the confinement of the electron

wave function in the island, which is essential to quantize the charge in this

island. The resistance of the tunnel barriers should exceed the quantum

resistance RK=h/e2~25.8 kΩ.

For the fabrication of SET’s, there are many different possibilities since

any kind of conductive material can be used for the island, metallic as well

as semiconductor and even molecular. However silicon is advantageous for

CMOS compatibility and also for the stability of devices [4].

3.2.2 Designing a low VDD inverter

With regard to the power consumption of digital circuits, we consider in

this part the case of a simple inverter, since this is a convenient reference to

make comparisons with CMOS. The design of a SET inverter has been

discussed by many authors [5,6,7].They pointed out that, since there is only

one kind of SET, the complementary action of the pull-up and pull-down

devices is not as easy to obtain as in CMOS where two types of transistor

exist. A first solution is to choose the supply voltage in order that both of

these devices are On or Off in a complementary way in the switching part of

the transfer characteristic. An example of such situation is shown in Fig. 3-3.

The shaded area displays the Coulomb blockage domains of the pull-up and

pull-down transistors at zero temperature. Based on that, the transfer

characteristics has been schematically drawn. Contrary to CMOS, we can

observe that the voltage swing is less than rail-to-rail and that the DC current

is minimal at the transition point.

0,00 0,20 0,40 0,60 0,80

Vin (V)

Vout (V)

Figure 3-3. Theoretical Coulomb blockade domains, also known as Coulomb diamonds,

(shaded areas) at 0K, for the pull-down and pull-up SET's of an inverter. At RT they are a

little narrower. Cj=0.1aF, Cg=0.2aF, VDD=0.53V. The bold line is a drawing of the transfer

characteristics.

Since a low VDD is advantageous for low power applications, we discuss

now the possibility to minimize it for this simple SET inverter, taking

account of the design constraints and aiming room-temperature operation:

• Cg + 2.Cj < 0.3aF for RT operation (for Ec~10kT)

• Cg / Cj > 1 for voltage gain

• VDD = e / (Cg + Cj) for complementary action of transistors

As a result, a very low VDD and RT operation would be difficult to

achieve simultaneously. In fact, with the previous equations and for a ratio

of gate to junction capacitances of 2, the minimum VDD would be equal to

0.7V ! However, for temperatures above 0K, the switching of the SET from

Off to On state is not abrupt since there is an exponential variation of the

current, equivalent to the subthreshold current of MOSFET’s. Consequently,

the real Coulomb blockade diamonds are narrower than those shown in Fig.

3-3 and it is possible to reduce VDD. This is demonstrated in Fig. 3-4, where

the DC voltage gain and the DC current at the transition point of an inverter

have been plotted versus VDD. Note also that the constraint on CΣ has been a

little relaxed. As thoroughly discussed by A. Korotkov [6], the acceptable

VDD window is quite narrow. A too low VDD value would be detrimental for

the noise margin and for the speed since the DC current at the transition

point is exponentially decreasing with VDD. On the contrary, a higher value

would increase the power consumption.

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8

ain Im

=e/(Cj+Cg)

Cj=0.1aF

Cg=0.2aF

RT=1MΩ

T=300K

Figure 3-4. DC voltage gain (solid line) and DC current (dashed line) at the transition point of

a SET inverter at room temperature.

To go further in reducing VDD, a solution is to add control gates to each

SET (Fig. 3-5). Based on this approach, NTT has demonstrated a quasi-

CMOS operation inverter at a supply voltage as low as 20 mV8, which is

very advantageous for the power consumption, but in this case the

temperature was only 27K. The bias of the control gates shifts the CBO,

making possible to select the optimal part of the ID(VG) characteristics of

each SET for complementary action. In this way, the equivalent of two types

of transistors can be obtained, like for CMOS. In addition, their equivalent

threshold voltages can be tuned, balancing the influence of eventual parasitic

(background) charges in the neighbourhood of the SET:

CV ∆−=∆ (3)

To get a symmetrical transfer characteristic, from the Coulomb

diamonds, it can be easily demonstrated that the sum of the control gate

voltages should be equal to VDD:

DDdgcss VVV =+gcd

As a result there is one more degree of freedom to design an inverter, in

comparison to the case without control gates. That gives flexibility to fix the

value of VDD. In fact, there is now one optimal supply voltage, leading to

complementary states of pull-up and pull-down transistors, for each bias of

the control gates. Taking equation (4) into account, it is given by:

gcssgc

DDoptCC

There is a consequent reduction of VDDopt thanks to the control gates, but

it is important to note that the constraint on the total capacitance (equation 2)

should also take account of the contribution of Cgc : CΣ=2Cj+Cg+Cgc

A drawback of this approach is the requirement of extra lines to distribute

the control gate voltages. However, this can be avoided in the particular case

where Vgcss=VDD and Vgcdd=0V (VSS=0V). For this condition, the optimum

value of VDD is given by:

DDoptCCC

2++= (6)

As discussed previously for the case without control gates, at RT the

Coulomb blockade area of SET is narrower than at 0K which makes possible

a reduction of VDD or a change of bias of the control gates for a given VDD.

However, it also results in a change of the SET current which may affect the

speed of circuits. Consequently, there is a design trade-off. To illustrate it, in

Fig. 3-5 we have plotted the variations of the DC voltage gain and of the

propagation delay along a chain of SET inverters versus the DC current at

the transition point of the transfer characteristics. The shaded area shows the

most advantageous design window. In this example, the load capacitance is

equal to 0.5fF, but for another value the design window would be the same,

since the propagation delay directly scales with this capacitance. This is a

difference with CMOS where the dominant load capacitance of dense logic

is due to the gate capacitance of MOSFETs. Here, the gate capacitance of

SETs is extremely small and the dominant load capacitance comes from the

local interconnects. In fact the later should be much larger than e/2VDD to

avoid any detrimental effects of the shot noise.

Regarding the dynamic power consumption of SET logic, as long as a

CMOS output buffer is not implemented, the major contribution would also

come from the load capacitance due to the interconnects.

1 10-8

2 10-8

3 10-8

4 10-8

5 10-8

DC current @ transition point (A)

T=300K

+VGCdd

Cj=0.05aF

CG=0.1aF

=0.1aF

=0.5fF

RT=1MΩ

Master Equation

Figure 3-5. Variations of the DC voltage gain (solid line) and of the propagation delay along

a chain of SET inverters (dashed line) versus the DC current at the transition point of the

transfer characteristics. VDD=0.3V. The control gate voltages are varied as follow:

0.7V<VGCss<0.1V and VGCss+VGCdd=VDD. The simulations are performed using a model of the

SET based on the solution of the Master Equation [2]. The shaded area is the design window

for sufficient voltage gain and speed.

An example of switching characteristics is shown in Fig. 3-5 for

VDD=0.3V, T=300K and a load capacitance of 0.5fF. The corresponding

switching energy is 5.2x10-2

fJ. In the same figure, transfer characteristics

have been plotted for a lower supply voltage, showing that a voltage gain

higher than 1 can be obtained down to VDD=0.2V at RT.

1 10-9

2 10-9

3 10-9

4 10-9

5 10-9

0 5 10-8

1 10-7

1,5 10-7

2 10-7

time (s)

Ipull-down

Ipull-up

0 0,05 0,1 0,15 0,2

input (V)

T=300K

T=150K

Figure 3-5. On the left, switching characteristics for a SET inverter with control gates.

VDD=0.3V, T=300K, CL=0.5fF, Cj=0.05aF, Cg= Cgc =0.1aF, RT=1MΩ, Vgcss=0.3V, Vgcdd=0.

On the right, transfer characteristics for the same capacitances and VDD=0.2V.

3.2.3 Designing gates with increased functionality

Another approach to lower the power consumption is to build logic gates

with increased functionality in order to reduce the count of transistor needed

to obtain a given function. This can be done by taking advantage of both the

existence of CBO and the possibility to design SET's with multiple inputs

[9]. The principle is to choose the logic levels such as the multiple inputs

SET's are biased at either the minima or at the peaks of current, depending

on the combination of input signals. This is illustrated in Fig. 3-7 in the case

of a double input X-OR function. The logic level "1" is equal to the CBO

period of the equivalent single input SET, e/(2.Cin), where Cin is the input

gate capacitance. From this equivalent SET, it is obvious that the device is

Off when both of the inputs are either "0" or "1" and that the device is On

when one and only one of the inputs is "1".

5 10-9

1 10-8

1,5 10-8

2 10-8

2,5 10-8

3 10-8

0 0,5 1 1,5 2

Vgeff (V)

VD=100mV

A ="1"

B ="1"

A ="1"

B ="1"

T=300K

inGeff

Cin VDD

Figure 3-7. Principle of design of a X-OR gate with a double input SET. The output current is

a X-OR function of VA and VB. The figure shows also the equivalent input circuit and

associated equations.

This can be used to design pass-gate logic functions, as demonstrated by

Y. Ono [9]. For instance, for an input current signal C in the previous X-OR

gate, the output pass current is given by C.(A ⊕ B). Furthermore, one of the

inputs of this gate, B for instance, can be viewed as a control input, leading

to the output pass current AC. if B is "1". Applying this technique to a gate

comprising such a control input in addition to the inputs A and B, we get the

function ( )BAC ⊕. for the output current, when the control input is "1" (see

Fig. 3-8). This way, cascading SET structures, NTT has designed a 4-b adder

with only 40 SET's for operation at 30K9. In comparison with CMOS there

are less transistors and no crossing of pass signal routes thanks to the high

functionality of the SET gates. Furthermore, there is a low level signal on the

pass route. Consequently, a lower dynamic power is expected. Nevertheless,

this power gain is not yet evaluated.

C.(A⊕B)

Control gate A

Figure 3-8. Design of complex gates using multiple input SETs. These gates can be cascaded.

Moreover, there are important issues, especially about the control of the

phase of CBO. Since it will be probably impossible to avoid the existence of

any parasitic charges, charge-tolerant solutions are required. A first approach

consists in incorporating redundancy into the circuit design in order to

replace the defective gates by reconfiguration [9]. This is valuable only if a

reasonable amount of spares is needed and if the area-overhead is not too

large. Another solution would be to balance the influence of parasitic offset

charges by opposite charges stored near the island of the SET. This concept

has been demonstrated in the case of SET in which nanostructures have been

embedded [10-12]. The resulting device is a merge of a SET and of a Non

Volatile Memory function. Further, a feedback loop can be implemented to

automatically control the phase of CBO [13]. The loop is closed to adjust the

amount of charges in a memory node then it is opened for the use of the

device.

There are other potential applications about the possibility to tune and

memorize the phase of CBO. A first example has been the demonstration of

a hybrid SET-MOSFET gate which can be programmed to be inverting or

non-inverting [11]. This feature has been obtained thanks to a SET active

device which can operate either in a positive or in a negative

transconductance region, depending on the amount of charge stored in a

nearby nanostructure. In this case, the SET was fabricated in a very thin

undulated SOI film in which a narrow source-drain percolation channel and

an electron pocket working as a memory node can be naturally formed for a

range of bias. In the hybrid gate, the MOSFET was just used as a load. Since

the output voltage swing was only 10mV, an output buffer has been

implemented. The reproducibility of the structure is not obvious, but RT

operation and peak-to-valley current ration (PVCR) as high as 102 were

obtained. The most important is the concept of programmable logic which is

feasible with SET based devices, since it has a high potential for low-power

and high packing density. The design of SET programmable logic array

(PLA) has been also reported by K. Uchida [11].

It is important to note that many other functions can be designed with

few-electron devices, taking advantage of their specific features. Especially,

several memory structures that are promising for low power consumption

have been reported [14-17]. For spiking neuron circuits, it has been proposed

to combine NVM MOSFET devices and single-electron circuit based on

multinanodot floating-gate arrays [18]. Also, some analog applications and

devices have been studied, like CCD [8], ADC [19], metrology [20] and

NEMS [21]. However, although some have been demonstrated, most of them

are still at the proof of concept level.

3.3 MOLECULAR ELECTRONICS

For the fabrication of SET's, any kinds of conducting materials can be

used. Whereas the basic research was done on metallic SET's [2], circuit

demonstrations are performed mainly on silicon [8-13], for complementary

with MOSFET's and to benefit of the huge investment in silicon technology.

However, it could be advantageous to use molecules for real applications

due to the size requirement discussed previously and because the

reproducibility of nanoscale structures is very challenging. In addition, the

load capacitance of circuits should be very low, for power consumption and

speed considerations, which implies short and narrow interconnects. The

most promising way to achieve it is the bottom-up approach, using naturally

formed tiny structures or self-assembling methods. The best example is the

carbon nanotube (CNT) which can be used to fabricate FET's [22-23], SET's

[24], interconnects [25] and even non volatile memory arrays [26].

CNT's are long cylinder of carbon atoms consisting of rolled-up sheets of

graphite. For Single Wall CNT's the diameter is as small as 1-5 nm.

Depending on their chirality, they are semiconductor or metallic materials.

Their mobility is much higher than the one of silicon, and a ballistic

transport has been demonstrated for lengths less than a few hundreds of nm,

but the subthreshold characteristics of CNFET's are not better than those of

MOSFET's. Worldwide, several teams are conducting research on the

selective growth or deposition of CNT's that would have the right chirality

and on the evaluation of CNFET's as potential candidate to replace

MOSFET's in the future. For low power applications, thanks to their

excellent transport properties [22], it should be possible to reduce the gate

overdrive and VDD, while meeting the ITRS specifications [1].

Different kinds of molecules are also currently investigated to make

nanometer-scale electronic components and circuits, but a single molecule

transistor has not yet been obtained. To date, one of the most advanced

achievement is a 1µm2 64 cells crossbar matrix fabricated by HP Labs [27]

in which the switching units are bundle of rotaxane molecules. The operation

of such molecules is not yet clear and others mechanisms like the formation

of tiny filaments across the molecule gap between the electrodes could

explain the switching [28]. However, on the long term, this research subject

has a great potential for high density, low cost and probably ultra low power

electronics.

True single molecule device will require interconnects at a similar scale.

This is also essential to reduce parasitic capacitances and the power

consumption. Since the needed resolution is far beyond the possibility of

lithographic tools, including NGL, the solution will come from the bottom-

up approach. An example is the realization by Caltech of a Pt nanowire

lattice with width and pitch of 8 nm and 16 nm respectively [29]. Biology

can also come to the rescue for the self-assembly of nano-circuits [30]. A

very different approach, also mitigating the arduous task of nanoscale

patterning, is the concept of self-assembled nanocells proposed by J. Tour

[31]. These nanocells are disordered arrays of metallic islands that are

interlinked with molecules and that are accessed by metallic input/output

leads. Switching-type functions have been observed, but like for the work of

HP [27], the creation and dissolution of metal filaments is probably

responsible for the behaviour. In fact, the behaviour of electrically active

molecules is strongly influenced by surrounding electrodes and other

materials, which make a difference between molecular nanotechnology and

bulk or solution-phase chemistry.

3.4 DISCUSSION

For CNT devices, as well as for nano MOSFET's, the supply voltage

reduction is dependent upon the effects of subthreshold leakage on the static

power, leading to trade-off with the speed of circuits. For SET's, the

steepness of the On-Off switching is not better, but they offer an increased

functionality and low charge operation. However, there are important issues,

especially about the sensibility to offset charges and the fabrication of nano-

scale structures with sufficient level of reproducibility, which require a lot of

Although it is not yet clear if they could achieve a lower VDD, there are

other candidates, like the resonant tunnelling diodes (RTD). Their operation

is based on electron transport via discrete energy levels in double barrier

quantum well structures, leading to the existence of a negative differential

resistance. This implies the fabrication in suitable materials and a perfect

control of the geometry, since the output characteristics are extremely

sensitive to the dimensions. A promising approach is the implementation of

RTD along semiconductor nanowires [31]. There are also prospective

studies for a molecular version and for structures mixing Coulomb blockade

and resonant effects [32].

One of the most important features of nanodevices, especially for

molecular ones, is their size. That offers the possibility to lower the power

consumption by parallel processing. For instance, consider two blocks of

low capacitance molecular devices doing the same task as one block of

conventional devices, but at half the clock frequency. The Ngate . fc product

being unchanged, equation 1 shows that the power consumption is directly

related to the C.V [2] product, the gate switching energy, which can be

strongly reduced thanks to lower capacitances and to the possibility to have

devices with lower On current, since fc is divided by 2 in this case.

Going further, quantum-dot cellular automata (QCA) is an attractive

approach, yet speculative, to reduce the power consumption, since there is

no flow of current but only Coulomb interactions [33]. The principle is to

encode binary information by charge configuration in electrostatically

coupled cells in which there are two extra electrons. It has been shown that a

clock field is needed to control the direction of propagation of information

along the cells and to enable power gain. This clock could also be used for a

quasi-adiabatic switching, leading to extremely low power consumption. To

date, experimental demonstrations are performed at low temperature on

metallic structures, but molecular implementations are being investigated in

view of RT operation [34].

References

[1] Semiconductor Industry Association, International Technology Roadmap for

Semiconductors 2001 Edition, http://public.itrs.net, 2003

[2] H. Grabert and M. H. Devoret, Single charge tunnelling Coulomb blockade phenomena

in nanostructures, volume 294 of NATO ASI Series B, Plenum Press New York and

London, 1992

[3] T. Douseki, T. Shimamura, N. Shibata, A 0.3V 3.6GHz 0.3mW frequency divider with

differential ED-CMOS/SOI circuit technology, in Proc. ISSCC, February 2003

[4] N. M. Zimmerman and W. H. Huber, Excellent charge offset stability in a Si-based

single-electron tunneling transistor, APL Vol. 79, N.19, pp. 3188-3190, 2001

[5] J. R. Tucker, Complementary digital logic based on the Coulomb blockade, JAP 72 (9),

1, pp. 4399-4413, 1992

[6] A. N. Korotkov, R. H. Chen and K. K. Likharev, Possible performance of capacitively

coupled single-electron transistors in digital circuits, JAP 78 (4), pp. 2520-2529, 1995

[7] M-Y. Jeong, B-H. Lee and Y-H. Jeong, Design considerations for low-power single-

electron transistor logic circuits, JJAP. Vol.40, pp. 2054-2057, 2001

[8] Y. Takahashi, Y. Ono, A. Fujiwara and H. Inokawa, Silicon Single-Electron Devices for

logic applications, in Proc. ESSDERC September 2002, Florence, pp. 61-68

[9] Y. Ono, H. Inokawa and T. Takahashi, Binary adders of multigate Single-Electron

Transistors: specific design using Pass-Transistor Logic, IEEE Trans. on Nanotech.

Vol.1 pp. 93-99, 2002

[10] N. Takahashi, H. Hishikuro and T. Hiramoto, A directional current switch using silicon

Single Electron Transistors controlled by charge injection into silicon nano-crystal

floating dots, in Proc. IEDM, pp.371-374, 1999

[11] K. Uchida, J. Koga, R. Ohba and A. Toriumi, Programmable Single-Electron Transistor

logic for future low-power intelligent LSI: proposal and room-temperature operation,

IEEE Trans. on Elec. Dev. Vol.50, pp.1623-1630, 2003

[12] G. Molas, X. Jehl, M. Sanquer, B. de Salvo, M. Gely, D. Lafond and S. Deleonibus,

Manipulation of periodic Coulomb Blockade Oscillations in ultra-scaled memories by

single electron charging of silicon nanocrystals floating gates, Silicon Nano Workshop,

Honolulu, June 2004

[13] K. Nishiguchi, H. Inokawa, Y. Ono, A. Fujiwara and Y. Takahashi, Automatic control

of the oscillation phase of a Single-Electron Transistor, IEEE EDL25 (1), pp. 31-33,

[14] K. Yano, T. Ishii, T. Hashimoto, T. Kobayashi, F. Murai and K. Seki, Room-

Temperature Single-Electron Memory, IEEE Trans. on Elec. Dev. Vol.41,

NO.9,pp.1628-1638, 1994

[15] Z. A. K. Durrani, A. Irvine and H. Ahmed, Coulomb blockade memory using integrated

Single-Electron Transistor/Metal-Oxide-Semiconductor transistor gain cells, IEEE

Trans. on Elec. Dev. Vol.47, pp.2334-2339, 2000

[16] H. Sunamura, H. Kawaura, T. Sakamoto and T. Baba, Multiple-valued memory

operation using a Single-Electron Device: a proposal and an experimental

demonstration of a ten-valued operation, JJAP Vol. 41, pp. L93-L95, 2002

[17] G. Molas, B. de Salvo, D. Mariolle, G. Ghibaudo, A. Toffoli, N. Buffet and S.

Deleonibus, Single electron charging phenomena at room temperature in a silicon

nanocrystal memory, in Proc. WODIM 2002, Grenoble

[18] T. Morie, T. Matsuura, M. Nagata and A. Iwata, A multinanodot floating-gate MOSFET

circuit for spiking neuron models, IEEE Trans. On Nanotechnology, Vol. 2, NO. 3, pp.

158-164, 2003

[19] H. Inokawa, A. Fujiwara and Y. Takahashi, A multiple-valued logic and memory with

combined Single-Electron and Metal-Oxide-Semiconductor transistors, IEEE Trans. on

Elec. Dev. Vol.50, NO.2, pp. 462-470, 2003

[20] H. E. van den Brom et al., Counting electrons one by one – overview of a joint european

research project, IEEE Trans. on Inst. and Meas. Vol. 52, NO.2, pp. 584-588, 2003

[21] S. Mahapatra, V. Pott, S. Ecoffey, A. Schmid, C. Wasshuber, J. W. Tringe, Y. Leblebici,

M. Declercq, K. Banerjee and A. Ionescu, SETMOS: a novel true hybrid SET-CMOS

high current Coulomb Blockade Oscillation cell for future nano-scale analog ICs, in

Proc. IEDM 2003, pp. 703-706

[22] A. Javey, H. Kim, M. Brink, Q. Wang, A. Ural, J. Guo, P. McIntyre, P. McEuen, M.

Lundstrom and H. Dai, High-K dielectrics for advanced carbon-nanotube transistors

and logic gates, Nature Materials, Vol 1, pp. 241-246, December 2002

[23] P. Avouris, Carbon nanotube electronics, Chemical Physics, 281 (2002), pp. 429-445

[24] K. Matsumoto, S. Kinoshita, Y. Gotoh, K. Kurachi, T. Kamimura, M. Maeda, K.

Sakamoto, M. Kuwahara, N. Atoda and Y. Awano, Single-Electron Transistor with

ultra-high Coulomb energy of 5000K using position controlled grown carbon nanotube

as channel, JJAP Vol.42 Part 1 N°4B, pp. 2415-2418, 2003

[25] J. Li, Q. Ye, A. Cassell, H. T. Ng, R. Stevens, J. Han and M. Meyyappan, Bottom-up

approach for carbon nanotube interconnects, APL Vol. 82, N°15, pp. 2491-2493, 2003

[26] Rueckes, K. Kim, E. Joselevich, G. Y. Tseng, C-L. Cheung and C. Lieber, Carbon

nanotubes-based nonvolatile Random Access Memory for molecular computing,

Science, Vol. 289, pp. 94-97, 7 July 2000

[27] Y. Chen, G-Y. Jung, D. A. Ohlberg, X. Li, D. R. Stewart, J. O. Jeppesen, K. A. Nielsen,

J. F. Stoddart and R. S. Williams, Nanoscale molecular-switch crossbar circuits,

Nanotechnology 14 (2003) 462-468

[28] R. F. Service, Next-generation technology hits an early midlife crisis, Science Vol. 302,

pp. 556-559, 24 October 2003

[29] N. Melosh, A. Boukai, F. Diana, B. Gerardot, A. Badolato, P. M. Petroff and J. R.

Heath, Ultrahigh-density nanowire lattices and circuits, Science, Vol. 300, pp.112-115,

4 April 2003

[30] P. Fairley, Germs that build circuits, IEEE Spectrum, pp. 37-41, November 2003

[31] M. T. Björk, B. J. Ohlsson, C. Thelander, A. I. Persson, K. Deppert, L. R. Wallenberg

and L. Samuelson, Nanowire resonant tunneling diodes, APL, Vol. 81, N°23, pp. 4458-

4460, December 2002

[32] M. Saitoh and T. Hiramoto, Room-temperature operation of highly functional Single-

Electron Transistor logic based on quantum mechanical effect in ultra-small silicon dot,

in Proc. IEDM IEDM 2003, pp. 753-756

[33] G. Bernstein, Quantum-dot Cellular Automata: computing by field polarization, in Proc.

DAC 2003, June 2-6, Anaheim (CA), pp. 268-273

[34] C. Lent and B. Isaksen, Clocked molecular Quantum-dot Cellular Automata, IEEE

Trans. on Elec. Dev. Vol.50, NO.9, pp. 1890-1896, 2003

Chapter 4

STATIC LEAKAGE REDUCTION THROUGH

SIMULTENEOUS VT/TOX AND STATE ASSIGN-

Dongwoo Lee, Bo Zhai, David Blaauw and Dennis Sylvester

University of Michigan, Ann Arbor

Abstract: Standby leakage current minimization is a pressing concern for mobile applica-

tions that rely on standby modes to extend battery life. In this paper, we

propose new leakage current reduction methods in standby mode. First, we pro-

pose a combined approach of sleep-state assignment and threshold voltage (Vt)

assignment in a dual-Vt process for subthreshold leakage (Isub) reduction. Sec-

ond, for the minimization of gate oxide leakage current (Igate) which has

become comparable to Isub in 90nm technologies, we extend the above method

to a combined sleep-state, Vt and gate oxide thickness (Tox) assignments

approach in a dual-Vt and dual-Tox process to minimize both Isub and Igate. By

combining Vt or Vt / Tox assignment with sleep-state assignment, leakage cur-

rent can be dramatically reduced since the circuit is in a known state in standby

mode and only certain transistors are responsible for leakage current and need

to be considered for high-Vt or thick-Tox assignment. A significant improve-

ment in the leakage/performance trade-off is therefore achievable using such

combined methods. We formulate the optimization problem for simultaneous

state/Vt and state/Vt/Tox assignments under delay constraints and propose both

an exact method for its optimal solution as well as two practical heuristics with

reasonable run time. We implemented and tested the proposed methods on a set

of synthesized benchmark circuits and show substantial leakage current reduc-

tion compared to the previous approaches using only state assignment or Vt

assignment alone.

Keywords: Leakage current, reduction, performance, dual threshold voltage, oxide thick-

ness, algorithm.

4.1 INTRODUCTION

There is a growing need for high-performance and low-power system,

especially for portable and battery-powered applications. Since these applica-

tions often remain in stand-by mode significantly longer than in active mode,

their stand-by (or leakage) current has a dominant impact on battery life.

Standby mode leakage current reduction therefore has been a concern for

some time and a number of such methods have been proposed to address this

problem [1]-[7][9]-[18]. However, with continued process scaling, lower sup-

ply voltages necessitate reduction of threshold voltages to meet performance

goals and result in a dramatic increase in subthreshold leakage current. New

methods for reducing the leakage current in standby mode are therefore criti-

cally needed.

In dual-Vt technology, the MTCMOS approach [1] was proposed where a

high-Vt sleep transistor is inserted between the power supply and the circuit

logic. In standby mode, this sleep transistor is turned off which dramatically

reduces leakage due to its high-Vt. However, the method requires routing of

an additional set of power supply lines in the layout as well as substantially

sized sleep transistors to maintain good supply integrity and circuit perfor-

mance [2]. Also, special latches that maintain state in standby mode need to

be used [3]. In addition, the method does not scale well into sub-1V technolo-

gies due to the increased delay penalty for the high-Vt sleep device [4].

A different approach to standby mode leakage reduction has been pro-

posed that leverages the state dependence of a leakage current due to the so-

called stack effect [5][6]. In [7], the circuit input state that minimizes leakage

current is determined and special flip-flops are inserted in the design to pro-

duce this state in standby mode. The flip-flops in the design are modified to

produce a predetermined state in standby mode while also maintaining the

previously latched state. The required modification to a flip-flop is minor and

can be incorporated in the feedback path of the slave latch with minimal

impact on performance [8]. In general, determining the minimum sleep state

is a difficult problem due to the inherent logic correlations in the circuit. How-

ever, a number of efficient heuristics for this problem have been proposed

[9][10]. The limitation of this approach is that for larger circuits, the reduction

in leakage current is typically only in the range of only 10 to 30% [9].

The above techniques are aimed primarily at subthreshold leakage current

reduction which has been the dominant component of leakage in CMOS tech-

nologies to date. However, in 90nm technologies the magnitude of gate

tunneling leakage, Igate, in a device is comparable to the subthreshold leakage,

Isub, at room temperature. With difficulties in achieving manufacturable high-

k insulator solutions to address the gate leakage problem, the burden address

this problem is primarily on circuit designers and EDA tools. As a result,

there has been recent work in the area of gate leakage analysis and reduction

techniques including pin reordering, PMOS sleep transistors, and the use of

NAND implementations rather than NOR implementations [11]-[13]. Also,

the MTCMOS technique was extended to combat gate leakage by using a

thick-oxide I/O device with a larger gate drive than the logic transistors as the

inserted sleep transistor [14].

Another previous approach to leakage reduction that targets only sub-

threshold leakage is to use individual assignment of transistor threshold

voltages in a dual-Vt process [15]-[18]. In these approaches, the trade-off

between high-Vt transistors with low leakage/low performance and low-Vt

transistors with high leakage/high performance is exploited. Circuit paths that

are non-critical are assigned high-Vt while critical circuit portions are given

low-Vt assignments. The method therefore provides a trade-off between cir-

cuit performance and leakage reduction. It was demonstrated that with a

modest performance reduction of 5–10%, significant reduction of 3-4X in

leakage could be obtained over a circuit with all low-Vt transistors [17]. In

these approaches, high/low-Vt assignments are performed without knowledge

of the states of the circuit. Therefore, in order to obtain sufficient leakage

reduction under all possible circuit states, all or most of the transistors in a

particular gate must be set to high-Vt and hence the gate incurs a substantial

performance degradation.

While such dual-Vt processes have been commonplace for several genera-

tions, the availability of multiple oxide thicknesses in a single process has

only become relevant at the 90nm node due to the rise of Igate [19]. Given a

process technology with dual oxide thicknesses for logic devices, the dual-Vt

approach can be easily extended to also consider gate leakage by assigning

thick-oxide transistors to non-critical paths as well. However, similar to the

dual-Vt assignment approach, a simultaneous dual-Vt and dual oxide thickness

assignment with unknown states of the circuit will set all or most of the tran-

sistors in a particular gate to both high-Vt and thick-oxide, to ensure that under

all possible circuit states in standby mode leakage current is acceptable. How-

ever, transistors that are simultaneously assigned a high-Vt and a thick-oxide

have a dramatic delay penalty compared to low-Vt transistors with thin oxide.

Therefore, this approach carries with it a significant delay penalty for process

technologies where both Isub and Igate need to be addressed.

In this paper, we therefore propose new methods to reduce standby mode

leakage current. We can divide our new methods into two categories: 1)

simultaneous dual-Vt and sleep state assignment for Isub reduction for technol-

ogies in which Isub is dominant in standby mode and 2) simultaneous dual-Vt,

dual oxide thickness and sleep state assignment for both Isub and Igate minimi-

zation for technologies which have comparable amount of Igate to Isub. First,

we combine the concepts of Vt assignment and sleep state assignment. This

approach is based on the key observation that, given a known input state for a

gate, the leakage of that gate can be dramatically reduced by setting only a

single OFF-transistor on each path from Vdd to Gnd to high-Vt. Since all other

transistors in the gate are kept at low-Vt and continue to have high drive cur-

rent, the performance degradation is limited while significantly gain in

leakage current is obtained. This approach therefore provides a much better

trade-off between leakage and performance compared to Vt assignment with

unknown input state where most or all of the transistors must be set to high-Vt

before a significant improvement in the leakage current is observed. The link

between the effectiveness of Vt assignment and state assignment was previ-

ously observed for Domino logic [8], since these circuits are by their own

nature in a known state in standby mode. However, we extend this concept to

general CMOS circuits by actively controlling the circuit state in standby

mode, thereby dramatically increasing the effectiveness of leakage reduction.

The second proposed approach minimizes the total leakage current (Isub

and Igate) by simultaneous assignment of sleep state, high-Vt and thick-oxide

transistors. In this approach, a key observation is that given a known input

state, a transistor need not be assigned both a high-Vt and a thick oxide since

Isub only occurs in transistors that are OFF while significant Igate occurs only

in transistors that are ON. Furthermore, depending on the input state of a cir-

cuit, only a subset of transistors need to be considered for either high-Vt or

thick-oxide. Therefore, the impact on the delay of the gate is significantly

reduced while obtaining leakage reductions comparable to when all transis-

tors are assigned to both high-Vt and thick-oxides. The proposed method is

compatible with existing library-based design flows, and we explore different

trade-offs between the number of Vt and Tox variations for each library cell

and the obtained leakage reduction. In addition, we compare the obtained

leakage reduction when Vt (the first method) and Vt / Tox (the second method)

assignments can be made individually for transistors in a stack as opposed to

when an entire stack is restricted to a uniform assignment due to manufactur-

ing or area considerations.

Since the circuit state / Vt and the circuit state / Vt / Tox assignments inter-

act, it is necessary to consider their optimization simultaneously. The state / Vt

and state / Vt / Tox assignment task is to find a simultaneous assignment that

minimizes the leakage current in standby mode while meeting a user specified

delay constraint. We formulate this problem as an integer optimization prob-

lem under delay constraints. The search space consists of all input state / Vt

and input states / Vt / Tox assignments and hence is very large. Therefore, in

addition to an exact solution, we also propose a number of heuristics. The pro-

posed methods are implemented on benchmark circuits synthesized using an

industrial cell library in 0.18 m technology for Isub minimization and in a pre-

dictive 65nm technology for both Isub and Igate minimization. On average, the

proposed Isub minimization method by simultaneous state / Vt assignment

approach improves leakage current by a factor of 6X over the traditional

approach using Vt assignment only. The second proposed method that mini-

mizes both Isub and Igate by simultaneous state / Vt / Tox assignment has an

average leakage reduction of 5-6X over an all low-Vt and thin-oxide design

solution with a 5% delay range point and achieves more than a 2X improve-

ment over the first proposed approach using Vt and state assignment only (i.e.,

without dual-Tox).

The remainder of this paper is organized as follows. In Section 4.2, we dis-

cus the used leakage model and the characteristics of Isub and Igate leakage

current. In Section 4.3, we present the approach using simultaneous Vt and

state assignment for Isub leakage reduction. In Section 4.4, we present the sec-

ond approach that also addresses Igate by performing simultaneous Vt, Tox,

and state assignment. In Section 4.5, we present our results on benchmark cir-

cuits and in Section 4.6 we present our conclusions.

4.2 LEAKAGE MODEL AND CHARACTERISTICS

In this section, we discuss our leakage current model and briefly review

the general characteristics of gate leakage current in CMOS gates.

Since the proposed leakage optimization approach is library-based, we use

precharacterized leakage current tables for each library cell, with specific

leakage table entries for each possible input state of a library cell. The pre-

characterized tables were constructed using SPICE simulation with BSIM3

models from 0.18 m technology for Isub minimization approach. In order to

represent both Isub and Igate components for the state / Vt / Tox assignment

approach, BSIM4 models were used to generate the precharacterization of

tables. The device simulation parameters were obtained using leakage esti-

mates from a predicted 65nm processes [20], and had a gate leakage

component that was approximately 36% of the total leakage at room tempera-

ture (at which all analysis is performed).1 (Detailed numbers will be shown at

Section 4.5.2.) Different high- and low-Vt versions of a cell as well as Tox and

Vt versions of a cell will be explained further in Section 4.4.2. Also, the delay

and output slope as a function of cell input slope and output loading were

stored in precharacterized tables.

The total gate leakage for a library cell consists of several different com-

ponents, depending on the input state of the gate, as illustrated for the inverter

cell in Figure 4.1. The maximum gate tunneling current occurs when the input

is at Vdd and Vs = Vd = 0V for the NMOS device. In this case, Vgs = Vgd = Vdd

and the Igate is at its maximum for the NMOS device. At the same time, the

1.Since this work aims at standby mode leakage, we expect junction temperatures during these idle periods

to be lower than under normal operating conditions, making room temperature analysis more valid.

PMOS device exhibits substantial subthreshold leakage current. When the

input is at Gnd, the output rises to Vdd and Vgs = 0 while Vgd will become -Vdd

for the NMOS device, resulting in a reverse gate tunneling current from the

drain to the gate node. In this case, tunneling is restricted to the gate-to-drain

overlap region, due to the absence of a channel. Since this overlap region is

much smaller than the channel region, reverse tunneling current is signifi-

cantly reduced compared to the forward tunneling current [21]. Note that

BSIM4 intrinsically considers this reverse tunneling current so it is included

in the precharacterized tables described above.

When the input voltage is Gnd, the PMOS device also exhibits gate cur-

rent from the channel to the gate since its Vgs = Vgd = -Vdd. The relative

magnitude of the PMOS gate current in comparison to the NMOS gate current

differs for different process technologies. If standard SiO2 is used as the gate

oxide material, then the Igate for a PMOS device is typically one order of mag-

nitude smaller than that for an NMOS device with identical Tox and Vdd

[19][22]. This is due to the much higher energy required for hole tunneling in

SiO2 compared to electron tunneling. However, in alternate dielectric materi-

als, the energy required for electron and hole tunneling can be completely

different. In the case of nitrided gate oxides, in use today in a few processes,

PMOS Igate can actually exceed NMOS Igate for higher nitrogen concentra-

tions [23][24]. In this paper, we assume that standard SiO2 gate oxide material

is used and the PMOS gate current is negligible. However, the presented

methods can be easily extended to include appreciable PMOS gate leakage as

4.3 SUBTHRESHOLD LEAKAGE REDUCTION

4.3.1 Simultaneous Vt and State Assignment

Consider the leakage and performance of the simple NAND2 circuit

shown in Figure 4.2 under different input states and Vt assignments. It is clear

that given a particular input state, only those transistors that are OFF need to

Figure 4.1. Inverter circuit with NMOS oxide leakage current.

VddGnd Igate

be considered for high-Vt assignment as the ON-transistors are not leaking.

For instance, in state AB = 01, only transistor tn1 needs to be considered for

high-Vt assignment. Assigning other transistors to high-Vt will only decrease

the performance of the gate with no reduction in leakage current. On the other

hand, in state 11 both tp1 and tp2 must be assigned high-Vt in order to reduce

leakage, since they are parallel devices.

We can partition the transistors into so-called Vt-groups, corresponding to

the minimum sets of transistors that need to be set to high-Vt to reduce leak-

age in a particular state assignment. For the 2-input NAND gate in Figure 4.2,

three Vt-groups exist as shown. The concept of Vt-groups can be easily

applied to more complex structures in which case it may be possible that a

transistor belongs more than one Vt-group. It is clear that we can restrict our-

selves to setting only entire Vt-groups to either high or low-Vt. By considering

only Vt-groups, instead of individual transistors, we therefore significantly

reduce the number of possible Vt assignment and the optimization complexity.

In Table 4.1, we show the leakage current for the NAND2 in Figure 4.2 for

different input states and Vt-group assignments. Column 3 shows the leakage

current when we use high-Vt for one or more Vt-groups that are OFF in a par-

ticular input state. In column 4 and 5, the leakage current with all transistors

assigned to, respectively, high-Vt and low-Vt is shown. We can see that in

states 01, 10, and 11 only a single Vt-group is a candidate for high-Vt assign-

Group 1

Group 2

Group 3

Figure 4.2. The concept of groups for a NAND2 gate

Table 4.1. Leakage current of NAND2 gate

Assigned

Leakage current [pA]

with Group Assign. with All High Vt with All Low Vt

2 24.9

7.2 286.73 9.8

2 and 3 7.2

01 2 26.6 26.6 1054.0

10 3 25.7 24.4 922.6

11 1 14.2 14.2 357.2

ment. Also, setting only this one Vt-group to high-Vt results in equal or nearly

equal leakage compared with the leakage when all transistors are assigned

high-Vt demonstrating the effectiveness of the approach. In state 00, three

high-Vt assignments are possible: group 2, group 3, and both group 2 and 3.

However, the leakage current with both groups assigned to high-Vt is only

slightly better than that with only one group set to high-Vt, and assigning

group 3 to high-Vt reduces leakage somewhat more than assigning group 2 to

high-Vt. Hence, it is clear that we need to only consider assignment of group 3

to high-Vt without significant loss in optimality.

Table 4.1 shows that the leakage current varies considerably as different

groups associated with different input states are set to high-Vt. At the same

time, the impact of different high-Vt group assignments on the performance of

the circuit must be considered. By setting only a single group to high-Vt, the

performance degradation is restricted to only a single signal transition direc-

tion and is also reduced compared to high-Vt assignments where most or all

transistors are set to high-Vt. Therefore, the performance/power trade-off of Vt

assignment with known input state is much improved compared with that with

unknown input state.

The input state of a gate effects which transition direction is degraded by a

high-Vt group assignment to a gate. Also, the position of the high-Vt group in

a stack of transistors changes the impact of a high-Vt group assignment on the

different input to output gate delays. Therefore, the input state of a gate must

be chosen such that its associated high-Vt group results in the least degrada-

tion of the critical paths in the circuit. However, only the input state of the

circuit as a whole can be controlled and the logic correlations of the circuit

restrict the possible assignments of gate input states. Therefore, selection of

the circuit input state and of which gate is assigned a high-Vt group must be

made simultaneously to obtain the maximum improvement in leakage current

with minimum loss in performance.

4.3.2 Exact Solution to Vt and State Assignment

The size of the input state space is 2n, where n is the number of circuit

inputs. For each input state assignment, there are two possible Vt assignments

for each gate (one high-Vt group which is pre-determined by its input state,

and all low-Vt). The total number of possible Vt assignment is therefore 2m,

where m is the number of gates in the circuit and the total size of the search

space is 2n+m.

In order to find an exact solution to the problem, we developed an efficient

branch-and-bound method that simultaneous explores the state and Vt assign-

ments and that exploits the characteristics of the problem to obtain efficient

pruning of the search space to improve the run time. Due to the exponential

nature of the problem, an exact solution is only possible for very small cir-

cuits. However, the exact approach is still useful as the proposed heuristics

are based on it.

We use two types of branch and bound trees. The first branch-and-bound

tree determines the input state of the circuit and is referred to as the state tree.

The nodes of the state tree correspond to the input variables of the circuit

inputs. Each node of the state tree is associated with a so-called gate tree

which is searched to determine the group Vt assignment. In other words, for a

state tree with k nodes, there exist k copies of the gate tree. Each node in a par-

ticular gate tree corresponds to a gate in the circuit, as shown in Figure 4.3.

Each node has two fanout edges, representing the assignment of that gate with

all low-Vt groups (left branch) or with one high-Vt group, as determined by

the input state of the gate (right branch).

At the root of the state tree, the state of all input variables is unknown. As

the algorithm proceeds down the tree, the state of one input variable becomes

defined with each level that is traversed. At each node in the state tree, a solu-

tion of leakage current can be obtained by traversing the gate tree. Note that

the gate tree may be traversed both with a completely known input state at the

bottom of the state tree as well as with a partially or completely unknown

input state, at higher levels of the state tree.

For each node in the state and gate tree, an upper and low bound on the

leakage current is computed incrementally as explained in Section 4.3.2.1.

Note that early in the state tree the bounds on leakage will be very loose since

the state of the circuit is only partly defined. As the algorithm traverses down

the state tree, the input state becomes more defined and the leakage bounds

become closer. Similarly, the leakage bounds are very wide at the top of each

gate tree, as the Vt assignment of all gates are unknown, and becomes progres-

sively tighter as the algorithm traverses down the tree. Only at the bottom of

both the state tree and its associated gate tree do the upper and low bounds on

0 1 0 1

L H L H

State tree

Gate tree

Figure 4.3. State tree with gate tree at each node

leakage coincide. The algorithm first traverses down to the bottom of the tree

and then returns back up, to traverse down unvisited branches in DFS manner.

During the search, a tree branch is pruned when if it has a lower bound on

leakage that is worse than the best upper bound on leakage that has been

observed so-far. In addition to pruning based on leakage bounds, we also

compute a lower bound on the circuit delay at each node in the gate tree tra-

versal and prune all branches whose lower bound exceeds the specified delay

constraint. Computation of the delay bounds is also performed incrementally

and is discussed in Section 4.3.2.2.

Also, early in the state tree, computation of the exact minimum Vt assign-

ment by traversing the gate tree is not meaningful since even at that bottom of

the gate tree there is considerable uncertainty in the leakage current due to the

unknown input state. Therefore, the gate tree is searched only partially at the

higher levels of the state-tree which results in slightly more conservative

bounds, but an overall improvement in the run time of the algorithm.

The gate tree is also searched in DFS manner and edges are pruned based

on the computed leakage bounds. During the downward traversal of the gate

tree, the high-Vt branch is always selected, provide it meets the delay con-

straint. This is due to the fact that the high-Vt branch always has less leakage

current than the low-Vt branch. Only if the lower bound on the delay of the

high-Vt branch exceeds the delay constraint, is the low-Vt branch selected and

is the high-Vt branch pruned.

Finally, the gates in the circuit are assigned to nodes in the gate tree in

topological order to enable incremental delay computation. Gates of equal

topological level are further sorted by decreasing leakage to improve the

pruning of the search space. The input signals of the circuit are also assigned

to nodes in the state tree in specific order. We want to place inputs whose state

assignment strongly influences the total leakage of the circuit near the top of

the state tree. We estimate the influence of each input signal on the circuit

leakage by taking the sum of the leakage current of all gates connected to the

input signal. This input variable ordering is similar to that used in [25].

4.3.2.1 Incremental leakage bound computation

During the traversal of the gate tree, some of the gates will have a known

Vt assignment and others, which have not been visited, will have an unknown

Vt assignment. As shown in Figure 4.4, a lower bound on the leakage is com-

puted by assuming all unknown gates have a high-Vt group assignment and an

upper bound is computed by assuming all unknown gates have a low-Vt group

assignment. As the high branch is taken in the downward traversal, only the

upper bound is update (decreased) while when a low branch is taken, only the

lower bound must be updated and is increased.

4.3.2.2 Incremental delay bound computation

Similar to the leakage current bounds, a lower bound on the delay is com-

puted assuming all unknown gates have low-Vt group assignments. Delay is

changed only when a high branch is taken in the traversal and is computed

incrementally. We first compute the slack of the circuit for all circuit nodes at

the start of the tree traversal with all Vt assignments assumed to be low-Vt.

When a group changes from a low to a high-Vt group assignment during the

traversal, the slack of that gate will be updated. However, the Vt change of the

gate will affect not only the gate itself but also the delays of fanout gates due

to the slope change at the output of the changed gate. Since the slope at the

output of the changed gate will become slower due to its high-Vt assignment,

the delay of all fanout gates will increase, resulting an overall increased cir-

cuit delay. Ignoring the effect of slope change on fanout gates will therefore

result in the computation of an optimistic lower bound which ensures that the

optimal solution is not accidentally pruned. It also enables incremental delay

computation, given that the gates are visited in topological ordering. As gates

are visited, the changed input slope, due to high-Vt assignments of a fanin

gate, is processed to ensure that an exact delay bound is computed at the bot-

tom of the gate tree.

4.3.3 Heuristic Solution to Vt and State Assignment

We propose two fast heuristics that can be applied to large circuits and that

produce high quality solutions. The proposed heuristic are based on the exact

method described in Section 4.3.2, and are discussed below.

Heuristic 1

In this heuristic, the state and gate tree search is limited to only one down-

ward traversal. Note that while only a single traversal of the state tree is

performed, at each node of the state tree the decision to follow the left or right

child node is based on the computed bounds of the leakage using the gate tree.

= leak(g1~g

i-1=known gates) + leak(g

= leak(g1~g

i-1=known gates) + leak(g

UBi = UB

i-1 - leak(g

i=L) + leak(g

LBi = (unchanged)

UBi = (unchanged)

LBi = LB

i-1 - leak(g

i=H) + leak(g

Delay = (unchanged) Delay = (increased)

Figure 4.4. Incremental leakage bound computation

Each downward traversal of the gate tree visits m nodes, where m is the num-

ber of gates in the circuit. We perform exactly two such traversals at each

state tree node, leading to a total run time complexity that is O(nm), where n is

the number of circuit inputs. Since the number of inputs is generally thought

to grow approximately as the sqrt(m), the total complexity of this heuristic is

O(msqrt(m)).

Heuristic 2

In the second heuristic, the state tree is searched more extensively, subject

to a fixed run time constraint, while the gate tree search is kept to a single

downward traversal for each state tree node. Experimentally, it was found that

the quality of the first bottom node reached in the gate tree search is near the

optimal Vt assignment. This is due to the fact that the gate tree always chooses

the high-Vt child in its downward traversal which tends to produce a high

quality result. This is in contrast to the state tree, where choosing the correct

child during the downward traversal was found to be much more difficult.

Therefore, the solution quality was found to improve most by searching the

state tree more extensively, subject to a run time constraint, while limiting the

gate tree search to a single downward traversal.

4.3.4 Vt assignment Control within Stacks

We assume the ability to assign Vt on an individual basis within stacks of

transistors. Although it is generally possible to assign the Vt of each transistor

in a stack individually, this may result in the need for increased spacing

between the transistors in order not to violate design rules and ensure manu-

facturability [26]. Hence, at times it may be desirable to restrict the

assignment of Vt such that all transistors in a stack are uniform. In this case,

less flexibility exists in the assignment of Vt, and hence the obtained trade-off

in delay and leakage will degrade to some extent. In Section 4.5.1, we present

results showing the impact on the leakage optimization when uniform stack

assignments are enforced in the library.

4.4 LEAKAGE REDUCTION METHOD FOR BOTH

SUBTHRESHOLD AND GATE LEAKAGE

CURRENT

4.4.1 Leakage Reduction Approach

The proposed leakage optimization method performs simultaneous assign-

ment of standby mode state and high-Vt and thick-oxide transistors. The

proposed method is based on the key observation that given a known input

state, a transistor need not be assigned both a high-Vt and a thick oxide. This

is due to the fact that if a transistor that is OFF, gate leakage is significantly

reduced and hence the transistor only needs to be considered for high-Vt

assignment. Conversely, a transistor that, given a particular input state, is ON

may exhibit significant Igate, but does not impact Isub. Hence, conducting tran-

sistors only need to be considered for thick oxide assignment. If the input state

is unknown in standby mode, it cannot be predicted at design time which tran-

sistors will be ON or OFF and therefore all or most transistors must be

assigned to both high-Vt and thick-oxide in order to significantly reduce the

total average leakage. However, given a known input state, we can avoid

assignment of transistors to both high-Vt and thick oxide, thereby significantly

improving the obtained leakage / delay trade-off.

Furthermore, depending on the input state of a circuit, only a subset of

transistors needs to be considered for high-Vt or thick-oxide, as discussed in

Section 4.3.1. For instance, in a stack of several transistors that are OFF, only

one transistor needs to be assigned to high-Vt to effectively reduce the total

Isub. Similarly, Igate for transistors in a stack also has strong dependence on

their position. If a conducting transistor is positioned above a non-conducting

transistor in a stack, its Vgs and Vgd will be small and gate leakage will be

reduced. Hence, depending on the input state, only a small subset of all ON

transistors needs to be assigned thick-oxide and only a subset of all OFF tran-

sistors need to be considered for high-Vt assignment.

We illustrate the advantage of high-Vt and thick-oxide assignment with a

known input state for a 2-input NAND and NOR gate in Figure 4.5. In Figure

4.5(a) a 2-input NOR gate is shown with input state 01. Since only PMOS

transistors p2 is OFF in the pull-up stack, it is the only transistor that needs to

Figure 4.5. High Vt and thick oxide assignments at different input states

H igh-V t

Transistor

O xide

Transistor

be set to high-Vt to reduce the subthreshold leakage of the gate. Similarly,

only NMOS transistor n2 exhibits gate leakage and needs to be assigned thick

oxide to reduce Igate. Hence only two out of four transistors are affected while

the total leakage current is reduced by nearly the same amount as when all

transistors in the gate are set to high-Vt and thick oxide simultaneously. As a

result, the delay of the rising input transition at input i1 is unaffected by the

high-Vt and thick-oxide assignments, while the other transitions are affected

only moderately.

In Figure 4.5(b), the worst-case input state for a NOR2 gate is shown,

which is when both inputs are 1. In this case, both NMOS devices must be

assigned to thick-oxide to reduce Igate, while at least one PMOS device is set

to high-Vt. Depending on the delay requirements, the best input state is either

the state 01 shown in Figure 4.5(a), or the state 00, shown in Figure 4.5(c),

which requires only two transistors to be set to high-Vt. Hence, it is clear that

the input state significantly impacts the ability to effectively assign high-Vt

and thick-oxides without degrading the performance of the circuit. This leads

to the need for a simultaneous optimization approach where both the input

state and the high-Vt and thick-oxide assignments are considered simulta-

neously under delay constraints.

In addition to high-Vt and thick-oxide assignment, we also take advantage

of the Igate dependence on input pin ordering to reduce leakage current [11].

This is illustrated in Figure 4.5(d), for a 2-input NAND gate with input state

01. In order to effectively reduce the leakage under this input state, NMOS

transistor n1 must be assigned to high-Vt and NMOS transistor n2 must be

assigned to thick-oxide. However, if input pins i1 and i2 are reordered, with i1positioned at the bottom of the stack, as shown in Figure 4.5(e), the Vgs and

Vgd voltage of NMOS transistor n1 will be reduced from Vdd to approximately

one Vt drop. Hence, the gate leakage current of n1 will be substantially

reduced and can be ignored. After reordering the input pins, it is necessary to

only set NMOS transistor n2 to high-Vt without further assignments of thick-

oxide transistors. It should be noted that pin reordering will impact the delay

of the circuit and hence some performance penalty might be incurred. How-

ever, this penalty will be readily offset by the elimination of the thick-oxide

assignment in the pull-down stack. In this paper, we therefore consider com-

bined input state assignment with pin-reordering and Vt / Tox assignment.

4.4.2 Cell Library Construction

In order to perform simultaneous Vt, Tox and state assignment, it is neces-

sary to develop a library where for each cell the necessary Vt and Tox version

are available. After such a library has been constructed, the process of assign-

ing Vt and Tox assignments can be performed by simply swapping cells from

the library. Since different Vt and Tox variations do not alter the footprint of a

cell, the leakage optimization can be performed either before or after final

placement and routing.

For each gate and input state, a number of different Tox and Vt assignments

is possible, providing different delay / leakage trade-off points. For the fastest

and highest leakage trade-off point, all transistors are assigned to low-Vt and

thin oxides, such as the NAND2 gate shown in Figure 4.6(a). On the other

hand, for the slowest and lowest leakage version of the cell all transistors con-

tributing to leakage are assigned either high-Vt or thick oxide. For instance,

for the NAND2 gate with input state 11, shown in Figure 4.6(b), all transistors

affect the leakage current and both NMOS transistors are assigned thick Tox

while both PMOS transistors are assigned high-Vt to obtain the minimum

leakage / maximum delay trade-off point.

In addition to the fastest version and minimum leakage version of the cell,

a number of other intermediate trade-off points can be constructed for a cell

by assigning only some of the transistors that contribute to leakage to high-Vt

or thick-Tox. These cell versions would have lower leakage than the fastest

cell version but would be faster than the lowest leakage version. It is clear that

a large number of possible cell versions can be constructed if all possible

trade-off points are considered for each possible input state. While a larger set

of cell versions provides the optimization algorithm with more flexibility, and

hence a more optimal leakage result, it also increases the size of the library,

which is undesirable. Therefore, we initially restrict our library to at most 4

different trade-off points for each input state of a library cell, which are: 1) the

minimum delay, shown in Figure 4.6(a), 2) minimum leakage, shown in Fig-

Figure 4.6. Complete Vt-Tox versions of NAND2 gate

(a) (b) (c)

ure 4.6(b), 3) fast falling transition but slow rising transition, with

intermediate leakage, shown in Figure 4.6(c), and 4) fast rising transition but

slow falling transition with intermediate leakage, shown in Figure 4.6(d).

Although other possible trade-off points could be considered, we empirically

found that these four points yield good optimization results and provide a sys-

tematic approach for constructing all versions of a cell.

In principle, using four possible trade-off points for each input combina-

tion could result in as many as 16 (4x4) cell versions for a 2 input gate.

However, in practice, many of the cell versions are shared between different

input states. Also, in some cases not all 4 trade-off points are realizable and

hence the total number of cell versions is significantly less. We illustrate this

for the NAND2 gate for input state 00. The fastest cell version is again shown

in Figure 4.6(a) and is shared for all input combinations, and the minimum

leakage version is shown in Figure 4.6(e). Note that only one transistor needs

to be set to high-Vt to achieve minimum leakage for this input state. This

results from the fact that PMOS devices have negligible gate leakage in the

target technology and only one transistor in a stack needs to be set to high-Vt

to reduce the leakage through the entire stack. Hence, for the input state 00,

only two trade-off points are needed and only one additional cell version is

added to the library.

Input state 10 again requires the assignment of only a single transistor to

high-Vt for the minimum leakage version, as shown in Figure 4.6(f). This is

due to the fact that the gate leakage through the top NMOS transistor n1 is

negligible since its Vgs and Vgd is reduced to approximately one Vt drop. Only

two trade-off points are therefore required for this input state and both ver-

sions are shared with the 00 state. Finally, if the 01 state occurs in the circuit,

the optimization will automatically perform input pin swapping for all but the

fastest trade-off point, thereby resulting in no additional cell version. The

NAND2 gate therefore requires a total of 5 cell versions to provide up to 4

trade-off points for each input state. In Table 4.2, we show the delay / leakage

Table 4.2. Trade-offs for different Vt-Tox versions of NAND2 gate

State Cell

leakage

current

Normalized

rise delay

Normalized

fall delay

pin A pin B pin A pin B

Minimum delay (a) 270.4 1.00 1.00 1.00 1.00

Fast rise delay (d) 109.1 1.00 1.36 1.27 1.27

Fast fall delay (c) 91.4 1.36 1.36 1.00 1.00

Minimum leakage (b) 19.5 1.36 1.37 1.27 1.27

Minimum delay (a) 41.2 1.00 1.00 1.00 1.00

Minimum leakage (e) 14.0 1.00 1.00 1.12 1.16

Minimum delay (a) 91.8 1.00 1.00 1.00 1.00

Minimum leakage (f) 13.3 1.00 1.00 1.12 1.16

trade-offs obtained for each input state using the described approach for the

NAND2 gate.

The same process can be applied to each cell in the library to construct the

full set of cell versions for the leakage characterization method. Table 4.3,

shows the number of cell version required for several common gates. Note

that the number of cell version is higher for NOR gates than NAND gates.

Since for a library the total number of cells would increase significantly, we

also explored reducing the number of cells by allowing only two trade-off

points for each cell (minimum delay, and minimum leakage), instead of 4

trade-off points. In this case, the number of cells for the NAND2 gate reduces

to only 3 versions. The number of cell version required for two trade-off

points for different cell types is shown in Table 4.3, column 3. In column 4,

we add one more cell library version - two trade-off points with reduced num-

ber of cells. In order to minimize the number of needed library cells, one or

two cells of NOR2 or NOR3, respectively, are removed from library with

small degradation of leakage/delay trade-off. Therefore, all gates have only

three cells in this option. In Section 4.5.2 we compare the final leakage results

using the full library with 4 trade-off points, the reduced library with only two

trade-off points, and minimum number of cell library with two trade-off

points.

Finally, we consider Vt and Tox assignment control within stacks similar to

the discussion for Vt stack control in Section 4.3.4. However, Tox assignment

differs from Vt assignment in that the assignment of Tox to transistors in a

stack is already uniform due to the use of pin-swapping. This is evident from

the 5 added cell versions for the NAND2 in Figure 4.6, and can be easily

shown to be true for all cell versions generated under the proposed approach.

This is a significant advantage since spacing design rules for different Tox

assignments are expected to be more severe that those for spacing between

different Vt assignments [26]. However, the Vt assignment is not always uni-

form as shown in Figure 4.6(e), where only a single transistor in a stack is

assigned to high-Vt. In the event that a uniform stack is required, both transis-

tors in the stack need to be set to high-Vt, resulting in a slightly worsened

Table 4.3. The number of needed library cells

4 trade-off points 2 trade-off points2 trade-off points

with reduced number of cells

Inverter 5 3 3

NAND2 5 3 3

NAND3 5 3 3

NOR2 8 4 3

NOR3 9 5 3

delay / leakage trade-off. Leakage current comparison results between indi-

vidual vs. uniform stack assignment control will be shown in Section 4.5.2.

4.4.3 Optimization - Approach and Heuristics

In this section, we present an exact solution and two heuristics to the prob-

lem of finding a simultaneous input state, high-Vt and thick-Tox assignments

for a circuit under delay constraints. As mentioned, the leakage minimization

problem can be formulated as a integer optimization problem under delay

constraints. The size of the input state space is 2n, where n is the number of

circuit inputs. As discussed in Section 4.4.2, for each input state assignment,

there are up to four possible Vt-Tox assignments for each gate. Note that while

the total number of cell versions can be larger than 4, only 4 of them need to

be considered for each specific input state. For instance, for the NAND2 gate

in Figure 4.6, only versions (a)-(d) are considered for a 11 input state. There-

fore, the total number of possible Vt-Tox assignments is 4m, where m is the

number of gates in the circuit and the total size of the search space is 2n+2m.

In order to find an exact solution to the problem, we extend the branch-

and-bound method with Section 4.3.2. The branch and bound algorithm for

Vt-Tox and state assignment uses two interdependent search trees: state tree

and gate tree. The state tree is searched to determine the input state of the cir-

cuit and the gate tree is searched to determine the Vt-Tox assignment of the

circuit, as shown in Figure 4.7. The only difference from Section 4.3.2 is the

gate tree. Each node in a particular gate tree corresponds to a gate in the cir-

cuit. Since there are four possible Vt-Tox assignments for a gate, each node of

the gate tree has four edges: minimum delay, minimum leakage, fast fall delay

with intermediate leakage, and fast rise delay with intermediate leakage. The

exponential nature of the problem makes it impossible to obtain an exact solu-

Figure 4.7. State tree with gate tree at each node

0 1 0 1

State tree

Gate tree

tion for substantial circuits, such as Isub minimization approach in Section

4.3.2. Therefore, we also use the two heuristics discussed in Section 4.3.3.

4.5 RESULTS

4.5.1 Subthreshold Leakage Reduction

The proposed methods for simultaneous state and Vt assignment were

tested on the ISCAS benchmark circuits [27] and a 64-bit ALU circuit, syn-

thesized using a 0.18 m industrial library with Synopsys. This technology has

a difference of 14X (10X) in Isub and 16% (15%) in delay between low-Vt and

high-Vt NMOS (PMOS) devices. The leakage current for each Vt version of a

cell was computed using SPICE simulation and stored in precharacterized

tables. Delay computation was performed based on the Synopsys table delay

model and was verified to match with Synopsys timing analysis delay reports.

In addition to the proposed methods, traditional methods using only state or Vt

assignment were also implemented for comparison. The state-only assign-

ment was implemented using the approach discussed in [25] while for Vt-only

assignment a method similar to the sensitivity-based approach of [17] was

Table 4.4 compares the leakage results obtained by the three proposed

heuristics for three delay constraints to the average leakage computed using

10,000 random input vectors. The columns marked 0%, 5%, and 10% refer to

leakage minimization results when the delay constraints were set at 0%, 5%,

and 10% respectively, of the full delay range between all low-Vt and all high-

Vt circuit delay, as illustrated in Figure 4.8. The 0% column is therefore the

Table 4.4. Leakage current comparison between heuristics

Minimized leakage current [nA] (reduction factor: vs. average leakage current)

Avg. Ileak by

random

(10K)vectors

0% in low Vt/high Vt delay range 5% in low Vt/high Vt delay range 10% in low Vt/high Vt delay range

Heuristic 1 Heuristic 2 Heuristic 1 Heuristic 2 Heuristic 1 Heuristic 2

Ileak X Time Ileak X Ileak X Time Ileak X Ileak X Time Ileak X

C432 32.9 7.7 4.3 1 4.3 7.7 4.9 6.7 1 3.6 9.2 4.7 7.0 1 3.6 9.1

C499 94.0 13.2 7.1 3 11.3 8.3 13.1 7.2 2 11.6 8.1 9.7 9.6 2 9.7 9.6

C880 73.4 9.7 7.5 4 8.9 8.3 8.9 8.2 3 8.3 8.8 8.9 8.3 4 8.3 8.8

C1355 85.1 19.0 4.5 3 12.7 6.7 14.6 5.8 3 11.7 7.3 12.0 7.1 3 11.0 7.7

C1908 82.8 19.0 4.3 2 15.1 5.5 15.5 5.3 2 12.2 6.8 13.4 6.2 2 10.3 8.0

C2670 162.5 12.7 12.8 58 12.5 13.0 12.7 12.8 55 12.4 13.1 14.3 11.3 55 12.2 13.3

C3540 173.1 20.1 8.6 10 16.4 10.6 20.5 8.4 10 14.6 11.8 17.4 10.0 9 14.5 11.9

C5315 309.1 26.4 11.7 169 25.9 11.9 27.5 11.2 164 25.2 12.3 28.5 10.9 165 25.2 12.2

C6288 451.5 157.5 2.9 47 153.9 2.9 145.5 3.1 44 141.4 3.2 135.8 3.3 43 128.4 3.5

C7552 385.8 31.0 12.4 330 30.6 12.6 30.8 12.5 330 30.1 12.8 30.7 12.6 328 29.6 13.0

alu64 332.3 46.0 7.2 405 43.6 7.6 47.2 7.0 408 44.5 7.5 43.0 7.7 406 42.0 7.9

AVG 7.6 8.6 8.0 9.2 8.5 9.6

most stringently constrained optimization as it corresponds to the best obtain-

able delay for the circuit (no performance penalty). Note that a simple

replacement of all low-Vt devices with all high-Vt ones would yield a ~20%

circuit delay increase. Thus, when interpreting the results in this section, a

10% delay point indicates that the circuit after Vt assignment has a delay that

is approximately 2% larger than the original fastest implementation. Since the

average leakage current with 10,000 random input vectors is computed with

all low-Vt transistors, it also corresponds to a 0% delay criteria. Runtimes for

heuristic 1 are given in Table 4.4 in seconds. Heuristic 2 was limited to a runt-

ime of 1800 seconds (30 minutes). We report the reduction factor relative to

the average leakage current over the 10,000 random vectors. Heuristic 2 has

~10% lower leakage results than heuristic 1 at 5% delay point across the

benchmark circuits. However, heuristic 2 has a 4-5X runtime overhead for

large circuits (~1000X for small circuits) over heuristic 1.

In Table 4.5, we compare the proposed approach with traditional tech-

niques, including state-only and Vt-only assignment methods. The state-only

Figure 4.8. Delay point from all low-Vt to all high-Vt range

Delay with all low-Vt

Delay with all high-Vt

5% 50%10%

Table 4.5. Leakage current comparison with traditional techniques

Circuits Minimized leakage current [nA]

Number ofAvg.

Ileak by

random

vectors

State-only

assignment

Vt only & proposed heuristic (reduction factor: vs. average leakage current)

0% in the delay range 5% in the delay range 10% in the delay range

Input GateVt-only Heuristic 1 Vt-only Heuristic 1 Vt-only Heuristic 1

Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X

C432 36 177 32.9 26.3 1.25 30.8 1.1 7.7 4.3 29.5 1.1 4.9 6.7 29.2 1.1 4.7 7.0

C499 41 519 94.0 86.1 1.09 85.0 1.1 13.2 7.1 57.2 1.6 13.1 7.2 40.4 2.3 9.7 9.6

C880 60 364 73.4 63.7 1.15 64.6 1.1 9.7 7.5 63.9 1.1 8.9 8.2 20.2 3.6 8.9 8.3

C1355 41 528 85.1 81.4 1.04 94.0 0.9 19.0 4.5 65.1 1.3 14.6 5.8 53.4 1.6 12.0 7.1

C1908 33 432 82.8 74.6 1.11 67.0 1.2 19.0 4.3 46.5 1.8 15.5 5.3 30.3 2.7 13.4 6.2

C2670 233 825 162.5 146.2 1.11 44.7 3.6 12.7 12.8 39.7 4.1 12.7 12.8 27.8 5.8 14.3 11.3

C3540 50 940 173.1 155.7 1.11 161.9 1.1 20.1 8.6 148.4 1.2 20.5 8.4 82.4 2.1 17.4 10.0

C5315 178 1627 309.1 283.1 1.09 290.6 1.1 26.4 11.7 289.7 1.1 27.5 11.2 108.3 2.9 28.5 10.9

C6288 32 2470 451.5 412.4 1.09 417.0 1.1 157.5 2.9 259.5 1.7 145.5 3.1 233.0 1.9 135.8 3.3

C7552 207 1994 385.8 352.3 1.10 360.2 1.1 31.0 12.4 353.5 1.1 30.8 12.5 350.9 1.1 30.7 12.6

alu64 131 1803 332.3 294.5 1.13 312.8 1.1 46.0 7.2 288.5 1.2 47.2 7.0 230.1 1.4 43.0 7.7

Avg. 1.12 1.3 7.6 1.6 8.0 2.4 8.5

assignment method was limited to a runtime of 1800 seconds (30 minutes).

The results demonstrate that substantial improvement in standby leakage cur-

rent can be obtained using the proposed methods, with an average

improvement of ~80% (5-6X) for the 0% and 5% delay constraints over Vt-

only assignment.

Table 4.6 compares leakage current results for both individual and uni-

form stack control. Since uniform stack control degrades the delay/leakage

trade-off as discussed in Section 4.3.4, the results for uniform stack assign-

ment exhibit less leakage reduction than those of individual stack control. It is

interesting to note, however, that the leakage current degradation by moving

to a less fine-grained threshold voltage assignment scheme is not overly large

implying that even with manufacturing constraints, the proposed technique

provides significant leakage savings.

Finally, Figure 4.9 plots the leakage results for the proposed method and

the two traditional methods as a function of the delay constraint for circuit

c6288. The optimization was performed for a range of delay constraints. The

proposed method provides its largest improvements at tight delay constraints.

This is due to the fact that, as the delay constraint becomes looser, more tran-

sistors can be set to high-Vt in both approaches, and the relative advantage of

the proposed approach reduces. However, leakage reduction is most challeng-

ing under tight performance constraints at which the proposed technique

holds promise.

Table 4.6. Leakage current comparison between individual and uniform stack control.

Minimized leakage current [nA]

5% in low Vt/high Vt delay range (reduction factor: vs. average leakage current)

Average

Ileak by

random

(10K)vector

Vt-only assignmentHeuristic 1

Individual control Uniform control

Ileak X Ileak X Ileak X

C432 32.9 29.5 1.1 4.9 6.7 6.8 4.8

C499 94.0 57.2 1.6 13.1 7.2 12.5 7.5

C880 73.4 63.9 1.1 8.9 8.2 9.1 8.1

C1355 85.1 65.1 1.3 14.6 5.8 23.7 3.6

C1908 82.8 46.5 1.8 15.5 5.3 15.7 5.3

C2670 162.5 39.7 4.1 12.7 12.8 12.9 12.6

C3540 173.1 148.4 1.2 20.5 8.4 24.1 7.2

C5315 309.1 289.7 1.1 27.5 11.2 28.5 10.9

C6288 451.5 259.5 1.7 145.5 3.1 163.1 2.8

C7552 385.8 353.5 1.1 30.8 12.5 31.3 12.3

alu64 332.3 288.5 1.2 47.2 7.0 44.6 7.5

Avg. 1.6 8.0 7.5

4.5.2 Leakage Reduction for both Subthreshold and Gate

Leakage

The proposed methods for simultaneous state, Vt and Tox assignment were

implemented on a number of benchmark circuits [27] synthesized using a

library based on a predictive 65nm process [20]. In this technology, the differ-

ence in Igate for the thick-oxide NMOS devices vs. the thin-oxide device is

11X, whereas Isub is reduced by 17.8X (16.7X) when replacing a low-Vt

NMOS (PMOS) device with a high-Vt version. Table 4.7 shows relative leak-

age and delay values at the four possible Vt and Tox assignments for NMOS

devices in this technology. A comparison of our first and second heuristics

along with average leakage computed using 10,000 random input vectors is

shown in Table 4.8. The total leakage current value is given in A and runt-

ime is given in seconds. In heuristic 2, we set the runtime limit as 1800

Figure 4.9. Leakage current comparison for c6288

0 10 20 30 40 50 60 70 80 90 100

450 Average Current with Low-V

State Assignment Only with Low-Vt

Dual-Vt Assignment only

Our proposed method - Heuristic 1

State Assignment Only with High-Vt

Delay Point from All Low-Vt to All High-V

t Range [%]

Table 4.7. Comparison of leakage and delay between four possible Vt-Tox assignment

for NMOS

Assignment Normalized values

Vt Oxide thickness

Leakage

Isub Forward Igate Reverse Igate

Low Thin 1.00 0.41 0.22 1.00

High Thin 0.06 0.31 0.22 1.33

Low Thick 0.73 0.04 0.00 1.26

High Thick 0.05 0.03 0.02 1.69

seconds (30 minutes). The average leakage computed using the random vec-

tors can be used to approximate the standby mode leakage if state assignment

as well as dual-Vt and dual-Tox techniques were not employed. Again, the

delay range points used in all results are defined by a percentage of the maxi-

mum possible delay that is associated with moving from an all low-Vt and

thin-oxide design to an all high-Vt and thick-oxide implementation. Note that

a simple replacement of all fast devices with their slowest counterparts would

yield a ~70% circuit delay increase. Thus, when interpreting the results in this

section, a 5% delay point indicates that the circuit after Vt and Tox assignment

has a delay that is approximately 4% larger than the original fastest

implementation.

As shown in Table 4.8, heuristic 2 generally provides somewhat better

results but at much greater runtimes. On average, heuristic 2 provides ~10%

lower leakage current than heuristic 1 across these benchmarks at the 5%

delay point, similar to the results in Section 4.5.1. The improvement of the

two proposed heuristics compared to the average leakage without state, Vt or

Tox assignment is dramatic and approaches 7X at the 10% delay point in the

best-worst delay range. More aggressively, with just a 5% delay penalty the

reduction in total standby leakage is 5.3-6X with a maximum improvement of

8.6X for heuristic 2 in circuit c2670.

In Table 4.9 we compare our results to other standby mode techniques,

including state assignment alone and simultaneous state and Vt assignment (as

in the previous section). The total leakage current value is given in A. Again,

we report the reduction factor in relation to the average leakage current with

10,000 random vectors for consistency. We first point out that state assign-

Table 4.8. Leakage current comparison between heuristics with 4-option, individual stack

control library

Average

Ileak by

random

vectors

0% in the best-worst delay range 5% in the best-worst delay range 10% in the best-worst delay range

Heu1 Heu2 Heu1 Heu2 Heu1 Heu2

Ileak X Time Ileak X Ileak X Time Ileak X Ileak X Time Ileak X

c432 24.5 8.2 3.0 3 5.4 4.6 7.7 3.2 2 3.2 7.6 5.5 4.5 2 3.0 8.2

c499 65.8 32.2 2.0 7 31.1 2.1 26.1 2.5 7 24.6 2.7 22.7 2.9 6 20.8 3.2

c880 50.1 10.3 4.9 8 9.2 5.5 8.5 5.9 7 8.3 6.1 8.5 5.9 7 7.0 7.1

c1355 70.8 20.4 3.5 8 20.4 3.5 15.8 4.5 6 13.1 5.4 9.9 7.1 6 9.9 7.1

c1908 56.7 17.4 3.3 5 16.9 3.4 14.8 3.8 4 13.6 4.2 13.2 4.3 5 10.5 5.4

c2670 104.7 14.9 7.0 82 14.7 7.1 12.3 8.5 78 12.2 8.6 13.5 7.8 78 11.3 9.3

c3540 128.5 27.7 4.6 20 23.7 5.4 22.1 5.8 18 19.9 6.4 18.6 6.9 17 17.4 7.4

c5315 221.2 36.6 6.0 219 35.9 6.2 30.0 7.4 213 30.0 7.4 28.4 7.8 202 27.6 8.0

c6288 346.8 153.6 2.3 75 146.0 2.4 112.2 3.1 64 101.4 3.4 84.1 4.1 59 75.6 4.6

c7552 270.0 34.9 7.7 410 33.4 8.1 32.2 8.4 404 31.8 8.5 30.3 8.9 399 30.2 8.9

alu64 260.0 48.7 5.3 468 46.8 5.6 43.4 6.0 464 41.6 6.3 34.3 7.6 458 33.1 7.9

AVG 4.5 4.9 5.4 6.0 6.2 7.0

ment alone, which we accomplish by searching the state tree only, achieves

very little improvement in standby mode leakage; about 6%. By adding Vt

assignment, the algorithm of the first proposed method shows an average

reduction of 58% beyond state assignment alone at a 5% delay point. The full

Vt, Tox, and state assignment approach provides an additional 53% reduction

in current beyond state and Vt assignment for the 5% delay point.

Table 4.10 provides a comparison of results using the various cell library

options; 4 and 2 trade-off points with individual stack control, and also with

uniform stacks. The main result in Table 4.10 is that there is very little leak-

Table 4.9. Leakage current comparison with 4-option, individual stack control library

Average

Ileak by

random

vectors

Assignment

0% in the delay range 5% in the delay range 10% in the delay range

Vt & State Heu1 Vt & State Heu1 Vt & State Heu1

Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X

c432 24.5 22.7 1.08 13.3 1.8 8.2 3.0 12.5 2.0 7.7 3.2 12.7 1.9 5.5 4.5

c499 65.8 63.9 1.03 41.9 1.6 32.2 2.0 35.7 1.8 26.1 2.5 32.2 2.0 22.7 2.9

c880 50.1 46.0 1.09 18.9 2.6 10.3 4.9 17.5 2.9 8.5 5.9 16.9 3.0 8.5 5.9

c1355 70.8 67.4 1.05 39.9 1.8 20.4 3.5 33.0 2.1 15.8 4.5 29.8 2.4 9.9 7.1

c1908 56.7 54.8 1.04 27.6 2.1 17.4 3.3 25.8 2.2 14.8 3.8 22.9 2.5 13.2 4.3

c2670 104.7 101.4 1.03 33.3 3.1 14.9 7.0 32.7 3.2 12.3 8.5 31.9 3.3 13.5 7.8

c3540 128.5 121.8 1.05 54.5 2.4 27.7 4.6 51.5 2.5 22.1 5.8 48.5 2.7 18.6 6.9

c5315 221.2 215.1 1.03 81.2 2.7 36.6 6.0 77.1 2.9 30.0 7.4 73.7 3.0 28.4 7.8

c6288 346.8 306.7 1.13 209.3 1.7 153.6 2.3 180.4 1.9 112.2 3.1 153.7 2.3 84.1 4.1

c7552 270.0 262.6 1.03 88.9 3.0 34.9 7.7 86.6 3.1 32.2 8.4 86.1 3.1 30.3 8.9

alu64 260.0 237.2 1.10 90.7 2.9 48.7 5.3 86.1 3.0 43.4 6.0 81.1 3.2 34.3 7.6

AVG 1.06 2.3 4.5 2.5 5.4 2.7 6.2

Table 4.10. Leakage current comparison between cell library options (current unit: A)

Average

Ileak by

random

vectors

5% in the best-worst delay range

Individual stack control Uniform stack control

4-option 2-option

2-option

3 cell versions

4-option 2-option

2-option

3 cell versions

Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X

c432 24.5 7.7 3.2 7.4 3.3 7.1 3.4 7.3 3.4 7.9 3.1 8.6 2.8

c499 65.8 26.1 2.5 26.7 2.5 27.8 2.4 26.0 2.5 28.0 2.3 28.9 2.3

c880 50.1 8.5 5.9 9.7 5.2 8.0 6.3 10.0 5.0 10.7 4.7 10.8 4.6

c1355 70.8 15.8 4.5 16.2 4.4 14.1 5.0 23.4 3.0 25.2 2.8 23.9 3.0

c1908 56.7 14.8 3.8 14.9 3.8 14.3 4.0 15.9 3.6 15.3 3.7 16.8 3.4

c2670 104.7 12.3 8.5 12.1 8.7 12.4 8.4 16.1 6.5 15.4 6.8 16.5 6.3

c3540 128.5 22.1 5.8 24.2 5.3 25.3 5.1 27.1 4.7 25.8 5.0 29.2 4.4

c5315 221.2 30.0 7.4 30.9 7.2 30.7 7.2 32.1 6.9 32.9 6.7 33.8 6.6

c6288 346.8 112.2 3.1 114.2 3.0 114.2 3.0 134.0 2.6 147.8 2.3 145.4 2.4

c7552 270.0 32.2 8.4 31.4 8.6 30.6 8.8 31.8 8.5 31.1 8.7 31.1 8.7

alu64 260.0 43.4 6.0 44.0 5.9 43.2 6.0 42.0 6.2 47.0 5.5 46.1 5.6

AVG 5.4 5.3 5.4 4.8 4.7 4.6

age current penalty when moving from a full 4-option library to a simpler 2-

option library. There are several cases where the smaller library outperforms

the larger library due to the heuristic nature of the algorithm used (heuristic 1

is used in this table). Since the library size required in the 2-option scenario is

roughly half that of 4-option, we conclude that the use of 2-option represents a

very good trade-off between library complexity and potential leakage reduc-

tion. Moreover we can see that the simplest cell library of 2-option with a

reduced number of cells provides good leakage reduction results. In general, a

reduced number of cells degrades the leakage/delay trade-off as discussed in

Section 4.4.2. However we find that only complex, and infrequently used

cells, such as 3-input NORs require appreciable reductions in cell variants

which limits the impact on total leakage reduction. Therefore, very good leak-

age current minimization can be obtained even with libraries with 3 cell

versions for each cell. Also, the restriction that each stack of transistors must

use the same Vt and Tox is shown in Table 4.10 to have only a minor impact on

leakage. For instance, the uniform stack 4-option case shows a 10.6% average

power increase compared to the individual stack 4-option case; this still repre-

sents a nearly 5X reduction in standby leakage compared to the average case.

Note that library complexity is not reduced in moving from individual to

stack-based control; such a change would be dictated by manufacturing issues

as well as the trade-off between standby power (lower for individual control)

and cell area (expected to be slightly lower for stack-based control).

Finally, Figure 4.10 plots the leakage current results for the proposed

method and traditional methods as a function of the delay constraint for cir-

Figure 4.10. Leakage current comparison for c6288

0 10 20 30 40 50 60 70 80 90 100

Average Current with Low-Vt/Thin-T

State Assignment Only with Low-Vt/Thin-T

Dual-Vt & State Assignment

Our proposed method - Heuristic 1

State Assignment Only with High-Vt/Thick-T

Delay Point from the best to the worst range [%]

cuit c6288. Here, a 100% delay point implies a complete replacement of low-

Vt and thin-oxide devices with high-Vt and thick-oxide. This is clearly the

lowest leakage solution but is also very slow. The key point in Figure 4.10 is

that the proposed approaches (heuristic 2 results are not shown but are nearly

identical to heuristic 1) provide substantial improvement beyond the average

leakage or the use of state assignment alone and that these gains are achiev-

able with very small and even zero delay penalties. The rapid saturation of the

gains as the delay point increases beyond 10% implies that the new approach

is best suited for achieving low-leakage standby states with very little perfor-

mance overhead (e.g., 5% or even less). Note that the leakage current

achieved by our proposed method does not converge to that by state assign-

ment using all high-Vt and thick-oxide devices. The reason is that the selected

library cells include only a limited number of thick-oxide assignments in

order to simplify the library. Many additional library cells would be needed to

achieve convergence to the minimal leakage solution; instead the bulk of this

leakage savings can be achieved with very little performance penalty.

4.6 CONCLUSIONS

In this paper, we propose new approaches for standby leakage current

minimization under delay constraints. Our approaches use simultaneous state

assignment and Vt or Vt / Tox assignment. Efficient methods for computing the

simultaneous state and Vt or Vt / Tox assignments leading to the minimum

standby mode leakage current were presented. The proposed methods were

implemented and tested on a set of synthesized benchmark circuits. Using the

new state and Vt assignment technique demonstrates 6X lower leakage than

previous Vt-only assignment approaches and 5X lower than state assignment

alone (at 5% delay point). In cases where gate leakage is prominent, as in

90nm CMOS technologies, these improvements are increased by an addi-

tional factor of 2 using state and Vt / Tox assignment. We also investigate the

leakage/complexity trade-off for various cell library configurations and dem-

onstrate that results are still very good even when only 2 additional variants

are used for each cell type.

Acknowledgement

The authors would like to thank Harmander Deogun for his work in leak-

age current model. The work has been supported by NSF, SRC, GSRC/

DARPA, IBM, and Intel.

References

[1] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu and J. Yamada, “1-V

power supply high-speed digital circuit technology with multithreshold voltage

CMOS,” IEEE Journal of Solid-State Circuits, vol. 30, pp. 847-854, Aug. 1995.

[2] J. Kao, A. Chandrakasan, and D. Antoniadis, “Transistor sizing issues and tool

for multi-threshold CMOS technology,” Proc. Design Automation Conference,

pp. 409-414, 1997.

[3] S. Shigematsu, S. Mutoh, Y. Matsuya, Y. Tanabe and J. Yamada, “A 1-V high-

speed MTCMOS circuit scheme for power-down application circuits,” IEEE

Journal of Solid-State Circuits, vol. 32, pp. 861-869, June 1997.

[4] H. Kawaguchi, K. Nose and T. Sakurai, “A super cut-off CMOS (SCCMOS)

scheme for 0.5V supply voltage with picoampere standby current,” IEEE Journal

of Solid-State Circuits, vol. 35, pp. 1498-1501, October 2000.

[5] R. X. Gu and M. I. Elmasry, “Power dissipation analysis and optimization of

deep submicron CMOS digital circuits,” IEEE Journal on Solid-State Circuits,

vol. 31, no. 5, pp. 707-713, May 1996.

[6] Z. Chen, M. C. Johnson, L. Wei and K. Roy, “Estimation of standby leakage

power in CMOS circuit considering accurate modeling of transistor stacks,”

Proc. International Symposium on Low Power Electronics Design, pp. 239-244,

[7] J. Halter and F. Najm, “A gate-level leakage power reduction method for ultra-

low-power CMOS circuits,” Proc. CICC, pp. 475-478, 1997.

[8] V. De, Y. Ye, A. Keshavarzi, S. Narendra, J. Kao, D. Somasekhar, R. Nair and S.

Borkar, “Techniques for leakage power reduction,” in Design of High-Perfor-

mance Microprocessor Circuits, New York: IEEE Press, 2001.

[9] M.C. Johnson, D. Somasekhar and K. Roy, “Models and algorithms for bounds

on leakage in CMOS circuits,” IEEE Transactions on Computer-Aided Design of

Integrated Circuits and Systems, vol. 18, pp. 714-725, June 1999.

[10] A. Fadi, S. Hassoun, K. A. Sakallaha and D. Blaauw, “Robust SAT-based search

algorithm for leakage power reduction,” Proc. International Workshop on Power

and Timing Modeling, Optimization and Simulation, 2002.

[11] D. Lee, W. Kwong, D. Blaauw and D. Sylvester, “Analysis and minimization

techniques for total leakage considering gate oxide leakage,” Proc. Design Auto-

mation Conference, pp. 175-180, 2003.

[12] R.S. Guindi and F.N. Najm, “Design techniques for gate-leakage reduction in

CMOS circuits,” Proc. ISQED, pp.61-65, 2003.

[13] F. Hamzaoglu and M.R. Stan, “Circuit-level techniques to control gate leakage

for sub-100nm CMOS,” Proc. International Symposium on Low Power Electron-

ics and Design, pp. 60-63, 2002.

[14] T. Inukai, M. Takamiya, K. Nose, H. Kawaguchi, T. Hiramoto and T. Sakurai,

“Boosted Gate MOS (BGMOS): Device/circuit cooperation scheme to achieve

leakage-free giga-scale integration,” Proc. Custom Integrated Circuit Confer-

ence, pp. 409-412, 2000.

[15] Q. Wang and S.B.K. Vrudhula, “Static power optimization of deep submicron

CMOS circuits for dual Vt technology,” International Conference on Computer-

Aided Design, pp. 490-496, 1998.

[16] L. Wei, Z. Chen, M. C. Johnson, K. Roy and V. De, “Design and optimization of

low voltage high performance dual threshold CMOS circuits,” Proc. Design

Automation Conference, pp. 489-494, 1998.

[17] S. Sirichotiyakul, T. Edwards, C. Oh, R. Panda and D. Blaauw, “Duet: an accu-

rate leakage estimation and optimization tool for dual Vt circuits,” IEEE Transac-

tions on Very Large Scale Integration (VLSI) Systems, vol. 10, pp. 79-90, April

[18] M. Ketkar and S. Sapatnekar, “Standby power optimization via transistor sizing

and dual threshold voltage assignment,” Proc. ICCAD, 2002, pp. 375-378.

[19] S. Stiffler, “Optimizing performance and power for 130nm and beyond,” IBM

Technology Group New England Forum, 2003.

[20] International Technology Roadmap for Semiconductors, 2002.

[21] N. Yang, W. K. Henson, and J. J. Wortman, “A comparative study of gate direct

tunneling and drain leakage currents in N-MOSFETs with sub-2nm gate oxides,”

IEEE Trans. Electron Devices, vol. 47, pp. 1636-1644, Aug. 2000.

[22] B. Yu, H. Wang, C. Riccobene, Q. Xiang and M.-R. Lin “Limits of gate oxide

scaling in nano-transistors,” Proc. Symposium on VLSI Tech., pp. 90-91, 2000.

[23] Y.-C. Yeo, Q. Lu, W.-C. Lee, T.-J. King, C. Hu, X. Wang, X. Guo and T. P. Ma,

“Direct tunneling gate leakage current in transistors with ultra thin silicon nitride

gate dielectric,” IEEE Electron Device Letters, vol. 21, pp. 540-542, Nov. 2000.

[24] Q. Xiang, J. Jeon, P. Sachdey, B. Yu, K. C. Saraswat and M.-R. Lin, “Very high

performance 40nm CMOS with ultra-thin nitride/oxynitride stack gate dielectric

and pre-doped dual poly-Si gate electrodes,” Proc. International Electron Devices

Meeting, pp. 860-862, 2000.

[25] H. Kriplani, F. N. Najm and I. N. Hajj, “Pattern independent maximum current

estimation in power and ground buses of CMOS VLSI circuits: algorithms, sig-

nal correlations, and their resolution,” IEEE Transactions on Computer-Aided

Design of Integrated Circuits and Systems, vol. 14, pp. 998-1012, Aug. 1995.

[26] Ruchir Puri, IBM T.J. Watson Research, personal communication.

[27] F. Brglez and H. Fujiwara, “A Neutral Netlist of 10 Combinatorial Benchmark

Circuits”, Proc. ISCAS, 1985, pp.695-698.

Chapter 5

ENERGY-EFFICIENT SHARED MEMORYARCHITECTURES FOR MULTI-PROCESSORSYSTEMS-ON-CHIP

Kimish Patel1, Alberto Macii1 and Massimo Poncino2

1 Politecnico di Torino; 2 Universita di Verona

Abstract Most current multi-processor systems-on-chip (MPSoC) platforms do rely ona shared-memory architectural paradigm. The shared memory, typically usedfor storage of shared data, is a significant performance bottleneck because it re-quires explicit synchronization of memory accesses which can potentially occurin parallel. Multi-port memories are a widely-used solution to this problem; theyallow these potentially parallel accesses to occur simultaneously. However, theyare not very energy-efficient, since their performance improvement comes at anincreased energy cost per access. We propose an energy-efficient architecturefor the shared memory that can be used as an alternative to multi-port mem-ories, and combines their performance advantage with a much smaller energycost. The proposed scheme is based on the application-driven partitioning of theshared address space into a multi-bank architecture. This optimization can beused to quickly explore different power-performance tradeoffs, thanks to simpleanalytical models of performance and energy. Experiments on a set of paral-lel benchmarks show energy-delay product (EDP) savings of 50% on average,measured on a set of standard parallel benchmarks.

Keywords: Multi-Processor Systems, Shared Memory, Systems-on-Chip.

5.1 INTRODUCTION

Modern design paradigms for MPSoCs are pushing towards architectureswhich are fully distributed and that work as general networks, based on a mod-ular layered architecture, and that are able to support non-deterministic com-munications. Such architectures, called Networks-on-Chips (NoCs) [1], havebeen devised as an answer to the scaling of SoC complexity, especially in termsof the increased number of hosted processing elements, and of the decreasedreliability of the communication medium.

In spite of these scalability challenges, most current SoCs are still based on ashared-medium architecture, and, consequently on a shared-memory paradigm.One reason for this slow migration to more complex architectures is cost. Sharedon-chip buses represent a convenient, low-overhead interconnection, and theydo not require special handling during the physical design flow. Another reasonis a consequence of the limited support provided by system software for sucharchitectures. Although current silicon technology allows to build SoCs witha large number of embedded cores, the capabilities currently offered by theembedded software (e.g., in terms of OS primitives) does not allow to fullyexploit all the potential computational power; therefore, most implementationsof SoC consist of few (seldom more than 16) processor cores, for which a sharedinterconnect is perfectly suitable.

The architecture of these MPSoC platforms is thus reminiscent of tradi-tional multi-processor systems, where inter-processor communication and/orsynchronization is provided through the exchange of data through shared mem-ories of different types. Generally speaking, accessing the shared memories aresignificantly slower than accesses to local ones. First, they are placed fartheraway from the processors than private memories; in fact, the latter are oftentightly coupled to the cores by means of dedicated local buses, while sharedmemories are forcedly connected to a shared bus. Moreover, accesses to theshared buses by the processors requires some form of arbitration, which mayrequire the insertion of wait cycles in case of simultaneous accesses. As aconsequence, the shared memories tend to become a major bottleneck for thebandwidth of the overall system, especially for applications in which parallelismis built around shared data.

Caching of shared data might be a solution, but it raises the well-know issue ofcache coherence, i.e., the possible inconsistence between data stored in cachesof different processors. Cache coherence can be solved in hardware, yet withan extra overhead that may not be affordable in small-scale, low-cost SoC asthose considered in this work. Software-based cache coherence is also a viablesolution, but it essentially consists of limiting the caching of shared data to safetimes [2]. For applications in which parallelism is built around shared data, thisbasically amounts to avoid caching of shared data. In this paper, this will beour assumption: all accesses to shared data will always imply an access to theshared memory.

Providing sufficient memory bandwidth to sustain fast program execution anddata communication/transfers is mandatory for most embedded applications.Increasing memory bandwidth can be achieved by making use of differenttypes of on-chip embedded memories, which provide shorter latencies andwider interfaces [3–5]. One typical solution used to match the computationalbandwidth with that of memory is to use multi-port memories. This solutionincreases the sustainable bandwidth by construction, since a P -port memory

allows in fact up to P accesses in parallel (i.e., in a single memory cycle).Therefore, by properly choosing the number of ports of the memory versus thenumber of processors, the issue of synchronization of simultaneous accessescan be easily solved.

The adoption of multi-port memories, however, comes at the price of a sig-nificant increase in area, wiring resources, and energy consumption. On theother hand, architectures based on multi-port memories seem to be the onlyviable option in the cases where bandwidth optimization has absolute priority.

In this work we propose an alternative architecture for the shared memorywhich combines the advantages, in terms of bandwidth, of the multi-port ap-proach, with the advantages, in terms of energy consumption and access time,of partitioned memories [5]. We propose the use of small, single-port mem-ory blocks as a way to achieve memory bandwidth increase together with lowenergy demand. In our scheme, the memory addressing space is mapped oversingle-port banks that can be simultaneously accessed by different processors,so as to mimic for a large fraction of the execution time the behavior of a dual-port memory. Energy efficiency is enforced by two facts: First, the single-portblocks have an energy access cost which is smaller than that of monolithic(either single or dual-port) memories; second, address mapping is application-driven, and cell access frequency data is thus used to determine the optimalsizes of the memory blocks.

Based on analytical expressions for performance and energy consumptionthat allow to explore the energy-performance tradeoff, we present experimentalresults showing that the new architecture guarantees energy savings as high as69% with respect to a dual-port memory configuration (54% with respect tothe baseline, single-ported architecture), with comparable improvement of thememory bandwidth.

The rest of the chapter is organized as follows. Section 5.2 provides somebackground material on memory energy modeling, multi-port memories, andapplication-driven memory partitioning. Section 5.3 describes how partitionedmemories can be used to achieve an energy-efficient shared memory architec-ture. Section 5.4 illustrates the analytical models used to drive the energy-performance exploration engine, which is discussed in Section 5.5. Section 5.6presents the optimization results for a set of standard parallel applications. Fi-nally, some concluding remarks are provided in Section 5.7.

5.2 BACKGROUND

5.2.1 Modeling Memory Energy

Unlike generic hardware modules, the energy consumption of memories isbasically independent of the input activity. What matters, in fact, is whether weare reading or writing a value from or to the memory, regardless of the value.

This property allows to model memory energy consumption in an very abstractway, by explicitly exposing two independent variables affecting it: the cost ofan access and the total number of accesses. This translates into the followingformula:

etot =ctot∑i=1

ei (1)

where ctot is the total number of memory accesses, and ei is the cost of eachaccess. For the sake of simplicity, we equally weigh all accesses (i.e., we donot distinguish the cost of a read from that of a write).

Equation 1 exposes the two quantities we can consider to reduce the energyconsumption of a memory system and will be used throughout the paper asa reference. Techniques for reducing memory energy can be thus classifiedaccording to which variable is optimized [6].

5.2.2 Multi-Port Memories

A multi-port memory is simply a memory that allows multiple simultaneousaccesses for reads and writes to any location in memory. Multi-port memo-ries are typically employed as shared memories in multiprocessor designs, andare especially popular as dual-ended FIFO buffers for bus interfacing, or forvideo/graphics buffering.

Multiple simultaneous accesses are made possible by duplicating some ofthe resources required to access a cell: the address and data pins, the word-lines,and the bit-lines. Figure 5.1 shows the structure of a typical dual-port SRAMcell, and in particular the extra word-line (with the corresponding transistors)and extra bit-line.

Figure 5.1. Structure of a Dual-Port SRAM Cell.

In some devices, additional overhead is also required to handle the synchro-nization of multiple writes to the same cell; this is managed through a sort ofhardware semaphore which serializes the concurrent accesses.

The increase in bandwidth provided by multi-port memories comes at theprice of increased area, wiring resources and power consumption. Becauseof this considerable overhead, multi-port memories are usually limited to a

few ports (often 2, and seldom more than 4). One noticeable exception isrepresented by register files (although they are not strictly SRAMs), that aretypically highly multi-ported (even 16 or more ports) to provide very highbandwidth in superscalar processors.

Multi-port memories can also be characterized by the flexibility of the ports.In some memory devices, some of the ports can be specialized, i.e., they allowonly some type of access (read or write). This fact can be expressed by writingthe number of ports P = pr + pw + prw, where the three terms denote thenumber of read, write, and read/write ports, respectively. In this work, withoutloss of generality, we will assume that pr = pw = 0, and prw = P , that is, allports can be used for any type of access at any time.

When analyzing multi-port memories from the energy point of view, wemust take into account the two following non-idealities, supported by datafrom several multi-port memory providers ([7],[8],[9]).

a) Energy consumption of multi-port memories does not scale linearly withthe number of ports. For instance, the energy cost for accessing a dual-port memory is more than twice the energy required for accessing asingle-port memory of the same size.

b) When a multi-port memory is used as a shared memory in a multiproces-sor system, there are cases in which not all the ports are used simulta-neously. It may in fact happen that the access pattern of the applicationdoes not allow to group a set of accesses (from the processors) into asingle, multi-port access. In these cases, we must consider the fact thatenergy consumption does not scale linearly with the number of ports thatare accessed simultaneously. For instance, the energy cost for accessinga single port in a dual-port memory is larger than the one for accessing asingle-port memory of the same size.

With reference to the model of Equation 1, the use of multi-port memoriesreduces ctot, but it implies a sizable increase of the access cost ei.

5.2.3 Application-Driven Memory Partitioning

Partitioning a memory block into multiple blocks, based on the memory ac-cess profile, was originally proposed by Benini et al. [10]. Their techniqueexploits the fact that, due to the high locality exhibited by embedded applica-tions, the distribution of memory references is not uniform. As a consequence,some memory locations will be accessed more frequently than others. Thepartitioning is realized by splitting the address space (stored onto a single,monolithic memory block) into non-overlapping contiguous sub-spaces (storedonto several, smaller memory blocks).

Reduction of energy consumption is achieved because of two facts. First,each block is smaller than the monolithic one, and thus it has a smaller accesscost (ei). Second, and more relevant, only one of the blocks is active at a time.By properly partitioning the address space, it should be possible to access thesmallest blocks most of the times, and access the largest ones only occasionally.

The original implementation of [10] employs a sophisticated recursive algo-rithm to determine the optimal partition with an arbitrary granularity. In thiswork, we will exploit their idea, yet without employing the same partitioningengine. As a matter of fact, in our case partitioning is driven by the accesspatterns of more than one processor.

Memory partitioning specifically targets the reduction of the access cost ei,and it does not change ctot, since it does not modify the access patterns.

5.3 PARTITIONED SHARED MEMORYARCHITECTURE

The target MPSoC architecture considered is this work is depicted in Fig-ure 5.2. Each processor core has a cache and a private memory (PM) containingprivate data and code, which is accessed through a local bus. Processors arealso connected to another memory (SM), through a common global bus con-taining the data that are shared between the various threads executing on theprocessors. We do not consider here other types of interconnections, such aspoint-to-point ones (i.e., crossbars).

gure 5.2. Generic Architectural Template.

In this work, starting from the assumption that the shared memory is imple-mented as a conventional on-chip, single-port memory, we aim at improvingthe performance of the accesses to the shared memory, yet in a more energy-efficient way than resorting to a multi-port memory.

The proposed shared memory architecture combines the bandwidth advan-tages of multi-port memories (and thus the reduction of ctot) with the advan-

tages, in terms of energy consumption and access time, of partitioned, single-port memories (and thus the reduction of ei).

In our scheme, the memory address space is mapped over single-port banksthat can be simultaneously accessed by the different processors, so as to mimicthe behavior of a multi-port memory for a large fraction of the execution time.Each bank covers a subset of the address space, with no replication of memorywords; therefore, the address sub-spaces are non-overlapping. The latter issueis essential to understand why the partitioned scheme can only approach theperformance of the multi-port architecture. Since the memory blocks are single-ported and contain non-overlapping subsets of addresses, simultaneous accessesfrom the processor can be parallelized only if they fit into different memoryblocks. Otherwise, the potentially parallel access must take place into twoconsecutive memory cycles.

Energy efficiency is enforced by two facts: First, the single-port blocks havean energy access cost which is by far smaller than that of monolithic (eithersingle or dual-port) memories; second, address mapping is application-driven,and it accounts thus for the cell access frequency to determine the size of thememory blocks which is most suitable for memory minimization.

In the following, we will restrict our analysis to systems with two proces-sors. Consequently, we will consider dual-port memories, and the partitionedarchitecture will also consists of two blocks at most. Although the concepts thatwill be discussed apply in principle to an arbitrary number of processors (withmulti-port memories and multi-bank architectures), the quantitative analysisof energy and performance strictly refers to the case of two processors (withdual-port memory and two memory blocks).

Figure 5.3. Dual-Port (a) and Partitioned Single-Port (b) Architectures.

Figure 5.3 show a conceptual architecture of the dual-port and the partitionedsingle-port schemes. Label Ai refers to addresses from processor i, while Di

refer to data to/from processor i. In the dual-port scheme (Figure 5.3-(a)), the

(a) (b)

existence of two read/write ports allows to bind each processor to one port,realizing in fact a point-to-point interconnection.

In the partitioned architecture (Figure 5.3-(b)), addresses and data must bemultiplexed (from processor to memory) or de-multiplexed (from memory toprocessor) properly, to connect the processor to the required memory block.This block diagram just shows the high-level flow of data and addresses; theactual implementation of the decoder is actually more complex, and will bediscussed in the experimental section.

5.3.1 Related Work

The literature on energy optimization of embedded memories is quite rich(see [6] for a comprehensive survey); however, most techniques deal with theoptimization of caches, scratch-pad memories, or off-chip memories, and multi-port memories are seldom addressed.

Most energy optimizations for multi-port memories are concerned with theissue of the mapping of data structures (typically, arrays) to multi-port mem-ories, based on the access profiles of the applications. From these profiles,these techniques evaluate simultaneous array accesses (e.g., whether two ormore arrays are accessed in the same cycle), and build a so-called compatibilitygraph, which expresses the potential parallelization of accesses. The variousapproaches differ then in how this graph is used to decide the optimal allocationof array accesses to memory ports [3, 11–13].

One technique closer to the one proposed in this work has been discussedby Lewis and Brackenbury [14]. Their approach is based on the typical accesspatterns of DSP applications, and splits highly-multiported register files intomultiple banks of predefined sizes.

5.4 PERFORMANCE AND ENERGYCHARACTERIZATION

In this section we will derive analytical expressions for the number of memoryaccesses and for the total energy consumption for the architectures of Figure 5.3,referred to the case of a system consisting of two processors (hereafter denotedwith P1 and P2).

5.4.1 Performance Characterization

Let c1 and c2 be the number of memory accesses required by the executionof the application on processors P1 and P2, respectively. In the following,we will use the term memory cycle instead of memory access; we adopt thisterminology in order to distinguish accesses to the shared memory that canoccur in parallel. In fact, the total number of memory accesses by a processor

is fixed (and determined by the memory access pattern of the application, whichwe do not modify); What actually changes is the time (in cycles) required toserve these accesses. Furthermore, we will denotes sets with bold symbols, andtheir cardinalities with lowercase ones.

Our reference performance figure is the total number of memory cycles forthe case where shared memory is implemented as a monolithic single-portmemory. This value is cspm = c1 + c2.

5.4.1.1 Dual-Port Memory. When the shared memory is implementedby a monolithic dual-port memory, the total number of memory accesses willbe smaller than cspm because of the possibility of simultaneous accesses. Onlya fraction of the accesses, however, will occur simultaneously.

As Figure 5.4 shows, this fraction can be represented in terms of set notation.We denote with Cpar the set of memory cycles that can access memory simul-taneously; Cpar consists of the union of two subsets Cpar = Cpar,1 ∪ Cpar,2,where Cpar,1 ⊆ C1 and Cpar,2 ⊆ C2. These two subsets have same cardinality(i.e., cpar,1 ≡ cpar,2) because each element of one set matches one of the otherset to make a parallel access.

Figure 5.4. Classification of Execution Cycles.

The number of cycles for the dual-port configuration is therefore:

cdpm = (c1 − cpar,1) + (c2 − cpar,2) + cpar/2 (2)

where cpar = cpar,1 + cpar,2, denotes the total number of the parallel cycles.The division by two in the last term denotes the fact that parallel cycles areactually grouped in pairs, with each pair corresponding to a single memoryaccess. Equation 2 simplifies to cdpm = c1 + c2 − cpar/2, exposing the factthat the magnitude of cpar directly translates into a performance improvement.

5.4.1.2 Partitioned Memory. In the case of partitioned memory, thetwo memory banks now host two non-overlapping subsets of the address space.This implies that only a subset of the cycles in Cpar can be parallelized; inparticular, accesses that fall in the same subset of addresses now need to beserialized, since the two memory blocks are single-ported.

This further sub-setting of the cycles is depicted in Figure 5.5, using the sameset notation as above. We can notice that C1 and C2 are now both split intotwo subsets, where Ci,j denotes the cycles of processor i that fall into block j.

Figure 5.5. Classification of Execution Cycles for the Partitioned Architecture.

This induces a partition onto Cpar, as follows. The shaded areas labeledA and D in Figure 5.5 denote parallel accesses that fall into different memoryblocks: In region A (D), P1 accesses Block 1 (Block 2), and P2 accesses Block2 (Block 1). Conversely, the regions labeled B and C denote accesses thatfall in the same memory block (Block 2 for region b, and Block 1 for regionc). Cycles belonging to region B and C cause a performance penalty, because,although they can potentially occur in parallel, they must be serialized (and thusrequire two memory accesses).

These subsets can be characterized by using a quantity λ, that denotes thepercentage of the cycles in Cpar that fall in distinct memory blocks (and canthus be made parallel). λ will be used in the following as a compact metric toevaluate the cost of the partition. In fact, λ depends on where how the partitionhas been made, that is, how many addresses fall in each block. Therefore, Cpar

consists of λcpar cycles that can be parallelized, and (1 − λ)cpar that requirestwo separate accesses.

The number of cycles of the partitioned-memory architecture cspm,part istherefore:

cspm,part = (c1 − cpar,1) + (c2 − cpar,2) + λcpar/2 + (1 − λ)cpar (3)

The formula simplifies to cspm,part = c1 + c2 − λcpar/2, exposing the factthat cspm,part ≥ cdpm, since λ ≤ 1. Analyzing the dependency of cspm,part

versus λ, We notice that cspm,part (and thus) the performance penalty of thepartitioned scheme is minimized when λ is maximized, as expected. In par-ticular, when λ = 1, all accesses in Cpar are parallelized, and the partitionedscheme is equivalent to the dual-port memory, performance-wise. When λ = 0,all accesses by Cpar overlap on the same memory block, and the partitionedscheme is equivalent to the single-port memory architecture.

5.4.2 Energy Characterization

To compute energy, we stick to the high-level model of Equation 1; energyis thus simply obtained by multiplying each access for its cost.

5.4.2.1 Dual-Port Memory. In this case we have to consider twotypes of access costs, depending on whether one or both ports are accessed.Total energy is obtained thus by properly weighing the terms of Equation 2: Informula:

edpm = (c1 − cpar,1) · edpm,1 + (c2 − cpar,2) · edpm,1 + cpar/2 · edpm,2 (4)

The term edpm,x denotes the energy per access to the memory, in which theterm x = 1, 2 in the subscript denotes the number of ports used in the access.

5.4.2.2 Partitioned Memory. In the case of the partitioned memory,total energy cannot be conveniently expressed by a closed formula, for tworeasons. First, the energy per access depends on the size of the memory blockthat is accessed; the sizes of the blocks, however, are precisely the variables ofthe partitioning problem we are trying to solve. Second, we have two single-port memories, and each memory access from either processor will fall into oneof the two memory blocks. This implies that the energy per access can onlybe approximated by a “average” cost (i.e., the number of accesses to Block 1weighted by its energy cost, plus number of accesses to Block 2 weighted byits energy cost).

The accurate evaluation of energy for the partitioned architecture requiresthus a simulation of the dynamic address trace of the two processors, and theapplication of Equation 1 on an access-by-access basis.

Nevertheless, we can derive an approximate expression of total energy thatcan be used for a rough comparison with Equation 4:

espm,part = (c1 − cpar,1) · e′spm + (c2 − cpar,2) · e′′spm+(1 − λ)cpar · e′′′spm + λcpar/2 · (espm1 + espm2)

The first two term (e′spm and e′′spm) are the above mentioned average accesscosts and represent the non-parallel memory accesses. e′′′spm is the cost ofaccessing either Block 1 or Block 2 (depending on the subset of addresses),when accesses are potentially parallel but must be serialized. The last termrepresents the subset of potentially parallel accesses that will access Block 1and Block 2 simultaneously (espm1 + espm2).

Although approximate, Equation 5 allows to do some rough comparisonwith the dual-port scheme. First, all energy costs in Equation 5 are smaller thanedpm,2, and, in most cases (when the sizes of the two blocks are of comparablesize), also smaller than edpm,1. This implies that all four terms of Equation 5

are smaller than the corresponding ones in Equation 4, and energy is potentiallysmaller than the dual-port memory case, regardless of the value of λ.

The actual dependency of espm,part on λ is not easily observable from Equa-tion 5. A large value of λ increases the probability of accessing both blocks inthe same cycle (this corresponds to the largest term (espm1 +espm2)). Therefore,energy should be in principle reduced by choosing partitions which minimizeλ. In this case, in fact, only one of the two blocks (each one smaller than themonolithic memory) will be accessed in each cycle, thus using less energy; asmall value of λ, however, tends to increase the number of cycles, as alreadyobserved.

5.5 EXPLORATION FRAMEWORK

The models described in Section 5.4 show that there exists a tradeoff be-tween energy and performance in partitioning the shared memory. Althoughwe are searching for energy-efficient memory architectures, we cannot ig-nore performance implications; therefore, in order to search for the best en-ergy/performance tradeoff, we use energy/delay product (EDP) as a metric, andchoose to minimize EDP during the space exploration.

Thanks to the simple models of Section 5.4, the optimization space is rela-tively small, since λ is the only parameter of the models. λ is a function of theaccess pattern of the application, but it also depends on how the address spaceis partitioned. Partitions can be characterized by the boundary address B thatsplits the address space [0, . . . , N − 1] into two sub-spaces [0, . . . , B − 1] and[B, . . . ,N−1]. Therefore, λ is also a function of B. As an example, Figure 5.6shows the behavior of λ versus B for a parallel FFT kernel; we can observe thatthe curve is not monotonic, showing the sensitivity of λ to the access pattern.

Figure 5.6. Behavior of λ(B) vs. B.

These observations leads us to the following exploration procedure, for ashared memory of N words:

1 Compute epm(λ) and cpm(λ) as in Section 5.4;

2 For all possible values of B = 0, . . . , N − 1, Compute EDPpm(λ) asepm · cpm. EDPpm(λ(B)) is not a function, since there may be morevalues of B (and thus of EDP ) for a given value of λ. An example ofsuch curve is shown in Figure 5.7, for the parallel FFT benchmark.

3 Compute the function EDP paretopm (λ), obtained by selecting, for each

value of λ, the smallest value of EDPpm(λ). EDP paretopm contains the

Pareto points of EDPpm(λ), and can possibly contain some discontinu-ities. Figure 5.8 shows the resulting curve for the FFT benchmark.

Figure 5.7. Behavior of EDP (λ(B)) vs. λ.

Figure 5.8. Pareto Points of EDP (λ(B)).

4 Compute the minimum EDPmin, of this function, and let λmin the cor-responding value of λ;

5 On the λ vs. B plot, identify the corresponding value Bmin of B. Incase of multiple values of B, choose the one that makes the partitions asequal (in size) as possible.

5.6 EXPERIMENTAL RESULTS

5.6.1 Experimental Setup

We have implemented our partitioned memory scheme in ABSS [15]. ABSSis an execution-driven architectural simulator for multiprocessor systems devel-oped at Stanford University, that extends the ideas implemented in the AUG-MINT simulator. ABSS is based on the idea of augmentation, that is, theinstrumentation of the assembly code with various hooks that allow to makecontext switches to the simulator; augmentation translates the program intoa functionally equivalent program that runs on the simulated version of theprocessor.

The memory architecture provided by ABSS includes both private and sharedmemory. All the memories are connected through a single shared bus. Yet,ABSS does not provide any specific predefined cache or shared bus model;rather, it a defines a specific interface to which user-defined cache and busmodels can be easily hooked.

We have integrated Dinero [16] into ABSS, in order to provide accuratecache simulation data, and we have derived performance and energy modelsfor the shared memory (both single- and dual-port) by interpolation of theresults obtained from an industrial memory generator by ST Microelectronics.The target technology for all the models is 0.18µ.

Concerning the benchmarks, we have used Stanford’s SPLASH suite [17]which includes a set of kernels and parallel applications widely used in theparallel computing community.

5.6.2 Energy/Performance Tradeoff Analysis

Table 5.1 shows energy-delay product (EDP) results for the above bench-marks, for the monolithic, single-port architecture (EDPmm) and the parti-tioned one (EDPpm), obtained using the exploration procedure of Section 5.5.The EDP reduction (Column ∆) ranges from 40.5% to 62.3% (50.2% on aver-age).The exploration procedure also allows to compute the best performance andenergy points; these are summarized in Table 5.2, where performance improve-ments (number of cycles) and energy saving with respect to the monolithic,single port architecture are reported (Columns Best Performance and Best En-ergy).

The comparison of Tables 5.1 and 5.2, shows that savings in the EDP ismostly due to energy savings than to performance savings. Minimum EDPpoints are in fact very close to minimum energy points, for most of the bench-marks, while performance improvements are less significant. Notice also thatonly benchmarks that exhibit a sizable amount of parallel cycles (e.g., FFT,LU-Cont, Radix) results in a sizable performance improvement. Conversely,energy does not seem to be that sensitive to the amount of parallel cycles.

Table 5.1. Energy-Delay Product Results.

Application EDPmm EDPpm ∆ [%]

Barnes 24987.8 11357.5 54.6FFT 6.4 3.7 41.2FMM 853.4 389.6 54.4LU 3931.3 2339.2 40.5LU-CONT 3734.4 2073.1 44.5Radix 59512.5 23180.5 61.0Volrend 869794.2 453283.8 47.9Water-N2 150460.7 56710.0 62.3Water-S 10581.2 5770.5 45.5

Average 50.2

Table 5.2. Optimal Performance and Energy Points.

Application Best Performance [%] Best Energy [%]

Barnes 1.5 54.4FFT 34.0 37.2FMM 2.1 54.4LU 10.9 40.3LU-CONT 19.8 40.5Radix 25.4 60.9Volrend 0.3 50.3Water-N2 13.9 62.3Water-S 8.7 45.9

Average 13.0 49.6

Figure 5.9 shows the energy savings of the the partitioned architecture withrespect to the dual-port case. Numbers refer to best-performance points, sincewe want to reduce the performance penalty as much as possible. The savings donot include the cost of the decoding logic. The partitioned architecture resultsin an average energy saving of 56% (maximum 70%). This energy saving isachieved at an increase of the total number of memory cycles of 2.4% on average(10.1% maximum).

5.6.3 Decoder Implementation

The partitioned architecture requires an ad-hoc encoder which implementsthe conceptual scheme of Figure 5.3. The encoder must provide two mainfunctionalities. First, it must drive the selectors that decide to which block agiven memory access is directed; to do this, it must contain the information aboutthe boundary of the partition of the address space. Second, and more important,

Figure 5.9. Energy Savings of the Partitioned Architecture vs. the Multi-Port One.

it must handle the connection between processors and memory blocks; thisrequires a sort of arbitration mechanism that allows to serialize accesses thatare potentially parallel, but fall in the same subset of addresses (i.e., memoryblock).

Figure 5.10 shows a more detailed block diagram of the encoder. It takesas inputs the addresses A1 and A2 from the two processors, the correspondingrequest signals Reqi, and the value B of the address corresponding to the parti-tion. It then generates the addresses to be sent to each memory block AB1 andAB2 , and the signals used to allow the processors to access memory Granti.The latter are both active but in the cases where potentially parallel accessesmust be serialized.

The decoder contains two main blocks. The first block (RH, Request Han-dler) checks the two addresses A1 and A2, and generates the Busyi outputs aswell as a signal that determines whether the accesses can be parallelized or not(S/NS). The other block (SEL), uses three inputs to decide to what memoryblock to send what address: the S/NS input, and the outputs A1i and A2i of twocomparators (the boxes labeled with “=”) which determine in which block A1

and A2 are falling, respectively. By using the value of B as an external input, itis possible to make the decoder application-independent, and therefore to haveone single encoder for any application. We have implemented the decoder inVHDL, and synthesized it on a 0.18µm technology library by ST Microelec-tronics, using Synopsys Design Compiler. When applying the memory accesstrace of the FFT benchmark, the dissipation of the decoder is 0.35 µJ , about1.7% of total memory energy consumption (19.8 µJ).

Figure 5.10. Block Diagram of the Decoder.

Concerning delay, although the decoder is on the critical path (its delayadds up to the memory access time), this is not really an issue in the par-titioned architecture. In fact, the memory cycle time in this case is smallerthan that of the dual-port case, since we are accessing smaller memory blocks.Quantitatively, the partitioned architecture results in a slack equal to ddpm −max( dspm,1, dspm,2), where di denotes access time to the corresponding mem-ory block. The delay of the decoder obtained from synthesis is 310ps, wellwithin this slack.

5.7 CONCLUSIONS

We have proposed an energy-efficient alternative to multi-port memories suit-able for the implementation of the shared memory of multi-processor systems-on-chip. The architecture is based on application-driven partitioning of theaddress space into multiple banks.

The target of the architecture is to achieve little or no performance penaltywith respect to multi-port memories; therefore, we pursue maximum perfor-mance partitioning solutions, corresponding the case where the chance of par-allelizing the accesses is maximized. The architecture can be enhanced so thatzero performance penalty is achieved, thank to the use of an extra memorybuffer.

Experiments on a set of parallel benchmarks has shown average energy-delayproduct (EDP) reductions of 50% on average, with respect to the baseline caseof a single-port memory, and energy savings of 56%, with respect to the caseof a multi-port memory, with an average 2% performance penalty.

References

[1] L. Benini, G. De Micheli, “Networks on Chips: A New SoC Paradigm,”IEEE Computer, Vol. 35, No. 1, pp. 70–78, January 2002.

[2] P. Stenstrom, “A Survey of Cache Coherence Schemes for Multiprocessors,”IEEE Computer, Vol. 23, No. 6, June 1990, pp. 12–24.

[3] F. Catthoor, et al. Custom Memory Management Methodology Explo-ration for Memory Optimization for Embedded Multimedia System Design,Kluwer Academic Publishers, 1998.

[4] P. Panda, N. Dutt, Memory Issues in Embedded Systems-on-Chip Optimiza-tion and Exploration, Kluwer Academic Publishers, 1999.

[5] A. Macii, L. Benini, M. Poncino, Memory Design Techniques for Low-Energy Embedded Systems, Kluwer Academic Publishers, 2002.

[6] L. Benini, A. Macii, M. Poncino, “Energy-Aware Design of EmbeddedMemories: A Survey of Technologies, Architectures and OptimizationTechniques”, ACM Transactions on Embedded Computing Systems, Vol.2, No. 1, Feb. 2003, pp. 5–32.

[7] Cypress Semiconductor, http://www.cypress.com/products.

[8] Integrated Devices Technology, http://www.idt.com/products/multi port.html.

[9] Artisan Components, http://www.artisan.com/products/memory.html.

[10] L. Macchiarulo, A. Macii, L. Benini, M. Poncino, “Layout-Driven Mem-ory Synthesis for Embedded Systems-on-Chip," IEEE Transactions on VeryLarge Scale Integration (VLSI), Vol. 10, No. 2, pp. 96-105, April 2000

[11] P.R. Panda, N.D. Dutt, “Behavioral Array Mapping into Multiport Mem-ories Targeting Low-Power,” VLSI’97: International Conference on VLSIDesign, Jan. 1997, pp. 268–272.

[12] P.R. Panda, L. Chitturi, “An Energy-Conscious Algorithm for MemoryPort Allocation,” ICCAD’02: International Conference on Computer AidedDesign, Nov. 2002, pp. 572–576.

[13] W.-T. Shiue, C. Chakrabarti, “Low-Power Multi-Module, Multi-PortMemory Design for Embedded Systems,” Journal of VLSI Signal Process-ing, pp.167-178, Nov 2001.

[14] M. Lewis, L. Brackenbury, “Exploiting Typical DSP Data Access Patternsand Asynchrony for a Low-Power Multi-ported Register Bank,” ASYNC’01:International Symposium on Asynchronous Circuits and Systems, March2001, pp. 4–14.

[15] D. Sunada, D. Glasco, M. Flynn, ABSS v2.0: A SPARC Simulator, Tech-nical Report CSL-TR-98-755, CSL, Stanford University, April 1998.

[16] M. D. Hill, J. Elder, DineroIV Trace-Driven Uniprocessor Cache Simula-tor, www.cs.wisc.edu/markhill/DineroIV, 1998.

[17] J. P. Singh, W.-D. Weber, A. Gupta, “SPLASH: Stanford Parallel Appli-cations for Shared-Memory”, Computer Architecture News, Vol. 20, No. 1,pages 5-44, March 1992.

Chapter 6

TUNING CACHES TO APPLICATIONS FOR

LOW-ENERGY EMBEDDED SYSTEMS

Ann Gordon-Ross1, Chuanjun Zhang

1, Frank Vahid

1,2, and Nikil Dutt

1University of California, Riverside;2 University of California, Irvine

Abstract The power consumed by the memory hierarchy of a microprocessor can

contribute to as much as 50% of the total microprocessor system power, and is

thus a good candidate for power and energy optimizations. We discuss four

methods for tuning a microprocessors’ cache subsystem to the needs of any

executing application for low-energy embedded systems. We introduce on-

chip hardware implementing an efficient cache tuning heuristic that can

automatically, transparently, and dynamically tune a configurable level-one

cache’s total size, associativity and line size to an executing application. We

extend the single-level cache tuning heuristic for a two-level cache using a

methodology applicable to both a simulation-based exploration environment

and a hardware-based system prototyping environment. We show that a victim

buffer can be very effective as a configurable parameter in a memory

hierarchy. We reduce static energy dissipation of on-chip data cache by

compressing the frequent values that widely exist in a data cache memory.

Keywords: Cache; configurable; architecture tuning; low power; low energy; embedded

systems; on-chip CAD; dynamic optimization; cache hierarchy; cache

exploration; cache optimization; victim buffer; frequent value.

6.1 INTRODUCTION

The power consumed by the memory hierarchy of a microprocessor can

contribute to 50% or more of total microprocessor system power [1]. Such a

large contributor to power is a good candidate for power and energy

optimization. The design of the caches in a memory hierarchy plays a major

role in the memory hierarchy’s power and performance.

Tuning cache design parameters to the needs of a particular application

or program region can save energy. Cache design parameters include: cache

size, meaning the total number of data byte storage; cache associativity,

meaning the number of tag and data ways simultaneously read per cache

access; cache line size, meaning the number of bytes in a block when

moving data between cache and the next memory level; and victim buffer

use, meaning a small fully-associative buffer storing recently-evicted cache

data lines. Every application has different cache requirements that cannot be

efficiently satisfied with one predetermined cache configuration. For

instance, different applications have vastly different spatial and temporal

locality and thus have different requirements [2] with respect to cache size,

cache line size, cache associativity, victim buffer configuration, etc. In

addition to tunable cache parameters, widely existing frequent values in data

caches for some applications can enable data encoding within the cache for

reduced power consumption. We define cache tuning as the task of

choosing the best configuration of cache design parameters for a particular

application, or for a particular phase of an application, such that

performance, power and/or energy are optimized.

New technologies enable cache tuning. Core-based processors allow a

designer to choose a particular cache configuration [3-7]. Some processor

designs allow caches to be configured during system reset or even during

runtime [2,8,9].

Manual tuning of the cache is hard. A single-level cache may have many

tens of different cache configurations, and interdependent multi-level caches

may have thousands of cache configurations. The configuration space gets

even larger if other dependent configurable architecture parameters are

considered, such as bus and processor parameters. Exhaustively searching

the space may be too slow even if fully automated. With possible average

energy savings of over 40% through tuning [2,10], we sought to develop

automated cache tuning methods.

In this chapter, we discuss four methods of cache tuning for energy

savings. We discuss an in-system method for automatically, transparently,

and dynamically tuning a level-one cache; an automatic tuning methodology

for two-level caches applicable to both a simulation-based exploration

environment or a hardware-based prototyping environment; a configurable

victim buffer; and a data cache that encodes frequent data values.

6.2 BACKGROUND – TUNABLE CACHE

PARAMETERS

Many methods exist for configuring a single level of cache to a particular

application during design time and in-system during runtime. Cache

configuration can be specified during design time for many commercial soft

cores from MIPS [6], ARM [5], and Arc [4] and for environments such as

Tensilica’s Xtensa processor generator [7] and Altera’s Nios embedded

processor system [3].

Configurable cache hardware also exists to assist in cache configuration.

Motorola’s M*CORE [9] processors offer way configuration which allows

the ways of a unified data/instruction cache to individually be specified as

either data or instruction ways. Additionally, ways may be shut down

entirely. Way shut-down is further explored by Albonesi [8] to reduce

dynamic power by an average of 40%. An adaptive cache line size

methodology is proposed by Veidenbaum et al.[11] to reduce memory traffic

by more than 50%.

Exhaustive search methods may be used to find optimal cache

configurations, but the time required for an exhaustive search is often

prohibitive. Several tools do exist for assisting designers in tuning a single

level of cache. Platune [12] is a framework for tuning configurable system-

on-a-chip (SOC) platforms. Platune offers many configurable parameters

beyond just cache parameters, and prunes the search space by isolating

interdependent parameters from independent parameters. The level one

cache parameters, being dependent, are explored exhaustively.

Heuristic methods exist to prune the search space of the configurable

cache. Palesi et al. [13] improves upon the exhaustive search used in Platune

by using a genetic algorithm to produce comparable results in less time.

Zhang et al. [14] presents a cache configuration exploration methodology

wherein a cache exploration component searches configurations in order of

their impact on energy, and produces a list of Pareto-optimal points

representing reasonable tradeoffs in energy and performance. Ghosh et

al.[15] uses an analytical model to efficiently explore cache size and

associativity and directly computes a cache configuration to meet the

designers’ performance constraints.

Few methods exist for tuning multiple levels of a cache hierarchy.

Balasubramonian et al. [10] proposes a hardware-based cache configuration

management algorithm to improve memory hierarchy performance while

considering energy consumption. An average reduction in memory hierarchy

energy of 43% can be achieved with a configurable level two and level three

cache hierarchy coupled with a conventional level one cache.

6.3 A SELF-TUNING LEVEL ONE CACHE

ARCHITECTURE

Tuning a cache to a particular application can be a cumbersome task left

for designers even with the advent of recent computer-aided design (CAD)

tuning aids. Large configuration spaces may take a designer weeks or

months to explore and with a small time-to-market, lengthy tuning iterations

may not be feasible. We propose to move the CAD environment on-chip,

eliminating designer effort for cache tuning. We introduce on-chip hardware

implementing an efficient heuristic that automatically, transparently, and

dynamically tunes the cache to the executing program to reduce energy [16].

6.3.1 Configurable Cache Architecture

The on-chip hardware tunes four cache parameters in the level-one cache:

cache line size (64, 32, or 16 bytes), cache size (8, 4, or 2 Kbytes),

associativity (4, 2, or 1-way), and cache way prediction (on or off). Way

prediction is a method for reducing set-associative cache energy, in which

one way is initially accessed, and other ways accessed only upon a miss.

Micro-

processor

Off chip

Memory

Figure 6-1. Self-tuning cache architecture

The exploration space is quite large, necessitating an efficient exploration

heuristic implemented with specialized tuning hardware, as illustrated in

Figure 6-1. The tuning phase may be activated during a special software-

selected tuning mode, during startup of a task, whenever a program phase

change is detected, or at fixed time intervals. The choice of approach is

orthogonal to the design of the self-tuning architecture itself.

The cache architecture supports a certain range of configurations [2]. The

base level-one cache of 8 Kbytes consists of four banks that can operate as

four ways. A special configuration register allows the ways to be

concatenated to form either a direct-mapped or 2-way set associative 8

Kbyte cache. The configuration register may also be configured to shut

down ways, resulting in a 4 Kbyte direct-mapped or 2-way set associative

cache or a 2 Kbyte direct-mapped cache. Specifically, due to the bank layout

for way shut down, 2 Kbyte 2- or 4-way set associative and 4 Kbyte 4-way

set associative caches are not possible using the configurable cache

hardware.

6.3.2 Heuristic Development Through Analysis

A naïve tuning approach would simply try all possible combinations of

configurable parameters in an arbitrary order. For each configuration, the

miss rate can be measured and used to estimate the energy consumption of

the particular cache configuration. After all configurations are executed, the

approach would simply choose the configuration with the lowest energy

consumption. However, such an exhaustive method may involve the

inspection of too many configurations. Therefore, we wish to develop a

cache tuning heuristic that minimizes the number of configurations explored.

When developing a good heuristic, the parameter (cache size, line size,

associativity, or way prediction) with the largest impact in performance and

energy would likely be the best parameter to search first. We analyzed each

parameter to determine the parameter’s impact on miss rate and energy by

fixing three parameters and varying the third.

We observed that varying the cache size had the largest average impact

on energy and miss rate – changing the cache size can impact the energy by

a factor of two or more. From our analysis, we developed a search heuristic

that first determines the best cache size, determines the best line size, then

the best associativity, and finally, if the best associativity is greater than one,

our heuristic determines whether to use way prediction or not.

6.3.3 Search Heuristic

The heuristic developed based on the importance of parameters is

summarized below:

1. Begin with a 2 Kbyte, direct-mapped cache with a 16 byte line size.

Increase the cache size to 4 Kbytes. If the increase in cache size causes a

decrease in energy consumption, increase the cache size to 8 Kbytes.

Choose the cache size with the best energy consumption.

2. For the best cache size determined in step 1, increase the line size from

16 bytes to 32 bytes. If the increase in line size causes a decrease in

energy consumption, increase the line size to 64 bytes. Choose the line

size with the best energy consumption.

3. For the best cache size determined in step 1 and the best line size

determined in step 2, increase the associativity to 2 ways. If the increase

in associativity causes a decrease in energy consumption, increase the

associativity to 4 ways. Choose the associativity with the best energy

consumption.

4. If step (3) determined the best associativity to be greater than 1,

determine if enabling way prediction results in energy savings.

The cache tuning heuristic can be implemented in either software or

hardware. In a software-based approach, the system processor would execute

the search heuristic. Executing the heuristic on the system processor would

not only change the runtime behavior of the application but also affect the

cache behavior, possibly resulting in the search heuristic choosing a non-

optimal cache configuration. Therefore, we prefer a hardware-based

approach that does not significantly impact overall area or power.

6.3.4 Experiments and Results

We simulated numerous Powerstone [9] and MediaBench [18]

benchmarks using SimpleScalar [19], a cycle-accurate simulator that

includes a MIPS-like microprocessor model, to obtain the number of cache

accesses and cache misses for each benchmark and configuration explored.

For power dissipation, we considered both static power dissipation due to

leakage current and dynamic power dissipation due to logic switching

current and the charging and discharging of the load capacitance. We obtain

the energy of a cache hit from our own CMOS 0.18 µm layout of our

configurable cache (we found our energy values correspond closely with

CACTI values). We obtain the off-chip memory access energy from a

standard Samsung memory, and the stall energy from a 0.18 µm MIPS

microprocessor. Furthermore, we obtained the power consumed by our cache

tuner, through simulation of a synthesized version of our cache tuner written

in VHDL.

Table 6-1. Results of search heuristic. Ben. is the benchmark considered, cfg. is the cache

configuration selected, No. is the number of configurations examined by our heuristic, and

E% is the energy savings of both the I-cache and D-cache.

Ben. I-cache cfg No. D-cache cfg No. I-cache E% D-cache E%

padpcm 8K_1W_64B 7 8K_1W_32B 7 23% 77%

crc 2K_1W_32B 4 4K_1W_64B 6 70% 30%

auto 8K_2W_16B 7 4K_1W_32B 6 3% 97%

bcnt 2K_1W_32B 4 2K_1W_64B 4 70% 30%

bilv 4K_1W_64B 6 2K_1W_64B 4 64% 36%

binary 2K_1W_32B 4 2K_1W_64B 4 54% 46%

blit 2K_1W_32B 4 8K_2W_32B 8 60% 40%

brev 4K_1W_32B 6 2K_1W_64B 4 63% 37%

g3fax 4K_1W_32B 6 4K_1W_16B 5 60% 40%

fir 4K_1W_32B 6 2K_1W_64B 4 29% 71%

jpeg 8K_4W_32B 8 4K_2W_32B 7 6% 94%

pjpeg 4K_1W_32B 6 4K_1W_16B 5 51% 49%

optimal 4K_2W_64B

ucbqsort 4K_1W_16B 6 4K_1W_64B 6 63% 37%

tv 8K_1W_16B 7 8K_2W_16B 7 37% 63%

adpcm 2K_1W_16B 5 4K_1W_16B 5 64% 36%

epic 2K_1W_64B 5 8K_1W_16B 6 39% 61%

g721 8K_4W_16B 8 2K_1W_16B 3 15% 85%

pegwit 4K_1W_16B 5 4K_1W_16B 5 37% 63%

mpeg2 4K_1W_32B 6 4K_2W_16B 6 40% 60%

optimal 8K_2W_16B

Average 5.8 Average: 5.4 45% 55%

Table 6-1 shows the results of our search heuristic, for instruction and

data cache configurations. Our search heuristic is quite effective: it searches

on average only 5.8 configurations, compared to 27 configurations for an

exhaustive approach. Furthermore, our heuristic finds the optimal

configuration in nearly all cases. For the two data cache configurations

where the heuristic does not find the optimal, pjpeg and mpeg2, the

configuration found is only 5% and 12% worse than the optimal,

respectively. On average, the dynamic self-tuning cache can reduce memory-

access energy by 45% to 55%. Additionally, be observed that way prediction

is only beneficial for instruction caches and that only a 4-way set associative

instruction cache has lower energy consumption when way prediction is

used. However, for the benchmarks we examined, the cache configurations

with the lowest energy dissipation were mostly direct mapped caches where

way prediction is not applicable.

To determine the area and power overhead of our cache tuner, we

designed the cache tuner hardware using VHDL and synthesized the tuner

using Synopsys Design Compiler. The total tuner size was about 4,000 gates,

or 0.039 mm2 in 0.18 µm CMOS technology. Compared to the reported size

of the MIPS 4Kp with caches [20], this represents an increase in area of just

over 3%. The power consumption of the cache tuner is 2.69 mW at 200

MHz, which is only 0.5% of the power consumed by a MIPS processor.

Furthermore, we only use the tuning hardware during the tuning stage; the

tuner can be shutdown after the best configuration is determined, thereby

minimizing the effects of additional static power dissipation due to the tuner.

6.4 AUTOMATIC TUNING OF A TWO-LEVEL

CACHE ARCHITECTURE – THE TCAT

In the previous section, we described an automatic method for tuning a

single level of cache in system during run-time. We extend the single level

cache tuner to tune two-level caches to embedded applications for reduced

energy consumption [21]. This method is applicable to both a simulation-

based exploration environment and a hardware-based prototyping

environment. We present the two-level cache tuner, or TCaT – a heuristic for

searching the huge solution space of possible configurations. The heuristic

interlaces the exploration of the two cache levels and searches the various

cache parameters in a specific order based on their impact on energy.

6.4.1 Configurable Cache Architecture

The configurable caches in each of the two cache levels explored here are

based on the configurable cache architecture described for a single level

configurable cache in Section 6.3.1. The target architecture for our two-level

cache tuning heuristic contains separate level one instruction and data caches

and separate level two instruction and data caches. For the first level cache,

we explore the same search space as the single level cache tuner: cache line

size (64, 32, or 16 bytes), cache size (8, 4, or 2 Kbytes), and associativity (4,

2, or 1-way). For the second level of cache, we expand the cache size to a

possible 64, 32, or 16 Kbytes while the line size and associativity parameters

are the same. We do not explore way prediction with the TCaT.

An exhaustive exploration of all cache configurations for a two level

cache hierarchy is too costly. For a single level separate instruction and data

cache design, an exhaustive exploration would explore a total of 28 different

cache configurations. However, the addition of a second level of hierarchy

raises the number of cache configurations to 432.

Nevertheless, for comparison purposes, we determined the optimal cache

configuration for each benchmark by generating exhaustive data. It took over

one month of continual simulation time on an UltraSparc compute server to

generate the data for our nine benchmarks.

In addition, we have chosen a base cache hierarchy configuration

consisting of an 8 Kbyte, 4-way set associative level-one cache with a 32

byte line size, and a 64 Kbyte 4-way set associative level two cache with a

64 byte line size – a reasonably common configuration.

6.4.2 Initial Two-Level Cache Tuning Heuristic – Search Each Level

Independently

Initially, we extended the heuristic described in Section 6.3.3 for a two-

level cache by tuning the level-one cache while holding the level-two cache

at the smallest size, then tuning the level-two cache using the same heuristic.

We applied the initial heuristic to the benchmarks and found that this

heuristic did not perform well for two levels (the original heuristic was

intended for only one level, where it works well). The cache configuration

determined by our initial heuristic consumed, on average over all

benchmarks, 1.41 times more energy than the optimal configuration. In the

worst case, our initial heuristic found a cache configuration using 2.7 times

more energy than the optimal configuration. In one benchmark, the initial

heuristic found a cache configuration that was worse than the base cache.

The naïve assumption that the two levels of cache could be configured

independently was the reason that our initial heuristic did not perform well

for a two level system. In a two-level cache hierarchy, the behavior of each

cache level directly affects the behavior of the other level. For example, the

miss rate of the level one cache does not solely determine the performance of

the level two cache. The performance of the level two cache is also

determined by what values are missing in the level one cache. To fully

explore the dependencies between the two levels, we decided to explore both

levels simultaneously.

6.4.3 The Two-Level Cache Tuner - TCaT

To more fully explore the dependencies between the two cache levels, we

expanded our initial heuristic to interlace the exploration of the level one and

level two caches. Instead of entirely configuring the level one cache before

configuring the level two cache, the interlaced heuristic explores one

parameter for both levels of cache before exploring the next parameter,

while adhering to the parameter ordering of the initial heuristic. The basic

intuition behind our heuristic is that interlacing the exploration allows for

better modeling and tuning of the interdependencies between the different

levels of cache hierarchy. We applied the interlaced heuristic to the

benchmarks and found that the interlaced heuristic performed much better

than the initial heuristic, but there was still much room for improvement.

We examined the cases where the interlaced heuristic did not yield the

optimal solution. We discovered that in these cases, the optimal was not

being reached for two reasons. First, the initial heuristic did not fully explore

each parameter. For instance, if an increase from a 2 Kbyte to 4 Kbyte cache

size did not yield an improvement in energy, an 8 Kbyte cache size was not

examined. The second reason the optimal configuration was not being found

was not due to a failure in the heuristic, but rather due to the limitations set

on certain cache configurations by the configurable cache itself. For

example, in the level two cache, if a 16 Kbyte cache is chosen as the best

size, the only associativity available is a direct-mapped cache. With no

energy improvement by increasing the cache from a 16 Kbyte direct-mapped

to a 32 Kbyte direct-mapped cache, no other associativities are searched by

the previous heuristics. To allow for all associativities to be searched, we

added a final adjustment to the associativity search step of the interlaced

heuristic with full parameter exploration. The final adjustment allows the

cache size to be increased for both the level one and level two caches in

order to search larger associativities. We refer to this final heuristic as the

two-level cache tuner - the TCaT.

The experimental setup and energy calculations are the same as those

described in Section 6.3.4. We explored nine different benchmarks obtained

from MediaBench [18] and EEMBC [22] benchmarks suites.

Figure 6-2. Energy consumption for the initial heuristic cache configuration, the TCaT cache

configuration, and the optimal cache configuration, normalized to the base cache

configuration for each benchmark.

Figure 6-2 shows the results for the initial heuristic and the TCaT for

each benchmark. The energy consumptions have been normalized to the base

cache configuration for each benchmark’s cache hierarchy. The results show

that the TCaT finds the optimal cache configuration in most cases.

Compared to the base cache configuration and averaged over all

benchmarks, the initial heuristic achieves an average energy savings of 32%

while the TCaT achieves an average energy savings of 53%. Additionally,

we found that for every benchmark, there is no loss of performance due to

cache configuration for optimal energy consumption. In fact, the benchmarks

receive an average of a 28% speedup, which we found was due to the tuning

of the cache line size.

Furthermore, the TCaT reduces the configuration search space

significantly. The exhaustive approach for separate instruction and data

caches for a two level cache hierarchy explores 432 cache configurations.

The improved heuristic explores only 28 cache configurations, or only 6.5%

of the search space. This reduction in the search space speeds up both a

simulation approach and a hardware-based prototyping platform approach.

00.20.40.60.8

caudio

Base Cache

Initial Heuristic

Optimal

6.5 USING A VICTIM BUFFER IN AN

APPLICATION SPECIFIC MEMORY

HEIRARCHY

In addition to tuning cache parameters such as cache size, line size, and

associativity, the cache subsystem can include a configurable victim buffer

which can be beneficial in systems with a direct-mapped cache. Direct-

mapped caches are popular in embedded microprocessor architecture due to

their simplicity and good hit rates for many applications. A victim buffer is a

small fully-associative cache, whose size is typically 4 to 16 cache lines,

residing between a direct-mapped L1 cache and the next level of memory.

The victim buffer holds lines discarded after an L1 cache miss. The victim

buffer is checked whenever there is an L1 cache miss, before going to the

next level memory. If the desired data is found in the victim buffer, the data

in the victim buffer is swapped back to the L1 cache. Jouppi [23] reported

that a four-entry victim buffer could reduce 20% to 95% of the conflict

misses in a 4 Kbyte direct-mapped data cache. Albera and Bahar [24]

evaluated the power and performance advantages of a victim buffer in a high

performance superscalar, speculative, out-of-order processor. They showed

that adding a victim buffer to an 8 Kbyte direct-mapped data cache results in

10% energy savings and 3.5% performance improvements on average for the

Spec95 benchmark suite.

A victim buffer improves the performance and energy of a direct-mapped

cache on average, but for some applications, a victim buffer actually

degrades performance without much or any energy savings, as we will show

later. Such degradation occurs when the victim buffer hit rate is low.

Checking a victim buffer requires an extra cycle after an L1 miss. If the

victim buffer hit rate is high, that extra cycle actually prevents dozens of

cycles for accessing the next level memory. But if the buffer hit rate is low,

that extra cycle does not save much and thus is wasteful. Whether a victim

buffer’s hit rate is high or low is dependent on what application is running.

Such performance overhead may be one reason that victim buffers are not

always included in embedded processor cache architectures.

In this section, we will show that treating the victim buffer as a

configurable memory parameter to a direct-mapped cache is superior to

either using a direct-mapped cache without a victim buffer or using a direct-

mapped cache with an always-on victim buffer [25]. Furthermore, we show

that a victim buffer parameter is even useful with a cache that itself is highly

parameterized.

6.5.1 Victim Buffer as a Cache Parameter

We consider adding a victim buffer to both core-based and pre-fabricated

platform based design situations.

A core-based approach involves incorporating a processor (core) into a

chip before the chip has been fabricated, either using a synthesizable core

(soft core) or a layout (hard core). In either case, most core vendors allow a

designer to configure the level 1 cache’s total size (typical sizes range from

no cache to 64 Kbyte), associativity (ranging from direct mapped to 4 or 8

ways), and sometimes line size (ranging from 16 bytes to 64 bytes). Other

parameters include use of write through, write back, and write allocate

policies for writing to a cache, as well as the size of a write buffer. Adding a

victim buffer to a core-based approach is straightforward, involving simply

including or not including a buffer into the design.

A pre-fabricated platform is a chip that has already been designed, but is

intended for use in a variety of possible applications. To perform efficiently

for the largest variety of applications, recent platforms come with

parameterized architectures that a designer can configure for his/her

particular set of applications. Recent architectures include cache parameters

[2,8,9] that can be configured by setting a few configuration register bits. We

therefore developed a configurable victim buffer that could be turned on or

off by setting bits in a configuration register.

The experimental setup and energy calculations are the same as those

described in Section 6.3.4. The benchmarks examined include programs

from the Powerstone [9], MediaBench [18], and Spec2000 [26] benchmark

suites.

6.5.2.1 Victim Buffer with a Direct-Mapped Cache

Figure 6-3 shows the performance and energy improvements when

adding an always-on victim buffer to a direct-mapped cache. Performance is

the program execution time. Energy is estimated as described in section

6.3.4. 0% represents the performance and energy consumption of an 8 Kbyte

direct-mapped cache. From Figure 6-3, we see that a victim buffer improves

both performance and energy for some benchmarks, like mpeg, epic, and

adpcm. For other benchmarks, energy is not improved but performance is

degraded, as for vpr, fir, and padpcm. A victim buffer should be excluded or

turned off for these benchmarks. Some benchmarks, like jpeg, parser, and

auto2, yield some energy savings at the expense of some performance

degradation using a victim buffer – a designer might choose whether to

include/exclude or turn on/off the buffer in these cases depending on

whether energy or performance is more important.

Figure 6-3. Performance and energy improvements when adding a victim buffer to an 8 Kbyte

direct-mapped cache. Positive values mean the victim buffer improved performance or

energy, with 0% representing an 8 Kbyte direct-mapped cache without a victim buffer.

Benchmarks with both bars positive should turn on the victim buffer, while those with

negative performance improvement and little or no energy improvement should turn off the

victim buffer.

6.5.2.2 Victim Buffer with a Parameterized Cache

Figure 6-4 shows the performance and energy improvement of adding a

victim buffer to a parameterized cache having the same configurability

described by Zhang et. Al. [2] 0% represents the performance and energy of

the original configurable cache when tuned optimally to a particular

application. The bars represent the performance and energy of the

configurable cache when optimally tuned to an application assuming a

victim buffer exists and is always on. The optimal cache configurations for a

given benchmark are usually different for each of the two cases (no victim

buffer versus always-on victim buffer).

We see that, even though the configurable cache already represents

significant energy savings compared to either a 4-way or direct-mapped

cache [2], a victim buffer extends the savings of a configurable cache by a

large amount for many examples. For example, a victim buffer yields an

additional 32%, 43%, and 23% energy savings for benchmarks adpcm, epic,

and mpeg2. The savings of adpcm and epic come primarily from the victim

buffer that reduces the visits to off-chip memory. The saving of epic comes

primarily from the victim buffer enabling us to configure the configurable

cache to use less associativity without increasing accesses to the next

memory level. Yet, for other benchmarks, like adpcm, auto2 and vpr, the

21% 24%38% 43% 60%

performance

energy

victim buffer yields performance overhead with no energy savings and thus

should be turned off.

Figure 6-4. Performance and energy improvements when adding a victim buffer to an 8 Kbyte

configurable cache. 0% represents a configurable cache without a victim buffer, tuned

optimally to the particular benchmark.

6.6 LOW STATIC-POWER FREQUENT-VALUE

DATA CACHES

Recently, a frequent value (FV) low power data cache design was

proposed based on the observation that a major portion of data cache

accesses involves frequent values, which can be dynamically captured [27].

Frequent values are encoded in the cache, occupying only a few bits.

We improve upon previous FV data caches by reducing static power by

shutting off the unused bits in the larger sub-array for encoded frequent

values [28]. Since frequent values are stored in encoded form using only the

few bits in the smaller sub-array, the remaining bits in the larger sub-array

serve no purpose as long as the value stays frequent. Such shutoff may be

beneficial since FVs occupy many words in data caches [27].

Furthermore, the original FV low power cache design suffers from an

extra cycle when reading non-FVs [27], which account for 68% of all data

cache accesses, resulting in a 5% increase in execution time. We used circuit

design to remove the extra cycle.

6.6.1 Overview of Original FV Cache Design

In this section, we give a brief overview of the original FV data cache

designed by Yang and Gupta [27].

The FV cache was proposed based on the observation that a small

number of distinct frequently occurring data values often occupy a large

portion of program memory data spaces and therefore account for a large

portion of memory accesses [27]. This frequent value phenomenon was

32% 43% 23%

performance energy

exploited in designing a data cache that trades off performance with energy

efficiency.

From the perspective of the frequent value cache, data values are divided

into two categories: a small number of frequent values, in our case 32 FVs,

and all remaining values that are referred to as non-frequent values. The

frequent values are stored in encoded form and therefore can be represented

in 5 bits; the non-frequent values are stored in unencoded form in 32 bit

words. Additionally, a flag bit is needed for each word in the cache to

determine if the value stored in that location is encoded or not. The set of

frequent values remains fixed for a given program run.

When reading a word from the cache, initially we simply read from the

low-bit array. Since every word read out contains a flag bit, the flag is

examined to determine what comes next. The flag being 1 means the desired

word is in un-encoded form, so the remaining bits should be read out from

the high-bit array to form the original value. On the other hand, the flag

being 0 means that the desired word is a frequent value and stored in

encoded form. In this case, the access proceeds to decode the value. Since

the access to the high-bit array is avoided, cache activity is reduced.

A write to the FV cache is performed as follows. Before a value is

written, it is first encoded through an encoder. If encoding is successful, it

means that the value is a frequent value and thus a 5-bit code is stored in the

low-bit array and the flag bit is cleared. In this case, accessing the high-bit

array is avoided. If the encoding fails, the value to be written is a non-

frequent value and thus both low-bit and high-bit data arrays are accessed as

well as the flag bit being set. Note that writing non-FVs does not need to

take two cycles as does reading non-FVs, because the value is encoded early

in the pipeline and thus the decision of driving one array or two is clear

before the access.

6.6.2 Improving the FV Cache Design

The FVs are not only accessed frequently, but also distributed widely in

caches [29]. This phenomenon provides a good opportunity for reducing

static power. Our approach is the following. Since the 32-bit FVs are

encoded in 5 bits, the remaining 27 bits do not store any useful information.

Therefore, they can be shut down to save static power and as long as a value

stays frequent, static power is saved. The overall savings depend on the

occupancy of FVs in the cache. Our studies show that on average nearly half

of the cache content contains FVs, which indicates the benefit of reducing

static power through finding FVs.

The flag bits are initially set to 1, which means initially all words are

non-FVs. Any data to the data cache is checked with the FV encoder. If the

word is an FV, the corresponding flag bit is set to 0 and this cache word is

encoded and stored in the 5-bit array. At the same time, the flag bit turns off

the 27-bit portion of the word. Similarly, on reading FVs, only the 5-bit

portion is read and the 27-bit portion is gated off using the flag bit. On a

non-FV read or write, the flag bit is set to 0 and the original 32 bits are

written into the cache as usual. Our new circuit design improves the original

FV cache design in that there is no extra delay in determining accesses of the

27-bit portion.

6.6.3 Designers’ Choices of Using the FV Cache

We have described a low static power FV cache. When utilized into a

processor system, the FV cache can be designed with different degrees of

complexity and flexibility. In this section, we provide three approaches that

are suitable for a variety of processors targeting different types of

applications. Essentially, the complexity comes from how FVs are identified

and if they are allowed to vary for different applications. As always, the

more flexibility the processor provides, the more complex the FV cache is.

The first approach is appropriate to application specific processors. Since

only a single type of application runs on the processor, its FVs tend to be

stable over time. In such cases, the FVs can be first obtained from a profiling

run through simulations, and then synthesized into the cache as part of the

cache data storage. The advantage of this approach is that once the FVs are

hard coded on-chip, the cache does not perform operations other than reads.

Thus, the logic of this component is simple and can be designed to consume

minimum power.

The second approach extends the first one with the ability of changing

the FVs according to different applications. This approach is suitable for a

multi-task environment in which the processor runs multiple programs

instead of single program. Each program’s FVs are still obtained off-line.

Instead of synthesizing the FVs on-chip, a register file may be used to store

FVs so that they can be rewritten on each activation of a different program.

The size of the register file depends on the number of FVs of interest to the

designer, which is heavily dependent on each program’s behavior.

The third approach provides the maximum flexibility in maintaining FVs.

According to a previous study [29], some programs’ FVs are sensitive to

different inputs. This suggests that another dimension of varying FVs might

be added into the design. Since it is infeasible to profile every program on all

possible inputs to catch FVs, detecting FVs on-line would be useful. Thus,

on top of the second approach, the register file could be extended to

dynamically capture FVs using extra logic. In the scheme proposed by Yang

and Gupta [27], an inexpensive hardware FV finder was developed that

monitored cache accesses. The FV finder was turned on for only the first 5%

of memory accesses assuming that the total memory access numbers are

known a priori. After that, the FVs were captured in the finder and

transmitted to the cache so that the cache starts operating as an FV cache.

The energy overhead of the finder was estimated to be 0.3%-6.1% of the L1

D-cache (8 Kbyte to 64 Kbyte caches were tested). The area overhead is

similar to our second approach, and thus modest. One potential issue is that

the FV finder described detects frequently accessed values, which may or

may not correspond to frequently distributed values in memory, though they

usually are the same. We leave an FV finder for frequently distributed values

for future work.

To determine the benefits of our FV cache architecture in reducing static

energy, we ran 11 SPEC2000 [26] benchmarks through the SimpleScalar

tool set [19]. We used a 4-issue out-of-order processor simulator with a 32

Kbyte L1 instruction and data cache. The benchmarks were fast-forwarded

for 1 billion instructions and executed for 500 million instructions

afterwards, using reference inputs.

6.6.4.1 Static Energy Savings

Our main goal is to reduce the static energy consumed by the data cache

without losing performance. As mentioned earlier, the overall static energy

saving depends on the average coverage of FVs inside data cache. Through

experiments, we found that there are abundant FVs in the L1 data cache at

any time for Spec 2000 benchmarks, as shown in Figure 6-5. The percentage

shown is the average for the 500 million instructions execution time. On

average, 49.2% of the total words are FVs, with the highest being 77.0% for

benchmark mcf and the lowest 9.4% for benchmark ammp. The static energy

savings are proportional to the number of FVs in the data cache. Thus, the

corresponding static energy savings on average are 35%

(49.2%×27/33×86%) considering that 27 bits out of 33 bits (we need a flag

bit per 32-bit word) are shut off and 86% of static power can be saved using

a pMOS Gated-Vdd. When compared with the conventional 32-bit per word

cache, the static energy savings can be calculated as 100%- (100%-

35%)*33/32 = 33%.

Figure 6-5. Percentage of data cache words that are FVs

6.6.4.2 Performance Improvement

Our second achievement is the performance improvement over the

original FV data cache design. Recall that the original FV cache

performance overhead was due to the prolonged non-FV accesses. The more

non-FV accesses, the slower the execution and the less the overall power

savings (less energy savings), since the system would consume more energy

when the program runs longer. We measured the average percentage of

cache hits that are FVs, as shown in Figure 6-6(a). On average, the hit rate

on data FVs is 32% with the highest being 62.7% for votex and the lowest

11.4% for mcf. Therefore, we can see that on average, 68% of cache accesses

are non-FVs.

Figure 6-6. (a) Hit rate of FVs in data cache; (b) Performance (IPC) degradation of two-cycle

FV cache

With our improved circuitry (1-cycle latency for non-FVs as well as for

FVs), we are able to maintain the same execution speed as the base case. To

see how much performance we have gained over the original FV cache, we

measured the IPCs for a normal cache and a 2-cycle FV cache and plot them

in Figure 6-6(b). The IPC for our improved design is the same as the normal

cache. Figure 6-6(b) shows the slowdowns of the original FV cache design,

which is the same value as our performance improvement. We can see that

there is a 5.2% difference in the averaged IPCs between the original FV

cache and our improved version. This also means that in addition to the

static energy we saved by shutting off partial FV words, we also saved more

dynamic energy than the original FV cache design.

equake

Normal Cache 2-cycle FVC

Another feature in our new design is that it is safe in the sense that it does

not increase power consumption significantly even when FVs are not

abundant. Thus, our improved FV cache design is an appealing approach in

reducing both static and dynamic energy of caches.

Acknowledgements

This work was supported by the National Science Foundation (CCR-

0203829, CCR-9876006) and by the Semiconductor Research Corporation

(2003-HJ-1046G).

References

[1] S. Segars. Low power design techniques for microprocessors, International Solid State

Circuit Conference, February 2001.

[2] C. Zhang, F. Vahid, and W. Najjar. A highly-configurable cache architecture for

embedded systems. 30th Annual International Symposium on Computer Architecture,

June 2003.

[3] Altera, Nios Embedded Processor System Development, http://www.altera.com/corporate/

news_room/releases/products/nr-nios_delivers_goods.html.

[4] Arc International, www.arccores.com.

[5] ARM, www.arm.com.

[6] MIPS Technologies, www.mips.com.

[7] Tensilica, Xtensa Processor Generator, http://www.tensilica.com/.

[8] D. H. Albonesi. Selective Cache Ways: On Demand Cache Resource Allocation. Journal

of Instruction Level Parallelism, May 2002.

[9] A. Malik, W. Moyer, and D. Cermak. A Low Power Unified Cache Architecture

Providing Power and Performance Flexibility. International Symposium on Low Power

Electronics and Design, 2000.

[10] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory

Heirarchy Reconfiguration For Energy and Performance in General-Purpose Processor

Architecture. 33rd International Symposium on Microarchitecture, December 2000.

[11] A. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Cache Access and Cache

Time Model. IEEE Journal of Solid-State Circuits, Vol 31, No 5, 1996.

[12] T. Givargis and F. Vahid. Platune: A Tuning Framework For System-On-a-Chip

Platforms. IEEE Transactions on Computer Aided Design, November 2002.

[13] M. Palesi and T. Givargis. Multi-Objective Design Space Exploration Using Genetic

Algorithms. International Workshop on Hardware/Software Codesign, May 2002.

[14] C. Zhang and F. Vahid. Cache Configuration Exploration on Prototyping Platforms. 14th

IEEE International Workshop on Rapid System Prototyping , June 2003.

[15] A. Ghosh and T. Givargis. Cache Optimization For Embedded Processor Cores: An

Analytical approach. International Conference on Computer Aided Design, November

[16] C. Zhang, F. Vahid, and R. Lysecky. A Self-Tuning Cache Architecture for Embedded

Systems. Design Automation and Test in Europe Conference (DATE), February 2004.

[17] M. Powell, A.Agarwal, T. Vijaykumar, B. Falsafi, and K. Roy. Reducing Set-Associative

Cache Energy via Way-Prediction and Selective Direct Mapping, 34th

International

Symposium on Microarchitecture, 2001.

[18] C. Lee, M. Potkonjak, and W.H. Mangione-Smith. MediaBench: A Tool For Evaluating

and Synthesizing Multimedia and Communication Systems. Proc 30th

Annual

International Symposium on Microarchitecture, December 1997.

[19] D. Burger, T. Austin, and S. Bennet. Evaluating Future Microprocessors: The

Simplescalar Toolset. University of Wisconsin-Madison. Computer Science

Department Tech. Report CS-TR-1308, July 2000.

[20] http://www.mips.com/products/s2p3.html, 2003.

[21] A. Gordon-Ross, F. Vahid, and N. Dutt. Automatic Tuning of Two-Level Caches to

Embedded Applications. Design Automation and Test in Europe Conference (DATE),

February 2004.

[22] EEMBC, the Embedded Microprocessor Benchmark Consortium, www.eembc.org.

[23] N. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small

Fully-Associative Cache and Prefetch Buffers, Proceedings of International Symposium

on Computer Architecture, 1990.

[24] G. Albera and R. Bahar. Power/performance Advantages of Victim Buffer in High-

Performance Processors, IEEE Alessandro Volta Memorial Workshop on Low-Power

Design, 1999.

[25] C. Zhang and F. Vahid. Using a Victim Buffer in an Application-Specific Memory

Hierarchy. Design Automation and Test in Europe Conference (DATE), February 2004.

[26] http://www.specbench.org/osg/cpu2000.

[27] J. Yang and R. Gupta. Energy Efficient Frequent Value Data Cache Design, Int. Symp.

on Microarchitecture, Nov. 2002.

[28] C. Zhang, J. Yang, and F. Vahid. Low Static-Power Frequent-Value Data Caches. Design

Automation and Test in Europe Conference (DATE), February 2004.

[29] J. Yang and R. Gupta. “Frequent Value Locality and its Applications,” ACM

Transactions on Embedded Computing Systems (inaugural issue), Vol. 1, No. 1, pages

79-105, November 2000.

Chapter 7

REDUCING ENERGY CONSUMPTIONIN CHIP MULTIPROCESSORS USINGWORKLOAD VARIATIONS

I. Kadayif1, M. Kandemir2, N. Vijaykrishnan2, M. J. Irwin2 and I. Kolcu3

1Canakkale Onsekiz Mart University;2Pennsylvania State University;3UMIST

Abstract Advances in semiconductor technology are enabling designs with several hundredmillion transistors. Since building sophisticated single processor based systemsis a complex process from design, verification, and software development per-spectives, the use of chip multiprocessing is inevitable in future microprocessors.In fact, the abundance of explicit loop-level parallelism in many embedded ap-plications helps us identify chip multiprocessing as one of the most promisingdirections in designing systems for embedded applications. Another architecturaltrend that we observe in embedded systems, namely, multi-voltage processors, isdriven by the need of reducing energy consumption during program execution.Practical implementations such as Transmeta’s Crusoe and Intel’s XScale tuneprocessor voltage/frequency depending on current execution load. Consideringthese two trends, chip multiprocessing and voltage/frequency scaling, this chapterpresents an optimization strategy for an architecture that makes use of both chipparallelism and voltage scaling. In our proposal, the compiler takes advantage ofheterogeneity in parallel execution between the loads of different processors andassigns different voltages/frequencies to different processors if doing so reducesenergy consumption without increasing overall execution cycles significantly.Our experiments with a set of applications show that this optimization can bringlarge energy benefits without much performance loss.

Keywords: Chip multiprocessing, voltage scaling, loop-level parallelism, embedded systems,optimizing compilers.

7.1 INTRODUCTION

Rising development costs motivate computer architecture companies to de-sign fewer systems-on-chip, but to make each one they do design more flex-ible and programmable. Doing so makes it possible to reuse designs to takeadvantage of economies of scale and shorten time-to-market. Moreover, pro-grammability allows companies to keep products in the market longer, boostingintegrated profits.

High-performance embedded processors have traditionally relied mainly onclock frequency and superscalar instruction issue to boost performance. Whilefrequency and superscalarity have served the industry well and will continueto be used, we believe that they have limitations that will diminish the gainsthey will deliver in the future. The gains in operating frequencies, which havehistorically come at a rate of about 35 percent per year, are attributable to twomajor factors: semiconductor feature scaling and deeper pipelining. But eachof these factors is approaching the point of diminishing returns. Similarly,superscalar processing is nearing its limits, mainly due to the exponential in-crease in complexity in dispatch logic with increasing issue width. In addition,superscalar processing is limited by the inherent instruction-level parallelismin the code. Although VLIW implementations are less complex than theirsuperscalar counterparts (since most of execution decisions are made by thecompiler), they still employ power-hungry components and are limited by theavailable instruction-level parallelism. It should also be noted that both super-scalar and VLIW architectures are not efficient from an energy consumptionviewpoint. Therefore, it is not clear whether current architectures will be suffi-cient for meeting continuously increasing power and performance demands ofapplications.

These observations motive system designers to investigate different archi-tectures. When one looks at computer architecture industry today, two differenttrends in system design can easily be observed: on-chip multi-processing andmulti-voltage processors. On-chip multi-processors take advantage of high-level, coarse-grain parallelism that exists due to the natural independence ofseparate program fragments (e.g., functions and loops). As compared to super-scalar and VLIW architectures, they are much more suitable for array-intensiveembedded applications. Another advantage of using an on-chip multiprocessor,instead of a more powerful and sophisticated uniprocessor, is that there is lessdifficulty in designing a smaller, less complex chip. This also speedups chipverification and validation. Thus, time required to put the chip in the marketbecomes shorter. One can see several examples of on-chip multi-processingtoday in both academia and industry. For example, the four-core Hydra fromStanford University [14] is built around Integrated Device Technology Inc.’sRC32364 processor, which uses a 0.25-micron process, and runs at 250 MHz.

As manufacturing processes keep getting refined, it becomes even easier toreplicate the core several times on a single die. The MAJC architecture fromSun Microsystems [11] allows one to four processors to share the same die, andfor each to run separate threads. Each processor is limited to four functionalunits (each of which are able execute both integer and floating point operations,making the MAJC architecture more flexible). Another example of an on-chipmulti-processor from industry is the Power4 processor from IBM [15], wheretwo processors are placed into the same die.

The second trend, multi-voltage processors, is mainly driven by the need toreduce energy consumption during program execution. Practical implementa-tions such as Transmeta’s Crusoe [10] and Intel’s XScale [8] scale processorvoltage/frequency depending on execution load. Observing that one rarelyneeds an application to exercise a processor’s maximum performance and theunused extra performance usually represents wasted energy, Crusoe designerstry to match the operating level of the processor (in terms of voltage and fre-quency) to the performance requirements of the application being executed.Depending on the voltage regulator, a Crusoe processor can change its voltagein steps of 25mV and its frequency in steps of 33MHz.

Considering the continuously pressing power and performance demands, wecan expect these two techniques to be co-exist in the future embedded archi-tectures. Specifically, we believe that future architectures will be based onon-chip multi-processors, where each on-chip processor can be individuallyvoltage/frequency scaled. Considering such an architecture, this paper investi-gates the energy/performance tradeoffs in parallelizing array-intensive applica-tions taking into account the possibility that individual processors can operatein different voltage/frequency levels. In assigning voltage levels to processors,we make use of compiler analysis that reveals heterogeneity between the loadsof different processors in parallel execution. Our experiments with a set of ap-plications show that the proposed optimization can bring large energy benefitswithout much performance penalty.

The rest of this chapter is organized as follows. The next sections describesour chip multiprocessor. Section 7.3 discusses why we may be experiencingload imbalance across on-chip processors at runtime. Section 7.4 discussesthe necessary compiler analysis for determining workloads (on a loop nest ba-sis) of individual processors participating in parallel computation. Section 7.5discusses additional optimizations to further enhance our power savings. Sec-tion 7.6 describes our implementation, experimental platforms, and presentsperformance and energy numbers. Section 7.7 presents our concluding re-marks.

2 Cache3

L2 Cache Optional

Off-Chip

CPU0 CPU1 CPU2 CPU3

Cache1Cache0

Figure 7.1. Chip multiprocessor under consideration.

7.2 CHIP MULTIPROCESSOR ARCHITECTURE ANDEXECUTION MODEL

The chip multiprocessor we consider here is a shared-memory architecture;that is, the entire address space is accessible by all processors. Each processorhas a private L1 cache, and shared memory is assumed to be off-chip. Option-ally, we may include a (shared) L2 cache as well. Note that several architecturesfrom academia and industry fit in this description [1, 14, 11, 12]. We keep thesubsequent discussion simple by using a shared bus as the interconnect (thoughone could use fancier/higher bandwidth interconnects as well). We also usethe MESI [19] protocol (the choice is orthogonal to the focus of this paper)to keep the caches coherent across the CPUs. We assume that voltage leveland frequency of each processor in this architecture can be set independentlyof the others, and this is the main mechanism through which we save power.This paper focuses on a single-issue, five-stage (instruction fetch (IF), instruc-tion decode/operand fetch (ID), execution (EXE), memory access (MEM), andwrite-back (WB) stages) pipelined datapath for each on-chip processor. Cur-rently, this is the only architectural model for which our compiler estimatesprocessor workload.

Note that progress in VLSI technology has allowed chip-makers to packmillions of transistors in a single die. Rather than throwing all these resourcesinto a single, powerful processing core and making this core very complex todesign and verify, chip-multiprocessors consisting of several simpler proces-sor cores can offer a more cost-effective and simpler way of exploiting thesehigher levels of integration. Chip multiprocessors also offer a higher granu-larity (thread/process level) at which parallelism in programs can be exploitedby compiler/runtime support, rather than leaving it to the hardware to extractthe parallelism at the instruction level on a single (larger) multiple-issue core.All these compelling reasons motivate the trends toward chip multiprocessor

architectures, and there is clear evidence of this trend in the several commercialofferings and research projects [1, 14, 11, 12].

Our application execution strategy can be summarized as follows. We focuson array-based applications that are constructed from loop nests. Typically,each loop nest in such an application is small but executes a large number ofiterations and accesses/manipulates large datasets (typically multidimensionalarrays). We employ a loop nest based application parallelization strategy. Morespecifically, each loop nest is parallelized independently of the others. In thiscontext, parallelizing a loop nest means distributing its iterations across proces-sors and allowing processors to execute their portions in parallel. For example,a loop with 1000 iterations can be parallelized across 10 processors by allo-cating 100 iterations to each processor. We also assume that after each loopnest execution, all processors get synchronized before they start executing thenext loop nest. Note that dropping this requirement would necessitate a so-phisticated compiler analysis to identify the cases under which a processor thatfinishes its portion of iterations from the previous loop nest can go ahead andstart executing its portion from the next loop nest without waiting for the others.Nevertheless, in our experiments to be presented later, we also evaluate suchan alternative strategy.

There are many proposals for power management of a dynamic voltagescaling-capable processor. Most of them are at operating system level andare either task-based [13, 17] or interval-based [21, 5]. While some proposalsaim at reducing energy without compromising performance, a recent study byGrunwald et al [6] observed noticeable performance loss for some interval-based algorithms using actual measurements. The existing compiler basedstudies such as [7, 16] target single processor architectures. In comparison, ourwork targets at a chip multiprocessor based environment.

7.3 LOAD IMBALANCE IN PARALLEL EXECUTION

We can broadly divide loop nest parallelization techniques into two cate-gories: static and dynamic. In the static case, the compiler (or the user) decidesa suitable parallelization strategy for each loop nest at compile time. The ideais to assign each loop iteration to a processor. There are at least two ways ofdoing this. In block assignment, a group of consecutive loop iterations are as-signed to the same processor. Since such iterations typically access data storedin consecutive memory locations, this type of assignment can also be expectedto be data locality friendly. In cyclic assignment, the iterations assigned toprocessors are interleaved using some stride. While this type of assignmentis known to be good from a load balance viewpoint, it generally exhibits poordata locality. Consider, as an example, the loop nest shown below and the arrayreference in it:

2V1V0V

3V2V1V0V

(c)(b)(a)

(e)(d)

2P3P 1P 0P>>> 3

3P2P1P 0P3P2P1P 0PP

Figure 7.2. Different array accesses imposed by different iteration assignments (the array isassumed to be row-major).

for i: 1..1024for j: 1..1024

..X[i,j]..

Assuming that only the i-loop is parallelized across four processors (P0

through P3), Figure 7.2(a) illustrates how array X is accessed by the processorswhen block iteration assignment is used. In this assignment, each processorexecutes 256 × 1024 iterations, and accesses a group of consecutive rows ofthe array as depicted in Figure 7.2(a). However, it is also possible to parallelizethis loop (i) by distributing its iterations cyclicly across processors using someregular stride. For example, we can give the first 128 × 1024 iterations to thefirst processor, the next 128 × 1024 to the second one and so on, and whenwe give its quota to the last processor, we can repeat the whole process (untilall loop iterations have been assigned) starting over with the first processor.Figure 7.2(b) shows how array X is accessed by the processors under this cycliciteration assignment scheme. Note that the cyclic iteration distribution is flexi-ble in the sense that it can work with any stride. For example, instead of using128 × 1024 iteration chunks, we could have easily used 16 × 1024 or even 1× 1024 iteration chunks.

In comparison, in a dynamic parallelization strategy, the assignment of iter-ations to processors is performed dynamically during the course of executionby a central controller. Typically, this controller gives a new set of loop it-erations to a processor when that processor is done with executing its currentset of assigned iterations. While the dynamic strategy is expected to balance

the workloads of processors better than static strategies (as it can take run-time constraints into account), it also incurs a much higher runtime cost — interms of both execution cycles and power consumption — (as compared to thestatic parallelization schemes) since decisions regarding iteration assignmentsare made at runtime. Therefore, our focus in this study is on static loop nestparallelization.

Consider now the following loop nest:

for i: 1..1024for j: i..1024

..X[i,j]..

While this loop nest is similar to the previous one considered above, there isone significant difference: the lower bound of the inner loop (j) is i (insteadof 1). Figure 7.2(c) shows how the four processors access the array in questionwhen block iteration assignment is employed. Clearly, there is a significantload imbalance across the processors. Assuming that each iteration of this loopnest has the same cost (in terms of execution cycles) and all processors shouldsynchronize following the execution of the nest, there is not any advantage forthe processors with the light load to finish their set of iterations as soon as pos-sible. Instead, they can delay their executions (by reducing their frequencies)and lower their voltages to save energy while making sure that their executiondoes not take more time than that of the processor with the largest load (op-erating with the highest voltage level). Figure 7.2(d) illustrates such a voltageassignment, assuming that V0 is the highest voltage level available. The workpresented in this paper performs such a voltage-to-processor assignment foreach loop nest of a given array-based application. In a sense, in our frame-work the job of the compiler is not just to decide which loop iterations shouldbe assigned to which processors but also which supply voltage/frequency eachprocessor needs to use. Our objective is to save as much power as possiblewithout incurring much performance penalty.

At this point, someone might claim that it would be better in this case (Fig-ure 7.2(c)) to use cyclic assignment instead of block assignment as this wouldeliminate the load imbalance problem introduced by the latter to a large extent.However, this may not be a viable option in general. Consider, for example,the scenario depicted in Figure 7.2(e), where the direction of parallelization isreversed (due to data dependences for example). In this case, cyclic assignmentwould be very costly in terms of data locality (cache behavior), assuming thatthe array in question is stored as row-major. Considering the fact that off-chipmemory accesses are getting more and more expensive in terms of processorcycle times, one may not want to degrade data locality.

7.4 COMPILER SUPPORT

As mentioned earlier, the compiler’s job in our setting is to assign not onlyiterations to processors but also come up with a suitable voltage level for eachprocessor. To do this, the compiler needs to estimate the workload of eachprocessor and match it with an appropriate voltage/frequency level. Withoutloss of generality, we assume that there are s voltage/frequency levels availableto the compiler. Our compiler-based approach proceeds as follows:

• Parallelization Step. In this step, the compiler parallelizes an applica-tion in a loop nest basis. That is, each loop nest is parallelized independentlyconsidering the intrinsic data dependences it has. Since we are targeting achip multiprocessor, our parallelization strategy tries to achieve (for each nest)outer-loop parallelism to the best extent possible. In other words, we parallelizethe outermost loop (in the nest) that carries no data dependence. Our baselineresults are obtained using this parallelization strategy. Later in our experiments,we change our parallelization strategy to conduct a sensitivity analysis.

• Processor Load Estimation. In this step, the compiler estimates theload of each processor in each nest. To do this, it performs two calculations:(a) iteration count estimation and (b) per-iteration cost estimation. Since inmost array-based embedded applications bounds of loops are known beforeexecution starts, estimating the iteration count for each loop nest is not verydifficult. The challenge is in determining the cost (in terms of execution cycles)of a single iteration (for a given loop nest). Since the processors employed inour chip multiprocessor are simple single-issue cores, our cost computation isclosely dependent on the number and types of the assembly instructions thatwill be generated for the loop body. Specifically, we associate a base executioncost with each type of assembly instruction. In addition, we also estimatethe number of cache misses. Since loop-based embedded applications exhibitvery good instruction locality (as they spend most of their execution cycleswithin loop nests and there are not too many conditional-if executions), wefocus on data cache and estimate data cache misses using the method proposedby Carr et al [2]. An important issue is to estimate (at the source level) whatassembly instructions will be generated for the loop body in question. Weattack this problem as follows. The constructs that are vital to the studied codesinclude a typical loop, a nested loop, assignment statements, array references,and scalar variable references within and outside loops. Our objective is toestimate the number of assembly instructions of each type associated with theactual execution of these constructs. To achieve this, the assembly equivalentsof several codes were obtained using our back-end compiler (a variant of gcc)with the O2-level optimization. Next, the portions of the assembly code werecorrelated with corresponding high-level constructs to extract the number andtype of each instruction associated with the construct. In order to simplify the

correlation process and to partially isolate the impact of instruction choice dueto low-level optimizations, the assembly instructions with similar functionalityand energy consumption are grouped together. For example, both branch-if-not-equal (bne) and branch-if-equal (beq) are grouped as a generic branchinstruction (denoted bra).

To illustrate our parameter extraction process in more detail, we focus onsome specifics of the following example constructs. First, let us focus on aloop construct. Each loop construct is modeled to have a one-time overhead toload the loop index variable into a register and initialize it. Each loop also hasan index comparison and an index increment (or decrement) overhead whosecosts are proportional to the number of loop iterations (called trip count or trip).From correlating the high-level loop construct to the corresponding assemblycode, each loop initialization code is estimated to execute one load (lw) andone add (add) instruction (in general). Similarly, an estimate of trip+1 load(lw), store-if-less-than (stl), and branch (bra) instructions is associated with theindex variable comparison. For index variable increment (resp. decrement),2×trip addition (resp. subtraction) and trip load, store, and jump instructionsare estimated to be performed.

Next, we consider extracting the number of instructions associated with ar-ray accesses. First, the number and types of instructions required to computethe address of the element are identified. This requires the evaluation of thebase address of the array and the offset provided by the subscript(s). Our cur-rent implementation considers the dimensionality of the array in question, andcomputes the necessary instructions for obtaining each subscript value. Com-putation of the subscript operations is modeled using multiple shift and addi-tion/subtraction instructions (instead of multiplications) as this is the way ourback-end compiler generates code when invoked with the O2 optimization flag.Finally, an additional load/store instruction was associated to read/write thecorresponding array element. Note that these correlations between high-levelconstructs and low-level assembly instructions are a first-level approximationfor our simple architecture and array-dominated codes with the O2-level op-timization and obtained through extensive analysis of a large number of codefragments.

Based on the process outlined above, the compiler estimates iteration countfor each processor and per-iteration cost. Then, by multiplying these two, itcalculates the estimated workload for each processor. While this workloadestimation may not be 100% accurate, it allows the compiler to rank processorsaccording to their workloads and assign suitable voltage levels and frequenciesto them as will be described in the next item. As an example consider thesecond loop nest shown above, parallelized using 4 processors. Assuming thatour estimator estimates the cost of loop body as L instructions, the loads of

processors P0, P1, P2, and P3 are 256 × 1024 × L, 256 × (1024-257+1) × L,256 × (1024-513+1) × L, and 256 × (1024-769+1) × L, respectively.

• Voltage Assignment. In this step, the compiler first orders the proces-sors according to non-increasing workloads. After that, the highest voltage isassigned to the processor with the largest workload (the objective being not toaffect the execution time to the greatest extent possible). Then, the processorwith the second highest workload gets assigned to the minimum voltage levelVk available (where 1 ≤ k ≤ s) that does not cause its execution time to exceedthat of the processors with the largest workload. In this way, each processorgets the minimum voltage level (to save maximum amount of power) withoutincreasing overall parallel execution time of the nest (which is determined bythe processor with the largest workload). Continuing with the example above,suppose that we have two voltage/frequency levels (that is, V1/f1 and V2/f2,assuming s = 2 and V1/f1 > V2/f2), we first determine the execution time takenby processor P0 (denoted T0). Then, for each other processor, we use V2/f2

if doing so does not cause their execution times to exceed T0. If any of theseexecution times exceeds T0 (when using V2/f2), we switch back to V1/f1 forthat processor.

The success of our strategy critically depends on two important factors. First,there should be some load imbalance to exploit between different processors.This is because if there is no such imbalance then it is reasonable to executeeach processor with the highest voltage/frequency. Second, the compiler-basedworkload estimation should be reasonably accurate. If this is not the case, thenwe may assign a wrong voltage level/frequency to a processor, which may inturn impact overall execution time. In fact, in this scheme, the only time wepay some penalty is when our compiler-based workload estimation is not veryaccurate. In our experiments, we quantify this penalty in detail.

7.5 ADDITIONAL OPTIMIZATIONS

In this section, we discuss how the effectiveness of our strategy can be furtherincreased using additional optimizations.

7.5.1 Inter-Nest Optimization

In the description of our strategy above, we assumed that the processors willsynchronize at the end of each loop nest (before they start executing the nextloop nest). As noted by Tseng [20], such a global synchronization presentstwo major problems. First, to implement such a synchronization, the compilerneeds to generate extra (synchronization) code and insert it in the applicationcode. Obviously, this code presents extra performance and power overhead atruntime. Second, since this synchronization requires all processors to wait forthe slowest one, it makes poor use of available resources (from the performance

angle). Consequently, allowing a processor to continue without waiting forthe slower ones can allow small perturbations in processor execution times toeven out, thereby improving overall performance (by taking advantage of theloosely-coupled nature of chip multiprocessors). However, determining whenit is safe to allow a processor to continue without synchronization requires extracompiler analysis. In this study, we implemented a strategy that takes a number(called b) as a parameter, and for each loop nest, allows a processor to continuefor at most b next nests if doing so does not violate any data dependences.

7.5.2 Voltage/Frequency Reuse

Another optimization can be performed by being more careful in voltageassignment. Up to this point in our discussion we assumed that the processorassignment for each loop nest is done independently of the other nests. As aresult of this, as we move from one loop nest to another the same processor canget assigned different voltage levels. Consequently, we pay a penalty (in termsof both performance and energy consumption) for changing voltage levels. Thispenalty can be minimized by reusing the same voltage as much as possible forthe same processor throughout the execution. This can be achieved as follows.Suppose that in loop nest i, we used voltage level Vk for processor j. Whenwe move to loop nest i + 1 if we need to assign voltage level Vk to a processor,we use processor j for that. This can be repeated for each neighboring loopnest pair, and in this way, the processors reuse their voltage levels as much aspossible.

7.5.3 Adaptive Parallelization

So far in our treatment of the subject, we have assumed that we use allavailable processors in execution of all nests in the application. However, itis known from prior research [9] that, in some cases using fewer processors(and shutting off the unused ones along with their L1 caches) can result in abetter energy consumption behavior. We also conducted experiments with anadaptive strategy, where each loop nest is first profiled using different numberof processors in conjunction with our optimization strategy. After the profiling,for each loop nest, we identified the ideal number of processors, and used itin the actual execution. It should be noted that in adaptive parallelization weuse fewer number of processors than available (this means some performanceloss); however, turning off unused processors along with their L1 caches canbring energy benefits.

7.5.4 Combining Cyclic and Block Iteration Allocations

As has been discussed earlier in the paper, one may also opt to use cyclicdistribution of loop iterations across processors. Since our framework is able to

Table 7.1. Base simulation parameters used in our experiments.

Parameter Default Value

Number of Voltage/Frequency Levels 8Lowest/Highest Voltage Levels 0.8V/1.4V

Frequency Step Size 30MHzVoltage/Frequency Transition Penalty 10 cycles/2.10nJ

L1 Size 8KBL1 Line Size 32 bytes

L1 Associativity 4-wayL1 Latency 1 cycle

L2 Size (Shared) 2MBL2 Associativity 4-way

L2 Line Size 64 bytesL2 Latency 10 cycles

Memory Access Latency 100 cyclesBus Arbitration Delay 5 cycles

Replacement Policy Strict LRUL1 Energy (per access) 1.14nJL2 Energy (per access) 2.56nJ

Main Memory Energy (per access) 23.10nJ

estimate the number of cache misses, we can potentially have a better strategyas follows. For each loop nest, we can calculate the number of misses for bothblock and cyclic wise allocations and select the strategy that generates the bestenergy savings under a performance (execution cycles) constraint. We can referto such a strategy as hybrid since it makes use of both block and cyclic wiseallocation.

7.6 EXPERIMENTS

We tested the effectiveness of our algorithm in reducing energy consump-tion of chip multiprocessor using six array-intensive programs: 3D, DFE, LU,SPLAT, MGRID, and WAVE5. 3D is an image-based modeling applicationthat simplifies the task of building 3D models and scenes. DFE is a digitalimage filtering and enhancement code. LU is an LU decomposition program.SPLAT is a volume rendering application which is used in multi-resolution vol-ume visualization through hierarchical wavelet splatting. Finally, MGRID andWAVE5 are C versions of two Spec95FP applications. These C programs arewritten in such a fashion that they can operate on inputs of different sizes. Thedefault configuration parameters used in our experiments are given in Table7.1, and these are the values that are used unless explicitly stated/varied in thesensitivity experiments.

To conduct our experiments, we modified Simics [18]. Simics is a full systemsimulation platform that can simulate both uniprocessor and multiprocessor

Figure 7.3. Normalized energy consumption with different number of processors (8 voltagelevels).

machines. All energy results reported in this section include the energy spentin CPUs, their caches, and main memory and have been normalized with respectto the energy consumption when no voltage scaling is used and each processoris operated with maximum supply voltage and frequency.

The graph in Figure 7.3 gives the normalized energy consumptions withdifferent number of processors. We can make two main observations fromthis graph. First, all our six applications get some energy benefit from ourapproach with all processor sizes experimented. Second, our energy savingsget better with increased number of processors. This is because a larger numberof processors means more load imbalance to optimize, and our approach takesadvantage of it. When considering individual applications, one can see thatMGRID and WAVE5 perform poorly as compared to the others, mainly becausethese applications have very few cases where our approach is applicable. Incomparison, LU benefits much from increasing the number of processors sincemost of its few loops exhibit significant amount of load imbalance. Overall,the average savings across all six applications are between 16.03% (for the twoprocessor case) and 41.80% (for the thirty-two processor case). To evaluate theimpact of the number of voltage levels on energy savings, we also performedexperiments with different number of voltage levels. The results are presentedin Figure 7.4 for the 8 processor case. One can easily see from this graphthat the number of voltage levels has a significant impact on energy behavior.In particular, the difference in going from 4 levels to 8 levels is dramatic; thecorresponding savings are 6.63% and 29.02%. Increasing the number of voltagelevels further (to 16) does not bring too much additional energy benefits sincethere is little scope left to be optimized (beyond what could be optimized using

Figure 7.4. Normalized energy consumption with different voltage levels (8 processors).

8 levels). It should also be mentioned that when we have only 2 levels, theaverage saving across all applications is only 2.40%. This poor results is dueto the fact that our strategy tries not to increase execution cycles as much aspossible. Consequently, in many cases (when we have only 2 voltage levels)the compiler cannot use the lower voltage for a processor (even though theprocessor has low workload) since doing so would increase execution cyclesdramatically.

Recall that in Section 7.5 we discussed four different optimization strategiesthat can further increase energy savings. The graph shown in Figure 7.5 givesnormalized energy consumptions with these optimizations. The first bar foreach application corresponds to our strategy when none of these four optimiza-tions have been activated. Our first observation is that each application benefitsfrom one or more of these optimizations. Second, not every optimization is ef-fective for each benchmark. For example, using the hybrid iteration allocationbrings energy benefits in only 3D and DFE (since the nests in other allocationsexhibit a uniform behavior and prefer only one type of iteration allocation forthe best energy behavior). Similarly, adaptive parallelization is useful only for3D and SPLAT. To further study the impact of inter-nest optimization (one of theoptimizations discussed in Section 7.5), we also performed experiments withdifferent values for b (nore that the default value that we used in Figure 7.5 is 4).We see from the graph in Figure 7.6 that for the applications that benefit fromthis optimization, a b value of 4 seems to be reasonable. This is because in manycases the data dependences in the application prevent a processor from goingbeyond the next four nests to execute without waiting for the slower processors.

While the energy savings reported in this section are significant, for a faircomparison one also needs to consider the impact of our approach on perfor-

Figure 7.5. Impact of different optimizations on energy consumption (8 processors; 8 voltagelevels; and b = 4).

Figure 7.6. Impact of b on energy consumption (8 processors and 8 voltage levels).

mance. As has been pointed out earlier, our approach can lead to an increase inexecution cycles only if the compiler analysis is largely inaccurate. The graphin Figure 7.7 shows that the performance overhead incurred by our approachis below 2% in all but one (SPLAT) application. The reason that we have arelatively large performance penalty in SPLAT is the fact that this applicationexhibits a large number of conflict misses (over 68%, and rest are cold and ca-pacity misses), which cannot be captured by the cache miss estimation schemecurrently employed by our implementation. Consequently, the compiler is notvery successful in attaching suitable voltage levels to processors, and this inturn causes performance degradation. It is conceivable that a more accurate

Figure 7.7. Percentage increase in execution cycles (8 processors; 8 voltage levels).

cache miss estimation strategy (e.g., [4]) can help improve the behavior of thisbenchmark. This will be part of our future research on this topic.

7.7 CONCLUDING REMARKS

A chip multiprocessor lowers the number of functional units per processor,and distributes separate tasks/threads to each processor. This paper has evalu-ated a compiler-directed strategy that allows different processors to use differentvoltage levels/frequencies to take advantage of the load imbalances stemmingfrom loop parallelization. Our results with six applications clearly demonstratethe effectiveness of our strategy and makes a case for voltage-sensitive loopparallelization. Our results also show that it is possible to increase energysavings further by employing voltage/frequency reuse, adaptive parallelization,and inter-nest optimization.

Acknowledgments

This work was supported in part by NSF Career Awards #0093082 and#0093085, and a grant from GSRC PAS.

References

[1] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B.Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable ArchitectureBased on Single-Chip Multiprocessing. Proceedings of International Sym-posium on Computer Architecture, Vancouver, Canada, June 12–14 2000.

[2] S. Carr, K. S. McKinley, and C. Tseng. Compiler Optimizations for Im-proving Data Locality. Proceedings of the Sixth International Conference

on Architectural Support for Programming Languages and Operating Sys-tems, San Jose, October 1994.

[3] DAC’02 Sessions: Design Methodologies Meet Network Applications andSystem on Chip Design, New Orleans, LA, June 2002.

[4] S. Ghosh, M. Martonosi, and S. Malik. Cache Miss Equations: An An-alytical Representation of Cache Misses. Proceedings of the 11th ACMInternational Conference on Supercomputing, July, 1997.

[5] K. Govil, E. Chan, and H. Wasserman. Comparing Algorithms for DynamicSpeed-Setting of a Low-Power CPU. Proceedings of the 1st ACM Interna-tional Conference on Mobile Computing and Networking, November 1995.

[6] D. Grunwald, P. Levis, K. Farkas, C. Morrey III, and M. Neufeld. Poli-cies for Dynamic Clock Scheduling. Proceedings of the 4th Symposium onOperating System Design and Implementation, October 2000.

[7] C.-H. Hsu and U. Kremer. Dynamic Voltage and Frequency Scaling forScientific Applications. Proceedings of the 14th Workshop on Languagesand Compilers for Parallel Computing, August 2001.

[8] Intel XScale Technology. http://www.intel.com/design/intelxscale/.

[9] I. Kadayif, M. Kandemir, and U. Sezer. An Integer Linear ProgrammingBased Approach for Parallelizing Applications in On-Chip Multiproces-sors. In Proc. Design Automation Conference, New Orleans, LA, June 2002.

[10] A. Klaiber. The Technology Behind Crusoe Pro-cessors. Transmeta White Paper, January 2000.http://www.transmeta.com/about/press/white papers.html.

[11] MAJC-5200. http://www.sun.com/microelectronics/MAJC/5200wp.html

[12] MP98: A Mobile Processor. http://www.labs.nec.co.jp/MP98/top-e.htm.

[13] T. Okuma, T. Ishihara, and H. Yasuura. Real-Time Task Scheduling for aVariable Voltage Processor. Proceedings of the 12th International Sympo-sium on System Synthesis, 1999.

[14] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. TheCase for a Single Chip Multiprocessor. Proceedings of the 7th Intl Confer-ence on Architectural Support for Programming Languages and OperatingSystems, ACM Press, New York, 1996, pp. 2–11.

[15] POWER4 System Microarchitecture, White Paper, http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power4.html

[16] H. Saputra, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, J. S. Hu, C-H. Hsu, and U. Kremer. Energy-Conscious Compilation Based on Voltage

Scaling. Proceedings of ACM SIGPLAN Joint Conference LCTES’02 andSCOPES’02, Berlin , Germany, June, 2002.

[17] Y. Shin, K. Choi, and T. Sakurai. Power Optimization of Real-Time Em-bedded Systems on Variable Speed Processors. Proceedings of the Inter-national Conference on Computer-Aided Design, November 2000.

[18] SIMICS. http://www.virtutech.com/simics/simics.html.

[19] J. P. Singh and D. Culler. Parallel Computer Architecture: A Hardware-Software Approach, Morgan-Kaufmann, 1998.

[20] C.-W. Tseng. Compiler Optimizations for Eliminating Barrier Synchro-nization. Proceedings of 5th ACM Symposium on Principles and Practiceof Parallel Programming, Santa Barbara, CA, July 1995.

[21] M. Weiser, B. Welch, A. Demers, and S. Shenker. Scheduling for ReducedCPU Energy. Proceedings of the 1st Symposium on Operating SystemsDesign and Implementation, November 1994.

Chapter 8

ARCHITECTURES AND DESIGN TECHNIQUES

FOR ENERGY EFFICIENT EMBEDDED DSP

AND MULTIMEDIA PROCESSING

Ingrid Verbauwhede1,2

, Patrick Schaumont1, Christian Piguet

3, Bart

Kienhuis4

1University of California, Los Angeles; 2K.U.Leuven; 3CSEM; 4Leiden

Abstract Energy efficient embedded systems consist of a heterogeneous collection of

very specific building blocks, connected together by a complex network of

many dedicated busses and interconnect options. The trend to merge multiple

functions into one device makes the design and integration of these “systems-

on-chip” (SOC’s) even more challenging. Yet, specifications and applications

are never fixed and require the embedded units to be programmable. The topic

of this chapter is to give the designer architectures and design techniques to

find the right balance between energy efficiency and flexibility. The key is to

include programmability (or reconfiguration) at the right level of abstraction

and tuned to the application domain. The challenge is to provide an

exploration and programming environment for this heterogeneous architecture

platform.

Keywords: Embedded systems, architectures, low power, design tools, design exploration

8.1 INTRODUCTION

Embedded systems (e.g. a cell phone, a GPS receiver, a portable DVD

player, a HDD camcorder) use an architecture that is a heterogeneous

collection of very specific building blocks, connected together by a complex

network of many dedicated busses and interconnect options. General-

purpose programmable processors are not used for energy efficiency

reasons. Typically, multiple small embedded processor cores with

accelerators, IP cores, etc. are used. The trend to merge multiple functions

into one device (e.g. a cell phone with video capabilities) makes the design

and integration of these “systems-on-chip” (SOC’s) even more challenging.

Yet, specifications and applications are never fixed and require the

embedded units to be programmable. A good balance between energy

efficiency and programmability can be obtained by using programmable

domain-specific processors. A well known example are the programmable

digital signal processors (DSPs). DSPs are developed for wireless

communication systems (mostly driven by cellular standards). In a first

generation this meant that DSPs were adapted to execute many types of

filters (e.g. FIR, IRR), later communication algorithms such as Viterbi

decoding and more recently Turbo decoding are added.

A first trend we notice is that more applications and multiple applications

run in parallel or on demand on the device, e.g. video decoding, data

processing, multiple standards, etc. A second trend we notice is that these

new applications tend to run either on a separate domain specific

programmable processor or on a hardware accelerator (the distinction

between the two being rather blurry) next to the embedded DSP or micro-

controller instead of being tightly coupled into the instruction set of the host

processor.

A third trend we notice is that general-purpose programming

environments are getting more heterogeneous and domain-specific. The

general-purpose solutions are for energy efficiency reasons augmented with

domain specific units, accelerators, IP cores, etc. This is clearly visible in

FPGA’s, as the new generations now include specialized blocks such as

embedded core’s, block RAM’s and large numbers of multipliers. One

successful example is the Virtex-Pro family of Xilinx [17]. These devices

contain up to four Power PC cores, multiple columns of SRAM, multiple

columns of multipliers, Gbits IO transceivers, etc.

The architecture design of this heterogeneous SOC is a search in a three

dimensional design space, which we call the reconfiguration hierarchy [12].

First in the Y direction: at what level of abstraction should the programming

be introduced? Secondly in the X direction: which component of the

architecture should be programmable? Thirdly in the Z direction: what is the

timing relation between processing and the configuration/programming?

Programming can be introduced at multiple levels of abstraction. When it is

introduced at the instruction set level, it is called a “programmable

processor”. When it is introduced at the CLB level of an FPGA, it is called a

reconfigurable device. Regarding components, a processor has four basic

components: data paths, control, memory and interconnect. One has a choice

of making some or all of them programmable. Then the third question is to

compare the processing activity to the binding time. It makes a system

configurable, reconfigurable, or dynamic reconfigurable.

The challenge is to develop a design environment to navigate in this three

dimensional design space.

Several SOC platforms have been presented in literature. Most of them

focus on general -purpose regular architectures, e.g. [2]. Very few focus on

the low power issue and the need to tune the architecture towards the

application. One example is the low power Maya platform [18]. Unique to

our design approach is that we combine the design and programming of the

architecture with an environment to explore the best options.

The chapter is organized as follows. Section 8.2 and 8.3 look at the

architecture design, while section 8.4 and 8.5 discuss the design exploration,

co-design and co-simulation challenges.

8.2 ENERGY EFFICIENT HETEROGENEOUS SOC’S

The system designer needs an architecture platform that gives him the

lowest energy consumption, but at the same time provides enough flexibility

to allow re-programming or re-configuration. The key to energy efficiency is

to tune the architecture to the application domain. This means freezing

flexibility in the X (components) and Y (level of abstraction) direction of the

reconfiguration hierarchy. A hierarchy of so-called “Y charts” allows us to

do this in a top-down fashion [5].

A complex SOC will consist of multiple domain specific processing

engines. Each processor is programmable to a more or less degree. It can be

highly programmable if the processor is a micro-controller or a DSP engine

or a blank box of CLB units. The efficiency goes up as domain specific

instructions are added. An example of this is the addition of a MAC

instruction to a DSP processor. Loosely coupled co-processors will be more

energy efficient but less flexible as they fit a narrower application domain.

An example is the Turbo coder acceleration unit. The ultimate energy

efficient block is the optimized hard IP unit. Yet, it does not provide any

flexibility. In SOC a range and collection of these blocks are used.

Similarly arguments can be made for the interconnect component of a

SOC. Currently, we see only two extreme options: either dedicated one-to-

one connections and specialized busses, which have the lowest power

consumption (to a first order) or general-purpose global busses or inter-

connect, as provided by FPGA’s [17] or networks on chip [2]. The latter two

are both general-purpose solutions at different levels of abstraction to give

the designer a maximum flexibility and programmability.

Networking Security

ProtocolAlgorithm

ArchitectureMicroArchitectre

Circuit

MEMORY

Reconfigurable Interconnect

BasebandProcessing

CryptoEngine

Domain-Specific

Hardware

Software

SignalProcessing

Signal ProcNetworking Video

StandardAlgorithm

ArchitectureMicroArchitectre

Circuit

MEMORY

BasebandProcessing

VideoEngine

Domain-Specific

Hardware

Software

SignalProcessing

Signal Proc

Figure 8-1. Example RINGS Architecture.

The proposed RINGS architecture [16] is an architecture platform that

gives the designer the option to explore the energy flexibility trade-offs. An

example is shown in Fig. 8-1. A RINGS architecture contains a

heterogeneous set of building blocks: programmable cores, both DSP’s and

micro-controllers, programmable and/or reconfigurable hardware accelerator

units, specialized IP building blocks, front-end blocks, and so on. When

designing a solution based on RINGS, it is important that the domain expert

has freedom to select the appropriate level of flexibility, ranging from fully

programmable approaches, such as embedded micro controllers or FPGA

blocks to highly optimized IP blocks. For different domains, the flexibility

will be supported in different ways as domains have different characteristics.

This domain specific flexibility can be expressed as a do-main specific

abstraction pyramid as shown for Networking, Video, and Signal Processing

on Fig. 8-1. In case of Video, the engine will consist of elements expressed

in the Video pyramid, for example dedicated co-processors.

The SOC is connected together at the top level by a supervising software

program, which typically runs on an embedded micro-controller. At the

bottom level, the reconfigurable interconnect glues it together. The

programming paradigm used in RINGS is a reconfigurable network-on-chip.

Also in this network, flexibility can be traded for energy efficiency at

different levels of abstraction. Designers can instantiate an arbitrary network

of 1D and 2 D router modules leading to an architecture illustrated in Fig. 8-2.

Proc A Proc A

router

Proc B

Proc X Proc Y

Figure 8-2. Example of Network-on-chip.

This network illustrates the three binding time concepts. At the level of

configuration, the static network architecture with routers is instantiated.

Reconfiguration is done by means of reprogramming the routing tables and

programming by giving each packet a target address. A traditional

reconfiguration is obtained by reprogramming the routing tables in each

node. An alternative approach is to use an easy to reconfigure physical

channel. One example of this is a CDMA based reconfigurable interconnect

[6][16]. Fig. 8-3 shows a conceptual picture of a source-synchronous CDMA

implementation. Each sender and receiver gets a unique spreading code. By

changing the Walsh code, a different configuration is obtained. Traditional

busses, which are a TDMA channel, require hardware switches for

reconfiguration. CDMA interconnect has the advantage that reconfiguration

can occur “on-the-fly.”

MOD1 MOD3

Figure 8-3. Reconfigurable Interconnect (a) TDMA (b) SS-CDMA Bus Interface [1].

8.3 ULTRA LOW POWER COMPONENTS

The focus of this section is on the architecture design options to design

ultra low power processor components, in many cases without losing

performance.

DSP processors have real-time constraints or need to maximize their

throughput for a given task while at the same time minimize the power or

energy consumption. Therefore, the design of DSP processors is very

challenging, as it has to take into account contradictory goals: an increased

throughput request at a reduced energy budget. On top there are new issues

due to very deep submicron technologies such as interconnect delays and

leakage. For instance, hearing aids used analog filters 15 years ago and were

designed as digital ASIC-like circuits 5 years ago. Today they are designed

with powerful DSP processors below 1 Volt and 1 mW of power

consumption [8]. Hearing aids companies require DSP processors just

because they require flexibility, i.e. to program the applications in-house.

The design of ultra-low power DSP cores has to be performed at all

design levels, i.e. system, architecture, circuit and technology levels. We will

focus in this section to DSP architectures, but VHDL implementations as

well as cell libraries are important too. Latch-based implementations

including gated clocks described in VHDL or Verilog, low-power standard

cell libraries and leakage reduction circuit techniques are necessary to reduce

power consumption at these low levels.

Various DSP architectures can be and have been proposed to reduce

significantly the power consumption while keeping the largest throughput.

Beyond the single MAC DSP core of 5-10 years ago, it is well known that

parallel architectures with several MAC working in parallel allow the

designers to reduce the supply voltage and the power consumption at the

same throughput. It is why many VLIW or multitask DSP architectures have

been proposed and used even for hearing aids. The key parameter to

benchmark these architectures is the number of simple operations executed

per clock cycle, up to 50 or more. However, there are some drawbacks. The

very large instruction words up to 256 bits increase significantly the energy

per memory access. Some instructions in the set are still missing for new

better algorithms. Finally the growing core complexity and transistor count

becomes a problem because leakage is roughly proportional to the transistor

count.

To be significantly more energy efficient, there are basically two ways,

however impacting either flexibility or the ease of programming:

1. To design specific very small DSP engines for each task, in such a way

that each DSP task is executed in the most energy efficient way on the

smallest piece of hardware [9]. For N DSP tasks within a given

application, the resulting architecture will be N co-processors or

hardware accelerators around a controller or a simple DSP core as

illustrated on Fig. 8-1.

Memory Memory

Figure 8-4. Hardware Reconfiguration Example [3].

2. To design reconfigurable architectures such as the DART cluster [3], in

which configuration bits allow the user to modify the hardware in such a

way that it can much better fit to the executed algorithms. Fig. 8-4 shows

an example.

Option 1 is definitively the best one regarding power consumption. Each

DSP task uses the minimal number of transistors and transitions to perform

its work. The control code unavoidable in every application is also

efficiently executed on the controller or on the simple DSP, and some

unexpected DSP tasks can be executed on the simple DSP if no accelerator is

available. However, the main issue is the software mapping of a given

application onto so many heterogeneous processors and co-processors (see

Section 4). Transistor count could be high and some co-processors fully

useless for some applications. Regarding leakage, unused engines have to be

cut off from the supply voltages, resulting in complex procedures to

start/stop them.

Reconfigurable DSP architectures are much more power efficient than

FPGAs. The key point is to reconfigure only a limited number of units

within the DSP core, such as some execution units and addressing units [11].

The latter are interesting, as the operands fetch from memory is generally a

severe bottleneck in parallel machines for which 8-16 operands are required

each clock cycle. So, sophisticated addressing modes can be dynamically

reconfigured depending on the DSP task to be executed. Fig. 8-5 shows an

example in which several addressing modes can reconfigured depending on

the user’s algorithms. This AGU (Address Generation Unit) contains 4 index

registers (a0 to a3), 4 offset registers (o0 to o3) and 4 modulo registers (m0

to m3). All these registers could be used to generate a given addressing

mode and to compute AGU registers updates. The VLIW AGU operation

register (AGUOP) is controlled by an AGU reconfiguration register (i0 to i3)

that could be reconfigured at any time and allows the programmer to

generate new addressing modes. Fig. 8-5 shows two examples of AGU

computations. In the first example, register i0 contains configuration data

such as the multiplexers and the PREAD adder are configured to generate

address a0 + (02>>1), while at the same time registers a1, a3 and o3 are

updated with new values computed through POSAD1, POSAD2 and

PREADR ALUs. The POSAD1 ALU is used to generate WP1 = (a1+o3)

modulo m2, while the POSAD2 ALU is used to generate WP2= m3 +

02<<2, and the result of PREADR is used to update register a0. The second

example (i2) generates WP2 that uses both POSAD1 and POSAD2 ALUs

connected in series. The operation (ao-02)%m0 is performed in the POSAD1

ALU, while adding 03 is performed in the POSAD2 ALU. This flexibility

allows the programmer to generate very complex addressing modes that

cannot be available in conventional DSP cores with addressing modes only

defined in their instruction sets.

DM ADDR

POSAD1 PREAD

POSAD2

RP1 RP2 RP3

P2A P2B

RP4 RP5

A O Ma0

WP1 WP2

VLIW AGU Reconfigurable

instruction registers

SELAGUOP

in = AGUOP

(n=0..3)

Examples of in operations:

i0: DM ADDR = a0+(o1>>1),

WP1: a1 = (a1+o3)%m2,

WP2: o3 = m3 + o2<<2

WP3: a0 = a0+(o1>>1),

i2: DM ADDR = a2+o1,

WP1: none

WP2: a0 = (a0-o2)%m0+o3

WP3: a2 = a2+o1

RP6RES1 RP7 RES2

DM ADDR

POSAD1 PREAD

POSAD2

RP1 RP2 RP3

P2A P2B

RP4 RP5

A O Ma0

WP1 WP2

VLIW AGU Reconfigurable

instruction registers

SELAGUOP

in = AGUOP

(n=0..3)

Examples of in operations:

i0: DM ADDR = a0+(o1>>1),

WP1: a1 = (a1+o3)%m2,

WP2: o3 = m3 + o2<<2

WP3: a0 = a0+(o1>>1),

i2: DM ADDR = a2+o1,

WP1: none

WP2: a0 = (a0-o2)%m0+o3

WP3: a2 = a2+o1

RP6RES1 RP7 RES2

Figure 8-5. Addressing Modes Reconfiguration Example (MACGIC DSP).

However, the power consumption is necessarily increased due to the

relatively large number of reconfiguration bits that have to be loaded in the

configuration registers. Similarly, the reconfigurable units are necessarily

more complex that non-reconfigurable units in terms of transistor count and

therefore consume more. Software issues are also difficult, as users can

define new instructions or new addressing modes that are difficult to support

by the development tools.

8.4 DESIGN & ARCHITECTURE EXPLORATION

The way a system behaves depends on the architecture, the way the

applications are written, and how these applications are mapped onto the

architecture as compactly expressed by the Y-chart [5]. Examples of

architectures for low-power have already been given in other sections. On

such architecture, mapping is typically done in case of reconfigurable fabrics

by the behavioral synthesis tool and the place and route tools. In case of

DSPs and CPUs, the mapping is typically performed by C-compilers

dedicated to a particular type of DSP or CPU. An important question

remains: how to specify the applications that they can take advantage of the

architecture in an effective manner.

A low-power architecture will typically employ different levels of

parallelism like bit-level parallelism, instruction parallelism or task-level

parallelism to take advantage of voltage scaling as already explained in the

previous section. To successfully map a DSP application at a high level, the

applications need to express task-level parallelism. This parallelism is

typically not present, as the applications are written in sequential languages

like C or Matlab. Therefore, mapping them is often a manual process that is

very tedious and time consuming, leading to a sub optimal system.

A designer would like to have tool support that converts automatically

the sequential specification into a parallel format. Moreover, the tool should

allow him to ‘play’ with the amount of parallelism extracted from the

specification. In general, such tools are lacking in embedded system design.

Some companies, like Pico and Art (ARM/Adelante) try to provide limited

commercial solutions but this field is still very much subject to research. The

Compaan tool suite [13] aims at providing designers the option to play with

parallelism for applications that are so-called “Nested Loop Programs”, a

very natural fit for DSP applications. A DSP application is specified in a

subset of Matlab and is automatically converted by Compaan into a network

of parallel processes. These processes can be specified in “C’ and mapped,

using a conventional C compiler, onto a DSP or CPU. On the other hand,

they can also be specified in VHDL and mapped using the appropriate tools

onto some reconfigurable fabric or realized as a dedicated IP core [19].

Hence, “programming” the RINGS architecture is reduced to putting some

processes onto the CPUs and DSPs while others are mapped onto FPGAs or

use dedicated IP cores.

There are many ways we can find parallelism in the application and in

the way we partition the processes of the CPUs, DSPs and reconfigurable

resources. Being able to explore these options early on in the design phase is

crucial to get efficient embedded low-power systems. To allow designers to

do this exploration, Compaan is equipped with a suite of techniques [14] like

Unfolding, Skewing and Merging, to allow designers to play with the level

of parallelism exposed in the derived network of processes. Skewing and

Unfolding increase the amount of parallelism, while Merging reduces

parallelism. By performing these techniques, many different networks can be

created that can be mapped in different ways onto the architecture. When

applied in a systematic way, the design space can be explored and the best

performing network of processes can be picked.

The difference in utilization of the architecture for a particular network

can be huge. By rewriting a DSP application (like Beam-forming) using the

presented techniques, we are able to achieve performances on a QR

algorithm (7 Antenna’s, 21 updates) ranging from 12MFlops to 472MFlops.

We realized QR using commercial floating point IP cores from QinetiQ,

which include pipelined 55 (Rotate) and 42 (Vectorize) stages. We achieved

this performance increase without doing anything to the architecture or

mapping tools, but only by playing with the way the QR application is

written, effectively improving the way the pipelines of the IP cores are

utilized. Using a system like Compaan, an experienced designer should be

able to obtain very different performing networks in days, having the

opportunity to explore different systems and picking the one that uses the

least amount of power.

8.5 DOMAIN-SPECIFIC CO DESIGN

ENVIRONMENTS

As discussed in the previous section, parallelism and distributed

processing are key to energy efficient architectures. Because the ensemble of

architecture elements (processors, busses, memories) cooperate towards a

common application, the designer faces a considerable co-simulation and co-

design problem. A key requirement is to have a good design model. Such a

model allows building of simulation tools, compilers and code generators.

We will look at a highly successful design model for programmable systems:

the instruction-set architecture (ISA). Next we will consider the approach

taken by the RINGS architecture.

In a classic Von-Neumann architecture, the instruction-set-architecture

(ISA) model maintains a single, consistent and abstracted view to the

operation of the system. Such a view ties four independent architecture

concepts together: control, interconnect, storage, and data operations [15].

This way the ISA becomes a template for the underlying target architecture,

for which compiler algorithms (scheduling etc) can be developed. Often

however, the ISA is unable to offer the right target template – in terms of

parallelism, storage capabilities or other.

In the RINGS architecture, we do not use an ISA as an intermediate

design model, but approach each of the four components that make up an

ISA independently. We enumerate them below and look at the requirements

they impose on co-simulation and co-design.

• Data Operations: Energy efficient operation requires us to specialize each

operator as much as possible. A RINGS system contains multiple

processing cores. These can include hardwired or programmable (DSP or

RISC) processors. We thus need to be able to combine instruction-set

simulation with hardware simulation.

• Storage: Energy efficient operation requires us to distribute storage. In

addition to the high-level design transformations discussed in the

previous section, we target to minimize storage bandwidth and use

multiple distributed memories. Each processor in RINGS will work

inside of a private memory space. Many operations in multimedia can be

implemented with dedicated storage architectures that take only a

fraction of the energy cost of a full-blown ISA. Examples are matrix

transposition or scan-conversion. Such dedicated storage can be captured

as a hardwired processor.

• Interconnect: The energy efficient interconnect architecture discussed in

section 2 requires explicit expression of interconnect operations – in

contrast to an ISA where this is implicitly encoded in the instruction

format. A network-on-chip can be modeled as a dedicated hardware

architecture [1]. On top of the network-on-chip a suitable network

protocol must be implemented, for example message-passing with the

MPI standard [7]. However, also this protocol is subject to specialization

and/or hard-coding. For example, a hardwired DCT coding unit attached

to a DSP core through RINGS will have a fixed communication pattern.

This pattern can be hard-coded in a collapsed and optimized protocol

stack.

• Control: Energy efficient operation requires us to split the data-flow and

control-flow in a RINGS architecture and handle them independently.

Fig. 8-6 clarifies this point. It shows the effect of moving an AES

encryption operation gradually from high-level software (Java)

implementation to dedicated hardware implementation, while at the same

time maintaining the interface to the high level Java model. It can be seen

that the interface overhead goes from 0.8% for a C-accelerated AES to

8000% for a hardware-accelerated AES! This overhead obviously is

caused by all the interfaces moving data from Java to C to hardware and

back. With the MPI message passing scheme, we have the freedom to

route control flow and a data flow independently as messages. This way,

we can eliminate or minimize this interface overhead.

cycles

Rijndael

301,034

Interface

367Interface

892Rijndael

44,063

Rijndael

Co-processor

cycles

301, 034 44,430 903Total Cycles

acceleration

Figure 8-6. Overhead of Tightly Coupled Data/Control Flow.

When we put the elements together, we conclude that the RINGS co-

design environment should accommodate multiple instruction-set simulators

with user-specified hardware models. All of these must be embedded in a

model of an on-chip network. The timing accuracy of the simulation should

be precise enough to simulate interactions such as network-on-chip

communication conflicts. On the other hand, the simulation must also be fast

enough to support reasonable design exploration capabilities.

We have built the ARMZILLA environment to evaluate one class of

RINGS architectures, namely those that can be built with one or more ARM

cores, a network-on-chip, and dedicated hardware processors. Fig. 8-7

illustrates the ARMZILLA setup. There are three components: a hardware

simulation kernel (GEZEL), one or more instruction-set simulators (ISS),

and a configuration unit. The GEZEL kernel [4] captures hardware models

with the FSMD (Finite-State-Machine with Datapath) model-of-

computation. It uses a specialized language and a scripted approach to

promote interactive design exploration. The cycle-true models of GEZEL

can also be automatically converted to synthesizable VHDL. For the ARM

ISS we use the cycle-true SimIT-ARM environment [10]. The ARM ISS

uses memory-mapped channels to connect to the GEZEL hardware models.

Finally, the configuration unit specifies a symbolic name for each ARM ISS,

and associates each ISS with an executable. This way the memory-mapped

communication channels can be set up, and the hardware GEZEL models

can address each ARM memory space uniquely.

ARMZILLA

ARM ISS GEZEL

Kernel

Configuration

ARM ISSARM ISS

Memory-mapped

Channels

Hardware

Processors

Network

On Chip

Config

Compiler

Figure 8-7. The ARMZILLA Design Environment for ARM-based RING Processors.

An example of what can be done with the ARMZILLA environment is

shown in Table 8-1. This table shows cycle counts that were obtained after

partitioning a JPEG encoding algorithm. The reference implementation runs

on a single-ARM ISS model. In the second implementation, we separate the

chrominance and luminance channels over two ARM processors. This seems

a logical partition that splits the data operations roughly in two parts. But, it

also creates a communication bottleneck in the on-chip network and the

resulting implementation becomes slower then the O3-level optimized

single-processor implementation. The third implementation shows a better

partitioning. In this case, the data streams are routed out of the ARM and

into dedicated hardware processors for JPEG encoder subtasks. These

processors can communicate directly amongst themselves.

All these simulations are cycle-accurate yet they can run efficiently. For

the H.264 decoding on a dual ARM with network-on-chip for example,

ARMZILLA offers a simulation speed of 176K cycles per second. The

simulation speed varies with the complexity of the hardware model used. A

single, stand-alone SimIT-ARM simulator runs at 1 MHz cycle-true on a

3GHz Pentium.

Table 8-1. Multiprocessor JPEG Encoding Performance

Partition Cycle count 64x64 block

One single ARM 1.223 M

Dual ARM using split chrominance/

luminance channels

1.336 M

Single ARM with color conversion,

transform coding, huffman coding as stand-

alone hardware processors

8.6 CONCLUSIONS

In this chapter, we presented architecture design and design exploration

for low power systems-on-chip. Low power is obtained by tuning all

components of the architecture (datapaths, control, memory and

interconnect) to the application. This can occur at different levels of

abstraction. The design of this type of SOC requires support by design

models and methods. The design environments Compaan and Gezel

/Armzilla are illustration of supporting tools for this design space

exploration.

References

[1] D. Ching, P. Schaumont, I. Verbauwhede, “Integrated Modeling and Generation of a

Reconfigurable Network-On-Chip,” Proc. 11th

Reconfigurable Architectures

Workshop, RAW 2004, Santa Fe, NM, April 2004.

[2] W. Dally, B. Towles, “Route Packets, not wires: on-chip interconnection networks,”

Proc. DAC 2001.

[3] R. David et al., “Low-Power Reconfigurable Processors”, Chapter 20 in “Low Power E

Electronics Design,” edited by C. Piguet, CRC Press, 2004.

[4] GEZEL kernel, http://www.ee.ucla.edu/~schaum/gezel

[5] B. Kienhuis, et al.,``A Methodology to Design Programmable Embedded Systems'',

LNCS, Vol 2268, Nov. 2001.

[6] J. Kim, et al., “A 2-Gb/s/pin Source Synchronous CDMA Bus Interface with

simultaneous Multi-Chip Access and Reconfigurable I/O capability,” CICC, Sept 2003.

[7] MPICH – A portable implementation of MPI, http://www.unix.mcs.anl.gov/mpi/mpich/

[8] P. Mosch et al., “A 720 mW 50 MOPS 1V. DSP for a Hearing Aid Chip Set,” Proc.

ISSCC, pp. 238-239, Feb. 2000.

[9] Õzgün Paker et al., “A heterogeneous multi-core platform for low power signal

processing in systems-on-chip,” ESSCIRC 2002.

[10] W. Qin, S. Malik, “Flexible and Formal Modeling of Microprocessors with Application

to Retargetable Simulation,” Proceedings of DATE 2003, Mar, 2003, pp.556-561.

[11] F. Rampogna et al., “Magic, a Low-Power, re-configurable DSP”, Chapter 21 in “Low

Power Electronics Design”, ed. C. Piguet, CRC Press, 2004.

[12] P. Schaumont, I. Verbauwhede, M. Sarrafzadeh, K. Keutzer, “A quick safari through

the reconfiguration jungle,” Proceedings DAC 2001, pg. 172-177, June 2001.

[13] T. Stefanov, C. Zissulescu, A. Turjan, B. Kienhuis, E. Deprettere,``System Design

using Kahn Process Networks: The Compaan/Laura Approach'', DATE2004, Feb 2004,

Paris, France.

[14] T. Stefanov, B. Kienhuis, E. Deprettere, “Algorithmic Transformation Techniques for

Efficient Exploration of Alternative Application Instances'', Proc. CODES'2002,

Colorado, May 2002.

[15] I. Verbauwhede, J. M. Rabaey. “Synthesis of Real-Time Systems: Solutions and

challenges” Journal of VLSI Signal Processing, Vol. 9, No. 1/2, Jan. 1995, pp. 67-88.

[16] I. Verbauwhede, M.C. F. Chang, “Reconfigurable Interconnect for next generation

systems”, Proc. SLIP, pp. 71-74, April 2002.

[17] Xilinx: Virtex-II-Pro Platform FPGAs: Introduction and Overview and Functional

Description, Aug. 2003, Oct. 2003, www.xilinx.com/bvdocs/publications/ds083-1.pdf,

ds083-2.pdf.

[18] H. Zhang, et al., “A 1V Heterogeneous Reconfigurable Processor IC for Baseband

Wireless Applications,” IEEE Journal on Solid State Circuits, November 2000.

[19] C. Zissulescu, et al., ``Laura: Leiden Architecture Research and Exploration Tool'',

Proc. FPL 2003.

Chapter 9

SOURCE-LEVEL MODELS FOR SOFTWAREPOWER OPTIMIZATION

Carlo Brandolese, William Fornaciari and Fabio SalicePolitecnico di Milano

Abstract This chapter presents a methodology and a set of models supporting energy-drivensource-to-source transformations. The most promising code transformation tech-niques have been identified and studied leading to accurate analytical and/orstatistical models. Experimental results obtained for some common embedded-system processors over a set of typical benchmarks are discussed, showing thevalue of the proposed approach as a support tool for embedded software design.

Keywords: Software optimization, Power optimization, Source-level modeling

9.1 INTRODUCTION

In a growing number of complex heterogeneous embedded systems the rele-vance of the software component is rapidly increasing. Issues such as develop-ment time, flexibility and reusability are, in fact, better addressed by softwarebased solutions. Another trend that is significantly pushing designers to moveas much functionality as possible toward software is the increased interest inplatform-based designs. In such systems much of the architecture is fixed andcan only be configured to match the design constraints. The greatest part ofthe application-specific functionality is thus naturally shifted from hardwarededicated components to software programs. In such a scenario it is clear thatthe importance of software is steadily increasing and poses new problems todesigners. Though performance, in the sense of computational efficiency, isstill the foremost requirement for many embedded systems, power consump-tion is gaining more and more attention. Optimization of the code is thus one ofthe key points and is currently addressed almost only by means of compilationtechniques. It is still not uncommon for designers to manually code criticalsections of the application directly in assembly. The recent technical litera-

ture proposes a different approach, based on source-to-source transformationsaimed at improving code quality either directly or by enabling better compileroptimizations. Source code transformations are extremely complex to auto-mate since they require a thorough semantic analysis of the code fragmentsto be optimized. This chapter proposes a sound and flexible methodology forthe analysis of the effect of source-to-source transformations mostly aimed atallowing rapid and accurate design space exploration. The proposed approachis based on a wide set of models studied to decouple the processor-independentanalysis from all technology specific aspects.

9.2 TRANSFORMATIONS OVERVIEW

Source-to-source transformation presented in literature, can be grouped into four main areas according to the code structures they operate on: loops, datastructures, procedures, control structures and operators. It is worth noting thatnot all the transformations are interesting when operating at source-level sincesome of them can as well be performed at RT or assembly-level and are thusperformed by modern compilers. The most promising transformations, eitherfound in literature [1, 2] or studied in the present work, are summarized in thefollowing. Particular attention must be devoted to loop transformations [3–6]since most of the execution time of a program is spent in loops.

Loop unrolling replicates the body of a loop a given number of times U (theunrolling factor), and modifies the iteration step from 1 to U . The trans-formation impacts on energy in two ways: on one hand, it reduces loopoverhead by performing less compare and branch instructions; on theother hand, it allows the compiler for better optimization and registerusage in the larger loop body.

Loop distribution breaks a single loop into multiple loops with the same it-eration range but each enclosing only a subset of the statements in theoriginal loop. Distribution is used to create sub-loops with fewer depen-dencies, improve instruction cache and instruction TLB locality due toshorter loop bodies, reduce memory requirements by iterating over fewerarrays and improve register usage by decreasing register pressure.

Loop fusion performs the opposite action of distribution, i.e. merging, by re-ducing loop overhead, increasing instruction parallelism, improving reg-ister, data cache, TLB or page locality. It also improves the load balanceof parallel loops.

Loop interchange exchanges the position of two loops in a loop nest, gener-ally moving one of the outer loops to the innermost position. It is oneof the most valuable transformations and can improve performance in

many ways: it enables and improves vectorization, increases data accesslocality and increases the number of loop-invariant expressions in theinner loop.

Loop tiling improves memory locality, primarily the at cache level, by access-ing matrices in N×M sized tiles rather than completely. It also improvesprocessor, register, TLB, and page locality.

Software pipelining breaks the operations of a single loop iteration into Sstages, and arranges the code in such a way that stage 1 is executed onthe instructions originally belonging to iteration i, stage 2 on those ofiteration i − 1, etc. Startup code must be generated before the loop toinitialize the pipeline for the first S − 1 iterations and cleanup code mustbe generated after the loop to drain the pipeline for the last S−1 iterations.

Loop unswitch is applied when a loop contains a branch with a loop-invarianttest condition. The loop is then replicated inside each branch of the con-ditional, saving the overhead of conditional branching inside the loop,reducing the code size of the loop body, and possibly enabling the paral-lelization of one or both branches.

The second class collects a number of data-structure and memory accesstransformations [7, 6].

Local to global array promotion allows compilers to use simpler addressingmodes since global arrays address does not depend on the stack pointer.

Scratch-pad array introduction has the goal of storing the most frequentlyaccessed array elements in a smaller array (the scratch-pad) to improvespatial locality.

Multiple indirection elimination identifies common chains of indirectionsand stores the address into a temporary variable.

The third group gathers those transformations [7] impacting on proceduresand functions.

Function inlining replaces the most frequently invoked function with the func-tion body. Inline expansion increases the spatial locality and decreasesthe number of function calls. This transformation increases the numberof unique references, which may result in more misses. However, a de-crease in the miss rate may also occur, since, without inlining, the calleecode might replace the caller code in the instruction cache.

Soft inlining is an intermediate solution between function calling and inlining.The transformation replaces calls and returns with jumps. This reducesthe code size w.r.t. inlining and eliminates context switching overheads.

Code linking directives can be used to suitably reorder the objects of differentfunctions to match as more as possible the dynamic call graph. Thispotentially leads to a reduction in instruction misses.

Most of the transformation in the last group are usually performed by com-pilers. Nevertheless, some of them can still be conveniently considered whenoperating at source-level [7, 8].

Conditional sub-expression reordering exploits shortcut evaluation of con-ditions usually performed by compilers. The transformation operates byreordering the sub-expressions according to their probability of beingtrue (for OR conditions) or false (for AND conditions). This reduces thenumber of instructions executed.

Special cases pre-evaluation allows avoiding a function call (usually a math-ematic library function) when the argument has a special value for whichthe result is known. This is done by defining suitable macros testing forthe special cases and leads to a reduction of actual calls.

Special cases optimization replaces calls to generic library or user-definedfunctions with optimized versions, suitable for common special cases.As an example, power raising on integers can be coded more efficientlythan it can be for real numbers.

9.3 METHODOLOGY

Transformations applied to source code might lead to very different resultsdepending on a number of factors: the specific structure of the code, the targetarchitecture, the parameters of the transformations etc. Furthermore, it is notunusual that a transformation applied on the source code as it is, leads to pooror no energy reduction, while, when applied to a pre-transformed code itseffectiveness is greatly increased. Thus sequences of transformations shouldbe considered, rather than single transformations. For this reason it is crucialto explore different transformations and sequences of transformations in termsof their energy reduction efficiency. The exploration strategy should allow toeasily modify the parameters of the transformation and of the target technologyand thus leading to a quick estimate of the expected benefits.

9.3.1 Conceptual Flow

Figure 9.1 shows the conceptual scheme of the estimation flow. The sourcecode is processed and its relevant characteristics are extracted by means of alexical and syntactical analysis leading to the set of code parameters. Typicalparameters are code size, loop body size, number of paths, number of loopiterations, etc.

Source Code↓

Transformationparameters

→ Code Analysis

↓∆I , ∆Minst, ∆Mdata

↓Technologyparameters

→ Energy Estimation

↓∆E

Figure 9.1. Phases of the methodology flow

The designer then chooses the transformations parameters such as unrollfactor, tiling size etc. and, finally, selects the target technology from a set oflibraries. Such libraries are collections of technology parameters specifyingarchitectural figures such as cache sizes, bus width etc. and electrical figuressuch as power supply voltages, average core currents, bus and memory capaci-tances etc. Based on all this data, the estimation models first provide the threedimensionless figures ∆I , ∆Minst and ∆Mdata expressing the variations ofnumber of instructions executed, of number of instruction cache misses and ofnumber of data cache misses, respectively. These figures, though still ratherabstract, already provide the designer with an indication of the potential bene-fits of a given transformation. To account for the target technology as well, thevariations are fed to a set of models, depending on the technology parameters,leading to an estimate of the energy reduction ∆E deriving from the applicationof the considered transformation.

9.3.2 Technology Models

Experimental results have shown that the energy consumption of an embed-ded system based on a processor executing some programs can be approximatedby considering three major contributions: the processor core and its on-chipcaches, the system bus and the main memory. All these components can bemodeled at different levels of accuracy by means of equations that involve twosets of parameters: those strictly related to the specific technology and thosesummarizing the properties and the behavior of the code being executed. Inparticular, as outlined above in the description of the conceptual flow, the en-ergy estimates can be based on three execution parameters only: the number ofassembly instructions executed and the number of instruction and data cachemisses. Though simple, the adopted models provide satisfactory results, es-pecially when considering energy variations rather than absolute values. Thetechnology parameters considered and used in the models adopted for the CPU,the cache, the bus and the main memory are summarized in Table 9.1.

Table 9.1. Technology parameters

Symbol Meaning Symbol Meaning

Tck CPU clock period B Cache block sizeCPI Average CPI1 S Cache sizeP cpu Average CPU power Edec Memory decode energyCtot Total capacitance on the bus Erw Memory read/write energyVsw Bus switching voltage Eref Memory refresh energyAsw Average bus switching activity Vm Memory supply voltageW Bus width Iref Average memory refresh current

The form of the equations, referred to relative energy variations, are reportedin the following using the symbols introduced. The processor energy variationis modeled as:

∆Ecpu = TckP cpuCPI∆I (3.1)

The contribution of system bus to energy variation ∆Ebus is:

∆Ebus = 12CtotV

2sw(∆Nbus,addr + ∆Nbus,data + ∆Ninst) (3.2)

where:

∆Nbus,addr = Asw,addrWaddr(∆Mdata + ∆Minst) (3.3)

∆Nbus,data = Asw,dataWdataBdata∆Mdata (3.4)

∆Nbus,inst = Asw,instWdataBinst∆Minst (3.5)

Finally, the adopted memory model expresses the energy variation ∆Em as:

∆Em = ∆Em,data + ∆Em,inst + ∆Em,ref (3.6)

where:

∆Em,data = (Edec + ErwBdata)∆Mdata (3.7)

∆Em,inst = (Edec + ErwBinst)∆Minst (3.8)

∆Em,ref = TckVmIrefCPI∆I (3.9)

9.4 CASE STUDIES

In this section, two case studies are reported: Loop unrolling and Loopfusion. For each transformation, the source code parameters and the modelequations are reported and discussed.

9.4.1 Loop Unrolling

Loop unrolling is a parametric transformation whose results in terms ofenergy reduction are influenced by the unrolling factor U , i.e. the number of

times the loop body is replicated to build the modified loop. The parameterU , thus, completely defines the transformation. The effects of loop unrollingclearly depend also on the characteristics of the source code being transformed.Such properties are captured by the set of source code parameters reported inTable 9.2.

Table 9.2. Source code parameters for loop unrolling

Symbol Meaning

LI Number of loop instructionsLS Size of loop instructions (bytes)

LBI Number of loop-body instructionsLBS Size of loop-body instructions (bytes)N Loop iterations

The number of instructions of the original loop is:

Io = N · LI (4.1)

The transformed loop executes Nt = N/U times and:

LIt = LI + (U − 1)LBI (4.2)

instructions per iteration. Therefore, the total number of instructions executedby the transformed loop is:

It = Nt · LIt =⌊

⌋· [LI + (U − 1)LBI] (4.3)

The instructions gain obtained with unrolling is thus:

∆I =⌊

⌋· [LI + (U − 1)LBI] − Io (4.4)

The transformation has also effects on the number of instruction cache missesdue to the increased dimension of the loop body. A more accurate analysisleads to the results—summarized in the following—that show a non-lineardependence of the number of misses on the relative values of the loop sizeLS and the instruction cache size Sinst

2. Three significant cases have beenidentified:

LS ≤ Sinst

In this case there are no capacity misses since the entire loop code can

2The loop size and number of instructions are linearly related assuming a fixed instruction size.

be loaded into the cache. Hence, there are only cold misses, during thefirst iteration. The number of instruction cache misses is thus:

Minst =⌈

⌉(4.5)

Sinst < LS < 2Sinst

In this case capacity misses also take place. The number of cold misses isthe same as in the previous case, but, in addition, for every additional iter-ations, there are 2(LS mod Sinst)/Binst capacity misses. Therefore,the total number of misses is:

Minst =⌈

⌉+ 2(N − 1)

⌈LS mod Sinst

⌉(4.6)

LS ≥ 2Sinst

The number of misses in every iteration is equal to the number of coldmisses, i.e.:

Minst = N

⌉(4.7)

For all these cases, the relevant figure is the variation of the number ofinstruction cache misses ∆Minst = IMt − IMo. Such difference depends onthe variation of number of instructions due to the transformation:

∆LS = LSt − LSo = (U − 1)LBS (4.8)

and must be calculated for all the 32 = 9 cases. It is worth noting that sincethe transformed code will always be larger than the original one, only 6 out ofthe 9 cases are significant. For the sake of conciseness, only the two boundarycases are described in the following.

(LSo ≤ ICS) ∧ (LSt ≤ ICS)In this case both the original and the transformed code completely fit intothe cache and thus only cold misses take place. The variation, recallingEquation (4.5), is:

∆Minst =⌈

⌉−

⌈LSt

⌉≈

⌈(U − 1)LBS

⌉(4.9)

(LSo ≥ 2ICS) ∧ (LSt ≥ 2ICS)In this other limiting case, both codes are larger than the double of thecache size and thus each instruction fetch causes a miss. Recalling Equa-tion (4.7), the instruction miss variation is:

∆Minst = Nt

⌈LS + (U − 1)LBS

⌉− No

⌉(4.10)

In a similar manner and referring to Equations (4.5)–(4.7), the variationsfor the other four cases can be calculated. The last effect to be considered isthe variation of data cache misses. Since the transformation does not modifythe data access pattern of the code, the term ∆Mdata can be assumed to be0, at least at a first approximation. A first validation can be performed atthis level comparing the dimensionless estimated figures ∆I and ∆Minst withthose derived from simulation. Figure 9.2 shows the results for the variation ofnumber of instruction executed. It is worth noting that ∆I does not depend onthe cache size but only on the structure of the code and the effectiveness of theoptimizations that the compiler can exploit on the modified loop.

0 10 20 30 40 50 60 70 80 90 100Unroll factor (U)

ActualEstimated

Loop Unrolling

Figure 9.2. Loop unrolling: ∆I

As far as the variation of instruction cache misses, different scenarios havebeen considered by varying the cache size from 256 to 4096 bytes. Table 9.3summarizes the results obtained by averaging the estimation error over theinterval U = [2; 100] and Figure 9.3 shows the two boundary cases.

Table 9.3. Loop unrolling: ∆Minst average error and standard deviation

Sinst(bytes) ε% σ%

256 -1.881 8.026512 -2.557 7.1011024 -2.531 6.9102048 -2.750 9.2524096 -1.691 5.065

The two contributions ∆I and ∆Minst (remembering that ∆Mdata = 0)can now be fed to the technology models to derive the overall energy saving.Table 9.4 reports the average error and the corresponding standard deviation interms of energy gain for the five cache-size scenarios just considered.

These results show that the model tends to underestimate the potential gainderiving from loop unrolling. A possible reason is that unrolling a loop leads to a

Loop Unrolling

0 10 20 30 40 50 60 70 80 90 1000

ActualEstimated

0 10 20 30 40 50 60 70 80 90 100Unrolling factor (U)

ActualEstimated

Cache size = 256 Bytes

Cache size = 4 KBytes

Figure 9.3. Loop unrolling: ∆Minst

Table 9.4. Loop unrolling: ∆E average error and standard deviation

256 -1.754 9.144512 -4.552 7.3221024 -7.663 6.9662048 -6.203 5.7774096 -4.409 3.011

longer loop body, i.e. a larger basic block where the compiler can better performoptimizations. Despite the light biasing of the model, the overall average erroris, in absolute value, approximately 4.9% and this can be considered more thansatisfactory when operating at source code level.

9.4.2 Loop Fusion

This transformation has the purpose of combining into a new single loopthe bodies of different subsequent loops. Some constraint must be satisfied,in particular the loops to be merged need to have the same iteration range andthe statements in their bodies must be independent. The only transformationparameter characterizing loop fusion is the number NF of loops to be merged.The source code parameters that influence the effect of this transformation are

all those considered for loop unrolling (see Table 9.2) plus the number and sizeof control instructions, defined as:

LCI = LI − LBI (4.11)

LCS = LS − LBS (4.12)

In the following the subscript k ∈ [1, NF ] is used to indicate a specific loopamong those to be fused. An additional useful parameter is the average numberof control instructions over all the considered loops:

LCI =1

NF∑k=1

LCIk (4.13)

Using the symbols just introduced, the number of instructions in the originaland transformed codes are:

Io = N

NF∑k=1

(LBIk + LCIk) (4.14)

It = N(LCI +NF∑k=1

LBIk) (4.15)

The variation ∆I is thus given by:

∆I = N(LCI +NF∑k=1

LBIk −NF∑k=1

(LBIk + LCIk) =

= N(LCI −NF∑k=1

(4.16)

Assuming that LCI = LCIk ∀k yields:

NF∑k=1

LCIk =NF∑k=1

LCI = NF · LCI (4.17)

and thus Equation (4.16) can be rewritten as:

∆I = N(LCI −NF∑k=1

LCIk) = N(1 − NF )LCI (4.18)

To study the effect of loop fusion with respect to instruction misses, thesame cases considered for loop unrolling and expressed by Equations (4.5)–(4.7) turn out to be applicable. Nevertheless, when considering the original codecomposed of NF loops, the number of instruction misses must be estimated foreach single loop according to the three mentioned equations and then summedover all loops. On the other hand, the estimates for the transformed code canbe obtained by simply substituting LS with the overall transformed code sizeLSt, defined as:

LSt = LCS +NF∑k=1

LBSk (4.19)

According to Equations (4.5)–(4.7) and referring to the original code sizesLSo,k and the transformed code size LSt, the number of instruction misses ofthe original loops IMo,k and the transformed one IMt can be derived. Theresulting overall variation is thus:

∆Minst = IMt −NF∑k=1

IMo,k (4.20)

It is worth noting that the number of possible cases derived from the limitingconditions on the cache size is, in general, 3NF+1. Similar considerationsapply to the estimation of data cache misses. Since in most cases the differentloops operate on different arrays, data misses tend to be increased, the best-case condition being that all data fit into the cache in which case the number ofmisses will approximately be invariant. A validation procedure similar to thatused for loop unrolling has been applied for loop fusion also, considering thesimplest and most common case where NF = 2. To analyze the behavior ofthe transformation, loops with different body sizes have been considered andthe results for instruction misses are shown in Figure 9.5, where the x axis isan index related to the loop body size ratio. For the same combinations of loopbody sizes and for an instruction cache size varying from 256 to 4096 bytes,the gain in terms of instruction misses have also been estimated and comparedwith actual results, leading to the data collected in Table 9.5 and the graphs ofFigure 9.5 relative to the two limiting cases.

Again the accuracy obtained is more than satisfactory since the average abso-lute error is approximately 2.1% with very low standard deviation. Combiningdimensionless figures with the energy models of the different component ofthe considered system led to the energy estimates. Such estimates show a verylimited error, as reported in Table 9.6, and are not biased. It is though worthnoting that the reported results refer to loops manipulating very small arraysfor which the hypothesis of being fully contained in the data cache may beassumed to hold. This translates into the models by assuming ∆Minst = 0.

0 10 20 30 40 50 60Loop size index

ActualEstimated

Loop Fusion

Figure 9.4. Loop fusion: ∆I

Table 9.5. Loop fusion: ∆Minst

256 +2.423 2.701512 +3.004 2.804

1024 -3.150 4.2532048 +0.153 1.6724096 -0.258 1.419

0 10 20 30 40 50 60-500

ActualEstimated

Loop Fusion

0 10 20 30 40 50 60Loop size index

ActualEstimated

Cache size = 64 Bytes

Cache size = 4 KBytes

Figure 9.5. Loop fusion: ∆Minst

More complex cases show higher errors but preliminary experimental resultssuggest that a 10–15% error is a reasonable and conservative upper bound.

Table 9.6. Loop fusion: ∆E

256 +1.945 3.882512 +0.177 3.469

1024 -0.194 3.9162048 +1.592 2.4254096 +0.168 0.017

9.5 EXPERIMENTAL RESULTS

The estimates of∆I , ∆Minst and∆Mdata, combined with the energy models(see Section 9.3.2) adopted to account for the technology-dependent parameters,lead to a new set of results showing the accuracy of the complete methodology interms of energy reduction (∆E) estimation. The models for 5 transformationshave been tested on a set of SPEC95 benchmarks in order to quantify theenergy gain estimation error. The actual energy gain has been obtained bysimulating both the original and the transformed code and then compared withthe estimated gain derived from the models. Experiments have been performedon four architectures based on different processors and operating systems usingthird-party timing and/or power profiling tools (see Table 9.7).

Table 9.7. Operating environments for validation

Processor Operating system Simulation engine

Intel strongARM Linux RedHat 9.0 SimpleScalar 3.0 / SimPAnalyzerIBM PowerPC 405 Linux RedHat 9.0 SimpleScalar 3.0Sun microSPARC II EP Solaris 8 SpixToolsMIPS Tech. MIPS-32 Linux RedHat 9.0 SimpleScalar 3.0

Each benchmark has been analyzed varying both the instruction cache size(Sinst) and the input data and all compatible transformations have been appliedin a proper sequence using the predicted optimal values for their parameters(unroll factor, tile size, etc.). Table 9.8 collects the relative error between theestimated gain ∆Eest and the actual value ∆Eact derived from simulation.

The results confirm that the models are reliable since they can correctlypredict both energy reductions and undesirable energy increases. In conclusion,the average estimation error has shown to be around than 3%.

Table 9.8. Energy gain estimation relative errors

FIB FIR WAVE-1 WAVE-2 IIRSinst ε% σ% ε% σ% ε% σ% ε% σ% ε% σ%

256 +4.16 3.90 n/a n/a -1.97 2.81 +4.29 3.63 -1.63 1.20512 +7.18 4.02 -3.67 4.48 -1.83 2.67 +4.63 3.52 -1.82 1.151024 +3.31 1.49 -2.11 4.95 -2.87 3.51 +4.81 0.79 -3.93 1.512048 -1.42 2.15 +1.03 7.68 -2.37 3.71 +4.20 0.57 -0.53 1.594096 -2.08 1.91 +11.25 7.57 -1.86 3.71 +3.74 0.20 +0.03 16.00

Average 3.63 2.69 4.51 6.17 2.18 3.28 4.33 1.74 1.58 4.29

9.6 CONCLUSIONS

The presented work has addressed the problem of the fast estimation of theeffects induced by a set of specific source code transformations by using astructured methodological approach based on technology-independent models.In particular, the presented analysis flow, by providing an appropriated set ofboth technological and transformation parameters, allows the designer to an apriori evaluation of the impact of a specific transformation and/or the effectof a sequence of interdependent transformations. Two specific transformationshave been accurately described: loop unrolling and loop fusion. As far as loopunrolling is concerned, it has been shown that the proposed model can be con-sidered more than satisfactory since the average error between the estimatedgain and the simulated gain is, approximately, 4,9% with a low standard devi-ation. Concerning loop fusion, the model has produced estimates—for a wideset of technological options—displaying an average absolute error of 2,1% withan high level of reliability. Both the methodology and the models has been val-idated on a set of benchmarks showing an overall average error of the estimatedenergy gain around 3%. This result is more than satisfactory and confirmsthat the models of the different transformation are sufficiently accurate and themethodology, though subject to further improvements, is promising.

References

[1] L. Benini and G. De Micheli. System-level power optimization: Tech-niques and tools. Transactions on Design Automation of Electronic Sys-tems, 5:115–192, 2000.

[2] F. Catthoor, H. De Man, and C. Hulkarni. Code transformations for lowpower caching in embedded multimedia processors. Proc. of IPPS/SPDP,pages 292–297, 1998.

[3] D.F. Bacon, S.L. Graham, and O.J. Sharp. Compiler transformations forhigh performance computing. Technical Report N. UCB/CSD-93-781, Uni-versity of California at Berkeley, 1993.

[4] M.S. Lam. Software pipelining: An effective scheduling technique for vliwmachines. SIGPLAN Conference on Programming Language Design andImplementation, pages 318–328, 1988.

[5] M.S. Lam, E.E. Rothberg, and M.E. Wolfe. The cache performance andoptimization of blocked algorithms. Conference on Architectural Supportfor Programming Languages an Operating Systems, pages 63–74, 1991.

[6] M.J. Wolfe. More iteration space tiling. ACM Proceedings of Supercom-puting, pages 655–664, 1989.

[7] C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto. The impact of sourcecode transformations on software power and energy consumption. Journalof Circuits, Systems and Computers, 11(5):477–502, 2002.

[8] C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto. Library functionstiming characterization for source-level analysis. Conference on DesignAutomation and Testing in Europe, pages 1132–1133, March 2003.

Chapter 10

TRANSMITTANCE SCALING FOR REDUCING

POWER DISSIPATION OF A BACKLIT TFT-LCD

Wei-Chung Cheng and Massoud Pedram

University of Southern California

Abstract This chapter presents transmittance scaling; a technique aimed at conserving

power in a transmissive TFT-LCD with a cold cathode fluorescent lamp

(CCFL) backlight by reducing the backlight illumination while compensating

for the luminance loss. This goal is accomplished by adjusting the

transmittance function of the TFT-LCD panel while meeting an upper bound

on a contrast distortion metric. Experimental results show that an average of

3.7X power saving can be achieved for still images with a mere 10% contrast

distortion.

Keywords: CCFL; transmissive LCD; TFT-LCD; backlight luminance dimming;

transmittance scaling; concurrent brightness and contrast scaling; power

efficiency; low power design.

10.1 INTRODUCTION

TFT-LCD is the most popular flat-panel display used in today's consumer

electronics and computer systems. TFT stands for "Thin Film Transistor"

and describes the control elements that actively control the individual pixels.

For this reason, one speaks of so-called "active matrix TFT's". LCD means

"Liquid Crystal Display" and stands for monitors that are based on liquid

crystals. To obtain a high image quality and low power dissipation in a TFT-

LCD, low off-current and high on-current are necessary.

Previous studies on battery-powered electronics point out that the display

subsystem dominates the energy consumption of the whole system. In the

SmartBadge system, for instance, the display consumes 29%, 29%, and 50%

of the total power in the active, idle, and standby modes, respectively [1].

Direct-view LCDs can largely be categorized into reflective and transmissive

displays which utilize ambient light and light from an artificial light source

(e.g., fluorescent backlight tube) respectively. In a transmissive TFT-LCD

monitor, the backlight contributes more than 50% of the display subsystem

when using a cold cathode fluorescent lamp (CCFL) [2]. To reduce the

backlight power consumption, Choi et al. proposed a technique called

backlight luminance dimming. This technique dims the backlight and

compensates for the luminance loss by adjusting the grayscale of the image

to increase its brightness or contrast. The grayscale of the image is adjusted

by multiplying the pixel values by a scaling factor. In this chapter, we

describe the transmittance scaling technique, which compensates for the

luminance loss by adjusting the transmittance function of the TFT-LCD

panel. More precisely, transmittance scaling means “scaling the

transmittance function of the TFT-LCD panel.” This is a general technique

that can achieve concurrent brightness and contrast scaling of the whole

image to compensate for the effects of the backlight dimming.

In the following sections, we explain how CCFL works and show how to

model the non-linearity between its backlight illumination and power

consumption. Next, we propose a contrast distortion metric to quantify the

image quality loss after transmittance scaling. Finally, we formulate and

optimally solve the optimal transmittance scaling problem subject to a

constraint on the contrast distortion.

10.2 PRELIMINARIES

A transmissive LCD uses a dedicated backlight. A reflective LCD uses the

ambient light or/and a dedicated frontlight. A transflective LCD uses both

the ambient light and backlight. The frontlight and backlight use the same

light source. The difference between the two lighting schemes is in the light

path from the light source through the LCD panel to the observer. A back-lit

or front-lit LCD offers superior contrast ratio compared to the one that is lit

by the ambient light. A backlight can be direct or indirect type. A direct

backlight is positioned directly beneath the LCD panel. An indirect (or side-

lit) backlight is positioned at the side of the LCD panel and requires a

carefully designed light-guide and a diffuser to illuminate the LCD panel

evenly.

Most TFT-LCD monitors use CCFL for backlighting due to its unrivaled

luminance density – emitting the most light within the minimum form factor.

The CCFL can be designed to generate an arbitrary color, which is critical

for reproducing pure white in the backlighting applications. Technology for

CCFL manufacturing is mature; therefore, its production cost is rather low.

However, compared to power consumption of the TFT-LCD panel, the

power consumption of the CCFL backlight is quite high.

10.2.1 Radiometry and Photometry Terminology

Radiometry refers to the science of measuring light in any portion of the

electromagnetic spectrum [3]. In practice, radiometry is usually limited to

the measurement of infrared, visible, and ultraviolet light using optical

instruments.

Light is radiant energy. Electromagnetic radiation transports energy

through space. Radiant energy (denoted as Q) is measured in joules. A

broadband source such as the Sun emits electromagnetic radiation

throughout most of the electromagnetic spectrum, from radio waves to

gamma rays. However, most of its radiant energy is concentrated within the

visible portion of the spectrum. A single-wavelength laser, on the other

hand, is a monochromatic source; all of its radiant energy is emitted at one

specific wavelength. Energy per unit time is power, which we measure in

joules per second, or watts. A laser beam, for example, has so many watts of

radiant power. Light “flows” through space, and so radiant power is more

commonly referred to as the “time rate of flow of radiant energy” or radiant

flux. It is defined as: =dQ/dt where Q is radiant energy and t is time.

Radiant flux is measured in watts. In terms of a photographic light meter

measuring visible light, the instantaneous magnitude of the electric current is

directly proportional to the radiant flux. The total amount of current

measured over a period of time is directly proportional to the radiant energy

absorbed by the light meter during that time.

Radiant flux density is the radiant flux per unit area at a point on a surface.

There are two possible conditions. The flux can be arriving at the surface, in

which case the radiant flux density is referred to as irradiance. The flux can

also be leaving the surface due to emission and/or reflection. The radiant

flux density is then referred to as radiant exitance. Radiant flux density is

measured in watts per square meter. The radiant flux density at a point on a

surface due to a single ray of light arriving (or leaving) at a solid angle to

the surface normal is d /(dA·cos ). The radiance at that point for the same

angle is then d2 /(dA·d ·cos ), or radiant flux density per unit solid angle.

Radiance is measured in watts per square meter per steradian. We can

imagine an infinitesimally small point source of light that emits radiant flux

in every direction. The amount of radiant flux emitted in a given direction

can be represented by a ray of light contained in an elemental cone. This

gives us the definition of radiant intensity: I =d /d . Radiant intensity is

measured in watts per steradian.

Photometry is the science of measuring visible light in units that are

weighted according to the sensitivity of the human eye [3]. It is a

quantitative science based on a statistical model of the human visual

response to light -- that is, our perception of light -- under carefully

controlled conditions. The human visual system is a complex and highly

nonlinear detector of electromagnetic radiation with wavelengths ranging

from 380 to 770 nanometers (nm). The sensitivity of the human eye to light

varies with wavelength. A light source with a radiance of one watt/m2-

steradian of green light (540nm wavelength), for example, appears much

brighter than the same source with a radiance of one watt/m2-steradian of red

(650nm wavelength) or blue light (450nm wavelength). In photometry, we

attempt to measure the subjective impression produced by stimulating the

human eye-brain visual system with radiant energy. This task is complicated

immensely by the eye’s nonlinear response to light. It varies not only with

wavelength but also with the amount of radiant flux, whether the light is

constant or flickering, the adaptation of the iris and retina, the spatial

complexity of the scene being perceived, the psychological and

physiological state of the observer, and a host of other variables [4].

According to studies done by the Commission Internationale d’Eclairage

(CIE), the photopic luminous efficiency of the human visual system as a

function of wavelength looks like a near-normal distribution as depicted in

Figure 10-1 (cf. [5].) The CIE photometric curve thus provides a weighting

function that can be used to convert radiometric measurements into

photometric measurements. Today the international standard for a light

source is a point source that has a luminous intensity of one candela (the

Latin word for “candle”). It emits monochromatic radiation with a frequency

of 540*1012

Hertz (or approximately 555nm, corresponding with the

wavelength of maximum photopic luminous efficiency) and has a radiant

intensity (in the direction of measurement) of 1/683 watts per steradian.

Figure 10-1: Photopic luminosity function.

Together with the CIE photometric curve, candela provides the weighting

factor needed to convert between radiometric and photometric

measurements. Consider, for example, a monochromatic point source with a

wavelength of 510nm and a radiant intensity of 1/683 watts per steradian.

The photopic luminous efficiency at 510nm is 0.503. The source therefore

has a luminous intensity of 0.503 candela. Luminous flux is the

photometrically-weighted radiant flux (power). Its unit of measurement is

the lumen, defined as 1/683 watts of radiant power at a frequency of

540*1012

Hertz. As with luminous intensity, the luminous flux of light with

other wavelengths can be calculated using the CIE photometric curve.

Luminous energy is photometrically-weighted radiant energy. It is measured

in lumen seconds. Luminous flux density is photometrically-weighted radiant

flux density. Luminous flux density is measured in lumens per square meter.

Illuminance is the photometric equivalent of irradiance, whereas luminous

exitance is the photometric equivalent of radiant existence. Illuminance can

be used to characterize the luminous flux emitted from a surface. Most

photographic light meters measure the illuminance. Luminance is

photometrically-weighted radiance. In terms of visual perception, we

perceive luminance. It is an approximate measure of how “bright” a surface

appears when we view it from a given direction. Luminance is measured in

lumens per square meter per steradian. The maximum brightness of a CRT

or LCD monitor is described by luminance in its specification. Luminous

intensity is photometrically-weighted radiant intensity. It is measured in

lumens per steradian (i.e., candelas). Luminous intensity can be used to

characterize the optical power emitted from a spot light source, such as a

light bulb.

There is much more that we have not covered here, such as reflectance,

transmittance, absorption, scattering, diffraction, and polarization. We have

also ignored the interaction of the human visual system with light, including

scoptic and mesopic luminous efficiency, temporal effects such as flicker,

and most important, color perception. The study of light and how we

perceive it fills volumes of research papers and textbooks.

10.2.2 Cold Cathode Fluorescent Lamp