fulltext01_4

Linköping Studies in Science and Technology

Dissertation No. 1130

Performance and Energy Efficient Network-on-Chip Architectures

Sriram R. Vangal

Electronic Devices

Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden

Linköping 2007

ISBN 978-91-85895-91-5 ISSN 0345-7524

ii

Performance and Energy Efficient Network-on-Chip Architectures Sriram R. Vangal ISBN 978-91-85895-91-5

Copyright Sriram. R. Vangal, 2007 Linköping Studies in Science and Technology Dissertation No. 1130 ISSN 0345-7524 Electronic Devices Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden Linköping 2007 Author email: [email protected] Cover Image A chip microphotograph of the industry’s first programmable 80-tile teraFLOPS processor, which is implemented in a 65-nm eight-metal CMOS technology. Printed by LiU-Tryck, Linköping University Linköping, Sweden, 2007

iii

Abstract

The scaling of MOS transistors into the nanometer regime opens the possibility for creating large Network-on-Chip (NoC) architectures containing hundreds of integrated processing elements with on-chip communication. NoC architectures, with structured on-chip networks are emerging as a scalable and modular solution to global communications within large systems-on-chip. NoCs mitigate the emerging wire-delay problem and addresses the need for substantial interconnect bandwidth by replacing today’s shared buses with packet-switched router networks. With on-chip communication consuming a significant portion of the chip power and area budgets, there is a compelling need for compact, low power routers. While applications dictate the choice of the compute core, the advent of multimedia applications, such as three-dimensional (3D) graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput. This work demonstrates that a computational fabric built using optimized building blocks can provide high levels of performance in an energy efficient manner. The thesis details an integrated 80-Tile NoC architecture implemented in a 65-nm process technology. The prototype is designed to deliver over 1.0TFLOPS of performance while dissipating less than 100W.

This thesis first presents a six-port four-lane 57 GB/s non-blocking router core based on wormhole switching. The router features double-pumped crossbar channels and destination-aware channel drivers that dynamically configure based on the current packet destination. This enables 45% reduction in crossbar

iv

channel area, 23% overall router area, up to 3.8X reduction in peak channel power, and 7.2% improvement in average channel power. In a 150-nm six-metal CMOS process, the 12.2 mm2 router contains 1.9-million transistors and operates at 1 GHz at 1.2-V supply.

We next describe a new pipelined single-precision floating-point multiply accumulator core (FPMAC) featuring a single-cycle accumulation loop using base 32 and internal carry-save arithmetic, with delayed addition techniques. A combination of algorithmic, logic and circuit techniques enable multiply-accumulate operations at speeds exceeding 3GHz, with single-cycle throughput. This approach reduces the latency of dependent FPMAC instructions and enables a sustained multiply-add result (2FLOPS) every cycle. The optimizations allow removal of the costly normalization step from the critical accumulation loop and conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In a 90-nm seven-metal dual-VT CMOS process, the 2 mm2 custom design contains 230-K transistors. Silicon achieves 6.2-GFLOPS of performance while dissipating 1.2 W at 3.1 GHz, 1.3-V supply.

We finally present the industry's first single-chip programmable teraFLOPS processor. The NoC architecture contains 80 tiles arranged as an 8×10 2D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision FPMAC units which feature a single-cycle accumulation loop for high throughput. The five-port router combines 100 GB/s of raw bandwidth with low fall-through latency under 1ns. The on-chip 2D mesh network provides a bisection bandwidth of 2 Tera-bits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100-M transistors. The fully functional first silicon achieves over 1.0TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07-V supply.

It is clear that realization of successful NoC designs require well balanced decisions at all levels: architecture, logic, circuit and physical design. Our results demonstrate that the NoC architecture successfully delivers on its promise of greater integration, high performance, good scalability and high energy efficiency.

v

Preface

This PhD thesis presents my research as of September 2007 at the Electronic Devices group, Department of Electrical Engineering, Linköping University, Sweden. I started working on building blocks critical to the success of NoC designs: first on crossbar routers followed by research into floating-point MAC cores. I finally integrate the blocks into a large monolithic 80-tile teraFLOP NoC multiprocessor. The following five publications are included in the thesis:

• Paper 1 - S. Vangal, N. Borkar and A. Alvandpour, "A Six-Port 57GB/s Double-Pumped Non-blocking Router Core", 2005 Symposium on VLSI Circuits, Digest of Technical Papers, June 16-18, 2005, pp. 268–269.

• Paper 2 - S. Vangal, Y. Hoskote, D. Somasekhar, V. Erraguntla, J. Howard, G. Ruhl, V. Veeramachaneni, D. Finan, S. Mathew and N. Borkar, “A 5 GHz floating point multiply- accumulator in 90 nm dual VT CMOS”, ISSCC Digest of Technical Papers, Feb. 2003, pp. 334–335.

• Paper 3 - S. Vangal, Y. Hoskote, N. Borkar and A. Alvandpour, "A 6.2 GFLOPS Floating Point Multiply-Accumulator with Conditional Normalization”, IEEE Journal of Solid-State Circuits, Volume 41, Issue 10, Oct. 2006, pp. 2314–2323.

• Paper 4 - S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote and N. Borkar, “An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS,” ISSCC Dig. Tech. Papers, pp. 98–99, Feb. 2007.

vi

• Paper 5 - S. Vangal, A. Singh, J. Howard, S. Dighe, N. Borkar and A. Alvandpour, "A 5.1GHz 0.34mm2 Router for Network-on-Chip Applications ", 2007 Symposium on VLSI Circuits, Digest of Technical Papers, June 14-16, 2007, pp. 42–43.

As a staff member of Microprocessor Technology Laboratory at Intel Corporation, Hillsboro, OR, USA, I am also involved in research work, with several publications not discussed as part of this thesis:

• J. Tschanz, N. Kim, S. Dighe, J. Howard, G. Ruhl, S. Vangal, S. Narendra, Y. Hoskote, H. Wilson, C. Lam, M. Shuman, C. Tokunaga, D. Somasekhar, D. Finan, T. Karnik, N. Borkar, N. Kurd, V. De, “Adaptive Frequency and Biasing Techniques for Tolerance to Dynamic Temperature-Voltage Variations and Aging”, ISSCC Digest of Technical Papers, Feb. 2007, pp. 292–293.

• S. Narendra, J. Tschanz, J. Hofsheier, B. Bloechel, S. Vangal, Y. Hoskote, S. Tang, D. Somasekhar, A. Keshavarzi, V. Erraguntla, G. Dermer, N. Borkar, S. Borkar, and V. De, “Ultra-low voltage circuits and processor in 180nm to 90nm technologies with a swapped-body biasing technique”, ISSCC Digest of Technical Papers, Feb. 2004, pp. 156–157.

• Y. Hoskote, B. Bloechel, G. Dermer, V. Erraguntla, D. Finan, J. Howard, D. Klowden, S. Narendra, G. Ruhl, J. Tschanz, S. Vangal, V. Veeramachaneni, H. Wilson, J. Xu, and N. Borkar, ”A TCP offload accelerator for 10Gb/s Ethernet in 90-nm CMOS”, IEEE Journal of Solid-State Circuits, Volume 38, Issue 11, Nov. 2003, pp. 1866–1875.

• Y. Hoskote, B. Bloechel, G. Dermer, V. Erraguntla, D. Finan, J. Howard, D. Klowden, S. Narendra, G. Ruhl, J. Tschanz, S. Vangal, V. Veeramachaneni, H. Wilson, J. Xu, and N. Borkar, “A 10GHz TCP offload accelerator for 10Gb/s Ethernet in 90nm dual-VT CMOS”, ISSCC Digest of Technical Papers, Feb 2003, pp. 258–259.

• S. Vangal, M. Anders, N. Borkar, E. Seligman, V. Govindarajulu, V. Erraguntla, H. Wilson, A. Pangal, V. Veeramachaneni, J. Tschanz, Y. Ye, D. Somasekhar, B. Bloechel, G. Dermer, R. Krishnamurthy, K. Soumyanath, S. Mathew, S. Narendra, M. Stan, S. Thompson, V. De and S. Borkar, “5-GHz 32-bit integer execution core in 130-nm dual-VT CMOS”, IEEE Journal of Solid-State Circuits, Volume 37, Issue 11, Nov. 2002, pp. 1421- 1432.

vii

• T. Karnik, S. Vangal, V. Veeramachaneni, P. Hazucha, V. Erraguntla and S. Borkar, “Selective node engineering for chip-level soft error rate improvement [in CMOS]”, 2002 Symposium on VLSI Circuits, Digest of Technical Papers, June 2002, pp. 204-205.

• S. Vangal, M. Anders, N. Borkar, E. Seligman, V. Govindarajulu, V. Erraguntla, H. Wilson, A. Pangal, V. Veeramachaneni, J. Tschanz, Y. Ye, D. Somasekhar, B. Bloechel, G. Dermer, R. Krishnamurthy, K. Soumyanath, S. Mathew, S. Narendra, M. Stan, S. Thompson, V. De and S. Borkar, “A 5GHz 32b integer-execution core in 130nm dual-VT CMOS”, ISSCC Digest of Technical Papers, Feb. 2002, pp. 412–413.

• S. Narendra, M. Haycock, V. Govindarajulu, V. Erraguntla, H. Wilson, S. Vangal, A. Pangal, E. Seligman, R. Nair, A. Keshavarzi, B. Bloechel, G. Dermer, R. Mooney, N. Borkar, S. Borkar, and V. De, “1.1 V 1 GHz communications router with on-chip body bias in 150 nm CMOS”, ISSCC Digest of Technical Papers, Feb. 2002, Volume 1, pp. 270–466.

• R. Nair, N. Borkar, C. Browning, G. Dermer, V. Erraguntla, V. Govindarajulu, A. Pangal, J. Prijic, L. Rankin, E. Seligman, S. Vangal and H. Wilson, “A 28.5 GB/s CMOS non-blocking router for terabit/s connectivity between multiple processors and peripheral I/O nodes”, ISSCC Digest of Technical Papers, Feb. 2001, pp. 224 – 225.

I have also co-authored one book chapter:

• Yatin Hoskote, Sriram Vangal, Vasantha Erraguntla and Nitin Borkar, “A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet”, chapter 5 in “Network Processor Design, Issues and Practices, Volume 3”, Elsevier, February 2005, ISBN: 978-0-12-088476-6.

I have the following patents issued with several patent applications pending:

• Sriram Vangal, “Integrated circuit interconnect routing using double pumped circuitry”, patent #6791376.

• Sriram Vangal and Howard A. Wilson, “Method and apparatus for driving data packets”, patent #6853644.

• Sriram Vangal and Dinesh Somasekhar, “Pipelined 4-2 compressor circuit”, patent #6701339.

viii

• Sriram Vangal and Dinesh Somasekhar, “Flip-flop circuit”, patent # 6459316.

• Yatin Hoskote, Sriram Vangal and Jason Howard, “Fast single precision floating point accumulator using base 32 system”, patent # 6988119.

• Amaresh Pangal, Dinesh Somasekhar, Shekhar Borkar and Sriram Vangal, “Floating point multiply accumulator”, patent # 7080111.

• Amaresh Pangal, Dinesh Somasekhar, Sriram Vangal and Yatin Hoskote, “Floating point adder”, patent #6889241.

• Bhaskar Chatterjee, Steven Hsu, Sriram Vangal, Ram Krishnamurthy, “Leakage tolerant register file”, patent #7016239.

• Sriram Vangal, Matthew B. Haycock, Stephen R. Mooney, “Biased control loop circuit for setting impedance of output driver”, patent #6424175.

• Sriram Vangal, “Weak current generation”, patent #6791376.

• Sriram Vangal and Gregory Ruhl, “ A scalable skew tolerant tileable clock distribution scheme with low duty cycle variation”, pending patent.

• Sriram Vangal and Arvind Singh, “ A compact crossbar router with distributed arbitration scheme”, pending patent.

ix

Contributions

The main contributions of this dissertation are:

• A single-chip 80-tile sub-100W NoC architecture with TeraFLOPS (one trillion floating point operations per second) of performance.

• A 2D on-die mesh network with a bisection bandwidth > 2 Tera-bits/s.

• A tile-able high-speed low-power differential mesochronous clocking scheme with low duty-cycle variation, applicable to large NoCs.

• A 5+GHz 5-port 100GB/s router architecture with phase-tolerant mesochronous links and sub-ns fall-through latency.

• Implementation of a shared double-pumped crossbar switch enabling up to 50% smaller crossbar and a compact 0.34mm2 router layout (65nm).

• A destination-aware CMOS driver circuit that dynamically configures based on the current packet destination for peak router power reduction.

• A new single-cycle floating-point MAC algorithm with just 15 FO4 stages in the critical path, capable of sustained multiply-add result (2 FLOPS) every clock cycle (200ps).

• An FPMAC implementation with a conditional normalization technique that opportunistically saves active and leakage power during long accumulate operations.

• A combination of fine-grained power management techniques for active and standby leakage power reduction.

• Extensive measured silicon data from the TeraFLOPS NoC processor and key building blocks spanning three CMOS technology generations (150-nm, 90-nm and 65-nm).

xi

Abbreviations

AC Alternating Current

ASIC Application Specific Integrated Circuit

CAD Computer Aided Design

CMOS Complementary Metal-Oxide-Semiconductor

CMP Chip Multi-Processor

DC Direct Current

DRAM Dynamic Random Access Memory

DSM Deep SubMicron

FF Flip-Flop

FLOPS Floating-Point Operations per second

FP Floating-Point

FPADD Floating-Point Addition

FPMAC Floating-Point Multiply Accumulator

FPU Floating-Point Unit

FO4 Fan-Out-of-4

xii

GB/s Gigabyte per second

IEEE The Institute of Electrical and Electronics Engineers

ILD Inter-Layer Dielectric

I/O Input Output

ITRS International Technology Roadmap for Semiconductors

LZA Leading Zero Anticipator

MAC Multiply-Accumulate

MOS Metal-Oxide-Semiconductor

MOSFET Metal-Oxide-Semiconductor Field Effect Transistor

MSFF Master-Slave Flip-Flop

MUX Multiplexer

NMOS N-channel Metal-Oxide-Semiconductor

NoC Network-on-Chip

PCB Printed Circuit Board

PLL Phase Locked Loop

PMOS P-channel Metal-Oxide-Semiconductor

RC Resistance-Capacitance

SDFF Semi-Dynamic Flip-Flop

SoC System-on-a-Chip

TFLOPS TeraFLOPS (one Trillion Floating Point Operations per Second)

VLSI Very-Large Scale Integration

VR Voltage Regulator

xiii

Acknowledgments

I would like to express my sincere gratitude to my principal advisor and

supervisor Prof. Atila Alvandpour, for motivating me and providing the

opportunity to embark on research at Linköping University, Sweden. I also

thank Lic. Eng. Martin Hansson for generously assisting me whenever I needed

help. I thank Anna Folkeson and Arta Alvandpour at LiU for their assistance.

I am thankful for support from all of my colleagues at Microprocessor

Technology Labs, Intel Corporation. Special thanks to my manager, Nitin

Borkar, and lab director Matthew Haycock, for their support; Shekhar Borkar,

Joe Schutz and Justin Rattner for their technical leadership and for funding this

research. I also thank Dr. Yatin Hoskote, Dr. Vivek De, Dr. Dinesh

Somasekhar, Dr. Tanay Karnik, Dr. Siva Narendra, Dr. Ram Krishnamurthy, Dr.

Keith Bowman (for his exceptional reviewing skills), Dr. Sanu Mathew, Dr.

Muhammad Khellah and Jim Tschanz for numerous technical discussions. I am

xiv

also grateful to the silicon design and prototyping team in both Oregon, USA

and Bangalore, India: Howard Wilson, Jason Howard, Greg Ruhl, Saurabh

Dighe, Arvind Singh, Venkat Veeramachaneni, Amaresh Pangal, Venki

Govindarajulu, David Finan, Priya Iyer, Arvind Singh, Tiju Jacob, Shailendra

Jain, Sriram Venkataraman and all mask designers for providing an ideal

research and prototyping environment for making silicon realization of the

research ideas possible. Clark Roberts and Paolo Aseron deserve credit for their

invaluable support in the lab with evaluation board design and silicon testing.

All of this was made possible by the love and encouragement from my

family. I am indebted to my inspirational grandfather Sri. V. S. Srinivasan (who

passed away recently), my parents Sri. Ranganatha and Smt. Chandra, for their

constant motivation from the other side of the world. I sincerely thank Prof. K.

S. Srinath, for being my mentor since high school. Without his visionary

guidance and support, my career dreams would have remained unfulfilled. This

thesis is dedicated to all of them.

Finally, I am exceedingly thankful for the patience, love and support from

my soul mate: Reshma, and remarkable cooperation from my son, Pratik ☺.

Sriram Vangal

Linköping, September 2007

xv

Contents

Abstract iii

Preface v

Contributions ix

Abbreviations xi

Acknowledgments xiii

List of Figures xix

Part I Introduction 1

Chapter 1 Introduction 3

1.1 The Microelectronics Era and Moore’s Law.................................... 4 1.2 Low Power CMOS Technology ....................................................... 5

1.2.1 CMOS Power Components........................................................ 5 1.3 Technology scaling trends and challenges ....................................... 7

1.3.1 Technology scaling: Impact on Power ...................................... 8 1.3.2 Technology scaling: Impact on Interconnects......................... 11 1.3.3 Technology scaling: Summary ................................................ 14

xvi

1.4 Motivation and Scope of this Thesis .............................................. 15 1.5 Organization of this Thesis............................................................. 16 1.6 References....................................................................................... 17

Part II NoC Building Blocks 21

Chapter 2 On-Chip Interconnection Networks 23

2.1 Interconnection Network Fundamentals......................................... 24 2.1.1 Network Topology................................................................... 25 2.1.2 Message Switching Protocols.................................................. 26 2.1.3 Virtual Lanes............................................................................ 27 2.1.4 A Generic Router Architecture................................................ 28

2.2 A Six-Port 57GB/s Crossbar Router Core...................................... 29 2.2.1 Introduction.............................................................................. 30 2.2.2 Router Micro-architecture ....................................................... 30 2.2.3 Packet structure and routing protocol...................................... 32 2.2.4 FIFO buffer and flow control .................................................. 33 2.2.5 Router Design Challenges ....................................................... 34 2.2.6 Double-pumped Crossbar Channel.......................................... 35 2.2.7 Channel operation and timing.................................................. 36 2.2.8 Location-based Channel Driver............................................... 37 2.2.9 Chip Results............................................................................. 39

2.3 Summary and future work .............................................................. 42 2.4 References....................................................................................... 42

Chapter 3 Floating-point Units 45

3.1 Introduction to floating-point arithmetic ........................................ 45 3.1.1 Challenges with floating-point addition and accumulation..... 46 3.1.2 Scheduling issues with pipelined FPUs................................... 48

3.2 Single-cycle Accumulation Algorithm........................................... 49 3.3 CMOS prototype............................................................................. 51

3.3.1 High-performance Flip-Flop Circuits...................................... 51 3.3.2 Test Circuits............................................................................. 52

3.4 References....................................................................................... 54

Part III An 80-Tile TeraFLOPS NoC 57

Chapter 4 An 80-Tile Sub-100W TeraFLOPS NoC Processor in 65-nm CMOS 59

4.1 Introduction..................................................................................... 59

xvii

4.2 Goals of the TeraFLOPS NoC Processor ....................................... 61 4.3 NoC Architecture............................................................................ 62 4.4 FPMAC Architecture...................................................................... 63

4.4.1 Instruction Set .......................................................................... 64 4.4.2 NoC Packet Format.................................................................. 65 4.4.3 Router Architecture ................................................................. 66 4.4.4 Mesochronous Communication............................................... 69 4.4.5 Router Interface Block (RIB) .................................................. 69

4.5 Design details.................................................................................. 70 4.6 Experimental Results...................................................................... 74 4.7 Applications for the TeraFLOP NoC Processor............................. 84 4.8 Conclusion ...................................................................................... 85 Acknowledgement ................................................................................... 85 4.9 References....................................................................................... 85

Chapter 5 Conclusions and Future Work 89

5.1 Conclusions..................................................................................... 89 5.2 Future Work.................................................................................... 91 5.3 References....................................................................................... 92

Part IV Papers 95

Chapter 6 Paper 1 97

6.1 Introduction..................................................................................... 98 6.2 Double-pumped Crossbar Channel with LBD ............................. 100 6.3 Comparison Results ...................................................................... 102 6.4 Acknowledgments ........................................................................ 103 6.5 References..................................................................................... 104


7.1 FPMAC Architecture.................................................................... 106 7.2 Proposed Accumulator Algorithm................................................ 107 7.3 Overflow detection and Leading Zero Anticipation..................... 109 7.4 Simulation results ......................................................................... 111 7.5 Acknowledgments ........................................................................ 112 7.6 References..................................................................................... 113


8.1 Introduction................................................................................... 116

xviii

8.2 FPMAC Architecture.................................................................... 118 8.2.1 Accumulator algorithm.......................................................... 119 8.2.2 Overflow prediction in carry-save......................................... 123 8.2.3 Leading zero anticipation in carry-save................................. 125

8.3 Conditional normalization pipeline .............................................. 127 8.4 Design details................................................................................ 129 8.5 Experimental Results.................................................................... 132 8.6 Conclusions................................................................................... 136 8.7 Acknowledgments ........................................................................ 137 8.8 References..................................................................................... 138


9.1 NoC Architecture.......................................................................... 142 9.2 FPMAC Architecture.................................................................... 143 9.3 Router Architecture ...................................................................... 144 9.4 NoC Power Management.............................................................. 146 9.5 Experimental Results.................................................................... 148 9.6 Acknowledgments ........................................................................ 149 9.7 References..................................................................................... 150


10.1 Introduction................................................................................... 152 10.2 NoC Architecture and Design....................................................... 153 10.3 Experimental Results.................................................................... 157 10.4 Acknowledgments ........................................................................ 158 10.5 References..................................................................................... 158

xix

List of Figures

Figure 1-1: Transistor count per Integrated Circuit. .................................................................. 4 Figure 1-2: Basic CMOS gate showing dynamic and short-circuit currents.............................. 6 Figure 1-3: Main leakage components for a MOS transistor. .................................................... 7 Figure 1-4: Power trends for Intel microprocessors................................................................... 8 Figure 1-5: Dynamic and Static power trends for Intel microprocessors .................................. 9 Figure 1-6: Sub-threshold leakage current as a function of temperature. .................................. 9 Figure 1-7: Three circuit level leakage reduction techniques. ................................................. 10 Figure 1-8: Intel 65 nm, 8-metal copper CMOS technology in 2004. ..................................... 11 Figure 1-9: Projected fraction of chip reachable in one cycle with an 8 FO4 clock period [20]................................................................................................................................................... 12 Figure 1-10: Current and projected relative delays for local and global wires and for logic gates in nanometer technologies [25]....................................................................................... 12 Figure 1-11: (a) Traditional bus-based communicaton, (b) An on-chip point-to-point network................................................................................................................................................... 13 Figure 1-12: Homogenous NoC Architecture. ......................................................................... 14 Figure 2-1: Direct and indirect networks ................................................................................. 24 Figure 2-2: Popular on-chip fabric topologies ......................................................................... 25 Figure 2-3: A fully connected non-blocking crossbar switch. ................................................. 26 Figure 2-4: Reduction in header blocking delay using virtual channels. ................................. 28 Figure 2-5: Canonical router architecture. ............................................................................... 29 Figure 2-6: Six-port four-lane router block diagram................................................................ 31 Figure 2-7: Router data and control pipeline diagram. ............................................................ 32 Figure 2-8: Router packet and FLIT format............................................................................. 33 Figure 2-9: Flow control between routers. ............................................................................... 34 Figure 2-10: Crossbar router in [19] (a) Internal lane organization. (b) Corresponding layout. (c) Channel interconnect structure. .......................................................................................... 34 Figure 2-11: Double-pumped crossbar channel. ...................................................................... 36

xx Figure 2-12: Crossbar channel timing diagram........................................................................ 37 Figure 2-13: Location-based channel driver and control logic. ............................................... 38 Figure 2-14: (a) LBD Schematic. (b) LBD encoding summary............................................... 38 Figure 2-15: Double-pumped crossbar channel results compared to work reported in [19].... 39 Figure 2-16: Simulated crossbar channel propagation delay as function of port distance....... 40 Figure 2-17: LBD peak current as function of port distance.................................................... 41 Figure 2-18: Router layout and characteristics. ....................................................................... 41 Figure 3-1: IEEE single precision format. The exponent is biased by 127.............................. 46 Figure 3-2: Single-cycle FP adder with critical blocks high-lighted. ...................................... 47 Figure 3-3: FPUs optimized for (a) FPMADD (b) FPMAC instruction. ................................. 48 Figure 3-4: (a) Five-stage FP adder [11] (b) Six-stage CELL FPU [10]. ................................ 49 Figure 3-5: Single-cycle FPMAC algorithm............................................................................ 50 Figure 3-6: Semi-dynamic resetable flip flop with selectable pulse width. ............................. 52 Figure 3-7: Block diagram of FPMAC core and test circuits. ................................................. 52 Figure 3-8: 32-entry x 32b dual-VT optimized register file. .................................................... 53 Figure 4-1: NoC architecture.................................................................................................... 60 Figure 4-2: NoC block diagram and tile architecture............................................................... 62 Figure 4-3: FPMAC 9-stage pipeline with single-cycle accumulate loop. .............................. 63 Figure 4-4: NoC protocol: packet format and FLIT description.............................................. 65 Figure 4-5: Five-port two-lane shared crossbar router architecture. ........................................ 67 Figure 4-6: (a) Double-pumped crossbar switch schematic. (b) Area benefit over work in [11]................................................................................................................................................... 68 Figure 4-7: Phase-tolerant mesochronous interface and timing diagram................................. 69 Figure 4-8: Semi-dynamic flip-flop (SDFF) schematic. .......................................................... 70 Figure 4-9: (a) Global mesochronous clocking and (b) simulated clock arrival times. ........... 71 Figure 4-10: Router and on-die network power management.................................................. 72 Figure 4-11: (a) FPMAC pipelined wakeup diagram and simulated peak current reduction and (b) state-retentive memory clamp circuit. ................................................................................ 73 Figure 4-12: Full-Chip and tile micrograph and characteristics. ............................................. 74 Figure 4-13: (a) Package die-side. (b) Land-side. (c) Evaluation board. ................................. 75 Figure 4-14: Measured chip FMAX and peak performance. ...................................................... 76 Figure 4-15: Measured chip power for stencil application. ..................................................... 78 Figure 4-16: Measured chip energy efficiency for stencil application..................................... 79 Figure 4-17: Estimated (a) Tile power profile (b) Communication power breakdown. .......... 80 Figure 4-18: Measured global clock distribution waveforms. ................................................. 81 Figure 4-19: Measured global clock distribution power. ......................................................... 81 Figure 4-20: Measured chip leakage power as percentage of total power vs. Vcc. A 2X reduction is obtained by turning off sleep transistors. ............................................................. 82 Figure 4-21: On-die network power reduction benefit............................................................. 82 Figure 4-22: Measured IMEM virtual ground waveform slowing transition to and from sleep................................................................................................................................................... 83 Figure 4-23: Measurement setup.............................................................................................. 84 Figure 6-1: Six-port four-lane router block diagram................................................................ 99 Figure 6-2: Router data and control pipeline diagram. ............................................................ 99 Figure 6-3: Double-pumped crossbar channel with location based driver (LBD). ................ 101 Figure 6-4: Crossbar channel timing diagram........................................................................ 101 Figure 6-5: (a) LBD schematic (b) Peak current per LBD vs. port distance.......................... 102

xxi

Figure 6-6: Router layout and characteristics. ....................................................................... 103 Figure 7-1: FPMAC pipe stages and organization. ................................................................ 107 Figure 7-2: Floating point accumulator mantissa datapath. ................................................... 108 Figure 7-3: Floating point accumulator exponent logic. ........................................................ 109 Figure 7-4: Toggle detection circuit....................................................................................... 110 Figure 7-5: Post-normalization pipeline diagram................................................................... 111 Figure 7-6: Chip layout and process characteristics............................................................... 112 Figure 7-7: Simulated FPMAC frequency vs. supply voltage. .............................................. 112 Figure 8-1: Conventional single-cycle floating-point accumulator with critical blocks high-lighted..................................................................................................................................... 118 Figure 8-2: FPMAC pipe stages and organization. ................................................................ 119 Figure 8-3: Conversion to base 32. ........................................................................................ 120 Figure 8-4: Accumulator mantissa datapath........................................................................... 121 Figure 8-5: Accumulator algorithm behavior with examples. ............................................... 122 Figure 8-6: Accumulator exponent logic. .............................................................................. 122 Figure 8-7: Example of overflow prediction in carry-save. ................................................... 124 Figure 8-8: Toggle detection circuit and overflow prediction. .............................................. 124 Figure 8-9: Leading-zero anticipation in carry-save. ............................................................. 126 Figure 8-10: Sleep enabled conditional normalization pipeline diagram. ............................. 127 Figure 8-11: Sparse-tree adder. Highlighted path shows generation of every 16th carry...... 128 Figure 8-12: Accumulator register flip-flop circuit and layout bit-slice. ............................... 129 Figure 8-13: (a) Simulated worst-case bounce on virtual ground. (b) Normalization pipeline layout showing sleep devices. (c) Sleep device insertion into power grid............................ 130 Figure 8-14: Block diagram of FPMAC core and test circuits. ............................................. 131 Figure 8-15: Chip clock generation and distribution. ............................................................ 133 Figure 8-16: Die photograph and chip characteristics. .......................................................... 133 Figure 8-17: Measurement setup............................................................................................ 134 Figure 8-18: Measured FPMAC frequency and power Vs. Vcc. ............................................ 134 Figure 8-19: Measured maximum frequency Vs. Vcc. ........................................................... 135 Figure 8-20: Measured switching and leakage power at 85°C............................................... 136 Figure 8-21: Conditional normalization and total FPMAC power for various activation rates (N). ......................................................................................................................................... 137 Figure 9-1: NoC block diagram and tile architecture............................................................. 143 Figure 9-2: FPMAC 9-stage pipeline with single-cycle accumulate loop. ............................ 144 Figure 9-3: Shared crossbar router with double-pumped crossbar switch. ............................ 145 Figure 9-4: Global mesochronous clocking and simulated clock arrival times. .................... 146 Figure 9-5: FPMAC pipelined wakeup diagram and state-retentive memory clamp circuit. 147 Figure 9-6: Estimated frequency and power versus Vcc, and power efficiency with 80 tiles (N) active................................................................................................................................ 148 Figure 9-7: Full-Chip and tile micrograph and characteristics. ............................................. 149 Figure 10-1: Communication fabric for 80-tile NoC. ............................................................ 153 Figure 10-2: Five-port two-lane shared crossbar router architecture. .................................... 154 Figure 10-3: NoC protocol: packet format and FLIT description.......................................... 154 Figure 10-4: (a) Double-pumped crossbar switch schematic. (b) Area benefit over work in [2]................................................................................................................................................. 155 Figure 10-5: Phase-tolerant mesochronous interface and timing diagram............................ 156 Figure 10-6: (a) Semi-dynamic flip-flop (SDFF) schematic. (b) Power management. ......... 156

xxii Figure 10-7: (a) Measured router FMAX and power Vs. Vcc. (b) Power reduction benefit. .... 157 Figure 10-8: (a) Communication power breakdown. (b) Die micrograph of individual tile. 158

1

Part I

Introduction

3

Chapter 1

Introduction

For four decades, the semiconductor industry has surpassed itself by the unparalleled pace of improvement in its products and has transformed the world that we live in. The remarkable characteristic of transistors that fuels this rapid growth is that their speed increases and their cost decreases as their size is reduced. Modern day high-performance Integrated Circuits (ICs) have more than one billion transistors. Today, the 65 nanometer (nm) Complementary Metal Oxide Semiconductor (CMOS) technology is in high volume manufacturing and industry has already demonstrated fully functional static random access memory (SRAM) chips using 45nm process. This scaling of CMOS Very Large Scale Integration (VLSI) technology is driven by Moore’s law. CMOS has been the driving force behind high-performance ICs. The attractive scaling properties of CMOS, coupled with low power, high speed and noise margins, reliability, wider temperature and voltage operation range, overall circuit and layout implementation and manufacturing ease, has made it the technology of choice for digital ICs. CMOS also enables monolithic integration of both analog and digital circuits on the same die, and shows great promise for future System-on-a-Chip (SoC) implementations.

This chapter reviews trends in Silicon CMOS technology and highlights the main challenges involved in keeping up with Moore’s law. CMOS device

4 Introduction

scaling increases transistor sub-threshold leakage. Interconnect scaling coupled with higher operating frequencies requires careful parasitic extraction and modeling. Power delivery and dissipation are fast becoming limiting factors in product design.

1.1 The Microelectronics Era and Moore’s Law

There is hardly any other area of industry which has developed as fast as the semiconductor industry in the last 40 years. Within this relatively short period, microelectronics has become the key technology enabler for several industries like information technology, telecommunication, medical equipment and consumer electronics [1].

In 1965, Intel co-founder Gordon Moore observed that the total number of devices on a chip doubled every 12 months [2]. He predicted that the trend would continue in the 1970s but would slow in the 1980s, when the total number of devices would double every 24 months. Known widely as “Moore’s Law,” these observations made the case for continued wafer and die size growth, defect density reduction, and increased transistor density as manufacturing matured and technology scaled. Figure 1-1 plots the growth number of transistors per IC and shows that the transistor count has indeed doubled every 24 months [3].

Figure 1-1: Transistor count per Integrated Circuit.

During these years, the die size increased at 7% per year, while the operating frequency of leading microprocessors has doubled every 24 months [4], and is

1.2 Low Power CMOS Technology 5

well into the GHz range. Figure 1-1 also shows that the integration density of memory ICs is consistently higher than logic chips. Memory circuits are highly regular, allowing better integration with much less interconnect overhead.

1.2 Low Power CMOS Technology

The idea of CMOS Field Effect Transistor (FET) was first introduced by Wanlass and Sah [5]. By the 1980’s, it was widely acknowledged that CMOS is the dominant technology for microprocessors, memories and application specific integrated circuits (ASICs), owing to its favorable properties over other IC technologies. The biggest advantage of CMOS over NMOS and bipolar technology is its significantly reduced power dissipation, since a CMOS circuit has almost no static (DC) power dissipation.

1.2.1 CMOS Power Components In order to understand the evolution of CMOS as one of the most popular

low-power design approaches, we first examine the sources of power dissipation in digital CMOS circuit. The total power consumed by a static CMOS circuit consists of three components, and is given by the following expression [6]:

Pdynamic represents the dynamic or switching power, i.e., the power dissipated in charging and discharging the physical load capacitance contributed by fan-out gate loading, interconnect loading, and parasitic capacitances at the CMOS gate outputs. CL represents this capacitance, lumped together as shown in Figure 1-2. The dynamic power is given by Eq. 1.2, where fclk is the clock frequency with which the gate switches, Vdd is the power supply, and α is the switching activity factor, which determines how frequently the output switches per clock-cycle.

Pshort-circuit represents the short-circuit power, i.e., the power consumed during switching between Vdd and ground. This short-circuit current (Isc) arises when both PMOS and NMOS transistors are simultaneously active, conducting current directly from Vdd to ground for a short period of time. Equation 1.3 describes the short-circuit power dissipation for a simple CMOS inverter [7], where β is the gain factor of the transistors, τ is the input rise/fall time, and VT is the transistor threshold voltage (assumed to be same for both PMOS/NMOS).

staticcircuitshortdynamictotal PPPP ++= − (1.1)

2ddLclkdynamic VCfP ⋅⋅⋅= α (1.2)

( ) clkTdd fVVPcircuitshort

⋅⋅−=−

τβ 3212

(1.3)

6 Introduction

Vdd

VoutA1

CL

Iswitch

IscNMOSNetwork

PMOSNetwork

AN

Vdd

VoutA1

CL

Iswitch

IscNMOSNetwork

PMOSNetwork

AN

Figure 1-2: Basic CMOS gate showing dynamic and short-circuit currents.

Its important to note that the switching component of power is independent of the rise/fall times at the input of logic gates, but short-circuit power depends on input signal slope. Short-circuit currents can be significant when the rise/fall times at the input of the gate is much longer when compared to the output rise/fall times.

The third component of power: Pstatic is due to leakage currents and is determined by fabrication technology considerations, and consists of (1) source/drain junction leakage current; (2) gate direct tunneling leakage; (3) sub-threshold leakage through the channel of an OFF transistor and is summarized in Figure 1-3.

(1) The junction leakage (I1) occurs from the source or drain to the substrate through the reverse-biased diodes when a transistor is OFF. The magnitude of the diode’s leakage current depends on the area of the drain diffusion and the leakage current density, which, is in turn, determined by the process technology.

(2) The gate direct tunneling leakage (I2) flows from the gate thru the “leaky” oxide insulation to the substrate. Its magnitude increases exponentially with the gate oxide thickness and supply voltage. According to the 2005 International Technology Roadmap for Semiconductors [8], high-K gate dielectric is required to control this direct tunneling current component of the leakage current.

(3) The sub-threshold current is the drain-source current (I3) of an OFF transistor. This is due to the diffusion current of the minority carriers in the channel for a MOS device operating in the weak inversion mode (i.e., the sub-threshold region.) For instance, in the case of an inverter with a low input voltage, the NMOS is turned OFF and the output voltage is high. Even with VGS at 0V, there is still a current flowing in the channel of the OFF NMOS transistor since VGS = Vdd. The magnitude of the sub-threshold current is a function of the temperature, supply voltage, device size, and the process parameters, out of

1.3 Technology scaling trends and challenges 7

which, the threshold voltage (VT) plays a dominant role. In today’s CMOS technologies, sub-threshold current is the largest of all the leakage currents, and can be computed using the following expression [9]:

where K and n are functions of the technology, and η is drain induced barrier lowering (DIBL) coefficient, an effect that manifests short channel MOSFET devices. The transistor sub-threshold swing coefficient is modeled with the parameter n while νT represents the thermal voltage, given by KT/q (~33mv at 110°C).

n+ n+

Source

Gate

Drain

P-substrate

I3

I2

I1

n+ n+

Source

Gate

Drain

P-substrate

I3

I2

I1

Figure 1-3: Main leakage components for a MOS transistor.

In the sub-threshold region, the MOSFET behaves primarily as a bipolar transistor, and the sub-threshold is exponentially dependent on VGS (Eq. 1.4). Another important figure of merit for low-power CMOS is the sub-threshold slope, which is the amount of voltage required to drop the sub-threshold current by one decade. The lower the sub-threshold slope, the better since the transistor can be turned OFF when VGS is reduced below VT.

1.3 Technology scaling trends and challenges

The scaling of CMOS VLSI technology is driven by Moore’s law, and is the primary factor driving speed and performance improvement of both microprocessors and memories. The term “scaling” refers to the reduction in transistor width, length and oxide dimensions by 30%. Historically, CMOS technology, when scaled to the next generation, (1) reduces gate delay by 30% allowing a 43% increase in clock frequency, (2) doubles the device density (as shown in Figure 1-1), (3) reduces the parasitic capacitance by 30% and (4)

( ) TDSTGSTDS nvVVVvvSUB eeKI //1 η+−− ⋅−⋅= (1.4)

8 Introduction

reduces the energy and active energy per transition by 65% and 50% respectively. The key barriers to continued scaling of supply voltage and technology for microprocessors to achieve low-power and high-performance are well documented in [10]-[13].

1.3.1 Technology scaling: Impact on Power The rapid increase in the number of transistors on chips has enabled a

dramatic increase in the performance of computing systems. However, the performance improvement has been accompanied by an increase in power dissipation; thus, requiring more expensive packaging and cooling technology. Figure 1-4 shows the power trends of Intel microprocessors, with the dotted line showing the power trend with classic scaling. Historically, the primary contributor to power dissipation in CMOS circuits has been the charging and discharging of load capacitances, often referred to as the dynamic power dissipation. As shown by Eq. 1.2, this component of power varies as the square of the supply voltage (Vdd). Therefore, in the past, chip designers have relied on scaling down Vdd to reduce the dynamic power dissipation. Despite scaling Vdd, the total power dissipation of microprocessors is expected to increase exponentially (in logarithmic scale) from 100W in 2005 to over 2KW by end of the decade, with a substantial increase in sub-threshold leakage power [10].

PProPentium

486

3862868086

80858080

80084004

0.1

1

10

100

1971 1974 1978 1985 1992 2000Year

Pow

er (

Wat

ts)

P4PPro

Pentium

486

3862868086

80858080

80084004

0.1

1

10

100

1971 1974 1978 1985 1992 2000Year

Pow

er (

Wat

ts)

P4

Figure 1-4: Power trends for Intel microprocessors.

Maintaining transistor switching speeds (constant electric field scaling) requires a proportionate downscaling of the transistor threshold voltage (VT) in lock step with the Vdd reduction. However, scaling VT results in a significant


increase of leakage power due to an exponential increase in the sub-threshold leakage current (Eq. 1.4). Figure 1-5 illustrates the increasing leakage power trend for Intel microprocessors in different CMOS technologies; with leakage power exceeding 50% of the total power budget in 70nm technology node [13]. Figure 1-6 is a semi-log plot that shows estimated sub-threshold leakage currents for future technologies as a function of temperature. Borkar in [11] predicts a 7.5X increase in the leakage current and a 5X increase in total energy dissipation for every new microprocessor chip generation!

LLeeaakkaaggee >> 5500%% ooff ttoottaall ppoowweerr!!

250nm 180nm 130nm 100nm 70nm 0

50

100

150

200

250

0.25µ0.25µ0.25µ0.25µ 0.18µ0.18µ0.18µ0.18µ 0.13µ0.13µ0.13µ0.13µ 0.1µ0.1µ0.1µ0.1µ 0.07µ0.07µ0.07µ0.07µ

Technology

Pow

er (

Wat

ts)

0%

20%

40%

60%

80%

100%

120%Active Power

Active Leakage

0

50

100

150

200

250

0.25µ0.25µ0.25µ0.25µ 0.18µ0.18µ0.18µ0.18µ 0.13µ0.13µ0.13µ0.13µ 0.1µ0.1µ0.1µ0.1µ 0.07µ0.07µ0.07µ0.07µ

Technology

Pow

er (

Wat

ts)

0%

20%

40%

60%

80%

100%

120%Active Power

Active Leakage

Figure 1-5: Dynamic and Static power trends for Intel microprocessors

1

10

100

1,000

10,000

20 40 60 80 100 120

Temp (C)

Ioff

(na/

u)

0.25µµµµ

0.13µµµµ0.18µµµµ

0.25µµµµ

90nm

65nm

45nm

1

10

100

1,000

10,000

20 40 60 80 100 120

Temp (C)

Ioff

(na/

u)

0.25µµµµ

0.13µµµµ0.18µµµµ

0.25µµµµ

90nm

65nm

45nm

Figure 1-6: Sub-threshold leakage current as a function of temperature.

Note that it is possible to substantially reduce the leakage power, and hence the overall power, by reducing the die temperature. Therefore better cooling

10 Introduction

techniques would be more critical in the advanced deep submicron technologies to control both active leakage and total power. Leakage current reduction can also be achieved by utilizing either process techniques and/or circuit techniques. At process level, channel engineering is used to optimize the device doping profile for reduced leakage. The dual-VT technology [14] is used to reduce total sub-threshold leakage power, where NMOS and PMOS devices with both high and low threshold voltages are made by selectively adjusting well doses. The technique adjusts high performance critical path transistors with low-VT while non-critical paths are implemented with high-VT transistors, at the cost of additional process complexity. Over 80% reduction in leakage power has been reported in [15], while meeting performance goals.

In the last decade, a number of circuit solutions for leakage control have been proposed. One solution is to force a non-stack device to a stack of two devices without affecting the input load [16], as shown in Figure 1-7(a). This significantly reduces sub-threshold leakage, but will incur a delay penalty, similar to replacing a low-VT device with a high-VT device in a dual-VT design. Combined dynamic body bias and sleep transistor techniques for active leakage power control have been described in [17]. Sub-threshold leakage can be reduced by dynamically changing the body bias applied to the block, as shown in Figure 1-7(b). During active mode, forward body bias (FBB) is applied to increase the operating frequency. When the block enters idle mode, the forward bias is withdrawn, reducing the leakage. Alternately, reverse body bias (RBB) can be applied during idle mode for further leakage savings.

(b) Body Bias(b) Body Bias

VVdddd

VVbpbp

VVbnbn--VVee

+V+Vee

(c) Sleep Transistor(c) Sleep Transistor

Logic BlockLogic Block

(a) Stack Effect(a) Stack Effect

Equal LoadingEqual Loading


VVdddd

VVbpbp

VVbnbn--VVee

+V+Vee


VVdddd

VVbpbp

VVbnbn--VVee

+V+Vee









Figure 1-7: Three circuit level leakage reduction techniques.

Supply gating or sleep transistor technique uses a high threshold transistor, as shown in Figure 1-7(c), to cut off the supply to a functional block, when the


design is in an “idle” or “standby” state. A 37X reduction in leakage power, with a block reactivation time of less than two clock cycles has been reported in [17]. While this technique can reduce leakage by orders of magnitude, it causes performance degradation and complicates power grid routing.

1.3.2 Technology scaling: Impact on Interconnects In deep-submicron designs, chip performance is increasingly limited by the

interconnect delay. With scaling, the width and thickness of the interconnections are reduced. As a result, the resistance increases, and as the interconnections get closer, the capacitance increases, increasing RC delay. As a result, the cross-coupling capacitance between adjacent wires is also increasing with each technology generation [18]. New designs add more transistors on chip and the average die size of a chip has been increasing over time. To account for increased RC parasitics, more interconnect layers are being added. The thinner, tighter interconnect layers are used for local interconnections, and the new thicker and sparser layers are used for global interconnections and power distribution. Copper metallization has been used to reduce the resistance of interconnects and Fluorinated SiO2 as inter-level dielectric (ILD) to reduce the dielectric constant (k=3.6). Figure 1-8 is a cross-section SEM image showing the interconnect structure in the 65nm technology node [19].

M8

M7

M6

M5M4M3M2M1

M8

M7

M6

M5M4M3M2M1

M8

M7

M6

M5M4M3M2M1

M8

M7

M6

M5M4M3M2M1

Figure 1-8: Intel 65 nm, 8-metal copper CMOS technology in 2004.

Technology trends show that global on-chip wire delays are growing significantly, increasing cross-chip communication latencies to several clock cycles and rendering the expected chip area reachable in a single cycle to be less than 1% in a 35nm technology, as shown in Figure 1-9 [20]. Figure 1-10 shows the projected relative delay taken from the ITRS roadmap for local wires, global wires (with and without repeaters), and logic gates in the near future. With gate

12 Introduction

delays (reducing 30%) and on-chip interconnect delay (increasing 30%) every technology generation, the cost gap between computation and communication is widening. Inductive noise and skin effect will get more pronounced as frequencies are reaching multi-GHz levels. Both circuit and layout solutions will be required to contain the inductive and capacitive coupling effects. In addition, interconnects now dissipate an increasingly larger portion of total chip power [22]. All of these trends indicate that interconnect delays and power in sub-65nm technologies will continue to dominate the overall chip performance.

Figure 1-9: Projected fraction of chip reachable in one cycle with an 8 FO4

clock period [20].

Process Technology Node (nm)

Relative Delay

Figure 1-10: Current and projected relative delays for local and global

wires and for logic gates in nanometer technologies [25].


The challenge for chip designers is to come up with new architectures that achieve both a fast clock rate and high concurrency, despite slow global wires. Shared bus networks (Figure 1-11a) are well understood and widely used in SoCs, but have serious scalability issues as more bus masters (>10) are added. To mitigate the global interconnect problem, new structured wire-delay scalable on-chip communication fabrics, called network-on-chip (NoCs), have emerged for use in SoC designs. The basic concept is to replace today’s shared buses with on-chip packet-switched interconnection networks (Figure 1-11b) [22]. The point-to-point network overcomes the scalability issues of shared-medium networks.

(a) (b)

IP/compute core

BusRouter (Switch)

Link / Channels

Network Interface

(a) (b)

IP/compute core

BusRouter (Switch)

Link / Channels

Network Interface

Figure 1-11: (a) Traditional bus-based communicaton, (b) An on-chip point-to-point network.

NoC architectures may have a homogeneous or heterogeneous structure. The

NoC architecture, shown in Figure 1-12 is homogenous and consists of the basic building block, the “network tile”. These tiles are connected to an on-chip network that routes packets between them. Each tile may consist of one or more compute cores (microprocessor) or memory cores and would also have routing

14 Introduction

logic responsible for routing and forwarding the packets, based on the routing policy of the network. The dedicated point-to-point links used in the network are optimal in terms of bandwidth availability, latency, and power consumption. The structured network wiring of such a NoC design gives well-controlled electrical parameters that simplifies timing and allows the use of high-performance circuits to reduce latency and increase bandwidth. An excellent introduction to NoCs, including a survey of research and practices is found in [24]-[25]

IP CoreIP Core

RouterRouter

Network TileNetwork TileLinksLinks

IP CoreIP Core

RouterRouter


Figure 1-12: Homogenous NoC Architecture.

1.3.3 Technology scaling: Summary We discussed trends in CMOS VLSI technology. The data indicates that

silicon performance, integration density, and power have followed the scaling theory. The main challenges to VLSI scaling include lithography, transistor scaling, interconnect scaling, and increasing power delivery and dissipation. Modular on-chip networks are required to resolve interconnect scaling issues. As the MOSFET channel length is reduced to 45nm and below, suppression of the ever increasing off-state leakage currents becomes crucial, requiring leakage-aware circuit techniques and newer bulk CMOS compatible device structures. While these challenges remain to be overcome, no fundamental barrier exists to scaling CMOS devices well into the nano-CMOS era, with a physical gate length of under 10nm. Predictions from the 2005 IRTS roadmap [8] indicate that “Moore’s law” should continue well into the next decade before the ultimate device limits for CMOS are reached.

1.4 Motivation and Scope of this Thesis 15

1.4 Motivation and Scope of this Thesis

Very recently, chip designers have moved from a computation-centric view of chip design to a communication-centric view, owing to the widening cost gap between interconnect delay over gate delay in deep-submicron technologies. As this discrepancy between device delay and on-chip communication latency becomes macroscopic, the practical solution is to use scalable NoC architectures. NoC architectures, with structured on-chip networks are emerging as a scalable and modular solution to global communications within large systems-on-chip. NoCs mitigate the emerging wire-delay problem and addresses the need for substantial interconnect bandwidth by replacing today’s shared buses with packet-switched router networks.

This work focuses on building blocks critical to the success of NoC designs. Research into high performance, area and energy efficient router architectures allows for larger router networks to be integrated on a single die with reduced power consumption. We next turn our attention to identifying and resolving computation throughput issues in conventional floating-point units by proposing a new single-cycle FPMAC algorithm capable of a sustained multiply-add result every cycle at GHz frequencies. The building blocks are integrated into a large monolithic 80-tile teraFLOP NoC multiprocessor. Such a NoC prototype will demonstrate that a computational fabric built using optimized building blocks can provide high levels of performance in an energy efficient manner. In addition, silicon results from such a prototype would provide in-depth understanding of performance and energy tradeoffs in designing successful NoC architectures well into the nanometer regime.

With on-chip communication consuming a significant portion of the chip power (about 40% [23]) and area budgets, there is a compelling need for compact, low power routers. While applications dictate the choice of the compute core, the advent of multimedia applications, such as three-dimensional (3D) graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput. The thesis details an integrated 80-Tile NoC architecture implemented in a 65-nm process technology. The prototype is designed to deliver over 1.0 TFLOPS of performance while dissipating less than 100W.

This thesis first presents a six-port four-lane 57 GB/s non-blocking router core based on wormhole switching. The router features double-pumped crossbar channels and destination-aware channel drivers that dynamically configure based on the current packet destination. This enables 45% reduction in crossbar channel area, 23% overall router area, up to 3.8X reduction in peak channel power, and 7.2% improvement in average channel power. In a 150nm six-metal

16 Introduction

CMOS process, the 12.2mm2 router contains 1.9 million transistors and operates at 1GHz at 1.2V.

We next present a new pipelined single-precision floating-point multiply accumulator core (FPMAC) featuring a single-cycle accumulation loop using base 32 and internal carry-save arithmetic, with delayed addition techniques. A combination of algorithmic, logic and circuit techniques enable multiply-accumulate operations at speeds exceeding 3GHz, with single-cycle throughput. This approach reduces the latency of dependent FPMAC instructions and enables a sustained multiply-add result (2 FLOPS) every cycle. The optimizations allow removal of the costly normalization step from the critical accumulation loop and conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In a 90nm seven-metal dual-VT CMOS process, the 2mm2 custom design contains 230K transistors. Silicon achieves 6.2 GFLOPS of performance while dissipating 1.2W at 3.1GHz, 1.3V supply.

We finally describe an integrated NoC architecture containing 80 tiles arranged as an 8×10 2D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision FPMAC units which feature a single-cycle accumulation loop for high throughput. The five-port router combines 100GB/s of raw bandwidth with fall-through latency under 1ns. The on-chip 2D mesh network provides a bisection bandwidth of 2 Tera-bits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100-M transistors. The fully functional first silicon achieves over 1.0TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07-V supply.

1.5 Organization of this Thesis

This thesis is organized into four parts:

• Part I – Introduction

• Part II – NoC Building Blocks

• Part III – An 80-Tile TeraFLOPS NoC

• Part IV– Papers Part I, provides the necessary background for the concepts used in the papers.

This chapter reviews trends in Silicon CMOS technology and highlights the main challenges involved in keeping up with Moore’s law. Properties of CMOS devices, sources of leakage power and the impact of scaling on power and

1.6 References 17

interconnect are discussed. This is followed by a brief discussion of NoC architectures.

In Part II, we describe the two NoC building blocks in detail. Introductory concepts specific to on-die interconnection networks, including a generic router architecture is presented in Chapter 2. A more detailed description of the six-port four-lane 57 GB/s non-blocking router core design (Paper 1) is also given. Chapter 3 presents basic concepts involved in floating-point arithmetic, reviews conventional floating-point units (FPU), and describes the challenges in accomplishing single-cycle accumulation on today’s FPUs. A new pipelined single-precision FPMAC design, capable of single-cycle accumulation, is described in Paper 2 and Paper 3. Paper 2 first presents the FPMAC algorithm, logic optimizations and preliminary chip simulation results. Paper 3 introduces the concept of “Conditional Normalization”, applicable to floating point units, for the first time. It also describes a 6.2 GFLOPS Floating Point Multiply-Accumulator, enhanced with conditional normalization pipeline and with detailed results from silicon measurement.

In Part III Chapter 4, we pull in all our work to realize the 80-Tile TeraFLOPS NoC processor. The motivation and applications space for tera-scale computers are also discussed. Paper 4 first introduces the 80-tile NoC architecture details with initial chip simulation results. In Paper 5, we focuses on the on-chip mesh network, router and mesochronous links used in the TeraFLOPS NoC. Chapter 4 builds on both papers and describes the NoC prototype in significant detail and includes results from chip measurement. We conclude in Chapter 5 with final remarks and directions for future research.

Finally in Part IV the papers, included in this thesis, are presented in full.

1.6 References

[1]. R. Smolan and J. Erwitt, One Digital Day – How the Microchip is Changing Our World, Random House, 1998.

[2]. G.E. Moore, “Cramming more components onto integrated circuits”, in Electronics, vol. 38, no. 8, 1965.

[3]. http://www.icknowledge.com, April. 2006.

[4]. http://www.intel.com/technology/mooreslaw/index.htm, April. 2006.

[5]. F. Wanlass and C. Sah, “Nanowatt logic using field-effect metal-oxide semiconductor triodes,” ISSCC Digest of Technical Papers, Volume VI, pp. 32 – 33, Feb 1963.

18 Introduction

[6]. A. Chandrakasan and R. Brodersen, “Low Power Digital CMOS Design”, Kluwer Academic Publishers, 1995.

[7]. H.J.M. Veendrick, “Short-Circuit Dissipation of Static CMOS Circuitry and its Impact on the Design of Buffer Circuits”, in IEEE Journal of Solid-State Circuits, vol. 19, no. 4, pp. 468-473, August 1984.

[8]. “International Technology Roadmap for Semiconductors.” http://public.itrs.net, April 2006.

[9]. A. Ferre and J. Figueras, “Characterization of leakage power in CMOS technologies,” in Proc. IEEE Int. Conf. on Electronics, Circuits and Systems, vol. 2, 1998, pp. 85–188.

[10]. S. Borkar, “Design challenges of technology scaling”, IEEE Micro, Volume 19, Issue 4, pp. 23 – 29, Jul-Aug 1999

[11]. V. De and S. Borkar, “Technology and Design Challenges for Low Power and High-Performance”, in Proceedings of 1999 International Symposium on Low Power Electronics and Design, pp. 163-168, 1999.

[12]. S. Rusu, “Trends and challenges in VLSI technology scaling towards 100nm”, Proceedings of the 27th European Solid-State Circuits Conference, pp. 194 – 196, Sept. 2001.

[13]. R. Krishnamurthy, A. Alvandpour, V. De, and S. Borkar, “High performance and low-power challenges for sub-70-nm microprocessor circuits,” in Proc. Custom Integrated Circuits Conf., 2002, pp. 125–128.

[14]. J. Kao and A. Chandrakasan, “Dual-threshold voltage techniques for low-power digital circuits,” IEEE J. Solid-State Circuits, vol. 35, pp. 1009–1018, July 2000.

[15]. L. Wei, Z. Chen, Z, K. Roy, M. Johnson, Y. Ye, and V. De,” Design and optimization of dual-threshold circuits for low-voltage low-power applications”, IEEE Transactions on VLSI Systems, Volume 7, pp. 16– 24, March 1999.

[16]. S. Narendra, S. Borkar, V. De, D. Antoniadis and A. Chandrakasan, “Scaling of stack effect and its application for leakage reduction”, Proceedings of ISLPED '01, pp. 195 – 200, Aug. 2001.

1.6 References 19

[17]. J.W. Tschanz, S.G. Narendra, Y. Ye, B.A. Bloechel, S. Borkar and V. De, “Dynamic sleep transistor and body bias for active leakage power control of microprocessors” IEEE Journal of Solid-State Circuits, Nov. 2003, pp. 1838–1845.

[18]. J.D. Meindl, “Beyond Moore’s Law: The Interconnect Era”, in Computing in Science & Engineering, vol. 5, no. 1, pp. 20-24, January 2003.

[19]. P. Bai, et al., "A 65nm Logic Technology Featuring 35nm Gate Lengths, Enhanced Channel Strain, 8 Cu Interconnect Layers, Low-k ILD and 0.57 um2 SRAM Cell," International Electron Devices Meeting, 2004, pp. 657 – 660.

[20]. S. Keckler, D. Burger, C. Moore, R. Nagarajan, K. Sankaralingam, V. Agarwal, M. Hrishikesh, N. Ranganathan, and P. Shivakumar, “A wire-delay scalable microprocessor architecture for high performance systems,” ISSCC Digest of Technical Papers, Feb. 2003, pp.168–169.

[21]. R. Ho, K. W. Mai, and M. A. Horowitz, “The Future of Wires,” in Proceedings of the IEEE, pp. 490–504, 2001.

[22]. W. J. Dally and B. Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” in Proceedings of the 38th Design Automation Conference, pp. 681-689, June 2001.

[23]. H. Wang, L. Peh, S. Malik, “Power-driven design of router microarchitectures in on-chip networks,” MICRO-36, Proceedings of 36th Annual IEEE/ACM International Symposium on Micro architecture, pp. 105-116, 2003.

[24]. L. Benini and G. D. Micheli, “Networks on Chips: A New SoC Paradigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, 2002.

[25]. T. Bjerregaard and S. Mahadevan, "A survey of research and practices of Network-on-chip", ACM Computing Surveys (CSUR), Vol. 38, Issue 1, Article No. 1, 2006.

20 Introduction

21

Part II

NoC Building Blocks

23

Chapter 2

On-Chip Interconnection Networks

As VLSI technology scales, and processing power continues to improve, inter-processor communication becomes a performance bottleneck. On-chip networks have been widely proposed as the interconnect fabric for high-performance SoCs [1] and the benefits demonstrated in several chip multiprocessors (CMPs) [2]–[3]. Recently, NoC architectures are emerging as the candidate for highly scalable, reliable, and modular on-chip communication infrastructure platform. The NoC architecture uses layered protocols and packet-switched networks which consist of on-chip routers, links, and well defined network interfaces. With the increasing demand for interconnect bandwidth, on-chip networks are taking up a substantial portion of system power budget. A case in point: the MIT Raw [2] on-chip network which connects 16 tiles of processing elements consumes 36% of total chip power, with each router dissipating 40% of individual tile power. The routers and the links of the Alpha 21364 microprocessor consume about 20% of the total chip power [4].These numbers indicate the significance of managing the interconnect power consumption. In addition, any NoC architecture should fit into the limited silicon budget with an optimal choice of NoC fabric topology providing high bisection bandwidth, efficient routing algorithms and compact low-power router implementations.

24 On-Chip Interconnection Networks

This chapter first presents introductory concepts specific to on-chip networks. Topics include network topologies, crossbar switches, and message switching techniques. The internal functional blocks for a canonical router architecture are also described. A more detailed description of a six-port four-lane 57 GB/s non-blocking router core design (Paper 1), with area and peak-energy benefits, is also presented.

2.1 Interconnection Network Fundamentals

An interconnection network consists of nodes (or routers) and links (or fabric), and can be broadly classified into direct or indirect networks [5]. A direct network consists of a set of nodes, each one being directly connected to a (usually small) subset of other nodes in the network. A common component of each node in such a network is a dedicated router, which handles all message communication between nodes, as shown in Figure 2-1. Usually two neighboring nodes are connected by a pair of uni-directional links in opposite directions. As the number of nodes in the system increases, the total communication bandwidth also increases. This excellent scaling property has made direct networks very popular in constructing large-scale NoC designs. The MIT Raw [2] design is an example of direct network. In an indirect network, communication between any two nodes must go through a set of switches.

Processing nodeProcessing node

P P

PP

P P

PPNetwork Network interface/switchinterface/switch

Direct NetworkDirect Network Indirect NetworkIndirect Network

Switching elementSwitching elementProcessing nodeProcessing node

PP P

PP

PP P

PPNetwork Network interface/switchinterface/switch

Direct NetworkDirect Network Indirect NetworkIndirect Network

Switching elementSwitching element

Figure 2-1: Direct and indirect networks

All modern day on-chip networks are buffered, i.e., the routers contain storage for buffering messages when they are unable to obtain an outgoing link.

2.1 Interconnection Network Fundamentals 25

An interconnection network can be defined by four parameters: (1) its topology (2) the routing algorithm governing it, (3) the message switching protocol, and (4) the router micro-architecture.

2.1.1 Network Topology The topology of a network concerns how the nodes and links are connected. A

topology dictates the number of alternate paths between nodes, and thus how well the network can handle contention and different traffic patterns. Scalability is an important factor in the selection of the right topology for on-chip networks. Figure 2-2 shows various popular on topologies which have been commercially adopted [5].

Shared bus networks (Figure 2-2a) are the simplest, consist of a shared link common to all nodes. All nodes have to compete for exclusive access to the bus. While simple, bus-based systems scale very poorly as mode nodes are added. In a ring (or 1D torus) every node has exactly two neighbors and allows more concurrent transfers. Since all nodes are not directly connected, messages will have to hop along intermediate nodes until they arrive at the final destination. This causes the ring to saturate at a lower network throughput for most traffic patterns. The 2D mesh/torus, crossbar and hypercube topologies are examples of direct networks, and thus provide tremendous improvement in performance, but at a cost, typically increasing as the square of the number of nodes.

(a) Shared Bus (b) 1D Torus or Ring (c) 2D Mesh

Node

(d) 2D Torus (e) Crossbar (f) Hypercube

Fabric

(a) Shared Bus (b) 1D Torus or Ring (c) 2D Mesh

Node

(d) 2D Torus (e) Crossbar (f) Hypercube

Fabric

Figure 2-2: Popular on-chip fabric topologies

A crossbar (Figure 2-2e) is a fully connected topology, i.e., the interconnection allows any node to directly communicate with any other node.

26 On-Chip Interconnection Networks A crossbar network example connecting p processors to b memory banks is shown in Figure 2-3. Note that it is a non-blocking network, i.e., a connection of one processor to a given memory bank does not block a connection of another processor to a different memory bank.

p-1

5

4

3

2

1

0

Pro

cess

ing

Ele

men

tsP

roce

ssin

g E

lem

ents

b-1543210

Memory BanksMemory Banks

Switching Switching elementelement

p-1

5

4

3

2

1

0

Pro

cess

ing

Ele

men

tsP

roce

ssin

g E

lem

ents

p-1

5

4

3

2

1

0

Pro

cess

ing

Ele

men

tsP

roce

ssin

g E

lem

ents

b-1543210

Memory BanksMemory Banks

Switching Switching elementelement

Figure 2-3: A fully connected non-blocking crossbar switch.

The above crossbar uses p × b switches. The hardware cost for crossbar is high, at least O(p2), if b = p. An important observation is that the crossbar is not very scalable and cost increases quadratically, as mode nodes are added.

2.1.2 Message Switching Protocols The message switching protocol determines when a message gets buffered,

and when it gets to continue. The goal is to effectively share the network resources among the many messages traversing the network, so as to approach the best latency-throughput performance.

(1) Circuit vs. packet switching: In circuit switching a physical path from source to destination is reserved prior to the transmission of data, as in telephone networks. The complete message is transmitted along the pre-reserved path that is established. While circuit switching reduces network latency, it does so at the expense of network throughput. Alternatively, a message can be decomposed into packets which share channels with other packets. The first few bytes of a packet, called the packet header, contain routing and control information. Packet

2.1 Interconnection Network Fundamentals 27

switching improves channel utilization and extends network throughput. Packet-based flow control methods dominate interconnection networks today because in interconnection networks, buffers and channels are limited resources that need to be efficiently allocated.

(2) Virtual Cut-Through (VCT) Switching: To reduce packet delay at each hop, virtual cut-through flow switching allows transmission of a packet to begin before the entire packet is received. Latency experienced by a packet is thus drastically reduced. However, bandwidth and storage are still allocated in packet-sized units. Packets can move forward only if there is enough storage to hold the entire packet. Examples of routers adopting VCT switching are the IBM SP/2 switch [6] and Mercury router described in [7].

(3) Wormhole Switching: In a wormhole routing scheme, a packet is divided into “FLIT’s” or “FLow control unIT’s”, for transmission. A FLIT is the minimum divisible amount of data transmitted within a packet that the control logic in the router can process. As the header flit containing routing information advances along the network, the message flits follow in a pipelined manner. The lower packet latency and greater throughput of wormhole routers increases the efficiency of inter-processor communication. In addition, wormhole routers have substantially reduced buffering requirements [8], thus enabling small, compact and fast router designs. The simplicity, low cost, and distance-insensitivity of wormhole switching are the main reasons behind its wide acceptance by manufacturers of commercial parallel machines. Several high-bandwidth low-latency routers based on wormhole switching have been presented in [8]–[10].

2.1.3 Virtual Lanes Virtual-lanes (or logical channels) [11] improves upon the channel utilization

of wormhole flow control, allowing blocked packets to be passed by other packets. This is accomplished by associating several virtual channels, each with a separate flit queue, with each physical channel. Virtual channels arbitrate for physical channel bandwidth on a flit-by-flit basis. When a packet holding a virtual channel gets blocked, other packets can still traverse the physical channel through other virtual channels. Virtual lanes were originally introduced to solve the deadlock avoidance problem, but they can be also used to improve network latency and throughput, as illustrated in Figure 2-4. Assume P0 arrived earlier and acquired the channel between the two routers first. In absence of virtual channels, packet P1 arriving later would be blocked until the transmission of P0 has been completed. Assume that the physical channels implement two virtual channels. Upon arrival of P1, the physical channel is multiplexed between them on a flit-by-flit basis and both packets proceed with half speed, and both messages continue to make progress. In effect, virtual lanes decouples the

28 On-Chip Interconnection Networks physical channels from message buffers allowing multiple messages to share the same physical channel in the same manner that multiple programs may share a central processing unit (CPU).

Figure 2-4: Reduction in header blocking delay using virtual channels.

On the downside, the use of virtual channels, while reducing header blocking delays in the network, increases design complexity of the link controller and flow control mechanisms.

2.1.4 A Generic Router Architecture Router architecture is largely determined by the choice of switching

technique. Most modern routers used in high performance multiprocessor architecture utilize packet switching or cut-through switching techniques (wormhole and virtual cut-through) with either a deterministic or an adaptive routing algorithm. Several high-bandwidth low-latency routers have been designed [12]–[14]. A generic wormhole switched router (Figure 2-5) has the following key components [5]: 1. Link Controller (LC). This block implements flow control across the physical channel between adjacent routers. The link controllers on either side of the link coordinate to transfer flow control units. In the presence of virtual channels, this unit also decodes the destination channel of the received flit. 2. Buffers. These are first-in-first out (FIFO) buffers for storing packets in transit and are associated with input and output physical links. Routers may have buffers only on inputs (input buffered) or outputs (output buffered). Buffer size is critical and must account for link delays in propagation of data and flow control signals. 3. Crossbar switch. This component is responsible for connecting router input buffers to the output buffers. High-speed routers utilize crossbar networks with full connectivity, enabling multiple and simultaneous message transfers. In this thesis, we also use the term crossbar channel to represent the crossbar switch.

2.2 A Six-Port 57GB/s Crossbar Router Core 29

4. Routing and arbitration unit. This unit implements the routing function and the task of selecting an output link for an incoming message. Conflicts for the same output link must be arbitrated. Fast arbitration policies are crucial to maintaining a low latency through the switch. This unit may also be replicated to expedite arbitration delay.

LC

LC

CrossbarSwitch

LCVC

LCVC

Input Queues(Virtual channels)

Output Queues(Virtual channels)

Routing & Arbitration

LC

LC

Physical O

utput LinksPhy

sica

l Inp

ut L

inks

To/From Local Processor

LCLC

LCLC

CrossbarSwitchCrossbarSwitch

LCLCVC

LCVC

Input Queues(Virtual channels)

Output Queues(Virtual channels)

Routing & ArbitrationRouting & Arbitration

LCLC

LCLC

Physical O

utput LinksPhy

sica

l Inp

ut L

inks

To/From Local Processor

Figure 2-5: Canonical router architecture.

5. Virtual channel controller (VC). This unit is responsible for multiplexing the contents of the virtual channels onto the physical links. Several efficient arbitration schemes have been proposed [15]. 6. Processor Interface. This block implements a physical link interface to the processor and contains one or more injection or ejection channels to the processor.

2.2 A Six-Port 57GB/s Crossbar Router Core

We now describe a six-port four-lane 57GB/s router core that features double-pumped crossbar channels and destination-aware channel drivers that dynamically configure based on the current packet destination. This enables 45% reduction in channel area, 23% overall chip area, up to 3.8X reduction in peak channel power, and 7.2% improvement in average channel power, with no performance penalty. In a 150nm six-metal CMOS process, the 12.2mm2 core contains 1.9 million transistors and operates at 1GHz at 1.2V.

30 On-Chip Interconnection Networks 2.2.1 Introduction

As described in Section 2.1.2, crossbar routers, based on wormhole switching are a popular choice to interconnect large multiprocessor systems. Although small crossbar routers are easily implemented on a single chip, area limitations have constrained cost-effective single chip realization of larger crossbars with bigger data path widths [16]–[17], since switch area increases as a square function of the total number of I/O ports and number of bits per port. In addition, larger crossbars exacerbate the amount of on-chip simultaneous-switching noise [18], because of increased number of data drivers. This work focuses on the design and implementation of an area and peak power efficient multiprocessor communication router core.

The rest of this chapter is organized as follows. The first three sections (2.2.2–2.2.4) present the router architecture including the core pipeline, packet structure, routing protocol and flow control descriptions. Section 2.2.5 describes the design and layout challenges faced during the implementation of an earlier version of the crossbar router, which forms the motivation for this work. Section 2.2.6 presents two specific design enhancements to the crossbar router. Section 2.2.7 describes the double-pumped crossbar channel, and its operation. Section 2.2.8 explains details of the location-based channel drivers used in the crossbar. These drivers are destination-aware and have the ability to dynamically configure based on the current packet destination. Section 2.2.9 presents chip results and compares this work with prior crossbar work in terms of area, performance and energy. We conclude in Section 2.3 with some final remarks and directions for future research.

2.2.2 Router Micro-architecture The architecture of the crossbar router is shown in Figure 2-6, which is based

on the router described in [19], and forms the reference design for this work. The design is a wormhole switched, input-buffered, fully non-blocking router. Inbound packets at any of the six input ports can be routed to any output ports simultaneously. The design uses vector-based routing. Each time a packet enters a crossbar port, a 3-bit destination ID (DID) field at the head of the packet determines the exit port. If more than one input contends for the same output port, then arbitration occurs to determine the binding sequence of the input port to an output. To support deadlock-free routing, the design implements four logical lanes. Notice that each physical port is connected to four identical lanes (0–3). These logical lanes, commonly referred to as “virtual channels”, allow the physical ports to accommodate packets from four independent data streams simultaneously.


Lane 3 (L3)Lane 2 (L2)


Port 0 IN

Port Inputs6 X 76

Port Outputs6 X 76

Lane ArbandMux

Port 1 IN

Port 2 IN

Port 3 IN

Port 4 IN

Port 5 IN Queue

Queue

Queue

Queue

Queue

Queue

Cro

ssba

r S

witc

hPort Arb

Port Arb

Port Arb

Port Arb

Port Arb

Port Arb

Route

Route

Route

Route

Route

Route

Out 5

Out 4

Out 3

Out 2

Out 1

Out 0



Port 0 IN

Port Inputs6 X 76

Port Outputs6 X 76

Lane ArbandMux

Port 1 IN

Port 2 IN

Port 3 IN

Port 4 IN

Port 5 IN Queue

Queue

Queue

Queue

Queue

Queue

Cro

ssba

r S

witc

hPort Arb

Port Arb

Port Arb

Port Arb

Port Arb

Port Arb

Route

Route

Route

Route

Route

Route

Out 5

Out 4

Out 3

Out 2

Out 1

Out 0

Out 5

Out 4

Out 3

Out 2

Out 1

Out 0

Out 5

Out 4

Out 3

Out 2

Out 1

Out 0

Figure 2-6: Six-port four-lane router block diagram.

Data flow through the crossbar switch is as follows: data input is first buffered into the queues. Each queue is basically a first-in-first out (FIFO) buffer. Data coming out of the queues is examined by routing and sequencing logic to generate output binding requests based on packet header information. With six ports and 4 lanes, the crossbar switch requires a 24-to-1 arbiter and is implemented in 2 stages, a 6-to-1 port Arbiter followed by a 4-to-1 lane Arbiter. The first stage of arbitration within a particular lane essentially binds one input port to an output port, for the entire duration of the packet. If other packets in different lanes are contending for the same physical output resource, then a second level of arbitration occurs. This level of arbitration occurs at a FLIT level, a finer level of granularity than packet boundaries. It should be noted that both levels of arbitration are accomplished using a balanced, non-weighted, round-robin algorithm.

Figure 2-7 shows the 6-stage data and control pipelines, including key signals used in the router crossbar switch. The pipe stages use rising edge-triggered sequential devices. Stages one, two and three are used for the queue. Stage four, for the route decoding and sequencing logic. Stage five, for data in the port multiplexer and port arbitration. Stage six, for data in the lane multiplexer and lane arbitration. The shaded portion between stages four and five represents the primary physical crossbar routing channel. Here, data and control buses are connected between the six ports within a lane. The layout area of this routing

32 On-Chip Interconnection Networks channel increases as a square function of the total number of ports and number of bits per port.

DataOut

6

4:1

Stg 5

6:1

Port Mux Lane Mux

5 6

Port Arbiter Lane Arbiter

LnGntPortGnt

DataIn

1 32

Queue Route

4

ControlPath

Sequencer

Adv

PortRq

76 76

Crossbar channel

Stg 4DataOut

6

4:1

Stg 5

6:1

Port Mux Lane Mux

5 6

Port Arbiter Lane Arbiter

LnGntPortGnt

DataIn

1 32

Queue Route

4

ControlPath

Sequencer

Adv

PortRq

7676 7676

Crossbar channel

Stg 4

Figure 2-7: Router data and control pipeline diagram.

The shaded portion between stages five and six represents a second physical routing channel. In this case, data and control buses are connected between the four corresponding lanes. The total die area of this routing channel increases only linearly as a function to the total number of bits per port. These two main routing channels run perpendicular to each other on the die. For high performance and symmetric layout, this basic unit of logic in Figure 2-7 is duplicated 24 times throughout the crossbar, six within each lane and across four lanes.

2.2.3 Packet structure and routing protocol Figure 2-8 shows the packet format used in the router. Each packet is

subdivided into FLITs. Each FLIT contains six control bits and 72 data bits. The control bits are made up of packet header (H), valid (V) and tail (T) bits, two encoded lane ID bits which indicate one of four lanes, and flow control (FC) information. The flow control bits contain receiver queue status and helps create “back pressure” in the network to help prevent queue overflows.

The crossbar uses vector based routing. Data fields in the “header” flit of a packet are reserved for routing destination information, required for each hop. A “hop” is defined as a flit entering and leaving a router. The router uses the 3-bit


destination ID [DID] field as part of the header to determine the exit port. The least significant 3 bits of the header indicates the current destination ID and is updated by shifting out the current DID after each hop. Each header flit supports a maximum of 24 hops. The minimum packet size the protocol allows for is two flits. This was chosen to simplify the sequencer control logic and allow back-to-back packets to be passed through the router core without having to insert idle flits between them. The router architecture places no restriction on the maximum packet size.

DID[2:0]V T H

Control, 6 bits 3 bit Destination IDs (DID), 24 hops

d71 d0

Data, 72 bits

FLIT 01 Packet

FLIT 1

L0L1

V T HL0L1

FC

FC

DID[2:0]V T H


d71 d0

Data, 72 bits

FLIT 01 Packet

FLIT 1

L0L1

V T HL0L1

FC

FC

Figure 2-8: Router packet and FLIT format.

2.2.4 FIFO buffer and flow control The queue acts as a FIFO buffer and is implemented as a large signal 1-read

1-write register file. Each queue is 76 bits wide and 64 flits deep. There are four separate queues [20]–[21] per physical port to support virtual channels. Reads and writes to two different locations in the queue occur simultaneously in a single clock cycle. The circuits for reading and writing are implemented in a single-ended fashion. Read sensing is implemented using a two level local and global bit-line approach. This approach helps to reduce overall bit-line loading as well as speeding up address decoding. In addition, data-dependent, self-timed pre-charge circuits are also used to lower both clock and total power.

Flow control and buffer management between routers is debit-based using almost-full bits (Figure 2-9), which the receiver FIFO signals via the flow control bits (FC) bits, when its buffers reach a specified threshold. This mechanism is chosen over conventional credit and counter based flow control schemes for improved reliability and recovery in the presence of signaling errors. The protocol is designed to allow all four receiving queues across four lanes to independently signal the sender when they are almost full regardless of the data traffic at that port. The FIFO buffer almost-full threshold can be programmed via software.


Almost-full

Router BRouter AData

Almost-full

Router BRouter AData

Figure 2-9: Flow control between routers.

2.2.5 Router Design Challenges This work was motivated to address the design challenges and issues faced

while implementing an earlier version of the router built in six-metal, 180nm technology node (Leff = 150nm) and reported in [19]. Figure 2-10(a) shows the internal organization of the six ports within each lane. We focused our attention on the physical crossbar channel between pipe stages four and five where data and control busses are connected between the six ports. The corresponding reference layout for one (of four lanes) is given in Figure 2-10(b). The crossbar channel interconnect structure used in the original full custom design is also shown in Figure 2-10(c). Each wire in the channel is routed using metal 3 (vertical) and metal 4 (horizontal) layers with a 1.1µm pitch.

M4

M2

D[0] D[2]D[1]

0.62µµµµm

0.48µµµµm

0.62µµµµm

M3

1.2mm

3.3m

m

53%0 0

1 1

2 2

3 3

4 4

5 5

(a) (b) (c)

M4

M2

D[0] D[2]D[1]

0.62µµµµm

0.48µµµµm

0.62µµµµm

M3

1.2mm

3.3m

m

53%0 0

1 1

2 2

3 3

4 4

5 5

M4

M2

D[0] D[2]D[1]

0.62µµµµm

0.48µµµµm

0.62µµµµm

M3

M4

M2M2

D[0] D[2]D[1]

0.62µµµµm

0.48µµµµm

0.62µµµµm

M3

1.2mm

3.3m

m3.

3mm

53%0 0

1 1

2 2

3 3

4 4

5 5

0 00 0

1 11 1

2 22 2

3 33 3

4 44 4

5 55 5

(a) (b) (c)

Figure 2-10: Crossbar router in [19] (a) Internal lane organization. (b) Corresponding layout. (c) Channel interconnect structure.

In each lane, 494 signals (456 data, 38 control) traverse the 3.3mm long wire-dominated channel. More importantly, the crossbar channel consumed 53% of


core area and presented a significant physical design challenge. To meet GHz operation, the entire core used hand-optimized data path macros. In addition, the design used large bus drivers to forward data across the crossbar channel, sized to meet the worst-case routing and timing requirements and replicated at each port. Over 1800 such channel drivers were used across all four lanes in the router core. When a significant number of these drivers switch simultaneously, it places a substantial transient current demand on the power distribution system. The magnitude of the simultaneous-switching noise spike can be easily computed using the expression:

Where L is the effective inductance, M the number of drivers active simultaneously and ∆I the current switched by each driver over the transition interval ∆t. For example, with M = 200, a power grid with a parasitic inductance of 1nH, and a current change of 2mA per driver with an overly conservative edge rate of half a nanosecond, the ∆V noise is approximately 0.8 volts. This peak noise is significant in today’s deep sub-micron CMOS circuits and causes severe power supply fluctuations and possible logic malfunction. It is important to note that the amount of peak switching noise, while data activity dependent, is independent of router traffic patterns and the physical spatial distance of an outbound port from an input port. Several techniques have been proposed to reduce this switching noise [17]–[18], but the work has not been addressed in the context of packet-switched crossbar routers.

In this work, we present two specific design enhancements to the crossbar channel and a more detailed description of the techniques described in Paper 1. To mitigate crossbar channel routing area, we propose double-pumping the channel data. To reduce peak power, we propose a location based channel driver (LBD) which adjusts driver strength as a function of current packet destination. Reducing peak currents has the benefit of reduced demand on the power grid and decoupling capacitance area, which in-turn results in lower gate leakage.

2.2.6 Double-pumped Crossbar Channel Now we describe design details of the double pumped crossbar. The crossbar

core is a fully synchronous design. A full clock cycle is allocated for communication between stages 4 and 5 of the core pipeline. As shown in Figure 2-11, one clock phase is allocated for data propagation across the channel. At pipe-stage 4, the 76-bit data bus is double-pumped by interleaving alternate data bits using dual edge -triggered flip-flops. The master latch M0 for data input (i0) is combined with the slave latch S1 for data input (i1) and so on. The slave latch S0 is moved across the crossbar channel to the receiver. A 2:1 multiplexer,

t

ILMV

∆∆=∆ (2.1)

36 On-Chip Interconnection Networks enabled using clock is used to select between the latched output data on each clock phase.

q1

clk

q0

6:1

FF

Stg 5clk

clkb

S0

6:1

clk

FF

Stg 5

clkb

clkbclk

Td < 500ps

3.3mm crossbar channel

(228)

Driver

clk

clkb

clkb

clkclkb

clk

clk

M0

i0

i1

clkclkb

S1

Stg 4 2:1 Mux

50%n3 n4

n5n1

n2q1

clk

q0

6:1

FF

Stg 5clk

clkb

S0

6:1

clk

FF

Stg 5

clkb

clkbclk

Td < 500ps


(228)

Driver

clk

clkb

clkb

clkclkb

clk

clk

M0

i0

i1

clkclkb

S1

Stg 4 2:1 Mux

50%

q1

clk

q0

6:1

FF

Stg 5clk

clkb

S0

6:1

clk

FF

Stg 5

clk

FF

Stg 5

clkb

clkbclk

Td < 500psTd < 500ps


(228)

Driver

clk

clkb

clkb

clkclkb

clk

clk

M0

i0

i1

clkclkb

S1

Stg 4 2:1 Mux

50%n3 n4

n5n1

n2

Figure 2-11: Double-pumped crossbar channel.

This double pumping effectively cuts the number of channel data wires in each lane by 50% from 456 to 228 and in all four logical lanes from 1824 to 912 nets. There is additional area savings due to a 50% reduction in the number of channel data drivers. To avoid timing impact, the 38 control (request + grant) wires in the channel are left un-touched.

2.2.7 Channel operation and timing The timing diagram in Figure 2-12 summarizes the communication of data

bits D[0-3] across the double-pumped channel. The signal values in the diagram match the corresponding node names in the schematic in Figure 2-11. To transmit, the diagram shows in cycle T2 and T3, master latch M0 latching input data bits D[0] and D[2], while the slave latch S1 retains bits D[1] and D[3], The 2:1 multiplexer, enabled by clock is used to select between the latched output data on each clock phase, thus effectively double-pumping the data. A large driver is used to forward the data across the crossbar channel.

At the receiving end of the crossbar channel, the delayed double-pumped data is first latched by slave latch (S0), which retains bits D[0] & D[2] prior to capture by pipe stage 5. Data bits D[1] and D[3] are easily extracted using a rising edge flip-flop. Note that the overall data delay from Stage 4 to Stage 5 of the pipeline is still one cycle, and there is no additional latency impact due to


double pumping. This technique, however, requires a close to 50% duty cycle requirement on the clock generation and distribution.

Figure 2-12: Crossbar channel timing diagram.

2.2.8 Location-based Channel Driver To reduce peak power, the crossbar channel uses location based drivers as a

drop-in replacement to conventional bus drivers. Figure 2-13 shows the LBD driver and the associated control logic highlighted. A 3-bit unique hard-wired location ID (LID) is assigned to provide information about the spatial location of the driver in the floor-plan, at a port level granularity. By comparing LID bits with the DID bits in the header flit, an indication of the receiver distance from the driver is obtained. A 3-bit subtractor accomplishes this task. The subtraction result is encoded to control the location based drivers. To minimize delay penalty to the crossbar channel data, the encoder output control signals (en[2:0]) are re-timed to pipe stage 4.

The LBD schematic Figure 2-14(a) shows a legged driver with individual enable controls for each leg, which is binary weighted. The LBD encodings are carefully chosen to allow full range of driver sizes and communication distances, without exceeding the path data delay budget.


en[2:0]


(228)cl

k

FF

Stg 4

LBD

clk

clkb

clkb

clkclkb

clk

clk

M0

i0

i1

clkclkb

S1

DID[2:0]

3-bit subtract encoder

LID[2:0] (hardwired)

en[2:0]


(228)cl

k

FF

Stg 4

LBD

clk

clkb

clkb

clkclkb

clk

clk

M0

i0

i1

clkclkb

S1

DID[2:0]

3-bit subtract encoder

LID[2:0] (hardwired)

Figure 2-13: Location-based channel driver and control logic.

LID[2:0] - DID[2:0] en[2:0]000 001001 010010 011011 101100 110101 111

(a)

(b)

Figure 2-14: (a) LBD Schematic. (b) LBD encoding summary.

LBD encoding is summarized in Figure 2-14(b), and is chosen in such a way that the smallest leg of the driver is turned on when a packet is headed to its own port, a feature called “loopback”. A larger leg is on if the packet is destined to the adjacent ports. All legs are enabled only for the farthest routable distances in


the channel, specifically ports 0 � 5 or 5 � 0. Hard-wiring of the LID bits enables a modular design and layout that is reusable at each port. It is important to note that the LBD enable bits change only at packet boundaries and not at flit boundaries. The loopback capability, where a packet can be routed from a node to itself is supported to assist with diagnostics and driver development.

2.2.9 Chip Results A 6-port four-lane router core utilizing the proposed double-pumped crossbar

channel with LBD has been designed in a 150nm six-metal CMOS process. All parasitic capacitance and resistance data from layout was extracted and included in the circuit simulations, and the results are compared to reference work [19] in terms of area, performance and power and summarized in Figure 2-15. It is important to note that we compare the designs on the same process generation.

560

310

0

100

200

300

400

500

600

Cha

nnel

Wid

th (m

icro

ns) 15.8

12.2

0

2

4

6

8

10

12

14

16

18

Chi

p A

rea

(sq.

mm

)

3427

3580

3100

3200

3300

3400

3500

3600

3700

Clo

ck lo

ad (m

icro

ns)

483

446

400

420

440

460

480

500

Tot

al D

elay

(ps)

483

512

400

420

440

460

480

500

520

2:1

Mux

Pen

alty

(ps)

333

309

260

280

300

320

340

Cha

nnel

Pow

er (m

W)

ISSCC2001

This Work

ISSCC2001

This Work

ISSCC2001

This Work

23% 4.4%

7.7%

6%7.2%

45%

Figure 2-15: Double-pumped crossbar channel results compared to work reported in [19].

Double-pumping the crossbar channel reduces all four channel widths from 560µm to 310µm, and full-chip area from 15.8 mm2 to 12.2mm2, enabling a 45% reduction in channel area and 23% overall chip area, with no latency penalty over the original design. The smaller channel enables an 8.3% reduction

40 On-Chip Interconnection Networks in worst-case interconnect length and a 7.7% improvement in total signal propagation delay due to reduced capacitance. This also results in a 7.2% improvement in average channel power, independent of traffic patterns. The technique however increases the clock load by 4% primarily because of the 2:1 multiplexer. Note that this is reported as a percentage of the clock load seen in Stage 4 and 5 of the router pipeline. This penalty relative to the clock load of the entire core is less than 1%. In addition, the 2:1 multiplexer increases signal propagation delay by 6%, which could be successfully traded for the 7.7% improvement in channel delay due to reduced double- pumped interconnect length.

Figure 2-16 plots the simulated logic delay, the wire delay and the total crossbar channel signal propagation delay as a function of distance of the receiver from the driver. The logic delay includes a fixed logic delay component, and a varying location-based driver delay. A port distance of “6” indicates the farthest routable port and “1” indicates packet loop-back on the same port. Notice that the total crossbar signal delay remains roughly constant over the range of port-distances. This is because the location based driver dynamically adjusts driver logic delay and wire delay to the specific destination port, without exceeding the combined delay budget of 500ps.

0.00

0.10

0.20

0.30

0.40

0.50

1 2 3 4 5 6

Port distance

Del

ay (n

s)

Logic delay Wire delay Logic + wire

delay budget

Figure 2-16: Simulated crossbar channel propagation delay as function of port distance.

Figure 2-17 shows the improvements in peak channel driver currents with LBD usage for various router traffic patterns. All 1824 channel drivers across all


four lanes were replaced by location based drivers. The graph plots the peak current per driver (in mA) for varying port distance and shows the current change from 5.9mA for the worst case port distance of 6 down to 1.55mA for the best case port distance of 1. The LBD technique achieves up to 3.8X reduction in peak channel currents, depending on the packet destination. As a percentage of logic in pipe stage 4, LBD control layout over head is 3%.

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

1 2 3 4 5 6Port distance

Pea

k cu

rren

t per

LB

D (m

A)

3.8X

1.55mA

5.9mA

Figure 2-17: LBD peak current as function of port distance.

Router area 3.7mm x 3.3 mm Technology 150nm CMOSTransistors 1.92 Million Interconnect 1 poly, 6 metal Core frequency 1GHz @ 1.2VChannel area savings 45.4%Core area savings 23%

Port 0

Port 1

Port 2

Port 3

Port 4

Port 5

L0 L1 L2 L3Router area 3.7mm x 3.3 mm Technology 150nm CMOSTransistors 1.92 Million Interconnect 1 poly, 6 metal Core frequency 1GHz @ 1.2VChannel area savings 45.4%Core area savings 23%

Port 0

Port 1

Port 2

Port 3

Port 4

Port 5

L0 L1 L2 L3Port 0

Port 1

Port 2

Port 3

Port 4

Port 5

L0 L1 L2 L3

Figure 2-18: Router layout and characteristics.

The router core layout and a summary of chip characteristics are shown in Figure 2-18. In a 150nm six-metal process, the 12.2mm2 full custom core contains 1.92 million transistors and operates at 1GHz at 1.2V. The tiled nature of the physical implementation is easily visible. The six ports within each lane set run vertically, and the four lane sets are placed side by side. Note the placement of the four lanes, which are mirrored to reduce the number of routing

42 On-Chip Interconnection Networks combinations across lanes. Also note the locations of the inbound queues in the crossbar and their relative area compared to the routing channel in the crossbar. De-coupling capacitors fill the space below the channel and occupy approximately 30% of the die area.

2.3 Summary and future work

This work has demonstrated the area and peak energy reduction benefits of a six port four lane 57 GB/sec communications router core. Design enhancements include four double-pumped crossbar channels and destination-aware crossbar channel drivers that dynamically configure based on the current packet destination. Combined application of both techniques enables a 45% reduction in channel area, 23% overall core area, up to 3.8X reduction in peak crossbar channel power, and 7.2% improvement in average channel power, with no performance penalty, when compared to an identical reference design on the same process. The crossbar area and peak power reduction also allows for larger crossbar networks to be integrated on a single die. This router design, when configured appropriately in a multiple-node system, yields a total system bandwidth in excess of one Terabyte/sec.

As future work we propose the application of low-swing signaling techniques [22] to router links and internal crossbar to significantly alleviate energy consumption. In addition, encoding techniques can also be applied to the crossbar channel data buses to further mitigate peak power based on [23]–[24].

2.4 References

[1]. L. Benini and G. D. Micheli, “Networks on Chips: A New SoC Paradigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, 2002.

[2]. M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffmann, P. Johnson, J. W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs,” IEEE Micro, vol.22, no.2, pp. 25–35, 2002.

[3]. K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S.W. Keckler, and C. R. Moore, “Exploiting ILP, TLP, and DLP with The Polymorphous TRIPS Architecture,” in Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 422–433, 2003.

2.4 References 43

[4]. H. Wang, L. Peh, S. Malik, “Power-driven design of router microarchitectures in on-chip networks,” MICRO-36, Proceedings of 36th Annual IEEE/ACM International Symposium on Micro architecture, pp. 105-116, 2003.

[5]. J. Duato, S. Yalamanchili and L. Ni, "Interconnection Networks: An Engineering Approach, 1st edition", IEEE Computer Society Press, Los Alamitos, CA, 1997.

[6]. Craig B. Stunkel, “The SP2 High-Performance Switch”, IBM System Journal, vol. 34, no. 2, pp. 185-204, 1995.

[7]. W. D. Weber et. al., "The Mercury Interconnect Architecture: A Cost-effective Infrastructure for High-Performance Servers", In Proceedings of the Twenty-Fourth International Symposium on Computer Architecture, Denver, Colorado, pp. 98-107, June 1997.

[8]. William J. Dally and Charles L. Seitz, “The Torus Routing Chip”, Distributed Computing, Vol. 1, No. 3, pp. 187-196, 1986.

[9]. J. Carbonaro and F. Verhoorn, “Cavallino: The Teraflops Router and NIC,” Hot Interconnects IV Symposium Record. Aug. 1996, pp. 157-160.

[10]. R. Nair, N.Y. Borkar, C.S. Browning, G.E. Dermer, V. Erraguntla, V. Govindarajulu, A. Pangal, J.D. Prijic, L. Rankin, E. Seligman, S. Vangal, and H.A. Wilson, “A 28.5 GB/s CMOS non-blocking router for terabit/s connectivity between multiple processors and peripheral I/O nodes”, ISSCC Digest of Technical Papers, Feb. 2001, pp. 224-225.

[11]. W.J. Dally, “Virtual Channel Flow Control,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, Feb. 1992, pp. 194–205.

[12]. J.M. Orduna and J. Duato, "A high performance router architecture for multimedia applications", Proc. Fifth International Conference on Massively Parallel Processing, June 1998, pp. 142 – 149.

[13]. N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick and M. Horowitz, "Tiny Tera: a packet switch core", IEEE Micro, Volume 17, Issue 1, Jan. - Feb. 1997, pp. 26 – 33.

[14]. M. Galles, "Spider: a high-speed network interconnect", IEEE Micro, Volume 17, Issue 1, Jan. - Feb. 1997, Page(s):34 – 39.

44 On-Chip Interconnection Networks [15]. Y. Tamir, H. C. Chi, "Symmetric crossbar arbiters for VLSI

communication switches", IEEE Transactions on Parallel and Distributed Systems, Volume 4, Issue 1, Jan. 1993, pp. 13 – 27.

[16]. Naik, R and Walker, D.M.H., “Large integrated crossbar switch”, in Proc. Seventh Annual IEEE International Conference on Wafer Scale Integration, Jan. 1995, pp. 217 – 227.

[17]. J. Ghosh and A. Varma, "Reduction of simultaneous-switching noise in large crossbar networks", IEEE Transactions on Circuits and Systems, Volume 38, Issue 1, Jan. 1991, pp. 86 – 99.

[18]. K.T. Tang and E.G. Friedman, "On-chip ∆I noise in the power distribution networks of high speed CMOS integrated circuits", Proc. 13th Annual IEEE International ASIC/SOC Conference, Sept. 2000, pp. 53 – 57.

[19]. H. Wilson and M. Haycock., “A six-port 30-GB/s non-blocking router component using point-to-point simultaneous bidirectional signaling for high-bandwidth interconnects”, IEEE JSSC, vol. 36, Dec. 2001, pp. 1954 – 1963.

[20]. Y. Tamir and G. Frazier, “High Performance Multiqueue Buffers for VLSI Communication Switches,” Proc. 15th Annual Symposium on Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., 1988, pp. 343 – 354.

[21]. B. Prabhakar, N. McKeown and R. Ahuja,, “Multicast Scheduling for Input-Queued Switches,” IEEE J. Selected Areas in Communications, Vol. 15, No. 5, Jun. 1997, pp. 855-866.

[22]. R. Ho, K. W. Mai, and M. A. Horowitz, “The Future of Wires,” in Proceedings of the IEEE, pp. 490–504, 2001.

[23]. M. R. Stan and W.P. Burleson, “Low-power encodings for global communication in CMOS VLSI”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume: 5, Issue: 4, Dec. 1997, pp. 444 – 455.

[24]. H. Kaul, D. Sylvester, M. Anders, R. Krishnamurthy, “Spatial encoding circuit techniques for peak power reduction of on-chip high-performance buses”, Proceedings of ISLPED '04, Aug. 2004, pp. 194 – 199.

45

Chapter 3

Floating-point Units

Scientific and engineering applications require high performance floating-point units (FPU). The advent of multimedia applications, such as 3D graphics and signal processing, places stronger demands for self-contained, low-latency FPUs with increased throughput. This chapter presents basic concepts involved in floating-point arithmetic, reviews conventional floating-point units (FPU), and describes the challenges in accomplishing single-cycle accumulate on today’s FPUs. A new pipelined single-precision FPMAC design, capable of single-cycle accumulation, is described in Paper 2 and Paper 3. Special flip-flop circuits used in the FPMAC design and test circuits are also presented.

3.1 Introduction to floating-point arithmetic

The design of FPUs is considered more difficult than most other arithmetic units due to the relatively large number of sequentially dependent steps required for a single FP operation and the extra circuits to deal with special cases such as infinity arithmetic, zeros and NaNs (Not a Number), as demanded by the IEEE-754 standard [3]. As a result, there is opportunity for exploring algorithms, logic and circuit techniques that enable faster FPU implementations. An excellent introduction to FP arithmetic can be found in Omondi [1] and Goldberg [2],

46 Floating-point Units

which describe logic principles behind FP addition, multiplication and division. Since the mid 1980s, almost all commercial FPU designs have followed the IEEE-754 standard [3], which specifies two basic floating-point formats: single and double. The IEEE single format consists of three fields: a 23-bit fraction, f; an 8-bit biased exponent, e; and a 1-bit sign, s. These fields are stored contiguously in one 32-bit word, as shown in Figure 3-1.

Figure 3-1: IEEE single precision format. The exponent is biased by 127.

If the exponent is not 0 (normalized form), mantissa = 1.fraction. If exponent is 0 (denormalized forms), mantissa = 0.fraction. The IEEE double format has a precision of 53 bits, and 64 bits overall.

3.1.1 Challenges with floating-point addition and accumulation FP addition is based on the sequence of mantissa operations: swap, shift, add,

normalize and round. A typical floating-point adder (Figure 3-2) first compares the exponents of the two input operands, swaps and shifts the mantissa of the smaller number to get them aligned. It is necessary to adjust the sign if either of the incoming number is negative. The two mantissas are then added; with the result requiring another sign adjustment if negative. Finally, the adder renormalizes the sum, adjusts the exponent accordingly, and truncates the resulting mantissa using an appropriate rounding scheme [4]. For extra speed, FP adders use leading zero anticipatory (LZA) logic to carry out predecoding for normalization shifts in parallel with the mantissa addition. Clearly, single-cycle FP addition performance is impeded by the speed of the critical blocks: comparator, variable shifter, carry-propagate adder, normalization logic and rounding hardware. To accumulate a series of FP numbers (e.g., ΣAi), the current sum is looped back as an input operand for the next addition operation, as shown by the dotted path in Figure 3-2. Over the past decade, a number of high-speed FPU designs have been presented [5]–[7]. Floating-point (FP) is dominated by multiplication and addition operations, justifying the need for fused multiply-add hardware. Probably the most common use of FPUs is performing matrix operations, and the most frequent matrix operation is a matrix multiplication, which boils down to computing an inner product of two vectors.

i

n

iii BAX ×=∑

−

=

1

0 (3.1)

3.1 Introduction to floating-point arithmetic 47

compare swap swap

normalize

sign

LZAADD

Exponents (A, B) Mantissas (A, B)

+

+ SHL

SHR-

Exponent (C) Mantissa (C)

+ round

compare swap swap

normalize

sign

LZAADD

Exponents (A, B) Mantissas (A, B)

+

+ SHL

SHR-

Exponent (C) Mantissa (C)

+ round

Figure 3-2: Single-cycle FP adder with critical blocks high-lighted.

Computing the dot product requires a series of multiply-add operations. Motivated by this, the IBM RS/6000 first introduced a single instruction that computes the fused multiply-add (FPMADD) operation.

CBAXi +×= )( (3.2)

Although this requires a three operand read in a single instruction, it has the potential for improving the performance of computing inner products. In such cases, one option is to design the FP hardware to accept up to three operands for executing MADD instructions as shown in Figure 3-3(a), while other FP instructions requiring fewer than three operands may utilize the same hardware by forcing constants into the unused operands. For example, to execute the floating-point add instruction, T = A + C, the B operand is forced to the constant 1.0. Similarly, a multiply operation would require forcing operand C to zero. Examples of fused multiply-add FPUs in this category include the CELL processing element, and are described in [8]–[10]. A second option is to optimize the FP hardware for dot product accumulations by accepting two operands, with an implicit third operand (Figure 3-3b). We define a fused multiply-accumulate (FPMAC) instruction as:

1)( −+×= ii XBAX (3.3)


x

A

+

normalize & round

B

(a)

Xi

x

A

+

B

(b)

Xi

C

normalize & round

x

A

+

normalize & round

B

(a)

Xi

x

A

+

B

(b)

Xi

C

normalize & round

Figure 3-3: FPUs optimized for (a) FPMADD (b) FPMAC instruction.

An important figure of merit for FPUs is the initiation (or repeat) interval [2], which is the number of cycles that must elapse between issuing two operations of a given type. An ideal FPU design would have an initiation interval of one.

3.1.2 Scheduling issues with pipelined FPUs To enable GHz operation, all state-of-art FPUs are pipelined, resulting in

multi-cycle latencies on most FP instructions. An example of a five stage pipe-lined FP adder [11] is shown in Figure 3-4(a) and a six-stage multiply-add FPU [10] is given in Figure 3-4 (b). An important observation is that neither of these implementations are optimal for accomplishing a steady stream of dot product accumulations. This is because of pipeline data hazards. For example, the six-stage FPU in Figure 3-4 (b) has an initiation interval of 6 for FPMAC instructions i.e., once a FPMAC instruction is issued, a second one can only be issued a minimum of six cycles apart due to read after write (RAW) hazard [2]. In this case, a new accumulation can only start after the current instruction is complete, thus increasing overall latency. Schedulers of multi-cycle FPUs detect such hazards and place code scheduling restrictions between consecutive FPMAC instructions, resulting in throughput loss. The goal of this work is to eliminate such limitations by demonstrating single-cycle accumulate operation and a sustained FPMAC (Xi) result every cycle at GHz frequencies.

Several solutions have been proposed to minimize data hazards and stalls due to pipelining. Result forwarding (or bypassing) is a popular technique to reduce the effective pipeline latency. Notice that the result bus is forwarded as a possible input operand by the CELL processor FPU in Figure 3-4 (b).

3.2 Single-cycle Accumulation Algorithm 49

((aa)) ((bb))

Figure 3-4: (a) Five-stage FP adder [11] (b) Six-stage CELL FPU [10].

Multithreading has emerged as one of the most promising options to hide multi-cycle FPU latency by exploiting thread-level parallelism (TLP). Multithreaded systems [12] provide support for multiple contexts and fast context switching within the processor pipeline. This allows multiple threads to share the same FPU resources by dynamically switching between threads. Such a processor has the ability to switch between the threads, not only to hide memory latency but also to avoid stalls resulting from data hazards. As a result, multithreaded systems tend to achieve better processor utilization and improved performance. These benefits come with the disadvantage that multithreading can add significant complexity and area overhead to the architecture, thus increasing design time and cost.

3.2 Single-cycle Accumulation Algorithm

In an effort to achieve fast single-cycle accumulate operation, we first analyzed each of the critical operations involved (Figure 3-2) in conventional FPUs, with the intent of eliminating, reducing and/or deferring the amount of logic operations inside the accumulate loop. The proposed FPMAC algorithm is given in Figure 3-5, and employs the following optimizations:


(1) Swap and Shift: To minimize the interaction between the incoming operand and the accumulated result, we choose to self-align the incoming number every cycle rather than the conventional method of aligning it to the accumulator result. The incoming number is converted to base 32 by shifting the mantissa left by an amount given by the five least significant exponent bits (Exp[4:0]) thus extending the mantissa width from 24-bits to 55-bits. This approach has the benefit of removing the need for a variable mantissa alignment shifter inside the accumulation loop. The reduction in exponent from 8 to 3 bits (Exp[7:5]) expedites exponent comparison.

(2) Addition: Accumulation is performed in base 32 system. The accumulator retains the multiplier output in carry-save format and uses an array of 4-2 carry-save adders to “accumulate” the result in an intermediate format [13], and delay addition until the end of a repeated calculation such as accumulation or dot product. This idea is frequently used in multipliers, built using Wallace trees [14], and removes the need for an expensive carry propagate adder in the critical path. Expensive variable shifters in the accumulate loop are replaced with constant shifters, and are easily implemented using a simple multiplexer circuit.

MBMA

MA X MB

Choose mantissa with > exp if exp1 ≠≠≠≠ exp2If exp1 = exp2, add mantissas If overflow, shift result right 32

Shift mantissa left by value of 5 bits of expIf negative, take 2’s complement

MM FM

Final Adder, normalize/round

EAEB

EA + EB -127

Make 5 bits of LSB 0

FE ME

Choose > expIf overflow, add 32

Normalize/round

EXMX

3 3 55 55

PipelinedWallace-treeMultiplier

Pipelined addition, Normalize/round

Single-cycleAccumulate

OperandSelf-align

MBMA

MA X MB

Choose mantissa with > exp if exp1 ≠≠≠≠ exp2If exp1 = exp2, add mantissas If overflow, shift result right 32

Shift mantissa left by value of 5 bits of expIf negative, take 2’s complement

MM FM

Final Adder, normalize/round

EAEB

EA + EB -127

Make 5 bits of LSB 0

FE ME

Choose > expIf overflow, add 32

Normalize/round

EXMX

33 33 5555 5555

PipelinedWallace-treeMultiplier

Pipelined addition, Normalize/round

Single-cycleAccumulate

OperandSelf-align

Figure 3-5: Single-cycle FPMAC algorithm.

(3) Normalize and round: The costly normalization step is moved outside the accumulate loop, where the accumulation result in carry-save is added, the sum normalized and converted back to base 2, with necessary rounding.

3.3 CMOS prototype 51

With a single-cycle accumulate loop, we have successfully eliminated RAW data hazards due to accumulation and enabled scheduling of FPMAC instructions every cycle. Note that we could still have a RAW data hazard between consecutive FPMAC instructions if either input operands to the multiplier (A, B) as part of the current FPMAC instruction, is dependent on the previous accumulated result. In such cases, the second FPMAC instruction issue must be delayed (in cycles) by the amount of the entire FPMAC pipeline depth. A better option is to have a compiler re-order the instruction stream on such dependent operations and improve throughput. With this exception, the proposed design allows FPMAC instructions with an initiation interval of 1, significantly increasing throughput. A detailed description of the work is presented in Paper 2 and Paper 3.

3.3 CMOS prototype

A test chip using the proposed FPMAC architecture has been designed and fabricated in a 90nm communication technology [15]. The 2mm2 full-custom design contains 230K transistors, with the FPMAC core accounting for 151K (67%) of the device count, and 0.88mm2 (44%) of total layout area.

3.3.1 High-performance Flip-Flop Circuits To enable fast performance, the FPMAC core uses implicit-pulsed semi-

dynamic flip-flops [16]–[17], with fast clock-to-Q delay and high skew tolerance. When compared to a conventional static master-slave flip-flop, the semi-dynamic flip-flops provide both shorter latency and the capability of incorporating logic functions with minimum delay penalty, properties which make them very attractive for high-performance digital designs. Critical pipe stages like the FPMAC accumulator registers (Figure 3-5), are built using inverting rising edge-triggered semi-dynamic flip-flop with synchronous reset. The flip-flop (Figure 3-6) has a dynamic master stage coupled to a pseudo-static slave stage. As is shown in the schematic, the flip flops are implicitly pulsed, with several advantages over non-pulsed designs. One main benefit is that they allow time-borrowing across cycle boundaries due to the fact that data can arrive coincident with, or even after, the clock edge. Thus negative setup time can be taken advantage of in the logic. Another benefit of negative setup time is that the flip-flop becomes less sensitive to jitter on the clock when the data arrives after clock. They thus offer better clock-to-output delay and clock skew tolerance than conventional master-slave flops. However, pulsed flip-flops have some important disadvantages. The worst-case hold time of this flip-flop can exceed clock-to-output delay because of pulse width variations across process, voltage,


and temperature conditions. A selectable pulse delay option is available, as shown in Figure 3-6, to avoid failures due to pulse-width variations over process corners and consequent min-delay failures.

CK

Delay

SP

dbq

clkb

clkb

rst

CK

Delay

SP

dbq

clkb

clkb

rst

Figure 3-6: Semi-dynamic resetable flip flop with selectable pulse width.

3.3.2 Test Circuits To aid with testing the FPMAC core, the design includes three 32-bit wide,

32-deep first-in first-out (FIFO) buffers, operating at core speed (Figure 3-7).

x

FPMAC

+

FIFOB

Scan Reg

FIFO Control

32 V

ecto

rs D

eep

FIFOA

FIFOC

Scan Out

Scan In

Scan Reg

32

32 32

Final add & normalize

Accumulator

x

FPMAC

+

FIFOB

Scan Reg

FIFO Control

32 V

ecto

rs D

eep

FIFOA

FIFOC

Scan Out

Scan In

Scan Reg

3232

3232 3232

Final add & normalize

Accumulator

Figure 3-7: Block diagram of FPMAC core and test circuits.

FIFOs A and B provide the input operands to the FPMAC and FIFO C captures the results. A 67-bit scan chain feeds the data and control words.

3.3 CMOS prototype 53

Output results are scanned out using a 32-bit scan chain. A control block manages operations of all three FIFOs and scan logic on chip.

Each FIFO is built using a register file unit that is 32-entry by 32b, with single read and write ports (Figure 3-8). The design is implemented as a large signal memory array [18]. A static design was chosen to reduce power and provide adequate robustness in the presence of large amounts of leakage. The RF design is organized in four identical 8-entry, 32b banks. For fast, single-cycle read operation, all four banks are simultaneously accessed and multiplexed to obtain the desired data. An 10-transistor, leakage-tolerant, dual-VT optimized RF cell with 1-read/1-write ports is used. Reads and writes to two different locations in the RF occur simultaneously in a single clock cycle. To reduce the routing and area cost, the circuits for reading and writing registers are implemented in a single-ended fashion. Local bit lines are segmented to reduce bit-line capacitive loading and leakage. As a result, address decoding time, read access time, as well as robustness improve. RF read and write paths are dual-VT

optimized for best performance with minimum leakage. The RF RAM latch and access devices in the write path are made high-VT to reduce leakage power. Low-VT devices are used everywhere else to improve critical read delay by 21% over a fully high-VT design.

B2

Write data

Readdata

4:1

Write data

Dec

ode

SD

FF

ck

Add

ress

B3

B1 B0 [31]

High V T

Low V T

wenwd

ren

[7]

To 4:1 mux[0]

[0]

B2

Write data

Readdata

4:1

Write data

Dec

ode

SD

FF

ck

Add

ress

B3

B1 B0

B2

Write data

Readdata

4:1

Write data

Dec

ode

SD

FF

ck

Add

ress

B3

B1 B0 [31]

High V T

Low V T

High V T

Low V T

wenwd

ren

[7]

To 4:1 mux[0]

[0]

Figure 3-8: 32-entry x 32b dual-VT optimized register file.


3.4 References

[1]. A. Omondi. Computer Arithmetic Systems, Algorithms, Architecture and Implementations. Prentice Hall, 1994.

[2]. D.A. Patterson, J.L. Hennessy, and D. Goldberg, “Computer Architecture, A Quantitative Approach”, Appendix A and H, 3rd edition, Morgan Kaufmann, May 2002.

[3]. IEEE Standards Board, “IEEE Standard for Binary Floating-Point Arithmetic,” Technical Report ANSI/IEEE Std. 754-1985, IEEE, New York, 1985.

[4]. W.-C. Park, S.-W. Lee, O.-Y. Kown, T.-D. Han, and S.-D. Kim. Floating point adder/subtractor performing ieee rounding and addition/subtraction in parallel. IEICE Transactions on Information and Systems, E79-D(4):297–305, Apr. 1996.

[5]. H. Yamada, T. Hotta, T. Nishiyama, F. Murabayashi, T. Yamauchi and H. Sawamoto, “A 13.3ns double-precision floating-point ALU and multiplier”, Proc. IEEE International Conference on Computer Design, Oct. 1995, pp. 466–470.

[6]. S.F. Oberman, H. Al-Twaijry and M.J. Flynn, “The SNAP project: design of floating point arithmetic units”, in Proc. 13th IEEE Symposium on Computer Arithmetic, 1997, pp.156–165.

[7]. P. M. Seidel and G. Even, “On the design of fast IEEE floating-point adders”, Proc. 15th IEEE Symposium on Computer Arithmetic, June 2001, pp. 184–194.

[8]. Elguibaly, F ,”A fast parallel multiplier-accumulator using the modified Booth algorithm”, IEEE Transactions on Circuits and Systems II, Sept. 2000, pp. 902–908.

[9]. Hokenek, E.; Montoye, R.K and Cook, P.W, “Second-generation RISC floating point with multiply-add fused”, IEEE Journal of Solid-State Circuits, Oct. 1990, pp. 1207–1213.

[10]. H. Oh, S. M. Mueller, C. Jacobi, K. D. Tran, S. R. Cottier, B. W Michael, H. Nishikawa, Y. Totsuka, T. Namatame, N. Yano, T. Machida and S. H. Dhong, “A fully-pipelined single-precision floating point unit in the

3.4 References 55

synergistic processor element of a CELL processor”, 2005 Symposium on VLSI Circuits, Digest of Technical Papers, pp. 24–27.

[11]. H. Suzuki, H. Morinaka, H. Makino, Y. Nakase, K. Mashiko and T. Sumi, “Leading-Zero Anticipatory Logic for High-speed Floating Point Addition,” IEEE J. Solid State Circuits, Aug. 1996, pp. 1157–1164.

[12]. D. M. Tullsen, S. J. Eggers and H. M Levy, “Simultaneous multithreading: Maximizing on-chip parallelism”, in Proceedings of 22nd Annual International Symposium on Computer Architecture, 22-24 June 1995, Pages:392 – 403.

[13]. Z. Luo, and M. Martonosi, “Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques,” IEEE Trans. on Computers, Mar. 2000, pp. 208–218.

[14]. C.S. Wallace, “Suggestions for a Fast Multiplier,” IEEE Trans. Electronic Computers, vol. 13, pp. 114-117, Feb. 1964.

[15]. K. Kuhn, M. Agostinelli, S. Ahmed, S. Chambers, S. Cea, S. Christensen, P. Fischer, J. Gong, C. Kardas, T. Letson, L. Henning, A. Murthy, H. Muthali, B. Obradovic, P. Packan, S. Pae, I. Post, S. Putna, K. Raol, A. Roskowski, R. Soman, T. Thomas, P. Vandervoorn, M. Weiss and I. Young, “A 90nm Communication Technology Featuring SiGe HBT Transistors, RF CMOS,” IEDM 2002, pp. 73–76.

[16]. F. Klass, "Semi-Dynamic and Dynamic Flip-Flops with Embedded Logic,” 1998 Symposium on VLSI Circuits, Digest of Technical Papers, pp. 108–109.

[17]. J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, V. De, “Comparative Delay and Energy of Single Edge-Triggered & Dual Edge-Triggered Pulsed Flip-Flops for High-Performance Microprocessors,” ISLPED 2001, pp. 147-151.

[18]. S. Vangal, M. Anders, N. Borkar, E. Seligman, V. Govindarajulu, V. Erraguntla, H. Wilson, A. Pangal, V. Veeramachaneni, J. Tschanz, Y. Ye, D. Somasekhar, B. Bloechel, G. Dermer, R. Krishnamurthy, K. Soumyanath, S. Mathew, S. Narendra, M. Stan, S. Thompson, V. De and S. Borkar, “5-GHz 32-bit integer execution core in 130-nm dual-VT CMOS,” IEEE Journal of Solid-State Circuits, Volume 37, Issue 11, Nov. 2002, pp. 1421- 1432.

57

Part III

An 80-Tile TeraFLOPS NoC

59

Chapter 4 An 80-Tile Sub-100W TeraFLOPS NoC Processor in 65-nm CMOS

We now present an integrated network-on-chip architecture containing 80 tiles arranged as an 8×10 2D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz (Paper 4). Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2D mesh network (Paper 5) provides a bisection bandwidth of 2 Tera-bits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100-M transistors. The fully functional first silicon achieves over 1.0TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07-V supply.

4.1 Introduction

The scaling of MOS transistors into the nanometer regime opens the possibility for creating large scalable Network-on-Chip (NoC) architectures [1] containing hundreds of integrated processing elements with on-chip communication. NoC architectures, with structured on-chip networks are emerging as a scalable and modular solution to global communications within large systems-on-chip. The basic concept is to replace today’s shared buses with on-chip packet-switched interconnection networks [2]. NoC architectures use

60 The TeraFLOPS NoC Processor layered protocols and packet-switched networks which consist of on-chip routers, links, and well defined network interfaces. As shown in Figure 4-1, the NoC architecture basic building block is the “network tile”. The tiles are connected to an on-chip network that routes packets between them. Each tile may consist of one or more compute cores and include logic responsible for routing and forwarding the packets, based on the routing policy of the network. The structured network wiring of such a NoC design gives well-controlled electrical parameters that simplifies timing and allows the use of high-performance circuits to reduce latency and increase bandwidth. Recent tile-based chip multiprocessors include the RAW [3], TRIPS [4], and ASAP [5] projects. These tiled architectures show promise for greater integration, high performance, good scalability and potentially high energy efficiency.

IP CoreIP Core

RouterRouter


IP CoreIP Core

RouterRouter


Figure 4-1: NoC architecture.

With the increasing demand for interconnect bandwidth, on-chip networks are

taking up a substantial portion of system power budget. The 16-tile MIT RAW on-chip network consumes 36% of total chip power, with each router dissipating 40% of individual tile power [6]. The routers and the links of the Alpha 21364 microprocessor consume about 20% of the total chip power. With on-chip communication consuming a significant portion of the chip power and area budgets, there is a compelling need for compact, low power routers. At the same time, while applications dictate the choice of the compute core, the advent of multimedia applications, such as three-dimensional (3D) graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput. A computational fabric built using these optimized building blocks is expected to provide high levels of

4.2 Goals of the TeraFLOPS NoC Processor 61

performance in an energy efficient manner. This paper describes design details of an integrated 80-Tile NoC architecture implemented in a 65-nm process technology. The prototype is designed to deliver over 1.0TFLOPS of average performance while dissipating less than 100W.

The remainder of the chapter is organized as follows. Section 4.2 lists the motivation for the tera-FLOP processor. Section 4.3 gives an architectural overview of the 80-tile NoC and describes the key building blocks. The section also explains the FPMAC unit pipeline and design optimizations used to accomplish single-cycle accumulation. Router architecture details, NoC communication protocol and packet formats are also described. Section 4.5 describes chip implementation details, including the high-speed mesochronous clock distribution network used in this design. Details of the circuits used for leakage power management in both logic and memory blocks are also discussed. Section 4.6 presents the chip measurement results. We present future tera-scale applications in Section 4.7. Section 4.8 concludes by summarizing the NoC architecture along with key performance and power numbers.

4.2 Goals of the TeraFLOPS NoC Processor

A primary motivation for this work is to demonstrate the industry’s first programmable processor capable of delivering over one trillion mathematical calculations per second (1.0TFLOPS) of performance while dissipating less than 100W. The TeraFLOPS processor extends our research work on key NoC building blocks and integrates them into a large 80-core NoC design using an effective tiled-design methodology. The number “80” is a balance between the performance/watt required and the available die area. The NoC uses a 2D mesh interconnect topology, fast enough to provide terabits of connectivity between the tiles. The mesh network is also attractive from a resiliency or reliability perspective because of its ability to re-route traffic in the event of congestion or link failure.

This research chip is designed to provide specific insights in new silicon design methodologies for large-scale NoCs (100+ cores), high-bandwidth interconnects, scalable clocking solutions and effective energy management techniques. The eventual goal is to develop highly scalable, multi-core architectures with an optimal mix of general and special purpose processing cores, and scalable, reliable on-chip networks, and exploiting state-of-the art process technology and packaging. An additional goal is to motivate research in the area of parallel programming by developing new programming tools that make highly-threaded and data-parallel applications easier to develop and debug.

62 The TeraFLOPS NoC Processor 4.3 NoC Architecture

The NoC architecture (Figure 4-2) contains 80 tiles arranged as an 8×10 2D mesh network that is designed to operate at 4GHz [7]. Each tile consists of a processing engine (PE) connected to a 5-port router with mesochronous interfaces (MSINT), which forwards packets between the tiles. The 80-tile on-chip network enables a bisection bandwidth of 2Tera-bits/s. The PE contains two independent fully-pipelined single-precision floating-point multiply-accumulator (FPMAC) units, 3KB single-cycle instruction memory (IMEM), and 2KB of data memory (DMEM). A 96-bit Very Long Instruction Word (VLIW) encodes up to eight operations per cycle. With a 10-port (6-read, 4-write) register file, the architecture allows scheduling to both FPMACs, simultaneous DMEM load and stores, packet send/receive from mesh network, program control, and dynamic sleep instructions. A router interface block (RIB) handles packet encapsulation between the PE and router. The fully symmetric architecture allows any PE to send (receive) instruction and data packets to (from) any other tile. The 15 fan-out-of-4 (FO4) design uses a balanced core and router pipeline with critical stages employing performance setting semi-dynamic flip-flops. In addition, a scalable low power mesochronous clock distribution is employed in a 65-nm eight-metal CMOS process that enables high integration and single chip realization of the teraFLOPS processor.

I/O

PLL JTAG

2KB Data memory (DMEM)

3KB

Inst

. mem

ory

(IM

EM

) 6-read, 4-write 32 entry RF

3264 64

32

64

RIB

96

96

Mesochronous Interface

Processing Engine (PE)

Crossbar RouterM

SIN

T39

3932GB/s Links

FPMAC0

x

+

Normalize

32

32

FPMAC1

x

+ 32

32

MS

INT

MSINT

MSINT

Normalize

I/O

PLL JTAG

I/O

PLL JTAG

2KB Data memory (DMEM)

3KB

Inst

. mem

ory

(IM

EM

) 6-read, 4-write 32 entry RF

3264 64

32

64

RIB

96

9696

Mesochronous Interface

Processing Engine (PE)

Crossbar RouterM

SIN

TM

SIN

T39

3932GB/s Links

FPMAC0

x

+

Normalize

32

32

FPMAC1

x

+ 32

32

MS

INT

MS

INT

MSINTMSINT

MSINTMSINT

Normalize

Figure 4-2: NoC block diagram and tile architecture.

4.4 FPMAC Architecture 63

4.4 FPMAC Architecture

The 9-stage pipelined FPMAC architecture (Figure 4-3) uses a single-cycle accumulate algorithm [8] with base 32 and internal carry-save arithmetic with delayed addition. The FPMAC contains a fully pipelined multiplier unit (pipe stages S1-S3), and a single-cycle accumulation loop (S5), followed by pipelined addition and normalization units (S6-S8). Operands A and B are 32-bit inputs in IEEE-754 single-precision format [9]. The design is capable of sustained pipelined performance of one FPMAC instruction every 250ps. The multiplier is designed using a Wallace tree of 4-2 carry-save adders. The well-matched delays of each Wallace tree stage allow for highly efficient pipelining (S1-S3). Four Wallace tree stages are used to compress the partial product bits to a sum and carry pair. Notice that the multiplier does not use a carry propagate adder at the final stage. Instead, the multiplier retains the output in carry-save format and converts the result to base 32 (at stage S3), prior to accumulation.

Base 32 conversion

SB EB MBSA EA MA

Shift

3-2

4-2

Shift

4-2

4-2

4-2 4-2

4-24-2 4-2

4-2 4-2 4-2

Sign Control EX=EB-Bias Partial Product Encoding

Mux

Exponent control

Exponent compare

+

_

ECSC MC

+

S,C

Sum (S) Carry (C)

Operand A Operand B

Wallace Stage 1

Wallace Stage 2, 3

Wallace Stage 4

Sign Inversion

Single cycleAccumulate

Final adder, Leading-Zero Anticipator (LZA)

Normalize

Base 2 conversion

S0

S1

S2

S3

S4

S5

S6

S7

S8

Crit

i cal

pa t

h

Shift

LZA

S,C

Mux

Mux

EA+EX

(IEEE-754 single precision inputs)

Base 32 conversion

SB EB MBSA EA MA

Shift

3-2

4-24-2

Shift

4-2

4-2

4-2 4-2

4-24-2 4-2

4-2 4-2 4-2

Sign Control EX=EB-Bias Partial Product Encoding

Mux

Exponent control

Exponent compare

+

_

ECSC MC

+

S,C

Sum (S) Carry (C)

Operand A Operand B

Wallace Stage 1

Wallace Stage 2, 3

Wallace Stage 4

Sign Inversion

Single cycleAccumulate

Final adder, Leading-Zero Anticipator (LZA)

Normalize

Base 2 conversion

S0

S1

S2

S3

S4

S5

S6

S7

S8

S0

S1

S2

S3

S4

S5

S6

S7

S8

Crit

i cal

pa t

h

Shift

LZA

S,C

Mux

Mux

EA+EX

(IEEE-754 single precision inputs)

Figure 4-3: FPMAC 9-stage pipeline with single-cycle accumulate loop.

64 The TeraFLOPS NoC Processor

In an effort to achieve fast single-cycle accumulation, we first analyzed each of the critical operations involved in conventional FPUs with the intent of eliminating, reducing or deferring the logic operations inside the accumulate loop and identified the following three optimizations [8].

1) The accumulator (stage S5) retains the multiplier output in carry-save format and uses an array of 4-2 carry save adders to “accumulate” the result in an intermediate format. This removes the need for a carry-propagate adder in the critical path.

2) Accumulation is performed in base 32 system, converting the expensive variable shifters in the accumulate loop to constant shifters.

3) The costly normalization step is moved outside the accumulate loop, where the accumulation result in carry-save is added (stage S6), the sum normalized (stage S7) and converted back to base 2 (stage S8).

These optimizations allow accumulation to be implemented in just 15 FO4 stages. This approach also reduces the latency of dependent FPMAC instructions and enables a sustained multiply-add result (2 FLOPS) every cycle. Careful pipeline re-balancing allows removal of 3 pipe-stages resulting in a 25% latency improvement over work in [8]. The dual FPMACs in each PE provide 16GFLOPS of aggregate performance and are critical to achieving the goal of teraFLOPS performance.

4.4.1 Instruction Set The architecture defines a 96-bit VLIW which allows a maximum of up to

eight operations to be issued every cycle. The instructions fall into one of the six categories (Table 4-1): Instruction issue to both floating-point units, simultaneous data memory load and stores, packet send/receive via the on-die mesh network, program control using jump and branch instructions, synchronization primitives for data transfer between PEs and dynamic sleep instructions. The data path between the DMEM and the register file supports transfer of two 32-bit data words per cycle on each load (or store) instruction. The register file issues four 32-bit data words to the dual FPMACs per cycle, while retiring two 32-bit results every cycle. The synchronization instructions aid with data transfer between tiles and allow the PE to stall while waiting for data (WFD) to arrive. To aid with power management, the architecture provides special instructions for dynamic sleep and wakeup of each PE, including independent sleep control of each floating-point unit inside the PE. The architecture allows any PE to issue sleep packets to any other tile or wake it up for processing tasks. With the exception of FPU instructions which have a pipelined latency of 9 cycles, most instructions execute in 1-2 cycles.


N/ASTALL/WFD

1JUMP/BRANCH

2LOAD/STORE

2SEND/RECEIVE

1-6SLEEP/WAKE

9FPU

Latency (cycles)Instruction Type

N/ASTALL/WFD

1JUMP/BRANCH

2LOAD/STORE

2SEND/RECEIVE

1-6SLEEP/WAKE

9FPU

Latency (cycles)Instruction Type

Table 4-1: Instruction types and latency.

4.4.2 NoC Packet Format

Figure 4-4 describes the NoC packet structure and routing protocol. The on-chip 2D mesh topology utilizes a 5-port router based on wormhole switching, where each packet is subdivided into “FLITs” or “Flow control unITs”. Each FLIT contains six control signals and 32 data bits. The packet header (FLIT_0) allows for a flexible source-directed routing scheme, where a 3-bit destination ID field (DID) specifies the router exit port. This field is updated at each hop.

L: Lane ID T: Packet tailV: Valid FLIT H: Packet headerFC: Flow control (Lanes 1:0) CH: Chained headerNPC: New PC address enable PCA: PC addressSLP: PE sleep/wake REN: PE execution enableADDR: IMEM/DMEM write address

PE Control Information

DID[2:0]V T H


FLIT_01 Packet

L CH

d31 d0

Data, 32 bits

FLIT_2 V T HL

NPC ADDRV T HLFLIT_1 PCA SLP REN

FC1FC0

FC1FC0

FC1FC0

PE Control Information

DID[2:0]V T H


FLIT_01 Packet

L CH

d31 d0

Data, 32 bits

FLIT_2 V T HL

NPC ADDRV T HLFLIT_1 PCA SLP REN

FC1FC0 FC1FC0

FC1FC0 FC1FC0

FC1FC0 FC1FC0

Figure 4-4: NoC protocol: packet format and FLIT description.


Flow control and buffer management between routers is debit-based using almost-full bits, which the receiver queue signals via two flow control bits (FC0-FC1) when its buffers reach a specified threshold. Each header FLIT supports a maximum of 10 hops. A chained header (CH) bit in the packet provides support for larger number of hops. Processing engine control information including sleep and wakeup control bits are specified in the FLIT_1 that follows the header FLIT. The minimum packet size required by the protocol is two FLITs. The router architecture places no restriction on the maximum packet size.

4.4.3 Router Architecture

A 4GHz five-port two-lane pipelined packet-switched router core (Figure 4-5) with phase-tolerant mesochronous links forms the key communication fabric for the 80-tile NoC architecture. Each port has two 39-bit unidirectional point-to-point links. The input-buffered wormhole-switched router [18] uses two logical lanes (lane 0-1) for dead-lock free routing and a fully non-blocking crossbar switch with a total bandwidth of 80GB/s (32-bits × 4 GHz × 5 ports). Each lane has a 16 FLIT queue, arbiter and flow control logic. The router uses a 5-stage pipeline with a two stage round-robin arbitration scheme that first binds an input port to an output port in each lane and then selects a pending FLIT from one of the two lanes. A shared data path architecture allows crossbar switch re-use across both lanes on a per-FLIT basis. The router links implement a mesochronous interface with first-in-first-out (FIFO) based synchronization at the receiver.

The router core features a double-pumped crossbar switch [10] to mitigate

crossbar interconnect routing area. The schematic in Figure 4-6(a) shows the 36-bit crossbar data bus double-pumped at the 4th pipe-stage of the router by interleaving alternate data bits using dual edge-triggered flip-flops, reducing crossbar area by 50%. In addition, the proposed router architecture shares the crossbar switch across both lanes on an individual FLIT basis. Combined application of both ideas enables a compact 0.34mm2 design, resulting in a 34% reduction in router layout area as shown in Figure 4-6(b), 26% fewer devices, 13% improvement in average power and one cycle latency reduction (from 6 to 5 cycles) over the router design in [11] when ported and compared in the same 65-nm process [12]. Results from comparison are summarized in Table 4-2.


From E

MS

INT

From S

MS

INT

From W

MS

INT

From N

MS

INT

To PE

To N

To E

To S

To W

Lane 0

Lane 1From PE

Crossbar switch

3939

39

39

Stage 4

Stage 5Stage 1

From E

MS

INT

From S

MS

INT

From W

MS

INT

From N

MS

INT

To PE

To NTo N

To ETo E

To STo S

To WTo W

Lane 0

Lane 1From PEFrom PE

Crossbar switch

39393939

3939

3939

Stage 4

Stage 5Stage 1

Figure 4-5: Five-port two-lane shared crossbar router architecture.

6

0.52

284K

Work in [11]

5

0.34

210K

This Work

34%Area (mm2)

16.6%Latency (cycles)

26%Transistors

BenefitRouter

6

0.52

284K

Work in [11]

5

0.34

210K

This Work

34%Area (mm2)

16.6%Latency (cycles)

26%Transistors

BenefitRouter

Table 4-2: Router comparison over work in [11].


(a)

(b)

This Work

Work in [11], scaled to 65nm

34%

0.8mm

0.65

mm

Lane 0 Lane 1

CrossbarCrossbar

0.8mm0.8mm

0.65

mm

0.65

mm

Lane 0 Lane 1

CrossbarCrossbar

0.53mm

0.65

mm

Lane 0

Sha

red

doub

le-p

umpe

dC

ross

bar

Lane 1

0.53mm

0.65

mm

0.65

mm

Lane 0

Sha

red

doub

le-p

umpe

dC

ross

bar

Lane 1

This Work


34%34%

0.8mm

0.65

mm

Lane 0 Lane 1

CrossbarCrossbar

0.8mm0.8mm

0.65

mm

0.65

mm

Lane 0 Lane 1

CrossbarCrossbar

0.53mm

0.65

mm

Lane 0

Sha

red

doub

le-p

umpe

dC

ross

bar

Lane 1

0.53mm

0.65

mm

0.65

mm

Lane 0

Sha

red

doub

le-p

umpe

dC

ross

bar

Lane 1

5:1

clkb

SD

FF

clk

Latc

h

q0

clk

q1

Stage 5

Td < 100ps

50%

1.0mm crossbar

interconnect

SD

FF

clk

clkb

clkb

clkclkb

clk

clk

M0

i0

i1

clkclkb

S1

2:1 MuxStage 4

clkbclk

Repeaters5:1

clkb

SD

FF

clk

Latc

h

q0

clk

q1

Stage 5


50%

1.0mm crossbar

interconnect

SD

FF

clk

clkb

clkb

clkclkb

clk

clk

M0

i0

i1

clkclkb

S1

2:1 MuxStage 4

clkbclk

Repeaters

clkbclk

Repeaters

(a)

(b)

This Work


34%

0.8mm

0.65

mm

Lane 0 Lane 1

CrossbarCrossbar

0.8mm0.8mm

0.65

mm

0.65

mm

Lane 0 Lane 1

CrossbarCrossbar

0.53mm

0.65

mm

Lane 0

Sha

red

doub

le-p

umpe

dC

ross

bar

Lane 1

0.53mm

0.65

mm

0.65

mm

Lane 0

Sha

red

doub

le-p

umpe

dC

ross

bar

Lane 1

This Work


34%34%

0.8mm

0.65

mm

Lane 0 Lane 1

CrossbarCrossbar

0.8mm0.8mm

0.65

mm

0.65

mm

Lane 0 Lane 1

CrossbarCrossbar

0.53mm

0.65

mm

Lane 0

Sha

red

doub

le-p

umpe

dC

ross

bar

Lane 1

0.53mm

0.65

mm

0.65

mm

Lane 0

Sha

red

doub

le-p

umpe

dC

ross

bar

Lane 1

5:1

clkb

SD

FF

clk

Latc

h

q0

clk

q1

Stage 5

Td < 100ps

50%

1.0mm crossbar

interconnect

SD

FF

clk

clkb

clkb

clkclkb

clk

clk

M0

i0

i1

clkclkb

S1

2:1 MuxStage 4

clkbclk

Repeaters5:1

clkb

SD

FF

clk

Latc

h

q0

clk

q1

Stage 5


50%

1.0mm crossbar

interconnect

SD

FF

clk

clkb

clkb

clkclkb

clk

clk

M0

i0

i1

clkclkb

S1

2:1 MuxStage 4

clkbclk

Repeaters

clkbclk

Repeaters

Figure 4-6: (a) Double-pumped crossbar switch schematic. (b) Area benefit over work in [11].


4.4.4 Mesochronous Communication The 2mm long point-to-point unidirectional router links implement a phase-

tolerant mesochronous interface (Figure 4-7). Four of the five router links are source synchronous, each providing a strobe (Tx_clk) with 38-bits of data. To reduce active power, Tx_clk is driven at half the clock rate. A 4-deep circular FIFO, built using transparent latches captures data on both edges of the delayed link strobe at the receiver. The strobe delay and duty-cycle can be digitally programmed using the on-chip scan chain. A synchronizer circuit sets the latency between the FIFO write and read pointers to 1 or 2 cycles at each port, depending on the phase of the arriving strobe with respect to the local clock. A more aggressive low-latency setting reduces the synchronization penalty by one cycle. The interface includes the first stage of the router pipeline.

Rx_data

write statemachine

Tx_data

Tx_clkread state machine

d q

Rx_clk

Sync Sync

Synchronizer circuitLow latency

Delay Line

Scan Register

4-deep FIFO

Rx clk

Tx_clkd

4:1

RxdStage 1Tx_clk

Tx_clkd

delay

Tx_data[37:0]

D1 D2

Rxd [37:0] D1 D2

D3

tmsint

(1-2 cycles)

38

38

Latc

hLa

tch

Latc

hLa

tch

Rx_data

write statemachine

Tx_data

Tx_clkread state machine

d qd q

Rx_clk

Sync Sync

Synchronizer circuitLow latency

Delay Line

Scan Register

4-deep FIFO

Rx clk

Tx_clkd

4:1

RxdStage 1Tx_clk

Tx_clkd

delay

Tx_data[37:0]

D1 D2

Rxd [37:0] D1 D2

D3

tmsint

(1-2 cycles)

3838

3838

Latc

hLa

tch

Latc

hLa

tch

Latc

hLa

tch

Latc

hLa

tch

Figure 4-7: Phase-tolerant mesochronous interface and timing diagram.

4.4.5 Router Interface Block (RIB) The RIB is responsible for message-passing and aids with synchronization of

data transfer between the tiles and with power management at the PE level. Incoming 38-bit wide FLITs are buffered in a 16-entry queue, where

70 The TeraFLOPS NoC Processor demultiplexing based on the lane-ID and framing to 64-bits for data packets (DMEM) and 96-bits for instruction packets (IMEM) are accomplished. The buffering is required during program execution since DMEM stores from the 10-port register file have priority over data packets received by the RIB. The unit decodes FLIT_1 (Figure 4-4) of an incoming instruction packet and generates several PE control signals. This allows the PE to start execution (REN) at a specified IMEM address (PCA) and is enabled by the new program counter (NPC) bit. After receiving a full data packet, the RIB generates a break signal to continue execution, if the IMEM is in a stalled (WFD) mode. Upon receipt of a sleep packet via the mesh network, the RIB unit can also dynamically put the entire PE to sleep or wake it up for processing tasks on demand.

4.5 Design details

To allow 4+ GHz operation, the entire core is designed using hand-optimized data path macros. CMOS static gates are used to implement most of the logic. However, critical registers in the FPMAC and router logic utilize implicit-pulsed semi-dynamic flip-flops (SDFF) [13]–[14]. The SDFF (Figure 4-8) has a dynamic master stage coupled to a pseudo-static slave stage. The FPMAC accumulator register is built using data-inverting rising edge-triggered SDFFs with synchronous reset and enable. The negative setup time of the flip-flop is taken advantage of in the critical path. When compared to a conventional static master-slave flip-flop, SDFF provides both shorter latency and the capability of incorporating logic functions with minimum delay penalty, properties which are desirable in high-performance digital designs.

ck

dbq

clkb

clkb

rstden

ck

dbq

clkb

clkb

rstden

clkck

dbq

clkb

clkb

rstden

ck

dbq

clkb

clkb

rstden

clk

Figure 4-8: Semi-dynamic flip-flop (SDFF) schematic.

4.5 Design details 71

The chip uses a scalable global mesochronous clocking technique, which allows for clock phase-insensitive communication across tiles and synchronous operation within each tile. The on-chip PLL output (Figure 4-9a) is routed using horizontal M8 and vertical M7 spines. Each spine consists of differential clocks for low duty-cycle variation along the worst-case clock route of 26 mm. An op-amp at each tile converts the differential clocks to a single ended clock with a 50% duty cycle prior to distributing the clock across the tile using a balanced H-tree. This clock distribution scales well as tiles are added or removed. The worst-case simulated global duty-cycle variation is 3 ps and local clock skew within the tile is 4ps. Figure 4-9(b) shows simulated clock arrival times for all 80 tiles at 4 GHz operation. Note that multiple cycles are required for the global clock to propagate to all 80 tiles. The systematic clock skews inherent in the distribution help spread peak currents due to simultaneous clock switching over the entire cycle.

1.5mm

2mm

M8M7

12.6mm

PLL clk, clk#

+-

20m

m

PE

router

Clock gating points

clk clk#

(a) (b)

1 2 3 4 5 6 7 8S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

Horizontal clock spine

Ver

tical

clo

ck s

pine

Clock Arrival Times @ 4GHz (ps)

200-250ps

150-200ps

100-150ps

50-100ps

0-50psPLL

1.5mm

2mm

M8M7

12.6mm

PLL clk, clk#

+-

20m

m

PE

router

Clock gating points

clk clk#

1.5mm

2mm

M8M7

12.6mm

PLL clk, clk#

+-

20m

m

PE

router

Clock gating points

clk clk#

(a) (b)

1 2 3 4 5 6 7 8S1

S2

S3

S4

S5

S6

S7

S8

S9

S10


Ver

tical

clo

ck s

pine


200-250ps

150-200ps

100-150ps

50-100ps

0-50psPLL

1 2 3 4 5 6 7 8S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

1 2 3 4 5 6 7 8S1

S2

S3

S4

S5

S6

S7

S8

S9

S10


Ver

tical

clo

ck s

pine


200-250ps

150-200ps

100-150ps

50-100ps

0-50psPLL

Figure 4-9: (a) Global mesochronous clocking and (b) simulated clock arrival times.

Fine-grained clock gating, sleep transistor and body bias circuits [15] are used to reduce active (Figure 4-9a) and standby leakage power, which are controlled at full-chip, tile-slice, and individual tile levels based on workload. Each tile is partitioned into 21 smaller sleep regions with dynamic control of individual

72 The TeraFLOPS NoC Processor blocks in PE and router units based on instruction type. The router is partitioned into 10 smaller sleep regions with control of individual router ports, depending on network traffic patterns. The design uses NMOS sleep transistors to reduce frequency penalty and area overhead. Figure 4-10 shows the router and on-die network power management scheme. The enable signals gate the clock to each port, MSINT and the links. In addition, the enable signals also activate the NMOS sleep transistors in the input queue arrays of both lanes. The 360µm NMOS sleep device in the register-file is sized to provide a 4.3X reduction in array leakage power with a 4% frequency impact. The global clock buffer feeding the router is finally gated at the tile level based on port activity.

Router Port 0Router Port 0

InputQueue

Port_en

Queue Array(Register File)

Sleep Transistor

gated_clk

MS

INT

Sta

ges

2-5Lanes 0 & 1

Ports 1Ports 1--44

5Gclk

1

Global clock buffer

Router Port 0Router Port 0

InputQueue

Port_en

Queue Array(Register File)

Sleep Transistor

gated_clk

MS

INT

Sta

ges

2-5Lanes 0 & 1

Ports 1Ports 1--44

5Gclk

1

Global clock buffer

Figure 4-10: Router and on-die network power management.

Each FPMAC implements unregulated sleep transistors with no data retention (Figure 4-11a). A 6-cycle pipelined wakeup sequence largely mitigates current spikes over single-cycle re-activation scheme, while allowing floating point unit execution to start one-cycle into wakeup. Circuit simulations (SPICE) show up to 4X reduction in peak currents when compared to single-cycle wakeup. Note the staged activation of the three FPMAC sub-blocks from sleep. A faster 3-cycle fast-wake option is also supported. On the other hand, memory arrays use a regulated active clamped sleep transistor circuit (Figure 4-11b) that ensures

4.5 Design details 73

data retention and minimizes standby leakage power [16]. The closed-loop opamp configuration ensures that the virtual ground voltage (VSSV) is no greater than a VREF input voltage under PVT variations. VREF is set based on memory cell standby VMIN voltage. The average sleep transistor area overhead is 5.4% with a 4% frequency penalty. About 90% of FPU logic and 74% of each PE is sleep-enabled. In addition, forward body bias can be externally applied to all NMOS devices during active mode to increase the operating frequency and reverse body bias can be applied during idle mode for further leakage savings.

(a)

(b)

fast-wake

clk

sleep

Normalization

Multiplier

Logic

nap1W

nap22W

Accumulator

Logic

nap3W

nap42W

Logic

nap5W

nap62W

fast-wake

clk

fast-wake

clk

1 0

1 0

1 0

fast-wake

clk

sleep

Normalization

Multiplier

Logic

nap1W

nap22W

Accumulator

Logic

nap3W

nap42W

Logic

nap5W

nap62W

fast-wake

clk

fast-wake

clk

1 01 01 0

1 01 01 0

1 01 01 0

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

time (ns)

FP

MA

C c

urre

nt (A

)

0

0.5

1

1.5

2

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

time (ns)

FP

MA

C c

urre

nt (A

)

Blocks enter sleep at same time

Single cycle wakeup

Pipelined wakeup sequence

Multiplier

Accumulator Normalization

4X

VREF=Vcc-VMIN _+

wake

Memory array

Vcc

Vssv

Sta

ndby

VM

IN

body bias

control

VREF=Vcc-VMIN _+

wake

Memory array

Vcc

Vssv

Sta

ndby

VM

IN

body bias

control

(a)

(b)

fast-wake

clk

sleep

Normalization

Multiplier

Logic

nap1W

nap22W

Accumulator

Logic

nap3W

nap42W

Logic

nap5W

nap62W

fast-wake

clk

fast-wake

clk

1 0

1 0

1 0

fast-wake

clk

sleep

Normalization

Multiplier

Logic

nap1W

nap22W

Accumulator

Logic

nap3W

nap42W

Logic

nap5W

nap62W

fast-wake

clk

fast-wake

clk

1 01 01 0

1 01 01 0

1 01 01 0

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

time (ns)

FP

MA

C c

urre

nt (A

)

0

0.5

1

1.5

2

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

time (ns)

FP

MA

C c

urre

nt (A

)

Blocks enter sleep at same time

Single cycle wakeup

Pipelined wakeup sequence

Multiplier

Accumulator Normalization

4X

VREF=Vcc-VMIN _+

wake

Memory array

Vcc

Vssv

Sta

ndby

VM

IN

body bias

control

VREF=Vcc-VMIN _+

wake

Memory array

Vcc

Vssv

Sta

ndby

VM

IN

body bias

control

Figure 4-11: (a) FPMAC pipelined wakeup diagram and simulated peak

current reduction and (b) state-retentive memory clamp circuit.

74 The TeraFLOPS NoC Processor 4.6 Experimental Results

The teraFLOPS processor is fabricated in a 65-nm process technology [12] with a 1.2-nm gate-oxide thickness, nickel salicide for lower resistance and a second-generation strained silicon technology. The interconnect uses eight copper layers and a low-K carbon-doped oxide (k = 2.9) inter-layer dielectric. The functional blocks of the chip and individual tile are identified in the die photographs in Figure 4-12. The 275 mm2 fully custom design contains 100 million transistors. Using a fully-tiled approach, each 3 mm2 tile is drawn complete with C4 bumps, power, global clock and signal routing, which are seamlessly arrayed by abutment. Each tile contains 1.2 million transistors with the processing engine accounting for 1 million (83%) and the router 17% of the total tile device count. De-coupling capacitors occupy about 20% of the total logic area. The chip has 3 independent voltage regions, one for the tiles, a separate supply for the PLL and a third one for the I/O circuits. Test and debug features include a TAP controller and full-scan support for all memory blocks on chip.

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip) 1.2 Million (tile)

Transistors

275mm2 (full-chip) 3mm2 (tile)

Die Area

8390C4 bumps #

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip) 1.2 Million (tile)

Transistors

275mm2 (full-chip) 3mm2 (tile)

Die Area

8390C4 bumps #

21.7

2mm

12.64mmI/O Area

I/O Area

PLL

single tile

1.5mm

2.0mm

TAP

21.7

2mm

12.64mmI/O Area

I/O Area

PLL

single tile

1.5mm

2.0mm

TAP

1.5mm

2.0m

m

FPMAC0

Router

IMEM

DMEM RF

RIB

CLK

FPMAC1

MSINT

Glo

bal c

lk s

pine

+ c

lk b

uffe

rs

MSINT

MSINT

1.5mm

2.0m

m2.

0mm

FPMAC0

Router

IMEM

DMEM RF

RIB

CLK

FPMAC1

MSINT

Glo

bal c

lk s

pine

+ c

lk b

uffe

rs

MSINT

MSINT

Figure 4-12: Full-Chip and tile micrograph and characteristics.

4.6 Experimental Results 75

The evaluation board with the packaged chip is shown in Figure 4-13. The die has 8390 C4 solder bumps, arrayed with a single uniform bump pitch across the entire die. The chip-level power distribution consists of a uniform M8-M7 grid aligned with the C4 power and ground bump array.

(a)

(c)

(b)(a)

(c)

(b)

Figure 4-13: (a) Package die-side. (b) Land-side. (c) Evaluation board.


The package is a 66 mm × 66 mm flip-chip LGA (land grid array) and includes an integrated heat spreader. The package has a 14-layer stack-up (5-4-5) to meet the various power planes and signal requirements and has a total of 1248 pins, out of which 343 are signal pins. Decoupling capacitors are mounted on the land-side of the package as shown in Figure 4-13(b). A PC running custom software is used to apply test vectors and observe results through the on-chip scan chain. First silicon has been validated to be fully functional.

Frequency versus power supply on a typical part is shown in Figure 4-14. Silicon chip measurements at a case temperature of 80°C demonstrates chip maximum frequency (FMAX ) of 1 GHz at 670 mV and 3.16 GHz at 950 mV, with frequency increasing to 5.1 GHz at 1.2 V and 5.67 GHz at 1.35 V. With all 80 tiles (N=80) actively performing single precision block-matrix operations, the chip achieves a peak performance of 0.32 TFLOPS (670 mV), 1.0 TFLOPS (950 mV), 1.63 TFLOPS (1.2 V) and 1.81TFLOPS (1.35 V).

0

1

2

3

4

5

6

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4Vcc (V)

Fre

quen

cy (

GH

z)

80°C

N=80

(0.32 TFLOPS) 1GHz

(1 TFLOPS) 3.16GHz

(1.81 TFLOPS)5.67GHz

(1.63 TFLOPS) 5.1GHz

Figure 4-14: Measured chip FMAX and peak performance.


Several application kernels have been mapped to the design and the performance is summarized in Table 4-3. The table shows the single-precision floating point operation count, the number of active tiles (N) and the average performance in TFLOPS for each application, reported as a percentage of the peak performance achievable with the design. In each case, task mapping was hand optimized and communication was overlapped with computation as much as possible to increase efficiency. The stencil code solves a steady state two-dimensional heat diffusion equation with periodic boundary conditions on left and right boundaries of a rectilinear grid, and prescribed temperatures on top and bottom boundaries. For the stencil kernel, chip measurements indicate an average performance of 1.0 TFLOPS at 4.27 GHz and 1.07 V supply with 358K floating-point operations, achieving 73.3% of the peak performance. This data is particularly impressive because the execution is entirely overlapped with local loads and stores and communication between neighboring tiles. The SGEMM matrix multiplication code operates on two 100 × 100 matrices with 2.63 million floating-point operations, corresponding to an average performance of 0.51 TFLOPS. It is important to note that the read bandwidth from local data memory limits the performance to half the peak rate. The spreadsheet kernel applies reductions to tables of data consisting of pairs of values and weights. For each table the weighted sum of each row and each column is computed. A 64-point 2D FFT (Fast Fourier Transform) which implements the Cooley-Tukey algorithm [17] using 64 tiles has also been successfully mapped to the design with an average performance of 27.3 GFLOPS. It first computes 8-point FFTs in each tile, which in turn passes results to 63 other tiles for the 2D FFT computation. The complex communication pattern results in high overhead and lower efficiency.

2.73%

33.2%

37.5%

73.3%

% Peak TFLOPS

800.512.63MSGEMM: Matrix Multiplication

800.4562.4KSpreadsheet

0.02

1.00

Average Performance (TFLOPS)

64196K2D FFT

80358KStencil

Number of active tiles (N)

FLOP count

Application Kernels

2.73%

33.2%

37.5%

73.3%

% Peak TFLOPS

800.512.63MSGEMM: Matrix Multiplication

800.4562.4KSpreadsheet

0.02

1.00

Average Performance (TFLOPS)

64196K2D FFT

80358KStencil

Number of active tiles (N)

FLOP count

Application Kernels

Table 4-3: Application performance measured at 1.07V and 4.27GHz

operation.


0

25

50

75

100

125

150

175

200

225

250

0.67 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35

Vcc (V)

Pow

er (

W)

Active PowerLeakage Power

78

15.6

152

26

1.33TFLOPS @ 230W

80°C, N=80

1TFLOPS @ 97W

Figure 4-15: Measured chip power for stencil application.

Figure 4-15 shows the total chip power dissipation with the active and leakage

power components separated as a function of frequency and power supply with case temperature maintained at 80°C. We report measured power for the stencil application kernel, since it is the most computationally intensive. The chip power consumption ranges from 15.6 W at 670 mW to 230 W at 1.35 V. With all 80 tiles actively executing stencil code the chip achieves 1.0 TFLOPS of average performance at 4.27 GHz and 1.07 V supply with a total power dissipation of 97 W. The total power consumed increases to 230 W at 1.35 V and 5.67 GHz operation, delivering 1.33 TFLOPS of average performance. Figure 4-16 plots the measured energy efficiency in GFLOPS/W for the stencil application with power supply and frequency scaling. As expected, the chip energy efficiency increases with power supply reduction, from 5.8 GFLOPS/W at 1.35 V supply to 10.5 GFLOPS/W at the 1.0 TFLOPS goal to a maximum of 19.4 GFLOPS/W at 750 mV supply. Below 750 mV the chip FMAX degrades faster than power saved by lowering tile supply voltage, resulting in overall performance reduction and consequent drop in the processor energy efficiency. The chip provides up to 394 GFLOPS of aggregate performance at 750 mV with a measured total power dissipation of just 20 W.


0

5

10

15

20

200 400 600 800 1000 1200 1400

GFLOPS

GF

LOP

S/W

80°CN=80

5.8 (1.35V)

19.4 (750 mV)

394 GFLOPS

10.5 (1.07V)

15 (670mV)

Figure 4-16: Measured chip energy efficiency for stencil application.

Figure 4-17 presents the estimated power breakdown at the tile and router

levels, which is simulated at 4 GHz, 1.2 V supply and at 110°C. The processing engine with the dual FPMACs, instruction and data memory, and the register file accounts for 61% of the total tile power (Figure 4-17a). The communication power is significant at 28% of the tile power and the synchronous tile-level clock distribution accounts for 11% of the total. Figure 4-17(b) shows a more detailed tile to tile communication power breakdown, which includes the router, mesochronous interfaces and links. Clocking power is the largest component, accounting for 33% of the communication power. The input queues on both lanes and data path circuits is the second major component dissipating 22% of the communication power.

Figure 4-18 shows the output differential and single ended clock waveforms measured at the farthest clock buffer from the PLL at a frequency of 5GHz. Notice that the duty cycles of the clocks are close to 50%. Figure 4-19 plots the global clock distribution power as a function of frequency and power supply. This is the switching power dissipated in the clock spines from the PLL to the op-amp at the center of all 80 tiles. Measured silicon data at 80°C shows that the power is 80 mW at 0.8 V and 1.7 GHz frequency, increasing by 10 X to 800

80 The TeraFLOPS NoC Processor mW at 1 V and 3.8 GHz. The global clock distribution power is 2 W at 1.2 V and 5.1 GHz and accounts for just 1.3% of the total chip power.

(b)

(a)

Dual FPMACs

36%

Clock dist.11%

Router + Links28%

10-port RF4%

IMEM + DMEM21%

Clocking33%

Queues + Datapath

22%

Arbiters + Control

7%

Links17%

MSINT6%

Crossbar15%

(b)

(a)

Dual FPMACs

36%

Clock dist.11%

Router + Links28%

10-port RF4%

IMEM + DMEM21%

Clocking33%

Queues + Datapath

22%

Arbiters + Control

7%

Links17%

MSINT6%

Crossbar15%

Figure 4-17: Estimated (a) Tile power profile (b) Communication power

breakdown.

Figure 4-20 plots the chip leakage power as a percentage of the total power with all 80 processing engines and routers awake and with all the clocks disabled. Measurements show that the worst-case leakage power in active mode varies from a minimum of 9.6% to a maximum of 15.7% of the total power when measured over the power supply range of 670 mV to 1.35 V. In sleep mode, the NMOS sleep transistors are turned off, reducing chip leakage by 2X, while preserving the logic state in all memory arrays.


clk, clk#

clkse

clk clkse

+ −− −−

clk#

I/OM8M7

PLL clk, clk#

20m

m

clk, clk#

clkse

clk, clk#

clkse

clk clkse

+ −− −−

clk#

I/OM8M7

PLL clk, clk#

20m

m

clk clkse

+ −− −−

clk#

I/OM8M7

PLL clk, clk#

20m

m

Figure 4-18: Measured global clock distribution waveforms.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 2 3 4 5 6Frequency (GHz)

Clo

ck D

ist.

Pow

er (W

) 80°C 2W (1.2V)

0.8W (1V)80m W (0.8V)

Figure 4-19: Measured global clock distribution power.


0%

4%

8%

12%

16%

0.67 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35Vcc (V)

% T

otal

Pow

er

Sleep disabledSleep enabled

80°C N=80

2X

Figure 4-20: Measured chip leakage power as percentage of total power vs.

Vcc. A 2X reduction is obtained by turning off sleep transistors.

0

200

400

600

800

1000

0 1 2 3 4 5

Number of active router ports

Net

wor

k po

wer

per

tile

(mW

)

7.3X

1.2V, 5.1GHz80°C

126mW

924mW

Figure 4-21: On-die network power reduction benefit.


Figure 4-21 shows the active and leakage power reduction due to a combination of selective router port activation, clock gating and sleep transistor techniques described in section III. Measured at 1.2 V, 80°C and 5.1 GHz operation, the total network power per-tile can be lowered from a maximum of 924 mW with all router ports active to 126 mW, resulting in a 7.3X reduction. The network leakage power per-tile with all ports and global clock buffers feeding the router disabled is 126 mW. This number includes power dissipated in the router, MSINT and the links. Figure 4-22 shows measured scope waveform of the virtual ground (Vssv) node for one of the instruction memory (IMEM) arrays showing transition of the block to and from sleep. For this measurement, VREF is set at 400mV, with a memory cell standby VMIN voltage of 800mV (Vcc = 1.2V) for data retention. Notice that the active clamping circuit ensures that Vssv is within a few mV (5%) of VREF. A photograph of the measurement setup used to characterize the design is shown in Figure 4-23.

VREF= 400mV _+

wake

IMEMMemory

array

Vcc = 1.2V

Vssv

Sta

ndby

VM

IN=

0.8V

Probe Pad

In Sleep In Sleep

IMEM ActiveVssv

VREF= 400mV _+

wake

IMEMMemory

array

Vcc = 1.2V

Vssv

Sta

ndby

VM

IN=

0.8V

Probe Pad

In Sleep In Sleep

IMEM ActiveVssv

Figure 4-22: Measured IMEM virtual ground waveform slowing transition

to and from sleep.


Figure 4-23: Measurement setup.

4.7 Applications for the TeraFLOP NoC Processor

We presented results from mapping a handful of typical application kernels to our NoC processor with good success. A broader question to answer is “what applications would really benefit from a tera-FLOP of computing power and tera-bits of on-die communication bandwidth?". The teraFLOP processor provides a 500X jump in compute capability over today’s giga-scale devices. This is required for tomorrow's emerging applications including real-time recognition, mining and synthesis (RMS) workloads on tera-bytes of data [19]. Other applications include artificial intelligence (AI) for smarter appliances, virtual reality for 3D modeling, gaming, visualization, physics simulation, financial applications and medical training; and other applications that are still on the edge of being science fiction. In medicine, a full-body medical scan already contains tera-bytes of information. Even at home, people are generating large amounts of data, including hundreds of hours of video, and thousands of digital photos that need to be indexed and searched. This computing model promises to bring the massive compute capabilities of supercomputers to laptops or even handheld devices.

4.8 Conclusion 85

4.8 Conclusion

In this chapter, we have presented an 80-tile high-performance NoC architecture implemented in a 65-nm process technology. The prototype contains 160 lower-latency FPMAC cores and features a single-cycle accumulator architecture for high throughput. Each tile also contains a fast and compact router operating at core speed where the 80 tiles are interconnected using a 2D mesh topology providing a high bisection bandwidth of over 2 Tera-bits/s. The design uses a combination of micro-architecture, logic, circuits and a 65-nm process to reach target performance. Silicon operates over a wide voltage and frequency range, and delivers teraFLOPS performance with high power efficiency. For the most computationally intensive application kernel, the chip achieves an average performance of 1.0 TFLOPS, while dissipating 97 W at 4.27 GHz and 1.07 V supply, corresponding to an energy efficiency of 10.5 GFLOPS/W. Average performance scales to 1.33 TFLOPS at a maximum operational frequency of 5.67 GHz and 1.35 V supply. These results demonstrate the feasibility for high-performance and energy efficient building blocks for peta-scale computing in the near future.

Acknowledgement

I sincerely thank Prof. A. Alvandpour, V. De, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, S. Borkar, D. Somasekhar, D. Jenkins, P. Aseron, J. Collias, B. Nefcy, P. Iyer, S. Venkataraman, S. Saha, M. Haycock, J. Schutz and J. Rattner for help, encouragement, and support; T Mattson, R. Wijngaart, M. Frumkin from SSG and ARL teams at Intel for assistance with mapping the kernels to the design, the LTD and ATD teams for PLL and package design and assembly, and the entire mask design team for chip layout.

4.9 References

[1]. L. Benini and G. D. Micheli., “Networks on Chips: A New SoC Paradigm,” IEEE Computer, vol. 35, pp. 70–78, Jan., 2002.

[2]. W. J. Dally and B. Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” in Proceedings of the 38th Design Automation Conference, pp. 681-689, June 2001.

[3]. M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffmann, P. Johnson, J. W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski,


N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs,” IEEE Micro, vol.22, no.2, pp. 25–35, 2002.

[4]. K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S.W. Keckler, and C. R. Moore, “Exploiting ILP, TLP, and DLP with The Polymorphous TRIPS Architecture,” in Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 422–433, 2003.

[5]. Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. Work, T. Mohsenin, M. Singh, and B. Baas, “An asynchronous array of simple processors for dsp applications,” ISSCC Dig. Tech. Papers, pp. 428–429, Feb., 2006.

[6]. H. Wang, L. Peh and S. Malik, “Power-driven design of router microarchitectures in on-chip networks,” MICRO-36, Proceedings of 36th Annual IEEE/ACM International Symposium on Micro architecture, pp. 105-116, 2003.

[7]. S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote and N. Borkar, “An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS,” ISSCC Dig. Tech. Papers, pp. 98–99, Feb. 2007.

[8]. S. Vangal Y. Hoskote, N. Borkar and A. Alvandpour, “A 6.2-GFlops Floating-Point Multiply-Accumulator with Conditional Normalization,” IEEE J. of Solid-State Circuits, pp. 2314–2323, Oct., 2006.

[9]. IEEE Standards Board, “IEEE Standard for Binary Floating-Point Arithmetic,” Technical Report ANSI/IEEE Std. 754-1985, IEEE, New York, 1985.

[10]. S. Vangal, N. Borkar and A. Alvandpour, "A Six-Port 57GB/s Double-Pumped Non-blocking Router Core", Symposium on VLSI Circuits, pp. 268–269, June 2005.

[11]. H. Wilson and M. Haycock., “A six-port 30-GB/s non-blocking router component using point-to-point simultaneous bidirectional signaling for high-bandwidth interconnects”, IEEE Journal of Solid-State Circuits, vol. 36, Dec. 2001, pp. 1954–1963.

4.9 References 87

[12]. P. Bai, C. Auth, S. Balakrishnan, M. Bost, R. Brain, V. Chikarmane, R. Heussner, M. Hussein, J. Hwang, D. Ingerly, R. James, J. Jeong, C. Kenyon, E. Lee, S.-H. Lee, N. Lindert, M. Liu, Z. Ma, T. Marieb, A. Murthy, R. Nagisetty, S. Natarajan, J. Neirynck, A. Ott, C. Parker, J. Sebastian, R. Shaheed, S. Sivakumar, J. Steigerwald, S. Tyagi, C. Weber, B. Woolery, A.Yeoh , K. Zhang and M. Bohr, “A 65nm Logic Technology Featuring 35nm Gate Lengths, Enhanced Channel Strain, 8 Cu Interconnect Layers, Low-k ILD and 0.57µm2 SRAM Cell,” IEDM technical digest, pp. 657–660, Dec. 2004.

[13]. F. Klass, "Semi-Dynamic and Dynamic Flip-Flops with Embedded Logic,” 1998 Symposium on VLSI Circuits, Digest of Technical Papers, pp. 108–109.

[14]. J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, V. De, “Comparative Delay and Energy of Single Edge-Triggered & Dual Edge-Triggered Pulsed Flip-Flops for High-Performance Microprocessors,” ISLPED 2001, pp. 147-151.

[15]. J. Tschanz, S. Narendra, Y. Ye, B. Bloechel, S. Borkar and V. De,., “Dynamic sleep transistor and body bias for active leakage power control of microprocessors,” IEEE J. of Solid-State Circuits, pp. 1838–1845, Nov. 2003.

[16]. M. Khellah, D. Somasekhar, Y. Ye, N. Kim, J. Howard, G. Ruhl, M. Sunna, J. Tschanz, N. Borkar, F. Hamzaoglu, G. Pandya, A. Farhang, K. Zhang, and V. De, “A 256-Kb Dual-Vcc SRAM Building Block in 65-nm CMOS Process With Actively Clamped Sleep Transistor,” IEEE J. of Solid-State Circuits, pp. 233–242, Jan. 2007.

[17]. J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comput. 19, 297–301, 1965.

[18]. S. Vangal, A. Singh, J. Howard, S. Dighe, N. Borkar and A. Alvandpour, "A 5.1GHz 0.34mm2 Router for Network-on-Chip Applications ", 2007 Symposium on VLSI Circuits, Digest of Technical Papers, pp. 42–43, June 2007.

[19]. J. Held, J. Bautista, and S. Koehl. , "From a few cores to many: A tera-scale computing research overview", 2006 http://www.intel.com/ research/platform/terascale/.

89

Chapter 5

Conclusions and Future Work

5.1 Conclusions

Multi-core processors are poised to become the new embodiment of Moore’s law. Indeed, the processor industry is now aiming for ten to hundred core processors on a single die within the next decade [1]. To realize this level of integration quickly and efficiently, NoC architectures are rapidly emerging as the candidate of choice by providing a highly scalable, reliable, and modular on-chip communication infrastructure platform. Since a NoC is constructed from multiple point-to-point links interconnected by switches (i.e., routers) that exchange packets between the various IP blocks, this thesis focused on research into these key building blocks, which spans three process generations (150-nm, 90-nm and 65-nm). NoC routers should be small, fast and energy-efficient. To address this, we demonstrated the area and peak energy reduction benefits of a six-port four-lane 57 GB/s communications router core. Design enhancements include four double-pumped crossbar channels and destination-aware crossbar channel drivers that dynamically configure based on the current packet destination. Combined application of both techniques enables a 45% reduction in channel area, 23% overall router core area, up to 3.8X reduction in peak

90 Conclusions and Future Work crossbar channel power, and 7.2% improvement in average channel power in a 150-nm six-metal CMOS process. A second-generation 102 GB/s router design with a shared-crossbar architecture increases the router area savings to 34%, enabling a compact 0.34mm2 design. The cost savings are expected to increase for crossbar routers with larger link widths (n > 64 bytes) due to the O(n2) effect.

We next presented the design of a pipelined single-precision FPMAC featuring a bit-level pipelined multiplier unit and a single-cycle accumulation loop, with delayed addition and normalization stages. A combination of algorithmic, logic and circuit techniques enabled 3GHz operation, with just 15 FO4 stages in the critical path. The design eliminates scheduling restrictions between consecutive FPMAC instructions and improves throughput. In addition, an improved leading zero anticipator and overflow detection logic applicable to carry-save format has been developed. Fabricated in a 90-nm seven-metal dual-VT CMOS process, silicon achieves 6.2 GFLOPS of performance at 3.1GHz, 1.3V supply, and dissipates 1.2W at 40°C, resulting in 5.2 GFLOPS/W. This performance comes at the expense of larger datapath in the accumulate loop, increasing layout area and power. The conditional normalization technique helps reclaim some of the power by enabling up to 24.6% reduction in FPMAC active power and 30% reduction in leakage power.

We finally describe an 80-tile high-performance NoC architecture implemented in a 65-nm eight-metal CMOS process technology. The prototype contains 160 lower-latency FPMAC cores and features a single-cycle accumulator architecture for high throughput. Each tile also contains a fast and compact router operating at core speed where the 80 tiles are interconnected using a 2D mesh topology providing a high bisection bandwidth of over 2 Tera-bits/s. The design uses a combination of micro-architecture, logic, circuits and a 65-nm process to reach target performance. Silicon operates over a wide voltage (0.67V–1.35V) and frequency (1 GHz–5.67 GHz) range, delivering teraFLOPS performance with high power efficiency. For the most computationally intensive application kernel, the chip achieves an average performance of 1.0 TFLOPS, while dissipating 97 W at 4.27 GHz and 1.07 V supply, corresponding to an energy efficiency of 10.5 GFLOPS/W. This represents a 100X-150X improvement in power efficiency, while providing a 200X increase in aggregate performance over a typical desktop processor available in the market today.

It is clear that realization of successful NoC designs require well balanced decisions at all levels: architecture, logic, circuit and physical design. Our results demonstrate that the NoC architecture successfully delivers on the promise of greater integration, high performance, good scalability and high energy efficiency. We are one step closer to having super-computing performance built into our desktop and mobile computers in the near future.

5.2 Future Work 91

5.2 Future Work

NoC design is a multi-variable optimization problem. Several challenging research problems remain to be solved at all levels of the layered-stack, from the physical link level through the network level, and all the way up to the system architecture and application software. Network architecture research in the areas of topology, routing, and flow control must complement efforts in designing NoCs with a high degree of fault tolerance while satisfying application quality-of-service (QoS) requirements. Heterogeneous NoCs that allocate resources as needed [2] and circuit-switched networks [3] are promising approaches. Designing efficient routers to support such networks is a worthwhile challenge.

The on-die fabric power constitutes a significant and growing portion of the overall design budget. Recall that the 2D mesh network in the teraflops NoC accounted for 28% of total power. It is imperative to contain the power and area needs of the communication fabric without adversely affecting performance. Much work still needs to be done in the areas of low-latency, power and area efficient router architectures. An aggressive router design should have latencies of just 1-2 cycles, account for less than 10% of tile power and area budgets, and still allow multi-GHz operation. Future routers should incorporate extensive fine-grained power management techniques that dynamically adapt to changing network workload conditions for energy-efficient operation.

Low-power circuits for fabric (link) power reduction, incorporating low-swing signaling techniques is a vital research area [4]. The circuits must not only reduce the interconnect swing, but also enable use of very low supply voltages to obtain quadratic energy savings. While low-power encoding techniques have been proposed at the link level, recent work [5] concludes that non-encoded links with end-to-end data protection through error correction methods allows aggressive supply voltage scaling and better power savings.

Optical interconnects present an attractive alternative to electrical communication for both on-chip and off-chip use, providing high bandwidth potentially at a lower power. Research shows that optical links are better than electrical wires for lengths longer than 3-5 mm in a 50 nm CMOS process [6]. Optical communication shows great potential for low power operation and for use in future clock distribution and global signaling. Fast silicon optical modulator and Raman silicon laser have been recently demonstrated [7]. Several practical issues and challenges associated with integrating the photonic devices on silicon and processing silicon photonic devices in a high volume CMOS manufacturing environment are discussed in [7].

92 Conclusions and Future Work

Multi-core processors need substantial memory bandwidth, a challenge commonly referred as "feeding the beast". To enable memory bandwidths beyond 1TB/s, the 3D processor + memory stacked die architecture using multi-chip packaging (MCP) becomes interesting [8]. Since this will provide the shortest possible interconnect between the CPU and memory die, the bit rate can exceed 10 Gb/s. In addition, the interconnect density will scale to enable thousands of die-to-die interconnections. Despite the promising advantages of 3D stacking, significant process flow integration, packaging and thermal challenges [9] need to be overcome.

Chip power delivery is becoming harder due to increase in the current drawn from the supply and the rate of change of current (di/dt) due to faster switching as technology is scaled. Delivering 200 W at 500 mV supply voltage is a very challenging problem due to high power supply current and low supply voltage. Recently, researchers have shown the effectiveness of integrated CMOS voltage regulators (VR) [10]. The VRs are small, have high current efficienty (>95%) and fast (<1ns) response time. The modular nature of NoCs makes possible on-chip VR integration with ultra-fast control of dozens of independent multi-supply-voltage regions on-die. This should improve the power distribution efficiency and allow excellent dynamic power management.

As a final note – to be able to successfully exploit the pool of integrated hardware resources avaliable on silicon, research into parallel programming is vital. There is a compelling need to develop enhanced compilation and scheduling tools, fine-grained instruction scheduling algorithms and new programming tools that make highly-threaded and data-parallel applications a reality.

5.3 References

[1]. J. Held, J. Bautista, and S. Koehl. , "From a few cores to many: A tera-scale computing research overview", 2006 http://www.intel.com/ research/platform/terascale/.

[2]. K. Rijpkema et al., “Trade-Offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip,” IEE Proc. Comput. Digit. Tech., vol. 150, no. 5, 2003, pp. 294-302.

[3]. P. Wolkotte et al., “An Energy-Efficient Reconfigurable Circuit-Switched Network-on-Chip,” Proc. IEEE Int’l Parallel and Distributed Symp. (IPDS 05), IEEE CS Press, 2005, pp. 155-162.

5.3 References 93

[4]. H. Zhang, V. George and J. Rabaey, “Low-swing on-chip signaling techniques: effectiveness and robustness”, IEEE Transactions on VLSI Systems, June 2000, pp. 264 – 272.

[5]. A. Jantsch and R. Vitkowski, “Power analysis of link level and end-to-end data protection in networks-on-chip”, International Symposium on Circuits and Systems (ISCAS), 2005, pp. 1770–1773.

[6]. P. Kapur and K Saraswat, “Optical interconnects for future high performance integrated circuits”, Physica E 16, 3–4, 2003, pp. 620–627.

[7]. M. Paniccia, M. Ansheng, N. Izhaky and A. Barkai, “Integration challenge of silicon photonics with microelectronics”, 2nd IEEE International Conf. on Group IV Photonics, Sept. 2005, pp. 20 – 22.

[8]. P. Reed, G. Yeung and B. Black, “Design Aspects of a Microprocessor Data Cache using 3D Die Interconnect Technology”, Proceedings of the International Conference on Integrated Circuit Design and Technology, May 2005, pp. 15–18.

[9]. Intel Technology Journal, Volume 11, Issue 3, 2007, http://www.intel.com/ technology/itj/.

[10]. P. Hazucha, S. Moon, G. Schrom, F. Paillet, D. Gardner, S. Rajapandian and T. Karnik, “High Voltage Tolerant Linear Regulator With Fast Digital Control for Biasing of Integrated DC-DC Converters”, IEEE J. of Solid-State Circuits, Jan. 2007, pp. 66–73.

fulltext01_4

Documents

linkping university

chip power

sweden linkping

chip noc architectures

linkping studies

chip communication

chip networks

chip architecturessriram