implementation of arm cortex quad-core in … of arm ® cortex ®-a17 quad-core in globalfoundries...

Implementation of ARM®Cortex®-A17 Quad-core in GLOBALFOUNDRIES 22FDX™ technology using Cadence Innovus Joerg Winkler, Tamer Ragheb | Design Enablement

FinFET & FD-SOI Solve Different Market Needs

2

Bulk CMOS

Lowest Cost

FinFET

High Performance

FD-SOI

Best Power/Performance/ Cost Tradeoffs

GLOBALFOUNDRIES

22FDX™ Provides Differentiated Performance, Power and Die Cost

GLOBALFOUNDRIES - CDNlive 2016 3

• Industry’s first 22nm fully-depleted silicon-on-insulator (FD-SOI) technology

• Delivers ultra low-power, FinFET like performance at the cost effectiveness of 28nm planar

• Support for Forward and Reverse transistor body-biasing for flexible design trade-offs between power and performance

• Integrated RF for reduced system cost and back-gate feature to reduce RF power up to ~50%

• Enables applications across mobile, IoT and RF markets

Next generation FD transistor boosts performance

28SLP

28HPP

14LPP 22FDX™

Cost/die comparable to 28SLP

Cost/Die

Per

form

ance

Ultra-thin Buried Oxide Insulator

Fully Depleted Channel for Low Leakage

FD-SOI Planar process similar to bulk

22FDX™ Enables Optimized System Solutions

Application scenario – 2 Quad-core CPU clusters – 1 cluster using FBB for maximum performance – 1 cluster using RBB for minimum leakage


Leakage Power

Max

Fre

quen

cy

Reverse Body-bias (RBB)

Forward Body-bias (FBB)

Maximum Performance

Minimum Leakage

FBB and RBB are implemented by different devices

FBB

RBB

22FDX™ Digital Design Flow – Ready for Early Customers

22FDX™ digital design flow is fully supported by industry-standard EDA tools

5

Technology Feature Design Features Design Collatoral

Support Cadence Tool

Support

Implant Constraints Min width and space Implant-aware

placement rules in router tech file

EDI/Innovus since GF 14nm

Source/Drain Constraints

Continuous RX RX-aware placement properties in cell abstracts


Double Patterning

Same/diff color spacing DP-aware placement and routing rules in router tech file


Two masks per metal layer

Decomposition deck PVS since GF 14nm

Body-biasing

Body-bias networks UPF (IEEE 1801) connectivity

CPF/UPF support on Cadence platform since long

Body-bias corners PVT corners become PVTB corners

Multi-corner support on Cadence platform since long

GLOBALFOUNDRIES - CDNlive 2016

22FDX™ Digital Reference Flow Using Cadence Tool Suite

6

Physical Synthesis Genus®

P&R (GigaOpt, CCOpt, Nanoroute) Innovus®

Parasitics Extraction Quantus Extraction

Static Timing Analysis Tempus™ Timing Signoff Solution

Logic Equivalence and LP Checks Conformal® LEC and LP

Physical Verification Physical Verification

System (PVS)

signoff signoff signoff signoff

RTL FP SDC

Netlist PLACEMENT SDC

Power Analysis Voltus

Power System

Netlist Layout SDC Parasitics

signoff signoff

DFM Litho Physical Analyzer (LPA)


22FDX™ Digital Reference Flow Using Cadence Tool Suite

7

The following Cadence flow modules have been used for the implementations:

Flow Module Cortex-A17 Quad-core Cortex-A9 Neon

Genus RTL Synthesis X X

Innovus Floorplanning, Place & Route X X

Tempus Static Timing Analysis X X

Conformal Logic Equivalence, Power Intent Checking X X

Quantus Parasitic Extraction X X

Voltus Power and Rail Analysis X Work in progress

PVS DRC, LVS Work in progress Work in progress

LPA DFM / DRC+ Work in progress Work in progress


ARM Cortex-A17 Quad-core Macro

• GLOBALFOUNDRIES has developed a family of ARM® Cortex ® -A test chips for early technology evaluation

• Cortex-A17 quad-core test chip was first implemented and taped out in GLOBALFOUNDRIES 28nm-SLP as part of its ARM Cortex-A test chip strategy

• Same macro used to demonstrate the implementation of ARM Cortex-A multi-core macro in 22FDX™

• Approach has been proven to be very beneficial in early technology evaluation for exploring implementation decisions and implementation flow details

8

Funn

el

TPIU

ATB

ATB

RTC PL031

GPIO PL061

TRACEPORT

RTCK

GPIO

DEBUG APB

APB-AP JTAG-DP

ROMTable

JTAG

Trick Box

Burn-in ROM

Wait for INT

ROM

Test Structures

AXI RAM Ctrl Upper SRAM

BP140

AXI RAM512 KB

Upper SRAM

DfT/MBIST Ctrl

Config

PLL

CFGCLK

CFGDATA

REFCLK

ATB

ATB

AXI Synchronisation

AXI Bus Interconnect NIC400

AXI Slave

AXI Master AHB MasterAHB MasterAHB Master

APB Master

Interrupt ControllerGIC400

Cortex-A17 Quad Core Macro

SCU

L2 Cache Controller

PTM0

PTM1

Cortex-A17CPU Core 0

32KB I$ / 32KB D$

NEON


32KB I$ / 32KB D$

NEON


32KB I$ / 32KB D$

NEON


32KB I$ / 32KB D$

NEON PTM2

PTM3

2MB L2 Cache


22FDX™ ARM Cortex-A17 Design IP

• Standard Cell Libraries – Base libraries

• Invecas 8-track LVt/SLVt C20 – Continuous RX (CNRX) – Support for body-biasing

– Power Management Kit • GLOBALFOUNDRIES evaluation standard cell kit

– Support for body-biasing

• Cache Memory Instances – GLOBALFOUNDRIES evaluation memory kit

• 14 different L1 cache memory macros • 1 L2 cache memory macro • Support for memory periphery body-biasing • Support for memory bitcell array body-biasing

9 GLOBALFOUNDRIES - CDNlive 2016

22FDX™ Power and Body-biasing Domains

• Supply voltage and body-bias voltage scenarios define power and voltage design intent

• Eventual design architecture depends on specific application scenarios and optimization criteria

• Current Cortex-A17 quad-core macro reference implementation supports the following scenarios – 5 power domains

• 4 CPU cores + 1 nonCPU module • Controlled by regular power switches • Allows for power-off states of individual CPU cores

– 1 unified body-bias scenario for quad-core macro • 5 body-bias net pairs (n-well, p-well biasing)

– 1 pair for standard cells – 2 pairs for L1 cache periphery, bitcell array – 2 pairs for L2 cache periphery, bitcell array

• Body-bias nets might be shared depending on eventual IP features


22FDX™ Cortex-A17 CPU

Cortex-A17 single core CPU implementation uses very similar floorplan and place & route approach as used in 28SLP for Cortex-A17 PPA optimization.

11

Data Engine

Dside

Iside

28SLP Single-core CPU

Core


22FDX™ Cortex-A17 Core Body-bias Nets

Body-bias net routing

• Body-bias connections through dedicated pins of – Well tap cells – Power switches – Memory macros

• Well tap cells, power switches, and memory macros vertically aligned for straight body-bias connections

• Body-bias net ring placed at module perimeter


22FDX™ Cortex-A17 Body-bias Connections

13

Body-bias net routing

• Power switch and well tap cells vertically aligned for straight connections of body-bias nets

• Always-on power nets connected to power switches

• Layer usage approach – Use lower layer (vertical)

metal for body-bias net routing to power switches, well tap cells, and memory macros

– Use upper layer (vertical) metal for always-on power routing to power switches (and memory macros)

Well Tap Cell

Header Power Switch

Body-bias nets

Always-on VDD


22FDX™ Cortex-A17 Quad-core Macro

14

Multiple power domains vs. body-bias scenario

• 5 power domains – CPU 0..3 – nonCPU

• 1 unified body-bias scenario – 5 pairs of body-bias nets – Each pair connected

across Cortex-A17 quad-core macro

• Body-bias net ring across nonCPU module – Provides connectivity to

CPU 0..3 – Provides connectivity to

standard cells and memory macros in nonCPU

CPU 0 CPU 1

CPU 2 CPU 3

nonCPU


22FDX™ Cortex-A17 Body-bias Net Connection in Hierarchical Design

• Body-bias net ring across nonCPU module – Support for 5 body-bias net

pairs – Overlapping nonCPU boundary

cells

• Sub-module body-bias connections in hierarchical design – 3 body-bias net pairs

connecting CPU sub-module


22FDX™ Cortex-A17 Memory Macro Body-bias Nets

Body-bias net routing around memoy macros

• 1 pair for std cells

• 2 pairs for cache memory macro

• Body-bias net routing needs to obey high-voltage related non-default spacing rules – Examples

• 0.8V ≥ 50nm

• 1.2V ≥ 65nm

• 1.5V ≥ 80nm

• 1.8V ≥ 90nm

16

Power Switches

Memory Macro Body-bias nets - Bitcell Array - Periphery

Standard Cell Body-bias nets


Sample – UPF Script Code Description create_supply_net BIAS_NWELL create_supply_net BIAS_PWELL create_supply_port BIAS_NWELL_PORT create_supply_port BIAS_PWELL_PORT connect_supply_net BIAS_NWELL \ -ports BIAS_NWELL_PORT connect_supply_net BIAS_PWELL \ -ports BIAS_PWELL_PORT

Define body-bias nets and ports in the same way as for VDD and VSS

create_supply_set SS_PDNONCPU \ -function "power VDD" \ -function "ground VSS" \ -function "nwell BIAS_NWELL" \ -function "pwell BIAS_PWELL“

Add body-bias nets to supply sets

foreach mem $mems { connect_supply_net BIAS_NWELL_MEMP \ -ports $mem/nwell_mem_peri ... }

Explicitly connect body-bias nets to memory macro body-bias pins

foreach sub_module $sub_modules { connect_supply_net BIAS_NWELL \ -ports $sub_module/bias_nwell ... }

Explicitly connect body-bias nets to sub-module body-bias pins


Sample – Implementation Script Code Description addRing -nets $BIAS_NETS -extend_corner

Provide body-bias connectivity to sub-modules, memory macros, well tap cells and power switches across design; Add body-bias ports

addWellTap -inRowOffset

Insert well tap cells; Align them with power switches for straight body-bias stripe creation

addStripe -nets $BIAS_NETS

Create body-bias stripes to connect memory macros, well tap cells and power switches across design

sroute -connect {blockPin} -inst $sub_module -nets $BIAS_NET

Connect body-bias pins of sub-modules


ARM Cortex-A9 Neon PPA Comparison

GLOBALFOUNDRIES has been using an ARM Cortex-A9 Neon block for synthesis, place and route PPA analysis across different technology nodes.

This methodology has been utilized for early technology benchmarking and implementation flow development.

• Goal: – Evaluate 22FDX™ performance at different BB as compared to 28SLP – Employ GLOBALFOUNDRIES digital reference flow phase II

• Availabe to customers on GLOBALFOUNDRIES FoundryView web site

– Use a simple apple-to-apple methodology for customers to replicate


ARM Cortex-A9 Neon PPA Comparison

• Corners used: – For speed (performance): SS / VDDnom-10% / worst(125c and -40c) – For power (dynamic and leakage): FF / VDDnom+10% / 125c

• Design Setup: – Testcase: falcon_neon (part of ARM Cortex-A9) – Libraries used:

• 9T for 28SLP (LVT Only) Using same cell list • 8T for 22FDX™ (LVT for FBB and RVT for RBB “VT overlap with LVT”) • Limited list of standard cells used

– 54 standard cells representing 15 unique logic functions

– Routing layers: M2-M7 – EDA Tool: Innovus 15.12 (Implementation) / Innnovus 15.14 (power

reporting) – CTS Engine: Ccopt – No speedboost replacement cells


ARM Cortex-A9 Neon PPA Results


0,0

0,-1

1,-2

1,-2

0,-1

0,0

VNW,VPW= NMOS,PMOS

VNW,VPW= PMOS,NMOS

1 2 3 4 5 6 7 8 9 Normalized Total Power

Nor

mal

ized

Fre

q

1

2

1.5

2.5

3.5

3

4



~45% power reduction @Iso Freq Plus ~45% area reduction

~30% more Freq @Iso Power Plus ~45% area reduction


Nor

mal

ized

Fre

q

1

2

1.5

2.5

3.5

3

4



Need to change the implementation for bulk nodes to change target

BB control can change freq vs power with same implementation


Nor

mal

ized

Fre

q

1

2

1.5

2.5

3.5

3

4

Further Investigations

24

• Floorplanning optimizations on body-bias net routing – Top/bottom, partial ring distributions vs. ring distribution in physical

design macros – Distribution around clusters of memory macros

• Full flow exercise – Including rail analysis

• Full ARM Cortex-A PPA optimization


Summary

• 22FDX™ is industry's first 22nm FD-SOI platform

• Delivers ultra low-power, FinFET like performance at cost effectiveness of 28nm planar

• Full PPA optimization capabilities through Cadence implementation and sign-off tools

• 22FDX™ digital design flow is similar to bulk digital design flow

• 22FDX™ digital design flow exploits EDA techniques which have been deployed on earlier nodes – Implant-aware, source/drain-aware, double patterning, UPF support

• Starter kit of 22FDX™ digital design flow available from GLOBALFOUNDRIES now

• 22FDX™ digital design flow has been demonstrated on ARM Cortex-A17 quad-core reference implementation with full Cadence support for FD-SOI design implementation


GLOBALFOUNDRIES Cadence Ralf Flemming Dirk Seidler

Ingolf Lorenz Klaus Sigl Tamer Ragheb Cristen Decoin Joerg Winkler Jonathan Smith

Joerg Winkler [email protected]

Tamer Ragheb

[email protected]

GLOBALFOUNDRIES Design Enablement

mailto:[email protected]

mailto:[email protected]

Disclaimer The information contained herein [is confidential and] is the property of GLOBALFOUNDRIES and/or its licensors. This document is for informational purposes only, is current only as of the date of publication and is subject to change by GLOBALFOUNDRIES at any time without notice. GLOBALFOUNDRIES, the GLOBALFOUNDRIES logo and combinations thereof are trademarks of GLOBALFOUNDRIES Inc. in the United States and/or other jurisdictions. Other product or service names are for identification purposes only and may be trademarks or service marks of their respective owners. © GLOBALFOUNDRIES Inc. 2016. Unless otherwise indicated, all rights reserved. Do not copy or redistribute except as expressly permitted by GLOBALFOUNDRIES.

Thank you

implementation of arm cortex quad-core in … of arm ® cortex ®-a17 quad-core in globalfoundries...

Documents