implementation of arm cortex quad-core in … of arm ® cortex ®-a17 quad-core in globalfoundries...
TRANSCRIPT
Implementation of ARM®Cortex®-A17 Quad-core in GLOBALFOUNDRIES 22FDX™ technology using Cadence Innovus Joerg Winkler, Tamer Ragheb | Design Enablement
FinFET & FD-SOI Solve Different Market Needs
2
Bulk CMOS
Lowest Cost
FinFET
High Performance
FD-SOI
Best Power/Performance/ Cost Tradeoffs
GLOBALFOUNDRIES
22FDX™ Provides Differentiated Performance, Power and Die Cost
GLOBALFOUNDRIES - CDNlive 2016 3
• Industry’s first 22nm fully-depleted silicon-on-insulator (FD-SOI) technology
• Delivers ultra low-power, FinFET like performance at the cost effectiveness of 28nm planar
• Support for Forward and Reverse transistor body-biasing for flexible design trade-offs between power and performance
• Integrated RF for reduced system cost and back-gate feature to reduce RF power up to ~50%
• Enables applications across mobile, IoT and RF markets
Next generation FD transistor boosts performance
28SLP
28HPP
14LPP 22FDX™
Cost/die comparable to 28SLP
Cost/Die
Per
form
ance
Ultra-thin Buried Oxide Insulator
Fully Depleted Channel for Low Leakage
FD-SOI Planar process similar to bulk
22FDX™ Enables Optimized System Solutions
Application scenario – 2 Quad-core CPU clusters – 1 cluster using FBB for maximum performance – 1 cluster using RBB for minimum leakage
GLOBALFOUNDRIES - CDNlive 2016 4
Leakage Power
Max
Fre
quen
cy
Reverse Body-bias (RBB)
Forward Body-bias (FBB)
Maximum Performance
Minimum Leakage
FBB and RBB are implemented by different devices
FBB
RBB
22FDX™ Digital Design Flow – Ready for Early Customers
22FDX™ digital design flow is fully supported by industry-standard EDA tools
5
Technology Feature Design Features Design Collatoral
Support Cadence Tool
Support
Implant Constraints Min width and space Implant-aware
placement rules in router tech file
EDI/Innovus since GF 14nm
Source/Drain Constraints
Continuous RX RX-aware placement properties in cell abstracts
EDI/Innovus since GF 14nm
Double Patterning
Same/diff color spacing DP-aware placement and routing rules in router tech file
EDI/Innovus since GF 14nm
Two masks per metal layer
Decomposition deck PVS since GF 14nm
Body-biasing
Body-bias networks UPF (IEEE 1801) connectivity
CPF/UPF support on Cadence platform since long
Body-bias corners PVT corners become PVTB corners
Multi-corner support on Cadence platform since long
GLOBALFOUNDRIES - CDNlive 2016
22FDX™ Digital Reference Flow Using Cadence Tool Suite
6
Physical Synthesis Genus®
P&R (GigaOpt, CCOpt, Nanoroute) Innovus®
Parasitics Extraction Quantus Extraction
Static Timing Analysis Tempus™ Timing Signoff Solution
Logic Equivalence and LP Checks Conformal® LEC and LP
Physical Verification Physical Verification
System (PVS)
signoff signoff signoff signoff
RTL FP SDC
Netlist PLACEMENT SDC
Power Analysis Voltus
Power System
Netlist Layout SDC Parasitics
signoff signoff
DFM Litho Physical Analyzer (LPA)
GLOBALFOUNDRIES - CDNlive 2016
22FDX™ Digital Reference Flow Using Cadence Tool Suite
7
The following Cadence flow modules have been used for the implementations:
Flow Module Cortex-A17 Quad-core Cortex-A9 Neon
Genus RTL Synthesis X X
Innovus Floorplanning, Place & Route X X
Tempus Static Timing Analysis X X
Conformal Logic Equivalence, Power Intent Checking X X
Quantus Parasitic Extraction X X
Voltus Power and Rail Analysis X Work in progress
PVS DRC, LVS Work in progress Work in progress
LPA DFM / DRC+ Work in progress Work in progress
GLOBALFOUNDRIES - CDNlive 2016
ARM Cortex-A17 Quad-core Macro
• GLOBALFOUNDRIES has developed a family of ARM® Cortex ® -A test chips for early technology evaluation
• Cortex-A17 quad-core test chip was first implemented and taped out in GLOBALFOUNDRIES 28nm-SLP as part of its ARM Cortex-A test chip strategy
• Same macro used to demonstrate the implementation of ARM Cortex-A multi-core macro in 22FDX™
• Approach has been proven to be very beneficial in early technology evaluation for exploring implementation decisions and implementation flow details
8
Funn
el
TPIU
ATB
ATB
RTC PL031
GPIO PL061
TRACEPORT
RTCK
GPIO
DEBUG APB
APB-AP JTAG-DP
ROMTable
JTAG
Trick Box
Burn-in ROM
Wait for INT
ROM
Test Structures
AXI RAM Ctrl Upper SRAM
BP140
AXI RAM512 KB
Upper SRAM
DfT/MBIST Ctrl
Config
PLL
CFGCLK
CFGDATA
REFCLK
ATB
ATB
AXI Synchronisation
AXI Bus Interconnect NIC400
AXI Slave
AXI Master AHB MasterAHB MasterAHB Master
APB Master
Interrupt ControllerGIC400
Cortex-A17 Quad Core Macro
SCU
L2 Cache Controller
PTM0
PTM1
Cortex-A17CPU Core 0
32KB I$ / 32KB D$
NEON
Cortex-A17CPU Core 1
32KB I$ / 32KB D$
NEON
Cortex-A17CPU Core 2
32KB I$ / 32KB D$
NEON
Cortex-A17CPU Core 3
32KB I$ / 32KB D$
NEON PTM2
PTM3
2MB L2 Cache
GLOBALFOUNDRIES - CDNlive 2016
22FDX™ ARM Cortex-A17 Design IP
• Standard Cell Libraries – Base libraries
• Invecas 8-track LVt/SLVt C20 – Continuous RX (CNRX) – Support for body-biasing
– Power Management Kit • GLOBALFOUNDRIES evaluation standard cell kit
– Support for body-biasing
• Cache Memory Instances – GLOBALFOUNDRIES evaluation memory kit
• 14 different L1 cache memory macros • 1 L2 cache memory macro • Support for memory periphery body-biasing • Support for memory bitcell array body-biasing
9 GLOBALFOUNDRIES - CDNlive 2016
22FDX™ Power and Body-biasing Domains
• Supply voltage and body-bias voltage scenarios define power and voltage design intent
• Eventual design architecture depends on specific application scenarios and optimization criteria
• Current Cortex-A17 quad-core macro reference implementation supports the following scenarios – 5 power domains
• 4 CPU cores + 1 nonCPU module • Controlled by regular power switches • Allows for power-off states of individual CPU cores
– 1 unified body-bias scenario for quad-core macro • 5 body-bias net pairs (n-well, p-well biasing)
– 1 pair for standard cells – 2 pairs for L1 cache periphery, bitcell array – 2 pairs for L2 cache periphery, bitcell array
• Body-bias nets might be shared depending on eventual IP features
10 GLOBALFOUNDRIES - CDNlive 2016
22FDX™ Cortex-A17 CPU
Cortex-A17 single core CPU implementation uses very similar floorplan and place & route approach as used in 28SLP for Cortex-A17 PPA optimization.
11
Data Engine
Dside
Iside
28SLP Single-core CPU
Core
GLOBALFOUNDRIES - CDNlive 2016
22FDX™ Cortex-A17 Core Body-bias Nets
Body-bias net routing
• Body-bias connections through dedicated pins of – Well tap cells – Power switches – Memory macros
• Well tap cells, power switches, and memory macros vertically aligned for straight body-bias connections
• Body-bias net ring placed at module perimeter
12 GLOBALFOUNDRIES - CDNlive 2016
22FDX™ Cortex-A17 Body-bias Connections
13
Body-bias net routing
• Power switch and well tap cells vertically aligned for straight connections of body-bias nets
• Always-on power nets connected to power switches
• Layer usage approach – Use lower layer (vertical)
metal for body-bias net routing to power switches, well tap cells, and memory macros
– Use upper layer (vertical) metal for always-on power routing to power switches (and memory macros)
Well Tap Cell
Header Power Switch
Body-bias nets
Always-on VDD
GLOBALFOUNDRIES - CDNlive 2016
22FDX™ Cortex-A17 Quad-core Macro
14
Multiple power domains vs. body-bias scenario
• 5 power domains – CPU 0..3 – nonCPU
• 1 unified body-bias scenario – 5 pairs of body-bias nets – Each pair connected
across Cortex-A17 quad-core macro
• Body-bias net ring across nonCPU module – Provides connectivity to
CPU 0..3 – Provides connectivity to
standard cells and memory macros in nonCPU
CPU 0 CPU 1
CPU 2 CPU 3
nonCPU
GLOBALFOUNDRIES - CDNlive 2016
22FDX™ Cortex-A17 Body-bias Net Connection in Hierarchical Design
• Body-bias net ring across nonCPU module – Support for 5 body-bias net
pairs – Overlapping nonCPU boundary
cells
• Sub-module body-bias connections in hierarchical design – 3 body-bias net pairs
connecting CPU sub-module
15 GLOBALFOUNDRIES - CDNlive 2016
22FDX™ Cortex-A17 Memory Macro Body-bias Nets
Body-bias net routing around memoy macros
• 1 pair for std cells
• 2 pairs for cache memory macro
• Body-bias net routing needs to obey high-voltage related non-default spacing rules – Examples
• 0.8V ≥ 50nm
• 1.2V ≥ 65nm
• 1.5V ≥ 80nm
• 1.8V ≥ 90nm
16
Power Switches
Memory Macro Body-bias nets - Bitcell Array - Periphery
Standard Cell Body-bias nets
GLOBALFOUNDRIES - CDNlive 2016
Sample – UPF Script Code Description create_supply_net BIAS_NWELL create_supply_net BIAS_PWELL create_supply_port BIAS_NWELL_PORT create_supply_port BIAS_PWELL_PORT connect_supply_net BIAS_NWELL \ -ports BIAS_NWELL_PORT connect_supply_net BIAS_PWELL \ -ports BIAS_PWELL_PORT
Define body-bias nets and ports in the same way as for VDD and VSS
create_supply_set SS_PDNONCPU \ -function "power VDD" \ -function "ground VSS" \ -function "nwell BIAS_NWELL" \ -function "pwell BIAS_PWELL“
Add body-bias nets to supply sets
foreach mem $mems { connect_supply_net BIAS_NWELL_MEMP \ -ports $mem/nwell_mem_peri ... }
Explicitly connect body-bias nets to memory macro body-bias pins
foreach sub_module $sub_modules { connect_supply_net BIAS_NWELL \ -ports $sub_module/bias_nwell ... }
Explicitly connect body-bias nets to sub-module body-bias pins
17 GLOBALFOUNDRIES - CDNlive 2016
Sample – Implementation Script Code Description addRing -nets $BIAS_NETS -extend_corner
Provide body-bias connectivity to sub-modules, memory macros, well tap cells and power switches across design; Add body-bias ports
addWellTap -inRowOffset
Insert well tap cells; Align them with power switches for straight body-bias stripe creation
addStripe -nets $BIAS_NETS
Create body-bias stripes to connect memory macros, well tap cells and power switches across design
sroute -connect {blockPin} -inst $sub_module -nets $BIAS_NET
Connect body-bias pins of sub-modules
18 GLOBALFOUNDRIES - CDNlive 2016
ARM Cortex-A9 Neon PPA Comparison
GLOBALFOUNDRIES has been using an ARM Cortex-A9 Neon block for synthesis, place and route PPA analysis across different technology nodes.
This methodology has been utilized for early technology benchmarking and implementation flow development.
• Goal: – Evaluate 22FDX™ performance at different BB as compared to 28SLP – Employ GLOBALFOUNDRIES digital reference flow phase II
• Availabe to customers on GLOBALFOUNDRIES FoundryView web site
– Use a simple apple-to-apple methodology for customers to replicate
GLOBALFOUNDRIES - CDNlive 2016 19
ARM Cortex-A9 Neon PPA Comparison
• Corners used: – For speed (performance): SS / VDDnom-10% / worst(125c and -40c) – For power (dynamic and leakage): FF / VDDnom+10% / 125c
• Design Setup: – Testcase: falcon_neon (part of ARM Cortex-A9) – Libraries used:
• 9T for 28SLP (LVT Only) Using same cell list • 8T for 22FDX™ (LVT for FBB and RVT for RBB “VT overlap with LVT”) • Limited list of standard cells used
– 54 standard cells representing 15 unique logic functions
– Routing layers: M2-M7 – EDA Tool: Innovus 15.12 (Implementation) / Innnovus 15.14 (power
reporting) – CTS Engine: Ccopt – No speedboost replacement cells
20 GLOBALFOUNDRIES - CDNlive 2016
ARM Cortex-A9 Neon PPA Results
GLOBALFOUNDRIES - CDNlive 2016 21
0,0
0,-1
1,-2
1,-2
0,-1
0,0
VNW,VPW= NMOS,PMOS
VNW,VPW= PMOS,NMOS
1 2 3 4 5 6 7 8 9 Normalized Total Power
Nor
mal
ized
Fre
q
1
2
1.5
2.5
3.5
3
4
ARM Cortex-A9 Neon PPA Results
GLOBALFOUNDRIES - CDNlive 2016 22
~45% power reduction @Iso Freq Plus ~45% area reduction
~30% more Freq @Iso Power Plus ~45% area reduction
1 2 3 4 5 6 7 8 9 Normalized Total Power
Nor
mal
ized
Fre
q
1
2
1.5
2.5
3.5
3
4
ARM Cortex-A9 Neon PPA Results
GLOBALFOUNDRIES - CDNlive 2016 23
Need to change the implementation for bulk nodes to change target
BB control can change freq vs power with same implementation
1 2 3 4 5 6 7 8 9 Normalized Total Power
Nor
mal
ized
Fre
q
1
2
1.5
2.5
3.5
3
4
Further Investigations
24
• Floorplanning optimizations on body-bias net routing – Top/bottom, partial ring distributions vs. ring distribution in physical
design macros – Distribution around clusters of memory macros
• Full flow exercise – Including rail analysis
• Full ARM Cortex-A PPA optimization
GLOBALFOUNDRIES - CDNlive 2016
Summary
• 22FDX™ is industry's first 22nm FD-SOI platform
• Delivers ultra low-power, FinFET like performance at cost effectiveness of 28nm planar
• Full PPA optimization capabilities through Cadence implementation and sign-off tools
• 22FDX™ digital design flow is similar to bulk digital design flow
• 22FDX™ digital design flow exploits EDA techniques which have been deployed on earlier nodes – Implant-aware, source/drain-aware, double patterning, UPF support
• Starter kit of 22FDX™ digital design flow available from GLOBALFOUNDRIES now
• 22FDX™ digital design flow has been demonstrated on ARM Cortex-A17 quad-core reference implementation with full Cadence support for FD-SOI design implementation
25 GLOBALFOUNDRIES - CDNlive 2016
GLOBALFOUNDRIES Cadence Ralf Flemming Dirk Seidler
Ingolf Lorenz Klaus Sigl Tamer Ragheb Cristen Decoin Joerg Winkler Jonathan Smith
Disclaimer The information contained herein [is confidential and] is the property of GLOBALFOUNDRIES and/or its licensors. This document is for informational purposes only, is current only as of the date of publication and is subject to change by GLOBALFOUNDRIES at any time without notice. GLOBALFOUNDRIES, the GLOBALFOUNDRIES logo and combinations thereof are trademarks of GLOBALFOUNDRIES Inc. in the United States and/or other jurisdictions. Other product or service names are for identification purposes only and may be trademarks or service marks of their respective owners. © GLOBALFOUNDRIES Inc. 2016. Unless otherwise indicated, all rights reserved. Do not copy or redistribute except as expressly permitted by GLOBALFOUNDRIES.
Thank you