many voltage domains enabling an energy efficient graphics...
TRANSCRIPT
Many voltage domains enabling an energyefficient graphics processor in 14nm
Rinkle Jain, Pascal Meinerzhagen, Vivek De
October 19 2018
Radio Circuits Integration Lab, Intel Corporation
Hillsboro, OR
Power Supply on Chip Workshop 2018
Outline
State of the Art
Fine grained DVFS: Costs and benefits
Co-design with load
Summary and Conclusion
2/ 24
Fully integrated VR Technologies
The SoC Side of the Story - the FIVR
[Burton et.al APEC ’14] [Kurd et.al ISSCC ’14]
Faster state transitions by 25%, higher performance per watt
Overall idle power slashed by 20x, battery life improvement by > 50%
BOM savings helped all segments
PSoC 2018 Rinkle Jain 3 / 24
Fully integrated VR Technologies
The SoC Side of the Story - the FIVR
Segment Key Challenge What helped the mostServers Bump Imax Iin reductionLaptops Board Area, Iccmax High bandwidth/frequency
Desktops Performance High bandwidth
PSoC 2018 Rinkle Jain 3 / 24
Fully integrated VR Technologies
The Platform Perspective
PSoC 2018 Rinkle Jain 4 / 24
Fully integrated VR Technologies
The Platform Perspective
SoCs inherently require several voltage rails
Input rail consolidation simplifies power delivery significantlyPSoC 2018 Rinkle Jain 4 / 24
Fully integrated VR Technologies
Loadline improvements
Translates to battery life, area and/or performance benefits
PSoC 2018 Rinkle Jain 5 / 24
Fully integrated VR Technologies
MIM-based Switched Capacitor VR
[R.Jain et al. JSSC ’14]
[R.Jain et al. JSSC ’15]
Regulation, few ns response, low area overhead, 880mA/mm2
PSoC 2018 Rinkle Jain 6 / 24
Fully integrated VR Technologies
Proposed Control Law for Adaptive Widths
fsw < FthW = bW ′
a implies higher efficiency at W’ (W’=W/n)
fsw is a good indicator of low voltage and light load conditions
PSoC 2018 Rinkle Jain 7 / 24
Enabling DVFS
Current density & domain area-power trends
0 0.5 1 1.5 2 2.5 3 3.50
1
2
3
4
5
Load area (mm2)
Max
load
cur
rent
(A)
4A/
mm
2
2A/m
m2
1A/mm2
0.6A/mm2
SCVR 22nm
PSoC 2018 Rinkle Jain 8 / 24
Enabling DVFS
Current density & domain area-power trends
0 5 10 150
5
10
15
20
25
30
Load area (mm2)
Max
load
cur
rent
(A)
HSW
4A/m
m2
1A/mm2
2A/mm2
0.6A/mm2
1A/mm2
2A/mm2
0.6A/mm2
FIVR 22nm
PSoC 2018 Rinkle Jain 8 / 24
Enabling DVFS
Current density & domain area-power trends
0 5 10 150
5
10
15
20
25
30
Load area (mm2)
Max
load
cur
rent
(A)
HSW
4A/m
m2
1A/mm2
2A/mm2
0.6A/mm2
HSW
4A/m
m2
1A/mm2
2A/mm2
FIVR 22nm
0.6A/mm2
SCVR 22nm
PSoC 2018 Rinkle Jain 8 / 24
Enabling DVFS
Current density & domain area-power trends
22nm
22nm
o
o
PSoC 2018 Rinkle Jain 8 / 24
Enabling DVFS
Buck converter with thin-film magnetics
H. Krishnamurthy et al. JSSC 2017
PSoC 2018 Rinkle Jain 9 / 24
Enabling DVFS
Current density & domain area-power trends
22nm
22nm
o
o
o
PSoC 2018 Rinkle Jain 10 / 24
14nm Graphics ProcessorI/O
I/O
GPU
Paging Cache
SS2SS1
SS0
EUs
PLL
EUs
8.0 x 8.0 mm2
Sampler
L2$Sam
plerL2$
Sampler
L2$
GPU
Interface
FFs, Cmd Stream
er, L3 Data $
Paging $ Ctrl
Paging $
TLB
Pwr Ctrl
Config
2 MB
Clk Sync
PLLPLL
CLKREF
System Agent
Gen9LP GPU
Bus Interface
& Ctrl
PCIe
FPGA
14nm Prototype
IVR D
ebug & I/O
SS0SS1
SS2÷ JTAG Ctrl JTAG
CLKGPUCKLSAJTAG Config & Debug
2 MB
PCI Bridge
Host PC
GPU data
Pwr
Cfg
New EUs
New EUs
EU EU EU
EU EU EU
New execution unit (EU) has a dedicated integrated voltage regulator
PSoC 2018 Rinkle Jain 11 / 24
Enabling DVFS
Switched Capacitor Voltage RegulatorSwitched Capacitor VR (SCVR)
APR friendly design, 6 distributed tiles with area overhead of 1.1% (680umx520um)
PSoC 2018 Rinkle Jain 12 / 24
Basic Power Gate (PG)
RF
RF
RF
RF
PG
VIN
VOUT
PG
Modified EU: Digital LDO (DLDO)
RF
RF
RF
RF
PGPG Driver
Ctrl Controller
DLDO PG
Driver
PPGa
PPGd
SPG
VIN
VOUT
PG
VUDDAC
Digital PGDriver
PPG
{ {Area overhead: 0.5%
Digital LDO demonstrated in S.Kim et al. JSSC ’15
PSoC 2018 Rinkle Jain 13 / 24
Distributed Implementation
Adaptive Width Simulations
Fast and stable adaptation
PSoC 2018 Rinkle Jain 14 / 24
Distributed Implementation
Adaptive Width Simulations
Fast and stable adaptation
PSoC 2018 Rinkle Jain 14 / 24
Distributed Implementation
Adaptive Width Simulations
Fast and stable adaptation
PSoC 2018 Rinkle Jain 14 / 24
EU Other ...
VGPU
GPU
IVR
VIN
EU Other ...
VEU
GPU
IVRVOther
IVR
VIN
LOWLOW
HIGHEUOther
DEMAND
FGPU
VGPU
PLL PLL
LOW HIGH LOW
VLOW
VHIGH
Slow Fast
Shared voltage railAll blocks at high V/F PLL re-lock time
LOWLOW
HIGHEUOther
DEMAND
FEUFOther
LOW HIGH LOW
Slow Fast
VOther
VEU
VLOW
Dedicated railsFast EU clock No PLL re-lock
Fine grained DVFS
PSoC 2018 Rinkle Jain 15 / 24
EUCLK2X/CLK1X
OtherCLK1X
BYPASS/IVR IVR
VIN
GPU
...
SYNC
10
CLK2X
CLKEU
EU Turbo
÷2
Level Shifter
...
CLK1X
÷2
EU Turbo
EUs at 2X clock (CLK2X) as needed
Sync logic at 1X/2X clock boundaries
IVR for fast turbo/un-turbo transitions
PSoC 2018 Rinkle Jain 16 / 24
LOWLOW
STALLEUOther
DEMAND
FEUFOther
LOW
VLOW
Slow Gated
VGPU
LOWLOW
STALLEUOther
DEMANDLOW
Slow Gated
VRET
FEUFOther
VLOW
VOther
VEU
VLOW
EU Other
CLAMP IVR
VIN
GPU
...
VEU VOther
Retentive Sleep
EU Other
VIN
GPU
...
VGPU IVR
PSoC 2018 Rinkle Jain 17 / 24
Independent V/F Domains at SoC Level
SCVR
GPULow
Demand
CPUHigh
Demand
VIN
SCVR BYPASS
Energy Savings at Iso-Performance: Max 77%, avg 62%
05
1015202530354045
0 1 2 3 4 5 6 7 8G
PU E
nerg
y/O
pera
tion
[µJ]
Performance: Ops/Sec [Normalized]
SCVR 2:1
25℃
Max 58%
Avg 33%
PSoC 2018 Rinkle Jain 18 / 24
EU Turbo Performance Gain
EUCLK1X/CLK2X
OtherCLK1X
DLDO/BYPASS DLDO
VIN
GPU
SYNC
EU Turbo
EUCLK1X
OtherCLK1X
VIN
GPU
Baseline: Shared V/F
Ideal IVR
5
10
15
20
25
30
0 1 2 3 4 5 6 7 8
Ener
gy/O
pera
tion
[µJ]
Performance: Ops/Sec [Normalized]
100% EU Utilization, 12 EUs, 70℃
Energy Savings at Iso-Performance: Max 32%, avg 29%
PSoC 2018 Rinkle Jain 19 / 24
Retentive Sleep: Leakage Savings
02468
1012141618
0 0.25 0.5 0.75 1
I SS(6
EU
s) [m
A]
VOUT [V]
25℃ VRET≈0.5V
VIN=0.8V
VIN=1.15V
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
EU2 EU3 EU4 EU5 EU6
EU V
MIN
Redu
ctio
n [%
]
25MHz 100MHz
PSoC 2018 Rinkle Jain 20 / 24
ConclusionsComplete 14nm GPU features energy efficiency techniques
Hybrid IVR (DLDO + SCVR) for independent V/F domains
• Within GPU: EU turbo provides up to 68% performance gain, 50% EU area reduction, or 32% energy savings
• SoC Level: SCVR enables up to 77% GPU energy savings
Additional DLDO usages
• Retentive sleep: Reduces EU leakage by 63%
• VMIN per block: Up to 2.4% EU VMIN reduction
PSoC 2018 Rinkle Jain 21 / 24
Summary and Conclusion
Summary
Power delivery: VRs inside SoC is fully integrated(>100MHz, >2:1)
Buck for mainstream SoC, SC for relatively small rails
Reduced area => less L, C; worse interconnects with every node; efficiencyreduces
Need: Better passives utilization, current density, die stacking
Power management: Inductor less approach for down-scalability
Distributed, APR-friendly designs desirable for SoC integration and wideadoption
Need down-scalable high current density/die stacking for point-of-loadVRs!
Need battery/source-to-SoC VR integration(3-5 MHz bandwidth,discrete passives)
PSoC 2018 Rinkle Jain 22 / 24
Summary and Conclusion
Summary
Power delivery: VRs inside SoC is fully integrated(>100MHz, >2:1)
Buck for mainstream SoC, SC for relatively small rails
Reduced area => less L, C; worse interconnects with every node; efficiencyreduces
Need: Better passives utilization, current density, die stacking
Power management: Inductor less approach for down-scalability
Distributed, APR-friendly designs desirable for SoC integration and wideadoption
Need down-scalable high current density/die stacking for point-of-loadVRs!
Need battery/source-to-SoC VR integration(3-5 MHz bandwidth,discrete passives)
PSoC 2018 Rinkle Jain 22 / 24
Summary and Conclusion
Summary
Power delivery: VRs inside SoC is fully integrated(>100MHz, >2:1)
Buck for mainstream SoC, SC for relatively small rails
Reduced area => less L, C; worse interconnects with every node; efficiencyreduces
Need: Better passives utilization, current density, die stacking
Power management: Inductor less approach for down-scalability
Distributed, APR-friendly designs desirable for SoC integration and wideadoption
Need down-scalable high current density/die stacking for point-of-loadVRs!
Need battery/source-to-SoC VR integration(3-5 MHz bandwidth,discrete passives)
PSoC 2018 Rinkle Jain 22 / 24
Summary and Conclusion
Summary
Power delivery: VRs inside SoC is fully integrated(>100MHz, >2:1)
Buck for mainstream SoC, SC for relatively small rails
Reduced area => less L, C; worse interconnects with every node; efficiencyreduces
Need: Better passives utilization, current density, die stacking
Power management: Inductor less approach for down-scalability
Distributed, APR-friendly designs desirable for SoC integration and wideadoption
Need down-scalable high current density/die stacking for point-of-loadVRs!
Need battery/source-to-SoC VR integration(3-5 MHz bandwidth,discrete passives)
PSoC 2018 Rinkle Jain 22 / 24
Summary and Conclusion
Summary
Power delivery: VRs inside SoC is fully integrated(>100MHz, >2:1)
Buck for mainstream SoC, SC for relatively small rails
Reduced area => less L, C; worse interconnects with every node; efficiencyreduces
Need: Better passives utilization, current density, die stacking
Power management: Inductor less approach for down-scalability
Distributed, APR-friendly designs desirable for SoC integration and wideadoption
Need down-scalable high current density/die stacking for point-of-loadVRs!
Need battery/source-to-SoC VR integration(3-5 MHz bandwidth,discrete passives)
PSoC 2018 Rinkle Jain 22 / 24
Summary and Conclusion
Summary
Power delivery: VRs inside SoC is fully integrated(>100MHz, >2:1)
Buck for mainstream SoC, SC for relatively small rails
Reduced area => less L, C; worse interconnects with every node; efficiencyreduces
Need: Better passives utilization, current density, die stacking
Power management: Inductor less approach for down-scalability
Distributed, APR-friendly designs desirable for SoC integration and wideadoption
Need down-scalable high current density/die stacking for point-of-loadVRs!
Need battery/source-to-SoC VR integration(3-5 MHz bandwidth,discrete passives)
PSoC 2018 Rinkle Jain 22 / 24
Summary and Conclusion
Summary
Power delivery: VRs inside SoC is fully integrated(>100MHz, >2:1)
Buck for mainstream SoC, SC for relatively small rails
Reduced area => less L, C; worse interconnects with every node; efficiencyreduces
Need: Better passives utilization, current density, die stacking
Power management: Inductor less approach for down-scalability
Distributed, APR-friendly designs desirable for SoC integration and wideadoption
Need down-scalable high current density/die stacking for point-of-loadVRs!
Need battery/source-to-SoC VR integration(3-5 MHz bandwidth,discrete passives)
PSoC 2018 Rinkle Jain 22 / 24
Thank you for your attention!
PSoC 2018 Rinkle Jain 23 / 24
Acknowledgement
Vaibhav Vaidya
Tri Hyunh
Chung-Ching Peng
George Matthew
Carlos tokunaga
Don Gardener
Kaladhar Radhakrishnan
Paul Fischer
Kevin O’brien
Muhammad Khellah
Jim Tschanz
Ravi Mahajan
Jonathan Douglas
Takao Oshita
This research was, in part, funded by the U.S. Government (DARPA). The views and conclusions
contained in this document are those of the authors and should not be interpreted as representing
the official policies, either expressed or implied, of the U.S. Government.
PSoC 2018 Rinkle Jain 24 / 24