carl ramey hotchips 32 silicon photonics for artificial ......analog interfaces to photonics core...
TRANSCRIPT
![Page 1: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/1.jpg)
Silicon Photonics for Artificial Intelligence AccelerationHotChips 32Carl Ramey
1
![Page 2: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/2.jpg)
A photonics compute platform to address AI demands
2
❏ Lightmatter’s Mars device accelerates AI workloads❏ Core computations performed optically using silicon photonics❏ Multi-chip solution to get the best of transistor and photonics technology❏ Photonics provides unprecedented opportunities for performance and efficiency
![Page 3: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/3.jpg)
Section 1 Motivation
33
![Page 4: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/4.jpg)
Transistors aren’t getting more efficient
4
“Moore’s law made CPUs 300xfaster than in 1990...but it’s over.”
Bill DallyNvidia, Stanford
“Moore’s Law lasted for half a century and changed the world as
Moore predicted in his 1965 article, but it has ended.”
David PattersonGoogle, BerkeleyTuring Prize Winner
Physical Limit @ 300K
Source: https://github.com/karlrupp/microprocessor-trend-data
![Page 5: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/5.jpg)
⇒Chips are getting hotter
5
400 Watts per chip package is a practical limit
Dark SiliconYou can’t use the whole chip at once
Clock Frequency SaturationPerformance per core is limited
Exotic Cooling SolutionsExpensive
![Page 6: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/6.jpg)
6Source: OpenAI
Electronics won’t keep up
AI growth rate 5x Moore’s Law
![Page 7: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/7.jpg)
Section 2Photonics Core
77
![Page 8: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/8.jpg)
8
Data In(light)
Data Out(light)
Optical data transportLess energy spent moving data
Optical tensor coreHigher clock frequency, less energy, same size, lower latency
Parallel ProcessingWavelength- and polarization-division multiplexing. N processors, 1 physical resource.
Why photonics?Fundamental benefits independent of process node
LatencyBandwidthPowerArea
100 ps20 GHz1 uW2500um2
100 ns2 GHz1 mW2500um2
Photonics ElectronicsPhotonic Compute Unit Cell
Improvement
103
10103
1
Systolic Array
![Page 9: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/9.jpg)
9
Optical data transportLess energy spent moving data
10s of Watts just for passive data transport with electronics.~Free with optics...and no length-dependent RC time constant!
2 cm
CHIP
Data Flow
![Page 10: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/10.jpg)
10
Mach Zehnder Interferometer (MZI)Compute tile
MZIs provide observation of phase shift
through interference
These become useful when you modulate the
phase on Φ1 and Φ2
Waveguide
Electric Field
![Page 11: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/11.jpg)
➔ Need an electrically-controlled phase shift for MZI. Some examples...◆ Thermal (slow, consumes static power)◆ P/N junction (fast but large and lossy)◆ Mechanical!
➔ Mars uses Nano Optical Electro Mechanical System (NOEMS)◆ Waveguide bends with electrostatic charge◆ Capacitance of the NOEMS stores the phase shift setting◆ Low loss◆ Actuation speed is 100’s of MHz
11
Programmable Phase Shifters
![Page 12: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/12.jpg)
12
Optical Vector MAC Compute tile
Directional couplers combine input signals in pairs
Phase shifters provide programmability
OV MAC computes a 2x2 matrix multiplied by a 1x2 vector
Directional Coupler
Phase Shifter
We just did a 2x2 MVP at the speed of light with near zero power!No RC delays, no dynamic power.
![Page 13: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/13.jpg)
13
Arrays of MZIs
Arrays of 2x2 MZIs provide matrix • vector products
M0,0
MN,N
Vin0
VinN
Vout0
VoutN
...
...... ...
![Page 14: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/14.jpg)
14
Data Conversion at Edges of Square
DACs encode input vectors
Detectors and ADC convert output vector to digital
❏ Mars: 64 DACs and 64 ADCs for 4096 MAC operations
❏ Performance scales with area ← like electronics
❏ Power scales with sqrt(area) ← this is interesting!
![Page 15: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/15.jpg)
15Parallel ProcessingWavelength- and polarization-division multiplexing. N processors, 1 physical resource.
1 Processor ⇒ ‘N’ Parallel Instances
RxRxRxRxRxTx
.
.
.
.
.
.
.
.
.
.
.
.
15
![Page 16: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/16.jpg)
16
❏ 64x64 Matrix * 64 element vector
❏ 8K ops per “cycle”
❏ 1GHz vector rate
❏ 50mW laser
❏ 8-bit signed operands
❏ 200ps latency
❏ 90nm standard photonics process
❏ 150mm2
Mars Photonics Core
![Page 17: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/17.jpg)
Section 3Digital System
1717
![Page 18: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/18.jpg)
18
Mars SoC
❏ 14nm custom ASIC
❏ 50mm2
❏ Analog interfaces to Photonics Core
❏ SRAM for weights and activations
❏ I/O interfaces
❏ Digital offloads:
❏ Non-linearities (GELU, sigmoid etc.)
❏ Scaling and accumulation
![Page 19: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/19.jpg)
Digital Architecture
❏ Buffer organization minimizes data movement
❏ Heterogeneous on chip networks
❏ Single fully synchronous pipeline scheduler
19
![Page 20: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/20.jpg)
20
Activation Pipeline
❏ Digital pipeline maintains operand locality
❏ SRAM access minimized
❏ True pipelining for small batch performance
❏ Photonics outputs physically close to inputs
![Page 21: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/21.jpg)
21
Weight Updates
❏ Weights stored near MZI as charge
❏ Static power near zero
❏ Computational dynamic power near zero
❏ Dynamic power for weight update comparable to
electronics
❏ Batching saves SRAM/transport power
Only pay for data conversion power, not compute
![Page 22: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/22.jpg)
22
Power and Performance
Under 3W TDP
Most power is data movement
ResNet-50 @ 99% accuracy(ImageNet, compared to FP32)
![Page 23: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/23.jpg)
23
3D Integration
❏ Low power compute allows stacking
❏ Stacking reduces data movement
❏ SRAM is closer to compute
14nmASIC
90nmOptical core
![Page 24: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/24.jpg)
Putting it all together → Software
Host:
➔ ML Framework➔ Compiler
Mars Accelerator
PCIe
Neural network model optimization and deployment
24
Provides Services for...
![Page 25: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/25.jpg)
General-purpose AI inference acceleration. Photonics provides core compute.
ImageRecognition
SentimentAnalysis
Machine Translation
Recommendation
Mars Accelerator
25
![Page 26: Carl Ramey HotChips 32 Silicon Photonics for Artificial ......Analog interfaces to Photonics Core SRAM for weights and activations I/O interfaces Digital offloads: Non-linearities](https://reader036.vdocuments.us/reader036/viewer/2022070211/60fdf0fa08e77956b822df11/html5/thumbnails/26.jpg)
Summary
26
❏ Optical computing is here - and focused on AI❏ Lightmatter’s Mars chip leverages photonics for compute, electronics for activation and I/O❏ 3D stacking brings weights and activations closer to the compute core❏ Freedom in power budget allows larger devices and more SRAM