emergingsilicon nanophotonicnetworks: time to bridge the...
Post on 01-Oct-2020
0 Views
Preview:
TRANSCRIPT
AISTECS 2019
Emerging SiliconNanophotonic Networks:Time to Bridge the Gap with System Designers
Davide BertozziUniversity of Ferrara (Italy) -Temporary Guest Scientist at
IHP Microelectronics (Germany)
• Evolution of the top10 in the last six years:
• Average total compute power:• 0.86 PFlops 21 PFlops• ~24x increase
• Average node compute power:• 31GFlops 600GFlops• ~19x increase
• Average number of nodes• 28k 35k• ~1.3x increase
Node compute power main contributor to performance growth
Node compute power may keep scaling thanks to customization
Average of top 10 sytems, relative to 2010
24x
19x
1.3x
[S. Rumley, et al. Optical Interconnects for Extreme Scale Computing Systems, Journal of Parallel Computing, pp.65-80, 2017]
Trends in Extreme HPC
<<Like 1980s, great time for architects!>>(John L. Hennessy & David A. Patterson, Turing Lecture, ISCA 2018)
• Top 10 average node levelevolutions:
• Average node compute power:• 31GFlops 600GFlops• ~19x increase• Number of nodes: ~1.3x• Total Compute power: ~24x
• Average bandwidth availableper node
• 2.7GB/s 7.8GB/s• ~3.2x increase
• Average byte‐per‐flop ratio• 0.06 B/Flop 0.01 B/Flop• ~6x decrease• Sunway TaihuLight (#1) shows 0.004 B/Flop !!
Growing gap in interconnect bandwidth might cause aggregate execution performance not to keep up with available compute power!
Average of top 10 sytems, relative to 2010
19x
3.2x
0.17x
Trends in Extreme HPC
[S. Rumley, et al. Optical Interconnects for Extreme Scale Computing Systems, Journal of Parallel Computing, pp.65-80, 2017]
What about Connectivity?
Interconnect Power Concern
Source: W.Dally
Data from 28nm NVIDIA chips
Source: S.Borkar
Computation will be relatively inexpensivein terms of energy over communication
Bandwidth should be increased within tighter and tighterpower budgets
NI
SRAM
switch
NI
CPU
Accel
NI
switch
switch
NI
SRAM NI
NINI
NINI
DSP DMA MPEG
CPU
Ethnt
switch
switch
switch
switch
switch
switch
But surprisingly criticalities are showing up even in the lowest layer (chip‐scale communications)
EMERGING Network‐on‐Chip CRITICALITIES: Latency sensitivityof the multi‐hop fabric Bandwidth criticalitiesfor future kilo‐core chips The power overheadfor moving bits around Non‐seamless scaling to off‐chip comm.
The Communication Hierarchy
Courtesy of K.Bergman
WE NEED A GAME CHANGER!
A lot of work is going on at the upper layers of the interconnection hierarchy: A lot of activity: PCIe, GEN‐Z, OpenCAPI, CCIX, Ethernet, InfiniBand,..
Node Router Router Node
Long distance Router-Router link
Electrical transceivers
Optical transceivers
Router
Node
Short distanceNode-Router link
Silicon Photonics: Game Changer?
Short distanceRouter-RouterLink (electricallink or VCSEL-based optical technology)
Silicon photonics uses co‐integration techniquesof optical components and/or transceivers with
standard CMOS manufacturing process
Silicon Photonics: Game Changer?Silicon photonics is delivering integrated optical transceivers and holds promise of bringing optical communications closer to and deeper into the processing node
Node Router
Core
Router
Core
Node
Electrical transceivers
Integrated optical transceivers
Router
Core
Node
Short distanceNR link
Conventionalhop‐by‐hop
data movement
Flattenedend‐to‐end
data movement
Courtesy of K.Bergman
Silicon Photonics: Game Changer?Silicon photonics is delivering integrated optical transceivers and holds promise of bringing optical communications closer to and deeper into the processing node
Node Router
Core
Router
Core
Node
Electrical transceivers
Integrated optical transceivers
Router
Core
Node
Short distanceNR link
Key enablerfor new paradigms:
disaggregatedarchitectures
Silicon Photonics: Game Changer?Silicon photonics holds promise of integrated optical transceivers and of bringing
optical communications closer to and deeper into the processing node
Node Router
Core
Router
Core
Node
Electrical transceivers
Integrated optical transceivers
Router
Core
Node
Short distanceNR link
Requirements for that to happen• Divide cost by 1.5 orders of magnitude at least• Improve energy efficiency by one order of magnitude at least• Efficient integration solutions with electronics• Improve system‐ability of the technology
ImprovingTechnology Maturity
Architecture and system‐level design
&
The gap between system-level designers and technology developers is huge!Architecture design points stemdirectly from designers’ intuition
Descriptive information at differentabstraction layers are mixed
Designs are difficult to compare with one another
The application of well‐knownoptimization techniques is difficult
No consistent methodologies to explore the design space
Most of the design spacestill largely unknown
Mind the Gap
TODAY
The gap between system-level designers and technology developers is huge!Architecture design points stemdirectly from designers’ intuition
Descriptive information at differentabstraction layers are mixed
Designs are difficult to compare with one another
The application of well‐knownoptimization techniques is difficult
No consistent methodologies to explore the design space
Most of the design spacestill largely unknown
Golden ageof ONoC
assessment(~2008‐2012)Estimated
power savingswith nanophotonicnetworks
Mind the Gap
Early‐stage ONoC analysis: inflated expectations
Example of optical parametersused in early‐stage analyses
Gartner Hype Cycle
TODAY
The gap between system-level designers and technology developers is huge!Architecture design points stemdirectly from designers’ intuition
Descriptive information at differentabstraction layers are mixed
Designs are difficult to compare with one another
The application of well‐knownoptimization techniques is difficult
No consistent methodologies to explore the design space
Most of the design spacestill largely unknown
How to change the through of
disillusionment into a slope to enlightment?
Mind the Gap
Gartner Hype Cycle
Goal:
Bridge the gap between developers of emerging devices and circuit & systemdesigners, thus coupling emerging interconnect technologies and architectureswith digital systems and working out novel system‐level design concepts.
Focus:
Photonically‐integrated chip‐scale parallel computing
Their coupling with off‐chip memory sub‐systems
Methodology:
Addressing the horizontal integration gap
Addressing the vertical integration gap
A Framework to Bridge the Gap
Optical NetworkProcessor(s) Cachehierarchy
ENoC DRAM GPU
BACKGROUND
OPTICAL NETWORKS‐ON‐CHIP
Optical NoC Initiator
1‐4In1
Wavelength‐divisionmultiplexed input signal
4‐stage modulator
ONoC Input 0101010
ElectricalSignal
Optical NoC Target1‐4Out
1 ONoC Output
TIA
Comp
010..
TIA
Comp
010..
TIA
Comp
010..
TIA
Comp
010..
Wavelength‐Routed Optical NoCs
I1 O1
O2
O3
O4
λ1
λ2λ3
λ4
I4λ2
λ1
λ3
λ4
.....
.
Main feature: static allocation of channels to source‐destination pairs
The topology needs to avoid interference of
same‐wavelength carriers
No Time spent in routing and arbitration
All‐optical interconnect solution
Performance predictability
All‐to‐all communications can take place concurrently
Hard to scale to a large number of cores
Better topologies exist, that reusethe same set of 4 wavelengthsacross all initiators
II
I
I1234
O1
II
I
I1234
O2
II
I
I1234
O3
II
I
I1234
O4
(Naive) Non‐blocking Crossbar
I11234
I21234
I31234
I41234
Wavelength‐Routed Optical NoCsI1 O1
O2
O3
O4
λ1
λ2λ3
λ4
I4 λ2
λ1
λ3
λ4
.....
.
Main feature: static allocation of channels to source‐destination pairs
Better topologies exist, that reusethe same set of 4 wavelengthsacross all initiators
State‐of‐the‐art «Snake» topology
Wavelength‐Routed Optical NoCsI1 O1
O2
O3
O4
λ1
λ2λ3
λ4
I4 λ2
λ1
λ3
λ4
.....
.
Main feature: static allocation of channels to source‐destination pairs
Static Power OverheadA major source of overhead of optical NoCs comes from static power
1 dBm
0.763 dBm
Passing by a ring0.005dB each
Waveguide crossing0.05dB each
Propagation loss0.274 dB/cm0.1 cm
PhotodetectorSensitivity
Laser sources Thermal Tuning
Insertion loss (and laser power requirements) depends on the connectivity pattern
Horizontal Integration Challenge
Optical NetworkProcessor(s) Li NoC DRAM GPU
TSV
TSV
TSV
TSV
M1
2
3
4
λ1 λ2 λ3 λ4
Array of off‐chip CW lasers
Electronic layer
Photonic layer
Off‐chip memories
Cluster of processor cores
M2
M3
M4H4
H3
H2
H11
Target ArchitectureSolutions such as 3D or 2.5D integration allow for the separation of
both electronic and photonic processes and open the door to a fully dedicated process optimization for the photonic die
Gateways
Hubs
System ViewEswitch Eswitch
Eswitch Eswitch
Local Domain
Eswitch Eswitch
Eswitch Eswitch
Local Domain
Eswitch Eswitch
Eswitch Eswitch
Local Domain
Eswitch Eswitch
Eswitch Eswitch
Local Domain
Top level
Source: IBM
System View
Data rate adaptation(De-)Serialization
Flow control Clock Resynchronization
Message-dependent deadlock avoidance
Not Just E/O and O/E Converters, but an Architecture Integration Challenge
Eswitch Eswitch
Eswitch Eswitch
Eswitch Eswitch
datavalid
stall
Local domain
EswitchEswitch
EswitchEswitch
EswitchEswitch
Local domain
datavalid
stall
Photo‐detector
TIA
PD
TIA
Driver
Modulator
Modulator
Driver
PD
Source: SSSA Pisa
Architecture Integration Challenge
1 0
>= 10GHz
1) Data rate adaptation
[0.5 ÷ 3] GHz
Clock speed Modulation Rate
2) (De-)SerializationArchitecture Integration Challenge
01
01
11
01 0101 1101
01
01
11
01
32/64/128/256 bits Optical bitstream
3) Flow control Architecture Integration Challenge
Buffer size is a function of the round trip time for
full‐throughput operation
System ViewEswitch Eswitch
Eswitch Eswitch
Eswitch Eswitch
datavalid
stall
Local domain
EswitchEswitch
EswitchEswitch
EswitchEswitch
Local domain
datavalid
stall
InterfaceInterface
BufferBuffer
4) Clock ResynchronizationArchitecture Integration Challenge
1 0
>= 10GHz
Local ClockData
Δ
System ViewEswitch Eswitch
Eswitch Eswitch
Eswitch Eswitch
datavalid
stall
Local domain
EswitchEswitch
EswitchEswitch
EswitchEswitch
Local domain
datavalid
stall
InterfaceInterface
Reconverted signal
5) Message-dependent deadlock avoidanceArchitecture Integration Challenge
System ViewEswitch Eswitch
Eswitch Eswitch
Eswitch Eswitch
datavalid
stall
Local domain
EswitchEswitch
EswitchEswitch
EswitchEswitch
Local domain
datavalid
stall
InterfaceInterface
System ViewEswitch Eswitch
Eswitch Eswitch
Eswitch Eswitch
datavalid
stall
Local domain
EswitchEswitch
EswitchEswitch
EswitchEswitch
Local domain
datavalid
stall
InterfaceInterface
BRIDGE
BRIDGE
The bridge is a complex blocktaking care of key functional tasksfor architecture correct operation, built on top of a multi-technology
platform and supporting GHz-range signaling rates
Bridge Configuration
One of the keychallenges consistsof overcoming the inherent serial nature of optical communications
0101 1101
01
01
11
01
Increasing the signalling rate of optical channels
Increasing the bit‐levelparallelism (WDM)
A combination thereof
Research Goal: Explore and Characterize the Configuration Space of the Bridge
Pay Attention: CMOS cannot achieve arbitrary speeds!
SERDES
Implications over the SerDes Architecture, henceover the performance‐power trade‐off of the bridge
010 1101
0101
1101
0101
1101
0101 1101
01
01
11
01
01
01
11
01
01
01
11
01
Technology Partitioning16 3D‐stacked computation clusters
16x16 optical NoC (ONoC) CMOSOptics
In static‐power dominated technologies like silicon photonics, operation at high transmission rates may become a priority
to cut down on pJ/bit
Better performing technologies than CMOS may be required in the back‐end of the bridge
Bridge
OpticsCMOS BiCMOS
We select IHP 130nm SiGe BiCMOS (SG13S)‐ fT/Fmax=250 GHz / 340 GHz‐ 3.3V I/O CMOS, 1.2V logic CMOS‐ 5 thin metal layers, 2 thick onesTarget logic family:‐ 2.5V compatible ECL‐ A Cell library provides std cell gates‐ Logic synthesis from HDL enabled (Synopsys DC)Similar technologies providemonolithic integration
of optical components with the BiCMOS process
OpticsCMOS BiCMOS
Our assumption
Mux
(15x2)x1
DC‐FIFO (req)
DC‐FIFO (reply)
CompDemux
1x3
DESER
Arbiterλ1c
λ13
λ12
λ11
………….
Comp
Comp
TIA
TIA
TIA
PD
PD
PD
Comp TIA PDClockDivider
VCdec
Demux
1x3
DC‐FIFO (req)
DC‐FIFO (reply)
dec
dec
1x15
Mux
3x1
.
.
.
.
SER
PLL
Arbiter
1x15
Driversλ11
λ12
λ13
λ1c
Modulators
ENoC
MesochronousSynchronizer
Credit counter 15
Credit counter 1
DC‐FIFO
DC‐FIFO…..
Filters
TRANSMITTER SIDE
RECEIVER SIDE
Transmitter side
Receiver side
Bridge Architecture
Gateway
Bridge Architecture –Transmitter Side1 transmission module for each target
(15 of them in a 16x16 ONoC)Optimization: only one set of buffers for all destinations
VCdec
Demux
1x3 DC‐FIFO (req)
DC‐FIFO (reply)
dec
dec
1x15 M
ux3x1
.
.
.
.
SER
PLL
1x15
Drivers λ11
λ12
λ13
λ1c
Modulators
ENoC
1 GHz Network Interface Frequency=f(Modulation rate)
ModulationRate (e.g., 10 Gbps)
Driversλ15_1
λ15_2
λ15_3
Mux
3x1
Arbiter
λ15cArbiter
SER
One virtual channelfor each message class
to avoid (message‐dependent) deadlock
Bit‐level Parallelism
Source‐synchronous communication
Bridge Architecture: Receiver SideMux
(15x2)x1
Arbiter
1 GHz Network Interface Frequency=f(transmission frequency)
ModulationRate (e.g., 10 Gbps)
DC‐FIFO (req)
DC‐FIFO (reply)
CompDemux
1x3 DESER
λ15c
λ15_3
λ15_2
λ15_1
Comp
Comp
TIA
TIA
TIA
PD
PD
PD
Comp TIA PDClockDivider
DC‐FIFO (req)
DC‐FIFO (reply)
CompDemux
1x3 DESER
λ1c
λ13
λ12
λ11
………….
Comp
Comp
TIA
TIA
TIA
PD
PD
PD
Comp TIA PDClockDivider
Receiver module 1
Receiver module 15
Source‐synchronous communication
1 receiver module for each transmitter
Flow ControlMux
(15x2)x1
DC‐FIFO (reply)
CompDemux
1x3 DESER
Arbiterλ1c
λ13
λ12
λ11
………….
Comp
Comp
TIA
TIA
TIA
PD
PD
PD
Comp TIA PDClockDivider
VCdec
Demux
1x3 DC‐FIFO (req)
DC‐FIFO (reply)
dec
dec
1x15 M
ux3x1
.
.
.
.
SER
PLLArbiter
1x15
Drivers λ11
λ12
λ13
λ1c
Modulators
ENoC
MesochronousSynchronizer
Credit counter15
Credit counter1
DC‐FIFO
DC‐FIFO….. Credit‐based flow control:
‐ Reuses the datapath‐ Exploits low dynamic power of
ONoCs‐ No round‐trip timing assumptions
Can fire only if credits available
Master D‐Latch
clkd q
Slave D‐Latch
Mux
clock f
Input Data
Output Data
2:1Mux
f 2f
clkd q
clkd q
2:1 Mux Cell is the main building block
a
b
a
b ba
Transmission frequency = twice the input clockLower PLL frequency
PERFECT BINARY TREE STRUCTURE• M =Log2(N) Stages working at halved speed
with respect to one another• The number of building blocks per stage is
inversely proportional to the operating frequency Energy savings
No need for additional selectors
Mux2x1
÷2f
Input ClockOutput
Clockf/4 f/2
÷2÷2
Input Data
f/8f/4
f/2
f
2f
f/8
16:1Mux
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
f/2
f/4
Output Data
N
Serializer Architecture
Mux2x1
÷2f
Input ClockOutput
Clockf/4 f/2
÷2÷2
Input Data
f/8f/4
f/2
f
2f
f/8
16:1Mux
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
f/2
f/4
Output Data
More parallelism: remove stages from the right
This architecture is very flexible, it can easily span a wide bridge configuration space
Scale up: add more stages to the leftFlexibility
Mux2x1
÷2f
Input ClockOutput
Clockf/4 f/2
÷2÷2
Input Data
f/8f/4
f/2
f
2f
f/8
16:1Mux
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
f/2
f/4
Output Data
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
÷2 ÷2fInput Clock
Output Clock f/4 f/2
Output Data
÷2÷2
Input Data
f/16 f/8f/4
f/2
f
f/8f/16
2f
Mux 32x1 32x1
MUX
OutputData 1
OutputData 2
f
f
16x2MUX
InputClock
2 bitsOpticalParallelism
f0
DC‐FIFO 1
Routing1 M
UX
3x1
DEMUX
1x15
15
1
1
Arbiter
Credits From Rx15
DEMUX
1x2
MUX
30x1
Arbiter
Credit counter
Credit counter
15
1
M15Data
÷2÷2
VC DECODER
MESO
Tx
comp TIA PD15CLK
÷2 ÷2
clk5 clk4 clk3 clk2
PLL
clk1
32x1 Binary Tree Serializer 15 Driver
DC‐FIFO 2
Routing2
DEMUX
1x15
MUX
3x1
15
1
15
1
1
15
M15CLK
VC DECODER
VC‐ID
DC‐FIFO 1
DC‐FIFO 15
Credits to Rx15
DC‐FIFO 29
DC‐FIFO 30
DEMUX
1x
3
1x32 Binary Tree Deserializer 15clk5 clk4 clk3 clk2 clk1
÷2÷2 ÷2 ÷2 comp TIA PD15Data
1
15
RX
f1f1/16 f1/2f1/4f1/8
1
15
VC_ID
ONOC
Driver
Clock domains
Laser Source
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Mux2x1
Input Dataf/8
÷2 ÷2
Output Dataf
f/2
Input Clock
Output Clock f/8
f/4
f/4
f/8
f/4
f/2
Mux 8x1
Demux2x1
Input Dataf
÷2 ÷2
Output Data
f/8Input Clock
Output Clockf/2 f/4
f/4
f/2
Demux 1x8 Demux
2x1
Demux2x1
Demux2x1
Demux2x1
Demux2x1
Demux2x1
f/4 f/8
f/8
Experimental Results
Bridge Front‐End Architecture
(De)‐Serialization + TransceiversOpto‐electronics
+ ONoC
CMOS Two process nodes
(bulk 40nm or 28 nm FD‐SOI)
CMOS
ECL 130nm
Partitioning options due to multi‐stage nature of the serializer
DP2DP3DP4DP5
DP1
Target Data Rate for source‐destination connection: @25 Gbit/s
DP2DP3DP4
DP1
Target Data Rate for source‐destination connection: @40 Gbit/s
CMOS 40nm ECL 130nmDP : Design point FD‐SOI 28nm
1.28 ns 0.64 ns 0.32 ns 0.16 ns 0.08 ns
Bridge Front‐EndArchitecture
1.28 ns
0.8 ns 0.4 ns 0.2 ns 0.1 ns 0.05 ns
Bridge Front‐EndArchitecture
0.8 ns
Experimental Results@12.5 Gbit/s
@12.5 Gbit/s
2‐bits
@20 Gbit/s
@20 Gbit/s
2‐bit
Fully CMOS
Fully CMOS
@6.25 Gbit/s
@6.25 Gbit/s
4‐bits
Fully CMOS
@10 Gbit/s
@10 Gbit/s
4‐bits
0
1
2
3
4
5
6
7
DP1 DP2 DP3 DP4 DP5 DP1 DP2 DP3 DP4 DP5 DP1 DP2 DP3 DP4 DP1 DP2 DP3 DP4 DP1 DP2 DP3 DP4 DP1 DP2 DP3 DP4
130nm ECL‐CMOS 40nm 130nm ECL‐28nm FDSOI CMOS
Energy‐per‐bit
(pJ/bit)
X
1 channel x 25 Gbit/s
4 channels x 6.25 Gbit/s
2 channels x 12.5 Gbit/s
X
1 channelx 40 Gbit/s
4 channelsx 10 Gbit/s
2 channelsx 20 Gbit/s
25 Gbit/s 40 Gbit/s
x : Not feasibleExperimental Results
Fully‐CMOS
‐31%
‐84%
Hybrid CMOS‐ECL
Fully CMOS Fully CMOS
Hybrid CMOS‐ECL
Fully CMOS
Hybrid CMOS‐ECL
Experimental Results
f0
DC‐FIFO 1
Routing1 M
UX
3x1
DEM
UX
1x15
15
1
1
Arbiter
Credits From Rx15
DEM
UX
1x2
MUX 30
x1
Arbiter
Credit counter
Credit counter
15
1
M15Data
÷2÷2
VC DECODER
MESO
Tx
comp TIA PD15CLK
÷2 ÷2
clk5 clk4 clk3 clk2
PLL
clk1
32x1 Binary Tree Serializer 15 Driver
DC‐FIFO 2
Routing2
DEM
UX
1x15
MUX
3x1
15
1
15
1
1
15
M15CLK
VC DECODER
VC‐ID
DC‐FIFO 1
DC‐FIFO 15
Credits to Rx15
DC‐FIFO 29
DC‐FIFO 30
DEM
UX
1x3
1x32 Binary Tree Deserializer 15clk5 clk4 clk3 clk2 clk1
÷2÷2 ÷2 ÷2 comp TIA PD15Data
1
15
RX
f1f1/16 f1/2f1/4f1/8
1
15
VC_ID
ONOC
Driver
Clock domains
Laser Source
f0
DC‐FIFO 1
Routing1 M
UX
3x1
DEM
UX
1x15
15
1
1
Arbiter
Credits From Rx15
DEM
UX
1x2
MUX 30
x1
Arbiter
Credit counter
Credit counter
15
1
M15Data
÷2÷2
VC DECODER
MESO
Tx
comp TIA PD15CLK
÷2 ÷2
clk5 clk4 clk3 clk2
PLL
clk1
32x1 Binary Tree Serializer 15 Driver
DC‐FIFO 2
Routing2
DEM
UX
1x15
MUX
3x1
15
1
15
1
1
15
M15CLK
VC DECODER
VC‐ID
DC‐FIFO 1
DC‐FIFO 15
Credits to Rx15
DC‐FIFO 29
DC‐FIFO 30
DEM
UX
1x3
1x32 Binary Tree Deserializer 15clk5 clk4 clk3 clk2 clk1
÷2÷2 ÷2 ÷2 comp TIA PD15Data
1
15
RX
f1f1/16 f1/2f1/4f1/8
1
15
VC_ID
ONOC
Driver
Clock domains
Laser Source
f0
DC‐FIFO 1
Routing1 M
UX
3x1
DEM
UX
1x15
15
1
1
Arbiter
Credits From Rx15
DEM
UX
1x2
MUX 30
x1
Arbiter
Credit counter
Credit counter
15
1
M15Data
÷2÷2
VC DECODER
MESO
Tx
comp TIA PD15CLK
÷2 ÷2
clk5 clk4 clk3 clk2
PLL
clk1
32x1 Binary Tree Serializer 15 Driver
DC‐FIFO 2
Routing2
DEM
UX
1x15
MUX
3x1
15
1
15
1
1
15
M15CLK
VC DECODER
VC‐ID
DC‐FIFO 1
DC‐FIFO 15
Credits to Rx15
DC‐FIFO 29
DC‐FIFO 30
DEM
UX
1x3
1x32 Binary Tree Deserializer 15clk5 clk4 clk3 clk2 clk1
÷2÷2 ÷2 ÷2 comp TIA PD15Data
1
15
RX
f1f1/16 f1/2f1/4f1/8
1
15
VC_ID
ONOC
Driver
Clock domains
Laser Source
f0
DC‐FIFO 1
Routing1 M
UX
3x1
DEM
UX
1x15
15
1
1
Arbiter
Credits From Rx15
DEM
UX
1x2
MUX 30
x1
Arbiter
Credit counter
Credit counter
15
1
M15Data
÷2÷2
VC DECODER
MESO
Tx
comp TIA PD15CLK
÷2 ÷2
clk5 clk4 clk3 clk2
PLL
clk1
32x1 Binary Tree Serializer 15 Driver
DC‐FIFO 2
Routing2
DEM
UX
1x15
MUX
3x1
15
1
15
1
1
15
M15CLK
VC DECODER
VC‐ID
DC‐FIFO 1
DC‐FIFO 15
Credits to Rx15
DC‐FIFO 29
DC‐FIFO 30
DEM
UX
1x3
1x32 Binary Tree Deserializer 15clk5 clk4 clk3 clk2 clk1
÷2÷2 ÷2 ÷2 comp TIA PD15Data
1
15
RX
f1f1/16 f1/2f1/4f1/8
1
15
VC_ID
ONOC
Driver
Clock domains
Laser SourceT
X
RX
TX
RX
TX
RX
TX
RX
16x16 λ‐Router Topology16mm x 16mm optical layer
0
2
4
6
8
10
12
1‐bit 2‐bits 4‐bits 1‐bit 2‐bits 4‐bits
Bridge CMOS Part Bridge ECL Part Thermal Tuning Tx‐Rx‐Laser
Energy‐per‐bit (p
J/bit)
Hybrid10.94 pJ/bit
Fully‐CMOS1.23 pJ/bit
1 channelx 25 Gbit/s
4 channelsx 6.25 Gbit/s
2 channelsx 12.5 Gbit/s
1 channel x 40 Gbit/s
4 channels x 10 Gbit/s
2 channelsx 20 Gbit/s
@25 Gbit/s @40 Gbit/s
Fully‐CMOS1.6 pJ/bit Fully‐CMOS
1.15 pJ/bit
Hybrid8.69 pJ/bit Hybrid
8.31 pJ/bit
Experimental Results
100% bandwidth utilization
‐88.7% ‐85.4%
Energy efficiencies in the ballpark of 1 to 2 pJ/bit
are possible with more WDM channels, a trend that highersignaling speeds exacerbate
TSV Laser Sources Total Power [W] SNR Manufacturingrequirements
25Gbit/s
40Gbit/s
25Gbit/s
40Gbit/s
25Gbit/s
40Gbit/s
25Gbit/s
40Gbit/s
1 channel 1216 2176 32 32 65.6 83.4 16.2 16.14 R = [5, 1, 25] um *Infeasible
R = [5, 1, 30] um*Up to 1 channel
R = [5, 0.25, 30] um*Up to 19 channels
2 channels 1216 2176 48 48 7.4 79.7 13.13 13.1
4 channels 2176 2176 80 80 9.6 11.01 8.8 8.8
Experimental Results
Network‐Level Trade‐Offs
Bridge Front‐EndArchitecture
25 Gbit/s
40 Gbit/s
ECL 130nmFD‐SOI 28nm
1 channel2 channels4 channels
Optical parallelism comes with cost and signal integrity concerns!
Vertical Integration Challenge
The Design Space
ONoC topology design points stemdirectly from designers’ intuition
HOW TO «SYNTHESIZE» THE MOST EFFICIENT ONoC SOLUTION FOR THE
REQUIREMENTS OF THE CONNECTIVITY PROBLEM AT HAND?
The design space is currentlylargely unknown
Major Requirements: start from a high-level description, operate on abstractions and refine them into an
actual implementation with components from a technology library.
Can we extend the paradigms and methodologies of EDA to the context of emergingsilicon nanophotonic interconnection networks?
High-level specification
Gate-Level Netlist
Mapped Gate-Level Netlist
Planar geometric shapes
Technology-independent Logic Library
Technology Library
I.
II.
III.
IV.
Switching Primitives Representation
Technology Mapping
Assignment of modulation carriers
Netlist connectivity
V.
VI.
VII.
Device Parameter Selection
Placement and routing
Physical Design
0 Routing Protocol SelectionRouting Protocol Selection
Design Automation Beyond E‐Roots
Can we extend the paradigms and methodologies of EDA to the context of emergingsilicon nanophotonic interconnection networks?
High-level specification
Gate-Level Netlist
Mapped Gate-Level Netlist
Planar geometric shapes
Technology-independent Logic Library
Technology Library
I.
II.
III.
IV.
Switching Primitives Representation
Technology Mapping
Assignment of modulation carriers
Netlist connectivity
V.
VI.
VII.
Device Parameter Selection
Placement and routing
Physical Design
0 Routing Protocol SelectionRouting Protocol Selection
Design Automation Beyond E‐Roots
Design automationshould not determinewhich technology to
pursue
Design automation can lead to concrete
evaluation of a new technology
State‐of‐the‐Art PIC Design Tools
I.
II.
III.
IV.
Switching Primitives Representation
Technology Mapping
Selection of modulation carriers
Netlist connectivity
Can we understand all topology design pointsin the context of a unified design framework?
Can we populate the design space of wavelength-routed optical NoC topologies?
Front‐End Synthesis Methodology
Add function at target’ side.
The filter is used to implement both the drop function at initiator side and the add one at target side
Basic building block for the implementation of any wavelength-routed topology:the 1x2 Drop Filter
λi
On-resonance signalOff-resonance signal
Drop function at initiator side.
Sjλ1 λ2λ1 λ1
Basic PrimitiveDROP FUNCTION ADD FUNCTION
SYNTHESIS METHODOLOGY1. Wavelength Resolution
Wavelength Resolution Graph (WRG) for a generic 4x4 WRONoC.
Each channel of the WDM input signal should be resolved so to be routed to a different output
2. Technology Mapping
ABCD
λi = λi
λi
λi
E.g., Grouping the 1x2 DFs into compact 2x2 photonicswitching elements (PSEs), from a technology library!
AB
CD
BC
AD AC
BD
λ2
3. Symbolic Wavelength AssignmentAssign a resonant wavelength to the MRRs
4. Topology ConnectionDraw the topology logic scheme. It’s a λ‐router! However, it is optimized wrt baseline: only 3 resonator types! .
λ2
Minimizeno. of MRR types
Constraint: avoid conflicts!
Constraint: Drop channels on rows only once!
Out1
Out2
Out3
Out4
Generic Topology from the Front-End Flow
These crossings should be considered as apparent, since at this stage we are drawing the logic topology, not the physical one.
Our synthesis methodology can potentially populate the complete design space of WRONoC topologies by spanning all possible technology mappings,
subject to the constraints of each stage for legal solutions.
Only with 2x2 PSEs, the number of WRONoC topologies in the design space amounts to[(n-1)!]n
A 4x4 WRONoC topology can be implemented in 1296 different ways
(1,2,3,4)A
(1,2,3,4)B
(1,2,3,4)C
(1,2,3,4)D
(1)B(2,3,4)A
(1)C(2,3,4)D
(1)D(2,3,4)C
(1)A(2,3,4)B
(2)A(1)D(3,4)C
(2)C(1)B(3,4)A
(2)B(1)C(3,4)D
(2)D(1)A(3,4)B
(1)C(2)B(3)A(4)D
(1)B(2)C(3)D(4)A
(1)A(2)D(3)C(4)B
(1)D(2)A(3)B(4)C
λ1
λ1
λ2
λ2
λ3
λ3
Generic Abstract Solutions
I.
II.
III.
IV.
Front-End Methodology
V.
VI.
VII.
Device Parameter Selection
Placement and routing
Physical Design
LOGIC TOPOLOGY
PHYSICAL TOPOLOGY
Back‐End Synthesis Methodology
λxλx
λi
λi What is the exact radius length of the MRR, typically in the range 5‐
20um? What is the exact value of the n wavelengths used by each initiator
in an n x n wavelength‐routed optical NoC? What is the maximum bit‐level communication parallelism on the
I/O optical channels? Not just cost and reliability, but also feasibility!λi
DEVICE PARAMETER SELECTION
This is not just a refinement step, due to the ROUTING FAULT concern:It has implications on network‐level throughput and scalability
R1
λx
λx
λy
λz
λz
λy
Parallelism is 6 in PSEx
λxλx
λi
λi What is the exact radius length of the MRR, typically in the range 5‐
20um? What is the exact value of the n wavelengths used by each initiator
in an n x n wavelength‐routed optical NoC? What is the maximum bit‐level communication parallelism on the
I/O optical channels? Not just cost and reliability, but also feasibility!λi
DEVICE PARAMETER SELECTION
This is not just a refinement step, due to the ROUTING FAULT concern:It has implications on network‐level throughput and scalability
R1
λx
λx
λy
λz
λz
λy
Parallelism is 6 in PSExNO NO
Parallelism is 6 4 in PSEx Parallelism is 7 5 in PSEy
R1
R2
λxλx
λi
λi What is the exact radius length of the MRR, typically in the range 5‐
20um? What is the exact value of the n wavelengths used by each initiator
in an n x n wavelength‐routed optical NoC? What is the maximum bit‐level communication parallelism on the
I/O optical channels? Not just cost and reliability, but also feasibility!λi
DEVICE PARAMETER SELECTION
This is not just a refinement step, due to the ROUTING FAULT concern:It has implications on network‐level throughput and scalability
R1
λx
λx
λy
λz
λz
λy
Parallelism is 6 in PSExNO NO
Parallelism is 6 4 in PSEx Parallelism is 7 5 in PSEy
R1
R2
R1
Available parallelism is 6 4 3 in PSEx Available parallelism is 7 5 3 in PSEy Available parallelism is 9 7 in PSEz
R2
NO NONO NO NO
R3
λxλx
λi
λi What is the exact radius length of the MRR, typically in the range 5‐
20um? What is the exact value of the n wavelengths used by each initiator
in an n x n wavelength‐routed optical NoC? What is the maximum bit‐level communication parallelism on the
I/O optical channels? Not just cost and reliability, but also feasibility!λi
DEVICE PARAMETER SELECTION
This is not just a refinement step, due to the ROUTING FAULT concern:It has implications on network‐level throughput and scalability
R1
λx
λx
λy
λz
λz
λy
Parallelism is 6 in PSExNO NO
Parallelism is 6 4 in PSEx Parallelism is 7 5 in PSEy
R1
R2
R1
Available parallelism is 6 4 3 in PSEx Available parallelism is 7 5 3 in PSEy Available parallelism is 9 7 in PSEz
R2
NO NONO NO NO
R3As topology size increases, the proliferation of filter types and wavelength channels
may limit the availability of non‐overlapped transmission peaks, which may cause the topology to be practically infeasible
Electromagnetic Model
NO
NO
NO
NO
NO
NO
NONO
NONO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NONO
NONO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NONO
NONO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NONO
NONO
NO
NO
NO
NO
NO
NO
NO
NO
Parallelism and Scalability Limitations
R1
R1 + Rtol R1 ‐ Rtol
λ2,2λ3,1
There exists a post‐fabrication variation scenario thatends up in a routing fault
Even without overlapping, proximity raises optical crosstalk concerns
PARAMETER UNCERTAINTY
Variation interval λ ± ()
R1
R2
Conservative design‐for‐reliability constraint:Let us assign device parameters
and state an achievable bit‐level parallelismsuch that routing faults will not take place
under any variability scenario
We modelled the Ring radius/wavelength channel selection problem subject to routingfault avoidance as a Constrained Optimization Problem, and used ASP as declarative technology.
This is the first refinement step directly exposed to the underlying technology
I.
II.
III.
IV.
Front-End Methodology
V.
VI.
VII.
Placement and routing
Physical Design
LOGIC TOPOLOGY
PHYSICAL TOPOLOGY
Back‐End Synthesis Methodology
Device Parameter Selection
Logic Topology
Optical Layer ‐ Layout Planning
Placement and Routing
Lots of unexpected waveguide crossings(which burden on the static power budget)
Electronic P&R tools cannot be reused here
We propose PROTON+, a tool for automatic placement and routing of ONoC topologies
(Collaboration with prof. Schlichtmann at TU Munich)The tool tries to strike a good balance between crossing losses and propagation losses, which
might be conflicting objectives
Minimize waveguide length Minimize no. of crossings
15
20
25
30
35
40
100/0 90/10 80/20 70/30 60/40 50/50 40/60 30/70 20/80 10 90 0/100max
imum
inse
rtion
loss
prop / cross
8x8-lambda-router8x8-GWOR
8x8-Std-Crossbar
Where Lp and Cp are approximate functions of path lengths and no. of crossings
By setting the weights of the objective function, the best physical mapping for the technology/topology at hand can be achieved
Placement: Non‐linear optimization problem solved with an IPMRouting: adaptation of the Lee’s algorithm «Maze Router»
Our objective functions minimizes the insertion loss across the lossiest path.This indirectly limits total laser power.
Physical Design Space Exploration
Layout of 16x16 λ‐Router with PROTON+
Hubs
Memory controller
Ins. Loss max = 44dB 255 crossings on the
critical path 28636um waveguide
length on the criticalpath
24425 sec of CPU time
(Intel Core 2 Quad CPU with 8GB RAMrunning at 2.33GHz)
Layout of 16x16 λ‐Router with PROTON+
Hubs
Memory controller
Ins. Loss max = 44dB 255 crossings on the
critical path 28636um waveguide
length on the criticalpath
24425 sec of CPU time
(Intel Core 2 Quad CPU with 8GB RAMrunning at 2.33GHz)
24426
640
50001000015000200002500030000
16x16 λ‐Router
0005101520253035404550
8x8 λ‐Router 8x8 GWOR 8x8 StandardCrossbar
16x16 λ‐Router
Maximum insertion loss (dB)
Manual
PROTON [Boos+ICCAD'13]
PLATON
Proton v2.0 (PLATON) implements a force‐directed placement algorithm Better computation times, better insertion losses PLATON is well‐suited for large‐scale topologies, rather than for small‐scale ones
PROTONv2.0 (PLATON)
Prof. Ulf SchlichtmannTU Munich’s Placement and Routing Tools
Computation time
There is a large variability in the design space: from 18 to 39 crossings! This raises the issue of placement-aware logic topology synthesis, completely new
discipline for optical NoCs. λ-Router and snake proposed in literature are not the best topologies from the critical path lengthviewpoint!
Design automation helps to get the most out of a technology
0
20
40
60
80
100
120
140
160
180
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Critical path length(max. no. Of crossings)
Number of Topologies
Lambda‐Router Snake
Distribution of the critical path after physical mapping.
Memory ControllerGateway
Optical Layer4x4 ONoCs replicated 3x
Physical Design with Proton+
We exhaustively generated all 4x4 WRONoC topologies and mapped them with Proton+
Exists in LiteratureExists in Literature
Exploring the Design Space
Conservative PROCESS VARIATIONS Only 4x4 ONoCs are certainly feasible Multiple wavelength selection options
useless if uncertainty ranges are notreduced accordingly
Ideal Fabrication With overly fine step and large rings,
the upper bound is roughly a 60x60 topology, with limited parallelismthough!
Achievable parallelism most sensitive to the incremental step of MRR radii
We performed device parameter selection to assess scalability of generic topologies
Fabricationoptions
Rmin Rstep Rmax
Ropt 5μm 1μm 25μm
R’opt 5μm 1μm 30μm
R’’opt 5μm 0.25μm 30μm
Radius selection rangeand
Incremental step
Radius Tolerance: 10nm
Laser uncertainty: 0.5nm
Scalability
Topology Radix
Conclusions• High‐performance computing systems will be soon againinterconnect‐limited. Emerging technologies can be game changers.
• Time for a concrete evaluation of emerging silicon nanophotonicnetworks in small‐scale systems. How? By bridging the gap with system designers.
• Horizontal integration gap:• ENoC‐ONoC bridge key to determining configuration of optical connections (data rate, parallelism), and its energy efficiency.
• 1‐2 pJ/bit communication can be realistically targeted at 40 Gbps connection rate, with 4 WDM channels@10Gbps in parallel (bridge in 28 nm CMOS). Signal integrityis an issue.
• Vertical integration gap:• Design methods have been developed to populate the largely unknown design space of wavelength routed topologies.
• Early‐stage complete cross‐layer synthesis methodology defined.• More energy‐efficient topologies than existing ones in literature have been«synthesized».
• Design automation: an enabler for emerging technologies.
Acknowledgement
68
top related