emergingsilicon nanophotonicnetworks: time to bridge the...

AISTECS 2019

Emerging SiliconNanophotonic Networks:Time to Bridge the Gap with System Designers

Davide BertozziUniversity of Ferrara (Italy) -Temporary Guest Scientist at

IHP Microelectronics (Germany)

• Evolution of the top10 in the last six years:

• Average total compute power:• 0.86 PFlops 21 PFlops• ~24x increase

• Average node compute power:• 31GFlops 600GFlops• ~19x increase

• Average number of nodes• 28k 35k• ~1.3x increase

Node compute power main contributor to performance growth

Node compute power may keep scaling thanks to customization

Average of top 10 sytems, relative to 2010

[S. Rumley, et al. Optical Interconnects for Extreme Scale Computing Systems, Journal of Parallel Computing, pp.65-80, 2017]

Trends in Extreme HPC

<<Like 1980s, great time for architects!>>(John L. Hennessy & David A. Patterson, Turing Lecture, ISCA 2018)

• Top 10 average node levelevolutions:

• Average node compute power:• 31GFlops 600GFlops• ~19x increase• Number of nodes: ~1.3x• Total Compute power: ~24x

• Average bandwidth availableper node

• 2.7GB/s 7.8GB/s• ~3.2x increase

• Average byte‐per‐flop ratio• 0.06 B/Flop 0.01 B/Flop• ~6x decrease• Sunway TaihuLight (#1) shows 0.004 B/Flop !!

Growing gap in interconnect bandwidth might cause aggregate execution performance not to keep up with available compute power!

Average of top 10 sytems, relative to 2010

Trends in Extreme HPC

[S. Rumley, et al. Optical Interconnects for Extreme Scale Computing Systems, Journal of Parallel Computing, pp.65-80, 2017]

What about Connectivity?

Interconnect Power Concern

Source: W.Dally

Data from 28nm NVIDIA chips

Source: S.Borkar

Computation will be relatively inexpensivein terms of energy over communication

Bandwidth should be increased within tighter and tighterpower budgets

switch

SRAM NI

DSP DMA MPEG

switch

But surprisingly criticalities are showing up even in the lowest layer (chip‐scale communications)

EMERGING Network‐on‐Chip CRITICALITIES: Latency sensitivityof the multi‐hop fabric Bandwidth criticalitiesfor future kilo‐core chips The power overheadfor moving bits around Non‐seamless scaling to off‐chip comm.

The Communication Hierarchy

Courtesy of K.Bergman

WE NEED A GAME CHANGER!

A lot of work is going on at the upper layers of the interconnection hierarchy: A lot of activity: PCIe, GEN‐Z, OpenCAPI, CCIX, Ethernet, InfiniBand,..

Node Router Router Node

Long distance Router-Router link

Electrical transceivers

Optical transceivers

Router

Short distanceNode-Router link

Silicon Photonics: Game Changer?

Short distanceRouter-RouterLink (electricallink or VCSEL-based optical technology)

Silicon photonics uses co‐integration techniquesof optical components and/or transceivers with

standard CMOS manufacturing process

Silicon Photonics: Game Changer?Silicon photonics is delivering integrated optical transceivers and holds promise of bringing optical communications closer to and deeper into the processing node

Node Router

Router

Integrated optical transceivers

Router

Short distanceNR link

Conventionalhop‐by‐hop

data movement

Flattenedend‐to‐end

data movement

Courtesy of K.Bergman

Silicon Photonics: Game Changer?Silicon photonics is delivering integrated optical transceivers and holds promise of bringing optical communications closer to and deeper into the processing node

Node Router

Router

Key enablerfor new paradigms:

disaggregatedarchitectures

Silicon Photonics: Game Changer?Silicon photonics holds promise of integrated optical transceivers and of bringing

optical communications closer to and deeper into the processing node

Node Router

Router

Requirements for that to happen• Divide cost by 1.5 orders of magnitude at least• Improve energy efficiency by one order of magnitude at least• Efficient integration solutions with electronics• Improve system‐ability of the technology

ImprovingTechnology Maturity

Architecture and system‐level design

The gap between system-level designers and technology developers is huge!Architecture design points stemdirectly from designers’ intuition

Descriptive information at differentabstraction layers are mixed

Designs are difficult to compare with one another

The application of well‐knownoptimization techniques is difficult

No consistent methodologies to explore the design space

Most of the design spacestill largely unknown

Mind the Gap

Golden ageof ONoC

assessment(~2008‐2012)Estimated

power savingswith nanophotonicnetworks

Mind the Gap

Early‐stage ONoC analysis: inflated expectations

Example of optical parametersused in early‐stage analyses

Gartner Hype Cycle

How to change the through of

disillusionment into a slope to enlightment?

Mind the Gap

Gartner Hype Cycle

Bridge the gap between developers of emerging devices and circuit & systemdesigners, thus coupling emerging interconnect technologies and architectureswith digital systems and working out novel system‐level design concepts.

Focus:

Photonically‐integrated chip‐scale parallel computing

Their coupling with off‐chip memory sub‐systems

Methodology:

Addressing the horizontal integration gap

Addressing the vertical integration gap

A Framework to Bridge the Gap

Optical NetworkProcessor(s) Cachehierarchy

ENoC DRAM GPU

BACKGROUND

OPTICAL NETWORKS‐ON‐CHIP

Optical NoC Initiator

1‐4In1

Wavelength‐divisionmultiplexed input signal

4‐stage modulator

ONoC Input 0101010

ElectricalSignal

Optical NoC Target1‐4Out

1 ONoC Output

Wavelength‐Routed Optical NoCs

λ2λ3

Main feature: static allocation of channels to source‐destination pairs

The topology needs to avoid interference of

same‐wavelength carriers

No Time spent in routing and arbitration

All‐optical interconnect solution

Performance predictability

All‐to‐all communications can take place concurrently

Hard to scale to a large number of cores

Better topologies exist, that reusethe same set of 4 wavelengthsacross all initiators

(Naive) Non‐blocking Crossbar

I11234

I21234

I31234

I41234

Wavelength‐Routed Optical NoCsI1 O1

λ2λ3

I4 λ2

Better topologies exist, that reusethe same set of 4 wavelengthsacross all initiators

State‐of‐the‐art «Snake» topology

Wavelength‐Routed Optical NoCsI1 O1

λ2λ3

I4 λ2

Static Power OverheadA major source of overhead of optical NoCs comes from static power

0.763 dBm

Passing by a ring0.005dB each

Waveguide crossing0.05dB each

Propagation loss0.274 dB/cm0.1 cm

PhotodetectorSensitivity

Laser sources Thermal Tuning

Insertion loss (and laser power requirements) depends on the connectivity pattern

Horizontal Integration Challenge

Optical NetworkProcessor(s) Li NoC DRAM GPU

λ1 λ2 λ3 λ4

Array of off‐chip CW lasers

Electronic layer

Photonic layer

Off‐chip memories

Cluster of processor cores

Target ArchitectureSolutions such as 3D or 2.5D integration allow for the separation of

both electronic and photonic processes and open the door to a fully dedicated process optimization for the photonic die

Gateways

System ViewEswitch Eswitch

Eswitch Eswitch

Local Domain

Eswitch Eswitch

Local Domain

Eswitch Eswitch

Local Domain

Eswitch Eswitch

Local Domain

Top level

Source: IBM

System View

Data rate adaptation(De-)Serialization

Flow control Clock Resynchronization

Message-dependent deadlock avoidance

Not Just E/O and O/E Converters, but an Architecture Integration Challenge

Eswitch Eswitch

datavalid

Local domain

EswitchEswitch

Local domain

datavalid

Photo‐detector

Driver

Modulator

Driver

Source: SSSA Pisa

Architecture Integration Challenge

>= 10GHz

1) Data rate adaptation

[0.5 ÷ 3] GHz

Clock speed Modulation Rate

2) (De-)SerializationArchitecture Integration Challenge

01 0101 1101

32/64/128/256 bits Optical bitstream

3) Flow control Architecture Integration Challenge

Buffer size is a function of the round trip time for

full‐throughput operation

Eswitch Eswitch

datavalid

Local domain

EswitchEswitch

Local domain

datavalid

InterfaceInterface

BufferBuffer

4) Clock ResynchronizationArchitecture Integration Challenge

>= 10GHz

Local ClockData

Eswitch Eswitch

datavalid

Local domain

EswitchEswitch

Local domain

datavalid

InterfaceInterface

Reconverted signal

5) Message-dependent deadlock avoidanceArchitecture Integration Challenge

Eswitch Eswitch

datavalid

Local domain

EswitchEswitch

Local domain

datavalid

InterfaceInterface

Eswitch Eswitch

datavalid

Local domain

EswitchEswitch

Local domain

datavalid

InterfaceInterface

BRIDGE

The bridge is a complex blocktaking care of key functional tasksfor architecture correct operation, built on top of a multi-technology

platform and supporting GHz-range signaling rates

Bridge Configuration

One of the keychallenges consistsof overcoming the inherent serial nature of optical communications

0101 1101

Increasing the signalling rate of optical channels

Increasing the bit‐levelparallelism (WDM)

A combination thereof

Research Goal: Explore and Characterize the Configuration Space of the Bridge

Pay Attention: CMOS cannot achieve arbitrary speeds!

SERDES

Implications over the SerDes Architecture, henceover the performance‐power trade‐off of the bridge

010 1101

0101 1101

Technology Partitioning16 3D‐stacked computation clusters

16x16 optical NoC (ONoC) CMOSOptics

In static‐power dominated technologies like silicon photonics, operation at high transmission rates may become a priority

to cut down on pJ/bit

Better performing technologies than CMOS may be required in the back‐end of the bridge

Bridge

OpticsCMOS BiCMOS

We select IHP 130nm SiGe BiCMOS (SG13S)‐ fT/Fmax=250 GHz / 340 GHz‐ 3.3V I/O CMOS, 1.2V logic CMOS‐ 5 thin metal layers, 2 thick onesTarget logic family:‐ 2.5V compatible ECL‐ A Cell library provides std cell gates‐ Logic synthesis from HDL enabled (Synopsys DC)Similar technologies providemonolithic integration

of optical components with the BiCMOS process

OpticsCMOS BiCMOS

Our assumption

(15x2)x1

DC‐FIFO (req)

DC‐FIFO (reply)

CompDemux

Arbiterλ1c

………….

Comp TIA PDClockDivider

DC‐FIFO (req)

DC‐FIFO (reply)

Arbiter

Driversλ11

Modulators

MesochronousSynchronizer

Credit counter 15

Credit counter 1

DC‐FIFO

DC‐FIFO…..

Filters

TRANSMITTER SIDE

RECEIVER SIDE

Transmitter side

Receiver side

Bridge Architecture

Gateway

Bridge Architecture –Transmitter Side1 transmission module for each target

(15 of them in a 16x16 ONoC)Optimization: only one set of buffers for all destinations

1x3 DC‐FIFO (req)

DC‐FIFO (reply)

1x15 M

Drivers λ11

Modulators

1 GHz Network Interface Frequency=f(Modulation rate)

ModulationRate (e.g., 10 Gbps)

Driversλ15_1

λ15_2

λ15_3

Arbiter

λ15cArbiter

One virtual channelfor each message class

to avoid (message‐dependent) deadlock

Bit‐level Parallelism

Source‐synchronous communication

Bridge Architecture: Receiver SideMux

(15x2)x1

Arbiter

1 GHz Network Interface Frequency=f(transmission frequency)

ModulationRate (e.g., 10 Gbps)

DC‐FIFO (req)

DC‐FIFO (reply)

CompDemux

1x3 DESER

λ15_3

λ15_2

λ15_1

DC‐FIFO (req)

DC‐FIFO (reply)

CompDemux

1x3 DESER

………….

Receiver module 1

Receiver module 15

Source‐synchronous communication

1 receiver module for each transmitter

Flow ControlMux

(15x2)x1

DC‐FIFO (reply)

CompDemux

1x3 DESER

Arbiterλ1c

………….

1x3 DC‐FIFO (req)

DC‐FIFO (reply)

1x15 M

PLLArbiter

Drivers λ11

Modulators

MesochronousSynchronizer

Credit counter15

Credit counter1

DC‐FIFO

DC‐FIFO….. Credit‐based flow control:

‐ Reuses the datapath‐ Exploits low dynamic power of

ONoCs‐ No round‐trip timing assumptions

Can fire only if credits available

Master D‐Latch

clkd q

Slave D‐Latch

clock f

Input Data

Output Data

2:1Mux

clkd q

2:1 Mux Cell is the main building block

Transmission frequency = twice the input clockLower PLL frequency

PERFECT BINARY TREE STRUCTURE• M =Log2(N) Stages working at halved speed

with respect to one another• The number of building blocks per stage is

inversely proportional to the operating frequency Energy savings

No need for additional selectors

Mux2x1

Input ClockOutput

Clockf/4 f/2

÷2÷2

Input Data

f/8f/4

16:1Mux

Mux2x1

Output Data

Serializer Architecture

Mux2x1

Input ClockOutput

Clockf/4 f/2

÷2÷2

Input Data

f/8f/4

16:1Mux

Mux2x1

Output Data

More parallelism: remove stages from the right

This architecture is very flexible, it can easily span a wide bridge configuration space

Scale up: add more stages to the leftFlexibility

Mux2x1

Input ClockOutput

Clockf/4 f/2

÷2÷2

Input Data

f/8f/4

16:1Mux

Mux2x1

Output Data

Mux2x1

Mux2x1Mux2x1

Mux2x1

÷2 ÷2fInput Clock

Output Clock f/4 f/2

Output Data

÷2÷2

Input Data

f/16 f/8f/4

f/8f/16

Mux 32x1 32x1

OutputData 1

OutputData 2

16x2MUX

InputClock

2 bitsOpticalParallelism

DC‐FIFO 1

Routing1 M

Arbiter

Credits From Rx15

Arbiter

Credit counter

M15Data

÷2÷2

VC DECODER

comp TIA PD15CLK

÷2 ÷2

clk5 clk4 clk3 clk2

32x1 Binary Tree Serializer 15 Driver

DC‐FIFO 2

Routing2

M15CLK

VC DECODER

VC‐ID

DC‐FIFO 1

DC‐FIFO 15

Credits to Rx15

DC‐FIFO 29

DC‐FIFO 30

1x32 Binary Tree Deserializer 15clk5 clk4 clk3 clk2 clk1

÷2÷2 ÷2 ÷2 comp TIA PD15Data

f1f1/16 f1/2f1/4f1/8

Driver

Clock domains

Laser Source

Mux2x1

Input Dataf/8

÷2 ÷2

Output Dataf

Input Clock

Output Clock f/8

Mux 8x1

Demux2x1

Input Dataf

÷2 ÷2

Output Data

f/8Input Clock

Output Clockf/2 f/4

Demux 1x8 Demux

Demux2x1

f/4 f/8

Experimental Results

Bridge Front‐End Architecture

(De)‐Serialization + TransceiversOpto‐electronics

+ ONoC

CMOS Two process nodes

(bulk 40nm or 28 nm FD‐SOI)

ECL 130nm

Partitioning options due to multi‐stage nature of the serializer

DP2DP3DP4DP5

Target Data Rate for source‐destination connection: @25 Gbit/s

DP2DP3DP4

Target Data Rate for source‐destination connection: @40 Gbit/s

CMOS 40nm ECL 130nmDP : Design point FD‐SOI 28nm

1.28 ns 0.64 ns 0.32 ns 0.16 ns 0.08 ns

Bridge Front‐EndArchitecture

1.28 ns

0.8 ns 0.4 ns 0.2 ns 0.1 ns 0.05 ns

0.8 ns

Experimental Results@12.5 Gbit/s

@12.5 Gbit/s

2‐bits

@20 Gbit/s

2‐bit

Fully CMOS

@6.25 Gbit/s

4‐bits

Fully CMOS

@10 Gbit/s

4‐bits

DP1 DP2 DP3 DP4 DP5 DP1 DP2 DP3 DP4 DP5 DP1 DP2 DP3 DP4 DP1 DP2 DP3 DP4 DP1 DP2 DP3 DP4 DP1 DP2 DP3 DP4

130nm ECL‐CMOS 40nm 130nm ECL‐28nm FDSOI CMOS

Energy‐per‐bit

(pJ/bit)

1 channel x 25 Gbit/s

4 channels x 6.25 Gbit/s

2 channels x 12.5 Gbit/s

1 channelx 40 Gbit/s

4 channelsx 10 Gbit/s

25 Gbit/s 40 Gbit/s

x : Not feasibleExperimental Results

Fully‐CMOS

‐31%

‐84%

Hybrid CMOS‐ECL

Fully CMOS Fully CMOS

Hybrid CMOS‐ECL

Fully CMOS

Hybrid CMOS‐ECL

DC‐FIFO 1

Routing1 M

Arbiter

Credits From Rx15

MUX 30

Arbiter

Credit counter

M15Data

÷2÷2

VC DECODER

comp TIA PD15CLK

÷2 ÷2

clk5 clk4 clk3 clk2

DC‐FIFO 2

Routing2

M15CLK

VC DECODER

VC‐ID

DC‐FIFO 1

DC‐FIFO 15

Credits to Rx15

DC‐FIFO 29

DC‐FIFO 30

f1f1/16 f1/2f1/4f1/8

Driver

Clock domains

Laser Source

DC‐FIFO 1

Routing1 M

Arbiter

Credits From Rx15

MUX 30

Arbiter

Credit counter

M15Data

÷2÷2

VC DECODER

comp TIA PD15CLK

÷2 ÷2

clk5 clk4 clk3 clk2

DC‐FIFO 2

Routing2

M15CLK

VC DECODER

VC‐ID

DC‐FIFO 1

DC‐FIFO 15

Credits to Rx15

DC‐FIFO 29

DC‐FIFO 30

f1f1/16 f1/2f1/4f1/8

Driver

Clock domains

Laser Source

DC‐FIFO 1

Routing1 M

Arbiter

Credits From Rx15

MUX 30

Arbiter

Credit counter

M15Data

÷2÷2

VC DECODER

comp TIA PD15CLK

÷2 ÷2

clk5 clk4 clk3 clk2

DC‐FIFO 2

Routing2

M15CLK

VC DECODER

VC‐ID

DC‐FIFO 1

DC‐FIFO 15

Credits to Rx15

DC‐FIFO 29

DC‐FIFO 30

f1f1/16 f1/2f1/4f1/8

Driver

Clock domains

Laser Source

DC‐FIFO 1

Routing1 M

Arbiter

Credits From Rx15

MUX 30

Arbiter

Credit counter

M15Data

÷2÷2

VC DECODER

comp TIA PD15CLK

÷2 ÷2

clk5 clk4 clk3 clk2

DC‐FIFO 2

Routing2

M15CLK

VC DECODER

VC‐ID

DC‐FIFO 1

DC‐FIFO 15

Credits to Rx15

DC‐FIFO 29

DC‐FIFO 30

f1f1/16 f1/2f1/4f1/8

Driver

Clock domains

Laser SourceT

16x16 λ‐Router Topology16mm x 16mm optical layer

1‐bit 2‐bits 4‐bits 1‐bit 2‐bits 4‐bits

Bridge CMOS Part Bridge ECL Part Thermal Tuning Tx‐Rx‐Laser

Energy‐per‐bit (p

J/bit)

Hybrid10.94 pJ/bit

Fully‐CMOS1.23 pJ/bit

1 channelx 25 Gbit/s

4 channelsx 6.25 Gbit/s

2 channelsx 12.5 Gbit/s

1 channel x 40 Gbit/s

4 channels x 10 Gbit/s

@25 Gbit/s @40 Gbit/s

Fully‐CMOS1.6 pJ/bit Fully‐CMOS

1.15 pJ/bit

Hybrid8.69 pJ/bit Hybrid

8.31 pJ/bit

100% bandwidth utilization

‐88.7% ‐85.4%

Energy efficiencies in the ballpark of 1 to 2 pJ/bit

are possible with more WDM channels, a trend that highersignaling speeds exacerbate

TSV Laser Sources Total Power [W] SNR Manufacturingrequirements

25Gbit/s

40Gbit/s

25Gbit/s

40Gbit/s

25Gbit/s

40Gbit/s

25Gbit/s

40Gbit/s

1 channel 1216 2176 32 32 65.6 83.4 16.2 16.14 R = [5, 1, 25] um *Infeasible

R = [5, 1, 30] um*Up to 1 channel

R = [5, 0.25, 30] um*Up to 19 channels

2 channels 1216 2176 48 48 7.4 79.7 13.13 13.1

4 channels 2176 2176 80 80 9.6 11.01 8.8 8.8

Network‐Level Trade‐Offs

25 Gbit/s

40 Gbit/s

ECL 130nmFD‐SOI 28nm

1 channel2 channels4 channels

Optical parallelism comes with cost and signal integrity concerns!

Vertical Integration Challenge

The Design Space

ONoC topology design points stemdirectly from designers’ intuition

HOW TO «SYNTHESIZE» THE MOST EFFICIENT ONoC SOLUTION FOR THE

REQUIREMENTS OF THE CONNECTIVITY PROBLEM AT HAND?

The design space is currentlylargely unknown

Major Requirements: start from a high-level description, operate on abstractions and refine them into an

actual implementation with components from a technology library.

Can we extend the paradigms and methodologies of EDA to the context of emergingsilicon nanophotonic interconnection networks?

High-level specification

Gate-Level Netlist

Mapped Gate-Level Netlist

Planar geometric shapes

Technology-independent Logic Library

Technology Library

Switching Primitives Representation

Technology Mapping

Assignment of modulation carriers

Netlist connectivity

Device Parameter Selection

Placement and routing

Physical Design

0 Routing Protocol SelectionRouting Protocol Selection

Design Automation Beyond E‐Roots

Can we extend the paradigms and methodologies of EDA to the context of emergingsilicon nanophotonic interconnection networks?

High-level specification

Gate-Level Netlist

Mapped Gate-Level Netlist

Planar geometric shapes

Technology-independent Logic Library

Technology Library

Technology Mapping

Assignment of modulation carriers

Physical Design

0 Routing Protocol SelectionRouting Protocol Selection

Design Automation Beyond E‐Roots

Design automationshould not determinewhich technology to

pursue

Design automation can lead to concrete

evaluation of a new technology

State‐of‐the‐Art PIC Design Tools

Technology Mapping

Selection of modulation carriers

Can we understand all topology design pointsin the context of a unified design framework?

Can we populate the design space of wavelength-routed optical NoC topologies?

Front‐End Synthesis Methodology

Add function at target’ side.

The filter is used to implement both the drop function at initiator side and the add one at target side

Basic building block for the implementation of any wavelength-routed topology:the 1x2 Drop Filter

On-resonance signalOff-resonance signal

Drop function at initiator side.

Sjλ1 λ2λ1 λ1

Basic PrimitiveDROP FUNCTION ADD FUNCTION

SYNTHESIS METHODOLOGY1. Wavelength Resolution

Wavelength Resolution Graph (WRG) for a generic 4x4 WRONoC.

Each channel of the WDM input signal should be resolved so to be routed to a different output

2. Technology Mapping

λi = λi

E.g., Grouping the 1x2 DFs into compact 2x2 photonicswitching elements (PSEs), from a technology library!

3. Symbolic Wavelength AssignmentAssign a resonant wavelength to the MRRs

4. Topology ConnectionDraw the topology logic scheme. It’s a λ‐router! However, it is optimized wrt baseline: only 3 resonator types! .

Minimizeno. of MRR types

Constraint: avoid conflicts!

Constraint: Drop channels on rows only once!

Generic Topology from the Front-End Flow

These crossings should be considered as apparent, since at this stage we are drawing the logic topology, not the physical one.

Our synthesis methodology can potentially populate the complete design space of WRONoC topologies by spanning all possible technology mappings,

subject to the constraints of each stage for legal solutions.

Only with 2x2 PSEs, the number of WRONoC topologies in the design space amounts to[(n-1)!]n

A 4x4 WRONoC topology can be implemented in 1296 different ways

(1,2,3,4)A

(1,2,3,4)B

(1,2,3,4)C

(1,2,3,4)D

(1)B(2,3,4)A

(1)C(2,3,4)D

(1)D(2,3,4)C

(1)A(2,3,4)B

(2)A(1)D(3,4)C

(2)C(1)B(3,4)A

(2)B(1)C(3,4)D

(2)D(1)A(3,4)B

(1)C(2)B(3)A(4)D

(1)B(2)C(3)D(4)A

(1)A(2)D(3)C(4)B

(1)D(2)A(3)B(4)C

Generic Abstract Solutions

Front-End Methodology

Physical Design

LOGIC TOPOLOGY

PHYSICAL TOPOLOGY

Back‐End Synthesis Methodology

λxλx

λi What is the exact radius length of the MRR, typically in the range 5‐

20um? What is the exact value of the n wavelengths used by each initiator

in an n x n wavelength‐routed optical NoC? What is the maximum bit‐level communication parallelism on the

I/O optical channels? Not just cost and reliability, but also feasibility!λi

DEVICE PARAMETER SELECTION

This is not just a refinement step, due to the ROUTING FAULT concern:It has implications on network‐level throughput and scalability

Parallelism is 6 in PSEx

λxλx

Parallelism is 6 in PSExNO NO

Parallelism is 6 4 in PSEx Parallelism is 7 5 in PSEy

λxλx

Available parallelism is 6 4 3 in PSEx Available parallelism is 7 5 3 in PSEy Available parallelism is 9 7 in PSEz

NO NONO NO NO

λxλx

Available parallelism is 6 4 3 in PSEx Available parallelism is 7 5 3 in PSEy Available parallelism is 9 7 in PSEz

NO NONO NO NO

R3As topology size increases, the proliferation of filter types and wavelength channels

may limit the availability of non‐overlapped transmission peaks, which may cause the topology to be practically infeasible

Electromagnetic Model

Parallelism and Scalability Limitations

R1 + Rtol R1 ‐ Rtol

λ2,2λ3,1

There exists a post‐fabrication variation scenario thatends up in a routing fault

Even without overlapping, proximity raises optical crosstalk concerns

PARAMETER UNCERTAINTY

Variation interval λ ± ()

Conservative design‐for‐reliability constraint:Let us assign device parameters

and state an achievable bit‐level parallelismsuch that routing faults will not take place

under any variability scenario

We modelled the Ring radius/wavelength channel selection problem subject to routingfault avoidance as a Constrained Optimization Problem, and used ASP as declarative technology.

This is the first refinement step directly exposed to the underlying technology

Front-End Methodology

Physical Design

LOGIC TOPOLOGY

PHYSICAL TOPOLOGY

Back‐End Synthesis Methodology

Logic Topology

Optical Layer ‐ Layout Planning

Placement and Routing

Lots of unexpected waveguide crossings(which burden on the static power budget)

Electronic P&R tools cannot be reused here

We propose PROTON+, a tool for automatic placement and routing of ONoC topologies

(Collaboration with prof. Schlichtmann at TU Munich)The tool tries to strike a good balance between crossing losses and propagation losses, which

might be conflicting objectives

Minimize waveguide length Minimize no. of crossings

100/0 90/10 80/20 70/30 60/40 50/50 40/60 30/70 20/80 10 90 0/100max

prop / cross

8x8-lambda-router8x8-GWOR

8x8-Std-Crossbar

Where Lp and Cp are approximate functions of path lengths and no. of crossings

By setting the weights of the objective function, the best physical mapping for the technology/topology at hand can be achieved

Placement: Non‐linear optimization problem solved with an IPMRouting: adaptation of the Lee’s algorithm «Maze Router»

Our objective functions minimizes the insertion loss across the lossiest path.This indirectly limits total laser power.

Physical Design Space Exploration

Layout of 16x16 λ‐Router with PROTON+

Memory controller

Ins. Loss max = 44dB 255 crossings on the

critical path 28636um waveguide

length on the criticalpath

24425 sec of CPU time

(Intel Core 2 Quad CPU with 8GB RAMrunning at 2.33GHz)

Layout of 16x16 λ‐Router with PROTON+

Memory controller

Ins. Loss max = 44dB 255 crossings on the

critical path 28636um waveguide

length on the criticalpath

24425 sec of CPU time

(Intel Core 2 Quad CPU with 8GB RAMrunning at 2.33GHz)

50001000015000200002500030000

16x16 λ‐Router

0005101520253035404550

8x8 λ‐Router 8x8 GWOR 8x8 StandardCrossbar

16x16 λ‐Router

Maximum insertion loss (dB)

Manual

PROTON [Boos+ICCAD'13]

PLATON

Proton v2.0 (PLATON) implements a force‐directed placement algorithm Better computation times, better insertion losses PLATON is well‐suited for large‐scale topologies, rather than for small‐scale ones

PROTONv2.0 (PLATON)

Prof. Ulf SchlichtmannTU Munich’s Placement and Routing Tools

Computation time

There is a large variability in the design space: from 18 to 39 crossings! This raises the issue of placement-aware logic topology synthesis, completely new

discipline for optical NoCs. λ-Router and snake proposed in literature are not the best topologies from the critical path lengthviewpoint!

Design automation helps to get the most out of a technology

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Critical path length(max. no. Of crossings)

Number of Topologies

Lambda‐Router Snake

Distribution of the critical path after physical mapping.

Memory ControllerGateway

Optical Layer4x4 ONoCs replicated 3x

Physical Design with Proton+

We exhaustively generated all 4x4 WRONoC topologies and mapped them with Proton+

Exists in LiteratureExists in Literature

Exploring the Design Space

Conservative PROCESS VARIATIONS Only 4x4 ONoCs are certainly feasible Multiple wavelength selection options

useless if uncertainty ranges are notreduced accordingly

Ideal Fabrication With overly fine step and large rings,

the upper bound is roughly a 60x60 topology, with limited parallelismthough!

Achievable parallelism most sensitive to the incremental step of MRR radii

We performed device parameter selection to assess scalability of generic topologies

Fabricationoptions

Rmin Rstep Rmax

Ropt 5μm 1μm 25μm

R’opt 5μm 1μm 30μm

R’’opt 5μm 0.25μm 30μm

Radius selection rangeand

Incremental step

Radius Tolerance: 10nm

Laser uncertainty: 0.5nm

Scalability

Topology Radix

Conclusions• High‐performance computing systems will be soon againinterconnect‐limited. Emerging technologies can be game changers.

• Time for a concrete evaluation of emerging silicon nanophotonicnetworks in small‐scale systems. How? By bridging the gap with system designers.

• Horizontal integration gap:• ENoC‐ONoC bridge key to determining configuration of optical connections (data rate, parallelism), and its energy efficiency.

• 1‐2 pJ/bit communication can be realistically targeted at 40 Gbps connection rate, with 4 WDM channels@10Gbps in parallel (bridge in 28 nm CMOS). Signal integrityis an issue.

• Vertical integration gap:• Design methods have been developed to populate the largely unknown design space of wavelength routed topologies.

• Early‐stage complete cross‐layer synthesis methodology defined.• More energy‐efficient topologies than existing ones in literature have been«synthesized».

• Design automation: an enabler for emerging technologies.

Acknowledgement

emergingsilicon nanophotonicnetworks: time to bridge the...

Documents

a brief history of time accelerated bridge construction ......

time-dependent deflection of a box girder bridge

development of a tool for estimating bridge construction...

bridge replacement innovations · 2021. 3. 10. · bridge...

adaptive bandwidth allocation for bridge downlink...

2015 chinese bridge summer time camp

seismic analysis for bridge foundation using time...

abba: adaptive brownian bridge-based symbolic elsworth...

identification of bridge mode shapes using short time

18v 1ch h-bridge motor driver ic - akm · 18v 1ch h-bridge...

real time bridge deck guidance using gnss systems€¦ ·...

presentation time maanagement bridge it

thames gateway bridge value of time study... · thames...

selection of time-histories for bridge design in eurocode...

time-dependent scour depth under bridge-submerged … ·...

algorithmic model system-level modelling and...

real time bridge deck guidance using gnss systems 1.1...

2.10 bridge crossings in windsor over time · 18 windsor...

in prestressed concrete bridge …...known incremental...

fan73832 (half-bridge dead time control)