© 2017 SiFive. All Rights Reserved.
TileLink: A free and open-source, high-performance scalable cache-coherent fabric designed for RISC-VWesley W. Terpstra
© 2017SiFive.AllRightsReserved.
TileLink Rationale
3 © 2017SiFive.AllRightsReserved.
So you want to build a chip, ‘eh?
sounds like you need a coherent bus protocol!
4 © 2017SiFive.AllRightsReserved.
• Open standard• Easy to implement• Cache-coherent block motion• Multiple cache layers• Reusable on- and off-chip• High performance
Requirements for a RISC-V bus
5 © 2017SiFive.AllRightsReserved.
Requirements for a RISC-V bus
AHB Wishbone AXI4 ACE CHI TileLink
Openstandard ✔ ✔ ✔ ✔ ✗ ✔
Easytoimplement ✗ ✔ ✗ ? ✔
Cacheblock motion ✗ ✗ ✗ ✔ ✔ ✔
Multiplecache layers - - - ✗ ? ✔
Reusableoff-chip ✗ ? ✔
High performance ✗ ✔ ✔ ? ✔
6 © 2017SiFive.AllRightsReserved.
• Open standard? CHI is not open!“The AMBA 5 CHI specification is currently available to partners integrating SoCs or developing IP or tools that implement it. Please contact your Arm account manager for details on obtaining a copy.”
• Easy to implement?• ACE: 10 probe message types, split control/data, narrow bursts, …
• Multiple cache layers?• ACE: No – designed for snoop-style coherency
• Depend on a standard controlled by a RISC-V competitor?
What about AMBA CHI/ACE?
7 © 2017SiFive.AllRightsReserved.
• Clean slate: avoid prior pitfalls• Decouple message protocol from wire protocol• Competitor cannot undermine RISC-V’s future• Designing for RISC-V requirements = simpler design• Reduced Message Protocol; RISC => RMP• Assume all connected hardware is trusted• Only power-of-2 block transfers
TileLink: Rolling our own has advantages
8 © 2017SiFive.AllRightsReserved.
• Master-slave point-to-point protocol• Message-based with 5 (ABCDE) priorities• Out-of-order design with optional ordering• Progressive conformance level complexity• Cache-coherent memory block motion• Designed for composability and deadlock freedom
TileLink: Bird’s-Eye View
Master
SlaveMaster
Slave
9 © 2017SiFive.AllRightsReserved.
• Rocket-chip open source SoC is entirely built using TileLink• > 30 public modules; including cores, Xbars, and adapters• broadcast-hub coherency manager (ie: ACE-style snoop coherency)• bridges to AXI/AHB/APB, clock crossings, fuzzer, monitor, model checker• for more details, see our TileLink CARRV 2017 paper
• SiFive-blocks: open I2C, SPI, UART, GPIO, PWM TileLink slaves
• Foundation of publicly available SiFive chips (FE310 + FU500)• a banked directory-based wormhole MESI L2$• off-chip coherent TileLink interconnect to FPGA (ChipLink)
TileLink: Open Source and In Production
10
TileLink Protocol Conformance Levels
© 2017SiFive.AllRightsReserved.
TL-UL TL-UH TL-CRead/WriteOperations ✔ ✔ ✔
MultibeatMessages ✔ ✔
Atomic Operations ✔ ✔
Hint(Prefetch) Operations ✔ ✔
Cache BlockTransfers ✔
PrioritiesB+C+E ✔
11
A B C D E
Operations are composed of Messages (TL-UL)
© 2017SiFive.AllRightsReserved.
PutFullDataPutPartialData
Get
AccessAck
AccessAckData
12
A B C D E
Operations are composed of Messages (TL-UL)
© 2017SiFive.AllRightsReserved.
PutFullDataPutPartialData
Get
AccessAck
AccessAckData
13
A B C D E
Operations are composed of Messages (TL-UH)
© 2017SiFive.AllRightsReserved.
PutFullDataPutPartialDataArithmeticDataLogicalData
GetHint
AccessAck
AccessAckData
HintAck
14
A B C D E
Operations are composed of Messages (TL-UH)
© 2017SiFive.AllRightsReserved.
PutFullDataPutPartialDataArithmeticDataLogicalData
GetHint
AccessAck
AccessAckData
HintAck
AtomicOperation
15
A B C D E
Operations are composed of Messages (TL-C)
© 2017SiFive.AllRightsReserved.
PutFullDataPutPartialDataArithmeticDataLogicalData
GetHint
AcquireBlockAcquirePerm
Probe
ReleaseReleaseData
ProbeAckDataProbeAck
GrantGrantData
ReleaseAck
GrantAck
AccessAck
AccessAckData
HintAck
16
A B C D E
Operations are composed of Messages (TL-C)
© 2017SiFive.AllRightsReserved.
PutFullDataPutPartialDataArithmeticDataLogicalData
GetHint
AcquireBlockAcquirePerm
Probe
ReleaseReleaseData
ProbeAckDataProbeAck
GrantGrantData
ReleaseAck
GrantAck
AccessAck
AccessAckData
HintAck
CacheaBlock
ReleaseaBlock
17 © 2017SiFive.AllRightsReserved.
Simplifying Requirement: No agent loops!
• Bus participants (agents) form a Directed Acyclic Graph (DAG)• Prevents message loops that can deadlock the system
18 © 2017SiFive.AllRightsReserved.
Simplifying Requirement: Strict Priorities!
• Messages have one of five priorities (A < B < C < D < E)• Lower priority messages never block higher priority messages• Responses have higher priority than requests
• Simplified protocol (TL-UH) only has two priorities (A < D)• The BCE priorities prevent deadlock during cache block motion
19 © 2017SiFive.AllRightsReserved.
Simplifying Requirement: Ownership tree!
VectorCache
CoreL1$
L2$Bank0
DDRChan0
• Agents which can cache a given cache block form an ownership tree• Multiple blocks means we have an ownership forest of logical trees• Rule applies to LOGICAL tree relationship; NOT physical links!• Uncacheable blocks = a tree of one node
DDRChan1
L2$Bank1
MMIO
© 2017SiFive.AllRightsReserved.
On-chip TileLink Serialization
21 © 2017SiFive.AllRightsReserved.
• Each message priority (ABCDE) has an independent channel• Messages are transmitted using multibeat bursts• Ready-valid used to regulate beats• Variable bus width with 0+ cycle response
• Off-chip TileLink (ChipLink) makes different choices
On-chip TileLink Wire Protocol
22
Physically Independent Channel Design
© 2017SiFive.AllRightsReserved.
L1SystemBusCrossbar DDRTL-C TL-UH
L1L1L1$ L2$TL-C
TL-C TL-UH
ABCDE
A
D
23
Read Operation
© 2017SiFive.AllRightsReserved.
clock
a_ready
a_valid
a_opcode
4 4
a_size
5 5
d_ready
d_valid
d_opcode
1 1
d_size
5 5
Get Get
AccessAckData AccessAckData
24 © 2017SiFive.AllRightsReserved.
TileLink: The Foundation of SiFive’s FU500
U54-MC Coreplex
Platform-Level Interrupt ControlBoot ROM
Debug Module
TileLink Switch
DDR3/4 Controller/PHY
ChipLink
GbE
E51 Core 0
RV64IMACL1 I$
SRAM
Tile
Link
Sw
itch
Clock Generation
Quad SPI
TileLink Coherence Manager
M
M
U54 Core
RV64GC16KiB L1 I$
16KiB L1 D$
U54 Core
RV64GC16KiB L1 I$
16KiB L1 D$
U54 Core
RV64GC16KiB L1 I$
16KiB L1 D$
U54 Core 1-4
RV64GCL1 I$
L1 D$
Banked L2$
Tile
Link
Sw
itch
SD Card
Mask ROM
Clock/Reset Control
GPIOUARTI2CSPI
OTP
FPGA
PCIe/USB/MIPI
ChipLink
TileLink
TileLink Switch
YourIP Block
JTAG
FU500 Base Platform
© 2017 SiFive. All Rights Reserved.
25 © 2017SiFive.AllRightsReserved.
TileLink: The Foundation of SiFive’s FU500
U54-MC Coreplex
Platform-Level Interrupt ControlBoot ROM
Debug Module
TileLink Switch
DDR3/4 Controller/PHY
ChipLink
GbE
E51 Core 0
RV64IMACL1 I$
SRAM
Tile
Link
Sw
itch
Clock Generation
Quad SPI
TileLink Coherence Manager
M
M
U54 Core
RV64GC16KiB L1 I$
16KiB L1 D$
U54 Core
RV64GC16KiB L1 I$
16KiB L1 D$
U54 Core
RV64GC16KiB L1 I$
16KiB L1 D$
U54 Core 1-4
RV64GCL1 I$
L1 D$
Banked L2$
Tile
Link
Sw
itch
SD Card
Mask ROM
Clock/Reset Control
GPIOUARTI2CSPI
OTP
FPGA
PCIe/USB/MIPI
ChipLink
TileLink
TileLink Switch
YourIP Block
JTAG
FU500 Base Platform
© 2017 SiFive. All Rights Reserved.
TL
-UL
x1
0
26 © 2017SiFive.AllRightsReserved.
• Public spec, Open source, RMP, designed for RISC-V
• Can be bridged to your existing AMBA designs
• If you are building a RISC-V chip, why not use TileLink?
• If you want to learn more:• https://www.sifive.com/documentation/tilelink/tilelink-spec/• These slides – much more content than presented!
Conclusion
27 27
28
Messages are composed of Beats, containing:
• The unchanging message header, including• opcode: the message type• size: the base-2 logarithm of the number of bytes in the data payload
• An optional multibeat data payload• presence depends on the opcode (only *Data message type)• number of beats is calculated from the size
© 2017SiFive.AllRightsReserved.
29
Beats are regulated by ready-valid
• The receiver provides ready• If ready is LOW, the receiver must not process the beat• If ready is LOW, the sender may elect to try a different message• Ready may combinationally depend on the beat payload
• The sender provides valid + the beat payload• If valid is LOW, the payload may be an illegal TileLink message• Valid + the payload may not combinationally depend on ready
© 2017SiFive.AllRightsReserved.
30
Ready-Valid example
© 2017SiFive.AllRightsReserved.
clock
a_ready
a_valid
a_opcode 0 0 0 0 0 0 4 0
a_size 5 5 5 0 6 2 4 1
F beat 0 F-1 F-1 F-2 F-3 G-0 H-0 I-0 J-0 K-0
31
Request-Response timing
• The first beat of a response message may be presented• After an arbitrarily long delay• On the same cycle as the first beat of the request, but not before• Before all the beats of burst request have been accepted
• Responses indicate receiver has successfully sequenced the request• The L2 may send an AccessAck before the last beat of a large Put has arrived• Simple slaves can send their response in the same cycle as their request
© 2017SiFive.AllRightsReserved.
32
Read Operation
© 2017SiFive.AllRightsReserved.
Master(DMA/AXI)
L2
Readbackingmemory
D: AccessAckData
Completeoperation
Initiateoperation
A: Get
33
Read Operation
© 2017SiFive.AllRightsReserved.
clock
a_ready
a_valid
a_opcode
4 4
a_size
5 5
d_ready
d_valid
d_opcode
1 1
d_size
5 5
Get Get
AccessAckData AccessAckData
34
Write Operation
© 2017SiFive.AllRightsReserved.
Master(DMA/AXI)
L2
Writebackingmemory
D: AccessAck
Completeoperation
Initiateoperation
A: PutFullData
35
Write Operation
© 2017SiFive.AllRightsReserved.
clock
a_ready
a_valid
a_opcode
0 0
a_size
5 5
d_ready
d_valid
d_opcode
0 0
d_size
5 5
PutFullData PutFullData
AA AA
36
Atomic Operation (TL-UH)
© 2017SiFive.AllRightsReserved.
Master(E51)
L2
Writebackingmemory
D: AccessAckData
Completeoperation
Initiateoperation
A: ArithmeticData
37
Atomic Operation (TL-UH)
© 2017SiFive.AllRightsReserved.
clock
a_ready
a_valid
a_opcode
2 2 1
a_size
4 4 3
d_ready
d_valid
d_opcode
1 1 1 1
d_size
4 4 4 3
AD AD AD
AA AAAA
© 2017SiFive.AllRightsReserved.
TileLink Coherence
39
What does TL-C add?
• Masters can create local replicas of cacheable blocks• This can improve both throughput and access latency
• Additional channels, B+C+E, are needed to maintain coherence• These are used to reclaim and sequence ownership of the block
• Three new operations (Acquire, Probe, Release) for block transfer
© 2017SiFive.AllRightsReserved.
40
Who owns the block? Access Operations
© 2017SiFive.AllRightsReserved.
Reads and Writes do not affect the ownership of the block.
However, the access is coherent with the rest of the system. E51
L2
Access AccessAck
processresult
withoutcaching
access backing memory storage
41
Who owns the block? Acquire Transfer
© 2017SiFive.AllRightsReserved.
A master can acquire a copy of the cache block by requesting it from a slave.
GrantAck is required to synchronize when the L2 can reacquire the block. U54
L2
Acquire Grant
create cached copy
get current copy
GrantAck
complete operation
42
Who owns the block? Probe Transfer
© 2017SiFive.AllRightsReserved.
The L2 can reacquire a copy of the block by probing it from the U54.
ProbeAck is required to synchronize when the L2 can allow a new Acquire.
U54
L2
Probe ProbeAck
write back dirty data
get current copy
43
Who owns the block? Explicit Release Transfer
© 2017SiFive.AllRightsReserved.
A master can release its copy of the cache block back to the L2.
ReleaseAck is required to synchronize when the U54 can reacquire the block.
U54
L2
Release ReleaseAck
complete operation
write back dirty data
44
Who owns the block? Silent Release
© 2017SiFive.AllRightsReserved.
A master can release its copy of the cache block without telling the L2.
When the L2 later needs the block, a Probe will indicate its absence in U54.
Only safe for clean blocks
U54
L2
implicit new owner
drop clean block
45
Composition of TileLink transfers
• The three transfer operations can occur concurrently• Move a block through more than one agent• Acquires often come with evictions (Releases)• Acquire race between cores• Racing Releases and Acquires… more exotic combinations possible
• Correct due to careful operation sequencing rules• C:Release > B:Probe > A:Acquire
© 2017SiFive.AllRightsReserved.
46
Cache miss in the L2, Recursive Acquire
© 2017SiFive.AllRightsReserved.
U54
DDRCork
L2
Read backingmemory
Grant
CompleteoperationInitiate
operation
Grant
Acquire
Acquire GrantAck
GrantAck
47
Cache miss in the L2, Recursive Acquire + Release
© 2017SiFive.AllRightsReserved.
U54
DDRCork
L2
Read backingmemory
Grant
CompleteoperationInitiate
operation
Grant
Acquire
Acquire Release GrantAck
GrantAck
ReleaseAck
Write backingmemory
48
Bounce from L1 to L1, Acquire + Probe
© 2017SiFive.AllRightsReserved.
Acquire
Probe ProbeAck
Grant GrantAck
U54 Core3
L2
U54 Core2
49
Bounce from L1 to L1, Acquire + Probe + Release
© 2017SiFive.AllRightsReserved.
Acquire
Probe ProbeAck
Grant GrantAck
U54 Core3
L2
U54 Core2
Release ReleaseAck
50
Acquire Race
© 2017SiFive.AllRightsReserved.
Acquire
Acquire Grant Probe ProbeAck
Grant
GrantAck
GrantAck
U54 Core3
L2
U54 Core2
51
Acquire Race, Double Probe
© 2017SiFive.AllRightsReserved.
Acquire (Upgrade)
Acquire Grant Probe ProbeAck
ProbeAck GrantProbe
GrantAck
GrantAck
U54 Core3
L2
U54 Core2
52
Acquire + Release Race
© 2017SiFive.AllRightsReserved.
Acquire
Probe
Release
Grant GrantAck
ReleaseAck ProbeAck
U54 Core3
L2
U54 Core2
© 2017SiFive.AllRightsReserved.
Ordering Rules
54 © 2017SiFive.AllRightsReserved.
Block the following message types on
this address:
outer Probesinner Acquires
Master
Slave
Acquire Grant
Block the following message types on
this address:
inner Acquires
GrantAck
Await all pending
ProbeAckscomplete operation
Acquire Sequencing
Acquire is the lowest priority transfer.
Outer probes and inner releases must be nested.
55 © 2017SiFive.AllRightsReserved.
Probe Sequencing
Probes are the middle priority transfer.
Inner acquires are blocked, while inner releases must be nested.
Block the following message types on
this address:
inner Acquiresouter Probes
Master
Slave
Probe ProbeAck
write back dirty data
Await all pending
GrantAcks
56 © 2017SiFive.AllRightsReserved.
Release Sequencing
Release is the highest priority transfer.
Outer probes and inner acquires must be blocked. Block the following
message types on this address:
inner Acquiresouter Probes
inner Releases
Master
Slave
Release ReleaseAck
Await all pending Grants
complete operation
57
Rules for Forward Progress
1. The TileLink network is a directed acyclic graph2. Bounded delays are ok (DDR refresh/etc)3. Messages are atomic
• Deliver all beats of a message before starting another on the same channel• Sending one beat means you must be able to eventually send all
4. Channels have strictly increasing message priority• Sending (or sent) unanswered priority C request? lowering A or B ready is ok
Ø combinationally forwarding A valid to B-E valid is ok, but B-E to A is not
5. You must eventually send a response to every request
© 2017SiFive.AllRightsReserved.