arm® 64-bit coherent scale out over rapidio® · ace-lite asb apb ahb apb2 axi3 atb ahb-lite apb3...
TRANSCRIPT
ARM® 64-bit Coherent Scale Out over RapidIO®
Rick O’Connor [email protected]
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 1
Outline • Open Industry Collaboration
– ARM 64-bit Coherent Scale Out over RapidIO Task Group
• System Challenges & Coherent Scale Out Terminology
• What is RapidIO? • Summary
– Call for Industry Participation • Background Material
– ARM AMBA® & ACE – RapidIO Globally Shared Memory
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 2
Task Group Charter • The ARM 64-bit Coherent Scale Out over RapidIO Task Group shall
be responsible for developing a specification for multi SoC / core coherent scale out of ARM 64-bit cores with the following functionality: – coherent scale out of a few 10s to 100s cores & 10s of sockets – ARM AMBA® protocol mapping to RapidIO protocols
• AMBA 4 AXI4/ACE mapping to RapidIO protocols • AMBA 5 CHI mapping to RapidIO protocols
– Migration path from AXI4/ACE to CHI and future ARM protocols – support for GPU/DSP floating point heterogeneous systems – HW hooks and definition to support RDMA, MPI, secure boot,
authentication, SDN, Open Flow, Open Data Plane, etc – Other functionality as necessary to for performance critical
computing - support Data Center, HPC and Networking Infrastructure system development and deployment
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 3
Task Group Contributors
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 4
System Challenges Networking and Computing Infrastructure
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 5
Networking & Computing Infrastructure View
Fiber Termination 1010101010111
1010101010111 Cable
DSLAM
Residential Access
Wifi
Aggrega&on
Access
Microwave
1010101010111
1010101010111 1010101010111
Macro Basestation
Basestation Aggregation
Radio Access
Small Cell Basestation
Enterprise Data Center
Enterprise
Core
Data Center
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 6
Hybrid Cloud - Networking and Computing functions Colliding
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 7
Application specific boxes transitioning to converged, scaled heterogeneous nodes
Mobile Base
Station
Access Fibre/
uWave Aggregation Gateways Core
Data center power budget • ATCA blade: ~200W
Many power constrained form factors • Small cell: ~5W SoC • Macro cell: ~25W SoC
Move to general purpose processors Virtualisation
Separation of operational plane (hardware) and control plane Consolidation (different boxes/subsystems within a box)
Flexibility for evolving workloads
Access “Cloud”
Data Center
Distribution of Flexible Intelligence End-to-End
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 8
Pool of scalable heterogeneous nodes with increased visibility and programmability
Access to distributed intelligence for flexibility of workloads, traffic,
services Heterogeneous platforms to meet diversity of physical requirements,
workload requirements
Access “Cloud”
Processing
Storage
Acceleration
Networking
Processing
Storage
Acceleration
Networking
Processing
Storage
Acceleration
Networking
P
N
A
P
N
S A P
N
S A
System Level Coherency
• What is it? – Hold the same memory location in multiple
caches – All caches kept up to date with most recent data
• Why do we need it? – Sharing data structures between chips,
without the need for software management • How does it work?
– Multiple copies of data are allowed – Enforce sequential modification of cache lines
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 9
Software View of Coherency
• It should just work! • Without hardware coherency the options
are: – Software cache maintenance
• Requires fine details of data location • Additional processor overhead • Prone to error • Difficult to verify/validate/debug
• Improves performance, without long debug cycle
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 10
Scale Out Model - Terminology
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 11
Inter-Node
Inter-Blade Multi-Node Processing
Intra-Node
Inter-SoC Multi-core Processing
Inter-Chassis
Inter-Box Grid/Cluster Computing
Domain-Specific Middlewares / Frameworks
Distributed Applications
Interconnect Technologies SoC Backplane / Local Network Wide Area Network
Shared Memory MPI (RDMA)
Socket (TCP/UDP)
up to ~100mm
100 cores / node
50 - 100 ns
Up to 100 Gbytes/s
100mm –10m
16 – 200 blades
100ns – 20,000ns
1-10 Gbytes/s
100m+
50,000 boxes
1ms – 5s
0.1-1 Gbytes/s
Distance
Scalability
Latency
Throughput
This is our
target scal
e
out applica
&on space
ARM AMBA & RapidIO Cache Coherency
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 12
Features ARM AMBA Cache Coherency RapidIO Cache Coherency
Architecture CC-‐NUMA CC-‐NUMA
Fabric Cache Coherency for mul;-‐core SoC Cache Coherency for mul;-‐SoC System
Coherency State Model 5-‐State or smaller Coherency Model 3-‐State or larger Coherency Model
Granularity Cache-‐Line and Large Buffer 64-‐byte (baseline)
Cache-‐Line Granularity 64-‐byte (baseline)
Protocol Implementa;on
Snoopy Cache Coherency Snoop Filter Cache Coherency
Directory based Cache Coherency Other extension possible
Complimentary Cache Coherency Protocols for multi-core and multi-SoC Unified Fabric Scale-out System
What is RapidIO? Low Latency 10Gbps per lane Unified Fabric
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 13
RapidIO Overview
• Proven technology > 10 years of market deployment
• 6.25Gbps/lane (6xN) RapidIO on CPUs, DSPs, FPGAs and ASICs
• 10Gbps/lane Specification (10xN) Released Q4 2013
• 25Gbps/lane (25xN) on roadmap • Hardware termination at PHY layer • Lowest Latency Interconnect ~ 100 ns • Inherently scales to 10,000’s of nodes
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing
RapidIO Switch
FPGA DSP CPU
RapidIO Switch
• Over 100 million 10-20 Gbps ports shipped worldwide • 100% 4G/LTE interconnect market share
• 60% Global 3G & 100% China 3G interconnect market share
14
Hardware Terminated Protocol Stack
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing
Impl
emen
ted
in H
ardw
are
15
RapidIO Global Shared Memory
• CC-NUMA model – Cache Coherent – Non-Uniform Memory Access – Directory based
• Coherency for – Instruction cache – Data cache
• Processor to processor • Processor to shared memory
– Non-cached • IO data fetch • Memory management unit page table invalidation/
synchronization
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 16
Coherency Reference System
• Illustrative example – not limited to 4 sockets
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 17
ARM SOC
ARM SOC
ARM SOC
ARM SOC
4 Core ARM Cluster
4 Core ARM Cluster
RapidIO Coherency Interface
Memory
AXI/ACE/AMBA 5
Memory Directory “Coherency Granule” States
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing
ARM SOC
ARM SOC
ARM SOC
ARM SOC
1
2 3
0
“Coherency Granule” States 18
Evolution of AMBA specifications
‘95 ‘99 ‘03 ‘10 ‘11
ACE
ACE-Lite
ASB
APB
AHB
APB2
AXI3
ATB
AHB-Lite
APB3
AXI4
AXI4-Lite
AXI4-Stream
APB4
Time
AMBA 1 AMBA 2 AMBA 3 AMBA 4
‘13
AMBA 5 CHI
‘14 13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 20
ARM ACE Cache Line States
• Unique – It is only in this cache • Shared – It may be in another cache • Clean – Does not have to update main memory • Dirty – Has responsibility for updating main memory
Unique Dirty
Unique Clean
Shared Dirty
Shared Clean
Invalid
Invalid Valid
Unique Shared D
irty
Cle
an
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 21
Join Us! • Open Call for Industry Participation • The ARM 64-bit Coherent Scale Out over RapidIO
Task Group shall be responsible for developing a specification for multi node / core coherent scale out of ARM 64-bit cores with the following functionality:
• coherent scale out of a few 10s to 100s cores & 10s of sockets – ARM AMBA® protocol mapping to RapidIO protocols
• AMBA 4 AXI4/ACE mapping to RapidIO protocols • AMBA 5 CHI mapping to RapidIO protocols
– Migration path from AXI4/ACE to CHI and future ARM protocols
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 24
ARM ACE / MOESI Mapping
ARM ACE MOESI ACE Meaning
UniqueDirty M – Modified Not shared, dirty, must be wriTen back
SharedDirty O – Owned Shared, dirty, must be wriTen back to memory
UniqueClean E – Exclusive Not shared, clean
SharedClean S – Shared Shared, no need to write back, may be clean or dirty
Invalid I – Invalid Invalid
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 26
RapidIO GSM Transactions
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 27
RapidIO GSM Transactions (2)
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 28
RapidIO GSM Example 1
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing
0 0 0 0
0 0 0 1 1 0
SoC 1 Reads SoC 0
0 1 1 0 1 0 2
SoC 2 Reads With Intent to Modify
29
RapidIO GSM Example 2
13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing
1 0 0 0 3
0 SoC 3 Reads SoC 0
0 0 0 0
3 0 SoC 3 Releases Ownership
0 1 1 0
2
30