arm® 64-bit coherent scale out over rapidio® · ace-lite asb apb ahb apb2 axi3 atb ahb-lite apb3...

ARM® 64-bit Coherent Scale Out over RapidIO®

Rick O’Connor [email protected]

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 1

Outline •  Open Industry Collaboration

– ARM 64-bit Coherent Scale Out over RapidIO Task Group

•  System Challenges & Coherent Scale Out Terminology

•  What is RapidIO? •  Summary

– Call for Industry Participation •  Background Material

– ARM AMBA® & ACE – RapidIO Globally Shared Memory


Task Group Charter •  The ARM 64-bit Coherent Scale Out over RapidIO Task Group shall

be responsible for developing a specification for multi SoC / core coherent scale out of ARM 64-bit cores with the following functionality: –  coherent scale out of a few 10s to 100s cores & 10s of sockets –  ARM AMBA® protocol mapping to RapidIO protocols

•  AMBA 4 AXI4/ACE mapping to RapidIO protocols •  AMBA 5 CHI mapping to RapidIO protocols

–  Migration path from AXI4/ACE to CHI and future ARM protocols –  support for GPU/DSP floating point heterogeneous systems –  HW hooks and definition to support RDMA, MPI, secure boot,

authentication, SDN, Open Flow, Open Data Plane, etc –  Other functionality as necessary to for performance critical

computing - support Data Center, HPC and Networking Infrastructure system development and deployment


Task Group Contributors


System Challenges Networking and Computing Infrastructure


Networking & Computing Infrastructure View

Fiber Termination 1010101010111

1010101010111 Cable

DSLAM

Residential Access

Wifi

Aggrega&on

Access

Microwave

1010101010111

1010101010111 1010101010111

Macro Basestation

Basestation Aggregation

Radio Access

Small Cell Basestation

Enterprise Data Center

Enterprise

Core

Data Center


Hybrid Cloud - Networking and Computing functions Colliding


Application specific boxes transitioning to converged, scaled heterogeneous nodes

Mobile Base

Station

Access Fibre/

uWave Aggregation Gateways Core

Data center power budget •  ATCA blade: ~200W

Many power constrained form factors •  Small cell: ~5W SoC •  Macro cell: ~25W SoC

Move to general purpose processors Virtualisation

Separation of operational plane (hardware) and control plane Consolidation (different boxes/subsystems within a box)

Flexibility for evolving workloads

Access “Cloud”

Data Center

Distribution of Flexible Intelligence End-to-End


Pool of scalable heterogeneous nodes with increased visibility and programmability

Access to distributed intelligence for flexibility of workloads, traffic,

services Heterogeneous platforms to meet diversity of physical requirements,

workload requirements

Access “Cloud”

Processing

Storage

Acceleration

Networking

Processing

Storage

Acceleration

Networking

Processing

Storage

Acceleration

Networking

P

N

A

P

N

S A P

N

S A

System Level Coherency

•  What is it? – Hold the same memory location in multiple

caches – All caches kept up to date with most recent data

•  Why do we need it? – Sharing data structures between chips,

without the need for software management •  How does it work?

– Multiple copies of data are allowed – Enforce sequential modification of cache lines


Software View of Coherency

•  It should just work! •  Without hardware coherency the options

are: – Software cache maintenance

•  Requires fine details of data location •  Additional processor overhead •  Prone to error •  Difficult to verify/validate/debug

•  Improves performance, without long debug cycle


Scale Out Model - Terminology


Inter-Node

Inter-Blade Multi-Node Processing

Intra-Node

Inter-SoC Multi-core Processing

Inter-Chassis

Inter-Box Grid/Cluster Computing

Domain-Specific Middlewares / Frameworks

Distributed Applications

Interconnect Technologies SoC Backplane / Local Network Wide Area Network

Shared Memory MPI (RDMA)

Socket (TCP/UDP)

up to ~100mm

100 cores / node

50 - 100 ns

Up to 100 Gbytes/s

100mm –10m

16 – 200 blades

100ns – 20,000ns

1-10 Gbytes/s

100m+

50,000 boxes

1ms – 5s

0.1-1 Gbytes/s

Distance

Scalability

Latency

Throughput

This is our

target scal

e

out applica

&on space

ARM AMBA & RapidIO Cache Coherency


Features ARM AMBA Cache Coherency RapidIO Cache Coherency

Architecture CC-‐NUMA CC-‐NUMA

Fabric Cache Coherency for mul;-‐core SoC Cache Coherency for mul;-‐SoC System

Coherency State Model 5-‐State or smaller Coherency Model 3-‐State or larger Coherency Model

Granularity Cache-‐Line and Large Buffer 64-‐byte (baseline)

Cache-‐Line Granularity 64-‐byte (baseline)

Protocol Implementa;on

Snoopy Cache Coherency Snoop Filter Cache Coherency

Directory based Cache Coherency Other extension possible

Complimentary Cache Coherency Protocols for multi-core and multi-SoC Unified Fabric Scale-out System

What is RapidIO? Low Latency 10Gbps per lane Unified Fabric


RapidIO Overview

•  Proven technology > 10 years of market deployment

•  6.25Gbps/lane (6xN) RapidIO on CPUs, DSPs, FPGAs and ASICs

•  10Gbps/lane Specification (10xN) Released Q4 2013

•  25Gbps/lane (25xN) on roadmap •  Hardware termination at PHY layer •  Lowest Latency Interconnect ~ 100 ns •  Inherently scales to 10,000’s of nodes

13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing

RapidIO Switch

FPGA DSP CPU

RapidIO Switch

•  Over 100 million 10-20 Gbps ports shipped worldwide •  100% 4G/LTE interconnect market share

•  60% Global 3G & 100% China 3G interconnect market share

14

Hardware Terminated Protocol Stack


Impl

emen

ted

in H

ardw

are

15

RapidIO Global Shared Memory

•  CC-NUMA model – Cache Coherent – Non-Uniform Memory Access – Directory based

•  Coherency for –  Instruction cache – Data cache

•  Processor to processor •  Processor to shared memory

– Non-cached •  IO data fetch •  Memory management unit page table invalidation/

synchronization


Coherency Reference System

•  Illustrative example – not limited to 4 sockets


ARM SOC

ARM SOC

ARM SOC

ARM SOC

4 Core ARM Cluster

4 Core ARM Cluster

RapidIO Coherency Interface

Memory

AXI/ACE/AMBA 5

Memory Directory “Coherency Granule” States


ARM SOC

ARM SOC

ARM SOC

ARM SOC

1

2 3

0

“Coherency Granule” States 18

ARM AMBA & ACE


Evolution of AMBA specifications

‘95 ‘99 ‘03 ‘10 ‘11

ACE

ACE-Lite

ASB

APB

AHB

APB2

AXI3

ATB

AHB-Lite

APB3

AXI4

AXI4-Lite

AXI4-Stream

APB4

Time

AMBA 1 AMBA 2 AMBA 3 AMBA 4

‘13

AMBA 5 CHI

‘14 13-Nov-2014 RapidIO - the Unified Fabric for Performance Critical Computing 20

ARM ACE Cache Line States

•  Unique – It is only in this cache •  Shared – It may be in another cache •  Clean – Does not have to update main memory •  Dirty – Has responsibility for updating main memory

Unique Dirty

Unique Clean

Shared Dirty

Shared Clean

Invalid

Invalid Valid

Unique Shared D

irty

Cle

an


Summary


Join Us!


Join Us! •  Open Call for Industry Participation •  The ARM 64-bit Coherent Scale Out over RapidIO

Task Group shall be responsible for developing a specification for multi node / core coherent scale out of ARM 64-bit cores with the following functionality:

•  coherent scale out of a few 10s to 100s cores & 10s of sockets –  ARM AMBA® protocol mapping to RapidIO protocols

•  AMBA 4 AXI4/ACE mapping to RapidIO protocols •  AMBA 5 CHI mapping to RapidIO protocols

–  Migration path from AXI4/ACE to CHI and future ARM protocols


Background Material


ARM ACE / MOESI Mapping

ARM ACE MOESI ACE Meaning

UniqueDirty M – Modified Not shared, dirty, must be wriTen back

SharedDirty O – Owned Shared, dirty, must be wriTen back to memory

UniqueClean E – Exclusive Not shared, clean

SharedClean S – Shared Shared, no need to write back, may be clean or dirty

Invalid I – Invalid Invalid


RapidIO GSM Transactions


RapidIO GSM Transactions (2)


RapidIO GSM Example 1


0 0 0 0

0 0 0 1 1 0

SoC 1 Reads SoC 0

0 1 1 0 1 0 2

SoC 2 Reads With Intent to Modify

29



1 0 0 0 3

0 SoC 3 Reads SoC 0

0 0 0 0

3 0 SoC 3 Releases Ownership

0 1 1 0

2

30



0 0 0 0 3 0

SoC 3 Requests Data Flush to Memory

0 1 1 0

2

0 0 0 0

non-coherent IO Peripheral Reads Memory

1 0

31

arm® 64-bit coherent scale out over rapidio® · ace-lite asb apb ahb apb2 axi3 atb ahb-lite apb3...

Documents