use-cases for remote memory in the unimem architecture
TRANSCRIPT
Use-cases for Remote Memory in the Unimem Architecture
C o m pute r A rc h i te c t ure a n d V L S I Sy st e ms ( C A RV ) L a b o rato r yF O RT H – I C S
Nikolaos Kallimanis Manolis Marazakis Emmanouil Skordalakis
Unimem Architecture
2
Communication mechanisms of the Unimem architecture:
1. Load/Store instructions across remote nodes.
2. Every page of physical memory is cacheable only in a single node.
3. Efficiently copying large amounts of memory from/to remote nodes.
4. Send and receive of small atomic messages in a low latency manner.
Unimem
Interconnect
Node 1 Node 2
Node 3 Node 4
3
How to exercise the Unimem
remote memory?
Unimem’s APIs
4
Apps(Unimem
optimized)
Apps(unmodified)
Apps(unmodified)
Runtimes(Unimem oriented)
Global Shared Address Space
Remote Memory SWAP
Apps(unmodified)
System Software - Unimem testbed(drivers, etc.)
HW - Unimem testbed(RDMA, virtual mailbox, virtual packetizer, remote memory access)
Unimem Sockets(modified libc)
Unimem Interfaces(Remote DMA, etc.)
Exercising Remote Memory
GSAS - Global Shared Address Space
1. Global Shared Address Space across system’s remote nodes.
2. It is mostly implemented based on mechanisms for sending/receiving small atomic messages.
3. API resembles to shared memory communication.
4. Applications can use this API for synchronization and for using remote memory.
5. Data are cached in the node that reside on → cacheable at single node.
KRAM - Remote Memory SWAP
1. It uses remote node's (unused) memory to create a SWAP.
2. It uses the Xilinx CDMA engine, or memcpy for transferring data.
3. Transparent to user application (unmodified apps).
4. Extends the memory that applications can use.
5. Data are cached in the local processor that application runs →cacheable at single node.
5
In both cases data are cached at a single node (Unimem property).
No complex hw-coherence protocols.
Flexibility.
Overview of the GSAS environment
GSAS Address Space
Node 1
0x0001-0000-0000-0000
0x0002-0000-0000-0000
0xFFFF-0000-0000-0000
0x0003-0000-0000-0000 …
Node 2
Node N
6
o64-bit address space.
oThe first 16 bits contain the routing information. node-id.
oThe remaining 48 bits indexing the memory of each node.
node-id (16 bits)
Overview of the GSAS environment
7
oThere is an atomic service at each node that serves requests.
oAtomic service is running on core 0 on every node of the system.
oApps and the atomic service communicate through small atomic messages with low latency.
oThere is a user-space library that handles the requests on the issuer side.
Unimem
Interconnect
Node 2
Packetizer Mailbox
Node 1
Packetizer Mailbox
Node 3
Packetizer Mailbox
Node 4
Packetizer Mailbox
Atomic Service
Overview of the GSAS environment
An application that uses the GSAS API is able to:o Allocation of memory in any
remote node.
o Spawning a new process on any remote node.
o Executing atomic operations (i.e., CAS, FAD, SWAP, etc.) on any remote memory location.
Low latency primitives (current prototype).o ≈ 2.0 μsec for a local issued
atomic instruction.o ≈ 3.9 μsec for a remote issued
atomic instruction.
GSAS Addresses
0x0001-0000-0000-0000
0x0002-0000-0000-0000
0xFFFF-0000-0000-0000
0x0003-0000-0000-0000
App
8
Performance/DHT on GSASGSAS use case example:oAn in-memory concurrent
Distributed Hash Table (DHT) is based on GSAS.
o It supports:DhtPut→ Store pairs of ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩
items in GSAS address space.
DhtGet→ Retrieves the value that corresponds to some 𝑘𝑒𝑦.
o Experiments on Unimem testbed (8 Trenz nodes):‒ Zynq MP Ultrascale+ SoC.
‒ 4 Arm A53 cores & 2 GB of local DDR4.
‒ Each thread executes pairs of DhtPut & DhtGet operations.
‒ Throughput of operations/sec is measured for different # of nodes.
‒ Experiments were performed for 1, 2 and 3 threads per node.
9
0
50000
100000
150000
200000
1 2 3 4 5 6 7 8
Thro
ugh
pu
t o
pe
rati
on
s/se
c
# nodes
1 Thread
2 Threads
3 Threads
KRAM – Concept/Architecture
10
250 MB are dedicated for the remote allocator service.
The maximum amount of local memory that local applications can use is 1.75 GB.
For more memory, KRAM creates a swap device to extend the memory.
𝑛0
𝑛1
Remoteallocator
KRAMKramtool
Remoteallocator
KRAMKramtool
𝑛1attached
unattached1.75 GB
1 GB
2 GB
𝑛0attached
unattached1.75 GB
0.8 GB
2 GB
Unused main memoryUsed main memory
Unused reserved memoryUsed reserved memory
KRAM -Requesting More Memory
11
1. User requests a swap device of a specific size.
2. KRAM requests memory from the local remote allocator service.
3. The remote allocator communicates with neighbor allocators for more memory.
Remoteallocator
KRAMKramtool
𝑛0
𝑛1
Remoteallocator
KRAMKramtool
𝑛1attached
unattached1.75 GB
1 GB
2 GB
𝑛0attached
unattached1.75 GB
0.8 GB
2 GB
Unused main memoryUsed main memory
Unused reserved memoryUsed reserved memory
KRAM – Allocating Remote Memory
12
4. The remote allocator of someneighbor sends the physicaladdress of the free remotememory.
5. The remote allocator sendsthe address to KRAM.
6. KRAM creates a swap device.
Remoteallocator
KRAMKramtool
𝑛0
𝑛1
Remoteallocator
KRAMKramtool
𝑛1attached
unattached1.75 GB
1 GB
2 GB
𝑛0attached
unattached1.75 GB
0.8 GB
2 GB
Unused main memoryUsed main memory
Unused reserved memoryUsed reserved memory
KRAM – Swapping
13
Now, Applications on node 𝑛0 are able to use:
- 1.75GB from local memory
- Some memory from remotes
Remoteallocator
KRAMKramtool
Remoteallocator
KRAMKramtool
𝑛1attached
unattached
1.75 GB
1 GB
2 GB
𝑛0 attached
𝑛0
𝑛1
𝑛0attached
unattached1.75 GB
0.8 GB
2 GB
Unused main memoryUsed main memory
Unused reserved memoryUsed reserved memory
KRAM – Stream Performance
14
3 nodes with specifications:• 2 GB RAM (1.75 GB main, 0.25GB reserved)• Zynq MP Ultrascale+ SoC
The Stream benchmark which calculates MB/s using the Copy, Scale, Add and Triad functions.
6 runs were performed, with dma mode and with memcpy mode
1. Local memory only.2. Remote memory of 48 MB.3. Remote memory of 96 MB.4. Remote memory of 144 MB.
1.6GB local 1.6GB local48MB swap
1.6GB local96MB swap
1.6GB local144MB swap
1.6GB local 1.6GB local48MB swap
1.6GB local96MB swap
1.6GB local144MB swap
memcpy memcpy memcpy memcpy dma dma dma dma
0
500
1000
1500
2000
2500
3000
3500
4000
4500
MB
/s
Copy Scale Add Triad
Thank You
15