DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 1
Zoran Radović and Erik Hagersten{zoranr, eh}@it.uu.se
Uppsala UniversityInformation Technology
Department of Computer Systems
Implementing Low Latency DistributedSoftware-Based Shared Memory
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 2
Problems with Traditional SW-DSMs
Page-sized coherence unit False Sharing![e.g., Ivy, Munin, TreadMarks, Cashmere-2L, Shasta, GeNIMA, …]
Protocol agent messaging is slow Most efficiency lost in interrupt/poll
CPUs
MemProt.agent
CPUs
MemProt.agent
LD x
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 3
Our proposal: DSZOOMRun entire protocol in requesting-processor No protocol agent communication!
Assumes user-level remote memory access put, get, and atomics [ InfiniBand]
Fine-grain access-control checks[e.g., Shasta, Blizzard-S, Sirocco-S]
CPUs
Mem
Protocol
CPUs
Mem
atomicDIR
get
LD x
DIR
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 4
Outline
Motivation
General DSZOOM Overview
Experimentation Environment
DSZOOM-WF Implementation Details
Performance Results
Improved DSZOOM… [SC2001]
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 5
DSZOOM Cluster
DSZOOM Nodes: Each node consists of an unmodified SMP
multiprocessor SMP hardware keeps coherence among the caches
and the memory within each SMP node
DSZOOM Cluster Network: Non-coherent cluster interconnect Inexpensive user-level remote memory access Remote atomic operations [e.g., InfiniBand]
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 6
Current DSZOOM Hardware
Two E6000 connected through a hardware-coherent interface (Sun-WildFire) with a raw bandwidth of 800 MB/s in each direction Data migration and coherent memory replication (CMR) are kept inactive
16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory Memory access times: 330 ns local / 1700 ns remote (lmbench latency)
Run as 16-way SMP, 28 HW-ccNUMA, and 28 SW-DSM
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 7
Compilation Process
DSZOOM-WFImplementationof PARMACS
Macros
a.out
Binary
EEL
DSZOOM-WFRun-Time Library
m4
GNU
gcc
UnmodifiedSPLASH-2Application
CoherenceProtocols
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 8
Stack
Text & Data
Heap
PRIVATE_DATA
shmid = A
Physical Memoryof the Cabinet 1
shmget
shmid = B
shmget
Physical Memoryof the Cabinet 2
Process and Memory Distribution
Cabinet 1
forkforkfork
pset_bindpset_bindpset_bind
forkforkfork
0x80000000
G_MEM
Cabinet_1_G_MEM
Cabinet_2_G_MEM
Cabinet_1_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
Cabinet_2_G_MEM
Cabinet_1_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
Cabinet_2_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
Stack
Text & Data
Heap
PRIVATE_DATA
Cabinet_1_G_MEM
Cabinet_2_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
”Aliasing”
Stack
Text & Data
Heap
PRIVATE_DATA
Cabinet 2
Stack
Text & Data
Heap
PRIVATE_DATA
Cabinet_1_G_MEM
Cabinet_2_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
shmat
shmat
shmat
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 9
So far …
DSZOOM-WFImplementationof PARMACS
Macros
a.out
(Un)executable
EEL
DSZOOM-WFRun-Time Library
m4
GNU
gcc
UnmodifiedSPLASH-2Application
CoherenceProtocols
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 10
Squeezing Protocols into Binaries …
Static Binary Instrumentation EEL — Machine-independent Executable
Editing Library implemented in C++• Replace global loads with snippets containing fine-
grain access control checks• Insert coherence protocols
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 11
1: ld [address],%reg // original LD 2: fcmps %fcc0,%reg,%reg // compare reg with itself 3: fbe,pt %fcc0,hit // if (reg == reg) goto hit 4: nop
5: Call global coherence load routine
hit:
Fine-grain Access Control Checks
The “magic” value is a small integer corresponding to an IEEE floating-point NaN [Blizzard-S, Sirocco-S]
Floating-point load example:
CoherenceProtocols
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 12
Modified-Shared-Invalid (MSI)
G_MEM
Cabinet_2_G_MEM
Shared cache line
Invalid cache line
MEM_STORE
Cabinet_1_G_MEM
0 0 0 0 0 0 1 0LOCK
After MEM_STORE
Presence bitsDIR_ENTRY
0 0 0 0 0 0 0 1LOCK
Before MEM_STORE
One DIR_ENTRYper cache line
Distributed DIR
”Aliasing”
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 13
Read Data from Home Node:2–hop read
MemDIR1a. f&s
2. put
= Small packet (~10 bytes)
= Large packet (~68 bytes)
= Message on the critical path
= Message off the critical path
1b. get
data
Requestor
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 14
Instrumentation Performance
Program Problem Size%LD
%ST
InstrumentationOverhead
FFT 1,048,576 points (48.1 MB) 26.1 22.2 1.43
LU-Cont 10241024, block 16 (8.0 MB) 22.7 14.5 1.68
LU-Non-Cont 10241024, block 16 (8.0 MB) 23.9 16.6 1.42
Radix 4,194,304 items (36.5 MB) 24.1 14.9 1.15
Barnes-Hut 16,384 bodies (32.8 MB) 37.5 50.5 1.25
FMM 32,768 particles (8.1 MB) 25.5 22.9 1.12
Ocean-Cont 514514 (57.5 MB) 28.6 26.2 1.34
Ocean-Non-Cont 258258 (22.9 MB) 15.5 31.6 1.21
Radiosity Room (29.4 MB) 31.1 35.0 1.11
Raytrace Car (32.2 MB) 28.8 31.5 1.53
Water-nsq 2,197 mols., 2 steps (2.0 MB) 24.5 32.4 1.21
Water-sp 2,197 mols., 2 steps (1.5 MB) 25.5 27.6 1.21
Average 26.2 27.2 1.30
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 15
Normalized Instrumentation Overhead Breakdown (Seq. Exec.)
0%
20%
40%
60%
80%
100%
f-p-ST-snippet
int-ST-snippet
f-p-LD-snippet
int-LD-snippet
E6000 seq
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 16
Results (1)Execution Times in Seconds (16 CPUs)
0
2
4
6
8
10
12
E6000 16 Processors ccNUMA 2x8 DSZOOM-WF 2x8 CL128
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 17
Results (2)Normalized Execution Time Breakdowns (16 CPUs)
0%
20%
40%
60%
80%
100%
Store
Load
Locks
Barriers
Task
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 18
DSZOOM completely eliminates asynchronous messaging between protocol agents
Consistently competitive and stable performance in spite of high instrumentation overhead 35% slowdown compared to hardware State-of-the-art checking overheads are in the range of
5–35% (e.g., Shasta), DSZOOM: 11–68%
Conclusions
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 19
Improved DSZOOM… [SC2001]
Protocol/Overall optimizations Coherency unit variations
Synchronization improvements More balanced execution between cabinets
Better instrumentation More detailed backward slice algorithm
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 20
SC2001 TeaserExecution Times in Seconds (16 CPUs)
0
2
4
6
8
10
12
ccNUMA 2x8 DSZOOM-WF 2x8 CL128 DSZOOM Today