uppsala university information technology department of computer systems uppsala architecture...
Post on 18-Dec-2015
215 views
TRANSCRIPT
Uppsala UniversityInformation Technology
Department of Computer SystemsUppsala Architecture Research Team [UART]
Removing the Overhead from Software-Based Shared Memory
Removing the Overhead from Software-Based Shared Memory
Zoran Radovic and Erik Hagersten{zoranr, eh}@it.uu.se
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Problems with Traditional SW-DSMs
Page-sized coherence unit False Sharing![e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …]
Protocol agent messaging is slow Most efficiency lost in interrupt/poll
CPUs
MemProt.agent
CPUs
MemProt.agent
LD x
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Our proposal: DSZOOM Run entire protocol in requesting-processor
No protocol agent communication!
Assumes user-level remote memory access put, get, and atomics [ InfiniBandSM]
Fine-grain access-control checks[e.g., Shasta, Blizzard-S, Sirocco-S]
CPUs
Mem
Protocol
CPUs
Mem
atomic, get/putDIR
get
LD x
DIR
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Outline
Motivation General DSZOOM Overview DSZOOM-WF Implementation Details Experimentation Environment Performance Results Conclusions
Supercomputing 2001 Uppsala Architecture Research Team (UART)
DSZOOM Cluster
DSZOOM Nodes: Each node consists of an unmodified SMP
multiprocessor SMP hardware keeps coherence among the caches
and the memory within each SMP node
DSZOOM Cluster Network: Non-coherent cluster interconnect Inexpensive user-level remote memory access Remote atomic operations [e.g., InfiniBandSM]
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Squeezing Protocols into Binaries …
Static Binary Instrumentation EEL — Machine-independent Executable
Editing Library implemented in C++• Instrument global LOADs with snippets containing
fine-grain access control checks• Instrument global STOREs with MTAG snippets • Insert calls to coherence protocols implemented in C
Supercomputing 2001 Uppsala Architecture Research Team (UART)
1: ld [address],%reg // original LOAD 2: fcmps %fcc0,%reg,%reg // compare reg with itself 3: fbe,pt %fcc0,hit // if (reg == reg) goto hit 4: nop
5: // Call global coherence load routine
hit:
Fine-grain Access Control Checks
The “magic” value is a small integer corresponding to an IEEE floating-point NaN[e.g., Blizzard-S, Sirocco-S]
Floating-point load example:
CoherenceProtocols(C-code)
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Blocking Directory Protocols
Originally proposed to simplify the design and verification of HW-DSMs Eliminates race conditions
DSZOOM implements a distributed version of a blocking protocol
Node 0
G_MEM
0 0 0 0 0 0 0 1LOCK
After MEM_STORE
Presence bitsDIR_ENTRY
0 1 1 0 1 0 0 0LOCK
Before MEM_STORE
One DIR_ENTRYper cache line
Distributed DIR
MEM_STORE
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Global Coherency ActionRead data from home node: 2–hop read
MemDIR1a. f&s
= Small packet (~10 bytes)
= Large packet (~68 bytes)
= Message on the critical path
= Message off the critical path
1b. get
data
2. put
RequestorLD x
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Global Coherency ActionRead data modified in a third node: 3–hop read
DIR
Mem MTAG
1. f&s
3b. put
2a. f&s
2b. get
data
3a. put
Requestor
LD x
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Compilation Process
DSZOOM-WFImplementationof PARMACS
Macros
a.out
(Un)executable
EEL
DSZOOM-WFRun-Time Library
m4
GNU
gcc
UnmodifiedSPLASH-2Application
CoherenceProtocols(C-code)
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Instrumentation Performance
Program Problem Size%LD
%ST
InstrumentationOverhead
FFT 1,048,576 points (48.1 MB) 19.0 16.5 1.38
LU-Cont 10241024, block 16 (8.0 MB) 15.5 9.4 1.59
LU-Non-Cont 10241024, block 16 (8.0 MB) 16.7 11.1 1.50
Radix 4,194,304 items (36.5 MB) 15.6 11.6 1.13
Barnes-Hut 16,384 bodies (32.8 MB) 23.8 31.1 1.03
FMM 32,768 particles (8.1 MB) 17.5 13.6 1.06
Ocean-Cont 514514 (57.5 MB) 27.0 23.9 1.34
Ocean-Non-Cont 258258 (22.9 MB) 11.6 28.0 1.24
Radiosity Room (29.4 MB) 26.3 27.2 1.07
Raytrace Car (32.2 MB) 19.0 18.1 1.21
Water-nsq 2,197 mols., 2 steps (2.0 MB) 13.4 16.2 1.06
Water-sp 2,197 mols., 2 steps (1.5 MB) 15.7 13.9 1.09
Average 18.4 18.3 1.22
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Instrumentation BreakdownSequential Execution
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
FFTLU
-c
LU-n
c
Radix
Barne
sFM
M
Ocean
-c
Ocean
-nc
Radios
ity
Raytra
ce
Wat
er-n
sq
Wat
er-s
p
f-p-ST-snippet
int-ST-snippet
f-p-LD-snippet
int-LD-snippet
E6000 seq.
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Current DSZOOM Hardware
Two E6000 connected through a hardware-coherent interface (Sun-WildFire) with a raw bandwidth of 800 MB/s in each direction Data migration and coherent memory replication (CMR) are kept inactive
16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory Memory access times: 330 ns local / 1700 ns remote (lmbench latency)
Run as 16-way SMP, 28 CC-NUMA, and 28 SW-DSM
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Stack
Text & Data
Heap
PRIVATE_DATA
shmid = A
Physical MemoryCabinet 1
shmget
shmid = B
shmget
Physical MemoryCabinet 2
Process and Memory Distribution
Cabinet 1
forkforkfork
pset_bindpset_bindpset_bind
forkforkfork
0x80000000
G_MEM
Cabinet_1_G_MEM
Cabinet_2_G_MEM
Cabinet_1_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
Cabinet_2_G_MEM
Cabinet_1_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
Cabinet_2_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
Stack
Text & Data
Heap
PRIVATE_DATA
Cabinet_1_G_MEM
Cabinet_2_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
”Aliasing”
Stack
Text & Data
Heap
PRIVATE_DATA
Cabinet 2
Stack
Text & Data
Heap
PRIVATE_DATA
Cabinet_1_G_MEM
Cabinet_2_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
shmat
shmat
shmat
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Results (1)Execution Times in Seconds (16 CPUs)
0
1
2
3
4
5
6
7
8
9
10
FFTLU
-c
LU-n
c
Radix
Barne
sFM
M
Ocean
-c
Ocean
-nc
Radios
ity
Raytra
ce
Wat
er-n
sq
Wat
er-s
p
Exe
cutio
n ti
me
[se
con
ds]
E6000 16 CPUs CC-NUMA 2x8 DSZOOM-WF 1x16 DSZOOM-EMU 2x8 DSZOOM-WF 2x8
HW SWEEL
8
8SW16
EEL
168 8 8 8
EEL
Supercomputing 2001 Uppsala Architecture Research Team (UART)
Results (2)Normalized Execution Time Breakdowns (16 CPUs)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
FFTLU
-c
LU-n
c
Radix
Barne
sFM
M
Ocean
-c
Ocean
-nc
Radios
ity
Raytra
ce
Wat
er-n
sq
Wat
er-s
p
Store
Load
Locks
Barriers
ILC
Task
SW8 8
EEL
Supercomputing 2001 Uppsala Architecture Research Team (UART)
DSZOOM completely eliminates asynchronous messaging between protocol agents
Consistently competitive and stable performance in spite of high instrumentation overhead 30% slowdown compared to hardware State-of-the-art checking overheads are in the range of
5–35% (e.g., Shasta), DSZOOM: 3–59%
Conclusions
Supercomputing 2001 Uppsala Architecture Research Team (UART)
http://www.it.uu.se/research/group/uart
DSZOOM’s Home Page