exploiting store locality through permission caching in software dsms

22
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Uppsala University Dept. of Information Technology Div. of Computer Systems Uppsala Architecture Research Team [UART] Exploiting Store Locality through Permission Caching in Software DSMs Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten [email protected]

Upload: macy

Post on 05-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Exploiting Store Locality through Permission Caching in Software DSMs. Uppsala University Dept. of Information Technology Div. of Computer Systems Uppsala Architecture Research Team [ UART ]. Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Uppsala UniversityDept. of Information Technology

Div. of Computer SystemsUppsala Architecture Research Team [UART]

Exploiting Store Locality throughPermission Caching in Software DSMs

Exploiting Store Locality throughPermission Caching in Software DSMs

Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten [email protected]

Page 2: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Software Distributed Shared Memory

Page 3: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Traditional Software DSMs

Page based coherence [e.g., Ivy, Munin, TreadMarks] Virtual memory hardware for coherence checks

• Expensive TLB traps

Large coherence unit size• Problem: False sharing• Solution: Weak memory consistency models

CPUs

DATAdir

req. ST miss

Page 4: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Fine-Grain Software DSMs

Fine-grain access-control checks [Shasta, Blizzard] Relies on binary instrumentation Avoids operating system trapping Less false sharing Extra instructions introduce overhead

CPUs

DATAdir

req.

if (miss)

goto st_protocol

ST

Checking code instrumentedinto the application

Page 5: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Fine-Grain Pros and Cons

Pros Small coherence unit Hardware-like memory consistency model

Cons Extra check instructions to execute

Our proposal: Write Permission Cache (WPC) Exploits store locality Caches write permission Effectively reduces the store instrumentation cost

Page 6: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Outline

Motivation Problem: Instrumentation Overhead Solution: Write Permission Cache Experimental Setup Results on Real HW- and SW-DSM Systems Conclusions

Page 7: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

add R1, R2 -> R3loop: load snippet for G_LD1 call coherence protocol if load miss load snippet for G_LD2 call coherence protocol if load miss sub R9, 1 -> R9 add R6, R7 -> R8 store snippet for G_ST1 call coherence protocol if store miss add R4, 4 -> R4 bnz R9, loopL134: st R3 -> [R7 + 4]

Software Fine-Grain Coherence

add R1, R2 -> R3loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loopL134: st R3 -> [R7 + 4]

Binary instrumentation of global loads and stores Inserted code “snippet” maintains coherence

Original program Instrumented program

Page 8: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Operation CUID Original snippet handling

ST 0xE22F0000 98 lock dir entry 98; store; unlock dir entry 98

ST 0xE22F0008 98 lock dir entry 98; store; unlock dir entry 98

ST 0xE22F0010 98 lock dir entry 98; store; unlock dir entry 98

ST 0xE22F0018 98 lock dir entry 98; store; unlock dir entry 98

ST 0xE22F0020 98 lock dir entry 98; store; unlock dir entry 98

ST 0xE22F0028 98 lock dir entry 98; store; unlock dir entry 98

ST 0xE22F0030 98 lock dir entry 98; store; unlock dir entry 98

ST 0xE22F0038 98 lock dir entry 98; store; unlock dir entry 98

ST 0xE22F0040 99 lock dir entry 99; store; unlock dir entry 99

ST 0xE22F0048 99 lock dir entry 99; store; unlock dir entry 99

The Lock Problem (original DSZOOM)

Example store access pattern (array traversal)

Page 9: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

add R1, R2 -> R3loop: ld [R1 + R4] -> R6 // original load if (R6 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss ld [R2 + R4] -> R7 // original load if (R7 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss sub R9, 1 -> R9 add R6, R7 -> R8 LOCK(LOCAL_DIR); // lock local dir if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL(); // protocol if miss st R8 -> [R3 + R4] // original store UNLOCK(LOCAL_DIR); // unlock local dir add R4, 4 -> R4 bnz R9, loopL134: st R3 -> [R7 + 4]

DSZOOM Fine-Grain Coherence

Magic value (load), atomic operations (store)

add R1, R2 -> R3loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loopL134: st R3 -> [R7 + 4]

Original program Instrumented program

Page 10: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Average instrumentation overhead when run on a single processor (SPLASH2 –O3):

Integer load instrumentation overhead: 3% Overhead when only integer loads are instrumented

Float load instrumentation overhead: 31% Only floating-point loads instrumented

Store instrumentation overhead: 61% Only stores instrumented

Sequential Instrumentation Overhead

Page 11: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Operation CUID WPC snippet handling

ST 0xE22F0000 98 check WPC; miss; upd. WPC; lock dir entry 98; store

ST 0xE22F0008 98 check WPC; hit; store

ST 0xE22F0010 98 check WPC; hit; store

ST 0xE22F0018 98 check WPC; hit; store

ST 0xE22F0020 98 check WPC; hit; store

ST 0xE22F0028 98 check WPC; hit; store

ST 0xE22F0030 98 check WPC; hit; store

ST 0xE22F0038 98 check WPC; hit; store

ST 0xE22F0040 99 check WPC; miss; unlock 98; upd. WPC; lock 99; store

ST 0xE22F0048 99 check WPC; hit; store

Write Permission Caching in Action

Example store access pattern (array traversal)

Write Permission Cache 9899

Page 12: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

add R1, R2 -> R3loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loopL134: st R3 -> [R7 + 4]

WPC_FASTPATH: if (WPC != CU_ID(ADDR)) WPC_SLOWPATH() st R8 -> [R3 + R4]; // original store

WPC_SLOWPATH: UNLOCK(WPC) WPC = CU_ID(ADDR) LOCK(WPC); if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL();

The Write Permission Cache Idea

Keep the lock Rely on store locality SPARC application registers

Original program Write Permission Cache Snippet

Page 13: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Experimental Setup: Software

Benchmarks: unmodified SPLASH2

Compiler: GCC 3.3.3 (-O0 and –O3)

Instrumentation tool: custom made

Page 14: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Experimental Setup: Hardware

SMP: Sun Enterprise E6000 Server 16 UltraSPARC II (250 MHz) Memory access time 330 ns [lmbench]

HW-DSM: Sun Wildfire (2 E6000 nodes) Remote memory access time 1700 ns [lmbench] Hardware coherent interconnect. BW 800 MB/s

DSZOOM: Runs in user space on the Wildfire system put (get) = uncacheable block load (store) operation atomic = ldstub (load store unsigned byte SPARC V9) maintains coherence between private copies of G_MEM

Page 15: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Write Permission Cache Hit Rate

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

fft

lu-c

lu-n

c

radi

x

barn

es

chol

esky

fmm

ocea

n-c

ocea

n-nc

radi

osity

rayt

race

wat

er-n

sq

wat

er-s

p

aver

age

WP

C H

it R

ate

1 wpc entry 2 wpc entries 4 wpc entries 8 wpc entries16 wpc entries 32 wpc entries 1024 wpc entries

Page 16: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Sequential Instrumentation Overhead

0%

50%

100%

150%

200%

250%

300%

fft

lu-c

lu-n

c

rad

ix

ba

rne

s

cho

lesk

y

fmm

oce

an

-c

oce

an

-nc

rad

iosi

ty

rayt

race

wa

ter-

nsq

wa

ter-

sp

ave

rag

e

Inst

rum

en

tatio

n O

verh

ea

d [%

]

st st-swpc st-dwpc

Page 17: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Execution Time, 16 processors (2x8)Performance bug in paper (popc).

0.0

0.5

1.0

1.5

2.0

2.5

3.0

fft

lu-c

lu-n

c

radi

x

barn

es

fmm

radi

osity

rayt

race

wat

er-n

sq

wat

er-s

p

aver

age

Nor

mal

ized

Exe

cutio

n T

ime

HW-DSM DSZOOM-base DSZOOM-dwpc

Page 18: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Conclusions

Write permission cache (WPC) Effectively reduces store instrumentation overhead 2 entries is sufficient

Store instrumentation overhead reduction: 42% HW-, SW-DSM gap reduction: 28% Parallel performance improvement: 9%

Page 19: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

http://www.it.uu.se/research/group/uart

Thanks and Questions

Page 20: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Memory Consistency

The base architecture implements sequential consistency by requiring all acknowledges from sharing nodes before a global store request is granted

Introducing the WPC in an invalidation-based environment will not weaken the memory model

WPC just extends the duration of the permission tenure before the write permission is given up

If the memory model of each node is weaker than SC, it will decide the memory model of the system

Page 21: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Deadlock

WPC entries are flushed at: Synchronization points Failures to acquire directory locks Thread termination

WPC + flag synchronization can lead to deadlock Timers Interrupt other CPUs Lack of forward progress

Page 22: Exploiting Store Locality through Permission Caching in Software DSMs

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart

Directory Collisions

Directory collision: if a requesting processor fails to acquire a directory lock

The number of directory collisions doesn’t increase when less than 32 WPC entries are used

More information in the paper