extended memory semantics for thread synchronization sheng li, ying zhou operating system progress...

16
Extended Memory Semantics for Thread Synchronization Sheng Li, Ying Zhou Operating System Progress Report Nov 1 st , 2007

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Extended Memory Semantics for Thread Synchronization

Extended Memory Semantics for Thread Synchronization

Sheng Li, Ying Zhou

Operating System Progress Report

Nov 1st, 2007

Sheng Li, Ying Zhou

Operating System Progress Report

Nov 1st, 2007

Nov 1st, 2007 2

Problems

Hardware multithreading is no longer a privilege of supercomputing, it is already part of the major microprocessors.

E.g. In Sun Niagara 2 has 64 threads/chip and 256 threads/server.

Concurrency management is one of the biggest challenges in multithreaded system

Key requirement: Low overhead and scalable thread synchronization

Synchronization mechanisms

Atomic primitives (Test-and-Set, Compare-and-Swap, LL-SC) Software routines built on them have poor performance

and scalability

Empty/Full bits, using extension bit for each memory location to denote the empty/full state.

Better performance [1], but still not enough

Nov 1st, 2007 3

Our Goal

Solve the synchronization bottleneck by using Extended Memory Semantics

Better performance and scalability

Quantify the performance gain when using EMS, compared to other synchronization mechanisms (e.g Empty/Full bits)

Nov 1st, 2007 4

Extended Memory Semantics

Memory instructions are characterized synchronization behavior.

Load.ff, Load.fe, Store.xf, Store.ef, Store.xe. (F--- Full, e---empty, x---don’t care)

64 bits of data/metadata

Extension bit

Nov 1st, 2007 5

EMS handler

There is no free lunch… EMS handler has overhead Creating the handler threads

To queue up memory requests, to build the data structure

Nov 1st, 2007 6

What we have done so far

Build the EMS model on both architecture and OS aspects in the Structural Simulation Toolkit (SST)

SST is the simulation environment for massively lightweight multithreading , developed at Notre Dame and Sandia Lab

Modified the glibc to use EMS

Especially pthread library

Design benchmarks for different categories

Run the simulations to evaluate EMS performance

Nov 1st, 2007 7

Tightly Coupled Parallel

Each thread competes with the others for the only lock before updating the counter

Very high contention, worst case

Nov 1st, 2007 8

Loosely Coupled Parallel

Each thread competes locks with the others before updating the counters.

Mild contention

Nov 1st, 2007 9

Embarrassingly Parallel

No contention, no locks

Nov 1st, 2007 10

Embarrassingly parallel and loosely coupled parallel

Low synchronization overhead--- guaranteed by EMS

EMS shows very good scalability

32 64 128 256 512 10240

5

10

15

20

25

30

35

Sp

eed

up

Number of threads

Ideal Gene-Embarrassingly Parallel Gene-Loosely Coupled

~5.1G synchroniztion operations

4.78%11%

84.2%

Non-Contention Hardware Supported Software Supported (EMS Handler)

~2.5G synchroniztion Operations

Synchronization distribution

Nov 1st, 2007 11

Tightly Coupled Parallel

Bad performance for EMS in the worst case

Most of threads are used for synchronization, not for real job

1 2 4 8 16 32 64 128

105

106

107

Ex

ec

uti

on

Tim

e (

Cy

cle

s)

Number of Competing Threads

Serial Parallel Using EMS

1 2 4 8 16 32 64 128

0

200

400

600

800

1000

1200

1400

1600

Nu

mb

er o

f th

read

s

Number of competing threads

EMS handlers Total Threads

Nov 1st, 2007 12

The Road Ahead

Build/complete other synchronization mechanisms (e.g. Empty/Full bits and etc) into SST

Modify glibc to make it support for other synchronization mechanisms

Compare performance between EMS and other synchronization mechanisms

Nov 1st, 2007 13

Thank you!

Questions?

Nov 1st, 2007 14

Bibliography

[1] Performance and Programming Experience on the Tera MTA, Larry Carter, John Feo, Allan Snavely, PPSC, 1999

Nov 1st, 2007 15

Back up Slides

Nov 1st, 2007 16

Lightweight Threads

Thread context (frame) is 32 double words (256 bytes) Two double words are reserved for the thread status; 30

general purpose registers.

No other per thread state, easy for multithreading .

Frames are stored in memory (No Register File) Registers are aliases for memory locations