multicore locks: the case is not closed yet · multicore locks: the case is not closed yet hugo...

40
Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu´ ema June 24, 2016 Universit´ e Grenoble Alpes Grenoble INP

Upload: others

Post on 17-Jun-2020

4 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Multicore Locks: The Case is not Closed Yet

Hugo Guiroux, Renaud Lachaize, Vivien Quema

June 24, 2016

Universite Grenoble Alpes

Grenoble INP

Page 2: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Synchronization on Modern

Multicore Machines

Page 3: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Synchronization

• Most multi-threaded applications require synchronization.

• As the number of cores increases, the synchronization

primitives become a bottleneck.

• The design of efficient multicore locks is still a hot research

topic: (e.g., [ASPLOS’10], [ATC’12], [OLS’12], [PPoPPP’12],

[SOSP’13], [OOPSLA’14], [PPoPP’15], [PPoPP’16]).

1

Page 4: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Pthread Mutex Lock

Lock-based synchronization:

pthread mutex lock(&mutex);

// Critical section:

// at most 1 thread here at

// a time

...

pthread mutex unlock(&mutex);

Plethora of locking algorithms.

Goals:

• Performance

• Throughput: at high

contention

• Latency: at low contention

• Fairness

• Energy efficiency

2

Page 5: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Pthread Mutex Lock

Lock-based synchronization:

pthread mutex lock(&mutex);

// Critical section:

// at most 1 thread here at

// a time

...

pthread mutex unlock(&mutex);

Plethora of locking algorithms.

Goals:

• Performance

• Throughput: at high

contention

• Latency: at low contention

• Fairness

• Energy efficiency

2

Page 6: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Problem Statement I

• Applications suffer from lock contention

• Plethora of locks algorithms

• Developers are puzzled:

• Does it really matters for my application/my setup?

• How to choose?

• Will the chosen lock perform reasonably well on most setups?

• Should we simply discard old/simple locks?

3

Page 7: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Problem Statement II

• Previous studies:

• Are mostly based on microbenchmarks . . .

• . . . or on workloads for which a new lock was specifically

designed

• Do not consider state-of-the-art algorithms that are known to

perform well (e.g., recent hierarchical locks) or important

parameters (e.g., the choice of waiting policy)

4

Page 8: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Contributions

1. Extended study:

5

Page 9: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Contributions

1. Extended study:

27 locks

5

Page 10: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Contributions

1. Extended study:

27 locks

39 applications

5

Page 11: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Contributions

1. Extended study:

27 locks

39 applications

3 machines

5

Page 12: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Contributions

1. Extended study:

27 locks

39 applications

3 machines

With/withoutpinning

5

Page 13: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Contributions

1. Extended study:

27 locks

39 applications

3 machines

With/withoutpinning

2. Library for transparent replacement of the lock algorithm

5

Page 14: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Outline

Locks Algorithms

LiTL: Library for Transparent Lock Interposition

Study of lock/application behavior

Conclusion

6

Page 15: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Taxonomy of Multicore Locks Algorithms I

Flat Algorithms

• e.g., Spinlock, Backoff

• Principle:

• Loop on a single memory address

• Use atomic instruction

• Pros:

• Very fast under low lock contention

• Cons:

• Collapse under high contention due to cache coherence traffic

7

Page 16: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Taxonomy of Multicore Locks Algorithms II

Queue-Based Algorithms

• e.g., MCS, CLH

• Principle:

• List of waiting threads

• Each thread spins on a private variable

• Pros:

• Mitigation of cache invalidations

• Fairness

• Cons:

• Inefficient lock handover if successor has been descheduled

• Memory locality issue (lack of NUMA awareness)

8

Page 17: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Taxonomy of Multicore Locks Algorithms III

Hierarchical Algorithms

• e.g., Cohort locks, AHMCS

• Principle:

• One local lock per NUMA node + one global lock

• Per-node batching

• Pros:

• Good behavior on NUMA hierarchies under high contention

• Cons:

• Short-time unfairness

• High costs under low lock contention

9

Page 18: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Taxonomy of Multicore Locks Algorithms IV

Load-control Algorithms

• e.g., MCS-TimePub, Malthusian locks

• Principle:

• Bypass threads in the waiting list

• Reduce the number of threads trying to acquire the lock

• Pros:

• Better resilience under resource contention

• Cons:

• Fairness

10

Page 19: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Taxonomy of Multicore Locks Algorithms V

Delegation-Based Algorithms

• e.g., RCL, CC-Synch

• Principle:

• One thread executes the critical section on behalf of the others

• Not general purpose, designed for highly contended locks

• Not considered here:

• Need to rewrite the code application

• Does not support thread-local data, nested locking, . . .

11

Page 20: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Waiting Policy

• Another design dimension (for most locks)

• What should a thread do while waiting for a lock?

• Park: sleep (default Pthread policy)

• Spin: busy-wait (active)

• Spin-Then-Park: spin a little, then go to sleep

12

Page 21: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

LiTL: Library for Transparent Lock

Interposition

Page 22: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

LiTL: Overview

• Motivation

• Implementing all existing locks into all applications is laborious

• No existing library to try a lock implementation easily

• LiTL: lock library on top of Pthread Mutex lock API

• Support unmodified application via library interposition

• Supports condition variables

• Supports nested critical sections

• 27 locks (easy to add new ones)

https://github.com/multicore-locks/litl

13

Page 23: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

LiTL Design Challenges: Lock Context

• Many lock algorithms rely on “contexts”

• The Pthread Mutex lock API does not consider contexts

• Solution:

• Each lock instance comes with an array of contexts, with one

entry per thread to support nested critical sections

• Pthread Mutex lock → custom lock via hash table (CLHT)

14

Page 24: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

LiTL Design Challenges: Supporting Condition Variables

• Approach: reuse Pthread Condition variable

1. Take an uncontended Pthread lock with the optimized lock

2. Use the Pthread lock on cond wait (paper)

15

Page 25: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Overhead Evaluation

• Comparison with manual implementation of all locks on 3

lock-intensive applications

• General trends are preserved

• Average performance difference is below 5%

16

Page 26: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Study of lock/application behavior

Page 27: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Methodology

• 5% tolerance margin to take into account deviation

• Optimal lock: best or at most 5% of performance degradation

of the best

17

Page 28: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Methodology

• 5% tolerance margin to take into account deviation

• Optimal lock: best or at most 5% of performance degradation

of the best

• Linux (3.17.6)

• 3 machines

• A-64: AMD 64 cores, 8 nodes

• A-48: AMD 48 cores, 8 nodes

• I-48: Intel 48 cores, 4 nodes (no hyperthreading)

• Most results presented here are from the A-64 machine

• We vary the number of threads used to launch the

applications.

17

Page 29: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Lock-Sensitive Applications

• 60% of the studied applications are lock sensitive

0%

10%

25%

50%

75%80%

sra

ytra

cell

face

sim

radio

sity

ll

ssl pr

oxy

sra

ytra

ce

ferre

t

stre

amclu

ster

stre

amclu

ster

ll

dedup

mat

rixm

ultiply

pcall

mys

qldvip

spca

fluidan

imat

e

linea

r regr

essio

n

volre

nd

water

spat

ial

ocean

cp

ocean

ncp

water

nsquar

ed

radio

sityfm

m

barnes

histog

ram

word

count fft

strin

gm

atch

bodytra

ck

kmea

ns

swap

tions

luncb

radix

x264

cannea

l

freqm

inelu

cb

black

schol

es

pra

ytra

ce

Rel

ativ

eS

tan

dar

dD

evia

tion

atM

axN

od

es

Locks impact application performance18

Page 30: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Impact of the Number of Nodes

We consider 2 configurations per application:

• Maximum number of nodes: use all cores of the machine

• Optimized number of nodes: take the number of nodes for agiven lock/application maximizing performance

• not all locks have the same optimized number of nodes

• avoid performance collapse

• take the best of each lock

Number of nodes impacts lock performance

19

Page 31: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

How Much do Locks Impact Applications?

• At 1 node, reduced impact

• From 2% to 683%:

• avg. 4%, med. 7%

• At max nodes, huge impact

• From 42% to 3343%:

• avg. 727%, med. 91%

• At opt nodes, significant impact

• From 6% to 683%:

• avg. 105%, med. 17%

20

Page 32: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Are Some Locks Always Among the Best? I

0%

25%

50%

75%

ahm

cs

alock

-ls

backo

ff

cbom

cssp

in

cbom

csst

pclh

-ls

clhsp

in

clhst

p

c-ptl-

tkt

c-tk

t-tkt

hmcs

hticke

t-ls

mal

thsp

in

mal

thst

p

mcs

-ls

mcs

spin

mcs

stp

mcs

-tim

epub

partit

ioned

pthre

ad

pthre

adad

apt

spin

lock

spin

lock

-ls

ticke

t

ticke

t-ls

ttas

ttas

-ls

1 Node

Fraction of lock-sensitive applications for which a given lock is optimal

At 1 node, no always-winning lock80% coverage

21

Page 33: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Are Some Locks Always Among the Best? II

0%

25%

50%

75%

ahm

cs

alock

-ls

backo

ff

cbom

cssp

in

cbom

csst

pclh

-ls

clhsp

in

clhst

p

c-ptl-

tkt

c-tk

t-tkt

hmcs

hticke

t-ls

mal

thsp

in

mal

thst

p

mcs

-ls

mcs

spin

mcs

stp

mcs

-tim

epub

partit

ioned

pthre

ad

pthre

adad

apt

spin

lock

spin

lock

-ls

ticke

t

ticke

t-ls

ttas

ttas

-ls

Max Nodes Opt Nodes

Fraction of lock-sensitive applications for which a given lock is optimal

At max and opt nodes, even worse52% coverage

22

Page 34: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Are All Locks Potentially Harmful?

0%

42%

25%

50%

75%

ahm

cs

alock

-ls

backo

ff

cbom

cssp

in

cbom

csst

pclh

-ls

clhsp

in

clhst

p

c-ptl-

tkt

c-tk

t-tkt

hmcs

hticke

t-ls

mal

thsp

in

mal

thst

p

mcs

-ls

mcs

spin

mcs

stp

mcs

-tim

epub

partit

ioned

pthre

ad

pthre

adad

apt

spin

lock

spin

lock

-ls

ticke

t

ticke

t-ls

ttas

ttas

-ls

Max Nodes Opt Nodes

Fraction of lock-sensitive applications for which a given lock degrades theperformance w.r.t. the best lock (by at least 15%)

Always several applications for which a givenlock hurts performance

23

Page 35: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Is There a Stable Hierarchy Between Locks?

The lock hierarchy for an application strongly changes with:

• The number of nodes:

• On average, only 27% of the pairwise comparisons are

conserved

• The machine:

• On average, only 30% of the pairwise comparisons are

conserved

24

Page 36: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Additional Remarks

• Using thread pinning does not change the general

observations

• Pthread Mutex locks perform relatively well (i.e., are among

the best locks) for a significant share of the studied

applications

25

Page 37: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Conclusion

Page 38: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Summary of Observations

• 60% of the studied applications are lock sensitive

• Lock behavior is strongly impacted by the number of nodes

• Locks impact applications both at max and opt nodes

• No lock is always among the best

• There is no stable hierarchy between locks

• The number of threads impacts the lock hierarchy

• The machine impacts the lock hierarchy

• All locks are potentially harmful

• Using thread pinning leads to the same conclusions

• Pthreads locks perform reasonably well

26

Page 39: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Conclusion

• Lock algorithms should not be hardwired into the code of

applications.

• The observed trends call for further research on

• new lock algorithms

• runtime support for

• parallel performance

• contention management

Extended version of the paper + Data Sets + Source Code

https://github.com/multicore-locks/litl/

27

Page 40: Multicore Locks: The Case is not Closed Yet · Multicore Locks: The Case is not Closed Yet Hugo Guiroux, Renaud Lachaize, Vivien Qu ema June 24, 2016 Universit e Grenoble Alpes Grenoble

Conclusion

• Lock algorithms should not be hardwired into the code of

applications.• The observed trends call for further research on

• new lock algorithms

• runtime support for

• parallel performance

• contention management

Extended version of the paper + Data Sets + Source Code

https://github.com/multicore-locks/litl/

Thank you for your attention.

27