stm in the small: trading generality for performance in ......stm overheads •book-keeping done in...

Post on 14-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

STM in the Small: Trading Generality for Performance in Software Transactional Memory Aleksandar Dragojević

Our motivation

• Concurrent data structures

• Hash-tables, skip-lists

– In-memory database indices2

– Key-value stores

[2] Larson et al. “High-performance concurrency control mechanisms for main-memory databases” VLDB Dec. 2011

2

Software Transactional Memory

• Simplifies data structures

– Ensures correct implementations

• Enables use of complex data structures

– More complex than when using CAS directly

3

What is STM performance like?

4

What is STM performance like?

Graph from: Cascaval et al. “Software transactional memory: why is it only a research toy” ACMQ September 2008

5

What is STM performance like?

0

1

2

3

4

5

6

7

8

9

10

Spe

ed

up

1

2

4

8

16

6

What is STM performance like?

0

1

2

3

4

5

6

7

8

9

10

Spe

ed

up

1

2

4

8

16

7

What is STM performance like?

0

1

2

3

4

5

6

7

8

9

10

Spe

ed

up

1

2

4

8

16

8

What is STM performance like?

0

1

2

3

4

5

6

7

8

9

10

Spe

ed

up

1

2

4

8

16

9

What is STM performance like?

0

1

2

3

4

5

6

7

8

9

10

Spe

ed

up

1

2

4

8

16

10

What is STM performance like?

0

1

2

3

4

5

6

7

8

9

0 5 10 15

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list

STM

CAS

11

What is STM performance like?

0

1

2

3

4

5

6

7

8

9

0 5 10 15

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list

STM

71%

78%

12

CAS

SpecTM: Specialized STM for Concurrent Data Structures

Ease of programming easy hard

Perf

orm

ance

fa

st

slo

w

STM

CAS

13

SpecTM: Specialized STM for Concurrent Data Structures

easy hard

Perf

orm

ance

fa

st

slo

w

SpecTM

Ease of programming

14

CAS

STM

This talk

0

1

2

3

4

5

6

7

8

9

0 5 10 15

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list

STM

71%

15

CAS

78%

This talk

0

1

2

3

4

5

6

7

8

9

0 5 10 15

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list SpecTM 0%

16

CAS

0%

Overview

• STM overheads

• SpecTM

– Short-transaction API

– Collocating data and meta-data

– In-word meta-data

• Evaluation

17

Overview

• STM overheads

• SpecTM

– Short-transaction API

– Collocating data and meta-data

– In-word meta-data

• Evaluation

18

STM overheads

• Book-keeping done in software

• Memory accesses become STM calls

• Significant overheads3

– With a single thread 50% overheads

– Requires 4 threads to outperform sequential code in 75% of cases

[3] Dragojević et al. “Why STM Can Be more than a Research Toy” CACM Apr. 2011

19

STM API

void StartTx();

void CommitTx();

word_t ReadWord(word_t *addr);

void WriteWord(word_t *addr, word_t val);

20

STM read

map(addr)

21

STM read

map(addr)

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

22

STM read

map(addr)

check RaW

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

23

STM read

map(addr)

check RaW

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

[orec idx]

write-set

address-value

orecs

index

… …

24

STM read

map(addr)

check RaW

o1 == o2 && !locked

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set read orec

read val

read orec

address-value

orecs

… …

25

index

[orec idx]

STM read (cont’d)

log read

26

STM read (cont’d)

log read head curr

count=2

read-set

27

STM read (cont’d)

log read head curr

count=3

read-set

28

STM read (cont’d)

log read

orec <= validts

head curr

count=3

read-set

29

STM read (cont’d)

log read

return val

orec <= validts

head curr

count=3

read-set

30

STM read (cont’d)

log read

validate

return val abort

valid

orec <= validts

head curr

count=3

read-set

31

STM read (cont’d)

log read

validate

return val abort

valid

orec <= validts

head curr

count=3 read-set

read-set

32

STM read fast-path

• Significantly more expensive than CPU read instructions

– >10 instructions

• Writes have comparable costs

– Incurs a compare-and-swap at commit time

33

Overview

• STM overheads

• SpecTM

– Short-transaction API

– Collocating data and meta-data

– In-word meta-data

• Evaluation

34

Overview

• STM overheads

• SpecTM

– Short-transaction API

– Collocating data and meta-data

– In-word meta-data

• Evaluation

35

Short transactions

• Short read-write transactions

• Short read-only transactions

• Single-access transactions

– Read, Write, CAS

36

Short read-write transactions

37

word_t TxRWRead_1(word_t *addr_1);

word_t TxRWRead_2(word_t *addr_2);

word_t TxRWRead_3(word_t *addr_3);

...

void TxRWCommit_1(word_t val);

void TxRWCommit_2(word_t v1, word_t v2);

void TxRWCommit_3(word_t v1,

word_t v2,

word_t v3);

...

Short read-write transactions

word_t TxRWRead_1(word_t *addr_1);

word_t TxRWRead_2(word_t *addr_2);

word_t TxRWRead_3(word_t *addr_3);

...

void TxRWCommit_1(word_t val);

void TxRWCommit_2(word_t v1, word_t v2);

void TxRWCommit_3(word_t v1,

word_t v2,

word_t v3);

...

38

word_t TxRWRead_1(word_t *addr_1);

word_t TxRWRead_2(word_t *addr_2);

word_t TxRWRead_3(word_t *addr_3);

...

void TxRWCommit_1(word_t val);

void TxRWCommit_2(word_t v1, word_t v2);

void TxRWCommit_3(word_t v1,

word_t v2,

word_t v3);

...

Short read-write transactions

39

word_t TxRWRead_1(word_t *addr_1);

word_t TxRWRead_2(word_t *addr_2);

word_t TxRWRead_3(word_t *addr_3);

...

void TxRWCommit_1(word_t val);

void TxRWCommit_2(word_t v1, word_t v2);

void TxRWCommit_3(word_t v1,

word_t v2,

word_t v3);

...

Short read-write transactions

40

Short transaction optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

41

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

Short transaction optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

42

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

Short transaction optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

43

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

Short transaction optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

44

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

Short transaction optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

static

write-set

45

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

Short transaction optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

static

write-set

46

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

Short transaction optimizations (cont’d)

log read

validate

return val abort

valid

orec <= validts

head curr

count=3 read-set

read-set

47

Short transaction optimizations (cont’d)

log read

validate

return val abort

valid

orec <= validts

head curr

count=3 read-set

read-set

merged with write-set

48

static

Short transaction optimizations (cont’d)

log read

validate

return val abort

valid

orec <= validts

head curr

count=3 read-set

read-set

49

merged with write-set

static

Using short transactions

Tx

With “traditional” transactions the whole operation is a single transaction

50

Using short transactions

Tx1 Tx2 Tx3 Tx4 Tx5 Tx6

Split operation into a sequence of short transactions

51

Using short transactions

Tx1 Tx2 Tx3 Tx4 Tx5 Tx6

“Glue” them together using techniques similar to lock-free algorithms (e.g. pointer marking)

52

Using short transactions

Tx1 Tx2 Tx3 Tx4 Tx5 Tx6

“Glue” them together using techniques similar to lock-free algorithms (e.g. pointer marking)

Is this simpler than using CAS directly?

53

Using short transactions

Tx1 Tx2 Tx3 Tx4 Tx5 Tx6

“Glue” them together using techniques similar to lock-free algorithms (e.g. pointer marking)

Is this simpler than using CAS directly? Yes: transactions eliminate tricky races

54

Mix short and ordinary transactions

Tx1 Tx2 Tx3 Tx4 Tx5 Tx6

Tx

Implement corner cases using ordinary transactions

55

Example

10 20 30 40

10

10

20 40

40

Remove 20

56

Example

10 20 30 40

10

10

20 40

40

Remove 20

TxSingleRead

57

Example

10 20 30 40

10

10

20 40

40

Remove 20

TxSingleRead

58

Example

10 20 30 40

10

10

20 40

40

Remove 20

TxSingleRead

59

Example

10 20 30 40

10

10

20 40

40

Remove 20

60

Short read-write transaction to mark the node and unlink it

Example

10 20 30 40

10

10

20 40

40

Remove 20

61

Short read-write transaction to mark the node and unlink it

Example

10 20 30 40

10

10

20 40

40

Remove 20

This becomes trickier when using CAS directly

62

Example

10 20 30 40

10

10

20 40

40

Remove 20

We have to use several CASes

63

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

64

We have to use several CASes

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

65

CAS

We have to use several CASes

Example

10 20 30 40

10

10

20 40

40

Remove 20

66

We have to use several CASes

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

67

CAS

We have to use several CASes

Example

10 20 30 40

10

10

20 40

40

Remove 20

68

We have to use several CASes CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

69

CAS

We have to use several CASes

Example

10 20 30 40

10

10

20 40

40

Remove 20

70

We have to use several CASes

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

71

There might be a race at each of these steps

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

72

We need to make sure other operations clean-up for us

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

73

All interactions between operations have to be handled correctly

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

74

E.g. removal of an element that is still being added to the list

CAS

Overview

• STM overheads

• SpecTM

– Short-transaction API

– Collocating data and meta-data

– In-word meta-data

• Evaluation

75

“Traditional” STM data layout

O1

O2

O3

O4

W1

W2

W3

W4

application data

orecs

76

“Traditional” STM data layout

O1

O2

O3

O4

W1

W2

W3

W4

application data

General False conflicts Cache misses

orecs

77

Collocate data and meta-data

W1

O1

W3

O3

W2

O2

W4

O4

Simplify mapping No false conflicts Same cache line

78

Collocated meta-data optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

static

write-set

79

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

Collocated meta-data optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

static

write-set

80

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

simplified mapping

Collocated meta-data optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

static

write-set

81

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

simplified mapping

better cache locality

Pure value-based validation

W1

W2

W3

W4

… Restrict values Restrict access patterns Single-word operations

82

Value-based optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

static

write-set

83

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

simplified mapping

better cache locality

Value-based optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

static

write-set

84

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

simplified mapping

better cache locality

Value-based optimizations

address

orec table

0 0 c 1 2 a b 0

[c1a2aa]

[c1a2ab]

[c1a2ac]

write-set

address-value

orecs

… …

static

write-set

85

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

simplified mapping

better cache locality

API changes

86

API changes

word_t ReadWord(TVar *addr);

void WriteWord(TVar *addr, word_t val);

87

word_t TVarToWord(TVar addr);

TVar WordToTVar(word_t val);

API changes

word_t ReadWord(TVar *addr);

void WriteWord(TVar *addr, word_t val);

88

Transactions access TVar, not word_t

word_t TVarToWord(TVar addr);

TVar WordToTVar(word_t val);

API changes

89

Transactions access TVar, not word_t

Calls to update TVar non-transactionally

word_t ReadWord(TVar *addr);

void WriteWord(TVar *addr, word_t val);

word_t TVarToWord(TVar addr);

TVar WordToTVar(word_t val);

Pure value-based short read-write

read val

90

Pure value-based short read-write

!locked

91

read val

Pure value-based short read-write

CAS

!locked

92

read val

Pure value-based short read-write

abort

!locked

93

read val

CAS

Pure value-based short read-write

log

abort

success

!locked

94

read val

CAS

Pure value-based short read-write

write-set

abort

!locked

95

static

read val

log

success

CAS

Pure value-based short read-write

return val

abort

!locked

96

read val

static

write-set

log

success

CAS

Pure value-based short read-write

return val

abort

!locked

97

read val

static

write-set

log

success

CAS

STM vs. SpecTM

98

STM SpecTM

Overview

• STM overheads

• SpecTM

– Short-transaction API

– Collocating data and meta-data

– In-word meta-data

• Evaluation

99

Evaluation

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list (90% lookups)

STM

100

CAS

Evaluation

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list (90% lookups)

STM

240%

101

CAS

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list (90% lookups)

Evaluation

48%

STM

SpecTM-Short

102

CAS

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list (90% lookups)

Evaluation

12%

SpecTM-Short-Colloc

SpecTM-Short

STM

103

CAS

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list (90% lookups)

Evaluation

2% SpecTM-Short-Val

STM

SpecTM-Short-Colloc

SpecTM-Short

104

CAS

Evaluation

105

0

50

100

150

200

250

300

350

400

450

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads

Hash-table (90% lookups) CAS

SpecTM-Short-Val

SpecTM-Short-Colloc

STM

SpecTM-Short

0%

Evaluation

106

0

10

20

30

40

50

60

70

80

0 5 10 15

Thro

ugh

pu

t [M

op

s/s]

Threads

Hash-table (90% lookups)

STM

SpecTM-Short

SpecTM-Short-Colloc

CAS SpecTM-Short-Val 0%

Conclusions

• Trade generality for performance in SpecTM

– Restrict STM API

– Control data layout

• STM-based data structures perform as well as the ones built directly from CAS

• Do we need SpecTM with upcoming Intel TSX?

– Best-effort limited-size transactions

107

SpecTM: Specialized STM for Concurrent Data Structures

easy hard

Perf

orm

ance

fa

st

slo

w

SpecTM

Ease of programming

108

CAS

STM

©2011 Microsoft Corporation. All rights reserved.

top related