numa-friendly data structures (using delegation and ...numa node (multiple cores, shared last level...
TRANSCRIPT
![Page 1: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/1.jpg)
NUMA-Friendly Stack (using Delegation and Elimination)
Irina Calciu
Justin Gottschlich
Maurice Herlihy
HotPar ‘13
1
![Page 2: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/2.jpg)
Trends for Future Architectures
2
![Page 3: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/3.jpg)
Uniform Memory Access (UMA)
3
![Page 4: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/4.jpg)
Non-Uniform Memory Access (NUMA)
(interconnect)
NUMA NODE (multiple cores, shared Last Level Cache)
NUMA NODE (multiple cores, shared Last Level Cache)
NUMA NODE (multiple cores, shared Last Level Cache)
NUMA NODE (multiple cores, shared Last Level Cache)
Cache coherency maintained between caches on different NUMA nodes
4
![Page 5: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/5.jpg)
Overview
• Motivation
• Algorithms
• Results
• Conclusions
5
![Page 6: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/6.jpg)
Delegation
NUMA node 0 NUMA node 1
Clients Clients
SEQ STACK
Server
6
![Page 7: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/7.jpg)
Delegation
NUMA node 0 NUMA node 1
Server
Client 5
Client 6
Client 7
Client 8
Slots Client 1 Client 2
Client 3 Client 4
Slots
Loops through all slots
SEQ STACK
7
![Page 8: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/8.jpg)
Elimination, Rendezvous
8
![Page 9: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/9.jpg)
Local Rendezvous
NUMA node 0 NUMA node 1
STACK
9
![Page 10: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/10.jpg)
Delegation + Elimination
NUMA node 0 NUMA node 1
Clients Clients
SEQ STACK
Server
10
![Page 11: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/11.jpg)
Delegation + LOCAL Elimination
NUMA node 0 NUMA node 1
Clients
Clients
SEQ STACK
Server
11
![Page 12: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/12.jpg)
Effect of Elimination
Throughput (Better)
50% push 50% pop
90% push 10% pop
12
![Page 13: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/13.jpg)
Effect of Delegation
Throughput (Better)
50% push 50% pop
90% push 10% pop
13
![Page 14: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/14.jpg)
Number of Slots
Throughput (Better)
50% push 50% pop
90% push 10% pop
14
![Page 15: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/15.jpg)
Workloads: Balanced vs. Unbalanced
Throughput (Better)
50% push 50% pop
70% push 30% pop
15
![Page 16: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/16.jpg)
Advantages
• Memory and cache locality
• Reduced bus traffic
• Increased parallelism through elimination
16
![Page 17: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/17.jpg)
Drawbacks
• Communication cost between clients and server thread
o Insignificant compared to the benefits
• Serializing otherwise parallel data structures
o Parallelism through elimination
• Elimination opportunities decrease as workload more unbalanced
17
![Page 18: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/18.jpg)
Open Questions
• Are there other data structures where we can use delegation and elimination?
• Are there data structures where direct access is much better?
• What can we do for those data structures?
18
![Page 19: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/19.jpg)
Thank you! Questions?
19
![Page 20: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/20.jpg)
References
• A Scalable Lock-free Stack Algorithm
http://www.inf.ufsc.br/~dovicchi/pos-ed/pos/artigos/p206-hendler.pdf
• Flat Combining and the Synchronization-Parallelism Tradeoff
http://www.cs.bgu.ac.il/~hendlerd/papers/flat-combining.pdf
• Fast and Scalable Rendezvousing
http://www.cs.tau.ac.il/~afek/rendezvous.pdf
20
![Page 21: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/21.jpg)
Cache to Cache Traffic
Better
21
![Page 22: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/22.jpg)
Coefficient of Variation
Better
22
![Page 23: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/23.jpg)
Flat Combining
23
![Page 24: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/24.jpg)
Delegation
CLIENT Find corresponding slot (by NUMA node and cpuid) Post message Wait for response Get response
SERVER Loop through all slots: If slot has message:
Take message Process message Send response
Time
24
![Page 25: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/25.jpg)
Delegation
CLIENT Find corresponding slot (by NUMA node and cpuid) try_elimination: if (eliminate) return Post message Wait for response Get response else try_elimination
SERVER Loop through all slots: If slot has message:
Take message Process message Send response
Time
25
![Page 26: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/26.jpg)
Delegation
CLIENT Find corresponding slot (by NUMA node and cpuid) try_elimination: if (eliminate) return if (Acquire slot lock) Post message Wait for response Get response Release slot lock else try_elimination
SERVER Loop through all slots: If slot has message:
Take message Process message Send response
Time
26
![Page 27: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview](https://reader036.vdocuments.us/reader036/viewer/2022070818/5f16920a9279ad024b1d71ac/html5/thumbnails/27.jpg)
Open Questions
• Performance
• Scalability
• Power
27