cache craftiness for fast multicore key-value storage yandong mao (mit), eddie kohler (harvard),...
TRANSCRIPT
![Page 1: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/1.jpg)
Cache Craftiness for Fast Multicore Key-Value Storage
Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)
![Page 2: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/2.jpg)
Let’s build a fast key-value store
• KV store systems are important– Google Bigtable, Amazon Dynamo, Yahoo! PNUTS
• Single-server KV performance matters– Reduce cost– Easier management
• Goal: fast KV store for single multi-core server– Assume all data fits in memory– Redis, VoltDB
![Page 3: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/3.jpg)
Feature wish list
• Clients send queries over network
• Persist data across crashes
• Range query
• Perform well on various workloads– Including hard ones!
![Page 4: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/4.jpg)
Hard workloads
• Skewed key popularity– Hard! (Load imbalance)
• Small key-value pairs– Hard!
• Many puts– Hard!
• Arbitrary keys– String (e.g. www.wikipedia.org/...) or integer– Hard!
![Page 5: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/5.jpg)
First try: fast binary tree
Series10
1
2
3
4
140M short KV, put-only, @16 cores
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
• Network/disk not bottlenecks• High-BW NIC• Multiple disks
• 3.7 million queries/second!
• Better?• What bottleneck remains?• DRAM!
![Page 6: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/6.jpg)
Cache craftiness goes 1.5X farther
Binary Masstree0
1
2
3
4
5
6
7
140M short KV, put-only, @16 cores
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
Cache-craftiness: careful use of cache and memory
![Page 7: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/7.jpg)
Contributions
• Masstree achieves millions of queries per second across various hard workloads– Skewed key popularity– Various read/write ratios– Variable relatively long keys– Data >> on-chip cache
• New ideas– Trie of B+ trees, permuter, etc.
• Full system– New ideas + best practices (network, disk, etc.)
![Page 8: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/8.jpg)
Experiment environment
• A 16-core server– three active DRAM nodes
• Single 10Gb Network Interface Card (NIC)
• Four SSDs
• 64 GB DRAM
• A cluster of load generators
![Page 9: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/9.jpg)
Potential bottlenecks in Masstree
Single multi-core server
Network
Disk
log log
…
…DRAM
![Page 10: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/10.jpg)
NIC bottleneck can be avoided
• Single 10Gb NIC– Multiple queue, scale to many cores– Target: 100B KV pair => 10M/req/sec
• Use network stack efficiently– Pipeline requests– Avoid copying cost
![Page 11: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/11.jpg)
Disk bottleneck can be avoided
• 10M/puts/sec => 1GB logs/sec!• Single disk
• Multiple disks: split log– See paper for details Single multi-core server
Write throughput Cost
Mainstream Disk 100-300 MB/sec 1 $/GB
High performance SSD up to 4.4GB/sec > 40 $/GB
![Page 12: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/12.jpg)
DRAM bottleneck – hard to avoid
Binary Masstree0
1
2
3
4
5
6
7
140M short KV, put-only, @16 coresTh
roug
hput
(req
/sec
, mill
ions
)
Cache-craftiness goes 1.5X father, including the cost of:• Network• Disk
![Page 13: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/13.jpg)
DRAM bottleneck – w/o network/disk
Binary 4-tree B+tree +Prefetch +Permuter Masstree0
1
2
3
4
5
6
7
8
9
10
140M short KV, put-only, @16 cores
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
Cache-craftiness goes 1.7X father!
![Page 14: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/14.jpg)
DRAM latency – binary tree
Binary0
1
2
3
4
5
6140M short KV, put-only, @16 cores
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
B
A C
Y
X Z
…
serial DRAM latencies!
10M keys =>
VoltDB
2.7 us/lookup 380K lookups/core/sec
![Page 15: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/15.jpg)
DRAM latency – Lock-free 4-way tree
• Concurrency: same as binary tree• One cache line per node => 3 KV / 4 children
X Y Z
A B … … …
½ levels as binary tree½ DRAM latencies as binary tree
![Page 16: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/16.jpg)
4-tree beats binary tree by 40%
Binary 4-tree B+tree +Prefetch +Permuter Masstree0
1
2
3
4
5
6
7
8
9
10
140M short KV, put-only, @16 cores
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
![Page 17: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/17.jpg)
4-tree may perform terribly!
• Unbalanced: serial DRAM latencies– e.g. sequential inserts
• Want balanced tree w/ wide fanout
A B C
D E F
G H I
…
O(N) levels!
![Page 18: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/18.jpg)
B+tree – Wide and balanced
• Balanced!
• Concurrent main memory B+tree [OLFIT]– Optimistic concurrency control: version technique– Lookup/scan is lock-free– Puts hold ≤ 3 per-node locks
![Page 19: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/19.jpg)
Wide fanout B+tree is 11% slower!
Binary 4-tree B+tree +Prefetch +Permuter Masstree0
1
2
3
4
5
6
7
8
9
10
140M short KV, put-only
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
Fanout=15, fewer levels than 4-tree, but • # cache lines from DRAM >= 4-tree
• 4-tree: each internal node is full• B+tree: nodes are ~75% full
• Serial DRAM latencies >= 4-tree
![Page 20: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/20.jpg)
B+tree – Software prefetch
• Same as [pB+-trees]
• Masstree: B+tree w/ fanout 15 => 4 cache lines• Always prefetch whole node when accessed• Result: one DRAM latency per node vs. 2, 3, or 4
4 lines
1 line
=
![Page 21: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/21.jpg)
B+tree with prefetch
Binary 4-tree B+tree +Prefetch +Permuter Masstree0
1
2
3
4
5
6
7
8
9
10
140M short KV, put-only, @16 cores
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
Beats 4-tree by 9%Balanced beats unbalanced!
![Page 22: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/22.jpg)
Concurrent B+tree problem
• Lookups retry in case of a concurrent insert
• Lock-free 4-tree: not a problem– keys do not move around– but unbalanced
A C D A C D
A B C D
insert(B)Intermediate state!
![Page 23: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/23.jpg)
B+tree optimization - Permuter
• Keys stored unsorted, define order in tree nodes
• A concurrent lookup does not need to retry– Lookup uses permuter to search keys– Insert appears atomic to lookups
A C D A C D B
A C D B
insert(B)
0 1 2
Permuter: 64-bit integer
…0 3 1 …2
![Page 24: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/24.jpg)
B+tree with permuter
Binary 4-tree B+tree +Prefetch +Permuter Masstree0
1
2
3
4
5
6
7
8
9
10
140M short KV, put-only, @16 cores
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
Improve by 4%
![Page 25: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/25.jpg)
Performance drops dramatically when key length increases
8 16 24 32 40 480
1
2
3
4
5
6
7
8
9
Short values, 50% updates, @16 cores, no logging
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
Key lengthKeys differ in last 8B
Why? Stores key suffix indirectly, thus each key comparison • compares full key• extra DRAM fetch
![Page 26: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/26.jpg)
… B+tree, indexed by k[0:7]
B+tree, indexed by k[8:15]
B+tree, indexed by k[16:23]
…
Masstree – Trie of B+trees
• Trie: a tree where each level is indexed by fixed-length key fragment
• Masstree: a trie with fanout 264, but each trie node is a B+tree
• Compress key prefixes!
![Page 27: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/27.jpg)
Case Study: Keys share P byte prefix – Better than single B+tree
…
• trie levels• each has one node only
A single B+tree with 8B keys
Complexity DRAM access
Masstree O(log N) O(log N)
Single B+tree O(P log N) O(P log N)
![Page 28: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/28.jpg)
Masstree performs better for long keys with prefixes
8 16 24 32 40 480123456789
10
MasstreeB+tree
Short values, 50% updates, @16 cores, no logging
8B key comparison vs.
full key comparison
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
Key length
![Page 29: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/29.jpg)
Does trie of B+trees hurt short key performance?
Binary 4-tree B+tree +Prefetch +Permuter Masstree0123456789
10
140M short KV, put-only, @16 cores
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
8% faster! More efficient code – internal node handle 8B keys only
![Page 30: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/30.jpg)
Evaluation
• Masstree compare to other systems?• Masstree compare to partitioned trees?– How much do we pay for handling skewed
workloads?• Masstree compare with hash table?– How much do we pay for supporting range queries?
• Masstree scale on many cores?
![Page 31: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/31.jpg)
Masstree performs well even with persistence and range queries
MongoDB VoltdB Redis Memcached Masstree0
2
4
6
8
10
12
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
20M short KV, uniform dist., read-only, @16 cores, w/ network
0.04 0.22
Unfair: both have a richer data and query model
Memcached: not persistent and no range queries
Redis: no range queries
![Page 32: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/32.jpg)
Multi-core – Partition among cores?
• Multiple instances, one unique set of keys per inst.– Memcached, Redis, VoltDB
• Masstree: a single shared tree– each core can access all keys– reduced imbalance
B
A C
Y
X Z
B
A C
Y
X Z
![Page 33: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/33.jpg)
A single Masstree performs better for skewed workloads
0 1 2 3 4 5 6 7 8 90
2
4
6
8
10
12Masstree16 partitioned Masstrees
Thro
ughp
ut (r
eq/s
ec, m
illio
ns)
δ
140M short KV, read-only, @16 cores, w/ network
One partition receives δ times more queries
No remote DRAM accessNo concurrency control
Partition: 80% idle time1 partition: 40% 15 partitions: 4%
![Page 34: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/34.jpg)
Cost of supporting range queries
• Without range query? One can use hash table– No resize cost: pre-allocate a large hash table– Lock-free: update with cmpxchg– Only support 8B keys: efficient code– 30% full, each lookup = 1.1 hash probes
• Measured in the Masstree framework– 2.5X the throughput of Masstree
• Range query costs 2.5X in performance
![Page 35: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/35.jpg)
Scale to 12X on 16 cores
Number of cores
Thro
ughp
ut (r
eq/s
ec/c
ore,
mill
ions
)
1 2 4 8 160
100000
200000
300000
400000
500000
600000
700000
Get
Perfect scalability
• Scale to 12X • Put scales similarly• Limited by the shared
memory system
Short KV, w/o logging
![Page 36: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/36.jpg)
Related work
• [OLFIT]: Optimistic Concurrency Control• [pB+-trees]: B+tree with software prefetch• [pkB-tree]: store fixed # of diff. bits inline• [PALM]: lock-free B+tree, 2.3X as [OLFIT]
• Masstree: first system combines them together, w/ new optimizations– Trie of B+trees, permuter
![Page 37: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/37.jpg)
Summary
• Masstree: a general-purpose high-performance persistent KV store
• 5.8 million puts/sec, 8 million gets/sec– More comparisons with other systems in paper
• Using cache-craftiness improves performance by 1.5X
![Page 38: Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649ccc5503460f94996185/html5/thumbnails/38.jpg)
Thank you!