data processing on modern hardware · scm-aware data structures are an active research topic. 6 /...
TRANSCRIPT
1 / 28
Storage
Storage
2 / 28
Storage
Motivation
• so far, we only talked about CPU caches and DRAM• both are volatile (data is lost on power failure)• traditional persistent storage:
I diskI tape
• modern persistent storage:I solid state drive (flash)I storage class memory
3 / 28
Storage
Moving Mountains of Data
Core
Register
Core
L1 Cache
Core
L2 Cache
Shared
L3 CacheDRAM HDD
©2016 Western Digital Corporation or its affiliates. All rights reserved. 10
Source: Western Digital estimates
4 / 28
Storage
Storage Class Memory (SCM)
• also called non-volatile memory (NVM), NVRAM, persistent memory• semiconductor-based technologies: Phase Change Memory (PCM), ReRAM, etc.• promises:
I similar performance as DRAMI byte-addressabilityI persistence
• not yet commercially available, but Intel has announced NVM-DIMMs
5 / 28
Storage
SCM Data Access
• connected like DRAM DIMMs (PCIe is too slow)• mapped into address space of application• automatically benefits from CPU caches and can be used as a (slower but cheaper) form of
RAM (without re-writing software)• How to utilize its persistence?
I since writes are cached, they are not immediately persistedI write order is also a problemI solution: explicit write back instruction: void _mm_clwb(void const* p)I hardware guarantees that once a cache line is written back from cache, it is persisted
• SCM-aware data structures are an active research topic
6 / 28
Storage
31
Game Changer? Non-Volatile Memory (NVRAM)
ADVANTAGES
… does not consume energy if not used
… is persistent, byte-addressable
… x-times denser than DRAM
DRAWBACKS
… has higher latency than DRAM- Read latency ~2x slower than DRAM
- Write latency ~10x slower than DRAM
Number of writes is limited
m
Estimates provided by David Wang (Samsung)
7 / 28
Storage
Solid State Drive
• based on semiconductors (no moving parts)• NAND gates• available as SATA- or PCIe/M.2- attached devices (NVMe interface)• e.g., Intel P3700:
I 2.8/1.9 GB/s sequential read/write bandwidthI 450K/150K random 4KB read/write IOs/s (requires many concurrent accesses)
8 / 28
Storage
SSD Data Access
• data is stored on pages (e.g., 4KB, 8KB, 16KB), access is through the normal OS blockdevice abstraction (i.e., read, write)
• physical properties:I pages are combined to blocks (e.g., 2MB)I it is not possible to overwrite a page, it is necessary to erase a block firstI erasing is fairly expensive
• these properties are usually not exposed• SSDs implement a flash translation layer (FTL) that emulates a traditional read/write
interface (including in-page updates)
9 / 28
Storage
10 / 28
Storage
11 / 28
Storage
Cos
t ($
/GB)
[Log
Sca
le]
2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
~2x
~20x
DRAM
SCM
BiCS
©2016 Western Digital Corporation or its affiliates. All rights reserved. 24
Cost Scaling for SCM Market Adoption
Note: Technology transition cadence assumed 18 months for all technologiesReRAM & 3DXP greenfield fabs, NAND & DRAM existing fabs
* DRAM data source: IDC ASP forecast with 45% GM assumed** 13 nm: assumes EUV @ 1.4x i-ArF capex cost, 2160 w/day
12 / 28
Storage
Managing Larger-than-RAM Data Sets
• traditional databases use a buffer pool to support very large data sets:I store everything on fixed-size pages (e.g., 8KB)I pages are fixed/unfixedI transparently handles arbitrary data (relations, indexes)
• traditional systems are very good at minimizing the number of disk IOs• however, in-memory systems are much faster than disk-based systems (as long as all data
fits into RAM)
13 / 28
Storage
Canonical Buffer Management
P1
P2 ...
P2 P4 P3
hash table pageframe root: P1
...
P4 ...
• Effelsberg and Harder, TODS 1984
14 / 28
Storage
Managing Larger-Than-Memory Data Sets
• paging/mmapI no control over eviction (needed for OLTP/HTAP systems)
• Anti CachingI each tuple has a per-relation LRU listI move cold tuples to disk/SSDI indexes refer to hot and cold tuples
• SiberiaI record tuple accesses in a logI identify cold tuples from log (offline)I move cold tuples to disk or SSDI in-memory index only stores hot (in-memory) tuplesI bloom (or adaptive range) filter for cold tuples
• buffer managersI have full full control over evictionI handle arbitrary data structures (tables, indexes, etc.) transparently
15 / 28
Storage
Hardware-Driven Desiderata
• fast and cheap PCIe-attached NVMe SSDs(1/10th of the price and bandwidth of DRAM)
⇒ store everything (relations, indexes) on pages (e.g., 16 KB)• huge main memory capacities
⇒ make in-memory accesses very fast• dozens (soon hundreds?) of cores
⇒ use smart synchronization techniques
16 / 28
Storage
Pointer Swizzling
P2
P1
P4
pageframeroot
...
...
...
P3
17 / 28
Storage
Replacement Strategy (1)
cold(SSD)
hot(RAM)
cooling(RAM)
swizzle
unswizzle
evictload,
swizzle
1. select random page as a potential candidate for eviction2. cooling pages are organized in a FIFO queue (e.g., 10% of all pages are in cooling state)
18 / 28
Storage
Replacement Strategy (2)
P3
P1
...
P8P2
hot
P9
...
hotP6
...
hot
P4
...
hotP5
3. it finds the swizzled child page P6 and unswizzles it instead
iterate
1. P4 is randomly selected for speculative unswizzling
2. the buffer manager iterates over all swips on the page ?
19 / 28
Storage
Scalable Optimistic Synchronization Primitives
• no hash table synchronization for accessing hot pages• per-page optimistic latches• epoch-based memory reclamation and eviction• global latch for cooling stage and I/O manager
20 / 28
Storage
Epochs
FIFOqueue
cooling stage
hash table
global epoch
e9
e4 P8
e7 P2
epochmanagement
thread-local epochse9∞e5
thread 1thread 2thread 3
P2
...
cooling
P8
...
cooling
21 / 28
Storage
Putting It All Together
FIFOqueue
root
OS
in-flight I/O operationscooling stage
P3
P1
...
P8P2
hotP7
P8
...
cooling
P4
...
hot buffer pool
hash table hash table
P2
...
cooling P3
...
cold(being loaded)
...(more hot pages)
22 / 28
Storage
Evaluation
• 10 core Haswell EP• Linux 4.8• Intel NVMe P3700 SSD• storage managers compared (no logging, no transactions, 16 KB pages):
I LeanStore: B-tree on top of buffer managerI in-memory B-tree: same B-tree without buffer managementI BerkeleyDB: B-tree on top of traditional buffer manager
23 / 28
Storage
How Large Is the Overhead? (Single-Threaded)
0
20K
40K
60K
BerkeleyDB LeanStore B-tree Silo
TPC
-C th
roug
hput
[txn
s/se
c]
in-memory onlybuffer managers
24 / 28
Storage
How Well Does It Scale?
0
200K
400K
600K
1 5 10 15 20threads
TPC
-C th
roug
hput
[txn
s/se
c]in-memory B-tree
LeanStore
BerkeleyDB
25 / 28
Storage
What Happens If The Data Does Not Fit Into RAM?
BerkeleyDB
in-memory B-tree (+swapping)
LeanStore
0
200K
400K
600K
0 20 40 60time [sec]
TPC
-C th
roug
hput
[txn
s/se
c]
26 / 28
Storage
TPC-C Ramp Up With Cold Cache
PCIe SSD
SATA SSD
disk10
100
1000
10K
100K
1M
0 10 20 30 40time [sec]
TPC
-C th
roug
hput
[txn
s/se
c]
27 / 28
Storage
Replacement Strategy Parameter
uniform z=1 z=1.25 z=1.5 z=1.55 z=1.6 z=1.65 z=1.7 z=1.75 z=2
2 10 50 2 10 50 2 10 50 2 10 50 2 10 50 2 10 50 2 10 50 2 10 50 2 10 50 2 10 50
0.7
0.8
0.9
1.0
1.1
% of pages in cooling stage [log scale]
norm
aliz
ed th
roug
hput
28 / 28
Storage
Conclusions
• in-memory performance is competitive with the fastest main-memory systems• transparent management of arbitrary data structures (e.g., B-trees, hash tables,
column-wise storage)• scales well on multi-core CPUs