database system implementationjarulraj/courses/4420-s19/slides/07-larg… · choice #1: tombstones...
TRANSCRIPT
![Page 1: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/1.jpg)
DATABASE SYSTEM IMPLEMENTATION
GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ
LECTURE #7: LARGER-THAN-MEMORY DATABASES
![Page 2: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/2.jpg)
THE WORLD OF DATABASE SYSTEMS
CitusDataDistributed extensionof a single-nodeDBMS (PostgreSQL)
2
![Page 3: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/3.jpg)
ADMINISTRIVIA
Reminder: Homework 2 was released on last Thursday. It will be due on next Tuesday.
3
![Page 4: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/4.jpg)
TODAY’S AGENDA
Larger-than-memory DatabasesImplementation IssuesReal-world ExamplesEvaluation
4
![Page 5: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/5.jpg)
MOTIVATION
DRAM is expensive.→ It would be nice if our in-memory DBMS could use
cheaper storage.→ 40% of energy in a server is spent on refreshing DRAM
Bringing back the disk in a smart way without having to bring back a buffer pool manager
5
![Page 6: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/6.jpg)
LARGER-THAN-MEMORY DATABASES
Allow an in-memory DBMS to store/access data on disk without bringing back all the slow parts of a disk-oriented DBMS.
Need to be aware of hardware access methods→ In-memory Storage = Tuple-Oriented→ Disk Storage = Block-Oriented
6
![Page 7: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/7.jpg)
OLAP
OLAP queries generally access the entire table.Thus, there isn’t anything about the workload for the DBMS to exploit that a disk-oriented buffer pool can’t handle.
7
Disk DataA
In-MemoryZone Map (A)
MIN=##MAX=##SUM=##
COUNT=##AVG=###
STDEV=###
⋮
A
![Page 8: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/8.jpg)
OLTP
OLTP workloads almost always have hot and coldportions of the database.→ We can assume that txns will almost always access hot
tuples.
The DBMS needs a mechanism to move cold data out to disk and then retrieve it if it is ever needed again.
8
![Page 9: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/9.jpg)
LARGER-THAN-MEMORY DATABASES
9
In-Memory Table Heap
Tuple #01
Tuple #03
Tuple #04
Tuple #00
Tuple #02
Cold-Data Storage
header
Tuple #01
Tuple #03
Tuple #04
In-Memory Index
SELECT * FROM tableWHERE id = <Tuple #01>
???
???
??????
Evicted Tuple Block
???
???
![Page 10: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/10.jpg)
AGAIN, WHY NOT MMAP?
Write-ahead logging requires that a modified page cannot be written to disk before the log records that made those changes is written.
There are no mechanisms for asynchronous read-ahead or writing multiple pages concurrently.
10
IN-MEMORY PERFORMANCE FOR BIG DATAVLDB 2014
![Page 11: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/11.jpg)
OLTP ISSUES
Run-time Operations→ Cold Tuple Identification
Eviction Policies→ Timing→ Evicted Tuple Metadata
Data Retrieval Policies→ Granularity→ Retrieval Mechanism→ Merging back to memory
11
![Page 12: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/12.jpg)
COLD TUPLE IDENTIFICATION
Choice #1: On-line→ The DBMS monitors txn access patterns and tracks how
often tuples are used.→ Embed the tracking meta-data directly in tuples.
Choice #2: Off-line→ Maintain a tuple access log during txn execution.→ Process in background to compute frequencies.
12
![Page 13: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/13.jpg)
EVICTION TIMING
Choice #1: Threshold→ The DBMS monitors memory usage and begins evicting
tuples when it reaches a threshold.→ The DBMS has to manually move data.
Choice #2: OS Virtual Memory→ The OS decides when it wants to move data out to disk.
This is done in the background.
13
![Page 14: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/14.jpg)
EVICTED TUPLE METADATA
Choice #1: Tombstones→ Leave a marker (block id, offset) that points to on-disk tuple.→ Update indexes to point to the tombstone tuples.
Choice #2: In-memory Bloom Filters→ Use approximate data structure for each table. → Check Bloom filter and on-disk index for each query.
Choice #3: OS Virtual Memory→ The OS tracks what data is on disk. The DBMS does not
need to maintain any additional metadata.
14
![Page 15: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/15.jpg)
CHOICE #2: BLOOM FILTERS
Bloom filter is a fast memory-efficient data structure that tells whether a tuple is present in on disk or not. → Price paid for this efficiency is that it is a probabilistic data
structure → It tells us that the element either definitely is not in the set or
may be in the set.
→ Check in-memory bloom filter for each tuple.→ If tuple is definitely not on disk, no need for disk accesses.→ Otherwise, we use the on-disk index to retrieve tuple.
15
![Page 16: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/16.jpg)
EVICTED TUPLE METADATA
16
In-Memory Table Heap
Tuple #01
Tuple #03
Tuple #04
Tuple #00
Tuple #02
Cold-Data Storage
header
Tuple #01
Tuple #03
Tuple #04
In-Memory Index
Access FrequencyTuple #00Tuple #01Tuple #02Tuple #03Tuple #04Tuple #05
<Block,Offset>
<Block,Offset>
<Block,Offset>
Bloom Filter Index
![Page 17: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/17.jpg)
DATA RETRIEVAL GRANULARITY
Choice #1: Only Tuples Needed→ Only merge the tuples that were accessed by a query back
into the in-memory table heap.→ Requires additional bookkeeping to track holes.
Choice #2: All Tuples in Block→ Merge all the tuples retrieved from a block regardless of
whether they are needed.→ More CPU overhead to update indexes.→ Certain tuples may likely be evicted again.
17
![Page 18: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/18.jpg)
RETRIEVAL MECHANISM
Choice #1: Abort-and-Restart→ Abort the txn that accessed the evicted tuple.→ Retrieve the data from disk and merge it into memory
with a separate background thread.→ Restart the txn when the data is ready.
Choice #2: Synchronous Retrieval→ Stall the txn when it accesses an evicted tuple while the
DBMS fetches the data and merges it back into memory.
18
![Page 19: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/19.jpg)
MERGING THRESHOLD
Choice #1: Always Merge→ Retrieved tuples are always put into table heap.
Choice #2: Merge Only on Update→ Retrieved tuples are only merged into table heap if they
are used in an UPDATE query.→ All other tuples are put in a temporary buffer.
Choice #3: Selective Merge→ Keep track of how often each block is retrieved.→ If a block’s access frequency is above some threshold,
merge it back into the table heap.
19
![Page 20: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/20.jpg)
REAL-WORLD IMPLEMENTATIONS
H-Store – Anti-CachingHekaton – Project SiberiaEPFL’s VoltDB PrototypeMemSQL – Columnar Tables
20
![Page 21: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/21.jpg)
H-STORE – ANTI-CACHING
On-line IdentificationAdministrator-defined ThresholdTombstonesAbort-and-restart RetrievalBlock-level GranularityAlways Merge
21
ANTI-CACHING: A NEW APPROACH TO DATABASE MANAGEMENT SYSTEM ARCHITECTUREVLDB 2013
![Page 22: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/22.jpg)
HEKATON – PROJECT SIBERIA
Off-line IdentificationAdministrator-defined ThresholdBloom FiltersSynchronous RetrievalTuple-level GranularityAlways Merge
22
TREKKING THROUGH SIBERIA: MANAGING COLD DATA IN A MEMORY-OPTIMIZED DATABASEVLDB 2014
![Page 23: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/23.jpg)
EPFL VOLTDB
Off-line IdentificationOS Virtual MemorySynchronous RetrievalPage-level GranularityAlways Merge
23
ENABLING EFFICIENT OS PAGING FOR MAIN-MEMORY OLTP DATABASESDAMON 2013
![Page 24: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/24.jpg)
In-Memory Table Heap
Cold-Data Storage
EPFL VOLTDB
24
Tuple #00
Tuple #02
Hot Tuples
Cold Tuples
Tuple #01Tuple #03
Tuple #01
![Page 25: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/25.jpg)
MEMSQL – COLUMNAR TABLES
Administrator manually declares a table as a distinct disk-resident columnar table.→ Appears as a separate logical table to the application.→ Uses mmap to manage buffer pool.→ Pre-computed aggregates per block always in memory.
Manual IdentificationNo Evicted Metadata is needed.Synchronous RetrievalAlways Merge
25
Source: MemSQL Documentation
![Page 26: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/26.jpg)
EVALUATION
Compare different design decisions in H-Store with anti-caching.
Storage Devices:→ Hard-Disk Drive (HDD)→ Shingled Magnetic Recording Drive (SMR)→ Solid-State Drive (SSD)→ 3D XPoint (3DX)→ Non-volatile Memory (NVRAM)
26
LARGER-THAN-MEMORY DATA MANAGEMENT ON MODERN STORAGE HARDWARE FOR IN-MEMORY OLTP DATABASE SYSTEMSDAMON 2016
![Page 27: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/27.jpg)
MICROBENCHMARK
27
1E+00
1E+02
1E+04
1E+06
1E+08
HDD SMR SSD 3D XPoint NVRAM DRAM
Late
ncy
(nan
osec
)
1KB Read 1KB Write 64KB Read 64KB Write
102
100
104
108
106
10m tuples – 1KB each50% Reads / 50% Writes – Synchronization Enabled
![Page 28: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/28.jpg)
MERGING THRESHOLD
28
0
50000
100000
150000
200000
250000
HDD (AR) HDD (SR) SMR (AR) SMR (SR) SSD 3DX NVMRAM
Thro
ughp
ut (t
xn/s
ec)
Merge (Update-Only) Merge (Top-5%) Merge (Top-20%) Merge (All)
YCSB Workload – 90% Reads / 10% Writes10GB Database using 1.25GB Memory
DRAM
![Page 29: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/29.jpg)
CONFIGURATION COMPARISON
Generic Configuration→ Abort-and-Restart Retrieval→ Merge (All) Threshold→ 1024 KB Block Size
Optimized Configuration→ Synchronous Retrieval→ Top-5% Merge Threshold→ Block Sizes (HDD/SMR - 1024 KB) (SSD/3DX - 16 KB)
29
![Page 30: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/30.jpg)
TATP BENCHMARK
30
0
80000
160000
240000
320000
HDD SMR SSD 3D XPoint NVRAM
Thro
ughp
ut (t
xn/s
ec)
Generic Optimized
Optimal Configuration per Storage Device1.25GB Memory
DRAM
![Page 31: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/31.jpg)
PARTING THOUGHTS
Today was about working around the block-oriented access and slowness of secondary storage.
None of these techniques handle index memory.
Fast & cheap byte-addressable NVM will make this lecture unnecessary.
31
![Page 32: DATABASE SYSTEM IMPLEMENTATIONjarulraj/courses/4420-s19/slides/07-larg… · Choice #1: Tombstones →Leave a marker (block id, offset) that points to on-disk tuple. →Update indexes](https://reader033.vdocuments.us/reader033/viewer/2022043017/5f39f9281a7e7440e002dfe6/html5/thumbnails/32.jpg)
NEXT CLASS
Hardware! NVM! GPUs! HTM!
32