what the hekaton? in-memory oltp overview · pdf file1 what the hekaton? in-memory oltp...
TRANSCRIPT
1
What the Hekaton?In-memory OLTP Overview
Kalen Delaneywww.SQLServerInternals.com
join-conference.com
2016-05-10 2join-conference.com© Kalen Delaney, 2016 #2
Kalen DelaneyBackground:
– MS in Computer Science from UC Berkeley– Working exclusively with SQL Server for 28 years
– SQL Server In-Memory OLTP Internals (RedGate 2014)– Primary Author: SQL Server 2012 Internals (MS Press/O’Reilly, 2013)– Author: SQL Server Concurrency (RedGate 2010)– Primary Author: SQL Server 2008 Internals (MS Press, 2009)– Primary Author: Inside SQL Server 2005: Query Tuning and Optimization (MS Press, 2007)– Author: Inside SQL Server 2005: The Storage Engine (MS Press, 2006)– SQL Server Magazine columnist and contributing editor
Website: www.SQLServerInternals.comTwitter: @sqlqueenBlog: www.SQLBlog.com
2
2016-05-10 3join-conference.com© Kalen Delaney, 2016 #3
Memory optimized table and index structures
Stream-based storage
Native compilation of business logic in stored procedures
Fully integrated into SQL Server
Familiar management tools: SSMS/SMO/DMVs
Latch- and lock-free data structures
Multi-versioning built into table structures
In-Memory OLTP Overview
SQL Server 2014 adds in-memory technology to boost performance of OLTP workloads
2016-05-10 4join-conference.com© Kalen Delaney, 2016 #4
Agenda
• In-memory Data Structures• Managing Memory and Garbage Collection• Persistence and Storage Management• Logging and Recovery
3
2016-05-10 5join-conference.com© Kalen Delaney, 2016 #5
In-Memory Data Structures• Rows
– New row format Structure of the row is optimized for in-memory residency and access
– One copy of rowIndexes point to rows, they do not duplicate them
• Indexes– Hash index for equality search – Memory-optimized B-tree for range and equality search – Do not exist on disk – recreated during recovery– Every table must have at least one index!
2016-05-10 6join-conference.com© Kalen Delaney, 2016 #6
In-memory Table: Row FormatRow header Payload (table columns)
Begin Ts End Ts StmtId IdxLinkCount
8 bytes 8 bytes 4 bytes 2 bytes
8 bytes * (IdxLinkCount – 1)
Key Points:Begin/End timestamp determines row’s validityThere is no data or index page - Just rowsRow size limited to 8060 bytes
Allows data to be moved to disk-based tablesNot every SQL table schema is supported
4
2016-05-10 7join-conference.com© Kalen Delaney, 2016 #7
Indexes On Memory-Optimized Tables • Indexes are what combines rows into a table• Hash Indexes
– Defined as NONCLUSTERED HASH– Predefined number of buckets (fixed memory size)– Good for point lookups
• “Range” Indexes– Defined as NONCLUSTERED – Stored as Bw-Tree– Good for range searches and ordered scans
• Indexes ONLY exist in memory
2016-05-10 8join-conference.com© Kalen Delaney, 2016 #8
90, ∞ Susan Bogota
Hash Indexes
50, ∞ Jane Prague
Timestamps NameChain ptrs City
Hash index on City
Hash index on Name
100, ∞ John Prague
90, 150
5
2016-05-10 9join-conference.com© Kalen Delaney, 2016 #9
“Range” Index Core Characteristics• Resizeable – grows and shrinks with utilization• Unidirectional• Good performance for point lookup, excellent performance for
range and table scan• Lock free• Bw-tree based• Leaf pages used to store index keys and pointers to chains of
rows• Value chains have same characteristics as bucket chains for
hash index
2016-05-10 10join-conference.com© Kalen Delaney, 2016 #10
Bw-Tree
10 20 28
5 8 10 11 15 18 21 24 27
PAGE
Page Mapping Table
0
1
2
3
14
15
PAGE
1 2 4 6 7 8 25 26 27
200, ∞ 1 50, 300 2
Root
Non-leaf pages
leaf pages
Data rows
PageID-0
PageID-3 Page-ID-2 PageID -14
• Page size- up to 8K• Logical pointers
– Indirect physical pointers through Page Mapping table
– Page Mapping table grows (doubles) as table grows• Sibling pages linked one direction
– Require two indexes for ASC/DESC• No in-place updates on index pages
– Handled thru delta pages or building new pages
Key Key
Physical
100,200 1
6
2016-05-10 11join-conference.com© Kalen Delaney, 2016 #11
Point Lookups and Range Scans
• Point lookups similar to B-Trees• Range scans
– Search for starting point– Follow keys, duplicate chains and right page
pointers until end key is reached– Uni-directional because pages linked in only
one direction
2016-05-10 12join-conference.com© Kalen Delaney, 2016 #12
Limitations on Tables in SQL 2014
• Optimized for in-memory– Rows are at most 8060 bytes – no off-row data– No Large Object (LOB) types like varchar(max)
• Scoping limitations– No FOREIGN KEY and no CHECK constraints– IDENTITY only (1,1)– No schema changes (ALTER TABLE) – need to
drop/recreate table– No add/remove index – need to drop/recreate table
7
2016-05-10 13join-conference.com© Kalen Delaney, 2016 #13
Memory Management• Table data resides in memory at all times.
– No paging– Must configure SQL box with sufficient memory to store memory-optimized
tables; Max supported 512GB
– Failure to allocate memory will fail transactional workload at runtime– Integrated with SQL Server memory manager and reacts to memory pressure
where possible
• Integration with Resource Governor – “Bind” a database to a resource pool– Ensures memory consumption from recovery is accounted for– Hard limit (80% of phys. memory) to ensure system remains stable under low-
memory situations
2016-05-10 14join-conference.com© Kalen Delaney, 2016 #14
Garbage Collection• Stale Row Versions
Updates, deletes, and aborted insert operations create row versions that (eventually) are no longer visible to any transaction.– Slows down scans of index structures– Creates unused memory that needs to be reclaimed (i.e. Garbage Collected)
• Garbage Collection (GC)– Analogous to version store cleanup task for disk-based tables to support Read
Committed Snapshot (RCSI)– System maintains ‘oldest active transaction’ information
• Design Goals: – Non-blocking, Cooperative, Efficient, Responsive, Scalable– Active transactions work cooperatively and pick up parts of GC work– A dedicated system thread to do GC
8
2016-05-10 15join-conference.com© Kalen Delaney, 2016 #15
Durability• Memory-optimized tables can be durable or non-durable
– Default is ‘durable’– Non-durable tables are useful for transient data
• Durable tables are persisted in a single memory-optimized filegroup– Storage used for memory-optimized has a different access pattern
than for disk tables
• Filegroup can have multiple containers (volumes)– Additional containers aid in parallel recovery; recovery happens at
the speed of I/O
2016-05-10 16join-conference.com© Kalen Delaney, 2016 #16
On-disk Storage• Filestream is the underlying storage mechanism
– Checksums and single-bit correcting ECC on files• Data files
– ~128MB in size (unless machine has <16GB memory), write 256KB chunks at a time
– Stores only the inserted rows (i.e. table content)– Chronologically organized streams of row versions
• Delta files– File size is not constant, write 4KB chunks at a time– Stores IDs of deleted rows
9
2016-05-10 17join-conference.com© Kalen Delaney, 2016 #17
Storage: Data and Delta Files
Data File
Delta File
0 100
TS (ins) RowId TableIdTS (ins) RowId TableIdTS (ins) RowId TableId
TS (ins) RowId TS (del)TS (ins) RowId TS (del)TS (ins) RowId TS (del)
Chec
kpoi
nt F
ile P
air
Row pay loadRow pay loadRow pay load
2016-05-10 18join-conference.com© Kalen Delaney, 2016 #18
Populating Data/Delta files
Offline Checkpoint Thread
Memory-optimized Table Filegroup
Rang
e 100
-199
Rang
e 200
-299
Rang
e 300
-399
Rang
e 400
-499
Rang
e 500
-
New InsertsDelete 450 TSDelete 250 TSDelete 150 TS
Del Tran2(TS 450)
Del Tran3(TS 250)
Del Tran1(TS150)
Insert into Hekaton T1Log in disk TableDel Tran1
(row TS150)Del Tran2
(row TS 450)Del Tran3
(row TS 250) Insert into T1SQL Transaction log
10
2016-05-10 19join-conference.com© Kalen Delaney, 2016 #19
Logging for Memory-Optimized Tables
• Uses SQL transaction log to store content– Each HK log record contains a log record header followed by
opaque memory optimized-specific log content
• All logging for memory-optimized tables is logical– No log records for physical structure modifications– No index-specific / index-maintenance log records– No UNDO information is logged
2016-05-10 20join-conference.com© Kalen Delaney, 2016 #20
Backup for Memory-Optimized Tables
– Memory-Optimized file group is backed up as part SQL database backup
11
2016-05-10 21join-conference.com© Kalen Delaney, 2016 #21
Recovery for Memory-Optimized Tables
• Analysis Phase• Finds the last completed checkpoint in transaction log
• Data Load• Load from set of data/delta files from the last completed checkpoint
– Parallel Load by reading data/delta files using 1 thread / file
• Redo phase to apply tail of the log– Apply the transaction log from last checkpoint– Concurrent with REDO on disk-based tables
• No UNDO phase for memory-optimized tables– Only committed transactions are logged
2016-05-10 22join-conference.com© Kalen Delaney, 2016 #22
Recovery: Parallel load
Delta map
Recovery Data Loader
DeltaFile1
Memory Optimized Tables
Recovery Data Loader
Recovery Data Loader
Delta mapDelta map
DataFile1
DeltaFile2
DataFile2
DeltaFile3
DataFile3
filter filter filter
Memory Optimized Container - 1 Memory Optimized Container - 2