Download - Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 CSG131 Summary Donghui Zhang
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
CSG131 Summary
Donghui Zhang
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2
What we learned Storage (Chp 8 & 9) B+-tree (Chp 10) Hash index (Chp 11) External sort (Chp 13) Query evaluation Concurrency control (Chp 17) Recovery (Chp 18) Project: simple DBMS implentation,
NEUStore SB-tree, R-tree.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3
Components of a Disk
Platters
The platters spin (say, 90rps).
Spindle
The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder (imaginary!).
Disk head
Arm movement
Arm assembly
Only one head reads/writes at any one time.
Tracks
Sector
Block size is a multiple of sector size (which is fixed).
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4
Accessing a Disk Page
Time to access (read/write) a disk block: seek time (moving arms to position disk head on track) rotational delay (waiting for block to rotate under head) transfer time (actually moving data to/from disk surface)
Seek time and rotational delay dominate. Seek time varies from about 1 to 20msec Rotational delay varies from 0 to 10msec Transfer rate is about 1msec per 4KB page
Key to lower I/O cost: reduce seek/rotation delays! Hardware vs. software solutions?
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5
Index Types A primary index is an index which
controls the actual storage of a table. Typically this index is built using the primary key of the table. A data entry is one record of the table.
A secondary index is an index which is built using some other attribute(s). A data entry contains a set of RIDs.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6
An Example3325 6632
0123
Donghui Zhang
33 617-373-2177
0222
James Bond
55 617-324-6685
2418
Betty Wu
21 951-666-8888332
5Jessica Greene
21
417-332-6659
3891
Frank Gates
33
433-222-1956
6632
Tim Varner
55 617-333-6782
9688
Stephanie Chen
33 951-243-0104
10
20
30
primary index
secondary index21: {10, 20}33: {10, 20, 30}55: {10, 30}
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7
Buffer Management in a DBMS
Data must be in RAM for DBMS to operate on it! For each page, dirty bit, pin count.
DB
MAIN MEMORY
DISK
disk page
free frame
Page Requests from Higher Levels
BUFFER POOL
choice of frame dictatedby replacement policy
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8
Page Formats: Variable Length Records
Can move records on page without changing rid; so, attractive for fixed-length records too.
Page iRid = (i,N)
Rid = (i,2)
Rid = (i,1)
Pointerto startof freespace
SLOT DIRECTORY
N . . . 2 120 16 24 N
# slots
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9
What we learned Storage (Chp 8 & 9) B+-tree (Chp 10) Hash index (Chp 11) External sort (Chp 13) Query evaluation Concurrency control (Chp 17) Recovery (Chp 18) Project: simple DBMS implentation,
NEUStore SB-tree, R-tree
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10
Example B+ Tree
2* 3*
Root
17
24 30
14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
135
7*5* 8*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11
B+ Trees in Practice
Typical order: 100. Typical fill-factor: 67%. average fanout = 133
Can often hold top levels in buffer pool: Level 1 = 1 page = 8 KB Level 2 = 133 pages = 1 MB Level 3 = 17,689 pages = 145 MB Level 4 = 2,352,637 pages = 19 GB With 1 MB buffer, can locate one record
in 19 GB (or 0.3 billion records) in two I/Os!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12
B+-tree index Structure Search Insert Delete
Bulk-loading a B+-tree Aggregation Query SB-tree
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13
What we learned Storage (Chp 8 & 9) B+-tree (Chp 10) Hash index (Chp 11) External sort (Chp 13) Query evaluation Concurrency control (Chp 17) Recovery (Chp 18) Project: simple DBMS implentation,
NEUStore SB-tree, R-tree
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14
Extendible Hashing Directory is array of size 4. To find bucket for r, take
last `global depth’ # bits of h(r); we denote r by h(r). If h(r) = 5 = binary 101,
it is in bucket pointed to by 01.
Insert: If bucket is full, split it (allocate new page, re-distribute).
If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.)
13*00
01
10
11
2
2
2
2
2
LOCAL DEPTH
GLOBAL DEPTH
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
DATA PAGES
10*
1* 21*
4* 12* 32* 16*
15* 7* 19*
5*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15
Insert h(r)=20 (Causes Doubling)
20*
00
01
10
11
2 2
2
2
LOCAL DEPTH 2
2
DIRECTORY
GLOBAL DEPTHBucket A
Bucket B
Bucket C
Bucket D
Bucket A2(`split image'of Bucket A)
1* 5* 21*13*
32*16*
10*
15* 7* 19*
4* 12*
19*
2
2
2
000
001
010
011
100
101
110
111
3
3
3DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2(`split image'of Bucket A)
32*
1* 5* 21*13*
16*
10*
15* 7*
4* 20*12*
LOCAL DEPTH
GLOBAL DEPTH
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16
Example of Linear Hashing
On split, hLevel+1 is used to re-distribute entries.
0hh
1
(This infois for illustrationonly!)
Level=0, N=4
00
01
10
11
000
001
010
011
(The actual contentsof the linear hashedfile)
Next=0
PRIMARYPAGES
Data entry rwith h(r)=5
Primary bucket page
44* 36*32*
25*9* 5*
14* 18*10*30*
31*35* 11*7*
0hh
1
Level=0
00
01
10
11
000
001
010
011
Next=1
PRIMARYPAGES
44* 36*
32*
25*9* 5*
14* 18*10*30*
31*35* 11*7*
OVERFLOWPAGES
43*
00100
Let hLevel=(a*key+b) % 4hLevel+1=(a*key+b) % 8
If insert 22……
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17
Example of Linear Hashing
0hh
1
Level=0
00
01
10
11
000
001
010
011
Next=2
PRIMARYPAGES
44* 36*
32*
25*9*
14* 18*10*30*
31*35* 11*7*
OVERFLOWPAGES
43*
00100
Let hLevel=(a*key+b) % 4hLevel+1=(a*key+b) % 8
22*
5*01101
Important notice: after buckets 010 and 011are split, Next moves back To 000!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18
What we learned Storage (Chp 8 & 9) B+-tree (Chp 10) Hash index (Chp 11) External sort (Chp 13) Query evaluation Concurrency control (Chp 17) Recovery (Chp 18) Project: simple DBMS implentation,
NEUStore SB-tree, R-tree
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19
General External Merge Sort
To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce sorted
runs of B pages each. Pass 2, …, etc.: merge B-1 runs.
N B/
B Main memory buffers
INPUT 1
INPUT B-1
OUTPUT
DiskDisk
INPUT 2
. . . . . .
. . .
More than 3 buffer pages. How can we utilize them?
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 20
Cost of External Merge Sort
Number of passes: Cost = 2N * (# of passes) E.g., with 5 buffer pages, to sort 108 page
file: Pass 0: = 22 sorted runs of 5 pages
each (last run is only 3 pages) Pass 1: = 6 sorted runs of 20 pages
each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages
Note: for merge join, no need to write result.
1 1 log /B N B
108 5/
22 4/
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 21
What we learned Storage (Chp 8 & 9) B+-tree (Chp 10) Hash index (Chp 11) External sort (Chp 13) Query evaluation Concurrency control (Chp 17) Recovery (Chp 18) Project: simple DBMS implentation,
NEUStore SB-tree, R-tree
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 22
parse
convert
apply laws
estimate result sizes
consider physical plans estimate costs
pick best
execute
{P1,P2,…..}
{(P1,C1),(P2,C2)...}
Pi
answerSQL query
parse tree
logical query plan
“ improved” l.q.p
l.q.p. +sizes
statistics
query compiling
query optimization
query execution
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 23
Query Compiling Convert
Input: a parse tree. Output: a logical query plan.
Algorithm: followed by . E.Name(E.SSN<5000 AND E.Age>50(E) ) Alternatively, a l.q.p tree.
SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50
Emp E
E.SSN<5000 AND E.Age>50
E.Name
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 24
Query Optimization Consider Physical Plans
Associate each RA operator with an implementation scheme. Multiple implementation schemes? Enumerate all.
SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50
Emp E
E.SSN<5000 AND E.Age>50
E.Name
Plan 1 (always work!)
scan
on-the-fly
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 25
Query Optimization Consider Physical Plans
SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50
Emp E
E.SSN<5000 AND E.Age>50
E.Name
Plan 2
range search in SSN index
on-the-fly
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 26
Query Optimization Consider Physical Plans
SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50
Emp E
E.SSN<5000 AND E.Age>50
E.Name
Plan 3
range search in Age index, follow pointers to SSN index
on-the-fly
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 27
Query Optimization Estimate Costs sample input
Assume for table E: Schema = (SSN: int, Name: string, Age: int, Salary: int) T(E) = 100 tuples. For attribute SSN:
• V(E, SSN)=100, min(E, SSN)=0000, max(E, SSN)=9999 For attribute Age:
• V(E, Age)=20, min(E, Age)=21, max(E, Age)=60 Primary index on SSN: 3 level B+-tree, 50 leaf nodes. Secondary index on Age: 2 level B+-tree, 10 leaf nodes, every
leaf entry points to 3.5 pageIDs (on average). Assumptions: all B+-tree roots are pinned. Can reach the
first leaf page of a B+-tree directly. Memory buffer size: 2 pages.
SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 28
Query Optimization Estimate Costs
Cost = 50. (The primary index has 50 leaf nodes. Assume we can reach the first leaf page of a B+-tree directly.)
SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50
Emp E
E.SSN<5000 AND E.Age>50
E.Name
Plan 1 (always work!)
scan
on-the-fly
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 29
Query Optimization Estimate Costs
SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50
Emp E
E.SSN<5000 AND E.Age>50
E.Name
Plan 2
range search in SSN index
on-the-fly
Cost = 25. SSN<5000 selects half of the employees, so 50/2=25 leaf nodes.
Note: if condition is E.SSN>5000, needs 1 more I/O.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 30
Query Optimization Estimate Costs
SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50
Emp E
E.SSN<5000 AND E.Age>50
E.Name
Plan 3
range search in Age index, follow pointers to SSN index
on-the-fly
Cost = 10/4 + 20/4 * 3.5 = 21.
#I/Os in the Age index #I/Os in the SSN index
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 31
Query Optimization Pick Best
SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50
physical plan I/O cost
Plan 1: scan 50
Plan 2: range search SSN index 25
Plan 3: range search Age index 21
Pick!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 32
Query Compiling Convert
Algorithm: then then .
E.Name. D.Dname(E.Did=D.Did AND E.SSN<5000 AND D.budget=1000(ED) ) The l.q.p tree:
SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000
Emp E
E.Did=D.Did AND E.SSN<5000 AND D.budget=1000
E.Name, D.Dname
Dept D
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 33
Query Compiling Apply Laws
Always always: (try to) replace with ! Also, push down.
SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000
Emp E
E.SSN<5000
E.Name, D.Dname
Dept D
D.budget=1000
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 34
Four Join Algorithms
Iteration join (nested loop join) Merge join Hash join Join with index
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 35
Example E D over common attribute Did
E: T(E)=10,000 primary index on SSN, 3 levels. |E|= 1,000 leaf nodes.
D: T(D)=5,000 primary index on Did. 3 levels. |D| = 500 leaf nodes.
Memory available = 101 blocks
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 36
• Iteration JoinUse our memory(1) Read 100 blocks of D(2) Read all of E (using 1 block) + join(3) Repeat until done
• I/O cost = |D| + |D|/100 * |E| =
500 + 5*1000 = 5,500.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 37
Merge Join Cost
Recall that |E|=1000, |D|=500. And |D| is already sorted on Did.
External sort E: pass 0, by reading and writing E, produces a file with 10 sorted runs. Another read is enough.
No need to write! Can pipeline to join operator.
Cost = 3*1000 + 500 = 3,500.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 38
Simple example hash: even/odd
R1 R2 Buckets2 5 Even 4 4 R1 R23 12 Odd: 5 38 139 8
1114
2 4 8 4 12 8 14
3 5 9 5 3 13 11
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 39
Hash Join Cost
Read + write both E and D for partitioning, then read to join.
Cost = 3 * (1000 + 500) = 4,500.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 40
Join with index (Conceptually)
For each r E do
Find the corresponding D tuple by probing index.
Assuming the root is pinned in memory,Cost = |E| + T(E)*2 = 1000 + 10,000*2 = 21,000.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 41
What we learned Storage (Chp 8 & 9) B+-tree (Chp 10) Hash index (Chp 11) External sort (Chp 13) Query evaluation Concurrency control (Chp 17) Recovery (Chp 18) Project: simple DBMS implentation,
NEUStore SB-tree, R-tree
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 42
ACID Properties of Transactions
Atomicity: all actions are carried out, or none. Consistency: each transaction preserves the
consistency of the database if executed by itself. (The users make sure of this. E.g. transfer money…)
Isolation: transactions are isolated from the effect of concurrently scheduling other transactions.
Durability: the effect of a committed transaction should last, even if system crash before all changed are flushed to disk.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 43
Scheduling Transactions
Serial schedule: Schedule that does not interleave the actions of different transactions.
Equivalent schedules: For any database state, the effect (on the set of objects in the database) of executing the first schedule is identical to the effect of executing the second schedule.
Serializable schedule: A schedule that is equivalent to some serial execution of the transactions.
(Note: If each transaction preserves consistency, every serializable schedule preserves consistency. )
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 44
Example
Conflict serializable serializable. Serializable conflict serializable
T1: R(A) W(A), CommitT2: W(A), CommitT3: W(A), Commit
T1 T2 precedence graph:1. a node for each
transaction2. an arc from Ti to Tj if an
action in Ti precedes andconflicts with an action inTj.
T3
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 45
Deadlock Prevention
Assign priorities based on timestamps. Assume Ti wants a lock that Tj holds. Two policies are possible: Wait-Die: It Ti has higher priority, Ti waits
for Tj; otherwise Ti aborts Wound-wait: If Ti has higher priority, Tj
aborts; otherwise Ti waits If a transaction re-starts, make sure it
has its original timestamp
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 46
Multiple-Granularity Locks
Why? If a transaction needs to scan all records in a table, do we really want to have a lock on all tuples individually? Significant locking overhead!
Put a single lock on the table!
Tuples
Tables
Pages
Database
contains
A lock on a nodeimplicitly locksall decendents.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 47
Allowed Sharings
-- IS IX
--
IS
IX
S X
S
X
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 48
Tree Locking Algorithm
Search: Start at root and go down; repeatedly, S lock child then unlock parent.
Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe: If child is safe, release all locks on ancestors.
Safe node: Node such that changes will not propagate up beyond this node. Inserts: Node is not full. Deletes: Node is not half-empty.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 49
Example
ROOT
A
B
C
D E
F
G H I
20
35
20*
38 44
22* 23* 24* 35* 36* 38* 41* 44*
Do:1) Search 38*2) Delete 38*3) Insert 45*4) Insert 25*
23
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 50
Kung-Robinson Model
Xacts have three phases: READ: Xacts read from the database,
but make changes to private copies of objects.
VALIDATE: Check for conflicts. WRITE: Make local copies of changes
public.ROOT
old
new
modifiedobjects
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 51
Timestamp CC
Idea: Give each object a read-timestamp (RTS) and a write-timestamp (WTS), give each Xact a timestamp (TS) when it begins: If action ai of Xact Ti conflicts with
action aj of Xact Tj, and TS(Ti) < TS(Tj), then ai must occur before aj. Otherwise, restart violating Xact.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 52
What we learned Storage (Chp 8 & 9) B+-tree (Chp 10) Hash index (Chp 11) External sort (Chp 13) Query evaluation Concurrency control (Chp 17) Recovery (Chp 18) Project: simple DBMS implentation,
NEUStore SB-tree, R-tree
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 53
The Big Picture: What’s Stored Where
DB
Data pageseachwith apageLSN
Xact TablelastLSNstatus
Dirty Page TablerecLSN
flushedLSN
RAM
prevLSNXIDtype
lengthpageID
offsetbefore-imageafter-image
LogRecords
LOG
master record
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 54
WAL & the Log
Each log record has a unique Log Sequence Number (LSN). LSNs always increasing.
Each data page contains a pageLSN. The LSN of the most recent log record
for an update to that page. System keeps track of flushedLSN.
The max LSN flushed so far. WAL: Before a page is written,
pageLSN flushedLSN
LSNs
DB
pageLSNs
RAM
flushedLSN
pageLSN
Log recordsflushed to disk
“ Log tail” in RAM
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 55
Normal Execution of an Xact
Series of reads & writes, followed by commit or abort. We will assume that write is atomic on disk.
• In practice, additional details to deal with non-atomic writes.
Strict 2PL. STEAL, NO-FORCE buffer management, with
Write-Ahead Logging.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 56
Simple Transaction Abort
For now, consider an explicit abort of a Xact. No crash involved.
We want to “play back” the log in reverse order, UNDOing updates. Get lastLSN of Xact from Xact table. Can follow chain of log records backward via
the prevLSN field. Before starting UNDO, write an Abort log record.
• For recovering from crash during UNDO!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 57
Transaction Commit
Write commit record to log. All log records up to Xact’s lastLSN are
flushed. Guarantees that flushedLSN lastLSN. Note that log flushes are sequential,
synchronous writes to disk. Many log records per log page.
Commit() returns. Write end record to log.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 58
Crash Recovery: Big Picture
Start from a checkpoint (found via master record).
Three phases. Need to:– Figure out which Xacts
committed since checkpoint, which failed (Analysis).
– REDO all actions. (repeat history)
– UNDO effects of failed Xacts.
Oldest log rec. of Xact active at crash
Smallest recLSN in dirty page table after Analysis
Last chkpt
CRASH
A R U