database management systems · advantages of dbmss •abstract data representation •relational...

27
Database Management Systems Marco Serafini COMPSCI 590S 1. Lecture 7

Upload: others

Post on 12-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

Database Management Systems

Marco Serafini

COMPSCI 590S1. Lecture 7

Page 2: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

2

MapReduce vs. DBMSs

Page 3: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

33

Advantages of DBMSs• Abstract data representation

• Relational model• Data storage is delegated to the DBMS

• Functional query language (SQL)• Queries specified as simple relational operators• Actual query execution delegated to the DBMS…• … including parallelism, distribution, pipelining etc.

• Support for indexing

Page 4: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

44

Disadvantages of DBMSs• SQL is a limited interface for complex analytics

• E.g. image analysis, creating maps• Need to define a schema for data a priori• High cost of loading data and indexing

• Can be amortized only if same data and schema reused• Too complex for “one shot” analytics

Page 5: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

55

Advantages of MapReduce• Support for arbitrary UDFs• Support for a variety of arbitrary data formats• Simple API• Scalability

Page 6: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

66

Disadvantages• Many of the optimizations of DBMS must be reimplemented by the application, for example

• Indices• Query execution plans (logical + physical)• Column-based architecture• Data format specifications (ProtoBuf)• Compression

Page 7: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

7

Sorting

Page 8: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

88

External Sorting• Problem

• Large table with two columns sorted on column one• Table does not fit in memory• Sort by column two

• Q: How to sort?• “Out-of-core” or “external” algorithm

Page 9: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

99

External Sorting1. Split file in chunks2. Load each chunk in memory3. Sort and write back4. Recursive merge sort

• log(n) passes

Page 10: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

10

Indexing

Page 11: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

1111

Index• Index maps a value (of a column) to a row• Primary index: Maps to primary key of table (unique)• Secondary index: Maps to other columns (duplicates)

Index

Table

Page 12: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

1212

Which Data Structure for Index?• A hash function?• A tree?• Q: Which one is better? When?

Page 13: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

1313

Hash Index

• Advantages• Fast reads and writes

• Disadvantages• No order• Collisions

Tuple

Tuple Tuple

Tuple

Tuple

Buckets Entries

Page 14: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

14

Binary Tree• Q: Problem?• Balance

• Balanced binary trees• E.g. R/B trees

• Locality• Space consumption

13

7

3 10

12

15

Page 15: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

15

B+ Trees• Designed for out-of-core use

• Large nodes• Sorted (for scans)

13 30 48 61

2 5 10 12 13 17 24 28

< 13 [13,30)

pointers to file positions(or actual data)

pointers to file positions(or actual data)

Internal node

Leaves

Page 16: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

1616

Clustered vs. Non-Clustered Index• Clustered index

• Reordering index elements changes physical layout• E.g. a B-Tree where leaf nodes are actual data pages• Primary index often clustered, one clustered index per table• A table without clustered index: heap (don’t be confused!)

• Non-Clustered index• Contains pointers to actual physical location• Can have multiple non-clustered indices for the same table

Page 17: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

17

Flash/SSD• Most of these algorithms expressed considering disks• What about Flash/SSD?

• Faster but…• Coarse read/write granularity: pages (~8 kB or more)• Wear leveling

• A block can be overwritten only a finite number of times• Controller must spread writes evenly

Page 18: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

18

Query Execution

Page 19: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

19

Query Execution• SQL: Domain Specific Language

• Restricted language for specific task• In this case: relational algebra• Physical data independence: changes in physical storage are transparent to applications

• One SQL statement, many possible executions• Logical plan: reorder the operators• Physical plan: different ways to store the data

Page 20: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

2020

ExampleSELECT e.cidFROM Enrollment e, Student sWHERE s.name= ’John Smith’

and e.sid = s.sid

TablesStudent(sid, name)Enrollment(sid, cid)

Page 21: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

2121

Logical Plan• Plan in terms of relational operators

!".$%&

'(.)*+",-…-

⋈(.(%&,".(%&

Student Enrollment

!".$%&

'(.)*+",-…-

⋈(.(%&,".(%&

Student Enrollment

Page 22: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

22

Physical Plan• Local plan annotated with physical operators

!".$%&

'(.)*+",-…-

⋈(.(%&,".(%&

Student Enrollment(File scan)

(Pipeline select and write temporary table T)

(Block nested loop join)

(File scan)

(Pipeline project) !".$%&

'(.)*+",-…-

⋈(.(%&,".(%&

Student Enrollment(Index lookup on name)

(Use index)

(Index nested loop join)

(Index lookup on sid)

(Pipeline project)

Page 23: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

23

Log Structured Merge Trees

Page 24: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

2424

How Good are B+ trees?• Q: Are they good for reading? Why?• Q: Are they good for writing? Why?

Page 25: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

25

Log Structured Merge Trees• Increasingly popular data structure for DBMSs

• RocksDB (Facebook)• LevelDB (Google)

• Goals• Fast data ingestion• Leverage large memory for caching

• Problems• Write and read amplification

Page 26: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

2626

LSMT Data Structures• Memtable

• Binary tree or skiplist à sorted• Receives writes and serves reads

• Log files (runs)• L0: dump of memtable• Li: merge of multiple Li-1 runs

Page 27: Database Management Systems · Advantages of DBMSs •Abstract data representation •Relational model •Data storage is delegated to the DBMS •Functional query language (SQL)

2727

Operations• Writes go to memtable• Reads

• Search memtables and read caches (if available)• Search log files in reverse chronological order• Bloom filters – indices in log files

• Periodically dump memtable to L0• Periodically merge from Li-1 to Li