data management systems - eth z

Data Management Systems

• Access Methods• Pages and Blocks• Indexing• Access Methods in context

Gustavo AlonsoInstitute of Computing PlatformsDepartment of Computer ScienceETH Zürich

1Pages and Blocks

Data representation in memorySlotted pagesRecordsCompression

Architecture of a database

Blocks, files, segments

Pages in memory

Physical records

Logical records (tuples)

Logical data (tables, schemas)

Relations, views

Queries, Transactions (SQL)

Record Interface

Record Access

Page access

File Access

Application

Logical view (logical data)

Access Paths

Physical data in memory

Page structure

Storage allocation

2Pages and Blocks

Cloud storage and file system (Amazon S3)

Goals of the lecture

• Discuss how data is represented within blocks • Block/Page organization

• Slotted pages

• Record identifiers (tuple identifiers)

• Header information for blocks and tuples

• Column vs Row storage

• Compression

Pages and Blocks 3

In perspective

• Data in a database is stored in blocks

• Blocks are parts of extents

• Extents are part of segments

• The notion of pages in a database• Hardware pages (the atomic unit to write to storage, usually 4KB)

• OS Page (the unit used by the OS to implement virtual memory, usually 4KB)

• Database Page = a block, anywhere between 512B and 32 KB; examples: Oracle 2KB to 32KB (typical 8KB), MS SQL Server 8KB, MySQL 16KB

• Trend is towards larger block sizes (incurs less overhead)

Pages and Blocks 4

Blocks and pages

Pages and Blocks 5

Finding the page you need

• Segment – Extent – Block/Page

• The pages allocated to a given object (a table, an index, and captured in a segment) are managed through lists

• These lists are stored as part of the segment header

• See “Segments and File Storage” slide 28

Pages and Blocks 6

USED

FREE

Block ID (Extent,Offset)

Block ID (Extent,Offset)Space Available

Potential bottleneck

• The free and used lists of a segment are a potential bottleneck, especially for transactions that will result in modifications

• How these lists are implemented affects performance:• Use several free lists so that concurrent transactions can look for free space in

parallel rather than conflict on access to the free list

• Make the traversal of the free list fast and keep the list small (shorter the larger the pages, sort the free list by available size, cache positions …)

• Make sure holes can be efficiently found (store the available space in each page in incremental steps by using a small amount of bits)

Pages and Blocks 7

Finding tuples within a page

• Slotted pages:• Each page has a header (checksum,

version, transaction visibility, compression information, utilization, etc.)

• Each tuple gets an id (typically, block ID and offset)

• The page maintains a list of the “slots” storing tuples in a page by storing a pointer (offset) to the beginning of each tuple

• Needed to store tuples of different sizes

Pages and Blocks 8

HEADER

Why slotted pages

• To support variable length tuples

• Tuples get a permanent tuple id (record id) that does not change

• When data does not change, it uses the space very efficiently (this is the most common case)

• When data changes, it requires careful management:• If a tuple is modified and becomes larger, use the original space to store a pointer to

the new location of the tuple (can be in another block)

• If a tuple is deleted, just remove the pointer

• If a tuple is modified and becomes smaller, just leave the unnecessary space empty

• For insertion, look for a page with enough free space and compact the page if needed

Pages and Blocks 9

Block structure: slotted pages

• A block is structured as follows:

• Header: address and type of segment (index, table, etc.)

• Table directory: schema of the table stored in the block

• Row directory: pointers to the actual tuples stored in the block

• Free space

• Row data (tuples) stored in the block

• The directory grows downwards, the space for tuples is used upwards

Pages and Blocks 10

Optimizing the use of blocks I

• Percentage Free• Determines how much space in

each block is reserved for updating tuples instead of using it for storing new tuples

• This is needed because an update can result in a bigger tuple than the original one

UPDATE TSET Adress = “AStreetWithAVeryLongName”WHERE LegiNr = 12345678

Pages and Blocks 11

Percentage Free

• The use of “Percentage Free” is to avoid the fragmentation that would occur if pages do not have enough space to modify a tuple that becomes larger:• Every modification will result in an

indirection to another page

• A few of them are acceptable, too many would slow down access

• This space is reserved for growing tuples rather than for inserting

Pages and Blocks 12

Optimizing the use of Blocks II

• Percentage used• Determines how much space needs to

be free in a block before the free space can be used to insert new tuples

• Blocks are unavailable to inserting new tuples until they have the given amount of free space

• It is needed because if updates can make tuples smaller (freeing up space), they can also make them bigger (needing space). The combination of both parameters avoids thrashing on the page

Pages and Blocks 13

Percentage Used

• The use of “Percentage Used” is to avoid having to constantly move a block from the used list to the free list• The block keeps track of how much space

it has used and it is free

• Only when enough free can it be used for inserting (probably for several inserts)

• Otherwise a block can constantly go from FREE to USED to FREE with single tuple deletes and inserts, adding overhead in managing the lists

Pages and Blocks 14

INSERTTuple 1Tuple 2Tuple 3 …

Space for one tuple

Pages and Blocks 15

Record Layout

Pages and Blocks 16

Structure of a record (tuple)

• A tuple contains:• Header (validity flags for deletion,

visibility info for concurrency control, bit map of null values, …)

• Attributes• Data for each non-null attribute (or a

pointer to the data)

• Relational engines do not store schema information in the tuple (types of the attributed are known), schema-less systems need to store the structure of the tuple since everyone can be different

Pages and Blocks 17

ID Name Dept. Sem.

1 John D-INFK 3

2 Mary D-ITET 5

4 6John D-INFK 3

4 6Mary D-ITET 5

STRING STRING INTEGER

Schema information

length lengthdata data int

representation

Actual tuples in memory

Optimizing the record layout

Pages and Blocks 18

4 6John D-INFK 3

34 John D-INFK6

Intuitive serial representation but linear time to access each attribute

Instead of length, store offsets. That way the record has a fixed sized part at the beginning and a varied sized part at the tail. Pointers point to tail of attribute. Each attribute can be accessed in constant time

3 4 John D-INFK6 Reorder the attributes, place variable length data at the end. Better performance.

Data Types

• Integer Numbers• Usually represented the same format as C or C++ (standard in hardware

architectures)

• Real Numbers• IEEE-754 standard for variable precision• Fixed point representations for fixed precision (e.g., Oracle numbers). Avoids

rounding errors, variable length by storing all digits plus where the decimal point is (not stored as a string)

• Strings and BLOBS• Length and data

• Time, coordinates, points …• System specific

Pages and Blocks 19

Corner cases

• When tuples are very big or some attribute is very big, instead of storing the whole tuple, one stores the fixed part of the tuple and a pointer to the variable, large size of the tuple (potentially in some other page)• Used for BLOBs (Binary Large Objects)• What is a large object is typically in reference to the page size (larger than

some size, a page, half a page, more than one page …)• Usually, those large attributes are not processed by queries, putting them

somewhere else speeds up scanning of the page as we do not need to scan the large object

• Examples: the attribute is a piece of text (a long string) or a photo• BLOBs can be very large (more than one block!)

Pages and Blocks 20

Pages and Blocks 21

DLN DOB EXP F. Name L. Name M. Name Photo Street #

99999999 08/04/75 08/05/23 Janice SAMPLE Ann Main Street 123

BLOB

Several options• Store as a BLOB on another block(s)

• It can be read with the data• It does not affect scans• Accessible form the database

• Store the name of a file where the BLOB is• Does not take space on the database• It does not affect scans• Requires to go somewhere else to retrieve

Row vs column store

Pages and Blocks 22

Row-store (or n-ary storage model)

• Row store:• Tuples are stored as described so far, all their attributes together

• A tuple can be quickly accessed and retrieved

Pages and Blocks 23

ID Name Dept. Sem.

1 John D-INFK 3

2 Mary D-ITET 5

4 6John D-INFK 3

4 6Mary D-ITET 5

TABLE SLOTTED PAGE

HEADER

HEADER

Pages and Blocks 24

“Data page layouts for relational databases on deep memory hierarchies”; The VLDB Journal, November 2002

Row store is for OLTP

• OLTP = Online Transaction Processing• Typical of banking applications, shopping carts, etc.

• Transactional operations (updates)

• Operations mostly on a single tuple

Pages and Blocks 25

SELECT balance FROM AccountsWHERE acc-no = 123456789

INDEX

Block

Row store cumbersome for complex queries

Pages and Blocks 26

SELECT SUM(balance)FROM AccountsWHERE Accounts.balance > 1000GROUP BY (Accounts.nationality)

Block 1

Block 2

Block 3

Needs to retrieve all pages for the tableNeeds to scan all tuplesDiscards most of the data as it only needs

two attributesProblematic for large tables=> this is a lot of wasted bandwidth

Column store

• Instead of storing by tuples, store the data by columns

• A block now contains columns (organized as slotted pages)

Pages and Blocks 27

Name Dept Sem Age

John D-INFK 3 21

Mary D-ITET 5 20

John

Mary

D-INFK

D-ITET

3

5

21

20

Block

Row Store

John

Mary

D-INFK

D-ITET

3

5

21

20

Block BlockBlockBlock

Column Store

Pages and Blocks 28


Only needs to read the attributes involved in the queryBlocks contain only what is neededNo extra blocks and no extra data has to be read

Processing on row store

Pages and Blocks 29

SELECT SUM(balance)FROM AccountsWHERE Accounts.balance > 1000GROUP BY (Accounts.nationality)

Name NameName

Blocks

Acc_# Acc-#

Balance Balance

Country

Accounts

Pages and Blocks 30

The Design and Implementation of Modern Column-Oriented Database SystemsFoundations and Trends in DatabasesVol. 5, No. 3 (2012) 197–280https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf

Column store and vectorized processing

• Modern processors heavily used vectorized processing to speed things up

• SIMD/AVX allows to perform an operation simultaneously on a vector of values

• A column store presents the data exactly in the vectorized representation needed to exploit SIMD

• Very useful for numeric values and bit comparisons

Pages and Blocks 31

Column stores and memory hierarchy

• Column stores started to be heavily used to address the memory wall

• Column stores are a more compact representation where a cache line is likely to bring data you want to see

• Far better cache utilization than with row stores

Pages and Blocks 32

CACHE

CACHE

ROW STORE

COLUMN STORE

Historical Overview

• Idea is very old and originally applied to data on disks• Horizontal partitioning (sharding)

• Vertical partitioning (stores columns or groups of columns on different files)

• Today, column store is used in all analytical databases and many main memory databases• Improves bandwidth across memory/caches

• Eliminates needs to deal with unwanted attributes

• Compresses better

• Can exploit SIMD

Pages and Blocks 33

When column store is no that good

• Column store suffers when:• Most of the attributes are needed anyway

• When the tuple or parts of it needs to be reconstructed (as intermediate results or final results)

• For updates and modifications

• Alternative: PAX (Partition Attributes Across) representation• A block contains several tuples but organized as a column store

• Reconstructing the tuple does not require to access several pages

• “Data page layouts for relational databases on deep memory hierarchies”; The VLDB Journal, November 2002

Pages and Blocks 34

Pages and Blocks 35


Pages and Blocks 36


PAX makes tuple reconstruction easierbut space management is far more complexdue to the mini-page layoutAlternative = Hybrid NSM/PAX“A hybrid page layout integrating PAX and NSM”HP LaboratoriesHPL-2012-240

Compression

Pages and Blocks 37

Compression is for bandwidth

• Databases optimize performance• Compression is not used to save space but to save bandwidth

• Trade-off CPU cycles to compress/decompress vs memory bandwidth (today, the CPOU wins by a large margin)

• Possible to process data in compressed form

• Many different options that depend on data organization• Dictionary compression

• Run length encoding

• Delta encoding

• Bit-vector representations (bitmaps)

Pages and Blocks 38

Compression approaches

• Dictionary compression• Build a dictionary mapping long entries to, e.g., integers or small numbers

• Example: use country codes (<200) instead of country names as strings = one byte is enough

• Can be applied to any finite collection of names (e.g., departments, cantons, provinces within a country, etc.)

• Dictionary automatically built when data is loaded

• Data can be processed in compressed form (it is encoded rather than compressed), dictionary used for query rewrite and result rewrite

• Very much used in many systems, especially for analytics

Pages and Blocks 39


• Frame of reference

• Many attributes have value locality, they can be represented as a delta over some base:• 1007, 1017, 1090 = 1000, 1, 17, 90

• Useful for many sets of data with a common base (e.g., for dates: days since 1900)

• Can be combined with delta encoding for sorted lists of data (store the difference to the previous value rather than the value)

• Allows operations over the compressed data

Pages and Blocks 40


• Run Length encoding• If a value appears repeated many times, just store it once and how many

times it appears

• Useful for attributes with low cardinality (e.g., departments)

• Used in columnar representation as the data does not even need to be stored

• In row stores, used for long strings with repeated characters

• Can compress the data significantly but makes processing more difficult and makes the encoding variable in size

Pages and Blocks 41


• Bit-vector representations (bitmaps)• For every value that an attribute might take, construct a bitmap as follows:

• Create an array as long as the number of tuples

• If tuple i has value x for that attribute, position i in the bitmap x is set to 1

• Bitmaps act as an index and can be used to process queries just by looking at the bitmap (selections, joins on that attribute, group by on that attribute …)

• Bitmaps can be further compressed using run length encoding (makes query processing slightly more complicated)

Pages and Blocks 42

Summary

• Databases process large amounts of data• How the data is organized significantly affects performance

• Organization in pages

• Finding the data quickly

• Not having to do too many changes when data is updated

• Bandwidth optimization

• Cache optimization

• Compression and processing over compressed data

• The ideas presented apply not only to databases but to any system storing and processing large data collections

Pages and Blocks 43

data management systems - eth z

Documents