column orientation - rotate your thinking 90 degrees

Column based Databases

Shweta Agrawal

What is a column based DB?

1|Shweta Agrawal|M|40|10000...|2|Neha Agrawal|F|25|12000...|3|Anant Agarwal|M|28|15000...|4|Vishal Mehta|M|30|8000...

1|2|3|4...|Shweta Agrawal|Neha Agrawal|Anant Agarwal|Vishal Mehta...|M|F|M|M...|40|25|28|30...|10000|12000|15000|8000...

Row based storage

Column based storage

One page of the table storage

Column stores

1|2|3|4|5|6|....

Shweta Agrawal|Neha Agrawal|Anant Agarwal|Vishal Mehta|Srinivas Pathak|Rubina Mehta....

M|F|M|M|M|F|F...

10000|12000|15000|8000|15000|5000...

40|25|28|30|45|20...

1st page of each column store

...

Query processing on row store

Evaluate condition age>40 possibly using an index on age.

Get a found-set containing row number/ID of rows that satisfy above condition.

Retrieve all rows in the above found-set.

Send only name, and salary from the rows as result to client

SELECT name, salary FROM employee WHERE age > 40

Query processing on a column store

Evaluate condition age > 40 on column age, using an index if present

Get a found-set containing row number/ID of rows that satisfy above condition

Retrieve name's from name's column store for all rows in the found-set

Retrieve salary's from salary column for all rows in the found-set

Associate name with salary by row id/number for final result

SELECT name, salary FROM employee WHERE age > 40

A quick calculation of IO

Table has 10 columns

1 million rows.

Each row is 100 bytes

30% of employees are above age 40

Total amount of data read in row based store = 100MB * 0.3 = 30MB

Total amount of data read in column based store 100MB * 0.3 * 0.2 (only 2 columns) = 6MB

Why is it important?

Wide fact tables in data-warehouses

Analytics queries on data-warehouse tend to aggregate/analyse a few columns but a large number of rows.

Full table scans for analytics queries in row stores

Normalization means more joins

An example star schema

Benefits of column based DB

Low pages read = Less IO = faster queries

Processes CPU bound instead of IO bound

Compression

Page level compression

Column level compression (lookup tables)

Natural intra-query parallelism on conditions on different columns

Row based equivalents

Index every column?

Maintenance: updates/insert/deletes

Storage

Most importantly: Index is value=>id, column is id=>value

Useful for selective queries only

Row based equivalents

Vertical partitioning?

Joins (although fast ones)

Table overhead

Cannot use horizontal partitioning

Row based query engine not geared up to make use of the column based storage.

Summary

For adhoc analytics queries, column based storage reduces IO, and makes queries faster

Column based query engines written ground up for analytics queries make good use of this storage.

Indexing every column, or vertical partioning not same as column based storage.

References

Commercial products

Sybase IQ

Vertica

MySQL's InfoBright storage engine

To know more, read

http://databasecolumn.vertica.com/

Muokkaa otsikon tekstimuotoa napsauttamalla

Muokkaa jsennyksen tekstimuotoa napsauttamalla

Toinen jsennystaso

Kolmas jsennystaso

Neljs jsennystaso

Viides jsennystaso

Kuudes jsennystaso

Seitsems jsennystaso

Kahdeksas jsennystaso

Yhdekss jsennystaso

IDNAMESEXAGESALARYADDRRESSPHONEPAN...

1Sunil SharmaM4010,000.........

2Neha AgarwalF2512,000.........

3Anant AgarwalM2815,000.........

4Vishal MehtaM308,000.........

column orientation - rotate your thinking 90 degrees

Technology