column orientation - rotate your thinking 90 degrees
TRANSCRIPT
Column based Databases
Shweta Agrawal
What is a column based DB?
1|Shweta Agrawal|M|40|10000...|2|Neha Agrawal|F|25|12000...|3|Anant Agarwal|M|28|15000...|4|Vishal Mehta|M|30|8000...
1|2|3|4...|Shweta Agrawal|Neha Agrawal|Anant Agarwal|Vishal Mehta...|M|F|M|M...|40|25|28|30...|10000|12000|15000|8000...
Row based storage
Column based storage
One page of the table storage
Column stores
1|2|3|4|5|6|....
Shweta Agrawal|Neha Agrawal|Anant Agarwal|Vishal Mehta|Srinivas Pathak|Rubina Mehta....
M|F|M|M|M|F|F...
10000|12000|15000|8000|15000|5000...
40|25|28|30|45|20...
1st page of each column store
...
Query processing on row store
Evaluate condition age>40 possibly using an index on age.
Get a found-set containing row number/ID of rows that satisfy above condition.
Retrieve all rows in the above found-set.
Send only name, and salary from the rows as result to client
SELECT name, salary FROM employee WHERE age > 40
Query processing on a column store
Evaluate condition age > 40 on column age, using an index if present
Get a found-set containing row number/ID of rows that satisfy above condition
Retrieve name's from name's column store for all rows in the found-set
Retrieve salary's from salary column for all rows in the found-set
Associate name with salary by row id/number for final result
SELECT name, salary FROM employee WHERE age > 40
A quick calculation of IO
Table has 10 columns
1 million rows.
Each row is 100 bytes
30% of employees are above age 40
Total amount of data read in row based store = 100MB * 0.3 = 30MB
Total amount of data read in column based store 100MB * 0.3 * 0.2 (only 2 columns) = 6MB
Why is it important?
Wide fact tables in data-warehouses
Analytics queries on data-warehouse tend to aggregate/analyse a few columns but a large number of rows.
Full table scans for analytics queries in row stores
Normalization means more joins
An example star schema
Benefits of column based DB
Low pages read = Less IO = faster queries
Processes CPU bound instead of IO bound
Compression
Page level compression
Column level compression (lookup tables)
Natural intra-query parallelism on conditions on different columns
Row based equivalents
Index every column?
Maintenance: updates/insert/deletes
Storage
Most importantly: Index is value=>id, column is id=>value
Useful for selective queries only
Row based equivalents
Vertical partitioning?
Joins (although fast ones)
Table overhead
Cannot use horizontal partitioning
Row based query engine not geared up to make use of the column based storage.
Summary
For adhoc analytics queries, column based storage reduces IO, and makes queries faster
Column based query engines written ground up for analytics queries make good use of this storage.
Indexing every column, or vertical partioning not same as column based storage.
References
Commercial products
Sybase IQ
Vertica
MySQL's InfoBright storage engine
To know more, read
http://databasecolumn.vertica.com/
Muokkaa otsikon tekstimuotoa napsauttamalla
Muokkaa jsennyksen tekstimuotoa napsauttamalla
Toinen jsennystaso
Kolmas jsennystaso
Neljs jsennystaso
Viides jsennystaso
Kuudes jsennystaso
Seitsems jsennystaso
Kahdeksas jsennystaso
Yhdekss jsennystaso
IDNAMESEXAGESALARYADDRRESSPHONEPAN...
1Sunil SharmaM4010,000.........
2Neha AgarwalF2512,000.........
3Anant AgarwalM2815,000.........
4Vishal MehtaM308,000.........