2
M.I.T
Current DBMS Gold StandardCurrent DBMS Gold Standard Current DBMS Gold StandardCurrent DBMS Gold Standard
Store fields in one record contiguously on diskUse B-tree indexingUse small (e.g. 4K) disk blocksAlign fields on byte or word boundariesConventional (row-oriented) query optimizer
and executor (technology from 1979)Aries-style transactions
Store fields in one record contiguously on diskUse B-tree indexingUse small (e.g. 4K) disk blocksAlign fields on byte or word boundariesConventional (row-oriented) query optimizer
and executor (technology from 1979)Aries-style transactions
3
M.I.T
Terminology -- “Row Store”
Record 2
Record 4
Record 1
Record 3
E.g. DB2, Oracle, Sybase, SQLServer, …
4
M.I.T
Row Stores are Row Stores are Write OptimizedWrite Optimized Row Stores are Row Stores are Write OptimizedWrite Optimized
Can insert and delete a record in one physical
writeGood for on-line transaction processing (OLTP)But not for read mostly applications
Data warehousesCRM
Can insert and delete a record in one physical
writeGood for on-line transaction processing (OLTP)But not for read mostly applications
Data warehousesCRM
5
M.I.T
Elephants Have Extended Row StoresElephants Have Extended Row StoresElephants Have Extended Row StoresElephants Have Extended Row Stores
With Bitmap indicesBetter sequential readIntegration of “datacube” productsMaterialized views
With Bitmap indicesBetter sequential readIntegration of “datacube” productsMaterialized views
But there may be a better idea…….
7
M.I.T
At 100K Feet…. At 100K Feet….
Ad-hoc queries read 2 columns out of 20In a very large warehouse, Fact table is
rarely clustered correctlyColumn store reads 10% of what a row
store reads
Ad-hoc queries read 2 columns out of 20In a very large warehouse, Fact table is
rarely clustered correctlyColumn store reads 10% of what a row
store reads
8
M.I.T
C-Store (Column Store) ProjectC-Store (Column Store) Project
Brandeis/Brown/MIT/UMass-Boston
projectUsual suspects participatingEnough coded to get performance
numbers for some queriesComplete status later
Brandeis/Brown/MIT/UMass-Boston
projectUsual suspects participatingEnough coded to get performance
numbers for some queriesComplete status later
9
M.I.T
We Build on Previous Pioneering We Build on Previous Pioneering Work….Work….
Sybase IQ (early ’90s)Monet (see CIDR ’05 for the most recent
description)
Sybase IQ (early ’90s)Monet (see CIDR ’05 for the most recent
description)
10
M.I.T
C-Store Technical IdeasC-Store Technical Ideas
Code the columns to save spaceNo alignmentBig disk blocksOnly materialized views (perhaps many)Focus on Sorting not indexingAutomatic physical DBMS design
Code the columns to save spaceNo alignmentBig disk blocksOnly materialized views (perhaps many)Focus on Sorting not indexingAutomatic physical DBMS design
11
M.I.T
C-store (Column Store) Technical C-store (Column Store) Technical IdeasIdeas
Optimize for grid computingInnovative redundancyXacts – but no need for MohanData ordered on anything, Not just timeColumn optimizer and executor
Optimize for grid computingInnovative redundancyXacts – but no need for MohanData ordered on anything, Not just timeColumn optimizer and executor
12
M.I.T
How to Evaluate This Paper….How to Evaluate This Paper….
None of the ideas in isolation merit
publication Judge the complete system by its
(hopefully intelligent) choice ofSmall collection of inter-related
powerful ideasThat together put performance in a
new sandbox
None of the ideas in isolation merit
publication Judge the complete system by its
(hopefully intelligent) choice ofSmall collection of inter-related
powerful ideasThat together put performance in a
new sandbox
13
M.I.T
Code the ColumnsCode the Columns
Work hard to shrink spaceUse extra space for multiple orders
Fundamentally easier than in a row storeE.g. RLE works well
Work hard to shrink spaceUse extra space for multiple orders
Fundamentally easier than in a row storeE.g. RLE works well
14
M.I.T
No AlignmentNo Alignment
Densepack columnsE.g. a 5 bit field takes 5 bits
Current CPU speed going up faster than
disk bandwidthFaster to shift data in CPU than to
waste disk bandwidth
Densepack columnsE.g. a 5 bit field takes 5 bits
Current CPU speed going up faster than
disk bandwidthFaster to shift data in CPU than to
waste disk bandwidth
15
M.I.T
Big Disk BlocksBig Disk Blocks
TunableBig (minimum size is 64K)
TunableBig (minimum size is 64K)
16
M.I.T
Only Materialized ViewsOnly Materialized Views
Projection (materialized view) is some
number of columns from a fact tablePlus columns in a dimension table – with
a 1-n join between Fact and Dimension
tableStored in order of a storage key(s)Several may be stored!!!!!With a permutation, if necessary, to map
between them
Projection (materialized view) is some
number of columns from a fact tablePlus columns in a dimension table – with
a 1-n join between Fact and Dimension
tableStored in order of a storage key(s)Several may be stored!!!!!With a permutation, if necessary, to map
between them
17
M.I.T
Only Materialized ViewsOnly Materialized Views
Table (as the user specified it and sees
it) is not stored!No secondary indexes (they are a one
column sorted MV plus a permutation, if
you really want one)
Table (as the user specified it and sees
it) is not stored!No secondary indexes (they are a one
column sorted MV plus a permutation, if
you really want one)
18
M.I.T
Example
User view:
EMP (name, age, salary, dept)
Dept (dname, floor)
Possible set of MVs:MV-1 (name, dept, floor) in floor order
MV-2 (salary, age) in age order
MV-3 (dname, salary, name) in salary order
19
M.I.T
Different Indexing
Few values Many values
Sequential
RLE encoded
Conventional B-tree at
the value level
Delta encoded
Conventional B-tree at
the block level
Non sequential Bitmap per value
Conventional Gzip
Conventional B-tree at
the block level
20
M.I.T
Automatic Physical DBMS DesignAutomatic Physical DBMS Design
Not enough 4-star wizards to go aroundAccept a “training set” of queries and a
space budgetChoose the MVs auto-magicallyRe-optimize periodically based on a log
of the interactions
Not enough 4-star wizards to go aroundAccept a “training set” of queries and a
space budgetChoose the MVs auto-magicallyRe-optimize periodically based on a log
of the interactions
21
M.I.T
Optimize for Grid ComputingOptimize for Grid Computing
I.e. shared-nothingDewitt (Gamma) was right
Horizontal partitioning and intra-query
parallelism as in Gamma
I.e. shared-nothingDewitt (Gamma) was right
Horizontal partitioning and intra-query
parallelism as in Gamma
22
M.I.T
Innovative RedundancyInnovative Redundancy
Hardly any warehouse is recovered by a
redo from the logTakes too long!
Store enough MVs at enough places to
ensure K-safetyRebuild dead objects from elsewhere in
the networkK-safety is a DBMS-design problem!
Hardly any warehouse is recovered by a
redo from the logTakes too long!
Store enough MVs at enough places to
ensure K-safetyRebuild dead objects from elsewhere in
the networkK-safety is a DBMS-design problem!
23
M.I.T
XACTS – No MohanXACTS – No Mohan
Undo from a log (that does not need to
be persistent)Redo by rebuild from elsewhere in the
network
Undo from a log (that does not need to
be persistent)Redo by rebuild from elsewhere in the
network
24
M.I.T
XACTS – No MohanXACTS – No Mohan
Snapshot isolation (run queries as of a
tunable time in the recent past)To solve read-write conflicts
Distributed XactsWithout a prepare message (no 2
phase commit)
Snapshot isolation (run queries as of a
tunable time in the recent past)To solve read-write conflicts
Distributed XactsWithout a prepare message (no 2
phase commit)
25
M.I.T
Storage (sort) Key(s) is not Storage (sort) Key(s) is not Necessarily TimeNecessarily Time
That would be too limitingSo how to do fast updates to densepack
column storage that is not in entry
sequence?
That would be too limitingSo how to do fast updates to densepack
column storage that is not in entry
sequence?
26
M.I.T
Solution – a Hybrid StoreSolution – a Hybrid Store
Read-optimized
Column store
Write-optimized
Column store
Tuple mover
(Much like Monet)
(What we have been
talking about so far)
(Batch rebuilder)
27
M.I.T
Column ExecutorColumn Executor
Column operations – not row operationsColumns remain coded – if possibleLate materialization of columns
Column operations – not row operationsColumns remain coded – if possibleLate materialization of columns
28
M.I.T
Column OptimizerColumn Optimizer
Chooses MVs on which to run the queryMost important task
Build in snowflake schemas Which are simple to optimize without
exhaustive searchLooking at extensions
Chooses MVs on which to run the queryMost important task
Build in snowflake schemas Which are simple to optimize without
exhaustive searchLooking at extensions
29
M.I.T
Current PerformanceCurrent Performance
100X popular row store in 40% of the
space10X popular column store in 70% of the
space7X popular row store in 1/6th of the spaceCode available with BSD license
100X popular row store in 40% of the
space10X popular column store in 70% of the
space7X popular row store in 1/6th of the spaceCode available with BSD license
30
M.I.T
Structure Going ForwardStructure Going Forward
Vertica Very well financed start-up to
commercialize C-storeDoing the heavy lifting
University ResearchFunded by Vertica
Vertica Very well financed start-up to
commercialize C-storeDoing the heavy lifting
University ResearchFunded by Vertica
31
M.I.T
VerticaVertica
Complete alpha system in December ‘05Everything, including DBMS designerWith current performance!Looking for early customers to work with
(see me if you are interested)
Complete alpha system in December ‘05Everything, including DBMS designerWith current performance!Looking for early customers to work with
(see me if you are interested)
32
M.I.T
University ResearchUniversity Research
Extension of algorithms to non-snowflake
schemasStudy of L2 cache performanceStudy of coding strategiesStudy of executor optionsStudy of recovery tacticsNon-cursor interfaceStudy of optimizer primitives
Extension of algorithms to non-snowflake
schemasStudy of L2 cache performanceStudy of coding strategiesStudy of executor optionsStudy of recovery tacticsNon-cursor interfaceStudy of optimizer primitives