Download - MOLAP on Cloud
Hongwei Zhao, Xiaojun Ye
MOLAP on CloudInteractive, Cluster Data Warehouse
Tsinghua University [email protected], [email protected]
MotivationExtend the cube model to support OLAP operations on Big Data:»OLAP operations»Interactive queries
OutlineCube modellingBuilding and queryingExperimenting
Data Transform for CubeTPC-DS tables Star views Cube data
User queries
A Simplified Cube Model
Cube Instance
Cuboid InstanceDimension
InstanceDimensionInstance
CubeMetadata
DimensionInstance
Cuboid Instance
Key MemberKey MemberKey
Dimension
Member
Key Measure NodeKey Measure NodeKey Measure Cell
ABC
AB
A
AC
B
BC
C
*
Base Cuboids
Result
Example: TPC-DS Query7select i_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 from store_sales, customer_demographics, date_dim, item, promotionwhere ss_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_cdemo_sk = cd_demo_sk and ss_promo_sk = p_promo_sk and cd_gender = '[GEN]' and cd_marital_status = '[MS]' and cd_education_status = '[ES]' and (p_channel_email = 'N' or p_channel_event = 'N') and d_year = [YEAR] group by i_item_id order by i_item_id
Relation Schema
Store Sales
Date Dim
Item Promotion
Customer Demographic
s
Converting to BitKeyDimensio
n ADimension B
Dimension C
Measure
A1 B1 C1 M1
A2 B1 C2 M2
A3 B2 C2 M3
Dimension Member
BitKey Dimension Mask
A1 000001 000001B1 000010 000010C1 000100 000100A2 001000 001001B1 000010 000010C2 010000 010100
BitKeys
Value
000111 M1011010 M2Result2
Result1
Intermediate
Result1
Fact1
Fact2
Intermediate
Result1
Cube StorageTableRegionColumnFamilyRowColumnVersionValue
Cell
One table for dimension instances storage:Row Key Dimension NameColumn Family
Default
Column Member BitKeyValue Member ValueMultiple tables for cuboids instancesTable Name Cuboid NameRow Key Cell BitKeyColumn Family
Default
Column Measure NameValue Measure Value
MDX for query 7select { i_item_id } on rows,
{ avg(ss_quantity), avg(ss_list_price),avg(ss_coupon_amt),
avg(ss_sales_price) } on columns
from store_sales_cubewhere (cd_gender .[Male], cd_marital_status .[Single], cd_education_status .[College],
d_year.[2000])
Cube Implementation
Base cuboid building with 4 stages:Dimension constructingHive queryAggregationSaving
Query execution with 4 stages:Loading dimensionOther cuboid constructingMappingReducing
OLAP System
Engi
neCo
lum
nar
Data
base
Master Node
Region NodeRegion
Node
Dispatcher Node
Worker Node
Region Node
Worker NodeWorker Node
cachedat
a
Cube data
Cluster FrameworkDispatcher Node
Worker Nodes
• Distribute dynamically cubes data onto worker nodes
• Parallelize OLAP operations into a concurrent model
Actor of AkkaStateBehaviorMailbox
Lifecycle
Fault tolerance
Execute Query
Query Dispatch
erCuboid
ManagerDimensio
n Manager
Mapper Reducer
1 2
34
require
Cuboid ready
Dimension load
data ready
Extract Query
Hit Cell
Hit Cell
Actors for Query • Load dimension
members• Build other cuboids• Mapping• Reducing
Compiling & MappingQuery 7 Condition: GEN=M and MS=S and ES=College and YEAR=2000
GEN Mask: 000000011 Male 000000010MS Mask: 000011100 Single :000001100ES Mask: 001100000College: 001000000YEAR Mask: 110000000 2000:010000000
Mask: 111111111FilterKey: 011001110
Query Dispatch
er
Mapper1
Mapper2
Mapper3
For each cell in mapper{ If (key & mask
== Filter Key) Send to Reducer}
Region 1
Region 2
Region 3
Worker
Worker
Worker
Master
messages
results
Cache 1
Cache 2
Cache 3
Query Execution• Master sends task messages to workers
• Each worker caches each region data
• Sequential tasks reuse the cache data
First query on 1G consume 48 secs, the following queries with various parameters consume 2.4 secs
Experiments On TPC-DS
1g 10g 100g0
50000000100000000150000000200000000250000000300000000
fact recordscells
1G 10G 100Grecords number
2,653,108
26,532,571
265,325,821
cube cell number
1,836,162
10,190,922
41,892,286
4 nodes:• 2*Intel Xeon CPU E5-2630• 4*600G 15000r/s SAS • 256G RAM• 10Gb Network
Dimensions:1. "i_item_id", 2. "cd_gender", 3. "cd_marital_status", 4. "cd_education_status", 5. "p_channel_email", 6. "p_channel_event", 7. "d_year“Measures: 8. ss_quantity_avg,9. ss_list_price_avg, 10. ss_coupon_amt_avg, 11. ss_sales_price_avg
Build Cube for Query 7
1G
10G
100G
0 1000 2000 3000 4000 5000 6000
queryingaggregatingSaving
running time (seconds)
TPC-
DS
data
siz
e
• Partition by the largest Dimension(i_item_id)
• In-Memory aggregation• Saving stage can be
ignore(cache)
1 2 3 4 50
50
100
150
200
250
300
350
400
4 workers8 workers16 workers
iteration number
runn
ing
tim
e (s
econ
ds)
Execute Query 7First execution on the cube includes • Dimension loading• other cuboids construction • Caching• Mapping• Reducing
Sequential execution includes:• Mapping• Reducing
Hive Query for Fact Data select p_channel_email, p_channel_event, cd_gender, cd_marital_status, cd_education_status, i_item_id,d_year, ss_quantity, ss_list_price, ss_coupon_amt, ss_sales_price from store_sales
join date_dim on (store_sales.ss_sold_date_sk
= date_dim.d_date_sk) join item on (store_sales.ss_item_sk =
item.i_item_sk) join customer_demographics on
(store_sales.ss_cdemo_sk = custom-er_demographics.cd_demo_sk)
join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)
Compare with Hive
1G 10G 100G0
200
400
600
800
1000
1200
1400
hiveprototype
1G 10G 100G0
200
400
600
800
1000
1200
1400
hiveprototype
First query time compare:2-3X
Sequential execution time:30-50X
Future work• Cube Model:
• Demand-driven & Data-driven • Cube Data:
• Model-driven & Requirement-driven• More experiments on TPC-DS
queries• Report, ad hoc, iterative, data mining,
• MDX/XMLA compliance
Thanks.
Storage for Example
Row Key
Column Family: default
Dimension A
Mask 000001 001000 001001001001 A1 A2 A3
Dimension B
Mask 000010 100000100010 B1 B2
Row Key Column Family: default000111 Mea_count Mea_sum
1 M1
011010 Mea_count Mea_sum1 M2
Table: Dimension
Table: Cuboid_ABC