design cube in apache kylin
TRANSCRIPT
![Page 2: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/2.jpg)
Before You Start
• Kylin is a MOLAP engine on Hadoop.
• Understand Kylin helps cube design a lot.– http://www.slideshare.net/YangLi43/apache-kylin-deep-dive-2014-dec
• This deck summarizes best practices and patterns on how to design an efficient cube.– For detailed steps to create a cube, check out
https://github.com/KylinOLAP/Kylin/wiki/Kylin-Cube-Creation-Tutorial
![Page 3: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/3.jpg)
Overview
• Identify Star Schema
• Design Cube– Dimensions
– Measures
– Incremental Build
– Advanced Options
• Build and Verify
![Page 4: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/4.jpg)
Identify Star Schema
• Kylin creates cube from a star schema of Hive tables.
• One fact table that has ever growing records, like transactions.
• A few dimension tables that are relatively static, like users and products.
• Hive tables must be synced into Kylin first.
![Page 5: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/5.jpg)
Know Cardinalities of Columns
• Cardinalities have significant impact on cube size and query latency.– High Cardinality: > 1,000– Ultra High Cardinality: > 1,000,000
• Avoid UHC as much as possible.– If it’s used as indicator, then put the indicator in cube.– Try categorize values or derive features from the UHC rather
than putting the original value in cube.
• To know column cardinalities– select count(distinct A) from T– or google for fancy tools
![Page 6: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/6.jpg)
Cube ConceptsCube = all combination of dimensions Cuboid = one combination of dimensionsCurse of dimensionality: N dimension cube has 2N cuboid
![Page 7: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/7.jpg)
Design Dimensions
• 15 dimensions or less is most ideal.– More than that causes slowness in cube build and
longer query latency.– Does user really need a report of 15+ dimensions?– You can define multiple cubes on one star schema to
fulfill different analysis scenarios.
• Control the total number of dimensions.– Mandatory dimension– Hierarchy dimension– Derived dimension
![Page 8: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/8.jpg)
Mandatory Dimension
• Dimension that presents in every query.– like Date
• Mandatory dimension cuts cuboid combinations by half.
Normal Dimensions
A B C
A B -
- B C
A - C
A - -
- B -
- - C
- - -
A is Mandatory
A B C
A B -
A - C
A - -
![Page 9: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/9.jpg)
Hierarchy Dimension
• Dimensions that form a “contains” relationship where parent level is required for child level to make sense.– like Year -> Month -> Day; or Country -> City
• Hierarchy dimension reduces combination from 2N to N+1.
Normal Dimensions
A B C
A B -
- B C
A - C
A - -
- B -
- - C
- - -
A->B->C is Hierarchy
A B C
A B -
A - -
- - -
![Page 10: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/10.jpg)
Derived Dimension
• Dimensions on lookup table that can be derived by PK.– like User ID derives [Name, Age, Gender]
• Derived dimension reduces combination from 2N to 2 at the cost of extra runtime aggregation.
Normal Dimensions
A B C
A B -
- B C
A - C
A - -
- B -
- - C
- - -
A, B, C are Derived by ID
ID
-
![Page 11: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/11.jpg)
The Order of Dimensions
• Finally, define dimensions in following order.– Mandatory dimension– Dimensions that heavily involved in filters– High cardinality dimensions– Low cardinality dimensions
• Filter first, helps to cut down query scan ranges.
• High cardinality first, helps to calculate cube efficiently.
![Page 12: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/12.jpg)
Define Measures
• Kylin currently support– Sum– Count– Max– Min– Average– Distinct Count (based on HyperLogLog)
• Distinct Count is a very heavy data type.– Error rate<1.22% takes 64KB per cell.– Convince user to use the wildest tolerable error rate.– Distinct Count is slower to build and query comparing to other
measures.
![Page 13: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/13.jpg)
Incremental Build
• Kylin supports incremental build along a time dimension if enabled.
• Setting a start time, cube segments can be built daily (or any period) processing only the incremental data.
• A segment can be refreshed relatively cheaply to reflect changes in hive table.
• With the increasing number of segments, query would slow down a bit.
• Merge segments to control the total number < 10 for best performance.
![Page 14: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/14.jpg)
Advanced Options
• Leave advanced options as is if you are not sure what they mean.
• Aggregation groups give finest control on which cuboids to build.– Partial cube -- Only combinations within the same group are built.– For cube with 30 dimensions, if divide the dimensions into 3 groups, the cuboid number will
reduce from 1 Billion to 3 Thousands.• 230 => 210 + 210 + 210
– It’s tradeoff between online aggregation and offline pre-aggregation.
• Query is efficient when involved dimensions all come from a single aggregation group, or otherwise runtime aggregation will slow down queries.– Capture query patterns with your aggregation group.– Keep less than 10 dimensions in one group, or the cube will be huge.– A dimension can appear in multiple groups.– Create a second cube with different aggregation group is also an option.
• Rowkeys, they are generated in order of dimensions. No need to change.
![Page 15: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/15.jpg)
Build and Verify
• Once the cube is created, build it, and ready to verify.
• Check the expansion rate of your cube.– Under 10 times is ideal.
• Notes on the SQLs– Write queries against the original hive tables, cubes are
transparent at the query time.– Sanity check: select count(*) from fact– Make sure the join relationships (inner or left) matches the cube
definition exactly.– Kylin works best with a group by clause.– Date constant is like date ‘1970-01-01’
![Page 16: Design cube in Apache Kylin](https://reader035.vdocuments.us/reader035/viewer/2022071814/55a558181a28ab23258b463c/html5/thumbnails/16.jpg)
Q & A
Thanks!