apache kylin balance between space and time · 2017-12-14 · sql rest,server kylin architecture...
Post on 25-Apr-2020
14 Views
Preview:
TRANSCRIPT
Apache Kylin Balance between Space and Time
Debashis Saha | Luke Han 2015-06-09
http://kylin.io | @ApacheKylin
About us
§Debashis Saha (@debashis_saha ) - VP, eBay Cloud Services (Platform, Infrastructure, Data) §Luke Han (@lukehq) - Sr. Product Manager, Analytics Data Infrastructure - Committer & PMC Member of Apache Kylin
Agenda§About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A
http://kylin.io
Whatkylin / ˈkiːˈlɪn / 麒麟 -‐-‐n. (in Chinese art) a mythical animal of composite form
Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine, contributed by eBay Inc., provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
http://kylin.io
• Open Sourced on Oct 1st, 2014 • Accepted as Apache Incubator Project on Nov 25th, 2014
• http://kylin.io (http://kylin.incubator.apache.org)
@ApacheKylin
§ External § 25+ contributors in community § Adoption: § On Production: Baidu Map § Evaluation: Huawei, Bloomberg Law, British Gas, JD.com
Microsoft, Tableau…
§ eBay Internal Cases - 90% ile query < 5 seconds
Case Cube Size Raw Records
User Session Analysis 26 TB 28+ billion rows
Traffic Analysis 21 TB 20+ billion rows
Behavior Analysis 560 GB 1.2+ billion rows
—from mailing list
Who are using Kylin?
http://kylin.io
Why
http://kylin.io
Happiness
Latency10s
size
Balance Between Space and Time
http://kylin.io
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, location
item, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item>
• Cuboid = one combination of dimensions • Cube = all combination of dimensions
(all cuboids)
OLAP Cube
How
http://kylin.io
Map Reduce
Kylin
BI Tools, Web App…
ANSI SQL
Agenda§About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A
http://kylin.io
Feature Highlights• Extremely Fast OLAP Engine at scale
• ANSI SQL Interface on Hadoop
• Seamless Integration with BI Tools, like Tableau
• Interactive Query Capability
• MOLAP Cube
• Incremental Build of Cubes
• Approximate Query Capability for Distinct Count (HyperLogLog)
• Leverage HBase Coprocessor for query latency
• Job Management and Monitoring
• User friendly Web GUI for manage, build, monitor and query cubes
• Security capability to set ACL at Cube/Project Level
• Support LDAP Integration
http://kylin.io
Define Data Model
http://kylin.io
Manage Jobs
http://kylin.io
Explore the Data
http://kylin.io
Interactive with BI Tool - Tableau
http://kylin.io
Agenda§About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A
http://kylin.io
http://kylin.io
Cube Build Engine (MapReduce…)
SQL
Low Latency -‐ SecondsMid Latency -‐ MinutesRouting
3rd Party App (Web App, Mobile…)
Metadata
SQL-‐Based Tool (BI Tools: Tableau…)
Query Engine
Hadoop Hive
REST API JDBC/ODBC
➢Online Analysis Data Flow ➢Offline Data Flow
➢Clients/Users interactive with Kylin via SQL
➢OLAP Cube is transparent to users
Star Schema Data Key Value Data
Data CubeOLAP Cube (HBase)
SQL
REST Server
Kylin Architecture Overview
http://kylin.io
Cube: … Fact Table: … Dimensions: … Measures: … Storage(HBase): …
Fact
Dim Dim
Dim
Source Star Schema
row A
row B
row C
Column Family
Val 1
Val 2
Val 3
Row Key Column
Target HBase Storage
Mapping Cube Metadata
End User Cube Modeler Admin
Data Modeling
http://kylin.io
Cube Build Job Flow
http://kylin.io
How to Store Cube - HBase Schema
http://kylin.io
SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New' GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,test_sites.site_name
OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
Kylin Query Engine - Explain Plan
Cube Optimization§ Curse of Dimensionality
§ N dimension cube has 2N cuboid § Full Cube vs. Parqal Cube
§ Huge Data Volume § Dicqonary Encoding § Incremental Building
http://kylin.io
§ Full Cube - Pre-aggregate all dimension combinations - “Curse of dimensionality”: N dimension cube has 2N cuboid.
§ ParOal Cube - To avoid dimension explosion, we divide the dimensions into different aggregation
groups - 2N+M+L à 2N + 2M + 2L
- For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands - 230 à 210 + 210 + 210
- Tradeoff between online aggregation and offline pre-aggregation
http://kylin.io
Full Cube vs. Partical Cube
http://kylin.io
Partical Cube
http://kylin.io
Incremental Build
What’s Next§ Improve cube algorithm § Cube by segments, 30%-‐50% faster § Build delay down to tens of minutes
§ Streaming cubing § Analyze real-‐qme data § Build delay down to seconds
§ Spark
http://kylin.io
Cube by Layer
§ The current algorithm - Many MRs, the number of
dimensions - Huge shuffles, aggregation at
reduce side, 100x of total cube size
http://kylin.io
Full Data
0-‐D Cuboid
1-‐D Cuboid
2-‐D Cuboid
3-‐D Cuboid
4-‐D Cuboid
MR
MR
MR
MR
MR
Cube by Segments
§ The to-be algorithm, 30%-50% faster
- 1 round MR - Reduced shuffles, map side
aggregation, 20x total cube size
- Hourly incremental build done in tens of minutes
http://kylin.io
Data Split
Cube Segment
Data Split
Cube Segment
Data Split
Cube Segment
……
Final Cube
Merge Sort (Shuffle)
mapper mapper mapper
Streaming Cubing§Cube is great but…
- Cube takes time to build, how about real-time analysis? - Sometimes we want to drill down to row level information
§Streaming cubing - Build micro cube segments from streaming - Use inverted index to capture last minute data
http://kylin.io
http://kylin.io
Streamingsec
onds delay
Last Hour
Inverted Index
Before Last Hour
Cube
Kylin Lambda Architecture
minutes delay
Query Engine
ANSI SQ
L
Hybrid StorageInterface
Adding Spark Support§Cubing Efficiency §MR is not optimal framework §Spark Cubing Engine
§Source from SparkSQL §Read data from SparkSQL instead of Hive
§Route to SparkSQL §Unsupported queries be coved by SparkSQL
http://kylin.io
Agenda§About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A
http://kylin.io
Future2015 2016
Kylin Evolution Roadmap
http://kylin.io
20142013
Initial
Prototype for MOLAP • Basic end to end POC
MOLAP • Incremental Refresh • ANSI SQL • ODBC Driver • Web GUI • Tableau • ACL • Open Source
StreamingOLAP • Streaming OLAP • JDBC Driver • New UI • Excel • SparkSQL • … more
TBD
Sep, 2013
Jan, 2014
Oct, 2014
H1, 2015
HybridOLAP • Lambda Arch • Automation • Capacity
Management • Spark • … more
Next Gen • Adv OLAP Functions • In-‐Memory Analysis
(TBD) • Mobile (TBD) • … more
Kylin Ecosystem■ Kylin Core
■ Fundamental framework of Kylin OLAP Engine
■ Extension ■ Plugins to support for additional functions and features
■ Integration ■ Lifecycle Management Support to integrate with other
applications
■ Interface ■ Allows for third party users to build more features via user-
interface atop Kylin core
http://kylin.io
Kylin OLAP Core
Extension à Security à Redis Storage à Spark Engine à Docker
Interface à Web Console à Customized BI à Ambari/Hue Plugin
Integration à ODBC Driver à ETL à Drill à SparkSQL
http://kylin.io
If you want to go fast, go alone. If you want to go far, go together.
-‐-‐African Proverbdev@kylin.incubator.apache.org
top related