apache kylin balance between space and time · 2017-12-14 · sql rest,server kylin architecture...

Apache Kylin Balance between Space and Time

Debashis Saha | Luke Han 2015-06-09

http://kylin.io | @ApacheKylin

About us

§Debashis Saha (@debashis_saha ) - VP, eBay Cloud Services (Platform, Infrastructure, Data) §Luke Han (@lukehq) - Sr. Product Manager, Analytics Data Infrastructure - Committer & PMC Member of Apache Kylin

Agenda§About Apache Kylin §Feature Highlights §Tech Highlights §Roadmap §Q&A

http://kylin.io

Whatkylin / ˈkiːˈlɪn / 麒麟 -‐-‐n. (in Chinese art) a mythical animal of composite form

Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine, contributed by eBay Inc., provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

http://kylin.io

• Open Sourced on Oct 1st, 2014 • Accepted as Apache Incubator Project on Nov 25th, 2014

• http://kylin.io (http://kylin.incubator.apache.org)

@ApacheKylin

§ External § 25+ contributors in community § Adoption: § On Production: Baidu Map § Evaluation: Huawei, Bloomberg Law, British Gas, JD.com

Microsoft, Tableau…

§ eBay Internal Cases - 90% ile query < 5 seconds

Case Cube Size Raw Records

User Session Analysis 26 TB 28+ billion rows

Traffic Analysis 21 TB 20+ billion rows

Behavior Analysis 560 GB 1.2+ billion rows

—from mailing list

Who are using Kylin?

http://kylin.io

Happiness

Latency10s

Balance Between Space and Time

http://kylin.io

time, item

time, item, location

time, item, location, supplier

time item location supplier

time, location

Time, supplier

item, location

item, supplier

location, supplier

time, item, supplier

time, location, supplier

item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item>

• Cuboid = one combination of dimensions • Cube = all combination of dimensions

(all cuboids)

OLAP Cube

http://kylin.io

Map Reduce

BI Tools, Web App…

ANSI SQL

http://kylin.io

Feature Highlights• Extremely Fast OLAP Engine at scale

• ANSI SQL Interface on Hadoop

• Seamless Integration with BI Tools, like Tableau

• Interactive Query Capability

• MOLAP Cube

• Incremental Build of Cubes

• Approximate Query Capability for Distinct Count (HyperLogLog)

• Leverage HBase Coprocessor for query latency

• Job Management and Monitoring

• User friendly Web GUI for manage, build, monitor and query cubes

• Security capability to set ACL at Cube/Project Level

• Support LDAP Integration

http://kylin.io

Define Data Model

http://kylin.io

Manage Jobs

http://kylin.io

Explore the Data

http://kylin.io

Interactive with BI Tool - Tableau

http://kylin.io

Cube Build Engine (MapReduce…)

Low Latency -‐ SecondsMid Latency -‐ MinutesRouting

3rd Party App (Web App, Mobile…)

Metadata

SQL-‐Based Tool (BI Tools: Tableau…)

Query Engine

Hadoop Hive

REST API JDBC/ODBC

➢Online Analysis Data Flow ➢Offline Data Flow

➢Clients/Users interactive with Kylin via SQL

➢OLAP Cube is transparent to users

Star Schema Data Key Value Data

Data CubeOLAP Cube (HBase)

REST Server

Kylin Architecture Overview

http://kylin.io

Cube: … Fact Table: … Dimensions: … Measures: … Storage(HBase): …

Dim Dim

Source Star Schema

Column Family

Row Key Column

Target HBase Storage

Mapping Cube Metadata

End User Cube Modeler Admin

Data Modeling

http://kylin.io

Cube Build Job Flow

http://kylin.io

How to Store Cube - HBase Schema

http://kylin.io

SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New' GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,test_sites.site_name

OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])

Kylin Query Engine - Explain Plan

Cube Optimization§ Curse of Dimensionality

§ N dimension cube has 2N cuboid § Full Cube vs. Parqal Cube

§ Huge Data Volume § Dicqonary Encoding § Incremental Building

http://kylin.io

§ Full Cube - Pre-aggregate all dimension combinations - “Curse of dimensionality”: N dimension cube has 2N cuboid.

§ ParOal Cube - To avoid dimension explosion, we divide the dimensions into different aggregation

groups - 2N+M+L à 2N + 2M + 2L

- For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands - 230 à 210 + 210 + 210

- Tradeoff between online aggregation and offline pre-aggregation

http://kylin.io

Full Cube vs. Partical Cube

http://kylin.io

Partical Cube

http://kylin.io

Incremental Build

What’s Next§ Improve cube algorithm § Cube by segments, 30%-‐50% faster § Build delay down to tens of minutes

§ Streaming cubing § Analyze real-‐qme data § Build delay down to seconds

§ Spark

http://kylin.io

Cube by Layer

§ The current algorithm - Many MRs, the number of

dimensions - Huge shuffles, aggregation at

reduce side, 100x of total cube size

http://kylin.io

Full Data

0-‐D Cuboid

1-‐D Cuboid

2-‐D Cuboid

3-‐D Cuboid

4-‐D Cuboid

Cube by Segments

§ The to-be algorithm, 30%-50% faster

- 1 round MR - Reduced shuffles, map side

aggregation, 20x total cube size

- Hourly incremental build done in tens of minutes

http://kylin.io

Data Split

Cube Segment

Data Split

Cube Segment

Data Split

Cube Segment

……

Final Cube

Merge Sort (Shuffle)

mapper mapper mapper

Streaming Cubing§Cube is great but…

- Cube takes time to build, how about real-time analysis? - Sometimes we want to drill down to row level information

§Streaming cubing - Build micro cube segments from streaming - Use inverted index to capture last minute data

http://kylin.io

Streamingsec

onds delay

Last Hour

Inverted Index

Before Last Hour

Kylin Lambda Architecture

minutes delay

Query Engine

ANSI SQ

Hybrid StorageInterface

Adding Spark Support§Cubing Efficiency §MR is not optimal framework §Spark Cubing Engine

§Source from SparkSQL §Read data from SparkSQL instead of Hive

§Route to SparkSQL §Unsupported queries be coved by SparkSQL

http://kylin.io

Future2015 2016

Kylin Evolution Roadmap

http://kylin.io

20142013

Initial

Prototype for MOLAP • Basic end to end POC

MOLAP • Incremental Refresh • ANSI SQL • ODBC Driver • Web GUI • Tableau • ACL • Open Source

StreamingOLAP • Streaming OLAP • JDBC Driver • New UI • Excel • SparkSQL • … more

Sep, 2013

Jan, 2014

Oct, 2014

H1, 2015

HybridOLAP • Lambda Arch • Automation • Capacity

Management • Spark • … more

Next Gen • Adv OLAP Functions • In-‐Memory Analysis

(TBD) • Mobile (TBD) • … more

Kylin Ecosystem■ Kylin Core

■ Fundamental framework of Kylin OLAP Engine

■ Extension ■ Plugins to support for additional functions and features

■ Integration ■ Lifecycle Management Support to integrate with other

applications

■ Interface ■ Allows for third party users to build more features via user-

interface atop Kylin core

http://kylin.io

Kylin OLAP Core

Extension à Security à Redis Storage à Spark Engine à Docker

Interface à Web Console à Customized BI à Ambari/Hue Plugin

Integration à ODBC Driver à ETL à Drill à SparkSQL

http://kylin.io

If you want to go fast, go alone. If you want to go far, go together.

-‐-‐African Proverbdev@kylin.incubator.apache.org

apache kylin balance between space and time · 2017-12-14 · sql rest,server kylin architecture...

Documents

real-time data processing with lambda architecture

lambda architecture with apache spark -...

lambda networking in twaren new architecture design

lambda architecture 2.0 for reactive ab testing

big data lambda architecture - streaming layer hands on

spark internals and architecture -...

patterns of the lambda architecture -- 2015 april -- hadoop...

apache kylin 云原生架构...

clojure applications in building serverless · brief...

twitter + lambda architecture (spark, kafka, flume,...

building a lambda architecture with elasticsearch at...

how we (almost) forgot lambda architecture and used...

journey to microservice architecture via amazon lambda

building recommendation engines using lambda architecture

rendez vos objets connectés intelligents avec la "lambda...

kylin engineering principles

johnna, kylin, georgia

kylin 250 -...

lambda architecture with spark streaming, kafka, cassandra,...

griffithsia pacifica kylin - ijcmas