apache kylin @ big data europe 2015

50
Using Apache Kylin for Large Scale Data Analytics Apache Big Data Europe 2015

Upload: seshu-adunuthula

Post on 20-Jan-2017

1.618 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Apache Kylin @ Big Data Europe 2015

Using Apache Kylin for Large Scale Data Analytics

Apache Big Data Europe 2015

Page 2: Apache Kylin @ Big Data Europe 2015

http://kylin.io

Agenda Big Data @ eBay What’s Apache Kylin? Cube Segments &

Streaming Q&A

Page 3: Apache Kylin @ Big Data Europe 2015

Big Data @ eBay

$28BGMV VIA MOBILE

(2014)

266MMOBILE APP

GLOBALLY

1BLISTINGS CREATED

VIA MOBILE

157MACTIVE BUYERS

25MACTIVE SELLERS

800MACTIVE LISTINGS

8.8MNEW LISTINGS

EVERY WEEK

Page 4: Apache Kylin @ Big Data Europe 2015

CHECKOUT

WATCH LIST

SHARE

LOCAL

PREFERENCES

BID

100’s PB 9 Quarters of historical

data

100MQUERIES/MONTH

20TBBehavior Data Capture

every day

CLICK

SEARCHDA

TA

Page 5: Apache Kylin @ Big Data Europe 2015

Hadoop@ eBay

1-10 nodes2007

100+ nodes

1000s + core1 PB

2010

20111000+ node10,000+ core10+ PB

3000+ node10,000+ core

50+ PB2012

2013/201410,000 nodes150,000+ cores170 PB +60,000 Jobs/day> 2000 users

200950+ nodes

Page 6: Apache Kylin @ Big Data Europe 2015

Extreme OLAP Engine for Big DataApache Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

What’s Kylin

kylin / ˈkiːˈlɪn / 麒麟--n. (in Chinese art) a mythical animal of composite form

• Open Sourced on Oct 1st, 2014• Accepted as Apache Incubator Project on Nov 25th, 2014

Page 7: Apache Kylin @ Big Data Europe 2015

Business Needs for Big Data Analysis

Sub-second query latency on billions of rows ANSI SQL for both analysts and engineers Full OLAP capability to offer advanced functionality Seamless Integration with BI Tools

Support of high cardinality and high dimensions High concurrency – thousands of end users Distributed and scale out architecture for large data

volume

Page 8: Apache Kylin @ Big Data Europe 2015

Extremely Fast OLAP Engine at ScaleKylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data

ANSI SQL Interface on Hadoop Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions

Seamless Integration with BI ToolsKylin currently offers integration capability with BI Tools like Tableau.

Interactive Query CapabilityUsers can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset

MOLAP CubeUser can define a data model and pre-build in Kylin with more than 10+ billions of raw data records

Features Highlights

Page 9: Apache Kylin @ Big Data Europe 2015

Apache Kylin AdoptionExternal

25+ contributors in community

Adoption:In Production: eBay, Baidu MapsEvaluation: Huawei, Bloomberg Law, British Gas, JD.com

Microsoft, Tableau… eBay Use Cases

Case Cube Size Raw RecordsUser Session Analysis 14+TB 75+ billion rowsTraffic Analysis 21+ TB 20+ billion rowsBehavior Analysis 560+ GB 1.2+ billion rows

Page 10: Apache Kylin @ Big Data Europe 2015

Cube Designer

Page 11: Apache Kylin @ Big Data Europe 2015

Job Management

Page 12: Apache Kylin @ Big Data Europe 2015

Query and Visualization

Page 13: Apache Kylin @ Big Data Europe 2015

Tableau Integration

Page 14: Apache Kylin @ Big Data Europe 2015

Kylin Architecture Overview

14

Cube Builder (MapReduce…)

SQL

Low Latency - SecondsRouting

3rd Party App(Web App, Mobile…)

Metadata

SQL-Based Tool(BI Tools: Tableau…)

Query Engine

HadoopHive

REST API JDBC/ODBC

Online Analysis Data Flow Offline Data Flow

Clients/Users interactive with Kylin via SQL

OLAP Cube is transparent to users

Star Schema Data Key Value Data

Data Cube

OLAPCubes(HBase)

SQL

REST Server

Data

Sou

rce

Abst

racti

on Engine

Abstraction

Stor

age

Abst

racti

on

Page 15: Apache Kylin @ Big Data Europe 2015

Cube Storage in HBase

Page 16: Apache Kylin @ Big Data Europe 2015

OLAP Cube – Balance between Space and Time

time, item

time, item, location

time, item, location, supplier

time item location supplier

time, location

Time, supplier

item, locationitem, supplier

location, supplier

time, item, supplier

time, location, supplier

item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells

1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>2. (9/15, milk, Urbana, *) - <time, item, location>3. (*, milk, Urbana, *) - <item, location>4. (*, milk, Chicago, *) - <item, location>5. (*, milk, *, *) - <item>

Cuboid = one combination of dimensions Cube = all combination of dimensions (all cuboids)

Page 17: Apache Kylin @ Big Data Europe 2015

Dynamic data management framework. Formerly known as Optiq, Calcite is an Apache incubator project, used by

Apache Drill and Apache Hive, among others. http://optiq.incubator.apache.org

Querying the Cube

Page 18: Apache Kylin @ Big Data Europe 2015

Metadata SPI Provide table schema from Kylin metadata

Optimization Rules Translate the logic operators into kylin operators

Relational Operators Find right cube Translate SQL into storage engine API call Generate physical execute plan by linq4j java implementation

Result Enumerator Translate storage engine result into java implementation result.

SQL Function Add HyperLogLog for distinct count Implement date time related functions (i.e. Quarter)

Query Engine based on Calcite

Page 19: Apache Kylin @ Big Data Europe 2015

Full Cube Pre-aggregate all dimension combinations “Curse of dimensionality”: N dimension cube has 2N cuboid.

Partial Cubes To avoid dimension explosion, we divide the dimensions into

different aggregation groups 2N+M+L 2N + 2M + 2L

For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands

230 210 + 210 + 210

Tradeoff between online aggregation and offline pre-aggregation

Optimizing Cube Builds – Full Cube vs. Partial Cube

Page 20: Apache Kylin @ Big Data Europe 2015

Optimizing Cube Builds: Partial Cubes

Page 21: Apache Kylin @ Big Data Europe 2015

Optimizing Cube Building – Incremental Building

Page 22: Apache Kylin @ Big Data Europe 2015

Optimizing Cube Builds: Improved Parallelism• The current algorithm

– Many MRs, the number of dimensions– Huge shuffles, aggregation at reduce side, 100x of total cube size

Full Data

0-D Cuboid

1-D Cuboid

2-D Cuboid

3-D Cuboid

4-D CuboidMR

MR

MR

MR

MR

Page 23: Apache Kylin @ Big Data Europe 2015

Optimizing Cube Builds: Improved Parallelism

• The to-be algorithm, 30%-50% faster– 1 round MR– Reduced shuffles, map side aggregation, 20x total cube

size– Hourly incremental build done in tens of minutes

Data Split

Cube Segment

Data Split

Cube Segment

Data Split

Cube Segment

……

Final Cube

Merge Sort(Shuffle)

mapper mapper mapper

Page 24: Apache Kylin @ Big Data Europe 2015

Historical DataRecent Data

Kafka Topics

SQL Query

Optimizing Cube Builds: Streaming

Inverted Index

Hybrid Storage Interface

Cube Segments

Page 25: Apache Kylin @ Big Data Europe 2015

Use Case: SEO Operational Dashboard• eBay Site

– ebay.com, ebay.co.uk, ebay.de• Buyer Country

– US, CN, RU• Search Engine

– Google, Bing, Yahoo!• Referrer

– google.com, google.co.uk• Page

– Search, View Item, Product• User Experience

– Desktop, Mobile APP, mWeb

• Visits, GMB $, GMB share, conversion rate, bounce rate, # of view items, # of bought items etc.

Dimensions

Measurements

Page 26: Apache Kylin @ Big Data Europe 2015

Q & A

Page 27: Apache Kylin @ Big Data Europe 2015

http://kylin.io

Agenda What’s Apache Kylin? Features & Tech

Highlights Performance Roadmap Q & A

Page 28: Apache Kylin @ Big Data Europe 2015

Cube Storage in HBase

Page 29: Apache Kylin @ Big Data Europe 2015

Plugin-able storage engine Common iterator interface for storage engine Isolate query engine from underline storage

Translate cube query into HBase table scan Columns, Groups Cuboid ID Filters -> Scan Range (Row Key) Aggregations -> Measure Columns (Row Values)

Scan HBase table and translate HBase result into cube result

HBase Result (key + value) -> Cube Result (dimensions + measures)

How to Query Cube?

Storage Engine

Page 30: Apache Kylin @ Big Data Europe 2015

Compression and Encoding Support

Incremental Refresh of Cubes

Approximate Query Capability for distinct Count (HyperLogLog)

Leverage HBase Coprocessor for query latency

Job Management and Monitoring

Easy Web interface to manage, build, monitor and query cubes

Security capability to set ACL at Cube/Project Level

Support LDAP Integration

Features Highlights…

Page 31: Apache Kylin @ Big Data Europe 2015

Data Modeling

Cube: …Fact Table: …Dimensions: …Measures: …Storage(HBase): …Fact

Dim Dim

Dim

SourceStar Schema

row A

row B

row C

Column Family

Val 1

Val 2

Val 3

Row Key Column

Target HBase Storage

MappingCube Metadata

End User Cube Modeler Admin

Page 32: Apache Kylin @ Big Data Europe 2015

OLAP Cube – Balance between Space and Time

time, item

time, item, location

time, item, location, supplier

time item location supplier

time, location

Time, supplier

item, locationitem, supplier

location, supplier

time, item, supplier

time, location, supplier

item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells

1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>2. (9/15, milk, Urbana, *) - <time, item, location>3. (*, milk, Urbana, *) - <item, location>4. (*, milk, Chicago, *) - <item, location>5. (*, milk, *, *) - <item>

Cuboid = one combination of dimensions Cube = all combination of dimensions (all cuboids)

Page 33: Apache Kylin @ Big Data Europe 2015

• Different ages have different analysis need, resulting in different storage architecture

• Inverted index may keep data long enough to support row level analysis

Optimizing Cube Builds: Streaming

ts ts … ts ts ts …

Streaming

ts ts

min min … min min … min min

hour hour … hour hour …

Last Hourinverted index

in memoryrow level

Last Weekcube segments

in memoryminute granularity

Monthscube segments

on diskhour granularity

cubing

cubing

Page 34: Apache Kylin @ Big Data Europe 2015

Transaction

Operation

Strategy

High Level Aggregatio

n

•Very High Level, e.g GMV by site by vertical by weeks

Analysis Query

•Middle level, e.g GMV by site by vertical, by category (level x) past 12 weeks

Drill Down to Detail

•Detail Level (Summary Table)

Low Level Aggregatio

n

•First Level Aggragation

Transaction Level

•Transaction Data

Analytics Query Taxonomy

OLAP

Kylin is designed to accelerate 80+% analytics queries performance on Hadoop

OLTP

Page 35: Apache Kylin @ Big Data Europe 2015

Cube Build Job Flow

Page 36: Apache Kylin @ Big Data Europe 2015

Huge volume data Table scan

Big table joins Data shuffling

Analysis on different granularity Runtime aggregation expensive

Map Reduce job Batch processing

Technical Challenges

Page 37: Apache Kylin @ Big Data Europe 2015

Big Data Era

More and more data becoming available on Hadoop Limitations in existing Business Intelligence (BI) Tools

Limited support for Hadoop Data size growing exponentially High latency of interactive queries Scale-Up architecture

Challenges to adopt Hadoop as interactive analysis system

Majority of analyst groups are SQL savvy No mature SQL interface on Hadoop OLAP capability on Hadoop ecosystem not ready yet

Page 38: Apache Kylin @ Big Data Europe 2015

Streaming Cubes

• Cube takes time to build, how about real-time analysis?

• Sometimes we want to drill down to row level information

• Streaming Cubing– Build micro cube segments from streaming– Use inverted index to capture last minute data

Page 39: Apache Kylin @ Big Data Europe 2015

Query Engine – Kylin Explain Plan

Page 40: Apache Kylin @ Big Data Europe 2015

Streaming…

Historic StoreReal-time Store

streaming Kafka

SQL Query

hourly/daily batchminutes batch

Inverted Index

Hybrid Storage Interface

Cube

Page 41: Apache Kylin @ Big Data Europe 2015

http://kylin.io

Agenda What’s Apache Kylin? Features & Tech

Highlights Performance Roadmap Q & A

Page 42: Apache Kylin @ Big Data Europe 2015

Kylin vs. Hive

# Query Type Return Dataset Query On Kylin (s)

Query On Hive (s)

Comments

1 High Level Aggregation

4 0.129 157.437 1,217 times

2 Analysis Query 22,669 1.615 109.206 68 times

3 Drill Down to Detail

325,029 12.058 113.123 9 times

4 Drill Down to Detail

524,780 22.42 6383.21 278 times

5 Data Dump 972,002 49.054 N/A

SQL #1 SQL #2 SQL #30

50

100

150

200

HiveKylin

High Level Aggregatio

n

Analysis Query

Drill Down to Detail

Low Level Aggregatio

n

Transaction Level

Based on 12+B records case

Page 43: Apache Kylin @ Big Data Europe 2015

Performance -- Concurrency

Linear scale out with more nodes

Page 44: Apache Kylin @ Big Data Europe 2015

Performance - Query Latency

90%tile queries <5s

Green Line: 90%tile queriesGray Line: 95%tile queries

Page 45: Apache Kylin @ Big Data Europe 2015

http://kylin.io

Agenda What’s Apache Kylin? Features & Tech

Highlights Performance Roadmap Q & A

Page 46: Apache Kylin @ Big Data Europe 2015

Kylin Evolution Roadmap

Page 47: Apache Kylin @ Big Data Europe 2015

Kylin Core Fundamental framework of

Kylin OLAP Engine Extension

Plugins to support for additional functions and features

Integration Lifecycle Management

Support to integrate with other applications

Interface Allows for third party users

to build more features via user-interface atop Kylin core

Driver ODBC and JDBC Drivers

Kylin OLAPCore

Extension Security Redis

Storage Spark

Engine Docker

Interface Web Console Customized BI Ambari/Hue

Plugin

Integration ODBC Driver ETL Drill SparkSQL

Kylin Ecosystem

Page 48: Apache Kylin @ Big Data Europe 2015

Cubing Efficiency MR is not optimal framework Spark Cubing Engine

Source from SparkSQL Read data from SparkSQL instead of Hive

Route to SparkSQL Unsupported queries be coved by SparkSQL

Adding Spark Support

Page 49: Apache Kylin @ Big Data Europe 2015

Hive Input source Pre-join star schema during cube building

MapReduce Pre-aggregation metrics during cube building

HDFS Store intermediated files during cube building.

HBase Store data cube. Serve query on data cube. Coprocessor is used for query processing.

Calcite Query Parsing Query Execution

The DRY Principle

Page 50: Apache Kylin @ Big Data Europe 2015

If you want to go fast, go alone.If you want to go far, go together.

--African Proverb

http://kylin.io