apache kylin @ big data europe 2015
TRANSCRIPT
Using Apache Kylin for Large Scale Data Analytics
Apache Big Data Europe 2015
http://kylin.io
Agenda Big Data @ eBay What’s Apache Kylin? Cube Segments &
Streaming Q&A
Big Data @ eBay
$28BGMV VIA MOBILE
(2014)
266MMOBILE APP
GLOBALLY
1BLISTINGS CREATED
VIA MOBILE
157MACTIVE BUYERS
25MACTIVE SELLERS
800MACTIVE LISTINGS
8.8MNEW LISTINGS
EVERY WEEK
CHECKOUT
WATCH LIST
SHARE
LOCAL
PREFERENCES
BID
100’s PB 9 Quarters of historical
data
100MQUERIES/MONTH
20TBBehavior Data Capture
every day
CLICK
SEARCHDA
TA
Hadoop@ eBay
1-10 nodes2007
100+ nodes
1000s + core1 PB
2010
20111000+ node10,000+ core10+ PB
3000+ node10,000+ core
50+ PB2012
2013/201410,000 nodes150,000+ cores170 PB +60,000 Jobs/day> 2000 users
200950+ nodes
Extreme OLAP Engine for Big DataApache Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
What’s Kylin
kylin / ˈkiːˈlɪn / 麒麟--n. (in Chinese art) a mythical animal of composite form
• Open Sourced on Oct 1st, 2014• Accepted as Apache Incubator Project on Nov 25th, 2014
Business Needs for Big Data Analysis
Sub-second query latency on billions of rows ANSI SQL for both analysts and engineers Full OLAP capability to offer advanced functionality Seamless Integration with BI Tools
Support of high cardinality and high dimensions High concurrency – thousands of end users Distributed and scale out architecture for large data
volume
Extremely Fast OLAP Engine at ScaleKylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
ANSI SQL Interface on Hadoop Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions
Seamless Integration with BI ToolsKylin currently offers integration capability with BI Tools like Tableau.
Interactive Query CapabilityUsers can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset
MOLAP CubeUser can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
Features Highlights
Apache Kylin AdoptionExternal
25+ contributors in community
Adoption:In Production: eBay, Baidu MapsEvaluation: Huawei, Bloomberg Law, British Gas, JD.com
Microsoft, Tableau… eBay Use Cases
Case Cube Size Raw RecordsUser Session Analysis 14+TB 75+ billion rowsTraffic Analysis 21+ TB 20+ billion rowsBehavior Analysis 560+ GB 1.2+ billion rows
Cube Designer
Job Management
Query and Visualization
Tableau Integration
Kylin Architecture Overview
14
Cube Builder (MapReduce…)
SQL
Low Latency - SecondsRouting
3rd Party App(Web App, Mobile…)
Metadata
SQL-Based Tool(BI Tools: Tableau…)
Query Engine
HadoopHive
REST API JDBC/ODBC
Online Analysis Data Flow Offline Data Flow
Clients/Users interactive with Kylin via SQL
OLAP Cube is transparent to users
Star Schema Data Key Value Data
Data Cube
OLAPCubes(HBase)
SQL
REST Server
Data
Sou
rce
Abst
racti
on Engine
Abstraction
Stor
age
Abst
racti
on
Cube Storage in HBase
OLAP Cube – Balance between Space and Time
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, locationitem, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>2. (9/15, milk, Urbana, *) - <time, item, location>3. (*, milk, Urbana, *) - <item, location>4. (*, milk, Chicago, *) - <item, location>5. (*, milk, *, *) - <item>
Cuboid = one combination of dimensions Cube = all combination of dimensions (all cuboids)
Dynamic data management framework. Formerly known as Optiq, Calcite is an Apache incubator project, used by
Apache Drill and Apache Hive, among others. http://optiq.incubator.apache.org
Querying the Cube
Metadata SPI Provide table schema from Kylin metadata
Optimization Rules Translate the logic operators into kylin operators
Relational Operators Find right cube Translate SQL into storage engine API call Generate physical execute plan by linq4j java implementation
Result Enumerator Translate storage engine result into java implementation result.
SQL Function Add HyperLogLog for distinct count Implement date time related functions (i.e. Quarter)
Query Engine based on Calcite
Full Cube Pre-aggregate all dimension combinations “Curse of dimensionality”: N dimension cube has 2N cuboid.
Partial Cubes To avoid dimension explosion, we divide the dimensions into
different aggregation groups 2N+M+L 2N + 2M + 2L
For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands
230 210 + 210 + 210
Tradeoff between online aggregation and offline pre-aggregation
Optimizing Cube Builds – Full Cube vs. Partial Cube
Optimizing Cube Builds: Partial Cubes
Optimizing Cube Building – Incremental Building
Optimizing Cube Builds: Improved Parallelism• The current algorithm
– Many MRs, the number of dimensions– Huge shuffles, aggregation at reduce side, 100x of total cube size
Full Data
0-D Cuboid
1-D Cuboid
2-D Cuboid
3-D Cuboid
4-D CuboidMR
MR
MR
MR
MR
Optimizing Cube Builds: Improved Parallelism
• The to-be algorithm, 30%-50% faster– 1 round MR– Reduced shuffles, map side aggregation, 20x total cube
size– Hourly incremental build done in tens of minutes
Data Split
Cube Segment
Data Split
Cube Segment
Data Split
Cube Segment
……
Final Cube
Merge Sort(Shuffle)
mapper mapper mapper
Historical DataRecent Data
Kafka Topics
SQL Query
Optimizing Cube Builds: Streaming
Inverted Index
Hybrid Storage Interface
Cube Segments
Use Case: SEO Operational Dashboard• eBay Site
– ebay.com, ebay.co.uk, ebay.de• Buyer Country
– US, CN, RU• Search Engine
– Google, Bing, Yahoo!• Referrer
– google.com, google.co.uk• Page
– Search, View Item, Product• User Experience
– Desktop, Mobile APP, mWeb
• Visits, GMB $, GMB share, conversion rate, bounce rate, # of view items, # of bought items etc.
Dimensions
Measurements
Q & A
http://kylin.io
Agenda What’s Apache Kylin? Features & Tech
Highlights Performance Roadmap Q & A
Cube Storage in HBase
Plugin-able storage engine Common iterator interface for storage engine Isolate query engine from underline storage
Translate cube query into HBase table scan Columns, Groups Cuboid ID Filters -> Scan Range (Row Key) Aggregations -> Measure Columns (Row Values)
Scan HBase table and translate HBase result into cube result
HBase Result (key + value) -> Cube Result (dimensions + measures)
How to Query Cube?
Storage Engine
Compression and Encoding Support
Incremental Refresh of Cubes
Approximate Query Capability for distinct Count (HyperLogLog)
Leverage HBase Coprocessor for query latency
Job Management and Monitoring
Easy Web interface to manage, build, monitor and query cubes
Security capability to set ACL at Cube/Project Level
Support LDAP Integration
Features Highlights…
Data Modeling
Cube: …Fact Table: …Dimensions: …Measures: …Storage(HBase): …Fact
Dim Dim
Dim
SourceStar Schema
row A
row B
row C
Column Family
Val 1
Val 2
Val 3
Row Key Column
Target HBase Storage
MappingCube Metadata
End User Cube Modeler Admin
OLAP Cube – Balance between Space and Time
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, locationitem, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>2. (9/15, milk, Urbana, *) - <time, item, location>3. (*, milk, Urbana, *) - <item, location>4. (*, milk, Chicago, *) - <item, location>5. (*, milk, *, *) - <item>
Cuboid = one combination of dimensions Cube = all combination of dimensions (all cuboids)
• Different ages have different analysis need, resulting in different storage architecture
• Inverted index may keep data long enough to support row level analysis
Optimizing Cube Builds: Streaming
ts ts … ts ts ts …
Streaming
ts ts
min min … min min … min min
hour hour … hour hour …
Last Hourinverted index
in memoryrow level
Last Weekcube segments
in memoryminute granularity
Monthscube segments
on diskhour granularity
cubing
cubing
Transaction
Operation
Strategy
High Level Aggregatio
n
•Very High Level, e.g GMV by site by vertical by weeks
Analysis Query
•Middle level, e.g GMV by site by vertical, by category (level x) past 12 weeks
Drill Down to Detail
•Detail Level (Summary Table)
Low Level Aggregatio
n
•First Level Aggragation
Transaction Level
•Transaction Data
Analytics Query Taxonomy
OLAP
Kylin is designed to accelerate 80+% analytics queries performance on Hadoop
OLTP
Cube Build Job Flow
Huge volume data Table scan
Big table joins Data shuffling
Analysis on different granularity Runtime aggregation expensive
Map Reduce job Batch processing
Technical Challenges
Big Data Era
More and more data becoming available on Hadoop Limitations in existing Business Intelligence (BI) Tools
Limited support for Hadoop Data size growing exponentially High latency of interactive queries Scale-Up architecture
Challenges to adopt Hadoop as interactive analysis system
Majority of analyst groups are SQL savvy No mature SQL interface on Hadoop OLAP capability on Hadoop ecosystem not ready yet
Streaming Cubes
• Cube takes time to build, how about real-time analysis?
• Sometimes we want to drill down to row level information
• Streaming Cubing– Build micro cube segments from streaming– Use inverted index to capture last minute data
Query Engine – Kylin Explain Plan
Streaming…
Historic StoreReal-time Store
streaming Kafka
SQL Query
hourly/daily batchminutes batch
Inverted Index
Hybrid Storage Interface
Cube
http://kylin.io
Agenda What’s Apache Kylin? Features & Tech
Highlights Performance Roadmap Q & A
Kylin vs. Hive
# Query Type Return Dataset Query On Kylin (s)
Query On Hive (s)
Comments
1 High Level Aggregation
4 0.129 157.437 1,217 times
2 Analysis Query 22,669 1.615 109.206 68 times
3 Drill Down to Detail
325,029 12.058 113.123 9 times
4 Drill Down to Detail
524,780 22.42 6383.21 278 times
5 Data Dump 972,002 49.054 N/A
SQL #1 SQL #2 SQL #30
50
100
150
200
HiveKylin
High Level Aggregatio
n
Analysis Query
Drill Down to Detail
Low Level Aggregatio
n
Transaction Level
Based on 12+B records case
Performance -- Concurrency
Linear scale out with more nodes
Performance - Query Latency
90%tile queries <5s
Green Line: 90%tile queriesGray Line: 95%tile queries
http://kylin.io
Agenda What’s Apache Kylin? Features & Tech
Highlights Performance Roadmap Q & A
Kylin Evolution Roadmap
Kylin Core Fundamental framework of
Kylin OLAP Engine Extension
Plugins to support for additional functions and features
Integration Lifecycle Management
Support to integrate with other applications
Interface Allows for third party users
to build more features via user-interface atop Kylin core
Driver ODBC and JDBC Drivers
Kylin OLAPCore
Extension Security Redis
Storage Spark
Engine Docker
Interface Web Console Customized BI Ambari/Hue
Plugin
Integration ODBC Driver ETL Drill SparkSQL
Kylin Ecosystem
Cubing Efficiency MR is not optimal framework Spark Cubing Engine
Source from SparkSQL Read data from SparkSQL instead of Hive
Route to SparkSQL Unsupported queries be coved by SparkSQL
Adding Spark Support
Hive Input source Pre-join star schema during cube building
MapReduce Pre-aggregation metrics during cube building
HDFS Store intermediated files during cube building.
HBase Store data cube. Serve query on data cube. Coprocessor is used for query processing.
Calcite Query Parsing Query Execution
The DRY Principle
If you want to go fast, go alone.If you want to go far, go together.
--African Proverb
http://kylin.io