introduction to apache tajo: future of data warehouse
TRANSCRIPT
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son / Gruter Inc.
I am
● Jihoon Son (@jihoonson)○ Ph.D at Korea Univ.○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo○ Research engineer at Gruter ○ Linkedin
■ https://www.linkedin.com/in/jihoonson
2
Today's Topic: Tajo
● What is Tajo?○ Tajo / tάːzo / 타조○ Ostrich in Korean
■ Fastest two-legged animal in the world
3
Today's Topic: Tajo
● What is Apache Tajo?○ Our Ostrich can do SQL
processing on big data!■ SQL-on-Hadoop system■ Apache Top-level project
4
Maybe You Think ...
5
SQL-on-Hadoop?Boring..
SQL-on-Hadoop Systems
7
SQL-on-Hadoop Systems
8
SQL-on-Hadoop Systems
9
Long-running ETL jobs
Low-latency interactive analysis
SQL-on-Hadoop Systems
10
● Requirements○ Stable query execution
■ Fault-tolerance● Can avoid query
resubmission ○ Adaptation to dynamic
environment■ Available resources,
unpredictable delays, ...
Long-running ETL jobs
SQL-on-Hadoop Systems
11
● Requirements○ Fast query execution
■ Several query execution techniques
■ In-memory processing Low-latency interactive analysis
Tajo is designed for Both Workloads
12
Long-running ETL jobs
Low-latency interactive analysis
Who are using Tajo?
13
Use Cases: SK Telecom
● Data warehousing & analysis○ 1st telco in South Korea
■ 40 TB/day compressed data (2014)
14
ETLETLETL
Integration Layer
Data Warehouse
Operational Systems
SK Telecom: Before Tajo
15
Marketing
Sales
ERP
SCM
ODS
Staging Area
Data Vault
Data Marts
Strategic Marts
Hadoop MPP DBMS
ETLETLETL
Integration Layer
Data Warehouse
Operational Systems
SK Telecom: After Tajo
16
Marketing
Sales
ERP
SCM
ODS
Staging Area
Data Vault
Data Marts
Strategic Marts
ETLETLETL
Integration Layer
Data Warehouse
Operational Systems
SK Telecom: After Tajo
17
Marketing
Sales
ERP
SCM
ODS
Staging Area
Data Vault
Data Marts
Strategic Marts
● Long-running ETL jobs● Ad-hoc analysis
Use Cases: SK Telecom
● Significantly reduced ETL & analysis time○ Daily analysis becomes possible○ More exploratory analysis is newly available
with remaining resources
18
Use Cases: Bluehole Studio
● Game log analysis○ Finding principal
causes of service-quality deficiencies
19
Use Cases: Bluehole Studio
● Tajo on EMR
20
Use Cases: Bluehole Studio
● Their first log analysis system○ Easy and rapid deployment of Tajo○ Low learning curve with SQL standard
● Immediate action becomes possible for user complaints and hidden bugs
21
Use Cases: Melon
● Data discovery○ Music streaming service (26 million users)○ Analysis of purchase history for target
marketing● Significantly reduced analysis time
○ Faster analysis by replacing Hive with Tajo○ More analysis becomes possible
22
So, Why should you use Tajo?
23
So, Why should you use Tajo?
● Easy to use
24
So, Why should you use Tajo?
● Easy to use○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
25
So, Why should you use Tajo?
● Easy to use○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...○ Mature SQL features
■ Most existing queries can be executed without modification
26
So, Why should you use Tajo?
● Easy to use○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...○ Mature SQL features
■ Most existing queries can be executed without modification
○ Various data format support■ Text, JSON, Orc, Parquet, …
27
So, Why should you use Tajo?
● Optimized performance
28
So, Why should you use Tajo?
● Optimized performance○ Optimized code
■ Optimized I/O performance● Nearly max I/O performance (~120MB/s) per disk
■ Off-heap data processing● Mitigating GC overhead
29
So, Why should you use Tajo?
● Optimized performance○ Cost-based query plan optimization
■ Join ordering ■ Best algorithm selection
● According to input size■ Progressive optimization
● Further optimize the query plan during query execution● Especially excellent for long running queries
■ => Efficient start schema processing
30
So, Why should you use Tajo?
● Various storage type support
31
So, Why should you use Tajo?
● Various storage type support
32
Logical Data Warehouse with Tajo
33
Global view
Application DBMS NoSQLCloud
storageOn-premise
storage
Logical Data Warehouse with Tajo
34
Global view
Application DBMS NoSQLCloud
storageOn-premise
storage
● Fast delivery● Easy maintenance● Simple data flow
How fast is Tajo?
35
Evaluation on Cloud Environment
● Google Cloud Platform○ Instance type: n1-standard-8
■ 8 core, 30GB RAM
36
Target Systems
● Hive (0.12)○ Baseline performance○ Default configuration provided by GCP
■ Use the whole cpu and memory
● Tajo (0.11.0)○ Default configuration provided by GCP
■ Use the whole cpu and memory
37
Target Systems
● Spark-SQL (1.5.0)○ Default configuration provided by GCP
■ Use the whole cpu and memory■ Tungsten enabled by default
○ spark.sql.shuffle.partitions is adjusted for better performance
38
TPC-DS
● Data○ 24 tables
■ Plain text format■ Stored on Google Cloud Storage
● Query○ Which can be executed on every system
without modifications■ For Hive, 0.12 doesn't support implicit join, so
every query had to be changed39
SF 1000, 50 instances
40
SF 1000, 50 instances
41
SF 1000, 50 instances
42
Cannot be run on 1TB
SF 10000, 50 instances
43
SF 10000, 50 instances
44
Demo
45
Simple Demo on EMR
46
● Using TPC-H data set, but○ Lineitem table is stored on HDFS○ Orders table is stored on PostgreSQL○ Other tables are stored on S3
Apache Tajo
● Is excellent for both long-running ETL jobs and exploratory ad-hoc analysis
● Is very fast● Supports query federation on diverse data
sources
47
Get Involved!
● We are recruiting contributors!● General
○ http://tajo.apache.org/
● Getting Started○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads○ http://tajo.apache.org/downloads.html
● Issue tracker○ http://issues.apache.org/jira/browse/TAJO
● Join the mailing list○ [email protected] ○ [email protected]
48
Useful Links
49
● EMR bootstrap○ https://github.com/awslabs/emr-bootstrap-
actions/tree/master/tajo ● How to setup Tajo on EMR
○ http://www.gruter.com/blog/setting-up-a-tajo-cluster-on-amazon-emr/
Q & A
50