introduction to apache tajo: future of data warehouse

Introduction to Apache Tajo: Future of Data WarehouseJihoon Son / Gruter Inc.

I am

● Jihoon Son (@jihoonson)○ Ph.D at Korea Univ.○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo○ Research engineer at Gruter ○ Linkedin

■ https://www.linkedin.com/in/jihoonson

2

https://www.linkedin.com/in/jihoonson

https://www.linkedin.com/in/jihoonson

Today's Topic: Tajo

● What is Tajo?○ Tajo / tάːzo / 타조○ Ostrich in Korean

■ Fastest two-legged animal in the world

3

Today's Topic: Tajo

● What is Apache Tajo?○ Our Ostrich can do SQL

processing on big data!■ SQL-on-Hadoop system■ Apache Top-level project

4

Maybe You Think ...

5

SQL-on-Hadoop?Boring..

This Ostrich is Different!

6

http://www.youtube.com/watch?v=VL30jp4tgnM

SQL-on-Hadoop Systems

7


8


9

Long-running ETL jobs

Low-latency interactive analysis


10

● Requirements○ Stable query execution

■ Fault-tolerance● Can avoid query

resubmission ○ Adaptation to dynamic

environment■ Available resources,

unpredictable delays, ...



11

● Requirements○ Fast query execution

■ Several query execution techniques

■ In-memory processing Low-latency interactive analysis

Tajo is designed for Both Workloads

12


Low-latency interactive analysis

Who are using Tajo?

13

Use Cases: SK Telecom

● Data warehousing & analysis○ 1st telco in South Korea

■ 40 TB/day compressed data (2014)

14

ETLETLETL

Integration Layer

Data Warehouse

Operational Systems

SK Telecom: Before Tajo

15

Marketing

Sales

ERP

SCM

ODS

Staging Area

Data Vault

Data Marts

Strategic Marts

Hadoop MPP DBMS

ETLETLETL

Integration Layer

Data Warehouse

Operational Systems

SK Telecom: After Tajo

16

Marketing

Sales

ERP

SCM

ODS

Staging Area

Data Vault

Data Marts

Strategic Marts

ETLETLETL

Integration Layer

Data Warehouse

Operational Systems

SK Telecom: After Tajo

17

Marketing

Sales

ERP

SCM

ODS

Staging Area

Data Vault

Data Marts

Strategic Marts

● Long-running ETL jobs● Ad-hoc analysis

Use Cases: SK Telecom

● Significantly reduced ETL & analysis time○ Daily analysis becomes possible○ More exploratory analysis is newly available

with remaining resources

18

Use Cases: Bluehole Studio

● Game log analysis○ Finding principal

causes of service-quality deficiencies

19


● Tajo on EMR

20


● Their first log analysis system○ Easy and rapid deployment of Tajo○ Low learning curve with SQL standard

● Immediate action becomes possible for user complaints and hidden bugs

21

Use Cases: Melon

● Data discovery○ Music streaming service (26 million users)○ Analysis of purchase history for target

marketing● Significantly reduced analysis time

○ Faster analysis by replacing Hive with Tajo○ More analysis becomes possible

22

So, Why should you use Tajo?

23


● Easy to use

24


● Easy to use○ ANSI-SQL standard compliance (2003)

■ CTAS, Window functions, ...

25



■ CTAS, Window functions, ...○ Mature SQL features

■ Most existing queries can be executed without modification

26



■ CTAS, Window functions, ...○ Mature SQL features

■ Most existing queries can be executed without modification

○ Various data format support■ Text, JSON, Orc, Parquet, …

27


● Optimized performance

28


● Optimized performance○ Optimized code

■ Optimized I/O performance● Nearly max I/O performance (~120MB/s) per disk

■ Off-heap data processing● Mitigating GC overhead

29


● Optimized performance○ Cost-based query plan optimization

■ Join ordering ■ Best algorithm selection

● According to input size■ Progressive optimization

● Further optimize the query plan during query execution● Especially excellent for long running queries

■ => Efficient start schema processing

30


● Various storage type support

31


● Various storage type support

32

Logical Data Warehouse with Tajo

33

Global view

Application DBMS NoSQLCloud

storageOn-premise

storage

Logical Data Warehouse with Tajo

34

Global view

Application DBMS NoSQLCloud

storageOn-premise

storage

● Fast delivery● Easy maintenance● Simple data flow

How fast is Tajo?

35

Evaluation on Cloud Environment

● Google Cloud Platform○ Instance type: n1-standard-8

■ 8 core, 30GB RAM

36

Target Systems

● Hive (0.12)○ Baseline performance○ Default configuration provided by GCP

■ Use the whole cpu and memory

● Tajo (0.11.0)○ Default configuration provided by GCP

■ Use the whole cpu and memory

37

Target Systems

● Spark-SQL (1.5.0)○ Default configuration provided by GCP

■ Use the whole cpu and memory■ Tungsten enabled by default

○ spark.sql.shuffle.partitions is adjusted for better performance

38

TPC-DS

● Data○ 24 tables

■ Plain text format■ Stored on Google Cloud Storage

● Query○ Which can be executed on every system

without modifications■ For Hive, 0.12 doesn't support implicit join, so

every query had to be changed39

SF 1000, 50 instances

40


41


42

Cannot be run on 1TB


43


44

Demo

45

Simple Demo on EMR

46

● Using TPC-H data set, but○ Lineitem table is stored on HDFS○ Orders table is stored on PostgreSQL○ Other tables are stored on S3

Apache Tajo

● Is excellent for both long-running ETL jobs and exploratory ad-hoc analysis

● Is very fast● Supports query federation on diverse data

sources

47

Get Involved!

● We are recruiting contributors!● General

○ http://tajo.apache.org/

● Getting Started○ http://tajo.apache.org/docs/current/getting_started.html

● Downloads○ http://tajo.apache.org/downloads.html

● Issue tracker○ http://issues.apache.org/jira/browse/TAJO

● Join the mailing list○ [email protected] ○ [email protected]

48

http://tajo.apache.org/

http://tajo.apache.org/

http://tajo.apache.org/docs/current/getting_started.html

http://tajo.apache.org/docs/current/getting_started.html

http://tajo.apache.org/downloads.html

http://tajo.apache.org/downloads.html

http://issues.apache.org/jira/browse/TAJO

http://issues.apache.org/jira/browse/TAJO

mailto:[email protected]




Useful Links

49

● EMR bootstrap○ https://github.com/awslabs/emr-bootstrap-

actions/tree/master/tajo ● How to setup Tajo on EMR

○ http://www.gruter.com/blog/setting-up-a-tajo-cluster-on-amazon-emr/

https://github.com/awslabs/emr-bootstrap-actions/tree/master/tajo



http://www.gruter.com/blog/setting-up-a-tajo-cluster-on-amazon-emr/



Q & A

50

introduction to apache tajo: future of data warehouse

Technology