introduction to apache tajo: future of data warehouse

50
Introduction to Apache Tajo: Future of Data Warehouse Jihoon Son / Gruter Inc.

Upload: jihoon-son

Post on 13-Jan-2017

1.360 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Introduction to Apache Tajo: Future of Data Warehouse

Introduction to Apache Tajo: Future of Data WarehouseJihoon Son / Gruter Inc.

Page 2: Introduction to Apache Tajo: Future of Data Warehouse

I am

● Jihoon Son (@jihoonson)○ Ph.D at Korea Univ.○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo○ Research engineer at Gruter ○ Linkedin

■ https://www.linkedin.com/in/jihoonson

2

Page 3: Introduction to Apache Tajo: Future of Data Warehouse

Today's Topic: Tajo

● What is Tajo?○ Tajo / tάːzo / 타조○ Ostrich in Korean

■ Fastest two-legged animal in the world

3

Page 4: Introduction to Apache Tajo: Future of Data Warehouse

Today's Topic: Tajo

● What is Apache Tajo?○ Our Ostrich can do SQL

processing on big data!■ SQL-on-Hadoop system■ Apache Top-level project

4

Page 5: Introduction to Apache Tajo: Future of Data Warehouse

Maybe You Think ...

5

SQL-on-Hadoop?Boring..

Page 6: Introduction to Apache Tajo: Future of Data Warehouse

This Ostrich is Different!

6

Page 7: Introduction to Apache Tajo: Future of Data Warehouse

SQL-on-Hadoop Systems

7

Page 8: Introduction to Apache Tajo: Future of Data Warehouse

SQL-on-Hadoop Systems

8

Page 9: Introduction to Apache Tajo: Future of Data Warehouse

SQL-on-Hadoop Systems

9

Long-running ETL jobs

Low-latency interactive analysis

Page 10: Introduction to Apache Tajo: Future of Data Warehouse

SQL-on-Hadoop Systems

10

● Requirements○ Stable query execution

■ Fault-tolerance● Can avoid query

resubmission ○ Adaptation to dynamic

environment■ Available resources,

unpredictable delays, ...

Long-running ETL jobs

Page 11: Introduction to Apache Tajo: Future of Data Warehouse

SQL-on-Hadoop Systems

11

● Requirements○ Fast query execution

■ Several query execution techniques

■ In-memory processing Low-latency interactive analysis

Page 12: Introduction to Apache Tajo: Future of Data Warehouse

Tajo is designed for Both Workloads

12

Long-running ETL jobs

Low-latency interactive analysis

Page 13: Introduction to Apache Tajo: Future of Data Warehouse

Who are using Tajo?

13

Page 14: Introduction to Apache Tajo: Future of Data Warehouse

Use Cases: SK Telecom

● Data warehousing & analysis○ 1st telco in South Korea

■ 40 TB/day compressed data (2014)

14

Page 15: Introduction to Apache Tajo: Future of Data Warehouse

ETLETLETL

Integration Layer

Data Warehouse

Operational Systems

SK Telecom: Before Tajo

15

Marketing

Sales

ERP

SCM

ODS

Staging Area

Data Vault

Data Marts

Strategic Marts

Hadoop MPP DBMS

Page 16: Introduction to Apache Tajo: Future of Data Warehouse

ETLETLETL

Integration Layer

Data Warehouse

Operational Systems

SK Telecom: After Tajo

16

Marketing

Sales

ERP

SCM

ODS

Staging Area

Data Vault

Data Marts

Strategic Marts

Page 17: Introduction to Apache Tajo: Future of Data Warehouse

ETLETLETL

Integration Layer

Data Warehouse

Operational Systems

SK Telecom: After Tajo

17

Marketing

Sales

ERP

SCM

ODS

Staging Area

Data Vault

Data Marts

Strategic Marts

● Long-running ETL jobs● Ad-hoc analysis

Page 18: Introduction to Apache Tajo: Future of Data Warehouse

Use Cases: SK Telecom

● Significantly reduced ETL & analysis time○ Daily analysis becomes possible○ More exploratory analysis is newly available

with remaining resources

18

Page 19: Introduction to Apache Tajo: Future of Data Warehouse

Use Cases: Bluehole Studio

● Game log analysis○ Finding principal

causes of service-quality deficiencies

19

Page 20: Introduction to Apache Tajo: Future of Data Warehouse

Use Cases: Bluehole Studio

● Tajo on EMR

20

Page 21: Introduction to Apache Tajo: Future of Data Warehouse

Use Cases: Bluehole Studio

● Their first log analysis system○ Easy and rapid deployment of Tajo○ Low learning curve with SQL standard

● Immediate action becomes possible for user complaints and hidden bugs

21

Page 22: Introduction to Apache Tajo: Future of Data Warehouse

Use Cases: Melon

● Data discovery○ Music streaming service (26 million users)○ Analysis of purchase history for target

marketing● Significantly reduced analysis time

○ Faster analysis by replacing Hive with Tajo○ More analysis becomes possible

22

Page 23: Introduction to Apache Tajo: Future of Data Warehouse

So, Why should you use Tajo?

23

Page 24: Introduction to Apache Tajo: Future of Data Warehouse

So, Why should you use Tajo?

● Easy to use

24

Page 25: Introduction to Apache Tajo: Future of Data Warehouse

So, Why should you use Tajo?

● Easy to use○ ANSI-SQL standard compliance (2003)

■ CTAS, Window functions, ...

25

Page 26: Introduction to Apache Tajo: Future of Data Warehouse

So, Why should you use Tajo?

● Easy to use○ ANSI-SQL standard compliance (2003)

■ CTAS, Window functions, ...○ Mature SQL features

■ Most existing queries can be executed without modification

26

Page 27: Introduction to Apache Tajo: Future of Data Warehouse

So, Why should you use Tajo?

● Easy to use○ ANSI-SQL standard compliance (2003)

■ CTAS, Window functions, ...○ Mature SQL features

■ Most existing queries can be executed without modification

○ Various data format support■ Text, JSON, Orc, Parquet, …

27

Page 28: Introduction to Apache Tajo: Future of Data Warehouse

So, Why should you use Tajo?

● Optimized performance

28

Page 29: Introduction to Apache Tajo: Future of Data Warehouse

So, Why should you use Tajo?

● Optimized performance○ Optimized code

■ Optimized I/O performance● Nearly max I/O performance (~120MB/s) per disk

■ Off-heap data processing● Mitigating GC overhead

29

Page 30: Introduction to Apache Tajo: Future of Data Warehouse

So, Why should you use Tajo?

● Optimized performance○ Cost-based query plan optimization

■ Join ordering ■ Best algorithm selection

● According to input size■ Progressive optimization

● Further optimize the query plan during query execution● Especially excellent for long running queries

■ => Efficient start schema processing

30

Page 31: Introduction to Apache Tajo: Future of Data Warehouse

So, Why should you use Tajo?

● Various storage type support

31

Page 32: Introduction to Apache Tajo: Future of Data Warehouse

So, Why should you use Tajo?

● Various storage type support

32

Page 33: Introduction to Apache Tajo: Future of Data Warehouse

Logical Data Warehouse with Tajo

33

Global view

Application DBMS NoSQLCloud

storageOn-premise

storage

Page 34: Introduction to Apache Tajo: Future of Data Warehouse

Logical Data Warehouse with Tajo

34

Global view

Application DBMS NoSQLCloud

storageOn-premise

storage

● Fast delivery● Easy maintenance● Simple data flow

Page 35: Introduction to Apache Tajo: Future of Data Warehouse

How fast is Tajo?

35

Page 36: Introduction to Apache Tajo: Future of Data Warehouse

Evaluation on Cloud Environment

● Google Cloud Platform○ Instance type: n1-standard-8

■ 8 core, 30GB RAM

36

Page 37: Introduction to Apache Tajo: Future of Data Warehouse

Target Systems

● Hive (0.12)○ Baseline performance○ Default configuration provided by GCP

■ Use the whole cpu and memory

● Tajo (0.11.0)○ Default configuration provided by GCP

■ Use the whole cpu and memory

37

Page 38: Introduction to Apache Tajo: Future of Data Warehouse

Target Systems

● Spark-SQL (1.5.0)○ Default configuration provided by GCP

■ Use the whole cpu and memory■ Tungsten enabled by default

○ spark.sql.shuffle.partitions is adjusted for better performance

38

Page 39: Introduction to Apache Tajo: Future of Data Warehouse

TPC-DS

● Data○ 24 tables

■ Plain text format■ Stored on Google Cloud Storage

● Query○ Which can be executed on every system

without modifications■ For Hive, 0.12 doesn't support implicit join, so

every query had to be changed39

Page 40: Introduction to Apache Tajo: Future of Data Warehouse

SF 1000, 50 instances

40

Page 41: Introduction to Apache Tajo: Future of Data Warehouse

SF 1000, 50 instances

41

Page 42: Introduction to Apache Tajo: Future of Data Warehouse

SF 1000, 50 instances

42

Cannot be run on 1TB

Page 43: Introduction to Apache Tajo: Future of Data Warehouse

SF 10000, 50 instances

43

Page 44: Introduction to Apache Tajo: Future of Data Warehouse

SF 10000, 50 instances

44

Page 45: Introduction to Apache Tajo: Future of Data Warehouse

Demo

45

Page 46: Introduction to Apache Tajo: Future of Data Warehouse

Simple Demo on EMR

46

● Using TPC-H data set, but○ Lineitem table is stored on HDFS○ Orders table is stored on PostgreSQL○ Other tables are stored on S3

Page 47: Introduction to Apache Tajo: Future of Data Warehouse

Apache Tajo

● Is excellent for both long-running ETL jobs and exploratory ad-hoc analysis

● Is very fast● Supports query federation on diverse data

sources

47

Page 48: Introduction to Apache Tajo: Future of Data Warehouse

Get Involved!

● We are recruiting contributors!● General

○ http://tajo.apache.org/

● Getting Started○ http://tajo.apache.org/docs/current/getting_started.html

● Downloads○ http://tajo.apache.org/downloads.html

● Issue tracker○ http://issues.apache.org/jira/browse/TAJO

● Join the mailing list○ [email protected][email protected]

48

Page 50: Introduction to Apache Tajo: Future of Data Warehouse

Q & A

50