big data camp la 2014 - apache tajo: a big data warehouse system on hadoop

20
Apache Tajo: A Big Data Warehouse Systemon Hadoop Hyunsik Choi Director of Research, Gruter Big Data Camp LA 2014

Upload: gruter-corp

Post on 10-Sep-2014

1.460 views

Category:

Software


0 download

DESCRIPTION

Apache Tajo is an open source big data warehouse system on Hadoop. This slide is a presentation material used in Big Data Camp LA 2014. This slide shows an introduction to Apache Tajo and the current status of the project. The current status includes cost-based optimization and the current supported SQL feature set.

TRANSCRIPT

Page 1: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Apache Tajo: A Big Data Warehouse Syste-mon HadoopHyunsik Choi

Director of Research, Gruter

Big Data Camp LA 2014

Page 2: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Talk Outline

• Introduction to Apache Tajo

• What you can do with Tajo

• Why you should use Tajo

• Current Status of Tajo Project

• Demonstration

Page 3: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

About Me

• Hyunsik Choi (pronounced “Hyeon-shick Cheh”)• PhD (Computer Science & Engineering, 2013), Korea Univ.• Director of Research, Gruter Corp

• Open-source Involvement– Full-time contributor to Apache Tajo (2013.6 ~ )– Apache Tajo PMC member and committer (2013.3 ~ )– Apache Giraph PMC member and committer (2011. 8 ~ )

• Contact Info– Email: [email protected]– Linkedin: http://linkedin.com/in/hyunsikchoi/

Page 4: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Apache Tajo

• Open-source “SQL-on-H” “Big DW” system

• Apache Top-level project since March 2014

• Supports SQL standards

• Low latency, long running batch queries

• Features– Supports Joins (inner and all outer), Groupby, and Sort– Window function– Most SQL data types supported (except for Decimal)

• Recent 0.8.0 release– https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0

Page 5: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Overall Architecture

Page 6: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

What You Can Do with Tajo

• Batch queries– Long-running queries (~ hours)

• Dynamic Scheduling• Fault Tolerance

– ETL workloads

• Interactive Ad-hoc Queries– Very low-latency (100 ms ~)– Few seconds on several TB dataset if you cluster

capability is enough

Page 7: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Why You Should Use Tajo

• SQL Standards– Non standard features – PgSQL and Oracle

• Simple Installation and Operation– http://tajo.apache.org/docs/0.8.0/getting_started.html

• Simple Software Stack Requirement– No MapReduce and No Tez– Yarn support but not mandatory– Tajo + Linux system for single node cluster– Tajo + HDFS for a distributed cluster

Page 8: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Why You Should Use Tajo

• Mature SQL Feature Set– Fully distributed query executions

• Inner join, and left/right/full outer join• Groupby, sort, multiple distinct aggregation, window function

– SQL data types • CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT• TIMESTAMP, DATE, TIME, and INTERVAL• DECIMAL (working)

– Various file formats• Text file (CSV), RCFile, Parquet (flat schema), and

Avro (flat schema)

Page 9: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Why You Should Use Tajo

• Fully community-driven open source

• Stable development team– 5 fulltime contributors + many contributors

• Performance and speed– Faster than Hive 0.10 (1.5 – 10 times)– Tajo v.s. Hive 0.13 ?– Tajo v.s. Impala ?

Page 10: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Why You Should Use Tajo

• Integration with Hadoop Ecosystem– Hadoop 2.2.0 – 2.4.0 support– Be able to connect to Hive Metastore– Directly process tables managed by Hive– Yarn support (backport)

• Enable Tajo to deploy and run on Yarn cluster• Allow users to add/remove cluster nodes to/from Tajo

cluster in runtime• Contributed by Min Zhou (committer), Linkedin Engineer• https://github.com/coderplay/tajo-yarn

Page 11: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Current Status – Overall

• Under beta stage – majority of key features are getting ready

• Most of SQL features implemented

• Working on hundreds of clusters for production– Collaboration with the biggest telco in S. Korea

• We’ve just started works on low-level optimization.– Runtime byte code generation (v0.9)– Unsafe-based hash table for hash aggregation/join– Vectorized execution engine

Page 12: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Current Status – Logical Plan Optimizer

• Basic Rewrite Rule– Common sub expression elimination– Constant folding (CF), and Null propagation

• Projection Push Down (PPD)– push expressions to operators lower as possible– narrow read columns– remove duplicated expressions

• if some expressions has common expression

• Filter Push Down (FPD)– reduce rows to be processed earlier as possible

• Extensible Rewrite Rule– Allow developers to write their own rewrite rules

Page 13: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Current Status – Logical Plan Optimizer

SELECT item_id, order_id sum_price * (1.2 * 0.3) as total, FROM ( SELECT item_id, order_id, sum(price) as sum_price FROM ITEMS GROUP BY item_id, order_id) a WHERE item_id = 17234

SELECT item_id, order_id, sum(price) * (3.6)FROM ITEMSGROUP BY item_id, order_idWHERE item_id = 17234

Original Rewritten

CF + PPD

FPD

Page 14: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Current Status – Logical Plan Optimizer

• Cost-based Join Order (since v0.2)– Don’t need to guess right join orders anymore– Greedy heuristic algorithm

• Resulting in a bushy join tree instead of left-deep join tree

Left-deep Join Tree Bush Join Tree

Page 15: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Current Status – Window Function

• OVER clause– row_number() and rank()– Aggregation function support– PARTITION and ORDER BY clause

SELECT depname, empno, salary, enroll_date FROM ( SELECT depname, empno, salary, enroll_date, rank() OVER (PARTITION BY depname ORDER BY salary DESC, empno) AS pos FROM empsalary ) AS ss WHERE pos < 3;

Page 16: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Current Status – Join

• Join– NATURAL, INNER, OUTER (LEFT, RIGHT, FULL)– SEMI, ANTI Join (planned for v0.9)

• Join Predicates– WHERE and ON predicates– de-factor standard outer join behavior with both

predicates

SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx';

SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.num and t2.value = ‘xxx’;

Page 17: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Current Status – Table Partitions

• Column Value Partition– Hive Compatible Partition

• Range Partition (planned for 1.0)– Table will be partitioned by disjoint ranges.– Will remove the partition granularity problem of

Hive Partition

CREATE TABLE T1 (C1 INT, C2 TEXT) using PARQUET WITH (‘parquet.compression’ = ‘SNAPPY’) PARTITION BY COLUMN (C3 INT, C4 TEXT);

Page 18: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Future Works

• Multi-tenant Scheduler (v0.9)– Support multiple users and multiple queries

• Runtime byte code generation for expressions (v0.9)– Eliminate interpret overhead of expression evaluation

• Authentication and SQL Standard Access Control

• JIT-based Vectorized Processing Engine– Refer to Hadoop Summit 2014 Slide

(http://goo.gl/jWghhp)

Page 19: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Get Involved!

• We are recruiting contributors!

• General– http://tajo.apache.org

• Getting Started– http://tajo.apache.org/docs/0.8.0/getting_started.html

• Downloads– http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html

• Jira – Issue Tracker– https://issues.apache.org/jira/browse/TAJO

• Join the mailing list– [email protected][email protected]

Page 20: Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Demonstration