introduction to apache tajo

Apache Top-level Project

○ Big data warehouse system

■ ANSI-SQL compliant

■ Mature SQL features

● Various types of join, window functions

○ Rapid query execution with own distributed DAG engine

■ Low latency, and long running batch queries with a single

system

■ Fault-tolerance

○ Beyond SQL-on-Hadoop

■ Support various types of storage

Fast and Efficient

Fully distributed SQL query processing engine

Advanced query optimization such as cost-based and progressive query optimization

Interactive analysis on reasonable data set

Scalable

Fault tolerance and dynamic scheduling for long-running queries

Out-of-core algorithms for data sets larger than main memory

Compatible

ANSI/ISO SQL standard compliance

Hive MetaStore access support

JDBC driver support

Various file formats support, such as CSV, JSON, RCFile, SequenceFile, ORC and Parquet

Easy

User-defined functions

Interactive shell

Convenient Backup/Restore utility

Asynchronous/Synchronous Java API

• Replacement of long-running ETL workloads on several TB datasets

• Lots of daily reports about user behavior

• Ad‐hoc analysis on TB datasets

• Analysis of purchase history for target marketing

Introduction Architecture

Features Use case

Filter Scan Unions & Joins

Group by & Sort Filters, group by, having & sort

Thank You !- Abhishek Solanki

https://github.com/AbhishekSolanki

https://twitter.com/abhisolanki94

introduction to apache tajo

Technology