introduction to apache tajo
TRANSCRIPT
Apache Top-level Project
○ Big data warehouse system
■ ANSI-SQL compliant
■ Mature SQL features
● Various types of join, window functions
○ Rapid query execution with own distributed DAG engine
■ Low latency, and long running batch queries with a single
system
■ Fault-tolerance
○ Beyond SQL-on-Hadoop
■ Support various types of storage
Fast and Efficient
Fully distributed SQL query processing engine
Advanced query optimization such as cost-based and progressive query optimization
Interactive analysis on reasonable data set
Scalable
Fault tolerance and dynamic scheduling for long-running queries
Out-of-core algorithms for data sets larger than main memory
Compatible
ANSI/ISO SQL standard compliance
Hive MetaStore access support
JDBC driver support
Various file formats support, such as CSV, JSON, RCFile, SequenceFile, ORC and Parquet
Easy
User-defined functions
Interactive shell
Convenient Backup/Restore utility
Asynchronous/Synchronous Java API
• Replacement of long-running ETL workloads on several TB datasets
• Lots of daily reports about user behavior
• Ad‐hoc analysis on TB datasets
• Analysis of purchase history for target marketing
Introduction Architecture
Features Use case
Filter Scan Unions & Joins
Group by & Sort Filters, group by, having & sort