Download - Big Data SQL Support in Apache Apex / Hadoop
SQL on Apache Apex
Chinmay Kolhatkar ([email protected])January 24th 2017
Apache Apex - Stream ProcessingEasily Operable - Exposes an easy API for developing Operators (part of an
application) and Applications
Highly Scalable - Scales statically as well as dynamically
Highly Performant - Can reach single digit millisecond end-to-end latency
Fault Tolerant - Automatically recovers from failures - without manual intervention
Stateful - Guarantees that no state will be lost
Apex Malhar library
YARN - Native - Uses Hadoop YARN framework for resource negotiation
Apex Platform Overview
3
An Apex Application is a DAG(Directed Acyclic Graph)
A DAG is composed of vertices (Operators) and edges (Streams).A Stream is a sequence of data tuples which connects operators at end-points called PortsAn Operator takes one or more input streams, performs computations & emits one or more output streams
● Each operator is USER’s business logic, or built-in operator from our open source library● Operator may have multiple instances that run in parallel
Typical application example
Brief about SQL1969 - CODASYL (network database)1979 - First commercial SQL RDBMSs1990 - Acceptance - transaction processing on SQL1993 - Multi-dimensional databases1996 - SQL EDWs2006 - Hadoop and other “big data” technologies2008 - NoSQL2011 - SQL on Hadoop2014 - Interactive analytics on {Hadoop, NoSQL, RDBMS} using SQL
SQL remains popular.Why?
“SQL on everything, in memory” by Julian Hyde, Strata NYC, Oct 16 2014
Brief about CalciteTraditional Architecture
“SQL on everything, in memory” by Julian Hyde, Strata NYC, Oct 16 2014
Brief about CalciteCalcite Architecture
“SQL on everything, in memory” by Julian Hyde, Strata NYC, Oct 16 2014
Brief about CalciteExpression Tree
“SQL on everything, in memory” by Julian Hyde, Strata NYC, Oct 16 2014
Brief about CalciteExpression Tree (Optimized)
“SQL on everything, in memory” by Julian Hyde, Strata NYC, Oct 16 2014
Apex-Calcite APIKafk
aInpu
t
CSVParser
FilterCSV
Formattter
FilteredWordsLines
Kafka File
Project
Projected LineWriter
Formatted
SQLExecEnvironment.getEnvironment() .registerTable("ORDERS", new KafkaEndpoint(conf.get("broker"), conf.get("topic"), new CSVMessageFormat(conf.get("schemaInDef")))) .registerTable("SALES", new FileEndpoint(conf.get("destFolder"), conf.get("destFileName"), new CSVMessageFormat(conf.get("schemaOutDef")))) .registerFunction("APEXCONCAT", this.getClass(), "apex_concat_str") .executeSQL(dag, "INSERT INTO SALES " + "SELECT STREAM ROWTIME, FLOOR(ROWTIME TO DAY), APEXCONCAT('OILPAINT', SUBSTRING(PRODUCT, 6, 7) " + "FROM ORDERS WHERE ID > 3 AND PRODUCT LIKE 'paint%'");
Demo
Resources
13
• Apache Apex - http://apex.apache.org/• Subscribe to forums
ᵒ Apex - http://apex.apache.org/community.htmlᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users
• Download - https://datatorrent.com/download/• Twitter
ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://meetup.com/topics/apache-apex• Webinars - https://datatorrent.com/webinars/• Videos - https://youtube.com/user/DataTorrent• Slides - http://slideshare.net/DataTorrent/presentations • Startup Accelerator – Free full featured enterprise product
ᵒ https://datatorrent.com/product/startup-accelerator/• Big Data Application Templates Hub – https://datatorrent.com/apphub
We Are Hiring
14
• [email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders
15