bringing bigapps to flink | collaborative predictive intelligence via ddf-on-flink using distributed...
TRANSCRIPT
Collaborative Predictive Intelligence
via DDF-on-Flink using Distributed DataFrame
Christopher Nguyen, PhD—CEO & Co-Founder, Arimo
Rohit Rai—CEO, Tuplejump
Bringing BigApps to Flink
@arimoinc@pentagoniachttp//ddf.io
@arimoinc@pentagoniachttp//ddf.io
Who Are We?
What Do We Do?
@arimoinc@pentagoniachttp//ddf.io
What Are Adatao Big Apps?
§Predictive: Predictive Analytics for Business Users
§Collaborative: Real-time Collaboration with Data Scientists
@arimoinc@pentagoniachttp//ddf.io
The EXPLOSION
of Data & Compute engines
The CIO Challenge
ScalaClient
Scala
JavaClient
Java
PyClientPyth
on
RClient
R
Ignite
HDFS
S3
Redshift
BigQ
Cassandra
RDBMS
Spark
Flink
Presto
Ignite
HDFS
S3
RedshiftBigQ
Cassandra
RDBMS
Spark
Flink
PrestoIgnite
HDFS
S3
Redshift
BigQ
Cassandra
RDBMS
Spark
FlinkPresto
ScalaClient
Scala
PyClient
PythonJavaC
lient
Java
RClient
R
FlinkFlin
k
Ignite
HDFSRDBMS
Redshift
Cassandra HDFS RDBMSHDFS
Flink
@arimoinc@pentagoniachttp//ddf.io
Scala Java Python R
DDF
Spark Flink
DDF
Ignite
DDF
Data in Memory
Presto
DDF
Data at Rest
HDFS
DDF
DWs DBs
Enterprise Data Bus
DDF
S3
DDF
Redshift
DDF
BigQ
DDF
Cassandra
DDF
RDBMS
The Solution: DDF Data Integration
@arimoinc@pentagoniachttp//ddf.io
Benefits of DDF Data Integration
§ FOR DATA ENGINEERS
§ Unified API across data sources and engines
§ HDFS, S3, Cassandra, Redshift, BigQuery, RDBMS, Salesforce, Spark, Flink, Ignite …
§ FOR DATA SCIENTISTS
§ Uniform high-level DataFrame abstractions: ETL, ML, Streaming
@arimoinc@pentagoniachttp//ddf.io
Custom Apps
Adatao AppBuilder
Adatao PredictiveEngine
Arimo Predictive Intelligence Platform
Big Compute
Big Data
Big Apps
Distributed DataFrame (DDF)Open
Sourced
Data ScientistBusiness User Data Engineer
@arimoinc@pentagoniachttp//ddf.io
Why Flink?
§ Emerging engine with unique strengths (e.g., streaming)
§Driven by Customer & Partner conversations
@arimoinc@pentagoniachttp//ddf.io
Java Python R
DDF DDF DDF
Spark Flink RedshiftSpark APIs
RDD DataFrame DStream
…
Flink APIs DataSet Table
DataStream …
ETL Interfaces
ML Interfaces
Streaming Interfaces
Unified DDF APIs
DDF: “Under the Hood”
@arimoinc@pentagoniachttp//ddf.io
DDF API in a Nutshell
// To start working with an engine
DDFManager manager = DDFManager.get(“flink”); // or “spark”
// Then, data can be loaded into a DDF as follows:DDF table = manager.sql2ddf("select * from airline");
// ETL, transformtable = table.transform("dist= round(distance/2, 2)”);
// Run Machine learning using MLlib, then run predictionKMeansModel kmeansModel = (KMeansModel) ddf.ML.train("kmeans", 5, 5).getRawModel();Int prediction = ddf.ML.applyModel(kmeansModel, false, true);
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ It was easy for us to implement DDF on Flink
§ Flink API close to functional collection API
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ With DDF, it’s easy to port applications on DDF from one engine to another
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ There’s now an opportunity to use Flink for interactive applications
§ Backtracking scheduler, session management, better graph analysis
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ Null/missing value handling in Flink
§ Null value support needed in RowSerializer
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ Map vs MapPartitions vs Accumulators
§ Map for aggregations can cause a lot of object creation overhead
§ Accumulators may fail for huge datasets
@arimoinc@pentagoniachttp//ddf.io
Lessons Learned
§ Use caution when doing array copy overs in Table API
@arimoinc@pentagoniachttp//ddf.io
DDF: Where is it heading?
§ More Engines: DBs & DWs: BigQuery, Cassandra, Teradata, Presto, Ignite
§ Enterprise Databus to seamlessly move data across sources
§ Richer APIs
@arimoinc@pentagoniachttp//ddf.io
Get Started with DDF§ Increase your productivity & build engine-agnostics Apps
• Build your analytics apps on existing modules
• Flink, Spark, JDBC
§ Expand possibilities. Contribute to DDF
• Enrich existing plugins: Data APIs, ML APIs...
• Add new DDF plugins:
• BigQuery, Cassandra
•Marketo
• Ignite, Presto
§ Spread the word!
www.ddf.io/gettingstarted
Collaborative Predictive Intelligence
via DDF-on-Flink using Distributed DataFrame
Christopher Nguyen, PhD—CEO & Co-Founder, Arimo
Rohit Rai—CEO, Tuplejump
Bringing BigApps to Flink
@arimoinc@pentagoniachttp//ddf.io