hybrid transactional/analytics processing with spark and imdgs
TRANSCRIPT
1
Hybrid Transactional/Analytics Processing with Spark and In-Memory Data Grids
Copyright © GigaSpaces 2016. All rights reserved.
Ali HodrojVP, Products and Strategy
2
About me
• Vice President, Products and Strategy @ GigaSpaces• (ex) Director of Solutions Architecture• Blogging at http://blog.gigaspaces.com• @ahodroj
• Email: [email protected]
• Slides at http://slideshare.com/ahodroj
3
About GigaSpaces
Direct customers500+
HeadquartersNew York, NY
Established2001
4
Do we need to bridge online transaction
processing with real-time operational
intelligence?
5
Modern applications: the line is blurred between…
Transactional Analytical
Essential to operate the business
Turning data into value:insights, diagnosis,
decision making
&
6
Stories from the real #enterprise world...
Hyper-personalizationand
Omni-channel
10
Minimize Latency + Strong Consistency
Maximize Data-Analytics Locality
Goal:
11
There’s a name for it...
12
13
In-Memory Data Store?
In-Memory Computing 101
Distribute CachePartitioned cache
nodes
In-Memory Data Grid
Scale-out systemof record
Increased Capacity
No support for write-heavy scenarios
Limited to ID-based reads
Reads are the only part latency path
In-Memory DatabaseScale-up system of record
Heavy Read/Write – sharded/partitioned architecture
Horizontally scalable on commodity HW (or cloud)
Serves as system of record with querying & transaction semantics
Requires modifying your application’s data access layer
Distribute CachePartitioned cache
nodes
In-Memory Data Grid
Scale-out systemof record
In-Memory DatabaseScale-up system of record
In-Memory Computing 101
Read/Write Scalability
Drop-in SQL database replacement
Often lacks horizontal scalability (Joins)
Requires replacing your database
Distribute CachePartitioned cache
nodes
In-Memory Data Grid
Scale-out systemof record
In-Memory DatabaseScale-up system of record
In-Memory Computing 101
IMDG Data Models
@SpaceClasspublic class Product{ private String name; private String brand; private Integer quantity;
// … }
IMDG Data Placement – Fixed Hashing
hash(key) % #nodes
IMDG Fixed Hashing - HA
hash(key) % #nodes
21
The database goes to the background
Partition your data and store it in memory
In-Memory Data Grids: How it works
http://xap.github.io
22
Partitioned, co-located in-memory messaging
In-Memory Data Grids: How it works
http://xap.github.io
23
Business logic, data & messaging co-located & partitioned into processing units
In-Memory Data Grids: How it works
http://xap.github.io
24
Hot backup for each partition for high availability
In-Memory Data Grids: How it works
http://xap.github.io
25
Host your web application on the XAP infrastructure
In-Memory Data Grids: How it works
http://xap.github.io
26
Auto-scale out & in based on real-time performance & load
In-Memory Data Grids: How it works
http://xap.github.io
28
Host Cisco UCS Server
CPU Intel 16core 2.9GHz
Concurrent Threads 2
Throughput 200, 400, 800 ops/sec
29
WHAT’STHE RIGHT DATA STORE TO CHOOSE?
30
● Nope: Your data sources and applications are often distributed.
● In-Memory or not, these databases aren’t built for horizontal scale-out
Approach Challenge
Just an IMDB Thing….
Shove it all in one “Big Iron”?
31
● Not when your apps requires polyglot analytics
● Unless you want to write ML algorithms, MDX engines…etc from scratch
Approach Challenge
One large In-Memory Data Grid to Rule them all?
32
What we needed
Low-latency Scale-Out In-Memory Data Grid
Large-scale distributed analytics framework
Maximize Data-Analytics Locality
Minimize Application Latency
33
Our approach to HTAP
Low-latency Scale-Out In-Memory Data Grid
Large-scale distributed analytics
framework
+
34
SPARK?So why did we bet on
35
• Unified & Concise API
• Highly Flexible Data Store Integration
• Massive Community and Adoption
36
But Spark is already in-memory!
37
Spark is caching over <insert your data store>,
not an in-memory system of record
38
APACHE SPARKFIT INTO THIS?
How does
39
In-Memory Data Grid
In-Memory Store(RAM) Flash, SSD, Off-Heap Store
Spark Spark SQL Spark Steaming Machine Learning High availability
Security & M
anagement
InsightEdge CoreBuilding out the driver
Transactional TierACID-compliantStrong Consistency
Analytics Tier
40
Data Grid + Spark Deployment Layout
node 1
Spark master
Gridmaster
node 2
Spark worker
Gridworker
node 3
Spark worker
Gridworker
41
•List of parent RDDs – Empty •An array of partitions that a dataset is divided to – IMDG Distributed Query
to get partitions and their hosts
•A compute function to do a computation on partitions – Iterator over portion of data
•Optional preferred locations, i.e. hosts for a partition where the data will be loaded – hosts from Distributed Query
Data Grid RDD: resilient distributed dataset
42
node 1
Spark executor
Data Grid RDD: one-to-one partition
Spark Partition
#1
GridPartition #1
Direct connection
Simple, but not enough parallelism for Spark
node 2
Spark executor
Spark Partition
#2
GridPartition #2
node 3
Spark executor
Spark Partition
#3
GridPartition #3
43
node 1
Spark Executor
Grid Primary #1
Data Grid RDD: with bucketing
0..
1..
2..
3..
4..
5..
.
.
.
.
.
.
.
.
.
.
Spark Partition #1
1023
1 Spark partition = M grid buckets
1 Grid partition = N Spark partitions
Spark Partition #2
Spark Partition #1
44
Grid DataFrames: predicates pushdown & columns pruning
Aggregation in Spark
Filtering and columns pruning
in Data Grid
SELECT SUM(amount)
FROM orderWHERE city = ‘NY’ AND year > 2012
Spark SQL architecture:
• Pushing down predicates to Data Grid• Leveraging indexes• Transparent to user• Enabling support for other languages - Python/R
Implementing DataSource API
45
Push-down Predicates
performance
Traditional Spark filtering of 7MM records
Grid-side + Spark filtering of 7MM records
31seconds
1second
vs
46
Eventually, we productized this as an open source Spark distribution
@InsightEdgeIO http://insightedge.io
Apache 2 License
http://insightedge.io/docs
http://insightedge.io/blog
http://github.com/InsightEdge
GigaSpaces InsightEdgehttp://insightedge.io
High Performance Spark with OLTP Capabilities
49
ADDITIONALINNOVATIONS
50
Spark GeoSpatial SQL and Data Frames
51
Multi-Spark Replication / Federated Clusters
In-Memory Replication across local or wide area networks
upcoming: Spark RDD/DF native read/save on Off-Heap (SSD/Flash/Direct Buffers)
Application
Processing
Primaryinstance
s
Backupinstance
s
SyncReplicati
on
StorageArray
StorageArray
In Memory Data Grid
Spark worker
Spark worker
• Significant RAM TCO reduction in Spark clusters
• Direct RDD/DataFrame read write from SSD/Flash device
• Avoid Filesystem hops and write amplification
53
REFERENCEARCHITECTURES
5454
In-Process HTAP
Read any POJO, JSON Document, or Transaction as a DataFrame or RDD
Web services/apps can read any DataFrame as POJO
True closed-loop analytics data pipeline
@SpaceClasspublic class Product{ private String name; private String brand; private Integer quantity;
// … }
5555
In-Memory Data Grid Realtime Replication
• Scoring models• Trigger actions• Events
Transactions Analytics
Point of Decision HTAPXAP + InsightEdge deployed on different grid clusters with bi-directional real-time data replication
5656
Case Study: Fleet Geo-analytics
Challenge
• Stream data from 1,000s of Taxis
• Actively monitor and generate real-time notifications
• Real-time Route Optimization and Geo-Fencing
Solution
• Leverage unified in-memory data fabric as middleware for geo-spatial analytics
• Elastically scale stream processing and transactional apps together
• Location-based tracking, Geo-fencing
Edge components
Data Sources
57
DEMO!
58
THANK YOU!QUESTIONS?