hybrid transactional/analytics processing with spark and imdgs

1

Hybrid Transactional/Analytics Processing with Spark and In-Memory Data Grids

Copyright © GigaSpaces 2016. All rights reserved.

Ali HodrojVP, Products and Strategy

2

About me

• Vice President, Products and Strategy @ GigaSpaces• (ex) Director of Solutions Architecture• Blogging at http://blog.gigaspaces.com• @ahodroj

• Email: [email protected]

• Slides at http://slideshare.com/ahodroj

http://blog.gigaspaces.com/

mailto:[email protected]

http://slideshare.com/ahodroj

3

About GigaSpaces

Direct customers500+

HeadquartersNew York, NY

Established2001

4

Do we need to bridge online transaction

processing with real-time operational

intelligence?

5

Modern applications: the line is blurred between…

Transactional Analytical

Essential to operate the business

Turning data into value:insights, diagnosis,

decision making

&

6

Stories from the real #enterprise world...

7

Omni-Channel Retail

https://static.pexels.com/photos/1719/city-people-walking-blur.jpg

Hyper-personalizationand

Omni-channel

9

#Industrial Internet of Things

https://en.wikipedia.org/wiki/Taiwan_High_Speed_Rail

10

Minimize Latency + Strong Consistency

Maximize Data-Analytics Locality

Goal:

https://www.pexels.com/search/analytics/

11

There’s a name for it...

13

In-Memory Data Store?

In-Memory Computing 101

Distribute CachePartitioned cache

nodes

In-Memory Data Grid

Scale-out systemof record

Increased Capacity

No support for write-heavy scenarios

Limited to ID-based reads

Reads are the only part latency path

In-Memory DatabaseScale-up system of record

Heavy Read/Write – sharded/partitioned architecture

Horizontally scalable on commodity HW (or cloud)

Serves as system of record with querying & transaction semantics

Requires modifying your application’s data access layer


nodes

In-Memory Data Grid




Read/Write Scalability

Drop-in SQL database replacement

Often lacks horizontal scalability (Joins)

Requires replacing your database


nodes

In-Memory Data Grid




IMDG Data Models

@SpaceClasspublic class Product{ private String name; private String brand; private Integer quantity;

// … }

IMDG Data Placement – Fixed Hashing

hash(key) % #nodes

IMDG Fixed Hashing - HA

hash(key) % #nodes

20

In-Memory Data Grids: How it works

http://xap.github.io

http://xap.github.io/

21

The database goes to the background

Partition your data and store it in memory




22

Partitioned, co-located in-memory messaging




23

Business logic, data & messaging co-located & partitioned into processing units




24

Hot backup for each partition for high availability




25

Host your web application on the XAP infrastructure




26

Auto-scale out & in based on real-time performance & load




27




28

Host Cisco UCS Server

CPU Intel 16core 2.9GHz

Concurrent Threads 2

Throughput 200, 400, 800 ops/sec

29

WHAT’STHE RIGHT DATA STORE TO CHOOSE?

30

● Nope: Your data sources and applications are often distributed.

● In-Memory or not, these databases aren’t built for horizontal scale-out

Approach Challenge

Just an IMDB Thing….

Shove it all in one “Big Iron”?

https://en.wikipedia.org/wiki/Amazon_DynamoDB

https://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/SAP-Logo.svg/2000px-SAP-Logo.svg.png

https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/AWS_Simple_Icons_Database_Amazon_RDS_Oracle_DB_Instance.svg/2000px-AWS_Simple_Icons_Database_Amazon_RDS_Oracle_DB_Instance.svg.png

31

● Not when your apps requires polyglot analytics

● Unless you want to write ML algorithms, MDX engines…etc from scratch

Approach Challenge

One large In-Memory Data Grid to Rule them all?

32

What we needed

Low-latency Scale-Out In-Memory Data Grid

Large-scale distributed analytics framework

Maximize Data-Analytics Locality

Minimize Application Latency

33

Our approach to HTAP

Low-latency Scale-Out In-Memory Data Grid

Large-scale distributed analytics

framework

+

34

SPARK?So why did we bet on

35

• Unified & Concise API

• Highly Flexible Data Store Integration

• Massive Community and Adoption

36

But Spark is already in-memory!

37

Spark is caching over <insert your data store>,

not an in-memory system of record

38

APACHE SPARKFIT INTO THIS?

How does

39

In-Memory Data Grid

In-Memory Store(RAM) Flash, SSD, Off-Heap Store

Spark Spark SQL Spark Steaming Machine Learning High availability

Security & M

anagement

InsightEdge CoreBuilding out the driver

Transactional TierACID-compliantStrong Consistency

Analytics Tier

40

Data Grid + Spark Deployment Layout

node 1

Spark master

Gridmaster

node 2

Spark worker

Gridworker

node 3

Spark worker

Gridworker

41

•List of parent RDDs – Empty •An array of partitions that a dataset is divided to – IMDG Distributed Query

to get partitions and their hosts

•A compute function to do a computation on partitions – Iterator over portion of data

•Optional preferred locations, i.e. hosts for a partition where the data will be loaded – hosts from Distributed Query

Data Grid RDD: resilient distributed dataset

42

node 1

Spark executor

Data Grid RDD: one-to-one partition

Spark Partition

#1

GridPartition #1

Direct connection

Simple, but not enough parallelism for Spark

node 2

Spark executor

Spark Partition

#2

GridPartition #2

node 3

Spark executor

Spark Partition

#3

GridPartition #3

43

node 1

Spark Executor

Grid Primary #1

Data Grid RDD: with bucketing

0..

1..

2..

3..

4..

5..

.

.

.

.

.

.

.

.

.

.

Spark Partition #1

1023

1 Spark partition = M grid buckets

1 Grid partition = N Spark partitions

Spark Partition #2

Spark Partition #1

44

Grid DataFrames: predicates pushdown & columns pruning

Aggregation in Spark

Filtering and columns pruning

in Data Grid

SELECT SUM(amount)

FROM orderWHERE city = ‘NY’ AND year > 2012

Spark SQL architecture:

• Pushing down predicates to Data Grid• Leveraging indexes• Transparent to user• Enabling support for other languages - Python/R

Implementing DataSource API

45

Push-down Predicates

performance

Traditional Spark filtering of 7MM records

Grid-side + Spark filtering of 7MM records

31seconds

1second

vs

46

Eventually, we productized this as an open source Spark distribution

https://translate.google.com/community

@InsightEdgeIO http://insightedge.io

Apache 2 License

http://insightedge.io/docs

http://insightedge.io/blog

http://github.com/InsightEdge

http://insightedge.io/

http://insightedge.io/docs

http://insightedge.io/blog

http://github.com/InsightEdge

GigaSpaces InsightEdgehttp://insightedge.io

High Performance Spark with OLTP Capabilities

http://insightedge.io/

49

ADDITIONALINNOVATIONS

50

Spark GeoSpatial SQL and Data Frames

51

Multi-Spark Replication / Federated Clusters

In-Memory Replication across local or wide area networks

upcoming: Spark RDD/DF native read/save on Off-Heap (SSD/Flash/Direct Buffers)

Application

Processing

Primaryinstance

s

Backupinstance

s

SyncReplicati

on

StorageArray

StorageArray

In Memory Data Grid

Spark worker

Spark worker

• Significant RAM TCO reduction in Spark clusters

• Direct RDD/DataFrame read write from SSD/Flash device

• Avoid Filesystem hops and write amplification

53

REFERENCEARCHITECTURES

5454

In-Process HTAP

Read any POJO, JSON Document, or Transaction as a DataFrame or RDD

Web services/apps can read any DataFrame as POJO

True closed-loop analytics data pipeline

@SpaceClasspublic class Product{ private String name; private String brand; private Integer quantity;

// … }

5555

In-Memory Data Grid Realtime Replication

• Scoring models• Trigger actions• Events

Transactions Analytics

Point of Decision HTAPXAP + InsightEdge deployed on different grid clusters with bi-directional real-time data replication

5656

Case Study: Fleet Geo-analytics

Challenge

• Stream data from 1,000s of Taxis

• Actively monitor and generate real-time notifications

• Real-time Route Optimization and Geo-Fencing

Solution

• Leverage unified in-memory data fabric as middleware for geo-spatial analytics

• Elastically scale stream processing and transactional apps together

• Location-based tracking, Geo-fencing

Edge components

Data Sources

57

DEMO!

58

THANK YOU!QUESTIONS?

hybrid transactional/analytics processing with spark and imdgs

Software