hybrid transactional/analytics processing with spark and imdgs

58
1 Hybrid Transactional/Analytics Processing with Spark and In-Memory Data Grids Copyright © GigaSpaces 2016. All rights reserved. Ali Hodroj VP, Products and Strategy

Upload: ali-hodroj

Post on 14-Apr-2017

310 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Hybrid Transactional/Analytics Processing with Spark and IMDGs

1

Hybrid Transactional/Analytics Processing with Spark and In-Memory Data Grids

Copyright © GigaSpaces 2016. All rights reserved.

Ali HodrojVP, Products and Strategy

Page 2: Hybrid Transactional/Analytics Processing with Spark and IMDGs

2

About me

• Vice President, Products and Strategy @ GigaSpaces• (ex) Director of Solutions Architecture• Blogging at http://blog.gigaspaces.com• @ahodroj

• Email: [email protected]

• Slides at http://slideshare.com/ahodroj

Page 3: Hybrid Transactional/Analytics Processing with Spark and IMDGs

3

About GigaSpaces

Direct customers500+

HeadquartersNew York, NY

Established2001

Page 4: Hybrid Transactional/Analytics Processing with Spark and IMDGs

4

Do we need to bridge online transaction

processing with real-time operational

intelligence?

Page 5: Hybrid Transactional/Analytics Processing with Spark and IMDGs

5

Modern applications: the line is blurred between…

Transactional Analytical

Essential to operate the business

Turning data into value:insights, diagnosis,

decision making

&

Page 6: Hybrid Transactional/Analytics Processing with Spark and IMDGs

6

Stories from the real #enterprise world...

Page 8: Hybrid Transactional/Analytics Processing with Spark and IMDGs

Hyper-personalizationand

Omni-channel

Page 9: Hybrid Transactional/Analytics Processing with Spark and IMDGs

9

#Industrial Internet of Things

Page 10: Hybrid Transactional/Analytics Processing with Spark and IMDGs

10

Minimize Latency + Strong Consistency

Maximize Data-Analytics Locality

Goal:

Page 11: Hybrid Transactional/Analytics Processing with Spark and IMDGs

11

There’s a name for it...

Page 12: Hybrid Transactional/Analytics Processing with Spark and IMDGs

12

Page 13: Hybrid Transactional/Analytics Processing with Spark and IMDGs

13

In-Memory Data Store?

Page 14: Hybrid Transactional/Analytics Processing with Spark and IMDGs

In-Memory Computing 101

Distribute CachePartitioned cache

nodes

In-Memory Data Grid

Scale-out systemof record

Increased Capacity

No support for write-heavy scenarios

Limited to ID-based reads

Reads are the only part latency path

In-Memory DatabaseScale-up system of record

Page 15: Hybrid Transactional/Analytics Processing with Spark and IMDGs

Heavy Read/Write – sharded/partitioned architecture

Horizontally scalable on commodity HW (or cloud)

Serves as system of record with querying & transaction semantics

Requires modifying your application’s data access layer

Distribute CachePartitioned cache

nodes

In-Memory Data Grid

Scale-out systemof record

In-Memory DatabaseScale-up system of record

In-Memory Computing 101

Page 16: Hybrid Transactional/Analytics Processing with Spark and IMDGs

Read/Write Scalability

Drop-in SQL database replacement

Often lacks horizontal scalability (Joins)

Requires replacing your database

Distribute CachePartitioned cache

nodes

In-Memory Data Grid

Scale-out systemof record

In-Memory DatabaseScale-up system of record

In-Memory Computing 101

Page 17: Hybrid Transactional/Analytics Processing with Spark and IMDGs

IMDG Data Models

@SpaceClasspublic class Product{ private String name; private String brand; private Integer quantity;

// … }

Page 18: Hybrid Transactional/Analytics Processing with Spark and IMDGs

IMDG Data Placement – Fixed Hashing

hash(key) % #nodes

Page 19: Hybrid Transactional/Analytics Processing with Spark and IMDGs

IMDG Fixed Hashing - HA

hash(key) % #nodes

Page 20: Hybrid Transactional/Analytics Processing with Spark and IMDGs

20

In-Memory Data Grids: How it works

http://xap.github.io

Page 21: Hybrid Transactional/Analytics Processing with Spark and IMDGs

21

The database goes to the background

Partition your data and store it in memory

In-Memory Data Grids: How it works

http://xap.github.io

Page 22: Hybrid Transactional/Analytics Processing with Spark and IMDGs

22

Partitioned, co-located in-memory messaging

In-Memory Data Grids: How it works

http://xap.github.io

Page 23: Hybrid Transactional/Analytics Processing with Spark and IMDGs

23

Business logic, data & messaging co-located & partitioned into processing units

In-Memory Data Grids: How it works

http://xap.github.io

Page 24: Hybrid Transactional/Analytics Processing with Spark and IMDGs

24

Hot backup for each partition for high availability

In-Memory Data Grids: How it works

http://xap.github.io

Page 25: Hybrid Transactional/Analytics Processing with Spark and IMDGs

25

Host your web application on the XAP infrastructure

In-Memory Data Grids: How it works

http://xap.github.io

Page 26: Hybrid Transactional/Analytics Processing with Spark and IMDGs

26

Auto-scale out & in based on real-time performance & load

In-Memory Data Grids: How it works

http://xap.github.io

Page 27: Hybrid Transactional/Analytics Processing with Spark and IMDGs

27

In-Memory Data Grids: How it works

http://xap.github.io

Page 28: Hybrid Transactional/Analytics Processing with Spark and IMDGs

28

Host Cisco UCS Server

CPU Intel 16core 2.9GHz

Concurrent Threads 2

Throughput 200, 400, 800 ops/sec

Page 29: Hybrid Transactional/Analytics Processing with Spark and IMDGs

29

WHAT’STHE RIGHT DATA STORE TO CHOOSE?

Page 30: Hybrid Transactional/Analytics Processing with Spark and IMDGs

30

● Nope: Your data sources and applications are often distributed.

● In-Memory or not, these databases aren’t built for horizontal scale-out

Approach Challenge

Just an IMDB Thing….

Shove it all in one “Big Iron”?

Page 31: Hybrid Transactional/Analytics Processing with Spark and IMDGs

31

● Not when your apps requires polyglot analytics

● Unless you want to write ML algorithms, MDX engines…etc from scratch

Approach Challenge

One large In-Memory Data Grid to Rule them all?

Page 32: Hybrid Transactional/Analytics Processing with Spark and IMDGs

32

What we needed

Low-latency Scale-Out In-Memory Data Grid

Large-scale distributed analytics framework

Maximize Data-Analytics Locality

Minimize Application Latency

Page 33: Hybrid Transactional/Analytics Processing with Spark and IMDGs

33

Our approach to HTAP

Low-latency Scale-Out In-Memory Data Grid

Large-scale distributed analytics

framework

+

Page 34: Hybrid Transactional/Analytics Processing with Spark and IMDGs

34

SPARK?So why did we bet on

Page 35: Hybrid Transactional/Analytics Processing with Spark and IMDGs

35

• Unified & Concise API

• Highly Flexible Data Store Integration

• Massive Community and Adoption

Page 36: Hybrid Transactional/Analytics Processing with Spark and IMDGs

36

But Spark is already in-memory!

Page 37: Hybrid Transactional/Analytics Processing with Spark and IMDGs

37

Spark is caching over <insert your data store>,

not an in-memory system of record

Page 38: Hybrid Transactional/Analytics Processing with Spark and IMDGs

38

APACHE SPARKFIT INTO THIS?

How does

Page 39: Hybrid Transactional/Analytics Processing with Spark and IMDGs

39

In-Memory Data Grid

In-Memory Store(RAM) Flash, SSD, Off-Heap Store

Spark Spark SQL Spark Steaming Machine Learning High availability

Security & M

anagement

InsightEdge CoreBuilding out the driver

Transactional TierACID-compliantStrong Consistency

Analytics Tier

Page 40: Hybrid Transactional/Analytics Processing with Spark and IMDGs

40

Data Grid + Spark Deployment Layout

node 1

Spark master

Gridmaster

node 2

Spark worker

Gridworker

node 3

Spark worker

Gridworker

Page 41: Hybrid Transactional/Analytics Processing with Spark and IMDGs

41

•List of parent RDDs – Empty •An array of partitions that a dataset is divided to – IMDG Distributed Query

to get partitions and their hosts

•A compute function to do a computation on partitions – Iterator over portion of data

•Optional preferred locations, i.e. hosts for a partition where the data will be loaded – hosts from Distributed Query

Data Grid RDD: resilient distributed dataset

Page 42: Hybrid Transactional/Analytics Processing with Spark and IMDGs

42

node 1

Spark executor

Data Grid RDD: one-to-one partition

Spark Partition

#1

GridPartition #1

Direct connection

Simple, but not enough parallelism for Spark

node 2

Spark executor

Spark Partition

#2

GridPartition #2

node 3

Spark executor

Spark Partition

#3

GridPartition #3

Page 43: Hybrid Transactional/Analytics Processing with Spark and IMDGs

43

node 1

Spark Executor

Grid Primary #1

Data Grid RDD: with bucketing

0..

1..

2..

3..

4..

5..

.

.

.

.

.

.

.

.

.

.

Spark Partition #1

1023

1 Spark partition = M grid buckets

1 Grid partition = N Spark partitions

Spark Partition #2

Spark Partition #1

Page 44: Hybrid Transactional/Analytics Processing with Spark and IMDGs

44

Grid DataFrames: predicates pushdown & columns pruning

Aggregation in Spark

Filtering and columns pruning

in Data Grid

SELECT SUM(amount)

FROM orderWHERE city = ‘NY’ AND year > 2012

Spark SQL architecture:

• Pushing down predicates to Data Grid• Leveraging indexes• Transparent to user• Enabling support for other languages - Python/R

Implementing DataSource API

Page 45: Hybrid Transactional/Analytics Processing with Spark and IMDGs

45

Push-down Predicates

performance

Traditional Spark filtering of 7MM records

Grid-side + Spark filtering of 7MM records

31seconds

1second

vs

Page 46: Hybrid Transactional/Analytics Processing with Spark and IMDGs

46

Eventually, we productized this as an open source Spark distribution

Page 47: Hybrid Transactional/Analytics Processing with Spark and IMDGs

@InsightEdgeIO http://insightedge.io

Apache 2 License

http://insightedge.io/docs

http://insightedge.io/blog

http://github.com/InsightEdge

Page 48: Hybrid Transactional/Analytics Processing with Spark and IMDGs

GigaSpaces InsightEdgehttp://insightedge.io

High Performance Spark with OLTP Capabilities

Page 49: Hybrid Transactional/Analytics Processing with Spark and IMDGs

49

ADDITIONALINNOVATIONS

Page 50: Hybrid Transactional/Analytics Processing with Spark and IMDGs

50

Spark GeoSpatial SQL and Data Frames

Page 51: Hybrid Transactional/Analytics Processing with Spark and IMDGs

51

Multi-Spark Replication / Federated Clusters

In-Memory Replication across local or wide area networks

Page 52: Hybrid Transactional/Analytics Processing with Spark and IMDGs

upcoming: Spark RDD/DF native read/save on Off-Heap (SSD/Flash/Direct Buffers)

Application

Processing

Primaryinstance

s

Backupinstance

s

SyncReplicati

on

StorageArray

StorageArray

In Memory Data Grid

Spark worker

Spark worker

• Significant RAM TCO reduction in Spark clusters

• Direct RDD/DataFrame read write from SSD/Flash device

• Avoid Filesystem hops and write amplification

Page 53: Hybrid Transactional/Analytics Processing with Spark and IMDGs

53

REFERENCEARCHITECTURES

Page 54: Hybrid Transactional/Analytics Processing with Spark and IMDGs

5454

In-Process HTAP

Read any POJO, JSON Document, or Transaction as a DataFrame or RDD

Web services/apps can read any DataFrame as POJO

True closed-loop analytics data pipeline

@SpaceClasspublic class Product{ private String name; private String brand; private Integer quantity;

// … }

Page 55: Hybrid Transactional/Analytics Processing with Spark and IMDGs

5555

In-Memory Data Grid Realtime Replication

• Scoring models• Trigger actions• Events

Transactions Analytics

Point of Decision HTAPXAP + InsightEdge deployed on different grid clusters with bi-directional real-time data replication

Page 56: Hybrid Transactional/Analytics Processing with Spark and IMDGs

5656

Case Study: Fleet Geo-analytics

Challenge

• Stream data from 1,000s of Taxis

• Actively monitor and generate real-time notifications

• Real-time Route Optimization and Geo-Fencing

Solution

• Leverage unified in-memory data fabric as middleware for geo-spatial analytics

• Elastically scale stream processing and transactional apps together

• Location-based tracking, Geo-fencing

Edge components

Data Sources

Page 57: Hybrid Transactional/Analytics Processing with Spark and IMDGs

57

DEMO!

Page 58: Hybrid Transactional/Analytics Processing with Spark and IMDGs

58

THANK YOU!QUESTIONS?