using distributed in-memory computing for fast data analysis

Copyright © 2011 by ScaleOut Software, Inc.

WSTA SeminarSeptember 14, 2011

Bill Bain ([email protected])

Using Distributed, In-Memory

Computing for Fast Data Analysis

WSTA Seminar2

Agenda

• The Need for Memory-Based, Distributed

Storage

• What Is a Distributed Data Grid (DDG)

• Performance Advantages and Architecture

• Migrating Data to the Cloud and Across Global

Sites

• Parallel Data Analysis

• Comparison of DDG to File-Based

Map/Reduce

WSTA Seminar3

The Need for Memory-Based Storage

W eb Server W eb Server W eb Server W eb Server W eb Server W eb Server

Ethernet

Internet

Database

Server

Raid D isk

ArrayDatabase

Server

Ethernet

App. Server App. Server App. Server App. Server

Ethernet

POW ER FAU LT DATA ALARM Load-balancer

Example: Web server farm:

• Load-balancer directs

incoming client requests

to Web servers.

• Web and app. server

farms build Web pages

and run business logic.

• Database server holds all

mission-critical, LOB data.

• Server farms share fast-

changing data using a

DDG to avoid bottlenecks

and maximize scalability.

Bottleneck

Distributed, In-Memory Data Grid

Distributed, In-Memory Data Grid

WSTA Seminar4

The Need for Memory-Based Storage

App VS

Cloud Application

App VS App VS

App VS

App VS

Cloud-Based Storage

Grid VS

Grid VS

Grid VS

Distributed Data Grid

Example: Cloud Application:

• Application runs as multiple,

virtual servers (VS).

• Application instances store and

retrieve LOB data from cloud-

based file system or database.

• Applications need fast, scalable

storage for fast-changing data.

• Distributed data grid runs as

multiple, virtual servers to

provide “elastic,” in-memory

storage.

WSTA Seminar5

What is a Distributed Data Grid?

• A new “vertical” storage tier:

– Adds missing layer to boost

performance.

– Uses in-memory, out-of-process

storage.

– Avoids repeated trips to backing

storage.

Distributed

Cache

“Out-of-

Process”

Distributed

Cache

“Out-of-

Process”

Processor

Cache

Application

Memory

“In-Process”

L2 Cache

Processor

Cache

Application

Memory

“In-Process”

L2 Cache

Backing

Storage

• A new “horizontal” storage tier:

– Allows data sharing among servers.

– Scales performance & capacity.

– Adds high availability.

– Can be used independently of

backing storage.

WSTA Seminar6

Distributed Data Grids: A Closer Look

• Incorporates a client-side, in-

process cache (“near cache”):

– Transparent to the application

– Holds recently accessed data.

• Boosts performance:

– Eliminates repeated network data

transfers & deserialization.

– Reduces access times to near “in-

process” latency.

– Is automatically updated if the

distributed grid changes.

– Supports various coherency models

(coherent, polled, event-driven)

Application

Memory

“In-Process”

Client-side

Cache

“In-Process”

Distributed

Cache

“Out-of-

Process”

WSTA Seminar7

Performance Benefit of Client-side Cache

• Eliminates repeated network data transfers.

• Eliminates repeated object deserialization.

0

500

1000

1500

2000

2500

3000

3500

DDG DBMS

Mic

roseco

nd

s

Average Response Time10KB Objects

20:1 Read/Update

WSTA Seminar8

Top 5 Benefits of Distributed Data Grids

1. Faster access time for business logic state or database data

2. Scalable throughput to match a growing workload and keep

response times low

3. High availability to prevent data loss if a grid server (or network

link) fails

4. Shared access to data across

the server farm

5. Advanced capabilities

for quickly and easily mining

data using scalable,

“map/reduce,” analysis. Ac

ce

ss

La

ten

cy (

mse

c)

Throughput (accesses / sec)

Grid DBMS

Access Latency vs. Throughput

WSTA Seminar9

Scaling the Distributed Data Grid

• Distributed data grid must deliver scalable throughput.

• To do so, its architecture must eliminate bottlenecks to

scaling:– Avoid centralized scheduling to eliminate hot spots.

– Use data partitioning and maintain load-balance to allow scaling.

– Use fixed vs. full replication

to avoid n-fold overhead.

– Use low overhead

heart-beating.

• Example of linear

throughput scaling:

Read/Write Throughput

10KB Objects

0

20,000

40,000

60,000

80,000

4 16 28 40 52 64

Accesses / S

eco

nd

Nodes

16,000 ------------------------------------------- 256,000 #Objects

WSTA Seminar10

Typical Commercial Distributed Data Grids

• Partition objects to scale throughput and avoid hot spots.

• Synchronize access to objects across all servers.

• Dynamically rebalance objects to avoid hot spots.

• Replicate each cached object for high availability.

• Detect server or network failures and self-heal.

Ethernet

Cache

Service

Cache

Service

Cache

Service

Cache

Service

Object

Client

Application

Client

Library

Distributed Cache

Retrieve

Cached

Copy

ReplicaCopy

WSTA Seminar11

Wide Range of Applications

Financial Services

• Portfolio risk analysis

• VaR calculations

• Monte Carlo simulations

• Algorithmic trading

• Market message caching

• Derivatives trading

• Pricing calculations

Other Applications

• Edge servers: chat, email

• Online gaming servers

• Scientific computations

• Command and control

E-commerce

• Session-state storage

• Application state storage

• Online banking

• Loan applications

• Wealth management

• Online learning

• Hotel reservations

• News story caching

• Shopping carts

• Social networking

• Service call tracking

• Online surveys

WSTA Seminar12

Importance for Cloud Computing

• Cloud computing:

– Make elastic resources readily available, but…

– Clouds have relatively slow interconnects.

• Distributed data grids add significant value in the cloud:

– Allow data sharing across a group of virtual servers.

– Elastically scale throughput as needed.

– Provide low latency, object-oriented storage

• Clouds provide the elastic platform for parallel data

analysis.

• DDGs provides the efficiency and scalability needed to

overcome the cloud’s limited interconnect speed.

WSTA Seminar13

DDGs Simplify Data Migration to the Cloud

• Distributed data grids can automatically bridge on-premise and cloud-based data grids to unify access.

• This enables seamless access to data across multiple sites.

Automatically Migrate Data

Cloud of Virtual Servers User’s On-Premise Application

SOSS VS

SOSS VS

SOSS VS

Cloud-Based Distributed Cache

App VS

Cloud Application

App VS App VS

App VS

App VS

SOSS Host

SOSS Host

On-Premise Cache

Server App

On-Premise Application 2

Cloud of Virtual Servers

User’s On-Premise Application

Server App

AutomaticallyMigrate Data

Backing

Store

Cloud hostedDistributed Data Grid

On-PremiseDistributed Data Grid

Cloud Application

On-Premise Application 2

App VS

App VS App VS

App VS

App VS

Server App Server App

SOSS Host SOSS HostSOSS VS

SOSS VS

SOSS VS

WSTA Seminar14

DDGs Enable Seamless Global Access


SOSS SVR

SOSS SVR

SOSS SVR


SOSS SVR

SOSS SVR

SOSS SVR

Global Distributed Data Grid


SOSS SVR

SOSS SVR

SOSS SVR


SOSS SVR

SOSS SVR

SOSS SVR

Mirrored Data Centers

Satellite Data Centers

WSTA Seminar15

• The goal:

– Quickly analyze a large set of data for patterns and trends.

– How? Run a method E (“eval”) across a set of objects D in parallel.

– Optionally merge the results using method M (“merge”).

• Evolution of parallel analysis:

– '80s: “SIMD/SPMD” (Flynn, Hillis)

– '90s: “Domain decomposition” (Intel, IBM)

– '00s: “Map/reduce” (Google, Hadoop, Dryad)

• Applications:

– Search, financial services,

business intelligence, simulation

Introducing Parallel Data Analysis

E M

DD DD

DD DD

DD DD

DD DD

Result

WSTA Seminar16

Example in Financial Services

Analyze trading strategies across stock histories:

Why?

• Back-testing systems help guard against risks in deploying new

trading strategies.

• Performance is critical for “first to market” advantage.

• Uses significant amount of market data and computation time.

How?

• Write method E to analyze trading strategies across a single

stock history.

• Write method M to merge two sets of results.

• Populate the data store with a set of stock histories.

• Run method E in parallel on all stock histories.

• Merge the results with method M to produce a report.

• Refine and repeat…

WSTA Seminar17

Stage the Data for Analysis

• Step 1: Populate the distributed data grid with objects each of which

represents a price history for a ticker symbol:

WSTA Seminar18

Code the Eval and Merge Methods

• Step 2: Write a method to evaluate a stock history based on parameters:

• Step 3: Write a method to merge the results of two evaluations:

• Notes:

– This code can be run a sequential calculation on in-memory data.

– No explicit accesses to the distributed data grid are used.

Results EvalStockHistory(StockHistory history, Parameters params)

{

<analyze trading strategy for this stock history>

return results;

}

Results MergeResuts(Results results1, Results results2)

{

<merge both results>

return results;

}

WSTA Seminar19

Run the Analysis

• Step 4: Invoke parallel evaluation and merging of results:

Results Invoke(EvalStockHistory, MergeResults, querySpec,

params);

EvalStockHistory()

MergeResults()

WSTA Seminar20

stock

history

stock

history

stock

history

stock

history

stock

history

stock

history

.eval()

results results results results results results

.merge() .merge() .merge()

results results results

.merge()

resultsresults returned

to client

Start parallel

analysis

WSTA Seminar21

DDG Minimizes Data Motion

• File-based map/reduce must move data to memory for analysis:

• Memory-based DDG analyzes data in place:

D D D D D D D D D

D D D D D D D D D

Grid ServerGrid ServerGrid Server

E E E

M/R Server

E

M/R Server

E

M/R Server

E

File System /

Database

Server

Memory

Distributed

Data Grid

WSTA Seminar22

stock

history

stock

history

stock

history

stock

history

stock

history

stock

history

.eval()

results results results results results results

.merge() .merge() .merge()

results results results

.merge()

resultsresults returned

to client

Start parallel

analysis

File I/O

File I/O

File I/O

WSTA Seminar23

Performance Impact of Data Motion

Measured random access to DDG data to simulate file I/O:

WSTA Seminar24

Comparison of DDGs and File-Based M/R

DDG File-Based M/R

Data set size Gigabytes->terabytes Terabytes->petabytes

Data repository In-memory File / database

Data view Queried object collection File-based key/value

pairs

Development time Low High

Automatic

scalability

Yes Application

dependent

Best use Quick-turn analysis of

memory-based data

Complex analysis of

large datasets

I/O overhead Low High

Cluster mgt. Simple Complex

High availability Memory-based File-based

WSTA Seminar25

Walk-Away Points

• Developers need fast, scalable, highly available and sharable memory-based storage for scaled out applications.

• Distributed data grids (DDGs) address these needs with:

– Fast access time & scalable throughput

– Highly available data storage

– Support for parallel data analysis

• Cloud-based and globally distributed applications need DDGs to:

– Support scalable data access for “elastic” applications.

– Efficiently and easily migrate data across sites.

– Avoid relatively slow cloud I/O storage and interconnects.

• DDGs offer simple, fast “map/reduce” parallel analysis:

– Make it easy to develop applications and configure clusters.

– Avoid file I/O overhead for datasets that fit in memory-based grids.

– Deliver automatic, highly scalable performance.

Distributed Data Grids forServer Farms & High Performance Computing

www.scaleoutsoftware.com

using distributed in-memory computing for fast data analysis

Technology