using distributed in-memory computing for fast data analysis

26
Copyright © 2011 by ScaleOut Software, Inc. WSTA Seminar September 14, 2011 Bill Bain ([email protected]) Using Distributed, In-Memory Computing for Fast Data Analysis

Upload: scaleout-software

Post on 28-Nov-2014

2.002 views

Category:

Technology


0 download

DESCRIPTION

This is an overview of how distributed data grids can enable sharing across web servers and virtual cloud environments to enable scalability and high availability. It also covers how distributed data grids are highly useful for running MapReduce analysis across large data sets.

TRANSCRIPT

Page 1: Using Distributed In-Memory Computing for Fast Data Analysis

Copyright © 2011 by ScaleOut Software, Inc.

WSTA SeminarSeptember 14, 2011

Bill Bain ([email protected])

Using Distributed, In-Memory

Computing for Fast Data Analysis

Page 2: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar2

Agenda

• The Need for Memory-Based, Distributed

Storage

• What Is a Distributed Data Grid (DDG)

• Performance Advantages and Architecture

• Migrating Data to the Cloud and Across Global

Sites

• Parallel Data Analysis

• Comparison of DDG to File-Based

Map/Reduce

Page 3: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar3

The Need for Memory-Based Storage

W eb Server W eb Server W eb Server W eb Server W eb Server W eb Server

Ethernet

Internet

Database

Server

Raid D isk

ArrayDatabase

Server

Ethernet

App. Server App. Server App. Server App. Server

Ethernet

POW ER FAU LT DATA ALARM Load-balancer

Example: Web server farm:

• Load-balancer directs

incoming client requests

to Web servers.

• Web and app. server

farms build Web pages

and run business logic.

• Database server holds all

mission-critical, LOB data.

• Server farms share fast-

changing data using a

DDG to avoid bottlenecks

and maximize scalability.

Bottleneck

Distributed, In-Memory Data Grid

Distributed, In-Memory Data Grid

Page 4: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar4

The Need for Memory-Based Storage

App VS

Cloud Application

App VS App VS

App VS

App VS

Cloud-Based Storage

Grid VS

Grid VS

Grid VS

Distributed Data Grid

Example: Cloud Application:

• Application runs as multiple,

virtual servers (VS).

• Application instances store and

retrieve LOB data from cloud-

based file system or database.

• Applications need fast, scalable

storage for fast-changing data.

• Distributed data grid runs as

multiple, virtual servers to

provide “elastic,” in-memory

storage.

Page 5: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar5

What is a Distributed Data Grid?

• A new “vertical” storage tier:

– Adds missing layer to boost

performance.

– Uses in-memory, out-of-process

storage.

– Avoids repeated trips to backing

storage.

Distributed

Cache

“Out-of-

Process”

Distributed

Cache

“Out-of-

Process”

Processor

Cache

Application

Memory

“In-Process”

L2 Cache

Processor

Cache

Application

Memory

“In-Process”

L2 Cache

Backing

Storage

• A new “horizontal” storage tier:

– Allows data sharing among servers.

– Scales performance & capacity.

– Adds high availability.

– Can be used independently of

backing storage.

Page 6: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar6

Distributed Data Grids: A Closer Look

• Incorporates a client-side, in-

process cache (“near cache”):

– Transparent to the application

– Holds recently accessed data.

• Boosts performance:

– Eliminates repeated network data

transfers & deserialization.

– Reduces access times to near “in-

process” latency.

– Is automatically updated if the

distributed grid changes.

– Supports various coherency models

(coherent, polled, event-driven)

Application

Memory

“In-Process”

Client-side

Cache

“In-Process”

Distributed

Cache

“Out-of-

Process”

Page 7: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar7

Performance Benefit of Client-side Cache

• Eliminates repeated network data transfers.

• Eliminates repeated object deserialization.

0

500

1000

1500

2000

2500

3000

3500

DDG DBMS

Mic

roseco

nd

s

Average Response Time10KB Objects

20:1 Read/Update

Page 8: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar8

Top 5 Benefits of Distributed Data Grids

1. Faster access time for business logic state or database data

2. Scalable throughput to match a growing workload and keep

response times low

3. High availability to prevent data loss if a grid server (or network

link) fails

4. Shared access to data across

the server farm

5. Advanced capabilities

for quickly and easily mining

data using scalable,

“map/reduce,” analysis. Ac

ce

ss

La

ten

cy (

mse

c)

Throughput (accesses / sec)

Grid DBMS

Access Latency vs. Throughput

Page 9: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar9

Scaling the Distributed Data Grid

• Distributed data grid must deliver scalable throughput.

• To do so, its architecture must eliminate bottlenecks to

scaling:– Avoid centralized scheduling to eliminate hot spots.

– Use data partitioning and maintain load-balance to allow scaling.

– Use fixed vs. full replication

to avoid n-fold overhead.

– Use low overhead

heart-beating.

• Example of linear

throughput scaling:

Read/Write Throughput

10KB Objects

0

20,000

40,000

60,000

80,000

4 16 28 40 52 64

Accesses / S

eco

nd

Nodes

16,000 ------------------------------------------- 256,000 #Objects

Page 10: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar10

Typical Commercial Distributed Data Grids

• Partition objects to scale throughput and avoid hot spots.

• Synchronize access to objects across all servers.

• Dynamically rebalance objects to avoid hot spots.

• Replicate each cached object for high availability.

• Detect server or network failures and self-heal.

Ethernet

Cache

Service

Cache

Service

Cache

Service

Cache

Service

Object

Client

Application

Client

Library

Distributed Cache

Retrieve

Cached

Copy

ReplicaCopy

Page 11: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar11

Wide Range of Applications

Financial Services

• Portfolio risk analysis

• VaR calculations

• Monte Carlo simulations

• Algorithmic trading

• Market message caching

• Derivatives trading

• Pricing calculations

Other Applications

• Edge servers: chat, email

• Online gaming servers

• Scientific computations

• Command and control

E-commerce

• Session-state storage

• Application state storage

• Online banking

• Loan applications

• Wealth management

• Online learning

• Hotel reservations

• News story caching

• Shopping carts

• Social networking

• Service call tracking

• Online surveys

Page 12: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar12

Importance for Cloud Computing

• Cloud computing:

– Make elastic resources readily available, but…

– Clouds have relatively slow interconnects.

• Distributed data grids add significant value in the cloud:

– Allow data sharing across a group of virtual servers.

– Elastically scale throughput as needed.

– Provide low latency, object-oriented storage

• Clouds provide the elastic platform for parallel data

analysis.

• DDGs provides the efficiency and scalability needed to

overcome the cloud’s limited interconnect speed.

Page 13: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar13

DDGs Simplify Data Migration to the Cloud

• Distributed data grids can automatically bridge on-premise and cloud-based data grids to unify access.

• This enables seamless access to data across multiple sites.

Automatically Migrate Data

Cloud of Virtual Servers User’s On-Premise Application

SOSS VS

SOSS VS

SOSS VS

Cloud-Based Distributed Cache

App VS

Cloud Application

App VS App VS

App VS

App VS

SOSS Host

SOSS Host

On-Premise Cache

Server App

On-Premise Application 2

Cloud of Virtual Servers

User’s On-Premise Application

Server App

AutomaticallyMigrate Data

Backing

Store

Cloud hostedDistributed Data Grid

On-PremiseDistributed Data Grid

Cloud Application

On-Premise Application 2

App VS

App VS App VS

App VS

App VS

Server App Server App

SOSS Host SOSS HostSOSS VS

SOSS VS

SOSS VS

Page 14: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar14

DDGs Enable Seamless Global Access

Distributed Data Grid

SOSS SVR

SOSS SVR

SOSS SVR

Distributed Data Grid

SOSS SVR

SOSS SVR

SOSS SVR

Global Distributed Data Grid

Distributed Data Grid

SOSS SVR

SOSS SVR

SOSS SVR

Distributed Data Grid

SOSS SVR

SOSS SVR

SOSS SVR

Mirrored Data Centers

Satellite Data Centers

Page 15: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar15

• The goal:

– Quickly analyze a large set of data for patterns and trends.

– How? Run a method E (“eval”) across a set of objects D in parallel.

– Optionally merge the results using method M (“merge”).

• Evolution of parallel analysis:

– '80s: “SIMD/SPMD” (Flynn, Hillis)

– '90s: “Domain decomposition” (Intel, IBM)

– '00s: “Map/reduce” (Google, Hadoop, Dryad)

• Applications:

– Search, financial services,

business intelligence, simulation

Introducing Parallel Data Analysis

E M

DD DD

DD DD

DD DD

DD DD

Result

Page 16: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar16

Example in Financial Services

Analyze trading strategies across stock histories:

Why?

• Back-testing systems help guard against risks in deploying new

trading strategies.

• Performance is critical for “first to market” advantage.

• Uses significant amount of market data and computation time.

How?

• Write method E to analyze trading strategies across a single

stock history.

• Write method M to merge two sets of results.

• Populate the data store with a set of stock histories.

• Run method E in parallel on all stock histories.

• Merge the results with method M to produce a report.

• Refine and repeat…

Page 17: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar17

Stage the Data for Analysis

• Step 1: Populate the distributed data grid with objects each of which

represents a price history for a ticker symbol:

Page 18: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar18

Code the Eval and Merge Methods

• Step 2: Write a method to evaluate a stock history based on parameters:

• Step 3: Write a method to merge the results of two evaluations:

• Notes:

– This code can be run a sequential calculation on in-memory data.

– No explicit accesses to the distributed data grid are used.

Results EvalStockHistory(StockHistory history, Parameters params)

{

<analyze trading strategy for this stock history>

return results;

}

Results MergeResuts(Results results1, Results results2)

{

<merge both results>

return results;

}

Page 19: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar19

Run the Analysis

• Step 4: Invoke parallel evaluation and merging of results:

Results Invoke(EvalStockHistory, MergeResults, querySpec,

params);

EvalStockHistory()

MergeResults()

Page 20: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar20

stock

history

stock

history

stock

history

stock

history

stock

history

stock

history

.eval()

results results results results results results

.merge() .merge() .merge()

results results results

.merge()

resultsresults returned

to client

Start parallel

analysis

Page 21: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar21

DDG Minimizes Data Motion

• File-based map/reduce must move data to memory for analysis:

• Memory-based DDG analyzes data in place:

D D D D D D D D D

D D D D D D D D D

Grid ServerGrid ServerGrid Server

E E E

M/R Server

E

M/R Server

E

M/R Server

E

File System /

Database

Server

Memory

Distributed

Data Grid

Page 22: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar22

stock

history

stock

history

stock

history

stock

history

stock

history

stock

history

.eval()

results results results results results results

.merge() .merge() .merge()

results results results

.merge()

resultsresults returned

to client

Start parallel

analysis

File I/O

File I/O

File I/O

Page 23: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar23

Performance Impact of Data Motion

Measured random access to DDG data to simulate file I/O:

Page 24: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar24

Comparison of DDGs and File-Based M/R

DDG File-Based M/R

Data set size Gigabytes->terabytes Terabytes->petabytes

Data repository In-memory File / database

Data view Queried object collection File-based key/value

pairs

Development time Low High

Automatic

scalability

Yes Application

dependent

Best use Quick-turn analysis of

memory-based data

Complex analysis of

large datasets

I/O overhead Low High

Cluster mgt. Simple Complex

High availability Memory-based File-based

Page 25: Using Distributed In-Memory Computing for Fast Data Analysis

WSTA Seminar25

Walk-Away Points

• Developers need fast, scalable, highly available and sharable memory-based storage for scaled out applications.

• Distributed data grids (DDGs) address these needs with:

– Fast access time & scalable throughput

– Highly available data storage

– Support for parallel data analysis

• Cloud-based and globally distributed applications need DDGs to:

– Support scalable data access for “elastic” applications.

– Efficiently and easily migrate data across sites.

– Avoid relatively slow cloud I/O storage and interconnects.

• DDGs offer simple, fast “map/reduce” parallel analysis:

– Make it easy to develop applications and configure clusters.

– Avoid file I/O overhead for datasets that fit in memory-based grids.

– Deliver automatic, highly scalable performance.

Page 26: Using Distributed In-Memory Computing for Fast Data Analysis

Distributed Data Grids forServer Farms & High Performance Computing

www.scaleoutsoftware.com