samson platform architecture streaming big data telefÓnica i+d

Post on 01-Apr-2015

216 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SAMSON PlatformArchitecture Streaming Big Data

TELEFÓNICA I+D

Index

SAMSON Platform

File-System

Streaming MapReduce

Eco-system

Architecture

0102030405

SAMSON Platform

01

Telefónica I+D

Overview

Samson is a distributed processing engine especially designed for efficient analytics of stream-processing.

Internal distributed file-system optimized for shared data processing

Provides an extension of the MapReduce framework• More efficient MapReduce

• Joins

Streaming MapReduce that allows for the incremental processing of data feeds

Uses existing BigData Storage solutions such as Apache HDFS or MongoDB for fetching and storing data

Built to be deployed on Ubuntu, Redhat Linux and in virtual machines

4

Telefónica I+D

Samson 0.6 DEB Samson 0.6 RPM User guide Available for VM

Telefónica I+D

SAMSON Platform Architecture

6

Telefónica I+D

Key Platform Components

File-system

Streaming MapReduce

Eco-system

Architecture

7

Telefónica I+D

File-system

02

Telefónica I+D

HD

Cores

HD HD HD

Distributed big-data platform for high-performance Processing over unbounded streams of data

SAMSON distributed file - system

MapReduce for streamed processing

Cores Cores Cores

Telefónica I+D

SAMSON distributed file - system

Apache Hadoop HDFS SAMSON

Data is uniformly distributed Binary data is distributed by key

Large heterogeneous clustersNon uniform datasets

Homogenous clusterUniform data-sets

Maximum resource usage Efficient accumulation operationsEfficient JOIN operations

Telefónica I+D

We periodically receive a set of documents.We want to compute the accumulated word-count each time we receive an update.

First input

Second input

Third input

Reduce

Reduce

Reduce

MapReduce

6

MapReduce

12

MapReduce

18

Redistribution

6

Redistribution

6

Redistribution

6

SAMSON distributed file - system

Telefónica I+D

Streaming MapReduce

03

Telefónica I+D

SAMSON

SAMSON

SAMSON

DELILAH

4 Gb

Upload Download

Run operations

Telefónica I+D

SAMSON

SAMSON

SAMSON

DELILAH

Upload new operations

& data types

Run operations

Open API for 3rd party developers

Telefónica I+D

SAMSON

SAMSON

SAMSON

Telefónica I+D

Operation

Operation

Operation

SAMSON

SAMSON

SAMSON

Map

Telefónica I+D

Operation

Operation

Operation

SAMSON

SAMSON

SAMSON

StateInput Output

Reduce

Telefónica I+D

Eco-system

04

Telefónica I+D 19

Top level view…

samsonPush

samsonPush

samsonPush

delilah

samsonPop

samsonClient

delilah

samsonPop

Console-based client • Upload data• Download data• Run commands• Platform monitor

samsonClient

delilah samsonPush

samsonPop

samsonClient

Binaries to stream data into and out of SAMSON

C++ library to develop new plugins

Examples: • samsonPush• samsonPop

module module

modulemodule

module

module

3rd Party C++ shared library

• New data types• New operations

Tools provided for simplified development!!

Telefónica I+D 20

Delilah clientdelilah

Telefónica I+D 21

SAMSON Module example…

Module simple_mobility{ title "Simple mobility example" author "Andreu Urruela" version "0.1.1"}

data UserArea{ system.String name; system.UInt x; system.UInt y; system.UInt radius;}

data Position{ system.UInt x; system.UInt y;

system.TimeUnix time;}

parser parser_cdrs{ out system.UInt simple_mobility.Position

helpLine "Parse input CDRs to get user-position"}

class parser_cdrs : public samson::system::SimpleParser {

std::vector<char*> words; // Vector used to store words parsed at each line

void parseLine( char * line , samson::KVWriter *writer ) { // Split line in words split_in_words( line, words );

// Expected format USER_ID CDR X Y time

if( words.size() < 5 ) return; // No content for a valid instruction

if( strcmp( words[1] , "CDR" ) != 0 ) return; // Non valid format

// Set the key key.value = atoll( words[0] );

// Set the position value.set( atoll( words[2] ) , atoll( words[3] ) , atoll( words[4] ) );

// Emit the key-value writer->emit( 0 , &key, &value ); }

};

module

Telefónica I+D 22

SAMSON ecosystem tools

Tool Description

samsonModuleBootstrap Create necessary files to start developing a new module

samsonModuleParser Create .h .cpp files from a description file

samsonCat Visualize binary content of SAMSON

samsonSetup Edit setup of a SAMSON system

samsonStarter Setup a new SAMSON cluster

samsonData Show a transactional log for debugging

samsonPush Push data from STDIN to a particular queue

samsonPop Pop data from a particular queue to STDOUT

samsonLocal Emmulate a local cluster for fast development of modules

Telefónica I+D

File system Stream MapReduce Ecosystem Architecture Demo

Telefónica I+D

Architecture

05

Telefónica I+D 25

Communication Protocols…

delilah

Platform messagesFlexibilityBack compatibility

Data serializationMaximum data compression ( no field separator )Best for fast-sequential processingEasy job distributionProprietary serialization format

MonitoringNo recompilation neededBest tools for querying ( XPATH )

Goal Solution Why ?

Telefónica I+D26

Worker

Disk Manager

Memory Manager

Process Manager

--- cores --

Runtime engine and notification system

Engine library

Independent development

NetworkManager

Telefónica I+D 27

Disk Manager // Network Manager

• Controller to access local disk and network connections• Asynchronus notifications using engine notification system• If required multiple threads are used

Memory Manager: ( our retain-release model ….. similar to Objective-C )

• Simple system to control memory usage• Used to optimize memory allocation when under heavy load

Process Manager ( similar to Apple’s Grand Central dispatch library )

• System to control independent “heavy” task to be executed• Automatic creation / destruction of threads• Optional “fork” mode with shared-memory system to get output

Runtime Engine & Notification system

• Inspired in message-passing system implemented in Objective-C• Single loop to run all state-update operations• Thread protection to interact with Disk/Network/Memory/Process

Managers

Engine library

Telefónica I+D28

Worker

Engine library

Independent development

multi-core

Block ManagerBlock ManagerDisk – Memory

balancer

Disk Manager

Memory Manager

Process Manager

--- cores --

Runtime engine and notification system

NetworkManager

Telefónica I+D 29

Maintains a reference of all blocks of data contained in a Worker ( in disk or memory )

It keeps a sorted list based on when they will be used ( future operations )• Low priority blocks are flushed to disk first• High priority blocks are loaded from disk first

Connected to the Disk Manager inside the engine using the EngineNotificationSystem

Block Manager

Block Manager

Schedule write operationsTo DiskManager

Schedule read operationsTo DiskManager

Important: Since the order of blocks changes continuously based on the scheduling of new processing operations, the Block Manager is made aware of the new order and is able to react accordingly.

Telefónica I+D30

Worker

Engine library

Independent development

multi-core

Block ManagerBlock ManagerDisk – Memory

balancer

Queues Manager

txt_cdrs

cdrs

Stream Operations Manager

Operation A

Operation BInput data

usersOperation C

pri

ori

ty

Disk Manager

Memory Manager

Process Manager

--- cores --

Runtime engine and notification system

NetworkManager

Telefónica I+D 31

Contains reference to all the blocks contained in queues and stream operationsBoth systems are connected to Block Manager to inform about the priority of blocks

Stream Operation Manager is connected with ProcessManager to schedule 3rd party operations at Engine Subsystem

Queue & Stream Operations Manager

Queues Manager

txt_cdrs

cdrs

Stream Operations Manager

Operation A

Operation Busers

Operation C

pri

ori

ty

top related