SAMSON PlatformArchitecture Streaming Big Data
TELEFÓNICA I+D
Index
SAMSON Platform
File-System
Streaming MapReduce
Eco-system
Architecture
0102030405
SAMSON Platform
01
Telefónica I+D
Overview
Samson is a distributed processing engine especially designed for efficient analytics of stream-processing.
Internal distributed file-system optimized for shared data processing
Provides an extension of the MapReduce framework• More efficient MapReduce
• Joins
Streaming MapReduce that allows for the incremental processing of data feeds
Uses existing BigData Storage solutions such as Apache HDFS or MongoDB for fetching and storing data
Built to be deployed on Ubuntu, Redhat Linux and in virtual machines
4
Telefónica I+D
Samson 0.6 DEB Samson 0.6 RPM User guide Available for VM
Telefónica I+D
SAMSON Platform Architecture
6
Telefónica I+D
Key Platform Components
File-system
Streaming MapReduce
Eco-system
Architecture
7
Telefónica I+D
File-system
02
Telefónica I+D
HD
Cores
HD HD HD
Distributed big-data platform for high-performance Processing over unbounded streams of data
SAMSON distributed file - system
MapReduce for streamed processing
Cores Cores Cores
Telefónica I+D
SAMSON distributed file - system
Apache Hadoop HDFS SAMSON
Data is uniformly distributed Binary data is distributed by key
Large heterogeneous clustersNon uniform datasets
Homogenous clusterUniform data-sets
Maximum resource usage Efficient accumulation operationsEfficient JOIN operations
Telefónica I+D
We periodically receive a set of documents.We want to compute the accumulated word-count each time we receive an update.
First input
Second input
Third input
Reduce
Reduce
Reduce
MapReduce
6
MapReduce
12
MapReduce
18
Redistribution
6
Redistribution
6
Redistribution
6
SAMSON distributed file - system
Telefónica I+D
Streaming MapReduce
03
Telefónica I+D
SAMSON
SAMSON
SAMSON
DELILAH
4 Gb
Upload Download
Run operations
Telefónica I+D
SAMSON
SAMSON
SAMSON
DELILAH
Upload new operations
& data types
Run operations
Open API for 3rd party developers
Telefónica I+D
SAMSON
SAMSON
SAMSON
Telefónica I+D
Operation
Operation
Operation
SAMSON
SAMSON
SAMSON
Map
Telefónica I+D
Operation
Operation
Operation
SAMSON
SAMSON
SAMSON
StateInput Output
Reduce
Telefónica I+D
Eco-system
04
Telefónica I+D 19
Top level view…
samsonPush
samsonPush
samsonPush
delilah
samsonPop
samsonClient
delilah
samsonPop
Console-based client • Upload data• Download data• Run commands• Platform monitor
samsonClient
delilah samsonPush
samsonPop
samsonClient
Binaries to stream data into and out of SAMSON
C++ library to develop new plugins
Examples: • samsonPush• samsonPop
module module
modulemodule
module
module
3rd Party C++ shared library
• New data types• New operations
Tools provided for simplified development!!
Telefónica I+D 20
Delilah clientdelilah
Telefónica I+D 21
SAMSON Module example…
Module simple_mobility{ title "Simple mobility example" author "Andreu Urruela" version "0.1.1"}
data UserArea{ system.String name; system.UInt x; system.UInt y; system.UInt radius;}
data Position{ system.UInt x; system.UInt y;
system.TimeUnix time;}
…
parser parser_cdrs{ out system.UInt simple_mobility.Position
helpLine "Parse input CDRs to get user-position"}
class parser_cdrs : public samson::system::SimpleParser {
std::vector<char*> words; // Vector used to store words parsed at each line
void parseLine( char * line , samson::KVWriter *writer ) { // Split line in words split_in_words( line, words );
// Expected format USER_ID CDR X Y time
if( words.size() < 5 ) return; // No content for a valid instruction
if( strcmp( words[1] , "CDR" ) != 0 ) return; // Non valid format
// Set the key key.value = atoll( words[0] );
// Set the position value.set( atoll( words[2] ) , atoll( words[3] ) , atoll( words[4] ) );
// Emit the key-value writer->emit( 0 , &key, &value ); }
};
module
Telefónica I+D 22
SAMSON ecosystem tools
Tool Description
samsonModuleBootstrap Create necessary files to start developing a new module
samsonModuleParser Create .h .cpp files from a description file
samsonCat Visualize binary content of SAMSON
samsonSetup Edit setup of a SAMSON system
samsonStarter Setup a new SAMSON cluster
samsonData Show a transactional log for debugging
samsonPush Push data from STDIN to a particular queue
samsonPop Pop data from a particular queue to STDOUT
samsonLocal Emmulate a local cluster for fast development of modules
Telefónica I+D
File system Stream MapReduce Ecosystem Architecture Demo
Telefónica I+D
Architecture
05
Telefónica I+D 25
Communication Protocols…
delilah
Platform messagesFlexibilityBack compatibility
Data serializationMaximum data compression ( no field separator )Best for fast-sequential processingEasy job distributionProprietary serialization format
MonitoringNo recompilation neededBest tools for querying ( XPATH )
Goal Solution Why ?
Telefónica I+D26
Worker
Disk Manager
Memory Manager
Process Manager
--- cores --
Runtime engine and notification system
Engine library
Independent development
NetworkManager
Telefónica I+D 27
Disk Manager // Network Manager
• Controller to access local disk and network connections• Asynchronus notifications using engine notification system• If required multiple threads are used
Memory Manager: ( our retain-release model ….. similar to Objective-C )
• Simple system to control memory usage• Used to optimize memory allocation when under heavy load
Process Manager ( similar to Apple’s Grand Central dispatch library )
• System to control independent “heavy” task to be executed• Automatic creation / destruction of threads• Optional “fork” mode with shared-memory system to get output
Runtime Engine & Notification system
• Inspired in message-passing system implemented in Objective-C• Single loop to run all state-update operations• Thread protection to interact with Disk/Network/Memory/Process
Managers
Engine library
Telefónica I+D28
Worker
Engine library
Independent development
multi-core
Block ManagerBlock ManagerDisk – Memory
balancer
Disk Manager
Memory Manager
Process Manager
--- cores --
Runtime engine and notification system
NetworkManager
Telefónica I+D 29
Maintains a reference of all blocks of data contained in a Worker ( in disk or memory )
It keeps a sorted list based on when they will be used ( future operations )• Low priority blocks are flushed to disk first• High priority blocks are loaded from disk first
Connected to the Disk Manager inside the engine using the EngineNotificationSystem
Block Manager
Block Manager
Schedule write operationsTo DiskManager
Schedule read operationsTo DiskManager
Important: Since the order of blocks changes continuously based on the scheduling of new processing operations, the Block Manager is made aware of the new order and is able to react accordingly.
Telefónica I+D30
Worker
Engine library
Independent development
multi-core
Block ManagerBlock ManagerDisk – Memory
balancer
Queues Manager
txt_cdrs
cdrs
Stream Operations Manager
Operation A
Operation BInput data
usersOperation C
pri
ori
ty
Disk Manager
Memory Manager
Process Manager
--- cores --
Runtime engine and notification system
NetworkManager
Telefónica I+D 31
Contains reference to all the blocks contained in queues and stream operationsBoth systems are connected to Block Manager to inform about the priority of blocks
Stream Operation Manager is connected with ProcessManager to schedule 3rd party operations at Engine Subsystem
Queue & Stream Operations Manager
Queues Manager
txt_cdrs
cdrs
Stream Operations Manager
Operation A
Operation Busers
Operation C
pri
ori
ty