simultaneous analysis of massive data streams in real time and batch

Simultaneous Analysis of Massive Data Streams in Real-Time and Batch

Anjana Fernando

Technical Lead

Agenda

• How massive data streams created• How to receive• How to store• How to analyze, batch vs real-time• WSO2 Big Data solution• Demo

Massive Data Streams -> Data Streams with Big Data

What is Big Data?

❏ The 3 Vs❏ Velocity❏ Volume❏ Variety

Image Source: http://akrayasolutions.com/big-data/

Where does it originate from?

• Machine logs• Social media• Archives• Traffic information• Weather data• Sensor data (IoT)

What do I do with it?

Create intelligence..

• Should I take an umbrella to work today?• What is the best route to go back home?• What are the current market trends?• Are my servers running healthily?

Protocols used to publish data..

• HTTP• MQTT • Zigbee• Thrift• Avro• ProtoBuf

How to store the data?• Relational databases • Block data stores -> HDFS• Column oriented -> HBase -> Cassandra• Document based -> MongoDB -> CouchDB• In-Memory -> VoltDB

How to analyse data?

• Two options:

-> Batch processing: Schedule data processing jobs and receive the processed data later

-> Real-time processing: The queries are executed and the results are retrieved instantly

Analysing data..

• Batch processing -> Apache Hadoop: Map/Reduce processing system and a distributed file system

Analysing data..

• Batch processing - Data Warehouse -> Apache Hive - Hadoop based framework for working on large scale data stores with SQL-like queries

INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0" GROUP BY userName;

Analysing data..

• Batch processing - In-Memory Computing -> Apache Spark - Functional programming model, in-memory computing, claims 10x - 100x faster than Hadoop

Analysing data..

• Real-time processing - Stream Processing -> Apache Storm - Distributed and fault-tolerant

Spouts Bolts

Analysing data..

• Real-time processing - Complex Event Processing -> WSO2 Siddhi:

Big Data Architecture with WSO2..

• Data Streams {

'name':'phone.retail.shop','version':'1.0.0','nickName': 'Phone_Retail_Shop','description': 'Phone Sales','metaData':[ {'name':'clientType','type':'STRING'}],'payloadData':[ {'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'}, {'name':'total','type':'INT'}, {'name':'user','type':'STRING'}]

The common stream format used in both CEP and BAM; The stream definition contains the stream name, version and other attributes that makes up the stream.

• WSO2 BAM-> Data Receiver - High performance binary format data publishing with Apache Thrift, shared with WSO2 CEP-> Data Storage - Cassandra for highly scalable data store-> Data Analyzer - Hive based batch processing

• WSO2 BAM..-> Activity Monitoring: Implemented using a custom indexing mechanism to instantly search for events of a specific activity in the system

• WSO2 BAM..-> Incremental Data Processing - Customized Hive to support incremental data processing:

@Incremental(name="salesAnalysis" , tables="PhoneSalesTable") SELECT brandname, Count(DISTINCT orderid), Sum(quantity) FROM phonesalestable WHERE version = "1.0.0" GROUP BY brandname;

• WSO2 CEP-> Same data receiver as BAM, where this is the point where the same event is sent to both servers, where BAM for batch processing and CEP for real-time processing of the same data streams-> Real-time in-memory processing, based on WSO2 Siddhi engine, with data adapters for receiving and sending event with different data types and transports, e.g. XML, JSON, Text, HTTP, JMS, SMTP

Questions?

Thank you!

simultaneous analysis of massive data streams in real time and batch

Technology

glaciers. glaciers are massive streams of ice flowing down...

xxx and yyy hotels apac joint strategy. simultaneous...

boom chameleon: simultaneous capture of 3d viewpoint ... ·...

processing massive data streams -...

a general framework for mining massive data streams geoff...

lower bounds for massive dataset algorithms t.s. jayram (ibm...

ieee transactions on parallel and distributed … · online...

estimating uncertainty for massive data streams …part, we...

predictive analytics on evolving data streams · predictive...

optimization of yarrowia lipolytica-based...

cs246: mining massive datasets jure leskovec,...

ieee transactions on visualization and computer...

massive data streams in graph theory and computational...

machine learning for data streams - cisco · apache...

handle data streams in dbs? data management: … · handle...

technology consumption and cognitive control: contrasting...

99812421; headend technology - kathrein · pdf filetransport...

cs246: mining massive datasets jure leskovec, … ·...

wso2con asia 2014 - simultaneous analysis of massive data...

on sampling from massive graph streams - ryan a....