simultaneous analysis of massive data streams in real time and batch

Post on 02-Jul-2015

104 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

My WSO2Con Asia 2014 Talk

TRANSCRIPT

Simultaneous Analysis of Massive Data Streams in Real-Time and Batch

Anjana Fernando

Technical Lead

WSO2

Agenda

• How massive data streams created• How to receive• How to store• How to analyze, batch vs real-time• WSO2 Big Data solution• Demo

Massive Data Streams -> Data Streams with Big Data

What is Big Data?

❏ The 3 Vs❏ Velocity❏ Volume❏ Variety

Image Source: http://akrayasolutions.com/big-data/

Where does it originate from?

• Machine logs• Social media• Archives• Traffic information• Weather data• Sensor data (IoT)

What do I do with it?

Create intelligence..

• Should I take an umbrella to work today?• What is the best route to go back home?• What are the current market trends?• Are my servers running healthily?

Protocols used to publish data..

• HTTP• MQTT • Zigbee• Thrift• Avro• ProtoBuf

How to store the data?• Relational databases • Block data stores -> HDFS• Column oriented -> HBase -> Cassandra• Document based -> MongoDB -> CouchDB• In-Memory -> VoltDB

A

C P

How to analyse data?

• Two options:

-> Batch processing: Schedule data processing jobs and receive the processed data later

-> Real-time processing: The queries are executed and the results are retrieved instantly

Analysing data..

• Batch processing -> Apache Hadoop: Map/Reduce processing system and a distributed file system

Analysing data..

• Batch processing - Data Warehouse -> Apache Hive - Hadoop based framework for working on large scale data stores with SQL-like queries

INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0" GROUP BY userName;

Analysing data..

• Batch processing - In-Memory Computing -> Apache Spark - Functional programming model, in-memory computing, claims 10x - 100x faster than Hadoop

Analysing data..

• Real-time processing - Stream Processing -> Apache Storm - Distributed and fault-tolerant

Spouts Bolts

Analysing data..

• Real-time processing - Complex Event Processing -> WSO2 Siddhi:

Big Data Architecture with WSO2..

• Data Streams {

'name':'phone.retail.shop','version':'1.0.0','nickName': 'Phone_Retail_Shop','description': 'Phone Sales','metaData':[ {'name':'clientType','type':'STRING'}],'payloadData':[ {'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'}, {'name':'total','type':'INT'}, {'name':'user','type':'STRING'}]

}

The common stream format used in both CEP and BAM; The stream definition contains the stream name, version and other attributes that makes up the stream.

Big Data Architecture with WSO2..

• WSO2 BAM-> Data Receiver - High performance binary format data publishing with Apache Thrift, shared with WSO2 CEP-> Data Storage - Cassandra for highly scalable data store-> Data Analyzer - Hive based batch processing

Big Data Architecture with WSO2..

• WSO2 BAM..-> Activity Monitoring: Implemented using a custom indexing mechanism to instantly search for events of a specific activity in the system

Big Data Architecture with WSO2..

• WSO2 BAM..-> Incremental Data Processing - Customized Hive to support incremental data processing:

@Incremental(name="salesAnalysis" , tables="PhoneSalesTable") SELECT brandname, Count(DISTINCT orderid), Sum(quantity) FROM phonesalestable WHERE version = "1.0.0" GROUP BY brandname;

Big Data Architecture with WSO2..

• WSO2 CEP-> Same data receiver as BAM, where this is the point where the same event is sent to both servers, where BAM for batch processing and CEP for real-time processing of the same data streams-> Real-time in-memory processing, based on WSO2 Siddhi engine, with data adapters for receiving and sending event with different data types and transports, e.g. XML, JSON, Text, HTTP, JMS, SMTP

Demo

Questions?

Thank you!

top related