simultaneous analysis of massive data streams in real time and batch

23
Simultaneous Analysis of Massive Data Streams in Real-Time and Batch Anjana Fernando Technical Lead WSO2

Upload: anjana-fernando

Post on 02-Jul-2015

104 views

Category:

Technology


0 download

DESCRIPTION

My WSO2Con Asia 2014 Talk

TRANSCRIPT

Page 1: Simultaneous analysis of massive data streams in real time and batch

Simultaneous Analysis of Massive Data Streams in Real-Time and Batch

Anjana Fernando

Technical Lead

WSO2

Page 2: Simultaneous analysis of massive data streams in real time and batch

Agenda

• How massive data streams created• How to receive• How to store• How to analyze, batch vs real-time• WSO2 Big Data solution• Demo

Page 3: Simultaneous analysis of massive data streams in real time and batch

Massive Data Streams -> Data Streams with Big Data

Page 4: Simultaneous analysis of massive data streams in real time and batch

What is Big Data?

❏ The 3 Vs❏ Velocity❏ Volume❏ Variety

Image Source: http://akrayasolutions.com/big-data/

Page 5: Simultaneous analysis of massive data streams in real time and batch

Where does it originate from?

• Machine logs• Social media• Archives• Traffic information• Weather data• Sensor data (IoT)

Page 6: Simultaneous analysis of massive data streams in real time and batch

What do I do with it?

Create intelligence..

• Should I take an umbrella to work today?• What is the best route to go back home?• What are the current market trends?• Are my servers running healthily?

Page 7: Simultaneous analysis of massive data streams in real time and batch

Protocols used to publish data..

• HTTP• MQTT • Zigbee• Thrift• Avro• ProtoBuf

Page 8: Simultaneous analysis of massive data streams in real time and batch

How to store the data?• Relational databases • Block data stores -> HDFS• Column oriented -> HBase -> Cassandra• Document based -> MongoDB -> CouchDB• In-Memory -> VoltDB

A

C P

Page 9: Simultaneous analysis of massive data streams in real time and batch

How to analyse data?

• Two options:

-> Batch processing: Schedule data processing jobs and receive the processed data later

-> Real-time processing: The queries are executed and the results are retrieved instantly

Page 10: Simultaneous analysis of massive data streams in real time and batch

Analysing data..

• Batch processing -> Apache Hadoop: Map/Reduce processing system and a distributed file system

Page 11: Simultaneous analysis of massive data streams in real time and batch

Analysing data..

• Batch processing - Data Warehouse -> Apache Hive - Hadoop based framework for working on large scale data stores with SQL-like queries

INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0" GROUP BY userName;

Page 12: Simultaneous analysis of massive data streams in real time and batch

Analysing data..

• Batch processing - In-Memory Computing -> Apache Spark - Functional programming model, in-memory computing, claims 10x - 100x faster than Hadoop

Page 13: Simultaneous analysis of massive data streams in real time and batch

Analysing data..

• Real-time processing - Stream Processing -> Apache Storm - Distributed and fault-tolerant

Spouts Bolts

Page 14: Simultaneous analysis of massive data streams in real time and batch

Analysing data..

• Real-time processing - Complex Event Processing -> WSO2 Siddhi:

Page 15: Simultaneous analysis of massive data streams in real time and batch
Page 16: Simultaneous analysis of massive data streams in real time and batch

Big Data Architecture with WSO2..

• Data Streams {

'name':'phone.retail.shop','version':'1.0.0','nickName': 'Phone_Retail_Shop','description': 'Phone Sales','metaData':[ {'name':'clientType','type':'STRING'}],'payloadData':[ {'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'}, {'name':'total','type':'INT'}, {'name':'user','type':'STRING'}]

}

The common stream format used in both CEP and BAM; The stream definition contains the stream name, version and other attributes that makes up the stream.

Page 17: Simultaneous analysis of massive data streams in real time and batch

Big Data Architecture with WSO2..

• WSO2 BAM-> Data Receiver - High performance binary format data publishing with Apache Thrift, shared with WSO2 CEP-> Data Storage - Cassandra for highly scalable data store-> Data Analyzer - Hive based batch processing

Page 18: Simultaneous analysis of massive data streams in real time and batch

Big Data Architecture with WSO2..

• WSO2 BAM..-> Activity Monitoring: Implemented using a custom indexing mechanism to instantly search for events of a specific activity in the system

Page 19: Simultaneous analysis of massive data streams in real time and batch

Big Data Architecture with WSO2..

• WSO2 BAM..-> Incremental Data Processing - Customized Hive to support incremental data processing:

@Incremental(name="salesAnalysis" , tables="PhoneSalesTable") SELECT brandname, Count(DISTINCT orderid), Sum(quantity) FROM phonesalestable WHERE version = "1.0.0" GROUP BY brandname;

Page 20: Simultaneous analysis of massive data streams in real time and batch

Big Data Architecture with WSO2..

• WSO2 CEP-> Same data receiver as BAM, where this is the point where the same event is sent to both servers, where BAM for batch processing and CEP for real-time processing of the same data streams-> Real-time in-memory processing, based on WSO2 Siddhi engine, with data adapters for receiving and sending event with different data types and transports, e.g. XML, JSON, Text, HTTP, JMS, SMTP

Page 21: Simultaneous analysis of massive data streams in real time and batch

Demo

Page 22: Simultaneous analysis of massive data streams in real time and batch

Questions?

Page 23: Simultaneous analysis of massive data streams in real time and batch

Thank you!