simultaneous analysis of massive data streams in real time and batch
DESCRIPTION
My WSO2Con Asia 2014 TalkTRANSCRIPT
![Page 1: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/1.jpg)
Simultaneous Analysis of Massive Data Streams in Real-Time and Batch
Anjana Fernando
Technical Lead
WSO2
![Page 2: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/2.jpg)
Agenda
• How massive data streams created• How to receive• How to store• How to analyze, batch vs real-time• WSO2 Big Data solution• Demo
![Page 3: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/3.jpg)
Massive Data Streams -> Data Streams with Big Data
![Page 4: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/4.jpg)
What is Big Data?
❏ The 3 Vs❏ Velocity❏ Volume❏ Variety
Image Source: http://akrayasolutions.com/big-data/
![Page 5: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/5.jpg)
Where does it originate from?
• Machine logs• Social media• Archives• Traffic information• Weather data• Sensor data (IoT)
![Page 6: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/6.jpg)
What do I do with it?
Create intelligence..
• Should I take an umbrella to work today?• What is the best route to go back home?• What are the current market trends?• Are my servers running healthily?
![Page 7: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/7.jpg)
Protocols used to publish data..
• HTTP• MQTT • Zigbee• Thrift• Avro• ProtoBuf
![Page 8: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/8.jpg)
How to store the data?• Relational databases • Block data stores -> HDFS• Column oriented -> HBase -> Cassandra• Document based -> MongoDB -> CouchDB• In-Memory -> VoltDB
A
C P
![Page 9: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/9.jpg)
How to analyse data?
• Two options:
-> Batch processing: Schedule data processing jobs and receive the processed data later
-> Real-time processing: The queries are executed and the results are retrieved instantly
![Page 10: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/10.jpg)
Analysing data..
• Batch processing -> Apache Hadoop: Map/Reduce processing system and a distributed file system
![Page 11: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/11.jpg)
Analysing data..
• Batch processing - Data Warehouse -> Apache Hive - Hadoop based framework for working on large scale data stores with SQL-like queries
INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0" GROUP BY userName;
![Page 12: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/12.jpg)
Analysing data..
• Batch processing - In-Memory Computing -> Apache Spark - Functional programming model, in-memory computing, claims 10x - 100x faster than Hadoop
![Page 13: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/13.jpg)
Analysing data..
• Real-time processing - Stream Processing -> Apache Storm - Distributed and fault-tolerant
Spouts Bolts
![Page 14: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/14.jpg)
Analysing data..
• Real-time processing - Complex Event Processing -> WSO2 Siddhi:
![Page 15: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/15.jpg)
![Page 16: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/16.jpg)
Big Data Architecture with WSO2..
• Data Streams {
'name':'phone.retail.shop','version':'1.0.0','nickName': 'Phone_Retail_Shop','description': 'Phone Sales','metaData':[ {'name':'clientType','type':'STRING'}],'payloadData':[ {'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'}, {'name':'total','type':'INT'}, {'name':'user','type':'STRING'}]
}
The common stream format used in both CEP and BAM; The stream definition contains the stream name, version and other attributes that makes up the stream.
![Page 17: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/17.jpg)
Big Data Architecture with WSO2..
• WSO2 BAM-> Data Receiver - High performance binary format data publishing with Apache Thrift, shared with WSO2 CEP-> Data Storage - Cassandra for highly scalable data store-> Data Analyzer - Hive based batch processing
![Page 18: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/18.jpg)
Big Data Architecture with WSO2..
• WSO2 BAM..-> Activity Monitoring: Implemented using a custom indexing mechanism to instantly search for events of a specific activity in the system
![Page 19: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/19.jpg)
Big Data Architecture with WSO2..
• WSO2 BAM..-> Incremental Data Processing - Customized Hive to support incremental data processing:
@Incremental(name="salesAnalysis" , tables="PhoneSalesTable") SELECT brandname, Count(DISTINCT orderid), Sum(quantity) FROM phonesalestable WHERE version = "1.0.0" GROUP BY brandname;
![Page 20: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/20.jpg)
Big Data Architecture with WSO2..
• WSO2 CEP-> Same data receiver as BAM, where this is the point where the same event is sent to both servers, where BAM for batch processing and CEP for real-time processing of the same data streams-> Real-time in-memory processing, based on WSO2 Siddhi engine, with data adapters for receiving and sending event with different data types and transports, e.g. XML, JSON, Text, HTTP, JMS, SMTP
![Page 21: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/21.jpg)
Demo
![Page 22: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/22.jpg)
Questions?
![Page 23: Simultaneous analysis of massive data streams in real time and batch](https://reader033.vdocuments.us/reader033/viewer/2022060121/5594777e1a28abaa388b4589/html5/thumbnails/23.jpg)
Thank you!