malhar data torrent (big data guru meetup 2014-02-27)
DESCRIPTION
Architects from DataTorrent talk about Malhar framework for processing streaming events in real time at massive scale.TRANSCRIPT
Hadoop’s Most Powerful Platform for Real-Time Stream Computations
Prepared for Big Data GurusFebruary 27th, 2014
Data processed by Hadoop (batch) DataTorrent(real-time)
Time
Now
[ seconds to millisec ]
Databases(HBase, Oracle,…)
Ad hoc queries
Standard Reports
hrs hrs millisec
Real-time ETL and Business Insights
Real-time Predictive Analytics
Real-time Business Actions
Real-time Business logic with HA
Real-time Monitoring and Alerting
Big Data – Done NOW
Business Decisions in Less than a Second
DataTorrent Big Data Platform• Vision: Ubiquitize Real-Time Big Data Computations
– Enterprise quality: Highly Available, Linearly Scalable, Operable and Easy to Use
– Big data dimensional computations in real time with linear scalability
• Real-Time ETL: De-dup, Staging, Cleanup, Transformations, Load …• Real-Time Computation Apps and Feed Ingestion (Games, Mobile,
Set-top Boxes, Devices, …)– Multi-Feed Sources– Run business logic in real-time with HA
• Real-Time Monitoring, and Security: Capacity, DDOS, …• Real-Time Predictive Analytics: Web Analytics, Business Analytics,
…
DataTorrent in Hadoop Ecosystem
© DataTorrent, 2014 - Confidential
DataTorrent in the Modern Data Architecture
APPL
ICAT
ION
SDA
TA S
YSTE
MSO
URC
ES
RDBMS EDW
Emerging Sources (Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources (CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
Business Analytics Business Intelligence Tools OLAP Clients
Real-time Stream Analytics
DATA
IN
MO
TIO
N
StrAM (Stream Application Master)
Security
SLA
Scalability
Alerts
Fault Tolerance
Tools
Partitioning
Web Services
Dynamic Modifications
State Snapshot
Malhar – Open Source Operators and Apps Library (Apache v2 License)
DataTorrent Technology Stack
© DataTorrent Inc. 2014 - Confidential
DataTorrent in Hadoop Reference Architecture
DATA IN MOTIONREAL TIME STREAMING APPLICATIONS
SOURCE DATA
MS Queue’s
Events
Files
Databases
Sensor data
Social
APPL
ICAT
ION
S BusinessObjects BI
Query/Visualization/ Reporting/Analytical Tools and Apps
Enterprise Repositories
RDBMS
EDW
NoSQL
Real Time Ingestion
DATA AT RESTBATCH APPLICATIONS
Hive
Pig
HBase
Custom
Message Queue
Data In Motion
YARN
HDFS
YARN
MapReduce
HDFS
OPERATIONAL INTELLEGENCE
BUSINESS ACTIONS
PREDICTIVE ANALYTICS
STREAM ETL
REALTIME ALERTS
© DataTorrent, 2014 - Confidential
Stream Processing
•A Stream is a sequence of data events with schema
•An Operator takes input streams and compute output streams
•An Application is a Directed Acyclic Graph (DAG)
•In-memory asynchronous distributed computations
•A Streaming Window is an atomic batch of sequential data events
DataTorrent Hadoop GRID1
2
43 6
NM NM NM NM
Resource Manager
StrAM
3
5
5
64
2
1
DT Gateway
dtCLIDT
Console
MapReduce
MapReduce
MapReduce
MapReduceMapReduce
MapReduce
Malhar Open Source Project• Apache 2.0• Operators (over 400 operators)
– Algorithms– Ingestion, ETL– Input and Output Adapters
• UI Widgets (over 50 widgets)– Console widgets for stats– Application widgets for app data
• Application Templates– LogStream– Map Reduce Debugger– Shuffle less MapReduce
• Demo Apps (15 demo apps)
Malhar Open Source Project• Continuous Integration: Unit tests• Performance tests for operators • Daily tests of Demos and Apps for memory usage• More operators and UI widgets added as per new use cases/user
requests• Fully supported: Documentation, Certification• Input and Output Adapters
– HBase, MongoDB, CouchDB, Redis, Memcache– Flume, Kakfa, RabbitMQ, ActiveMQ, ZeroMQ– JBDC, MySql, DerbyDB, TimesTen– MQTT, Twitter, RSS, HTTP, WebSockets, Socket– Logs: Apache, SMTP– DFS, Local cache (Guava)
• Languages: Java, Python, JavaScript, Script, R
DataTorrent’s Platform Differentiators.
Extreme Scalability Mission Critical Hadoop-Native
• Automatically scales to changing loads. Massive performance per node. Billions of events/sec
• Sub-second latency with linear scalability.
• Complex monitoring applications with massive computations.
• Built-in Stateful Fault-tolerance. 24/7 uptime - Highly Available.
• Predictive Analysis, Root cause. Real-time ETL.
• Update your application while it's running! A/B testing (2h2014).
• Develop faster and implement any business logic with our open-source framework.
• Runs on your existing Apache Hadoop cluster.
• Integrate seamlessly with your existing data flow and monitoring stack.
Live Demonstration
Thank You!
• DataTorrent• Try Sandbox (https://datatorrent.com)• Free for
• Startup Program (Contact us for more details)• Up to 25GB memory usage in production• Non-production clusters
• Malhar Open Source (Apache 2.0) project • https://github.com/DataTorrent/Malhar• [email protected]• Applications available Jan 2014
• LogStream Application• Map-Reduce Monitor
DataTorrent Inc.3200 Partrick Henry, 2nd FlSanta Clara, CA 95054
Twitter.com/DataTorrentFacebook.com/DataTorrent