big data landscape
DESCRIPTION
An overview about several technologies which contribute to the landscape of Big Data. An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.TRANSCRIPT
Big dataThe technology landscape and its applications.
Natalino Busa - 12 Feb. 2013
Outline
● Big Data: Who are thou?● Big Data: The technology landscape
● Hadoop: Overview● Analytics & Machine Learning● Opportunities
Natalino Busa - 12 Feb. 2013
Hype cycle on new IT technologies
Gartner 2012
Natalino Busa - 12 Feb. 2013
What is big data?
Velocity Diversity Volume
Hardware Software Services
BIG DATA
DATA (structured and un-structured, Logs, ETL, social)
Marketing (e.g. Unica)Analytics (Tableau)Modeling (SAS)
RDBMSOLAPMessaging
Infrastructure(Private) CloudNetworking
Natalino Busa - 12 Feb. 2013
Big Data Heat map
Natalino Busa - 12 Feb. 2013
How big is big?
ARI = # Rows × # Columns Time (secs)
Where # Rows = Number of records being analyzed
# Columns = Number of variables captured in each record
Time (secs) = The timeframe within which to complete the analysis
SkyTree (tm) defines: Analytics Requirements Index (ARI)
Example: For each view (1000 views/sec) produce a personalized banner I need to analyze 100 variables on 1000 records (historic data) every 1 ms
ARI = (1000*100)/0.001 = 100 M values/sec
Natalino Busa - 12 Feb. 2013
What data?
Big Data can imply:
● Complex Data refactoring in Batch (lots of rows)● Real-Time Event Processing (high-speed responses)● Multidimensional analisys (lots of parameters)
● ... or any of those three
Natalino Busa - 12 Feb. 2013
Parameters Entities
Res
pons
e tim
e
More data
Database Databases Federated Data Aggregated Data Linked Data Just Data
Structured Unstructured
customerscustomers +products
customers +products +surveys
customers +products +surveys +transactions
customers +products +surveys +transactions +social messages
● in today's IT environments there is a gradual shift from structured data to unstructured data
RDBMS are well suited to deal with structured data -> but: more and complex ETL, how to deal with new data (structures) ?
Map-Reduce and noSQL systems are good with unstructured data -> but: how to we query and analyze this data?
Natalino Busa - 12 Feb. 2013
Big Data: how to deal with it
● Big Data at rest (storage, access) ● Big Data in motion (streaming, dataflows)
● Big Data analytics (OLAP, OTAP, BI)● Big Data modeling (predictive, machine learning)
Natalino Busa - 12 Feb. 2013
Big Data at rest
Analytical RDBMSs (EDW) Oracle, IBM, and various MPP's
Hadoop Distributed Systems HDFS (distributed file system) Hbase (Big Table)
HDFSLogs
Batch Real-time
EDW EDW
Analytics
EDW
Cassandra HBase
Natalino Busa - 12 Feb. 2013
● Traditional EDW and Distributed BigData / NoSQL solutions are complementary to each other.
● These systems do not exclude each others and can coexist to form a fullenterprise level solution.
Big Data at rest
No need to get everything out of the hadoop ecosystem:
NoSQL DBMSs: Couchbase ( ++ reads, caching) Cassandra ( ++ writes, OLAP)
... hybrid solutions are also possible:
HDFS + Cassandra : in-memory analytics + large DFSHDFS + Solr/Lucene: fast text search on a distributed file system
Natalino Busa - 12 Feb. 2013
Big Data in motion
Stream processing // Dataflow architectures
Used to support the automatic analysis of data-in-motion in real-time or near real-time.
- Identify meaningful patterns - Trigger action to respond to them as quickly as possible.
- Storm (from twitter) dataflow processing framework ++ multi-language
- Akka (from typesafe) dataflow actor framework ++ speed
Both are:Distributed, fault-tolerant, streaming
Natalino Busa - 12 Feb. 2013
Big Data Landscape
HDFS
Logs Hbase
EDWsqoop
hiho
flume
REST
scribe
Cassandra
Hive
Pig
MapR
OTAP Impala
SAS, R over HDFS Mahout
OLAP
BI
STORM
Natalino Busa - 12 Feb. 2013
● Real-Time Analytics● Streaming
● Batch Analytics● Visualization● Monitoring● Marketing
Machine Learning on Big Data
FS
Unstructured
Unstructured
Dat
a In
terfa
ces
Lambda Architecture
Logic layerSoftware as a Servicee.g realt-time predictor
Natalino Busa - 12 Feb. 2013from http://www.manning.com/marz/
Why do machine learning on big data
Natalino Busa - 12 Feb. 2013
http://www.skytree.net/why-do-machine-learning-on-big-data/
Machine Learning: What?
SIMILARITY SEARCH
Similarity search provides a way to find the objects that are the most similar, in an overall sense, to the object(s) of interest.
Natalino Busa - 12 Feb. 2013
PREDICTIVE ANALYTICS
Predictive analytics is the science of analyzing current and historical facts/data to make predictions about future events.
CLUSTERING AND SEGMENTATION
Cluster analysis and segmentation represents a purely data driven approach to grouping similar objects, behaviors, or whatever is represented by the data.
From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/
Word Counting on Map Reduce
Natalino Busa - 12 Feb. 2013
Machine learning on Map Reduce
Natalino Busa - 12 Feb. 2013
From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011
Machine learning on Map Reduce
Natalino Busa - 12 Feb. 2013From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011
Machine Learning: Use Cases
E-Commerce / E-Tailing● Product Recommendation Engines● Cross Channel Analytics● Events/Activity Behavior Segmentation
Product Marketing● Campaign management and optimization● Market and consumer segmentations● Pricing Optimization
Customer Marketing● Customer Churn Management● (Mobile) User Behavior Prediction● Offer Personalization
Natalino Busa - 12 Feb. 2013
Big Data: Opportunities
Unstructured Data● Clustering● Distributed processing● Distributed Storage
Modeling & Analytics● Distributed Machine Learning● Fast Online Analytics Cubes
Streaming and Real-Time processing● Build RT profiles● Decision trees and Predictions● Offer Personalization
Natalino Busa - 12 Feb. 2013
Thanks
linkedin:
www.linkedin.com/in/natalinobusa
blog:
www.natalinobusa.com