Download - Open problems big_data_19_feb_2015_ver_0.1
1
Open Problems in Big Data
Analytics: A Practitioner’s View
Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovation Labs, Impetus
Invited Talk, National Conference on Distributed
Machine Learning, Feb 2015
Contents
2
State-of-art in Big Data Analytics
Big Data Computations: Characterization
Big Data pipelines: open problems
• Start from business questions
• How quickly and accurately can we get
answers?
• Data gets stored in HDFS
• Various frameworks to process data
• Spark – machine learning
• Giraph/GraphLab – graph processing
• Storm – real-time processing
State of Art in Big Data Analytics
3
• HDFS the right storage?
• Alternatives
• Cassandra, MapR – M7, QFS,
Cleversafe, Isilion, etc.
http://www.inktank.com/news-events/new/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
State of Art in Big Data Analytics
4
5
State of Art in Big Data Analytics
• Spark the right platform for processing?
• Alternatives
• Flink
• Forge – meta domain specific
language
6
State of Art in Big Data Analytics
• Spark Streaming/Storm the right platform
for stream processing?
7
Big Data ComputationsC
om
puta
tions/O
pera
tio
ns
Giant 1 (simple stats) is perfect for Hadoop 1.0.
Giants 2 (linear algebra), 3 (N-body), 4 (optimization) Spark from UC Berkeley is efficient?
Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs
sampling, alternating least squares.
Example is social group-first approach for consumer churn
analysis [2]
Interactive/On-the-fly data processing – Storm.
OLAP – data cube operations. Dremel/Drill
Data sets – not embarrassingly parallel?
Deep LearningArtificial Neural Networks/Deep
Belief Networks
Machine vision from Google [3]
Speech analysis from Microsoft
Giant 5 – Graph processing –GraphLab, Pregel, Giraph
[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.
[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social
Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio
Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012:
1232-1240
8
Big Data Pipelines
1. Nuance – incompleteness
2. Scale
3. Timeliness
4. Privacy
5. Human Loop
9
Big Data Pipelines: Data Acquisition
• Needle in a Haystack.
• Blink DB?
• Automatic metadata discovery
10
Big Data Pipelines: Information
Extraction
• Error models for data cleaning
• Multimedia data
11
Big Data Pipelines: Analytics
• Multi-dimensional data
The network to identify the individual digits
from the input image
http://neuralnetworksanddeeplearning.com/chap1.html
Copyright @Impetus Technologies, 2014
DLNs for Face Recognition
Copyright @Impetus Technologies, 2014
Copyright @Impetus Technologies, 2015
DLN for Face Recognition
http://www.slideshare.net/hammawan/deep-neural-networks
Copyright @Impetus Technologies,
2014
Success stories of DLNsAndroid voice
recognition system –
based on DLNs
Improves accuracy by
25% compared to state-
of-art
Microsoft Skype Translate software
and Digital assistant Cortana
1.2 million images, 1000
classes (ImageNet Data)
– error rate of 15.3%,
better than state of art at
26.1%
Copyright @Impetus Technologies, 2015
Success stories of DLNs…..
Senna system – PoS tagging, chunking, NER,
semantic role labeling, syntactic parsing
Comparable F1 score with state-of-art with huge speed
advantage (5 days VS few hours).
DLNs VS TF-IDF: 1 million
documents, relevance search.
3.2ms VS 1.2s.
Robot navigation
18
• Hadoop = HDFS + Map-Reduce
• Useful for large scale embarrassingly
parallel processing of data sets
• Not so good for iterative, interactive
computing.
• Beyond Hadoop Map-Reduce philosophy
• Optimization and other problems.
• Real-time computation
• Processing specialized data structures
Conclusions
Thank You!
Mail • [email protected]
LinkedIn • http://in.linkedin.com/in/vijaysrinivasagneeswaran
Blogs • blogs.impetus.com
Twitter • @a_vijaysrinivas.
• Divyakant Agarwal et. al., Challenges and
Opportunities with Big Data, Computing
Research Association White Paper,
available from
http://www.cra.org/ccc/files/docs/init/bigdat
awhitepaper.pdf.
• Vijay Srinivas Agneeswaran et. al.,
Distributed Deep Learning over Spark,
available at:
http://www.datasciencecentral.com/profiles/
blogs/implementing-a-distributed-deep-
learning-network-over-spark
References
20