are we reaching a data science singularity? how cognitive computing is emerging from machine...
TRANSCRIPT
2 Natalino Busa - @natbusa
3 Natalino Busa - @natbusa
4 Natalino Busa - @natbusa
5 Natalino Busa - @natbusa
6 Natalino Busa - @natbusa
What about (data) science?
- technologies and tools are driving innovation in data analytics -
8 Natalino Busa - @natbusa
Learning: The Scientific Method
Ørsted's "First Introduction to General Physics" (1811) https://en.m.wikipedia.org/wiki/History_of_scientific_method
observation hypothesis deduction synthesis
Hans Christian Ørsted
experiment
Icons made by Gregor Cresnar from www.flaticon.com is licensed by CC 3.0 BY
9 Natalino Busa - @natbusa
Innovation in Data Analytics
Cloud Community AI & ML
11 Natalino Busa - @natbusa
“we live in an age of open source datacenters, so we can stack all these things together and we have open source from the ground to ceiling.”
Sam Ramji, CEO of Cloud Foundry
https://www.youtube.com/watch?v=7oCSFcUW-Qk
12 Natalino Busa - @natbusa
Analytics in the cloud
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes
DAAAS: Data Analytics as a Service
13 Natalino Busa - @natbusa
DAAAS: AI and ML API’s
Cloud Computing for Deep Neural Networks > Models, Compute (Train, Score), and Data
AI and ML models for:
● Speech (audio)● Language (text)● Vision (images/video)
● Data (classification, regression, clustering, anomaly detection)
14 Natalino Busa - @natbusa
Ephemeral Computing Clusters on a Cloud
data
create load compute storetimeline
destroy
15 Natalino Busa - @natbusa
dPaaS: Analytical clusters
Ephemeral
Short-Lived
Data Exploration
Isolated, Personal
Simple Access Management
Permanent
Long Lived
Production / Operations
Co-Ordinated
Complex Access Management
vs
16 Natalino Busa - @natbusa
GPU’s and Distributed ComputingGPU support is coming in Kubernetes, Mesos, Spark
https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpushttp://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark
out
up
CPUR,Python
SparkTensorFrames
19 Natalino Busa - @natbusa
Sharing is caring … speed
github.com + Jupyter notebooks, share ideas, code, and data
arxiv.orgshare innovation and scientific results
21 Natalino Busa - @natbusa
Google: open-sources NLP parserscoring 95% in grammar accuracy
https://github.com/tensorflow/models/tree/master/syntaxnet
22 Natalino Busa - @natbusa
Deep Learning in Language Parsing
https://github.com/tensorflow/models/blob/master/syntaxnet/ff_nn_schematic.png
23 Natalino Busa - @natbusa
Semantic Search: TDA + NNs Word2Vec, Par2Vec, Doc2Vechttps://arxiv.org/pdf/1405.4053v2.pdfhttps://arxiv.org/pdf/1301.3781v3.pdf
24 Natalino Busa - @natbusa
Lip reading
LipNet achieves 93.4% accuracy,on GRID corpus.
https://arxiv.org/pdf/1611.01599v1.pdf
25 Natalino Busa - @natbusa
Ask me Anything
Dynamic Memory Networks
for Natural Language
Processinghttps://arxiv.org/pdf/1603.01417v1.pdf
https://youtu.be/oGk1v1jQITw
Caiming Xiong, Stephen Merity, Richard Socher
26 Natalino Busa - @natbusa
Ask me Anything
http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial
Dynamic Memory Networks for Natural Language Processinghttps://arxiv.org/pdf/1603.01417v1.pdf
http://www.socher.org/Local context
Wider context
NLP, Attention Masks
Semantic Embeddings from Text, Images
28 Natalino Busa - @natbusa
Network Intrusion Detection
http://billsdata.net/?p=105
It contains 130 million flow records involving 12,027 distinct computers over 36 days (not the full 58 days claimed for the entire data release).
Each record consists of: time (to nearest second), duration, source and destination computer ids, source and destination ports, protocol, number of packets and number of bytes
Techniques: TDA, Dimensionality Reductionhttps://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
29 Natalino Busa - @natbusa
Approaching (Almost) Any Machine Learning Problem- Abhishek Thakur, Kaggle Grandmaster -
data labels
raw data: tables, files Useful dataData munging Feature Engineering
Tabular Data ready for ML
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/
30 Natalino Busa - @natbusa
AutoML challenge
- based on scikit-learn- 15 classifiers, - 14 feature preprocessing methods- 4 data preprocessing methods- 110 hyperparameters
- Supervised classification challenge:100 different datasets
Natalino Busa - @natbusa
32 Natalino Busa - @natbusa
Human cognitive biases :
Too much information
Not enough meaning
What should we remember?
Need to act fast
https://en.wikipedia.org/wiki/List_of_cognitive_biases
33 Natalino Busa - @natbusa
Man vs Machine cognitive limits
Model generation
Explanation
Unsupervised
Planning
Too much information
Not enough meaning
Need to act quickly
Memory limits
34 Natalino Busa - @natbusa
Theorems often tell us complex truths about the simple things, but only rarely tell us simple truths about the complex ones
Marvin MinskyK-Linesː A Theory of Memory (1980)
35 Natalino Busa - @natbusa
Data Science: wear the AI/ML LensesWe are entering a new era of intelligent machines
Boost our understanding of data
Focus on higher level analyses
36 Natalino Busa - @natbusa
Intelligent Data Systems:Long live the “database”
Wikipedia:A database is an organized collection of data.
DATA
New-SQL
ML
AI
SQL
Python - Scala - R
NLP
UX
Speech
COG