the paradox of big data - dataiku / oxalide aperotech
TRANSCRIPT
![Page 1: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/1.jpg)
The Paradox of Big Data
![Page 2: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/2.jpg)
2001 Programming Languages 2004 Natural Language Processing
2006 Social Recommendation
2008 Distributed Computing
2011 Social Gaming2012 Advertising
2013 Dataiku
2009 Web Mining
Type Spent Coding
2010
100%100%80%50%
20%
0%10%50%
20%
Favorite Language
CExascriptExascript
Exascript
Python
Powerpoint
Python
Java
None
Largest Dataset
100GB100GB10GB10TB
100TB
100kB500GB100TB
10TB
I’m Florian and I like data
![Page 3: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/3.jpg)
www.dataiku.com
Dataiku in short
Software editor behind Data Science Studio,the « Photoshop for Data Science »
COMMUNITY EDITION
http://www.dataiku.com/dss/trynow/
![Page 4: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/4.jpg)
Goals For Today• Big Data with the bias of what I know of it
(Analytics …)
• Big Data: History and Feelings
• What are the key technologies to watch ?
• Some practical use cases ?
• How to get started ?
![Page 5: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/5.jpg)
Dataiku
Motivation
1/8/144
First Hard Drive: 3,75 Megabytes Access Time: 1 second
![Page 6: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/6.jpg)
IN 2008 man
invented big data
Volume Variety Velocity
![Page 7: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/7.jpg)
WHAT IF THE MARKETING GUY HAD CHOSEN ANOTHER LETTER?
Capacity Complexity Celerity
![Page 8: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/8.jpg)
OR SIMPLER
Size Serendipity Speed
![Page 9: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/9.jpg)
OR AFTER A DRINK
Big Blur Blazing
![Page 10: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/10.jpg)
Or Combine
C… B.. S….
![Page 11: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/11.jpg)
Or Combine
Complete Bull Sh..
![Page 12: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/12.jpg)
SOOO WHAT IS
BIG DATA ?
![Page 13: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/13.jpg)
PARADOX #1 SIMPLEXITY
![Page 14: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/14.jpg)
SUBTLE PATTERNS
![Page 15: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/15.jpg)
"MORE BUSINESS" BUTTONS
![Page 16: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/16.jpg)
PARADOX #2 SELF-AWARE
![Page 17: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/17.jpg)
DATA SCIENTIST AT NIGHT
![Page 18: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/18.jpg)
DATA CLEANER THE DAY
![Page 19: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/19.jpg)
DATA PLUMBERER THE WEEK-END
![Page 20: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/20.jpg)
WAIT COMPUTATION BETWEEN COFFEES
![Page 21: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/21.jpg)
PARADOX #3 WHERE TO STORE DATA?
![Page 22: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/22.jpg)
MY DATA IS WORTH MILLIONS
![Page 23: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/23.jpg)
I SEND IT TO THE
MARKETING CLOUD
AND BACKUP IT TO GOOGLE
![Page 24: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/24.jpg)
PARADOX #4 IS IT BIG OR NOT ?
![Page 25: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/25.jpg)
WE ALL LIVE IN A BIG DATA
LAKE
![Page 26: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/26.jpg)
ALL MY DATA MAY FITS IN HERE
![Page 27: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/27.jpg)
PARADOX #5 (at last) HUMAN OR NOT ?
![Page 28: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/28.jpg)
TECHCRUNCH SAYS THAT MACHINE LEARNING WILL SAVE
US ALL
![Page 29: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/29.jpg)
I JUST WANT MORE REPORTS
![Page 30: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/30.jpg)
BIG DATA TECH TRENDS
![Page 31: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/31.jpg)
ELEPHANT MAKE BABIES
![Page 32: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/32.jpg)
Dataiku - Pig, Hive and Cascading
WELCOME TO TECHNOSLAVIA
Hadoop Ceph
Sphere Cassandra
Kafka Flume Spark
Scikit-Learn GraphLAB prediction.io jubatus
Mahout WEKA
MLBase LibSVM
RapidMiner Panda
Kibana
InfiniDB Drill Spark SQL
Hive Impala
…
Elastic Search
SOLR MongoDB
Riak Membase
Pig
Cascading
Talend
Machine Learning Mystery Land
Scalability Central
SQL Colunnar Republic
Vizualization County Data Clean Wasteland
Statistician Old House
R Real-time island
Storm
NOSQL Nihiland
![Page 33: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/33.jpg)
DRIVER 1: BACK TO THE BASICS
RAM -‐ CPU -‐ DISK
![Page 34: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/34.jpg)
2000 2013
1000$ / GB
6$ / GB$10 / GB
$0.06 / GB
memory divided by 150
disk cost divided by 250
MAP REDUCE times
HACK REDUCE times
A PERSISTENT MEMORY PROBLEM
![Page 35: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/35.jpg)
DATA IS BIGGER
![Page 36: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/36.jpg)
IS USEFUL DATA BIGGER ?
WHOLE DATA
REFINED DATA
![Page 37: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/37.jpg)
GOLD
NEEDLE IN HAYSTACK ?
![Page 38: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/38.jpg)
OILD
REFINE BEFORE USE
![Page 39: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/39.jpg)
HOW BIG IS BIG DATA ?Web Site
– $1Billion revenue per year – 10 Millions Unique Visitor per month – 100.Millions orders / actions / per day
10TB RAW DATA
1TB REFINED DATA
![Page 40: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/40.jpg)
1 TERABYTE
FITS IN MEMORY
1TB
![Page 41: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/41.jpg)
DRIVER 2 : ECOSYSTEM GROWS
• 1 Circle OPEN SOURCE – YAHOO – IBM – LINKEDIN -‐ FACEBOOK
• 2 Circle – STANDFORD BERKELEY – STARTUPS
![Page 42: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/42.jpg)
STARTUPS
64m$
6.75m$
14m$
2m$
40m$
20m$
20.5m$
19m$
4m$
100m$
1.8m$
17m$
11m$
7.75m$
1.7m$
20132012
2011
2010
2009
$1B per year Invested in Big Data
TECH 223m$
301m$
![Page 43: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/43.jpg)
ALL > SPARK
Real-‐Time Resilient Distributed Memory Framework
• Abstraction with any DAG operation on data: -‐ Filter -‐ Map -‐ Reduce -‐ Cache
![Page 44: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/44.jpg)
SPARK AND ITS ECOSYSTEM
SHARK
MLBASE
STREAMING
Real-‐Time Queries
Real-‐Time Updates
In-‐Memory Learning
SPAR
K
![Page 45: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/45.jpg)
SooOOo WHAT IS IT IN PRACTICE?
![Page 46: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/46.jpg)
www.dataiku.com
Turn Device Logs Into Next Years' Business
Parking ticket machine data
OpenStreetMapdata
Cleaning and enrichment of data Crossing data
Data Science Studio
Creation of a predictive algorithm
Availability of the predictions
Each street is segmented into small pieces that are enriched with geospatial information.
The parking ticket history is joined with the points of
interest from OpenStreetMap.
The availability of parking lots is predicted by street
segments from the joined data.
The algorithm is finally integrated in the iPhone
app « Find me a space ».
by
![Page 47: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/47.jpg)
www.dataiku.com
Optimizing Last Mile with Data Science Studio
Data Science Studio
Historical delivery and retrieval data
Modeling of a score for each delivery
Cleaning and temporal enrichment of data
Data aggregation by geographic location
Incorporation of new deliveries to the existing model
by
![Page 48: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/48.jpg)
• Reformulation de la recherche
• Pas de réponse
• Clic sur un pro• Top recherche• Clic de navigation ou filtre
COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?
20 M
Analyse & corrections
automatisation
>10 occurrences1,4M
requêtes
>200M recherches
✗ ✓
0,5M requêtes priorisées
![Page 49: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/49.jpg)
SOLUTION
Machine
Gestion Exploration
pagesjaunes.frAnnuaire
hadoop PIG+Hive
Export indexation
Moteur d’interprétation
crawl Autres référentiels
Sickit-learn
![Page 50: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/50.jpg)
www.dataiku.com
Analyst
Panels
1970 : Birth of Computer Analytics
ComputerExpensive Software
Marketing Studies
![Page 51: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/51.jpg)
www.dataiku.com
Multiple Data Sources
Analyst Team
Many Models
CRM
Logs
2015 : BUILD YOUR FACTORY
Server ClusterLight Software
Personalised Experience Model
Acquisition Cost Opportunity
Model
Stock Optimisation Model
Optimize Delivery
![Page 52: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/52.jpg)
www.dataiku.com
Churn
Volume Forecast
RecommenderSegmentation Lifetime Value
Risk Score Hot Location
Pricing Ranking FraudEvent Paths
A MODEL An automated way to make a computertake a decision from raw (historical) data
The model can be used to take immediate (real-time)actions through an API
![Page 53: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/53.jpg)
www.dataiku.com
Churn
Volume Forecast
RecommenderSegmentation Lifetime Value
Risk Score Hot Location
Pricing Ranking FraudEvent Paths
![Page 54: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/54.jpg)
SooOOo How To I ENTER WONDERLAND ?
![Page 55: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/55.jpg)
STEP 1 : LEARN
• PYTHON + PANDAS + SCIKIT
• R
• SCALA
http://scikit-learn.org/https://www.coursera.org/course/rprog
![Page 56: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/56.jpg)
STEP 2 : PRACTICE• Try to enter in a Contest on kaggle.com or
• or datascience.net
• Join a meetup
![Page 57: The paradox of big data - dataiku / oxalide APEROTECH](https://reader034.vdocuments.us/reader034/viewer/2022052700/55a20a1f1a28aba5368b4650/html5/thumbnails/57.jpg)
www.dataiku.com
http://www.dataiku.com/dss/trynow/
Dataiku HQ
2 rue Jean Lantier
75001 Paris France
Dataiku West
2423A Durant Avenue
Berkeley, CA 94704
Florian [email protected]
You have ideas
“My data is too dirty. I don’t even know where to start ”
“We could probably better understand ours users. But how ?
“There’s a trend here, but our full historical data is just too big”
You have data
You need a tool