dataiku - google cloud platform roadshow - october 2013
DESCRIPTION
TRANSCRIPT
Data Science Studio
19 customers
Founded in January 2013
Data Science For Everyone
(big) data(s) + machine learning + for practical applications = Data Science
The Project
(c) Dataiku 2013 - Confidential
Hal Alowne BI Manager Dim’s Private Showroom
Dim Sum CEO & Founder Dim’s Private Showroom Medium size e-commerce • 100M$ revenue • 1 Data Analyst
Big Guys $10B + revenue 100+ Data Scientists
Hey Hal ! We need a big data platform, like the big guys! Let’s just do as they do!
Hal Wish #1Global Customer Value Funnel
SEO
NewsLetter
Display Retargeting Display AdWords
Marketplace Direct Sales
Delivery
View Basket
Support Returns
$
$ $ $
Orders
Hal Wish #2Why people drop basket ?
9/30/13 5
Basket
Payment refused
Credit Refused
Cheaper elsewhere ?
Delivery costs ? Wait Xmas?
ACTION
Hal Wish #3What product to put on top ?
9/30/13 6
Original Most Popular on top
Better Machine Learning Score (age/discount/margin…)
Advanced Machine Learning Score + Personalization
9/30/13 7
Why is it so
complicated
?
Partner Data Spaghetti
Mailing Partner
DMP Partnerz
Mail Optimizer
Retargeter
Market Data Providers
Social z Networks
Database are Full
9/30/13 9
1 TB BI Database
20 TB BI Database
Any new computing job take > 1 day
NEED FOR SCALE
Architecture Bingo
9/30/13 10
BI Real-Time Batch Real Real-Time
Simple Queries
Statistics
Machine Learning
Hive
Pig
Spark
MongoDB
ElasticSearch
Cascading
R
Hadoop Ceph
Sphere Cassandra Spark
Scikit-Learn
Mahout WEKA
MLBase
RapidMiner
Panda D3 Crossfilter
InfiniDB LucidDB
Impala
Elastic Search SOLR
MongoDB Riak
Membase
Pig Hive Cascading Talend
Machine Learning !Mystery Land!
Scalability Central!NoSQL-Slavia!
SQL Columnar Republic!
Vizualization County! Data Cleanup Wasteland!
Statistician Old !House!
R
Hal’s Bingo !
9/30/13 12
HADOOP Google Cloud Platform Dataiku
Dataiku Open Source Web Tracker (WT1) } Apache License } Javascript & IO } Write directly to Google
Cloud Storage } Full Java, Easy To Deploy
Step 1 Get your own data
9/30/13 13
Silent in night Autoscale during Sales summer and winter
Step 2 Mix All Your Data
9/30/13 14
4 VMs on GCE
Tracking Data
Internal Data
Partner Data
Data Science Studio Pig Hive
HADOOP
auto-sync to BigQuery
Step 3 Mine your Data
9/30/13 15
Builtin Predictive Models
Advanced Adhoc Models (R or Python)
Shared Web Based Data Mining Platform
} January ◦ Choose Partner / Setup the architecture
} February ◦ Initial Deployment : 4TB ◦ Replace BI
} May ◦ New Applications (SEO, …)
} September ◦ Scale Deployment to 15TB ◦ Integrate all channels
Typical Project Calendar
9/30/13 16
} Enhance Daily Report Availability ◦ Previous architecture � Between H+17 and H+26 (!) ◦ Hadoop on GCE � Between H+3 AND H+7
} +21% Email Channel Optimization } SEO plan optimization } and a dozen BI Style “apps”
Some Success For the Project
9/30/13 17