journée thématique big dataperso.univ-lyon1.fr/khalid.benabdeslem/ihp/o_giroud.pdfcustomer...
TRANSCRIPT
1
JournéeThématiqueBig Data
13/03/2015
2
• About Flaminem
• What Do We Want To Predict?
• What Is The Machine Learning Theory Behind It ?
• How Does It Work In Practice?
• What Is Happening When Data Gets Big?
…And How Big is Big?
• MapReduce And Spark Explained With Card Games
Agenda
3
We Collect Browsing Data Way Beyond Our Clients Properties And Analyze Online Behaviors to Predict Intent
Our clients
properties
Internet
Some figures in France
We track online journeys of 150 Millions of
cookies (of which, 50% mobile and the other
half on Tablet/PC)…
…on 1 million of websites in France (via http
cookies with opt-out)
… and detect, for each single cookie, about
5 to 30 visits each active day
We analyze the context of these visits thanks
to our automated classification engine
(based on semantic analysis of pages
content)
4
We Find the Best Moments to Engage a Conversation With Their Customers and Prospective Customers
Purchase
Decision
Level Of
Intent Complex and rare decisions taking
months to consider, evaluate and decide
Examples: Car, Real-Estate Purchase,
suscriptions to Insurance, Credit.
What we predict
Impulse purchase
Examples :
eCommerce,
Travel
Not in our focusConsideration
Intention
Customer Decision Journey
5
Our Solutions Are Based On Modelling And Predicting Online Behaviors
Data for client
Acquisition
• Upper-funnel and
Mid-funnel targeting
segments for RTB
• Paid Search
Optimization Data
Leads Scoring
For Call-Centers
Clients
Scoring
• Churn scoring
• Cross-Sell and
Up-Sell Scoring
• Lead scoring for
use in Call-Center
to optimize lead
conversions
Price Sensitivity
Scoring
• Scores to assess
prospects / clients
price sensitivity to
improve product
pricing
R&D
6
• We predict when a user fills an online form : “client conversion”
• We represent the conversion with a binary variable 𝑦 ∈ {0,1}
• In the general case we call y the outputs (or responses)
• To predict conversion, we learn on features (e.g., browsing history)
• We represent these input variables (called features or predictors), with a d-
dimensional vector 𝑥𝑖 ∈ ℝ𝑑
• For each example, the goal is to use the inputs to predict the value of the
outputs. We call this exercise Supervised Learning
What Do We Want To Predict?
7
What Is The Machine Learning Theory Behind It ?
• We consider the outputs 𝑦𝑖 = 𝑓 𝑥𝑖
• We want to minimize risk 𝔼𝑃𝑋,𝑌 𝑙 𝑦, 𝑓 𝑥
• Where 𝑙 is the Loss Function.
• In practice, we want to minimize the empiric risk (expectation
is approximated by averaging over sample data)
1
𝑛
𝑖=1
𝑛
𝑙(𝑦𝑖 , 𝑓 𝑥𝑖 )
• After adding the regularization term, the goal is to solve the problem:
min𝑓[1
𝑛
𝑖=1
𝑛
𝑙 𝑦𝑖 , 𝑓 𝑥𝑖 + λ. Ω(𝑓) ]
Loss function Regularization
8
How Does It Work In Practice ?
Data Ingestion Data Processing Machine Learning
• Batch Update
• Quality control
• Jobs Automation and
Monitoring
• Events to Features building
• Features aggregation
• Features enrichment with
temporal dimension
• Pattern analysis
• Data formatting
• Features Engineering
• Model fitting
• Hybrid optimization
‒ 1st epoch with SGD
‒ High-precision
conversion with LBFGS
What tools are available when dealing with “small” data?
9
What Is Happening When Data Gets Big?... …..By The Way How Big is Big?
Data Ingestion Data Processing Machine Learning
How do traditional tools behave with such constraints?
10+ GB Ingested daily
Queries (Join, Sort…)
on very large tables
300+ GB
10M lines sub-sampled
dataset
30 000 Features
Very Sparse Features (e.g., 60 active features per line)
Too
slow
Too large
to fit in
RAM
Needs
distributed
storage
10
Our Tools
Data Ingestion Data Processing Machine Learning
Exploration
Machine Learning at Scale
+ Interactive Analytics
11
Let’s play a game!
12
2
3
4
5
Let’s play a game!Method 1: Use the RAM
13
Let’s play a game!
14
Let’s play a game!
7
8
3
Method 2: Sort the deck
15
Let’s play a game!Method 3: Use MapReduce
,2,3,4,5,
6,7,8,9,10
,2,3,4,5,
6,8,9,10
,2,3,4,5,
6,7,9,10
,2,4,5,
6,7,8,9,10
7
8
3
Split ShuffleMap Sort ReducePartition
16
Let’s play a game!Method 4: Use Spark
1-5
1-5
1-5
1-5
6-10
6-10
6-10
6-10
1-5
1-5
1-5
1-5
6-10
6-10
6-10
6-10
1-5
1-5
1-5
1-5
6-10
6-10
6-10
6-10
7
3
8
17
Wrapping up
• You shall not scale
• Code can get messy
• Simple to use
• Wide ML library
Used for : Machine Learning (Exploration)
Pros Cons
18
Wrapping up
Used for : Data storage + Batch Processing + SQL
• Heavy to handle
• MapReduce jobs are hard to
write
• SQL can’t do everything
• Scalable to 1000+
machines
• Failure Resilience
• Mature
• Very powerful data
structures
Pros Cons
19
Wrapping up
Used for : Machine Learning at Scale + Interactive Analytics
• Hard to master
• Young
• Less stable
• Less feature-
complete
• Very versatile
• Faster than MapReduce
• Easy to write jobs
• Works (not only) with Hadoop
• Easy to interface with SQL
• Strong community, fast
growth
• Many connectors (Python, R)
Pros Cons
FLAMSPARK
20
(A brief, unexhaustive)
History of Hadoop (and co.)
20052006
Open Source
200320042006
Google File SystemMapReduceBigTable
Publishes papers on Proprietary
20072008
Open Source
20072008
Open Source
20102013
Open Source
2006
2010
201220082011
Open Source
22
Our Machine Learning Algorithms
Timestamps added to our features
Features selection:
- Features’ frequencies and strengths
- Noisy features removed
- L1 regularization to select top features
- Stepwise subset selection
- “User groups” classification
- “Sites/pages” classification (semi-
supervised)
- “User groups” crossed with “site groups”
- Feature hierarchizing and crossing
Our Models
Features Engineering
Our Optimizers
Logistic Regression
L2 regularization
Distributed algorithms in Spark
Hybrid models:
- 1st epoch with SGD
- High-precision conversion with LBFGS
(OW-LQN for non-smooths functions)
Evaluation of Factorization machine and
multinomial logistic regression
Finalization of decision process modelling
through a Hidden Markov Chain (HMM)
model
Current explorations
Gradient Boosted Decision trees used to
generate more complex features
23
What is FlamSpark? A ML Framework sitting on top
of Apache Spark and exploiting existing mathematical
libraries (MLLib, Breeze)
Why Spark? Our models in Python couldn’t scale
above 5 Million observations and 20,000 features.
Distributed Computing was our preferred option to
scale and Spark the best framework for this
Why FlamSpark?
• Simplify data manipulation
• Rapid ML model prototyping in Spark
• Support custom mathematical features beyond
existing Libs (e.g., LBFGS & OWLQN optimizers,
ElasticNet and Adaptative regularization,
Factorization Machine Algorithm)
How does it scale?• Run Logistic Regression with 100 Million
observations and 500,000 features (on a 10
nodes cluster)
ROC + Precision / Recall
+ Segments generation
for use in RTB
Raw Data
Meet FlamSpark, Our Machine Learning Framework