apache spark and ml bo zhang, ibm chul sung and pu yang ......sales leads prediction clustering...
TRANSCRIPT
![Page 1: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/1.jpg)
Uncover Customer Insights with Apache Spark and ML Bo Zhang, IBM Joint work with Chunhui Higgins, Chul Sung and Pu Yang 2016/09/23 http://dpda.mybluemix.net
![Page 2: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/2.jpg)
About Me
• Data Science Driven Business at IBM
• Ph.D. from North Carolina State University in Statistics
• LinkedIn: https://www.linkedin.com/in/imbozhang
![Page 3: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/3.jpg)
• Cognitive Business = Digital Business + Digital Intelligence (Data Science) • Big Data
• User online data • User “offline” data • Watson Internet of Things • …
• Good news: cost (5TB of disk: ~$109.99) • Bad news: speed (time to read 5TB from disk: 15 hours)
Challenge: Big Data
Cloud platform for building, running, and managing apps and services
![Page 4: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/4.jpg)
• One machine can not process or even store all the data • Solution is to distribute data over a large cluster of machines
Solution: Apache Spark
Churn Tenure Cost … Type
1 13 44 … 0
1 11 33 … 1
0 68 52 … 1
1 33 33 … 0
1 23 30 … 0
0 41 39 … 0
Churn Tenure Cost … Type
1 13 44 … 0
1 11 33 … 1
Churn Tenure Cost … Type
0 68 52 … 1
1 33 33 … 0
Churn Tenure Cost … Type
1 23 30 … 0
0 41 39 … 0
Partition 1
Partition 2
Partition 3
Spark DataFrame (DF)
Spark provides a programming abstraction and parallel runtime that hides the complexities of fault tolerance and slow machines.
![Page 5: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/5.jpg)
Apache Spark Components
Apache Spark
Spark SQL Hive, JSON,
…
JDBC to DB
Spark Streaming
ML (machine learning)
ML Algorithms
Featurization, Pipelines,
Persistence, Utilities
GraphX (graph)
Spark DF
![Page 6: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/6.jpg)
Spark Driver and Workers
• A Spark program is two programs: • a driver program (one machine) • a worker program (cluster nodes)
Driver Program
SparkContext
Cluster Manager
SQLContext
Worker Spark Executor
Worker Spark Executor
DF distributed across workers
• SparkContext • SQLContext
![Page 7: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/7.jpg)
Lifecycle
• DashDB
Create
• DF
Filter • Filtered DF
Show
• Transformed DF
ML • Insights
Present
Transformations • Filter • Select • …
Actions • Show • Count • …
![Page 8: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/8.jpg)
Machine Learning
Settings Common Methods Use Cases
Regression Linear Regression,
Generalized Linear Regression, Survival Regression
Customer Churn Analysis
Classification
Logistic Regression, Decision Tree,
Random Forest, Naïve Bayes
Sales Leads Prediction
Clustering K-means, Gaussian Mixture Customer Segmentation
Collaborative Filtering User-Based Collaborative Filtering, Item-Based Collaborative Filtering,
Alternating Least Squares Service Recommendation
Text Mining Sentiment Analysis, Topic Classification NPS Survey Analysis
![Page 9: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/9.jpg)
Typical Supervised Machine Learning Pipeline
• Many sources: marketing data, user behavior data, social media data, call center data, survey data and so on Obtain New Data
• Extract features to represent observations • Unsupervised learning Feature Extraction
• Train models Supervised Learning
• Determine the quality of the model Evaluation
• Predict on future observations Predict
![Page 10: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/10.jpg)
Business Initiative - Customer Churn Reduction
Customer B
Customer C
Customer D
Now
Register Time
Churn
Acquisition
Activation
Retention
Revenue
Referral
Customer A
Churn
Churn
As part of the efforts to reduce customer churn, IBM is interested in modeling the "time to churn" in order to determine the factors associated with customers who left.
AARRR
![Page 11: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/11.jpg)
� Survival regression: to estimate time to event: death, equipment broken or customer churn
� Here we cannot show the real data so we use some toy data with similar properties, and thereby make the problem reproducible.
Customer Churn Prediction via Survival Regression
Tenure Churn Type Analytics Runtimes Mobile … Watson Boilerplate Quantity Cost
13 1 1 14 12 07 … 1 1 28 45
11 1 1 13 06 02 … 0 0 15 24
68 0 1 02 07 02 … 1 1 10 23
33 1 0 13 09 01 … 1 0 3 3
23 1 0 10 19 03 … 1 1 1 1
41 0 1 19 25 04 … 1 0 17 9
… … … … … … … … … … …
![Page 12: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/12.jpg)
Survival Regression Model - Accelerated Failure Time
f (t) = limΔt⎯→⎯ 0
P(t ≤ T < t +Δt)Δt
S(t) = P(T ≥ t) =1−F(t) = f (x)dxt
∞
∫
ttTttTtPth
t Δ
≥Δ+<≤=
⎯→⎯Δ
)/(lim)(0
F(t) = P(T ≤ t)
• T: the churn time for a customer, a random variable having a probability distribution (PDF):
• Cumulative density function (CDF):
• Survival function:
• Hazard function:
• Accelerated Failure Time (AFT):
• for trial accounts, for paying accounts
iippii zzT σεβββ ++++= …110)log(
zk = 0 1=kzS1(t) = S2 (ct)c = eβk⎧⎨⎩
Median survival time of paying accounts all are c times as much as those of trial accounts
![Page 13: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/13.jpg)
• Maximize Likelihood:
• Minimize the negative log-likelihood with a Weibull distribution
• A convex optimization problem, how to solve it efficiently in Spark?
• Quasi-Newton method (L-BFGS): approximate the objective function locally as a quadratic without evaluating the second partial derivatives.
Maximum Likelihood Estimation of AFT
L(β,σ ; x,δ) = [ f (xi;β,σ )]∏δi [S(xi;β,σ )]
1−δi
−l(β,σ ; x,δ) = (δi logσ −δiεi + eεi )
i=1
n
∑ ,where εi = log ti −ʹxiβ
σ
Bk+1 = Bk −Bksk( ) Bksk( )T
skTBksk
+yk yk
T
ykT sk
,sk =θk+1−θk , yk = gk+1− gk
![Page 14: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/14.jpg)
Survival Regression on Spark
Initialize Weights
Broadcast Weights to Executors
Customer likelihood and gradient for each instance, sum them up locally
Compute likelihood and gradient for each instance, sum them up locally
Compute likelihood and gradient for each instance, sum them up locally
Reduce from executor to get sum of likelihood and gradient
Use L-BFGS to find next step
Final Model Weights
![Page 15: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/15.jpg)
Demo: Customer Churn Analysis on Spark • bluemix: https://console.ng.bluemix.net • notebook: https://goo.gl/KgxZn5 • web application: http://dpda.mybluemix.net
![Page 16: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/16.jpg)
Takeaways • Apache Spark helped our team significantly reduce the time
from prototype to production • Survival analysis is useful to identify not only who will churn
but also when to churn and why to churn • By ranking the customers predicted survival probabilities in
ascending order, the top 50% customers capture 80% of churners.
![Page 17: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative](https://reader033.vdocuments.us/reader033/viewer/2022042806/5f6b9030cd10764f32676d27/html5/thumbnails/17.jpg)
Thank You