apache spark and ml bo zhang, ibm chul sung and pu yang ......sales leads prediction clustering...

Uncover Customer Insights with Apache Spark and ML Bo Zhang, IBM Joint work with Chunhui Higgins, Chul Sung and Pu Yang 2016/09/23 http://dpda.mybluemix.net

About Me

•  Data Science Driven Business at IBM

•  Ph.D. from North Carolina State University in Statistics

•  LinkedIn: https://www.linkedin.com/in/imbozhang

•  Cognitive Business = Digital Business + Digital Intelligence (Data Science) •  Big Data

•  User online data •  User “offline” data •  Watson Internet of Things •  …

•  Good news: cost (5TB of disk: ~$109.99) •  Bad news: speed (time to read 5TB from disk: 15 hours)

Challenge: Big Data

Cloud platform for building, running, and managing apps and services

•  One machine can not process or even store all the data •  Solution is to distribute data over a large cluster of machines

Solution: Apache Spark

Churn Tenure Cost … Type

1 13 44 … 0

1 11 33 … 1

0 68 52 … 1

1 33 33 … 0

1 23 30 … 0

0 41 39 … 0


1 13 44 … 0

1 11 33 … 1


0 68 52 … 1

1 33 33 … 0


1 23 30 … 0

0 41 39 … 0

Partition 1

Partition 2

Partition 3

Spark DataFrame (DF)

Spark provides a programming abstraction and parallel runtime that hides the complexities of fault tolerance and slow machines.

Apache Spark Components

Apache Spark

Spark SQL Hive, JSON,

…

JDBC to DB

Spark Streaming

ML (machine learning)

ML Algorithms

Featurization, Pipelines,

Persistence, Utilities

GraphX (graph)

Spark DF

Spark Driver and Workers

•  A Spark program is two programs: •  a driver program (one machine) •  a worker program (cluster nodes)

Driver Program

SparkContext

Cluster Manager

SQLContext

Worker Spark Executor

Worker Spark Executor

DF distributed across workers

•  SparkContext •  SQLContext

Lifecycle

•  DashDB

Create

•  DF

Filter •  Filtered DF

Show

•  Transformed DF

ML •  Insights

Present

Transformations •  Filter •  Select •  …

Actions •  Show •  Count •  …

Machine Learning

Settings Common Methods Use Cases

Regression Linear Regression,

Generalized Linear Regression, Survival Regression

Customer Churn Analysis

Classification

Logistic Regression, Decision Tree,

Random Forest, Naïve Bayes

Sales Leads Prediction

Clustering K-means, Gaussian Mixture Customer Segmentation

Collaborative Filtering User-Based Collaborative Filtering, Item-Based Collaborative Filtering,

Alternating Least Squares Service Recommendation

Text Mining Sentiment Analysis, Topic Classification NPS Survey Analysis

Typical Supervised Machine Learning Pipeline

• Many sources: marketing data, user behavior data, social media data, call center data, survey data and so on Obtain New Data

• Extract features to represent observations • Unsupervised learning Feature Extraction

• Train models Supervised Learning

• Determine the quality of the model Evaluation

• Predict on future observations Predict

Business Initiative - Customer Churn Reduction

Customer B

Customer C

Customer D

Now

Register Time

Churn

Acquisition

Activation

Retention

Revenue

Referral

Customer A

Churn

Churn

As part of the efforts to reduce customer churn, IBM is interested in modeling the "time to churn" in order to determine the factors associated with customers who left.

AARRR

�  Survival regression: to estimate time to event: death, equipment broken or customer churn

�  Here we cannot show the real data so we use some toy data with similar properties, and thereby make the problem reproducible.

Customer Churn Prediction via Survival Regression

Tenure Churn Type Analytics Runtimes Mobile … Watson Boilerplate Quantity Cost

13 1 1 14 12 07 … 1 1 28 45

11 1 1 13 06 02 … 0 0 15 24

68 0 1 02 07 02 … 1 1 10 23

33 1 0 13 09 01 … 1 0 3 3

23 1 0 10 19 03 … 1 1 1 1

41 0 1 19 25 04 … 1 0 17 9

… … … … … … … … … … …

Survival Regression Model - Accelerated Failure Time

f (t) = limΔt⎯→⎯ 0

P(t ≤ T < t +Δt)Δt

S(t) = P(T ≥ t) =1−F(t) = f (x)dxt

∞

∫

ttTttTtPth

t Δ

≥Δ+<≤=

⎯→⎯Δ

)/(lim)(0

F(t) = P(T ≤ t)

•  T: the churn time for a customer, a random variable having a probability distribution (PDF):

•  Cumulative density function (CDF):

•  Survival function:

•  Hazard function:

•  Accelerated Failure Time (AFT):

•  for trial accounts, for paying accounts

iippii zzT σεβββ ++++= …110)log(

zk = 0 1=kzS1(t) = S2 (ct)c = eβk⎧⎨⎩

Median survival time of paying accounts all are c times as much as those of trial accounts

•  Maximize Likelihood:

•  Minimize the negative log-likelihood with a Weibull distribution

•  A convex optimization problem, how to solve it efficiently in Spark?

•  Quasi-Newton method (L-BFGS): approximate the objective function locally as a quadratic without evaluating the second partial derivatives.

Maximum Likelihood Estimation of AFT

L(β,σ ; x,δ) = [ f (xi;β,σ )]∏δi [S(xi;β,σ )]

1−δi

−l(β,σ ; x,δ) = (δi logσ −δiεi + eεi )

i=1

n

∑ ,where εi = log ti −ʹxiβ

σ

Bk+1 = Bk −Bksk( ) Bksk( )T

skTBksk

+yk yk

T

ykT sk

,sk =θk+1−θk , yk = gk+1− gk

Survival Regression on Spark

Initialize Weights

Broadcast Weights to Executors

Customer likelihood and gradient for each instance, sum them up locally

Compute likelihood and gradient for each instance, sum them up locally

Compute likelihood and gradient for each instance, sum them up locally

Reduce from executor to get sum of likelihood and gradient

Use L-BFGS to find next step

Final Model Weights

Demo: Customer Churn Analysis on Spark •  bluemix: https://console.ng.bluemix.net •  notebook: https://goo.gl/KgxZn5 •  web application: http://dpda.mybluemix.net

Takeaways •  Apache Spark helped our team significantly reduce the time

from prototype to production •  Survival analysis is useful to identify not only who will churn

but also when to churn and why to churn •  By ranking the customers predicted survival probabilities in

ascending order, the top 50% customers capture 80% of churners.

Thank You

apache spark and ml bo zhang, ibm chul sung and pu yang ......sales leads prediction clustering...

Documents