optimization for data science 50pt1pt amies ilb · optimization for data science you want to train...
TRANSCRIPT
Optimization for Data Science
AMIES – ILB
S. Gaıffas
Motivations
1. Partnership with Caisse Nationale d’Assurance Maladie(CNAM)
World’s largest electronic health records database
Extremely complex data preprocessing
Pharmacovigilance: detect potentially dangerous drugs
2. Time-oriented machine learning
High-frequency financial signals
Social networks
“Causality maps”
Motivation 1. CNAM Project
Context
Electronic health records: SNIIRAM + PMSI
Extremely complex database: 800 SQL tables, 500To, all in aSAS-Oracle closed ecosystem (exadata)
All health-care reimbursement of the French population (withdiagnosis, prescriptions, hospital stays, etc.)
Applications with a strong social impact
Goals
Pharmacovigilance: automatically detect potentially dangerous drugs(screening)
Examples: some anti-diabetics and bladder cancer, drug changesand fractures (with old persons)
Team
6 engineers
Administration, big data development
Motivation 1. CNAM Project
Big data cluster
“Scalable” architecture
4 masters
15 slaves
240 cores
1.9To RAM
480To (120 hard-drives)
HDFS
Spark (mostlyspark.sql), Scala
Only open-sourcetechnology
Motivation 1. CNAM project
Administration of two duplicateclusters (CNAM and X)
Understanding of the data
“Flattening” of the data
All from “raw” CSV extracts...
Work/Code management
Production in “agile” mode
Confluence + JIRA + GitHub
Motivation 1. CNAM Project
Types of results
For antidiabetics and bladder cancer
New model for longitudinal data for “self-controlled case series”
Validation: blind detection of a well-known adverse effect of somedrug (suppressed from French market in 2011)
Motivation 2. Time-oriented machine learning
From
We want to quantify interactions:
Motivation 2. Time-oriented machine learning
Hawkes process
N = [N1, . . . ,Nd ]> where Ni (t) =∑
k≥1 1t ik≤t
Ni jumps each time i does something (e.g. tweet, price up or down)
Model: Ni has an intensity λi given by
λi (t) = µi (t) +
∫(0,t)
d∑j=1
ϕij(t − s)dNj(s),
λ1(t) and corresponding ticks with d = 1 and ϕ11(t) = e−t
Motivation 2. Time-oriented machine learning
Achab, Bacry, Gaıffas, Mastromatteo, Muzy, Uncovering Causality fromMultivariate Hawkes Integrated Cumulants, ICML 2017
Granger causality estimation Integrated cumulants
Highly non-convex problem
Application on social networks and financial datasets
MemeTracker DAX data
Method ODE GC ADM4 NPHCErr 0.162 0.19 0.092 0.071Corr 0.07 0.053 0.081 0.095Time 2944 2780 2217 38
Motivation 2. Time-oriented machine learning
Lead/lag + flow prediction
Software: tick library
Python 3 et C++11
Open-source (BSD-3 License)
pip install tick (on MacOS and Linux...)
https://x-datainitiative.github.io/tick
Statistical learning for time-dependent models
Point processes (Poisson, Hawkes), Survival analysis, GLMs(parallelized, sparse, etc.)
A strong simulation and optimization toolbox
Partnership with Intel (use-case of a new processor with 180 cores)
Contributors welcome!
Software: tick library
Optimization for data science
You want to train a logistic regression with ridge penalization:
argminw∈Rd
{1
n
n∑i=1
log(1 + e−yix>i w ) +
λ
2‖w‖22
}
You have many ways to do it:
Gradient descent
Coordinate descent
Quasi-newton (BFGS)
Stochastic gradient descent
Dual, primal-dual methods
...
Optimization for data science
You’re likely to get very different performances:
Many linear methods for supervised learning write as the followingproblem:
argminw∈Rd
{1
n
n∑i=1
fi (w) + λg(w)
}where
fi (w) = “loss” for model-weights w on i-th data point
g is a penalization
Examples where fi (w) = `(yi , x>i w)
Linear regression:`(y , y ′) = 1
2 (y − y ′)2
Logistic regression:`(y , y ′) = log(1 + e−yy
′)
Hinge loss (SVM):`(y , y ′) = (1− yy ′)+
And let’s define f = 1n
∑ni=1 fi . NB: goodness-of-fit is an average
“Simplest” algorithm: gradient descent
Input: starting point w0, learning rate η > 0
For k = 1, 2, . . . until converged do
wk ← proxηg
(wk−1 − η∇f (wk−1)
)Return last wk
What if sample size n is large?
Each iteration of a full gradient method has complexity O(nd)
You need to wait some time before doing anything...
Idea: we want to minimize an average of losses...
If I choose uniformly at random I ∈ {1, . . . , n}, then
E[∇fI (w)] =1
n
n∑i=1
∇fi (w) = ∇f (w)
∇fI (w) unbiased but very noisy estimate of the full gradient ∇f (w)
Stochastic Gradient Descent: Robbins and Monro 51
Stochastic Gradient Descent
Input: starting point w0, sequence of learning rates {ηt}t≥0For t = 1, 2, . . . until convergence do
Sample it uniformly in {1, . . . , n}w t ← w t−1 − ηt∇fi (w t−1)
Return last w t
Remarks
Each iteration has complexity O(d) instead of O(nd) for fullgradient methods
Actually faster for sparse datasets (lazy or delayed updates)
Very fast in the early iterations (first passes on the data)
Very slow convergence to a precise minimizer
The problem
Put X = ∇fI (w) with I uniformly chosen at random in {1, . . . , n}In SGD we use X = ∇fI (w) as an approximation of EX = ∇f (w)
How to reduce varX ?
Recent results improve this:
Bottou and LeCun (2005)
Shalev-Shwartz et al (2007, 2009)
Nesterov et al. (2008, 2009)
Bach et al. (2011, 2012, 2014, 2015)
T. Zhang et al. (2014, 2015)
An idea
Reduce it by finding C s.t. EC is “easy” to compute and such thatC is highly correlated with X
Put Zα = α(X − C ) + EC for α ∈ [0, 1]. We have
EZα = αEX + (1− α)EC
andvarZα = α2(varX + varC − 2 cov(X ,C ))
Standard variance reduction: α = 1, so that EZα = EX (unbiased)
Variance reduction of the gradient
In the iterations of SGD, replace ∇fit (w t−1) by
α(∇fit (w t−1)−∇fit (w)) +∇f (w)
where w is an “old” value of the iterate, namely use
w t ← w t−1 − η(α(∇fit (w t−1)−∇fit (w)) +∇f (w)
)Several cases
α = 1/n: SAG (Bach et al. 2013)
α = 1: SVRG (T. Zhang et al. 2015, 2015)
α = 1: SAGA (Bach et al., 2014)
Stochastic Variance Reduced GradientPhase size typically chosen as m = n or m = 2nIf F = f + g with g prox-capable, use
w t+1k ← proxηg (w t
k − η(∇fi (w tk)−∇fi (wk) +∇f (wk)))
SAGAIf F = f + g with g prox-capable, use
w t ← proxηg
(w t−1 − η
(∇fit (w t−1)− gt−1(it) +
1
n
n∑i=1
gt−1(i)))
Important remark
In these algorithms, the step-size η is kept constant
Leads to linearly convergent algorithms, with a numericalcomplexity comparable to SGD!
Theoretical knowledge
What is the order required number of iterates K to achieveε-precision:
F (wK )− F (w0) ≤ ε ?
Ifµ ≤ λmin(∇2f ) ≤ λmax(∇2f ) ≤ L
we know that, with κ = Lµ
Gradient descent: K = Θ(d × n × κ× log
(1ε
))SGD: K = Θ
(d × κ× 1
ε
)If
µ ≤ λmin(∇2f ) and λmax(∇2fi ) ≤ Li for all i = 1, . . . , n
we know that, with κ = maxi Li
µ
SAG, SAGA, SVRG, SDCA: K = Θ(d × (n + κ)× log
(1ε
))
Algorithms comparison
Asynchronous mode
All above algorithms can be strongly parallelized: using anasynchronous mode, involving “lock-free” updates
Lock-free SGD: apply in parallel
sample i uniformly in {1, . . . , n}w ← w − η∇fi (w)
without locking w (hence allowing collisions)
References (with generalizations and variance-reduction)
Niu et al. (2011), Hsieh et al. (2015), Reddi et al. (2015), Mania etal (2015), Zhao and Li, (2016), Leblond, Pedregosa, Lacoste-Julien(2017)
[from Leblond, Pedregosa, Lacoste-Julien (2017)]
The Poisson problem: non-smooth objectives
Great! So, let’s use these powerful tools for our problems about:
Health (Poisson-type models)
Hawkes processes (Social networks, Finance)
Our general optimization problem is
minw∈Rd
ψ>w +1
n
n∑i=1
fi (w>xi ) + λg(w)
subject to x>i w > 0 for all i = 1, . . . , n,
with fi (u) = −yi log(u) and yi > 0.
Example 1. Poisson regression (linear link)
minw∈Rd
1
n
n∑i=1
w>xi − yi log(w>xi ) + g(w)
subject to x>i w > 0 for all i = 1, . . . , n,
The Poisson problem: non-smooth objectives
Example 2. Hawkes process’ log-likelihood
minw∈Rd
ψ>w −d∑
i=1
N i∑k=1
log(w>xi,k)
subject to x>i,kw > 0 for all i = 1, . . . , n,
for some choice of xi,k with n =∑d
i=1 Ni .
Problem
− log is non-smooth! (gradient not even bounded). Standard theoryis useless. Linear rates?
In practice: hard to tune the step-size
The Poisson problem: non-smooth objectives
Most algorithm won’t even converge (example on a non-pathologicHawkes process, with sparse/small parameters)
0 10 20 30 40 50number of iterations
10 10
10 8
10 6
10 4
10 2
100
102
reac
hed
prec
ision
L-BFGS-BIstaFistaSCPGSVRGSDCA
The Poisson problem: non-smooth objectives
Idea (work with M. Bompaire)
Use a slightly modified Stochastic Dual Coordinate Ascent (T.Zhang et al. 2015)
supα∈Rn : αi>0
1
n
n∑i=1
−f ∗i (−αi )− λg∗( 1
λn
n∑i=1
αixi −1
λψ)
With the primal dual relation:
w = ∇g∗( 1
λn
n∑i=1
αixi −1
λψ),
where f ∗i and g∗ are convex conjugates of fi and g .Since fi (u) = −yi log(u), we have f ∗i (v) = −yi − yi log
(−vyi
).
The Poisson problem: non-smooth objectives
Algorithm.Input: Starting point α(0)
Put w (0) = 1λn
∑ni=1 α
(0)i xi − 1
λψFor t = 1, 2 . . . ,T do:
Randomly pick i
Find αi that maximize
yi + yi log(αi
yi
)− λn
2
∥∥∥w (t−1) +1
λn(αi − α(t−1)
i )xi
∥∥∥22
(1)
(explicit solution)
α(t) ← α(t−1) + ∆αiei
w (t) ← w (t−1) + (λn)−1∆αixi
Contribution
A new state-of-the-art for Poisson regression and Hawkes processes
Provable linear rates (using self-concordance)
The Poisson problem: non-smooth objectives
The Poisson problem in tick
Parallelized version of the algorithm in tick (in development)
Conclusion
Optimization for machine learning: many recent development
Distributed / parallel / lock-free
Variance reduction
Results also for non-convex problems
But many problems from statistical learning don’t fit
Such as the ones mentioned here... still a lot to do
Thank you!