optimization for data science 50pt1pt amies ilb · optimization for data science you want to train...

Optimization for Data Science

AMIES – ILB

S. Gaıffas

Motivations

1. Partnership with Caisse Nationale d’Assurance Maladie(CNAM)

World’s largest electronic health records database

Extremely complex data preprocessing

Pharmacovigilance: detect potentially dangerous drugs

2. Time-oriented machine learning

High-frequency financial signals

Social networks

“Causality maps”

Motivation 1. CNAM Project

Context

Electronic health records: SNIIRAM + PMSI

Extremely complex database: 800 SQL tables, 500To, all in aSAS-Oracle closed ecosystem (exadata)

All health-care reimbursement of the French population (withdiagnosis, prescriptions, hospital stays, etc.)

Applications with a strong social impact

Goals

Pharmacovigilance: automatically detect potentially dangerous drugs(screening)

Examples: some anti-diabetics and bladder cancer, drug changesand fractures (with old persons)

Team

6 engineers

Administration, big data development


Big data cluster

“Scalable” architecture

4 masters

15 slaves

240 cores

1.9To RAM

480To (120 hard-drives)

HDFS

Spark (mostlyspark.sql), Scala

Only open-sourcetechnology

Motivation 1. CNAM project

Administration of two duplicateclusters (CNAM and X)

Understanding of the data

“Flattening” of the data

All from “raw” CSV extracts...

Work/Code management

Production in “agile” mode

Confluence + JIRA + GitHub


Types of results

For antidiabetics and bladder cancer

New model for longitudinal data for “self-controlled case series”

Validation: blind detection of a well-known adverse effect of somedrug (suppressed from French market in 2011)

Motivation 2. Time-oriented machine learning

From

We want to quantify interactions:


Hawkes process

N = [N1, . . . ,Nd ]> where Ni (t) =∑

k≥1 1t ik≤t

Ni jumps each time i does something (e.g. tweet, price up or down)

Model: Ni has an intensity λi given by

λi (t) = µi (t) +

∫(0,t)

d∑j=1

ϕij(t − s)dNj(s),

λ1(t) and corresponding ticks with d = 1 and ϕ11(t) = e−t


Achab, Bacry, Gaıffas, Mastromatteo, Muzy, Uncovering Causality fromMultivariate Hawkes Integrated Cumulants, ICML 2017

Granger causality estimation Integrated cumulants

Highly non-convex problem

Application on social networks and financial datasets

MemeTracker DAX data

Method ODE GC ADM4 NPHCErr 0.162 0.19 0.092 0.071Corr 0.07 0.053 0.081 0.095Time 2944 2780 2217 38


Lead/lag + flow prediction

Software: tick library

Python 3 et C++11

Open-source (BSD-3 License)

pip install tick (on MacOS and Linux...)

https://x-datainitiative.github.io/tick

Statistical learning for time-dependent models

Point processes (Poisson, Hawkes), Survival analysis, GLMs(parallelized, sparse, etc.)

A strong simulation and optimization toolbox

Partnership with Intel (use-case of a new processor with 180 cores)

Contributors welcome!

https://x-datainitiative.github.io/tick

Software: tick library

Optimization for data science

You want to train a logistic regression with ridge penalization:

argminw∈Rd

{1

n

n∑i=1

log(1 + e−yix>i w ) +

λ

2‖w‖22

}

You have many ways to do it:

Gradient descent

Coordinate descent

Quasi-newton (BFGS)

Stochastic gradient descent

Dual, primal-dual methods

...

Optimization for data science

You’re likely to get very different performances:

Many linear methods for supervised learning write as the followingproblem:

argminw∈Rd

{1

n

n∑i=1

fi (w) + λg(w)

}where

fi (w) = “loss” for model-weights w on i-th data point

g is a penalization

Examples where fi (w) = `(yi , x>i w)

Linear regression:`(y , y ′) = 1

2 (y − y ′)2

Logistic regression:`(y , y ′) = log(1 + e−yy

′)

Hinge loss (SVM):`(y , y ′) = (1− yy ′)+

And let’s define f = 1n

∑ni=1 fi . NB: goodness-of-fit is an average

“Simplest” algorithm: gradient descent

Input: starting point w0, learning rate η > 0

For k = 1, 2, . . . until converged do

wk ← proxηg

(wk−1 − η∇f (wk−1)

)Return last wk

What if sample size n is large?

Each iteration of a full gradient method has complexity O(nd)

You need to wait some time before doing anything...

Idea: we want to minimize an average of losses...

If I choose uniformly at random I ∈ {1, . . . , n}, then

E[∇fI (w)] =1

n

n∑i=1

∇fi (w) = ∇f (w)

∇fI (w) unbiased but very noisy estimate of the full gradient ∇f (w)

Stochastic Gradient Descent: Robbins and Monro 51

Stochastic Gradient Descent

Input: starting point w0, sequence of learning rates {ηt}t≥0For t = 1, 2, . . . until convergence do

Sample it uniformly in {1, . . . , n}w t ← w t−1 − ηt∇fi (w t−1)

Return last w t

Remarks

Each iteration has complexity O(d) instead of O(nd) for fullgradient methods

Actually faster for sparse datasets (lazy or delayed updates)

Very fast in the early iterations (first passes on the data)

Very slow convergence to a precise minimizer

The problem

Put X = ∇fI (w) with I uniformly chosen at random in {1, . . . , n}In SGD we use X = ∇fI (w) as an approximation of EX = ∇f (w)

How to reduce varX ?

Recent results improve this:

Bottou and LeCun (2005)

Shalev-Shwartz et al (2007, 2009)

Nesterov et al. (2008, 2009)

Bach et al. (2011, 2012, 2014, 2015)

T. Zhang et al. (2014, 2015)

An idea

Reduce it by finding C s.t. EC is “easy” to compute and such thatC is highly correlated with X

Put Zα = α(X − C ) + EC for α ∈ [0, 1]. We have

EZα = αEX + (1− α)EC

andvarZα = α2(varX + varC − 2 cov(X ,C ))

Standard variance reduction: α = 1, so that EZα = EX (unbiased)

Variance reduction of the gradient

In the iterations of SGD, replace ∇fit (w t−1) by

α(∇fit (w t−1)−∇fit (w)) +∇f (w)

where w is an “old” value of the iterate, namely use

w t ← w t−1 − η(α(∇fit (w t−1)−∇fit (w)) +∇f (w)

)Several cases

α = 1/n: SAG (Bach et al. 2013)

α = 1: SVRG (T. Zhang et al. 2015, 2015)

α = 1: SAGA (Bach et al., 2014)

Stochastic Variance Reduced GradientPhase size typically chosen as m = n or m = 2nIf F = f + g with g prox-capable, use

w t+1k ← proxηg (w t

k − η(∇fi (w tk)−∇fi (wk) +∇f (wk)))

SAGAIf F = f + g with g prox-capable, use

w t ← proxηg

(w t−1 − η

(∇fit (w t−1)− gt−1(it) +

1

n

n∑i=1

gt−1(i)))

Important remark

In these algorithms, the step-size η is kept constant

Leads to linearly convergent algorithms, with a numericalcomplexity comparable to SGD!

Theoretical knowledge

What is the order required number of iterates K to achieveε-precision:

F (wK )− F (w0) ≤ ε ?

Ifµ ≤ λmin(∇2f ) ≤ λmax(∇2f ) ≤ L

we know that, with κ = Lµ

Gradient descent: K = Θ(d × n × κ× log

(1ε

))SGD: K = Θ

(d × κ× 1

ε

)If

µ ≤ λmin(∇2f ) and λmax(∇2fi ) ≤ Li for all i = 1, . . . , n

we know that, with κ = maxi Li

µ

SAG, SAGA, SVRG, SDCA: K = Θ(d × (n + κ)× log

(1ε

))

Algorithms comparison

Asynchronous mode

All above algorithms can be strongly parallelized: using anasynchronous mode, involving “lock-free” updates

Lock-free SGD: apply in parallel

sample i uniformly in {1, . . . , n}w ← w − η∇fi (w)

without locking w (hence allowing collisions)

References (with generalizations and variance-reduction)

Niu et al. (2011), Hsieh et al. (2015), Reddi et al. (2015), Mania etal (2015), Zhao and Li, (2016), Leblond, Pedregosa, Lacoste-Julien(2017)

[from Leblond, Pedregosa, Lacoste-Julien (2017)]

The Poisson problem: non-smooth objectives

Great! So, let’s use these powerful tools for our problems about:

Health (Poisson-type models)

Hawkes processes (Social networks, Finance)

Our general optimization problem is

minw∈Rd

ψ>w +1

n

n∑i=1

fi (w>xi ) + λg(w)

subject to x>i w > 0 for all i = 1, . . . , n,

with fi (u) = −yi log(u) and yi > 0.

Example 1. Poisson regression (linear link)

minw∈Rd

1

n

n∑i=1

w>xi − yi log(w>xi ) + g(w)

subject to x>i w > 0 for all i = 1, . . . , n,


Example 2. Hawkes process’ log-likelihood

minw∈Rd

ψ>w −d∑

i=1

N i∑k=1

log(w>xi,k)

subject to x>i,kw > 0 for all i = 1, . . . , n,

for some choice of xi,k with n =∑d

i=1 Ni .

Problem

− log is non-smooth! (gradient not even bounded). Standard theoryis useless. Linear rates?

In practice: hard to tune the step-size


Most algorithm won’t even converge (example on a non-pathologicHawkes process, with sparse/small parameters)

0 10 20 30 40 50number of iterations

10 10

10 8

10 6

10 4

10 2

100

102

reac

hed

prec

ision

L-BFGS-BIstaFistaSCPGSVRGSDCA


Idea (work with M. Bompaire)

Use a slightly modified Stochastic Dual Coordinate Ascent (T.Zhang et al. 2015)

supα∈Rn : αi>0

1

n

n∑i=1

−f ∗i (−αi )− λg∗( 1

λn

n∑i=1

αixi −1

λψ)

With the primal dual relation:

w = ∇g∗( 1

λn

n∑i=1

αixi −1

λψ),

where f ∗i and g∗ are convex conjugates of fi and g .Since fi (u) = −yi log(u), we have f ∗i (v) = −yi − yi log

(−vyi

).


Algorithm.Input: Starting point α(0)

Put w (0) = 1λn

∑ni=1 α

(0)i xi − 1

λψFor t = 1, 2 . . . ,T do:

Randomly pick i

Find αi that maximize

yi + yi log(αi

yi

)− λn

2

∥∥∥w (t−1) +1

λn(αi − α(t−1)

i )xi

∥∥∥22

(1)

(explicit solution)

α(t) ← α(t−1) + ∆αiei

w (t) ← w (t−1) + (λn)−1∆αixi

Contribution

A new state-of-the-art for Poisson regression and Hawkes processes

Provable linear rates (using self-concordance)

The Poisson problem in tick

Parallelized version of the algorithm in tick (in development)

Conclusion

Optimization for machine learning: many recent development

Distributed / parallel / lock-free

Variance reduction

Results also for non-convex problems

But many problems from statistical learning don’t fit

Such as the ones mentioned here... still a lot to do

Thank you!

optimization for data science 50pt1pt amies ilb · optimization for data science you want to train...

Documents