optimization for data science 50pt1pt amies ilb · optimization for data science you want to train...

34
Optimization for Data Science AMIES – ILB S. Ga¨ ıffas

Upload: others

Post on 04-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Optimization for Data Science

AMIES – ILB

S. Gaıffas

Page 2: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Motivations

1. Partnership with Caisse Nationale d’Assurance Maladie(CNAM)

World’s largest electronic health records database

Extremely complex data preprocessing

Pharmacovigilance: detect potentially dangerous drugs

2. Time-oriented machine learning

High-frequency financial signals

Social networks

“Causality maps”

Page 3: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Motivation 1. CNAM Project

Context

Electronic health records: SNIIRAM + PMSI

Extremely complex database: 800 SQL tables, 500To, all in aSAS-Oracle closed ecosystem (exadata)

All health-care reimbursement of the French population (withdiagnosis, prescriptions, hospital stays, etc.)

Applications with a strong social impact

Goals

Pharmacovigilance: automatically detect potentially dangerous drugs(screening)

Examples: some anti-diabetics and bladder cancer, drug changesand fractures (with old persons)

Team

6 engineers

Administration, big data development

Page 4: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Motivation 1. CNAM Project

Big data cluster

“Scalable” architecture

4 masters

15 slaves

240 cores

1.9To RAM

480To (120 hard-drives)

HDFS

Spark (mostlyspark.sql), Scala

Only open-sourcetechnology

Page 5: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Motivation 1. CNAM project

Administration of two duplicateclusters (CNAM and X)

Understanding of the data

“Flattening” of the data

All from “raw” CSV extracts...

Work/Code management

Production in “agile” mode

Confluence + JIRA + GitHub

Page 6: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Motivation 1. CNAM Project

Types of results

For antidiabetics and bladder cancer

New model for longitudinal data for “self-controlled case series”

Validation: blind detection of a well-known adverse effect of somedrug (suppressed from French market in 2011)

Page 7: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Motivation 2. Time-oriented machine learning

From

We want to quantify interactions:

Page 8: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Motivation 2. Time-oriented machine learning

Hawkes process

N = [N1, . . . ,Nd ]> where Ni (t) =∑

k≥1 1t ik≤t

Ni jumps each time i does something (e.g. tweet, price up or down)

Model: Ni has an intensity λi given by

λi (t) = µi (t) +

∫(0,t)

d∑j=1

ϕij(t − s)dNj(s),

λ1(t) and corresponding ticks with d = 1 and ϕ11(t) = e−t

Page 9: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Motivation 2. Time-oriented machine learning

Achab, Bacry, Gaıffas, Mastromatteo, Muzy, Uncovering Causality fromMultivariate Hawkes Integrated Cumulants, ICML 2017

Granger causality estimation Integrated cumulants

Highly non-convex problem

Application on social networks and financial datasets

MemeTracker DAX data

Method ODE GC ADM4 NPHCErr 0.162 0.19 0.092 0.071Corr 0.07 0.053 0.081 0.095Time 2944 2780 2217 38

Page 10: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Motivation 2. Time-oriented machine learning

Lead/lag + flow prediction

Page 11: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Software: tick library

Python 3 et C++11

Open-source (BSD-3 License)

pip install tick (on MacOS and Linux...)

https://x-datainitiative.github.io/tick

Statistical learning for time-dependent models

Point processes (Poisson, Hawkes), Survival analysis, GLMs(parallelized, sparse, etc.)

A strong simulation and optimization toolbox

Partnership with Intel (use-case of a new processor with 180 cores)

Contributors welcome!

Page 12: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Software: tick library

Page 13: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Optimization for data science

You want to train a logistic regression with ridge penalization:

argminw∈Rd

{1

n

n∑i=1

log(1 + e−yix>i w ) +

λ

2‖w‖22

}

You have many ways to do it:

Gradient descent

Coordinate descent

Quasi-newton (BFGS)

Stochastic gradient descent

Dual, primal-dual methods

...

Page 14: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Optimization for data science

You’re likely to get very different performances:

Page 15: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Many linear methods for supervised learning write as the followingproblem:

argminw∈Rd

{1

n

n∑i=1

fi (w) + λg(w)

}where

fi (w) = “loss” for model-weights w on i-th data point

g is a penalization

Examples where fi (w) = `(yi , x>i w)

Linear regression:`(y , y ′) = 1

2 (y − y ′)2

Logistic regression:`(y , y ′) = log(1 + e−yy

′)

Hinge loss (SVM):`(y , y ′) = (1− yy ′)+

And let’s define f = 1n

∑ni=1 fi . NB: goodness-of-fit is an average

Page 16: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

“Simplest” algorithm: gradient descent

Input: starting point w0, learning rate η > 0

For k = 1, 2, . . . until converged do

wk ← proxηg

(wk−1 − η∇f (wk−1)

)Return last wk

What if sample size n is large?

Each iteration of a full gradient method has complexity O(nd)

You need to wait some time before doing anything...

Idea: we want to minimize an average of losses...

If I choose uniformly at random I ∈ {1, . . . , n}, then

E[∇fI (w)] =1

n

n∑i=1

∇fi (w) = ∇f (w)

∇fI (w) unbiased but very noisy estimate of the full gradient ∇f (w)

Stochastic Gradient Descent: Robbins and Monro 51

Page 17: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Stochastic Gradient Descent

Input: starting point w0, sequence of learning rates {ηt}t≥0For t = 1, 2, . . . until convergence do

Sample it uniformly in {1, . . . , n}w t ← w t−1 − ηt∇fi (w t−1)

Return last w t

Remarks

Each iteration has complexity O(d) instead of O(nd) for fullgradient methods

Actually faster for sparse datasets (lazy or delayed updates)

Very fast in the early iterations (first passes on the data)

Very slow convergence to a precise minimizer

Page 18: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

The problem

Put X = ∇fI (w) with I uniformly chosen at random in {1, . . . , n}In SGD we use X = ∇fI (w) as an approximation of EX = ∇f (w)

How to reduce varX ?

Recent results improve this:

Bottou and LeCun (2005)

Shalev-Shwartz et al (2007, 2009)

Nesterov et al. (2008, 2009)

Bach et al. (2011, 2012, 2014, 2015)

T. Zhang et al. (2014, 2015)

Page 19: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

An idea

Reduce it by finding C s.t. EC is “easy” to compute and such thatC is highly correlated with X

Put Zα = α(X − C ) + EC for α ∈ [0, 1]. We have

EZα = αEX + (1− α)EC

andvarZα = α2(varX + varC − 2 cov(X ,C ))

Standard variance reduction: α = 1, so that EZα = EX (unbiased)

Page 20: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Variance reduction of the gradient

In the iterations of SGD, replace ∇fit (w t−1) by

α(∇fit (w t−1)−∇fit (w)) +∇f (w)

where w is an “old” value of the iterate, namely use

w t ← w t−1 − η(α(∇fit (w t−1)−∇fit (w)) +∇f (w)

)Several cases

α = 1/n: SAG (Bach et al. 2013)

α = 1: SVRG (T. Zhang et al. 2015, 2015)

α = 1: SAGA (Bach et al., 2014)

Page 21: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Stochastic Variance Reduced GradientPhase size typically chosen as m = n or m = 2nIf F = f + g with g prox-capable, use

w t+1k ← proxηg (w t

k − η(∇fi (w tk)−∇fi (wk) +∇f (wk)))

SAGAIf F = f + g with g prox-capable, use

w t ← proxηg

(w t−1 − η

(∇fit (w t−1)− gt−1(it) +

1

n

n∑i=1

gt−1(i)))

Important remark

In these algorithms, the step-size η is kept constant

Leads to linearly convergent algorithms, with a numericalcomplexity comparable to SGD!

Page 22: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Theoretical knowledge

What is the order required number of iterates K to achieveε-precision:

F (wK )− F (w0) ≤ ε ?

Ifµ ≤ λmin(∇2f ) ≤ λmax(∇2f ) ≤ L

we know that, with κ = Lµ

Gradient descent: K = Θ(d × n × κ× log

(1ε

))SGD: K = Θ

(d × κ× 1

ε

)If

µ ≤ λmin(∇2f ) and λmax(∇2fi ) ≤ Li for all i = 1, . . . , n

we know that, with κ = maxi Li

µ

SAG, SAGA, SVRG, SDCA: K = Θ(d × (n + κ)× log

(1ε

))

Page 23: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Algorithms comparison

Page 24: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Asynchronous mode

All above algorithms can be strongly parallelized: using anasynchronous mode, involving “lock-free” updates

Lock-free SGD: apply in parallel

sample i uniformly in {1, . . . , n}w ← w − η∇fi (w)

without locking w (hence allowing collisions)

References (with generalizations and variance-reduction)

Niu et al. (2011), Hsieh et al. (2015), Reddi et al. (2015), Mania etal (2015), Zhao and Li, (2016), Leblond, Pedregosa, Lacoste-Julien(2017)

Page 25: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

[from Leblond, Pedregosa, Lacoste-Julien (2017)]

Page 26: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

The Poisson problem: non-smooth objectives

Great! So, let’s use these powerful tools for our problems about:

Health (Poisson-type models)

Hawkes processes (Social networks, Finance)

Our general optimization problem is

minw∈Rd

ψ>w +1

n

n∑i=1

fi (w>xi ) + λg(w)

subject to x>i w > 0 for all i = 1, . . . , n,

with fi (u) = −yi log(u) and yi > 0.

Example 1. Poisson regression (linear link)

minw∈Rd

1

n

n∑i=1

w>xi − yi log(w>xi ) + g(w)

subject to x>i w > 0 for all i = 1, . . . , n,

Page 27: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

The Poisson problem: non-smooth objectives

Example 2. Hawkes process’ log-likelihood

minw∈Rd

ψ>w −d∑

i=1

N i∑k=1

log(w>xi,k)

subject to x>i,kw > 0 for all i = 1, . . . , n,

for some choice of xi,k with n =∑d

i=1 Ni .

Problem

− log is non-smooth! (gradient not even bounded). Standard theoryis useless. Linear rates?

In practice: hard to tune the step-size

Page 28: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

The Poisson problem: non-smooth objectives

Most algorithm won’t even converge (example on a non-pathologicHawkes process, with sparse/small parameters)

0 10 20 30 40 50number of iterations

10 10

10 8

10 6

10 4

10 2

100

102

reac

hed

prec

ision

L-BFGS-BIstaFistaSCPGSVRGSDCA

Page 29: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

The Poisson problem: non-smooth objectives

Idea (work with M. Bompaire)

Use a slightly modified Stochastic Dual Coordinate Ascent (T.Zhang et al. 2015)

supα∈Rn : αi>0

1

n

n∑i=1

−f ∗i (−αi )− λg∗( 1

λn

n∑i=1

αixi −1

λψ)

With the primal dual relation:

w = ∇g∗( 1

λn

n∑i=1

αixi −1

λψ),

where f ∗i and g∗ are convex conjugates of fi and g .Since fi (u) = −yi log(u), we have f ∗i (v) = −yi − yi log

(−vyi

).

Page 30: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

The Poisson problem: non-smooth objectives

Algorithm.Input: Starting point α(0)

Put w (0) = 1λn

∑ni=1 α

(0)i xi − 1

λψFor t = 1, 2 . . . ,T do:

Randomly pick i

Find αi that maximize

yi + yi log(αi

yi

)− λn

2

∥∥∥w (t−1) +1

λn(αi − α(t−1)

i )xi

∥∥∥22

(1)

(explicit solution)

α(t) ← α(t−1) + ∆αiei

w (t) ← w (t−1) + (λn)−1∆αixi

Contribution

A new state-of-the-art for Poisson regression and Hawkes processes

Provable linear rates (using self-concordance)

Page 31: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

The Poisson problem: non-smooth objectives

Page 32: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

The Poisson problem in tick

Parallelized version of the algorithm in tick (in development)

Page 33: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Conclusion

Optimization for machine learning: many recent development

Distributed / parallel / lock-free

Variance reduction

Results also for non-convex problems

But many problems from statistical learning don’t fit

Such as the ones mentioned here... still a lot to do

Page 34: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e

Thank you!