large-scale learning with kernels libskylark - mmds-data.orgmmds-data.org › presentations › 2014...

Large-scale Learning with Kernels& libSkylark

Vikas SindhwaniIBM Research, NY

MMDS 2014. Algorithms for Modern Massive Data SetsUC Berkeley, June 19th 2014

IBM Research: Vikas Sindhwani

What do you see?

IBM Research: Vikas Sindhwani 1

Motivation

• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?

Motivation


– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?

Motivation


– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?– NLA: argmin‖w‖1≤1 ‖Xw − y‖22 ⇒ argmin‖w‖1≤1 ‖SXw − Sy‖22

Motivation


– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?– NLA: argmin‖w‖1≤1 ‖Xw − y‖22 ⇒ argmin‖w‖1≤1 ‖SXw − Sy‖22– Statistician: With high prob (1− δ), a training set with

n = O( 1ε2log(2d2

δ )) enough for generalization error to be within εthat of the best linear model with ‖ · ‖1 ≤ 1 =⇒ dont need n=1B

• Big data Machine Learning stack design requires conversations.

• Avoiding strong assumptions (linearity, sparsity...) upfront to avoidsaturation ⇒ Non-parametric models (models that grow with data)

Motivation


– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?– NLA: argmin‖w‖1≤1 ‖Xw − y‖22 ⇒ argmin‖w‖1≤1 ‖SXw − Sy‖22– Statistician: With high prob (1− δ), a training set with

n = O( 1ε2log(2d2

δ )) enough for generalization error to be within εthat of the best linear model with ‖ · ‖1 ≤ 1 =⇒ dont need n=1B

• Big data Machine Learning stack design requires conversations.

• Avoiding strong assumptions (linearity, sparsity...) upfront to avoidsaturation ⇒ Non-parametric models (models that grow with data)

• This setting needs both Distributed computation and Randomization


Outline

• Non-parametric modeling with Kernel methods, their scalabilityproblems and Random Fourier Features (Rahimi & Recht, 2007)

• Kernel methods match DNNs on (knowledge-free) speech recognitionbenchmarks (ICASSP, 2014): parallelization + randomization

• Recent efforts towards improving scalability:

– Practical Implementations of Distributed ADMM: handling largenumber of examples and random feature spaces.

– Quasi-Monte Carlo Feature Maps (ICML 2014)

• libskylark: Open-source software library instantiating sketchingprimitives and randomized Numerical Linear Algebra (NLA) techniquesfor large-scale Machine Learning in distributed-memory environments.


Acknowledgements and References

• XDATA Skylark team: Ken Clarkson (PI), Haim Avron, Costas Bekas,Christos Boutsidis, Ilse Ipsen, Yves Ineichien, Anju Kambadur, GiorgiosKollias, Michael Mahoney, Vikas Sindhwani, David Woodruff

• High-performance Kernel Machines with Implicit DistributedOptimization and Randomization, with Haim Avron, 2014

• Quasi-Monte Carlo Feature Maps for Shift Invariant Kernels, withJiyan Yang, Haim Avron, and Michael Mahoney, ICML 2014

• Kernel Methods match Deep Neural Networks on TIMIT, with P.Huang, H. Avron, T. Sainath and B. Ramabhadran, ICASSP 2014

• Random Laplace Feature Maps for Semigroup Kernels onHistograms with J. Yang, Q. Fan, H. Avron, M. Mahoney, CVPR 2014


Kernel Methods (Aronszajn, 1950)

• Symm. pos. def. function k(x, z) on input domain X ⊂ Rd

• k ⇔ rich Reproducing Kernel Hilbert Space (RKHS) Hk of real-valuedfunctions, with inner product 〈·, ·〉k and norm ‖ · ‖k

• Regularized Risk Minimization ⇔ Linear models in an implicithigh-dimensional (often infinite-dimensional) feature space.

f? = argminf∈Hk

1

n

n∑i=1

V (yi, f(xi)) + λ‖f‖2Hk, xi ∈ Rd

• Representer Theorem: f?(x) =∑ni=1αik(x,xi)


The Issue of Scalability

• Regularized Least Squares

(K + λI)α = YO(n2) storage

O(n3 + n2d) trainingO(nd) test speed

Hard to parallelize when working directly with Kij = k(xi,xj)

• Linear kernels: k(x, z) = xTz, f?(x) = xTw, (w = XTα)

(XTX + λI

)w = XTY

O(nd) storageO(nd2) trainingO(d) test speed


Randomized Algorithms: Definitions

• Data-oblivious explicit feature map: Ψ : Rd 7→ Cs such that,

k(x, z) ≈ 〈Ψ(x), Ψ(z)〉Cs

⇒(Z(X)

TZ(X) + λI

)w = Z(X)

TY ⇒

O(ns) storageO(ns2) trainingO(s) test speed

• Shift-Invariant kernels: k(x, z) = ψ(x− z), for some complex-valuedpositive definite function ψ on Rd

– Given any set of m points, z1 . . . zm ∈ Rd, the m×m the matrixAij = ψ(zi − zj) is positive definite.


Randomized Algorithms: Bochner’s TheoremTheorem 1 (Bochner, 1932-33). A complex-valued function ψ : Rd 7→ Cis positive definite if and only if it is the Fourier Transform of a finitenon-negative measure µ on Rd, i.e.,

ψ(x) = µ(x) =

∫Rde−ix

Twdµ(w), x ∈ Rd .

ψ(x)

e−iwT x


Random Fourier Features (Rahimi & Recht, 2007)

• One-to-one correspondence between k and density p such that,

k(x, z) = ψ(x− z) =

∫Rde−i(x−z)Twp(w)dw

Gaussian kernel: k(x, z) = e−‖x−z‖22

2σ2 ⇐⇒ p = N (0, σ−2Id)

• Monte-Carlo approximation to Integral representation:

k(x, z) ≈ 1

s

s∑j=1

e−i(x−z)Twj = 〈ΨS(x), ΨS(z)〉Cs

ΨS(x) =1√s

[e−ix

Tw1 . . . e−ixTws]∈ Cs, S = [w1 . . .ws] ∼ p


DNNs vs Kernel Methods on TIMIT (Speech)

1 2 3 4 5 6 7 8Number of Random Features (s) / 10000

33

34

35

36

37

38

39

40

41

Cla

ssifi

cati

onE

rror

(%)

TIMIT: n = 2M,d = 440, k = 147

DNN (440-4k-4k-147)RandomFourierExact Kernel (n=100k, 75GB)

% Training

G = randn(size(X,1), s);

Z = exp(i*X*G);

alpha = (eye(size(X,2))*lambda+Z’*Z)\(Z’*y(:));

% Testing

ztest = exp(i*xtest*G)*alpha;


Learning in High-dimensional Random Feature Spaces

Kernel Methods Match Deep Neural Networks on TIMIT, ICASSP 2014

0 5 10 15 20 25 30 35 40Number of Random Features (s) / 10000

33

34

35

36

37

38

39

40

41

Cla

ssifi

cati

onE

rror

(%)

PER: 21.3% < 22.3% (DNN)

TIMIT: n = 2M,d = 440, k = 147

DNN (440-4k-4k-147)RandomFourierExact Kernel (n=100k, 75GB)

• Distributed solvers (Z ∈ R2M×400K ≈ 6 terabytes)• More effective feature maps?


Distributed Learning with ADMM• Alternating Direction Method of Multipliers (1950s; Boyd et al, 2013)

argminx∈Rn,z∈Rm

f(x) + g(z) subject to Ax+ Bz = c

• Several variations: row-splitting if examples are across processors.

argminx∈Rd

R∑i=1

fi(x) + g(x) ⇒R∑i=1

fi(xi) + g(z) s.t xi = z (1)

x(k+1)i = argmin

xfi(x) +

ρ

2‖x− zk + ν

ki ‖

22 (2)

z = proxg/(Rρ)[xk+1

+ νk] (3)

νk+1i = ν

ki + x

k+1i − zk+1

(4)

where proxf [x] = argminy

1

2‖x− y‖2

2 + f(y)

• Note: extra consensus and dual variables need to be managed.• Closed-form updates, Extensibility, Code-reuse, Parallelism.


ADMM Block Splitting and Hybrid Parallelism

• Implicit Optimization problems

argminW∈Rs×m

n∑i=1

V (yi,WTz(xi)) + λr(W)

• Distributed Memory (MPI) across R nodes, T -cores each (OpenMP)• Implicit R× C Block partitioning of Z

MPI Proc.

Y1

...YR

,

X1...

XR

T→...T→

T threads︷︸︸︷Z11 Z12 . . . Z1C

... ... ... ...ZR1 ZR2 . . . ZRC

• ADMM coordinates local models on Zij = T[Xi, j], built on-the-fly.


ADMM Block Splitting and Hybrid Parallelism• ADMM Block splitting formulation (follows Parikh and Boyd, 2013)

R∑i=1

li(Oi) + λ

C∑j=1

rj(Wj) +∑ij

I[Oij = ZijWij]

s.t. Wj = Wij,Oi =∑ij

Oij︸︷︷︸local-global consistency

– MPI process local, thread local, master process

• Key Steps: proxf , proxr (parallel; closed-form) and Graph Projection

projZij[(Y,X)] = argmin

V=ZijU

1

2‖V − Y‖2

fro +1

2‖U− X‖2

fro

=⇒ U = [ZTijZij + λI]

−1︸︷︷︸cached

(X + ZTijY), V = ZijU (5)


Making the implementation practical

• C ↑ ⇒ cache memory ↓, computation↓; shared- memory parallelism↑.

• However, Oij, Oij grow as niCm (e.g., 335K examples, C = 64, m = 100

exhausts 16GB). Fortunately, can be avoided using incremental updates, shared

memory access and structure of Graph projection!

Update rules from Parikh and Boyd, 2013

If C=s/d, per-‐itera@on complexity is linear in n,s


Scalability and Effect of Splitting

0 5 10 15 20

Number of MPI processes (t=6 threads/process)0

5

10

15

20

25

Speedup

MNIST Strong Scaling (Triloka)

SpeedupIdeal

0 5 10 15 200

20

40

60

80

100

Cla

ssific

ati

on A

ccura

cy (

%)

Accuracy

500 1000 1500 2000

Time (secs)55

56

57

58

59

60

61

62

Acc

ura

cy

TIMIT Column Splitting

501002004008001000

Table 1: Comparison on TIMIT-binary (n = 100k, d = 440, s = 100k)Libsvm PSVM (p = n0.5) BlockADMM

Training Time 80355 47 42Testing Time 1295 259 2.9

Accuracy 85.41% 73.1% 83.47%


Revisit Efficiency of Approximate Integration

k(x, z) =

∫Rde−i(x−z)Twp(w)dw ≈ 1

s

s∑j=1

e−i(x−z)Tws

• Consider error in approximating an integral on unit cube:

εS[f ] =

∣∣∣∣∣∫

[0,1]df(x)dx− 1

s

∑w∈S

f(w)

∣∣∣∣∣• Monte-carlo approach draws S from U([0, 1]d), with convergence rate:(

ES[εS[f ]2

])1/2= σ[f ]s−1/2 where σ[f ]2 = varX∼U([0,1]d)[f(X)]

O(s−12) =⇒ 4-fold increase in s will only cut error by half.

Can we do better with different S?


Low-discrepancy Quasi-Monte Carlo Pointsets

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Uniform

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Halton

• Deterministic correlated QMC points avoid clustering, clumping effectsin MC point sets.

• Hierarchical structure: sample the integrand from coarse-to-fine as sincreases.


Low-discrepancy Quasi-Monte Carlo Pointsets


Star DiscrepancyIntegration error depends on variation f and uniformity of S.Theorem 2 (Koksma-Hlawka inequality (1941, 1961)).

εS[f ] ≤ D?(S)VHK[f ], where

D?(S) = supx∈[0,1]d

∣∣∣∣vol(Jx)− |{i : wi ∈ Jx}|s

∣∣∣∣


Quasi-Monte Carlo Sequences

• Low-discrepancy point sets have D?({w1, . . . ,ws}) = O((log s)d/s),conjectured to be optimal.

– Halton, Sobol, Faure, Niederreiter sequences. . . we will treat these asblack boxes.

– Implementations available, e.g. Matlab haltonset, sobolset

– Usually, very cheap to generate.

• For fixed d, asymptotically, QMC rate (log s)d/s beats MC s−12.

– Note: dimension dependence– Empirically, QMC better even for very high dimensional integration.– Modern analysis: worst-case analysis for a nice space of integrands

(RKHS) or average case analysis assuming a distribution overintegrands.


QMC Fourier Features: Algorithm

• Transform variables∫Rde−i(x−z)Twp(w)dw =

∫[0,1]d

e−i(x−z)TΦ−1(t)dt .

• QMC feature maps for shift-invariant kernels, [YSAM] 2014

1. Given k, compute density p (=∏di=1 pj) via inverse Fourier

transform.2. Generate low-discrepancy sequence t1 . . . ts in [0, 1]d.3. Transform: wi =

(Φ−1

1 (ti1) . . .Φ−1d (tid)

)and set S = [w1 . . .ws].

4. Compute Z′ = XS.

5. compute Zij = 1√se−iZ

′ij.

6. run linear method on Z.


How do standard QMC sequences perform?

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

Relati

ve err

or on

||K|| 2

USPST, n=1506

MC

HaltonSobol’Digital net

Lattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035


Relati

ve err

or on

||K|| 2

CPU, n=6554

MC


Lattice

• QMC methods consistently provide better Gram matrix approximations.


200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09


Relati

ve err

or on

||K|| 2

USPST, n=1506

MC


Lattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035


Relati

ve err

or on

||K|| 2

CPU, n=6554

MC


Lattice


• Why are some QMC sequences better than others?


200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09


Relati

ve err

or on

||K|| 2

USPST, n=1506

MC


Lattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035


Relati

ve err

or on

||K|| 2

CPU, n=6554

MC


Lattice


• Why are some QMC sequences better than others?

• Can we learn sequences even better adapted to our problem class?


Chacterization via Box Discrepancy

• Define F�b = {f(w) = e−i(x−z)Tw,−b ≤ x− z ≤ b,x, z ∈ X}Theorem 3 (Expected Integration error wrt p over f ∈ F�b).

Ef∼U(F�b)

[εS,p[f ]2

]∝ D�

p (S)2 . (6)

• Below, h(u,v) = sincb(u,v) = π−d∏dj=1

sin(bi(ui−vi)ui−vi

D�p (S)

2=

∫Rd

∫Rdh(ω, φ)p(ω)p(φ)dωdφ

−2

s

s∑l=1

∫Rdh(wl, ω)p(ω)dω︸︷︷︸

Alignment with p (wl ≈ ω)

+1

s2

s∑l=1

s∑j=1

h(wl,wj)︸︷︷︸Non-uniformity of S

(7)

Integrals above can be computed in closed form for Gaussian density.


Does Box Discrepancy explain performancedifferences?

0 500 1000 150010

−5

10−4

10−3

10−2

Samples

D✷(S

)2

CPU, d=21

Digital NetMC (expected)HaltonSobol’Lattice


Learning Adaptive QMC Sequences

Unlike Star discrepancy, Box discrepancy admits numerical optimization,

S∗ = argminS=(w1...ws)∈Rds

D�(S) . (8)

0 20 40 60 8010

−6

10−4

10−2

100

CPU dataset, s=100

Iteration

Normalized D✷(S)2

Maximum Squared ErrorMean Squared Error

‖K −K‖2/‖K‖2


libskylark: sketching-based matrix computations for ML

http://xdata-skylark.github.io/libskylark/

Randomized Kernel Methods •  Support Vector Machines •  Multinomial Logistic Reg. •  Robust Regression •  Regularized Least Squares •  Gaussian, Laplacian

Polynomial, Semigroup Kernels

Matrix Completion Multitask Learning PCA and CCA

High-performance Sketching NLA ML [Python] El

emen

tal/C

ombB

las

MPI

com

mun

icat

ion

InputMatrixType LocalMatrix numpy.ndarray, scipy.linalg.sparse, elem::Matrix<double>

DistributedDense Elem:DistMatrix (1D, 2D)

DistributedSparse SpParMat (CombBlas/KDT)

Streaming Out-of-core problems

OutputMatrixType LocalMatrix numpy.ndarray, scipy.linalg.sparse, elem::Matrix<double>

DistributedDense Elem:DistMatrix (1D, 2D)

DistributedSparse SpParMat (CombBlas/KDT) CWT, WZT, MMT

JLT, FJLT, CT, FCT, RFT

JLT, CT, RFT

A sketch

SA (column-wise) AS’ (row-wise)

oblivious subspace embeddings

Faster Least Squares Solver Sketch-and-Solve Sketch-based preconditioning (Blendenpik, LSRN) Low-rank Approximations Randomized SVD

Distributed Optimization Block-splitting Alternating Directions Method of Multipliers (ADMM)


Flavor

A = rand(m, n);

b = rand(m, 1);

% Gaussian Random matrix for JLT

S = randn(t, m);

% Gaussian Random matrix for JLT

SA = S*A;

Sb = S*b;

% Sketch and Solve

X = SA\Sb;

• Sparse-vs-Dense

• 1D/2D distributions

• Input-output combinations

• Row-wise or Column-wise

• Summa-based Distributed GEMM

• Comm-free random matrices

• Counter-based PRNGs (Random123)

import elem

from skylark import sketch, elemhelper

from mpi4py import MPI

import numpy as np

# Set up the random regression problem.

A = elem.DistMatrix_d_VR_STAR()

elem.Uniform(A, m, n)

b = elem.DistMatrix_d_VR_STAR()

elem.Uniform(b, m, 1)

# Create transformm with output type = "LocalMatrix".

S = sketch.JLT(m, t, defouttype="LocalMatrix")

# Sketch A and b note specialized distributed GEMM

SA = S * A

Sb = S * b

#SA and Sb reside on rank zero, so solving there.

if (MPI.COMM_WORLD.Get_rank() == 0):

# Solve using NumPy

[x, res, rank, s] = np.linalg.lstsq(SA, Sb)

else:

x = None


Implementation References

• Sketching Layer

Abbreviation Name ReferenceJLT Johnson-Lindenstrauss Transform Johnson and Lindenstrauss, 1984

FJLT Fast Johnson-Lindenstrauss Transform Ailon and Chazelle, 2009CT Cauchy Transform Sohler and Woodruff, 2011

MMT Meng-Mahoney Transform Meng and Mahoney, 2013CWT Clarkson-Woodruff Transform Clarkson and Woodruff, 2013WZT Woodruff-Zhang Transform Woodruff and Zhang, 2013PPT Pahm-Pagh Transform Pahm and Pagh, 2013

ESRLT Random Laplace Transform Yang et al, 2014LRFT Laplacian Random Fourier Transform Rahimi and Recht, 2007GRFT Gaussian Random Fourier Transform Rahimi and Recht, 2007

FGRFT Fast Gaussian Random Fourier Transform Le, Sarlos and Smola, 2013

• Avron, H., Maymounkov, P. and Toledo, S., Supercharging LAPACKs Least Squares Solver, 2010• Meng, X., Saunders, M.A. and Mahoney, M. W, LSRN: A Paralllel Iterative Solver for Strongly Over- or

Under-Determined Systems, 2012• Halko, N. and Martinsson, P.G, and Tropp J., Finding structure with randomness: Probabilistic algorithms for

constructing approximate matrix decompositions, SIAM Rev., Survey and Review section, Vol. 53, num. 2, pp.217-288, 2011

• N. Parikh and S. Boyd, Block splitting for distributed optimization, Math. Prog. Comp., October 2013• V. Sindhwani and H. Avron, High-performance Kernel Machines with Implicit Distributed Optimization and

Randomization, 2014


Conclusion

• High-performance implementation of randomized algorithms anddistributed optimization with emphasis on scaling up non-parametericmodels.

• Scalable Kernel methods may be promising alternatives to Deep NeuralNetworks.

– Incorporating prior knowledge (e.g., invariances) and new forms ofkernel learning

• libskylark: http://xdata-skylark.github.io/libskylark/


Thank you.

The Machine Learning Group at IBM Research, NY is hiring Research StaffMembers and Postdocs!


large-scale learning with kernels libskylark - mmds-data.orgmmds-data.org › presentations › 2014...

Documents