large-scale learning with kernels libskylark - mmds-data.orgmmds-data.org › presentations › 2014...

38
Large-scale Learning with Kernels & libSkylark Vikas Sindhwani IBM Research, NY MMDS 2014. Algorithms for Modern Massive Data Sets UC Berkeley, June 19 th 2014 IBM Research: Vikas Sindhwani

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Large-scale Learning with Kernels& libSkylark

Vikas SindhwaniIBM Research, NY

MMDS 2014. Algorithms for Modern Massive Data SetsUC Berkeley, June 19th 2014

IBM Research: Vikas Sindhwani

Page 2: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

What do you see?

IBM Research: Vikas Sindhwani 1

Page 3: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Motivation

• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?

Page 4: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Motivation

• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?

– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?

Page 5: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Motivation

• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?

– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?– NLA: argmin‖w‖1≤1 ‖Xw − y‖22 ⇒ argmin‖w‖1≤1 ‖SXw − Sy‖22

Page 6: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Motivation

• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?

– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?– NLA: argmin‖w‖1≤1 ‖Xw − y‖22 ⇒ argmin‖w‖1≤1 ‖SXw − Sy‖22– Statistician: With high prob (1− δ), a training set with

n = O( 1ε2log(2d2

δ )) enough for generalization error to be within εthat of the best linear model with ‖ · ‖1 ≤ 1 =⇒ dont need n=1B

• Big data Machine Learning stack design requires conversations.

• Avoiding strong assumptions (linearity, sparsity...) upfront to avoidsaturation ⇒ Non-parametric models (models that grow with data)

Page 7: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Motivation

• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?

– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?– NLA: argmin‖w‖1≤1 ‖Xw − y‖22 ⇒ argmin‖w‖1≤1 ‖SXw − Sy‖22– Statistician: With high prob (1− δ), a training set with

n = O( 1ε2log(2d2

δ )) enough for generalization error to be within εthat of the best linear model with ‖ · ‖1 ≤ 1 =⇒ dont need n=1B

• Big data Machine Learning stack design requires conversations.

• Avoiding strong assumptions (linearity, sparsity...) upfront to avoidsaturation ⇒ Non-parametric models (models that grow with data)

• This setting needs both Distributed computation and Randomization

IBM Research: Vikas Sindhwani 2

Page 8: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Outline

• Non-parametric modeling with Kernel methods, their scalabilityproblems and Random Fourier Features (Rahimi & Recht, 2007)

• Kernel methods match DNNs on (knowledge-free) speech recognitionbenchmarks (ICASSP, 2014): parallelization + randomization

• Recent efforts towards improving scalability:

– Practical Implementations of Distributed ADMM: handling largenumber of examples and random feature spaces.

– Quasi-Monte Carlo Feature Maps (ICML 2014)

• libskylark: Open-source software library instantiating sketchingprimitives and randomized Numerical Linear Algebra (NLA) techniquesfor large-scale Machine Learning in distributed-memory environments.

IBM Research: Vikas Sindhwani 3

Page 9: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Acknowledgements and References

• XDATA Skylark team: Ken Clarkson (PI), Haim Avron, Costas Bekas,Christos Boutsidis, Ilse Ipsen, Yves Ineichien, Anju Kambadur, GiorgiosKollias, Michael Mahoney, Vikas Sindhwani, David Woodruff

• High-performance Kernel Machines with Implicit DistributedOptimization and Randomization, with Haim Avron, 2014

• Quasi-Monte Carlo Feature Maps for Shift Invariant Kernels, withJiyan Yang, Haim Avron, and Michael Mahoney, ICML 2014

• Kernel Methods match Deep Neural Networks on TIMIT, with P.Huang, H. Avron, T. Sainath and B. Ramabhadran, ICASSP 2014

• Random Laplace Feature Maps for Semigroup Kernels onHistograms with J. Yang, Q. Fan, H. Avron, M. Mahoney, CVPR 2014

IBM Research: Vikas Sindhwani 4

Page 10: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Kernel Methods (Aronszajn, 1950)

• Symm. pos. def. function k(x, z) on input domain X ⊂ Rd

• k ⇔ rich Reproducing Kernel Hilbert Space (RKHS) Hk of real-valuedfunctions, with inner product 〈·, ·〉k and norm ‖ · ‖k

• Regularized Risk Minimization ⇔ Linear models in an implicithigh-dimensional (often infinite-dimensional) feature space.

f? = argminf∈Hk

1

n

n∑i=1

V (yi, f(xi)) + λ‖f‖2Hk, xi ∈ Rd

• Representer Theorem: f?(x) =∑ni=1αik(x,xi)

IBM Research: Vikas Sindhwani 5

Page 11: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

The Issue of Scalability

• Regularized Least Squares

(K + λI)α = YO(n2) storage

O(n3 + n2d) trainingO(nd) test speed

Hard to parallelize when working directly with Kij = k(xi,xj)

• Linear kernels: k(x, z) = xTz, f?(x) = xTw, (w = XTα)

(XTX + λI

)w = XTY

O(nd) storageO(nd2) trainingO(d) test speed

IBM Research: Vikas Sindhwani 6

Page 12: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Randomized Algorithms: Definitions

• Data-oblivious explicit feature map: Ψ : Rd 7→ Cs such that,

k(x, z) ≈ 〈Ψ(x), Ψ(z)〉Cs

⇒(Z(X)

TZ(X) + λI

)w = Z(X)

TY ⇒

O(ns) storageO(ns2) trainingO(s) test speed

• Shift-Invariant kernels: k(x, z) = ψ(x− z), for some complex-valuedpositive definite function ψ on Rd

– Given any set of m points, z1 . . . zm ∈ Rd, the m×m the matrixAij = ψ(zi − zj) is positive definite.

IBM Research: Vikas Sindhwani 7

Page 13: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Randomized Algorithms: Bochner’s TheoremTheorem 1 (Bochner, 1932-33). A complex-valued function ψ : Rd 7→ Cis positive definite if and only if it is the Fourier Transform of a finitenon-negative measure µ on Rd, i.e.,

ψ(x) = µ(x) =

∫Rde−ix

Twdµ(w), x ∈ Rd .

ψ(x)

e−iwT x

IBM Research: Vikas Sindhwani 8

Page 14: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Random Fourier Features (Rahimi & Recht, 2007)

• One-to-one correspondence between k and density p such that,

k(x, z) = ψ(x− z) =

∫Rde−i(x−z)Twp(w)dw

Gaussian kernel: k(x, z) = e−‖x−z‖22

2σ2 ⇐⇒ p = N (0, σ−2Id)

• Monte-Carlo approximation to Integral representation:

k(x, z) ≈ 1

s

s∑j=1

e−i(x−z)Twj = 〈ΨS(x), ΨS(z)〉Cs

ΨS(x) =1√s

[e−ix

Tw1 . . . e−ixTws]∈ Cs, S = [w1 . . .ws] ∼ p

IBM Research: Vikas Sindhwani 9

Page 15: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

DNNs vs Kernel Methods on TIMIT (Speech)

1 2 3 4 5 6 7 8Number of Random Features (s) / 10000

33

34

35

36

37

38

39

40

41

Cla

ssifi

cati

onE

rror

(%)

TIMIT: n = 2M,d = 440, k = 147

DNN (440-4k-4k-147)RandomFourierExact Kernel (n=100k, 75GB)

% Training

G = randn(size(X,1), s);

Z = exp(i*X*G);

alpha = (eye(size(X,2))*lambda+Z’*Z)\(Z’*y(:));

% Testing

ztest = exp(i*xtest*G)*alpha;

IBM Research: Vikas Sindhwani 10

Page 16: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Learning in High-dimensional Random Feature Spaces

Kernel Methods Match Deep Neural Networks on TIMIT, ICASSP 2014

0 5 10 15 20 25 30 35 40Number of Random Features (s) / 10000

33

34

35

36

37

38

39

40

41

Cla

ssifi

cati

onE

rror

(%)

PER: 21.3% < 22.3% (DNN)

TIMIT: n = 2M,d = 440, k = 147

DNN (440-4k-4k-147)RandomFourierExact Kernel (n=100k, 75GB)

• Distributed solvers (Z ∈ R2M×400K ≈ 6 terabytes)• More effective feature maps?

IBM Research: Vikas Sindhwani 11

Page 17: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Distributed Learning with ADMM• Alternating Direction Method of Multipliers (1950s; Boyd et al, 2013)

argminx∈Rn,z∈Rm

f(x) + g(z) subject to Ax+ Bz = c

• Several variations: row-splitting if examples are across processors.

argminx∈Rd

R∑i=1

fi(x) + g(x) ⇒R∑i=1

fi(xi) + g(z) s.t xi = z (1)

x(k+1)i = argmin

xfi(x) +

ρ

2‖x− zk + ν

ki ‖

22 (2)

z = proxg/(Rρ)[xk+1

+ νk] (3)

νk+1i = ν

ki + x

k+1i − zk+1

(4)

where proxf [x] = argminy

1

2‖x− y‖2

2 + f(y)

• Note: extra consensus and dual variables need to be managed.• Closed-form updates, Extensibility, Code-reuse, Parallelism.

IBM Research: Vikas Sindhwani 12

Page 18: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

ADMM Block Splitting and Hybrid Parallelism

• Implicit Optimization problems

argminW∈Rs×m

n∑i=1

V (yi,WTz(xi)) + λr(W)

• Distributed Memory (MPI) across R nodes, T -cores each (OpenMP)• Implicit R× C Block partitioning of Z

MPI Proc.

Y1

...YR

,

X1...

XR

T→...T→

T threads︷ ︸︸ ︷Z11 Z12 . . . Z1C

... ... ... ...ZR1 ZR2 . . . ZRC

• ADMM coordinates local models on Zij = T[Xi, j], built on-the-fly.

IBM Research: Vikas Sindhwani 13

Page 19: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

ADMM Block Splitting and Hybrid Parallelism• ADMM Block splitting formulation (follows Parikh and Boyd, 2013)

R∑i=1

li(Oi) + λ

C∑j=1

rj(Wj) +∑ij

I[Oij = ZijWij]

s.t. Wj = Wij,Oi =∑ij

Oij︸ ︷︷ ︸local-global consistency

– MPI process local, thread local, master process

• Key Steps: proxf , proxr (parallel; closed-form) and Graph Projection

projZij[(Y,X)] = argmin

V=ZijU

1

2‖V − Y‖2

fro +1

2‖U− X‖2

fro

=⇒ U = [ZTijZij + λI]

−1︸ ︷︷ ︸cached

(X + ZTijY), V = ZijU (5)

IBM Research: Vikas Sindhwani 14

Page 20: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Making the implementation practical

• C ↑ ⇒ cache memory ↓, computation↓; shared- memory parallelism↑.

• However, Oij, Oij grow as niCm (e.g., 335K examples, C = 64, m = 100

exhausts 16GB). Fortunately, can be avoided using incremental updates, shared

memory access and structure of Graph projection!

Update  rules  from  Parikh  and  Boyd,  2013  

If  C=s/d,  per-­‐itera@on  complexity  is  linear  in  n,s  

IBM Research: Vikas Sindhwani 15

Page 21: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Scalability and Effect of Splitting

0 5 10 15 20

Number of MPI processes (t=6 threads/process)0

5

10

15

20

25

Speedup

MNIST Strong Scaling (Triloka)

SpeedupIdeal

0 5 10 15 200

20

40

60

80

100

Cla

ssific

ati

on A

ccura

cy (

%)

Accuracy

500 1000 1500 2000

Time (secs)55

56

57

58

59

60

61

62

Acc

ura

cy

TIMIT Column Splitting

501002004008001000

Table 1: Comparison on TIMIT-binary (n = 100k, d = 440, s = 100k)Libsvm PSVM (p = n0.5) BlockADMM

Training Time 80355 47 42Testing Time 1295 259 2.9

Accuracy 85.41% 73.1% 83.47%

IBM Research: Vikas Sindhwani 16

Page 22: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Revisit Efficiency of Approximate Integration

k(x, z) =

∫Rde−i(x−z)Twp(w)dw ≈ 1

s

s∑j=1

e−i(x−z)Tws

• Consider error in approximating an integral on unit cube:

εS[f ] =

∣∣∣∣∣∫

[0,1]df(x)dx− 1

s

∑w∈S

f(w)

∣∣∣∣∣• Monte-carlo approach draws S from U([0, 1]d), with convergence rate:(

ES[εS[f ]2

])1/2= σ[f ]s−1/2 where σ[f ]2 = varX∼U([0,1]d)[f(X)]

O(s−12) =⇒ 4-fold increase in s will only cut error by half.

Can we do better with different S?

IBM Research: Vikas Sindhwani 17

Page 23: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Low-discrepancy Quasi-Monte Carlo Pointsets

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Uniform

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Halton

• Deterministic correlated QMC points avoid clustering, clumping effectsin MC point sets.

• Hierarchical structure: sample the integrand from coarse-to-fine as sincreases.

IBM Research: Vikas Sindhwani 18

Page 24: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Low-discrepancy Quasi-Monte Carlo Pointsets

IBM Research: Vikas Sindhwani 19

Page 25: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Star DiscrepancyIntegration error depends on variation f and uniformity of S.Theorem 2 (Koksma-Hlawka inequality (1941, 1961)).

εS[f ] ≤ D?(S)VHK[f ], where

D?(S) = supx∈[0,1]d

∣∣∣∣vol(Jx)− |{i : wi ∈ Jx}|s

∣∣∣∣

IBM Research: Vikas Sindhwani 20

Page 26: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Quasi-Monte Carlo Sequences

• Low-discrepancy point sets have D?({w1, . . . ,ws}) = O((log s)d/s),conjectured to be optimal.

– Halton, Sobol, Faure, Niederreiter sequences. . . we will treat these asblack boxes.

– Implementations available, e.g. Matlab haltonset, sobolset

– Usually, very cheap to generate.

• For fixed d, asymptotically, QMC rate (log s)d/s beats MC s−12.

– Note: dimension dependence– Empirically, QMC better even for very high dimensional integration.– Modern analysis: worst-case analysis for a nice space of integrands

(RKHS) or average case analysis assuming a distribution overintegrands.

IBM Research: Vikas Sindhwani 21

Page 27: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

QMC Fourier Features: Algorithm

• Transform variables∫Rde−i(x−z)Twp(w)dw =

∫[0,1]d

e−i(x−z)TΦ−1(t)dt .

• QMC feature maps for shift-invariant kernels, [YSAM] 2014

1. Given k, compute density p (=∏di=1 pj) via inverse Fourier

transform.2. Generate low-discrepancy sequence t1 . . . ts in [0, 1]d.3. Transform: wi =

(Φ−1

1 (ti1) . . .Φ−1d (tid)

)and set S = [w1 . . .ws].

4. Compute Z′ = XS.

5. compute Zij = 1√se−iZ

′ij.

6. run linear method on Z.

IBM Research: Vikas Sindhwani 22

Page 28: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

How do standard QMC sequences perform?

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

Relati

ve err

or on

||K|| 2

USPST, n=1506

MC

HaltonSobol’Digital net

Lattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Number of random features

Relati

ve err

or on

||K|| 2

CPU, n=6554

MC

HaltonSobol’Digital net

Lattice

• QMC methods consistently provide better Gram matrix approximations.

Page 29: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

How do standard QMC sequences perform?

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

Relati

ve err

or on

||K|| 2

USPST, n=1506

MC

HaltonSobol’Digital net

Lattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Number of random features

Relati

ve err

or on

||K|| 2

CPU, n=6554

MC

HaltonSobol’Digital net

Lattice

• QMC methods consistently provide better Gram matrix approximations.

• Why are some QMC sequences better than others?

Page 30: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

How do standard QMC sequences perform?

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

Relati

ve err

or on

||K|| 2

USPST, n=1506

MC

HaltonSobol’Digital net

Lattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Number of random features

Relati

ve err

or on

||K|| 2

CPU, n=6554

MC

HaltonSobol’Digital net

Lattice

• QMC methods consistently provide better Gram matrix approximations.

• Why are some QMC sequences better than others?

• Can we learn sequences even better adapted to our problem class?

IBM Research: Vikas Sindhwani 23

Page 31: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Chacterization via Box Discrepancy

• Define F�b = {f(w) = e−i(x−z)Tw,−b ≤ x− z ≤ b,x, z ∈ X}Theorem 3 (Expected Integration error wrt p over f ∈ F�b).

Ef∼U(F�b)

[εS,p[f ]2

]∝ D�

p (S)2 . (6)

• Below, h(u,v) = sincb(u,v) = π−d∏dj=1

sin(bi(ui−vi)ui−vi

D�p (S)

2=

∫Rd

∫Rdh(ω, φ)p(ω)p(φ)dωdφ

−2

s

s∑l=1

∫Rdh(wl, ω)p(ω)dω︸ ︷︷ ︸

Alignment with p (wl ≈ ω)

+1

s2

s∑l=1

s∑j=1

h(wl,wj)︸ ︷︷ ︸Non-uniformity of S

(7)

Integrals above can be computed in closed form for Gaussian density.

IBM Research: Vikas Sindhwani 24

Page 32: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Does Box Discrepancy explain performancedifferences?

0 500 1000 150010

−5

10−4

10−3

10−2

Samples

D✷(S

)2

CPU, d=21

Digital NetMC (expected)HaltonSobol’Lattice

IBM Research: Vikas Sindhwani 25

Page 33: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Learning Adaptive QMC Sequences

Unlike Star discrepancy, Box discrepancy admits numerical optimization,

S∗ = argminS=(w1...ws)∈Rds

D�(S) . (8)

0 20 40 60 8010

−6

10−4

10−2

100

CPU dataset, s=100

Iteration

Normalized D✷(S)2

Maximum Squared ErrorMean Squared Error

‖K −K‖2/‖K‖2

IBM Research: Vikas Sindhwani 26

Page 34: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

libskylark: sketching-based matrix computations for ML

http://xdata-skylark.github.io/libskylark/

Randomized Kernel Methods •  Support Vector Machines •  Multinomial Logistic Reg. •  Robust Regression •  Regularized Least Squares •  Gaussian, Laplacian

Polynomial, Semigroup Kernels

Matrix Completion Multitask Learning PCA and CCA

High-performance Sketching NLA ML [Python] El

emen

tal/C

ombB

las

MPI

com

mun

icat

ion

InputMatrixType LocalMatrix numpy.ndarray, scipy.linalg.sparse, elem::Matrix<double>

DistributedDense Elem:DistMatrix (1D, 2D)

DistributedSparse SpParMat (CombBlas/KDT)

Streaming Out-of-core problems

OutputMatrixType LocalMatrix numpy.ndarray, scipy.linalg.sparse, elem::Matrix<double>

DistributedDense Elem:DistMatrix (1D, 2D)

DistributedSparse SpParMat (CombBlas/KDT) CWT, WZT, MMT

JLT, FJLT, CT, FCT, RFT

JLT, CT, RFT

A sketch

SA (column-wise) AS’ (row-wise)

oblivious subspace embeddings

Faster Least Squares Solver Sketch-and-Solve Sketch-based preconditioning (Blendenpik, LSRN) Low-rank Approximations Randomized SVD

Distributed Optimization Block-splitting Alternating Directions Method of Multipliers (ADMM)

IBM Research: Vikas Sindhwani 27

Page 35: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Flavor

A = rand(m, n);

b = rand(m, 1);

% Gaussian Random matrix for JLT

S = randn(t, m);

% Gaussian Random matrix for JLT

SA = S*A;

Sb = S*b;

% Sketch and Solve

X = SA\Sb;

• Sparse-vs-Dense

• 1D/2D distributions

• Input-output combinations

• Row-wise or Column-wise

• Summa-based Distributed GEMM

• Comm-free random matrices

• Counter-based PRNGs (Random123)

import elem

from skylark import sketch, elemhelper

from mpi4py import MPI

import numpy as np

# Set up the random regression problem.

A = elem.DistMatrix_d_VR_STAR()

elem.Uniform(A, m, n)

b = elem.DistMatrix_d_VR_STAR()

elem.Uniform(b, m, 1)

# Create transformm with output type = "LocalMatrix".

S = sketch.JLT(m, t, defouttype="LocalMatrix")

# Sketch A and b note specialized distributed GEMM

SA = S * A

Sb = S * b

#SA and Sb reside on rank zero, so solving there.

if (MPI.COMM_WORLD.Get_rank() == 0):

# Solve using NumPy

[x, res, rank, s] = np.linalg.lstsq(SA, Sb)

else:

x = None

IBM Research: Vikas Sindhwani 28

Page 36: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Implementation References

• Sketching Layer

Abbreviation Name ReferenceJLT Johnson-Lindenstrauss Transform Johnson and Lindenstrauss, 1984

FJLT Fast Johnson-Lindenstrauss Transform Ailon and Chazelle, 2009CT Cauchy Transform Sohler and Woodruff, 2011

MMT Meng-Mahoney Transform Meng and Mahoney, 2013CWT Clarkson-Woodruff Transform Clarkson and Woodruff, 2013WZT Woodruff-Zhang Transform Woodruff and Zhang, 2013PPT Pahm-Pagh Transform Pahm and Pagh, 2013

ESRLT Random Laplace Transform Yang et al, 2014LRFT Laplacian Random Fourier Transform Rahimi and Recht, 2007GRFT Gaussian Random Fourier Transform Rahimi and Recht, 2007

FGRFT Fast Gaussian Random Fourier Transform Le, Sarlos and Smola, 2013

• Avron, H., Maymounkov, P. and Toledo, S., Supercharging LAPACKs Least Squares Solver, 2010• Meng, X., Saunders, M.A. and Mahoney, M. W, LSRN: A Paralllel Iterative Solver for Strongly Over- or

Under-Determined Systems, 2012• Halko, N. and Martinsson, P.G, and Tropp J., Finding structure with randomness: Probabilistic algorithms for

constructing approximate matrix decompositions, SIAM Rev., Survey and Review section, Vol. 53, num. 2, pp.217-288, 2011

• N. Parikh and S. Boyd, Block splitting for distributed optimization, Math. Prog. Comp., October 2013• V. Sindhwani and H. Avron, High-performance Kernel Machines with Implicit Distributed Optimization and

Randomization, 2014

IBM Research: Vikas Sindhwani 29

Page 37: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Conclusion

• High-performance implementation of randomized algorithms anddistributed optimization with emphasis on scaling up non-parametericmodels.

• Scalable Kernel methods may be promising alternatives to Deep NeuralNetworks.

– Incorporating prior knowledge (e.g., invariances) and new forms ofkernel learning

• libskylark: http://xdata-skylark.github.io/libskylark/

IBM Research: Vikas Sindhwani 30

Page 38: Large-scale Learning with Kernels libSkylark - mmds-data.orgmmds-data.org › presentations › 2014 › sindhwani_mmds14.pdf · Big data Machine Learning stack design requires conversations

Thank you.

The Machine Learning Group at IBM Research, NY is hiring Research StaffMembers and Postdocs!

IBM Research: Vikas Sindhwani 31