coordinate-wise power method

1
Coordinate-wise Power Method Qi Lei, Kai Zhong 1 , and Inderjit S. Dhillon 12 1 Institute for Computational Sciences and Engineering, 2 Department of Computer Science, The University of Texas at Austin Motivation Goal: Given a matrix A, we seek to compute its dominant eigenvector v 1 : v 1 = argmax kx k=1 x T A T Ax (1) Computing the dominant eigenvector of a given matrix/graph is meaningful for: I Graph Centrality/PageRank I Sparse PCA I Spectral Clustering The classic power method is still powerful in the sense of: I Simplicity I Small memory footprint I Stable: being resistent to noise We propose two coordinate-wise versions of the power method, from an optimization viewpoint. A brief review of the Power Method I Given a matrix A, let its two dominant eigenvalues be λ 1 2 , and its dominant eigenvector is v . Power iteration conducts: x (l +1) normalize(Ax (l ) ) (2) I This is inefficient since some coordinates converge faster than others, e.g., A = 201 030 102 , x : 0.71 0.71 0 0.53 0.80 0.27 0.45 0.81 0.36 0.42 0.82 0.39 0.41 0.82 0.40 Therefore we want to select and update important coordinates only. I One key question: how to select the coordinates? I Another key problem: how to choose these coordinates without too much overhead? Algorithm of Coordinate-wise Power Method (CPM) MAIN IDEA: Choose k coordinates with the most potential change and update them only. 1. Define auxiliary parameters: 1.1 z = Ax maintained for algorithm efficiency. 1.2 Coordinate selection criterion: c = z x T z - x 2. Coordinate selection: let Ω be a set containing k coordinates of c with the largest magnitude. 3. Update the new iterate x + : y i z i z T z , i Ω x i i / Ω . x + y ky k . 4. Update the auxiliary parameters with the k changes in x with O (kn ) operations. z + z + A T :,Ω (y Ω - x Ω ). z + z + /ky k. c + = z + (x + ) T z + - x + . 5. Repeat 2 - 4. Illustration on how CPM works (a) Illustration on one update of CPM (b) Number of updates of each coordinate I (a) One iteration in CPM suffices similar result with the Power Method, but with less operations. I (b) The unevenness of updates suggests that selecting important coordinates saves many useless updates in the Power method. Relation to Optimization & Coordinate selection rules I Power method ⇐⇒ Alternating minimization for Rank-1 matrix approximation: argmin x R,y R n f (x , y )= kA - xy T k 2 F o (3) I Updating rule for Alternation minimization: x argmin α f (α, y )= Ay ky k 2 , y argmin β f (x )= A T x kx k 2 , I The following coordinate selecting rules for (3) are equivalent: 1. largest coordinate value change, denoted as |δ x i |; 2. largest partial gradient (Gauss-Southwell rule), |∇ i f (x )| 3. largest function value decrease, |f (x + δ x i e i ) - f (x )| I A simple alternation of the objective function for Rank-1 matrix approximation for symmetric matrices: Algorithm Compared to Objective function Power Method Alternating Minimization f (x , y )= kA - xy T k 2 F CPM Greedy Coordinate Descent f (x , y )= kA - xy T k 2 F SGCD Greedy Coordinate Descent f (x )= kA - xx T k 2 F Algorithm of Symmetric Greedy Coordinate Descent(SGCD) I We also propose a new method we call Symetric Greedy Coordinate Descent (SGCD) for symmetric matrices. I MAIN IDEA: use greedy and exact coordinate descent on f (x )= kA - xx T k 2 F . I Main differences: 1. A different coordinate selection criterion: c = Ax kx k 2 - x (parallel to the gradient of f (x )) 2. A different update rule of x + in Ω x + i = argmin α f (x +(α - x i )e i ) , if i Ω, x i , if i / Ω. I Exact update: I Solve x + i = α such that f (x +(α - x i )e i )= α 3 + p α + q = 0, where p = kx k 2 - x 2 i - a ii , q = -a T i x + a ii x i . I O (n ) operations Convergence guarantees for CPM and SGCD I For Coordinate-wise Power Method (CPM), we prove global linear convergence for any positive semidefinite matrix A. Theorem 1 Convergence rate: require T = O ( λ 1 λ 1 -λ 2 log( 1 ε )) to achieve tan θ x (l ) ,v 1 provided the ”noise rate” kc [n]-Ω k kc k . λ 1 -λ 2 λ 1 . I For the method of Symmetric Greedy Coordinate Descent (SGCD), we prove local linear convergence: Theorem 2 Convergence rate: require T = O ( λ 1 λ 1 -λ 2 log( 1 ε )) to achieve f (x (l ) ) - f (v ) provided x (0) sufficiently close to v 1 : kx (0) - v 1 k . λ 1 -λ 2 λ 1 Experimental Results I Scalability experiments between our methods compared to power method, Lanczos method and VRPCA (Ohad Shamir, 2015) conducted with C++ with Eigen library on one machine with 16G memory: I Performance on dense and synthetic dataset 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 λ 2 /λ 1 10 0 10 1 10 2 time (sec) CPM SGCD PM Lanczos VRPCA 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 n 10 -1 10 0 10 1 10 2 time (sec) CPM SGCD PM Lanczos VRPCA Convergence time vs λ 2 λ 1 (A) Convergence time vs. n I Peformance on real and sparse dataset 0 5 10 15 20 25 30 35 40 45 time (sec) 10 -3 10 -2 10 -1 10 0 tan θ x,v 1 CPM SGCD PM Lanczos 0 10 20 30 40 50 60 70 80 time (sec) 10 -3 10 -2 10 -1 10 0 tan θ x,v 1 CPM SGCD PM Lanczos 0 1 2 3 4 5 6 7 time (sec) 10 -3 10 -2 10 -1 10 0 tan θ x,v 1 CPM SGCD PM Lanczos Performance on LiveJournal Performance on com-Orkut Performance on web-Stanford Dataset LiveJournal com-Orkut web-Stanford # nodes 4,847,571 3,072,626 281,903 # nonzero 86,220,856 234,370,166 3,985,272 I Extensions to out-of-core case: 0 500 1000 1500 2000 2500 3000 3500 4000 4500 time (sec) 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 tan θ x,v 1 CPM SGCD PM I Existing methods can’t be easily applied to out-of-core dataset. I Our methods indicate that updating only k coordinates of iterate x still enhance the target direction I we can choose a k such that k rows of data fit in memory and then fully update the corresponding coordinates Mail: {leiqi,zhongkai}@ices.utexas.edu, [email protected]

Upload: others

Post on 19-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Coordinate-wise Power MethodQi Lei, Kai Zhong1, and Inderjit S. Dhillon12

1Institute for Computational Sciences and Engineering, 2Department of Computer Science, The University of Texas at Austin

Motivation

Goal: Given a matrix A, we seek to compute its dominant eigenvectorv1:

v1 = argmax‖x‖=1

xTATAx (1)

Computing the dominant eigenvector of a given matrix/graph ismeaningful for:

I Graph Centrality/PageRankI Sparse PCAI Spectral Clustering

The classic power method is still powerful in the sense of:I SimplicityI Small memory footprintI Stable: being resistent to noise

We propose two coordinate-wise versions of the power method, from anoptimization viewpoint.

A brief review of the Power Method

I Given a matrix A, let its two dominant eigenvalues be λ1, λ2, and itsdominant eigenvector is v . Power iteration conducts:

x (l+1) ← normalize(Ax (l)) (2)I This is inefficient since some coordinates converge faster than others,

e.g.,

A =

2 0 10 3 01 0 2

,x :

0.710.71

0

→0.53

0.800.27

→0.45

0.810.36

→0.42

0.820.39

→0.41

0.820.40

Therefore we want to select and update important coordinates only.

I One key question: how to select the coordinates?I Another key problem: how to choose these coordinates without too

much overhead?

Algorithm of Coordinate-wise Power Method (CPM)

MAIN IDEA: Choose k coordinates with the most potential change andupdate them only.

1. Define auxiliary parameters:1.1 z = Ax maintained for algorithm efficiency.1.2 Coordinate selection criterion: c = z

xTz − x2. Coordinate selection: let Ω be a set containing k coordinates of c with

the largest magnitude.3. Update the new iterate x+:

yi ← zi

zTz , i ∈ Ωxi i /∈ Ω

. x+ ← y‖y‖

.

4. Update the auxiliary parameters with the k changes in x with O(kn)operations.

z+ ← z + AT:,Ω(yΩ − xΩ). z+ ← z+/‖y‖.

c+ =z+

(x+)Tz+− x+.

5. Repeat 2− 4.

Illustration on how CPM works

(a) Illustration on one update of CPM (b) Number of updates of each coordinate

I (a) One iteration in CPM suffices similar result with the Power Method,but with less operations.

I (b) The unevenness of updates suggests that selecting importantcoordinates saves many useless updates in the Power method.

Relation to Optimization & Coordinate selection rules

I Power method⇐⇒ Alternating minimization for Rank-1 matrixapproximation:

argminx∈R,y∈R

f (x ,y) = ‖A− xyT‖2

F

(3)

I Updating rule for Alternation minimization:x ← argminα f (α, y) = Ay

‖y‖2, y ← argminβ f (x , β) = ATx‖x‖2,

I The following coordinate selecting rules for (3) are equivalent:1. largest coordinate value change, denoted as |δxi|;2. largest partial gradient (Gauss-Southwell rule), |∇if (x)|3. largest function value decrease, |f (x + δxiei)− f (x)|

I A simple alternation of the objective function for Rank-1 matrixapproximation for symmetric matrices:

Algorithm Compared to Objective functionPower Method Alternating Minimization f (x ,y) = ‖A− xyT‖2

FCPM Greedy Coordinate Descent f (x ,y) = ‖A− xyT‖2

FSGCD Greedy Coordinate Descent f (x) = ‖A− xxT‖2

F

Algorithm of Symmetric Greedy Coordinate Descent(SGCD)

I We also propose a new method we call Symetric Greedy CoordinateDescent (SGCD) for symmetric matrices.

I MAIN IDEA: use greedy and exact coordinate descent onf (x) = ‖A− xxT‖2

F .I Main differences:

1. A different coordinate selection criterion: c = Ax‖x‖2 − x (parallel to the

gradient of f (x))2. A different update rule of x+ in Ω

x+i =

argminα f (x + (α− xi)ei) , if i ∈ Ω,

xi, if i /∈ Ω.

I Exact update:I Solve x+

i = α such that∇f (x + (α− xi)ei) =α3 + pα + q = 0, wherep = ‖x‖2 − x2

i − aii,q = −aT

i x + aiixi.I O(n) operations

Convergence guarantees for CPM and SGCD

I For Coordinate-wise Power Method (CPM), we prove global linearconvergence for any positive semidefinite matrix A.

Theorem 1

Convergence rate: require T = O( λ1λ1−λ2

log(1ε))

to achieve tan θx (l),v1≤ ε

provided the ”noise rate” ‖c[n]−Ω‖‖c‖ . λ1−λ2

λ1.

I For the method of Symmetric Greedy Coordinate Descent (SGCD),we prove local linear convergence:

Theorem 2

Convergence rate: require T = O( λ1λ1−λ2

log(1ε))

to achieve f (x (l))− f (v) ≤ ε provided x (0) sufficiently closeto v1: ‖x (0) − v1‖ . λ1−λ2√

λ1

Experimental Results

I Scalability experiments between our methods compared to powermethod, Lanczos method and VRPCA (Ohad Shamir, 2015)conducted with C++ with Eigen library on one machine with 16Gmemory:

I Performance on dense and synthetic dataset

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λ2/λ1

100

101

102

tim

e (

sec)

CPM

SGCD

PM

Lanczos

VRPCA

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

n

10-1

100

101

102

tim

e (

se

c)

CPM

SGCD

PM

Lanczos

VRPCA

Convergence time vs λ2λ1

(A) Convergence time vs. n

I Peformance on real and sparse dataset

0 5 10 15 20 25 30 35 40 45

time (sec)

10-3

10-2

10-1

100

tanθx,v

1

CPM

SGCD

PM

Lanczos

0 10 20 30 40 50 60 70 80

time (sec)

10-3

10-2

10-1

100

tanθx,v

1

CPM

SGCD

PM

Lanczos

0 1 2 3 4 5 6 7

time (sec)

10-3

10-2

10-1

100

tanθx,v

1

CPM

SGCD

PM

Lanczos

Performance on LiveJournal Performance on com-Orkut Performance on web-Stanford

Dataset LiveJournal com-Orkut web-Stanford# nodes 4,847,571 3,072,626 281,903

# nonzero 86,220,856 234,370,166 3,985,272

I Extensions to out-of-core case:

0 500 1000 1500 2000 2500 3000 3500 4000 4500

time (sec)

10-5

10-4

10-3

10-2

10-1

100

101

102

tanθx,v

1

CPM

SGCD

PM

I Existing methods can’t be easilyapplied to out-of-core dataset.

I Our methods indicate that updatingonly k coordinates of iterate x stillenhance the target direction

I we can choose a k such that k rows ofdata fit in memory and then fully updatethe corresponding coordinates

Mail: leiqi,[email protected], [email protected]