nonconvex optimization for high-dimensional signal

Nonconvex Optimization for High-Dimensional Signal Estimation:Spectral and Iterative Methods – Part IV

Yuejie ChiCarnegie Mellon

Yuxin ChenPrinceton

Cong MaUC Berkeley

EUSIPCO Tutorial, December 2020

Outline

• Part I: Introduction and Warm-UpWhy nonconvex? basic concepts and a warm-up example (PCA)

• Part II: Gradient Descent and Implicit Regularizationphase retrieval, matrix completion, random initialization

• Part III: Spectral Methodsa general recipe, `2 and `∞ guarantees, community detection

• Part IV: Robustness to Corruptions and Ill-Conditioningmedian truncation, least absolute deviation, scaled gradient descent

2 / 39

Robustness to ill-conditioning?

A factorization approach to low-rank matrix sensing

A(·)

00.2

0.40.6

0.81 0

0.2

0.4

0.6

0.8

1

00.51

amplitude

M 2 Rn1n2y 2 Rm

rank(M) = r linear map

y = A(M) + noise

find X ∈ Rn1×r, Y = Rn2×r, such that y ≈ A(XY >)

4 / 39

Prior art: GD with balancing regularization

minX,Y

freg(X,Y ) = 12

∥∥∥y −A(XY >)∥∥∥2

2+ 1

8

∥∥∥X>X − Y >Y∥∥∥2

F

“Basin of attraction”

• Spectral initialization: find an initial pointin the “basin of attraction”.

(X0,Y0)← SVDr(A∗(y))

• Gradient iterations:

Xt+1 = Xt − η∇Xfreg(Xt,Yt)Yt+1 = Yt − η∇Y freg(Xt,Yt)

for t = 0, 1, . . .5 / 39

Prior theory for vanilla GD

Theorem 1 (Tu et al., ICML 2016)Suppose M = X?Y

>? is rank-r and has a condition number

κ = σmax(M)/σmin(M). For low-rank matrix sensing withi.i.d. Gaussian design, vanilla GD (with spectral initialization) achieves

‖XtY>t −M‖F ≤ ε · σmin(M)

• Computational: within O(κ log 1

ε

)iterations;

• Statistical: as long as the sample complexity satisfies

m & (n1 + n2)r2κ2.

Similar results hold for many low-rank problems.

(Netrapalli et al. ’13, Candes, Li, Soltanolkotabi ’14, Sun and Luo ’15, Chen andWainwright ’15, Zheng and Lafferty ’15, Ma et al. ’17, ....)

6 / 39

Convergence slows down for ill-conditioned matrices

minX,Y

f(X,Y ) = 12

∥∥∥PΩ(XY > −M)∥∥∥2

F

0 200 400 600 800 1000Iteration count

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Rel

ativ

e er

ror

%

Vanilla GD converges in O(κ log 1

ε

)iterations.

— Can we provably accelerate the convergence to O(

log 1ε

)?

7 / 39

Condition number can be large

chlorine concentration levels120 junctions, 180 time slots

0 5 10 15 20Index

0

10

20

30

40

Sin

gula

r va

lues

power-law spectrum

Data source: www.epa.gov/water-research/epanet8 / 39

www.epa.gov/water-research/epanet



0 5 10 15 20Index

0

10

20

30

40

Sing

ular

val

ues

20

88%

rank-5 approximation





0 5 10 15 20Index

0

10

20

30

40

Sing

ular

val

ues

60

96%

rank-10 approximation



A new algorithm: scaled gradient descent

f(X,Y ) =12

∥∥y −A(XY >)∥∥2

2

• Spectral initialization: find an initial pointin the “basin of attraction”.

• Scaled gradient iterations:

Xt+1 = Xt − η∇Xf(Xt,Yt) (Y >t Yt)−1︸︷︷︸preconditioner

Yt+1 = Yt − η∇Y f(Xt,Yt) (X>t Xt)−1︸︷︷︸preconditioner

for t = 0, 1, . . .

ScaledGD is a preconditioned gradient methodwithout balancing regularization!

9 / 39

ScaledGD for low-rank matrix completion

0 200 400 600 800 1000Iteration count

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Rel

ativ

e er

ror

Huge computational saving: ScaledGD converges in anκ-independent manner with a minimal overhead!

10 / 39

A closer look at ScaledGDInvariance to invertible transforms: (Tanner and Wei, ’16; Mishra ’16)

(Xt, Y t)

(Xt+1, Y t+1)(Xt+1Q, Y t+1Q

>)

(XtQ, Y tQ>)

M t = XtY>t

M t+1 = Xt+1Y>t+1

New distance metric as Lyapunov function:

dist2([

XY

],

[X?

Y?

])= inf

Q∈GL(r)

∥∥(XQ − X?)Σ1/2?

∥∥2

F

+∥∥(Y Q−> − Y?)Σ1/2

?

∥∥2

F.

11 / 39

Theoretical guarantees of ScaledGD

Theorem 2 (Tong, Ma and Chi, 2020)For low-rank matrix sensing with i.i.d. Gaussian design, ScaledGDwith spectral initialization achieves

‖XtY>t −M‖F . ε · σmin(M)

• Computational: within O(

log 1ε

)iterations;

• Statistical: the sample complexity satisfies

m & nr2κ2.

Compared with Tu et. al.: ScaledGD provably accelerates vanillaGD at the same sample complexity!

12 / 39

Stability: a numerical result

For the chlorine concentration levels dataset, ScaledGD convergesfaster than vanilla GD in a small number of iterations.

0 100 200 300 400

Iteration count

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1R

ela

tive

err

or

13 / 39

Robustness to outliers and corruptions?

Outlier-corrupted low-rank matrix sensing

A(·)

00.2

0.40.6

0.81 0

0.2

0.4

0.6

0.8

1

00.51

amplitude

M 2 Rn1n2

rank(M) = r linear map

Sensor failuresMalicious attacks

y 2 Rm

y = A(M) + s︸︷︷︸outliers

, A(M) = 〈Ai,M〉mi=1

Arbitrary but sparse outliers: ‖s‖0 ≤ α ·m, where 0 ≤ α < 1 isfraction of outliers.

15 / 39

Existing approaches fail

• Spectral initialization would fail: X0 ←top-r SVD of

Y = 1m

m∑i=1

yiAi

• Gradient iterations would fail:

Xt+1 = Xt −η

m

m∑i=1∇`i(yi;Xt)

for t = 0, 1, . . .

Even a single outlier can fail the algorithm!

16 / 39

Median-truncated gradient descent

Key idea: “median-truncation” —discard samples adaptively based on how

large sample gradients / values deviatefrom median

• Robustify spectral initialization: X0 ← top-r SVD of

Y = 1m

∑i:|yi|.median(|yi|)

yiAi

• Robustify gradient descent:

Xt+1 = Xt −η

m

∑i:|ri

t|.median(|rit|)∇`i(yi;Xt), t = 0, 1, . . .

where rit :=∣∣∣yi − 〈Ai,XtX

>t 〉∣∣∣ is the size of the gradient.

17 / 39

Theoretical guarantees

Theorem 3 (Li, Chi, Zhang, and Liang, IMIAI 2020)For low-rank matrix sensing with i.i.d. Gaussian design,median-truncated GD (with robust spectral initialization) achieves

‖XtX>t −M‖F ≤ ε · σmin(M),


ε

)iterations;


m & nr2poly(κ, logn);

• Robustness: and the fraction of outliersα . 1/

√r.

Median-truncated GD adds robustness to GD obliviously.

18 / 39

Theoretical guarantees

Theorem 3 (Li, Chi, Zhang, and Liang, IMIAI 2020)For low-rank matrix sensing with i.i.d. Gaussian design,median-truncated GD (with robust spectral initialization) achieves

‖XtX>t −M‖F ≤ ε · σmin(M),


ε

)iterations;


m & nr2poly(κ, logn);

• Robustness: and the fraction of outliersα . 1/

√r.

Median-truncated GD adds robustness to GD obliviously.18 / 39

Numerical example

Low-rank matrix sensing:yi = 〈Ai,M〉+ si, i = 1, . . . ,m

Ground truth GD GD median-TGDno outliers 1% outliers 1% outliers

Median-truncated GD achieves similar performance as ifperforming GD on the clean data.

Li, Chi, Zhang and Liang, “Non-convex low-rank matrix recovery with arbitrary outliers via median-truncatedgradient descent”, Information and Inference: A Journal of the IMA, 2020. 19 / 39

Dealing with outliers: subgradient methods

Least absolute deviation (LAD): (Charisopoulos et.al.’19; Li et al’18)

minX,Y

f(X,Y ) =∥∥∥y −A(XY >)

∥∥∥1

Subgradient iterations:

Xt+1 = Xt − ηt ∂Xf(Xt,Yt)Yt+1 = Yt − ηt ∂Y f(Xt,Yt)

where ηt is set as Polyak’s or geometricdecaying stepsize.

20 / 39

Dealing with outliers: scaled subgradient methods

Least absolute deviation (LAD):

minX,Y

f(X,Y ) =∥∥∥y −A(XY >)

∥∥∥1

Scaled subgradient iterations:

Xt+1 = Xt − ηt ∂Xf(Xt,Yt) (Y >t Yt)−1︸︷︷︸preconditioner

Yt+1 = Yt − ηt ∂Y f(Xt,Yt) (X>t Xt)−1︸︷︷︸preconditioner

where ηt is set as Polyak’s or geometricdecaying stepsize.

21 / 39

Performance guarantees

matrix sensing quadratic sensingSubgradient Method κ

(1−2α)2 log 1ε

rκ(1−2α)2 log 1

ε(Charisopoulos et al, ’19)ScaledSM 1

(1−2α)2 log 1ε

r(1−2α)2 log 1

ε(Tong, Ma, Chi, ’20)

0 200 400 600 800 1000

Iteration count

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

Robustness to both ill-conditioning and adversarial corruptions!

22 / 39

Demixing sparse and low-rank matrices

Suppose we are given a matrix

M = L︸︷︷︸low-rank

+ S︸︷︷︸sparse

∈ Rn×n

Question: can we hope to recover both L and S from M?

23 / 39

Applications

• Robust PCA

• Video surveillance: separation of background and foreground

24 / 39

Nonconvex approach

• rank(L) ≤ r; if we write the SVD of L = UΣV >, set

X? = UΣ1/2; Y ? = V Σ1/2

• non-zero entries of S are “spread out” (no more than s fractionof non-zeros per row/column), but otherwise arbitrary

Ss =S ∈ Rn×n : ‖Si,:‖0 ≤ s · n; ‖S:,j‖0 ≤ s · n

minimizeX,Y ,S∈Ss

F (X,Y ,S) := ‖M −XY > − S‖2F︸︷︷︸least-squares loss

where X,Y ∈ Rn×r.

25 / 39

Gradient descent and hard thresholding

minimizeX,Y ,S∈Ss F (X,Y ,S)

• Spectral initialization: Set S0 = Hγs(M). Let U0Σ0V 0> berank-r SVD of M0 := PΩ(M − S0); set X0 = U0 (Σ0)1/2 andY 0 = V 0 (Σ0)1/2

• for t = 0, 1, 2, · · · Hard thresholding: St+1 = Hγs(M −XtY t>) Scaled gradient updates:

Xt+1 = Xt − η∇XF(Xt,Y t,St+1) (Y >t Yt)−1

Y t+1 = Y t − η∇Y F(Xt,Y t,St+1) (X>t Xt)−1

26 / 39

Gradient descent and hard thresholding

minimizeX,Y ,S∈Ss F (X,Y ,S)

• Spectral initialization: Set S0 = Hγs(M). Let U0Σ0V 0> berank-r SVD of M0 := PΩ(M − S0); set X0 = U0 (Σ0)1/2 andY 0 = V 0 (Σ0)1/2• for t = 0, 1, 2, · · ·

Hard thresholding: St+1 = Hγs(M −XtY t>) Scaled gradient updates:

Xt+1 = Xt − η∇XF(Xt,Y t,St+1) (Y >t Yt)−1

Y t+1 = Y t − η∇Y F(Xt,Y t,St+1) (X>t Xt)−1

26 / 39

Efficient nonconvex recovery

Theorem 4 (Nonconvex RPCA, Tian, Ma, Chi ’20)

Set γ = 2 and 0.1 ≤ η ≤ 2/3. Suppose that

s .1

µr3/2κ

Then GD+HT satisfies∥∥XtY t> −L∥∥

F . (1− 0.6η)t σmin(L)

• O(log 1ε ) iterations to reach ε accuracy

• for adversarial outliers, optimal fraction is s = O(1/µr);Theorem 4 is suboptimal by a factor of κ

√r

• Improves over GD (Yi et al ’16) which requiress . 1

maxµr3/2κ3/2,µrκ2 .1

µr3/2κand O(κ log 1

ε ) iterations;27 / 39

Concluding remarks

Statistical thinking + Optimization efficiency

which is a consensus problem.ADMM for the original general form is

xt+1 := arg minx

(f(x) +

2

Ax + Bzt b +1

t

2

2

)

zt+1 := arg minz

(h(x) +

2

Axt+1 + Bz b +1

t

2

2

)

t+1 := t + Axt+1 + Bzt+1 b

ADMM for the reduced consensus form is

xt+1 := arg minx

(f(x) +

2

x zt +1

t

2

2

)

zt+1 := arg minz

(h(x) +

2

xt+1 z +1

t

2

2

)

t+1 := t + xt+1 zt+1

They are the same, with x being recovered from x as the solution to theoptimization problem

minimize f(x)

subject to Ax = x

with variable x 2 Rn. A similar expression allows for the recovery of z fromz.

[(A>)>, (B>)>, (Ax Bz + b)>]>

benign landscape statistical models tractable algorithms

7


xt+1 := arg minx

(f(x) +

2

Ax + Bzt b +1

t

2

2

)

zt+1 := arg minz

(h(x) +

2

Axt+1 + Bz b +1

t

2

2

)

t+1 := t + Axt+1 + Bzt+1 b


xt+1 := arg minx

(f(x) +

2

x zt +1

t

2

2

)

zt+1 := arg minz

(h(x) +

2

xt+1 z +1

t

2

2

)

t+1 := t + xt+1 zt+1


minimize f(x)

subject to Ax = x


[(A>)>, (B>)>, (Ax Bz + b)>]>


7


xt+1 := arg minx

(f(x) +

2

Ax + Bzt b +1

t

2

2

)

zt+1 := arg minz

(h(x) +

2

Axt+1 + Bz b +1

t

2

2

)

t+1 := t + Axt+1 + Bzt+1 b


xt+1 := arg minx

(f(x) +

2

x zt +1

t

2

2

)

zt+1 := arg minz

(h(x) +

2

xt+1 z +1

t

2

2

)

t+1 := t + xt+1 zt+1


minimize f(x)

subject to Ax = x


[(A>)>, (B>)>, (Ax Bz + b)>]>


7


xt+1 := arg minx

(f(x) +

2

Ax + Bzt b +1

t

2

2

)

zt+1 := arg minz

(h(x) +

2

Axt+1 + Bz b +1

t

2

2

)

t+1 := t + Axt+1 + Bzt+1 b


xt+1 := arg minx

(f(x) +

2

x zt +1

t

2

2

)

zt+1 := arg minz

(h(x) +

2

xt+1 z +1

t

2

2

)

t+1 := t + xt+1 zt+1


minimize f(x)

subject to Ax = x


[(A>)>, (B>)>, (Ax Bz + b)>]>


7

When data are generated by certain statistical models, problems areoften much nicer than worst-case instances

29 / 39

A growing list of “benign” nonconvex problems

• phase retrieval

• matrix sensing

• matrix completion

• blind deconvolution / self-calibration

• dictionary learning

• tensor decomposition / completion

• robust PCA

• mixed linear regression

• learning one-layer neural networks

• ...

30 / 39

Open problems

• characterize generic landscape properties that enable fastconvergence of gradient methods from random initialization

• relax the stringent assumptions on the statistical modelsunderlying the data

• develop robust and scalable nonconvex methods that can handledistributed data with strong statistical guarantees

• identify new classes of nonconvex problems that admit efficientoptimization procedures

• ...

31 / 39

Advertisement: overview and monographs

• “Nonconvex Optimization Meets Low-Rank Matrix Factorization:An Overview”, Y. Chi, Y. M. Lu and Y. Chen.

• “Spectral Methods for Data Science: A Statistical Perspective”,Y. Chen, Y. Chi, J. Fan and C. Ma.

32 / 39

An incomplete list of reference

[1] Dr. Ju Sun’s webpage: “http://sunju.org/research/nonconvex/”.

[2] ”Harnessing Structures in Big Data via Guaranteed Low-Rank MatrixEstimation,” Y. Chen, and Y. Chi, IEEE Signal Processing Magazine, 2018.

[3] “Phase retrieval via Wirtinger flow: Theory and algorithms,” E. Candes, X. Li,M. Soltanolkotabi, IEEE Transactions on Information Theory, 2015.

[4] “Solving random quadratic systems of equations is nearly as easy as solvinglinear systems,” Y. Chen, E. Candes, Communications on Pure and AppliedMathematics, 2017.

[5] “Provable non-convex phase retrieval with outliers: Median truncatedWirtinger flow ,” H. Zhang, Y. Chi, and Y. Liang, ICML 2016.

[6] ”Implicit Regularization in Nonconvex Statistical Estimation: Gradient DescentConverges Linearly for Phase Retrieval, Matrix Completion and BlindDeconvolution,” C. Ma, K. Wang, Y. Chi and Y. Chen, Foundations ofComputational Mathematics, 2019.

33 / 39

http://sunju.org/research/nonconvex/


[7] “Gradient Descent with Random Initialization: Fast Global Convergence forNonconvex Phase Retrieval ,” Y. Chen, Y. Chi, J. Fan, C. Ma, MathematicalProgramming, 2018.

[8] “Solving systems of random quadratic equations via truncated amplitudeflow ,” G. Wang, G. Giannakis, and Y. Eldar, IEEE Transactions on InformationTheory, 2017.

[9] ”Matrix completion from a few entries,” R. Keshavan, A. Montanari, andS. Oh, IEEE Transactions on Information Theory, 2010.

[10] ”Guaranteed matrix completion via non-convex factorization,” R. Sun, T. Luo,IEEE Transactions on Information Theory , 2016.

[11] ”Fast low-rank estimation by projected gradient descent: General statisticaland algorithmic guarantees,” Y. Chen and M. Wainwright, arXiv preprintarXiv:1509.03025 , 2015.

[12] ”Fast Algorithms for Robust PCA via Gradient Descent,” X. Yi, D. Park,Y. Chen, and C. Caramanis, NIPS, 2016.

34 / 39


[13] “Matrix completion has no spurious local minimum,” R. Ge, J. Lee, andT. Ma, NIPS, 2016.

[14] “No Spurious Local Minima in Nonconvex Low Rank Problems: A UnifiedGeometric Analysis,” R. Ge, C. Jin, and Y. Zheng, ICML, 2017.

[15] “Symmetry, Saddle Points, and Global Optimization Landscape of NonconvexMatrix Factorization,” X. Li et al., IEEE Transactions on Information Theory,2019.

[16] “Kaczmarz Method for Solving Quadratic Equations,” Y. Chi and Y. M. Lu,IEEE Signal Processing Letters, vol. 23, no. 9, pp. 1183-1187, 2016.

[17] “A Geometric Analysis of Phase Retrieval ,” S. Ju, Q. Qu, and J. Wright, toappear, Foundations of Computational Mathematics, 2016.

[18] “Gradient descent converges to minimizers,” J. Lee, M. Simchowitz,M. Jordan, B. Recht, Conference on Learning Theory, 2016.

[19] “Phase retrieval using alternating minimization,” P. Netrapalli, P. Jain, andS. Sanghavi, NIPS, 2013.

35 / 39


[20] “How to escape saddle points efficiently ,” C. Jin, R. Ge, P. Netrapalli,S. Kakade, M. Jordan, arXiv:1703.00887, 2017.

[21] “Complete dictionary recovery over the sphere,” J. Sun, Q. Qu, J. Wright,IEEE Transactions on Information Theory, 2017.

[22] “A nonlinear programming algorithm for solving semidefinite programs vialow-rank factorization,” S. Burer, and R. Monteiro, MathematicalProgramming, 2003.

[23] “Memory-efficient Kernel PCA via Partial Matrix Sampling and NonconvexOptimization: a Model-free Analysis of Local Minima,” J. Chen, X. Li, Journalof Machine Learning Research, 2019.

[24] “Rapid, robust, and reliable blind deconvolution via nonconvex optimization,”X. Li, S. Ling, T. Strohmer, K. Wei, Applied and computational harmonicanalysis, 2019.

[25] “Phaselift: Exact and stable signal recovery from magnitude measurements viaconvex programming,” E. Candes, T. Strohmer, V. Voroninski,Communications on Pure and Applied Mathematics, 2012.

36 / 39

An incomplete list of reference[26] “Exact matrix completion via convex optimization,” E. Candes, B. Recht,

Foundations of Computational mathematics, 2009.

[27] “Low-rank solutions of linear matrix equations via Procrustes flow,” S. Tu,R. Boczar, M. Simchowitz, M. Soltanolkotabi, B. Recht, arXiv:1507.03566,2015.

[28] “Global optimality of local search for low rank matrix recovery,”S. Bhojanapalli, B. Neyshabur, and N. Srebro, NIPS, 2016.

[29] “Optimal rates of convergence for noisy sparse phase retrieval via thresholdedWirtinger flow,” T. Cai, X. Li, Z. Ma, The Annals of Statistics, 2016.

[30] “The landscape of empirical risk for non-convex losses,” S. Mei, Y. Bai, andA. Montanari, arXiv:1607.06534, 2016.

[31] “Non-convex robust PCA,” P. Netrapalli, U. Niranjan, S. Sanghavi,A. Anandkumar, and P. Jain, NIPS, 2014.

[32] “Non-square matrix sensing without spurious local minima via theBurer-Monteiro approach,” D. Park, A. Kyrillidis, C. Caramanis, andS. Sanghavi, arXiv:1609.03240, 2016.

37 / 39


[33] “From symmetry to geometry: Tractable nonconvex problems”, Y. Zhang, Q.Qu, J. Wright, arXiv preprint arXiv:2007.06753, 2020.

[34] “A Nonconvex Approach for Phase Retrieval: Reshaped Wirtinger Flow andIncremental Algorithms,” H. Zhang, Y. Zhou, Y. Liang and Y. Chi, Journal ofMachine Learning Research, 2017.

[35] “Nonconvex Matrix Factorization from Rank-One Measurements,” Y. Li, C.Ma, Y. Chen and Y. Chi, arXiv:1802.06286, 2018.

[36] “Manifold Gradient Descent Solves Multi-Channel Sparse Blind DeconvolutionProvably and Efficiently”, L. Shi and Y. Chi, arXiv preprint arXiv:1911.11167,2019.

[37] “Median-Truncated Gradient Descent: A Robust and Scalable NonconvexApproach for Signal Estimation”, Y. Chi, Y. Li, H. Zhang, and Y. Liang,Compressed Sensing and Its Applications, Springer, Birkhauser, 2019.

[38] “Nonconvex Low-Rank Matrix Recovery with Arbitrary Outliers viaMedian-Truncated Gradient Descent”, Y. Li, Y. Chi, H. Zhang, and Y. Liang,Information and Inference, 2020.

38 / 39


[39] “Nonconvex robust low-rank matrix recovery”, Li et al., SIAM Journal onOptimization, 2020.

[40] “Median-Truncated Nonconvex Approach for Phase Retrieval with Outliers”,H. Zhang, Y. Chi and Y. Liang, IEEE Trans. on Information Theory, 2018.

[41] “Low-rank matrix recovery with composite optimization: good conditioningand rapid convergence”, V. Charisopoulos et al., arXiv:1904.10020, 2019.

[42] “Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled GradientDescent”, T. Tong, C. Ma, and Y. Chi, arXiv:2005.08898, 2020.

[43] “Low-Rank Matrix Recovery with Scaled Subgradient Methods: Fast andRobust Convergence Without the Condition Number”, T. Tong, C. Ma, and Y.Chi, arXiv:2010.13364, 2020.

39 / 39

nonconvex optimization for high-dimensional signal

Documents