nonconvex optimization for high-dimensional signal
TRANSCRIPT
Nonconvex Optimization for High-Dimensional Signal Estimation:Spectral and Iterative Methods – Part IV
Yuejie ChiCarnegie Mellon
Yuxin ChenPrinceton
Cong MaUC Berkeley
EUSIPCO Tutorial, December 2020
Outline
• Part I: Introduction and Warm-UpWhy nonconvex? basic concepts and a warm-up example (PCA)
• Part II: Gradient Descent and Implicit Regularizationphase retrieval, matrix completion, random initialization
• Part III: Spectral Methodsa general recipe, `2 and `∞ guarantees, community detection
• Part IV: Robustness to Corruptions and Ill-Conditioningmedian truncation, least absolute deviation, scaled gradient descent
2 / 39
Robustness to ill-conditioning?
A factorization approach to low-rank matrix sensing
A(·)
00.2
0.40.6
0.81 0
0.2
0.4
0.6
0.8
1
00.51
amplitude
M 2 Rn1n2y 2 Rm
rank(M) = r linear map
y = A(M) + noise
find X ∈ Rn1×r, Y = Rn2×r, such that y ≈ A(XY >)
4 / 39
Prior art: GD with balancing regularization
minX,Y
freg(X,Y ) = 12
∥∥∥y −A(XY >)∥∥∥2
2+ 1
8
∥∥∥X>X − Y >Y∥∥∥2
F
“Basin of attraction”
• Spectral initialization: find an initial pointin the “basin of attraction”.
(X0,Y0)← SVDr(A∗(y))
• Gradient iterations:
Xt+1 = Xt − η∇Xfreg(Xt,Yt)Yt+1 = Yt − η∇Y freg(Xt,Yt)
for t = 0, 1, . . .5 / 39
Prior theory for vanilla GD
Theorem 1 (Tu et al., ICML 2016)Suppose M = X?Y
>? is rank-r and has a condition number
κ = σmax(M)/σmin(M). For low-rank matrix sensing withi.i.d. Gaussian design, vanilla GD (with spectral initialization) achieves
‖XtY>t −M‖F ≤ ε · σmin(M)
• Computational: within O(κ log 1
ε
)iterations;
• Statistical: as long as the sample complexity satisfies
m & (n1 + n2)r2κ2.
Similar results hold for many low-rank problems.
(Netrapalli et al. ’13, Candes, Li, Soltanolkotabi ’14, Sun and Luo ’15, Chen andWainwright ’15, Zheng and Lafferty ’15, Ma et al. ’17, ....)
6 / 39
Convergence slows down for ill-conditioned matrices
minX,Y
f(X,Y ) = 12
∥∥∥PΩ(XY > −M)∥∥∥2
F
0 200 400 600 800 1000Iteration count
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Rel
ativ
e er
ror
%
Vanilla GD converges in O(κ log 1
ε
)iterations.
— Can we provably accelerate the convergence to O(
log 1ε
)?
7 / 39
Condition number can be large
chlorine concentration levels120 junctions, 180 time slots
0 5 10 15 20Index
0
10
20
30
40
Sin
gula
r va
lues
power-law spectrum
Data source: www.epa.gov/water-research/epanet8 / 39
Condition number can be large
chlorine concentration levels120 junctions, 180 time slots
0 5 10 15 20Index
0
10
20
30
40
Sing
ular
val
ues
20
88%
rank-5 approximation
Data source: www.epa.gov/water-research/epanet8 / 39
Condition number can be large
chlorine concentration levels120 junctions, 180 time slots
0 5 10 15 20Index
0
10
20
30
40
Sing
ular
val
ues
60
96%
rank-10 approximation
Data source: www.epa.gov/water-research/epanet8 / 39
A new algorithm: scaled gradient descent
f(X,Y ) =12
∥∥y −A(XY >)∥∥2
2
• Spectral initialization: find an initial pointin the “basin of attraction”.
• Scaled gradient iterations:
Xt+1 = Xt − η∇Xf(Xt,Yt) (Y >t Yt)−1︸ ︷︷ ︸preconditioner
Yt+1 = Yt − η∇Y f(Xt,Yt) (X>t Xt)−1︸ ︷︷ ︸preconditioner
for t = 0, 1, . . .
ScaledGD is a preconditioned gradient methodwithout balancing regularization!
9 / 39
A new algorithm: scaled gradient descent
f(X,Y ) =12
∥∥y −A(XY >)∥∥2
2
• Spectral initialization: find an initial pointin the “basin of attraction”.
• Scaled gradient iterations:
Xt+1 = Xt − η∇Xf(Xt,Yt) (Y >t Yt)−1︸ ︷︷ ︸preconditioner
Yt+1 = Yt − η∇Y f(Xt,Yt) (X>t Xt)−1︸ ︷︷ ︸preconditioner
for t = 0, 1, . . .
ScaledGD is a preconditioned gradient methodwithout balancing regularization!
9 / 39
ScaledGD for low-rank matrix completion
0 200 400 600 800 1000Iteration count
10-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Rel
ativ
e er
ror
Huge computational saving: ScaledGD converges in anκ-independent manner with a minimal overhead!
10 / 39
A closer look at ScaledGDInvariance to invertible transforms: (Tanner and Wei, ’16; Mishra ’16)
(Xt, Y t)
(Xt+1, Y t+1)(Xt+1Q, Y t+1Q
>)
(XtQ, Y tQ>)
M t = XtY>t
M t+1 = Xt+1Y>t+1
New distance metric as Lyapunov function:
dist2([
XY
],
[X?
Y?
])= inf
Q∈GL(r)
∥∥(XQ − X?)Σ1/2?
∥∥2
F
+∥∥(Y Q−> − Y?)Σ1/2
?
∥∥2
F.
11 / 39
A closer look at ScaledGDInvariance to invertible transforms: (Tanner and Wei, ’16; Mishra ’16)
(Xt, Y t)
(Xt+1, Y t+1)(Xt+1Q, Y t+1Q
>)
(XtQ, Y tQ>)
M t = XtY>t
M t+1 = Xt+1Y>t+1
New distance metric as Lyapunov function:
dist2([
XY
],
[X?
Y?
])= inf
Q∈GL(r)
∥∥(XQ − X?)Σ1/2?
∥∥2
F
+∥∥(Y Q−> − Y?)Σ1/2
?
∥∥2
F.
11 / 39
Theoretical guarantees of ScaledGD
Theorem 2 (Tong, Ma and Chi, 2020)For low-rank matrix sensing with i.i.d. Gaussian design, ScaledGDwith spectral initialization achieves
‖XtY>t −M‖F . ε · σmin(M)
• Computational: within O(
log 1ε
)iterations;
• Statistical: the sample complexity satisfies
m & nr2κ2.
Compared with Tu et. al.: ScaledGD provably accelerates vanillaGD at the same sample complexity!
12 / 39
Theoretical guarantees of ScaledGD
Theorem 2 (Tong, Ma and Chi, 2020)For low-rank matrix sensing with i.i.d. Gaussian design, ScaledGDwith spectral initialization achieves
‖XtY>t −M‖F . ε · σmin(M)
• Computational: within O(
log 1ε
)iterations;
• Statistical: the sample complexity satisfies
m & nr2κ2.
Compared with Tu et. al.: ScaledGD provably accelerates vanillaGD at the same sample complexity!
12 / 39
Stability: a numerical result
For the chlorine concentration levels dataset, ScaledGD convergesfaster than vanilla GD in a small number of iterations.
0 100 200 300 400
Iteration count
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1R
ela
tive
err
or
13 / 39
Robustness to outliers and corruptions?
Outlier-corrupted low-rank matrix sensing
A(·)
00.2
0.40.6
0.81 0
0.2
0.4
0.6
0.8
1
00.51
amplitude
M 2 Rn1n2
rank(M) = r linear map
Sensor failuresMalicious attacks
y 2 Rm
y = A(M) + s︸︷︷︸outliers
, A(M) = 〈Ai,M〉mi=1
Arbitrary but sparse outliers: ‖s‖0 ≤ α ·m, where 0 ≤ α < 1 isfraction of outliers.
15 / 39
Existing approaches fail
• Spectral initialization would fail: X0 ←top-r SVD of
Y = 1m
m∑i=1
yiAi
• Gradient iterations would fail:
Xt+1 = Xt −η
m
m∑i=1∇`i(yi;Xt)
for t = 0, 1, . . .
Even a single outlier can fail the algorithm!
16 / 39
Median-truncated gradient descent
Key idea: “median-truncation” —discard samples adaptively based on how
large sample gradients / values deviatefrom median
• Robustify spectral initialization: X0 ← top-r SVD of
Y = 1m
∑i:|yi|.median(|yi|)
yiAi
• Robustify gradient descent:
Xt+1 = Xt −η
m
∑i:|ri
t|.median(|rit|)∇`i(yi;Xt), t = 0, 1, . . .
where rit :=∣∣∣yi − 〈Ai,XtX
>t 〉∣∣∣ is the size of the gradient.
17 / 39
Median-truncated gradient descent
Key idea: “median-truncation” —discard samples adaptively based on how
large sample gradients / values deviatefrom median
• Robustify spectral initialization: X0 ← top-r SVD of
Y = 1m
∑i:|yi|.median(|yi|)
yiAi
• Robustify gradient descent:
Xt+1 = Xt −η
m
∑i:|ri
t|.median(|rit|)∇`i(yi;Xt), t = 0, 1, . . .
where rit :=∣∣∣yi − 〈Ai,XtX
>t 〉∣∣∣ is the size of the gradient.
17 / 39
Median-truncated gradient descent
Key idea: “median-truncation” —discard samples adaptively based on how
large sample gradients / values deviatefrom median
• Robustify spectral initialization: X0 ← top-r SVD of
Y = 1m
∑i:|yi|.median(|yi|)
yiAi
• Robustify gradient descent:
Xt+1 = Xt −η
m
∑i:|ri
t|.median(|rit|)∇`i(yi;Xt), t = 0, 1, . . .
where rit :=∣∣∣yi − 〈Ai,XtX
>t 〉∣∣∣ is the size of the gradient.
17 / 39
Theoretical guarantees
Theorem 3 (Li, Chi, Zhang, and Liang, IMIAI 2020)For low-rank matrix sensing with i.i.d. Gaussian design,median-truncated GD (with robust spectral initialization) achieves
‖XtX>t −M‖F ≤ ε · σmin(M),
• Computational: within O(κ log 1
ε
)iterations;
• Statistical: the sample complexity satisfies
m & nr2poly(κ, logn);
• Robustness: and the fraction of outliersα . 1/
√r.
Median-truncated GD adds robustness to GD obliviously.
18 / 39
Theoretical guarantees
Theorem 3 (Li, Chi, Zhang, and Liang, IMIAI 2020)For low-rank matrix sensing with i.i.d. Gaussian design,median-truncated GD (with robust spectral initialization) achieves
‖XtX>t −M‖F ≤ ε · σmin(M),
• Computational: within O(κ log 1
ε
)iterations;
• Statistical: the sample complexity satisfies
m & nr2poly(κ, logn);
• Robustness: and the fraction of outliersα . 1/
√r.
Median-truncated GD adds robustness to GD obliviously.18 / 39
Numerical example
Low-rank matrix sensing:yi = 〈Ai,M〉+ si, i = 1, . . . ,m
Ground truth GD GD median-TGDno outliers 1% outliers 1% outliers
Median-truncated GD achieves similar performance as ifperforming GD on the clean data.
Li, Chi, Zhang and Liang, “Non-convex low-rank matrix recovery with arbitrary outliers via median-truncatedgradient descent”, Information and Inference: A Journal of the IMA, 2020. 19 / 39
Dealing with outliers: subgradient methods
Least absolute deviation (LAD): (Charisopoulos et.al.’19; Li et al’18)
minX,Y
f(X,Y ) =∥∥∥y −A(XY >)
∥∥∥1
Subgradient iterations:
Xt+1 = Xt − ηt ∂Xf(Xt,Yt)Yt+1 = Yt − ηt ∂Y f(Xt,Yt)
where ηt is set as Polyak’s or geometricdecaying stepsize.
20 / 39
Dealing with outliers: scaled subgradient methods
Least absolute deviation (LAD):
minX,Y
f(X,Y ) =∥∥∥y −A(XY >)
∥∥∥1
Scaled subgradient iterations:
Xt+1 = Xt − ηt ∂Xf(Xt,Yt) (Y >t Yt)−1︸ ︷︷ ︸preconditioner
Yt+1 = Yt − ηt ∂Y f(Xt,Yt) (X>t Xt)−1︸ ︷︷ ︸preconditioner
where ηt is set as Polyak’s or geometricdecaying stepsize.
21 / 39
Performance guarantees
matrix sensing quadratic sensingSubgradient Method κ
(1−2α)2 log 1ε
rκ(1−2α)2 log 1
ε(Charisopoulos et al, ’19)ScaledSM 1
(1−2α)2 log 1ε
r(1−2α)2 log 1
ε(Tong, Ma, Chi, ’20)
0 200 400 600 800 1000
Iteration count
10-12
10-10
10-8
10-6
10-4
10-2
100
Re
lative
err
or
Robustness to both ill-conditioning and adversarial corruptions!
22 / 39
Demixing sparse and low-rank matrices
Suppose we are given a matrix
M = L︸︷︷︸low-rank
+ S︸︷︷︸sparse
∈ Rn×n
Question: can we hope to recover both L and S from M?
23 / 39
Applications
• Robust PCA
• Video surveillance: separation of background and foreground
24 / 39
Nonconvex approach
• rank(L) ≤ r; if we write the SVD of L = UΣV >, set
X? = UΣ1/2; Y ? = V Σ1/2
• non-zero entries of S are “spread out” (no more than s fractionof non-zeros per row/column), but otherwise arbitrary
Ss =S ∈ Rn×n : ‖Si,:‖0 ≤ s · n; ‖S:,j‖0 ≤ s · n
minimizeX,Y ,S∈Ss
F (X,Y ,S) := ‖M −XY > − S‖2F︸ ︷︷ ︸least-squares loss
where X,Y ∈ Rn×r.
25 / 39
Gradient descent and hard thresholding
minimizeX,Y ,S∈Ss F (X,Y ,S)
• Spectral initialization: Set S0 = Hγs(M). Let U0Σ0V 0> berank-r SVD of M0 := PΩ(M − S0); set X0 = U0 (Σ0)1/2 andY 0 = V 0 (Σ0)1/2
• for t = 0, 1, 2, · · · Hard thresholding: St+1 = Hγs(M −XtY t>) Scaled gradient updates:
Xt+1 = Xt − η∇XF(Xt,Y t,St+1) (Y >t Yt)−1
Y t+1 = Y t − η∇Y F(Xt,Y t,St+1) (X>t Xt)−1
26 / 39
Gradient descent and hard thresholding
minimizeX,Y ,S∈Ss F (X,Y ,S)
• Spectral initialization: Set S0 = Hγs(M). Let U0Σ0V 0> berank-r SVD of M0 := PΩ(M − S0); set X0 = U0 (Σ0)1/2 andY 0 = V 0 (Σ0)1/2• for t = 0, 1, 2, · · ·
Hard thresholding: St+1 = Hγs(M −XtY t>) Scaled gradient updates:
Xt+1 = Xt − η∇XF(Xt,Y t,St+1) (Y >t Yt)−1
Y t+1 = Y t − η∇Y F(Xt,Y t,St+1) (X>t Xt)−1
26 / 39
Efficient nonconvex recovery
Theorem 4 (Nonconvex RPCA, Tian, Ma, Chi ’20)
Set γ = 2 and 0.1 ≤ η ≤ 2/3. Suppose that
s .1
µr3/2κ
Then GD+HT satisfies∥∥XtY t> −L∥∥
F . (1− 0.6η)t σmin(L)
• O(log 1ε ) iterations to reach ε accuracy
• for adversarial outliers, optimal fraction is s = O(1/µr);Theorem 4 is suboptimal by a factor of κ
√r
• Improves over GD (Yi et al ’16) which requiress . 1
maxµr3/2κ3/2,µrκ2 .1
µr3/2κand O(κ log 1
ε ) iterations;27 / 39
Concluding remarks
Statistical thinking + Optimization efficiency
which is a consensus problem.ADMM for the original general form is
xt+1 := arg minx
(f(x) +
2
Ax + Bzt b +1
t
2
2
)
zt+1 := arg minz
(h(x) +
2
Axt+1 + Bz b +1
t
2
2
)
t+1 := t + Axt+1 + Bzt+1 b
ADMM for the reduced consensus form is
xt+1 := arg minx
(f(x) +
2
x zt +1
t
2
2
)
zt+1 := arg minz
(h(x) +
2
xt+1 z +1
t
2
2
)
t+1 := t + xt+1 zt+1
They are the same, with x being recovered from x as the solution to theoptimization problem
minimize f(x)
subject to Ax = x
with variable x 2 Rn. A similar expression allows for the recovery of z fromz.
[(A>)>, (B>)>, (Ax Bz + b)>]>
benign landscape statistical models tractable algorithms
7
which is a consensus problem.ADMM for the original general form is
xt+1 := arg minx
(f(x) +
2
Ax + Bzt b +1
t
2
2
)
zt+1 := arg minz
(h(x) +
2
Axt+1 + Bz b +1
t
2
2
)
t+1 := t + Axt+1 + Bzt+1 b
ADMM for the reduced consensus form is
xt+1 := arg minx
(f(x) +
2
x zt +1
t
2
2
)
zt+1 := arg minz
(h(x) +
2
xt+1 z +1
t
2
2
)
t+1 := t + xt+1 zt+1
They are the same, with x being recovered from x as the solution to theoptimization problem
minimize f(x)
subject to Ax = x
with variable x 2 Rn. A similar expression allows for the recovery of z fromz.
[(A>)>, (B>)>, (Ax Bz + b)>]>
benign landscape statistical models tractable algorithms
7
which is a consensus problem.ADMM for the original general form is
xt+1 := arg minx
(f(x) +
2
Ax + Bzt b +1
t
2
2
)
zt+1 := arg minz
(h(x) +
2
Axt+1 + Bz b +1
t
2
2
)
t+1 := t + Axt+1 + Bzt+1 b
ADMM for the reduced consensus form is
xt+1 := arg minx
(f(x) +
2
x zt +1
t
2
2
)
zt+1 := arg minz
(h(x) +
2
xt+1 z +1
t
2
2
)
t+1 := t + xt+1 zt+1
They are the same, with x being recovered from x as the solution to theoptimization problem
minimize f(x)
subject to Ax = x
with variable x 2 Rn. A similar expression allows for the recovery of z fromz.
[(A>)>, (B>)>, (Ax Bz + b)>]>
benign landscape statistical models tractable algorithms
7
which is a consensus problem.ADMM for the original general form is
xt+1 := arg minx
(f(x) +
2
Ax + Bzt b +1
t
2
2
)
zt+1 := arg minz
(h(x) +
2
Axt+1 + Bz b +1
t
2
2
)
t+1 := t + Axt+1 + Bzt+1 b
ADMM for the reduced consensus form is
xt+1 := arg minx
(f(x) +
2
x zt +1
t
2
2
)
zt+1 := arg minz
(h(x) +
2
xt+1 z +1
t
2
2
)
t+1 := t + xt+1 zt+1
They are the same, with x being recovered from x as the solution to theoptimization problem
minimize f(x)
subject to Ax = x
with variable x 2 Rn. A similar expression allows for the recovery of z fromz.
[(A>)>, (B>)>, (Ax Bz + b)>]>
benign landscape statistical models tractable algorithms
7
When data are generated by certain statistical models, problems areoften much nicer than worst-case instances
29 / 39
A growing list of “benign” nonconvex problems
• phase retrieval
• matrix sensing
• matrix completion
• blind deconvolution / self-calibration
• dictionary learning
• tensor decomposition / completion
• robust PCA
• mixed linear regression
• learning one-layer neural networks
• ...
30 / 39
Open problems
• characterize generic landscape properties that enable fastconvergence of gradient methods from random initialization
• relax the stringent assumptions on the statistical modelsunderlying the data
• develop robust and scalable nonconvex methods that can handledistributed data with strong statistical guarantees
• identify new classes of nonconvex problems that admit efficientoptimization procedures
• ...
31 / 39
Advertisement: overview and monographs
• “Nonconvex Optimization Meets Low-Rank Matrix Factorization:An Overview”, Y. Chi, Y. M. Lu and Y. Chen.
• “Spectral Methods for Data Science: A Statistical Perspective”,Y. Chen, Y. Chi, J. Fan and C. Ma.
32 / 39
An incomplete list of reference
[1] Dr. Ju Sun’s webpage: “http://sunju.org/research/nonconvex/”.
[2] ”Harnessing Structures in Big Data via Guaranteed Low-Rank MatrixEstimation,” Y. Chen, and Y. Chi, IEEE Signal Processing Magazine, 2018.
[3] “Phase retrieval via Wirtinger flow: Theory and algorithms,” E. Candes, X. Li,M. Soltanolkotabi, IEEE Transactions on Information Theory, 2015.
[4] “Solving random quadratic systems of equations is nearly as easy as solvinglinear systems,” Y. Chen, E. Candes, Communications on Pure and AppliedMathematics, 2017.
[5] “Provable non-convex phase retrieval with outliers: Median truncatedWirtinger flow ,” H. Zhang, Y. Chi, and Y. Liang, ICML 2016.
[6] ”Implicit Regularization in Nonconvex Statistical Estimation: Gradient DescentConverges Linearly for Phase Retrieval, Matrix Completion and BlindDeconvolution,” C. Ma, K. Wang, Y. Chi and Y. Chen, Foundations ofComputational Mathematics, 2019.
33 / 39
An incomplete list of reference
[7] “Gradient Descent with Random Initialization: Fast Global Convergence forNonconvex Phase Retrieval ,” Y. Chen, Y. Chi, J. Fan, C. Ma, MathematicalProgramming, 2018.
[8] “Solving systems of random quadratic equations via truncated amplitudeflow ,” G. Wang, G. Giannakis, and Y. Eldar, IEEE Transactions on InformationTheory, 2017.
[9] ”Matrix completion from a few entries,” R. Keshavan, A. Montanari, andS. Oh, IEEE Transactions on Information Theory, 2010.
[10] ”Guaranteed matrix completion via non-convex factorization,” R. Sun, T. Luo,IEEE Transactions on Information Theory , 2016.
[11] ”Fast low-rank estimation by projected gradient descent: General statisticaland algorithmic guarantees,” Y. Chen and M. Wainwright, arXiv preprintarXiv:1509.03025 , 2015.
[12] ”Fast Algorithms for Robust PCA via Gradient Descent,” X. Yi, D. Park,Y. Chen, and C. Caramanis, NIPS, 2016.
34 / 39
An incomplete list of reference
[13] “Matrix completion has no spurious local minimum,” R. Ge, J. Lee, andT. Ma, NIPS, 2016.
[14] “No Spurious Local Minima in Nonconvex Low Rank Problems: A UnifiedGeometric Analysis,” R. Ge, C. Jin, and Y. Zheng, ICML, 2017.
[15] “Symmetry, Saddle Points, and Global Optimization Landscape of NonconvexMatrix Factorization,” X. Li et al., IEEE Transactions on Information Theory,2019.
[16] “Kaczmarz Method for Solving Quadratic Equations,” Y. Chi and Y. M. Lu,IEEE Signal Processing Letters, vol. 23, no. 9, pp. 1183-1187, 2016.
[17] “A Geometric Analysis of Phase Retrieval ,” S. Ju, Q. Qu, and J. Wright, toappear, Foundations of Computational Mathematics, 2016.
[18] “Gradient descent converges to minimizers,” J. Lee, M. Simchowitz,M. Jordan, B. Recht, Conference on Learning Theory, 2016.
[19] “Phase retrieval using alternating minimization,” P. Netrapalli, P. Jain, andS. Sanghavi, NIPS, 2013.
35 / 39
An incomplete list of reference
[20] “How to escape saddle points efficiently ,” C. Jin, R. Ge, P. Netrapalli,S. Kakade, M. Jordan, arXiv:1703.00887, 2017.
[21] “Complete dictionary recovery over the sphere,” J. Sun, Q. Qu, J. Wright,IEEE Transactions on Information Theory, 2017.
[22] “A nonlinear programming algorithm for solving semidefinite programs vialow-rank factorization,” S. Burer, and R. Monteiro, MathematicalProgramming, 2003.
[23] “Memory-efficient Kernel PCA via Partial Matrix Sampling and NonconvexOptimization: a Model-free Analysis of Local Minima,” J. Chen, X. Li, Journalof Machine Learning Research, 2019.
[24] “Rapid, robust, and reliable blind deconvolution via nonconvex optimization,”X. Li, S. Ling, T. Strohmer, K. Wei, Applied and computational harmonicanalysis, 2019.
[25] “Phaselift: Exact and stable signal recovery from magnitude measurements viaconvex programming,” E. Candes, T. Strohmer, V. Voroninski,Communications on Pure and Applied Mathematics, 2012.
36 / 39
An incomplete list of reference[26] “Exact matrix completion via convex optimization,” E. Candes, B. Recht,
Foundations of Computational mathematics, 2009.
[27] “Low-rank solutions of linear matrix equations via Procrustes flow,” S. Tu,R. Boczar, M. Simchowitz, M. Soltanolkotabi, B. Recht, arXiv:1507.03566,2015.
[28] “Global optimality of local search for low rank matrix recovery,”S. Bhojanapalli, B. Neyshabur, and N. Srebro, NIPS, 2016.
[29] “Optimal rates of convergence for noisy sparse phase retrieval via thresholdedWirtinger flow,” T. Cai, X. Li, Z. Ma, The Annals of Statistics, 2016.
[30] “The landscape of empirical risk for non-convex losses,” S. Mei, Y. Bai, andA. Montanari, arXiv:1607.06534, 2016.
[31] “Non-convex robust PCA,” P. Netrapalli, U. Niranjan, S. Sanghavi,A. Anandkumar, and P. Jain, NIPS, 2014.
[32] “Non-square matrix sensing without spurious local minima via theBurer-Monteiro approach,” D. Park, A. Kyrillidis, C. Caramanis, andS. Sanghavi, arXiv:1609.03240, 2016.
37 / 39
An incomplete list of reference
[33] “From symmetry to geometry: Tractable nonconvex problems”, Y. Zhang, Q.Qu, J. Wright, arXiv preprint arXiv:2007.06753, 2020.
[34] “A Nonconvex Approach for Phase Retrieval: Reshaped Wirtinger Flow andIncremental Algorithms,” H. Zhang, Y. Zhou, Y. Liang and Y. Chi, Journal ofMachine Learning Research, 2017.
[35] “Nonconvex Matrix Factorization from Rank-One Measurements,” Y. Li, C.Ma, Y. Chen and Y. Chi, arXiv:1802.06286, 2018.
[36] “Manifold Gradient Descent Solves Multi-Channel Sparse Blind DeconvolutionProvably and Efficiently”, L. Shi and Y. Chi, arXiv preprint arXiv:1911.11167,2019.
[37] “Median-Truncated Gradient Descent: A Robust and Scalable NonconvexApproach for Signal Estimation”, Y. Chi, Y. Li, H. Zhang, and Y. Liang,Compressed Sensing and Its Applications, Springer, Birkhauser, 2019.
[38] “Nonconvex Low-Rank Matrix Recovery with Arbitrary Outliers viaMedian-Truncated Gradient Descent”, Y. Li, Y. Chi, H. Zhang, and Y. Liang,Information and Inference, 2020.
38 / 39
An incomplete list of reference
[39] “Nonconvex robust low-rank matrix recovery”, Li et al., SIAM Journal onOptimization, 2020.
[40] “Median-Truncated Nonconvex Approach for Phase Retrieval with Outliers”,H. Zhang, Y. Chi and Y. Liang, IEEE Trans. on Information Theory, 2018.
[41] “Low-rank matrix recovery with composite optimization: good conditioningand rapid convergence”, V. Charisopoulos et al., arXiv:1904.10020, 2019.
[42] “Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled GradientDescent”, T. Tong, C. Ma, and Y. Chi, arXiv:2005.08898, 2020.
[43] “Low-Rank Matrix Recovery with Scaled Subgradient Methods: Fast andRobust Convergence Without the Condition Number”, T. Tong, C. Ma, and Y.Chi, arXiv:2010.13364, 2020.
39 / 39