sampling algorithms for l 2 regression and applications

Sampling algorithms for lSampling algorithms for l22 regression regression and applicationsand applications

Michael W. MahoneyMichael W. Mahoney

Yahoo Researchhttp://www.cs.yale.edu/homes/mmahoney

(Joint work with P. Drineas and S. (Muthu) Muthukrishnan)(Joint work with P. Drineas and S. (Muthu) Muthukrishnan)

SODA 2006

Regression problems

We seek sampling-based algorithms for solving l2 regression.

We are interested in overconstrained problems, n >> d. Typically, there is no x such that Ax = b.

sampled “rows” of b

sampled rows of A

scaling to account for

undersampling

“Induced” regression problems

Regression problems, definition

There is work by K. Clarkson in SODA 2005 on sampling-based algorithms for l1 regression ( = 1 ) for overconstrained problems.

Exact solution

Pseudoinverse of A

Projection of b on the subspace

spanned by the columns of A

Singular Value Decomposition (SVD)

U (V): orthogonal matrix containing the left (right) singular vectors of A.

: diagonal matrix containing the singular values of A.

rank of A.

Computing the SVD takes O(d2n) time. The pseudoinverse of A is

Questions …

Can sampling methods provide accurate estimates for l2 regression?

Is it possible to approximate the optimal vector and the optimal value value Z2 by only looking at a small sample from the input?

(Even if it takes some sophisticated oracles to actually perform the sampling …)

Equivalently, is there an induced subproblem of the full regression problem, whose optimal solution and its value Z2,s aproximates the optimal solution and its value Z2?

Creating an induced subproblem

Algorithm

• Fix a set of probabilities pi, i=1…n, summing up to 1.

• Pick r indices from {1…n} in r i.i.d. trials, with respect to the pi’s.

• For each sampled index j, keep the j-th row of A and the j-th element of b; rescale both by (1/rpj)1/2.

The induced subproblem

sampled elements of b, rescaled

sampled rows of A, rescaled

Our results

If the pi satisfy certain conditions, then with probability at least 1-,

The sampling complexity is

Our results, cont’d

If the pi satisfy certain conditions, then with probability at least 1-,

The sampling complexity is

A): condition number of A

Back to induced subproblems …

sampled elements of b, rescaled

sampled rows of A, rescaled

The relevant information for l2 regression if n >> d is contained in an induced subproblem of size O(d2)-by-d.

(upcoming writeup: we can reduce the sampling complexity to r = O(d).)

Conditions on the probabilities, SVD

Let U(i) denote the i-th row of U.

Let U? 2 Rn £ (n -) denote the orthogonal complement of U.

U (V): orthogonal matrix containing the left (right) singular vectors of A.

: diagonal matrix containing the singular values of A.

rank of A.

Conditions on the probabilities, interpretation

What do the lengths of the rows of the n x d matrix U = UA “mean”?

Consider possible n x d matrices U of d left singular vectors:

In|k = k columns from the identity

row lengths = 0 or 1

In|k x -> x

Hn|k = k columns from the n x n Hadamard (real Fourier) matrix

row lengths all equal

Hn|k x -> maximally dispersed

Uk = k columns from any orthogonal matrix

row lengths between 0 and 1

The lengths of the rows of U = UA correspond to a notion of information dispersal

Conditions for the probabilities

The conditions that the pi must satisfy, for some 1, 2, 3 2 (0,1]:

lengths of rows of matrix of left singular vectors of A

Component of b not in the span of the columns of A

The sampling complexity is:

Small i ) more

sampling

Computing “good” probabilities

Open question: can we compute “good” probabilities faster, in a pass efficient manner?

Some assumptions might be acceptable (e.g., bounded condition number of A, etc.)

In O(nd2) time we can easily compute pi’s that satisfy all three conditions, with 1 = 2 = 3 = 1/3.

(Too expensive in practice for this problem!)

Critical observation

sample & rescale

sample & rescale

Critical observation, cont’d

sample & rescale only U

sample & rescale

Critical observation, cont’d

Important observation: Us is almost orthogonal, and we can bound the spectral and the Frobenius norm of

UsT Us – I.

(FKV98, DK01, DKM04, RV04)

Application: CUR-type decompositions

1. How do we draw the rows and columns of A to include in C and R?

2. How do we construct U?

Create an approximation to A, using rows and columns of A

O(1) columnsO(1) rows

Carefully chosen U

Goal: provide (good) bounds for some norm of the error matrix

A – CUR

sampling algorithms for l 2 regression and applications

Documents