subspace embeddings for the l1 norm with applications christian sohler david woodruff tu...
DESCRIPTION
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden. Subspace Embeddings for the L1 norm with Applications to... Robust Regression and Hyperplane Fitting. Outline. Massive data sets Regression analysis Our results - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/1.jpg)
Subspace Embeddings for the L1 norm with Applications
Christian Sohler David WoodruffTU Dortmund IBM Almaden
![Page 2: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/2.jpg)
Subspace Embeddings for the L1 norm with Applications
to...
Robust Regression and Hyperplane Fitting
![Page 3: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/3.jpg)
3
Outline
Massive data sets Regression analysis Our results Our techniques Concluding remarks
![Page 4: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/4.jpg)
4
Massive data sets
Examples Internet traffic logs Financial data etc.
Algorithms Want nearly linear time or less Usually at the cost of a randomized approximation
![Page 5: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/5.jpg)
5
Regression analysis
Regression Statistical method to study dependencies between variables in the
presence of noise.
![Page 6: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/6.jpg)
6
Regression analysis
Linear Regression Statistical method to study linear dependencies between variables in the
presence of noise.
![Page 7: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/7.jpg)
7
Regression analysis
Linear Regression Statistical method to study linear dependencies between variables in the
presence of noise.
Example Ohm's law V = R ∙ I
![Page 8: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/8.jpg)
8
Regression analysis
Linear Regression Statistical method to study linear dependencies between variables in the
presence of noise.
Example Ohm's law V = R ∙ I Find linear function that best
fits the data
![Page 9: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/9.jpg)
9
Regression analysis
Linear Regression Statistical method to study linear dependencies between variables in the
presence of noise.
Standard Setting One measured variable b A set of predictor variables a ,…, a Assumption:
b = x + a x + … + a x + is assumed to be noise and the xi are model parameters we want to learn Can assume x0 = 0 Now consider n measured variables
1 d
1 1 d d0
![Page 10: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/10.jpg)
10
Regression analysis
Matrix formInput: nd-matrix A and a vector b=(b1,…, bn)
n is the number of observations; d is the number of predictor variables
Output: x* so that Ax* and b are close
Consider the over-constrained case, when n À d Can assume that A has full column rank
![Page 11: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/11.jpg)
11
Regression analysis
Least Squares Method Find x* that minimizes (bi – <Ai*, x*>)² Ai* is i-th row of A Certain desirable statistical properties
Method of least absolute deviation (l1 -regression) Find x* that minimizes |bi – <Ai*, x*>| Cost is less sensitive to outliers than least squares
![Page 12: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/12.jpg)
12
Regression analysis
Geometry of regression We want to find an x that minimizes |Ax-b|1 The product Ax can be written as
A*1x1 + A*2x2 + ... + A*dxd
where A*i is the i-th column of A
This is a linear d-dimensional subspace The problem is equivalent to computing the point of the column space of A
nearest to b in l1-norm
![Page 13: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/13.jpg)
13
Regression analysis
Solving l1 -regression via linear programming
Minimize (1,…,1) ∙ ( + ) Subject to: A x = b , ≥ 0
Generic linear programming gives poly(nd) time
Best known algorithm is nd5 log n + poly(d/ε) [Clarkson]
+ -
+ -+ -
![Page 14: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/14.jpg)
14
Our Results
A (1+ε)-approximation algorithm for l1-regression problem
Time complexity is nd1.376 + poly(d/ε)(Clarkson’s is nd5 log n + poly(d/ε))
First 1-pass streaming algorithm with small space (poly(d log n /ε) bits)
Similar results for hyperplane fitting
![Page 15: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/15.jpg)
15
Outline
Massive data sets Regression analysis Our results Our techniques Concluding remarks
![Page 16: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/16.jpg)
16
Our Techniques
Notice that for any d x d change of basis matrix U,
minx in Rd |Ax-b|1 = minx in Rd |AUx-b|1
![Page 17: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/17.jpg)
17
Our Techniques
Notice that for any y 2 Rd,
minx in Rd |Ax-b|1 = minx in Rd |Ax-b+Ay|1
We call b-Ay the “residual”, denoted b’, and so
minx in Rd |Ax-b|1 = minx in Rd |Ax-b’|1
![Page 18: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/18.jpg)
18
Rough idea behind algorithm of Clarkson
Compute poly(d)-approximation
Compute well-conditionedbasis
Sample rows from the well-conditioned basis and the residual of the poly(d)-
approximation
Solve l1-regression on the sample, obtaining vector x, and output x
Find y such that|Ay-b|1 · poly(d) minx in Rd |Ax-b|1
Let b’ = b-Ay be the residual
Find a basis U so that for all x in Rd,
|x|1/poly(d) · |AUx|1 · poly(d) |x|1
minx in Rd |Ax-b|1 = minx in Rd |AUx – b’|1
Sample poly(d/ε) rows of AU◦b’ proportional to their l1-norm.
Takes nd5 log n time
Takes nd time
Takes nd5 log n time
Takes poly(d/ε) timeNow generic linear programming is efficient
![Page 19: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/19.jpg)
19
Our Techniques
Suffices to show how to quickly compute
1. A poly(d)-approximation
2. A well-conditioned basis
![Page 20: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/20.jpg)
20
Our main theorem
Theorem There is a probability space over (d log d) n matrices R such that for any
nd matrix A, with probability at least 99/100 we have for all x:
|Ax|1 ≤ |RAx|1 ≤ d log d ∙ |Ax|1
Embedding is linear is independent of A preserves lengths of an infinite number of vectors
![Page 21: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/21.jpg)
21
Application of our main theorem
Computing a poly(d)-approximation
Compute RA and Rb
Solve x’ = argminx |RAx-Rb|1
Main theorem applied to A◦b implies x’ is a d log d – approximation
RA, Rb have d log d rows, so can solve l1-regression efficiently
Time is dominated by computing RA, a single matrix-matrix product
![Page 22: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/22.jpg)
22
Application of our main theorem
Computing a well-conditioned basis
1. Compute RA
2. Compute U so that RAU is orthonormal (in the l2-sense)
3. Output AU
AU is well-conditioned because:
|AUx|1 · |RAUx|1 · (d log d)1/2 |RAUx|2 = (d log d)1/2 |x|2 · (d log d)1/2 |x|1and
|AUx|1 ¸ |RAUx|1/(d log d) ¸ |RAUx|2/(d log d) = |x|2/(d log d) ¸ |x|1 /(d3/2 log d)
Life is really simple!
Time dominated by computing RA and AU, two matrix-matrix products
![Page 23: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/23.jpg)
23
Application of our main theorem
It follows that we get an nd1.376 + poly(d/ε) time algorithm for (1+ε)-approximate l1-regression
![Page 24: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/24.jpg)
24
What’s left?
We should prove our main theorem Theorem: There is a probability space over (d log d) n matrices R such that for any
nd matrix A, with probability at least 99/100 we have for all x:
|Ax|1 ≤ |RAx|1 ≤ d log d ∙ |Ax|1
R is simple The entries of R are i.i.d. Cauchy random variables
![Page 25: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/25.jpg)
25
Cauchy random variables
pdf(z) = 1/(π(1+z)2) for z in (-1, 1)
Infinite expectation and variance
1-stable: If z1, z2, …, zn are i.i.d. Cauchy, then for a 2 Rn,
a1¢z1 + a2¢z2 + … + an¢zn » |a|1¢z, where z is Cauchy
z
![Page 26: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/26.jpg)
26
Proof of main theorem
By 1-stability, For all rows r of R,
<r, Ax> » |Ax|1¢Z,
where Z is a Cauchy
RAx » (|Ax|1 ¢ Z1, …, |Ax|1 ¢ Zd log d), where Z1, …, Zd log d are i.i.d. Cauchy
|RAx|1 = |Ax|1 i |Zi|
The |Zi| are half-Cauchy
i |Zi| = (d log d) with probability 1-exp(-d) by Chernoff
ε-net argument on {Ax | |Ax|1 = 1} shows |RAx|1 = |Ax|1¢(d log d) for all x
Scale R by 1/(d log d)
But i |Zi| is heavy-tailed
z
/ (d log d)
![Page 27: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/27.jpg)
27
Proof of main theorem
i |Zi| is heavy-tailed, so |RAx|1 = |Ax|1 i |Zi| / (d log d) may be large
Each |Zi| has c.d.f. asymptotic to 1-Θ(1/z) for z in [0, 1)
No problem!
We know there exists a well-conditioned basis of A We can assume the basis vectors are A*1, …, A*d
|RA*i|1 » |A*i|1 ¢ i |Zi| / (d log d)
With constant probability, i |RA*i|1 = O(log d) i |A*i|1
![Page 28: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/28.jpg)
28
Proof of main theorem
Suppose i |RA*i|1 = O(log d) i |A*i|1 for well-conditioned basis A*1, …, A*d
We will use the Auerbach basis which always exists: For all x, |x|1 · |Ax|1 i |A*i|1 = d
I don’t know how to compute such a basis, but it doesn’t matter!
i |RA*i|1 = O(d log d)
|RAx|1 · i |RA*i xi| · |x|1 i |RA*i|1 = |x|1O(d log d) = O(d log d) |Ax|1
Q.E.D.
![Page 29: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/29.jpg)
29
Main Theorem
Theorem
There is a probability space over (d log d) n matrices R such that for any nd matrix A, with probability at least 99/100 we have for all x:
|Ax|1 ≤ |RAx|1 ≤ d log d ∙ |Ax|1
![Page 30: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/30.jpg)
30
Outline
Massive data sets Regression analysis Our results Our techniques Concluding remarks
![Page 31: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/31.jpg)
31
Regression for data streams
Streaming algorithm given additive updates to entries of A and b Pick random matrix R according to the distribution of main theorem Maintain RA and Rb during the stream Find x'that minimizes |RAx'-Rb|1 using linear programming Compute U so that RAU is orthonormal
The hard thing is sampling rows from AU◦b’ proportional to their norm Do not know U, b’ until end of stream Surpisingly, there is still a way to do this in a single pass by treating U, x’ as
formal variables and plugging them in at the end Uses a noisy sampling data structure Omitted from talk
Entries of R do not need to be independent
![Page 32: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/32.jpg)
32
Hyperplane Fitting
Reduces to d invocations of l1-regression
Given n points in Rd, find hyperplane minimizing sum of l1-distances ofpoints to the hyperplane
![Page 33: Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681684b550346895dde4202/html5/thumbnails/33.jpg)
33
Conclusion
Main results
Efficient algorithms for l1-regression and hyperplane fitting
nd1.376 time improves previous nd5 log n running time for l1-regression
First oblivious subspace embedding for l1