Title: Sketching as a Tool for Numerical Linear Algebra
1Sketching as a Tool for Numerical Linear Algebra
- David Woodruff
- IBM Almaden
2Talk Outline
- Regression
- Exact Regression Algorithms
- Sketching to speed up Least Squares Regression
- Sketching to speed up Least Absolute Deviation
(l1) Regression - Low Rank Approximation
- Sketching to speed up Low Rank Approximation
- Recent Results and Open Questions
- M-Estimators and robust regression
- CUR decompositions
3Regression
- Linear Regression
- Statistical method to study linear dependencies
between variables in the presence of noise. - Example
- Ohm's law V R I
- Find linear function that
- best fits the data
4Regression
- Standard Setting
- One measured variable b
- A set of predictor variables a ,, a
- Assumption b x a
x a x e - e is assumed to be noise and the xi are model
parameters we want to learn - Can assume x0 0
- Now consider n observations of b
1
d
1
d
1
d
0
5Regression analysis
- Matrix form
- Input n?d-matrix A and a vector b(b1,, bn)n
is the number of observations d is the number of
predictor variables - Output x so that Ax and b are close
- Consider the over-constrained case, when n À d
- Assume that A has full column rank
6Regression analysis
- Least Squares Method
- Find x that minimizes Ax-b22 S (bi ltAi,
xgt)² - Ai is i-th row of A
- Certain desirable statistical properties
- Closed form solution x (ATA)-1 AT b
- Method of least absolute deviation (l1
-regression) - Find x that minimizes Ax-b1 S bi ltAi,
xgt - Cost is less sensitive to outliers than least
squares - Can solve via linear programming
- Time complexities are at least nd2, we want
better!
7Talk Outline
- Regression
- Exact Regression Algorithms
- Sketching to speed up Least Squares Regression
- Sketching to speed up Least Absolute Deviation
(l1) Regression - Low Rank Approximation
- Sketching to speed up Low Rank Approximation
- Recent Results and Open Questions
- M-Estimators and robust regression
- CUR decompositions
8Sketching to solve least squares regression
- How to find an approximate solution x to minx
Ax-b2 ? - Goal output x for which Ax-b2 (1e) minx
Ax-b2 with high probability - Draw S from a k x n random family of matrices,
for a value k ltlt n - Compute SA and Sb
- Output the solution x to minx (SA)x-(Sb)2
9How to choose the right sketching matrix S?
- Recall output the solution x to minx
(SA)x-(Sb)2 - Lots of matrices work
- S is d/e2 x n matrix of i.i.d. Normal random
variables - Computing SA may be slow
10How to choose the right sketching matrix S? S
- S is a Johnson Lindenstrauss Transform
- S PHD
- D is a diagonal matrix with 1, -1 on diagonals
- H is the Hadamard transform
- P just chooses a random (small) subset of rows of
HD - SA can be computed much faster
11Even faster sketching matrices CW
- CountSketch matrix
- Define k x n matrix S, for k O(d2/e2)
- S is really sparse single randomly chosen
non-zero entry per column
12Simpler and Sharper Proofs MM, NN, N
- Let B A, b be an n x (d1) matrix
- Let U be an orthonormal basis for the columns of
B - Suffices to show SUx2 1 e for all unit x
- Implies S(Ax-b)2 (1 e) Ax-b2 for all x
- SU is a (d1)2/e2 x (d1) matrix
- Suffices to show UTST SU I2 UTST SU IF
e - Matrix product result CSTSD CDF2 1/(
rows of S) CF2 DF2 - Set C UT and D U. Then U2F (d1) and (
rows of S) (d1)2/e2
SBx2 (1e) Bx2 for all x S is called a
subspace embedding
13Talk Outline
- Regression
- Exact Regression Algorithms
- Sketching to speed up Least Squares Regression
- Sketching to speed up Least Absolute Deviation
(l1) Regression - Low Rank Approximation
- Sketching to speed up Low Rank Approximation
- Recent Results and Open Questions
- M-Estimators and robust regression
- CUR decompositions
14Sketching to solve l1-regression
- How to find an approximate solution x to minx
Ax-b1 ? - Goal output x for which Ax-b1 (1e) minx
Ax-b1 with high probability - Natural attempt Draw S from a k x n random
family of matrices, for a value k ltlt n - Compute SA and Sb
- Output the solution x to minx (SA)x-(Sb)1
- Turns out this does not work
15Sketching to solve l1-regression SW
- Why doesnt outputting the solution x to minx
(SA)x-(Sb)1 work? - Dont know of k x n matrices S with small k for
which if x is solution to minx (SA)x-(Sb)1
then - Ax-b1 (1e) minx Ax-b1
- with high probability
- Instead can find an S so that
- Ax-b1 (d log d) minx Ax-b1
- S is a matrix of i.i.d. Cauchy random variables
- Property Ax-b1 S(Ax-b)1 (d log d) Ax-b1
16Cauchy random variables
- Cauchy random variables not as nice as Normal
(Gaussian) random variables - They dont have a mean and have infinite variance
- Ratio of two independent Normal random variables
is Cauchy
If a and b are scalars and C1 and C2 independent
Cauchys, then aC1 bC2 (ab)C for a
Cauchy C
17Sketching to solve l1-regression
- Main Idea Let B A, b. Compute a
QR-factorization of SB -
- Q has orthonormal columns and QR SB
- BR-1 is a well-conditioning of B
- Si1d BR-1ei1 Si1d SBR-1ei1
- (d log d).5 Si1d SBR-1ei2
- d (d log d).5
- x1 x2
- SBR-1x2
- SBR-1x1
- (d log d) BR-1x1
- These two properties make importance sampling
work!
18Importance Sampling
- Want to estimate Si1n yi by sampling, for yi 0
- Suppose we sample yi with probability pi
- T Si1n d(yi sampled) yi/pi
- ET Si1n pi yi/pi Si1n yi
- VarT ET2 Si1n pi yi 2 / pi2 (Si1n
yi ) maxi yi/pi - Bound maxi yi/pi by e2 (Si1n yi )
- For us, yi (Ax-b)i and this holds if pi
eiBR-11 poly(d/e) !
19Importance Sampling
- To get a bound for all x, use Bernsteins
inequality and a net argument - Sample poly(d/e) rows of BR-1 where the i-th row
is sampled proportional to its 1-norm - T is diagonal matrix with Ti,i 0 if row i not
sampled, otherwise Ti,i 1/Prrow i sampled - TBx1 (1 e )Bx1 for all x
- Solve regression on the (reweighted) samples!
20Sketching to solve l1-regression MM
- Most expensive operation is computing SA where S
is the matrix of i.i.d. Cauchy random variables - All other operations are in the smaller space
- Can speed this up by choosing S as follows
21Further sketching improvements WZ
Uses max-stability of exponentials Andoni max
yi/ei y1/e
For recent work on fast sampling-based
algorithms, see Richards talk!
- Can show you need a fewer number of sampled rows
in later steps if instead choose S as follows - Instead of diagonal of Cauchy random variables,
choose diagonal of reciprocals of exponential
random variables
22Talk Outline
- Regression
- Exact Regression Algorithms
- Sketching to speed up Least Squares Regression
- Sketching to speed up Least Absolute Deviation
(l1) Regression - Low Rank Approximation
- Sketching to speed up Low Rank Approximation
- Recent Results and Open Questions
- M-Estimators and robust regression
- CUR decompositions
23Low rank approximation
- A is an n x d matrix
- Typically well-approximated by low rank matrix
- E.g., only high rank because of noise
- Want to output a rank k matrix A, so that
- A-AF (1e) A-AkF,
- w.h.p., where Ak argminrank k matrices B
A-BF - (For matrix C, CF (Si,j Ci,j2)1/2 )
24Solution to low-rank approximation S
- Given n x d input matrix A
- Compute SA using a sketching matrix S with k ltlt
n rows. SA takes random linear combinations of
rows of A
A
SA
- Project rows of A onto SA, then find best rank-k
approximation to points inside of SA.
25Low Rank Approximation Idea
- S can be matrix of i.i.d. Normals
- S can be a Fast Johnson Lindenstrauss Matrix
- S can be a CountSketch matrix
- Regression problem minX Ak X AF
- Solution is X I, and minimum is Ak AF
- This is a generalized regression problem!
- If S is a subspace embedding for column space of
Ak and also if for any matrices B, C, - BSTSC BCF2 1/( rows of S) BF2 CF2
- Then if X is the minimizer to minX SAkX SAF
, then - Ak X AF (1e) minX AkX-AF (1e)
Ak-AF - But minimizer X (SAk)- SA is in the row span
of SA!
26Caveat projecting the points onto SA is slow
- Current algorithm
- Compute SA (easy)
- Project each of the rows onto SA
- Find best rank-k approximation of projected
points inside of rowspace of SA (easy) - Bottleneck is step 2
- CW Turns out you can approximate the projection
- Sketching for generalized regression again minX
X(SA)-AF2
27Talk Outline
- Regression
- Exact Regression Algorithms
- Sketching to speed up Least Squares Regression
- Sketching to speed up Least Absolute Deviation
(l1) Regression - Low Rank Approximation
- Sketching to speed up Low Rank Approximation
- Recent Results and Open Questions
- M-Estimators and robust regression
- CUR decompositions
28M-Estimators and Robust Regression
- Solve minx Ax-bM
- M R -gt R 0
- yM Si1n M(yi)
- Least squares and L1-regression are special cases
- Huber function, given a parameter c
- M(y) y2/(2c) for y c
- M(y) y-c/2 otherwise
- Enjoys smoothness properties of l2
- and
- robustness properties of l1
CW15 For M-estimators with at least linear and
at most quadratic growth, can get
O(1)-approximation in nnz(A) poly(d) time
29CUR Decompositions
BW14 Can find a CUR decomposition in O(nnz(A)
log n) npoly(k/e) time with
O(k/e) columns, O(k/e) rows, and rank(U) k
30Open Questions
- Recent monograph in NOW Publishers
- D. Woodruff, Sketching as a Tool for Numerical
Linear Algebra - Other types of low rank approximation
- (Spectral) How quickly can we find a rank k
matrix A, so that - A-A2 (1e) A-Ak2,
- w.h.p., where Ak argminrank k
matrices B A-B2 -
- (Robust) How quickly can we find a rank k
matrix A, so that - A-A1 (1e) A-Ak1,
- w.h.p., where Ak argminrank k
matrices B A-B1 - For other questions regarding Schatten norms and
communication-efficiency, see reference above.
Thanks!