Title: Overcoming the L1 Non-Embeddability Barrier
1Overcoming the L1 Non-Embeddability Barrier
- Robert Krauthgamer (Weizmann Institute)
- Joint work with Alexandr Andoni and Piotr Indyk
(MIT)
2Algorithms on Metric Spaces
Hamming distance
- Fix a metric M
- Fix a computational problem
- Solve problem under M
Ulam metric
Compute distance between x,y
Earthmover distance
ED(x,y) minimum number of edit operations
that transform x into y. edit operation
insert/delete/ substitute a
character ED(0101010, 1010101) 2
Nearest Neighbor Search Preprocess n strings,
so that given a query string, can find the
closest string to it.
3Motivation for Nearest Neighbor
- Many applications
- Image search (Euclidean dist, Earth-mover dist)
- Processing of genetic information, text
processing (edit dist.) - many others
Generic Search Engine
4A General Tool Embeddings
- An embedding of M into a host metric (H,dH) is a
map f M?H - preserves distances approximately
- has distortion A 1 if for all x,y? M,
- dM(x,y) dH(f(x),f(y)) AdM(x,y)
- Why?
- If H is easy ( can solve efficiently
computational problems like NNS) - Then get good algorithms for the original space
M!
f
5Host space?
l1real space with d1(x,y) ?i xi-yi
- Popular target metric l1
- Have efficient algorithms
- Distance estimation O(d) for d-dimensional space
(often less) - NNS c-approx with O(n1/c) query time and
O(n11/c) space IM98 - Powerful enough for some things
Metric References Upper bound Lower bound
Edit distance over 0,1d OR05 KN05,KR06,AK07 2O(vlog d) ?(log d)
Ulam ( edit distance over permutations) CK06 AK07 O(log d) ?(log d)
Block edit distance over 0,1d MS00, CM07 Cor03 O(log d) 4/3
Earthmover distance in ?2 (sets of size s) Cha02, IT03 NS07 O(log s) ?(log1/2 s)
Earthmover distance in 0,1d (set of size s) AIK08 KN05 O(log slog d) ?(log s)
6Below logarithmic?
(l2)preal space with dist2p(x,y)x-y2p
- Cannot work with l1
- Other possibilities?
- (l2)p is bigger and algorithmically tractable
- but not rich enough (often same lower bounds)
- l8 is rich (includes all metrics),
- but not efficient computationally usually (high
dimension) - And thats roughly it ?
- (at least for efficient NNS)
l8real space with dist8(x,y)maxixi-yi
7Meet our new host
a
d1
ß
- Iterated product space, ?22,8,1
d8,1
d22,8,1
?
8Why ?22,8,1?
- Because we can
- Theorem 1. Ulam embeds into ?22,8,1 with O(1)
distortion - Dimensions (?,ß,a)(d, log d, d)
- Theorem 2. ?22,8,1 admits NNS on n points with
- O(log log n) approximation
- O(ne) query time and O(n1e) space
- In fact, there is more for Ulam
Rich
Algorithmically tractable
9Our Algorithms for Ulam
ED(1234567, 7123456) 2
- Ulam edit on strings where each symbol appears
at most once - A classical distance between rankings
- Exhibits hardness of misalignments (as in general
edit) - All lower bounds same as for general edit (up to
T() ) - Distortion of embedding into l1 (and (l2)p, etc)
T(log d) - Our approach implies new algorithms for Ulam
- 1. NNS with O(log log n) approx, O(ne) query
time - Can improve to O(log log d) approx
- 2. Sketching with O(1)-approx in logO(1) d space
- 3. Distance estimation with O(1)-approx in time
If we ever hope for approximation ltltlog d for NNS
under general edit, first we have to get it under
Ulam!
BEKMRRS03 when ED¼d, approx de in O(d1-2e)
time
10Theorem 1
- Theorem 1. Can embed Ulam into ?22,8,1 with O(1)
distortion - Dimensions (?,ß,a)(d, log d, d)
- Proof
- Geometrization of Ulam characterizations
- Previously studied in the context of testing
monotonicity (sortedness) - Sublinear algorithms EKKRV98, ACCL04
- Data-stream algorithms GJKK07, GG07, EH08
11Thm 1 Characterizing Ulam
- Consider permutations x,y over d
- Assume for now x identity permutation
- Idea
- Count chars in y to delete to obtain increasing
sequence ( Ulam(x,y)) - Call them faulty characters
- Issues
- Ambiguity
- How do we count them?
123456789
123456789
X
234657891
341256789
y
12Thm 1 Characterization inversions
- Definition chars altb form inversion if b
precedes a in y - How to identify faulty char?
- Has an inversion?
- Doesnt work all chars might have inversion
- Has many inversions?
- Still can miss faulty chars
- Has many inversions locally?
- Same problem
Check if either is true!
123456789
123456789
123456789
X
567981234
234567891
213456798
y
13Thm 1 Characterization faulty chars
- Definition 1 a is faulty if exists Kgt0 s.t.
- a is inverted w.r.t. a majority of the K symbols
preceding a in y - (ok to consider K2k)
- Lemma ACCL04, GJKK07 faulty chars
T(Ulam(x,y)).
123456789
234567891
4 characters preceding 1 (all inversions with 1)
14Thm 1 Characterization?Embedding
- To get embedding, need
- Symmetrization (neither string is identity)
- Deal with exists, majority?
- To resolve (1), use instead XaK
- Definition 2 a is faulty if exists K2k such
that - Xa2k ? Ya2k gt 2k (symmetric difference)
X54
123456789
123467895
Y54
15Thm 1 Embedding final step
X522
123456789
- We have
- Replace by weight?
- Final embedding
123467895
Y522
equal 1 iff true
)2
(
16Theorem 2
- Theorem 2. ?22,8,1 admits NNS on n points
- O(log log n) approximation
- O(ne) query time and O(n1e) space for any small
e - (ignoring (aß?)O(1))
- A rather general approach
- LSH on l1-products of general metric spaces
- Of course, cannot do, but can reduce to
l8-products
17Thm 2 Proof
- Lets start from basics l1a
- IM98 c-approx with O(n1/c) query time and
O(n11/c) space - (ignoring aO(1))
- Ok, what about
- Then NNS for
- O(cM log log n) -approx
- O(QM) query time
- O(SM n1e) space.
- Suppose NNS for M with
- cM-approx
- QM query time
- SM space.
I02
18Thm 2 What about (l2)2-product?
- Enough to consider
- (for us, M is the l1-product)
- Off-the-shelf?
- I04 gives space n? or gtlog n approximation
- We reduce to multiple NNS queries under
- Instructive to first look at NNS for standard l1
19Thm 2 Review of NNS for l1
?
- LSH family collection H of
- hash functions such that
- For random h?H (parameter ?gt0)
- Prh(q)h(p) 1-q-p1 / ?
- Query just uses primitive
- Can obtain H by imposing randomly-shifted grid of
side-length ? - Then for h defined by ri20, ? at random,
primitive becomes
q
p
return all points p such that h(q)h(p)
return all p s.t. qi-piltri for all i?d
20Thm 2 LSH for l1-product
?
- Intuition abstract LSH!
- Recall we had
- for ri random from 0, ?,
- point p returned if for all i qi-piltri
- Equivalently
- For all i
q
p
l8 product of R!
For l1
return all p s.t. qi-piltri for all i?d
return all points ps such that maxi
dM(qi,pi)/rilt1
For
21Thm 2 Final
- Thus, sufficient to solve primitive
- We reduced NNS over
- to several instances of NNS over
- (with appropriately scaled coordinates)
- Approximation is O(1)O(log log n)
- Done!
return all points ps such that maxi
dM(qi,pi)/rilt1 (in fact, for k independent
choices of (r1,rd))
For
22Take-home message
- Can embed combinatorial metrics into iterated
product spaces - Works for Ulam (edit on non-repetitive strings)
- Approach bypasses non-embeddability results into
usual-suspect spaces like l1, (l2)2
- Open
- Embeddings for edit over 0,1d, EMD, other
metrics? - Understanding product spaces?
- Jayram-Woodruff sketching
Thank you!