Overcoming the L1 Non-Embeddability Barrier - PowerPoint PPT Presentation

About This Presentation
Title:

Overcoming the L1 Non-Embeddability Barrier

Description:

Overcoming the L1 Non-Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT) – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 23
Provided by: acil150
Category:

less

Transcript and Presenter's Notes

Title: Overcoming the L1 Non-Embeddability Barrier


1
Overcoming the L1 Non-Embeddability Barrier
  • Robert Krauthgamer (Weizmann Institute)
  • Joint work with Alexandr Andoni and Piotr Indyk
    (MIT)

2
Algorithms on Metric Spaces
Hamming distance
  • Fix a metric M
  • Fix a computational problem
  • Solve problem under M

Ulam metric
Compute distance between x,y
Earthmover distance
ED(x,y) minimum number of edit operations
that transform x into y. edit operation
insert/delete/ substitute a
character ED(0101010, 1010101) 2
Nearest Neighbor Search Preprocess n strings,
so that given a query string, can find the
closest string to it.


3
Motivation for Nearest Neighbor
  • Many applications
  • Image search (Euclidean dist, Earth-mover dist)
  • Processing of genetic information, text
    processing (edit dist.)
  • many others

Generic Search Engine
4
A General Tool Embeddings
  • An embedding of M into a host metric (H,dH) is a
    map f M?H
  • preserves distances approximately
  • has distortion A 1 if for all x,y? M,
  • dM(x,y) dH(f(x),f(y)) AdM(x,y)
  • Why?
  • If H is easy ( can solve efficiently
    computational problems like NNS)
  • Then get good algorithms for the original space
    M!

f
5
Host space?
l1real space with d1(x,y) ?i xi-yi
  • Popular target metric l1
  • Have efficient algorithms
  • Distance estimation O(d) for d-dimensional space
    (often less)
  • NNS c-approx with O(n1/c) query time and
    O(n11/c) space IM98
  • Powerful enough for some things

Metric References Upper bound Lower bound
Edit distance over 0,1d OR05 KN05,KR06,AK07 2O(vlog d) ?(log d)
Ulam ( edit distance over permutations) CK06 AK07 O(log d) ?(log d)
Block edit distance over 0,1d MS00, CM07 Cor03 O(log d) 4/3
Earthmover distance in ?2 (sets of size s) Cha02, IT03 NS07 O(log s) ?(log1/2 s)
Earthmover distance in 0,1d (set of size s) AIK08 KN05 O(log slog d) ?(log s)
6
Below logarithmic?
(l2)preal space with dist2p(x,y)x-y2p
  • Cannot work with l1
  • Other possibilities?
  • (l2)p is bigger and algorithmically tractable
  • but not rich enough (often same lower bounds)
  • l8 is rich (includes all metrics),
  • but not efficient computationally usually (high
    dimension)
  • And thats roughly it ?
  • (at least for efficient NNS)

l8real space with dist8(x,y)maxixi-yi
7
Meet our new host
a
d1

ß
  • Iterated product space, ?22,8,1

d8,1
d22,8,1
?
8
Why ?22,8,1?
  • Because we can
  • Theorem 1. Ulam embeds into ?22,8,1 with O(1)
    distortion
  • Dimensions (?,ß,a)(d, log d, d)
  • Theorem 2. ?22,8,1 admits NNS on n points with
  • O(log log n) approximation
  • O(ne) query time and O(n1e) space
  • In fact, there is more for Ulam

Rich
Algorithmically tractable
9
Our Algorithms for Ulam
ED(1234567, 7123456) 2
  • Ulam edit on strings where each symbol appears
    at most once
  • A classical distance between rankings
  • Exhibits hardness of misalignments (as in general
    edit)
  • All lower bounds same as for general edit (up to
    T() )
  • Distortion of embedding into l1 (and (l2)p, etc)
    T(log d)
  • Our approach implies new algorithms for Ulam
  • 1. NNS with O(log log n) approx, O(ne) query
    time
  • Can improve to O(log log d) approx
  • 2. Sketching with O(1)-approx in logO(1) d space
  • 3. Distance estimation with O(1)-approx in time

If we ever hope for approximation ltltlog d for NNS
under general edit, first we have to get it under
Ulam!
BEKMRRS03 when ED¼d, approx de in O(d1-2e)
time
10
Theorem 1
  • Theorem 1. Can embed Ulam into ?22,8,1 with O(1)
    distortion
  • Dimensions (?,ß,a)(d, log d, d)
  • Proof
  • Geometrization of Ulam characterizations
  • Previously studied in the context of testing
    monotonicity (sortedness)
  • Sublinear algorithms EKKRV98, ACCL04
  • Data-stream algorithms GJKK07, GG07, EH08

11
Thm 1 Characterizing Ulam
  • Consider permutations x,y over d
  • Assume for now x identity permutation
  • Idea
  • Count chars in y to delete to obtain increasing
    sequence ( Ulam(x,y))
  • Call them faulty characters
  • Issues
  • Ambiguity
  • How do we count them?

123456789
123456789
X
234657891
341256789
y
12
Thm 1 Characterization inversions
  • Definition chars altb form inversion if b
    precedes a in y
  • How to identify faulty char?
  • Has an inversion?
  • Doesnt work all chars might have inversion
  • Has many inversions?
  • Still can miss faulty chars
  • Has many inversions locally?
  • Same problem

Check if either is true!
123456789
123456789
123456789
X
567981234
234567891
213456798
y
13
Thm 1 Characterization faulty chars
  • Definition 1 a is faulty if exists Kgt0 s.t.
  • a is inverted w.r.t. a majority of the K symbols
    preceding a in y
  • (ok to consider K2k)
  • Lemma ACCL04, GJKK07 faulty chars
    T(Ulam(x,y)).

123456789
234567891
4 characters preceding 1 (all inversions with 1)
14
Thm 1 Characterization?Embedding
  • To get embedding, need
  • Symmetrization (neither string is identity)
  • Deal with exists, majority?
  • To resolve (1), use instead XaK
  • Definition 2 a is faulty if exists K2k such
    that
  • Xa2k ? Ya2k gt 2k (symmetric difference)

X54
123456789
123467895
Y54
15
Thm 1 Embedding final step
X522
123456789
  • We have
  • Replace by weight?
  • Final embedding

123467895
Y522
equal 1 iff true
)2
(
16
Theorem 2
  • Theorem 2. ?22,8,1 admits NNS on n points
  • O(log log n) approximation
  • O(ne) query time and O(n1e) space for any small
    e
  • (ignoring (aß?)O(1))
  • A rather general approach
  • LSH on l1-products of general metric spaces
  • Of course, cannot do, but can reduce to
    l8-products

17
Thm 2 Proof
  • Lets start from basics l1a
  • IM98 c-approx with O(n1/c) query time and
    O(n11/c) space
  • (ignoring aO(1))
  • Ok, what about
  • Then NNS for
  • O(cM log log n) -approx
  • O(QM) query time
  • O(SM n1e) space.
  • Suppose NNS for M with
  • cM-approx
  • QM query time
  • SM space.

I02
18
Thm 2 What about (l2)2-product?
  • Enough to consider
  • (for us, M is the l1-product)
  • Off-the-shelf?
  • I04 gives space n? or gtlog n approximation
  • We reduce to multiple NNS queries under
  • Instructive to first look at NNS for standard l1

19
Thm 2 Review of NNS for l1
?
  • LSH family collection H of
  • hash functions such that
  • For random h?H (parameter ?gt0)
  • Prh(q)h(p) 1-q-p1 / ?
  • Query just uses primitive
  • Can obtain H by imposing randomly-shifted grid of
    side-length ?
  • Then for h defined by ri20, ? at random,
    primitive becomes

q
p
return all points p such that h(q)h(p)
return all p s.t. qi-piltri for all i?d
20
Thm 2 LSH for l1-product
?
  • Intuition abstract LSH!
  • Recall we had
  • for ri random from 0, ?,
  • point p returned if for all i qi-piltri
  • Equivalently
  • For all i

q
p
l8 product of R!
For l1
return all p s.t. qi-piltri for all i?d
return all points ps such that maxi
dM(qi,pi)/rilt1
For
21
Thm 2 Final
  • Thus, sufficient to solve primitive
  • We reduced NNS over
  • to several instances of NNS over
  • (with appropriately scaled coordinates)
  • Approximation is O(1)O(log log n)
  • Done!

return all points ps such that maxi
dM(qi,pi)/rilt1 (in fact, for k independent
choices of (r1,rd))
For
22
Take-home message
  • Can embed combinatorial metrics into iterated
    product spaces
  • Works for Ulam (edit on non-repetitive strings)
  • Approach bypasses non-embeddability results into
    usual-suspect spaces like l1, (l2)2
  • Open
  • Embeddings for edit over 0,1d, EMD, other
    metrics?
  • Understanding product spaces?
  • Jayram-Woodruff sketching

Thank you!
Write a Comment
User Comments (0)
About PowerShow.com