Title: CS 361A (Advanced Data Structures and Algorithms)
1CS 361A (Advanced Data Structures and Algorithms)
- Lecture 19 (Dec 5, 2005)
- Nearest Neighbors
Dimensionality Reduction and Locality-Sensitive
Hashing - Rajeev Motwani
2Metric Space
- Metric Space (M,D)
- For points p,q in M, D(p,q) is distance from p to
q - only reasonable model for high-dimensional
geometric space - Defining Properties
- Reflexive D(p,q) 0 if and only if pq
- Symmetric D(p,q) D(q,p)
- Triangle Inequality D(p,q) is at most
D(p,r)D(r,q) - Interesting Cases
- M ? points in d-dimensional space
- D ? Hamming or Euclidean Lp-norms
3High-Dimensional Near Neighbors
- Nearest Neighbors Data Structure
- Given N points Pp1, , pN in metric space
(M,D) - Queries Which point p?P is closest to point
q? - Complexity Tradeoff preprocessing space with
query time - Applications
- vector quantization
- multimedia databases
- data mining
- machine learning
4Known Results
Query Time Storage Technique Paper
dN dN Brute-Force
2d log N N2d1 Voronoi Diagram Dobkin-Lipton 76
Dd/2 log N Nd/2 Random Sampling Clarkson 88
d5 log N Nd Combination Meiser 93
logd-1 N N logd-1 N Parametric Search Agarwal-Matousek 92
- Some expressions are approximate
- Bottom-line exponential dependence on d
5Approximate Nearest Neighbor
- Exact Algorithms
- Benchmark brute-force needs space O(N), query
time O(N) - Known Results exponential dependence on
dimension - Theory/Practice no better than brute-force
search - Approximate Near-Neighbors
- Given N points Pp1, , pN in metric space
(M,D) - Given error parameter ?gt0
- Goal for query q and nearest-neighbor p, return
r such that - Justification
- Mapping objects to metric space is heuristic
anyway - Get tremendous performance improvement
6Results for Approximate NN
Query Time Storage Technique Paper
dd e-d dN Balanced Trees Arya et al 94
d2 polylog(N,d) N N2d dN polylog(N,d) Random Projection Kleinberg 97
log3 N N1/?2 Search Trees Dimension Reduction Indyk-Motwani 98
dN1/?log2N N11/?log N Locality-Sensitive Hashing Indyk-Motwani 98
External Memory External Memory Locality-Sensitive Hashing Gionis-Indyk-Motwani 99
- Will show main ideas of last 3 results
- Some expressions are approximate
7Approximate r-Near Neighbors
- Given N points Pp1,,pN in metric space
(M,D) - Given error parameter ?gt0, distance threshold
rgt0 - Query
- If no point p with D(q,p)ltr, return FAILURE
- Else, return any p with D(q,p)lt (1?)r
- Application
- Solving Approximate Nearest Neighbor
- Assume maximum distance is R
- Run in parallel for
- Time/space O(log R) overhead
- Indyk-Motwani reduce to O(polylog n) overhead
8Hamming Metric
- Hamming Space
- Points in M bit-vectors 0,1d (can generalize
to 0,1,2,,qd) - Hamming Distance D(p,q) of positions where
p,q differ - Remarks
- Simplest high-dimensional setting
- Still useful in practice
- In theory, as hard (or easy) as Euclidean space
- Trivial in low dimensions
- Example
- Hypercube in d3 dimensions
- 000, 001, 010, 011, 100, 101, 110, 111
9Dimensionality Reduction
- Overall Idea
- Map from high to low dimensions
- Preserve distances approximately
- Solve Nearest Neighbors in new space
- Performance improvement at cost of approximation
error - Mapping?
- Hash function family H H1, , Hm
- Each Hi 0,1d ? 0,1t with tltltd
- Pick HR from H uniformly at random
- Map each point in P using same HR
- Solve NN problem on HR(P) HR(p1), , HR(pN)
10Reduction for Hamming Spaces
- Theorem For any r and small ?gt0, there is hash
family H such that for any p,q and random HR ?H -
with probability gt1-?, provided for some
constant C,
b
b
a
a
c
c
11Remarks
- For fixed threshold r, can distinguish between
- Near D(p,q) lt r
- Far D(p,q) gt (1e)r
- For N points, need
- Yet, can reduce to O(log N)-dimensional space,
while approximately preserving distances - Works even if points not known in advance
12Hash Family
- Projection Function
- Let S be ordered, multiset of s indexes from
1,,d - pS0,1d ?0,1s projects p into s-dimensional
subspace - Example
- d5, p01100
- s3, S2,2,4 ? pS 110
- Choosing hash function HR in H
- Repeat for i1,,t
- Pick Si randomly (with replacement) from 1d
- Pick random hash function fi0,1s ?0,1
- hi(p)fi(pSi)
- HR(p) (h1(p), h2(p),,ht(p))
- Remark note similarity to Bloom Filters
13Illustration of Hashing
. . . . .
1
d
p
0 1 1 0 0 0 1 0 1 0
pS1
pSt
. . . . .
1 0 0 1
0 0 0 0
. . .
. . .
1
s
1
s
ft
f1
0 1 1 0
HR(p)
. . .
h1(p)
ht(p)
14Analysis I
- Choose random index-set S
- Claim For any p,q
- Why?
- p,q differ in D(p,q) bit positions
- Need all s indexes of S to avoid these positions
- Sampling with replacement from 1, ,d
15Analysis II
- Choose sd/r
- Since 1-xlte-x for xlt1, we obtain
- Thus
16Analysis III
- Recall hi(p)fi(pSi)
- Thus
-
- Choosing c ½ (1-e-1)
17Analysis IV
- Recall HR(p)(h1(p),h2(p),,ht(p))
- D(HR(p),HR(q)) number of is where hi(p), hi(q)
differ - By linearity of expectations
- Theorem almost proved
- For high probability bound, need Chernoff Bound
18Chernoff Bound
- Consider Bernoulli random variables X1,X2, , Xn
- Values are 0-1
- PrXi1 x and PrXi0 1-x
- Define X X1X2Xn with EXnx
- Theorem For independent X1,, Xn, for any 0lt?lt1,
P
2?nx
X
nx
19Analysis V
- Define
- Xi0 if hi(p)hi(q), and 1 otherwise
- nt
- Then X X1X2Xt D(HR(p),HR(q))
- Case 1 D(p,q)ltr ? xc
- Case 2 D(p,q)gt(1e)r ? xce/6
- Observe sloppy bounding of constants in Case 2
20Putting it all together
- Recall
- Thus, error probability
- Choosing C1200/c
- Theorem is proved!!
21Algorithm I
- Set error probability
- Select hash HR and map points p ? HR(p)
- Processing query q
- Compute HR(q)
- Find nearest neighbor HR(p) for HR(q)
- If then return p,
else FAILURE - Remarks
- Brute-force for finding HR(p) implies query time
- Need another approach for lower dimensions
22Algorithm II
- Fact Exact nearest neighbors in 0,1t requires
- Space O(2t)
- Query time O(t)
- How?
- Precompute/store answers to all queries
- Number of possible queries is 2t
- Since
- Theorem In Hamming space 0,1d, can solve
approximate nearest neighbor with - Space
- Query time
23Different Metric
- Many applications have sparse points
- Many dimensions but few 1s
- Example points?documents, dimensions?words
- Better to view as sets
- Previous approach would require large s
- For sets A,B, define
- Observe
- AB ? sim(A,B)1
- A,B disjoint ? sim(A,B)0
- Question Handling D(A,B)1-sim(A,B) ?
24Min-Hash
- Random permutations p1,,pt of universe
(dimensions) - Define mapping hj(A)mina in A pj(a)
- Fact Prhj(A) hj(B) sim(A,B)
- Proof? already seen!!
- Overall hash-function
- HR(A) (h1(A), h2(A),,ht(A))
25Min-Hash Analysis
- Select
- Hamming Distance
- D(HR(A),HR(B)) ? number of js such that
- Theorem For any A,B,
- Proof? Exercise (apply Chernoff Bound)
- Obtain ANN algorithm similar to earlier result
-
26Generalization
- Goal
- abstract technique used for Hamming space
- enable application to other metric spaces
- handle Dynamic ANN
- Dynamic Approximate r-Near Neighbors
- Fix threshold r
- Query if any point within distance r of q,
return any point within distance - Allow insertions/deletions of points in P
- Recall earlier method required preprocessing
all possible queries in hash-range-space
27Locality-Sensitive Hashing
- Fix metric space (M,D), threshold r, error
- Choose probability parameters Q1 gt Q2gt0
- Definition Hash family HhM?S for (M,D) is
called .
-sensitive, if for random h and for any p,q
in M - Intuition
- p,q are near ? likely to collide
- p,q are far ? unlikely to collide
28Examples
- Hamming Space M0,1d
- point pb1bd
- H hi(b1bd)bi, for i1d
- sampling one bit at random
- Prhi(q)hi(p) 1 D(p,q)/d
- Set Similarity D(A,B) 1 sim(A,B)
- Recall
- H
- Prh(A)h(B) 1 D(A,B)
29Multi-Index Hashing
- Overall Idea
- Fix LSH family H
- Boost Q1, Q2 gap by defining G H k
- Using G, each point hashes into l buckets
- Intuition
- r-near neighbors likely to collide
- few non-near pairs in any bucket
- Define
- G g g(p) h1(p)h2(p)hk(p)
- Hamming metric ? sample k random bits
30Example (l4)
h1
hk
p
g1
q
g2
g3
g4
r
31Overall Scheme
- Preprocessing
- Prepare hash table for range of G
- Select l hash functions g1, g2, , gl
- Insert(p) add p to buckets g1(p), g2(p), ,
gl(p) - Delete(p) remove p from buckets g1(p), g2(p),
, gl(p) - Query(q)
- Check buckets g1(q), g2(q), , gl(q)
- Report nearest of (say) first 3l points
- Complexity
- Assume computing D(p,q) needs O(d) time
- Assume storing p needs O(d) space
- Insert/Delete/Query Time O(dlk)
- Preprocessing/Storage O(dNNlk)
32Collision Probability vs. Distance
1
Q1
Q2
0
r
r
r
r
33Multi-Index versus Error
-
- Set lNz where
- Theorem For lNz, any query returns r-near
neighbor correctly with probability at least 1/6. - Consequently (ignoring kO(log N) factors)
- Time O(dNz)
- Space O(N1z)
- Hamming Metric ?
- Boost Probability use several parallel
hash-tables
34Analysis
- Define (for fixed query q)
- p any point with D(q,p) lt r
- FAR(q) all p with D(q,p) gt (1 )r
- BUCKET(q,j) all p with gj(p) gj(q)
- Event Esize
- (?query cost bounded by O(dl))
- Event ENN gj(p) gj(q) for some j
- (?nearest point in l buckets is r-near
neighbor) - Analysis
- Show PrEsize x gt 2/3 and PrENN y gt 1/2
- Thus Prnot(Esize ENN) lt (1-x) (1-y) lt 5/6
35Analysis Bad Collisions
- Choose
- Fact
- Clearly
- Markov Inequality PrXgtr.EXlt1/r, for Xgt0
- Lemma 1
36Analysis Good Collisions
- Observe ?
- Since lnz ?
- Lemma 2 PrENN gt1/2
37Euclidean Norms
- Recall
- x(x1, x2, , xd) and y(y1, y2, , yd) in Rd
- L1-norm
- Lp-norm (for pgt1)
38Extension to L1-Norm
- Round coordinates to 1,M
- Embed L1-1,,Md into Hamming-0,1dM
- Unary Mapping
- Apply algorithm for Hamming Spaces
- Error due to rounding of 1/M ?
- Space-Time Overhead due to mapping of d ? dM
39Extension to L2-Norm
- Observe
- Little difference in L1-norm and L2-norm for high
d - Additional error is small
- More generally Lp, for 1 p 2
- Figiel et al 1977, Johnson-Schechtman 1982
- Can embed Lp into L1
- Dimensions d ? O(d)
- Distances preserved within factor (1a)
- Key Idea random rotation of space
40Improved Bounds
- Indyk-Motwani 1998
- For any Lp-norm
- Query Time O(log3 N)
- Space
- Problem impractical
- Today only a high-level sketch
41Better Reduction
- Recall
- Reduced Approximate Nearest Neighbors to
Approximate r-Near Neighbors - Space/Time Overhead O(log R)
- R max distance in metric space
- Ring-Cover Trees
- Removed dependence on R
- Reduced overhead to O(polylog N)
42Approximate r-Near Neighbors
- Idea
- Impose regular-grid on Rd
- Decompose into cubes of side length s
- Label cubes with points at distance ltr
- Data Structure
- Query q determine cube containing q
- Cube labels candidate r-near neighbors
- Goals
- Small s ? lower error
- Fewer cubes ? smaller storage
43p1
p2
p3
44Grid Analysis
- Assume r1
- Choose
- Cube Diameter
- Number of cubes
- Theorem For any Lp-norm, can solve Approx
r-Near Neighbor using - Space
- Time O(d)
45Dimensionality Reduction
- Johnson-Lindenstraus 84, Frankl-Maehara 88 For
, can map points in P into subspace
of dimension while preserving all
inter-point distances to within a factor - Proof idea project onto random lines
- Result for NN
- Space
- Time O(polylog N)
46References
- Approximate Nearest Neighbors Towards Removing
the Curse of Dimensionality
P. Indyk and R. Motwani
STOC 1998 - Similarity Search in High Dimensions via Hashing
A. Gionis, P. Indyk, and R. Motwani
VLDB 1999