Title: Dimension Reduction in the Hamming Cube (and its Applications)
1Dimension Reduction in the Hamming Cube (and its
Applications)
Rafail Ostrovsky UCLA (joint works with
Rabani and Kushilevitz and Rabani)
2PLAN
- Problem Formulations
- Communication complexity game
- What really happened? (dimension reduction)
- Solutions to 2 problems
- ANN
- k-clustering
- Whats next?
3Problem statements
- Johnson-lindenstrauss lemma n points in high
dim. Hilbert Space can be embedded into O(logn)
dim subspace with small distortion - Q how do we do it for the Hamming Cube?
- (we show how to avoid impossibility of
Charicar-Sahai)
4Many different formulations of ANN
- ANN approximate nearest neighbor search
- (many applications in computational geometry,
biology/stringology, IR, other areas) - Here are different formulations
5Approximate Searching
- Motivation given a DB of names, user with a
target name, find if any of DB names are
close to the current name, without doing liner
scan.
Jon Alice Bob Eve Panconesi Kate Fred
A.Panconesi ?
??
6Geometric formulation
- Nearest Neighbor Search (NNS) given N blue
points (and a distance function, say Euclidian
distance in Rd), store all these points somehow
7Data structure question
- given a new red point, find closest blue point.
Naive solution 1 store blue points as is and when given a red point, measure distances to all blue points. Q can we do better?
8Can we do better?
- Easy in small dimensions (Voronoi diagrams)
- Curse of dimensionality in High Dimensions
- KOR Can get a good approximate solution
efficiently!
9Hamming Cube Formulation for ANN
- Given a DB of N blue n-bit strings, process them
somehow. Given an n-bit red string find ANN in
the Hyper-Cube 0,1n - Naïve solution 2 pre-compute all (exponential
) of answers (want small data-structures!)
00101011 01011001 11101001 10110110 11010101 11011000 10101010 10101111
11010100
??
10Clustering problem that Ill discuss in detail
11An example of Clustering find centers
12A clustering formulation
13Clustering formulation
- The cost is the sum of distances
14Main technique
- First, as a communication game
- Second, interpreted as a dimension reduction
15COMMUNICATION COMPLEXITY GAME
- Given two players Alice and Bob,
- Alice is secretly given string x
- Bob is secretly given string y
- they want to estimate hamming distance between x
and y with small communication (with small
error), provided that they have common randomness - How can they do it? (say length of xy N)
- Much easier how do we check that xy ?
16Main lemma an abstract game
- How can Alice and Bob estimate hamming distance
between X and Y with small CC? - We assume Alice and Bob share randomness
ALICE X1X2X3X4Xn
BOB Y1Y2Y3Y4Yn
??
17A simpler question
- To estimate hamming distance between X and Y
(within (1 e)) with small CC, sufficient for
Alice and Bob for any L to be able to distinguish
X and Y for - H(X,Y) lt L OR
- H(X,Y) gt (1 e) L
- Q why sampling does not work?
BOB Y1Y2Y3Y4Yn
??
ALICE X1X2X3X4Xn
18Alice and Bob pick the SAME n-bit blue R each
bit of R1 independently with probability 1/2L
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0
0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
19What is the difference in probabilities? H(X,Y)
lt L and H(X,Y) gt (1 e) L
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0
0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
20How do we amplify?
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0
0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
21How do we amplify? - Repeat, with many
independent Rs but same distribution!
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0
0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
22a refined game with a small communication
- How can Alice and Bob distinguish X and Y
- H(X,Y) lt L OR
- H(X,Y) gt (1 e) L
BOB Y1Y2Y3Y4Yn For each R XOR (the same subset) of Yi Compare the outputs.
ALICE X1X2X3X4Xn For each R XOR (subset) of Xi Compare the outputs.
?? Pick 1/ e2 logN Rs with correct distribution Compare this linear transformation.
23Dimension Reduction in the Hamming Cube OR
For each L, we can pick O(log N) Rs and boost the Probabilities! Key Property we get an embedding from large to small cube that preserve ranges around L very well.
24Dimension Reduction in the Hamming Cube OR
For each L, we can pick O(log N) Rs and boost the Probabilities! Key Property we get an embedding from large to small cube that preserve ranges around L. Key idea in applications can build inverse lookup table for the small cube!
25Applications
- Applications of the dimension reduction in the
Hamming CUBE - For ANN in the Hamming cube and Rd
- For K-Clustering
26Application to ANN in the Hamming Cube
- For each possible L build a small cube and
project original DB to a small cube - Pre-compute inverse table for each entry of the
small cube. - Why is this efficient?
- How do we answer any query?
- How do we navigate between different L?
27Putting it All together Users private approx
search from DB
- Each projection is O(log N) Rs. User picks many
such projections for each L-range. That defines
all the embeddings. - Now, DB builds inverse lookup tables for each
projection as new DBs for each L. - User can now project its query into small cube
and use binary search on L
28MAIN THM KOR
- Can build poly-size data-structure to do ANN for
high-dimensional data in time polynomial in d and
poly-log in N - For the hamming cube
- L_1
- L_2
- Square of the Euclidian dist.
- IM had a similar results, slightly weaker
guarantee.
29Dealing with Rd
- Project to random lines, choose cut points
- Well, not exactly we need navigation
30Clustering
- Huge number of applications (IR, mining, analysis
of stat data, biology, automatic taxonomy
formation, web, topic-specific data-collections,
etc.) - Two independent issues
- Representation of data
- Forming clusters (many incomparable methods)
31Representation of data examples
- Latent semantic indexing yields points in Rd with
l2 distance (distance indicating similarity) - Min-wise permutation (Broder at. al.) approach
yields points in the hamming metric - Many other representations from IR literature
lead to other metrics, including edit-distance
metric on strings - Recent news OR-95 showed that we can embed
edit-distance metric into l1 with small
distortion distortion exp(sqrt(\log n \log log
n))
32Geometric Clustering examples
- Min-sum clustering in Rd form clusters s.t. the
sum of intra-cluster distances in minimized - K-clustering pick k centers in the ambient
space. The cost is the sum of distances from each
data-point to the closest center - Agglomerative clustering (form clusters below
some distance-threshold) - Q which is better?
33Methods are (in general) incomparable
34Min-SUM
352-Clustering
36A k-clustering problem notation
- N number of points
- d dimension
- k number of centers
37About k-clustering
- When k if fixed, this is easy for small d
- Kleinberg, Papadimitriou, Raghavan NP-complete
for k2 for the cube - Drineas, Frieze, Kannan, Vempala, Vinay NP
complete for Rd for square of the Euclidian
distance - When k is not fixed, this is facility location
(Euclidian k-median) - For fixed d but growing k a PTAS was given by
Arora, Raghavan, Rao (using dynamic prog.) - (this talk) OR PTAS for fixed k, arbitrary d
38Common tools in geometric PTAS
- Dynamic programming
- Sampling Schulman, AS, DLVK
- DFKVV use SVD
- Embeddings/dimension reduction seem useless
because - Too many candidate centers
- May introduce new centers
39OR k-clustering result
- A PTAS for fixed k
- Hamming cube 0,1d
- l1d
- l2d (Euclidian distance)
- Square of the Euclidian distance
40Main ideas
- For 2-clustering find a good partition is as good
as solving the problem - Switch to cube
- Try partitions in the embedded low-dimensional
data set - Given a partition, compute centers and cost in
the original data send - Embedding/dim. reduction used to reduce the
number of partitions
41Stronger property of OR dimension reduction
- Our random linear transformation preserve ranges!
42THE ALGORITHM
43The algorithm yet again
- Guess 2-center distance
- Map to small cube
- Partition in the small cube
- Measure the partition in the big cube
- THM gets within (1e) of optimal.
- Disclaimer PTAS is (almost never) practical,
this shows feasibility only, more ideas are
needed for a practical solution.
44Dealing with kgt2
- Apex of a tournament is a node of max out-degree
- Fact apex has a path of length 2 to every node
- Every point is assigned an apex of center
tournaments - Guess all (k choose 2) center distances
- Embed into (k choose 2) small cubes
- Guess center-projection in small cubes
- For every point, for every pair of centers,
define a tournament which center is closer in
the projection
45Conclusions
- Dimension reduction in the cube allows to deal
with huge number of incomparable attributes. - Embeddings of other metrics into the cube allows
fast ANN for other metrics - Real applications still require considerable
additional ideas - Fun area to work in