Dimension Reduction in the Hamming Cube (and its Applications) - PowerPoint PPT Presentation

About This Presentation

Title:

Dimension Reduction in the Hamming Cube (and its Applications)

Description:

Jon. Alice. Bob. Eve. Panconesi. Kate. Fred. A.Panconesi ? Geometric formulation ... To estimate hamming distance between X and Y (within (1 e)) with small CC, ... – PowerPoint PPT presentation

Number of Views:253

Avg rating:3.0/5.0

Slides: 46

Provided by: csU5

Learn more at: http://web.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Dimension Reduction in the Hamming Cube (and its Applications)

1
Dimension Reduction in the Hamming Cube (and its
Applications)

Rafail Ostrovsky UCLA (joint works with
Rabani and Kushilevitz and Rabani)

2
PLAN

Problem Formulations
Communication complexity game
What really happened? (dimension reduction)
Solutions to 2 problems
ANN
k-clustering
Whats next?

3
Problem statements

Johnson-lindenstrauss lemma n points in high
dim. Hilbert Space can be embedded into O(logn)
dim subspace with small distortion
Q how do we do it for the Hamming Cube?
(we show how to avoid impossibility of
Charicar-Sahai)

4
Many different formulations of ANN

ANN approximate nearest neighbor search
(many applications in computational geometry,
biology/stringology, IR, other areas)
Here are different formulations

5
Approximate Searching

Motivation given a DB of names, user with a
target name, find if any of DB names are
close to the current name, without doing liner
scan.

Jon Alice Bob Eve Panconesi Kate Fred
A.Panconesi ?
??
6
Geometric formulation

Nearest Neighbor Search (NNS) given N blue
points (and a distance function, say Euclidian
distance in Rd), store all these points somehow

7
Data structure question

given a new red point, find closest blue point.

Naive solution 1 store blue points as is and when given a red point, measure distances to all blue points. Q can we do better?
8
Can we do better?

Easy in small dimensions (Voronoi diagrams)
Curse of dimensionality in High Dimensions
KOR Can get a good approximate solution
efficiently!

9
Hamming Cube Formulation for ANN

Given a DB of N blue n-bit strings, process them
somehow. Given an n-bit red string find ANN in
the Hyper-Cube 0,1n
Naïve solution 2 pre-compute all (exponential
) of answers (want small data-structures!)

00101011 01011001 11101001 10110110 11010101 11011000 10101010 10101111
11010100
??
10
Clustering problem that Ill discuss in detail

K-clustering

11
An example of Clustering find centers

Given N points in Rd

12
A clustering formulation

Find cluster centers

13
Clustering formulation

The cost is the sum of distances

14
Main technique

First, as a communication game
Second, interpreted as a dimension reduction

15
COMMUNICATION COMPLEXITY GAME

Given two players Alice and Bob,
Alice is secretly given string x
Bob is secretly given string y
they want to estimate hamming distance between x
and y with small communication (with small
error), provided that they have common randomness
How can they do it? (say length of xy N)
Much easier how do we check that xy ?

16
Main lemma an abstract game

How can Alice and Bob estimate hamming distance
between X and Y with small CC?
We assume Alice and Bob share randomness

ALICE X1X2X3X4Xn
BOB Y1Y2Y3Y4Yn
??
17
A simpler question

To estimate hamming distance between X and Y
(within (1 e)) with small CC, sufficient for
Alice and Bob for any L to be able to distinguish
X and Y for
H(X,Y) lt L OR
H(X,Y) gt (1 e) L
Q why sampling does not work?

BOB Y1Y2Y3Y4Yn
??
ALICE X1X2X3X4Xn
18
Alice and Bob pick the SAME n-bit blue R each
bit of R1 independently with probability 1/2L
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0

0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
19
What is the difference in probabilities? H(X,Y)
lt L and H(X,Y) gt (1 e) L
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0

0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
20
How do we amplify?
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0

0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
21
How do we amplify? - Repeat, with many
independent Rs but same distribution!
0 1 0 0 0 1 0 0 1 0 0
0 1 0 0 0 1 0 0 1 0 0

0 1 0 1 0 0 0 1 0 1 0
0 1 0 1 1 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
22
a refined game with a small communication

How can Alice and Bob distinguish X and Y
H(X,Y) lt L OR
H(X,Y) gt (1 e) L

BOB Y1Y2Y3Y4Yn For each R XOR (the same subset) of Yi Compare the outputs.
ALICE X1X2X3X4Xn For each R XOR (subset) of Xi Compare the outputs.
?? Pick 1/ e2 logN Rs with correct distribution Compare this linear transformation.
23
Dimension Reduction in the Hamming Cube OR
For each L, we can pick O(log N) Rs and boost the Probabilities! Key Property we get an embedding from large to small cube that preserve ranges around L very well.
24
Dimension Reduction in the Hamming Cube OR
For each L, we can pick O(log N) Rs and boost the Probabilities! Key Property we get an embedding from large to small cube that preserve ranges around L. Key idea in applications can build inverse lookup table for the small cube!
25
Applications

Applications of the dimension reduction in the
Hamming CUBE
For ANN in the Hamming cube and Rd
For K-Clustering

26
Application to ANN in the Hamming Cube

For each possible L build a small cube and
project original DB to a small cube
Pre-compute inverse table for each entry of the
small cube.
Why is this efficient?
How do we answer any query?
How do we navigate between different L?

27
Putting it All together Users private approx
search from DB

Each projection is O(log N) Rs. User picks many
such projections for each L-range. That defines
all the embeddings.
Now, DB builds inverse lookup tables for each
projection as new DBs for each L.
User can now project its query into small cube
and use binary search on L

28
MAIN THM KOR

Can build poly-size data-structure to do ANN for
high-dimensional data in time polynomial in d and
poly-log in N
For the hamming cube
L_1
L_2
Square of the Euclidian dist.
IM had a similar results, slightly weaker
guarantee.

29
Dealing with Rd

Project to random lines, choose cut points
Well, not exactly we need navigation

30
Clustering

Huge number of applications (IR, mining, analysis
of stat data, biology, automatic taxonomy
formation, web, topic-specific data-collections,
etc.)
Two independent issues
Representation of data
Forming clusters (many incomparable methods)

31
Representation of data examples

Latent semantic indexing yields points in Rd with
l2 distance (distance indicating similarity)
Min-wise permutation (Broder at. al.) approach
yields points in the hamming metric
Many other representations from IR literature
lead to other metrics, including edit-distance
metric on strings
Recent news OR-95 showed that we can embed
edit-distance metric into l1 with small
distortion distortion exp(sqrt(\log n \log log
n))

32
Geometric Clustering examples

Min-sum clustering in Rd form clusters s.t. the
sum of intra-cluster distances in minimized
K-clustering pick k centers in the ambient
space. The cost is the sum of distances from each
data-point to the closest center
Agglomerative clustering (form clusters below
some distance-threshold)
Q which is better?

33
Methods are (in general) incomparable
34
Min-SUM
35
2-Clustering
36
A k-clustering problem notation

N number of points
d dimension
k number of centers

37
About k-clustering

When k if fixed, this is easy for small d
Kleinberg, Papadimitriou, Raghavan NP-complete
for k2 for the cube
Drineas, Frieze, Kannan, Vempala, Vinay NP
complete for Rd for square of the Euclidian
distance
When k is not fixed, this is facility location
(Euclidian k-median)
For fixed d but growing k a PTAS was given by
Arora, Raghavan, Rao (using dynamic prog.)
(this talk) OR PTAS for fixed k, arbitrary d

38
Common tools in geometric PTAS

Dynamic programming
Sampling Schulman, AS, DLVK
DFKVV use SVD
Embeddings/dimension reduction seem useless
because
Too many candidate centers
May introduce new centers

39
OR k-clustering result

A PTAS for fixed k
Hamming cube 0,1d
l1d
l2d (Euclidian distance)
Square of the Euclidian distance

40
Main ideas

For 2-clustering find a good partition is as good
as solving the problem
Switch to cube
Try partitions in the embedded low-dimensional
data set
Given a partition, compute centers and cost in
the original data send
Embedding/dim. reduction used to reduce the
number of partitions

41
Stronger property of OR dimension reduction

Our random linear transformation preserve ranges!

42
THE ALGORITHM
43
The algorithm yet again

Guess 2-center distance
Map to small cube
Partition in the small cube
Measure the partition in the big cube
THM gets within (1e) of optimal.
Disclaimer PTAS is (almost never) practical,
this shows feasibility only, more ideas are
needed for a practical solution.

44
Dealing with kgt2

Apex of a tournament is a node of max out-degree
Fact apex has a path of length 2 to every node
Every point is assigned an apex of center
tournaments
Guess all (k choose 2) center distances
Embed into (k choose 2) small cubes
Guess center-projection in small cubes
For every point, for every pair of centers,
define a tournament which center is closer in
the projection

45
Conclusions

Dimension reduction in the cube allows to deal
with huge number of incomparable attributes.
Embeddings of other metrics into the cube allows
fast ANN for other metrics
Real applications still require considerable
additional ideas
Fun area to work in

Write a Comment

User Comments (0)