Title: Compact Data Representations and their Applications
1Compact Data Representations and their
Applications
- Moses Charikar
- Princeton University
2Lots and lots of data
- ATT
- Information about who calls whom
- What information can be got from this data ?
- Network router
- Sees high speed stream of packets
- Detect DOS attacks ? fair resource allocation ?
3Lots and lots of data
- Google search engine
- About 3 billion web pages
- Many many queries every day
- How to efficiently process data ?
- Eliminate near duplicate web pages
- Query log analysis
4Sketching Paradigm
- Construct compact representation (sketch) of data
such that - Interesting functions of data can be computed
from compact representation
estimated
5Why care about compact representations ?
- Practical motivations
- Algorithmic techniques for massive data sets
- Compact representations lead to reduced space,
time requirements - Make impractical tasks feasible
- Theoretical Motivations
- Interesting mathematical problems
- Connections to many areas of research
6Questions
- What is the data ?
- What functions do we want to compute on the data
? - How do we estimate functions on the sketches ?
- Different considerations arise from different
combinations of answers - Compact representation schemes are functions of
the requirements
7What is the data ?
- Sets, vectors, points in Euclidean space, points
in a metric space, vertices of a graph. - Mathematical representation of objects (e.g.
documents, images, customer profiles, queries).
8What functions do we want to compute on the data ?
- Local functions pairs of objectse.g. distance
between objects - Sketch of each object, such that function can be
estimated from pairs of sketches - Global functions entire data sete.g.
statistical properties of data - Sketch of entire data set, ability to update,
combine sketches
9Local functions distance/similarity
- Distance is a general metric, i.e satisfies
triangle inequality - Normed spacex (x1, x2, , xd) y (y1, y2,
, yd) - Other special metrics (e.g. Earth Mover Distance)
10Estimating distance from sketches
- Arbitrary function of sketches
- Information theory, communication complexity
question. - Sketches are points in normed space
- Embedding original distance function in normed
space. Bourgain 85 Linial,London,Rabinovich
94 - Original metric is (same) normed space
- Original data points are high dimensional
- Sketches are points low dimensions
- Dimension reduction in normed spacesJohnson
Lindenstrauss 84
11Streaming algorithms
- Perform computation in one (or constant) pass(es)
over data using a small amount of storage space - Availability of sketch function facilitates
streaming algorithm - Additional requirements - sketch should allow
- Update to incorporate new data items
- Combination of sketches for different data sets
storage
input
12Global functions
- Statistical properties of entire data set
- Frequency moments
- Sortedness of data
- Set membership
- Size of join of relations
- Histogram representation
- Most frequent items in data set
- Clustering of data
13Goals
- Glimpse of compact representation techniques in
the sketching and streaming domains. - Basic ideas, no messy details
14Talk Outline
- Classical techniques spectral methods
- Dimension reduction
- Similarity preserving hash functions
- sketching vector norms
- sketching Earth Mover Distance (EMD)
15Spectral methods approximating matrices
- SVD Singular Value DecompositionLSI Latent
Semantic Indexing - Related to PCA Principal Component
AnalysisMDS MultiDimensional Scaling
16SVD Matrix Factorization
X U x ? x VT
n
n
r
x
x
r
m
m
Singular Values
Representation
Basis
Restrictions on representation U, V orthonormal
? diagonal
17Matrix approximation
- X ?i ui si viT
- X(k) ?ik1 ui si viT
- X(k) is best rank k approximation to Xminimizes
?ij xij x(k)ij2
18Dimension Reduction
Xr U x ?r x VT
n
r
k
n
k
0
0
x
x
0
0
0
r
m
m
Singular Values
Representation
Basis
The columns of Xr represent the docs, but in r ltlt
m dimensions Best rank r approximation according
to 2-norm
19Closely related notions
- Singular Value Decomposition
- Karhunen-Loeve (KL) Transform
- Principal Component Analysis (PCA)
- Latent Semantic Indexing (LSI)
- Information retrieval
20SVD complexity
- O(min(nm2,mn2))
- Less work
- if we want just eigenvalues
- if we want first k eignevectors
- if matrix is sparse
- Implemented in any linear algebra
package(LINPACK, matlab, Splus, mathematica,)
21Applications
- Image processing and compression
- low rank approximation leads to compressed
representation, noise reduction - Molecular dynamics
- characterizing protein molecular dynamics
- higher prinicipal components correspond to large
scale motions
22Applications
- Information retrieval
- LSI Latent semantic indexing SVD applied to
term document matrix - compute best rank k approximation
- eigenvectors correspond to linguistic concepts
- Gene expression data analysis
- SVD useful preprocessing step
- grouping genes by transcriptional response,
grouping assays by expression profile
23Microarray gene expression data
24SVD applied to gene expression data
25Information retrieval
- X is term document matrix
- m terms, n documents
- entry (t,d) for term t and document d is function
of how many times t occurs d - SVD of X gives low dimensional representation Xr
- Latent Semantic Indexing
- XrT Xr is matrix of document similarities
- Columns of Xr represent the documents, but in r
ltlt m dimensions
26Semi-precise intuition
- We accomplish more than dimension reduction here
- Docs with lots of overlapping terms stay together
- Terms from these docs also get pulled together.
- Thus car and automobile get pulled together
because both co-occur in docs with tires,
radiator, cylinder, etc.
27Query processing
- View a query as a (short) doc
- call it column 0 of Xr.
- Now the entries in column 0 of XrTXr give the
similarities of the query with each doc. - Entry (j,0) is the score of doc j on the query.
28Talk Outline
- Dimension reduction
- Similarity preserving hash functions
- sketching vector norms
- sketching Earth Mover Distance (EMD)
29Low Distortion Embeddings
- Given metric spaces (X1,d1) (X2,d2),embedding
f X1 ? X2 has distortion D if ratio of
distances changes by at most D - Dimension Reduction
- Original space high dimensional
- Make target space be of low dimension, while
maintaining small distortion
30Dimension Reduction in L2
- n points in Euclidean space (L2 norm) can be
mapped down to O((log n)/?2) dimensions with
distortion at most 1?.Johnson Lindenstrauss
84 - Two interesting properties
- Linear mapping
- Oblivious choice of linear mapping does not
depend on point set - Quite simple JL84, FM88, IM98, DG99, Ach01
Even a random 1/-1 matrix works - Many applications
31Dimension reduction for L1
- C,Sahai 02Linear embeddings are not good for
dimension reduction in L1 - There exist O(n) points in L1 in n dimensions,
such that any linear mapping with distortion ?
needs n/?2 dimensions
32Dimension reduction for L1
- C, Brinkman 03Strong lower bounds for
dimension reduction in L1 - There exist n points in L1 , such that any
embedding with constant distortion ? needs n1/?2
dimensions - Simpler proof by Lee,Naor 04
- Does not rule out other sketching techniques
33Talk Outline
- Dimension reduction
- Similarity preserving hash functions
- sketching vector norms
- sketching Earth Mover Distance (EMD)
34Similarity Preserving Hash Functions
- Similarity function sim(x,y), distance d(x,y)
- Family of hash functions F with probability
distribution such that
35Applications
- Compact representation scheme for estimating
similarity - Approximate nearest neighbor search
Indyk,Motwani 98 Kushilevitz,Ostrovsky,Rabani
98
36Relaxations of SPH
- Estimate distance measure, not similarity measure
in 0,1. - Measure Ef(h(x),h(y)).
- Estimator will approximate distance function.
37Sketching Set SimilarityMinwise Independent
Permutations
Broder,Manasse,Glassman,Zweig 97
Broder,C,Frieze,Mitzenmacher 98
38Other similarity functions ?
C02
- Necessary conditions for existence of similarity
preserving hash functions. - SPH does not exist for Dice coefficient and
Overlap coefficient. - SPH schemes from rounding algorithms
- Hash function for vectors based on random
hyperplane rounding.
39Existence of SPH schemes
- sim(x,y) admits an SPH scheme if? family of
hash functions F such that
40- Theorem If sim(x,y) admits an SPH scheme then
1-sim(x,y) satisfies triangle inequality. - Proof
41Non-existence of SPH
42Stronger Condition
- Theorem If sim(x,y) admits an SPH scheme then
(1sim(x,y) )/2 has an SPH scheme with hash
functions mapping objects to 0,1. - Theorem If sim(x,y) admits an SPH scheme then
1-sim(x,y) is isometrically embeddable in the
Hamming cube.
43- For n vectors, random hyperplane can be chosen
using O(log2 n) random bits.Indyk,
Engebretson,Indyk,ODonnell - Alternate similarity measure for sets
44Random Hyperplane Rounding based SPH
- Collection of vectors
- Pick random hyperplane through origin (normal
) - Goemans,Williamson
45Sketching L1
- Design sketch for vectors to estimate L1 norm
- Hash function to distinguish between small and
large distances KOR 98 - Map L1 to Hamming space
- Bit vectors a(a1,a2,,an) and b(b1,b2,,bn)
- Distinguish between distances ? (1-e)n/k versus
? (1e)n/k - XOR random set of k bits
- Prh(a)h(b) differs by constant in two cases
46Sketching L1 via stable distributions
- a(a1,a2,,an) and b(b1,b2,,bn)
- Sketching L2
- f(a) Si ai Xi f(b) Si bi XiXi
independent Gaussian - f(a)-f(b) has Gaussian distribution scaled by
a-b2 - Form many coordinates, estimate a-b2 by taking
L2 norm - Sketching L1
- f(a) Si ai Xi f(b) Si bi XiXi
independent Cauchy distributed - f(a)-f(b) has Cauchy distribution scaled by
a-b1 - Form many coordinates, estimate a-b1 by taking
medianIndyk 00 -- streaming applications
47Earth Mover Distance (EMD)
P
Q
EMD(P,Q)
48Bipartite/Bichromatic matching
- Minimum cost matching between two sets of points.
- Point weights ? multiple copies of points
Fast estimation of bipartite matching
Agarwal,Varadarajan 04
Goal Sketch point set to enable estimation of
min cost matching
49Approximating metrics by trees
50EMD on trees embedding into L1
suggested by Piotr Indyk
wT(P)-wT(Q)
EMD(P,Q) STlTwT(P)-wT(Q)
51EMD on general metrics
- Approximate metric by probability distribution on
trees - Sample tree from distribution and compute L1
representation - EMD(P,Q) ? Ed(v(P),v(Q)) ? O(log n) EMD(P,Q)
52Tree approximations for Euclidean points
distortion O(d log ?) Bartal 96, CCGGP 98
proposed by Indyk,Thaper 03 for estimating EMD
53Conclusions
- Compact representations at the heart of several
algorithmic techniques for large data sets - Compact representations tailored to applications
- Effective for region based image retrieval
54ISOMAP and LLE
- Nonlinear dimension reduction methods
- Learn hidden structure in data
- See slides of Chan-Su Lee and Rong Xu from
Michael Littmans course at Rutgers - http//www.cs.rutgers.edu/mlittman/courses/lighta
i03/chansu.ppt - http//www.cs.rutgers.edu/mlittman/courses/lighta
i03/rongxu.ppt
55Compact Representations in Streaming Algorithms
- Moses CharikarPrinceton University
56Compact Representations in Streaming
- Statistical properties of data streams
- Distinct elements
- Frequency moments, norm estimation
- Frequent items
57Frequency MomentsAlon, Matias, Szegedy 99
- Stream consists of elements from 1,2,,n
- mi number of times i occurs
- Frequency moment
- F0 number of distinct elements
- F1 size of stream
- F2
58Overall Scheme
- Design estimator (i.e. random variable) with the
right expectation - If estimator is tightly concentrated, maintain
number of independent copies of estimator E1, E2,
, Er - Obtain estimate E from E1, E2, , Er
- Within (1?) with probability 1-?
59Randomness
- Design estimator assuming perfect hash functions,
as much randomness as needed - Too much space required to explicitly store such
a hash function - Fix later by showing that limited randomness
suffices
60Distinct Elements
- Estimate the number of distinct elements in a
data stream - Brute Force solution Maintain list of distinct
elements seen so far - ?(n) storage
- Can we do better ?
61Distinct ElementsFlajolet, Martin 83
- Pick a random hash function hn ? 0,1
- Say there are k distinct elements
- Then minimum value of h over k distinct elements
is around 1/k - Apply h() to every element of data stream
maintain minimum value - Estimator 1/minimum
62(Idealized) Analysis
- Assume perfectly random hash function hn ?
0,1 - S set of k elements of n
- X min a?S h(a)
- EX 1/(k1)
- VarX O(1/k2)
- Mean of O(1/?2) independent estimators is within
(1?) of 1/k with constant probability
63Analysis
- Alon,Matias,SzegedyAnalysis goes through with
pairwise independent hash functionh(x) axb - 2 approximation
- O(log n) space
- Many improvementsBar-Yossef,Jayram,Kumar,Sivakum
ar,Trevisan
64Estimating F2
- F2
- Brute force solution Maintain counters for all
distinct elements - Sampling ?
- n1/2 space
65Estimating F2Alon,Matias,Szegedy
- Pick a random hash functionhn ? 1,-1
- hi h(i)
- Z
- Z initially 0, add hi every time you see i
- Estimator X Z2
66Analyzing the F2 estimator
67Analyzing the F2 estimator
- Median of means gives good estimator
68What about the randomness ?
- Analysis only requires 4-wise independence of
hash function h - Pick h from 4-wise independent family
- O(log n) space representation, efficient
computation of h(i)
69Properties of F2 estimator
- sketch of data stream that allows computation
of - Linear function of mi
- Can be added, subtracted
- Given two streams, frequencies mi , ni
- E(Z1-Z2)2
- Estimate L2 norm of difference
- How about L1 norm ? Lp norm ?
70Stable Distributions
- p-Stable distribution DIf X1, X2, Xn are
i.i.d. samples from D,m1X1m2X2mnXn is
distributed as(m1,m2,,mn)pX - Defining property up to scale factor
- Gaussian distribution is 2-stable
- Cauchy distribution is 1-stable
- p-Stable distributions exist for all0 lt p ? 2
71Estimating Lp normIndyk 00
- Compute Z m1X1m2X2mnXn
- Distributed as (m1,m2,,mn)pX
- Estimate scale factor of distributionfrom Z1,
Z2, Zr - Given i.i.d. samples from p-stable distribution,
how to estimate scale ? - Compute statistical property of samples and
compare to that of distribution
72Estimating scale factor
- Zi distributed as (m1,m2,,mn)pX
- Estimate scale factor of distributionfrom Z1,
Z2, Zr - Mean(Z1,Z2,,Zr) works for Gaussian distribution
- p1, Cauchy distribution does not have finite
mean ! - Median(Z1,Z2,,Zr) works in this case
-
- Note sketch is linear, nice properties follow
73What about the randomness ?
- Analog of 4-wise independence for F2 estimator ?
- Key insight p-stable based sketch computation
done in O(log n) space - Use pseudo-random number generator which fools
any space bounded computation Nisan 90 - Difference between using truly random and
psuedo-random bits is negligible - Random seed of polylogarithmic size, efficient
generation of required pseudo-random variables
74Talk Outline
- Similarity preserving hash functions
- Similarity estimation
- Statistical properties of data streams
- Distinct elements
- Frequency moments, norm estimation
- Frequent items
75Variants of F2 estimatorAlon, Gibbons, Matias,
Szegedy
- Estimate join size of two relations(m1,m2,)
(n1,n2,) -
- Variance may be too high
76Finding Frequent Items
C,Chen,Farach-Colton 02
- Goal
- Given a data stream, return an approximate list
of the k most frequent items in one pass and
sub-linear space - Applications
- Analyzing search engine queries, network
traffic.
77Finding Frequent Items
- ai ith most frequent element
- mi frequency
- If we had an oracle that gave us exact
frequencies, can find most frequent items in one
pass - Solution
- A data structure called a Count Sketch that
gives good estimates of frequencies of the high
frequency elements at every point in the stream
78Intuition
- Consider a single counter X with a single hash
function ha ? 1, -1 - On seeing each element ai, update the counter
with X h(ai) - X ? mi h(ai)
- Claim EX h(ai) mi
- Proof idea Cross-terms cancel because of
pairwise independence
79Finding the max element
- Problem with the single counter scheme variance
is too high - Replace with an array of t counters, using
independent hash functions h1... ht
h1 a ? 1, -1
ht a ? 1, -1
80Analysis of array of counters data structure
- Expectation still correct
- Claim Variance of final estimate lt ? mi2 /t
- Variance of each estimate lt ? mi2
- Proof idea cross-terms cancel
- Set t O(log n ? mi2 / (?m1)2) to get answer
with high prob. - Proof idea median of averages
81Problem with array of counters data structure
- Variance of estimator dominated by contribution
of large elements - Estimates for important elements such as ak
corrupted by larger elements (variance much more
than mk2 ) - To avoid collisions, replace each counter with a
hash table of b counters to spread out the large
elements
82In Conclusion
- Simple powerful ideas at the heart of several
algorithmic techniques for large data sets - Sketches of data tailored to applications
- Many interesting research questions
83Content-Based Similarity Search
- Moses Charikar
- Princeton University
- Joint work with
- Qin Lv, William Josephson, Zhe Wang, Perry Cook,
Matthew Hoffman, Kai Li
84Motivation
- Massive amounts of feature-rich digital data
- Audio, video, digital photos, scientific sensor
data - Noisy, high-dimensional
- Traditional file systems/search tools inadequate
- Exact match
- Keyword-based search
- Annotations
- Need content-based similarity search
85Motivation
- Recent progress of theoretical studies on
sketches - compact data representation for estimation of
pairwise similarity/distance - Compact data structures for high-quality and
efficient content-based similarity search?
86Compact representation
sketch
complex object
0
0
0
0
0
1
1
1
1
1
- Distance measured by (weighted) l1
distanced(x,y) Si wixi-yi - Better still, hamming distance between bit
vectors - Distance between sketches estimates distance
between objects - Several theoretical constructions of sketches
forsets, vectors, earth mover distance (EMD).
1
0
0
0
0
0
0
1
1
1
87Outline
- Motivation
- System architecture
- Implementation details
- Segmentation feature extraction
- Sketch construction
- Filtering
- Indexing
- Performance evaluation
- Conclusions future work
88System Architecture
89Similarity Search Engine Architecture
Pre-processing
Query time
90Similarity Search Problem
- Similarity search finding objects similar to a
query object i.e. containing similar features - Object representation
- Distance function d (X, Y)
- Nearest neighbor query
- K-nearest neighbor (KNN)
- Approximate nearest neighbor (ANN)
91Object Representation Distance Function
Earth Mover Distance (EMD)
92Segmentation Feature Extraction (1)
- Derive a small set of features that characterize
the important attributes of a data object - Data-dependent
93Segmentation Feature Extraction (1)
- Image Data
- JSEG image segmentation tool
- Each segments by a 14-dimension feature vector
- Color moments
- First three moments in HSV color space
- ? 9-D vector
- Bounding box
- Aspect ratio, Bounding box size, Area ratio,
Region centroid - ? 5-D vector
- Segment weight ? square root of segment size
- l1 distance between segments, EMD between images
94Segmentation Feature Extraction (2)
- Audio Data
- Phonetic segmentation feature extraction using
MARSYAS - Each segment
- 50 sliding windows x 6 MFCC parameters 300
- Segment weight ? segment length
- Segment distance l1 distance
- Sentence distance EMD
95Segmentation Feature Extraction (3)
- 3D shape data
- 32 decomposing spheres
- Spherical harmonic descriptor (SHD)
- Spherical harmonic coefficients up to order 16
- 32 x 17 544 dimensions
- l2 distance
96Sketch Construction
- Sketches tiny data structures that can be used
to estimate properties of original data - High-dimensional feature vector ? N?K bit vector
- hamming distance ? original feature vector
distance - XOR groups of K bits ? N bit vector
- hamming distance ? thresholded distance
97Thresholding distance by XORing bits
Number of bits XORed control shape of flattened
curve
98Filtering for Similarity Search
- EMD computation is expensive
- Filtering
- Scans through the entire dataset
- Uses a much faster distance function to filter
out bad answers - Computes EMD for a much smaller candidate set
- Criteria in picking candidate objects
- Has at least one segment that is close enough to
one of the top segments of the query object
99Indexing for Similarity Search
- a leveled tree where each level is a cover for
the level beneath it - Nesting
- Covering tree For every node , there
exists a node satisfying
and exactly one such q is a parent of p - Separation For all nodes ,
100Performance Evaluation
- Can we achieve high-quality similarity search
results at high speed? - How small can the sketches be as the metadata of
the similarity search engine? - What are the performance tradeoffs of
- Brute-force
- Filtering
- Indexing
101Benchmarks
- Search quality benchmark suite
- VARY image 10k images, 32 sets
- TIMIT audio 6300 sentences, 450 sets
- PSB shape 1814 models, 92 sets
- Search speed benchmark suite
- Mixed image dataset 600k images
- Mixed audio dataset 60K sentences
- Mixed shape dataset 40k shape models
102Search Quality Metrics
- Given a query q with k similar objects
- First-tier
- Percentage of similar objects returned within
rank k - Second-tier
- Percentage of similar objects returned within
rank 2k - Average precision
103Search Quality Search Speed
104Search Quality vs. Sketch Size
105Brute Force, Filtering, Indexing
106Conclusions Future Work
- A general purpose content-based similarity search
system - high-quality similarity search with reasonably
high speed - Using sketches reduces metadata size
- Filtering indexing speeds up similarity search
- Future work
- More efficient distance function than EMD
- Further investigation of indexing data structures
- More data types
- video, genomic microarray data, other sensor data
107(No Transcript)
108(No Transcript)