Title: Clustering Algorithms for Perceptual Image Hashing
1Clustering Algorithms for Perceptual Image
Hashing
IEEE Eleventh DSP Workshop, August 3rd 2004
Vishal Monga, Arindam Banerjee, and Brian L.
Evans
vishal, abanerje, bevans_at_ece.utexas.edu
Embedded Signal Processing Laboratory Dept. of
Electrical and Computer Engineering The
University of Texas at Austin http//signal.ece.ut
exas.edu
Research supported by a gift from the Xerox
Foundation
2Hash Example
- Hash function Projects value from set with large
(possibly infinite) number of members to set with
fixed number of (fewer) members - Irreversible
- Provides short, simple representationof large
digital message - Example sum of ASCII codes forcharacters in
name modulo N,a prime number (N 7)
Database name search example
3Perceptual Hash Desirable Properties
- Perceptual robustness
- Fragility to distinct inputs
- Randomization
- Necessary in security applicationsto minimize
vulnerability againstmalicious attacks
4Hashing Framework
- Two-stage hash algorithm
- Goal Retain perceptual significance
- Let (li, lj) denote vectors in metric space of
feature vectors V and 0 lt e lt d, then it is
desired - Minimizing average distance between clusters
inappropriate
5Cost Function for Feature Vector Compression
- Define joint cost matrices C1 and C2 (n x n)
- n total number of vectors be clustered, C(li),
C(lj) denote the clusters that these vectors are
mapped to - Exponential cost
- Ensures severe penalty associated if feature
vectors far apart - Perceptually distinct clustered together
a gt 0, ? gt 1 are algorithm parameters
6Cost Function for Feature Vector Compression
- Define S1 as
- S2 is defined similarly
- Normalize to get ,
- Then, minimize expected cost
- p(i) p(li), p(j) p(lj)
7Basic Clustering Algorithm
- Obtain e, d, set k 1. Select the data point
associated with highest probability mass, label
it l1 - Make the first cluster by including all
unclustered points lj such that D(l1, lj) lt e/2 - 3. k k 1. Select the highest probability data
point lk among the unclustered points such that - where
- S is any cluster, C set of clusters formed
till this step - Form the kth cluster Sk by including all
unclustered points lj such that D(lk, lj) lt e/2 - 5. Repeat steps 3-4 until no more clusters can be
formed
8Observations
- For any (li, lj) in cluster Sk
-
- No errors up to this stage of algorithm
- Each cluster is at least e away from any other
cluster - Within each cluster, maximum distance between
any two points is at most e
9Approach 1
- Select data point l among unclustered data
points that has highest probability mass - For each existing cluster Si, i 1,2,, k
compute - Let S(d)
Si such that di d - IF S(d) F THEN k k 1. Sk l is a
cluster of its own - ELSE for each Si in S(d) define
- where denotes the complement of Si i.e.
all clusters in S(d) except Si. Then, l is
assigned to the cluster S arg min F(Si) - 4. Repeat steps 1 through 3 until all data points
are exhausted
10Approach 2
- Select data point l among unclustered data
points that has highest probability mass - For each existing cluster Si, i 1, 2,, k,
define -
-
- and ß lies in 1/2, 1
- Here, denotes the complement of Si i.e.
all existing clusters except Si. Then, l is
assigned to the cluster S arg min F(Si) - 3. Repeat steps 1 and 2 until all data points are
exhausted
11Summary
- Approach 1
- Tries to minimize conditioned on
0 - Approach 2
- Smoothly trades off the minimization of
vs. - via the parameter ß
- ß ½ ? joint minimization
- ß 1 ? exclusive minimization of
- Final hash length determined automatically!
- Given by bits, where k is number
of clusters formed - Proposed clustering can compress feature vectors
in any metric space, e.g. Euclidean, Hamming, and
Levenshtein
12Clustering Results
- Compress binary feature vector of L 240 bits
- Final hash length 46 bits, with Approach 2, ß
1/2 - Value of cost function is orders of magnitude
lower for proposed clustering
13Conclusion Future Work
- Two-stage framework for image hashing
- Feature extraction followed by feature vector
compression - Second stage is media independent
- Clustering algorithms for compression
- Novel cost function for hashing applications
- Applicable to feature vectors in any metric space
- Trade-offs facilitated between robustness and
fragility - Final hash length determined automatically
- Future work
- Randomized clustering for secure hashing
- Information theoretically secure hashing