Similarity Search in High Dimensions via Hashing - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Similarity Search in High Dimensions via Hashing

Description:

... of (r1, r2, p1, p2) sensitive functions, {hi(.)} dist(p,q) r1 ProbH [h(q) ... Speed, Miss Ratio, Error (%) for various data sizes, dimensions, and K values ... – PowerPoint PPT presentation

Number of Views:130

Avg rating:3.0/5.0

Slides: 22

Provided by: yah4

Category:

Tags: dimensions | hashing | hispeed | high | search | similarity | via

Transcript and Presenter's Notes

Title: Similarity Search in High Dimensions via Hashing

1
Similarity Search in High Dimensions via Hashing

Aristides Gionis, Piotr Indyk, Rajeev Motwani
Presented by
Srujana Merugu Yousuf Ahmed

2
Outline

Introduction
Problem Description
Key Idea
Experiments and Results
Conclusions

3
Introduction

Similarity Search over High-Dimensional Data
Image databases, document collections etc
Curse of Dimensionality
All space partitioning techniques degrade to
linear search for high dimensions
Exact vs. Approximate Answer
Approximate might be good-enough and much-faster
Time-quality trade-off

4
Problem Description

? - Nearest Neighbor Search (? - NNS)
Given a set P of points in a normed space ,
preprocess P so as to efficiently return a point
p ? P for any given query point q, such that
dist(q,p) ? (1 ? ) ? min r ? P dist(q,r)
Generalizes to K- nearest neighbor search ( K gt1)

5
Problem Description
6
Key Idea

Locality Sensitive Hashing ( LSH ) instead of
space partitioning to get sub-linear dependence
on the data-size for high-dimensional data
Preprocessing
Hash the data-point using several LSH functions
so that probability of collision is higher for
closer objects
Querying
Hash query point and retrieve elements in the
buckets containing the query point

7
Algorithm Preprocessing

Input
Set of N points p1 , .. pn
L ( number of hash tables )
Output
Hash tables Ti , i 1 , 2, . L
Foreach i 1 , 2, . L
Initialize Ti with a random hash function
gi(.)
Foreach i 1 , 2, . L
Foreach j 1 , 2, . N
Store point pj on bucket gi(pj) of hash table
Ti

8
LSH - Algorithm
P
pi
g1(pi)
g2(pi)
gL(pi)
TL
T2
T1
9
Algorithm ? - NNS Query

Input
Query point q
K ( number of approx. nearest neighbors )
Access
Hash tables Ti , i 1 , 2, . L
Output
Set S of K ( or less ) approx. nearest neighbors
S ? ?
Foreach i 1 , 2, . L
S ? S ? points found in gi(q) bucket of hash
table Ti

10
LSH - Analysis

Family H of (r1, r2, p1, p2) sensitive functions,
hi(.)
dist(p,q) lt r1 ? ProbH h(q) h(p) ? p1
dist(p,q) ? r2 ? ProbH h(q) h(p) ? p2
LSH functions gi(.) h1(.) hk(.)
For a proper choice of k and l, a simpler
problem, (r,?)-Neighbor, and hence the actual
problem can be solved
Query Time O(d ?n1/(1?) )
d dimensions , n data size

11
LSH - Analysis
12
Experiments

Data Sets
Color images from COREL Draw library (20,000
points,dimensions up to 64)
Texture information of aerial photographs
(270,000 points, dimensions 60)
Evaluation
Speed, Miss Ratio, Error () for various data
sizes, dimensions, and K values
Compare Performance with SR-Tree ( Spatial Data
Structure )

13
Performance Measures

Speed
Number of disk block accesses in order to answer
the query ( hash tables)
Miss Ratio
Fraction of cases when less than K points are
found for K-NNS
Error
Average of fractional error in distance to point
found by LSH as compared to nearest neighbor
distance taken over entire set of queries

14
Speed vs. Data Size
15
Speed vs. Dimension
16
Speed vs. Nearest Neighbors
17
Speed vs. Error
18
Miss Ratio vs. Data Size
19
Conclusion

Better Query Time than Spatial Data Structures
Scales well to higher dimensions and larger data
size ( Sub-linear dependence )
Predictable running time
Extra storage over-head
Inefficient for data with distances concentrated
around average

20
Future Work

Investigate Hybrid-Data Structures obtained by
merging tree and hash based structures.
Make use of the structure of the data-set to
systematically obtain LSH functions
Explore other applications of LSH-type techniques
to data mining

21
Questions ?

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

CS 352 Peer 2 Peer Networking PowerPoint PPT Presentation

CS 352 Peer 2 Peer Networking - 60M users of file-sharing in US. 8.5M logged in at a given time on average ... Soon many other clients: Bearshare, Morpheus, LimeWire, etc. ... | PowerPoint PPT presentation | free to view

CS 552 Peer 2 Peer Networking PowerPoint PPT Presentation

CS 552 Peer 2 Peer Networking - Focus on building a distributed hash table (DHT) Finger tables. PlanetP; ... Bit torrent. Designed for high bandwidth. Napster. Success started the whole craze ... | PowerPoint PPT presentation | free to view

Associative Peer to Peer Networks: Harnessing Latent Semantics PowerPoint PPT Presentation

Associative Peer to Peer Networks: Harnessing Latent Semantics - Scope: ability to locate 'rare' items 'Find the 10th episode of ... Partial-match/complex queries: 'Find an Indiana Jones movie' ...Or 'Indiana Joens' movie. ... | PowerPoint PPT presentation | free to view

Fast Similarity Search for Learned Metrics PowerPoint PPT Presentation

Fast Similarity Search for Learned Metrics - Fast image search is a useful component for a number of vision problems. Object categorization ... fast search algorithms for useful image metrics. ... | PowerPoint PPT presentation | free to view

Articulated Pose Estimation in a Learned Smooth Space of Feasible Solutions PowerPoint PPT Presentation

Articulated Pose Estimation in a Learned Smooth Space of Feasible Solutions - Accurate model and typically with high DOF. ... 3000 training (Male model) 3000 testing (Female model) 3D Pose. Synthesized Silhouettes sampled ... | PowerPoint PPT presentation | free to view

Deterministic Memory-Efficient String Matching Algorithms for Intrusion Detection PowerPoint PPT Presentation

Deterministic Memory-Efficient String Matching Algorithms for Intrusion Detection - { cat, car, bar, foo, for. te. 3. 11 - CS7701 Fall 2004. Aho-Corasick Algorithm ... 20 nodes/character in SFK Search. 80 rules/character for Wu-Manber ... | PowerPoint PPT presentation | free to view

iMinMax(?%20)%20Indexing%20the%20Edges%20%20%20%20A%20simple%20and%20yet%20efficient%20approach%20to%20%20high-dimensional%20indexing PowerPoint PPT Presentation

iMinMax(?%20)%20Indexing%20the%20Edges%20%20%20%20A%20simple%20and%20yet%20efficient%20approach%20to%20%20high-dimensional%20indexing - A most effective mechanism to prune the search ... Nearest Neighbour Query: 'Find me the nearest fire station to Clementi Ave. 3?' Applications ... | PowerPoint PPT presentation | free to view

The Small World Phenomenon: An Algorithmic Perspective PowerPoint PPT Presentation

The Small World Phenomenon: An Algorithmic Perspective - Manku et al. (2002) Symphony. arrange all participants in a ring I [0,1) ... Symphony: Distributed hashing in a small world. ... | PowerPoint PPT presentation | free to view

Indexing HighDimensional Space: Database Support for Next Decades Applications PowerPoint PPT Presentation

Indexing HighDimensional Space: Database Support for Next Decades Applications - Indexing High-Dimensional Space: Database Support for Next Decade's Applications ... split strategy: split always at the 50%-quantile. number of split dimensions: ... | PowerPoint PPT presentation | free to view

Efficient Similarity Joins for Near Duplicate Detection PowerPoint PPT Presentation

Efficient Similarity Joins for Near Duplicate Detection - The University of New South Wales, Australia. Joint Work: Wei Wang (UNSW), Xuemin Lin (UNSW), Jeffrey Xu Yu (CUHK) ... On one end, a winded Pete Sampras tried ... | PowerPoint PPT presentation | free to view

Algorithmic Aspects of Finite Metric Spaces PowerPoint PPT Presentation

Algorithmic Aspects of Finite Metric Spaces - d(x,y) = 0 iff x=y. d(x,y) = d(y,x) Symmetric. d(x,z) d(x,y) d(y,z) Triangle ... for optimization problem (metric labelling) yield log n approximate estimator ... | PowerPoint PPT presentation | free to view

Similarity Search in High Dimensions via Hashing PowerPoint PPT Presentation

Similarity Search in High Dimensions via Hashing - ... of (r1, r2, p1, p2)-sensitive functions, {hi(.)} dist(p,q) r1 ProbH [h(q) ... Speed, Miss Ratio, Error (%) for various data sizes, dimensions, and K values ... | PowerPoint PPT presentation | free to view

Chapter 3: Topk Query Processing and Indexing PowerPoint PPT Presentation

Chapter 3: Topk Query Processing and Indexing - metasearch engines: aggregation over ranked results from multiple. web search engines ... Best-score. Worst-score. Doc. Rank. t1. d78. 0.9. d1. 0.7. d88. 0.2 ... | PowerPoint PPT presentation | free to view

Dennis DeCoste PowerPoint PPT Presentation

Dennis DeCoste - Yahoo! Research Labs. http://research.yahoo.com/staff/algorithms/decosted.xml. Yahoo! Research Labs Spot Workshop on Recommender Systems. August 24, 2004. August ... | PowerPoint PPT presentation | free to view

Nearest%20Neighbor PowerPoint PPT Presentation

Nearest%20Neighbor - Pick a subset I of random coordinates. Hash function, h(p), will return a bucket ID ... Requires parameter tweaking (size of I and number of hash buckets) ... | PowerPoint PPT presentation | free to view

P2P Information Retrieval Using Semantic Overlay Networks PowerPoint PPT Presentation

P2P Information Retrieval Using Semantic Overlay Networks - ... proportional to the distance between A and B to their distance p on the circle ... with vector Va we store its index on p places in the CAN Vai where i = 0 ... | PowerPoint PPT presentation | free to view

Similarity Searching in High Dimensions via Hashing PowerPoint PPT Presentation

Similarity Searching in High Dimensions via Hashing - ... the KNNS problem, our algorithm generalizes to finding the K ( 1) approximate nearest neighbors. ... The analysis is generalized for the case of secondary memory. ... | PowerPoint PPT presentation | free to view

Spatial Access Methods PowerPoint PPT Presentation

Spatial Access Methods - Spatial Access Methods & Query Processing. Matei Lunca Geographic Information Analysis 2004 ... GIS vragen zoals 'buffer rond rivier' 4. Extending RDMS for GIS/GIA ... | PowerPoint PPT presentation | free to view

Overview of Storage and Indexing PowerPoint PPT Presentation

Overview of Storage and Indexing - Instructor: Marina Gavrilova If you don t find it in the index, look very carefully through the entire catalogue.-- Sears, Roebuck, and Co., Consumer s Guide , 1897 | PowerPoint PPT presentation | free to view

The Problem with Music: Modeling Distance Distributions of Large Music Collections PowerPoint PPT Presentation

The Problem with Music: Modeling Distance Distributions of Large Music Collections - The Problem with Music: Modeling Distance Distributions of Large Music Collections Prof. Michael Casey Program in Digital Musics Dartmouth College, Hanover, NH | PowerPoint PPT presentation | free to view

Record Linkage: Similarity Measures and Algorithms PowerPoint PPT Presentation

Record Linkage: Similarity Measures and Algorithms - ... combinatorial pattern matching ... suitably extended for a ... Path Prefix Tree Matches XML Data Augment the path prefix tree into an ... | PowerPoint PPT presentation | free to view

Associative Peer to Peer Networks: Harnessing Latent Semantics PowerPoint PPT Presentation

Associative Peer to Peer Networks: Harnessing Latent Semantics - Associative Peer to Peer Networks: Harnessing Latent Semantics Edith Cohen AT&T Labs-research Amos Fiat Haim Kaplan Tel-Aviv University Traditional Client ... | PowerPoint PPT presentation | free to view

Integration of Classification and Pattern Mining: A Discriminative and Frequent Pattern-Based Approach PowerPoint PPT Presentation

Integration of Classification and Pattern Mining: A Discriminative and Frequent Pattern-Based Approach - Integration of Classification and Pattern Mining: A Discriminative and Frequent Pattern-Based Approach Hong Cheng Jiawei Han | PowerPoint PPT presentation | free to view

Associative Peer to Peer Networks: Harnessing Latent Semantics PowerPoint PPT Presentation

Associative Peer to Peer Networks: Harnessing Latent Semantics - Title: PowerPoint Presentation Author: Edith Cohen Last modified by: Edith Cohen Created Date: 8/22/2001 10:20:50 PM Document presentation format | PowerPoint PPT presentation | free to view

CS 361A (Advanced Data Structures and Algorithms) PowerPoint PPT Presentation

CS 361A (Advanced Data Structures and Algorithms) - CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive Hashing | PowerPoint PPT presentation | free to view

Similarity Search in High Dimensions via Hashing PowerPoint PPT Presentation

Similarity Search in High Dimensions via Hashing - Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University | PowerPoint PPT presentation | free to view

ITU-T activities in the field of telecommunications security PowerPoint PPT Presentation

ITU-T activities in the field of telecommunications security - ITU-T activities in the field of telecommunications security Sami Trabulsi ITU/TSB sami.trabulsi@itu.int Introduction ICT security is high in the agenda of many ... | PowerPoint PPT presentation | free to view