Similarity Searching in High Dimensions via Hashing - PowerPoint PPT Presentation

About This Presentation
Title:

Similarity Searching in High Dimensions via Hashing

Description:

... the KNNS problem, our algorithm generalizes to finding the K ( 1) approximate nearest neighbors. ... The analysis is generalized for the case of secondary memory. ... – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 17
Provided by: And6311
Category:

less

Transcript and Presenter's Notes

Title: Similarity Searching in High Dimensions via Hashing


1
Similarity Searching in High Dimensions
viaHashing
  • Paper by
  • Aristides Gionis, Poitr Indyk, Rajeev Motwani

2
  • The approach of using the similarity searching is
    to be used only in high dimensions.
  • The reason, or the idea behind the approach is
    that since the selection of features and the
    choice of distance is rather heuristic,
    determining an appropriate nearest neighbor
    should suffice for most practical purposes.

3
  • The basic idea is to hash the points from the
    database so as to ascertain that the probability
    of collision is much higher for objects that are
    close to each other than for those that are far
    apart.
  • The necessity arose from the so called curse of
    dimensionality fact for the large databases.
  • In this case all the searching techniques reduce
    to linear search, if are being searched for the
    appropriate answer.

4
  • The similarity search problem involves the
    nearest ( most similar ) object in a given
    collection of objects to a given query.
  • Typically the objects of interest are represented
    as points in ?d and a distance metric is used to
    measure the similarity of the objects.
  • The basic problem is to perform indexing or
    similarity searching for query objects.

5
  • The problem arises due to the fact that the
    present methods are not entirely satisfactory,
    for large d.
  • And is based on the idea that for most
    applications it is not necessary for the exact
    answer.
  • It also provides the user with a time-quality
    trade-off.
  • The above statements are based on the assumption
    that the searching for approximate answers is
    faster than for finding the exact answers.

6
  • The technique is to use locality-sensitive
    hashing instead of space sensitive hashing.
  • The idea is to hash points points using several
    hash functions so as to ensure that, for each
    function the probability of collision is much
    higher for objects that are close to each other.
  • Then, one can determine near neighbors by hashing
    the query point and retrieving elements stored in
    buckets containing that point.

7
  • The LSH ( locality sensitive hashing ) enabled to
    achieve the worst case O(dn1/?) time for
    approximate nearest neighbor over a n-point
    database.
  • In the presented paper, the worst time running
    time has been improved by the new technique to
    O(dn1/(1?)), which is a significant improvement.

8
Preliminaries
  • ldp is used to denote the Euclidian space ?d
    under the lp normal form i.e., when the length of
    the vector (x1,,xd) is defined as ( x1p
    xdp).
  • Further, d(p,q) denotes the distance between the
    points p and q in ldp
  • We use Hd to represent the Hamming metric space
    of of dimension d.
  • We use dH(p,q) to denote the Hamming distance.

9
  • General definition of the problem is to find K
    nearest points in the given database, where K gt
    1.
  • Even for the KNNS problem, our algorithm
    generalizes to finding the K (gt1) approximate
    nearest neighbors.
  • Here we wish to find the K points p1,,pk such
    the distance of pi to the query q is at the most
    (1?) times the distance from the ith nearest
    point to q.

10
The Algorithm
  • The distance is measured in the Euclidian terms,
    or the l1 norm.
  • All the co-ordinates of the points in P are
    positive integers.

11
Locality Sensitive Hashing
  • The new algorithm is in many respects more
    natural the earlier ones it does not require
    that a bucket to store only point.
  • It has better running time.
  • The analysis is generalized for the case of
    secondary memory.

12
  • Let C be the largest coordinate in all points in
    P.
  • Then we can embed P into the Hamming cube Hd
    with dC.d, by transforming each point
    p(x1,,xd) into a binary vector.
  • vpUnaryc(x1)Unaryc(xd),
  • where Unaryc(x) denotes the unary
    representation of x, i.e., a sequence of x zeroes
    followed by C-x ones.

13
  • For an integer l, choose I1Il subsets of 1d.
  • Let pI denote the the projection of vector p on
    the coordinate positions as per I and
    concatenating the bits in those positions.
  • Denote gj(p) pIj
  • For the preprocessing we store each p?P in the
    buckets for gj(p), for j1l.
  • As the total number of buckets may be large, we
    compress the buckets by resorting to standard
    hashing.

14
  • Thus we use two levels of hashing.
  • The LSH maps the points into buckets gj(p) while
    a standard hashing function maps the contents of
    these buckets into a hash table of size M.
  • If a bucket in a given index is full, a new
    point cannot be added to it, since it will be
    added to another index with a very high
    probability.
  • This saves the overhead of maintaining the link
    structure.

15
  • To process a query q, we search all the indices
    g1(q)gl(q) until we either encounter at least
    c.l points or use all the l indices.
  • The number of disk accesses is always upper
    bounded on l, the number of indices.
  • Let p1,,pt be the points encountered in the
    process.
  • For the output we return the nearest K points, or
    fewer in case we could not find so many points as
    a result of the search.

16
  • The principle behind our method is the
    probability of collision of two points p and q is
    closely related to the distance between them.
  • Especially the larger the distance, smaller the
    collision property.
Write a Comment
User Comments (0)
About PowerShow.com