CostAware Processing of Similarity Queries in Structured Overlays - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

CostAware Processing of Similarity Queries in Structured Overlays

Description:

similar to IMDB (http://www.imdb.com) : Motivation Examples ... randomly chosen from a 4000 entry sample of movie titles from the IMDB database? ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 21
Provided by: anthonyeo
Category:

less

Transcript and Presenter's Notes

Title: CostAware Processing of Similarity Queries in Structured Overlays


1
Cost-Aware Processing of Similarity Queries in
Structured Overlays
  • P2P06

2
Outline
  • Motivation
  • Motivating Examples
  • Similarity Measures and Processing
  • Physical Operators
  • Cost Model
  • Evaluation
  • Conclusions

3
Motivation
  • Most of current DHT proposals support only exact
    lookups for key-value pairs
  • Similarity both at the data and at the schema
    levels is not available in DHTs so far
  • we present an approach for efficient processing
    of similarity selections and joins in a
    structured overlay.

4
Motivation
Main Idea
  • each tuple (oid, v1, . . . , vn) of a given
    relation schema R(A1, . . . ,An) is stored in the
    form of n triples (oid,A1, v1), . . . , (oid,An,
    vn),
  • each triple (oid,Ai, vi) is inserted three times
    into the DHT using the oid, Aivi (the
    concatenation of Ai and vi), and vi as keys.

5
Motivation Examples
we use the following simple relations similar to
IMDB (http//www.imdb.com)
The basic construct of a query in VQL is a
SELECT- WHERE block similar to SQL
6
Similarity Measures and Processing
  • Edit Distance
  • Edit distance of two strings s1 and s2 is the
    number of operations needed to transform s1 into
    s2.
  • For instance, the edit distance of edna and
    eden is 2.

7
Similarity Measures and Processing
  • Rather than indexing only whole strings, we
    additionally split them into q-grams and index
    them
  • For a triple t (oid,Ai, vi) we store the
    following in the DHT

8
Similarity Measures and Processing
As an example consider a tuple t
123,edna with schema s id,name.
In the original scheme the following data items
would be stored in the DHT (we assume that the
triples OID is 1)
Extending this by a 3-gram index on instance
level of attribute name produces the following
additional data items to be stored
9
Physical Operators
Similarity Selection
1?term-based processing
very expensive because queries can result in
involving the whole overlay if popular attributes
are distributed among all peers such as in Chord.
10
Physical Operators
Similarity Selection
2?q-gram-based processing
exploits additional indexes based on q-grams as
described above and incurs additional messages
for querying these indexes, but saves a lot of
bandwidth and message costs for processing the
queries in most cases.
11
Physical Operators
Similarity Join
To process such a join, three basic approaches
exist 1?Process A and B separately and
evaluate the join on the data gathered
locally 2?Process and apply a nested
loop approach for querying similar
data from the right side 3?Include both
selections into the join processing
12
Physical Operators
Similarity Join(variant 2)
The actual join tuples are built in line 4
Each tuple from the left side is joined with all
similar tuples from the right side
13
Cost Model
In a distributed environment the main cost
measures are the number of messages m and the
number of hops h needed to process queries.
14
Evaluation
1?Experimental setup we used a network of
400 peers each running on a PlanetLab node. Each
node inserted 10 strings of lengths between 8 and
45 characters, randomly chosen from a 4000 entry
sample of movie titles from the IMDB database?
15
Evaluation
2?Experimental results
16
Evaluation
2?Experimental results
17
Evaluation
2?Experimental results
18
Evaluation
2?Experimental results
19
Conclusions
1?we presented similarity queries in DHTs for
evolving structured overlays into a viable
infrastructure for large-scale public data
management and information retrieval. 2?The
cost model allows us to predict the costs of the
available physical operators dynamically to
choose the optimal one for the current network
conditions.
20
Thanks!!!
Write a Comment
User Comments (0)
About PowerShow.com