CostAware Processing of Similarity Queries in Structured Overlays

About This Presentation

Title:

CostAware Processing of Similarity Queries in Structured Overlays

Description:

similar to IMDB (http://www.imdb.com) : Motivation Examples ... randomly chosen from a 4000 entry sample of movie titles from the IMDB database? ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 21

Provided by: anthonyeo

Category:

more less

Transcript and Presenter's Notes

Title: CostAware Processing of Similarity Queries in Structured Overlays

1
Cost-Aware Processing of Similarity Queries in
Structured Overlays

P2P06

2
Outline

Motivation
Motivating Examples
Similarity Measures and Processing
Physical Operators
Cost Model
Evaluation
Conclusions

3
Motivation

Most of current DHT proposals support only exact
lookups for key-value pairs
Similarity both at the data and at the schema
levels is not available in DHTs so far
we present an approach for efficient processing
of similarity selections and joins in a
structured overlay.

4
Motivation
Main Idea

each tuple (oid, v1, . . . , vn) of a given
relation schema R(A1, . . . ,An) is stored in the
form of n triples (oid,A1, v1), . . . , (oid,An,
vn),
each triple (oid,Ai, vi) is inserted three times
into the DHT using the oid, Aivi (the
concatenation of Ai and vi), and vi as keys.

5
Motivation Examples
we use the following simple relations similar to
IMDB (http//www.imdb.com)
The basic construct of a query in VQL is a
SELECT- WHERE block similar to SQL
6
Similarity Measures and Processing

Edit Distance
Edit distance of two strings s1 and s2 is the
number of operations needed to transform s1 into
s2.
For instance, the edit distance of edna and
eden is 2.

7
Similarity Measures and Processing

Rather than indexing only whole strings, we
additionally split them into q-grams and index
them
For a triple t (oid,Ai, vi) we store the
following in the DHT

8
Similarity Measures and Processing
As an example consider a tuple t
123,edna with schema s id,name.
In the original scheme the following data items
would be stored in the DHT (we assume that the
triples OID is 1)
Extending this by a 3-gram index on instance
level of attribute name produces the following
additional data items to be stored
9
Physical Operators
Similarity Selection
1?term-based processing
very expensive because queries can result in
involving the whole overlay if popular attributes
are distributed among all peers such as in Chord.
10
Physical Operators
Similarity Selection
2?q-gram-based processing
exploits additional indexes based on q-grams as
described above and incurs additional messages
for querying these indexes, but saves a lot of
bandwidth and message costs for processing the
queries in most cases.
11
Physical Operators
Similarity Join
To process such a join, three basic approaches
exist 1?Process A and B separately and
evaluate the join on the data gathered
locally 2?Process and apply a nested
loop approach for querying similar
data from the right side 3?Include both
selections into the join processing
12
Physical Operators
Similarity Join(variant 2)
The actual join tuples are built in line 4
Each tuple from the left side is joined with all
similar tuples from the right side
13
Cost Model
In a distributed environment the main cost
measures are the number of messages m and the
number of hops h needed to process queries.
14
Evaluation
1?Experimental setup we used a network of
400 peers each running on a PlanetLab node. Each
node inserted 10 strings of lengths between 8 and
45 characters, randomly chosen from a 4000 entry
sample of movie titles from the IMDB database?
15
Evaluation
2?Experimental results
16
Evaluation
2?Experimental results
17
Evaluation
2?Experimental results
18
Evaluation
2?Experimental results
19
Conclusions
1?we presented similarity queries in DHTs for
evolving structured overlays into a viable
infrastructure for large-scale public data
management and information retrieval. 2?The
cost model allows us to predict the costs of the
available physical operators dynamically to
choose the optimal one for the current network
conditions.
20
Thanks!!!

Write a Comment

User Comments (0)