Title: Integrating Semantics-Based Access Mechanisms with P2P File Systems
1Integrating Semantics-Based Access Mechanisms
with P2P File Systems
- Yingwu Zhu, Honghao Wang and Yiming Hu
2Outline
- Background
- System Design
- Related Work
- Conclusions and Furture Work
3Background
- Current P2P file systems (e.g.,CFS and PAST)
- Layer FS functionalities on a distributed hash
table (DHT), e.g., chord, pastry - Do not support semantics-based access
- Because DHTs support only exact-match lookups
4Background
Layer Responsibity
FS Stores/retrieves file objects into/from the DHT Presents a file system interface to applications/ users
DHT Supports a hash-table interface of get(fileID) and put(fileID, file)
Software layering in a P2P file system
5Motivation
- A problem of P2P file systems
- Supports only exact-match lookups given a file
object identifier fileID - get(fileID) retrieves the file corresponding to
the fileID - put(fileID, file) stores the file with the
fileID as a DHT key - Extending exact-match lookups to semantic access
is non-trivial
6Motivation
- A challenge to P2P file systems
- Provides convenient access to vast amount of
information - E.g., provide semantics-based search capabilities
to efficiently locate semantically close files
for browsing and purging, etc.
7System Design
- Targeted Application
- System Architecture
- Semantic Indexing and Locating
- Evalutation
8Targeted Application
- Semantic search is expressed in natural language.
- Query locate files similar to f1
- The query results are materialized via semantic
directories - Not a simple keyword match loate files with k1,
k2 and k3k1, k2 and k3 are three distinct
keywords
9System Architecture
- Extends a P2P file system to support
semantics-based access - Major Components
- Semantic Extractor Registry
- Semantic Indexing and Locating Utility
10System Architecture
Application/User
FS
Extractor Registry
Semantic Indexing and Locating Utility
DHT
Major components of the system architecture
11Semantic Extractor Registry
- A set of semantic extractors
- Leverage IR algorithms, VSM and LSI
- Represent a file as a semantic vector (SV),
typcially 200-300 keywords - Semantically close files have similar SVs
12 Semantic Indexing and Locating Untility
- Provides semantics-based indexing and retrieval
capabilities - Relies on the property of Locality Sensitive Hash
Fucntions (LSH) - Derives a small number of semantic identifiers
(semID) from a files SV as the DHT keys for
indexing and locating
13 Semantic Indexing and Locating Untility
- Goals
- The indice of semantically close files are
clustered to the same peer nodes with high
probability (nearly 100) - Efficiently locate semantically close files by
searching a small number of peer nodes (e.g, 20)
14Locality Sensitive Hashing
- A family of hash functions F is locality
sensitive if ?h?F operating on two sets A and B,
we haveP h?F h(A)h(B) sim(A,B) - Min-wise independent permutations are LSH
- sim(A,B) A? B / A? B
Similarity function
15Semantic Indexing
- Step 1 Drive a small number of semIDs from the
SV using LSH
- Step 2 Indexing the file by having these semIDs
as the DHT keys
16Semantic Indexing
- Using n groups of m hash functions
- Results
- The indice of semantically close files are hashed
to the same peers with probability ? 1-(1-pm)n - P is expected to be high for semantically close
files, so is the probability - psim(f1,f2), similarity between two filess
SVs
17Semantic Indexing
- Given a files SV A
- proc sem_index (A)
- convert A into A \\ A is a set of
integer by using SHA-1 - for each gj do \\ gj is one of n
group of hash funcions - semIDj 0
- for each hi in gj do \\ gj
has m hash functions - semIDj hi(A) \\
is a XOR operation - endfor
- endfor
- for each semIDj do
- insert the tuple ltsemID, fileID, Agt
into DHT by having semIDj as the DHT
key \\ semantic indexing - endfor
- endproc
18Semantic Locating
- Step 1 Derive a small number of semIDs from the
SV using LSH
- Step 2 Locate those semantically close files by
having these semIDs as the DHT keys
- Goal answer a query by consulting only a small
number of peer nodes
19Demostration of Semantic Indexing and Locating
A
B
C
D
Peer node
A, B, C and D are semantically close files
User1
User2
Query locate files similar to D
20Evaluation
- Load distribution of semantic indexing
- Semantic indices per peer node
- Performance of semantic locating
- Percentage of semantically close files that can
be located (Recall)
21Semantic Indexing
Number of file indexes per node
Number of peer nodes
Load distribution when the system indexes 10,000
files
22Semantic Indexing
Nmber of file indexes per node
Number of indexed files (x1000)
Load distribution in a 1000 node system
23Perf. of Semantic Locating
5 10 15 20
5 84 92 94 96
2 94 99 100 100
n
recall
m
1 Apply n groups of m hash functions
2 Percentage of files located (128-byte
fingerprint limit as a SV)
3 m and n determine the performance of semantic
locating
24Related Work
- P2P file systems like CFS and PAST
- Exact-match lookups in DHTs
- Traditional semantic file systems like SFS and
HAC - IR algorithms as VSM and LSI
- LSH and its related applications (e.g.,the
nearest neighbor problem, cached data location in
database)
25Conclusions
- The first step to support semantics-based access
in P2P file systems - LSH-based semantic indexing and locating approach
- Impose small storage overhead (several MBs per
node) - Efficiency answer a query by consulting a small
number of peers (e.g., 20) - Approximate results, but acceptable
26Furture Work
- Query consistency and refinement
- Evaluation using IR workloads (e.g., TREC data
sets).