Integrating Semantics-Based Access Mechanisms with P2P File Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Integrating Semantics-Based Access Mechanisms with P2P File Systems

Description:

Integrating Semantics-Based Access Mechanisms with P2P File Systems Yingwu Zhu, Honghao Wang and Yiming Hu – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 27
Provided by: Jaso1242
Category:

less

Transcript and Presenter's Notes

Title: Integrating Semantics-Based Access Mechanisms with P2P File Systems


1
Integrating Semantics-Based Access Mechanisms
with P2P File Systems
  • Yingwu Zhu, Honghao Wang and Yiming Hu

2
Outline
  • Background
  • System Design
  • Related Work
  • Conclusions and Furture Work

3
Background
  • Current P2P file systems (e.g.,CFS and PAST)
  • Layer FS functionalities on a distributed hash
    table (DHT), e.g., chord, pastry
  • Do not support semantics-based access
  • Because DHTs support only exact-match lookups

4
Background
Layer Responsibity
FS Stores/retrieves file objects into/from the DHT Presents a file system interface to applications/ users
DHT Supports a hash-table interface of get(fileID) and put(fileID, file)
Software layering in a P2P file system
5
Motivation
  • A problem of P2P file systems
  • Supports only exact-match lookups given a file
    object identifier fileID
  • get(fileID) retrieves the file corresponding to
    the fileID
  • put(fileID, file) stores the file with the
    fileID as a DHT key
  • Extending exact-match lookups to semantic access
    is non-trivial

6
Motivation
  • A challenge to P2P file systems
  • Provides convenient access to vast amount of
    information
  • E.g., provide semantics-based search capabilities
    to efficiently locate semantically close files
    for browsing and purging, etc.

7
System Design
  • Targeted Application
  • System Architecture
  • Semantic Indexing and Locating
  • Evalutation

8
Targeted Application
  • Semantic search is expressed in natural language.
  • Query locate files similar to f1
  • The query results are materialized via semantic
    directories
  • Not a simple keyword match loate files with k1,
    k2 and k3k1, k2 and k3 are three distinct
    keywords

9
System Architecture
  • Extends a P2P file system to support
    semantics-based access
  • Major Components
  • Semantic Extractor Registry
  • Semantic Indexing and Locating Utility

10
System Architecture
Application/User
FS
Extractor Registry
Semantic Indexing and Locating Utility
DHT
Major components of the system architecture
11
Semantic Extractor Registry
  • A set of semantic extractors
  • Leverage IR algorithms, VSM and LSI
  • Represent a file as a semantic vector (SV),
    typcially 200-300 keywords
  • Semantically close files have similar SVs

12
Semantic Indexing and Locating Untility
  • Provides semantics-based indexing and retrieval
    capabilities
  • Relies on the property of Locality Sensitive Hash
    Fucntions (LSH)
  • Derives a small number of semantic identifiers
    (semID) from a files SV as the DHT keys for
    indexing and locating

13
Semantic Indexing and Locating Untility
  • Goals
  • The indice of semantically close files are
    clustered to the same peer nodes with high
    probability (nearly 100)
  • Efficiently locate semantically close files by
    searching a small number of peer nodes (e.g, 20)

14
Locality Sensitive Hashing
  • A family of hash functions F is locality
    sensitive if ?h?F operating on two sets A and B,
    we haveP h?F h(A)h(B) sim(A,B)
  • Min-wise independent permutations are LSH
  • sim(A,B) A? B / A? B

Similarity function
15
Semantic Indexing
  • Given a files SV
  • Step 1 Drive a small number of semIDs from the
    SV using LSH
  • Step 2 Indexing the file by having these semIDs
    as the DHT keys

16
Semantic Indexing
  • Using n groups of m hash functions
  • Results
  • The indice of semantically close files are hashed
    to the same peers with probability ? 1-(1-pm)n
  • P is expected to be high for semantically close
    files, so is the probability
  • psim(f1,f2), similarity between two filess
    SVs

17
Semantic Indexing
  • Given a files SV A
  • proc sem_index (A)
  • convert A into A \\ A is a set of
    integer by using SHA-1
  • for each gj do \\ gj is one of n
    group of hash funcions
  • semIDj 0
  • for each hi in gj do \\ gj
    has m hash functions
  • semIDj hi(A) \\
    is a XOR operation
  • endfor
  • endfor
  • for each semIDj do
  • insert the tuple ltsemID, fileID, Agt
    into DHT by having semIDj as the DHT
    key \\ semantic indexing
  • endfor
  • endproc

18
Semantic Locating
  • Given a querys SV
  • Step 1 Derive a small number of semIDs from the
    SV using LSH
  • Step 2 Locate those semantically close files by
    having these semIDs as the DHT keys
  • Goal answer a query by consulting only a small
    number of peer nodes

19
Demostration of Semantic Indexing and Locating
A
B
C
D
Peer node
A, B, C and D are semantically close files
User1
User2
Query locate files similar to D
20
Evaluation
  • Load distribution of semantic indexing
  • Semantic indices per peer node
  • Performance of semantic locating
  • Percentage of semantically close files that can
    be located (Recall)

21
Semantic Indexing
Number of file indexes per node
Number of peer nodes
Load distribution when the system indexes 10,000
files
22
Semantic Indexing
Nmber of file indexes per node
Number of indexed files (x1000)
Load distribution in a 1000 node system
23
Perf. of Semantic Locating
5 10 15 20
5 84 92 94 96
2 94 99 100 100
n
recall
m
1 Apply n groups of m hash functions
2 Percentage of files located (128-byte
fingerprint limit as a SV)
3 m and n determine the performance of semantic
locating
24
Related Work
  • P2P file systems like CFS and PAST
  • Exact-match lookups in DHTs
  • Traditional semantic file systems like SFS and
    HAC
  • IR algorithms as VSM and LSI
  • LSH and its related applications (e.g.,the
    nearest neighbor problem, cached data location in
    database)

25
Conclusions
  • The first step to support semantics-based access
    in P2P file systems
  • LSH-based semantic indexing and locating approach
  • Impose small storage overhead (several MBs per
    node)
  • Efficiency answer a query by consulting a small
    number of peers (e.g., 20)
  • Approximate results, but acceptable

26
Furture Work
  • Query consistency and refinement
  • Evaluation using IR workloads (e.g., TREC data
    sets).
Write a Comment
User Comments (0)
About PowerShow.com