Title: An Evaluation and Comparison of Current PeertoPeer FullText Keyword Search Techniques
1An Evaluation and Comparison of Current
Peer-to-Peer Full-Text Keyword Search Techniques
- Ming Zhong, Justin Moore, Kai Shen
- University of Rochester
- Amy Murphy
- University of Lugano
2Current P2P Full-Text Keyword Search Techniques
- Document-based partitioning Gnutella, KaZaA
- Keyword-based partitioning Bhattacharjee03,
Gnawali02, Reynold03, Suel03 - Hybrid indexing Tang04
- Semantic search Ganesan04, Li04, Tang03
- There is no comprehensive quantitative
performance evaluation and comparison of these
techniques! (Li03)
3Our Work
- Quantitative performance evaluation results on
real, large datasets (3.8 million web pages from
www.dmoz.org and 6.8 million web queries from
AskJeeves). - Performance metrics
- Total storage consumption
- Communication overhead
- Search latency
- Search quality
- Performance evaluation results linearly projected
to 1 billion web pages.
4Evaluation Setup
- 8-byte page ID (the MD5 of page URL).
- Each entry in inverted lists has 10 bytes (8 byte
ID 2 byte term frequency). - Each query only retrieves the top 20 most related
page IDs, ranked by TF.IDF term weighting scheme.
- The underlying topology Chord.
- Network settings.
- p2p overlay link latency 40ms.
- The maximum per-query bandwidth consumption
1.5Mbps. - The maximum whole-system bandwidth consumption
1Gbps 0.26 of US Internet backbone bandwidth
in 2002.
5Document-Based Partitioning
- Each node holds a partition of documents. A query
is broadcast to all nodes and each node returns
top 20 most relevant documents. - Tree-based (log n depth) message broadcast and
aggregation. - Total storage consumption 4.24 GB for our data
set - Total communication cost 300n bytes for our data
set (n, the network size). - Search latency 0.08 X log(n) secs.
- Search Quality
6Baseline Keyword-Based Partitioning
- Each node holds the inverted lists of some
keywords (randomly distributed). To save the
communication overhead of the inverted list
intersection, a k-word query visit k peers in the
ascending order of the inverted list sizes. - NO quality degradation for baseline keyword-based
partitioning. - Equal storage consumption compared with doc-based
partitioning. - Average comm. cost per query 96.61 KB.
- Max comm. Cost per query 18.65 MB
- Search latencylt0.14 X log(n) 0.52 secs.
7Improved Keyword-Based Partitioning (I)
The average comm. cost is reduced to 0.137 times
that of the baseline keyword-based partitioning.
8Improved Keyword-Based Partitioning (II)
san ca diego francisco puerto tx austin rico
antonio earthquake jose juan vallarta lucas rican
luis cabo fransisco bernardino
9Improved Keyword-Based Partitioning (III)
10Hybrid Indexing
- Each page ID in inverted lists has some metadata
about the corresponding page. - A naive approach each page ID has the complete
term list of the corresponding page. Too much
storage consumption!!! - Tangs approach An inverted list of x only
contains those pages that has x as its top terms.
Query expansion also used. - Storage consumption 13 times that of
keyword-based partitioning. - Latency lt 0.16 X DHT diameter sec.
- Comm. cost per query 7.5 KB.
- Search quality
11Semantic Search
- Documents and queries are mapped into points in a
semantic space and hence keyword search becomes
nearest point search. - pSearch LSI, rolling index.
- Storage consumption 30.06 GB, 7.09 times that of
keyword-based partitioning if p 10. - Comm. cost per query 1.29 MB.
- (P 10, K 160)
- Search latency DHT comm. latency
- network transmission time
- Search quality
12Scaled Performance on 109 pages and n peers
13Examples of Choosing Techniques
- Application 1 105 peers, 107 pages.
- Use keyword-based partitioning
- 29.64 MB storage per peer
- lt12.22 KB communication overhead per query
- lt2.4 sec search latency
- 100 search quality
- Application 2 103 peers, 109 pages.
- Use document-based partitioning
- 1.14 GB per peer.
- 300 KB communication overhead per query
- 0.8 sec search latency
- 75 95 search quality