In collaboration with: - PowerPoint PPT Presentation

About This Presentation

Title:

In collaboration with:

Description:

(Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.) Collaboration among Peers ... marks. B0. term g: 13, 11, 45, ... term a: 17, 11, 92, ... – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 37

Provided by: Wei84

Category:

more less

Transcript and Presenter's Notes

Title: In collaboration with:

1
In collaboration with Matthias Bender, Debora
Donato, Alessandro Linari, Julia Luxenburger,
Sebastian Michel, Nikos Ntarmos, Josiane
Parreira, Peter Triantafillou, Christian Zimmer
2
Peer-to-Peer (P2P) Systems
Decentralized, self-organizing, highly
dynamic loose coupling of many autonomous
computers

unstructured overlay networks
with epidemic dissemination (flooding)
structured overlay networks
based on distributed hash tables (DHTs)

Applications
Large-scale computation (SETI_at_home, etc.)
File sharing (Napster, Gnutella, KaZaA,
BitTorrent, etc.)
Publish-Subscribe (Blogs, Marketplaces, etc.)
Collaborative work (Games, etc.)
IP telephony (Skype)

3
Peer-to-Peer Web Search
Vision Self-organizing P2P Web Search Engine
with Google-or-better functionality

Scalable Self-Organizing Data Structures and
Algorithms
(DHTs, Semantic Overlay Networks, Epidemic
Spreading, Distr. Link Analysis, etc.)

Better Search Result Quality (Precision, Recall,
etc.)

Powerful Search Methods for Each Peer
(Concept-based Search, Query Expansion, XML
IR, Personalization, etc.)

Leverage User/Community Input (Wisdom of
Crowds)
(Bookmarks, Feedback, Query Logs, Click
Streams, Evolving Web, etc.)

Collaboration among Peers
(Query Routing, Incentives, Fairness,
Anonymity, etc.)

Benefit of Large-scale Social Networks
Small-World Phenomenon
Breaking Information Monopolies

4
Solution without Problem?
no killer app with business value !?

but
P2P potentially useful also for server farms
grids
interesting non-business applications

5
Outline
Motivation and Research Directions
?
P2P Query Routing

Overlap Awareness
Discriminative Posting

P2P Link Analysis

JXP Authority Scoring

Personalized and Community-aware Ranking

QRank and QReward

Conclusion

6
Computational Model

Peers connected by overlay network
(e.g. DHT, random graph) and IP

Each peer has a full-fledged local search engine
with crawler/importer, indexer, query processor

Each peer has autonomously compiled (e.g.
crawled)
its own content according to the users
thematic interests
? peer-specific collections

When a query is issued by a peer, it is first
executed locally
and then possibly routed to carefully selected
other peers

Peers can post summaries / synopses / metadata /
QoS info
to (distr.) network-wide directory (space
O(terms peers))
with efficient per-key lookup

7
Minerva System Architecture
based on scalable, churn-resilient DHT

Query routing (QR) aims to optimize benefit/cost
driven by distributed statistics on peers
content quality, content overlap, freshness,
authority, trust, performability etc.
Dynamically precompute good peers to
maintain a Semantic Overlay Network (SON)
Exploit community input (bookmarks, etc.)
8
P2P Query Routing (Resource Selection)

Principle
Select peers with highest benefit/cost ratio
where
benefit(Pi) quality(Pi, q) ? sim(q, Xi)
(1- ?) sim (X0, Xi)
cost(Pi) estimated response time or
communication costs

e.g. for sim(q,Xi) use prob.-IR CORI Callan 95
and for sim(X0, Xi) use rel. entropy
Method

Precompute per-term peer-quality scores keep
in directory
QR aggregates PeerLists for query terms
selects top-k peers

Caveat
Peer-peer similarity overfits to content quality
and ignores overlap

9
Overlap Awareness Bender et al. SIGIR05,
EDBT06
Estimate overlap(p0, pj) X0?Xj / X0?Xj
between query initiator peer p0 and QR candidate
pj using min-wise independent permutations (MIPs)
Broder 97 on the URLs in the collections of p0
and pj (with precomputed per-term MIPs
posted to directory)
Consider candidates pj in desc. order of
estimated quality for q and re-rank peers by
Better estimate novelty of additional pj
and
with
and rank peers by integrated quality-novelty (IQN)
10
Min-Wise Independent Permutations Broder 97
set of ids
17 21 3 12 24 8
h1(x) 7x 3 mod 51
20 48 24 36 18 8
h2(x) 5x 6 mod 51
40 9 21 15 24 46

hN(x) 3x 9 mod 51
9 21 18 45 30 33
compute N random permutations with
Pmin?(x)x?S?(x) 1/S
MIPs are unbiased estimator of overlap P
min h(x) x?A min h(y) y?B A?B /
A?B
MIPs can be viewed as repeated sampling of x, y
from A, B
11
IQN Experimental Results
Experiment based on 100 .Gov partitions (1.25
Mio. docs), assigned to 50 peers, with each peer
holding 10 partitions and 80 overlap for peers
Pi, Pi1 with 50 TREC-2003 Web queries, e.g.
pest safety control juvenile delinquency,
Marijuana legalization, etc.
relative recall
queried peers
For more experiments see our papers (Bender et
al. SIGIR05, EDBT06, WIRI06)
12
Discriminative Posting

peer pj posts a term only if pj has term-specific
content
above average (or above quantile) of quality
measure
reduces load on P2P directory
may ease decision on good query routing
requires global statistics on quality measures !

e.g. peer posts only if local df gt ?(global df)
with ? lt 1
Experiment 250 000 Web pages on 40
peers popular Google queries (e.g. national
hurricane center)
13
Efficiently Capturing Global Statistics

gdf (global doc. freq.) of a term is interesting
key measure,
for discriminative posting or
for P2P result merging,
but overlap among peers makes simple distr.
counting infeasible

hash sketches Flajolet/Martin 85
duplicate-sensitive cardinality estimator for
multisets
hash each multiset element x onto m-bit
bitvector
and remember ls 1 bit ?(h(x))
maxx?S ?(h(x)) estimates ? log2 0.77351 S
with std.dev. / S
rough intuition
average multiple iid sketches

14
Efficient Accurate gdf Estimation Bender et
al. WebDB 06
Hash sketches of different peers collected at
directory peer distributivity is free ?i
?(h(x)) x ?Si ?(h(x)) x ? ?i Si

gdf estimation algorithm
each peer p posts hash sketch for each
(discriminative) term t to directory
directory peer for term t forms union of
incoming hash sketches
when a peer needs to know gdf(t), simply ask
directory peer for t
sliding-window techniques for dynamic adjustment

dir(t)
dir(c)
dir(f)
dir(d)
dir(a)
dir(e)
15
gdf Estimation Experiments
Experiment with steady-state P2P system 1000
peers, each with 1000 randomly chosen docs from 1
Mio. docs
Experiment with churn Peers joining and leaving
according to Poisson processes
16
Outline
Motivation and Research Directions
?
P2P Query Routing
?

Overlap Awareness
Discriminative Posting

P2P Link Analysis

JXP Authority Scoring

Personalized and Community-aware Ranking

QRank and QReward

Conclusion

17
Distributed PageRank (PR)
Page authority important for final result scoring

Exploit locality in Web link graph construct
block structure
(disjoint graph partitioning) based on sites or
domains

Compute page PR within site/domain site/domain
weights,
combine page scores with site/domain scores
Kamvar03, Lee03, Broder04, Wang04, Wu05 or
communicate PR mass propagation across sites
Abiteboul00, Sankaralingam03, Shi03,
Jelasity05

18
PageRank (PR) in a P2P Network

Every peer crawls Web fragments at its discretion
and has its own local personalized search
engine
? overlaps between peers graphs may occur

19
JXP (Juxtaposed Approximate PageRank) J.X.
Parreira et al. WebDB 05, VLDB 06
based on Markov-chain aggregation (state
lumping) Courtois 1977, Meyer 1988 cf.
Chien et al. 2004, Langville/Meyer 2005
each peer represents external, a priori unknown
part of the global graph by one superstate, a
world node

peers meet randomly
exchange their local graph fragments and PR
vectors
learn about incoming edges to nodes of local
graph
compute local PR on merged graphs or enhanced
local graph
keep only improved PR and own local graph
dont keep other peers graph fragments

converges to global PR (experiments theoretical
arguments)
convergence sped up by biased p2pDating
strategy prefer peers whose nodeset of outgoing
links has high overlaps with our nodeset (use
MIPs as synopses)
20
JXP Algorithm at Work (1)
Input G local graph GOUT q?G q? s ?
s?W n pages in G N pages in U G?W WIN(G)
p?W p? q ? q?G WIN(G) ? WIN(G) known part
of WIN(G)
Output ?(q) for q?G est. stationary
probs (PR) ?(G) ?q?G ?(q)1- ?(W) est.
total mass of G

F

G

W

H

At each meeting with another peer
compute
for all q?G
world self-loop
compute all ? values for G?w remember WIN(G)
info

21
JXP Algorithm at Work (2)
Input G local graph GOUT q?G q? s ?
s?W n pages in G N pages in U G?W WIN(G)
p?W p? q ? q?G WIN(G) ? WIN(G) known part
of WIN(G)
Output ?(q) for q?G est. stationary
probs (PR) ?(G) ?q?G ?(q)1- ?(W) est.
total mass of G

F

G

W

H

At each meeting with another peer
compute
for all q?G
world self-loop
compute all ? values for G?w remember WIN(G)
info

22
JXP Algorithm at Work (3)
Input G local graph GOUT q?G q? s ?
s?W n pages in G N pages in U G?W WIN(G)
p?W p? q ? q?G WIN(G) ? WIN(G) known part
of WIN(G)
Output ?(q) for q?G est. stationary
probs (PR) ?(G) ?q?G ?(q)1- ?(W) est.
total mass of G

F

G

W

H

At each meeting with another peer
compute
for all q?G
world self-loop
compute all ? values for G?w remember WIN(G)
info

23
JXP Convergence
Theorem In a fair sequence of P2P meetings, the
JXP scores of every peer converge to the global
PR scores.

Proof
based on Markov-chain aggr./disaggr. theory
C.D. Meyer 1988, G.E. Cho C.D. Meyer 1999
for world node w
JXP(w) is non-increasing and JXP(w) ? PR(w)
for nodes q in peers graph fragment
JXP(q) is non-decreasing and JXP(q) ? PR(q)

24
p2pDating

Each peer pj precomputes two MIPs synopses for
M(pj) URLs in the collection of pj (the nodes
of G) and
O(pj) URLs of the out-neighbors of pages of pj
(OUT(G))

repeat forever
peer pj randomly picks blind date candidate
pd
pj and pk exchange their O synopses
they may also recommend to each other a set of
friends pf
and pass on their O synopses
peer pj maintains a list of dating candidates pc
ordered by resemblance (M(pj), O(pc))
peer pj chooses best candidate for next date
(exchange of graphs, local PR computation,
etc.)

25
JXP Experiments
100 peers with simulated crawls of Amazon
products categories (with recommended similar
products as links)
Ongoing work peer trust measures robustness to
cheating
similar and more results for real Web data
also improves precision of query-result
ranking, and query routing by combining
quality-novelty with JXP mass
26
Outline
Motivation and Research Directions
?
P2P Query Routing
?

Overlap Awareness
Discriminative Posting

?
P2P Link Analysis
JXP Authority Scoring

Personalized and Community-aware Ranking

QRank and QReward

Conclusion

27
Personalized PageRank Haveliwala et al. 2002
Idea random jumps favor designated high-quality
pages such as personal bookmarks,
frequently visited pages, etc.
with
random walk uniformly random choice of links
biased jumps to personal favorites (or trusted
pages or ...)
see also Jeh 2003, Benczur 2004, Gyöngyi 2004,
Guha 2004
28
Exploiting Query Logs and Click Streams J.
Luxenburger et al. WISE 04
from PageRank uniformly random choice of links
random jumps
to QRank query-doc transitions query-query
transitions doc-doc
transitions on implicit links (w/ thesaurus) with
probabilities estimated from log statistics
29
Small-Scale Experiments
Setup 70 000 Wikipedia docs, 18 volunteers
posing Trivial-Pursuit queries ca. 500 queries,
ca. 300 refinements, ca. 1000 positive clicks ca.
15 000 implicit links based on doc-doc similarity

Results (assessment by blind-test users)
QRank top-10 result preferred over PageRank in
81 of all cases
QRank has 50.3 precision_at_10, PageRank has 33.9

Untrained example query philosophy
PageRank QRank
x
1. Philosophy Philosophy 2. GNU free doc.
license GNU free doc. license 3. Free software
foundation Early modern philosophy 4. Richard
Stallman Mysticism 5. Debian Aristotle
30
Negative Feedback Assessment

Users give implicit or explicit negative
assessments
non-clicked query results ranked higher than
clicked ones
encountered spam pages or personally disliked
pages
ratings of pages or other users in social
tagging networks
very valuable human input, but typically sparse

Approaches and problems using biased random walks
for qualitytrust propagation Eiron 2004, Guha
2004, Luxenburger 2006
penalize neg. pages by reducing their
random-jump prob.
source-specific random-jump probs and
self-loops
force backward step or random jump when reaching
neg. page
but probabilities are non-negative and
L1-normalized
? ranking models become technically convoluted

Better approach decouple random walk from trust
propagation ? Markov reward models
31
Markov Reward Models

discrete-time or continuous-time Markov chain
with
state-specific lump reward rj ? R whenever j is
entered
transition-specific lump reward rij ? R when i?j
is traversed
(plus reward rates in CTMC case)
penalties expressed as negative rewards

analysis of transient and stationary properties
(used in queueing and performability models
? textbooks by H.C. Tijms, R.W. Wolff surveys by
Haverkort/Trivedi)

gained reward until step n
long-run average reward
32
QReward Ranking J. Luxenburger et al. WebDB 06

?

Add queries and users as nodes to the state
graph
and connect to clicked, non-clicked, rated pages

?
?

?

Associate transition-specific lump rewards
1 for each positive assessment
-1 for each negative assessment
0 otherwise

Perform random walk in standard way,
using links and random jumps, yielding
stationary probs ?j

Compute long-run average reward gj for each
state j

Quality of page j ? gj (1 ? ?) ?j

33
Fast Computation of QReward
Renewal-Reward Theorem (Wolff p. 60)
Compute ?j values as usual by power iteration
(using QRank)
but we need sufficient accuracy for all i, not
just for the high-ranked ones
? iterate QRank for ?j, QReward, and quality
score stop when quality scores of top pages
converge
34
Preliminary Experiments
Setup 70 000 Wikipedia docs, 18 volunteers
posing queries ca. 500 queries, ca. 300
refinements, ca. 1000 positive clicks, ca. 2000
implict negative assessments (cf. Joachims et
al. SIGIR 05)

Results (based on relevance majority votes of 3
users)
PR has MAP 0.45 for top-15 of 14 test queries,
QRank has MAP 0.51, QReward has MAP 0.56

Ongoing work combine with personalized
LMs trust models larger-scale experimentation
Example query political system China
PageRank QReward
x
1. China One country, two systems 2.
Peoples Republic of China China 3. List of
countries Party discipline 4. Country List
of countries 5. Chinese language Communist
state
35
Outline
Motivation and Research Directions
?
P2P Query Routing
?

Overlap Awareness
Discriminative Posting

P2P Link Analysis
?
JXP Authority Scoring

Personalized and Community-aware Ranking
?
QRank and QReward

Conclusion

36
Conclusion Challenges Remain Open