In collaboration with: - PowerPoint PPT Presentation

About This Presentation
Title:

In collaboration with:

Description:

(Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.) Collaboration among Peers ... marks. B0. term g: 13, 11, 45, ... term a: 17, 11, 92, ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 37
Provided by: Wei84
Category:

less

Transcript and Presenter's Notes

Title: In collaboration with:


1
In collaboration with Matthias Bender, Debora
Donato, Alessandro Linari, Julia Luxenburger,
Sebastian Michel, Nikos Ntarmos, Josiane
Parreira, Peter Triantafillou, Christian Zimmer
2
Peer-to-Peer (P2P) Systems
Decentralized, self-organizing, highly
dynamic loose coupling of many autonomous
computers
  • unstructured overlay networks
  • with epidemic dissemination (flooding)
  • structured overlay networks
  • based on distributed hash tables (DHTs)
  • Applications
  • Large-scale computation (SETI_at_home, etc.)
  • File sharing (Napster, Gnutella, KaZaA,
    BitTorrent, etc.)
  • Publish-Subscribe (Blogs, Marketplaces, etc.)
  • Collaborative work (Games, etc.)
  • IP telephony (Skype)

3
Peer-to-Peer Web Search
Vision Self-organizing P2P Web Search Engine
with Google-or-better functionality
  • Scalable Self-Organizing Data Structures and
    Algorithms
  • (DHTs, Semantic Overlay Networks, Epidemic
    Spreading, Distr. Link Analysis, etc.)
  • Better Search Result Quality (Precision, Recall,
    etc.)
  • Powerful Search Methods for Each Peer
  • (Concept-based Search, Query Expansion, XML
    IR, Personalization, etc.)
  • Leverage User/Community Input (Wisdom of
    Crowds)
  • (Bookmarks, Feedback, Query Logs, Click
    Streams, Evolving Web, etc.)
  • Collaboration among Peers
  • (Query Routing, Incentives, Fairness,
    Anonymity, etc.)
  • Benefit of Large-scale Social Networks
  • Small-World Phenomenon
  • Breaking Information Monopolies

4
Solution without Problem?
no killer app with business value !?
  • but
  • P2P potentially useful also for server farms
    grids
  • interesting non-business applications

5
Outline
Motivation and Research Directions
?
P2P Query Routing



Overlap Awareness
Discriminative Posting

P2P Link Analysis

JXP Authority Scoring

Personalized and Community-aware Ranking

QRank and QReward

Conclusion

6
Computational Model
  • Peers connected by overlay network
  • (e.g. DHT, random graph) and IP
  • Each peer has a full-fledged local search engine
  • with crawler/importer, indexer, query processor
  • Each peer has autonomously compiled (e.g.
    crawled)
  • its own content according to the users
    thematic interests
  • ? peer-specific collections
  • When a query is issued by a peer, it is first
    executed locally
  • and then possibly routed to carefully selected
    other peers
  • Peers can post summaries / synopses / metadata /
    QoS info
  • to (distr.) network-wide directory (space
    O(terms peers))
  • with efficient per-key lookup

7
Minerva System Architecture
based on scalable, churn-resilient DHT








Query routing (QR) aims to optimize benefit/cost
driven by distributed statistics on peers
content quality, content overlap, freshness,
authority, trust, performability etc.
Dynamically precompute good peers to
maintain a Semantic Overlay Network (SON)
Exploit community input (bookmarks, etc.)
8
P2P Query Routing (Resource Selection)
  • Principle
  • Select peers with highest benefit/cost ratio
    where
  • benefit(Pi) quality(Pi, q) ? sim(q, Xi)
    (1- ?) sim (X0, Xi)
  • cost(Pi) estimated response time or
    communication costs

e.g. for sim(q,Xi) use prob.-IR CORI Callan 95
and for sim(X0, Xi) use rel. entropy
Method
  • Precompute per-term peer-quality scores keep
    in directory
  • QR aggregates PeerLists for query terms
    selects top-k peers
  • Caveat
  • Peer-peer similarity overfits to content quality
    and ignores overlap

9
Overlap Awareness Bender et al. SIGIR05,
EDBT06
Estimate overlap(p0, pj) X0?Xj / X0?Xj
between query initiator peer p0 and QR candidate
pj using min-wise independent permutations (MIPs)
Broder 97 on the URLs in the collections of p0
and pj (with precomputed per-term MIPs
posted to directory)
Consider candidates pj in desc. order of
estimated quality for q and re-rank peers by
Better estimate novelty of additional pj
and
with
and rank peers by integrated quality-novelty (IQN)
10
Min-Wise Independent Permutations Broder 97
set of ids
17 21 3 12 24 8
h1(x) 7x 3 mod 51
20 48 24 36 18 8
h2(x) 5x 6 mod 51
40 9 21 15 24 46

hN(x) 3x 9 mod 51
9 21 18 45 30 33
compute N random permutations with
Pmin?(x)x?S?(x) 1/S
MIPs are unbiased estimator of overlap P
min h(x) x?A min h(y) y?B A?B /
A?B
MIPs can be viewed as repeated sampling of x, y
from A, B
11
IQN Experimental Results
Experiment based on 100 .Gov partitions (1.25
Mio. docs), assigned to 50 peers, with each peer
holding 10 partitions and 80 overlap for peers
Pi, Pi1 with 50 TREC-2003 Web queries, e.g.
pest safety control juvenile delinquency,
Marijuana legalization, etc.
relative recall
queried peers
For more experiments see our papers (Bender et
al. SIGIR05, EDBT06, WIRI06)
12
Discriminative Posting
  • peer pj posts a term only if pj has term-specific
    content
  • above average (or above quantile) of quality
    measure
  • reduces load on P2P directory
  • may ease decision on good query routing
  • requires global statistics on quality measures !

e.g. peer posts only if local df gt ?(global df)
with ? lt 1
Experiment 250 000 Web pages on 40
peers popular Google queries (e.g. national
hurricane center)
13
Efficiently Capturing Global Statistics
  • gdf (global doc. freq.) of a term is interesting
    key measure,
  • for discriminative posting or
  • for P2P result merging,
  • but overlap among peers makes simple distr.
    counting infeasible
  • hash sketches Flajolet/Martin 85
  • duplicate-sensitive cardinality estimator for
    multisets
  • hash each multiset element x onto m-bit
    bitvector
  • and remember ls 1 bit ?(h(x))
  • maxx?S ?(h(x)) estimates ? log2 0.77351 S
  • with std.dev. / S
  • rough intuition
  • average multiple iid sketches

14
Efficient Accurate gdf Estimation Bender et
al. WebDB 06
Hash sketches of different peers collected at
directory peer distributivity is free ?i
?(h(x)) x ?Si ?(h(x)) x ? ?i Si
  • gdf estimation algorithm
  • each peer p posts hash sketch for each
    (discriminative) term t to directory
  • directory peer for term t forms union of
    incoming hash sketches
  • when a peer needs to know gdf(t), simply ask
    directory peer for t
  • sliding-window techniques for dynamic adjustment

dir(t)
dir(c)
dir(f)
dir(d)
dir(a)
dir(e)
15
gdf Estimation Experiments
Experiment with steady-state P2P system 1000
peers, each with 1000 randomly chosen docs from 1
Mio. docs
Experiment with churn Peers joining and leaving
according to Poisson processes
16
Outline
Motivation and Research Directions
?
P2P Query Routing
?

Overlap Awareness
Discriminative Posting


P2P Link Analysis

JXP Authority Scoring

Personalized and Community-aware Ranking

QRank and QReward

Conclusion

17
Distributed PageRank (PR)
Page authority important for final result scoring
  • Exploit locality in Web link graph construct
    block structure
  • (disjoint graph partitioning) based on sites or
    domains
  • Compute page PR within site/domain site/domain
    weights,
  • combine page scores with site/domain scores
  • Kamvar03, Lee03, Broder04, Wang04, Wu05 or
  • communicate PR mass propagation across sites
  • Abiteboul00, Sankaralingam03, Shi03,
    Jelasity05

18
PageRank (PR) in a P2P Network
  • Every peer crawls Web fragments at its discretion
  • and has its own local personalized search
    engine
  • ? overlaps between peers graphs may occur

19
JXP (Juxtaposed Approximate PageRank) J.X.
Parreira et al. WebDB 05, VLDB 06
based on Markov-chain aggregation (state
lumping) Courtois 1977, Meyer 1988 cf.
Chien et al. 2004, Langville/Meyer 2005
each peer represents external, a priori unknown
part of the global graph by one superstate, a
world node
  • peers meet randomly
  • exchange their local graph fragments and PR
    vectors
  • learn about incoming edges to nodes of local
    graph
  • compute local PR on merged graphs or enhanced
    local graph
  • keep only improved PR and own local graph
  • dont keep other peers graph fragments

converges to global PR (experiments theoretical
arguments)
convergence sped up by biased p2pDating
strategy prefer peers whose nodeset of outgoing
links has high overlaps with our nodeset (use
MIPs as synopses)
20
JXP Algorithm at Work (1)
Input G local graph GOUT q?G q? s ?
s?W n pages in G N pages in U G?W WIN(G)
p?W p? q ? q?G WIN(G) ? WIN(G) known part
of WIN(G)
Output ?(q) for q?G est. stationary
probs (PR) ?(G) ?q?G ?(q)1- ?(W) est.
total mass of G

F

G

W

H
  • At each meeting with another peer
  • compute
  • for all q?G
  • world self-loop
  • compute all ? values for G?w remember WIN(G)
    info

21
JXP Algorithm at Work (2)
Input G local graph GOUT q?G q? s ?
s?W n pages in G N pages in U G?W WIN(G)
p?W p? q ? q?G WIN(G) ? WIN(G) known part
of WIN(G)
Output ?(q) for q?G est. stationary
probs (PR) ?(G) ?q?G ?(q)1- ?(W) est.
total mass of G

F

G

W

H
  • At each meeting with another peer
  • compute
  • for all q?G
  • world self-loop
  • compute all ? values for G?w remember WIN(G)
    info

22
JXP Algorithm at Work (3)
Input G local graph GOUT q?G q? s ?
s?W n pages in G N pages in U G?W WIN(G)
p?W p? q ? q?G WIN(G) ? WIN(G) known part
of WIN(G)
Output ?(q) for q?G est. stationary
probs (PR) ?(G) ?q?G ?(q)1- ?(W) est.
total mass of G

F

G

W

H
  • At each meeting with another peer
  • compute
  • for all q?G
  • world self-loop
  • compute all ? values for G?w remember WIN(G)
    info

23
JXP Convergence
Theorem In a fair sequence of P2P meetings, the
JXP scores of every peer converge to the global
PR scores.
  • Proof
  • based on Markov-chain aggr./disaggr. theory
  • C.D. Meyer 1988, G.E. Cho C.D. Meyer 1999
  • for world node w
  • JXP(w) is non-increasing and JXP(w) ? PR(w)
  • for nodes q in peers graph fragment
  • JXP(q) is non-decreasing and JXP(q) ? PR(q)

24
p2pDating
  • Each peer pj precomputes two MIPs synopses for
  • M(pj) URLs in the collection of pj (the nodes
    of G) and
  • O(pj) URLs of the out-neighbors of pages of pj
    (OUT(G))
  • repeat forever
  • peer pj randomly picks blind date candidate
    pd
  • pj and pk exchange their O synopses
  • they may also recommend to each other a set of
    friends pf
  • and pass on their O synopses
  • peer pj maintains a list of dating candidates pc
  • ordered by resemblance (M(pj), O(pc))
  • peer pj chooses best candidate for next date
  • (exchange of graphs, local PR computation,
    etc.)

25
JXP Experiments
100 peers with simulated crawls of Amazon
products categories (with recommended similar
products as links)
Ongoing work peer trust measures robustness to
cheating
similar and more results for real Web data
also improves precision of query-result
ranking, and query routing by combining
quality-novelty with JXP mass
26
Outline
Motivation and Research Directions
?
P2P Query Routing
?

Overlap Awareness
Discriminative Posting

?
P2P Link Analysis
JXP Authority Scoring

Personalized and Community-aware Ranking


QRank and QReward

Conclusion

27
Personalized PageRank Haveliwala et al. 2002
Idea random jumps favor designated high-quality
pages such as personal bookmarks,
frequently visited pages, etc.
with
random walk uniformly random choice of links
biased jumps to personal favorites (or trusted
pages or ...)
see also Jeh 2003, Benczur 2004, Gyöngyi 2004,
Guha 2004
28
Exploiting Query Logs and Click Streams J.
Luxenburger et al. WISE 04
from PageRank uniformly random choice of links
random jumps
to QRank query-doc transitions query-query
transitions doc-doc
transitions on implicit links (w/ thesaurus) with
probabilities estimated from log statistics
29
Small-Scale Experiments
Setup 70 000 Wikipedia docs, 18 volunteers
posing Trivial-Pursuit queries ca. 500 queries,
ca. 300 refinements, ca. 1000 positive clicks ca.
15 000 implicit links based on doc-doc similarity
  • Results (assessment by blind-test users)
  • QRank top-10 result preferred over PageRank in
    81 of all cases
  • QRank has 50.3 precision_at_10, PageRank has 33.9

Untrained example query philosophy
PageRank QRank
x
1. Philosophy Philosophy 2. GNU free doc.
license GNU free doc. license 3. Free software
foundation Early modern philosophy 4. Richard
Stallman Mysticism 5. Debian Aristotle
30
Negative Feedback Assessment
  • Users give implicit or explicit negative
    assessments
  • non-clicked query results ranked higher than
    clicked ones
  • encountered spam pages or personally disliked
    pages
  • ratings of pages or other users in social
    tagging networks
  • very valuable human input, but typically sparse
  • Approaches and problems using biased random walks
  • for qualitytrust propagation Eiron 2004, Guha
    2004, Luxenburger 2006
  • penalize neg. pages by reducing their
    random-jump prob.
  • source-specific random-jump probs and
    self-loops
  • force backward step or random jump when reaching
    neg. page
  • but probabilities are non-negative and
    L1-normalized
  • ? ranking models become technically convoluted

Better approach decouple random walk from trust
propagation ? Markov reward models
31
Markov Reward Models
  • discrete-time or continuous-time Markov chain
    with
  • state-specific lump reward rj ? R whenever j is
    entered
  • transition-specific lump reward rij ? R when i?j
    is traversed
  • (plus reward rates in CTMC case)
  • penalties expressed as negative rewards
  • analysis of transient and stationary properties
  • (used in queueing and performability models
  • ? textbooks by H.C. Tijms, R.W. Wolff surveys by
    Haverkort/Trivedi)

gained reward until step n
long-run average reward
32
QReward Ranking J. Luxenburger et al. WebDB 06


?
  • Add queries and users as nodes to the state
    graph
  • and connect to clicked, non-clicked, rated pages

?
?




?
  • Associate transition-specific lump rewards
  • 1 for each positive assessment
  • -1 for each negative assessment
  • 0 otherwise

  • Perform random walk in standard way,
  • using links and random jumps, yielding
    stationary probs ?j
  • Compute long-run average reward gj for each
    state j
  • Quality of page j ? gj (1 ? ?) ?j

33
Fast Computation of QReward
Renewal-Reward Theorem (Wolff p. 60)
Compute ?j values as usual by power iteration
(using QRank)
but we need sufficient accuracy for all i, not
just for the high-ranked ones
? iterate QRank for ?j, QReward, and quality
score stop when quality scores of top pages
converge
34
Preliminary Experiments
Setup 70 000 Wikipedia docs, 18 volunteers
posing queries ca. 500 queries, ca. 300
refinements, ca. 1000 positive clicks, ca. 2000
implict negative assessments (cf. Joachims et
al. SIGIR 05)
  • Results (based on relevance majority votes of 3
    users)
  • PR has MAP 0.45 for top-15 of 14 test queries,
  • QRank has MAP 0.51, QReward has MAP 0.56

Ongoing work combine with personalized
LMs trust models larger-scale experimentation
Example query political system China
PageRank QReward
x
1. China One country, two systems 2.
Peoples Republic of China China 3. List of
countries Party discipline 4. Country List
of countries 5. Chinese language Communist
state
35
Outline
Motivation and Research Directions
?
P2P Query Routing
?

Overlap Awareness
Discriminative Posting

P2P Link Analysis
?
JXP Authority Scoring

Personalized and Community-aware Ranking
?
QRank and QReward

Conclusion


36
Conclusion Challenges Remain Open
  • Distributed Statistics Management
  • Key to Query Routing, Quality/Overlap
    Estimation, Ranking (PR etc.)
  • Capturing Global Statistics in Decentralized
    Manner
  • Efficiently Disseminating Statistical Synopses
  • Robustness to Churn and Cheating
  • Statistically Semantic Social Overlay
    Networks
  • Experimental Evaluation
  • Benchmarking Methodology
  • Large-scale P2P Testbed
  • Capturing User/Community Behavior
Write a Comment
User Comments (0)
About PowerShow.com