SmartSeer: Continuous Queries over Citeseer - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

SmartSeer: Continuous Queries over Citeseer

Description:

A. B. D. C. Cat:1,4,7,19,20. Dog:1,5,7,26. Cow:2,4,8,18. Bat: 1,8,31 'cat & dog'? cat? Cat:1,4,7,19,20 ... Can choose greedy strategy. We evaluate the simple ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 70
Provided by: anon54
Category:

less

Transcript and Presenter's Notes

Title: SmartSeer: Continuous Queries over Citeseer


1
SmartSeerContinuous Queries over Citeseer
Jayanthkumar Kannan, Beverly Yang, Scott
Shenker, Sujata Banerjee, Sung Ju Lee, Puneet
Sharma, Sujoy Basu
2
Motivation
  • Problem
  • Increasing number of research papers
  • No easy way for researchers to keep themselves
    informed of new papers
  • Increasingly a problem as online preprint
    libraries become more popular

3
Our Solution SmartSeer
  • Like CiteSeer
  • Allow users to register continuous queries
  • immediate query one-time results
  • continuous query notify user of incremental
    results (news alerts etc)
  • polling versus notification
  • Makes sense for scientific publications user can
    keep abreast of pertinent papers
  • Event notification systems becoming more common
    (news alerts, web alerts)
  • Also provides immediate query capabilities

4
SmartSeer Requirements
  • Rich Continuous Queries
  • Donated Infrastructure

5
Rich Continuous Queries
  • Citeseer documents have rich structured metadata
  • citations
  • authors
  • etc.
  • Allow queries to exploit this structure
  • Give me all papers that cite me or my
    co-authors
  • Eg Nested Queries, Joins
  • Certain SQL constructs are expensive however
    (more later)

6
Sample Queries
  • Text matching
  • TEXTdatabase TEXT networks
  • Vanity Who is citing me?
  • CITES(AUTHORxxx)
  • Curious Joe
  • AUTHOR(AUTHORJoe)
  • Bridges between communities
  • AUTHOR(CONFSigcomm) AUTHOR(CONFSigmod)

7
Donated Infrastructure
  • Single server clearly inadequate to handle
    eventual load
  • Costly to buy and maintain large number of
    servers and bandwidth
  • Google does it, but how much does it cost them?
  • Instead, use of individual servers and bandwidth
    donated by different organizations
  • Loosely maintained
  • Desire a self-organizing architecture
  • The success of Planetlab is encouraging

8
Related Work
  • Telegraph (and others)
  • Allows fully general SQL queries
  • Works on a tightly coupled distributed system
  • Event notification systems (Scribe etc)
  • Only simple event semantics
  • PIER
  • Only immediate queries on DHT

9
Outline
  • Introduction
  • System architecture
  • Alternative strategies
  • Performance evaluation
  • Complex queries

10
Which distributed paradigm?
  • Query Replication
  • Document Replication
  • Rendevzous Approach

11
Query Replication
  • Query Replication
  • Every query stored on all nodes
  • New document sent to any one node
  • Does not scale with number of queries

12
Document Replication
  • Document Replication
  • Queries randomly partitioned among all nodes
  • New document sent to all nodes
  • Partitioning-by-ID (Gnutella)
  • Does not scale

A
D
Q
B
D
C
D
13
Rendevzous Approach
  • Keyword space partitioned among all nodes
  • Provided by a DHT
  • Essentially distributed hashing
  • Perform lookup in O(log n) hops
  • Rendevzous mechanism ensures related queries and
    documents meet
  • Are queries sent to the documents or vice-versa?

14
Primer Immediate Queries
  • Inverted list approach
  • Node responsible for keyword W stores the IDs of
    documents containing that word
  • Query Q W1 W2 .
  • Fetch lists of IDs stored under W1, W2 .
  • Intersect these lists

15
Immediate Queries
Cat1,4,7,19,20
A
B
C
Dog1,5,7,26
Send Result Docs 1,7
D
Cow2,4,8,18 Bat 1,8,31
16
Immediate Queries
  • Rough Model
  • Bandwidth C x N x I
  • DHTs provide lookups in O(log n) hops
  • Shipping inverted lists can be expensive
  • Many existing techniques to optimize
  • See MIT paper (On the Feasibility of
    Peer-to-Peer Web Searching and Indexing)

17
Continuous Queries
  • Interchange the role of documents and queries
  • Consider AND queries for now
  • Query Q W1 W2 stored at one of the words
    W1, W2
  • Most selective keyword

18
Continuous Queries
Cat1,4,7,19,20
A
Dog1,5,7,26
B
C
D
Cow2,4,8,18
Bat 1,8,31
19
Continuous Queries
  • Document Insertion
  • Fetch query list stored under each word in the
    document
  • Notify owners of satisfied queries
  • Basic Send Query Strategy
  • Note We still need document indices
  • For running certain complex queries

20
Continuous Queries
  • Cat (query)
  • dog
  • horse dog
  • horse cow

A
  • Dog (query)
  • sheep
  • cow sheep
  • lamb sheep

B
C
Notify owner of query 1
D
  • Bat (query)
  • cow
  • mouse
  • Cow (query)
  • horse

21
Continuous Queries
  • Cat (query)
  • dog
  • horse dog
  • horse cow

A
  • Dog (query)
  • sheep
  • cow sheep
  • lamb sheep

B
C
No action
D
  • Bat (query)
  • cow
  • mouse
  • Cow (query)
  • horse

22
Mimic immediate query strategy
  • Rough Model
  • Bandwidth C x Q x R
  • Q corresponds to N, R corresponds to I
  • Problems with this method
  • Does not scale with document corpus size
  • Bandwidth linear with N/Q
  • Key Metric
  • Bandwidth
  • Not latency

23
Outline
  • Introduction
  • System architecture
  • Join strategies
  • Performance evaluation
  • Complex queries

24
Join strategies
  • Must somehow join document metadata and query
    metadata
  • Document Node (DN)
  • Node at which document currently resides
  • e.g., document insertion node
  • For each term in the document, contacts the
    query node responsible for the term
  • Query Node (QN) for term T
  • Manages the list of queries registered on a
    particular term T

25
Send Document
  • DN sends entire document to QN
  • Cat (query)
  • cow
  • horse dog
  • horse cow

A
  • Dog (query)
  • sheep
  • horse sheep
  • lamb sheep

B
C
No action
D
  • Bat (query)
  • cow
  • mouse
  • Cow (query)
  • dog

26
Send Document
  • Pros
  • Simple
  • Low latency
  • Good if shipping document lt shipping query
  • Documents are small
  • Many queries
  • Few nodes
  • Cons
  • Potentially (usually) very expensive

27
Term Dialogue
  • Let Q denote queries stored at QN
  • Q Q1,Q2.
  • Let K denote distinct keywords in Q
  • K K1, K2, K3
  • If keyword K does not occur in document D, all
    queries in Q containing K can be eliminated

28
Term Dialogue
  • QN chooses some term T from K
  • DN replies with yes/no
  • QN then prunes the set of candidate queries
  • Delete all queries from Q that contain T
  • Delete all keywords from K that no longer appear
    in queries in Q
  • Repeat

29
Term Dialogue
Singleton heuristic
  • Cat (query)
  • dog
  • horse dog
  • horse cow

A
B
Notify owner of Q1
30
Term Dialogue
  • Pros
  • Good if query terms overlap heavily
  • Worst case outperforms send queries strategy
    (ignoring packet headers)
  • Cons
  • Potentially high latency
  • Another issue packet overhead
  • QN can also ask for several terms at one time to
    reduce of rounds
  • Optimal Problem seems hard
  • Can choose greedy strategy
  • We evaluate the simple heuristic
  • Future work Detailed evaluation of possible
    heuristics

31
Bloom Filter
  • DN sends bloom filter of keywords in documents to
    QN
  • QN uses this to prune the set of candidate
    queries Q
  • Bloom filter can have false positives however
  • Q pruned to Q after probing bloom filter
  • Now initiate term-by-term dialogue on Q

32
Bloom Filter
  • Cat (query)
  • dog
  • horse dog
  • horse cow

A
B
Notify owner of query 1
33
Bloom Filter
  • Pros
  • Can potentially eliminate many queries with just
    the bloom filter
  • Cons
  • Fixed overhead to every node

34
Note
  • None of these strategies work for immediate
    queries
  • Documents are like OR queries have to be
    registered on every keyword
  • Documents typically too large to replicate on
    each keyword

35
Outline
  • Introduction
  • System architecture
  • Alternative strategies
  • Performance evaluation
  • Complex queries

36
Simulation
  • Run over 10 (simulated) nodes
  • Evaluated each strategy with 50000 queries
  • Queries generated from words in about 500
    documents (from Citeseer, LA Times)
  • Bandwidth costs measured over the insertion of
    the next 500 documents

37
Query Generation
  • General process
  • Feed in n representative documents
  • Extract document term frequencies from these
    documents
  • Model frequency of query terms after frequency in
    documents

38
Cost Model
  • Send Queries
  • Cost Size of all queries (keywords and
    metadata) in inverted list
  • Send Document
  • Cost Size of Document
  • Term Dialogue
  • Cost C x K
  • Bloom Filter
  • Cost Size of filter C x K

39
Which strategy works best?
40
Which strategy works best?
41
Query Term Frequency
  • Problem
  • Will continuous query term frequencies be the
    same as document frequencies?
  • Our approach skew the document frequency
    distribution
  • Inverse skew, uniform, unchanged, high skew
  • Skew has large impact on performance

42
Results (Unchanged skew)
43
Results (Uniform distribution)
44
Results (Inverse distribution)
Inv. List Size
45
Other Experiments
  • Notification type
  • Does the document node contact one node for every
    distinct term in document?

46
Other Experiments
  • Notification type

Cat
Pig
Sheep
A
B
47
Other Experiments
  • Notification type

Cat
Pig
Sheep
A
B
48
Other Experiments
  • Notification Type
  • Naïve
  • One notification per distinct keyword
  • Can be very expensive
  • Clustered Determine all distinct nodes in naïve
    strategy, and cluster notifications
  • One notification per distinct node in naïve
  • Clustering stage can have high latency
  • Broadcast Send out to all nodes in the DHT
  • One notification per node
  • Fixed cost, good if system is small
  • Only works for Send Document

49
Other Experiments
  • Number of nodes
  • Average length of query
  • Affects probability that query is satisfied
  • Longer (but fewer) queries better for term
    dialogue and bloom filter

50
Outline
  • Introduction
  • System architecture
  • Alternative strategies
  • Performance evaluation
  • Complex queries

51
Complex Queries
  • OR queries
  • (AuthorJoe or AuthorBeth) and Year 1999
  • All papers from 1999 written by Joe or Beth
  • Subqueries
  • Year1999 and Author(AuthorJoe and Year1998)
  • All papers from 1999 written by coauthors of Joe,
    where at least one co-authored paper was written
    in 1998
  • Queries with hard predicates
  • e.g., range queries

52
OR Queries
  • OR queries registered on all terms
  • Otherwise, we would miss some results

53
OR Queries
Query cat OR dog?
  • Cat (query)
  • (cat) dog

A
Notify owner
B
C
D
54
OR Queries
Query cat OR dog?
  • Cat (query)
  • (cat) dog

A
No action
B
C
No action
D
55
OR Queries
Query cat OR dog?
  • Cat (query)
  • (cat) dog

A
Notify owner
  • Dog (query)
  • cat (dog)

B
C
Notify owner
D
56
OR Queries
  • To prevent duplicates, can rewrite query
  • Q T1 T2 T3 rewritten into
  • Q1 T1
  • Q2 T1 T2 (registered on T2)
  • Q3 T1 T2 T3 (registered on T3)
  • Works well since queries tend to have few terms

57
OR Queries
Query cat OR dog?
  • Cat (query)
  • (cat) dog

A
No action
  • Dog (query)
  • (dog)

B
C
Notify owner
D
58
Subqueries
  • Example
  • Author(AuthorJoe) and Year1999
  • Translation Find all papers from 1999 written
    by co-authors of Joe

59
Subqueries immediate
Example Author(AuthorJoe) and Year1999
  • Execute subquery, get D, the set of all documents
    where AuthorJoe
  • Extract D, all authors from documents in D
  • e.g., D Sue, Beth, Scott
  • Translate subquery into an OR query
  • AuthorSue OR AuthorBeth OR AuthorScott
  • Execute new query
  • (AuthorSue OR AuthorBeth OR AuthorScott)
    AND Year1999

60
Subqueries continuous
  • Observe new answer to query if
  • New paper written by an existing co-author of
    Joe
  • Joe writes a paper with a new co-author
  • Register two queries
  • Outer query with translated subquery
  • subquery

61
Subqueries continuous
Example Author(AuthorJoe) and Year1999
  • Joe
  • (Joe)

A
Original Subquery
  • 1999
  • (1999) (AuthorSue AuthorScott AuthorBeth
    )

B
C
D
Outer query with translated subquery
62
Subqueries continuous
Example Author(AuthorJoe) and Year1999
  • Joe
  • (Joe)

A
  • 1999
  • (1999) (AuthorSue AuthorScott AuthorBeth
    )

C
B
D
Notify owner
63
Subqueries continuous
Example Author(AuthorJoe) and Year1999
  • Joe
  • (Joe)

A
Bob Beth
  • 1999
  • (1999) (AuthorSue AuthorScott AuthorBeth
    )

C
B
D
Doc 18 Author Bob Year 1999
Notify owner with Doc 18
64
Subqueries
  • Can easily extend in a similar fashion for any
    level of nesting
  • Note we now need to maintain document indices as
    well!
  • Old documents can now be returned
  • Note Subqueries can have multiple keywords

65
Limits on Expressiveness
  • Negative predicates
  • cat
  • Easy to process if there exists other positive
    predicates in the query, difficult otherwise
  • Range predicates
  • year gt 2000
  • Can plug-in other solutions (eg PHTs)
  • Arbitrary conditions
  • Papers with more than 5 co-authors

66
Dealing with hard predicates
  • Register on easier predicates, if any exist
  • e.g., Q K1 K2 register at K2
  • If all predicates are hard, then store on
    specialized node, process every document at that
    node
  • e.g., Q K1 K2

67
Implementation
  • Implementation in Java
  • Will be soon deployed on Planetlab as a public
    service

68
Future Work
  • Relevance Feedback
  • Using semantic vector to reduce bandwidth
  • by storing commonly co-occuring terms on the same
    node
  • by inserting related documents in batches

69
Demo
  • Document 1 AUTHORJake AUTHORJohn YEAR1999
  • Document 2 AUTHORJames YEAR1998
  • Query 1 AUTHORJohn
  • Query 2 AUTHORJohn YEAR1999 TEXTword3
  • Document 3 AUTHORJohn AUTHORJohnson YEAR1999
  • Document 4 AUTHORJames AUTHORJohnson
    YEAR1999
  • Document 5 AUTHORJake YEAR1999
  • Query 3 OR (AUTHORJohn YEAR1999)
    AUTHORJake
  • Document 6 AUTHORJake YEAR1992
  • Document 7 AUTHORJason YEAR1999
  • Query 4 AUTHOR(AUTHORJohn)
  • Document 8 AUTHORJason AUTHORJohn YEAR1998
  • Document 9 AUTHORJohn AUTHORJohnson YEAR1998
Write a Comment
User Comments (0)
About PowerShow.com