Title: SmartSeer: Continuous Queries over Citeseer
1SmartSeerContinuous Queries over Citeseer
Jayanthkumar Kannan, Beverly Yang, Scott
Shenker, Sujata Banerjee, Sung Ju Lee, Puneet
Sharma, Sujoy Basu
2Motivation
- Problem
- Increasing number of research papers
- No easy way for researchers to keep themselves
informed of new papers - Increasingly a problem as online preprint
libraries become more popular
3Our Solution SmartSeer
- Like CiteSeer
- Allow users to register continuous queries
- immediate query one-time results
- continuous query notify user of incremental
results (news alerts etc) - polling versus notification
- Makes sense for scientific publications user can
keep abreast of pertinent papers - Event notification systems becoming more common
(news alerts, web alerts) - Also provides immediate query capabilities
4SmartSeer Requirements
- Rich Continuous Queries
- Donated Infrastructure
5Rich Continuous Queries
- Citeseer documents have rich structured metadata
- citations
- authors
- etc.
- Allow queries to exploit this structure
- Give me all papers that cite me or my
co-authors - Eg Nested Queries, Joins
- Certain SQL constructs are expensive however
(more later)
6Sample Queries
- Text matching
- TEXTdatabase TEXT networks
- Vanity Who is citing me?
- CITES(AUTHORxxx)
- Curious Joe
- AUTHOR(AUTHORJoe)
- Bridges between communities
- AUTHOR(CONFSigcomm) AUTHOR(CONFSigmod)
7Donated Infrastructure
- Single server clearly inadequate to handle
eventual load - Costly to buy and maintain large number of
servers and bandwidth - Google does it, but how much does it cost them?
- Instead, use of individual servers and bandwidth
donated by different organizations - Loosely maintained
- Desire a self-organizing architecture
- The success of Planetlab is encouraging
8Related Work
- Telegraph (and others)
- Allows fully general SQL queries
- Works on a tightly coupled distributed system
- Event notification systems (Scribe etc)
- Only simple event semantics
- PIER
- Only immediate queries on DHT
9Outline
- Introduction
- System architecture
- Alternative strategies
- Performance evaluation
- Complex queries
10Which distributed paradigm?
- Query Replication
- Document Replication
- Rendevzous Approach
11Query Replication
- Query Replication
- Every query stored on all nodes
- New document sent to any one node
- Does not scale with number of queries
12Document Replication
- Document Replication
- Queries randomly partitioned among all nodes
- New document sent to all nodes
- Partitioning-by-ID (Gnutella)
- Does not scale
A
D
Q
B
D
C
D
13Rendevzous Approach
- Keyword space partitioned among all nodes
- Provided by a DHT
- Essentially distributed hashing
- Perform lookup in O(log n) hops
- Rendevzous mechanism ensures related queries and
documents meet - Are queries sent to the documents or vice-versa?
14Primer Immediate Queries
- Inverted list approach
- Node responsible for keyword W stores the IDs of
documents containing that word - Query Q W1 W2 .
- Fetch lists of IDs stored under W1, W2 .
- Intersect these lists
15Immediate Queries
Cat1,4,7,19,20
A
B
C
Dog1,5,7,26
Send Result Docs 1,7
D
Cow2,4,8,18 Bat 1,8,31
16Immediate Queries
- Rough Model
- Bandwidth C x N x I
- DHTs provide lookups in O(log n) hops
- Shipping inverted lists can be expensive
- Many existing techniques to optimize
- See MIT paper (On the Feasibility of
Peer-to-Peer Web Searching and Indexing)
17Continuous Queries
- Interchange the role of documents and queries
- Consider AND queries for now
- Query Q W1 W2 stored at one of the words
W1, W2 - Most selective keyword
18Continuous Queries
Cat1,4,7,19,20
A
Dog1,5,7,26
B
C
D
Cow2,4,8,18
Bat 1,8,31
19Continuous Queries
- Document Insertion
- Fetch query list stored under each word in the
document - Notify owners of satisfied queries
- Basic Send Query Strategy
- Note We still need document indices
- For running certain complex queries
20Continuous Queries
- Cat (query)
- dog
- horse dog
- horse cow
A
- Dog (query)
- sheep
- cow sheep
- lamb sheep
B
C
Notify owner of query 1
D
21Continuous Queries
- Cat (query)
- dog
- horse dog
- horse cow
A
- Dog (query)
- sheep
- cow sheep
- lamb sheep
B
C
No action
D
22Mimic immediate query strategy
- Rough Model
- Bandwidth C x Q x R
- Q corresponds to N, R corresponds to I
- Problems with this method
- Does not scale with document corpus size
- Bandwidth linear with N/Q
- Key Metric
- Bandwidth
- Not latency
23Outline
- Introduction
- System architecture
- Join strategies
- Performance evaluation
- Complex queries
24Join strategies
- Must somehow join document metadata and query
metadata - Document Node (DN)
- Node at which document currently resides
- e.g., document insertion node
- For each term in the document, contacts the
query node responsible for the term - Query Node (QN) for term T
- Manages the list of queries registered on a
particular term T
25Send Document
- DN sends entire document to QN
- Cat (query)
- cow
- horse dog
- horse cow
A
- Dog (query)
- sheep
- horse sheep
- lamb sheep
B
C
No action
D
26Send Document
- Pros
- Simple
- Low latency
- Good if shipping document
- Documents are small
- Many queries
- Few nodes
- Cons
- Potentially (usually) very expensive
27Term Dialogue
- Let Q denote queries stored at QN
- Q Q1,Q2.
- Let K denote distinct keywords in Q
- K K1, K2, K3
- If keyword K does not occur in document D, all
queries in Q containing K can be eliminated
28Term Dialogue
- QN chooses some term T from K
- DN replies with yes/no
- QN then prunes the set of candidate queries
- Delete all queries from Q that contain T
- Delete all keywords from K that no longer appear
in queries in Q - Repeat
29Term Dialogue
Singleton heuristic
- Cat (query)
- dog
- horse dog
- horse cow
A
B
Notify owner of Q1
30Term Dialogue
- Pros
- Good if query terms overlap heavily
- Worst case outperforms send queries strategy
(ignoring packet headers) - Cons
- Potentially high latency
- Another issue packet overhead
- QN can also ask for several terms at one time to
reduce of rounds - Optimal Problem seems hard
- Can choose greedy strategy
- We evaluate the simple heuristic
- Future work Detailed evaluation of possible
heuristics
31Bloom Filter
- DN sends bloom filter of keywords in documents to
QN - QN uses this to prune the set of candidate
queries Q - Bloom filter can have false positives however
- Q pruned to Q after probing bloom filter
- Now initiate term-by-term dialogue on Q
32Bloom Filter
- Cat (query)
- dog
- horse dog
- horse cow
A
B
Notify owner of query 1
33Bloom Filter
- Pros
- Can potentially eliminate many queries with just
the bloom filter - Cons
- Fixed overhead to every node
34Note
- None of these strategies work for immediate
queries - Documents are like OR queries have to be
registered on every keyword - Documents typically too large to replicate on
each keyword
35Outline
- Introduction
- System architecture
- Alternative strategies
- Performance evaluation
- Complex queries
36Simulation
- Run over 10 (simulated) nodes
- Evaluated each strategy with 50000 queries
- Queries generated from words in about 500
documents (from Citeseer, LA Times) - Bandwidth costs measured over the insertion of
the next 500 documents
37Query Generation
- General process
- Feed in n representative documents
- Extract document term frequencies from these
documents - Model frequency of query terms after frequency in
documents
38Cost Model
- Send Queries
- Cost Size of all queries (keywords and
metadata) in inverted list - Send Document
- Cost Size of Document
- Term Dialogue
- Cost C x K
- Bloom Filter
- Cost Size of filter C x K
39Which strategy works best?
40Which strategy works best?
41Query Term Frequency
- Problem
- Will continuous query term frequencies be the
same as document frequencies? - Our approach skew the document frequency
distribution - Inverse skew, uniform, unchanged, high skew
- Skew has large impact on performance
42Results (Unchanged skew)
43Results (Uniform distribution)
44Results (Inverse distribution)
Inv. List Size
45Other Experiments
- Notification type
- Does the document node contact one node for every
distinct term in document?
46Other Experiments
Cat
Pig
Sheep
A
B
47Other Experiments
Cat
Pig
Sheep
A
B
48Other Experiments
- Notification Type
- Naïve
- One notification per distinct keyword
- Can be very expensive
- Clustered Determine all distinct nodes in naïve
strategy, and cluster notifications - One notification per distinct node in naïve
- Clustering stage can have high latency
- Broadcast Send out to all nodes in the DHT
- One notification per node
- Fixed cost, good if system is small
- Only works for Send Document
49Other Experiments
- Number of nodes
- Average length of query
- Affects probability that query is satisfied
- Longer (but fewer) queries better for term
dialogue and bloom filter
50Outline
- Introduction
- System architecture
- Alternative strategies
- Performance evaluation
- Complex queries
51Complex Queries
- OR queries
- (AuthorJoe or AuthorBeth) and Year 1999
- All papers from 1999 written by Joe or Beth
- Subqueries
- Year1999 and Author(AuthorJoe and Year1998)
- All papers from 1999 written by coauthors of Joe,
where at least one co-authored paper was written
in 1998 - Queries with hard predicates
- e.g., range queries
52OR Queries
- OR queries registered on all terms
- Otherwise, we would miss some results
53OR Queries
Query cat OR dog?
A
Notify owner
B
C
D
54OR Queries
Query cat OR dog?
A
No action
B
C
No action
D
55OR Queries
Query cat OR dog?
A
Notify owner
B
C
Notify owner
D
56OR Queries
- To prevent duplicates, can rewrite query
- Q T1 T2 T3 rewritten into
- Q1 T1
- Q2 T1 T2 (registered on T2)
- Q3 T1 T2 T3 (registered on T3)
- Works well since queries tend to have few terms
57OR Queries
Query cat OR dog?
A
No action
B
C
Notify owner
D
58Subqueries
- Example
- Author(AuthorJoe) and Year1999
- Translation Find all papers from 1999 written
by co-authors of Joe
59Subqueries immediate
Example Author(AuthorJoe) and Year1999
- Execute subquery, get D, the set of all documents
where AuthorJoe - Extract D, all authors from documents in D
- e.g., D Sue, Beth, Scott
- Translate subquery into an OR query
- AuthorSue OR AuthorBeth OR AuthorScott
- Execute new query
- (AuthorSue OR AuthorBeth OR AuthorScott)
AND Year1999
60Subqueries continuous
- Observe new answer to query if
- New paper written by an existing co-author of
Joe - Joe writes a paper with a new co-author
- Register two queries
- Outer query with translated subquery
- subquery
61Subqueries continuous
Example Author(AuthorJoe) and Year1999
A
Original Subquery
- 1999
- (1999) (AuthorSue AuthorScott AuthorBeth
)
B
C
D
Outer query with translated subquery
62Subqueries continuous
Example Author(AuthorJoe) and Year1999
A
- 1999
- (1999) (AuthorSue AuthorScott AuthorBeth
)
C
B
D
Notify owner
63Subqueries continuous
Example Author(AuthorJoe) and Year1999
A
Bob Beth
- 1999
- (1999) (AuthorSue AuthorScott AuthorBeth
)
C
B
D
Doc 18 Author Bob Year 1999
Notify owner with Doc 18
64Subqueries
- Can easily extend in a similar fashion for any
level of nesting - Note we now need to maintain document indices as
well! - Old documents can now be returned
- Note Subqueries can have multiple keywords
65Limits on Expressiveness
- Negative predicates
- cat
- Easy to process if there exists other positive
predicates in the query, difficult otherwise - Range predicates
- year 2000
- Can plug-in other solutions (eg PHTs)
- Arbitrary conditions
- Papers with more than 5 co-authors
66Dealing with hard predicates
- Register on easier predicates, if any exist
- e.g., Q K1 K2 register at K2
- If all predicates are hard, then store on
specialized node, process every document at that
node - e.g., Q K1 K2
67Implementation
- Implementation in Java
- Will be soon deployed on Planetlab as a public
service
68Future Work
- Relevance Feedback
- Using semantic vector to reduce bandwidth
- by storing commonly co-occuring terms on the same
node - by inserting related documents in batches
69Demo
- Document 1 AUTHORJake AUTHORJohn YEAR1999
- Document 2 AUTHORJames YEAR1998
- Query 1 AUTHORJohn
- Query 2 AUTHORJohn YEAR1999 TEXTword3
- Document 3 AUTHORJohn AUTHORJohnson YEAR1999
- Document 4 AUTHORJames AUTHORJohnson
YEAR1999 - Document 5 AUTHORJake YEAR1999
- Query 3 OR (AUTHORJohn YEAR1999)
AUTHORJake - Document 6 AUTHORJake YEAR1992
- Document 7 AUTHORJason YEAR1999
- Query 4 AUTHOR(AUTHORJohn)
- Document 8 AUTHORJason AUTHORJohn YEAR1998
- Document 9 AUTHORJohn AUTHORJohnson YEAR1998