SmartSeer: Continuous Queries over Citeseer - PowerPoint PPT Presentation

1 / 69

About This Presentation

Title:

SmartSeer: Continuous Queries over Citeseer

Description:

A. B. D. C. Cat:1,4,7,19,20. Dog:1,5,7,26. Cow:2,4,8,18. Bat: 1,8,31 'cat & dog'? cat? Cat:1,4,7,19,20 ... Can choose greedy strategy. We evaluate the simple ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 70

Provided by: anon54

Category:

more less

Transcript and Presenter's Notes

Title: SmartSeer: Continuous Queries over Citeseer

1
SmartSeerContinuous Queries over Citeseer
Jayanthkumar Kannan, Beverly Yang, Scott
Shenker, Sujata Banerjee, Sung Ju Lee, Puneet
Sharma, Sujoy Basu
2
Motivation

Problem
Increasing number of research papers
No easy way for researchers to keep themselves
informed of new papers
Increasingly a problem as online preprint
libraries become more popular

3
Our Solution SmartSeer

Like CiteSeer
Allow users to register continuous queries
immediate query one-time results
continuous query notify user of incremental
results (news alerts etc)
polling versus notification
Makes sense for scientific publications user can
keep abreast of pertinent papers
Event notification systems becoming more common
(news alerts, web alerts)
Also provides immediate query capabilities

4
SmartSeer Requirements

Rich Continuous Queries
Donated Infrastructure

5
Rich Continuous Queries

Citeseer documents have rich structured metadata
citations
authors
etc.
Allow queries to exploit this structure
Give me all papers that cite me or my
co-authors
Eg Nested Queries, Joins
Certain SQL constructs are expensive however
(more later)

6
Sample Queries

Text matching
TEXTdatabase TEXT networks
Vanity Who is citing me?
CITES(AUTHORxxx)
Curious Joe
AUTHOR(AUTHORJoe)
Bridges between communities
AUTHOR(CONFSigcomm) AUTHOR(CONFSigmod)

7
Donated Infrastructure

Single server clearly inadequate to handle
eventual load
Costly to buy and maintain large number of
servers and bandwidth
Google does it, but how much does it cost them?
Instead, use of individual servers and bandwidth
donated by different organizations
Loosely maintained
Desire a self-organizing architecture
The success of Planetlab is encouraging

8
Related Work

Telegraph (and others)
Allows fully general SQL queries
Works on a tightly coupled distributed system
Event notification systems (Scribe etc)
Only simple event semantics
PIER
Only immediate queries on DHT

9
Outline

Introduction
System architecture
Alternative strategies
Performance evaluation
Complex queries

10
Which distributed paradigm?

Query Replication
Document Replication
Rendevzous Approach

11
Query Replication

Query Replication
Every query stored on all nodes
New document sent to any one node
Does not scale with number of queries

12
Document Replication

Document Replication
Queries randomly partitioned among all nodes
New document sent to all nodes
Partitioning-by-ID (Gnutella)
Does not scale

A
D
Q
B
D
C
D
13
Rendevzous Approach

Keyword space partitioned among all nodes
Provided by a DHT
Essentially distributed hashing
Perform lookup in O(log n) hops
Rendevzous mechanism ensures related queries and
documents meet
Are queries sent to the documents or vice-versa?

14
Primer Immediate Queries

Inverted list approach
Node responsible for keyword W stores the IDs of
documents containing that word
Query Q W1 W2 .
Fetch lists of IDs stored under W1, W2 .
Intersect these lists

15
Immediate Queries
Cat1,4,7,19,20
A
B
C
Dog1,5,7,26
Send Result Docs 1,7
D
Cow2,4,8,18 Bat 1,8,31
16
Immediate Queries

Rough Model
Bandwidth C x N x I
DHTs provide lookups in O(log n) hops
Shipping inverted lists can be expensive
Many existing techniques to optimize
See MIT paper (On the Feasibility of
Peer-to-Peer Web Searching and Indexing)

17
Continuous Queries

Interchange the role of documents and queries
Consider AND queries for now
Query Q W1 W2 stored at one of the words
W1, W2
Most selective keyword

18
Continuous Queries
Cat1,4,7,19,20
A
Dog1,5,7,26
B
C
D
Cow2,4,8,18
Bat 1,8,31
19
Continuous Queries

Document Insertion
Fetch query list stored under each word in the
document
Notify owners of satisfied queries
Basic Send Query Strategy
Note We still need document indices
For running certain complex queries

20
Continuous Queries

Cat (query)
dog
horse dog
horse cow

Dog (query)
sheep
cow sheep
lamb sheep

B
C
Notify owner of query 1
D

Bat (query)
cow
mouse

Cow (query)
horse

21
Continuous Queries

Cat (query)
dog
horse dog
horse cow

Dog (query)
sheep
cow sheep
lamb sheep

B
C
No action
D

Bat (query)
cow
mouse

Cow (query)
horse

22
Mimic immediate query strategy

Rough Model
Bandwidth C x Q x R
Q corresponds to N, R corresponds to I
Problems with this method
Does not scale with document corpus size
Bandwidth linear with N/Q
Key Metric
Bandwidth
Not latency

23
Outline

Introduction
System architecture
Join strategies
Performance evaluation
Complex queries

24
Join strategies

Must somehow join document metadata and query
metadata
Document Node (DN)
Node at which document currently resides
e.g., document insertion node
For each term in the document, contacts the
query node responsible for the term
Query Node (QN) for term T
Manages the list of queries registered on a
particular term T

25
Send Document

DN sends entire document to QN

Cat (query)
cow
horse dog
horse cow

Dog (query)
sheep
horse sheep
lamb sheep

B
C
No action
D

Bat (query)
cow
mouse

Cow (query)
dog

26
Send Document

Pros
Simple
Low latency
Good if shipping document lt shipping query
Documents are small
Many queries
Few nodes
Cons
Potentially (usually) very expensive

27
Term Dialogue

Let Q denote queries stored at QN
Q Q1,Q2.
Let K denote distinct keywords in Q
K K1, K2, K3
If keyword K does not occur in document D, all
queries in Q containing K can be eliminated

28
Term Dialogue

QN chooses some term T from K
DN replies with yes/no
QN then prunes the set of candidate queries
Delete all queries from Q that contain T
Delete all keywords from K that no longer appear
in queries in Q
Repeat

29
Term Dialogue
Singleton heuristic

Cat (query)
dog
horse dog
horse cow

A
B
Notify owner of Q1
30
Term Dialogue

Pros
Good if query terms overlap heavily
Worst case outperforms send queries strategy
(ignoring packet headers)
Cons
Potentially high latency
Another issue packet overhead
QN can also ask for several terms at one time to
reduce of rounds
Optimal Problem seems hard
Can choose greedy strategy
We evaluate the simple heuristic
Future work Detailed evaluation of possible
heuristics

31
Bloom Filter

DN sends bloom filter of keywords in documents to
QN
QN uses this to prune the set of candidate
queries Q
Bloom filter can have false positives however
Q pruned to Q after probing bloom filter
Now initiate term-by-term dialogue on Q

32
Bloom Filter

Cat (query)
dog
horse dog
horse cow

A
B
Notify owner of query 1
33
Bloom Filter

Pros
Can potentially eliminate many queries with just
the bloom filter
Cons
Fixed overhead to every node

34
Note

None of these strategies work for immediate
queries
Documents are like OR queries have to be
registered on every keyword
Documents typically too large to replicate on
each keyword

35
Outline

Introduction
System architecture
Alternative strategies
Performance evaluation
Complex queries

36
Simulation

Run over 10 (simulated) nodes
Evaluated each strategy with 50000 queries
Queries generated from words in about 500
documents (from Citeseer, LA Times)
Bandwidth costs measured over the insertion of
the next 500 documents

37
Query Generation

General process
Feed in n representative documents
Extract document term frequencies from these
documents
Model frequency of query terms after frequency in
documents

38
Cost Model

Send Queries
Cost Size of all queries (keywords and
metadata) in inverted list
Send Document
Cost Size of Document
Term Dialogue
Cost C x K
Bloom Filter
Cost Size of filter C x K

39
Which strategy works best?
40
Which strategy works best?
41
Query Term Frequency

Problem
Will continuous query term frequencies be the
same as document frequencies?
Our approach skew the document frequency
distribution
Inverse skew, uniform, unchanged, high skew
Skew has large impact on performance

42
Results (Unchanged skew)
43
Results (Uniform distribution)
44
Results (Inverse distribution)
Inv. List Size
45
Other Experiments

Notification type
Does the document node contact one node for every
distinct term in document?

46
Other Experiments

Notification type

Cat
Pig
Sheep
A
B
47
Other Experiments

Notification type

Cat
Pig
Sheep
A
B
48
Other Experiments

Notification Type
Naïve
One notification per distinct keyword
Can be very expensive
Clustered Determine all distinct nodes in naïve
strategy, and cluster notifications
One notification per distinct node in naïve
Clustering stage can have high latency
Broadcast Send out to all nodes in the DHT
One notification per node
Fixed cost, good if system is small
Only works for Send Document

49
Other Experiments

Number of nodes
Average length of query
Affects probability that query is satisfied
Longer (but fewer) queries better for term
dialogue and bloom filter

50
Outline

Introduction
System architecture
Alternative strategies
Performance evaluation
Complex queries

51
Complex Queries

OR queries
(AuthorJoe or AuthorBeth) and Year 1999
All papers from 1999 written by Joe or Beth
Subqueries
Year1999 and Author(AuthorJoe and Year1998)
All papers from 1999 written by coauthors of Joe,
where at least one co-authored paper was written
in 1998
Queries with hard predicates
e.g., range queries

52
OR Queries

OR queries registered on all terms
Otherwise, we would miss some results

53
OR Queries
Query cat OR dog?

Cat (query)
(cat) dog

A
Notify owner
B
C
D
54
OR Queries
Query cat OR dog?

Cat (query)
(cat) dog

A
No action
B
C
No action
D
55
OR Queries
Query cat OR dog?

Cat (query)
(cat) dog

A
Notify owner

Dog (query)
cat (dog)

B
C
Notify owner
D
56
OR Queries

To prevent duplicates, can rewrite query
Q T1 T2 T3 rewritten into
Q1 T1
Q2 T1 T2 (registered on T2)
Q3 T1 T2 T3 (registered on T3)
Works well since queries tend to have few terms

57
OR Queries
Query cat OR dog?

Cat (query)
(cat) dog

A
No action

Dog (query)
(dog)

B
C
Notify owner
D
58
Subqueries

Example
Author(AuthorJoe) and Year1999
Translation Find all papers from 1999 written
by co-authors of Joe

59
Subqueries immediate
Example Author(AuthorJoe) and Year1999

Execute subquery, get D, the set of all documents
where AuthorJoe
Extract D, all authors from documents in D
e.g., D Sue, Beth, Scott
Translate subquery into an OR query
AuthorSue OR AuthorBeth OR AuthorScott
Execute new query
(AuthorSue OR AuthorBeth OR AuthorScott)
AND Year1999

60
Subqueries continuous

Observe new answer to query if
New paper written by an existing co-author of
Joe
Joe writes a paper with a new co-author
Register two queries
Outer query with translated subquery
subquery

61
Subqueries continuous
Example Author(AuthorJoe) and Year1999

Joe
(Joe)

A
Original Subquery

1999
(1999) (AuthorSue AuthorScott AuthorBeth
)

B
C
D
Outer query with translated subquery
62
Subqueries continuous
Example Author(AuthorJoe) and Year1999

Joe
(Joe)

1999
(1999) (AuthorSue AuthorScott AuthorBeth
)

C
B
D
Notify owner
63
Subqueries continuous
Example Author(AuthorJoe) and Year1999

Joe
(Joe)

A
Bob Beth

1999
(1999) (AuthorSue AuthorScott AuthorBeth
)

C
B
D
Doc 18 Author Bob Year 1999
Notify owner with Doc 18
64
Subqueries

Can easily extend in a similar fashion for any
level of nesting
Note we now need to maintain document indices as
well!
Old documents can now be returned
Note Subqueries can have multiple keywords

65
Limits on Expressiveness

Negative predicates
cat
Easy to process if there exists other positive
predicates in the query, difficult otherwise
Range predicates
year gt 2000
Can plug-in other solutions (eg PHTs)
Arbitrary conditions
Papers with more than 5 co-authors

66
Dealing with hard predicates

Register on easier predicates, if any exist
e.g., Q K1 K2 register at K2
If all predicates are hard, then store on
specialized node, process every document at that
node
e.g., Q K1 K2

67
Implementation