Title: Query subscription
1Query subscription
2The Web changes all the time
- Crawler that crawls the web
- Filtering of the flow of documents based on some
query subscriptions
3Subscription Language
- SQL-like language based on atomic events.
- Combines the use of monitoring queries and
continuous queries. - The language can be extended by adding new types
of atomic events. - Uses an XML Query Language for continuous queries
4Example
- subscription myPaintings
- what are the new painting entries in Musee
dOrsay site - monitoring newPainting
- select URL
- where URL extends www.musee-orsay.fr/
- and ltpaintergt contains Monet
- manage the changes in the expositions
- continuous delta Exposition
- select ... from ... where
- when monthly
- notify daily send me a daily report
Atomic events
5Step 1 Atomic Event Detection
5 millions of pages/day
atomic event 46 URL matches pattern
www.musee-orsay.fr/ atomic event 67 XML
document contains the tag ltpaintergt with the
value Monet
metadata manager
document alerts d/46
XML loader
complex event detection
d/46,67
6Alerters
- Each Alerter can be viewed as a plug-in that acts
on a document flow. - All sorts of Atomic events can be detected URL
pattern detection, Keywords, XPath expressions,
Page rank - Can be distributed.
- Some advanced alerts are
- Long string look-ups
- Finding XML Patterns (e.g. XPath)
- Comparing digital signature of text documents
(copy tracker)
7URL Patterns Detection
- Supported patterns
- URL prefix suffix
- Using Hash Table try all possible patterns
- Test in O(1), total test time is O(n), where n is
the length of URLs - Example http//www.inria.fr/verso/index.html
- Test
- http//www.inria.fr/verso/
- http//www.inria.fr/
8Stemming
- On the Alerter
- Exemple Éléphant gt ELEPHANT
- Noise may be introduced
- (Example tâche tache)
- On the Subscription Manager
- To avoid duplicate registration of similar events
- To show the user how his query is stemmed
- Real stemmers and concept extraction
- chevaux ? cheval
- Composite words, beau fils ? gendre
9Step 2 Complex Event Detection
Millions of alerts of pages/day Millions of
subscriptions
HTML parser
complex event detection
complex event 12 67 46 (XML document contains
the tag ltpaintergt with value Monet and URL
matches pattern www.musee-orsay.fr/)
XML loader
10Complex Events Algorithm
- The formal problem is NP-hard
- We proposed several possible algorithms
- Experimental (simulation) values proved the
effectiveness of our solutions - The Hash-Tree based algorithm is well suited for
our application - 10 million Complex Events
- 1 million Atomic Events
- 100 Atomic events detected per document
- 0.8 ms to process a document. 2 million
documents per day (on each PC).
11Step 3 Notification Processor
alerts
notification/monitoring
Reporter
complex event detection
Millions of Notifications/day
triggers
clock
continuous queries
notification/results
12Architecture
Xyleme Query Processor
documents
Trigger Engine
Xyleme Alerter
Xyleme Reporter
Complex Event Detection
Reporter
Subscription Manager
SQL
Web Browser
Xyleme Subscription Manager
SQL