Title: A Dynamic Warehouse for the XML data of the Web Gr
1A Dynamic Warehouse for the XML data of the
WebGrégory COBENAINRIA Xyleme SA(
Gregory.Cobena_at_inria.fr )Serge Abiteboul, INRIA
Xyleme SA( Serge.Abiteboul_at_inria.fr
)http//www-rocq.inria.fr/verso/ http//www.xyle
me.com/
2Organization
- 1. The Web and XML
- 2. Xyleme
- 3. Data Acquisition and Maintenance
- XML Repository, Semantic Data Integration and
Query Processing - 4. Query Subscription
- Conclusion
31. The Web and XML
4The Web today
- Terabytes of data
- A lot of public pages
- 1 billion in 06/2000
- several millions of servers
- Private web not publicly available pages
- Deep web data hidden behind forms
5HTML Hypertext Language
hard
Text presentation Where is the data ?
6XML Semistructured Data
ltproduct-tablegt lt product referenceX23"gt
ltdesignationgt camera lt/designationgt ltprice
unitDollarsgt 359.99 lt/pricegt ltdescriptiongt
lt/descriptiongt lt/productgt lt product
referenceR2D2"gt ltdesignationgt Robot
lt/designationgt ltprice unitDollarsgt 19350
lt/pricegt ltdescriptiongt lt/descriptiongt ... lt/p
roduct-tablegt
easy
Data Structure Semistructured more flexible
XML
7XML Tree Types
product-table
product
reference
price
designation
description
- Semantics and structure are in paths
- product-table/product/reference
- product-table/product/price
82. A Dynamic Warehouse for the XML Data of the Web
9Xyleme Research
- Project Xyleme at INRIA (1999-2000)
- Explore XML Web SGBD to make the Web a
Knowledge Database - INRIA
- Sophie Cluet Databases (OQL)
- Serge Abiteboul semi-structured data web
- Guy Ferran ex O2 Technology
- Mannheim University
- Guido Moerkotte
- Université dOrsay
- Marie Christine Rousset
- CNAM
- Dan Vodislav
10Xyleme Company
- Started September 2000
- (25 employees end of 2001)
- Market Challenges
- Few XML documents available on the Web (because
of weak software support) - Company is focusing on private XML
- Press, Editors, Financial Data, Biology
- Technology
- Scalability for large amount of data
- Internet (focus) / Intranet support
- Monitoring and Version Management
- Heterogeneous Data Integration
11Architecture
- Cluster of PCs
- Developed with Linux and C
- Communications
- local Corba
- external HTTP
- Distribution between autonomous machines
- Now Web Services
12Functional Architecture
-------------------- I N T E R N E T
-----------------------
Web Interface
Query Processor
Repository and Index Manager
13Architecture
-------------------- I N T E R N E T
-----------------------
E T H E R N E T
143. Data Acquisition and Maintenance,Page
Importance
15Goals
- Discover XML pages on the web that are of
interest for customers - For this crawl the web (HTMLXML)
- Maintain them up to date
- Do this under bounded resources
- Memory for known URLs
- Bandwidth
16Life Cycle of a page in Xyleme
- The URL of D is discovered as a link in another
page (or published by a customer) - The page scheduler decides to read D
- The meta data of D is read
- type, last_date_update...
- The document D is loaded
- The document D is re(read) regularly
17Main Issues
- Loading of pages
- we can load up to 5 millions of pages/day on a
standard PC - main cost is Internet connection
- Metadata management (access to disk)
- Page scheduling
- decide which page to read or refresh next
18Page Importance
- Definition Important pages are linked to by
important pages - Offline algorithm (used by Google)
- Our Online algorithm
- (M. Preda, S. Abiteboul, G. Cobena)
- does not require to maintain graph information
- faster convergence with focused crawling
19( XML Repository,Semantic Data Integrationand
Query Processing )
20Querying Language
- Today A mix of OQL and XQL
- We are currently moving to X-Query (which is also
a mix of OQL and XQL) - Select boss/Name, boss/Phone
- From comp in BusinessDomain,
- boss in comp//Manager
- Where comp/Product contains Xyleme
21Web Heterogeneity
- Semantic domains, e.g., cinema
- Many possible types for data in this domain, many
DTDs - Semantic Integration
- one abstract DTD for the domain
- gives the illusion that the system maintains an
homogeneous database for this domain - 1 domain 1 abstract DTD
22Indexing
- Standard inverted index
- word ? documents that contain this word
- Xyleme index
- word ? elements that contain this word
- document element identifier
- Goal more work can be performed without
accessing data
234. Change Control
24The Web changes all the time
- Data acquisition maintenance
- keep the warehouse up-to-date
- Version management
- representation and storage of changes
- Change monitoring
- query subscription
25Subscription Language
- SQL-like language based on atomic events.
- Combines the use of monitoring queries and
continuous queries. - The language can be extended by adding new types
of atomic events. - Uses the XML Query Language for continuous
queries. Querying the XML Documents of the Web,
V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report
26Example
- subscription myPaintings
- what are the new painting entries in Musee
dOrsay site - monitoring newPainting
- select URL
- where URL extends www.musee-orsay.fr/
- and ltpaintergt contains Monet
- manage the changes in the expositions
- continuous delta Exposition
- select ... from ... where
- when monthly
- notify daily send me a daily report
Atomic events
27Step 1 Atomic Event Detection
5 millions of pages/day
atomic event 46 URL matches pattern
www.musee-orsay.fr/ atomic event 67 XML
document contains the tag ltpaintergt with the
value Monet
metadata manager
HTML parser
complex event detection
XML loader
28Step 2 Complex Event Detection
Millions of alerts of pages/day Millions of
subscriptions
HTML parser
complex event detection
complex event 12 67 46 (XML document contains
the tag ltpaintergt with value Monet and URL
matches pattern www.musee-orsay.fr/)
XML loader
29Step 3 Notification Processor
complex event detection
Reporter
continuous queries
30Architecture
Xyleme Query Processor
documents
Trigger Engine
Xyleme Alerter
Xyleme Reporter
Complex Event Detection
Reporter
Subscription Manager
SQL
Web Browser
Xyleme Subscription Manager
SQL
31Complex Events Algorithm
- The formal problem is NP-hard
- We proposed several possible algorithms
- Experimental (simulation) values proved the
effectiveness of our solutions - The Hash-Tree based algorithm is well suited for
our application - 10 million Complex Events
- 1 million Atomic Events
- 100 Atomic events detected per document
- 0.8 ms to process a document. 2 million
documents per day.
32Alerters
- Each Alerter can be viewed as a plug-in that acts
on a document flow. - All sorts of Atomic events can be detected URL
pattern detection, Keywords, XPath expressions,
Page rank - Can be distributed.
33Some Advanced Alerts
- Process document flow (single pass)
- Full strings
- Context Stack
- Reversed look-up
- XML Alerts
- Reversed XPath expressions
- Dual context stack for / and //
34Versions
- Objectives
- Temporal Queries (persistent identification of
nodes) - Version some documents or some sites (store a
delta) - Change Monitoring (query changes)
- We proposed a representation of changes
- Change-Centric Management of Versions (VLDB
2001) - We developed a Diff algorithm for XML
- Detecting Changes in XML Documents,
- G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San
Jose)
35Conclusion Prospectives
- Focus crawling on important pages
- Refine notion of importance
- Improve important pages discovery
- Improve Change control accuracy
- Semantic web
- Real-time advanced processing
36Merci