A Dynamic Warehouse for the XML data of the Web Gr - PowerPoint PPT Presentation

About This Presentation

Title:

A Dynamic Warehouse for the XML data of the Web Gr

Description:

A Dynamic Warehouse for the XML data of the Web Gr gory COBENA INRIA & Xyleme SA ( Gregory.Cobena_at_inria.fr ) Serge Abiteboul, INRIA & Xyleme SA ( Serge.Abiteboul_at_ ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 37

Provided by: sergeab7

Category:

more less

Transcript and Presenter's Notes

Title: A Dynamic Warehouse for the XML data of the Web Gr

1
A Dynamic Warehouse for the XML data of the
WebGrégory COBENAINRIA Xyleme SA(
Gregory.Cobena_at_inria.fr )Serge Abiteboul, INRIA
Xyleme SA( Serge.Abiteboul_at_inria.fr
)http//www-rocq.inria.fr/verso/ http//www.xyle
me.com/
2
Organization

1. The Web and XML
2. Xyleme
3. Data Acquisition and Maintenance
XML Repository, Semantic Data Integration and
Query Processing
4. Query Subscription
Conclusion

3
1. The Web and XML
4
The Web today

Terabytes of data
A lot of public pages
1 billion in 06/2000
several millions of servers
Private web not publicly available pages
Deep web data hidden behind forms

5
HTML Hypertext Language
hard
Text presentation Where is the data ?
6
XML Semistructured Data
ltproduct-tablegt lt product referenceX23"gt
ltdesignationgt camera lt/designationgt ltprice
unitDollarsgt 359.99 lt/pricegt ltdescriptiongt
lt/descriptiongt lt/productgt lt product
referenceR2D2"gt ltdesignationgt Robot
lt/designationgt ltprice unitDollarsgt 19350
lt/pricegt ltdescriptiongt lt/descriptiongt ... lt/p
roduct-tablegt
easy
Data Structure Semistructured more flexible
XML
7
XML Tree Types
product-table
product
reference
price
designation
description

Semantics and structure are in paths
product-table/product/reference
product-table/product/price

8
2. A Dynamic Warehouse for the XML Data of the Web

Xyleme

9
Xyleme Research

Project Xyleme at INRIA (1999-2000)
Explore XML Web SGBD to make the Web a
Knowledge Database
INRIA
Sophie Cluet Databases (OQL)
Serge Abiteboul semi-structured data web
Guy Ferran ex O2 Technology
Mannheim University
Guido Moerkotte
Université dOrsay
Marie Christine Rousset
CNAM
Dan Vodislav

10
Xyleme Company

Started September 2000
(25 employees end of 2001)
Market Challenges
Few XML documents available on the Web (because
of weak software support)
Company is focusing on private XML
Press, Editors, Financial Data, Biology
Technology
Scalability for large amount of data
Internet (focus) / Intranet support
Monitoring and Version Management
Heterogeneous Data Integration

11
Architecture

Cluster of PCs
Developed with Linux and C
Communications
local Corba
external HTTP
Distribution between autonomous machines
Now Web Services

12
Functional Architecture
-------------------- I N T E R N E T
-----------------------
Web Interface
Query Processor
Repository and Index Manager
13
Architecture
-------------------- I N T E R N E T
-----------------------
E T H E R N E T
14
3. Data Acquisition and Maintenance,Page
Importance
15
Goals

Discover XML pages on the web that are of
interest for customers
For this crawl the web (HTMLXML)
Maintain them up to date
Do this under bounded resources
Memory for known URLs
Bandwidth

16
Life Cycle of a page in Xyleme

The URL of D is discovered as a link in another
page (or published by a customer)
The page scheduler decides to read D
The meta data of D is read
type, last_date_update...
The document D is loaded
The document D is re(read) regularly

17
Main Issues

Loading of pages
we can load up to 5 millions of pages/day on a
standard PC
main cost is Internet connection
Metadata management (access to disk)
Page scheduling
decide which page to read or refresh next

18
Page Importance

Definition Important pages are linked to by
important pages
Offline algorithm (used by Google)
Our Online algorithm
(M. Preda, S. Abiteboul, G. Cobena)
does not require to maintain graph information
faster convergence with focused crawling

19
( XML Repository,Semantic Data Integrationand
Query Processing )
20
Querying Language

Today A mix of OQL and XQL
We are currently moving to X-Query (which is also
a mix of OQL and XQL)
Select boss/Name, boss/Phone
From comp in BusinessDomain,
boss in comp//Manager
Where comp/Product contains Xyleme

21
Web Heterogeneity

Semantic domains, e.g., cinema
Many possible types for data in this domain, many
DTDs
Semantic Integration
one abstract DTD for the domain
gives the illusion that the system maintains an
homogeneous database for this domain
1 domain 1 abstract DTD

22
Indexing

Standard inverted index
word ? documents that contain this word
Xyleme index
word ? elements that contain this word
document element identifier
Goal more work can be performed without
accessing data

23
4. Change Control
24
The Web changes all the time

Data acquisition maintenance
keep the warehouse up-to-date
Version management
representation and storage of changes
Change monitoring
query subscription

25
Subscription Language

SQL-like language based on atomic events.
Combines the use of monitoring queries and
continuous queries.
The language can be extended by adding new types
of atomic events.
Uses the XML Query Language for continuous
queries. Querying the XML Documents of the Web,
V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report

26
Example

subscription myPaintings
what are the new painting entries in Musee
dOrsay site
monitoring newPainting
select URL
where URL extends www.musee-orsay.fr/
and ltpaintergt contains Monet
manage the changes in the expositions
continuous delta Exposition
select ... from ... where
when monthly
notify daily send me a daily report

Atomic events
27
Step 1 Atomic Event Detection
5 millions of pages/day
atomic event 46 URL matches pattern
www.musee-orsay.fr/ atomic event 67 XML
document contains the tag ltpaintergt with the
value Monet
metadata manager
HTML parser
complex event detection
XML loader
28
Step 2 Complex Event Detection
Millions of alerts of pages/day Millions of
subscriptions
HTML parser
complex event detection
complex event 12 67 46 (XML document contains
the tag ltpaintergt with value Monet and URL
matches pattern www.musee-orsay.fr/)
XML loader
29
Step 3 Notification Processor
complex event detection
Reporter
continuous queries
30
Architecture
Xyleme Query Processor
documents
Trigger Engine
Xyleme Alerter
Xyleme Reporter
Complex Event Detection
Reporter
Subscription Manager
SQL
Web Browser
Xyleme Subscription Manager
SQL
31
Complex Events Algorithm

The formal problem is NP-hard
We proposed several possible algorithms
Experimental (simulation) values proved the
effectiveness of our solutions
The Hash-Tree based algorithm is well suited for
our application
10 million Complex Events
1 million Atomic Events
100 Atomic events detected per document
0.8 ms to process a document. 2 million
documents per day.

32
Alerters

Each Alerter can be viewed as a plug-in that acts
on a document flow.
All sorts of Atomic events can be detected URL
pattern detection, Keywords, XPath expressions,
Page rank
Can be distributed.

33
Some Advanced Alerts