A Dynamic Warehouse for the XML data of the Web Gr - PowerPoint PPT Presentation

About This Presentation
Title:

A Dynamic Warehouse for the XML data of the Web Gr

Description:

A Dynamic Warehouse for the XML data of the Web Gr gory COBENA INRIA & Xyleme SA ( Gregory.Cobena_at_inria.fr ) Serge Abiteboul, INRIA & Xyleme SA ( Serge.Abiteboul_at_ ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 37
Provided by: sergeab7
Category:

less

Transcript and Presenter's Notes

Title: A Dynamic Warehouse for the XML data of the Web Gr


1
A Dynamic Warehouse for the XML data of the
WebGrégory COBENAINRIA Xyleme SA(
Gregory.Cobena_at_inria.fr )Serge Abiteboul, INRIA
Xyleme SA( Serge.Abiteboul_at_inria.fr
)http//www-rocq.inria.fr/verso/ http//www.xyle
me.com/
2
Organization
  • 1. The Web and XML
  • 2. Xyleme
  • 3. Data Acquisition and Maintenance
  • XML Repository, Semantic Data Integration and
    Query Processing
  • 4. Query Subscription
  • Conclusion

3
1. The Web and XML
4
The Web today
  • Terabytes of data
  • A lot of public pages
  • 1 billion in 06/2000
  • several millions of servers
  • Private web not publicly available pages
  • Deep web data hidden behind forms

5
HTML Hypertext Language
hard
Text presentation Where is the data ?
6
XML Semistructured Data
ltproduct-tablegt lt product referenceX23"gt
ltdesignationgt camera lt/designationgt ltprice
unitDollarsgt 359.99 lt/pricegt ltdescriptiongt
lt/descriptiongt lt/productgt lt product
referenceR2D2"gt ltdesignationgt Robot
lt/designationgt ltprice unitDollarsgt 19350
lt/pricegt ltdescriptiongt lt/descriptiongt ... lt/p
roduct-tablegt
easy
Data Structure Semistructured more flexible
XML
7
XML Tree Types
product-table
product
reference
price
designation
description
  • Semantics and structure are in paths
  • product-table/product/reference
  • product-table/product/price

8
2. A Dynamic Warehouse for the XML Data of the Web
  • Xyleme

9
Xyleme Research
  • Project Xyleme at INRIA (1999-2000)
  • Explore XML Web SGBD to make the Web a
    Knowledge Database
  • INRIA
  • Sophie Cluet Databases (OQL)
  • Serge Abiteboul semi-structured data web
  • Guy Ferran ex O2 Technology
  • Mannheim University
  • Guido Moerkotte
  • Université dOrsay
  • Marie Christine Rousset
  • CNAM
  • Dan Vodislav

10
Xyleme Company
  • Started September 2000
  • (25 employees end of 2001)
  • Market Challenges
  • Few XML documents available on the Web (because
    of weak software support)
  • Company is focusing on private XML
  • Press, Editors, Financial Data, Biology
  • Technology
  • Scalability for large amount of data
  • Internet (focus) / Intranet support
  • Monitoring and Version Management
  • Heterogeneous Data Integration

11
Architecture
  • Cluster of PCs
  • Developed with Linux and C
  • Communications
  • local Corba
  • external HTTP
  • Distribution between autonomous machines
  • Now Web Services

12
Functional Architecture
-------------------- I N T E R N E T
-----------------------
Web Interface
Query Processor
Repository and Index Manager
13
Architecture
-------------------- I N T E R N E T
-----------------------
E T H E R N E T
14
3. Data Acquisition and Maintenance,Page
Importance
15
Goals
  • Discover XML pages on the web that are of
    interest for customers
  • For this crawl the web (HTMLXML)
  • Maintain them up to date
  • Do this under bounded resources
  • Memory for known URLs
  • Bandwidth

16
Life Cycle of a page in Xyleme
  • The URL of D is discovered as a link in another
    page (or published by a customer)
  • The page scheduler decides to read D
  • The meta data of D is read
  • type, last_date_update...
  • The document D is loaded
  • The document D is re(read) regularly

17
Main Issues
  • Loading of pages
  • we can load up to 5 millions of pages/day on a
    standard PC
  • main cost is Internet connection
  • Metadata management (access to disk)
  • Page scheduling
  • decide which page to read or refresh next

18
Page Importance
  • Definition Important pages are linked to by
    important pages
  • Offline algorithm (used by Google)
  • Our Online algorithm
  • (M. Preda, S. Abiteboul, G. Cobena)
  • does not require to maintain graph information
  • faster convergence with focused crawling

19
( XML Repository,Semantic Data Integrationand
Query Processing )
20
Querying Language
  • Today A mix of OQL and XQL
  • We are currently moving to X-Query (which is also
    a mix of OQL and XQL)
  • Select boss/Name, boss/Phone
  • From comp in BusinessDomain,
  • boss in comp//Manager
  • Where comp/Product contains Xyleme

21
Web Heterogeneity
  • Semantic domains, e.g., cinema
  • Many possible types for data in this domain, many
    DTDs
  • Semantic Integration
  • one abstract DTD for the domain
  • gives the illusion that the system maintains an
    homogeneous database for this domain
  • 1 domain 1 abstract DTD

22
Indexing
  • Standard inverted index
  • word ? documents that contain this word
  • Xyleme index
  • word ? elements that contain this word
  • document element identifier
  • Goal more work can be performed without
    accessing data

23
4. Change Control
24
The Web changes all the time
  • Data acquisition maintenance
  • keep the warehouse up-to-date
  • Version management
  • representation and storage of changes
  • Change monitoring
  • query subscription

25
Subscription Language
  • SQL-like language based on atomic events.
  • Combines the use of monitoring queries and
    continuous queries.
  • The language can be extended by adding new types
    of atomic events.
  • Uses the XML Query Language for continuous
    queries. Querying the XML Documents of the Web,
    V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report

26
Example
  • subscription myPaintings
  • what are the new painting entries in Musee
    dOrsay site
  • monitoring newPainting
  • select URL
  • where URL extends www.musee-orsay.fr/
  • and ltpaintergt contains Monet
  • manage the changes in the expositions
  • continuous delta Exposition
  • select ... from ... where
  • when monthly
  • notify daily send me a daily report

Atomic events
27
Step 1 Atomic Event Detection
5 millions of pages/day
atomic event 46 URL matches pattern
www.musee-orsay.fr/ atomic event 67 XML
document contains the tag ltpaintergt with the
value Monet
metadata manager
HTML parser
complex event detection
XML loader
28
Step 2 Complex Event Detection
Millions of alerts of pages/day Millions of
subscriptions
HTML parser
complex event detection
complex event 12 67 46 (XML document contains
the tag ltpaintergt with value Monet and URL
matches pattern www.musee-orsay.fr/)
XML loader
29
Step 3 Notification Processor
complex event detection
Reporter
continuous queries
30
Architecture
Xyleme Query Processor
documents
Trigger Engine
Xyleme Alerter
Xyleme Reporter
Complex Event Detection
Reporter
Subscription Manager
SQL
Web Browser
Xyleme Subscription Manager
SQL
31
Complex Events Algorithm
  • The formal problem is NP-hard
  • We proposed several possible algorithms
  • Experimental (simulation) values proved the
    effectiveness of our solutions
  • The Hash-Tree based algorithm is well suited for
    our application
  • 10 million Complex Events
  • 1 million Atomic Events
  • 100 Atomic events detected per document
  • 0.8 ms to process a document. 2 million
    documents per day.

32
Alerters
  • Each Alerter can be viewed as a plug-in that acts
    on a document flow.
  • All sorts of Atomic events can be detected URL
    pattern detection, Keywords, XPath expressions,
    Page rank
  • Can be distributed.

33
Some Advanced Alerts
  • Process document flow (single pass)
  • Full strings
  • Context Stack
  • Reversed look-up
  • XML Alerts
  • Reversed XPath expressions
  • Dual context stack for / and //

34
Versions
  • Objectives
  • Temporal Queries (persistent identification of
    nodes)
  • Version some documents or some sites (store a
    delta)
  • Change Monitoring (query changes)
  • We proposed a representation of changes
  • Change-Centric Management of Versions (VLDB
    2001)
  • We developed a Diff algorithm for XML
  • Detecting Changes in XML Documents,
  • G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San
    Jose)

35
Conclusion Prospectives
  • Focus crawling on important pages
  • Refine notion of importance
  • Improve important pages discovery
  • Improve Change control accuracy
  • Semantic web
  • Real-time advanced processing

36
Merci
Write a Comment
User Comments (0)
About PowerShow.com