Issues in Monitoring Web Data - PowerPoint PPT Presentation

About This Presentation

Title:

Issues in Monitoring Web Data

Description:

Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul_at_inria.fr – PowerPoint PPT presentation

Number of Views:306

Avg rating:3.0/5.0

Slides: 70

Provided by: serge178

Category:

more less

Transcript and Presenter's Notes

Title: Issues in Monitoring Web Data

1
Issues in Monitoring Web Data

Serge Abiteboul
INRIA and Xyleme
Serge.Abiteboul_at_inria.fr

2
Organization

Introduction
What is there to monitor?
Why monitor?
Some applications of web monitoring
Web archiving
An experience the archiving of the French web
Page importance and change frequency
Creation of a warehouse using web resources
An experience the Xyleme Project
Monitoring in Xyleme
Queries and monitoring
Conclusion

3
1. Introduction
4
The Web Today

Billions of pages millions of servers
Query keywords to retrieve URLs
Imprecise query results are useless for further
processing
Applications based on ad-hoc wrapping
Expensive incomplete short-lived, not adapted
to the Web constant changes
Poor quality
Cannot be trusted spamming, rumors
Often stale
Our vision of it often out-of-date
Importance of monitoring

5
The HTML Web Structure
Source IBM, AltaVista, Compaq
6
HTML Percentage covered by Crawlers
Source searchenginewatch.com
7
So much for the world knowledge

Most of the web is not reached by crawlers
(hidden web)
Some of the public HTML pages are never read
Most of what is on the web is junk anyway
Our knowledge of it may be stale
Do not junk the techno improve it!

8
What is there to monitor?

Documents HTML but also doc, pdf, ps
Many data exchange formats such as asn1, bibtex
New official data exchange format XML
Hidden web database queries behind forms or
scripts
Multimedia data ignored here
Public vs. private (Intranet or Internetpasswd)
Static vs. dynamic

9
What is changing?

XML is coming
Universal data exchange format
Marriage of document and database worlds
Standard query language XQuery
Quickly growing on Intranet and very slowly on
public web (less than 1)
Web services are coming
Format for exporting services
Format for encapsulating queries
More semantics to be expected
RDF for data
WSDLUDDI for services

10
What is not changing fast or even getting worse

Massive quantity of data most of it junk
Lots of stale data
Very primitive HTML query mechanisms (keywords)
No real change control mechanism soon
Compare database queries (fresh data) with web
search engines (possibly stale)
Compare database triggers (based on push) to web
notification services (most of the times based on
pull/refresh)

11
The need to monitor the web

The web changes all the time
Users are often as interested in changes as by
data new products, new press articles, new
price
Discover new resources
Keep our vision of the web up-to-date
Be aware of changes that may be of interest, have
impact on our business

12
Analogy databases

Databases
Query instantaneous vision of data
Trigger alert/notification of some changes of
interest
Web
Query need monitoring to give correct answer
Monitoring to support alert/notifications of
changes of interest

13
Web vs. database monitoring

Quantity of data larger on the web
Knowledge of data
structure and semantics known in databases
Reliability and availability
High in databases null on the web
Data granularity
Tuple vs. page in HTML or element in XML
Change control
Databases support from data sources/triggers
Web no support pull only in general

14
2. Some applications ofweb monitoring
15
Comparative shopping

Unique entry point to many catalogs
Data integration problem
Main issue wrapping of web catalogs
Semi-automatic so limited to a few sites
Simpler and towards automatic with XML
Alternatives
Mediation when data change very fast
prices and availability of plane tickets
Warehousing otherwise ? need to monitor changes

16
Web surveillance

Applications
Anti-criminal and anti-terrorist intelligence,
e.g., detecting suspicious acquisition of
chemical products
Business intelligence, e.g., discovering
potential customers, partners, competitors
Find the data (crawl the web)
Monitor the changes
new pages, deleted pages, changes in a page
Classify information and extract data of interest
Data mining, text understanding, knowledge
representation and extraction, linguistic Very AI

17
Copy tracking

Example a press agency wants to check that
people are not publishing copies of their wires
without paying

Query to search engine Or specific crawl
pre-filter
Filter
1
2
3
detection
Flow of candidate documents
Slice the document
18
Web archiving

We will discuss an experience in archiving the
French web

19
Creation of a data warehouse with resources found
of the web

We will discuss some work in the Xyleme project
on the construction of XML warehouses

20
3. Web archiving

An experience towards the archiving of the French
web with
Bibliothèque Nationale de France

21
Dépôt légal (legal deposit)

Books are archived since 1537, a decision by King
Francois the 1st
The Web is an important and valuable source of
information that should also be archived
What is different?
Number of content providers 148000 sites vs.
5000 editors
Quantity of information millions of pages
video/audio
Quality of information lots of junk
Relationship with editors freedom of publication
vs. traditional push model
Updates and changes occur continuously
The perimeter is unclear what is the French web?

22
Goal and Scope

Provide future generations with a representative
archive of the cultural production
Provide material for cultural, political,
sociological studies
The mission is to archive a wide range of
material because nobody knows what will be of
interest for future research
In traditional publication, publishers are
filtering contents. No filter on the web

23
Similar Projects

The Internet Archive www.archive.org
The Wayback machine
Largest collection of versions of web pages
Human selection based approach
select a few hundred sites and choose a
periodicity of archiving
Australia and Canada
The Nordic experience
Use robot crawler to archive a significant part
of the surface web
Sweden, Finland, Norway
Problems encountered
Lack of updates of archived pages between two
snapshots
The hidden Web

24
Orientation of our experiment

Goals
Cover a large portion of the French web
Automatic content gathering is necessary
Adapt robots to provide a continuous archiving
facility
Have frequent versions of the sites, at least for
the most important ones
Issues
The notion of important sites
Building a coherent Web archive
Discover and manage important sources of deep Web

25
First issue the perimeter

The perimeter of the French Web contents edited
in France
Many criteria may be used
The French language but many French sites use
English (e.g. INRIA) many French-speaking sites
are from other French speaking countries or
regions (e.g. Quebec)
Domain Name or resource locators .fr sites, but
many are also in .com or .org
Address of the site physical location of the web
servers or address of the owner
Other criteria than the perimeter
Little interest in commercial sites
Possibly interest in foreign sites that discuss
French issues
Pure automatic does not work ? involve librarians

26
Second issueSite vs. Page archiving

The Web
Physical granularity HTML pages
The problem is inconsistent data and links
Read page P one week later read pages pointed by
P may not exist anymore
Logical granularity?
Snapshot view of a web site
What is a site?
INRIA is www.inria.fr www-rocq.inria.fr
www.multimania.com is the provider of many sites
There are technical issues (rapid firing, )

27
Importance of data
28
What is page importance?

Le Louvre homepage is more important than an
unknown persons homepage
Important pages are pointed by
Other important pages
Many unimportant pages
This leads to Google definition of PageRank
Based on the link structure of the web
used with remarkable success by Google for
ranking results
Useful but not sufficient for web archiving

29
Page Importance

Importance
Link matrix L
In short, page importance is the fixpoint X of
the equation LX X
Storing the Link matrix and computing page
importance uses lots of resources
We developed a new efficient technique to compute
the fixpoint
Without having to store the Link matrix
Technique adapts to automatically to changes

30
Site vs. pages

Limitation of page importance
Google page importance works well when links have
a strong semantic
More and more web pages are automatically
generated and most links have little semantics
More limitation
Refresh at the page level presents drawbacks
So we also use link topology between sites and
not only between pages

31
Experiments

Crawl
We used between 2 to 8 PCs for Xyleme crawlers
for 2 months
Discovery and refresh based on page importance
Discovery
We looked at more than 1.5 billion (most
interesting) web pages
We discovered more than 15 million .fr pages
about 1.5 of the web
We discovered 150 000 .fr sites
Refresh
Important pages were refreshed more often
Takes into account also the change rate of pages
Analysis of the relevance of site importance for
librarians
Comparison with ranking by librarians
Strong correlation with their rankings

32
Issues and on going workOther criteria for
importance

Take into account indications by archivists
They know best -- man-machine-interface issue
Use classification and clustering techniques to
refine the notion of site
Frequent use of infrequent words
Find pages dedicated to specific topics
Text Weight
Find pages with text content vs. raw data pages)
Others

33
4. Creation of a Warehouse from Web data

The Xyleme Project

34
Xyleme in short

The Xyleme project
Initiated at INRIA
Joint work with researchers from Orsay, Mannheim
and CNAM-Paris universities
The Xyleme company www.xyleme.com
Started in 2000
About 30 people
Mission Deliver a new generation of content
technologies to unlock the potential of XML
Here focus on the Xyleme project

35
Goal of the Xyleme project

Focus is on XML data (but also handle HTML)
Semantic
Understand tags, partition the Web into semantic
domains, provide a simple view of each domain
Dynamicity
Find and monitor relevant data on the web
Control relevant changes in Web data
XML storage, index and queries
Manage efficiently millions of XML documents and
process millions of simultaneous queries

36
Corporate information environment with Xyleme
Crawling interpreting data
XML Repository
Repository
Query Engine
Xyleme Server
Systematic updating
publishing
searches
queries
Information System
37
XML in short

Data exchange format
eXtensible Mark-up Language (child of SGML)
Promoted by W3C and major industry players
XML document ordered labeled tree
Other essential gadgets unicode, namespaces,
attributes, pointers, typing (XML schema)

38
XML magic in short

Presentation is given elsewhere (style-sheet)
Semantic and structure are provided by labels
So it is easy to extract information
Universal format understood by more and more
softwares (e.g., exported by most databases
read by more and more editors)
More and more tools available

39
It is easy to extract information
40
4.1 XylemeFunctionality and architechture
41
The goal of Xyleme project XML Dynamic
Datawarehouse

Many research issues
Query Processor
Semantic Classification
Data Monitoring
Native Storage
XML document Versionning
XML automatic or user driven acquisition
Graphical User Interface through the Web

42
Functional Architecture
Query Processor
Repository and Index Manager
43
Architecture
-------------------- I N T E R N E T
-----------------------
44
Prototype main choices

Network of Linux PCs
C on the server side
Corba for communications between PCs
HTTP SOAP for communications for external
communications
Exception for query processing

45
Scaling

Parallelism based on
Partitioning
XML documents
URL table
Indexes (semantic partitioning)
Memory replication
Autonomous machines (PCs)
Caches are used for data flow

46
4.2 XylemeData Acquisition
47
Data Acquisition

Xyleme crawler visits the HTML/XML web
Management of metadata on pages
Sophisticate strategy to optimize network
bandwidth
importance ranking of pages
change frequency and age of pages
publications (owners) subscriptions (users)
Each crawler visits about 4 million pages per day
Each index may create index for 1 million pages
per day

48
4.3 XylemeChange Control
49
Change Management

The Web changes all the time

Data acquisition
automatic and via publication

Monitoring
subscriptions
continuous queries
versions

50
Subscription

Users can subscribe to certain events, e.g.,
changes in all pages of a certain DTD or of a
certain semantic domain
insertion of a new product in a particular
catalog or in all catalogs with a particular DTD

They may request to be notified
at the time the event is detected by Xyleme
regularly, e.g., once a week

51
Continuous Queries

Queries asked regularly or when some events are
detected
send me each Monday the list of movies in
Pariscope
send me each Monday the list of new movies in
Pariscope
each time you detect that a new member is added
to the Stanford DB-group, send me their lists of
publications from their homepages

52
Versions and Deltas

Store snapshots of documents
For some documents, store changes (deltas)
storage last version sequence of deltas
complete delta reconstruct old versions
partial delta allow to send changes to the user
and allow refresh
Deltas are XML documents
so changes can be queried as standard data
Temporal queries
List of products that were introduced in this
catalog since January 1st 2002

53
The Information Factory
Web
loaders
subscription processor
send notification
changes detection
documents and deltas
continuous queries
time
Repository
results
version queries
54
Results

Very efficient XML Diff algorithm
compute difference between consecutive versions
Representation of deltas based on an original
naming scheme for XML elements
one element is assigned a unique identifier for
its entire life
compact way of representing these IDs
Efficient versioning mechanism

55
Results

Sophisticate monitoring algorithm
Detection of simple patterns (conjunctions) at
the document level
Detection of changes between consecutive versions
of the same documents
Scale to dozens of crawlers loading millions of
documents per day for a single monitor

56
Issues languages for monitoring

In the spirit of temporal languages for
relational databases
But
Data model is richer (trees vs. tables)
Context is richer versions, continuous queries,
monitoring of data streams

57
4.4 XylemeSemantic Data Integration
58
Data Integration

One application domain -- Several schemas
heterogeneous vocabulary and structure
Xyleme Semantic Integration è
gives the illusion that the system maintains an
homogeneous database for this domain
abstracts a set of DTDs into a hierarchy of
pertinent terms for a particular domain
(business, culture, tourism, biology, )

59
Technology in short

Cluster DTDs into application domains
For an application domain semi-automatically
Organize tags into a hierarchy of concepts using
thesauri such as Wordnet and other linguistic
tool
This provides the abstract DTD for the particular
domain
Generate mappings between concrete DTDs and the
abstract one

60
4.5 XylemeQuery Processing
61
Xyleme Query Language

A mix of OQL and XQL, will use the W3C standard
when there will be one.

Select product/name, product/price From doc
in catalogue, product in
doc/product Where product//components contains
flash and product/description
contains camera
62
Principle of Querying
query on abstract dtd
Union of concrete queries (possibly with Joins)
catalogue/product/price ? d1//camera/price ?
d2/product/cost catalogue/product/description
? d1//camera/description ?
d2/product/info, ref ? d2/description
MAPPINGS between concrete and abstract DTDs
63
Query Processing

Partial translation, from abstract to concrete,
to identify machines with relevant data
Algebraic rewriting, linear search strategy based
on simple heuristics in priority, use in memory
indexes and minimize communication
Decomposition into local physical subplans and
installation
Execution of plans
If needed, Relaxation

64
Query processing