Monitoring du Web - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Monitoring du Web

Description:

Imprecise; query results are useless for further processing ... 'Le Louvre' homepage is more important than an unknown person's homepage ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 34

Provided by: sergeab

Category:

more less

Transcript and Presenter's Notes

Title: Monitoring du Web

1
Monitoring du Web
2
Organization

Introduction
What is there to monitor?
Why monitor?
Some applications of web monitoring
Web archiving
An experience the archiving of the French web
Page importance and change frequency
Creation of a warehouse using web resources
An experience the Xyleme Project
Monitoring in Xyleme
Conclusion

3
1. Introduction
4
The Web Today

Billions of pages millions of servers
Query keywords to retrieve URLs
Imprecise query results are useless for further
processing
Applications based on ad-hoc wrapping
Expensive incomplete short-lived not adapted
to the Web constant changes
Poor quality
Cannot be trusted spamming, rumors
Often stale
Our vision of it often out-of-date
Importance of web monitoring

5
The HTML Web Structure
Source IBM, AltaVista, Compaq
6
HTML Percentage covered by Crawlers
Source searchenginewatch.com
7
So much for the world knowledge

A lot of the information is private
Partial
Web robots miss a lot of data, not HTML or HTML
not read
The hidden web contains a lot of the useful
knowledge and is also not reached
Low quality
Most of what is on the web is junk anyway
Stale
The knowledge of web robots is most of the time
not up-to-date
Do not junk the techno improve it!
More monitoring

8
What data is there to monitor?

Documents HTML but also doc, pdf, ps
Many data exchange formats such as asn1, bibtex
New official data exchange format XML
Hidden web database queries behind forms or
scripts
Multimedia data ignored here
Public vs. private (Intranet or Internetpasswd)
Static vs. dynamic

9
The need to monitor the web

The web changes all the time
Users are often as interested in changes as by
data
To do what?
Keep the vision of the web up-to-date
News, real-time data (stock market, weather
report...), new publications, new prices
Discover new sites, new pages
New companies, new products, new offers
Be aware of changes that may be of interest, have
impact on your business

10
Analogy databases

Databases
Query instantaneous vision of data
Trigger alert/notification of some changes of
interest
Based on direct control of the data in database
systems
Web
Query based on robots and often stale indexes
Notification of changes of interest
Based on monitoring by an outsider

11
Web vs. database monitoring

Quantity of data larger on the web
Knowledge of data
structure and semantics known in databases not in
the web
Reliability and availability
High in databases null on the web
Data granularity
Tuple vs. page in HTML or element in XML
Change control
Databases support from data sources/triggers
Web no support pull only in general

12
2. Some applications ofweb monitoring
13
Comparative shopping

Unique entry point to many catalogs
Data integration problem
Main issue wrapping of web catalogs
Semi-automatic so limited to a few sites
Simpler and towards automatic with XML
Alternatives
Mediation when data change very fast
prices and availability of plane tickets
Warehousing otherwise
Same for houses
? need to monitor changes

14
Web surveillance

Applications
Anti-criminal and anti-terrorist intelligence,
e.g., detecting suspicious acquisition of
chemical products
Business intelligence, e.g., discovering
potential customers, partners, competitors
Find the data (crawl the web)
Monitor the changes
new pages, deleted pages, changes in a page
Classify information and extract data of interest
Data mining, text understanding, knowledge
representation and extraction, linguistic Very AI

15
Copy tracking

Example a press agency wants to check that
people are not publishing illegally copies of
their wires
Need to react fast on changes illegal copy of
the wire may last only a couple of days

Query to search engine Or specific crawl
pre-filter
Filter
1
2
3
detection
Flow of candidate documents
Slice the document
16
Web portal management

Standard portal management
Unreachable pages
Dangling pointers
Incorrect pages (e.g., do not parse)
Detection of interesting pages on the web
Etc.
Portal archiving
Subscription and notification

17
Web archiving

We will discuss an experience in archiving the
French web

18
Creation of a data warehouse with resources found
of the web

We will discuss some work in the Xyleme project
on the construction of XML warehouses

19
3. Web archiving

An experience towards the archiving of the French
web with
Bibliothèque Nationale de France

20
Dépôt légal (legal deposit)

Books are archived since 1537, a decision by King
Francois the 1st
The Web is an important and valuable source of
information that should also be archived
What is different?
Number of content providers 148000 sites vs.
5000 editors
Quantity of information millions of pages
video/audio
Quality of information lots of junk
Relationship with editors freedom of publication
vs. traditional push model
Updates and changes occur continuously
The perimeter is unclear what is the French web?

21
Goal and Scope

Provide future generations with a representative
archive of the cultural production
Provide material for cultural, political,
sociological studies
The mission is to archive a wide range of
material because nobody knows what will be of
interest for future research
Side issue legal proof of publication
In traditional publication, publishers are
filtering contents. No filter on the web

22
Similar Projects

The Internet Archive www.archive.org
The Wayback machine
Largest collection of versions of web pages
Human selection based approach
select a few hundred sites and choose a
periodicity of archiving
Australia and Canada
The Nordic experience
Use robot crawler to archive a significant part
of the surface web
Sweden, Finland, Norway
Problems encountered
Lack of updates of archived pages between two
snapshots
The hidden Web

23
Orientation of our experiment

Goals
Cover a large portion of the French web
Automatic content gathering is necessary
Adapt robots to provide a continuous archiving
facility
Have frequent versions of the sites, at least for
the most important ones
Issues
The notion of important sites
Building a coherent Web archive
Discover and manage important sources of deep Web

24
First issue the perimeter

The perimeter of the French Web contents edited
in France
Many criteria may be used
The French language but many French sites use
English (e.g. INRIA) many French-speaking sites
are from other French speaking countries or
regions (e.g. Quebec)
Domain Name or resource locators .fr sites, but
many are also in .com or .org
Site address physical location of web servers or
address of the owner
Other criteria than the perimeter
Little interest in commercial sites
Possibly interest in foreign sites that discuss
French issues
Pure librarian driven does not scale
Pure automatic does not work ? involve librarians

25
Second issueSite vs. Page archiving

The Web
Physical granularity HTML pages
The problem is inconsistent data and links
Read page P one week later read pages pointed by
P may not exist anymore
Logical granularity?
Snapshot view of a web site
What is a site?
INRIA is www.inria.fr www-rocq.inria.fr
www.multimania.com is the provider of many sites
There are technical issues (rapid firing, )

26
Importance of data
27
What is page importance?

Le Louvre homepage is more important than an
unknown persons homepage
Important pages are pointed by
Other important pages
Many unimportant pages
This leads to Google definition of PageRank
Based on the link structure of the web
used with remarkable success by Google for
ranking results
Useful but not sufficient for web archiving

28
Page Importance

We will see that in more detail
Nice algorithmic issue

29
Site vs. pages

Limitation of page importance
Google page importance works well when links have
a strong semantic
More and more web pages are automatically
generated and most links have little semantics
More limitation
Refresh at the page level presents drawbacks
So we also use link topology between sites and
not only between pages

30
Experiments