Title: Monitoring du Web
1Monitoring du Web
2Organization
- Introduction
- What is there to monitor?
- Why monitor?
- Some applications of web monitoring
- Web archiving
- An experience the archiving of the French web
- Page importance and change frequency
- Creation of a warehouse using web resources
- An experience the Xyleme Project
- Monitoring in Xyleme
- Conclusion
31. Introduction
4The Web Today
- Billions of pages millions of servers
- Query keywords to retrieve URLs
- Imprecise query results are useless for further
processing - Applications based on ad-hoc wrapping
- Expensive incomplete short-lived not adapted
to the Web constant changes - Poor quality
- Cannot be trusted spamming, rumors
- Often stale
- Our vision of it often out-of-date
- Importance of web monitoring
5The HTML Web Structure
Source IBM, AltaVista, Compaq
6HTML Percentage covered by Crawlers
Source searchenginewatch.com
7So much for the world knowledge
- A lot of the information is private
- Partial
- Web robots miss a lot of data, not HTML or HTML
not read - The hidden web contains a lot of the useful
knowledge and is also not reached - Low quality
- Most of what is on the web is junk anyway
- Stale
- The knowledge of web robots is most of the time
not up-to-date - Do not junk the techno improve it!
- More monitoring
8What data is there to monitor?
- Documents HTML but also doc, pdf, ps
- Many data exchange formats such as asn1, bibtex
- New official data exchange format XML
- Hidden web database queries behind forms or
scripts - Multimedia data ignored here
- Public vs. private (Intranet or Internetpasswd)
- Static vs. dynamic
9The need to monitor the web
- The web changes all the time
- Users are often as interested in changes as by
data - To do what?
- Keep the vision of the web up-to-date
- News, real-time data (stock market, weather
report...), new publications, new prices - Discover new sites, new pages
- New companies, new products, new offers
- Be aware of changes that may be of interest, have
impact on your business
10Analogy databases
- Databases
- Query instantaneous vision of data
- Trigger alert/notification of some changes of
interest - Based on direct control of the data in database
systems - Web
- Query based on robots and often stale indexes
- Notification of changes of interest
- Based on monitoring by an outsider
11Web vs. database monitoring
- Quantity of data larger on the web
- Knowledge of data
- structure and semantics known in databases not in
the web - Reliability and availability
- High in databases null on the web
- Data granularity
- Tuple vs. page in HTML or element in XML
- Change control
- Databases support from data sources/triggers
- Web no support pull only in general
122. Some applications ofweb monitoring
13Comparative shopping
- Unique entry point to many catalogs
- Data integration problem
- Main issue wrapping of web catalogs
- Semi-automatic so limited to a few sites
- Simpler and towards automatic with XML
- Alternatives
- Mediation when data change very fast
- prices and availability of plane tickets
- Warehousing otherwise
- Same for houses
- ? need to monitor changes
14Web surveillance
- Applications
- Anti-criminal and anti-terrorist intelligence,
e.g., detecting suspicious acquisition of
chemical products - Business intelligence, e.g., discovering
potential customers, partners, competitors - Find the data (crawl the web)
- Monitor the changes
- new pages, deleted pages, changes in a page
- Classify information and extract data of interest
- Data mining, text understanding, knowledge
representation and extraction, linguistic Very AI
15Copy tracking
- Example a press agency wants to check that
people are not publishing illegally copies of
their wires - Need to react fast on changes illegal copy of
the wire may last only a couple of days
Query to search engine Or specific crawl
pre-filter
Filter
1
2
3
detection
Flow of candidate documents
Slice the document
16Web portal management
- Standard portal management
- Unreachable pages
- Dangling pointers
- Incorrect pages (e.g., do not parse)
- Detection of interesting pages on the web
- Etc.
- Portal archiving
- Subscription and notification
17Web archiving
- We will discuss an experience in archiving the
French web
18Creation of a data warehouse with resources found
of the web
- We will discuss some work in the Xyleme project
on the construction of XML warehouses
193. Web archiving
- An experience towards the archiving of the French
web with - Bibliothèque Nationale de France
20Dépôt légal (legal deposit)
- Books are archived since 1537, a decision by King
Francois the 1st - The Web is an important and valuable source of
information that should also be archived - What is different?
- Number of content providers 148000 sites vs.
5000 editors - Quantity of information millions of pages
video/audio - Quality of information lots of junk
- Relationship with editors freedom of publication
vs. traditional push model - Updates and changes occur continuously
- The perimeter is unclear what is the French web?
21Goal and Scope
- Provide future generations with a representative
archive of the cultural production - Provide material for cultural, political,
sociological studies - The mission is to archive a wide range of
material because nobody knows what will be of
interest for future research - Side issue legal proof of publication
- In traditional publication, publishers are
filtering contents. No filter on the web
22Similar Projects
- The Internet Archive www.archive.org
- The Wayback machine
- Largest collection of versions of web pages
- Human selection based approach
- select a few hundred sites and choose a
periodicity of archiving - Australia and Canada
- The Nordic experience
- Use robot crawler to archive a significant part
of the surface web - Sweden, Finland, Norway
- Problems encountered
- Lack of updates of archived pages between two
snapshots - The hidden Web
23Orientation of our experiment
- Goals
- Cover a large portion of the French web
- Automatic content gathering is necessary
- Adapt robots to provide a continuous archiving
facility - Have frequent versions of the sites, at least for
the most important ones - Issues
- The notion of important sites
- Building a coherent Web archive
- Discover and manage important sources of deep Web
24First issue the perimeter
- The perimeter of the French Web contents edited
in France - Many criteria may be used
- The French language but many French sites use
English (e.g. INRIA) many French-speaking sites
are from other French speaking countries or
regions (e.g. Quebec) - Domain Name or resource locators .fr sites, but
many are also in .com or .org - Site address physical location of web servers or
address of the owner - Other criteria than the perimeter
- Little interest in commercial sites
- Possibly interest in foreign sites that discuss
French issues - Pure librarian driven does not scale
- Pure automatic does not work ? involve librarians
25Second issueSite vs. Page archiving
- The Web
- Physical granularity HTML pages
- The problem is inconsistent data and links
- Read page P one week later read pages pointed by
P may not exist anymore - Logical granularity?
- Snapshot view of a web site
- What is a site?
- INRIA is www.inria.fr www-rocq.inria.fr
- www.multimania.com is the provider of many sites
- There are technical issues (rapid firing, )
26Importance of data
27What is page importance?
- Le Louvre homepage is more important than an
unknown persons homepage - Important pages are pointed by
- Other important pages
- Many unimportant pages
- This leads to Google definition of PageRank
- Based on the link structure of the web
- used with remarkable success by Google for
ranking results - Useful but not sufficient for web archiving
28Page Importance
- We will see that in more detail
- Nice algorithmic issue
29Site vs. pages
- Limitation of page importance
- Google page importance works well when links have
a strong semantic - More and more web pages are automatically
generated and most links have little semantics - More limitation
- Refresh at the page level presents drawbacks
- So we also use link topology between sites and
not only between pages
30Experiments
- Crawl
- We used between 2 to 8 PCs for Xyleme crawlers
for 2 months - Discovery and refresh based on page importance
- Discovery
- We looked at more than 1.5 billion (most
interesting) web pages - We discovered more than 15 million .fr pages
about 1.5 - We discovered 150 000 .fr sites
- Refresh
- Important pages were refreshed more often
- Takes into account also the change rate of pages
- Analysis of the relevance of site importance for
librarians - Comparison with ranking by librarians
- Strong correlation with their rankings
31Issues and on going workOther criteria for
importance
- Take into account indications by archivists
- They know best -- man-machine-interface issue
- Use classification and clustering techniques to
refine the notion of site - Frequent use of infrequent words
- Find pages dedicated to specific topics
- Text Weight
- Find pages with text content vs. raw data pages)
- Others
325. Conclusion
33Web monitoring
- Very challenging problem
- Complexity due to the volume of data and the
number of users - Complexity due to heterogeneity
- Complexity due to lack of cooperation from data
sources - Many issues to investigate
- Add monitoring features to the web
- We will see one Active XML