Monitoring du Web - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Monitoring du Web

Description:

Imprecise; query results are useless for further processing ... 'Le Louvre' homepage is more important than an unknown person's homepage ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 34
Provided by: sergeab
Category:
Tags: louvre | monitoring | web

less

Transcript and Presenter's Notes

Title: Monitoring du Web


1
Monitoring du Web
2
Organization
  • Introduction
  • What is there to monitor?
  • Why monitor?
  • Some applications of web monitoring
  • Web archiving
  • An experience the archiving of the French web
  • Page importance and change frequency
  • Creation of a warehouse using web resources
  • An experience the Xyleme Project
  • Monitoring in Xyleme
  • Conclusion

3
1. Introduction
4
The Web Today
  • Billions of pages millions of servers
  • Query keywords to retrieve URLs
  • Imprecise query results are useless for further
    processing
  • Applications based on ad-hoc wrapping
  • Expensive incomplete short-lived not adapted
    to the Web constant changes
  • Poor quality
  • Cannot be trusted spamming, rumors
  • Often stale
  • Our vision of it often out-of-date
  • Importance of web monitoring

5
The HTML Web Structure
Source IBM, AltaVista, Compaq
6
HTML Percentage covered by Crawlers
Source searchenginewatch.com
7
So much for the world knowledge
  • A lot of the information is private
  • Partial
  • Web robots miss a lot of data, not HTML or HTML
    not read
  • The hidden web contains a lot of the useful
    knowledge and is also not reached
  • Low quality
  • Most of what is on the web is junk anyway
  • Stale
  • The knowledge of web robots is most of the time
    not up-to-date
  • Do not junk the techno improve it!
  • More monitoring

8
What data is there to monitor?
  • Documents HTML but also doc, pdf, ps
  • Many data exchange formats such as asn1, bibtex
  • New official data exchange format XML
  • Hidden web database queries behind forms or
    scripts
  • Multimedia data ignored here
  • Public vs. private (Intranet or Internetpasswd)
  • Static vs. dynamic

9
The need to monitor the web
  • The web changes all the time
  • Users are often as interested in changes as by
    data
  • To do what?
  • Keep the vision of the web up-to-date
  • News, real-time data (stock market, weather
    report...), new publications, new prices
  • Discover new sites, new pages
  • New companies, new products, new offers
  • Be aware of changes that may be of interest, have
    impact on your business

10
Analogy databases
  • Databases
  • Query instantaneous vision of data
  • Trigger alert/notification of some changes of
    interest
  • Based on direct control of the data in database
    systems
  • Web
  • Query based on robots and often stale indexes
  • Notification of changes of interest
  • Based on monitoring by an outsider

11
Web vs. database monitoring
  • Quantity of data larger on the web
  • Knowledge of data
  • structure and semantics known in databases not in
    the web
  • Reliability and availability
  • High in databases null on the web
  • Data granularity
  • Tuple vs. page in HTML or element in XML
  • Change control
  • Databases support from data sources/triggers
  • Web no support pull only in general

12
2. Some applications ofweb monitoring
13
Comparative shopping
  • Unique entry point to many catalogs
  • Data integration problem
  • Main issue wrapping of web catalogs
  • Semi-automatic so limited to a few sites
  • Simpler and towards automatic with XML
  • Alternatives
  • Mediation when data change very fast
  • prices and availability of plane tickets
  • Warehousing otherwise
  • Same for houses
  • ? need to monitor changes

14
Web surveillance
  • Applications
  • Anti-criminal and anti-terrorist intelligence,
    e.g., detecting suspicious acquisition of
    chemical products
  • Business intelligence, e.g., discovering
    potential customers, partners, competitors
  • Find the data (crawl the web)
  • Monitor the changes
  • new pages, deleted pages, changes in a page
  • Classify information and extract data of interest
  • Data mining, text understanding, knowledge
    representation and extraction, linguistic Very AI

15
Copy tracking
  • Example a press agency wants to check that
    people are not publishing illegally copies of
    their wires
  • Need to react fast on changes illegal copy of
    the wire may last only a couple of days

Query to search engine Or specific crawl
pre-filter
Filter
1
2
3
detection
Flow of candidate documents
Slice the document
16
Web portal management
  • Standard portal management
  • Unreachable pages
  • Dangling pointers
  • Incorrect pages (e.g., do not parse)
  • Detection of interesting pages on the web
  • Etc.
  • Portal archiving
  • Subscription and notification

17
Web archiving
  • We will discuss an experience in archiving the
    French web

18
Creation of a data warehouse with resources found
of the web
  • We will discuss some work in the Xyleme project
    on the construction of XML warehouses

19
3. Web archiving
  • An experience towards the archiving of the French
    web with
  • Bibliothèque Nationale de France

20
Dépôt légal (legal deposit)
  • Books are archived since 1537, a decision by King
    Francois the 1st
  • The Web is an important and valuable source of
    information that should also be archived
  • What is different?
  • Number of content providers 148000 sites vs.
    5000 editors
  • Quantity of information millions of pages
    video/audio
  • Quality of information lots of junk
  • Relationship with editors freedom of publication
    vs. traditional push model
  • Updates and changes occur continuously
  • The perimeter is unclear what is the French web?

21
Goal and Scope
  • Provide future generations with a representative
    archive of the cultural production
  • Provide material for cultural, political,
    sociological studies
  • The mission is to archive a wide range of
    material because nobody knows what will be of
    interest for future research
  • Side issue legal proof of publication
  • In traditional publication, publishers are
    filtering contents. No filter on the web

22
Similar Projects
  • The Internet Archive www.archive.org
  • The Wayback machine
  • Largest collection of versions of web pages
  • Human selection based approach
  • select a few hundred sites and choose a
    periodicity of archiving
  • Australia and Canada
  • The Nordic experience
  • Use robot crawler to archive a significant part
    of the surface web
  • Sweden, Finland, Norway
  • Problems encountered
  • Lack of updates of archived pages between two
    snapshots
  • The hidden Web

23
Orientation of our experiment
  • Goals
  • Cover a large portion of the French web
  • Automatic content gathering is necessary
  • Adapt robots to provide a continuous archiving
    facility
  • Have frequent versions of the sites, at least for
    the most important ones
  • Issues
  • The notion of important sites
  • Building a coherent Web archive
  • Discover and manage important sources of deep Web

24
First issue the perimeter
  • The perimeter of the French Web contents edited
    in France
  • Many criteria may be used
  • The French language but many French sites use
    English (e.g. INRIA) many French-speaking sites
    are from other French speaking countries or
    regions (e.g. Quebec)
  • Domain Name or resource locators .fr sites, but
    many are also in .com or .org
  • Site address physical location of web servers or
    address of the owner
  • Other criteria than the perimeter
  • Little interest in commercial sites
  • Possibly interest in foreign sites that discuss
    French issues
  • Pure librarian driven does not scale
  • Pure automatic does not work ? involve librarians

25
Second issueSite vs. Page archiving
  • The Web
  • Physical granularity HTML pages
  • The problem is inconsistent data and links
  • Read page P one week later read pages pointed by
    P may not exist anymore
  • Logical granularity?
  • Snapshot view of a web site
  • What is a site?
  • INRIA is www.inria.fr www-rocq.inria.fr
  • www.multimania.com is the provider of many sites
  • There are technical issues (rapid firing, )

26
Importance of data
27
What is page importance?
  • Le Louvre homepage is more important than an
    unknown persons homepage
  • Important pages are pointed by
  • Other important pages
  • Many unimportant pages
  • This leads to Google definition of PageRank
  • Based on the link structure of the web
  • used with remarkable success by Google for
    ranking results
  • Useful but not sufficient for web archiving

28
Page Importance
  • We will see that in more detail
  • Nice algorithmic issue

29
Site vs. pages
  • Limitation of page importance
  • Google page importance works well when links have
    a strong semantic
  • More and more web pages are automatically
    generated and most links have little semantics
  • More limitation
  • Refresh at the page level presents drawbacks
  • So we also use link topology between sites and
    not only between pages

30
Experiments
  • Crawl
  • We used between 2 to 8 PCs for Xyleme crawlers
    for 2 months
  • Discovery and refresh based on page importance
  • Discovery
  • We looked at more than 1.5 billion (most
    interesting) web pages
  • We discovered more than 15 million .fr pages
    about 1.5
  • We discovered 150 000 .fr sites
  • Refresh
  • Important pages were refreshed more often
  • Takes into account also the change rate of pages
  • Analysis of the relevance of site importance for
    librarians
  • Comparison with ranking by librarians
  • Strong correlation with their rankings

31
Issues and on going workOther criteria for
importance
  • Take into account indications by archivists
  • They know best -- man-machine-interface issue
  • Use classification and clustering techniques to
    refine the notion of site
  • Frequent use of infrequent words
  • Find pages dedicated to specific topics
  • Text Weight
  • Find pages with text content vs. raw data pages)
  • Others

32
5. Conclusion
33
Web monitoring
  • Very challenging problem
  • Complexity due to the volume of data and the
    number of users
  • Complexity due to heterogeneity
  • Complexity due to lack of cooperation from data
    sources
  • Many issues to investigate
  • Add monitoring features to the web
  • We will see one Active XML
Write a Comment
User Comments (0)
About PowerShow.com