Web Structure Mining - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Web Structure Mining

Description:

Validate web sites compared to Open standards (W3C and IETF) ... Can the navigability of a website be determined by measuring the structure of the website? ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 16
Provided by: ingelinfiv
Category:

less

Transcript and Presenter's Notes

Title: Web Structure Mining


1
Web Structure Mining
  • Web structure mining examines the link hierarchy
    of a site in order to improve navigation.

2
Project background
  • ROBAAC
  • The purpose of the ROBACC project is listed on
    its website, and is as follows
  • Validate web sites compared to Open standards
    (W3C and IETF)
  • Provide quantitative accessibility indicators
  • Demonstrate use of Open Source
  • Raise awareness, to encourage competition for
    better web sites, publication-tools, and
    requirement specifications
  • EIAO is based on the ROBOAAC experience
  • The goal of the EIAO project is to contribute to
    better e-accessibility for all citizens and to
    increase the use of standards for online
    resources.

3
Previous work on this project
  • The project so far has
  • resulted in a CGI-program
  • which classifies each
  • link on a given website.
  • A website is modeled as
  • a directed graph.
  • Node document
  • Edge hyperlink
  • From the graph, different
  • quantitative measurements
  • are made and a model is created from
  • the measurements.
  • Textual representation of the quantitative
    measurements.

4
Shortest path
5
Previous work on this project
  • Measurements made by the crawler
  • Average cycle length
  • The average cycle length is by definition
    (almost) the same as the average distance between
    two nodes.
  • Longest of the shortest path
  • The longest of the shortest paths is by
    definition the highest number if all the shortest
    paths on the graph are found.
  • Shortest path problem is the problem of finding a
    series of edges connecting two nodes such that
    the number of edges is as small as possible
  • Average distance between two pages
  • The average distance between two nodes in a graph
    is by definition the average length of all the
    shortest paths.
  • Average number of internal links per page

6
Textual output
  • There are 4 pages on the site.
  • The average cycle (from a node and back to the
    same node) is 1.5 pages long.
  • The Longest of the shortest paths from a node to
    a new node is 2 pages long.
  • The average distance between two pages is
    1.69230769231 pages long.
  • There are 2.25 number of internal links in
    average from each page.

7
Graph output
8
Literature review
  • Ji-Hyun Lee and Wei-Kun Shiu a sites efficiency
    is measured by shortest path.
  • Efficiency decreases as shortest path increases.
  • User spends much more time even if the shortest
    path only has a slight increase
  • Probability of user mistakes increases with
    shortest path
  • Xiao Fang, University of Arizona
  • Effectiveness is the percentage of user-sought
    top-level web pages that can be easily accessed
    from the portal page.
  • Efficiency measures the usefulness of hyperlinks
    placed in a portal page.

9
Problem description
  • Today a lot of time is spent searching for
    relevant information on the web. Some websites
    are easy to navigate, some are not.
  • Can the navigability of a website be determined
    by measuring the structure of the website?
  • Are the measured values in any way coherent with
    the user experience?

10
Project outline
  • Goal
  • The goal of this project is to analyze websites
    with diverse measured values, and find if/how
    these measurements correspond with the user
    experience.
  • The sites will be measured using the application
    created by last years students.
  • Get last years application up and running.

11
Project outline
  • Limitations
  • A number of sites will be measured, but not all
    of them analyzed. A diverse selection will be
    made so a wide range of site-types are
    represented.
  • The user experience will at first be based upon
    one user (me). However, if time allows so,
    several users will be asked to give their opinion
    on the different sites.
  • Concentrate on internal links

12
Agenda
  • Get last years project running
  • Measure a large selection of websites using last
    years project
  • Make a diverse selection of websites from this
    collection should give as general representation
    of the web as possible
  • Take notes as these sites are surfed, to document
    the user experience
  • Find how these results correspond with users
    experience of the sites
  • Compare values and notes

13
Plan of progress
14
Requirements
  • Requirements for the analysis
  • The crawler must be working
  • Pages with both high, medium and low values from
    measurement must be analyzed (diversity)
  • Pages that are supported in the crawler must be
    used (issues with java script, https).

15
Related work
  • PageRank (Brin and Page, 1998)
  • Used by the search engine Google. Interprets a
    link from page A to page B as a vote by page A
    for page B and thus considers a page to be more
    important if the votes it obtains are from more
    important pages.
  • HITS (Kleinberg, 1998)
  • Assumption if document A has hyperlink to
    document B, then the author of document A thinks
    that document B contains valuable information
  • Hub web page links to a collection of prominent
    sites on a common topic
  • Authority pages that link to a collection of
    authoritative pages on a broad topic web page
    pointed to by hubs
Write a Comment
User Comments (0)
About PowerShow.com