Improving Web Search Results with Data Quality Metrics - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Improving Web Search Results with Data Quality Metrics

Description:

Limited advertising count pop-ups, keywords like 'Sponsored Links' followed by ... Keyword such as 'Source' or 'References' followed by one or more links ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 20
Provided by: jennif113
Category:

less

Transcript and Presenter's Notes

Title: Improving Web Search Results with Data Quality Metrics


1
Improving Web Search Results with Data Quality
Metrics
  • Jennifer Lynch

2
Introduction
  • Low data quality is a common problem on the Web
    today.
  • Search engines could incorporate data quality
    (DQ) metrics to rank search results by DQ scores.
  • Need automatic metrics, not user reviews!
  • We will propose some new DQ metrics, along with
    some guidelines about which DQ metrics to use in
    various situations.

3
Previous Work DQ Metrics
  • Source Zhu and Guach

4
Previous Work DQ Metrics
  • Source Eppler and Muenzenmayer

5
Proposed Metrics
  • Accuracy Spelling and grammar checks
  • What about errors that are really correct?
  • Examine HTML E.g. sentence fragments okay if
    list items or headers
  • Different spell checks for different domains
    (subjects) finds scientific words, etc.
  • Perhaps count number of unique misspelled words
    to lessen penalty for proper names, etc.
  • Use ratio of errors to number of total words

6
Proposed Metrics
  • Objectivity
  • Limited advertising count pop-ups, keywords
    like Sponsored Links followed by links, etc.
  • Use of personal pronouns could indicate bias or
    opinion
  • Domain of Web pages - .edu is perhaps more
    objective than .com?

7
Proposed Metrics
  • Authority
  • Personal Web pages could have less authority
    e.g. Geocities sites, look for in URL, etc.
  • Top-level domain - .edu, .gov more authoritative
    than .com?
  • Verifiability (Traceability)
  • Keyword such as Source or References followed
    by one or more links

8
Proposed Metrics
  • Understandability
  • Reading level prompt user for desired level (or
    3-4 ranges of levels)
  • Readability
  • Colors of text and background
  • Text size (not too big or small)
  • Limited bold, italics, and emphasized text
  • ALT tag defined for images

9
Proposed Metrics
  • Navigation (Convenience)
  • Presence of site map or table of contents
  • Links to homepage on each page
  • Visited links change color
  • Links easily identifiable (different color)
  • Navigational links on large pages (e.g. links to
    top, bottom, subsections)

10
Proposed Metrics
  • Availability
  • Amount of time Web page is down
  • Bandwidth, traffic, loading time
  • Special technology required (e.g. Java, Flash,
    other plug-ins)
  • Timeliness (Currency)
  • How often Web page is updated
  • Documentation
  • Presence of comments

11
Implementation Issues
  • Speed
  • Do not run at each search store information
    each time crawler visits page
  • How to determine what pages make up a Web site
    See Amento et al.
  • Need to be able to combine scores of metrics
  • Normalize each metric to yield a score between 0
    and 1, and use an average or weighted average

12
Selecting DQ Metrics
  • If using all of the metrics is too restrictive,
    we could group related metrics into simplified
    dimensions and ask the user to select the most
    important qualities in an advanced search.
  • Following is a proposed list of options to
    present the user along with what DQ dimensions
    make up each list option.

13
User-Selected Metrics
  • Up-to-Date timeliness
  • Accurate/Error-Free accuracy, objectivity,
    verifiability, applicability
  • Reputable popularity, authority, documentation
    (maybe?)
  • Easy to Use availability, accessibility,
    navigation, speed
  • Well-Written and Laid Out information-to-noise
    ratio, cohesiveness, conciseness, readability
  • Reading Level Understandability
  • Secure security

14
Selecting DQ Metrics
  • Alternatively, the search engine could choose
    metrics based on the type of search.
  • The following metrics are suggested as the most
    important metrics for various types of searches.

15
Selecting DQ Metrics
  • News timeliness, accuracy, authority,
    availability, speed, and sometimes objectivity
  • Sports same as news, probably not objectivity
  • Business and Money timeliness, authority,
    accuracy, navigation, security
  • Health, Science, and Technology accuracy,
    verifiability, applicability, objectivity,
    understandability, and sometimes timeliness

16
Selecting DQ Metrics
  • Entertainment timeliness, availability,
    accessibility, navigation
  • Amento et al. found that the number of images and
    the number of pages on a site are good metrics
    for entertainment!
  • Games accessibility, availability,
    maintainability, documentation, popularity
  • Travel timeliness, authority, accuracy,
    readability, navigation, speed, availability,
    security (if booking flights, etc.)
  • e-Commerce security, availability,
    accessibility, navigation

17
Future Work
  • Do the proposed metrics yield improved search
    results?
  • Should the average be weighted? What metrics are
    most important?
  • Are the proposed metrics really the best for each
    type of search?
  • Would users actually use DQ features in an
    advanced search?

18
Important References
  • Previously studied Web DQ metrics
  • X. Zhu and S. Gauch, Incorporating Quality
    Metrics in Centralized/Distributed Information
    Retrieval on the World Wide Web, in Proceedings
    of the 23rd Annual International ACM SIGIR
    Conference on Research and Development in
    Information Retrieval, 2000, pp. 288-295.
  • M. Eppler and P. Muenzenmayer, Measuring
    Information Quality in the Web Context A Survey
    of State-of-the-Art Instruments and an
    Application Methodology, In Proceedings of the
    7th International Conference on Information
    Quality, MIT, 2002, pp. 187-196.
  • B. Amento, L. Terveen, and W. Hill, Does
    Authority Mean Quality? Predicting Expert
    Quality Ratings of Web Documents, Proceedings of
    the 23rd Annual International ACM SIGIR
    Conference on Research and Development in
    Information Retreival, 2000, pp. 296-303.

19
Important References
  • Interesting study on colors and readability
  • A. Hill and L. Scharff, Readability of Web sites
    with Various Foreground/Background Color
    Combinations, Font Types, and Word Styles, In
    Proceedings of the 11th National Conference of
    Undergraduate Research, 1997.
  • Guidelines for accessing DQ of Web pages manually
  • A. Smith, Evaluation of Information Sources,
    Online Document, 2005 Oct 27 cited 2005 Nov
    7, Available HTTP http//www.vuw.ac.nz/staff/ala
    stair_smith/evaln/evaln.htm.
  • E. Byrne, Evaluate Web Resources, Online
    Document, 1999 Nov. (Rev. 2000 Jun 25) cited
    2005 Nov 7, Available HTTP http//www.clubi.ie/w
    ebserch/resources/.
Write a Comment
User Comments (0)
About PowerShow.com