Results and Challenges in Web Search Evaluation - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Results and Challenges in Web Search Evaluation

Description:

Falkland petroleum exploration Description: ... in the South Atlantic near the Falkland Islands is considered relevant. ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 18
Provided by: irI73
Category:

less

Transcript and Presenter's Notes

Title: Results and Challenges in Web Search Evaluation


1
Results and Challenges in Web Search Evaluation
  • David Hawking
  • Nick Craswell
  • Donna Harman

2
Problems
  • How do we evaluate different Search Engines?
  • Numerous Algorithms and Techniques - are they
    effective?
  • Do longer queries result in better answers?
  • Can link information result in better rankings?

3
What is TREC?
  • The Text REtrieval Conference (TREC),
    co-sponsored by the National Institute of
    Standards and Technology (NIST) and the Defense
    Advanced Research Projects Agency (DARPA). Its
    purpose was to support research within the
    information retrieval community by providing the
    infrastructure necessary for large-scale
    evaluation of text retrieval methodologies.
  • TREC is overseen by a program committee
    consisting of representatives from government,
    industry, and academia. For each TREC, NIST
    provides a test set of documents and questions.
    Participants run their own retrieval systems on
    the data, and return to NIST a list of the
    retrieved top-ranked documents. NIST pools the
    individual results, judges the retrieved
    documents for correctness, and evaluates the
    results. The TREC cycle ends with a workshop that
    is a forum for participants to share their
    experiences.

4
Do TREC systems work well on Web data?
5
Prior Work
  • In order to compare TREC retrieval systems with
    Web search engines, short queries (average 2.5
    words) were fed to five well-known Web search
    engines. Of course, these engines were searching
    the current Web rather than the frozen snapshot.
    Top 20 results for each of the topics over the
    real Web were then judged.
  • Following results were obtained
  • Table 1 P_at_20 performance for Web Search
    Engines, using 50 title-only queries (average 2.5
    terms) and the real Web. P_at_20 is the proportion
    of the top 20 documents retrieved which were
    judged relevant. All documents for a query were
    judged by the same person using the same
    browser, regardless of whether they came from
    the VLC2 or from the real Web.

6
TREC Data
  • VLC2 Collection a frozen snapshot of the web.
  • Internet archive forms a basis of a TREC
    collection known as the VLC2 ( Very Large
    Collection, Second Edition).
  • 18.5 million page, 100.426 gigabyte VLC2
    collection is the Web snapshot which was used in
    the TREC-8 Web track.
  • Generally, the query format comprises of three
    fields, title, description and narrative.

7
Example
  • lttopgt
  • ltnumgt Number 351
  • lttitlegt Falkland petroleum exploration
  • ltdescgt Description
  • What information is available on petroleum
    exploration in the South Atlantic near the
    Falkland Islands?
  • ltnarrgt Narrative
  • Any document discussing petroleum exploration in
    the South Atlantic near the Falkland Islands is
    considered relevant. Documents discussing
    petroleum exploration in continental South
    America are not relevant.
  • lt/topgt

8
Methodology
  • Participants in the annual TREC conference must
    process a long set of queries over a standard
    test collection documents provided to them and
    submit ranked lists of documents to NIST for
    assessment by human judges.
  • The TREC approach to objective evaluation of
    effectiveness is to define a large set (at least
    50) of statements of user need (called topics
    within TREC) and to use human judges to assess
    whether submitted pages are or are not relevant
    to the users need. Note that the title of the
    topic may be used as a query to the retrieval
    system or longer queries may be derived from more
    or all of the topic. Regardless of what query is
    used, pages are judged against the full topic.
    This method of judgment is called pooling.

9
Advantages
  • Reproducible results.
  • Blind testing.
  • - Document judges do not know which documents
    were retrieved by
  • which systems.
  • - Participating researchers do not find out
    which documents are
  • relevant
  • Sharing of relevance judgments across a large
    number of groups significantly reduces the total
    cost of evaluations.
  • Collaborative experiments.
  • - Much more confidence can be placed in a
    similar result obtained by
  • nine out of ten groups performing a common
    task.

10
Judging Issues
  • Relevance is always judged against the full topic
    description and each document is judged
    independently of all others as either relevant
    or irrelevant.
  • Topics are assigned to judges on an arbitrary
    basis. All judgments for a particular topic are
    made by the same judge.
  • Every effort is made to ensure that the judgment
    conditions for the live Web documents were as
    close to identical as possible to those for the
    VLC2 web documents.

11
Relevance Assessments
  • Relevance judgments are of critical importance to
    a test collection. For each topic it is necessary
    to compile a list of relevant documents.
  • TREC uses pooling method to assemble the
    relevance assessments.

12
Pooling Method
  • A pool of possible relevant documents is created
    by taking a sample of documents selected by the
    various participating systems.This pool is then
    shown to the human assessor. The particular
    sampling method used in TREC is to take the top
    100 documents retrieved in each submitted run for
    a given topic and merge them into the pool for
    assessment. This is a valid sampling technique
    since all the systems use ranked retrieval
    methods, with those documents most likely to be
    relevant returned first.

13
Results
  • Table 2 P_at_20 performance for 16 VLC2 runs. Runs
    1 - 4 made use of the full topics, runs5-13 made
    use
  • of Title plus Description fields of the topic
    statement whereas runs 14-16 used only the Title
    field.
  • Table 3 Summary of P_at_20 performance for Web
    Search Engines and VLC2 runs. The median and
  • range for all search engine runs are compared
    with median and range for each of the VLC2
    topic-length
  • categories.

14
  • As may be seen, all five search engines performed
    below the median P_at_20 for title-only VLC2
    submissions and substantially below the medians
    for the longer topic runs.
  • The median performance of the VLC2 groups
    increases sharply with increasing use of topic
    words.
  • Fair comparison of effectiveness of ranking
    algorithms can be obtained by conducting trials
    on standardized test collection such as VLC2.
  • It is difficult to draw a firm conclusion here,
    as the groups which were focused on query
    processing speed rather than effectiveness were
    likely to have used shorter queries. It may well
    be that some of these systems performed less well
    because they chose fast but less effective
    methods, rather than because of the length of the
    queries.

15
TREC-8 Web Track
  • The Web track made use of the VLC2 frozen data
    set.
  • The TREC-8 Web Track activities centered on two
    major tasks.
  • - Small Web Task
  • A small subset of VLC2 data containing
    approximately two giga-bytes of text (250,000
    HTML pages) has been used.
  • - Large Web Task
  • 100 giga-bytes, 18.5 million web pages VLC2
    collection has been used.

16
Efficiency-Effectiveness
  • Efficiency Effectiveness are actually
    tradeoffs between five dimensions.
  • - Speed of Indexing
  • - Size of Indexes
  • - Speed of query processing
  • - Query processing effectiveness
  • - Cost

17
Conclusion
  • Would have been nice if we had more effectiveness
    comparisons of TREC systems and commercial web
    search engines.
  • VLC2 collection and associated resources form a
    means for obtaining better evaluation results in
    the context of web search as per the results of
    web track-8.
Write a Comment
User Comments (0)
About PowerShow.com