WebPage Summarization Using Clickthrough Data - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

WebPage Summarization Using Clickthrough Data

Description:

... is bracketed by significant words not more than L non-significant words apart ... All ODP Web pages have been manually organized into a hierarchical taxonomy ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 21
Provided by: YiT9
Category:

less

Transcript and Presenter's Notes

Title: WebPage Summarization Using Clickthrough Data


1
Web-Page Summarization Using Clickthrough Data
  • JianTao Sun, Yuchang Lu
  • Dept. of Computer Science TsingHua University
    Beijing 100084, China
  • Dou Shen, Qiang Yang
  • Hong Kong University of Science and Technology
    Clearwater Bay, Kowloon, HK
  • HuaJun Zeng, Zheng Chen
  • Microsoft Research Asia 5F, Sigma Center, 49
    Zhichun Road, Beijing 100080, China
  • Presenter Chen Yi-Ting

2
Reference
  • JianTao Sun, Yuchang Lu, Dou Shen, Qiang Yang,
    HuaJun Zeng, Zheng Chen, Web-Page Summarization
    Using Clickthrought Data, SIGIR05, August
    15-19, 2005.
  • H. Luhn. The automatic creation of literature
    abstracts. IBM Journal of Research and
    Development, 2(2)159-165, 1958.

3
Outline
  • Introduction
  • Summarize Web Pages using Clickthrough Data
  • Empirical study on clickthrough data
  • Adapted web-page summarization methods
  • Summarize web pages not covered by clickthrough
    data
  • Experiments
  • Conclusions and future work

4
Introduction(1/2)
  • Why web-page summarized?
  • Web-page summaries can be abstracts or extracts
  • Web-page summary can also be either generic or
    query-dependent
  • A query-dependent summary presents the
    information which is most relevant with the
    initial query
  • A generic summary gives an overall sense of the
    documents content
  • A generic summary should meet two conditions
    maintain wide coverage of the pages topics and
    keep low redundancy at the same time
  • In this paper, we focus on extract-based generic
    Web-page summarization
  • The objective of this research is to utilize
    extra knowledge to improve Web-page summarization
  • clickthroughcontains users knowledge on Web
    pages content
  • A users query words often reflect the true
    meaning of target Web pages content

5
Introduction(2/2)
  • This is a challenging task
  • Web pages may have no associated query words
    since they are not visited by web users through
    search engine
  • The clickthrough data are noisy
  • In this paper, a thematic hierarchy of query
    terms are constructed
  • The thematic lexicon can be used to complement
    the scarcity of Web-page content even no
    clickthrough data was collected associated with
    these pages
  • That method can help filter out noises contained
    in query words for an individual Web page through
    the use of statistics over all Web page of this
    category
  • Two text-summarization methods to summarize Web
    pages
  • The first approach is based on significant-word
    selection adapted from Luhns method
  • The second method is based on Latent Semantic
    Analysis (LSA)

6
Summarize web pages using clickthrough data (1/7)
  • Empirical study on clickthrough data
  • Consider the typical search scenario a user (u)
    submits a query (q) to search engine, the search
    engine returns a ranked list of Web page. Then
    the user clicks on the pages (p) of interest
  • Be represented by a set of triples ltu,q,pgt
  • The clickthrough data records how Web users find
    information through queries
  • The collection of queries is supposed to well
    reflect the topic of the target Web page
  • Two experiment
  • To investigate whether the query words are
    related with the topics of the Web page (45.5
    of keywords occurs in the query words, 13.1 of
    query words appear as keywords)
  • To give evidence that clickthrough data is
    helpful to summarizing Web pages

7
Summarize web pages using clickthrough data (2/7)
  • Adapted Web-page Summarization Methods (Suppose
    that we have a set of query terms for each page
    now)
  • Adapted Significant Word (ASW) Method
  • The first summarization method is adapted from
    Luhns algorithm, which is a classical algorithm
    designed for text signed a significance
  • In Luhns method, each sentence is assigned a
    significance factor and the sentences with high
    significance factors are selected to form the
    summary
  • Then the significant factor of a sentence can be
    computed as follow(1) Set a limit L for the
    distance at which any two significant words could
    be considered as being significantly related(2)
    Find out a portion in the sentence that is
    bracketed by significant words not more than L
    non-significant words apart(3) Count the number
    of significant words contained in the portion and
    divide the square of this number by the total
    number of words within he portion ?
  • First, a set of significant words are constructed
    (according to word frequency in a document)

8
Summarize web pages using clickthrough data (3/7)
  • Adapted Web-page Summarization Methods
  • Adapted Significant Word (ASW) Method
  • In order to customize this procedure to leverage
    query terms for Web-page summarization, the
    significant word selection method is modified
  • The basic idea is to use both the local contents
    of a Web page and query terms collected from the
    clickthrough data to decide whether a word is
    significant
  • After the significance factors for all words are
    calculated, ranking them and select the top N as
    significant words
  • Then Luhns algorithm to compute the significant
    factor of each sentence is employed

9
Summarize web pages using clickthrough data (4/7)
  • Adapted Web-page Summarization Methods
  • Adapted Latent Semantic Analysis (ALSA) Method
  • Gong et al. proposed an extraction based
    summarization algorithm
  • Firstly, a term-sentence matrix is constructed
    from the original text document
  • Next, LSA analysis is conducted on the matrix
  • In the last step, a document summary is produced
    incrementally
  • Proposed LSA-based summarization method is a
    variant of Gongs method
  • Utilizing the query-word knowledge by changing
    the term-sentence matrix if a term occurs as
    query word, its weight is increased according to
    its frequency in query word collection
  • Expecting to extract sentences whose topics are
    related to the ones reflected by query words
  • The term frequency vector of each sentence can be
    weighted by different weighting (global weighting
    and local weighting) and normalization methods

10
Summarize web pages using clickthrough data (5/7)
  • Adapted Web-page Summarization Methods
  • Adapted Latent Semantic Analysis (ALSA) Method
  • In this paper, a term frequency (TF) approach
    without weighting or normalization is used to
    represent the sentences in Web pages
  • Terms in a sentence are augmented by query terms
    as follows
  • Advantages of the adapted methods
  • The extra knowledge of query terms is utilized to
    help select significant words and to modify the
    page representation
  • Our approach can, to some extent, handle the
    noises of query words
  • Finally, ASW approach can avoid that problem that
    is Luhns method, the frequency-cutoff method may
    lead to a lot of significant words for long pages

11
Summarize web pages using clickthrough data (6/7)
  • Summarize Web Pages Not Covered by Clickthrough
    Data
  • Building a hierarchical lexicon using the
    clickthrough data and apply it to help summarize
    those pages
  • All ODP Web pages have been manually organized
    into a hierarchical taxonomy
  • For each category of the taxonomy, the lexicon
    contains all query terms that users have
    submitted to browse Web pages of this category
  • The lexicon is built as follows
  • First, TS corresponding to each category is set
    empty.
  • Next, for each page covered by the clickthrough
    data, its query words are added into TS of
    categories
  • At last, term weight in each TS is multiplied by
    its Inverse Category Frequency (ICF)
  • For each Web page to be summarized, first look up
    the lexicon for TS according to the page
    category

12
Summarize web pages using clickthrough data (7/7)
  • Summarize Web Pages Not Covered by Clickthrough
    Data
  • Weights of the terms in TS can be used to select
    significant words or update the term-sentence
    matrix
  • If a page to be summarized has multiple
    categories, the corresponding TS are merged
    together and weights are averaged
  • When a TS does not have sufficient terms, TS
    corresponding with its parent category is used
  • Two advantages
  • First, the category-specific TS provides a
    distribution of topic term in this category
  • Second, some noisy terms which may be relatively
    frequent in one pages query words will be given
    a low weight through the used of statistics over
    all Web pages of this category

13
Experiments(1/6)
  • Data Set
  • The clickthrough data was collected from MSN
    search engine
  • A set of Web pages of the ODP directory are
    crawled
  • To get 1,125,207 Web pages, 260,763 of which are
    clicked by Web users using 1,586,472 different
    queries
  • Two different data sets were used for
    experiment(1) DAT1-consists of 90 pages which
    are selected from the 260,763 browsed pages.
    Three human evaluators were employed to summarize
    these

14
Experiments(2/6)
  • Data Set
  • Two different data sets were used for
    experiment(2) DAT2-from the 260,763, 10,000
    pages are randomly selected and constitutes Data2
    data setdescriptions of each page are also
    extracted that is provided by the page editor to
    give a general description of this page, they use
    it as the ideal summary
  • Performance Evaluation
  • Precision, Recall and F1
  • ROUGE EvaluationN1

15
Experiments(3/6)
  • Experimental Results and Analysis
  • On DAT1 (1) To investigate whether the adapted
    summarizers can benefit from query terms
    associated with each page

16
Experiments(4/6)
  • Experimental Results and Analysis
  • On DAT1 (1) To evaluate proposed summarization
    methods using the thematic lexicon approach

17
Experiments(5/6)
  • Experimental Results and Analysis
  • On DAT2
  • Only ROUGE-1 measure is used for evaluation
  • Since the description length is commonly short
    and the ROUGE-1 measures is recall based, the
    summarization results are relatively poor
  • The thematic lexicon-based methods can still lead
    to better summaries compared with local textual
    content based summarizers

18
Experiments(6/6)
  • Discussions
  • Finding that ICF-based re-weighting can help
    discover topic terms of a specific category
  • To verify our hypothesis that the clickthrough
    data can complement the textual contents of Web
    pages for summarization tasks

19
Conclusions and Future work
  • To leverage extract knowledge from clickthrough
    data to improve Web-page summarization
  • It would be interesting to propose a method to
    determine parameter automatically
  • To study how to leverage other types of knowledge

20
?
Write a Comment
User Comments (0)
About PowerShow.com