A Survey on Automatic Text/Speech Summarization - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

A Survey on Automatic Text/Speech Summarization

Description:

Y. T. Chen et al., A probabilistic generative framework for ... Osborne (2002) used log-linear models to obviate the assumption of feature independence ... – PowerPoint PPT presentation

Number of Views:241
Avg rating:3.0/5.0
Slides: 26
Provided by: Pili2
Category:

less

Transcript and Presenter's Notes

Title: A Survey on Automatic Text/Speech Summarization


1
A Survey on Automatic Text/Speech Summarization
  • Shih-Hsiang Lin(???)
  • Department of Computer Science Information
    Engineering
  • National Taiwan Normal University
  • References
  • D, Das and A. F. T. Martins, A Survey on
    Automatic Text Summarization, 2007
  • Y. T. Chen et al., A probabilistic generative
    framework for extractive broadcast news speech
    summarization, IEEE Trans. on ASLP 2009.
  • Hoveys tutorial, Automated Text summarization
    Tutorial , COLING/ACL 1998
  • Radevs tutorial, Text summarization, SIGIR 2004
  • Berlins lecture, A Brief Review of Extractive
    Summarization Research, 2008

2
NLP Related Technologies
3
Outline
  • Introduction
  • Single-Document Summarization
  • Early work
  • Supervised Methods
  • Unsupervised Method
  • Multi-Document Summarization
  • Not abailable yet
  • Evaluation
  • ROGUE
  • Information-Theoretic Method

4
Introduction
  • The subfield of summarization has been
    investigated by the NLP community for nearly the
    last half century
  • A text that is produced from one or more texts,
    that conveys important information in the
    original text(s), and that is no longer than half
    of the original text(s) and usually significantly
    less than that (Radev, 2000)
  • Summaries may be produced from a single document
    or multiple documents
  • Summaries should preserve important information
  • Summaries should be short
  • Terminology in the summarization dialect
  • Extraction identify important sections of the
    text
  • Abstraction produce important material in a new
    way
  • Fusion combines extracted parts coherently
  • Compression throw out unimportant sections of
    the text
  • Indicative vs. Informative vs. Critic
  • Generic vs. Query-oriented
  • Single-Document Summarization vs. Multi-Document
    Summarization

5
Introduction (cont.)
  • Input (Jones, 1997)
  • Subject type domain
  • Genre newspaper articles, editorials, letters,
    reports...
  • Form regular text structure free-form
  • Source size single doc multiple docs (few
    many)
  • Purpose
  • Situation embedded in larger system (MT, IR) or
    not?
  • Audience focused or general
  • Usage IR, sorting, skimming...
  • Output
  • Completeness include all aspects, or focus on
    some?
  • Format paragraph, table, etc.
  • Style informative, indicative, critical...

This slides was adopted from Prof. Hoveys
presentation
6
Introduction (cont.)
  • A Summarization Machine

This slides was adopted from Prof. Hoveys
presentation
7
Introduction (cont.)
  • A brief history of summarization

8
Speech Summarization
  • Fundamental problems with speech summarization
  • Disfluencies, hesitations, repetitions, repairs,
  • Difficulties of sentence segmentation
  • More spontaneous parts of speech (e.g. interviews
    in broadcast news) are less amenable to standard
    text summarization
  • Speech recognition errors
  • Speech Summarization
  • Speech-to-text summarization
  • The documents can be easily looked through
  • The part of the documents that is interesting for
    users can be easily extracted
  • Information extraction and retrieval techniques
    can be easily applied to the documents
  • Speech-to-speech summarization
  • Wrong information due to speech recognition
    errors can be avoided
  • Prosodic information such as the emotion of
    speakers that is conveyed only by speech can be
    presented

This slides was adopted from Prof. Furuis
presentation
9
Single-Document SummarizationEarly Work
  • The most cited paper on summarization is that of
    (Luhn, 1958)
  • The frequency of a particular word in an article
    provides an useful measure of its significance
  • There are also several key ideas put forward in
    this paper that have assumed importance in later
    work on summarization
  • words were stemmed to their root forms, and stop
    words were deleted
  • compiled a list of content words sorted by
    decreasing frequency, the index providing a
    significance measure of the word
  • a significance factor was derived that reflects
    the number of occurrences of significant words
    within a sentence
  • all sentences are ranked in order of their
    significance factor, and the top ranking
    sentences are finally selected to form the
    auto-abstract
  • Baxendale also suggest that sentence position
    is helpful in finding salient parts of documents
    (Baxendale, 1958)
  • examined 200 paragraphs to find that in 85 of
    the paragraphs the topic sentence came as the
    first one in 7 of the time it was the last
    sentence

10
Single-Document Summarization Early Work (cont.)
  • Edmundson (1969) describes a system that produces
    document extracts
  • His primary contribution was the development of a
    typical structure for an extractive summarization
    experiment (400 technical documents)
  • Four kind of features are used
  • Word frequency, Positional feature
  • Cue words present of words like significant, or
    hardly
  • The skeleton of the document whether the
    sentence is a title or heading
  • Weights were attached to each of these features
    manually to score each sentence
  • About 44 of the auto-extracts matched the manual
    extracts

11
Single-Document SummarizationSupervised Methods
  • In the 1990s, with the advent of machine learning
    techniques in NLP
  • a series of seminal publications appeared that
    employed statistical techniques to produce
    document extracts
  • Kupiec et al. (1995) using a naive-Bayes
    classifier to categorizes each sentence as
    worthy of extraction or not
  • Let be a particular sentence, the set of
    sentences that make up the summary, and
    the features
  • Assuming independence of the features
  • Two additional features are used sentence length
    and the presence of uppercase words
  • Feature analysis revealed that a system using
    only the position and the cue features, along
    with the sentence length, performed best

12
Single-Document Summarization Supervised Methods
(cont.)
  • Aone et al. (1999) also incorporated a
    naive-Bayes classifier, but with richer features
  • Signature words derived from term frequency(TF)
    and inverse document frequency(IDF)
  • Named-entity tagger
  • Shallow discourse analysis
  • Synonyms and morphological variants were also
    merged (accomplied by WordNet)
  • Lin and Hovy (1997) studied the importance of
    sentence position feature
  • However, since the discourse structure
    significantly varies over domains
  • They makes an important contribution by
    investigating techniques of tailoring the
    position method towards optimality over a genre
  • Measured the yield of each sentence position
    against the topic keywords
  • Then ranked the sentence positions by their
    average yield to produce the Optimal Position
    Policy (OPP) for topic positions for the genre

13
Single-Document Summarization Supervised Methods
(cont.)
  • Lin (1999) broke away from the assumption that
    features are independent of each other
  • He tried to model the problem of sentence
    extraction using decision trees, instead of a
    naive-Bayes classifier
  • Some novel features were introduced in his paper
  • Query Signature normalized score given to
    sentences depending on number of query words that
    they contain
  • IR signature score given to sentences depending
    on number and scores of IR signature words
    included (the m most salient words in the corpus)
  • Average lexical connectivity the number of terms
    shared with other sentences divided by the total
    number of sentences in the text
  • Numerical data value 1 when sentences contained
    a number
  • Proper name, Pronoun or Adjective, Weekday or
    Month, Quotation (similar as previous feature)
  • Sentence length, Sentence Order
  • Feature analysis suggested that the IR signature
    was a valuable feature, corroborating the early
    findings of Luhn (1958)

14
Single-Document Summarization Supervised Methods
(cont.)
  • Conroy and O'leary (2001) modeled the problem of
    extracting a sentence from a document using a
    hidden Markov model (HMM)
  • The HMM was structured as follows
  • states (alternating between summary
    states and non-summary states)
  • Allowed hesitation only in non-summary states
    and skipping next state only in summary states
  • The transition matrix can be estimated from
    training corpus
  • element is the empirical probability of
    transitioning from state i to state j
  • Associated with each state i was an output
    function
  • assume that the features are multivariate normal
    distributed
  • using the training data to compute the maximum
    likelihood estimate of its mean and covariance
    matrix (shared covariance)
  • Use three features position of the sentence,
    number of terms in the sentence, and likeliness
    of the sentence terms given the document terms

15
Single-Document Summarization Supervised Methods
(cont.)
  • Osborne (2002) used log-linear models to obviate
    the assumption of feature independence
  • Let be a label, the item we are interested
    in labeling, the i-th feature and the
    corresponding feature weight
  • The conditional log-linear model can be stated as
    follows
  • The authors added a non-uniform prior to the
    model, claiming that a log-linear model tends to
    reject too many sentences for inclusion in a
    summary
  • The features included word pairs, sentence
    length, sentence position, and naive discourse
    features like inside introduction or inside
    conclusion.

16
Single-Document Summarization Supervised Methods
(cont.)
  • Svore et al. (2007) propose an algorithm based on
    neural nets and the use of third party datasets
    to perform extractive summarization
  • They trained a model that could infer the proper
    ranking of sentences
  • The ranking was accomplished using RankNet based
    on neural networks
  • For the training set, they used ROUGE-1 to score
    the similarity of a human written highlight and a
    sentence in the document
  • These similarity scores were used as
    soft-labels during training, contrasting with
    other approaches where sentences are
    hard-labeled, as selected or not
  • Another novelty of the framework lay in the use
    of features that derived information from query
    logs from Microsoft's news search engine and
    Wikipedia entries (third party datasets)
  • They conjecture that if a document sentence
    contained keywords used in the news search
    engine, or entities found in Wikipedia articles,
    then there is a greater chance of having that
    sentence in the highlight
  • They generate 10 features for each sentence in
    each document
  • Is first sentence, Sentence position, SumBasic
    score(unigram), SumBasic bigram score, Title
    similarity score, Average News Query Term Score,
    News Query Term Sum Score, Relative News Query
    Term Score, Average Wikipedia Entity Score,
    Wikipedia Entity Sum Score

17
Single-Document Summarization Supervised Methods
(cont.)
  • Other kinds of supervised summarizers includes
  • Support vector machine (SVM) (Hirao et al. 2002)
  • Gaussian Mixture Models (GMM) (Murray et al.
    2005)
  • Conditional Random Fields (CRFs) (Shen et al.
    2007)
  • In general, the extractive summarization can be
    treated as a two-class (summary/non-summary)
    classification problem (Lin et al. 2009)
  • A sentence with a set of representative
    features
  • To summarize documents with different summary
    ratios, the important sentences of a document
    can be selected (or ranked) based on the
    posterior probability of a sentence being
    included in the summary given the feature set

18
Single-Document SummarizationUnsupervised Methods
  • Gong (2001) proposed using vector space model
    (VSM)
  • Vector representations of sentences and the
    document to be summarized using statistical
    weighting, such as TF-IDF
  • Sentences are ranked based on their proximity to
    the document
  • Maximum Marginal Relevance (MMR) (Murray et al.
    2005) can be applied to summarize more important
    and different concepts in a document

19
Single-Document SummarizationUnsupervised
Methods (cont.)
  • Latent Semantic Analysis (LSA) (Gong 2001)
  • Construct a term-sentence matrix for a given
    document
  • Perform Singular Value Decomposition (SVD) on the
    term-sentence matrix
  • The right singular vectors with larger singular
    values represent the dimensions of the more
    important latent semantic concepts in the
    document
  • Represent each sentence of a document as a
    semantic vector in the reduced space

20
Single-Document Summarization Unsupervised
Methods (cont.)
  • Probabilistic Generative Framework (Chen et al.
    2009)
  • Criterion Maximum a posteriori (MAP)
  • Sentence Generative Model
  • Each sentence of the document as a probabilistic
    generative model
  • Language Model (LM), Sentence Topic Model (STM)
    and Word Topic Model (WTM) are initially
    investigated
  • Sentence Prior Model
  • Sentence prior is simply set to uniform here
  • Or may have to do with duration/position,
    correctness of sentence boundary, confidence
    score, prosodic information, etc.

21
Single-Document Summarization Unsupervised
Methods (cont.)
  • Language Model (LM) Approach (Literal Term
    Matching)
  • Sentence Topic Model (STM) Approach (Concept
    Matching)
  • Word Topic Model (WTM) Approach (Concept Matching)

the sentence model
the collection model
a weighting parameter
22
Multi-Document Summarization
  • Task Characteristics
  • Input a set of documents on the same topic
  • Retrieved during an IR search
  • Clustered by a news browsers
  • Problem same topic or same event
  • Output a paragraph length summary
  • Salient information across documents
  • Similarities between topics?
  • Redundancy removal is critical
  • Application oriented task
  • News portal, presenting articles from different
    sources
  • Corporate emails organized by subjects.
  • Medical reports about a patient

23
Evaluation
  • Recall-Oriented Understudy for Gisting Evaluation
    (ROUGE) (Lin 2004)
  • Let be a set of reference
    summary, and let be a summary generated
    automatically by a system. Let be a
    binary vector representing n-grams contained in a
    document
  • The metric ROUGE-N is an n-gram recall based
    statistic
  • where denotes the usual inner product of
    vectors
  • The various versions of ROUGE were evaluated by
    computing the correlation coefficient between
    ROUGE scores and human judgment scores
  • ROUGE-2 performed the best among the ROUGE-N
    variants

?? ??? ?? ???? ?? ??? ?? ?? ??
24
Evaluation (cont.)
  • Lin et al., (2006) also proposed to use an
    information-theoretic method to automatic
    evaluation of summaries
  • The central idea is to use a divergence measure
    (i.e., Jensen-Shannon divergence), between a pair
    of probability distributions
  • The first distribution is derived from an
    automatic summary and the second from a set of
    reference summaries
  • Let be the set of documents
    to summarize
  • A distribution parameterized by
    generates reference summaries
  • A summarization system is governed by some
    distribution
  • We may define a good summarizer as one for which
    is closed to
  • One information-theoretic measure between
    distributions that is adequate for this is the KL
    divergence
  • However, the KL divergence is unbounded and goes
    to infinity whenever vanishes and
    does not
  • Another problem is that KL divergence is not
    symmetric

25
Evaluation (cont.)
  • Hence, they propose to use the Jensen-Shannon
    divergence which is bounded and symmetric
  • where
  • To evaluate a summary given a reference
    summary , the negative JS divergence can be
    used for the purpose
Write a Comment
User Comments (0)
About PowerShow.com