A Survey on Automatic Text/Speech Summarization

About This Presentation

Title:

A Survey on Automatic Text/Speech Summarization

Description:

Y. T. Chen et al., A probabilistic generative framework for ... Osborne (2002) used log-linear models to obviate the assumption of feature independence ... – PowerPoint PPT presentation

Number of Views:245

Avg rating:3.0/5.0

Slides: 26

Provided by: Pili2

Category:

more less

Transcript and Presenter's Notes

Title: A Survey on Automatic Text/Speech Summarization

1
A Survey on Automatic Text/Speech Summarization

Shih-Hsiang Lin(???)
Department of Computer Science Information
Engineering
National Taiwan Normal University

References
D, Das and A. F. T. Martins, A Survey on
Automatic Text Summarization, 2007
Y. T. Chen et al., A probabilistic generative
framework for extractive broadcast news speech
summarization, IEEE Trans. on ASLP 2009.
Hoveys tutorial, Automated Text summarization
Tutorial , COLING/ACL 1998
Radevs tutorial, Text summarization, SIGIR 2004
Berlins lecture, A Brief Review of Extractive
Summarization Research, 2008

2
NLP Related Technologies
3
Outline

Introduction
Single-Document Summarization
Early work
Supervised Methods
Unsupervised Method
Multi-Document Summarization
Not abailable yet
Evaluation
ROGUE
Information-Theoretic Method

4
Introduction

The subfield of summarization has been
investigated by the NLP community for nearly the
last half century
A text that is produced from one or more texts,
that conveys important information in the
original text(s), and that is no longer than half
of the original text(s) and usually significantly
less than that (Radev, 2000)
Summaries may be produced from a single document
or multiple documents
Summaries should preserve important information
Summaries should be short
Terminology in the summarization dialect
Extraction identify important sections of the
text
Abstraction produce important material in a new
way
Fusion combines extracted parts coherently
Compression throw out unimportant sections of
the text
Indicative vs. Informative vs. Critic
Generic vs. Query-oriented
Single-Document Summarization vs. Multi-Document
Summarization

5
Introduction (cont.)

Input (Jones, 1997)
Subject type domain
Genre newspaper articles, editorials, letters,
reports...
Form regular text structure free-form
Source size single doc multiple docs (few
many)
Purpose
Situation embedded in larger system (MT, IR) or
not?
Audience focused or general
Usage IR, sorting, skimming...
Output
Completeness include all aspects, or focus on
some?
Format paragraph, table, etc.
Style informative, indicative, critical...

This slides was adopted from Prof. Hoveys
presentation
6
Introduction (cont.)

A Summarization Machine

This slides was adopted from Prof. Hoveys
presentation
7
Introduction (cont.)

A brief history of summarization

8
Speech Summarization

Fundamental problems with speech summarization
Disfluencies, hesitations, repetitions, repairs,
Difficulties of sentence segmentation
More spontaneous parts of speech (e.g. interviews
in broadcast news) are less amenable to standard
text summarization
Speech recognition errors
Speech Summarization
Speech-to-text summarization
The documents can be easily looked through
The part of the documents that is interesting for
users can be easily extracted
Information extraction and retrieval techniques
can be easily applied to the documents
Speech-to-speech summarization
Wrong information due to speech recognition
errors can be avoided
Prosodic information such as the emotion of
speakers that is conveyed only by speech can be
presented

This slides was adopted from Prof. Furuis
presentation
9
Single-Document SummarizationEarly Work

The most cited paper on summarization is that of
(Luhn, 1958)
The frequency of a particular word in an article
provides an useful measure of its significance
There are also several key ideas put forward in
this paper that have assumed importance in later
work on summarization
words were stemmed to their root forms, and stop
words were deleted
compiled a list of content words sorted by
decreasing frequency, the index providing a
significance measure of the word
a significance factor was derived that reflects
the number of occurrences of significant words
within a sentence
all sentences are ranked in order of their
significance factor, and the top ranking
sentences are finally selected to form the
auto-abstract
Baxendale also suggest that sentence position
is helpful in finding salient parts of documents
(Baxendale, 1958)
examined 200 paragraphs to find that in 85 of
the paragraphs the topic sentence came as the
first one in 7 of the time it was the last
sentence

10
Single-Document Summarization Early Work (cont.)

Edmundson (1969) describes a system that produces
document extracts
His primary contribution was the development of a
typical structure for an extractive summarization
experiment (400 technical documents)
Four kind of features are used
Word frequency, Positional feature
Cue words present of words like significant, or
hardly
The skeleton of the document whether the
sentence is a title or heading
Weights were attached to each of these features
manually to score each sentence
About 44 of the auto-extracts matched the manual
extracts

11
Single-Document SummarizationSupervised Methods

In the 1990s, with the advent of machine learning
techniques in NLP
a series of seminal publications appeared that
employed statistical techniques to produce
document extracts
Kupiec et al. (1995) using a naive-Bayes
classifier to categorizes each sentence as
worthy of extraction or not
Let be a particular sentence, the set of
sentences that make up the summary, and
the features
Assuming independence of the features
Two additional features are used sentence length
and the presence of uppercase words
Feature analysis revealed that a system using
only the position and the cue features, along
with the sentence length, performed best

12
Single-Document Summarization Supervised Methods
(cont.)

Aone et al. (1999) also incorporated a
naive-Bayes classifier, but with richer features
Signature words derived from term frequency(TF)
and inverse document frequency(IDF)
Named-entity tagger
Shallow discourse analysis
Synonyms and morphological variants were also
merged (accomplied by WordNet)
Lin and Hovy (1997) studied the importance of
sentence position feature
However, since the discourse structure
significantly varies over domains
They makes an important contribution by
investigating techniques of tailoring the
position method towards optimality over a genre
Measured the yield of each sentence position
against the topic keywords
Then ranked the sentence positions by their
average yield to produce the Optimal Position
Policy (OPP) for topic positions for the genre

13
Single-Document Summarization Supervised Methods
(cont.)

Lin (1999) broke away from the assumption that
features are independent of each other
He tried to model the problem of sentence
extraction using decision trees, instead of a
naive-Bayes classifier
Some novel features were introduced in his paper
Query Signature normalized score given to
sentences depending on number of query words that
they contain
IR signature score given to sentences depending
on number and scores of IR signature words
included (the m most salient words in the corpus)
Average lexical connectivity the number of terms
shared with other sentences divided by the total
number of sentences in the text
Numerical data value 1 when sentences contained
a number
Proper name, Pronoun or Adjective, Weekday or
Month, Quotation (similar as previous feature)
Sentence length, Sentence Order
Feature analysis suggested that the IR signature
was a valuable feature, corroborating the early
findings of Luhn (1958)

14
Single-Document Summarization Supervised Methods
(cont.)

Conroy and O'leary (2001) modeled the problem of
extracting a sentence from a document using a
hidden Markov model (HMM)
The HMM was structured as follows
states (alternating between summary
states and non-summary states)
Allowed hesitation only in non-summary states
and skipping next state only in summary states
The transition matrix can be estimated from
training corpus
element is the empirical probability of
transitioning from state i to state j
Associated with each state i was an output
function
assume that the features are multivariate normal
distributed
using the training data to compute the maximum
likelihood estimate of its mean and covariance
matrix (shared covariance)
Use three features position of the sentence,
number of terms in the sentence, and likeliness
of the sentence terms given the document terms

15
Single-Document Summarization Supervised Methods
(cont.)

Osborne (2002) used log-linear models to obviate
the assumption of feature independence
Let be a label, the item we are interested
in labeling, the i-th feature and the
corresponding feature weight
The conditional log-linear model can be stated as
follows
The authors added a non-uniform prior to the
model, claiming that a log-linear model tends to
reject too many sentences for inclusion in a
summary
The features included word pairs, sentence
length, sentence position, and naive discourse
features like inside introduction or inside
conclusion.

16
Single-Document Summarization Supervised Methods
(cont.)

Svore et al. (2007) propose an algorithm based on
neural nets and the use of third party datasets
to perform extractive summarization
They trained a model that could infer the proper
ranking of sentences
The ranking was accomplished using RankNet based
on neural networks
For the training set, they used ROUGE-1 to score
the similarity of a human written highlight and a
sentence in the document
These similarity scores were used as
soft-labels during training, contrasting with
other approaches where sentences are
hard-labeled, as selected or not
Another novelty of the framework lay in the use
of features that derived information from query
logs from Microsoft's news search engine and
Wikipedia entries (third party datasets)
They conjecture that if a document sentence
contained keywords used in the news search
engine, or entities found in Wikipedia articles,
then there is a greater chance of having that
sentence in the highlight
They generate 10 features for each sentence in
each document
Is first sentence, Sentence position, SumBasic
score(unigram), SumBasic bigram score, Title
similarity score, Average News Query Term Score,
News Query Term Sum Score, Relative News Query
Term Score, Average Wikipedia Entity Score,
Wikipedia Entity Sum Score

17
Single-Document Summarization Supervised Methods
(cont.)

Other kinds of supervised summarizers includes
Support vector machine (SVM) (Hirao et al. 2002)
Gaussian Mixture Models (GMM) (Murray et al.
2005)
Conditional Random Fields (CRFs) (Shen et al.
2007)
In general, the extractive summarization can be
treated as a two-class (summary/non-summary)
classification problem (Lin et al. 2009)
A sentence with a set of representative
features
To summarize documents with different summary
ratios, the important sentences of a document
can be selected (or ranked) based on the
posterior probability of a sentence being
included in the summary given the feature set

18
Single-Document SummarizationUnsupervised Methods

Gong (2001) proposed using vector space model
(VSM)
Vector representations of sentences and the
document to be summarized using statistical
weighting, such as TF-IDF
Sentences are ranked based on their proximity to
the document
Maximum Marginal Relevance (MMR) (Murray et al.
2005) can be applied to summarize more important
and different concepts in a document

19
Single-Document SummarizationUnsupervised
Methods (cont.)

Latent Semantic Analysis (LSA) (Gong 2001)
Construct a term-sentence matrix for a given
document
Perform Singular Value Decomposition (SVD) on the
term-sentence matrix
The right singular vectors with larger singular
values represent the dimensions of the more
important latent semantic concepts in the
document
Represent each sentence of a document as a
semantic vector in the reduced space

20
Single-Document Summarization Unsupervised
Methods (cont.)

Probabilistic Generative Framework (Chen et al.
2009)
Criterion Maximum a posteriori (MAP)
Sentence Generative Model
Each sentence of the document as a probabilistic
generative model
Language Model (LM), Sentence Topic Model (STM)
and Word Topic Model (WTM) are initially
investigated
Sentence Prior Model
Sentence prior is simply set to uniform here
Or may have to do with duration/position,
correctness of sentence boundary, confidence
score, prosodic information, etc.

21
Single-Document Summarization Unsupervised
Methods (cont.)

Language Model (LM) Approach (Literal Term
Matching)
Sentence Topic Model (STM) Approach (Concept
Matching)
Word Topic Model (WTM) Approach (Concept Matching)

the sentence model
the collection model
a weighting parameter
22
Multi-Document Summarization

Task Characteristics
Input a set of documents on the same topic
Retrieved during an IR search
Clustered by a news browsers
Problem same topic or same event
Output a paragraph length summary
Salient information across documents
Similarities between topics?
Redundancy removal is critical
Application oriented task
News portal, presenting articles from different
sources
Corporate emails organized by subjects.
Medical reports about a patient

23
Evaluation

Recall-Oriented Understudy for Gisting Evaluation
(ROUGE) (Lin 2004)
Let be a set of reference
summary, and let be a summary generated
automatically by a system. Let be a
binary vector representing n-grams contained in a
document
The metric ROUGE-N is an n-gram recall based
statistic
where denotes the usual inner product of
vectors
The various versions of ROUGE were evaluated by
computing the correlation coefficient between
ROUGE scores and human judgment scores
ROUGE-2 performed the best among the ROUGE-N
variants

?? ??? ?? ???? ?? ??? ?? ?? ??
24
Evaluation (cont.)

Lin et al., (2006) also proposed to use an
information-theoretic method to automatic
evaluation of summaries
The central idea is to use a divergence measure
(i.e., Jensen-Shannon divergence), between a pair
of probability distributions
The first distribution is derived from an
automatic summary and the second from a set of
reference summaries
Let be the set of documents
to summarize
A distribution parameterized by
generates reference summaries
A summarization system is governed by some
distribution
We may define a good summarizer as one for which
is closed to
One information-theoretic measure between
distributions that is adequate for this is the KL
divergence
However, the KL divergence is unbounded and goes
to infinity whenever vanishes and
does not
Another problem is that KL divergence is not
symmetric

25
Evaluation (cont.)

Hence, they propose to use the Jensen-Shannon
divergence which is bounded and symmetric
where
To evaluate a summary given a reference
summary , the negative JS divergence can be
used for the purpose

Write a Comment

User Comments (0)