Title: A Survey on Automatic Text/Speech Summarization
1A Survey on Automatic Text/Speech Summarization
- Shih-Hsiang Lin(???)
- Department of Computer Science Information
Engineering - National Taiwan Normal University
- References
- D, Das and A. F. T. Martins, A Survey on
Automatic Text Summarization, 2007 - Y. T. Chen et al., A probabilistic generative
framework for extractive broadcast news speech
summarization, IEEE Trans. on ASLP 2009. - Hoveys tutorial, Automated Text summarization
Tutorial , COLING/ACL 1998 - Radevs tutorial, Text summarization, SIGIR 2004
- Berlins lecture, A Brief Review of Extractive
Summarization Research, 2008
2NLP Related Technologies
3Outline
- Introduction
- Single-Document Summarization
- Early work
- Supervised Methods
- Unsupervised Method
- Multi-Document Summarization
- Not abailable yet
- Evaluation
- ROGUE
- Information-Theoretic Method
4Introduction
- The subfield of summarization has been
investigated by the NLP community for nearly the
last half century - A text that is produced from one or more texts,
that conveys important information in the
original text(s), and that is no longer than half
of the original text(s) and usually significantly
less than that (Radev, 2000) - Summaries may be produced from a single document
or multiple documents - Summaries should preserve important information
- Summaries should be short
- Terminology in the summarization dialect
- Extraction identify important sections of the
text - Abstraction produce important material in a new
way - Fusion combines extracted parts coherently
- Compression throw out unimportant sections of
the text - Indicative vs. Informative vs. Critic
- Generic vs. Query-oriented
- Single-Document Summarization vs. Multi-Document
Summarization
5Introduction (cont.)
- Input (Jones, 1997)
- Subject type domain
- Genre newspaper articles, editorials, letters,
reports... - Form regular text structure free-form
- Source size single doc multiple docs (few
many) - Purpose
- Situation embedded in larger system (MT, IR) or
not? - Audience focused or general
- Usage IR, sorting, skimming...
- Output
- Completeness include all aspects, or focus on
some? - Format paragraph, table, etc.
- Style informative, indicative, critical...
This slides was adopted from Prof. Hoveys
presentation
6Introduction (cont.)
This slides was adopted from Prof. Hoveys
presentation
7Introduction (cont.)
- A brief history of summarization
8Speech Summarization
- Fundamental problems with speech summarization
- Disfluencies, hesitations, repetitions, repairs,
- Difficulties of sentence segmentation
- More spontaneous parts of speech (e.g. interviews
in broadcast news) are less amenable to standard
text summarization - Speech recognition errors
- Speech Summarization
- Speech-to-text summarization
- The documents can be easily looked through
- The part of the documents that is interesting for
users can be easily extracted - Information extraction and retrieval techniques
can be easily applied to the documents - Speech-to-speech summarization
- Wrong information due to speech recognition
errors can be avoided - Prosodic information such as the emotion of
speakers that is conveyed only by speech can be
presented
This slides was adopted from Prof. Furuis
presentation
9Single-Document SummarizationEarly Work
- The most cited paper on summarization is that of
(Luhn, 1958) - The frequency of a particular word in an article
provides an useful measure of its significance - There are also several key ideas put forward in
this paper that have assumed importance in later
work on summarization - words were stemmed to their root forms, and stop
words were deleted - compiled a list of content words sorted by
decreasing frequency, the index providing a
significance measure of the word - a significance factor was derived that reflects
the number of occurrences of significant words
within a sentence - all sentences are ranked in order of their
significance factor, and the top ranking
sentences are finally selected to form the
auto-abstract - Baxendale also suggest that sentence position
is helpful in finding salient parts of documents
(Baxendale, 1958) - examined 200 paragraphs to find that in 85 of
the paragraphs the topic sentence came as the
first one in 7 of the time it was the last
sentence
10Single-Document Summarization Early Work (cont.)
- Edmundson (1969) describes a system that produces
document extracts - His primary contribution was the development of a
typical structure for an extractive summarization
experiment (400 technical documents) - Four kind of features are used
- Word frequency, Positional feature
- Cue words present of words like significant, or
hardly - The skeleton of the document whether the
sentence is a title or heading - Weights were attached to each of these features
manually to score each sentence - About 44 of the auto-extracts matched the manual
extracts
11Single-Document SummarizationSupervised Methods
- In the 1990s, with the advent of machine learning
techniques in NLP - a series of seminal publications appeared that
employed statistical techniques to produce
document extracts - Kupiec et al. (1995) using a naive-Bayes
classifier to categorizes each sentence as
worthy of extraction or not - Let be a particular sentence, the set of
sentences that make up the summary, and
the features - Assuming independence of the features
- Two additional features are used sentence length
and the presence of uppercase words - Feature analysis revealed that a system using
only the position and the cue features, along
with the sentence length, performed best
12Single-Document Summarization Supervised Methods
(cont.)
- Aone et al. (1999) also incorporated a
naive-Bayes classifier, but with richer features - Signature words derived from term frequency(TF)
and inverse document frequency(IDF) - Named-entity tagger
- Shallow discourse analysis
- Synonyms and morphological variants were also
merged (accomplied by WordNet) - Lin and Hovy (1997) studied the importance of
sentence position feature - However, since the discourse structure
significantly varies over domains - They makes an important contribution by
investigating techniques of tailoring the
position method towards optimality over a genre - Measured the yield of each sentence position
against the topic keywords - Then ranked the sentence positions by their
average yield to produce the Optimal Position
Policy (OPP) for topic positions for the genre
13Single-Document Summarization Supervised Methods
(cont.)
- Lin (1999) broke away from the assumption that
features are independent of each other - He tried to model the problem of sentence
extraction using decision trees, instead of a
naive-Bayes classifier - Some novel features were introduced in his paper
- Query Signature normalized score given to
sentences depending on number of query words that
they contain - IR signature score given to sentences depending
on number and scores of IR signature words
included (the m most salient words in the corpus) - Average lexical connectivity the number of terms
shared with other sentences divided by the total
number of sentences in the text - Numerical data value 1 when sentences contained
a number - Proper name, Pronoun or Adjective, Weekday or
Month, Quotation (similar as previous feature) - Sentence length, Sentence Order
- Feature analysis suggested that the IR signature
was a valuable feature, corroborating the early
findings of Luhn (1958)
14Single-Document Summarization Supervised Methods
(cont.)
- Conroy and O'leary (2001) modeled the problem of
extracting a sentence from a document using a
hidden Markov model (HMM) - The HMM was structured as follows
- states (alternating between summary
states and non-summary states) - Allowed hesitation only in non-summary states
and skipping next state only in summary states - The transition matrix can be estimated from
training corpus - element is the empirical probability of
transitioning from state i to state j - Associated with each state i was an output
function - assume that the features are multivariate normal
distributed - using the training data to compute the maximum
likelihood estimate of its mean and covariance
matrix (shared covariance) - Use three features position of the sentence,
number of terms in the sentence, and likeliness
of the sentence terms given the document terms
15Single-Document Summarization Supervised Methods
(cont.)
- Osborne (2002) used log-linear models to obviate
the assumption of feature independence - Let be a label, the item we are interested
in labeling, the i-th feature and the
corresponding feature weight - The conditional log-linear model can be stated as
follows -
- The authors added a non-uniform prior to the
model, claiming that a log-linear model tends to
reject too many sentences for inclusion in a
summary - The features included word pairs, sentence
length, sentence position, and naive discourse
features like inside introduction or inside
conclusion.
16Single-Document Summarization Supervised Methods
(cont.)
- Svore et al. (2007) propose an algorithm based on
neural nets and the use of third party datasets
to perform extractive summarization - They trained a model that could infer the proper
ranking of sentences - The ranking was accomplished using RankNet based
on neural networks - For the training set, they used ROUGE-1 to score
the similarity of a human written highlight and a
sentence in the document - These similarity scores were used as
soft-labels during training, contrasting with
other approaches where sentences are
hard-labeled, as selected or not - Another novelty of the framework lay in the use
of features that derived information from query
logs from Microsoft's news search engine and
Wikipedia entries (third party datasets) - They conjecture that if a document sentence
contained keywords used in the news search
engine, or entities found in Wikipedia articles,
then there is a greater chance of having that
sentence in the highlight - They generate 10 features for each sentence in
each document - Is first sentence, Sentence position, SumBasic
score(unigram), SumBasic bigram score, Title
similarity score, Average News Query Term Score,
News Query Term Sum Score, Relative News Query
Term Score, Average Wikipedia Entity Score,
Wikipedia Entity Sum Score
17Single-Document Summarization Supervised Methods
(cont.)
- Other kinds of supervised summarizers includes
- Support vector machine (SVM) (Hirao et al. 2002)
- Gaussian Mixture Models (GMM) (Murray et al.
2005) - Conditional Random Fields (CRFs) (Shen et al.
2007) - In general, the extractive summarization can be
treated as a two-class (summary/non-summary)
classification problem (Lin et al. 2009) - A sentence with a set of representative
features - To summarize documents with different summary
ratios, the important sentences of a document
can be selected (or ranked) based on the
posterior probability of a sentence being
included in the summary given the feature set
18Single-Document SummarizationUnsupervised Methods
- Gong (2001) proposed using vector space model
(VSM) - Vector representations of sentences and the
document to be summarized using statistical
weighting, such as TF-IDF - Sentences are ranked based on their proximity to
the document - Maximum Marginal Relevance (MMR) (Murray et al.
2005) can be applied to summarize more important
and different concepts in a document
19Single-Document SummarizationUnsupervised
Methods (cont.)
- Latent Semantic Analysis (LSA) (Gong 2001)
- Construct a term-sentence matrix for a given
document - Perform Singular Value Decomposition (SVD) on the
term-sentence matrix - The right singular vectors with larger singular
values represent the dimensions of the more
important latent semantic concepts in the
document - Represent each sentence of a document as a
semantic vector in the reduced space
20Single-Document Summarization Unsupervised
Methods (cont.)
- Probabilistic Generative Framework (Chen et al.
2009) - Criterion Maximum a posteriori (MAP)
- Sentence Generative Model
- Each sentence of the document as a probabilistic
generative model - Language Model (LM), Sentence Topic Model (STM)
and Word Topic Model (WTM) are initially
investigated - Sentence Prior Model
- Sentence prior is simply set to uniform here
- Or may have to do with duration/position,
correctness of sentence boundary, confidence
score, prosodic information, etc.
21Single-Document Summarization Unsupervised
Methods (cont.)
- Language Model (LM) Approach (Literal Term
Matching) - Sentence Topic Model (STM) Approach (Concept
Matching) - Word Topic Model (WTM) Approach (Concept Matching)
the sentence model
the collection model
a weighting parameter
22Multi-Document Summarization
- Task Characteristics
- Input a set of documents on the same topic
- Retrieved during an IR search
- Clustered by a news browsers
- Problem same topic or same event
- Output a paragraph length summary
- Salient information across documents
- Similarities between topics?
- Redundancy removal is critical
- Application oriented task
- News portal, presenting articles from different
sources - Corporate emails organized by subjects.
- Medical reports about a patient
23Evaluation
- Recall-Oriented Understudy for Gisting Evaluation
(ROUGE) (Lin 2004) - Let be a set of reference
summary, and let be a summary generated
automatically by a system. Let be a
binary vector representing n-grams contained in a
document - The metric ROUGE-N is an n-gram recall based
statistic - where denotes the usual inner product of
vectors - The various versions of ROUGE were evaluated by
computing the correlation coefficient between
ROUGE scores and human judgment scores - ROUGE-2 performed the best among the ROUGE-N
variants
?? ??? ?? ???? ?? ??? ?? ?? ??
24Evaluation (cont.)
- Lin et al., (2006) also proposed to use an
information-theoretic method to automatic
evaluation of summaries - The central idea is to use a divergence measure
(i.e., Jensen-Shannon divergence), between a pair
of probability distributions - The first distribution is derived from an
automatic summary and the second from a set of
reference summaries - Let be the set of documents
to summarize - A distribution parameterized by
generates reference summaries - A summarization system is governed by some
distribution - We may define a good summarizer as one for which
is closed to - One information-theoretic measure between
distributions that is adequate for this is the KL
divergence - However, the KL divergence is unbounded and goes
to infinity whenever vanishes and
does not - Another problem is that KL divergence is not
symmetric
25Evaluation (cont.)
- Hence, they propose to use the Jensen-Shannon
divergence which is bounded and symmetric - where
- To evaluate a summary given a reference
summary , the negative JS divergence can be
used for the purpose