Automatic Summarization

About This Presentation

Title:

Automatic Summarization

Description:

The Automatic Creation of Literature Abstracts from H.P. Luhn ... different words to express the same notion (often very few synonyms repetition) ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 32

Provided by: tcnj

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Summarization

1
Automatic Summarization

Introduction to summarization
The Automatic Creation of Literature Abstracts
from H.P. Luhn
New Methods in Automatic Extracting from H.P.
Edmundson

2
What is the problem?

We are not able to develop useful summarizing
systems unless we pay attention to context
factors and purpose factors of a summary
We should, nevertheless, concentrate on
relatively shallow techniques because we would
never be able to emulate human summarizing
Limitation of technology (especially in the
1960s) implies careful identification of the
summary tasks and of the conditions under which
it could be applied

3
What is a summary?

An intuitive, informal and obvious definition
A summary is a reductive transformation of source
text to summary text through content reduction by
selection and/or generalization on what is
important in the source.
This definition leads to an basic, simple,
processing model
I source text interpretation to source text
representation
T source representation transformation to summary
text representation
G summary text generation from summary
representation
Each stage can have several substages

4
Why is summarization so hard?

Have to characterize a whole text without human
intuition!
Capture the important content (matter of
information and expression)
Efforts
First researches in the sixties (Luhns paper)
Two decades with little activity
Marked growth since the eighties (information
era)
all works fall under two headings
text extraction
fact extraction
Both approaches are complementary

5
Text extraction

What you see is what you get!
open approach
no prior presumption about the content, which is
of importance
let important content emerge (individually
appropriate from each source)
key text is identified by a mix of statistical,
location and cue word criteria
summary is close to the source (linguistically,
structural and in order of presentation)
dim view of source is more obscure (sentences are
not very coherent)
advantage generality
disadvantage low quality (weak, indirect methods
are not effective enough in detecting important
material and presenting it as well organized text)

6
Fact extraction

What you know is what you get!
Close approach
intended to find individual manifestations of
important notions regardless of source status
Decision about what content is sought has already
been made (prior selection of the type of
information)
no independent source text representation (direct
insertion of source material with some
modification in a frame which can be templates or
schemas)
approach will certainly need generation of
natural language
approach allows only one view of the subject
regardless whether this was the view of the
author
advantage better quality output in substance and
presentation
disadvantage required type of information has to
be specified (may not be important for the source
itself) gt not very flexible

7
Context factors

Why should any one specific technique give the
best or even an acceptable result regardless of
the properties of the input source?
Answer is given by practical research
general summarizing strategies have to pay great
attention to context factors
effective summarizing requires an explicit and
detailed analysis of context factors
precise capturing of context factors to guide
summarizing is very hard
Three classes of context factors
Input
Purpose
Output
summary may only partially cover a source text
because (certain kind of information)

8
Context factors - INPUT

Input
Form
structure (headings, rhetorical patterns)
scale (book, short text)
medium (natural language, sublanguage)
genre (description, narrative)
Subject type
Ordinary
specialized (source text from specific journals)
restricted (local names or facts)
Unit
summarizing over single input text (material
previously brought together)
summarizing over multiple sources

9
Context factors - PURPOSE

purpose (most important guides the choice of
summarizing strategy but in automatic
summarization often not recognized)
Situation or context within the summary is to be
used
tied (environment in which summary is used is
known)
floating (no precise context)
Audience
Class of readers (domain knowledge, language
skill, etc.)
Untargeted (e.g. big variance in experience or
interest)
Targeted
Use (what is the summary for?)
retrieving source text as a kind of preview
device for refreshing memory (text already read)

10
Context factors - OUTPUT

Output
Material
to which extent should the summary capture the
source text
special summary types can cover some special
source information (partial summary)
Format
running text
headed summaries (fields, standardized parts)
Style
informative (what does the source text say)
indicative (notes that the source is about a
topic)
aggregative (multiple documents are set in
relation to one another)

11
Context factors - Example

book review summaries for librarian purchasers
who have to buy new books for their library
input factors
simple running text
variable in scale
literary prose as medium
single units
purpose factors
floating situation (no deep knowledge about
readers)
untargeted audience (general or professional
education)
use as review
output factors
should not select only a set of books (whole
overview)
simple running text attached to bibliographic
header
style should be indicative

12
The Automatic Creation of Literature Abstracts
(1957)

Presented by H. P. Luhn at IRE National
Convention in New York, 1958
Principle
machine selects those sentences of an article
that are most representative
these sentences are enumerated and used to judge
the character of the article

13
The Automatic Creation of Literature Abstracts

How does the machine know which sentence is
important?
significance factor of a sentence is derived from
the analysis of its words
significance of a word depends and is based on
the frequency of its occurrence (list of words in
descending order of frequency)
the relative position within a sentence
method does not differentiate between word forms
(no stemming nevertheless, similar words were
eliminated by Luhns algorithm as well as stop
words)
no attention to logical or semantic relationships
the author has produced

14
Was Luhn lazy?

Why did Luhn use this technique if he knew about
the insufficiencies of this approach?
Approach is very economic in terms of computation
time (very important in 1957)
In technical writing it is very unlikely, that a
word has more than one notion and that the author
uses different words to express the same notion
(often very few synonyms repetition)

15
Form of Word Frequency/Cut-Off lines

noise in the system caused by very common words
reduce noise by a stored common-word list
simpler way determine a high-frequency and a
low-frequency cut-off - so called confidence
limits (problem with words like cell in
literature about biology)
resolving power/discrimination power

16
Improvement

pure statistical and physical approach without
considering meaning or topic can be slightly
improved
assumptions
the closer certain words are associated the more
specifically an aspect of a subject is being
treated
wherever the greatest number of frequently
occurring words is found in physical proximity to
each other, the probability is high that the
information covered by this physical part (a
sentence) is most representative of the article
Significance of proximity
based on characteristics of spoken/written human
language
ideas closely linked to each other are also
closely associated physically
division of text (chapters, paragraphs,
sentences, ) is physical manifestation of the
authors structure of thinking

17
Significance factor

Luhn therfore wants his consideration factor
reflect the following
number of occurrences of significant words in a
sentence
linear distance between them
number of non-significant words in this sentence
rank sentences according to the highest
significance
pick those sentences with highest ranks
this significance factor ranks only relationship
of significant words to each other not
distribution over whole sentence
Obvious improve
consider only portions of sentences (clusters)
a cluster is bracketed by significant words
set limit for the maximal distance of significant
words (useful is four or five non-significant
words between sig. words)
if two or more clusters exist in a sentence take
cluster with highest significance and rate the
corresponding sentence

18
Cluster-Significance

significance of a cluster
a number of all significant words
b number of all non-significant words
(aa)/b is the significance of the cluster
formula has been encouraged by several
experiments
Problem
resolving power of this method depends on the
number of words comprising the article
power decreases with increasing number of words
Solution
perform evaluation on subdivisions of the article
(subdivisions often provided by the author
otherwise divisions can be made arbitrarily)
highest ranking sentences from each subdivision
will constitute the auto-abstract

19
Modifications - Use

modifications for special abstracts
condensation of a document
adjust cut-off value of sentence significance
print out a certain number of most significant
sentences (if a fixed number of sentences is
required)
condensation of a document using a relationship
to another source or field of interest
assign premium values to a predetermined class of
words (for example, interest chemical
substances source article about farming in the
USA)
it is also possible that there are only sentences
in an article which are of minor importance
(dont go beyond a certain value) gt article
might be rejected as too generalized or not
suitable
auto-abstracting could be used to alleviate the
translation burden
methods could provide key words for encoding
documents or books

20
New Methods in Automatic Extracting (ca. 1961)

work was initially conducted at
Thompson-Ramo-Woolbridge, Inc. with the support
of Rome Air Development Center and later passed
to other research institutions
Principle
use characteristics of the abstracting behaviour
of humans
replace subjective notion of significant by a
procedure
use four different methods to produce automatic
abstracts
allow easy modifications of these methods by
adjusting parameters
offer selection of these methods and their
combination for usage

21
Select Corpus

200 documents in fields of physical, life and
information science, humanities, (heterogenous
corpus) were used to determine initial weights,
parameters, preliminary statistical data (common
words, sentence length, sentence postition)
200 documents in the field of chemistry were used
for extracting experiments (technical reports,
highly formated, a lot of equations and
experimental data)
Experimental corpus was devided in
experimental library (data base
experimentation)
test library reserved for evaluation of the
program

22
Study summary characteristics

Target Extracts (produced by humans)
Instruction set for human extractors to follow
Sentences (selected, when eligible in terms of
content)
What? (general subject)
Why? (intent of author)
How? (methods used to conduct the research)
Conclusions/Findings
Generalization
...
Minimize redundancy
Maximize coherence
Number of sentences to use for abstract (length
25 of the sentences in the document not
optimal)

23
Principles for Automation - Characteristics

Detect and use ALL content and format clues to
the relative importance of sentences that were
provided by the author
Employ mechanizable criteria of selection and
rejection (reward weights, penalty weights)
Employ system parameters
Employ function that is a function of several
linguistic factors
Set up computable characteristics of text
Text characteristic is positively relevant if it
tended to be associated with sentences manually
selected
Text characteristic is negatively relevant if it
tended to be associated with unselected sentences
Text characteristics are irrelevant if it tended
to be associated equally with selected and
unselected sentences

24
Four Basic Methods

System is based on assigning numerical weights to
sentences
Assigned weights are functions of the weights
assigned to certain characteristics or clues
Sentence weights are the sum of the weights of
these characteristics
Four different methods are used which apply
different sets of clues to the source
cue method
key method
title method
location method
Several word lists are needed for each method

25
Word lists

Necessary to distinguish two types of word lists
Dictionary
List of words with numerical weights
Fixed input to the extracting system
Independent of the words in the document to be
extracted
Glossary
List of words with numerical weights
Variable input to the extracting system
Contains words selected from the document to be
extracted

26
Cue Method

Hypothesis
Relevance is affected by pragmatic words
(significant, hardly)
Uses prestored cue dictionary which comprises
dictionaries for
Bonus words (positively relevant)
Stigma words (negatively relevant)
Null words (irrelevant)
Cue dictionary is obtained from documents for
which target extracts have been created,
considering
Frequency
Dispersion (number of documents in which the word
occurs)
Selection ratio (ratio of frequency in extractor
sentences to frequency in all sentences)

27
Key method

equal to the method proposed by Luhn
Hypothesis
High-frequency content words are positively
relevant
Compiles key glossary
take all words not in cue dictionary
Sort them in decreasing order of frequency
Cut off all words with a frequency lower than a
threshold
assign positive weights equal to their frequency
later improvement through fractional threshold
Take fixed percentage of keywords from the
document
Weights are equal to their frequency over all
words in the document

28
Title method

Clues are characteristics of the skeleton of the
document
Hypotheses
Author conceives the title as circumscribing the
subject matter
Partitions of the body of a document in major
sections ask for summarization by appropriate
headings
Words of the title and headings are positively
relevant
Compiles a title glossary
take all non-null words of the title, subtitle
and headings
Assign positive weights
Weights assigned were determined on the basis of
their effect in the combined weighting scheme of
the four methods

29
Location Method

Clues are provided by the skeleton of a document
(heading, format)
Hypotheses
sentences occurring under headings are positively
relevant
topic sentences tend to occur early or late in a
document and its paragraphs
uses prestored heading dictionary of selected
words (from the corpus) that appear in headings
(Introduction, Purpose, Conclusions)
Assign positive weights provided by the heading
dictionary
Assign positive weights to sentences according to
their ordinal position in the text (first/last
paragraph, first/last sentence in paragraphs)

30
Results

Linear function for evaluation of the
significance of a sentence
a1C a2K a3T a4L
where ai (1? i ? 4) are the parameters for the
Cue, Key, Title and Location weights
Mean percentages of the number of sentences
coselected on both the automatic and the target
extracts

31
The Future (gt 1960s)

Research should involve sharper statistical
analysis
Discover machine-recognizable clues to determine
the proper length of an abstract (inadequate 25
of all sentences)
Extent to which redundancy appears in automatic
extracts and ways of minimizing it should be
investigated
Linguistic clues to coherence should be
identified and expressed in a computable form