Automatic Summarization - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Automatic Summarization

Description:

The Automatic Creation of Literature Abstracts from H.P. Luhn ... different words to express the same notion (often very few synonyms repetition) ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 32
Provided by: tcnj
Category:

less

Transcript and Presenter's Notes

Title: Automatic Summarization


1
Automatic Summarization
  • Introduction to summarization
  • The Automatic Creation of Literature Abstracts
    from H.P. Luhn
  • New Methods in Automatic Extracting from H.P.
    Edmundson

2
What is the problem?
  • We are not able to develop useful summarizing
    systems unless we pay attention to context
    factors and purpose factors of a summary
  • We should, nevertheless, concentrate on
    relatively shallow techniques because we would
    never be able to emulate human summarizing
  • Limitation of technology (especially in the
    1960s) implies careful identification of the
    summary tasks and of the conditions under which
    it could be applied

3
What is a summary?
  • An intuitive, informal and obvious definition
  • A summary is a reductive transformation of source
    text to summary text through content reduction by
    selection and/or generalization on what is
    important in the source.
  • This definition leads to an basic, simple,
    processing model
  • I source text interpretation to source text
    representation
  • T source representation transformation to summary
    text representation
  • G summary text generation from summary
    representation
  • Each stage can have several substages

4
Why is summarization so hard?
  • Have to characterize a whole text without human
    intuition!
  • Capture the important content (matter of
    information and expression)
  • Efforts
  • First researches in the sixties (Luhns paper)
  • Two decades with little activity
  • Marked growth since the eighties (information
    era)
  • all works fall under two headings
  • text extraction
  • fact extraction
  • Both approaches are complementary

5
Text extraction
  • What you see is what you get!
  • open approach
  • no prior presumption about the content, which is
    of importance
  • let important content emerge (individually
    appropriate from each source)
  • key text is identified by a mix of statistical,
    location and cue word criteria
  • summary is close to the source (linguistically,
    structural and in order of presentation)
  • dim view of source is more obscure (sentences are
    not very coherent)
  • advantage generality
  • disadvantage low quality (weak, indirect methods
    are not effective enough in detecting important
    material and presenting it as well organized text)

6
Fact extraction
  • What you know is what you get!
  • Close approach
  • intended to find individual manifestations of
    important notions regardless of source status
  • Decision about what content is sought has already
    been made (prior selection of the type of
    information)
  • no independent source text representation (direct
    insertion of source material with some
    modification in a frame which can be templates or
    schemas)
  • approach will certainly need generation of
    natural language
  • approach allows only one view of the subject
    regardless whether this was the view of the
    author
  • advantage better quality output in substance and
    presentation
  • disadvantage required type of information has to
    be specified (may not be important for the source
    itself) gt not very flexible

7
Context factors
  • Why should any one specific technique give the
    best or even an acceptable result regardless of
    the properties of the input source?
  • Answer is given by practical research
  • general summarizing strategies have to pay great
    attention to context factors
  • effective summarizing requires an explicit and
    detailed analysis of context factors
  • precise capturing of context factors to guide
    summarizing is very hard
  • Three classes of context factors
  • Input
  • Purpose
  • Output
  • summary may only partially cover a source text
    because (certain kind of information)

8
Context factors - INPUT
  • Input
  • Form
  • structure (headings, rhetorical patterns)
  • scale (book, short text)
  • medium (natural language, sublanguage)
  • genre (description, narrative)
  • Subject type
  • Ordinary
  • specialized (source text from specific journals)
  • restricted (local names or facts)
  • Unit
  • summarizing over single input text (material
    previously brought together)
  • summarizing over multiple sources

9
Context factors - PURPOSE
  • purpose (most important guides the choice of
    summarizing strategy but in automatic
    summarization often not recognized)
  • Situation or context within the summary is to be
    used
  • tied (environment in which summary is used is
    known)
  • floating (no precise context)
  • Audience
  • Class of readers (domain knowledge, language
    skill, etc.)
  • Untargeted (e.g. big variance in experience or
    interest)
  • Targeted
  • Use (what is the summary for?)
  • retrieving source text as a kind of preview
  • device for refreshing memory (text already read)

10
Context factors - OUTPUT
  • Output
  • Material
  • to which extent should the summary capture the
    source text
  • special summary types can cover some special
    source information (partial summary)
  • Format
  • running text
  • headed summaries (fields, standardized parts)
  • Style
  • informative (what does the source text say)
  • indicative (notes that the source is about a
    topic)
  • aggregative (multiple documents are set in
    relation to one another)

11
Context factors - Example
  • book review summaries for librarian purchasers
    who have to buy new books for their library
  • input factors
  • simple running text
  • variable in scale
  • literary prose as medium
  • single units
  • purpose factors
  • floating situation (no deep knowledge about
    readers)
  • untargeted audience (general or professional
    education)
  • use as review
  • output factors
  • should not select only a set of books (whole
    overview)
  • simple running text attached to bibliographic
    header
  • style should be indicative

12
The Automatic Creation of Literature Abstracts
(1957)
  • Presented by H. P. Luhn at IRE National
    Convention in New York, 1958
  • Principle
  • machine selects those sentences of an article
    that are most representative
  • these sentences are enumerated and used to judge
    the character of the article

13
The Automatic Creation of Literature Abstracts
  • How does the machine know which sentence is
    important?
  • significance factor of a sentence is derived from
    the analysis of its words
  • significance of a word depends and is based on
  • the frequency of its occurrence (list of words in
    descending order of frequency)
  • the relative position within a sentence
  • method does not differentiate between word forms
    (no stemming nevertheless, similar words were
    eliminated by Luhns algorithm as well as stop
    words)
  • no attention to logical or semantic relationships
    the author has produced

14
Was Luhn lazy?
  • Why did Luhn use this technique if he knew about
    the insufficiencies of this approach?
  • Approach is very economic in terms of computation
    time (very important in 1957)
  • In technical writing it is very unlikely, that a
    word has more than one notion and that the author
    uses different words to express the same notion
    (often very few synonyms repetition)

15
Form of Word Frequency/Cut-Off lines
  • noise in the system caused by very common words
  • reduce noise by a stored common-word list
  • simpler way determine a high-frequency and a
    low-frequency cut-off - so called confidence
    limits (problem with words like cell in
    literature about biology)
  • resolving power/discrimination power

16
Improvement
  • pure statistical and physical approach without
    considering meaning or topic can be slightly
    improved
  • assumptions
  • the closer certain words are associated the more
    specifically an aspect of a subject is being
    treated
  • wherever the greatest number of frequently
    occurring words is found in physical proximity to
    each other, the probability is high that the
    information covered by this physical part (a
    sentence) is most representative of the article
  • Significance of proximity
  • based on characteristics of spoken/written human
    language
  • ideas closely linked to each other are also
    closely associated physically
  • division of text (chapters, paragraphs,
    sentences, ) is physical manifestation of the
    authors structure of thinking

17
Significance factor
  • Luhn therfore wants his consideration factor
    reflect the following
  • number of occurrences of significant words in a
    sentence
  • linear distance between them
  • number of non-significant words in this sentence
  • rank sentences according to the highest
    significance
  • pick those sentences with highest ranks
  • this significance factor ranks only relationship
    of significant words to each other not
    distribution over whole sentence
  • Obvious improve
  • consider only portions of sentences (clusters)
  • a cluster is bracketed by significant words
  • set limit for the maximal distance of significant
    words (useful is four or five non-significant
    words between sig. words)
  • if two or more clusters exist in a sentence take
    cluster with highest significance and rate the
    corresponding sentence

18
Cluster-Significance
  • significance of a cluster
  • a number of all significant words
  • b number of all non-significant words
  • (aa)/b is the significance of the cluster
  • formula has been encouraged by several
    experiments
  • Problem
  • resolving power of this method depends on the
    number of words comprising the article
  • power decreases with increasing number of words
  • Solution
  • perform evaluation on subdivisions of the article
    (subdivisions often provided by the author
    otherwise divisions can be made arbitrarily)
  • highest ranking sentences from each subdivision
    will constitute the auto-abstract

19
Modifications - Use
  • modifications for special abstracts
  • condensation of a document
  • adjust cut-off value of sentence significance
  • print out a certain number of most significant
    sentences (if a fixed number of sentences is
    required)
  • condensation of a document using a relationship
    to another source or field of interest
  • assign premium values to a predetermined class of
    words (for example, interest chemical
    substances source article about farming in the
    USA)
  • it is also possible that there are only sentences
    in an article which are of minor importance
    (dont go beyond a certain value) gt article
    might be rejected as too generalized or not
    suitable
  • auto-abstracting could be used to alleviate the
    translation burden
  • methods could provide key words for encoding
    documents or books

20
New Methods in Automatic Extracting (ca. 1961)
  • work was initially conducted at
    Thompson-Ramo-Woolbridge, Inc. with the support
    of Rome Air Development Center and later passed
    to other research institutions
  • Principle
  • use characteristics of the abstracting behaviour
    of humans
  • replace subjective notion of significant by a
    procedure
  • use four different methods to produce automatic
    abstracts
  • allow easy modifications of these methods by
    adjusting parameters
  • offer selection of these methods and their
    combination for usage

21
Select Corpus
  • 200 documents in fields of physical, life and
    information science, humanities, (heterogenous
    corpus) were used to determine initial weights,
    parameters, preliminary statistical data (common
    words, sentence length, sentence postition)
  • 200 documents in the field of chemistry were used
    for extracting experiments (technical reports,
    highly formated, a lot of equations and
    experimental data)
  • Experimental corpus was devided in
  • experimental library (data base
    experimentation)
  • test library reserved for evaluation of the
    program

22
Study summary characteristics
  • Target Extracts (produced by humans)
  • Instruction set for human extractors to follow
  • Sentences (selected, when eligible in terms of
    content)
  • What? (general subject)
  • Why? (intent of author)
  • How? (methods used to conduct the research)
  • Conclusions/Findings
  • Generalization
  • ...
  • Minimize redundancy
  • Maximize coherence
  • Number of sentences to use for abstract (length
    25 of the sentences in the document not
    optimal)

23
Principles for Automation - Characteristics
  • Detect and use ALL content and format clues to
    the relative importance of sentences that were
    provided by the author
  • Employ mechanizable criteria of selection and
    rejection (reward weights, penalty weights)
  • Employ system parameters
  • Employ function that is a function of several
    linguistic factors
  • Set up computable characteristics of text
  • Text characteristic is positively relevant if it
    tended to be associated with sentences manually
    selected
  • Text characteristic is negatively relevant if it
    tended to be associated with unselected sentences
  • Text characteristics are irrelevant if it tended
    to be associated equally with selected and
    unselected sentences

24
Four Basic Methods
  • System is based on assigning numerical weights to
    sentences
  • Assigned weights are functions of the weights
    assigned to certain characteristics or clues
  • Sentence weights are the sum of the weights of
    these characteristics
  • Four different methods are used which apply
    different sets of clues to the source
  • cue method
  • key method
  • title method
  • location method
  • Several word lists are needed for each method

25
Word lists
  • Necessary to distinguish two types of word lists
  • Dictionary
  • List of words with numerical weights
  • Fixed input to the extracting system
  • Independent of the words in the document to be
    extracted
  • Glossary
  • List of words with numerical weights
  • Variable input to the extracting system
  • Contains words selected from the document to be
    extracted

26
Cue Method
  • Hypothesis
  • Relevance is affected by pragmatic words
    (significant, hardly)
  • Uses prestored cue dictionary which comprises
    dictionaries for
  • Bonus words (positively relevant)
  • Stigma words (negatively relevant)
  • Null words (irrelevant)
  • Cue dictionary is obtained from documents for
    which target extracts have been created,
    considering
  • Frequency
  • Dispersion (number of documents in which the word
    occurs)
  • Selection ratio (ratio of frequency in extractor
    sentences to frequency in all sentences)

27
Key method
  • equal to the method proposed by Luhn
  • Hypothesis
  • High-frequency content words are positively
    relevant
  • Compiles key glossary
  • take all words not in cue dictionary
  • Sort them in decreasing order of frequency
  • Cut off all words with a frequency lower than a
    threshold
  • assign positive weights equal to their frequency
  • later improvement through fractional threshold
  • Take fixed percentage of keywords from the
    document
  • Weights are equal to their frequency over all
    words in the document

28
Title method
  • Clues are characteristics of the skeleton of the
    document
  • Hypotheses
  • Author conceives the title as circumscribing the
    subject matter
  • Partitions of the body of a document in major
    sections ask for summarization by appropriate
    headings
  • Words of the title and headings are positively
    relevant
  • Compiles a title glossary
  • take all non-null words of the title, subtitle
    and headings
  • Assign positive weights
  • Weights assigned were determined on the basis of
    their effect in the combined weighting scheme of
    the four methods

29
Location Method
  • Clues are provided by the skeleton of a document
    (heading, format)
  • Hypotheses
  • sentences occurring under headings are positively
    relevant
  • topic sentences tend to occur early or late in a
    document and its paragraphs
  • uses prestored heading dictionary of selected
    words (from the corpus) that appear in headings
    (Introduction, Purpose, Conclusions)
  • Assign positive weights provided by the heading
    dictionary
  • Assign positive weights to sentences according to
    their ordinal position in the text (first/last
    paragraph, first/last sentence in paragraphs)

30
Results
  • Linear function for evaluation of the
    significance of a sentence
  • a1C a2K a3T a4L
  • where ai (1? i ? 4) are the parameters for the
    Cue, Key, Title and Location weights
  • Mean percentages of the number of sentences
    coselected on both the automatic and the target
    extracts

31
The Future (gt 1960s)
  • Research should involve sharper statistical
    analysis
  • Discover machine-recognizable clues to determine
    the proper length of an abstract (inadequate 25
    of all sentences)
  • Extent to which redundancy appears in automatic
    extracts and ways of minimizing it should be
    investigated
  • Linguistic clues to coherence should be
    identified and expressed in a computable form
Write a Comment
User Comments (0)
About PowerShow.com