Automatic Text Summarization: A Solid Base - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Text Summarization: A Solid Base

Description:

Automatic Text Summarization: A Solid Base. Martijn B. Wieling, ... [Rath e.a., 1961] Human selection of sentences in abstracts is very variable ... – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 59
Provided by: martijn7
Category:

less

Transcript and Presenter's Notes

Title: Automatic Text Summarization: A Solid Base


1
Automatic Text Summarization A Solid Base
  • Martijn B. Wieling,
  • Rijksuniversiteit Groningen

November, 25th 2004
2
Outline
  • Why should we bother at all? (a.k.a.
    Introduction)
  • A frequency based ATS Luhn, 1958
  • An ATS based on multiple features Edmundson,
    1969
  • Automatically combining the features (1) Kupiec
    et al, 1995
  • Automatically combining the features (2) Teufel
    Moens, 1997
  • Why should we still bother? (a.k.a. Conclusion)

0000001
3
Why should we bother at all?
  • Time saving
  • Large scale application possible, e.g.
  • Google-xtract
  • Extract translation
  • Abstracts will be consistent and objective

0000010
4
And in the beginning there was
  • Hans Peter Luhn (father of Information
    Retrieval) The Automatic Creation of
    Literature Abstracts - 1958

Image Courtesy IBM
0000011
5
Luhns method basic idea
  • Target documents technical literature
  • The method is based on the following assumptions
  • Frequency of word occurrence in an article is a
    useful measurement of word significance
  • Relative position of these significant words
    within a sentence is also a useful measurement of
    word significance
  • Based on limited capabilities of machines (IBM
    704) ? no semantic information

IBM 704 - Courtesy IBM
0000100
6
Why word frequency?
  • Important words are repeated throughout the text
  • examples are given in favor of a certain
    principle
  • arguments are given for a certain principle
  • Technical literature ? one word one notion
  • Simple and straightforward algorithm ? cheap to
    implement (processing time is costly)
  • Note that different forms of the same word are
    counted as the same word

0000101
7
When significant?
  • Too low frequent words are not significant
  • Too high frequent words are also not significant
    (e.g. the, and)
  • Removing low frequent words is easy
  • set a minimum frequency-threshold
  • Removing common (high frequent) words
  • Setting a maximum frequency threshold
    (statistically obtained)
  • Comparing to a common-word list

0000110
Figure 1 from Luhn, 1958
8
Using relative position
  • Where greatest number of high-frequent words are
    found closest together ? probability very high
    that representative information is given
  • Based on the characteristic that an explanation
    of a certain idea is represented by words closely
    together (e.g. sentences paragraphs - chapters)

0000111
9
The significance factor
  • The significance factor of a sentence reflects
    the number of occurrences of significant words
    within a sentence and the linear distance between
    them due to non-significant words in between
  • Only consider portion of sentence bracketed by
    significant words with maximum of 5
    non-significant words in between,
    e.g. () - - - - - - - - -
    - ()
  • Significance factor formula (S)2 / .
  • (2.5 in the above example)

0001000
10
Generating the abstract
  • For every sentence the significance factor is
    calculated
  • The sentences with a significance factor higher
    than a certain cut-off value are returned
    (alternatively the N highest-valued sentences can
    be returned)
  • For large texts, it can also be applied to
    subdivisions of the text
  • No evaluation of the results present in the
    journal paper!

0001001
11
A new method by Edmundson
  • H.P. Edmundson New methods in Automatic
    Extracting - 1969

IBM 7090 - Courtesy IBM
0001010
12
Four methods for weighting
  • Weighting methods
  • Cue Method
  • Key Method
  • Title Method
  • Location Method
  • The weight of a sentence is a linear combination
    of the weights obtained with the above four
    methods
  • The highest weighing sentences are included in
    the abstract
  • Target documents technical literature

0001011
13
Cue Method
  • Based on the hypothesis that the probable
    relevance of a sentence is affected by presence
    of pragmatic words (e.g. Significant,
    Greatest, Impossible, Hardly)
  • Three types of Cue words
  • Bonus words positively affecting the relevance
    of a sentence (e.g. Significant, Greatest)
  • Stigma words negatively affecting the relevance
    of a sentence (e.g. Impossible, Hardly)
  • Null words irrelevant

0001100
14
Obtaining Cue words
  • The lists were obtained by statistical analyses
    of 100 documents
  • Dispersion (?) number of documents in which the
    word occurred
  • Selection ratio (?) ratio of number of
    occurrences in extractor-selected sentences to
    number of occurrences in all sentences
  • Bonus words ? gt thigh?
  • Stigma words ? lt tlow?
  • Null words ? gt t? and tlow?lt ? lt thigh?

0001101
15
Resulting Cue lists
  • Bonus list (783) comparatives, superlatives,
    adverbs of conclusion, value terms, etc.
  • Stigma list (73) anaphoric expressions,
    belittling expressions, etc.
  • Null list (139) ordinals, cardinals, the verb
    to be, prepositions, pronouns, etc.

0001110
16
Cue weight of sentence
  • Tag all Bonus words with weight b gt 0, all Stigma
    words with weight s lt 0, all Null words with
    weight n 0
  • Cue weight of sentence S (Cue weight of each
    word in sentence)

0001111
17
Key Method
  • Principle based on Luhn, counting the frequency
    of words.
  • Algorithm differs
  • Create key glossary of all non-Cue words in the
    document which have a frequency larger than a
    certain threshold
  • Weight of each key word in the key glossary is
    set to the frequency it occurs in the document
  • Assign key weight to each word which can be found
    in the key glossary
  • If word is not in key glossary, key weight 0
  • No relative position is used (Luhn)
  • Key weight of sentence S (Key weight of each
    word in sentence)

0010000
18
Title Method
  • Based on the hypothesis that an author conceives
    title as circumscribing the subject matter of the
    document (similarly for headings vs. paragraphs)
  • Create title glossary consisting of all non-Null
    words in the title, subtitle and headings of the
    document
  • Words are given a positive title weight if they
    appear in this glossary
  • Title words are given a larger weight than
    heading words
  • Title weight of sentence S (Title weight of each
    word in sentence)

0010001
19
Location Method
  • Based on the hypothesis that
  • Sentences occurring under certain headings are
    positively relevant
  • Topic sentences tend to occur very early or very
    late in a document and its paragraphs
  • Global idea
  • Give each sentence below his heading the same
    weight as the heading itself (note that this is
    independent from the Title Method) Heading
    weight
  • Give each sentence a certain weight based on its
    position - Ordinal weight
  • Location weight of sentence Ordinal weight of
    sentence Heading weight of sentence

0010010
20
Location Method Heading weight
  • Compare each word in a heading with the
    pre-stored Heading dictionary
  • If the word occurs in this dictionary, assign it
    a weight equal to the weight it has in the
    dictionary
  • Heading weight of a heading S (heading weight of
    each word in heading)
  • Heading weight of a sentence Heading weight of
    its heading

0010011
21
Creating the Heading dictionary
  • The Heading dictionary was created by listing all
    words in the headings of 120 documents and
    calculating the selection ratio for each word
  • Selection ratio (?) ratio of number of
    occurrences in extractor-selected sentences to
    number of occurrences in all headings
  • Deletions from this list were made on the basis
    of low frequency and unrelatedness to the desired
    information types (subject, purpose, conclusion,
    etc.)
  • Weights were given to the words in the Heading
    dictionary proportional to the selection ratio
  • The resulting Heading dictionary contained 90
    words

0010100
22
Location Method Ordinal weight
  • Sentences of the first paragraph are tagged with
    weight O1
  • Sentences of the last paragraph are tagged with
    weight O2
  • The first sentence of a paragraph is tagged with
    weight O3
  • The last sentence of a paragraph is tagged with
    weight O4
  • Ordinal weight of sentence O1 O2 O3 O4

0010101
23
Generating the abstract
  • Calculate the weight of a sentence aC bK cT
    dL, with a,b,c,d constant positive integers, C
    Cue Weight, K Key weight, T Title weight, L
    Location weight
  • The values of a, b, c and d were obtained by
    manually comparing the generated automatic
    abstracts with the desired (human made) abstract
  • Return the highest N sentences under their proper
    headings as the abstract (including title)
  • N is calculated by taking a percentage of the
    size of the original documents, in this journal
    paper 25 is used

0010110
24
Which combination is best?
  • All combinations of C, K, T and L were tried to
    see which result had (on average) the most
    overlap with the handmade extract
  • As can be seen in the figure below (only the
    interesting results are shown), the Key method
    was omitted and only C, T and L are used to
    create the best abstract
  • Surprising result! (Luhn used only keywords to
    create the abstract)

Figure 4 from Edmundson, 1969
0010111
25
Evaluation
  • Evaluation was done on unseen data (40 technical
    documents), comparison with handmade abstracts
  • Result 44 of the sentences co-selected, 66
    similarity between abstracts (human judge)
  • Random abstract 25 of the sentences
    co-selected, 34 similarity between abstracts
  • Another evaluation criterion extract-worthiness
  • Result 84 of the sentences selected is
    extract-worthy
  • Therefore for one document many possible
    abstracts (differing in length and content)

0011000
26
Comments
  • Goldstein e.a., 1999 Not good to base length
    of abstract on length of document
  • Summary length is independent of document length
  • The longer the document, the smaller the
    compression ratio ( doc. / abstract )
  • Better to use constant summary length
  • Rath e.a., 1961Human selection of sentences in
    abstracts is very variable
  • 6 abstracts of 20 sentences only 32 overlap
    between 5 subjects (6 8)
  • Abstracting the same document 2 times by the same
    person with 8 weeks in between only 55 overlap
    (average for 6 subjects)
  • Perhaps the Key Method algorithm used here is not
    that good (Luhns algorithm could be better)

0011001
27
Time and cost of this system ?
  • Speed of extracting 7800 words/minute
  • Cost 0,015 / word
  • Including keypunching costs 0.01 / word
  • Used corpus of 29,500 words ? 442.50 total cost
  • CPI 2003 2798.00 total cost

0011010
28
A jump in time
  • 1969 First man on the moon
  • 1972 Watergate scandal
  • 1980 John Lennon killed
  • 1981 First identification of AIDS Birth of me
    ?
  • 1986 Space Shuttle Challenger explodes after
    launch
  • 1989 Fall of Berlin Wall
  • 1990 Start Gulf War Introduction WWW
  • 1991 Soviet Union breaks up
  • 1992 Formal end of Cold War
  • 1993 Creation of European Union (Verdrag van
    Maastricht)
  • 1994 Nelson Mandela president of South Africa

0011011
29
1995 Trained summarization
  • Julian Kupiec, Jan Pedersen and Francine Chen A
    Trainable Document Summarizer - 1995

0011100
30
Trained weighting
  • Edmundson used subjective weighting of the
    features (Cue, Key, Title, Location) to create an
    abstract
  • In this journal paper generating the abstract is
    approached as a statistical classification
    problem
  • Given a training set of documents with handmade
    abstracts
  • Develop a classification function that estimates
    the probability a given sentence is included in
    the abstract
  • This requires a training corpus of documents with
    abstracts
  • Target documents technical literature

0011101
31
Features
  • Five features were used
  • Sentence Length Cut-off Feature
  • Fixed Phrase Feature
  • Paragraph Feature
  • Thematic Word Feature
  • Uppercase Word Feature
  • The above features were chosen by experimentation

0011110
32
Sentence Length Cut-off Feature
  • Based on the principle that short sentences are
    often not included in abstracts
  • Given a threshold (e.g. 5 words)
  • SLC-value is true for sentences longer than the
    threshold
  • SLC-value is false otherwise
  • Note that this feature is not similar to any of
    the features Edmundson used

0011111
33
Fixed-Phrase Feature
  • Based on the hypothesis that
  • sentences containing any of a list of fixed
    phrases (mostly 2 words long) are likely to be in
    the abstract (e.g. in conclusion, this result
    total 26 elements)
  • Sentences following a heading containing a
    certain keyword are more likely to be in the
    abstract (e.g., conclusions, results,
    summary)
  • FP-value is true for sentences in the above
    situations, false otherwise
  • Note that this feature is a combination of
    Edmundsons Location Method and Cue Method,
    though in reduced form

0100000
34
Paragraph Feature
  • Each sentence in the first ten and last five
    paragraphs is tagged based on its location
  • Paragraph-initial
  • Paragraph-final (P gt 1 sentence)
  • Paragraph-medial (P gt 2 sentences)
  • Note that this feature is a reduced form of
    Edmundsons Location Method

0100001
35
Thematic Word Feature
  • The most frequent words in a document are defined
    as thematic words
  • A small number of thematic words is selected and
    each sentence is scored as a function of
    frequency of these thematic words
  • TW-value is true if it is one of the highest
    scoring sentences
  • TW-value is false otherwise
  • Note that this feature is an adapted version of
    Edmundsons Key Method

0100010
36
Uppercase Word Feature
  • Based on the hypothesis that proper names often
    are important, since it is the explanatory text
    for acronyms (e.g. the ISO (International
    Standards Organization) )
  • Count the frequency of each proper name
  • Constraint the uppercase thematic word is not
    sentence initial and begins with a capital letter
  • The word must occur several times and may not be
    an abbreviated measurement unit
  • Score each sentence based on the number of
    frequent proper names in each sentence
  • The score of a sentence in which the frequent
    proper name appears first is twice as high as
    later occurrences
  • UW-value is true if it is one of the highest
    scoring sentences, false otherwise
  • Note that this feature is a bit similar to
    Edmundsons Key Method

0100011
37
Classification
  • For each sentence s the probability P is
    calculated that it will be included in the
    summary S given the k features (Bayes rule)
  • Assuming statistical independence of the
    features
  • is constant, and
    and can be estimated directly from the
    training set by counting occurrences
  • This function assigns for each s a score which
    can be used to select sentences for inclusion in
    the abstract

0100100
38
The training material
  • 188 documents with professionally created
    abstracts from the scientific/technical domain,
    the average length of the abstracts is 3
    sentences (3.5 of the total size of the
    document)
  • Sentences from the abstract were matched to the
    original document
  • 79 direct sentence matches
  • 3 direct joins (2 sentences combined)
  • 18 no direct match or join possible
  • Therefore the maximum performance of the
    automatic system is 82

0100101
39
Evaluation (1)
  • Too little material ? Cross-validation used to
    evaluate
  • Two evaluation measures
  • Fraction of manually selected sentences which
    were reproduced correctly average result 35
  • Fraction of the matchable selected sentences
    which were reproduced correctly average result
    42
  • Performance of features (2nd measure)

Feature Individual sentences correct Cumulative sentences correct
Paragraph 33 33
Fixed Phrases 29 42
Length Cut-off 24 44
Thematic Word 20 42
Uppercase Word 20 42
0100110
40
Evaluation (2)
  • Best combination is Paragraph Fixed Phrase
    Length Cut-off (44 performance)
  • Addition of frequency keyword features results in
    a slight decrease of performance (44 ? 42)
  • Note that Edmundson in this case also reports a
    decrease in performance
  • In final implementation frequency keyword
    features are retained in favor of robustness
  • Baseline used in this experiment Selecting N
    sentences from the beginning (Length Cut-off,
    thus positively biased)
  • Full feature set has an improvement of 74 over
    baseline (24 ? 42)

0100111
41
Evaluation (3)
  • If the size of the generated abstract is
    increased to 25, the performance improves to 84
  • Edmundson only had a performance of 44

0101000
42
Comments
  • The features used in this paper were chosen by
    experimentation
  • No results/discussions of these experiments are
    given in the paper, so the reason for the choices
    remain unclear
  • The comparison to Edmundson is not very fair
  • Handmade reference abstracts of Edmundson had a
    size of 25 (here 3.5)
  • Also the comments which were given about
    Edmundson apply here
  • Not good to base length of abstract on length of
    document
  • Human selection of sentences in abstracts is very
    variable
  • Perhaps the Key Method algorithm used here is too
    simple (Luhns algorithm could be better)

0101001
43
Revisited Kupiec e.a., 1995
  • Simone Teufel and Marc Moens Sentence
    extraction as a classification task - 1997

0101010
44
Main research questions
  • Could Kupiec e.a.s methodology (training a model
    with a corpus) be used for another evaluation
    criterion?
  • What was the difference in extracting performance
    of both evaluation criterions for different types
    of documents?
  • Note that another set of features is used here
    than Kupiec e.a. used

0101011
45
Another evaluation method
  • Kupiec e.a. used the match sentences evaluation
    criterion
  • Here the training and test set abstracts are
    created by the authors themselves (as opposed to
    Kupiec e.a.)
  • Hence less alignable sentences are available in
    the document
  • 32 on average vs. 79 in Kupiec e.a.
  • This does not mean there are less
    extract-worthy sentences in the document ?
    another evaluation method is chosen
  • Evaluation ask human to identify abstract-worthy
    non-matchable sentences in the original document

0101100
46
Features
  • The features used here are different from Kupiec
    e.a.
  • Cue Phrase Method (1670 cue phrases)
  • Location Method
  • Sentence Length Method
  • Thematic Word Method
  • Title Method

0101101
47
Cue Phrase Method
  • Similarly as in Edmundson, with some differences
  • A 5-point scale (-1 3) is used instead of 3
    (Bonus, Null, Stigma)
  • Cue phrases are used instead of Cue words
  • If a phrase was entered into the list, also
    syntactically and semantically similar phrases
    were manually included in the list
  • A sentence gets the score of its maximum-scored
    Cue phrase, if no Cue phrases are present it gets
    a score of 0
  • The list was manually created by inspecting
    extracted sentences
  • Also based on relative frequency in abstract and
    relative frequency in document
  • Sentences occurring directly after headings like
    Introduction or Conclusion are given a prior
    score of 2 (in Edmundson this is part of the
    Location Method)

0101110
48
Location Method
  • As in Edmundson, with the exception of the
    sentences directly after headings previously
    mentioned
  • Sensitive for certain headings (e.g.
    Introduction) if such headings cannot be
    found only the sentences of the first 7 and last
    3 paragraphs are tagged (initial, medial, final)

0101111
49
Sentence Length Method
  • As in Kupiec e.a.
  • The threshold is set to 15 tokens (including
    punctuation)

0110000
50
Thematic Word Method
  • As in Kupiec e.a., with a few differences
  • Selecting (non-Cue) words which occur frequently
    in this document, but rarely in the overall
    collection of documents
  • For each (non-Cue) word the term-frequencyinverse
    -document-frequency value is calculated
  • score(w) floc log (100N / fglob)
  • with N total number of documents, floc
    frequency of word w in document, fglob number
    of documents containing word w
  • Top 10 scoring words are defined as thematic
    words
  • Top 40 sentences based on the frequency of
    thematic words (meaned by sentence length) are
    given a TW-value of 1, all others 0

0110001
51
Title Method
  • As in Edmundson, with the difference that
  • The Title score of the sentence is the mean
    frequency of Title word occurrences in the
    sentence (in Edmundson each Title word was given
    the same score and the scores were summed)
  • Headings are not taken into account here (by
    experimentation)
  • The 18 top-scoring sentences receive a
    Title-value of 1, the others 0

0110010
52
The experiment
  • Training set a corpus of 124 documents from
    different areas of computational linguistics with
    summaries written by the authors
  • A human judge marked additional abstract-worthy
    sentences in each document
  • 32 alignable sentences in the abstracts
  • Two evaluation methods (alignable and
    abstract-worthy) which were also combined

0110011
53
Summary of results
Alignability Abstract-worthy Combined
Best single feature Cue Method 23.2 46.7 55.2
All features 31.6 57.2 68.4
  • Baseline 28 (obtained in a similar fashion as
    Kupiec e.a.)
  • Bad performance of 31.6 for alignability can be
    explained because there are less alignable
    sentences to train on
  • Short abstracts were generated (2 5 of size
    original document)
  • If abstract size would be increased to 25,
    performance would increase to
  • Alignability 96 (Kupiec e.a. 84)
  • Abstract-worthy 98
  • Combined 97.3
  • Therefore compression makes the difference, not
    the evaluation criterion

0110100
54
Conclusions of this experiment
  • The method proposed by Kupiec e.a. of
    classificatory sentence selection is not
    restricted to texts which have high-quality
    handmade abstracts
  • A higher alignability of the handmade abstract is
    therefore not necessary for the purpose of
    sentence extraction compression rate is the
    factor which influences the result
  • However, if more flexible abstracts should be
    generated, the addition of other training and
    evaluation criterions is useful
  • Increased training did not improve results,
    improvement can be obtained in the extraction
    methods themselves

0110101
55
Comments
  • The features used in this paper were different
    from Kupiec e.a.
  • No motivation was given why for instance the
    Uppercase Word feature was omitted, and why
    adapted versions of Edmundson were chosen instead
    of the versions Kupiec e.a. used
  • Also comments which were given about Edmundson
    apply here
  • Not good to base length of abstract on length of
    document
  • Human selection of abstract-worthy sentences in
    abstracts is very variable

0110110
56
Why should we still bother
  • In the discussed methods no attention is given
    to
  • Cohesion of the abstract filtering anaphors out
    of an abstract (e.g. it, that)
  • Filtering out repetition in the abstract
  • The semantics of the document
  • Cohesion an attempt is made by using Lexical
    Chains
  • Repetition an attempt is made by using Maximum
    Marginal Relevance
  • Semantics this can still not be done for the
    general case, but an attempt is made by using
    Rhetorical Tree Structures
  • Interested about these problems?
  • Wicher will explain extraction methods which will
    address repetition and semantics problems in his
    presentation
  • Terrence will explain Lexical Chains in his
    presentation

0110111
57
References
  • The Automatic Creation of Literature Abstracts,
    H.P. Luhn, 1958
  • New Methods in Automatic Extracting, H.P.
    Edmundson, 1969
  • A Trainable Document Summarizer, J. Kupiec e.a.,
    1995
  • Sentence Extraction as a Classification Task, S.
    Teufel and M. Moens, 1997
  • The Formation of Abstracts by the Selection of
    Sentences, G.J. Rath e.a., 1961
  • Constructing Literature Abstracts by Computer
    Techniques and Prospects, C.D. Paice, 1990
  • Summarizing Text Documents Sentence Selection
    and Evaluation Metrics, Goldstein e.a., 1999

0111000
58
Any questions?
0111001
Write a Comment
User Comments (0)
About PowerShow.com