Reference - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Reference

Description:

Xiaodan Zhu, Gerald Penn, 'Evaluation of Sentence Selection for Speech ... X are feature sets and Yare class labels. Evaluation Metrics. Precision/Recall ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 34
Provided by: YiT9
Category:
Tags: reference | yare

less

Transcript and Presenter's Notes

Title: Reference


1
Reference
  • Julian Kupiec, Jan Pedersen, Francine Chen, A
    Trainable Document Summarizer, SIGIR95 Seattle
    WA USA, 1995.
  • Xiaodan Zhu, Gerald Penn, Evaluation of Sentence
    Selection for Speech Summarization, Proceedings
    of the 2nd International Conference on Recent
    Advances in Natural Language Processing
    (RANLP-05), Borovets, Bulgaria, pp. 39-45.
    September 2005.
  • C.D. Paice, Constructing literature abstracts by
    computer Techniques and prospects. Information
    Processing and Management, 26171-186, 1990.

2
A Trainable Document Summarizer
  • Julian Kupiec, Jan Pedersen and Francine
    ChenXerox Palo Alto Research Center

3
Outline
  • Introduction
  • A Trainable Summarizer
  • Experiments and Evaluation
  • Discussion and Conclusions

4
Introduction
  • To summarize is to reduce in complexity, and
    hence in length, while retaining some of the
    essential qualities of the original
  • This paper focuses on document extracts, a
    particular kind of computed document summary
  • Document extracts consisting of roughly 20 of
    the original can be as informative as the full
    text of a document, which suggests that even
    shorter extracts may be useful indicative
    summaries
  • Titles, key-words, tables-of-contents and
    abstracts might all be considered as forms of
    summary
  • They approach extract election as a statistical
    classification problem
  • This framework provides a natural evaluation
    criterion the classification success rate or
    precision
  • It does require a training corpus of documents
    with labelled extracts

5
A Trainable Summarizer
  • Features
  • Paice groups sentence scoring features into seven
    categories
  • Frequency-keyword heuristics
  • The title-keyword heuristic
  • Location heuristics
  • Indicator phrase (e.g., this report..)
  • Related heuristic involves cue words
  • Two set of words which are positively and
    negatively correlated with summary sentences
  • Bonus e.g., greatest and significant
  • Stigma e.g., hardly and impossible
  • Ref. ---The frequency-keyword approach?the
    title-keyword method?The location
    method?Syntactic criteria?The cue method?The
    indicator-phrase method?Relational criteria

6
A Trainable Summarizer
  • Features
  • Sentence Length Cut-off Feature
  • Given a threshold (e.g., 5 words)
  • The feature is true for all sentences longer than
    the threshold, and false otherwise
  • Fixed-phrase Feature
  • This features is true for sentences that contain
    any of 26 indicator phrases, or that follow
    section heads that contain specific key words
  • Paragraph Feature
  • Thematic Word Feature
  • The most frequent content words are defined as
    thematic words
  • This feature is binary, depending on whether a
    sentence is present in the set of highest scoring
    sentences
  • Uppercase Word Feature

7
A Trainable Summarizer
  • Classifier
  • For each sentence s, to compute the probability
    it will be included in a summary S given the k
    features ,
  • which can be expressed using Bayes rule as
    follows
  • Assuming statistical independence of the
    features
  • is a constant and
    and can be estimated directly
    from the training set by counting occurrences

8
Experiments and Evaluation
  • The corpus
  • There are 188 document/summary pairs, sampled
    from 21 publications in the scientific/technical
    domain
  • The average number of sentences per document is
    86
  • Each document was normalized so that the first
    line of each file contained the document title

9
Experiments and Evaluation
  • The corpus
  • Sentence Matching
  • Direct sentence match (verbatim or minor
    modification)
  • Direct join (two or more sentences)
  • Unmatchable
  • Incomplete (some overlap, includes a sentence
    from the original document, but also contains
    other information)
  • The correspondences were produced in two passes
  • 79 of the summary sentences have direct matches

10
Experiments and Evaluation
  • The corpus

11
Experiments and Evaluation
  • Evaluation
  • Using a cross-validation strategy for evaluation
  • Unmatchable and incomplete sentences were
    excluded from both training and testing, yielding
    a total of 498 unique sentences
  • Performance
  • First way
  • the highest performance
  • A sentence produced by the summarizer is defined
    as correct here if (direct sentence match,
    direct join)
  • Of the 568 sentences, 195 direct sentence matches
    and 6 direct joins were correctly identified, for
    a total of 201 correctly identified summary
    sentences 35
  • Second way
  • 498 match-able sentences
  • 42

12
Experiments and Evaluation
  • Evaluation
  • The best combination is (Paragraphfixed-phrasese
    ntence-length)
  • Addition of the frequency-keyword features
    (thematic and uppercase word features) results in
    a slight decrease in overall performance
  • For a baseline, to select sentences from the
    beginning of a document (considering the sentence
    length cut-off feature alone)24 (121 sentences
    correct)

13
Experiments and Evaluation
  • Figure 3 shows the performance of the summarizer
    (using all features) as a function of summary
    size
  • Edmundson cites a sentence-level performance of
    44
  • By analogy, 25 of the average document length
    (86 sentences) in our corpus is about 20
    sentences
  • Reference to the table indicatesperformance at
    84

14
Discussion and Conclusions
  • The trends in our results are in agreement with
    those of edmundson who used a subjectively
    weighted combination of features as opposed to
    training the feature weights using a corpus
  • Frequency-keyword features also gave poorest
    individual performance in evaluation
  • They have however retained these features in our
    final system for several reasons
  • The first is robustness
  • Secondly, as the number of sentences in a summary
    grows, more dispersed informative material tends
    to be included

15
Discussion and Conclusions
  • The goal is to provide a summarization program
    that is of general utility
  • The first concerns robustness
  • The second issue concerns presentation and other
    forms of summary information

16
Reference
  • Julian Kupiec, Jan Pedersen, Francine Chen, A
    Trainable Document Summarizer, SIGIR95 Seattle
    WA USA, 1995.
  • Xiaodan Zhu, Gerald Penn, Evaluation of Sentence
    Selection for Speech Summarization, Proceedings
    of the 2nd International Conference on Recent
    Advances in Natural Language Processing
    (RANLP-05), Borovets, Bulgaria, pp. 39-45.
    September 2005.

17
Evaluation of Sentence Selection for Speech
Summarization
  • Xiaodan Zhu and Gerald PennDepartment of
    Computer Science University of Toronto

18
Outline
  • Introduction
  • Speech Summarization by Sentence Selection
  • Evaluation Metrics
  • Experiments
  • Conclusions

19
Introduction
  • This paper consider whether ASR-inspired
    evaluation metrics produce different results than
    those taken from text summarization
  • The goal of speech summarization is to distill
    important information from speech data
  • In this paper, we will focus on sentence-level
    extraction

20
Speech Summarization by Sentence Selection
  • LEAD sentence selection is to select the first
    N of sentences from the beginning of the
    transcript
  • RAND random selection
  • Knowledge-based Approach SEM
  • To calculate semantic similarity between a given
    utterance and the dialogue, the noun portion of
    WordNet is used as a knowledge source, with
    semantic distance between senses computed using
    normalized path length
  • The performance of the system is reported as
    better than LEAD, RAND and TFIDF based methods
  • Not using manually disambiguated, to apply
    Brills POS tagger to acquire the nouns
  • Using semantic similarity package

21
Speech Summarization by Sentence Selection
  • MMR-based Approach MMR
  • Whether it is more similar to the whole dialogue
  • Whether it is less similar to the sentences that
    have so far been selected
  • Classification-Based Approaches
  • To formulate sentence selection as a binary
    classification problem
  • The best two have consistently been SVM and
    logistic regression
  • SVM (OSU-SVM package)
  • SVM seeks an optimal separating hyperplane, where
    the margin is maximal
  • Decision function is

22
Speech Summarization by Sentence Selection
  • Features

23
Speech Summarization by Sentence Selection
  • Classification-Based Approaches
  • Logistic Regression LOG
  • To model the posterior probabilities of the class
    label with linear functions
  • X are feature sets and Yare class labels

24
Evaluation Metrics
  • Precision/Recall
  • When evaluated on binary annotations and using
    precision/recall metrics, sys1 and sys2 achieve
    50 and 0
  • Relative Utility
  • For the above example, if using relative utility,
    sys1 gets 18/19 and sys2 gets 15/19
  • The values obtained are higher than with P/R, but
    they are higher for all of the systems evaluated

25
Evaluation Metrics
  • Word Error Rate
  • Sentence level and word level
  • The sum of insertion error, substitution error
    and deletion error of words, divided by the
    number of all these errors plus the number of
    corrects words
  • Zechners Summarization Accuracy
  • The summarization accuracy is defined as the sum
    of the relevance scores of all the words in the
    automatic summary, divided by the maximum
    achievable relevance score with the same number
    of words
  • ROUGE
  • To measuring overlapping units such as n-grams,
    word sequences and word pairs
  • ROUGE-N and ROUGE-L

26
Experiments
  • Corpus the SWITCHBOARD dataset (a corpus of
    open-domain spoken dialogue)
  • To randomly select 27 spoken dialogues from
    SWITCHBOARD
  • Three annotators are asked to assign 0/1 labels
    to indicate whether a sentence is in the summary
    or not (required to select around 10 of the
    sentences into the summary)
  • Judges annotation relative to another are
    evaluated (F-scores)

27
Experiments
  • Precision/Recall
  • One standard marks a sentence as in the summary
    only when all three annotators agree
  • LOG and SVM have similar performance and
    outperform the others, with MMR following, and
    then SEM and LEAD
  • At least two of the three judges include in the
    summary

28
Experiments
  • Precision/Recall
  • Any of the three annotators
  • Relative Utility
  • For three different human judges, an assignment
    of a number between 0 and 9 to each sentence are
    obtained, to indicate the confidence that this
    sentence should be included in the summary

29
Experiments
  • Relative Utility
  • The performance ranks of the five summarizaers
    are the same here as they are in the three P/R
    evaluations
  • First, the P/R agreement among annotators is not
    low
  • Second, the redundancy in the data is much less
    than in the multi-document summarization tasks
  • Third , the summarizers we compare might tend to
    select the same sentences

30
Experiments
  • Word Error Rate and Summarization Accuracy

31
Experiments
  • Word Error Rate and Summarization Accuracy

32
Experiments
  • ROUGE

33
Conclusion
  • Five summarizers were evaluated on three
    text-summarization-inspired metrics (P/R), (RU),
    and ROUGE, as well as on two ASR-inspired
    evaluation metrics (WER) and (SA)
  • Preliminary conclusion is that considerably
    greater caution must be exercised when using
    ASR-based measures than we have witnessed to date
    in the speech summarization literature
Write a Comment
User Comments (0)
About PowerShow.com