Automatic Sentence Compression in the MUSA project

About This Presentation

Title:

Automatic Sentence Compression in the MUSA project

Description:

Remove disfluencies: compress sentence by removing repetitions introduced by hesitation ... Rule-Based Approach: compress sentences based on handcrafted ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 21

Provided by: walt147

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Sentence Compression in the MUSA project

1
Automatic Sentence Compression in the MUSA project

Walter Daelemans Anja Höthker
walter.daelemans_at_ua.ac.be
http//cnts.uia.ac.be
CNTS, University of Antwerp, Belgium
Languages The Media 2004, Berlin

2
MUSA

MUltilingual Subtitling of multimediA content
EU IST 5th framework, Sep. 2002 - Feb. 2005
Goals
Conversion of audio streams into TV subtitles
(monolingual)
Translation of subtitles into French or Greek

3
(No Transcript)
4
Partners

ILSP, Athens coordination, integration
ESAT, KU Leuven Automatic Speech Recognition
CNTS, U. Antwerp Sentence compression
Systran, Paris Machine Translation
BBC, London Main User, Data provider, Evaluation
Lumiere, Athens Main User, Multilingual Data
Provider, Evaluation

5
Goals for Sentence Compression

Automatically and dynamically generate subtitles
based on constraints (words and characters)
Reduce the time needed for producing subtitles by
expert subtitler
Provide an architecture that can easily be ported
to other languages

6
Example

SPEECH
The task force is in place and ready to attack
without mercy.
Constraints
Delete 3 words and 14 characters
Compression Module output
The task force is in place and ready to fight
without mercy .
SUBTITLE
The task force is ready ...
...to fight without mercy.

7
Approach

Remove disfluencies compress sentence by
removing repetitions introduced by hesitation
I, I know that this war, this war will last for
years
Paraphrasing replace part of the input sentence
by shorter paraphrase
an increasing number of ? more and more
Rule-Based Approach compress sentences based on
handcrafted deletion rules that combine
Shallow-parsing information (identifying
constituents used by deletion rules)
Relevance measures (determine in which order to
delete constituents)

8
Shallow Parsing POS Tagging

The/Det woman/NN will/MD give/VB Mary/NNP a/Det
book/NN

9
Shallow Parsing Chunking

The/Det woman/NNNP will/MD give/VBVP
Mary/NNPNP a/Det book/NNNP

10
Shallow parsing Sense Tagging

The/Det woman/NNNP-PERSON will/MD give/VBVP
Mary/NNPNP-PERSON a/Det book/NNNP-MATERIAL-OBJ
ECT

11
Shallow Parsing Relation Finding
person
material-object
person
12
MBSP (Perl)
Text In
Tokenizer (Perl)
MBT server POS Tagger
TiMBL server Known words
TiMBL server Relation Finder
TiMBL server Unknown words
MBT server Concept Tagger
Timbl server Phrase Chunker
TiMBL 5.0 MBT 2.0 http//ilk.uvt.nl/
TiMBL server Known words
TiMBL server Unknown words
13
Rule-Based Approach (syntax)

Deletion rules mark phrases for deletion based on
shallow parser output
Rules for adverbs, adjectives, PNPs, subordinate
sentences, interjections, ...
Phrases are deleted iteratively until target
compression rate is met

Example Rule ADJECTIVES
if(POS(word) JJ CHUNK(word) ! ADJP-END
word-1 ! most least more less)
delete(word)
if (word-1CC word-2JJ)
delete(word-1)
elseif (word1CC word2JJ)
delete(word1)
Adam 's only serious childhood illness
had been measles
The virus triggered an 1 extremely 1 2 rare 3
and 2 fatal 3 condition

15
Relevance Measures (semantics)

Deletion rules suggest more deletions than
necessary for reaching target compression
System rates the different possibilities and
starts with deleting the least important phrases
Relevance measures in MUSA are based on (a
weighted combination of)
Word frequencies (in BNC corpus)
Rule Probabilities (as encountered in parallel
BBC corpus of transcripts with associated
subtitles)
Word Durations (compare estimates with actual
durations)

16
Example

This is a basic summarizer for English used for
demonstration purposes.
(NP This) is (NP a basic11 summarizer) (PNP for
English)10 used (PNP for demonstration
purposes)12.
This is a basic summarizer used for demonstration
purposes
This is a summarizer used for demonstration
purposes
This is a summarizer used

17
Evaluation Data (Lumiere)

MMR - every parents choice
243 segments
39.5 of the segments need compression
Average target compression rate 4.58 words, 1.98
chars
The Tranquiliser Trap
287 segments
50.52 of the segments need compression
Average target compression rate 3.21 words, 2.0
chars

18
Human Evaluation
19
Conclusions

We presented the Sentence Compression Module of
the MUSA system
Eclectic system combining statistical techniques
for relevance detection with handcrafted deletion
rules based on shallow parser output
Evaluation suggests usefulness (with transcripts
as input)
Future Work
Porting to other languages
Machine Learning of paraphrases