Identifying Fragmented Words in Spoken Dialogue - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Identifying Fragmented Words in Spoken Dialogue

Description:

Identifying Fragmented Words in Spoken Dialogue. Piroska Lendvai ... Incomplete words pose problems to NLP. Spoken Dutch Corpus ... 3,137 tagged fragmented words (0.9 ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 18
Provided by: piroska4
Category:

less

Transcript and Presenter's Notes

Title: Identifying Fragmented Words in Spoken Dialogue


1
Identifying Fragmented Words in Spoken Dialogue
  • Piroska Lendvai
  • Induction of Linguistics Knowledge Research Group
  • Dept. of Computational Linguistics
  • Tilburg University, NL

2
Outline
  • Incomplete words pose problems to NLP
  • Spoken Dutch Corpus
  • Task classify each lexical item in sentence as
    in/completely uttered word
  • Memory-based vs Rule inducing learner
  • Extensive optimization strategy to
  • find optimal parameter settings
  • select informative features
  • Findings

3
Introduction
  • Spoken input contains disfluent speech
  • Disfluent speech involves fragmented words
  • he did not c-- call
  • the safe usage of interne-- sorry of electronic
    commerce
  • Presence of fragment is often incorporated into
    NLP tools taken for granted, but
  • Automatic identification of a fragment cannot be
    straightforwardly done
  • 1-letter?
  • not in Lexicon?
  • may be identical to existing word

4
Data
  • Spoken Dutch Corpus, 203 transcribed discourses
  • Various genres, 1-7 speakers
  • 40,840 lexical tokens
  • 44,939 sentences
  • 3,137 tagged fragmented words (0.9)
  • Instance generation automatically extract
    feature values binary class

5
Cues in learning incomplete words
  • Vector of 22 features based on corpus and the
    literature
  • Readily available, word-based cues
  • Lexical window neighbouring 2 unigram words
    left/right focus word (string)
  • Overlap in letters/matching words (binary)
  • Sentence-based general (numeric)
  • Context-type (binary)

6
22 features
7
Feature vector
8
Learning process
  • Memory-based lazy learner IB1 in TiMBL
  • Rule inducing eager learner RIPPER
  • Discourse-based partitioning 10-fold CV
  • Optimization with iterative deepening on training
    data
  • Optimal learner per fold classifies held-out test
    data

9
Task baselines
  • In-lexicon baseline
  • 1-letter baseline
  • Evaluative measures Accuracy, Precision, Recall,
    Fß1
  • Accuracy Prec Rec Fß1
  • In-lexicon 91.4 2.4 53.6 4.6
  • 1-letter 97.4 54.3 43.9 48.5

10
Optimization by iterative deepening
  • Construct large number of different learners by
  • varying algorithm parameters
  • Varying employed feature groups
  • Iterate
  • Learners trained on a portion of 90 training set
  • Tested on data elsewhere from 90 training set
  • Ranked on F-score of Fragment class
  • Good learners are re-trained on more data

11
Optimizing IB1
  • 4,301 learning experiments
  • Algorithm settings tested
  • Number of NNs used for extrapolating the class
    1
  • Distance weighting metric of NNs -
  • Feature importance metric ?2
  • Metric for computing instance similarity
    overlap
  • Frequency threshold in feature similarity metric
    -
  • Data variedly represented by all possible
    combinations of feature groups use all groups

12
Optimizing RIPPER
  • 3,187 learning experiments
  • Algorithm settings tested
  • Negative tests on the feature attributes
    dis/allowed
  • Number of optimization rounds on ruleset 3
  • Amount of examples to be minimally covered 1
  • Simplification/complication of hypotheses
    complicate
  • Loss ratio of costs 0.5
  • Allowed to add redundant features -

13
Results of default vs optimized learners
14
Results

15
Error analysis
  • Frequent false negatives fragmented item that
    resembles true word
  • False positives short true lexical items
  • Named entities, foreign words, acronyms cause
    confusion
  • Annotation errors

16
Discussion
  • Readily available, word-based features
  • Context, letter overlap, word identity reliable
  • Iterative deepening search method beneficial
    optimal settings uniformly differ from defaults
  • Example-based learning approach more successful
    than abstract rule induction
  • RIPPER overfits on non-total data, IB1 not
  • Beneficial instance-specific behaviour observed
    (lt100 rules, 1-7 conditions)

17
Future Work
  • Use ASR lexical output
  • Generate syntactic info from ASR output
  • Extract prosodic properties
  • Extend research to other disfluency types
Write a Comment
User Comments (0)
About PowerShow.com