ICASSP 05

1 / 35
About This Presentation
Title:

ICASSP 05

Description:

ICASSP 05 - smil.csie.ntnu.edu.tw ... icassp 05 – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 36
Provided by: Amy55

less

Transcript and Presenter's Notes

Title: ICASSP 05


1
ICASSP 05
2
Reference
  • Rapid Language Model Development Using External
    Resources for New Spoken Dialog Domains
  • Ruhi Sarikaya1, Agustin Gravano2, Yuqing Gao1
  • 1IBM, 2Columbia University
  • Maximum Entropy Based Generic Filter for Language
    Model Adaptation
  • Dong Yu, Milind Mahajan, PeterMau, Alex Acero
  • Microsoft
  • Language Model Estimation for Optimizing End-to
    End Performance of A Natural Language Call
    Routing System
  • Vaibhava Goel, Hong-Kwang Jeff Kuo, Sabine
    Deligne, Cheng Wu
  • IBM

3
Introduction
  • LM adaptation consists of four steps
  • 1. Collection of task specific adaptation data
  • 2. Normalization step
  • Abbreviations, data and time, punctuations
  • 3. Analyze adaptation data and build a task
    specific LM
  • 4. Interpolate task specific LM with task
    independent LM

4
Introduction
  • Language Modeling research concentrated in tow
    directions
  • 1. Improving the language model probability
    estimation
  • 2. Obtaining additional training material
  • The largest data set is the World Wide Web (WWW)
  • More than 4 billion pages

5
Introduction
  • Using web data for language modeling
  • Query generation
  • Filtering the relevant text from the retrieved
    pages
  • The web counts are certainly less sparse than the
    counts in a corpus of a fixed size
  • The web counts are also likely to be
    significantly more noisy than counts obtained
    from a carefully cleaned and normalized corpus
  • Retrieve unit
  • Whole document v.s. sentence (utterance)

6
Build LM for new domain
  • In practice when we start to build an SDS (spoken
    dialog system) for a new domain, the amount of
    in-domain data for the target domain is usually
    small
  • Definition
  • Static resource corpora collected for other
    tasks
  • Dynamic resource web data

7
Flow diagram for the collecting relevant data
8
Generating search Queries
  • Using Google as search engine
  • The more specific a query is the more relevant
    the retrieved pages are.

9
Similarity based sentence selection
  • Machine translations BLEU (BiLingual Evaluation
    Understudy)
  • N is the maximum n-gram length, wn and pn are the
    corresponding weight and precision, respectively,
    BP is the brevity penalty
  • Where r and c are the lengths of the reference
    and candidate sentences, respectively

Threshold is 0.08
10
Experimental result
  • SCLM using static corpora for language model
  • WWW-20 / WWW-100 predefined limit to 20 / 100
    pares per sentence

11
E-mail corpus
  • Dictated and non-dictated

12
Filtering the corpus
  • Filtering out these non-dictated texts is not an
    easy job in general
  • Hand-crafted rules (e.g. regular expressions)
  • Limitations
  • It does not generalize well to situations which
    we have not encountered
  • Rules are usually language dependent
  • Developing and testing rules are very costly

13
Maximum Entropy based filter
  • Consider the filtering task as a labeling problem
    to segment the adaptation data into two
    categories
  • Category D (Dictated text)
  • Text which should be used for LM adaptation
  • Category N (Non-dictated text)
  • Text which should not be used for LM adaptation
  • Text is divided into a sequence of text units
    (such as lines)

ti is the text unit, and li is the label
associated with ti
14
Label dependency
  • Assume that the labels of text units are
    independent with each other given the complete
    sequence of text units
  • We further assume that the label for a given unit
    depends only upon units in a surrounding window
    of units
  • k 1

15
Classification Model
  • A MaxEnt model has the form
  • where is the vector of
    model parameters

16
Classification Model
  • Pthresh 0.5

17
Features
18
Space Splitting
19
Evaluation
  • Uses only features RawCompact, EOS, and OOV
  • No space splitting

20
  • Filtering is especially important and effective
    for the adaptation data with high percentage of
    non-dictated text (U2)

21
Efficient linear combination for distant n-gram
models
  • David Langlois, Kamel Smaili, Jean-Paul Haton
  • EUROSPEECH 2003 p409412

22
Introduction
  • Classical n-gram model
  • Distant language models

23
Modelization of distance in SLM
  • Cache model (self-relationship)
  • The former deals with the self-relationship
    between a word present in the history and itself
    if a word is frequent in the history, it has more
    chance to appear once again

24
Modelization of distance in SLM (cont.)
  • Trigger model
  • the relationship between two words
  • It deals with couple of words v ? w such that if
    v (the triggering word) is in the history, w (the
    triggered word) has more chance to appear
  • But, in fact, the majority of triggers are self
    triggers (v ? v) a word triggers itself

25
d-n-gram model
  • Nd(.) is the discounted count
  • 0-n-gram model is the classical n-gram model

26
Evaluation
  • Voc 20k words
  • Training set 38M words
  • Development set 2M words
  • Test set 2M words
  • Baseline classical n-gram models

Models Perplexity
unigram 739.9
bigram 132.4
trigram 97.8
27
Integration of distant n-gram models
  • Distant n-gram model cannot be used alone. It
    takes into account only a part of the history
  • Perplexity is 717 for n2 and d4
  • Several models with distance up to d are combined
    with the baseline model

28
Improvement 7.1
Improvement 3.1
The utility of distant n-gram models decreases
with the distance a distance greater than 2 does
not provide more information
29
Distant trigram lead to an improvement, but it is
less important than in distant bigram. ?overlap
between the history of d-trigram and (d1)-trigram
30
Backoff smoothing
b_u_z
db_u_z
(b_u_z)?(db_u_z)
31
7.9
11.6
32
Combination weight
  • Unique weight
  • The models weights depend on each history (the
    class of each sub-history)

33
Combination of distant n-gram
  • In order to combine K models, M1,,MK, a set of
    weight ?1,,?K is defined and the combination is
    expressed by
  • Development corpus is not sufficient to estimate
    a huge number of parameters
  • Classify histories and set a weight to each class

34
Classification
  • Break the history into the several parts
    (sub-histories). Each sub-history is analyzed in
    order to estimate its importance in terms of
    prediction and then put into a class
  • Such a class is directly linked to the value of
    the sub-history frequency
  • This class gathers all sub-histories which have
    approximately the same frequency

35
8000 classes/115.4
12.8 improvement to baseline (132.4) 5.3
improvement to the single weight combination
(121.9)
4000 classes/85.2
12.8 improvement to baseline (97.8) 1.5
improvement to the single weight combination
(86.5)
Write a Comment
User Comments (0)