ICASSP 05

1 / 35

About This Presentation

Title:

ICASSP 05

Description:

ICASSP 05 - smil.csie.ntnu.edu.tw ... icassp 05 – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 36

Provided by: Amy55

more less

Transcript and Presenter's Notes

Title: ICASSP 05

1
ICASSP 05
2
Reference

Rapid Language Model Development Using External
Resources for New Spoken Dialog Domains
Ruhi Sarikaya1, Agustin Gravano2, Yuqing Gao1
1IBM, 2Columbia University
Maximum Entropy Based Generic Filter for Language
Model Adaptation
Dong Yu, Milind Mahajan, PeterMau, Alex Acero
Microsoft
Language Model Estimation for Optimizing End-to
End Performance of A Natural Language Call
Routing System
Vaibhava Goel, Hong-Kwang Jeff Kuo, Sabine
Deligne, Cheng Wu
IBM

3
Introduction

LM adaptation consists of four steps
1. Collection of task specific adaptation data
2. Normalization step
Abbreviations, data and time, punctuations
3. Analyze adaptation data and build a task
specific LM
4. Interpolate task specific LM with task
independent LM

4
Introduction

Language Modeling research concentrated in tow
directions
1. Improving the language model probability
estimation
2. Obtaining additional training material
The largest data set is the World Wide Web (WWW)
More than 4 billion pages

5
Introduction

Using web data for language modeling
Query generation
Filtering the relevant text from the retrieved
pages
The web counts are certainly less sparse than the
counts in a corpus of a fixed size
The web counts are also likely to be
significantly more noisy than counts obtained
from a carefully cleaned and normalized corpus
Retrieve unit
Whole document v.s. sentence (utterance)

6
Build LM for new domain

In practice when we start to build an SDS (spoken
dialog system) for a new domain, the amount of
in-domain data for the target domain is usually
small
Definition
Static resource corpora collected for other
tasks
Dynamic resource web data

7
Flow diagram for the collecting relevant data
8
Generating search Queries

Using Google as search engine
The more specific a query is the more relevant
the retrieved pages are.

9
Similarity based sentence selection

Machine translations BLEU (BiLingual Evaluation
Understudy)
N is the maximum n-gram length, wn and pn are the
corresponding weight and precision, respectively,
BP is the brevity penalty
Where r and c are the lengths of the reference
and candidate sentences, respectively

Threshold is 0.08
10
Experimental result

SCLM using static corpora for language model
WWW-20 / WWW-100 predefined limit to 20 / 100
pares per sentence

11
E-mail corpus

Dictated and non-dictated

12
Filtering the corpus

Filtering out these non-dictated texts is not an
easy job in general
Hand-crafted rules (e.g. regular expressions)
Limitations
It does not generalize well to situations which
we have not encountered
Rules are usually language dependent
Developing and testing rules are very costly

13
Maximum Entropy based filter

Consider the filtering task as a labeling problem
to segment the adaptation data into two
categories
Category D (Dictated text)
Text which should be used for LM adaptation
Category N (Non-dictated text)
Text which should not be used for LM adaptation
Text is divided into a sequence of text units
(such as lines)

ti is the text unit, and li is the label
associated with ti
14
Label dependency

Assume that the labels of text units are
independent with each other given the complete
sequence of text units
We further assume that the label for a given unit
depends only upon units in a surrounding window
of units
k 1

15
Classification Model

A MaxEnt model has the form
where is the vector of
model parameters

16
Classification Model

Pthresh 0.5

17
Features
18
Space Splitting
19
Evaluation

Uses only features RawCompact, EOS, and OOV
No space splitting

Filtering is especially important and effective
for the adaptation data with high percentage of
non-dictated text (U2)

21
Efficient linear combination for distant n-gram
models

David Langlois, Kamel Smaili, Jean-Paul Haton
EUROSPEECH 2003 p409412

22
Introduction

Classical n-gram model
Distant language models

23
Modelization of distance in SLM

Cache model (self-relationship)
The former deals with the self-relationship
between a word present in the history and itself
if a word is frequent in the history, it has more
chance to appear once again

24
Modelization of distance in SLM (cont.)

Trigger model
the relationship between two words
It deals with couple of words v ? w such that if
v (the triggering word) is in the history, w (the
triggered word) has more chance to appear
But, in fact, the majority of triggers are self
triggers (v ? v) a word triggers itself

25
d-n-gram model

Nd(.) is the discounted count
0-n-gram model is the classical n-gram model

26
Evaluation

Voc 20k words
Training set 38M words
Development set 2M words
Test set 2M words
Baseline classical n-gram models

Models Perplexity
unigram 739.9
bigram 132.4
trigram 97.8
27
Integration of distant n-gram models

Distant n-gram model cannot be used alone. It
takes into account only a part of the history
Perplexity is 717 for n2 and d4
Several models with distance up to d are combined
with the baseline model

28
Improvement 7.1
Improvement 3.1
The utility of distant n-gram models decreases
with the distance a distance greater than 2 does
not provide more information
29
Distant trigram lead to an improvement, but it is
less important than in distant bigram. ?overlap
between the history of d-trigram and (d1)-trigram
30
Backoff smoothing
b_u_z
db_u_z
(b_u_z)?(db_u_z)
31
7.9
11.6
32
Combination weight

Unique weight
The models weights depend on each history (the
class of each sub-history)

33
Combination of distant n-gram

In order to combine K models, M1,,MK, a set of
weight ?1,,?K is defined and the combination is
expressed by
Development corpus is not sufficient to estimate
a huge number of parameters
Classify histories and set a weight to each class

34
Classification

Break the history into the several parts
(sub-histories). Each sub-history is analyzed in
order to estimate its importance in terms of
prediction and then put into a class
Such a class is directly linked to the value of
the sub-history frequency
This class gathers all sub-histories which have
approximately the same frequency

35
8000 classes/115.4
12.8 improvement to baseline (132.4) 5.3
improvement to the single weight combination
(121.9)
4000 classes/85.2
12.8 improvement to baseline (97.8) 1.5
improvement to the single weight combination
(86.5)

Write a Comment

User Comments (0)