Title: Stefano Scanzio, Pietro Laface
1- Adapting Hybrid ANN/HMM to Speech Variations
- Stefano Scanzio, Pietro Laface
- Politecnico di Torino
- Dario Albesano, Roberto Gemello, and Franco Mana
2Acoustic Model Adaptation
- Adaptation tasks
- Linear Input Network
- Linear Hidden Network
- Catastrophic forgetting
- Conservative Training
- Results on several adaptation tasks
3Acoustic model adaptation
- Specific speaker
- Speaking style (spontaneous, regional accents)
- Audio channel (telephone, cellular, microphone)
- Environment (car, office, )
- Specific vocabulary
Voice Application
ASR
Data Log
4Linear Input Network adaptation
Acoustic phonetic units
Emission Probabilities
.
Speaker/Task Independent MLP
.
Input layer
Speech parameters
5Linear Hidden Network - LHN
Acoustic phonetic units
Emission Probabilities
.
.
Hidden layer 2
Hidden layer 1
.
Input layer
Speech parameters
6Catastrophic forgetting
- Acquiring new information can damage previously
learned information if the new data that do not
adequately represent the knowledge included in
the original training data - This effect is evident when adaptation data do
not contain examples for a subset of the output
classes. - Problem is more severe in ANN framework than in
the Gaussian Mixture HMM framework
7Catastrophic forgetting
- Back-propagation algorithm penalizes classes with
no adaptation examples setting their target value
to zero for every adaptation frame - Thus, during adaptation, the weights of the ANN
will be biased - to favor the activations of the classes with
samples in the adaptation set - to weaken the other classes.
8 16-classes training
20 x 2 hidden nodes 2 input nodes 16 output
nodes 2500 patterns per class
The adaptation set includes 5000 patterns
belonging only to classes 6 and 7
9Adaptation of 2 Classes
6
7
10Conservative Training target assignment policy
Standard target assignment policy
P2 is the class corresponding to the current
input frame Px class in the adaptation set
Mx missing class
11Adaptation of 2 classes
Conservative Training adaptation
Standard adaptation
6
7
12Adaptation tasks
- Application data adaptation Directory Assistance
- 9325 Italian city names
- 53713 training 3917 test utterances
- Vocabulary adaptation Command words
- 30 command words
- 6189 training 3094 test utterances
- Channel-Environment adaptation Aurora-3
- 2951 training 654 test utterances
- Speaker adaptation WSJ0
- 8 speakers, 16KHz
- 40 test 40 train sentences
13Results on different tasks (WER)
Adaptation Task Adaptation Method Application Directory Assistance Vocabulary Command Words Channel-Environment Aurora-3 CH1
No adaptation 14.6 3.8 24.0
LIN 11.2 3.4 11.0
LIN CT 12.4 3.4 15.3
LHN 9.6 2.1 9.8
LHN CT 10.1 2.3 10.4
14Mitigation of Catastrophic Forgetting using
Conservative Training
Tests using adapted models on Italian continuous
speech ( WER)
Models Adapted on Application Directory Assistance Vocabulary Command Words Channel-Environment Aurora-3 CH1
Adaptation Method LIN 36.3 42.7 108.6
Adaptation Method LIN CT 36.5 35.2 42.1
Adaptation Method LHN 40.6 63.7 152.1
Adaptation Method LHN CT 40.7 45.3 44.2
Adaptation Method No Adaptation 29.3 29.3 29.3
15Networks used in Speaker Adaptation Task
- STD (Standard)
- 2 hidden layer hybrid MLP-HMM model
- 273 input features (39 parameters and 7 context
frames) - IMP (Improved)
- Uses a wider input window spanning a time context
of 25 frames - Includes an additional hidden layer
16Results on WSJ0 Speaker Adaptation Task
Net type Adaptation method Trigram LM
STD No adaptation 8.4
STD LIN 7.9
STD LINCT 7.1
STD LHNCT 6.6
STD LINLHNCT 6.3
IMP No adaptation 6.5
IMP LHNCT 5.6
IMP LINLHNCT 5.0
17Conclusions
- LHN adaptation outperforms LIN adaptation
- Linear transformations at different levels
produce different positive effects - LINLHN performs better than LHN
- In adaptation tasks with missing classes,
Conservative Training - reduces the catastrophic forgetting effect,
preserving the performance on another generic
task - improve the performance in speaker adaptation
with few available sentences
18Weight merging
19Conservative Training (CT)
- For each observation frame
- 1) Set the target value for each class that has
no (few) adaptation data to its posterior
probability computed by the original network - 2) Set to zero the target value for a class that
has adaptation data, but does not correspond to
the input frame - 3) Set the target value for the class
corresponding to the input frame to 1 minus the
sum of the posterior probabilities assigned
according to rule 1)
20Conclusions on LHN
- LHN outperform LIN
- Linear transformations at different levels
produce different positive effects - LINLHN performs better than LHN
- For continuous speech, the wide-input IMP network
is better than the STD one