Stefano Scanzio, Pietro Laface - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Stefano Scanzio, Pietro Laface

Description:

Adapting Hybrid ANN/HMM to Speech Variations Stefano Scanzio, Pietro Laface Politecnico di Torino Dario Albesano, Roberto Gemello, and Franco Mana – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 18

Provided by: Piet113

Learn more at: https://www.isca-speech.org

Category:

more less

Transcript and Presenter's Notes

Title: Stefano Scanzio, Pietro Laface

1

Adapting Hybrid ANN/HMM to Speech Variations

Stefano Scanzio, Pietro Laface
Politecnico di Torino
Dario Albesano, Roberto Gemello, and Franco Mana

2
Acoustic Model Adaptation

Adaptation tasks
Linear Input Network
Linear Hidden Network
Catastrophic forgetting
Conservative Training
Results on several adaptation tasks

3
Acoustic model adaptation

Specific speaker
Speaking style (spontaneous, regional accents)
Audio channel (telephone, cellular, microphone)
Environment (car, office, )
Specific vocabulary

Voice Application
ASR
Data Log
4
Linear Input Network adaptation
Acoustic phonetic units
Emission Probabilities
.
Speaker/Task Independent MLP
.
Input layer
Speech parameters
5
Linear Hidden Network - LHN
Acoustic phonetic units
Emission Probabilities
.
.
Hidden layer 2
Hidden layer 1
.
Input layer
Speech parameters
6
Catastrophic forgetting

Acquiring new information can damage previously
learned information if the new data that do not
adequately represent the knowledge included in
the original training data
This effect is evident when adaptation data do
not contain examples for a subset of the output
classes.
Problem is more severe in ANN framework than in
the Gaussian Mixture HMM framework

7
Catastrophic forgetting

Back-propagation algorithm penalizes classes with
no adaptation examples setting their target value
to zero for every adaptation frame
Thus, during adaptation, the weights of the ANN
will be biased
to favor the activations of the classes with
samples in the adaptation set
to weaken the other classes.

8
16-classes training
20 x 2 hidden nodes 2 input nodes 16 output
nodes 2500 patterns per class
The adaptation set includes 5000 patterns
belonging only to classes 6 and 7
9
Adaptation of 2 Classes
6
7
10
Conservative Training target assignment policy
Standard target assignment policy

0.00 0.00 1.00 0.00 0.00

P2 is the class corresponding to the current
input frame Px class in the adaptation set
Mx missing class
11
Adaptation of 2 classes
Conservative Training adaptation
Standard adaptation
6
7
12
Adaptation tasks

Application data adaptation Directory Assistance
9325 Italian city names
53713 training 3917 test utterances
Vocabulary adaptation Command words
30 command words
6189 training 3094 test utterances
Channel-Environment adaptation Aurora-3
2951 training 654 test utterances
Speaker adaptation WSJ0
8 speakers, 16KHz
40 test 40 train sentences

13
Results on different tasks (WER)
Adaptation Task Adaptation Method Application Directory Assistance Vocabulary Command Words Channel-Environment Aurora-3 CH1
No adaptation 14.6 3.8 24.0
LIN 11.2 3.4 11.0
LIN CT 12.4 3.4 15.3
LHN 9.6 2.1 9.8
LHN CT 10.1 2.3 10.4
14
Mitigation of Catastrophic Forgetting using
Conservative Training
Tests using adapted models on Italian continuous
speech ( WER)
Models Adapted on Application Directory Assistance Vocabulary Command Words Channel-Environment Aurora-3 CH1
Adaptation Method LIN 36.3 42.7 108.6
Adaptation Method LIN CT 36.5 35.2 42.1
Adaptation Method LHN 40.6 63.7 152.1
Adaptation Method LHN CT 40.7 45.3 44.2
Adaptation Method No Adaptation 29.3 29.3 29.3
15
Networks used in Speaker Adaptation Task

STD (Standard)
2 hidden layer hybrid MLP-HMM model
273 input features (39 parameters and 7 context
frames)
IMP (Improved)
Uses a wider input window spanning a time context
of 25 frames
Includes an additional hidden layer

16
Results on WSJ0 Speaker Adaptation Task
Net type Adaptation method Trigram LM
STD No adaptation 8.4
STD LIN 7.9
STD LINCT 7.1
STD LHNCT 6.6
STD LINLHNCT 6.3
IMP No adaptation 6.5
IMP LHNCT 5.6
IMP LINLHNCT 5.0
17
Conclusions

LHN adaptation outperforms LIN adaptation
Linear transformations at different levels
produce different positive effects
LINLHN performs better than LHN
In adaptation tasks with missing classes,
Conservative Training
reduces the catastrophic forgetting effect,
preserving the performance on another generic
task
improve the performance in speaker adaptation
with few available sentences

18
Weight merging
19
Conservative Training (CT)

For each observation frame
1) Set the target value for each class that has
no (few) adaptation data to its posterior
probability computed by the original network
2) Set to zero the target value for a class that
has adaptation data, but does not correspond to
the input frame
3) Set the target value for the class
corresponding to the input frame to 1 minus the
sum of the posterior probabilities assigned
according to rule 1)

20
Conclusions on LHN