HMM-based speech synthesis: the new generation of artificial voices - PowerPoint PPT Presentation

About This Presentation
Title:

HMM-based speech synthesis: the new generation of artificial voices

Description:

HMM-based speech synthesis: the new generation of artificial voices Thomas Drugman thomas.drugman_at_umons.ac.be * ... – PowerPoint PPT presentation

Number of Views:400
Avg rating:3.0/5.0
Slides: 54
Provided by: Thierry110
Category:

less

Transcript and Presenter's Notes

Title: HMM-based speech synthesis: the new generation of artificial voices


1
HMM-based speech synthesis the new generation of
artificial voices
  • Thomas Drugman
  • thomas.drugman_at_umons.ac.be

2
TCTS Lab
 Laboratoire de Théorie des Circuits et de
Traitement du Signal  25 people 3 Profs, 10
PhD Students
TCTS Lab
Image Video
Numerical Arts
Audio Speech
3
Content
  • Speech synthesis history
  • HMM-based speech synthesis
  • Parametric modeling of speech
  • Statistical generation
  • Conclusions

4
Content
  • Speech synthesis history
  • HMM-based speech synthesis
  • Parametric modeling of speech
  • Statistical generation
  • Conclusions

5
Speech Synthesis
Text-to-speech system
 Hello
GOAL Produce the lecture of an unknown text
typed by the user
6
Challenges
  • Naturalness
  • Intelligibility
  • Cost-effectiveness
  • Expressivity

7
Challenge 3 Cost-effectiveness
  • Industry expects Intelligibility Naturalness
  • Small footprint a few Megs
  • Small CPU requirements (embedded market)
  • Easy extension to other languages
  • Possibility to create new voices as fast as
    possible
  • Through automatic recording/segmentation process
  • Through efficient voice conversion
  • Possibility to bootstrap an existing TTS voice
    into any voice

8
Challenge 4 (new) Expressivity
  • Emotional speech synthesis (?art!)
  • Being able to render an expressive voice
  • In terms of prosody
  • In terms of voice quality
  • Knowing when to do it (yet unsolved)
  • Todays holy grail for the industry
  • Strategic advantage for whoever gets it first
  • News markets (ebooks?)

9
Methods for Speech Synthesis
  • Expert-based (rule-based) approach
  • Corpus-based approach
  • Diphone concatenation
  • Unit Selection
  • Statistical parametric synthesis (HMM-based
    synthesis)

10
Von Kempelens talking machine (1791)
11
Omer Dudleys Voder (Bell Labs, 1936)
12
And other developments in articulatory synthesis
  • Work by
  • K. Stevens, G. Fant, P. Mermelstein, R. Carré
    (GNUSpeech), S. Maeda, J. Shroeter M. Sondhi
  • More recently
  • O. Engwall, S. Fels (ArtiSynth), Birkholz and
    Kröger, A. Alwan S. Narayanan (MRI)

13
Rule-based synthesis
Intelligibility? Naturalness?
Mem/CPU/Voices? Expressivity ?
14
Methods for Speech Synthesis
  • Expert-based (rule-based) approach
  • Corpus-based approach
  • Diphone concatenation
  • Unit Selection
  • Statistical parametric synthesis (HMM-based
    synthesis)

15
Diphone concatenation
Intelligibility? Naturalness Mem/CPU/Voices?
Expressivity ?
16
Unit selection
Intelligibility? Naturalness ?
Mem/CPU/Voices Expressivity
17
Content
  • Speech synthesis history
  • HMM-based speech synthesis
  • Parametric modeling of speech
  • Statistical generation
  • Conclusions

18
Statistical Parametric Speech Synthesis
DATABASE
Speech Parameters
Speech Analysis
Statistical Modeling
TRAINING
SPS Synthesizer
SYNTHESIS
Speech Parameters
Speech Processing
Statistical Generation
 Hello !
Hello!
19
HMM-based speech synthesis
http//hts.sp.nitech.ac.jp/
Intelligibility? Naturalness ?? Mem/CPU/Voices
? Expressivity ??
20
TRAINING OF THE HMM-BASED SYNTHESIZER
21
Parameter extraction
22
Parameter extraction
Pulse train
Synthetic Speech
Filter
White noise
23
Labels
24
Labels
Labels consist of phonetic environment description
  • Contextual factors
  • Phone identity
  • Syntaxical factors
  • Stress-related factors
  • Locational ,

25
Labels
Example
26
HMM training
27
System architecture
Contextual factors may affect duration, source
and filter differently
Context Oriented Clustering using Decision Trees
28
System architecture
State Duration Model
HMM for Source and Filter
Decision tree for State Duration
Decision trees for Filter
Decision trees for Source
29
Training decision trees
An exhaustive list of possible questions is first
drawn up
Example
QS "LL-Nasal" m,n,en,ng QS
"LL-Fricative" ch,dh,f,hh,hv,s,sh,
th,v,z,zh QS "LL-Liquid"
el,hh,l,r,w,y QS "LL-Front"
ae,b,eh,em,f,ih,ix,iy,m,p,v
,w QS "LL-Central" ah,ao,axr,d,dh,d
x,el,en,er,l,n,r,s,t,th,z,zh
QS "LL-Back" aa,ax,ch,g,hh,jh,k
,ng,ow,sh,uh,uw,y QS
"LL-Front_Vowel" ae,eh,ey,ih,iy QS
"LL-Central_Vowel" aa,ah,ao,axr,er QS
"LL-Back_Vowel" ax,ow,uh,uw QS
"LL-Long_Vowel" ao,aw,el,em,en,en,iy
,ow,uw QS "LL-Short_Vowel"
aa,ah,ax,ay,eh,ey,ih,ix,oy,uh
QS "LL-Dipthong_Vowel" aw,axr,ay,el,e
m,en,er,ey,oy QS "LL-Front_Start_Vowel"
aw,axr,er,ey
Total about 1500 questions
30
Training decision trees
Decision trees are trained using a Maximum
Likelihood criterion
Example
31
Emission likelihood and training
Finally, each leaf is modeled by a Gaussian
Mixture Model (GMM)
Training is guided by the Viterbi and Baum-Welch
re-estimation algorithms
32
SYNTHESIS BY THE HMM-BASED SYNTHESIZER
33
Text analysis
34
Parameters generation
35
Parameters generation
Given the sequence of labels, durations are
determined by maximizing the state sequence
likelihood
A trajectory through context-dependent HMM states
is known !
36
Parameters generation
Using this trajectory, source and filter
parameters are generated by maximizing the output
probability
Dynamic features evolution more realistic and
smooth
37
Speech synthesizers comparison
38
Speech synthesizers comparison
Quality
Unit Selection
HTS
Diphone Concatenation
Footprint
lt1Mb
5Mb
200Mb
39
Content
  • Speech synthesis history
  • HMM-based speech synthesis
  • Parametric modeling of speech
  • Statistical generation
  • Conclusions

40
Problem positioning
Parametric speech synthesizers generally suffer
from a typical buzziness as encountered in
LPC-like vocoders
SourceFilter approach
Enhance the excitation signal
Pulse train
Synthetic Speech
Filter
White noise
41
Proposed solution
SOURCE
FILTER
T.Drugman, G.Wilfart, T.Dutoit,  A Deterministic
plus Stochastic Model of the Residual Signal for
Improved Parametric Speech Synthesis ,
Interspeech09
42
Results
Traditional Proposed
43
Content
  • Speech synthesis history
  • HMM-based speech synthesis
  • Parametric modeling of speech
  • Statistical generation
  • Conclusions

44
Problem of oversmoothing
45
Compensation of oversmooting
46
Global Variance
47
Global Variance
48
Results
49
Content
  • Speech synthesis history
  • HMM-based speech synthesis
  • Parametric modeling of speech
  • Statistical generation
  • Conclusions

50
Speech synthesizers comparison
Rule-based synthesis
Intelligibility? Naturalness?
Mem/CPU/Voices? Expressivity ?
Diphone concatenation
Intelligibility? Naturalness Mem/CPU/Voices?
Expressivity ?
Unit selection
Intelligibility? Naturalness ?
Mem/CPU/Voices Expressivity
HMM-based speech synthesis
Intelligibility? Naturalness ?? Mem/CPU/Voices
? Expressivity ??
51
Speech synthesizers comparison
Quality
Unit Selection
HTS
Diphone Concatenation
Footprint
lt1Mb
5Mb
200Mb
52
Future Works
  • Voice Conversion
  • Expressive/emotional synthesis
  • Better parametric representation
  • Real-time speech synthesis

53
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com