Title: Dr. O. Dakkak
1Prosodic Feature Introduction and Emotion
Incorporation in an Arabic TTS Presented by Dr.
O. Al Dakkak
- Dr. O. Dakkak Dr. N. Ghneim HIAST
- M. Abu-Zleikha S. Al-Moubyed IT fac., Damascus
U.
2Outline
- Arabic TTS
- Why Prosody generation?
- Prosody Analysis and Rule Extraction
- Emotion Inclusion
- Results
- Conclusion
3Arabic Text-to-Speech System
- Arabic Text-to-Phonemes (ATOPH) Including open
/E/, /O/ phonemes and emphatic vowels - Use of MBROLA Diphone units to synthesize speech
Till our semi-syllables are ready (Corpus is
currently being recorded) - Prosody Generation and Emotion Inclusion
4Arabic Text-to-Speech System
- MBROLA permits to synthesize phonemes. With
control on duration and F0 contour (a set of
segments) and we implemented a tool to control
the Amplitude. - Absent phonemes are replaced by the nearest
present phonemes - Possibility to generate and test prosody
5Why Prosody Generation?
- Increase intelligibility expressionality.
- Provides the context in which speech is
interpreted - Signals speaker intentions (special aids)
- Man-machine communication (airports,..)
- Doublage
6Methodology
- Based on the punctuation marks (,, ., ? and
!) we classify sentences into continuous
affirmation, long affirmation, interrogative,
exclamation respectively. - Recording a corpus and Analysis of its sentences
to produce F0, and intensity curves - Statistical study of the curves and Rule
extraction to generate them automatically.
7The corpus
- Use of a pre-recorded corpus, of 12 short
sentences for each type, 5 speakers (4 m. 1
f.). Each sentence has 14 phonemes at most. - Recording of other 10 sentences of variable
lengths pronounced by 3 speakers. - short 4-20 phonemes,
- medium 20-40 phonemes
- long more than 40 phonemes.
- The curves of F0, intensity were available for
the pre-recorded corpus and were computed for the
further set of recording.
8Rules Extraction
- Re-definition of the length concept, using fuzzy
sets
9Rules Extraction
- Curve stylization after stochastic analysis, ex
10Emotion Inclusion
- Recording a corpus of 5 different emotional
sentences (joy, anger, sadness, fear surprise)
with their emotionless versions (20
sentences/emotion). - Measures of prosodic features F0, duration and
intensity, with their variations (Praat). - Extraction of rules to automatically produce
emotion on synthetic speech. - Rules Validation.
11?????? ??????? ???? ??????????? ????? ???? Is it
my fault to bear it?
Range difference between F0max F0min F0
Averag Mean value
Jitter Irregularities between successive glottal
pulses
Pitch variation of F0
Variability deg. Of it (high, low..) .
Contour slope shape of contour slope (range
variation).
12 Example Anger emotion
- F0 mean 40-75
- F0 range 50-100
- F0 at vowels and semi-vowels 30
- F0 slope
- Speech rate
- Silence rate -
- Duration of vowels and semi-vowels
- Intensity mean
- Intensity monotonous with F0
- Others F0 variability , F0 jitter
13Analysis Rule Extraction Anger
emotionless
With emotion
14Emotion Synthesis Anger
- F0 mean 30
- F0 range 30
- F0 at vowels and semi-vowels 100
- Speech rate 75-80
- Duration of vowels and semi-vowels 30
- Duration of fricatives 20
15Synthetic examples
- emotionless with emotion
- Anger
- Joy
- Sadness
- Fear
- Surprise
- who do you think you are?
- no more clouds in the sky
- Im so sad today
- What a scary scene! What a beautiful
scene!
16EmoGen
Interface Text Editor
Voice
Input Text
Speech and emotion properties
Mbrola Player interface
Normal text to MBROLA text Converter (NTMTC)
Prosody Generator
Emotion Generator
17Results
- Five sentences for each emotion were synthesized
and listened by 10 people. - Each listener gives the perceived emotion for
each sentence (we dont provide our list of
emotions)
18Results
19Conclusion
- An automated tool for emotional Arabic synthesis
has been developed - The prosodic model proposed and tested in this
work proved to be successful. Especially in
conversational context -
- Further work will follow to include other
emotions Disgust, Annoyance,