VoiceXML: SSML Speech Synthesis Markup Language Recorded speech and audio - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

VoiceXML: SSML Speech Synthesis Markup Language Recorded speech and audio

Description:

Speak enclosed text in the given style. Implemented (with limitations) in some platforms ... such as: airline, equity, street, city, state, citystate, address ... – PowerPoint PPT presentation

Number of Views:558

Avg rating:3.0/5.0

Slides: 29

Provided by: Michael2145

Category:

more less

Transcript and Presenter's Notes

Title: VoiceXML: SSML Speech Synthesis Markup Language Recorded speech and audio

1
VoiceXML SSML (Speech Synthesis Markup
Language)Recorded speech and audio
2
Acknowledgements

Prof. Mctear, Natural Language Processing,
http//www.infj.ulst.ac.uk/nlp/index.html,
University of Ulster.

3
Overview

Speech Synthesis Markup Language (SSML)
Phases of Text to Speech Synthesis
Structure analysis
Text normalisation
Text to phoneme conversion
Prosody analysis
Waveform production
Recorded speech

4
SSML

Speech Synthesis Markup Language
enables developers to override default
specifications
Stages
Structure analysis
Text normalisation
Text to phoneme conversion
Prosody analysis
Waveform production

5
Structure Analysis

Division of text into basic elements e.g.
sentence, paragraph to support more natural
phrasing
ltsgt - sentence
ltpgt - paragraph
Structure inferred from punctuation and
formatting, but
Dr. Lewis works at the clinic on Sunset Dr. in
western Portland.
Dr. Smith lives at 214 Elm Dr. He weighs 214 lb.
He plays bass guitar. He also likes to fish
last week he caught a 20 lb. bass.ltpgt    ltsgtDr.
Smith lives at 214 Elm Dr. lt/sgt    ltsgtHe weighs
214 lb.lt/sgt     ltsgtHe plays bass guitar. lt/sgt
    ltsgtHe also likes to fish last week he
caught a 20 lb. bass.lt/sgt
lt/pgt

6
Text Normalisation

Annotation of text so that it is spoken correctly
Ambiguous examples
1/2 - may be spoken as half, January second,
February first, or one of two.
Dr. may be doctor or drive e.g. Dr. John
Dr. is rewritten as Doctor John Drive
St. may be saint or street e.g. St. John
St. is written as Saint John Street.
Acronyms e.g. ACM or IEEE should be spelled out,
others are pronounced as words e.g. RAM, ROM
Email addresses e.g. catazman_at_bee.com
First part Cat Azman, C.A.Tazman, or C.
Atazman?
Last part Bee dot com or B.E.E. dot com?

7
ltsubgt

New in VoiceXML 2.0. Speech Synthesis Markup.
Syntax
ltsub alias"substituteText" gt OriginalText lt/subgt
DescriptionLanguage element whose alias
attribute provides substitute text to be spoken
instead of the contained text. This allows the
document to contain both a written and a spoken
form for a string

8
ltsubgt

ltsub alias "doctor"gtDr.lt/subgt
Smith lives at
ltsub alias "two fourteen "gt214 lt/subgt
Elm ltsub alias "drive"gtDr. lt/subgt
He weighs ltsub alias "two hundred and
fourteen"gt214 lt/subgt
ltsub alias "pounds"gt lb.lt/subgt
He plays bass guitar.
He also likes to fish last week he caught a ltsub
alias "twenty"gt20 lt/subgtltsub alias "pound"gt
lb. lt/subgt bass.

ltsub alias "doctor"gtDr. lt/subgt
Smith lives at 214 Elm
ltsub alias "drive"gtDr. lt/subgt
He weighs 214 ltsub alias "pounds"gt lb.
lt/subgt
He plays bass guitar.
He also likes to fish last week he caught a 20
ltsub alias "pound"gt lb. lt/subgt bass.

9
ltsay-asgt

Speak enclosed text in the given style
Implemented (with limitations) in some platforms
Example numbers
Contained text can be interpreted as a number.
The allowed number formats are ordinal, cardinal,
and digits.
ltsay-as type"numberordinal"gt12lt/say-asgt is
spoken as "twelfth
ltsay-as type"numberdigits"gt12lt/say-asgt is
spoken as "one two".
Other types acronyms, currency, time, date,
duration, measures, telephone, spell-out, names,
and net.
Bevocal provides a set of extended tags for items
such as airline, equity, street, city, state,
citystate, address

10
Text to phoneme conversion

Specify pronunciation of words that are difficult
to pronounce, e.g.
read reed / red
wind Wind the watch when you face into the wind
ltphonemegt - uses the standard phonetic alphabet,
the International Phonetic Alphabet (IPA).
He plays ltphoneme alphabet "ipa"
ph"U0062 U0258 U0073"gt bass lt/phonemegt guitar.
He also likes to fish last week he caught a ltsub
alias "twenty"gt20 lt/subgt ltsub alias
"pound"gt lb. lt/subgt ltphoneme alphabet
"ipa" ph"U0062 U00E6 U0073"gt bass lt/phonemegt.

Unicode numbers
11
Attributes of ltphonemegt

alphabetThe phonetic alphabet used to specify
the pronunciation of the word contained in the
ltphonemegt element
phThe phonetic spelling of this word expressed
using the alphabet. The only valid values for
this attribute are ph"ipa" and vendor-defined
strings of the form ph "x-organization" or ph
"x-organization-alphabet ".
Using the IPA requires some linguistic training.
For an excellent tutorial on the IPA symbols and
sounds, see http//www.unil.ch/ling/english/phonet
ique/table-eng.html.
For an overview of the IPA and a full chart of
symbols, see http//www.arts.gla.ac.uk/IPA/ipa.htm
l.
The sounds used in English and their IPA symbols
are illustrated in http//www.antimoon.com/how/pro
nunc-soundsipa.htm. You can hear each sound by
clicking the word that contains the sound.
To identify the corresponding Unicode number, go
to http//web.uvic.ca/ling/resources/ipa/charts/un
icode_intro.htm, move the cursor above the IPA
symbol, and the Unicode value will appear.

12
Prosody analysis

Pitch (intonation or melody), timing (rhythm),
pauses, speech rate, emphasis on words, and the
relative timing of segments and pauses.
most TTS engines have a prosody analysis
algorithm responsible for producing the prosody
of synthesized speech, which is often based on
the parts of speech. For example, nouns, verbs,
and adjectives may be accented whereas,
auxiliary verbs and prepositions may be
distressed.
Spoken speech pauses for commas and properly
inflects the speech depending upon whether the
sentence is declarative, interrogative, or
exclamatory.
Prosody rules and algorithms are not perfect and
are a topic of ongoing research. Prosody rules
for different spoken national languages may be
quite different. For example, the prosody for
American, British, Indian, and Jamaican
pronunciations of English are different.

13
ltprosodygt pitch

refers to the highness or lowness of speech
(currently not implemented in bevocal cafe)
measured by the frequency (Hz, vibrations per
second) of the sound
can be specified with
A number followed by Hz
A relative change expressed as a percentage for
example, "18.2" or "-10.3"
A relative change as a relative number for
example, "10" or "-8.7"
One of the following words "x-high", "high",
"medium", "low", "x-low", or "default"

14
ltprosodygt range

Range - specifies the variability of the pitch.
specified using the same options as pitch e.g.
(currently not implemented in bevocal cafe)
ltprosody pitch "medium" range "x-low"gt

15
ltprosodygt contour

describes the actual pitch contour for the text.
(currently not implemented in bevocal cafe)
set of time segments with a target pitch
specified for each time segment.
Each time segment is defined as a percentage of
the total time for speaking the contained text
e.g. (25, 25, 25, 25) would speak the
contained text in four equal segments.
An interpolation algorithm smoothes the
transitions between the time segments. For
example, a contour can be used to describe the
increase in pitch at the end of a question as
follows
ltprosody contour "(90, medium) (10, high)"gt
You said what? lt/prosodygt

16
ltprosodygt rate, duration

Rate. The speaking rate expressed using
words-per-minute (currently not implemented in
bevocal cafe), specified using any of the
following
A number
A relative change expressed as a percentage for
example, "18.2" or "-10.3"
A relative change as a relative number for
example, "10" or "-8.7"
One of the following words "x-fast", "fast",
"medium", "slow", "x-slow", or "default"
The students name is ltprosody rate-10"gt
John Scott lt/prosodygt
Duration. A value in seconds or milliseconds for
the desired time to read the element contents
e.g.
ltprosody duration "10s"gt

17
ltprosodygt volume

Volume. Specifies how loudly or quietly the
words are spoken, specified by
A number in the range from 0.0 to 100.0
A relative change expressed as a percentage for
example "18.2" or "-10.3"
A relative change as a relative number for
example, "10" or "-8.7"
One of the following words "loud", "medium",
"soft", "low", "x-soft", or "silent"
ltprosody volume "loud"gt text to be spoken
lt/prosodygt

18
ltemphasisgt

formerly ltemphgt
level values strong moderate, none and
reduced.
none used to prevent the speech synthesis
processor from emphasizing words that it might
typically emphasize
ltemphasis level "strong"gthelplt/emphasisgt

19
ltbreakgt

specifies when to insert silence (or pause) in
text
strength - the strength of the prosodic break.
Values are "none" "x-small", "small","medium"
(the default value), "large", or "x-large"
time e.g. "250ms", "3s".
Welcome to the Student System
ltbreak time "250ms"/gt
Please say one of the following

20
Waveform Production

Process of converting a textual representation to
acoustical sounds which humans hear and interpret
as human-like speech.
ltvoicegt - uses a different voice from the default
specified for TTS
ltvoice age3" gender"female"gt text to
speak lt/voicegt
ltaudiogt - specifies what audio to present to user
ltdescgt - specifies text-only output describing
the audio output (e.g. dog barking)

21
Other SSML elements

ltspeakgt - defines a container for a speech
synthesis document
not required when SSML tags are used in PCDATA
within VoiceXML.
ltlexicongt - specifies a pronunciation lexicon
document which the speech synthesis engine uses
to generate the pronunciation of words.
format not yet defined, see documentation of
VoiceXML browser vendor
ltmarkgt - places a marker into the text to be
processed by the speech synthesis engine, e.g.
ltmark name "pause"/gtWhen encountered, the
speech synthesis pauses and throws an event
referencing the marker name. A built-in event
handler processes the event and causes the speech
synthesis engine to resume.

22
ltaudiogt playing prerecorded audio files

Output can consist of a combination of
prerecorded files, audio streams, or synthesised
speech e.g.ltpromptgtWelcome to the Student
System ltaudio src AudioSample.wav /gtHow can
I help you?lt/promptgt
ltaudiogt can have alternative content in case the
audio sample is not available e.g.ltaudio src
welcome.wav gt Welcome to the Student System
lt/audiogt

23
Recording speech input using ltrecordgt

ltrecordgt is a form element similar to ltfieldgt
It is used to collect a recording from the user
that can be played back or submitted to a server
It has a ltpromptgt element and can have a ltfilledgt
element
It can have a grammar for a spoken command to
terminate the recording

24
Attributes of ltrecordgt

name - The name of a variable that holds the
value of the recorded item.
expr - The value of the recorded item variable.
beepThere are two possible values beep "true"
and beep "false" If true, a beep tone is
presented to the user just before the recording
begins. The default is false.
maxtimeThe maximum duration of the recording,
beginning when the recording starts. For example,
maxtime "10s" where "10s" means 10 seconds.
finalsilenceThe interval of silence indicating
the end of speech. For example, finalsilence
"3s" (not implemented in IBM Voice Server SDK)
dtmftermThere are two possible values dtmfterm
"true and dtmfterm "false" If true, then any
DTMF key press not matched by an active grammar
will terminate the input. The default is true.
typeMedia format of the resulting recording. A
media type is a file format written in the form
type/subtype. For audio files, the type is
always audio.

25
Example using ltrecordgt

ltformgt
ltrecord name "msg" beep "true" maxtime "5s
finalsilence "5000ms" dtmfterm "true" type
"audio/x-wavgt
ltprompt timeout "5s"gt
Record your message after the beep.
lt/promptgt
lt/recordgt
ltfilledgt
lt!-- when recording is completed, replay recorded
message -gt
ltpromptgt You said ltaudio expr"msg"/gt lt/promptgt
lt/filledgt
lt/formgt

26
Submitting recording to the server

In this example, a recording has been stored in
the variable msg and the system confirms if the
user wishes to keep it
ltfield name"confirm type booleangt
ltpromptgt Your message is ltaudio expr"msg"/gt.
lt/promptgt
ltpromptgt To keep it, say yes. To discard it, say
no. lt/promptgt ltfilledgt
ltif cond"confirm"gt
ltsubmit next"save_message.jsp"
enctype"multipart/form-data" method"post"
namelist"msg"/gt
lt/ifgt
ltclear/gt
lt/filledgt
lt/fieldgt

27
ltrecordgt shadow variables (1)

NB name represents the name of the form item
variable
name.duration - The duration of the recording in
milliseconds
name.size - The size of the recording in bytes
name.termchar - The DTMF key used by the caller
to terminate the recording. This variable is
undefined if a key was not used to terminate the
audio.
name.maxtime - true indicates the recording was
terminated because the maxtime duration was
reached. false indicates the recording was not
terminated due to maxtime.

28
ltrecordgt shadow variables (2)

name.utterance - The string of words spoken by
the user if the recording was terminated by
speech recognition input. This shadow variable is
undefined if the recording was not terminated by
speech recognition input.
name.confidence - The confidence level (0.0
1.0) if the recording was terminated by speech.
This shadow variable is undefined if the
recording was not terminated by speech
recognition input. The confidence level refers
to the speech recognizer's estimate of the
accuracy of its results, in this case the
accuracy of the contents of name.utterance.

29
Dealing with user hang up during recording

When a user hangs up during recording, the
recording terminates and a connection.disconnect.h
angup event is thrown. Audio recorded up until
the hangup is available through the ltrecordgt
variable e.g.
ltcatch eventconnection.disconnect.hangupgt
action such as submit recording to server
lt/catchgt

30
Exercise SSML markup

Create a file using some SSML markup for TTS.
Examples
He drove his new car, ltprosody pitch"-10"
range"-20" volume"-20"gtnot his ugly old
carlt/prosodygt, because he wanted to seem more
ltemphasis levelstronggt impressive lt/emphasisgt
My user number is ltsay-as interpret-asdigitsgt
145678 lt/say-asgt
Sample file tts.vxml

31
Exercise recording and using audio files

Create a simple application that includes a field
in which you ask the user to speak some
information, such as name and address, that is
recorded by the system for later playback.
Play back a pre-recorded file (music to be played
as introduction)

Write a Comment

User Comments (0)