Title: VoiceXML: SSML Speech Synthesis Markup Language Recorded speech and audio
1VoiceXML SSML (Speech Synthesis Markup
Language)Recorded speech and audio
2Acknowledgements
- Prof. Mctear, Natural Language Processing,
http//www.infj.ulst.ac.uk/nlp/index.html,
University of Ulster.
3Overview
- Speech Synthesis Markup Language (SSML)
- Phases of Text to Speech Synthesis
- Structure analysis
- Text normalisation
- Text to phoneme conversion
- Prosody analysis
- Waveform production
- Recorded speech
4SSML
- Speech Synthesis Markup Language
- enables developers to override default
specifications - Stages
- Structure analysis
- Text normalisation
- Text to phoneme conversion
- Prosody analysis
- Waveform production
5Structure Analysis
- Division of text into basic elements e.g.
sentence, paragraph to support more natural
phrasing - ltsgt - sentence
- ltpgt - paragraph
- Structure inferred from punctuation and
formatting, but - Dr. Lewis works at the clinic on Sunset Dr. in
western Portland. - Dr. Smith lives at 214 Elm Dr. He weighs 214 lb.
He plays bass guitar. He also likes to fish
last week he caught a 20 lb. bass.ltpgt ltsgtDr.
Smith lives at 214 Elm Dr. lt/sgt ltsgtHe weighs
214 lb.lt/sgt ltsgtHe plays bass guitar. lt/sgt
ltsgtHe also likes to fish last week he
caught a 20 lb. bass.lt/sgt - lt/pgt
6Text Normalisation
- Annotation of text so that it is spoken correctly
- Ambiguous examples
- 1/2 - may be spoken as half, January second,
February first, or one of two. - Dr. may be doctor or drive e.g. Dr. John
Dr. is rewritten as Doctor John Drive - St. may be saint or street e.g. St. John
St. is written as Saint John Street. - Acronyms e.g. ACM or IEEE should be spelled out,
others are pronounced as words e.g. RAM, ROM - Email addresses e.g. catazman_at_bee.com
- First part Cat Azman, C.A.Tazman, or C.
Atazman? - Last part Bee dot com or B.E.E. dot com?
7ltsubgt
- New in VoiceXML 2.0. Speech Synthesis Markup.
- Syntax
- ltsub alias"substituteText" gt OriginalText lt/subgt
- DescriptionLanguage element whose alias
attribute provides substitute text to be spoken
instead of the contained text. This allows the
document to contain both a written and a spoken
form for a string
8ltsubgt
- ltsub alias "doctor"gtDr.lt/subgt
- Smith lives at
- ltsub alias "two fourteen "gt214 lt/subgt
- Elm ltsub alias "drive"gtDr. lt/subgt
- He weighs ltsub alias "two hundred and
fourteen"gt214 lt/subgt - ltsub alias "pounds"gt lb.lt/subgt
- He plays bass guitar.
- He also likes to fish last week he caught a ltsub
alias "twenty"gt20 lt/subgtltsub alias "pound"gt
lb. lt/subgt bass.
- ltsub alias "doctor"gtDr. lt/subgt
- Smith lives at 214 Elm
- ltsub alias "drive"gtDr. lt/subgt
- He weighs 214 ltsub alias "pounds"gt lb.
lt/subgt - He plays bass guitar.
- He also likes to fish last week he caught a 20
ltsub alias "pound"gt lb. lt/subgt bass.
9ltsay-asgt
- Speak enclosed text in the given style
- Implemented (with limitations) in some platforms
- Example numbers
- Contained text can be interpreted as a number.
The allowed number formats are ordinal, cardinal,
and digits. - ltsay-as type"numberordinal"gt12lt/say-asgt is
spoken as "twelfth - ltsay-as type"numberdigits"gt12lt/say-asgt is
spoken as "one two". - Other types acronyms, currency, time, date,
duration, measures, telephone, spell-out, names,
and net. - Bevocal provides a set of extended tags for items
such as airline, equity, street, city, state,
citystate, address
10Text to phoneme conversion
- Specify pronunciation of words that are difficult
to pronounce, e.g. - read reed / red
- wind Wind the watch when you face into the wind
- ltphonemegt - uses the standard phonetic alphabet,
the International Phonetic Alphabet (IPA). - He plays ltphoneme alphabet "ipa"
ph"U0062 U0258 U0073"gt bass lt/phonemegt guitar. - He also likes to fish last week he caught a ltsub
alias "twenty"gt20 lt/subgt ltsub alias
"pound"gt lb. lt/subgt ltphoneme alphabet
"ipa" ph"U0062 U00E6 U0073"gt bass lt/phonemegt.
Unicode numbers
11Attributes of ltphonemegt
- alphabetThe phonetic alphabet used to specify
the pronunciation of the word contained in the
ltphonemegt element - phThe phonetic spelling of this word expressed
using the alphabet. The only valid values for
this attribute are ph"ipa" and vendor-defined
strings of the form ph "x-organization" or ph
"x-organization-alphabet ". - Using the IPA requires some linguistic training.
For an excellent tutorial on the IPA symbols and
sounds, see http//www.unil.ch/ling/english/phonet
ique/table-eng.html. - For an overview of the IPA and a full chart of
symbols, see http//www.arts.gla.ac.uk/IPA/ipa.htm
l. - The sounds used in English and their IPA symbols
are illustrated in http//www.antimoon.com/how/pro
nunc-soundsipa.htm. You can hear each sound by
clicking the word that contains the sound. - To identify the corresponding Unicode number, go
to http//web.uvic.ca/ling/resources/ipa/charts/un
icode_intro.htm, move the cursor above the IPA
symbol, and the Unicode value will appear.
12Prosody analysis
- Pitch (intonation or melody), timing (rhythm),
pauses, speech rate, emphasis on words, and the
relative timing of segments and pauses. - most TTS engines have a prosody analysis
algorithm responsible for producing the prosody
of synthesized speech, which is often based on
the parts of speech. For example, nouns, verbs,
and adjectives may be accented whereas,
auxiliary verbs and prepositions may be
distressed. - Spoken speech pauses for commas and properly
inflects the speech depending upon whether the
sentence is declarative, interrogative, or
exclamatory. - Prosody rules and algorithms are not perfect and
are a topic of ongoing research. Prosody rules
for different spoken national languages may be
quite different. For example, the prosody for
American, British, Indian, and Jamaican
pronunciations of English are different.
13ltprosodygt pitch
- refers to the highness or lowness of speech
- (currently not implemented in bevocal cafe)
- measured by the frequency (Hz, vibrations per
second) of the sound - can be specified with
- A number followed by Hz
- A relative change expressed as a percentage for
example, "18.2" or "-10.3" - A relative change as a relative number for
example, "10" or "-8.7" - One of the following words "x-high", "high",
"medium", "low", "x-low", or "default"
14ltprosodygt range
- Range - specifies the variability of the pitch.
- specified using the same options as pitch e.g.
- (currently not implemented in bevocal cafe)
- ltprosody pitch "medium" range "x-low"gt
15ltprosodygt contour
- describes the actual pitch contour for the text.
- (currently not implemented in bevocal cafe)
- set of time segments with a target pitch
specified for each time segment. - Each time segment is defined as a percentage of
the total time for speaking the contained text
e.g. (25, 25, 25, 25) would speak the
contained text in four equal segments. - An interpolation algorithm smoothes the
transitions between the time segments. For
example, a contour can be used to describe the
increase in pitch at the end of a question as
follows - ltprosody contour "(90, medium) (10, high)"gt
You said what? lt/prosodygt
16ltprosodygt rate, duration
- Rate. The speaking rate expressed using
words-per-minute (currently not implemented in
bevocal cafe), specified using any of the
following - A number
- A relative change expressed as a percentage for
example, "18.2" or "-10.3" - A relative change as a relative number for
example, "10" or "-8.7" - One of the following words "x-fast", "fast",
"medium", "slow", "x-slow", or "default" - The students name is ltprosody rate-10"gt
John Scott lt/prosodygt - Duration. A value in seconds or milliseconds for
the desired time to read the element contents
e.g. - ltprosody duration "10s"gt
17ltprosodygt volume
- Volume. Specifies how loudly or quietly the
words are spoken, specified by - A number in the range from 0.0 to 100.0
- A relative change expressed as a percentage for
example "18.2" or "-10.3" - A relative change as a relative number for
example, "10" or "-8.7" - One of the following words "loud", "medium",
"soft", "low", "x-soft", or "silent" - ltprosody volume "loud"gt text to be spoken
lt/prosodygt
18ltemphasisgt
- formerly ltemphgt
- level values strong moderate, none and
reduced. - none used to prevent the speech synthesis
processor from emphasizing words that it might
typically emphasize - ltemphasis level "strong"gthelplt/emphasisgt
19ltbreakgt
- specifies when to insert silence (or pause) in
text - strength - the strength of the prosodic break.
Values are "none" "x-small", "small","medium"
(the default value), "large", or "x-large" - time e.g. "250ms", "3s".
- Welcome to the Student System
- ltbreak time "250ms"/gt
- Please say one of the following
20Waveform Production
- Process of converting a textual representation to
acoustical sounds which humans hear and interpret
as human-like speech. - ltvoicegt - uses a different voice from the default
specified for TTS - ltvoice age3" gender"female"gt text to
speak lt/voicegt - ltaudiogt - specifies what audio to present to user
- ltdescgt - specifies text-only output describing
the audio output (e.g. dog barking)
21Other SSML elements
- ltspeakgt - defines a container for a speech
synthesis document - not required when SSML tags are used in PCDATA
within VoiceXML. - ltlexicongt - specifies a pronunciation lexicon
document which the speech synthesis engine uses
to generate the pronunciation of words. - format not yet defined, see documentation of
VoiceXML browser vendor - ltmarkgt - places a marker into the text to be
processed by the speech synthesis engine, e.g.
ltmark name "pause"/gtWhen encountered, the
speech synthesis pauses and throws an event
referencing the marker name. A built-in event
handler processes the event and causes the speech
synthesis engine to resume.
22ltaudiogt playing prerecorded audio files
- Output can consist of a combination of
prerecorded files, audio streams, or synthesised
speech e.g.ltpromptgtWelcome to the Student
System ltaudio src AudioSample.wav /gtHow can
I help you?lt/promptgt - ltaudiogt can have alternative content in case the
audio sample is not available e.g.ltaudio src
welcome.wav gt Welcome to the Student System
lt/audiogt
23Recording speech input using ltrecordgt
- ltrecordgt is a form element similar to ltfieldgt
- It is used to collect a recording from the user
that can be played back or submitted to a server - It has a ltpromptgt element and can have a ltfilledgt
element - It can have a grammar for a spoken command to
terminate the recording
24Attributes of ltrecordgt
- name - The name of a variable that holds the
value of the recorded item. - expr - The value of the recorded item variable.
- beepThere are two possible values beep "true"
and beep "false" If true, a beep tone is
presented to the user just before the recording
begins. The default is false. - maxtimeThe maximum duration of the recording,
beginning when the recording starts. For example,
maxtime "10s" where "10s" means 10 seconds. - finalsilenceThe interval of silence indicating
the end of speech. For example, finalsilence
"3s" (not implemented in IBM Voice Server SDK) - dtmftermThere are two possible values dtmfterm
"true and dtmfterm "false" If true, then any
DTMF key press not matched by an active grammar
will terminate the input. The default is true. - typeMedia format of the resulting recording. A
media type is a file format written in the form
type/subtype. For audio files, the type is
always audio.
25Example using ltrecordgt
- ltformgt
- ltrecord name "msg" beep "true" maxtime "5s
finalsilence "5000ms" dtmfterm "true" type
"audio/x-wavgt - ltprompt timeout "5s"gt
- Record your message after the beep.
- lt/promptgt
- lt/recordgt
- ltfilledgt
- lt!-- when recording is completed, replay recorded
message -gt - ltpromptgt You said ltaudio expr"msg"/gt lt/promptgt
- lt/filledgt
- lt/formgt
26Submitting recording to the server
- In this example, a recording has been stored in
the variable msg and the system confirms if the
user wishes to keep it - ltfield name"confirm type booleangt
ltpromptgt Your message is ltaudio expr"msg"/gt.
lt/promptgt - ltpromptgt To keep it, say yes. To discard it, say
no. lt/promptgt ltfilledgt - ltif cond"confirm"gt
- ltsubmit next"save_message.jsp"
enctype"multipart/form-data" method"post"
namelist"msg"/gt - lt/ifgt
- ltclear/gt
- lt/filledgt
- lt/fieldgt
27ltrecordgt shadow variables (1)
- NB name represents the name of the form item
variable - name.duration - The duration of the recording in
milliseconds - name.size - The size of the recording in bytes
- name.termchar - The DTMF key used by the caller
to terminate the recording. This variable is
undefined if a key was not used to terminate the
audio. - name.maxtime - true indicates the recording was
terminated because the maxtime duration was
reached. false indicates the recording was not
terminated due to maxtime.
28ltrecordgt shadow variables (2)
- name.utterance - The string of words spoken by
the user if the recording was terminated by
speech recognition input. This shadow variable is
undefined if the recording was not terminated by
speech recognition input. - name.confidence - The confidence level (0.0
1.0) if the recording was terminated by speech.
This shadow variable is undefined if the
recording was not terminated by speech
recognition input. The confidence level refers
to the speech recognizer's estimate of the
accuracy of its results, in this case the
accuracy of the contents of name.utterance.
29Dealing with user hang up during recording
- When a user hangs up during recording, the
recording terminates and a connection.disconnect.h
angup event is thrown. Audio recorded up until
the hangup is available through the ltrecordgt
variable e.g. - ltcatch eventconnection.disconnect.hangupgt
- action such as submit recording to server
- lt/catchgt
30Exercise SSML markup
- Create a file using some SSML markup for TTS.
- Examples
- He drove his new car, ltprosody pitch"-10"
range"-20" volume"-20"gtnot his ugly old
carlt/prosodygt, because he wanted to seem more
ltemphasis levelstronggt impressive lt/emphasisgt - My user number is ltsay-as interpret-asdigitsgt
145678 lt/say-asgt - Sample file tts.vxml
31Exercise recording and using audio files
- Create a simple application that includes a field
in which you ask the user to speak some
information, such as name and address, that is
recorded by the system for later playback. - Play back a pre-recorded file (music to be played
as introduction)