Title: Talking to Computers and Computers Talking to You
1Talking to Computers and Computers Talking to You
- CA107 Topics in Computing Lecture
- Nov 7, 2003
- John McKenna
- John.McKenna_at_computing.dcu.ie
2Overview
- What are we dealing with?
- Sounds and Speech
- Talking to Computers
- Speech Recognition
- Computers Talking to You
- Speech Synthesis
3Preface
- All the software used in the demos today is
either available in the CA labs and/or is
downloadable for free. - In the CA labs, choose
- gt Programs
- gt Computational Linguistics
- gt Package
- Feel free to experiment
4Sounds and Speech
- Words contain sequences of sounds
- Each sound (phone) is produced by sending signals
from the brain to the vocal articulators - The vocal articulators produce variations in air
pressure - These variations are transmitted through the air
as complex waves - These waves are received by the ear and signals
are sent to the brain
5Articulation
Vocal Folds
Vocal Tract
6Sound Production
- Vocal folds open and close rapidly
- Their rate of opening/closiing determines what we
perceive as pitch - Some consonants are voiceless
- Vocal tract configuration determines the sound
quality
7How Sounds Vary
Phonation? Manner? Place? Nasality?
8Acoustics Vowels
- All vowels are voiced (except whispered vowels)
- Vocal tract independent of vocal folds
- So we have two things we can vary
- Rate of vocal folds opening/closing
- Vocal tract configuration
- What is it that causes us to perceive
differences? - Lets look at the ear
9The Ear, Waves Frequencies
- The cochlea in ear is sensitive to frequency
- What do we mean by frequency?
- We use frequency to describe phenomena that
repeat regularly in time - E.g. a tuning fork vibrates at a certain
frequency - Its oscillations cause air pressure variations
10Waves and Spectra
Simple wave
Complex wave
- Demos
- MATLAB?
- Praat
- Analysis Tool
11Spectral Envelope
Harmonics of F0 vs. Formants (resonances)
12Computers
- When machines produce sound
- Signals are sent from a program to speakers
- I.e. speakers replace the articulators
- When machines receive sound
- The microphone replaces the ear
- Signals are sent from microphone to program
- Sound card intermediate controller/processor
- The articulator muscles
- Cochlea in ear
13Conclusions
- If we want to process speech, we
analyse/synthesise at the acoustic level - Acoustically, speech is a series of complex waves
which contain oscillations of many frequencies - The relative strengths of these frequencies
characterise sounds - Knowing/learning these characteristics allows us
to process speech
14Note on Speakers
- Acoustics depend on articualtors
- Articulators vary across speakers
- So acoustics vary across speakers
- This can be problematic in speech processing
- More later
15Automatic Speech Recognition
- Techniques
- Example
- Template Matching
- Issues
- Related Tasks
- Demos
16Techniques in ASR
- Template Matching
- Used in voice dialling on mobiles
- Calculate distances between test utterance and
each stored template - Choose template with minimum distance
- Probabilistic Modelling
- Train models with multiple utterances
- Calculate the likelihood that the test utterance
was produced by each model - Choose model with highest probability
17Architecture
18Example Template Matching
- Isolated word recognition
- Task
- Want to build an isolated word recogniser e.g.
voice dialling on mobile phones - Method
- Record, parameterise and store vocabulary of
reference words - Record test word to be recognised and
parameterise - Measure distance between test word and each
reference word - Choose reference word closest to test word
19Preprocessing
Words are parameterised on a frame-by-frame
basis Choose frame length, over which speech
remains reasonably stationary Overlap frames e.g.
40ms frames, 10ms frame shift
40ms
20ms
We want to compare frames of test and reference
words i.e. calculate distances between them
20Dynamic Time Warping
5 3 9 7 3
Test
4 7 4
Reference
Using a dynamic alignment, make most similar
frames correspond Find distances between two
utterences using these corresponding frames There
are efficient algorithms to find the minimum
distance path between 2 sets of frames
21Issues in ASR
- Speaker dependent/independent
- Vocabulary size
- Isolated word vs. Continuous speech
- Language modelling constraints
- Level of ambiguity in vocabulary
- Environment e.g. noise considerations
22Language Modelling
- Given the acoustic signal
- P(w1 nigh) 0.4
- P(w1 my) 0.3
- P(w2 at) 0.5
- P(w2 hat) 0.2
- Given the language model
- P(w1 w2 nigh at 0.05)
- P(w1 w2 my hat 0.4)
- What is the most likely sequence?
23Related Tasks
- Speaker Recognition
- Speaker Identification
- Speaker Verification
- Speaker Adaptation
- Speaker Normalisation
24Speech Synthesis
- Text-To-Speech (TTS)
- Typical architecture
- Festival
- Demos
- MBROLA
25TTS Architecture
Text
Speech
Linguistic Analysis
Waveform Generation
Text Analysis
26Text Linguistic Processing
- Language modelling generates phone sequence and
prosody of the target utterance. - Tokenisation
- Parsing helps phrasing
- POS tagging helps disambiguate word senses which
may have varying phone sequences and prosodies - Word pronunciation guided by a lexicon, with
non-entries relying on letter to sound rules - Prosodic modelling is often machine learnt
27Tokenisation
- Specifying units
- Easier for English than Chinese
28Parsing Tagging
- Parsing helps phrasing
- Phrasing helps decide natural pauses
- He went to the drive in to see the movie
- POS tagging aids disambiguation
- I live in Reading
- I can fish
- Homographs e.g. 1996
29Dictionary/Lexicon
- Phonemic info
- Prosodic info
- E.g.
- ( "present" v ((( p r e ) 0) (( z _at_ n t ) 1)) )
- ( "monument" n ((( m o ) 1) (( n y u ) 0) (( m _at_
n t ) 0)) ) - ( "lives" n ((( l ai v z ) 1)) )
- ( "lives" v ((( l i v z ) 1)) )
- What if no entry?
- E.g. proper nouns
- Letter-to-sound rules
- Post-lexical rules
- thirteen men
30Speech Signal Generation
- Concatenative
- What units?
- Words, syllables, phones, diphones
- Unit selection
- Post-processing
- Other types
- Formant
- Articulatory
31Diphone concatenation
di?gw??
32Signal Postprocessing
- Manipulate duration
- Manipulate pitch
- Smooth joins
33Evaluation
- Intelligibility
- Naturalness
- Perceptual tests
- Psychoacoustics
34Future Trends
- Best synthesis units?
- Speech signal modification
- Voice conversion
- Variability style, mood, ...
- Better models