Talking to Computers and Computers Talking to You - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Talking to Computers and Computers Talking to You

Description:

P(w1 = nigh') = 0.4. P(w1 = my) = 0.3. P(w2 = at') = 0.5. P(w2 = hat') = 0.2 ... P(w1 w2 = nigh at' = 0.05) P(w1 w2 = my hat' = 0.4) What is the most likely sequence? ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 35
Provided by: johnmc5
Category:

less

Transcript and Presenter's Notes

Title: Talking to Computers and Computers Talking to You


1
Talking to Computers and Computers Talking to You
  • CA107 Topics in Computing Lecture
  • Nov 7, 2003
  • John McKenna
  • John.McKenna_at_computing.dcu.ie

2
Overview
  • What are we dealing with?
  • Sounds and Speech
  • Talking to Computers
  • Speech Recognition
  • Computers Talking to You
  • Speech Synthesis

3
Preface
  • All the software used in the demos today is
    either available in the CA labs and/or is
    downloadable for free.
  • In the CA labs, choose
  • gt Programs
  • gt Computational Linguistics
  • gt Package
  • Feel free to experiment

4
Sounds and Speech
  • Words contain sequences of sounds
  • Each sound (phone) is produced by sending signals
    from the brain to the vocal articulators
  • The vocal articulators produce variations in air
    pressure
  • These variations are transmitted through the air
    as complex waves
  • These waves are received by the ear and signals
    are sent to the brain

5
Articulation
Vocal Folds
Vocal Tract
6
Sound Production
  • Vocal folds open and close rapidly
  • Their rate of opening/closiing determines what we
    perceive as pitch
  • Some consonants are voiceless
  • Vocal tract configuration determines the sound
    quality

7
How Sounds Vary
Phonation? Manner? Place? Nasality?
8
Acoustics Vowels
  • All vowels are voiced (except whispered vowels)
  • Vocal tract independent of vocal folds
  • So we have two things we can vary
  • Rate of vocal folds opening/closing
  • Vocal tract configuration
  • What is it that causes us to perceive
    differences?
  • Lets look at the ear

9
The Ear, Waves Frequencies
  • The cochlea in ear is sensitive to frequency
  • What do we mean by frequency?
  • We use frequency to describe phenomena that
    repeat regularly in time
  • E.g. a tuning fork vibrates at a certain
    frequency
  • Its oscillations cause air pressure variations

10
Waves and Spectra
Simple wave
Complex wave
  • Demos
  • MATLAB?
  • Praat
  • Analysis Tool

11
Spectral Envelope
Harmonics of F0 vs. Formants (resonances)
12
Computers
  • When machines produce sound
  • Signals are sent from a program to speakers
  • I.e. speakers replace the articulators
  • When machines receive sound
  • The microphone replaces the ear
  • Signals are sent from microphone to program
  • Sound card intermediate controller/processor
  • The articulator muscles
  • Cochlea in ear

13
Conclusions
  • If we want to process speech, we
    analyse/synthesise at the acoustic level
  • Acoustically, speech is a series of complex waves
    which contain oscillations of many frequencies
  • The relative strengths of these frequencies
    characterise sounds
  • Knowing/learning these characteristics allows us
    to process speech

14
Note on Speakers
  • Acoustics depend on articualtors
  • Articulators vary across speakers
  • So acoustics vary across speakers
  • This can be problematic in speech processing
  • More later

15
Automatic Speech Recognition
  • Techniques
  • Example
  • Template Matching
  • Issues
  • Related Tasks
  • Demos

16
Techniques in ASR
  • Template Matching
  • Used in voice dialling on mobiles
  • Calculate distances between test utterance and
    each stored template
  • Choose template with minimum distance
  • Probabilistic Modelling
  • Train models with multiple utterances
  • Calculate the likelihood that the test utterance
    was produced by each model
  • Choose model with highest probability

17
Architecture
18
Example Template Matching
  • Isolated word recognition
  • Task
  • Want to build an isolated word recogniser e.g.
    voice dialling on mobile phones
  • Method
  • Record, parameterise and store vocabulary of
    reference words
  • Record test word to be recognised and
    parameterise
  • Measure distance between test word and each
    reference word
  • Choose reference word closest to test word

19
Preprocessing
Words are parameterised on a frame-by-frame
basis Choose frame length, over which speech
remains reasonably stationary Overlap frames e.g.
40ms frames, 10ms frame shift
40ms
20ms
We want to compare frames of test and reference
words i.e. calculate distances between them
20
Dynamic Time Warping
5 3 9 7 3
Test
4 7 4
Reference
Using a dynamic alignment, make most similar
frames correspond Find distances between two
utterences using these corresponding frames There
are efficient algorithms to find the minimum
distance path between 2 sets of frames
21
Issues in ASR
  • Speaker dependent/independent
  • Vocabulary size
  • Isolated word vs. Continuous speech
  • Language modelling constraints
  • Level of ambiguity in vocabulary
  • Environment e.g. noise considerations

22
Language Modelling
  • Given the acoustic signal
  • P(w1 nigh) 0.4
  • P(w1 my) 0.3
  • P(w2 at) 0.5
  • P(w2 hat) 0.2
  • Given the language model
  • P(w1 w2 nigh at 0.05)
  • P(w1 w2 my hat 0.4)
  • What is the most likely sequence?

23
Related Tasks
  • Speaker Recognition
  • Speaker Identification
  • Speaker Verification
  • Speaker Adaptation
  • Speaker Normalisation

24
Speech Synthesis
  • Text-To-Speech (TTS)
  • Typical architecture
  • Festival
  • Demos
  • MBROLA

25
TTS Architecture
Text
Speech
Linguistic Analysis
Waveform Generation
Text Analysis
26
Text Linguistic Processing
  • Language modelling generates phone sequence and
    prosody of the target utterance.
  • Tokenisation
  • Parsing helps phrasing
  • POS tagging helps disambiguate word senses which
    may have varying phone sequences and prosodies
  • Word pronunciation guided by a lexicon, with
    non-entries relying on letter to sound rules
  • Prosodic modelling is often machine learnt

27
Tokenisation
  • Specifying units
  • Easier for English than Chinese

28
Parsing Tagging
  • Parsing helps phrasing
  • Phrasing helps decide natural pauses
  • He went to the drive in to see the movie
  • POS tagging aids disambiguation
  • I live in Reading
  • I can fish
  • Homographs e.g. 1996

29
Dictionary/Lexicon
  • Phonemic info
  • Prosodic info
  • E.g.
  • ( "present" v ((( p r e ) 0) (( z _at_ n t ) 1)) )
  • ( "monument" n ((( m o ) 1) (( n y u ) 0) (( m _at_
    n t ) 0)) )
  • ( "lives" n ((( l ai v z ) 1)) )
  • ( "lives" v ((( l i v z ) 1)) )
  • What if no entry?
  • E.g. proper nouns
  • Letter-to-sound rules
  • Post-lexical rules
  • thirteen men

30
Speech Signal Generation
  • Concatenative
  • What units?
  • Words, syllables, phones, diphones
  • Unit selection
  • Post-processing
  • Other types
  • Formant
  • Articulatory

31
Diphone concatenation
di?gw??
32
Signal Postprocessing
  • Manipulate duration
  • Manipulate pitch
  • Smooth joins

33
Evaluation
  • Intelligibility
  • Naturalness
  • Perceptual tests
  • Psychoacoustics

34
Future Trends
  • Best synthesis units?
  • Speech signal modification
  • Voice conversion
  • Variability style, mood, ...
  • Better models
Write a Comment
User Comments (0)
About PowerShow.com