Title: Advanced Digital Signal Processing
1 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF
JOENSUU JOENSUU, FINLAND
- Advanced Digital Signal Processing
- Lecture 12
- Speech Processing
- Andrei Mihaila
- Speech and Image Processing Unit
- courtesy to Tomi Kinnunen, Speaker Recognition
course, Spring 2004 http//phon.joensuu.fi/kielite
knologia/kurssit/2004k/puhujantunn/
2Outline
- Introduction
- Speech production
- Speech analysis
3Introduction
4Speech processing
- Speech processing is the study of speech signals
and the processing methods of these signals. - Linguistics, Phonetics Phonology
- Acoustics, DSP
- Cognitive science, Artificial intelligence
- Pattern recognition, Statistics
- Computer science
5What is speech?
- Socio-linguistic our main communicative intent
- Phoneticians/phonologist an organized sequence
of phonemes, syllables, linked to each other with
certain rules - Accoustician wave motion in a medium
- DSP engineer a digital signal to be processed
- Pattern recognizer a pattern to be classified
or analyzed - Computer scientist a challenge for finding
efficient algorithms for processing it
6Speech acoustics an example
Waveform air pressure as a function of time
Spectrogram relative amplitudes of different
frequencies in time
Intensity contour modulation of volume
F0 contour modulation of fundamental frequency
7Information in speech
- message
- language
- dialect
- sex
- age
- weight/height
- voice quality
- emotional status
8The area of speech processing
- Recognition tasks
- Speech recognition - conversion of speech to
text - Speaker recognition - classify speakers by
their voice - Language recognition - recognizing the spoken
language - Emotion recognition - recognizing the emotional
status of the speaker - Synthesis - conversion from text to speech
- Coding - processing for economic
transmission and storage of speech - Manipulation
- Speech enhancement - improve the quality of
signal - Voice mapping - convert speech of a person to
another persons voice - Etc.
9Complexity of recognition
- Speech is very different from written text
- No markers between letters or even words!
- Coarticulation and deletion effects
- Context dependency
- The acoustic speech signal varies from repetition
to another - Different information (message, speaker,
language, etc) cant be separated by any simple
method they are mixed up in the acoustic signal
in a highly complex way
10Example of complexity
- (word Puhujantunnistus uttered by the same male
speaker)
11Speech applications
- speech recognition
- Voice dictation
- command control app. (voice commands)
- speaker recognition
- Person authentication
- Forensics
- text to speech (TTS)
- Telephony appl.
- Speech email
- Disabled people
12Speech as a biometric
- Biometrics the science of identifying or
verifying the identity of a person based on
physiological or behavioral characteristics - A biometric a measurable physiological or
behavioral characteristic used for identifying or
verifying
13Example of biometrics
14Pros and Cons of Speech
- Pros
- The most natural way of communicating
- Non-intrusive as a biometric
- Cheap sensor cost
- Easy to integrate with other biometrics
- Cons
- Not accurate
- Can be imitated or resynthesized
- Depends on the speakers physical and emotional
state - Recognition in noisy environments is demanding
15Speech Production
16Tract a system of body parts that together
serve some particular purpose
17Speech production model
18Phonation modes
- Three types of phonation
- voiceless phonation the vocal chords are apart
of each other and the airstream passes through
the open glottis to the vocal tract - whispering glottis open but the area of opening
is smaller than in voiceless phonation - voiced phonation vocal chords are successively
opening and closing, producing a sequence of air
puffs (glottal pulse) - Phoneme one of a small set of speech sounds
that are distinguished by the speakers of a
particular language
19Voicing
- Fundamental frequency (F0)
- The rate at which the vocal folds vibrate (also
called pitch) - Average F0 values (European languages)
- 120Hz (male), 220Hz (female), 330Hz (children)
20Vocal tract
- includes pharyngeal, oral and nasal cavities.
Volumes and shapes are individual. - The length of the vocal tract (from glottis to
lips) is in average 17.5cm for male speakers - Resonance frequencies called formants (F1, F2,
) - Vocal tract acts like a filter that modifies the
frequency content of the signal. (source-filter
model)
F1 and F2 are mostly responsible for the phonetic
quality of vowels, whereas higher (vowel)
formants are more depended on the speaker It has
been also reported in many studies that phonetic
features are in the mid-frequency range, and that
the low- and high-end of the spectrum are more
speaker-depended
21Resonances
- During the production of different speech sounds,
the relative volumes of the parts of vocal tract
are different - E.g. different vowels are produced by making one
major constriction in vocal tract with different
parts of the tongue, effectively dividing the VT
into two cavities and the passage between them,
which all have their own resonance characteristics
22Resonances 2
- Assumption usually made about the vocal tract
the tube is closed at one end (glottis) and open
at the other end (lips) - The tube has natural resonance frequencies,
independent there is sound or not. The air inside
the tube will form standing wave patterns
occuring at its natural resonances. - The standing wave arises from reflections and
constructive interference between the waves
travelling to different directions - Reflections occur at both the closed end as well
as open end.
23Vowels
24Resonance of an acoustic tube
- open at one end, closed at the other one
- Open at both ends
- Fn nth formant resonance Hz, c speed of
sound in air (300 m/s), L length of the tube.
m
25Speech Perception
- Psychoacoustics the study of subjective human
perception of sounds - Some correspondence between physical and
perceptual phenomena - Intensity loudness
- Fundamental frequency pitch
- Spectral shape timbre
- Phase difference in binaural hearing location
- the effect of frequency on the human ear has a
logarithmic basis (perceived pitch of a sound is
related to the frequency as an exponential
function) - Frequency resolution of the ear is about 2 Hz (in
middle range)
26Octave
- The ear is very accustomed to hearing a
fundamental frequency plus harmonics. - Term octave means a factor of two in frequency
- Logarithmic representation of frequency
- (e.g. The same amount of audio information is
carried in the octave between 50Hz-gt100Hz, as
10KHz-gt20KHz.
27Frequency scales
- Mel (melody) scale
- perceptual scale of pitches judged by listeners
to be equal in distance one from another - Above 500Hz, larger and larger intervals are
judged by listeners to produce equal pitch
increments (e.g. 4 octaves/Hz -gt 2 octaves/Mel) - Bark (critical band) scale based on the idea
that the peripheral audio system contains a bank
of analyzing filters within which the energy is
summed. (24 critical bands of hearing)
28Mel-frequency warping
- commonly used
- Instead of linear processing, frequency axis is
stratched at the low end, and shrinked at the
high end according to critical bands. Thus, the
spectral resolution at the low frequencies is
higher than at the higher frequencies. - Nonlinear frequency axis transformation (reffered
as frequency warping)
29Speech analysis
30Signal Aquisition
- Speech signal
- a form of wave motion
- continuous or analog
- the wave motion appears as sound pressure changes
that are converted with a microphone into
(continuous) voltage changes - Analog-to-digital converter (ADC)
- The range of human hearing 20Hz -gt 20KHz
- More sensitive to sounds between 1 KHz to 4KHz
- High fidelity music (CD audio) sampling rate
44.1KHz, 16bps (bit per sample) - Bandwidths 20Khz (music), 3.2Khz (speech)
- Telephone quality 8KHz, 12 or 8 bps
31(No Transcript)
32Features
- Feature a measure of a property of the speech
waveform - Reasons for feature extraction
- Redundancy and harmful information is removed
- Reduced computation time
- Easier modeling of the feature distribution
- Speech has many natural (Acoustic-phonetic)
features - Fundamental frequency (F0), formant frequencies,
formant bandwidths, spectral tilt, intensity,
phone durations, articulation, etc - Not-so-natural features
- Cepstrum, linear predictive coefficients, line
spectral frequencies, vocal tract area function,
delta and double-delta coefficients, etc
33Why do we need feature extraction?
- acoustic speech signal varies over time. Cant
compare two waveforms - example two instances of /a/ vowel spoken in
isolation, with time interval between repetitions
lt 1 second
34Example of features
- robust against channel effect and noise
- hard to automatically extract
- Requires lots of training data
- Complicated models
- text dependence
- easy to automatically extract
- small amount of data needed
- text independence
- easy models
- easily corrupted by noise and inter-session
variability
35Feature extraction from speech
- speech signal is processed in short time
segments, called frames - typical frame length 10-30ms. Adjacent frames
are slightly overlapping 30-75 (e.g. 10ms) - For each frame, a feature vector is computed
using DSP algorithm(s) - Underlying assumption speech signal is locally
stationary (its statistics dont change over the
frame) - Valid for steady sounds but not necessary for
transients - Shorter frame length gives better time
resolution, but with the cost of frequency
resolution
36More about windowing
- DFT assumes that the local waveform is periodic
- the discontinuities at the frame edges are
interpreted as being part from a signal with
infinite period. - windowing suppress the effect of the
discontinuities - Framing and windowing provide a thorough analysis
since each speech sound is approximately centered
within a frame
37Examples of windowing
- Less spectral leakage in the case of Hamming
window
38Signal pre-emphasis
- High frequency portion of the speech signal is
usually pre-emphasized prior to frequency
analysis - Pre-emphasis filter is selected to give
approximately 6 dB/octave boost - The 6 dB is meant for canceling the effect of
the glottal source so that the frequency spectrum
describes the vocal tract characteristics - Typically, the following FIR digital filter is
used - Since the pre-emphasis filter is meant for
canceling the effect of the glottal source,
unvoiced/whispered sounds should not be
pre-emphasized!
39Signal pre-emphasis example
40(No Transcript)
41The filterbank
- Each of the filters just computes a weighted
sum/average of that subband
42Two basic types of features
- Static (instantaneous) features
- Computed from one/more frame(s). Gives a
snapshot of the associated articulators at that
time interval. - Loose analogy physiological/organic features
- Dynamic features
- Dynamic changes (over time) of some static
feature - Assumed to correlate with speaking rate,
coarticulation, rhythm, etc - Loose analogy learned/behavioral features
43Cepstrum
- most widely used feature
- several variants
- Mel-frequency cepstral coefficients (MFCC)
- Linear predictive cepstral coefficients (LPCC)
- Usually the higher coefficients are thrown away,
and a small number of the lowest coefficients
(10-20) is retained, except the lowest
coefficient c0
44(No Transcript)
45(No Transcript)
46(No Transcript)
47Computation of dynamic features
- time sequence of any feature f1i, f2i,
fNi fki is the ith feature of the kth
frame - we want to estimate the rate of change (1st
derivative of this feature at each time instant
(frame) - Differentiator method
- Linear regression method
- The regression formula represents the slope of
the least-squares fitted line. Similar formulas
can be derived for higher-order polynomial - feature trajectories becomes smoother