Advanced Digital Signal Processing - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Advanced Digital Signal Processing

Description:

Vocal tract acts like a filter that modifies the frequency content of the signal. ... High fidelity music (CD audio): sampling rate 44.1KHz, 16bps (bit per sample) ... – PowerPoint PPT presentation

Number of Views:988

Avg rating:3.0/5.0

Slides: 48

Provided by: andreim7

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Digital Signal Processing

1
DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF
JOENSUU JOENSUU, FINLAND

Advanced Digital Signal Processing
Lecture 12
Speech Processing
Andrei Mihaila
Speech and Image Processing Unit
courtesy to Tomi Kinnunen, Speaker Recognition
course, Spring 2004 http//phon.joensuu.fi/kielite
knologia/kurssit/2004k/puhujantunn/

2
Outline

Introduction
Speech production
Speech analysis

3
Introduction
4
Speech processing

Speech processing is the study of speech signals
and the processing methods of these signals.
Linguistics, Phonetics Phonology
Acoustics, DSP
Cognitive science, Artificial intelligence
Pattern recognition, Statistics
Computer science

5
What is speech?

Socio-linguistic our main communicative intent
Phoneticians/phonologist an organized sequence
of phonemes, syllables, linked to each other with
certain rules
Accoustician wave motion in a medium
DSP engineer a digital signal to be processed
Pattern recognizer a pattern to be classified
or analyzed
Computer scientist a challenge for finding
efficient algorithms for processing it

6
Speech acoustics an example
Waveform air pressure as a function of time
Spectrogram relative amplitudes of different
frequencies in time
Intensity contour modulation of volume
F0 contour modulation of fundamental frequency
7
Information in speech

message
language
dialect
sex
age
weight/height
voice quality
emotional status

8
The area of speech processing

Recognition tasks
Speech recognition - conversion of speech to
text
Speaker recognition - classify speakers by
their voice
Language recognition - recognizing the spoken
language
Emotion recognition - recognizing the emotional
status of the speaker
Synthesis - conversion from text to speech
Coding - processing for economic
transmission and storage of speech
Manipulation
Speech enhancement - improve the quality of
signal
Voice mapping - convert speech of a person to
another persons voice
Etc.

9
Complexity of recognition

Speech is very different from written text
No markers between letters or even words!
Coarticulation and deletion effects
Context dependency
The acoustic speech signal varies from repetition
to another
Different information (message, speaker,
language, etc) cant be separated by any simple
method they are mixed up in the acoustic signal
in a highly complex way

10
Example of complexity

(word Puhujantunnistus uttered by the same male
speaker)

11
Speech applications

speech recognition
Voice dictation
command control app. (voice commands)
speaker recognition
Person authentication
Forensics
text to speech (TTS)
Telephony appl.
Speech email
Disabled people

12
Speech as a biometric

Biometrics the science of identifying or
verifying the identity of a person based on
physiological or behavioral characteristics
A biometric a measurable physiological or
behavioral characteristic used for identifying or
verifying

13
Example of biometrics
14
Pros and Cons of Speech

Pros
The most natural way of communicating
Non-intrusive as a biometric
Cheap sensor cost
Easy to integrate with other biometrics
Cons
Not accurate
Can be imitated or resynthesized
Depends on the speakers physical and emotional
state
Recognition in noisy environments is demanding

15
Speech Production
16
Tract a system of body parts that together
serve some particular purpose
17
Speech production model
18
Phonation modes

Three types of phonation
voiceless phonation the vocal chords are apart
of each other and the airstream passes through
the open glottis to the vocal tract
whispering glottis open but the area of opening
is smaller than in voiceless phonation
voiced phonation vocal chords are successively
opening and closing, producing a sequence of air
puffs (glottal pulse)
Phoneme one of a small set of speech sounds
that are distinguished by the speakers of a
particular language

19
Voicing

Fundamental frequency (F0)
The rate at which the vocal folds vibrate (also
called pitch)
Average F0 values (European languages)
120Hz (male), 220Hz (female), 330Hz (children)

20
Vocal tract

includes pharyngeal, oral and nasal cavities.
Volumes and shapes are individual.
The length of the vocal tract (from glottis to
lips) is in average 17.5cm for male speakers
Resonance frequencies called formants (F1, F2,
)
Vocal tract acts like a filter that modifies the
frequency content of the signal. (source-filter
model)

F1 and F2 are mostly responsible for the phonetic
quality of vowels, whereas higher (vowel)
formants are more depended on the speaker It has
been also reported in many studies that phonetic
features are in the mid-frequency range, and that
the low- and high-end of the spectrum are more
speaker-depended
21
Resonances

During the production of different speech sounds,
the relative volumes of the parts of vocal tract
are different
E.g. different vowels are produced by making one
major constriction in vocal tract with different
parts of the tongue, effectively dividing the VT
into two cavities and the passage between them,
which all have their own resonance characteristics

22
Resonances 2

Assumption usually made about the vocal tract
the tube is closed at one end (glottis) and open
at the other end (lips)
The tube has natural resonance frequencies,
independent there is sound or not. The air inside
the tube will form standing wave patterns
occuring at its natural resonances.
The standing wave arises from reflections and
constructive interference between the waves
travelling to different directions
Reflections occur at both the closed end as well
as open end.

23
Vowels
24
Resonance of an acoustic tube

open at one end, closed at the other one
Open at both ends
Fn nth formant resonance Hz, c speed of
sound in air (300 m/s), L length of the tube.
m

25
Speech Perception

Psychoacoustics the study of subjective human
perception of sounds
Some correspondence between physical and
perceptual phenomena
Intensity loudness
Fundamental frequency pitch
Spectral shape timbre
Phase difference in binaural hearing location
the effect of frequency on the human ear has a
logarithmic basis (perceived pitch of a sound is
related to the frequency as an exponential
function)
Frequency resolution of the ear is about 2 Hz (in
middle range)

26
Octave

The ear is very accustomed to hearing a
fundamental frequency plus harmonics.
Term octave means a factor of two in frequency
Logarithmic representation of frequency
(e.g. The same amount of audio information is
carried in the octave between 50Hz-gt100Hz, as
10KHz-gt20KHz.

27
Frequency scales

Mel (melody) scale
perceptual scale of pitches judged by listeners
to be equal in distance one from another
Above 500Hz, larger and larger intervals are
judged by listeners to produce equal pitch
increments (e.g. 4 octaves/Hz -gt 2 octaves/Mel)
Bark (critical band) scale based on the idea
that the peripheral audio system contains a bank
of analyzing filters within which the energy is
summed. (24 critical bands of hearing)

28
Mel-frequency warping

commonly used
Instead of linear processing, frequency axis is
stratched at the low end, and shrinked at the
high end according to critical bands. Thus, the
spectral resolution at the low frequencies is
higher than at the higher frequencies.
Nonlinear frequency axis transformation (reffered
as frequency warping)

29
Speech analysis

Feature extraction

30
Signal Aquisition

Speech signal
a form of wave motion
continuous or analog
the wave motion appears as sound pressure changes
that are converted with a microphone into
(continuous) voltage changes
Analog-to-digital converter (ADC)
The range of human hearing 20Hz -gt 20KHz
More sensitive to sounds between 1 KHz to 4KHz
High fidelity music (CD audio) sampling rate
44.1KHz, 16bps (bit per sample)
Bandwidths 20Khz (music), 3.2Khz (speech)
Telephone quality 8KHz, 12 or 8 bps

31
(No Transcript)
32
Features

Feature a measure of a property of the speech
waveform
Reasons for feature extraction
Redundancy and harmful information is removed
Reduced computation time
Easier modeling of the feature distribution
Speech has many natural (Acoustic-phonetic)
features
Fundamental frequency (F0), formant frequencies,
formant bandwidths, spectral tilt, intensity,
phone durations, articulation, etc
Not-so-natural features
Cepstrum, linear predictive coefficients, line
spectral frequencies, vocal tract area function,
delta and double-delta coefficients, etc

33
Why do we need feature extraction?

acoustic speech signal varies over time. Cant
compare two waveforms
example two instances of /a/ vowel spoken in
isolation, with time interval between repetitions
lt 1 second

34
Example of features

robust against channel effect and noise
hard to automatically extract
Requires lots of training data
Complicated models
text dependence

easy to automatically extract
small amount of data needed
text independence
easy models
easily corrupted by noise and inter-session
variability

35
Feature extraction from speech

speech signal is processed in short time
segments, called frames
typical frame length 10-30ms. Adjacent frames
are slightly overlapping 30-75 (e.g. 10ms)
For each frame, a feature vector is computed
using DSP algorithm(s)
Underlying assumption speech signal is locally
stationary (its statistics dont change over the
frame)
Valid for steady sounds but not necessary for
transients
Shorter frame length gives better time
resolution, but with the cost of frequency
resolution

36
More about windowing

DFT assumes that the local waveform is periodic
the discontinuities at the frame edges are
interpreted as being part from a signal with
infinite period.
windowing suppress the effect of the
discontinuities
Framing and windowing provide a thorough analysis
since each speech sound is approximately centered
within a frame

37
Examples of windowing

Less spectral leakage in the case of Hamming
window

38
Signal pre-emphasis

High frequency portion of the speech signal is
usually pre-emphasized prior to frequency
analysis
Pre-emphasis filter is selected to give
approximately 6 dB/octave boost
The 6 dB is meant for canceling the effect of
the glottal source so that the frequency spectrum
describes the vocal tract characteristics
Typically, the following FIR digital filter is
used
Since the pre-emphasis filter is meant for
canceling the effect of the glottal source,
unvoiced/whispered sounds should not be
pre-emphasized!

39
Signal pre-emphasis example
40
(No Transcript)
41
The filterbank

Each of the filters just computes a weighted
sum/average of that subband

42
Two basic types of features

Static (instantaneous) features
Computed from one/more frame(s). Gives a
snapshot of the associated articulators at that
time interval.
Loose analogy physiological/organic features
Dynamic features
Dynamic changes (over time) of some static
feature
Assumed to correlate with speaking rate,
coarticulation, rhythm, etc
Loose analogy learned/behavioral features

43
Cepstrum

most widely used feature
several variants
Mel-frequency cepstral coefficients (MFCC)
Linear predictive cepstral coefficients (LPCC)
Usually the higher coefficients are thrown away,
and a small number of the lowest coefficients
(10-20) is retained, except the lowest
coefficient c0

44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Computation of dynamic features

time sequence of any feature f1i, f2i,
fNi fki is the ith feature of the kth
frame
we want to estimate the rate of change (1st
derivative of this feature at each time instant
(frame)
Differentiator method
Linear regression method
The regression formula represents the slope of
the least-squares fitted line. Similar formulas
can be derived for higher-order polynomial
feature trajectories becomes smoother