Advanced Digital Signal Processing - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Advanced Digital Signal Processing

Description:

Vocal tract acts like a filter that modifies the frequency content of the signal. ... High fidelity music (CD audio): sampling rate 44.1KHz, 16bps (bit per sample) ... – PowerPoint PPT presentation

Number of Views:988
Avg rating:3.0/5.0
Slides: 48
Provided by: andreim7
Category:

less

Transcript and Presenter's Notes

Title: Advanced Digital Signal Processing


1
DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF
JOENSUU JOENSUU, FINLAND
  • Advanced Digital Signal Processing
  • Lecture 12
  • Speech Processing
  • Andrei Mihaila
  • Speech and Image Processing Unit
  • courtesy to Tomi Kinnunen, Speaker Recognition
    course, Spring 2004 http//phon.joensuu.fi/kielite
    knologia/kurssit/2004k/puhujantunn/

2
Outline
  • Introduction
  • Speech production
  • Speech analysis

3
Introduction
4
Speech processing
  • Speech processing is the study of speech signals
    and the processing methods of these signals.
  • Linguistics, Phonetics Phonology
  • Acoustics, DSP
  • Cognitive science, Artificial intelligence
  • Pattern recognition, Statistics
  • Computer science

5
What is speech?
  • Socio-linguistic our main communicative intent
  • Phoneticians/phonologist an organized sequence
    of phonemes, syllables, linked to each other with
    certain rules
  • Accoustician wave motion in a medium
  • DSP engineer a digital signal to be processed
  • Pattern recognizer a pattern to be classified
    or analyzed
  • Computer scientist a challenge for finding
    efficient algorithms for processing it

6
Speech acoustics an example
Waveform air pressure as a function of time
Spectrogram relative amplitudes of different
frequencies in time
Intensity contour modulation of volume
F0 contour modulation of fundamental frequency
7
Information in speech
  • message
  • language
  • dialect
  • sex
  • age
  • weight/height
  • voice quality
  • emotional status

8
The area of speech processing
  • Recognition tasks
  • Speech recognition - conversion of speech to
    text
  • Speaker recognition - classify speakers by
    their voice
  • Language recognition - recognizing the spoken
    language
  • Emotion recognition - recognizing the emotional
    status of the speaker
  • Synthesis - conversion from text to speech
  • Coding - processing for economic
    transmission and storage of speech
  • Manipulation
  • Speech enhancement - improve the quality of
    signal
  • Voice mapping - convert speech of a person to
    another persons voice
  • Etc.

9
Complexity of recognition
  • Speech is very different from written text
  • No markers between letters or even words!
  • Coarticulation and deletion effects
  • Context dependency
  • The acoustic speech signal varies from repetition
    to another
  • Different information (message, speaker,
    language, etc) cant be separated by any simple
    method they are mixed up in the acoustic signal
    in a highly complex way

10
Example of complexity
  • (word Puhujantunnistus uttered by the same male
    speaker)

11
Speech applications
  • speech recognition
  • Voice dictation
  • command control app. (voice commands)
  • speaker recognition
  • Person authentication
  • Forensics
  • text to speech (TTS)
  • Telephony appl.
  • Speech email
  • Disabled people

12
Speech as a biometric
  • Biometrics the science of identifying or
    verifying the identity of a person based on
    physiological or behavioral characteristics
  • A biometric a measurable physiological or
    behavioral characteristic used for identifying or
    verifying

13
Example of biometrics
14
Pros and Cons of Speech
  • Pros
  • The most natural way of communicating
  • Non-intrusive as a biometric
  • Cheap sensor cost
  • Easy to integrate with other biometrics
  • Cons
  • Not accurate
  • Can be imitated or resynthesized
  • Depends on the speakers physical and emotional
    state
  • Recognition in noisy environments is demanding

15
Speech Production
16
Tract a system of body parts that together
serve some particular purpose
17
Speech production model
18
Phonation modes
  • Three types of phonation
  • voiceless phonation the vocal chords are apart
    of each other and the airstream passes through
    the open glottis to the vocal tract
  • whispering glottis open but the area of opening
    is smaller than in voiceless phonation
  • voiced phonation vocal chords are successively
    opening and closing, producing a sequence of air
    puffs (glottal pulse)
  • Phoneme one of a small set of speech sounds
    that are distinguished by the speakers of a
    particular language

19
Voicing
  • Fundamental frequency (F0)
  • The rate at which the vocal folds vibrate (also
    called pitch)
  • Average F0 values (European languages)
  • 120Hz (male), 220Hz (female), 330Hz (children)

20
Vocal tract
  • includes pharyngeal, oral and nasal cavities.
    Volumes and shapes are individual.
  • The length of the vocal tract (from glottis to
    lips) is in average 17.5cm for male speakers
  • Resonance frequencies called formants (F1, F2,
    )
  • Vocal tract acts like a filter that modifies the
    frequency content of the signal. (source-filter
    model)

F1 and F2 are mostly responsible for the phonetic
quality of vowels, whereas higher (vowel)
formants are more depended on the speaker It has
been also reported in many studies that phonetic
features are in the mid-frequency range, and that
the low- and high-end of the spectrum are more
speaker-depended
21
Resonances
  • During the production of different speech sounds,
    the relative volumes of the parts of vocal tract
    are different
  • E.g. different vowels are produced by making one
    major constriction in vocal tract with different
    parts of the tongue, effectively dividing the VT
    into two cavities and the passage between them,
    which all have their own resonance characteristics

22
Resonances 2
  • Assumption usually made about the vocal tract
    the tube is closed at one end (glottis) and open
    at the other end (lips)
  • The tube has natural resonance frequencies,
    independent there is sound or not. The air inside
    the tube will form standing wave patterns
    occuring at its natural resonances.
  • The standing wave arises from reflections and
    constructive interference between the waves
    travelling to different directions
  • Reflections occur at both the closed end as well
    as open end.

23
Vowels
24
Resonance of an acoustic tube
  • open at one end, closed at the other one
  • Open at both ends
  • Fn nth formant resonance Hz, c speed of
    sound in air (300 m/s), L length of the tube.
    m

25
Speech Perception
  • Psychoacoustics the study of subjective human
    perception of sounds
  • Some correspondence between physical and
    perceptual phenomena
  • Intensity loudness
  • Fundamental frequency pitch
  • Spectral shape timbre
  • Phase difference in binaural hearing location
  • the effect of frequency on the human ear has a
    logarithmic basis (perceived pitch of a sound is
    related to the frequency as an exponential
    function)
  • Frequency resolution of the ear is about 2 Hz (in
    middle range)

26
Octave
  • The ear is very accustomed to hearing a
    fundamental frequency plus harmonics.
  • Term octave means a factor of two in frequency
  • Logarithmic representation of frequency
  • (e.g. The same amount of audio information is
    carried in the octave between 50Hz-gt100Hz, as
    10KHz-gt20KHz.

27
Frequency scales
  • Mel (melody) scale
  • perceptual scale of pitches judged by listeners
    to be equal in distance one from another
  • Above 500Hz, larger and larger intervals are
    judged by listeners to produce equal pitch
    increments (e.g. 4 octaves/Hz -gt 2 octaves/Mel)
  • Bark (critical band) scale based on the idea
    that the peripheral audio system contains a bank
    of analyzing filters within which the energy is
    summed. (24 critical bands of hearing)

28
Mel-frequency warping
  • commonly used
  • Instead of linear processing, frequency axis is
    stratched at the low end, and shrinked at the
    high end according to critical bands. Thus, the
    spectral resolution at the low frequencies is
    higher than at the higher frequencies.
  • Nonlinear frequency axis transformation (reffered
    as frequency warping)

29
Speech analysis
  • Feature extraction

30
Signal Aquisition
  • Speech signal
  • a form of wave motion
  • continuous or analog
  • the wave motion appears as sound pressure changes
    that are converted with a microphone into
    (continuous) voltage changes
  • Analog-to-digital converter (ADC)
  • The range of human hearing 20Hz -gt 20KHz
  • More sensitive to sounds between 1 KHz to 4KHz
  • High fidelity music (CD audio) sampling rate
    44.1KHz, 16bps (bit per sample)
  • Bandwidths 20Khz (music), 3.2Khz (speech)
  • Telephone quality 8KHz, 12 or 8 bps

31
(No Transcript)
32
Features
  • Feature a measure of a property of the speech
    waveform
  • Reasons for feature extraction
  • Redundancy and harmful information is removed
  • Reduced computation time
  • Easier modeling of the feature distribution
  • Speech has many natural (Acoustic-phonetic)
    features
  • Fundamental frequency (F0), formant frequencies,
    formant bandwidths, spectral tilt, intensity,
    phone durations, articulation, etc
  • Not-so-natural features
  • Cepstrum, linear predictive coefficients, line
    spectral frequencies, vocal tract area function,
    delta and double-delta coefficients, etc

33
Why do we need feature extraction?
  • acoustic speech signal varies over time. Cant
    compare two waveforms
  • example two instances of /a/ vowel spoken in
    isolation, with time interval between repetitions
    lt 1 second

34
Example of features
  • robust against channel effect and noise
  • hard to automatically extract
  • Requires lots of training data
  • Complicated models
  • text dependence
  • easy to automatically extract
  • small amount of data needed
  • text independence
  • easy models
  • easily corrupted by noise and inter-session
    variability

35
Feature extraction from speech
  • speech signal is processed in short time
    segments, called frames
  • typical frame length 10-30ms. Adjacent frames
    are slightly overlapping 30-75 (e.g. 10ms)
  • For each frame, a feature vector is computed
    using DSP algorithm(s)
  • Underlying assumption speech signal is locally
    stationary (its statistics dont change over the
    frame)
  • Valid for steady sounds but not necessary for
    transients
  • Shorter frame length gives better time
    resolution, but with the cost of frequency
    resolution

36
More about windowing
  • DFT assumes that the local waveform is periodic
  • the discontinuities at the frame edges are
    interpreted as being part from a signal with
    infinite period.
  • windowing suppress the effect of the
    discontinuities
  • Framing and windowing provide a thorough analysis
    since each speech sound is approximately centered
    within a frame

37
Examples of windowing
  • Less spectral leakage in the case of Hamming
    window

38
Signal pre-emphasis
  • High frequency portion of the speech signal is
    usually pre-emphasized prior to frequency
    analysis
  • Pre-emphasis filter is selected to give
    approximately 6 dB/octave boost
  • The 6 dB is meant for canceling the effect of
    the glottal source so that the frequency spectrum
    describes the vocal tract characteristics
  • Typically, the following FIR digital filter is
    used
  • Since the pre-emphasis filter is meant for
    canceling the effect of the glottal source,
    unvoiced/whispered sounds should not be
    pre-emphasized!

39
Signal pre-emphasis example
40
(No Transcript)
41
The filterbank
  • Each of the filters just computes a weighted
    sum/average of that subband

42
Two basic types of features
  • Static (instantaneous) features
  • Computed from one/more frame(s). Gives a
    snapshot of the associated articulators at that
    time interval.
  • Loose analogy physiological/organic features
  • Dynamic features
  • Dynamic changes (over time) of some static
    feature
  • Assumed to correlate with speaking rate,
    coarticulation, rhythm, etc
  • Loose analogy learned/behavioral features

43
Cepstrum
  • most widely used feature
  • several variants
  • Mel-frequency cepstral coefficients (MFCC)
  • Linear predictive cepstral coefficients (LPCC)
  • Usually the higher coefficients are thrown away,
    and a small number of the lowest coefficients
    (10-20) is retained, except the lowest
    coefficient c0

44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Computation of dynamic features
  • time sequence of any feature f1i, f2i,
    fNi fki is the ith feature of the kth
    frame
  • we want to estimate the rate of change (1st
    derivative of this feature at each time instant
    (frame)
  • Differentiator method
  • Linear regression method
  • The regression formula represents the slope of
    the least-squares fitted line. Similar formulas
    can be derived for higher-order polynomial
  • feature trajectories becomes smoother
Write a Comment
User Comments (0)
About PowerShow.com