Acoustics of Speech - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Acoustics of Speech

Description:

Claim: How things are said can be critical to understanding ... Pressure fluctuations in the air caused by a musical instrument, a car horn, a voice ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 30

Provided by: juliahir

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Acoustics of Speech

1
Acoustics of Speech

Julia Hirschberg
CS 4706

2
Claim How things are said can be critical to
understanding

I.e., Varying phrasing, prominence, pitch range,
speaking rate, pitch contour, voice
qualityconveys meaning
What is our evidence? How do we prove?
Observation
Hypotheses
Experimentation (perception, production)
Speech analysis (independent variables)
Correlation with dependent variable

What does our data look like?
What tools do we have for analysis?

4
What is sound?

Pressure fluctuations in the air caused by a
musical instrument, a car horn, a voice
Cause eardrum to move
Auditory system translates into neural impulses
Brain interprets as sound
Can we tell one sound from another?
Can we distinguish one particular sound in
noise?

From a speech-centric point of view, when sound
is not produced by the human voice, we may term
it noise
Ratio of speech-generated sound to other
simultaneous sound signal-to-noise ratio

6
How Loud are Common Sounds?

Event Pressure (Pa) Db
Absolute 20 0
Whisper 200 20
Quiet office 2K 40
Conversation 20K 60
Bus 200K 80
Subway 2M 100
Thunder 20M 120
DAMAGE 200M 140

7
Some Sounds are Periodic

Simple Periodic Waves (sine waves) defined by
Frequency how often does pattern repeat per time
unit
Cycle one repetition
Period duration of cycle
Frequency cycles per time unit, e.g.
Frequency in Hz1sec/period_in_sec
Horizontal axis of waveform
Amplitude peak deviation of pressure from normal
atmospheric pressure

Phase timing of waveform relative to a reference
point
Complex periodic waves
Cyclic but composed of two or more sine waves
Fundamental frequency (F0) rate at which largest
pattern repeats (also GCD of component freqs)
Components not always easily identifiable power
spectrum graphs amplitude vs. frequency
Any complex waveform can be analyzed into a set
of sine waves with their own frequencies,
amplitudes, and phases (Fouriers theorem)
E.g. some speech sounds (mostly vowels) cat.wav

9
Some Sounds are Aperiodic

Waveforms with random or non-repeating patterns
Random aperiodic waveforms white noise
Flat spectrum equal amplitude for all frequency
components
Transients sudden bursts of pressure (clicks,
pops, door slams)
Waveform shows a single impulse (click.wav)
Fourier analysis shows a flat spectrum
Some speech sounds, e.g. many consonants (e.g.
cat.wav)

10
Speech Production

Voiced and voiceless sounds
Vocal fold vibration filtered by the Vocal tract
produces complex periodic waveform
Cycles per sec of lowest frequency component of
signal fundamental frequency (F0)
Fourier analysis yields power spectrum with
component frequencies and amplitudes
F0 is first (lowest frequency) peak
Harmonics are resonances of vocal track,
multiples of F0

11
Vocal fold vibration
UCLA Phonetics Lab demo
12
Places of articulation
http//www.chass.utoronto.ca/danhall/phonetics/sa
mmy.html
13
How do we capture speech for analysis?

Recording conditions
A quiet office, a sound booth, an anachoic
chamber
Microphones
Analog devices (e.g. tape recorders) store and
analyze continuous air pressure variations
(speech) as a continuous signal
Digital devices (e.g. computers,DAT) first
convert continuous signals into discrete signals
(A-to-D conversion)

File format
.wav, .aiff, .ds, .au, .sph,
Conversion programs, e.g. sox
Storage
Function of how much information we store about
speech in digitization
Higher quality, closer to original
More space (1000s of hours of speech take up a
lot of space)

15
Sampling

Sampling rate how often do we need to sample?
At least 2 samples per cycle to capture
periodicity of a waveform component at a given
frequency
100 Hz waveform needs 200 samples per sec
Nyquist frequency highest-frequency component
captured with a given sampling rate (half the
sampling rate)

16
Sampling/storage tradeoff

Human hearing 20K top frequency
Do we really need to store 40K samples per second
of speech?
Telephone speech 300-4K Hz (8K sampling)
But some speech sounds (e.g. fricatives, /f/,
/s/, /p/, /t/, /d/) have energy above 4K!
Peter/teeter/Dieter
44k (CD quality audio) vs.16-22K (usually good
enough to study pitch, amplitude, duration, )

17
Sampling Errors

Aliasing
Signals frequency higher than half the sampling
rate
Solutions
Increase the sampling rate
Filter out frequencies above half the sampling
rate (anti-aliasing filter)

18
Quantization

Measuring the amplitude at sampling points what
resolution to choose?
Integer representation
8, 12 or 16 bits per sample
Noise due to quantization steps avoided by higher
resolution -- but requires more storage
How many different amplitude levels do we need to
distinguish?
Choice depends on data and application (44K 16bit
stereo requires 10Mb storage)

But clipping occurs when input volume is greater
than range representable in digitized waveform
Increase the resolution
Decrease the amplitude

20
What can we do if our data is noisy?

Acoustic filters block out certain frequencies of
sounds
Low-pass filter blocks high frequency components
of a waveform
High-pass filter blocks low frequencies
Reject band (what to block) vs. pass band (what
to let through)
But if frequencies of two sounds overlap.source
separation

21
How can we capture pitch contours, pitch range?

What is the pitch contour of this utterance? Is
the pitch range of X greater than that of Y?
Pitch tracking Estimate F0 over time as fn of
vocal fold vibration
A periodic waveform is correlated with itself
One period looks much like another (cat.wav)
Find the period by finding the lag (offset)
between two windows on the signal for which the
correlation of the windows is highest
Lag duration (T) is 1 period of waveform
Inverse is F0 (1/T)

Errors to watch for
Halving shortest lag calculated is too long
(underestimate pitch)
Doubling shortest lag too short (overestimate
pitch)
Microprosody errors (e.g. /v/)

23
Sample Analysis File Pitch Track Header

version 1
type_code 4
frequency 12000.000000
samples 160768
start_time 0.000000
end_time 13.397333
bandwidth 6000.000000
dimensions 1
maximum 9660.000000
minimum -17384.000000
time Sat Nov 2 155550 1991
operation record padding xxxxxxxxxxxx

24
Sample Analysis File Pitch Track Data

(F0 Pvoicing Energy A/C Score)
147.896 1 2154.07 0.902643
140.894 1 1544.93 0.967008
138.05 1 1080.55 0.92588
130.399 1 745.262 0.595265
0 0 567.153 0.504029
0 0 638.037 0.222939
0 0 670.936 0.370024
0 0 790.751 0.357141
141.215 1 1281.1 0.904345

25
Pitch Perception

But do pitch trackers capture what humans
perceive?
Auditory systems perception of pitch is
non-linear
Sounds at lower frequencies with same difference
in absolute frequency sound more different than
those at higher frequencies (male vs. female
speech)
Bark scale (Zwicker) and other models of
perceived difference

26
How do we capture loudness/intensity?

Is one utterance louder than another?
Energy closely correlated experimentally with
perceived loudness
For each window, square the amplitude values of
the samples, take their mean, and take the root
of that mean (RMS energy)
What size window?
Longer windows produce smoother amplitude traces
but miss sudden acoustic events

27
Perception of Loudness

But the relation is non-linear sones or decibels
(dB)
Differences in soft sounds more salient than loud
Intensity proportional to square of amplitude
sointensity of sound with pressure x vs.
reference sound with pressure r x2/r2
bel base 10 log of ratio
decibel 10 bels
dB 10log10 (x2/r2)
Absolute (20 ?Pa, lowest audible pressure
fluctuation of 1000 Hz tone), typical threshold
level for tone at frequency

28
How do we capture.

For utterances X and Y
Pitch contour Same or different?
Pitch range Is X larger than Y?
Duration Is utterance X longer than utterance
Y?
Speaker rate Is the speaker of X speaking
faster than the speaker of Y?
Voice quality.

29
Next Class

Tools for the Masses Read the Praat tutorial
Download Praat from the course syllabus page and
play with a speech file (e.g. http//www.cs.columb
ia.edu/julia/cs4706/cc_001_sadness_1669.04_August
-second-.wav or record your own)

Write a Comment

User Comments (0)