Acoustics of Speech - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Acoustics of Speech

Description:

Claim: How things are said can be critical to understanding ... Pressure fluctuations in the air caused by a musical instrument, a car horn, a voice ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 30
Provided by: juliahir
Category:
Tags: acoustics | horn | speech

less

Transcript and Presenter's Notes

Title: Acoustics of Speech


1
Acoustics of Speech
  • Julia Hirschberg
  • CS 4706

2
Claim How things are said can be critical to
understanding
  • I.e., Varying phrasing, prominence, pitch range,
    speaking rate, pitch contour, voice
    qualityconveys meaning
  • What is our evidence? How do we prove?
  • Observation
  • Hypotheses
  • Experimentation (perception, production)
  • Speech analysis (independent variables)
  • Correlation with dependent variable

3
  • What does our data look like?
  • What tools do we have for analysis?

4
What is sound?
  • Pressure fluctuations in the air caused by a
    musical instrument, a car horn, a voice
  • Cause eardrum to move
  • Auditory system translates into neural impulses
  • Brain interprets as sound
  • Can we tell one sound from another?
  • Can we distinguish one particular sound in
    noise?

5
  • From a speech-centric point of view, when sound
    is not produced by the human voice, we may term
    it noise
  • Ratio of speech-generated sound to other
    simultaneous sound signal-to-noise ratio

6
How Loud are Common Sounds?
  • Event Pressure (Pa) Db
  • Absolute 20 0
  • Whisper 200 20
  • Quiet office 2K 40
  • Conversation 20K 60
  • Bus 200K 80
  • Subway 2M 100
  • Thunder 20M 120
  • DAMAGE 200M 140

7
Some Sounds are Periodic
  • Simple Periodic Waves (sine waves) defined by
  • Frequency how often does pattern repeat per time
    unit
  • Cycle one repetition
  • Period duration of cycle
  • Frequency cycles per time unit, e.g.
  • Frequency in Hz1sec/period_in_sec
  • Horizontal axis of waveform
  • Amplitude peak deviation of pressure from normal
    atmospheric pressure

8
  • Phase timing of waveform relative to a reference
    point
  • Complex periodic waves
  • Cyclic but composed of two or more sine waves
  • Fundamental frequency (F0) rate at which largest
    pattern repeats (also GCD of component freqs)
  • Components not always easily identifiable power
    spectrum graphs amplitude vs. frequency
  • Any complex waveform can be analyzed into a set
    of sine waves with their own frequencies,
    amplitudes, and phases (Fouriers theorem)
  • E.g. some speech sounds (mostly vowels) cat.wav

9
Some Sounds are Aperiodic
  • Waveforms with random or non-repeating patterns
  • Random aperiodic waveforms white noise
  • Flat spectrum equal amplitude for all frequency
    components
  • Transients sudden bursts of pressure (clicks,
    pops, door slams)
  • Waveform shows a single impulse (click.wav)
  • Fourier analysis shows a flat spectrum
  • Some speech sounds, e.g. many consonants (e.g.
    cat.wav)

10
Speech Production
  • Voiced and voiceless sounds
  • Vocal fold vibration filtered by the Vocal tract
    produces complex periodic waveform
  • Cycles per sec of lowest frequency component of
    signal fundamental frequency (F0)
  • Fourier analysis yields power spectrum with
    component frequencies and amplitudes
  • F0 is first (lowest frequency) peak
  • Harmonics are resonances of vocal track,
    multiples of F0

11
Vocal fold vibration
UCLA Phonetics Lab demo
12
Places of articulation
http//www.chass.utoronto.ca/danhall/phonetics/sa
mmy.html
13
How do we capture speech for analysis?
  • Recording conditions
  • A quiet office, a sound booth, an anachoic
    chamber
  • Microphones
  • Analog devices (e.g. tape recorders) store and
    analyze continuous air pressure variations
    (speech) as a continuous signal
  • Digital devices (e.g. computers,DAT) first
    convert continuous signals into discrete signals
    (A-to-D conversion)

14
  • File format
  • .wav, .aiff, .ds, .au, .sph,
  • Conversion programs, e.g. sox
  • Storage
  • Function of how much information we store about
    speech in digitization
  • Higher quality, closer to original
  • More space (1000s of hours of speech take up a
    lot of space)

15
Sampling
  • Sampling rate how often do we need to sample?
  • At least 2 samples per cycle to capture
    periodicity of a waveform component at a given
    frequency
  • 100 Hz waveform needs 200 samples per sec
  • Nyquist frequency highest-frequency component
    captured with a given sampling rate (half the
    sampling rate)

16
Sampling/storage tradeoff
  • Human hearing 20K top frequency
  • Do we really need to store 40K samples per second
    of speech?
  • Telephone speech 300-4K Hz (8K sampling)
  • But some speech sounds (e.g. fricatives, /f/,
    /s/, /p/, /t/, /d/) have energy above 4K!
  • Peter/teeter/Dieter
  • 44k (CD quality audio) vs.16-22K (usually good
    enough to study pitch, amplitude, duration, )

17
Sampling Errors
  • Aliasing
  • Signals frequency higher than half the sampling
    rate
  • Solutions
  • Increase the sampling rate
  • Filter out frequencies above half the sampling
    rate (anti-aliasing filter)

18
Quantization
  • Measuring the amplitude at sampling points what
    resolution to choose?
  • Integer representation
  • 8, 12 or 16 bits per sample
  • Noise due to quantization steps avoided by higher
    resolution -- but requires more storage
  • How many different amplitude levels do we need to
    distinguish?
  • Choice depends on data and application (44K 16bit
    stereo requires 10Mb storage)

19
  • But clipping occurs when input volume is greater
    than range representable in digitized waveform
  • Increase the resolution
  • Decrease the amplitude

20
What can we do if our data is noisy?
  • Acoustic filters block out certain frequencies of
    sounds
  • Low-pass filter blocks high frequency components
    of a waveform
  • High-pass filter blocks low frequencies
  • Reject band (what to block) vs. pass band (what
    to let through)
  • But if frequencies of two sounds overlap.source
    separation

21
How can we capture pitch contours, pitch range?
  • What is the pitch contour of this utterance? Is
    the pitch range of X greater than that of Y?
  • Pitch tracking Estimate F0 over time as fn of
    vocal fold vibration
  • A periodic waveform is correlated with itself
  • One period looks much like another (cat.wav)
  • Find the period by finding the lag (offset)
    between two windows on the signal for which the
    correlation of the windows is highest
  • Lag duration (T) is 1 period of waveform
  • Inverse is F0 (1/T)

22
  • Errors to watch for
  • Halving shortest lag calculated is too long
    (underestimate pitch)
  • Doubling shortest lag too short (overestimate
    pitch)
  • Microprosody errors (e.g. /v/)

23
Sample Analysis File Pitch Track Header
  • version 1
  • type_code 4
  • frequency 12000.000000
  • samples 160768
  • start_time 0.000000
  • end_time 13.397333
  • bandwidth 6000.000000
  • dimensions 1
  • maximum 9660.000000
  • minimum -17384.000000
  • time Sat Nov 2 155550 1991
  • operation record padding xxxxxxxxxxxx

24
Sample Analysis File Pitch Track Data
  • (F0 Pvoicing Energy A/C Score)
  • 147.896 1 2154.07 0.902643
  • 140.894 1 1544.93 0.967008
  • 138.05 1 1080.55 0.92588
  • 130.399 1 745.262 0.595265
  • 0 0 567.153 0.504029
  • 0 0 638.037 0.222939
  • 0 0 670.936 0.370024
  • 0 0 790.751 0.357141
  • 141.215 1 1281.1 0.904345

25
Pitch Perception
  • But do pitch trackers capture what humans
    perceive?
  • Auditory systems perception of pitch is
    non-linear
  • Sounds at lower frequencies with same difference
    in absolute frequency sound more different than
    those at higher frequencies (male vs. female
    speech)
  • Bark scale (Zwicker) and other models of
    perceived difference

26
How do we capture loudness/intensity?
  • Is one utterance louder than another?
  • Energy closely correlated experimentally with
    perceived loudness
  • For each window, square the amplitude values of
    the samples, take their mean, and take the root
    of that mean (RMS energy)
  • What size window?
  • Longer windows produce smoother amplitude traces
    but miss sudden acoustic events

27
Perception of Loudness
  • But the relation is non-linear sones or decibels
    (dB)
  • Differences in soft sounds more salient than loud
  • Intensity proportional to square of amplitude
    sointensity of sound with pressure x vs.
    reference sound with pressure r x2/r2
  • bel base 10 log of ratio
  • decibel 10 bels
  • dB 10log10 (x2/r2)
  • Absolute (20 ?Pa, lowest audible pressure
    fluctuation of 1000 Hz tone), typical threshold
    level for tone at frequency

28
How do we capture.
  • For utterances X and Y
  • Pitch contour Same or different?
  • Pitch range Is X larger than Y?
  • Duration Is utterance X longer than utterance
    Y?
  • Speaker rate Is the speaker of X speaking
    faster than the speaker of Y?
  • Voice quality.

29
Next Class
  • Tools for the Masses Read the Praat tutorial
  • Download Praat from the course syllabus page and
    play with a speech file (e.g. http//www.cs.columb
    ia.edu/julia/cs4706/cc_001_sadness_1669.04_August
    -second-.wav or record your own)
Write a Comment
User Comments (0)
About PowerShow.com