Title: Motor%20Control%20Strategies%20for%20Chinese%20Intonation
1Motor Control Strategies for Chinese Intonation
- Greg Kochanski (University of Oxford, UK)
- Chilin Shih (University of Illinois,
Urbana-Champaign) - Tan Lee (Chinese University of Hong-Kong)
- Hongyan Jing (IBM)
2http//kochanski.org/gpk
3- The Goal
- Explain intonation in a way that is
- Consistent with linguistic assumptions.
- Consistent with known Physiology and
Neuroscience. - The Method
- Motion planning over a phrase.
- Minimize sum of
- Error between actual pitch and linguistic target
- An effort cost term that penalizes rapid, jerky
motions. - The Result
- Intonation in tone languages can be represented
by - A lexically-specified tone template (i.e, you use
a dictionary to look up which tone a syllable
has). - A continuous cost-of-error parameter, one per
word. - Evidence that the cost-of-misinterpretations we
measure are real - Cross-language similarities
- Metrical patterns
- Other
4TheChallenge
5- Tone languages provide the ideal test case for
motor control strategies - tone is important, and
- you can be sure what the speaker is trying to
accomplish. - The meaning of each syllable is determined by the
pitch contour over the syllable. - Ma (high tone) Mother
- Ma (rising tone) Hemp
- Ma (low falling tone) Horse
- Ma (high falling tone) to scold
- You can look up the tone in the dictionary.
- Pitch contour is determined primarily by muscle
tension in the vocal folds.
6Another Challenge
1
Typical tone shapes in green
F0 (Hz)
Time (10 ms intervals)
7People talk nearly as fast as possible, therefore
dynamics must be important.
Pitch (f0) for a maximum-rate warble.
Pitch (f0) for a maximum-rate warble.
Conversational Mandarin on the sametime scale
8The Data
- Male speaker of Madarin (Chinese)
- Female speaker of Cantonese (Chinese)
- Text from newspaper news stories.
- 737 syllables for Mandarin
- 4?1.4 syllables per second
- 1.2?0.7 seconds per phrase (between pauses).
- Segmented into words by three independent native
speakers (Mandarin) - Tracks of fundamental frequency vs. time (pitch)
extracted by get_f0 from ESPS/Waves package.
9Basic assumptions used in modeling
- People plan their utterances several syllables in
advance. - People produce optimal, highly practiced speech.
- Most of what we say is made from bits and pieces
weve said before. - There are only 4 (Mandarin) or 6 (Cantonese)
tones to combine. - A speaker has the chance to practice and optimize
all the common 3- and 4- tone sequences. - A simple model for f0 (pitch) f0 is linearly
related to muscle tensions. - A simple model of the muscle control strategy.
- No reason to believe pitch is controlled
differently from other muscle motions.
10Optimize what?
- People want to minimize the chance that they will
be significantly misunderstood. Some words will
be more important than others - Risk P(misinterpreted) cost-of-misinterpretati
on - Perhaps weight matches importance.
- People want to minimize effort and/or talk faster
- Chairs, Cars
- How to combine the two?
- A weighted sum.
- Cost-of-misinterpretation plays the role of the
weight.
11What is the unit of motion planning?Probably a
phrase or a sentence.
(Data courtesy Chilin Shih)
People start at a higher pitch when they begin
longer sentences. Also planning of inhaled air
volume. Therefore, there is some plan 300 ms
before start of speech.
12Modeling math
p is the realized pitch
Were optimizing something
p is implicitly a function of time
R is the total risk for the utterance ri is the
error of the ith target, and si is the cost if
this particular word is misinterpreted.
Where ri is the error of the ith target
(this is an approximation see elsewhere for
correct, more detailed equation)
y(t) is the pitch of a point in the ith
target.The time-dependence is suppressed for
clarity.
13Modeling math more detail.
The cost of a misinterpretation of the ith
syllable.
Total risk for the utterance.
Alpha (?) controls how much the shape of the
pitch contour matters.
Beta (?) controls how much the average pitch of
the syllable matters.
Where ri is the error of the ith target
y is the pitch of a point in the ith target.
A bar denotes an average over a target.
14Effort
How does G depend on the form of the pitch
curve? Large effort implies a curve with larger
slopes and sharper corners wigglier.
15Model behavior
- For sgtgt1, Error (R) dominates, and pitch matches
target. - For sltlt1, Effort (G) dominates, both speaker and
listener accept large deviations, and pitch
smoothly interpolates. - For s1, everything compromises.
16The rest of the model
- A model is a sequence of targets.
- The type of the target (tone1, tone2, )is
looked up in a dictionary. - Each target has a cost-of-misinterpretation.
- The cost is adjustable for each word
- Syllables within a word are derived from word
cost via the metrical pattern for words of a
certain length. - One target per tone.
- Targets are stretched to fit syllable duration.
- Only one phonological rule 33?23
17Whats the procedure?
Sequence of tones (phonology)
Data
Compute the pitch curve as a function of
phonological inputs and the cost of a
misinterpretation.
Predicted F0
Costs of mis-interpretations
Nonlinear least-squares fitting algorithm
18Model fits for Mandarin Chinese
Tone class (input)
Cost-of-misinterpretation (result)
Inside a word, the cost of a misinterpretation is
distributed by the metrical pattern
19Model fits to Mandarin Chinese
0.61 free parameters per syllable, 13 Hz RMS
error.
20Results are stable under small changes in the
model.
This model allows extra freedom different tones
are allowed to define their targets differently
Costs for misinterpreting different syllables.
This model allows less freedom all tones have
the same type of target.
The two models have words defined by different
labelers
21Model parameters
Cantonese
Phrasing is marked in speech.
Cantonese data courtesy of Prof. Tan Lee
Mandarin
22Metrical patterns inside words (Mandarin)
The metrical pattern controls how the
cost-of-misinterpretation is split up inside a
word. Syllables are marked with ?. The
vertical position is proportional to log(s) for
each syllable, so higher syllables have larger s,
and will be executed more carefully. For
4-syllable words, the error bars are shown by the
pairs of arrows.
Normal segmentation of characters into words.
Random segmentation of characters into words
Note that the metrical pattern disappears,
showing that we are measuring something real that
is tied to words.
23Another nice property
- The cost-of-misinterpretation parameter for a
syllable is correlated with the mutual
information with the preceeding syllable - r -0.175
- gt95 confidence
- Pitch patterns are implemented
- sloppily for syllables that are unsurprising, and
- precisely for surprising ones.
- (Mutual informations from a database of 15000
newspaper sentences. Syllable identity was
defined by phoneme content and tone.)
24Conclusion
- Models with motor planning capture important
aspects of speech. - They allow a very compact representation of
complex behaviors. - Intonation is represented as
- a small set of discrete symbols, in sequence,
- modulated by a cost-of-misinterpretation, with
- The cost-of-misinterpretation parameter seems
real - Similar across languages
- Matches language structure
- This model can be applied broadly
- Two dialects of Chinese
- Some aspects of English
- Separating different singing and speaking styles
from the content - See http//kochanski.org/papers .