Title: The Vocal Joystick: a voicebased humancomputer interface for individuals with motor impairments
1The Vocal Joysticka voice-based human-computer
interface for individuals with motor impairments
- J. Bilmes, X. Li, J. Malkin, K. Kilanski, R.
Wright, K. Kirchhoff, A. Subramanya, S. Harada,
J. Landay, P. Dowden, H. Chizeck - University of Washington, Seattle
2Outline
- Voice-based human-computer interfaces
- What A new solution, the Vocal Joystick
- How it works
- classification
- adaptation
- acceleration
- Video Demonstrations
- User Studies
3Motivation
- Significant population of individuals with poor
(or no) motor abilities, but have perfect (or
good) use of their voice. - Many devices exist for their use (sip-and-puff
switches (similar to Morse code), head-tracking
mice, eye-tracking mice, etc.)
eye-tracking mouse.
head-mouse video
4Issues with existing technology
- Can be expensive, requiring special purpose
hardware - It might not be most efficient (leading to user
frustration) - When voice-based, it might not use the full
capabilities of the human voice - reduced communication bandwidth
- users with (even not quite) full voice control
can do more - Standard speech-recognition non-ideal for
continuous control (e.g., mouse-movement, robotic
limb control). Imagine move-left, move-up,
etc.
5The Vocal Joystick
- The Vocal Joystick Use the voice to produce
real-time continuous control signals to control
standard computing devices and robotic arms. - The analogy of a joystick
- small number of discrete commands (button
presses) for simple tasks, modality switches,
etc. - multiple simultaneous continuous degrees of
freedom to be controlled by continuous aspects of
your voice (pitch, amplitude, vowel-quality,
vibrato)
6Design Goals
- easy to learn and remember (by the user)
- keep cognitive load at a minimum
- easy to speak (reduce vocal strain)
- easy to recognize (as noise-robust and
non-confusable as possible) - exploitive use full capabilities of human vocal
apparatus - universal (attempt to use vocal characteristics
that minimize the chance that regional
languages/dialects preclude its use) - complementary can be used jointly with existing
speech-recognition - computationally cheap leave enough computational
headroom for other important applications to run.
- Infrastructure like a library, easy to
incorporate into applications.
7The VJ-Mouse
- The VJ-mouse
- long-term goal of project is to voice-control
arbitrary systems with multi-dimensional
continuous input space - So far, we have mostly concentrated on a
VJ-controlled mouse (which is still quite
general). - Allows us to perform a variety of tasks on a
standard WIMP desktop (mouse movement and mouse
clicks, and thus web browsing, slider control,
some video games, Dasher typing, etc.) - Recent work also shows a simple simulated
robotic arm.
8Vocal Joystick Mapping
- Standard mice map physical space to physical
space. - Here, we must map vocal tract articulatory change
to physical space
9Vocal Joystick On Our Mapping
- Mapping may seem be arbitrary
- certainly, rotational permutations are likely to
be equally preferred after practice. - The mapping is customizable
- Ultimately, need a user study to see, relative to
the typical mouse-gesture workload, what, if
any, mapping is best.
10Vocal Joystick Engine
- Adaptation and acceleration is crucial
11Vocal Joystick Mouse-Control Demonstration
- View Movie
- Browsing a news website
- Playing video games
- Google maps
- Training/Visualization tool
12How it works
- A number of problems need to be solved
- VJ vowel classifier discriminative, and for
high-classification accuracy, apply rapid
adaptation (like in speech recognition) in VJs
vowel classifier - acceleration For acceleration, use loudness to
control speed, and build a mapping from
intentional-loudness to rate of positional
change on the 2-D screen.
13Goals for VJs vowel classifier
- Real-time performance
- latency should be no more than reaction time
(about 10-50ms) - Classification accuracy high robust!
- Speaker independent system likely to perform
worse - Adaptation is key
- Adaptation algorithm should be
- fast user shouldnt spent lots of time enrolling
- accurate (e.g., discriminative or max-margin)
- No more resources (compute/memory) than speaker
independent system
14Two new ideas
- 1 Maximum margin MLP
- 2 Max-margin based adaptation
15Standard MLP Training
- Forward Pass
- Given input x, matrix multiply by Who followed by
non-linearity (sigmoid) ?, then matrix multiply
by Who followed by final non-linearity (softmax).
Final output - Training given target t, cost
- Update rule gradient descent
- Not convex, finds only local optimum. Adjusts
both weight matrices.
16Standard Kernel Training
ae
- Kernel (support-vector) training method linear
kernel ltw,xtgt
- Finds optimal (max-margin, low VC-dim)
separating hyperplane - Convex optimization
- Complexity lives in kernel
17Hybrid MLP-SVM Classifier
- Approach
- Train MLP parameters using gradient descent
- Use resulting MLPs input-to-hidden layer
(nonlinear mapping) ?(x) as input-to-feature
mapping for SVM training - Replace the MLPs hidden-to-output layer (linear
classifiers) by optimal margin hyperplane - Advantages
- Unique optimal solution for last layer parameters
(convex) - same optimal separating hyperplane guarantees
- Nonlinear feature mapping is implicitly optimized
in the form of a kernel (kernel learning) - Amenable to max-margin adaptation (described in a
few slides from now), SVs provides flexibility
for regularization
18Speaker Independent Classifiers and
Performance
- GMM 16 mixtures
- Two-layer MLP
- Input 182 MFCC features from 7 consecutive
frames - Hidden 50 nodes
- 7 and 50 were empirically determined to be best
- Output 4 or 8 classes
- Results (error rate, percent)
19MLP-SVM Adaptation
- Approach
- Fix input-to-hidden layer and adapt the last
layer - Interpolation achieved by weighted combination of
trained support vectors (SVg) and new adaptation
data (Ra) - Adaptation objective
Weight pt for trained SVs
Weight 1 for adaptation data
20Kernel Adaptation Strategy
- Remove all data points but support vectors
- Remove support vectors that are too close
- Add adaptation data
- Train using max-margin criterion (convex
problem), producing a speaker-dependent
separating hyperplane
21Error rate results on VJ data
amount of adaptation data (seconds)
22Acceleration in Vocal Joystick
- Both loudness and pitch were initially considered
as candidates (and were implemented) to control
speed - Pitch ended up being less reliable (even using a
state-of-the-art pitch tracker) - People naturally tended to vocalize more softly
when wishing to move smaller distances (and vice
versa). - Loudness ended up working better.
23Acceleration in Vocal Joystick
- Compute direction value dj
- where vi is directional unit vector for
classifier output i, ej is unit vector in
direction j?x,y, and pi is classifier output
probability for class i at time t.
24Acceleration in Vocal Joystick
- Next, compute acceleration scalar sj
- where E is current vowel energy, ?i is average
vowel-i energy, f() and g() are mapping functions
(from intentional loudness to its desired
effect).
25Acceleration in Vocal Joystick
- Final velocity in direction j is Vjdj where
- and where b and ? are tuning parameters (so far,
best empirically determined to be ? 1.0, ? 0.6)
26User Study VJ vs. modern desktop mouse
- Earlier version of VJ-engine was used.
- Compared two tasks
- link web navigate through a prescribed set of
links - map use google maps, navigate from USA to
University of Washington map.
27User Study VJ vs. mouse
- Task completion time results
- link VJ is about 4-times slower, map VJ is
about 7.5 times slower.
28User Study VJ vs. modern Eye Tracking mouse
- Compared two tasks
- Target acquisition how quickly can you move from
a starting object to another target object and
click on that object. Variations a function of
distance, target size, and relative angle - Web browsing similar to previous task (navigate
a prescribed set of links)
29Target Acquisition (VJ vs. ET)
30Web Browsing Task (VJ vs. ET)
31Summary
- A new voice-based human-computer interface for
individuals with motor impairments. - Use continuous aspects of the human voice to
affect continuous movement in on-screen devices - New classification, rapid adaptation, and
acceleration algorithms. - It appears to work!
32Questions?
33(No Transcript)
34Standard MLP Training
- Forward Pass
- Input x presented
- Matrix multiply by Who followed by non-linearity
(sigmoid) ? - Matrix multiply by Who followed by final
non-linearity (softmax) - Final output
- Training given target pattern t, propagate delta
(y-t) back through using gradient descent to
update two weight matrices.
35Outline the Why, What, and Results
The Why
- The Vocal Joystick Project at the University of
Washington - continuous control with your voice (mice, robotic
arms)
The What
- Comparison of adaptation strategies for Gaussian
mixture models (GMMs), multi-layered perceptions
(MLPs), and max-margin trained MLPs.
The Results
- Max-margin trained and adapted MLPs out-performs
standard adaptation methods (maximum-likelihood
linear regression (MLLR), and gradient descent
(GD)
36Maximum Margin Learning and Adaptation of MLP
Classifiers
Our Initial Solution This Paper
- Xiao Li, Jeff Bilmes and Jonathan Malkin
- Signal, Speech, and Language Interpretation
Laboratory (SSLI-LAB) - Department of Electrical Engineering
- University of Washington, Seattle
37Database
- Vocal-Joystick Vowel Database (our own
collection) - Constant-vowel recordings with different
- duration
- loudness
- pitch
- 15 speakers (out of 40) used for a training and
test set. Each speaker has 18 utterances for each
vowel - Training set 10 speakers
- Test set 5 speakers
38Adaptation Parameters
- Weight pt is determined by how close a SV is to
the adaptation data distribution. - Use SI support-vector if far enough from margin
- Use a hard threshold dgt0 which controls the
tradeoff between SI model and adaptation data - In experiments so far, coefficient C during
adaptation is the same as that used in training - this need not be the best assumption however!
39Preliminary Adaptation Experiments
- Vary the amount of adaptation data
- 1, 2 and 3 utterances (1.2, 1.8 and 3.6s)
- d determined on eval set
- 4-class case choose all training support vecs
- 8-class case choose about ½ the support vecs
- Comparison (using same C as in training)
- GMM MLLR
- MLP Gradient descent adaptation of 2nd layer
- MLP-SVM SV based adaptation
40Database for adaptation
- Training set 10 speakers
- Test set 5 speakers for each speaker
- 18 utterances for each vowel are divided into 6
subsets - Adapt on each subset and evaluate on the rest
- The error rate is an average over 6 subsets
- The final error rate is an average over 5
speakers, and hence 30 subsets for each vowel
(averaged then again to get final number)
41Previous Work
- Explicitly model the source of variation
- Vocal tract length normalization (Cohen 94, Lin
95, Eide 96, Welling 02) - Statistical methods to adapt the classifier
itself - Gaussian mixture models (GMM) Maximum-likelihood
linear regression MLLR (Gales 96), MAP
(Gauvain94), Eigenvoice (Kuhn 00) - Multilayer perceptrons (MLP)
- Adding speaker-dependent units (Neto 95, Strom
95) - Re-training part of the last layer (Stadermann
05) - Adding additional input layer (Abrash 95)
- Support vector machines (SVM) Incremental
learning approach (Matic 93, Peng 02)
42Vocal Joystick 3-joint Robot-Arm Control
43Issues
- A number of issues need to be resolved
- how to do extremely accurate real-time
classification of vowels mixed in with discrete
commands, amplitude/energy detection, and do so
without using up all computing resources - How should acceleration (as in a standard mouse)
be generalized to the vocal-tract to 2D space
mapping?
44Summary
- Discussions
- The performance of an MLP classifier can be
enhanced by applying maximum margin training in
the last layer. - The SVs can be combined with adaptation data,
with weights, to retrain the MLP last layer for
adaptation - Future work will present new ways to adapt MLPs
in max-margin framework that so far work even
better!
45MLP-Kernel Training
ae
- Kernel (support-vector) training method linear
kernel
- Only difference is use if MLP-optimized
input-to-hidden layer for input-to-feature space
mapping in SVM
46Two new ideas used here
- 1 Maximum margin MLP
- Use input layer of already trained MLP as
input-to-feature space mapping ?(x) - Given this mapping (data-driven kernel), train
optimal separating hyperplane corresponding to
2nd layer of MLP - 2 Max-margin based adaptation
- Keep support vectors from a speaker independent
trained system (as above), and add (at most) only
them to a set of adaptation training - Weight speaker independent support vectors
depending on amount of adaptation data