The Vocal Joystick: a voicebased humancomputer interface for individuals with motor impairments - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

The Vocal Joystick: a voicebased humancomputer interface for individuals with motor impairments

Description:

MLP GD. 9.27. 10.35. 12.10. GMM MLLR. 3.6s. 1.8s. 1.2s. 4-class case ... methods (maximum-likelihood linear regression (MLLR), and gradient descent (GD) ... – PowerPoint PPT presentation

Number of Views:261

Avg rating:3.0/5.0

Slides: 35

Provided by: jeffb98

Category:

more less

Transcript and Presenter's Notes

Title: The Vocal Joystick: a voicebased humancomputer interface for individuals with motor impairments

1
The Vocal Joysticka voice-based human-computer
interface for individuals with motor impairments

J. Bilmes, X. Li, J. Malkin, K. Kilanski, R.
Wright, K. Kirchhoff, A. Subramanya, S. Harada,
J. Landay, P. Dowden, H. Chizeck
University of Washington, Seattle

2
Outline

Voice-based human-computer interfaces
What A new solution, the Vocal Joystick
How it works
classification
adaptation
acceleration
Video Demonstrations
User Studies

3
Motivation

Significant population of individuals with poor
(or no) motor abilities, but have perfect (or
good) use of their voice.
Many devices exist for their use (sip-and-puff
switches (similar to Morse code), head-tracking
mice, eye-tracking mice, etc.)

eye-tracking mouse.
head-mouse video
4
Issues with existing technology

Can be expensive, requiring special purpose
hardware
It might not be most efficient (leading to user
frustration)
When voice-based, it might not use the full
capabilities of the human voice
reduced communication bandwidth
users with (even not quite) full voice control
can do more
Standard speech-recognition non-ideal for
continuous control (e.g., mouse-movement, robotic
limb control). Imagine move-left, move-up,
etc.

5
The Vocal Joystick

The Vocal Joystick Use the voice to produce
real-time continuous control signals to control
standard computing devices and robotic arms.
The analogy of a joystick
small number of discrete commands (button
presses) for simple tasks, modality switches,
etc.
multiple simultaneous continuous degrees of
freedom to be controlled by continuous aspects of
your voice (pitch, amplitude, vowel-quality,
vibrato)

6
Design Goals

easy to learn and remember (by the user)
keep cognitive load at a minimum
easy to speak (reduce vocal strain)
easy to recognize (as noise-robust and
non-confusable as possible)
exploitive use full capabilities of human vocal
apparatus
universal (attempt to use vocal characteristics
that minimize the chance that regional
languages/dialects preclude its use)
complementary can be used jointly with existing
speech-recognition
computationally cheap leave enough computational
headroom for other important applications to run.
Infrastructure like a library, easy to
incorporate into applications.

7
The VJ-Mouse

The VJ-mouse
long-term goal of project is to voice-control
arbitrary systems with multi-dimensional
continuous input space
So far, we have mostly concentrated on a
VJ-controlled mouse (which is still quite
general).
Allows us to perform a variety of tasks on a
standard WIMP desktop (mouse movement and mouse
clicks, and thus web browsing, slider control,
some video games, Dasher typing, etc.)
Recent work also shows a simple simulated
robotic arm.

8
Vocal Joystick Mapping

Standard mice map physical space to physical
space.
Here, we must map vocal tract articulatory change
to physical space

9
Vocal Joystick On Our Mapping

Mapping may seem be arbitrary
certainly, rotational permutations are likely to
be equally preferred after practice.
The mapping is customizable
Ultimately, need a user study to see, relative to
the typical mouse-gesture workload, what, if
any, mapping is best.

10
Vocal Joystick Engine

Adaptation and acceleration is crucial

11
Vocal Joystick Mouse-Control Demonstration

View Movie
Browsing a news website
Playing video games
Google maps
Training/Visualization tool

12
How it works

A number of problems need to be solved
VJ vowel classifier discriminative, and for
high-classification accuracy, apply rapid
adaptation (like in speech recognition) in VJs
vowel classifier
acceleration For acceleration, use loudness to
control speed, and build a mapping from
intentional-loudness to rate of positional
change on the 2-D screen.

13
Goals for VJs vowel classifier

Real-time performance
latency should be no more than reaction time
(about 10-50ms)
Classification accuracy high robust!
Speaker independent system likely to perform
worse
Adaptation is key
Adaptation algorithm should be
fast user shouldnt spent lots of time enrolling
accurate (e.g., discriminative or max-margin)
No more resources (compute/memory) than speaker
independent system

14
Two new ideas

1 Maximum margin MLP
2 Max-margin based adaptation

15
Standard MLP Training

Forward Pass
Given input x, matrix multiply by Who followed by
non-linearity (sigmoid) ?, then matrix multiply
by Who followed by final non-linearity (softmax).
Final output
Training given target t, cost
Update rule gradient descent
Not convex, finds only local optimum. Adjusts
both weight matrices.

16
Standard Kernel Training
ae

Kernel (support-vector) training method linear
kernel ltw,xtgt

Finds optimal (max-margin, low VC-dim)
separating hyperplane
Convex optimization
Complexity lives in kernel

17
Hybrid MLP-SVM Classifier

Approach
Train MLP parameters using gradient descent
Use resulting MLPs input-to-hidden layer
(nonlinear mapping) ?(x) as input-to-feature
mapping for SVM training
Replace the MLPs hidden-to-output layer (linear
classifiers) by optimal margin hyperplane
Advantages
Unique optimal solution for last layer parameters
(convex)
same optimal separating hyperplane guarantees
Nonlinear feature mapping is implicitly optimized
in the form of a kernel (kernel learning)
Amenable to max-margin adaptation (described in a
few slides from now), SVs provides flexibility
for regularization

18
Speaker Independent Classifiers and
Performance

GMM 16 mixtures
Two-layer MLP
Input 182 MFCC features from 7 consecutive
frames
Hidden 50 nodes
7 and 50 were empirically determined to be best
Output 4 or 8 classes
Results (error rate, percent)

19
MLP-SVM Adaptation

Approach
Fix input-to-hidden layer and adapt the last
layer
Interpolation achieved by weighted combination of
trained support vectors (SVg) and new adaptation
data (Ra)
Adaptation objective

Weight pt for trained SVs
Weight 1 for adaptation data
20
Kernel Adaptation Strategy

Remove all data points but support vectors
Remove support vectors that are too close
Add adaptation data
Train using max-margin criterion (convex
problem), producing a speaker-dependent
separating hyperplane

21
Error rate results on VJ data
amount of adaptation data (seconds)
22
Acceleration in Vocal Joystick

Both loudness and pitch were initially considered
as candidates (and were implemented) to control
speed
Pitch ended up being less reliable (even using a
state-of-the-art pitch tracker)
People naturally tended to vocalize more softly
when wishing to move smaller distances (and vice
versa).
Loudness ended up working better.

23
Acceleration in Vocal Joystick

Compute direction value dj
where vi is directional unit vector for
classifier output i, ej is unit vector in
direction j?x,y, and pi is classifier output
probability for class i at time t.

24
Acceleration in Vocal Joystick

Next, compute acceleration scalar sj
where E is current vowel energy, ?i is average
vowel-i energy, f() and g() are mapping functions
(from intentional loudness to its desired
effect).

25
Acceleration in Vocal Joystick

Final velocity in direction j is Vjdj where
and where b and ? are tuning parameters (so far,
best empirically determined to be ? 1.0, ? 0.6)

26
User Study VJ vs. modern desktop mouse

Earlier version of VJ-engine was used.
Compared two tasks
link web navigate through a prescribed set of
links
map use google maps, navigate from USA to
University of Washington map.

27
User Study VJ vs. mouse

Task completion time results
link VJ is about 4-times slower, map VJ is
about 7.5 times slower.

28
User Study VJ vs. modern Eye Tracking mouse

Compared two tasks
Target acquisition how quickly can you move from
a starting object to another target object and
click on that object. Variations a function of
distance, target size, and relative angle
Web browsing similar to previous task (navigate
a prescribed set of links)

29
Target Acquisition (VJ vs. ET)
30
Web Browsing Task (VJ vs. ET)
31
Summary

A new voice-based human-computer interface for
individuals with motor impairments.
Use continuous aspects of the human voice to
affect continuous movement in on-screen devices
New classification, rapid adaptation, and
acceleration algorithms.
It appears to work!

32
Questions?
33
(No Transcript)
34
Standard MLP Training

Forward Pass
Input x presented
Matrix multiply by Who followed by non-linearity
(sigmoid) ?
Matrix multiply by Who followed by final
non-linearity (softmax)
Final output
Training given target pattern t, propagate delta
(y-t) back through using gradient descent to
update two weight matrices.

35
Outline the Why, What, and Results
The Why

The Vocal Joystick Project at the University of
Washington
continuous control with your voice (mice, robotic
arms)

The What

Comparison of adaptation strategies for Gaussian
mixture models (GMMs), multi-layered perceptions
(MLPs), and max-margin trained MLPs.

The Results

Max-margin trained and adapted MLPs out-performs
standard adaptation methods (maximum-likelihood
linear regression (MLLR), and gradient descent
(GD)

36
Maximum Margin Learning and Adaptation of MLP
Classifiers
Our Initial Solution This Paper

Xiao Li, Jeff Bilmes and Jonathan Malkin
Signal, Speech, and Language Interpretation
Laboratory (SSLI-LAB)
Department of Electrical Engineering
University of Washington, Seattle

37
Database

Vocal-Joystick Vowel Database (our own
collection)
Constant-vowel recordings with different
duration
loudness
pitch
15 speakers (out of 40) used for a training and
test set. Each speaker has 18 utterances for each
vowel
Training set 10 speakers
Test set 5 speakers

38
Adaptation Parameters

Weight pt is determined by how close a SV is to
the adaptation data distribution.
Use SI support-vector if far enough from margin
Use a hard threshold dgt0 which controls the
tradeoff between SI model and adaptation data
In experiments so far, coefficient C during
adaptation is the same as that used in training
this need not be the best assumption however!

39
Preliminary Adaptation Experiments

Vary the amount of adaptation data
1, 2 and 3 utterances (1.2, 1.8 and 3.6s)
d determined on eval set
4-class case choose all training support vecs
8-class case choose about ½ the support vecs
Comparison (using same C as in training)
GMM MLLR
MLP Gradient descent adaptation of 2nd layer
MLP-SVM SV based adaptation

40
Database for adaptation

Training set 10 speakers
Test set 5 speakers for each speaker
18 utterances for each vowel are divided into 6
subsets
Adapt on each subset and evaluate on the rest
The error rate is an average over 6 subsets
The final error rate is an average over 5
speakers, and hence 30 subsets for each vowel
(averaged then again to get final number)

41
Previous Work

Explicitly model the source of variation
Vocal tract length normalization (Cohen 94, Lin
95, Eide 96, Welling 02)
Statistical methods to adapt the classifier
itself
Gaussian mixture models (GMM) Maximum-likelihood
linear regression MLLR (Gales 96), MAP
(Gauvain94), Eigenvoice (Kuhn 00)
Multilayer perceptrons (MLP)
Adding speaker-dependent units (Neto 95, Strom
95)
Re-training part of the last layer (Stadermann
05)
Adding additional input layer (Abrash 95)
Support vector machines (SVM) Incremental
learning approach (Matic 93, Peng 02)

42
Vocal Joystick 3-joint Robot-Arm Control
43
Issues

A number of issues need to be resolved
how to do extremely accurate real-time
classification of vowels mixed in with discrete
commands, amplitude/energy detection, and do so
without using up all computing resources
How should acceleration (as in a standard mouse)
be generalized to the vocal-tract to 2D space
mapping?

44
Summary

Discussions
The performance of an MLP classifier can be
enhanced by applying maximum margin training in
the last layer.
The SVs can be combined with adaptation data,
with weights, to retrain the MLP last layer for
adaptation
Future work will present new ways to adapt MLPs
in max-margin framework that so far work even
better!

45
MLP-Kernel Training
ae

Kernel (support-vector) training method linear
kernel

Only difference is use if MLP-optimized
input-to-hidden layer for input-to-feature space
mapping in SVM

46
Two new ideas used here

1 Maximum margin MLP
Use input layer of already trained MLP as
input-to-feature space mapping ?(x)
Given this mapping (data-driven kernel), train
optimal separating hyperplane corresponding to
2nd layer of MLP
2 Max-margin based adaptation
Keep support vectors from a speaker independent
trained system (as above), and add (at most) only
them to a set of adaptation training
Weight speaker independent support vectors
depending on amount of adaptation data