Tools%20for%20Sound,%20Speech,%20and%20Multi-modal%20Interaction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Tools%20for%20Sound,%20Speech,%20and%20Multi-modal%20Interaction

1
Tools for Sound, Speech, and Multi-modal
Interaction

Johnny Lee
05-830 Advanced UI Software

2
Sound
3
Sound

Authoring Tools
Recording, Playback
SFX libraries
Editing,Mixing
MIDI
Developer Tools
Software APIs
FFT libraries

4
Recording Sound

Most laptops have built-in mono microphones

(Schoeps)
5
Recording Sound
6
Recording Sound
7
Playing Sound

Most laptops have built in speakers

8
Multichannel Audio

ProTools by Digidesign up to 64 channels of
24-bit, 48Khz audio I/O

9
Multichannel Audio
10
(No Transcript)
11
Sound Libraries

SoundIdeas (http//www.sound-ideas.com/)
General 6000
Hanna Barbara (http//gs304.sp.cs.cmu.edu/sfx/)
Lots of other smaller suppliers of stock sound
libraries

12
Editing/Mixing Sounds

LogicAudio, SoundForge, Peak, SoundEdit16, many
others.
Edits sound kind of like a text-editor.
Sophisitcated DSP (some realtime)
Synchronization with video and MIDI support

13
MIDI

Musical Instrument Digital Interface
Hardware communication layer
5-pin din, uni-directional with pass-thru
Software protocol layer
MIDI Commands are 2-3 bytes
Note specification
Device configuration (128 controllers)
Device Control/Synchronization

14
MIDI

Lots of general purpose fields
Simple electronics (2 resistors and PIC
processor)
Semi-popular option for simple control/robotics
applications.

15
MOD files

File size can be tiny if using a MIDI synthesizer
is used at playback time.
Playback quality depends on the quality of the
synthesizer
MOD files (module format) combine MIDI data with
WAV samples to produce high quality consistent
playback in a relatively small file.

16
(No Transcript)
17
Software APIs for sound
18
Microsoft DirectX 9.0

DirectX is
DirectDraw 2D drawing
Direct3D 3D drawing
DirectInput input/haptic devices
DirectPlay network gaming
DirectShow video streams
DirectSound wave audio I/O
DirectMusic soundtrack management and MIDI
DirectSetup DirectX installation routines

19
DirectSound

WAV capture
Multi-channel sound playback
Full duplex
3D specification of sound sources.
Some real-time DSP Chorus, Compression, Flange,
Distortion, Echo, Reverb

20
DirectMusic

Coordinates several sound files (MIDI, wav, etc.)
into soundtracks.
Sequencing (timelines, cueing, and
synchronization).
Supports dynamic composition, variation, and
transitioning between songs/parts.
Dynamic content authored in DirectMusic Producer

21
DirectMusic

Compositions can be made with DLS (downloadable
sound) files a cross-platform smart audio
file format designed for dynamic loading in
interactive applications.
DLS MIDI WAV for interactive apps

22
MacOS X Core Audio
23
MacOS X Core Audio

Sound Manager routines for resource management
and play/recording sound
AudioToolbox sophisitcated DSP architecture,
sequencing/composition
MIDI Services device abstraction, control, and
patching
Audio HAL medium level I/O access (real-time,
low-latency, multi-channel, floating point is
standard access)
IOKit low level device access
Drivers, Hardware - blarg
Full Java API provided

24
Java

Basic data structures and routines for loading,
playing, and stopping sounds.
java.applet.AudioClip
javax.sound.midi
javax.sound.midi.spi
javax.sound.sampled
javax.sound.sampled.spi
I/O device access is somewhat limited.
Ive been told that synchronization is a problem
in Java.

25
Voice as Sound

Voice as sound using non-verbal voice input for
interactive control. Takeo Igarashi, John F.
Hughes UIST 2001 155-156
STFT, FFT analysis
Extension to SUITEKeys

26
Fourier Transform(FT)

Simple properties about a sound can be gotten
by looking at the data file duration, volume
More interesting analysis requires some DSP
mainly Fourier Transform.

27
Fourier Transform

FT extracts the frequency content from a given
segment of audio.

28
Fourier Transform
29
Fast Fourier Transform(FFT)

FFT is a fast computational algorithm for doing
discrete Fourier transform (DFT).
Implementations available in most languages.
Good reference source Numerical Recipes in C

30
Speech (spech)
31
Speech Synthesis

Three categories of speech synthesizers
Articulatory synth - uses physical model of the
physiology of speech production and physics of
sound generation in the vocal apparatus
Formant synth - acoustic-phonetic approach to
synthesis. Applies hundreds of filters loosely
associated to the movement of articulators using
rules.
Concatenative synth - segmental database that
reflects the major phonological features of a
language. Creates smooth transitions and basic
processing to match prosodic patterns
(http//cslu.cse.ogi.edu/HLTsurvey/ch5node4.html)

32
ATT Natural Voices

US English, UK English, French, Spanish, German,
Korean
Can build a new voice font from an existing
person
Examples
Male Voice
Custom UK English
Voice Font
French

33
Phoenix Semantic Frame Parser

Center for Spoken Language Research, University
of Colorado, Boulder
http//communicator.colorado.edu/phoenix/license.h
tml
System for processing and parsing natural
language

34
Phoenix
35
Phoenix
Details and Syntax for creating frames and
networks http//communicator.colorado.edu/phoenix
/Phoenix_Manual.pdf
36
Universal Speech Interfaces
Universal speech interfacesgt Ronald Rosenfeld ,
Dan Olsen , Alex Rudnickygt Interactions October
2001gt Volume 8 Issue 6

In essence, we attempt to do for speech what
Palms Graffiti has done for mobile text entry.
http//www-2.cs.cmu.edu/usi/USI-manifesto.htm
Speech is an ambient medium.
Speech is descriptive rather than referential.
Speech require modest physical resources.
Only speech will scale as digital technology
progresses.
3 Speech interaction techniques Natural Language
(NLI, NLP), Dialog Trees, Command and Control

37
(No Transcript)
38
Universal Speech Interfaces

Look and FeelSound and Say
Universal Metaphors familiar ways of doing
things across applications.
Universal User Primitives standard dialog
interaction techniques, detection, recovering
from error, asking for help, navigation, etc.
Universal Machine Primitives standardize
machine responses and meanings to increase user
understanding.

39
Java Speech

JSAPI Java Speech API
Speech Generation
Structure Analysis Java Synthesis Markup
Language (JSML)
Text Pre-Processing abbreviation, acronyms,
1998
Text-to-Phoneme Conversion
Prosody Analysis
Waveform Production
Speech Recognition
Grammar Design - Java Speech Grammar Format
(JSGF)
Signal Processing
Phoneme Recognition
Word Recognition
Result Generation

40
Windows .NET Speech SDK

Basically the .NET-ified SAPI 5.1 (Speech API)
Continuous Speech Recognition (US English,
Japanese, and Simplified Chinese)
Concatenative Speech Synthesis (US English and
Simplified Chinese)
Interface is broken into two components
Application Programming Interface (API)
Device Driver Interface(DDI)

41
Windows .NET Speech SDK

Speech Synthesis API
ISpVoiceSpeak(my text, voice)
Speech Synthesis DDI
Prases text into an XML doc
Calls the TTSEngine
Manages sound and threading details

42
Windows .NET Speech SDK

Speech Recognition API
Define context
Define grammar
Request type (dictation or command/control)
Event is fired when recognized
Speech Recognition DDI
Interfacing and configuring the SREngine
Manages sound and threading details.

43
Windows .NET Speech SDK

Speech Application Language Tags (SALT)
extension to HTML for speech integration in to
webpages
Speech Recognition Grammar Specification (SRGS)
support for field parsing
Telephony Controls interfaces with telephone
technology to develop voice-only apps.

44
MacOS X Speech

Barely changed since 1996, MacInTalk 3
US English only
Full Java API
Speech Synthesis Manager (PlainTalk)
algorithmic voice generation
Speech Recognition Manager
OS wide push-to-talk Command/Control
Customizable vocabulary w/scripting
Uses Language Model grammar
No dictation support

45
Dragon Naturally Speaking

Commercial Recognition software
Dictation
Command and control
API available for developers for application
integration
http//www.scansoft.com/naturallyspeaking/

46
Sphinx

Open source speech recognizer from CMU
(http//fife.speech.cs.cmu.edu/sphinx/)
Auto-builds language model/grammervocabulary
from example sentences
CMU-Cambridge Statistical Language Modeling
Toolkit semi-machine learning algorithms for
digesting a large example corpus into a usable
model
Uses CMU Pronouncing Dictionary
SphinxTrain - builds new acoustic models
Audio recording, transcript, pronunciation
dictionary/vocabulary, phoneme list

47
SUITEKeys

Manaris,B., McCauley,R., MacGyvers,V., An
Intelligent Interface for Keyboard and Mouse
Control--Providing Full Access to PC
Functionality via Speech, Proceedings of 14th
International Florida AI Research Symposium
(www.cs.cofc.edu/manaris/)
Developed for individuals with motor
disabilities.
Interface layer that generates keyboard and mouse
events for the OS
Recognizes keyboard strokes/operations
backspace, function twleve, control-alt-delete,
page down, press. release
Recognizes mouse buttons and movement
left-click, move down. Stop, 2 units above
clock, move to 5-18

48
Suede
Scott R. Klemmer , Anoop K. Sinha , Jack Chen ,
James A. Landay , Nadeemgt Aboobaker , Annie Wanggt
Proceedings of the 13th annual ACM symposium on
User interface software andgt technology November
2000

Wizard of OZ tool for prototyping speech
interfaces
Allows the developer to quicky generate a state
machine representing the possible paths through a
speech interface and stores recorded system
responses.
Operator simulates a functional system during
evaluation by stepping through the state machine.
Runtime transcripts are recorded for later
analysis.

49
(No Transcript)
50
Mulitmodal Interaction
51
Multimodal Interaction

According to Scott The term multi-modal
interface usually refers to speech and
something else because speech alone wasnt good
enough.
Though, should probably mean more than one
(simultaneous?) input modality
Point, click, gesture, type, speak, write, touch,
look, bite, shake, think, sweat, etc (lots of
sensing techniques).

52
Multimodal Interaction

Lots of things have used them, but no real
tools or werent simultaneous.
Cohen, P.R., Cheyer, A., Wang, M., and Baeg, S.C.
An open agent architecture. AAAI 94 Spring
Symposium Series on Software AgentsAAAI, (Menlo
Park, CA, 1994) reprinted in Readings in Agents.
MorganKaufmann, 1997, 197204.
Brad Myers, Robert Malkin, Michael Bett, Alex
Waibel, BenBostwick, Robert C. Miller, Jie Yang,
Matthias Denecke, Edgar Seemann,Jie Zhu, Choon
Hong Peck, Dave Kong, Jeffrey Nichols,
BillScherlis. "Flexi-modal and Multi-Machine User
Interfaces",ltigtIEEE Fourth International
Conference on Multimodal Interfaceslt/igt,Pittsburgh
, PA. October 14-16, 2002. pp. 343-348.

53
Multimodal Interfaces

A common concept is mode-ing or modifying
interaction.
Gives extra context for recognizers (e.g. point
and speak)
Multiplies functionality of an interaction (e.g
tool stones, left/right/no click)
Rekimoto, J., Sciammarella, E. (2000)
ToolStone effective use of physical
manipulation vocabularies of input devices.
Proceedings of the ACM Symposium on User
Interface Software and Technology, pp. 109-117,
November 2000
Also, a need for an input interpretation layer
for widgets that can be specified in multiple
ways.

Write a Comment

User Comments (0)

About PowerShow.com

Tools%20for%20Sound,%20Speech,%20and%20Multi-modal%20Interaction PowerPoint PPT Presentation