Title: Tools%20for%20Sound,%20Speech,%20and%20Multi-modal%20Interaction
1Tools for Sound, Speech, and Multi-modal
Interaction
- Johnny Lee
- 05-830 Advanced UI Software
2Sound
3Sound
- Authoring Tools
- Recording, Playback
- SFX libraries
- Editing,Mixing
- MIDI
- Developer Tools
- Software APIs
- FFT libraries
4Recording Sound
- Most laptops have built-in mono microphones
(Schoeps)
5Recording Sound
6Recording Sound
7Playing Sound
- Most laptops have built in speakers
8Multichannel Audio
- ProTools by Digidesign up to 64 channels of
24-bit, 48Khz audio I/O
9Multichannel Audio
10(No Transcript)
11Sound Libraries
- SoundIdeas (http//www.sound-ideas.com/)
- General 6000
- Hanna Barbara (http//gs304.sp.cs.cmu.edu/sfx/)
- Lots of other smaller suppliers of stock sound
libraries
12Editing/Mixing Sounds
- LogicAudio, SoundForge, Peak, SoundEdit16, many
others. - Edits sound kind of like a text-editor.
- Sophisitcated DSP (some realtime)
- Synchronization with video and MIDI support
13MIDI
- Musical Instrument Digital Interface
- Hardware communication layer
- 5-pin din, uni-directional with pass-thru
- Software protocol layer
- MIDI Commands are 2-3 bytes
- Note specification
- Device configuration (128 controllers)
- Device Control/Synchronization
14MIDI
- Lots of general purpose fields
- Simple electronics (2 resistors and PIC
processor) - Semi-popular option for simple control/robotics
applications.
15MOD files
- File size can be tiny if using a MIDI synthesizer
is used at playback time. - Playback quality depends on the quality of the
synthesizer - MOD files (module format) combine MIDI data with
WAV samples to produce high quality consistent
playback in a relatively small file.
16(No Transcript)
17Software APIs for sound
18Microsoft DirectX 9.0
- DirectX is
- DirectDraw 2D drawing
- Direct3D 3D drawing
- DirectInput input/haptic devices
- DirectPlay network gaming
- DirectShow video streams
- DirectSound wave audio I/O
- DirectMusic soundtrack management and MIDI
- DirectSetup DirectX installation routines
19DirectSound
- WAV capture
- Multi-channel sound playback
- Full duplex
- 3D specification of sound sources.
- Some real-time DSP Chorus, Compression, Flange,
Distortion, Echo, Reverb
20DirectMusic
- Coordinates several sound files (MIDI, wav, etc.)
into soundtracks. - Sequencing (timelines, cueing, and
synchronization). - Supports dynamic composition, variation, and
transitioning between songs/parts. - Dynamic content authored in DirectMusic Producer
21DirectMusic
- Compositions can be made with DLS (downloadable
sound) files a cross-platform smart audio
file format designed for dynamic loading in
interactive applications. - DLS MIDI WAV for interactive apps
22MacOS X Core Audio
23MacOS X Core Audio
- Sound Manager routines for resource management
and play/recording sound - AudioToolbox sophisitcated DSP architecture,
sequencing/composition - MIDI Services device abstraction, control, and
patching - Audio HAL medium level I/O access (real-time,
low-latency, multi-channel, floating point is
standard access) - IOKit low level device access
- Drivers, Hardware - blarg
- Full Java API provided
24Java
- Basic data structures and routines for loading,
playing, and stopping sounds. - java.applet.AudioClip
- javax.sound.midi
- javax.sound.midi.spi
- javax.sound.sampled
- javax.sound.sampled.spi
- I/O device access is somewhat limited.
- Ive been told that synchronization is a problem
in Java.
25Voice as Sound
- Voice as sound using non-verbal voice input for
interactive control. Takeo Igarashi, John F.
Hughes UIST 2001 155-156 - STFT, FFT analysis
- Extension to SUITEKeys
26Fourier Transform(FT)
- Simple properties about a sound can be gotten
by looking at the data file duration, volume - More interesting analysis requires some DSP
mainly Fourier Transform.
27Fourier Transform
- FT extracts the frequency content from a given
segment of audio.
28Fourier Transform
29Fast Fourier Transform(FFT)
- FFT is a fast computational algorithm for doing
discrete Fourier transform (DFT). - Implementations available in most languages.
- Good reference source Numerical Recipes in C
30Speech (spech)
31Speech Synthesis
- Three categories of speech synthesizers
- Articulatory synth - uses physical model of the
physiology of speech production and physics of
sound generation in the vocal apparatus - Formant synth - acoustic-phonetic approach to
synthesis. Applies hundreds of filters loosely
associated to the movement of articulators using
rules. - Concatenative synth - segmental database that
reflects the major phonological features of a
language. Creates smooth transitions and basic
processing to match prosodic patterns - (http//cslu.cse.ogi.edu/HLTsurvey/ch5node4.html)
32ATT Natural Voices
- US English, UK English, French, Spanish, German,
Korean - Can build a new voice font from an existing
person - Examples
- Male Voice
- Custom UK English
- Voice Font
- French
33Phoenix Semantic Frame Parser
- Center for Spoken Language Research, University
of Colorado, Boulder - http//communicator.colorado.edu/phoenix/license.h
tml - System for processing and parsing natural
language
34Phoenix
35Phoenix
Details and Syntax for creating frames and
networks http//communicator.colorado.edu/phoenix
/Phoenix_Manual.pdf
36Universal Speech Interfaces
Universal speech interfacesgt Ronald Rosenfeld ,
Dan Olsen , Alex Rudnickygt Interactions October
2001gt Volume 8 Issue 6
- In essence, we attempt to do for speech what
Palms Graffiti has done for mobile text entry.
- http//www-2.cs.cmu.edu/usi/USI-manifesto.htm
- Speech is an ambient medium.
- Speech is descriptive rather than referential.
- Speech require modest physical resources.
- Only speech will scale as digital technology
progresses. - 3 Speech interaction techniques Natural Language
(NLI, NLP), Dialog Trees, Command and Control
37(No Transcript)
38Universal Speech Interfaces
- Look and FeelSound and Say
- Universal Metaphors familiar ways of doing
things across applications. - Universal User Primitives standard dialog
interaction techniques, detection, recovering
from error, asking for help, navigation, etc. - Universal Machine Primitives standardize
machine responses and meanings to increase user
understanding.
39Java Speech
- JSAPI Java Speech API
- Speech Generation
- Structure Analysis Java Synthesis Markup
Language (JSML) - Text Pre-Processing abbreviation, acronyms,
1998 - Text-to-Phoneme Conversion
- Prosody Analysis
- Waveform Production
- Speech Recognition
- Grammar Design - Java Speech Grammar Format
(JSGF) - Signal Processing
- Phoneme Recognition
- Word Recognition
- Result Generation
40Windows .NET Speech SDK
- Basically the .NET-ified SAPI 5.1 (Speech API)
- Continuous Speech Recognition (US English,
Japanese, and Simplified Chinese) - Concatenative Speech Synthesis (US English and
Simplified Chinese) - Interface is broken into two components
- Application Programming Interface (API)
- Device Driver Interface(DDI)
41Windows .NET Speech SDK
- Speech Synthesis API
- ISpVoiceSpeak(my text, voice)
- Speech Synthesis DDI
- Prases text into an XML doc
- Calls the TTSEngine
- Manages sound and threading details
42Windows .NET Speech SDK
- Speech Recognition API
- Define context
- Define grammar
- Request type (dictation or command/control)
- Event is fired when recognized
- Speech Recognition DDI
- Interfacing and configuring the SREngine
- Manages sound and threading details.
43Windows .NET Speech SDK
- Speech Application Language Tags (SALT)
extension to HTML for speech integration in to
webpages - Speech Recognition Grammar Specification (SRGS)
support for field parsing - Telephony Controls interfaces with telephone
technology to develop voice-only apps.
44MacOS X Speech
- Barely changed since 1996, MacInTalk 3
- US English only
- Full Java API
- Speech Synthesis Manager (PlainTalk)
- algorithmic voice generation
- Speech Recognition Manager
- OS wide push-to-talk Command/Control
- Customizable vocabulary w/scripting
- Uses Language Model grammar
- No dictation support
45Dragon Naturally Speaking
- Commercial Recognition software
- Dictation
- Command and control
- API available for developers for application
integration - http//www.scansoft.com/naturallyspeaking/
46Sphinx
- Open source speech recognizer from CMU
(http//fife.speech.cs.cmu.edu/sphinx/) - Auto-builds language model/grammervocabulary
from example sentences - CMU-Cambridge Statistical Language Modeling
Toolkit semi-machine learning algorithms for
digesting a large example corpus into a usable
model - Uses CMU Pronouncing Dictionary
- SphinxTrain - builds new acoustic models
- Audio recording, transcript, pronunciation
dictionary/vocabulary, phoneme list
47SUITEKeys
- Manaris,B., McCauley,R., MacGyvers,V., An
Intelligent Interface for Keyboard and Mouse
Control--Providing Full Access to PC
Functionality via Speech, Proceedings of 14th
International Florida AI Research Symposium
(www.cs.cofc.edu/manaris/) - Developed for individuals with motor
disabilities. - Interface layer that generates keyboard and mouse
events for the OS - Recognizes keyboard strokes/operations
backspace, function twleve, control-alt-delete,
page down, press. release - Recognizes mouse buttons and movement
left-click, move down. Stop, 2 units above
clock, move to 5-18
48Suede
Scott R. Klemmer , Anoop K. Sinha , Jack Chen ,
James A. Landay , Nadeemgt Aboobaker , Annie Wanggt
Proceedings of the 13th annual ACM symposium on
User interface software andgt technology November
2000
- Wizard of OZ tool for prototyping speech
interfaces - Allows the developer to quicky generate a state
machine representing the possible paths through a
speech interface and stores recorded system
responses. - Operator simulates a functional system during
evaluation by stepping through the state machine. - Runtime transcripts are recorded for later
analysis.
49(No Transcript)
50Mulitmodal Interaction
51Multimodal Interaction
- According to Scott The term multi-modal
interface usually refers to speech and
something else because speech alone wasnt good
enough. - Though, should probably mean more than one
(simultaneous?) input modality - Point, click, gesture, type, speak, write, touch,
look, bite, shake, think, sweat, etc (lots of
sensing techniques).
52Multimodal Interaction
- Lots of things have used them, but no real
tools or werent simultaneous. - Cohen, P.R., Cheyer, A., Wang, M., and Baeg, S.C.
An open agent architecture. AAAI 94 Spring
Symposium Series on Software AgentsAAAI, (Menlo
Park, CA, 1994) reprinted in Readings in Agents.
MorganKaufmann, 1997, 197204. - Brad Myers, Robert Malkin, Michael Bett, Alex
Waibel, BenBostwick, Robert C. Miller, Jie Yang,
Matthias Denecke, Edgar Seemann,Jie Zhu, Choon
Hong Peck, Dave Kong, Jeffrey Nichols,
BillScherlis. "Flexi-modal and Multi-Machine User
Interfaces",ltigtIEEE Fourth International
Conference on Multimodal Interfaceslt/igt,Pittsburgh
, PA. October 14-16, 2002. pp. 343-348.
53Multimodal Interfaces
- A common concept is mode-ing or modifying
interaction. - Gives extra context for recognizers (e.g. point
and speak) - Multiplies functionality of an interaction (e.g
tool stones, left/right/no click) - Rekimoto, J., Sciammarella, E. (2000)
ToolStone effective use of physical
manipulation vocabularies of input devices.
Proceedings of the ACM Symposium on User
Interface Software and Technology, pp. 109-117,
November 2000 - Also, a need for an input interpretation layer
for widgets that can be specified in multiple
ways. -