Title: Research Directions in MultimodalMultimedia Systems for Children
1Research Directions in Multimodal/Multimedia
Systems for Children
- Alexandros Potamianos
- Dept. of Electronics Computer Engineering
- Technical University of Crete, Greece
- September 2004
2Outline
- Motivation and Goals
- Recent Work
- Acoustic Analysis
- Acoustic Modeling
- Linguistic Analysis/Modeling
- Pragmatics/Dialogue Analysis/Modeling
- HCI and human factors
- Speech/Multimodal Interfaces
- Dialogue/Multimodal Systems
- Research Directions
3Motivation and Goals
- Dynamics of man-machine interactions different
for children and adults - Spontaneous childrens speech exhibits greater
degree of acoustic and linguistic variability. - Problem solving skills and approaches differ with
age. - Current spoken language technology not
robust enough to handle spontaneous childrens
speech (open research issues). - Little work exists in
- Analysis modeling of conversational user
interfaces for children. - Investigating multiple modalities of
child-machine interaction.
4Previous Work
- Acoustic and linguistic analysis(Eguchi and Hirsh
1969, Kent 1976, Goldstein 1980) - Babbling and initial language acquisition (Wexler
and Culicover 1980) - Adults speaking to children (Fernald and Mazzle
1991) - Speech disorders (JSHR)
- Educational systems using speech recognition
(Strommen and Frome 1993, Mostow et al 1995)
5Recent Work
- Acoustic Analysis
- Acoustic Modeling
- Linguistic Analysis/Modeling
- Pragmatics/Dialogue Analysis/Modeling
- HCI and human factors
- Speech/Multimodal Interfaces
- Dialogue/Multimodal Systems
6Acoustic Analysis
- What has been done
- Ages 6-18
- American English
- Pitch
- Formant Frequencies
- Duration
- Spectral Variability
- Other work
- Language acquisition
- Speech pathologies
7(No Transcript)
8(No Transcript)
9Acoustic Analysis Results
- Mean and variance of acoustic correlates reach
adult range occurs around 13 or 14 years - Children younger than 10 years show greater
within-subject variability than adults - Formant values scale linearly with age,
especially for males - Variability may reach minima around 14-16 years
10Acoustic Analysis
- What could be done short term
- Investigate non-vowel phonemes for American
English - Investigate other languages
- Investigate English as a second-language
(non-native speech) - What could be done long term
- Ages 3-6
- Other work
- Language acquisition
- Speech pathologies
11Acoustic Modeling
- What has been done
- ASR baseline per age
- Matched conditions (train-test on children)
- Mismatched conditions (train on adults-test on
children) - Vocal Tract Length Normalization (VTLN)
- Global warping factor
- Per utterance computed warping factor
- Adaptation techniques
- Bias Removal, MLLR, MAP, other
- Acoustic modeling age-dependent models
- Combinations of VTLN adaptation acoustic
modeling
12(No Transcript)
13Acoustic Modeling Results
- For matched conditions (train-test on children)
- for children 7 (?) and higher performance similar
to adults - For mismatched conditions (train on adults-test
on children) - Performance significantly lower (x 2-5 times
adult error) - Varies with age
- VTLNadaptation reduces the adult-children
performance gap by about 60
14Acoustic Modeling
- What could be done
- Front-end, features
- Enhanced VTLN algorithms
- Other adaptation algorithms
- Better acoustic modeling
15Linguistic Analysis/Modeling
- What has been done
- Spontaneous speech
- Length of Utterances, Pause Time, Duration,
Filled Pauses - Linguistic exploration, linguistic perplexity.
- Extraneous Speech, Variability Across Similar
Utterances - Effects of Task Experience, Age and Gender on
language usage
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20Linguistic Analysis Results
- No major age- and gender- differences for the
8-14 age group for - Linguistic perplexity
- Length of utterances
- Linguistic exploration
- Girls ages 11-13 display larger vocabulary, more
exploration and somewhat longer utterances than
boys - Disfluencies and hesitations as a function of age
and gender - Frequency of false-starts (2 of utts) and
mispronunciations (2 of utts) greater for the
younger children than the older ones. - Breathing noise (4 of utts) 60 more common in
younger children. - Frequency of filled pauses (8 of utts) for
older children twice that of the younger ones.
21Linguistic Analysis/Modeling
- What could be done
- Pronunciation modeling
- Linguistic analysis of non-native speech
- Linguistic analysis/modeling of spontaneous
speech - More tasks (simpler/more complex)
- More data
- Better age coverage (challenge)
22Pragmatics/Dialog Analysis
- What has been done
- Dialog strategies for problem solving
- Determine factors related to task completion,
time to completion, skipping dialog states,
multiple requests - Determine the role of age, gender, and experience
level on dialog interaction - Stereotypical dialogue modeling
23Game Screen
24Dialog-Tagging Tool
25Dialog States
- Navigate- Moving within a state
- ( i.e. left or right)
- Talk2Him- To stop a person for questioning
- WhereDid- Ask question Where did the suspect
go? - Arrest- Arrest suspect
- TellAbout- Ask question Tell me about the
suspect? - Goodbye- Tell person Thank you, goodbye
- Merged State- Miscellaneous
- Cluebook- Go to Magnifying Glass where clues
are written - CloseDatabase- Get out of Atlas
- Enterfeature- Enter in a clue about a suspects
appearance - ActionState- Go to a state in the United States
- Warrant- obtain warrant for a suspects arrest
- Find- Look up a location
- Database- Go to Atlas to look up a location
26Navigation Graph
Double queries are about 7 times more common than
single queries
27Dialog Diagrams
Cluebook Diagram
Database Diagram
Single and multiple feature entries were equally
common
Single question queries comprised about 60
28Dialog Data Analysis
- Speech utterances assigned to dialog states based
on actions triggered (manual tagging), i.e.,
dialog FSM, superset of game FSM. - Talk2Him - commands asking for attention of a
characters attention. - TellMeAbout - queries about suspects whereabouts
and physical characteristics. - Dialog state transitions analyzed as a function
of age (8-10 vs. 11-14 year olds), gender, and
experience levels. - Extraneous speech patterns in dialog flow
modeled. - Dialog flow differences between voice and
keyboardmouse modalities identified.
29Dialog Modeling Results
- Dialog flow patterns of male and female children
very similar. - Dialog flow structure age-dependent due to
differences in game playing skills older
children - complete the game faster (take fewer turns).
- spend less time in database search (more
knowledgeable). - attempt multiple sub-tasks simultaneously (double
queries). - Extraneous speech utterances age-dependent,
speaker-dependent, and dialog-context dependent
for younger children - there were twice as many extraneous utterances
than for the older ones. - on average 5 of all utterances were extraneous.
- the number of extraneous utterances ranged
between 0-25 among individuals (7 variance).
30Dialog Modeling
- What could be done
- Dialogue modeling as a function of age
- More tasks (simpler/harder)
- Better age coverage (challenge)
31Speech Interface Human Factors
- What has been done
- Determine preference for voice versus
conventional interfaces - Consider effects of
- Age and gender
- Task success (won/loss of game)
- Level of experience
- Investigate other human factor issues
- Multi-modal interfaces for children (prototypes)
32Population Statistics and Solicitation
- Permission was obtained from Superintendents and
flyers were distributed to the Summit and
Berkeley Heights School Districts - 15 response rate
- Consent forms to collect and analyze childrens
speech were signed by participants parent or
guardian
33Exit Interview
- Example Questions
- What did you like about using voice activation?
- What did you like/dislike about the game?
- What other things would you like to see become
voice activated? - Would you like to use voice with a keyboard and a
mouse?
- Rate Each Item On A
- Scale of 1-5 (5 High)
- Voice interface
- Game
- Use of headset
- Database search
- Error messages (TTS)
- Multi-modal interface
- Previous computer usage
34Game and Voice Enjoyment
Childrens Response to Voice
- 93 of children rated using their voice at least
4 out of 5, while only 81 rated the game a 4 or
5. - Enthusiasm for voice peaked in the 11-12 range.
35Effect of Subjects Success
- Game and voice enjoyment increased with number of
games won
36Effect of Gender
- Gender had a negligible effect on subjects
ratings
37Multi-modal interface
- Dislike of having to spell decreased with age
- 2/3 of children preferred a multi-modal interface
to voice only
38Other Effects
- Dislike of error messages(TTS) decreased with age
- Enjoyment of headset roughly correlated with
enjoyment of game
39Human Factor Results
- Age dependent factors had mostly to do with the
game rather than the interface - Exception text-to-speech synthesis (younger kids
did not like it) - Kids liked interacting with the computer using
voice - Kids prefer combining interface modalities
- Recognition accuracy and speed are crucial to the
success of application
40Multiple Modalities
- Voice Vs. Keyboard Mouse based on data from 12
children - Total number of commands roughly the same for
navigation/query and database entry sub tasks. - 50 more actions in database search with keyboard
and mouse. - Greetings (Thank you, Goodbye) reduced by
factor of 3 with keyboard and mouse.
Although voice might not always be the most
efficient modality, it is the most natural one.
41Interfaces Future Work
- Analysis of uni-modal and multi-modal interface
usage as a function of age - Adaptive interfaces
- Adapt to age and experience of child
- Educational systems and interfaces
- Interfaces for children with disabilities
- Interfaces with intelligence and personality
- Multi-modal interfaces
42Systems
- Prototypes
- Toys
- Desktop
- Educational
- Children with special needs
43Future Directions Summary
- Acoustic Analysis
- Investigate non-vowel phonemes for American
English - Investigate other languages
- Investigate English as a second-language
(non-native speech) - Ages 3-6
- Language acquisition
- Speech pathologies
- Acoustic Modeling
- Front-end, features
- Enhanced VTLN algorithms
- Other adaptation algorithms
- Better acoustic modeling
44Future Directions Summary
- Linguistic Analysis/Modeling
- Pronunciation modeling
- Linguistic analysis of non-native speech
- Linguistic analysis/modeling of spontaneous
speech - More tasks, more data, better age coverage
- Dialogue Analysis/Modeling
- Dialogue modeling as a function of age
- More tasks, better age coverage (challenge)
45Future Directions Summary
- Interfaces/Human Factors
- Analysis of interface usage as a function of age
(challenge) - Adaptive interfaces Adapt to age and experience
of child - Educational systems and interfaces
- Interfaces for children with disabilities
- Interfaces with intelligence and personality
- Multi-modal interfaces
- Systems
- Toys
- Desktop
- Education
- Children with special needs
46Conclusions
- Good progress on acoustic analysis and acoustic
modeling - Technology works!
- Less work on language/dialogue/interface aspects
- Interesting prototype systems built (including
multi-modal) - Not many products using speech recognition for
children