Title: Some Activities on Speech Translation at ITCirst
1(Some) Activities on Speech Translation at
ITC-irst
- Fabio Pianesi
- and
- Roldano Cattoni
- Erica Costantini
- Emanule Pianta
- ..
2Scenario
- NESPOLE! is a STST system allowing a tourist
operator and a user to interact using their own
languages. - Both customer and agent have thin clients (with
whiteboard) - The customers terminal connects to the Italian
(Agent side) mediator, which acts as a
multimedial dispatcher. - The mediator
- opens a connection with the tourist agent
- transmits web pages
- sends the audio to the appropriate HLT servers.
- buffers and transmits gestures from the client to
the agent and vice versa. - Feedback facilities provide full control by both
parties on the evolution of the communicative
exchange.
3Scenario
CLIENT screen
- The customer wants to organise a trip in
Trentino. - She starts by browsing APT web pages to get
information.
4Interchange Format
5Motivations
- Interlingua facilitates translation between as
many language pairs as possible with minimal
effort. - SUBJECT DOMAINS tourist information, medical
domain. - COMMUNICATION DOMAIN spoken dialogue
6IF
Intermediate Representation Formalism
A lot of work Goals pursued
- a general-purpose IRF to be used in conjunction
with a more domain-oriented interlingua. - the generic part exploits a frame-like
representation. WordNet 1.6 provides the
conceptual repertory. - Important the interplay between the
general-purpose and the domain-oriented IRF.
- updates and improvements to the domain-oriented
IF developed within CSTAR-II, to cope with the
new requirements of NESPOLE!. - Extension of coverage to the new features of the
application scenarios - improvements over existing representation for
such linguistic information as referents novelty,
number, nominal.
7IF Design
- CMU,Usa
- Karlsruhe University, Germany
- ITC-irst, Italy
- CLIPS, France
- ATR, Japan
- ETRI, Korea
- Chinese Academy of Sciences, China
- Siemens, Germany
8Language Features
- Task oriented (rather than descriptive)
- Many fixed expressions
- Many fragments
9Requirements
- abstract away from the peculiarities of
particular languages - capture the speaker's communication intent rather
than the literal phrasing - usable at different sites with different language
engine - allow reliable data annotation (inter-coder and
inter-site agreement) - allow for robust language engines
(underspecification, fragments)
10Formalism
- DOMAIN ACTION plus ARGUMENTS
- DOMAIN-ACTION main communicative intention,
semantic focus - ARGUMENTS semantic details
11IF - syntax
- ()
- Domain action
- a, c (agent, client)
- give-information,
request-information, greeting, accept,
apologize... -
- disposition, feasibility,
obligation, view, arrival, rent, accommodation,
arrival, trip, .. -
- accommodation-spec ...
12Recent advances
- information packaging (new/old)
- number, gender
- attitudes (know that, want, prefer)
- modality (must, need, ...)
- tense
- rhetorical relations (because, after that, ...)
- relatives (partial)
- focalisers (also, only, for example)
- Copular constructions (the hotel is cheap and
near Trento) - multimodality coverage (indicate, show, square,
.. pen)
13Example1
- " thank you . "
- cthank
-
- " can I help you ? "
- aofferhelp (whoi, to-whomyou)
-
- " my name is Chad "
- cgive-informationpersonal-data(person-name(giv
en-namechad))
14Example 2
- " and I would like to arrive around
September ninth . " -
- cgive-informationdispositionarrival
- (disposition(whoi, desire), / attitude /
- conjunctiondiscourse /rhetorical information
/ - time(exactnessapproximate, month9, md9))
- / time /
15Example 3
- " and I was hoping that you could help me plan a
vacation to one of the national parks in the
Trentino area . " -
- crequest-actionhelpplantrip
- (help(whoyou, to-whomi),
- conjunctiondiscourse,
- visit-spec(vacation, identifiabilityno),
- destination(quantity1,
- specifier( national_park,
- quantityplural,
- location(place- nametrentino))))
-
- NB the focus is on communicative intention, not
on exact phrasing
16Example4
-
- " and in a restaurant .
"agive-informationconcept
(conjunctiondiscourse, location(restaurant,
identifiabilityno))" which town
?"crequest-informationconcept
(concept-spec(town, identifiabilityquestion))
17Quantitativa data IF specifications
- Last release february 2002speech acts 61
(domain independent 20
dialog-management SA)concepts 108 (mostly
domain dependent)arguments 304 (mostly domain
dependent)values 7,652
18Quantitative data - NESPOLE! data base
- Annotated turns (end 2001) English 815 (235
distinct DAs) German 2,873 (367) Italian 1,286
(233) French 234 (94)Total distinct DAs
610Annotated turns (by end 2002) some 30/40
more
19Support tools
- IF specifications (available on the web)
- http//www.is.cs.cmu.edu/nespole/db/index.html
- IF discussion board
- http//peace.is.cs.cmu.edu/ISL/get/if.html
- C-STAR and NESPOLE! Data Bases
- http//www.is.cs.cmu.edu/nespole/db/index.html
- IF Checker (web interface)
- http//tcc.itc.it/projects/xig/xig-on-line.html
- IF test suite
- http//tcc.itc.it/projects/xig/xig-ts.html
- IF emacs mode
20- Robust Generation for Speech Translation
21Problems with the IF Lack of Well Defined
Semantics
- Informal definition of the semantic primitives
- No predicate argument structure
- Only loosely compositional (but this is
improving) - No true formal semantics for IF (huge effort,
risk of loosing flexibility and adaptation to
languages and linguistic engine)
22Problems for Generation Linguistic
Underspecification
- Analysis engines may not able to understand/
disambiguate (e.i. information packaging, tense)
-
- Subject in pro drop languages (the Italian
analyser can produce an IF representation in
which the subject is left unspecified)
23Problems for Generation Ill-formedness
- A legal IF representation must fit certain
constraints. - agive-informationdisposition OK
- agive-informationarrival OK
- agive-informationarrivaldisposition NO!
24Problems for Generation Ill-formedness
- Top level arguments must be licensed by the
concepts in the DA - a three star hotel would be fine
25Problems for Generation Ill-formedness
- Sub-arguments are licensed by super-arguments
- are there available rooms at Hotel Belvedere?
26Problems for GenerationIll-formedness
- Values must be licensed by arguments
- cgive-informationaction (actione-call-2,
origin(place-namemumbay)) - mumbay is out of the current coverage of the IF
specs -
27Strategy Generating Fragments
- Some 20-30 of IF representations produced
during translation are illformed. - If you cannot generate a complete sentence
- then generate the main phrases of the sentence
and adopt a default order (e.g. NP, Verb, NP,
Adjuncts) - If you cannot generate a main phrase
- then generate fragments adopting some default
order (e.g. Det, Noun, Modifiers) - If you cannot generate a lexical item out of an
IF value - then return the value as it is
28Defaults
- Agent Can you see the map?
- arequest-informationfeasibilityviewinformation
-object (info-objectmap) - ??? Who is the the subject of view ???
- Default rule When the Agent requests
information about viewing something the
agent of the viewing is the client
29The Italian Generation Component.continued
- XIG-IF sentence planner
- Maps IF-representations into functional
representations. - 4 layers of mapping rules (sentences, NPs,
adjuncts, lexicon) - Cascade of increasingly less specific rules.
- Functional representations are Mixed
representations (morphology, potential words,
strings,..) - HTPL solver Maps mixed representations into text.
30Multimodality in STST
31Multimodality
- NESPOLE! allows users to perform gestures
(pointing, selection, etc.) on maps. - Gestures are performed by means of a tablet
and/or a mouse on maps displayed through the
systems whiteboard. - Anchoring between gestures and language is
obtained through a simple time-based procedure. - More complex procedures, aiming at conceptual
anchoring have a greater impact on HLT modules.
Their investigation has been postponed.
32Multimodality
Previous results
- The advantages of multimodal input over
speech-only input includes faster task
completion, fewer errors, fewer spontaneous
disfluences, strong preference for multimodal
interaction (Oviatt, 97) - when combined with spoken input, pen-based input
can disambiguate badly understood sentences
(Oviatt, 2000)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37Multimodality
Usability study
- Goal the impact of multimodality in a real
speech-to-speech translation environment - Evaluation of the added value of multimodality in
a multilingual and multimedial environment. - Evaluation of the degree of integration of
multimodality in the multilingual system.
38MODALITY x LANGUAGE
Multimodality - experiment
Experimental Design
- MODALITY
- SO (Speech only)
- MM (Multimodal)
39Experimental Design
Users Customers
- TOTAL NUMBER 28
- FEATURES
- English and German speakers
- similar level of computer literacy and web
expertise - paid volunteers
- DESIGN between (each client took part in one
dialogue and experienced only one modality) - Sex balanced across conditions
40Table 1. Group composition
Experimental Design
Users Customers
E English speakers G German speakers
41Experimental Design
Users Agents
- TOTAL NUMBER 7
- Italian volunteers (not involved in the Nespole!
Project) acting as Trentino tourist board agents - DESIGN within (each agent took part in more than
one dialogue, and experienced both modalities) - Sex balanced across conditions and languages
42Experimental Design
Dependent Variables
- Variables targeted
- spoken input
- gestures
- effectiveness of the dialogue
- usability self-reports
43Experimental Design
Dependent Variables
- Speech
- Spontaneous events
- A-grammatical phrases (repetitions, corrections,
false starts) - empty pauses (silence, breathing)
- filled pauses (vowels, nasal, other)
- human noises (laugh, noise)
- word interruptions (speaker)
- understandability
- technical breaks (word break, word missing)
- turn breaks (the utterance is broken)
44Experimental Design
Dependent Variables
- Speech
- TURNS AND WORDS
- turns per dialogue
- tokens (spoken words) per dialogue
- types (vocabulary) per dialogue
- tokens per turn
- types per turn
- token/type rate (how many words were used before
a new word was introduced) - returns to topics already treated
45Experimental Design
Dependent Variables
- Pen-based Gestures
- Number and types of collected gestures
- loading of an image
- scroll
- zoom
- running a browser
- selection of an area (only MM condition)
- pointing to an area (only MM condition)
- connecting different areas (only MM condition)
in SO modality too they are not properly
multimodal inputs, but commands concerning
multimedia
46Experimental Design
Dependent Variables
- Dialogue effectiveness
- number of successful turns
- ambiguities concerning place names (ski-areas,
towns, hotels) - reached goal did the client find the hotel which
meets his/her expense budget?
47Experimental Design
Dependent Variables
- Usability self-reports
- S.U.S. (System Usability Scale) (agents and
clients) - Preference concerning experimental conditions
(agents)
48Experimental Design
Material
- Microphone
- Pen and tablet
- 3 maps
- Two web pages
- Same translation systems for the two conditions
- Different instructions for agents and customers
49Experimental Design
Material -screen
- Netmeeting window with
- Push-to-talk button
- Check-uncheck button
- Feedback window with
- Hypothesed string
- Hypothesed meaning
- Textual translation of remote speech
50Experiment - results
Successful dialogues
- CANCELED DIALOGUES N 22
- client didnt show up 3
- interrupted (connection or hlt servers crashes)
4 - connection problems (connection failed) 4
- the system was not yet frozen 5
- incomplete recordings 6
- FULLY RECORDED DIALOGUES n 28
- delays due to connection problems (about 20
minutes) 3 - interruption and restart during dialogue 3
- synthesis crashed 10 minutes before the end of
the dialogue(but dialogue contined in text
mode) 2
51Experiment - results
Speech-related variables
- No significant differences among conditions as to
spontaneous events, turns and words figures,
dialogue lenght. - One spoken turn every 33 seconds (average) in
both conditions. - Average duration per dialogue
- SO36 min. MM34.5 min
52Experiment - results
Successful turns
- Real turns (excluding non-understandable case)
- SO 486 (83) MM 368 (79)
- Average duration of real turns (from the start of
turn i to the start of turn ii) - SO 33,78 secs MM 32,45 secs
53Experiment - results
Repetitions
Percentages of repeated turns, repetitions, and
other turns on real turns
54Percentages of repeated turns and repetitions of
the repeated turns
55Percentages of successful turns (yes), partially
successful turns (par) non-successful turns (no)
and false turns (false).
56Experiment - results
Dialogue fluency
Return rate number of turns / number of returns
57Experiment - results
Ambiguities
- Number of dialogues containing ambiguities
concerning place names (ski-areas, towns, hotels) - MM SO
- yes 2 5
- no 5 2
- All ambiguities were immediately solved in MM
- Ambiguities were harder to solve in SO
- Original It is not Panchià, it is Cavalese
- Translation Pachià not Cavalese
58Experiment - results
Gesture-related variables
- All gestures (but 2), performed by agents
- Total gestures
- SO 63 MM 182
- Few or no deictics used. Mostly accompanying
speech (Ill show it to you on the map)
59Experiment - results
Gesture-related variables
- Average figures for gestures
- loading of an image 2,7 (MM and SO. No
significant differences) - scroll 1,7 (both MM and SO. No significant
differences) - zoom 0
- running a browser 0,4 (both MM and SO, No
significant differences) - MM-only gestures 7.6
- selection of an area 4.71
- pointing on an area 1.36
- gestures connecting different areas 1.4
60Experiment - results
Gesture-related variables
- All gestures performed at the end of the turn
- Typical sequence
- Ill show you the ice skating rink on the map
- Microphone is switched off
- Gesture is performed
- Despite the absence of deictics, gestures were
always appropriately introduced by language. - Hence multilinguality and multimodality are
suitably integrated.
61Experiment - results
Goals achievement
- No differences in the number of dialogues in
which the client found/didnt find the hotel
meeting the requirements - MM SO
- yes 5 5
- no 2 2
62Experiment - results
Usability
- No differences among conditions as to S.U.S.
scores. - No differences between clients group and agents
group as to S.U.S. scores. - Average score 55
- System Usability Scale (developed by Digital
Equipment Co. Ltd, Reading, UK) - S.U.S. scores have a range of 0 to 100
63Experiment - results
Usability
- Strong preference of agents for multimodal
interaction - Weak preference of agents for the English
Language
X strong preference x weak preference
Agents n.5, 6, 7 took part in 3 or 4 dialogues
(less than half respect to the other agents) n.
5 and 6 have not preferences n. 7 has not
preference concerning language)
64Experiment
Conclusions
- Tendency for dialogues to be shorter in MM than
in SO - Tendency for repeated turns to be fewer in MM
than in SO - If returns can be taken as an indicator of
dialogue fluency, then there is a tendency for
fluency to be better in MM than in SO. - Moreover, this is even clearer for dialogue
segments dealing with spatial information.
65Experiment
Conclusions
- No, or very rare, spontaneous use of deictics.
- All MM gestures have been used by agents, with a
clear preference for area selection. - Tendency for MM to exhibit less ambiguity
- Moreover, when present, the ambiguity was
immediately solved by resorting to MM resources. - However, there doesnt seem to be a difference in
effectiveness (goal achievement) between SO and
MM. - Strong preference for MM by agents.
66Experiment
Conclusions
- Pen-based input increases the probability of
successful interaction, reducing the impact of
translation errors - The advantages of multimodal input are more
relevant when spatial information is to be
conveyed. - The greater complexity of the the MM system does
not prevent users from enjoying the interaction
(and from evaluating it friendlier and more
usable than SO system)
67Experiment
Conclusions
- The presence/absence of multimodality does not
seem to systematically affect low-level
linguistic variables - This seams to be a consequence of the low number
of turns with gestures and of the very high
frequency of bad turns (technical problems)
68Experiment
Conclusions
- The number of cases is very low considering the
number of independent (and confounding)
variables, negatively affecting the power of the
statistical tests - A between design is not able to capture the
preferences for one modality with complex
systems when users can experience both the SO
and the MM versions, the preference towards MM
condition is very strong.