Title: VisionLanguage Integration in AI:
1Vision-Language Integration in AI a reality check
Katerina Pastra and Yorick Wilks
Department of Computer Science,
Natural Language Processing Group, University of
Sheffield, U.K.
2Setting the context
Artificial Intelligence From technical
integration of modalities
? multimodal meaning integration From
Multimedia ? Intellimedia Intelligent
Interfaces Purpose intelligent, natural,
coherent communication
- We focus on
- vision and language integration
- Visual modalities images
- (visual perception and/or visualisation
representations - physically realised as e.g. 2D/3D graphics,
photos) - Linguistic modalities text and/or speech
3The problem
? Multimodal Integration an old AI aspiration
(cf. Kirsch 1964) ? A wide variety of V-L
integration prototypes in AI
- What is computational V-L integration?
(definition) - How is it achieved computationally?
- (state of the art, practices, tendencies,
needs) - How far can we go?
- (implementation suggestions, the VLEMA
prototype)
4In search of a definition
? Defining computational V-L Integration
could a review of related applied AI research
hold the answer ?
- Related work
- Srihari 1994 review of V-L integration
prototypes - ? limited number of prototypes reviewed
- ? suggestions and implementations are mixed
- ? no clear focus on how integration is
achieved - ? system classification according to input
type - ? includes cases of quasi-integration
5The notion of quasi-integration
? Quasi-integration fusion of results
obtained by modality-dependent processes (
intersection or combination of results, or even
the results of
one process constrain the search space
for another)
6Defining integration through classification
- Main criterion for considering a prototype for
- review V-L integration to be essential for
the task - the prototype is built for.
Specifics of the review
? It is diachronic from SHRDLU (Winograd 72)
to conversational robots of the new millennium
(e.g. Shapiro and Ismail 2003, Roy et al.
2003)
? It crosses over into diverse AI areas and
applications more than 60 prototypes
reviewed from IR to Robotics
- ? System classification criterion
- the integration purpose served
7Classification of V-L integration prototypes
8Examples
9Beyond differences
? different visual and linguistic modalities
involved ? different tasks performed ? different
integration purposes served, but
- ? similar integration resources are used
- (though represented and instantiated
differently)
Integration resources Associations between
Visual and corresponding linguistic information
e.g. words/concepts and visual features or
image models Form lists, integrated KB,
scene/event models in KR Integration mechanisms
KR instantiation, translation rules, media
selection, coordination
10A descriptive definition
Descriptive Definition a) Intensional
Definition (what the term is e.g. its genus et
differentia) ? b) Extensional Definition
(what the term applies to)
a) Computational Vision-Language Integration is a
process of associating visual and
corresponding linguistic pieces of
information
(indirect back-up from Cognitive Science
cf. notion of learned associations in
Minskys "Society of Mind" 1986, and Jackendoffs
theory of associating concepts and 3D
models, 1987)
b) Computational Vision-Language Integration may
take the form of one of 4 integration
processes according to the integration
purpose to be served
11The AI quest for V-L Integration
Argument In relying on human created data,
state of the art V-L integration systems avoid
core integration challenges and therefore fail to
perform real integration
- Simulated or manually abstracted visual input is
used - ? to avoid difficulties in image analysis
- Applications are restricted to
blocksworlds/miniworlds - ? scaling issues
- Manually constructed integration resources used
- ? to avoid difficulties in associating V-L
Difficulties in integration correspondence
problem etc. but, difficulties lie there where
developers intervene...
12How far can we go?
Challenging current practices in V-L integration
system development requires that an ambitious
system specification is formulated
- A prototype should
- work with real visual scenes
- analyse its visual data automatically
- associate images and language automatically
Is it feasible to develop such a prototype ???
13An optimistic answer
VLEMA A Vision-Language intEgration MechAnism
- Input automatically re-constructed static
scenes in - 3D (VRML format) from RESOLV
(robot-surveyor) - Integration task Medium Translation
- from images (3D sitting rooms) to text (what
and where in EN) - Domain estates surveillance
- Horizontal prototype
- Implemented in shell programming and ProLog
14The Input
15System Architecture
OntoVis KB
Object Segmentation
Object Naming
Data Transformations
16The Output
Wed Jul 7 132222 GMTDT 2004 VLEMA V1.0 Katerina
Pastra_at_University of Sheffield Description of
the automatically constructed VRML file
development-scene.wrl This is a general view
of a room. We can see the front wall, the
left-side wall, the floor, A heater on the lower
part of the front-wall and a sofa with 3
seats. The heater is shorter in length than the
sofa. It is on the right of the sofa.
17Conclusion
Could occasional reality checks re-direct
(part of) AI research ?
- Descriptive definition of V-L integration in AI
? a theoretical explanatory one in K. Pastra
(2004), Viewing Vision-Language Integration as a
Double-Grounding Case, Proceedings of the AAAI
Fall Symposium Series, Washington DC.
- Review and critique of the state of the art in
AI
- The VLEMA prototype a baseline for future
- research that will challenge current
practices