Title: ImageText Relations in Multimedia Systems
1Image-Text Relations in Multimedia Systems
- School of Computing, Mathematical and Information
Sciences - University of Brighton, 28 April 2004
- Dr Andrew Salway
- Dept. of Computing, University of Surrey
- a.salway_at_surrey.ac.uk
2BACKGROUND
- Image-text combinations widespread in human
communication - Spoken language and gesture
- Printed media
- Moving images and dialogue
- Digital multimedia ? new media
3(No Transcript)
4(No Transcript)
5(No Transcript)
6BACKGROUND
- Multimedia systems process correlated media for
- Multimedia Information Retrieval and Fusion
- Summarization, Adaptation and Generation
7Overview PART 1
- Processing experts descriptions and
interpretations into machine-executable
surrogates of visual information, e.g. to index
image and video data
8Background
9Motivation
- One way to resolve the semantic gap comes from
integrating other sources of information about
the image Information about an image can come
from a number of different sources the image
content, labels attached to the image, images
embedded in a text, and so on. We still have
very primitive ways of integrating this
information in order to optimize access to
images - Smeulders et al. Content-Based Image Retrieval
at the End of the Early Years, IEEE Trans.
Pattern Analysis and Machine Intelligence 22
(12), 1349-1380.
10Less Primitive Ways of Integrating Textual
Information About Images?
- Considering, for now, keyword-based
representations of image content, perhaps two
important questions are - How do we pick good keywords from the mass of
text apparently associated with an image? - How do we associate keywords with specific image
attributes, e.g. place of creation, literal
content, abstract meanings specific regions of
the image specific intervals of video data?
11Less Primitive Ways of Integrating Textual
Information About Images?
- Consider specialist images and the texts produced
by trained experts to be informative about them - Extensive and diverse textual information
produced by art experts, scene of crime officers
and forensic scientists, film makers, audio
describers, etc. - Other text can be elicited from experts
relatively cheaply (?), e.g. during creation of
images
12(No Transcript)
13 - I can see what appears to be a male laying in the
prone position on the floor. - He is wearing a maroon striped shirt with white
collar and cuffs, blue jeans, and has a pair of
left and right training shoes which have become
slightly dis-extended from the foot. - There appears to be a green tie down by his right
hand and I can see a possible footwear impression
in blood on his right hand. - Surrounding the body there are droplets of blood,
footwear impression in blood and several pieces
of broken glass and bottles.
14h
15 16Issues
- Keywords contexts an indication of what image
attribute (or region / interval) they relate to? - Description and Interpretation are there
systematically different languages used to
describe and to interpret visual information? - Experts consistency to what extent do trained
experts analyse visual information in the same
way? - Image-Text Relations
17Approach
- Analyse collateral text corpora, including extant
texts and experts analyses elicited in
controlled scenarios - Develop systems that integrate image/video data
and collateral text
18Cues for content
- Which words in collateral text refer to image
content? - Art Corpus
- 804,939 words from Tate Gallery WWW-site
- Painting captions - 691,121 words
- Artist biographies - 113,818 words
- depict (295 occurrences in captions) and convey
(119) - this painting depicts a glass, two pears and a
box - this work depicts a group struggling in a wind
- this composition conveys the claustrophobia of
the interior of an omnibus - an expressive use of colour and shape to convey
the subjects mood
19Cues for content
- Which of Panofskys levels do words in a
collateral text refer to? - depict / convey cue references to
- pre-iconographical 56 / 0
- iconographical 41 / 9
- iconological 3 / 91
- Results from earlier 305,913 word corpus
145 occurrences of depict, 47
of convey
20Cues for content
- Frequent words in left-hand contexts
- depict painting, portrait, work
- convey colours, elements, surfaces
- Frequent words in right-hand contexts
- depict movement, scene, landscape, Christ
- convey sense, essence, mood, spiritual
21Cues to chart the history of art
- influence (403 occurrences) and inspire (442)
about 80 passive - his paintings of the Thames were influenced by
Whistler - where he was influenced by expressionism
- this picture was inspired by a performance of
Shakespeares play Macbeth - Severini was inspired by modern machinery
22Cues to chart the history of art
- Classifying collocations of influence and inspire
- 49 PERSON influenced by PERSON
- 22 PERSON influenced by MOVEMENT
- 31 WORK inspired by PERSON / ENVIRONMENT / WORLD
/ WORK - 16 PERSON inspired by PERSON
23Eliciting spoken commentaries
- Five dance experts each asked to Describe then
to Interpret five dance sequences as they
watched them (20 minutes in total) ? - 11,300 words of description
- 9,754 words of interpretation
- Appears to be systematic contrasts between
description and interpretation -
24(No Transcript)
25Descriptions
- Utterances
- Single words in rapid sequence to identify
movements - Spatio-temporal details and relationships between
dancers - Most frequent open-class words referred literally
to dancers, their movements and space woman,
arm, leg, turn, jump, spin, arabesque, pirouette,
left, right - Note that when allowed to stop and start the
video, one expert spoke for about 30 minutes
about two minutes of dance
26Interpretations
- Most frequent open-class words referred
non-literally to dancers, their movements and
themes of the dance swan, prince, wing, flight,
ethereality - Longer utterances either referring to larger
video intervals, or linking literal descriptions
to interpreted meaning, conjoined by seems, as
if, like, a sense of, suggest, appears to be - The stretching of the neck, like a swan
- Aerial steps which could suggest flight
- Moving faster as if something is driving him
27 28Descriptions shaded by dance
29Descriptions shaded by expert
30Audio Description
- Enhances the enjoyment of most kinds of films and
television programs for visually impaired
viewers, - In between existing dialogue a describer gives
important information about on-screen scenes and
events, and about characters actions,
appearance, gestures and expressions. - Provided with some digital television broadcasts
and with films in some cinemas and on VHS/DVD
releases currently 500 films with British
English audio description, and up to 10 of
television broadcasts - In effect, that part of the story told by the
moving image is retold in words.
31Audio Description Script
- 11.43 Hanna passes Jan some banknotes.
- 11.55 Laughing, Jan falls back into her seat as
the jeep overtakes the line of the lorries. - 12.01 An explosion on the road ahead.
- 12.08 The jeep has hit a mine.
- 12.09 Hanna jumps from the lorry.
- 12.20 Desperately she runs towards the mangled
jeep. - 12.27 Soldiers try to stop her.
- 12.31 She struggles with the soldier who grabs
hold of her firmly. - 12.35 He lifts her bodily from the ground,
holding her tightly in his arms.
32Computing Narrative analysis of emotion tokens
- METHOD
- Create list of Emotion Tokens for each of 22
Emotion Types proposed by Ortony, Clore and
Collins (1988) - Plot the occurrence of these tokens in audio
description scripts over time - Analyse distribution of emotion tokens as a
representation of film content
33Plot of Emotion Tokens in Audio Description for
Captain Correllis Mandolin
34Plot of Emotion Tokens in Audio Description for
Captain Correllis Mandolin 52 tokens of 8
emotion types 15-20 minutes Pelagrias betrothal
to Madras 20-30 minutes invasion of the
island 68-74 minutes Pelagria and Correllis
growing relationship 92-95 minutes German
soldiers disarm Italians
35Plot of Emotion Tokens in Audio Description for
The Postman
36Computing Narrative analysis of emotion tokens
- Results suggest that we can access some aspects
of narrative structure in this way - ? Video Retrieval by Story Similarity
- ? Video Summarisation by Dramatic Sequences
- ? Video Browsing by Cause-Effect Relationships
37Overview PART 2
- II. Classifying image-text relations to process a
greater diversity of image-text combinations
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42Classifying Image-Text Relations
- Interested in developing a computational
understanding of how to read an image-text
combination - not simply a question of adding the result of
text content analysis to the result of image
content analysis! - It seems that a key aspect of understanding an
image-text combination is the way in which the
image and the text relate to one another - in terms of relative importance
- and, in terms of how they function to convey
meaning.
43Classifying Image-Text Relations
- Words like illustrate, describe and
equivalent capture some of our intuitions about
how images and texts relate. - Maybe a computationally tractable framework for
classifying image-text relations is required to
facilitate better processing of image-text
combinations in a wide range of applications.
44Classifying Image-Text Relations
- With regards to an image and a text in
combination - How can we tell which is more important for
successful communication? - What correspondence is there between the
information conveyed by one and by the other? - What information, or other value, does one add to
the other? - If we understand the content of one, then what
can we infer about the content of the other? - What conventions are there for combining images
and texts in particular genres of communication?
45Proposed Classification Scheme
- In our classification of image-text relations we
distinguish two kinds of relations that we take
to be mutually independent. - Status relations are to do with the relative
importance of the text and the image, or the
dependence of one on the other. - Logico-semantic relations are to do with the
functions that images and texts serve for one
another. - Different relations may hold between different
parts of images and texts, i.e. between image
regions and text fragments. - Based on Barthes (1977) and Halliday (1994)
46Status Relations
- The relation between an image and a text is equal
when - both the image and the text are required for
successful communication, in which case they are
equal-complementary OR - both the image and the text can be understood
individually, in which case they are
equal-independent. - The relation between an image and a text is
unequal when either the image or the text can be
understood individually - that which cannot be
understood individually is subordinate to the
other.
47Logico-Semantic Relations
- A text elaborates the meaning of an image, and
vice versa, by further specifying or describing
it - A text extends the meaning of an image, and vice
versa, by adding new information - A text enhances the meaning of an image, and vice
versa, by qualifying it with reference to time,
place and/or cause-effect
48Automatic Classification?
- Features of interest to us include
- Page layout and formatting the relative size and
position of the image and the text font type and
size image border - Lexical references in text for example, This
picture shows See Figure 1 on the left
is shown by - Grammatical characteristics of the text tense
past / present quantification single / many
full sentences or short phrases - Modality of images a scale from realistic to
abstract, or from photographic to graphic a
function of depth, colour saturation, colour
differentiation, colour modulation,
contextualisation, pictorial detail, illumination
and degree of brightness may correlate with use
of GIF / JPEG - Framing of images for example, one centred
subject, or no particular subject
49Applying Image-Text Relations?
- Cross-modal Information Retrieval
- Hypermedia Systems
- Multimedia Generation
50Closing Remarks
- NEXT STEPS
- Refine classification scheme, especially for
image-text combinations with diagrams - Build image-text corpora to train a
classification system - Evaluate image-text relations in multimedia
applications
51Acknowledgements
- This work has been carried out with
- Chris Frehen Tate analysis
- Mike Graham - TIWO
- Radan Martinec Image-Text Relations
- Elia Tomadaki - TIWO
- Yan Xu -TIWO
- Television in Words (TIWO) EPSRC GR/R67194/01
52Publications
- Salway and Graham (2003), Extracting Information
about Emotions in Films. Procs. 11th ACM
Conference on Multimedia 2003, 4th-6th Nov. 2003,
pp. 299-302. ISBN 1-58113-722-2. - Salway, Graham, Tomadaki and Xu (2003), Linking
Video and Text via Representations of Narrative',
AAAI Spring Symposium on Intelligent Multimedia
Knowledge Management, Palo Alto, 24-26 March
2003. - Salway and Tomadaki (2002), Temporal Information
in Collateral Texts for Indexing Moving Images.
Proceedings of LREC 2002 Workshop on Annotation
Standards for Temporal Information in Natural
Language, eds. A. Setzer and R. Gaizauskas, pp.
36-43. - Salway and Frehen (2002), Words for Pictures
analysing a corpus of art texts. Procs. TKE
2002 Terminology and Knowledge Engineering. - Salway and Ahmad (1999), Multimedia Systems and
Semiotics Collateral Texts for Video Annotation.
Andrew Salway and Khurshid Ahmad. In IEE
Colloquium Digest, Multimedia Databases and
MPEG-7, 29 Jan. 1999, LondonIEE. - Salway and Ahmad (1998), Talking Pictures
Indexing and Representing Video with Collateral
Texts . Andrew Salway and Khurshid Ahmad. 14th
Twente Workshop on Language Technology - Language
Technology for Multimedia Information Retrieval.
53Image-Text Relations in Multimedia Systems
- School of Computing, Mathematical and Information
Sciences - University of Brighton, 28 April 2004
- Dr Andrew Salway
- Dept. of Computing, University of Surrey
- a.salway_at_surrey.ac.uk