ImageText Relations in Multimedia Systems - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

ImageText Relations in Multimedia Systems

Description:

Digital multimedia new media' BACKGROUND. Multimedia systems process ... television broadcasts and with films in some cinemas and on VHS/DVD releases: ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 54
Provided by: css4
Category:

less

Transcript and Presenter's Notes

Title: ImageText Relations in Multimedia Systems


1
Image-Text Relations in Multimedia Systems
  • School of Computing, Mathematical and Information
    Sciences
  • University of Brighton, 28 April 2004
  • Dr Andrew Salway
  • Dept. of Computing, University of Surrey
  • a.salway_at_surrey.ac.uk

2
BACKGROUND
  • Image-text combinations widespread in human
    communication
  • Spoken language and gesture
  • Printed media
  • Moving images and dialogue
  • Digital multimedia ? new media

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
BACKGROUND
  • Multimedia systems process correlated media for
  • Multimedia Information Retrieval and Fusion
  • Summarization, Adaptation and Generation

7
Overview PART 1
  • Processing experts descriptions and
    interpretations into machine-executable
    surrogates of visual information, e.g. to index
    image and video data

8
Background
  • GOOGLE Image Search

9
Motivation
  • One way to resolve the semantic gap comes from
    integrating other sources of information about
    the image Information about an image can come
    from a number of different sources the image
    content, labels attached to the image, images
    embedded in a text, and so on. We still have
    very primitive ways of integrating this
    information in order to optimize access to
    images
  • Smeulders et al. Content-Based Image Retrieval
    at the End of the Early Years, IEEE Trans.
    Pattern Analysis and Machine Intelligence 22
    (12), 1349-1380.

10
Less Primitive Ways of Integrating Textual
Information About Images?
  • Considering, for now, keyword-based
    representations of image content, perhaps two
    important questions are
  • How do we pick good keywords from the mass of
    text apparently associated with an image?
  • How do we associate keywords with specific image
    attributes, e.g. place of creation, literal
    content, abstract meanings specific regions of
    the image specific intervals of video data?

11
Less Primitive Ways of Integrating Textual
Information About Images?
  • Consider specialist images and the texts produced
    by trained experts to be informative about them
  • Extensive and diverse textual information
    produced by art experts, scene of crime officers
    and forensic scientists, film makers, audio
    describers, etc.
  • Other text can be elicited from experts
    relatively cheaply (?), e.g. during creation of
    images

12
(No Transcript)
13
  • I can see what appears to be a male laying in the
    prone position on the floor.
  • He is wearing a maroon striped shirt with white
    collar and cuffs, blue jeans, and has a pair of
    left and right training shoes which have become
    slightly dis-extended from the foot.
  • There appears to be a green tie down by his right
    hand and I can see a possible footwear impression
    in blood on his right hand.
  • Surrounding the body there are droplets of blood,
    footwear impression in blood and several pieces
    of broken glass and bottles.

14
h

15

16
Issues
  • Keywords contexts an indication of what image
    attribute (or region / interval) they relate to?
  • Description and Interpretation are there
    systematically different languages used to
    describe and to interpret visual information?
  • Experts consistency to what extent do trained
    experts analyse visual information in the same
    way?
  • Image-Text Relations

17
Approach
  • Analyse collateral text corpora, including extant
    texts and experts analyses elicited in
    controlled scenarios
  • Develop systems that integrate image/video data
    and collateral text

18
Cues for content
  • Which words in collateral text refer to image
    content?
  • Art Corpus
  • 804,939 words from Tate Gallery WWW-site
  • Painting captions - 691,121 words
  • Artist biographies - 113,818 words
  • depict (295 occurrences in captions) and convey
    (119)
  • this painting depicts a glass, two pears and a
    box
  • this work depicts a group struggling in a wind
  • this composition conveys the claustrophobia of
    the interior of an omnibus
  • an expressive use of colour and shape to convey
    the subjects mood

19
Cues for content
  • Which of Panofskys levels do words in a
    collateral text refer to?
  • depict / convey cue references to
  • pre-iconographical 56 / 0
  • iconographical 41 / 9
  • iconological 3 / 91
  • Results from earlier 305,913 word corpus
    145 occurrences of depict, 47
    of convey

20
Cues for content
  • Frequent words in left-hand contexts
  • depict painting, portrait, work
  • convey colours, elements, surfaces
  • Frequent words in right-hand contexts
  • depict movement, scene, landscape, Christ
  • convey sense, essence, mood, spiritual

21
Cues to chart the history of art
  • influence (403 occurrences) and inspire (442)
    about 80 passive
  • his paintings of the Thames were influenced by
    Whistler
  • where he was influenced by expressionism
  • this picture was inspired by a performance of
    Shakespeares play Macbeth
  • Severini was inspired by modern machinery

22
Cues to chart the history of art
  • Classifying collocations of influence and inspire
  • 49 PERSON influenced by PERSON
  • 22 PERSON influenced by MOVEMENT
  • 31 WORK inspired by PERSON / ENVIRONMENT / WORLD
    / WORK
  • 16 PERSON inspired by PERSON

23
Eliciting spoken commentaries
  • Five dance experts each asked to Describe then
    to Interpret five dance sequences as they
    watched them (20 minutes in total) ?
  • 11,300 words of description
  • 9,754 words of interpretation
  • Appears to be systematic contrasts between
    description and interpretation

24
(No Transcript)
25
Descriptions
  • Utterances
  • Single words in rapid sequence to identify
    movements
  • Spatio-temporal details and relationships between
    dancers
  • Most frequent open-class words referred literally
    to dancers, their movements and space woman,
    arm, leg, turn, jump, spin, arabesque, pirouette,
    left, right
  • Note that when allowed to stop and start the
    video, one expert spoke for about 30 minutes
    about two minutes of dance

26
Interpretations
  • Most frequent open-class words referred
    non-literally to dancers, their movements and
    themes of the dance swan, prince, wing, flight,
    ethereality
  • Longer utterances either referring to larger
    video intervals, or linking literal descriptions
    to interpreted meaning, conjoined by seems, as
    if, like, a sense of, suggest, appears to be
  • The stretching of the neck, like a swan
  • Aerial steps which could suggest flight
  • Moving faster as if something is driving him

27

28
Descriptions shaded by dance
29
Descriptions shaded by expert
30
Audio Description
  • Enhances the enjoyment of most kinds of films and
    television programs for visually impaired
    viewers,
  • In between existing dialogue a describer gives
    important information about on-screen scenes and
    events, and about characters actions,
    appearance, gestures and expressions.
  • Provided with some digital television broadcasts
    and with films in some cinemas and on VHS/DVD
    releases currently 500 films with British
    English audio description, and up to 10 of
    television broadcasts
  • In effect, that part of the story told by the
    moving image is retold in words.

31
Audio Description Script
  • 11.43 Hanna passes Jan some banknotes.
  • 11.55 Laughing, Jan falls back into her seat as
    the jeep overtakes the line of the lorries.
  • 12.01 An explosion on the road ahead.
  • 12.08 The jeep has hit a mine.
  • 12.09 Hanna jumps from the lorry.
  • 12.20 Desperately she runs towards the mangled
    jeep.
  • 12.27 Soldiers try to stop her.
  • 12.31 She struggles with the soldier who grabs
    hold of her firmly.
  • 12.35 He lifts her bodily from the ground,
    holding her tightly in his arms.

32
Computing Narrative analysis of emotion tokens
  • METHOD
  • Create list of Emotion Tokens for each of 22
    Emotion Types proposed by Ortony, Clore and
    Collins (1988)
  • Plot the occurrence of these tokens in audio
    description scripts over time
  • Analyse distribution of emotion tokens as a
    representation of film content

33
Plot of Emotion Tokens in Audio Description for
Captain Correllis Mandolin
34
Plot of Emotion Tokens in Audio Description for
Captain Correllis Mandolin 52 tokens of 8
emotion types 15-20 minutes Pelagrias betrothal
to Madras 20-30 minutes invasion of the
island 68-74 minutes Pelagria and Correllis
growing relationship 92-95 minutes German
soldiers disarm Italians
35
Plot of Emotion Tokens in Audio Description for
The Postman
36
Computing Narrative analysis of emotion tokens
  • Results suggest that we can access some aspects
    of narrative structure in this way
  • ? Video Retrieval by Story Similarity
  • ? Video Summarisation by Dramatic Sequences
  • ? Video Browsing by Cause-Effect Relationships

37
Overview PART 2
  • II. Classifying image-text relations to process a
    greater diversity of image-text combinations

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
Classifying Image-Text Relations
  • Interested in developing a computational
    understanding of how to read an image-text
    combination
  • not simply a question of adding the result of
    text content analysis to the result of image
    content analysis!
  • It seems that a key aspect of understanding an
    image-text combination is the way in which the
    image and the text relate to one another
  • in terms of relative importance
  • and, in terms of how they function to convey
    meaning.

43
Classifying Image-Text Relations
  • Words like illustrate, describe and
    equivalent capture some of our intuitions about
    how images and texts relate.
  • Maybe a computationally tractable framework for
    classifying image-text relations is required to
    facilitate better processing of image-text
    combinations in a wide range of applications.

44
Classifying Image-Text Relations
  • With regards to an image and a text in
    combination
  • How can we tell which is more important for
    successful communication?
  • What correspondence is there between the
    information conveyed by one and by the other?
  • What information, or other value, does one add to
    the other?
  • If we understand the content of one, then what
    can we infer about the content of the other?
  • What conventions are there for combining images
    and texts in particular genres of communication?

45
Proposed Classification Scheme
  • In our classification of image-text relations we
    distinguish two kinds of relations that we take
    to be mutually independent.
  • Status relations are to do with the relative
    importance of the text and the image, or the
    dependence of one on the other.
  • Logico-semantic relations are to do with the
    functions that images and texts serve for one
    another.
  • Different relations may hold between different
    parts of images and texts, i.e. between image
    regions and text fragments.
  • Based on Barthes (1977) and Halliday (1994)

46
Status Relations
  • The relation between an image and a text is equal
    when
  • both the image and the text are required for
    successful communication, in which case they are
    equal-complementary OR
  • both the image and the text can be understood
    individually, in which case they are
    equal-independent.
  • The relation between an image and a text is
    unequal when either the image or the text can be
    understood individually - that which cannot be
    understood individually is subordinate to the
    other.

47
Logico-Semantic Relations
  • A text elaborates the meaning of an image, and
    vice versa, by further specifying or describing
    it
  • A text extends the meaning of an image, and vice
    versa, by adding new information
  • A text enhances the meaning of an image, and vice
    versa, by qualifying it with reference to time,
    place and/or cause-effect

48
Automatic Classification?
  • Features of interest to us include
  • Page layout and formatting the relative size and
    position of the image and the text font type and
    size image border
  • Lexical references in text for example, This
    picture shows See Figure 1 on the left
    is shown by
  • Grammatical characteristics of the text tense
    past / present quantification single / many
    full sentences or short phrases
  • Modality of images a scale from realistic to
    abstract, or from photographic to graphic a
    function of depth, colour saturation, colour
    differentiation, colour modulation,
    contextualisation, pictorial detail, illumination
    and degree of brightness may correlate with use
    of GIF / JPEG
  • Framing of images for example, one centred
    subject, or no particular subject

49
Applying Image-Text Relations?
  • Cross-modal Information Retrieval
  • Hypermedia Systems
  • Multimedia Generation

50
Closing Remarks
  • NEXT STEPS
  • Refine classification scheme, especially for
    image-text combinations with diagrams
  • Build image-text corpora to train a
    classification system
  • Evaluate image-text relations in multimedia
    applications

51
Acknowledgements
  • This work has been carried out with
  • Chris Frehen Tate analysis
  • Mike Graham - TIWO
  • Radan Martinec Image-Text Relations
  • Elia Tomadaki - TIWO
  • Yan Xu -TIWO
  • Television in Words (TIWO) EPSRC GR/R67194/01

52
Publications
  • Salway and Graham (2003), Extracting Information
    about Emotions in Films. Procs. 11th ACM
    Conference on Multimedia 2003, 4th-6th Nov. 2003,
    pp. 299-302. ISBN 1-58113-722-2.
  • Salway, Graham, Tomadaki and Xu (2003), Linking
    Video and Text via Representations of Narrative',
    AAAI Spring Symposium on Intelligent Multimedia
    Knowledge Management, Palo Alto, 24-26 March
    2003.
  • Salway and Tomadaki (2002), Temporal Information
    in Collateral Texts for Indexing Moving Images.
    Proceedings of LREC 2002 Workshop on Annotation
    Standards for Temporal Information in Natural
    Language, eds. A. Setzer and R. Gaizauskas, pp.
    36-43.
  • Salway and Frehen (2002), Words for Pictures
    analysing a corpus of art texts. Procs. TKE
    2002 Terminology and Knowledge Engineering.
  • Salway and Ahmad (1999), Multimedia Systems and
    Semiotics Collateral Texts for Video Annotation.
    Andrew Salway and Khurshid Ahmad. In IEE
    Colloquium Digest, Multimedia Databases and
    MPEG-7, 29 Jan. 1999, LondonIEE.
  • Salway and Ahmad (1998), Talking Pictures
    Indexing and Representing Video with Collateral
    Texts . Andrew Salway and Khurshid Ahmad. 14th
    Twente Workshop on Language Technology - Language
    Technology for Multimedia Information Retrieval.

53
Image-Text Relations in Multimedia Systems
  • School of Computing, Mathematical and Information
    Sciences
  • University of Brighton, 28 April 2004
  • Dr Andrew Salway
  • Dept. of Computing, University of Surrey
  • a.salway_at_surrey.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com