Semantic Content based Modeling

About This Presentation

Title:

Semantic Content based Modeling

Description:

C3 = create Cnn.HeadlineNews.rv 32 65. D1 = (description C1 'Anchor speaking' ... Example: title = 'CNN Headline News' 15. Output Characteristics ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 57

Provided by: csewe

Category:

more less

Transcript and Presenter's Notes

Title: Semantic Content based Modeling

1
Semantic Content based Modeling

Video semantics are captured and organized to
support video retrieval
Difficult to automate
Relying on manual annotation
Capable of supporting natural language like
queries.

2
Video Content Extraction

Other forms of information extraction can be
employed

Close-captioned text
Speech recognition
Descriptive information from screenplay
Key frames that characterize a shot

These content information can be associated with
the video story units.

3
Existing Semantic-Level Models

Segmentation-based Models
Stratification-based Models
Temporal Coherent Models

4
Segmentation-based Modeling

A video stream is segmented into temporally
continuous segments
Each segment is associated with a description
which could be natural text, keywords, or other
kinds of annotation.
Disadvantages
Lack of flexibility
Limited in representing semantics

Lack of flexibility Limited capability of
representing semantics
5
Stratification-based Modeling
70
0
5
15
20
30
35
60
85
90
We partition the contextual information into
single events. Each event is associated with a
video segment called a stratum. Strata can
overlap or encompass each other.
6
Temporal Coherent

Each event is associated with a set of video
segments where it happens.
More flexible in structuring video semantics.

More flexible in structuring video semantics
7
Stratum
The concept of stratification can be used to
assign descriptions to video footage. - Each
stratum refers to a sequence of video
frames. - The strata may overlap or totally
encompass each other.
Car wreck rescue mission
Medics
Victim
In ambulance
In stretcher
Pulled free
Siren
Ambulance
Video Frames
Advantage Allowing easy retrieval by keyword
8
Video Algebra

Goal To provide a high-level abstraction that
models complex information associated with
digital video data and
supports content-based access
Strategy
The algebraic video data model consists of
hierarchical compositions of video expressions
with high-level semantic descriptions
The video expressions are constructed using video
algebra operations

9
Presentation

In the algebraic video data model, the
fundamental entity is a presentation.
A presentation is a multiwindow spatial,
temporal,
and content combination of video segments.
Presentations are described by video
expressions.
- The most primitive video expression creates a
single-window presentation from a raw video
segment.
- Compound video expression are constructed from
simpler ones using video algebra operations.

a compound video expression
video expression
a primitive video expression
an algebraic video node
video expression
video expression
video expression
raw video
raw video
Note An algebraic video node provides a means of
abstraction by which video expressions can be
named, stored, and manipulated as units.
10
Video Algebra Operations

The video algebra operations fall into four
categories
1. Creation defines the construction of
video expressions from raw video.
2. Composition defines temporal relationships
between component video expressions.
3. Output defines spatial layout and audio
output for component video expressions.
4. Description associates content attributes
with a video expression.

11
Composition
The composition operations can be combined
to produce complex scheduling definitions and
constraints.
create a video presentation raw
video segment
C1 create Cnn.HeadlineNews.rv 10 30 C2
create Cnn.HeadlineNews.rv 20 40 C3 create
Cnn.HeadlineNews.rv 32 65 D1 (description
C1 Anchor speaking) D2 (description C2
Professor Smith) D3 (description C3 Economic
reform)
D3 follows D2 which follows D1, and common
footages are not repeated. (It creates a
non-redundant video stream from three overlapping
segments.)
C1
C2
C3
Anchor speaking
Professor Smith
Economic reform
12
Composition Operators (1)

E1 ? E2 defines the presentation where E2
follows E1
E1 È E2 defines the presentation where E2
follows E1 and common footage is not repeated.
E1 Ç E2 defines the presentation where only
common footage of E1 and E2 is played.
E1 - E2 defines the presentation where only
footage of E1 that is not in E2 is played.
E1 E2 E1 and E2 are played concurrently and
terminate simultaneously.
(test) ? E1E2...En Ei is played if test
evaluates to i.
loop E1 time defines a repetition of E1 for a
duration of time
stretch E1 factor sets the duration of the
presentation equal to factor times duration of E1
by changing the playback speed of the video
segment.
limit E1 time sets the duration of the
presentation equal to the minimum of time and the
duration of E1, but the playback speed is not
changed.

13
Composition Operators (2)

transition E1 E2 type time defines type
transition effect between E1 and E2 time defines
the duration of the transition effect
The transition type is one of a set of
transition effects, such as dissolve, fade, and
wipe.
contains E1 query defines the presentation that
contains component expressions of E1 that match
query.
A query is a Boolean combination of attributes
Example text smith and text question

14
Descriptions

description E1 content specifies that E1 is
described by content.
a content is a Boolean combination of attributes
that consists of a field name and a value.
some field names have predefined semantics (e.g.,
title), while other fields are user-definable.
values can assume a variety of types, including
strings and video node names.
field names or values do not have to be unique
within a description.
hide-content E1 defines a presentation that
hides the content of E1 (i.e.., E1 does not
contain any description).
This operation provides a method for creating
abstraction barriers for content-based access.

Example title CNN Headline News
15
Output Characteristics

Video expressions include output characteristics
that specify the screen layout and audio output
for playing back children streams.
Since expressions can be nested, the spatial
layout of any particular video expression is
defined relative to the parent rectangle.
window E1 (X1 , Y1 ) - (X2 , Y2 ) priority
specifies that E1 will be displayed with
priority in the window defined by the top-left
corner (X1 , Y1) and the bottom-right corner (X2
, Y2) such that Xi Î 0, 1 and Yi Î 0, 1.
Window priorities are used to resolve overlap
conflicts of screen display.
audio E1 channel force priority
specifies that the audio of E1 will be output to
channel with priority if force is true, then the
audio operation overrides any channel
specifications of the component video expressions.

16
Output Characteristics An example

C1 create MavericksvsBulls.rv 300 500
P1 window C1 (0, 0) - (0.5, 0.5) 10
P2 window C1 (0, 0.5) - (0.5, 1) 20
P3 window C1 (0.5, 0.5) - (1, 1) 30
P4 window C1 (0.5, 0) - (1, 0.5) 40
P5 (P1 P2 P4)
P6 (P1 P2 P3 P4)
(P5
(window
(P5 (window P6 (0.5, 0.5) - (1, 1)
60))
(0.5, 0.5) - (1, 1) 50))

larger means
higher priority
bottom-right
top-left
0
P1
P4
P1
P4
P2
P1
P4
P2
P2
P3
17
Scope of a video node description

The scope of a given algebraic video node
description is the subgraph that originates from
the node.
The components of a video expression inherit
descriptions by context.
All the content attributes associated with some
parent video nodes are also associated with all
its descendant nodes.

18
Content-Based Access

Search query Search a collection of video nodes
for video expressions that match query.
Strategy Matching a query to the attributes of
an expression must take into account all of
the attributes of that expression including the
attributes of its encompassing expressions.
Example search text smith AND text question

Smith on economic reform
?
This is the result of the query
Smith
Anchor
O
Question from audience
O
This node also satisfies the query but is not
returned because its a descendant of a node
already in the result set.
Question
Raw video
19
Browsing and Navigation

Playback presentation
Plays back the video expression. It enables the
user to view the presentation defined by the
expressions.
Display video-expression
Display the video expression. It allows the user
to inspect the video expression.
Get-parent video-expression
Returns the set of nodes that directly point to
video-expression.
Get-children video expression
Returns the set of nodes that video- expressions
directly points to.

20
Algebraic Video System Prototype

The Algebraic Video System is a prototype
implementation of the algebraic video data model
and its associated operations.
The implementation is build on top of three
existing subsystems
The VuSystem is used for managing raw video data
and for its support of Tcl (Tool command
language) programming. It provides an environment
for recording, processing, and playing video.
The Semantic File System is used as a storage
subsystem with content-based access to data for
indexing and retrieving files that represent
algebraic video nodes.
The WWW server provides a graphical interface to
the system that includes facilities for querying,
navigating, video editing and composing, and
invoking the video player.

21
Multimedia Objects in Relational Databases

The most straightforward and fundamental support
of multimedia data types in a RDBMS is the
ability to declare variable-length fields in the
tables.
Some of the names of variable-length bit or
character string used in commercial products
include
VARCHAR
BLOB
TEXT
IMAGE
CHARACTER VARYING / SQL92 /
VARGRAPHIC
LONG RAW
BYTE VARYING
BIT VARYING / SQL92 /
Some systems have maximal variable-length field
as small as 256 bytes. Other systems allow field
values as large as 2 GBytes.

22
BLOBs in InterBase

InterBase stores BLOBs in collections of
segments. A segment in InterBase can be thought
of as a fixed-length page or I/O block.
InterBase provides special API calls to retrieve
and modify the segments.
open-BLOB opens the BLOB for
reading
get-segment reads the next segment
create-BLOB opens the BLOB for writes
or updates
put-segment saves the changes to the
BLOB
Users can specify the length of each segment.

23
IMAGE TEXT in Sybases SQL Server

TEXT and IMAGE data types are supported in
Sybases TransactSQL, which is an enhanced
version of the SQL standard.
TEXT and IMAGE data types can be as large as 2
GBytes.
Internally, TEXT and IMAGE column values contain
pointers to the first page of a linked list of
pages.
Some of the functions supported
PATINDEX(pattern, column) returns the
starting position of the first occurrence of the
pattern in the column.
TEXTPTR(column) returns a pointer to the
variable length field.

24
OODBs and Multimedia Applications

Object-oriented databases are more suitable for
multimedia application development.
Better complex object support By their nature,
many multimedia database applications, such as
compound documents, need complex object support.
Extensibility and ability to add new types
(classes) Users can add new types and extend
the existing class hierarchies to address the
specific needs of the multimedia application.
Better concurrency control and transaction model
support Transaction concepts such as long
transactions and nested transactions are
important for multimedia applications.

25
Multimedia Data Types in UniSQL/X

UniSQL/X supports a class hierarchy rooted at
generalized large object (GLO) class.
GLO class serves as the root of multimedia data
type classes and provides a number of built-in
attributes and methods.
For the content of GLO objects the user can
create either a Large Object (LO) or a File Based
Object (FBO).
LOs can only be accessed through UniSQL/X.
FBOs are stored in the host file system. The
database stores a reference or a path for each
FBO.
In addition to the base class GLO, UniSQL/X
supports subclasses of GLO for specific
multimedia data types
Audio class
Image class

26
Programming Multimedia Applications

An application is considered to be a multimedia
object.
An application object uses or consists of many
Basic Multimedia Objects (BMOs) and Compound
Multimedia Objects (CMOs).
The specification of an object includes
binding information to a file
methods
event-driven processing (e.g., displaying the
last image if the video ends before the audio).
The use of methods and events allows the
application to create a script which express the
interactions of different objects precisely and
relatively simply.

27
A Multimedia-Program Example
28
Multimedia Information Retrieval (and Indexing)

Multimedia information retrieval
deals with the storage, retrieval, transport and
presentation of different types of multi-media
data (e.g., images, video clips, audio clips,
texts,)
real need for managing multimedia data including
their retrieval
Multimedia information retrieval in general
retrieval process
queries
indexing the medias
matching media and query representations

29
MMDBMS and Retrieval What is that ? First
attempt for a clearer meaning

Example
an insurance companys accident claim report as
a multimedia object it includes
images (or video) of the accident
insurance forms with structured data
audio recordings of the parties involved in the
accident
text report of the insurance companys
representative
Multimedia databases store structured data and
unstructured data
Multimedia retrieval systems must retrieve
structured and unstructured data

30
MMDBMS and Retrieval (cont.)

Retrieval of structured data from databases
typically handled by a Database Management System
(DBMS)
DBMS provides a query language (e.g., Structured
Query Language, SQL for the relational data
model)
deterministic matching of query and data
Retrieval of unstructured data from databases
typically handled by Information Retrieval (IR)
system
similarity matching of uncertain query and
document representations
result list of documents according to relevance

31
MMDBMS and Retrieval (cont.)

Multimedia database management systems should
combine the Database Management System (DBMS) and
information retrieval (IR) technology
data modeling capabilities of DBMSs with the
advanced and similarity based query capabilities
of IR systems
Challenge finding a data model that ensures
effective query formulation and document
representation
efficient storage
efficient matching
effective delivery

32
MMDBMS and Retrieval (cont.)

Query formulation
must accommodate information needs of users of
multimedia systems
Document representations and their storage
an appropriate modeling of the structure and
content of the wide range of data of many
different formats ( indexing) - XML ? - MPEG-7
cf. dealing with thousands of images, documents,
audio and video segments, and free text
at the same time modeling of physical properties
for
compression/ decompression, synchronization,
delivery - MPEG-21

33
MMDBMS and Retrieval (cont.)

Matching of query and document representations
taking into account the variety of attributes and
their relationships of query and document
representations
combination of exact matching of structured data
with uncertain matching of unstructured data
Delivery of data
browsing, retrieval
temporal constraints of video and audio
presentation
merging of data from different sources (e.g., in
medical networks)

34
MMDBMS Queries

1) As in many retrieval systems, the user has the
opportunity to browse and navigate through
hyperlinks with querying need of
topic maps
summary descriptions of the multimedia objects
2) Queries specifying the conditions of the
objects of interest
idea of multimedia query language
should provide predicates for expressing
conditions on the attributes, structure and
content (semantics) of multimedia objects

35
MMDBMS Queries (cont.)

attribute predicates
concern the attributes of multimedia objects with
an exact value (cf. traditional DB attributes)
e.g., date of a picture, name of a show
structural predicates
temporal predicates to specify temporal
synchronization
for continuous media such as audio and video
for expressing temporal relationships between the
frame representations of a single audio or video
e.g., Find all the objects in which a jingle is
playing for the duration of an image display

36
MMDBMS Queries (cont.)

spatial predicates to specify spatial layout
properties for the presentation of multimedia
objects
examples of predicates contain, is contained in,
intersect, is adjacent to
e.g., Find all the objects containing an image
overlapping the associated text
temporal and spatial predicates can be combined
e.g., Find all the objects in which the logo of
the car company is displayed, and when it
disappears, a graphic (showing the increase in
the company sales) is shown in the same position
where the logo was
temporal and spatial predicates can
refer to whole objects
refer to subcomponents of objects with data
model that supports complex object representation

37
MMDBMS Queries (cont.)

semantic predicates
concern the semantic and unstructured content of
the data involved
represented by the features that have been
extracted and stored for each multimedia object
e.g.,Find all the objects containing the word
OFFICE or Find all red houses
uncertainty, proximity and weights can be
expressed in query
multimedia query language
structured language
users do not formulate queries in this language,
but enter query conditions by means of interfaces
natural language queries?
interface translates query to correct query syntax

38
MMDBMS Queries

3) Query by example
e.g., video, audio
the query is composed by picking an example and
choosing the features the object must comply with
e.g., in a graphical user interface (GUI) users
chooses image of a house and domain features for
the query Retrieve all houses of similar shape
and different color
e.g., music recorded melody, note sequence
being entered by Musical Instruments Digital
Interface (MIDI)
4) Question-answering?
e.g., questioning video images How many
helicopters were involved in the attack on Kabul
of December 20, 2001?

39
MMDBMS Example Oracles interMedia

Enables Oracle 9i to manage rich content,
including images, audio, and video information in
an integrated fashion with other traditional
business data.
interMedia can parse, index, and store rich
content, develop content rich Web applications,
deploy rich content on the Web, and tune Oracle9i
content repositories.
interMedia enables data management services to
support the rich data types used in electronic
commerce catalogs, corporate repositories, Web
publishing, corporate communications and
training, media asset management, and other
applications for internet, intranet, extranet,
and traditional application in an integrated
fashion
http//technet.oracle.com

40
MMDBMS Indexing

Remember Indexing and Retrieval Systems.
Indexing assigning or extracting features that
will be used for unstructured and structured
queries (refers unfortunately often only to
low-level features)
Often also segmentation detection of retrieval
units
Two main approaches
manual
segmentation
indexing naming of objects and their
relationships with key terms (natural language or
controlled language)
automatic analysis
identify the mathematical characteristics of the
contents
different techniques depending on the type of
multimedia source (image, text, video, or audio)
possible manual correction

41
Indexing multimedia and features

multimedia object typically represented as set
of features (e.g., as vector of features)
features can be weighted (expressing uncertainty
or significance of its value)
can be stored and searched in an index tree
Features have to embedded with the semantic
content

42
Indexing images

Automatic indexing of images
segmentation in homogeneous segments
homogeneity predicate defines the conditions for
automatically grouping the cells
e.g., in a color image, cells that are adjacent
to one another and whose pixel values are close
are grouped into a segment
indexing recognition of objects simple
patterns
recognition of low level features color
histograms, textures, shapes (e.g., person,
house), position
appearance features often not important in
retrieval

43
Indexing audio

Automatic indexing of audio
segmentation into sequences ( basic units for
retrieval) often manually
indexing
speech recognition and indexing of the resulting
transcripts (cf. indexing written text retrieval)
acoustic analysis (e.g., sounds, music, songs
melody transcription note encoding, interval and
rhythm detection and chords information)
translated into string
e.g., key melody extraction Tseng, 1999

44
Scene Segmentation based on Audio Information

Short Time Energy (STE) is a reliable indicator
for silence detection.
Zero-Crossing Rate (ZCR) is a useful feature to
characterize different non-silence audio signals
(especially discern unvoiced speech )
Pitch (P value) is the fundamental frequency of
an audio waveform
Spectrum Flux (SF) is defined as the average
variation value of spectrum between two adjacent
frames in a short-time analysis window to
discriminate speech and environmental sound

45
Indexing video

Automatic indexing of video
segment basic unit for retrieval
objects and activities identified in each video
segment can be used to index the segment
segmentation
detection of video shot breaks, camera motions
boundaries in audio material (e.g., other music
tune, changes in speaker)
textual topic segmentation of transcripts of
audio and of close-captions (see below)
heuristic rules based on knowledge of
type-specific schematic structure of video (e.g.,
documentary, sports)
certain cues appearance of anchor person in news
new topic

46
An example of indexing

Learning of textual descriptions of images from
surrounding text (Mori et al., 2000)
training
images segmented in image parts of equal size
feature extraction for each image part (by
quantization)
4 x 4 x 4 RGB color histogram
8 directions x 4 resolutions intensity histogram
words that accompany the image are inherited by
each image part
words are selected from the text of the document
that contains the image by selecting nouns and
adjectives that occur with a frequency above a
threshold
cluster similar image parts based on their
extracted features
single-pass partitioning algorithm with minimum
similarity threshold value

47
An example of indexing

for each word and each cluster is estimated
P(wicj) as
where mji total frequency of word wi in
cluster cj
Mj total frequency of all words in cj
testing
unknown image is divided into parts and image
features are extracted
for each part, the nearest cluster is found as
the cluster whose centroid is most similar to the
part
the average likelihood of all the words of the
nearest clusters is computed
k words with largest average likelihood are
chosen to index the new image (in example k 3)

48
(No Transcript)
49
source Mori et al.
50
source Mori et al.
51
Demo Systems

Hermitage Museum Web Site (QBIC)
http//hermitagemuseum.org/
http//hermitagemuseum.org/fcgi-bin/db2www/qbicCol
or.mac/qbic?selLangEnglish
Media Portal WebSEEk
http//www.ctr.columbia.edu/webseek/
Video Search Engine VideoQ
http//www.ctr.columbia.edu/videoq
Georgraphical Application
http//nayana.ece.ucsb.edu/M7TextureDemo/Demo/clie
nt/M7TextureDemo.html
http//www-db.stanford.edu/IMAGE/

52
QBIC features

Color QBIC computes the average Munsell
(Miyahara, et.al., 1988) coordinates of each
object and image, plus a k element color
histogram (k is typically 64 or 256) that gives
the percentage of the pixels in each image in
each of the k colors.
Texture QBIC's texture features are based on
modified versions of the coarseness, contrast,
and directionality features proposed in (H.
Tamura, et.al., 1978). Coarseness measures the
scale of the texture (pebbles vs. boulders),
contrast describes the vividness of the pattern,
and directionality describes whether or not the
image has a favored direction or is isotropic
(grass versus a smooth object).
Shape QBIC has used several different sets of
shape features. One is based on a combination of
area, circularity, eccentricity, major axis
orientation and a set of algebraic moment
invariants. A second is the turning angles or
tangent vectors around the perimeter of an
object, computed from smooth splines fit to the
perimeter. The result is a list of 64 values of
turning angle.