Title: Semantic Content based Modeling
1Semantic Content based Modeling
- Video semantics are captured and organized to
support video retrieval - Difficult to automate
- Relying on manual annotation
- Capable of supporting natural language like
queries.
2Video Content Extraction
- Other forms of information extraction can be
employed
- Close-captioned text
- Speech recognition
- Descriptive information from screenplay
- Key frames that characterize a shot
- These content information can be associated with
the video story units.
3Existing Semantic-Level Models
- Segmentation-based Models
- Stratification-based Models
- Temporal Coherent Models
4Segmentation-based Modeling
- A video stream is segmented into temporally
continuous segments - Each segment is associated with a description
which could be natural text, keywords, or other
kinds of annotation. - Disadvantages
- Lack of flexibility
- Limited in representing semantics
Lack of flexibility Limited capability of
representing semantics
5Stratification-based Modeling
70
0
5
15
20
30
35
60
85
90
We partition the contextual information into
single events. Each event is associated with a
video segment called a stratum. Strata can
overlap or encompass each other.
6Temporal Coherent
- Each event is associated with a set of video
segments where it happens. - More flexible in structuring video semantics.
More flexible in structuring video semantics
7Stratum
The concept of stratification can be used to
assign descriptions to video footage. - Each
stratum refers to a sequence of video
frames. - The strata may overlap or totally
encompass each other.
Car wreck rescue mission
Medics
Victim
In ambulance
In stretcher
Pulled free
Siren
Ambulance
Video Frames
Advantage Allowing easy retrieval by keyword
8Video Algebra
- Goal To provide a high-level abstraction that
- models complex information associated with
digital video data and - supports content-based access
- Strategy
- The algebraic video data model consists of
hierarchical compositions of video expressions
with high-level semantic descriptions - The video expressions are constructed using video
algebra operations
9Presentation
- In the algebraic video data model, the
fundamental entity is a presentation. - A presentation is a multiwindow spatial,
temporal, - and content combination of video segments.
- Presentations are described by video
expressions. - - The most primitive video expression creates a
single-window presentation from a raw video
segment. - - Compound video expression are constructed from
simpler ones using video algebra operations.
a compound video expression
video expression
a primitive video expression
an algebraic video node
video expression
video expression
video expression
raw video
raw video
Note An algebraic video node provides a means of
abstraction by which video expressions can be
named, stored, and manipulated as units.
10Video Algebra Operations
- The video algebra operations fall into four
categories - 1. Creation defines the construction of
video expressions from raw video. - 2. Composition defines temporal relationships
between component video expressions. - 3. Output defines spatial layout and audio
output for component video expressions. - 4. Description associates content attributes
with a video expression.
11Composition
The composition operations can be combined
to produce complex scheduling definitions and
constraints.
create a video presentation raw
video segment
C1 create Cnn.HeadlineNews.rv 10 30 C2
create Cnn.HeadlineNews.rv 20 40 C3 create
Cnn.HeadlineNews.rv 32 65 D1 (description
C1 Anchor speaking) D2 (description C2
Professor Smith) D3 (description C3 Economic
reform)
D3 follows D2 which follows D1, and common
footages are not repeated. (It creates a
non-redundant video stream from three overlapping
segments.)
C1
C2
C3
Anchor speaking
Professor Smith
Economic reform
12Composition Operators (1)
- E1 ? E2 defines the presentation where E2
follows E1 - E1 È E2 defines the presentation where E2
follows E1 and common footage is not repeated. - E1 Ç E2 defines the presentation where only
common footage of E1 and E2 is played. - E1 - E2 defines the presentation where only
footage of E1 that is not in E2 is played. - E1 E2 E1 and E2 are played concurrently and
terminate simultaneously. - (test) ? E1E2...En Ei is played if test
evaluates to i. - loop E1 time defines a repetition of E1 for a
duration of time - stretch E1 factor sets the duration of the
presentation equal to factor times duration of E1
by changing the playback speed of the video
segment. - limit E1 time sets the duration of the
presentation equal to the minimum of time and the
duration of E1, but the playback speed is not
changed.
13Composition Operators (2)
- transition E1 E2 type time defines type
transition effect between E1 and E2 time defines
the duration of the transition effect - The transition type is one of a set of
transition effects, such as dissolve, fade, and
wipe. - contains E1 query defines the presentation that
contains component expressions of E1 that match
query. - A query is a Boolean combination of attributes
- Example text smith and text question
14Descriptions
- description E1 content specifies that E1 is
described by content. - a content is a Boolean combination of attributes
that consists of a field name and a value. - some field names have predefined semantics (e.g.,
title), while other fields are user-definable. - values can assume a variety of types, including
strings and video node names. - field names or values do not have to be unique
within a description. - hide-content E1 defines a presentation that
hides the content of E1 (i.e.., E1 does not
contain any description). - This operation provides a method for creating
abstraction barriers for content-based access.
Example title CNN Headline News
15Output Characteristics
- Video expressions include output characteristics
that specify the screen layout and audio output
for playing back children streams. - Since expressions can be nested, the spatial
layout of any particular video expression is
defined relative to the parent rectangle. - window E1 (X1 , Y1 ) - (X2 , Y2 ) priority
- specifies that E1 will be displayed with
priority in the window defined by the top-left
corner (X1 , Y1) and the bottom-right corner (X2
, Y2) such that Xi ÃŽ 0, 1 and Yi ÃŽ 0, 1. - Window priorities are used to resolve overlap
conflicts of screen display. - audio E1 channel force priority
- specifies that the audio of E1 will be output to
channel with priority if force is true, then the
audio operation overrides any channel
specifications of the component video expressions.
16Output Characteristics An example
- C1 create MavericksvsBulls.rv 300 500
- P1 window C1 (0, 0) - (0.5, 0.5) 10
- P2 window C1 (0, 0.5) - (0.5, 1) 20
- P3 window C1 (0.5, 0.5) - (1, 1) 30
- P4 window C1 (0.5, 0) - (1, 0.5) 40
- P5 (P1 P2 P4)
- P6 (P1 P2 P3 P4)
- (P5
- (window
- (P5 (window P6 (0.5, 0.5) - (1, 1)
60)) - (0.5, 0.5) - (1, 1) 50))
larger means
higher priority
bottom-right
top-left
0
P1
P4
P1
P4
P2
P1
P4
P2
P2
P3
17Scope of a video node description
- The scope of a given algebraic video node
description is the subgraph that originates from
the node. - The components of a video expression inherit
descriptions by context. - All the content attributes associated with some
parent video nodes are also associated with all
its descendant nodes.
18Content-Based Access
- Search query Search a collection of video nodes
for video expressions that match query. - Strategy Matching a query to the attributes of
an expression must take into account all of
the attributes of that expression including the
attributes of its encompassing expressions. - Example search text smith AND text question
Smith on economic reform
?
This is the result of the query
Smith
Anchor
O
Question from audience
O
This node also satisfies the query but is not
returned because its a descendant of a node
already in the result set.
Question
Raw video
19Browsing and Navigation
- Playback presentation
- Plays back the video expression. It enables the
user to view the presentation defined by the
expressions. - Display video-expression
- Display the video expression. It allows the user
to inspect the video expression. - Get-parent video-expression
- Returns the set of nodes that directly point to
video-expression. - Get-children video expression
- Returns the set of nodes that video- expressions
directly points to.
20Algebraic Video System Prototype
- The Algebraic Video System is a prototype
implementation of the algebraic video data model
and its associated operations. - The implementation is build on top of three
existing subsystems - The VuSystem is used for managing raw video data
and for its support of Tcl (Tool command
language) programming. It provides an environment
for recording, processing, and playing video. - The Semantic File System is used as a storage
subsystem with content-based access to data for
indexing and retrieving files that represent
algebraic video nodes. - The WWW server provides a graphical interface to
the system that includes facilities for querying,
navigating, video editing and composing, and
invoking the video player.
21Multimedia Objects in Relational Databases
- The most straightforward and fundamental support
of multimedia data types in a RDBMS is the
ability to declare variable-length fields in the
tables. - Some of the names of variable-length bit or
character string used in commercial products
include - VARCHAR
- BLOB
- TEXT
- IMAGE
- CHARACTER VARYING / SQL92 /
- VARGRAPHIC
- LONG RAW
- BYTE VARYING
- BIT VARYING / SQL92 /
- Some systems have maximal variable-length field
as small as 256 bytes. Other systems allow field
values as large as 2 GBytes.
22BLOBs in InterBase
- InterBase stores BLOBs in collections of
segments. A segment in InterBase can be thought
of as a fixed-length page or I/O block. - InterBase provides special API calls to retrieve
and modify the segments. - open-BLOB opens the BLOB for
reading - get-segment reads the next segment
- create-BLOB opens the BLOB for writes
or updates - put-segment saves the changes to the
BLOB - Users can specify the length of each segment.
23IMAGE TEXT in Sybases SQL Server
- TEXT and IMAGE data types are supported in
Sybases TransactSQL, which is an enhanced
version of the SQL standard. - TEXT and IMAGE data types can be as large as 2
GBytes. - Internally, TEXT and IMAGE column values contain
pointers to the first page of a linked list of
pages. - Some of the functions supported
- PATINDEX(pattern, column) returns the
starting position of the first occurrence of the
pattern in the column. - TEXTPTR(column) returns a pointer to the
variable length field.
24OODBs and Multimedia Applications
- Object-oriented databases are more suitable for
multimedia application development. - Better complex object support By their nature,
many multimedia database applications, such as
compound documents, need complex object support. - Extensibility and ability to add new types
(classes) Users can add new types and extend
the existing class hierarchies to address the
specific needs of the multimedia application. - Better concurrency control and transaction model
support Transaction concepts such as long
transactions and nested transactions are
important for multimedia applications.
25Multimedia Data Types in UniSQL/X
- UniSQL/X supports a class hierarchy rooted at
generalized large object (GLO) class. - GLO class serves as the root of multimedia data
type classes and provides a number of built-in
attributes and methods. - For the content of GLO objects the user can
create either a Large Object (LO) or a File Based
Object (FBO). - LOs can only be accessed through UniSQL/X.
- FBOs are stored in the host file system. The
database stores a reference or a path for each
FBO. - In addition to the base class GLO, UniSQL/X
supports subclasses of GLO for specific
multimedia data types - Audio class
- Image class
26Programming Multimedia Applications
- An application is considered to be a multimedia
object. - An application object uses or consists of many
Basic Multimedia Objects (BMOs) and Compound
Multimedia Objects (CMOs). - The specification of an object includes
- binding information to a file
- methods
- event-driven processing (e.g., displaying the
last image if the video ends before the audio). - The use of methods and events allows the
application to create a script which express the
interactions of different objects precisely and
relatively simply.
27A Multimedia-Program Example
28Multimedia Information Retrieval (and Indexing)
- Multimedia information retrieval
- deals with the storage, retrieval, transport and
presentation of different types of multi-media
data (e.g., images, video clips, audio clips,
texts,) - real need for managing multimedia data including
their retrieval - Multimedia information retrieval in general
- retrieval process
- queries
- indexing the medias
- matching media and query representations
29MMDBMS and Retrieval What is that ? First
attempt for a clearer meaning
- Example
- an insurance companys accident claim report as
a multimedia object it includes - images (or video) of the accident
- insurance forms with structured data
- audio recordings of the parties involved in the
accident - text report of the insurance companys
representative - Multimedia databases store structured data and
unstructured data - Multimedia retrieval systems must retrieve
structured and unstructured data
30MMDBMS and Retrieval (cont.)
- Retrieval of structured data from databases
- typically handled by a Database Management System
(DBMS) - DBMS provides a query language (e.g., Structured
Query Language, SQL for the relational data
model) - deterministic matching of query and data
- Retrieval of unstructured data from databases
- typically handled by Information Retrieval (IR)
system - similarity matching of uncertain query and
document representations - result list of documents according to relevance
31MMDBMS and Retrieval (cont.)
- Multimedia database management systems should
combine the Database Management System (DBMS) and
information retrieval (IR) technology - data modeling capabilities of DBMSs with the
advanced and similarity based query capabilities
of IR systems - Challenge finding a data model that ensures
- effective query formulation and document
representation - efficient storage
- efficient matching
- effective delivery
32MMDBMS and Retrieval (cont.)
- Query formulation
- must accommodate information needs of users of
multimedia systems - Document representations and their storage
- an appropriate modeling of the structure and
content of the wide range of data of many
different formats ( indexing) - XML ? - MPEG-7 - cf. dealing with thousands of images, documents,
audio and video segments, and free text - at the same time modeling of physical properties
for - compression/ decompression, synchronization,
delivery - MPEG-21
33MMDBMS and Retrieval (cont.)
- Matching of query and document representations
- taking into account the variety of attributes and
their relationships of query and document
representations - combination of exact matching of structured data
with uncertain matching of unstructured data - Delivery of data
- browsing, retrieval
- temporal constraints of video and audio
presentation - merging of data from different sources (e.g., in
medical networks)
34MMDBMS Queries
- 1) As in many retrieval systems, the user has the
opportunity to browse and navigate through
hyperlinks with querying need of - topic maps
- summary descriptions of the multimedia objects
- 2) Queries specifying the conditions of the
objects of interest - idea of multimedia query language
- should provide predicates for expressing
conditions on the attributes, structure and
content (semantics) of multimedia objects
35MMDBMS Queries (cont.)
- attribute predicates
- concern the attributes of multimedia objects with
an exact value (cf. traditional DB attributes) - e.g., date of a picture, name of a show
- structural predicates
- temporal predicates to specify temporal
synchronization - for continuous media such as audio and video
- for expressing temporal relationships between the
frame representations of a single audio or video - e.g., Find all the objects in which a jingle is
playing for the duration of an image display
36MMDBMS Queries (cont.)
- spatial predicates to specify spatial layout
properties for the presentation of multimedia
objects - examples of predicates contain, is contained in,
intersect, is adjacent to - e.g., Find all the objects containing an image
overlapping the associated text - temporal and spatial predicates can be combined
- e.g., Find all the objects in which the logo of
the car company is displayed, and when it
disappears, a graphic (showing the increase in
the company sales) is shown in the same position
where the logo was - temporal and spatial predicates can
- refer to whole objects
- refer to subcomponents of objects with data
model that supports complex object representation
37MMDBMS Queries (cont.)
- semantic predicates
- concern the semantic and unstructured content of
the data involved - represented by the features that have been
extracted and stored for each multimedia object - e.g.,Find all the objects containing the word
OFFICE or Find all red houses - uncertainty, proximity and weights can be
expressed in query - multimedia query language
- structured language
- users do not formulate queries in this language,
but enter query conditions by means of interfaces - natural language queries?
- interface translates query to correct query syntax
38MMDBMS Queries
- 3) Query by example
- e.g., video, audio
- the query is composed by picking an example and
choosing the features the object must comply with
- e.g., in a graphical user interface (GUI) users
chooses image of a house and domain features for
the query Retrieve all houses of similar shape
and different color - e.g., music recorded melody, note sequence
being entered by Musical Instruments Digital
Interface (MIDI) - 4) Question-answering?
- e.g., questioning video images How many
helicopters were involved in the attack on Kabul
of December 20, 2001?
39MMDBMS Example Oracles interMedia
- Enables Oracle 9i to manage rich content,
including images, audio, and video information in
an integrated fashion with other traditional
business data. - interMedia can parse, index, and store rich
content, develop content rich Web applications,
deploy rich content on the Web, and tune Oracle9i
content repositories. - interMedia enables data management services to
support the rich data types used in electronic
commerce catalogs, corporate repositories, Web
publishing, corporate communications and
training, media asset management, and other
applications for internet, intranet, extranet,
and traditional application in an integrated
fashion - http//technet.oracle.com
40MMDBMS Indexing
- Remember Indexing and Retrieval Systems.
- Indexing assigning or extracting features that
will be used for unstructured and structured
queries (refers unfortunately often only to
low-level features) - Often also segmentation detection of retrieval
units - Two main approaches
- manual
- segmentation
- indexing naming of objects and their
relationships with key terms (natural language or
controlled language) - automatic analysis
- identify the mathematical characteristics of the
contents - different techniques depending on the type of
multimedia source (image, text, video, or audio) - possible manual correction
41Indexing multimedia and features
- multimedia object typically represented as set
of features (e.g., as vector of features) - features can be weighted (expressing uncertainty
or significance of its value) - can be stored and searched in an index tree
- Features have to embedded with the semantic
content
42Indexing images
- Automatic indexing of images
- segmentation in homogeneous segments
- homogeneity predicate defines the conditions for
automatically grouping the cells - e.g., in a color image, cells that are adjacent
to one another and whose pixel values are close
are grouped into a segment - indexing recognition of objects simple
patterns - recognition of low level features color
histograms, textures, shapes (e.g., person,
house), position - appearance features often not important in
retrieval
43Indexing audio
- Automatic indexing of audio
- segmentation into sequences ( basic units for
retrieval) often manually - indexing
- speech recognition and indexing of the resulting
transcripts (cf. indexing written text retrieval) - acoustic analysis (e.g., sounds, music, songs
melody transcription note encoding, interval and
rhythm detection and chords information)
translated into string - e.g., key melody extraction Tseng, 1999
44Scene Segmentation based on Audio Information
- Short Time Energy (STE) is a reliable indicator
for silence detection. - Zero-Crossing Rate (ZCR) is a useful feature to
characterize different non-silence audio signals
(especially discern unvoiced speech ) - Pitch (P value) is the fundamental frequency of
an audio waveform - Spectrum Flux (SF) is defined as the average
variation value of spectrum between two adjacent
frames in a short-time analysis window to
discriminate speech and environmental sound
45Indexing video
- Automatic indexing of video
- segment basic unit for retrieval
- objects and activities identified in each video
segment can be used to index the segment - segmentation
- detection of video shot breaks, camera motions
- boundaries in audio material (e.g., other music
tune, changes in speaker) - textual topic segmentation of transcripts of
audio and of close-captions (see below) - heuristic rules based on knowledge of
- type-specific schematic structure of video (e.g.,
documentary, sports) - certain cues appearance of anchor person in news
new topic
46An example of indexing
- Learning of textual descriptions of images from
surrounding text (Mori et al., 2000) - training
- images segmented in image parts of equal size
- feature extraction for each image part (by
quantization) - 4 x 4 x 4 RGB color histogram
- 8 directions x 4 resolutions intensity histogram
- words that accompany the image are inherited by
each image part - words are selected from the text of the document
that contains the image by selecting nouns and
adjectives that occur with a frequency above a
threshold - cluster similar image parts based on their
extracted features - single-pass partitioning algorithm with minimum
similarity threshold value
47An example of indexing
- for each word and each cluster is estimated
P(wicj) as -
- where mji total frequency of word wi in
cluster cj - Mj total frequency of all words in cj
- testing
- unknown image is divided into parts and image
features are extracted - for each part, the nearest cluster is found as
the cluster whose centroid is most similar to the
part - the average likelihood of all the words of the
nearest clusters is computed - k words with largest average likelihood are
chosen to index the new image (in example k 3)
48(No Transcript)
49source Mori et al.
50source Mori et al.
51Demo Systems
- Hermitage Museum Web Site (QBIC)
- http//hermitagemuseum.org/
- http//hermitagemuseum.org/fcgi-bin/db2www/qbicCol
or.mac/qbic?selLangEnglish - Media Portal WebSEEk
- http//www.ctr.columbia.edu/webseek/
- Video Search Engine VideoQ
- http//www.ctr.columbia.edu/videoq
- Georgraphical Application
- http//nayana.ece.ucsb.edu/M7TextureDemo/Demo/clie
nt/M7TextureDemo.html - http//www-db.stanford.edu/IMAGE/
52QBIC features
- Color QBIC computes the average Munsell
(Miyahara, et.al., 1988) coordinates of each
object and image, plus a k element color
histogram (k is typically 64 or 256) that gives
the percentage of the pixels in each image in
each of the k colors. - Texture QBIC's texture features are based on
modified versions of the coarseness, contrast,
and directionality features proposed in (H.
Tamura, et.al., 1978). Coarseness measures the
scale of the texture (pebbles vs. boulders),
contrast describes the vividness of the pattern,
and directionality describes whether or not the
image has a favored direction or is isotropic
(grass versus a smooth object). - Shape QBIC has used several different sets of
shape features. One is based on a combination of
area, circularity, eccentricity, major axis
orientation and a set of algebraic moment
invariants. A second is the turning angles or
tangent vectors around the perimeter of an
object, computed from smooth splines fit to the
perimeter. The result is a list of 64 values of
turning angle.
53WebSeek
54WebSeek (cont.)
55VideoQ
56VideoQ (cont.)