Title: Automatic Musical Video Creation with Media Analysis
1Automatic Musical Video Creation with Media
Analysis
- 2004/02/16
- Student Chen-Hsiu Huang
- Advisor Prof. Ja-Ling Wu
2Outline
- Problem Formulation
- Current Solutions
- Our Goal
- Gory Details
- Performance Evaluation
- Whats Next?
- Questions and Discussion
3Problem Formulation
- The digital video capture devices such as DVs are
made more affordable for end users. - Its interesting to shoot videos but frustrating
for editing them. - Theres still a tremendous barrier between
amateurs (home users) and the powerful video
editing software. - Finally people leave their precious shots in
piles of DV tapes without editing and management.
4- According to a survey on DVworld, the relations
between the video length and how many times will
user review them after days - Video clips with no more then 5 minutes are best
for humans concentration.
http//www.DVworld.com.tw/
5Facts about Musical Video
- People are inpatient for videos without scenario
or voice-over, especially for those with no
music. - The improved soundtrack quality improved
perceived video image quality. - Synchronizing video and audio segments enhance
the perception of both. - One study at MIT showed that listener judge the
identical video image to be higher quality when
accompanied by higher-fidelity audio.
6- Home videos can be roughly classified by its
nature property.
- Four profiles are proposed to deal with videos of
different nature.
7Current Solutions
- A consumer product called muvee autoProducer
has been announced to ease the burden of
professional video editing. - Its application scenario is quite simple
Pick-up your video
Select profiles to apply
Produce a quality musical video
Choose your favorite music
8Our Goal
- Although there are commercial products in the
market, only few academic publications related. - Jonathan Foote, Matthew D. Cooper, Andreas
Girgensohn, "Creating music videos using
automatic media analysis," ACM Multimedia 2002
553-560 - The content-analysis technologies are developed
for years can we adopt those technologies to
help auto-creation of musical videos? - Goal To achieve the near or beyond quality in
the similar application scenario with the
content-analysis technologies developed in
multimedia domain.
9Volume ZCR Brightness Bandwidth
Proposed Framework
Audio segment cutting
Shot change Scene change
Human face Flash light Motion strength Color
variance Camera Operation ...
10Audio Analysis
- We should cut the input audio into several clips
according to its audio features. - Frame-level features
- Volume defined as the MSR of audio samples
- ZCR the number of times that the audio waveform
crosses the zero axis in each frame. - Spectral features
- Brightness the centroid of frequency spectrum
- Bandwidth the standard deviation of frequency
spectrum
11- Generally the brightness distribution curve is
almost the same as ZCR curve, so here we use ZCR
feature only. - Bandwidth is an important audio feature but we
can not easily tell whats the real physical
meaning in music when the bandwidth reaches its
high/low value. - Furthermore, the relations between musical
perceptual and bandwidth values are not clear and
not regular.
12Audio Segmentation
- First we cut the input audio into clips when the
volume changes dramatically. - For each clip, we define the burst of ZCR as an
attack, which may be a beat of base drum or the
singers voice.
13- The dramatic volume change defines the audio clip
boundary, while the burst of ZCR (attack) in each
clip defines the granular sub-segment within it.
?????
- Here we define the dynamic of each clip as
? The dynamic feature can be used as a good
reference later for video/audio synchronization
14Video Analysis
- First we need to apply shot change detection to
segment video into scenes. - Here we use the combination of pixel MAD and
pixel histogram method to perform the shot change
detection.
15- Flashlight detection
- The flashlight event will be detected as shot
change. - When the shot change is founded, check if
- If so, then its a flashlight event, should not
be treated as shot change. - Sub-Shot segmentation
- Here we use MPEG-7 ColorLayout descriptor to
measure each frames similarity. - The first frame in each shot is selected as the
basis, each consecutive frames are compared with
the basis. If - Then we say that in frame i, a sub-shot is
occurred.
16Camera Operation
- Camera operations such as pan or zoom are widely
used in amateur home videos. By detection those
camera operations can help catch the video
takers intention. - Our camera operation detection is performed base
on the MPEG videos motion vectors in P-frames.
Pan
Zoom
- This method is simple and efficient. However, it
does well when detecting camera operations.
17Video Features
- Frame-level features
- The presence of human faces.
- Use OpenCV library as face detection module.
- Motion intensity
- Flashlight detection
- Mean and standard deviation of luminance plane
- (Dcolor(i) Thcolor Dhist(i) defines the unsuitable frames
- Shot-level features
- Numbers and types of camera operation in each
shot. - Numbers of faces and flashlight event in each
shot. - The accumulation of distance between each frame
and first frame can be used to describe the
shots homogeneity.
18Importance Measure
A scaling coefficient according to synchronized
audio clips feature
- Frame-level score function
- The face and flashlight event have the highest
weighting. - Camera operation and higher motion intensity
represent the video takers intension, so its
more important. - Frames with higher luminance and larger standard
derivation are more suitable. - The penalty of unsuitable frames will be
discussed later.
19- The shot-level importance is motivated by
observing that - Shots with larger motion intensity take longer
duration. - The presence of face attracts viewer.
- Shots of higher heterogeneity can taker longer
playing time. - Shots with more camera operations are more
important. - Of course, shots with longer length in origin are
more important. - Shot-level importance
- The shot-level importance function is used in the
medium profile to reassign each shots length
according to its importance. - Static shots takes shorter, while dynamic shots
can take longer.? Gets better results after
editing - muvee autoProducer does not reassign each
shots length!
20Example 1
- ????? (3155)
- Music SHE / ?????
- Length 425
- Profile Sequential Medium
21Proposed Profiles
- The usage of profiles allows users to customize
their videos according to its content property
and users preference in a easy way. - We said that home videos have four types
- Causal, Non-causal, Recreational, Memorial
- For causal or non-causal videos, we use the
sequential or non-sequential parameter to deal
with. - For memorial or recreational videos, the rhythmic
or medium parameter is developed to cope with. - In rhythmic, the music tempo/rhythm is better
preserved, while some shots of video will be
neglected. - In medium, the accompany of music tempo/rhythm is
not so clear as rhythm, but most of the shots
will be promised to shown. The medium parameter
preserved the original video the most.
22- Thus we have four profiles
- Sequential Rhythmic, Sequential Medium
- Non-Sequential Rhythmic, Non-Sequential Medium
23Rhythmic vs. Medium
- The video is segmented according to the audio
clips and sub-clips. - After projecting to the video time-line,
searching in the video range to find the video
segments with the highest score as the same
length as audio segment. - Finally concatenate all the selected segments.
24- Each shot will be reassign to a new length
according to its shot importance, shots may
becomes longer or shorter in proportion to the
total length. - After projection to the video space, the length
budget is calculated according to the reduction
rate then allocate the budget to each inner
shots according to its length. - If the allocated shot length is to short (frames), then its budget will be transfer to near
shots.
25- However, there are some issues
- The fast tempo/rhythm audio clip may be aligned
to a static video shot, which will be annoying
for viewer. - The slow audio clip may be aligned to a dynamic
video shot. - ? We apply an audio scaling coefficient in
synchronization stage. The motion intensity of
video shots weight will be decreased when
aligned with a slow audio clip nearly preserved
when synchronized with fast audio clip. - Another issue when the media length differ
- Its unavoidable when the sequential policy is
enforced. ?
26- For some video sources, the order of shots is not
so important, and re-order shots will not degrade
the original. - If we allow re-order the input video shots,
things may be better
- It sounds simple and intuitive, but its not an
easy problem if we want to develop an efficient
algorithm to find such permutation. - Furthermore, the best solution may not exist
and the optimal solution may not be only one
permutation.
27Non-Sequential Permutation
- So we developed a randomize algorithm to find a
not-bad solution within predictable computation
time. - First randomly permute each video shot
- Then we compute the Ravc audio-to-video
coverage in the corresponding time-line for each
shot
- Then we calculate the average Ravc, each
permutation will has its Ravc. - After lots of iterations, find the minimal Ravc,
theoretically we can approach to the optimal
solution efficiently and predictable, only
depends on how many iterations we perform.
28- For an example, 10000 iterations are performed
- We can get better solution with more iterations,
but through experiments, 10000 iterations are
quite enough and will not be a burden for our
computation power (actually its really fast) - Since its random property, each synchronization
result will be different. But we have discussed
before that its normal to have lots of solutions.
29Example 2
??? (1908) Music ???? Length 425 Profile
Sequential Medium and Non-Sequential Medium
30Performance Evaluation
- Development environment
- AMD Duron 1.2G Hz with 386 MB RAM
- Analysis complexity
- For videos, about 1.21.31 comparing to the
original video time. - For audios, about 2 minutes for a 5 minute audio
if perform the spectral analysis, 4-5 minutes are
needed. - The audio/video analysis will be saved as
description files, so the analysis is required
only once. - The synchronization can be regarded as O(n)
complexity. - When analyzing, usually less than 20 MB RAM is
required (depends on how many shots in video) - The synchronization result is saved as an
AviSynth script. Then we use VirtualDub to encode
the produced musical video.
31Sample Videos
????? (3155)
??????? (1759)
??? (1908)
???? (4342)
????? (6034)
littleco ??? (2022)
32Whats Next?
- How to design the experimental result?
- The subjective test should not over-burden the
viewer. - Adding the shot transition effects? Such as
dissolve, fade in, fade out. - Ive tried, but not so easy as I thought.
- The automatic approach may not always product a
satisfaction result and the experience is highly
subjective and differs from people to people. - Semi-automatic is probably the best compromise.
The automatic result is served only as a
pre-process basis and a labor-saving tool. - But the video editing tool is hard to develop,
and I doubt if its necessary to develop one from
startup on the purpose of thesis.
33Questions and Discussion
- Any comments are welcomed.
- Acknowledgment
- Special thanks for Mr. ???, for his videos and
suggestions. ? - Thanks friends in DVworld who provide lots of
ideas and comments. - Thanks Chih-Hao Shen for his dancing video.