Automatic Musical Video Creation with Media Analysis - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Automatic Musical Video Creation with Media Analysis

Description:

People are inpatient for videos without scenario or voice-over, especially for ... Camera operations such as pan or zoom are widely used in amateur home videos. ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 34
Provided by: chenhsi
Category:

less

Transcript and Presenter's Notes

Title: Automatic Musical Video Creation with Media Analysis


1
Automatic Musical Video Creation with Media
Analysis
  • 2004/02/16
  • Student Chen-Hsiu Huang
  • Advisor Prof. Ja-Ling Wu

2
Outline
  • Problem Formulation
  • Current Solutions
  • Our Goal
  • Gory Details
  • Performance Evaluation
  • Whats Next?
  • Questions and Discussion

3
Problem Formulation
  • The digital video capture devices such as DVs are
    made more affordable for end users.
  • Its interesting to shoot videos but frustrating
    for editing them.
  • Theres still a tremendous barrier between
    amateurs (home users) and the powerful video
    editing software.
  • Finally people leave their precious shots in
    piles of DV tapes without editing and management.

4
  • According to a survey on DVworld, the relations
    between the video length and how many times will
    user review them after days
  • Video clips with no more then 5 minutes are best
    for humans concentration.

http//www.DVworld.com.tw/
5
Facts about Musical Video
  • People are inpatient for videos without scenario
    or voice-over, especially for those with no
    music.
  • The improved soundtrack quality improved
    perceived video image quality.
  • Synchronizing video and audio segments enhance
    the perception of both.
  • One study at MIT showed that listener judge the
    identical video image to be higher quality when
    accompanied by higher-fidelity audio.

6
  • Home videos can be roughly classified by its
    nature property.
  • Four profiles are proposed to deal with videos of
    different nature.

7
Current Solutions
  • A consumer product called muvee autoProducer
    has been announced to ease the burden of
    professional video editing.
  • Its application scenario is quite simple

Pick-up your video
Select profiles to apply
Produce a quality musical video
Choose your favorite music
8
Our Goal
  • Although there are commercial products in the
    market, only few academic publications related.
  • Jonathan Foote, Matthew D. Cooper, Andreas
    Girgensohn, "Creating music videos using
    automatic media analysis," ACM Multimedia 2002
    553-560
  • The content-analysis technologies are developed
    for years can we adopt those technologies to
    help auto-creation of musical videos?
  • Goal To achieve the near or beyond quality in
    the similar application scenario with the
    content-analysis technologies developed in
    multimedia domain.

9
Volume ZCR Brightness Bandwidth
Proposed Framework
Audio segment cutting
Shot change Scene change
Human face Flash light Motion strength Color
variance Camera Operation ...
10
Audio Analysis
  • We should cut the input audio into several clips
    according to its audio features.
  • Frame-level features
  • Volume defined as the MSR of audio samples
  • ZCR the number of times that the audio waveform
    crosses the zero axis in each frame.
  • Spectral features
  • Brightness the centroid of frequency spectrum
  • Bandwidth the standard deviation of frequency
    spectrum

11
  • Generally the brightness distribution curve is
    almost the same as ZCR curve, so here we use ZCR
    feature only.
  • Bandwidth is an important audio feature but we
    can not easily tell whats the real physical
    meaning in music when the bandwidth reaches its
    high/low value.
  • Furthermore, the relations between musical
    perceptual and bandwidth values are not clear and
    not regular.

12
Audio Segmentation
  • First we cut the input audio into clips when the
    volume changes dramatically.
  • For each clip, we define the burst of ZCR as an
    attack, which may be a beat of base drum or the
    singers voice.

13
  • The dramatic volume change defines the audio clip
    boundary, while the burst of ZCR (attack) in each
    clip defines the granular sub-segment within it.

?????
  • Here we define the dynamic of each clip as

? The dynamic feature can be used as a good
reference later for video/audio synchronization
14
Video Analysis
  • First we need to apply shot change detection to
    segment video into scenes.
  • Here we use the combination of pixel MAD and
    pixel histogram method to perform the shot change
    detection.

15
  • Flashlight detection
  • The flashlight event will be detected as shot
    change.
  • When the shot change is founded, check if
  • If so, then its a flashlight event, should not
    be treated as shot change.
  • Sub-Shot segmentation
  • Here we use MPEG-7 ColorLayout descriptor to
    measure each frames similarity.
  • The first frame in each shot is selected as the
    basis, each consecutive frames are compared with
    the basis. If
  • Then we say that in frame i, a sub-shot is
    occurred.

16
Camera Operation
  • Camera operations such as pan or zoom are widely
    used in amateur home videos. By detection those
    camera operations can help catch the video
    takers intention.
  • Our camera operation detection is performed base
    on the MPEG videos motion vectors in P-frames.

Pan
Zoom
  • This method is simple and efficient. However, it
    does well when detecting camera operations.

17
Video Features
  • Frame-level features
  • The presence of human faces.
  • Use OpenCV library as face detection module.
  • Motion intensity
  • Flashlight detection
  • Mean and standard deviation of luminance plane
  • (Dcolor(i) Thcolor Dhist(i) defines the unsuitable frames
  • Shot-level features
  • Numbers and types of camera operation in each
    shot.
  • Numbers of faces and flashlight event in each
    shot.
  • The accumulation of distance between each frame
    and first frame can be used to describe the
    shots homogeneity.

18
Importance Measure
A scaling coefficient according to synchronized
audio clips feature
  • Frame-level score function
  • The face and flashlight event have the highest
    weighting.
  • Camera operation and higher motion intensity
    represent the video takers intension, so its
    more important.
  • Frames with higher luminance and larger standard
    derivation are more suitable.
  • The penalty of unsuitable frames will be
    discussed later.

19
  • The shot-level importance is motivated by
    observing that
  • Shots with larger motion intensity take longer
    duration.
  • The presence of face attracts viewer.
  • Shots of higher heterogeneity can taker longer
    playing time.
  • Shots with more camera operations are more
    important.
  • Of course, shots with longer length in origin are
    more important.
  • Shot-level importance
  • The shot-level importance function is used in the
    medium profile to reassign each shots length
    according to its importance.
  • Static shots takes shorter, while dynamic shots
    can take longer.? Gets better results after
    editing
  • muvee autoProducer does not reassign each
    shots length!

20
Example 1
  • ????? (3155)
  • Music SHE / ?????
  • Length 425
  • Profile Sequential Medium

21
Proposed Profiles
  • The usage of profiles allows users to customize
    their videos according to its content property
    and users preference in a easy way.
  • We said that home videos have four types
  • Causal, Non-causal, Recreational, Memorial
  • For causal or non-causal videos, we use the
    sequential or non-sequential parameter to deal
    with.
  • For memorial or recreational videos, the rhythmic
    or medium parameter is developed to cope with.
  • In rhythmic, the music tempo/rhythm is better
    preserved, while some shots of video will be
    neglected.
  • In medium, the accompany of music tempo/rhythm is
    not so clear as rhythm, but most of the shots
    will be promised to shown. The medium parameter
    preserved the original video the most.

22
  • Thus we have four profiles
  • Sequential Rhythmic, Sequential Medium
  • Non-Sequential Rhythmic, Non-Sequential Medium

23
Rhythmic vs. Medium
  • The video is segmented according to the audio
    clips and sub-clips.
  • After projecting to the video time-line,
    searching in the video range to find the video
    segments with the highest score as the same
    length as audio segment.
  • Finally concatenate all the selected segments.

24
  • Each shot will be reassign to a new length
    according to its shot importance, shots may
    becomes longer or shorter in proportion to the
    total length.
  • After projection to the video space, the length
    budget is calculated according to the reduction
    rate then allocate the budget to each inner
    shots according to its length.
  • If the allocated shot length is to short (frames), then its budget will be transfer to near
    shots.

25
  • However, there are some issues
  • The fast tempo/rhythm audio clip may be aligned
    to a static video shot, which will be annoying
    for viewer.
  • The slow audio clip may be aligned to a dynamic
    video shot.
  • ? We apply an audio scaling coefficient in
    synchronization stage. The motion intensity of
    video shots weight will be decreased when
    aligned with a slow audio clip nearly preserved
    when synchronized with fast audio clip.
  • Another issue when the media length differ
  • Its unavoidable when the sequential policy is
    enforced. ?

26
  • For some video sources, the order of shots is not
    so important, and re-order shots will not degrade
    the original.
  • If we allow re-order the input video shots,
    things may be better
  • It sounds simple and intuitive, but its not an
    easy problem if we want to develop an efficient
    algorithm to find such permutation.
  • Furthermore, the best solution may not exist
    and the optimal solution may not be only one
    permutation.

27
Non-Sequential Permutation
  • So we developed a randomize algorithm to find a
    not-bad solution within predictable computation
    time.
  • First randomly permute each video shot
  • Then we compute the Ravc audio-to-video
    coverage in the corresponding time-line for each
    shot
  • Then we calculate the average Ravc, each
    permutation will has its Ravc.
  • After lots of iterations, find the minimal Ravc,
    theoretically we can approach to the optimal
    solution efficiently and predictable, only
    depends on how many iterations we perform.

28
  • For an example, 10000 iterations are performed
  • We can get better solution with more iterations,
    but through experiments, 10000 iterations are
    quite enough and will not be a burden for our
    computation power (actually its really fast)
  • Since its random property, each synchronization
    result will be different. But we have discussed
    before that its normal to have lots of solutions.

29
Example 2
??? (1908) Music ???? Length 425 Profile
Sequential Medium and Non-Sequential Medium
30
Performance Evaluation
  • Development environment
  • AMD Duron 1.2G Hz with 386 MB RAM
  • Analysis complexity
  • For videos, about 1.21.31 comparing to the
    original video time.
  • For audios, about 2 minutes for a 5 minute audio
    if perform the spectral analysis, 4-5 minutes are
    needed.
  • The audio/video analysis will be saved as
    description files, so the analysis is required
    only once.
  • The synchronization can be regarded as O(n)
    complexity.
  • When analyzing, usually less than 20 MB RAM is
    required (depends on how many shots in video)
  • The synchronization result is saved as an
    AviSynth script. Then we use VirtualDub to encode
    the produced musical video.

31
Sample Videos
????? (3155)
??????? (1759)
??? (1908)
???? (4342)
????? (6034)
littleco ??? (2022)
32
Whats Next?
  • How to design the experimental result?
  • The subjective test should not over-burden the
    viewer.
  • Adding the shot transition effects? Such as
    dissolve, fade in, fade out.
  • Ive tried, but not so easy as I thought.
  • The automatic approach may not always product a
    satisfaction result and the experience is highly
    subjective and differs from people to people.
  • Semi-automatic is probably the best compromise.
    The automatic result is served only as a
    pre-process basis and a labor-saving tool.
  • But the video editing tool is hard to develop,
    and I doubt if its necessary to develop one from
    startup on the purpose of thesis.

33
Questions and Discussion
  • Any comments are welcomed.
  • Acknowledgment
  • Special thanks for Mr. ???, for his videos and
    suggestions. ?
  • Thanks friends in DVworld who provide lots of
    ideas and comments.
  • Thanks Chih-Hao Shen for his dancing video.
Write a Comment
User Comments (0)
About PowerShow.com