Automatic Summarization of Rushes Video using Bipartite Graphs - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Automatic Summarization of Rushes Video using Bipartite Graphs

Description:

VT: Video play time (vs. pause) to judge the inclusions (sec) ... understanding; RE little duplication; TT time taken to judge; VT video play time) ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 30
Provided by: alan281
Category:

less

Transcript and Presenter's Notes

Title: Automatic Summarization of Rushes Video using Bipartite Graphs


1
Automatic Summarization of Rushes Video using
Bipartite Graphs
  • Liang Bai, Songyang LaoNational University of
    Defense Technology, PR ChinaAlan F.Smeaton,
    Noel E. OConnorDublin City University, Ireland

2
Contents
  • Introduction
  • Video summaries
  • Video summarization for rushes in TRECVID
  • Various approaches in 2007 and 2008
  • Our Proposed Approach
  • Video Structuring
  • Useless Content Removal
  • Re-take Shot Detection and Removal
  • Representative Shots and Summary Generation
  • Experiments and Our Results
  • Ground Truth
  • Experimental Results
  • Conclusion and Discussion

3
Video Summarisation
  • Summary is a condensed version of something so
    that judgments about the full thing can be made
    in less time and effort than the full thing
  • In a world of information overload, summaries
    have widespread application as surrogates
    resulting from searches, as previews, as
    familiarisation with unknown collections
  • Video summaries can be keyframes (static
    storyboards, dynamic slideshows), skims (fixed or
    variable speed) or multi-dimensional browsers
  • Literature previous work shows interest in
    evaluating summaries, but datasets always small,
    single-site, closed
  • TVS07 and TVS08 were tracks in TRECVID and
    have energised work in the area, and made
    available data, evaluation metrics, etc.

4
Summarisation Data
  • BBC provided 11 boxes of BETA SP tapes, 250 hours
    of rushes from dramatic series Casualty, House
    of Elliot, Jonathan Creek, Ancient Greece,
    Between the Lines other miscellaneous
  • Rushes video is pre-production, lots of noise and
    redundancy, was digitised into MPEG-1 and used
    for training and testing

5
TRECVid Measures and Process
  • TRECVid created groundtruth and invited
    summaries of 4 (2007) and 2 (2008) of the
    original video
  • Task is to reduce time needed to grasp content
    and constraint is single playback, no interaction
    except play/pause
  • Measures used were subjective
  • Fraction of (12 items of) ground truth found
  • Ease of use
  • Amount of near-redundancy
  • and objective
  • Assessment time to judge included ground truth
  • Summary duration
  • Summary creation compute time

6
  • Lack of junk video
  • Lack of redundancy
  • Pleasant tempo and rhythm
  • GT Assessment Time

7
22 Participating groups 2007
  • ATT shot clustering to remove redundancy,
    use shot with most speech/facesBrno Univ.
    cluster shots using PCA, remove junk shotsCMU
    k-means clustering using iterative colour
    matching, audio coherenceCity UHK obj.
    detection, camera motion, keypoint matching for
    repetitive shotsColumbia duplicate shot
    detection and ASRCOST292 face, camera motion,
    audio excitementCurtin U shot clustering using
    SIFT matchingDCU amount of motion faces for
    keyframe selectionFXPAL colour distribution,
    camera motion, for repetition detectionHUT
    SOMs for shot pruning to eliminate
    redundancyHKPU junk shot removal, visual
    aural redundancy

8
22 Participating groups 2007
  • Eurecom determine the most non-redundant
    shotsJoanneum variant of LCSS to cluster
    re-takes of same sceneKDDI use only low-level
    features for fast summarisationLIP6 eliminate
    repeating shots using stacking techniqueNII
    feature extraction and clusteringNatl. Taiwan
    U LL shot similarity motion vectors, then
    clusterTsinghua/Intel keyframe clustering,
    repetitive segments, main scenes/actorsUCSB
    k-means clustering on HL features, speech, camera
    motionGlasgow 0-1 knapsack optimisation
    problem, shot clusteringUA Madrid single pass
    for realtime clustering on-the-fly,
    colour-basedSheffield concatenate some frames
    from middle of each shot

9
31 Participating groups 2008
  • Most groups, almost all, explicitly searched
    for and removed junk framesMost groups,
    majority, used some form of clustering of
    shots/scenes in order to detect
    redundancySeveral groups included face
    detection as some componentMost groups used
    visual-only, though some also used audio in
    selecting segments to include in summaryCamera
    motion/optical flow was used by someMost
    groups used whole frame for selecting, though
    some also used frame regions

10
Summary generation
  • Even more variety among techniques for summary
    generation than summary selection
  • Many groups used FF or VS/FF video playback
  • Several groups incorporated visual indicator(s)
    of offset into original video source, within the
    summary
  • Some used an overall storyboard of keyframes

Plain keyframes ?Plain clips?Clips of 1s
duration?FF and VSFF Clips
Main scene/actor re-caps?Clips w. indicators of
offset/re-takes?Clips w. picture-in-picture
?Clips in 4-windows
11
Fraction GT/ease of use 2007
12
What is the best combination ?
  • From a high-level analysis of participant
    approaches and performances, in 2007, we were
    able to pick promising looking techniques
    which we did.

13
Our Approach
  • Our Approach (take 2 !)

14
Our Approach
  • Video Structuring
  • Used mutual information measure for Shot
    detection
  • The probability that a pixel with gray level i in
    frame ft has gray level j in frame ft1
  • The mutual information of frame fk, fl for the R
    component is expressed as
  • The total mutual information between frames fk
    and fl is defined as
  • Local mutual information mean values on a
    temporal window W of size Nw for frame ft are
    calculated as

15
Proposed Approach
  • Mutual information measure for Shot detection
  • The standard deviation of mutual information on
    the window is calculated as
  • Step 1 calculate the mutual information time
    series with
  • Step 2 calculate and at each temporal
    window in which ft is the first frame.
  • Step 3 if , frame ft is determined
    as a shot boundary.

16
Proposed Approach
  • Sub-shot partitioning
  • 8x8 pixel grid and calculate the mean and
    variance of RGB color in each grid
  • Using Euclidean distance to measure the
    difference between neighboring frames
  • In one sub-shot the cumulative frame difference
    shows gradual change. High curvature points
    within the curve of the cumulative frame
    difference are very likely to indicate sub-shot
    boundaries.

17
Proposed Approach
  • Useless content removal
  • Shots less than 1 second are removed anyway
  • Extracted four features from frames color
    layout, scalable color, edge histogram and
    homogenous texture
  • Built SVM classifiers for color bars and
    monochromatic shots detection
  • Used the algorithm for Near-Duplicate Key frame
    (NDK) detection described in C. W. Ngo to
    detect clapperboards

18
Proposed Approach
  • Re-take Shot Detection and Removal
  • In rushes video, the same shot can be re-taken
    many times in order to eliminate actor or filming
    mistakes.

19
Proposed Approach
  • Re-take Shot Detection and Removal the
    principle
  • The similarity between shots can be measured
    according to the similarity of keyframes
    extracted from corresponding shots.
  • Re-take shots can be detected by modeling the
    continuity of similarity of key frames.
  • We use maximal matching in bipartite graphs for
    similarity detection between video shots, and
    patterns in these graphs indicate shot re-takes.
  • Similarity measure between video shots is divided
    into two phases key frame similarity and shot
    similarity.

20
Proposed Approach
  • Re-take Shot Detection and Removal
  • Key frame similarity component
  • a video shot is partitioned into several
    sub-shots and one key frame is extracted from
    each sub-shot.
  • The similarity among sub-shots is used instead of
    the similarity between corresponding key frames.
  • Key frame similarity is measured according to the
    spatial color histogram and texture features.
  • Shot similarity using maximal matching in
    bipartite graphs
  • A shot is expressed as
  • where represents the ith key frame.
  • For two shots,
  • the similar key frames between Sx and Sy can
    be expressed by a bipartite graph ,
    where , , indicates is
    similar with .

21
Proposed Approach
  • Re-take Shot Detection and Removal
  • Shot similarity using maximal matching in
    bipartite graphs
  • There exist many similar pairs of key frames
    between two retake- shots and often in one
    retake-shot .
  • This results in one to many, many to one and many
    to many relations in a bipartite graph. In this
    case, there will be many similar key frames pairs
    found between two dissimilar shots.

22
Proposed Approach
  • Re-take Shot Detection and Removal
  • So, the similarity between two shots is measured
    by the maximal matching of similar key frames.
  • Hungarian algorithm for calculating maxima
    matching M,
  • If , where n, m are the
    number of key frames in the two shots, it is
    determined that one shot is similar with respect
    to the other.

23
Proposed Approach
  • Selecting Representative Shots and Summary
    Generation
  • 4 duration limit, so, the most representative
    clips need to be selected to generate the final
    summary.
  • Motion and face factors to rank how
    representative each remaining sub-shot is in the
    context of the overall video.
  • And sub-shot duration is important for sub-shot
    selection so we use simple weighting to combine
    the factors.
  • The sub-shots with highest scores are selected
    according to the summary duration limitation.
  • Finally, 1-second clips centred around the
    keyframe in each selected sub-shot are extracted
    for generating final summary.

24
Experiments Results
  • Experimental Setup
  • The seven criteria set by the TRECVid guidelines
    for summarization evaluation are
  • EA Easy to understand (1 strongly disagree - 5
    strongly agree)
  • RE Little duplicate video (1 strongly disagree
    - 5 strong agree)
  • IN Fraction of inclusions found in the summary
    (0 - 1)
  • DU Duration of the summary (sec)
  • XD Difference between target and actual summary
    size (sec)
  • TT Total time spent judging the inclusions
    (sec)
  • VT Video play time (vs. pause) to judge the
    inclusions (sec).
  • Ten participants were selected to review the
    summaries under the exact same guidelines as
    provided by NIST and give their scores for the
    four subjective criteria.

25
Experiments Results
  • Investigation into subjective variations of the
    evaluation process
  • By running our own evaluation we could
    potentially introduce new subjective variations
    into the evaluation process.
  • So, we first evaluated three sets of results the
    two TRECVid baselines and our own original
    submission.

26
Experiments Results
  • Experimental results on our summaries
  • Our enhanced approach results in a big
    improvement in IN (0.40) with a slightly longer
    duration of summaries (0.71 sec) compared with
    our original approach.
  • Our enhanced approachs XD is 18.83, which is 8.5
    sec longer than the mean of the other 22 teams
  • (IN fraction of inclusions DU duration XD
    target vs. actual duration difference)

27
Experiments Results
  • The experimental results on all of our summaries
  • We obtain very encouraging results for the EA and
    RE
  • Results show our enhanced approach performs
    competitively compared with the other teams and
    the baselines
  • (EA ease of understanding RE little duplication
    TT time taken to judge VT video play time)

28
Conclusions
  • Rushes videos include many useless and redundant
    shots and are organized based on (filmic) shot
    structure
  • Shot and sub-shot detections for video
    structuring seem to really help in selecting
    material for rushes summaries
  • SVM-based method for removing useless content
    useful
  • We introduced modeling the similarity of key
    frames between two shots by bipartite graphs and
    measuring shot similarity by maximal matching for
    re-take shot detection, and MI for SBD
  • Selecting the most representative clips based on
    considerations of motion, face and duration of
    sub-shots, seems to work
  • Obtained improvements compared to our original
    approach, but more importantly compared to other
    teams who participated, and the TRECVID baselines
  • Video summarization clearly still remains
    challenging
  • Most submissions cannot significantly outperform
    the two baselines we were lucky, we made good,
    informed guesses !
  • A deeper semantic understanding of the content
    can help in this regard

29
Thank You
Write a Comment
User Comments (0)
About PowerShow.com