Automatic LipSynchronization Using Linear Prediction of Speech - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Automatic LipSynchronization Using Linear Prediction of Speech

Description:

Automatic Lip-Synchronization Using Linear Prediction of Speech. Christopher Kohnert ... The blending of sound based on adjacent phonemes (common in every-day speech) ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 34
Provided by: C94
Learn more at: http://www.cs.uccs.edu
Category:

less

Transcript and Presenter's Notes

Title: Automatic LipSynchronization Using Linear Prediction of Speech


1
Automatic Lip-Synchronization Using Linear
Prediction of Speech
  • Christopher Kohnert
  • SK Semwal
  • University of Colorado, Colorado Springs

2
Topics of Presentation
  • Introduction and Background
  • Linear Prediction Theory
  • Sound Signatures
  • Viseme Scoring
  • Rendering System
  • Results
  • Conclusions

3
Justification
  • Need
  • Existing methods are labor intensive
  • Poor results
  • Expensive
  • Solution
  • Automatic method
  • Decent results

4
Applications of Automatic System
  • Typical applications benefiting from an automatic
    method
  • Real-time video communication
  • Synthetic computer agents
  • Low-budget animation scenarios
  • Video games industry

5
Automatic Is Possible
  • Spoken word is broken into phonemes
  • Phonemes are comprehensive
  • Visemes are visual correlates
  • Used in lip-reading and traditional animation

6
Existing Methods of Synchronization
  • Text Based
  • Analyze text to extract phonemes
  • Speech Based
  • Volume tracking
  • Speech recognition front-end
  • Linear Prediction
  • Hybrids
  • Text Speech
  • Image Speech

7
Speech Based is Best
  • Doesnt need script
  • Fully automatic
  • Can use original sound sample (best quality)
  • Can use source-filter model

8
Source-Filter Model
  • Models a sound signal as a source passed through
    a filter
  • Source lungs vocal cords
  • Filter vocal tract
  • Implemented using Linear Prediction

9
Speech Related Topics
  • Phoneme recognition
  • How many to use?
  • Mapping phonemes to visemes
  • Use visually distinctive ones (e.g. vowel sounds)
  • Coarticulation effect

10
The Coarticulation Effect
  • The blending of sound based on adjacent phonemes
    (common in every-day speech)
  • Artifact of discrete phoneme recognition
  • Causes poor visual synchronization (transitions
    are jerky and unnatural)

11
Speech Encoding Methods
  • Pulse Code Modulation (PCM)
  • Vocoding
  • Linear Prediction

12
Pulse Code Modulation
  • Raw digital sampling
  • High quality sound
  • Very high bandwidth requirements

13
Vocoding
  • Stands for VOice-enCODing
  • Origins in military applications
  • Models physical entities (tongue, vocal cord,
    jaw, etc.)
  • Poor sound quality (tin can voices)
  • Very low bandwidth requirements

14
Linear Prediction
  • Hybrid method (of PCM and Vocoding)
  • Models sound source and filter separately
  • Uses original sound sample to calculate
    recreation parameters (minimum error)
  • Low bandwidth requirements
  • Pitch and intonation independence

15
Linear Prediction Theory
  • Source-Filter model
  • P coefficients are calculated

Filter
Source
16
Linear Prediction Theory (cont.)
  • The ak coefficients are found by minimizing the
    original sound (St) and the reconstructed sound
    (si).
  • Can be solved using Levinson-Durbin recursion.

17
Linear Prediction Theory (cont.)
  • Coefficients represent the filter part
  • The filter is assumed constant for small
    windows on the original sample (10-30ms
    windows)
  • Each window has its own coefficients
  • Sound source is either Pulse Train (voiced) or
    white noise (unvoiced)

18
Linear Prediction for Recognition
  • Recognition on raw coefficients is poor
  • Better to FFT the values
  • Take only first half of FFTd values
  • This is the signature of the sound

19
Sound Signatures
  • 16 values represent the sound
  • Speaker independent
  • Unique for each phoneme
  • Easily recognized by machine

20
Viseme Scoring
  • Phonemes were chosen judiciously
  • Map one-to-one to visemes
  • Visemes scored independently using history
  • Vi 0.9 Vi-1 0.1 1 if matched at i, else
    0
  • Ramps up and down with successive
    matches/mismatches

21
Rendering System
  • Uses AliasWavefronts Maya package
  • Built-in support for blend shapes
  • Mapped directly to viseme scores
  • Very expressive and flexible
  • Script generated and later read in
  • Rendered to movie, QuickTime used to add in
    original sound and produce final movie.

22
Results (Timing)
  • Precise timing can be achieved
  • Smoothing introduces lag

23
Results (Other Examples)
  • A female speaker using male phoneme set

Slower speech, male speaker
24
Results (Other Examples) (cont.)
  • Accented speech with fast pace

25
Results (Summary)
  • Good with basic speech
  • Good speaker independence (for normal speech)
  • Poor performance when speech
  • Is too fast
  • Is accented
  • Contains phonemes not in the reference set (e.g.
    w and th)

26
Conclusion
  • Linear Prediction provides several benefits
  • Speaker independence
  • Easy to recognize automatically
  • Results are reasonable, but can be improved

27
Future Work
  • Identify best set of phonemes and visemes
  • Phoneme classification could be improved with
    better matching algorithm (neural net?)
  • Larger phoneme reference set for more robust
    matching

28
Results
  • Simple cases work very well
  • Timing is good and very responsive
  • Robust with respect to speaker
  • Cross-gender, multiple male speakers
  • Fails on accents, speed, unknown phonemes
  • Problems with noisy samples
  • Can be smoothed but introduces lag

29
End
30
Automatic Is Possible
  • Spoken word is broken into phonemes
  • Phonemes are comprehensive
  • Visemes are visual correlates
  • Used in lip-reading and traditional animation
  • Physical speech (vocal cords, vocal tract) can be
    modeled
  • Source-filter model

31
Sound Signatures (Speaker Independence)
32
Sound Signatures (For Phonemes)
33
Results (Normal Speech)
  • Normal speech, moderate pace
Write a Comment
User Comments (0)
About PowerShow.com