Talking Heads - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Talking Heads

Description:

... sequences of the head where the lips are in synchrony with the accompanying soundtrack. ... jaw rotation independent of the lip shape. Head and Face Parts ... – PowerPoint PPT presentation

Number of Views:491
Avg rating:3.0/5.0
Slides: 17
Provided by: cse5
Category:
Tags: heads | lips | talking

less

Transcript and Presenter's Notes

Title: Talking Heads


1
Talking Heads
  • Rakhi Motwani

Sample-Based Synthesis of Photo-Realistic Talking
Heads Eric Cosatto Hans Peter Graf ATT
Labs-Research, 100 Schulz Drive, Room
3-124,134, Red Bank, NJ 07701-7033, USA
2
Introduction
  • What are Talking Heads?
  • - Talking Heads are photorealistic video
    animations based on computational and cognitive
    models of audio-visual speech.
  • Applications
  • - Talking Heads increase intelligibility of
    the human- machine interface.
  • - Video conferencing, Entertainment
    Industry, Gaming,
  • E-Commerce, E-Learning.

3
Introduction
  • This paper describes a system that generates
    photorealistic video animations of talking heads.
  • The system derives head models from existing
    video footage using image recognition techniques.
  • It locates, extracts and labels facial parts such
    as mouth, eyes, and eyebrows into a compact
    library.
  • Then, using these face models and a
    text-to-speech synthesizer, it synthesizes new
    video sequences of the head where the lips are in
    synchrony with the accompanying soundtrack.
    Emotional cues and conversational signals are
    produced by combining head movements, raising
    eyebrows, wide open eyes, etc. with the mouth
    animation.

4
Topics of Discussion
  • 3D Head Modeling and Facial Animation
  • - Head and face parts
  • - Visemes
  • - Head movement and emotional expressions
  • - Synthesis

5
Related Work
  • In traditional computer graphics, a 3D wireframe
    model is created that represents the shape of the
    head.
  • Then an "intelligent layer" is added to provide
    high-level control such as muscle-actions.

6
- Then a texture map is used to provide natural
appearance of the skin.
Figure 3D face mesh is parameterized over a 2D
domain and the texture is resampled from several
input photographs.
7
- Due to the complexities of the face's
appearance (highly deformable lips and tongue,
countless creases and wrinkles of the skin) and
the complex dynamics of speech (mechanics of the
jaw, tongue and lips, co-articulation effects
stemming from the brains adaptation to the vocal
apparatus) such systems result in unnatural
animations.
8
Authors Approach
  • The head model is a hierarchical set of facial
    parts containing 3D shape information as well as
    a set of image
  • samples.
  • The 3D shape models the "overall" shape of the
    facial parts and does not attempt to model the
    details
  • which are instead captured and rendered via
    the sample images.
  • These 3D shapes are used to render these facial
  • parts under a range of views by
    texture-mapping the image samples, thus allowing
    head movements.

9
Head and Face Parts
Separating the head into facial part is essential
for two reasons. - First, it reduces the total
number of parts needed to animate a talking head.
- Second, it reduces the number of parameters
needed to describe a part. To generate a novel
appearance, base head (1) is combined with mouth
(5), eyes (7) and brows (6). The mouth area is
generated by overlaying lips (2) over upper teeth
(3) and lower teeth jaw (4). This allows
animating a jaw rotation independent of the lip
shape.
10
Visemes
  • phoneme - any of the abstract units of the
    phonetic system of a language that correspond to
    a set of similar speech sounds (as the velar \k\
    of cool and the palatal \k\ of keel) which are
    perceived to be a single distinctive sound in the
    language
  • A mouth shape articulating a phoneme is often
    referred to as viseme.
  • Authors use twelve visemes, namely a, e, ee, o,
    u, f, k, l, m, t, w, closed.
  • All the other phonemes are mapped onto this set.
  • In order to compare mouth shapes, we need to
    describe them with a set of parameters. For a
    first classification of the mouth shapes we
    choose 3 parameters mouth width, position of the
    upper lip, and the position of the lower lip.

11
  • The table shows this parameterization for a few
    visemes.
  • This representation is very convenient, since it
    defines a distance metric between mouth shapes
    and represents all possible shapes in a compact,
    low-dimensional space.
  • Every mouth sample is mapped to a point in this
    parameter space. Now we can cover all the
    possible appearances of the mouth simply by
    populating the parameter space with samples at
    regular intervals.

12
Sample Extraction
  • Using the image recognition techniques samples of
    face parts, such as the lips, are extracted from
    video sequences of a talking person.

Figure Examples of processing steps to identify
facial features. 1 shows an intermediate step of
the shape analysis used to locate the head and
give an estimate of the positions of a few facial
features as shown in 2. 3 shows a result of the
color segmentation to locate eyes and eyebrows as
shown in 4. 5 illustrates an intermediate step of
the shape analysis, used to locate nostrils and
mouth. 6 shows the results of shape and color
analysis, locating outer and inner edges of the
lips, the teeth and nostrils.
13
  • From the 2D position of the eyes-nostrils plane
    in the image and the relative 3D position of
    these features in the "real world", the pose of
    the head is derived. The 2D position of the mouth
    plane is then derived from the head pose (1).
    Then the 2D mouth plane is warped and features
    can be measured (2). Once labeled each "viseme"
    is stored in a grid for indexing (3).

14
Synthesis
  • A text-to-speech synthesizer (ATT's Flextalk)
    provides the audio track of the animation as well
    as phoneme information at each time step. Figure
    4 shows the block diagram of the whole system,
    including speech synthesizer and talking head
    generator.

15
Conclusion and Future Work
  • There are several possible ways of generating
    animated talking heads, and the preferred method
    depends on the specific requirements of the
    application.
  • The strengths of 3D head models and those of
    sample-based techniques are complementary to a
    large extent. In an environment with a limited
    number of movements and views a sample-based
    approach looks promising. It can deliver
    photo-realism that is still hard to match with
    texture-mapped models.
  • However, if the emphasis is on a wide range of
    motions, the greater flexibility makes a 3D model
    advantageous.
  • Future directions include the use of a generic 3D
    model onto which samples can be texture mapped.
    This would increase the flexibility of the
    synthesis and would allow richer movements. To
    maintain a photo-realistic appearance sample
    views from all sides of the head are then
    required.

16
Questions?
Write a Comment
User Comments (0)
About PowerShow.com