Speech - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Speech

Description:

Speech. Dr. Bj rn Gamb ck. SICS Swedish Institute of Computer Science AB. Stockholm, Sweden ... Bj rn Gamb ck 1999: Human Language Technology: The Babel Fish ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 36
Provided by: bjrng
Category:
Tags: babel | fish | speech

less

Transcript and Presenter's Notes

Title: Speech


1
Speech
  • Dr. Björn Gambäck
  • SICS Swedish Institute of Computer Science AB
  • Stockholm, Sweden

2
Reading Instructions
  • Daniel Jurafsky and James H. Martin
  • Ch. 1 Ch. 2 Ch. 3 Ch. 4, pp. 91-112, 120-133
  • Ch. 6 Ch. 7.1-7 Ch. 8 Ch. 9 Ch. 10
  • Ch. 11.1-3 Ch. 12.4-5 Ch. 14, pp. 501-527 Ch.
    15.1-2, 4-5
  • Ch. 16 Ch. 17 Ch. 18.1-3 Ch. 19 Ch. 20.1-2
    Ch. 21
  • Douglas Arnold et al. 1994 Machine Translation
    An Introductory Guide, Ch. 3, 4, 6, 8, 9.
  • Björn Gambäck 1999 Human Language Technology
    The Babel Fish
  • John Kimball 1973 "Seven Principles of Surface
    Structure Parsing in Natural Languages",
    Cognition 2(1), pp. 15-47.
  • Yorick Wilks Roberta Catizone 2000
    "Human-Computer Conversation", Encyclopedia of
    Microcomputers, Dekker, New York.
  • Victor Zue James Glass 2000 "Conversational
    Interfaces Advances and Challenges", Proceedings
    of the IEEE 88(8), pp. 1166-1180.

3
Left Recursion
  • A left-recursive rule is on the form
  • A ? A ?
  • ?
  • where ? and ? are sequencies of non-terminal
    symbols that dont start with A
  • The rule can be rewritten as (right-recursive)
  • A ? ? A
  • A ? ? A
  • ?

4
Dialogue
  • Turn-taking
  • Utterances
  • Spoken language phenomena
  • Discourse markers
  • Pauses

5
Content Relevance
  • Relevant to Domain
  • Non Relevant
  • Uninterpretable
  • SelfTalk (one speaker talking to him or herself)
  • OffTopic

6
Verbmobil Speech-to-speech translation
  • Ca 900 researchers 166 million DM (ca 1 billion
    birr)
  • Speaker-independent spontaneous speech.
  • Offers assistance in dialogue situations

7
Verbmobil, phase 1 (1993-1996)
  • Appointment negotiation
  • German / Japanese ? English
  • around 2,500 words
  • less than six times real life
  • 74.2 of proposed translations approx. correct

8
Verbmobil, phase 2 (1997-2000)
  • Bidirectional
  • Travel planning,
  • hotel reserving

9
Verbmobil Dialogue Phases
  • Hello
  • The dialogue participants greet each other and
    introduce themselves.
  • Opening
  • The topic to be discussed is introduced.
  • Negotiation
  • The actual negotiation, between opening and
    closing.
  • Closing
  • The discussion is finished (all the participants
    have agreed).
  • Good-Bye
  • The dialogue participants say good-bye to each
    other.

10
Speaker Variation
  • Lexical variation
  • (which words?)
  • Allophonic variation
  • (which sounds?)
  • Dialect
  • Sociolect
  • Hearer
  • Style
  • ...

11
Coarticulation
  • A segment is influenced by its neighbours
  • Assimilation
  • change to be more like the neighbours
  • Deletion

12
Text-to-Speech Synthesis (TTS)
  • Fundamental frequency contour, F0
  • Duration
  • Prosody
  • Mood
  • Canned speech
  • Fill in the blanks
  • Free speech (generation)

13
Text-to-Speech Synthesis, cont
  • Concatenative synthesis
  • Formant synthesis
  • Articulatory synthesis
  • Diphones
  • Polyphones
  • PSOLA (Pitch-Synchronous Overlap and Add)
  • MBROLA (Multi-Band Resynthesis Overlap and Add)

14
Multimodal Speech Synthesis
  • Lip movements
  • Eye lids

15
Automatic Speech Recognition (ASR)
  • Speaker-dependent vs. speaker-independent
  • (speaker-adaptive?!)
  • Isolated words vs. continuous speech
  • Small vs. large vocabulary
  • (large 10,000 words words units)
  • Broad vocabulary vs. restricted domain
  • Speaker recognition/verification

16
ASR System Components
  • Bayes decision rule
  • All state-of-the-art ASR is based on statistical
    approaches
  • Reasoning under uncertainty
  • The theory of probability is a system for
    making better guesses Feynman
  • Language model
  • predicts input based on previous words (N-grams)
  • Search engine
  • chooses amongst competing hypotheses

17
ASR System Components, cont.
  • Acoustic analysis
  • Analog speech signal ? a sequence of acoustic
    feature vectors (contain characteristic
    information about the spoken utterance)
  • Global search
  • Determines unknown length word sequence which
    most probably caused the observed sequence of
    acoustic feature vectors

18
Acoustic Confusability
Overlap classification error Acoustic and
linguistic context ? reduce overlap
19
Decoding
  • Choose the most likely word given an observation
  • P(wordobservation) P(wO) ? Maximize P(wO)
  • P(wO) P(Ow) P(w) (Bayes Rule)
  • P(O)
  • max P(wO) max P(Ow) P(w)
  • (since P(O) is the same for all words)
  • P(w) prior probability (e.g., frequency)
  • P(Ow) likelihood

20
Probabilistic Formulation
  • Objective Minimize word error rate by maximizing
  • P(OW) P(W)
  • P(O)
  • Approach Maximize P(OW) during training
  • Components
  • P(0W) observation likelihood Acoustic Model
  • P(W) prior probability Language Model
  • P(O) acoustic probability
  • (same for all sequences ignored during
    maximization)

21
Probabilistic Method
  • Objective Minimize word error rate by maximizing
  • P(OW) P(W)
  • Global search
  • Combine two statistical knowledge sources
  • Acoustic Model
  • Language Model
  • Optimal combination search process

22
ASR System Architecture Ney 1990
23
Acoustic Model
  • P(0W) observation likelihood Acoustic Model
  • (Class-dependent probability distribution)
  • How closely does the hypothesized sequence mimic
    the observed acoustic sequence?
  • (the conditional probability of observing the
    acoustic feature vectors xT when a speaker
    utters the word sequence wN)
  • Acoustic analysis (spectral analysis)
  • transforms the analog speech signal into a
    sequence of acoustic feature vectors
  • Hidden Markov Models for sub-word units
  • approximate speech speed variations
  • Pronunciation lexicon
  • Decomposition of words into sub-word units

24
Acoustic Analysis
  • Transforms the analog speech signal into a
    sequence of acoustic feature vectors
  • Dependencies on the speakers voice, the acoustic
    channel and the environmental conditions should
    be suppressed
  • Short-time spectral analysis of the speech signal
  • performed every 10 milliseconds on a short
    segment
  • (e.g., with a length of 25 milliseconds)

25
Decomposition into sub-units
  • Not feasible to estimate probability
    distributions for each possible word sequence ?
    break into units
  • Small vocabulary recognition tasks the words
  • Larger vocabularies ? decompose further into
    sub-word units
  • Pronunciation lexicon
  • phoneme sequences representing the words
  • Coarticulation
  • Triphones
  • (modeled by Hidden Markov Models)

26
Pronunciation Lexicon
  • Store the most likely pronunciations
  • Weighted automaton
  • (each arch has a probability)
  • (sum of probabilities for all archs leaving a
    node 1)
  • Pronunciation networks

27
Hidden Markov Models (HMMs)
  • Stochastic finite automata states and
    transitions
  • State set
  • model the acoustic characteristics of part of
    the triphone
  • Transition probabilities
  • aij transition from state i to state j
  • Observation likelihoods (emission probabilities)
  • biot probability of observation ot coming from
    state I
  • (acoustic emission probability distribution)

28
Acoustic Modeling, HMMs
  • Hidden Markov Models
  • temporal variation in the transition
    probabilities
  • Gaussian mixture distributions
  • variations in speaker, accent, and pronunciation

Hamaker 2002
29
ANN Hybrids
  • Flexible, discriminative classifiers for emission
    probabilities
  • Avoid HMM independence assumptions (can use wider
    acoustic context)
  • Prone to overfitting
  • require cross-validation to determine when to
    stop training
  • No substantial recognition improvements over
    HMM/GMM

30
Recombination Search Strategies
  • The global decision about the most probable word
    sequence is decomposed into local decisions
    within the network.
  • On HMM states and/or on words.
  • Considerable reduction of search paths
  • Two strategies
  • A (Best-First)
  • Viterbi (dynamic programming)

31
Best-First (A) Search
  • The score of each partial word sequence
    hypothesis is enhanced by an estimate of the
    probability of the not yet decoded part of the
    sentence
  • C(n) S(n) G(n)
  • C(n) evaluation function for node n, estimates
    C(n)
  • C(n) actual cost of optimal path from start to
    goal through n
  • S(n) cost of path followed so far to n S(n) ?
    S(n)
  • G(n) estimates cost of remaining path from n to
    goal

32
The Viterbi Algorithm(Dynamic Time Warping,
breadth-first)
  • All possible word sequences are hypothesized in
    parallel
  • Threshold excludes improbable hypotheses
  • Based on
  • previous path probability
  • (getting to state i)
  • transition probability
  • (getting from i to j)
  • observation likelihood
  • (state j matches input)

33
Language Model
  • P(W) prior probability Language Model
  • independent of the acoustic feature vectors
  • provides the a-priori probability of a given word
    sequence
  • Perplexity the average number of choices
  • N-gram P(W) ? P( Wi Wi-1 Wi-2 Wi-N )
  • Part-of-Speech
  • Grammar
  • Class pairs

34
Class Pair Grammar
  • A word graph all possible word combinations
    that the A search can generate with the help of
    a search network
  • Search network created by
  • lexical transcriptions
  • class pair grammar
  • Class pair grammar defines which words may
    follow which
  • Cant recognize utterances not covered by the
    grammar

35
A Class Pair Grammar
  • Start class pairs Class pairs End class pairs
  • SILENCE PRON ART PLACE PLACE SILENCE
  • SILENCE FRAGMENT PRON OBJECT OBJECT SILENCE
  • SILENCE INTRO FRAGMENT OBJECT
  • FRAGMENT PLACE
  • OBJECT ART
  • OBJECT VERB
  • VERB ART
  • INTRO FRAGMENT2
  • FRAGMENT2 OBJECT
Write a Comment
User Comments (0)
About PowerShow.com