An efficient architecture for speech synthesis using Unit Selection Synthesis' - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

An efficient architecture for speech synthesis using Unit Selection Synthesis'

Description:

An efficient architecture for speech synthesis using Unit Selection Synthesis. ... Anil Muthineni. Ashish Kumar Agarwal. Atul Singh. Speech Synthesis ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 21
Provided by: atuls
Category:

less

Transcript and Presenter's Notes

Title: An efficient architecture for speech synthesis using Unit Selection Synthesis'


1
An efficient architecture for speech synthesis
using Unit Selection Synthesis.
  • CS422 Project Presentation
  • Anil Muthineni
  • Ashish Kumar Agarwal
  • Atul Singh

2
Speech Synthesis
  • A very important branch of Natural Language
    Processing.
  • Reads out things for us. E.g. Microsoft Narrator,
    Plain Talk, TalkIt,
  • Essential component of bilingual computers so
    that a person in India can talk to a person with
    US without the help of any third person.
  • Widespread usage in Railway announcement system,
    IVRS.
  • Makes use of computer equally easy for blind
    people.

3
Steps involved in Speech Synthesis
Normalization
Morphimization
Prosodization
4
Basic Terminology and Techniques for Speech
Synthesis
  • Text is first of all normalized.
  • Then converted into phonetic transcriptions.
  • Then divided into prosodic units.
  • Then this linguistic representation is sent to
    the output unit.
  • Then the output unit synthesizes the digital
    sound signals on the basis of this input signal.
  • This output signal is fed to the last stage where
    it undergoes IQ and IDCT.

5
Challenges
  • Normal text is full of heteronyms, numbers and
    abbreviations.
  • Entrance ??
  • St. John St.
  • 1234 1,234 1234
  • in

6
Major approaches for text to phoneme conversion
  • Dictionary based
  • Any token is searched in lexicon. If found the
    word is pronounced. Large Database required, very
    fast. Non dictionary words not possible.
  • Syllable based
  • Token is parsed into syllables using rules for
    pronunciation, each syllable is concatenated and
    converted into phoneme. Small Database required
    but slower. Not all words pronounced correctly.
    Put Cut !!

7
Major approaches used for synthesis of sound
  • Concatenative Synthesis Based on concatenation
    of segments of recorded speech. Can sometime
    produce audible glitches. Implement soft
    thresholding as a solution of this problem.
  • Formant Synthesis Based on the synthesis of
    speech using frequency, amplitude and other
    characteristics of sound waveform.

8
Requirement Analysis
  • Because of the complexity of the algorithms
    involved, the processing cannot be done in real
    time on concurrent processors.
  • A application specific instruction set processor
    provides the desired flexibility and
    enhancebility.
  • Scalability with minimal changes is required.
  • Naturalness and intelligibility are two essential
    components too.

9
Requirement Analysis Contd.
  • Need to minimize the glitches.
  • For this we need to do things well in advance.
  • Advance means advance on a nano-scale.
  • For this we need to make the computing fast for
    that data which we get frequently.
  • So this gives us the idea that the frequently
    used data needs to be stored in a cache time
    access memory.

10
Requirement Analysis Contd.
  • Also need to have well balanced pipelines.
  • Because the time taken for text to morpheme
    conversion is almost half of the time taken in
    morpheme to sound conversion(Source Various
    Research Papers on Speech Synthesis)
  • Basically due to bandwidth considerations.
  • Increase the bandwidth or have two pipelines work
    to hide the latency.

11
Concatenative Synthesis
  • Unit Speech Synthesis Use of large recorded
    speech database created by segmentation of
    recorded utterance into syllables, phones,
    morphemes, words, phrases and sentences.
  • Diphone Synthesis Uses minimal speech database
    containing only diphones.

12
Implementation
  • Unique Approach with novel ideas.
  • Frequent things have been designed to be
    processed fast while things that occur rather
    infrequently, like symbols and numbers have been
    left out in the cold.
  • Design is simplistic with least possible
    complexity.
  • Power factor has been kept in mind with real time
    processing being the primary motive.

13
Major features of the proposed architecture
  • Two parallel pipelines, each with specialized
    tasks.
  • Cache memory doesnt require to have a data cache
    because the data that we are dealing with is
    basically a stream in nature without any
    significant spatial locality.
  • Besides the conventional cache, we will need one
    local storage which will have access time equal
    to that of cache that will be split into three
    parts
  • First part will store the recently processed
    words in accordance with the principal of
    temporal and spatial locality
  • Second part will store the database for all the
    frequently occurring morphemes
  • Third will be a lookup table based built up on
    the basis of hashing algorithms which map a
    morpheme to its linguistic transcription. This
    minimizes the hard disk access.

14
Pipelines
  • One of the pipelines will convert the text into
    phonetic transcriptions.
  • This will require lot of partitioning algorithms
    basically depending on loops and this code cannot
    be vectorized and hence this pipeline will have
    integer ALU with scalar registers.
  • Meanwhile the other pipeline unit is processing
    the morpheme into linguistic unit.
  • This pipeline has an SIMD FPU because of intense
    number crunching requirements.
  • Care must be taken that the pipelines are
    interlocked.

15
Block diagram of proposed architecture
16
Cost and Performance Analysis
  • As we have just shown, the processor has a very
    simple design.
  • Done to minimize the hardware complexity
  • Has two pronged benefits in terms of cost
  • Less actual hardware resources
  • Less control logic meaning less power
    expenditure.

17
Contd
  • Performance as such is bound to be good because
    the design is very straightforward.
  • No quantitative result thus far because the idea
    is in its stage of infancy and has not been
    marketed so far.
  • Yet intuitively, it is possible to conclude that
    the performance is going to be at least at par
    with that of the current commodity processors
    while doing general mathematic and scientific
    computations.

18
Instruction Set Architecture
  • Instruction set is small yet powerful with a
    variety of instructions.
  • There are a couple of special purpose registers
    in addition to general purpose registers
  • A status register has been kept to store
    information about pitch, intensity and
    speed/tempo.

19
(No Transcript)
20
Q A
Write a Comment
User Comments (0)
About PowerShow.com