Title: An efficient architecture for speech synthesis using Unit Selection Synthesis'
1An efficient architecture for speech synthesis
using Unit Selection Synthesis.
- CS422 Project Presentation
- Anil Muthineni
- Ashish Kumar Agarwal
- Atul Singh
2Speech Synthesis
- A very important branch of Natural Language
Processing. - Reads out things for us. E.g. Microsoft Narrator,
Plain Talk, TalkIt, - Essential component of bilingual computers so
that a person in India can talk to a person with
US without the help of any third person. - Widespread usage in Railway announcement system,
IVRS. - Makes use of computer equally easy for blind
people.
3Steps involved in Speech Synthesis
Normalization
Morphimization
Prosodization
4Basic Terminology and Techniques for Speech
Synthesis
- Text is first of all normalized.
- Then converted into phonetic transcriptions.
- Then divided into prosodic units.
- Then this linguistic representation is sent to
the output unit. - Then the output unit synthesizes the digital
sound signals on the basis of this input signal. - This output signal is fed to the last stage where
it undergoes IQ and IDCT.
5Challenges
- Normal text is full of heteronyms, numbers and
abbreviations. - Entrance ??
- St. John St.
- 1234 1,234 1234
- in
6Major approaches for text to phoneme conversion
- Dictionary based
- Any token is searched in lexicon. If found the
word is pronounced. Large Database required, very
fast. Non dictionary words not possible. - Syllable based
- Token is parsed into syllables using rules for
pronunciation, each syllable is concatenated and
converted into phoneme. Small Database required
but slower. Not all words pronounced correctly.
Put Cut !!
7Major approaches used for synthesis of sound
- Concatenative Synthesis Based on concatenation
of segments of recorded speech. Can sometime
produce audible glitches. Implement soft
thresholding as a solution of this problem. - Formant Synthesis Based on the synthesis of
speech using frequency, amplitude and other
characteristics of sound waveform.
8Requirement Analysis
- Because of the complexity of the algorithms
involved, the processing cannot be done in real
time on concurrent processors. - A application specific instruction set processor
provides the desired flexibility and
enhancebility. - Scalability with minimal changes is required.
- Naturalness and intelligibility are two essential
components too.
9Requirement Analysis Contd.
- Need to minimize the glitches.
- For this we need to do things well in advance.
- Advance means advance on a nano-scale.
- For this we need to make the computing fast for
that data which we get frequently. - So this gives us the idea that the frequently
used data needs to be stored in a cache time
access memory.
10Requirement Analysis Contd.
- Also need to have well balanced pipelines.
- Because the time taken for text to morpheme
conversion is almost half of the time taken in
morpheme to sound conversion(Source Various
Research Papers on Speech Synthesis) - Basically due to bandwidth considerations.
- Increase the bandwidth or have two pipelines work
to hide the latency.
11Concatenative Synthesis
- Unit Speech Synthesis Use of large recorded
speech database created by segmentation of
recorded utterance into syllables, phones,
morphemes, words, phrases and sentences. - Diphone Synthesis Uses minimal speech database
containing only diphones.
12Implementation
- Unique Approach with novel ideas.
- Frequent things have been designed to be
processed fast while things that occur rather
infrequently, like symbols and numbers have been
left out in the cold. - Design is simplistic with least possible
complexity. - Power factor has been kept in mind with real time
processing being the primary motive.
13Major features of the proposed architecture
- Two parallel pipelines, each with specialized
tasks. - Cache memory doesnt require to have a data cache
because the data that we are dealing with is
basically a stream in nature without any
significant spatial locality. - Besides the conventional cache, we will need one
local storage which will have access time equal
to that of cache that will be split into three
parts - First part will store the recently processed
words in accordance with the principal of
temporal and spatial locality - Second part will store the database for all the
frequently occurring morphemes - Third will be a lookup table based built up on
the basis of hashing algorithms which map a
morpheme to its linguistic transcription. This
minimizes the hard disk access.
14Pipelines
- One of the pipelines will convert the text into
phonetic transcriptions. - This will require lot of partitioning algorithms
basically depending on loops and this code cannot
be vectorized and hence this pipeline will have
integer ALU with scalar registers. - Meanwhile the other pipeline unit is processing
the morpheme into linguistic unit. - This pipeline has an SIMD FPU because of intense
number crunching requirements. - Care must be taken that the pipelines are
interlocked.
15Block diagram of proposed architecture
16Cost and Performance Analysis
- As we have just shown, the processor has a very
simple design. - Done to minimize the hardware complexity
- Has two pronged benefits in terms of cost
- Less actual hardware resources
- Less control logic meaning less power
expenditure.
17Contd
- Performance as such is bound to be good because
the design is very straightforward. - No quantitative result thus far because the idea
is in its stage of infancy and has not been
marketed so far. - Yet intuitively, it is possible to conclude that
the performance is going to be at least at par
with that of the current commodity processors
while doing general mathematic and scientific
computations.
18Instruction Set Architecture
- Instruction set is small yet powerful with a
variety of instructions. - There are a couple of special purpose registers
in addition to general purpose registers - A status register has been kept to store
information about pitch, intensity and
speed/tempo.
19(No Transcript)
20Q A