Title: Development of a Korean Large Vocabulary Continuous Speech Recognition Platform ECHOS
1Development of a Korean Large Vocabulary
Continuous Speech Recognition Platform (ECHOS)
Oriental COCOSDA 2007, December 4-6, Hanoi,
Vietnam
- December 5, 2007
- Oh-Wook Kwon 1, Hoirin Kim 2, Sukbong Kwon 2,
Sungrack Yun 3, Gyucheol Jang 3, Yong-Rae Kim 1,
Bong-Wan Kim 4, Changdong Yoo 3, Yong-Ju Lee 4 - 1 Chungbuk National University, 2 ICU, 3 KAIST,
- 4 SITEC, Wonkwang University, Korea
2Outline
- 1. Introduction
- 2. ECHOS
- 3. Search Algorithm
- 4. Performance Evaluation
- 5. Conclusion
31. Introduction
- Motivations
- Hard to know the details of the conventional
speech recognition platforms (HTK, Sphinx,
Julius, ISIP) - The source codes lack in readability and
reusability - Hard to modify the source code and implement a
new idea
I have an idea but
HTK
ISIP
Julius
Sphinx
How should I modify the source code for my idea?
4Contributions
- Developed a speech recognition platform (ECHOS)
for education and research purposes - Easy and compact
- Object-oriented structure
- Programmers manual
- Application Programming Interface (API)
- Implemented FSN-based and statistical language
model (LM)-based search algorithms - Lexical tree search
- Two-pass search
- Compared its performance with HTK and Julius
52. ECHOS
Easy Easy to understand UML-based
documentation High-level API Compact Standard
template library (STL) Hangeul Korean processing
modules Automatic text-to-pronunciation
conversion with morphological analysis Object-orie
nted Modular structure Sample codes for improved
reusability Speech recognizer Noise reduction
modules Decoder only HTK-compatible acoustic
models
ECHOS
Educational platform Research platform Baseline
platform
6Block Diagram
Speech
Speech detection
Noise reduction
Word sequence
Feature extraction
Search
Post- processing
Search Tree/ Search network
Acoustic model
Language model
Dictionary
Speaker adaptation
Adaptation text
7Specifications
- Input
- 8/16 kHz, 16 bit PCM
- Isolated word, continuous speech
- Speech detection with continuous listening
capability - Output
- Recognition results 1-best, N-best, Lattice
- Additional information word likelihood, state
segmentation information - Tasks
- Isolated word recognition
- Continuous speech recognition
- Finite state network (FSN)
- Large vocabulary continuous speech recognition
(LVCSR) - Lexical tree
- 30,000 words
8Supported Algorithms (1)
9Supported Algorithms (2)
10Application Programming Interface
- Two-level APIs for beginners and experts
11Documentation
- Manuals for beginner and expert
- Users manual
- Programmers manual
- Documentation based on Unified Modeling Language
(UML) - Requirement Use case diagram
- Design Package diagram, Class diagram
- Implementation Sequence diagram, State-chart
diagram
12Package Diagram Sequence Diagram
Platform
Search module
13Class Diagram
14Programmers Manual
- Describes the details of the source code
- Algorithms
- Implementation Class, Member variables
153. Search Algorithm
- Lexical tree search
- Combining lexical tree with flat lexicon for
single-phone words (Fig. a) - Incorporating the duration model to handle short
words (Fig. b)
Word transitions with short
duration are checked with
duration models.
Lexical tree
Leaf nodes
of
single
-
phone or
Null
-
node
Null
-
node
short
-
phone size
words
Flat lexicon
(single
-
phone words)
(a)
(b)
16Search Algorithm
- Two-pass search
- Forward Bigram, word graph optimization
(unfolding, boundary optimization, pruning,
merging) - Backward Stack decoding with trigram
Knowledge source 2
- Unfolding into tree structure
- Boundary optimization for
- removal of same word sequence
- Pruning
- Merging
Knowledge source 1
Back pointer table
Word graph
Viterbi beam search (bigram)
Word graph generation
Stack decoding (trigram)
Word graph optimization
1-best back tracking
1-best or N-best results
1-best or N-best results
174. Performance Evaluation
- Database
- 8000-word CSR
- SiTEC Dict01 Database
- Test data 1050 sentences (10 speakers)
- Feature
- MFCC, ?-MFCC, Delta energy
- Acoustic and language models
- Triphone
- Bigram
- Search
- Lexical tree search
18Evaluation Results (1)
- Testing search algorithms
- Flat lexicon Similar to HTK
- Lexical tree Reduced 50 of recognition time
with 40 relative error rate increase
19Evaluation Results (2)
- Performance of two-pass search
- Comparison with Julius
205. Conclusion
- Korean speech recognition platform for
educational and research purpose - Signal processing Noise reduction
- Feature extraction MFCC, PLP, ETSI
- Acoustic model HMM
- Language model FSN, bigram, trigram
- Search FSN search, lexical tree search
- Post-processing Lattice-based Confidence
- Recent Activities
- Distributed to 20 Korean universities
- SiTEC Technical Seminars
- Thanks a lot