Title: Hardware Speech Recognition for User Interfaces in Low Cost, Low Power Devices
1Hardware Speech Recognition for User Interfaces
in Low Cost, Low Power Devices
Sergiu Nedevschi, Rabin Patra, Eric
Brewer University of California, Berkeley
2Motivation
- ICTs are empowering technologies, but only for
those with access to it - Current technology built with different
assumptions about - Cost
- Power
- User Interfaces
- 3 billion below US2000 per annum, 862 million
illiterate adults - User Interfaces relying on low-cost, low-power
hardware speech recognition - Makes IT accessible to illiterate and
semi-literate people - Reduces cost by replacing expensive displays
- Low power lower battery costs
3Solution Requirements
- Low cost power
- Flexibility
- Various languages, vocabularies
- Different recognition algorithms coding
techniques - Re-trainability
- Scalability
- Simple design, add more processors
- Extensible to bigger vocabulary
4Speech Recognition Basics
5Speech Decoding Computation
END
HMM Models for Speech Utterances
- Observation Probabilities
- Multivariate Gaussian Mixtures
START
b1(O1)
P(O,q1q2q4q4q5q5q6M)
6Speech Decoding Computation
fj(t) maxi fj (t-1) aij bj (Ot )
7fj(t) maxi fj (t-1) aij bj (Ot )
8Algorithmic Design Decisions
- Use regular grammar language models
- Unified recognition network (big HMM)
- Use Token Passing Algorithm for Likelihood
Computation - Reduce complexity by partitioning vocabulary in
active sets of words (lt 100 words each)
9Hardware Design Decisions
- Parallel design
- Reduced frequency, voltage scaling
- Set of small simple Processing Elements (PEs)
- On-chip embedded FLASH and SRAM
- No additional packaging, shorter wires, lower
voltages - Multiple memory modules can deliver high
throughput at low frequencies - Scaled fixed-point arithmetic
- Much simpler smaller, almost no accuracy
penalties - Data scaling determined by search for each
operator - Single-cycle data path and gated clocks
- Small frequency allows for long critical paths
gated clocks
10Architecture
- General-purpose CPU
- Computes the PE allocation, initialization
- Provides the observation vectors from the DSP
front-end - Processing Elements
- Assigned a set of Gaussians and a set of HMM
nodes - Aggregators
- Aggregate filter data from PEs, pass results to
PEs
11Processing Element
12Workload Allocation
- 3 important steps
- Language loading
- Phoneme speech models loaded
- Common pool of phonemes
- Application loading
- Active set of words loaded
- User Interface Context loading
- Regular grammar for allowable phrases
- Word Interconnections
13Observation Probability Scheduling
- Computation
- Multivariate Gaussian mixtures
- Common pool of phonemes, often repeated
- Solution
- Phonemes equally divided among PEs
- Results propagated by aggregator
- Performed at language load time
- Writing parameters in every PEs local FLASH
14Token Probability Scheduling
fj(t) maxi fj (t-1) aij bj (Ot )
- Best if adjacent states assigned to same PE
- Step 1 at application load time
- Full words assigned to PEs
- Writing RAM of each PE
- Step 2 at UI context change
- Word interconnections handled by aggregator
- Writing aggregators RAM, at each UI context
change
15Implementation
- HTK (Hidden Markov Model Toolkit)-based software
simulator - FPGA implementation on BEE (Berkeley Emulation
Engine) - A real-time hardware emulation engine using 20
high-density Xilinx Virtex-E FPGA chips - Needed due to memory constraints
- Memory contents loaded at synthesis time
- ASIC implementation
- Synthesised in 0.18 micron CMOS process, at 1.08V
- Synopsys Design Compiler
16Area and Power Estimates
- Area estimate (for 8PEs) 2.5 mm2
- Power estimate (at 5MHz) 20mW
- 5 mW excluding memory (only 0.1 mW leakage)
- 15mW in memory (12.5 mw in FLASH 2.5 mw in
SRAM)
17Energy Savings
- Compare with power consumption for a low-power
general purpose ARM processor with similar
throughput - 2 methods to estimate power consumption for ARM
18Language Independence
- Tested our recognizer (software simulator) on a
Tamil dataset - 4600 speech samples, 30 Tamil speakers
- Collection performed in 3 villages in Tamil Nadu
by briefly trained volunteers - Simple word-based recognition
19Envisioned Recognition Platform
20Questions/Comments?