Title: Automatic Speaker Recognition for Series 60 Mobile Devices
1Automatic Speaker Recognition for Series 60
Mobile Devices
Specom2004, Sep 20, 2004
Juhani Saastamoinen, Evgeny Karpov, Ville
Hautamäki, and Pasi Fränti
- University of Joensuu,
- Department of Computer Science
2Background
- Project in National FENIX programme
- New Methods and Applications in Speech Technology
- 7 research institutes
- Project partners NRC, Lingsoft, National Bureau
of Investigation, etc. - Joensuu Speaker Recognition
- http//cs.joensuu.fi/pages/pums
3PUMS project
Juhani Saastamoinen Project manager
Pasi Fränti Professor
Evgeny Karpov Project researcher
Tomi Kinnunen Researcher
Ismo Kärkkäinen Clustering algorithms
Ville Hautamäki Project researcher
4Application Scenarios
5Project Goal
Port speaker recognition to Series 60 mobile phone
6Symbian Phones
- Series 60 phone features
- 16 MB ROM
- 8 MB RAM
- 176 x 208 display
- ARM-processor
- No floating-point unit!!!
UIQ
Series 60
Series 80
7Symbian OS
- Defined by Symbian consortium
- Based on EPOC
- Operating system for mobile phones
- Real-time system
- Long uptime required
- Multitasking, multithreading
8Problems of Porting
- Usual considerations when porting to phone
- GUI event driven program(ming)
- Platform specific programming model
- Real-time system, exceptions
- Application specific porting problems
- Number crunching without floating point unit!!!
- Signal processing numerically challenging
9Identification System
Add speaker profiles during training
Read and use all profiles during recognition
Speaker Profile Database
Decision
10MFCC Signal Processing
- pre-emph. coeff. 0.97, Hamm window, 30 triangular
mel-filters, base-2 logarithm, output 12 MFCC's
11Fixed-Point Implementation
- Numerical analysis needed for fixed-point
arithmetic implementation - Truncation and re-scaling to avoid overflows in
the converted algorithm - Minimize information loss caused by computation
in fixed-point arithmetic - Minimize relative error
12FFT, Fixed-Point
- Frequency spectrum of speech
- Biggest source of numerical error
- Butterflies have multiplications
- Layers repeat truncation errors
- Fixed number of bits per element
- 32, native integer size in many systems
- Reference implementation FFTGEN
- http//www.jjj.de/fft/fftgen.tgz
13FFTGEN (16/16)
- Multiplication 32 x 32 -bit result must fit in
32 bits truncate input - FFTGEN Truncate inputs to 16/16 bits
FFT layer input
FFT Twiddle Factor
X
16-bit integer
16-bit integer
X
32-bit multiplication result
16 used bits
16 crop-off bits
FFT layer output (part of it) Crop-off for next
layer 16 bits!
16-bit integer
14Info Preserving FFT (22/10)
- Approximate DFT operator F with G
- Increase F-G, preserve more signal
information - minimize maximum relative error in scaled sine
values with respect to scale 980 good for FFT
sizes up to 1024 - Truncate multiplication inputs to 22/10 bits
(signal/op)
FFT layer input
FFT Twiddle Factor
X
32-bit integer
22 used bits
10 crop-off bits
32-bit integer, 22 bits used
16-bit integer, 10 bits used
X
FFT layer output (part of it) Crop-off for next
layer 10 bits
32-bit multiplication result
15FFT Spectrum, Fixed-Point
- x-axis fixed-point FFT element abs. values
- y-axis correct FFT element abs. values
16/16 abs values
22/10 abs values
original TIMIT signal
TIMIT signal x 4
16Scale of Error in Proposed FFT
17Magnitude Spectrum, Fixed-Point
- Compute complex absolute values using maximum
coordinate and coordinate ratio - Suppose x gt y for z x i y, then
- Interpret the (squared) y/x by t
- Approx. square root by a polynomial P(t)
- Constant time algorithm (vs. Newton)
18Logarithm, Fixed-Point
- Use base 2 instead of base 10
- corresponds to output multiplication
- Standard technique
- Return problem to interval 1,2)
- Use linear interpolation from values stored in a
look-up table - 8 bits used for indexing the look-up table values
19Rest of System, Fixed-Point
- No improvement needed in VQ/GLA
- Should apply similar technique as with FFT to
other signal processing - Pre-emphasis, utilize full 32 bits
- Time windowing, use less bits in windowing
function - FB, use less bits in frequency responses
- DCT, use less bits for the cosines
20Effect of Signal Processing
- TIMIT data sets, varying number of speakers (N)
- For each N repeat (6x, 5x, 2x) train/recognize
cycles (eliminate GLA initial solution
randomness) - FFTGEN FFT with 16/16 multiplication
- Fixed-point use proposed 22/10 FFT
- Mixed floating-point DSP, fixed-point GLA/VQ
21Effect of Signal Quality
- GSM/PC data 16 aligned dual recordings
- All computations in floating-point arith.
- Signal recorded with laptop and PC mic gives
average recognition rate 100 - Signal recorded with Nokia 3660 results in
average recognition rate 84,9
22Conclusion
- Speaker identification was ported to Symbian
Series 60 mobile phone - 22/10 bit usage in multiplication proposed
instead of standard 16/16 - Experiments indicate that recognition accuracy
improves from 68 to 95