Automatic Speaker Recognition for Series 60 Mobile Devices - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Speaker Recognition for Series 60 Mobile Devices

Description:

Truncation and re-scaling to avoid overflows in the converted algorithm ... Multiplication: 32 x 32 -bit result must fit in 32 bits: truncate input ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 23
Provided by: csJoe
Category:

less

Transcript and Presenter's Notes

Title: Automatic Speaker Recognition for Series 60 Mobile Devices


1
Automatic Speaker Recognition for Series 60
Mobile Devices
Specom2004, Sep 20, 2004
Juhani Saastamoinen, Evgeny Karpov, Ville
Hautamäki, and Pasi Fränti
  • University of Joensuu,
  • Department of Computer Science

2
Background
  • Project in National FENIX programme
  • New Methods and Applications in Speech Technology
  • 7 research institutes
  • Project partners NRC, Lingsoft, National Bureau
    of Investigation, etc.
  • Joensuu Speaker Recognition
  • http//cs.joensuu.fi/pages/pums

3
PUMS project
Juhani Saastamoinen Project manager
Pasi Fränti Professor
Evgeny Karpov Project researcher
Tomi Kinnunen Researcher
Ismo Kärkkäinen Clustering algorithms
Ville Hautamäki Project researcher
4
Application Scenarios
5
Project Goal
Port speaker recognition to Series 60 mobile phone
6
Symbian Phones
  • Series 60 phone features
  • 16 MB ROM
  • 8 MB RAM
  • 176 x 208 display
  • ARM-processor
  • No floating-point unit!!!

UIQ
Series 60
Series 80
7
Symbian OS
  • Defined by Symbian consortium
  • Based on EPOC
  • Operating system for mobile phones
  • Real-time system
  • Long uptime required
  • Multitasking, multithreading

8
Problems of Porting
  • Usual considerations when porting to phone
  • GUI event driven program(ming)
  • Platform specific programming model
  • Real-time system, exceptions
  • Application specific porting problems
  • Number crunching without floating point unit!!!
  • Signal processing numerically challenging

9
Identification System
Add speaker profiles during training
Read and use all profiles during recognition
Speaker Profile Database
Decision
10
MFCC Signal Processing
  • pre-emph. coeff. 0.97, Hamm window, 30 triangular
    mel-filters, base-2 logarithm, output 12 MFCC's

11
Fixed-Point Implementation
  • Numerical analysis needed for fixed-point
    arithmetic implementation
  • Truncation and re-scaling to avoid overflows in
    the converted algorithm
  • Minimize information loss caused by computation
    in fixed-point arithmetic
  • Minimize relative error

12
FFT, Fixed-Point
  • Frequency spectrum of speech
  • Biggest source of numerical error
  • Butterflies have multiplications
  • Layers repeat truncation errors
  • Fixed number of bits per element
  • 32, native integer size in many systems
  • Reference implementation FFTGEN
  • http//www.jjj.de/fft/fftgen.tgz

13
FFTGEN (16/16)
  • Multiplication 32 x 32 -bit result must fit in
    32 bits truncate input
  • FFTGEN Truncate inputs to 16/16 bits

FFT layer input
FFT Twiddle Factor
X
16-bit integer
16-bit integer
X
32-bit multiplication result
16 used bits
16 crop-off bits
FFT layer output (part of it) Crop-off for next
layer 16 bits!
16-bit integer
14
Info Preserving FFT (22/10)
  • Approximate DFT operator F with G
  • Increase F-G, preserve more signal
    information
  • minimize maximum relative error in scaled sine
    values with respect to scale 980 good for FFT
    sizes up to 1024
  • Truncate multiplication inputs to 22/10 bits
    (signal/op)

FFT layer input
FFT Twiddle Factor
X
32-bit integer
22 used bits
10 crop-off bits
32-bit integer, 22 bits used
16-bit integer, 10 bits used
X
FFT layer output (part of it) Crop-off for next
layer 10 bits
32-bit multiplication result
15
FFT Spectrum, Fixed-Point
  • x-axis fixed-point FFT element abs. values
  • y-axis correct FFT element abs. values

16/16 abs values
22/10 abs values
original TIMIT signal
TIMIT signal x 4
16
Scale of Error in Proposed FFT
17
Magnitude Spectrum, Fixed-Point
  • Compute complex absolute values using maximum
    coordinate and coordinate ratio
  • Suppose x gt y for z x i y, then
  • Interpret the (squared) y/x by t
  • Approx. square root by a polynomial P(t)
  • Constant time algorithm (vs. Newton)

18
Logarithm, Fixed-Point
  • Use base 2 instead of base 10
  • corresponds to output multiplication
  • Standard technique
  • Return problem to interval 1,2)
  • Use linear interpolation from values stored in a
    look-up table
  • 8 bits used for indexing the look-up table values

19
Rest of System, Fixed-Point
  • No improvement needed in VQ/GLA
  • Should apply similar technique as with FFT to
    other signal processing
  • Pre-emphasis, utilize full 32 bits
  • Time windowing, use less bits in windowing
    function
  • FB, use less bits in frequency responses
  • DCT, use less bits for the cosines

20
Effect of Signal Processing
  • TIMIT data sets, varying number of speakers (N)
  • For each N repeat (6x, 5x, 2x) train/recognize
    cycles (eliminate GLA initial solution
    randomness)
  • FFTGEN FFT with 16/16 multiplication
  • Fixed-point use proposed 22/10 FFT
  • Mixed floating-point DSP, fixed-point GLA/VQ

21
Effect of Signal Quality
  • GSM/PC data 16 aligned dual recordings
  • All computations in floating-point arith.
  • Signal recorded with laptop and PC mic gives
    average recognition rate 100
  • Signal recorded with Nokia 3660 results in
    average recognition rate 84,9

22
Conclusion
  • Speaker identification was ported to Symbian
    Series 60 mobile phone
  • 22/10 bit usage in multiplication proposed
    instead of standard 16/16
  • Experiments indicate that recognition accuracy
    improves from 68 to 95
Write a Comment
User Comments (0)
About PowerShow.com