Title: Speaker Recognition
1Speaker Recognition
PUMS 2003-2004 seminaari 14.10.2004 Turku
Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov,
Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen
- University of Joensuu,
- Department of Computer Science
2PUMS project
Juhani Saastamoinen Project manager
Pasi Fränti Professor
Evgeny Karpov Project researcher
Tomi Kinnunen Researcher
Ismo Kärkkäinen Clustering algorithms
Ville Hautamäki Project researcher
3PUMS JoY
- Speaker Recognition
- PUMS season 2003-2004
- Identification, no verification
- Port it in mobile phone
- Feature fusion
- Real-time
- http//cs.joensuu.fi/pages/pums
4Application Scenarios
5Identification System
Add trained speaker profiles
Use all profiles in recognition
Speaker Profile Database
Decision
6Results 2003-2004
TCL/TK (HY)
console UI
SpeakerProfiler
sprofiler
Windows
console UI
Winsprofiler
Series60
ProfMatch
Epocsprofiler
common speaker recognition app. interface
Fusion
Real-time
srlib
Speech features (HY)
DB
7Planned Results
Large scale database
Teleconference
Applications
Access control
Mobile phone login?
Results 2003-2004
sprofiler
SpeakerProfiler
Winsprofiler
ProfMatch
Epocsprofiler
common speaker recognition app. interface
common speaker recognition app. interface
Fusion
Verification
Real-time
Segmentation
Speech features (HY)
srlib
VAD
DB
8System in Mobile Phone
Port to Symbian OS with Series 60 UI platform
9Symbian Phones
- Series 60 phone features
- 16 MB ROM
- 8 MB RAM
- 176 x 208 display
- 32-bit ARM-processor
- No floating-point unit!!!
UIQ
Series 60
Series 80
10FFTGEN
- Multiplication results must fit in 32 bits
truncate multiplication inputs - FFTGEN Truncate to 16/16 bits (16/16 FFT)
FFT layer input
FFT Twiddle Factor
X
16-bit integer
16-bit integer
X
32-bit multiplication result
16 used bits
16 crop-off bits
FFT layer output (part of it) Crop-off for next
layer 16 bits!
16-bit integer
11Proposed Information Preserving 22/10 FFT
- Approximate DFT operator F with G
- Increase F-G, preserve more signal
information - minimize maximum relative error in scaled sine
values with respect to scale 980 good for FFT
sizes up to 1024 - Truncate multiplication inputs to 22/10 bits
(signal/op)
FFT layer input
FFT Twiddle Factor
X
32-bit integer
22 used bits
10 crop-off bits
32-bit integer, 22 bits used
16-bit integer, 10 bits used
X
FFT layer output (part of it) Crop-off for next
layer 10 bits
32-bit multiplication result
12Scale of Error in Proposed FFT
Log10 of relative error in FFT elements Log10 of relative error in FFT elements Log10 of relative error in FFT elements
FFTGEN 22/10 FFT
average -0.775 -2.118
standard deviation 0.797 0.590
13Mobile Phone Results
TIMIT, 100 speakers recog. rate () std. dev. ()
FLOAT 100.0 N/A
FFTGEN 9.7 1.6
FIXED 95.8 1.2
MIXED 100.0 N/A
MIXED2 98.0 0.6
implementation, signal recog. rate () std. dev. ()
FLOAT, Symbian audio 83.2 4.38
FLOAT, PC audio 100.0 N/A
FIXED, Symbian audio 76.0 2.83
FIXED, PC audio 100.0 N/A
14Improving Accuracy by Information Fusion
feature vector
Feature set 1
(e.g. 5 MFCCs)
...
...
Feature set 2
(e.g. F0 ?-F0)
Feature set 3
(e.g. formants F1,F2,F3)
Classifier 1
score 1
Classifier 2
score 2
Decision
Score combiner
Classifier 3
score 3
15Information Fusion Results
N/A
N/A
N/A
N/A
16Real-Time Speaker Identification
Speech input stream
Speaker database
Speaker 1 model
v
All frames
...
v
Speaker N model
Non-silent frames
v
v
Feature vectors
v
v
Active speakers
Pruned speakers
List of candidate speakers
Redused set of vectors
v
v
Database pruning
v
Yes
No
Decision ?
END
17Results Baseline System (TIMIT)
(Average length of test utterance 8.9 s)
Real-time requirement satisfied
18Results Pre-Quantization (TIMIT)
(Codebook size 64)
- Averaging performs worst, clustering best
- About 21 speed-up to full search (no
pre-quantization) without degradation in the
accuracy
19Results Pruning Variants (TIMIT)
(Codebook size 64)
- Recommended method adaptive pruning (AP)
20Results PQ, Pruning and PQP (TIMIT)
(Codebook size 64)
- Recommended method Combination of
pre-quantization and pruning (PQP)
21Results VQ vs. GMM (TIMIT)
(Average length of test utterance 8.9 s)
VQ
GMM
Best time 0.27 s 33 x realtime _at_ error rate
0.32 Smallest error 0.00 _at_ 0.31 s 28 x
realtime
Best time 0.18 s 49 x realtime _at_ error rate
0.16 Smallest error 0.16 _at_ 0.18 s 49 x
realtime
22Results VQ vs. GMM (NIST-1999)
(Average length of test utterance 30.4 s)
VQ
GMM
Best time 0.82 s 37 x realtime _at_ error rate
19.36 Smallest error 16.90 _at_ 37.9 s 0.8 x
realtime
Best time 0.48 s 63 x realtime _at_ error rate
19.22 Smallest error 17.34 _at_ 11.4 s 3 x
realtime