Speaker Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

Speaker Recognition

Description:

Pruned speakers. Frame blocking. Decision ? END. Fill buffer with new data. All frames ... of pre-quantization and pruning (PQP) University of Joensuu ... – PowerPoint PPT presentation

Number of Views:775
Avg rating:3.0/5.0
Slides: 23
Provided by: csJoe
Category:

less

Transcript and Presenter's Notes

Title: Speaker Recognition


1
Speaker Recognition
PUMS 2003-2004 seminaari 14.10.2004 Turku
Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov,
Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen
  • University of Joensuu,
  • Department of Computer Science

2
PUMS project
Juhani Saastamoinen Project manager
Pasi Fränti Professor
Evgeny Karpov Project researcher
Tomi Kinnunen Researcher
Ismo Kärkkäinen Clustering algorithms
Ville Hautamäki Project researcher
3
PUMS JoY
  • Speaker Recognition
  • PUMS season 2003-2004
  • Identification, no verification
  • Port it in mobile phone
  • Feature fusion
  • Real-time
  • http//cs.joensuu.fi/pages/pums

4
Application Scenarios
5
Identification System
Add trained speaker profiles
Use all profiles in recognition
Speaker Profile Database
Decision
6
Results 2003-2004
TCL/TK (HY)
console UI
SpeakerProfiler
sprofiler
Windows
console UI
Winsprofiler
Series60
ProfMatch
Epocsprofiler
common speaker recognition app. interface
Fusion
Real-time
srlib
Speech features (HY)
DB
7
Planned Results
Large scale database
Teleconference
Applications
Access control
Mobile phone login?
Results 2003-2004
sprofiler
SpeakerProfiler
Winsprofiler
ProfMatch
Epocsprofiler
common speaker recognition app. interface
common speaker recognition app. interface
Fusion
Verification
Real-time
Segmentation
Speech features (HY)
srlib
VAD
DB
8
System in Mobile Phone
Port to Symbian OS with Series 60 UI platform
9
Symbian Phones
  • Series 60 phone features
  • 16 MB ROM
  • 8 MB RAM
  • 176 x 208 display
  • 32-bit ARM-processor
  • No floating-point unit!!!

UIQ
Series 60
Series 80
10
FFTGEN
  • Multiplication results must fit in 32 bits
    truncate multiplication inputs
  • FFTGEN Truncate to 16/16 bits (16/16 FFT)

FFT layer input
FFT Twiddle Factor
X
16-bit integer
16-bit integer
X
32-bit multiplication result
16 used bits
16 crop-off bits
FFT layer output (part of it) Crop-off for next
layer 16 bits!
16-bit integer
11
Proposed Information Preserving 22/10 FFT
  • Approximate DFT operator F with G
  • Increase F-G, preserve more signal
    information
  • minimize maximum relative error in scaled sine
    values with respect to scale 980 good for FFT
    sizes up to 1024
  • Truncate multiplication inputs to 22/10 bits
    (signal/op)

FFT layer input
FFT Twiddle Factor
X
32-bit integer
22 used bits
10 crop-off bits
32-bit integer, 22 bits used
16-bit integer, 10 bits used
X
FFT layer output (part of it) Crop-off for next
layer 10 bits
32-bit multiplication result
12
Scale of Error in Proposed FFT
Log10 of relative error in FFT elements Log10 of relative error in FFT elements Log10 of relative error in FFT elements
FFTGEN 22/10 FFT
average -0.775 -2.118
standard deviation 0.797 0.590
13
Mobile Phone Results
TIMIT, 100 speakers recog. rate () std. dev. ()
FLOAT 100.0 N/A
FFTGEN 9.7 1.6
FIXED 95.8 1.2
MIXED 100.0 N/A
MIXED2 98.0 0.6
implementation, signal recog. rate () std. dev. ()
FLOAT, Symbian audio 83.2 4.38
FLOAT, PC audio 100.0 N/A
FIXED, Symbian audio 76.0 2.83
FIXED, PC audio 100.0 N/A
14
Improving Accuracy by Information Fusion
feature vector
Feature set 1
(e.g. 5 MFCCs)
...
...
Feature set 2
(e.g. F0 ?-F0)
Feature set 3
(e.g. formants F1,F2,F3)
Classifier 1
score 1
Classifier 2
score 2
Decision
Score combiner
Classifier 3
score 3
15
Information Fusion Results
N/A
N/A
N/A
N/A
16
Real-Time Speaker Identification
Speech input stream
Speaker database
Speaker 1 model
v
All frames
...
v
Speaker N model
Non-silent frames
v
v
Feature vectors
v
v
Active speakers
Pruned speakers
List of candidate speakers
Redused set of vectors
v
v
Database pruning
v
Yes
No
Decision ?
END
17
Results Baseline System (TIMIT)
(Average length of test utterance 8.9 s)
Real-time requirement satisfied
18
Results Pre-Quantization (TIMIT)
(Codebook size 64)
  • Averaging performs worst, clustering best
  • About 21 speed-up to full search (no
    pre-quantization) without degradation in the
    accuracy

19
Results Pruning Variants (TIMIT)
(Codebook size 64)
  • Recommended method adaptive pruning (AP)

20
Results PQ, Pruning and PQP (TIMIT)
(Codebook size 64)
  • Recommended method Combination of
    pre-quantization and pruning (PQP)

21
Results VQ vs. GMM (TIMIT)
(Average length of test utterance 8.9 s)
VQ
GMM
Best time 0.27 s 33 x realtime _at_ error rate
0.32 Smallest error 0.00 _at_ 0.31 s 28 x
realtime
Best time 0.18 s 49 x realtime _at_ error rate
0.16 Smallest error 0.16 _at_ 0.18 s 49 x
realtime
22
Results VQ vs. GMM (NIST-1999)
(Average length of test utterance 30.4 s)
VQ
GMM
Best time 0.82 s 37 x realtime _at_ error rate
19.36 Smallest error 16.90 _at_ 37.9 s 0.8 x
realtime
Best time 0.48 s 63 x realtime _at_ error rate
19.22 Smallest error 17.34 _at_ 11.4 s 3 x
realtime
Write a Comment
User Comments (0)
About PowerShow.com