Title: Development of the SPACE intelligibility assessment method
1- Development of the SPACE intelligibility
assessment method - Catherine Middag, Gwen Van Nuffelen,
- Jean-Pierre Martens, Marc De Bodt
2Introduction
- Intelligibility popular measure for
pathological speech assessment - Perceptual assessment affected by non-speech
information - familiarity of listener with speaker and type of
disorder - ? hard to eliminate this subjective bias
- guessing on the basis of linguistic context
- ? test material design must eliminate this
bias - Replacing the human listener by an automatic
speech recognizer (ASR) can solve the two
problems, but is the ASR sufficiently reliable? - test case automation of the Dutch
Intelligibility Assessment (DIA)
3Dutch Intelligibility Assessment (DIA)
- 50 isolated (nonsense) words
- intelligibility percent phonemes correct
4How to apply ASR in the DIA?
- Two approaches
- let ASR recognize the words and count the
percentage of correct decisions - let ASR check how well on average the acoustics
support the phonetic transcription of the target
word (alignment) - Our experience
- intelligibility emerging from first approach
insufficiently reliable - therefore we developed a system based on alignment
5System architecture flow chart
Speech aligner
speaker features
Intelligibility Prediction Model
objective score
6System architecture flow chart
Speech aligner
- Two systems
- complex state-of-the-art HMM-based system
(ASR-ESAT) - simple system with a phonological layer
(ASR-ELIS) - (point more directly to articulatory
problems) - Acoustic models trained on speech of normal
adult speakers
7ASR - ESAT
- Acoustic models
- state-of-the-art Semi-Continuous HMM
- triphone models trained on normal speech
- states tied using decision trees phonological
questions - Output
- each frame t assigned to state st
- per frame st, P(stXt)?
8ASR - ELIS
Xt
- 24 binary phonological features concerning
- voicing
- manner of articulation
- place of articulation
target speech transcription
PLF extractor
P(S1Xt), , P(SnXt)
P(K1Xt), , P(K24Xt)
Probability product model
Viterbi decoder
st, P(stXt)?
P(K1Xt)..P(K24Xt)?
9System architecture flow chart
Speech aligner
speaker features
Intelligibility Prediction Model
- Three feature sets
- Phonemic features (patient has trouble
pronouncing a certain phoneme) - Phonological features (patient has problems with
voicing, manner or place of articulation) - NEW context-dependent features (patient has
problems with a desired change of voicing, manner
or place of articulation)
objective score
10Extraction of phonemic features (PMF)
Frame Phoneme P(stXt)
1 0.7
2 0.5
3 /p/ 0.4
4 /p/ 0.8
5 /o/ 0.6
6 /o/ 0.8
7 /l/ 0.6
8 0.3
Speech aligner ASR-ESAT
Phonemic features
- (0.70.50.3) /3
- /p/ (0.40.8) /2
- /o/ (0.60.8) /2
- /l/ 0.6
11Extraction of phonological features (PLF)
Frame Phone voiced P(K1Xt) back P(K2Xt) burst P(K3Xt)
1 0.1 0.1 0.2
2 0.1 0.1 0.1
3 /pcl/ 0.2 0.1 0.1
4 /p/ 0.2 0.2 0.6
5 /o/ 0.8 0.7 0.2
6 /o/ 0.6 0.9 0.0
7 /l/ 0.5 0.5 0.1
8 0.1 0.1 0.0
Speech aligner ASR-ELIS
Phonological features
Burst 0.6 Back (0.70.9)/2 Voiced
(0.80.60.5)/3
12Extraction of phonological features (PLF)
Frame Phone voiced P(K1Xt) back P(K2Xt) burst P(K3Xt)
1 0.1 0.1 0.2
2 0.1 0.1 0.1
3 /pcl/ 0.2 0.1 0.1
4 /p/ 0.2 0.2 0.6
5 /o/ 0.8 0.7 0.2
6 /o/ 0.6 0.9 0.0
7 /l/ 0.5 0.5 0.1
8 0.1 0.1 0.0
Speech aligner ASR-ELIS
Phonological features
Not burst (0.20.1 Not back
(0.10.1 Not voiced (0.10.1
13Extraction of phonological features (PLF)
Frame Phone voiced P(K1Xt) back P(K2Xt) burst P(K3Xt)
1 0.1 0.1 0.2
2 0.1 0.1 0.1
3 /pcl/ 0.2 0.1 0.1
4 /p/ 0.2 0.2 0.6
5 /o/ 0.8 0.7 0.2
6 /o/ 0.6 0.9 0.0
7 /l/ 0.5 0.5 0.1
8 0.1 0.1 0.0
Speech aligner ASR-ELIS
Phonological features
Irrelevant features for these phones
14Extraction of context-dependent phonological
features (CD-PLF)
- How well is change in PLF realized?
- use PLF target in preceding/succeeding phone as
context - binary features ? two values for target
(present/absent) - binary features ? restricted number of left
right contexts - Left or right context can be
- present, absent, not relevant, silence
- Model selection (preliminary)
- maximum 4 2 4 32 CD-PLFs per PLF
- ? 768 in total
- select only those CD-PLFs occurring at least
twice in every test - ? 123 in total
15Extraction of context-dependent phonological
features (CD-PLF)
Segment Phone voiced burst
2 0.1 0.2
3 /pcl/ 0.2 0.2
4 /p/ 0.2 0.6
6 /o/ 0.6 0.1
7 /s/ 0.4 0.3
8 0.2 0.1
9 /m/ 0.7 0.3
10 /A/ 0.8 0.0
11 /l/ 0.6 0.1
12 0.1 0.1
Speech aligner ASR-ELIS
CD-PLF features
voicing burst
Off, on, off 0.6 Yes, no, no 0.1
On, on, on 0.8 No, no, no 0.0
16System architecture flow chart
Speech aligner
speaker features
Intelligibility Prediction Model
objective score
17Intelligibility prediction model (IPM)
- Objective
- map speaker features (PMF, PLF, CD-PLF or
combinations) to speaker intelligibility score - Model training
- train on DIA recordings
- pathological speakers ( some normal control
speakers) - Model type and size
- limited number of pathological speakers
- high number of features
- ? linear regression model
- ? feature selection
18Reference material (DIA)
- 211 speakers
- 51 normals
- 60 dysarthric
- 12 clefts (children)
- 42 hearing impaired
- 37 with laryngectomy
- 7 with dysphonia
- 2 others
- Pathological speakers mean of 78,7
- Normals mean of 93,3
- Few with very low score
19Solving microphone issues
- Two microphones were used.
- Difference can be found in cepstral means (?
Cepstral mean subtraction was performed)
20Training / validation
- Models chosen with five-fold cross validation
- Measure Standard deviation (STD) in case of
normality, 67 of the computed score lie in an
interval of STD around the perceptual score - More features more chance of overfitting
- Rule of thumb take 1 feature for every 10
training examples - ? Restrict number of features to maximum 15
21Results individual systems
PMFelis 9.52
PMFesat 8.57
22Results individual systems
PLF (elis) 9.35
CD-PLF (elis) 8.48
23Results all systems
- New models with CD-PLF outperform old PLF models
- CD-PLFs form best system with one feature set
- PMFesat CD-PLF best system with combined
feature sets - Using three ELIS feature sets yields next best
result and needs only one recognizer (the
simplest one) - ? less complex system
Model STD N
PMFesat 8.57 15
PMFelis 9.52 15
PLF 9.35 15
CD-PLF 8.48 15
PMFelis PLF 8.20 15
PMFesat PLF 8.00 13
PMFelis CD-PLF 7.63 15
PLF CD-PLF 8.04 15
PMFesat CD-PLF 7.34 15
PMFelis PLF CD-PLF 7.48 15
24Results combined system
25Results pathology-specific IPM
- Instead of creating one general IPM, one can
create IPMs for specific pathologies - trained on all speakers (to have enough speakers)
- model selection based on performance on speakers
of that pathology (importance of features depends
on type of disorder)
26Results pathology-specific IPM (2)
Model DYS LAR HEAR
PMFesat 8.44 8.32 7.48
PMFelis 8.10 5.88 9.73
PLF 8.27 7.17 8.05
CD-PLF 6.49 5.70 6.87
PMFelis PLF 6.97 5.14 6.63
PMFesat PLF 6.87 6.49 6.20
PMFelis CD-PLF 6.50 3.54 6.05
PLF CD-PLF 6.32 5.82 6.17
PMFesat CD-PLF 6.69 4.86 5.27
PMFelis PLF CD-PLF 6.32 3.68 5.73
- Very good match in case CD-PLFs are involved
- New models with CD-PLF outperform old PLF models
- CD-PLFs form best system with one feature set
- Using three ELIS feature sets yields (almost)
best result and needs only one recognizer (the
simplest one) - ? less complex system
27Results pathology-specific IPM
- Dysarthria 6.32 (red circles)
- Dispersion of other speakers is increased
- Largest deviations in low intelligibility area
- scarce data in that area
- can be solved by adding more weight to patients
with very low intelligibility
28Conclusions and future work
- PMF, PLF and CD-PLF can predict intelligibility
of pathological speech - CD-PLFs seem to play an important role
- STD 7.34 for general model combining CD-PLF and
PMFesat - STDs less than 6.32 for pathology specific model
using 3 elis feature sets - ? not the articulation pattern but the change in
the articulation pattern matters? - More research is needed before adding this
feature set to the tool - Results on validation set compete with human
inter-rater agreements. - Future work
- more profound articulatory assessment, which is
directly related to determination of appropriate
therapy - monitoring of effectiveness of chosen therapy
- using more natural speech (words, phrases) in
tests
29