Development of the SPACE intelligibility assessment method - PowerPoint PPT Presentation

About This Presentation
Title:

Development of the SPACE intelligibility assessment method

Description:

Intelligibility = popular measure for pathological speech assessment ... 7 with dysphonia. 2 others. Pathological speakers : mean of 78,7 % Normals : mean of 93,3 ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 30
Provided by: fsto
Category:

less

Transcript and Presenter's Notes

Title: Development of the SPACE intelligibility assessment method


1
  • Development of the SPACE intelligibility
    assessment method
  • Catherine Middag, Gwen Van Nuffelen,
  • Jean-Pierre Martens, Marc De Bodt

2
Introduction
  • Intelligibility popular measure for
    pathological speech assessment
  • Perceptual assessment affected by non-speech
    information
  • familiarity of listener with speaker and type of
    disorder
  • ? hard to eliminate this subjective bias
  • guessing on the basis of linguistic context
  • ? test material design must eliminate this
    bias
  • Replacing the human listener by an automatic
    speech recognizer (ASR) can solve the two
    problems, but is the ASR sufficiently reliable?
  • test case automation of the Dutch
    Intelligibility Assessment (DIA)

3
Dutch Intelligibility Assessment (DIA)
  • 50 isolated (nonsense) words
  • intelligibility percent phonemes correct

4
How to apply ASR in the DIA?
  • Two approaches
  • let ASR recognize the words and count the
    percentage of correct decisions
  • let ASR check how well on average the acoustics
    support the phonetic transcription of the target
    word (alignment)
  • Our experience
  • intelligibility emerging from first approach
    insufficiently reliable
  • therefore we developed a system based on alignment

5
System architecture flow chart
Speech aligner
speaker features
Intelligibility Prediction Model
objective score
6
System architecture flow chart
Speech aligner
  • Two systems
  • complex state-of-the-art HMM-based system
    (ASR-ESAT)
  • simple system with a phonological layer
    (ASR-ELIS)
  • (point more directly to articulatory
    problems)
  • Acoustic models trained on speech of normal
    adult speakers

7
ASR - ESAT
  • Acoustic models
  • state-of-the-art Semi-Continuous HMM
  • triphone models trained on normal speech
  • states tied using decision trees phonological
    questions
  • Output
  • each frame t assigned to state st
  • per frame st, P(stXt)?

8
ASR - ELIS
Xt
  • 24 binary phonological features concerning
  • voicing
  • manner of articulation
  • place of articulation

target speech transcription
PLF extractor
P(S1Xt), , P(SnXt)
P(K1Xt), , P(K24Xt)
Probability product model
Viterbi decoder
st, P(stXt)?
P(K1Xt)..P(K24Xt)?
9
System architecture flow chart
Speech aligner
speaker features
Intelligibility Prediction Model
  • Three feature sets
  • Phonemic features (patient has trouble
    pronouncing a certain phoneme)
  • Phonological features (patient has problems with
    voicing, manner or place of articulation)
  • NEW context-dependent features (patient has
    problems with a desired change of voicing, manner
    or place of articulation)

objective score
10
Extraction of phonemic features (PMF)
Frame Phoneme P(stXt)
1 0.7
2 0.5
3 /p/ 0.4
4 /p/ 0.8
5 /o/ 0.6
6 /o/ 0.8
7 /l/ 0.6
8 0.3
Speech aligner ASR-ESAT
Phonemic features
  • (0.70.50.3) /3
  • /p/ (0.40.8) /2
  • /o/ (0.60.8) /2
  • /l/ 0.6

11
Extraction of phonological features (PLF)
Frame Phone voiced P(K1Xt) back P(K2Xt) burst P(K3Xt)
1 0.1 0.1 0.2
2 0.1 0.1 0.1
3 /pcl/ 0.2 0.1 0.1
4 /p/ 0.2 0.2 0.6
5 /o/ 0.8 0.7 0.2
6 /o/ 0.6 0.9 0.0
7 /l/ 0.5 0.5 0.1
8 0.1 0.1 0.0
Speech aligner ASR-ELIS
Phonological features
Burst 0.6 Back (0.70.9)/2 Voiced
(0.80.60.5)/3
12
Extraction of phonological features (PLF)
Frame Phone voiced P(K1Xt) back P(K2Xt) burst P(K3Xt)
1 0.1 0.1 0.2
2 0.1 0.1 0.1
3 /pcl/ 0.2 0.1 0.1
4 /p/ 0.2 0.2 0.6
5 /o/ 0.8 0.7 0.2
6 /o/ 0.6 0.9 0.0
7 /l/ 0.5 0.5 0.1
8 0.1 0.1 0.0
Speech aligner ASR-ELIS
Phonological features
Not burst (0.20.1 Not back
(0.10.1 Not voiced (0.10.1
13
Extraction of phonological features (PLF)
Frame Phone voiced P(K1Xt) back P(K2Xt) burst P(K3Xt)
1 0.1 0.1 0.2
2 0.1 0.1 0.1
3 /pcl/ 0.2 0.1 0.1
4 /p/ 0.2 0.2 0.6
5 /o/ 0.8 0.7 0.2
6 /o/ 0.6 0.9 0.0
7 /l/ 0.5 0.5 0.1
8 0.1 0.1 0.0
Speech aligner ASR-ELIS
Phonological features
Irrelevant features for these phones
14
Extraction of context-dependent phonological
features (CD-PLF)
  • How well is change in PLF realized?
  • use PLF target in preceding/succeeding phone as
    context
  • binary features ? two values for target
    (present/absent)
  • binary features ? restricted number of left
    right contexts
  • Left or right context can be
  • present, absent, not relevant, silence
  • Model selection (preliminary)
  • maximum 4 2 4 32 CD-PLFs per PLF
  • ? 768 in total
  • select only those CD-PLFs occurring at least
    twice in every test
  • ? 123 in total

15
Extraction of context-dependent phonological
features (CD-PLF)
Segment Phone voiced burst
2 0.1 0.2
3 /pcl/ 0.2 0.2
4 /p/ 0.2 0.6
6 /o/ 0.6 0.1
7 /s/ 0.4 0.3
8 0.2 0.1
9 /m/ 0.7 0.3
10 /A/ 0.8 0.0
11 /l/ 0.6 0.1
12 0.1 0.1
Speech aligner ASR-ELIS
CD-PLF features
voicing burst
Off, on, off 0.6 Yes, no, no 0.1
On, on, on 0.8 No, no, no 0.0
16
System architecture flow chart
Speech aligner
speaker features
Intelligibility Prediction Model
objective score
17
Intelligibility prediction model (IPM)
  • Objective
  • map speaker features (PMF, PLF, CD-PLF or
    combinations) to speaker intelligibility score
  • Model training
  • train on DIA recordings
  • pathological speakers ( some normal control
    speakers)
  • Model type and size
  • limited number of pathological speakers
  • high number of features
  • ? linear regression model
  • ? feature selection

18
Reference material (DIA)
  • 211 speakers
  • 51 normals
  • 60 dysarthric
  • 12 clefts (children)
  • 42 hearing impaired
  • 37 with laryngectomy
  • 7 with dysphonia
  • 2 others
  • Pathological speakers mean of 78,7
  • Normals mean of 93,3
  • Few with very low score

19
Solving microphone issues
  • Two microphones were used.
  • Difference can be found in cepstral means (?
    Cepstral mean subtraction was performed)

20
Training / validation
  • Models chosen with five-fold cross validation
  • Measure Standard deviation (STD) in case of
    normality, 67 of the computed score lie in an
    interval of STD around the perceptual score
  • More features more chance of overfitting
  • Rule of thumb take 1 feature for every 10
    training examples
  • ? Restrict number of features to maximum 15

21
Results individual systems
PMFelis 9.52
PMFesat 8.57
22
Results individual systems
PLF (elis) 9.35
CD-PLF (elis) 8.48
23
Results all systems
  • New models with CD-PLF outperform old PLF models
  • CD-PLFs form best system with one feature set
  • PMFesat CD-PLF best system with combined
    feature sets
  • Using three ELIS feature sets yields next best
    result and needs only one recognizer (the
    simplest one)
  • ? less complex system

Model STD N
PMFesat 8.57 15
PMFelis 9.52 15
PLF 9.35 15
CD-PLF 8.48 15
PMFelis PLF 8.20 15
PMFesat PLF 8.00 13
PMFelis CD-PLF 7.63 15
PLF CD-PLF 8.04 15
PMFesat CD-PLF 7.34 15
PMFelis PLF CD-PLF 7.48 15
24
Results combined system
  • CD-PLF PMFesat
  • STD 7.34

25
Results pathology-specific IPM
  • Instead of creating one general IPM, one can
    create IPMs for specific pathologies
  • trained on all speakers (to have enough speakers)
  • model selection based on performance on speakers
    of that pathology (importance of features depends
    on type of disorder)

26
Results pathology-specific IPM (2)
Model DYS LAR HEAR
PMFesat 8.44 8.32 7.48
PMFelis 8.10 5.88 9.73
PLF 8.27 7.17 8.05
CD-PLF 6.49 5.70 6.87
PMFelis PLF 6.97 5.14 6.63
PMFesat PLF 6.87 6.49 6.20
PMFelis CD-PLF 6.50 3.54 6.05
PLF CD-PLF 6.32 5.82 6.17
PMFesat CD-PLF 6.69 4.86 5.27
PMFelis PLF CD-PLF 6.32 3.68 5.73
  • Very good match in case CD-PLFs are involved
  • New models with CD-PLF outperform old PLF models
  • CD-PLFs form best system with one feature set
  • Using three ELIS feature sets yields (almost)
    best result and needs only one recognizer (the
    simplest one)
  • ? less complex system

27
Results pathology-specific IPM
  • Dysarthria 6.32 (red circles)
  • Dispersion of other speakers is increased
  • Largest deviations in low intelligibility area
  • scarce data in that area
  • can be solved by adding more weight to patients
    with very low intelligibility

28
Conclusions and future work
  • PMF, PLF and CD-PLF can predict intelligibility
    of pathological speech
  • CD-PLFs seem to play an important role
  • STD 7.34 for general model combining CD-PLF and
    PMFesat
  • STDs less than 6.32 for pathology specific model
    using 3 elis feature sets
  • ? not the articulation pattern but the change in
    the articulation pattern matters?
  • More research is needed before adding this
    feature set to the tool
  • Results on validation set compete with human
    inter-rater agreements.
  • Future work
  • more profound articulatory assessment, which is
    directly related to determination of appropriate
    therapy
  • monitoring of effectiveness of chosen therapy
  • using more natural speech (words, phrases) in
    tests

29
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com