Novel Speech Recognition Models for Arabic - PowerPoint PPT Presentation

1 / 168
About This Presentation
Title:

Novel Speech Recognition Models for Arabic

Description:

AR 'knitting' example. unknown: t bqwA. kn.roman: yibqu. ops: ... Knitting local model n-best 30.0% 23.1% (n = 25) Varying the number of dictionary matches ... – PowerPoint PPT presentation

Number of Views:460
Avg rating:3.0/5.0
Slides: 169
Provided by: katrinki
Category:

less

Transcript and Presenter's Notes

Title: Novel Speech Recognition Models for Arabic


1
Novel Speech Recognition Models for Arabic
  • The Arabic Speech Recognition Team
  • JHU Workshop Final Presentations
  • August 21, 2002

2
Arabic ASR Workshop Team
  • Senior Participants Undergraduate Students
  • Katrin Kirchhoff, UW Melissa Egan, Pomona College
  • Jeff Bilmes, UW Feng He, Swarthmore College
  • John Henderson, MITRE
  • Mohamed Noamany, BBN Affiliates
  • Pat Schone, DoD Dimitra Vergyri, SRI
  • Rich Schwartz, BBN Daben Liu, BBN
  • Nicolae Duta, BBN
  • Graduate Students Ivan Bulyko, UW
  • Sourin Das, JHU Mari Ostendorf, UW
  • Gang Ji, UW

3
Arabic
Dialects used for informal conversation
Cross-regional standard, used for formal
communication
4
Arabic ASR Previous Work
  • dictation IBM ViaVoice for Arabic
  • Broadcast News BBN TIDESOnTap
  • conversational speech 1996/1997 NIST CallHome
    Evaluations
  • little work compared to other languages
  • few standardized ASR resources

5
Arabic ASR State of the Art (before WS02)
  • BBN TIDESOnTap 15.3 WER
  • BBN CallHome system 55.8 WER
  • WER on conversational speech noticeably higher
    than for other languages
  • (eg. 30 WER for English CallHome)
  • ? focus on recognition of conversational Arabic

6
Problems for Arabic ASR
  • language-external problems
  • data sparsity, only 1 (!) standardized corpus of
    conversational Arabic available
  • language-internal problems
  • complex morphology, large number of possible word
    forms
  • (similar to Russian, German, Turkish,)
  • differences between written and spoken
    representation lack of short vowels and other
    pronunciation information
  • (similar to Hebrew, Farsi, Urdu, Pashto,)

7
Corpus LDC ECA CallHome
  • phone conversations between family
    members/friends
  • Egyptian Colloquial Arabic (Cairene dialect)
  • high degree of disfluencies (9),
    out-of-vocabulary words (9.6), foreign words
    (1.6)
  • noisy channels
  • training 80 calls (14 hrs), dev 20 calls (3.5
    hrs), eval 20 calls (1.5 hrs)
  • very small amount of data for language modeling
    (150K) !

8
MSA - ECA differences
  • Phonology
  • /th/ ? /s/ or /t/ thalatha - talata
    (three)
  • /dh/ ? /z/ or /d/ dhahab - dahab
    (gold)
  • /zh/ ? /g/ zhadeed - gideed
    (new)
  • /ay/ ? /e/ Sayf - Seef
    (summer)
  • /aw/ ? /o/ lawn - loon
    (color)
  • Morphology
  • inflections yatakallamu -
    yitkallim (he speaks)
  • Vocabulary
  • different terms TAwila - tarabeeza
    (table)
  • Syntax
  • word order differences SVO - VSO

9
Workshop Goals
improvements to Arabic ASR through
developing novel models to better exploit
available data
developing techniques for using
out-of-corpus data
Automatic romanization
Integration of MSA text data
Factored language modeling
10
Factored Language Models
  • complex morphological structure leads to large
    number of possible word forms
  • break up word into separate components
  • build statistical n-gram models over individual
    morphological components rather than complete
    word forms

11
Automatic Romanization
  • Arabic script lacks short vowels and other
    pronunciation markers
  • comparable English example
  • lack of vowels results in lexical ambiguity
    affects acoustic and language model training
  • try to predict vowelization automatically from
    data and use result for recognizer training

th fsh stcks f th nrth tlntc hv bn dpletd
the fish stocks of the north atlantic have been
depleted
12
Out-of-corpus text data
  • no corpora of transcribed conversational speech
    available
  • large amounts of written (Modern Standard Arabic)
    data available (e.g. Newspaper text)
  • Can MSA text data be used to improve language
    modeling for conversational speech?
  • Try to integrate data from newspapers,
    transcribed TV broadcasts, etc.

13
Recognition Infrastructure
  • baseline system BBN recognition system
  • N-best list rescoring
  • Language model training SRI LM toolkit with
    significant additions implemented during this
    workshop
  • Note no work on acoustic modeling, speaker
    adaptation, noise robustness, etc.
  • two different recognition approaches
    grapheme-based vs. phoneme-based

14
Summary of Results (WER)
Grapheme-based reconizer
Phone-based recognizer
15
Novel research
  • new strategies for language modeling based on
    morphological features
  • new graph-based backoff schemes allowing wider
    range of smoothing techniques in language
    modeling
  • new techniques for automatic vowel insertion
  • first investigation of use of automatically
    vowelized data for ASR
  • first attempt at using MSA data for language
    modeling for conversational Arabic
  • morphology induction for Arabic

16
Key Insights
  • Automatic romanization improves grapheme-based
    Arabic recognition systems
  • trend morphological information helps in
    language modeling
  • needs to be confirmed on larger data set
  • Using MSA text data does not help
  • We need more data!

17
Resources
  • significant add-on to SRILM toolkit for general
    factored language modeling
  • techniques/software for automatic romanization of
    Arabic script
  • part-of-speech tagger for MSA tagged text

18
Outline of Presentations
  • 130 - 145 Introduction (Katrin Kirchhoff)
  • 145 - 155 Baseline system (Rich Schwartz)
  • 155 - 220 Automatic romanization (John
    Henderson,

  • Melissa Egan)
  • 220 - 235 Language modeling - overview
    (Katrin Kirchhoff)
  • 235 - 250 Factored language modeling (Jeff
    Bilmes)
  • 250 - 305 Coffee Break
  • 305 - 310 Automatic morphology learning (Pat
    Schone)
  • 315 - 330 Text selection (Feng He)
  • 330 - 400 Graduate student proposals (Gang Ji,
    Sourin Das)
  • 400 - 430 Discussion and Questions

19
Thank you!
  • Fred Jelinek, Sanjeev Khudanpur, Laura Graham
  • Jacob Laderman assistants
  • Workshop sponsors
  • Mark Liberman, Chris Cieri, Tim Buckwalter
  • Kareem Darwish, Kathleen Egan
  • Bill Belfield colleagues from BBN
  • Apptek

20
(No Transcript)
21
BBN Baseline System for Arabic
  • Richard Schwartz, Mohamed Noamany,
  • Daben Liu, Bill Belfield, Nicolae Duta
  • JHU Workshop
  • August 21, 2002

22
BBN BYBLOS System
  • RoughnReady / OnTAP / OASIS system
  • Version of BYBLOS optimized for Broadcast News
  • OASIS system fielded in Bangkok and Aman
  • Real-Time operation with 1-minute delay
  • 10-20 WER, depending on data

23
BYBLOS Configuration
  • 3-passes of recognition
  • Forward Fast-match uses PTM models and
    approximate bigram search
  • Backward pass uses SCTM models and approximate
    trigram search, creates N-best.
  • Rescoring pass uses cross-word SCTM models and
    trigram LM
  • All runs in real time
  • Minimal difference from running slowly

24
Use for Arabic Broadcast News
  • Transcriptions are in normal Arabic script,
    omitting short vowels and other diacritics.
  • We used each Arabic letter as if it were a
    phoneme.
  • This allowed addition of large text corpora for
    language modeling.

25
Initial BN Baseline
  • 37.5 hours of acoustic training
  • Acoustic training data (230K words) used for LM
    training
  • 64K-word vocabulary (4 OOV)
  • Initial word error rate (WER) 31.2

26
Speech Recognition Performance
27
Call Home Experiments
  • Modified OnTAP system to make it more appropriate
    for Call Home data.
  • Added features from LVCSR research to OnTAP
    system for Call Home data.
  • Experiments
  • Acoustic training 80 conversations (15 hours)
  • Transcribed with diacritics
  • Acoustic training data (150K words) used for LM
  • Real-time

28
Using OnTAP system for Call Home
29
Additions from LVCSR
30
Output Provided for Workshop
  • OASIS was run on various sets of training as
    needed
  • Systems were run either for Arabic script
    phonemes or Romanized phonemes with
    diacritics.
  • In addition to workshop participants, others at
    BBN provided assistance and worked on workshop
    problems.
  • Output provided for workshop was N-best sentences
  • with separate scores for HMM, LM, words,
    phones, silences
  • Due to high error rate (56), the oracle error
    rate for 100 N-best was about 46.
  • Unigram lattices were also provided, with oracle
    error rate of 15

31
Phoneme HMM Topology Experiment
  • The phoneme HMM topology was increased for the
    Arabic script system from 5 states to 10 states
    in order to accommodate a consonant and possible
    vowel.
  • The gain was small (0.3 WER)

32
OOV Problem
  • OOV Rate is 10
  • 50 is morphological variants of words in the
    training set
  • 10 is Proper names
  • 40 is other unobserved words
  • Tried adding words from BN and from morphological
    transducer
  • Added too many words with too small gain

33
Use BN to Reduce OOV
  • Can we add words from BN to reduce OOV?
  • BN text contains 1.8M distinct words.
  • Adding entire 1.8M words reduces OOV from 10 to
    3.9.
  • Adding top 15K words reduces OOV to 8.9
  • Adding top 25K words reduces OOV to 8.4.

34
Use Morphological Transducer
  • Use LDC Arabic transducer to expand verbs to all
    forms
  • Produces gt 1M words
  • Reduces OOV to 7

35
Language Modeling Experiments
  • Described in other talks
  • Searched for available dialect transcriptions
  • Combine BN (300M words) with CH (230K)
  • Use BN to define word classes
  • Constrained back-off for BNCH

36
(No Transcript)
37
Autoromanization of Arabic Script
  • Melissa Egan and John Henderson

38
Autoromanization (AR) goal
  • Expand Arabic script representation to include
    short vowels and other pronunciation information.
  • Phenomena not typically marked in non-diacritized
    script include
  • Short vowels a, i, u
  • Repeated consonants (shadda)
  • Extra phonemes for Egyptian Arabic f/v,j/g
  • Grammatical marker that adds an n to the
    pronunciation (tanween)
  • Example
  • Non-diacritized form ktb write
  • Expansions kitab book
  • aktib I write
  • kataba he wrote
  • kattaba he caused to write

39
AR motivation
  • Romanized text can be used to produce better
    output from an ASR system.
  • Acoustic models will be able to better
    disambiguate based on extra information in text.
  • Conditioning events in LM will contain more
    information.
  • Romanized ASR output can be converted to script
    for alternative WER measurement.
  • Eval96 results (BBN recognizer, 80 conv. train)
  • script recognizer 61.1 WERG (grapheme)
  • romanized recognizer 55.8 WERR (roman)

40
AR data
  • CallHome Arabic from LDC
  • Conversational speech transcripts (ECA) in both
    script and a roman specification that includes
    short vowels, repeats, etc.
  • set conversations words
  • asrtrain 80 135K
  • dev 20 35K
  • eval96(asrtest) 20 15K
  • eval97 20 18K
  • h5_new 20 18K

Romanizer Testing
Romanizer Training
41
Data format
  • Script without and with diacritics
  • CallHome in script and roman forms

our task
Script AlHmd_llh kwIsB w AntI AzIk
Roman ilHamdulillA kuwayyisaB wi inti
izzayyik
42
Autoromanization (AR) WER baseline
  • Train on 32K words in eval97h5_new
  • Test on 137K words in ASR_trainh5_new

Status portion error total in train in
test in test error unambig. 68.0 1.8
6.2 ambig. 15.5 13.9 10.8 unknown 16.5
99.8 83.0 total 100 19.9 100.0
Biggest potential error reduction would come from
predicting romanized forms for unknown words.
43
AR knitting example
unknown tbqwA
1. Find close known word
known ybqwA
known y bqwA
2. Record ops required to make roman from known
kn.roman yibqu ops ciccrd
unknown t bqwA
3. Construct new roman using same ops
kn.roman yibqu ops ciccrd
new roman tibqu
44
Experiment 1 (best match)
  • Observed patterns in the known short/long pairs
  • Some characters in the short forms are
    consistently found with particular, non-identical
    characters in the long forms.
  • Example rule
  • A ? a

45
Experiment 2 (rules)
Environments in which w occurs in training
dictionary long forms Env Freq C _ V 149 V _
8 _ V 81 C _ 5 V _ V 121 V _ C 118
Environments in which u occurs in training
dictionary long forms Env Freq C _ C 1179 C
_ 301 _ C 29
  • Some output forms depend on output context.
  • Rule
  • u occurs only between two non-vowels.
  • w occurs elsewhere.
  • Accurate for 99.7 of the instances of u and
    w in the training dictionary long forms.
    Similar rule may be formulated for i and y.

46
Experiment 3 (local model)
  • Move to more data-driven model
  • Found some rules manually.
  • Look for all of them, systematically.
  • Use best-scoring candidate for replacement
  • Environment likelihood score
  • Character alignment score

47
Experiment 4 (n-best)
  • Instead of generating romanized form using the
    single best short form in the dictionary,
    generate romanized forms using top n best short
    forms.
  • Example (n 5)

48
Character error rate (CER)
  • Measurement of insertions, deletions, and
    substitutions in character strings should more
    closely track phoneme error rate.
  • More sensitive than WER
  • Stronger statistics from same data
  • Test set results
  • Baseline 49.89 character error rate (CER)
  • Best model 24.58 CER
  • Oracle 2-best list 17.60 CER suggests more room
    for gain.

49


Summary of performance (dev set)

Accuracy CER Baseline 8.4 41.4 Knitting 16.9
29.5 Knitting best match rules 18.4 28.6 K
nitting local model 19.4 27.0 Knitting
local model n-best 30.0 23.1 (n
25)
50

Varying the number of dictionary matches




51
ASR scenarios
  • 1) Have a script recognizer, but want to produce
    romanized form.
  • postprocessing ASR output
  • 2) Have a small amount of romanized data and a
    large amount of script data available for
    recognizer training.
  • preprocessing ASR training set

52
ASR experiments
Preprocessing
Roman Result
WERR
Roman ASR
AR
Script Result
R2S
WERG
Script Train
Postprocessing
53
Experiment adding script data
Future training set
  • Script LM training data could be acquired from
    found text.

AR train 40
ASR train 100 conv
  • Script transcription is cheaper than roman
    transcription
  • Simulate a preponderance of script by training AR
    on a separate set.
  • ASR is then trained on output of AR.

54
Eval 96 experiments, 80 conv
Config WERR WERG script baseline N/A 59.8 post
processing 61.5 59.8 preprocessing 59.9 59.2
(-0.6) Roman baseline 55.8 55.6 (-4.2)
  • Bounding experiment
  • No overlap between ASR train and AR train.
  • Poor pronunciations for made-up words.

55
Eval 96 experiments, 100 conv
Config WERR WERG script baseline N/A 59.0 postproc
essing 60.7 59.0 preprocessing 58.5 57.5
(-1.5) Roman baseline 55.1 54.9 (-4.1)
  • More realistic experiment
  • 20 conversation overlap between ASR train and AR
    train.
  • Better pronunciations for made-up words.

56
Remaining challenges
  • Correct dangling tails in short matches
  • Merge unaligned characters

57
Bigram translation model
input s t b q w A output r ? t i b
q u ? kn. roman dl y i b q u
58
Trigram translation model
input s t b q w A output r t i b
q u kn. roman dl y i b q u
59
Future work
  • Context provides information for disambiguating
    both known and unknown words
  • Bigrams for unknown words will also be unknown,
    use part of speech tags or morphology.
  • Acoustics
  • Use acoustics to help disambiguate vowels?
  • Provide n-best output as alternative
    pronunciations for ASR training.

60
(No Transcript)
61
Factored Language Modeling
Katrin Kirchhoff, Jeff Bilmes, Dimitra
Vergyri, Pat Schone, Gang Ji, Sourin Das
62
Arabic morphology
  • structure of Arabic derived words

pattern
s
k
n
-tu
fa-
affixes
particles
root
LIVE past 1st-sg-past part so I lived
63
Arabic morphology
  • 5000 roots
  • several hundred patterns
  • dozens of affixes
  • large number of possible word forms
  • problems training robust language model
  • large number of OOV words

64
Vocabulary Growth - full word forms
65
Vocabulary Growth - stemmed words
66
Particle model
  • Break words into sequences of stems affixes
  • Approximate probability of word sequence by
    probability of particle sequence

67
Factored Language Model
  • Problem how can we estimate P(WtWt-1,Wt-2,...)
    ?
  • Solution decompose W into its morphological
    components affixes, stems, roots, patterns
  • words can be viewed as bundles of features

patterns
Pt
Pt-1
Pt-2
roots
Rt
Rt-1
Rt-2
affixes
At
At-1
At-2
St
St-1
St-2
stems
words
Wt-2
Wt-1
Wt
68
Statistical models for factored representations
  • Class-based LM
  • Single-stream LM

69
Full Factored Language Model
  • assume where
  • w word, r root, f pattern, a affixes
  • Goal find appropriate conditional independence
    statements to simplify this model.

70
Experimental Infrastructure
  • All language models tested using nbest rescoring
  • two baseline word-based LMs
  • B1 BBN LM, WER 55.1
  • B2 WS02 baseline LM, WER 54.8
  • combination of baselines 54.5
  • new language models were used in combination with
    one or both baseline LMs
  • log-linear score combination scheme

71
Log-linear combination
  • For m information sources, each producing a
    maximum-likelihood estimate for W
  • I total information available
  • Ii the ith information source
  • ki weight for the ith information source

72
Discriminative combination
  • We optimize the combination weights jointly with
    the language model and insertion penalty to
    directly minimize WER of the maximum likelihood
    hypothesis.
  • The normalization factor can be ignored since it
    is the same for all alternative hypotheses.
  • Used the simplex optimization method on the
    100-bests provided by BBN (optimization algorithm
    available in the SRILM toolkit).

73
Word decomposition
  • Linguistic decomposition (expert knowledge)
  • automatic morphological decomposition acquire
    morphological units from data without using human
    knowledge
  • assign words to classes based not on
    characteristics of word form but based on
    distributional properties

74
(Mostly) Linguistic Decomposition
  • Stems/morph class information from LDC CH
    lexicon
  • roots determined by K. Darwishs morphological
    analyzer for MSA
  • pattern determined by subtracting root from stem

atamna ltgt atamverbpast-1st-plural
stem
morph. tag
atam ? tm
atam ? CaCaC
75
Automatic Morphology
  • Classes defined by morphological components
    derived from data
  • no expert knowledge
  • based on statistics of word forms
  • more details in Pats presentation

76
Data-driven Classes
  • Word clustering based on distributional
    statistics
  • Exchange algorithm (Martin et. al 98)
  • initially assign words to individual clusters
  • move each temporarily word to all other clusters,
    compute change in perplexity (class-based
    trigram)
  • keep assignment that minimizes perplexity
  • stop when class assignment no longer changes
  • bottom-up clustering (SRI toolkit)
  • initially assign words to individual clusters
  • successively merge pairs of clusters with highest
    average mutual information
  • stop at specified number of classes

77
Results
  • Best word error rates obtained with
  • particle model 54.0 (B1 particle LM)
  • class-based models 53.9 (B1MorphStem)
  • automatic morphology 54.3 (B1B2Rule)
  • data-driven classes 54.1 (B1SRILM, 200
    classes)
  • combination of best models 53.8

78
Conclusions
  • Overall improvement in WER gained from language
    modeling (1.3) is significant
  • individual differences between LMs are not
    significant
  • but adding morphological class models always
    helps language model combination
  • morphological models get the highest weights in
    combination (in addition to word-based LMs)
  • trend needs to be verified on larger data set
  • ? application to script-based system?

79
(No Transcript)
80
Factored Language Models and Generalized Graph
Backoff
  • Jeff Bilmes, Katrin Kirchhoff
  • University of Washington, Seattle
  • JHU-WS02 ASR Team

81
Outline
  • Language Models, Backoff, and Graphical Models
  • Factored Language Models (FLMs) as Graphical
    Models
  • Generalized Graph Backoff algorithm
  • New features to SRI Language Model Toolkit (SRILM)

82
Standard Language Modeling
  • Example standard tri-gram

83
Typical Backoff in LM
  • In typical LM, there is one natural (temporal)
    path to back off along.
  • Well motivated since information often decreases
    with word distance.

84
Factored LM Proposed Approach
  • Decompose words into smaller morphological or
    class-based units (e.g., morphological classes,
    stems, roots, patterns, or other automatically
    derived units).
  • Produce probabilistic models over these units to
    attempt to improve WER.

85
Example with Words, Stems, and Morphological
classes
86
Example with Words, Stems, and Morphological
classes
87
In general
88
General Factored LM
  • A word is equivalent to collection of factors.
  • E.g., if K3
  • Goal find appropriate conditional independence
    statements to simplify this sort of model while
    keeping perplexity and WER low. This is the
    structure learning problem in graphical models.

89
The General Case
90
The General Case
91
The General Case
92
A Backoff Graph (BG)
93
Example 4-gram Word Generalized Backoff
94
How to choose backoff path?
  • Four basic strategies
  • Fixed path (based on what seems reasonable (e.g.,
    temporal constraints))
  • Generalized all-child backoff
  • Constrained multi-child backoff
  • Child combination rules

95
Choosing a fixed back-off path
96
How to choose backoff path?
  • Four basic strategies
  • Fixed path (based on what seems reasonable (e.g.,
    temporal constraints))
  • Generalized all-child backoff
  • Constrained multi-child backoff
  • Child combination rules

97
Generalized Backoff
  • In typical backoff, we drop 2nd parent and use
    conditional probability.
  • More generally, g() can be any positive function,
    but need new algorithm for computing backoff
    weight (BOW).

98
Computing BOWs
  • Many possible choices for g() functions (next few
    slides)
  • Caveat certain g() functions can make the LM
    much more computationally costly than standard
    LMs.

99
g() functions
  • Standard backoff
  • Max counts
  • Max normalized counts

100
More g() functions
  • Max backoff graph node.

101
More g() functions
  • Max back off graph node.

102
How to choose backoff path?
  • Four basic strategies
  • Fixed path (based on what seems reasonable
    (time))
  • Generalized all-child backoff
  • Constrained multi-child backoff
  • Same as before, but choose a subset of possible
    paths a-priori
  • Child combination rules
  • Combine child node via combination function
    (mean, weighted avg., etc.)

103
Significant Additions to Stolckes SRILM, the
SRI Language Modeling Toolkit
  • New features added to SRILM including
  • Can specify an arbitrary number of
    graphical-model based factorized models to train,
    compute perplexity, and rescore N-best lists.
  • Can specify any (possibly constrained) set of
    backoff paths from top to bottom level in BG.
  • Different smoothing (e.g., Good-Turing,
    Kneser-Ney, etc.) or interpolation methods may be
    used at each backoff graph node
  • Supports the generalized backoff algorithms with
    18 different possible g() functions at each BG
    node.

104
Example with Words, Stems, and Morphological
classes
105
How to specify a model
word given stem morph W 2 S(0) M(0) S0,M0
M0 wbdiscount gtmin 1 interpolate S0 S0
wbdiscount gtmin 1 0 0 wbdiscount gtmin 1
morph given word word M 2 W(-1) W(-2)
W1,W2 W2 kndiscount gtmin 1 interpolate W1
W1 kndiscount gtmin 1 interpolate 0 0
kndiscount gtmin 1
stem given morph word word S 3 M(0) W(-1)
W(-2) M0,W1,W2 W2 kndiscount gtmin 1
interpolate M0,W1 W1 kndiscount gtmin 1
interpolate M0 M0 kndiscount gtmin 1 0
0 kndiscount gtmin 1
106
Summary
  • Language Models, Backoff, and Graphical Models
  • Factored Language Models (FLMs) as Graphical
    Models
  • Generalized Graph Backoff algorithm
  • New features to SRI Language Model Toolkit (SRILM)

107
Coffee Break
  • Back in 10 minutes

108
Knowledge-Free Induction of Arabic Morphology
  • Patrick Schone
  • 21 August 2002

109
Why induce Arabic morphology?
  • (1) Has not been done before
  • (2) If it can be done, and if it has value in LM,
  • it can generalize across languages without
  • needing an expert

110
Original Algorithm(Schone Jurafsky, 00/01)
Looking for word inflections on words w/
Frgt9 Use a character tree to find word pairs
with similar beginnings/ endings Ex
car/cars , car/cares, car/caring Use Latent
Semantic Analysis to induce semantic vectors
for each word, then compare word-pair
semantics Use frequencies of word stems/rules to
improve the initial semantic estimates
111
Algorithmic Expansions
IR-Based Minimum Edit Distance
Trie-based approach could be a problem for
Arabic Templates gt aGlaB aGlaB
ilAGil aGlu AGil Result 3576 words in
CallHome lexicon w/ 50 relationships!
Use Minimum Edit Distance to find the
relationships (can be weighted)
Use information-retrieval based approach to
faciliate search for MED candidates
112
Algorithmic Expansions
Agglomerative Clustering Using Rules Stems
Word Pairs w/ Rule
Word Pairs w/ Stem
Gayyar 507 xallaS 503 makallim 468 qaddim 434 i
tgawwiz 332 tkallim 285
gt il 1178 gt u 635 gt i 455 i gt
u 377 gt fa 375 gt bi 366
Do bottom-up clustering, where weight
between two words is Ct(Rule)Ct(PairedStem)1/2
113
Algorithmic Expansions
Updated Transitivity
If XY and YZ and XYgt2 and XYltZ, then XZ
114
Scoring Induced Morphology
  • Score in terms of conflation set
    agreement
  • Conflation set (W)all words morphologically
    related to W
  • Example aGlaB aGlaB ilAGil aGlu AGil

If XWinduced set for W, and YWtruth set for W,
compute total correct, inserted, and
deleted as
ErrorRate 100(ID)/(CD)
115
Scoring Induced Morphology
Induction error rates on words from original 80
Set
116
Using Morphology for LM Rescoring
  • For each word W, use induced morphology to
    generate
  • Stem smallest word, z, from XW where zlt w
  • Root character intersection across XW
  • Rule map of word-to-stem
  • Patternmap of stem-to-root
  • Class map of word-to-root

117
Other Potential Benefits of MorphologyMorphology
-driven Word Generation
  • Generate probability-weighted words using
  • morphologically-derived rules (like Null gt
    ilNULL)
  • Generate only if initial and final n-characters
    of stem have been seen before.

118
(No Transcript)
119
Text Selection for Conversational Arabic
  • Feng He
  • ASR (Arabic Speech Recognition) Team
  • JHU Workshop

120
Motivation
  • Group goal Conversational Arabic Speech
    Recognition.
  • One of the Problems not enough training data to
    build a Language Model most available text is
    in MSA (Modern Standard Arabic) or a mixture of
    MSA and conversational Arabic.
  • One Solution Select from mixed text segments
    that are conversational, and use them in training.

121
Task Text Selection
  • Use POS-based language models because it has been
    shown to better indicate differences in styles,
    such as formal vs conversational.
  • Method
  • Training POS (part of speech) tagger on available
    data
  • Train POS-based language models on formal vs
    conversational data
  • Tag new data
  • Select segments from new data that are closest to
    conversational model by using scores from
    POS-based language models.

122
Data
  • For building the Tagger and Language Models
  • Arabic Treebank 130K words of hand-tagged
    Newspaper text in MSA.
  • Arabic CallHome 150K words of transcribed phone
    conversations. Tags are only in the Lexicon.
  • For Text Selection
  • Al Jazeera 9M words of transcribed TV
    broadcasts. We want to select segments that are
    closer to conversational Arabic, such as
    talk-shows and interviews.

123
Implementation
  • Model (bigram)

124
About unknown words
  • These are words that are not seen in training
    data, but appear in test data.
  • Assume unknown words behave like singletons
    (words that appear only once in training data).
  • This is done by duplicating training data with
    singletons replaced by special token. Then train
    tagger on both the original and duplicate.

125
  • Tools
  • GMTK (Graphical Model Toolkit)
  • Algorithms
  • Training EM training set parameters so that
    joint probability of hidden states and
    observations is maximized.
  • Decoding (tagging) Viterbi find hidden state
    sequence that maximizes joint probability of
    hidden state and observations.

126
Experiments
  • Exp 1 Data first 100K of English Penn Treebank.
    Trigram model. Sanity check.
  • Exp 2 Data Arabic Treebank. Trigram model.
  • Exp 3 Data Arabic Treebank and CallHome.
    Trigram model.
  • The above three experiments all used 10 fold
    cross validation, and are unsupervised.
  • Exp 4 Data Arabic Treebank. Supervised trigram
    model.
  • Exp 5 Data Arabic Treebank and Callhome.
    Partially supervised training using Treebanks
    tagged data. Test on portion of treebank not used
    in training. Trigram model.

127
Results
128
Building Language Models and Text Selection
  • Use existing scripts to build formal and
    conversational language models from tagged Arabic
    Treebank and CallHome data.
  • Text selection use log likelihood ratio

Si the ith sentence in data set C coversational
language model F formal language model Ni
length of Si
129
Score Distribution
percentage
log count
log likelihood ratio
log likelihood ratio
130
Assessment
  • A subset of Al Jazeera equal in size to Arabic
    CallHome (150K words) is selected, and added to
    training data for speech recognition language
    model.
  • No reduction in perplexity.
  • Possible reasons Al Jazeera has no
    conversational Arabic, or has only conversational
    Arabic of a very different style.

131
Text Selection Work Done at BBN
  • Rich Schwartz
  • Mohamed Noamany
  • Daben Liu
  • Nicolae Duta

132
Search for Dialect Text
  • We have an insufficient amount of CH text for
    estimating a LM.
  • Can we find additional data?
  • Many words are unique to dialect text.
  • Searched Internet for 20 common dialect words.
  • Most of the data found were jokes or chat rooms
    very little data.

133
Search BN Text for Dialect Data
  • Search BN text for the same 20 dialect words.
  • Found less than CH data
  • Each occurrence was typically an isolated lapse
    by the speaker into dialect, followed quickly by
    a recovery to MSA for the rest of the sentence.

134
Combine MSA text with CallHome
  • Estimate separate models for MSA text (300M
    words) and CH text (150K words).
  • Use SRI toolkit to determine single optimal
    weight for the combination, using deleted
    interpolation (EM)
  • Optimal weight for MSA text was 0.03
  • Insignificant reduction in perplexity and WER

135
Classes from BN
  • Hypothesis
  • Even if MSA ngrams are different, perhaps the
    classes are the same.
  • Experiment
  • Determine classes (using SRI toolkit) from BNCH
    data.
  • Use CH data to estimate ngrams of classes and /
    or p(w class)
  • Combine resulting model with CH word trigram
  • Result
  • No gain

136
Hypothesis Test Constrained Back-Off
  • Hypothesis
  • In combining BN and CH, if a probability is
    different, could be for 2 reasons
  • CH has insufficient training
  • BN and CH truly have different probabilities
    (likely)
  • Algorithm
  • Interpolate BN and CH, but limit the probability
    change to be as much as would be likely due to
    insufficient training.
  • Ngram count cannot change by more than its sqrt
  • Result
  • No gain

137
(No Transcript)
138
Learning Using Factored Language Models
  • Gang Ji
  • Speech, Signal, and Language Interpretation
  • University of Washington
  • August 21, 2002

139
Outline
  • Factored Language Models (FLMs) overview
  • Part I automatically finding FLM structure
  • Part II first-pass decoding in ASR with FLMs
    using graphical models

140
Factored Language Models
  • Along with words, consider factors as components
    of the language model
  • Factors can be words, stems, morphs, patterns,
    roots, which might contain complementary
    information about language
  • FLMs also provide a new possibilities for
    designing LMs (e.g., multiple back-off paths)
  • Problem We dont know the best model, and space
    is huge!!!

141
Factored Language Models
  • How to learn FLMs
  • Solution 1 do it by hand using expert linguistic
    knowledge
  • Solution 2 data driven let the data help to
    decide the model
  • Solution 3 combine both linguistic and data
    driven techniques

142
Factored Language Models
  • A Proposed Solution
  • Learn FLMs using evolution-inspired search
    algorithm
  • Idea Survival of the fittest
  • A collection (generation) of models
  • In each generation, only good ones survive
  • The survivors produce the next generation

143
Evolution-Inspired Search
  • Selection choose the good LMs
  • Combination retain useful characteristics
  • Mutation some small change in next generation

144
Evolution-Inspired Search
  • Advantages
  • Can quickly find a good model
  • Retain goodness of the previous generation while
    covering significant portion of the search space
  • Can run in parallel
  • How to judge the quality of each model?
  • Perplexity on a development set
  • Rescore WER on development set
  • Complexity-penalized perplexity

145
Evolution-Inspired Search
  • Three steps form new models.
  • Selection (based on perplexity, etc)
  • E.g. Stochastic universal sampling models are
    selected in proportion to their fitness
  • Combination
  • Mutation

146
Moving from One Generation to Next
  • Combination Strategies
  • Inherit structures horizontally
  • Inherit structures vertically
  • Random selection
  • Mutation
  • Add/remove edges randomly
  • Change back-off/smoothing strategies

147
Combination according to Frames
148
Combination according to Factors
149
Outline
  • Factored Language Models (FLMs) overview
  • Part I automatically finding FLM structure
  • Part II first-pass decoding with FLMs

150
Problem
  • May be difficult to improve WER just by rescoring
    n-best lists
  • More gains can be expected from using better
    models in first-pass decoding
  • Solution
  • do first-pass decoding using FLMs
  • Since FLMs can be viewed as graphical models, use
    GMTK (most existing tools dont support general
    graph-based models)
  • To speed up inference, use generalized
    graphical-model-based lattices.

151
FLMs as Graphical Models
F1
F2
F3
Word
Graph for Acoustic Model
152
FLMs as Graphical Models
  • Problem decoding can be expensive!
  • Solution multi-pass graphical lattice refinement
  • In first-pass, generate graphical lattices using
    a simple model (i.e., more independencies)
  • Rescore the lattices using a more complicated
    model (fewer independencies) but on much smaller
    search space

153
Example Lattices in a Markov Chain
0
1
2
2
2
3
5
7
4
This is the same as a word-based lattice
154
Lattices in General Graphs
0
1
0
1
2
2
3
4
5
6
0
1
1
2
1
2
3
3
2
6
4
5
155
Research Plan
  • Data
  • Arabic CallHome data
  • Tools
  • Tools for evolution-inspired search
  • most part already developed during workshop
  • Training/Rescoring FLMs
  • Modified SRI LM toolkit developed during this
    workshop
  • Multi-pass decoding
  • Graphical models toolkit (GMTK) developed in
    last workshop

156
Summary
  • Factored Language Models (FLMs) overview
  • Part I automatically finding FLM structure
  • Part II first-pass decoding of FLMs using GMTK
    and graphical lattices

157
Combination and mutation
0
1
0
1
0
1
0
0
0
1
1
1
0
1
0
1
1
1
1
0
1
0
0
0
1
0
1
1
1
1
0
1
0
0
0
0
0
1
0
1
1
1
1
0
0
0
0
0
1
0
1
1
1
1
0
1
0
0
1
0
158
Stochastic Universal Sampling
  • Idea probability of surviving is proportional to
    fitness
  • Fitness is the quantity we described before
  • SUS
  • Models have bins with length proportional to
    fitness, lay on x-axis
  • Choose N even-spaced samples uniformly
  • Choose a model k times, k is number of samples in
    its bin.

159
FLMs as Graphical Models
F1
F2
F3
Word
Word Trans.
Word Pos.
Phone Trans.
Phone
observation
160
(No Transcript)
161
Minimum Divergence Adaptation of a MSA-Based
Language Model to Egyptian Arabic
  • A proposal by
  • Sourin Das
  • JHU Workshop Final Presentation
  • August 21, 2002

162
Motivation for LM Adaptation
  • Transcripts of spoken Arabic are expensive to
    obtain MSA text is relatively inexpensive (AFP
    newswire, ELRA arabic data, Al jazeera )
  • MSA text ought to help after all it is Arabic
  • However there are considerable dialectal
    differences
  • Inferences drawn from Callhome knowledge or data
    ought to overrule those from MSA whenever the
    inferences drawn from them disagree e.g.
    estimates of N-gram probabilities
  • Cannot interpolate models or merge data naïvely
  • Need to instead fall back to MSA knowledge only
    when the Callhome model or data is agnostic
    about an inference

163
Motivation for LM Adaptation
  • The minimum K-L divergence framework provides a
    mechanism to achieve this effect
  • First estimate a language model Q from MSA text
    only
  • Then find a model P which matches all major
    Callhome statistics and is close to Q.
  • Anecdotal evidence MDI methods successfully used
    to adapt models based on NABN text to SWBD a 2
    WER reduction in LM95 from a 50 baseline WER.

164
An Information Geometric View
The Uniform Distribution
Models satisfying Callhome marginals
MaxEnt Callhome LM
MaxEnt MSA-text LM
Min Divergence Callhome LM
Models satisfying MSA-text marginals
The Space of all Language Models
165
A Parametric View of MaxEnt Models
  • The MSA-text based MaxEnt LM is the ML estimate
    among exponential models of the form
  • Q(x) Z-1(?,?) exp? ?i fi(x) ?j gj(x)
  • The Callhome based MaxEnt LM is the ML estimate
    among exponential models of the form
  • P(x) Z-1(?,?) exp? ?j gj(x) ?k hk(x)
  • Think of the Callhome LM as being from the family
  • P(x) Z-1(?,?) exp? ?i fi(x) ?j gj(x) ?k
    hk(x)
  • where we set ?0 based on the MaxEnt principle.
  • One could also be agnostic about the values of
    ?is, since no examples with fi(x)gt0 are seen in
    Callhome
  • Features (e.g. N-grams) from MSA-text which are
    not seen in Callhome always have fi(x)0 in
    Callhome training data

166
A Pictorial Interpretation of the Minimum
Divergence Model
The ML model for MSA text Q(x)Z-1(?,?) exp?
?ifi(x) ?jgj(x)
Subset of all exponential models with
?? P(x)Z-1(?,?,?) exp? ?ifi(x) ?j gj(x)
?k hk(x)
The ML model for Callhome, with ?? instead of
?0. P(x)Z-1(?,?,?) exp? ?ifi(x) ?jgj(x)
?khk(x)
Subset of all exponential models with
?0 Q(x)Z-1(?,?) exp? ?i fi(x) ?j gj(x)
All exponential models of the form P(x)Z-1(?,?,?)
exp? ?i fi(x) ?j gj(x) ?k hk(x)
167
Details of Proposed Research (1)A Factored LM
for MSA text
  • Notation Wromanized word, ?script, Sstem,
    Rroot, Mtag
  • Q(?i?i-1,?i-2) Q(?i?i-1,?i-2,Si-1,Si-2,Mi-1,Mi
    -2,Ri-1,Ri-2)
  • Examine all 8C2 28 all trigram templates of
    two variables from the history with ?i.
  • Set observations w/counts above a threshold as
    features
  • Examine all 8C1 8 all bigram templates of one
    variable from the history with ?i.
  • Set observations w/counts above a threshold as
    features
  • Build a MaxEnt model (Use Jun Wus toolkit)
  • Q(?i?i-1,?i-2)Z-1(?,?) exp ?1f1(?i,?i-1,Si-2)
    ?2f2(?i,Mi-1,Mi-2) ?ifi(?i,?i-1)?jgj(?i,R
    i-1)?JgJ(?i)
  • Build the Romanized language model
  • Q(WiWi-1,Wi-2) U(Wi?i) Q(?i?i-1,?i-2)

168
A Pictorial Interpretation of the Minimum
Divergence Model
The ML model for MSA text Q(x)Z-1(?,?) exp?
?ifi(x) ?jgj(x)
The ML model for Callhome, with ?? instead of
?0. P(x)Z-1(?,?,?) exp? ?ifi(x) ?jgj(x)
?khk(x)
All exponential models of the form P(x)Z-1(?,?,?)
exp? ?i fi(x) ?j gj(x) ?k hk(x)
169
Details of Proposed Research (2) Additional
Factors in Callhome LM
  • P(WiWi-1,Wi-2) P(Wi,?i Wi-1,Wi-2,?i-1,?i-2,Si-
    1,Si-2,Mi-1,Mi-2,Ri-1,Ri-2)
  • Examine all 10C2 45 all trigram templates of
    two variables from the history with W or ?.
  • Set observations w/counts above a threshold as
    features
  • Examine all 10C1 10 all bigram templates of
    one variable from the history with W or ?.
  • Set observations w/counts above a threshold as
    features
  • Compute a Min Divergence model of the form
  • P(WiWi-1,Wi-2)Z-1(?,?, ?) exp
    ?1f1(?i,?i-1,Si-2)?2f2(?i,Mi-1,Mi-2)
    ?ifi(?i,?i-1 )?jgj(?i,Ri-1)?JgJ(?i)
  • exp?1h1(Wi,Wi-1,Si-2) ?2h2(?i,Wi-1,Si-2)
  • ?khi(?i,?i-1) ?KhK(Wi)

170
Research Plan and Conclusion
  • Use baseline Callhome results from WS02
  • Investigate treating romanized forms of a script
    form as alternate pronunciations
  • Build the MSA-text MaxEnt model
  • Feature selection is not critical use high
    cutoffs
  • Choose features for the Callhome model
  • Build and test the minimum divergence model
  • Plug in induced structure
  • Experiment with subsets of MSA text

171
A Pictorial Interpretation of the Minimum
Divergence Model
The ML model for MSA text Q(x)Z-1(?,?) exp?
?ifi(x) ?jgj(x)
The ML model for Callhome, with ?? instead of
?0. P(x)Z-1(?,?,?) exp? ?ifi(x) ?jgj(x)
?khk(x)
All exponential models of the form P(x)Z-1(?,?,?)
exp? ?i fi(x) ?j gj(x) ?k hk(x)
172
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com