Title: Machine Transliteration
1Machine Transliteration
2Overview
- Words written in a language with alphabet A ?
written in a language with alphabet B - ???? ? shalom
- Importance for MT, for cross-language IR
- Forward transliteration, Romanization,
back-transliteration
3Is there a convergence towards standards?
- Perhaps for really famous names. Even for such
standard names, multiple acceptable spellings.
Whether there is someone regulating such
spellings probably dependent culturally. In
meantime, have a lot of variance. Especially on
Web. E.g. holiday of Succot, ?????, ??????? - Variance in pronunciation culturally across
different groups (soo-kot, suh-kes) dialect,
variance in how one chooses to transliterate
different Hebrew letters (kk, cc, gemination). - Sukkot 7.1 million
- Succot 173 thousand
- Succos 153 thousand
- Sukkoth 113 thousand
- Succoth 199 thousand
- Sukos 112 thousand
- Sucos 927 thousand, but probably almost none
related to holiday - Sucot 101 thousand. Spanish transliteration of
holiday - Sukkes 1.4 thousand. Yiddish rendition
- Succes 68 million. Misspelling of success
- Sukket 45 thousand. But not Yiddish, because
wouldnt have t ending - Recently in the news AP Emad Borat Arutz
Sheva Imad Muhammad Intisar Boghnat
4Can we enforce standards?
- Would make task easier.
- News articles, perhaps
- However
- Would they listen to us?
- Does the standard make sense across the board?
Once again, dialectal differences. E.g. ?, ?,
vowels. Also, fold-over of alphabet. ?-?, ?-?,
?-?, ?-?, ?-? - 2N for N laguages
5(No Transcript)
6Four Papers
- Cross Linguistic Name Matching in English and
Arabic - For IR search. Fuzzy string matching.
Modification of Soundex to use cross-language
mapping, using character equivalence classes - Machine Transliteration
- For Machine translation. Back transliteration. 5
steps in transliteration. Use Bayes rule - Transliteration of Proper Names in
Cross-Language Applications - Forward transliteration, purely statistical based
- Statistical Transliteration for English-Arabic
Cross Language Information Retrieval - Forward transliteration. For IR, generating every
possible transliteration, then evaluate. Using
selected n-gram model
7Cross Linguistic Name Matching in English and
ArabicA One to Many Mapping Extension of the
Levenshtein Edit Distance Algorithm
- Dr. Andrew T. Freeman, Dr. Sherri L. Condon and
- Christopher M. Ackerman
- The Mitre Corporation
8Cross Linguistic Name Matching
- What?
- Match personal names in English to the same names
in Arabic script. - Why is this not a trivial problem?
- There are multiple transcription schemes, so it
is not one-to-one - e.g. ???? ??????? can be Muammar Gaddafi, Muammar
Qaddafi, Moammar Gadhafi, Muammar Qadhafi,
Muammar al Qadhafi - because certain consonants and vowels can be
represented multiple ways in English - note Arabic is just an example of this
phenomenon - so standard string comparison insufficient
- For What purpose?
- For search on, say, news articles. How do you
match all occurrences of Qadhafi - Their solution
- Enter the search term in Arabic, use Character
Equivalence Classes (CEQ) to generate possible
transliterations, supplement the Levenshtein Edit
Distance Algorithm
9Elaboration on Multiple Transliteration Schemes
- Why?
- No standard English phoneme corresponding to
Arabic /q/ - Different dialects in Libya, this is pronounced
g - note Similar for Hebrew dialects
10Fuzzy string matching
- def matching strings based on similarity rather
than identity - Examples
- edit-distance
- n-gram matching
- normalization procedures like Soundex.
11Survey of Fuzzy Matching Methods - Soundex
- Soundex
- Odell and Russel, 1918
- Some obvious pluses
- (not mentioned explicitly by paper)
- we eliminate vowels, so Moammar/Muammar not a
problem - Groups of letters will take care of different
English letters corresponding to Arabic - Elimination of repetition and of h will remove
gemination/fricatives - Some minuses
- Perhaps dialects will transgress Soundex phonetic
code boundaries. e.g. ? in Hebrew can be t, th,
s. ? can be ch or h. Is a ? to be w or v? But
could modify algorithm to match. - note al in al-Qadafi
- Perhaps would match too many inappropriate results
12Noisy Channel Model
13Levenshtein Edit Distance
- AKA Minimum Edit Distance
- Minimum number of operations of insertion,
deletion, substitution. Cost per operation 1 - Via dynamic programming
- Example taken from Jurafsky and Martin, but with
corrections - Minimum of diagonal subst, or down/left
insertion/deletion cost
14Minimum Edit Distance Example(substitution cost
2)
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 12
E 4 3 4 5 6 7 8 9 10 11
T 3 4 5 6 7 8 7 8 9 10
N 2 3 4 5 6 7 8 9 10 11
I 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9
E X E C U T I O N
15Minimum Edit Distance Example(substitution cost
1)
N 9 7 7 7 7 8 8 7 6 5
O 8 6 6 6 7 7 7 6 5 6
I 7 5 5 6 6 6 6 5 6 7
T 6 4 5 5 5 5 5 6 7 8
N 5 4 4 5 4 5 6 7 7 7
E 4 3 4 3 4 5 6 6 7 8
T 3 3 3 3 4 5 5 6 7 8
N 2 2 2 3 4 5 6 7 7 8
I 1 1 2 3 4 5 6 6 7 8
0 1 2 3 4 5 6 7 8 9
E X E C U T I O N
16Minimum Edit Distance
- Score of 0 perfect match, since no edit ops
- s of len m, t of len n
- Fuzzy match divide edit score by length of
shortest (or longest) string, 1 this number.
Set threshold for strings to be a match. Then,
longer pairs of strings more likely to be matched
than shorter pairs of strings with same number of
edits. So get percentage of chars that need ops.
Otherwise, A vs I has same edit distance as
tuning vs. turning. - Good algorithm for fuzzy string comparison can
see that Muammar Gaddafi, Muammar Qaddafi,
Moammar Gadhafi, Muammar Qadhafi, Muammar al
Qadhafi are relatively close. - But, dont really want substitution cost of G/Q,
O/U, DD/DH, certain insertion/deletion costs.
That is why they supplement it with these
Character Equivalence Classes (CEQ), which well
get to a bit later.
17Editex
- Zobel and Dart (1996) Soundex Levenshtein
Edit Distance - replace e(si, tj) which was basically 1 if
unequal, 0 if equal (that is, cost of an op),
with r(si, tj), which makes use of Soundex
equivalences. 0 if identical, 1 if in same group,
2 if different - Also neutralizes h and w in general. Show example
based on chart from before. In terms of
initializing or calculating cost of
insertion/deletion, do not count, otherwise have
cost of 1. - Other enhancements to standard Soundex and Edit
distance for the purpose of comparison. e.g.
tapering (counts less later in the word)
phonometric methods input strings mapped to
phonemic representations. E.g. rough. - Say performed better than Soundex, Min Edit
Distance, counting n-gram sequences, 10
permutations of tapering, phonemetric
enhancements to standard algorithms
18SecondString (Tool)
- Java based implementation of many of these string
matching algorithms. They use this for comparison
purposes. Also, SecondString allows hybrid
algorithms by mixing and matching, tools for
string matching metrics, tools for matching
tokens within strings.
19Baseline Task (??)
- Took 106 Arabic, 105 English texts from newswire
articles - Took names from these articles, 408 names from
English, 255 names from Arabic. - manual cross-script matching, got 29 common names
(rather than manually coming up with all possible
transliterations) - But to get baseline, tried matching all names in
Arabic (transliterated using Atrans by Basis
2004) to all names in English, using algorithms
from SecondString. Thus, have one standard
transliteration, and try to match it to all other
English transliterations - Empirically set threshold to something that
yielded good result. - R recall correctly matched English names /
available correct English matches in set what
percentage of total correct did they get? - P Precision total correct names / total
of names returned what percentage of their
guesses were accurate? - Defined F-score as 2 X (PR) / (P R)
20Other Algorithms Used For Comparison
- Smith Waterman Levenstein Edit, with some
parameterization of gap score - SLIM iterative statistical learning algorithm
based on a variety of estimation-maximization in
which a Levenshtein edit-distance matrix is
iteratively processed to find the statistical
probabilities of the overlap between two strings. - Jaro n-gram
- Last one is Edit distance
21Their Enhancements
- Motivation Arabic letter has more than one
possible English letter equivalent. Also, Arabic
transliterations of English names not
predictable. 6 different ways to represent
Milosevic in Arabic.
22Some Real World Knowledge
23Character Equivalence Classes
- Same idea as Editex, except use Ar(si, tj) where
s is an Arabic word, so si is an Arabic letter,
and t is an English word, and tj is an English
letter. - So, comparing Arabic to English directly, rather
than a standard transliteration - The sets within Ar to handle (modified) Buckwater
transliteration, default transliteration of
Basis software - Basis uses English digraphs for certain letters
24Buckwalter Transliteration Scheme
- A scholarly transliteration scheme, unlikely to
be found in newspaper articles - WikipediaThe Buckwalter Arabic transliteration
was developed at Xerox by Tim Buckwalter in the
1990s. It is an ASCII only transliteration
scheme, representing Arabic orthography strictly
one-to-one, unlike the more common romanization
schemes that add morphological information not
expressed in Arabic script. Thus, for example, a
waw will be transliterated as w regardless of
whether it is realized as a vowel u or a
consonant w. Only when the waw is modified by a
hamza ( ?) does the transliteration change to .
The unmodified letters are straightforward to
read (except for maybe dhaal and Eayin,
vthaa), but the transliteration of letters with
diacritica and the harakat take some time to get
used to, for example the nunated irab -un, -an,
-in appear as N, F, K, and the sukun ("no vowel")
as o. Ta marbouta ? is p. - hamza
- lone hamza '
- hamza on alif gt
- hamza on wa
- hamza on ya
- alif
- madda on alif
- alif al-wasla
- dagger alif
- alif maqsura Y
25The Equivalence Classes
26Normalization
- They normalize Buckwalter and the English in the
newspaper articles. - Thus, ? sh from Buckwalter,
- ph ? f in English, eliminate dupes, etc.
- Move vowels from each language closer to one
another by only retaining matching vowels (that
is, where exist in both)
27(No Transcript)
28Why different from Soundex and Editex
- What we do here is the opposite of the approach
taken by the Soundex and Editex algorithms. They
try to reduce the complexity by collapsing groups
of characters into a single super-class of
characters. The algorithm here does some of that
with the steps that normalize the strings.
However, the largest boost in performance is with
CEQ, which expands the number of allowable
cross-language matches for many characters.
29Machine (Back-) Transliteration
- Kevin Knight and Jonathan Graehl
- University of Southern California
30Machine Transliteration
- For Translation purposes
- Foreign Words commonly transliterated, using
approximate phonemic equivalents - computer ? konpyuuta
- Problem Usually, translate by looking up in
dictionaries, but these often dont show up in
dictionaries - Usually not a problem for some languages, like
Spanish/English, since have similar alphabets.
But non-alphabetic languages or with different
alphabets, more problematic. (e.g. Japanese,
Arabic) - Popular on the Internet The Coca-Cola name in
China was first read as "Ke-kou-ke-la," meaning
"Bite the wax tadpole" or "female horse stuffed
with wax," depending on the dialect. Coke then
researched 40,000 characters to find a phonetic
equivalent to "ko-kou-ko-le," translating into
"happiness in the mouth." - Solution Backwards transliteration to get the
original word, using a generative model
31Machine Transliteration
- Japanese transliterates e.g. English in katakana.
Foreign names and loan-words. - Compromises e.g. golfbag
- L/R map to same character
- Japanese has alternating consonant vowel pattern,
so cannot have consonant cluster LFB - Syllabary instead of alphabet.
- Goruhubaggu
- Dot separator, but inconsisent, so
- aisukuriimu can be I scream
- or ice cream
32Back Transliteration
- Going from katakana back to original English word
- for translation katakana not found in bilingual
dictionaries, so just generate original English
(assuming it is English) - Yamrom 1994 pattern matching
- Arbabi 1994 neural net/expert system
- Information loss, so not easy to invert
33More Difficult Than
- Forward transliteration
- several ways to transliterate into katakana, all
valid, so you might encounter any of them - But only one English spelling cant say arture
for archer - Romanization
- we have seen examples of thisthe katakana
examples above - more difficult because of spelling variations
- Certain things cannot be handled by
back-transliteration - Onomatopoeia
- Shorthand e.g. waapuro word processing
34Desired Features
- Accuracy
- Portability to other languages
- Robust against OCR errors
- Relevant to ASR where speaker has heavy accent
- Ability to take context (topical/syntactic) into
account, or at least return ranked list of
possibilities - Really requires 100 knowledge
35Learning Approach Initial Attempt
- Can learn what letters transliterate for what by
training on corpus of katakana phrases in
bilingual dictionaries - Drawbacks
- with naïve approach, how can we make sure we get
a normal transliteration? - E.g. we can get iskrym as back transliteration
for aisukuriimu. - Take letter frequency into account! So can get
isclim - Restrict to real words! Is crime.
- We want ice cream!
36Modular Learning Approach
- Build generative model of transliteration
process, - English phrase is written
- Translator pronounces it in English
- Pronunciation modified to fit Japanese sound
inventory - Sounds are converted into katakana
- Katakana is written
- Solve and coordinate solutions to these
subproblems, use generative models in reverse
direction - Use probabilities and Bayes Rule
37Bayes Rule Example
- Example 1 Conditional probabilities from
Wikipedia - Suppose there are two bowls full of cookies. Bowl
1 has 10 chocolate chip cookies and 30 plain
cookies, while bowl 2 has 20 of each. Fred picks
a bowl at random, and then picks a cookie at
random. We may assume there is no reason to
believe Fred treats one bowl differently from
another, likewise for the cookies. The cookie
turns out to be a plain one. How probable is it
that Fred picked it out of bowl 1? - Intuitively, it seems clear that the answer
should be more than a half, since there are more
plain cookies in bowl 1. The precise answer is
given by Bayes's theorem. But first, we can
clarify the situation by rephrasing the question
to "whats the probability that Fred picked bowl
1, given that he has a plain cookie? Thus, to
relate to our previous explanation, the event A
is that Fred picked bowl 1, and the event B is
that Fred picked a plain cookie. To compute
Pr(AB), we first need to know - Pr(A), or the probability that Fred picked bowl
1 regardless of any other information. Since
Fred is treating both bowls equally, it is 0.5. - Pr(B), or the probability of getting a plain
cookie regardless of any information on the
bowls. In other words, this is the probability of
getting a plain cookie from each of the bowls. It
is computed as the sum of the probability of
getting a plain cookie from a bowl multiplied by
the probability of selecting this bowl. We know
from the problem statement that the probability
of getting a plain cookie from bowl 1 is 0.75,
and the probability of getting one from bowl 2
is 0.5, and since Fred is treating both bowls
equally the probability of selecting any one of
them is 0.5. Thus, the probability of getting a
plain cookie overall is 0.750.5Â Â 0.50.5
0.625. - Pr(BA), or the probability of getting a plain
cookie given that Fred has selected bowl 1. From
the problem statement, we know this is 0.75,
since 30 out of 40 cookies in bowl 1 are plain. - Given all this information, we can compute the
probability of Fred having selected bowl 1 given
that he got a plain cookie, as such - As we expected, it is more than half.
38Application To Task At Hand
- English Phrase Generator produces word sequences
according to probability distribution P(w) - English Pronouncer probabilistically assigns a
set of pronunciations to word sequence, according
to P(pw) - Given pronunciation p, find word sequence that
maximizes P(wp) - Based on Bayes Rule P(wp) P(pw) P(w) /
P(p) - But P(p) will be the same regardless of the
specific word sequence, so can just search for
word sequence that maximizes P(pw) P(w), which
are the two distributions we just modeled
39Five Probability Distributions
- Extending this notion, built 5 probability
distributions - P(w) generates written English word sequences
- P(ew) pronounces English word sequences
- P(je) converts English sounds into Japanese
sounds - P(kj) converts Japanese sounds into katakana
writing - P(ok) introduces misspellings caused by OCR
- Parallels 5 steps above
- English phrase is written
- Translator pronounces it in English
- Pronunciation modified to fit Japanese sound
inventory - Sounds are converted into katakana
- Katakana is written
- Given katakana string o observed by OCR, we wish
to maximize - P(w) P(ew) P(je) P(kj) P(o k)
over all e, j, k - Why? Lets say have e and want to determine most
probable w given e that is, P(we), would
maximize P(w) P(ew) / P(e)
40Implementation of the probability distributions
- P(w) as WFSA (weighted finite state acceptor),
others as WFST (transducers) - WFSA state transition diagram with both symbols
and weights on the transitions, such that some
transitions more likely than others - WFST the same, but with both input and output
symbols - Implemented composition algorithm to yield P(xz)
from models P(xy) and P(yz), treating WFSAs
simply as WFST with identical input and output - Yields one large WFSA, and use Djikstras
shortest path algorithm to extract most probable
one - No pruning, use Viterbi approximation, searching
best path through WFSA rather than best sequence
41First Model Word Sequences
- ice cream gt ice crème gt aice kreme
- Unigram scoring mechanism which multiplies scores
of known words and phrases in a sequence - Corpus WSJ corpus online English name list
online gazeteer of place names - Should really e.g. ignore auxiliaries and favor
surnames. Approximate by removing high frequency
words
42Model 2 Eng Word Sequences ? Eng Sound Sequences
- Use English phoneme inventory from CMU
Pronunciation Dictionary, minus stress marks - 40 sounds 14 vowel sounds, 25 consonant sounds
(e.g. K, HH, R), additional symbol PAUSE - Dictionary has 100,000 (125,000) word
pronunciation - Used top 50,000 words because of memory
limitations - Capital letters Eng sounds lowercase words
Eng words
43(No Transcript)
44(No Transcript)
45Example Second WFST
Note Why not letters instead of phonemes?
Doesnt match Japanese transliteration
mispronunciation, and that is modeled in next
step.
46Model 3 English Sounds ? Japanese Sounds
- Information losing process R, L ? r, 14 vowels ?
5 Japanese vowels - Identify Japanese sound inventory
- Build WFST to perform the sequence mapping
- Japanese sound inventory has 39 symbols 5
vowels, 33 consonants (including doubled kk),
special symbol pause. - (P R OW PAUSE S AA K ER) (pro-soccer) maps to (p
u r o pause s a kk a a) - Use machine learning to train WFST from 8000
pairs of English/Japanese sound sequences (for
example, soccer). Created this corpus by
modifying an English/katakana dictionary,
converting into these sounds used EM (estimation
maximization) algorithm to generate symbol
matching probabilities. See table on next page
47(No Transcript)
48The EM Algorithm
Note pays no heed to context
49Model 4 Japanese sounds ? Katakana
- Manually construct 2.
- 1 just merges sequential doubled sounds into
single sound. o o ? oo - 2 just does mapping, accounting for different
spelling variation. e.g.
50Model 5 katakana ? OCR
51Example
52Transliteration of Proper Names in
Cross-LanguageApplications
- Paola Virga, Sanjeev Khudanpur
- Johns Hopkins University
53Abstract
- For MT, for IR, specifically cross-language IR
- Names important, particularly for short queries
- Transliteration writing name in foreign
language, preserving the way it sounds - Render English name in phonemic form
- Convert phonemic string into foreign orthography,
e.g. Mandarin Chinese - Mentions back transliteration for Japanese, and
application to Arabic, by Knight etc. - For Korean, strongly phonetic orthography allows
good transliteration using simple HMMS - Hand-crafted rules to change English spelling to
accord to Mandarin syllabification, then learns
to convert English phoneme sequence to Mandarin
syllable sequence. - They extend the previous, making it fully
data-driven rather than relying on hand-crafted
rules, to accomplish English ? Mandarin
transliteration
54Four steps in transliteration process
- English ? Phonetic English (using Festival)
- Festival free, source available, multilingual,
interfaces to shell, Scheme, Java, C, emacs
(see next page) - English phoneme ? initials and finals
- Initial final sequence ? pin-yin symbols
- Wikipedia Pinyin is a system of romanization
(phonemic notation and transcription to Roman
script) for Standard Mandarin, where pin means
"spell" and yin means "sound". - Pinyin is a romanization and not an
anglicization that is, it uses Roman letters to - represent sounds in Standard Mandarin. The way
these letters represent sounds in Standard
Mandarin will differ from how other languages
that use the Roman alphabet represent sound. For
example, the sounds indicated in pinyin by b and
g are not as heavily voiced as in the Western use
of the Latin script. Other letters, like j, q, x
or zh indicate sounds that do not correspond to
any exact sound in English. Some of the
transcriptions in pinyin, such as the ang ending,
do not correspond to English pronunciations,
either. - By letting Roman characters refer to specific
Chinese sounds, pinyin produces a compact and
accurate romanization, which is convenient for
native Chinese speakers and scholars. However, it
also means that a person who has not studied
Chinese or the pinyin system is likely to
severely mispronounce words, which is a less
serious problem with some earlier romanization
systems such as Wade-Giles. - Diff than katakana
- Pin-yin ? Chinese character sequence
- 1, 3 deterministic 2, 4 statistics
55(No Transcript)
56Noisy Channel Model
- We had concept before
- Think of e an i-word English sentence output from
noisy channel, c as j-word Chinese input into the
noisy channel. Except words phonemes - Find most likely Chinese sentence to have
generated English output. Use Bayes rule.
57How train, use transliteration system see next
slide
58Training
- Got from authors of 3 4, their corpus.
- 3875 English names, Chinese transliterations,
pin-yin counterparts, used Festival to generate
phonemic English, pronunciation of pinyin based
on Initial/Final inventory from Mandarin
phonology text - First corpus lines 2, 3
- Second corpus lines 4, 5
- Compare to 4, Do more general test
59Spoken Document Retrieval
- Infrastructure developed at Johns Hopkins Summer
Workshop Mandarin audio to be searched using
English text queries - English proper names unavailable in translation
lexicon, thus ignored during retrieval - Improved mean average precision by adding name
transliteration (from 0.501 to 0.515)
60Statistical Transliteration for English-Arabic
Cross Language Information Retrieval
- Nasreen AbdulJaleel, Leah Larkey
61Overview
- For IR
- Motivation not proper nouns but rather OOV (out
of vocabulary) words when have no corresponding
word in dictionary, simply transliterate it - Though train English to Arabic transliteration
model from pairs of names - Selected n-gram model
- Two stage training model
- Learn which n-gram segments should be added to
unigram inventory for source language - Then learn translation model over this inventory
- No need for heuristics
- No need for knowledge of either language
62The Problem
- OOV words problem in cross language information
retrieval - Named entities
- Numbers
- Technical terms
- Acronyms
- These compose significant portion of OOV, and
when named entity translation not available,
reduction in average precision of 50 - Variability of spelling foreign words. E.g.
Qaddafi from before - OK to use own spelling in foreign language when
share same alphabet (e.g. Italian, Spanish,
German), but not when has different alphabet.
Then transliteration.
63Multiple Spellings In Arabic
- Thus, useful to have way to generate multiple
spellings in Arabic from single source - Use statistical transliteration to generate no
heuristics, no linguistic knowledge - Statistical transliteration is special case of
statistical translation, in which the words are
letters.
64Selected N-gram transliteration model
- Generative statistical model, producing string of
Arabic chars from string of English chars - Model set of conditional probability
distributions over Arabic chars and NULL - Each English char n-gram ei can be mapped to
Arabic char or sequence of chars ai with
probability P(aiei) - Most probabilities are 0, in practice.
- Probabilities of s, z, tz
- Also, English source symbol inventory has,
besides unigrams (such as single letters), some
end symbols and n-grams such as sh, bb, eE
65Training of Model
- From lists of English/Arabic name pairs
- 2 alignment stages
- 1 to select n-grams for the model
- 2 Determine translation probabilities for the
n-grams - Used GIZA for letter alignment rather than word
alignment, treating letters as words - Corpus 125,000 English proper nouns and Arabic
translations, retaining only those existing in AP
news article corpus - Some normalization made lowercase, prefixed
with B and ended with E - Alignment 1 Align using GIZA, count instances
in which English char sequence aligned to single
Arabic character. Take top 50 of these n-grams
and add to English symbol inventory - Resegment based on new inventory, using
greedy-ish method - Ashcroft ? a sh c r o f t
- Alignment 2, using GIZA
- Count up alignments, use them as conditional
probabilities, removing alignments with
probability threshold of 0.01
66Generation of Arabic Transliterations
- Take English word ew.
- Segment, greedily (?) from n-gram inventory
- All possible transliterations, wa generated
- Rank according to probabilities, by multiplying
- Ran experiments, improvement over unigram only.
Etc.
67(No Transcript)