Title: Web-based Acquisition of Japanese Katakana Variants
1Web-based Acquisition of Japanese Katakana
Variants
- Hiroshi Nakagawa (University of Tokyo, Japan)
- Takeshi Masuyama (University of Tokyo , Japan)
(Now Yahoo! Japan)
2- Very sorry for Katakana fonts printing problem in
proceedings. We could not check the final
printing. - Please read English transliterations of Katakana
parts like -c..
3Cooperation with
- Satoshi Sekine
- Computer Science, New York University
- And
- Language Craft Co.
4The way of sound to spelling defers language by
language
???,???(nyaa,nyaa) ?????????(nyah,nyah)
5History of Katakana
- Every Country and every language has its own
history of meanings, codes and fonts. - Phonogram vs. Ideogram
6??(Hanji)
Kanji(Hanji)Characters(ideogram) imported to
Japan 1300 yeas ago
7Almost 1000 years ago, women writers worked out
phonogram(Hiragana and Katakana) from Kanji
(ideogram) to express Japanese peoples
mentality.
???
?(hiragana)
? (Kanji) ideogram
?(katakana) phonogram
8Modern history of Katakana
- Japanese Katakana and Hiragana have one to one
mapping. - After Meiji revolution(1868), Japanese people
used - Katakana to express functional word
- Hiragana to express words imported from western
countries. - After World War II(1945), we exchanged them.
- Hiragana became used to express functional word
like case markers - Katakana became used to express words imported
from western countries. - Thus majority of Katakana words are
transliterations from English words.
9However,
- Japanese Katakana has only five vowels
(a,i,u,e,o) and 19 consonants (k,g,s,z,j,t,d,n,h,b
,m,y,r,w,c,sh,ch,ny,my,). - Pronunciations are always CV or V.
- No CC.
- No distinction between, (b,v),(h,f),(l,r),..
- There are no orthographic way to express English
sounds with Katakana character set. - Thus Japanese language accepted several Katakana
spellings for one English word. - ? Katakana variants
10?????(dhiteeru) ?????? (dhithiiru) ??????(dhitheer
u) ?????(dhiteiru)
detail
Transliterated into Katakana variants
?????????(kyameronndhiasu) ?????????(kyameronn
dhiazu) ?????????(kyameronndhiasu)
Cameron Diaz
11An example of search result Hits for spaghetti
with Google
- To make sure to avoid overlap between distinct
Katakana variants by and - options.
Katakana variants. Hits of Google search ()
??????(supagettuthi) 187,000 (32.7)
???????(supagettuthii) 57,600 (10.1)
??????(supagettutei) 6,850 (1.2)
?????(supagetuthi) 240,000(41.9)
??????(supagethii) 77,400(13.5)
?????(supagetei) 3,800 (0.7)
total 572,650(100)
12Katakana variants extraction system is needed to
enhance the cross-language ability of
- Information Retrieval
- Search engine
- Machine translation
- Information Extraction
- Summarization
- Question Answering
13Previous research 1
- Manually constructed Rewriting rules to generate
and/or extract Katakana variants from given
Katakana word (Shishibori et al, 1993, 1994,
Kubota 1994) - Samples of rewrite rules
- ?(Be)???(Ve)
- ?(chi)???(thi)
- Input ????(Benechia)
- Output ?????(Benethia)
- ?????(Venechia)
- ??????(Venethia)
14Previous research 2
- Extract Katakana variants with weighted edit
distance (Magari et al?2004)?(Ohtake et al?2004) - Edit distance is defined as
- Number of operations to transform one Katakana
word into another Katakana word - Operations insert, delete,replace
- Ex. ????(Repooto)?????(Ripooto) ? edit dist. 1
- Weighted edit distance
- Weight of each operation is manually given
- Ex Weight of edit dist. (?????????) ? 0.8
15Previous research 3more direct way
- String penalty to extract Katakana variants
(Masuyama et al, 2004) - String penalty SP
- Based on weighted edit distance, but extended to
treat two,three charactersstring - Manually given weights to Combination of edit
operations string replacing operations. - Ex.SP(???,????)4 replace and insert
- Boisu, Voisu
16Previous research 4 Combination
method(Masuyama,Nakagawa,Sekine 2004 COLING)
- Combination of string penalty and context
- String penalty SP
- SP value is given by an expertise
- Similarity of contexts in which each Katakana
variant appears - Vector space model (automatically calculated)
- If Words around each Katakana words are similar,
then the Katakana words are variants each other
17Problems of previous researches
- Less coverage
- Need human intellectual and intensive work for
- Working out rewrite rules
- Determining weights of weighted edit distance
- Determining values of string penalty of each
Katakana string pairs - Depend on specific corpus which is used to
calculate weights of - weighted edit distance
- string penalty
18Purpose of this work
- The problem of manually given string penalty
- Labor intensive (even in combination of SP and
context) - Low coverage
Determine string penalty mechanically and
Automatically building Katakana variants for
each Katakana word
19Calculating string penalties Mechanically
For this, we need accurate and high quality
Katakana variants database!
20Process
(idea, ????)(report, ????)
English word and its Katakana variant
???? report ????
WWW
(????repooto,????ripooto)(????repooto,????
sapooto)
Pairs of variant cadi.
(????repooto,????repooto)(??????refarensu,
??????rifarenssu) (??????aakitekuto,
??????aakitekutu)
Pairs of variant
String Penalty
?re??ri 1?to??ttu 3
21Process
(idea, ????)(report, ????)
English word and its Katakana variant
???? report ????
WWW
Web search by
(????repooto,????ripooto)(????repooto,????
sapooto)
Pairs of variant cadi.
(????repooto,????repooto)(??????refarensu,
??????rifarenssu) (??????aakitekuto,
??????aakitekutu)
Pairs of variant
String Penalty
?re??ri 1?to??ttu 3
22How to find candidates of Katakana variant pairs
(1/3)
- To collect English words and thier Katakana
variants i.e. (vodka ????) - we used four Web sites where we collect a number
of English words and their Japanese translations.
- http//homepage2.nifty.com/katakanaEnglish/
- http//www.hoshi.cis.ibaraki.ac.jp/usefull/usefull
15.html - http//ke.ics.saitama-u.ac.jp/jsgs/keywords.html
- http//smalltown.ne.jp/uasa/pub/distfiles/skk-ext
ra-200307/SKK-JISYO.edit - 14,958 distinct pairs of English words and their
Katakana translations.
23How to find candidates of Katakana variant pairs
(2/3)
- Extract many English word and its Katakana
variant - 14.958 pairs of English-Katakana
- To collect more Katakana variants for each
English word, we use Google search to get pages
that include English word and Katakana word of
its translation - English word ( language Japanese )
- English word ????(English to Japanese) in
order to search English-Japanese dictionary site - Gather Katakana words from search results
24Google search with English word vodka among
page written in Japanese
vodka
25Add a query ????(english-Japanese) and Google
search
??e-j vodaka
26Process
(idea, ????)(report, ????)
English word and its Katakana variant
Web search ??(e-j) report
???? report ????
WWW
(????repooto,????ripooto)(????repooto,????
sapooto)
Pairs of variant cadi.
Edit dist. 1
(????repooto,????repooto)(??????refarensu,
??????rifarenssu) (??????aakitekuto,
??????aakitekutu)
Pairs of variant
String Penalty
?re??ri 1?to??ttu 3
27How to find candidates of Katakana variant pairs
(3/3)
- Extract promising candidates of Katakana word
pairs whose edit distance 1 as Katakana variants - Ex. (vodka ????)
- (????Uottuka?????Uotoka)
- (????Uottuka?????UOttuka)
- (????Uottuka?????(Vuottuka)
28Process
(idea, ????)(report, ????)
English word and its Katakana variant
Web search by??(e-j) report
???? report ????
WWW
(????repooto,????ripooto)(????repooto,????
sapooto)
Pairs of variant candi.
Edit dist. 1
(????repooto,????repooto)(??????refarensu,
??????rifarenssu) (??????aakitekuto,
??????aakitekutu)
cosine sim gt 0.00006
Pairs of variant candi. by context
?re??ri 1?to??ttu 3
String Penalty
29How to extract documents in which context
similarity is calculated
- Google search with a query of Katakana word which
is a candidate of Katakana variant. - Extract context of the Katakana variant from
search result pages.
30Search Vodka with Google
31- Calculate context similarity of a candidate of
Katakana variant pair - drink vodka(Vuottka) with a main dish and plate
of caviar in the restaurants -
cosine similarity - eat some main dish plate after vodka(Uotoka) in
that restaurants - 50 words around a candidate of Katakana variant
is used as its context - Identify and extract Katakana variants if cosine
similarity is greater than the threshold of
0.00006.
32Detail of context similarity calculation
- context50 words around Katakana word
- Weight of word t in context
- log(freq(t)1)
- Context similarity cosine
- Selection from candidates by
- cosine similarity?0.00006 (threshold)
- The threshold optimization
- argmax of F-value
- threshold
- on positive pairs (347pairs)and negative
pairs(111 pair)
33Results of context similarity vs cosine threshold
34Process
(idea, ????)(report, ????)
English word and its Katakana variant
Web search by??(e-j) report
???? report ????
WWW
(????repooto,????ripooto)(????repooto,????
sapooto)
Pairs of variant cadi.
Edit dist. 1
(????repooto,????repooto)(??????refarensu,
??????rifarenssu) (??????aakitekuto,
??????aakitekutu)
cosine sim gt 0.00006
Pairs of variant
?re??ri SP1?to??ttu SP3
Next to do is to calculate SP based on Statistics
352nd stageCalculation of string penalty SP
- String penalty of operation x?? y (x replaces
with y) - We focus on
- High correlation between replaced strings and
their character context which is composed of
several characters around the target string. - Example (???????????????)
(?????????????) (?????????) - ?replace ?I with ?i??U and ?n co-occurs
36Character level contextCLC1..CLC5 used to
calculate SP
- x target character
- a?ß???d characters around x
CLC String contexts around x
CLC1 aß x preceeding two characters of x
CLC2 ß x preceeding one character of x
CLC3 x ? succeeding one character of x
CLC4 x ?d succeeding two characters of x
CLC5 ß x ? preceeding and succeeding characters of x
37Calculation of string penaltySP
- i1,2,3,4,5
- f(CLCi) freq. of pairs in which CLCi occurs
- f(CLCi, x??y) freq. of pairs in which both of
- CLCi and x??y
occur
38Calculation of string penalty SP
Identify character context CLCi which most
probably co-occurs with operation x ?? y
Then
Rank of occurrence C (Prob. of
occurrence)-1 Zipfs law
39Examples of string penalties
operation SP Example
Insertion and deletion of 1 ?????????????
Insertion and deletion of macron ? 1 ??????????
Replace ? O and ? o 1 ?????????
Replace ? gu and ? ku 2 ???????
Replace ? vu and ? bu 2 ???????????
Replace ? vu and ? U 3 ?????????
40Comparison of SP by hand and SP by the proposed
method
- SP by hand proposed by Masuyama et al(2004)
- Expertise worked out SP by hand
- Gold standard Katakana variants
- 682 pairs of Katakana variant candidates
extracted from newspaper corpus and whose string
penalties are between 1 and 12 - We found no correct variants whose SPs are
bewteen 10 and 12. Thus, the above gold standard
probably cover all correct varinats.
41Comparison of SPs
SP SP by hand SP by proposed mechanical method
1 216/221 (97.7) 262/286 (91.6)
2 162/207 (78.3) 133/148 (89.9)
3 70/99 (70.7) 51/90 (56.7)
4 2/14 (14.3) 2/26 (7.7)
5 0/29 (0.0) 0/16 (0.0)
6 0/13 (0.0) 2/34 (5.9)
7 1/20 (5.0) 1/39 (2.6)
8 0/13 (0.0) 1/15 (6.7)
9 1/12 (8.3) 0/8 (0.0)
10 0/16 (0.0) 0/5 (0.0)
11 0/17 (0.0) 0/12 (0.0)
12 0/21 (0.0) 0/3 (0.0)
42Comparison of SPs correlation
SP by proposed mechanical method
1 2 3 4 5 6 7 8 9 10 11 12 ??
1 207 7 3 2 0 1 1 0 0 0 0 0 221
2 20 123 59 2 1 1 1 0 0 0 0 0 207
3 59 11 20 3 2 3 1 0 0 0 0 0 99
4 0 2 3 2 2 0 4 0 0 1 0 0 14
5 0 2 2 6 3 4 5 3 1 0 2 1 29
6 0 0 0 3 1 1 2 0 3 1 2 0 13
7 0 0 1 3 2 2 2 4 1 1 3 1 20
8 0 1 0 0 0 4 6 1 0 0 1 0 13
9 0 1 0 0 0 0 2 3 1 1 4 0 12
10 0 1 1 5 0 0 4 2 1 1 0 1 16
11 0 0 1 0 2 13 0 0 1 0 0 0 17
12 0 0 0 0 3 5 11 2 0 0 0 0 21
?? 286 148 90 26 16 34 39 15 8 5 12 3 682
SP by hand
correlation0.76
43Building Katakana variantsDB automatically
44Summary of comparison and next?
COLING 2004
SIGIR2005
by Mechanical method
Correlation 0.76
by hand
Context similarity
Context similarity
SP
SP
Extracted variants
Extracted variants
Accurate
Accurate?
45Variants DB
???? ???? ???? ????
News paper corpus
(????,????)(????,????)(????,????)
Candidates of Katakana variants
(????,????)(????,????)
Candidates of Katakana variants
(????,????)
Katakana variants DB
46Variants DB
???? ???? ???? ????
News paper corpus
Extract Katakana words
(????,????)(????,????)(????,????)
Candidates of Katakana variants
(????,????)(????,????)
Candidates of Katakana variants
(????,????)
Katakana variants DB
47Variants DB
???? ???? ???? ????
News paper corpus
Extract Katakana words
(????,????)(????,????)(????,????)
Candidates of Katakana variants
SP 3
(????,????)(????,????)
Candidates of Katakana variants
(????,????)
Katakana variants DB
48Variants DB
???? ???? ???? ????
News paper corpus
Extract Katakana words
(????,????)(????,????)(????,????)
Candidates of Katakana variants
SP 3
(????,????)(????,????)
Candidates of Katakana variants
Context similarity 0.005
(????,????)
Katakana variants DB
Optimized threshold
49Comparison of variants DB
SP?3, context similarity?0.05
SP by hand of expertise SP by the proposed mechanical method
recall 417/420 (99.3) 415/420 (98.8)
precision 417/480 (86.9) 415/480 (86.5)
F-value 92.7 92.2
cf. The whole DB contains 3 million Katakana
variants for 1 million distinct Katakana words.
50Conclusions
- Mechanical method of calculating SP
- Using Web search engine to extract variant
candidates - SP by character context
- Almost same accuracy as SP by hand of expertise
- Katakana variants DB with SP by mechanical method
- recall98.8
- precision86.5
- F-value 92.2
51Future of our research
- Other language like German
- Arbeit -- ?????
- Application of our methodology (Web resource
statistical string penalty) to other language
pair. - Londre ?? London
- München ?? Munich
- Our hope is Cross-language automatic spelling
variants generator for any language pairs based
on the proposed method.
52- Thank you!
- ?????(sankyuh)
- ?????(sankyuu)
- Question or comments are welcome.
53Error analysis
- grizzly bear
- ???????? vs ????????
- gurihzurihbea gurihzurih bea
- ? are not regarded as variants
- animal Norman Shwarzkov
- totally different contexts!
- sign pole sign ball
- ?????? vs. ??????
- sainpohru sainbohru
- Are regarded as variants.
- barber shop baseball
- ? customer, shop, sales ( very similar contexts)
54The threshold of SP vs. F-value
55cosine similarity vs. F-value
56If you search some Kataka variant with Google,
Katakana Variants Found or not
??????( spaghetti) ?
???????( supagettuthii) ?
??????( supagettutei)
?????( supagettuthi) ?
??????( supagethii) ?
?????( supagetei)
57How to find candidates of Katakana
- How to extract document in which context
similarity is calculated - Google search with a query of Katakana word which
is a candidate of Katakana variant. - Extract context of the Katakana variant from
search result pages and calculate context
similarity to identify Katakana variants.
58Example of similarity calculation
- (????Uottuka?????Uotoka)
- ????liquor1.1?strong1.4?alcohol1.6? western
liquir0.7? - ????liquor0.7?strong0.7?alcohol3.4?western
liquor1.4?