Web-based Acquisition of Japanese Katakana Variants - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Web-based Acquisition of Japanese Katakana Variants

Description:

Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo , Japan) – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 59
Provided by: 5821
Category:

less

Transcript and Presenter's Notes

Title: Web-based Acquisition of Japanese Katakana Variants


1
Web-based Acquisition of Japanese Katakana
Variants
  • Hiroshi Nakagawa (University of Tokyo, Japan)
  • Takeshi Masuyama (University of Tokyo , Japan)

(Now Yahoo! Japan)
2
  • Very sorry for Katakana fonts printing problem in
    proceedings. We could not check the final
    printing.
  • Please read English transliterations of Katakana
    parts like -c..

3
Cooperation with
  • Satoshi Sekine
  • Computer Science, New York University
  • And
  • Language Craft Co.

4
The way of sound to spelling defers language by
language
???,???(nyaa,nyaa) ?????????(nyah,nyah)
5
History of Katakana
  • Every Country and every language has its own
    history of meanings, codes and fonts.
  • Phonogram vs. Ideogram

6
??(Hanji)
Kanji(Hanji)Characters(ideogram) imported to
Japan 1300 yeas ago
7
Almost 1000 years ago, women writers worked out
phonogram(Hiragana and Katakana) from Kanji
(ideogram) to express Japanese peoples
mentality.
???
?(hiragana)
? (Kanji) ideogram
?(katakana) phonogram
8
Modern history of Katakana
  • Japanese Katakana and Hiragana have one to one
    mapping.
  • After Meiji revolution(1868), Japanese people
    used
  • Katakana to express functional word
  • Hiragana to express words imported from western
    countries.
  • After World War II(1945), we exchanged them.
  • Hiragana became used to express functional word
    like case markers
  • Katakana became used to express words imported
    from western countries.
  • Thus majority of Katakana words are
    transliterations from English words.

9
However,
  • Japanese Katakana has only five vowels
    (a,i,u,e,o) and 19 consonants (k,g,s,z,j,t,d,n,h,b
    ,m,y,r,w,c,sh,ch,ny,my,).
  • Pronunciations are always CV or V.
  • No CC.
  • No distinction between, (b,v),(h,f),(l,r),..
  • There are no orthographic way to express English
    sounds with Katakana character set.
  • Thus Japanese language accepted several Katakana
    spellings for one English word.
  • ? Katakana variants

10
?????(dhiteeru) ?????? (dhithiiru) ??????(dhitheer
u) ?????(dhiteiru)
detail
Transliterated into Katakana variants
?????????(kyameronndhiasu) ?????????(kyameronn
dhiazu) ?????????(kyameronndhiasu)
Cameron Diaz
11
An example of search result Hits for spaghetti
with Google
  • To make sure to avoid overlap between distinct
    Katakana variants by and - options.

Katakana variants. Hits of Google search ()
??????(supagettuthi) 187,000 (32.7)
???????(supagettuthii) 57,600 (10.1)
??????(supagettutei) 6,850 (1.2)
?????(supagetuthi) 240,000(41.9)
??????(supagethii) 77,400(13.5)
?????(supagetei) 3,800 (0.7)
total 572,650(100)
12
Katakana variants extraction system is needed to
enhance the cross-language ability of
  • Information Retrieval
  • Search engine
  • Machine translation
  • Information Extraction
  • Summarization
  • Question Answering

13
Previous research 1
  • Manually constructed Rewriting rules to generate
    and/or extract Katakana variants from given
    Katakana word (Shishibori et al, 1993, 1994,
    Kubota 1994)
  • Samples of rewrite rules
  • ?(Be)???(Ve)
  • ?(chi)???(thi)
  • Input ????(Benechia)
  • Output ?????(Benethia)
  • ?????(Venechia)
  • ??????(Venethia)

14
Previous research 2
  • Extract Katakana variants with weighted edit
    distance (Magari et al?2004)?(Ohtake et al?2004)
  • Edit distance is defined as
  • Number of operations to transform one Katakana
    word into another Katakana word
  • Operations insert, delete,replace
  • Ex. ????(Repooto)?????(Ripooto) ? edit dist. 1
  • Weighted edit distance
  • Weight of each operation is manually given
  • Ex Weight of edit dist. (?????????) ? 0.8

15
Previous research 3more direct way
  • String penalty to extract Katakana variants
    (Masuyama et al, 2004)
  • String penalty SP
  • Based on weighted edit distance, but extended to
    treat two,three charactersstring
  • Manually given weights to Combination of edit
    operations string replacing operations.
  • Ex.SP(???,????)4 replace and insert
  • Boisu, Voisu

16
Previous research 4 Combination
method(Masuyama,Nakagawa,Sekine 2004 COLING)
  • Combination of string penalty and context
  • String penalty SP
  • SP value is given by an expertise
  • Similarity of contexts in which each Katakana
    variant appears
  • Vector space model (automatically calculated)
  • If Words around each Katakana words are similar,
    then the Katakana words are variants each other

17
Problems of previous researches
  • Less coverage
  • Need human intellectual and intensive work for
  • Working out rewrite rules
  • Determining weights of weighted edit distance
  • Determining values of string penalty of each
    Katakana string pairs
  • Depend on specific corpus which is used to
    calculate weights of
  • weighted edit distance
  • string penalty

18
Purpose of this work
  • The problem of manually given string penalty
  • Labor intensive (even in combination of SP and
    context)
  • Low coverage

Determine string penalty mechanically and
Automatically building Katakana variants for
each Katakana word
19
Calculating string penalties Mechanically
For this, we need accurate and high quality
Katakana variants database!
20
Process
(idea, ????)(report, ????)
English word and its Katakana variant
???? report ????
WWW
(????repooto,????ripooto)(????repooto,????
sapooto)
Pairs of variant cadi.
(????repooto,????repooto)(??????refarensu,
??????rifarenssu) (??????aakitekuto,
??????aakitekutu)
Pairs of variant
String Penalty
?re??ri 1?to??ttu 3
21
Process
(idea, ????)(report, ????)
English word and its Katakana variant
???? report ????
WWW
Web search by
(????repooto,????ripooto)(????repooto,????
sapooto)
Pairs of variant cadi.
(????repooto,????repooto)(??????refarensu,
??????rifarenssu) (??????aakitekuto,
??????aakitekutu)
Pairs of variant
String Penalty
?re??ri 1?to??ttu 3
22
How to find candidates of Katakana variant pairs
(1/3)
  • To collect English words and thier Katakana
    variants i.e. (vodka ????)
  • we used four Web sites where we collect a number
    of English words and their Japanese translations.
  • http//homepage2.nifty.com/katakanaEnglish/
  • http//www.hoshi.cis.ibaraki.ac.jp/usefull/usefull
    15.html
  • http//ke.ics.saitama-u.ac.jp/jsgs/keywords.html
  • http//smalltown.ne.jp/uasa/pub/distfiles/skk-ext
    ra-200307/SKK-JISYO.edit
  • 14,958 distinct pairs of English words and their
    Katakana translations.

23
How to find candidates of Katakana variant pairs
(2/3)
  • Extract many English word and its Katakana
    variant
  • 14.958 pairs of English-Katakana
  • To collect more Katakana variants for each
    English word, we use Google search to get pages
    that include English word and Katakana word of
    its translation
  • English word ( language Japanese )
  • English word ????(English to Japanese) in
    order to search English-Japanese dictionary site
  • Gather Katakana words from search results

24
Google search with English word vodka among
page written in Japanese
vodka
25
Add a query ????(english-Japanese) and Google
search
??e-j vodaka
26
Process
(idea, ????)(report, ????)
English word and its Katakana variant
Web search ??(e-j) report
???? report ????
WWW
(????repooto,????ripooto)(????repooto,????
sapooto)
Pairs of variant cadi.
Edit dist. 1
(????repooto,????repooto)(??????refarensu,
??????rifarenssu) (??????aakitekuto,
??????aakitekutu)
Pairs of variant
String Penalty
?re??ri 1?to??ttu 3
27
How to find candidates of Katakana variant pairs
(3/3)
  • Extract promising candidates of Katakana word
    pairs whose edit distance 1 as Katakana variants
  • Ex. (vodka ????)
  • (????Uottuka?????Uotoka)
  • (????Uottuka?????UOttuka)
  • (????Uottuka?????(Vuottuka)

28
Process
(idea, ????)(report, ????)
English word and its Katakana variant
Web search by??(e-j) report
???? report ????
WWW
(????repooto,????ripooto)(????repooto,????
sapooto)
Pairs of variant candi.
Edit dist. 1
(????repooto,????repooto)(??????refarensu,
??????rifarenssu) (??????aakitekuto,
??????aakitekutu)
cosine sim gt 0.00006
Pairs of variant candi. by context
?re??ri 1?to??ttu 3
String Penalty
29
How to extract documents in which context
similarity is calculated
  • Google search with a query of Katakana word which
    is a candidate of Katakana variant.
  • Extract context of the Katakana variant from
    search result pages.

30
Search Vodka with Google
31
  • Calculate context similarity of a candidate of
    Katakana variant pair
  • drink vodka(Vuottka) with a main dish and plate
    of caviar in the restaurants

  • cosine similarity
  • eat some main dish plate after vodka(Uotoka) in
    that restaurants
  • 50 words around a candidate of Katakana variant
    is used as its context
  • Identify and extract Katakana variants if cosine
    similarity is greater than the threshold of
    0.00006.

32
Detail of context similarity calculation
  • context50 words around Katakana word
  • Weight of word t in context
  • log(freq(t)1)
  • Context similarity cosine
  • Selection from candidates by
  • cosine similarity?0.00006 (threshold)
  • The threshold optimization
  • argmax of F-value
  • threshold
  • on positive pairs (347pairs)and negative
    pairs(111 pair)

33
Results of context similarity vs cosine threshold
34
Process
(idea, ????)(report, ????)
English word and its Katakana variant
Web search by??(e-j) report
???? report ????
WWW
(????repooto,????ripooto)(????repooto,????
sapooto)
Pairs of variant cadi.
Edit dist. 1
(????repooto,????repooto)(??????refarensu,
??????rifarenssu) (??????aakitekuto,
??????aakitekutu)
cosine sim gt 0.00006
Pairs of variant
?re??ri SP1?to??ttu SP3
Next to do is to calculate SP based on Statistics
35
2nd stageCalculation of string penalty SP
  • String penalty of operation x?? y (x replaces
    with y)
  • We focus on
  • High correlation between replaced strings and
    their character context which is composed of
    several characters around the target string.
  • Example (???????????????)
    (?????????????) (?????????)
  • ?replace ?I with ?i??U and ?n co-occurs

36
Character level contextCLC1..CLC5 used to
calculate SP
  • x target character
  • a?ß???d characters around x

CLC String contexts around x
CLC1 aß x preceeding two characters of x
CLC2 ß x preceeding one character of x
CLC3 x ? succeeding one character of x
CLC4 x ?d succeeding two characters of x
CLC5 ß x ? preceeding and succeeding characters of x
37
Calculation of string penaltySP
  • i1,2,3,4,5
  • f(CLCi) freq. of pairs in which CLCi occurs
  • f(CLCi, x??y) freq. of pairs in which both of
  • CLCi and x??y
    occur

38
Calculation of string penalty SP
Identify character context CLCi which most
probably co-occurs with operation x ?? y
Then
Rank of occurrence C (Prob. of
occurrence)-1 Zipfs law
39
Examples of string penalties
operation SP Example
Insertion and deletion of 1 ?????????????
Insertion and deletion of macron ? 1 ??????????
Replace ? O and ? o 1 ?????????
Replace ? gu and ? ku 2 ???????
Replace ? vu and ? bu 2 ???????????
Replace ? vu and ? U 3 ?????????
40
Comparison of SP by hand and SP by the proposed
method
  • SP by hand proposed by Masuyama et al(2004)
  • Expertise worked out SP by hand
  • Gold standard Katakana variants
  • 682 pairs of Katakana variant candidates
    extracted from newspaper corpus and whose string
    penalties are between 1 and 12
  • We found no correct variants whose SPs are
    bewteen 10 and 12. Thus, the above gold standard
    probably cover all correct varinats.

41
Comparison of SPs
SP SP by hand SP by proposed mechanical method
1 216/221 (97.7) 262/286 (91.6)
2 162/207 (78.3) 133/148 (89.9)
3 70/99 (70.7) 51/90 (56.7)
4 2/14 (14.3) 2/26 (7.7)
5 0/29 (0.0) 0/16 (0.0)
6 0/13 (0.0) 2/34 (5.9)
7 1/20 (5.0) 1/39 (2.6)
8 0/13 (0.0) 1/15 (6.7)
9 1/12 (8.3) 0/8 (0.0)
10 0/16 (0.0) 0/5 (0.0)
11 0/17 (0.0) 0/12 (0.0)
12 0/21 (0.0) 0/3 (0.0)
42
Comparison of SPs correlation
SP by proposed mechanical method
1 2 3 4 5 6 7 8 9 10 11 12 ??
1 207 7 3 2 0 1 1 0 0 0 0 0 221
2 20 123 59 2 1 1 1 0 0 0 0 0 207
3 59 11 20 3 2 3 1 0 0 0 0 0 99
4 0 2 3 2 2 0 4 0 0 1 0 0 14
5 0 2 2 6 3 4 5 3 1 0 2 1 29
6 0 0 0 3 1 1 2 0 3 1 2 0 13
7 0 0 1 3 2 2 2 4 1 1 3 1 20
8 0 1 0 0 0 4 6 1 0 0 1 0 13
9 0 1 0 0 0 0 2 3 1 1 4 0 12
10 0 1 1 5 0 0 4 2 1 1 0 1 16
11 0 0 1 0 2 13 0 0 1 0 0 0 17
12 0 0 0 0 3 5 11 2 0 0 0 0 21
?? 286 148 90 26 16 34 39 15 8 5 12 3 682
SP by hand
correlation0.76
43
Building Katakana variantsDB automatically
44
Summary of comparison and next?
COLING 2004
SIGIR2005
by Mechanical method
Correlation 0.76
by hand
Context similarity
Context similarity
SP
SP
Extracted variants
Extracted variants
Accurate
Accurate?
45
Variants DB
???? ???? ???? ????
News paper corpus
(????,????)(????,????)(????,????)
Candidates of Katakana variants
(????,????)(????,????)
Candidates of Katakana variants
(????,????)
Katakana variants DB
46
Variants DB
???? ???? ???? ????
News paper corpus
Extract Katakana words
(????,????)(????,????)(????,????)
Candidates of Katakana variants
(????,????)(????,????)
Candidates of Katakana variants
(????,????)
Katakana variants DB
47
Variants DB
???? ???? ???? ????
News paper corpus
Extract Katakana words
(????,????)(????,????)(????,????)
Candidates of Katakana variants
SP 3
(????,????)(????,????)
Candidates of Katakana variants
(????,????)
Katakana variants DB
48
Variants DB
???? ???? ???? ????
News paper corpus
Extract Katakana words
(????,????)(????,????)(????,????)
Candidates of Katakana variants
SP 3
(????,????)(????,????)
Candidates of Katakana variants
Context similarity 0.005
(????,????)
Katakana variants DB
Optimized threshold
49
Comparison of variants DB
SP?3, context similarity?0.05
SP by hand of expertise SP by the proposed mechanical method
recall 417/420 (99.3) 415/420 (98.8)
precision 417/480 (86.9) 415/480 (86.5)
F-value 92.7 92.2
cf. The whole DB contains 3 million Katakana
variants for 1 million distinct Katakana words.
50
Conclusions
  • Mechanical method of calculating SP
  • Using Web search engine to extract variant
    candidates
  • SP by character context
  • Almost same accuracy as SP by hand of expertise
  • Katakana variants DB with SP by mechanical method
  • recall98.8
  • precision86.5
  • F-value 92.2

51
Future of our research
  • Other language like German
  • Arbeit -- ?????
  • Application of our methodology (Web resource
    statistical string penalty) to other language
    pair.
  • Londre ?? London
  • München ?? Munich
  • Our hope is Cross-language automatic spelling
    variants generator for any language pairs based
    on the proposed method.

52
  • Thank you!
  • ?????(sankyuh)
  • ?????(sankyuu)
  • Question or comments are welcome.

53
Error analysis
  • grizzly bear
  • ???????? vs ????????
  • gurihzurihbea gurihzurih bea
  • ? are not regarded as variants
  • animal Norman Shwarzkov
  • totally different contexts!
  • sign pole sign ball
  • ?????? vs. ??????
  • sainpohru sainbohru
  • Are regarded as variants.
  • barber shop baseball
  • ? customer, shop, sales ( very similar contexts)

54
The threshold of SP vs. F-value
55
cosine similarity vs. F-value
56
If you search some Kataka variant with Google,
  • In case of spaghetti

Katakana Variants Found or not
??????( spaghetti) ?
???????( supagettuthii) ?
??????( supagettutei)
?????( supagettuthi) ?
??????( supagethii) ?
?????( supagetei)
57
How to find candidates of Katakana
  • How to extract document in which context
    similarity is calculated
  • Google search with a query of Katakana word which
    is a candidate of Katakana variant.
  • Extract context of the Katakana variant from
    search result pages and calculate context
    similarity to identify Katakana variants.

58
Example of similarity calculation
  • (????Uottuka?????Uotoka)
  • ????liquor1.1?strong1.4?alcohol1.6? western
    liquir0.7?
  • ????liquor0.7?strong0.7?alcohol3.4?western
    liquor1.4?
Write a Comment
User Comments (0)
About PowerShow.com