Title: Migration of Intex resources towards NooJ the case of Serbian
1Migration of Intex resources towards NooJ - the
case of Serbian
Faculty of Mining and Geology, Belgrade Ranka
Stankovic, assistant, ranka_at_rgf.bg.ac.yu Ivan
Obradovic, professor, ivano_at_rgf.bg.ac.yu
Faculty of Philology, Belgrade Cvetana Krstev,
professor, cvetana_at_matf.bg.ac.yu
Faculty of Mathematics, Belgrade Duko Vitas,
professor, vitas_at_matf.bg.ac.yu Gordana
Pavlovic-Laetic, professor, gordana_at_matf.bg.ac.yu
2CONTENTS
- Overview of Intex resources for Serbian
- Specific features of Serbian
- The approach to Intex -gt Nooj migration of
lexical resources - ConvertIN Convert Intex to Nooj (Migration
software scripts) - Migration results and open questions
3Intex dictionaries for Serbian
- Dictionary of simple words 73000 lemmas (DELAS
entries) and more than a million word forms
(DELAF entries) - Dictionaries of proper names 21000 lemmas, or
145000 forms - Dictionary of compound words (compound nouns,
prepositions, conjunctions and adverbs, compound
toponyms and proper names) - Auxiliary dictionaries (special purpose filter
dictionaries and auxiliary dictionaries for the
processing of particular texts)
4Intex transducers for Serbian
- Transducers for description of inflectional
classes - used for generating DELAF from DELAS
dictionaries. The largest group of transducers
333 for nouns, 60 for adjectives and 344 for
verbs - Transducers for derivation in Serbian. Second
largest group of transducers 40 - Transducers for the identification of specific
forms, such as acronyms with their appropriate
inflection and derivation (e.g. OEBS, OEBS-a,
OEBS-ov) 66 - Transducers for disambiguation 41
5Specific features of Serbian
- ,? ? sx
- d,? ? dx
- c,? ? cy
- c,??cx
- ,? ? zx
- nj,? ? nx
- lj,? ? lx
- d,? ? dy
- Use of two alphabets
- Official Cyrillic alphabet
- Serbian Latin alphabet (also widely used)
- Absence of a unique transliteration procedure in
any of the standard coding schemas - Rich morphological system
- Reflected both on the inflective and derivational
level
6The approach to migration
- Development resources are kept in transliterated
Latin alphabet (with the adopted transliteration
scheme) - Automatic production of both Serbian Latin and
Cyrillic Unicode versions of the resources, which
can then be used either separately or jointly - Application of the procedure to existing DELAF
dictionaries will make the translation of
numerous transducers for inflection unnecessary
7DELAS transformation
Intex Delas entry ponesxto,PRO13IndefProN
Nooj ponesxto,ponesxto,PROFLXPRO13IndefP
roN poneto,ponesxto,PROFLXPRO13_latIndef
ProN ???????,ponesxto,PROFLXPRO13_cirInd
efProN delas-im.dic -gt ascdelas-im.dic,
latdelas-im.dic, cirdelas-im.dic delas-gl.dic -gt
ascdelas-gl.dic, latdelas-gl.dic,
cirdelas-gl.dic ....
- ,? ? sx
- d,? ? dx
- c,? ? cy
- c,??cx
- ,? ? zx
- nj,? ? nx
- lj,? ? lx
- d,? ? dy
Different inflectional classes
8Options for transforming DELAF (1)
- Use the same lemma (transliterated option)
- cyudovisxta,cyudovisxte.Nhumns2vnp1vnp2vnp4v
np5v - cudovita,cyudovisxte.Nns2vnp1vnp2vnp4vnp5v
- ?????????,cyudovisxte.Nns2vnp1vnp2vnp4vnp5v
cyudovisxta,cyudovisxte,NHumns2v cyudovisxta,
cyudovisxte,Nnp1v cyudovisxta,cyudovisxte,Nn
p2v cyudovisxta,cyudovisxte,Nnp4v cyudovisxt
a,cyudovisxte,Nnp5v
?????????,cyudovisxte,Nns2v ?????????,cyudovis
xte,Nnp1v ?????????,cyudovisxte,Nnp2v ????
?????,cyudovisxte,Nnp4v ?????????,cyudovisxte,
Nnp5v
cudovita,cyudovisxte,Nns2v cudovita,cyudovis
xte,Nnp1v cudovita,cyudovisxte,Nnp2v cudo
vita,cyudovisxte,Nnp4v cudovita,cyudovisxte,
Nnp5v
9Options for transforming DELAF (2)
- Use different lemmas
- cyudovisxta,cyudovisxte.Nhumns2vnp1vnp2vnp4v
np5v - cudovita,cyudovisxte.Nns2vnp1vnp2vnp4vnp5v
- ?????????,cyudovisxte.Nns2vnp1vnp2vnp4vnp5v
cyudovisxta,cyudovisxte,NHumns2v cyudovisxta,
cyudovisxte,Nnp1v cyudovisxta,cyudovisxte,Nn
p2v cyudovisxta,cyudovisxte,Nnp4v cyudovisxt
a,cyudovisxte,Nnp5v
?????????,????????e,Nns2v ?????????,????????e,
Nnp1v ?????????,????????e,Nnp2v ?????????,
????????e,Nnp4v ?????????,????????e,Nnp5v
cudovita,cudovite,Nns2v cudovita,cudovite,
Nnp1v cudovita,cudovite,Nnp2v cudovita,
cudovite,Nnp4v cudovita,cudovite,Nnp5v
10ConvertIN - Overview
Delas, Delaf Graph,Text Inflexion?
11ConvertIN a glimpse at the code
"ltEgt" 28 168 1 16 "" 760 144 0 "-" 236 168 1
13 "godisxnxice" 380 64 1 7 "godisxnxicu" 384
160 1 8 "godisxnxicom" 384 212 1 9
"godisxnxica" 404 288 1 10 "ltEgt/12godisxnxice
,12godisxnxica.NCf2s" 516 64 1 1
"ltEgt/12godisxnxicu,12godisxnxica.NCf4s"
529 160 1 1 "ltEgt/12godisxnxicom,12godisxnxic
a.NCf6s" 524 212 1 1 "ltEgt/12godisxnxica,1
2godisxnxica.NCf1s" 524 288 1 1 "BrojCifre"
88 168 1 15 "(2" 192 168 1 2 ")" 276 168 5 3
4 5 6 14 "godisxnxici" 382 108 1 17 ")" 164
168 1 12 "(1" 68 165 1 11 "ltEgt/12godisxnxici
,12godisxnxica.NCf3sf7s" 520 108 1 1
12ConvertIN a glimpse at the code
"ltEgt" 28 168 1 16 "" 760 144 0 "-" 236 168 1
13 "?????????" 380 64 1 7 "?????????" 384 160 1
8 "??????????" 384 212 1 9 "?????????" 404 288
1 10 "ltEgt/12?????????,12godisxnxica.NCf2s
" 516 64 1 1 "ltEgt/12?????????,12godisxnxica.
NCf4s" 529 160 1 1 "ltEgt/12??????????,12go
disxnxica.NCf6s" 524 212 1 1
"ltEgt/12?????????,12godisxnxica.NCf1s"
524 288 1 1 "BrojCifre" 88 168 1 15 "(2" 192
168 1 2 ")" 276 168 5 3 4 5 6 14 "?????????"
382 108 1 17 ")" 164 168 1 12 "(1" 68 165 1
11 "ltEgt/12?????????,12godisxnxica.NCf3sf7
s" 520 108 1 1
NCf2s NCf4s NCf6s NCf1s
13ConvertIN DELAS editor
PRO_Distribution Cr Demon Ek Gen Ijk
Indef Int Neg Pos ProA ProN Prs
Ref Rel Sr
Delas editor
_properties.def
14ConvertIN properties
_properties.def
15Results of dictionary conversion
Small dictionaries (all in one)
Bigger dictionaries (one each)
16Conversion of Cyrillic dictionaries
- Small size Cyrillic dictionaries were converted
successfully with the transliterated Latin lemma - Large size Cyrillic dictionaries could not be
converted with the transliterated Latin lemma - ?????????,cydovisxte,Nns2v
- ?????????, cydovisxte,Nnp1v
- ....
- but could be converted with the Cyrillic lemma
- ?????????,????????e,Nns2v
- ?????????,????????e,Nnp1v
- ....
- The problem of different lemmas for Latin and
Cyrillic loss of connection
17Results of graph conversion
18Results of graph conversion
?
19Lexical analysis a comparison
- Results of a lexical analysis using original
Intex and resources converted to Nooj format on
the 13KW ebit2002-bez test corpus - Intex
- 900 unknown words
- Nooj (from Intex)
- 1140 unknown words for transliterated Latin
- 1034 unknown words for Serbian Latin
- 10000 unknown words for Cyrillic (V,N,A)
20Lexical analysis cyrilic lema
- All dictionaries with cyrilic lema
- All graphs with cyrilic lema
21Lexical analysis cyrilic lema
22Morphology an open question
- ltEgt/msNNgmsAAqmsVVg
- 2na/fsNNgfsVVgnpNNgnpAAgnpVVg
- 2ne/fsGGgmpAAgfpNNgfpAAgfpVVg
- 2ni/mpNNgmpVVg
- 2nih/mpGGgfpGGgnpGGg
- 2nim/msIIgnsIIgmpXXgmpIIgmpWWgfpXXgfpIIgfp
WWgnpXXgnpIIgnpWWg - 2nima/mpXXgmpIIgmpWWgfpXXgfpIIgfpWWgnpXXgn
pIIgnpWWg - 2no/nsNNgnsAAgnsVVg
- 2nog/msGGgmsAAvnsGGg
- 2noga/msGGgmsAAvnsGGg
- 2noj/fsXXgfsWWg
- 2nom/msXXgmsWWgnsXXgnsWWgfsIIg
- 2nome/msXXgmsWWgnsXXgnsWWg
- 2nomu/msXXgmsWWgnsXXgnsWWg
- 2nu/fsAAg
NUM01.exp and NooJ ekvivalent
NUM01 ltEgt/msNNg ltEgt/msAAq
ltEgt/msVVg ltB2gtna(ltEgt/fsNNg
ltEgt/fsVVg ltEgt/npNNg ltEgt/npAAg
ltEgt/npVVg) ltB2gtne(ltEgt/fsGGg
ltEgt/mpAAg ltEgt/fpNNg ltEgt/fpAAg
ltEgt/fpVVg ) ltB2gtni(ltEgt/mpNNg
ltEgt/mpVVg ) ltB2gtnih(ltEgt/mpGGg
ltEgt/fpGGg ltEgt/npGGg)
ltB2gtnim(ltEgt/msIIg ltEgt/nsIIg
ltEgt/mpXXg ltEgt/mpIIg ltEgt/mpWWg
ltEgt/fpXXg ltEgt/fpIIg ltEgt/fpWWg
ltEgt/npXXg ltEgt/npIIg
ltEgt/npWWg ) ltB2gtnima(ltEgt/mpXXg
ltEgt/mpIIg ltEgt/mpWWg ltEgt/fpXXg
ltEgt/fpIIg ltEgt/fpWWg ltEgt/npXXg
ltEgt/npIIg ltEgt/npWWg )
ltB2gtno(ltEgt/nSNNg ltEgt/nSAAg
ltEgt/nSVVg ) ltB2gtnog(ltEgt/msGGg
ltEgt/msAAv ltEgt/nsGGg )
ltB2gtnoga(ltEgt/msGGg ltEgt/msAAv
ltEgt/nsGGg ) ltB2gtnoj(ltEgt/fsXXg
ltEgt/fsWWg ) ltB2gtnom(ltEgt/msXXg
ltEgt/msWWg ltEgt/nsXXg ltEgt/nsWWg
ltEgt/fsIIg) ltB2gtnome(ltEgt/msXXg
ltEgt/msWWg ltEgt/nsXXg ltEgt/nsWWg
) ltB2gtnomu(ltEgt/msXXg ltEgt/msWWg
ltEgt/nsXXg ltEgt/nsWWg )
ltB2gtnu/fsAAg
23Morphology an open question
ASC N7.exp bubanx, pirinacy (with fleeting a)
- N7 ltEgt/ms1q ltEgt/ms4q
ltL2gtltBgtltR2gta/ms2q ltL2gtltBgtltR2gtu/(ltEgt/ms
3q ltEgt/ms7q) ltL2gtltBgtltR2gte/(ltEgt/ms
5q ltEgt/mp4q) ltL2gtltBgtltR2gtem/ms6q
a/mp2q ltL2gtltBgtltR2gtima/(ltEgt/mp
3q ltEgt/mp6q ltEgt/mp7q) - N3 ltEgt/ms1q ltEgt/ms4q
ltLgtltBgtltRgta/ms2q ltLgtltBgtltRgtu/(ltEgt/ms3q
ltEgt/ms7q) ltLgtltBgtltRgte/(ltEgt/ms5q
ltEgt/mp4q) ltLgtltBgtltRgtem/ms6q
a/mp2q ltLgtltBgtltRgtima/(ltEgt/mp3q
ltEgt/mp6q ltEgt/mp7q)
LAT bubanj N7, pirinac N3
CIR ????? N7, ??????? N3 ALL NEW
24A few more open questions
- The choice of lemma unique (transliterated) or
three/two different - Diferences in Lexical analysis with Intex
- Rules for automatic conversion of .exp files
Generic Commands ltBgt keyboard Backspace ltDgt
Duplicate current char ltEgt Empty string ltLgt
keyboard Left arrow ltNgt go to end of Next word
form ltPgt go to end of Previous word form ltRgt
keyboard Right arrow
Stack operators L R C LR LL RL LC
c insert character 'c' at the end of the form L
delete last character push it onto the stack R
pop the stack C copy the character at the top of
the stack to the end of the form pop the
stack
25Conclusion
- Further work will include procedures for
automatic import and export between resources in
Intex/NooJ format and resources in MULTEXT-east
and other currently accepted formats (MAF, LMF) - The process described in this paper has proven
beneficial for both kinds of the resources. - Tool is language independent
- However, we should point some problems in its
application. First of all, the sizes of SMD.