Application of INTEX in refinement and validation of Serbian WordNet PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Application of INTEX in refinement and validation of Serbian WordNet


1
Application of INTEX in refinement and validation
of Serbian WordNet
  • Ivan Obradovic, Ranka Stankovic
  • Cvetana Krstev, Gordana Pavlovic-Lažetic
  • University of Belgrade

2
WordNet (WN)
  • a semantic network of concepts represented by
    synsets sets of synonymous words (nouns, verbs,
    adjectives adverbs)
  • contains explicitly coded descriptions of
    semantic relations
  • inspired by research in the field of
    psycholinguistics
  • initially developed at Princeton for the English
    language

Fellbaum C. (ed.), (1998) WordNet An Electronic
Lexical Database, The MIT Press
3
Multilingual WordNets
  • Featuring the InterLingual Index (ILI)
  • EuroWordNet (EWN) Dutch, Italian, Spanish,
    German, French, Czech and Estonian
  • BalkaNet (BWN) five Balkan languages Greek,
    Turkish, Bulgarian, Romanian and Serbian, as well
    as Czech

Vossen, P. (ed.) (1998) EuroWordNet A
Multilingual Database with Lexical Semantic
Networks, Kluwer Academic Publishers,
Dordrecht Stamou S., Oflazer K., Pala K.,
Christoudoulakis D., Cristea D., Tufis D., Koeva
S., Totkov G., Dutoit D., Grigoriadou M. (2002)
BALKANET A Multilingual Semantic Network for
Balkan Languages, 1st International Wordnet
Conference, Mysore, India, January 2002
(http//www.ceid.upatras.gr/Balkanet/files/balkane
t-elsnet-ko-accept.pdf)
4
The WN semantic network
  • based on a grouping of synonyms into synsets -
    representing network nodes
  • nodes are interconnected by arcs which describe
    particular semantic relations (hyperonymy,
    hyponymy, antonymy etc.)
  • in general, every synset is accompanied by a
    definition (gloss) and examples of usage that
    specify the meaning of the concept represented by
    the synset
  • the semantic network itself is an XML-document
    with a precisely established set of entities

5
The Serbian version of WN
  • developed starting from the base concepts of the
    English WN using existing English/Serbian
    dictionaries in paper form
  • synset elements represented as the elements in
    DELAS or DELAC dictionaries without any
    additional morphosyntactic information
  • lexical meanings in Serbian coded with reference
    to the dictionary of Matica Srpska

6
XML representation of a synset in Serbian WN
(demonstrate, establish, prove, show)
ltSYNSETgt ltIDgtENG171-00528591-vlt/IDgt
ltSYNONYMgt ltLITERALgt dokazati ltSENSEgt 1
lt/SENSEgt lt/LITERALgt ltLITERALgt dokazivati
ltSENSEgt 1 lt/SENSEgt lt/LITERALgt ltLITERALgt
pokazati ltSENSEgt 3 lt/SENSEgt lt/LITERALgt ltLITERALgt
pokazivati ltSENSEgt 3 lt/SENSEgt lt/LITERALgt
lt/SYNONYMgt ltDEFgt Utvrditi valxanost necyega,
primerom, objasxnxenxem ili eksperimentom.
(Establish the validity of something by example,
explanation or experiment)lt/DEFgt ltUSAGEgt Anketa
je pokazala da u tako nesxto veruje mali broj
ispitanih. (The poll showed that few people
believe in this) lt/USAGEgt ltPOSgtvlt/POSgt
ltILRgtENG171-00529622-v ltTYPEgthypernymlt/TYPEgtlt/IL
Rgt ltBCSgt1lt/BCSgt ltSTAMPgtDusko
2003/04/21lt/STAMPgt lt/SYNSETgt
7
Problems in Serbian WN that might be solved using
INTEX
  • lack of morphological and syntactic information
    related to lexemes
  • absence of precise criteria for the selection of
    lexemes for a particular synset
  • lack of information on relative relevance of each
    lexeme in a synset in terms of its lexical
    frequency

8
Incorporation of morphosyntactic information into
synsets using INTEX
  • The DictWNSrp program
  • matches literals in WN with literals in selected
    Delas dictionaries and extracts morphosyntactic
    information from dictionaries
  • assigns morphosyntactic information to WN
    literals in cases of a 1-1 match
  • offers the user the option to confirm or alter
    the assigned information and resolve cases of
    homography (e.g. multiple matches)
  • transfers confirmed morphosyntactic information
    into the WN using the LNOTE element

9
Resolving homography with the DictWNSrp program
10
XML representation of a synset with assigned
morphosyntactic information
  • ltSYNONYMgt
  • ltLITERALgtdokazati ltSENSEgt1lt/SENSEgt
  • ltLNOTEgtV122PerfTrIrefReflt/LNOTEgt
  • lt/LITERALgt
  • ltLITERALgtdokazivati ltSENSEgt1lt/SENSEgt
  • ltLNOTEgtV18ImperfTrIreflt/LNOTEgt
    lt/LITERALgt
  • ltLITERALgtpokazati ltSENSEgt3lt/SENSEgt
  • ltLNOTEgtV122PerfTrIrefReflt/LNOTEgt
    lt/LITERALgt
  • ltLITERALgtpokazivati ltSENSEgt3lt/SENSEgt
  • ltLNOTEgtV18ImperfTrIreflt/LNOTEgt
    lt/LITERALgt
  • lt/SYNONYMgt

11
Validation of lexemes from a synset on a corpus
  • Phase One The IntexWN program
  • selects and displays all synsets from WN for a
    given lexeme
  • constructs Intex graphs for all lexemes from
    selected synsets
  • Phase Two INTEX
  • produces concordances from a chosen corpus for
    graphs constructed by IntexWN
  • Phase Three User
  • checks the validity of synonymous relations of
    lexemes on concordances
  • decides on removing or adding new lexemes to the
    synset

12
Constructing a graph for all lexemes from a
synset with the IntexWN program
13
Validation results for synset ENG171-11771798(bei
ng, beingness, existence)
  • Comments
  • the lexemes used in the synset have been used to
    denote the given concept in 24 of concordances
  • the lexeme most frequently used to denote the
    given concept is postojanxe
  • although zxivot is the most frequent lexeme in
    the synset, it has been used to denote the given
    concept only in 10 of cases
  • bivstvo does not occur in the corpus and its
    exclusion from the synset could be considered if
    a similar result is obtained on a wider corpus

14
Further developments
  • definition of more precise criteria for
    validation of lexemes in a synset based on their
    occurrence in corpora
  • investigation of possibilities for introducing
    relevance information in synsets
  • further development of the IntexWN program to
    include semantic relations, such as hyponymy/
    hyperonymy etc.
  • introduction of near-synonym information into the
    Serbian WN using INTEX dictionaries (e.g.
    augmentatives/diminutives)
  • investigation of possibilities for introducing
    multi-lingual features into INTEX using the WN
    (to be used for parallel corpora)
Write a Comment
User Comments (0)
About PowerShow.com