SCHEMAS Workshop 3 Introduction - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

SCHEMAS Workshop 3 Introduction

Description:

Lexicon-driven extraction of ontological data. Corpus-driven extraction of ... promotes development of lexicon resources which aim at text-understanding as it ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 33
Provided by: mak113
Category:

less

Transcript and Presenter's Notes

Title: SCHEMAS Workshop 3 Introduction


1
Extraction of Ontological Information
from Lexicon and Corpora
Dimitrios Kokkinakis Maria Toporowska Gronostaj
2
Motto
  • To process information
  • you need information
  • P. Vossen, 2003

3
Content
  • Introduction
  • Background
  • Language resources
  • Methodology
  • Lexicon-driven extraction of ontological data
  • Corpus-driven extraction of ontological data
  • Conclusions

4
Background
  • What is ontological information ?
  • information necessary for making
    common-sense-like inferences based on our
    knowledge of the world
  • How is it represented?
  • in form of structured sets of conceptual types
    often inclusive semantic relations underlying
    them
  • Where?
  • SIMPLE-ontology, EWN, LexiQuest

5
Background
  • Why is ontological information relevant for NLP?
  • promotes development of lexicon resources which
    aim at text-understanding as it offers
    disambiguation means
  • provides knowledge needed in
  • machine translation (MT)
  • information retrieval (IR)
  • information extraction (IE)
  • summarization
  • computer aided language learning (CALL)
  • enables communication on the Semantic Web

6
Background
  • What is meant with a semi-automatic extraction of
    OI?
  • some human intervention is involved in
    information processing to maximize its effects
  • What will we achieve with it?
  • enhance the content of the Swedish SIMPLE lexicon
    in a quick and costs-effective way
  • investigate lexicon-driven and corpus-driven
    methodologies

7
Methodology in general (1)
  • Methodological assumptions
  • lexical databases, MRD lexica and corpora can be
    mined for ontological information
  • relevant factors in information processing
  • resource size
  • degree of extractability
  • implicitness and explicitness of information
  • bootstrapping

8
Methodology in general (2)
  • Approach text data mining (TDM)
  • TDM is a process of exploratory data analysis
    using text that leads to the discovery of
    heretofore unknown information, or to answers to
    questions for which the answer is not currently
    known (Mitkov 2003, Hearst 2003)
  • Result evolutionary lexicon model
  • output data are reused to discover new data,
    which leads
  • to a successive enlargement of lexicon

9
Language resources SIMPLE-SE (1)
  • Corpora
  • 150 million words i SprÃ¥kbanken
  • Lexicon resources
  • SIMPLE-SE lexicon
  • GLDB Göteborg lexical database
  • SEMNET

10
Language resources SIMPLE-SE (2)
  • About SIMPLE-SE
  • computational lexicon with explicit ontological
    information (OI)
  • 10 000 lexicon units
  • 7 000 nouns, 2 000 verbs, 1 000 adjectives
  • manually annotated with semantic and OI which is
    linked to the morphosyntactic information in the
    PAROLE lexicon
  • multidimensional

11
Language resources SIMPLE-SE (3)
  • SIMPLE-SE supports
  • word sense disambiguation
  • kastanji 1/1/0 FRUIT
  • kastanji 1/1/1 PLANT
  • kastanji 1/1/2sms COLOUR
  • kastanji 1/1/3 FOOD
  • kastanji 1/2/0 ORGANIC OBJECT
  • finding regular polysemy
  • creating multilingual links between lexicons

12
Language resources SIMPLE-SE (4)
  • SIMPLE-SE supports
  • text annotation
  • text data mining knowledge based information
    processing
  • evaluation
  • pattern matching based on the ontological
    information assigned to arguments (selection
    restrictions/preferences)

13
Language resources SIMPLE-SE (5)
  • selection restriction based pattern matching
  • Word/expression Position Ontological term
  • injicera (inject) object Substance
  • bebo (inhabit) object Area
  • griljera (roast) object Food
  • förlova sig (become engaged) subj., prep.
    obj Human
  • devalvera (devaluate) obj. Money
  • ha ont i (have pain in) prep. obj. Body part

14
Language resources GLDB
  • Göteborg lexical database, GLDB
  • 67 000 core senses with stringent definition
    format
  • implicit, but extractable genus proximum (genus
    word)
  • implicit onto info about arguments in definition
    extensions
  • 35 000 explicit semantic references on semantic
    relations like synonymy, antonymy, hyperonymy,
    hyponymy and cohyponymy

15
Language resources SEMNET (1)
  • SEMNET hyperonymic taxonomy
  • Extraction of hyperonymy relations from GLDBs
    definitions
  • (methodology software Y. Cederholm, 1999)
  • Recognition of headwords (genus proximum) in
    definitions

16
Language resources SEMNET (2)
  • Input data
  • GLDB definitions
  • 44 915 noun lexeme
  • 10 082 verb lexeme
  • Two analysis methods which complete each other

17
Language resources SEMNET (3)
  • Method I
  • distinguishing typical def. patterns for core
    senses
  • (see overhead/handout from Cederholm Y. 1999,
    Tabell 1. Definitionsformler))
  • pattern matching against non-lemmatized
    definitions (using regular expressions)

18
Language resources SEMNET (4)
  • Method II
  • Input lemmatized definitions
  • Assumptions
  • genus word is the first word in the definition
    which matches the part of speech of the headword,
    the word being defined
  • method II finds even those genus words which
    cannot be parsed with the method I

19
Language resources SEMNET (5)
  • Analysis results for nouns
  • tot. number of analysis tot. number of
    correct analysis
  • Method I 8127 (64) 7141 (56)
  • Method II 12 194 (95) 8974 (70)
  • Method I II 12 528 (98) 10536 ( 83)
  • (evaluation based on 12 786 manually annotated
    noun genus words)
  • Approximated result for ca 45 000 nouns i genus
    position
  • 36 500 correctly recognised noun genus words

20
Language resources SEMNET (6)
  • The 33 most frequent noun genus words i SEMNET
  • 2702 person 858 typ 612 del
  • 461 anordning 314 omrÃ¥de 261 kvinna
  • 228 tillstÃ¥nd 219 lära 217 titel
  • 207 grupp 183 föremÃ¥l 173 sammanfattning
  • 172 mängd 169 sätt 167 plats
  • 166 system 165 växt 162 ämne
  • 153 apparat 145 förmÃ¥ga 133 medlem
  • 128 sprÃ¥k 122 stycke 122 redskap
  • 122 plats 119 känsla 118 form
  • 116 metod 116 handling 113 enhet
  • 111 ljud 110 instrument 102 verksamhet

21
Language resources SEMNET (7)
  • Hyperonymy taxonomy sjukdom
  • -- 1 akutfall 1/1
  • -- 2 almsjuka 1/1
  • -- 3 astma 1/1
  • -- 4 avitaminos 1/1
  • -- 5 basedow 1/1
  • -- 6 bladrullsjuka 1/1
  • -- 7 blodkräfta 1/1
  • -- 8 blodsjukdom 1/1
  • -- 9 blödarsjuka 1/1............................
    ................ (totalt 66 hyponyms)

22
Definition-driven extraction of ontological
information (1)
  • Resources SIMPLE-SE SEMNET GLDB
  • Methodological assumptions
  • Hyperonymic taxonomy in combination with
    ontological information in SIMPLE-SE supports
    semiautomatic extraction of ontological
    information
  • Procedure
  • Preparatory phase relevant for all ontological
    processing annotate GLDB data with the ontol.
    info from the SIMPLE-SE to generate ontologically
    enriched SEMNET

23
Definition-driven extraction of ontological
information (2)
  • Methodological assumptions (cont.)
  • The extracted ontological information is an
    approximation of ontological category until
    verified with other methods, t.ex. a
    corpus-driven methodology, semantic/ontological
    data från GLDB or pattern matching based on
    selection restrictions
  • Since annotated words in SIMPLE cover both
    hyperonyms and hyponyms, two methods are proposed
    here that put in focus each of these semantic
    categories

24
Definition-driven extraction of ontological
information (3)
  • Method I
  • from annotated hyponyms to new annotations of
    hyperonyms
  • Assumption
  • One can approximate ontological category of a
    hyperonym given some information on its hyponyms
    and using the structural knowledge inherent in
    ontology
  • Annotation of a hyperonym can be performed if all
    of the annotated hyponyms share the same
    ontological tag or if the tags share a common
    superordinate tag, except the tag Entity which is
    ontologically heterogeneous and thus relatively
    uninformative

25
Definition-driven extraction of ontological
information (4)
  • Method I example
  • Hyponyms known info
  • diabetes Disease cat Air animal,
  • asthma Disease dog Air animal
  • cholera Disease fisk Water_animal
  • Hyperonym new info
  • disease gtDisease djur gt Animal

26
Definition-driven extraction of ontological
information (5)
  • Method II
  • from annotated hyperonyms to new annotations of
    hyponyms
  • Assumption (resulting in approximation)
  • Direct hyponyms (hyponyms which are directly
    subordinated to the genus word/hyperonym)
    automatically inherit the ontological category of
    their hyperonyms och therefore manual annotation
    of the most frequent genus words/hyperonyms can
    be recommended and justified.
  • hyperonym known info hyponyms new info
  • myntenhet Money gt dollar, krona, pund,
    rubel... Money

27
Definition-driven extraction of ontological
information (6)
  • The assumption has far reaching consequences for
    all those annotated hyponymic words which also
    occur as genus words, since their subordinates
    can automatically inherit the ontological class
    from the hyperonym/genus word.
  • Cascade effect
  • sjukdom (disease) 66 hyponymes
  • infektionssjukdom 25 hyponyms
  • könsjukdom 4 hyponyms

28
Definition-driven extraction of ontological
information (7)
  • Cascade distribution of the ontological type
    Animal
  • Djur 102 hyponyms
  • hovdjur 10
  • ryggradsdjur 8
  • fÃ¥gel 98
  • däggdjur 18
  • Note 80 most frequent genus words, when
    ontologically annotated, give rise to 11 000
    automatically annotated genus words at the first
    hyponymy level. This number further increases due
    to the cascade effect.

29
Definition-driven extraction of ontological
information (8)
  • 2702 person 1/1 person HUMAN
  • 461 anordning 1/1 device ARTIFACT
  • 314 omrÃ¥de 1/1 area AREA (gtLOCATION)
  • 261 kvinna 1/1 woman HUMAN
  • 238 tillstÃ¥nd 1/ state STATE
  • 219 lära 1/1 doctrine DOMAIN
  • 217 titel 1/1 titel SOCIAL_STATUS (gtHUMAN)
  • 183 föremÃ¥l 1/ thing CONCRETE_ENTITY
  • 169 sätt 1/1 manner CONSTITUTIVE
  • 167 plats 1/1el 4 place LOCATION
  • 166 system 1/1 system CONSTITUTIVE
  • 165 växt 1/1 plant PLANT

30
Conclusion
  • Ontological annotations are approximations. They
    need to be verified against manually annotated
    data and/or by means of corpus-driven methodology
    for extracting ontological information
  • The status of ontological annotations need to be
    explicitly specified in the database
  • Method I (from hyponyms to hyperonyms) seem to
    complement the method II (from hyperonyms to
    hyponyms) since the range of annotated categories
    increases rapidly
  • The quality (and quantity) of the used lexical
    resources determines the precision of the
    acquired results ontology

31
Conclusion contd
  • To prevent overgenerating of incorrect
    ontological annotation special attention needs to
    be paid to
  • disambiguation of polysemous and homographic
    genus words (hyperonyms)
  • krona Artifact, Money, Part
  • analysis of compound nouns
  • gosedjur Artifact vs husdjur Animal

32
References
  • Cederholm Y. 1999. Automatisk konstruktion av en
    hyperonymitaxonomi baserad på definitioner i
    GLDB. In Från dataskärm och forskarpärm. MISS 25.
    Göteborgs universitet.
  • Hearst, M. 2003. Text Data Mining. In ed. R.
    Mitkov The Oxford Handbook of Computational
    Linguistics Oxford.
  • Mitkov, R. 2003. The Oxford Handbook of
    Computational Linguistics Oxford. Oxford
    University Press.
  • Vossen, P. 2003. Ontologies. In ed. R. Mitkov The
    Oxford Handbook of Computational Linguistics
    Oxford.
  • about SIMPLE see http//spraakbanken.gu.se
Write a Comment
User Comments (0)
About PowerShow.com