Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources - PowerPoint PPT Presentation

About This Presentation
Title:

Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources

Description:

Title: Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources Author: spela Last modified by: spela Created Date – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 15
Provided by: Spe71
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources


1
Learning to Mine Definitions from Slovene
Structured and Unstructured Knowledge-Rich
Resources
  • Darja Fišer, Senja Pollak, Špela Vintar
  • University of Ljubljana, Dept. of Translation
    Studies
  • darja.fiser, spela.vintar_at_guest.arnes.si,
    senja.pollak_at_ff.uni-lj.si

2
Aim
  • Extract definitions of specialised concepts from
    texts (journals, textbooks etc.).
  • Use Wikipedia to learn rules that help
    distinguish between proper definitions and
    non-definitions.
  • Extract candidate sentences from texts using 3
    approaches
  • patterns (A cell is the smallest living unit in
    an organism)
  • automatic term recognition
  • wordnet
  • Apply rules to select good definitions and
    discard non-definitions

3
Learning rules from Wikipedia
title
definition
non-definition
4
Learning rules from Wikipedia
  • Slovene Wikipedia (December 2009) 162,500
    articles
  • only well-formed pages retained
  • morphosyntactic annotation and lemmatization with
    ToTaLe (Erjavec et al. 2005)
  • structural parsing 19,964 instances
  • building a classification model in Weka (Witten
    and Frank 2005)
  • features most frequent PoS and lemmata

5
Learning rules - Results
  • best J48 decision tree classifier
  • experimenting with full and merged PoS, absolute
    frequency (AF) and binary values
  • 10-fold cross-validation

SETS Instances Attributes NaiveBayes J48 JRIP PART
ORIG 19964 260 66.91 82.13 80.91 82.56
ORIG_bin 19964 260 73.85 82.2 80.6 81.88
MERGED 19964 188 62.64 82.72 81.68 82.72
MERGED_bin 19964 188 72.39 82.44 80.5 81.79
6
Extracting definitions from textsResources
  • unstructured texts subset of the FidaPlus
    corpus (http//www.fidaplus.net)
  • knowledge-rich textbooks, popular science
    volumes (e.g. All about mushrooms)
  • various domains astronomy, physics, geography,
    botany ...
  • sloWNet Slovene wordnet (Fišer 2007,
    http//lojze.lugos.si/darja/slownet.html)
  • Automatic term recognition system for Slovene
    (Vintar 2004, http//lojze.lugos.si/cgitest/extra
    ct.cgi)

7
Extracting definitions from texts 1. Using
wordnet hyperonymy
  • The sentence is a definition candidate if
  • the sentence starts with a sloWNet literal and
    contains at least one more literal from the same
    hyperonymy chain (i.e. its hyponym or its
    hypernym)
  • ltterm idENG20-13313485-ngtDiabeteslt/termgt je
    ltterm idENG20-13268088-ngtbolezenlt/termgt, ki je
    posledica pomanjkanja inzulina, hormona, ki
    skrbi, da celice v telesu dobivajo glukozo
    (sladkor).
  • Diabetes is a disease resulting from insulin
    deficiency, the hormone providing glucose (sugar)
    for body cells.

8
Extracting definitions from texts2. Using
automatic term recognition
  • The sentence is a definition candidate if
  • the sentence contains at least two
    domain-specific terms and the first term is in
    the nominative case
  • ltterm score80.45gtEkvatorlt/termgt je najdaljši
    vzporednik,
  • ki deli Zemljo na severno in ltterm
    score43.21gtjužno
  • poloblolt/termgt.
  • The Equator is the largest circle of latitude
    dividing the Earth
  • into the Northern and the Southern Hemispheres.

9
Extracting definitions from texts3. Using
patterns
  • The sentence is a definition candidate if
  • the sentence contains a defining morphosyntactic
    pattern (NPnominative is_a NP nominative).
  • NP is_a NPCelica je
    strukturna in funkcionalna enota vseh živih
    organizmov.
  • A cell is a structural and functional unit of
    all living organisms.

10
Results
  • manual evaluation of all definition candidates
  • sloWNet best precision, ATR best recall
  • what is a definition??

Def. candidates True definitions Precision
sloWNet 104 41 0.39
ATR 629 118 0.19
Patterns 311 98 0.31
Total / Average 1044 257 0.29
11
Classification accuracy
sloWNet ATR Patterns
MERGED J48 61.76 69.79 69.45
MERGED_bin J48 66.67 71.06 63.9
ORIG_bin J48 63.72 65.98 62.7
For definitions only
sloWNet ATR Patterns
Precision 0.63 0.46 0.514
Recall 0.415 0.441 0.551
F-measure 0.5 0.452 0.532
12
Which is the best definition?
  • The Equator is an imaginary line on the Earth's
    surface equidistant from the North Pole and South
    Pole that divides the Earth into a Northern
    Hemisphere and a Southern Hemisphere.
  • An equator is the intersection of a sphere's
    surface with the plane perpendicular to the
    sphere's axis of rotation and containing the
    sphere's center of mass.
  • The longest of the five main circles of latitude
    on Earth (the others being the Arctic and
    Antarctic Circles and the Tropics of Cancer and
    Capricorn) is called the Equator.

13
Definitions depend on context... and may span
over several sentences
  • Head lice are parasites that live in the hair and
    scalp of humans.
  • HEAD LICE, also called Pediculus Humanus Capitis
    are small blood-sucking, wingless insects found
    on the human scalp. They are approximately the
    size of a sesame seed and cannot jump or fly. 
    They are six-legged creatures with claws, which
    help them cling to and crawl through human hair. 
    Head lice are an emerging social problem, not
    only in economically poor countries but also in
    practically all other societies.

14
Conclusions future work
  • Wikipedia can help us learn the properties of
    definitions,
  • Knowledge-rich texts are a good source of
    definitions,
  • A semantically-rich approach (using wordnet and
    ATR) yields many definitions and defining
    contexts.
  • Defining a definition is hard...
  • Encyclopaedic definitions differ from those found
    in running texts,
  • Future work
  • use other features in learning,
  • use active learning,
  • redefine definitions and possibly re-evaluate
    definition candidates
Write a Comment
User Comments (0)
About PowerShow.com