Title: Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources
1Learning to Mine Definitions from Slovene
Structured and Unstructured Knowledge-Rich
Resources
- Darja Fišer, Senja Pollak, Špela Vintar
- University of Ljubljana, Dept. of Translation
Studies - darja.fiser, spela.vintar_at_guest.arnes.si,
senja.pollak_at_ff.uni-lj.si
2Aim
- Extract definitions of specialised concepts from
texts (journals, textbooks etc.). - Use Wikipedia to learn rules that help
distinguish between proper definitions and
non-definitions. - Extract candidate sentences from texts using 3
approaches - patterns (A cell is the smallest living unit in
an organism) - automatic term recognition
- wordnet
- Apply rules to select good definitions and
discard non-definitions
3Learning rules from Wikipedia
title
definition
non-definition
4Learning rules from Wikipedia
- Slovene Wikipedia (December 2009) 162,500
articles - only well-formed pages retained
- morphosyntactic annotation and lemmatization with
ToTaLe (Erjavec et al. 2005) - structural parsing 19,964 instances
- building a classification model in Weka (Witten
and Frank 2005) - features most frequent PoS and lemmata
5Learning rules - Results
- best J48 decision tree classifier
- experimenting with full and merged PoS, absolute
frequency (AF) and binary values - 10-fold cross-validation
SETS Instances Attributes NaiveBayes J48 JRIP PART
ORIG 19964 260 66.91 82.13 80.91 82.56
ORIG_bin 19964 260 73.85 82.2 80.6 81.88
MERGED 19964 188 62.64 82.72 81.68 82.72
MERGED_bin 19964 188 72.39 82.44 80.5 81.79
6Extracting definitions from textsResources
- unstructured texts subset of the FidaPlus
corpus (http//www.fidaplus.net) - knowledge-rich textbooks, popular science
volumes (e.g. All about mushrooms) - various domains astronomy, physics, geography,
botany ... - sloWNet Slovene wordnet (Fišer 2007,
http//lojze.lugos.si/darja/slownet.html) - Automatic term recognition system for Slovene
(Vintar 2004, http//lojze.lugos.si/cgitest/extra
ct.cgi)
7Extracting definitions from texts 1. Using
wordnet hyperonymy
- The sentence is a definition candidate if
- the sentence starts with a sloWNet literal and
contains at least one more literal from the same
hyperonymy chain (i.e. its hyponym or its
hypernym) - ltterm idENG20-13313485-ngtDiabeteslt/termgt je
ltterm idENG20-13268088-ngtbolezenlt/termgt, ki je
posledica pomanjkanja inzulina, hormona, ki
skrbi, da celice v telesu dobivajo glukozo
(sladkor). - Diabetes is a disease resulting from insulin
deficiency, the hormone providing glucose (sugar)
for body cells.
8Extracting definitions from texts2. Using
automatic term recognition
- The sentence is a definition candidate if
- the sentence contains at least two
domain-specific terms and the first term is in
the nominative case - ltterm score80.45gtEkvatorlt/termgt je najdaljši
vzporednik, - ki deli Zemljo na severno in ltterm
score43.21gtjužno - poloblolt/termgt.
- The Equator is the largest circle of latitude
dividing the Earth - into the Northern and the Southern Hemispheres.
9Extracting definitions from texts3. Using
patterns
- The sentence is a definition candidate if
- the sentence contains a defining morphosyntactic
pattern (NPnominative is_a NP nominative). - NP is_a NPCelica je
strukturna in funkcionalna enota vseh živih
organizmov. - A cell is a structural and functional unit of
all living organisms.
10Results
- manual evaluation of all definition candidates
- sloWNet best precision, ATR best recall
- what is a definition??
Def. candidates True definitions Precision
sloWNet 104 41 0.39
ATR 629 118 0.19
Patterns 311 98 0.31
Total / Average 1044 257 0.29
11Classification accuracy
sloWNet ATR Patterns
MERGED J48 61.76 69.79 69.45
MERGED_bin J48 66.67 71.06 63.9
ORIG_bin J48 63.72 65.98 62.7
For definitions only
sloWNet ATR Patterns
Precision 0.63 0.46 0.514
Recall 0.415 0.441 0.551
F-measure 0.5 0.452 0.532
12Which is the best definition?
- The Equator is an imaginary line on the Earth's
surface equidistant from the North Pole and South
Pole that divides the Earth into a Northern
Hemisphere and a Southern Hemisphere. - An equator is the intersection of a sphere's
surface with the plane perpendicular to the
sphere's axis of rotation and containing the
sphere's center of mass. - The longest of the five main circles of latitude
on Earth (the others being the Arctic and
Antarctic Circles and the Tropics of Cancer and
Capricorn) is called the Equator.
13Definitions depend on context... and may span
over several sentences
- Head lice are parasites that live in the hair and
scalp of humans. - HEAD LICE, also called Pediculus Humanus Capitis
are small blood-sucking, wingless insects found
on the human scalp. They are approximately the
size of a sesame seed and cannot jump or fly.
They are six-legged creatures with claws, which
help them cling to and crawl through human hair.
Head lice are an emerging social problem, not
only in economically poor countries but also in
practically all other societies.
14Conclusions future work
- Wikipedia can help us learn the properties of
definitions, - Knowledge-rich texts are a good source of
definitions, - A semantically-rich approach (using wordnet and
ATR) yields many definitions and defining
contexts.
- Defining a definition is hard...
- Encyclopaedic definitions differ from those found
in running texts, - Future work
- use other features in learning,
- use active learning,
- redefine definitions and possibly re-evaluate
definition candidates