An Empirical Methodology for Word Sense Creation

About This Presentation

Title:

An Empirical Methodology for Word Sense Creation

Description:

... pork, potatoes, pudding, rice, sausages, scrambled eggs, toast, tomatoes, wheat ... Eggs and scrambled eggs; milk and cheese; pies... Methods of preparation ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 65

Provided by: Eduar1

more less

Transcript and Presenter's Notes

Title: An Empirical Methodology for Word Sense Creation

1
An Empirical Methodology for Word Sense Creation

Eduard Hovy
Information Sciences Institute
University of Southern California
www.isi.edu/hovy

2
How many colours are there?
colour
light,dark
gt red
gt blue
gt yellow/green
gt green/yellow
gt brown
3
Why? Microtheories of colour
4
Microtheorizing

Straightforward approach
Concepts collect terms, group, taxonomize
Lexicons collect words, connect
Roughly 1-1 concepts-to-words
Problem close but non-identical meaning overlaps
across languages terms
Solution complex mappings (e.g., EuroWordNets
ILI)
Microtheory approach
Understand phenomenon
Concepts basic primitives of theory
Lexicons words defined in terms of primitives
Push the complexity into the lexicon
Problem which microtheory? Need one for each
meaning complex!
Solution ? (See Mikrokosmos (Nirenburg and
Raskin 2000))

5
Toward building a lexicon of meanings or
concepts

In practice, you cant build microtheories for
everything
So you have to build more of a terminology bank
than a true ontology
Start with words (terms in the domain, in the
dictionary, etc.)
then separate out word senses and bring together
synonyms
and then group together similar meaning clusters
for later (ontology inheritance and inference)
while recording the essential features that
differentiate the groups from each other
(hopefully, some not-too-large set of features)
What does this involve?

6
Lets do an exercise

Create your own ontology of the following 24
words
apples, beans, beef, bread, cake, carrots,
cheese, cookies, eggs, ground beef, kimchi,
mushrooms, peaches, peas, pies, pork, potatoes,
pudding, rice, sausages, scrambled eggs, toast,
tomatoes, wheat
build the taxonomy
provide some important characteristics

7
What did you do?

The easy part
vegetables, fruits, meats but what of tomatoes?
Is your experience right, or are the biologists
right?
Or, what about
starches, proteins, greens this is whats
inside is this a better organization? Should
you have both?
The harder part
Eggs and scrambled eggs milk and cheese pies
Methods of preparation define somewhere else,
and then somehow apply this to the basic
foodstuffs?
What is right?
If I organized them by color or size, would that
be wrong? By sweetness? What if I were
diabetic?

8
Decisions when ontologizing

Should you create the term?
Where should the term go relative to the other
terms? species
What is special/unique/different about this term?
differentium/ae
How do you know youre right?
How do you decide between two or more
alternatives?
Should you record (and re-use) the differentiae?

Store everything in an ontology
9
The zones of ontologies
10
Outline

Problems when creating concepts
Ontological Semantics
The high-level abstractions
The content-level concepts
The OntoNotes project
A methodology for creating word senses
From senses to concepts
The Omega Ontology
Conclusion

11
CYC

Creator Lenat (CYCorp, Austin Texas) since
1990s
CYC largest and richest ontology millions of
axioms
ResearchCYC 25571 concs
RCYC (which was translated into RDF) omits all
second order concept expressions, (for example
functional operator expression) and so it has a
lot of missing supers
R-CYC in Omega
Lexical items not in Omega yet
Missing supers require dummy root Protocol Root
which now has 95 arbitrary child concepts

12
Top of ResearchCYC

Largest and most developed ontology
Principally aimed at AI inference-heavy and
NL-light
Very tangled network hard to understand and use
unless you absorb the CYCs philosophy /
methodology
Full CYC, for sale over 1M axioms
Has been tried by many research groups in the
past successful adoption rate low

13
SUMO

Creator Pease and consortium recent (USA)
Suggested as standard Suggested Upper Model
Ontology
1653 concepts
No lexical items
Adopts more traditional KR-style / Description
Logic approach, with lots of internal reasoning
mechanism constructs
More hierarchical than RCYC or DOLCE, with few
uplinks pointing to expressions
Omega version not complete and a little buggy

14
Some top parts of SUMO

Caserole
Agent-rel
Patient
Result
Resource
ResourceUsed
Instrument
ComputerRunning
StandardOutputDevice
DataProcessed
Experiencer
Origin
Destination
Direction
Path

Entity
Physical
Object
Process
Financial Asset
Abstract
Quantity
Graph
Attribute
Thing
Permission
Obligation
InheritableSumorelation
IntentionalSumorelation
BinarySumoRelation
CaseRole
Product
Computational System

15
CYC Event and SUMO Process

Definition An important specialization of
Situation and thus also of IntangibleIndividua
l and TemporallyExistingThing (qq.v). Each
instance of Event is a dynamic situation in
which the state of the world changes each
instance is something one would say happens.
Events are intangible because they are changes
per se, not tangible objects that effect and
undergo changes. Notable specializations of
Event include Event-Localized,
PhysicalEvent, Action, and GeneralizedTransf
er. Events should not be confused with
TimeIntervals (q.v.). The temporal bounds of
events are delineated by time intervals, but in
contrast to many events time intervals have no
spatial location or extent.
Definition Intuitively, the class of things that
happen and have temporal parts or stages.
Examples include extended events like a football
match or a race, actions like Searching and
Reading, and biological processes. The formal
definition is anything that lasts for a time but
is not an Object. Note that a Process may
have participants inside it which are
Objects, such as the players in a football
match. In a 4D ontology, a Process is
something whose spatiotemporal extent is thought
of as dividing into temporal stages roughly
perpendicular to the time-axis.

16
DOLCE

Author Guarino et al. (LADSEB, Italy)
Upper Model only focus on very abstract
conceptualizations
Approx. 500 concepts
No lexical items
Defined in XML/RDF (very clean and precise use of
formalism)
For Omega DOLCE was converted from XML/RDF,
omitting some expressions Omega cant yet handle
(e.g., most superclass relations have as range
some expression, like i.e., one that uses
intersection, union, a quantifier, or a
restriction object). So DOLCE doesnt appear
very hierarchical in Omega
Some very nice things in DOLCE, but it's hard to
see when nearly everything looks like a top level
object

17
DOLCE examples Particular, Endurant

Particular
Perdurant
Event
Accomplishment
Achievement
Stative
Process
Endurant
Abstract
Region
Abstract-region
Physical-region
Quality
Physical-quality
Temporal-quality

18
RDF definition for Endurant
19
DOLCE Process

Definition Within stative occurrences, we
distinguish between states and processes
according to homeomericity sitting is classified
as a state but running is classified as a
process, since there are (very short) temporal
parts of a running that are not themselves
runnings. In general, processes differ from
situations because they are not assumed to have a
description from which they depend. They can be
sequenced by some course, but they do not require
a description as a unifying criterion. On the
other hand, at any time, one can conceive a
description that asserts the constraints by which
a process of a certain type is such, and in this
case, it becomes a situation. Since the decision
of designing an explicit description that unifies
a perdurant depends on context, task, interest,
application, etc., when aligning an ontology do
DLP, there can be indecision on where to align a
process-oriented class. For example, in the
WordNet alignment, we have decided to put only
some physical processes under process, e.g.
organic process, in order to stress the social
orientedness of DLP. But whereas we need to talk
explicitly of the criteria by which we conceive
organic processes, these will be put under
situation. Similar considerations are made for
the other types of perdurants in DOLCE. A
different notion of event (dealing with change)
is currently investigated for further
developments being achievement,
accomplishment, state, event, etc. can be
also considered aspects of processes or of
parts of them. For example, the same process
rock erosion in the Sinni valley can be
conceptualized as an accomplishment (what has
brought the current state that e.g. we are trying
to explain), as an achievement (the erosion
process as the result of a previous
accomplishment), as a state (if we collapse the
time interval of the erosion into a time point),
or as an event (what has changed our focus from a
state to another). In the erosion case, we could
have good motivations to shift from one aspect to
another a) causation focus, b) effectual focus,
c) condensation d) transition (causality). If we
want to consider all the aspects of a process
together, we need to postulate a unifying
descriptive set of criteria (i.e. a
description), according to which that process
is circumstantiated in a situation. The
different aspects will arise as a parts of a same
situation.

20
Sowas top-level ontology

From Knowledge Representation (1999)
Lattice structure of 12 concepts all
combinations of three differentiae
Firstness/secondness/thirdness (man/husband/marria
ge)
Physical/abstract
Continuant/occurrent (object/process)
Obtained from C.S. Peirce and other historical
philosophers
Idea form a basis into which all ontologies (or
their Upper Models) can be inserted
This is a pretty idea, but not used in practice
HOWEVER local lattices are a good idea used in
Omega to overcome differentium order problem (see
lecture 1)

21
Sowa top ontology

Physical (P). An entity that has a location in
space-time.
Abstract (A). Pure information as distinguished
from any particular encoding of the information
in a physical medium. Formally, Abstract is a
primitive that satisfies the following axioms?
No abstraction has a location in space
(xAbstract)(yPlace)loc(x,y). No abstraction
occurs at a point in time (xAbstract)(tTime)
pTime(x,t).
Continuant (C). An entity whose identity
continues to be recognizable over some extended
interval of time.
Occurrent (O). An entity that does not have a
stable identity during any interval of time.
Formally, Occurrent is a primitive that satisfies
the following axioms The temporal parts of an
occurrent, which are called stages, exist at
different times. The spatial parts of an
occurrent, which are called participants, may
exist at the same time, but an occurrent may have
different participants at different stages. There
are no identity conditions that can be used to
identify two occurrents that are observed in
nonoverlapping space-time regions.
Independent (I). An entity characterized by some
inherent Firstness, independent of any
relationships it may have to other entities.
Formally, Independent is a primitive for which
the has-test of Section 2.4 need not apply. If x
is an independent entity, it is not necessary
that there exists an entity y such that x has y
or y has x ("xIndependent)o(y)(has(x,y) Ú
has(y,x)).
Relative (R). An entity in a relationship to some
other entity. Formally, Relative is a primitive
for which the has-test must apply
("xRelative)o(y)(has(x,y) Ú has(y,x)). For any
relative x, there must exist some y such that x
has y or y has x.
Mediating (M). An entity characterized by some
Thirdness that brings other entities into a
relationship. An independent entity need not have
any relationship to anything else, a relative
entity must have some relationship to something
else, and a mediating entity creates a
relationship between two other entities. An
example of a mediating entity is a marriage,
which creates a relationship between a husband
and a wife. According to Peirce, the defining
aspect of Thirdness is the conception of
mediation, whereby a first and a second are
brought into relation. That property could be
expressed in second-order logic
("mMediating)("x,yEntity)((R,SRelation)(R(m,x)
Ù S(m,y))) ? o(TRelation)T(x,y). This formula
says that for any mediating entity m and any
other entities x and y, if there exist relations
R and S that relate m to x and m to y, then it is
necessarily true that there exists some relation
T that relates x to y. For example, if m is a
marriage, R relates m to a husband x, S relates m
to a wife y, then T relates the husband to the
wife (or the wife to the husband).

22
Sowa case roles 1

A determinant participant determines the
direction of the process, either from the
beginning as the initiator or from the end as the
goal.
An immanent participant is present throughout the
process, but does not actively control what
happens.
A source must be present at the beginning of the
process, but need not participate throughout the
process.
A product must be present at the end of the
process but need not participate throughout the
process.

23
Sowa case roles 2

Initiator corresponds to Aristotle's efficient
cause, whereby a change or a state is initiated
(1013b23).
Resource corresponds to the material cause, which
is the matter or the substrate (hypokeimenon)
(983a30).
Goal corresponds to the final cause, which is
the purpose or the benefit for this is the goal
(telos) of any generation or motion (983a32).
Essence corresponds to the formal cause, which is
the essence (ousia) or what it is (to ti einai)
(983a27).

24
Penman Upper Model

Matthiessen et al. (ISI) 1980s
Linguistic (English) generalizations
Approx. 300 concepts. No lexical items at this
level
AI-light (no axioms)
Tested for NLG in various languages and MT
Serves as overall connection between Domain Model
symbols (used in system input representations)
and NLG system decision rules
Good example of use of Upper Model to to capture
and organize essential processing distinctions

25
Some MIKRO case roles and relations

Caserole
THEME
SOURCE
PATH
LOCATION
INSTRUMENT
EXPERIENCER
DESTINATION
BENEFICIARY
AGENT
ACCOMPANIER
Manner-relation
MANNER-QUALITY
MANNER-INSTRUMENT
MANNER-MEANS
Quantity-relation
LESS-THAN
EQUAL-TO
GREATER-THAN

MIKROKOSMOS ontology
Nirenburg, Mahesh, et al.
Spatial-Temporal-relation
Temporal-relation
BETWEEN-TEMPORAL TEMPORAL-FREQUENCY
TEMPORAL-DURATION TEMPORAL-LOCATING
TEMPORAL-INTERVAL-OVERLAP (subrelations)
TEMPORAL-NON-OVERLAP (subrelations)
Spatial-relation
SURROUNDED-BY SURROUNDS RIGHT-OF OUTSIDE-OF
ON-TOP-OF MEET UNDER LEFT-OF INSIDE-OF
IN-FRONT-OF BETWEEN-SPATIAL BESIDE
BELOW-AND-TOUCHING BEHIND ACROSS-FROM ACROSS
ABOVE DIRECTIONAL-RELATION (subrelations)

26
Outline

Problems when creating concepts
Ontological Semantics
The high-level abstractions
The content-level concepts
The OntoNotes project
A methodology for creating word senses
From senses to concepts
The Omega Ontology
Conclusion

27
Questions

What do you include in the Upper Model, and what
in the Middle Model?
Where is the boundary?
How primitive are your concepts? What
granularity should you use?

28
Parsimonious vs profligate

Parsimonious
Few symbols
Easy to see conceptual relatedness
Easy to define and run inferences
Hard to compose complex meanings

Profligate
Many symbols
Hard to determine conceptual relatedness
Hard work to define inferences
No need to compose complex meanings
Easy to fall into the trap of semantics-by-capital
ization (or wishful mnemonics McDermott
Artificial Intelligence Meets Natural Stupidity,
1981)

There is no correct position what you choose
depends on how much inference you need vs how
complex your domain is
29
CYC middle
Lenat www.cyc.com

Built by CYC Artificial Intelligence reasoning
and databases
Hundreds of thousands of concepts
Various termsets available over past years
Many interesting capabilities

30
WordNet
Miller Fellbaum wordnet.princeton.edu

Being built by Miller and Fellbaum at Princeton
cognitive scientists
Synonymous senses of words grouped into synsets
approx. 120,000 synsets
Rudimentary Upper Model all Middle Model
Nouns organized by hyponym (ISA) average depth
of Noun hierarchy 12
Verbs weakly organized by hyponym avg depth 3
Adjectives organized as star structures
(quasi-synonym clusters related to antonym
clusters)
Also meronym (part-of) and other relations, and
recently includes sense frequency values
Used for many NLP applications, but effectiveness
is controversial
IR study claims WordNet not useful (Voorhees)
QA work, using axioms in Extended WordNet
(Moldovan), shows great promise
Wordsense disambiguation shows WordNet has too
many senses

31
Mikrokosmos
Nirenburg et al. crl.nmsu.edu/Research/ Projects/m
ikro/

Intermittently being built by Nirenburg et al. at
New Mexico State U and U of Maryland NLP people
About 6000 concepts, 250 relations (slots)
Focus on lexicon define cores of meaning
clusters and differentiate at the word/sense
level includes about 25K English and 25K Spanish
(and some other) words
Used as Interlingua symbol repository for MT, in
Text Meaning Rep (TMR) notation
Nice feature facets on slots
Value value of the slot (may be a formula)
Strength certainty/probability
Aspect constant/intermittent/etc.

32
Aligning ontologies

Instead of building an ontology (with all the
problems that entails)can one just combine
existing ones?
Find the most popular concepts and organization
Merge the definitions
Identify individual errors and problem areas
I tried this in 199697 (Hovy, LREC 1998)
Project funded by IBM Align Upper Models of CYC,
Penman, and Mikrokosmos
Built alignment routines and created merge
Conceptual mismatch problems were significant!
Since then, fairly large group of researchers
doing this a competition every year

33
General alignment and merging

Goal find attachment point(s) in ontology for
node/term from somewhere else (ontology, website,
metadata schema, etc.)
Its hard to do manually very hard to do
automaticallysystem needs to understand
semantics of entities to be aligned

34
Outcome 1 Good and Misleading

S_at_foodstuffltfood
a substance that can be used or prepared
for use as food
superconcepts (S_at_food)
M_at_FOODSTUFF (COMB 13.355 NAME 91 DEF
10.00 TAX 0.140)
a substance that can be used or prepared for
use as food
superconcepts (M_at_FOOD M_at_MATERIAL)
----------------------------------------
S_at_librarygtbibliotheca
a collection of literary documents or
records kept for reference
superconcepts (S_at_aggregation)
M_at_LIBRARY (COMB 2.742 NAME 59 DEF 3.57
TAX 0.000)
a place in which literary and artistic
materials such as books periodicals
newspapers pamphlets and prints are kept for
reading or reference an
institution or foundation maintaining such a
collection
superconcepts (M_at_ACADEMIC-BUILDING)

A document collection or a place?
35
Outcome 2 Unclear and Error!

S_at_geisha
a Japanese woman trained to entertain men
with conversation and singing
and dancing
superconcepts (S_at_adult female
S_at_JapaneseltAsian)
M_at_GEISHA (COMB 1.540 NAME 46 DEF 2.27
TAX 0.000)
a Japanese girl trained as an entertainer to
serve as a hired entertainer
to men
superconcepts (M_at_ENTERTAINMENT-ROLE)
----------------------------------------
S_at_archipelago
many scattered islands in a large body of
water
superconcepts (S_at_dry land)
M_at_ARCHIPELAGO (COMB 1.522 NAME 131 DEF
1.33 TAX 0.000)
a sea with many islands
superconcepts (M_at_SEA)

A person or a function?
Land or sea?
36
When are two concepts the same? Guarinos
Identity Criteria

Material the stuff
Topological the shape
Morphological the parts
Functional the use
Meronymical the members
Social the societal role
(see also Pustejovskys qualia)

A water glass, before and after being smashed
the ACL in 1964 and in 2064
37
Shishkebobs (Hovy et al. in prep)

Library ISA Building (and hence cant buy things)
Library ISA Institution (and hence can buy
things)
SO Building ? Institution ? Location a
Library is all these

Also Country ? Nation ? Government (GPE)
France the land, the people, and the rulers
Also Field-of-Study ? Activity ?
Result-of-Process
(Science, Medicine, Architecture, Art)
Also Company ? Product ? Stock
He worked at Coke, drank Coke, and owned Coke
(shares)
We found about 400 potential shishkebobs

Shishkebobs Concept senses or metonymy
rings A continuum, from on-the-fly meaning
shadings to full metonymy
Link regular alternation possibilities at general
level in ontology allow meaning shift for
semantic interpretation, where needed
Using shishkebobs makes merging ontologies easier
(possible?) you respect each ontologys
perspective

38
Domain models/ontologies

Theres tons of work building domain-specific
ontologies see the web
Artificial Intelligence
Databases
Company products
Government codes
Domain expertise capture
etc
Not the focus of this lecture we continue with
general lexico-semantics

39
Outline

Problems when creating concepts
Ontological Semantics
The high-level abstractions
The content-level concepts
The OntoNotes project
A methodology for creating word senses
From senses to concepts
The Omega Ontology
Conclusion

40
Semantic annotation projects

Goal corpus of pairs (sentence semantic rep)
Process humans add information to sentences (and
their parses)
Recent projects

Interlingua Annotation (Dorr et al. 04)
coref links
OntoNotes (Weischedel et al. 05)
ontology
I-CAB, Greek banks
PropBank (Palmer et al. 03)
TIGER/SALSA Bank (Pinkal et al. 04)
verb frames
Framenet (Fillmore et al. 04)
noun frames
Prague Dependency Treebank (Hajic et al. 02)
word senses
Penn Treebank (Marcus et al. 99)
NomBank (Myers et al. 03)
syntax
41
OntoNotes project structure
ISI
Colorado
Verb Sensesand verbal ontology links
Noun Sensesand targeted nominalizations
Propositions
Training Data
Ontology Links and resulting structure
BBN
Penn
Decoders
Treebank Syntax
Coreference
Summarization
Translation
Syntactic structure Predicate/argument
structure Disambiguated nouns and verbs
Coreference links
Goal In 4 years, annotate text corpora of 1
mill words of English, Chinese, and Arabic text
42
Focus on word senses

Create a very large corpus of text by annotating
JUST the semantic sense(s) of every noun and verb
(and later, adjective and adverb)
Why?
Enable computer programs to learn to assign
correct senses automatically, in search of
improved machine translation, text summarization,
question answering, (web) search, etc.
begin to understand the distribution of principal
semantic features (animacy, concreteness, etc.)
at large scale.

43
Example of result

3_at_wsj/00/wsj_0020.mrg_at_wsj Mrs. Hills said many
of the 25 countries that she placed under
varying degrees of scrutiny have made
genuine progress '' on this touchy issue .
Propositionspredicate saypb sense 01on
sense 1
ARG0 Mrs. Hills 10
ARG1 many of the 25 countries that she placed
under varying degrees of scrutiny have made
genuine progress '' on this touchy issue
predicate makepb sense 03on sense None
ARG0 many of the 25 countries that she placed
under varying degrees of scrutiny
ARG1 genuine progress '' on this touchy issue

OntoNotes Normal Form (ONF)
44
OntoNotes annotation procedure

Sense creation process goes by word
Expert creates meaning options (shallow semantic
senses) for verbs, nouns, adjs, advs follows
PropBank process (Palmer et al.)
Expert creates definitions, examples,
differentiating features
(Ontology insertion At same time, expert groups
equivalent senses from different words and
organizes/refines Omega ontology content and
structure process being developed at ISI)
Sense annotation process goes by word, across
docs
Process developed in PropBank
Annotators manually
See each sentence in corpus containing the
current word (noun, verb, adjective, adverb) to
annotate
Select appropriate senses ( ontology concepts)
for each one
Connect frame structure (for each verb and
relational noun)
Coref annotation process goes by doc
Annotators connect co-references within each doc

45
Ensuring trustworthiness/stability

Problematic issues
What senses are there? Are the senses
stable/good/clear?
Is the sense annotation trustworthy?
What things should corefer?
Is the coref annotation trustworthy?
Approach (from PropBank) the 90 solution
Sense granularity and stability Test with
annotators to ensure agreement at 90 on real
text
If not, then redefine and re-do until 90
agreement reached
Coref stability only annotate the types of
aspects/phenomena for which 90 agreement can be
achieved

46
Sense annotation procedure

Sense creator first creates senses for a word
Loop 1
Manager selects next nouns from sensed list and
assigns annotators
Programmer randomly selects 50 sentences and
creates initial Task File
Annotators (at least 2) do the first 50
Manager checks their performance
90 agreement few or no NoneOfAbove send on
to Loop 2
Else Adjudicator and Manager identify reasons,
send back to Sense creator to fix senses and defs

Loop 2
Annotators (at least 2) annotate all the
remaining sentences
Manager checks their performance
90 agreement few or no NoneOfAbove send to
Adjudicator to fix the rest
Else Adjudicator annotates differences
If Adj agrees with one Annotator 90, then
ignore other Annotators work (assume a bad day
for the other) else Adj agrees with both about
equally often, then assume bad senses and send
the problematic ones back to Sense creator

STAMP annotation interface
Built for PropBank (Palme UPenn)
Target word
Sentence
Word sense choices (no mouse!)

48
Pre-project test Can it be done?

Annotation process and tools developed and tested
in PropBank (Palmer et al. U Colorado)
Typical results (10 words of each type, 100
sentences each)

Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3
tagger agreement senses time (min/100 tokens)
verbs .76 ? .86 ? .91 4.5 ? 5.2 ? 3.8 30 ? 25 ? 25
nouns .71 ? .85 ? .95 7.3 ? 5.1 ? 3.3 28 ? 20 ? 15
adjs .87 ? ? .90 2.8 ? ? 5.5 24 ? ? 18
(by comparison agreement using WordNet senses is
70)
49
Setting up Word statistics
1000-word corpus tokens types
verbs 125.3 87.3
nouns 446.6 288.7
adjectives 103.2 80.6

Number of word tokens/types in 1000-word corpus
(95 confidence intervals on 85213 trials)

Nouns approx. 50 of tokens Monosemous nouns
(but not names etc.) 14.6 of tokens 25.6 of
nouns
250K WSJ verbs verbs nouns nouns
total 2341 2341 5421 5421
1 WN sense 428 (18) 1751 (32)
2 or 3 senses 966 (41) 2159 (40)
4 senses 947 (40) 1511 (28)
Polysemy of verbs and nouns
Nouns Tokens (total 205442) Tokens (total 205442)
100 76420 37
500 140453 68
1000 167715 82
1500 181412 88
2000 189641 92
Coverage in WSJ and Brown Corpus of most frequent
N polysemous-2 nouns
50
Outline

Problems when creating concepts
Ontological Semantics
The high-level abstractions
The content-level concepts
The OntoNotes project
A methodology for creating word senses
From senses to concepts
The Omega Ontology
Conclusion

51
We want to gofrom lexemes to conceptsand we
use senses as the bridge

Lexical space
Words
drive
steer
fahren
steuern
besturen
rijden
drijven

Sense space
Word senses
Drive1
Drive2
Drive3
Manage

Concept space
Concepts
?
How many concepts?
How related to senses?

52
Which, and how many, senses? Graduated refinement

Initialization Given a term (word), collect
several dozen sentences containing it. Also
collect definitions from various dictionaries
Cluster the words senses into preliminary,
loosely similar groups
Differentiation process Begin a tree structure
with all the groups at the root
Considering all the groups, identify the group
most different from the others
If you can find one clearly most different group,
write down its most important distinction
explicitly this will later become the
differentium and be formalized axiomatically
If you cannot find any distinctions by which to
further subdivide the group, stop elaborating
this branch and continue with some other branch
If you can find several distinctions that
subdivide the group in different, but equally
valid, ways, also stop elaborating this branch
and continue with some other branch
Create two new branches in the evolving tree
structure, putting the new group under one, and
leaving the other groups under the other
Repeat from step 4, considering separately the
group(s) under each branch
Concept formation When all branches have
stopped, the ultimate result is a tree of
increasingly fine-grained distinctions, which are
explicitly listed at each branch point. Each
leaf becomes a single concept, not further
differentiable in the current task/application/dom
ain. Each distinction must be formalized as an
axiom that holds for the branch it is associated
with
Insertion into ontology Starting from the top,
visit each branch point. Do the two branches
have approximately the same meaning?
If so, insert them into the ontology at the
appropriate point and stop traversing this branch
If not, split the tree and repeat step 8
separately for each branch. Repeat until done

53
An exercise drive

Drive the demons out of her and teach her to stay
away from my husband!!
Shortly before nine I drove my jalopy to the
street facing the Lake and parked the car in
shadows.
He drove carefully in the direction of the brief
tour they had taken earlier.
Her scream split up the silence of the car,
accompanied by the rattling of the freight, and
then Cappy came off the floor, his legs driving
him hard.
With an untrained local labor pool, many experts
believe, that policy could drive businesses from
the city.
Treasury Undersecretary David Mulford defended
the Treasurys efforts this fall to drive down
the value of the dollar.
Even today range riders will come upon mummified
bodies of men who attempted nothing more
difficult than a twenty-mile hike and slowly lost
direction, were tortured by the heat, driven mad
by the constant and unfulfilled promise of the
landscape, and who finally died.
Cows were kept in backyard barns, and boys were
hired to drive them to and from the pasture on
the edge of town.
He had to drive the hammer really hard to get the
nail into that plank!
She learned to drive a bulldozer from her uncle,
who was a road maker.
I used to drive a taxi (for work) before I went
to night school.
BewareRalph drives a hard bargain you will
probably lose all your money.

54
Grouping the senses of drive
drive (1,212)
55
Deeper semantic drive
drive (1,212)
ltpsychgt
ltprofessiongt
11 taxi
ltnegotiategt
12 drive a hard bargain
56
Ontologizing drive
ltmove in desired directiongt (1,2,3,4,5,6,8,9,10)
57
From lexemes to concepts

Lexical space
Words
Monolingual
drive
steer
fahren
rijden

Sense space
Word senses
Multilingual
Drive1
Drive2
Drive3

Concept space
Concepts
Interlingual (?)

58
Time for some fun

Do the class exercise

59
Doing this seriously OntoNotes sense creation
interface

Input word
Tree of senses being created
Working area write defs, exs, features, etc
Google or dictionarysense list for ideas

60
Outline

Problems when creating concepts
Ontological Semantics
The high-level abstractions
The content-level concepts
The OntoNotes project
A methodology for creating word senses
From senses to concepts
The Omega Ontology
Conclusion

61
Trying to find real concepts

What tests can one use to ensure concept-hood
and not just sense-hood?
One idea Multilinguality
Can one arrange wordsenses/concepts to form an
interlingua termset?
This would guarantee (some degree of) semantic
nature
How to do this?
Translate a text several times
Compare the differences presumably they derive
from the same source, which must therefore pack
together all the translation meanings

62
Finding meaning via translation difference
K1E1 Starting on January 1 of next year, SK
Telecom subscribers can switch to less expensive
LG Telecom or KTF. The Subscribers cannot
switch again to another provider for the first 3
months, but they can cancel the switch in 14
days if they are not satisfied with services
like voice quality. K1E2 Starting January
1st of next year, customers of SK Telecom can
change their service company to LG Telecom or
KTF Once a service company swap has been made,
customers are not allowed to change companies
again within the first three months, although
they can cancel the change anytime within 14
days if problems such as poor call quality are
experienced.

Semantically identical

Semantically equivalent

Semantically different

Additional/less information

Different information

63
Getting at meaning Two translations of a
Japanese original text

This year,
too,
in addition to
the birth
of Mitsubishi Chemical,
which has already been announced,
other rather large-scale mergers
may continue,
and
be recorded
as a year of mergers.

This year,
which has already seen the announcement of
the birth
of Mitsubishi Chemical Corporation
as well as
the continuous numbers of big mergers,
may
too
be recorded
as the year of the merger
for all we know.

Problems for semantic rep
Lexical differences
Dependency differences

64
Interlinguas in MT

The idea of an interlingua is intriguing
Example use in Machine Translation
For transfer systems, need 2n.(n-1) rules for n
languages (L1?L2, L2?L1, L1?L3, L3?L1)
For Interlingual systems, need only 2n sets of
rules (Lx?IL?Ly)
Interlingua is the deep semantic notation of
the meaning (the idea) behind the text
An Interlingua is a system of symbols and
notation to represent the meaning(s)
of(linguistic) communications with the following
features
language-independent
formally well-defined
expressive to arbitrary level of semantics
non-redundant

word senses concepts
65
Structure of EuroWordNet
Slide by Wim Peters, U of Sheffield, 2001

Based on WordNet (Miller Fellbaum, Princeton
University)
WordNet currently has about 120,000 wordsense
groupings (synsets)
EuroWordNet Language-specific wordnets for 8
languages, all independent but connected
English, Spanish, Italian, Dutch
Wordnets represent unique concept lexicalization
patterns in 8 languages, based on
sense-inventories of mono- and bilingual
dictionaries
BUT NOT AN INTERLINGUA Synsets of various
languages linked to the Inter-Lingual Index (ILI)
serves as interlingua mapping but there is no
single central term/concept set

66
Slide by Wim Peters, U of Sheffield, 2001
EuroWordNet architecture
move travel go
bewegen reizen gaan
Domain-Ontology
Top-Ontology
III
rijden
berijden
III
III
I
I
ILI-record drive
II
III
III
cabalgar jinetear
Inter-Lingual-Index
guidare
andare muoversi
67
MultilingualityWord-sense-concept 1
Sense Space
eat1
eat3
eat2
Word Space
eat
68
MultilingualityWord-sense-concept 2
Sense Space
eat1
eat3
eat2
Word Space
eat
69
MultilingualityWord-sense-concept 3
Other languages may suggest refined
conceptualization
IngestFood
Concept Space
IngestFood-Human
IngestFood-Animal

Sense Space
eat1
eat3
eat2
Word Space
eat
70
Lexicon, Senses and Conceptsin Omega

Lexical space
Words
drive
steer
fahren
steuern
besturen
rijden
drijven

Sense space
Word senses
Drive1
Drive2
Drive3
Manage
Ride

Concept space
Concepts

drive/steer
ride1
manage3
driveltpropel
71
OntoNotes procedure for building the ontology

Goal Create ontology Repository of OntoNotes
senses, organized to provide additional
information
Creation procedure
Start with framework (Upper Structure) from ISIs
Omega ontology
Contains verb frame structures from PropBank,
Framenet, LCS, WordNet
Gather all senses created for annotation
Include definitional features defined for senses
Concepts Identify and pool together senses
with same meaning
Look for shared features
Recognize paraphrases to avoid redundancy
Arrange close senses together to share features
Enable eventual reasoning (buy ? sell)
Validation Measure agreement between poolers

72
Some OntoNotes features from the word senses
building 45
business 40
device 37
legal 36
substance 34
mental 32
small 34
measure 30
unit 26
concrete 22
enterprise 19
amount 19
portion 19
military 19
people 19
vehicle 18
official 18
large 18
collection 17
natural 18
financial 17
part 17
object 16
human 16
cylindrical 16
entity 965
artifact 272
relation advisor, agent, aid 199
state 188
quantity 171
activity acquisition, act, battle 156
structure 160
role advisor, agent, banker 147
physical 123
social 112
action act, appeal, bid, call 103
quality 97
person 83
group 86
organization 59
abstract 58
event 57
location 58
document 53
individual 46
form 46

Interesting correspondences with theoretical
work
Linguistics Comries syntactic and semantic
features
Knowledge Representation / AI Upper Model
features of SUMO, CYC, etc.

73
OntoNotes sense pooler interface

Sense list
Pool being built list of senses
Subordination link to another pool
Features of this pool

74
Outline

Problems when creating concepts
Ontological Semantics
The high-level abstractions
The content-level concepts
The OntoNotes project
A methodology for creating word senses
From senses to concepts
The Omega Ontology
Conclusion

75
Omega content and framework
www.omega.edu
Goal one environment for various ontologies and
resources

Concepts 120,604 Concept/term entries 76 MB
WordNet (Princeton Miller Fellbaum)
Mikrokosmos (NMSU Nirenburg et al.)
Penman Upper Model (ISI Bateman et al.)
25,000 Noun-noun compounds (ISI Pantel)
Lexicon / sense space
156,142 English words 33,822 Spanish words
271,243 word senses
13,000 frames of verb arg structure with case
roles
LCS case roles (Dorr) 6.3MB
PropBank roleframes (Palmer et al.) 5.3MB
Framenet roleframes (Fillmore et al.) 2.8MB
WordNet verb frames (Fellbaum) 1.8MB
Associated information (not all complete)
WordNet subj domains (Magnini Cavaglia) 1.2
MB
Various relations learned from text (ISI
Pantel)
TAP domain groupings (Stanford Guha)
SemCor term frequencies 7.5MB
Topic signatures (Basque U Agirre et al.) 2.7GB

Instances 10.1 GB
1.1 million persons harvested from text
765,000 facts harvested from text
5.7 million locations from USGS and NGA
Framework (over 28 million statements of
concepts, relations, instances)
Available in PowerLoom
Instances in RDF
With database/MYSQL
Online browser
Clustering software
Term and ontology alignment software

76
Omega browser Mammoth
77
Omega hierarchy display
78
Omega sense frames
79
Outline

Problems when creating concepts
Ontological Semantics
The high-level abstractions
The content-level concepts
The OntoNotes project
A methodology for creating word senses
From senses to concepts
The Omega Ontology
Conclusion

80
Shallow and deep ontologies

Omega is a language-based ontology
Concepts defined via wordsenses annotator
agreement constrains and validates granularity
Granularity how many senses for each word?
Associated information is subsumption hierarchy
and case frames
Deep(er) semantics
Deeper knowledge definitional features,
subconcept differentiae, inferences, etc., are
not present
In places temporal relations using
Allen/Hobbs/OWL
Future, deeper version of annotation will require
and motivate more semantic ontology

81
What would be nice?

A small number of (globally) standardized
ontologies and/or core theories of important
aspects (time, space, social dynamics, motion,
privacy, etc.)
Solid theoretical frameworks for developing
ontological notions and theories, and for testing
them
A rich online world of ontologies, domain models,
etc., with appropriate ontology creation tools
and methodologies
(Semi-)automated techniques for rapidly finding,
absorbing, and testing existing ontologies for
your own applications
Tools that automatically create new knowledge
bases on demand, in accord with given ontologies
Ontology and knowledge base support technology
that can handle info that may be inconsistent,
tenuous, partial, and growing

82
Thank you!

Write a Comment

User Comments (0)