Title: An Empirical Methodology for Word Sense Creation
1An Empirical Methodology for Word Sense Creation
- Eduard Hovy
- Information Sciences Institute
- University of Southern California
- www.isi.edu/hovy
2How many colours are there?
colour
light,dark
gt red
gt blue
gt yellow/green
gt green/yellow
gt brown
3Why? Microtheories of colour
4Microtheorizing
- Straightforward approach
- Concepts collect terms, group, taxonomize
- Lexicons collect words, connect
- Roughly 1-1 concepts-to-words
- Problem close but non-identical meaning overlaps
across languages terms - Solution complex mappings (e.g., EuroWordNets
ILI) - Microtheory approach
- Understand phenomenon
- Concepts basic primitives of theory
- Lexicons words defined in terms of primitives
- Push the complexity into the lexicon
- Problem which microtheory? Need one for each
meaning complex! - Solution ? (See Mikrokosmos (Nirenburg and
Raskin 2000))
5Toward building a lexicon of meanings or
concepts
- In practice, you cant build microtheories for
everything - So you have to build more of a terminology bank
than a true ontology - Start with words (terms in the domain, in the
dictionary, etc.) - then separate out word senses and bring together
synonyms - and then group together similar meaning clusters
for later (ontology inheritance and inference) - while recording the essential features that
differentiate the groups from each other
(hopefully, some not-too-large set of features) - What does this involve?
6Lets do an exercise
- Create your own ontology of the following 24
words - apples, beans, beef, bread, cake, carrots,
cheese, cookies, eggs, ground beef, kimchi,
mushrooms, peaches, peas, pies, pork, potatoes,
pudding, rice, sausages, scrambled eggs, toast,
tomatoes, wheat - build the taxonomy
- provide some important characteristics
7What did you do?
- The easy part
- vegetables, fruits, meats but what of tomatoes?
Is your experience right, or are the biologists
right? - Or, what about
- starches, proteins, greens this is whats
inside is this a better organization? Should
you have both? - The harder part
- Eggs and scrambled eggs milk and cheese pies
Methods of preparation define somewhere else,
and then somehow apply this to the basic
foodstuffs? - What is right?
- If I organized them by color or size, would that
be wrong? By sweetness? What if I were
diabetic?
8Decisions when ontologizing
- Should you create the term?
- Where should the term go relative to the other
terms? species - What is special/unique/different about this term?
differentium/ae - How do you know youre right?
- How do you decide between two or more
alternatives? - Should you record (and re-use) the differentiae?
Store everything in an ontology
9The zones of ontologies
10Outline
- Problems when creating concepts
- Ontological Semantics
- The high-level abstractions
- The content-level concepts
- The OntoNotes project
- A methodology for creating word senses
- From senses to concepts
- The Omega Ontology
- Conclusion
11CYC
- Creator Lenat (CYCorp, Austin Texas) since
1990s - CYC largest and richest ontology millions of
axioms - ResearchCYC 25571 concs
- RCYC (which was translated into RDF) omits all
second order concept expressions, (for example
functional operator expression) and so it has a
lot of missing supers - R-CYC in Omega
- Lexical items not in Omega yet
- Missing supers require dummy root Protocol Root
which now has 95 arbitrary child concepts
12Top of ResearchCYC
- Largest and most developed ontology
- Principally aimed at AI inference-heavy and
NL-light - Very tangled network hard to understand and use
unless you absorb the CYCs philosophy /
methodology - Full CYC, for sale over 1M axioms
- Has been tried by many research groups in the
past successful adoption rate low
13SUMO
- Creator Pease and consortium recent (USA)
- Suggested as standard Suggested Upper Model
Ontology - 1653 concepts
- No lexical items
- Adopts more traditional KR-style / Description
Logic approach, with lots of internal reasoning
mechanism constructs - More hierarchical than RCYC or DOLCE, with few
uplinks pointing to expressions - Omega version not complete and a little buggy
14Some top parts of SUMO
- Caserole
- Agent-rel
- Patient
- Result
- Resource
- ResourceUsed
- Instrument
- ComputerRunning
- StandardOutputDevice
-
- DataProcessed
- Experiencer
- Origin
- Destination
- Direction
- Path
- Entity
- Physical
- Object
- Process
- Financial Asset
- Abstract
- Quantity
- Graph
- Attribute
- Thing
- Permission
- Obligation
- InheritableSumorelation
- IntentionalSumorelation
- BinarySumoRelation
- CaseRole
-
- Product
- Computational System
15CYC Event and SUMO Process
- Definition An important specialization of
Situation and thus also of IntangibleIndividua
l and TemporallyExistingThing (qq.v). Each
instance of Event is a dynamic situation in
which the state of the world changes each
instance is something one would say happens.
Events are intangible because they are changes
per se, not tangible objects that effect and
undergo changes. Notable specializations of
Event include Event-Localized,
PhysicalEvent, Action, and GeneralizedTransf
er. Events should not be confused with
TimeIntervals (q.v.). The temporal bounds of
events are delineated by time intervals, but in
contrast to many events time intervals have no
spatial location or extent. - Definition Intuitively, the class of things that
happen and have temporal parts or stages.
Examples include extended events like a football
match or a race, actions like Searching and
Reading, and biological processes. The formal
definition is anything that lasts for a time but
is not an Object. Note that a Process may
have participants inside it which are
Objects, such as the players in a football
match. In a 4D ontology, a Process is
something whose spatiotemporal extent is thought
of as dividing into temporal stages roughly
perpendicular to the time-axis.
16DOLCE
- Author Guarino et al. (LADSEB, Italy)
- Upper Model only focus on very abstract
conceptualizations - Approx. 500 concepts
- No lexical items
- Defined in XML/RDF (very clean and precise use of
formalism) - For Omega DOLCE was converted from XML/RDF,
omitting some expressions Omega cant yet handle
(e.g., most superclass relations have as range
some expression, like i.e., one that uses
intersection, union, a quantifier, or a
restriction object). So DOLCE doesnt appear
very hierarchical in Omega - Some very nice things in DOLCE, but it's hard to
see when nearly everything looks like a top level
object
17DOLCE examples Particular, Endurant
- Particular
- Perdurant
- Event
- Accomplishment
- Achievement
- Stative
- Process
- Endurant
- Abstract
- Region
- Abstract-region
- Physical-region
- Quality
- Physical-quality
- Temporal-quality
18RDF definition for Endurant
19DOLCE Process
- Definition Within stative occurrences, we
distinguish between states and processes
according to homeomericity sitting is classified
as a state but running is classified as a
process, since there are (very short) temporal
parts of a running that are not themselves
runnings. In general, processes differ from
situations because they are not assumed to have a
description from which they depend. They can be
sequenced by some course, but they do not require
a description as a unifying criterion. On the
other hand, at any time, one can conceive a
description that asserts the constraints by which
a process of a certain type is such, and in this
case, it becomes a situation. Since the decision
of designing an explicit description that unifies
a perdurant depends on context, task, interest,
application, etc., when aligning an ontology do
DLP, there can be indecision on where to align a
process-oriented class. For example, in the
WordNet alignment, we have decided to put only
some physical processes under process, e.g.
organic process, in order to stress the social
orientedness of DLP. But whereas we need to talk
explicitly of the criteria by which we conceive
organic processes, these will be put under
situation. Similar considerations are made for
the other types of perdurants in DOLCE. A
different notion of event (dealing with change)
is currently investigated for further
developments being achievement,
accomplishment, state, event, etc. can be
also considered aspects of processes or of
parts of them. For example, the same process
rock erosion in the Sinni valley can be
conceptualized as an accomplishment (what has
brought the current state that e.g. we are trying
to explain), as an achievement (the erosion
process as the result of a previous
accomplishment), as a state (if we collapse the
time interval of the erosion into a time point),
or as an event (what has changed our focus from a
state to another). In the erosion case, we could
have good motivations to shift from one aspect to
another a) causation focus, b) effectual focus,
c) condensation d) transition (causality). If we
want to consider all the aspects of a process
together, we need to postulate a unifying
descriptive set of criteria (i.e. a
description), according to which that process
is circumstantiated in a situation. The
different aspects will arise as a parts of a same
situation.
20Sowas top-level ontology
- From Knowledge Representation (1999)
- Lattice structure of 12 concepts all
combinations of three differentiae - Firstness/secondness/thirdness (man/husband/marria
ge) - Physical/abstract
- Continuant/occurrent (object/process)
- Obtained from C.S. Peirce and other historical
philosophers - Idea form a basis into which all ontologies (or
their Upper Models) can be inserted - This is a pretty idea, but not used in practice
- HOWEVER local lattices are a good idea used in
Omega to overcome differentium order problem (see
lecture 1)
21Sowa top ontology
- Physical (P). An entity that has a location in
space-time. - Abstract (A). Pure information as distinguished
from any particular encoding of the information
in a physical medium. Formally, Abstract is a
primitive that satisfies the following axioms?
No abstraction has a location in space
(xAbstract)(yPlace)loc(x,y). No abstraction
occurs at a point in time (xAbstract)(tTime)
pTime(x,t). - Continuant (C). An entity whose identity
continues to be recognizable over some extended
interval of time. - Occurrent (O). An entity that does not have a
stable identity during any interval of time.
Formally, Occurrent is a primitive that satisfies
the following axioms The temporal parts of an
occurrent, which are called stages, exist at
different times. The spatial parts of an
occurrent, which are called participants, may
exist at the same time, but an occurrent may have
different participants at different stages. There
are no identity conditions that can be used to
identify two occurrents that are observed in
nonoverlapping space-time regions. - Independent (I). An entity characterized by some
inherent Firstness, independent of any
relationships it may have to other entities.
Formally, Independent is a primitive for which
the has-test of Section 2.4 need not apply. If x
is an independent entity, it is not necessary
that there exists an entity y such that x has y
or y has x ("xIndependent)o(y)(has(x,y) Ú
has(y,x)). - Relative (R). An entity in a relationship to some
other entity. Formally, Relative is a primitive
for which the has-test must apply
("xRelative)o(y)(has(x,y) Ú has(y,x)). For any
relative x, there must exist some y such that x
has y or y has x. - Mediating (M). An entity characterized by some
Thirdness that brings other entities into a
relationship. An independent entity need not have
any relationship to anything else, a relative
entity must have some relationship to something
else, and a mediating entity creates a
relationship between two other entities. An
example of a mediating entity is a marriage,
which creates a relationship between a husband
and a wife. According to Peirce, the defining
aspect of Thirdness is the conception of
mediation, whereby a first and a second are
brought into relation. That property could be
expressed in second-order logic
("mMediating)("x,yEntity)((R,SRelation)(R(m,x)
Ù S(m,y))) ? o(TRelation)T(x,y). This formula
says that for any mediating entity m and any
other entities x and y, if there exist relations
R and S that relate m to x and m to y, then it is
necessarily true that there exists some relation
T that relates x to y. For example, if m is a
marriage, R relates m to a husband x, S relates m
to a wife y, then T relates the husband to the
wife (or the wife to the husband).
22Sowa case roles 1
- A determinant participant determines the
direction of the process, either from the
beginning as the initiator or from the end as the
goal. - An immanent participant is present throughout the
process, but does not actively control what
happens. - A source must be present at the beginning of the
process, but need not participate throughout the
process. - A product must be present at the end of the
process but need not participate throughout the
process.
23Sowa case roles 2
- Initiator corresponds to Aristotle's efficient
cause, whereby a change or a state is initiated
(1013b23). - Resource corresponds to the material cause, which
is the matter or the substrate (hypokeimenon)
(983a30). - Goal corresponds to the final cause, which is
the purpose or the benefit for this is the goal
(telos) of any generation or motion (983a32). - Essence corresponds to the formal cause, which is
the essence (ousia) or what it is (to ti einai)
(983a27).
24Penman Upper Model
- Matthiessen et al. (ISI) 1980s
- Linguistic (English) generalizations
- Approx. 300 concepts. No lexical items at this
level - AI-light (no axioms)
- Tested for NLG in various languages and MT
- Serves as overall connection between Domain Model
symbols (used in system input representations)
and NLG system decision rules - Good example of use of Upper Model to to capture
and organize essential processing distinctions
25Some MIKRO case roles and relations
- Caserole
- THEME
- SOURCE
- PATH
- LOCATION
- INSTRUMENT
- EXPERIENCER
- DESTINATION
- BENEFICIARY
- AGENT
- ACCOMPANIER
- Manner-relation
- MANNER-QUALITY
- MANNER-INSTRUMENT
- MANNER-MEANS
- Quantity-relation
- LESS-THAN
- EQUAL-TO
- GREATER-THAN
- MIKROKOSMOS ontology
- Nirenburg, Mahesh, et al.
- Spatial-Temporal-relation
- Temporal-relation
- BETWEEN-TEMPORAL TEMPORAL-FREQUENCY
TEMPORAL-DURATION TEMPORAL-LOCATING
TEMPORAL-INTERVAL-OVERLAP (subrelations)
TEMPORAL-NON-OVERLAP (subrelations) - Spatial-relation
- SURROUNDED-BY SURROUNDS RIGHT-OF OUTSIDE-OF
ON-TOP-OF MEET UNDER LEFT-OF INSIDE-OF
IN-FRONT-OF BETWEEN-SPATIAL BESIDE
BELOW-AND-TOUCHING BEHIND ACROSS-FROM ACROSS
ABOVE DIRECTIONAL-RELATION (subrelations)
26Outline
- Problems when creating concepts
- Ontological Semantics
- The high-level abstractions
- The content-level concepts
- The OntoNotes project
- A methodology for creating word senses
- From senses to concepts
- The Omega Ontology
- Conclusion
27Questions
- What do you include in the Upper Model, and what
in the Middle Model? - Where is the boundary?
- How primitive are your concepts? What
granularity should you use?
28Parsimonious vs profligate
- Parsimonious
- Few symbols
- Easy to see conceptual relatedness
- Easy to define and run inferences
- Hard to compose complex meanings
- Profligate
- Many symbols
- Hard to determine conceptual relatedness
- Hard work to define inferences
- No need to compose complex meanings
- Easy to fall into the trap of semantics-by-capital
ization (or wishful mnemonics McDermott
Artificial Intelligence Meets Natural Stupidity,
1981)
There is no correct position what you choose
depends on how much inference you need vs how
complex your domain is
29CYC middle
Lenat www.cyc.com
- Built by CYC Artificial Intelligence reasoning
and databases - Hundreds of thousands of concepts
- Various termsets available over past years
- Many interesting capabilities
30WordNet
Miller Fellbaum wordnet.princeton.edu
- Being built by Miller and Fellbaum at Princeton
cognitive scientists - Synonymous senses of words grouped into synsets
approx. 120,000 synsets - Rudimentary Upper Model all Middle Model
- Nouns organized by hyponym (ISA) average depth
of Noun hierarchy 12 - Verbs weakly organized by hyponym avg depth 3
- Adjectives organized as star structures
(quasi-synonym clusters related to antonym
clusters) - Also meronym (part-of) and other relations, and
recently includes sense frequency values - Used for many NLP applications, but effectiveness
is controversial - IR study claims WordNet not useful (Voorhees)
- QA work, using axioms in Extended WordNet
(Moldovan), shows great promise - Wordsense disambiguation shows WordNet has too
many senses
31Mikrokosmos
Nirenburg et al. crl.nmsu.edu/Research/ Projects/m
ikro/
- Intermittently being built by Nirenburg et al. at
New Mexico State U and U of Maryland NLP people - About 6000 concepts, 250 relations (slots)
- Focus on lexicon define cores of meaning
clusters and differentiate at the word/sense
level includes about 25K English and 25K Spanish
(and some other) words - Used as Interlingua symbol repository for MT, in
Text Meaning Rep (TMR) notation - Nice feature facets on slots
- Value value of the slot (may be a formula)
- Strength certainty/probability
- Aspect constant/intermittent/etc.
32Aligning ontologies
- Instead of building an ontology (with all the
problems that entails)can one just combine
existing ones? - Find the most popular concepts and organization
- Merge the definitions
- Identify individual errors and problem areas
- I tried this in 199697 (Hovy, LREC 1998)
- Project funded by IBM Align Upper Models of CYC,
Penman, and Mikrokosmos - Built alignment routines and created merge
- Conceptual mismatch problems were significant!
- Since then, fairly large group of researchers
doing this a competition every year
33General alignment and merging
- Goal find attachment point(s) in ontology for
node/term from somewhere else (ontology, website,
metadata schema, etc.) - Its hard to do manually very hard to do
automaticallysystem needs to understand
semantics of entities to be aligned
34Outcome 1 Good and Misleading
- S_at_foodstuffltfood
- a substance that can be used or prepared
for use as food - superconcepts (S_at_food)
- M_at_FOODSTUFF (COMB 13.355 NAME 91 DEF
10.00 TAX 0.140) - a substance that can be used or prepared for
use as food - superconcepts (M_at_FOOD M_at_MATERIAL)
- ----------------------------------------
- S_at_librarygtbibliotheca
- a collection of literary documents or
records kept for reference - superconcepts (S_at_aggregation)
- M_at_LIBRARY (COMB 2.742 NAME 59 DEF 3.57
TAX 0.000) - a place in which literary and artistic
materials such as books periodicals - newspapers pamphlets and prints are kept for
reading or reference an - institution or foundation maintaining such a
collection - superconcepts (M_at_ACADEMIC-BUILDING)
A document collection or a place?
35Outcome 2 Unclear and Error!
- S_at_geisha
- a Japanese woman trained to entertain men
with conversation and singing - and dancing
- superconcepts (S_at_adult female
S_at_JapaneseltAsian) - M_at_GEISHA (COMB 1.540 NAME 46 DEF 2.27
TAX 0.000) - a Japanese girl trained as an entertainer to
serve as a hired entertainer - to men
- superconcepts (M_at_ENTERTAINMENT-ROLE)
- ----------------------------------------
- S_at_archipelago
- many scattered islands in a large body of
water - superconcepts (S_at_dry land)
- M_at_ARCHIPELAGO (COMB 1.522 NAME 131 DEF
1.33 TAX 0.000) - a sea with many islands
- superconcepts (M_at_SEA)
A person or a function?
Land or sea?
36When are two concepts the same? Guarinos
Identity Criteria
- Material the stuff
- Topological the shape
- Morphological the parts
- Functional the use
- Meronymical the members
- Social the societal role
- (see also Pustejovskys qualia)
A water glass, before and after being smashed
the ACL in 1964 and in 2064
37Shishkebobs (Hovy et al. in prep)
- Library ISA Building (and hence cant buy things)
- Library ISA Institution (and hence can buy
things) - SO Building ? Institution ? Location a
Library is all these
- Also Country ? Nation ? Government (GPE)
- France the land, the people, and the rulers
- Also Field-of-Study ? Activity ?
Result-of-Process - (Science, Medicine, Architecture, Art)
- Also Company ? Product ? Stock
- He worked at Coke, drank Coke, and owned Coke
(shares) - We found about 400 potential shishkebobs
- Shishkebobs Concept senses or metonymy
rings A continuum, from on-the-fly meaning
shadings to full metonymy - Link regular alternation possibilities at general
level in ontology allow meaning shift for
semantic interpretation, where needed - Using shishkebobs makes merging ontologies easier
(possible?) you respect each ontologys
perspective
38Domain models/ontologies
- Theres tons of work building domain-specific
ontologies  see the web - Artificial Intelligence
- Databases
- Company products
- Government codes
- Domain expertise capture
- etc
- Not the focus of this lecture we continue with
general lexico-semantics
39Outline
- Problems when creating concepts
- Ontological Semantics
- The high-level abstractions
- The content-level concepts
- The OntoNotes project
- A methodology for creating word senses
- From senses to concepts
- The Omega Ontology
- Conclusion
40Semantic annotation projects
- Goal corpus of pairs (sentence semantic rep)
- Process humans add information to sentences (and
their parses) - Recent projects
Interlingua Annotation (Dorr et al. 04)
coref links
OntoNotes (Weischedel et al. 05)
ontology
I-CAB, Greek banks
PropBank (Palmer et al. 03)
TIGER/SALSA Bank (Pinkal et al. 04)
verb frames
Framenet (Fillmore et al. 04)
noun frames
Prague Dependency Treebank (Hajic et al. 02)
word senses
Penn Treebank (Marcus et al. 99)
NomBank (Myers et al. 03)
syntax
41OntoNotes project structure
ISI
Colorado
Verb Sensesand verbal ontology links
Noun Sensesand targeted nominalizations
Propositions
Training Data
Ontology Links and resulting structure
BBN
Penn
Decoders
Treebank Syntax
Coreference
Summarization
Translation
Syntactic structure Predicate/argument
structure Disambiguated nouns and verbs
Coreference links
Goal In 4 years, annotate text corpora of 1
mill words of English, Chinese, and Arabic text
42Focus on word senses
- Create a very large corpus of text by annotating
JUST the semantic sense(s) of every noun and verb
(and later, adjective and adverb) - Why?
- Enable computer programs to learn to assign
correct senses automatically, in search of
improved machine translation, text summarization,
question answering, (web) search, etc. - begin to understand the distribution of principal
semantic features (animacy, concreteness, etc.)
at large scale.
43Example of result
- 3_at_wsj/00/wsj_0020.mrg_at_wsj Mrs. Hills said many
of the 25 countries that she placed under
varying degrees of scrutiny have made
genuine progress '' on this touchy issue . - Propositionspredicate saypb sense 01on
sense 1 - ARG0 Mrs. Hills 10
- ARG1 many of the 25 countries that she placed
under varying degrees of scrutiny have made
genuine progress '' on this touchy issue - predicate makepb sense 03on sense None
- ARG0 many of the 25 countries that she placed
under varying degrees of scrutiny - ARG1 genuine progress '' on this touchy issue
OntoNotes Normal Form (ONF)
44OntoNotes annotation procedure
- Sense creation process goes by word
- Expert creates meaning options (shallow semantic
senses) for verbs, nouns, adjs, advs follows
PropBank process (Palmer et al.) - Expert creates definitions, examples,
differentiating features - (Ontology insertion At same time, expert groups
equivalent senses from different words and
organizes/refines Omega ontology content and
structure process being developed at ISI) - Sense annotation process goes by word, across
docs - Process developed in PropBank
- Annotators manually
- See each sentence in corpus containing the
current word (noun, verb, adjective, adverb) to
annotate - Select appropriate senses ( ontology concepts)
for each one - Connect frame structure (for each verb and
relational noun) - Coref annotation process goes by doc
- Annotators connect co-references within each doc
45Ensuring trustworthiness/stability
- Problematic issues
- What senses are there? Are the senses
stable/good/clear? - Is the sense annotation trustworthy?
- What things should corefer?
- Is the coref annotation trustworthy?
- Approach (from PropBank) the 90 solution
- Sense granularity and stability Test with
annotators to ensure agreement at 90 on real
text - If not, then redefine and re-do until 90
agreement reached - Coref stability only annotate the types of
aspects/phenomena for which 90 agreement can be
achieved
46Sense annotation procedure
- Sense creator first creates senses for a word
- Loop 1
- Manager selects next nouns from sensed list and
assigns annotators - Programmer randomly selects 50 sentences and
creates initial Task File - Annotators (at least 2) do the first 50
- Manager checks their performance
- 90 agreement few or no NoneOfAbove send on
to Loop 2 - Else Adjudicator and Manager identify reasons,
send back to Sense creator to fix senses and defs
- Loop 2
- Annotators (at least 2) annotate all the
remaining sentences - Manager checks their performance
- 90 agreement few or no NoneOfAbove send to
Adjudicator to fix the rest - Else Adjudicator annotates differences
- If Adj agrees with one Annotator 90, then
ignore other Annotators work (assume a bad day
for the other) else Adj agrees with both about
equally often, then assume bad senses and send
the problematic ones back to Sense creator
47- STAMP annotation interface
- Built for PropBank (Palme UPenn)
- Target word
- Sentence
- Word sense choices (no mouse!)
48Pre-project test Can it be done?
- Annotation process and tools developed and tested
in PropBank (Palmer et al. U Colorado) - Typical results (10 words of each type, 100
sentences each)
Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3
tagger agreement senses time (min/100 tokens)
verbs .76 ? .86 ? .91 4.5 ? 5.2 ? 3.8 30 ? 25 ? 25
nouns .71 ? .85 ? .95 7.3 ? 5.1 ? 3.3 28 ? 20 ? 15
adjs .87 ? ? .90 2.8 ? ? 5.5 24 ? ? 18
(by comparison agreement using WordNet senses is
70)
49Setting up Word statistics
1000-word corpus tokens types
verbs 125.3 87.3
nouns 446.6 288.7
adjectives 103.2 80.6
- Number of word tokens/types in 1000-word corpus
- (95 confidence intervals on 85213 trials)
Nouns approx. 50 of tokens Monosemous nouns
(but not names etc.) 14.6 of tokens 25.6 of
nouns
250K WSJ verbs verbs nouns nouns
total 2341 2341 5421 5421
1 WN sense 428 (18) 1751 (32)
2 or 3 senses 966 (41) 2159 (40)
4 senses 947 (40) 1511 (28)
Polysemy of verbs and nouns
Nouns Tokens (total 205442) Tokens (total 205442)
100 76420 37
500 140453 68
1000 167715 82
1500 181412 88
2000 189641 92
Coverage in WSJ and Brown Corpus of most frequent
N polysemous-2 nouns
50Outline
- Problems when creating concepts
- Ontological Semantics
- The high-level abstractions
- The content-level concepts
- The OntoNotes project
- A methodology for creating word senses
- From senses to concepts
- The Omega Ontology
- Conclusion
51We want to gofrom lexemes to conceptsand we
use senses as the bridge
- Lexical space
- Words
- drive
- steer
- fahren
- steuern
- besturen
- rijden
- drijven
- Sense space
- Word senses
- Drive1
- Drive2
- Drive3
-
- Manage
- Concept space
- Concepts
- ?
- How many concepts?
- How related to senses?
52Which, and how many, senses? Graduated refinement
- Initialization Given a term (word), collect
several dozen sentences containing it. Also
collect definitions from various dictionaries - Cluster the words senses into preliminary,
loosely similar groups - Differentiation process Begin a tree structure
with all the groups at the root - Considering all the groups, identify the group
most different from the others - If you can find one clearly most different group,
write down its most important distinction
explicitly this will later become the
differentium and be formalized axiomatically - If you cannot find any distinctions by which to
further subdivide the group, stop elaborating
this branch and continue with some other branch - If you can find several distinctions that
subdivide the group in different, but equally
valid, ways, also stop elaborating this branch
and continue with some other branch - Create two new branches in the evolving tree
structure, putting the new group under one, and
leaving the other groups under the other - Repeat from step 4, considering separately the
group(s) under each branch - Concept formation When all branches have
stopped, the ultimate result is a tree of
increasingly fine-grained distinctions, which are
explicitly listed at each branch point. Each
leaf becomes a single concept, not further
differentiable in the current task/application/dom
ain. Each distinction must be formalized as an
axiom that holds for the branch it is associated
with - Insertion into ontology Starting from the top,
visit each branch point. Do the two branches
have approximately the same meaning? - If so, insert them into the ontology at the
appropriate point and stop traversing this branch
- If not, split the tree and repeat step 8
separately for each branch. Repeat until done
53An exercise drive
- Drive the demons out of her and teach her to stay
away from my husband!! - Shortly before nine I drove my jalopy to the
street facing the Lake and parked the car in
shadows. - He drove carefully in the direction of the brief
tour they had taken earlier. - Her scream split up the silence of the car,
accompanied by the rattling of the freight, and
then Cappy came off the floor, his legs driving
him hard. - With an untrained local labor pool, many experts
believe, that policy could drive businesses from
the city. - Treasury Undersecretary David Mulford defended
the Treasurys efforts this fall to drive down
the value of the dollar. - Even today range riders will come upon mummified
bodies of men who attempted nothing more
difficult than a twenty-mile hike and slowly lost
direction, were tortured by the heat, driven mad
by the constant and unfulfilled promise of the
landscape, and who finally died. - Cows were kept in backyard barns, and boys were
hired to drive them to and from the pasture on
the edge of town. - He had to drive the hammer really hard to get the
nail into that plank! - She learned to drive a bulldozer from her uncle,
who was a road maker. - I used to drive a taxi (for work) before I went
to night school. - BewareRalph drives a hard bargain you will
probably lose all your money.
54Grouping the senses of drive
drive (1,212)
55Deeper semantic drive
drive (1,212)
ltpsychgt
ltprofessiongt
11 taxi
ltnegotiategt
12 drive a hard bargain
56Ontologizing drive
ltmove in desired directiongt (1,2,3,4,5,6,8,9,10)
57From lexemes to concepts
- Lexical space
- Words
- Monolingual
- drive
- steer
- fahren
- rijden
- Sense space
- Word senses
- Multilingual
- Drive1
- Drive2
- Drive3
- Concept space
- Concepts
- Interlingual (?)
58Time for some fun
59Doing this seriously OntoNotes sense creation
interface
- Input word
- Tree of senses being created
- Working area write defs, exs, features, etc
- Google or dictionarysense list for ideas
60Outline
- Problems when creating concepts
- Ontological Semantics
- The high-level abstractions
- The content-level concepts
- The OntoNotes project
- A methodology for creating word senses
- From senses to concepts
- The Omega Ontology
- Conclusion
61Trying to find real concepts
- What tests can one use to ensure concept-hood
and not just sense-hood? - One idea Multilinguality
- Can one arrange wordsenses/concepts to form an
interlingua termset? - This would guarantee (some degree of) semantic
nature - How to do this?
- Translate a text several times
- Compare the differences presumably they derive
from the same source, which must therefore pack
together all the translation meanings
62Finding meaning via translation difference
K1E1 Starting on January 1 of next year, SK
Telecom subscribers can switch to less expensive
LG Telecom or KTF. The Subscribers cannot
switch again to another provider for the first 3
months, but they can cancel the switch in 14
days if they are not satisfied with services
like voice quality. K1E2 Starting January
1st of next year, customers of SK Telecom can
change their service company to LG Telecom or
KTF Once a service company swap has been made,
customers are not allowed to change companies
again within the first three months, although
they can cancel the change anytime within 14
days if problems such as poor call quality are
experienced.
- Additional/less information
63Getting at meaning Two translations of a
Japanese original text
- This year,
- too,
- in addition to
- the birth
- of Mitsubishi Chemical,
- which has already been announced,
- other rather large-scale mergers
- may continue,
- and
- be recorded
- as a year of mergers.
- This year,
- which has already seen the announcement of
- the birth
- of Mitsubishi Chemical Corporation
- as well as
- the continuous numbers of big mergers,
- may
- too
- be recorded
- as the year of the merger
- for all we know.
- Problems for semantic rep
- Lexical differences
- Dependency differences
64Interlinguas in MT
- The idea of an interlingua is intriguing
- Example use in Machine Translation
- For transfer systems, need 2n.(n-1) rules for n
languages (L1?L2, L2?L1, L1?L3, L3?L1) - For Interlingual systems, need only 2n sets of
rules (Lx?IL?Ly) - Interlingua is the deep semantic notation of
the meaning (the idea) behind the text - An Interlingua is a system of symbols and
notation to represent the meaning(s)
of(linguistic) communications with the following
features - language-independent
- formally well-defined
- expressive to arbitrary level of semantics
- non-redundant
word senses concepts
65Structure of EuroWordNet
Slide by Wim Peters, U of Sheffield, 2001
- Based on WordNet (Miller Fellbaum, Princeton
University) - WordNet currently has about 120,000 wordsense
groupings (synsets) - EuroWordNet Language-specific wordnets for 8
languages, all independent but connected - English, Spanish, Italian, Dutch
- Wordnets represent unique concept lexicalization
patterns in 8 languages, based on
sense-inventories of mono- and bilingual
dictionaries - BUT NOT AN INTERLINGUA Synsets of various
languages linked to the Inter-Lingual Index (ILI)
serves as interlingua mapping but there is no
single central term/concept set
66Slide by Wim Peters, U of Sheffield, 2001
EuroWordNet architecture
move travel go
bewegen reizen gaan
Domain-Ontology
Top-Ontology
III
rijden
berijden
III
III
I
I
ILI-record drive
II
III
III
cabalgar jinetear
Inter-Lingual-Index
guidare
andare muoversi
67MultilingualityWord-sense-concept 1
Sense Space
eat1
eat3
eat2
Word Space
eat
68MultilingualityWord-sense-concept 2
Sense Space
eat1
eat3
eat2
Word Space
eat
69MultilingualityWord-sense-concept 3
Other languages may suggest refined
conceptualization
IngestFood
Concept Space
IngestFood-Human
IngestFood-Animal
Sense Space
eat1
eat3
eat2
Word Space
eat
70Lexicon, Senses and Conceptsin Omega
- Lexical space
- Words
- drive
- steer
- fahren
- steuern
- besturen
- rijden
- drijven
- Sense space
- Word senses
- Drive1
- Drive2
- Drive3
-
- Manage
- Ride
drive/steer
ride1
manage3
driveltpropel
71OntoNotes procedure for building the ontology
- Goal Create ontology Repository of OntoNotes
senses, organized to provide additional
information - Creation procedure
- Start with framework (Upper Structure) from ISIs
Omega ontology - Contains verb frame structures from PropBank,
Framenet, LCS, WordNet - Gather all senses created for annotation
- Include definitional features defined for senses
- Concepts Identify and pool together senses
with same meaning - Look for shared features
- Recognize paraphrases to avoid redundancy
- Arrange close senses together to share features
- Enable eventual reasoning (buy ? sell)
- Validation Measure agreement between poolers
72Some OntoNotes features from the word senses
building 45
business 40
device 37
legal 36
substance 34
mental 32
small 34
measure 30
unit 26
concrete 22
enterprise 19
amount 19
portion 19
military 19
people 19
vehicle 18
official 18
large 18
collection 17
natural 18
financial 17
part 17
object 16
human 16
cylindrical 16
entity 965
artifact 272
relation advisor, agent, aid 199
state 188
quantity 171
activity acquisition, act, battle 156
structure 160
role advisor, agent, banker 147
physical 123
social 112
action act, appeal, bid, call 103
quality 97
person 83
group 86
organization 59
abstract 58
event 57
location 58
document 53
individual 46
form 46
- Interesting correspondences with theoretical
work - Linguistics Comries syntactic and semantic
features - Knowledge Representation / AI Upper Model
features of SUMO, CYC, etc.
73OntoNotes sense pooler interface
- Sense list
- Pool being built list of senses
- Subordination link to another pool
- Features of this pool
74Outline
- Problems when creating concepts
- Ontological Semantics
- The high-level abstractions
- The content-level concepts
- The OntoNotes project
- A methodology for creating word senses
- From senses to concepts
- The Omega Ontology
- Conclusion
75Omega content and framework
www.omega.edu
Goal one environment for various ontologies and
resources
- Concepts 120,604 Concept/term entries 76 MB
- WordNet (Princeton Miller Fellbaum)
- Mikrokosmos (NMSU Nirenburg et al.)
- Penman Upper Model (ISI Bateman et al.)
- 25,000 Noun-noun compounds (ISI Pantel)
- Lexicon / sense space
- 156,142 English words 33,822 Spanish words
- 271,243 word senses
- 13,000 frames of verb arg structure with case
roles - LCS case roles (Dorr) 6.3MB
- PropBank roleframes (Palmer et al.) 5.3MB
- Framenet roleframes (Fillmore et al.) 2.8MB
- WordNet verb frames (Fellbaum) 1.8MB
- Associated information (not all complete)
- WordNet subj domains (Magnini Cavaglia) 1.2
MB - Various relations learned from text (ISI
Pantel) - TAP domain groupings (Stanford Guha)
- SemCor term frequencies 7.5MB
- Topic signatures (Basque U Agirre et al.) 2.7GB
- Instances 10.1 GB
- 1.1 million persons harvested from text
- 765,000 facts harvested from text
- 5.7 million locations from USGS and NGA
- Framework (over 28 million statements of
concepts, relations, instances) - Available in PowerLoom
- Instances in RDF
- With database/MYSQL
- Online browser
- Clustering software
- Term and ontology alignment software
76Omega browser Mammoth
77Omega hierarchy display
78Omega sense frames
79Outline
- Problems when creating concepts
- Ontological Semantics
- The high-level abstractions
- The content-level concepts
- The OntoNotes project
- A methodology for creating word senses
- From senses to concepts
- The Omega Ontology
- Conclusion
80Shallow and deep ontologies
- Omega is a language-based ontology
- Concepts defined via wordsenses annotator
agreement constrains and validates granularity - Granularity how many senses for each word?
- Associated information is subsumption hierarchy
and case frames - Deep(er) semantics
- Deeper knowledge definitional features,
subconcept differentiae, inferences, etc., are
not present - In places temporal relations using
Allen/Hobbs/OWL - Future, deeper version of annotation will require
and motivate more semantic ontology
81What would be nice?
- A small number of (globally) standardized
ontologies and/or core theories of important
aspects (time, space, social dynamics, motion,
privacy, etc.) - Solid theoretical frameworks for developing
ontological notions and theories, and for testing
them - A rich online world of ontologies, domain models,
etc., with appropriate ontology creation tools
and methodologies - (Semi-)automated techniques for rapidly finding,
absorbing, and testing existing ontologies for
your own applications - Tools that automatically create new knowledge
bases on demand, in accord with given ontologies - Ontology and knowledge base support technology
that can handle info that may be inconsistent,
tenuous, partial, and growing
82Thank you!