An Empirical Methodology for Word Sense Creation

1 / 64
About This Presentation
Title:

An Empirical Methodology for Word Sense Creation

Description:

... pork, potatoes, pudding, rice, sausages, scrambled eggs, toast, tomatoes, wheat ... Eggs and scrambled eggs; milk and cheese; pies... Methods of preparation ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 65
Provided by: Eduar1

less

Transcript and Presenter's Notes

Title: An Empirical Methodology for Word Sense Creation


1
An Empirical Methodology for Word Sense Creation
  • Eduard Hovy
  • Information Sciences Institute
  • University of Southern California
  • www.isi.edu/hovy

2
How many colours are there?
colour
light,dark
gt red
gt blue
gt yellow/green
gt green/yellow
gt brown
3
Why? Microtheories of colour
4
Microtheorizing
  • Straightforward approach
  • Concepts collect terms, group, taxonomize
  • Lexicons collect words, connect
  • Roughly 1-1 concepts-to-words
  • Problem close but non-identical meaning overlaps
    across languages terms
  • Solution complex mappings (e.g., EuroWordNets
    ILI)
  • Microtheory approach
  • Understand phenomenon
  • Concepts basic primitives of theory
  • Lexicons words defined in terms of primitives
  • Push the complexity into the lexicon
  • Problem which microtheory? Need one for each
    meaning complex!
  • Solution ? (See Mikrokosmos (Nirenburg and
    Raskin 2000))

5
Toward building a lexicon of meanings or
concepts
  • In practice, you cant build microtheories for
    everything
  • So you have to build more of a terminology bank
    than a true ontology
  • Start with words (terms in the domain, in the
    dictionary, etc.)
  • then separate out word senses and bring together
    synonyms
  • and then group together similar meaning clusters
    for later (ontology inheritance and inference)
  • while recording the essential features that
    differentiate the groups from each other
    (hopefully, some not-too-large set of features)
  • What does this involve?

6
Lets do an exercise
  • Create your own ontology of the following 24
    words
  • apples, beans, beef, bread, cake, carrots,
    cheese, cookies, eggs, ground beef, kimchi,
    mushrooms, peaches, peas, pies, pork, potatoes,
    pudding, rice, sausages, scrambled eggs, toast,
    tomatoes, wheat
  • build the taxonomy
  • provide some important characteristics

7
What did you do?
  • The easy part
  • vegetables, fruits, meats but what of tomatoes?
    Is your experience right, or are the biologists
    right?
  • Or, what about
  • starches, proteins, greens this is whats
    inside is this a better organization? Should
    you have both?
  • The harder part
  • Eggs and scrambled eggs milk and cheese pies
    Methods of preparation define somewhere else,
    and then somehow apply this to the basic
    foodstuffs?
  • What is right?
  • If I organized them by color or size, would that
    be wrong? By sweetness? What if I were
    diabetic?

8
Decisions when ontologizing
  • Should you create the term?
  • Where should the term go relative to the other
    terms? species
  • What is special/unique/different about this term?
    differentium/ae
  • How do you know youre right?
  • How do you decide between two or more
    alternatives?
  • Should you record (and re-use) the differentiae?

Store everything in an ontology
9
The zones of ontologies
10
Outline
  1. Problems when creating concepts
  2. Ontological Semantics
  3. The high-level abstractions
  4. The content-level concepts
  5. The OntoNotes project
  6. A methodology for creating word senses
  7. From senses to concepts
  8. The Omega Ontology
  9. Conclusion

11
CYC
  • Creator Lenat (CYCorp, Austin Texas) since
    1990s
  • CYC largest and richest ontology millions of
    axioms
  • ResearchCYC 25571 concs
  • RCYC (which was translated into RDF) omits all
    second order concept expressions, (for example
    functional operator expression) and so it has a
    lot of missing supers
  • R-CYC in Omega
  • Lexical items not in Omega yet
  • Missing supers require dummy root Protocol Root
    which now has 95 arbitrary child concepts

12
Top of ResearchCYC
  • Largest and most developed ontology
  • Principally aimed at AI inference-heavy and
    NL-light
  • Very tangled network hard to understand and use
    unless you absorb the CYCs philosophy /
    methodology
  • Full CYC, for sale over 1M axioms
  • Has been tried by many research groups in the
    past successful adoption rate low

13
SUMO
  • Creator Pease and consortium recent (USA)
  • Suggested as standard Suggested Upper Model
    Ontology
  • 1653 concepts
  • No lexical items
  • Adopts more traditional KR-style / Description
    Logic approach, with lots of internal reasoning
    mechanism constructs
  • More hierarchical than RCYC or DOLCE, with few
    uplinks pointing to expressions
  • Omega version not complete and a little buggy

14
Some top parts of SUMO
  • Caserole
  • Agent-rel
  • Patient
  • Result
  • Resource
  • ResourceUsed
  • Instrument
  • ComputerRunning
  • StandardOutputDevice
  • DataProcessed
  • Experiencer
  • Origin
  • Destination
  • Direction
  • Path
  • Entity
  • Physical
  • Object
  • Process
  • Financial Asset
  • Abstract
  • Quantity
  • Graph
  • Attribute
  • Thing
  • Permission
  • Obligation
  • InheritableSumorelation
  • IntentionalSumorelation
  • BinarySumoRelation
  • CaseRole
  • Product
  • Computational System

15
CYC Event and SUMO Process
  • Definition An important specialization of
    Situation and thus also of IntangibleIndividua
    l and TemporallyExistingThing (qq.v). Each
    instance of Event is a dynamic situation in
    which the state of the world changes each
    instance is something one would say happens.
    Events are intangible because they are changes
    per se, not tangible objects that effect and
    undergo changes. Notable specializations of
    Event include Event-Localized,
    PhysicalEvent, Action, and GeneralizedTransf
    er. Events should not be confused with
    TimeIntervals (q.v.). The temporal bounds of
    events are delineated by time intervals, but in
    contrast to many events time intervals have no
    spatial location or extent.
  • Definition Intuitively, the class of things that
    happen and have temporal parts or stages.
    Examples include extended events like a football
    match or a race, actions like Searching and
    Reading, and biological processes. The formal
    definition is anything that lasts for a time but
    is not an Object. Note that a Process may
    have participants inside it which are
    Objects, such as the players in a football
    match. In a 4D ontology, a Process is
    something whose spatiotemporal extent is thought
    of as dividing into temporal stages roughly
    perpendicular to the time-axis.

16
DOLCE
  • Author Guarino et al. (LADSEB, Italy)
  • Upper Model only focus on very abstract
    conceptualizations
  • Approx. 500 concepts
  • No lexical items
  • Defined in XML/RDF (very clean and precise use of
    formalism)
  • For Omega DOLCE was converted from XML/RDF,
    omitting some expressions Omega cant yet handle
    (e.g., most superclass relations have as range
    some expression, like i.e., one that uses
    intersection, union, a quantifier, or a
    restriction object). So DOLCE doesnt appear
    very hierarchical in Omega
  • Some very nice things in DOLCE, but it's hard to
    see when nearly everything looks like a top level
    object

17
DOLCE examples Particular, Endurant
  • Particular
  • Perdurant
  • Event
  • Accomplishment
  • Achievement
  • Stative
  • Process
  • Endurant
  • Abstract
  • Region
  • Abstract-region
  • Physical-region
  • Quality
  • Physical-quality
  • Temporal-quality

18
RDF definition for Endurant
19
DOLCE Process
  • Definition Within stative occurrences, we
    distinguish between states and processes
    according to homeomericity sitting is classified
    as a state but running is classified as a
    process, since there are (very short) temporal
    parts of a running that are not themselves
    runnings. In general, processes differ from
    situations because they are not assumed to have a
    description from which they depend. They can be
    sequenced by some course, but they do not require
    a description as a unifying criterion. On the
    other hand, at any time, one can conceive a
    description that asserts the constraints by which
    a process of a certain type is such, and in this
    case, it becomes a situation. Since the decision
    of designing an explicit description that unifies
    a perdurant depends on context, task, interest,
    application, etc., when aligning an ontology do
    DLP, there can be indecision on where to align a
    process-oriented class. For example, in the
    WordNet alignment, we have decided to put only
    some physical processes under process, e.g.
    organic process, in order to stress the social
    orientedness of DLP. But whereas we need to talk
    explicitly of the criteria by which we conceive
    organic processes, these will be put under
    situation. Similar considerations are made for
    the other types of perdurants in DOLCE. A
    different notion of event (dealing with change)
    is currently investigated for further
    developments being achievement,
    accomplishment, state, event, etc. can be
    also considered aspects of processes or of
    parts of them. For example, the same process
    rock erosion in the Sinni valley can be
    conceptualized as an accomplishment (what has
    brought the current state that e.g. we are trying
    to explain), as an achievement (the erosion
    process as the result of a previous
    accomplishment), as a state (if we collapse the
    time interval of the erosion into a time point),
    or as an event (what has changed our focus from a
    state to another). In the erosion case, we could
    have good motivations to shift from one aspect to
    another a) causation focus, b) effectual focus,
    c) condensation d) transition (causality). If we
    want to consider all the aspects of a process
    together, we need to postulate a unifying
    descriptive set of criteria (i.e. a
    description), according to which that process
    is circumstantiated in a situation. The
    different aspects will arise as a parts of a same
    situation.

20
Sowas top-level ontology
  • From Knowledge Representation (1999)
  • Lattice structure of 12 concepts all
    combinations of three differentiae
  • Firstness/secondness/thirdness (man/husband/marria
    ge)
  • Physical/abstract
  • Continuant/occurrent (object/process)
  • Obtained from C.S. Peirce and other historical
    philosophers
  • Idea form a basis into which all ontologies (or
    their Upper Models) can be inserted
  • This is a pretty idea, but not used in practice
  • HOWEVER local lattices are a good idea used in
    Omega to overcome differentium order problem (see
    lecture 1)

21
Sowa top ontology
  • Physical (P). An entity that has a location in
    space-time.
  • Abstract (A). Pure information as distinguished
    from any particular encoding of the information
    in a physical medium. Formally, Abstract is a
    primitive that satisfies the following axioms?
    No abstraction has a location in space
    (xAbstract)(yPlace)loc(x,y). No abstraction
    occurs at a point in time (xAbstract)(tTime)
    pTime(x,t).
  • Continuant (C). An entity whose identity
    continues to be recognizable over some extended
    interval of time.
  • Occurrent (O). An entity that does not have a
    stable identity during any interval of time.
    Formally, Occurrent is a primitive that satisfies
    the following axioms The temporal parts of an
    occurrent, which are called stages, exist at
    different times. The spatial parts of an
    occurrent, which are called participants, may
    exist at the same time, but an occurrent may have
    different participants at different stages. There
    are no identity conditions that can be used to
    identify two occurrents that are observed in
    nonoverlapping space-time regions.
  • Independent (I). An entity characterized by some
    inherent Firstness, independent of any
    relationships it may have to other entities.
    Formally, Independent is a primitive for which
    the has-test of Section 2.4 need not apply. If x
    is an independent entity, it is not necessary
    that there exists an entity y such that x has y
    or y has x ("xIndependent)o(y)(has(x,y) Ú
    has(y,x)).
  • Relative (R). An entity in a relationship to some
    other entity. Formally, Relative is a primitive
    for which the has-test must apply
    ("xRelative)o(y)(has(x,y) Ú has(y,x)). For any
    relative x, there must exist some y such that x
    has y or y has x.
  • Mediating (M). An entity characterized by some
    Thirdness that brings other entities into a
    relationship. An independent entity need not have
    any relationship to anything else, a relative
    entity must have some relationship to something
    else, and a mediating entity creates a
    relationship between two other entities. An
    example of a mediating entity is a marriage,
    which creates a relationship between a husband
    and a wife. According to Peirce, the defining
    aspect of Thirdness is the conception of
    mediation, whereby a first and a second are
    brought into relation. That property could be
    expressed in second-order logic
    ("mMediating)("x,yEntity)((R,SRelation)(R(m,x)
    Ù S(m,y))) ? o(TRelation)T(x,y). This formula
    says that for any mediating entity m and any
    other entities x and y, if there exist relations
    R and S that relate m to x and m to y, then it is
    necessarily true that there exists some relation
    T that relates x to y. For example, if m is a
    marriage, R relates m to a husband x, S relates m
    to a wife y, then T relates the husband to the
    wife (or the wife to the husband).

22
Sowa case roles 1
  • A determinant participant determines the
    direction of the process, either from the
    beginning as the initiator or from the end as the
    goal.
  • An immanent participant is present throughout the
    process, but does not actively control what
    happens.
  • A source must be present at the beginning of the
    process, but need not participate throughout the
    process.
  • A product must be present at the end of the
    process but need not participate throughout the
    process.

23
Sowa case roles 2
  • Initiator corresponds to Aristotle's efficient
    cause, whereby a change or a state is initiated
    (1013b23).
  • Resource corresponds to the material cause, which
    is the matter or the substrate (hypokeimenon)
    (983a30).
  • Goal corresponds to the final cause, which is
    the purpose or the benefit for this is the goal
    (telos) of any generation or motion (983a32).
  • Essence corresponds to the formal cause, which is
    the essence (ousia) or what it is (to ti einai)
    (983a27).

24
Penman Upper Model
  • Matthiessen et al. (ISI) 1980s
  • Linguistic (English) generalizations
  • Approx. 300 concepts. No lexical items at this
    level
  • AI-light (no axioms)
  • Tested for NLG in various languages and MT
  • Serves as overall connection between Domain Model
    symbols (used in system input representations)
    and NLG system decision rules
  • Good example of use of Upper Model to to capture
    and organize essential processing distinctions

25
Some MIKRO case roles and relations
  • Caserole
  • THEME
  • SOURCE
  • PATH
  • LOCATION
  • INSTRUMENT
  • EXPERIENCER
  • DESTINATION
  • BENEFICIARY
  • AGENT
  • ACCOMPANIER
  • Manner-relation
  • MANNER-QUALITY
  • MANNER-INSTRUMENT
  • MANNER-MEANS
  • Quantity-relation
  • LESS-THAN
  • EQUAL-TO
  • GREATER-THAN
  • MIKROKOSMOS ontology
  • Nirenburg, Mahesh, et al.
  • Spatial-Temporal-relation
  • Temporal-relation
  • BETWEEN-TEMPORAL TEMPORAL-FREQUENCY
    TEMPORAL-DURATION TEMPORAL-LOCATING
    TEMPORAL-INTERVAL-OVERLAP (subrelations)
    TEMPORAL-NON-OVERLAP (subrelations)
  • Spatial-relation
  • SURROUNDED-BY SURROUNDS RIGHT-OF OUTSIDE-OF
    ON-TOP-OF MEET UNDER LEFT-OF INSIDE-OF
    IN-FRONT-OF BETWEEN-SPATIAL BESIDE
    BELOW-AND-TOUCHING BEHIND ACROSS-FROM ACROSS
    ABOVE DIRECTIONAL-RELATION (subrelations)

26
Outline
  1. Problems when creating concepts
  2. Ontological Semantics
  3. The high-level abstractions
  4. The content-level concepts
  5. The OntoNotes project
  6. A methodology for creating word senses
  7. From senses to concepts
  8. The Omega Ontology
  9. Conclusion

27
Questions
  1. What do you include in the Upper Model, and what
    in the Middle Model?
  2. Where is the boundary?
  3. How primitive are your concepts? What
    granularity should you use?

28
Parsimonious vs profligate
  • Parsimonious
  • Few symbols
  • Easy to see conceptual relatedness
  • Easy to define and run inferences
  • Hard to compose complex meanings
  • Profligate
  • Many symbols
  • Hard to determine conceptual relatedness
  • Hard work to define inferences
  • No need to compose complex meanings
  • Easy to fall into the trap of semantics-by-capital
    ization (or wishful mnemonics McDermott
    Artificial Intelligence Meets Natural Stupidity,
    1981)

There is no correct position what you choose
depends on how much inference you need vs how
complex your domain is
29
CYC middle
Lenat www.cyc.com
  • Built by CYC Artificial Intelligence reasoning
    and databases
  • Hundreds of thousands of concepts
  • Various termsets available over past years
  • Many interesting capabilities

30
WordNet
Miller Fellbaum wordnet.princeton.edu
  • Being built by Miller and Fellbaum at Princeton
    cognitive scientists
  • Synonymous senses of words grouped into synsets
    approx. 120,000 synsets
  • Rudimentary Upper Model all Middle Model
  • Nouns organized by hyponym (ISA) average depth
    of Noun hierarchy 12
  • Verbs weakly organized by hyponym avg depth 3
  • Adjectives organized as star structures
    (quasi-synonym clusters related to antonym
    clusters)
  • Also meronym (part-of) and other relations, and
    recently includes sense frequency values
  • Used for many NLP applications, but effectiveness
    is controversial
  • IR study claims WordNet not useful (Voorhees)
  • QA work, using axioms in Extended WordNet
    (Moldovan), shows great promise
  • Wordsense disambiguation shows WordNet has too
    many senses

31
Mikrokosmos
Nirenburg et al. crl.nmsu.edu/Research/ Projects/m
ikro/
  • Intermittently being built by Nirenburg et al. at
    New Mexico State U and U of Maryland NLP people
  • About 6000 concepts, 250 relations (slots)
  • Focus on lexicon define cores of meaning
    clusters and differentiate at the word/sense
    level includes about 25K English and 25K Spanish
    (and some other) words
  • Used as Interlingua symbol repository for MT, in
    Text Meaning Rep (TMR) notation
  • Nice feature facets on slots
  • Value value of the slot (may be a formula)
  • Strength certainty/probability
  • Aspect constant/intermittent/etc.

32
Aligning ontologies
  • Instead of building an ontology (with all the
    problems that entails)can one just combine
    existing ones?
  • Find the most popular concepts and organization
  • Merge the definitions
  • Identify individual errors and problem areas
  • I tried this in 199697 (Hovy, LREC 1998)
  • Project funded by IBM Align Upper Models of CYC,
    Penman, and Mikrokosmos
  • Built alignment routines and created merge
  • Conceptual mismatch problems were significant!
  • Since then, fairly large group of researchers
    doing this a competition every year

33
General alignment and merging
  • Goal find attachment point(s) in ontology for
    node/term from somewhere else (ontology, website,
    metadata schema, etc.)
  • Its hard to do manually very hard to do
    automaticallysystem needs to understand
    semantics of entities to be aligned

34
Outcome 1 Good and Misleading
  • S_at_foodstuffltfood
  • a substance that can be used or prepared
    for use as food
  • superconcepts (S_at_food)
  • M_at_FOODSTUFF (COMB 13.355 NAME 91 DEF
    10.00 TAX 0.140)
  • a substance that can be used or prepared for
    use as food
  • superconcepts (M_at_FOOD M_at_MATERIAL)
  • ----------------------------------------
  • S_at_librarygtbibliotheca
  • a collection of literary documents or
    records kept for reference
  • superconcepts (S_at_aggregation)
  • M_at_LIBRARY (COMB 2.742 NAME 59 DEF 3.57
    TAX 0.000)
  • a place in which literary and artistic
    materials such as books periodicals
  • newspapers pamphlets and prints are kept for
    reading or reference an
  • institution or foundation maintaining such a
    collection
  • superconcepts (M_at_ACADEMIC-BUILDING)

A document collection or a place?
35
Outcome 2 Unclear and Error!
  • S_at_geisha
  • a Japanese woman trained to entertain men
    with conversation and singing
  • and dancing
  • superconcepts (S_at_adult female
    S_at_JapaneseltAsian)
  • M_at_GEISHA (COMB 1.540 NAME 46 DEF 2.27
    TAX 0.000)
  • a Japanese girl trained as an entertainer to
    serve as a hired entertainer
  • to men
  • superconcepts (M_at_ENTERTAINMENT-ROLE)
  • ----------------------------------------
  • S_at_archipelago
  • many scattered islands in a large body of
    water
  • superconcepts (S_at_dry land)
  • M_at_ARCHIPELAGO (COMB 1.522 NAME 131 DEF
    1.33 TAX 0.000)
  • a sea with many islands
  • superconcepts (M_at_SEA)

A person or a function?
Land or sea?
36
When are two concepts the same? Guarinos
Identity Criteria
  • Material the stuff
  • Topological the shape
  • Morphological the parts
  • Functional the use
  • Meronymical the members
  • Social the societal role
  • (see also Pustejovskys qualia)

A water glass, before and after being smashed
the ACL in 1964 and in 2064
37
Shishkebobs (Hovy et al. in prep)
  • Library ISA Building (and hence cant buy things)
  • Library ISA Institution (and hence can buy
    things)
  • SO Building ? Institution ? Location a
    Library is all these
  • Also Country ? Nation ? Government (GPE)
  • France the land, the people, and the rulers
  • Also Field-of-Study ? Activity ?
    Result-of-Process
  • (Science, Medicine, Architecture, Art)
  • Also Company ? Product ? Stock
  • He worked at Coke, drank Coke, and owned Coke
    (shares)
  • We found about 400 potential shishkebobs
  • Shishkebobs Concept senses or metonymy
    rings A continuum, from on-the-fly meaning
    shadings to full metonymy
  • Link regular alternation possibilities at general
    level in ontology allow meaning shift for
    semantic interpretation, where needed
  • Using shishkebobs makes merging ontologies easier
    (possible?) you respect each ontologys
    perspective

38
Domain models/ontologies
  • Theres tons of work building domain-specific
    ontologies  see the web
  • Artificial Intelligence
  • Databases
  • Company products
  • Government codes
  • Domain expertise capture
  • etc
  • Not the focus of this lecture we continue with
    general lexico-semantics

39
Outline
  1. Problems when creating concepts
  2. Ontological Semantics
  3. The high-level abstractions
  4. The content-level concepts
  5. The OntoNotes project
  6. A methodology for creating word senses
  7. From senses to concepts
  8. The Omega Ontology
  9. Conclusion

40
Semantic annotation projects
  • Goal corpus of pairs (sentence semantic rep)
  • Process humans add information to sentences (and
    their parses)
  • Recent projects

Interlingua Annotation (Dorr et al. 04)
coref links
OntoNotes (Weischedel et al. 05)
ontology
I-CAB, Greek banks
PropBank (Palmer et al. 03)
TIGER/SALSA Bank (Pinkal et al. 04)
verb frames
Framenet (Fillmore et al. 04)
noun frames
Prague Dependency Treebank (Hajic et al. 02)
word senses
Penn Treebank (Marcus et al. 99)
NomBank (Myers et al. 03)
syntax
41
OntoNotes project structure
ISI
Colorado
Verb Sensesand verbal ontology links
Noun Sensesand targeted nominalizations
Propositions
Training Data
Ontology Links and resulting structure
BBN
Penn
Decoders
Treebank Syntax
Coreference
Summarization
Translation
Syntactic structure Predicate/argument
structure Disambiguated nouns and verbs
Coreference links
Goal In 4 years, annotate text corpora of 1
mill words of English, Chinese, and Arabic text
42
Focus on word senses
  • Create a very large corpus of text by annotating
    JUST the semantic sense(s) of every noun and verb
    (and later, adjective and adverb)
  • Why?
  • Enable computer programs to learn to assign
    correct senses automatically, in search of
    improved machine translation, text summarization,
    question answering, (web) search, etc.
  • begin to understand the distribution of principal
    semantic features (animacy, concreteness, etc.)
    at large scale.

43
Example of result
  • 3_at_wsj/00/wsj_0020.mrg_at_wsj Mrs. Hills said many
    of the 25 countries that she placed under
    varying degrees of scrutiny have made
    genuine progress '' on this touchy issue .
  • Propositionspredicate saypb sense 01on
    sense 1
  • ARG0 Mrs. Hills 10
  • ARG1 many of the 25 countries that she placed
    under varying degrees of scrutiny have made
    genuine progress '' on this touchy issue
  • predicate makepb sense 03on sense None
  • ARG0 many of the 25 countries that she placed
    under varying degrees of scrutiny
  • ARG1 genuine progress '' on this touchy issue

OntoNotes Normal Form (ONF)
44
OntoNotes annotation procedure
  • Sense creation process goes by word
  • Expert creates meaning options (shallow semantic
    senses) for verbs, nouns, adjs, advs follows
    PropBank process (Palmer et al.)
  • Expert creates definitions, examples,
    differentiating features
  • (Ontology insertion At same time, expert groups
    equivalent senses from different words and
    organizes/refines Omega ontology content and
    structure process being developed at ISI)
  • Sense annotation process goes by word, across
    docs
  • Process developed in PropBank
  • Annotators manually
  • See each sentence in corpus containing the
    current word (noun, verb, adjective, adverb) to
    annotate
  • Select appropriate senses ( ontology concepts)
    for each one
  • Connect frame structure (for each verb and
    relational noun)
  • Coref annotation process goes by doc
  • Annotators connect co-references within each doc

45
Ensuring trustworthiness/stability
  • Problematic issues
  • What senses are there? Are the senses
    stable/good/clear?
  • Is the sense annotation trustworthy?
  • What things should corefer?
  • Is the coref annotation trustworthy?
  • Approach (from PropBank) the 90 solution
  • Sense granularity and stability Test with
    annotators to ensure agreement at 90 on real
    text
  • If not, then redefine and re-do until 90
    agreement reached
  • Coref stability only annotate the types of
    aspects/phenomena for which 90 agreement can be
    achieved

46
Sense annotation procedure
  • Sense creator first creates senses for a word
  • Loop 1
  • Manager selects next nouns from sensed list and
    assigns annotators
  • Programmer randomly selects 50 sentences and
    creates initial Task File
  • Annotators (at least 2) do the first 50
  • Manager checks their performance
  • 90 agreement few or no NoneOfAbove send on
    to Loop 2
  • Else Adjudicator and Manager identify reasons,
    send back to Sense creator to fix senses and defs
  • Loop 2
  • Annotators (at least 2) annotate all the
    remaining sentences
  • Manager checks their performance
  • 90 agreement few or no NoneOfAbove send to
    Adjudicator to fix the rest
  • Else Adjudicator annotates differences
  • If Adj agrees with one Annotator 90, then
    ignore other Annotators work (assume a bad day
    for the other) else Adj agrees with both about
    equally often, then assume bad senses and send
    the problematic ones back to Sense creator

47
  • STAMP annotation interface
  • Built for PropBank (Palme UPenn)
  • Target word
  • Sentence
  • Word sense choices (no mouse!)

48
Pre-project test Can it be done?
  • Annotation process and tools developed and tested
    in PropBank (Palmer et al. U Colorado)
  • Typical results (10 words of each type, 100
    sentences each)

Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3
tagger agreement senses time (min/100 tokens)
verbs .76 ? .86 ? .91 4.5 ? 5.2 ? 3.8 30 ? 25 ? 25
nouns .71 ? .85 ? .95 7.3 ? 5.1 ? 3.3 28 ? 20 ? 15
adjs .87 ? ? .90 2.8 ? ? 5.5 24 ? ? 18
(by comparison agreement using WordNet senses is
70)
49
Setting up Word statistics
1000-word corpus tokens types
verbs 125.3 87.3
nouns 446.6 288.7
adjectives 103.2 80.6
  • Number of word tokens/types in 1000-word corpus
  • (95 confidence intervals on 85213 trials)

Nouns approx. 50 of tokens Monosemous nouns
(but not names etc.) 14.6 of tokens 25.6 of
nouns
250K WSJ verbs verbs nouns nouns
total 2341 2341 5421 5421
1 WN sense 428 (18) 1751 (32)
2 or 3 senses 966 (41) 2159 (40)
4 senses 947 (40) 1511 (28)
Polysemy of verbs and nouns
Nouns Tokens (total 205442) Tokens (total 205442)
100 76420 37
500 140453 68
1000 167715 82
1500 181412 88
2000 189641 92
Coverage in WSJ and Brown Corpus of most frequent
N polysemous-2 nouns
50
Outline
  1. Problems when creating concepts
  2. Ontological Semantics
  3. The high-level abstractions
  4. The content-level concepts
  5. The OntoNotes project
  6. A methodology for creating word senses
  7. From senses to concepts
  8. The Omega Ontology
  9. Conclusion

51
We want to gofrom lexemes to conceptsand we
use senses as the bridge
  • Lexical space
  • Words
  • drive
  • steer
  • fahren
  • steuern
  • besturen
  • rijden
  • drijven
  • Sense space
  • Word senses
  • Drive1
  • Drive2
  • Drive3
  • Manage
  • Concept space
  • Concepts
  • ?
  • How many concepts?
  • How related to senses?

52
Which, and how many, senses? Graduated refinement
  1. Initialization Given a term (word), collect
    several dozen sentences containing it. Also
    collect definitions from various dictionaries
  2. Cluster the words senses into preliminary,
    loosely similar groups
  3. Differentiation process Begin a tree structure
    with all the groups at the root
  4. Considering all the groups, identify the group
    most different from the others
  5. If you can find one clearly most different group,
    write down its most important distinction
    explicitly this will later become the
    differentium and be formalized axiomatically
  6. If you cannot find any distinctions by which to
    further subdivide the group, stop elaborating
    this branch and continue with some other branch
  7. If you can find several distinctions that
    subdivide the group in different, but equally
    valid, ways, also stop elaborating this branch
    and continue with some other branch
  8. Create two new branches in the evolving tree
    structure, putting the new group under one, and
    leaving the other groups under the other
  9. Repeat from step 4, considering separately the
    group(s) under each branch
  10. Concept formation When all branches have
    stopped, the ultimate result is a tree of
    increasingly fine-grained distinctions, which are
    explicitly listed at each branch point. Each
    leaf becomes a single concept, not further
    differentiable in the current task/application/dom
    ain. Each distinction must be formalized as an
    axiom that holds for the branch it is associated
    with
  11. Insertion into ontology Starting from the top,
    visit each branch point. Do the two branches
    have approximately the same meaning?
  12. If so, insert them into the ontology at the
    appropriate point and stop traversing this branch
  13. If not, split the tree and repeat step 8
    separately for each branch. Repeat until done

53
An exercise drive
  1. Drive the demons out of her and teach her to stay
    away from my husband!!
  2. Shortly before nine I drove my jalopy to the
    street facing the Lake and parked the car in
    shadows.
  3. He drove carefully in the direction of the brief
    tour they had taken earlier.
  4. Her scream split up the silence of the car,
    accompanied by the rattling of the freight, and
    then Cappy came off the floor, his legs driving
    him hard.
  5. With an untrained local labor pool, many experts
    believe, that policy could drive businesses from
    the city.
  6. Treasury Undersecretary David Mulford defended
    the Treasurys efforts this fall to drive down
    the value of the dollar.
  7. Even today range riders will come upon mummified
    bodies of men who attempted nothing more
    difficult than a twenty-mile hike and slowly lost
    direction, were tortured by the heat, driven mad
    by the constant and unfulfilled promise of the
    landscape, and who finally died.
  8. Cows were kept in backyard barns, and boys were
    hired to drive them to and from the pasture on
    the edge of town.
  9. He had to drive the hammer really hard to get the
    nail into that plank!
  10. She learned to drive a bulldozer from her uncle,
    who was a road maker.
  11. I used to drive a taxi (for work) before I went
    to night school.
  12. BewareRalph drives a hard bargain you will
    probably lose all your money.

54
Grouping the senses of drive
drive (1,212)
55
Deeper semantic drive
drive (1,212)
ltpsychgt
ltprofessiongt
11 taxi
ltnegotiategt
12 drive a hard bargain
56
Ontologizing drive
ltmove in desired directiongt (1,2,3,4,5,6,8,9,10)
57
From lexemes to concepts
  • Lexical space
  • Words
  • Monolingual
  • drive
  • steer
  • fahren
  • rijden
  • Sense space
  • Word senses
  • Multilingual
  • Drive1
  • Drive2
  • Drive3
  • Concept space
  • Concepts
  • Interlingual (?)

58
Time for some fun
  • Do the class exercise

59
Doing this seriously OntoNotes sense creation
interface
  • Input word
  • Tree of senses being created
  • Working area write defs, exs, features, etc
  • Google or dictionarysense list for ideas

60
Outline
  1. Problems when creating concepts
  2. Ontological Semantics
  3. The high-level abstractions
  4. The content-level concepts
  5. The OntoNotes project
  6. A methodology for creating word senses
  7. From senses to concepts
  8. The Omega Ontology
  9. Conclusion

61
Trying to find real concepts
  • What tests can one use to ensure concept-hood
    and not just sense-hood?
  • One idea Multilinguality
  • Can one arrange wordsenses/concepts to form an
    interlingua termset?
  • This would guarantee (some degree of) semantic
    nature
  • How to do this?
  • Translate a text several times
  • Compare the differences presumably they derive
    from the same source, which must therefore pack
    together all the translation meanings

62
Finding meaning via translation difference
K1E1 Starting on January 1 of next year, SK
Telecom subscribers can switch to less expensive
LG Telecom or KTF. The Subscribers cannot
switch again to another provider for the first 3
months, but they can cancel the switch in 14
days if they are not satisfied with services
like voice quality. K1E2 Starting January
1st of next year, customers of SK Telecom can
change their service company to LG Telecom or
KTF Once a service company swap has been made,
customers are not allowed to change companies
again within the first three months, although
they can cancel the change anytime within 14
days if problems such as poor call quality are
experienced.
  • Semantically identical
  • Semantically equivalent
  • Semantically different
  • Additional/less information
  • Different information

63
Getting at meaning Two translations of a
Japanese original text
  • This year,
  • too,
  • in addition to
  • the birth
  • of Mitsubishi Chemical,
  • which has already been announced,
  • other rather large-scale mergers
  • may continue,
  • and
  • be recorded
  • as a year of mergers.
  • This year,
  • which has already seen the announcement of
  • the birth
  • of Mitsubishi Chemical Corporation
  • as well as
  • the continuous numbers of big mergers,
  • may
  • too
  • be recorded
  • as the year of the merger
  • for all we know.
  • Problems for semantic rep
  • Lexical differences
  • Dependency differences

64
Interlinguas in MT
  • The idea of an interlingua is intriguing
  • Example use in Machine Translation
  • For transfer systems, need 2n.(n-1) rules for n
    languages (L1?L2, L2?L1, L1?L3, L3?L1)
  • For Interlingual systems, need only 2n sets of
    rules (Lx?IL?Ly)
  • Interlingua is the deep semantic notation of
    the meaning (the idea) behind the text
  • An Interlingua is a system of symbols and
    notation to represent the meaning(s)
    of(linguistic) communications with the following
    features
  • language-independent
  • formally well-defined
  • expressive to arbitrary level of semantics
  • non-redundant

word senses concepts
65
Structure of EuroWordNet
Slide by Wim Peters, U of Sheffield, 2001
  • Based on WordNet (Miller Fellbaum, Princeton
    University)
  • WordNet currently has about 120,000 wordsense
    groupings (synsets)
  • EuroWordNet Language-specific wordnets for 8
    languages, all independent but connected
  • English, Spanish, Italian, Dutch
  • Wordnets represent unique concept lexicalization
    patterns in 8 languages, based on
    sense-inventories of mono- and bilingual
    dictionaries
  • BUT NOT AN INTERLINGUA Synsets of various
    languages linked to the Inter-Lingual Index (ILI)
    serves as interlingua mapping but there is no
    single central term/concept set

66
Slide by Wim Peters, U of Sheffield, 2001
EuroWordNet architecture
move travel go
bewegen reizen gaan
Domain-Ontology
Top-Ontology
III
rijden
berijden
III
III
I
I
ILI-record drive
II
III
III
cabalgar jinetear
Inter-Lingual-Index
guidare
andare muoversi
67
MultilingualityWord-sense-concept 1
Sense Space
eat1
eat3
eat2
Word Space
eat
68
MultilingualityWord-sense-concept 2
Sense Space
eat1
eat3
eat2
Word Space
eat
69
MultilingualityWord-sense-concept 3
Other languages may suggest refined
conceptualization
IngestFood
Concept Space
IngestFood-Human
IngestFood-Animal



Sense Space
eat1
eat3
eat2
Word Space
eat
70
Lexicon, Senses and Conceptsin Omega
  • Lexical space
  • Words
  • drive
  • steer
  • fahren
  • steuern
  • besturen
  • rijden
  • drijven
  • Sense space
  • Word senses
  • Drive1
  • Drive2
  • Drive3
  • Manage
  • Ride
  • Concept space
  • Concepts

drive/steer
ride1
manage3
driveltpropel
71
OntoNotes procedure for building the ontology
  • Goal Create ontology Repository of OntoNotes
    senses, organized to provide additional
    information
  • Creation procedure
  • Start with framework (Upper Structure) from ISIs
    Omega ontology
  • Contains verb frame structures from PropBank,
    Framenet, LCS, WordNet
  • Gather all senses created for annotation
  • Include definitional features defined for senses
  • Concepts Identify and pool together senses
    with same meaning
  • Look for shared features
  • Recognize paraphrases to avoid redundancy
  • Arrange close senses together to share features
  • Enable eventual reasoning (buy ? sell)
  • Validation Measure agreement between poolers

72
Some OntoNotes features from the word senses
building 45
business 40
device 37
legal 36
substance 34
mental 32
small 34
measure 30
unit 26
concrete 22
enterprise 19
amount 19
portion 19
military 19
people 19
vehicle 18
official 18
large 18
collection 17
natural 18
financial 17
part 17
object 16
human 16
cylindrical 16
entity 965
artifact 272
relation advisor, agent, aid 199
state 188
quantity 171
activity acquisition, act, battle 156
structure 160
role advisor, agent, banker 147
physical 123
social 112
action act, appeal, bid, call 103
quality 97
person 83
group 86
organization 59
abstract 58
event 57
location 58
document 53
individual 46
form 46
  • Interesting correspondences with theoretical
    work
  • Linguistics Comries syntactic and semantic
    features
  • Knowledge Representation / AI Upper Model
    features of SUMO, CYC, etc.

73
OntoNotes sense pooler interface
  • Sense list
  • Pool being built list of senses
  • Subordination link to another pool
  • Features of this pool

74
Outline
  1. Problems when creating concepts
  2. Ontological Semantics
  3. The high-level abstractions
  4. The content-level concepts
  5. The OntoNotes project
  6. A methodology for creating word senses
  7. From senses to concepts
  8. The Omega Ontology
  9. Conclusion

75
Omega content and framework
www.omega.edu
Goal one environment for various ontologies and
resources
  • Concepts 120,604 Concept/term entries 76 MB
  • WordNet (Princeton Miller Fellbaum)
  • Mikrokosmos (NMSU Nirenburg et al.)
  • Penman Upper Model (ISI Bateman et al.)
  • 25,000 Noun-noun compounds (ISI Pantel)
  • Lexicon / sense space
  • 156,142 English words 33,822 Spanish words
  • 271,243 word senses
  • 13,000 frames of verb arg structure with case
    roles
  • LCS case roles (Dorr) 6.3MB
  • PropBank roleframes (Palmer et al.) 5.3MB
  • Framenet roleframes (Fillmore et al.) 2.8MB
  • WordNet verb frames (Fellbaum) 1.8MB
  • Associated information (not all complete)
  • WordNet subj domains (Magnini Cavaglia) 1.2
    MB
  • Various relations learned from text (ISI
    Pantel)
  • TAP domain groupings (Stanford Guha)
  • SemCor term frequencies 7.5MB
  • Topic signatures (Basque U Agirre et al.) 2.7GB
  • Instances 10.1 GB
  • 1.1 million persons harvested from text
  • 765,000 facts harvested from text
  • 5.7 million locations from USGS and NGA
  • Framework (over 28 million statements of
    concepts, relations, instances)
  • Available in PowerLoom
  • Instances in RDF
  • With database/MYSQL
  • Online browser
  • Clustering software
  • Term and ontology alignment software

76
Omega browser Mammoth
77
Omega hierarchy display
78
Omega sense frames
79
Outline
  1. Problems when creating concepts
  2. Ontological Semantics
  3. The high-level abstractions
  4. The content-level concepts
  5. The OntoNotes project
  6. A methodology for creating word senses
  7. From senses to concepts
  8. The Omega Ontology
  9. Conclusion

80
Shallow and deep ontologies
  • Omega is a language-based ontology
  • Concepts defined via wordsenses annotator
    agreement constrains and validates granularity
  • Granularity how many senses for each word?
  • Associated information is subsumption hierarchy
    and case frames
  • Deep(er) semantics
  • Deeper knowledge definitional features,
    subconcept differentiae, inferences, etc., are
    not present
  • In places temporal relations using
    Allen/Hobbs/OWL
  • Future, deeper version of annotation will require
    and motivate more semantic ontology

81
What would be nice?
  • A small number of (globally) standardized
    ontologies and/or core theories of important
    aspects (time, space, social dynamics, motion,
    privacy, etc.)
  • Solid theoretical frameworks for developing
    ontological notions and theories, and for testing
    them
  • A rich online world of ontologies, domain models,
    etc., with appropriate ontology creation tools
    and methodologies
  • (Semi-)automated techniques for rapidly finding,
    absorbing, and testing existing ontologies for
    your own applications
  • Tools that automatically create new knowledge
    bases on demand, in accord with given ontologies
  • Ontology and knowledge base support technology
    that can handle info that may be inconsistent,
    tenuous, partial, and growing

82
Thank you!
Write a Comment
User Comments (0)