Annotation Types for UIMA - PowerPoint PPT Presentation

About This Presentation
Title:

Annotation Types for UIMA

Description:

Temporal expressions. Named entities. Parse constituents (?) Want: a ... Add an optional property (eg 'pieces') that can be used to specify discontinuous ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 19
Provided by: edwa146
Category:
Tags: uima | annotation | types

less

Transcript and Presenter's Notes

Title: Annotation Types for UIMA


1
Annotation Types for UIMA
  • Edward Loper

2
UIMA
  • Unified Information Management Architecture
  • Analytics framework
  • Consists of components that perform specific
    tasks (tagging, parsing, etc.)
  • Each component declares its own interface
    (input/output, requirements, work flow metadata,
    etc)
  • All information is communicated using a single
    standard data format CAS
  • Built-in support for network distribution,
    clustering, etc.

3
CAS
  • Common Analysis Structure
  • Tends to fall on the weakly-merged side of the
    spectrum (does not require annotations to be
    modified to ensure consistency).
  • Annotations are encoded using typed feature
    structures.
  • But the type definitions are left unspecified.
  • C.f. XML
  • Components can only work together if they use the
    same type system.

4
Standard CAS Types
  • Goal design standard CAS types for ULA
    annotations.
  • In particular, were currently looking at
    Treebank, Propbank, Timebank.
  • Issues
  • Redundancy of information
  • Coupling between annotations
  • Discontinuous constituents

5
CAS Types background
  • UIMA does provide a couple of top-level types.
    (e.g. Annotation)
  • These make it clear that UIMA intends
  • Standoff annotations
  • defined using spans
  • with character-based offsets
  • C.f. AGTK

6
Treebank
  • Typical representation for treebank
  • ltTreeBankConstituent id8 start5 end23
    typeNP children12 28 38 parent94gt
  • Questions
  • Should children be explicitly marked?
  • Should parents be explicitly marked?
  • These questions have consequences

7
Treebank Explicit children?
  • How could we not mark children?
  • They can be mostly reconstructed, if we assume
  • All constituents are properly nested
  • Unary branch direction can be determined based on
    node type.
  • Not quite true SBAR/FRAG S/NP NP/FRAG NP/PRN.
  • Theoretical consequences of (not) marking
    children.
  • Have to assume proper nesting of constituents
  • Alternatively, allow for multiple coexisting
    bracketings (a la chart parse) -- probably not
    what we want.

8
Treebank Explicit parents?
  • Parent pointers are redundant -- it can be
    reconstructed.
  • But it can be very handy to have when working
    with structures.
  • Theoretical consequence of marking parents
  • Every constituent has exactly one parent.
  • Rules out multi-parented trees. (fine.)

9
Propbank
  • Probanks current annotation
  • Is strongly coupled to treebank
  • Argument locations are specified using tree
    pointers
  • Includes trace chain information

10
Propbank Tree Pointers
  • Each propbank argument is specified using a
    tree pointer wh
  • The hth constituent above the wth word.
  • Problems with this strong coupling
  • Propbank cant be used without trees.
  • New propbanking cant be done unless parsing has
    been done.
  • Changes to trees are annoying to propagate to
    propank.

11
Propbank spans
  • Can we get away with using spans instead (UIMAs
    preferred approach)?
  • Do we lose any information?
  • Potentially yes -- for binary branching nodes.
  • In practice
  • 99.92 of non-trace args select the low
    constituent.
  • 97.9 of trace args select the high constituent.
  • The differences appear to just be errors.
  • so no (important) lost info!
  • About 50-55 of split arguments go away.

12
Propbank trace chains
  • For arguments that have undergone movement,
    propbank explicitly marks the trace chain.
  • But isnt this something the tree should give us
    anyway?
  • Treebank propbank have somewhat different
    notions of what gets included in trace chains.
  • 1/3 of the Propbank annotation guidelines talk
    about null elements.

13
Propbank trace chains
  • How much can we recover?
  • Using very simple heuristics (e.g., link NP-2
    with t-2), 60
  • Using more advanced heuristics, maybe 80.
  • Not close enough to 100 to throw them away.
  • Some differences harder to automate e.g.,
    propbank (usually) only marks traces that
    interact with the predicate in some way.
  • Asbestosi was used ti and replaced ti

14
Propbank trace chains (?s for discussion)
  • Should marking trace chains be part of the
    propbanking task?
  • Or should we leave it up to the treebankers?
  • If it should be part of propbanking, should it be
    split off as a separate subtask?
  • Would that help annotation speed any?
  • Should the annotation be split off as a separate
    layer?

15
Discontinuous constituents
  • Propbank has provisions for discontinuous
    constituents w1h1,w2h2
  • Discontinuous constituents can appear almost
    anywhere
  • Temporal expressions
  • Named entities
  • Parse constituents (?)
  • Want a uniform way to handle them.

16
Discontinuous constituents
  • Goals
  • Make the common case easy
  • Make the uncommon case possible
  • Preferred approach
  • Add an optional property (eg pieces) that can
    be used to specify discontinuous chunks.
  • If used, then the start/end properties should be
    treated with appropriate care
  • Open question
  • Should this property be defined on the top-level
    type, or on individual types (eg
    PropBankArgument)?

17
A note on consistency
  • CAS is weakly merged -- it doesnt enforce
    consistency.
  • But that doesnt mean we cant enforce
    consistency ourselves.
  • For weakly merged formats, it will be important
    to
  • Define consistencies that we want
  • Both within annotations between annotations
  • Actively check those consistencies during
    annotation.
  • Weakly coupled annotations are a good thing.
  • But the more weakly coupled the annotations are,
    the more well need to check consistency

18
Questions/discussion
  • Strongly vs weakly merged
  • (when) is redundancy good?
  • How strongly coupled should annotations be?
  • Handling discontinuous constituents?
  • Where is there information overlap between
    annotations (e.g. coref chains)? What should be
    done about it?
  • Any principled way to decide when to mark heads
    vs spans?
  • Token offset vs character offset vs tree pointer
Write a Comment
User Comments (0)
About PowerShow.com