Annotation Types for UIMA - PowerPoint PPT Presentation

About This Presentation

Title:

Annotation Types for UIMA

Description:

Temporal expressions. Named entities. Parse constituents (?) Want: a ... Add an optional property (eg 'pieces') that can be used to specify discontinuous ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 19

Provided by: edwa146

Learn more at: https://verbs.colorado.edu

Category:

more less

Transcript and Presenter's Notes

Title: Annotation Types for UIMA

1
Annotation Types for UIMA

Edward Loper

2
UIMA

Unified Information Management Architecture
Analytics framework
Consists of components that perform specific
tasks (tagging, parsing, etc.)
Each component declares its own interface
(input/output, requirements, work flow metadata,
etc)
All information is communicated using a single
standard data format CAS
Built-in support for network distribution,
clustering, etc.

3
CAS

Common Analysis Structure
Tends to fall on the weakly-merged side of the
spectrum (does not require annotations to be
modified to ensure consistency).
Annotations are encoded using typed feature
structures.
But the type definitions are left unspecified.
C.f. XML
Components can only work together if they use the
same type system.

4
Standard CAS Types

Goal design standard CAS types for ULA
annotations.
In particular, were currently looking at
Treebank, Propbank, Timebank.
Issues
Redundancy of information
Coupling between annotations
Discontinuous constituents

5
CAS Types background

UIMA does provide a couple of top-level types.
(e.g. Annotation)
These make it clear that UIMA intends
Standoff annotations
defined using spans
with character-based offsets
C.f. AGTK

6
Treebank

Typical representation for treebank
ltTreeBankConstituent id8 start5 end23
typeNP children12 28 38 parent94gt
Questions
Should children be explicitly marked?
Should parents be explicitly marked?
These questions have consequences

7
Treebank Explicit children?

How could we not mark children?
They can be mostly reconstructed, if we assume
All constituents are properly nested
Unary branch direction can be determined based on
node type.
Not quite true SBAR/FRAG S/NP NP/FRAG NP/PRN.
Theoretical consequences of (not) marking
children.
Have to assume proper nesting of constituents
Alternatively, allow for multiple coexisting
bracketings (a la chart parse) -- probably not
what we want.

8
Treebank Explicit parents?

Parent pointers are redundant -- it can be
reconstructed.
But it can be very handy to have when working
with structures.
Theoretical consequence of marking parents
Every constituent has exactly one parent.
Rules out multi-parented trees. (fine.)

9
Propbank

Probanks current annotation
Is strongly coupled to treebank
Argument locations are specified using tree
pointers
Includes trace chain information

10
Propbank Tree Pointers

Each propbank argument is specified using a
tree pointer wh
The hth constituent above the wth word.
Problems with this strong coupling
Propbank cant be used without trees.
New propbanking cant be done unless parsing has
been done.
Changes to trees are annoying to propagate to
propank.

11
Propbank spans

Can we get away with using spans instead (UIMAs
preferred approach)?
Do we lose any information?
Potentially yes -- for binary branching nodes.
In practice
99.92 of non-trace args select the low
constituent.
97.9 of trace args select the high constituent.
The differences appear to just be errors.
so no (important) lost info!
About 50-55 of split arguments go away.

12
Propbank trace chains

For arguments that have undergone movement,
propbank explicitly marks the trace chain.
But isnt this something the tree should give us
anyway?
Treebank propbank have somewhat different
notions of what gets included in trace chains.
1/3 of the Propbank annotation guidelines talk
about null elements.

13
Propbank trace chains

How much can we recover?
Using very simple heuristics (e.g., link NP-2
with t-2), 60
Using more advanced heuristics, maybe 80.
Not close enough to 100 to throw them away.
Some differences harder to automate e.g.,
propbank (usually) only marks traces that
interact with the predicate in some way.
Asbestosi was used ti and replaced ti

14
Propbank trace chains (?s for discussion)

Should marking trace chains be part of the
propbanking task?
Or should we leave it up to the treebankers?
If it should be part of propbanking, should it be
split off as a separate subtask?
Would that help annotation speed any?
Should the annotation be split off as a separate
layer?

15
Discontinuous constituents

Propbank has provisions for discontinuous
constituents w1h1,w2h2
Discontinuous constituents can appear almost
anywhere
Temporal expressions
Named entities
Parse constituents (?)
Want a uniform way to handle them.

16
Discontinuous constituents

Goals
Make the common case easy
Make the uncommon case possible
Preferred approach
Add an optional property (eg pieces) that can
be used to specify discontinuous chunks.
If used, then the start/end properties should be
treated with appropriate care
Open question
Should this property be defined on the top-level
type, or on individual types (eg
PropBankArgument)?

17
A note on consistency

CAS is weakly merged -- it doesnt enforce
consistency.
But that doesnt mean we cant enforce
consistency ourselves.
For weakly merged formats, it will be important
to
Define consistencies that we want
Both within annotations between annotations
Actively check those consistencies during
annotation.
Weakly coupled annotations are a good thing.
But the more weakly coupled the annotations are,
the more well need to check consistency

18
Questions/discussion

Strongly vs weakly merged
(when) is redundancy good?
How strongly coupled should annotations be?
Handling discontinuous constituents?
Where is there information overlap between
annotations (e.g. coref chains)? What should be
done about it?
Any principled way to decide when to mark heads
vs spans?
Token offset vs character offset vs tree pointer