Title: What
1Whats NEXT?Navigating through Dense
Annotation Spaces
- Branimir K. Boguraev
- Mary S. NeffLanguage Engineering for Content
AnalysisIBM T.J. Watson Research Center - Yorktown Heights, NY
2Outline
- Dense annotation spaces
- Navigational challenges
- Elements of the annotation-matching Formalism
- Support for navigational control
- Conclusion
- Future work
3Dense Annotation Spaces
SENT
SENT
SC
SC
SUB
OBJ
OBJ
SUB
OBJ
OBJ
PP
PP
NP
NP
NP
NP
VG
VG
NP
NP
NP
NP
VG
VG
np
nps
md
vb
nn
nn
in
nn
to
vb
dt
nn
np
nps
md
vb
nn
nn
in
nn
to
vb
dt
nn
Service Reps can read customer name, in order to
contact the customer.
4Annotation trees
SENT
SC
SUB
OBJ
OBJ
PP
NP
NP
NP
NP
VG
VG
np
nps
md
vb
nn
nn
in
nn
to
vb
dt
nn
Service Reps can read customer name, in order to
contact the customer.
5Annotation lattice
SENT
SC
SUB
OBJ
OBJ
PP
NP
NP
NP
NP
VG
VG
np
nps
md
vb
nn
nn
in
nn
to
vb
dt
nn
Service Reps can read customer name, in order to
contact the customer.
6Navigational Challenges
- PNAME
- TitleName
- First Middle Last
- What is visible to the lattice traversal engine?
7Annotation-Based Finite State Transducer (AFst)
- UIMA-based
- A finite state calculus over typed feature
structures - Cf. grep over a sequence of annotations,
specified as types and features - np ltEgt/NP .
- TokenposDT ltEgt .
- TokenposJJ .
- ( TokenposNN TokenposNNS ) .
- ltEgt/NP
8Pitching the Iterator support for navigational
control
SENT
SC
SUB
OBJ
OBJ
PP
NP
NP
NP
NP
VG
VG
np
nps
md
vb
nn
nn
in
nn
to
vb
dt
nn
Service Reps can read customer name, in order to
contact the customer.
9Afst Traversal Regime
- Defining a particular path through the annotation
space requires a lattice traversal engine that
can focus onsimultaneously - Sequential constraints pattern matching
- Horizontalprenominal mod and nominal head
- Structural constraints
- Verticaliterate over NP with specific
configurational relationship e.g. not sentence
initial, not in a PP - Configurational constraints
- Type prioritization
10Linearizing the Lattice whats next?
SUB
OBJ
OBJ
PP
NP
NP
NP
NP
VG
VG
-
- Unambiguous Typeset iterator, inferred from
grammar SUB . VG . OBJ . PP - UIMA natural annotation sort order
- Start position ascending
- Length descending
- Type priority, defined in UIMA descriptors
11Linearizing the Lattice whats next?
- Grammar-wide declarations
- boundary Sentence
-
- honour Address
- month TokenlemmaJanuary
- TokenlemmaFebruary
-
- date ltEgt/Year .
- month ltEgt .
- Tokenstring12\d3
- ltEgt/Year
12FocusSelecting Nested Boundary Annotations
ltnameValuePairgt ltnamegtFocuslt/namegt ltvaluegtltarra
ygt ltstringgtSectionlabelEducation lt/st
ringgt ltstringgtSentencenumber1 lt/stringgt lt/
arraygtlt/valuegt lt/nameValuePairgt
13Linearizing the Lattice whats next?
- Grammar-wide declarations
- match first, last, longesr, shortest, all
- advance skip, step
-
14Whats next?Switching Levels, Mixed Iterator
- Refocus the iterator to examine inner contour
_at_descend, _at_ascend - findDrSmith
- ltEgt/PName_at_descend .
- TitlestringDr. .
- ltEgt/Name_at_descend .
- FirstltEgt . LaststringSmith .
- ltEgt/Name_at_ascend .
- ltEgt/PName_at_ascend
15Alternate Multiple Level Access
- Upper/lower context without switching levels
- Token_costartsSentencenumber1
- Subject_coversPName
- PName_costartsNP,_coendsNP
16Grammar cascading
- From simpler to more complex analyses
- Lower levels of output feed as inputs into higher
levels - Small noun phrases verb groups
- Prepositional, possessive adjectival phrases
- More complex noun phrases
- Variety of clause types
- Grammatical relations (subject, object)
17Implementations
- Shallow Parsing
- Named Entity Detection interleaved with shallow
parsing - Terminology identification in new domains
- Temporal expression parsing
- Privacy policy rules
- Information extraction from resumes
- Information extraction from contact center
telephone calls
18Future work list
- Alternate (semi-ambiguous) iterator, useful for
disambiguator grammars - Actor Director
- Tree-walk iterator for tree representations where
children are explicitly referenced in features
19Performance Notes
- Performance is a function of
- How grammar is written
- Optimisation of fst graph (grammar compiler)
- Optimisation of symbol compiler
- Optimisation of executor
- However for the benefit of the curious
- IBM Software Group (Dublin) optimised the last
two, and
20IBM LanguageWare (Dublin) text analysis
performance results
- The Results
- Precision for Company Annotations only 0.81
- Recall for Company Annotations only 0.67
- Precision for Person Annotations only 0.93
- Recall for Person Annotations only 0.91
- Processing time 3.4 seconds
- These numbers are 10 times faster than the best
of breed internal reference annotators.
- The analysis
- - AFST rules and FST dictionary
- - 26 rules, 7 dictionaries (things like first
names, indicators like Corp. etc) - - creating Person and Company annotations
- The Test
- - test set Enron
- - 924 files
- - (4.5Mb)
21Perpetrators erResponsible parties
- Bran Boguraev
- Mary Neff
- Bran Lambov
- D.J. McCloskey
- Thilo Goetz
- Thomas Hampp
- Oliver Suhre
- Roy Byrd
- Herb Chong
- Albert Eskenazi
- Paul Kaye
- Son Bao Pham
- Lokesh Shresta
- Max Silberztein
22For more on AFst and tools --
- Tomorrow, 1225 in Fez 1
- A Development Environment for Configurable
Meta-Annotators in a Pipelined NLP Environment - Youssef Drissi, Branimir Boguraev, David
Ferrucci, Paul Keyser, and Anthony Levas