Linguistic Annotation Framework SC4 WG 1 - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Linguistic Annotation Framework SC4 WG 1

Description:

Based on a formal model. Users map their formats into/out of LAF ... FLEA. HAVE. head. gen. subj. DOG. MY [DOG] Advantages of DAG. ISO TC37 SC4 - WG 1. Beijing 2006 ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 28
Provided by: nanc169
Category:

less

Transcript and Presenter's Notes

Title: Linguistic Annotation Framework SC4 WG 1


1
Linguistic Annotation FrameworkSC4 WG 1
  • Nancy Ide Vassar College USA

2
LAF Goal
  • Provide a generic means to represent linguistic
    data and annotations
  • Based on a formal model
  • Users map their formats into/out of LAF
  • User formats must conform to underlying model
  • Pivot or dump format for exchange, machine
    processing

3
DUMP FORMAT interlingua
User As representation
User Bs representation
4
Principles
  • Separation of data and annotations
  • Stand-off annotation
  • Separation of user annotation formats and the
    exchange (dump) format
  • Mappable to one another
  • Separation of referential structure and
    annotation content in dump format
  • Separation of annotation structure (relationships
    among parts) and content (data categories) in
    representation of annotations

5
LAF Development
  • LAF has gone through a slow evolution
  • Model development (GMT as base)
  • Consideration of processing needs
  • Application to different annotation
    types/structures/formats
  • Adjustments to development in other WGs on
    specific annotation types and feature structures
  • Proof of concept instantiation in the American
    National Corpus
  • Transduction of several different annotation
    types and formats to LAF format
  • API to merge, transduce to other formats

6
LAF Status
  • Have now
  • Reduced FS specification
  • Final XML format / schema
  • GrAF Graph Annotation Format
  • Mapping rules and examples
  • Also
  • Coordination with UIMA
  • Header specification including information about
    annotation, similar to UIMA type definition

7
Basic Model
  • Annotation content represented by feature
    structures
  • Powerful means to represent any/all annotations
  • Referential structure represented as a directed
    acyclic graph (DAG)
  • Enables exploitation of well-understood graph
    traversal and manipulation algorithms

8
Referential Structure
  • Means by which annotation content is associated
    with primary data or other annotations
  • Very simple DAG model
  • No need to consider internal structure of
    annotation content (i.e. relations among bits of
    annotation information)

9
Primary Data
  • Primary data contains no annotations
  • Read-only
  • Modifications can be regarded as annotations
  • Insistence on the identification of a base
    segmentation of the primary data
  • Identifies contiguous sequences of indivisible
    logical units
  • For text, usually a character
  • Compatible annotations (i.e. those that can be
    merged etc.) use common base segmentation

10
Primary Segmentation
  • Set of disjoint edges over primary data
  • Vertices
  • Virtual, located between each logical unit
  • Sequentially numbered
  • Edges
  • Each edge (x,y) in the graph delimits a
    non-divisible region of primary data
  • Comformance to MAF, SynAF
  • call these edges over primary data a span

11
  • Multiple primary segmentations may be defined
    over a single primary data set
  • Specify segmentations at different levels of
    granularity
  • A segmentation is primary vis a vis a given
    annotation, not the data itself
  • Edges in a primary segmentation can be defined
    over any span of contiguous primary data,
    regardless of its length
  • No need for spans to be contiguous
  • For text, most common primary segmentation is the
    token

12
Referring to Primary Segmentation
  • Define an edge graph over the edges (spans) in
    the primary segmentation
  • Given an edge set, E, create an edge graph E
    such that for each edge (x,y) in E, there is a
    vertex xy in E
  • Annotations are associated with regions of
    primary data by referencing the edge graph
    vertices
  • Annotations never reference the primary data
    directly

13
  • Edges in E are defined when annotations
    reference vertices in E
  • Vertices may or may not be contiguous
  • An annotation is associated with vertices in E
    as follows
  • Create a new vertex, v
  • Label it with the FS containing the annotation
    content
  • Create an edge from v to 0 or more vertices in E
  • Zero reference is used in the special case where
    the annotation applies to information not present
    in the data
  • References to 2 or more vertices in E by by
    default concatenate the information covered by
    the referenced vertices (in order)
  • can be overridden to specify vertices are to be
    regarded as an ordered list or bag

14
Edge graph over primary data
The clock struck
twenty-two
Annotations associated with vertices in the
primary data edge graph
15
  • As many annotations as desired can reference the
    same segmentation or be layered over lower-level
    annotations

MS1
Syn1
S E G 1
MS2
NP
Co-Ref
Primary data
S EG 2
Syn2
MS3
Sem
16
Annotating Annotations
  • Vertices in an annotation may be referenced from
    other annotations
  • Create a new vertex, v
  • Label it with the FS containing the annotation
    content
  • Create an edge from v to one or more vertices
    associated with an annotation
  • The strategy described above may be applied
    recursively, thus creating a DAG whose leaves are
    the vertices in E

17
Annotations associated with token annotations
18
XML Instantiation
lt!-- edges over primary data --gt ltedge id"e1"
from"0" to"3"/gt ltedge id"e2" from"4"
to"9"/gt ltedge id"e3" from"10" to"16"/gt ltedge
id"e4" from"17" to"23"/gt ltedge id"e5"
from"23" to"24"/gt ltedge id"e6" from"14"
to"27"/gt
19
Token Annotation
ltnode id"t2" edgesTo"e2"gt ltfs type"token"gt
ltf name"base" value"clock"/gt ltf
name"pos" value"NN"/gt lt/fsgt lt/nodegt
Creates a new vertex (node) associated with the
FS with a single edge to vertex e2 in the
primary segmentation edge graph
20
NP Annotation
ltnode id"np1" edgesTo"t1 t2"gt ltfs
type"NP"gt ltf name"number"
sVal"singular"/gt lt/fsgt lt/nodegt
Creates a new vertex (node) associated with the
FS with two outgoing edges to vertices t1 and
t2 in the token annotation
21
Question
  • When referring to annotations, edge targets
    typically represent components
  • E.g. in the example the and clock are
    components of NP
  • But this is not always the case
  • Could be e.g. a list of co-referents
  • Others?
  • Possible solution let the processor deal with it
    using the FS type

22
Note
  • Edges are never labeled, unlike in many
    linguistic analyses
  • Preserves simplicity of the graph
  • Relations are DatCats
  • edgesTo attribute can be empty
  • Can create pseudo-nodes
  • Implies a flat (non-nested) structure in the dump
    format

23
s
obj
head
subj
gen
head
HAVE
FLEA
DOG
DOG
MY
ltnode type"clone" id"E2" ref"t2"/gt ltnode
idc5 edgesTot5gtltf namerole
sValgen/gtlt/nodegt ltnode idc6
edgesTot2gtltf namerole sValhead/gtlt/nodegt lt
node idc7 edgesToc5 c6/gt ltnode idc1
edgesTot1gtltf namerole sValhead/gtlt/nodegt lt
node idc7 edgesToc7gtltf namerole
sVals/gtlt/nodegt ltnode idc3 edgesTot3gtltf
namerole sValobj/gtlt/nodegt ltnode idc4
edgesToE2gtltf namerole sValsubj/gtlt/nodegt lt
node idD1 edgesToc1 c7 c3 c4 E2/gt
24
Advantages of DAG
  • Can apply graph algorithms to traverse the graph
  • Breadth-first, depth-first traversal, shortest
    path, minimum spanning tree
  • Connectedness, articulation vertices
  • Topological sort
  • Graph coloring, graph partitioning
  • Etc.
  • What can we do with this?
  • What is all info on path to/from node x
  • What is nearest common ancestor of nodes x and y
  • Find matching sub-graphs
  • Identify connected components
  • Which nodes (phenomena) are most connected, form
    articulation vertices, etc.

25
Feature Structures
  • Each edge is labeled with a feature value
  • Can be FS, collection (list, bag, set), atom
  • Alternation and grouping handled by the FS
    mechanisms
  • Need to identify basic FS mechanisms
  • 90 of annotations use only these
  • Annotations may (optionally) use only this set
  • Ease of use
  • No need to implement procedures to handle full
    power of FS
  • Need to create a FS library for abbreviation

26
Implications for Other WGs
  • Should (conceptually at least) separate
    referential structure from annotation content
  • E.g. tlink in TimeML/SemAF the link itself is
    the edge, tlink is the annotation content (?)
  • Need for coordination
  • Inter-project coordination committee?
  • Need examples!

27
Todays Work
  • Discuss the format in terms of specific
    annotation types
  • Remember that dump format is in principle never
    seen by the user
  • Map user format into and out of dump format
  • Two topics
  • DAG for referential structure
  • FS for representing annotation content
Write a Comment
User Comments (0)
About PowerShow.com