Title: osis linguistic annotation
1osis linguistic annotation
- definitions and requirements
kirk e. lowerywestminster hebrew institute sbl
computer-assisted research group
2the context
- why osis linguistic annotation?
3the goal of osis
- to exchange electronic bibles
- any language, medium, presentation style
- to add meta-information to those texts
- keywords link, hierarchy, pyramid
- to easily transform these texts
- the target transformation is unknown
- to cut costs production, presentation,
distribution of bibles plus meta-data - time, money, people
4why exchange bible texts?
- coordination within organizations
- cooperation between organizations and between
individuals - publish in multiple formats and media from one
canonical source - long-term archival
- the changing definition of publish
- documents have a life cycle!
5who wants to exchange texts?
- bible publishers
- commercial publishing houses
- denominations bible societies
- bible translators
- translation teams editors
- consultants supervisors
- bible scholars
- original languages, text criticism
- text analysis and commentary
6text meta-data
- what informationneeds to be captured?
7translatorsmanaging the translation process
- document versions responsibility
- comments corrections by editors
- handling presentation issues
- script direction
- rubies
- linking source, relay target translations
- linking supplementary information
- notes, glossaries, maps
8translators scholarsfocus on the text
- manuscript collation description
- text criticism establishment of the original
- linguistic analysis
- text segmentation
- segment id from phoneme to text structures
- linguistic mapping of source target
- alignment parallel synoptic texts
9linguistic annotation
- how can we capturethe information?
10required
- a way to segment the text
- a mechanism for associating labels with an
arbitrary text-span - a means to declare labels used in analysis
- a common linguistic vocabulary
- language-specific grammar terms
- a protocol for user redefinition
11segmenting text
- ltseg id"gn11,1.1"gtB.lt/seggtltseg
id"gn11,1.2"gtR")IYTlt/seggtltseg
id"gn11,2.1"gtB.FRF)lt/seggtltseg
id"gn11,3.1"gt)ELOHIYM lt/seggtltseg
id"gn11,4.1"gt)"Tlt/seggtltseg id"gn11,5.1"gtHAlt
/seggtltseg id"gn11,5.2"gt.FMAYIMlt/seggtltseg
id"gn11,6.1"gtWlt/seggtltseg id"gn11,6.2"gt)"Tlt
/seggtltseg id"gn11,7.1"gtHFlt/seggtltseg
id"gn11,7.2"gt)FREClt/seggt
start tag unique identification hebrew text end
tag
12adding annotation (1)
ltseg id"gn11,1.1"gtB. ltlemmagtB.lt/lemmagt ltparti
cle type"preposition" /gtlt/seggtltseg
id"gn11,1.2"gtR")IYT ltlemmagtR")IYTlt/lemmagt ltn
oun type"common" features"fsa"
/gtlt/seggtltseg id"gn11,2.1"gtB.FRF) ltlemma
homonym"1"gtB.R)lt/lemmagt ltverb stem"q"
conjugation"p" pgn"3ms"
/gtlt/seggtltseg id"gn11,3.1"gt)ELOHIYM
ltlemmagt)ELOHIYMlt/lemmagt ltnoun type"common"
features"mpa" /gtlt/seggtltseg
id"gn11,4.1"gt)"T ltlemma homonym"1"gt)"Tlt/lemmagt
ltparticle type"object_marker" /gtlt/seggt
content tag milestone tag
13adding annotation (2)
ltseg id"gn11,5.1"gtHA ltlemmagtHlt/lemmagt ltparticl
e type"article" /gtlt/seggtltseg
id"gn11,5.2"gt.FMAYIM ltlemmagtFMAYIMlt/lemmagt lt
noun type"common" features"mpa"
/gtlt/seggtltseg id"gn11,6.1"gtW ltlemmagtWlt/lemmagt
ltparticle type"conjunction" /gtlt/seggtltseg
id"gn11,6.2"gt)"T ltlemma homonym"1"gt)"Tlt/lemmagt
ltparticle type"object_marker" /gtlt/seggtltseg
id"gn11,7.1"gtHF ltlemmagtHlt/lemmagt ltparticle
type"article" /gtlt/seggtltseg id"gn11,7.2"gt)FREC
ltlemmagt)EREClt/lemmagt ltnoun type"common"
features"fsa" /gtlt/seggt
content tag milestone tag
14the hard part linguistic labels
- must be standard
- must be applicable to any conceivable language
- labels are the linguistic inventory
- must be compatible with current and future
linguistic theories - labels must be linguistic theory-neutral
- must be redefinable by the user
15standard solutions labels
- expert advisory group on language engineering
standards (eagles) - lthttp//www.ilc.pi.cnr.it/EAGLES/home.htmlgt
- an initiative of the european commission (1993)
- standard grammar labels of morphology and syntax
for european languages - create osis standard labels for hebrew, aramaic
and greek
16standard solutions mechanism
- the text encoding initiative (tei) guidelines
- chapter 14 linking, segmentation, alignment
- chapter 16 feature structures
- chapter 26 feature system declaration
- stand-off markup (xlink) or up-close-and-person
al (inline)? - separate meta-data about the text from the text
itself? - either-or or both-and?
17formal requirements
18labels
- claims made about the data itself vs claims about
the claims that can be made! - the linguistic model vs the analysis allowed by
the model - example does Hebrew have adverbs?
- a library of labels as comprehensive as possible
- definitions to clarify what thing is being
labeled - labels are names for grammatical objects
19labels as objects
- grammatical objects have attributes or
features - features can vary over a range of values
- objects features have defaults that could be
changed - objects features could be easily extended
- objects features can be arranged linearly or
hierarchically
20mechanism
- user language declaration
- all labels and their relationships
- done by exclusion, not inclusion
- sensitive to linguistic theory
- levels of language resolution of ambiguity
- lexical, semantic, phonemic, morphologic,
phrase-, clause-, discourse-, theological levels - context-free and context-bound analysis
- part-of-speech resolution
21tei feature structures
- the feature element
- the most basic markup
- requires a label and any number of values
- ltf t"feature name" value"feature value"gt
- the feature structure element
- ltfs name"feature structure name"gt
- may contain any number of nested ltfgt and ltfsgt
- models some grammatical object
22tei feature example
ltf name"conjugation"gt ltvAlt mutExcl"Y"gt
ltsym id"pf" value"perfect"
/gt ltsym id"impf" value"imperfect"
/gt ltsym id"qppt"
value"qal_passive_participle" /gt ltsym
id"wc" value"wayyiqtol" /gt
ltsym id"impv" value"imperative"
/gt ltsym id"inf" value"infinitive"
/gt ltsym id"pt" value"participle"
/gt lt/vAltgt lt/fgt
23tei feature structure example
ltfs type"common noun features"gt ltf
name"gender" org"set" fVal"gm gf gn" /gt ltf
name"number" org"set" fVal"ns np nd" /gt ltf
name"state" org"set" fVal"sa sc" /gt lt/fsgt
24tei feature library example
ltfvLib id"g" type"gender feature values"gt
ltvAlt mutExcl"N"gt ltsym id"gm"
value"masculine"/gt ltsym id"gf"
value"feminine" /gt ltsym id"gn"
value"neuter" /gt lt/vAltgt lt/fvLibgt
25a different approach
Dictionary of Packard-Style Greek Morphology Codes
ltdiv type"x-tag" osisID"A_APFC" divTitle"A
APFC"gt ltpgtPart of speech adjectivelt/pgt
ltpgtCase accusativelt/pgt ltpgtNumber
plurallt/pgt ltpgtGender femininelt/pgt
ltpgtDegree comparativelt/pgt lt/divgt
26what can we do with feature structure marked up
text?
- self-organizing topic maps
- compare linguistic hypotheses with actual usage
- XSLT transforms
- automated tagging of new features
- comparative linguistic study
- source?target language grammar mapping
27conclusions
- where do we go from here?
28in the short-term
- complete a first pass of language modeling
- mark up real biblical text with annotation
- distribute to translators and scholars for
feedback - does this meet your needs?
- is it practical enough that you will use it?
- is it flexible enough for your language(s) and
linguistic theories
29in the long-term
- determine if tei feature structures are
sufficient - decide whether to require inline or standoff
markup, or to allow either - determine the best way of integrating linguistic
markup with the osis core tag set - explore ideas for authoring software or, at
least, linguistic annotation utility programs