Title: TEI
1TEI CES
- Mark-up for Scientific Purposes
2If asked for a sure recipe of chaos I would
propose a project in which several thousand
impassioned specialists in scores of disciplines
from a dozen or more countries would be given
five years to produce some 1300 pages of
guidelines for representing the information
models of their specialties in a
rigorous, machine-verifiable notation.
Charles F. Goldfarb
3 BUT
The vaunted information superhighway would
hardly be worth traveling if the landscape were
dominated by industrial parks, office buildings,
and shopping malls. Thanks to the Text Encoding
Initiative, there will be museums, libraries,
theaters, and universities as well.
4TEI
- 1987 Vassar Conference (Vassar College, N.Y.)
- international initiative of scientists, Text
Encoding Initiative - also the name of the mark-up language
- diversity of tag sets due to the different
provenience of the involved scientists and their
different scientific approaches - bottom-up development
- conformity with SGML and XML, 5 versions TEI
exist
5general architecture
- a) TEI Declaration TEI.DCL
- b) TEI DTDs (Document Type Definition)
- lt!DOCTYPE TEI.2 SYSTEM "tei2.dtd" lt!ENTITY
TEI.prose 'INCLUDE'gtlt!ENTITY TEI.textcrit
'INCLUDE'gt gt - c) the annotated text
6tag sets
- a) core tag set available for any TEI document
- b) base tag set just one set can be chosen
- TEI.prose for prose
- TEI.verse for lyrics
- TEI.drama drama
- TEI.spoken transcription of spoken language
- TEI.dictionaries dictionaries
- TEI.terminology terminological databases
- TEI.mixed text that need tags from more than one
categorie - TEI.general similiar to TEI mixed
- c) additional Tag sets several can be chosen
- TEI.linking tags to link documents
- TEI.analysis simple analytical tags
- TEI.fs tags for feature structure
- TEI.certainty tags to name the author of the
particular tags and the probable accuracy of the
mark-up - TEI.transcr transcription of primary sources
- TEI.textcrit text critics
- TEI.names.dates names and dates
- TEI.nets graphs and networks
7Ex TEI.Transcr
- TEI.transcr transcription of primary sources
- ltdelgt deletion
- ltdelspangt deletion that excedes hierarchie
limits - attributes
- rend to declare how the deletion has been
accomplished - values
- subpunction . deletion by a line of points below
that which shall be deleted - overstrike . deletion by a line
- erasure .by erasure of the text
- bracketed by embedding the text into brackets
- status to declare whether the deletion has
affected too much text on the left or right by
mistake - resp declares who is responsible for the
declaration of deletion - hand declares who has deleted the part of the
text - cert gives a probability for the provenience of
the deletion
8- 400 elements
- problems with hierarchy
- problems with ambiguity
- bibliographical information included in TEI
annotated - documents match the standards of libraries
- vast amount of possibilities to apply TEI
standard
9CES
- Corpus Encoding Standard is part of the EAGLES
guidelines by the Expert Advisory Group on
Language Engineering Standards and is consistent
with TEI. - determination of a minimal encoding level of
corpora - due to the increasing empiricist approach of
linguists - data interchange between individuals or sites
- CES, standard for interchange, so that only one
translation between a single mark-up scheme and
the local format is needed - CES must be as expressive as the original text
10Criteria of the CES
- A) coverage
- B) consistency
- C) recoverability
- D) validatability
- E) capturability
- F) processability
- E) extensibility
- G) compactness
- H) readability
11Global attributes
- id a unique identifier an element
- n a number or other label for the element
- lang indicates that the tag's content is in the
specified language - wsd indicates that the tag's content is encoded
in the specified character set - rend provides information about rendition in an
original printed version values - BO bold face
- BX boxed
- IT italic font
- RO roman font
- UL underlined
- CA capital letters
12Header
- each text ltcesDocgt has a header and the whole
corpus has a ltcesHeadergt distinguished by the
type attributes CORPUS or TEXT - title statement, edition statement, extent
statement, publication statement, source
description, encoding description, profile
description, revision description
13Minimal Header
ltcesHeader version"2.0"gt
ltfileDescgt
lttitleStmtgt
lth.titlegtlt/h.titlegt
lt/titleStmtgt
ltpublicationStmtgt
ltdistributorgtlt/distributorgt
ltpubAddressgtlt/pubAddressgt
ltavailabilitygtlt/availabilitygt
ltpubDategtlt/pubDategt
lt/publicationStmtgt
ltsourceDescgt
ltbiblStructgt
ltmonogrgt
lth.titlegtlt/h.titlegt
lth.authorgtlt/h.authorgt
ltimprintgt
ltpubPlacegtlt/pubPlacegt
ltpublishergtlt/publishergt
ltpubDategtlt/pubDategt
lt/imprintgt
lt/monogrgt
lt/biblStructgt
lt/sourceDescgt
lt/fileDescgt lt/cesHeadergt
14Encoding Primary Data
- definition of 3 encoding levels
- level 1 is minimum standard
- interdependence
- validate against cesDoc DTD
15level 1conformance/ metalanguage level
- CES conformant encoding to paragraph level having
at least the following structure - ltcesDoc version"3.9"gt
- ltcesHeader version"2.0"gt ... lt/cesHeadergt
- lttextgt
- ltbodygt
- ltdivgt optional for sections, chapters
ltpgt
- ltpgt
- ltpgt .
-
- ...other paragraph-level elements
- ltspgt - for written to be spoken material
- ltpoemgt - for poems embedded in a text
16level 2 conformance / syntactic level
- if a sub-paragraph element is marked every
occurrence of this element has to be marked
throughout the text - replacement of all special characters by SGML
entities - replacement of quotation marks by SGML entities
- correct identification of all paragraph elements
17level 3 conformance / semantic level
- lthigt tags are resolved to more precise tags
(foreign, term, etc.) - the lthigt element for marking typographically
distinct words or phrases, especially when the
purpose of the highlighting is not yet determined - lthi rend BOgt bold face
- identification and mark-up of the following
sub-paragraph elements - abbreviations
- numbers
- names
- foreign words and phrases
- validation of 10 sample of the text
- elements for identifying s-units and quoted
dialogue ltsgtHe said ltqgtThe weather will be
fine.lt/qgtlt/sgt
18refined forms of linguistic annotation
- lttokgt ,ltorthgt ,ltdisambgt ltlexgt ltbasegt ltmsdgt
ltctaggt - lttok class'tok' from'1.2.1\5'gt
- ltorthgtcritèreslt/orthgt
- ltdisambgt
- ltctaggtNCMPlt/ctaggt
- lt/disambgt
- ltlexgt
- ltbasegtcritèrelt/basegt
- ltmsdgtNcmp-lt/msdgt
- ltctaggtNCMPlt/ctaggt
- lt/lexgt
- lt/tokgt
19Corpus alignment I
- rules specified in the CesAlign DTD
- separation of primary data and annotation
- avoids unwieldy documents
- architecture
- hub-document
- annotation document
- linking
- one-way links to hub-document
- linking by IDs or locators
20Corpus allignment II
- application multilingual corpora, dictionaries
- !!! word-alignment (compositionality vs.
equivalence) - -gt idiomatic expressions/collocations ex. Er biß
ins Gras. vs. He kicked the bucket. - -gt different usage of active and passive Man
sagte mir vs I was told - -gt compounds Finanzamt vs. fiscal authorities
21examples
- The New York Times
- article French voters soundly reject European
Union Constitution - Universität Osnabrück - Cognitive Sciences -
korpusbasierte Kollokationssuche
22Sources
- CES - Vassar College, NY
- POS Tagging and Lemmatization
- TEI - University of Munich
- Universität Osnabrück - Cognitive Sciences -
korpusbasierte Kollokationssuche