TEI - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

TEI

Description:

I would propose a project in which. several thousand ... pubDate /pubDate /publicationStmt sourceDesc biblStruct monogr h.title /h.title ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 23

Provided by: F10985

Category:

more less

Transcript and Presenter's Notes

Title: TEI

1
TEI CES

Mark-up for Scientific Purposes

2
If asked for a sure recipe of chaos I would
propose a project in which several thousand
impassioned specialists in scores of disciplines
from a dozen or more countries would be given
five years to produce some 1300 pages of
guidelines for representing the information
models of their specialties in a
rigorous, machine-verifiable notation.
Charles F. Goldfarb
3
BUT
The vaunted information superhighway would
hardly be worth traveling if the landscape were
dominated by industrial parks, office buildings,
and shopping malls. Thanks to the Text Encoding
Initiative, there will be museums, libraries,
theaters, and universities as well.
4
TEI

1987 Vassar Conference (Vassar College, N.Y.)
international initiative of scientists, Text
Encoding Initiative
also the name of the mark-up language
diversity of tag sets due to the different
provenience of the involved scientists and their
different scientific approaches
bottom-up development
conformity with SGML and XML, 5 versions TEI
exist

5
general architecture

a) TEI Declaration TEI.DCL
b) TEI DTDs (Document Type Definition)
lt!DOCTYPE TEI.2 SYSTEM "tei2.dtd" lt!ENTITY
TEI.prose 'INCLUDE'gtlt!ENTITY TEI.textcrit
'INCLUDE'gt gt
c) the annotated text

6
tag sets

a) core tag set available for any TEI document
b) base tag set just one set can be chosen
TEI.prose for prose
TEI.verse for lyrics
TEI.drama drama
TEI.spoken transcription of spoken language
TEI.dictionaries dictionaries
TEI.terminology terminological databases
TEI.mixed text that need tags from more than one
categorie
TEI.general similiar to TEI mixed
c) additional Tag sets several can be chosen
TEI.linking tags to link documents
TEI.analysis simple analytical tags
TEI.fs tags for feature structure
TEI.certainty tags to name the author of the
particular tags and the probable accuracy of the
mark-up
TEI.transcr transcription of primary sources
TEI.textcrit text critics
TEI.names.dates names and dates
TEI.nets graphs and networks

7
Ex TEI.Transcr

TEI.transcr transcription of primary sources
ltdelgt deletion
ltdelspangt deletion that excedes hierarchie
limits
attributes
rend to declare how the deletion has been
accomplished
values
subpunction . deletion by a line of points below
that which shall be deleted
overstrike . deletion by a line
erasure .by erasure of the text
bracketed by embedding the text into brackets
status to declare whether the deletion has
affected too much text on the left or right by
mistake
resp declares who is responsible for the
declaration of deletion
hand declares who has deleted the part of the
text
cert gives a probability for the provenience of
the deletion

400 elements
problems with hierarchy
problems with ambiguity
bibliographical information included in TEI
annotated
documents match the standards of libraries
vast amount of possibilities to apply TEI
standard

9
CES

Corpus Encoding Standard is part of the EAGLES
guidelines by the Expert Advisory Group on
Language Engineering Standards and is consistent
with TEI.
determination of a minimal encoding level of
corpora
due to the increasing empiricist approach of
linguists
data interchange between individuals or sites
CES, standard for interchange, so that only one
translation between a single mark-up scheme and
the local format is needed
CES must be as expressive as the original text

10
Criteria of the CES

A) coverage
B) consistency
C) recoverability
D) validatability
E) capturability
F) processability
E) extensibility
G) compactness
H) readability

11
Global attributes

id a unique identifier an element
n a number or other label for the element
lang indicates that the tag's content is in the
specified language
wsd indicates that the tag's content is encoded
in the specified character set
rend provides information about rendition in an
original printed version values
BO bold face
BX boxed
IT italic font
RO roman font
UL underlined
CA capital letters

12
Header

each text ltcesDocgt has a header and the whole
corpus has a ltcesHeadergt distinguished by the
type attributes CORPUS or TEXT
title statement, edition statement, extent
statement, publication statement, source
description, encoding description, profile
description, revision description

13
Minimal Header
ltcesHeader version"2.0"gt
ltfileDescgt
lttitleStmtgt
lth.titlegtlt/h.titlegt
lt/titleStmtgt
ltpublicationStmtgt

ltdistributorgtlt/distributorgt
ltpubAddressgtlt/pubAddressgt

ltavailabilitygtlt/availabilitygt
ltpubDategtlt/pubDategt

lt/publicationStmtgt
ltsourceDescgt
ltbiblStructgt

ltmonogrgt

lth.titlegtlt/h.titlegt

lth.authorgtlt/h.authorgt

ltimprintgt

ltpubPlacegtlt/pubPlacegt

ltpublishergtlt/publishergt

ltpubDategtlt/pubDategt

lt/imprintgt
lt/monogrgt

lt/biblStructgt
lt/sourceDescgt
lt/fileDescgt lt/cesHeadergt
14
Encoding Primary Data

definition of 3 encoding levels
level 1 is minimum standard
interdependence
validate against cesDoc DTD

15
level 1conformance/ metalanguage level

CES conformant encoding to paragraph level having
at least the following structure
ltcesDoc version"3.9"gt
ltcesHeader version"2.0"gt ... lt/cesHeadergt
lttextgt
ltbodygt
ltdivgt optional for sections, chapters
ltpgt
ltpgt
ltpgt .
...other paragraph-level elements
ltspgt - for written to be spoken material
ltpoemgt - for poems embedded in a text

16
level 2 conformance / syntactic level

if a sub-paragraph element is marked every
occurrence of this element has to be marked
throughout the text
replacement of all special characters by SGML
entities
replacement of quotation marks by SGML entities
correct identification of all paragraph elements

17
level 3 conformance / semantic level

lthigt tags are resolved to more precise tags
(foreign, term, etc.)
the lthigt element for marking typographically
distinct words or phrases, especially when the
purpose of the highlighting is not yet determined
lthi rend BOgt bold face
identification and mark-up of the following
sub-paragraph elements
abbreviations
numbers
names
foreign words and phrases
validation of 10 sample of the text
elements for identifying s-units and quoted
dialogue ltsgtHe said ltqgtThe weather will be
fine.lt/qgtlt/sgt

18
refined forms of linguistic annotation

lttokgt ,ltorthgt ,ltdisambgt ltlexgt ltbasegt ltmsdgt
ltctaggt
lttok class'tok' from'1.2.1\5'gt
ltorthgtcritèreslt/orthgt
ltdisambgt
ltctaggtNCMPlt/ctaggt
lt/disambgt
ltlexgt
ltbasegtcritèrelt/basegt
ltmsdgtNcmp-lt/msdgt
ltctaggtNCMPlt/ctaggt
lt/lexgt
lt/tokgt

19
Corpus alignment I

rules specified in the CesAlign DTD
separation of primary data and annotation
avoids unwieldy documents
architecture
hub-document
annotation document
linking
one-way links to hub-document
linking by IDs or locators

20
Corpus allignment II

application multilingual corpora, dictionaries
!!! word-alignment (compositionality vs.
equivalence)
-gt idiomatic expressions/collocations ex. Er biß
ins Gras. vs. He kicked the bucket.
-gt different usage of active and passive Man
sagte mir vs I was told
-gt compounds Finanzamt vs. fiscal authorities

21
examples