TEI - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

TEI

Description:

I would propose a project in which. several thousand ... pubDate /pubDate /publicationStmt sourceDesc biblStruct monogr h.title /h.title ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 23
Provided by: F10985
Category:
Tags: tei | pubdate

less

Transcript and Presenter's Notes

Title: TEI


1
TEI CES
  • Mark-up for Scientific Purposes

2
If asked for a sure recipe of chaos I would
propose a project in which several thousand
impassioned specialists in scores of disciplines
from a dozen or more countries would be given
five years to produce some 1300 pages of
guidelines for representing the information
models of their specialties in a
rigorous, machine-verifiable notation.
Charles F. Goldfarb
3
BUT
The vaunted information superhighway would
hardly be worth traveling if the landscape were
dominated by industrial parks, office buildings,
and shopping malls. Thanks to the Text Encoding
Initiative, there will be museums, libraries,
theaters, and universities as well.
4
TEI
  • 1987 Vassar Conference (Vassar College, N.Y.)
  • international initiative of scientists, Text
    Encoding Initiative
  • also the name of the mark-up language
  • diversity of tag sets due to the different
    provenience of the involved scientists and their
    different scientific approaches
  • bottom-up development
  • conformity with SGML and XML, 5 versions TEI
    exist

5
general architecture
  • a) TEI Declaration TEI.DCL
  • b) TEI DTDs (Document Type Definition)
  • lt!DOCTYPE TEI.2 SYSTEM "tei2.dtd" lt!ENTITY
    TEI.prose 'INCLUDE'gtlt!ENTITY TEI.textcrit
    'INCLUDE'gt gt
  • c) the annotated text

6
tag sets
  • a) core tag set available for any TEI document
  • b) base tag set just one set can be chosen
  • TEI.prose for prose
  • TEI.verse for lyrics
  • TEI.drama drama
  • TEI.spoken transcription of spoken language
  • TEI.dictionaries dictionaries
  • TEI.terminology terminological databases
  • TEI.mixed text that need tags from more than one
    categorie
  • TEI.general similiar to TEI mixed
  • c) additional Tag sets several can be chosen
  • TEI.linking tags to link documents
  • TEI.analysis simple analytical tags
  • TEI.fs tags for feature structure
  • TEI.certainty tags to name the author of the
    particular tags and the probable accuracy of the
    mark-up
  • TEI.transcr transcription of primary sources
  • TEI.textcrit text critics
  • TEI.names.dates names and dates
  • TEI.nets graphs and networks

7
Ex TEI.Transcr
  • TEI.transcr transcription of primary sources
  • ltdelgt deletion
  • ltdelspangt deletion that excedes hierarchie
    limits
  • attributes
  • rend to declare how the deletion has been
    accomplished
  • values 
  • subpunction . deletion by a line of points below
    that which shall be deleted
  • overstrike . deletion by a line
  • erasure .by erasure of the text
  • bracketed by embedding the text into brackets
  • status to declare whether the deletion has
    affected too much text on the left or right by
    mistake
  • resp declares who is responsible for the
    declaration of deletion
  • hand declares who has deleted the part of the
    text
  • cert gives a probability for the provenience of
    the deletion

8
  • 400 elements
  • problems with hierarchy
  • problems with ambiguity
  • bibliographical information included in TEI
    annotated
  • documents match the standards of libraries
  • vast amount of possibilities to apply TEI
    standard

9
CES
  • Corpus Encoding Standard is part of the EAGLES
    guidelines by the Expert Advisory Group on
    Language Engineering Standards and is consistent
    with TEI.
  • determination of a minimal encoding level of
    corpora
  • due to the increasing empiricist approach of
    linguists
  • data interchange between individuals or sites
  • CES, standard for interchange, so that only one
    translation between a single mark-up scheme and
    the local format is needed
  • CES must be as expressive as the original text

10
Criteria of the CES
  • A) coverage
  • B) consistency
  • C) recoverability
  • D) validatability
  • E) capturability
  • F) processability
  • E) extensibility
  • G) compactness
  • H) readability

11
Global attributes
  • id a unique identifier an element
  • n a number or other label for the element
  • lang indicates that the tag's content is in the
    specified language
  • wsd indicates that the tag's content is encoded
    in the specified character set
  • rend provides information about rendition in an
    original printed version values
  • BO  bold face
  • BX  boxed
  • IT  italic font
  • RO  roman font
  • UL  underlined
  • CA  capital letters

12
Header
  • each text ltcesDocgt has a header and the whole
    corpus has a ltcesHeadergt distinguished by the
    type attributes CORPUS or TEXT
  • title statement, edition statement, extent
    statement, publication statement, source
    description, encoding description, profile
    description, revision description

13
Minimal Header
ltcesHeader version"2.0"gt
ltfileDescgt
lttitleStmtgt
lth.titlegtlt/h.titlegt
lt/titleStmtgt
ltpublicationStmtgt

ltdistributorgtlt/distributorgt
ltpubAddressgtlt/pubAddressgt

ltavailabilitygtlt/availabilitygt
ltpubDategtlt/pubDategt

lt/publicationStmtgt
ltsourceDescgt
ltbiblStructgt

ltmonogrgt

lth.titlegtlt/h.titlegt

lth.authorgtlt/h.authorgt

ltimprintgt

ltpubPlacegtlt/pubPlacegt

ltpublishergtlt/publishergt

ltpubDategtlt/pubDategt

lt/imprintgt
lt/monogrgt

lt/biblStructgt
lt/sourceDescgt
lt/fileDescgt lt/cesHeadergt
14
Encoding Primary Data
  • definition of 3 encoding levels
  • level 1 is minimum standard
  • interdependence
  • validate against cesDoc DTD

15
level 1conformance/ metalanguage level
  • CES conformant encoding to paragraph level having
    at least the following structure
  • ltcesDoc version"3.9"gt
  • ltcesHeader version"2.0"gt ... lt/cesHeadergt
  • lttextgt
  • ltbodygt
  • ltdivgt optional for sections, chapters
    ltpgt
  • ltpgt
  • ltpgt .
  • ...other paragraph-level elements
  • ltspgt - for written to be spoken material
  • ltpoemgt - for poems embedded in a text

16
level 2 conformance / syntactic level
  • if a sub-paragraph element is marked every
    occurrence of this element has to be marked
    throughout the text
  • replacement of all special characters by SGML
    entities
  • replacement of quotation marks by SGML entities
  • correct identification of all paragraph elements

17
level 3 conformance / semantic level
  • lthigt tags are resolved to more precise tags
    (foreign, term, etc.)
  • the lthigt element for marking typographically
    distinct words or phrases, especially when the
    purpose of the highlighting is not yet determined
  • lthi rend BOgt bold face
  • identification and mark-up of the following
    sub-paragraph elements
  • abbreviations
  • numbers
  • names
  • foreign words and phrases
  • validation of 10 sample of the text
  • elements for identifying s-units and quoted
    dialogue ltsgtHe said ltqgtThe weather will be
    fine.lt/qgtlt/sgt

18
refined forms of linguistic annotation
  • lttokgt ,ltorthgt ,ltdisambgt ltlexgt ltbasegt ltmsdgt
    ltctaggt
  • lttok class'tok' from'1.2.1\5'gt
  • ltorthgtcritèreslt/orthgt
  • ltdisambgt
  • ltctaggtNCMPlt/ctaggt
  • lt/disambgt
  • ltlexgt
  • ltbasegtcritèrelt/basegt
  • ltmsdgtNcmp-lt/msdgt
  • ltctaggtNCMPlt/ctaggt
  • lt/lexgt
  • lt/tokgt

19
Corpus alignment I
  • rules specified in the CesAlign DTD
  • separation of primary data and annotation
  • avoids unwieldy documents
  • architecture
  • hub-document
  • annotation document
  • linking
  • one-way links to hub-document
  • linking by IDs or locators

20
Corpus allignment II
  • application multilingual corpora, dictionaries
  • !!! word-alignment (compositionality vs.
    equivalence)
  • -gt idiomatic expressions/collocations ex. Er biß
    ins Gras. vs. He kicked the bucket.
  • -gt different usage of active and passive Man
    sagte mir vs I was told
  • -gt compounds Finanzamt vs. fiscal authorities

21
examples
  • The New York Times
  • article French voters soundly reject European
    Union Constitution
  • Universität Osnabrück - Cognitive Sciences -
    korpusbasierte Kollokationssuche

22
Sources
  • CES - Vassar College, NY
  • POS Tagging and Lemmatization
  • TEI - University of Munich
  • Universität Osnabrück - Cognitive Sciences -
    korpusbasierte Kollokationssuche
Write a Comment
User Comments (0)
About PowerShow.com