Title: TEXT ENCODING INITIATIVE (TEI)
1TEXT ENCODING INITIATIVE (TEI)
- Inf 384C
- Block II, Module C
2TEI History
- The developing organizations first met in 1987
- Association for Computers and the Humanities
(ACH) - Association for Computational Linguistics (ACL)
- Association for Literary and Linguistic Computing
(ALLC) - 1990first Version TEI P1
- 1992TEI P2
- 1993TEI P3
3TEI History Continued
- Principles for the development of TEI
- Standard format for data interchange in
humanities research - Guidelines for encoding texts in the same format
- Define a recommended syntax
- Define a meta language for description of
text-encoding schemes - Future Developments
- Linguistic description and grammatical annotation
- Historical analysis and interpretation
- Base tag sets for further document types
- Manuscript analysis and physical description of
text
4General Introduction to SGML and XML
5The Evolution of SGML and XML
- 1960 Generalized Markup Language by IBM 1960s
- 1970s 1980s ANSI initiates project to develop
a Standard text-description language based on GML - 1983 SGML became an industry standard
- 1986 ISO ratified a standards for SGML
- 1990s Tim Berners-Lee developed HTML a simple
formatting markup language for the World Wide Web - Mid 1990s XML was developed by the W3C to
combine the flexibility of SGML and the
simplicity of HTML
6Benefits of SGML and XML
- SGML is a toolkit for developing specialized
markup languages - Specifies the structure of information
- Enables interoperability between multiple
platforms - Acts like a database
- ail encompassing
- The DTD acts as a blueprint for document
structure - XML provides a manageable framework in which you
can define your own elements
7XML Syntax
- Information content must have start and end tags
- Case is significant
- Elements may not overlap
- Elements can nest one inside another
8The XML Environment
- XML Editor
- XML Parser/Validator
- Display program
- DTD or schema to define elements
- Style sheet for display of elements
9The XML Document
- Document prologue
- XML declaration
- Document type declaration
- Points to root element
- Points to external standards (DTDs, namespaces)
- Document itself
- Bracketed by root element
- Contains elements, attributes, entities
10The Document Type Definition
11The DTDDocument Type Definition
- DTD defines a documents structure
- i.e. it is a set of rules and declarations that
specify what tags can be used and what these tags
can contain - DTD validates documents
- - determines which documents conform to
language - - reduces possibility of errors
- DTD provides blueprint for documents
- - specifies how to handle elements
- - specifies which elements are allowed
12The DTDDocument Type Definition
- The DTD has four main functions
- 1. declares a set of allowed elements
vocabulary - 2. defines content model for each element
grammar - 3. declares set of allowed attributes for each
element - 4. provide various mechanisms to make
management of model easier - (Ray, Chapter 5, p 148)
13Basic Structure of DTD-Element Declaration-
- lt!Element name (content-model)gt
- Holds two functions
- Adds a new element
- States what can go inside the element
- For every element that appears in the document,
one must be identified in the DTD - Order of declarations is important
14lt!Element name (content-model)gt
?
?
- vocabulary
- Denotes NAME of element that appears in mark-up
tag - (case-sensitive-LOWER)
- e.g. title, graphic, article, thingie
- grammar
- Formula that delineates what kind of content, how
many and in what order - Empty elements EMPTY
- No content restrictions (little value) ALL
- Only character data, no elements PCDATA
- Only elements formula
- Mixed Content content model
15Basic Structure of a DTD-Attribute Declaration-
- lt!attlist name (attname1 atttype1
attdescl1) - (attname2 atttype2 attdescl2)gt
- For each element that appears in document,
attributes of the - element must be declared
- All attributes are declared in one place,
attribute list
16lt!attlist name (attname1 atttype1
attdescl1)gt ?
?
- vocabulary
- Name of element to which the attributes belong
- Same as name as element declared earlier
- e.g. title, article, thingie
- Attribute declarations
- attname1 Gives attribute name
- atttype1 Specifies datatype of
- attribute, list of values
- CDATA, NMTOKEN, ID
- attdesc1 Describes behavior
- 1. default value high
- 2. author specified value
- REQUIRED, FIXED, IMPLIED
17The DTDDocument Type Definition
- It is important to remember that every document
type definition is an interpretation of a text.
There is no single DTD which encompasses any kind
of absolute truth about a text, although it may
be convenient to privilege some DTDs above others
for particular types of analysis. - TEI Guidelines for Electronic Text Encoding and
Interchange - http//etext.virginia.edu/TEI.html
18The TEI DTD
- Uses basic structural elements of general DTD
- Designed to simplify the task of choosing an
appropriate set of tags for the text in hand. - Selects appropriate combination of smaller tag
sets, each containing some set of tags likely to
be used together - 1. core tag sets standard components that are
always included, no encoder action - 2. basic tag sets basic building blocks for
text types, encoder must select at least one - 3. additional tag sets extra tags compatible
with all other tag sets, encoder may add with
basic tags in any combination - http//www.tei-c.org/P4X/DTD/
19The TEI Header
20Basic Elements of TEI
- Paragraphs ltpgt
- Punctuation ltstop.abbrgt, ltstop.sentgt
- Quotations ltqgt or ltquotegt
- Lists ltlistgt, ltitemgt etc.
- Bibliographic Citations ltbiblgt
- THE HEADER! ltteiHeadergt
21The TEI Header
- Required of every TEI text, composed of four
parts - May be large and complex or very simple
- The header may differ for documents not based on
written text, such as computer files or spoken
text - The header is not a library cataloging record,
although the intent is similar
22Four Parts
- File Description ltfileDescgt
- Encoding Description ltencodingDescgt
- Text Profile ltprofileDescgt
- Revision Description ltrevisionDescgt
23File Description ltfileDescgt
- lttitleStmtgt
- lteditionStmtgt
- ltextentgt
- ltpublicationStmtgt
- ltseriesStmtgt
- ltnotesStmtgt
- ltsourceDescgt
24Encoding Description ltencodingDescgt
- ltprojectDescgt
- ltsamplingDeclgt
- lteditorialDeclgt
- lttagsDeclgt
- ltrefsDeclgt
- ltclassDeclgt
- ltfsdDeclgt
- ltmetDeclgt
- ltvariantEncodinggt
25Profile Description ltprofileDescgt
- ltcreationgt
- ltlangUsagegt
- lttextClassgt
26Revision Description ltrevisionDescgt
- ltrevisionDescgt
- ltchangegt
27Examples and Application
28Examples and Application
- Dumble Geological Survey
- A Geological survey of Texas from the late 19th
Century comprised of twelve volumes - Digitally imaged monographs processed with OCR
software to produce text - Text marked up in XML using the TEI Lite
specifications - http//www.lib.utexas.edu/books/dumble/
29Dumble DTD
- Element and Attribute definitions
- Entity references
30(No Transcript)
31(No Transcript)
32Dumble Header
- Four basic sections
- File description
- Encoding description
- Profile description
- Revision description
- Contains bibliographic information
- Contains information on the creation of the
digital file
33(No Transcript)
34(No Transcript)
35(No Transcript)
36Why XML?
- Ability to record information about a document
within the document. - Ability to separate structure from format
- Ability to wrap or embed information in layers
of xml
37XML Beyond TEI
- Open Archives Initiative (OAI)
- Semantic Web
- Open Archival Information System
- Digital Preservation
- Information Discovery
38References
- A Sample TEI Markup
- Appendix A.2 Elements in TEI Lite
- OAI
- OAIS
- Learning XML
- www.tei-c.org/Lite/U5-eg.html
- www.tei-c.org/Lite/U5-taglist.html
- www.openarchives.org/
- http//www.rlg.org/longterm/oais.html
- Erik T. Ray