Title: Tone Merete Bruvik Aksis, June 16, 2006
1Tone Merete BruvikAksis, June 16, 2006
Short introduction to XML-encodingCorpora In
Phonological ResearchAmsterdam, Netherlands , 15
- 17 June 2006
2Source and editions
(Source The Botulph Breviary fragments, MPF AT
BERGEN UNIVERSITY LIBRARY
3The encoding
- ...
- lttextgt
- ltbody lang"lat"gt
- ltpb n"1r"/gt
- ltcb n"A"/gt
- ltlb n"1"/gtltdiv type"hymnus"gtltpgtltsuppliedgtsumlt/su
ppliedgtens illud aue gabrelis ore - ltlb n"2"/gtltsuppliedgtfunlt/suppliedgtda nos in pace
mutans nomen - ltlb n"3"/gtltsuppliedgteult/suppliedgte. lthi
rend"blue"gtSlt/higtolue uincla reis pro - ltlb n"4"/gtltsuppliedgtferlt/suppliedgt lumen cecis
mala nostra pelle - ...
4What should a document format be like?
- Open and well documented.
- Application independent.
- Interchangeable.
- Readable for both computers and humans
- Encoding structure, not layout.
- Make the encoding of the document explicit.
5Text encoding languages
- Encoding grammars SGML, MECS, XML
- Encoding semantics HTML, HNML, MECSWIT, TEI
Guidelines
SGML (1986)
XML (1998)
6Extensible Markup Language (XML)
- Design goals
- XML shall be straightforwardly usable over the
Internet. - XML shall support a wide variety of applications.
- XML shall be compatible with SGML.
- It shall be easy to write programs which process
XML documents. - The number of optional features in XML is to be
kept to the absolute minimum, ideally zero. - XML documents should be human-legible and
reasonably clear. - The XML design should be prepared quickly.
- The design of XML shall be formal and concise.
- XML documents shall be easy to create.
- Terseness in XML markup is of minimal importance.
- Source Extensible Markup Language (XML) 1.0
(Third Edition), W3C Recommendation 04 February
2004.
7DTD and Schema
- A template for the structure of a text, two main
types - DTD - Document Type Definition
- Schema
- RELAX NG
- W3C Schema
- Schematron
8Well formed and Valid
- A document has to be well formed to be called a
XML document - Has only one root element.
- Has no open tags.
- All tags nest.
- ...
- A document which follows the rules of a schema is
said to be valid.
9TEI - Text Encoding Initiative
- Design goals
- Provide a standard format for data interchange
- Provide guidance for encoding of texts in this
format - Support the encoding of all kinds of features of
all kinds of texts studied by researchers - Be application independent
- (Source TEI - Guidelines for Electronic Text
Encoding and Interchange, 2002)
10TEI - design decisions
- The choice of SGML, XML, ISO 646, and Unicode
- The provision of a large predefined tag set
- A distinction between required, recommended, and
optional encoding practices - Encodings for different views of text
- Alternative encodings for the same text features
- Mechanisms for user-defined extensions to the
scheme - (Source TEI - Guidelines for Electronic Text
Encoding and Interchange, 2002)
11TEI versions
- P1 (1992)
- P3 (1999)
- P4 (2003)
- P5 (2006, and still in development)
-
- Do not be afraid of new versions of the TEI, old
texts will still be valid according to the old
version. - New TEI versions are a new set of spelling
rules which only applies to texts that are
referring to them.
12TEI schemata
- There is no such thing as the TEI schema or
the TEI DTD. - Each project has to make its own TEI
customisation, or should use a TEI customisation
made by someone else. - TEI is made to be customised.
- Correctly customised TEI is still TEI.
- The tool ROMA helps you pick what you like to
include in your TEI schema, see
http//tei.oucs.ox.ac.uk/Roma/
13TEI on Speech and Corpora
- TEI P5 Chapter 11 Transcriptions of Speech
- TEI P5 Chapter 15 Simple Analytic Mechanisms
- TEI P5 Chapter 18 Transcription of Primary
Sources - TEI P5 Chapter 23 Language Corpora
14Elements Unique to Spoken Texts in TEI
- ltugt utterance.
- ltpausegt a pause.
- ltvocalgt vocalized semi-lexical.
- ltkinesicgt any communicative phenomenon, for
example a gesture, frown, etc. - lteventgt any phenomenon or occurrence, for example
incidental noises or other events affecting
communication. - ltwritinggt a passage of written text revealed to
participants in the course of a spoken text. - ltshiftgt marks the point at which some
paralinguistic feature of a series of utterances
by any one speaker changes. - Source TEI -P5, chapter 11 Transcriptions of
Speech )
15Some general elements relevant to spoken texts
- ltseggt (arbitrary segment) contains any arbitrary
phrase-level unit of text. - ltwgt (word) represents a grammatical (not
necessarily orthographic) word. - ltsgt (s-unit) contains a sentence-like division of
a text. - ltcgt (character)
- Elements for transcriptionsltabbrgt, ltaddgt, ltappgt,
ltcorrgt, ltdelgt, ltdamagegt, ltexpandgt, ltgapgt, lthigt,
ltrdggt, ltsicgt, ltsuppliedgt, ltuncleargt.
16Sample Coding of schwa
- ltdivgt
- ltu who"A"gtHeter hun Ronja Langangen?lt/ugt
- ltu who"B"gtNei hun heter Sonja Langangltc
rend"vowel" ana"phon.schwa"gtelt/cgtnlt/ugt - lt/divgt
- ltdivgt
- ltu who"A"gtHeter hun Nelly Dalen?lt/ugt
- ltu who"B"gtNei hun heter Molly Dalltc
rend"empty" ana"phon.schwa"gtelt/cgtnlt/ugt - lt/divgt
17Sample ...more on schwa
- ltu who"B"gtNei hun heter Molly Dalltc rend"empty"
ana"phon.schwa"gtelt/cgtnlt/ugt - ltu who"B"gtNei hun heter Molly Dalltdel
rend"empty" ana"phon.schwa"gtelt/delgtnlt/ugt - ltu who"B"gtNei hun heter Molly ltchoicegtltreggtDalenlt
/reggtltorig ana"phon.schwa"gtDalnlt/origgtlt/choicegtlt/
ugt - ltu who"B"gtNei hun heter Molly Dalltchoicegtltreggtelt/
reggtltorig ana"phon.schwa"gtlt/origgtlt/choicegtnlt/ugt - ltu who"B"gtNei hun heter Molly ltreg
ana"phon.schwa"gtDalenlt/reggtlt/ugt
18Sample Pronunciation of orthographic /rd/
sequences in East Norwegian
- Sverd sværd sveltm rend"rd ana"phon.rd"gtrdlt/m
gt - Bord bur boltm rend"r" ana"phon.rd"gtrdlt/mgt
- Verdi væ?i veltm rend"?" ana"phon.rd"gtrdlt/mgt
i - or
- lt!ENTITY dtail "x0256"gt lt!-- LATIN SMALL
LETTER D WITH TAIL --gt - ...
- veltm rend"dtail" ana"phon.rd"gtrdlt/mgtilt/pgt
19Sample Prosodic features
- lt!ENTITY lr "?"gt lt!-- low rise intonation --gt
- lt!ENTITY rf "!"gt lt!-- rise fall intonation --gt
- ...
- ltu who"person2"gthvilket dalrlt/ugt
- Might be encoded using entities.
20Problems with text encoding
- Texts are more complex than one might think.
- Text encoding can be a both a very philosophic
and a very prosaic task. - Time consuming.
(Source The Botulph Breviary fragments, MPF AT
BERGEN UNIVERSITY LIBRARY
21... and more problems
- Overlap
- Discontinuous elements
- Alternative element orderings
- This is a problem to encode in XML.
- ltugtThis lthi rendblue italicgtis a ltemph
rendboldgtproblemlt/higt to encodelt/emphgt in
XMLlt/ugt.
22Questions to be solved
- What should be encoded?
- Separate tiers?
- Should a standard encoding schema be used?
- Should the text be interchangeable?
- Is there a community that has to decide what
schema to be used?
23Links and references
- TEI P5 - Guidelines for Electronic Text Encoding
and Interchange, edited by C.M. Sperberg-McQueen
and Lou Burnard, 2005 (http//www.tei-c.org/releas
e/doc/tei-p5-doc/html/). - ROMA, http//tei.oucs.ox.ac.uk/Roma/
- XML, http//www.w3.org/XML/
- Markup Language for Complex Documents (MLCD),
http//teksttek.aksis.uib.no/projects/mlcd