Title: Representing dictionaries with the TEI
1Representing dictionaries with the TEI
- Proposal for basic guidelines
- Laurent Romary - Max Planck Digital Library
- With the help of Susanne Alt - CNRS
2Background
- The P5 edition of the TEI guidelines
- XML
- ODD - Roma
- Modules and classes
- DTD, RelaxNG, W3C schemas
- The dictionary chapter
- Very close to the P4 version
- Work to be done
- Enhancing the coherence with the class system
- Providing more examples
3Proposal for today
- Browse through the main features of the
dictionary chapter - Identify questionable issues
- Select best practices
- Work with Roma and implement (part of) the best
practices - Minimal schema that dictionary project can start
with - Bottom approach to customization
- Discuss about conformance
4Dictionaries as TEI documents
- Same general document structure as any other TEI
document - ltteiHeadergt, lttextgt
- Define a common strategy concerning source
identification with general text sources - Specific documentation of previous editions
- Intuition that ltteiCorpusgt is not to be retained
here - ltfrontgt, ltbodygt, ltbackgt
- Divisions
- Strong case for unnumbered ltdivgts
- Can we recommend/implement a basic dictionary
oriented typology?
5Issues
- see Wuerzburg.xml
- Providing precise guidelines for
- ltpublicationStmtgt
- Elicit the role and possible content of
ltpublishergt - ltsourceDescgt
- Base the guidelines on ltbiblStructgt (ltbiblItemgt?)
and ltlistBiblgt
6Describing dictionary entries
- A variety of possible objects
- ltentrygt, ltentryFreegt ltsuperEntrygt, ltdictScrapgt
- lthomgt, ltregt
- First issue dealing with the editorial workflow
- Keep ltdictScrapgt for ongoing tagging activity
- depends on the degree of structure of the
dictionary - Stay consistent in the use of entry/entryFree/supe
rEntry/hom - Strong feeling for limiting ourselves to ltentrygt
- Point to the importance of ltregt
- Embedded entries
7Finding the right granularity
- The core lexical unit ltentrygt
- Should be used coherently in a dictionary project
to gather up homogenous lexical objects - Possible combination with
- ltsuperEntrygt to group sets of homographs
- Should only be used to record such a feature when
it exists in legacy data - Should be avoided for new editorial projects
- lthomgt to subdivide senses in groups of homonyms
8Example
- Recording a series of homographs with
ltsuperEntrygt - ltbodygt
- ltentry/gt
- ltentry/gt
- ltsuperEntrygt
- ltentry type"hom" n"1"/gt
- ltentry type"hom" n"2"/gt
- lt/superEntrygt
- lt/bodygt
- Issues
- Values of n attribute according to the source
- Values of type defined in att.entryLike
9Example
- Recording a series of homographs with lthomgt
- ltentrygt
- lthom n"1"gt
- ltsense n"1"/gtltsense n"2"/gt
- lt/homgt
- lthom n"2"gt
- ltsense n"1"/gtltsense n"2"/gtltsense n"3"/gt
- lt/homgt
- lt/entrygt
- Issues
- Weak boundary between polysemes and homonyms
- Why not just have separate entries?
10From word to senses
- Background
- Semasiological vs. onomasiological views on
lexical data - Two complementary data organisations
- Two sets of standards
- In ISO TMF (ISO 16642) vs. LMF
- In the TEI Terminology vs. Print dictionary
chapters
11The LMF Model
Lexical DB
1..1
1..1
1..1
0..n
Global Info
Lexical Entry
1..1
1..1
0..n
1..1
0..n
Form
Sense
1..1
12Consequences for dictionaries
- Strong ltformgt to ltsensegt orientation
- ltformgt qualifies the entry, with the
identification of the headword and its
morphological variations - ltsensegt is subordinated to the choice made for
ltformgt - Role of grammatical information
- Overall qualification of the entry
- Qualification of morphological variants
- Issue
- ltregt does not necessarily fit into the theory
13Example
- Basic structure of an ltentrygt
- ltentrygt
- ltformgt
- ltorthgtchatlt/orthgt
- lt/formgt
- ltsensegt
- ltdefgtPetit animal familierlt/defgt
- lt/sensegt
- lt/entrygt
14Representing form and grammar
- General issues
- Multiple forms
- ltorthgt, ltprongt, etc.
- Compounds
- May be represented using embedded forms
- Role of grammar (ltgramGrpgt)
- In isolation qualifies the entry
- Within a form marks special features associated
with the form - Inflexions
- Can be reprensented by means of additional
ltformgts
15Example
- A simple entry
- ltentrygt
- ltformgt
- ltorthgtchatlt/orthgt
- ltprongt?alt/prongt
- lt/formgt
- ltgramGrpgt
- ltposgtNlt/posgt
- ltgengtfltgengt
- lt/gramGrpgt
- lt/entrygt
16Example
- Simple entry with inflected form
- ltentrygt
- ltform typelemmagt
- ltorthgtchatlt/orthgt
- lt/formgt
- ltgramGrpgt
- ltposgtNlt/posgt
- ltgengtmlt/gengt
- lt/gramGrpgt
- ltform typeinflectedgt
- ltorthgtchatslt/orthgt
- ltgramGrpgt
- ltnumbergtplt/numbergt
- lt/gramGrpgt
- lt/formgt
- lt/entrygt
17ltformgt the case of the Campe dictionary
- Step 1 Dealing with the presence of determiners
- ltform typelemmagt
- ltform typedeterminergt
- ltorthgtDaslt/orthgt
- lt/formgt
- ltform typeheadwordgt
- ltorthgtAaklt/orthgt
- lt/formgt
- lt/formgt
18ltformgt the case of the Campe dictionary
- Step 2 adding grammatical information
- ltform typelemmagt
- ltform typedeterminergt
- ltorthgtDaslt/orthgt
- ltgramGrpgt
- ltpos valueD/gt
- ltgengtnlt/gengt
- lt/gramGrpgt
- lt/formgt
- ltform typeheadwordgt
- ltorthgtAaklt/orthgt
- ltgramGrpgt
- ltposgtNlt/posgt
- ltgengtnlt/gengt
- lt/gramGrpgt
- lt/formgt
- lt/formgt
19ltformgt the case of the Campe dictionary
- Step 3 dealing with inflected forms
- ltform typeinflectedgt
- ltform typedeterminergt
- ltorthgtdeslt/orthgt
- ltgramGrpgtlt/gramGrpgt
- lt/formgt
- ltform typeheadwordgt
- ltorthgtltoVargtltoRef/gt-eslt/oVargtlt/orthgt
- ltgramGrpgt
- ltcase valueGgtGlt/casegt
- lt/gramGrpgt
- lt/formgt
- lt/formgt
20Main arguments for the proposed changes
- Coherent use of ltformgt and ltorthgt
- Accounts for a coherent access to orthographic
information in form/orth - Coherent use of grammatical features
- Danger of tag abuse with
- ltgram typeart_ngtDaslt/gramgt
- type attribute should indicate a grammatical
feature - ltgramgt content should be the value of that
feature - Non differentiation of features (art_n -gt pos
gen)
21ltsensegt main components
- Core elements
- ltdefgt to provide the definition
- ltdicteggt
- Need to establish guidelines on the
identification of sources - ltetymgt a complex issue
22Documentation des exemples
ltdicteggtltqgtTa gamine est assise trop ltoRef/gt,
elle ne dépasse pas de la table.lt/qgtlt/dicteggt
ltdicteggtltcitgt ltqgtTa gamine est assise trop
ltoRef/gt, elle ne dépasse pas de la
table.lt/qgt ltbiblgtBenoit M., Michel C., Le Parler
de Metz...lt/biblgt lt/citgtlt/dicteggt
ltdicteggt ltcitgt ltqgtTa gamine est assise trop
ltoRef/gt, elle ne dépasse pas de la
table.lt/qgt ltbiblStructgt ltauthorgtBENOIT M,
MICHEL C.lt/authorgt lttitlegtLe Parler de Metz et
du pays messinlt/titlegt ltimprintgt ltpubPlacegtMe
tzlt/pubPlacegt ltpublishergtSerpenoiselt/publishergt
ltdategt2001lt/dategt ltbiblScopegtp.
38lt/biblScopegt lt/imprintgt lt/biblStructgt lt/citgt
lt/dicteggt
23A quick glimpse into Roma
- A journey in three steps
- Adding the PD module and generating a schema
- Checking out elements
- Expressing constraints on specific values
24Final discussion
- What is it, being TEI conformant?