Title: The Construction of Anglo-Norman Text Corpus
1The Construction of Anglo-Norman Text Corpus
- Joint Project of the University of Wales, Swansea
and the University of Wales, Aberystwyth . - AHRC-funded.
- Anglo-Norman Online Dictionary
- Anglo-Norman Text Corpus
- http//www.anglo-norman.net
2Goal of the Anglo-Norman Hub Text Digitisation
Project
- To provide a set of digitised texts and articles
to mediaeval linguists and historians which is
searchable and fully cross-referenced within
itself and to and from the Anglo-Norman Online
Dictionary
3Main Challenges facing the Anglo-Norman Hub
Project
- Image to text migration for maximum throughput at
minimum cost - Application of markup suitable for rendering and
full cross-referencing - Handling of non-standard character sets
(mediaeval abbreviations)
4Image to Text Migration Strategies
- Optical Character Recognition
- Re-keying
- Both require subsequent proofreading
- Both allow insertion of appearance metadata as
provisional markup
5Advantages of Alternative Image to Text Migration
Strategies
- OCR
- Rapid processing
- Can be performed by students on-site and can be
supervised.
- Rekeying
- Less error-prone
- Cheap if outsourced
- Non-standard characters can be represented by
combinations - More consistent output quality
- Image quality less critical
- Consistent output quality
6Economic Image to Text Migration Conclusions
- Re-keying is more economic for the bulk of the
mediaeval-language material - OCR is competitive for modern languages (critical
material) - OCR can also be used for mediaeval language
material when required by workflows provided that - good image quality can be easily achieved
- the material consists of standard characters
7Markup requirements must
- Conform to widely-accepted standards
- Be capable of encapsulating diverse document
structures - Allow for automation
- Enable internal and external referencing
- Preserve as much appearance metadata as possible
- Not be tied to any one approach to rendering
8Document types requiring a variety of XML
Structures
- Texts
- Verse
- Prose
- Lists Tables
- Critical material
- Introductions (conform to prose structures)
- Notes (do not conform to any of the above
structures)
9Cross-referencing of Critical Matter
- Need to navigate from pointer to note
- Need to navigate cross-references from critical
material to specific points in the text or
elsewhere in critical material - Achieved by use of target-id pairs
10Markup Density and Automation
- Verse medium density can be automated
- Prose variable density can be automated if
footnote pointers present - Lists tables medium density can be automated
- Critical material high-density many
cross-references limited scope for automation
11Extract from XML version of La Passiun de St.
Edmund
- ltlg n"316"gtltl id"L1261"gtA Deu del cel ad
graciédlt/lgt - ltl id"L1262"gtE al martir suvent a voédlt/lgt
- ltl id"L1263"gtQue si bel l'at delivrédlt/lgt
- ltpb ed"folio" n"123a"/gtltl id"L1264"
n"1264"gtDe ço qu'esteit ainz encumbrét.lt/lgtlt/lggt
12Extract from XML version of La Passiun de St.
Edmund
- ltnote id"N1261-4" target"L1261"
targetEnd"L1264"gtThese lines present several
problems (a) ltq lang"AN" rend"b"gtA Deu. . .ad
graciédlt/qgt ltref target"L1261"gt1261lt/refgt. The
verb ltterm lang"AN" rend"i"gtgracierlt/termgt,
occurring here with an indirect object, normally
takes a direct object and does so in its other
occurrences in the text ltref target"L826 L943
L1132"gtll. 826, 943, 1132lt/refgt.
13Additional Markup for Critical Material
- lttermgt Terms discussed may need to be linked to
the Anglo-Norman Dictionary - ltqgt Citations may need to be linked to their
sources within the text base - ltbiblgt, lttitlegt etc. Bibliographical information
needs to be encoded to link citations with their
sources - Much of the above can be extrapolated from the
appearance metadata embedded in the provisional
markup - lthigt to encode embedded appearance metadata
whose significance is not apparent
14La Passiun de St. EdmundRendered for a Web
Browser
- These lines present several problems (a) A Deu.
. .ad graciéd 1261 . The verb gracier , occurring
here with an indirect object, normally takes a
direct object and does so in its other
occurrences in the text ll. 826, 943, 1132 .
T.-L. 4,502 cites one instance of gracier with
indirect object, but in the construction gracier.
qc. a qn . If this construction were applied
here, ll. 1263-4 would have to be taken as the
direct object of gracier and also, presumably, of
voer 1262 . The use here of gracier with indirect
object may have been influenced by the
construction rendre graces a qn. employed at ll.
995, 1046, 1512 .
15Markup Density and Automation
- Verse medium density can be automated
- Prose variable density can be automated if
footnote pointers present - Lists tables medium density can be automated
- Critical material high-density many
cross-references limited scope for automation
16Markup Requirements Application
- 1,000 to 100,000 XML tags per document
- Automation essential for high throughput
- Digitisers can embed appearance metadata in
provisional markup - Well-designed provisional markup schemes
facilitate automation
17Facsimile of part of the Statute Roll
18The same passage in the 1800 printed edition
19Extract from the explanation published with the
Statutes, exemplifying the two forms resembling
9s.
20"rum"-abbreviation and flourishes
21Handling of Non-Unicode Characters 1)
Transcription
- Transcription is the one-to-one encapsulation of
character appearance metadata - Transliteration is the expansion of abbreviated
characters into an intelligible sequence of
letters - Transliteration requires transcription as a
starting point - Transcription codes must resemble originals to
facilitate re-keying
22P-contractions
23Examples of the "per" "pro" and "pre"
contractions as represented by the agency
Signifies Keyed as Expanded example Rekeyed example
per p!! ceperit cep!!it
pro p propria ppltsupgtilt/supgta
pro p probum pbu
per p!! persone p!!sone
per p!! apertement ap!!tement
pro p profit pfit
per p!! permisit p!!misit
pro p promisit pmisit
pro p prochein pchein
per p!! persona p!!ltsupgtalt/supgt
par p!! paratus p!!atus
par p!! parceles p!!celes
por p!! tempore temp!!e
por p!! corporum corp!!um
pre p? presentem p?sentem
pre p? prelatz p?laz!!
pre p? predictum p?d!!cm
pre p? prendront p?ndront
24Transcription1810 Edition and Rekeyed Version
25Transcription to TransliterationRekeyed Version
XML File
ltpgtltexpan abbr"R-"gtRexlt/expangt Collectoribltexpan
abbr"z"gtuslt/expangt custume sue lanaltexpan
abbr"z"gtrumlt/expangt in Civitate Londoltexpan
abbr"n-"gtniilt/expangt, saltexpan
abbr"l-t"gtlutelt/expangtm. Cum nultexpan
abbr"p-"gtperlt/expangt ltexpan abbr"p-"gtperlt/expangt
nos amp consililtexpan abbr"u-"gtumlt/expangt
nltexpan abbr"r"gtostrult/expangtm ordinatum
fuisset, qltexpan abbr"d-"gtuodlt/expangt lane,
coria, pelles lanute, plumbum amp stagmen
nltexpan abbr"o-"gtonlt/expangt dimitltexpan
abbr"t?"gtterlt/expangtentltexpan abbr"rsup"gturlt/exp
angt seu quomodolibet venderentltexpan
abbr"rsup"gturlt/expangt, nisi ltexpan
abbr"p"gtprolt/expangt bonis sterlingis seu aliis
ltexpan abbr"m?"gtmerlt/expangtcandisis
legalibltexpan abbr"z"gtuslt/expangt, ltexpan
abbr"p"gtprolt/expangtut in statuto inde edito
plenius continetltexpan abbr"rsup"gturlt/expangt
26Handling of Non-Unicode Characters 2)
Transliteration
- Manual transliteration would take too long
- Blanket replacement is not possible because of
ambiguous abbreviations - Semi-automated transliteration can be achieved
using a list of words for block-replacement,
derived from a concordance - The appearance metadata from the transcription
should remain embedded
27Extract from Concordance
28Table of expansions, example 1
Contracted word Occurrences Expansion
6264
q 2989 q'
pltsupgtrlt/supgt 803 p'r
seignltsupgtrlt/supgt 325 seignour
aut?s 289 autres
man?e 250 manere
sltsupgtrlt/supgt 224 sur
p!!lement 215 parlement
t?re 199 terre
t?res 196 terres
denglet?re 191 dengleterre
gltsupgtalt/supgtnt 181 grant
loltsupgtrlt/supgt 167 lour
p!!tie 152 partie
ap?s 142 apres
s?ront 139 serront
hltbargtolt/bargtme 137 homme
29Table of expansions, example 2
Contracted word Occurrences Expansion
memorand!! 8 memorandum
mest? 8 mestre
p!!dre 8 perdre
p?dcm 8
p?mer 8 premer
p?scheins 8 proscheins
p?sentz 8 presentz
pasch!! 8 Pasche
t?minez 8 terminez
t?ra 8 terra
ten!! 8
tentz 8 tenementz
v?ge 8 verge
v?roie 8 verroie
v?tue 8 vertue
ppres 7 propres
q 7
30TransliterationXML File Rendered Output
ltpgtltexpan abbr"R-"gtRexlt/expangt Collectoribltexpan
abbr"z"gtuslt/expangt custume sue lanaltexpan
abbr"z"gtrumlt/expangt in Civitate Londoltexpan
abbr"n-"gtniilt/expangt, saltexpan
abbr"l-t"gtlutelt/expangtm. Cum nultexpan
abbr"p-"gtperlt/expangt ltexpan abbr"p-"gtperlt/expangt
nos amp consililtexpan abbr"u-"gtumlt/expangt
nltexpan abbr"r"gtostrult/expangtm ordinatum
fuisset, qltexpan abbr"d-"gtuodlt/expangt lane,
coria, pelles lanute, plumbum amp stagmen
nltexpan abbr"o-"gtonlt/expangt dimitltexpan
abbr"t?"gtterlt/expangtentltexpan abbr"rsup"gturlt/exp
angt seu quomodolibet venderentltexpan
abbr"rsup"gturlt/expangt, nisi ltexpan
abbr"p"gtprolt/expangt bonis sterlingis seu aliis
ltexpan abbr"m?"gtmerlt/expangtcandisis
legalibltexpan abbr"z"gtuslt/expangt, ltexpan
abbr"p"gtprolt/expangtut in statuto inde edito
plenius continetltexpan abbr"rsup"gturlt/expangt
31Main Challenges facing the Anglo-Norman Hub
Project
- Image to text migration for maximum throughput at
minimum cost - Application of markup suitable for rendering and
full cross-referencing - Handling of non-standard character sets
(mediaeval abbreviations)