The Construction of Anglo-Norman Text Corpus - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

The Construction of Anglo-Norman Text Corpus

Description:

Goal of the Anglo-Norman Hub Text Digitisation Project ... por. p!!celes. parceles. p!! par. p!!atus. paratus. p!! par. p!! sup a /sup persona. p!! per ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 32
Provided by: Rich76
Category:

less

Transcript and Presenter's Notes

Title: The Construction of Anglo-Norman Text Corpus


1
The Construction of Anglo-Norman Text Corpus
  • Joint Project of the University of Wales, Swansea
    and the University of Wales, Aberystwyth .
  • AHRC-funded.
  • Anglo-Norman Online Dictionary
  • Anglo-Norman Text Corpus
  • http//www.anglo-norman.net

2
Goal of the Anglo-Norman Hub Text Digitisation
Project
  • To provide a set of digitised texts and articles
    to mediaeval linguists and historians which is
    searchable and fully cross-referenced within
    itself and to and from the Anglo-Norman Online
    Dictionary

3
Main Challenges facing the Anglo-Norman Hub
Project
  • Image to text migration for maximum throughput at
    minimum cost
  • Application of markup suitable for rendering and
    full cross-referencing
  • Handling of non-standard character sets
    (mediaeval abbreviations)

4
Image to Text Migration Strategies
  • Optical Character Recognition
  • Re-keying
  • Both require subsequent proofreading
  • Both allow insertion of appearance metadata as
    provisional markup

5
Advantages of Alternative Image to Text Migration
Strategies
  • OCR
  • Rapid processing
  • Can be performed by students on-site and can be
    supervised.
  • Rekeying
  • Less error-prone
  • Cheap if outsourced
  • Non-standard characters can be represented by
    combinations
  • More consistent output quality
  • Image quality less critical
  • Consistent output quality

6
Economic Image to Text Migration Conclusions
  • Re-keying is more economic for the bulk of the
    mediaeval-language material
  • OCR is competitive for modern languages (critical
    material)
  • OCR can also be used for mediaeval language
    material when required by workflows provided that
  • good image quality can be easily achieved
  • the material consists of standard characters

7
Markup requirements must
  • Conform to widely-accepted standards
  • Be capable of encapsulating diverse document
    structures
  • Allow for automation
  • Enable internal and external referencing
  • Preserve as much appearance metadata as possible
  • Not be tied to any one approach to rendering

8
Document types requiring a variety of XML
Structures
  • Texts
  • Verse
  • Prose
  • Lists Tables
  • Critical material
  • Introductions (conform to prose structures)
  • Notes (do not conform to any of the above
    structures)

9
Cross-referencing of Critical Matter
  • Need to navigate from pointer to note
  • Need to navigate cross-references from critical
    material to specific points in the text or
    elsewhere in critical material
  • Achieved by use of target-id pairs

10
Markup Density and Automation
  • Verse medium density can be automated
  • Prose variable density can be automated if
    footnote pointers present
  • Lists tables medium density can be automated
  • Critical material high-density many
    cross-references limited scope for automation

11
Extract from XML version of La Passiun de St.
Edmund
  • ltlg n"316"gtltl id"L1261"gtA Deu del cel ad
    graciédlt/lgt
  • ltl id"L1262"gtE al martir suvent a voédlt/lgt
  • ltl id"L1263"gtQue si bel l'at delivrédlt/lgt
  • ltpb ed"folio" n"123a"/gtltl id"L1264"
    n"1264"gtDe ço qu'esteit ainz encumbrét.lt/lgtlt/lggt

12
Extract from XML version of La Passiun de St.
Edmund
  • ltnote id"N1261-4" target"L1261"
    targetEnd"L1264"gtThese lines present several
    problems (a) ltq lang"AN" rend"b"gtA Deu. . .ad
    graciédlt/qgt ltref target"L1261"gt1261lt/refgt. The
    verb ltterm lang"AN" rend"i"gtgracierlt/termgt,
    occurring here with an indirect object, normally
    takes a direct object and does so in its other
    occurrences in the text ltref target"L826 L943
    L1132"gtll. 826, 943, 1132lt/refgt.

13
Additional Markup for Critical Material
  • lttermgt Terms discussed may need to be linked to
    the Anglo-Norman Dictionary
  • ltqgt Citations may need to be linked to their
    sources within the text base
  • ltbiblgt, lttitlegt etc. Bibliographical information
    needs to be encoded to link citations with their
    sources
  • Much of the above can be extrapolated from the
    appearance metadata embedded in the provisional
    markup
  • lthigt to encode embedded appearance metadata
    whose significance is not apparent

14
La Passiun de St. EdmundRendered for a Web
Browser
  • These lines present several problems (a) A Deu.
    . .ad graciéd 1261 . The verb gracier , occurring
    here with an indirect object, normally takes a
    direct object and does so in its other
    occurrences in the text ll. 826, 943, 1132 .
    T.-L. 4,502 cites one instance of gracier with
    indirect object, but in the construction gracier.
    qc. a qn . If this construction were applied
    here, ll. 1263-4 would have to be taken as the
    direct object of gracier and also, presumably, of
    voer 1262 . The use here of gracier with indirect
    object may have been influenced by the
    construction rendre graces a qn. employed at ll.
    995, 1046, 1512 .

15
Markup Density and Automation
  • Verse medium density can be automated
  • Prose variable density can be automated if
    footnote pointers present
  • Lists tables medium density can be automated
  • Critical material high-density many
    cross-references limited scope for automation

16
Markup Requirements Application
  • 1,000 to 100,000 XML tags per document
  • Automation essential for high throughput
  • Digitisers can embed appearance metadata in
    provisional markup
  • Well-designed provisional markup schemes
    facilitate automation

17
Facsimile of part of the Statute Roll
18
The same passage in the 1800 printed edition
19
Extract from the explanation published with the
Statutes, exemplifying the two forms resembling
9s.
20
"rum"-abbreviation and flourishes
21
Handling of Non-Unicode Characters 1)
Transcription
  • Transcription is the one-to-one encapsulation of
    character appearance metadata
  • Transliteration is the expansion of abbreviated
    characters into an intelligible sequence of
    letters
  • Transliteration requires transcription as a
    starting point
  • Transcription codes must resemble originals to
    facilitate re-keying

22
P-contractions
23
Examples of the "per" "pro" and "pre"
contractions as represented by the agency
Signifies Keyed as Expanded example Rekeyed example
per p!! ceperit cep!!it
pro p propria ppltsupgtilt/supgta
pro p probum pbu
per p!! persone p!!sone
per p!! apertement ap!!tement
pro p profit pfit
per p!! permisit p!!misit
pro p promisit pmisit
pro p prochein pchein
per p!! persona p!!ltsupgtalt/supgt
par p!! paratus p!!atus
par p!! parceles p!!celes
por p!! tempore temp!!e
por p!! corporum corp!!um
pre p? presentem p?sentem
pre p? prelatz p?laz!!
pre p? predictum p?d!!cm
pre p? prendront p?ndront
24
Transcription1810 Edition and Rekeyed Version
25
Transcription to TransliterationRekeyed Version
XML File
ltpgtltexpan abbr"R-"gtRexlt/expangt Collectoribltexpan
abbr"z"gtuslt/expangt custume sue lanaltexpan
abbr"z"gtrumlt/expangt in Civitate Londoltexpan
abbr"n-"gtniilt/expangt, saltexpan
abbr"l-t"gtlutelt/expangtm. Cum nultexpan
abbr"p-"gtperlt/expangt ltexpan abbr"p-"gtperlt/expangt
nos amp consililtexpan abbr"u-"gtumlt/expangt
nltexpan abbr"r"gtostrult/expangtm ordinatum
fuisset, qltexpan abbr"d-"gtuodlt/expangt lane,
coria, pelles lanute, plumbum amp stagmen
nltexpan abbr"o-"gtonlt/expangt dimitltexpan
abbr"t?"gtterlt/expangtentltexpan abbr"rsup"gturlt/exp
angt seu quomodolibet venderentltexpan
abbr"rsup"gturlt/expangt, nisi ltexpan
abbr"p"gtprolt/expangt bonis sterlingis seu aliis
ltexpan abbr"m?"gtmerlt/expangtcandisis
legalibltexpan abbr"z"gtuslt/expangt, ltexpan
abbr"p"gtprolt/expangtut in statuto inde edito
plenius continetltexpan abbr"rsup"gturlt/expangt
26
Handling of Non-Unicode Characters 2)
Transliteration
  • Manual transliteration would take too long
  • Blanket replacement is not possible because of
    ambiguous abbreviations
  • Semi-automated transliteration can be achieved
    using a list of words for block-replacement,
    derived from a concordance
  • The appearance metadata from the transcription
    should remain embedded

27
Extract from Concordance
28
Table of expansions, example 1
Contracted word Occurrences Expansion
6264
q 2989 q'
pltsupgtrlt/supgt 803 p'r
seignltsupgtrlt/supgt 325 seignour
aut?s 289 autres
man?e 250 manere
sltsupgtrlt/supgt 224 sur
p!!lement 215 parlement
t?re 199 terre
t?res 196 terres
denglet?re 191 dengleterre
gltsupgtalt/supgtnt 181 grant
loltsupgtrlt/supgt 167 lour
p!!tie 152 partie
ap?s 142 apres
s?ront 139 serront
hltbargtolt/bargtme 137 homme
29
Table of expansions, example 2
Contracted word Occurrences Expansion
memorand!! 8 memorandum
mest? 8 mestre
p!!dre 8 perdre
p?dcm 8
p?mer 8 premer
p?scheins 8 proscheins
p?sentz 8 presentz
pasch!! 8 Pasche
t?minez 8 terminez
t?ra 8 terra
ten!! 8
tentz 8 tenementz
v?ge 8 verge
v?roie 8 verroie
v?tue 8 vertue
ppres 7 propres
q 7
30
TransliterationXML File Rendered Output
ltpgtltexpan abbr"R-"gtRexlt/expangt Collectoribltexpan
abbr"z"gtuslt/expangt custume sue lanaltexpan
abbr"z"gtrumlt/expangt in Civitate Londoltexpan
abbr"n-"gtniilt/expangt, saltexpan
abbr"l-t"gtlutelt/expangtm. Cum nultexpan
abbr"p-"gtperlt/expangt ltexpan abbr"p-"gtperlt/expangt
nos amp consililtexpan abbr"u-"gtumlt/expangt
nltexpan abbr"r"gtostrult/expangtm ordinatum
fuisset, qltexpan abbr"d-"gtuodlt/expangt lane,
coria, pelles lanute, plumbum amp stagmen
nltexpan abbr"o-"gtonlt/expangt dimitltexpan
abbr"t?"gtterlt/expangtentltexpan abbr"rsup"gturlt/exp
angt seu quomodolibet venderentltexpan
abbr"rsup"gturlt/expangt, nisi ltexpan
abbr"p"gtprolt/expangt bonis sterlingis seu aliis
ltexpan abbr"m?"gtmerlt/expangtcandisis
legalibltexpan abbr"z"gtuslt/expangt, ltexpan
abbr"p"gtprolt/expangtut in statuto inde edito
plenius continetltexpan abbr"rsup"gturlt/expangt
31
Main Challenges facing the Anglo-Norman Hub
Project
  • Image to text migration for maximum throughput at
    minimum cost
  • Application of markup suitable for rendering and
    full cross-referencing
  • Handling of non-standard character sets
    (mediaeval abbreviations)
Write a Comment
User Comments (0)
About PowerShow.com