Title: Retrospective Conversion of Old Catalogues
1Retrospective Conversion of Old Catalogues
A. Belaïd LORIA-CNRS, Nancy France
2The Framework
F
Electronic Record
3The EU Library Program
F
- Create, enhance harmonize bibliography in
electronic format, develop tools for the
conversion of important collections - Develop network connection between libraries
allowing data access - Define innovating services for libraries using
new technology
4Three Projects IMAGE/OCR
F
BIBLIOTHECA Spain Italy France
- Study of a pivot format (SGML) for the
representation of different card and a system
for indexing and classifying different documents
FACIT DenmarkGreeceItaly
- Search for adapted OCR packages for
retro-conversion with large set of characters,
and tools for fast and cheap mass conversion of
catalogues
MORE BelgiumFrance
- Study of the role use of dictionaries in the
structure modeling and recognition of
catalogues by OCR techniques
5MORE three experiences
F
N o t i c e s
6Bibliography Description Normative Aspects
B
Notice (catalogues)
Reference (bases)
Bibliographic Descriptions
Areas position, nb of digits, content
ISBD
Physical
Coded Elements
Logical
MARC
LIBRARIES Cataloguing organisms
Semantic
Interpreted Elements
7Structure Rules
B
8Reference Structures
B
9Role of the Separators ISBD
B
Punctuation
Fields
Area
F /
- Title proper
- Type of the document
- Parallel Title
- Sub-title
- Responsibility Mention
- Others mentions
Title / Responsibility
- Edition Mention
- Edition Parallel Mention
- First Mention
F /
Edition
...
10Example
B
Les interrogations du psychanalyste clinique,
théorie et technique / par Jean Bergeret, ---
Paris Presses Universitaires de France, 1987.
--- 193 p. couv. ill. 22 cm. --- (Le fait
psychanalytique, ISSN 0986-3524). Bibliogr., 5
p. --- ISBN 2-13-039-780-8 (br.) 98 F
Title proper
sub-title / aut
hor Statement. --- Location Editor Name, Pub.
date. --- Print Mat. Accomp. Mater. Format
. --- (Title pr. collec., ISSN). Note. --- ISBN
Price
11Structure Modeling
Object Constructor subordinate objects
qualifier sequence,
required, aggregate, optional, choice
repetitive Separator
space, punctuation Attributes
Physical Logical Typographical
position
lexicon typeface
Weights Attributes
Sub-objects
Imp / Reco. Imp / Hyp. Ambig.
12Recognition Schema
R
Learning
Recognition
Acquisition
Structure Model
Structure Recognition
Lexicons
Format Restitution
Target Format
Control
Electronic Record
13The French Library without OCR
R
Structural Analysis
Compilation
Anchor Points
Indices
Hypotheses Management
Image
Model
Bottom-Up
Top-Down
14Indices Extraction
R
Correlation with
4.7
16.1
76.7
43.3
37.5
91.0
31.5
55.5
37.5
61
15Indices Extraction Results
R
16Syntactical Analysis the approach
R
- Anchor points extraction (o)- Bottom-up
Choice of a rule A ??o o ?o- Top-down
verification for left context ?o right context
?o- Add A to anchor points
17Propagation
R
18 Results
R
Group Vedette
Area Title
Principal Title
Crossing Title
End of the title
Cros. Formulae
Area Address / Date
Crossing Title
Address
Date
Area Collection
200 references 75
Group Cote
19The Belgian Library Albert I
R
- Large number of abbrev. Words
- Numerical information
- Imp. quantity of names
- Mixture of languages
- Stressed characters
- Punctuation marks
- Similar characters
20Analysis Schema
R
Filtering
ANALYSIS
OCR Flow
Hypotheses
Model
Specific Structure (Hypotheses)
Pre-conditions
Specific
Instances
a priori
a posteriori
Local Strategies
Post Analysis Actions
Hypotheses Evaluation
21OCR Flow SGML
R
ltLIG X ...Y ltBgtHelvetius R100lt/Bgt
ltLEX LGNLgt ltBgt(Claude R100lt/Bgt Adrier
R75).lt/Bgt ltLEX LGFR,GNLgtDeltLEX
LGGB,GFR,GNLgt lesprit. R100
Présent-lt/LIGgt ltLIG X Y ... tation
R98ltLEX IGFR,GNLgt de R100 François R100
ltLEX LGFRgtChâtelet. R99 ltIgt(Verviers,
R100ltLEX LGGBgtEditions R100lt/Igtlt/LIGgt...
Line
Bold
Lexicon
Italic
22The Analysis opportunistic mode
R
Reference
CHOICE A B C
Agenda
- Frontiers of the father
- Inheritance of the father score
- Put subordinate objects in the Agenda
Sort
SEQ A B C
a priori score (Attributes)
a posteriori score
- Find search area, initials finals
- Construct potential zones combinations of I F
- Put combinations in the agenda
23The prototype
R
Catalogues
Dictionnaires - Général - Spécifique
Automatic Acquisition
Manual Acquisition
OCR Flow (SGML)
Manual Structural Acquisition
Dictionaries
Structure Recognition
Manual Structure
Structure Model
Structure Specif. (UNIMARC)
Error File
Dictionaries
Error Correction (with Library)
Verification Final Formatting
Structure Model
MARC
24Structure Results
ER
References
75.5 Recognized with ambiguities to resolve
manually 3 Recognized but with
risk to be re-examined
8 Recognized with structure error
1 Unrecognized
unknown cause
3 Unrecognized model
2
25Manual Correction
ER
OCR Corrections Nb of doubts examined
Structure Correct.
Country Correct.
Doubt Index Authors
Doubt Subj. Index French
Doubt Subj. Index Dutch
Doubt References
Default Solutions
Refer. Number
Refer. Number
Total Nb. of Refer.
Month
January
427
1100
9
10
2882
2305
130
96
428
1056
15
21
2632
2105
127
84
February
March
408
1975
19
24
2344
1555
98
82
419
1132
14
9
3187
1912
200
107
April
386
1088
8
3
2178
1438
130
79
May
June
412
1493
7
11
2347
1488
169
112
July
372
August
930
3
13
2433
1699
102
85
302
963
12
12
1725
1170
79
43
September
397
925
13
21
3037
2209
120
89
October
387
1239
13
45
3217
1961
166
96
November
December
610
2272
17
20
4446
3230
173
141
4548
14173
130
189
30428
21072 69
1494 33
1014 22.3
Total 11 months
26Conclusion
C
Comparing the two methods
- Importance of the model
- Tools to extract pertinent indices
- Several references remain unrecognized - Task
Complexity - Model built from
non-normalized references (pre_ISBD) -
Knowledge is incomplete and uncertain -
Great number of sub-classes - Model
construction difficulty of introduction of
fine degree of specification attributes,
weights, etc.
27General Conclusion
C
- Enhance Understanding of the issues involved in
the retroconversion - Advances in OCR and Structure Recognition and
their solutions - Prototypes developed constitute important
results - Precious syntheses - Broader basis
for further work - Cooperation between Libraries - valuable
insights into practices of retroconversion of
old catalogues - contributed to a better
comprehension of the problem - A number of problems remain to be tackled