Retrospective Conversion of Old Catalogues - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Retrospective Conversion of Old Catalogues

Description:

Create, enhance & harmonize bibliography in electronic format, ... Area Address / Date: Address: Date: Area Collection: Group Cote: Crossing Title: Cros. ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 28
Provided by: laurentj
Category:

less

Transcript and Presenter's Notes

Title: Retrospective Conversion of Old Catalogues


1
Retrospective Conversion of Old Catalogues
A. Belaïd LORIA-CNRS, Nancy France
2
The Framework
F
Electronic Record
3
The EU Library Program
F
  • Create, enhance harmonize bibliography in
    electronic format, develop tools for the
    conversion of important collections
  • Develop network connection between libraries
    allowing data access
  • Define innovating services for libraries using
    new technology

4
Three Projects IMAGE/OCR
F
BIBLIOTHECA Spain Italy France
  • Study of a pivot format (SGML) for the
    representation of different card and a system
    for indexing and classifying different documents

FACIT DenmarkGreeceItaly
  • Search for adapted OCR packages for
    retro-conversion with large set of characters,
    and tools for fast and cheap mass conversion of
    catalogues

MORE BelgiumFrance
  • Study of the role use of dictionaries in the
    structure modeling and recognition of
    catalogues by OCR techniques

5
MORE three experiences
F
N o t i c e s
6
Bibliography Description Normative Aspects
B
Notice (catalogues)
Reference (bases)
Bibliographic Descriptions
Areas position, nb of digits, content
ISBD
Physical
Coded Elements
Logical
MARC
LIBRARIES Cataloguing organisms
Semantic
Interpreted Elements
7
Structure Rules
B
8
Reference Structures
B
9
Role of the Separators ISBD
B
Punctuation
Fields
Area
F /
  • Title proper
  • Type of the document
  • Parallel Title
  • Sub-title
  • Responsibility Mention
  • Others mentions

Title / Responsibility
  • Edition Mention
  • Edition Parallel Mention
  • First Mention

F /
Edition
...
10
Example
B
Les interrogations du psychanalyste clinique,
théorie et technique / par Jean Bergeret, ---
Paris Presses Universitaires de France, 1987.
--- 193 p. couv. ill. 22 cm. --- (Le fait
psychanalytique, ISSN 0986-3524). Bibliogr., 5
p. --- ISBN 2-13-039-780-8 (br.) 98 F
Title proper
sub-title / aut
hor Statement. --- Location Editor Name, Pub.
date. --- Print Mat. Accomp. Mater. Format
. --- (Title pr. collec., ISSN). Note. --- ISBN
Price
11
Structure Modeling
Object Constructor subordinate objects
qualifier sequence,
required, aggregate, optional, choice
repetitive Separator
space, punctuation Attributes
Physical Logical Typographical
position
lexicon typeface
Weights Attributes
Sub-objects
Imp / Reco. Imp / Hyp. Ambig.
12
Recognition Schema
R
Learning
Recognition
Acquisition
Structure Model
Structure Recognition
Lexicons
Format Restitution
Target Format
Control
Electronic Record
13
The French Library without OCR
R
Structural Analysis
Compilation
Anchor Points
Indices
Hypotheses Management
Image
Model
Bottom-Up
Top-Down
14
Indices Extraction
R
Correlation with
4.7
16.1
76.7
43.3
37.5
91.0
31.5
55.5
37.5
61
15
Indices Extraction Results
R
16
Syntactical Analysis the approach
R
- Anchor points extraction (o)- Bottom-up
Choice of a rule A ??o o ?o- Top-down
verification for left context ?o right context
?o- Add A to anchor points
17
Propagation
R
18
Results
R
Group Vedette
Area Title
Principal Title
Crossing Title
End of the title
Cros. Formulae
Area Address / Date
Crossing Title
Address
Date
Area Collection
200 references 75
Group Cote
19
The Belgian Library Albert I
R
  • Large number of abbrev. Words
  • Numerical information
  • Imp. quantity of names
  • Mixture of languages
  • Stressed characters
  • Punctuation marks
  • Similar characters

20
Analysis Schema
R
Filtering
ANALYSIS
OCR Flow
Hypotheses
Model
Specific Structure (Hypotheses)
Pre-conditions
Specific
Instances
a priori
a posteriori
Local Strategies
Post Analysis Actions
Hypotheses Evaluation
21
OCR Flow SGML
R
ltLIG X ...Y ltBgtHelvetius R100lt/Bgt
ltLEX LGNLgt ltBgt(Claude R100lt/Bgt Adrier
R75).lt/Bgt ltLEX LGFR,GNLgtDeltLEX
LGGB,GFR,GNLgt lesprit. R100
Présent-lt/LIGgt ltLIG X Y ... tation
R98ltLEX IGFR,GNLgt de R100 François R100
ltLEX LGFRgtChâtelet. R99 ltIgt(Verviers,
R100ltLEX LGGBgtEditions R100lt/Igtlt/LIGgt...
Line
Bold
Lexicon
Italic
22
The Analysis opportunistic mode
R
Reference
CHOICE A B C
Agenda
  • Frontiers of the father
  • Inheritance of the father score
  • Put subordinate objects in the Agenda

Sort
SEQ A B C
a priori score (Attributes)
a posteriori score
  • Find search area, initials finals
  • Construct potential zones combinations of I F
  • Put combinations in the agenda

23
The prototype
R
Catalogues
Dictionnaires - Général - Spécifique
Automatic Acquisition
Manual Acquisition
OCR Flow (SGML)
Manual Structural Acquisition
Dictionaries
Structure Recognition
Manual Structure
Structure Model
Structure Specif. (UNIMARC)
Error File
Dictionaries
Error Correction (with Library)
Verification Final Formatting
Structure Model

MARC
24
Structure Results
ER
References

75.5 Recognized with ambiguities to resolve
manually 3 Recognized but with
risk to be re-examined
8 Recognized with structure error
1 Unrecognized
unknown cause
3 Unrecognized model
2
25
Manual Correction
ER
OCR Corrections Nb of doubts examined
Structure Correct.
Country Correct.
Doubt Index Authors
Doubt Subj. Index French
Doubt Subj. Index Dutch
Doubt References
Default Solutions
Refer. Number
Refer. Number
Total Nb. of Refer.
Month
January
427
1100
9
10
2882
2305
130
96
428
1056
15
21
2632
2105
127
84
February
March
408
1975
19
24
2344
1555
98
82
419
1132
14
9
3187
1912
200
107
April
386
1088
8
3
2178
1438
130
79
May
June
412
1493
7
11
2347
1488
169
112
July
372
August
930
3
13
2433
1699
102
85
302
963
12
12
1725
1170
79
43
September
397
925
13
21
3037
2209
120
89
October
387
1239
13
45
3217
1961
166
96
November
December
610
2272
17
20
4446
3230
173
141
4548
14173
130
189
30428
21072 69
1494 33
1014 22.3
Total 11 months
26
Conclusion
C
Comparing the two methods
  • Importance of the model
  • Tools to extract pertinent indices
  • Several references remain unrecognized - Task
    Complexity - Model built from
    non-normalized references (pre_ISBD) -
    Knowledge is incomplete and uncertain -
    Great number of sub-classes - Model
    construction difficulty of introduction of
    fine degree of specification attributes,
    weights, etc.

27
General Conclusion
C
  • Enhance Understanding of the issues involved in
    the retroconversion
  • Advances in OCR and Structure Recognition and
    their solutions
  • Prototypes developed constitute important
    results - Precious syntheses - Broader basis
    for further work
  • Cooperation between Libraries - valuable
    insights into practices of retroconversion of
    old catalogues - contributed to a better
    comprehension of the problem
  • A number of problems remain to be tackled
Write a Comment
User Comments (0)
About PowerShow.com