Scanning Text ABMcentrum Advanced Image Capture - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Scanning Text ABMcentrum Advanced Image Capture

Description:

What 'object behaviors' will they expect to be present in the interface(s) ... flatbed scanner. single-pass autofeed (Fujitsu duplex models) ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 45

Provided by: stephen240

Category:

more less

Transcript and Presenter's Notes

Title: Scanning Text ABMcentrum Advanced Image Capture

1
Scanning TextABM-centrum Advanced Image Capture

Stephen Chapman
Weissman Preservation Center
Harvard University Library
15 February 2007

2
Planning checklist for text digitization

Products
page images
machine-readable text
XML binding
page-turned objects
administrative metadata
deposit packages
discovery/descriptive/ delivery metadata

Processes
scanning
OCR and/or keying
structural metadata creation
digitization
appraisal/processing
prep/transfer to repository
object cataloging and naming

3
Key planning questions for creating sustainable
text collections

What do your users want? (today and in the
future)
Where do they expect to find text objects?
In what context(s) (e.g., as part of
collection)
What object behaviors will they expect to be
present in the interface(s)? (e.g., display,
search, navigate, print, etc.)
Where will you put the digital objects you
create, and what system(s) will you use to
deliver them to users?
What will you do with the source items you
digitize?

4
Products

Page images, Machine-readable text, Structural
metadata, Admin metadata, Deposit packages
Discovery records (catalog records, names)

5
Page images

Definition
digital image (still image) corresponding to one
side of one page term of use, e.g., in DLF
Benchmark
First planning question What is the rendering
intent?
reproduce item in hand (facsimile standard)
restore to original appearance
optimize for web presentation and/or printing

6
(No Transcript)
7
Google Partners image evaluations, December 2006
8
University of Virginia/Chadwyck-Healey, Early
American Fiction collection
9
University of Michigan, Making of America
collection
10
Royal Library / Manuscript Department
11
Page images number of versions

What features for image display do your delivery
application(s) offer?
display of page image within web browser?
interactive features, such as zooming, page
rotation, page resizing, panning?
Will delivery application(s) produce deliverable
versions upon request (i.e., on-the-fly)?
example Aware Image Server generates JPEGs from
stored JP2 images

12
Format conventions for page images
13
Machine-readable text

How many errors can be tolerated in the text
produced by Optical Character Recognition (OCR)
or keying?
Consider whether the text is to be used for
display
indexing and full-text searching
both

14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
University of Pennsylvania Library, Online Books
Page, Project Gutenberg edition
19
University of Virginia/Chadwyck-Healey, Early
American Fiction collection
20
Harvard University Library, Immigration to the
United States 1789-1930
21
Harvard University Library, Immigration to the
United States 1789-1930
22
Harvard University Library, Immigration to the
United States 1789-1930
23
Machine-readable text conventions

Formats for masters
ASCII, UTF-8 character encoding, generally per
page
Formats for delivery
HTML, PDF, E-Book (Palm Reader, Microsoft Reader)
for on-line viewing and/or for printing
indexing/search engine format when hidden from
user and used only for full-text searching
Whenever Optical Character Recognition (OCR) is
viable, assume that 99.995 accurate text will be
5-10X more expensive per page than uncorrected
OCR.

24
Structural metadata the binding for multi-page
objects

Metadata used to create coherent, logical object
Primary back-end data to drive navigational
features of your (front-end) delivery application
page-forward/page-back, go-to-page number
go-to (click-to) section (e.g., Contents, Index)
Metadata to facilitate managing object and its
component parts over time
page images (all versions) and text files
related administrative metadata

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Structural metadata conventions

Manage related files and drive delivery from
filenames
very limited utility, practice fading
Manage multi-page objects with XML
make all relationships and features explicit
Metadata Encoding Transmission Standard (METS)
- overview
Text Encoding Initiative (TEI) guidelines on
Wiki

33
(No Transcript)
34
Structural metadata to record physical (instead
of logical) features

Function map each word in OCR-generated text
file to x-y bounding box coordinates in
corresponding page image
for prevalent examples of feature, see Amazon
Search Inside the Book, Google Book Search, and
other mass digitization initiatives and
interfaces
ALTO, (Analyzed Layout in Text Object), is one
example (integrated in METAe product, as well as
METS implementation)
PDF and XDOC are other format options

35
Administrative and deposit metadata

Administrative metadata pertinent to text objects
page images (Z39.87, NISO MIX)
structural metadata (sections internal to METS,
TEI)
ownership, rights, governance (possibilities with
PREMIS implementations)
Metadata schemes for object packaging and
transfer
METS gaining adoption among preservation
repositories in US (e.g., CDL, FCLA, Harvard)

36
Best-Fit Processes

Scanning, OCR/keying, Metadata production
structural, administrative, packaging
Cataloging, object naming (for linking)

37
Scanning quality/risk assessments

Rendering intent
List required image types and quality
requirements
Disposition/handling policies for source
materials
May materials be transformed (disbound) for
scanning?
Must restrictions be placed on handling?
Viable sources
Bound print, unbound print, microfilm, photocopy

38
Scanning technology assessments

Which technology?
auto-page turning scanner (Kirtas, 4digitalbooks)
high-end digital camera with custom cradle(s)
overhead bookscanner (Zeutschel, i2s Copibook)
flatbed scanner
single-pass autofeed (Fujitsu duplex models)
Materials handling, image quality, throughput,
operator skill, budgets vary (among items,
projects, collections, institutions) no best
system

39
Creating machine-readable text

Manual approach keying (keyboarding,
transcribing)
viable from all sources, with high quality
(accuracy)
many outsourcing options
Automated approach OCR
viable from all page images with machine-printed
content in supported languages
uncorrected OCR inexpensive, but contains errors
complexity of source, page image quality relevant
pre-OCR image processing (dewarping) helps
ABBYY FineReader most widely used application

40
Producing structural metadata

Challenging attributes
pagination no standard rules for implied numbers
wide variety of layouts present in historic
materials
no standardized vocabulary to record features
Technology automated layout analysis
widely used in production of e-newspapers
docWORKS/METAe (Metadata Engine) working well for
range of 19th and 20th-c. published materials
expect errors that need to be human-corrected
efficiencies possible by mapping data from
catalogs

41
Producing administrative metadata

Purpose/mandate
facilitate management of objects, their component
parts, and access to them
Workflow
can be generated automatically for some formats
(Z39.87 for still images) as outputs of scanning
and/or validation (JHOVE) processes
Standards
PREMIS implementation efforts important offer
promise for standardization and additional
automation

42
Producing packaging metadata (additional XML)

Function
producing Submission Information Packages (SIPs)
from your well-made, multi-page objects
Implementation
Most preservation repositories prescribing
formats for SIPs providing tools to deposit
agents (CDL, OCLC)

43
Thanks