Title: Scanning Text ABMcentrum Advanced Image Capture
1Scanning TextABM-centrum Advanced Image Capture
- Stephen Chapman
- Weissman Preservation Center
- Harvard University Library
- 15 February 2007
2Planning checklist for text digitization
- Products
- page images
- machine-readable text
- XML binding
- page-turned objects
- administrative metadata
- deposit packages
- discovery/descriptive/ delivery metadata
- Processes
- scanning
- OCR and/or keying
- structural metadata creation
- digitization
- appraisal/processing
- prep/transfer to repository
- object cataloging and naming
3Key planning questions for creating sustainable
text collections
- What do your users want? (today and in the
future) - Where do they expect to find text objects?
- In what context(s) (e.g., as part of
collection) - What object behaviors will they expect to be
present in the interface(s)? (e.g., display,
search, navigate, print, etc.) - Where will you put the digital objects you
create, and what system(s) will you use to
deliver them to users? - What will you do with the source items you
digitize?
4Products
- Page images, Machine-readable text, Structural
metadata, Admin metadata, Deposit packages - Discovery records (catalog records, names)
5Page images
- Definition
- digital image (still image) corresponding to one
side of one page term of use, e.g., in DLF
Benchmark - First planning question What is the rendering
intent? - reproduce item in hand (facsimile standard)
- restore to original appearance
- optimize for web presentation and/or printing
6(No Transcript)
7Google Partners image evaluations, December 2006
8University of Virginia/Chadwyck-Healey, Early
American Fiction collection
9University of Michigan, Making of America
collection
10Royal Library / Manuscript Department
11Page images number of versions
- What features for image display do your delivery
application(s) offer? - display of page image within web browser?
- interactive features, such as zooming, page
rotation, page resizing, panning? - Will delivery application(s) produce deliverable
versions upon request (i.e., on-the-fly)? - example Aware Image Server generates JPEGs from
stored JP2 images
12Format conventions for page images
13Machine-readable text
- How many errors can be tolerated in the text
produced by Optical Character Recognition (OCR)
or keying? - Consider whether the text is to be used for
- display
- indexing and full-text searching
- both
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18University of Pennsylvania Library, Online Books
Page, Project Gutenberg edition
19University of Virginia/Chadwyck-Healey, Early
American Fiction collection
20Harvard University Library, Immigration to the
United States 1789-1930
21Harvard University Library, Immigration to the
United States 1789-1930
22Harvard University Library, Immigration to the
United States 1789-1930
23Machine-readable text conventions
- Formats for masters
- ASCII, UTF-8 character encoding, generally per
page - Formats for delivery
- HTML, PDF, E-Book (Palm Reader, Microsoft Reader)
for on-line viewing and/or for printing - indexing/search engine format when hidden from
user and used only for full-text searching - Whenever Optical Character Recognition (OCR) is
viable, assume that 99.995 accurate text will be
5-10X more expensive per page than uncorrected
OCR.
24Structural metadata the binding for multi-page
objects
- Metadata used to create coherent, logical object
- Primary back-end data to drive navigational
features of your (front-end) delivery application - page-forward/page-back, go-to-page number
- go-to (click-to) section (e.g., Contents, Index)
- Metadata to facilitate managing object and its
component parts over time - page images (all versions) and text files
- related administrative metadata
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Structural metadata conventions
- Manage related files and drive delivery from
filenames - very limited utility, practice fading
- Manage multi-page objects with XML
- make all relationships and features explicit
- Metadata Encoding Transmission Standard (METS)
- overview - Text Encoding Initiative (TEI) guidelines on
Wiki
33(No Transcript)
34Structural metadata to record physical (instead
of logical) features
- Function map each word in OCR-generated text
file to x-y bounding box coordinates in
corresponding page image - for prevalent examples of feature, see Amazon
Search Inside the Book, Google Book Search, and
other mass digitization initiatives and
interfaces - ALTO, (Analyzed Layout in Text Object), is one
example (integrated in METAe product, as well as
METS implementation) - PDF and XDOC are other format options
35Administrative and deposit metadata
- Administrative metadata pertinent to text objects
- page images (Z39.87, NISO MIX)
- structural metadata (sections internal to METS,
TEI) - ownership, rights, governance (possibilities with
PREMIS implementations) - Metadata schemes for object packaging and
transfer - METS gaining adoption among preservation
repositories in US (e.g., CDL, FCLA, Harvard)
36Best-Fit Processes
- Scanning, OCR/keying, Metadata production
structural, administrative, packaging - Cataloging, object naming (for linking)
37Scanning quality/risk assessments
- Rendering intent
- List required image types and quality
requirements - Disposition/handling policies for source
materials - May materials be transformed (disbound) for
scanning? - Must restrictions be placed on handling?
- Viable sources
- Bound print, unbound print, microfilm, photocopy
38Scanning technology assessments
- Which technology?
- auto-page turning scanner (Kirtas, 4digitalbooks)
- high-end digital camera with custom cradle(s)
- overhead bookscanner (Zeutschel, i2s Copibook)
- flatbed scanner
- single-pass autofeed (Fujitsu duplex models)
- Materials handling, image quality, throughput,
operator skill, budgets vary (among items,
projects, collections, institutions) no best
system
39Creating machine-readable text
- Manual approach keying (keyboarding,
transcribing) - viable from all sources, with high quality
(accuracy) - many outsourcing options
- Automated approach OCR
- viable from all page images with machine-printed
content in supported languages - uncorrected OCR inexpensive, but contains errors
- complexity of source, page image quality relevant
- pre-OCR image processing (dewarping) helps
- ABBYY FineReader most widely used application
40Producing structural metadata
- Challenging attributes
- pagination no standard rules for implied numbers
- wide variety of layouts present in historic
materials - no standardized vocabulary to record features
- Technology automated layout analysis
- widely used in production of e-newspapers
- docWORKS/METAe (Metadata Engine) working well for
range of 19th and 20th-c. published materials - expect errors that need to be human-corrected
- efficiencies possible by mapping data from
catalogs
41Producing administrative metadata
- Purpose/mandate
- facilitate management of objects, their component
parts, and access to them - Workflow
- can be generated automatically for some formats
(Z39.87 for still images) as outputs of scanning
and/or validation (JHOVE) processes - Standards
- PREMIS implementation efforts important offer
promise for standardization and additional
automation
42Producing packaging metadata (additional XML)
- Function
- producing Submission Information Packages (SIPs)
from your well-made, multi-page objects - Implementation
- Most preservation repositories prescribing
formats for SIPs providing tools to deposit
agents (CDL, OCLC)
43Thanks
- Stephen Chapman
- stephen_chapman_at_harvard.edu
- Additional resources
- http//preserve.harvard.edu/bibliographies/textcon
version.pdf - http//preserve.harvard.edu/bibliographies/ocr.pdf
44(No Transcript)