Scanning Text ABMcentrum Advanced Image Capture - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Scanning Text ABMcentrum Advanced Image Capture

Description:

What 'object behaviors' will they expect to be present in the interface(s) ... flatbed scanner. single-pass autofeed (Fujitsu duplex models) ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 45
Provided by: stephen240
Category:

less

Transcript and Presenter's Notes

Title: Scanning Text ABMcentrum Advanced Image Capture


1
Scanning TextABM-centrum Advanced Image Capture
  • Stephen Chapman
  • Weissman Preservation Center
  • Harvard University Library
  • 15 February 2007

2
Planning checklist for text digitization
  • Products
  • page images
  • machine-readable text
  • XML binding
  • page-turned objects
  • administrative metadata
  • deposit packages
  • discovery/descriptive/ delivery metadata
  • Processes
  • scanning
  • OCR and/or keying
  • structural metadata creation
  • digitization
  • appraisal/processing
  • prep/transfer to repository
  • object cataloging and naming

3
Key planning questions for creating sustainable
text collections
  • What do your users want? (today and in the
    future)
  • Where do they expect to find text objects?
  • In what context(s) (e.g., as part of
    collection)
  • What object behaviors will they expect to be
    present in the interface(s)? (e.g., display,
    search, navigate, print, etc.)
  • Where will you put the digital objects you
    create, and what system(s) will you use to
    deliver them to users?
  • What will you do with the source items you
    digitize?

4
Products
  • Page images, Machine-readable text, Structural
    metadata, Admin metadata, Deposit packages
  • Discovery records (catalog records, names)

5
Page images
  • Definition
  • digital image (still image) corresponding to one
    side of one page term of use, e.g., in DLF
    Benchmark
  • First planning question What is the rendering
    intent?
  • reproduce item in hand (facsimile standard)
  • restore to original appearance
  • optimize for web presentation and/or printing

6
(No Transcript)
7
Google Partners image evaluations, December 2006
8
University of Virginia/Chadwyck-Healey, Early
American Fiction collection
9
University of Michigan, Making of America
collection
10
Royal Library / Manuscript Department
11
Page images number of versions
  • What features for image display do your delivery
    application(s) offer?
  • display of page image within web browser?
  • interactive features, such as zooming, page
    rotation, page resizing, panning?
  • Will delivery application(s) produce deliverable
    versions upon request (i.e., on-the-fly)?
  • example Aware Image Server generates JPEGs from
    stored JP2 images

12
Format conventions for page images
13
Machine-readable text
  • How many errors can be tolerated in the text
    produced by Optical Character Recognition (OCR)
    or keying?
  • Consider whether the text is to be used for
  • display
  • indexing and full-text searching
  • both

14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
University of Pennsylvania Library, Online Books
Page, Project Gutenberg edition
19
University of Virginia/Chadwyck-Healey, Early
American Fiction collection
20
Harvard University Library, Immigration to the
United States 1789-1930
21
Harvard University Library, Immigration to the
United States 1789-1930
22
Harvard University Library, Immigration to the
United States 1789-1930
23
Machine-readable text conventions
  • Formats for masters
  • ASCII, UTF-8 character encoding, generally per
    page
  • Formats for delivery
  • HTML, PDF, E-Book (Palm Reader, Microsoft Reader)
    for on-line viewing and/or for printing
  • indexing/search engine format when hidden from
    user and used only for full-text searching
  • Whenever Optical Character Recognition (OCR) is
    viable, assume that 99.995 accurate text will be
    5-10X more expensive per page than uncorrected
    OCR.

24
Structural metadata the binding for multi-page
objects
  • Metadata used to create coherent, logical object
  • Primary back-end data to drive navigational
    features of your (front-end) delivery application
  • page-forward/page-back, go-to-page number
  • go-to (click-to) section (e.g., Contents, Index)
  • Metadata to facilitate managing object and its
    component parts over time
  • page images (all versions) and text files
  • related administrative metadata

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Structural metadata conventions
  • Manage related files and drive delivery from
    filenames
  • very limited utility, practice fading
  • Manage multi-page objects with XML
  • make all relationships and features explicit
  • Metadata Encoding Transmission Standard (METS)
    - overview
  • Text Encoding Initiative (TEI) guidelines on
    Wiki

33
(No Transcript)
34
Structural metadata to record physical (instead
of logical) features
  • Function map each word in OCR-generated text
    file to x-y bounding box coordinates in
    corresponding page image
  • for prevalent examples of feature, see Amazon
    Search Inside the Book, Google Book Search, and
    other mass digitization initiatives and
    interfaces
  • ALTO, (Analyzed Layout in Text Object), is one
    example (integrated in METAe product, as well as
    METS implementation)
  • PDF and XDOC are other format options

35
Administrative and deposit metadata
  • Administrative metadata pertinent to text objects
  • page images (Z39.87, NISO MIX)
  • structural metadata (sections internal to METS,
    TEI)
  • ownership, rights, governance (possibilities with
    PREMIS implementations)
  • Metadata schemes for object packaging and
    transfer
  • METS gaining adoption among preservation
    repositories in US (e.g., CDL, FCLA, Harvard)

36
Best-Fit Processes
  • Scanning, OCR/keying, Metadata production
    structural, administrative, packaging
  • Cataloging, object naming (for linking)

37
Scanning quality/risk assessments
  • Rendering intent
  • List required image types and quality
    requirements
  • Disposition/handling policies for source
    materials
  • May materials be transformed (disbound) for
    scanning?
  • Must restrictions be placed on handling?
  • Viable sources
  • Bound print, unbound print, microfilm, photocopy

38
Scanning technology assessments
  • Which technology?
  • auto-page turning scanner (Kirtas, 4digitalbooks)
  • high-end digital camera with custom cradle(s)
  • overhead bookscanner (Zeutschel, i2s Copibook)
  • flatbed scanner
  • single-pass autofeed (Fujitsu duplex models)
  • Materials handling, image quality, throughput,
    operator skill, budgets vary (among items,
    projects, collections, institutions) no best
    system

39
Creating machine-readable text
  • Manual approach keying (keyboarding,
    transcribing)
  • viable from all sources, with high quality
    (accuracy)
  • many outsourcing options
  • Automated approach OCR
  • viable from all page images with machine-printed
    content in supported languages
  • uncorrected OCR inexpensive, but contains errors
  • complexity of source, page image quality relevant
  • pre-OCR image processing (dewarping) helps
  • ABBYY FineReader most widely used application

40
Producing structural metadata
  • Challenging attributes
  • pagination no standard rules for implied numbers
  • wide variety of layouts present in historic
    materials
  • no standardized vocabulary to record features
  • Technology automated layout analysis
  • widely used in production of e-newspapers
  • docWORKS/METAe (Metadata Engine) working well for
    range of 19th and 20th-c. published materials
  • expect errors that need to be human-corrected
  • efficiencies possible by mapping data from
    catalogs

41
Producing administrative metadata
  • Purpose/mandate
  • facilitate management of objects, their component
    parts, and access to them
  • Workflow
  • can be generated automatically for some formats
    (Z39.87 for still images) as outputs of scanning
    and/or validation (JHOVE) processes
  • Standards
  • PREMIS implementation efforts important offer
    promise for standardization and additional
    automation

42
Producing packaging metadata (additional XML)
  • Function
  • producing Submission Information Packages (SIPs)
    from your well-made, multi-page objects
  • Implementation
  • Most preservation repositories prescribing
    formats for SIPs providing tools to deposit
    agents (CDL, OCLC)

43
Thanks
  • Stephen Chapman
  • stephen_chapman_at_harvard.edu
  • Additional resources
  • http//preserve.harvard.edu/bibliographies/textcon
    version.pdf
  • http//preserve.harvard.edu/bibliographies/ocr.pdf

44
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com