A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Description:

Ghostscript/PDF/ImageMagick problems. Decision to go semi-manual with script/cgi that: ... creates web page for page image viewing and thumbnail selection ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 22
Provided by: local62
Category:

less

Transcript and Presenter's Notes

Title: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books


1
A Standardized DigiTool Ingest Approach to
Internet Archive Digitized Books
  • Joseph Shubitowski (jshubitowski_at_getty.edu)
  • IGeLU 2008, September 9, 2008

2
Talking Points
  • Scope / Background
  • Why?
  • Major hurdles
  • Manual / automated workflows
  • Outcomes
  • What can we share?
  • Results
  • Methodologies
  • Tools, etc.

IGeLU Conference 2008, September 9, 2008
3
Alfred P. Sloan Foundation
Getty Research Institute Archaeology and
antiquities Boston Public Library John Adams
collection Johns Hopkins Anti-slavery
materials The Metropolitan Museum of Art Museum
Publications Bancroft Library Gold Rush and
westward expansion
IGeLU Conference 2008, September 9, 2008
4
Scope of the Digitization Project
  • 2,000,000 pages or approx. 5,000 books
  • Self-evident collection
  • Public domain
  • pre-1923 for works published in U.S.
  • pre-1909 for works published outside of U.S.

IGeLU Conference 2008, September 9, 2008
5
1 Pod 10 Scribe Stations
Internet Archive Scribe Station
IGeLU Conference 2008, September 9, 2008
6
Why Do it?
  • Internet Archive issues
  • Response/search time
  • Metadata only searching
  • No control
  • Full-text searching
  • Use in metasearch
  • More control!

IGeLU Conference 2008, September 9, 2008
7
Major Hurdles
  • Getting the files
  • Disk space issues for general storage and for
    DTL
  • What/how to process all the files
  • Abbyy OCR vs. ALTO OCR
  • Thumbnail generation
  • Handle configuration/synchronization

IGeLU Conference 2008, September 9, 2008
8
Link to URLs
List of OCRd books received from Internet
Archive
URLs from Internet Archive
Download files from Internet Archive
Process downloaded files
Ready for Digitool Ingest
IGeLU Conference 2008, September 9, 2008
9
IGeLU Conference 2008, September 9, 2008
10
IGeLU Conference 2008, September 9, 2008
11
IGeLU Conference 2008, September 9, 2008
12
Disk Space Issues
  • Each digitized book 500MB to 1.5 GB of raw
    files
  • Further untarring and processing consume even
    more disk!
  • DTL scratch/processing space, permanent storage
    space, and Oracle tablespace including full
    text indexing consumes even more disk space
  • 3000 books in the queue will require 10-15 TB for
    this project alone.

IGeLU Conference 2008, September 9, 2008
13
  • DTL ingest package
  • Archive raw jpeg2000 (renamed to .j2k)
  • View use copy jpeg2000 (.jp2)
  • Index ALTO files
  • Thumbnail appropriate thumb of title page for
    display of the complex object
  • PDF high res PDF as additional manifestation
  • MARCXML record for IE level metadata
  • No TIF files from IA everything is jpeg2000
  • Mapping file same for every ingest
  • CSV file is produced automatically

IGeLU Conference 2008, September 9, 2008
14
Abbyy to ALTO
  • IA scanning produces one huge OCR file in Abbyy
    proprietary XML
  • Discussions with / proposal from CCS
  • Real need to open source approach
  • Abbyy XSD can morph in future
  • Desire to share
  • Contract with Ex Libris to produce tool
  • Java based
  • Includes jar and class files
  • Free to share and redistribute
  • Tool transforms single ABBYY file to
    ALTO-file-per-page XML files

IGeLU Conference 2008, September 9, 2008
15
Thumbnail Creation
  • Initial ingest flow created complex object
    thumbnail from first page of PDF manifestation
  • Boring!
  • Ghostscript/PDF/ImageMagick problems
  • Decision to go semi-manual with script/cgi that
  • creates thumbnails for first 15 jpeg2000 page
    images
  • sends URL in email for each separate ingest
  • creates web page for page image viewing and
    thumbnail selection
  • adds chosen thumbnail to staging directory,
    cleans up, and sends confirmation email

IGeLU Conference 2008, September 9, 2008
16
IGeLU Conference 2008, September 9, 2008
17
Handle Generation
  • Setup per DTL docs
  • Firewall tweaks
  • Ingest flow tweaks
  • Handle for IE
  • Handles for all archive jpeg2000 images
  • DTL errors with mass publication of Handles
  • Fixed in SP21

IGeLU Conference 2008, September 9, 2008
18
Ingest Summary
  • Get/process/stage files
  • Generate ALTO OCR files
  • Web CGI for thumbnail selection
  • Load.sh script moves all files to locations DTL
    expects
  • Activate saved Ingest Flow from DTL Web Ingest
    client
  • Wait.........

IGeLU Conference 2008, September 9, 2008
19
Outstanding Issues
  • Ingest speed
  • Remedied somewhat in SP21
  • Digitized books are just darn big!
  • Low number of ingests per day
  • Handles
  • Manual publishing process
  • Need to populate Voyager bib record
  • METS viewer performance issues

IGeLU Conference 2008, September 9, 2008
20
Success Factors !!
  • Code to share
  • Get/process/staging scripts
  • Abbyy/ALTO transform code
  • Web cgi thumbnail code
  • YMMV
  • Handles provide true persistent IDs
  • http//hdl.handle.net/10020/17473
  • Full-text multilingual searching
  • MetaLib QuickSet for metasearch of all local
    repositories

IGeLU Conference 2008, September 9, 2008
21
Demo and Thanks......
  • http//archives.getty.edu
  • jshubitowski_at_getty.edu

IGeLU Conference 2008, September 9, 2008
Write a Comment
User Comments (0)
About PowerShow.com