Title: Down and Dirty Digitization: Everything you need to know about putting content online
1Down and Dirty DigitizationEverything you need
to know about putting content online
- Roy Tennant
- California Digital Library
2Outline
- Project Planning
- Selecting Material to Digitize
- Digitization Purpose
- Basic Imaging Principles
- Capturing Images
- Editing Images
- Best Practices
- Conversion to Text
- Metadata
- Access Systems
- Skills Required of Staff
- Preservation
3Project Planning
- Who will do the work?
- What systems will be required?
- What are the specifications for images and
metadata? - How much will the project cost?
- Who will own and manage the digital products that
will be produced?
Steve Chapman, from Handbook for Digital
Projects, NEDCC
4Selecting Material to Digitize
- Publishing rights
- Available support/funding opportunity
- Critical mass
- Uniqueness
- Reputation
- Audience and potential use
- Diversity of material type
- Ability to stand on its own and fit in with other
collections
5What Do We Preserve?
- The body or the soul?
- The artifact
- The intellectual content
- How do we decide that the artifact has
preservation value? - Who decides?
6The Artifact
- The look and feel
- The experience of interacting with a specific
object - Consequences
- Choices for providing access are limited
- Time and money spent on recreating the artifact
may be better spent on increasing access - In some cases, preserving the look and feel
actually harms other uses
7(No Transcript)
8Written Material
- Handwritten texts (diaries, etc.), or those with
handwritten notations (manuscript drafts, etc.)
can easily be considered to have artifactual
value - But how much artifactual value do printed texts
have? - And born-digital texts?
- Whats it worth to you?
9If the goal of preservation is persistent
utility, then functionality rather than
aesthetics should drive system design.
Stephen Chapman, Content Follows Form
Preservation via Systems Design, Microform
Imaging Review
10Persistent Utility
- Form must be allowed to be altered or destroyed
to retain or enhance function - If function cannot be retained or enhanced, then
form should be preserved
11Considerations for Retaining Items in Original
Format
- Age
- Evidential value
- Aesthetic value
- Scarcity
- Associational value
- Market value
- Exhibition value
12The issue is not to evaluate the artifact per
se to determine what survives and what does
notThe issue is the need to agree on a method
for interrogating the individual artifact, that
would, in a climate of finite resources, help
make a good decision about whether and how
to preserve it.
Council on Library and Information Resources,
The Evidence in Hand the Report of the Task
Force on the Artifact in Library Collections
13How Do We Preserve It?
Preservation costs by method calculated by the
Library of Congress Preservation Directorate
14Types of Materials
Printed text/ Simple line art
Mixed
Halftones
Manuscripts
Continuous Tone
From Anne Kenney, et.al., Moving Theory into
Practice
15Benchmarking
- The process whereby you determine your
digitization requirements using the material you
will digitize
16Resolution
The number of pixels in a given area defines the
resolution of an image
One pixel
1
500 x 1,000 pixels
17Dynamic Range (bit-depth)
1 bit 8 bit grayscale 8 bit
color 24 bit color
(GIF)
(GIF) (JPEG)
1 bit black or white 8 bits 256 shades 16
bits thousands 24 bits millions 36 bits
billions
18RGB Color Space
8 bits per channel 24 bit color image
Red
Color Channels
Green
Blue
12 bits per channel 36 bit color image
19Image Compression
- Lossless the image is unchanged after
compression (no image data is lost) - Typical file size 50 of original
- Example LZW compression
- Lossy the image is altered after compression
(image data is lost) - Example JPEG
20TIFF
- Tagged Image File Format
- Most often used to save master versions of
images (unedited) - Can be compressed or uncompressed
21Compuserve GIF
- Graphic Interchange Format (GIF)
- Maximum 8 bits/pixel 256 colors (shades)
- Good for
- Text and line art
- Thumbnails
- Not good for
- Full-color pictures
- Anything that requires more than 256 colors
22JPEG
- Joint Photographic Engineers Group
- JPEG is actually a compression scheme the image
file format is JFIF (JPEG File Image Format) - Good for
- Full-color pictures
- Anything that requires more than 256 colors
- Not good for
- Text or line art
23New Image Formats
- Portable Network Graphics (PNG) - from the W3C to
replace the Compuserve GIF format and provide
more capabilities - JPEG2000 - An upgrade of the JPEG format
- Flashpix - from a consortium of commercial
companies, to provide much higher-resolution
images in a way that allows speedy network
delivery - MrSID - From LizardTech, good for large format
materials (maps, panoramic photos, etc.)
24Capturing Images
- Technologies
- Digital Cameras
- Flatbed Scanners
- Film Scanners
- Kodak PhotoCD
- Outsourcing
- Standards and Best Practices
25Digital Cameras
Phase One PowerPhase FX 10,500 x 12,600 pixels,
760MB (48 bit RGB)
BetterLight Super6K 6,000 x 8,000 pixels, 136MB
(24bit RGB) 16,990
26Flatbed Scanners
- Minimum requirements
- 600 X 1200 dpi optical resolution
- 36-bit color
- Not for slides or transparencies, best for
81/2x11 or 81/2x14 originals - Sheet feeder (often optional) helpful for
digitizing text
27Film Scanners
- For 35mm slides and negativesothers available
for larger formats - 600 - 3,000
- Most around 2700-4000 dpi,30-36 bit color
28Kodak PhotoCD
- Take pictures with a normal camera, but have your
pictures developed onto a PhotoCD - A proprietary image format ImagePAC, but very
high resolution (4 different resolutions)
29Outsourcing Pros and Cons
- Benefits
- No ramp-up costs (both time and money)
- Probably higher quality, at least to begin with
- High volume capability
- Drawbacks
- May be more costly if you have underutilized
staff time - No internal capability or experience developed
(that is, when the money runs out, so does your
chance to do anything more) - Rare items may require in-house digitization
30Outsourcing How
- Write an RFQ (Request for Quote) outlining
- Type and amount of material being digitized
- Quality requirements
- Volume per unit of time requirements
- For RFQ guidance and samples, see RLG Tools for
Digital Imaging - www.rlg.org/preserv/RLGtools.html
31Digital Image Work Flow
Rotate, Crop, Retouch, Brightness/ Contrast
Resize, Sharpen
Original TIFF or PCD 10-100MB
JPEG 100K
GIF 10K
Indexed Color Space
RGB Color Space
Stored offline
Stored online
32Editing Images
- Rotating
- Cropping
- Retouching
- Adjusting
- Resizing
- Sharpening
- Saving
33Image Editing Demonstration
34Conversion to Text
- Optical Character Recognition (OCR) software is
required (Caere OmniPage Pro, Xerox TextBridge,
etc.) - Quality and typography of originals is key
- Less than 99.5 accuracy is less expensive to
have re-keyed offshore - For some applications, uncorrected text is
sufficient
35Imaging Best Practices
- General guidelines for archival versions
- Photos, illustrations, maps, etc.
- 300-600dpi
- 24-36 bit color
- B/W Text document
- 300-600dpi
- 8 bit grayscale
- Negatives and Slides
- 2000-4000 pixels in longest dimension
- 24-36 bit color for color 8 bit grayscale for B/W
36Imaging Best Practices
The key to image quality is not to capture at
the highest resolution or bit depth possible, but
to match the conversion process to the
informational content of the original, and to
scan at that level--no more, no less. Moving
Theory Into Practice
37Metadata Types
- Structured description of an object or collection
of objects - Three basic types
- descriptive - e.g., title, creator, subject -
used for discovery - administrative - e.g., resolution, bit depth -
used for managing the collection - structural - e.g., table of contents page, page
34, etc. - used for navigation
38Metadata Appropriate Level
- Collection-level access
- Discovery metadata describes the collection
- Example Archival finding aid encoded in SGML
see http//www.oac.cdlib.org/ - Item-level access
- Discovery metadata describes the item
- Example individual metadata records for each
item see http//jarda.cdlib.org/cgi-bin/imagesear
ch.pl
39Collection Level Access
Images
Individual Finding Aid
Search Interface (Library catalogor dedicated)
Individual Finding Aid
40(No Transcript)
41(No Transcript)
42(No Transcript)
43Item Level Access
Finding Aids
Images
Search Interface (Dedicated)
44jarda.cdlib.org/search.html
45Metadata Granularity
- William Randolph Hearst
- William Randolphmiddle Hearst
- Consider all uses for the metadata
- Design for the most granular use
- Store it in a machine-parseable format
46Metadata Qualification
- William Randolph
Hearst - Builder -- Castles --
Southern California
47Metadata Machine Parseability
- The ability to pull apart and reconstruct
metadata via software - For example, this
- Can easily become this
William Randolphmiddle Hearst
Hearst, William Randolph
48Metadata Standards
- Metadata
- Collection Level
- Encoded Archival Description (EAD) -
lcweb.loc.gov/ead/ - Item Level
- MARC
- Dublin Core - purl.org/DC/
- MODS - www.loc.gov/standards/mods/
- Harvesting
- Open Archives Initiative, www.openarchives.org
49Access Systems
50Access Systems Exhibit
- Goals
- Inviting
- Easy to navigate
- Highlight selected parts of a collection
- Teach
- Requirements
- Great graphic design
- Informative and succinct commentary
- Interesting subject matter
51(No Transcript)
52(No Transcript)
53Access Systems Browse
- Goals
- Provide intriguing and interesting paths into and
throughout a collection - Give a broad sense of a collection, but not show
everything necessarily - Requirements
- Logical browse paths
- May have multiple paths to the same items (e.g.,
time, geography, subject)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57Access Systems Search
- Goals
- To provide post-coordinate access to all items in
a collection relevant to a particular query - To provide good methods to create a search as
well as refine or alter the display as required - Requirements
- Good search software (database or indexing
software) - Good metadata (minimum is probably a title or
caption for each item) - Good interface (options for navigation, search
refinement, etc.)
58(No Transcript)
59Skills Required of Staff
- Imaging
- OCR
- Markup languages (HTML, XML)
- Cataloging metadata
- Indexing and database technology
- User interface design
- Programming
- Web technology
- Project management
60How Does Digital Data Die?
- Let me count the ways
- New replaces old
- Death of a sponsor
- Sponsor loses interest
- Lost functionality
- Format rot
- Media format obsolescence
- Content format obsolescence
- Disaster
61Preserving Digital Content
- No preservation format
- Digital preservation techniques
- Print (on acid free paper!)
- Store
- Refresh
- Encapsulate
- Emulate
- Proliferate (Lots Of Copies Keep Stuff Safe or
LOCKSS)
62Preserving Digital Content
- Institutional commitment
- Consortial agreements
- Cooperatively funded central repositories
- Preservation Open Market
63The Best Defense
- What will ensure that material will not be
preserved? - Ignorance of its existence
- Ignorance of its worth
- Inability or unwillingness to pay for its
preservation - Access helps with all of these problems