Digital Content Delivery Platforms

About This Presentation

Title:

Digital Content Delivery Platforms

Description:

Example search for photos of Thomas W. Mackesey, VP for University planning, circa 1967 ... Plans for a headline search ~9000 articles within corpus ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 35

Provided by: cornelluni

Category:

more less

Transcript and Presenter's Notes

Title: Digital Content Delivery Platforms

1
Digital Content Delivery Platforms

Newspaper Delivery with Greenstone

2
Newspaper Delivery

Requirements
Full Text Search
Photo Search
Chronological Browse
Full Page and Article-Level Display

3
OCLC Olive Prototype
4
OCLC Olive Prototype

OCLC hosted initial 4 years
Little control over delivery mechanism
Annual hosting fees
Initial costs lower than in-house solution
Eventual costs greater than in-house solution
Abandoned

5
CUL Development

OCR for full text searches
Structural metadata for everything else

6
Full Text OCR

Each page scanned from bound volumes or microfilm
Automated
Sophisticated OCR software coverts image of text
to text Optimized for high recall -- Automated

7
Structural Metadata Development

Both automated and manual
Manual for most important informatione.g. dates,
titles and article continuity
Automated for everything else

8
September 6, 1967 page 2, article 2
9
Extrapolating
10
Structural Metadata
11
(No Transcript)
12
Photo Search

Uses photocaptions and article content as clues
Examplesearch for photos of Thomas W. Mackesey,
VP for University planning, circa 1967

13
Photo Search (cont.)
14
Uses of Structural Metadata
article

Photocaption metatags--explicit
Articles containing text Mackesey as well as
contain photo -- implicit

photo
photocaption
15
Putting It All Together

Scans are delivered in PDF form.
Metadata delivered in XML form

16
Original ConceptionRole of Outside Vendors

iArchives/Byte Managers deliver files one volume
at a time along with thumbnails for navigation
and user interface purposes.
iArchives licensed a customized version of
Greenstone (Greenstone is open source)

17
Original ConceptionRole of CUL

Compile customized version of Greenstone on top
of Apache server.
Build each volume.
Long Process for Server (Library23/Libdev),
little human intervention.

18
Perfect Solution Does Not Yet Exist

2/3 of all OCRed text just noise
All noise indexed
Known, homogeneous collection
Uncustomized and SLOW interface
Unintelligible URLhard to cite source

19
Greenstone and URLs

Typical Greenstone URL http//iarchives.library.c
ornell.edu/cgi-bin/iarchives.cgi?ed-0y18801-y1880
12cy191452cy193562cy193782cy193782cy1939402c
y194122cy19412b2cy194232cy195232cy195342cy195
452cy195562cy196012cy196122cy196122cy196562c
y19678-01-0-0--010---4------0-1l--1en-50---20-home
---0-0031-0010utfZz-8-00adf1cy18801clCL2.1
?????????
Would be easier to citehttp//cdsun.library.corn
ell.edu/index.php?cy19678d9.6.D0

20
AsideSearch Effectiveness
If information retrieval were perfect ... Every
hit would be relevant to the original query, and
every relevant item in the body of information
would be found. Precision percentage (or
fraction) of the hits that are relevant, i.e.,
the extent to which the set of hits retrieved
by a query satisfies the requirement that
generated the query. Recall percentage (or
fraction) of the relevant items that are found
by the query, i.e., the extent to which the query
found all the items that satisfy the
requirement.
21
Why So Much Noise?

Optimize search to return all documents relevant
to a particular query.
Unless noise is indexed, may miss important
subtleties
Nat Arnstein much different search than Matt
Arnstein

22
Problems with interface

Slow
Demand high for CPU Resources
Uses old-style html techniques--no CSS
Hard to consult with layout/interface designers
Interface functionality hard coded

23
The Greenstone User Interface

Contents of the pages are rendered at run time
via macro files.
Algorithms (for-loops) that display content is
hard coded.
Contents such as table headers are assumed to be
contained within macros or configuration file.

24
The Greenstone User Interface (cont.)

The hard code uses tables to display list
content--a serious problem for design and
interface functionality.
There is no way around this short of changing the
C code.

25
(No Transcript)
26
Solution

Use Greenstone for search capabilities.
Use PHP for everything else since content is
static and regularwe know what to expect.

27
Introducing the Projects

Cornell Daily Sun
Friend of Man
Preservation News

28
Cornell Daily Sun16 years and growing
29
Cornell Daily Sun

Combination Greenstone/In-House Custom PHP Pages
and JavaScript
Search by Greenstone
Layout and Design by Melissa Kuo
Browse Tree by Ron Rice
Metadata Harvesting and Integration by Matthew
Arnstein
Project Management by Fiona Patrick

30
Friend of Man5 volumes
31
Friend of Man

283 Issues (avg issue 4 pages)
Estimated index time 1 month
50 CPU resources from libdev
Terminated during University-wide shutdown
No plans to restart
Plans for a headline search
9000 articles within corpus

32
Preservation News
33
Preservation News

Next on the production queue
35 volumes, 7000 pages
Cleanest OCR
Plans for full-text indexing using Greenstone
Similar browse tree and interface functionality

34
Plans For the Future

Develop scripts to automate entire build
processlittle human intervention
Digitize additional volumes of CDS
FOM considered closed
PRN considered closed
Hand user-friendly software off to Library
Systems for future ingest and delivery of
materials.

Write a Comment

User Comments (0)