Title: Digital Content Delivery Platforms
1Digital Content Delivery Platforms
- Newspaper Delivery with Greenstone
2Newspaper Delivery
- Requirements
- Full Text Search
- Photo Search
- Chronological Browse
- Full Page and Article-Level Display
3OCLC Olive Prototype
4OCLC Olive Prototype
- OCLC hosted initial 4 years
- Little control over delivery mechanism
- Annual hosting fees
- Initial costs lower than in-house solution
- Eventual costs greater than in-house solution
- Abandoned
5CUL Development
- OCR for full text searches
- Structural metadata for everything else
6Full Text OCR
- Each page scanned from bound volumes or microfilm
Automated - Sophisticated OCR software coverts image of text
to text Optimized for high recall -- Automated
7Structural Metadata Development
- Both automated and manual
- Manual for most important informatione.g. dates,
titles and article continuity - Automated for everything else
8September 6, 1967 page 2, article 2
9Extrapolating
10Structural Metadata
11(No Transcript)
12Photo Search
- Uses photocaptions and article content as clues
- Examplesearch for photos of Thomas W. Mackesey,
VP for University planning, circa 1967
13Photo Search (cont.)
14Uses of Structural Metadata
article
- Photocaption metatags--explicit
- Articles containing text Mackesey as well as
contain photo -- implicit
photo
photocaption
15Putting It All Together
- Scans are delivered in PDF form.
- Metadata delivered in XML form
16Original ConceptionRole of Outside Vendors
- iArchives/Byte Managers deliver files one volume
at a time along with thumbnails for navigation
and user interface purposes. - iArchives licensed a customized version of
Greenstone (Greenstone is open source)
17Original ConceptionRole of CUL
- Compile customized version of Greenstone on top
of Apache server. - Build each volume.
- Long Process for Server (Library23/Libdev),
little human intervention.
18Perfect Solution Does Not Yet Exist
- 2/3 of all OCRed text just noise
- All noise indexed
- Known, homogeneous collection
- Uncustomized and SLOW interface
- Unintelligible URLhard to cite source
19Greenstone and URLs
- Typical Greenstone URL http//iarchives.library.c
ornell.edu/cgi-bin/iarchives.cgi?ed-0y18801-y1880
12cy191452cy193562cy193782cy193782cy1939402c
y194122cy19412b2cy194232cy195232cy195342cy195
452cy195562cy196012cy196122cy196122cy196562c
y19678-01-0-0--010---4------0-1l--1en-50---20-home
---0-0031-0010utfZz-8-00adf1cy18801clCL2.1
- ?????????
- Would be easier to citehttp//cdsun.library.corn
ell.edu/index.php?cy19678d9.6.D0
20AsideSearch Effectiveness
If information retrieval were perfect ... Every
hit would be relevant to the original query, and
every relevant item in the body of information
would be found. Precision percentage (or
fraction) of the hits that are relevant, i.e.,
the extent to which the set of hits retrieved
by a query satisfies the requirement that
generated the query. Recall percentage (or
fraction) of the relevant items that are found
by the query, i.e., the extent to which the query
found all the items that satisfy the
requirement.
21Why So Much Noise?
- Optimize search to return all documents relevant
to a particular query. - Unless noise is indexed, may miss important
subtleties - Nat Arnstein much different search than Matt
Arnstein
22Problems with interface
- Slow
- Demand high for CPU Resources
- Uses old-style html techniques--no CSS
- Hard to consult with layout/interface designers
- Interface functionality hard coded
23The Greenstone User Interface
- Contents of the pages are rendered at run time
via macro files. - Algorithms (for-loops) that display content is
hard coded. - Contents such as table headers are assumed to be
contained within macros or configuration file.
24The Greenstone User Interface (cont.)
- The hard code uses tables to display list
content--a serious problem for design and
interface functionality. - There is no way around this short of changing the
C code.
25(No Transcript)
26Solution
- Use Greenstone for search capabilities.
- Use PHP for everything else since content is
static and regularwe know what to expect.
27Introducing the Projects
- Cornell Daily Sun
- Friend of Man
- Preservation News
28Cornell Daily Sun16 years and growing
29Cornell Daily Sun
- Combination Greenstone/In-House Custom PHP Pages
and JavaScript - Search by Greenstone
- Layout and Design by Melissa Kuo
- Browse Tree by Ron Rice
- Metadata Harvesting and Integration by Matthew
Arnstein - Project Management by Fiona Patrick
30Friend of Man5 volumes
31Friend of Man
- 283 Issues (avg issue 4 pages)
- Estimated index time 1 month
- 50 CPU resources from libdev
- Terminated during University-wide shutdown
- No plans to restart
- Plans for a headline search
- 9000 articles within corpus
32Preservation News
33Preservation News
- Next on the production queue
- 35 volumes, 7000 pages
- Cleanest OCR
- Plans for full-text indexing using Greenstone
- Similar browse tree and interface functionality
34Plans For the Future
- Develop scripts to automate entire build
processlittle human intervention - Digitize additional volumes of CDS
- FOM considered closed
- PRN considered closed
- Hand user-friendly software off to Library
Systems for future ingest and delivery of
materials.