Digital Content Delivery Platforms - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Digital Content Delivery Platforms

Description:

Example search for photos of Thomas W. Mackesey, VP for University planning, circa 1967 ... Plans for a headline search ~9000 articles within corpus ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 35
Provided by: cornelluni
Category:

less

Transcript and Presenter's Notes

Title: Digital Content Delivery Platforms


1
Digital Content Delivery Platforms
  • Newspaper Delivery with Greenstone

2
Newspaper Delivery
  • Requirements
  • Full Text Search
  • Photo Search
  • Chronological Browse
  • Full Page and Article-Level Display

3
OCLC Olive Prototype
4
OCLC Olive Prototype
  • OCLC hosted initial 4 years
  • Little control over delivery mechanism
  • Annual hosting fees
  • Initial costs lower than in-house solution
  • Eventual costs greater than in-house solution
  • Abandoned

5
CUL Development
  • OCR for full text searches
  • Structural metadata for everything else

6
Full Text OCR
  • Each page scanned from bound volumes or microfilm
    Automated
  • Sophisticated OCR software coverts image of text
    to text Optimized for high recall -- Automated

7
Structural Metadata Development
  • Both automated and manual
  • Manual for most important informatione.g. dates,
    titles and article continuity
  • Automated for everything else

8
September 6, 1967 page 2, article 2
9
Extrapolating
10
Structural Metadata
11
(No Transcript)
12
Photo Search
  • Uses photocaptions and article content as clues
  • Examplesearch for photos of Thomas W. Mackesey,
    VP for University planning, circa 1967

13
Photo Search (cont.)
14
Uses of Structural Metadata
article
  • Photocaption metatags--explicit
  • Articles containing text Mackesey as well as
    contain photo -- implicit

photo
photocaption
15
Putting It All Together
  • Scans are delivered in PDF form.
  • Metadata delivered in XML form

16
Original ConceptionRole of Outside Vendors
  • iArchives/Byte Managers deliver files one volume
    at a time along with thumbnails for navigation
    and user interface purposes.
  • iArchives licensed a customized version of
    Greenstone (Greenstone is open source)

17
Original ConceptionRole of CUL
  • Compile customized version of Greenstone on top
    of Apache server.
  • Build each volume.
  • Long Process for Server (Library23/Libdev),
    little human intervention.

18
Perfect Solution Does Not Yet Exist
  • 2/3 of all OCRed text just noise
  • All noise indexed
  • Known, homogeneous collection
  • Uncustomized and SLOW interface
  • Unintelligible URLhard to cite source

19
Greenstone and URLs
  • Typical Greenstone URL http//iarchives.library.c
    ornell.edu/cgi-bin/iarchives.cgi?ed-0y18801-y1880
    12cy191452cy193562cy193782cy193782cy1939402c
    y194122cy19412b2cy194232cy195232cy195342cy195
    452cy195562cy196012cy196122cy196122cy196562c
    y19678-01-0-0--010---4------0-1l--1en-50---20-home
    ---0-0031-0010utfZz-8-00adf1cy18801clCL2.1
  • ?????????
  • Would be easier to citehttp//cdsun.library.corn
    ell.edu/index.php?cy19678d9.6.D0

20
AsideSearch Effectiveness
If information retrieval were perfect ... Every
hit would be relevant to the original query, and
every relevant item in the body of information
would be found. Precision percentage (or
fraction) of the hits that are relevant, i.e.,
the extent to which the set of hits retrieved
by a query satisfies the requirement that
generated the query. Recall percentage (or
fraction) of the relevant items that are found
by the query, i.e., the extent to which the query
found all the items that satisfy the
requirement.
21
Why So Much Noise?
  • Optimize search to return all documents relevant
    to a particular query.
  • Unless noise is indexed, may miss important
    subtleties
  • Nat Arnstein much different search than Matt
    Arnstein

22
Problems with interface
  • Slow
  • Demand high for CPU Resources
  • Uses old-style html techniques--no CSS
  • Hard to consult with layout/interface designers
  • Interface functionality hard coded

23
The Greenstone User Interface
  • Contents of the pages are rendered at run time
    via macro files.
  • Algorithms (for-loops) that display content is
    hard coded.
  • Contents such as table headers are assumed to be
    contained within macros or configuration file.

24
The Greenstone User Interface (cont.)
  • The hard code uses tables to display list
    content--a serious problem for design and
    interface functionality.
  • There is no way around this short of changing the
    C code.

25
(No Transcript)
26
Solution
  • Use Greenstone for search capabilities.
  • Use PHP for everything else since content is
    static and regularwe know what to expect.

27
Introducing the Projects
  • Cornell Daily Sun
  • Friend of Man
  • Preservation News

28
Cornell Daily Sun16 years and growing
29
Cornell Daily Sun
  • Combination Greenstone/In-House Custom PHP Pages
    and JavaScript
  • Search by Greenstone
  • Layout and Design by Melissa Kuo
  • Browse Tree by Ron Rice
  • Metadata Harvesting and Integration by Matthew
    Arnstein
  • Project Management by Fiona Patrick

30
Friend of Man5 volumes
31
Friend of Man
  • 283 Issues (avg issue 4 pages)
  • Estimated index time 1 month
  • 50 CPU resources from libdev
  • Terminated during University-wide shutdown
  • No plans to restart
  • Plans for a headline search
  • 9000 articles within corpus

32
Preservation News
33
Preservation News
  • Next on the production queue
  • 35 volumes, 7000 pages
  • Cleanest OCR
  • Plans for full-text indexing using Greenstone
  • Similar browse tree and interface functionality

34
Plans For the Future
  • Develop scripts to automate entire build
    processlittle human intervention
  • Digitize additional volumes of CDS
  • FOM considered closed
  • PRN considered closed
  • Hand user-friendly software off to Library
    Systems for future ingest and delivery of
    materials.
Write a Comment
User Comments (0)
About PowerShow.com