CS 430 / INFO 430 Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

CS 430 / INFO 430 Information Retrieval

Description:

High recall requires tuning system to the specific collection and skilled users. ... Requires close quality control of metadata creation ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 29
Provided by: wya54
Category:

less

Transcript and Presenter's Notes

Title: CS 430 / INFO 430 Information Retrieval


1
CS 430 / INFO 430 Information Retrieval
Lecture 16 Library Catalogs 1
2
Course Administration
3
Information Retrieval with High Recall
Full-text Indexing (automated) Text only. Most
effective on medium-length documents on related
topics. High recall requires tuning system to
the specific collection and skilled
users. Catalogs and Indexes (created
manually) Can be used for all formats of
material Requires close quality control of
metadata creation High recall requires tuning
system to the specific collection and skilled
users.
4
Descriptive metadata
  • Information discovery is can be very effective
    when applied to metadata rather than raw
    information
  • Allows fielded searching
  • author "Goethe"
  • Suitable for non-textual material
  • type "picture" and subject "Ithaca"
  • Can be used with controlled vocabulary
  • language "en" (English)

5
Examples of Library Catalogs
Cornell University Library catalog http//catalo
g.library.cornell.edu/ Library of Congress,
Prints and Photographs http//www.loc.gov/rr/pri
nt/catalog.html
6
Origins of Library Catalogs
Bibliographic Objective To bring together
like items To differentiate among similar
ones
Sir Anthony Panizzi, Keeper of Books at the
British Museum (1856-67). His Ninety-One Rules
(1841) were the basis of modern catalog rules.
7
Origins of Library Catalogs
Information Discovery to enable a person
to find a book of which either the author,
title or subject is known to show what the
library has by a given author, on a given
subject, or in a given kind of literature
to assist in the choice of a book as to its
edition (bibliographically) or to its
character (literary or topical).
Charles Ammi Cutter Librarian of the Boston
Athenaeum Rules for a Dictionary Catalog, 1874
8
Origins of Library Catalogs
Classification Division of subject matter into a
hierarchy. Typically used in libraries to
provided a subject-based order for shelving books.
Melvil Dewey Acting Librarian of Amherst College
(1874) Dewey Decimal system of book
classification, uses the numbers 000 to 999 to
cover the general fields of knowledge and
decimals to fit special subjects.
9
Technology
Materials to be catalogued Originally
books Extended to serials, maps, music,
etc., but concepts still rely heavily on
experience with books Form of catalog
Entries in books (Panizzi) Index cards
(Cutter) Online databases (Kilgour)
10
Catalogs as Investments
Costs Conventional Catalog Records are
created by skilled librarians. (cost
estimate 100 per record). OCLC's catalog
has 52 million records. Total investment is
several billion dollars. Cataloguing
Standards Enable libraries to share
records Combine records of the past with
records created today Allow readers and
librarians to move between libraries
11
Shared Cataloguing OCLC
  • OCLC -- Large centralized transaction processing
    database system
  • When a library catalogs a book it deposits MARC
    record in OCLC
  • Other libraries can copy the record
  • saves duplication of cataloguing
  • build database of holdings
  • OCLC database has 52 million records, serves
    47,000 libraries
  • When developed in 1967, OCLC was a pioneering
    computer system (had to develop own network,
    computer terminal, etc.)

12
Layers of a Library Catalog
Encoding Rules that define how catalog
records are encoded in a computer system, e.g.,
XML mark-up. Syntax Rules that define the
fields and subfields, whether repeated, optional,
etc. Semantics Rules that define the values
of the field and subfield, with instructions for
cataloguers of what data to include and how to
decide when choices have to be made.
13
Library Cataloging using the Anglo American
Cataloguing Rules
Anglo American Cataloguing Rules (AACR2)
Rules for each category of material, e.g.,
monographs (books). Specify what fields should
be used and what data to include in each field.
Text strings were originally intended for printed
catalog cards. MARC format An exchange format
for catalog records. Includes encoding rules and
syntax specification. "MARC Catalog" Catalog
in MARC format, where content of each field
follows AACR2.
14
Anglo American Cataloguing Rules
The Anglo American Cataloguing (AACR) rules
provide detailed rules for the choice of
fields the content of the data that goes into
each field the syntax of the data that goes
into each field The rules are an excellent
example of technical writing, precise but clear.
For an example, see http//www.cs.cornell.edu/Cou
rses/cs430/2004fa/slides/AACR.pdf
15
Example Controlled Vocabulary
Level 1 Level 2
Arts ArchitectureArt therapyCareersComputers in artDanceDrama/dramaticsFilmHistoryInformal educationInstructional issuesMusicPhotographyPopular cultureProcess skillsTechnologyTheater artsVisual arts
Terms marked can appear in other hierarchies
Source presentation by Diane Hillmann, 2004
16
MARC Format
The MARC format was developed in the late 1960s
as a tagging scheme for exchanging catalog
records on magnetic tape. It remains the standard
way to represent such data. At present, MARC is
steadily being converted (slowly) to modern
computing formats, e.g., Unicode, XML.
17
MARC Monograph catalog record
Citation Caroline R. Arms, editor, Campus
strategies for libraries and electronic
information. Bedford, MA Digital Press, 1990.
18
MARC fields
tag value 001 89-16879 r93 050 Z675.U5C16
1990 082 027.7/0973 20 245 Campus strategies
for libraries and electronic title statement
information/Caroline Arms, editor. 260
Bedford, Mass. Digital Press, c1990.
publisher 300 xi, 404 p. ill. 24 cm.

collation 440 EDUCOM strategies series on
information technology

series title 504 Includes
bibliographical references (p. 373-381). 020
ISBN 1-55558-036-X 34.95
19
MARC fields (continued)
650 Academic libraries--United
States--Automation.
subject
heading 650 Libraries and electronic
publishing--United States. 650 Library
information networks--United States. 650
Information technology--United States. 700
Arms, Caroline R. (Caroline Ruth) 040 DLC DLC
DLC 043 n-us--- 955 CIP ver. br02 to SL
02-26-90 985 APIF/MIG
20
MARC Encoding
tag 260 subfield a Bedford, Mass.
subfield b Digital Press, subfield
c c1990. MARC encoding 2600abcBedford,
Mass. Digital Press,c1990. Definitely not a
modern encoding!
Note that the content is designed to be part of a
printed catalog record and is not in a convenient
format for computer manipulation.
21
Name authority files
  • An Authority File "brings together like items and
    differentiates among similar ones."
  • Caroline R. Arms or Caroline Ruth Arms?
  • Which William Phillips of Cardiff?
  • Mark Twain or Samuel Clemens?
  • Epithets
  • of Cardiff
  • doctor
  • Dates
  • 1832 - 1876
  • flourished 1860
  • circa 1832 - 1876

22
Name authority example
  • LC Control Number n 87870182
  • HEADING Arms, Caroline R. (Caroline
    Ruth)
  • 000 00907cz 2200205n 450
  • 001 4383796
  • 005 19890706143144.8
  • 008 70909nacannaab a aaa c
  • 010 __ a n 87870182
  • 035 __ a (DLC)n 87870182
  • 040 __ a InU c DLC d DLC
  • 100 10 a Arms, Caroline R. q
    (Caroline Ruth)
  • 400 10 w nna a Arms, Caroline
    Ruth
  • 400 10 a Arms, C. R. q
    (Caroline Ruth)
  • 670 __ a Arms, W.Y. Report on
    the performance problems of the
  • RLIN computer system, 1982 b t.p. (Caroline R.
    Arms)
  • 670 __ a LC data base, 8/24/87
    b (hdg. Arms, Caroline Ruth usage Caroline
    R. Arms, C. R. Arms)
  • 670 __ a Campus networking
    strategies, 1988 b CIP t.p. (Caroline Arms)
  • 670 __ a Phone call to pub.,
    2/10/88 b (Caroline Ruth Arms studied at
    Oxford)
  • 670 __ a Campus strategies for
    libraries and electronic information, c1990 b
    CIP t.p. (Caroline Arms) data sheet (b.
    10-24-45)
  • 953 __ a bz46 b bd24

23
Subject information
Library of Congress Subject Headings Academic
libraries--United States--Automation Hierarchical
classification Library of Congress call
number Z675.U5C16 Dewey Decimal
Classification 027.7 Creation and maintenance
of lists of subject headings and classifications
is a never ending task.
24
Online public access catalog (OPAC)
  • History First stage
  • Library mounts its MARC records on a central
    computer
  • Provides a simple terminal interface and
    dedicated terminals
  • Boolean search -- fielded searching
  • Most university libraries reached this stage
    about 1990
  • History Second stage
  • Library connects computer to a campus network and
    Internet
  • Converts card catalog records to MARC
    (retrospective conversion)

25
Library information systems
  • When the catalog is online ...
  • Add other collections and services
  • Secondary information (Inspec, Medline,
    Chemical Abstracts)
  • Reference works (dictionaries,
    encyclopedias)
  • Improve user interface
  • Add full text searching
  • Add web interface
  • Add gateway to off-campus information sources
  • Scientific journals
  • Databases (census, genome)

26
Library management systems
A library management system, sometimes called an
integrated library system, integrates the
internal processes of a library, e.g.,
acquisitions, cataloguing, binding, circulation,
etc. It usually contains an online public
access catalog, but does not provide integrated
services to users. Library management systems are
produced by small companies who lack the capital
and technical expertise to develop modern digital
libraries.
27
Notes on MARC
  • A great achievement
  • Developed in 1960s
  • Magnetic tape exchange format for printing
    catalog records
  • The dawn of computing
  • mixed upper and lower case
  • variable length fields,
  • repeated fields
  • non-Roman scripts
  • 100(?) million records with standard content
    and format
  • Thousands of trained librarians (millions?)

28
Notes on MARC
  • A great problem
  • Not designed for computer algorithms
  • One record per item (poor links between
    records)
  • Tied to traditional materials and
    traditional practices
  • Not Unicode
  • 100 of million records at 100 -- 10 billion
  • A classic legacy system!
Write a Comment
User Comments (0)
About PowerShow.com