Title: CS 430 / INFO 430 Information Retrieval
1CS 430 / INFO 430 Information Retrieval
Lecture 16 Library Catalogs 1
2Course Administration
3Information Retrieval with High Recall
Full-text Indexing (automated) Text only. Most
effective on medium-length documents on related
topics. High recall requires tuning system to
the specific collection and skilled
users. Catalogs and Indexes (created
manually) Can be used for all formats of
material Requires close quality control of
metadata creation High recall requires tuning
system to the specific collection and skilled
users.
4Descriptive metadata
- Information discovery is can be very effective
when applied to metadata rather than raw
information - Allows fielded searching
- author "Goethe"
- Suitable for non-textual material
- type "picture" and subject "Ithaca"
- Can be used with controlled vocabulary
- language "en" (English)
5Examples of Library Catalogs
Cornell University Library catalog http//catalo
g.library.cornell.edu/ Library of Congress,
Prints and Photographs http//www.loc.gov/rr/pri
nt/catalog.html
6Origins of Library Catalogs
Bibliographic Objective To bring together
like items To differentiate among similar
ones
Sir Anthony Panizzi, Keeper of Books at the
British Museum (1856-67). His Ninety-One Rules
(1841) were the basis of modern catalog rules.
7Origins of Library Catalogs
Information Discovery to enable a person
to find a book of which either the author,
title or subject is known to show what the
library has by a given author, on a given
subject, or in a given kind of literature
to assist in the choice of a book as to its
edition (bibliographically) or to its
character (literary or topical).
Charles Ammi Cutter Librarian of the Boston
Athenaeum Rules for a Dictionary Catalog, 1874
8Origins of Library Catalogs
Classification Division of subject matter into a
hierarchy. Typically used in libraries to
provided a subject-based order for shelving books.
Melvil Dewey Acting Librarian of Amherst College
(1874) Dewey Decimal system of book
classification, uses the numbers 000 to 999 to
cover the general fields of knowledge and
decimals to fit special subjects.
9Technology
Materials to be catalogued Originally
books Extended to serials, maps, music,
etc., but concepts still rely heavily on
experience with books Form of catalog
Entries in books (Panizzi) Index cards
(Cutter) Online databases (Kilgour)
10Catalogs as Investments
Costs Conventional Catalog Records are
created by skilled librarians. (cost
estimate 100 per record). OCLC's catalog
has 52 million records. Total investment is
several billion dollars. Cataloguing
Standards Enable libraries to share
records Combine records of the past with
records created today Allow readers and
librarians to move between libraries
11Shared Cataloguing OCLC
- OCLC -- Large centralized transaction processing
database system - When a library catalogs a book it deposits MARC
record in OCLC - Other libraries can copy the record
- saves duplication of cataloguing
- build database of holdings
- OCLC database has 52 million records, serves
47,000 libraries - When developed in 1967, OCLC was a pioneering
computer system (had to develop own network,
computer terminal, etc.)
12Layers of a Library Catalog
Encoding Rules that define how catalog
records are encoded in a computer system, e.g.,
XML mark-up. Syntax Rules that define the
fields and subfields, whether repeated, optional,
etc. Semantics Rules that define the values
of the field and subfield, with instructions for
cataloguers of what data to include and how to
decide when choices have to be made.
13Library Cataloging using the Anglo American
Cataloguing Rules
Anglo American Cataloguing Rules (AACR2)
Rules for each category of material, e.g.,
monographs (books). Specify what fields should
be used and what data to include in each field.
Text strings were originally intended for printed
catalog cards. MARC format An exchange format
for catalog records. Includes encoding rules and
syntax specification. "MARC Catalog" Catalog
in MARC format, where content of each field
follows AACR2.
14Anglo American Cataloguing Rules
The Anglo American Cataloguing (AACR) rules
provide detailed rules for the choice of
fields the content of the data that goes into
each field the syntax of the data that goes
into each field The rules are an excellent
example of technical writing, precise but clear.
For an example, see http//www.cs.cornell.edu/Cou
rses/cs430/2004fa/slides/AACR.pdf
15Example Controlled Vocabulary
Level 1 Level 2
Arts ArchitectureArt therapyCareersComputers in artDanceDrama/dramaticsFilmHistoryInformal educationInstructional issuesMusicPhotographyPopular cultureProcess skillsTechnologyTheater artsVisual arts
Terms marked can appear in other hierarchies
Source presentation by Diane Hillmann, 2004
16MARC Format
The MARC format was developed in the late 1960s
as a tagging scheme for exchanging catalog
records on magnetic tape. It remains the standard
way to represent such data. At present, MARC is
steadily being converted (slowly) to modern
computing formats, e.g., Unicode, XML.
17MARC Monograph catalog record
Citation Caroline R. Arms, editor, Campus
strategies for libraries and electronic
information. Bedford, MA Digital Press, 1990.
18MARC fields
tag value 001 89-16879 r93 050 Z675.U5C16
1990 082 027.7/0973 20 245 Campus strategies
for libraries and electronic title statement
information/Caroline Arms, editor. 260
Bedford, Mass. Digital Press, c1990.
publisher 300 xi, 404 p. ill. 24 cm.
collation 440 EDUCOM strategies series on
information technology
series title 504 Includes
bibliographical references (p. 373-381). 020
ISBN 1-55558-036-X 34.95
19MARC fields (continued)
650 Academic libraries--United
States--Automation.
subject
heading 650 Libraries and electronic
publishing--United States. 650 Library
information networks--United States. 650
Information technology--United States. 700
Arms, Caroline R. (Caroline Ruth) 040 DLC DLC
DLC 043 n-us--- 955 CIP ver. br02 to SL
02-26-90 985 APIF/MIG
20MARC Encoding
tag 260 subfield a Bedford, Mass.
subfield b Digital Press, subfield
c c1990. MARC encoding 2600abcBedford,
Mass. Digital Press,c1990. Definitely not a
modern encoding!
Note that the content is designed to be part of a
printed catalog record and is not in a convenient
format for computer manipulation.
21Name authority files
- An Authority File "brings together like items and
differentiates among similar ones." - Caroline R. Arms or Caroline Ruth Arms?
- Which William Phillips of Cardiff?
- Mark Twain or Samuel Clemens?
- Epithets
- of Cardiff
- doctor
- Dates
- 1832 - 1876
- flourished 1860
- circa 1832 - 1876
22Name authority example
- LC Control Number n 87870182
- HEADING Arms, Caroline R. (Caroline
Ruth) - 000 00907cz 2200205n 450
- 001 4383796
- 005 19890706143144.8
- 008 70909nacannaab a aaa c
- 010 __ a n 87870182
- 035 __ a (DLC)n 87870182
- 040 __ a InU c DLC d DLC
- 100 10 a Arms, Caroline R. q
(Caroline Ruth) - 400 10 w nna a Arms, Caroline
Ruth - 400 10 a Arms, C. R. q
(Caroline Ruth) - 670 __ a Arms, W.Y. Report on
the performance problems of the - RLIN computer system, 1982 b t.p. (Caroline R.
Arms) - 670 __ a LC data base, 8/24/87
b (hdg. Arms, Caroline Ruth usage Caroline
R. Arms, C. R. Arms) - 670 __ a Campus networking
strategies, 1988 b CIP t.p. (Caroline Arms) - 670 __ a Phone call to pub.,
2/10/88 b (Caroline Ruth Arms studied at
Oxford) - 670 __ a Campus strategies for
libraries and electronic information, c1990 b
CIP t.p. (Caroline Arms) data sheet (b.
10-24-45) - 953 __ a bz46 b bd24
23Subject information
Library of Congress Subject Headings Academic
libraries--United States--Automation Hierarchical
classification Library of Congress call
number Z675.U5C16 Dewey Decimal
Classification 027.7 Creation and maintenance
of lists of subject headings and classifications
is a never ending task.
24Online public access catalog (OPAC)
- History First stage
- Library mounts its MARC records on a central
computer - Provides a simple terminal interface and
dedicated terminals - Boolean search -- fielded searching
- Most university libraries reached this stage
about 1990 - History Second stage
- Library connects computer to a campus network and
Internet - Converts card catalog records to MARC
(retrospective conversion)
25Library information systems
- When the catalog is online ...
- Add other collections and services
- Secondary information (Inspec, Medline,
Chemical Abstracts) - Reference works (dictionaries,
encyclopedias) - Improve user interface
- Add full text searching
- Add web interface
- Add gateway to off-campus information sources
- Scientific journals
- Databases (census, genome)
26Library management systems
A library management system, sometimes called an
integrated library system, integrates the
internal processes of a library, e.g.,
acquisitions, cataloguing, binding, circulation,
etc. It usually contains an online public
access catalog, but does not provide integrated
services to users. Library management systems are
produced by small companies who lack the capital
and technical expertise to develop modern digital
libraries.
27Notes on MARC
- A great achievement
- Developed in 1960s
- Magnetic tape exchange format for printing
catalog records - The dawn of computing
- mixed upper and lower case
- variable length fields,
- repeated fields
- non-Roman scripts
- 100(?) million records with standard content
and format - Thousands of trained librarians (millions?)
28Notes on MARC
- A great problem
- Not designed for computer algorithms
- One record per item (poor links between
records) - Tied to traditional materials and
traditional practices - Not Unicode
- 100 of million records at 100 -- 10 billion
- A classic legacy system!