Title: Nomadic Digital Library Research at Cornell
1Open Access to Digital Libraries. Must Research
Libraries be Expensive? William Y.
Arms Department of Computer Science Cornell
University
2Before Digital Libraries
Access to scientific, medical, legal information
In the United States -- excellent if you
belonged to a rich organization (e.g, a
major university) -- very poor otherwise In many
countries of the world -- very poor for everybody
3Research Libraries are Expensive
library materials
buildings facilities
staff
4The Potential of Digital Libraries
materials
buildings facilities
staff
5Economic Models for Open Access
Who pays for open access to information?
6Two Fallacies
1. The Luddite Publishing Fallacy Academic
authors will never change. Prestige is
determined by which journals a researcher
publishes in. The prestigious journals make the
rules.
2. The Free Lunch Fallacy Web publishing costs
nothing. Therefore groups of researchers should
publish their own research. There is no need to
waste money on publishers.
7Four Economic Models
Example Broadcast Television Open
Access Advertising network television External
funding public broadcasting Restricted
Access Subscription
cable Pay-by-use pay-per-view
8Examples
Old New Books in Print (subscription) Amazon.
com (advertising) Medline (pay-by-use) Grateful
Med (external) Journal (subscription) ePrint
archives (external) Westlaw (pay-by-use) Legal
Information Institute (external) Inspec
(subscription) Google (advertising)
9Thoughts on the Future of Open Access
The dominant force is author pressure, which
emphasizes open access rather than closed access.
The producing organization may be a university
(or part), a conference series, a laboratory, an
association, etc.
10A New Role For Academic Libraries and
Associations
Academic libraries and associations can provide
support for open access information --
Establish standards for academic quality --
Maintain local archives (e.g., M.I.T.'s
archive of local research) -- Protect and
preserve for the long-term
11The Potential of Digital Libraries
staff
12Automated Digital Libraries
How effectively can computers be used for the
skilled tasks of professional librarianship? --
Time horizon 5 to 20 years -- All materials in
digital form Computers cannot imitate
intelligence. Can automated digital libraries
provide equivalent services?
13Example Catalogs and Indexes
Catalog, index and abstracting records are very
expensive when created by skilled
professionals -- only available for certain
categories of material (e.g., monographs,
scientific journals) -- contain limited fields
of information (e.g., no contents page) --
restricted to static information
14Equivalent Services Catalogs and Indexes
Cataloguing rules -- Application of cataloguing
rules is skilled -- It is hard to imagine a
computer system with these skills but ... --
Cataloguing rules are the means, not the end
15Equivalent Services
Information discovery I used to be a heavy user
of Inspec. Now I use Google instead.
Why are web search services the most widely used
information discovery tools in universities
today?
16Conventional Criteria
Web search services have many weaknesses --
selection is arbitrary -- index records are
crude -- no authority control -- duplicate
detection is weak -- search precision is
deplorable yet they clearly satisfy some users ...
17Effectiveness of Web Search
Why I use Google instead of Inspec gt
Broader coverage gt Better ranking gt
Immediate access to information (e.g.,
open access version of published paper)
Google is an equivalent service for information
discovery (for some users)
18Simple Algorithms Immense Computing Power
19Brute Force Computing
Few people really understand Moore's Law --
Computing power doubles every 18 months --
Increases 100 times in 10 years -- Increases
10,000 times in 20 years Simple algorithms
immense computing power may outperform human
intelligence
20Brute Force Computing
Example Creators of the world champion chess
program (Deep Thought later Deep Blue) --
moderate chess players -- simple tree-search
algorithm -- very, very fast computer hardware
21Examples of Automated Digital Library Services
22Brute Force ComputingWeb Search
Web search engines -- retrieve every page on
the web -- index every word -- repeat every
month
23Substitutes for Human Intelligence
Automated algorithms for information
discovery Closeness of match -- vector space and
statistical methods (Salton, et al., c.
1970) Importance of digital object -- Google
ranks web pages by how many other pages link
to them (NSF/DARPA/NASA Digital Libraries
Initiative)
24Brute Force Computing Archiving and Preservation
Internet Archive -- Monthly, web crawler gathers
every open access web page with associated
images -- Web pages are preserved for future
generations -- Files are available for scholarly
research
25Brute Force Computing Reference Linking
ResearchIndex (CiteSeer, ScienceIndex) (NEC) --
fully automatic -- all open access material in
computer science -- a free service Contrast with
the Web of Science (ISI) -- input combination
of automatic means, skilled people -- limited
number of journals -- very expensive
26Brute Force Computing Automated Metadata
Extraction
Informedia (Carnegie Mellon) Automatic processing
of segments of video, e.g., television news.
Algorithms for -- dividing raw video into
discrete items -- generating short summaries --
indexing the sound track using speech
recognition -- recognizing faces
(NSF/DARPA/NASA Digital Libraries Initiative)
27Automating Interoperability Example Cornell
University's Core System for the NSDL (The
National Science Foundation's digital library for
science, mathematics, engineering and technology
education)
28Levels of Interoperability
A comprehensive science library The NSDL must
provide coherent services across a vast range of
materials managed by organizations with many
objectives. Three levels of interoperability
Federation Harvesting Gathering
29Federation (e.g., Z39.50 and MARC)
Digital libraries that follow a full set of
agreements form a federation. Standards and
agreements -- Technical formats, protocols,
security systems, etc. -- Content data and
metadata (including semantics) --
Organizational access, services, payment,
authentication, etc. Federations are desirable
but very demanding and hence rare
30Gathering (e.g., Internet Archive, Google)
Gathering service for open access information,
even if information providers do not follow
standard agreements -- web crawlers gather open
access information -- web search engines index
it -- automated services are possible (e.g.,
ResearchIndex) Entirely automated
31Harvesting (e.g., Open Archives Initiative)
Digital libraries -- provide a brief metadata
record for each item (e.g., minimal Dublin
Core) -- support a simple protocol for access to
this metadata Automated harvesters -- harvest
the metadata automatically -- build automated
services Mainly automated
32Costs and Benefits
33Costs of Automated Digital Libraries
The Google Company -- 5.5 million searches
daily -- 85 people (half technical, 14 with
Ph.D. in computing) -- 2,500 PCs running Linux,
with 80 terabytes of disk The Internet
Archive -- 7 people plus support from
Alexa (March 2000)
34Overall
If you are rich ... -- Research libraries, using
commercial information services, provide
excellent service at very high cost to a
favored few -- Automated digital libraries are
far from providing the personal service
available to a faculty member at a rich
university but ...
35The Model T Library
The Model T Ford, with mass production, brought
car travel to the masses ...
-- Automated digital libraries, with open access
materials, can already provide good service
at low cost
-- In the future, automated digital libraries
can bring scientific, scholarly, medical
and legal information to everybody
36Some Light Reading
William Y. Arms, "Automated digital libraries."
D-Lib Magazine, July/August 2000.
http//www.dlib.org/dlib/july20/07contents.html
William Y. Arms, "Economic models for
open-access publishing." iMP, March 2000.
http//www.cisp.org/imp/march_2000/03_00arms.htm