Title: LIS 450EP Case Study: The Illinois Digital Library Initiative Project
1LIS 450EP Case Study The Illinois Digital
Library Initiative Project
- Timothy W. Cole
- William H. Mischo
- t-cole3_at_uiuc.edu, w-mischo_at_uiuc.edu
- Grainger Engineering Library Information Center
- University of Illinois at Urbana-Champaign
- http//dli.grainger.uiuc.edu/Publications/WHMischo
/LIS450EP/
2Outline
- Digital Libraries, Publishers, XML, the
Scholarly Information Environment. - The Illinois DLI / D-Lib Testbed Project.
- XML Technologies in Journal Publishing
- Current work linking, metadata, metasearch,
the Open Archives Initiative Protocol for
Metadata Harvesting.
3References
- Cole, Timothy W., William H. Mischo, Thomas G.
Habing, and Robert H.Ferrer. "Using XML and XSLT
to Process and Render Online Journals,"Library
Hi Tech 19, no. 3 (2001) 210 - 222.
Availablehttp//dx.doi.org/10.1108/07378830110405
067 - Shreeves, Sarah L., Joanne S. Kaczmarek, and
Timothy W. Cole. "Harvesting Cultural Heritage
Metadata Using the OAI Protocol." Library Hi Tech
21, no. 2 (2003) 159-169. Availablehttp//dx.do
i.org/10.1108/07378830310479802 - Lagoze, Carl and Herbert Van de Sompel. "The
making of the OpenArchives Initiative Protocol
for Metadata Harvesting," Library Hi Tech21, no.
2 (2003) 118 - 128. Avaliablehttp//dx.doi.org/1
0.1108/07378830310479776 - XML Schemas for Qualified Dublin Core, see bottom
of Web page at URL http//www.dublincore.org/sc
hemas/xmls/
4Overview
- We now have the tools to pursue the grand
challenges of Information retrieval - Standard retrieval environment (Web) and
interface/client (Web Browser). - Standardized search/retrieval mechanisms (HTTP
Post/Get, SQL, Z39.50, OAI). - Standard language for describing and transforming
content and metadata (XML, XSLT, XML Schemas). - Standard interoperability mechanisms to connect
heterogeneous content (HTTP, SOAP, OAI).
5XML and Publishers
- Tim Gill of Quark, the use of XML could lead to
a drop in the cost of Web publishing by 30 to
50 and a significant reduction in the time it
takes to produce sites. - Gill I dont believe that there is any
innovation in print that is going to save us even
10 in costs. - AIP all-XML Journal
- Issues and Challenges remain.
- Use of XML behind the scenes commonplace
6XML and Publishers
- Vendor-Neutral, platform-independent structured
information standard. - Document representation interchange standard.
- Applications can externalize their data/metadata
as XML. - Based on Document Object Model (DOM), std.
OOP-style components (XSLT, CSS, ) - Issues with full-text representation PDF,
XML/HTML. Value in indexing, retrieval.
7The Digital Library
- Digital, Virtual, Electronic Library as
network-based library without regard to place and
time. - Digital Collections vs. Digital Library.
- Tendency to call collections resources DLs.
- IMLS Framework of Guidance for Building Good
Digital Collections - Emphasis on the integration of collections and
creation of DL services (e.g., NSDL). - Application of standards and protocols enables
and facilitates development of services.
8Scholarly Communication Overview
- Web-based E-Resources still publisher-centric.
- Not user-centric or topic-centric
- Growth of Heterogeneous Distributed Repositories.
- Value-added services and branding of journals.
- Prestige of Journals and Publishers
- Reciprocal linking relationships between
publishers. - Cooperation on linking standards (DOI, CrossRef).
- Alternative publishing models - Academia (e.g.,
SPARC), Preprint Servers, disintermediation.
9Full-Text Technologies
- Continuum of Web-Enabled technologies presently
being utilized. - Evolving technologies and standards.
- Role and history of markup.
- Increasing role and importance of XML.
- Towards a Smart Document
10(No Transcript)
11Distributed Repositories
- Current Resources
- publisher repositories A Is (remote and
local) course management systems OIA and
preprint servers Web search engines vendor
portals institutional repositories - Goal for distributed repositories Integration
of discrete publisher repositories, locally
loaded full-text, local and remote A I
services, OPAC, Web resources, and local data.
12Distributed Repository - Needs
- Support simultaneous searching of A I Services,
Distributed Repositories, OPACs, Web search
engines, local files. Integrate TOC, full-text. - Remote Reference 24 X 7.
- Metadata harvesting
- Digital archiving.
- Local Resolver services for locally loaded or
Aggregator Resources.
13Illinois Testbed Project
- Funded under DLI-I by NSF, DARPA, and NASA,
1994--1998. Awards made to 6 universities. - Large-scale Testbed, Distributed Repository
models, evaluation, Web software. - Funded under CNRI D-Lib Test Suite Program,
19982001. - Collaborating Partners Program. AIP, APS, ASCE,
IEE, NRL, ASM, ACM, NTT Learning Systems,
Elsevier. - All XML Journals -- AIP, APS, ACM.
14Illinois Testbed
- American Institute of Physics--APL, JAP, RSI
- 18,000 articles, 1995--.
- American Physical Society--PRL
- 14,000 articles, 1995--, weekly updates.
- ASCE Journals (25 titles)
- 10,000 articles, 1995--.
- IEE Proceedings and Electronics Letters
- 8,500 articles, 1993--.
- IEEE Computer Society.
- ASM (American Society for Materials) Handbook.
- ACM (Association for Computing Machinery)
Transactions. - Elsevier Science.
15(No Transcript)
16Project Issues
- Evolution of the Document.
- Distributed information environment.
- Use of Metalanguages Transformations (SGML,
XML). - Searching over full-text of journals vs. document
surrogates in A I format. - Rendering and styling (SGML, XML, MathML).
- Dynamic metadata for normalization, linking.
- Breadth and depth of collections.
- User needs.
17Accomplishments
- Process retrieve from multiple publishers
heterogeneous DTDs. - Metadata specification that uses RDF, Qualified
Dublin Core, XML Schemas, XML Namespaces. - Cross-repository searching (Testbed D-LIB Test
Suite). Full-Text and Metadata. - SGML to XML Conversion.
- XSLT, CSS, for transformation rendering,
including Mathematics.
18Accomplishments (2)
- Linking Forward/Backward within Testbed, from/to
A I Services. - Conversion of ISO 12083 math markup to MathML
rendering of MathML. - Enhanced Web retrieval mechanisms Author Word
Wheels, Co-Occurrence Matrices. - Detailed user transaction logs, gathered at the
search argument level, with identification of
characteristics of each user search sessions - Simultaneous search within DeLiver of Tesbed
repositories, A Is, NCSTRL,
19(No Transcript)
20Ongoing Investigations
- Support federated/broadcast searching of A I
Services, Distributed Repositories, enhanced
navigation, expanded gateway functions. - Interoperability models, e.g., Metadata
harvesting vs. Federated (Broadcast) - Z39.50 protocols, HTTP harvesting, Spider
technology (gathering). - E-Journal Archiving (AIP).
- Local link server with context-sensitive
resources. - MathML other ENTS (Essential Non-Text Stuff)
21XML Parser APIs Tree-Based and Event-Based
- DOM (Document Object Model for XML HTML).
- DOM Level 1 and Level 2 W3C recommendation.
Widely implemented, Tree-Based. Hierarchy of
nodes. Loads entire document into memory. Level 2
adds namespace support, traversal, stylesheets,
events, triggers. Level 3 W3C candidate
recommendation. Parsers allow developers to
iterate through documents, change document
content. - SAX (Simple API for XML).
- Open-source, not W3C. Initially Java-based.
Event-based, fires events as it reads document,
need not load entire document into memory. Good
for single-pass processing. Xerces, XML4C, Sun
Project X (Crimson), MSXML.
22XML Schema and Structure
- DTD
- Original schema representation, defines
structural rules for a class of XML documents.
Inherited from SGML. - XML Schema http//www.w3.org/XML/Schema
- W3C recommendation. Also sets out standardized
structure for class of XML documents. Is coded in
XML, can be parsed and edited with standard
software. Two separate parts structures and
datatypes. - Namespaces http//www.w3.org/TR/REC-xml-names/
- W3C recommendation (1.1 candidate in work) Allows
developers to qualify element and attribute names
with unique URIs, avoids recognition errors.
23XML, XSLT, and CSS
- Use XML full-text articles as ordered hierarchy
of content objects. - Generate item-level metadata in XML, using RDF
and Dublin Core syntax and semantics. - XSLT and CSS used to present metadata and
articles in either XML or HTML format depending
on Browser. - Mathematics rendering using MathML tools
(conversion from ISO 12083 to MathML). - Real-time transformation between XML and HTML
using XSLT (scalability issues).
24XML Linking
- XML Base http//www.w3.org/TR/xmlbase
- W3C recommendation. Permits use of relative URI
path prefixes. Can then shorten references. - XLink http//www.w3.org/TR/xlink/
- W3C recommendation. Method for specifying
navigational links. Allows enforcement of
specific path order through links.
xlinktypesimple corresponds to HTML ltagt or
ltimggt tags. May be used with XPointer. - XInclude http//www.w3.org/TR/xinclude
- W3C working draft. Copies entire XML documents
or selected portions into current document. Uses
XPath and XPointer to specify document elements
to include. Unlike XML external entities, no DTD
is required. - XML Pointer Language http//www.w3.org/XML/Linking
- Composed of multiple W3C recommendations and
working drafts. A language to be used for
fragment identifier in XML. Uses XPath. Permits
string searches and range specifiers.
25Searching and Transformation
- XPath http//www.w3.org/TR/xpath
- W3C recommendation. Defines pattern-matching
syntax used by XSLT and XPointer. Method for
selecting data (e.g. nodes, attributes, ) in a
document. - XSL-FO http//www.w3.org/TR/xsl/
- W3C recommendation. FO similar to CSS but more
powerful for XML document formatting. - XSLT http//www.w3.org/TR/xslt
- W3C recommendation. (2.0 working draft) Mechanism
for transforming XML documents. Can be used for
normalization of XML documents from different
schemas. - XML Query http//www.w3.org/XML/Query
- Composed of multiple W3C working drafts. Designed
to bring database-style queries to XML documents.
26Converting XML to HTML (XSLT)
- Simple one-to-one conversionsltsectgt becomes
ltspan class"sect"gt - span.sect displayblockmargin-left2em
- Attribute based conversionsltemph type"1"gt
becomes ltspan class"emph_1"gt - span.emph_1 font-styleitalic
- Generated text, such as punctuationltaggtltaugtTomlt/
augtltaugtTimlt/augtltaugtBoblt/augtlt/aggt becomes Tom,
Tim, Bob. - Rearranged childrenltaugtltsngtHabinglt/sngtltfngtTomlt/f
ngtlt/augt becomes Tom Habing
27XSLT Where Should It Happen
- Client-side
- IE5, Netscape 7/Mozilla
- Not Netscape 6 and earlier
- IE5 not fully compliant w/ XSLT and XPath
standard - Can reduce the load on your servers
- But performance on low-end clients can be BAD
- Server-side
- Performance could be a problem on busy servers,
serving large, complex documents - More control flexibility over the conversion
(metamerge) - Offline Preconversion
- Best performance
- Not best for dynamic documents (metamerge)
28Remote Object Access
- Web Services
- Based on XML, SOAP (Simple Object Access Protocol
W3C), UDDI (Universal Description, Discovery,
and Integration), and WSDL (Web Services
Description Language). Applications are assembled
on the fly in XML, exposed to the world, and
accessed via the Web from different devices. - Supported by Microsoft .net, IBM WebSphere, SUN
One. - OCLC looking at implementing Web Services (e.g.,
for Name Authority lookup)
29Schemas vs. DTDs
- Both are systems of representing a data model
that defines the datas elements and attributes,
and the relationship among elements. - Schemas add namespaces, address limitations of
DTDs facilitate data-typing. - W3C XML Schema Working Group two documents XML
structures and datatypes. - Alternatives to XML SchemaRELAX-NGSchematron
30Examples from DLI / D-Lib
- ACM Search
- XML XSLT for layered views of content
(publisher.toc, journal.toc, XSLT, HTML) - Transforms of SGML to MathML(png image, SGML
math, MathML) - On the fly XML to HTML
- Transforms of Qualified DC to Simple
DCQualified, Simple, XSLT, Alt. XSLT
31Linking Metadata Aggregation
- Digital Object Identifier (DOI) and CrossRef.
- OpenURL and Value-Added Service Components (SFX,
Encompass). - Local Resolver Servers.
- OAI-PMH, Dublin Core (DC) Qualified DC.
32Metadata in DLI
- To normalize augment presentation.
- To normalize searching (e.g. Names).
- To store dynamic links.
- Types of links
- Articles referenced By item (Backward).
- Articles that reference the item (Forward).
- A I Records for references and items.
- Other relationships (TOC, Other items by Author,
Collaborative Data). - Known item and presumptive linking.
33(No Transcript)
34Digital Object Identifier (DOI)
- DOI is both a unique identifier of a piece of
digital content AND a system to access that
content digitally. Persistent object identifier. - The ISBN for the 21st Century -- Norman Paskin.
- DOI system has two main parts (the identifier
and a directory system) and a third logical
component, a database. - Developed by AAP (Association of American
Publishers), now managed by International DOI
Foundation. - 5 million DOI records in CrossRef
35DOI Construction
- First real open standard for content
identification. - DOI is a number that identifies a digital object
- 10.1063/S000369519903216
- 10 Registration Agency Prefix
- 1063 Publisher Prefix
- S000369519903216 Suffix (Publisher-assigned
ID) - Suffix can be SICI or PII.
- The DOI and URL pointing to the digital object,
is registered with the International DOI
Foundation, e.g - 10.1063/333 http//www.pubsite.org/apr99/artl1.p
df
36 Reference Linking
- Alternatives to DOI
- Proprietary Link Managers (AIP, APS
- Even then, most still use DOIs as well
- CrossRef Project major Sci-Tech professional
societies and commercial publishers. - 252 members
- 9.3 million registered items (journal articles
conference papers). - Appropriate Copy Problem (OhioLink, Los Alamos,
NRL).
37Local Resolver
- Issue Directing users to locally held or
licensed version of Digital Object (locally
loaded or from Aggregator). - Appropriate Copy problem.
- Additional desire to direct users to local
value-added services local print holdings,
interlibrary borrowing, other articles in A I
Services. - Special Services
- http//g118.grainger.uiuc.edu/linker/
38DOI Proxy
OpenURL
Client (Web Browser)
AIP
Handle Server
dx.doi.org/10.1063/1234
IEE
Nosfxy
Aware
Elsevier
Local AIP, IEE
OpenURL
Local Value Added
Illinois Local Link Server
DOI
CrossRef Metadata Database
Metadata
UIUC Metadata Registry
39Open Archives Initiative (OAI)
- Version 1 released Jan 01, V.2 released June 02
- Mechanism for data providers to expose their
metadata through an HTTP protocol and a mechanism
for harvesting records containing metadata from
repositories. - Roots in e-print archives.
- Lightweight, low-barrier. Easy to implement on
standard Web servers to handle OAI protocol
requests need to incorporate into workflow used
to create / maintain metadata.
40OAI Continued
- Requires repositories to support the Dublin Core
schema as lowest common denominator. - Allows communities to expose metadata in other
formats as long as records are structured as XML
data with corresponding XML schema. - Application for discipline specific portals,
institutional repositories, NSDL, IMLS - Over 250 OAI 2.0 metadata providers.
- http//oai.grainger.uiuc.edu/registry
- OAI extensions in development
- OAI Static Repository Gateway
- OAI Rights
41How OAI Works
- OAI VERBS
- Identify
- ListMetadataFormats
- ListSets
- ListIdentifiers
- ListRecords
- GetRecord
Service Provider Metadata Provider
H A R VESTER
REPOSITORY
OAI
OAI
HTTP Request
(OAI Verb)
HTTP Response
(Valid XML)
42(No Transcript)
43Metadata Schemas Used By OAI Metadata Providers
44Illinois-Mellon OAI Project
- Funded to create a web portal to scholarly
information resources in cultural heritage
harvested via OAI-PMH - Primary objectives
- Build harvesting and search service
- Investigate viability and utility of searching
OAI harvested resources - Explore issues of advanced search/indexing/display
- Document user needs usage patterns
- Identify critical issues and best practices for
using OAI-PMH with cultural heritage material
45Technical achievements (Mellon)
- Developed harvesting tools (OpenSource)
- Refined data provider tools (OpenSource)
- Investigated logistics and scalability of
harvesting activities - Created XSL stylesheets for metadata
transformations - Experimented w/configurations for scalability and
performance issues
46Metadata aggregation (Mellon)
- 39 providers (OAI-compliant and surrogates)
- Metadata describing resources of 580 institutions
- 1.1 million original records
- 2.6 million including item-level records derived
from EAD finding aids
47Type of resources (Mellon)
- Hidden web
- Other includes
- archival collections
- websites
- moving images
- audio
- 30 of metadata describes digitized objects (of
any type)
48DC element usage (Mellon)
- Records containing subject description element
SUBJECT DESCRIPTION
Digital libraries (10 total, 122,719 records) 78 36
Museums, hist. societies, etc. (6 total, 255,800 records) 93 93
Academic libraries (7 total, 235,294 records) 15 13
- Many different controlled and local vocabularies
in use - Granularity a record may describe a collection
of coins or one coin
49Related ongoing future work
- Test usability with targeted user community
- Linking resources
- Including linking using MathML
- Simultaneous search, automated metadata
generation, automated metadata normalization - NSF National Science Digital Library Projects
- Mathematics resources MathML
- Combining sci-tech journals with other Web
resources - Additional OAI Implementations
- IMLS NLG
- CIC
- DLF - DODL
50Open Issues
- Role of Authors, Academic Institutions,
Libraries, Publishers, Abstracting Indexing
Services. - Disintermediation may affect both Libraries and
Publishers. - Information as Function not Place.
- Provide Digital Library services built atop
digital collections. - Role of XML technology.
- Service mechanisms processing archiving,
search and discovery, presentation, linking.