LIS 450EP Case Study: The Illinois Digital Library Initiative Project PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: LIS 450EP Case Study: The Illinois Digital Library Initiative Project


1
LIS 450EP Case Study The Illinois Digital
Library Initiative Project
  • Timothy W. Cole
  • William H. Mischo
  • t-cole3_at_uiuc.edu, w-mischo_at_uiuc.edu
  • Grainger Engineering Library Information Center
  • University of Illinois at Urbana-Champaign
  • http//dli.grainger.uiuc.edu/Publications/WHMischo
    /LIS450EP/

2
Outline
  • Digital Libraries, Publishers, XML, the
    Scholarly Information Environment.
  • The Illinois DLI / D-Lib Testbed Project.
  • XML Technologies in Journal Publishing
  • Current work linking, metadata, metasearch,
    the Open Archives Initiative Protocol for
    Metadata Harvesting.

3
References
  • Cole, Timothy W., William H. Mischo, Thomas G.
    Habing, and Robert H.Ferrer. "Using XML and XSLT
    to Process and Render Online Journals,"Library
    Hi Tech 19, no. 3 (2001) 210 - 222.
    Availablehttp//dx.doi.org/10.1108/07378830110405
    067
  • Shreeves, Sarah L., Joanne S. Kaczmarek, and
    Timothy W. Cole. "Harvesting Cultural Heritage
    Metadata Using the OAI Protocol." Library Hi Tech
    21, no. 2 (2003) 159-169. Availablehttp//dx.do
    i.org/10.1108/07378830310479802
  • Lagoze, Carl and Herbert Van de Sompel. "The
    making of the OpenArchives Initiative Protocol
    for Metadata Harvesting," Library Hi Tech21, no.
    2 (2003) 118 - 128. Avaliablehttp//dx.doi.org/1
    0.1108/07378830310479776
  • XML Schemas for Qualified Dublin Core, see bottom
    of Web page at URL http//www.dublincore.org/sc
    hemas/xmls/

4
Overview
  • We now have the tools to pursue the grand
    challenges of Information retrieval
  • Standard retrieval environment (Web) and
    interface/client (Web Browser).
  • Standardized search/retrieval mechanisms (HTTP
    Post/Get, SQL, Z39.50, OAI).
  • Standard language for describing and transforming
    content and metadata (XML, XSLT, XML Schemas).
  • Standard interoperability mechanisms to connect
    heterogeneous content (HTTP, SOAP, OAI).

5
XML and Publishers
  • Tim Gill of Quark, the use of XML could lead to
    a drop in the cost of Web publishing by 30 to
    50 and a significant reduction in the time it
    takes to produce sites.
  • Gill I dont believe that there is any
    innovation in print that is going to save us even
    10 in costs.
  • AIP all-XML Journal
  • Issues and Challenges remain.
  • Use of XML behind the scenes commonplace

6
XML and Publishers
  • Vendor-Neutral, platform-independent structured
    information standard.
  • Document representation interchange standard.
  • Applications can externalize their data/metadata
    as XML.
  • Based on Document Object Model (DOM), std.
    OOP-style components (XSLT, CSS, )
  • Issues with full-text representation PDF,
    XML/HTML. Value in indexing, retrieval.

7
The Digital Library
  • Digital, Virtual, Electronic Library as
    network-based library without regard to place and
    time.
  • Digital Collections vs. Digital Library.
  • Tendency to call collections resources DLs.
  • IMLS Framework of Guidance for Building Good
    Digital Collections
  • Emphasis on the integration of collections and
    creation of DL services (e.g., NSDL).
  • Application of standards and protocols enables
    and facilitates development of services.

8
Scholarly Communication Overview
  • Web-based E-Resources still publisher-centric.
  • Not user-centric or topic-centric
  • Growth of Heterogeneous Distributed Repositories.
  • Value-added services and branding of journals.
  • Prestige of Journals and Publishers
  • Reciprocal linking relationships between
    publishers.
  • Cooperation on linking standards (DOI, CrossRef).
  • Alternative publishing models - Academia (e.g.,
    SPARC), Preprint Servers, disintermediation.

9
Full-Text Technologies
  • Continuum of Web-Enabled technologies presently
    being utilized.
  • Evolving technologies and standards.
  • Role and history of markup.
  • Increasing role and importance of XML.
  • Towards a Smart Document

10
(No Transcript)
11
Distributed Repositories
  • Current Resources
  • publisher repositories A Is (remote and
    local) course management systems OIA and
    preprint servers Web search engines vendor
    portals institutional repositories
  • Goal for distributed repositories Integration
    of discrete publisher repositories, locally
    loaded full-text, local and remote A I
    services, OPAC, Web resources, and local data.

12
Distributed Repository - Needs
  • Support simultaneous searching of A I Services,
    Distributed Repositories, OPACs, Web search
    engines, local files. Integrate TOC, full-text.
  • Remote Reference 24 X 7.
  • Metadata harvesting
  • Digital archiving.
  • Local Resolver services for locally loaded or
    Aggregator Resources.

13
Illinois Testbed Project
  • Funded under DLI-I by NSF, DARPA, and NASA,
    1994--1998. Awards made to 6 universities.
  • Large-scale Testbed, Distributed Repository
    models, evaluation, Web software.
  • Funded under CNRI D-Lib Test Suite Program,
    19982001.
  • Collaborating Partners Program. AIP, APS, ASCE,
    IEE, NRL, ASM, ACM, NTT Learning Systems,
    Elsevier.
  • All XML Journals -- AIP, APS, ACM.

14
Illinois Testbed
  • American Institute of Physics--APL, JAP, RSI
  • 18,000 articles, 1995--.
  • American Physical Society--PRL
  • 14,000 articles, 1995--, weekly updates.
  • ASCE Journals (25 titles)
  • 10,000 articles, 1995--.
  • IEE Proceedings and Electronics Letters
  • 8,500 articles, 1993--.
  • IEEE Computer Society.
  • ASM (American Society for Materials) Handbook.
  • ACM (Association for Computing Machinery)
    Transactions.
  • Elsevier Science.

15
(No Transcript)
16
Project Issues
  • Evolution of the Document.
  • Distributed information environment.
  • Use of Metalanguages Transformations (SGML,
    XML).
  • Searching over full-text of journals vs. document
    surrogates in A I format.
  • Rendering and styling (SGML, XML, MathML).
  • Dynamic metadata for normalization, linking.
  • Breadth and depth of collections.
  • User needs.

17
Accomplishments
  • Process retrieve from multiple publishers
    heterogeneous DTDs.
  • Metadata specification that uses RDF, Qualified
    Dublin Core, XML Schemas, XML Namespaces.
  • Cross-repository searching (Testbed D-LIB Test
    Suite). Full-Text and Metadata.
  • SGML to XML Conversion.
  • XSLT, CSS, for transformation rendering,
    including Mathematics.

18
Accomplishments (2)
  • Linking Forward/Backward within Testbed, from/to
    A I Services.
  • Conversion of ISO 12083 math markup to MathML
    rendering of MathML.
  • Enhanced Web retrieval mechanisms Author Word
    Wheels, Co-Occurrence Matrices.
  • Detailed user transaction logs, gathered at the
    search argument level, with identification of
    characteristics of each user search sessions
  • Simultaneous search within DeLiver of Tesbed
    repositories, A Is, NCSTRL,

19
(No Transcript)
20
Ongoing Investigations
  • Support federated/broadcast searching of A I
    Services, Distributed Repositories, enhanced
    navigation, expanded gateway functions.
  • Interoperability models, e.g., Metadata
    harvesting vs. Federated (Broadcast)
  • Z39.50 protocols, HTTP harvesting, Spider
    technology (gathering).
  • E-Journal Archiving (AIP).
  • Local link server with context-sensitive
    resources.
  • MathML other ENTS (Essential Non-Text Stuff)

21
XML Parser APIs Tree-Based and Event-Based
  • DOM (Document Object Model for XML HTML).
  • DOM Level 1 and Level 2 W3C recommendation.
    Widely implemented, Tree-Based. Hierarchy of
    nodes. Loads entire document into memory. Level 2
    adds namespace support, traversal, stylesheets,
    events, triggers. Level 3 W3C candidate
    recommendation. Parsers allow developers to
    iterate through documents, change document
    content.
  • SAX (Simple API for XML).
  • Open-source, not W3C. Initially Java-based.
    Event-based, fires events as it reads document,
    need not load entire document into memory. Good
    for single-pass processing. Xerces, XML4C, Sun
    Project X (Crimson), MSXML.

22
XML Schema and Structure
  • DTD
  • Original schema representation, defines
    structural rules for a class of XML documents.
    Inherited from SGML.
  • XML Schema http//www.w3.org/XML/Schema
  • W3C recommendation. Also sets out standardized
    structure for class of XML documents. Is coded in
    XML, can be parsed and edited with standard
    software. Two separate parts structures and
    datatypes.
  • Namespaces http//www.w3.org/TR/REC-xml-names/
  • W3C recommendation (1.1 candidate in work) Allows
    developers to qualify element and attribute names
    with unique URIs, avoids recognition errors.

23
XML, XSLT, and CSS
  • Use XML full-text articles as ordered hierarchy
    of content objects.
  • Generate item-level metadata in XML, using RDF
    and Dublin Core syntax and semantics.
  • XSLT and CSS used to present metadata and
    articles in either XML or HTML format depending
    on Browser.
  • Mathematics rendering using MathML tools
    (conversion from ISO 12083 to MathML).
  • Real-time transformation between XML and HTML
    using XSLT (scalability issues).

24
XML Linking
  • XML Base http//www.w3.org/TR/xmlbase
  • W3C recommendation. Permits use of relative URI
    path prefixes. Can then shorten references.
  • XLink http//www.w3.org/TR/xlink/
  • W3C recommendation. Method for specifying
    navigational links. Allows enforcement of
    specific path order through links.
    xlinktypesimple corresponds to HTML ltagt or
    ltimggt tags. May be used with XPointer.
  • XInclude http//www.w3.org/TR/xinclude
  • W3C working draft. Copies entire XML documents
    or selected portions into current document. Uses
    XPath and XPointer to specify document elements
    to include. Unlike XML external entities, no DTD
    is required.
  • XML Pointer Language http//www.w3.org/XML/Linking
  • Composed of multiple W3C recommendations and
    working drafts. A language to be used for
    fragment identifier in XML. Uses XPath. Permits
    string searches and range specifiers.

25
Searching and Transformation
  • XPath http//www.w3.org/TR/xpath
  • W3C recommendation. Defines pattern-matching
    syntax used by XSLT and XPointer. Method for
    selecting data (e.g. nodes, attributes, ) in a
    document.
  • XSL-FO http//www.w3.org/TR/xsl/
  • W3C recommendation. FO similar to CSS but more
    powerful for XML document formatting.
  • XSLT http//www.w3.org/TR/xslt
  • W3C recommendation. (2.0 working draft) Mechanism
    for transforming XML documents. Can be used for
    normalization of XML documents from different
    schemas.
  • XML Query http//www.w3.org/XML/Query
  • Composed of multiple W3C working drafts. Designed
    to bring database-style queries to XML documents.

26
Converting XML to HTML (XSLT)
  • Simple one-to-one conversionsltsectgt becomes
    ltspan class"sect"gt
  • span.sect displayblockmargin-left2em
  • Attribute based conversionsltemph type"1"gt
    becomes ltspan class"emph_1"gt
  • span.emph_1 font-styleitalic
  • Generated text, such as punctuationltaggtltaugtTomlt/
    augtltaugtTimlt/augtltaugtBoblt/augtlt/aggt becomes Tom,
    Tim, Bob.
  • Rearranged childrenltaugtltsngtHabinglt/sngtltfngtTomlt/f
    ngtlt/augt becomes Tom Habing

27
XSLT Where Should It Happen
  • Client-side
  • IE5, Netscape 7/Mozilla
  • Not Netscape 6 and earlier
  • IE5 not fully compliant w/ XSLT and XPath
    standard
  • Can reduce the load on your servers
  • But performance on low-end clients can be BAD
  • Server-side
  • Performance could be a problem on busy servers,
    serving large, complex documents
  • More control flexibility over the conversion
    (metamerge)
  • Offline Preconversion
  • Best performance
  • Not best for dynamic documents (metamerge)

28
Remote Object Access
  • Web Services
  • Based on XML, SOAP (Simple Object Access Protocol
    W3C), UDDI (Universal Description, Discovery,
    and Integration), and WSDL (Web Services
    Description Language). Applications are assembled
    on the fly in XML, exposed to the world, and
    accessed via the Web from different devices.
  • Supported by Microsoft .net, IBM WebSphere, SUN
    One.
  • OCLC looking at implementing Web Services (e.g.,
    for Name Authority lookup)

29
Schemas vs. DTDs
  • Both are systems of representing a data model
    that defines the datas elements and attributes,
    and the relationship among elements.
  • Schemas add namespaces, address limitations of
    DTDs facilitate data-typing.
  • W3C XML Schema Working Group two documents XML
    structures and datatypes.
  • Alternatives to XML SchemaRELAX-NGSchematron

30
Examples from DLI / D-Lib
  • ACM Search
  • XML XSLT for layered views of content
    (publisher.toc, journal.toc, XSLT, HTML)
  • Transforms of SGML to MathML(png image, SGML
    math, MathML)
  • On the fly XML to HTML
  • Transforms of Qualified DC to Simple
    DCQualified, Simple, XSLT, Alt. XSLT

31
Linking Metadata Aggregation
  • Digital Object Identifier (DOI) and CrossRef.
  • OpenURL and Value-Added Service Components (SFX,
    Encompass).
  • Local Resolver Servers.
  • OAI-PMH, Dublin Core (DC) Qualified DC.

32
Metadata in DLI
  • To normalize augment presentation.
  • To normalize searching (e.g. Names).
  • To store dynamic links.
  • Types of links
  • Articles referenced By item (Backward).
  • Articles that reference the item (Forward).
  • A I Records for references and items.
  • Other relationships (TOC, Other items by Author,
    Collaborative Data).
  • Known item and presumptive linking.

33
(No Transcript)
34
Digital Object Identifier (DOI)
  • DOI is both a unique identifier of a piece of
    digital content AND a system to access that
    content digitally. Persistent object identifier.
  • The ISBN for the 21st Century -- Norman Paskin.
  • DOI system has two main parts (the identifier
    and a directory system) and a third logical
    component, a database.
  • Developed by AAP (Association of American
    Publishers), now managed by International DOI
    Foundation.
  • 5 million DOI records in CrossRef

35
DOI Construction
  • First real open standard for content
    identification.
  • DOI is a number that identifies a digital object
  • 10.1063/S000369519903216
  • 10 Registration Agency Prefix
  • 1063 Publisher Prefix
  • S000369519903216 Suffix (Publisher-assigned
    ID)
  • Suffix can be SICI or PII.
  • The DOI and URL pointing to the digital object,
    is registered with the International DOI
    Foundation, e.g
  • 10.1063/333 http//www.pubsite.org/apr99/artl1.p
    df

36
Reference Linking
  • Alternatives to DOI
  • Proprietary Link Managers (AIP, APS
  • Even then, most still use DOIs as well
  • CrossRef Project major Sci-Tech professional
    societies and commercial publishers.
  • 252 members
  • 9.3 million registered items (journal articles
    conference papers).
  • Appropriate Copy Problem (OhioLink, Los Alamos,
    NRL).

37
Local Resolver
  • Issue Directing users to locally held or
    licensed version of Digital Object (locally
    loaded or from Aggregator).
  • Appropriate Copy problem.
  • Additional desire to direct users to local
    value-added services local print holdings,
    interlibrary borrowing, other articles in A I
    Services.
  • Special Services
  • http//g118.grainger.uiuc.edu/linker/

38
DOI Proxy
  • Cookie on client

OpenURL
Client (Web Browser)
AIP
Handle Server
dx.doi.org/10.1063/1234
IEE
Nosfxy
Aware
Elsevier
Local AIP, IEE
OpenURL
Local Value Added
Illinois Local Link Server
DOI
CrossRef Metadata Database
Metadata
UIUC Metadata Registry
39
Open Archives Initiative (OAI)
  • Version 1 released Jan 01, V.2 released June 02
  • Mechanism for data providers to expose their
    metadata through an HTTP protocol and a mechanism
    for harvesting records containing metadata from
    repositories.
  • Roots in e-print archives.
  • Lightweight, low-barrier. Easy to implement on
    standard Web servers to handle OAI protocol
    requests need to incorporate into workflow used
    to create / maintain metadata.

40
OAI Continued
  • Requires repositories to support the Dublin Core
    schema as lowest common denominator.
  • Allows communities to expose metadata in other
    formats as long as records are structured as XML
    data with corresponding XML schema.
  • Application for discipline specific portals,
    institutional repositories, NSDL, IMLS
  • Over 250 OAI 2.0 metadata providers.
  • http//oai.grainger.uiuc.edu/registry
  • OAI extensions in development
  • OAI Static Repository Gateway
  • OAI Rights

41
How OAI Works
  • OAI VERBS
  • Identify
  • ListMetadataFormats
  • ListSets
  • ListIdentifiers
  • ListRecords
  • GetRecord

Service Provider Metadata Provider
H A R VESTER
REPOSITORY
OAI
OAI
HTTP Request
(OAI Verb)
HTTP Response
(Valid XML)
42
(No Transcript)
43
Metadata Schemas Used By OAI Metadata Providers
44
Illinois-Mellon OAI Project
  • Funded to create a web portal to scholarly
    information resources in cultural heritage
    harvested via OAI-PMH
  • Primary objectives
  • Build harvesting and search service
  • Investigate viability and utility of searching
    OAI harvested resources
  • Explore issues of advanced search/indexing/display
  • Document user needs usage patterns
  • Identify critical issues and best practices for
    using OAI-PMH with cultural heritage material

45
Technical achievements (Mellon)
  • Developed harvesting tools (OpenSource)
  • Refined data provider tools (OpenSource)
  • Investigated logistics and scalability of
    harvesting activities
  • Created XSL stylesheets for metadata
    transformations
  • Experimented w/configurations for scalability and
    performance issues

46
Metadata aggregation (Mellon)
  • 39 providers (OAI-compliant and surrogates)
  • Metadata describing resources of 580 institutions
  • 1.1 million original records
  • 2.6 million including item-level records derived
    from EAD finding aids

47
Type of resources (Mellon)
  • Hidden web
  • Other includes
  • archival collections
  • websites
  • moving images
  • audio
  • 30 of metadata describes digitized objects (of
    any type)

48
DC element usage (Mellon)
  • Records containing subject description element

SUBJECT DESCRIPTION
Digital libraries (10 total, 122,719 records) 78 36
Museums, hist. societies, etc. (6 total, 255,800 records) 93 93
Academic libraries (7 total, 235,294 records) 15 13
  • Many different controlled and local vocabularies
    in use
  • Granularity a record may describe a collection
    of coins or one coin

49
Related ongoing future work
  • Test usability with targeted user community
  • Linking resources
  • Including linking using MathML
  • Simultaneous search, automated metadata
    generation, automated metadata normalization
  • NSF National Science Digital Library Projects
  • Mathematics resources MathML
  • Combining sci-tech journals with other Web
    resources
  • Additional OAI Implementations
  • IMLS NLG
  • CIC
  • DLF - DODL

50
Open Issues
  • Role of Authors, Academic Institutions,
    Libraries, Publishers, Abstracting Indexing
    Services.
  • Disintermediation may affect both Libraries and
    Publishers.
  • Information as Function not Place.
  • Provide Digital Library services built atop
    digital collections.
  • Role of XML technology.
  • Service mechanisms processing archiving,
    search and discovery, presentation, linking.
Write a Comment
User Comments (0)
About PowerShow.com