Mixed content, mixed metadata: Information discovery in the NSDL - PowerPoint PPT Presentation

About This Presentation
Title:

Mixed content, mixed metadata: Information discovery in the NSDL

Description:

It holds information about every collection and item known to the NSDL, ... Collections map metadata to Dublin Core, provide via Open Archives protocol. ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 23
Provided by: bobm181
Category:

less

Transcript and Presenter's Notes

Title: Mixed content, mixed metadata: Information discovery in the NSDL


1
Mixed content, mixed metadataInformation
discovery in the NSDL
2
Experience from American Memory and NSDL
Caroline R. Arms and William Y. Arms Mixed
content, mixed metadata information discovery in
a messy world In Metadata in Practice, Editors
Diane Hillmann and Elaine Westbrooks, ALA
Editions (forthcoming)
3
The National Science Digital Library
The Integration Task is to provide a coherent set
of collections and services across great
diversity (all digital collections relevant to
science education).
http//nsdl.org/
4
Mixed Content
Examples NSDL-funded collections at
Cornell Atlas. Data sets of earthquakes,
volcanoes, etc. Reuleaux. Digitized kinematics
models from the nineteenth century Laboratory of
Ornithology. Sound recording, images, videos of
birds and other animals. Nuprl. Logic-based tools
to support programming and to implement formal
computational mathematics.
5
Effective Information Discovery Before Digital
Information
  • Searching
  • (a) Resources separated into categories of
    related materials. Each category organized,
    indexed and searched separately.
  • Catalogs and indexes built on tightly controlled
    metadata standards, e.g., MARC, MeSH headings,
    etc.
  • Search engines used Boolean operators and
    fielding searching.
  • Query languages and search interfaces assumed a
    trained user.
  • Resources were physical items.

6
Effective Information Discovery With Homogeneous
Digital Information
Comprehensive metadata with Boolean retrieval
Can be excellent for well-understood categories
of material, but requires standardized metadata
and relatively homogeneous content (e.g., MARC
catalog). Full text indexing with ranked
retrieval Can be excellent, but methods
developed and validated for relatively
homogeneous textual material (e.g., TREC ad hoc
track).
7
Mixed Metadata the Chimera of Standardization
  • Technical reasons
  • Characteristics of formats and genres
  • Differing user needs
  • Social and cultural reasons
  • Economic factors
  • Installed base

8
Cross-Domain Metadata
Dublin Core "... indexes such as Lycos are
most useful in small collections within a given
domain. As the scope of their coverage expands,
indexes succumb to problems of large retrieval
sets and problems of cross-disciplinary semantic
drift. Richer records, created by content
experts, are necessary to improve search and
retrieval." Weibel 1995
9
Information Discovery in a Messy World
Web search engines have adapted to a very large
scale. Other techniques, such as cross-domain
metadata and federated searching have failed to
scale up. What new concepts and techniques
have enabled this adaptation? What
can we learn that is applicable to other
information discovery tasks? How
is NSDL making use of this understanding?
10
Information Discovery in a Messy World
Building blocks Brute force computation The
expertise of users -- human in the
loop Methods (a) Better understanding of how and
why users seek for information (b) Relationships
and context information (c) Multi-modal
information discovery (d) User interfaces for
exploring information
11
Understanding How and Why Users Seek for
Information
Homogeneous content All documents are assumed
equal Criterion is relevance (binary
measure) Goal is to find all relevant documents
(high recall) Hits ranked in order of similarity
to query Mixed content Some documents are more
important than other Goal is to find most useful
documents on a topic and then browse Hits ranked
in order that combines importance and similarity
to query
12
Relationship and Contextual Information
Methods for capturing context Analysis of
citations and links (e.g., PageRank) Mining
usage logs (e.g., customers who buy the same
product) Reviews (e.g., reputation
management) Structural relationships (e.g.,
domain names)
13
Multi-Modal Information Discovery
With mixed content and mixed metadata, the amount
of information about the various resources
varies greatly but clues from many difference
sources can be combined. "The fundamental
premise of the research was that the integration
of these technologies, all of which are imperfect
and incomplete, would overcome the limitations of
each, and improve the overall performance in the
information retrieval task." Wactlar, 2000
14
User Interfaces for Exploring Information
Return objects
Return hits
Browse content
Search index
15
NSDL The Spectrum of Interoperability
Level Agreements Example Federation Strict use
of standards AACR, MARC (syntax, semantic, Z
39.50 and business) Harvesting Digital
libraries expose Open Archives metadata
simple metadata harvesting protocol and
registry Gathering Digital libraries do not Web
crawlers cooperate services must and search
engines seek out information
16
The NSDL Repository
Services
The repository is a resource for service
providers. It holds information about every
collection and item known to the NSDL, including
contextual information.
NSDL Repository
Users
Collections
17
NSDL Search Service First Phase
NSDL Repository
harvest
Portal
SDLIP
Search andDiscoveryService
Portal
Portal
crawl
Inquery -gt Lucene
Collections
18
NSDL Search Service First Phase
  • Approach
  • Collections map metadata to Dublin Core, provide
    via Open Archives protocol.
  • Search service augments Dublin Core metadata with
    indexing of full-text where available.
  • User interface returns snippets derived from the
    metadata, links to full content and to metadata.

19
NSDL Search Service First Phase
  • Weaknesses
  • Ranking by similarity to query not sufficient.
  • Snippets do not indicate why item was returned
    (e.g., terms in full text but not in metadata).
  • Dublin Core records provide limited information.
  • (d) Browsing environment limited.
  • (e) Most users begin their search with a Web
    search engine (e.g., Google)

20
NSDL Search Service Second Phase Developments
  • Metadata
  • Accept any metadata that is available in a range
    of formats
  • System for reviews and annotations, with
    reputation management
  • Search system
  • Multimodal retrieval and ranking
  • Dynamic generation of snippets by search engine

21
NSDL Search Service Second Phase Developments
(cont.)
  • Usability and human factors
  • Wider range of browsing tools (e.g., collection
    visualization)
  • Filters by education level and education quality,
    where known
  • Web compatibility
  • Expose records for Web crawlers to index
  • Browser bookmarklet to add NSDL information to
    Web pages

22
Mixed content, mixed metadataInformation
discovery in the NSDL
Write a Comment
User Comments (0)
About PowerShow.com