Mixed content, mixed metadata: Information discovery in the NSDL - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Mixed content, mixed metadata: Information discovery in the NSDL

Description:

... services must and search engines seek out ... Information Discovery in a Messy World Building blocks Brute force ... [Machine learning methods can identify ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 33
Provided by: bobmor4
Learn more at: https://www.cni.org
Category:

less

Transcript and Presenter's Notes

Title: Mixed content, mixed metadata: Information discovery in the NSDL


1
Mixed content, mixed metadataInformation
discovery in the NSDL
2
(No Transcript)
3
The National Science Digital Library
The Integration Task is to provide a coherent set
of collections and services across great
diversity (all digital collections relevant to
science education).
http//nsdl.org/
4
Basic Assumptions
  • Mixed content
  • Very large digital libraries will have mixed
    content from many sources, with large variations
    in formats, structure, packaging, access
    permissions, etc.
  • Mixed metadata
  • The metadata about the items in a very large
    digital library will vary greatly in extent,
    standards, and quality.

5
Mixed Content
Examples NSDL-funded collections at
Cornell Atlas. Data sets of earthquakes,
volcanoes, etc. Reuleaux. Digitized kinematics
models from the nineteenth century Laboratory of
Ornithology. Sound recording, images, videos of
birds and other animals. Nuprl. Logic-based tools
to support programming and to implement formal
computational mathematics.
6
Mixed Metadata the Chimera of Standardization
  • Technical reasons
  • Characteristics of formats and genres
  • Differing user needs
  • Social and cultural reasons
  • Economic factors
  • Installed base
  • Conclusion There will not be a single
  • metadata standard for the items in the NSDL.

7
NSDL The Spectrum of Interoperability
Level Agreements Example Federation Strict use
of standards AACR, MARC (syntax, semantic, Z
39.50 and business) Harvesting Digital
libraries expose Open Archives metadata
simple metadata harvesting protocol and
registry Gathering Digital libraries do not Web
crawlers cooperate services must and search
engines seek out information
8
NSDL The Spectrum of Interoperability
Chronology The first phase of the NSDL has
concentrated on gathering Dublin Core metadata
using the Open Archives Initiative protocol for
metadata harvesting. Current expansions
include (a) A wider range of metadata standards,
e.g., LOM, Onix (b) Automatic indexing of web
sites recommended by users (c) Links to the SCORM
federation of structured learning objects
9
The NSDL Repository
Services
The repository is a resource for service
providers. It holds information about every
collection and item known to the NSDL.
NSDL Repository
Users
Collections
10
NSDL Search Service First Phase
NSDL Repository
harvest
Portal
Search andDiscoveryService
Portal
Portal
crawl
Lucene
Collections
11
NSDL Search Service First Phase
  • Approach
  • Collections map metadata to Dublin Core, make it
    available via the Open Archives protocol.
  • The search service augments Dublin Core metadata
    with indexing of full-text where available.
  • User interface returns snippets derived from the
    metadata, with links to full content and to
    metadata.

12
(No Transcript)
13
NSDL Search Service First Phase
  • The first phase search service is useful, but has
    weaknesses
  • Ranking by similarity to query not sufficient.
  • Snippets do not indicate why item was returned
    (e.g., terms in full text but not in metadata).
  • Dublin Core records provide limited information.
  • (d) Browsing environment limited.
  • Most users begin their search with a Web search
    engine (e.g., Google)
  • What are the methods for improving information
    discovery as
  • the system grows in size and the mixture of
    content increases?

14
Effective Information Discovery Before Digital
Information
  • Searching
  • (a) Resources separated into categories of
    related materials. Each category organized,
    indexed and searched separately.
  • Catalogs and indexes built on tightly controlled
    metadata standards, e.g., MARC, MeSH headings,
    etc.
  • Search engines used Boolean operators and fielded
    searching.
  • Query languages and search interfaces assumed a
    trained user.
  • Resources were physical items.

15
Effective Information Discovery With
Homogeneous Digital Information
Comprehensive metadata with Boolean retrieval
Can be excellent for well-understood categories
of material, but requires standardized metadata
and relatively homogeneous content (e.g., MARC
catalog). Full text indexing with ranked
retrieval Can be excellent, but methods
developed and validated for relatively
homogeneous textual material (e.g., TREC ad hoc
track).
16
Information Discovery in a Messy
WorldCross-Domain Metadata
Dublin Core "... indexes such as Lycos are
most useful in small collections within a given
domain. As the scope of their coverage expands,
indexes succumb to problems of large retrieval
sets and problems of cross-disciplinary semantic
drift. Richer records, created by content
experts, are necessary to improve search and
retrieval." Weibel 1995
17
Information Discovery in a Messy World Web
Search Engines
Web search engines have adapted to a very large
scale. Other techniques, such as cross-domain
metadata and federated searching have failed to
scale up. What new concepts and techniques
have enabled this adaptation? What
can we learn that is applicable to other
information discovery tasks? How
is NSDL making use of this understanding?
18
Information Discovery in a Messy World
Building blocks Brute force computation The
expertise of users -- human in the
loop Methods (a) Better understanding of how and
why users seek for information (b) Relationships
and context information (c) Multi-modal
information discovery (d) User interfaces for
exploring information
19
Brute Force Computing
Few people really understand Moore's Law
Computing power doubles every 18 months
Increases 100 times in 10 years Increases
10,000 times in 20 years Simple algorithms plus
immense computing power may outperform skilled
humans
20
The Expertise of UsersThe Human in the Loop
Return objects
Return hits
Browse content
Search index
21
Understanding How and Why Users Seek for
Information
Homogeneous content All documents are assumed
equal Criterion is relevance (binary
measure) Goal is to find all relevant documents
(high recall) Hits ranked in order of similarity
to query Mixed content Some documents are more
important than other Goal is to find most useful
documents on a topic and then browse Hits ranked
in order that combines importance and similarity
to query
22
Research Topics from the NSDL
How can users indicate preferences? They do not
want to see research articles. Machine learning
methods can identify collections of research
articles. Their students have a specific
mathematical background. Usage data can
identify items of similar academic
level. Detailed metadata requirements will not
be accepted!
23
Relationship and Contextual Information
Methods for capturing context Analysis of
citations and links (e.g., PageRank) Mining
usage logs (e.g., customers who buy the same
product) Reviews (e.g., reputation
management) Structural relationships (e.g.,
domain names)
24
Multi-Modal Information Discovery
With mixed content and mixed metadata, the amount
of information about the various resources
varies greatly but clues from many difference
sources can be combined. "The fundamental
premise of the research was that the integration
of these technologies, all of which are imperfect
and incomplete, would overcome the limitations of
each, and improve the overall performance in the
information retrieval task." Wactlar, 2000
25
The Expertise of UsersExamples
Return objects
Return hits
Browse content
Search index
26
(No Transcript)
27
(No Transcript)
28
NSDL Search Service Second Phase Developments
  • Metadata
  • Accept any metadata that is available in a range
    of formats
  • System for reviews and annotations, with
    reputation management
  • Search system
  • Multimodal retrieval and ranking
  • Dynamic generation of snippets by search engine

29
NSDL Search Service Second Phase Developments
(cont.)
  • Usability and human factors
  • Wider range of browsing tools (e.g., collection
    visualization)
  • Filters by education level and education quality,
    where known
  • Web compatibility
  • Expose records for Web crawlers to index
  • Browser bookmarklet to add NSDL information to
    Web pages

30
Further Reading
Caroline R. Arms and William Y. Arms Mixed
content, mixed metadata information discovery in
a messy world In Metadata in Practice, Editors
Diane Hillmann and Elaine Westbrooks, ALA
Editions (forthcoming 2004) Carl Lagoze, et
al. Core Services in the Architecture of the
National Digital Library for Science Education
(NSDL) Joint Conference on Digital Libraries,
July 2002. http//arxiv.org/abs/cs.DL/0201025.
31
Acknowledgements and Disclaimer
The NSDL is a program of the National Science
Foundation's Directorate for Education and Human
Resources, Division of Undergraduate
Education. The NSDL Core Integration is a
collaboration between the University Center for
Atmospheric Research, Columbia University and
Cornell University. The NSDL Search Service has
been developed in partnership with a team at the
University of Massachusetts, Amherst. The ideas
discussed in this talk do not represent the
official views of the NSF or of the Core
Integration team. This work is funded in part by
the NSF, grant number 0227648
32
Mixed content, mixed metadataInformation
discovery in the NSDL
Write a Comment
User Comments (0)
About PowerShow.com