Metadata Quality for Federated Collections - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Metadata Quality for Federated Collections

Description:

... for Federated Collections. GSLIS, UIUC. November, ... across multiple digital collections from libraries, archives and ... Digital Collections and Content ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 17
Provided by: bjs9
Category:

less

Transcript and Presenter's Notes

Title: Metadata Quality for Federated Collections


1
Metadata Quality for Federated Collections
Besiki Stvilia, Les Gasser, Mike Twidale, Sarah
Shreeves, Tim Cole
  • GSLIS, UIUC
  • November, 2004

2
1. Abstract
  • Centralized metadata repositories attempt to
    provide integrated access across multiple digital
    collections from libraries, archives and museums.
    Metadata quality in these repositories heavily
    influences the collections' usability---high
    quality can raise satisfaction and use, while low
    quality can render collections unusable.
    Individual metadata type, origin and quality
    variances are compounded into complex quality
    challenges when collections are aggregated.
    Current metadata quality assurance is generally
    piecemeal, reactive, ad-hoc, and a-theoretical
    formal compatibility and interoperability
    standards often prove unenforceable given
    metadata providers' dynamic and conflicting
    organizational priorities. We are empirically
    examining large bodies of harvested metadata to
    develop systematic techniques for metadata
    quality assessment/assurance. We study metadata
    quality, value, and cost models algorithms for
    connecting metadata component variations to
    (aggregate) metadata record quality and
    prototype metadata quality assurance tools that
    help providers, aggregators and users reason
    about metadata quality, doing more intelligent
    selection, aggregation and maintenance of
    metadata.

3
2. Approach
The model has been developed using a number of
techniques such as literature analysis, case
studies, statistical analysis, strategic
experimentation, and multi-agent modeling. The
model along with the concepts and metrics used
can serve as a foundation for developing
effective specific methodologies of quality
assurance in various types of organizations. Our
model of metadata quality ties together findings
from existing and new research in information
quality, along with well-developed work in
information seeking/use behavior, and the
techniques of strategic experimentation from
manufacturing. It presents a holistic approach to
determining the quality of a metadata object,
identifying quality requirements based on
typified contexts of metadata use (such as
specific information seeking/use activities) and
expressing interactions between metadata quality
and metadata value.
4
3. Measuring Metadata Quality3.1 Metadata
Quality Problem
  • Actual qualitynot matching Required/needed
    level of quality
  • May arise at different levels
  • Element Level
  • Schema Level
  • Quality Dimensions

5
3.2 Information Quality Dimensions
  • Relational / Contextual
  • Accuracy
  • Completeness
  • Complexity
  • Latency
  • Naturalness 
  • Informativeness
  • Relevance (aboutness)
  • Precision
  • Security
  • Verifiability
  • Volatility
  • Intrinsic
  • Accuracy
  • Cohesiveness
  • Complexity
  • Semantic-consistency
  • Structural-consistency
  • Currency
  • Informativeness
  • Naturalness
  • Precision
  • Reputational
  • Authority

6
3.3 MQ Dimensions may trade off
  • completeness vs. simplicity robustness vs.
    simplicity volatility vs. simplicity robustness
    vs. redundancy accessibility vs. certainty
  • Taguchi curves help to model and reason about
    tradeoffs.

7
3.4 Genre Captures Context
8
4. Measuring Value4.1 Whats the Value of
Quality?
9
4.2 Value as Amount of Use
Dublin Core element of total records containing element
Identifier 99.6
Title 80.3
Type 76.5
Subject 72.9
Format 69.4
Publisher 61.2
Language 55.0
Creator 50.7
Description 47.4
Date 43.0
Rights 41.0
Relation 31.2
Source 14.9
Contributor 6.6
Coverage 5.9
  • The value of metadata can be a function of the
    probability distribution of the
    operations/transactions using the metadata.
  • Human factors experiments can be used for
    assessing the effectiveness of creating and using
    the metadata.
  • Metadata often is an organizational asset,
    especially in organizations like libraries and
    one can calculate its dollar cost based on the
    average time a cataloger spends on creating a
    record or an element of the record..

10
5. IMLS Digital Collections and Content Project
  • Promote centralized search, interoperability and
    reusability of metadata collections
  • Harvested metadata from gt20 data providers,
    gt150,000 Dublin Core Records (and growing)
  • Data providers small public libraries and
    historical societies large academic libraries
    museums research centers
  • Records provided from dozens to tens of
    thousands
  • Interoperability and reusability require
    negotiation of Global quality
  • http//imlsdcc.grainger.uiuc.edu

11
5.1 Examples of Quality Problems
  • Ptolemaios son of Diodoros
  • Dioskoros Ptolemaios
  • Dioscorus. Ptolemaios
  • (variant transliteration)
  • ltdategt2000lt/dategt
  • ltdategt1998-03-26lt/dategt
  • (ambiguous and structurally inconsistent)
  • ltpublishergtNew York Robert Carter,
    1846lt/publishergt(schema limitation led to
    workaround)
  • . . .
  • Activity
  • Find Collocate
  • Actions
  • Find
  • Identify
  • Select
  • Obtain
  • Across Federated Collections

12
5.2 Findings
  • MQ dimensions with major quality problems
  • completeness
  • redundancy
  • clarity
  • semantic inconsistency (incorrect element use)
  • structural inconsistency
  • inaccurate representation

Problem type Incomplete Redundant Unclear Incorrect Use of Elements Inconsistent Inaccurate
100 94 78 73 47 24
13
5.3 Findings
  • Correlation between consistency of element use
    and type of metadata objects and type of data
    providers (sample size 2,000).
  • Grouping by type of objects made standard
    deviation of total number of elements used drop
    significantly (from 5.73 to 3.6)
  • Clustering by use of distinct DC elements
    (K-means, with 2 clusters) suggested that
    different types of institutions may use different
    number of distinct DC elements
  • Academic libraries 13
  • Public libraries 8
  • Museums - divided

DC Elements A P
Title 1 1
Creator 1 1
Subject 1 1
Description 1 1
publisher 1 1
contributor 0 0
Date 1 1
Type 1 0
Format 1 0
Identifier 1 1
Source 1 0
Language 1 0
Relation 1 0
Coverage 0 0
Rights 1 1
14
5.3 Findings
  • High complexity of metadata content related to
    quality problems
  • Strong correlation found between Content
    Simplicity/Complexity Rate and Quality Problem
    Rate (-.434, plt.01)
  • However, no significant correlation found between
    Quality Problem Rate and Length of Metadata
    Object (.043)
  • Differences in how well standard schemas handle
    different types of original objects - lowest
    quality problem rate found for print materials

GENRE/TYPE MEAN Error Rate MEDIAN Error Rate
species .00099 .00064
manuscript .00063 .00058
photograph .00025 .00018
art .00016 .00015
print .00010 .00010
15
6. Conclusions and Lessons Learned
  • Communities of practice may use their own
    implicit or explicit schema when sharing metadata
    even through a standardized schema such as DC
  • Some schema elements can be more ambiguous than
    others and require qualification Date vs.
    Creator
  • Ambiguity of schema elements can be major source
    of quality problems leading to context loss and
    element misuses
  • Inferring native schema and comparing it to
    destination schema can point to possible sources
    of quality problems
  • Analysis of activities can help in evaluating
    Robustness and Clarity of schema
  • Mining regularities between metadata
    characteristics and quality problems can help in
    constructing robust and inexpensive metrics
  • Some metrics used in Information Retrieval
    (Infonoise, Kulback-Liebler, Average IDF) can be
    effective and scalable in assessing quality at
    the content level
  • A general purpose dictionary-based metric found
    robust for assessing cognitive complexity of
    metadata content
  • Structure profiles can be effective source for
    measuring quality and predicting quality problems
    at the schema level

16
Acknowledgements and Contact Information
The research was made possible by the generous
support from the Institute of Museums and Library
Services (IMLS) and the UIUC Campus Research
Board.
How to contact
Email Besiki Stvilia at stvilia_at_uiuc.edu
Write a Comment
User Comments (0)
About PowerShow.com