Usagebased models of science: - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Usagebased models of science:

Description:

Survey usage-based metrics on basis of reference data set. ... Examples: An author (Agent) publishes (Context:Event) an article (Document) ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 47
Provided by: lanl4
Learn more at: http://vw.indiana.edu
Category:

less

Transcript and Presenter's Notes

Title: Usagebased models of science:


1
Usage-based models of science applications to
community mapping and scholarly assessment
Johan Bollen Digital Library Research
Prototyping Team Los Alamos National Laboratory -
Research Library jbollen_at_lanl.gov Acknowledgemen
ts Herbert Van de Sompel (LANL), Marko A.
Rodriguez (LANL), Ryan Chute (LANL), Lyudmila L.
Balakireva (LANL), Aric Hagberg (LANL), Luis
Bettencourt (LANL) Research supported by the
Andrew W. Mellon Foundation.
2
Scholarly evaluation
  • Qualitative/subjective
  • Peer review
  • Tenure committees
  • Networks
  • Quantitative
  • Citation counts
  • Many proposals, little clarity

3
Impact evaluation from citation data.
  • Citation data
  • Golden standard of scholarly evaluation
  • Citation scholarly influences.
  • Extracted from published materials.
  • Main bibliometric data source for scholarly
    evaluation.
  • IF is part of Journal Citation Reports (JCR)
  • JCR citation graph
  • 2005 journal citation network
  • - 8,560 journals
  • 6,370,234 weighted citation edges
  • Impact Factor mean 2 year citation rate

4
Citation graph other metrics?
  • Possibility to calculate other metrics of impact.
  • How about PageRank?
  • IF normalized indegree in citation graph
  • Popularity
  • Favors review journals
  • Are all citations created equal?
  • Transfer citer influence to citee
  • Normalize edge weight
  • Modulate transfer for citer influence
  • Indicator of node prestige

Pinski, G., Narin, F. (1976). Citation
influence for journal aggregates of scientific
publications theory, with application to the
literature of physics. Information processing and
management, 12(5), 297-312. Chen, P., Xie, H.,
Maslov, S., Redner, S. (2007). Finding
scientific gems with Google. Journal of
Informetrics, 1(1), arxiv.org/abs/physics/0604130.
5
Popularity vs. prestige
rho0.61 Outliers reveal differences in aspects
of status IF general popularity PR
prestige, influence
Johan Bollen, Marko A. Rodriguez, and Herbert Van
de Sompel. Journal status. Scientometrics, 69(3),
December 2006 (DOI 10.1007/s11192-006-0176-z) Phi
lip Ball. Prestige is factored into journal
ratings. Nature 439, 770-771, February 2006
(doi10.1038/439770a)
6
Domain specific
7
A few issues
  • On the basis of citation graph, many
    alternatives
  • Normalized citation statistics
  • Social network indicators (centrality)
  • Semantic web hybrids
  • Which one to choose?
  • Semantics
  • Validity does indicator express what it is
    intended to express?
  • Reliability sensitivity to changes in network
    structure?
  • Another question need they be based on citation
    data?
  • Other means of expressing influence
  • Readership, services requested, etc.
  • Usage!

8
Usage data
  • Citation data pertain to 4 levels in the
    scholarly communication process
  • Community authors of journal articles.
  • Artifacts journal articles.
  • Data citation data (1 year publication delay).
  • Metrics mean citation rate rules supreme.
  • Scale expensive to extract.
  • However, for usage data
  • Community all users including most
    authors.
  • Artifacts all that is accessible.
  • Data recorded upon publication.
  • Metrics a range of web and web2.0 inspired
    metrics, e.g. clickstream and datamining.
  • Scale automatically recorded at point of
    service.

Hence, various initiatives focused on usage data
COUNTER, IRS, SUSHI, CiteBase. But where are the
metrics?
9
Challenges to usage-based metrics.
  • Usage data is here
  • Routinely recorded by library, publisher and
    aggregators services
  • Large-scale and longitudinal
  • Highly detailed
  • Requester
  • Referent
  • Sessions
  • Service type
  • Usage-based metrics have lagged development.
    Heres why
  • Multiple communities
  • Multiple collection (artifacts)
  • Data usage data limited to particular
    sub-communities and collections of artifacts.
  • Metrics various metrics studied. Different
    results because of sample, collection or metric
    definition? Aspects of scholarly status?


10
Our experience divergence and convergence.
  • Convergence!
  • Is this guaranteed?
  • To what? A common-baseline?
  • What we do know
  • Institutional perspective can be contrasted to
    baseline.
  • As aggregation increases in size, so does value.
  • Cross-validation is key.

11
MESUR1 Metrics from Scholarly Usage of Resources.
  • Andrew W. Mellon Foundation funded study of
    usage-based metrics (2006-2008)
  • Executed at the Digital Library Research and
    Prototyping team, Los Alamos National Laboratory
    Research Library
  • Objectives
  • Create a model of the scholarly communication
    process.
  • Create a large-scale reference data set (semantic
    network) that relates all relevant bibliographic,
    citation and usage data according to (1).
  • Characterize reference data set.
  • Survey usage-based metrics on basis of reference
    data set.

1. Pronounced measure
12
The MESUR project.
Johan Bollen (LANL) Principal investigator Herber
t Van de Sompel (LANL) Architectural
consultant Marko Rodriguez (LANL) PhD student
(Computer Science, UCSC) Ryan Chute (LANL)
Software development and database
management Lyudmila Balakireva (LANL) Database
management and HCI Aric Hagberg (LANL)
Mathematical and statistical consultant Luis
Bettencourt (LANL) Mathematical and statistical
consultant
The Andrew W. Mellon Foundation has awarded a
grant to Los Alamos National Laboratory (LANL) in
support of a two-year project that will
investigate metrics derived from the
network-based usage of scholarly information. The
Digital Library Research Prototyping Team of
the LANL Research Library will carry out the
project. The project's major objective is
enriching the toolkit used for the assessment of
the impact of scholarly communication items, and
hence of scholars, with metrics that derive from
usage data.
13
Project data flow and work plan.
4
2
3
1
14
Project timeline.
We are here!
15
Presentation structurean update on the MESUR
project
  • Usage data characterization
  • Analysis
  • Usage graphs
  • Metrics analysis
  • Results
  • Discussion

1
2
4
3
16
Presentation structurean update on the MESUR
project
  • Usage data characterization
  • Analysis
  • Usage graphs
  • Metrics analysis
  • Results
  • Discussion

17
A tapestry of usage data providers
  • Each represent different, and possibly
    overlapping, samples of the scholarly community.
  • Institutions
  • Institutional communities
  • Many collections
  • Aggregators
  • Many communities
  • Many collections
  • Publishers
  • Many communities
  • Publisher collection
  • Main players
  • Individual institutions
  • Link resolver data
  • EZ proxy
  • Aggregators
  • Ad hoc formats
  • COUNTER reports
  • Publishers
  • Ad hoc
  • COUNTER reports

18
Negotiation results
  • Data gt 1B usage events and 1B citations
  • At this point, 247,083,481 usage events loaded
  • Another 1,000,000,000 on the way
  • Documents gt 50M documents
  • Journals 326,000
  • Includes newspapers, magazines
  • Professional magazines
  • Obscure material
  • Community gt 100M users and authors combined

19
Data acquired timelines
  • Span
  • Majority -1 years
  • Some minor 2002-2003 data
  • Sharing models
  • Historical data for period t-x,t
  • Periodical updates
  • Main issues
  • Restoration of archives
  • Digital preservation issues
  • All data fields intact
  • Integration of various sources of usage data

20
Data flow
http//www.mesur.org/schemas/2007- 01/mesur/
21
An ontology of the scholarly communication
process?
22
Modeling the scholarly communication processthe
MESUR ontology.
  • Basic concepts
  • OWL RDF/XML representation
  • Three basic notions Documents, Agents and
    Contexts
  • Context n-ary relationship between documents and
    agent.
  • Subclassed to Events and States to express
    action, e.g. Uses vs. continuous state, e.g.
    hasImpact
  • Previous efforts The ScholOnto project, ABC
    ontology, VA. Tech Goncalves (2002), Web
    Scholars, etc.
  • Requirements
  • - Combined representation of usage data with
    bibliographic and citation data.
  • - Fine granularity
  • - Pragmatism in modeling

23
Modeling the scholarly communication processthe
MESUR ontology.
Examples An author (Agent) publishes
(ContextEvent) an article (Document) A user
(Agent) uses (ContextEvent) a journal (Document)
http//www.mesur.org/schemas/2007-01/mesur/
Based on OntologyX5 framework developed by
Rights.com
Rodriguez, Bollen Van de Sompel. A Practical
Ontology for the Large-Scale Modeling of
Scholarly Artifacts and their Usage. JCDL07
24
Presentation structurean update on the MESUR
project
  • Usage data characterization
  • Analysis
  • Usage graphs
  • Metrics analysis
  • Results
  • Discussion

25
MESURs usage data representation framework.
  • Assumptions
  • Sessions identify sequences of events (same user
    - same document)
  • Documents tied to aggregate request objects
  • Request objects consist of series of service
    requests and the date and time at which request
    took place.
  • Implications
  • Sequence preserved.
  • Most usage data and statistics can be
    reconstructed from framework
  • Lends itself to XML and RDF format
  • Permits request type filtering
  • Example
  • COUNTER stats aggregate request counts for each
    document (journal) grouped by date/time (month)
  • Usage graph overlay sessions with same document
    pairs
  • Out of 13 MESUR providers so far, only 3 natively
    follow this model.
  • The usage data of another 8 contains the
    necessary information for conversion

26
Implications for structural analysis of usage data
  • Sequence preservation allows
  • Reconstruction of user behavior
  • Usage graphs!
  • Statistics do not allow this type of analysis BUT
    are useful for
  • validating results
  • rankings

27
How to generate a usage graph.
  • Documents are associated by co-occurrence in same
    session
  • Same session, same user common interest
  • Frequency of co-occurrence in same session
    degree of relationship
  • Normalized conditional probability
  • Usage data
  • Works for journals and articles
  • Anything for which usage was recorded
  • Options
  • Strict pair-wise sequence?
  • All within session?
  • Take distance into account?
  • Note not something we invented. Association rule
    learning in data mining. Beer and diapers!

28
Usage graphs
  • MESUR graph created
  • 200M usage events
  • Usage restricted to 2006
  • Journals clipped to 7600 2004 JCR journals
  • Pair-wise sequences
  • Within session, only consecutive pairs
  • Raw frequency weights
  • Network analysis now on-going
  • Network properties
  • Clustering

29
Lay of the land flow of information.
30
Presentation structurean update on the MESUR
project
  • Usage data characterization
  • Analysis
  • Usage graphs
  • Metrics analysis
  • Results
  • Discussion

31
Metric types
  • Note
  • Metrics can be calculated both on citation and
    usage data
  • Structural metrics require graphs
  • Citation graph, e.g. 2004 JCR
  • Usage graph, e.g. created by MESUR

32
Frequentist metrics
  • Raw cites
  • Count number of citations to document or journal
  • Count number of times document or journal was
    accessed
  • Normalized
  • Journal Impact factor
  • Number of citations to journal
  • Divided by number of articles published in
    journal
  • Usage Impact Factor
  • Number of request for journal or article
  • Divided by number of articles published in
    journal

Johan Bollen. Usage Impact Factor the effects of
sample characteristics on usage-based impact
metrics. Journal of the American Society for
Information Science and Technology, 59(1), 2008
33
Structural metrics calculated from usage graph
  • Classes of metrics
  • Degree
  • Shortest path
  • Random walk
  • Distribution
  • Degree
  • In-degree
  • Out-degree
  • Shortest path
  • Closeness
  • Betweenness
  • Newman
  • Random walk
  • PageRank
  • Eigenvector
  • Distribution
  • In-degree entropy
  • Out-degree entropy
  • Bucket Entropy

Each can be defined to take into account weights
by e.g. means of weighted shortest path definition
34
Social network metrics different aspects of
impact I
Degree metrics
Degree centrality
In-degree/IF
Closeness centrality
Shortest path metrics
Betweenness centrality
35
Social network metrics different aspects of
impact II
Random walk Metrics, e.g. PageRank
  • Basic idea
  • Random walkers follow edges
  • Probability of random teleportation
  • Visitation numbers converge PageRank
  • Stationary Probability distribution

From wikipedia.org
36
Presentation structurean update on the MESUR
project
  • Usage data characterization
  • Analysis
  • Usage graphs
  • Metrics analysis
  • Results
  • Discussion

37
Set of metrics calculated on MESUR data set
  • List of metrics
  • JCR 2004
  • CITE-BE
  • CITE-ID
  • CITE-IE
  • CITE-IF
  • CITE-OD
  • CITE-OE
  • CITE-PG
  • CITE-UBW
  • CITE-UBW-UN
  • CITE-UCL
  • CITE-UCL-UN
  • CITE-UNM
  • CITE-UNM-UN
  • CITE-UPG
  • CITE-UPR
  • CITE-WBW
  • CITE-WBW-UN
  • Usage-based metrics
  • MESUR 2006
  • USES-BE,
  • USES-ID
  • USES-IE
  • USES-OD
  • USES-OE
  • USES-PG
  • USES-UBW
  • USES-UBW-UN
  • USES-UCL
  • USES-UCL-UN
  • USES-UNM
  • USES-UNM-UN
  • USES-UPG
  • USES-UPR
  • USES-WBW
  • USES-WBW-UN
  • USES-WCL

Usage graph creation Wenzhong Zhao Metrics
Marko Rodriguez and Aric Hagberg
38
Overlaps and discrepancies
Rankings and correlation structure will reveal
components of notion of scholarly impact across
citation and usage data.
39
Citation rankings
2004 Impact Factor value journal 1
49.794 CANCER 2 47.400 ANNU REV IMMUNOL 3
44.016 NEW ENGL J MED 4 33.456 ANNU REV
BIOCHEM 5 31.694 NAT REV CANCER
Citation Pagerank value journal 1
0.0116 SCIENCE 2 0.0111 J BIOL CHEM 3
0.0108 NATURE 4 0.0101 PNAS 5 0.006 PHYS REV
LETT
betweenness value journal 1 0.076 PNAS 2
0.072 SCIENCE 3 0.059 NATURE 4 0.039 LECT NOTES
COMPUT SC 5 0.017 LANCET
Closeness value journal 1 7.02e-05 PNAS 2
6.72e-05 LECT NOTES COMPUT SC 3
6.43e-05 NATURE 4 6.37e-05 SCIENCE 5 6.37e-05 J
BIOL CHEM
In-Degree value journal 1
3448 SCIENCE 2 3182 NATURE 3 2913 PNAS 4
2190 LANCET 5 2160 NEW ENGL J MED
  • In-degree entropy
  • Value journal
  • 1 9.849 LANCET
  • 2 9.748 SCIENCE
  • 3 9.701 NEW ENGL J MED
  • 4 9.611 NATURE
  • 5 9.526 JAMA

40
Usage rankings
2004 Impact Factor value journal 1
49.794 CANCER 2 47.400 ANNU REV IMMUNOL 3
44.016 NEW ENGL J MED 4 33.456 NNU REV BIOCHEM 5
31.694 NAT REV CANCER
betweenness value journal 1 0.035 SCIENCE 2
0.032 NATURE 3 0.020 PNAS 4 0.017 LECT NOTES
COMPUT SC 5 0.006 LANCET
Pagerank value journal 1
0.0016 SCIENCE 2 0.0015 NATURE 3 0.0013 PNAS 4
0.0010 LECT NOTES COMPUT SC 5 0.0008 J BIOL CHEM
In-Degree value journal 1
4195 SCIENCE 2 4019 NATURE 3 3562 PNAS 4
2438 J BIOL CHEM 5 2432 LECT NOTES COMPUT SC
In-degree entropy Value journal 1 9.364 MED
HYPOTHESES 2 9.152 PNAS 3 9.027 LIFE SCI 4
8.939 LANCET 5 8.858 INT J BIOCHEM CELL B
Closeness value journal 1 0.670 SCIENCE 2
0.665 NATURE 3 0.644 PNAS 4 0.591 LECT NOTES
COMPUT SC 5 0.587 BIOCHEM BIOPH RES CO
41
Metrics relationship
42
Metrics relationships
  • Citation and usage metrics reveal an entirely
    different pattern
  • Citation is split in 2 section
  • Degree metrics (right)
  • Shortest path and random walk (left)
  • Usage is split in 4 clusters
  • Degree metrics
  • PageRank and entropy
  • Closeness
  • Betweenness
  • Usage pattern can be caused
  • Noise in usage graph
  • Higher density of usage/nodes

43
Hierarchical cluster analysis
Citation PageRank
Usage degree
Usage Closeness
Usage betweenness
Usage PageRank
Citation Degree
Citation closeness
Citation betweenness
Impact Factor
44
Presentation structurean update on the MESUR
project
  • Usage data characterization
  • Analysis
  • Usage graphs
  • Metrics analysis
  • Results
  • Discussion

45
MESUR an update
  • Usage data
  • Creation of single largest reference data set
    of usage, citation and bibliographic data
  • 1,000,000,000 usage events loaded in next
    month
  • Usage data obtained from multiple
    publishers, aggregators and institutions
  • Infrastructure for a continued research
    program in this domain
  • Results will guide scholarly evaluation and
    may help produce standards for usage data
    representation
  • Usage graphs
  • Adequate data model for item-level usage data
    naturally leads to this
  • Reduced distortion compared to raw usage
    structure counts, not raw hits
  • Several options on how to create MESUR
    investigates option
  • Metrics
  • Frequentist and structural metrics
  • Each can represent different facets of
    scholarly impact
  • Simple metrics can produce adequate results.
    Law of diminishing returns?
  • Hybrid metrics based on triple store
    functionality
  • Note increasing convergence of usage-metrics
    to citation metrics as sample increases.

46
Some relevant publications.
  • Johan Bollen, Herbert Van de Sompel, and Marko A.
    Rodriguez. Towards usage-based impact metrics
    first results from the MESUR project. In
    Proceedings of the Joint Conference on Digital
    Libraries, Pittsburgh, June 2008
  • Marko A. Rodriguez, Johan Bollen and Herbert Van
    de Sompel. A Practical Ontology for the
    Large-Scale Modeling of Scholarly Artifacts and
    their Usage, In Proceedings of the Joint
    Conference on Digital Libraries, Vancouver, June
    2007
  • Johan Bollen and Herbert Van de Sompel. Usage
    Impact Factor the effects of sample
    characteristics on usage-based impact metrics.
    (cs.DL/0610154)
  • Johan Bollen and Herbert Van de Sompel. An
    architecture for the aggregation and analysis of
    scholarly usage data. In Joint Conference on
    Digital Libraries (JCDL2006), pages 298-307,
    June 2006.
  • Johan Bollen and Herbert Van de Sompel. Mapping
    the structure of science through usage.
    Scientometrics, 69(2), 2006.
  • Johan Bollen, Marko A. Rodriguez, and Herbert
    Van de Sompel. Journal status. Scientometrics,
    69(3), December 2006 (arxiv.orgcs.DL/0601030)
  • Johan Bollen, Herbert Van de Sompel, Joan Smith,
    and Rick Luce. Toward alternative metrics of
    journal impact a comparison of download and
    citation data. Information Processing and
    Management, 41(6)1419-1440, 2005.
Write a Comment
User Comments (0)
About PowerShow.com