Usagebased models of science: - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Usagebased models of science:

Description:

Survey usage-based metrics on basis of reference data set. ... Examples: An author (Agent) publishes (Context:Event) an article (Document) ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 47

Provided by: lanl4

Learn more at: http://vw.indiana.edu

Category:

more less

Transcript and Presenter's Notes

Title: Usagebased models of science:

1
Usage-based models of science applications to
community mapping and scholarly assessment
Johan Bollen Digital Library Research
Prototyping Team Los Alamos National Laboratory -
Research Library jbollen_at_lanl.gov Acknowledgemen
ts Herbert Van de Sompel (LANL), Marko A.
Rodriguez (LANL), Ryan Chute (LANL), Lyudmila L.
Balakireva (LANL), Aric Hagberg (LANL), Luis
Bettencourt (LANL) Research supported by the
Andrew W. Mellon Foundation.
2
Scholarly evaluation

Qualitative/subjective
Peer review
Tenure committees
Networks
Quantitative
Citation counts
Many proposals, little clarity

3
Impact evaluation from citation data.

Citation data
Golden standard of scholarly evaluation
Citation scholarly influences.
Extracted from published materials.
Main bibliometric data source for scholarly
evaluation.

IF is part of Journal Citation Reports (JCR)
JCR citation graph
2005 journal citation network
- 8,560 journals
6,370,234 weighted citation edges
Impact Factor mean 2 year citation rate

4
Citation graph other metrics?

Possibility to calculate other metrics of impact.
How about PageRank?
IF normalized indegree in citation graph
Popularity
Favors review journals
Are all citations created equal?
Transfer citer influence to citee
Normalize edge weight
Modulate transfer for citer influence
Indicator of node prestige

Pinski, G., Narin, F. (1976). Citation
influence for journal aggregates of scientific
publications theory, with application to the
literature of physics. Information processing and
management, 12(5), 297-312. Chen, P., Xie, H.,
Maslov, S., Redner, S. (2007). Finding
scientific gems with Google. Journal of
Informetrics, 1(1), arxiv.org/abs/physics/0604130.
5
Popularity vs. prestige
rho0.61 Outliers reveal differences in aspects
of status IF general popularity PR
prestige, influence
Johan Bollen, Marko A. Rodriguez, and Herbert Van
de Sompel. Journal status. Scientometrics, 69(3),
December 2006 (DOI 10.1007/s11192-006-0176-z) Phi
lip Ball. Prestige is factored into journal
ratings. Nature 439, 770-771, February 2006
(doi10.1038/439770a)
6
Domain specific
7
A few issues

On the basis of citation graph, many
alternatives
Normalized citation statistics
Social network indicators (centrality)
Semantic web hybrids
Which one to choose?
Semantics
Validity does indicator express what it is
intended to express?
Reliability sensitivity to changes in network
structure?
Another question need they be based on citation
data?
Other means of expressing influence
Readership, services requested, etc.
Usage!

8
Usage data

Citation data pertain to 4 levels in the
scholarly communication process
Community authors of journal articles.
Artifacts journal articles.
Data citation data (1 year publication delay).
Metrics mean citation rate rules supreme.
Scale expensive to extract.

However, for usage data
Community all users including most
authors.
Artifacts all that is accessible.
Data recorded upon publication.
Metrics a range of web and web2.0 inspired
metrics, e.g. clickstream and datamining.
Scale automatically recorded at point of
service.

Hence, various initiatives focused on usage data
COUNTER, IRS, SUSHI, CiteBase. But where are the
metrics?
9
Challenges to usage-based metrics.

Usage data is here
Routinely recorded by library, publisher and
aggregators services
Large-scale and longitudinal
Highly detailed
Requester
Referent
Sessions
Service type
Usage-based metrics have lagged development.
Heres why
Multiple communities
Multiple collection (artifacts)
Data usage data limited to particular
sub-communities and collections of artifacts.
Metrics various metrics studied. Different
results because of sample, collection or metric
definition? Aspects of scholarly status?

10
Our experience divergence and convergence.

Convergence!
Is this guaranteed?
To what? A common-baseline?
What we do know
Institutional perspective can be contrasted to
baseline.
As aggregation increases in size, so does value.
Cross-validation is key.

11
MESUR1 Metrics from Scholarly Usage of Resources.

Andrew W. Mellon Foundation funded study of
usage-based metrics (2006-2008)
Executed at the Digital Library Research and
Prototyping team, Los Alamos National Laboratory
Research Library
Objectives
Create a model of the scholarly communication
process.
Create a large-scale reference data set (semantic
network) that relates all relevant bibliographic,
citation and usage data according to (1).
Characterize reference data set.
Survey usage-based metrics on basis of reference
data set.

1. Pronounced measure
12
The MESUR project.
Johan Bollen (LANL) Principal investigator Herber
t Van de Sompel (LANL) Architectural
consultant Marko Rodriguez (LANL) PhD student
(Computer Science, UCSC) Ryan Chute (LANL)
Software development and database
management Lyudmila Balakireva (LANL) Database
management and HCI Aric Hagberg (LANL)
Mathematical and statistical consultant Luis
Bettencourt (LANL) Mathematical and statistical
consultant
The Andrew W. Mellon Foundation has awarded a
grant to Los Alamos National Laboratory (LANL) in
support of a two-year project that will
investigate metrics derived from the
network-based usage of scholarly information. The
Digital Library Research Prototyping Team of
the LANL Research Library will carry out the
project. The project's major objective is
enriching the toolkit used for the assessment of
the impact of scholarly communication items, and
hence of scholars, with metrics that derive from
usage data.
13
Project data flow and work plan.
4
2
3
1
14
Project timeline.
We are here!
15
Presentation structurean update on the MESUR
project

Usage data characterization
Analysis
Usage graphs
Metrics analysis
Results
Discussion

1
2
4
3
16
Presentation structurean update on the MESUR
project

Usage data characterization
Analysis
Usage graphs
Metrics analysis
Results
Discussion

17
A tapestry of usage data providers

Each represent different, and possibly
overlapping, samples of the scholarly community.
Institutions
Institutional communities
Many collections
Aggregators
Many communities
Many collections
Publishers
Many communities
Publisher collection

Main players
Individual institutions
Link resolver data
EZ proxy
Aggregators
Ad hoc formats
COUNTER reports
Publishers
Ad hoc
COUNTER reports

18
Negotiation results

Data gt 1B usage events and 1B citations
At this point, 247,083,481 usage events loaded
Another 1,000,000,000 on the way
Documents gt 50M documents
Journals 326,000
Includes newspapers, magazines
Professional magazines
Obscure material
Community gt 100M users and authors combined

19
Data acquired timelines

Span
Majority -1 years
Some minor 2002-2003 data
Sharing models
Historical data for period t-x,t
Periodical updates

Main issues
Restoration of archives
Digital preservation issues
All data fields intact
Integration of various sources of usage data

20
Data flow
http//www.mesur.org/schemas/2007- 01/mesur/
21
An ontology of the scholarly communication
process?
22
Modeling the scholarly communication processthe
MESUR ontology.

Basic concepts
OWL RDF/XML representation
Three basic notions Documents, Agents and
Contexts
Context n-ary relationship between documents and
agent.
Subclassed to Events and States to express
action, e.g. Uses vs. continuous state, e.g.
hasImpact

Previous efforts The ScholOnto project, ABC
ontology, VA. Tech Goncalves (2002), Web
Scholars, etc.
Requirements
- Combined representation of usage data with
bibliographic and citation data.
- Fine granularity
- Pragmatism in modeling

23
Modeling the scholarly communication processthe
MESUR ontology.
Examples An author (Agent) publishes
(ContextEvent) an article (Document) A user
(Agent) uses (ContextEvent) a journal (Document)
http//www.mesur.org/schemas/2007-01/mesur/
Based on OntologyX5 framework developed by
Rights.com
Rodriguez, Bollen Van de Sompel. A Practical
Ontology for the Large-Scale Modeling of
Scholarly Artifacts and their Usage. JCDL07
24
Presentation structurean update on the MESUR
project

Usage data characterization
Analysis
Usage graphs
Metrics analysis
Results
Discussion

25
MESURs usage data representation framework.

Assumptions
Sessions identify sequences of events (same user
- same document)
Documents tied to aggregate request objects
Request objects consist of series of service
requests and the date and time at which request
took place.
Implications
Sequence preserved.
Most usage data and statistics can be
reconstructed from framework
Lends itself to XML and RDF format
Permits request type filtering
Example
COUNTER stats aggregate request counts for each
document (journal) grouped by date/time (month)
Usage graph overlay sessions with same document
pairs

Out of 13 MESUR providers so far, only 3 natively
follow this model.
The usage data of another 8 contains the
necessary information for conversion

26
Implications for structural analysis of usage data

Sequence preservation allows
Reconstruction of user behavior
Usage graphs!
Statistics do not allow this type of analysis BUT
are useful for
validating results
rankings

27
How to generate a usage graph.

Documents are associated by co-occurrence in same
session
Same session, same user common interest
Frequency of co-occurrence in same session
degree of relationship
Normalized conditional probability
Usage data
Works for journals and articles
Anything for which usage was recorded
Options
Strict pair-wise sequence?
All within session?
Take distance into account?
Note not something we invented. Association rule
learning in data mining. Beer and diapers!

28
Usage graphs

MESUR graph created
200M usage events
Usage restricted to 2006
Journals clipped to 7600 2004 JCR journals
Pair-wise sequences
Within session, only consecutive pairs
Raw frequency weights
Network analysis now on-going
Network properties
Clustering

29
Lay of the land flow of information.
30
Presentation structurean update on the MESUR
project

Usage data characterization
Analysis
Usage graphs
Metrics analysis
Results
Discussion

31
Metric types

Note
Metrics can be calculated both on citation and
usage data
Structural metrics require graphs
Citation graph, e.g. 2004 JCR
Usage graph, e.g. created by MESUR

32
Frequentist metrics

Raw cites
Count number of citations to document or journal
Count number of times document or journal was
accessed

Normalized
Journal Impact factor
Number of citations to journal
Divided by number of articles published in
journal
Usage Impact Factor
Number of request for journal or article
Divided by number of articles published in
journal

Johan Bollen. Usage Impact Factor the effects of
sample characteristics on usage-based impact
metrics. Journal of the American Society for
Information Science and Technology, 59(1), 2008
33
Structural metrics calculated from usage graph

Classes of metrics
Degree
Shortest path
Random walk
Distribution

Degree
In-degree
Out-degree

Shortest path
Closeness
Betweenness
Newman

Random walk
PageRank
Eigenvector

Distribution
In-degree entropy
Out-degree entropy
Bucket Entropy

Each can be defined to take into account weights
by e.g. means of weighted shortest path definition
34
Social network metrics different aspects of
impact I
Degree metrics
Degree centrality
In-degree/IF
Closeness centrality
Shortest path metrics
Betweenness centrality
35
Social network metrics different aspects of
impact II
Random walk Metrics, e.g. PageRank

Basic idea
Random walkers follow edges
Probability of random teleportation
Visitation numbers converge PageRank
Stationary Probability distribution

From wikipedia.org
36
Presentation structurean update on the MESUR
project

Usage data characterization
Analysis
Usage graphs
Metrics analysis
Results
Discussion

37
Set of metrics calculated on MESUR data set

List of metrics
JCR 2004
CITE-BE
CITE-ID
CITE-IE
CITE-IF
CITE-OD
CITE-OE
CITE-PG
CITE-UBW
CITE-UBW-UN
CITE-UCL
CITE-UCL-UN
CITE-UNM
CITE-UNM-UN
CITE-UPG
CITE-UPR
CITE-WBW
CITE-WBW-UN

Usage-based metrics
MESUR 2006
USES-BE,
USES-ID
USES-IE
USES-OD
USES-OE
USES-PG
USES-UBW
USES-UBW-UN
USES-UCL
USES-UCL-UN
USES-UNM
USES-UNM-UN
USES-UPG
USES-UPR
USES-WBW
USES-WBW-UN
USES-WCL

Usage graph creation Wenzhong Zhao Metrics
Marko Rodriguez and Aric Hagberg
38
Overlaps and discrepancies
Rankings and correlation structure will reveal
components of notion of scholarly impact across
citation and usage data.
39
Citation rankings
2004 Impact Factor value journal 1
49.794 CANCER 2 47.400 ANNU REV IMMUNOL 3
44.016 NEW ENGL J MED 4 33.456 ANNU REV
BIOCHEM 5 31.694 NAT REV CANCER
Citation Pagerank value journal 1
0.0116 SCIENCE 2 0.0111 J BIOL CHEM 3
0.0108 NATURE 4 0.0101 PNAS 5 0.006 PHYS REV
LETT
betweenness value journal 1 0.076 PNAS 2
0.072 SCIENCE 3 0.059 NATURE 4 0.039 LECT NOTES
COMPUT SC 5 0.017 LANCET
Closeness value journal 1 7.02e-05 PNAS 2
6.72e-05 LECT NOTES COMPUT SC 3
6.43e-05 NATURE 4 6.37e-05 SCIENCE 5 6.37e-05 J
BIOL CHEM
In-Degree value journal 1
3448 SCIENCE 2 3182 NATURE 3 2913 PNAS 4
2190 LANCET 5 2160 NEW ENGL J MED

In-degree entropy
Value journal
1 9.849 LANCET
2 9.748 SCIENCE
3 9.701 NEW ENGL J MED
4 9.611 NATURE
5 9.526 JAMA

40
Usage rankings
2004 Impact Factor value journal 1
49.794 CANCER 2 47.400 ANNU REV IMMUNOL 3
44.016 NEW ENGL J MED 4 33.456 NNU REV BIOCHEM 5
31.694 NAT REV CANCER
betweenness value journal 1 0.035 SCIENCE 2
0.032 NATURE 3 0.020 PNAS 4 0.017 LECT NOTES
COMPUT SC 5 0.006 LANCET
Pagerank value journal 1
0.0016 SCIENCE 2 0.0015 NATURE 3 0.0013 PNAS 4
0.0010 LECT NOTES COMPUT SC 5 0.0008 J BIOL CHEM
In-Degree value journal 1
4195 SCIENCE 2 4019 NATURE 3 3562 PNAS 4
2438 J BIOL CHEM 5 2432 LECT NOTES COMPUT SC
In-degree entropy Value journal 1 9.364 MED
HYPOTHESES 2 9.152 PNAS 3 9.027 LIFE SCI 4
8.939 LANCET 5 8.858 INT J BIOCHEM CELL B
Closeness value journal 1 0.670 SCIENCE 2
0.665 NATURE 3 0.644 PNAS 4 0.591 LECT NOTES
COMPUT SC 5 0.587 BIOCHEM BIOPH RES CO
41
Metrics relationship
42
Metrics relationships

Citation and usage metrics reveal an entirely
different pattern
Citation is split in 2 section
Degree metrics (right)
Shortest path and random walk (left)
Usage is split in 4 clusters
Degree metrics
PageRank and entropy
Closeness
Betweenness
Usage pattern can be caused
Noise in usage graph
Higher density of usage/nodes

43
Hierarchical cluster analysis
Citation PageRank
Usage degree
Usage Closeness
Usage betweenness
Usage PageRank
Citation Degree
Citation closeness
Citation betweenness
Impact Factor
44
Presentation structurean update on the MESUR
project

Usage data characterization
Analysis
Usage graphs
Metrics analysis
Results
Discussion

45
MESUR an update

Usage data
Creation of single largest reference data set
of usage, citation and bibliographic data
1,000,000,000 usage events loaded in next
month
Usage data obtained from multiple
publishers, aggregators and institutions
Infrastructure for a continued research
program in this domain
Results will guide scholarly evaluation and
may help produce standards for usage data
representation
Usage graphs
Adequate data model for item-level usage data
naturally leads to this
Reduced distortion compared to raw usage
structure counts, not raw hits
Several options on how to create MESUR
investigates option
Metrics
Frequentist and structural metrics
Each can represent different facets of
scholarly impact
Simple metrics can produce adequate results.
Law of diminishing returns?
Hybrid metrics based on triple store
functionality
Note increasing convergence of usage-metrics
to citation metrics as sample increases.

46
Some relevant publications.

Johan Bollen, Herbert Van de Sompel, and Marko A.
Rodriguez. Towards usage-based impact metrics
first results from the MESUR project. In
Proceedings of the Joint Conference on Digital
Libraries, Pittsburgh, June 2008
Marko A. Rodriguez, Johan Bollen and Herbert Van
de Sompel. A Practical Ontology for the
Large-Scale Modeling of Scholarly Artifacts and
their Usage, In Proceedings of the Joint
Conference on Digital Libraries, Vancouver, June
2007
Johan Bollen and Herbert Van de Sompel. Usage
Impact Factor the effects of sample
characteristics on usage-based impact metrics.
(cs.DL/0610154)
Johan Bollen and Herbert Van de Sompel. An
architecture for the aggregation and analysis of
scholarly usage data. In Joint Conference on
Digital Libraries (JCDL2006), pages 298-307,
June 2006.
Johan Bollen and Herbert Van de Sompel. Mapping
the structure of science through usage.
Scientometrics, 69(2), 2006.
Johan Bollen, Marko A. Rodriguez, and Herbert
Van de Sompel. Journal status. Scientometrics,
69(3), December 2006 (arxiv.orgcs.DL/0601030)
Johan Bollen, Herbert Van de Sompel, Joan Smith,
and Rick Luce. Toward alternative metrics of
journal impact a comparison of download and
citation data. Information Processing and
Management, 41(6)1419-1440, 2005.