Title: Usagebased models of science:
1 Usage-based models of science applications to
community mapping and scholarly assessment
Johan Bollen Digital Library Research
Prototyping Team Los Alamos National Laboratory -
Research Library jbollen_at_lanl.gov Acknowledgemen
ts Herbert Van de Sompel (LANL), Marko A.
Rodriguez (LANL), Ryan Chute (LANL), Lyudmila L.
Balakireva (LANL), Aric Hagberg (LANL), Luis
Bettencourt (LANL) Research supported by the
Andrew W. Mellon Foundation.
2Scholarly evaluation
- Qualitative/subjective
- Peer review
- Tenure committees
- Networks
- Quantitative
- Citation counts
- Many proposals, little clarity
3Impact evaluation from citation data.
- Citation data
- Golden standard of scholarly evaluation
- Citation scholarly influences.
- Extracted from published materials.
- Main bibliometric data source for scholarly
evaluation.
- IF is part of Journal Citation Reports (JCR)
- JCR citation graph
- 2005 journal citation network
- - 8,560 journals
- 6,370,234 weighted citation edges
- Impact Factor mean 2 year citation rate
4Citation graph other metrics?
- Possibility to calculate other metrics of impact.
- How about PageRank?
- IF normalized indegree in citation graph
- Popularity
- Favors review journals
- Are all citations created equal?
- Transfer citer influence to citee
- Normalize edge weight
- Modulate transfer for citer influence
- Indicator of node prestige
Pinski, G., Narin, F. (1976). Citation
influence for journal aggregates of scientific
publications theory, with application to the
literature of physics. Information processing and
management, 12(5), 297-312. Chen, P., Xie, H.,
Maslov, S., Redner, S. (2007). Finding
scientific gems with Google. Journal of
Informetrics, 1(1), arxiv.org/abs/physics/0604130.
5Popularity vs. prestige
rho0.61 Outliers reveal differences in aspects
of status IF general popularity PR
prestige, influence
Johan Bollen, Marko A. Rodriguez, and Herbert Van
de Sompel. Journal status. Scientometrics, 69(3),
December 2006 (DOI 10.1007/s11192-006-0176-z) Phi
lip Ball. Prestige is factored into journal
ratings. Nature 439, 770-771, February 2006
(doi10.1038/439770a)
6Domain specific
7 A few issues
- On the basis of citation graph, many
alternatives - Normalized citation statistics
- Social network indicators (centrality)
- Semantic web hybrids
- Which one to choose?
- Semantics
- Validity does indicator express what it is
intended to express? - Reliability sensitivity to changes in network
structure? - Another question need they be based on citation
data? - Other means of expressing influence
- Readership, services requested, etc.
- Usage!
8Usage data
- Citation data pertain to 4 levels in the
scholarly communication process - Community authors of journal articles.
- Artifacts journal articles.
- Data citation data (1 year publication delay).
- Metrics mean citation rate rules supreme.
- Scale expensive to extract.
- However, for usage data
- Community all users including most
authors. - Artifacts all that is accessible.
- Data recorded upon publication.
- Metrics a range of web and web2.0 inspired
metrics, e.g. clickstream and datamining. - Scale automatically recorded at point of
service.
Hence, various initiatives focused on usage data
COUNTER, IRS, SUSHI, CiteBase. But where are the
metrics?
9Challenges to usage-based metrics.
- Usage data is here
- Routinely recorded by library, publisher and
aggregators services - Large-scale and longitudinal
- Highly detailed
- Requester
- Referent
- Sessions
- Service type
- Usage-based metrics have lagged development.
Heres why - Multiple communities
- Multiple collection (artifacts)
- Data usage data limited to particular
sub-communities and collections of artifacts. - Metrics various metrics studied. Different
results because of sample, collection or metric
definition? Aspects of scholarly status?
10Our experience divergence and convergence.
- Convergence!
- Is this guaranteed?
- To what? A common-baseline?
- What we do know
- Institutional perspective can be contrasted to
baseline. - As aggregation increases in size, so does value.
- Cross-validation is key.
11MESUR1 Metrics from Scholarly Usage of Resources.
- Andrew W. Mellon Foundation funded study of
usage-based metrics (2006-2008) - Executed at the Digital Library Research and
Prototyping team, Los Alamos National Laboratory
Research Library - Objectives
- Create a model of the scholarly communication
process. - Create a large-scale reference data set (semantic
network) that relates all relevant bibliographic,
citation and usage data according to (1). - Characterize reference data set.
- Survey usage-based metrics on basis of reference
data set.
1. Pronounced measure
12The MESUR project.
Johan Bollen (LANL) Principal investigator Herber
t Van de Sompel (LANL) Architectural
consultant Marko Rodriguez (LANL) PhD student
(Computer Science, UCSC) Ryan Chute (LANL)
Software development and database
management Lyudmila Balakireva (LANL) Database
management and HCI Aric Hagberg (LANL)
Mathematical and statistical consultant Luis
Bettencourt (LANL) Mathematical and statistical
consultant
The Andrew W. Mellon Foundation has awarded a
grant to Los Alamos National Laboratory (LANL) in
support of a two-year project that will
investigate metrics derived from the
network-based usage of scholarly information. The
Digital Library Research Prototyping Team of
the LANL Research Library will carry out the
project. The project's major objective is
enriching the toolkit used for the assessment of
the impact of scholarly communication items, and
hence of scholars, with metrics that derive from
usage data.
13Project data flow and work plan.
4
2
3
1
14Project timeline.
We are here!
15Presentation structurean update on the MESUR
project
- Usage data characterization
- Analysis
- Usage graphs
- Metrics analysis
- Results
- Discussion
1
2
4
3
16Presentation structurean update on the MESUR
project
- Usage data characterization
- Analysis
- Usage graphs
- Metrics analysis
- Results
- Discussion
17A tapestry of usage data providers
- Each represent different, and possibly
overlapping, samples of the scholarly community. - Institutions
- Institutional communities
- Many collections
- Aggregators
- Many communities
- Many collections
- Publishers
- Many communities
- Publisher collection
- Main players
- Individual institutions
- Link resolver data
- EZ proxy
- Aggregators
- Ad hoc formats
- COUNTER reports
- Publishers
- Ad hoc
- COUNTER reports
18Negotiation results
- Data gt 1B usage events and 1B citations
- At this point, 247,083,481 usage events loaded
- Another 1,000,000,000 on the way
- Documents gt 50M documents
- Journals 326,000
- Includes newspapers, magazines
- Professional magazines
- Obscure material
- Community gt 100M users and authors combined
19Data acquired timelines
- Span
- Majority -1 years
- Some minor 2002-2003 data
- Sharing models
- Historical data for period t-x,t
- Periodical updates
- Main issues
- Restoration of archives
- Digital preservation issues
- All data fields intact
- Integration of various sources of usage data
20Data flow
http//www.mesur.org/schemas/2007- 01/mesur/
21An ontology of the scholarly communication
process?
22Modeling the scholarly communication processthe
MESUR ontology.
- Basic concepts
- OWL RDF/XML representation
- Three basic notions Documents, Agents and
Contexts - Context n-ary relationship between documents and
agent. - Subclassed to Events and States to express
action, e.g. Uses vs. continuous state, e.g.
hasImpact
- Previous efforts The ScholOnto project, ABC
ontology, VA. Tech Goncalves (2002), Web
Scholars, etc. - Requirements
- - Combined representation of usage data with
bibliographic and citation data. - - Fine granularity
- - Pragmatism in modeling
23Modeling the scholarly communication processthe
MESUR ontology.
Examples An author (Agent) publishes
(ContextEvent) an article (Document) A user
(Agent) uses (ContextEvent) a journal (Document)
http//www.mesur.org/schemas/2007-01/mesur/
Based on OntologyX5 framework developed by
Rights.com
Rodriguez, Bollen Van de Sompel. A Practical
Ontology for the Large-Scale Modeling of
Scholarly Artifacts and their Usage. JCDL07
24Presentation structurean update on the MESUR
project
- Usage data characterization
- Analysis
- Usage graphs
- Metrics analysis
- Results
- Discussion
25MESURs usage data representation framework.
- Assumptions
- Sessions identify sequences of events (same user
- same document) - Documents tied to aggregate request objects
- Request objects consist of series of service
requests and the date and time at which request
took place. - Implications
- Sequence preserved.
- Most usage data and statistics can be
reconstructed from framework - Lends itself to XML and RDF format
- Permits request type filtering
- Example
- COUNTER stats aggregate request counts for each
document (journal) grouped by date/time (month) - Usage graph overlay sessions with same document
pairs
- Out of 13 MESUR providers so far, only 3 natively
follow this model. - The usage data of another 8 contains the
necessary information for conversion
26Implications for structural analysis of usage data
- Sequence preservation allows
- Reconstruction of user behavior
- Usage graphs!
- Statistics do not allow this type of analysis BUT
are useful for - validating results
- rankings
27How to generate a usage graph.
- Documents are associated by co-occurrence in same
session - Same session, same user common interest
- Frequency of co-occurrence in same session
degree of relationship - Normalized conditional probability
- Usage data
- Works for journals and articles
- Anything for which usage was recorded
- Options
- Strict pair-wise sequence?
- All within session?
- Take distance into account?
- Note not something we invented. Association rule
learning in data mining. Beer and diapers!
28Usage graphs
- MESUR graph created
- 200M usage events
- Usage restricted to 2006
- Journals clipped to 7600 2004 JCR journals
- Pair-wise sequences
- Within session, only consecutive pairs
- Raw frequency weights
- Network analysis now on-going
- Network properties
- Clustering
-
29Lay of the land flow of information.
30Presentation structurean update on the MESUR
project
- Usage data characterization
- Analysis
- Usage graphs
- Metrics analysis
- Results
- Discussion
31Metric types
- Note
- Metrics can be calculated both on citation and
usage data - Structural metrics require graphs
- Citation graph, e.g. 2004 JCR
- Usage graph, e.g. created by MESUR
32Frequentist metrics
- Raw cites
- Count number of citations to document or journal
- Count number of times document or journal was
accessed
- Normalized
- Journal Impact factor
- Number of citations to journal
- Divided by number of articles published in
journal - Usage Impact Factor
- Number of request for journal or article
- Divided by number of articles published in
journal
Johan Bollen. Usage Impact Factor the effects of
sample characteristics on usage-based impact
metrics. Journal of the American Society for
Information Science and Technology, 59(1), 2008
33Structural metrics calculated from usage graph
- Classes of metrics
- Degree
- Shortest path
- Random walk
- Distribution
- Degree
- In-degree
- Out-degree
- Shortest path
- Closeness
- Betweenness
- Newman
- Random walk
- PageRank
- Eigenvector
- Distribution
- In-degree entropy
- Out-degree entropy
- Bucket Entropy
Each can be defined to take into account weights
by e.g. means of weighted shortest path definition
34Social network metrics different aspects of
impact I
Degree metrics
Degree centrality
In-degree/IF
Closeness centrality
Shortest path metrics
Betweenness centrality
35Social network metrics different aspects of
impact II
Random walk Metrics, e.g. PageRank
- Basic idea
- Random walkers follow edges
- Probability of random teleportation
- Visitation numbers converge PageRank
- Stationary Probability distribution
From wikipedia.org
36Presentation structurean update on the MESUR
project
- Usage data characterization
- Analysis
- Usage graphs
- Metrics analysis
- Results
- Discussion
37Set of metrics calculated on MESUR data set
- List of metrics
- JCR 2004
- CITE-BE
- CITE-ID
- CITE-IE
- CITE-IF
- CITE-OD
- CITE-OE
- CITE-PG
- CITE-UBW
- CITE-UBW-UN
- CITE-UCL
- CITE-UCL-UN
- CITE-UNM
- CITE-UNM-UN
- CITE-UPG
- CITE-UPR
- CITE-WBW
- CITE-WBW-UN
- Usage-based metrics
- MESUR 2006
- USES-BE,
- USES-ID
- USES-IE
- USES-OD
- USES-OE
- USES-PG
- USES-UBW
- USES-UBW-UN
- USES-UCL
- USES-UCL-UN
- USES-UNM
- USES-UNM-UN
- USES-UPG
- USES-UPR
- USES-WBW
- USES-WBW-UN
- USES-WCL
Usage graph creation Wenzhong Zhao Metrics
Marko Rodriguez and Aric Hagberg
38Overlaps and discrepancies
Rankings and correlation structure will reveal
components of notion of scholarly impact across
citation and usage data.
39Citation rankings
2004 Impact Factor value journal 1
49.794 CANCER 2 47.400 ANNU REV IMMUNOL 3
44.016 NEW ENGL J MED 4 33.456 ANNU REV
BIOCHEM 5 31.694 NAT REV CANCER
Citation Pagerank value journal 1
0.0116 SCIENCE 2 0.0111 J BIOL CHEM 3
0.0108 NATURE 4 0.0101 PNAS 5 0.006 PHYS REV
LETT
betweenness value journal 1 0.076 PNAS 2
0.072 SCIENCE 3 0.059 NATURE 4 0.039 LECT NOTES
COMPUT SC 5 0.017 LANCET
Closeness value journal 1 7.02e-05 PNAS 2
6.72e-05 LECT NOTES COMPUT SC 3
6.43e-05 NATURE 4 6.37e-05 SCIENCE 5 6.37e-05 J
BIOL CHEM
In-Degree value journal 1
3448 SCIENCE 2 3182 NATURE 3 2913 PNAS 4
2190 LANCET 5 2160 NEW ENGL J MED
- In-degree entropy
- Value journal
- 1 9.849 LANCET
- 2 9.748 SCIENCE
- 3 9.701 NEW ENGL J MED
- 4 9.611 NATURE
- 5 9.526 JAMA
40Usage rankings
2004 Impact Factor value journal 1
49.794 CANCER 2 47.400 ANNU REV IMMUNOL 3
44.016 NEW ENGL J MED 4 33.456 NNU REV BIOCHEM 5
31.694 NAT REV CANCER
betweenness value journal 1 0.035 SCIENCE 2
0.032 NATURE 3 0.020 PNAS 4 0.017 LECT NOTES
COMPUT SC 5 0.006 LANCET
Pagerank value journal 1
0.0016 SCIENCE 2 0.0015 NATURE 3 0.0013 PNAS 4
0.0010 LECT NOTES COMPUT SC 5 0.0008 J BIOL CHEM
In-Degree value journal 1
4195 SCIENCE 2 4019 NATURE 3 3562 PNAS 4
2438 J BIOL CHEM 5 2432 LECT NOTES COMPUT SC
In-degree entropy Value journal 1 9.364 MED
HYPOTHESES 2 9.152 PNAS 3 9.027 LIFE SCI 4
8.939 LANCET 5 8.858 INT J BIOCHEM CELL B
Closeness value journal 1 0.670 SCIENCE 2
0.665 NATURE 3 0.644 PNAS 4 0.591 LECT NOTES
COMPUT SC 5 0.587 BIOCHEM BIOPH RES CO
41Metrics relationship
42Metrics relationships
- Citation and usage metrics reveal an entirely
different pattern - Citation is split in 2 section
- Degree metrics (right)
- Shortest path and random walk (left)
- Usage is split in 4 clusters
- Degree metrics
- PageRank and entropy
- Closeness
- Betweenness
- Usage pattern can be caused
- Noise in usage graph
- Higher density of usage/nodes
43Hierarchical cluster analysis
Citation PageRank
Usage degree
Usage Closeness
Usage betweenness
Usage PageRank
Citation Degree
Citation closeness
Citation betweenness
Impact Factor
44Presentation structurean update on the MESUR
project
- Usage data characterization
- Analysis
- Usage graphs
- Metrics analysis
- Results
- Discussion
45MESUR an update
- Usage data
- Creation of single largest reference data set
of usage, citation and bibliographic data - 1,000,000,000 usage events loaded in next
month - Usage data obtained from multiple
publishers, aggregators and institutions - Infrastructure for a continued research
program in this domain - Results will guide scholarly evaluation and
may help produce standards for usage data
representation - Usage graphs
- Adequate data model for item-level usage data
naturally leads to this - Reduced distortion compared to raw usage
structure counts, not raw hits - Several options on how to create MESUR
investigates option - Metrics
- Frequentist and structural metrics
- Each can represent different facets of
scholarly impact - Simple metrics can produce adequate results.
Law of diminishing returns? - Hybrid metrics based on triple store
functionality - Note increasing convergence of usage-metrics
to citation metrics as sample increases.
46Some relevant publications.
- Johan Bollen, Herbert Van de Sompel, and Marko A.
Rodriguez. Towards usage-based impact metrics
first results from the MESUR project. In
Proceedings of the Joint Conference on Digital
Libraries, Pittsburgh, June 2008 - Marko A. Rodriguez, Johan Bollen and Herbert Van
de Sompel. A Practical Ontology for the
Large-Scale Modeling of Scholarly Artifacts and
their Usage, In Proceedings of the Joint
Conference on Digital Libraries, Vancouver, June
2007 - Johan Bollen and Herbert Van de Sompel. Usage
Impact Factor the effects of sample
characteristics on usage-based impact metrics.
(cs.DL/0610154) - Johan Bollen and Herbert Van de Sompel. An
architecture for the aggregation and analysis of
scholarly usage data. In Joint Conference on
Digital Libraries (JCDL2006), pages 298-307,
June 2006. - Johan Bollen and Herbert Van de Sompel. Mapping
the structure of science through usage.
Scientometrics, 69(2), 2006. - Johan Bollen, Marko A. Rodriguez, and Herbert
Van de Sompel. Journal status. Scientometrics,
69(3), December 2006 (arxiv.orgcs.DL/0601030) - Johan Bollen, Herbert Van de Sompel, Joan Smith,
and Rick Luce. Toward alternative metrics of
journal impact a comparison of download and
citation data. Information Processing and
Management, 41(6)1419-1440, 2005.