Title: Bibliometric Impact Measures Leveraging Topic Analysis
1Bibliometric Impact MeasuresLeveraging Topic
Analysis
- Gideon Mann
- David Mimno
- Andrew McCallum
- Computer Science Department
- University of Massachusetts Amherst
2Goal
Measure the impact of papers, and research
subfields.
Important for
- Researchers understanding their own field.
- Libraries deciding which journals to purchase.
- Personnel committees deciding on hiring,
promotion, awards.
3Typical Impact Measures
- Citation Count
- Garfields Journal Impact Factor
4Why are topical divisions useful in
bibliometrics?
Source Journal Citation Reports (2004)
Biochemistry and molecular biology
J. Biol. Chem 405017
Cell 136472
Biochem.-US 96809
Citation counts
Mathematics
Lect. Notes Math 6926
T. Am. Math. Soc 6469
J. Math. Anal. Appl. 6004
Can you compare the tallest building in NY to the
tallest building in Stamford, CT?
5Why are topical divisions useful in
bibliometrics?
6Why not use Journalas a proxy for Topic?
- Journals not necessarily about one topic.
- Topics may not have their own journal.
- Open access publishing on the rise
- 5 of the 200 most-cited papers in CiteSeer are
tech reports! - Spidered web documents often do not include venue
information.
7This Paper
Talk Outline
Topical N-Grams a phrase-discovering
enhancement to LDA A quick tour of 8 impact
measureswith examples An introduction
to Rexa, a new sibling of CiteSeerGoogle
Scholar, etc.
- Discovering fine-grained, interpretable topics
from text - 8 impact measures leveraging topicsAnalysis on
1.5 million research papers and their citations. - Where did we get all this data from?
8Clustering words into topics withLatent
Dirichlet Allocation
Blei, Ng, Jordan 2003
GenerativeProcess
Example
For each document
70 Iraq war 30 US election
Sample a distributionover topics, ?
For each word in doc
Iraq war
Sample a topic, z
Sample a wordfrom the topic, w
bombing
9Inference and Estimation
- Gibbs Sampling
- Easy to implement
- Reasonably fast
r
10Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
11Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
12Topics Modeling Multi-word Phrases
- Topics based only on unigrams sometimes difficult
to interpret - Topic discovery itself is confused because
important meaning / distinctions carried by
phrases.
13Topical N-gram Model
Wang, McCallum 2005
?
?
z1
z2
z3
z4
. . .
y1
y2
y3
y4
. . .
w1
w2
w3
w4
. . .
D
?1
?2
?
?1
?
?2
W
W
T
T
14Features of Topical N-Grams model
- Easily trained by Gibbs sampling
- Can run efficiently on millions of words
- Topic-specific phrase discovery
- white house has special meaning as a phrase in
the politics topic, - ... but not in the real estate topic.
15A Topic Comparison
LDA algorithms algorithm genetic problems efficie
nt
Topical N-grams genetic algorithms genetic
algorithm evolutionary computation evolutionary
algorithms fitness function
16Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
policy action states actions function reward contr
ol agent q-learning optimal goal learning space st
ep environment system problem steps sutton policie
s
learning optimal reinforcement state problems poli
cy dynamic action programming actions function mar
kov methods decision rl continuous spaces step pol
icies planning
reinforcement learning optimal policy dynamic
programming optimal control function
approximator prioritized sweeping finite-state
controller learning system reinforcement learning
rl function approximators markov decision
problems markov decision processes local
search state-action pair markov decision
process belief states stochastic policy action
selection upright position reinforcement learning
methods
17Our Data for This Paper
- 1.6 million research papers
- mostly in Computer Science
- 400k of them with full text
- 14 fields of meta-data from
- headers at top of papers
- citations in References section
- automatically extracted with 99 accuracy.
- Reference resolution performed on 4 million
citations.
18Example Results on our Corpus
Step 1
Step 2
Run LDA on 1.6 million papers. Use topic analysis
to select a subset of AI ML, NLP, robotics,
vision, etc.
Run Topical N-gramson the 300k papers in the
subset.
Sample Topical N-gram topics
Sample LDA topics
19Each topic is now an intellectual domain that
includes some number of documents. We can
substitute topic for journal in most traditional
bibliometric indicators. We can also now define
several new indicators.
20Impact Measures Leveraging Topics
- Topical Citation count
- Topical Impact factor
- Topical Diffusion
- Topical Diversity
- Topical Half-life
- Topical Precedence
- Topical H-factor
- Topical Transfer
21Impact Measures Leveraging Topics
- Topical Citation count
- Topical Impact factor
- Topical Diffusion
- Topical Diversity
- Topical Half-life
- Topical Precedence
- Topical H-factor
- Topical Transfer
22Topical Citation Count
23Topical Citation Count
24Impact Factor
Journal Impact Factor Citations from articles
published in 2004 to articles in Cell published
in 2002-3, divided by the number of articles
published in Cell in 2002-3. 2004 Impact
factors from JCR
Nature 32.182
Cell 28.389
JMLR 5.952
Machine Learning 3.258
25Topical Impact Factor over time
26Impact Measures Leveraging Topics
- Topical Citation count
- Topical Impact factor
- Topical Diffusion
- Topical Diversity
- Topical Half-life
- Topical Precedence
- Topical H-factor
- Topical Transfer
27Broad Impact Diffusion
Journal Diffusion of journals citing Cell
divided by the total number of citations to
Cell, over a given time period, times
100 Problem Relatively brittle at low citation
counts. If a topic/journal is cited twice by two
different topics/journals, it will have high
diffusion.
28Broad Impact Diversity
Topic Diversity Entropy of the distribution of
citing topics
Diffusion
Diversity
These are just the least cited topics!
Better at capturing broad end of impact spectrum
29Broad Impact Diversity, for papers
Topic Diversity Entropy of the distribution of
citing topic
30Impact Measures Leveraging Topics
- Topical Citation count
- Topical Impact factor
- Topical Diffusion
- Topical Diversity
- Topical Half-life
- Topical Precedence
- Topical H-factor
- Topical Transfer
31Topical Longevity Cited Half Life
- Two views
- Given a paper, what is the median age of
citations to that paper? - What is the median age of citations from current
literature?
Collaborative Filtering is young, fast
moving. Maximum Entropy looks further back, but
is still producing new work. Neural Networks
literature is aging.
32Topical Precedence
Early-ness
Within a topic, what are the earliest papers
that received more than n citations?
- Speech Recognition
- Some experiments on the recognition of speech,
with one and two ears, E. Colin Cherry (1953) - Spectrographic study of vowel reduction, B.
Lindblom (1963) - Automatic Lipreading to enhance speech
recognition, Eric D. Petajan (1965) - Effectiveness of linear prediction
characteristics of the speech wave for..., B.
Atal (1974) - Automatic Recognition of Speakers from Their
Voices, B. Atal (1976)
33Topical Precedence
Early-ness
Within a topic, what are the earliest papers
that received more than n citations?
- Information Retrieval
- On Relevance, Probabilistic Indexing and
Information Retrieval, Kuhns and Maron (1960) - Expected Search Length A Single Measure of
Retrieval Effectiveness Based on the Weak
Ordering Action of Retrieval Systems, Cooper
(1968) - Relevance feedback in information retrieval,
Rocchio (1971) - Relevance feedback and the optimization of
retrieval effectiveness, Salton (1971) - New experiments in relevance feedback, Ide
(1971) - Automatic Indexing of a Sound Database Using
Self-organizing Neural Nets, Feiten and Gunzel
(1982)
34Impact Measures Leveraging Topics
- Topical Citation count
- Topical Impact factor
- Topical Diffusion
- Topical Diversity
- Topical Half-life
- Topical Precedence
- Topical H-factor
- Topical Transfer
35H-factor
H maximum number K for which you have K papers,
each with at least K citations.
...for journals Braun et al, 2005
36Topical H-factor
Year 1990
16 12 Natural Language Parsing (16) 173
12 Neural Networks (173) 120 12
Speech Recognition (120) 21 11 Hidden
Markov Models (21) 71 11 Genetic
Algorithms (71) 48 11 Optical Flow
(48) 83 10 Reinforcement Learning
(83) 49 10 Computer Vision (49) 22
10 Mobile Robots (22) 118 9 Word
Sense Disambiguation (118) 160 9 NLP
(160) 35 8 Planning (35) 106 8
Markov Chain Monte Carlo (106) 40 8
Maximum Likelihood Estimators (40) 131 8
Genetic Algorithms (131) 61 7 Genetic
Programming (61)
37Topical H-factor
Year 1995
49 18 Computer Vision (49) 120 17
Speech Recognition (120) 146 15
Decision Trees (146) 176 15 Data Mining
(176) 21 14 Hidden Markov Models
(21) 71 14 Genetic Algorithms (71) 106
13 Markov Chain Monte Carlo (106) 138
13 IR And Queries (138) 118 12 Word
Sense Disambiguation (118) 80 12 Web
And VR (80) 16 12 Natural Language
Parsing (16) 110 12 Bayesian Inference
(110) 83 12 Reinforcement Learning
(83) 150 12 Logic Programming (150) 22
12 Mobile Robots (22) 160 12 NLP
(160)
38Topical H-factor
Year 2001
129 15 Web Pages (129) 186 15
Ontologies (186) 50 13 SVMs (50) 49
13 Computer Vision (49) 126 13
Gene Expression (126) 176 13 Data Mining
(176) 29 12 Dimensionality Reduction
(29) 111 12 Question Answering (111) 132
12 Search Engines (132) 16 11
Natural Language Parsing (16) 83 11
Reinforcement Learning (83) 184 11 Web
Services (184) 164 11 HCI (164) 21
10 Hidden Markov Models (21) 118 10
Word Sense Disambiguation (118) 138 10
IR And Queries (138)
39Impact Measures Leveraging Topics
- Topical Citation count
- Topical Impact factor
- Topical Diffusion
- Topical Diversity
- Topical Half-life
- Topical Precedence
- Topical H-factor
- Topical Transfer
40Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cits Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being Undigital with digital cameras extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase a repository of Web pages
41Topical Transfer
Citation counts from one topic to another.
Map producers and consumers
42Where did the data come from?
http//rexa.info
43Rexa System Overview
WWW
NSF grant DB
Home-grownJavaMySQL (1m PDF/day)
Enhancedps2text (better word stiching,plus
layout in XML)
ConditionalRandom Fields (99 word accuracy)
Discriminativelytrainedgraph partitioning (compe
tition-winningaccuracy)
44 IE from Research Papers
McCallum et al 99
_at_article kaelbling96reinforcement, author
"Leslie Pack Kaelbling and Michael L. Littman
and Andrew P. Moore", title
"Reinforcement Learning A Survey", journal
"Journal of Artificial Intelligence Research",
volume "4", pages "237-285", year
"1996",
45(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
sequence given input sequence
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input seq
said Jones a Microsoft VP
46IE from Research Papers
Field-level F1 Hidden Markov Models
(HMMs) 75.6 Seymore, McCallum, Rosenfeld,
1999 Support Vector Machines (SVMs) 89.7 Han,
Giles, et al, 2003 Conditional Random Fields
(CRFs) 93.9 Peng, McCallum, 2004
? error 40
(Word-level accuracy is gt99)
47Previous Systems
48(No Transcript)
49Previous Systems
Cites
Research Paper
50More Entities and Relations
Expertise
Cites
Research Paper
Person
Grant
University
Venue
Groups
51(No Transcript)
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71Neural Information Processing Conference Dataset
Volumes 0-12Spanning 1987 1999. Prepared by
Sam Roweis.
- 1740 Papers
- 13649 Unique words
- 2,301,375 Words
72Trends in 17 years of NIPS proceedings
73Topic Distributions Conditioned on Time
topic mass (in vertical height)
time
74Finding Topics in 1 million CS papers
200 topics keywords automatically discovered.
75Topic Correlations in PAM
5000 research paper abstracts, from across all CS
Numbers on edges are supertopics Dirichlet
parameters