Detecting Research Topics via the Correlation between Graphs and Texts - PowerPoint PPT Presentation

About This Presentation
Title:

Detecting Research Topics via the Correlation between Graphs and Texts

Description:

Detecting Research Topics via the Correlation between Graphs and ... Chiral perturbation. Form factors. Lattice qcd. String theory. Hubbard model. 1. 2. 3. 4. 5 ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 19
Provided by: yoo1
Category:

less

Transcript and Presenter's Notes

Title: Detecting Research Topics via the Correlation between Graphs and Texts


1
Detecting Research Topics via the Correlation
between Graphs and Texts
  • Yookyung Jo
  • Dept. of Computer Science, Cornell University
  • Carl Lagoze, and C. Lee Giles
  • Dept. of Computer Science, Cornell University,
    Information Sciences and Technology, The
    Pennsylvania State University

2
Acknowledgment
  • John E. Hopcroft
  • Thorsten Joachims
  • Simeon Warner
  • Isaac G. Councill
  • NSF IIS-0430906, 0227648, 0227888, and 0424671

3
Topic detection
  • Problem Statement
  • Our strategy
  • The correlation between
  • Distribution of terms representing a topic
  • Distribution of citation links

How to detect topics in a linked corpus (e.g.
Citeseer, arXiv, the Web )
4
Correlation between Terms and Links
Term citation graph for a
Term citation graph for ?
a
?
?
?
a
a
?
a
?
a
?
a
a
?
?
?
a
a
?
?
a
?
a
?
a
Term a representing a topic (e.g. sensor
network, or association rule )
Term ? not representing a topic (e.g. six
months, or practical examples )
5
Term citation graphfor a term a
a
a
a
a
a
a
a
a
a
a
a
6
Correlation between Terms and Links
Term citation graph for a
Term citation graph for ?
a
?
?
?
a
a
?
a
?
a
?
a
a
?
?
?
a
a
?
?
a
?
a
?
a
Term a representing a topic (e.g. sensor
network, or association rule )
Term ? not representing a topic (e.g. six
months, or practical examples )
7
Detecting a topic via a single term
  • Given a term A,
  • Binary decision of whether A represents a topic
    or not
  • H1 A represents a topic
  • H0 A does not represent a topic
  • GA The term citation graph for A
  • O(GA) Link connectivity observation on GA
  • Finally, a ranked list of terms

8
Loglikelihood of H1
  • Observation O(GA)
  • For each node i in GA, is it connected to other
    nodes in GA by at least one link?
  • Under H1
  • pc1 estimation of pc
  • pc1 set to a value close to 1 (e.g. pc1 0.9)

This probability pc
9
Loglikelihood of H0
  • pc0 estimation of pc

GA
?
?
10
Evaluation
  • arXiv
  • A Physics literature collection
  • Year 1991-2006, 7 major arXiv areas
  • 214,546 papers, 2,165,170 citation links
  • Abstract as document
  • 137,098 bi-gram terms after low-frequency prune
  • Citeseer
  • A Computer Science related collection
  • Year 1994-2004
  • 716,771 papers, 1,740,326 citation links
  • Abstract title as document
  • 631,839 bi-gram terms after low-frequency prune

11
arXiv (physics) topic terms at top ranks
top rank Topic (term) ltn, nc, Egt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Black hole Quantum hall Black holes Higgs boson Renormalization group Quantum gravity Standard model Heavy quark Cosmological constant Quantum dot Chiral perturbation Form factors Lattice qcd String theory Hubbard model lt4978, 4701, 38952gt lt1863, 1493, 4862gt lt3131, 2896, 22824gt lt2079, 1896, 12607gt lt3738, 2920, 8490gt lt2014, 1724, 9693gt lt7848, 7145, 53829gt lt1671, 1473, 6570gt lt2141, 1815, 7134gt lt1366, 1031, 2926gt lt1132, 1050, 5578gt lt1578, 1354, 5616gt lt1425, 1265, 5240gt lt3818, 3539, 26250gt lt1702, 1167, 2678gt
n number of nodes in GA nc number of nodes
with at least one connection within GA E
number of edges in GA
12
arXiv(Physics) Term citation graphs for
intermediate rank topic terms
time
Research communities
13
arXiv(Physics) terms at bottommost ranks
rank term
137098 137097 137096 137095 137094 137093 137092 137091 137090 137089 137088 137087 137086 137085 137084 we show has been we find we present we study we have we also have been we discuss we consider does not our results we investigate into account we propose
Bottom entries are stop-phrases
14
rank topic (term) up to 1999 topic (term) since 2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 logic programs model checking semidefinite programming inductive logic petri nets genetic programming interior point kolmogorov complexity automatic differentiation complementarity problems congestion control complementarity problem conservation laws linear logic timed automata situation calculus real-time database motion planning duration calculus volume rendering chain monte association rules sensor networks hoc networks logic programs image retrieval support vector congestion control model checking decision diagrams wireless sensor ad hoc intrusion detection vector machines mobile ad binary decision sensor network energy consumption content-based image semantic web fading channels xml data source separation timed automata
Citeseer(CS) top rank terms
  • Top rank terms from two different time periods
  • Time up to 1999
  • Time since 2000

15
Citeseer Topic time evolution
sensor networks
logic programs
support vector
congestion control
16
Citeseer Topic time evolution
petri nets
association rules
genetic programming
semantic web
17
Algorithm Extension
  • To detect topics represented by a single term
  • Algorithm
  • Evaluation on arXiv, Citeseer
  • To detect topics defined by a set of terms
  • Algorithm
  • Evaluation on arXiv

18
Conclusion (poster session 7)
  • Topic detection via the correlation between terms
    and links
  • Our algorithm (in its evaluation on arXiv,
    Citeseer)
  • Effectively discovers topics represented by a
    single-term or by a set of terms
  • Identifies stop-phrases as a by-product
  • Discovers topics in their natural scale
  • Demonstrates its utility in trend analysis
  • Shows the association between topic scale and
    specificity
Write a Comment
User Comments (0)
About PowerShow.com