A Comparison of On-line Computer Science Citation Databases - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

A Comparison of On-line Computer Science Citation Databases

Description:

Title: PowerPoint Presentation Last modified by: vpetrice Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 27
Provided by: occ50
Category:

less

Transcript and Presenter's Notes

Title: A Comparison of On-line Computer Science Citation Databases


1
A Comparison of On-line Computer
ScienceCitation Databases
  • Vaclav Petricek, Ingemar J. Cox, Hui Han,
  • Isaac G. Councill, C. Lee Giles
  • v.petricek_at_cs.ucl.ac.uk
  • http//www.cs.ucl.ac.uk/staff/V.Petricek

2
Motivation
  • Autonomous databases have advantages compared to
    manually constructed
  • Easier maintenance
  • Lower cost
  • Is it really an equivalent solution that is just
    cheaper?
  • Does the automated acquisition introduce any bias?

3
Talk Overview
  • Datasets
  • Acquisition bias and models
  • CS Citation Distribution
  • Conclusions
  • Future Work

4
Datasets - DBLP
  • DBLP was operated by Micheal Ley since 1994 8.
    It currently contains over 550,000 computer
    science references from around 368,000 authors.
  • Each entry is manually inserted by a group of
    volunteers and occasionally hired students. The
    entries are obtained from conference proceeding
    and journals.

5
Datasets - CiteSeer
  • CiteSeer was created by Steve Lawrence and C. Lee
    Giles in 1997. It currently contains over 716,797
    documents.
  • In contrast, each entry in CiteSeer is
    automatically entered from an analysis of
    documents found on the Web.

6
Datasets Publication year
  • CiteSeer
  • DBLP
  • Declining CiteSeer maintenance
  • Increased DBLP funding

7
Author bias
  • CiteSeer papers have higher average number of
    authors
  • Both databases show growing team sizes

8
Author bias
  • Crossover for low number of authors
  • CiteSeer has higher proportion of multiauthor
    papers than DBLP
  • (for number of authors lt4)

9
Author bias
  • Papers with higher number of authors are more
    likely to be included in CiteSeer
  • Hypothesis
  • Crawler suffers from acquisition bias due to
  • Submission
  • Crawling

10
Models - CiteSeer
  • CiteSeer Submission model
  • Probability of a document being submitted grows
    with number of authors
  • Publication submitted with probability ß
  • Probabilities independent for coauthors
  • citeseers(i) (1-(1- ß )i) all(i)

11
Models - CiteSeer
  • CiteSeer crawler model
  • Probability of crawling a document grows with
    number of its online copies
  • Probability of a document being online grows with
    number of authors
  • Probabilities independent between authors
  • Publication published online with probability d
  • Publication found by crawler with probability ?
  • citeseerc(i) (1-(1- ?d)i) all(i)
  • Both models result in equivalent type of bias

12
Coverage
  • Can we estimate the coverage of dblp?
  • Can we estimate the coverage of CiteSeer?
  • Can we estimate the coverage of CS literature?
  • We need a model of DBLP acquisition method

13
Models - DBLP
  • DBLP model
  • Publication included in DBLP with probability a
  • a is a parameter reflecting DBLP coverage of CS
    literature
  • dblp(i) a all(i)

14
Coverage
  • citeseer(i) (1-(1- ß )i) all(i)
  • dblp(i) a all(i)
  • r(i) dblp(i) / citeseer(i)
  • r(i) a / (1-(1- ß )i)

15
Results
  • r(i) a / (1-(1- ß )i)
  • Alpha 0.3
  • DBLP covers approx 30
  • of CS literature
  • CiteSeer covers approx 40
  • CS literature 2M publications

16
Citation distribution
17
Citation distribution
  • Studied before
  • Follow a power-law
  • Redner, Laherrere et al, Lehmann and others
  • Mostly physics community
  • We use a subset of CiteSeer and DBLP papers that
    have citation information

18
Citation distribution
  • Power law
  • Sparse data for high number of citations

19
Citation distribution
  • Exponential binning
  • Data aggregated in exponentially increasing
    bins
  • Equivalent to constant bins on a logarithmic
    scale
  • Easier interpolation

20
Citation distribution
slope
citations Lehmann DBLP CiteSeer
lt 50 -1.29 -1.876 -1.504
gt 50 -2.32 -3.509 -3.074
  • Distribution of citations more uneven in CS than
    in Physics
  • Significant differences between DBLP and CiteSeer

21
Citation distribution
  • CiteSeer contains fewer low cited papers than
    DBLP
  • No model yet
  • Lawrence
  • Online or invisible?

22
Conclusions - authors
  • CiteSeer and DBLP have very different acquisition
    methods
  • Significant bias against papers with low number
    of authors (less than 4) in CiteSeer.
  • Single author papers appear to be disadvantaged
    with regard to the CiteSeer acquisition method.
  • two probabilistic models for paper acquisition in
    CiteSeer resulting in the same type of bias
  • Crawler model
  • Submission model

23
Conclusions - coverage
  • Simple model of DBLP coverage predicts coverage
    of approx 30 of the entire Computer Science
    literature.
  • This gives us CiteSeer coverage of approx 40
  • and total number of CS papers around 2M

24
Conclusions - citations
  • CiteSeer and DBLP citation distributions are
    different
  • Both indicate that highly cited papers in
    Computer Science receive a larger citation share
    than in Physics.
  • CiteSeer contains fewer low cited papers

25
Future Work
  • Repeat experiments on most recent CiteSeer data
  • Other methods to estimate Computer science
    literature size and trends
  • Overlap of CiteSeer and DBLP
  • Bias introduced by bibliography parsing
  • Collaborative network analysis
  • Connection to internet surveys?

26
Thank you
27
References
  • 1 Arxiv e-print archive, http//arxiv.org/.
  • 2 Compuscience database, http//www.zblmath.fiz-
    karlsruhe.de/COMP/quick.htm.
  • 3 Corr, http//xxx.lanl.gov/archive/cs/.
  • 4 Cs bibtex database, http//liinwww.ira.uka.de/
    bibliography/.
  • 5 Dblp, http//dblp.uni-trier.de/.
  • 6 Scientific citation index, http//www.isinet.c
    om/products/citation/sci/.
  • 7 Spires high energy physics literature
    database, http//www.slac.stanford.edu/spires/hep/
    .
  • 8 Sciencedirect digital library,
    http//www.sciencedirect.com, 2003.
  • 9 P. Bailey, N. Craswell, and D. Hawking. Dark
    matter on the web. In Poster Proceedings of 9th
    International World Wide Web Conference. ACM
    Press, 2000.
  • 10 M. Batty. Citation geography Its about
    location. The Scientist, 17(16), 2003.
  • 11 M. Batty. The geography of scientific
    citation. Environment and Planning A,
  • 35761770, 2003.
  • 12 S.Lawrence Online or invisible, Nature,
    Volume 411, Number 6837, p. 521, 2001

28
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com