Online Autonomous Citation Management for CiteSeer - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Online Autonomous Citation Management for CiteSeer

Description:

Document & Citation Indexing / Search. 4. 4. Crawler. Retrieval. Conversion ... The so-called canonical metadata is fixed to the document record ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 15
Provided by: bhuvanur
Learn more at: https://www.cse.psu.edu
Category:

less

Transcript and Presenter's Notes

Title: Online Autonomous Citation Management for CiteSeer


1
Online Autonomous Citation Management for CiteSeer
  • CSE598B Course Project
  • By Huajing Li

2
Introduction to CiteSeer
  • Software package developed at NEC-Labs
  • Domain Independent Software for Automatic
    Citation Indexing (ACI)
  • Focus is on scholarly publications in electronic
    format (PS / PDF and variants)
  • Performs
  • Document Discovery / Retrieval / Parsing
  • Automatic Citation Extraction
  • Document Citation Indexing / Search

3
(No Transcript)
4
Crawler
Document URL
Retrieval
Document (PDF/PS)
Conversion
Document (Plain Text)
Web Server
Parsing Meta-Data Extraction
Meta-Data Database PDBM_File Chunk Tables
Indexes
Document Database File System
Document Meta-data Set DID Title Authors etc.
Document Body Text
N Citation Texts
N Citation Meta-data Sets CID GID Title Authors e
tc.
C
D
Indexing
5
Submitting Documents
  • Output of Crawl / User Submission is URL of page
    linking to document.
  • These URLs are dumped in Paper Table
  • Paper Table maintains status for each document
  • Downloaded/undownloaded
  • Processed/unprocessed
  • Other processing errors (tooshort/noreference/etc.
    )
  • CiteSeer regularly scans this table to start
    download of new documents
  • Only Documents meeting typical pattern of
    scholarly publications are eventually added to
    the collection

6
Document Structure Identification
  • Title
  • Subject (keywords)
  • Description (abstract)
  • Author names
  • Author affiliations
  • Author address, email, phone, Homepage URL
  • Publication date, Publication number
  • Archive date
  • Contributor
  • Type
  • Format
  • Identifier
  • Source
  • Publisher
  • Journal/Conference
  • Pages
  • Relation
  • References
  • Is Referenced By

7
Citations grouping
  • Citations to same document have common Group ID
  • Each Group ID has a set of keys associated to it,
    based on citation information
  • authorkey1-titlekey authorkey2-titlekey
  • For every single word in the authors information
    there is an authorkey
  • For a given citation, titlekey is unique and is
    concatenation of all title words

8
Citations Grouping
  • For newly discovered citation
  • Extract
  • Authors C. Lee Giles, S. Lawrence
  • Title Good Paper Title
  • Generate keys giles-goodpapertitle
    lee-goodpapertitle lawrence-goodpapertitle
  • Try to match at least one of them with existing
    Group ID key
  • If there is a match, add this citation (Citation
    ID) to the group
  • Otherwise create a new Group ID for this citation

9
Linking Citations to Documents
  • Citation ID-gtGroup ID
  • We just saw that
  • Document ID-gtGroup ID
  • Based on documents metadata, generate
    authorkey-titlekey in the same way and try to
    match a Group ID key generated from the citations
  • Document metadata can be erroneous, so successful
    mapping often happens AFTER correction by users

10
Problems of the Current Approach
  • There is no guarantee that the most similar
    citation contains the best metadata
  • Building citation graph is a time-intensive,
    offline task
  • Due to batch clustering, the addition of a single
    citation requires rebuilding the entire citation
    graph to include the new instance
  • The so-called canonical metadata is fixed to the
    document record

11
Goals of the New Citation Management System
  • Provide better document metadata
  • Reduce the cost of maintenance
  • Use on-line citation matching such that the
    citation graph environment can be adjusted
    immediately based on a single new citation
  • Provide a fluid framework for building canonical
    metadata in which all evidence is always
    considered
  • Allow the development of flexible APIs into
    CiteSeer citation graph system
  • Maintain data security despite an open, wiki-like
    approach to user-contributed metadata changes
  • Provide better citation matching compared to the
    current system

12
Prototype Overview
13
Edge DB
  • One simple table containing one edge per row
  • Id citation handle (equivalent to CID)
  • citingDoc citing document handle
  • citedDoc cited document handle
  • Row-level locking

14
Matching citations and docs
  • Exact string match across disparate metadata
    fields way too optimistic - need better matching
    criteria
  • Lucene provides two methods out of the box
  • Match based on Levenshtein distance
  • Specify arbitrary distance cut-off per field
  • choose most similar match out of returned set
  • Cut out the middleman - similarity-based matching
  • Specify arbitrary similarity threshold
  • Choose most similar match out of return set over
    threshold
  • Criteria to be determined through empirical tests
    using prototype system.
Write a Comment
User Comments (0)
About PowerShow.com