Refined Online Citation Matching and Adaptive Canonical Metadata Construction - PowerPoint PPT Presentation

About This Presentation
Title:

Refined Online Citation Matching and Adaptive Canonical Metadata Construction

Description:

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 20
Provided by: Huaj150
Learn more at: https://www.cse.psu.edu
Category:

less

Transcript and Presenter's Notes

Title: Refined Online Citation Matching and Adaptive Canonical Metadata Construction


1
Refined Online Citation Matching and Adaptive
Canonical Metadata Construction
  • CSE 598B Course Project Report
  • Huajing Li

2
Outline
  • Introduction
  • Matching Citations and documents
  • Learning from Observations
  • Cluster Repair
  • Evaluation

3
Introduction
  • In research repositories, citations represent
    important knowledge regarding work contexts.
  • The citation relationships form a data structure
    generally known as a citation graph, where
    documents are vertices and citations are directed
    edges between citing and cited documents.
  • The methods to construction citation graphs
  • manual information extraction
  • autonomous citation indexing (ACI)

4
Introduction
  • Popular ACI systems
  • The CiteSeer Digital Library (a collection of
    over 725,000 documents with over 8 million
    citations )
  • Google Scholar (433 million document and citation
    records )
  • Typical ACI process
  • Extract citations from research papers
  • Parse subfields to build accurate metadata for
    each citation
  • Link citations to documents

5
Introduction
  • Typical problems in the ACI process
  • The citation parsers error-prone and often
    produce noisy results
  • Errors in the citation text (such as typos)
  • Identity uncertainty in document matching
  • For an automatic DL system, the identity of
    documents is uncertain (canonical metadata of the
    document can be incomplete or inaccurate)
  • In such cases, citation metadata can be used to
    correct the canonical metadata of documents

6
Our Research Goals
  • Provide better document metadata
  • Reduce the cost of maintenance
  • Allow the development of flexible APIs into
    CiteSeer citation graph system
  • Maintain data security despite an open, wiki-like
    approach to user-contributed metadata changes
  • Provide better citation matching compared to the
    current system

7
Matching Citations and documents
  • Current offline approach
  • Citations are grouped according to their
    extracted metadata
  • The citation group is linked to a real document
    in the repository (exist inside the ACI system
    and yet not collected)

8
Matching Citations and documents
  • Remember citations are themselves documents
  • Treat citations and documents differently brings
    a lot of unnecessary complications into the
    system
  • Citations pointing to a document in the ACI
    system can be represented by the documents
    identity
  • To represent the document which a citation points
    to and not in the current system, we use the
    notion of virtual document, which takes on the
    extracted metadata of the citation.

9
Matching Citations and documents
  • Once the document enters the system, the
    corresponding virtual record is then updated with
    a pointer to the document file, making it a
    real document record.
  • There are no citation edges pointing to an
    external unknown resource. All edges are
    internal in the document database and real and
    virtual documents can be searched in the same
    index space.
  • We use Lucene to match documents online.

10
Learning from Observations
  • A problem of generating beliefs in the identity
    of a document based on observational evidence.
  • records may be linked with many information
    sources
  • Extracted document metadata
  • Extracted citations
  • External records (from DBLP, ACM)
  • User correction
  • We focus on metadata elements with small
    variability in correct representations, such as
    names, titles, dates, etc.

11
Learning from Observations
  • We use Bayesian Belief network to construct
    canonical metadata
  • Decide the canonical value X? from all
    observations on X.
  • Each network BEL(X) is to develop degrees of
    belief in each possible value X, and X? is chosen
    based on the value with the largest belief score.
  • Given a prior belief vector BEL(x), BEL(x) can be
    updated with a new observation ox? using only a
    local computation.

12
Learning from Observations
  • An example
  • An example observation vector o?(x) may be (0, 0,
    1, 0), indicating that o?(2) is the observed
    value for x.
  • This vector must then be adjusted based on our
    confidence in the observation. This is achieved
    using a confidence matrix
  • assigning C0.7 to o? results in an actual
    message of (0.1,0.1,0.7,0.1) sent to X.

13
Cluster Repair
  • Adjusting metadata dynamically in response to new
    evidence can lead to inconsistencies in citation
    groups.
  • repairCluster(R)
  • Find matching citations M for R
  • For each citation C in GR
  • If C is not contained in M
  • Add C to REVOKE
  • Set GR M
  • Reset belief vectors
  • For each citation C in GR
  • If C is not contained in REVOKE
  • Update belief vectors using C
  • If metadata changes
  • repairCluster(R)

14
Cluster Repair
  • Voting privilege
  • To prevent unbounded iterations, once a citation
    C1 is removed from GR, it can return to GR but it
    cannot influence metadata belief vectors for the
    remainder of the repairCluster iterations.
  • At the end of a repairCluster call stack, the
    non-voting citations regain voting privileges.

15
Evaluation
  • Ten frequently referenced document records were
    selected from the top of CiteSeers most-cited
    document list along with all corresponding
    citations.
  • 9,121 citations were used in the final test set.
  • the data set was run through a noise generation
    program to purposely add some noise into the
    citation records.
  • Randomly insert a word into the title.
  • Randomly delete a word from the title.
  • Randomly insert an author name.
  • Randomly delete an author name.
  • Randomly misspell a word in the title.
  • Randomly misspell an author name.
  • Mistakes in the publication year attribute.
  • Corresponding parameters are provided to control
    the probability with which a certain category of
    noise will occur, varying from 0 to 1. A noise
    rate of 0 means the original version of citation
    texts are adopted, without any intended
    modifications. A noise rate of 1 means a type of
    noise is destined to happen.

16
Index-Based Citation Clustering
  • Lucenes fuzzy query is utilized to match
    citations to documents. We vary the similarity
    threshold to observe the precision and recall

17
Index-Based Citation Clustering
  • Noise is introduced into the citation data to
    test the capability of the matching algorithm to
    handle inaccurate inputs.

18
Metadata Determination and Cluster Repair
  • Confidence in the document metadata was
    arbitrarily set at 0.8, and confidence in
    citation data was set at 0.5.
  • The cluster repair algorithm was then used to
    iteratively query the citation index and repair
    the documents metadata until convergence.
  • Only title, author, and year metadata was tested
    for accuracy.

19
Metadata Determination and Cluster Repair
Write a Comment
User Comments (0)
About PowerShow.com