Disambiguation Problems in Digital Libraries - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Disambiguation Problems in Digital Libraries

Description:

Disambiguation Problems. in Digital Libraries. Tan Yee Fan. 2006 August 11 ... d(x1, x2) = 1 if x1 and x2 matches. d(x1, ... Jaro-Winkler, ... Abbreviation ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 13
Provided by: wingCom
Category:

less

Transcript and Presenter's Notes

Title: Disambiguation Problems in Digital Libraries


1
Disambiguation Problemsin Digital Libraries
  • Tan Yee Fan
  • 2006 August 11
  • WING Group Meeting

2
Introduction
  • Bibliographic digital libraries
  • DBLP, Citeseer, ACM Portal,
  • Metadata records
  • Authors, title, venue, year,
  • Inconsistencies and errors
  • Typographical errors
  • Abbreviation
  • Different entities sharing same name

3
(No Transcript)
4
Problem formulation
  • General disambiguation problem
  • Given a list of data items X
  • Find a function d X X ? 0, 1 such that
  • d(x1, x2) 1 if x1 and x2 matches
  • d(x1, x2) 0 otherwise
  • Matching relation is not necessarily transitive
  • d(ab, bc) 1 and d(bc, cd) 1,but
    d(ab, cd) 0
  • If transitive, it is clustering/classification

5
Related fields
  • String similarity
  • Edit distance, Jaro-Winkler,
  • Abbreviation matching
  • Mostly deals with biomedical texts and in
    predefined formats
  • Data cleaning
  • High level architectures by database people
  • Social network analysis
  • Collaboration graphs of authors

6
Citation matching, author name disambiguation
  • Can be cast as classification/clustering
  • Usual information source
  • Coauthor information, titles and venues
  • i.e. within the records themselves (internal)
  • Models
  • Naïve Bayes, K-means, SVM, vector space model,
    graphical models,
  • Some apply methods to reduce number of
    comparisons required

7
Resources
  • Internal resources
  • May contain insufficient information
  • Information may be difficult to extract
  • External resources
  • Web resources, ontologies
  • Contains additional freely available information
  • Objective
  • Combine internal and external resources

8
Mixed citation problem
  • Given an ambiguous name X (belonging to k
    different authors)
  • Given a list of citations C containing X
  • Which citations in C belong to which author?

Yoojin Hong, Byung-Won On and Dongwon Lee.
SystemSupport for Name Authority Control Problem
inDigital Libraries OpenDBLP Approach. ECDL
2004.
Sudha Ram, Jinsoo Park and Dongwon Lee.
DigitalLibraries for the Next Millennium
Challenges andResearch Directions. Information
Systems Frontiers 1999.
9
Search engine results
  • For each citation c in C
  • Query search engine with title of c to obtain
    relevant URLs
  • Represent c by a feature vector of relevant URLs
  • Each URL weighted by its inverse host frequency
  • Cosine similarity between feature vectors
  • Perform clustering on C to derive k clusters

10
External coauthor network
  • Coauthor network from DBLP metadata
  • Delete the node representing X and its edges
  • Similarity between two author names computed as
    an inverse of their distance
  • Similarity between two citations is pairwise sum
    of their author similarities

Each noderepresents a name
Connected if they arecoauthors in someDBLP
citation
11
Results
12
Venue name disambiguation
  • To determine e.g. TREC Text Retrieval
    Conference
  • Not using other parts of the citation records
  • Problems
  • Abbreviations are extremely common
  • Venues change name over time
  • Experiments using Google in progress
  • Using URL features
  • Using Google snippets
Write a Comment
User Comments (0)
About PowerShow.com