Title: Disambiguation Problems in Digital Libraries
1Disambiguation Problemsin Digital Libraries
- Tan Yee Fan
- 2006 August 11
- WING Group Meeting
2Introduction
- Bibliographic digital libraries
- DBLP, Citeseer, ACM Portal,
- Metadata records
- Authors, title, venue, year,
- Inconsistencies and errors
- Typographical errors
- Abbreviation
- Different entities sharing same name
3(No Transcript)
4Problem formulation
- General disambiguation problem
- Given a list of data items X
- Find a function d X X ? 0, 1 such that
- d(x1, x2) 1 if x1 and x2 matches
- d(x1, x2) 0 otherwise
- Matching relation is not necessarily transitive
- d(ab, bc) 1 and d(bc, cd) 1,but
d(ab, cd) 0 - If transitive, it is clustering/classification
5Related fields
- String similarity
- Edit distance, Jaro-Winkler,
- Abbreviation matching
- Mostly deals with biomedical texts and in
predefined formats - Data cleaning
- High level architectures by database people
- Social network analysis
- Collaboration graphs of authors
6Citation matching, author name disambiguation
- Can be cast as classification/clustering
- Usual information source
- Coauthor information, titles and venues
- i.e. within the records themselves (internal)
- Models
- Naïve Bayes, K-means, SVM, vector space model,
graphical models, - Some apply methods to reduce number of
comparisons required
7Resources
- Internal resources
- May contain insufficient information
- Information may be difficult to extract
- External resources
- Web resources, ontologies
- Contains additional freely available information
- Objective
- Combine internal and external resources
8Mixed citation problem
- Given an ambiguous name X (belonging to k
different authors) - Given a list of citations C containing X
- Which citations in C belong to which author?
Yoojin Hong, Byung-Won On and Dongwon Lee.
SystemSupport for Name Authority Control Problem
inDigital Libraries OpenDBLP Approach. ECDL
2004.
Sudha Ram, Jinsoo Park and Dongwon Lee.
DigitalLibraries for the Next Millennium
Challenges andResearch Directions. Information
Systems Frontiers 1999.
9Search engine results
- For each citation c in C
- Query search engine with title of c to obtain
relevant URLs - Represent c by a feature vector of relevant URLs
- Each URL weighted by its inverse host frequency
- Cosine similarity between feature vectors
- Perform clustering on C to derive k clusters
10External coauthor network
- Coauthor network from DBLP metadata
- Delete the node representing X and its edges
- Similarity between two author names computed as
an inverse of their distance - Similarity between two citations is pairwise sum
of their author similarities
Each noderepresents a name
Connected if they arecoauthors in someDBLP
citation
11Results
12Venue name disambiguation
- To determine e.g. TREC Text Retrieval
Conference - Not using other parts of the citation records
- Problems
- Abbreviations are extremely common
- Venues change name over time
- Experiments using Google in progress
- Using URL features
- Using Google snippets