Title: Record Linkage Survey
1Record Linkage Survey
- Tan Yee Fan
- 2007 February 9
- WING Group Meeting
2Contents
- Introduction
- Record linkage using internal knowledge
- String matching
- Classification or clustering
- Graphical formalisms
- Blocking
- Record linkage using search engine
- Adaptive methods
- Conclusion
3Introduction
- Ambiguous representation of named entities
- Citation records
- Web pages
- People names
- Products
- Customer records
- Images
- Merging information from various sources
- Typos? Errors?
- Think of large scale databases of millions of
records!
4Commercial world
- Addresses
- Dongwon Lee, 110 E. Foster Ave. 410, State
College, PA, 16802 - LEE Dong, 110 East Foster Avenue Apartment 410,
University Park, PA 16802-2343 - Products
- Honda Fix vs. Honda Jazz
- T-Fal vs. Tefal
- Apple iPod Nano 4GB vs. 4GB iPod nano 4GB
Examples courtesy of Dongwon Lee (Penn State
University)
5Author names and citations
Jeffrey D. Ullman (Stanford University)
Some images courtesy of Dongwon Lee(Penn State
University)
6Web pages
Note I did not author any of these web pages!
7More on person names
- Highly ambiguous
- Only 90,000 different names for 100 million
people (U.S. Census Bureau) - Valid changes
- Customs Lee, Dongwon vs. Dongwon Lee vs. LEE
Dongwon - Marriage Carol Dusseau vs. Carol Arpaci-Dusseau
- Misc. Sean Engelson vs. Shlomo Argamon
Examples courtesy of Dongwon Lee (Penn State
University)
8Record linkage
- Input
- Two lists of records, A and B
- Output
- For each record a in A and for each record b in
B,does a and b refer to the same entity? - Note
- Entities do not come with unique identifiers
- To disambiguate (deduplicate) items in a single
list L, we set A B L
9Fellegi-Sunter model
no-decision region (hold for human review)
designate as definite match
designate as definite non-match
true matches? true non-matches
false matches
false non-matches
sim(a, b)
10Contents
- Introduction
- Record linkage using internal knowledge
- String matching
- Classification or clustering
- Graphical formalisms
- Blocking
- Record linkage using search engine
- Adaptive methods
- Conclusion
11String matching
- String similarity
- Strings as ordered sequences
- Edit distance
- Jaro and Jaro-Winkler
- Strings as unordered sets
- Jaccard similarity
- Cosine similarity
- Abbreviation matching
- Usually pattern detection in texts
- e.g. Almost Locked Sets (ALS)
(a, b, c) ? (c, b, a)
a, b, c c, b, a
12Classification or clustering
- Feature engineering model selection
- Features
- String similarity, relationships (e.g.
collaborators) - Models
- Naïve Bayes, Support Vector Machine, K-means,
Agglomerative Clustering,
Yoojin Hong, Byung-Won On and Dongwon Lee.
SystemSupport for Name Authority Control Problem
inDigital Libraries OpenDBLP Approach. ECDL
2004.
Sudha Ram, Jinsoo Park and Dongwon Lee.
DigitalLibraries for the Next Millennium
Challenges andResearch Directions. Information
Systems Frontiers 1999.
13Classification or clustering
- Graphical models
- Structure
- Nodes Record fields or entire records
- Edges Join fields with similar values, fields to
records, etc - Characteristic
- Model global knowledge
- Propagate information around the graph until
convergence - Usually very time consuming
- Examples
- Conditional random field
- Dependency graph
- Generative probabilistic model
14Social network analysis
- Social network
- Nodes entities (e.g. author names)
- Edges relationships (e.g. coauthored a paper)
coauthorship network
15Social network analysis
- Analysis
- Connected components
- Distance between nodes
- Node/edge centrality
- Cliques
- Bipartite subgraphs
16Social network analysis
- Connected triple
- Random walk
x1
s
t
x2
x2
x1
x3
17Scalability Issues
- Pairwise comparisons
- Requires O(n2) time
- Major bottleneck
- Possible solutions
- Blocking techniques
- Avoiding pairwise comparisons altogether
Input d1, d2, , dn for i 1 to n for j
(i 1) to n compute sim(di, dj)
18Contents
- Introduction
- Record linkage using internal knowledge
- String matching
- Classification or clustering
- Graphical formalisms
- Blocking
- Record linkage using search engine
- Adaptive methods
- Conclusion
19Record linkage using search engine
- Previously
- We assumed input data records contain sufficient
information to perform linkage - What if
- There is insufficient or only noisy information?
- e.g. linking short forms to long forms
- Ask other people!
- Use web as collective knowledge of people
20Record linkage using search engine
Number of results
Ranked list
Title
Snippet
URL
Web page
sudoku strategiessudoku OR strategies sudoku
strategies
21Examples
- Counts
- Co-occurrence measure between count(q), count(q)
andcount(q and q) - Hostnames from URLs
- Overlap between the hostnames of results of
queries q and q - Inverse Host Frequency
- Snippets or web pages
- Cosine similarity using the tokens
- Counts of specific terms
- e.g. number of snippets for the query q
containing the string q - Do natural language processing
22Googled name linkage
- Suppose e and e refer to the same entity
- Then web pages of e and web pages of e are
likely to share some representative data
- Jeffrey D. Ullman384,000 pages
- Jeffrey D. Ullman aho174,000 pages
- J. Ullman124,000 pages
- J. Ullman aho41,000 pages
- Shimon Ullman27,300 pages
- Shimon Ullman aho66 pages
23Query probing
Googling and web page downloadsare expensive on
time!
- Consider
- Joint Conference on Digital Libraries
- European Conference on Digital Libraries
- Digital Libraries
- Query probing
- Use common n-gram digital libraries as query
probe - If we can obtain information on all three
conferences, we save two queries
24Adaptive querying
- Methods
- Ms stronger method but very slow (e.g. web page
similarity) - Mw weaker method but fast (e.g. host overlap)
- Aim
- Accuracy close to Ms
- Significantly reduced running time than Ms
- Algorithm
- Execute Mw
- If heuristic suggests that Mw results are likely
incorrect - Execute Ms
25Comment
- These techniques
- Blocking
- Query probing
- Adaptive querying
- Combine different methods to obtain the better
aspects of each
26Contents
- Introduction
- Record linkage using internal knowledge
- String matching
- Classification or clustering
- Graphical formalisms
- Blocking
- Record linkage using search engine
- Adaptive methods
- Conclusion
27Conclusion
- Comment
- This survey is very brief and broad, but still
many aspects not covered - History
- Record linkage became a research issue in the
1940s, possibly due to analysis of census data or
medical records - Research directions
- Graphical models utilize global knowledge, but
how to make them scalable for large datasets - Utilizing external knowledge in an effective and
scalable manner - Adaptive methods
28Thank You
29Selected Bibliography
- General and surveys
- Ivan P. Fellegi and Alan B. Sunter. A theory for
record linkage. Journal of the American
Statistical Association, 64(328)11831210,
December 1969. - William E. Winkler and Yves Thibaudeau. An
application of the Fellegi-Sunter Model of record
linkage to the 1990 U.S. Decennial Census.
Technical Report RR91/09, U.S. Bureau of the
Census, 1991. - Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and
Vassilios S. Verykios. Duplicate record
detection A survey. IEEE Transactions on
Knowledge and Data Engineering (TKDE),
19(1)116, January 2007. - William E. Winkler. Overview of record linkage
and current research directions. Technical Report
RRS2006/02, U.S. Bureau of the Census, February
2006. - Mikhail Bilenko, Raymond J. Mooney, William W.
Cohen, Pradeep Ravikumar, and Stephen E.
Fienberg. Adaptive name matching in information
integration. IEEE Intelligent Systems,
18(5)1623, January/February 2003. - Min-Yen Kan and Yee Fan Tan. Record Matching in
Digital Library Metadata. To appear in
Communications of the ACM (CACM).
30Selected Bibliography
- String matching
- Robert A. Wagner and Michael J. Fischer. The
string-to-string correction problem. Journal of
the Association of Computing Machinery,
21(1)168173, January 1974. - Saul B. Needleman and Christian D. Wunsch. 1970.
A general method applicable to the search for
similarities in the amino acid sequence of two
proteins. Journal of Molecular Biology,
148(3)443453, March 1970. - Temple F. Smith and Michael S. Waterman.
Identification of common molecular subsequences.
Journal of Molecular Biology, 147(1)195197,
March 1981. - Andrés Marzal and Enrique Vidal. Computation of
normalized edit distance and applications. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 15(9)926932, September 1993. - Alvaro E. Monge and Charles Elkan. The field
matching problem Algorithms and applications. In
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 267270, August
1996. - Jie Wei. Markov edit distance. IEEE Transactions
on Pattern Analysis and Machine Intelligence,
26(3)311321, March 2004. - Mikhail Bilenko and Raymond J. Mooney. Adaptive
duplicate detection using learnable string
similarity measures. In ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining, pages 3948, August 2003. - Andrew McCallum, Kedar Bellare, and Fernando
Pereira. A Conditional Random Field For
Discriminatively-Trained Finite-State String Edit
Distance. In Conference on Uncertainty in
Artificial Intelligence (UAI), July 2005. - William. W. Cohen, Pradeep Ravikumar, and Stephen
E. Fienberg. A comparison of string distance
metrics for name-matching tasks. In Information
Integration on the Web (IIWeb), pages 7378,
August 2003. - Ariel S. Schwartz and Marti A. Hearst. A simple
algorithm for identifying abbreviation
definitions in biomedical text. In Pacific
Symposium on Biocomputing (PSB), pages 451462,
January 2003. - Youngja Park and Roy J. Byrd. Hybrid text mining
for finding abbreviations and their definitions.
In Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 126133, June
2001. - Jeffrey T. Chang , Hinrich Schütze, and Russ B.
Altman. Creating an online dictionary of
abbreviations from MEDLINE. Journal of the
American Medical Informatics Association,
9(6)612620, November/December 2002. - Hiroko Ao and Toshihisa Takagi. ALICE An
algorithm to extract abbreviations from MEDLINE.
Journal of the American Medical Informatics
Association, 12(5)576586, September/October
2005.
31Selected Bibliography
- Direct classification or clustering, and blocking
- Hui Han, Hongyuan Zha, and C. Lee Giles. A
model-based K-means algorithm for name
disambiguation. In Workshop on Semantic Web
Technologies for Searching and Retrieving
Scientific Data, October 2003. - Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li,
and Kostas Tsioutsiouliklis. Two supervised
learning approaches for name disambiguation in
author citations. In ACM/IEEE Joint Conference on
Digital Libraries (JCDL), pages 296305, June
2004. - Hui Han, Wei Xu, Hongyuan Zha, and C. Lee Giles.
A hierarchical naive bayes mixture model for name
disambiguation in author citations. In ACM
Symposium on Applied Computing (SAC), pages
10651069, March 2005. - Hui Han, Hongyuan Zha, and C. Lee Giles. Name
disambiguation in author citations using a K-way
spectral clustering method. In ACM/IEEE Joint
Conference on Digital Libraries (JCDL), pages
334343, June 2005. - Dongwon Lee, Byung-Won On, Jaewoo Kang, and
Sanghyun Park. Effective and scalable solutions
for mixed and split citation problems in digital
libraries. In ACM SIGMOD Workshop on Information
Quality in Information Systems (IQIS), pages
6976, June 2005. - Byung-Won On, Dongwon Lee, Jaewoo Kang, and
Prasenjit Mitra. Comparative study of name
disambiguation problem using a scalable
blocking-based framework. In ACM/IEEE Joint
Conference on Digital Libraries (JCDL), pages
344353, June 2005. - Andrew McCallum, Kamal Nigam, and Lyle Ungar.
Efficient clustering of high-dimensional data
sets with application to reference matching. In
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 169178, August
2000. - Matthew Michelson and Craig A. Knoblock. Learning
blocking schemes for record linkage. In National
Conference on Artificial Intelligence (AAAI),
July 2006. - Mikhail Bilenko, Beena Kamath, and Raymond J.
Mooney. Adaptive Blocking Learning to Scale Up
Record Linkage and Clustering. In IEEE
International Conference on Data Mining (ICDM),
December 2006.
32Selected Bibliography
- Graphical models
- Jie Wei. Markov edit distance. IEEE Transactions
on Pattern Analysis and Machine Intelligence,
26(3)311321, March 2004. - John Lafferty, Andrew McCallum, and Fernando
Pereira. Conditional random fields Probabilistic
models for segmenting and labeling sequence data.
In International Conference on Machine Learning
(ICML), pages 282289, June/July 2001. - Andrew McCallum and Ben Wellner. Object
consolidation by graph partitioning with a
conditionally-trained distance metric. In ACM
SIGKDD Workshop on Data Cleaning, Record Linkage,
and Object Consolidation, pages 1924, August
2003. - Ben Wellner, Andrew McCallum, Fuchun Peng, and
Michael Hay. An integrated, conditional model of
information extraction and coreference with
application to citation matching. In Conference
on Uncertainty in Artificial Intelligence (UAI),
pages 593601, July 2004. - Andrew McCallum, Kedar Bellare, and Fernando
Pereira. A Conditional Random Field For
Discriminatively-Trained Finite-State String Edit
Distance. In Conference on Uncertainty in
Artificial Intelligence (UAI), July 2005. - Xin Dong, Alon Halevy, and Jayant Madhavan.
Reference reconciliation in complex information
spaces. In ACM SIGMOD International Conference on
Management of Data, pages 8596, June 2005. - Indrajit Bhattacharya and Lise Getoor. A latent
dirichlet model for unsupervised entity
resolution. In SIAM International Conference on
Data Mining, pages 4758, April 2006.
33Selected Bibliography
- Social network analysis
- H. A. Kautz, B. Selman, and M. A. Shah. The
hidden web. AI Magazine, 18(2)2736, 1997. - P. Mutschke. Mining networks and central entities
in digital libraries. A graph theoretic approach
applied to co-author networks. In Intelligent
Data Analysis (IDA), pages 155166, August 2003. - M. E. J. Newman. Who is the best connected
scientist? A study of scientific coauthorship
networks. In Complex Networks, pages 337370,
February 2004. - E. Otte and R. Rousseau. Social network analysis
a powerful strategy, also for the information
sciences. Journal of Information Science, 28(6),
December 2002. - T. Krichel and N. Bakkalbasi. A social network
analysis of research collaboration in the
economics community. In International Workshop on
Webometrics, Informetrics and Scientometrics
Seventh COLLNET Meeting, May 2006. - R. Rousseau and M. Thelwall. Escher staircases on
the world wide web. First Monday, 9(6), June
2004. - D. G. Feitelson. On identifying name equivalences
in digital libraries. Information Research, 9(4),
October 2004. - R. Bekkerman and A. McCallum. Disambiguating web
appearances of people in a social network. In
International conference on World Wide Web (WWW),
pages 463470, May 2005. - R. Holzer, B. Malin, and L. Sweeney. Email alias
detection using social network analysis. In
Workshop on Link Discovery Issues, Approaches
and Applications (LinkKDD), August 2005. - B. Malin, E. Airoldi, and K. M. Carley. A network
analysis model for disambiguation of names in
lists. Computational and Mathematical
Organization Theory, 11(2)119139, July 2005. - G. Flake, S. Lawrence, and C. L. Giles. Efficient
identification of web communities. In ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, pages 150160, August 2000. - P. K. Reddy and M. Kitsuregawa. An approach to
build a cyber-community hierarchy. In SIAM ICDM
Workshop on Web Analysis, April 2002. - Patrick Reuther. Personal name matching New test
collections and a social network based approach.
Technical Report Mathematics/Computer Science
06-01, University of Trier, March 2006. - Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki,
Keisuke Ishida, Takuichi Nishimura, Hideaki
Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
POLYPHONET an advanced social network extraction
system from the web. In International conference
on World Wide Web (WWW), pages 397-406, May 2006.
34Selected Bibliography
- Web-based methods
- Jamie P. Callan, Margie E. Connell, and Aiqun Du.
Automatic discovery of language models for text
databases. In ACM SIGMOD International Conference
on Management of Data, pages 479490, June 1999. - Jamie P. Callan and Margie E. Connell.
Query-based sampling of text databases. ACM
Transactions on Information Systems (TOIS),
19(2)97130, April 2001. - Panagiotis G. Ipeirotis and Luis Gravano.
Distributed search over the hidden-web
Hierarchical database sampling and selection. In
International Conference on Very Large Databases
(VLDB), pages 394405, August 2002. - Luis Gravano, Panagiotis G. Ipeirotis, and Mehran
Sahami. QProber A system for automatic
classification of hidden-web databases. ACM
Transactions on Information Systems (TOIS),
21(1)141, January 2003. - Aron Culotta, Ron Bekkerman, and Andrew McCallum.
Extracting social networks and contact
information from email and the web. In Conference
on Email and Anti-Spam (CEAS), July 2004. - Philipp Cimiano, Siegfried Handschuh, and Steffen
Staab. Towards the self-annotating web. In
International conference on World Wide Web (WWW),
pages 462471, May 2004. - Philipp Cimiano, Günter Ladwig, and Steffen
Staab. Gimme the context Context-driven
automatic semantic annotation with C-PANKOW. In
International conference on World Wide Web (WWW),
pages 332341, May 2005. - Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki,
Keisuke Ishida, Takuichi Nishimura, Hideaki
Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
POLYPHONET an advanced social network extraction
system from the web. In International conference
on World Wide Web (WWW), pages 397-406, May 2006. - Yee Fan Tan, Min-Yen Kan, and Dongwon Lee. Search
engine driven author disambiguation. In ACM/IEEE
Joint Conference on Digital Libraries (JCDL),
June 2006. - Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee, and
Yi Zhang. Googled name linkage. 2007. - Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, and
Dongwon Lee. Record Linkage of Short Forms to
Long Forms A Case Study of Publication Venues.
2007. - Min-Yen Kan. Web page classification without the
web page. In International conference on World
Wide Web (WWW), pages 262263, May 2004. - Min-Yen Kan and Hoang Oanh Nguyen Thi. Fast
webpage classification using url features. In
International Conference on Information and
Knowledge Management (CIKM), pages 325326,
October/November 2005. - Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay
Jain, and Luis Gravano. To search or to crawl?
Towards a query optimizer for text-centric tasks.
In ACM SIGMOD International Conference on
Management of Data, pages 265276, June 2006.