Record Linkage Survey - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Record Linkage Survey

Description:

LEE Dong, 110 East Foster Avenue Apartment 410, ... Apple iPod Nano 4GB vs. 4GB iPod nano 4GB. Examples courtesy of Dongwon Lee (Penn State University) ... – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0
Slides: 35
Provided by: tanye
Category:
Tags: ipod | linkage | nano | record | survey

less

Transcript and Presenter's Notes

Title: Record Linkage Survey


1
Record Linkage Survey
  • Tan Yee Fan
  • 2007 February 9
  • WING Group Meeting

2
Contents
  • Introduction
  • Record linkage using internal knowledge
  • String matching
  • Classification or clustering
  • Graphical formalisms
  • Blocking
  • Record linkage using search engine
  • Adaptive methods
  • Conclusion

3
Introduction
  • Ambiguous representation of named entities
  • Citation records
  • Web pages
  • People names
  • Products
  • Customer records
  • Images
  • Merging information from various sources
  • Typos? Errors?
  • Think of large scale databases of millions of
    records!

4
Commercial world
  • Addresses
  • Dongwon Lee, 110 E. Foster Ave. 410, State
    College, PA, 16802
  • LEE Dong, 110 East Foster Avenue Apartment 410,
    University Park, PA 16802-2343
  • Products
  • Honda Fix vs. Honda Jazz
  • T-Fal vs. Tefal
  • Apple iPod Nano 4GB vs. 4GB iPod nano 4GB

Examples courtesy of Dongwon Lee (Penn State
University)
5
Author names and citations
Jeffrey D. Ullman (Stanford University)
Some images courtesy of Dongwon Lee(Penn State
University)
6
Web pages
Note I did not author any of these web pages!
7
More on person names
  • Highly ambiguous
  • Only 90,000 different names for 100 million
    people (U.S. Census Bureau)
  • Valid changes
  • Customs Lee, Dongwon vs. Dongwon Lee vs. LEE
    Dongwon
  • Marriage Carol Dusseau vs. Carol Arpaci-Dusseau
  • Misc. Sean Engelson vs. Shlomo Argamon

Examples courtesy of Dongwon Lee (Penn State
University)
8
Record linkage
  • Input
  • Two lists of records, A and B
  • Output
  • For each record a in A and for each record b in
    B,does a and b refer to the same entity?
  • Note
  • Entities do not come with unique identifiers
  • To disambiguate (deduplicate) items in a single
    list L, we set A B L

9
Fellegi-Sunter model
no-decision region (hold for human review)
designate as definite match
designate as definite non-match
true matches? true non-matches
false matches
false non-matches
sim(a, b)
10
Contents
  • Introduction
  • Record linkage using internal knowledge
  • String matching
  • Classification or clustering
  • Graphical formalisms
  • Blocking
  • Record linkage using search engine
  • Adaptive methods
  • Conclusion

11
String matching
  • String similarity
  • Strings as ordered sequences
  • Edit distance
  • Jaro and Jaro-Winkler
  • Strings as unordered sets
  • Jaccard similarity
  • Cosine similarity
  • Abbreviation matching
  • Usually pattern detection in texts
  • e.g. Almost Locked Sets (ALS)

(a, b, c) ? (c, b, a)
a, b, c c, b, a
12
Classification or clustering
  • Feature engineering model selection
  • Features
  • String similarity, relationships (e.g.
    collaborators)
  • Models
  • Naïve Bayes, Support Vector Machine, K-means,
    Agglomerative Clustering,

Yoojin Hong, Byung-Won On and Dongwon Lee.
SystemSupport for Name Authority Control Problem
inDigital Libraries OpenDBLP Approach. ECDL
2004.
Sudha Ram, Jinsoo Park and Dongwon Lee.
DigitalLibraries for the Next Millennium
Challenges andResearch Directions. Information
Systems Frontiers 1999.
13
Classification or clustering
  • Graphical models
  • Structure
  • Nodes Record fields or entire records
  • Edges Join fields with similar values, fields to
    records, etc
  • Characteristic
  • Model global knowledge
  • Propagate information around the graph until
    convergence
  • Usually very time consuming
  • Examples
  • Conditional random field
  • Dependency graph
  • Generative probabilistic model

14
Social network analysis
  • Social network
  • Nodes entities (e.g. author names)
  • Edges relationships (e.g. coauthored a paper)

coauthorship network
15
Social network analysis
  • Analysis
  • Connected components
  • Distance between nodes
  • Node/edge centrality
  • Cliques
  • Bipartite subgraphs

16
Social network analysis
  • Connected triple
  • Random walk
  • Maximum flow
  • Clustering

x1
s
t
x2
x2
x1
x3
17
Scalability Issues
  • Pairwise comparisons
  • Requires O(n2) time
  • Major bottleneck
  • Possible solutions
  • Blocking techniques
  • Avoiding pairwise comparisons altogether

Input d1, d2, , dn for i 1 to n for j
(i 1) to n compute sim(di, dj)
18
Contents
  • Introduction
  • Record linkage using internal knowledge
  • String matching
  • Classification or clustering
  • Graphical formalisms
  • Blocking
  • Record linkage using search engine
  • Adaptive methods
  • Conclusion

19
Record linkage using search engine
  • Previously
  • We assumed input data records contain sufficient
    information to perform linkage
  • What if
  • There is insufficient or only noisy information?
  • e.g. linking short forms to long forms
  • Ask other people!
  • Use web as collective knowledge of people

20
Record linkage using search engine
Number of results
Ranked list
Title
Snippet
URL
Web page
sudoku strategiessudoku OR strategies sudoku
strategies
21
Examples
  • Counts
  • Co-occurrence measure between count(q), count(q)
    andcount(q and q)
  • Hostnames from URLs
  • Overlap between the hostnames of results of
    queries q and q
  • Inverse Host Frequency
  • Snippets or web pages
  • Cosine similarity using the tokens
  • Counts of specific terms
  • e.g. number of snippets for the query q
    containing the string q
  • Do natural language processing

22
Googled name linkage
  • Suppose e and e refer to the same entity
  • Then web pages of e and web pages of e are
    likely to share some representative data
  • Jeffrey D. Ullman384,000 pages
  • Jeffrey D. Ullman aho174,000 pages
  • J. Ullman124,000 pages
  • J. Ullman aho41,000 pages
  • Shimon Ullman27,300 pages
  • Shimon Ullman aho66 pages

23
Query probing
Googling and web page downloadsare expensive on
time!
  • Consider
  • Joint Conference on Digital Libraries
  • European Conference on Digital Libraries
  • Digital Libraries
  • Query probing
  • Use common n-gram digital libraries as query
    probe
  • If we can obtain information on all three
    conferences, we save two queries

24
Adaptive querying
  • Methods
  • Ms stronger method but very slow (e.g. web page
    similarity)
  • Mw weaker method but fast (e.g. host overlap)
  • Aim
  • Accuracy close to Ms
  • Significantly reduced running time than Ms
  • Algorithm
  • Execute Mw
  • If heuristic suggests that Mw results are likely
    incorrect
  • Execute Ms

25
Comment
  • These techniques
  • Blocking
  • Query probing
  • Adaptive querying
  • Combine different methods to obtain the better
    aspects of each

26
Contents
  • Introduction
  • Record linkage using internal knowledge
  • String matching
  • Classification or clustering
  • Graphical formalisms
  • Blocking
  • Record linkage using search engine
  • Adaptive methods
  • Conclusion

27
Conclusion
  • Comment
  • This survey is very brief and broad, but still
    many aspects not covered
  • History
  • Record linkage became a research issue in the
    1940s, possibly due to analysis of census data or
    medical records
  • Research directions
  • Graphical models utilize global knowledge, but
    how to make them scalable for large datasets
  • Utilizing external knowledge in an effective and
    scalable manner
  • Adaptive methods

28
Thank You
29
Selected Bibliography
  • General and surveys
  • Ivan P. Fellegi and Alan B. Sunter. A theory for
    record linkage. Journal of the American
    Statistical Association, 64(328)11831210,
    December 1969.
  • William E. Winkler and Yves Thibaudeau. An
    application of the Fellegi-Sunter Model of record
    linkage to the 1990 U.S. Decennial Census.
    Technical Report RR91/09, U.S. Bureau of the
    Census, 1991.
  • Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and
    Vassilios S. Verykios. Duplicate record
    detection A survey. IEEE Transactions on
    Knowledge and Data Engineering (TKDE),
    19(1)116, January 2007.
  • William E. Winkler. Overview of record linkage
    and current research directions. Technical Report
    RRS2006/02, U.S. Bureau of the Census, February
    2006.
  • Mikhail Bilenko, Raymond J. Mooney, William W.
    Cohen, Pradeep Ravikumar, and Stephen E.
    Fienberg. Adaptive name matching in information
    integration. IEEE Intelligent Systems,
    18(5)1623, January/February 2003.
  • Min-Yen Kan and Yee Fan Tan. Record Matching in
    Digital Library Metadata. To appear in
    Communications of the ACM (CACM).

30
Selected Bibliography
  • String matching
  • Robert A. Wagner and Michael J. Fischer. The
    string-to-string correction problem. Journal of
    the Association of Computing Machinery,
    21(1)168173, January 1974.
  • Saul B. Needleman and Christian D. Wunsch. 1970.
    A general method applicable to the search for
    similarities in the amino acid sequence of two
    proteins. Journal of Molecular Biology,
    148(3)443453, March 1970.
  • Temple F. Smith and Michael S. Waterman.
    Identification of common molecular subsequences.
    Journal of Molecular Biology, 147(1)195197,
    March 1981.
  • Andrés Marzal and Enrique Vidal. Computation of
    normalized edit distance and applications. IEEE
    Transactions on Pattern Analysis and Machine
    Intelligence, 15(9)926932, September 1993.
  • Alvaro E. Monge and Charles Elkan. The field
    matching problem Algorithms and applications. In
    ACM SIGKDD International Conference on Knowledge
    Discovery and Data Mining, pages 267270, August
    1996.
  • Jie Wei. Markov edit distance. IEEE Transactions
    on Pattern Analysis and Machine Intelligence,
    26(3)311321, March 2004.
  • Mikhail Bilenko and Raymond J. Mooney. Adaptive
    duplicate detection using learnable string
    similarity measures. In ACM SIGKDD International
    Conference on Knowledge Discovery and Data
    Mining, pages 3948, August 2003.
  • Andrew McCallum, Kedar Bellare, and Fernando
    Pereira. A Conditional Random Field For
    Discriminatively-Trained Finite-State String Edit
    Distance. In Conference on Uncertainty in
    Artificial Intelligence (UAI), July 2005.
  • William. W. Cohen, Pradeep Ravikumar, and Stephen
    E. Fienberg. A comparison of string distance
    metrics for name-matching tasks. In Information
    Integration on the Web (IIWeb), pages 7378,
    August 2003.
  • Ariel S. Schwartz and Marti A. Hearst. A simple
    algorithm for identifying abbreviation
    definitions in biomedical text. In Pacific
    Symposium on Biocomputing (PSB), pages 451462,
    January 2003.
  • Youngja Park and Roy J. Byrd. Hybrid text mining
    for finding abbreviations and their definitions.
    In Conference on Empirical Methods in Natural
    Language Processing (EMNLP), pages 126133, June
    2001.
  • Jeffrey T. Chang , Hinrich Schütze, and Russ B.
    Altman. Creating an online dictionary of
    abbreviations from MEDLINE. Journal of the
    American Medical Informatics Association,
    9(6)612620, November/December 2002.
  • Hiroko Ao and Toshihisa Takagi. ALICE An
    algorithm to extract abbreviations from MEDLINE.
    Journal of the American Medical Informatics
    Association, 12(5)576586, September/October
    2005.

31
Selected Bibliography
  • Direct classification or clustering, and blocking
  • Hui Han, Hongyuan Zha, and C. Lee Giles. A
    model-based K-means algorithm for name
    disambiguation. In Workshop on Semantic Web
    Technologies for Searching and Retrieving
    Scientific Data, October 2003.
  • Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li,
    and Kostas Tsioutsiouliklis. Two supervised
    learning approaches for name disambiguation in
    author citations. In ACM/IEEE Joint Conference on
    Digital Libraries (JCDL), pages 296305, June
    2004.
  • Hui Han, Wei Xu, Hongyuan Zha, and C. Lee Giles.
    A hierarchical naive bayes mixture model for name
    disambiguation in author citations. In ACM
    Symposium on Applied Computing (SAC), pages
    10651069, March 2005.
  • Hui Han, Hongyuan Zha, and C. Lee Giles. Name
    disambiguation in author citations using a K-way
    spectral clustering method. In ACM/IEEE Joint
    Conference on Digital Libraries (JCDL), pages
    334343, June 2005.
  • Dongwon Lee, Byung-Won On, Jaewoo Kang, and
    Sanghyun Park. Effective and scalable solutions
    for mixed and split citation problems in digital
    libraries. In ACM SIGMOD Workshop on Information
    Quality in Information Systems (IQIS), pages
    6976, June 2005.
  • Byung-Won On, Dongwon Lee, Jaewoo Kang, and
    Prasenjit Mitra. Comparative study of name
    disambiguation problem using a scalable
    blocking-based framework. In ACM/IEEE Joint
    Conference on Digital Libraries (JCDL), pages
    344353, June 2005.
  • Andrew McCallum, Kamal Nigam, and Lyle Ungar.
    Efficient clustering of high-dimensional data
    sets with application to reference matching. In
    ACM SIGKDD International Conference on Knowledge
    Discovery and Data Mining, pages 169178, August
    2000.
  • Matthew Michelson and Craig A. Knoblock. Learning
    blocking schemes for record linkage. In National
    Conference on Artificial Intelligence (AAAI),
    July 2006.
  • Mikhail Bilenko, Beena Kamath, and Raymond J.
    Mooney. Adaptive Blocking Learning to Scale Up
    Record Linkage and Clustering. In IEEE
    International Conference on Data Mining (ICDM),
    December 2006.

32
Selected Bibliography
  • Graphical models
  • Jie Wei. Markov edit distance. IEEE Transactions
    on Pattern Analysis and Machine Intelligence,
    26(3)311321, March 2004.
  • John Lafferty, Andrew McCallum, and Fernando
    Pereira. Conditional random fields Probabilistic
    models for segmenting and labeling sequence data.
    In International Conference on Machine Learning
    (ICML), pages 282289, June/July 2001.
  • Andrew McCallum and Ben Wellner. Object
    consolidation by graph partitioning with a
    conditionally-trained distance metric. In ACM
    SIGKDD Workshop on Data Cleaning, Record Linkage,
    and Object Consolidation, pages 1924, August
    2003.
  • Ben Wellner, Andrew McCallum, Fuchun Peng, and
    Michael Hay. An integrated, conditional model of
    information extraction and coreference with
    application to citation matching. In Conference
    on Uncertainty in Artificial Intelligence (UAI),
    pages 593601, July 2004.
  • Andrew McCallum, Kedar Bellare, and Fernando
    Pereira. A Conditional Random Field For
    Discriminatively-Trained Finite-State String Edit
    Distance. In Conference on Uncertainty in
    Artificial Intelligence (UAI), July 2005.
  • Xin Dong, Alon Halevy, and Jayant Madhavan.
    Reference reconciliation in complex information
    spaces. In ACM SIGMOD International Conference on
    Management of Data, pages 8596, June 2005.
  • Indrajit Bhattacharya and Lise Getoor. A latent
    dirichlet model for unsupervised entity
    resolution. In SIAM International Conference on
    Data Mining, pages 4758, April 2006.

33
Selected Bibliography
  • Social network analysis
  • H. A. Kautz, B. Selman, and M. A. Shah. The
    hidden web. AI Magazine, 18(2)2736, 1997.
  • P. Mutschke. Mining networks and central entities
    in digital libraries. A graph theoretic approach
    applied to co-author networks. In Intelligent
    Data Analysis (IDA), pages 155166, August 2003.
  • M. E. J. Newman. Who is the best connected
    scientist? A study of scientific coauthorship
    networks. In Complex Networks, pages 337370,
    February 2004.
  • E. Otte and R. Rousseau. Social network analysis
    a powerful strategy, also for the information
    sciences. Journal of Information Science, 28(6),
    December 2002.
  • T. Krichel and N. Bakkalbasi. A social network
    analysis of research collaboration in the
    economics community. In International Workshop on
    Webometrics, Informetrics and Scientometrics
    Seventh COLLNET Meeting, May 2006.
  • R. Rousseau and M. Thelwall. Escher staircases on
    the world wide web. First Monday, 9(6), June
    2004.
  • D. G. Feitelson. On identifying name equivalences
    in digital libraries. Information Research, 9(4),
    October 2004.
  • R. Bekkerman and A. McCallum. Disambiguating web
    appearances of people in a social network. In
    International conference on World Wide Web (WWW),
    pages 463470, May 2005.
  • R. Holzer, B. Malin, and L. Sweeney. Email alias
    detection using social network analysis. In
    Workshop on Link Discovery Issues, Approaches
    and Applications (LinkKDD), August 2005.
  • B. Malin, E. Airoldi, and K. M. Carley. A network
    analysis model for disambiguation of names in
    lists. Computational and Mathematical
    Organization Theory, 11(2)119139, July 2005.
  • G. Flake, S. Lawrence, and C. L. Giles. Efficient
    identification of web communities. In ACM SIGKDD
    International Conference on Knowledge Discovery
    and Data Mining, pages 150160, August 2000.
  • P. K. Reddy and M. Kitsuregawa. An approach to
    build a cyber-community hierarchy. In SIAM ICDM
    Workshop on Web Analysis, April 2002.
  • Patrick Reuther. Personal name matching New test
    collections and a social network based approach.
    Technical Report Mathematics/Computer Science
    06-01, University of Trier, March 2006.
  • Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki,
    Keisuke Ishida, Takuichi Nishimura, Hideaki
    Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
    POLYPHONET an advanced social network extraction
    system from the web. In International conference
    on World Wide Web (WWW), pages 397-406, May 2006.

34
Selected Bibliography
  • Web-based methods
  • Jamie P. Callan, Margie E. Connell, and Aiqun Du.
    Automatic discovery of language models for text
    databases. In ACM SIGMOD International Conference
    on Management of Data, pages 479490, June 1999.
  • Jamie P. Callan and Margie E. Connell.
    Query-based sampling of text databases. ACM
    Transactions on Information Systems (TOIS),
    19(2)97130, April 2001.
  • Panagiotis G. Ipeirotis and Luis Gravano.
    Distributed search over the hidden-web
    Hierarchical database sampling and selection. In
    International Conference on Very Large Databases
    (VLDB), pages 394405, August 2002.
  • Luis Gravano, Panagiotis G. Ipeirotis, and Mehran
    Sahami. QProber A system for automatic
    classification of hidden-web databases. ACM
    Transactions on Information Systems (TOIS),
    21(1)141, January 2003.
  • Aron Culotta, Ron Bekkerman, and Andrew McCallum.
    Extracting social networks and contact
    information from email and the web. In Conference
    on Email and Anti-Spam (CEAS), July 2004.
  • Philipp Cimiano, Siegfried Handschuh, and Steffen
    Staab. Towards the self-annotating web. In
    International conference on World Wide Web (WWW),
    pages 462471, May 2004.
  • Philipp Cimiano, Günter Ladwig, and Steffen
    Staab. Gimme the context Context-driven
    automatic semantic annotation with C-PANKOW. In
    International conference on World Wide Web (WWW),
    pages 332341, May 2005.
  • Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki,
    Keisuke Ishida, Takuichi Nishimura, Hideaki
    Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
    POLYPHONET an advanced social network extraction
    system from the web. In International conference
    on World Wide Web (WWW), pages 397-406, May 2006.
  • Yee Fan Tan, Min-Yen Kan, and Dongwon Lee. Search
    engine driven author disambiguation. In ACM/IEEE
    Joint Conference on Digital Libraries (JCDL),
    June 2006.
  • Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee, and
    Yi Zhang. Googled name linkage. 2007.
  • Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, and
    Dongwon Lee. Record Linkage of Short Forms to
    Long Forms A Case Study of Publication Venues.
    2007.
  • Min-Yen Kan. Web page classification without the
    web page. In International conference on World
    Wide Web (WWW), pages 262263, May 2004.
  • Min-Yen Kan and Hoang Oanh Nguyen Thi. Fast
    webpage classification using url features. In
    International Conference on Information and
    Knowledge Management (CIKM), pages 325326,
    October/November 2005.
  • Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay
    Jain, and Luis Gravano. To search or to crawl?
    Towards a query optimizer for text-centric tasks.
    In ACM SIGMOD International Conference on
    Management of Data, pages 265276, June 2006.
Write a Comment
User Comments (0)
About PowerShow.com