Warren Shen, Xin Li, AnHai Doan - PowerPoint PPT Presentation

About This Presentation
Title:

Warren Shen, Xin Li, AnHai Doan

Description:

If two citations match,then their authors will be matched in order. Ordering ... Smith, J = e4. Constraints: c1 = layout constraint p(c1) = 0.8. 15 ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 22
Provided by: zam34
Category:
Tags: anhai | doan | shen | warren | xin

less

Transcript and Presenter's Notes

Title: Warren Shen, Xin Li, AnHai Doan


1
Constraint-Based Entity Matching
  • Warren Shen, Xin Li, AnHai Doan
  • Database AI Groups
  • University of Illinois, Urbana

2
Entity Matching
  • Decide if mentions refer to the same real-world
    entity
  • Key problem in numerous applications
  • Information integration
  • Natural language understanding
  • Semantic Web

Chris Li, Jane Smith. Numerical Analysis. SIAM
2001 Chen Li, Doug Chan. Ensemble Learning C.
Li, D. Chan. Ensemble Learning. ICML 2003
3
State of the Art
  • Numerous solutions in the AI, Database, and Web
    communities
  • Cohen, Ravikumar, Fienberg 2003
  • Li, Morie, Roth 2004
  • Bhattacharya Getoor 2004
  • McCallum, Nigam, Ungar 2000
  • Pasula et. al. 2003
  • Wellner et. al. 2004
  • Most solutions largely exploit only syntactic
    similarity
  • Jeff Smith J. Smith
  • (217) 235-1234 235-1234

4
Semantic Constraints
  • Incompatible
  • Subsumption
  • Layout

C. Li. User Interfaces. SIGCHI 2000 C. Li, J.
Smith. Numerical Analysis. SIAM 2001
Numerical Analysis, SIAM 2001 with J. Smith.
Chris Lis Homepage
Chris Li, Jane Smith. Numerical Analysis. SIAM
2001
DBLP
Chen Li, Doug Chan. Ensemble Learning. ICML
2003 C. Li. Data Mining. KDD 2000
Chen Lis Homepage
5
Numerous Semantic Constraint Types
Type Example
Aggregate No researcher has chaired more than 3 conferences in a year
Subsumption If a citation X from DBLP matches a citation Y in a homepage, then each author in Y matches some author in X
Neighborhood If authors X and Y share similar names and some coauthors, they are likely to match
Incompatible No researcher exists who has published in both HCI and numerical analysis
Layout If two mentions in the same document share similar names, they are likely to match
Uniqueness Mentions in the PC listing of a conference refer to different researchers
Ordering If two citations match,then their authors will be matched in order
Individual The researcher named Mayssam Saria has fewer than five mentions in DBLP (e.g. being a new graduate student with fewer than five papers)
6
Our Contributions
  • Develop a solution to exploit semantic
    constraints
  • Models constraints in a uniform probabilistic
    manner
  • Clusters mentions using a generative model
  • Uses relaxation labeling to handle constraints
  • Adds a pairwise layer to further improve accuracy
  • Experimental results on two real-world domains
  • Researchers, IMDB
  • Improved accuracy over state of the art by 3-12
    F-1

7
Probabilistic Modeling of Constraints
  • Modeled as the effect on the probability that a
    mention refers to a real-world entity
  • If two mentions in the same document share
    similar names, they are likely to match
  • Constraint probabilities have a natural
    interpretation
  • Can be learned or manually specified by a domain
    expert

P (m2e1 m1 e1) 0.8
8
The Entity Matching Problem
  • Solution
  • Model document generation
  • Cluster mentions using this model

9
Modeling Document Generation
  • Generate mentions for each document
  • Select entities
  • Generate and sprinkle mentions
  • Check constraints for each mention
  • Decide whether to enforce
  • constraint c
  • If enforced, check if
  • mention violates c
  • If yes, discard documents
  • and repeat process
  • (Extension of model in
  • Li, Morie Roth 2004)

10
Clustering with the Generative Model
  • Find mention assignments F and model parameters ?
    to maximize P (D, F ? )
  • Difficult to compute exactly, so use a variant of
    EM

11
Incorporating Constraints
  • Extend the step that assigns mentions
  • Basic mention assignment
  • Extension Use constraints to improve mention
    assignments

12
Enforcing Constraints on Clusters
  • Apply constraints at each iteration
  • Use relaxation labeling to apply constraints to
    mention assignments

13
Relaxation Labeling
  • Start with an initial labeling of mentions with
    entities
  • Iteratively improve mention labels, given
    constraints
  • Can be extended to probabilistic constraints
  • Scalable

Chen Li e1 C. Li e2 Y. Lee e3
Chris Lee e2 Jane Smith e4
C. Lee e2 Smith, J e4
Constraints c1 layout constraint p(c1)
0.8
14
Relaxation Labeling
  • Start with an initial labeling of mentions with
    entities
  • Iteratively improve mention labels, given
    constraints
  • Can be extended to probabilistic constraints
  • Scalable

Chen Li e1 C. Li e2 ? e1 Y. Lee e3
Chris Lee e2 Jane Smith e4
C. Lee e2 Smith, J e4
Constraints c1 layout constraint p(c1)
0.8
15
Handling Probabilistic Constraints
  • Relaxation labeling can combine multiple
    probabilistic constraints

16
Pairwise Layer
  • So far, we have applied constraints to clusters
  • It may be unclear how to enforce constraints on
    clusters
  • Add a pairwise layer
  • Convert clusters into predicted matching pairs
  • Remove only pairs that negative pairwise hard
    constraints apply to

Constraint C. Li ? Li, C. Remove C. Li or Li,
C. ?
17
Empirical Evaluation
  • Two real-world domains
  • Researchers, IMDB
  • For each domain
  • Collected documents
  • Researchers homepages from DBLP and the web
  • IMDB text and structured records from IMDB
  • Marked up mentions and their attributes
  • 4,991 researcher mentions
  • 3,889 movie titles from IMDB
  • Manually identified all correct matching pairs
  • Evaluation Metric
  • Precision true positives / predicted pairs
  • Recall true positives / correct pairs
  • F1 (2 P R) / (P R)

18
Using Constraints Improves Accuracy
  • Relaxation labeler improves F-1 by 3-12
  • Relaxation labeling very fast

19
Using Constraints Individually
  • Each constraint makes a contribution

20
Related Work
  • Much work in entity matching
  • Cohen, Ravikumar, Fienberg 2003
  • Li, Morie, Roth 2004
  • Bhattacharya Getoor 2004
  • McCallum, Nigam, Ungar 2000
  • Pasula et. al. 2003
  • Wellner et. al. 2004
  • Recent work has looked at exploiting semantic
    constraints
  • Personal Information Management (Dong et. al.
    2004)
  • Profiler based entity matching (Doan et. al.
    2003)
  • Semantic constraints successfully exploited in
    other applications
  • Clustering algorithms (Bilenko et. al. 2004),
    ontology matching (Doan et. al. 2002)

21
Summary and Future Work
  • Exploit semantic constraints in entity matching
  • Models constraints in a uniform probabilistic
    manner
  • Uses a generative model and relaxation labeling
    to handle constraints in a scalable way
  • Experimental results on two real-world domains
    show effectiveness
  • Future work Learning constraints effectively
    from current or external data
Write a Comment
User Comments (0)
About PowerShow.com