CLIP Colloquium Series - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

CLIP Colloquium Series

Description:

A random sample of homogeneous objects from single relation ... Abstraction in Affiliation Networks. Social Capital in Friendship Event Networks ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 66
Provided by: get79
Category:

less

Transcript and Presenter's Notes

Title: CLIP Colloquium Series


1
Statistical Relational Learning Entity
Resolution
  • Lise Getoor
  • University of Maryland, College Park

2
What is SRL?
  • Traditional statistical machine learning
    approaches assume
  • A random sample of homogeneous objects from
    single relation
  • Traditional relational learning approaches
    assume
  • No noise or uncertainty in data
  • Real world data sets
  • Multi-relational and heterogeneous
  • Noisy and uncertain
  • Statistical Relational Learning (SRL)
  • newly emerging research area at the intersection
    of statistical models and relational
    learning/inductive logic programming
  • Sample Domains
  • web data, social networks, biological data,
    communication data, customer networks, sensor
    networks, natural language, vision,

3
SRL Theory
  • Directed Approaches
  • Semantics based on Bayesian Networks
  • Frame-based Directed Models
  • Rule-based Directed Models
  • Undirected Approaches
  • Semantics based on Markov Networks
  • Frame-based Undirected Models
  • Rule-based Undirected Models

modeling logical vs. statistical dependencies,
feature construction, instances vs. classes,
effective inference, use of labeled unlabeled
data, link prediction, open vs. closed world
Reference Upcoming book on Statistical
Relational Learning w/ Ben Taskar
4
SRL Application Link Mining
  • Data
  • Structured Input Mining graphs and networks
  • Structured Output Extracting entity and
    relationships from unstructured data
  • Taxonomy
  • Node Centric
  • Labeling/ranking nodes (aka Collective
    Classification/PageRank)
  • Consolidating nodes (aka Entity Resolution)
  • Discovering hidden nodes (aka Group Discovery)
  • Edge Centric
  • Labeling/ranking edges
  • Predicting the existence, number of edges
  • Discovering new relations/paths
  • Graph/Subgraph Centric
  • Discovering frequent subpatterns
  • Metadata discovery, extraction, and reformulation

ReferenceSigKDD Explorations Special Issue on
Link Mining, Dec. 2005, w/ Chris Diehl, JHUAPL
5
LINQS Group _at_ UMD
  • Members
  • myself, Indrajit Bhattacharya, Mustafa Bilgic,
    Rezarta Islamaj, Hyunmo Kang, Louis Licamele,
    Galileo Namata, Prithivaraj Sen, Vivek Senghal,
    Elena Zheleva
  • Projects
  • Link-based Classification
  • Entity Resolution (ER)
  • Algorithms
  • Query-time ER
  • User Interface
  • Predictive Models for Social Network Analysis
  • Abstraction in Affiliation Networks
  • Social Capital in Friendship Event Networks
  • Role Identification Relationship Ranking
  • Temporal Analysis of Email Traffic Networks
  • Reputation-based Spam Filtering Privacy
  • Feature Gen Protein Interaction Prediction
    (biological data)
  • Ontology Alignment and Schema Mapping
  • Probabilistic Databases
  • Cost-sensitive Feature Acquisition

6
LINQS Group _at_ UMD
  • Members
  • myself, Indrajit Bhattacharya, Mustafa Bilgic,
    Rezarta Islamaj, Hyunmo Kang, Louis Licamele,
    Galileo Namata, Prithivaraj Sen, Vivek Senghal,
    Elena Zheleva
  • Projects
  • Link-based Classification
  • Entity Resolution (ER)
  • Algorithms
  • Query-time ER
  • User Interface
  • Predictive Models for Social Network Analysis
  • Abstraction in Affiliation Networks
  • Social Capital in Friendship Event Networks
  • Role Identification Relationship Ranking
  • Temporal Analysis of Email Traffic Networks
  • Reputation-based Spam Filtering Privacy
  • Feature Gen Protein Interaction Prediction
    (biological data)
  • Ontology Alignment and Schema Mapping
  • Probabilistic Databases
  • Cost-sensitive Feature Acquisition

7
LINQS Group _at_ UMD
  • Members
  • myself, Indrajit Bhattacharya, Mustafa Bilgic,
    Rezarta Islamaj, Hyunmo Kang, Louis Licamele,
    Galileo Namata, Prithivaraj Sen, Vivek Senghal,
    Elena Zheleva
  • Projects
  • Link-based Classification
  • Entity Resolution (ER)
  • Algorithms
  • Query-time ER
  • User Interface
  • Predictive Models for Social Network Analysis
  • Abstraction in Affiliation Networks
  • Social Capital in Friendship Event Networks
  • Role Identification Relationship Ranking
  • Temporal Analysis of Email Traffic Networks
  • Reputation-based Spam Filtering Privacy
  • Feature Gen Protein Interaction Prediction
    (biological data)
  • Ontology Alignment and Schema Mapping
  • Probabilistic Databases
  • Cost-sensitive Feature Acquisition

Graduated this fall starting at IBM Delhi April
8
Entity Resolution
  • The Problem
  • Relational Entity Resolution
  • Algorithms
  • Graph-based Clustering (GBC)
  • Probabilistic Model (LDA-ER)
  • Query-time Entity Resolution
  • ER User Interface

9
InfoVis Co-Author Network Fragment
10
The Entity Resolution Problem
James Smith
John Smith
John Smith
Jim Smith
J Smith
James Smith
Jon Smith
Jonathan Smith
J Smith
Jonthan Smith
  • Issues
  • Identification
  • Disambiguation

11
Attribute-based Entity Resolution
?
J Smith
James Smith
0.8
Jim Smith
James Smith
Pair-wise classification
J Smith
James Smith
?
0.1
John Smith
James Smith
0.7
James Smith
Jon Smith
0.05
James Smith
Jonthan Smith
  • Choosing threshold precision/recall tradeoff
  • Inability to disambiguate
  • Perform transitive closure?

12
Entity Resolution
  • The Problem
  • Relational Entity Resolution
  • Algorithms
  • Graph-based Clustering (GBC)
  • Probabilistic Model (LDA-ER)
  • Experimental Evaluation
  • Query-time Entity Resolution
  • ER User Interface

13
Relational Entity Resolution
  • References not observed independently
  • Links between references indicate relations
    between the entities
  • Co-author relations for bibliographic data
  • To, cc lists for email
  • Use relations to improve identification and
    disambiguation

14
Relational Identification
Very similar names. Added evidence from shared
co-authors
15
Relational Disambiguation
Very similar names but no shared collaborators
16
Relational Constraints
Co-authors are typically distinct
17
Collective Entity Resolution
One resolutions provides evidence for another gt
joint resolution
18
Entity Resolution with Relations
  • Naïve Relational Entity Resolution
  • Also compare attributes of related references
  • Two references have co-authors w/ similar names
  • Collective Entity Resolution
  • Use discovered entities of related references
  • Entities cannot be identified independently
  • Harder problem to solve

19
Entity Resolution
  • The Problem
  • Relational Entity Resolution
  • Algorithms
  • Relational Clustering (RC-ER)
  • DMKD04, Wiley06, DE Bulletin06,TKDD07
  • Probabilistic Model (LDA-ER)
  • Experimental Evaluation
  • Query-time Entity Resolution
  • ER User Interface

20
P1 JOSTLE Partitioning of Unstructured Meshes
for Massively Parallel Machines, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson J P2
Partitioning Mapping of Unstructured Meshes to
Parallel Machine Topologies, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson, K. McManus
J P3 Dynamic Mesh Partitioning A Unied
Optimisation and Load-Balancing Algorithm, C.
Walshaw, M. Cross, M. G. Everett P4 Code
Generation for Machines with Multiregister
Operations, Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman J P5 Deterministic Parsing
of Ambiguous Grammars, A. Aho, S. Johnson, J.
Ullman J P6 Compilers Principles, Techniques,
and Tools, A. Aho, R. Sethi, J. Ullman
21
P1 JOSTLE Partitioning of Unstructured Meshes
for Massively Parallel Machines, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson P2
Partitioning Mapping of Unstructured Meshes to
Parallel Machine Topologies, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson, K. McManus P3
Dynamic Mesh Partitioning A Unied Optimisation
and Load-Balancing Algorithm, C. Walshaw, M.
Cross, M. G. Everett P4 Code Generation for
Machines with Multiregister Operations, Alfred
V. Aho, Stephen C. Johnson, Jefferey D.
Ullman P5 Deterministic Parsing of Ambiguous
Grammars, A. Aho, S. Johnson, J. Ullman P6
Compilers Principles, Techniques, and Tools,
A. Aho, R. Sethi, J. Ullman
22
Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
23
Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
24
Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
25
Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
26
Cut-based Formulation of RC-ER
M. G. Everett
S. Johnson
S. Johnson
M. Everett
S. Johnson
A. Aho
Stephen C. Johnson
Alfred V. Aho
  • Good separation of attributes
  • Many cluster-cluster relationships
  • Aho-Johnson1, Aho-Johnson2, Everett-Johnson1
  • Worse in terms of attributes
  • Fewer cluster-cluster relationships
  • Aho-Johnson1, Everett-Johnson2

27
Objective Function
  • Minimize

weight for attributes
weight for relations
similarity of attributes
1 iff relational edge exists between ci and cj
28
Objective Function
  • Minimize

weight for attributes
weight for relations
similarity of attributes
1 iff relational edge exists between ci and cj
  • Greedy clustering algorithm merge cluster pair
    with max reduction in objective function

Common cluster neighborhood
Similarity of attributes
29
Measures for Attribute Similarity
  • Use best available measure for each attribute
  • Name Strings Soft TF-IDF, Levenstein, Jaro
  • Textual Attributes TF-IDF
  • Aggregate to find similarity between clusters
  • Single link, Average link, Complete link
  • Cluster representative

30
Relational Similarity Example 1
A. Aho
Alfred V. Aho
P5
P4
Stephen C. Johnson
S. Johnson
P4
P5
J. Ullman
Jefferey D. Ullman
All neighborhood clusters are shared high
relational similarity
31
Relational Similarity Example 2
Alfred V. Aho
K. McManus
P4, P5
A. Aho
P2
C. Walshaw
P1, P2
C. Walshaw
Stephen C. Johnson
S. Johnson
S. Johnson
M. G. Everett
S. Johnson
P1, P2
M. Everett
P1, P2
P4, P5
Jefferey D. Ullman
M. Cross
J. Ullman
M. Cross
No neighborhood cluster is shared no relational
similarity
32
Comparing Cluster Neighborhoods
  • Different measures of set similarity
  • Common Neighbors Intersection size
  • Jaccards Coefficient Normalize by union size
  • Adar Coefficient Weighted set similarity
  • Higher order similarity Consider nbrs of nbrs
  • Also consider neighborhood as multi-set

33
Relational Clustering Algorithm
  • Find similar references using blocking
  • Bootstrap clusters using attributes and relations
  • Compute similarities for cluster pairs and insert
    into priority queue
  • Repeat until priority queue is empty
  • Find closest cluster pair
  • Stop if similarity below threshold
  • Merge to create new cluster
  • Update similarity for related
    clusters
  • O(n k log n) algorithm w/ efficient
    implementation

34
Entity Resolution
  • The Problem
  • Relational Entity Resolution
  • Algorithms
  • Relational Clustering (RC-ER)
  • Probabilistic Model (LDA-ER)
  • SIAM SDM06, Best Paper Award
  • Experimental Evaluation
  • Query-time Entity Resolution
  • ER User Interface

35
Probabilistic Generative Model for Collective
Entity Resolution
  • Model how references co-occur in data
  • Generation of references from entities
  • Relationships between underlying entities
  • Groups of entities instead of pair-wise relations

36
Discovering Groups from Relations
Bell Labs Group
Parallel Processing Research Group
Stephen C Johnson
Stephen P Johnson
Alfred V Aho
Ravi Sethi
Chris Walshaw
Kevin McManus
Mark Cross
Martin Everett
Jeffrey D Ullman
P1 C. Walshaw, M. Cross, M. G. Everett, S.
Johnson
P4 Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman
P2 C. Walshaw, M. Cross, M. G. Everett, S.
Johnson, K. McManus
P5 A. Aho, S. Johnson, J. Ullman
P6 A. Aho, R. Sethi, J. Ullman
P3 C. Walshaw, M. Cross, M. G. Everett
37
LDA-ER Model
  • Entity label a and group label z for each
    reference r
  • T mixture of groups for each co-occurrence
  • Fz multinomial for choosing entity a for each
    group z
  • Va multinomial for choosing reference r from
    entity a
  • Dirichlet priors with a and ß

38
Generating References from Entities
  • Entities are not directly observed
  • Hidden attribute for each entity
  • Similarity measure for pairs of attributes
  • A distribution over attributes for each entity

39
Approx. Inference Using Gibbs Sampling
  • Conditional distribution over labels for each
    ref.
  • Sample next labels from conditional distribution
  • Repeat over all references until convergence
  • Converges to most likely number of entities

40
Faster Inference Split-Merge Sampling
  • Naïve strategy reassigns references individually
  • Alternative allow entities to merge or split
  • For entity ai, find conditional distribution for
  • Merging with existing entity aj
  • Splitting back to last merged entities
  • Remaining unchanged
  • Sample next state for ai from distribution
  • O(n g e) time per iteration compared to O(n g
    n e)

41
Entity Resolution
  • The Problem
  • Relational Entity Resolution
  • Algorithms
  • Relational Clustering (RC-ER)
  • Probabilistic Model (LDA-ER)
  • Experimental Evaluation
  • Query-time Entity Resolution
  • ER User Interface

42
Evaluation Datasets
  • CiteSeer
  • 1,504 citations to machine learning papers
    (Lawrence et al.)
  • 2,892 references to 1,165 author entities
  • arXiv
  • 29,555 publications from High Energy Physics (KDD
    Cup03)
  • 58,515 refs to 9,200 authors
  • Elsevier BioBase
  • 156,156 Biology papers (IBM KDD Challenge 05)
  • 831,991 author refs
  • Keywords, topic classifications, language,
    country and affiliation of corresponding author,
    etc

43
Baselines
  • A Pair-wise duplicate decisions w/ attributes
    only
  • Names Soft-TFIDF with Levenstein, Jaro,
    Jaro-Winkler
  • Other textual attributes TF-IDF
  • A Transitive closure over A
  • AN Add attribute similarity of co-occurring
    refs
  • AN Transitive closure over AN
  • Evaluate pair-wise decisions over references
  • F1-measure (harmonic mean of precision and recall)

44
ER over Entire Dataset
  • RC-ER LDA-ER outperform baselines in all
    datasets
  • Collective resolution better than naïve
    relational resolution
  • BioBase Biggest improvement over baselines
  • arXiv 6,500 additional correct resolutions 20
    err. red.
  • CiteSeer Near perfect resolution 22 error
    reduction

45
ER over Entire Dataset
  • RC-ER and baselines require threshold as
    parameter
  • Best achievable performance over all thresholds
  • Best RC-ER performance better than LDA-ER
  • LDA-ER does not require similarity threshold

46
Performance for Specific Names
arXiv Significantly larger improvements for
ambiguous names
47
Trends in Synthetic Data
  • Bigger improvement with
  • bigger of ambiguous refs
  • more refs per co-occurrence
  • more neighbors per entity

48
Entity Resolution
  • The Problem
  • Relational Entity Resolution
  • Algorithms
  • Relational Clustering (RC-ER)
  • Probabilistic Model (LDA-ER)
  • Experimental Evaluation
  • Query-time Entity Resolution
  • KDD06
  • ER User Interface

49
Query-time ER Motivation
  • Most publicly available databases do not have
    resolved entities
  • PubMed, CiteSeer have unresolved authors
  • Query processing requires resolved entities
  • Retrieve papers by S. Johnson of Bell Labs

50
Entity Resolution Queries
P1 Jostle , C. Walshaw, M. Cross, M. G.
Everett, S. Johnson P2 Parallel Machine
Topologies, C. Walshaw, M. Cross, M. G. Everett,
S. Johnson, K. McManus P5 Deterministic
Parsing , A. Aho, S. Johnson, J. Ullman
  • Disambiguation Query
  • Among papers with S Johnson as author, find
    those by the Bell Labs researcher
  • Resolution Query
  • Do disambiguation
  • Also retrieve papers by the Bell Labs researcher
    with a different author name, e.g. Stephen C
    Johnson
  • P5 Deterministic Parsing , A. Aho, S.
    Johnson, J. Ullman
  • P4 Code Generation , Alfred V. Aho, Stephen
    C. Johnson, Jefferey D. Ullman

51
Query-time ER using Relations
  • Possible directions
  • Leave resolution burden on user
  • Ask owner to clean database
  • Develop techniques for query-time resolution
  • Attribute-based query resolution
  • Quick but not accurate
  • Collective resolution for queries
  • Extract relevant records by recursive expansion
  • Collective resolution on extracted records

52
Extracting Relevant Records
Attr expansion
Attr expansion
Relation expansion
Query
Level 0
Level 1
Level 2
S. Johnson
P4 Stephen C. Johnson P5 S.
Johnson P2 S. Johnson P1 S. Johnson
P4 Alfred V. Aho P5 A. Aho P4
Jefferey D. Ullman P5 J. Ullman P2 K.
McManus P2 C. Walshaw P1 C. Walshaw
P A. Aho P Alfred V. Aho P J.
Ullman P Jefferey D. Ullman P K.
McManus P K. McManus P C. Walshaw P C.
Walshaw
Start with query name or record
  • Alternate between
  • Attribute expansion For any relevant record,
    include other records with that name
  • Relation Expansion For any relevant record,
    include other related records

53
Adaptive Expansion
  • Too many records with unconstrained expansion
  • Adaptively select records based on ambiguity
  • Smith is more ambiguous than McManus
  • Use adaptive expansion
  • Expand the more ambiguous records
  • They need extra evidence
  • When expanding, add fewer ambiguous records
  • They lead to imprecision
  • Large reduction in number of relevant records

54
Ambiguity Estimation
  • Probability of multiple entities sharing
    attribute value
  • No. of entities with last name Smith
  • No labeled data available
  • Estimate last name ambiguity using other
    attributes
  • No. of different first initials for last-name
    Smith
  • Estimate improves with more independent
    attributes

A. Smith, B. Smith, D. Smith, G. Smith, K.
Smith, M. Smith, P. Smith, R. Smith, S. Smith,
T. Smith,
K. McManus
55
QT-ER Evaluation Datasets
  • arXiv High Energy Physics
  • 29,555 publications, 58,515 refs to 9,200 authors
  • Queries All ambiguous names (75 total)
  • True authors per name 2 to 11 (avg. 2.4)
  • Elsevier BioBase
  • 156,156 publications, 831,991 author refs
  • Queries 100 most frequent names
  • True authors per name 1 to 100 (avg. 32)

56
Growth Rate of Relevant Records and Query
Processing Time
Number of relevant references grows rapidly with
expansion depth
RC-ER is fast but not good enough for query-time
resolution
57
QT-ER Results
  • Unconstrained expansion
  • Collective resolution more accurate
  • Accuracy improves beyond depth 1
  • Adaptive expansion
  • Minimal loss in accuracy
  • Dramatic reduction in query processing time

AX-2 adaptive expansion at depths 2 and
beyond AX-1 adaptive expansion even at depth 1
58
Entity Resolution
  • The Problem
  • Relational Entity Resolution
  • Algorithms
  • Relational Clustering (RC-ER)
  • Probabilistic Model (LDA-ER)
  • Experimental Evaluation
  • Query-time Entity Resolution
  • ER User Interface
  • VAST06

59
D-Dupe An Interactive Tool for Entity Resolution
http//www.cs.umd.edu/projects/linqs/ddupe
60
Current ER Projects
  • Entity Resolution in Geospatial Data
  • Using spatial information, location name
    information and location type information
  • ACMGIS06
  • Name Reference Resolution in Email
  • Goal Figure out who is being talked about
  • Make use of traffic patterns to infer social
    network
  • SDM06
  • Currently investing adaptive context construction
  • Elsayed, Namata, Oard, under review
  • Ontology Alignment (work w/ Octavian Udrea, Renee
    Miller)
  • Combines relational clustering with logical
    inference (e.g. equivalence and subsumption)
  • Results in a 40 improvement in recall on 30 OWL
    lite ontology pairs
  • under review

61
ER for GIS Data - Identification
Dataset A
Dataset B
Match
62
ER for GIS Data - Disambiguation
Dataset A
Dataset B
Not Match!
63
ER for GIS Data - Identification
Not Match
64
An Example
location reference lj ? Dataset B
location reference li ? Dataset A
li.name Qaryat an Nuaymiyah
lj.name Qaryat an Naimiyah
li.coordinates (lati, longi)
lj.coordinates (latj, longj)
li.type Populated place
lj.type City
,
,
Match!
,
65
Conclusion
  • Projects
  • Link-based Classification and Prediction
  • Predictive Models for Social Network Analysis
  • Temporal Analysis of Email Traffic Network
  • Reputation-based Spam filtering
  • Ontology Alignment and Schema Mapping
  • Feature Generation for Sequences (biological
    data)
  • Protein Interaction Prediction (biological data)
  • Probabilistic Databases
  • SRL/Link Mining is a emerging research area at
    the intersection of statistical machine learning,
    logical reasoning and visualization.
  • In reality, want to be able to flexibly combine
    node, edge and graph-based inferences
  • While there are important pitfalls to take into
    account (confidence and privacy), there are many
    potential benefits and payoffs

66
Thanks!
httpwww.cs.umd.edu/getoor
Work sponsored by the National Science
Foundation, KDD program and National Geospatial
Agency
Write a Comment
User Comments (0)
About PowerShow.com