Title: Exploiting Relationships for Object Consolidation
1Exploiting Relationships for Object
Consolidation
Work supported by NSF Grants IIS-0331707 and
IIS-0083489
- Zhaoqi Chen Dmitri V. Kalashnikov
Sharad Mehrotra - Computer Science Department
- University of California, Irvine
- http//www.ics.uci.edu/dvk/RelDC
- http//www.itr-rescue.org (RESCUE)
ACM IQIS 2005
2Talk Overview
- Motivation
- Object consolidation problem
- Proposed approach
- RelDC Relationship based data cleaning
- Relationship analysis and graph partitioning
- Experiments
3Why do we need Data Cleaning?
q
Hi, my name is Jane Smith. Id like to apply for
a faculty position at your university
Wow! Unbelievable! Are you sure you will join us
even if we do not offer you tenure right away?
OK, let me check something quickly
???
Jane Smith Fresh Ph.D.
Tom - Recruiter
4What is the problem?
- Names often do not uniquely identify people
CiteSeer the top-k most cited authors
DBLP
DBLP
5Comparing raw and cleaned CiteSeer
Cleaned CiteSeer top-k
CiteSeer top-k
6Object Consolidation Problem
Representations of objects in the database
r1
r2
r3
r4
r5
r6
r7
rN
o1
o2
o3
o4
o5
o6
o7
oM
Real objects in the database
- Cluster representations that correspond to the
same real world object/entity - Two instances real world objects are
known/unknown
7RelDC Approach
- Exploit relationships among objects to
disambiguate when traditional approach on
clustering based on similarity does not work
RelDC Framework
Relationship
-
based Data Cleaning
ARG
?
f
1
B
f
1
C
A
?
f
2
f
2
Y
X
Y
D
X
?
f
3
f
3
?
E
F
f
4
f
4
Traditional Methods
Relationship Analysis
features and context
8Attributed Relational Graph (ARG)
- View the database as an ARG
- Nodes
- per cluster of representations (if already
resolved by feature-based approach) - per representation (for tough cases)
- Edges
- Regular correspond to relationships between
entities - Similarity created using feature-based methods
on representations
9Context Attraction Principle (CAP)
- Who is J. Smith
- Jane?
- John?
10Questions to Answer
- Does the CAP principle hold over real datasets?
- That is, if we consolidate objects based on
it, will the quality of consolidation improves? - Can we design a generic strategy that exploits
CAP for consolidation?
11Consolidation Algorithm
- Construct ARG and identify all virtual clusters
(VCSs) - use FBS in constructing the ARG
- Choose a VCS and compute connection strength
between nodes - for each pair of repr. connected via a similarity
edge - Partition the VCS
- use a graph partitioning algorithm
- partitioning is based on connection strength
- after partitioning, adjust ARG accordingly
- go to Step 2, if more potential clusters exists
12Connection Strength c(u,v)
- Models for c(u,v)
- many possibilities
- diffusion kernels, random walks, etc
- none is fully adequate
- cannot learn similarity from data
- Diffusion kernels
- ?(x,y) ?1(x,y) base similarity
- via direct links (of size 1)
- ?k(x,y) indirect similarity
- via links of size k
- B where Bxy B1xy ?1(x,y)
- base similarity matrix
- Bk indirect similarity matrix
- K total similarity matrix, or kernel
13Connection Strength c(u,v) (cont.)
- Instantiating parameters
- Determining ?(x,y)
- regular edges have types T1,...,Tn
- types T1,...,Tn have weights w1,...,wn
- ?(x,y) wi
- get the type of a given edge
- assign this weigh as base similarity
- Handling similarity edges
- ?(x,y) assigned value proportional to similarity
(heuristic) - Approach to learn ?(x,y) from data (ongoing work)
- Implementation
- we do not compute the whole matrix K
- we compute one c(u,v) at a time
- limit path lengths by L
14Consolidation via Partitioning
- Observations
- each VCS contains representations of at least 1
object - if a repr. is in VCS, then the rest of repr. of
the same object are in it too - Partitioning
- two cases
- k, the number of entities in VSC, is known
- k is unknown
- when k is known, use any partit. algo
- maximize inside-con, minimize outside-con.
- we use Shi,Malik2000
- normalized cut
- when k is unknown
- split into two just to see the cut
- compare cut against threshold
- decide to split or not to split
- Iterate
15Measuring Quality of Outcome
- dispersion
- for an entity, into how many clusters its repr.
are clustered, ideal is 1 - diversity
- for a cluster, how many distinct entities it
covers, ideal is 1 - Entity uncertainty
- for an entity, if out of m represent. m1 to C1
... mn to Cn then - Cluster Uncertainty
- if a cluster consists of represent. m1 of E1
... mn of En then (same...) - ideal entropy is zero
16Experimental Setup
- Uncertainty
- d1,d2,...,dn are director entities
- pick a fraction d1,d2,...,dm
- Group entries in size k,
- e.g. in groups of two d1,d2, ... ,d9,d10
- make all representations of a group indiscernible
by FBS, ... - Baseline 1
- one cluster per VCS, regardless
- Equivalent to using only FBS
- ideal dispersion H(E)!
- Baseline 2
- knows grouping statistics
- gueses ent in VCS
- random assigns repr. to clusters
- RealMov
- movies (12K)
- people (22K)
- actors
- directors
- producers
- studious (1K)
- producing
- distributing
- Parameters
- L-short simple paths, L 7
- L is the path-length limit
- Note
- The algorithm is applied to tough cases, after
FBS already has successfully consolidated many
entries!
17Sample Movies Data
18The Effect of L on Quality
Cluster Entropy Diversity
Entity Entropy Dispersion
19Effect of Threshold and Scalability
20Summary
- RelDC
- domain-independent data cleaning framework
- uses relationships for data cleaning
- reference disambiguation SDM05
- object consolidation IQIS05
- Ongoing work
- learning the importance of relationships from
data - Exploiting relationships among entities for other
data cleaning problems
21Contact Information
- RelDC project
- www.ics.uci.edu/dvk/RelDC
- www.itr-rescue.org (RESCUE)
- Zhaoqi Chen
- chenz_at_ics.uci.edu
- Dmitri V. Kalashnikov
- www.ics.uci.edu/dvk
- dvk_at_ics.uci.edu
- Sharad Mehrotra
- www.ics.uci.edu/sharad
- sharad_at_ics.uci.edu
22extra slides
23What is the lesson?
- data should be cleaned first
- e.g., determine the (unique) real authors of
publications - solving such challenges is not always easy
- that explains a large body of work on data
cleaning - note
- CiteSeer is aware of the problem with its ranking
- there are more issues with CiteSeer
- many not related to data cleaning
Garbage in, garbage out principle Making
decisions based on bad data, can lead to wrong
results.
24Object Consolidation
- Notation
- Oo1,...,oO set of entities
- unknown in general
- Xx1,...,xX set of repres.
- dxi the entity xi refers to
- unknown in general
- Cxi all repres. that refer to dxi
- group set
- unknown in general
- the goal is to find it for each xi
- Sxi all repres. that can be xi
- consolidation set
- determined by FBS
- we assume Cxi ? Sxi
25Object Consolidation Problem
- Let Oo1,...,oO be the set of entities
- unknown in general
- Let Xx1,...,xX be the set of representations
- Map xi to its corresponding entity oj in O dxi
the entity xi refers to - unknown in general
- Cxi all repres. that refer to dxi
- group set
- unknown in general
- the goal is to find it for each xi
- Sxi all repres. that can be xi
- consolidation set
- determined by FBS
- we assume Cxi ? Sxi
26RelDC Framework
27Connection Strength
- Computation of c(u,v)
- Phase 1 Discover connections
- all L-short simple paths between u and v
- bottleneck
- optimizations, not in IQIS05
- Phase 2 Measure the strength
- in the discovered connections
- many c(u,v) models exist
- we use model similar to diffusion kernels
28Our c(u,v) Model
- Our model Diff. kernels
- virtually identical, but...
- we do not compute the whole matrix K
- we compute one c(u,v) at a time
- we limit path lengths by L
- ?(x,y) is unknown in general
- the analyst assigns them
- learn from data (ongoing work)
- Our c(u,v) model
- regular edges have types T1,...,Tn
- types T1,...,Tn have weights w1,...,wn
- ?(x,y) wi
- get the type of a given edge
- assign this weigh as base similarity
- paths with similarity edges
- might not exist, use heuristics