Exploiting Relationships for Object Consolidation - PowerPoint PPT Presentation

About This Presentation

Title:

Exploiting Relationships for Object Consolidation

Description:

Exploiting Relationships for Object Consolidation – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 28

Provided by: tra83

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting Relationships for Object Consolidation

1
Exploiting Relationships for Object
Consolidation
Work supported by NSF Grants IIS-0331707 and
IIS-0083489

Zhaoqi Chen Dmitri V. Kalashnikov
Sharad Mehrotra
Computer Science Department
University of California, Irvine
http//www.ics.uci.edu/dvk/RelDC
http//www.itr-rescue.org (RESCUE)

ACM IQIS 2005
2
Talk Overview

Motivation
Object consolidation problem
Proposed approach
RelDC Relationship based data cleaning
Relationship analysis and graph partitioning
Experiments

3
Why do we need Data Cleaning?
q
Hi, my name is Jane Smith. Id like to apply for
a faculty position at your university
Wow! Unbelievable! Are you sure you will join us
even if we do not offer you tenure right away?
OK, let me check something quickly
???

Publications

Jane Smith Fresh Ph.D.
Tom - Recruiter
4
What is the problem?

Names often do not uniquely identify people

CiteSeer the top-k most cited authors
DBLP
DBLP
5
Comparing raw and cleaned CiteSeer
Cleaned CiteSeer top-k
CiteSeer top-k
6
Object Consolidation Problem
Representations of objects in the database
r1
r2
r3
r4
r5
r6
r7
rN
o1
o2
o3
o4
o5
o6
o7
oM
Real objects in the database

Cluster representations that correspond to the
same real world object/entity
Two instances real world objects are
known/unknown

7
RelDC Approach

Exploit relationships among objects to
disambiguate when traditional approach on
clustering based on similarity does not work

RelDC Framework
Relationship
-
based Data Cleaning
ARG
?
f
1
B
f
1
C
A

?
f
2
f
2
Y
X
Y
D
X
?
f
3
f
3
?
E
F
f
4
f
4
Traditional Methods
Relationship Analysis
features and context
8
Attributed Relational Graph (ARG)

View the database as an ARG
Nodes
per cluster of representations (if already
resolved by feature-based approach)
per representation (for tough cases)
Edges
Regular correspond to relationships between
entities
Similarity created using feature-based methods
on representations

9
Context Attraction Principle (CAP)

Who is J. Smith
Jane?
John?

10
Questions to Answer

Does the CAP principle hold over real datasets?
That is, if we consolidate objects based on
it, will the quality of consolidation improves?
Can we design a generic strategy that exploits
CAP for consolidation?

11
Consolidation Algorithm

Construct ARG and identify all virtual clusters
(VCSs)
use FBS in constructing the ARG
Choose a VCS and compute connection strength
between nodes
for each pair of repr. connected via a similarity
edge
Partition the VCS
use a graph partitioning algorithm
partitioning is based on connection strength
after partitioning, adjust ARG accordingly
go to Step 2, if more potential clusters exists

12
Connection Strength c(u,v)

Models for c(u,v)
many possibilities
diffusion kernels, random walks, etc
none is fully adequate
cannot learn similarity from data

Diffusion kernels
?(x,y) ?1(x,y) base similarity
via direct links (of size 1)
?k(x,y) indirect similarity
via links of size k
B where Bxy B1xy ?1(x,y)
base similarity matrix
Bk indirect similarity matrix
K total similarity matrix, or kernel

13
Connection Strength c(u,v) (cont.)

Instantiating parameters
Determining ?(x,y)
regular edges have types T1,...,Tn
types T1,...,Tn have weights w1,...,wn
?(x,y) wi
get the type of a given edge
assign this weigh as base similarity
Handling similarity edges
?(x,y) assigned value proportional to similarity
(heuristic)
Approach to learn ?(x,y) from data (ongoing work)

Implementation
we do not compute the whole matrix K
we compute one c(u,v) at a time
limit path lengths by L

14
Consolidation via Partitioning

Observations
each VCS contains representations of at least 1
object
if a repr. is in VCS, then the rest of repr. of
the same object are in it too
Partitioning
two cases
k, the number of entities in VSC, is known
k is unknown
when k is known, use any partit. algo
maximize inside-con, minimize outside-con.
we use Shi,Malik2000
normalized cut
when k is unknown
split into two just to see the cut
compare cut against threshold
decide to split or not to split
Iterate

15
Measuring Quality of Outcome

dispersion
for an entity, into how many clusters its repr.
are clustered, ideal is 1
diversity
for a cluster, how many distinct entities it
covers, ideal is 1
Entity uncertainty
for an entity, if out of m represent. m1 to C1
... mn to Cn then
Cluster Uncertainty
if a cluster consists of represent. m1 of E1
... mn of En then (same...)
ideal entropy is zero

16
Experimental Setup

Uncertainty
d1,d2,...,dn are director entities
pick a fraction d1,d2,...,dm
Group entries in size k,
e.g. in groups of two d1,d2, ... ,d9,d10
make all representations of a group indiscernible
by FBS, ...
Baseline 1
one cluster per VCS, regardless
Equivalent to using only FBS
ideal dispersion H(E)!
Baseline 2
knows grouping statistics
gueses ent in VCS
random assigns repr. to clusters

RealMov
movies (12K)
people (22K)
actors
directors
producers
studious (1K)
producing
distributing

Parameters
L-short simple paths, L 7
L is the path-length limit
Note
The algorithm is applied to tough cases, after
FBS already has successfully consolidated many
entries!

17
Sample Movies Data
18
The Effect of L on Quality
Cluster Entropy Diversity
Entity Entropy Dispersion
19
Effect of Threshold and Scalability
20
Summary