Exploiting Relationships for Object Consolidation - PowerPoint PPT Presentation

About This Presentation
Title:

Exploiting Relationships for Object Consolidation

Description:

Exploiting Relationships for Object Consolidation – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 28
Provided by: tra83
Category:

less

Transcript and Presenter's Notes

Title: Exploiting Relationships for Object Consolidation


1
Exploiting Relationships for Object
Consolidation
Work supported by NSF Grants IIS-0331707 and
IIS-0083489
  • Zhaoqi Chen Dmitri V. Kalashnikov
    Sharad Mehrotra
  • Computer Science Department
  • University of California, Irvine
  • http//www.ics.uci.edu/dvk/RelDC
  • http//www.itr-rescue.org (RESCUE)

ACM IQIS 2005
2
Talk Overview
  • Motivation
  • Object consolidation problem
  • Proposed approach
  • RelDC Relationship based data cleaning
  • Relationship analysis and graph partitioning
  • Experiments

3
Why do we need Data Cleaning?
q
Hi, my name is Jane Smith. Id like to apply for
a faculty position at your university
Wow! Unbelievable! Are you sure you will join us
even if we do not offer you tenure right away?
OK, let me check something quickly
???
  • Publications

Jane Smith Fresh Ph.D.
Tom - Recruiter
4
What is the problem?
  • Names often do not uniquely identify people

CiteSeer the top-k most cited authors
DBLP
DBLP
5
Comparing raw and cleaned CiteSeer
Cleaned CiteSeer top-k
CiteSeer top-k
6
Object Consolidation Problem
Representations of objects in the database
r1
r2
r3
r4
r5
r6
r7
rN
o1
o2
o3
o4
o5
o6
o7
oM
Real objects in the database
  • Cluster representations that correspond to the
    same real world object/entity
  • Two instances real world objects are
    known/unknown

7
RelDC Approach
  • Exploit relationships among objects to
    disambiguate when traditional approach on
    clustering based on similarity does not work

RelDC Framework
Relationship
-
based Data Cleaning
ARG
?
f
1
B
f
1
C
A

?
f
2
f
2
Y
X
Y
D
X
?
f
3
f
3
?
E
F
f
4
f
4
Traditional Methods
Relationship Analysis
features and context
8
Attributed Relational Graph (ARG)
  • View the database as an ARG
  • Nodes
  • per cluster of representations (if already
    resolved by feature-based approach)
  • per representation (for tough cases)
  • Edges
  • Regular correspond to relationships between
    entities
  • Similarity created using feature-based methods
    on representations

9
Context Attraction Principle (CAP)
  • Who is J. Smith
  • Jane?
  • John?

10
Questions to Answer
  • Does the CAP principle hold over real datasets?
  • That is, if we consolidate objects based on
    it, will the quality of consolidation improves?
  • Can we design a generic strategy that exploits
    CAP for consolidation?

11
Consolidation Algorithm
  • Construct ARG and identify all virtual clusters
    (VCSs)
  • use FBS in constructing the ARG
  • Choose a VCS and compute connection strength
    between nodes
  • for each pair of repr. connected via a similarity
    edge
  • Partition the VCS
  • use a graph partitioning algorithm
  • partitioning is based on connection strength
  • after partitioning, adjust ARG accordingly
  • go to Step 2, if more potential clusters exists

12
Connection Strength c(u,v)
  • Models for c(u,v)
  • many possibilities
  • diffusion kernels, random walks, etc
  • none is fully adequate
  • cannot learn similarity from data
  • Diffusion kernels
  • ?(x,y) ?1(x,y) base similarity
  • via direct links (of size 1)
  • ?k(x,y) indirect similarity
  • via links of size k
  • B where Bxy B1xy ?1(x,y)
  • base similarity matrix
  • Bk indirect similarity matrix
  • K total similarity matrix, or kernel

13
Connection Strength c(u,v) (cont.)
  • Instantiating parameters
  • Determining ?(x,y)
  • regular edges have types T1,...,Tn
  • types T1,...,Tn have weights w1,...,wn
  • ?(x,y) wi
  • get the type of a given edge
  • assign this weigh as base similarity
  • Handling similarity edges
  • ?(x,y) assigned value proportional to similarity
    (heuristic)
  • Approach to learn ?(x,y) from data (ongoing work)
  • Implementation
  • we do not compute the whole matrix K
  • we compute one c(u,v) at a time
  • limit path lengths by L

14
Consolidation via Partitioning
  • Observations
  • each VCS contains representations of at least 1
    object
  • if a repr. is in VCS, then the rest of repr. of
    the same object are in it too
  • Partitioning
  • two cases
  • k, the number of entities in VSC, is known
  • k is unknown
  • when k is known, use any partit. algo
  • maximize inside-con, minimize outside-con.
  • we use Shi,Malik2000
  • normalized cut
  • when k is unknown
  • split into two just to see the cut
  • compare cut against threshold
  • decide to split or not to split
  • Iterate

15
Measuring Quality of Outcome
  • dispersion
  • for an entity, into how many clusters its repr.
    are clustered, ideal is 1
  • diversity
  • for a cluster, how many distinct entities it
    covers, ideal is 1
  • Entity uncertainty
  • for an entity, if out of m represent. m1 to C1
    ... mn to Cn then
  • Cluster Uncertainty
  • if a cluster consists of represent. m1 of E1
    ... mn of En then (same...)
  • ideal entropy is zero

16
Experimental Setup
  • Uncertainty
  • d1,d2,...,dn are director entities
  • pick a fraction d1,d2,...,dm
  • Group entries in size k,
  • e.g. in groups of two d1,d2, ... ,d9,d10
  • make all representations of a group indiscernible
    by FBS, ...
  • Baseline 1
  • one cluster per VCS, regardless
  • Equivalent to using only FBS
  • ideal dispersion H(E)!
  • Baseline 2
  • knows grouping statistics
  • gueses ent in VCS
  • random assigns repr. to clusters
  • RealMov
  • movies (12K)
  • people (22K)
  • actors
  • directors
  • producers
  • studious (1K)
  • producing
  • distributing
  • Parameters
  • L-short simple paths, L 7
  • L is the path-length limit
  • Note
  • The algorithm is applied to tough cases, after
    FBS already has successfully consolidated many
    entries!

17
Sample Movies Data
18
The Effect of L on Quality
Cluster Entropy Diversity
Entity Entropy Dispersion
19
Effect of Threshold and Scalability
20
Summary
  • RelDC
  • domain-independent data cleaning framework
  • uses relationships for data cleaning
  • reference disambiguation SDM05
  • object consolidation IQIS05
  • Ongoing work
  • learning the importance of relationships from
    data
  • Exploiting relationships among entities for other
    data cleaning problems

21
Contact Information
  • RelDC project
  • www.ics.uci.edu/dvk/RelDC
  • www.itr-rescue.org (RESCUE)
  • Zhaoqi Chen
  • chenz_at_ics.uci.edu
  • Dmitri V. Kalashnikov
  • www.ics.uci.edu/dvk
  • dvk_at_ics.uci.edu
  • Sharad Mehrotra
  • www.ics.uci.edu/sharad
  • sharad_at_ics.uci.edu

22
extra slides
23
What is the lesson?
  • data should be cleaned first
  • e.g., determine the (unique) real authors of
    publications
  • solving such challenges is not always easy
  • that explains a large body of work on data
    cleaning
  • note
  • CiteSeer is aware of the problem with its ranking
  • there are more issues with CiteSeer
  • many not related to data cleaning

Garbage in, garbage out principle Making
decisions based on bad data, can lead to wrong
results.
24
Object Consolidation
  • Notation
  • Oo1,...,oO set of entities
  • unknown in general
  • Xx1,...,xX set of repres.
  • dxi the entity xi refers to
  • unknown in general
  • Cxi all repres. that refer to dxi
  • group set
  • unknown in general
  • the goal is to find it for each xi
  • Sxi all repres. that can be xi
  • consolidation set
  • determined by FBS
  • we assume Cxi ? Sxi

25
Object Consolidation Problem
  • Let Oo1,...,oO be the set of entities
  • unknown in general
  • Let Xx1,...,xX be the set of representations
  • Map xi to its corresponding entity oj in O dxi
    the entity xi refers to
  • unknown in general
  • Cxi all repres. that refer to dxi
  • group set
  • unknown in general
  • the goal is to find it for each xi
  • Sxi all repres. that can be xi
  • consolidation set
  • determined by FBS
  • we assume Cxi ? Sxi

26
RelDC Framework
27
Connection Strength
  • Computation of c(u,v)
  • Phase 1 Discover connections
  • all L-short simple paths between u and v
  • bottleneck
  • optimizations, not in IQIS05
  • Phase 2 Measure the strength
  • in the discovered connections
  • many c(u,v) models exist
  • we use model similar to diffusion kernels

28
Our c(u,v) Model
  • Our model Diff. kernels
  • virtually identical, but...
  • we do not compute the whole matrix K
  • we compute one c(u,v) at a time
  • we limit path lengths by L
  • ?(x,y) is unknown in general
  • the analyst assigns them
  • learn from data (ongoing work)
  • Our c(u,v) model
  • regular edges have types T1,...,Tn
  • types T1,...,Tn have weights w1,...,wn
  • ?(x,y) wi
  • get the type of a given edge
  • assign this weigh as base similarity
  • paths with similarity edges
  • might not exist, use heuristics
Write a Comment
User Comments (0)
About PowerShow.com