Interactive De-Duplicate using Active Learning* - PowerPoint PPT Presentation

About This Presentation
Title:

Interactive De-Duplicate using Active Learning*

Description:

Given a list of semi-structured records, find all records that refer to a same entity ... Sure reds. Sure greens. Region of uncertainity ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 27
Provided by: KReSITF
Category:

less

Transcript and Presenter's Notes

Title: Interactive De-Duplicate using Active Learning*


1
Interactive De-Duplicate using Active Learning
Sunita Sarawagi Anuradha
Bhamidipaty sunita,anu_at_it.iitb.ac.in
Funded by the Ministry of Information
Technology, India.
2
The de-duplication problem
  • Given a list of semi-structured records,
  • find all records that refer to a same entity
  • Example applications
  • Data warehousing merging name/address lists
  • Entity
  • Person
  • Household
  • Automatic citation databases (Citeseer)
    references
  • Entity paper
  • De-duplication
  • is not clustering!
  • precise external notion of correctness

3
Challenges
  • Errors and inconsistencies in data
  • Duplicates may be spread far apart
  • may not be group-able using obvious keys
  • Domain-specific
  • Existing manual approaches require retuning with
    every new domain

4
Motivating example Citations
  • Our prior
  • duplicate when author, title, booktitle and year
    match..
  • Author match could be hard
  • L. Breiman, L. Friedman, and P. Stone, (1984).
  • Leo Breiman, Jerome H. Friedman, Richard A.
    Olshen, and Charles J. Stone.
  • Conference match could be harder
  • In VLDB-94
  • In Proc. of the 20th Int'l Conference on Very
    Large Databases, Santiago, Chile, September 1994.

5
  • Fields may not be segmented,
  • Word overlap could be misleading
  • Non-duplicates with lots of word overlap
  • H. Balakrishnan, S. Seshan, and R. H. Katz.,
    Improving Reliable Transport and Hando
    Performance in Cellular Wireless Networks, ACM
    Wireless Networks, 1(4), December 1995.
  • H. Balakrishnan, S. Seshan, E. Amir, R. H. Katz,
    "Improving TCP/IP Performance over Wireless
    Networks," Proc. 1st ACM Conf. on Mobile
    Computing and Networking, November 1995.
  • Duplicates with little overlap even in title
  • Johnson Laird, Philip N. (1983). Mental models.
    Cambridge, Mass. Harvard University Press.
  • P. N. Johnson-Laird. Mental Models Towards a
    Cognitive Science of Language, Inference, and
    Consciousness. Cambridge University Press, 1983

6
The learning approach
Example labeled pairs
Similarity functions
f1 f2 fn
Record 1 D Record 2 Record 1 N Record
3 Record 4 D Record 5
7
Example of a learnt function
  • Bibtex entries

Similarity functions
YearDifference gt 1
All-Ngrams ? 0.48
Non-Duplicate
AuthorTitleNgrams ? 0.4
Non Duplicate
Duplicate
TitleIsNull lt 1
PageMatch ? 0.5
Duplicate
Duplicate
AuthorEditDist ? 0.8
Non-Duplicate
Duplicate
Classifier automates the non-trivial task of
combining simple similarity functions
8
Experiences with the learning approach
  • Too much manual search in preparing training data
  • Hard to spot challenging and covering sets of
    duplicates in large lists
  • Even harder to find close non-duplicates that
    will capture the nuances
  • Our solution examine instances that are highly
    similar on one attribute but dissimilar on
    another
  • Active learning is a generalization of this!

9
The active learning approach
Example labeled pairs
Similarity functions
f1 f2 fn
Record 1 D Record 2 Record 3 N Record 4
Classifier
10
Working of ALIAS
  • Apply similarity functions on record pairs.
  • Loop until user satisfaction
  • Train classifier.
  • Use active learning to select n instances
  • Collect user feedback.
  • Augment with pairs inferred using transitivity
  • Add to training set
  • Output classifier

11
The ALIAS deduplication system
  • Interactive discovery of deduplication function
    using active learning
  • Manual effort reduced to
  • Providing simple similarity functions
  • Labeling selected pairs
  • Efficient indexing mechanism
  • Novel cluster-based evaluation engine
  • Cost-based optimizer

12
Architecture of ALIAS
Training data T
Lp
Initial training records
Mapped labeled instances
Infer pairs using transitivity
Mapper
Train classifier
Similarity Functions (F)
Pool of mapped unlabeled instances
Dp
Mapper
Select instances
S
Active Learner
Similarity Indices
Predicate for uncertain region
Deduplication function
Groups of duplicates in A
Evaluation engine
13
Example active learning
Assume Points from two classes (red and green)
on a real line perfectly separable by a single
point separator
labeled points
Unlabeled points
x
The point x that has greatest uncertainty of
prediction yields largest expected reduction in
uncertainty region
14
Active-learning
  • Implicit measure
  • Train classifier
  • For each unlabeled instance
  • Measure prediction uncertainty
  • Choose representative instance with high
    uncertainty
  • Explicit measure
  • For each unlabeled instance
  • Add to training data,
  • Train classifier
  • Measure classifier error quantified as
  • Size of confusion region, or,
  • Sum prediction uncertainty on all instances
  • Choose instance that yields lowest error

15
Example active learning
Assume Points from two classes (red and green)
on a real line perfectly separable by a single
point separator
labeled points
Unlabeled points
x
The point x that has greatest uncertainty of
prediction yields largest expected reduction in
uncertainty region
16
Measuring prediction certainty
  • Classifier-specific methods
  • Support vector machines
  • Distance from separator
  • Naïve Bayes classifier
  • Posterior probability of winning class
  • Decision tree classifier
  • Weighted sum of distance from different
    boundaries, error of the leaf, depth of the
    leaf, etc
  • Committee-based approach
  • Disagreements amongst members of a committee
  • Most successfully used method

17
Committee-based algorithm
  • Train k classifiers C1, C2,.. Ck on training data
  • For each unlabeled instance x
  • Find prediction y1,.., yk from the k classifiers
  • Compute uncertainty U(x) as entropy of above y-s
  • Sampling for representativeness
  • With weight as U(x), do weighted sampling to
    select an instance for labeling.

18
Forming a classifier committee
  • Data partitioning
  • Resampling training data
  • Attribute Partitioning
  • Random parameter perturbation
  • Probabilistic classifiers.
  • Sample from posterior distribution on parameters
    given training data.
  • Example binomial parameter p has a beta
    distribution with mean p

19
Randomly perturbing trees
  • Selecting split attribute
  • Normally attribute with lowest entropy
  • Perturbed random attribute within close range
    of lowest
  • Selecting a split point
  • Normally midpoint of range with lowest entropy
  • Perturbed a random point anywhere in the range
    with lowest entropy

20
Experimental analysis
  • 250 references from Citeseer ? 32000 pairs of
    which only 150 duplicates
  • Citeseers script used to segment into author,
    title, year, page and rest.
  • 20 text and integer similarity functions
  • Initial labeled set just two pairs

21
Methods of creating committee
  • Data partition bad when limited data
  • Attribute partition bad when sufficient data
  • Parameter perturbation best overall

22
Importance of randomization
Naïve Bayes
Decision tree
  • Important to randomize selection for generative
    classifiers like naïve Bayes

23
Choosing the right classifier
  • SVMs good initially but not effective in choosing
    instances
  • Decision trees best overall

24
Benefits of active learning
  • Active learning much better than random
  • With only 100 active instances
  • 97 accuracy, Random only 30
  • Committee-based selection close to optimal

25
Analyzing selected instances
  • Fraction of duplicates in selected instances 44
    starting with only 0.5
  • Non-duplicates in active set if replaced with
    same number of random non-dups gives only 40
    accuracy

26
Related work
  • Performance aspects given fixed function
  • Hernandez and Stolfo (DMKD journal, 1998)
  • Monge and Elkan (KDD-1996)
  • Designing domain-specific similarity functions
  • Library catalogs Toney 1992, Hilton 1996
  • Census Bureau data Winkler 1995
  • Learning approach
  • Census Bureau, Winkler 1995, 1999
  • Semi-supervised approach
  • Tailor Elfeky, Verykios, Elmagarmid in ICDE
    2002
  • Relevance feedback
  • Other applications of active learning
  • Argamon-Engelson and I. Dagan, JAIR 1999

27
Conclusion and future work
  • Interactive discovery of deduplication function
    using active learning
  • Manual effort reduced to
  • Providing simple similarity functions
  • Labeling selected pairs two orders of magnitude
    fewer than random
  • Analyzed tradeoffs in various active learning
    methods
  • Ongoing work
  • Efficient evaluation on large data sets
  • Multi-table de-duplication
Write a Comment
User Comments (0)
About PowerShow.com