Interactive Deduplication using Active Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Interactive Deduplication using Active Learning

Description:

Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey Active Learning for de-duplication De-duplication ... – PowerPoint PPT presentation

Number of Views:219
Avg rating:3.0/5.0
Slides: 18
Provided by: Doug148
Category:

less

Transcript and Presenter's Notes

Title: Interactive Deduplication using Active Learning


1
Interactive Deduplication using Active Learning
  • Sunita Sarawagi and Anuradha Bhamidipaty

Presented by Doug Downey
2
Active Learning for de-duplication
  • De-duplication systems try to learn a function
  • Where D is the data set.
  • f is learned using a labeled training data set
  • Normally, D is large, so many sets Lp are
    possible.
  • Choosing a representative useful Lp is hard.
  • Instead of a fixed set Lp, in Active Learning the
    learner interactively chooses pairs from D?D to
    be labeled and added to Lp.

3
The ALIAS de-duplicator
  • Input
  • Set Dp of pairs of data records represented as
    feature vectors (features might include edit
    distance, soundex, etc).
  • Initial set Lp of some elements of Dp labeled as
    duplicates or non-duplicates.
  • Set T Lp Loop until user satisfaction
  • Train classifier C using T.
  • Use C to choose a set S of instances from Dp for
    labeling.
  • Get labels for S from user, and set T T ? S.

4
The ALIAS de-duplicator
5
Active Learning
  • How do we choose the set S of instances to label?
  • Idea Choose most uncertain instances.
  • Were given that s and s can be separated by
    some point, and assume that probability of or
    is linear between labeled examples r and b.
  • The point m
  • maximally uncertain,
  • also the point that reduces our confusion
    region the most.
  • So choose m!

6
Measuring Uncertainty with Committees
  • Train a committee of several slightly different
    versions of a classifier.
  • Uncertainty(x) ? entropycommittee(x)
  • Form committees by
  • Randomizing model parameters
  • Partitioning training data
  • Partitioning attributes

7
Methods for Forming Committees
8
Committee Size
9
Representativeness of an Instance
  • We need informative instances, not just uncertain
    ones.
  • Solution sample n of the kn most uncertain
    instances, weighted by uncertainty.
  • k 1 ? no sampling
  • kn all data ? full-sampling
  • Why not use information gain?

10
Sampling for Representativeness
11
Evaluation Different Classifiers
  • Decision Trees Naïve Bayes
  • Committees of 5 via parameter randomization
  • SVMs
  • Uncertainty distance from separator
  • Start with one dup, one non-dup, add a new
    training example each round (n 1), partial
    sampling (k 5).
  • Similarity functions 3-Grams match,
    overlapping words, approx. edit distance, special
    handling of s/nulls.
  • Data sets
  • Bibliography 32131 citation pairs from Citeseer,
    0.5 duplicates.
  • Address 44850 pairs, 0.25 duplicates.

12
Evaluation different classifiers
13
Evaluation different classifiers
14
Value of Active Learning
15
Value of Active Learning
16
Example Decision Tree
17
Conclusions
  • Active Learning improves performance over random
    selection.
  • Uses two orders of magnitude less training data.
  • Note not due just to change in /- mix.
  • In these experiments, Decision Trees outperformed
    SVMs and Naïve Bayes.
Write a Comment
User Comments (0)
About PowerShow.com