Cotraining - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Cotraining

Description:

Can co-training algorithms be applied to datasets without natural feature divisions? ... Many algorithms: EM, co-EM, self-training, co-training, ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 33
Provided by: facultyWa9
Category:

less

Transcript and Presenter's Notes

Title: Cotraining


1
Co-training
  • LING 572
  • Fei Xia
  • 02/21/06

2
Overview
  • Proposed by Blum and Mitchell (1998)
  • Important work
  • (Nigam and Ghani, 2000)
  • (Goldman and Zhou, 2000)
  • (Abney, 2002)
  • (Sarkar, 2002)
  • Used in document classification, parsing, etc.

3
Outline
  • Basic concept (Blum and Mitchell, 1998)
  • Relation with other SSL algorithms (Nigam and
    Ghani, 2000)

4
An example
  • Web-page classification e.g., find homepages of
    faculty members.
  • Page text words occurring on that page
  • e.g., research interest, teaching
  • Hyperlink text words occurring in hyperlinks
    that point to that page
  • e.g., my advisor

5
Two views
  • Features can be split into two sets
  • The instance space
  • Each example
  • D the distribution over X
  • C1 the set of target functions over X1.
  • C2 the set of target function over X2.

6
Assumption 1 compatibility
  • The instance distribution D is compatible with
    the target function f(f1, f2) if for any x(x1,
    x2) with non-zero prob, f(x)f1(x1)f2(x2).
  • The compatibility of f with D

? Each set of features is sufficient for
classification
7
Assumption 2 conditional independence
8
Co-training algorithm
9
Co-training algorithm (cont)
  • Why uses U, in addition to U?
  • Using U yields better results.
  • Possible explanation this forces h1 and h2
    select examples that are more representative of
    the underlying distribution D that generates U.
  • Choosing p and n the ratio of p/n should match
    the ratio of positive examples and negative
    examples in D.
  • Choosing the iteration number and the size of U.

10
Intuition behind the co-training algorithm
  • h1 adds examples to the labeled set that h2 will
    be able to use for learning, and vice verse.
  • If the conditional independence assumption holds,
    then on average each added document will be as
    informative as a random document, and the
    learning will progress.

11
Experiments setting
  • 1051 web pages from 4 CS depts
  • 263 pages (25) as test data
  • The remaining 75 of pages
  • Labeled data 3 positive and 9 negative examples
  • Unlabeled data the rest (776 pages)
  • Manually labeled into a number of categories
    e.g., course home page.
  • Two views
  • View 1 (page-based) words in the page
  • View 2 (hyperlink-based) words in the
    hyperlinks
  • Learner Naïve Bayes

12
Naïve Bayes classifier(Nigam and Ghani, 2000)
13
Experiment results
p1, n3 of iterations 30 U 75
14
Questions
  • Can co-training algorithms be applied to datasets
    without natural feature divisions?
  • How sensitive are the co-training algorithms to
    the correctness of the assumptions?
  • What is the relation between co-training and
    other SSL methods (e.g., self-training)?

15
(Nigam and Ghani, 2000)
16
EM
  • Pool the features together.
  • Use initial labeled data to get initial parameter
    estimates.
  • In each iteration use all the data (labeled and
    unlabeled) to re-estimate the parameters.
  • Repeat until converge.

17
Experimental results WebKB course database
EM performs better than co-training Both are
close to supervised method when trained on
more labeled data.
18
Another experiment The News 22 dataset
  • A semi-artificial dataset
  • Conditional independence assumption holds.

Co-training outperforms EM and the oracle
result.
19
Co-training vs. EM
  • Co-training splits features, EM does not.
  • Co-training incrementally uses the unlabeled
    data.
  • EM probabilistically labels all the data at each
    round EM iteratively uses the unlabeled data.

20
Co-EM EM with feature split
  • Repeat until converge
  • Train A-feature-set classifier using the labeled
    data and the unlabeded data with Bs labels
  • Use classifier A to probabilistically label all
    the unlabeled data
  • Train B-feature-set classifier using the labeled
    data and the unlabeled data with As labels.
  • B re-labels the data for use by A.

21
Four SSL methods
Results on the News 22 dataset
22
Random feature split
Co-training 3.7 ? 5.5 Co-EM 3.3 ? 5.1
  • When the conditional independence assumption does
    not hold, but
  • there is sufficient redundancy among the
    features,
  • co-training still works well.

23
Assumptions
  • Assumptions made by the underlying classifier
    (supervised learner)
  • Naïve Bayes words occur independently of each
    other, given the class of the document.
  • Co-training uses the classifier to rank the
    unlabeled examples by confidence.
  • EM uses the classifier to assign probabilities to
    each unlabeled example.
  • Assumptions made by SSL method
  • Co-training conditional independence assumption.
  • EM maximizing likelihood correlates with
    reducing classification errors.

24
Summary of (Nigam and Ghani, 2002)
  • Comparison of four SSL methods self-training,
    co-training, EM, co-EM.
  • The performance of the SSL methods depends on how
    well the underlying assumptions are met.
  • Random splitting features is not as good as
    natural splitting, but it still works if there is
    sufficient redundancy among features.

25
Variations of co-training
  • Goldman and Zhou (2000) use two learners of
    different types but both takes the whole feature
    set.
  • Zhou and Li (2005) use three learners. If two
    agree, the data is used to teach the third
    learner.
  • Balcan et al. (2005) relax the conditional
    independence assumption with much weaker
    expansion condition.

26
An alternative?
  • L ? L1, L?L2
  • U ?U1, U ? U2
  • Repeat
  • Train h1 using L1 on Feat Set1
  • Train h2 using L2 on Feat Set2
  • Classify U2 with h1 and let U2 be the subset
    with the most confident scores, L2 U2 ? L2,
    U2-U2 ? U2
  • Classify U1 with h2 and let U1 be the subset
    with the most confident scores, L1 U1 ? L1,
    U1-U1 ? U1

27
Yarowskys algorithm
  • one-sense-per-discourse
  • ? View 1 the ID of the document that a word
    is in
  • one-sense-per-allocation
  • ? View 2 local context of word in the
    document
  • Yarowskys algorithm is a special case of
    co-training (Blum Mitchell, 1998)
  • Is this correct? No, according to (Abney, 2002).

28
Summary of co-training
  • The original paper (Blum and Mitchell, 1998)
  • Two independent views split the features into
    two sets.
  • Train a classifier on each view.
  • Each classifier labels data that can be used to
    train the other classifier.
  • Extension
  • Relax the conditional independence assumptions
  • Instead of using two views, use two or more
    classifiers trained on the whole feature set.

29
Summary of SSL
  • Goal use both labeled and unlabeled data.
  • Many algorithms EM, co-EM, self-training,
    co-training,
  • Each algorithm is based on some assumptions.
  • SSL works well when the assumptions are satisfied.

30
Additional slides
31
Rule independence
  • H1 (H2) consists of rules that are functions of
    X1 (X2, resp) only.

32
  • EM the data is generated according to some
    simple known parametric model.
  • Ex the positive examples are generated according
    to an n-dimensional Gaussian D centered around
    the point
Write a Comment
User Comments (0)
About PowerShow.com