COMP53184044, Lecture 10 Knowledge Discovery and Data Mining - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

COMP53184044, Lecture 10 Knowledge Discovery and Data Mining

Description:

Machine Learning usually depends upon a good (large) size ... Some other recipes available (fish, bread, berry pie) High percentage (40%) of positive results ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 36
Provided by: Lab4151
Category:

less

Transcript and Presenter's Notes

Title: COMP53184044, Lecture 10 Knowledge Discovery and Data Mining


1
COMP5318/4044, Lecture 10Knowledge Discovery
and Data Mining
  • Getting Started
  • What if I only have a small training set?
  • This set of slides is adapted from two diffeent
    work with Irena Koprinska, Jason Chan, Martin
    Buchholz and Dirk Pflüger

2
Motivation
  • Machine Learning usually depends upon a good
    (large) size of training set, but it is not
    always feasible in the real world to obtain a
    large training set
  • Rare events
  • Expensive to obtain
  • Does not have the relevant domain knowledge
  • How can we improve the classification performance
    even when we have a small set of training
    examples?
  • Making use of the unlabelled set
  • Creating artificial examples

3
Available MethodsCo-Training with a Random Split
of a Single Natural Feature Set
  • Make more use of unlabelled data while using very
    few labelled instances to help classifiers learn
  • Co-Testing
  • An active learning algorithm that exploits
    multiple views. It is based on the idea of
    learning from mistakes. More precisely, it
    queries examples on which the views predict a
    different label
  • Expectation Maximization (EM)
  • EM is a statistical model that makes use of the
    finite Gaussian mixtures model. The algorithm is
    similar to the K-means procedure in that a set of
    parameters are re-computed until a desired
    convergence value is achieved. The parameters are
    re-computed until a desired convergence value is
    achieved. The finite mixtures model assumes all
    attributes to be independent random variables.
    This algorithm is part of the Weka clustering
    package.
  • Self Training
  • Co-Training

4
The Co-Training AlgorithmCo-Training with a
Random Split of a Single Natural Feature Set
  • Introduced by Avrim Blum and Tom Mitchell in 1998
  • Applied to web-page classification
  • Problem Identify the home page of a course
  • How
  • Build two classifiers with separate (disjoint)
    features trained on the same small labelled set
  • Words in the body of the web page
  • Words in hyperlinks of other documents referring
    to that particular page
  • Classifiers label instances in unlabelled set
  • Each classifier selects the most confidently
    predicted examples and adds them to the labelled
    set
  • Repeat from 1
  • Obtain small set L of labelled examples
  • Obtain large set U of unlabelled examples
  • Obtain 2 sets F1 and F2 of features describing
    the dataset
  • while U ?? do
  • train classifier C1 from L based on F1
  • train classifier C2 from L based on F2
  • for each classifier Ci do
  • Ci labels examples in U based on Fi
  • Ci chooses the most confidently predicted
    examples E from U
  • E is removed from U and added (with their given
    labels) to L
  • end for
  • end while

A. Blum, T. Mitchell. Combining Labeled and
Unlabeled Data with Co-Training. In Proceedings
of the Workshop on Computational Learning Theory,
1998
5
Assumptions of Co-Training Assumptions of the
Feature SetsCo-Training with a Random Split of a
Single Natural Feature Set
  • Conditional independence
  • knowing the values of F1 does not mean you can
    predict the values of F2
  • Redundant sufficiency
  • using only one of F1 or F2 separately still gives
    good classification accuracy

6
Experiment 1 (Web) Setup (1)Co-Training with a
Random Split of a Single Natural Feature Set
  • Domain web-browsing agent
  • 4 users
  • 4 topics
  • nuclear fusion
  • circulatory system
  • food pyramid
  • greenhouse effect ozone layer
  • 80 pages per topic

7
Experiment 1 Setup (2)Co-Training with a Random
Split of a Single Natural Feature Set
  • Each document is represented using bag-of-words
  • Information Gain used in feature selection
  • Top 100 words was selected
  • Reduction of about 98
  • Term frequency was used in the features vector
  • Four types of classifiers were tested
  • Decision Tree (DT)
  • Random Forest (RF)
  • Naïve Bayesian (NB) and
  • Support Vector Machine (SVM)
  • The classification performance using 10-fold
    cross validation

8
Feature SetsCo-Training with a Random Split of a
Single Natural Feature Set
  • Natural Feature Sets
  • Heading
  • Titles Headings Hyperlinks all words that
    appear in either titles, headings or hyperlinks
  • Body all words that appear in the Web page
    without counting occurrences in the titles,
    headings, or hyperlinks
  • Random Selection Feature Sets
  • Half1 a random selection of half of the feature
    set Body
  • Half2 the other half of the words not found in
    Half1
  • Fifth1 a random selection of a fifth of the
    feature set Body
  • Fifth2 a random selection of another fifth of
    the feature set Body

9
WebSL F-measures in Supervised Learning
Co-Training with a Random Split of a Single
Natural Feature Set
  • Simply supervised learning
  • No co-training
  • 10-fold cross validation
  • 90 of data as training set
  • Results were averaged over all users and topics
  • The numbers in the table were F-measure
    (macro-averaged)

10
WebCT The ComparisonCo-Training with a Random
Split of a Single Natural Feature Set
  • Co-training with Natural Features
  • words in main body of page
  • words in titles, headings, and hyperlinks
  • Co-training with a Random Split
  • using only the words in the body
  • Is co-training with a random split still
    beneficial?

11
WebCT F-measures in Co-TrainingCo-Training with
a Random Split of a Single Natural Feature Set
Table 1. Maximum increase in classification
performance of the combined classifier using
co-training
Initial size of labeled examples 8 instances
12
Experiment 2 (Spam) SetupCo-Training with a
Random Split of a Single Natural Feature Set
  • Domain spam detection
  • LingSpam emails sent to Linguist mailing list
  • emails 2893
  • legitimate emails 2412 (83.4)
  • spam 481 (16.6)

13
Experiment 2 Results (1)Co-Training with a
Random Split of a Single Natural Feature Set
  • SpamSL
  • Simply supervised learning
  • No co-training
  • 10-fold cross validation
  • 90 of data as training set (2595 instances in
    training)
  • Results were averaged over all users and topics
  • The numbers in the table were F-measure
    (macro-averaged)
  • SpamCT
  • Initial size of labeled examples 8 instances

14
SpamCT Results (2)Co-Training with a Random
Split of a Single Natural Feature Set
NB
Initial labelled spam 1 Initial labelled
non-spam 1 p 1 n 5
15
Why is the random split so good?Co-Training with
a Random Split of a Single Natural Feature Set
  • One of the natural feature sets is considerably
    weaker than the other
  • The classifier using Title/Subject is incorrectly
    labelling many instances in comparison with
    classifiers built using the random selection
    feature sets, hence transferring many incorrectly
    labelled instances into the labelled set.
  • As discovered in the Supervised Learning, it was
    found that using a random selection of half of
    the features from all the features results in
    classifiers that only perform slightly worse than
    a classifier using all the attributes available.
  • As a result, when performing co-training, both
    classifiers using their respective half of the
    features are able to improve the training set by
    labelling unlabelled instances with a
    sufficiently high classification performance.

16
ConclusionCo-Training with a Random Split of a
Single Natural Feature Set
  • In our experiments
  • Comparison between co-training with random
    splitting and using two natural feature sets
  • Co-training with a random split of a single
    natural feature set can be just as competitive as
    co-training with two natural feature sets
  • Reasons cited for the observed experimental
    results
  • Random splitting is favoured when the data
    contains many weak attributes and/or two natural
    feature sets significantly differ in strength
  • These conditions are very common (e.g. text
    categorization), which indicates that co-training
    with random splitting has great practical
    potential

17
Another Approach to Tackle Small Training Set
Artificial Examples
  • Domain Problem
  • Vast amount of information available
  • Use of Search Engines
  • Thousands of results
  • Often lots of irrelevant pages

18
Users Perspective
  • Ranking
  • Ranking optimised for whole web community, not
    for a users individual needs
  • Query formulation
  • Difficult to make ones intention explicit
  • NEC 50 one-word queries
  • ? Use of example pages to specify the users
    intention closer

19
Common approaches
  • Restrict to limited domain, construct
    domain-specific SEs
  • E.g. create ontologies/hierarchies1
  • query refinement (add keywords)2
  • Clustering instead of ranking3

1Dwi H. Widyantoro, John Yen A Fuzzy
Ontology-based AbstractSearch and Its User
Studies. Proc. of the 10th IEEE International
Conference on Fuzzy Systems, vol.2, pp.1291-4,
2001. 2Satoshi Oyama, Takashi Kokubo and Toru
Ishida Domain-Specific Web Search with Keyword
Spcies. IEEE Transactions on Knowledge
Engineering, 2003. 3Michael Chau, Daniel Zeng,
Hsinchun Chen Personalized Spiders for Web
Search and Analysis. Proc. of the 1st ACM/IEEE-CS
Joint Conference on Digital Libraries (JCSL01),
Roanoke, Virginia, June 24-28, 2001, pp.79-87..
20
Our approach
  • Use of general purpose search engine
  • No restriction to special domain
  • User provides
  • general query
  • example pages (implicit knowledge)
  • Re-ranking based on confidence of ML
  • Keep representation of results

21
System Design
22
Feature Selection
  • Feature Reduction Nouns
  • A threshold of number of features supplied
  • Use features from positive and unclassified
    documents
  • Use of modified tf-idf
  • tf additionally weighted with inverse document
    length

23
Evaluation Method
  • Precision and recall not feasible
  • No classification but ranking
  • User looks only at first few documents
  • ? cost-benefit analysis

24
Evaluation Method (cont.)
Assume we have a set of 15 documents consisting
of 5 positive and 10 negative documents
  • User As many good documents as soon as possible

25
Creating Artificial Examples
  • Main problems for system
  • Very small training set
  • No negative documents
  • 2 classes necessary for training
  • Normalization ? feature vectors ? 0,1d
  • Create negative docs where no positive expected

26
Creating Artificial Examples (cont.)
  • Artificial document Zero (AD0)
  • All features zero
  • Document completely off-topic
  • Further artificial documents
  • Value 0 for attributes that occur in pos. docs
  • Higher value for other attributes
  • Manabu Sassano (2003). Virtual Examples for Text
    Classification with Support Vector Machines,
    Proceedings of the 2003 Conference on Empirical
    Methods in Natural Language Processing.
  • Partha Niyogi, Federico Girosi, and Tomaso Poggio
    (1998). Incorporating prior information in
    machine learning by creating virtual examples. In
    Proceedings of IEEE, volume 86, pages 21962207.

27
Typical observations
  • Significant improvement for initial ranking
  • Especially for topmost and bottommost positions
    of ranking
  • Later enough real negative examples by feedback

28
Different Test Cases
  • Scenario 1
  • Unspecific query (large domain)
  • Positive documents from different subdomains
  • Very low percentage of positive results
  • Scenario 2
  • Very specific query (small domain)
  • Returned documents closely related
  • Search for very specific subdomain
  • Scenario 3
  • Example documents from related domain
  • Test for generalization capabilities

29
Evaluation Scenario 1
  • Tourist searching for information (Australia)
  • First 100 Google results
  • Example pages about accommodation, travel,
    tourists
  • Low percentage (4) of positive results,all from
    different subdomains

30
Scenario 1 (cont.)
  • Works with small amount of positive results
  • SVM outperforms others

31
Evaluation Scenario 2
  • How to play movie on TV connected to laptop
    (laptop tv video overlay)
  • First 100 Google results
  • Low percentage (7) of positive results
  • Very specialized, difficult for humans
  • Target docs mainly FAQs, Forum pages

32
Scenario 2 (cont.)
  • Performance well above average,even though
    closely related
  • ½ pos. docs on top of list using SVM

33
Evaluation Scenario 3
  • Search for special recipe (apple pie)
  • First 500 Google results
  • Some other recipes available (fish, bread, berry
    pie)
  • High percentage (40) of positive results
  • Negative documents e.g. movies, books,

34
Scenario 3 (cont.)
  • Benefit axis most important
  • Similar results with only fish example doc? good
    generalization capabilities
  • Leads to domain-specific SE about recipes

35
Conclusions
  • Similar observations for all scenarios
  • SVM outperforms all other MLs
  • Generation of artificial negative examples
    reasonable (more work to be done)
  • System able to cope with very small training set
Write a Comment
User Comments (0)
About PowerShow.com