CSI5388 Practical Recommendations - PowerPoint PPT Presentation

About This Presentation
Title:

CSI5388 Practical Recommendations

Description:

The only difference is that the domains in this class will be more closely ... if a test is asymptotically distribution free (i.e., if it becomes more and more ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 16
Provided by: COEMA4
Category:

less

Transcript and Presenter's Notes

Title: CSI5388 Practical Recommendations


1
CSI5388Practical Recommendations
2
Context for our Recommendations I
  • This discussion will take place in the context of
    the following three questions
  • I have created a new classifier for a specific
    problem. How does it compare to other existing
    classifiers on this particular problem?
  • I have designed a new classifier, how does it
    compare to existing classifiers on benchmark
    data?
  • How do various classifiers fare on benchmark data
    or on a single new problem?

3
Context for our Recommendations II
  • These three questions can be translated into the
    four different situations
  • Situation 1 Comparison of a new classifier to
    generic ones for a specific problem
  • Situation 2 Comparison of a new classifier to
    generic ones on generic problems
  • Situation 3 Comparison of generic classifiers on
    generic domains
  • Situation 4 Comparison of generic classifiers on
    a specific problem

4
Selecting learning algorithms I
  • The general strategy is to try to select
    classifiers that are more likely to succeed on
    the task at hand.
  • Situation 1 Select generic classifiers with a
    good chance of success at the particular task.
  • E.g., For high dimensionality problem use SVM as
    a generic classifier
  • E.g., For class imbalanced problem use SMOTE as
    a generic classifier, etc.
  • Situation 2 Different from Situation 1 in that
    not specific problem is targeted. So, choose
    generic classifiers that are generally accurate
    and stable across domains
  • E.g., Random Forests, SVMs, Bagging

5
Selecting learning algorithms II
  • Situation 3 Different from Situations 1 and 2.
    This time, we are interested in finding the
    strengths and weaknesses of various algorithms on
    different problems. So, select various well-known
    and well-used algorithms. Not necessarily the
    best algorithms overall.
  • E.g., Decision Trees, Neural Networks, Naïve
    Bayes, Nearest Neighbours, SVMs, etc.
  • Situation 4 reduces to Case 1 where what matters
    is the search for an optimal classifier or to
    Case 3, where the purpose is of a more general
    and scientific nature.

6
Selecting Data Sets I
  • The selection of data sets is different in the
    cases of Situations 1 and 4 and Situations 2 and
    3.
  • Situations 1 and 4 We distinguish between two
    cases
  • Case 1 There is just one data set of interest
    Just use this data set.
  • Case 2 We are considering a class of data sets
    (e.g., data sets for text categorization). In
    this case, we should look at Situations 2 and 3,
    since data sets in the same class can have
    different characteristics (e.g., noise, class
    imbalances, etc). The only difference is that the
    domains in this class will be more closely
    related than those in a wider study of the kind
    considered in Situations 2 and 3.

7
Selecting Data Sets II
  • Situations 2 and 3 The first thing that we need
    to do is determine what the exact purpose of the
    study is.
  • Case 1 To test a specific characteristic of a
    new algorithm or of various algorithms (e.g.,
    their resilience to noise) Select domains
    presenting the same characteristics
  • Case 2 To test the general performance of a new
    algorithm or of various algorithms on a variety
    of domains with different characteristics
    Select varied domains, but watch the way in which
    you report the results. There may be a lot of
    variance, from classifier to classifier and type
    of domain to type of domain. It will be best to
    cluster the kinds of domains on which
    classifiers excel or do poorly and report the
    results on a cluster by cluster basis.

8
Selecting Data Sets III
  • Situations 2 and 3 (Contd) Three questions
    remain
  • Question 1 How many data sets are necessary /
    desirable?
  • Question 2 Where can we get these data sets?
  • Question 3 How do we select data sets from those
    available?

9
Selecting Data Sets IV
  • Situations 2 3 How many data sets?
  • The number of domains necessary depends on the
    variance in the performance of the classifiers.
    As a rule of thumb, 3 to 5 domains within the
    same category of domains are desirable to begin
    with. Note As domains get added, the question
    raised by Salzberg, 1997 and Jensen, 2001
    regarding the multiplicity effect should be
    considered.
  • Situations 2 3 Where can we get these data
    sets?
  • UCI Repository for machine learning or other
    repositories (but the collections may not be
    representative of reality).
  • Directly from the Web (but gathering a cleaning a
    data collection is extremely time consuming)
  • Artificial data sets (easy to build, unlimited in
    size, but too far removed from reality)
  • Real-world inspired artificial data (real-world
    data sets artificially augmented. Easy to build,
    closer to reality)

10
Selecting Data Sets V
  • Situations 2 3 How do we select data sets from
    those available?
  • Select all those that are available and meet the
    constraints of the algorithms that are under
    study. For example, the UCI repository contain
    many data sets, but only a subset of these are
    multi-class, only a subset has nominal attributes
    only, only a subset has no missing attributes,
    and so on.
  • In order to increase the number of domains
    available for use by researchers or practitioners
    of Data Mining, some amendments to the data sets
    can be made to make as many data sets as possible
    conform to the requirements of the classifiers.

11
Selecting performance measures
  • Cases 2 and 3 Caruana and Niculescu-Mizil, 2004
    suggest that the Root mean Squared error is the
    best general-purpose method since it is the one
    that is best correlated with the other eight
    measures that they use. Researchers are, however,
    encouraged to use a variety of different metrics
    in order to discover the various strengths and
    shortcomings of each classifier and each domain
    more specifically.
  • Cases 1 and 4 We distinguish between the
    following cases
  • Balanced versus imbalanced domains ROC
  • Certainty of the decision matters B K
  • All the classes matter RMSE
  • The problem is binary but one class matters more
    than the other Precision, Recall, F-measure,
    Sensitivity, Specificity, Likelihood Ratios.

12
Selecting an error estimation method and
statistical test I
  • If the size of the data set is large enough (the
    size of all testing sets is, at least, 30) and if
    the statistics of interest to the user is
    parameterizable cross-validation can be tried
    (but see the next slide).
  • If the data set is particularly small, i.e., if
    some of the testing sets contain fewer than 30
    examples say, if it contains fewer than 30, or
    so samples Bootstrapping or Randomization.
  • If the statistics of interest does not have
    statistical tests associated with it
    Bootstrapping or Randomization.

13
Selecting an error estimation method and
statistical test II
  • Question How can one see whether
    cross-validation is appropriate for his/her
    purposes?
  • 2 ways
  • Visual plot the distribution and check its shape
    visually
  • Apply a Hypothesis Test designed to see if the
    distribution is normal or not. (e.g. Chi squared
    goodness of fit, Kolmogorov-Smirnov goodness of
    fit, etc.)
  • Since no practical distribution will be exactly
    Normal, we must also look into the robustness of
    the various statistical method considered. The
    t-test is quite robust against the normality
    assumption.
  • If the distribution is far from normal
    non-parametric tests must be used.

14
Selecting an error estimation method and
statistical test III
  • The robustness of a procedure is important since
    that will ensure that the reported significance
    level is close to the true one.
  • However, Robustness does not answer the question
    of whether efficient use is made of the data so
    that a false null hypothesis can be rejected.
  • Power should be considered
  • The power of a test depends on some intrinsic
    nature of that test, but also on the shape and
    size of the population to which it is applied.
  • Example Parametric tests based on the normality
    assumption are generally as powerful or more
    powerful than non-parametric tests based on ranks
    in the case of distribution functions with
    lighter tails than the normal distribution.

15
Selecting an error estimation method and
statistical test IV
  • But Parametric tests based on the normality
    assumption are less powerful than non-parametric
    ones in the case where the tails of the
    distribution are heavier than those of the normal
    distribution (An important kind of data
    presenting such distributions are data containing
    outliers).
  • Note that the relative power of parametric and
    non-parametric tests does not change as a
    function of sample size, even if a test is
    asymptotically distribution free (i.e., if it
    becomes more and more robust as the sample
    size increases).
Write a Comment
User Comments (0)
About PowerShow.com