Classification - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Classification

Description:

Classification Vasileios Hatzivassiloglou University of Texas at Dallas – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 13
Provided by: Vasile1
Category:

less

Transcript and Presenter's Notes

Title: Classification


1
Classification
  • Vasileios Hatzivassiloglou
  • University of Texas at Dallas

2
Classification in bioinformatics
  • Given samples of activated/non-activated gene
    combinations (from microarrays), detect a
    particular disease or subtype of disease (e.g.,
    arthritis)
  • Given samples of interacting and non-interacting
    proteins, predict if they interact from their
    sequences
  • Given several types of protein folding, classify
    new proteins into one of these types

3
Two kinds of descriptions
  • Syntactic or structural
  • Description is symbolic, with interconnected
    components
  • Example First-order logic description of sets
  • Decision-theoretic or probabilistic
  • The description is a vector of numeric values,
    corresponding to measurements of features
  • Classification attempts to select the class that
    best fits the data

4
Supervised learning
  • Classification is supervised learning, because we
    are given examples from each class along with
    their descriptions and class labels
  • In unsupervised learning part of the description
    is unavailable
  • In clustering, the classes themselves are learned
    without labeled examples by detecting
    regularities among the data

5
Stages of an operational classifier
  • Collect data
  • Extract features
  • Label some data with class information
  • may be available for a part of the data
  • may have to ask experts to label data
    specifically for classification (expensive!)
  • Apply a classification algorithm
  • Measure performance

6
Features
  • Selection of features is very important because
    all later operations depend on them
  • Features capture/abstract all our knowledge about
    the data
  • Good to have many features because we may capture
    additional information
  • Bad to have many features because of the danger
    of overfitting any model
  • Best to have multiple uncorrelated features

7
Classification pitfalls
  • Selecting representative training and test sets
  • Discovering appropriate features
  • Overfitting the model
  • Training on the test data adapting the
    classifier according to information not in the
    training set, e.g., by changing the model

8
Measuring performance
  • Done on a test set separate from the training set
    (the examples with known labels)
  • We need to know (but not make available to the
    classifier) the class labels in the test set, in
    order to evaluate the classifiers performance
  • Both sets must be representative of the problem
    instances not always the case

9
Limited data
  • In any setting, we have a fixed amount of data
    for training and testing
  • How much should be assigned to each?
  • The more data in the training set, the better the
    trained classifier will be
  • The more data in the test set, the more accurate
    our estimate of classifier performance on unseen
    data will be

10
Cross-validation
  • Addresses the dilemma between training and test
    data
  • Given a set of labeled samples D, separate it
    into k (nearly) equal subsets
  • Perform k rounds of classifier building and
    evaluation (k-fold cross-validation)
  • In each round, train on k-1 subsets and test on
    the remaining one
  • Compose a classifier from the k constructed ones

11
Feature selection
  • Usually, we start with many potential features
    and want to use a subset of those
  • The process of eliminating features can be
    automated based on measurements of each features
    contribution to the classifier
  • analysis of variance
  • information criteria (e.g., Akaike Information
    Criterion based on likelihood ratios, or relative
    entropy)
  • Usually an iterative process

12
Feature representation
  • We view each sample (training or test) as a point
    in n-dimensional space
  • Each feature provides one of these dimensions
  • Features can be discrete or continuous
  • Samples can be viewed as points or equivalently
    n-dimensional vectors
Write a Comment
User Comments (0)
About PowerShow.com