Classification - PowerPoint PPT Presentation

1 / 12

About This Presentation

Title:

Classification

Description:

Number of Views:48

Avg rating:3.0/5.0

Slides: 13

Provided by: Vasile1

Category:

Tags: classification | clustering | supervised

Transcript and Presenter's Notes

Title: Classification

1
Classification

2
Classification in bioinformatics

Given samples of activated/non-activated gene
combinations (from microarrays), detect a
particular disease or subtype of disease (e.g.,
arthritis)
Given samples of interacting and non-interacting
proteins, predict if they interact from their
sequences
Given several types of protein folding, classify
new proteins into one of these types

3
Two kinds of descriptions

Syntactic or structural
Description is symbolic, with interconnected
components
Example First-order logic description of sets
Decision-theoretic or probabilistic
The description is a vector of numeric values,
corresponding to measurements of features
Classification attempts to select the class that
best fits the data

4
Supervised learning

Classification is supervised learning, because we
are given examples from each class along with
their descriptions and class labels
In unsupervised learning part of the description
is unavailable
In clustering, the classes themselves are learned
without labeled examples by detecting
regularities among the data

5
Stages of an operational classifier

Collect data
Extract features
Label some data with class information
may be available for a part of the data
may have to ask experts to label data
specifically for classification (expensive!)
Apply a classification algorithm
Measure performance

6
Features

Selection of features is very important because
all later operations depend on them
Features capture/abstract all our knowledge about
the data
Good to have many features because we may capture
additional information
Bad to have many features because of the danger
of overfitting any model
Best to have multiple uncorrelated features

7
Classification pitfalls

Selecting representative training and test sets
Discovering appropriate features
Overfitting the model
Training on the test data adapting the
classifier according to information not in the
training set, e.g., by changing the model

8
Measuring performance

Done on a test set separate from the training set
(the examples with known labels)
We need to know (but not make available to the
classifier) the class labels in the test set, in
order to evaluate the classifiers performance
Both sets must be representative of the problem
instances not always the case

9
Limited data

In any setting, we have a fixed amount of data
for training and testing
How much should be assigned to each?
The more data in the training set, the better the
trained classifier will be
The more data in the test set, the more accurate
our estimate of classifier performance on unseen
data will be

10
Cross-validation

Addresses the dilemma between training and test
data
Given a set of labeled samples D, separate it
into k (nearly) equal subsets
Perform k rounds of classifier building and
evaluation (k-fold cross-validation)
In each round, train on k-1 subsets and test on
the remaining one
Compose a classifier from the k constructed ones

11
Feature selection

Usually, we start with many potential features
and want to use a subset of those
The process of eliminating features can be
automated based on measurements of each features
contribution to the classifier
analysis of variance
information criteria (e.g., Akaike Information
Criterion based on likelihood ratios, or relative
entropy)
Usually an iterative process

12
Feature representation