Machine Learning and its Applications in Bioinformatics presentation

About This Presentation

Transcript and Presenter's Notes

Title: Machine Learning and its Applications in Bioinformatics

1
Machine Learning and its Applications in
Bioinformatics

Yen-Jen Oyang
Dept. of Computer Science and Information
Engineering

2
Observations and Challenges in the Information
Age

A huge volume of information has been and is
being digitized and stored in the computer.
Due to the volume of digitized information,
effectively exploitation of information is beyond
the capability of human being without the aid of
intelligent computer software.

3
An Example ofSupervised Machine Learning (Data
Classification)

Given the data set shown on next slide, can we
figure out a set of rules that predict the
classes of objects?

4
Data Set
5
Distribution of the Data Set
6
Rule Based on Observation
7
Rule Generated by a Kernel Density Estimation
Based Algorithm

Let and
If then
predictionO.
Otherwise predictionX.

8
(No Transcript)
9
Identifying Boundary of Different Classes of
Objects
10
Boundary Identified
11
Problem Definition ofData Classification

In a data classification problem, each object is
described by a set of attribute values and each
object belongs to one of the predefined classes.
The goal is to derive a set of rules that
predicts which class a new object should belong
to, based on a given set of training samples.
Data classification is also called supervised
learning.

12
The Vector Space Model

In the vector space model, each object is
described by a number of numerical
attributes/features.
For example, the outlook of a man is described by
his height, weight, and age.
It is typical that the objects are described by a
large number of attributes/features.

13
Transformation of Categorical Attributes into
Numerical Attributes

Represent the attribute values of the object in a
binary table form as exemplified in the following

Assign appropriate weight to each column.
Treat the weighted vector of each row as the
feature vector of the corresponding object.

15
Transformation of the Similarity/Dissimilarity
Matrix Model

In this model, a matrix records the
similarity/dissimilarity scores between every
pair of objects.

We may select P2, P5, P6 as representatives and
use reciprocals of the similarity scores to these
representatives to describe an object.
For example, the feature vectors of P1 and P2 are
lt1/53, 1/35, 1/180gt and
lt0, 1/816, 1/606gt, respectively.

17
Applications ofData Classificationin
Bioinformatics

In microarray data analysis, data classification
is employed to predict the class of a new sample
based on the existing samples with known class.
Data classification has also been widely employed
in prediction of protein family, protein fold,
and protein secondary structure.

For example, in the Leukemia data set, there are
72 samples and 7129 genes.
25 Acute Myeloid Leukemia(AML) samples.
38 B-cell Acute Lymphoblastic Leukemia samples.
9 T-cell Acute Lymphoblastic Leukemia samples.

19
Model of Microarray Data Sets
20
Alternative Data Classification Algorithms

Decision tree (Q4.5 and Q5.0)
Instance-based learning(KNN)
Naïve Bayesian classifier
RBF network
Support vector machine(SVM)
Kernel Density Estimation (KDE) based classifier.

21
Instance-Based Learning

In instance-based learning, we take k nearest
training samples of a new instance (v1, v2, ,
vm) and assign the new instance to the class that
has most instances in the k nearest training
samples.
Classifiers that adopt instance-based learning
are commonly called the KNN classifiers.

22
Example of the KNN Classifiers

If an 1NN classifier is employed, then the
prediction of ? X.
If an 3NN classifier is employed, then prediction
of ? O.

23
Decision Function of the KNN Classifier

Assume that there are two classes of samples,
positive and negative.
The decision function of a KNN classifier is

24
Extension of the KNN Classifier

We may extend the KNN classifier by weighting the
contribution of each neighbor with a term related
to its distance to the query vector

25
A RBF Network Based Classifier withGaussian
Kernels

It is typical that all are radial
basis functions of the same form.
With the popular Gaussian function, the decision
function is of the following form

26
The Common Structure of the RBF Network Based
Data Classifier
v
27
(No Transcript)
28
Regularization of a RBF Network Based Classifier

The conventional approaches proceed with either
employing a constant s for all kernel functions
or employing a heuristic mechanism to set si
individually, e.g. a multiple of the average
distance among samples, and attempt to minimize
where is a learning sample.

The term
is included to avoid overfitting and ? is to be
set through cross validation.

30
(No Transcript)
31
(No Transcript)
32
Decision Function of a SVM

A prediction of the class of a new sample located
at v in the vector space is based on the
following rule

33
The Kernel Density Estimation (KDE) Based
Classifier

The KDE based learning algorithm constructs one
approximate probability density function for one
class of objects.
Classification of a new object is conducted based
on the likelihood function

34
Identifying Boundary of Different Classes of
Objects
35
Boundary Identified
36
Problem Definition of Kernel Density Estimation

Given a set of samples
randomly taken from a probability distribution.
We want to find a set of symmetric kernel
functions and the corresponding
weights such that

37
The Proposed KDE Based Classifier

We determined to employ the Gaussian function and
set the width of each Gaussian function to a
multiple of the average distance among
neighboring samples

can be estimated as follow

39
Accuracy of Different Classification Algorithms
40
Comparison of Execution Time(in seconds)
41
Parameter Setting through Cross Validation

When carrying out data classification, we
normally need to set one or more parameters
associated with the data classification
algorithm.
For example, we need to set the value of k with
the KNN classifier.
The typical approach is to conduct cross
validation to find out the optimal value.

In the cross validation process, we set the
parameters of the classifier to a particular
combination of values that we are interested in
and then evaluate how good the combination is
based on alternative schemes.
With the leave-one-out cross validation scheme,
we attempt to predict the class of each sample
using the remaining samples as the training data
set.

With 10-fold cross validation, we evenly divide
the training data set into 10 subsets. Each
time, we test the prediction accuracy of one of
the 10 subsets using the other 9 subsets as the
training set.

44
Overfitting

Overfitting occurs when we construct a classifier
based on insufficient quantity of samples.
As a result, the classifier may works well for
the training dataset but fail to deliver an
acceptable accuracy in the real world.

For example, if we toss a fair coin two times,
there is a 50 chance that we will observe either
side up in both tosses.
Therefore, if we draw our conclusion on how fair
the coin is with just two tosses, we may end up
with overfitting the dataset.
Overfitting is a serious problem in analyzing
high-dimensional datasets, e.g. the microarray
datasets.

46
Alternative Similarity Functions

Let lt vr,1, vr,2 ,, vr,ngt and lt vt,1, vt,2 ,,
vt,n gt be the gene expression vectors, i.e. the
feature vectors, of samples Sr and St,
respectively. Then, the following alternative
similarity functions can be employed
Euclidean distance

Cosine
Correlation coefficient--

48
Importance of Feature Selection

Inclusion of features that are not correlated to
the classification decision may make the problem
even more complicated.
For example, in the data set shown on the
following page, inclusion of the feature
corresponding to the Y-axis causes incorrect
prediction of the test instance marked by ?, if
a 3NN classifier is employed.

49
y
x
x10

It is apparent that os and x s are separated
by x10. If only the attribute corresponding to
the x-axis was selected, then the 3NN classifier
would predict the class of ? correctly.

50
Linearly Separable and Non-Linearly Separable

Some datasets are linearly separable.
However, there are more datasets that are
non-linearly separable.

51
An Example of Linearly Separable
y
x
52
An Example of Non-Linearly Separable
53
A Simplest Case ofLinearly Separable
y
x
x10
54
Feature Selection Based on the Univariate Analysis
Class 1
Class 2
Class 3
55
(No Transcript)
56
An Example of Univariate Analysis
57
Joint p.m.f. of X, Y, and C
58
(No Transcript)
59
Blind Spot of the Univariate Analysis

The univariate analysis is not able to identify
crucial features in the following example

y
x
60
The Demonstrative Data Set
61
Joint p.m.f. of X, Y, and C
62

For Gene X,
For Gene Y,

However, if we apply the following linear
transformation, then we will be able to identify
the significance of these two genes
2?(GeneX)?(GeneY)

For 2 Gene X Gene Y,

On the other hand, if we employ linear operator
(x2y), then we obtain

Accordingly, the issue now is that how we can
figure out the optimal linear operator of form
axßy for the projection.
In the 2-D case, given a set of samples
(x1,y1), (x2,y2),, (xn,yn),
then vi cos?xisin?yi
is the value obtained by projecting (xi,yi) on
sin?x-cos?y0
or on the component along vector
(cos?, sin?) as shown on the following page.

67
(No Transcript)
68
Feature Selection with Independent Components
Analysis (ICA)

In recent years, ICA has emerged as a promising
approach for carrying out multivariate analysis.

69
Basic Idea

The ICA algorithm attempts to identify a plane so
that when we project the data set on the plane
the distribution is most non-gaussian.

70
A Measure of Non-Gaussianity

The kurtosis is commonly employed to measure the
non-Gaussianity of a data set.
The kurtosis of a dataset v1, v2 , , vn is

The expected value of the kurtosis of a set of
random samples taken from a standard normal
distribution is 0.
If the kurtosis of a set of random sample is
larger than 0, then the p.d.f. of the
distribution is sharper than the standard normal
distribution.
If the kurtosis of a set of random sample is
smaller than 0, then the p.d.f. of the
distribution is flatter than the standard normal
distribution.

Let kurt(?) denote the kurtosis of v1, v2 , ,
vn

The issue now is to find the value of ? that
minimizes kurt(?).
This is an optimization problem.

74
The Optimization Problem

The optimization problem is to find the global
maximum/minimum of a function.
There are several heuristic algorithms designed
for solving the optimization problem, e.g.
gradient descend
genetic algorithms
simulated annealing.

75
The Gradient Descend Algorithm

In the gradient descend algorithm, a number of
random samples are taken as the starting points.
Then, we compute the gradient at each point and
make a move in the direction of which the slope
is maximum.
This process is repeated a number of times until
the convergent criterion is met.

76
An 1-D Example
d is a parameter that controls the stepsize
77

The gradient descend algorithm can be applied to
multidimensional functions. In such cases,
partial differentiation is involved.
If the gradient descend algorithm is to be
employed, then we must be able to compute the
gradient of the function at any point in the
vector space.

78
Blind Spot of ICA

However, ICA may fail in the following
non-linearly separable dataset

79
Data Clustering

Data clustering concerns how to group a set of
objects based on their similarity of attributes
and/or their proximity in the vector space.

80
Model of Microarray Data Sets
Class 1
Class 2
Class 3
81
Applications of Data Clustering in Microarray
Data Analysis

Data clustering has been employed in microarray
data analysis for
identifying the genes with similar expressions
identifying the subtypes of samples.

For cluster analysis of samples, we can employ
the feature selection mechanism developed for
classification of samples.
For cluster analysis of genes, each column of
gene expression data is regarded as the feature
vector of one gene.

83
The Agglomerative Hierarchical Clustering
Algorithms

The agglomerative hierarchical clustering
algorithms operate by maintaining a sorted list
of inter-cluster distances/similarity.
Initially, each data instance forms a cluster.
The clustering algorithm repetitively merges the
two clusters with the minimum inter-cluster
distance or the maximum inter-cluster similarity.

Upon merging two clusters, the clustering
algorithm computes the distances between the
newly-formed cluster and the remaining clusters
and maintains the sorted list of inter-cluster
distances accordingly.
There are a number of ways to define the
inter-cluster distance
minimum distance/maximum similarity
(single-link)
maximum distance/minimum similarity
(complete-link)
average distance/average similarity
mean distance (applicable only with the vector
space model).

Given the following similarity matrix, we can
apply the complete-link algorithm to obtain the
dendrogram shown on the next slide.

Assume that the complete-link algorithm is
employed.
If those similarity scores that are less than 0.3
are excluded, then we obtain 3 clusters P1, P4,
P2, P5, P6, P3.

0.018
0.137
0.494
0.862
0.816
P1
P4
P2
P5
P6
P3
87

If the single-link algorithm is employed, then we
obtain the following result.

88
Example of the Chaining Effect
Single-link (10 clusters)
Complete-link (2 clusters)
89
Effect of Bias towards Spherical Clusters
Single-link (2 clusters)
Complete-link (2 clusters)
90
K-Means A Partitional Data Clustering Algorithm

The k-means algorithm is probably the most
commonly used partitional clustering algorithm.
The k-means algorithm begins with selecting k
data instances as the means or centers of k
clusters.

The k-means algorithm then executes the following
loop iteratively until the convergence criterion
is met.
repeat
assign every data instance to the closest cluster
based on the distance between the data instance
and the center of the cluster
compute the new centers of the k clusters
until(the convergence criterion is met)

A commonly-used convergence criterion is
If the difference between the values of two
consecutive iterations is smaller than a
threshold, then the algorithm terminates.

93
Illustration of the K-Means Algorithm---(I)
94
Illustration of the K-Means Algorithm---(II)
95
Illustration of the K-Means Algorithm---(III)
96
A Case in which the K-Means Algorithm Fails

The K-means algorithm may converge to a local
optimal state as the following example
demonstrates

Initial Selection
97
Conclusions

Machine learning algorithms have been widely
exploited to tackle many important bioinformatics
problems.

Write a Comment

User Comments (0)

About PowerShow.com

Machine Learning and its Applications in Bioinformatics PowerPoint PPT Presentation