Title: Machine Learning and its Applications in Bioinformatics
1Machine Learning and its Applications in
Bioinformatics
- Yen-Jen Oyang
- Dept. of Computer Science and Information
Engineering
2Observations and Challenges in the Information
Age
- A huge volume of information has been and is
being digitized and stored in the computer. - Due to the volume of digitized information,
effectively exploitation of information is beyond
the capability of human being without the aid of
intelligent computer software.
3An Example ofSupervised Machine Learning (Data
Classification)
- Given the data set shown on next slide, can we
figure out a set of rules that predict the
classes of objects?
4Data Set
5Distribution of the Data Set
6Rule Based on Observation
7Rule Generated by a Kernel Density Estimation
Based Algorithm
- Let and
- If then
predictionO. - Otherwise predictionX.
8(No Transcript)
9Identifying Boundary of Different Classes of
Objects
10Boundary Identified
11Problem Definition ofData Classification
- In a data classification problem, each object is
described by a set of attribute values and each
object belongs to one of the predefined classes. - The goal is to derive a set of rules that
predicts which class a new object should belong
to, based on a given set of training samples.
Data classification is also called supervised
learning.
12The Vector Space Model
- In the vector space model, each object is
described by a number of numerical
attributes/features. - For example, the outlook of a man is described by
his height, weight, and age. - It is typical that the objects are described by a
large number of attributes/features.
13Transformation of Categorical Attributes into
Numerical Attributes
- Represent the attribute values of the object in a
binary table form as exemplified in the following
14- Assign appropriate weight to each column.
- Treat the weighted vector of each row as the
feature vector of the corresponding object.
15Transformation of the Similarity/Dissimilarity
Matrix Model
- In this model, a matrix records the
similarity/dissimilarity scores between every
pair of objects.
16- We may select P2, P5, P6 as representatives and
use reciprocals of the similarity scores to these
representatives to describe an object. - For example, the feature vectors of P1 and P2 are
lt1/53, 1/35, 1/180gt and - lt0, 1/816, 1/606gt, respectively.
17Applications ofData Classificationin
Bioinformatics
- In microarray data analysis, data classification
is employed to predict the class of a new sample
based on the existing samples with known class. - Data classification has also been widely employed
in prediction of protein family, protein fold,
and protein secondary structure.
18- For example, in the Leukemia data set, there are
72 samples and 7129 genes. - 25 Acute Myeloid Leukemia(AML) samples.
- 38 B-cell Acute Lymphoblastic Leukemia samples.
- 9 T-cell Acute Lymphoblastic Leukemia samples.
19Model of Microarray Data Sets
20Alternative Data Classification Algorithms
- Decision tree (Q4.5 and Q5.0)
- Instance-based learning(KNN)
- Naïve Bayesian classifier
- RBF network
- Support vector machine(SVM)
- Kernel Density Estimation (KDE) based classifier.
21Instance-Based Learning
- In instance-based learning, we take k nearest
training samples of a new instance (v1, v2, ,
vm) and assign the new instance to the class that
has most instances in the k nearest training
samples. - Classifiers that adopt instance-based learning
are commonly called the KNN classifiers.
22Example of the KNN Classifiers
- If an 1NN classifier is employed, then the
prediction of ? X. - If an 3NN classifier is employed, then prediction
of ? O.
23Decision Function of the KNN Classifier
- Assume that there are two classes of samples,
positive and negative. - The decision function of a KNN classifier is
24Extension of the KNN Classifier
- We may extend the KNN classifier by weighting the
contribution of each neighbor with a term related
to its distance to the query vector
25A RBF Network Based Classifier withGaussian
Kernels
- It is typical that all are radial
basis functions of the same form. - With the popular Gaussian function, the decision
function is of the following form
26The Common Structure of the RBF Network Based
Data Classifier
v
27(No Transcript)
28Regularization of a RBF Network Based Classifier
- The conventional approaches proceed with either
employing a constant s for all kernel functions
or employing a heuristic mechanism to set si
individually, e.g. a multiple of the average
distance among samples, and attempt to minimize -
- where is a learning sample.
29- The term
- is included to avoid overfitting and ? is to be
set through cross validation.
30(No Transcript)
31(No Transcript)
32Decision Function of a SVM
- A prediction of the class of a new sample located
at v in the vector space is based on the
following rule
33The Kernel Density Estimation (KDE) Based
Classifier
- The KDE based learning algorithm constructs one
approximate probability density function for one
class of objects. - Classification of a new object is conducted based
on the likelihood function
34Identifying Boundary of Different Classes of
Objects
35Boundary Identified
36Problem Definition of Kernel Density Estimation
- Given a set of samples
- randomly taken from a probability distribution.
We want to find a set of symmetric kernel
functions and the corresponding
weights such that
37The Proposed KDE Based Classifier
- We determined to employ the Gaussian function and
set the width of each Gaussian function to a
multiple of the average distance among
neighboring samples
38- can be estimated as follow
-
-
-
39Accuracy of Different Classification Algorithms
40Comparison of Execution Time(in seconds)
41Parameter Setting through Cross Validation
- When carrying out data classification, we
normally need to set one or more parameters
associated with the data classification
algorithm. - For example, we need to set the value of k with
the KNN classifier. - The typical approach is to conduct cross
validation to find out the optimal value.
42- In the cross validation process, we set the
parameters of the classifier to a particular
combination of values that we are interested in
and then evaluate how good the combination is
based on alternative schemes. - With the leave-one-out cross validation scheme,
we attempt to predict the class of each sample
using the remaining samples as the training data
set.
43- With 10-fold cross validation, we evenly divide
the training data set into 10 subsets. Each
time, we test the prediction accuracy of one of
the 10 subsets using the other 9 subsets as the
training set.
44Overfitting
- Overfitting occurs when we construct a classifier
based on insufficient quantity of samples. - As a result, the classifier may works well for
the training dataset but fail to deliver an
acceptable accuracy in the real world.
45- For example, if we toss a fair coin two times,
there is a 50 chance that we will observe either
side up in both tosses. - Therefore, if we draw our conclusion on how fair
the coin is with just two tosses, we may end up
with overfitting the dataset. - Overfitting is a serious problem in analyzing
high-dimensional datasets, e.g. the microarray
datasets.
46Alternative Similarity Functions
- Let lt vr,1, vr,2 ,, vr,ngt and lt vt,1, vt,2 ,,
vt,n gt be the gene expression vectors, i.e. the
feature vectors, of samples Sr and St,
respectively. Then, the following alternative
similarity functions can be employed - Euclidean distance
47- Cosine
- Correlation coefficient--
48Importance of Feature Selection
- Inclusion of features that are not correlated to
the classification decision may make the problem
even more complicated. - For example, in the data set shown on the
following page, inclusion of the feature
corresponding to the Y-axis causes incorrect
prediction of the test instance marked by ?, if
a 3NN classifier is employed.
49y
x
x10
- It is apparent that os and x s are separated
by x10. If only the attribute corresponding to
the x-axis was selected, then the 3NN classifier
would predict the class of ? correctly.
50Linearly Separable and Non-Linearly Separable
- Some datasets are linearly separable.
- However, there are more datasets that are
non-linearly separable.
51An Example of Linearly Separable
y
x
52An Example of Non-Linearly Separable
53A Simplest Case ofLinearly Separable
y
x
x10
54Feature Selection Based on the Univariate Analysis
Class 1
Class 2
Class 3
55(No Transcript)
56An Example of Univariate Analysis
57Joint p.m.f. of X, Y, and C
58(No Transcript)
59Blind Spot of the Univariate Analysis
- The univariate analysis is not able to identify
crucial features in the following example
y
x
60The Demonstrative Data Set
61Joint p.m.f. of X, Y, and C
62 63- However, if we apply the following linear
transformation, then we will be able to identify
the significance of these two genes - 2?(GeneX)?(GeneY)
-
64 65- On the other hand, if we employ linear operator
(x2y), then we obtain
66- Accordingly, the issue now is that how we can
figure out the optimal linear operator of form
axßy for the projection. - In the 2-D case, given a set of samples
- (x1,y1), (x2,y2),, (xn,yn),
- then vi cos?xisin?yi
- is the value obtained by projecting (xi,yi) on
- sin?x-cos?y0
- or on the component along vector
- (cos?, sin?) as shown on the following page.
67(No Transcript)
68Feature Selection with Independent Components
Analysis (ICA)
- In recent years, ICA has emerged as a promising
approach for carrying out multivariate analysis.
69Basic Idea
- The ICA algorithm attempts to identify a plane so
that when we project the data set on the plane
the distribution is most non-gaussian.
70A Measure of Non-Gaussianity
- The kurtosis is commonly employed to measure the
non-Gaussianity of a data set. - The kurtosis of a dataset v1, v2 , , vn is
71- The expected value of the kurtosis of a set of
random samples taken from a standard normal
distribution is 0. - If the kurtosis of a set of random sample is
larger than 0, then the p.d.f. of the
distribution is sharper than the standard normal
distribution. - If the kurtosis of a set of random sample is
smaller than 0, then the p.d.f. of the
distribution is flatter than the standard normal
distribution.
72- Let kurt(?) denote the kurtosis of v1, v2 , ,
vn
73- The issue now is to find the value of ? that
minimizes kurt(?). - This is an optimization problem.
74The Optimization Problem
- The optimization problem is to find the global
maximum/minimum of a function. - There are several heuristic algorithms designed
for solving the optimization problem, e.g. - gradient descend
- genetic algorithms
- simulated annealing.
75The Gradient Descend Algorithm
- In the gradient descend algorithm, a number of
random samples are taken as the starting points. - Then, we compute the gradient at each point and
make a move in the direction of which the slope
is maximum. - This process is repeated a number of times until
the convergent criterion is met.
76An 1-D Example
d is a parameter that controls the stepsize
77- The gradient descend algorithm can be applied to
multidimensional functions. In such cases,
partial differentiation is involved. - If the gradient descend algorithm is to be
employed, then we must be able to compute the
gradient of the function at any point in the
vector space.
78Blind Spot of ICA
- However, ICA may fail in the following
non-linearly separable dataset
79Data Clustering
- Data clustering concerns how to group a set of
objects based on their similarity of attributes
and/or their proximity in the vector space.
80Model of Microarray Data Sets
Class 1
Class 2
Class 3
81Applications of Data Clustering in Microarray
Data Analysis
- Data clustering has been employed in microarray
data analysis for - identifying the genes with similar expressions
- identifying the subtypes of samples.
82- For cluster analysis of samples, we can employ
the feature selection mechanism developed for
classification of samples. - For cluster analysis of genes, each column of
gene expression data is regarded as the feature
vector of one gene.
83The Agglomerative Hierarchical Clustering
Algorithms
- The agglomerative hierarchical clustering
algorithms operate by maintaining a sorted list
of inter-cluster distances/similarity. - Initially, each data instance forms a cluster.
- The clustering algorithm repetitively merges the
two clusters with the minimum inter-cluster
distance or the maximum inter-cluster similarity.
84- Upon merging two clusters, the clustering
algorithm computes the distances between the
newly-formed cluster and the remaining clusters
and maintains the sorted list of inter-cluster
distances accordingly. - There are a number of ways to define the
inter-cluster distance - minimum distance/maximum similarity
(single-link) - maximum distance/minimum similarity
(complete-link) - average distance/average similarity
- mean distance (applicable only with the vector
space model).
85- Given the following similarity matrix, we can
apply the complete-link algorithm to obtain the
dendrogram shown on the next slide.
86- Assume that the complete-link algorithm is
employed. - If those similarity scores that are less than 0.3
are excluded, then we obtain 3 clusters P1, P4,
P2, P5, P6, P3.
0.018
0.137
0.494
0.862
0.816
P1
P4
P2
P5
P6
P3
87- If the single-link algorithm is employed, then we
obtain the following result.
88Example of the Chaining Effect
Single-link (10 clusters)
Complete-link (2 clusters)
89Effect of Bias towards Spherical Clusters
Single-link (2 clusters)
Complete-link (2 clusters)
90K-Means A Partitional Data Clustering Algorithm
- The k-means algorithm is probably the most
commonly used partitional clustering algorithm. - The k-means algorithm begins with selecting k
data instances as the means or centers of k
clusters.
91- The k-means algorithm then executes the following
loop iteratively until the convergence criterion
is met. - repeat
- assign every data instance to the closest cluster
based on the distance between the data instance
and the center of the cluster - compute the new centers of the k clusters
- until(the convergence criterion is met)
-
92- A commonly-used convergence criterion is
- If the difference between the values of two
consecutive iterations is smaller than a
threshold, then the algorithm terminates.
93Illustration of the K-Means Algorithm---(I)
94Illustration of the K-Means Algorithm---(II)
95Illustration of the K-Means Algorithm---(III)
96A Case in which the K-Means Algorithm Fails
- The K-means algorithm may converge to a local
optimal state as the following example
demonstrates
Initial Selection
97Conclusions
- Machine learning algorithms have been widely
exploited to tackle many important bioinformatics
problems.