Machine Learning and its Applications in Bioinformatics - PowerPoint PPT Presentation

1 / 97
About This Presentation
Title:

Machine Learning and its Applications in Bioinformatics

Description:

A huge volume of information has been and is being digitized and stored in ... The Demonstrative Data Set. Joint p.m.f. of X, Y, and C. For Gene X, For Gene Y, ... – PowerPoint PPT presentation

Number of Views:613
Avg rating:3.0/5.0
Slides: 98
Provided by: Sys892
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning and its Applications in Bioinformatics


1
Machine Learning and its Applications in
Bioinformatics
  • Yen-Jen Oyang
  • Dept. of Computer Science and Information
    Engineering

2
Observations and Challenges in the Information
Age
  • A huge volume of information has been and is
    being digitized and stored in the computer.
  • Due to the volume of digitized information,
    effectively exploitation of information is beyond
    the capability of human being without the aid of
    intelligent computer software.

3
An Example ofSupervised Machine Learning (Data
Classification)
  • Given the data set shown on next slide, can we
    figure out a set of rules that predict the
    classes of objects?

4
Data Set
5
Distribution of the Data Set
6
Rule Based on Observation
7
Rule Generated by a Kernel Density Estimation
Based Algorithm
  • Let and
  • If then
    predictionO.
  • Otherwise predictionX.

8
(No Transcript)
9
Identifying Boundary of Different Classes of
Objects
10
Boundary Identified
11
Problem Definition ofData Classification
  • In a data classification problem, each object is
    described by a set of attribute values and each
    object belongs to one of the predefined classes.
  • The goal is to derive a set of rules that
    predicts which class a new object should belong
    to, based on a given set of training samples.
    Data classification is also called supervised
    learning.

12
The Vector Space Model
  • In the vector space model, each object is
    described by a number of numerical
    attributes/features.
  • For example, the outlook of a man is described by
    his height, weight, and age.
  • It is typical that the objects are described by a
    large number of attributes/features.

13
Transformation of Categorical Attributes into
Numerical Attributes
  • Represent the attribute values of the object in a
    binary table form as exemplified in the following

14
  • Assign appropriate weight to each column.
  • Treat the weighted vector of each row as the
    feature vector of the corresponding object.

15
Transformation of the Similarity/Dissimilarity
Matrix Model
  • In this model, a matrix records the
    similarity/dissimilarity scores between every
    pair of objects.

16
  • We may select P2, P5, P6 as representatives and
    use reciprocals of the similarity scores to these
    representatives to describe an object.
  • For example, the feature vectors of P1 and P2 are
    lt1/53, 1/35, 1/180gt and
  • lt0, 1/816, 1/606gt, respectively.

17
Applications ofData Classificationin
Bioinformatics
  • In microarray data analysis, data classification
    is employed to predict the class of a new sample
    based on the existing samples with known class.
  • Data classification has also been widely employed
    in prediction of protein family, protein fold,
    and protein secondary structure.

18
  • For example, in the Leukemia data set, there are
    72 samples and 7129 genes.
  • 25 Acute Myeloid Leukemia(AML) samples.
  • 38 B-cell Acute Lymphoblastic Leukemia samples.
  • 9 T-cell Acute Lymphoblastic Leukemia samples.

19
Model of Microarray Data Sets
20
Alternative Data Classification Algorithms
  • Decision tree (Q4.5 and Q5.0)
  • Instance-based learning(KNN)
  • Naïve Bayesian classifier
  • RBF network
  • Support vector machine(SVM)
  • Kernel Density Estimation (KDE) based classifier.

21
Instance-Based Learning
  • In instance-based learning, we take k nearest
    training samples of a new instance (v1, v2, ,
    vm) and assign the new instance to the class that
    has most instances in the k nearest training
    samples.
  • Classifiers that adopt instance-based learning
    are commonly called the KNN classifiers.

22
Example of the KNN Classifiers
  • If an 1NN classifier is employed, then the
    prediction of ? X.
  • If an 3NN classifier is employed, then prediction
    of ? O.

23
Decision Function of the KNN Classifier
  • Assume that there are two classes of samples,
    positive and negative.
  • The decision function of a KNN classifier is

24
Extension of the KNN Classifier
  • We may extend the KNN classifier by weighting the
    contribution of each neighbor with a term related
    to its distance to the query vector

25
A RBF Network Based Classifier withGaussian
Kernels
  • It is typical that all are radial
    basis functions of the same form.
  • With the popular Gaussian function, the decision
    function is of the following form

26
The Common Structure of the RBF Network Based
Data Classifier
v
27
(No Transcript)
28
Regularization of a RBF Network Based Classifier
  • The conventional approaches proceed with either
    employing a constant s for all kernel functions
    or employing a heuristic mechanism to set si
    individually, e.g. a multiple of the average
    distance among samples, and attempt to minimize
  • where is a learning sample.

29
  • The term
  • is included to avoid overfitting and ? is to be
    set through cross validation.

30
(No Transcript)
31
(No Transcript)
32
Decision Function of a SVM
  • A prediction of the class of a new sample located
    at v in the vector space is based on the
    following rule

33
The Kernel Density Estimation (KDE) Based
Classifier
  • The KDE based learning algorithm constructs one
    approximate probability density function for one
    class of objects.
  • Classification of a new object is conducted based
    on the likelihood function

34
Identifying Boundary of Different Classes of
Objects
35
Boundary Identified
36
Problem Definition of Kernel Density Estimation
  • Given a set of samples
  • randomly taken from a probability distribution.
    We want to find a set of symmetric kernel
    functions and the corresponding
    weights such that

37
The Proposed KDE Based Classifier
  • We determined to employ the Gaussian function and
    set the width of each Gaussian function to a
    multiple of the average distance among
    neighboring samples

38
  • can be estimated as follow

39
Accuracy of Different Classification Algorithms
40
Comparison of Execution Time(in seconds)
41
Parameter Setting through Cross Validation
  • When carrying out data classification, we
    normally need to set one or more parameters
    associated with the data classification
    algorithm.
  • For example, we need to set the value of k with
    the KNN classifier.
  • The typical approach is to conduct cross
    validation to find out the optimal value.

42
  • In the cross validation process, we set the
    parameters of the classifier to a particular
    combination of values that we are interested in
    and then evaluate how good the combination is
    based on alternative schemes.
  • With the leave-one-out cross validation scheme,
    we attempt to predict the class of each sample
    using the remaining samples as the training data
    set.

43
  • With 10-fold cross validation, we evenly divide
    the training data set into 10 subsets. Each
    time, we test the prediction accuracy of one of
    the 10 subsets using the other 9 subsets as the
    training set.

44
Overfitting
  • Overfitting occurs when we construct a classifier
    based on insufficient quantity of samples.
  • As a result, the classifier may works well for
    the training dataset but fail to deliver an
    acceptable accuracy in the real world.

45
  • For example, if we toss a fair coin two times,
    there is a 50 chance that we will observe either
    side up in both tosses.
  • Therefore, if we draw our conclusion on how fair
    the coin is with just two tosses, we may end up
    with overfitting the dataset.
  • Overfitting is a serious problem in analyzing
    high-dimensional datasets, e.g. the microarray
    datasets.

46
Alternative Similarity Functions
  • Let lt vr,1, vr,2 ,, vr,ngt and lt vt,1, vt,2 ,,
    vt,n gt be the gene expression vectors, i.e. the
    feature vectors, of samples Sr and St,
    respectively. Then, the following alternative
    similarity functions can be employed
  • Euclidean distance

47
  • Cosine
  • Correlation coefficient--

48
Importance of Feature Selection
  • Inclusion of features that are not correlated to
    the classification decision may make the problem
    even more complicated.
  • For example, in the data set shown on the
    following page, inclusion of the feature
    corresponding to the Y-axis causes incorrect
    prediction of the test instance marked by ?, if
    a 3NN classifier is employed.

49
y
x
x10
  • It is apparent that os and x s are separated
    by x10. If only the attribute corresponding to
    the x-axis was selected, then the 3NN classifier
    would predict the class of ? correctly.

50
Linearly Separable and Non-Linearly Separable
  • Some datasets are linearly separable.
  • However, there are more datasets that are
    non-linearly separable.

51
An Example of Linearly Separable
y
x
52
An Example of Non-Linearly Separable
53
A Simplest Case ofLinearly Separable
y
x
x10
54
Feature Selection Based on the Univariate Analysis
Class 1
Class 2
Class 3
55
(No Transcript)
56
An Example of Univariate Analysis
57
Joint p.m.f. of X, Y, and C
58
(No Transcript)
59
Blind Spot of the Univariate Analysis
  • The univariate analysis is not able to identify
    crucial features in the following example

y
x
60
The Demonstrative Data Set
61
Joint p.m.f. of X, Y, and C
62
  • For Gene X,
  • For Gene Y,

63
  • However, if we apply the following linear
    transformation, then we will be able to identify
    the significance of these two genes
  • 2?(GeneX)?(GeneY)

64
  • For 2 Gene X Gene Y,

65
  • On the other hand, if we employ linear operator
    (x2y), then we obtain

66
  • Accordingly, the issue now is that how we can
    figure out the optimal linear operator of form
    axßy for the projection.
  • In the 2-D case, given a set of samples
  • (x1,y1), (x2,y2),, (xn,yn),
  • then vi cos?xisin?yi
  • is the value obtained by projecting (xi,yi) on
  • sin?x-cos?y0
  • or on the component along vector
  • (cos?, sin?) as shown on the following page.

67
(No Transcript)
68
Feature Selection with Independent Components
Analysis (ICA)
  • In recent years, ICA has emerged as a promising
    approach for carrying out multivariate analysis.

69
Basic Idea
  • The ICA algorithm attempts to identify a plane so
    that when we project the data set on the plane
    the distribution is most non-gaussian.

70
A Measure of Non-Gaussianity
  • The kurtosis is commonly employed to measure the
    non-Gaussianity of a data set.
  • The kurtosis of a dataset v1, v2 , , vn is

71
  • The expected value of the kurtosis of a set of
    random samples taken from a standard normal
    distribution is 0.
  • If the kurtosis of a set of random sample is
    larger than 0, then the p.d.f. of the
    distribution is sharper than the standard normal
    distribution.
  • If the kurtosis of a set of random sample is
    smaller than 0, then the p.d.f. of the
    distribution is flatter than the standard normal
    distribution.

72
  • Let kurt(?) denote the kurtosis of v1, v2 , ,
    vn

73
  • The issue now is to find the value of ? that
    minimizes kurt(?).
  • This is an optimization problem.

74
The Optimization Problem
  • The optimization problem is to find the global
    maximum/minimum of a function.
  • There are several heuristic algorithms designed
    for solving the optimization problem, e.g.
  • gradient descend
  • genetic algorithms
  • simulated annealing.

75
The Gradient Descend Algorithm
  • In the gradient descend algorithm, a number of
    random samples are taken as the starting points.
  • Then, we compute the gradient at each point and
    make a move in the direction of which the slope
    is maximum.
  • This process is repeated a number of times until
    the convergent criterion is met.

76
An 1-D Example
d is a parameter that controls the stepsize
77
  • The gradient descend algorithm can be applied to
    multidimensional functions. In such cases,
    partial differentiation is involved.
  • If the gradient descend algorithm is to be
    employed, then we must be able to compute the
    gradient of the function at any point in the
    vector space.

78
Blind Spot of ICA
  • However, ICA may fail in the following
    non-linearly separable dataset

79
Data Clustering
  • Data clustering concerns how to group a set of
    objects based on their similarity of attributes
    and/or their proximity in the vector space.

80
Model of Microarray Data Sets
Class 1
Class 2
Class 3
81
Applications of Data Clustering in Microarray
Data Analysis
  • Data clustering has been employed in microarray
    data analysis for
  • identifying the genes with similar expressions
  • identifying the subtypes of samples.

82
  • For cluster analysis of samples, we can employ
    the feature selection mechanism developed for
    classification of samples.
  • For cluster analysis of genes, each column of
    gene expression data is regarded as the feature
    vector of one gene.

83
The Agglomerative Hierarchical Clustering
Algorithms
  • The agglomerative hierarchical clustering
    algorithms operate by maintaining a sorted list
    of inter-cluster distances/similarity.
  • Initially, each data instance forms a cluster.
  • The clustering algorithm repetitively merges the
    two clusters with the minimum inter-cluster
    distance or the maximum inter-cluster similarity.

84
  • Upon merging two clusters, the clustering
    algorithm computes the distances between the
    newly-formed cluster and the remaining clusters
    and maintains the sorted list of inter-cluster
    distances accordingly.
  • There are a number of ways to define the
    inter-cluster distance
  • minimum distance/maximum similarity
    (single-link)
  • maximum distance/minimum similarity
    (complete-link)
  • average distance/average similarity
  • mean distance (applicable only with the vector
    space model).

85
  • Given the following similarity matrix, we can
    apply the complete-link algorithm to obtain the
    dendrogram shown on the next slide.

86
  • Assume that the complete-link algorithm is
    employed.
  • If those similarity scores that are less than 0.3
    are excluded, then we obtain 3 clusters P1, P4,
    P2, P5, P6, P3.

0.018
0.137
0.494
0.862
0.816
P1
P4
P2
P5
P6
P3
87
  • If the single-link algorithm is employed, then we
    obtain the following result.

88
Example of the Chaining Effect
Single-link (10 clusters)
Complete-link (2 clusters)
89
Effect of Bias towards Spherical Clusters
Single-link (2 clusters)
Complete-link (2 clusters)
90
K-Means A Partitional Data Clustering Algorithm
  • The k-means algorithm is probably the most
    commonly used partitional clustering algorithm.
  • The k-means algorithm begins with selecting k
    data instances as the means or centers of k
    clusters.

91
  • The k-means algorithm then executes the following
    loop iteratively until the convergence criterion
    is met.
  • repeat
  • assign every data instance to the closest cluster
    based on the distance between the data instance
    and the center of the cluster
  • compute the new centers of the k clusters
  • until(the convergence criterion is met)

92
  • A commonly-used convergence criterion is
  • If the difference between the values of two
    consecutive iterations is smaller than a
    threshold, then the algorithm terminates.

93
Illustration of the K-Means Algorithm---(I)
94
Illustration of the K-Means Algorithm---(II)
95
Illustration of the K-Means Algorithm---(III)
96
A Case in which the K-Means Algorithm Fails
  • The K-means algorithm may converge to a local
    optimal state as the following example
    demonstrates

Initial Selection
97
Conclusions
  • Machine learning algorithms have been widely
    exploited to tackle many important bioinformatics
    problems.
Write a Comment
User Comments (0)
About PowerShow.com