Title: Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning
 1Introduction to Bioinformatics Lecture 
VIIIClassification and Supervised Learning
-  Jarek Meller 
- Division of Biomedical Informatics, 
- Childrens Hospital Research Foundation 
-  Department of Biomedical Engineering, UC
2Outline of the lecture
-  
- Motivating story correlating inputs and outputs 
- Learning with a teacher 
- Regression and classification problems 
- Model selection, feature selection and 
 generalization
- k-nearest neighbors and some other classification 
 algorithms
- Phenotype fingerprints and their applications in 
 medicine
3Web watch an on-line biology textbook by JW 
Kimball
Dr. J. W. Kimball's Biology Pages http//users.rc
n.com/jkimball.ma.ultranet/BiologyPages/ Story 
1 B-cells and DNA editing, Apolipoprotein B and 
RNA eiditing http//users.rcn.com/jkimball.ma.ult
ranet/BiologyPages/R/RNA_Editing.htmlapoB_gene 
 Story 2 ApoB, cholesterol uptake, LDL and 
its endocytosis http//users.rcn.com/jkimball.ma.
ultranet/BiologyPages/E/Endocytosis.htmlldl 
Complex patterns of mutations in genes related to 
cholesterol transport and uptake (e.g. LDLR, 
ApoB) may lead to an elevated level of LDL in the 
blood. 
 4Correlations and fingerprints
Instead of often difficult to decipher underlying 
molecular model, one may simply try to find 
correlations between inputs and outputs. If 
measurements on certain attributes correlate with 
molecular processes, underlying 
genomic structures, phenotypes, disease states 
etc., one can use such attributes as indicators 
of such hidden states and to make predictions 
for new cases. Consider for example the elevated 
levels of the low density lipoprotein 
(LDL) particles in the blood, as an indicator 
(fingerprint) of the atherosclerosis. 
 5Correlations and fingerprints LDL example
Healthy cases blue heart attack or stroke 
within 5 years from the exam red (simulated 
data) x  LDL y - HDL z  age (see study by 
Westendorp et. al., Arch Intern Med. 2003, 
163(13)1549 
 6LDL example 2D projection 
 7LDL example regression with binary output and 1D 
projection for classification 
 8Unsupervised vs. supervised learning
In case of unsupervised learning the goal is to 
discover the structure in the data and group 
(cluster) similar objects, given a similarity 
measure. In case of supervised learning (or 
learning with a teacher) a set of examples 
with class assignments (e.g. healthy vs. 
diseased) is given and the goal is to find a 
representation of the problem in some feature 
(attribute) space that provides a proper 
separation of the imposed classes. Such 
representations With the resulting decision 
boundaries may be subsequently used to 
make prediction for new cases.
Class 3
Class 1
Class 2 
 9Choice of the model, problem representation and 
feature selection another simple example
adults
children
F 
weight
estrogen
M 
heights
testosterone 
 10Gene expression example again JRA clinical 
classes
Picture courtesy of B. Aronow 
 11Advantages of prior knowledge, problems with 
class assignment (e.g. in clinical practice) on 
the other hand
GLOBINS
FixL
No sequence similarity
??
PYP
Prior knowledge  the same class despite low 
sequence similarity suggestion that distance 
based on sequence similarity is not sufficient  
adding structure derived features might help 
(good model question again). 
 12Three phases in supervised learning protocols
- Training data examples with class assignment are 
 given
- Learning 
-  i) appropriate model (or 
 representation) of the problem needs to be
 selected in terms of attributes, distance measure
 and classifier type ii) adaptive parameters
 in the model need to optimized to provide correct
 classification of training examples (e.g.
 minimizing the number of misclassified training
 vectors)
- Validation cross-validation, independent control 
 sets and other measure of real accuracy and
 generalization should be used to assess the
 success of the model and the training phase
 (finding trade off between accuracy and
 generalization is not trivial)
13Training set LDL example again
- A set of objects (here patients) xi , i1, , N 
 is given. For each patient a set of features
 (attributes and the corresponding measurements on
 these attributes) are given too. Finally, for
 each patient we are given the class Ck , k1, ,
 K, he/she belongs to.
-  
-  Age LDL HDL Sex Class 
-  41 230 60 F healthy (0) 
-  32 120 50 M stroke within 5 years (1) 
-  45 90 70 M heart attack within 5 
 years (1)
 xi , Ck  i1, , N  
 14Optimizing adaptable parameters in the model 
- Find a model y(xw) that describes the objects of 
 each class as a function of the features and
 adaptive parameters (weights) w.
- Prediction, given x (e.g. LDL240, age52, 
 sexmale) assign the class C?, (e.g. if
 y(x,w)gt0.5 then C1, i.e. likely to suffer from a
 stroke or heart attack in the next 5 years)
y(xw) 
 15Examples of machine learning algorithms for 
classification and regression problems
- Linear perceptron, Least Squares 
- LDA/FDA (Linear/Fisher Discriminate Analysis) 
 (simple linear cuts, kernel non-linear
 generalizations)
- SVM (Support Vector Machines) (optimal, wide 
 margin linear cuts, kernel non-linear
 generalizations)
- Decision trees (logical rules) 
- k-NN (k-Nearest Neighbors) (simple 
 non-parametric)
- Neural networks (general non-linear models, 
 adaptivity, artificial brain)
16Training accuracy vs. generalization 
 17Model complexity, training set size and 
generalization 
 18Similarity measures 
 19k-nearest neighbors as a simple algorithm for 
classification
-  
- Given a training set of N objects with known 
 class assignment and kltN find an assignment of
 new objects (not included in the training) to one
 of the classes based on the assignment of its k
 neighbors
- A simple, non-parametric method that works 
 surprisingly well, especially in case of low
 dimensional problems
- Note however that the choice of the distance 
 measure may again have a profound effect on the
 results
- The optimal k is found by trial and error
20k-nearest neighbor algorithm
Step 1 Compute pairwise distances and take k 
closest neighbors Step2 Assign class based on a 
simple majority voting, the new point 
belongs to the class with most neighbors in this 
class