Title: Bio277 Lab 2: Clustering and Classification of Microarray Data
1Bio277 Lab 2 Clustering and Classification of
Microarray Data
- Jess Mar
- Department of Biostatistics
- Quackenbush Lab DFCI
- jmar_at_hsph.harvard.edu
2Machine Learning
Machine learning algorithms predict new classes
based on patterns discerned from existing data.
Goal derive a rule (classifier) that assigns a
new object (e.g. patient microarray profile) to a
pre-specified group (e.g. aggressive vs
non-aggressive prostate cancer).
Classification algorithms are a form of
supervised learning. Clustering algorithms are a
form of unsupervised learning.
3The Golub Data
Golub et al. published gene expression microarray
data in a 1999 Science paper entitled Molecular
Classification of Cancer Class Discovery and
Class Prediction by Gene Expression Monitoring.
The primary focus of their paper was to
demonstrate the use of a class discovery
procedure which could assign tumors to either
acute myeloid leukemia (ALL) versus acute
lymphoblastic leukemia (AML). Bioconductor has
this (pre-processed) data packaged up in
golubEsets. gt library(golubEsets) gt
library(helpgolubEsets)
4Some Clustering Algorithms for Array Data
Hierarchical Methods Single, Average, Complete
Linkage plus other variations. Partitioning
Methods Self-Organising Maps (Köhonen) K-Means
Clustering Gene shaving (Hastie, Tibshirani et
al.) Model based clustering Plaid models
(Lazzeroni Owen)
5Cluster Analysis
- Clustering genes on the basis of experiments or
across a time series. - ? Elucidate unknown gene function.
- Clustering slides on the basis of genes.
- Discover subclasses in tissue samples.
A clustering problem is generally much harder
than a classification problem because we dont
know the number of classes.
Hierarchical Methods (Agglomerative, Divisive)
(Single, Average, Complete) Linkage Model-based
Methods Mixed models. Plaid models. Mixture
models
6Hierarchical Clustering
Source J-Express Manual
We join (or break) nodes based on the notion of
maximum (or minimum) similarity.
Euclidean distance
(Pearson) correlation
7Different Ways to Determine Distances Between
Clusters
Single linkage
Complete linkage
Average linkage
8Implementing Hierarchical Clustering
Agglomerative hierarchical clustering with the
function agnes gt colnames(eset.filt) lt-
classLabels gt plot(agnes(dist(t(eset.filt) ,
method"euclidean")))
9Principal Component Analysis
Multi-dimensional scaling tool. See GC's lectures
for a more in depth treatment. In our Golub data
set, PCA will take the data (500 genes x 72
samples) and map each sample vector (ALL or AML)
from 558 dimensions to 2 dimensions. gt
pca.samples lt- princomp(eset.filt) gt
plot(pca.samples)
10Principal Components
11(No Transcript)
12Classification Example Support Vector Machine
- For this example we will use data from Golub et
al. - 47 patients with ALL, 25 patients with AML
- 7129 genes from an Affymettrix HGU6800 but
we'll take a subset for this example. - gt library(MLInterfaces) library(golubEsets)
- gt library(e1071)
- gt data(golubMerge)
- To fit the support vector machine
- gt model lt- svm(classLabels140.,
datat(eset.train))
13Visualizing the SVM
What predictions were made for the test
set? predLabels lt- predict(model,
t(eset.test)) gt predLabels ALL ALL ALL ALL ALL
ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL
ALL ALL ALL ALL AML AML AML AML AML AML AML AML
AML ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL
ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML AML
AML AML AML AML AML AML AML AML AML AML AML
Levels ALL AML How do these stack up to the
true classification? gt trueLabels lt-
classLabels4172 gt table(predLabels,
trueLabels) trueLabels predLabels ALL AML
ALL 21 0 AML 0 11
14More Materials, More Labs?
Hypothesis Testing of Differentially Expressed
Genes Gene Set Enrichment Clustering Classificatio
n Support Vector Machines
Lecture Topics Covered Since Last Lab
Tutorial BioConductor Tour