Bio277 Lab 2: Clustering and Classification of Microarray Data - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Bio277 Lab 2: Clustering and Classification of Microarray Data

Description:

Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI jmar_at_hsph.harvard.edu The Golub Data ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 15
Provided by: Jess173
Category:

less

Transcript and Presenter's Notes

Title: Bio277 Lab 2: Clustering and Classification of Microarray Data


1
Bio277 Lab 2 Clustering and Classification of
Microarray Data
  • Jess Mar
  • Department of Biostatistics
  • Quackenbush Lab DFCI
  • jmar_at_hsph.harvard.edu

2
Machine Learning
Machine learning algorithms predict new classes
based on patterns discerned from existing data.
Goal derive a rule (classifier) that assigns a
new object (e.g. patient microarray profile) to a
pre-specified group (e.g. aggressive vs
non-aggressive prostate cancer).
Classification algorithms are a form of
supervised learning. Clustering algorithms are a
form of unsupervised learning.
3
The Golub Data
Golub et al. published gene expression microarray
data in a 1999 Science paper entitled Molecular
Classification of Cancer Class Discovery and
Class Prediction by Gene Expression Monitoring.
The primary focus of their paper was to
demonstrate the use of a class discovery
procedure which could assign tumors to either
acute myeloid leukemia (ALL) versus acute
lymphoblastic leukemia (AML). Bioconductor has
this (pre-processed) data packaged up in
golubEsets. gt library(golubEsets) gt
library(helpgolubEsets)
4
Some Clustering Algorithms for Array Data
Hierarchical Methods Single, Average, Complete
Linkage plus other variations. Partitioning
Methods Self-Organising Maps (Köhonen) K-Means
Clustering Gene shaving (Hastie, Tibshirani et
al.) Model based clustering Plaid models
(Lazzeroni Owen)
5
Cluster Analysis
  • Clustering genes on the basis of experiments or
    across a time series.
  • ? Elucidate unknown gene function.
  • Clustering slides on the basis of genes.
  • Discover subclasses in tissue samples.

A clustering problem is generally much harder
than a classification problem because we dont
know the number of classes.
Hierarchical Methods (Agglomerative, Divisive)
(Single, Average, Complete) Linkage Model-based
Methods Mixed models. Plaid models. Mixture
models
6
Hierarchical Clustering
Source J-Express Manual
We join (or break) nodes based on the notion of
maximum (or minimum) similarity.
Euclidean distance
(Pearson) correlation
7
Different Ways to Determine Distances Between
Clusters
Single linkage
Complete linkage
Average linkage
8
Implementing Hierarchical Clustering
Agglomerative hierarchical clustering with the
function agnes gt colnames(eset.filt) lt-
classLabels gt plot(agnes(dist(t(eset.filt) ,
method"euclidean")))
9
Principal Component Analysis
Multi-dimensional scaling tool. See GC's lectures
for a more in depth treatment. In our Golub data
set, PCA will take the data (500 genes x 72
samples) and map each sample vector (ALL or AML)
from 558 dimensions to 2 dimensions. gt
pca.samples lt- princomp(eset.filt) gt
plot(pca.samples)
10
Principal Components
11
(No Transcript)
12
Classification Example Support Vector Machine
  • For this example we will use data from Golub et
    al.
  • 47 patients with ALL, 25 patients with AML
  • 7129 genes from an Affymettrix HGU6800 but
    we'll take a subset for this example.
  • gt library(MLInterfaces) library(golubEsets)
  • gt library(e1071)
  • gt data(golubMerge)
  • To fit the support vector machine
  • gt model lt- svm(classLabels140.,
    datat(eset.train))

13
Visualizing the SVM
What predictions were made for the test
set? predLabels lt- predict(model,
t(eset.test)) gt predLabels ALL ALL ALL ALL ALL
ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL
ALL ALL ALL ALL AML AML AML AML AML AML AML AML
AML ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL
ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML AML
AML AML AML AML AML AML AML AML AML AML AML
Levels ALL AML How do these stack up to the
true classification? gt trueLabels lt-
classLabels4172 gt table(predLabels,
trueLabels) trueLabels predLabels ALL AML
ALL 21 0 AML 0 11
14
More Materials, More Labs?
Hypothesis Testing of Differentially Expressed
Genes Gene Set Enrichment Clustering Classificatio
n Support Vector Machines
Lecture Topics Covered Since Last Lab
Tutorial BioConductor Tour
Write a Comment
User Comments (0)
About PowerShow.com