Title: Gene Expression Data Analysis
1 Gene Expression Data
Analysis
- Zhang Louxin
- Dept. of
Mathematics - Nat. University of
Singapore
2(No Transcript)
3RNA Transcription ( by L. Miller)
mRNA
RNA polymerase
3
5
4The Transcriptome (by L. Miller)
5CDNA Microarray
Based on hybridization principle Use
parallelism so that one can observe the activity
of thousands of genes at a time
P.Brown/Stanford
6Paradigm for Using cDNA Micro-arrays
Animals
Patients
Cell Lines
Appropriate Tissue
Extract RNA
Microarray
Hybridization
Scan Microarray
Scan microarray
Computer Analysis
Data measures the relative ratio of mRNA
abundance of each gene in test sample to ref.
7cDNA microarray schema -- P. Browns approach
Data from a single experiment measures the
relative ratio of mRNA abundance of each gene on
the array in the two samples (D. Duggan et al.,
Nature Genetics, 1999)
8Affymetrix GeneChip Probe Arrays
Single stranded, fluorescently labeled DNA target
9cDNA microarray
1. DNA microarrays are ordered assemblies of DNA
sequences immobilized on a solid support
(such as chemically modified glass).
10What is a DNA microarray?
2. The DNA sequences (e.g. PCR products or long
oligos) correspond to the transcribed regions
of genes.
genomic DNA
exon 1
exon 2
exon 3
gene Y
ATTTCAGGCGCATGCTCGG
gene X
gene Z
11What is a DNA microarray?
3. The DNA sequences (aka, probes) are capable
of anneal- ling with cDNA targets derived
from the mRNA of a cell.
12Scanning/Signal Detection
Cy3 channel
Cy5 channel
13GIS COMP 19K human oligo array v1.0p2
14Applications
- Gene function assignment guilt-by- association
Cluster genes together
into groups
unknown genes are assigned a function
based on the known functions of genes in the same
expression cluster. - Gene prediction
- The regulatory network of living cells
For a given
cell, arrays can produce a snapshot revealing
which genes are on or off at a particular time. - Clinical diagnosis ( especially for cancers) .
Cancers are caused by gene
disorders. These disorders result in a deviation
of the gene expression profile from that of the
normal cell.
15Microarray Data Analysis
Array Quantification (from digital image)
Quality control
Data Mining
16Gene Expression Matrix
17Difficulties of the Analysis
- The myriad random and systematic measurement
errors - Small numbers of samples (cell lines, patients),
- but the large number of variables (probes or
genes)
- Random errors are caused by the time that
- the array are processed, target accessibility,
- variation in washing procedures.
- System errors are bias. They result in a constant
- tendency to over- or underestimate true values.
- Biasing factors are dependent on spotting,
scanning - labelling technologies.
18Normalization 1- ratio and log transformation
Ratio of raw expression from image quantification
are usually not appropriate for statistical
analysis. Log-transformed data are usually
used. Why? (1). The log transformation removes
much of the proportional relationship
between random error and signal intensity.
Most statistical tests assume an additive error
model. (2). Distributions of replicated logged
expression values tend to be normal.
(3). Summary statistics of log ratio yield same
quantities, regardless the
numerator/denominator assignment.
Example Consider treatmentcontrol ratios for
three replicates 21.1, 51.4,
15 2 and inverted ratios. They have
difference means and standard deviations
but their logs have same means (different
signs) and deviations.
19Normalization 2 - normalize two experiments
- The expression levels of genes are normalized to
a common - standard so that they can be compared.
- Power of microarray analysis comes from the
analysis of - many experiments to identify common patterns of
expression - Techniques
- Housekeeping genes
- Spiked controls
- Global normalization to overall distribution
exp2
Exp. value
exp1
experiments
20Normalization 3 -Outliers
Concept Outliers are extreme values in a
distribution of replicates.
The number can be as high as
15 in a typical microarray experiments. Reason
(1). They are caused by image artifacts (e.g.
dust on a cDNA array, or blooming
of adjoining spots on radioisotopic array).
(2). They can also be caused by the
factors such as cross-
hybridization or failure of one probe to
hybridization adequately. Detection Large
sample sizes are needed to detect outliers more
accurately and precisely.
Estimate
errors on all the probes, rather than a
probe-by-probe basis.
21Mining Gene Expression DATA
Classification Classifying genes (or
tissues, condition) into groups each
containing genes (or tissues) with similar
attributes. Class Prediction Given a set
of known classes of genes (or tissues),
determine the correct class for new genes
(or tissues).
22PART 1Molecular Classification
Traditional Clustering Algorithms
K-means, Self-Organising Maps,
Hierarchical Clustering
Graph Theoretic-based Clustering
Algorithms (Ben-Dor et al.99, Eartuv et al.99)
23K-means, Self-Organising Maps
Input Gene expression matrix, and an integer
k Output k disjoint groups of genes with
similar expression.
Clustering genes
K3
Exp
exp1
exp4
24Similarity and Dissimilarity Measures
Sim
Two main classes of distance functions are used
here -- Correlation coefficients for
comparing shape of expression curves.
Pearson Correlation Coefficient --
Metric distance for measuring the distance
between two points in a metric space.
Manhattan distance, Euclidean distance.
25Pearson Correlation Coefficient p(X, Y) (between
-1 and 1)
Sim
Let
and
are standard deviation of X and Y
X
X
Y
Negative correlation
Positive correlation
26Pearson Correlation Coefficient p(X, Y) (between
-1 and 1)
Sim
Let
Pitfalls
X
Large correlation
Y
27Distance metrics
Let
Y
-- Most commonly used distance -- Identical to
the geometric distance in the
multidimensional space.
Y
-- Sum of difference across dimensions
X
28K-means Algorithm
Arbitrarily partition the input points into K
clusters Each cluster is represented by its
geometrical center. Repeatedly adjust K clusters
by assigning a point to the nearest cluster.
1
1
2
initial
Input Points
K3
29Hierarchical Clustering Algorithm
Input Some data points Output A set of
clusters arranged in a tree - a
hierarchical structure.
What is the distance between clusters? Average
pairwise distance
Each internal node corresponds a cluster.
30 Identify Subtypes of
Diffuse large B-Cell Lymphoma ( DLBCL )
(Alizadeh et al. Nature, 2000)
- A special cDNA microarray --Lymphochip was
designed - Study gene expression patterns in three lymphoid
malignancies DLBCL, FL and CLL.
12,069 cDNA clones from germinal centre B-cell
library 2,338 cDNA clones from libraries derived
from DLBCL, follicular lymph.(FL), mantle
cell lymph, and chronic lymphocytic
leukaemia(CLL) 3,349 other cDNA clones.
96 normal and malignant lymphocyte samples
31Germinal centre B-like DLBCLvsActivated B-like
DLBCL
Courtesy Alizadeh
32Germinal centre B-like DLBCLvsActivated B-like
DLBCL
International Prognostic Indicator
Courtsey Alizadeh
33Remarks
- Programmes designed to cluster data generally
re-order - the rows, or columns, or both, such that
pattern of expression - becomes visually apparent when present in this
fashion. - There might never be a best approach for
clustering data. - Different approaches allow different aspects
of the data to - be explored.
- They are subjective. Different distance
metrics will place - different objects in different clusters.
- Understanding the underlying biology,
particularly of - gene regulation, is important.
34Research Problem
Bi-clustering cluster genes and experiments at
the same time Why? Some
genes are only co-regulated in a subset of
conditions (experiments). References
Y. Kluger et al. Spectral Biclustering of
Microarray data Coclustering Genes and
Conditions, Genome Res. 13, 703-716.
L. Zhang and S. Zhu. A New Clustering Method for
macroarray data analysis. Proc. IEEE CSB
2002.
35Molecular Class Prediction
- Several supervised learning methods available
- Neural Networks
- Support Vector Machines
- Decision trees
- Other statistical methods
36A Supervised Learning Method for Predicting a
Binary Class
Positive and negative examples
Yes
Learning
Prediction
No
?
A new item
A class is just a concept! In the learning step,
the class is modelled as a math. object -- a
function with multiple variables, or a subspace
in a high dimensional space, representing
knowledge of the class.
37Learning the class of tall men
The class is modelled as the half space hgt63
Examples
38Support Vector Machines
A support vector machine finds a hyperplane that
maximally separate data points into two classes
in the feature space.
?
39Molecular Class Prediction-- Leukemia Case
Morphology does not distinguish leukemias very
well. Golub et al. (Science, 1999) proposed a
voting method for predicting Acute
lymphoblastic leukemia(ALL) and Acute Myeloid
Leukemia(AML) using gene expression
fingerprinting.
In the work, Affymetrix DNA chip with 6817
genes was used for 72 ALL/AML samples.
40The voting algorithm(Golub99
Courtesy Golub
1. Select a subset of (2X25) genes highly
correlating with ALL/AML distinction based
on 38 training samples.
Correlation metric
the mean expression level of g in AML (ALL)
samples
the within-class standard deviation of
expression of g in AML (ALL) samples.
2. Each selected gene casts a weighted vote for
a new sample the total of the weighted
votes decides the winning class.
41The voting method Separating samples by
hyperplanes
Mathematically, the total of all the votes on a
new sample X is
is the expression level of in the new
sample X.
If Vgt0, X is classified as AML otherwise , X is
ALL.
AML
ALL
42Decision Tree Learning
- Information-reduction learning method.
- Representing a class or concept as a logic
sentence. - When to use decision trees
IF (Outlook Sunny) (Humidity Normal)
THEN playTennis
- Instance describable by attribute-value pairs
- Target function is discrete valued
- Possibly noisy training data
Examples medical diagnosis, credit risk analysis
43 Textbook ExamplePlayTennis
- Each internal node tests
- an attribute
- Each branch has a value
- Each leaf assigns a
- classification
44Remarks
- Decision tree is constructed by top-down
induction - Preference for short trees, and for those with
high - information gain attributes near the root.
- Information is measured with entropy.
45ALL vs AML - Decision Tree this time
(Y. Sun, tech report, MIT)
- Single gene (zyxin), single branch tree
- Tree size up to 3 genes
38/38 correct on training cases 31/34 correct on
test cases, 3 errors
X5735_at lt(81)38 ALL
1 decision tree with 1 error 7 decision trees
with 2 errors 7 decision trees with 3 errors
46Gene Selection
Gene Selection is critical in molecular class
prediction as we learn from decision tree
results. Why?
- In a cellular processe, only a relatively small
set - of genes are active.
- Mathematically, each gene is just a feature.
- The more weak features, the more noise the data.
- More features arise overfitting problem.
Research Problem How to select genes?
47Two Approaches
1. Gene selection is done first, and then
use these genes to learn such as Golub et als
paper. 2. Gene selection and learning are done
together, like decision tree
learning. Does this make difference in learning?
48Discovery
49(No Transcript)
50BioCluster
Similarity Measure
Cluster Number
Clustering Methods
Clustering Methods
Self-Organising Map(SOM)
Hierarchical
K-Means
Microarray Data Sets
51Concluding remarks
- Some of previous works and our work in analysing
gene expression data are summarised. - Our group will focus on designing more efficient
and sophisticated algorithms and software tools
for mining and visualizing gene expression data.
52 Advantages of using arrays
A microarray contains up to 8000 genes or
probes, and hence it is not necessary to guess
what the important genes or mechanisms are in
advance An array produces abroader, more
complete, less biased , genome-wide expression
profiling.
53Problems with Traditional Clustering Algorithms
1
1. They are not quite suitable for studying
genes with multiple functions or regulated
under multiple factors 2. They cannot handle
data errors or missing well.
Error or missing often occur when tissues are
rare Normalisation of expression levels across
different experiments is also problematic.
54Our Approach (ZZ00)
Let A be a gene expression matrix with gene set
X and experiment set Y let Then I and J
specifies a submatrix A(I, J). We associate the
following score with each entry of A(I,J)
A(I, J) is ?-smooth if S (i,j) ?? for all
i?I, j?J.
IJ
55Clustering Problems
- Smooth Cluster Problem
- Instance A gene expression matrix A with gene
set X and - experiment set Y, a subset J?Y,
a number ??0 - Question Find a largest ?-smooth submatrix
A(I, J).
To handle genes with multiple functions, we use
a known idea (Hartigan72, Cheng and Church00)
- Smooth Bicluster Problem
- Instance A gene expression matrix A with gene
set X and - experiment set Y, a number ??0
- Question Find I??X and J?Y with largest
min(I, J) - and such that A(I, J) is
?-smooth.
56Greedy Algorithms for Smooth Cluster Problem
Top-Down Algorithm
Input A gene expression matrix A with gene set X
and experiment set Y, a subset J?Y, a
number ??gt0 Output A set I?X such that A(I, J)
is ?-smooth under J. Set IX initially Repeat
If A(I, J) is ?-smooth, stop Select a
row i?I that is furthest from the clusters
center and remove it, that is, I I
-i End repeat.
57Top-Down and Bottom-Up Algorithm
Input A gene expression matrix A with gene set X
and experiment set Y, a subset J?Y, a
number ??gt0 Output A set I?X such that A(I, J)
is ?-smooth under J. Set IX initially Apply
Top-Down Algorithm first Repeat Select a
row r?X-I that is closest to the center of
cluster I if A(Ir, J) is ?-smooth,
IIr End repeat.
58Algorithm (Finding a given number of clusters)
Input A gene expression matrix A with gene set X
and experiment set Y, a subset J?Y, a
number ??gt0, and n, the number of
?-smooth clusters to be found Output A set CS
of n ?-smooth clusters under conditions in J.
IX CS? / Output cluster set
/ Repeat n times Run Top-Down Algorithm
on the set I of unselected genes to get a
?-smooth cluster C Apply Bottom-Up on X
to extend C CS CS C I I -C End
repeat.
C
C
59Experiments with the Yeast Data
The Data Set (Tavazoie et al.99) 2884 genes,
17 conditions. Experiments with
K-means Algorithm with k30, Pearson
coefficient. Our Algorithm 1 with
smooth score ? 50 output over
hundred 50-smooth clusters.
60 Clusters from our algorithm have strong patterns.
The 22nd cluster from K-means
corresponds roughly 12 smooth clusters from our
algorithms
61More Smooth Clusters
62Characterization of Our Algorithms
Our algorithm first clusters all low-fluctuating
genes or noises into one or two clusters, while
K-means algorithm assigns these genes into many
clusters.
First two clusters from our algorithm
63Comparison with functional categories
Our approach is systematic and blind to knowledge
of yeast. However, there is significant grouping
of genes within the same functional category in
many of discovered smooth clusters.
64Performance evaluationTest on DLBCL case