Gene Expression Data Analysis - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Gene Expression Data Analysis

Description:

Gene Expression Data Analysis – PowerPoint PPT presentation

Number of Views:629
Avg rating:3.0/5.0
Slides: 65
Provided by: bioinforma7
Category:

less

Transcript and Presenter's Notes

Title: Gene Expression Data Analysis


1
Gene Expression Data
Analysis
  • Zhang Louxin
  • Dept. of
    Mathematics
  • Nat. University of
    Singapore

2
(No Transcript)
3
RNA Transcription ( by L. Miller)
mRNA
RNA polymerase
3
5
4
The Transcriptome (by L. Miller)
5
CDNA Microarray
Based on hybridization principle Use
parallelism so that one can observe the activity
of thousands of genes at a time
P.Brown/Stanford
6
Paradigm for Using cDNA Micro-arrays
Animals
Patients
Cell Lines
Appropriate Tissue
Extract RNA
Microarray
Hybridization
Scan Microarray
Scan microarray
Computer Analysis
Data measures the relative ratio of mRNA
abundance of each gene in test sample to ref.
7
cDNA microarray schema -- P. Browns approach
Data from a single experiment measures the
relative ratio of mRNA abundance of each gene on
the array in the two samples (D. Duggan et al.,
Nature Genetics, 1999)
8
Affymetrix GeneChip Probe Arrays
Single stranded, fluorescently labeled DNA target
9
cDNA microarray
1. DNA microarrays are ordered assemblies of DNA
sequences immobilized on a solid support
(such as chemically modified glass).
10
What is a DNA microarray?
2. The DNA sequences (e.g. PCR products or long
oligos) correspond to the transcribed regions
of genes.
genomic DNA
exon 1
exon 2
exon 3
gene Y
ATTTCAGGCGCATGCTCGG
gene X
gene Z
11
What is a DNA microarray?
3. The DNA sequences (aka, probes) are capable
of anneal- ling with cDNA targets derived
from the mRNA of a cell.
12
Scanning/Signal Detection
Cy3 channel
Cy5 channel
13
GIS COMP 19K human oligo array v1.0p2
14
Applications
  • Gene function assignment guilt-by- association
    Cluster genes together
    into groups
    unknown genes are assigned a function
    based on the known functions of genes in the same
    expression cluster.
  • Gene prediction
  • The regulatory network of living cells
    For a given
    cell, arrays can produce a snapshot revealing
    which genes are on or off at a particular time.
  • Clinical diagnosis ( especially for cancers) .
    Cancers are caused by gene
    disorders. These disorders result in a deviation
    of the gene expression profile from that of the
    normal cell.

15
Microarray Data Analysis
Array Quantification (from digital image)
Quality control
Data Mining
16
Gene Expression Matrix
17
Difficulties of the Analysis
  • The myriad random and systematic measurement
    errors
  • Small numbers of samples (cell lines, patients),
  • but the large number of variables (probes or
    genes)
  • Random errors are caused by the time that
  • the array are processed, target accessibility,
  • variation in washing procedures.
  • System errors are bias. They result in a constant
  • tendency to over- or underestimate true values.
  • Biasing factors are dependent on spotting,
    scanning
  • labelling technologies.

18
Normalization 1- ratio and log transformation
Ratio of raw expression from image quantification
are usually not appropriate for statistical
analysis. Log-transformed data are usually
used. Why? (1). The log transformation removes
much of the proportional relationship
between random error and signal intensity.
Most statistical tests assume an additive error
model. (2). Distributions of replicated logged
expression values tend to be normal.
(3). Summary statistics of log ratio yield same
quantities, regardless the
numerator/denominator assignment.
Example Consider treatmentcontrol ratios for
three replicates 21.1, 51.4,
15 2 and inverted ratios. They have
difference means and standard deviations
but their logs have same means (different
signs) and deviations.
19
Normalization 2 - normalize two experiments
  • The expression levels of genes are normalized to
    a common
  • standard so that they can be compared.
  • Power of microarray analysis comes from the
    analysis of
  • many experiments to identify common patterns of
    expression
  • Techniques
  • Housekeeping genes
  • Spiked controls
  • Global normalization to overall distribution

exp2
Exp. value
exp1
experiments
20
Normalization 3 -Outliers
Concept Outliers are extreme values in a
distribution of replicates.
The number can be as high as
15 in a typical microarray experiments. Reason
(1). They are caused by image artifacts (e.g.
dust on a cDNA array, or blooming
of adjoining spots on radioisotopic array).
(2). They can also be caused by the
factors such as cross-
hybridization or failure of one probe to
hybridization adequately. Detection Large
sample sizes are needed to detect outliers more
accurately and precisely.
Estimate
errors on all the probes, rather than a
probe-by-probe basis.
21
Mining Gene Expression DATA
Classification Classifying genes (or
tissues, condition) into groups each
containing genes (or tissues) with similar
attributes. Class Prediction Given a set
of known classes of genes (or tissues),
determine the correct class for new genes
(or tissues).
22
PART 1Molecular Classification
Traditional Clustering Algorithms
K-means, Self-Organising Maps,
Hierarchical Clustering
Graph Theoretic-based Clustering
Algorithms (Ben-Dor et al.99, Eartuv et al.99)
23
K-means, Self-Organising Maps
Input Gene expression matrix, and an integer
k Output k disjoint groups of genes with
similar expression.
Clustering genes
K3
Exp
exp1
exp4
24
Similarity and Dissimilarity Measures
Sim
Two main classes of distance functions are used
here -- Correlation coefficients for
comparing shape of expression curves.
Pearson Correlation Coefficient --
Metric distance for measuring the distance
between two points in a metric space.
Manhattan distance, Euclidean distance.
25
Pearson Correlation Coefficient p(X, Y) (between
-1 and 1)
Sim
Let
and
are standard deviation of X and Y
X
X
Y
Negative correlation
Positive correlation
26
Pearson Correlation Coefficient p(X, Y) (between
-1 and 1)
Sim
Let
Pitfalls
X
Large correlation
Y
27
Distance metrics
  • Sim

Let
  • Euclidean distance

Y
-- Most commonly used distance -- Identical to
the geometric distance in the
multidimensional space.
  • Manhattan distance

Y
-- Sum of difference across dimensions
X
28
K-means Algorithm
Arbitrarily partition the input points into K
clusters Each cluster is represented by its
geometrical center. Repeatedly adjust K clusters
by assigning a point to the nearest cluster.
1
1
2
initial
Input Points
K3
29
Hierarchical Clustering Algorithm
Input Some data points Output A set of
clusters arranged in a tree - a
hierarchical structure.
What is the distance between clusters? Average
pairwise distance
Each internal node corresponds a cluster.
30
Identify Subtypes of
Diffuse large B-Cell Lymphoma ( DLBCL )
(Alizadeh et al. Nature, 2000)
  • A special cDNA microarray --Lymphochip was
    designed
  • Study gene expression patterns in three lymphoid
    malignancies DLBCL, FL and CLL.

12,069 cDNA clones from germinal centre B-cell
library 2,338 cDNA clones from libraries derived
from DLBCL, follicular lymph.(FL), mantle
cell lymph, and chronic lymphocytic
leukaemia(CLL) 3,349 other cDNA clones.
96 normal and malignant lymphocyte samples
31
Germinal centre B-like DLBCLvsActivated B-like
DLBCL
Courtesy Alizadeh
32
Germinal centre B-like DLBCLvsActivated B-like
DLBCL
International Prognostic Indicator
Courtsey Alizadeh
33
Remarks
  • Programmes designed to cluster data generally
    re-order
  • the rows, or columns, or both, such that
    pattern of expression
  • becomes visually apparent when present in this
    fashion.
  • There might never be a best approach for
    clustering data.
  • Different approaches allow different aspects
    of the data to
  • be explored.
  • They are subjective. Different distance
    metrics will place
  • different objects in different clusters.
  • Understanding the underlying biology,
    particularly of
  • gene regulation, is important.

34
Research Problem
Bi-clustering cluster genes and experiments at
the same time Why? Some
genes are only co-regulated in a subset of
conditions (experiments). References
Y. Kluger et al. Spectral Biclustering of
Microarray data Coclustering Genes and
Conditions, Genome Res. 13, 703-716.
L. Zhang and S. Zhu. A New Clustering Method for
macroarray data analysis. Proc. IEEE CSB
2002.
35
Molecular Class Prediction
  • Several supervised learning methods available
  • Neural Networks
  • Support Vector Machines
  • Decision trees
  • Other statistical methods

36
A Supervised Learning Method for Predicting a
Binary Class
Positive and negative examples
Yes
Learning
Prediction
No
?
A new item
A class is just a concept! In the learning step,
the class is modelled as a math. object -- a
function with multiple variables, or a subspace
in a high dimensional space, representing
knowledge of the class.
37
Learning the class of tall men
The class is modelled as the half space hgt63
Examples
38
Support Vector Machines
A support vector machine finds a hyperplane that
maximally separate data points into two classes
in the feature space.
?
39
Molecular Class Prediction-- Leukemia Case
Morphology does not distinguish leukemias very
well. Golub et al. (Science, 1999) proposed a
voting method for predicting Acute
lymphoblastic leukemia(ALL) and Acute Myeloid
Leukemia(AML) using gene expression
fingerprinting.
In the work, Affymetrix DNA chip with 6817
genes was used for 72 ALL/AML samples.
40
The voting algorithm(Golub99
Courtesy Golub
1. Select a subset of (2X25) genes highly
correlating with ALL/AML distinction based
on 38 training samples.
Correlation metric
the mean expression level of g in AML (ALL)
samples
the within-class standard deviation of
expression of g in AML (ALL) samples.
2. Each selected gene casts a weighted vote for
a new sample the total of the weighted
votes decides the winning class.
41
The voting method Separating samples by
hyperplanes
Mathematically, the total of all the votes on a
new sample X is
is the expression level of in the new
sample X.
If Vgt0, X is classified as AML otherwise , X is
ALL.
AML
ALL
42
Decision Tree Learning
  • Information-reduction learning method.
  • Representing a class or concept as a logic
    sentence.
  • When to use decision trees

IF (Outlook Sunny) (Humidity Normal)
THEN playTennis
  • Instance describable by attribute-value pairs
  • Target function is discrete valued
  • Possibly noisy training data

Examples medical diagnosis, credit risk analysis
43
Textbook ExamplePlayTennis
  • Each internal node tests
  • an attribute
  • Each branch has a value
  • Each leaf assigns a
  • classification

44
Remarks
  • Decision tree is constructed by top-down
    induction
  • Preference for short trees, and for those with
    high
  • information gain attributes near the root.
  • Information is measured with entropy.

45
ALL vs AML - Decision Tree this time
(Y. Sun, tech report, MIT)
  • Single gene (zyxin), single branch tree
  • Tree size up to 3 genes

38/38 correct on training cases 31/34 correct on
test cases, 3 errors
X5735_at lt(81)38 ALL
1 decision tree with 1 error 7 decision trees
with 2 errors 7 decision trees with 3 errors
46
Gene Selection
Gene Selection is critical in molecular class
prediction as we learn from decision tree
results. Why?
  • In a cellular processe, only a relatively small
    set
  • of genes are active.
  • Mathematically, each gene is just a feature.
  • The more weak features, the more noise the data.
  • More features arise overfitting problem.

Research Problem How to select genes?
47
Two Approaches
1. Gene selection is done first, and then
use these genes to learn such as Golub et als
paper. 2. Gene selection and learning are done
together, like decision tree
learning. Does this make difference in learning?
48
Discovery
49
(No Transcript)
50
BioCluster
Similarity Measure
Cluster Number
Clustering Methods
Clustering Methods
Self-Organising Map(SOM)
Hierarchical
K-Means
Microarray Data Sets
51
Concluding remarks
  • Some of previous works and our work in analysing
    gene expression data are summarised.
  • Our group will focus on designing more efficient
    and sophisticated algorithms and software tools
    for mining and visualizing gene expression data.

52
Advantages of using arrays
A microarray contains up to 8000 genes or
probes, and hence it is not necessary to guess
what the important genes or mechanisms are in
advance An array produces abroader, more
complete, less biased , genome-wide expression
profiling.
53
Problems with Traditional Clustering Algorithms
1
1. They are not quite suitable for studying
genes with multiple functions or regulated
under multiple factors 2. They cannot handle
data errors or missing well.
Error or missing often occur when tissues are
rare Normalisation of expression levels across
different experiments is also problematic.
54
Our Approach (ZZ00)
Let A be a gene expression matrix with gene set
X and experiment set Y let Then I and J
specifies a submatrix A(I, J). We associate the
following score with each entry of A(I,J)
A(I, J) is ?-smooth if S (i,j) ?? for all
i?I, j?J.
IJ
55
Clustering Problems
  • Smooth Cluster Problem
  • Instance A gene expression matrix A with gene
    set X and
  • experiment set Y, a subset J?Y,
    a number ??0
  • Question Find a largest ?-smooth submatrix
    A(I, J).

To handle genes with multiple functions, we use
a known idea (Hartigan72, Cheng and Church00)
  • Smooth Bicluster Problem
  • Instance A gene expression matrix A with gene
    set X and
  • experiment set Y, a number ??0
  • Question Find I??X and J?Y with largest
    min(I, J)
  • and such that A(I, J) is
    ?-smooth.

56
Greedy Algorithms for Smooth Cluster Problem
Top-Down Algorithm
Input A gene expression matrix A with gene set X
and experiment set Y, a subset J?Y, a
number ??gt0 Output A set I?X such that A(I, J)
is ?-smooth under J. Set IX initially Repeat
If A(I, J) is ?-smooth, stop Select a
row i?I that is furthest from the clusters
center and remove it, that is, I I
-i End repeat.
57
Top-Down and Bottom-Up Algorithm
Input A gene expression matrix A with gene set X
and experiment set Y, a subset J?Y, a
number ??gt0 Output A set I?X such that A(I, J)
is ?-smooth under J. Set IX initially Apply
Top-Down Algorithm first Repeat Select a
row r?X-I that is closest to the center of
cluster I if A(Ir, J) is ?-smooth,
IIr End repeat.
58
Algorithm (Finding a given number of clusters)
Input A gene expression matrix A with gene set X
and experiment set Y, a subset J?Y, a
number ??gt0, and n, the number of
?-smooth clusters to be found Output A set CS
of n ?-smooth clusters under conditions in J.
IX CS? / Output cluster set
/ Repeat n times Run Top-Down Algorithm
on the set I of unselected genes to get a
?-smooth cluster C Apply Bottom-Up on X
to extend C CS CS C I I -C End
repeat.
C
C
59
Experiments with the Yeast Data
The Data Set (Tavazoie et al.99) 2884 genes,
17 conditions. Experiments with
K-means Algorithm with k30, Pearson
coefficient. Our Algorithm 1 with
smooth score ? 50 output over
hundred 50-smooth clusters.
60
Clusters from our algorithm have strong patterns.
The 22nd cluster from K-means
corresponds roughly 12 smooth clusters from our
algorithms
61
More Smooth Clusters
62
Characterization of Our Algorithms
Our algorithm first clusters all low-fluctuating
genes or noises into one or two clusters, while
K-means algorithm assigns these genes into many
clusters.
First two clusters from our algorithm
63
Comparison with functional categories
Our approach is systematic and blind to knowledge
of yeast. However, there is significant grouping
of genes within the same functional category in
many of discovered smooth clusters.
64
Performance evaluationTest on DLBCL case
Write a Comment
User Comments (0)
About PowerShow.com