Title: CZ5225: Modeling and Simulation in Biology Lecture 2: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS
1CZ5225 Modeling and Simulation in Biology
Lecture 2 Gene Expression Profiles and
Microarray Data AnalysisProf. Chen Yu
ZongTel 6874-6877Email yzchen_at_cz3.nus.edu.sg
http//xin.cz3.nus.edu.sgRoom 07-24, level 7,
SOC1, NUS
2Biology and Cells
- All living organisms consist of cells.
- Humans have trillions of cells. Yeast - one
cell. - Cells are of many different types (blood, skin,
nerve), but all arose from a single cell (the
fertilized egg) - Each cell contains a complete copy of the genome
(the program for making the organism), encoded in
DNA.
3DNA
- DNA molecules are long double-stranded chains 4
types of bases are attached to the backbone
adenine (A), guanine (G), cytosine (C), and
thymine (T). A pairs with T, C with G. - A gene is a segment of DNA that specifies how to
make a protein. - Human DNA has about 25-35K genes Rice about
50-60K but shorter genes.
4Exons and Introns
- exons are coding DNA (translated into a protein),
which are only about 2 of human genome - introns are non-coding DNA, which provide
structural integrity and regulatory (control)
functions - exons can be thought of program data, while
introns provide the program logic - Humans have much more control structure than rice
5Gene Expression
- Cells are different because of differential gene
expression. - About 40 of human genes are expressed at one
time. - Gene is expressed by transcribing DNA into
single-stranded mRNA - mRNA is later translated into a protein
- Microarrays measure the level of mRNA expression
6Molecular Biology Overview
Nucleus
Cell
Chromosome
Protein
Gene (DNA)
Gene (mRNA), single strand
cDNA
7Gene Expression
- Genes control cell behavior by controlling which
proteins are made by a cell - House keeping genes vs. cell/tissue specific
genes - Regulation
- Transcriptional (promoters and enhancers)
- Post Transcriptional (RNA splicing, stability,
localization -small non coding RNAs)
8Gene Expression
- Regulation
- Translational (3UTR repressors, poly A tail)
- Post Transcriptional (RNA splicing, stability,
localization -small non coding RNAs) - Post Translational (Protein modification
carbohydrates, lipids, phosphorylation,
hydroxylation, methlylation, precursor protein)
cDNA
9Gene Expression Measurement
- mRNA expression represents dynamic aspects of
cell - mRNA expression can be measured with latest
technology - mRNA is isolated and labeled with fluorescent
protein - mRNA is hybridized to the target level of
hybridization corresponds to light emission which
is measured with a laser
10Traditional Methods
- Northern Blotting
- Single RNA isolated
- Probed with labeled cDNA
- RT-PCR
- Primers amplify specific cDNA transcripts
11Microarray Technology
- Microarray
- New Technology (first paper 1995)
- Allows study of thousands of genes at same time
- Glass slide of DNA molecules
- Molecule string of bases (25 bp 500 bp)
- uniquely identifies gene or unit to be studied
12Gene Expression Microarrays
- The main types of gene expression microarrays
- Short oligonucleotide arrays (Affymetrix)
- cDNA or spotted arrays (Brown/Botstein).
- Long oligonucleotide arrays (Agilent Inkjet)
- Fiber-optic arrays
- ...
13Fabrications of Microarrays
- Size of a microscope slide
Images http//www.affymetrix.com/
14Differing Conditions
- Ultimate Goal
- Understand expression level of genes under
different conditions - Helps to
- Determine genes involved in a disease
- Pathways to a disease
- Used as a screening tool
15Gene Conditions
- Cell types (brain vs. liver)
- Developmental (fetal vs. adult)
- Response to stimulus
- Gene activity (wild vs. mutant)
- Disease states (healthy vs. diseased)
16Expressed Genes
- Genes under a given condition
- mRNA extracted from cells
- mRNA labeled
- Labeled mRNA is mRNA present in a given condition
- Labeled mRNA will hybridize (base pair) with
corresponding sequence on slide
17Two Different Types of Microarrays
- Custom spotted arrays (up to 20,000 sequences)
- cDNA
- Oligonucleotide
- High-density (up to 100,000 sequences) synthetic
oligonucleotide arrays - Affymetrix (25 bases)
- SHOW AFFYMETRIX LAYOUT
18Custom Arrays
- Mostly cDNA arrays
- 2-dye (2-channel)
- RNA from two sources (cDNA created)
- Source 1 labeled with red dye
- Source 2 labeled with green dye
19Two Channel Microarrays
- Microarrays measure gene expression
- Two different samples
- Control (green label)
- Sample (red label)
- Both are washed over the microarray
- Hybridization occurs
- Each spot is one of 4 colors
20Microarray Technology
21Microarray Image Analysis
- Microarrays detect gene interactions 4 colors
- Green high control
- Red High sample
- Yellow Equal
- Black None
- Problem is to quantify image signals
22Single Color Microarrays
- Prefabricated
- Affymetrix (25mers)
- Custom
- cDNA (500 bases or so)
- Spotted oligos (70-80 bases)
23Microarray Animations
- Davidson University
- http//www.bio.davidson.edu/courses/genomics/chip/
chip.html - Imagecyte
- http//www.imagecyte.com/array2.html
24Basic idea of Microarray
- Construction
- Place array of probes on microchip
- Probe (for example) is oligonucleotide 25 bases
long that characterizes gene or genome - Each probe has many, many clones
- Chip is about 2cm by 2cm
- Application principle
- Put (liquid) sample containing genes on
microarray and allow probe and gene sequences to
hybridize and wash away the rest - Analyze hybridization pattern
25Microarray analysis
Operation Principle Samples are tagged with
flourescent material to show pattern of
sample-probe interaction (hybridization) Micro
array may have 60K probe
26(No Transcript)
27Gene Expression Data
- Gene expression data on p genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene i in mRNA sample j
Log (Red intensity / Green intensity)
Log(Avg. PM - Avg. MM)
28Some possible applications
- Sample from specific organ to show which genes
are expressed - Compare samples from healthy and sick host to
find gene-disease connection - Probes are sets of human pathogens for disease
detection
29Huge amount of data from single microarray
- If just two color, then amount of data on array
with N probes is 2N - Cannot analyze pixel by pixel
- Analyze by pattern cluster analysis
30Major Data Mining Techniques
- Link Analysis
- Associations Discovery
- Sequential Pattern Discovery
- Similar Time Series Discovery
- Predictive Modeling
- Classification
- Clustering
31Cluster Analysis Grouping Similarly Expressed
Genes, Cell Samples, or Both
- Strengthens signal when averages are taken within
clusters of genes (Eisen) - Useful (essential ?) when seeking new subclasses
of cells, tumours, etc. - Leads to readily interpreted figures
32Some clustering methods and software
- PartitioningK-Means, K-Medoids, PAM, CLARA
- HierarchicalCluster, HAC?BIRCH?CURE?ROCK
- Density-based CAST, DBSCAN?OPTICS?CLIQUE
- Grid-basedSTING?CLIQUE?WaveCluster
- Model-basedSOM (self-organized
map)?COBWEB?CLASSIT?AutoClass - Two-way Clustering
- Block clustering
33Assessment of various methods
- Algorithmic Approaches to Clustering Gene
Expression Data, Ron Shamir School of Computer
Science, Tel-Aviv University Tel-Aviv - http//citeseer.nj.nec.com/shamir01algorithmic.htm
l - Conclusion hierarchical clustering exceptional
34Partitioning
35Density-based clustering
36Hierarchical (used most often)
37Hierarchical Clustering grouping similarly
expressed genes
Gene Expression Profile Analysis
Sample
.
B
C
A
gene
0.4 0.9 0 0.5 .. .. 0.8
0.2 0.8 0.3 0.2 .. .. 0.7
0.6 0.2 0 0.7 .. .. 0.3
1 2 3 4 ..
.. 1000
38After Clustering
Gene Expression Profile Analysis
sample
.
B
C
A
gene
.. 0 0.4 0.5 .. 0.9 0.8
.. 0.3 0.2 0.2 .. 0.8 0.7
.. 0 0.6 0.7 .. 0.2 0.3
.. 3 1 4 ..
2 1000
39 randomized row column both
data clustered
Eisen et al. Proc. Natl. Acad. Sci. USA 95 (1998)
time
40Types of Similarity Measurements
- Distance measurements
- Correlation coefficients
- Association coefficients
- Probabilistic similarity coefficients
41Correlation Coefficients
- The most popular correlation coefficient is
Pearson correlation coefficient (1892) - correlation between XX1, X2, , Xn and YY1,
Y2, , Yn - where
sXY is the similarity between X Y
sXY
42Use of Similarity for Tree Construction
- Normalize similarity so that 1
- Then have nxn similarity matrix S whose diagonal
elements are 1 - Define distance matrix by (for example)
- D 1 S
- Diagonal elements of D are 0
- Now use distance matrix to built tree (using some
tree-building software recall lecture on
Phylogeny)
sXX
43A dendrogram (tree) for clustered genes
E.g. p5
- Let p number of genes.
- 1. Calculate within class correlation.
- 2. Perform hierarchical clustering which will
produce (2p-1) clusters of genes. - 3. Average within clusters of genes.
- 4 Perform testing on averages of clusters of
genes as if they were single genes.
Cluster 6(1,2)
Cluster 7(1,2,3)
Cluster 8(4,5)
Cluster 9 (1,2,3,4,5)
1
2
3
4
5
44A real case
Nature Feb, 2000 Paper by Allzadeh. A et
al Distinct types of diffuse large B-cell
lymphoma identified by gene expression
profiling
45Validation Techniques Huberts G Statistics
- XX(i, j) and YY(i, j) are two n n matrix
- X(i, j) similarity of gene i and gene j
-
- Huberts G statistic represents the point serial
correlation - where M n (n - 1) / 2
- A higher value of G represents the better
clustering quality.
if genes i and j are in same cluster, otherwise
46Discovering sub-groups
47Gene Expression is Time-Dependent
Time Course Data
48Sample of time course of clustered genes
49Limitations
- Cluster analyses
- Usually outside the normal framework of
statistical inference - Less appropriate when only a few genes are likely
to change - Needs lots of experiments
- Single gene tests
- May be too noisy in general to show much
- May not reveal coordinated effects of positively
correlated genes. - Hard to relate to pathways
50Useful Links
- Affymetrix www.affymetrix.com
- Michael Eisen Lab at LBL (hierarchical clustering
software Cluster and Tree View (Windows))
rana.lbl.gov/ - Review of Currently Available Microarray
Softwarewww.the-scientist.com/yr2001/apr/profile1
_010430.html - ArrayExpress at the EBI http//www.ebi.ac.uk/array
express/ - Stanford MicroArray Database http//genome-www5.st
anford.edu/ - Yale Microarray Database http//info.med.yale.edu/
microarray/ - Microarray DB www.biologie.ens.fr/en/genetiqu/puce
s/bddeng.html