Title: Bioinformatics approaches
1Bioinformatics approaches to gene expression
2Gene expression is regulated in several basic
ways
- by region (e.g. brain versus kidney)
- in development (e.g. fetal versus adult tissue)
- in dynamic response to environmental signals
- (e.g. immediate-early response genes)
- in disease states
- by gene activity
3Organism Gene expression changes
measured... virus bacteria fungi invert
ebrates rodents human
In mutant or wildtype cells
Development
Cell types
Disease
In virus, bacteria, and/or host
In response to stimuli
4DNA
RNA
phenotype
protein
cDNA
5protein
protein
DNA
RNA
DNA
RNA
cDNA
cDNA
UniGene
SAGE
microarray
65
3
exon 1
exon 2
exon 3
intron
intron
5
3
transcription
5
3
RNA splicing (remove introns)
3
5
polyadenylation
AAAAA 3
5
Export to cytoplasm
7Analysis of gene expression in cDNA libraries
- A fundamental approach to studying gene
expression - is through cDNA libraries.
- Isolate RNA (always from a specific
- organism, region, and time point)
- Convert RNA to complementary DNA
- Subclone into a vector
- Sequence the cDNA inserts.
- These are expressed sequence tags
- (ESTs)
insert
vector
8UniGene unique genes via ESTs
- Find UniGene at NCBI
- www.ncbi.nlm.nih.gov/UniGene
- UniGene clusters contain many ESTs
- UniGene data come from many cDNA libraries.
- Thus, when you look up a gene in UniGene
- you get information on its abundance
- and its regional distribution.
9Pitfalls in interpreting cDNA library data
- bias in library construction
- variable depth of sequencing
- library normalization
- error rate in sequencing
- contamination (chimeric sequences)
10Serial analysis of gene expression (SAGE)
9 to 11 base tags correspond to genes
measure of gene expression in different
biological samples SAGE tags can be compared
electronically
11(No Transcript)
12Microarrays tools for gene expression
A microarray is a solid support (such as a
membrane or glass microscope slide) on which DNA
of known sequence is deposited in a grid-like
array. RNA is isolated from matched samples of
interest. The RNA is typically converted to cDNA,
labeled with fluorescence (or radioactivity),
then hybridized to microarrays in order to
measure the expression levels of thousands of
genes.
13Questions addressed using microarrays
Wildtype versus mutant Cultured cells /-
drug Physiological states (hibernation, cell
polarity formation) Normal versus diseased
tissue (cancer, autism)
14Organisms represented on microarrays
metazoans human, mouse, rat, worm, insect
fungi yeast plants Arabidopsis other
bacteria, viruses
15Advantages of microarray experiments
Fast Data on 15,000 genes in 1-4
weeks Comprehensive Entire genome on a
chip Flexible As more genomes are
sequenced, more arrays can be made.
Custom arrays can be made
to represent genes of interest Easy
You can submit RNA samples to a core
facility for analysis Cheap? Chip
representing 15,000 genes for 350 robotic
spotter/scanner cost 100,000
16Disadvantages of microarray experiments
Cost Many researchers cant afford to
do appropriate controls, replicates RNA The
final product of gene expression is
protein significance Quality Impossible to
assess elements on array surface control Artifacts
with image analysis Artifacts with data
analysis
17Sample acquisition
RNA purify, label
Data acquisition
Microarray hybridize, wash, image
Data analysis
Data confirmation
Biological insight
18(No Transcript)
19(No Transcript)
20Stage 1 Experimental design
1 Biological samples technical and biological
replicates 2 RNA extraction, conversion,
labeling, hybridization 3 Arrangement of array
elements on a surface
21Sample 1
Sample 2
Sample 3
22Samples 1,2
Samples 1,3
Samples 2,3
Samples 2,1 switch dyes
Sample 1, pool
Sample 2, pool
23Stage 2 RNA preparation
For Affymetrix chips, need total RNA (about 10
ug) Confirm purity by running agarose
gel Measure a260/a280 to confirm purity, quantity
24Stage 3 hybridization to DNA arrays
The array consists of cDNA or oligonucleotides Ol
igonucleotides can be deposited by
photolithography The sample is converted to cRNA
or cDNA
25Microarrays array surface
26Stage 4 Image analysis
RNA expression levels are quantitated Fluorescenc
e intensity is measured with a scanner, or
radioactivity with a phosphorimager
27Differential Gene Expression on a cDNA Microarray
Control
b-Crystallin is over-expressed in Rett Syndrome
Rett
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Stage 5 Data analysis
- How can arrays be compared?
- Which genes are regulated?
- Are differences authentic?
- What are the criteria for statistical
significance? - Are there meaningful patterns in the data
- (such as groups)?
33Microarray data analysis
preprocessing
inferential statistics
exploratory statistics
34Microarray data analysis
preprocessing
global normalization local normalization scatter
plots
inferential statistics
exploratory statistics
t-tests
clustering
35Matrix of genes versus samples
Metric (define distance)
principal components analysis
clustering Trees (hierarchical, k-means)
supervised, unsupervised analyses
self- organizing maps
36Stage 6 Biological confirmation
Microarray experiments can be thought of
as hypothesis-generating experiments. The
differential up- or down-regulation of
specific genes can be measured using independent
assays such as -- Northern blots -- polymerase
chain reaction (RT-PCR) -- in situ hybridization
37Stage 7 Microarray databases
There are two main repositories Gene expression
omnibus (GEO) at NCBI ArrayExpress at the
European Bioinformatics Institute (EBI)
38Gene expression omnibus (GEO)
NCBI repository for gene expression data
39(No Transcript)
40(No Transcript)
41http//www.dnachip.org
Page 183
42Microarrays web resources
Many links on Leming Shis page
http//www.gene-chips.com Stanford Microarray
Database http//www.dnachip.org links at
http//pevsnerlab.kennedykrieger.org/
43Microarray data analysis
begin with a data matrix (gene expression
values versus samples)
44Microarray data analysis
begin with a data matrix (gene expression
values versus samples)
Typically, there are many genes (gtgt 10,000) and
few samples ( 10)
45Microarray data analysis
begin with a data matrix (gene expression
values versus samples)
Preprocessing
Inferential statistics
Descriptive statistics
46Microarray data analysis preprocessing
- Observed differences in gene expression could be
- due to transcriptional changes, or they could be
- caused by artifacts such as
- different labeling efficiencies of Cy3, Cy5
- uneven spotting of DNA onto an array surface
- variations in RNA purity or quantity
- variations in washing efficiency
- variations in scanning efficiency
47Microarray data analysis preprocessing
The main goal of data preprocessing is to
remove the systematic bias in the data as
completely as possible, while preserving the
variation in gene expression that occurs because
of biologically relevant changes in
transcription. A basic assumption of most
normalization procedures is that the average gene
expression level does not change in an
experiment.
48Data analysis global normalization
Global normalization is used to correct two or
more data sets. In one common scenario, samples
are labeled with Cy3 (green dye) or Cy5 (red dye)
and hybridized to DNA elements on a microrarray.
After washing, probes are excited with a laser
and detected with a scanning confocal microscope.
49Data analysis global normalization
Global normalization is used to correct two or
more data sets Example total fluorescence in
Cy3 channel 4 million units Cy 5 channel 2
million units Then the uncorrected ratio for a
gene could show 2,000 units versus 1,000 units.
This would artifactually appear to show 2-fold
regulation.
50Data analysis global normalization
Global normalization procedure Step 1 subtract
background intensity values (use a blank region
of the array) Step 2 globally normalize so that
the average ratio 1 (apply this to 1-channel or
2-channel data sets)
51Microarray data preprocessing
Some researchers use housekeeping genes for
global normalization Visit the Human Gene
Expression (HuGE) Index www.HugeIndex.org
52Scatter plots
Useful to represent gene expression values
from two microarray experiments (e.g. control,
experimental) Each dot corresponds to a gene
expression value Most dots fall along a
line Outliers represent up-regulated or
down-regulated genes
53Scatter plot analysis of microarray data
54Differential Gene Expression in Different Tissue
and Cell Types
Fibroblast
Brain
Astrocyte
Astrocyte
55up
high
down
expression level
Expression level (sample 2)
low
Expression level (sample 1)
56Log-log transformation
57Scatter plots
Typically, data are plotted on log-log
coordinates. Visually, this moves out the data
to a more concentrated region. raw
ratio log2 ratio time behavior value value
t0 basal 1.0 0.0 t1h no
change 1.0 0.0 t2h 2-fold up 2.0 1.0 t3h
2-fold down 0.5 -1.0
58expression level
low
high
up
Log ratio
down
Mean log intensity
59SNOMAD converts array data to scatter
plots http//snomad.org
2-fold
Linear-linear plot
Log-log plot
EXP
EXP
2-fold
2-fold
2-fold
CON
CON
EXP gt CON
2-fold
Log10 (Ratio )
2-fold
EXP lt CON
Mean ( Log10 ( Intensity ) )
60SNOMAD corrects local variance artifacts
robust local regression fit
residual
EXP gt CON
2-fold
Log10 ( Ratio )
Corrected Log10 ( Ratio ) residuals
2-fold
EXP lt CON
Mean ( Log10 ( Intensity ) )
Mean ( Log10 ( Intensity ) )
61Inferential statistics
Inferential statistics are used to make
inferences about a population from a sample.
Hypothesis testing is a common form of
inferential statistics. A null hypothesis is
stated, such as There is no difference in
signal intensity for the gene expression
measurements in normal and diseased samples. The
alternative hypothesis is that there is a
difference. We use a test statistic to decide
whether to accept or reject the null hypothesis.
For many applications, we set the significance
level a to p lt 0.05.
62Inferential statistics
A t-test is a commonly used test statistic to
assess the difference in mean values between two
groups. t Questions Is the
sample size (n) adequate? Are the data normally
distributed? Is the variance of the data
known? Is the variance the same in the two
groups? Is it appropriate to set the significance
level to p lt 0.05?
x1 x2
difference between mean values
s
variability (noise)
63Inferential statistics
Paradigm Parametric test Nonparametric Compare
two unpaired groups Unpaired t-test Mann-Whitney
test Compare two paired groups Paired
t-test Wilcoxon test Compare 3 or ANOVA more
groups
64Inferential statistics
Is it appropriate to set the significance level
to p lt 0.05? If you hypothesize that a specific
gene is up-regulated, you can set the probability
value to 0.05. You might measure the expression
of 10,000 genes and hope that any of them are up-
or down-regulated. But you can expect to see 5
(500 genes) regulated at the p lt 0.05 level by
chance alone. To account for the thousands of
repeated measurements you are making, some
researchers apply a Bonferroni correction. The
level for statistical significance is divided by
the number of measurements, e.g. the criterion
becomes p lt (0.05)/10,000 or p lt 5 x 10-6
65Descriptive statistics
Microarray data are highly dimensional there
are many thousands of measurements made from a
small number of samples. Descriptive
(exploratory) statistics help you to
find meaningful patterns in the data. A first
step is to arrange the data in a matrix. Next,
use a distance metric to define the
relatedness of the different data points. Two
commonly used distance metrics are -- Euclidean
distance -- Pearson coefficient of correlation
203
66Data matrix (20 genes and 3 time points from Chu
et al.)
67t2.0
t0
t0.5
3D plot (using S-PLUS software)
68Descriptive statistics clustering
Clustering algorithms offer useful visual
descriptions of microarray data. Genes may be
clustered, or samples, or both. We will next
describe hierarchical clustering. This may be
agglomerative (building up the branches of a
tree, beginning with the two most closely
related objects) or divisive (building the tree
by finding the most dissimilar objects
first). In each case, we end up with a tree
having branches and nodes.
69Agglomerative clustering
4
3
2
1
0
a
a,b
b
c
d
e
70Agglomerative clustering
4
3
2
1
0
a
a,b
b
c
d
d,e
e
71Agglomerative clustering
4
3
2
1
0
a
a,b
b
c
c,d,e
d
d,e
e
72Agglomerative clustering
4
3
2
1
0
a
a,b
b
a,b,c,d,e
c
c,d,e
d
d,e
e
tree is constructed
73Divisive clustering
a,b,c,d,e
4
3
2
1
0
74Divisive clustering
a,b,c,d,e
c,d,e
4
3
2
1
0
75Divisive clustering
a,b,c,d,e
c,d,e
d,e
4
3
2
1
0
76Divisive clustering
a,b
a,b,c,d,e
c,d,e
d,e
4
3
2
1
0
77Divisive clustering
a
a,b
b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
tree is constructed
78agglomerative
4
3
2
1
0
a
a,b
b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
divisive
79(No Transcript)
80(No Transcript)
811
12
Agglomerative and divisive clustering sometimes
give conflicting results, as shown here
1
12
82Cluster and TreeView
83Cluster and TreeView
clustering
PCA
SOM
K means
84Cluster and TreeView
85Cluster and TreeView
86(No Transcript)
87(No Transcript)
88(No Transcript)
89Two-way clustering of genes (y-axis) and cell
lines (x-axis) (Alizadeh et al., 2000)
90(No Transcript)
91(No Transcript)
92(No Transcript)