Bioinformatics approaches - PowerPoint PPT Presentation

1 / 92

About This Presentation

Title:

Bioinformatics approaches

Description:

Bioinformatics approaches – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 93

Provided by: Jonathan408

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics approaches

1
Bioinformatics approaches to gene expression
2
Gene expression is regulated in several basic
ways

by region (e.g. brain versus kidney)
in development (e.g. fetal versus adult tissue)
in dynamic response to environmental signals
(e.g. immediate-early response genes)
in disease states
by gene activity

3
Organism Gene expression changes
measured... virus bacteria fungi invert
ebrates rodents human
In mutant or wildtype cells
Development
Cell types
Disease
In virus, bacteria, and/or host
In response to stimuli
4
DNA
RNA
phenotype
protein
cDNA
5
protein
protein
DNA
RNA
DNA
RNA
cDNA
cDNA
UniGene
SAGE
microarray
6
5
3
exon 1
exon 2
exon 3
intron
intron
5
3
transcription
5
3
RNA splicing (remove introns)
3
5
polyadenylation
AAAAA 3
5
Export to cytoplasm
7
Analysis of gene expression in cDNA libraries

A fundamental approach to studying gene
expression
is through cDNA libraries.
Isolate RNA (always from a specific
organism, region, and time point)
Convert RNA to complementary DNA
Subclone into a vector
Sequence the cDNA inserts.
These are expressed sequence tags
(ESTs)

insert
vector
8
UniGene unique genes via ESTs

Find UniGene at NCBI
www.ncbi.nlm.nih.gov/UniGene
UniGene clusters contain many ESTs
UniGene data come from many cDNA libraries.
Thus, when you look up a gene in UniGene
you get information on its abundance
and its regional distribution.

9
Pitfalls in interpreting cDNA library data

bias in library construction
variable depth of sequencing
library normalization
error rate in sequencing
contamination (chimeric sequences)

10
Serial analysis of gene expression (SAGE)
9 to 11 base tags correspond to genes
measure of gene expression in different
biological samples SAGE tags can be compared
electronically
11
(No Transcript)
12
Microarrays tools for gene expression
A microarray is a solid support (such as a
membrane or glass microscope slide) on which DNA
of known sequence is deposited in a grid-like
array. RNA is isolated from matched samples of
interest. The RNA is typically converted to cDNA,
labeled with fluorescence (or radioactivity),
then hybridized to microarrays in order to
measure the expression levels of thousands of
genes.
13
Questions addressed using microarrays
Wildtype versus mutant Cultured cells /-
drug Physiological states (hibernation, cell
polarity formation) Normal versus diseased
tissue (cancer, autism)
14
Organisms represented on microarrays
metazoans human, mouse, rat, worm, insect
fungi yeast plants Arabidopsis other
bacteria, viruses
15
Advantages of microarray experiments
Fast Data on 15,000 genes in 1-4
weeks Comprehensive Entire genome on a
chip Flexible As more genomes are
sequenced, more arrays can be made.
Custom arrays can be made
to represent genes of interest Easy
You can submit RNA samples to a core
facility for analysis Cheap? Chip
representing 15,000 genes for 350 robotic
spotter/scanner cost 100,000
16
Disadvantages of microarray experiments
Cost Many researchers cant afford to
do appropriate controls, replicates RNA The
final product of gene expression is
protein significance Quality Impossible to
assess elements on array surface control Artifacts
with image analysis Artifacts with data
analysis
17
Sample acquisition
RNA purify, label
Data acquisition
Microarray hybridize, wash, image
Data analysis
Data confirmation
Biological insight
18
(No Transcript)
19
(No Transcript)
20
Stage 1 Experimental design
1 Biological samples technical and biological
replicates 2 RNA extraction, conversion,
labeling, hybridization 3 Arrangement of array
elements on a surface
21
Sample 1
Sample 2
Sample 3
22
Samples 1,2
Samples 1,3
Samples 2,3
Samples 2,1 switch dyes
Sample 1, pool
Sample 2, pool
23
Stage 2 RNA preparation
For Affymetrix chips, need total RNA (about 10
ug) Confirm purity by running agarose
gel Measure a260/a280 to confirm purity, quantity
24
Stage 3 hybridization to DNA arrays
The array consists of cDNA or oligonucleotides Ol
igonucleotides can be deposited by
photolithography The sample is converted to cRNA
or cDNA
25
Microarrays array surface
26
Stage 4 Image analysis
RNA expression levels are quantitated Fluorescenc
e intensity is measured with a scanner, or
radioactivity with a phosphorimager
27
Differential Gene Expression on a cDNA Microarray
Control
b-Crystallin is over-expressed in Rett Syndrome
Rett
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Stage 5 Data analysis

How can arrays be compared?
Which genes are regulated?
Are differences authentic?
What are the criteria for statistical
significance?
Are there meaningful patterns in the data
(such as groups)?

33
Microarray data analysis
preprocessing
inferential statistics
exploratory statistics
34
Microarray data analysis
preprocessing
global normalization local normalization scatter
plots
inferential statistics
exploratory statistics
t-tests
clustering
35
Matrix of genes versus samples
Metric (define distance)
principal components analysis
clustering Trees (hierarchical, k-means)
supervised, unsupervised analyses
self- organizing maps
36
Stage 6 Biological confirmation
Microarray experiments can be thought of
as hypothesis-generating experiments. The
differential up- or down-regulation of
specific genes can be measured using independent
assays such as -- Northern blots -- polymerase
chain reaction (RT-PCR) -- in situ hybridization
37
Stage 7 Microarray databases
There are two main repositories Gene expression
omnibus (GEO) at NCBI ArrayExpress at the
European Bioinformatics Institute (EBI)
38
Gene expression omnibus (GEO)
NCBI repository for gene expression data
39
(No Transcript)
40
(No Transcript)
41
http//www.dnachip.org
Page 183
42
Microarrays web resources
Many links on Leming Shis page
http//www.gene-chips.com Stanford Microarray
Database http//www.dnachip.org links at
http//pevsnerlab.kennedykrieger.org/
43
Microarray data analysis
begin with a data matrix (gene expression
values versus samples)
44
Microarray data analysis
begin with a data matrix (gene expression
values versus samples)
Typically, there are many genes (gtgt 10,000) and
few samples ( 10)
45
Microarray data analysis
begin with a data matrix (gene expression
values versus samples)
Preprocessing
Inferential statistics
Descriptive statistics
46
Microarray data analysis preprocessing

Observed differences in gene expression could be
due to transcriptional changes, or they could be
caused by artifacts such as
different labeling efficiencies of Cy3, Cy5
uneven spotting of DNA onto an array surface
variations in RNA purity or quantity
variations in washing efficiency
variations in scanning efficiency

47
Microarray data analysis preprocessing
The main goal of data preprocessing is to
remove the systematic bias in the data as
completely as possible, while preserving the
variation in gene expression that occurs because
of biologically relevant changes in
transcription. A basic assumption of most
normalization procedures is that the average gene
expression level does not change in an
experiment.
48
Data analysis global normalization
Global normalization is used to correct two or
more data sets. In one common scenario, samples
are labeled with Cy3 (green dye) or Cy5 (red dye)
and hybridized to DNA elements on a microrarray.
After washing, probes are excited with a laser
and detected with a scanning confocal microscope.
49
Data analysis global normalization
Global normalization is used to correct two or
more data sets Example total fluorescence in
Cy3 channel 4 million units Cy 5 channel 2
million units Then the uncorrected ratio for a
gene could show 2,000 units versus 1,000 units.
This would artifactually appear to show 2-fold
regulation.
50
Data analysis global normalization
Global normalization procedure Step 1 subtract
background intensity values (use a blank region
of the array) Step 2 globally normalize so that
the average ratio 1 (apply this to 1-channel or
2-channel data sets)
51
Microarray data preprocessing
Some researchers use housekeeping genes for
global normalization Visit the Human Gene
Expression (HuGE) Index www.HugeIndex.org
52
Scatter plots
Useful to represent gene expression values
from two microarray experiments (e.g. control,
experimental) Each dot corresponds to a gene
expression value Most dots fall along a
line Outliers represent up-regulated or
down-regulated genes
53
Scatter plot analysis of microarray data
54
Differential Gene Expression in Different Tissue
and Cell Types
Fibroblast
Brain
Astrocyte
Astrocyte
55
up
high
down
expression level
Expression level (sample 2)
low
Expression level (sample 1)
56
Log-log transformation
57
Scatter plots
Typically, data are plotted on log-log
coordinates. Visually, this moves out the data
to a more concentrated region. raw
ratio log2 ratio time behavior value value
t0 basal 1.0 0.0 t1h no
change 1.0 0.0 t2h 2-fold up 2.0 1.0 t3h
2-fold down 0.5 -1.0
58
expression level
low
high
up
Log ratio
down
Mean log intensity
59
SNOMAD converts array data to scatter
plots http//snomad.org
2-fold
Linear-linear plot
Log-log plot
EXP
EXP
2-fold
2-fold
2-fold
CON
CON
EXP gt CON
2-fold
Log10 (Ratio )
2-fold
EXP lt CON
Mean ( Log10 ( Intensity ) )
60
SNOMAD corrects local variance artifacts
robust local regression fit
residual
EXP gt CON
2-fold
Log10 ( Ratio )
Corrected Log10 ( Ratio ) residuals
2-fold
EXP lt CON
Mean ( Log10 ( Intensity ) )
Mean ( Log10 ( Intensity ) )
61
Inferential statistics
Inferential statistics are used to make
inferences about a population from a sample.
Hypothesis testing is a common form of
inferential statistics. A null hypothesis is
stated, such as There is no difference in
signal intensity for the gene expression
measurements in normal and diseased samples. The
alternative hypothesis is that there is a
difference. We use a test statistic to decide
whether to accept or reject the null hypothesis.
For many applications, we set the significance
level a to p lt 0.05.
62
Inferential statistics
A t-test is a commonly used test statistic to
assess the difference in mean values between two
groups. t Questions Is the
sample size (n) adequate? Are the data normally
distributed? Is the variance of the data
known? Is the variance the same in the two
groups? Is it appropriate to set the significance
level to p lt 0.05?
x1 x2
difference between mean values
s
variability (noise)
63
Inferential statistics
Paradigm Parametric test Nonparametric Compare
two unpaired groups Unpaired t-test Mann-Whitney
test Compare two paired groups Paired
t-test Wilcoxon test Compare 3 or ANOVA more
groups
64
Inferential statistics
Is it appropriate to set the significance level
to p lt 0.05? If you hypothesize that a specific
gene is up-regulated, you can set the probability
value to 0.05. You might measure the expression
of 10,000 genes and hope that any of them are up-
or down-regulated. But you can expect to see 5
(500 genes) regulated at the p lt 0.05 level by
chance alone. To account for the thousands of
repeated measurements you are making, some
researchers apply a Bonferroni correction. The
level for statistical significance is divided by
the number of measurements, e.g. the criterion
becomes p lt (0.05)/10,000 or p lt 5 x 10-6
65
Descriptive statistics
Microarray data are highly dimensional there
are many thousands of measurements made from a
small number of samples. Descriptive
(exploratory) statistics help you to
find meaningful patterns in the data. A first
step is to arrange the data in a matrix. Next,
use a distance metric to define the
relatedness of the different data points. Two
commonly used distance metrics are -- Euclidean
distance -- Pearson coefficient of correlation
203
66
Data matrix (20 genes and 3 time points from Chu
et al.)
67
t2.0
t0
t0.5
3D plot (using S-PLUS software)
68
Descriptive statistics clustering
Clustering algorithms offer useful visual
descriptions of microarray data. Genes may be
clustered, or samples, or both. We will next
describe hierarchical clustering. This may be
agglomerative (building up the branches of a
tree, beginning with the two most closely
related objects) or divisive (building the tree
by finding the most dissimilar objects
first). In each case, we end up with a tree
having branches and nodes.
69
Agglomerative clustering
4
3
2
1
0
a
a,b
b
c
d
e
70
Agglomerative clustering
4
3
2
1
0
a
a,b
b
c
d
d,e
e
71
Agglomerative clustering
4
3
2
1
0
a
a,b
b
c
c,d,e
d
d,e
e
72
Agglomerative clustering
4
3
2
1
0
a
a,b
b
a,b,c,d,e
c
c,d,e
d
d,e
e
tree is constructed
73
Divisive clustering
a,b,c,d,e
4
3
2
1
0
74
Divisive clustering
a,b,c,d,e
c,d,e
4
3
2
1
0
75
Divisive clustering
a,b,c,d,e
c,d,e
d,e
4
3
2
1
0
76
Divisive clustering
a,b
a,b,c,d,e
c,d,e
d,e
4
3
2
1
0
77
Divisive clustering
a
a,b
b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
tree is constructed
78
agglomerative
4
3
2
1
0
a
a,b
b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
divisive
79
(No Transcript)
80
(No Transcript)
81
1
12
Agglomerative and divisive clustering sometimes
give conflicting results, as shown here
1
12
82
Cluster and TreeView
83
Cluster and TreeView
clustering
PCA
SOM
K means
84
Cluster and TreeView
85
Cluster and TreeView
86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
Two-way clustering of genes (y-axis) and cell
lines (x-axis) (Alizadeh et al., 2000)
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)

Write a Comment

User Comments (0)