Title: Lecture 17 Gene expression and the transcriptome II
1Lecture 17Gene expression and the transcriptome
II
2SAGE
- SAGE Serial Analysis of Gene Expression
- Based on serial sequencing of 15-bp tags that are
unique to each and every gene - SAGE is a method to determine absolute abundance
of every transcript expressed in a population of
cells
3SAGE
- 15-bp gene-specific tags are produced by elegant
series of molecular biology manipulations and
then concatenated into a single molecule (string)
for automated sequencing - By sequencing the concatenated fragments, the
number of copies of each tag can be counted - A list of each unique tag and its abundance in
the population is assembled
4SAGE
- At least 50,000 tags are required per sample to
approach saturation, the point where each
expressed gene (eukaryotic cell) is represented
at least twice - SAGE costs about 5000 per sample
- Too expensive to do replicated comparisons like
is done with microarrays
5Transcript abundance in typical eukaryotic cell
- lt100 transcripts account for 20 of of total mRNA
population, each being present in between 100 and
1000 copies per cell - These encode ribosomal proteins and other core
elements of transcription and translation
machinery, histones and further taxon-specific
genes - General, basic and most important cellular
mechanisms
6Transcript abundance in typical eukaryotic cell
(2)
- Several hundred intermediate-frequency
transcripts, each making 10 to 100 copies, make
up for a further 30 of mRNA - These code for housekeeping enzymes, cytoskeletal
components and some unusually abundant cell-type
specific proteins - Pretty basic housekeeping things
7Transcript abundance in typical eukaryotic cell
(3)
- Further 50 of mRNA is made up of tens of
thousands low-abundance transcripts (lt10), some
of which may be expressed at less than one copy
per cell (on average) - Most of these genes are tissue-specific or
induced only under particular conditions - Specific or special purpose products
8Transcript abundance in typical eukaryotic cell
(4)
- Get some feel for the numbers (can be a factor 2
off but order of magnitude about right) - If
- 80 transcripts 400 copies 32,000 (20)
- 600 transcripts 75 copies 45,000 (30)
- 25,000 transcripts 3 copies 75,000 (50)
- Then Total 150,000 mRNA molecules
9Transcript abundance in typical eukaryotic cell
(5)
- This means that most of the transcripts in a cell
population contribute less than 0.01 of the
total mRNA - Say 1/3 of higher eukaryote genome is expressed
in given tissue, then about 10,000 different tags
should be detectable - Taking into account that half the transcriptome
is relatively abundant, at least 50,000 different
tags should be sequenced to approach saturation
(at least 10 copies per transcript on average)
10SAGE analysis of yeast (Velculesco et al., 1997)
1.0 0.75 0.5 0.25 0
17 38 45
Fraction of all transcripts
1000 100 10 1
0.1
Number of transcripts per cell
11SAGE quantitative comparison
- A tag present in 4 copies in one sample of 50,000
tags, and in 2 copies in another sample, may be
twofold expressed but is not going to be
significant - Even 20 to 10 tags might not be statistically
significant given the large numbers of
comparisons - Often, 10-fold over- or under-expression is taken
as threshold
12SAGE quantitative comparison
- A great advantage of SAGE is that method is
unbiased by experimental conditions - Direct comparison of data sets is possible
- Data produced by different groups can be pooled
- Web-based tools for performing comparisons of
samples all over the world exist (e.g. SAGEnet
and xProfiler)
13Genome-Wide Cluster AnalysisEisen dataset
- Eisen et al., PNAS 1998
- S. cerevisiae (bakers yeast)
- all genes ( 6200) on a single array
- measured during several processes
- human fibroblasts
- 8600 human transcripts on array
- measured at 12 time points during serum
stimulation
14The Eisen Data
- 79 measurements for yeast data
- collected at various time points during
- diauxic shift (shutting down genes for
metabolizing sugars, activating those for
metabolizing ethanol) - mitotic cell division cycle
- sporulation
- temperature shock
- reducing shock
15The Data
- each measurement represents
- Log(Redi/Greeni)
- where red is the test expression level, and green
is - the reference level for gene G in the i th
experiment - the expression profile of a gene is the vector
of - measurements across all experiments G1 .. Gn
16The Data
- m genes measured in n experiments
-
- g1,1 g1,n
- g2,1 . g2,n
- gm,1 . gm,n
Vector for 1 gene
17(No Transcript)
18(No Transcript)
19(No Transcript)
20Eisen et al. Results
- redundant representations of genes cluster
together - but individual genes can be distinguished from
related genes by subtle differences in expression - genes of similar function cluster together
- e.g. 126 genes strongly down-regulated in
response to stress
21Eisen et al. Results
- 126 genes down-regulated in response to stress
- 112 of the genes encode ribosomal and other
proteins related to translation - agrees with previously known result that yeast
responds to favorable growth conditions by
increasing the production of ribosomes
22Partitional Clustering
- divide instances into disjoint clusters
- flat vs. tree structure
- key issues
- how many clusters should there be?
- how should clusters be represented?
23(No Transcript)
24Partitional Clustering from aHierarchical
Clustering
we can always generate a partitional clustering
from ahierarchical clustering by cutting the
tree at some level
25K-Means Clustering
- assume our instances are represented by vectors
of real values - put k cluster centers in same space as
instances - now iteratively move cluster centers
26K-Means Clustering
- each iteration involves two steps
- assignment of instances to clusters
- re-computation of the means
27K-Means Clustering
- in k-means clustering, instances are assigned to
one and only one cluster - can do soft k-means clustering via Expectation
Maximization (EM) algorithm - each cluster represented by a normal distribution
- E step determine how likely is it that each
cluster generated each instance - M step move cluster centers to maximize
likelihood of instances
28(No Transcript)
29Ecogenomics
Algorithm that maps observed clustering behaviour
of sampled gene expression data onto the
clustering behaviour of contaminant labelled gene
expression patterns in the knowledge base
Sample
Compatibility scores
Condition n (contaminant n)
Condition 1 (contaminant 1)
Condition 2 (contaminant 2)
Condition 3 (contaminant 3)
30Protein-protein interactions
31Home
back to Chapter 5 page zygote
The Two-Hybrid System for the Detection of
Protein-Protein Interactions
Protein-protein interaction
 If you want to know whether any particular
proteins bound to protein X. Then such proteins
can be found by the yeast two-hybrid system.
The two-hybrid system allows in vivo detection
of protein-protein interactions as well as the
analysis of the affinity of these interactions.
32Home
back to Chapter 5 page zygote
The Two-Hybrid System for the Detection of
Protein-Protein Interactions
Protein-protein interaction
- Â
- Two-hybrid technology exploits the fact that
transcriptional activators are modular in nature.
- Two physically distinct functional domains are
necessary to get transcription - a DNA binding domain (DBD) that binds to the DNA
of the promoter and - an activation domain (AD) that binds to the
basal transcription apparatus and activates
transcription.
33Home
back to Chapter 5 page zygote
The Two-Hybrid System for the Detection of
Protein-Protein Interactions
Protein-protein interaction
- Â
- In the yeast two-hybrid system, the known gene
encoding X, is cloned into the "bait" vector. - In this way, the gene for X is placed into a
plasmid next to the gene encoding a DNA-binding
domain from some transcription factor. - For instance, if the gene for X is cloned into
the pHybLex/Zeo vector, X would be expressed as a
fusion protein containing bacterially-derived
LexA DBD.
34Home
back to Chapter 5 page zygote
The Two-Hybrid System for the Detection of
Protein-Protein Interactions
Protein-protein interaction
 Separately, a second gene (or a library of
cDNAs encoding potential interactors), Y, is
cloned in frame adjacent to an activation domain
of a different transcription factor. For
instance, it could be inserted next to the DNA
encoding the B42 activation domain (AD) in a
"prey" vector such as pYESTrp2.
35Home
back to Chapter 5 page zygote
The Two-Hybrid System for the Detection of
Protein-Protein Interactions
Protein-protein interaction
 Thus, in one strain of yeast, a known protein X
is fused to the DNA binding domain of a
transcription factor and in another strain,
unknown proteins are fused to the activation
domain of another transcription factor. If one of
the unknown proteins combines with X, it will
bring the AD over to the DBD, and transcription
will be activated. So the plasmids containing the
"bait" (known protein/DBD) and "prey" (unknown
protein/AD) are then placed into a yeast strain,
where a marker gene has a promoter containing the
sequence bound by the bait protein DBD.
36Home
back to Chapter 5 page zygote
The Two-Hybrid System for the Detection of
Protein-Protein Interactions
Protein-protein interaction
 However, in order to form a working
transcription factor, it needs the AD provided by
the "prey." The only way that it can get this
activation domain is if the known protein X
combines with some unknown protein Y that is
carrying the AD. If the X and Y proteins
interact, the B42 AD is brought into proximity of
the LexA DBD and transcription of the reporter
gene is activated. The activation of the reporter
gene can be screened by enzyme activity, light
release, or cell growth, depending on the type of
reporter gene activated.
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41- Literature two-hybrid systems
- Fields, S and Song, O. 1989. A novel genetic
system to detect protein-protein interactions.
Nature 340245 -246. - 2. Gyuris, J., Golemis, E., Chertkov, H., and
Brent, R. 1993. Cdi1, a human G1 and S phase
protein phosphatase that associates with Cdk2.
Cell. 75 791-803.