Title: Genomewide Analysis of Gene regulation
1- Genome-wide Analysis of Gene regulation
Berlin, 4th of May, 2005
Presentation by David Rozado
2 Comparative analysis of methods forrepresenting
and searching for transcriptionfactor binding
sites
- Robert Osada, Elena Zaslavsky and Mona Singh
- Department of Computer Science Lewis-Sigler
Institute for Integrative Genomics, - Princeton University, Princeton, NJ 08544, USA
3Introduction
- Identification of DNA binding sites for
transcription factors - Important step in unraveling the transcriptional
regulatory network - Several approaches for transcription factors
binding sites search - consensus sequences
- position-specific scoring matrices
- Berg and von Hippel
- Centroid
- Such basic approaches can all be extended by
incorporating - pair wise nucleotide dependencies
- per-position information content
- The paper evaluates the effectiveness of the
basic approaches and their extensions in finding
binding sites for a transcription factor of
interest
4Datasets
- 68 regulatory proteins and their aligned DNA
binding domains - Number of Filters applied
- Only proteins with at least four binding sites
were considered - Duplicate binding sites were removed in order to
preserve the integrity of the leave-one-out
cross-validation - Each binding site unambiguously located within
the E.coli K-12 genome and extracted along with
flanking regions on each side - This process left 35 transcription factors and
410 binding sites - Average of 11.7 8.5 sites per transcription
factor
5Notation
- S set of N DNA binding sites for a transcription
factor - nj (b) number of times base b appears in the j
-th position in S - fj (b) corresponding frequency
- n(b) number of times base b appears overall in
the N binding sites - f (b) overall frequency for base b
- nij (b, d) number of times the ordered pair (b,
d) occurs in positions i and j - fij (b, d) corresponding frequency
- tj j -th base of the sequence t to be scored
j
i
AGTTAACAAT
t
AGGTAACAAA
S
ATATAACAAT
TTTTAACAAT
AGTAATCAAT
6Approaches for representing and searchingfor
binding sites
7Extension I - Pairwise correlations
- A method for incorporating pairwise correlations
should only take into account those pairs that
act together in determining DNAprotein binding
specificity. - Such precise information is not always readily
available - As approximation, focus on considering pairwise
correlations between bases that are nearby in
sequence - Introduce the notion of scope to delimit which
pairs are considered correlated. - A scope of one restricts correlated positions to
adjacent pairs - a scope of two considers both adjacent pairs and
pairs separated by an intermediate base
8Extensions II - Information content
- Information content (IC) is a concept based on
the information-theoretic notion of entropy. - In the current application, the entropy of a
position expresses the number of bits necessary
to describe the position in a binding site - The information content of a position is
calculated by subtracting its entropy from the
value of the maximum possible entropy - The higher the information content, the more
conserved (and presumably more important) the
position
9Cross-validation testing and analysis
- Common usage of any of the methods described
above would be to scan non-coding regions in a
genome in order to find binding sites for a
particular transcription factor - Such a framework is not easily applicable when we
wish to evaluate and compare different methods - The E.coli genome contains many yet
uncharacterized binding sites - Predicted windows may correspond to true binding
sites even if they are not annotated as such in
the original dataset - Testing framework with sets of positive and
negative examples
10Cross-validation testing and analysis II
- Conduct leave-one-out cross-validation studies to
evaluate a particular method - Suppose s belongs to a set S of known binding
sites, each of length l, for a particular
transcription factor TF - The method under consideration uses all the sites
except s, to build the binding site
representation for TF, and scores s as well as a
set of negative examples - The negative examples consist of all binding
sites in our dataset except those known to be
bound by TF - It is still is possible that transcription factor
TF can bind some of the negative examples - Nevertheless, s should be among the top scoring
sites in the overall pool
11Comparing Methods
- For each site s of a transcription factor under
consideration a rank in cross-validation testing
is computed by counting how many negative
examples score as well or better than s - lower rank indicating better performance
- To compare how well two methods perform, a
Wilcoxon matched-pairs signed-ranks test is used - The number of times one method outperforms the
other is compared with how many times such an
event would happen merely by chance under the
assumption that both methods perform equally well - A ROC curve for each individual leave-one-out
test is created and then, the average over all
sites for that transcription factor is computed
12Comparison of basic methods
13ROC curves comparing performance when pairs are
considered for Centroid
14ROC curves comparing performance when pairs are
considered for PSSM
15ROC curves comparing Centroid-P with scope 2
using regular sites and sites with columns
shuffled
16Performance of methods based on averaged ranks
per transcription factor
17Conclusions
- Using per-position information content to weigh
positional scores improves the performance of all
methods - Sometimes dramatically
- Methods based on nucleotide matches, such as
consensus sequences and Centroid, show
statistically significant improvements when
incorporating pairwise nucleotide dependences - Probabilistic methods, such as log-odds PSSMs, do
not show statistically significant improvements
when incorporating pairwise dependencies - Difference in performance between methods
decreases substantially once information content
and pairwise correlations have been incorporated
18Making connections between novel
transcriptionfactors and their DNA motifs
- Kai Tan,1 Lee Ann McCue,2 and Gary D. Stormo1,3
- 1Department of Genetics, Washington University
School of Medicine, Saint Louis, Missouri 63110,
USA 2The Wadsworth Center, - New York State Department of Health, Albany, New
York 12201-0509, USA
19Introduction
- A computational method to connect novel
transcription factors and DNA motifs in E. coli - The method takes advantage of three types of
information to assign a DNA binding motif to a
given TF - A distance constraint between a TF and its
closest binding site in the genome (Dmin
information) - The phylogenetic correlation between TFs and
their regulated genes (PC information) - A binding specificity constraint for TFs having
structurally similar DNA-binding domains (FMC
information) - The different types of information are combined
to calculate the probability of a given
transcription-factorDNA-motif pair being a true
pair
20Distance constraint
- Besides auto-regulation, it has been noticed in
many cases that TFs and the genes they regulate
are near each other in the genome - Distance constraint between the TF and its
closest binding site in the genome
- Dmin_self is the distance between a TF gene and
its closest binding site in the genome - Dmin_cross is the distance between a TF gene and
the closest binding site for a different TF
21The phylogenetic correlation
- TFs and their regulated genes tend to evolve
concurrently - Connect TFs and DNA motifs through correlation
between their occurrences in a comparative
analysis of multiple species - Two types of phylogenetic correlation (PC)
distributions - PC for true TFDNA-motif pairs.
- PC for false TFDNA-motif pairs
22Binding specificity constraint
- TFs that are more similar to one another are
expected to bind to sites that are more similar
to each other than to dissimilar pairs - Distribution of average similarity scores for
motifs from the same family and from different
families
23Conclusions
- Hypothesize that information concerning the
connection of a TF to its DNA motif is carried in
the genome sequences - TFs and their binding sites are often in similar
genomic locations (Dmin information) - TFs tend to evolve concurrently with their
regulated genes (PC information) - TFs from the same structural family tend to have
similar DNA motifs
24Functional determinants of transcription factors
inEscherichia coli protein families and binding
sites
- M. Madan Babu and Sarah A. Teichmann
- MRC Laboratory of Molecular Biology, Hills Road,
Cambridge CB2 2QH, UK
25Introduction
- DNA-binding transcription factors regulate
expression of genes near to where they bind - These factors can be activators or repressors of
transcription, or both - A fundamental question is what determines whether
a transcription factor acts as an activator or a
repressor? - Proteinprotein contacts
- Position of the DNA-binding domain in the protein
primary sequence - Altered DNA structure,
- Position of its binding site on the DNA relative
to the transcription start site - This work suggest that, in general, in E. Coli, a
transcription factors protein family is not
indicative of its regulatory function, but the
position of its binding site on the DNA is
26Domain Architectures for different TFs
27(No Transcript)
28Conclusions
- Activators, repressors and dual regulators in E.
coli belong to many of the same protein families
and share some domain architectures - A transcription factors regulatory role is not
determined by protein structure or evolutionary
relationships - Transcription factors have evolved by duplication
of an ancestral transcription factor, followed by
a change in function through a shift in binding
sites - A transcription factors regulatory role is
determined to a large extent simply by the
position of the transcription factor binding site - Activators have essentially only upstream binding
sites - More than two thirds of repressors have at least
one downstream binding site
29(No Transcript)