Genomewide Analysis of Gene regulation - PowerPoint PPT Presentation

About This Presentation

Title:

Genomewide Analysis of Gene regulation

Description:

Robert Osada, Elena Zaslavsky and Mona Singh. Department of Computer Science & Lewis-Sigler Institute for Integrative Genomics, ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 30

Provided by: davidr114

Category:

more less

Transcript and Presenter's Notes

Title: Genomewide Analysis of Gene regulation

1

Genome-wide Analysis of Gene regulation

Berlin, 4th of May, 2005
Presentation by David Rozado
2
Comparative analysis of methods forrepresenting
and searching for transcriptionfactor binding
sites

Robert Osada, Elena Zaslavsky and Mona Singh
Department of Computer Science Lewis-Sigler
Institute for Integrative Genomics,
Princeton University, Princeton, NJ 08544, USA

3
Introduction

Identification of DNA binding sites for
transcription factors
Important step in unraveling the transcriptional
regulatory network
Several approaches for transcription factors
binding sites search
consensus sequences
position-specific scoring matrices
Berg and von Hippel
Centroid
Such basic approaches can all be extended by
incorporating
pair wise nucleotide dependencies
per-position information content
The paper evaluates the effectiveness of the
basic approaches and their extensions in finding
binding sites for a transcription factor of
interest

4
Datasets

68 regulatory proteins and their aligned DNA
binding domains
Number of Filters applied
Only proteins with at least four binding sites
were considered
Duplicate binding sites were removed in order to
preserve the integrity of the leave-one-out
cross-validation
Each binding site unambiguously located within
the E.coli K-12 genome and extracted along with
flanking regions on each side
This process left 35 transcription factors and
410 binding sites
Average of 11.7 8.5 sites per transcription
factor

5
Notation

S set of N DNA binding sites for a transcription
factor
nj (b) number of times base b appears in the j
-th position in S
fj (b) corresponding frequency
n(b) number of times base b appears overall in
the N binding sites
f (b) overall frequency for base b
nij (b, d) number of times the ordered pair (b,
d) occurs in positions i and j
fij (b, d) corresponding frequency
tj j -th base of the sequence t to be scored

j
i
AGTTAACAAT
t
AGGTAACAAA
S
ATATAACAAT
TTTTAACAAT
AGTAATCAAT
6
Approaches for representing and searchingfor
binding sites
7
Extension I - Pairwise correlations

A method for incorporating pairwise correlations
should only take into account those pairs that
act together in determining DNAprotein binding
specificity.
Such precise information is not always readily
available
As approximation, focus on considering pairwise
correlations between bases that are nearby in
sequence
Introduce the notion of scope to delimit which
pairs are considered correlated.
A scope of one restricts correlated positions to
adjacent pairs
a scope of two considers both adjacent pairs and
pairs separated by an intermediate base

8
Extensions II - Information content

Information content (IC) is a concept based on
the information-theoretic notion of entropy.
In the current application, the entropy of a
position expresses the number of bits necessary
to describe the position in a binding site
The information content of a position is
calculated by subtracting its entropy from the
value of the maximum possible entropy
The higher the information content, the more
conserved (and presumably more important) the
position

9
Cross-validation testing and analysis

Common usage of any of the methods described
above would be to scan non-coding regions in a
genome in order to find binding sites for a
particular transcription factor
Such a framework is not easily applicable when we
wish to evaluate and compare different methods
The E.coli genome contains many yet
uncharacterized binding sites
Predicted windows may correspond to true binding
sites even if they are not annotated as such in
the original dataset
Testing framework with sets of positive and
negative examples

10
Cross-validation testing and analysis II

Conduct leave-one-out cross-validation studies to
evaluate a particular method
Suppose s belongs to a set S of known binding
sites, each of length l, for a particular
transcription factor TF
The method under consideration uses all the sites
except s, to build the binding site
representation for TF, and scores s as well as a
set of negative examples
The negative examples consist of all binding
sites in our dataset except those known to be
bound by TF
It is still is possible that transcription factor
TF can bind some of the negative examples
Nevertheless, s should be among the top scoring
sites in the overall pool

11
Comparing Methods

For each site s of a transcription factor under
consideration a rank in cross-validation testing
is computed by counting how many negative
examples score as well or better than s
lower rank indicating better performance
To compare how well two methods perform, a
Wilcoxon matched-pairs signed-ranks test is used
The number of times one method outperforms the
other is compared with how many times such an
event would happen merely by chance under the
assumption that both methods perform equally well
A ROC curve for each individual leave-one-out
test is created and then, the average over all
sites for that transcription factor is computed

12
Comparison of basic methods
13
ROC curves comparing performance when pairs are
considered for Centroid
14
ROC curves comparing performance when pairs are
considered for PSSM
15
ROC curves comparing Centroid-P with scope 2
using regular sites and sites with columns
shuffled
16
Performance of methods based on averaged ranks
per transcription factor
17
Conclusions

Using per-position information content to weigh
positional scores improves the performance of all
methods
Sometimes dramatically
Methods based on nucleotide matches, such as
consensus sequences and Centroid, show
statistically significant improvements when
incorporating pairwise nucleotide dependences
Probabilistic methods, such as log-odds PSSMs, do
not show statistically significant improvements
when incorporating pairwise dependencies
Difference in performance between methods
decreases substantially once information content
and pairwise correlations have been incorporated

18
Making connections between novel
transcriptionfactors and their DNA motifs

Kai Tan,1 Lee Ann McCue,2 and Gary D. Stormo1,3
1Department of Genetics, Washington University
School of Medicine, Saint Louis, Missouri 63110,
USA 2The Wadsworth Center,
New York State Department of Health, Albany, New
York 12201-0509, USA

19
Introduction

A computational method to connect novel
transcription factors and DNA motifs in E. coli
The method takes advantage of three types of
information to assign a DNA binding motif to a
given TF
A distance constraint between a TF and its
closest binding site in the genome (Dmin
information)
The phylogenetic correlation between TFs and
their regulated genes (PC information)
A binding specificity constraint for TFs having
structurally similar DNA-binding domains (FMC
information)
The different types of information are combined
to calculate the probability of a given
transcription-factorDNA-motif pair being a true
pair

20
Distance constraint

Besides auto-regulation, it has been noticed in
many cases that TFs and the genes they regulate
are near each other in the genome
Distance constraint between the TF and its
closest binding site in the genome

Dmin_self is the distance between a TF gene and
its closest binding site in the genome
Dmin_cross is the distance between a TF gene and
the closest binding site for a different TF

21
The phylogenetic correlation

TFs and their regulated genes tend to evolve
concurrently
Connect TFs and DNA motifs through correlation
between their occurrences in a comparative
analysis of multiple species
Two types of phylogenetic correlation (PC)
distributions
PC for true TFDNA-motif pairs.
PC for false TFDNA-motif pairs

22
Binding specificity constraint

TFs that are more similar to one another are
expected to bind to sites that are more similar
to each other than to dissimilar pairs
Distribution of average similarity scores for
motifs from the same family and from different
families

23
Conclusions

Hypothesize that information concerning the
connection of a TF to its DNA motif is carried in
the genome sequences
TFs and their binding sites are often in similar
genomic locations (Dmin information)
TFs tend to evolve concurrently with their
regulated genes (PC information)
TFs from the same structural family tend to have
similar DNA motifs

24
Functional determinants of transcription factors
inEscherichia coli protein families and binding
sites

M. Madan Babu and Sarah A. Teichmann
MRC Laboratory of Molecular Biology, Hills Road,
Cambridge CB2 2QH, UK

25
Introduction

DNA-binding transcription factors regulate
expression of genes near to where they bind
These factors can be activators or repressors of
transcription, or both
A fundamental question is what determines whether
a transcription factor acts as an activator or a
repressor?
Proteinprotein contacts
Position of the DNA-binding domain in the protein
primary sequence
Altered DNA structure,
Position of its binding site on the DNA relative
to the transcription start site
This work suggest that, in general, in E. Coli, a
transcription factors protein family is not
indicative of its regulatory function, but the
position of its binding site on the DNA is

26
Domain Architectures for different TFs
27
(No Transcript)
28
Conclusions

Activators, repressors and dual regulators in E.
coli belong to many of the same protein families
and share some domain architectures
A transcription factors regulatory role is not
determined by protein structure or evolutionary
relationships
Transcription factors have evolved by duplication
of an ancestral transcription factor, followed by
a change in function through a shift in binding
sites
A transcription factors regulatory role is
determined to a large extent simply by the
position of the transcription factor binding site
Activators have essentially only upstream binding
sites
More than two thirds of repressors have at least
one downstream binding site