Genomewide Analysis of Gene regulation - PowerPoint PPT Presentation

About This Presentation
Title:

Genomewide Analysis of Gene regulation

Description:

Robert Osada, Elena Zaslavsky and Mona Singh. Department of Computer Science & Lewis-Sigler Institute for Integrative Genomics, ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 30
Provided by: davidr114
Category:

less

Transcript and Presenter's Notes

Title: Genomewide Analysis of Gene regulation


1
  • Genome-wide Analysis of Gene regulation

Berlin, 4th of May, 2005
Presentation by David Rozado
2
Comparative analysis of methods forrepresenting
and searching for transcriptionfactor binding
sites
  • Robert Osada, Elena Zaslavsky and Mona Singh
  • Department of Computer Science Lewis-Sigler
    Institute for Integrative Genomics,
  • Princeton University, Princeton, NJ 08544, USA

3
Introduction
  • Identification of DNA binding sites for
    transcription factors
  • Important step in unraveling the transcriptional
    regulatory network
  • Several approaches for transcription factors
    binding sites search
  • consensus sequences
  • position-specific scoring matrices
  • Berg and von Hippel
  • Centroid
  • Such basic approaches can all be extended by
    incorporating
  • pair wise nucleotide dependencies
  • per-position information content
  • The paper evaluates the effectiveness of the
    basic approaches and their extensions in finding
    binding sites for a transcription factor of
    interest

4
Datasets
  • 68 regulatory proteins and their aligned DNA
    binding domains
  • Number of Filters applied
  • Only proteins with at least four binding sites
    were considered
  • Duplicate binding sites were removed in order to
    preserve the integrity of the leave-one-out
    cross-validation
  • Each binding site unambiguously located within
    the E.coli K-12 genome and extracted along with
    flanking regions on each side
  • This process left 35 transcription factors and
    410 binding sites
  • Average of 11.7 8.5 sites per transcription
    factor

5
Notation
  • S set of N DNA binding sites for a transcription
    factor
  • nj (b) number of times base b appears in the j
    -th position in S
  • fj (b) corresponding frequency
  • n(b) number of times base b appears overall in
    the N binding sites
  • f (b) overall frequency for base b
  • nij (b, d) number of times the ordered pair (b,
    d) occurs in positions i and j
  • fij (b, d) corresponding frequency
  • tj j -th base of the sequence t to be scored

j
i
AGTTAACAAT
t
AGGTAACAAA
S
ATATAACAAT
TTTTAACAAT
AGTAATCAAT
6
Approaches for representing and searchingfor
binding sites
7
Extension I - Pairwise correlations
  • A method for incorporating pairwise correlations
    should only take into account those pairs that
    act together in determining DNAprotein binding
    specificity.
  • Such precise information is not always readily
    available
  • As approximation, focus on considering pairwise
    correlations between bases that are nearby in
    sequence
  • Introduce the notion of scope to delimit which
    pairs are considered correlated.
  • A scope of one restricts correlated positions to
    adjacent pairs
  • a scope of two considers both adjacent pairs and
    pairs separated by an intermediate base

8
Extensions II - Information content
  • Information content (IC) is a concept based on
    the information-theoretic notion of entropy.
  • In the current application, the entropy of a
    position expresses the number of bits necessary
    to describe the position in a binding site
  • The information content of a position is
    calculated by subtracting its entropy from the
    value of the maximum possible entropy
  • The higher the information content, the more
    conserved (and presumably more important) the
    position

9
Cross-validation testing and analysis
  • Common usage of any of the methods described
    above would be to scan non-coding regions in a
    genome in order to find binding sites for a
    particular transcription factor
  • Such a framework is not easily applicable when we
    wish to evaluate and compare different methods
  • The E.coli genome contains many yet
    uncharacterized binding sites
  • Predicted windows may correspond to true binding
    sites even if they are not annotated as such in
    the original dataset
  • Testing framework with sets of positive and
    negative examples

10
Cross-validation testing and analysis II
  • Conduct leave-one-out cross-validation studies to
    evaluate a particular method
  • Suppose s belongs to a set S of known binding
    sites, each of length l, for a particular
    transcription factor TF
  • The method under consideration uses all the sites
    except s, to build the binding site
    representation for TF, and scores s as well as a
    set of negative examples
  • The negative examples consist of all binding
    sites in our dataset except those known to be
    bound by TF
  • It is still is possible that transcription factor
    TF can bind some of the negative examples
  • Nevertheless, s should be among the top scoring
    sites in the overall pool

11
Comparing Methods
  • For each site s of a transcription factor under
    consideration a rank in cross-validation testing
    is computed by counting how many negative
    examples score as well or better than s
  • lower rank indicating better performance
  • To compare how well two methods perform, a
    Wilcoxon matched-pairs signed-ranks test is used
  • The number of times one method outperforms the
    other is compared with how many times such an
    event would happen merely by chance under the
    assumption that both methods perform equally well
  • A ROC curve for each individual leave-one-out
    test is created and then, the average over all
    sites for that transcription factor is computed

12
Comparison of basic methods
13
ROC curves comparing performance when pairs are
considered for Centroid
14
ROC curves comparing performance when pairs are
considered for PSSM
15
ROC curves comparing Centroid-P with scope 2
using regular sites and sites with columns
shuffled
16
Performance of methods based on averaged ranks
per transcription factor
17
Conclusions
  • Using per-position information content to weigh
    positional scores improves the performance of all
    methods
  • Sometimes dramatically
  • Methods based on nucleotide matches, such as
    consensus sequences and Centroid, show
    statistically significant improvements when
    incorporating pairwise nucleotide dependences
  • Probabilistic methods, such as log-odds PSSMs, do
    not show statistically significant improvements
    when incorporating pairwise dependencies
  • Difference in performance between methods
    decreases substantially once information content
    and pairwise correlations have been incorporated

18
Making connections between novel
transcriptionfactors and their DNA motifs
  • Kai Tan,1 Lee Ann McCue,2 and Gary D. Stormo1,3
  • 1Department of Genetics, Washington University
    School of Medicine, Saint Louis, Missouri 63110,
    USA 2The Wadsworth Center,
  • New York State Department of Health, Albany, New
    York 12201-0509, USA

19
Introduction
  • A computational method to connect novel
    transcription factors and DNA motifs in E. coli
  • The method takes advantage of three types of
    information to assign a DNA binding motif to a
    given TF
  • A distance constraint between a TF and its
    closest binding site in the genome (Dmin
    information)
  • The phylogenetic correlation between TFs and
    their regulated genes (PC information)
  • A binding specificity constraint for TFs having
    structurally similar DNA-binding domains (FMC
    information)
  • The different types of information are combined
    to calculate the probability of a given
    transcription-factorDNA-motif pair being a true
    pair

20
Distance constraint
  • Besides auto-regulation, it has been noticed in
    many cases that TFs and the genes they regulate
    are near each other in the genome
  • Distance constraint between the TF and its
    closest binding site in the genome
  • Dmin_self is the distance between a TF gene and
    its closest binding site in the genome
  • Dmin_cross is the distance between a TF gene and
    the closest binding site for a different TF

21
The phylogenetic correlation
  • TFs and their regulated genes tend to evolve
    concurrently
  • Connect TFs and DNA motifs through correlation
    between their occurrences in a comparative
    analysis of multiple species
  • Two types of phylogenetic correlation (PC)
    distributions
  • PC for true TFDNA-motif pairs.
  • PC for false TFDNA-motif pairs

22
Binding specificity constraint
  • TFs that are more similar to one another are
    expected to bind to sites that are more similar
    to each other than to dissimilar pairs
  • Distribution of average similarity scores for
    motifs from the same family and from different
    families

23
Conclusions
  • Hypothesize that information concerning the
    connection of a TF to its DNA motif is carried in
    the genome sequences
  • TFs and their binding sites are often in similar
    genomic locations (Dmin information)
  • TFs tend to evolve concurrently with their
    regulated genes (PC information)
  • TFs from the same structural family tend to have
    similar DNA motifs

24
Functional determinants of transcription factors
inEscherichia coli protein families and binding
sites
  • M. Madan Babu and Sarah A. Teichmann
  • MRC Laboratory of Molecular Biology, Hills Road,
    Cambridge CB2 2QH, UK

25
Introduction
  • DNA-binding transcription factors regulate
    expression of genes near to where they bind
  • These factors can be activators or repressors of
    transcription, or both
  • A fundamental question is what determines whether
    a transcription factor acts as an activator or a
    repressor?
  • Proteinprotein contacts
  • Position of the DNA-binding domain in the protein
    primary sequence
  • Altered DNA structure,
  • Position of its binding site on the DNA relative
    to the transcription start site
  • This work suggest that, in general, in E. Coli, a
    transcription factors protein family is not
    indicative of its regulatory function, but the
    position of its binding site on the DNA is

26
Domain Architectures for different TFs
27
(No Transcript)
28
Conclusions
  • Activators, repressors and dual regulators in E.
    coli belong to many of the same protein families
    and share some domain architectures
  • A transcription factors regulatory role is not
    determined by protein structure or evolutionary
    relationships
  • Transcription factors have evolved by duplication
    of an ancestral transcription factor, followed by
    a change in function through a shift in binding
    sites
  • A transcription factors regulatory role is
    determined to a large extent simply by the
    position of the transcription factor binding site
  • Activators have essentially only upstream binding
    sites
  • More than two thirds of repressors have at least
    one downstream binding site

29
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com