Title: 5.1: Gene Regulation and Promoter Analysis
15.1 Gene Regulation and Promoter Analysis
- Wyeth Wasserman
- Centre for Molecular Medicine and Therapeutics
- Childrens and Womens Hospital
- Department of Medical Genetics
- University of British Columbia
www.cisreg.ca
2Overview
- 5.1.0 Bioinformatics for detection of
transcription factor binding sites - The Specificity Problem
- 5.1.1 Discrimination of regulatory control
sequences - Based on knowledge of established TFBS
- 5.1.2 Discovery of regulatory mechanisms
- Based on de novo pattern discovery
- 5.1.3 Impending advances
3Layers of Complexity in Metazoan Transcription
4Transcription Simplified
URF
Pol-II
TATA
URE
55.1.0 Profile Models for Prediction of TF Binding
Sites
6Representing Binding Sites for a TF
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTT
AAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA
CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTG
ATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA A
AGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAA
TGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AA
GTTAATGA AAGTTAATGA AAGTTAATGA
- A set of sites represented as a consensus
- VDRTWRWWSHD (IUPAC degenerate DNA)
7PFMs to PWMs
One would like to add the following features to
the model 1. Correcting for the base
frequencies in DNA 2. Weighting for the
confidence (depth) in the pattern 3. Convert to
log-scale probability for easy arithmetic
w matrix
f matrix
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0
4 T 0 0 1 1 1
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5
0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T
-1.7 -1.7 -0.2 -0.2 -0.2
f(b,i) s(N)
Log ( )
p(b)
8JASPAR OPEN-ACCESS DATABASE OF TF BINDING
PROFILES (Some other databases with TF profiles
include Transfac, TRRD, mPromDB, SCPD (yeast),
dbTBS and EcoTFS (bacteria))
9Performance of Profiles
- 95 of predicted sites bound in vitro (Tronche
1997) - MyoD binding sites predicted about once every 600
bp (Fickett 1995) - Futility Theorem
- Nearly 100 of predicted TFBS have no function in
vivo - Brazma claims it should be called the futility
conjunction
101000bp promoter screened with collection of TF
profiles (beta-globin)
115.1.1 Pattern Discrimination
- Overcoming the specificity problem by
incorporating biological knowledge into
computational algorithms
12Phylogenetic Footprinting
- 70,000,000 years of evolution reveals most
regulatory regions.
13SIDENOTE Global Progressive Alignments(e.g.
ORCA, AVID, LAGAN)
- Global alignments memory product of sequence
lengths - Progressive alignment by banding with local
alignments (e.g. BLAST) and running global method
on banded sub-segments - Recursion with decreasingly stringent parameters
14Phylogenetic Footprinting to Identify Functional
Segments
Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse.
15Phylogenetic Footprinting (cont)
FoxC2
100 80 60 40 20 0
Identity
Start Position of 200bp Window
16Recall...
171000bp beta-globin promoter screened with
phylogenetic footprinting
18Choosing the right species...Genes evolve at
different rates make gene-specific choice
CHICKEN
HUMAN
MOUSE
HUMAN
COW
HUMAN
19Performance Human vs. Mouse Pairwise
SELECTIVITY
SENSITIVITY
- Testing set 40 experimentally defined sites in
15 well studied genes (Replicated with 100 site
set) - 85-95 of defined sites detected with
conservation filter, while only 11-16 of total
predictions retained
20ConSite
Now driven by the ORCA Aligner
21Selected Emerging Issues
- Multiple sequence comparisons
- Incorporate phylogenetic distances into a scoring
metric - Visualization (see dcode service and Sockeye)
- Analysis of many closely related species
- Phylogenetic shadowing
- Genome rearrangements
- Inversion compatible alignment algorithm
- LAGAN
- Higher order models of TFBS
22Regulatory Modulesfor better specificity
- TFs do NOT act in isolation
23Layers of Complexity in Metazoan Transcription
24Liver regulatory modules
25PSSMs for Liver TFs
HNF3
HNF1
HNF4
C/EBP
26Detection of Clusters of TFBS
- In the best cases, we have enough data to train a
discriminant function - Rare to have sufficient data
- Alternatively, identify dense clusters of sites
that are statistically significant - Diverse methods have been introduced over the
past few yearsBerman Markstein Frith Noble
Wagner - Non-trivial to correct for non-random properties
of DNA - Most difficulty comes from local direct repeats
- A primary challenge from the biological side is
the selection of a meaningful grouping of TFs - Multiple testing problems severe
-
27TFBS Clusters(MSCAN, MCAST, COMET, etc)
- MSCAN allows users to submit any set of TF
profiles - Calculates significance for each site based on
local sequence characteristics - G-rich PSSM gets less weight on G-rich region of
gene - Calculates cluster significance using a dynamic
programming approach - Approximately 1 significant liver cluster / 18
000 bp in human genome sequence - Filters to remove significant clusters of sites
that contain local repeats - Identification of non-random characteristics in
DNA
http//mscan.cgb.ki.se
28Training predictive models for modules
- Not every combination of sites is meaningful
- Reality Some factors critical, others secondary
- An alternative is to teach the computer which
combinations are better - Limited by small size of positive training set
- Explore an older method based on Logistic
Regression Analysis
29Recall Liver regulatory modules
30Logistic Regression Analysis
a1 a2 a3 a4
Optimize a vector to maximize the distance
between output values for positive and negative
training data. Output value is
elogit
p(x) 1
elogit
S
logit
31PERFORMANCE
- Liver (Genome Research, 2001)
- At 1 hit per 35 kbp, identifies 60 of modules
- Limited to genes expressed late in liver
development
32UDPGT1 (Gilberts Syndrome)
Wildtype Mutant
Liver Module Model Score
Window Position in Sequence
33Making better predictions
- Profiles make far too many false predictions to
have predictive value in isolation - Phylogenetic footprinting eliminates about 90 of
false predictions while retaining 70-70 of real
sites (human vs mouse) - Detection of clusters of binding sites offers
better predictive performance, especially through
trained discrimination functions
34Active Issues
- Significance of clusters of sites
- Segmentation of DNA into regions of different
composition - Methods using training to find clusters
- Where to place weights?
- Interaction weighting in the absence of large
data collections - Resources
- Limited number of solid PSSMs
- Need a reference database for functional
regulatory regions - Validation of predictions for tissues/cells not
well represented in cell culture
35EMERGING APPLICATIONRegulatory Analysis of
Variation in ENhancers
- Genetic variation in TFBS can result in
biomedically important phenotypes
36Sequence Variation in TFBS
URF
TSS
AaGT
37Stage 1Prediction of Regulatory Regions
38Stage 1 Predict Regulatory Regions
- Retrieve orthologous human and mouse gene
sequences - Align sequences with a global aligner (ORCA)
- Identify regions of conservation
- Designs primers for SNP discovery
FoxC2
100 80 60 40 20 0
39SIDENOTE Data/Orthology obtained from GeneLynx
(www.genelynx.org)
40Stage 2Analysis of Polymorphisms
ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGA
T ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACA
GAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAA
CAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAAT
AACAGAT ACGCATAAGTTAACGAATAACAGAT
41Identify variations that generate allele-specific
binding sites (predicted)
Differences in scores
Pseudo-data for instructional purposes
1234567890123456789012345 ACGCATAAGTTAAtGAATAACAGA
T .............c...........
42RAVEN screenshots
435.1.2 Discovery of Mediating TFBS for Sets of
Co-Regulated Genes
- Finding characteristics over-represented in a set
of co-regulated genes
44Pattern Discovery
45Linking co-expressed genes from microarrays to
candidate TFs
46oPOSSUM Project
- A significant subset of TFs are represented by
existing binding profiles - Within same structural class, often binding
specificity retained (more on this later) - Can we link known TFs to a putative regulon by
over-representation of predicted binding sites in
promoters? - Identical concept to the detection of
over-represented GO terms from previous session
47oPOSSUM Procedure
48Reference Gene Sets
Fisher p-values plt1e-05, plt1e-02
49MICROARRAY APPLICATIONNF-kB Inhibitor-sensitive
genes (326)
plt1e-30, plt1e-10, plt1e-05, plt1e-02
50oPOSSUM Server
51Over-represented Site Combinations (Kreiman 2004)
- Based on our understanding of CRMs, likely that
combinations of sites would be more distinguished
than individual sites (better signal-to-noise?) - Kreiman has introduced a system to assess
clusters of neighbouring conserved sites based on
counting - Hypergeometric distribution, simply compare the
frequency of the cluster occurrence vs.
expectation
52What if the TFBS is novel?
53de novo Pattern Discovery Methods
- String-based
- e.g. Moby Dick (Bussemaker, Li Siggia)
- Identify over-represented oligomers in comparison
of and - (or complete) promoter collections - Profile-based
- Monte Carlo/Gibbs Sampling
- e.g. AnnSpec (Workman Stormo)
- Identify strong patterns in promoter
collection vs. background model of expected
sequence characteristics
54String-base Exhaustive Methods
Word-based methods How likely are X words in a
set of sequences, given sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC gtEP71002 ()
CeIV msp-56 B range -100 to -75
TTCAAATTTTAACGCCGGAATAATCTCCTATT gtEP63009 () Ce
Cuticle Col-12 range -100 to -75
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG gtEP63010 () Ce
Cuticle Col-13 range -100 to -75
TATCGTCATTCTCCGCCTCTTTTCTT gtEP11013 () Ce
vitellogenin 2 range -100 to -75
GCTTATCAATGCGCCCGGAATAAAACGCTATA gtEP11014 () Ce
vitellogenin 5 range -100 to -75
CATTGACTTTATCGAATAAATCTGTT gtEP11015 (-) Ce
vitellogenin 4 range -100 to -75
ATCTATTTACAATGATAAAACTTCAA gtEP11016 () Ce
vitellogenin 6 range -100 to -75
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT gtEP11017 () Ce
calmodulin cal-2 range -100 to -75
TTTCAAATCCGGAATTTCCACCCGGAATTACT gtEP63007 (-) Ce
cAMP-dep. PKR P1 range -100 to -75
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC gtEP63008 () Ce
cAMP-dep. PKR P2 range -100 to -75
ACTGAACTTGTCTTCAAATTTCAACACCGGAA gtEP17012 () Ce
hsp 16K-1 A range -100 to -75 TCAATGCCGGAATTCTGAA
TGTGAGTCGCCCT gtEP55011 (-) Ce hsp 16K-1 B range
55Exhaustive methods(2)
Find all words of length 7 in the yeast genome
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGC
GTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCA
CTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTG
GAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACG
GTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAA
TCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGA
AGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGG
AAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTC
AAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAG
CTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGC
TCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTG
TTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAAT
TGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGT
ATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCC
ATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGC
CAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGA
TGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTA
TTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGT
TTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCA
GGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGA
CGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTG
GTGACACTGCCA
Make a lookup table TTTTTTTT/aaaaaaa 57788 GATAG
GCA/tgcctatc 589 AAACCTTT/aaaggttt
456 Etc...
56Exhaustive methods(3)
Over-representation How many words of type
AGGAGTGA are found in our sequences?
How likely is this result?
57Exhaustive methods(4)
Modeling Properties of DNA
Simple How likely are single nucleotides?
(extended Bernoulli) Complex Neglect certain
words Locations of TFBS Higher-order
descriptions of DNA
58Exhaustive methods Key items
- Algorithms with high complexity - Large sequences
and/or many possible word lengths not possible - Often string-based
- TFBS are not words (fuzzy binding)
- Sensitivity susceptible to noisy indata (e.g.
microarrays)
59Profile-based Methods(usually probablistic)
Find a local alignment of width x of sites that
maximizes information content in reasonable
time Usually by Gibbs sampling or EM methods
Motivations TFBS are not words Efficiency Can be
intentionally influenced by biological data
60Profile Methods (2)
tgacttcc
The Gibbs Sampling algorithm
tgatctct
agacctca
tgacctct
Two data structures used 1) Current pattern
nucleotide frequencies qi,1,..., qi,4 and
corresponding background frequencies pi,1,...,
pi,4 2) Current positions of site startpoints
in the N sequences a1, ..., aN , i.e. the
alignment that contributes to qi,j. One starting
point in each sequence is chosen randomly
initially.
61Profile Methods (3)
Iteration step Remove one sequence z from the
set. Update the current pattern according to
z
A
tgacttcc
tgatctct
agacctca
tgacctct
Pseudocount for symbol j
Sum of all pseudocounts in column
62Pattern Discovery Across Orthologous Promoters
from Gram-Positive Bacteria
Real sets
random
63EXAMPLEYeast Regulatory Sequence Analysis (YRSA)
system
64Tests of YRSA System
65SIDENOTE Comparison of profiles requires
alignment and a scoring function
- Scoring function based on sum of squared
differences - Align frequency matrices with modified
Needleman-Wunsch algorithm - Calculate empirical p-values based on simulated
set of matrices
66How is the Performance Hit and Miss
67Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
68Over-coming the sensitivity challenge
- Metazoan genomes are far from ideal
69Biochemical complexity enables greater complexity
in regulation
70Four Approaches to Improve Sensitivity
- Better background models
- -Higher-order properties of DNA
- Phylogenetic Footprinting
- HumanMouse comparison eliminates 75 of
sequence - Regulatory Modules
- Architectural rules
- Limit the types of binding profiles allowed
- TFBS patterns are NOT random
71Phylogenetic Footprinting to Identify Conserved
Regions
Bayes Block Aligner (Lawrence Group)
ORCA
72Skeletal Muscle Genes
- One of the most extensively studied tissues for
transcriptional regulation - 45 genes partially analyzed
- 26 genes with orthologous genomic sequence from
human and rodent - Five primary classes of transcription factors
- Principal Myf (myoD), Mef2, SRF
- Secondary Sp1 (G/C rich patches), Tef (subset of
skeletal muscle types)
73de novo Discovery of Skeletal Muscle
Transcription Factor Binding Sites
Mef2-Like
SRF-Like
Myf-Like
74Pattern discovery methods using biochemical
constraints
75RECALL Gibbs Algorithm
z
tgacttcc
tgatctct
agacctca
tgacttcc
tgacctct
tgatctct
agacctca
tgacctct
76(No Transcript)
77Intra-family PSSM similarity
TF Database (JASPAR)
COMPARE
Jackknife Test 87 correct Independent Test
Set 93 correct
78(No Transcript)
79FBPs enhance sensitivity of pattern detection
80(No Transcript)
81APPLICATIONCancer Protection Response
- Detoxification-related enzymes are induced by
compounds present in Broccoli - Arrays, SSH and hard work have defined a set of
responsive genes - A known element mediates the response
(Antioxidant Responsive Element) - Controversy over the type of mediating leucine
zipper TF - NF-E2/Maf or Jun/Fos
82Application (2)
Problem Given a set of co-regulated genes,
determine the common TFBS. Classify the
mediating TF. We expect a leucine zipper-type
TF.
83Application (3)
Problem Given a set of co-regulated genes,
determine the common TFBS. Classify the
mediating TF. We expect a leucine zipper-type
TF.
84Application (4)
Problem Given a set of co-regulated genes,
determine the common TFBS. Classify the
mediating TF. We expect a leucine zipper-type
TF.
85EMERGING METHODde novo Analysis of Regulatory
Modules
86Focus on regulatory modules for pattern detection
Cluster Genes by Expression
87Analyze co-regulated genes to define circuit
characteristics
General Circuit Properties
Specific Gene Features
Binding Profiles
mi
aij
Neighbor Interactions
mi
mj
mi
mj
0
b
Width Distributions (Sum of Separations)
250
88Discovery performance
- Approximately 50 of annotated TFBS are detected
in the training set sequences of 25 genes - Only 40 of predicted TFBS are annotated
- We suspect that most of the un-annotated sites
will turn out to be functional. This needs to be
determined.
89Review of Primary Points
90Regulatory regions problem space
Sets of binding sites AATCACCAAATCACCAAATCACCA
AATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATC
TCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATA
ATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTA
GCGATTCTTCCTATGAACAGATTAAAAAGACCCCA
Specificity profiles for binding sites A -2
0 -2 -0.415 0.585 -2 -2 2.088 -2
-2 -1 0.585 C 1 0.585 0 0
-1 -2 -2 -2 2.088 -2 0.585 0.807
G 0.585 0.322 0.807 1.585 1 -2 2
-2 -2 2.088 -2 0 T 0.319 0.322
1 -2 0 2.088 -1 -2 -2 -2
1.459 -0.415
Clusters of binding sites
Transcription factors Transcription factor
binding sites Regulatory nucleotide sequences
91Detecting binding sites in a single sequence
Scanning a sequence against a PWM
Sp1
Abs_score 13.4 (sum of column scores)
Is 93 better than 82?
92Phylogenetic Footprints
Scanning a single sequence
Scanning a pair orf orthologous sequences for
conserved patterns in conserved sequence regions
A dramatic improvement in the percentage of
biologically significant detections
- Low specificity of profiles
- too many hits
- great majority are not biologically significant
93Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
94Acknowledgements
- Wasserman Group
- Wynand Alkema
- Dave Arenillas
- Jochen Brumm
- Alice Chou
- Shannan Ho Sui
- Danielle Kemmer
- Jonathan Lim
- Raf Podowski
- Dora Pak
- Albin Sandelin
- Chris Walsh
- Collaborating Trainees
- Malin Andersson (KTH)
- Öjvind Johansson (UCSD)
- Stuart Lithwick (U.Toronto)
Collaborators Boris Lenhard (K.I.) Chip Lawrence
(Wadsworth) William Thompson (Wadsworth) Jens
Lagergren (KTH) Christer Höög (K.I.) Brenda
Gallie (OCI) Jacob Odeberg (KTH) Niclas Jareborg
(AZ) William Hayes (AZ) James Mortimer
(MF) Group Alumni Elena Herzog Annette
Höglund William Krivan Luis Mendoza
Support CIHR, CGDN, CFI, Merck-Frosst, BC
Childrens Hospital Foundation, Pharmacia,
ECMarie Curie, KI-Funder
95EXTRA SLIDESWhat will a computational biologist
do with a scoring function?
96The matrix tree
97Compare with consensus for both classes - CANNTG