5.1: Gene Regulation and Promoter Analysis

About This Presentation

Title:

5.1: Gene Regulation and Promoter Analysis

Description:

Profiles make far too many false predictions to have predictive value in isolation ... Validation of predictions for tissues/cells not well represented in cell culture ... – PowerPoint PPT presentation

Number of Views:252

Avg rating:3.0/5.0

Slides: 98

Provided by: stephe78

Category:

more less

Transcript and Presenter's Notes

Title: 5.1: Gene Regulation and Promoter Analysis

1
5.1 Gene Regulation and Promoter Analysis

Wyeth Wasserman
Centre for Molecular Medicine and Therapeutics
Childrens and Womens Hospital
Department of Medical Genetics
University of British Columbia

www.cisreg.ca
2
Overview

5.1.0 Bioinformatics for detection of
transcription factor binding sites
The Specificity Problem
5.1.1 Discrimination of regulatory control
sequences
Based on knowledge of established TFBS
5.1.2 Discovery of regulatory mechanisms
Based on de novo pattern discovery
5.1.3 Impending advances

3
Layers of Complexity in Metazoan Transcription
4
Transcription Simplified
URF
Pol-II
TATA
URE
5
5.1.0 Profile Models for Prediction of TF Binding
Sites
6
Representing Binding Sites for a TF
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTT
AAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA
CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTG
ATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA A
AGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAA
TGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AA
GTTAATGA AAGTTAATGA AAGTTAATGA

A single site
AAGTTAATGA

A set of sites represented as a consensus
VDRTWRWWSHD (IUPAC degenerate DNA)

7
PFMs to PWMs
One would like to add the following features to
the model 1. Correcting for the base
frequencies in DNA 2. Weighting for the
confidence (depth) in the pattern 3. Convert to
log-scale probability for easy arithmetic
w matrix
f matrix
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0
4 T 0 0 1 1 1
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5
0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T
-1.7 -1.7 -0.2 -0.2 -0.2
f(b,i) s(N)
Log ( )
p(b)
8
JASPAR OPEN-ACCESS DATABASE OF TF BINDING
PROFILES (Some other databases with TF profiles
include Transfac, TRRD, mPromDB, SCPD (yeast),
dbTBS and EcoTFS (bacteria))
9
Performance of Profiles

95 of predicted sites bound in vitro (Tronche
1997)
MyoD binding sites predicted about once every 600
bp (Fickett 1995)
Futility Theorem
Nearly 100 of predicted TFBS have no function in
vivo
Brazma claims it should be called the futility
conjunction

10
1000bp promoter screened with collection of TF
profiles (beta-globin)
11
5.1.1 Pattern Discrimination

Overcoming the specificity problem by
incorporating biological knowledge into
computational algorithms

12
Phylogenetic Footprinting

70,000,000 years of evolution reveals most
regulatory regions.

13
SIDENOTE Global Progressive Alignments(e.g.
ORCA, AVID, LAGAN)

Global alignments memory product of sequence
lengths
Progressive alignment by banding with local
alignments (e.g. BLAST) and running global method
on banded sub-segments
Recursion with decreasingly stringent parameters

14
Phylogenetic Footprinting to Identify Functional
Segments
Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse.
15
Phylogenetic Footprinting (cont)
FoxC2
100 80 60 40 20 0
Identity
Start Position of 200bp Window
16
Recall...
17
1000bp beta-globin promoter screened with
phylogenetic footprinting
18
Choosing the right species...Genes evolve at
different rates make gene-specific choice
CHICKEN
HUMAN
MOUSE
HUMAN
COW
HUMAN
19
Performance Human vs. Mouse Pairwise
SELECTIVITY
SENSITIVITY

Testing set 40 experimentally defined sites in
15 well studied genes (Replicated with 100 site
set)
85-95 of defined sites detected with
conservation filter, while only 11-16 of total
predictions retained

20
ConSite
Now driven by the ORCA Aligner
21
Selected Emerging Issues

Multiple sequence comparisons
Incorporate phylogenetic distances into a scoring
metric
Visualization (see dcode service and Sockeye)
Analysis of many closely related species
Phylogenetic shadowing
Genome rearrangements
Inversion compatible alignment algorithm
LAGAN
Higher order models of TFBS

22
Regulatory Modulesfor better specificity

TFs do NOT act in isolation

23
Layers of Complexity in Metazoan Transcription
24
Liver regulatory modules
25
PSSMs for Liver TFs
HNF3
HNF1
HNF4
C/EBP
26
Detection of Clusters of TFBS

In the best cases, we have enough data to train a
discriminant function
Rare to have sufficient data
Alternatively, identify dense clusters of sites
that are statistically significant
Diverse methods have been introduced over the
past few yearsBerman Markstein Frith Noble
Wagner
Non-trivial to correct for non-random properties
of DNA
Most difficulty comes from local direct repeats
A primary challenge from the biological side is
the selection of a meaningful grouping of TFs
Multiple testing problems severe

27
TFBS Clusters(MSCAN, MCAST, COMET, etc)

MSCAN allows users to submit any set of TF
profiles
Calculates significance for each site based on
local sequence characteristics
G-rich PSSM gets less weight on G-rich region of
gene
Calculates cluster significance using a dynamic
programming approach
Approximately 1 significant liver cluster / 18
000 bp in human genome sequence
Filters to remove significant clusters of sites
that contain local repeats
Identification of non-random characteristics in
DNA

http//mscan.cgb.ki.se
28
Training predictive models for modules

Not every combination of sites is meaningful
Reality Some factors critical, others secondary
An alternative is to teach the computer which
combinations are better
Limited by small size of positive training set
Explore an older method based on Logistic
Regression Analysis

29
Recall Liver regulatory modules
30
Logistic Regression Analysis
a1 a2 a3 a4
Optimize a vector to maximize the distance
between output values for positive and negative
training data. Output value is
elogit
p(x) 1
elogit
S
logit
31
PERFORMANCE

Liver (Genome Research, 2001)
At 1 hit per 35 kbp, identifies 60 of modules
Limited to genes expressed late in liver
development

32
UDPGT1 (Gilberts Syndrome)
Wildtype Mutant
Liver Module Model Score
Window Position in Sequence
33
Making better predictions

Profiles make far too many false predictions to
have predictive value in isolation
Phylogenetic footprinting eliminates about 90 of
false predictions while retaining 70-70 of real
sites (human vs mouse)
Detection of clusters of binding sites offers
better predictive performance, especially through
trained discrimination functions

34
Active Issues

Significance of clusters of sites
Segmentation of DNA into regions of different
composition
Methods using training to find clusters
Where to place weights?
Interaction weighting in the absence of large
data collections
Resources
Limited number of solid PSSMs
Need a reference database for functional
regulatory regions
Validation of predictions for tissues/cells not
well represented in cell culture

35
EMERGING APPLICATIONRegulatory Analysis of
Variation in ENhancers

Genetic variation in TFBS can result in
biomedically important phenotypes

36
Sequence Variation in TFBS
URF
TSS
AaGT
37
Stage 1Prediction of Regulatory Regions
38
Stage 1 Predict Regulatory Regions

Retrieve orthologous human and mouse gene
sequences
Align sequences with a global aligner (ORCA)
Identify regions of conservation
Designs primers for SNP discovery

FoxC2
100 80 60 40 20 0
39
SIDENOTE Data/Orthology obtained from GeneLynx
(www.genelynx.org)
40
Stage 2Analysis of Polymorphisms
ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGA
T ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACA
GAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAA
CAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAAT
AACAGAT ACGCATAAGTTAACGAATAACAGAT
41
Identify variations that generate allele-specific
binding sites (predicted)
Differences in scores
Pseudo-data for instructional purposes
1234567890123456789012345 ACGCATAAGTTAAtGAATAACAGA
T .............c...........
42
RAVEN screenshots
43
5.1.2 Discovery of Mediating TFBS for Sets of
Co-Regulated Genes

Finding characteristics over-represented in a set
of co-regulated genes

44
Pattern Discovery
45
Linking co-expressed genes from microarrays to
candidate TFs
46
oPOSSUM Project

A significant subset of TFs are represented by
existing binding profiles
Within same structural class, often binding
specificity retained (more on this later)
Can we link known TFs to a putative regulon by
over-representation of predicted binding sites in
promoters?
Identical concept to the detection of
over-represented GO terms from previous session

47
oPOSSUM Procedure
48
Reference Gene Sets
Fisher p-values plt1e-05, plt1e-02
49
MICROARRAY APPLICATIONNF-kB Inhibitor-sensitive
genes (326)
plt1e-30, plt1e-10, plt1e-05, plt1e-02
50
oPOSSUM Server
51
Over-represented Site Combinations (Kreiman 2004)

Based on our understanding of CRMs, likely that
combinations of sites would be more distinguished
than individual sites (better signal-to-noise?)
Kreiman has introduced a system to assess
clusters of neighbouring conserved sites based on
counting
Hypergeometric distribution, simply compare the
frequency of the cluster occurrence vs.
expectation

52
What if the TFBS is novel?
53
de novo Pattern Discovery Methods

String-based
e.g. Moby Dick (Bussemaker, Li Siggia)
Identify over-represented oligomers in comparison
of and - (or complete) promoter collections
Profile-based
Monte Carlo/Gibbs Sampling
e.g. AnnSpec (Workman Stormo)
Identify strong patterns in promoter
collection vs. background model of expected
sequence characteristics

54
String-base Exhaustive Methods
Word-based methods How likely are X words in a
set of sequences, given sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC gtEP71002 ()
CeIV msp-56 B range -100 to -75
TTCAAATTTTAACGCCGGAATAATCTCCTATT gtEP63009 () Ce
Cuticle Col-12 range -100 to -75
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG gtEP63010 () Ce
Cuticle Col-13 range -100 to -75
TATCGTCATTCTCCGCCTCTTTTCTT gtEP11013 () Ce
vitellogenin 2 range -100 to -75
GCTTATCAATGCGCCCGGAATAAAACGCTATA gtEP11014 () Ce
vitellogenin 5 range -100 to -75
CATTGACTTTATCGAATAAATCTGTT gtEP11015 (-) Ce
vitellogenin 4 range -100 to -75
ATCTATTTACAATGATAAAACTTCAA gtEP11016 () Ce
vitellogenin 6 range -100 to -75
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT gtEP11017 () Ce
calmodulin cal-2 range -100 to -75
TTTCAAATCCGGAATTTCCACCCGGAATTACT gtEP63007 (-) Ce
cAMP-dep. PKR P1 range -100 to -75
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC gtEP63008 () Ce
cAMP-dep. PKR P2 range -100 to -75
ACTGAACTTGTCTTCAAATTTCAACACCGGAA gtEP17012 () Ce
hsp 16K-1 A range -100 to -75 TCAATGCCGGAATTCTGAA
TGTGAGTCGCCCT gtEP55011 (-) Ce hsp 16K-1 B range
55
Exhaustive methods(2)
Find all words of length 7 in the yeast genome
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGC
GTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCA
CTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTG
GAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACG
GTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAA
TCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGA
AGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGG
AAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTC
AAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAG
CTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGC
TCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTG
TTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAAT
TGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGT
ATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCC
ATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGC
CAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGA
TGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTA
TTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGT
TTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCA
GGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGA
CGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTG
GTGACACTGCCA
Make a lookup table TTTTTTTT/aaaaaaa 57788 GATAG
GCA/tgcctatc 589 AAACCTTT/aaaggttt
456 Etc...
56
Exhaustive methods(3)
Over-representation How many words of type
AGGAGTGA are found in our sequences?
How likely is this result?
57
Exhaustive methods(4)
Modeling Properties of DNA
Simple How likely are single nucleotides?
(extended Bernoulli) Complex Neglect certain
words Locations of TFBS Higher-order
descriptions of DNA
58
Exhaustive methods Key items

Algorithms with high complexity - Large sequences
and/or many possible word lengths not possible
Often string-based
TFBS are not words (fuzzy binding)
Sensitivity susceptible to noisy indata (e.g.
microarrays)

59
Profile-based Methods(usually probablistic)
Find a local alignment of width x of sites that
maximizes information content in reasonable
time Usually by Gibbs sampling or EM methods
Motivations TFBS are not words Efficiency Can be
intentionally influenced by biological data
60
Profile Methods (2)
tgacttcc
The Gibbs Sampling algorithm
tgatctct
agacctca
tgacctct
Two data structures used 1) Current pattern
nucleotide frequencies qi,1,..., qi,4 and
corresponding background frequencies pi,1,...,
pi,4 2) Current positions of site startpoints
in the N sequences a1, ..., aN , i.e. the
alignment that contributes to qi,j. One starting
point in each sequence is chosen randomly
initially.
61
Profile Methods (3)
Iteration step Remove one sequence z from the
set. Update the current pattern according to
z
A
tgacttcc
tgatctct
agacctca
tgacctct
Pseudocount for symbol j
Sum of all pseudocounts in column
62
Pattern Discovery Across Orthologous Promoters
from Gram-Positive Bacteria
Real sets
random
63
EXAMPLEYeast Regulatory Sequence Analysis (YRSA)
system
64
Tests of YRSA System
65
SIDENOTE Comparison of profiles requires
alignment and a scoring function

Scoring function based on sum of squared
differences
Align frequency matrices with modified
Needleman-Wunsch algorithm
Calculate empirical p-values based on simulated
set of matrices

66
How is the Performance Hit and Miss
67
Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
68
Over-coming the sensitivity challenge

Metazoan genomes are far from ideal

69
Biochemical complexity enables greater complexity
in regulation
70
Four Approaches to Improve Sensitivity

Better background models
-Higher-order properties of DNA
Phylogenetic Footprinting
HumanMouse comparison eliminates 75 of
sequence
Regulatory Modules
Architectural rules
Limit the types of binding profiles allowed
TFBS patterns are NOT random

71
Phylogenetic Footprinting to Identify Conserved
Regions
Bayes Block Aligner (Lawrence Group)
ORCA
72
Skeletal Muscle Genes

One of the most extensively studied tissues for
transcriptional regulation
45 genes partially analyzed
26 genes with orthologous genomic sequence from
human and rodent
Five primary classes of transcription factors
Principal Myf (myoD), Mef2, SRF
Secondary Sp1 (G/C rich patches), Tef (subset of
skeletal muscle types)

73
de novo Discovery of Skeletal Muscle
Transcription Factor Binding Sites
Mef2-Like
SRF-Like
Myf-Like
74
Pattern discovery methods using biochemical
constraints
75
RECALL Gibbs Algorithm
z
tgacttcc
tgatctct
agacctca
tgacttcc
tgacctct
tgatctct
agacctca
tgacctct
76
(No Transcript)
77
Intra-family PSSM similarity
TF Database (JASPAR)
COMPARE
Jackknife Test 87 correct Independent Test
Set 93 correct
78
(No Transcript)
79
FBPs enhance sensitivity of pattern detection
80
(No Transcript)
81
APPLICATIONCancer Protection Response

Detoxification-related enzymes are induced by
compounds present in Broccoli
Arrays, SSH and hard work have defined a set of
responsive genes
A known element mediates the response
(Antioxidant Responsive Element)
Controversy over the type of mediating leucine
zipper TF
NF-E2/Maf or Jun/Fos

82
Application (2)
Problem Given a set of co-regulated genes,
determine the common TFBS. Classify the
mediating TF. We expect a leucine zipper-type
TF.
83
Application (3)
Problem Given a set of co-regulated genes,
determine the common TFBS. Classify the
mediating TF. We expect a leucine zipper-type
TF.
84
Application (4)
Problem Given a set of co-regulated genes,
determine the common TFBS. Classify the
mediating TF. We expect a leucine zipper-type
TF.
85
EMERGING METHODde novo Analysis of Regulatory
Modules
86
Focus on regulatory modules for pattern detection
Cluster Genes by Expression
87
Analyze co-regulated genes to define circuit
characteristics
General Circuit Properties
Specific Gene Features
Binding Profiles
mi

aij
Neighbor Interactions
mi
mj
mi
mj
0
b
Width Distributions (Sum of Separations)
250
88
Discovery performance

Approximately 50 of annotated TFBS are detected
in the training set sequences of 25 genes
Only 40 of predicted TFBS are annotated
We suspect that most of the un-annotated sites
will turn out to be functional. This needs to be
determined.

89
Review of Primary Points

Second Chance

90
Regulatory regions problem space
Sets of binding sites AATCACCAAATCACCAAATCACCA
AATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATC
TCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATA
ATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTA
GCGATTCTTCCTATGAACAGATTAAAAAGACCCCA
Specificity profiles for binding sites A -2
0 -2 -0.415 0.585 -2 -2 2.088 -2
-2 -1 0.585 C 1 0.585 0 0
-1 -2 -2 -2 2.088 -2 0.585 0.807
G 0.585 0.322 0.807 1.585 1 -2 2
-2 -2 2.088 -2 0 T 0.319 0.322
1 -2 0 2.088 -1 -2 -2 -2
1.459 -0.415
Clusters of binding sites
Transcription factors Transcription factor
binding sites Regulatory nucleotide sequences
91
Detecting binding sites in a single sequence
Scanning a sequence against a PWM
Sp1
Abs_score 13.4 (sum of column scores)
Is 93 better than 82?
92
Phylogenetic Footprints
Scanning a single sequence
Scanning a pair orf orthologous sequences for
conserved patterns in conserved sequence regions
A dramatic improvement in the percentage of
biologically significant detections

Low specificity of profiles
too many hits
great majority are not biologically significant

93
Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
94
Acknowledgements

Wasserman Group
Wynand Alkema
Dave Arenillas
Jochen Brumm
Alice Chou
Shannan Ho Sui
Danielle Kemmer
Jonathan Lim
Raf Podowski
Dora Pak
Albin Sandelin
Chris Walsh
Collaborating Trainees
Malin Andersson (KTH)
Öjvind Johansson (UCSD)
Stuart Lithwick (U.Toronto)

Collaborators Boris Lenhard (K.I.) Chip Lawrence
(Wadsworth) William Thompson (Wadsworth) Jens
Lagergren (KTH) Christer Höög (K.I.) Brenda
Gallie (OCI) Jacob Odeberg (KTH) Niclas Jareborg
(AZ) William Hayes (AZ) James Mortimer
(MF) Group Alumni Elena Herzog Annette
Höglund William Krivan Luis Mendoza
Support CIHR, CGDN, CFI, Merck-Frosst, BC
Childrens Hospital Foundation, Pharmacia,
ECMarie Curie, KI-Funder
95
EXTRA SLIDESWhat will a computational biologist
do with a scoring function?