Title: Canadian Bioinformatics Workshops
1Canadian Bioinformatics Workshops
22
Module Title of Module
3Canadian Bioinformatics Workshops 2009 Module
3 Inferring Regulatory Mechanisms Governing Sets
of Genes Wyeth W. Wasserman University of
British Columbia
www.cisreg.ca
4Module 3 Overview
- Part 1 Overview of transcription
- Lab 3.1 Promoters in Genome Browser (UCSC)
- Part 2 Prediction of transcription factor
binding sites using binding profiles
(Discrimination) - Lab 3.2 TFBS scan (Footer)
- Part 3 Interrogation of sets of co-expressed
genes to identify mediating transcription factors - Lab 3.3 TFBS Over-Representation (oPOSSUM)
- Part 4 Detection of novel motifs (TFBS)
over-represented in regulatory regions of
co-expressed genes (Discovery) - Lab 3.4 Motif Discovery (MEME/Motif-Compare)
5Restrictions in Coverage
- Focus on Eukaryotic cells and PolII Promoters
- Principles apply to prokaryotes
- Will provide suggestions for similar tools for
other species - Many of the examples drawn from my labs work -
there are many equivalent tools (links to be
provided)
6Part 1Introduction to transcription in
eukaryotic cells
7Transcription Over-Simplified
- Three-step Process
- TF binds to TFBS (DNA)
- TF catalyzes recruitment of polymerase II complex
- Production of RNA from transcription start site
(TSS)
TF
Pol-II
TATA
TFBS
TSS
8Anatomy of Transcriptional RegulationWARNING
Terms vary widely in meaning between scientists
Core Promoter/Initiation Region (Inr)
TSR
Distal Regulatory Region
Proximal Regulatory Region
Distal R.R.
EXON
EXON
TFBS
TATA
TFBS
TFBS
TFBS
TFBS
TFBS
TFBS
- Core Promoter Sufficient for initiation of
transcription orientation dependent - TSR transcription start region
- Refers to a region rather than specific start
site (TSS) - TFBS single transcription factor binding site
- Regulatory Regions
- Proximal/Distal vague reference to distance
from TSR - May be positive (enhancing) or negative
(repressing) - Orientation independent (generally)
- Modules Sets of TFBS within a region that
function together - Transcriptional Unit
- DNA sequence transcribed as a single
polycistronic mRNA
9Complexity in Transcription
Chromatin
Distal enhancer
Distal enhancer
Proximal enhancer
Core Promoter
10Lab Discovery of TF Binding Sites
Reporter Gene Activity
0
100
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
mutation
Identify functional regulatory region within a
sequence and delineate specific TFBS through
mutagenesis (and in vitro binding studies)
11EMSA/Gel Shift Assays to Identify Binding
Proteins
TF DNA
DNA
http//www.biomedcentral.com/content/figures/1741-
7015-4-28-8.jpg
12High-throughput Methods
- SELEX
- mix random ds DNA oligonucleotides with TF
protein, recover TF-DNA complexes and sequence
DNA - Protein Binding Arrays (UniProbe Database)
- prepare arrays with ds DNA attached, label
protein with a fluorescent mark and observe DNA
bound by protein - ChIP
- covalently link proteins to DNA in cell, shear
DNA, recover protein-DNA complexes and identify
DNA (PCR, array or sequencing)
13Promoters
- In most vertebrates the delineation of the
transcription start position is not easy - cDNA often incomplete at 5 end
- Multiple promoters for most human genes
- Referencing position relative to the initiation
site is therefore not a good idea - But done almost uniformly in biological papers
- Translation start equally problematic
- Can be in internal exon
- Multiple ORF start positions common
- Importance of promoter proximal regions varies
between species - Humans appear to have little enrichment for
functional sequences vast regions to consider
generally leads to restricted region around
promoter(s), but justification is not strong - Yeast and C.elegans have more compact regions and
promoter proximity can be a useful property to
restrict analyses
14mRNA Caps for Mapping Initiation Sites
- 5 end of mRNA have a cap structure that can be
precipitated with an antibody - Allows for large-scale sequencing of
full-length cDNAs and tags derived from the
5 end of mRNAs - RIKEN the leading generators of such sequences
- Not well represented in genome annotation
resources (unfortunately)
http//departments.oxy.edu/biology/Stillman/bi221/
111300/26_18a.GIF
15Classes of Initiation Regions
CAGE Cap Tags per Position
Position
This is over-simplified - see paper for greater
detail. Take home message is that promoters are
not drawn from a single continuous distribution
of properties, rather drawn from at least two
classes.
Image from Carninci P, et al (2006). Genome-wide
analysis of mammalian promoter architecture and
evolution. Nat Genet. Apr 28 PMID 16645617
16CpG Islands
- DNA methylation occurs in competition with
histone acetylation - Acetylation promotes open chromatin structure
that is permissive for TF binding to DNA - Methylation of DNA inhibits histone acetylation
- Certain TFs promote histone acetylation by
recruiting acetylases - Methylation occurs on cytosines
- Preferentially on cytosine adjacent to guanines
(CG dinucleotides, generally referred to as CpG) - Methylated cytosines frequently undergo
deamination to form thymidine (CpG -gt TpG) - CpG Islands are regions of DNA where CG
dinucleotides occur at a frequency consistent
with C and G mononucleotide frequencies - Highlight regions of active transcription
17CpG Islands (2)
- Important to recognize that promoters selectively
active after early development will not be
acetylated (and hence will be methylated) in the
cell divisions preceding the establishment of
germ cells and therefore will not have CpG
islands - Lists of genes that have higher or lower CpG
frequencies than average can misleadingly appear
to have TF binding motifs based on this
compositional characteristic - CpG Island bias in a gene set can mislead an
analyst to think that there are patterns of TFBS
(patterns with internal CG for island-rich and TG
for island-poor sets)
18Additional Topics
- Chromatin modification studies making great
strides - Signatures indicative of active regulatory
sequences such as H3K4me3 - Co-activator (p300) ChIP study suggests
possibility to read-off regulatory regions - No methods currently address 3D properties of
nucleus (long-run will be necessary)
19Section 3.1What have we learned?
- Transcription controlled by regulatory regions
- Regulatory regions can be distant from initiation
regions - Laboratory methods can identify regulatory
regions and TF binding sites - Concept of single initiation site is flawed
- Promoters fall into subclasses
- CpG vs TATA
- Can impact assessment of TFBS in sets of genes
20Questions?
?
?
?
?
?
- Please, please, please . . .
- ASK QUESTIONS
- . . . now is a great chance.
21Module 3
- Part 1 Overview of transcription
- Lab 3.1 Promoters in Genome Browser (UCSC)
- Part 2 Prediction of transcription factor
binding sites using binding profiles
(Discrimination) - Lab 3.2 TFBS scan (Footer)
- Part 3 Interrogation of sets of co-expressed
genes to identify mediating transcription factors - Lab 3.3 TFBS Over-Representation (oPOSSUM)
- Part 4 Detection of novel motifs (TFBS)
over-represented in regulatory regions of
co-expressed genes (Discovery) - Lab 3.4 Motif Discovery (MEME/Motif-Compare)
22Part 2 Prediction of TF Binding Sites
Teaching a computer to find TFBS
23Representing Binding Sites for a TF
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTT
AAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA
CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTG
ATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA A
AGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAA
TGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AA
GTTAATGA AAGTTAATGA AAGTTAATGA
- A set of sites represented as a consensus
- VDRTWRWWSHD (IUPAC degenerate DNA)
24Conversion of PFMs to Position Specific Scoring
Matrices (PSSM)PSSMs also known as Position
Weight Matrices(PWMs)
Add the following features to the matrix
profile 1. Correct for nucleotide frequencies
in genome 2. Weight for the confidence (depth)
in the pattern 3. Convert to log-scale
probability for easy arithmetic
pssm
pfm
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5
0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T
-1.7 -1.7 -0.2 -0.2 -0.2
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0
4 T 0 0 1 1 1
f(b,i) s(n)
Log ( )
p(b)
25PSSM Scoring Scales
- Raw scores
- Sum of values from indicated cells of the matrix
- Relative Scores (most common)
- Normalize the scores to range of 0-1 or 0-100
- Empirical p-values
- Based on distribution of scores for some DNA
sequence, determine a p-value (see next slide)
26Detecting binding sites in a single sequence
Raw Scores
Sp1
Abs_score 13.4 (sum of column scores)
Empirical p-value Scores
0.3 0.2 0.1 0.0
Area to right of value Area under entire curve
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
Relative Score
27JASPAR AN OPEN-ACCESS DATABASE OF TF BINDING
PROFILES ( jaspar.genereg.net )
28The Good
- Tronche (1997) tested 50 predicted HNF1 TFBS
using an in vitro binding test and found that 96
of the predicted sites were bound! - Stormo and Fields (1998) found in detailed
biochemical studies that the best weight matrices
produce scores highly correlated with in vitro
binding energy
BINDING ENERGY
PSSM SCORE
29the Bad
- Fickett (1995) found that a profile for the myoD
TF made predictions at a rate of 1 per 500bp of
human DNA sequence - This corresponds to an average of 20 sites / gene
(assuming 10,000 bp as average gene size)
30and the Ugly!
Human Cardiac a-Actin gene analyzed with a set of
profiles (each line represents a TFBS prediction)
Futility Conjuncture TFBS predictions are almost
always wrong
Red boxes are protein coding exons - TFBS
predictions excluded in this analysis
31ADVANCED TOPICIssues of Column Independence
- PSSM model assumes independence between positions
- For example, if you observe a G at position 2,
the model assumes there is no influence on the
likelihood of a T at position 3 - this is known
to be an incorrect assumption - Other models can represent dependence
- Hidden Markov models of Nth order where Nth
refers to the number of influencing positions - For the cases where there are hundreds of TFBS
known for a TF, there has been only modest
improvement in the specificity of TFBS
predictions using advanced column inter-dependent
models - The newly emerging ChIP-Seq data collections will
ultimately lead to the systematic use of more
advanced models (not likely to advance to wet
labs for 3 years)
32A Conundrum
- Counter to intuition, the ratio of true positives
to predictions fails to improve for stringent
thresholds - For most predictive models this ratio would
increase - Why?
- True binding sites are defined by properties not
incorporated into the profile scores - above some
threshold all sites could be bound if accessible
33Section 3.1AWhat have we learned?
- PSSMs accurately reflect in vitro binding
properties of DNA binding proteins - Suitable binding sites occur at a rate far too
frequent to reflect in vivo function - Bioinformatics methods that use PSSMs for binding
site studies must incorporate additional
information to enhance specificity - Unfiltered predictions are too noisy for most
applications - Organisms with short regulatory sequences are
less problematic (e.g. yeast and bacteria)
34Using Phylogenetic Footprinting to Improve TFBS
Discrimination
- 70,000,000 years of evolution can reveal
regulatory regions
35Phylogenetic Footprinting
FoxC2 a single exon gene
100 80 60 40 20 0
Human-Mouse Identity
- Align orthologous gene sequences (e.g. LAGAN)
- For first window of 100 bp, of sequence1,
determine the with identical match in
sequence2 - Step across the first sequence, recording the
percentage of identical nucleotides in each
window - Observe that single exon contains a region of
high identity that corresponds to the ORF, with
lower identity in the 5 and 3 UTRs - Additional conserved region could be regulatory
regions
36Phylogenetic Footprinting (cont)
Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse
37Multi-species Phylogenetic Footprinting
- PhastCons scores indicate the regions of DNA
which are unusual in their sequence composition
in some subset of organisms
38Phylogenetic Footprints in UCSC Genome Browser
- PhyloCons (regions score)
- PhyloP (position score)
INSERT SCREENSHOT
39Phylogenetic Footprinting Dramatically Reduces
Spurious Hits
Actin, alpha cardiac
40TFBS Prediction with Human Mouse Pairwise
Phylogenetic Footprinting
SELECTIVITY
SENSITIVITY
- Testing set 40 experimentally defined sites in
15 well studied genes (Replicated with 100 site
set) - 75-80 of defined sites detected with
conservation filter, while only 11-16 of total
predictions retained
411kbp insulin receptor promoter screened with
footprinting
42Choosing the right species for pairwise
comparison...
CHICKEN
HUMAN
MOUSE
HUMAN
COW
HUMAN
43ConSite
44TFBS Discrimination Tools
- Phylogenetic Footprinting Servers
- FOOTER http//biodev.hgen.pitt.edu/footer_php/Foo
terv2_0.php - CONSITE http//asp.ii.uib.no8090/cgi-bin/CONSITE
/consite/ - rVISTA http//rvista.dcode.org/
- ORCAtk http//burgundy.cmmt.ubc.ca/cgi-bin/OrcaT
K/orcatk - SNPs in TFBS Analysis
- RAVEN http//burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?
rmhome - Prokaryotes or Yeast
- PRODORIC http//prodoric.tu-bs.de/
- YEASTRACT http//www.yeastract.com/index.php
- Software Packages
- TOUCAN http//homes.esat.kuleuven.be/saerts/soft
ware/toucan.php - Programming Tools
- TFBS http//tfbs.genereg.net/
- ORCAtk http//burgundy.cmmt.ubc.ca/cgi-bin/OrcaT
K/orcatk
45Analysis of TFBS with Phylogenetic Footprinting
Scanning a single sequence
Scanning a pair orf orthologous sequences for
conserved patterns in conserved sequence regions
A dramatic improvement in the percentage of
biologically significant detections
- Low specificity of profiles
- too many hits
- great majority not biologically significant
46Section 3.2BWhat have we learned?
- TFBS discrimination coupled with phylogenetic
footprinting has greater specificity with
tolerable loss of sensitivity - As with any purification process, some true
binding sites will be lost - Available online resources support phylogenetic
footprinting
47Questions?
48Laboratory Exercise 3.2
- TF Binding Site Prediction
4920 minute break
- Until 1050am
- Next Sections 3.3 and 3.4
50Module 3
- Part 1 Overview of transcription
- Lab 3.1 Promoters in Genome Browser (UCSC)
- Part 2 Prediction of transcription factor
binding sites using binding profiles
(Discrimination) - Lab 3.2 TFBS scan (Footer)
- Part 3 Interrogation of sets of co-expressed
genes to identify mediating transcription factors - Lab 3.3 TFBS Over-Representation (oPOSSUM)
- Part 4 Detection of novel motifs (TFBS)
over-represented in regulatory regions of
co-expressed genes (Discovery) - Lab 3.4 Motif Discovery (MEME/Motif-Compare)
51Part 3 Inferring Regulating TFs for Sets of
Co-Expressed Genes
52TFBS Over-representation
- Akin to the GO studies yesterday, we seek to
determine if a set of co-expressed genes contains
an over-abundance of predicted binding sites for
a known TF - Phylogenetic footprinting to reduce false
prediction rate
53Two Examples of TFBS Over-Representation
More Genes with TFBS
54Statistical Methods for Identifying
Over-represented TFBS
- Binomial test (Z scores)
- Based on the number of occurrences of the TFBS
relative to background - Normalized for sequence length
- Simple binomial distribution model
- Fisher exact probability scores
- Based on the number of genes containing the TFBS
relative to background - Hypergeometric probability distribution
55Validation using Reference Gene Sets
A. Muscle-specific (23 input 16 analyzed) A. Muscle-specific (23 input 16 analyzed) A. Muscle-specific (23 input 16 analyzed) A. Muscle-specific (23 input 16 analyzed) B. Liver-specific (20 input 12 analyzed) B. Liver-specific (20 input 12 analyzed) B. Liver-specific (20 input 12 analyzed) B. Liver-specific (20 input 12 analyzed)
Rank Z-score Fisher Rank Z-score Fisher
SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08
MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03
c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01
Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01
TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02
deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01
S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01
Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02
Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01
HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01
TFs with experimentally-verified sites in the
reference sets.
56Empirical Selection of Parameters based on
Reference Studies
57C-Myc SAGE Data
- c-Myc transcription factor dimerizes with the Max
protein - Key regulator of cell proliferation,
differentiation and apoptosis - Menssen and Hermeking identified 216 different
SAGE tags corresponding to unique mRNAs that were
induced after adenoviral expression of c-Myc in
HUVEC cells - They then went on to confirm the induction of 53
genes using microarray analysis and RT-PCR
58Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed) Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input 36 analyzed)
TF Class Rank Z-score Fisher No. Genes
Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7
Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2
Max bHLH-ZIP 3 18.32 2.16e-02 12
SAP-1 ETS 4 13.23 1.61e-04 13
USF bHLH-ZIP 5 11.90 1.84e-01 16
SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12
n-MYC bHLH-ZIP 7 11.11 1.55e-01 20
ARNT bHLH 8 11.11 1.55e-01 20
Elk-1 ETS 9 10.92 3.88e-03 19
Ahr-ARNT bHLH 10 10.17 1.11e-01 25
59Structurally-related TFs with Indistinguishable
TFBS
- Most structurally related TFs bind to highly
similar patterns - Zn-finger is a big exception
60oPOSSUM Server
61Ets Factor Family
- EG232974
- EG432800
- Ehf
- Elf1
- Elf2
- Elf3
- Elf4
- Elf5
- Elk1
- Elk3
- Elk4
- Erf
- Erg
- Ets1
- Ets2
- How to pick which one?
- At this stage there are TF catalogs coming that
will be coupled to characteristics. - Candidate gene prioritization software can be
used (such as TOPPGENE) -
- Etv1
- Etv2
- Etv3
- Etv3l
- Etv4
- Etv5
- Etv6
- Fev
- Fli1
- Gabpa
- LOC100
- LOC100
- factor)
- LOC634494
- Sfpi1
- Spdef
- Spib
- Spic
62Section 3.3What have we learned?
- New generation of tools to help interrogate the
meaning of observed clusters of co-expressed
genes - Generally best performance has been with data
directly linked to a transcription factor - Highly dependent on the experimental design
cannot overcome noisy data from poor design
(Recall Day 1) - The identity of a mediating TF may not be
apparent when many proteins can bind to the same
motif
63Questions?
64Laboratory Exercise 3.3
- TFBS Over-Representation Analysis
65Module 3 Overview
- Part 1 Overview of transcription
- Lab 3.1 Promoters in Genome Browser (UCSC)
- Part 2 Prediction of transcription factor
binding sites using binding profiles
(Discrimination) - Lab 3.2 TFBS scan (Footer)
- Part 3 Interrogation of sets of co-expressed
genes to identify mediating transcription factors - Lab 3.3 TFBS Over-Representation (oPOSSUM)
- Part 4 Detection of novel motifs (TFBS)
over-represented in regulatory regions of
co-expressed genes (Discovery) - Lab 3.4 Motif Discovery (MEME/Motif-Compare)
66Part 4de novo Discovery of TF Binding Sites
67de novo Pattern Discovery
- String-based
- e.g. YMF (Sinha Tompa)
- Generalization Identify over-represented
oligomers in comparison of and - (or
complete) promoter collections - Used often for yeast promoter analysis
- Profile-based
- e.g. AnnSpec (Workman Stormo) or MEME (Bailey
Elkin) - Generalization Identify strong patterns in
promoter collection vs. background model of
expected sequence characteristics
68Assessing Discovered Patterns
- Strength
- Similarity search
69String-based methods(1)
How likely are X words in a set of sequences,
given background sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC gtEP71002 ()
CeIV msp-56 B range -100 to -75
TTCAAATTTTAACGCCGGAATAATCTCCTATT gtEP63009 () Ce
Cuticle Col-12 range -100 to -75
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG gtEP63010 () Ce
Cuticle Col-13 range -100 to -75
TATCGTCATTCTCCGCCTCTTTTCTT gtEP11013 () Ce
vitellogenin 2 range -100 to -75
GCTTATCAATGCGCCCGGAATAAAACGCTATA gtEP11014 () Ce
vitellogenin 5 range -100 to -75
CATTGACTTTATCGAATAAATCTGTT gtEP11015 (-) Ce
vitellogenin 4 range -100 to -75
ATCTATTTACAATGATAAAACTTCAA gtEP11016 () Ce
vitellogenin 6 range -100 to -75
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT gtEP11017 () Ce
calmodulin cal-2 range -100 to -75
TTTCAAATCCGGAATTTCCACCCGGAATTACT gtEP63007 (-) Ce
cAMP-dep. PKR P1 range -100 to -75
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC gtEP63008 () Ce
cAMP-dep. PKR P2 range -100 to -75
ACTGAACTTGTCTTCAAATTTCAACACCGGAA gtEP17012 () Ce
hsp 16K-1 A range -100 to -75 TCAATGCCGGAATTCTGAA
TGTGAGTCGCCCT gtEP55011 (-) Ce hsp 16K-1 B range
70String-based methods(2)
Find all words of length n in the yeast promoters
(e.g. n7)
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGC
GTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCA
CTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTG
GAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACG
GTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAA
TCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGA
AGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGG
AAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTC
AAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAG
CTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGC
TCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTG
TTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAAT
TGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGT
ATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCC
ATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGC
CAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGA
TGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTA
TTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGT
TTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCA
GGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGA
CGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTG
GTGACACTGCCA
Make a lookup table AAACCTTT 456 TTTTTTTT 5778
8 GATAGGCA 589 Etc...
71String-based methods(3)
Xw Instances of a word w within our set of X
genes EXw Average number of instances of w
based on number of genes in our set VarXw
Variance how much deviation from the average is
expected for w
72Limitations of String-based Methods
- Longer word lengths not possible
- While degeneracy codes can be used, TFBS are not
words we lose quantitation for variable
positions with consensus sequences - Imagine column in PFM with 7 As and 1 T --- in a
consensus sequence we would represent as W or
throw out the instance with T - Recently the string-based method has found
renewed utility in the analysis of 3UTRs for the
presence of microRNA target sequences...
73microRNA Target Sequences
- Lim et al expressed miRNAs in cells and observed
that the overall pattern of gene expression
shifted toward the pattern of expression observed
in cells which naturally express the miRNA - The genes with reduced expression in response to
miRNA exposure shared 7nt motifs the 3UTR of
their transcripts - Nice website tutorial
- http//www.ambion.com/main/explorations/mirna.html
74Probabilistic Methods for Pattern Discovery
- What is a probabilistic method?
- The Gibbs sampler algorithm
75Probabilistic Methods
Overview Find a local alignment of width x of
sites that maximizes information content (or
related measure) in reasonable time Usually by
Gibbs sampling or EM methods
Motivation TFBS are not words Efficiency can
handle longer patterns than string-based
methods Can be intentionally influenced to
reflect prior knowledge
76What does probabilistic mean?
- Based on probability
- Functionally, it means were going to guess our
way to a good pattern (TFBS) - Were going to try to make a good guess
- Two different flavours of the approach
- Expectation Maximization in which we try to make
the best guess each time - Gibbs Sampling in which we make our guesses based
on the strength of our conviction
77Gibbs Sampling
tgacttcc
Two data structures used 1) Current pattern
nucleotide frequencies qi,1,..., qi,4 and
corresponding background frequencies pi,1,...,
pi,4 2) Current positions of site startpoints
in the N sequences a1, ..., aN , i.e. the
alignment that contributes to qi,j. One starting
point in each sequence is chosen randomly
initially.
tgatctct
agacctca
tgacctct
78Iterations in Gibbs Sampling
Remove one sequence z from the set. Update the
current pattern according to
A
z
Pseudocount for symbol j
tgacttcc
tgatctct
agacctca
Sum of all pseudocounts in column
tgacctct
79Gibbs Sampling(grossly over-simplified)
80Pattern Discovery
- Gibbs sampling is guaranteed to return an optimal
pattern if repeated sufficiently often - Procedure is fast, so running many 1000s of times
is feasible - Unfortunately, we have a problemwhat if the
mediating TFBS are not strongly over-represented
relative to other patterns
81Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
82Four Approaches to Improve Sensitivity
- Better background models
- -Higher-order properties of DNA
- Phylogenetic Footprinting
- HumanMouse comparison eliminates 75 of
sequence - Regulatory Modules
- Architectural rules
- Limit the types of binding profiles allowed
- TFBS patterns are NOT random
83Pattern Discovery Summary
- Pattern discovery methods can recover
over-represented patterns in the promoters of
co-expressed genes - Methods are acutely sensitive to noise,
indicating that the signal we seek is weak - TFs tolerate great variability between binding
sites - As for pattern discrimination, supplementary
information/approaches are required to over-come
the noise
84Questions?
85Laboratory Exercise 3.4
86REFLECTIONS
- Part 2
- Futility Theorem Essentially predictions of
individual TFBS have no relationship to an in
vivo function - Successful bioinformatics methods for site
discrimination incorporate additional information
(clusters, conservation) - Part 3
- TFBS over-representation is a powerful new means
to identify TFs likely to contribute to observed
patterns of co-expression - Part 4
- Pattern discovery methods are severely restricted
by the Signal-to-Noise problem - Observed patterns must be carefully considered
- Successful methods for pattern discovery will
have to incorporate additional information
(conservation, structural constraints on TFs)
87Module 3 Overview
- Part 1 Overview of transcription
- Lab 3.1 Promoters in Genome Browser (UCSC)
- Part 2 Prediction of transcription factor
binding sites using binding profiles
(Discrimination) - Lab 3.2 TFBS scan (Footer)
- Part 3 Interrogation of sets of co-expressed
genes to identify mediating transcription factors - Lab 3.3 TFBS Over-Representation (oPOSSUM)
- Part 4 Detection of novel motifs (TFBS)
over-represented in regulatory regions of
co-expressed genes (Discovery) - Lab 3.4 Motif Discovery (MEME/Motif-Compare)
88THE END
- Questions before the break?
- Lab exercises address Sections 2 and 3
89LUNCH
- On your own
- (Food court Downstairs)
- Back at ??