Canadian Bioinformatics Workshops

About This Presentation

Title:

Canadian Bioinformatics Workshops

Description:

Part 2: Prediction of transcription factor binding sites using binding profiles ... Part 3: Interrogation of sets of co-expressed genes to identify mediating ... – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 90

Provided by: GaryB123

Category:

more less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops

1
Canadian Bioinformatics Workshops

www.bioinformatics.ca

2
2
Module Title of Module
3
Canadian Bioinformatics Workshops 2009 Module
3 Inferring Regulatory Mechanisms Governing Sets
of Genes Wyeth W. Wasserman University of
British Columbia
www.cisreg.ca
4
Module 3 Overview

Part 1 Overview of transcription
Lab 3.1 Promoters in Genome Browser (UCSC)
Part 2 Prediction of transcription factor
binding sites using binding profiles
(Discrimination)
Lab 3.2 TFBS scan (Footer)
Part 3 Interrogation of sets of co-expressed
genes to identify mediating transcription factors
Lab 3.3 TFBS Over-Representation (oPOSSUM)
Part 4 Detection of novel motifs (TFBS)
over-represented in regulatory regions of
co-expressed genes (Discovery)
Lab 3.4 Motif Discovery (MEME/Motif-Compare)

5
Restrictions in Coverage

Focus on Eukaryotic cells and PolII Promoters
Principles apply to prokaryotes
Will provide suggestions for similar tools for
other species
Many of the examples drawn from my labs work -
there are many equivalent tools (links to be
provided)

6
Part 1Introduction to transcription in
eukaryotic cells
7
Transcription Over-Simplified

Three-step Process
TF binds to TFBS (DNA)
TF catalyzes recruitment of polymerase II complex
Production of RNA from transcription start site
(TSS)

TF
Pol-II
TATA
TFBS
TSS
8
Anatomy of Transcriptional RegulationWARNING
Terms vary widely in meaning between scientists
Core Promoter/Initiation Region (Inr)
TSR
Distal Regulatory Region
Proximal Regulatory Region
Distal R.R.
EXON
EXON
TFBS
TATA
TFBS
TFBS
TFBS
TFBS
TFBS
TFBS

Core Promoter Sufficient for initiation of
transcription orientation dependent
TSR transcription start region
Refers to a region rather than specific start
site (TSS)
TFBS single transcription factor binding site
Regulatory Regions
Proximal/Distal vague reference to distance
from TSR
May be positive (enhancing) or negative
(repressing)
Orientation independent (generally)
Modules Sets of TFBS within a region that
function together
Transcriptional Unit
DNA sequence transcribed as a single
polycistronic mRNA

9
Complexity in Transcription
Chromatin
Distal enhancer
Distal enhancer
Proximal enhancer
Core Promoter
10
Lab Discovery of TF Binding Sites
Reporter Gene Activity
0
100
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
mutation
Identify functional regulatory region within a
sequence and delineate specific TFBS through
mutagenesis (and in vitro binding studies)
11
EMSA/Gel Shift Assays to Identify Binding
Proteins
TF DNA
DNA
http//www.biomedcentral.com/content/figures/1741-
7015-4-28-8.jpg
12
High-throughput Methods

SELEX
mix random ds DNA oligonucleotides with TF
protein, recover TF-DNA complexes and sequence
DNA
Protein Binding Arrays (UniProbe Database)
prepare arrays with ds DNA attached, label
protein with a fluorescent mark and observe DNA
bound by protein
ChIP
covalently link proteins to DNA in cell, shear
DNA, recover protein-DNA complexes and identify
DNA (PCR, array or sequencing)

13
Promoters

In most vertebrates the delineation of the
transcription start position is not easy
cDNA often incomplete at 5 end
Multiple promoters for most human genes
Referencing position relative to the initiation
site is therefore not a good idea
But done almost uniformly in biological papers
Translation start equally problematic
Can be in internal exon
Multiple ORF start positions common
Importance of promoter proximal regions varies
between species
Humans appear to have little enrichment for
functional sequences vast regions to consider
generally leads to restricted region around
promoter(s), but justification is not strong
Yeast and C.elegans have more compact regions and
promoter proximity can be a useful property to
restrict analyses

14
mRNA Caps for Mapping Initiation Sites

5 end of mRNA have a cap structure that can be
precipitated with an antibody
Allows for large-scale sequencing of
full-length cDNAs and tags derived from the
5 end of mRNAs
RIKEN the leading generators of such sequences
Not well represented in genome annotation
resources (unfortunately)

http//departments.oxy.edu/biology/Stillman/bi221/
111300/26_18a.GIF
15
Classes of Initiation Regions
CAGE Cap Tags per Position
Position
This is over-simplified - see paper for greater
detail. Take home message is that promoters are
not drawn from a single continuous distribution
of properties, rather drawn from at least two
classes.
Image from Carninci P, et al (2006). Genome-wide
analysis of mammalian promoter architecture and
evolution. Nat Genet. Apr 28 PMID 16645617
16
CpG Islands

DNA methylation occurs in competition with
histone acetylation
Acetylation promotes open chromatin structure
that is permissive for TF binding to DNA
Methylation of DNA inhibits histone acetylation
Certain TFs promote histone acetylation by
recruiting acetylases
Methylation occurs on cytosines
Preferentially on cytosine adjacent to guanines
(CG dinucleotides, generally referred to as CpG)
Methylated cytosines frequently undergo
deamination to form thymidine (CpG -gt TpG)
CpG Islands are regions of DNA where CG
dinucleotides occur at a frequency consistent
with C and G mononucleotide frequencies
Highlight regions of active transcription

17
CpG Islands (2)

Important to recognize that promoters selectively
active after early development will not be
acetylated (and hence will be methylated) in the
cell divisions preceding the establishment of
germ cells and therefore will not have CpG
islands
Lists of genes that have higher or lower CpG
frequencies than average can misleadingly appear
to have TF binding motifs based on this
compositional characteristic
CpG Island bias in a gene set can mislead an
analyst to think that there are patterns of TFBS
(patterns with internal CG for island-rich and TG
for island-poor sets)

18
Additional Topics

Chromatin modification studies making great
strides
Signatures indicative of active regulatory
sequences such as H3K4me3
Co-activator (p300) ChIP study suggests
possibility to read-off regulatory regions
No methods currently address 3D properties of
nucleus (long-run will be necessary)

19
Section 3.1What have we learned?

Transcription controlled by regulatory regions
Regulatory regions can be distant from initiation
regions
Laboratory methods can identify regulatory
regions and TF binding sites
Concept of single initiation site is flawed
Promoters fall into subclasses
CpG vs TATA
Can impact assessment of TFBS in sets of genes

20
Questions?
?
?
?
?
?

Please, please, please . . .
ASK QUESTIONS
. . . now is a great chance.

21
Module 3

Part 1 Overview of transcription
Lab 3.1 Promoters in Genome Browser (UCSC)
Part 2 Prediction of transcription factor
binding sites using binding profiles
(Discrimination)
Lab 3.2 TFBS scan (Footer)
Part 3 Interrogation of sets of co-expressed
genes to identify mediating transcription factors
Lab 3.3 TFBS Over-Representation (oPOSSUM)
Part 4 Detection of novel motifs (TFBS)
over-represented in regulatory regions of
co-expressed genes (Discovery)
Lab 3.4 Motif Discovery (MEME/Motif-Compare)

22
Part 2 Prediction of TF Binding Sites
Teaching a computer to find TFBS
23
Representing Binding Sites for a TF
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTT
AAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA
CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTG
ATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA A
AGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAA
TGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AA
GTTAATGA AAGTTAATGA AAGTTAATGA

A single site
AAGTTAATGA

A set of sites represented as a consensus
VDRTWRWWSHD (IUPAC degenerate DNA)

24
Conversion of PFMs to Position Specific Scoring
Matrices (PSSM)PSSMs also known as Position
Weight Matrices(PWMs)
Add the following features to the matrix
profile 1. Correct for nucleotide frequencies
in genome 2. Weight for the confidence (depth)
in the pattern 3. Convert to log-scale
probability for easy arithmetic
pssm
pfm
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5
0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T
-1.7 -1.7 -0.2 -0.2 -0.2
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0
4 T 0 0 1 1 1
f(b,i) s(n)
Log ( )
p(b)
25
PSSM Scoring Scales

Raw scores
Sum of values from indicated cells of the matrix
Relative Scores (most common)
Normalize the scores to range of 0-1 or 0-100
Empirical p-values
Based on distribution of scores for some DNA
sequence, determine a p-value (see next slide)

26
Detecting binding sites in a single sequence
Raw Scores
Sp1
Abs_score 13.4 (sum of column scores)
Empirical p-value Scores
0.3 0.2 0.1 0.0
Area to right of value Area under entire curve
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
Relative Score
27
JASPAR AN OPEN-ACCESS DATABASE OF TF BINDING
PROFILES ( jaspar.genereg.net )
28
The Good

Tronche (1997) tested 50 predicted HNF1 TFBS
using an in vitro binding test and found that 96
of the predicted sites were bound!
Stormo and Fields (1998) found in detailed
biochemical studies that the best weight matrices
produce scores highly correlated with in vitro
binding energy

BINDING ENERGY
PSSM SCORE
29
the Bad

Fickett (1995) found that a profile for the myoD
TF made predictions at a rate of 1 per 500bp of
human DNA sequence
This corresponds to an average of 20 sites / gene
(assuming 10,000 bp as average gene size)

30
and the Ugly!
Human Cardiac a-Actin gene analyzed with a set of
profiles (each line represents a TFBS prediction)
Futility Conjuncture TFBS predictions are almost
always wrong
Red boxes are protein coding exons - TFBS
predictions excluded in this analysis
31
ADVANCED TOPICIssues of Column Independence

PSSM model assumes independence between positions
For example, if you observe a G at position 2,
the model assumes there is no influence on the
likelihood of a T at position 3 - this is known
to be an incorrect assumption
Other models can represent dependence
Hidden Markov models of Nth order where Nth
refers to the number of influencing positions
For the cases where there are hundreds of TFBS
known for a TF, there has been only modest
improvement in the specificity of TFBS
predictions using advanced column inter-dependent
models
The newly emerging ChIP-Seq data collections will
ultimately lead to the systematic use of more
advanced models (not likely to advance to wet
labs for 3 years)

32
A Conundrum

Counter to intuition, the ratio of true positives
to predictions fails to improve for stringent
thresholds
For most predictive models this ratio would
increase
Why?
True binding sites are defined by properties not
incorporated into the profile scores - above some
threshold all sites could be bound if accessible

33
Section 3.1AWhat have we learned?

PSSMs accurately reflect in vitro binding
properties of DNA binding proteins
Suitable binding sites occur at a rate far too
frequent to reflect in vivo function
Bioinformatics methods that use PSSMs for binding
site studies must incorporate additional
information to enhance specificity
Unfiltered predictions are too noisy for most
applications
Organisms with short regulatory sequences are
less problematic (e.g. yeast and bacteria)

34
Using Phylogenetic Footprinting to Improve TFBS
Discrimination

70,000,000 years of evolution can reveal
regulatory regions

35
Phylogenetic Footprinting
FoxC2 a single exon gene
100 80 60 40 20 0
Human-Mouse Identity

Align orthologous gene sequences (e.g. LAGAN)
For first window of 100 bp, of sequence1,
determine the with identical match in
sequence2
Step across the first sequence, recording the
percentage of identical nucleotides in each
window
Observe that single exon contains a region of
high identity that corresponds to the ORF, with
lower identity in the 5 and 3 UTRs
Additional conserved region could be regulatory
regions

36
Phylogenetic Footprinting (cont)
Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse
37
Multi-species Phylogenetic Footprinting

PhastCons scores indicate the regions of DNA
which are unusual in their sequence composition
in some subset of organisms

38
Phylogenetic Footprints in UCSC Genome Browser

PhyloCons (regions score)
PhyloP (position score)

INSERT SCREENSHOT
39
Phylogenetic Footprinting Dramatically Reduces
Spurious Hits
Actin, alpha cardiac
40
TFBS Prediction with Human Mouse Pairwise
Phylogenetic Footprinting
SELECTIVITY
SENSITIVITY

Testing set 40 experimentally defined sites in
15 well studied genes (Replicated with 100 site
set)
75-80 of defined sites detected with
conservation filter, while only 11-16 of total
predictions retained

41
1kbp insulin receptor promoter screened with
footprinting
42
Choosing the right species for pairwise
comparison...
CHICKEN
HUMAN
MOUSE
HUMAN
COW
HUMAN
43
ConSite
44
TFBS Discrimination Tools

Phylogenetic Footprinting Servers
FOOTER http//biodev.hgen.pitt.edu/footer_php/Foo
terv2_0.php
CONSITE http//asp.ii.uib.no8090/cgi-bin/CONSITE
/consite/
rVISTA http//rvista.dcode.org/
ORCAtk http//burgundy.cmmt.ubc.ca/cgi-bin/OrcaT
K/orcatk
SNPs in TFBS Analysis
RAVEN http//burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?
rmhome
Prokaryotes or Yeast
PRODORIC http//prodoric.tu-bs.de/
YEASTRACT http//www.yeastract.com/index.php
Software Packages
TOUCAN http//homes.esat.kuleuven.be/saerts/soft
ware/toucan.php
Programming Tools
TFBS http//tfbs.genereg.net/
ORCAtk http//burgundy.cmmt.ubc.ca/cgi-bin/OrcaT
K/orcatk

45
Analysis of TFBS with Phylogenetic Footprinting
Scanning a single sequence
Scanning a pair orf orthologous sequences for
conserved patterns in conserved sequence regions
A dramatic improvement in the percentage of
biologically significant detections

Low specificity of profiles
too many hits
great majority not biologically significant

46
Section 3.2BWhat have we learned?

TFBS discrimination coupled with phylogenetic
footprinting has greater specificity with
tolerable loss of sensitivity
As with any purification process, some true
binding sites will be lost
Available online resources support phylogenetic
footprinting

47
Questions?

Please Ask

48
Laboratory Exercise 3.2

TF Binding Site Prediction

49
20 minute break

Until 1050am
Next Sections 3.3 and 3.4

50
Module 3

Part 1 Overview of transcription
Lab 3.1 Promoters in Genome Browser (UCSC)
Part 2 Prediction of transcription factor
binding sites using binding profiles
(Discrimination)
Lab 3.2 TFBS scan (Footer)
Part 3 Interrogation of sets of co-expressed
genes to identify mediating transcription factors
Lab 3.3 TFBS Over-Representation (oPOSSUM)
Part 4 Detection of novel motifs (TFBS)
over-represented in regulatory regions of
co-expressed genes (Discovery)
Lab 3.4 Motif Discovery (MEME/Motif-Compare)

51
Part 3 Inferring Regulating TFs for Sets of
Co-Expressed Genes
52
TFBS Over-representation

Akin to the GO studies yesterday, we seek to
determine if a set of co-expressed genes contains
an over-abundance of predicted binding sites for
a known TF
Phylogenetic footprinting to reduce false
prediction rate

53
Two Examples of TFBS Over-Representation
More Genes with TFBS
54
Statistical Methods for Identifying
Over-represented TFBS

Binomial test (Z scores)
Based on the number of occurrences of the TFBS
relative to background
Normalized for sequence length
Simple binomial distribution model
Fisher exact probability scores
Based on the number of genes containing the TFBS
relative to background
Hypergeometric probability distribution

55
Validation using Reference Gene Sets
TFs with experimentally-verified sites in the
reference sets.
56
Empirical Selection of Parameters based on
Reference Studies
57
C-Myc SAGE Data

c-Myc transcription factor dimerizes with the Max
protein
Key regulator of cell proliferation,
differentiation and apoptosis
Menssen and Hermeking identified 216 different
SAGE tags corresponding to unique mRNAs that were
induced after adenoviral expression of c-Myc in
HUVEC cells
They then went on to confirm the induction of 53
genes using microarray analysis and RT-PCR

58
(No Transcript)
59
Structurally-related TFs with Indistinguishable
TFBS

Most structurally related TFs bind to highly
similar patterns
Zn-finger is a big exception

60
oPOSSUM Server
61
Ets Factor Family

EG232974
EG432800
Ehf
Elf1
Elf2
Elf3
Elf4
Elf5
Elk1
Elk3
Elk4
Erf
Erg
Ets1
Ets2

How to pick which one?
At this stage there are TF catalogs coming that
will be coupled to characteristics.
Candidate gene prioritization software can be
used (such as TOPPGENE)

Etv1
Etv2
Etv3
Etv3l
Etv4
Etv5
Etv6
Fev
Fli1
Gabpa
LOC100
LOC100
factor)
LOC634494
Sfpi1
Spdef
Spib
Spic

62
Section 3.3What have we learned?

New generation of tools to help interrogate the
meaning of observed clusters of co-expressed
genes
Generally best performance has been with data
directly linked to a transcription factor
Highly dependent on the experimental design
cannot overcome noisy data from poor design
(Recall Day 1)
The identity of a mediating TF may not be
apparent when many proteins can bind to the same
motif

63
Questions?

Now is a good time

64
Laboratory Exercise 3.3

TFBS Over-Representation Analysis

65
Module 3 Overview

Part 1 Overview of transcription
Lab 3.1 Promoters in Genome Browser (UCSC)
Part 2 Prediction of transcription factor
binding sites using binding profiles
(Discrimination)
Lab 3.2 TFBS scan (Footer)
Part 3 Interrogation of sets of co-expressed
genes to identify mediating transcription factors
Lab 3.3 TFBS Over-Representation (oPOSSUM)
Part 4 Detection of novel motifs (TFBS)
over-represented in regulatory regions of
co-expressed genes (Discovery)
Lab 3.4 Motif Discovery (MEME/Motif-Compare)

66
Part 4de novo Discovery of TF Binding Sites
67
de novo Pattern Discovery

String-based
e.g. YMF (Sinha Tompa)
Generalization Identify over-represented
oligomers in comparison of and - (or
complete) promoter collections
Used often for yeast promoter analysis
Profile-based
e.g. AnnSpec (Workman Stormo) or MEME (Bailey
Elkin)
Generalization Identify strong patterns in
promoter collection vs. background model of
expected sequence characteristics

68
Assessing Discovered Patterns

Strength
Similarity search

69
String-based methods(1)
How likely are X words in a set of sequences,
given background sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC gtEP71002 ()
CeIV msp-56 B range -100 to -75
TTCAAATTTTAACGCCGGAATAATCTCCTATT gtEP63009 () Ce
Cuticle Col-12 range -100 to -75
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG gtEP63010 () Ce
Cuticle Col-13 range -100 to -75
TATCGTCATTCTCCGCCTCTTTTCTT gtEP11013 () Ce
vitellogenin 2 range -100 to -75
GCTTATCAATGCGCCCGGAATAAAACGCTATA gtEP11014 () Ce
vitellogenin 5 range -100 to -75
CATTGACTTTATCGAATAAATCTGTT gtEP11015 (-) Ce
vitellogenin 4 range -100 to -75
ATCTATTTACAATGATAAAACTTCAA gtEP11016 () Ce
vitellogenin 6 range -100 to -75
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT gtEP11017 () Ce
calmodulin cal-2 range -100 to -75
TTTCAAATCCGGAATTTCCACCCGGAATTACT gtEP63007 (-) Ce
cAMP-dep. PKR P1 range -100 to -75
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC gtEP63008 () Ce
cAMP-dep. PKR P2 range -100 to -75
ACTGAACTTGTCTTCAAATTTCAACACCGGAA gtEP17012 () Ce
hsp 16K-1 A range -100 to -75 TCAATGCCGGAATTCTGAA
TGTGAGTCGCCCT gtEP55011 (-) Ce hsp 16K-1 B range
70
String-based methods(2)
Find all words of length n in the yeast promoters
(e.g. n7)
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGC
GTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCA
CTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTG
GAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACG
GTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAA
TCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGA
AGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGG
AAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTC
AAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAG
CTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGC
TCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTG
TTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAAT
TGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGT
ATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCC
ATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGC
CAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGA
TGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTA
TTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGT
TTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCA
GGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGA
CGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTG
GTGACACTGCCA
Make a lookup table AAACCTTT 456 TTTTTTTT 5778
8 GATAGGCA 589 Etc...
71
String-based methods(3)
Xw Instances of a word w within our set of X
genes EXw Average number of instances of w
based on number of genes in our set VarXw
Variance how much deviation from the average is
expected for w
72
Limitations of String-based Methods

Longer word lengths not possible
While degeneracy codes can be used, TFBS are not
words we lose quantitation for variable
positions with consensus sequences
Imagine column in PFM with 7 As and 1 T --- in a
consensus sequence we would represent as W or
throw out the instance with T
Recently the string-based method has found
renewed utility in the analysis of 3UTRs for the
presence of microRNA target sequences...

73
microRNA Target Sequences

Lim et al expressed miRNAs in cells and observed
that the overall pattern of gene expression
shifted toward the pattern of expression observed
in cells which naturally express the miRNA
The genes with reduced expression in response to
miRNA exposure shared 7nt motifs the 3UTR of
their transcripts
Nice website tutorial
http//www.ambion.com/main/explorations/mirna.html

74
Probabilistic Methods for Pattern Discovery

What is a probabilistic method?
The Gibbs sampler algorithm

75
Probabilistic Methods
Overview Find a local alignment of width x of
sites that maximizes information content (or
related measure) in reasonable time Usually by
Gibbs sampling or EM methods
Motivation TFBS are not words Efficiency can
handle longer patterns than string-based
methods Can be intentionally influenced to
reflect prior knowledge
76
What does probabilistic mean?

Based on probability
Functionally, it means were going to guess our
way to a good pattern (TFBS)
Were going to try to make a good guess
Two different flavours of the approach
Expectation Maximization in which we try to make
the best guess each time
Gibbs Sampling in which we make our guesses based
on the strength of our conviction

77
Gibbs Sampling
tgacttcc
Two data structures used 1) Current pattern
nucleotide frequencies qi,1,..., qi,4 and
corresponding background frequencies pi,1,...,
pi,4 2) Current positions of site startpoints
in the N sequences a1, ..., aN , i.e. the
alignment that contributes to qi,j. One starting
point in each sequence is chosen randomly
initially.
tgatctct
agacctca
tgacctct
78
Iterations in Gibbs Sampling
Remove one sequence z from the set. Update the
current pattern according to
A
z
Pseudocount for symbol j
tgacttcc
tgatctct
agacctca
Sum of all pseudocounts in column
tgacctct
79
Gibbs Sampling(grossly over-simplified)
80
Pattern Discovery

Gibbs sampling is guaranteed to return an optimal
pattern if repeated sufficiently often
Procedure is fast, so running many 1000s of times
is feasible
Unfortunately, we have a problemwhat if the
mediating TFBS are not strongly over-represented
relative to other patterns

81
Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
82
Four Approaches to Improve Sensitivity

Better background models
-Higher-order properties of DNA
Phylogenetic Footprinting
HumanMouse comparison eliminates 75 of
sequence
Regulatory Modules
Architectural rules
Limit the types of binding profiles allowed
TFBS patterns are NOT random

83
Pattern Discovery Summary

Pattern discovery methods can recover
over-represented patterns in the promoters of
co-expressed genes
Methods are acutely sensitive to noise,
indicating that the signal we seek is weak
TFs tolerate great variability between binding
sites
As for pattern discrimination, supplementary
information/approaches are required to over-come
the noise

84
Questions?

Winding down

85
Laboratory Exercise 3.4

Motif Discovery

86
REFLECTIONS

Part 2
Futility Theorem Essentially predictions of
individual TFBS have no relationship to an in
vivo function
Successful bioinformatics methods for site
discrimination incorporate additional information
(clusters, conservation)
Part 3
TFBS over-representation is a powerful new means
to identify TFs likely to contribute to observed
patterns of co-expression
Part 4
Pattern discovery methods are severely restricted
by the Signal-to-Noise problem
Observed patterns must be carefully considered
Successful methods for pattern discovery will
have to incorporate additional information
(conservation, structural constraints on TFs)

87
Module 3 Overview

Part 1 Overview of transcription
Lab 3.1 Promoters in Genome Browser (UCSC)
Part 2 Prediction of transcription factor
binding sites using binding profiles
(Discrimination)
Lab 3.2 TFBS scan (Footer)
Part 3 Interrogation of sets of co-expressed
genes to identify mediating transcription factors
Lab 3.3 TFBS Over-Representation (oPOSSUM)
Part 4 Detection of novel motifs (TFBS)
over-represented in regulatory regions of
co-expressed genes (Discovery)
Lab 3.4 Motif Discovery (MEME/Motif-Compare)

88
THE END