Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

Canadian Bioinformatics Workshops

Description:

Part 2: Prediction of transcription factor binding sites using binding profiles ... Part 3: Interrogation of sets of co-expressed genes to identify mediating ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 90
Provided by: GaryB123
Category:

less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops


1
Canadian Bioinformatics Workshops
  • www.bioinformatics.ca

2
2
Module Title of Module
3
Canadian Bioinformatics Workshops 2009 Module
3 Inferring Regulatory Mechanisms Governing Sets
of Genes Wyeth W. Wasserman University of
British Columbia
www.cisreg.ca
4
Module 3 Overview
  • Part 1 Overview of transcription
  • Lab 3.1 Promoters in Genome Browser (UCSC)
  • Part 2 Prediction of transcription factor
    binding sites using binding profiles
    (Discrimination)
  • Lab 3.2 TFBS scan (Footer)
  • Part 3 Interrogation of sets of co-expressed
    genes to identify mediating transcription factors
  • Lab 3.3 TFBS Over-Representation (oPOSSUM)
  • Part 4 Detection of novel motifs (TFBS)
    over-represented in regulatory regions of
    co-expressed genes (Discovery)
  • Lab 3.4 Motif Discovery (MEME/Motif-Compare)

5
Restrictions in Coverage
  • Focus on Eukaryotic cells and PolII Promoters
  • Principles apply to prokaryotes
  • Will provide suggestions for similar tools for
    other species
  • Many of the examples drawn from my labs work -
    there are many equivalent tools (links to be
    provided)

6
Part 1Introduction to transcription in
eukaryotic cells
7
Transcription Over-Simplified
  • Three-step Process
  • TF binds to TFBS (DNA)
  • TF catalyzes recruitment of polymerase II complex
  • Production of RNA from transcription start site
    (TSS)

TF
Pol-II
TATA
TFBS
TSS
8
Anatomy of Transcriptional RegulationWARNING
Terms vary widely in meaning between scientists
Core Promoter/Initiation Region (Inr)
TSR
Distal Regulatory Region
Proximal Regulatory Region
Distal R.R.
EXON
EXON
TFBS
TATA
TFBS
TFBS
TFBS
TFBS
TFBS
TFBS
  • Core Promoter Sufficient for initiation of
    transcription orientation dependent
  • TSR transcription start region
  • Refers to a region rather than specific start
    site (TSS)
  • TFBS single transcription factor binding site
  • Regulatory Regions
  • Proximal/Distal vague reference to distance
    from TSR
  • May be positive (enhancing) or negative
    (repressing)
  • Orientation independent (generally)
  • Modules Sets of TFBS within a region that
    function together
  • Transcriptional Unit
  • DNA sequence transcribed as a single
    polycistronic mRNA

9
Complexity in Transcription
Chromatin
Distal enhancer
Distal enhancer
Proximal enhancer
Core Promoter
10
Lab Discovery of TF Binding Sites
Reporter Gene Activity
0
100
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
mutation
Identify functional regulatory region within a
sequence and delineate specific TFBS through
mutagenesis (and in vitro binding studies)
11
EMSA/Gel Shift Assays to Identify Binding
Proteins
TF DNA
DNA
http//www.biomedcentral.com/content/figures/1741-
7015-4-28-8.jpg
12
High-throughput Methods
  • SELEX
  • mix random ds DNA oligonucleotides with TF
    protein, recover TF-DNA complexes and sequence
    DNA
  • Protein Binding Arrays (UniProbe Database)
  • prepare arrays with ds DNA attached, label
    protein with a fluorescent mark and observe DNA
    bound by protein
  • ChIP
  • covalently link proteins to DNA in cell, shear
    DNA, recover protein-DNA complexes and identify
    DNA (PCR, array or sequencing)

13
Promoters
  • In most vertebrates the delineation of the
    transcription start position is not easy
  • cDNA often incomplete at 5 end
  • Multiple promoters for most human genes
  • Referencing position relative to the initiation
    site is therefore not a good idea
  • But done almost uniformly in biological papers
  • Translation start equally problematic
  • Can be in internal exon
  • Multiple ORF start positions common
  • Importance of promoter proximal regions varies
    between species
  • Humans appear to have little enrichment for
    functional sequences vast regions to consider
    generally leads to restricted region around
    promoter(s), but justification is not strong
  • Yeast and C.elegans have more compact regions and
    promoter proximity can be a useful property to
    restrict analyses

14
mRNA Caps for Mapping Initiation Sites
  • 5 end of mRNA have a cap structure that can be
    precipitated with an antibody
  • Allows for large-scale sequencing of
    full-length cDNAs and tags derived from the
    5 end of mRNAs
  • RIKEN the leading generators of such sequences
  • Not well represented in genome annotation
    resources (unfortunately)

http//departments.oxy.edu/biology/Stillman/bi221/
111300/26_18a.GIF
15
Classes of Initiation Regions
CAGE Cap Tags per Position
Position
This is over-simplified - see paper for greater
detail. Take home message is that promoters are
not drawn from a single continuous distribution
of properties, rather drawn from at least two
classes.
Image from Carninci P, et al (2006). Genome-wide
analysis of mammalian promoter architecture and
evolution. Nat Genet. Apr 28 PMID 16645617
16
CpG Islands
  • DNA methylation occurs in competition with
    histone acetylation
  • Acetylation promotes open chromatin structure
    that is permissive for TF binding to DNA
  • Methylation of DNA inhibits histone acetylation
  • Certain TFs promote histone acetylation by
    recruiting acetylases
  • Methylation occurs on cytosines
  • Preferentially on cytosine adjacent to guanines
    (CG dinucleotides, generally referred to as CpG)
  • Methylated cytosines frequently undergo
    deamination to form thymidine (CpG -gt TpG)
  • CpG Islands are regions of DNA where CG
    dinucleotides occur at a frequency consistent
    with C and G mononucleotide frequencies
  • Highlight regions of active transcription

17
CpG Islands (2)
  • Important to recognize that promoters selectively
    active after early development will not be
    acetylated (and hence will be methylated) in the
    cell divisions preceding the establishment of
    germ cells and therefore will not have CpG
    islands
  • Lists of genes that have higher or lower CpG
    frequencies than average can misleadingly appear
    to have TF binding motifs based on this
    compositional characteristic
  • CpG Island bias in a gene set can mislead an
    analyst to think that there are patterns of TFBS
    (patterns with internal CG for island-rich and TG
    for island-poor sets)

18
Additional Topics
  • Chromatin modification studies making great
    strides
  • Signatures indicative of active regulatory
    sequences such as H3K4me3
  • Co-activator (p300) ChIP study suggests
    possibility to read-off regulatory regions
  • No methods currently address 3D properties of
    nucleus (long-run will be necessary)

19
Section 3.1What have we learned?
  • Transcription controlled by regulatory regions
  • Regulatory regions can be distant from initiation
    regions
  • Laboratory methods can identify regulatory
    regions and TF binding sites
  • Concept of single initiation site is flawed
  • Promoters fall into subclasses
  • CpG vs TATA
  • Can impact assessment of TFBS in sets of genes

20
Questions?
?
?
?
?
?
  • Please, please, please . . .
  • ASK QUESTIONS
  • . . . now is a great chance.

21
Module 3
  • Part 1 Overview of transcription
  • Lab 3.1 Promoters in Genome Browser (UCSC)
  • Part 2 Prediction of transcription factor
    binding sites using binding profiles
    (Discrimination)
  • Lab 3.2 TFBS scan (Footer)
  • Part 3 Interrogation of sets of co-expressed
    genes to identify mediating transcription factors
  • Lab 3.3 TFBS Over-Representation (oPOSSUM)
  • Part 4 Detection of novel motifs (TFBS)
    over-represented in regulatory regions of
    co-expressed genes (Discovery)
  • Lab 3.4 Motif Discovery (MEME/Motif-Compare)

22
Part 2 Prediction of TF Binding Sites
Teaching a computer to find TFBS
23
Representing Binding Sites for a TF
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTT
AAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA
CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTG
ATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA A
AGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAA
TGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AA
GTTAATGA AAGTTAATGA AAGTTAATGA
  • A single site
  • AAGTTAATGA
  • A set of sites represented as a consensus
  • VDRTWRWWSHD (IUPAC degenerate DNA)

24
Conversion of PFMs to Position Specific Scoring
Matrices (PSSM)PSSMs also known as Position
Weight Matrices(PWMs)
Add the following features to the matrix
profile 1. Correct for nucleotide frequencies
in genome 2. Weight for the confidence (depth)
in the pattern 3. Convert to log-scale
probability for easy arithmetic
pssm
pfm
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5
0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T
-1.7 -1.7 -0.2 -0.2 -0.2
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0
4 T 0 0 1 1 1
f(b,i) s(n)
Log ( )
p(b)
25
PSSM Scoring Scales
  • Raw scores
  • Sum of values from indicated cells of the matrix
  • Relative Scores (most common)
  • Normalize the scores to range of 0-1 or 0-100
  • Empirical p-values
  • Based on distribution of scores for some DNA
    sequence, determine a p-value (see next slide)

26
Detecting binding sites in a single sequence
Raw Scores
Sp1
Abs_score 13.4 (sum of column scores)
Empirical p-value Scores
0.3 0.2 0.1 0.0
Area to right of value Area under entire curve
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
Relative Score
27
JASPAR AN OPEN-ACCESS DATABASE OF TF BINDING
PROFILES ( jaspar.genereg.net )
28
The Good
  • Tronche (1997) tested 50 predicted HNF1 TFBS
    using an in vitro binding test and found that 96
    of the predicted sites were bound!
  • Stormo and Fields (1998) found in detailed
    biochemical studies that the best weight matrices
    produce scores highly correlated with in vitro
    binding energy

BINDING ENERGY
PSSM SCORE
29
the Bad
  • Fickett (1995) found that a profile for the myoD
    TF made predictions at a rate of 1 per 500bp of
    human DNA sequence
  • This corresponds to an average of 20 sites / gene
    (assuming 10,000 bp as average gene size)

30
and the Ugly!
Human Cardiac a-Actin gene analyzed with a set of
profiles (each line represents a TFBS prediction)
Futility Conjuncture TFBS predictions are almost
always wrong
Red boxes are protein coding exons - TFBS
predictions excluded in this analysis
31
ADVANCED TOPICIssues of Column Independence
  • PSSM model assumes independence between positions
  • For example, if you observe a G at position 2,
    the model assumes there is no influence on the
    likelihood of a T at position 3 - this is known
    to be an incorrect assumption
  • Other models can represent dependence
  • Hidden Markov models of Nth order where Nth
    refers to the number of influencing positions
  • For the cases where there are hundreds of TFBS
    known for a TF, there has been only modest
    improvement in the specificity of TFBS
    predictions using advanced column inter-dependent
    models
  • The newly emerging ChIP-Seq data collections will
    ultimately lead to the systematic use of more
    advanced models (not likely to advance to wet
    labs for 3 years)

32
A Conundrum
  • Counter to intuition, the ratio of true positives
    to predictions fails to improve for stringent
    thresholds
  • For most predictive models this ratio would
    increase
  • Why?
  • True binding sites are defined by properties not
    incorporated into the profile scores - above some
    threshold all sites could be bound if accessible

33
Section 3.1AWhat have we learned?
  • PSSMs accurately reflect in vitro binding
    properties of DNA binding proteins
  • Suitable binding sites occur at a rate far too
    frequent to reflect in vivo function
  • Bioinformatics methods that use PSSMs for binding
    site studies must incorporate additional
    information to enhance specificity
  • Unfiltered predictions are too noisy for most
    applications
  • Organisms with short regulatory sequences are
    less problematic (e.g. yeast and bacteria)

34
Using Phylogenetic Footprinting to Improve TFBS
Discrimination
  • 70,000,000 years of evolution can reveal
    regulatory regions

35
Phylogenetic Footprinting
FoxC2 a single exon gene
100 80 60 40 20 0
Human-Mouse Identity
  • Align orthologous gene sequences (e.g. LAGAN)
  • For first window of 100 bp, of sequence1,
    determine the with identical match in
    sequence2
  • Step across the first sequence, recording the
    percentage of identical nucleotides in each
    window
  • Observe that single exon contains a region of
    high identity that corresponds to the ORF, with
    lower identity in the 5 and 3 UTRs
  • Additional conserved region could be regulatory
    regions

36
Phylogenetic Footprinting (cont)
Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse
37
Multi-species Phylogenetic Footprinting
  • PhastCons scores indicate the regions of DNA
    which are unusual in their sequence composition
    in some subset of organisms

38
Phylogenetic Footprints in UCSC Genome Browser
  • PhyloCons (regions score)
  • PhyloP (position score)

INSERT SCREENSHOT
39
Phylogenetic Footprinting Dramatically Reduces
Spurious Hits
Actin, alpha cardiac
40
TFBS Prediction with Human Mouse Pairwise
Phylogenetic Footprinting
SELECTIVITY
SENSITIVITY
  • Testing set 40 experimentally defined sites in
    15 well studied genes (Replicated with 100 site
    set)
  • 75-80 of defined sites detected with
    conservation filter, while only 11-16 of total
    predictions retained

41
1kbp insulin receptor promoter screened with
footprinting
42
Choosing the right species for pairwise
comparison...
CHICKEN
HUMAN
MOUSE
HUMAN
COW
HUMAN
43
ConSite
44
TFBS Discrimination Tools
  • Phylogenetic Footprinting Servers
  • FOOTER http//biodev.hgen.pitt.edu/footer_php/Foo
    terv2_0.php
  • CONSITE http//asp.ii.uib.no8090/cgi-bin/CONSITE
    /consite/
  • rVISTA http//rvista.dcode.org/
  • ORCAtk http//burgundy.cmmt.ubc.ca/cgi-bin/OrcaT
    K/orcatk
  • SNPs in TFBS Analysis
  • RAVEN http//burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?
    rmhome
  • Prokaryotes or Yeast
  • PRODORIC http//prodoric.tu-bs.de/
  • YEASTRACT http//www.yeastract.com/index.php
  • Software Packages
  • TOUCAN http//homes.esat.kuleuven.be/saerts/soft
    ware/toucan.php
  • Programming Tools
  • TFBS http//tfbs.genereg.net/
  • ORCAtk http//burgundy.cmmt.ubc.ca/cgi-bin/OrcaT
    K/orcatk

45
Analysis of TFBS with Phylogenetic Footprinting
Scanning a single sequence
Scanning a pair orf orthologous sequences for
conserved patterns in conserved sequence regions
A dramatic improvement in the percentage of
biologically significant detections
  • Low specificity of profiles
  • too many hits
  • great majority not biologically significant

46
Section 3.2BWhat have we learned?
  • TFBS discrimination coupled with phylogenetic
    footprinting has greater specificity with
    tolerable loss of sensitivity
  • As with any purification process, some true
    binding sites will be lost
  • Available online resources support phylogenetic
    footprinting

47
Questions?
  • Please Ask

48
Laboratory Exercise 3.2
  • TF Binding Site Prediction

49
20 minute break
  • Until 1050am
  • Next Sections 3.3 and 3.4

50
Module 3
  • Part 1 Overview of transcription
  • Lab 3.1 Promoters in Genome Browser (UCSC)
  • Part 2 Prediction of transcription factor
    binding sites using binding profiles
    (Discrimination)
  • Lab 3.2 TFBS scan (Footer)
  • Part 3 Interrogation of sets of co-expressed
    genes to identify mediating transcription factors
  • Lab 3.3 TFBS Over-Representation (oPOSSUM)
  • Part 4 Detection of novel motifs (TFBS)
    over-represented in regulatory regions of
    co-expressed genes (Discovery)
  • Lab 3.4 Motif Discovery (MEME/Motif-Compare)

51
Part 3 Inferring Regulating TFs for Sets of
Co-Expressed Genes
52
TFBS Over-representation
  • Akin to the GO studies yesterday, we seek to
    determine if a set of co-expressed genes contains
    an over-abundance of predicted binding sites for
    a known TF
  • Phylogenetic footprinting to reduce false
    prediction rate

53
Two Examples of TFBS Over-Representation
More Genes with TFBS
54
Statistical Methods for Identifying
Over-represented TFBS
  • Binomial test (Z scores)
  • Based on the number of occurrences of the TFBS
    relative to background
  • Normalized for sequence length
  • Simple binomial distribution model
  • Fisher exact probability scores
  • Based on the number of genes containing the TFBS
    relative to background
  • Hypergeometric probability distribution

55
Validation using Reference Gene Sets
TFs with experimentally-verified sites in the
reference sets.
56
Empirical Selection of Parameters based on
Reference Studies
57
C-Myc SAGE Data
  • c-Myc transcription factor dimerizes with the Max
    protein
  • Key regulator of cell proliferation,
    differentiation and apoptosis
  • Menssen and Hermeking identified 216 different
    SAGE tags corresponding to unique mRNAs that were
    induced after adenoviral expression of c-Myc in
    HUVEC cells
  • They then went on to confirm the induction of 53
    genes using microarray analysis and RT-PCR

58
(No Transcript)
59
Structurally-related TFs with Indistinguishable
TFBS
  • Most structurally related TFs bind to highly
    similar patterns
  • Zn-finger is a big exception

60
oPOSSUM Server
61
Ets Factor Family
  • EG232974
  • EG432800
  • Ehf
  • Elf1
  • Elf2
  • Elf3
  • Elf4
  • Elf5
  • Elk1
  • Elk3
  • Elk4
  • Erf
  • Erg
  • Ets1
  • Ets2
  • How to pick which one?
  • At this stage there are TF catalogs coming that
    will be coupled to characteristics.
  • Candidate gene prioritization software can be
    used (such as TOPPGENE)
  • Etv1
  • Etv2
  • Etv3
  • Etv3l
  • Etv4
  • Etv5
  • Etv6
  • Fev
  • Fli1
  • Gabpa
  • LOC100
  • LOC100
  • factor)
  • LOC634494
  • Sfpi1
  • Spdef
  • Spib
  • Spic

62
Section 3.3What have we learned?
  • New generation of tools to help interrogate the
    meaning of observed clusters of co-expressed
    genes
  • Generally best performance has been with data
    directly linked to a transcription factor
  • Highly dependent on the experimental design
    cannot overcome noisy data from poor design
    (Recall Day 1)
  • The identity of a mediating TF may not be
    apparent when many proteins can bind to the same
    motif

63
Questions?
  • Now is a good time

64
Laboratory Exercise 3.3
  • TFBS Over-Representation Analysis

65
Module 3 Overview
  • Part 1 Overview of transcription
  • Lab 3.1 Promoters in Genome Browser (UCSC)
  • Part 2 Prediction of transcription factor
    binding sites using binding profiles
    (Discrimination)
  • Lab 3.2 TFBS scan (Footer)
  • Part 3 Interrogation of sets of co-expressed
    genes to identify mediating transcription factors
  • Lab 3.3 TFBS Over-Representation (oPOSSUM)
  • Part 4 Detection of novel motifs (TFBS)
    over-represented in regulatory regions of
    co-expressed genes (Discovery)
  • Lab 3.4 Motif Discovery (MEME/Motif-Compare)

66
Part 4de novo Discovery of TF Binding Sites
67
de novo Pattern Discovery
  • String-based
  • e.g. YMF (Sinha Tompa)
  • Generalization Identify over-represented
    oligomers in comparison of and - (or
    complete) promoter collections
  • Used often for yeast promoter analysis
  • Profile-based
  • e.g. AnnSpec (Workman Stormo) or MEME (Bailey
    Elkin)
  • Generalization Identify strong patterns in
    promoter collection vs. background model of
    expected sequence characteristics

68
Assessing Discovered Patterns
  • Strength
  • Similarity search

69
String-based methods(1)
How likely are X words in a set of sequences,
given background sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC gtEP71002 ()
CeIV msp-56 B range -100 to -75
TTCAAATTTTAACGCCGGAATAATCTCCTATT gtEP63009 () Ce
Cuticle Col-12 range -100 to -75
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG gtEP63010 () Ce
Cuticle Col-13 range -100 to -75
TATCGTCATTCTCCGCCTCTTTTCTT gtEP11013 () Ce
vitellogenin 2 range -100 to -75
GCTTATCAATGCGCCCGGAATAAAACGCTATA gtEP11014 () Ce
vitellogenin 5 range -100 to -75
CATTGACTTTATCGAATAAATCTGTT gtEP11015 (-) Ce
vitellogenin 4 range -100 to -75
ATCTATTTACAATGATAAAACTTCAA gtEP11016 () Ce
vitellogenin 6 range -100 to -75
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT gtEP11017 () Ce
calmodulin cal-2 range -100 to -75
TTTCAAATCCGGAATTTCCACCCGGAATTACT gtEP63007 (-) Ce
cAMP-dep. PKR P1 range -100 to -75
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC gtEP63008 () Ce
cAMP-dep. PKR P2 range -100 to -75
ACTGAACTTGTCTTCAAATTTCAACACCGGAA gtEP17012 () Ce
hsp 16K-1 A range -100 to -75 TCAATGCCGGAATTCTGAA
TGTGAGTCGCCCT gtEP55011 (-) Ce hsp 16K-1 B range
70
String-based methods(2)
Find all words of length n in the yeast promoters
(e.g. n7)
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGC
GTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCA
CTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTG
GAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACG
GTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAA
TCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGA
AGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGG
AAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTC
AAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAG
CTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGC
TCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTG
TTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAAT
TGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGT
ATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCC
ATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGC
CAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGA
TGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTA
TTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGT
TTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCA
GGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGA
CGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTG
GTGACACTGCCA
Make a lookup table AAACCTTT 456 TTTTTTTT 5778
8 GATAGGCA 589 Etc...
71
String-based methods(3)
Xw Instances of a word w within our set of X
genes EXw Average number of instances of w
based on number of genes in our set VarXw
Variance how much deviation from the average is
expected for w
72
Limitations of String-based Methods
  • Longer word lengths not possible
  • While degeneracy codes can be used, TFBS are not
    words we lose quantitation for variable
    positions with consensus sequences
  • Imagine column in PFM with 7 As and 1 T --- in a
    consensus sequence we would represent as W or
    throw out the instance with T
  • Recently the string-based method has found
    renewed utility in the analysis of 3UTRs for the
    presence of microRNA target sequences...

73
microRNA Target Sequences
  • Lim et al expressed miRNAs in cells and observed
    that the overall pattern of gene expression
    shifted toward the pattern of expression observed
    in cells which naturally express the miRNA
  • The genes with reduced expression in response to
    miRNA exposure shared 7nt motifs the 3UTR of
    their transcripts
  • Nice website tutorial
  • http//www.ambion.com/main/explorations/mirna.html

74
Probabilistic Methods for Pattern Discovery
  • What is a probabilistic method?
  • The Gibbs sampler algorithm

75
Probabilistic Methods
Overview Find a local alignment of width x of
sites that maximizes information content (or
related measure) in reasonable time Usually by
Gibbs sampling or EM methods
Motivation TFBS are not words Efficiency can
handle longer patterns than string-based
methods Can be intentionally influenced to
reflect prior knowledge
76
What does probabilistic mean?
  • Based on probability
  • Functionally, it means were going to guess our
    way to a good pattern (TFBS)
  • Were going to try to make a good guess
  • Two different flavours of the approach
  • Expectation Maximization in which we try to make
    the best guess each time
  • Gibbs Sampling in which we make our guesses based
    on the strength of our conviction

77
Gibbs Sampling
tgacttcc
Two data structures used 1) Current pattern
nucleotide frequencies qi,1,..., qi,4 and
corresponding background frequencies pi,1,...,
pi,4 2) Current positions of site startpoints
in the N sequences a1, ..., aN , i.e. the
alignment that contributes to qi,j. One starting
point in each sequence is chosen randomly
initially.
tgatctct
agacctca
tgacctct
78
Iterations in Gibbs Sampling
Remove one sequence z from the set. Update the
current pattern according to
A
z
Pseudocount for symbol j
tgacttcc
tgatctct
agacctca
Sum of all pseudocounts in column
tgacctct
79
Gibbs Sampling(grossly over-simplified)
80
Pattern Discovery
  • Gibbs sampling is guaranteed to return an optimal
    pattern if repeated sufficiently often
  • Procedure is fast, so running many 1000s of times
    is feasible
  • Unfortunately, we have a problemwhat if the
    mediating TFBS are not strongly over-represented
    relative to other patterns

81
Applied Pattern Discovery is Acutely Sensitive to
Noise
True Mef2 Binding Sites
82
Four Approaches to Improve Sensitivity
  • Better background models
  • -Higher-order properties of DNA
  • Phylogenetic Footprinting
  • HumanMouse comparison eliminates 75 of
    sequence
  • Regulatory Modules
  • Architectural rules
  • Limit the types of binding profiles allowed
  • TFBS patterns are NOT random

83
Pattern Discovery Summary
  • Pattern discovery methods can recover
    over-represented patterns in the promoters of
    co-expressed genes
  • Methods are acutely sensitive to noise,
    indicating that the signal we seek is weak
  • TFs tolerate great variability between binding
    sites
  • As for pattern discrimination, supplementary
    information/approaches are required to over-come
    the noise

84
Questions?
  • Winding down

85
Laboratory Exercise 3.4
  • Motif Discovery

86
REFLECTIONS
  • Part 2
  • Futility Theorem Essentially predictions of
    individual TFBS have no relationship to an in
    vivo function
  • Successful bioinformatics methods for site
    discrimination incorporate additional information
    (clusters, conservation)
  • Part 3
  • TFBS over-representation is a powerful new means
    to identify TFs likely to contribute to observed
    patterns of co-expression
  • Part 4
  • Pattern discovery methods are severely restricted
    by the Signal-to-Noise problem
  • Observed patterns must be carefully considered
  • Successful methods for pattern discovery will
    have to incorporate additional information
    (conservation, structural constraints on TFs)

87
Module 3 Overview
  • Part 1 Overview of transcription
  • Lab 3.1 Promoters in Genome Browser (UCSC)
  • Part 2 Prediction of transcription factor
    binding sites using binding profiles
    (Discrimination)
  • Lab 3.2 TFBS scan (Footer)
  • Part 3 Interrogation of sets of co-expressed
    genes to identify mediating transcription factors
  • Lab 3.3 TFBS Over-Representation (oPOSSUM)
  • Part 4 Detection of novel motifs (TFBS)
    over-represented in regulatory regions of
    co-expressed genes (Discovery)
  • Lab 3.4 Motif Discovery (MEME/Motif-Compare)

88
THE END
  • Questions before the break?
  • Lab exercises address Sections 2 and 3

89
LUNCH
  • On your own
  • (Food court Downstairs)
  • Back at ??
Write a Comment
User Comments (0)
About PowerShow.com