Title: Computational Genomics and Proteomics
1Computational Genomics and Proteomics
Lecture 8 Motif Discovery
2Outline Gene Regulation DNA Transcription
factors Motifs What are they? Binding
Sites Combinatoric Approaches Exhaustive
searches Consensus Comparative
Genomics Example Probabilistic
Approaches Statistics EM algorithm Gibbs
Sampling
3www.accessexcellence.org
4www.accessexcellence.org
5www.accessexcellence.org
6Four DNA nucleotide building blocks
G-C is more strongly hydrogen-bonded than A-T
7Degenerate code
Four bases A, C, G, T Two-fold degenerate IUB
codes RAG -- Purines YCT --
Pyrimidines KGT MAC SGC WAT Four-fold
degenerate NAGCT
8Transcription Factors
- Required but not a part of the RNA polymerase
complex - Many different roles in gene regulation
- Binding
- Interaction
- Initiation
- Enhancing
- Repressing
- Various structural classes (eg. zinc finger
domains) - Consist of both a DNA-binding domain and an
interactive domain
9Motifs
- Short sequences of DNA or RNA (or amino acids)
- Often consist of 5- 16 nucleotides
- May contain gaps
- Examples include
- Splice sites
- Start/stop codons
- Transmembrane domains
- Centromeres
- Phosphorylation sites
- Coiled-coil domains
- Transcription factor binding sites (TFBS
regulatory motifs)
10TFBSs
- Difficult to identify
- Each transcription factor may have more than one
binding site - Degenerate
- Most occur upstream of translation start site
(TSS) but are known to also occur in - introns
- exons
- 3 UTRs
- Usually occur in clusters, i.e. collections of
sites within a region (modules) - Often repeated
- Sites can be experimentally verified
11Why are TFBSs important?
- Aid in identification of gene networks/pathways
- Determine correct network structure
- Drug discovery
- Switch production of gene product on/off
Gene A Gene B
12Consensus sequences
- Matches all of the example sequences closely but
not exactly - A single site
- TACGAT
- A set of sites
- TACGAT
- TATAAT
- TATAAT
- GATACT
- TATGAT
- TATGTT
- Consensus sequence
- TATAAT or
- TATRNT
- Trade-off number of mismatches allowed,
ambiguity in consensus sequence and the
sensitivity and precision of the representation.
13Information Content and Entropy
14Sequence Logos
15Frequency Matrices
- Given a collection of motifs,
- TACGAT
- TATAAT
- TATAAT
- GATACT
- TATGAT
- TATGTT
- Create the matrix
T A C G
16Position weight matrices
17Finding Motifs
- Two problems
- Given a collection of known motifs, develop a
representation of the motifs such that additional
occurrences can reliably be identified in new
promoter regions - Given a collection of genes, thought to be
related somehow, find the location of the motif
common to all and a representation for it. - Two approaches
- Combinatorial
- Probabilistic
18Combinatorial Approach
19Exhaustive Search
20Exhaustive Search
Sample-driven here refers to trying all the words
as they occur in the sequences, instead of trying
all possible (4W) words exhaustively
21Greedy Motif Clustering
22Greedy Motif Clustering
23Greedy Motif Clustering
24Comparative Genomics
- Main Idea Conserved non coding regions are
important - Align the promoters of orthologous co-expressed
genes from two (or more) species e.g. human and
mouse - Search for TFBS only in conserved regions
- Problems
- Not all regulatory regions are conserved
- Which genomes to use?
25Phylogenetic Footprinting
Phylogenetic Footprinting refers to the task of
finding conserved motifs across different
species. Common ancestry and selection on these
motifs has resulted in these footprints.
26Phylogenetic Footprinting An Example
- Xie et al. 2005
- Genome-wide alignments for four species (human,
mouse, rat, dog) - Promoter regions and 3UTRs then extracted for
17,700 well-annotated genes - Promoter region taken to be (-2000, 2000)
- This set of sequences then searched exhaustively
for motifs
Nature 434, 338-345, 2005
27The Search
Xie et al. 2005
28Expected Rate
29Probabilistic Approach
30Gibbs Sampling (applied to Motif Finding)
31Gibbs Sampling Algorithm
32Gibbs Sampling Motif Positions
33AlignACE - Gibbs Sampling
34Remainder of the lecture Maximum likelihood and
the EM algorithm
The remaining slides are for your information
only and will not be part of the exam
35Basic Statistics
36Maximum Likelihood Estimates
37EM Algorithm
38Basic idea (MEME)
http//meme.nbcr.net/meme/meme-intro.html
39Basic idea (MEME)
MEME is a tool for discovering motifs in a group
of related DNA or protein sequences. A motif is a
sequence pattern that occurs repeatedly in a
group of related protein or DNA sequences. MEME
represents motifs as position-dependent
letter-probability matrices which describe the
probability of each possible letter at each
position in the pattern. Individual MEME motifs
do not contain gaps. Patterns with
variable-length gaps are split by MEME into two
or more separate motifs. MEME takes as input a
group of DNA or protein sequences (the training
set) and outputs as many motifs as requested.
MEME uses statistical modeling techniques to
automatically choose the best width, number of
occurrences, and description for each motif.
http//meme.nbcr.net/meme/meme-intro.html
40Basic MEME Model
41MEME Background frequencies
42MEME Hidden Variable
43MEME Conditional Likelihood
44EM algorithm
45Example
46E-step of EM algorithm
47Example
48M-step of EM Algorithm
49Example
50Characteristics of EM
51Gibbs Sampling (versus EM)