Computational Genomics and Proteomics

About This Presentation

Title:

Computational Genomics and Proteomics

Description:

MEME represents motifs as position-dependent letter-probability matrices which ... Individual MEME motifs do not contain gaps. ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 52

Provided by: robert414

Category:

more less

Transcript and Presenter's Notes

Title: Computational Genomics and Proteomics

1
Computational Genomics and Proteomics
Lecture 8 Motif Discovery
2
Outline Gene Regulation DNA Transcription
factors Motifs What are they? Binding
Sites Combinatoric Approaches Exhaustive
searches Consensus Comparative
Genomics Example Probabilistic
Approaches Statistics EM algorithm Gibbs
Sampling
3
www.accessexcellence.org
4
www.accessexcellence.org
5
www.accessexcellence.org
6
Four DNA nucleotide building blocks
G-C is more strongly hydrogen-bonded than A-T
7
Degenerate code
Four bases A, C, G, T Two-fold degenerate IUB
codes RAG -- Purines YCT --
Pyrimidines KGT MAC SGC WAT Four-fold
degenerate NAGCT
8
Transcription Factors

Required but not a part of the RNA polymerase
complex
Many different roles in gene regulation
Binding
Interaction
Initiation
Enhancing
Repressing
Various structural classes (eg. zinc finger
domains)
Consist of both a DNA-binding domain and an
interactive domain

9
Motifs

Short sequences of DNA or RNA (or amino acids)
Often consist of 5- 16 nucleotides
May contain gaps
Examples include
Splice sites
Start/stop codons
Transmembrane domains
Centromeres
Phosphorylation sites
Coiled-coil domains
Transcription factor binding sites (TFBS
regulatory motifs)

10
TFBSs

Difficult to identify
Each transcription factor may have more than one
binding site
Degenerate
Most occur upstream of translation start site
(TSS) but are known to also occur in
introns
exons
3 UTRs
Usually occur in clusters, i.e. collections of
sites within a region (modules)
Often repeated
Sites can be experimentally verified

11
Why are TFBSs important?

Aid in identification of gene networks/pathways
Determine correct network structure
Drug discovery
Switch production of gene product on/off

Gene A Gene B
12
Consensus sequences

Matches all of the example sequences closely but
not exactly
A single site
TACGAT
A set of sites
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
Consensus sequence
TATAAT or
TATRNT
Trade-off number of mismatches allowed,
ambiguity in consensus sequence and the
sensitivity and precision of the representation.

13
Information Content and Entropy
14
Sequence Logos
15
Frequency Matrices

Given a collection of motifs,
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
Create the matrix

T A C G
16
Position weight matrices
17
Finding Motifs

Two problems
Given a collection of known motifs, develop a
representation of the motifs such that additional
occurrences can reliably be identified in new
promoter regions
Given a collection of genes, thought to be
related somehow, find the location of the motif
common to all and a representation for it.
Two approaches
Combinatorial
Probabilistic

18
Combinatorial Approach
19
Exhaustive Search
20
Exhaustive Search
Sample-driven here refers to trying all the words
as they occur in the sequences, instead of trying
all possible (4W) words exhaustively
21
Greedy Motif Clustering
22
Greedy Motif Clustering
23
Greedy Motif Clustering
24
Comparative Genomics

Main Idea Conserved non coding regions are
important
Align the promoters of orthologous co-expressed
genes from two (or more) species e.g. human and
mouse
Search for TFBS only in conserved regions
Problems
Not all regulatory regions are conserved
Which genomes to use?

25
Phylogenetic Footprinting
Phylogenetic Footprinting refers to the task of
finding conserved motifs across different
species. Common ancestry and selection on these
motifs has resulted in these footprints.
26
Phylogenetic Footprinting An Example

Xie et al. 2005
Genome-wide alignments for four species (human,
mouse, rat, dog)
Promoter regions and 3UTRs then extracted for
17,700 well-annotated genes
Promoter region taken to be (-2000, 2000)
This set of sequences then searched exhaustively
for motifs

Nature 434, 338-345, 2005
27
The Search
Xie et al. 2005
28
Expected Rate
29
Probabilistic Approach
30
Gibbs Sampling (applied to Motif Finding)
31
Gibbs Sampling Algorithm
32
Gibbs Sampling Motif Positions
33
AlignACE - Gibbs Sampling
34
Remainder of the lecture Maximum likelihood and
the EM algorithm
The remaining slides are for your information
only and will not be part of the exam
35
Basic Statistics
36
Maximum Likelihood Estimates
37
EM Algorithm
38
Basic idea (MEME)
http//meme.nbcr.net/meme/meme-intro.html
39
Basic idea (MEME)
MEME is a tool for discovering motifs in a group
of related DNA or protein sequences. A motif is a
sequence pattern that occurs repeatedly in a
group of related protein or DNA sequences. MEME
represents motifs as position-dependent
letter-probability matrices which describe the
probability of each possible letter at each
position in the pattern. Individual MEME motifs
do not contain gaps. Patterns with
variable-length gaps are split by MEME into two
or more separate motifs. MEME takes as input a
group of DNA or protein sequences (the training
set) and outputs as many motifs as requested.
MEME uses statistical modeling techniques to
automatically choose the best width, number of
occurrences, and description for each motif.
http//meme.nbcr.net/meme/meme-intro.html
40
Basic MEME Model
41
MEME Background frequencies
42
MEME Hidden Variable
43
MEME Conditional Likelihood
44
EM algorithm
45
Example
46
E-step of EM algorithm
47
Example
48
M-step of EM Algorithm
49
Example
50
Characteristics of EM
51
Gibbs Sampling (versus EM)

Write a Comment

User Comments (0)

About PowerShow.com

Computational Genomics and Proteomics - PowerPoint PPT Presentation

Computational Genomics and Proteomics

MEME represents motifs as position-dependent letter-probability matrices which ... Individual MEME motifs do not contain gaps. ... – PowerPoint PPT presentation