Title: Motif discovery
1Tutorial 5
Motif discovery
2Multiple sequence alignments and motif discovery
- Motif discovery
- MEME
- MAST
- TOMTOM
- GOMO
- PROSITE
3Can we find motifs using multiple sequence
alignment?
..YDEEGGDAEE.. ..YDEEGGDAEE.. ..YGEEGADYED.. ..
YDEEGADYEE.. ..YNDEGDDYEE.. ..YHDEGAADEE..
- Motif
- A widespread pattern with a biological
significance
1 2 3 4 5 6 7 8 9 10
A 0 0 0 0 0 3/6 1/6 2/6 0 0
D 0 3/6 2/6 0 0 1/6 5/6 1/6 0 1/6
E 0 0 4/6 1 0 0 0 0 1 5/6
G 0 1/6 0 0 1 1/3 0 0 0 0
H 0 1/6 0 0 0 0 0 0 0 0
N 0 1/6 0 0 0 0 0 0 0 0
Y 1 0 0 0 0 0 3/6 3/6 0 0
4Can we find motifs using multiple sequence
alignment (MSA)?
YES!
NO
5Using MSA for motif discovery
- Can only work if things align nicely alone
- For most motifs this is not the case!
6ClustalW - Input
http//www.ebi.ac.uk/Tools/clustalw2/index.html
Input sequences
Scoring matrix
Gap scoring
Output format
Email address
7Muscle
http//www.ebi.ac.uk/Tools/muscle/index.html
Input sequences
Output format
Email address
8Motif search from de-novo motifs to motif
annotation
gapped motifs
Large DNA data
http//meme.sdsc.edu/
9MEME Multiple EM for Motif finding
- http//meme.sdsc.edu/
- Motif discovery from unaligned sequences
- Genomic or protein sequences
- Flexible model of motif presence (Motif can be
absent in some sequences or appear several times
in one sequence)
Expectation-maximization
10MEME - Input
Email address
How many times in each sequence?
Input file (fasta file)
Range of motif lengths
How many motifs?
How many sites?
11MEME - Output
Motif score
12MEME - Output
Motif score
Motif length
Number of times
13MEME - Output
Low uncertainty High information content
14MEME - Output
Multilevel Consensus
15 Patterns can be presented as regular expressions
- AG-x-V-x(2)-YW
- - Either residue
- x - Any residue
- x(2) - Any residue in the next 2 positions
- - Any residue except these
- Examples AYVACM, GGVGAA
16MEME - Output
Position in sequence
Strength of match
Sequence names
Motif within sequence
17MEME - Output
Sequence names
Motif location in the input sequence
Overall strength of motif matches
18What can we do with motifs?
- MAST - Search for them in non annotated sequence
databases (protein and DNA) - TOMTOM - Find the protein who binds the DNA
motifs. - GOMO - Find putative target genes (DNA) of motifs
and analyze their associated annotation terms. - PROSITE - Search for them in annotated protein
sequence databases.
19MAST
http//meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi
- Searches for motifs (one or more) in sequence
databases - Like BLAST but motifs for input
- Similar to iterations of PSI-BLAST
- Profile defines strength of match
- Multiple motif matches per sequence
- Combined E value for all motifs
- MEME uses MAST to summarize results
- Each MEME result is accompanied by the MAST
result for searching the discovered motifs on the
given sequences.
20MAST - Input
Email address
Database
Input file (motifs)
21MAST - Output
Input motifs
Presence of the motifs in a given database
22TOMTOM
http//meme.sdsc.edu/meme/doc/tomtom.html
- Searches one or more query DNA motifs against one
or more databases of target motifs, and reports
for each query a list of target motifs, ranked
by p-value. - The output contains results for each query, in
the order that the queries appear in the input
file.
23TOMTOM - Input
Input motif
Background frequencies
Database
24DNA IUPAC code
- A --gt adenosine M --gt A C (amino)
- C --gt cytidine S --gt G C
(strong) - G --gt guanine W --gt A T
(weak) - T --gt thymidine
- B --gt G T C D --gt G A T
- R --gt G A (purine) H --gt A C T
- Y --gt T C (pyrimidine) V --gt G C A
- K --gt G T (keto) N --gt A G C T
(any)
Example YCAY TCCATC
IUPAC International Union of Pure and Applied
Chemistry
25TOMTOM - Output
Input motif
Matching motifs
26TOMTOM Output
Wrong input, ok results
27JASPAR
- Profiles
- Transcription factor binding sites
- Multicellular eukaryotes
- Derived from published collections of experiments
- Open data accesss
28logo
score
organism
Name of gene/protein
29GOMO
- GOMO takes DNA binding motifs to find putative
target genes and analyze their associated GO
terms. A list of significant GO terms that can be
linked to the given motifs will be produced. - GOMO returns a list of GO-terms that are
significantly associated with target genes of the
motif. - Gene Ontology provides a controlled vocabulary to
describe gene and gene product attributes in any
organism.
30GOMO - Input
Email address
Database
Input file (motifs)
31GOMO - Output
Input motifs
GO annotation
MF - Molecular function BP - Biological
process CC - Cellular compartment
32Prosite
http//www.expasy.org/tools/scanprosite
- ProSite is a database of protein domains and
motifs that can be searched by either regular
expression patterns or sequence profiles.
33(No Transcript)
34Prosite - input
Input motif a regular expression
Database
Filters
35Input motif
Prosite - Output
Location in the protein sequence
protein