Motif Search - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Motif Search

Description:

PowerPoint Presentation ... Motif Search – PowerPoint PPT presentation

Number of Views:179
Avg rating:3.0/5.0
Slides: 37
Provided by: esti58
Category:
Tags: lecture | motif | search

less

Transcript and Presenter's Notes

Title: Motif Search


1
  • Motif Search

2
What are Motifs
  • Motif (dictionary) A recurrent thematic element,
    a common theme

3
Find a common motif in the text
4
Find a short common motif in the text
5
Motifs in biological sequences
  • Sequence motifs represent a short common sequence
    (length 4-20) which is highly represented in the
    data

6
Challenges in biological sequences
  • Motifs are usually not exact words

7
How to present non exact motifs?
  • Consensus string NTAHAWT
  • May allow degenerate symbols in string, e.g., N
    A/C/G/T W A/T Hnot G S C/G R A/G Y
    T/C etc.
  • Position Weight Matrix (PWM)
  • Probability for each base
  • in each position

2
3
4
5
6
1
A
T
G
C
8
Motifs in biological sequences
What can we learn from these motifs?
  • Regulatory motifs in DNA (transcription factor
    binding sites)
  • Functional site in proteins (Phosphorylation
    site)

9
DNA Regulatory Motifs
  • Transcription Factors (TF) are regulatory protein
    that bind to regulatory motifs near the gene and
    act as a switch bottom (on/off)
  • TF binding motifs are usually 6 20 nucleotides
    long
  • located near target gene, mostly upstream the
    transcription start site

Transcription Start Site
TF2
TF1
Gene X
TF2 motif
TF1 motif
10
Can we find TF targets using a bioinformatics
approach?
11
P53 is a transcription factorinvolved in most
human cancers
We are interested to identify the genes regulated
by p53
12
Finding TF targets using a bioinformatics
approach?
Scenario 1 Binding motif is known (easier
case) Scenario 2 Binding motif is unknown
(hard case)
13
Scenario 1 Binding motif is known
  • Given a motif (e.g., consensus string, or weight
    matrix), find the binding sites in an input
    sequence

14
Given a consensus
For each position l in the input sequence, check
if substring starting at position l matches the
motif. Example find the consensus motif
NTAHAWT in the promoter of a gene gtpromoter of
gene A ACGCGTATATTACGGGTACACCCTCCCAATTACTACTATAAAT
TCATACGGACTCAGACCTTAAAA.
15
Given a Position Weight Matrix (PWM)
Starting from a set of aligned motifs
Seq 1 AAAGCCC Seq 2 CTATCCA Seq 3 CTATCCC Seq 4
CTATCCC Seq 5 GTATCCC Seq 6 CTATCCC Seq 7
CTATCCC Seq 8 CTATCCC Seq 9 TTATCTG
16
Given a Position Weight Matrix (PWM)
1 1 9 9 0 0 0 1 A
6 0 0 0 0 9 8 7 C
1 0 0 0 1 0 0 1 G
1 8 0 0 8 0 1 0 T
.11 .11 1 1 0 0 0 .11 A
.67 0 0 0 0 1 .89 .78 C
.11 0 0 0 .11 0 0 .11 G
.11 .89 0 0 .89 0 .11 0 T
W
Probability of each base In each column
Counts of each base In each column
W?k probability of base ? in column k
17
Given a Position Weight Matrix (PWM)
  • Given sequence S (e.g., 1000 base-pairs long)
  • For each substring s of S,
  • Compute Pr(sW)
  • If Pr(sW) gt some threshold, call that a binding
    site
  • In DNA sequences we need to search both strands
    AGTTACACCA
  • TGGTGTAACT (reverse complement)

18
Scenario 2 Binding motif is unknown
Ab initio motif finding
19
Ab initio motif finding Expectation Maximization
  • Local search algorithm
  • - Start from a random PWM
  • Move from one PWM to another so as to improve the
    score which fits the sequence to the motif
  • Keep doing this until no more improvement is
    obtained Convergence to local optima

20
Expectation Maximization
  • Let W be a PWM .
  • Let S be the input sequence .
  • Imagine a process that randomly searches, picks
    different strings matching W and threads them
    together to a new PWM

21
Expectation Maximization
  • Find W so as to maximize Pr(SW)
  • The Expectation-Maximization (EM) algorithm
    iteratively finds a new motif W that improves
    Pr(SW)

22
Expectation Maximization
23
The final PWM represents the motif which is
mostly enriched in the data
The PWM can be also represented as a sequence
logo
-A letters height indicates the information it
contains -The top letter at each position can be
read to obtain the consensus sequence (motif)
24
Are common motifs the right thing to search for ?
25
?
26
Solutions
-Searching for motifs which are enriched in one
set but not in a random set - Use experimental
information to rank the sequences according to
their binding affinity and search for enriched
motifs at the top of the list
27
Searching for enriched motifs in a ranked list
Hyper Geometric (HG) Distribution test
1
2
3
4
Binding affinity
k number of motifs in the top of the list m
number of sequences in the top of the list n
number of total motifs found N total number of
sequences
  • The P reflects the surprise of seeing the
    observed density of motif occurrences at the top
    of the list compared to the rest of the list.

28
Searching for enriched motifs in ranked list
Choosing the best way to cut the list (minimal HG
score)
1
2
3
4
Binding affinity
k number of motifs in the top of the list m
number of sequences in the top of the list n
number of total motifs found N total number of
sequences
29
Finding the p53 binding motif in a set of p53
target sequences which are ranked according to
binding affinity
gtaffinity 5.962 ACAAAAGCGUGAACACUUCCACAUGAAAUUC
GUUUUUUGUCCUUUUUUUUCUCUUCUUUUUCUCUCCUGUUUCU gtaffin
ity 5.937 AAUAAAAAUAGAUAUAAUAGAUGGCACCGCUCUUCACG
CCCGAAAGUUGGACAUUUUAAAUUUUAAUUCUCAUGA gt affinity
5.763 UCACACUUGAAUGUGCUGCACUUUACUAGAAGUUUCUUUUUC
UUUUUUUAAAAAUAAAAAAAGAGGAGAAAAAUGC gtaffinity
5.498 GCUGGUGCAAGUUUCCGGUAAAAAUAAUGAUGUUCUAGUCAUUC
AUAUAUACGAUACAAAAAUAACA ...
http//drimust.technion.ac.il/
30
Protein Motifs
Protein motifs are usually 6-20 amino acids long
and can be represented as a consensus/profile
PEDXKRWRKXED
or as PWM
31
Protein Domains
  • In additional to protein short motifs, proteins
    are characterized by Domains.
  • Domains are long motifs (30-100 aa) and are
    considered as the building blocks of proteins
    (evolutionary modules).

The zinc-finger domain
32
Some domains can be found in many proteins with
different functions
33
.while other domains are only found in proteins
with a certain function..
MBD Methylated DNA Binding Domain
34
Varieties of protein domains
Extending along the length of a protein
Occupying a subset of a protein sequence
Occurring one or more times
Page 228
35
Pfam
  • gt Database that contains a large collection of
    multiple sequence alignments of protein domains
  • Based on
  • Profile hidden Markov Models (HMMs).
  • HMM in comparison to PWM is a model
  • which considers dependencies between the
  • different columns in the matrix (different
    residues) and is thus much more powerful!!!!

http//pfam.sanger.ac.uk/
36
Profile HMM (Hidden Markov Model)can accurately
represent a MSA
D19
D16
D17
D18
100
16 17 18 19
delete
D R T R D R T S S - - S S P T R D R T R D P
T S D - - S D - - S D - - S D - - R
100
50
M16
M17
M18
M19
100
100
50
D 0.8 S 0.2
P 0.4 R 0.6
R 0.4 S 0.6
Match
T 1.0
I16
I19
I18
I17
insert
X
X
X
X
Write a Comment
User Comments (0)
About PowerShow.com