Biological Motif Discovery - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Biological Motif Discovery

Description:

DNA Motif Discovery with MEME and AlignAce. Co-regulated genes from TB Boshoff data set ... MEME uses this idea. Start at many points. Run for one iteration ... – PowerPoint PPT presentation

Number of Views:201
Avg rating:3.0/5.0
Slides: 59
Provided by: broadin
Category:

less

Transcript and Presenter's Notes

Title: Biological Motif Discovery


1
Biological Motif Discovery
  • Concepts
  • Motif Modeling and Motif Information
  • EM and Gibbs Sampling
  • Comparative Motif Prediction
  • Applications
  • Transcription Factor Binding Site Prediction
  • Epitope Prediction
  • Lab Practical
  • DNA Motif Discovery with MEME and AlignAce
  • Co-regulated genes from TB Boshoff data set

2
Regulatory Motifs
Find promoter motifs associated with co-regulated
or functionally related genes
3
Transcription Factor Binding Sites
  • Very Small
  • Highly Variable
  • Constant Size
  • Often repeated
  • Low-complexity-ish

Slide Credit S. Batzoglou
4
Other Motifs
  • Splicing Signals
  • Splice junctions
  • Exonic Splicing Enhancers (ESE)
  • Exonic Splicing Surpressors (ESS)
  • Protein Domains
  • Glycosylation sites
  • Kinase targets
  • Targetting signals
  • Protein Epitopes
  • MHC binding specificities

5
Essential Tasks
  • Modeling Motifs
  • How to computationally represent motifs
  • Visualizing Motifs
  • Motif Information
  • Predicting Motif Instances
  • Using the model to classify new sequences
  • Learning Motif Structure
  • Finding new motifs, assessing their quality

6
Modeling Motifs
7
Consensus Sequences
  • Useful for publication
  • IUPAC symbols for degenerate sites
  • Not very amenable to computation

Nature Biotechnology 24, 423 - 425 (2006)
8
Probabilistic Model
M1
M1
MK
A C G T
.1
.1
.4
.1
.2
.1
.5
.1
.2
.2
.2
.2
.2
.1
.2
.4
.5
.4
.2
.7
.2
.2
.1
.3
Pk(SM)
Position Frequency Matrix (PFM)
9
Scoring A Sequence
To score a sequence, we compare to a null model
Log likelihood ratio
Position Weight Matrix (PWM)
Background DNA (B)
PFM
10
Scoring a Sequence
Common threshold 60 of maximum score
MacIsaac Fraenkel (2006) PLoS Comp Bio
11
Visualizing Motifs Motif Logos
Represent both base frequency and conservation at
each position
Height of letter proportional to frequency of
base at that position
Height of stack proportional to conservation at
that position
12
Motif Information
  • The height of a stack is often called the motif
    information at that position measured in bits

Information
Why is this a measure of information?
13
Uncertainty and probability
Uncertainty is related to our surprise at an event
The sun will rise tomorrow
Not surprising (p1)
The sun will not rise tomorrow
Very surprising (pltlt1)
Uncertainty is inversely related to probability
of event
14
Average Uncertainty
Two possible outcomes for sun rising
A The sun will rise tomorrow
P(A)p1
B The sun will not rise tomorrow
P(B)p2
What is our average uncertainty about the sun
rising
Entropy
15
Entropy
  • Entropy measures average uncertainty
  • Entropy measures randomness

If log is base 2, then the units are called bits
16
Entropy versus randomness
Entropy is maximum at maximum randomness
Example Coin Toss
P(heads)0.1 Not very random H(X)0.47 bits
Entropy
P(heads)0.5 Completely random H(X)1 bits
P(heads)
17
Entropy Examples
18
Information Content
  • Information is a decrease in uncertainty

Once I tell you the sun will rise, your
uncertainty about the event decreases
Hbefore(X)
Hafter(X)
-
Information
Information is difference in entropy after
receiving information
19
Motif Information
2
-
Motif Position Information
Hbackground(X)
Hmotif_i(X)
Uncertainty after learning it is position i in a
motif
Prior uncertainty about nucleotide
P(x)
P(x)
H(X)2 bits
H(X)0.63 bits
Uncertainty at this position has been reduced by
0.37 bits
20
Motif Logo
Conserved Residue Reduction of uncertainty of 2
bits
Little Conservation Minimal reduction of
uncertainty
21
Background DNA Frequency
The definition of information assumes a uniform
background DNA nucleotide frequency What if the
background frequency is not uniform?
2
-
Motif Position Information
-0.2 bits
Some motifs could have negative information!
22
A Different Measure
  • Relative entropy or Kullback-Leibler (KL)
    divergence
  • Divergence between a true distribution and
    another

True Distribution
Other Distribution
Properties
DKL is larger the more different Pmotif is from
Pbackground Same as Information if Pbackground
is uniform
23
Comparing Both Methods
Information assuming uniform background DNA
KL Distance assuming 20 GC content (e.g.
Plasmodium)
24
Online Logo Generation
http//weblogo.berkeley.edu/
http//biodev.hgen.pitt.edu/cgi-bin/enologos/enolo
gos.cgi
25
Finding New Motifs
  • Learning Motif Models

26
A Promoter Model
Length K
Motif
Pk(SM)
The same motif model in all promoters
27
Probability of a Sequence
Given a sequence(s), motif model and motif
location
1
60
65
100
A T A T G C
28
Parameterizing the Motif Model
Given multiple sequences and motif locations but
no motif model
AATGCG ATATGG ATATCG GATGCA
Count Frequencies Add pseudocounts
29
Finding Known Motifs
Given multiple sequences and motif model but no
motif locations
P(SeqwindowMotif)
window
Calculate P(SeqwindowMotif) for every starting
location
Choose best starting location in each sequence
30
Discovering Motifs
Given a set of co-regulated genes, we need to
discover with only sequences
We have neither a motif model nor motif
locations Need to discover both
How can we approach this problem? (Hint start
with a random motif model)
31
Expectation Maximization (EM)
  • Remember the basic idea!
  • Use model to estimate (distribution of) missing
    data
  • Use estimate to update model
  • Repeat until convergence

Model is the motif model Missing data are the
motif locations
32
EM for Motif Discovery
  1. Start with random motif model
  2. E Step estimate probability of motif positions
    for each sequence
  3. M Step use estimate to update motif model
  4. Iterate (to convergence)

At each iteration, P(SequencesModel) guaranteed
to increase
ETC
33
MEME
  • MEME - implements EM for motif discovery in DNA
    and proteins
  • MAST search sequences for motifs given a model

http//meme.sdsc.edu/meme/
34
P(SeqModel) Landscape
EM searches for parameters to increase
P(seqsparameters)
Useful to think of P(seqsparameters) as a
function of parameters
P(Sequencesparams1,params2)
Parameter1
Parameter2
Where EM starts can make a big difference
35
Search from Many Different Starts
To minimize the effects of local maxima, you
should search multiple times from different
starting points
MEME uses this idea Start at many points Run
for one iteration Choose starting point that
got the highest and continue
P(Sequencesparams1,params2)
Parameter1
Parameter2
36
Gibbs Sampling
A stochastic version of EM that differs from
deterministic EM in two key ways
  • At each iteration, we only update the motif
    position
  • of a single sequence
  • 2. We may update a motif position to a
    suboptimal new position

37
Gibbs Sampling
Best Location
New Location
  1. Start with random motif locations and calculate
    a motif model
  2. Randomly select a sequence, remove its motif and
    recalculate tempory model
  3. With temporary model, calculate probability of
    motif at each position on sequence
  4. Select new position based on this distribution
  5. Update model and Iterate

ETC
38
Gibbs Sampling and Climbing
Because gibbs sampling does always choose the
best new location it can move to another place
not directly uphill
P(Sequencesparams1,params2)
Parameter1
Parameter2
In theory, Gibbs Sampling less likely to get
stuck a local maxima
39
AlignACE
  • Implements Gibbs sampling for motif discovery
  • Several enhancements
  • ScanAce look for motifs in a sequence given a
    model
  • CompareAce calculate similarity between two
    motifs (i.e. for clustering motifs)

http//atlas.med.harvard.edu/cgi-bin/alignace.pl
40
Assessing Motif Quality
Scan the genome for all instances and associate
with nearest genes
  • Category Enrichment look for association
    between motif and sets of genes. Score using
    Hypergeometric distribution
  • Functional Category
  • Gene Families
  • Protein Complexes
  • Group Specificity how restricted are motif
    instances to the promoter sequences used to find
    the motif?
  • Positional Bias do motif instances cluster at a
    certain distance from ATG?
  • Orientation Bias do motifs appear
    preferentially upstream or downstream of genes?

41
Comparative Motif Prediction
42
Kellis et al. (2003) Nature
43
Conservation of Motifs
GAL10
Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTT
TGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCA
TATACA Spar CTATGTTGATCTTTTCAGAATTTTT-C
ACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTT
TCCTATCATACACA Smik GTATATTGAATTTTTCAGT
TTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTA
CATTTCGTTACTATCATACACA Sbay
TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGA
AAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA


Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-G
TAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTT
TCAGTAATACG Spar TATCCATATCTAGTCTTACTTATATGTTGT-
GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTA
TGAAACTTGAACTG-TACG Smik TACCGATGTCTAGTCTTACTTAT
ATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT-
-CTTTCTATGGAGCTTTG-CTA-TATG Sbay
TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGT
GCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG


Scer CTTAACTGCTCATTGC-----TATATTGAAGT
ACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTC
CTCCGTGCGTCCTCGTCT Spar CTAAACTGCTCATTGC-----AAT
ATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGA
ATATTCCCCTCCGTGCGTCGCCGTCT Smik
TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCG
AACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCT
Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGA
AGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGA
AGTCGTCT

Scer TCACCGG-TCGCGTTCCTGA
AACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACA
A-----TACTAGCTTTT--ATGGTTATGAA Spar
TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTC
CGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGAC
Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCA
CCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACG
GTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTC
TCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTA
GTG--GCAGTTATGGT

Scer
GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTA
ACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T
Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTT
TCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG-----
-TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCA
CATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTA
GCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCC
CT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGA
TGGGGTTGCGGTCAAGCCTACTCG

Scer
TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT
-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT
Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAA
TGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCA
C-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCG
AAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCG
CAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCT
CAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATAT
GAAAGTAAGATCGCCTCAATTGTA

Scer
TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAA
T----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT
Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TT
TGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACA
TCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATAT
TTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTC
AGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACT
TCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACAT
CAATAACAAGTATTCAACATTTGT

Scer
TTAA-CGTCAAGGA---GAAAAAACTATA Spar
TTAT-CGTCAAGGAAA-GAACAAACTATA Smik
TCGTTCATCAAGAA----AAAAAACTA.. Sbay
TTATCCCAAAAAAACAACAACAACATATA

GAL1
slide credits M. Kellis
44
Genome-wide conservation
Evaluate conservation within
(1) All intergenic regions
A signature for regulatory motifs
45
Finding Motifs in Yeast GenomesM. Kellis PhD
Thesis
  1. Enumerate all mini-motifs
  2. Apply three tests
  3. Look for motifs conserved in intergenic regions
  4. Look for motifs more conserved intergenically
    than in genes
  5. Look for motifs preferentially conserved upstream
    or downstream of genes

N
C
T
A
C
G
A
slide credits M. Kellis
46
Test 1 Intergenic conservation
Conserved count
Total count
slide credits M. Kellis
47
Test 2 Intergenic vs. Coding
Intergenic Conservation
Coding Conservation
slide credits M. Kellis
48
Test 3 Upstream vs. Downstream
Upstream Conservation
Downstream Conservation
slide credits M. Kellis
49
Constructing full motifs
Test 1
Test 2
Test 3
2,000 Mini-motifs
C
T
A
C
G
A
R
R
slide credits M. Kellis
50
Results
Rank Discovered Motif Known TF motif Tissue Enrichment Distance bias
1 RCGCAnGCGY NRF-1 Yes Yes
2 CACGTG MYC Yes Yes
3 SCGGAAGY ELK-1 Yes Yes
4 ACTAYRnnnCCCR Yes Yes
5 GATTGGY NF-Y Yes Yes
6 GGGCGGR SP1 Yes Yes
7 TGAnTCA AP-1 Yes
8 TMTCGCGAnR Yes Yes
9 TGAYRTCA ATF3 Yes Yes
10 GCCATnTTG YY1 Yes
11 MGGAAGTG GABP Yes Yes
12 CAGGTG E12 Yes
13 CTTTGT LEF1 Yes
14 TGACGTCA ATF3 Yes Yes
15 CAGCTG AP-4 Yes
16 RYTTCCTG C-ETS-2 Yes Yes
17 AACTTT IRF1() Yes
18 TCAnnTGAY SREBP-1 Yes Yes
19 GKCGCn(7)TGAYG Yes Yes
  • 174 promoter motifs
  • 70 match known TF motifs
  • 60 show positional bias
  • ? 75 have evidence
  • Control sequences
  • lt 2 match known TF motifs
  • lt 3 show positional bias
  • ? lt 7 false positives

slide credits M. Kellis
51
Antigen Epitope Prediction
52
Genome to Immunome
Pathogen genome sequences provide define all
proteins that could illicit an immune response
  • Looking for a needle
  • Only a small number of epitopes are typically
    antigenic
  • in a very big haystack
  • Vaccinia virus (258 ORFs) 175,716 potential
    epitopes (8-, 9-, and 10-mers)
  • M. tuberculosis (4K genes) 433,206 potential
    epitopes
  • A. nidulans (9K genes) 1,579,000 potential
    epitopes

Can computational approaches predict all
antigenic epitopes from a genome?
53
Antigen Processing Presentation
54
Modeling MHC Epitopes
  • Have a set of peptides that have been associate
    with a particular MHC allele
  • Want to discover motif within the peptide bound
    by MHC allele
  • Use motif to predict other potential epitopes

55
Motifs Bound by MHCs
  • MHC 1
  • Closed ends of grove
  • Peptides 8-10 AAs in length
  • Motif is the peptide
  • MHC 2
  • Grove has open ends
  • Peptides have broad length distribution 10-30
    AAs
  • Need to find binding motif within peptides

56
MHC 2 Motif Discovery
Use Gibbs Sampling!
462 peptides known to bind to MHC
II HLA-DR4(B10401) 9-30 residues in
length Goal identify a common length 9 binding
motif
Nielsen et al (2004) Bioinf
57
Vaccinia Epitope Prediction
Mutaftsi et al (2006) Nat. Biotech.
  • Predict MHC1 binding peptides
  • Using 4 matrices for H-2 Kb and Db
  • Top 1 predictions experimentally validated

49 validated epitopes accounting for 95 of
immune response
58
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com