Title: Motif Discovery in DNA Sequences with EC
1Motif Discovery in DNA Sequences with EC
2Biological Background
Transcription Factor
Expressed
Coexpressed Genes
TFBS motifs
3Problem Description
Upstream Sequences
4TFBS Identification
- Experimental methods (in vivo)
- most accurate and reliable
- time-consuming and expensive.
- Computational methods
- based on the upstream sequences
- fast and inexpensive
- de novo TFBS identification
5Deterministic Methods
- A quick look at deterministic methods
- Enumeration of all possible motifs
- Approximate matching
- (l, d)-motif discovery problem
- Generalized suffix trees O(Nk2ld?d)
- Disadvantages
- Over predict a large amount of output
- TFBS motifs are weakly conserved
d?
6Conventional Methods
- Multiple Sequence Alignment methods
- Do not lose any information
- Do not generalize the sequence data
- Limited help for biological understanding
- Large
- Machine learning methods
- EM, Gibbs sampling, HMM, Neural networks
- Do not produce biologically meaningful results
- Prior knowledge weight matrix
- Local search local optima
7Evolutionary Computation
- Why use EC for motif discovery?
- Global search though also not guarantee optimal
solutions - Good scaling
- Flexibility of scoring
- Flexibility of representation
Review Lones et al GECCO05
8EC methods for Motif Discovery
- FMGA Finding Motifs by Genetic Algorithm
- MDGA Motif Discovery Using A Genetic Algorithm
- GACluster Identification of Weak Motifs in
Multiple Biological Sequences using Genetic
Algorithm - St-GA Motif Discovery in Upstream Sequences of
Coordinately Expressed Genes - Discovery, validation, and genetic dissection of
transcription factor binding sites by comparative
and functional genomics
9FMGA Liu et al BIBE04
IUPAC ambiguity codes
10FMGA Liu et al BIBE04
- Consensus led consensus generated randomly
- Fitness function
m is the index of sequences, i is the position
within the motif , n is the index of motif
patterns, k is the length of motif pattern, j is
number of matched regions in the sequence
11FMGA Liu et al BIBE04
- IUPAC ambiguity codes
- Total fitness score function
- where L is the total number of sequences
12FMGA Liu et al BIBE04
- Operators
- Mutation create a weight matrix from the matched
motif patterns - Mutate those not completely conserved randomly
13FMGA Liu et al BIBE04
- Operators
- Crossover one-point crossover
- Ambiguity codes penalty
14FMGA Liu et al BIBE04
- Rearrangement
- If the predicted motif pattern is unchanged for
more than K generations (e.g., K 10) - For diversity
15FMGA Liu et al BIBE04
- Experiments
- Compared with MEME and Gibbs sampler. FMGA have
better prediction results than the others. - For the computation time, FMGA is faster than
MEME and slower than Gibbs sampler. - Comments
- Early method
- Biological meaning AAA., TTT
16MDGA Che et al GECCO05
- Positions-led like Gary B Fogels method
- Representation concatenated binary encoded
string - Fitness information content
- Pseudo count (db) is used
Where fb is the observed frequency of nucleotide
b on the column and pb is the background
frequency of the same nucleotide. The summation
is taken over the four possible types of
nucleotides (b). W is the motif width.
17MDGA Che et al GECCO05
- Selection
- Roulette wheel mechanism
- The phase problem
- shifting all starting positions to the left or
right by a small number - Crossover operators
- single-point and double-point
- Mutation
- bitwise mutation operator
18MDGA Che et al GECCO05
- Experiment Results
- Crossover operators
- Mutation Rate 0.01
- Compared with the Gibbs Sampler and the
BioProspector - on the set of 18 sequences
- Computational time is better than AlignACE
19MDGA Che et al GECCO05
Measured by deviation from the true starting
positions ER
Gibbs Sampler
Bio-Prospector
MDGA
20MDGA Che et al GECCO05
- Conclusion
- better prediction accuracy
- search spaces with a better strategy
- shorter running time for long sequences
- The assumption
- each sequence contains a motif
- zero to more in real cases
21GACluster Paul et al GECCO06
- Consensus-led with GA Alignment technique
- Tackles with (l, d) motif discovery problem as
well as weakly conserved motif discovery - The framework
Fitness Evaluation Cluster Alignment score of
subsequences
22GACluster Paul et al GECCO06
- Consensus from subsequences
- Focus on weakly conserved motifs
- Fitness Evaluation
- the alignment score (Information Content)
- Non-linear combination
23GACluster Paul et al GECCO06
- Fitness Evaluation (cont.)
- Example
- AT, AC, AG, AA, AC, TC, AG, TG
- Clustering and Scoring
24GACluster Paul et al GECCO06
Min d 1
Min d 1
Min d 1
25GACluster Paul et al GECCO06
- Fitness
-
- Offspring generation
- One point crossover
- Single mutation
- Dealing with Poly-A and TATA box
- Reduce the fitness
26GACluster Paul et al GECCO06
- Experiments
- CRP motifs
- The method with 3 motifs found outperforms the
binary GA, which finds no real motifs - MCB
- True motifs ACGCGT, ACGCGA, CCGCGT, TCGCGA,
ACGCGT, ACGCGT Consensus WCGCGW - GACluster ACGCGT, ACGCGT, ACGCGT, ACGCGT,
ACGCGT, ACGCGT Consensus ACGCGT - Binary GA TTTCGA, TCACCA, TCACGT, TGACGA,
TCACGA, TAACGG None are true motifs
True Motifs
27GACluster Paul et al GECCO06
- Discussion
- starting population is very important
- prevent loss of some initial motifs and to keep
diversity - Conclusions
- Drawback Like other computational methods of
motif discovery, our method looks for similar
subsequences in multiple biological sequences
many of these similar subsequences have no
biological significance.
28St-GA Stine et al CEC03
- Consensus-led
- Representation Structured GA with binary
encoding( two bits for each base) - A lt S1, S2 gt S1activation level S2 expression
level - A (ai, aij), (ai 0,1 , i 0.. .14)
(aij 0,1, i 0.. .14 j 0 . . . l ) . - The interpreted string is constructed from A by
concatenating each two bit block of S2 ai0,
ai1 for which ai S1 1
Various Lengths of motifs
29St-GA Stine et al CEC03
- Use BLAST for alignment
- bl2seq align the motif (subsequence) against the
sequences - Measures from bl2seq P is the number of
sequence - Also as Fitness
- Results
- Can not work with motif shorter than 7
- (l, d)-motif poor results when dgt1
30Comparative and functional genomics Gertz et al
Genome Res vol.15 05
- Biological Experiment Oriented
- in vivo Validation of de novo motifs
- GA is used as the computational method
- Consensus-led randomly generated
- Fitness
- Information content I
- Conservation C Scores from other species
- Fitness
31Summary
- Representation Consensus VS Positions
- Motif generation From subsequences VS Randomly
generated (enumeration) - Encoding Binary VS Natural
- Evaluation methods Alignment VS No alignment
- Fitness function Information content,
Similarity - Prior Knowledge
32The End