Title: Geometric Crossovers for Supervised Motif Discovery
1Geometric Crossovers for Supervised Motif
Discovery
2Motivation and Scope
- Try out the applicability of the geometric
framework, on a supervised motif discovery
problem - Compare its merits to a previously used operator.
- In practice, we test on a very easy problem
- that existing software can solve easily
- Value as test case
- Building block for more complex motif discovery
problems, that current algorithms can not solve
satisfactory
3Motif Discovery
- Has become a standard problem in bioinformatics
- Given a set of sequences, figure out what is
special with it - by eliciting motifs in the dataset
- Differing by
- Motif model
- Learning algorithms
- Scoring functions
4The Standard Approach
- Do analysis of the positive set of sequences
- background distribution
- information content
- statistical significance
- Report motif
5Motif discovery as a classification problem
- Always at least two datasets The positive, and
the rest - Choose a negative dataset
- Report motifs best suited to discriminate
- No need to learn a background model
- The statistical significance of the motif can be
given - Discriminative motif discovery has received
increased attention lately
6Classification problem
- Protein sequences, from the SwissProt database
- Classified according to protein family (as
specified in the Prosite database) - Selected six families, that previously have been
shown to be hard to classify under similar
circumstances. - Some of the families can be said to have an
overrepresented motif as the ones we can train on
7The Potential Negative Data Set
- Huge, compared to the negative
- Quite common in bioinformatics, and an
interesting problem to cope with in its own right - In field
- randomly generated sequences
- one set of randomly selected sequences
- random rearrangement of the positive sequences
(data not shown) - The best practice was to select the samples
randomly from the negative set each generation,
so that their size matches the positive set.
8Motif Model
- Twenty amino acids
- Wildcard
C...C.C..C DMEGACGGSCACSTCHVIVDP
Motif match, positive sequence
9Operators on Motifs
- Unit edit move as mutation
- Mut(A) Insert, Delete or Replace a token
- Substring Swapping Crossover (for comparison)
- Two-point Geometric Crossover
10Geometric Crossover
- Search space have a metric
- Mutation is a move in search space
- Crossover yield children found on the shortest
path between the parents in search space - Successfully applied to other problems
11Geometric Crossover for Motifs
- View motifs as sequences
- Basic assumption The edit distance is a good way
to move around in motif space - A crossover based on the edit distance, should
yield a good crossover for motif discovery - We (arbitrarily) choose unit costs for
insertions, deletions and substitutions
12Sequence Alignment
- Alignment put spaces (-) in both sequences such
as they become of the same length - Seq1 agcacac-a
- Seq2 a-cacacta
- Score 2
- An Optimal alignment is an alignment with minimal
score - The score of the optimal alignment of two
sequences equals their edit distance - There often are multiple optimal alignments
13Homologous Crossover
- Pick an optimal alignment for two parent
sequences - Generate a crossover mask as long as the
alignment - Recombine as traditional crossover
- Remove dashes from offspring
Child1 BANANAS Child2 ANANA
Mask 1101100 Seq1 BANANA- Seq2
-ANANAS SeqA BANANAS SeqB -ANANA-
14Experiments
- Two crossovers with same parameters, and mutation
only - Ten fold cross validation
- Partitioned datasets in ten pieces
- Trained on 9/10ths
- Tested the best motif on the remaining test set
- Trained on randomly selected subset of SwissProt
- Tested on entire SwissProt
- Fitness Scaled Pearson correlation of confusion
matrix
15Dynamic behavior during evolution
16Maximum Values
17Max
18Cytochrome
- Include the following fragment of a highly
conserved motif CCH - Which geometric crossover find
- While substring swapping finds CH
- Conservation of length keeps us in the correct
ballpark - CH representa local maximum for substring swap
19Ferredoxin
- Contains the following motif C..C..C...CPH
- Which Substring Swap finds
- While Geometric Crossover dont
- Conservation of length keeps us from finding the
correct motif
20Population Means
21Means
22Classification Performance
Medians, of 10 experiments, for each family
23Classification Performance - II
- Similar for all operators
- Maybe a slight advantage, for the geometric
crossover if we have - A highly conserved motif exist
- A ballpark guess on motif length
- Surprisingly, mutation frequently outperforms the
other operators
24Concluding remarks
- The geometric operator is promising - need work
- It is more length preserving than substring swap
- The geometric operator need a good guess on motif
length - Edit move might not be optimal for motif
discovery? - even though, it for some problems shows merit.
- Our initial assumption imply an
insertion/deletion equally often as replacement
in sequence data - we are WAY off on that parameter
25Future Work
- Synthetic data with known parameters
- Include character classes and within motif gaps
in representation - Modules (composite motifs)
- Expand to position weight matrixes