Geometric Crossovers for Supervised Motif Discovery presentation

About This Presentation

Transcript and Presenter's Notes

Title: Geometric Crossovers for Supervised Motif Discovery

1
Geometric Crossovers for Supervised Motif
Discovery

2
Motivation and Scope

Try out the applicability of the geometric
framework, on a supervised motif discovery
problem
Compare its merits to a previously used operator.
In practice, we test on a very easy problem
that existing software can solve easily
Value as test case
Building block for more complex motif discovery
problems, that current algorithms can not solve
satisfactory

3
Motif Discovery

4
The Standard Approach

5
Motif discovery as a classification problem

6
Classification problem

Protein sequences, from the SwissProt database
Classified according to protein family (as
specified in the Prosite database)
Selected six families, that previously have been
shown to be hard to classify under similar
circumstances.
Some of the families can be said to have an
overrepresented motif as the ones we can train on

7
The Potential Negative Data Set

Huge, compared to the negative
Quite common in bioinformatics, and an
interesting problem to cope with in its own right
In field
randomly generated sequences
one set of randomly selected sequences
random rearrangement of the positive sequences
(data not shown)
The best practice was to select the samples
randomly from the negative set each generation,
so that their size matches the positive set.

8
Motif Model

C...C.C..C DMEGACGGSCACSTCHVIVDP
Motif match, positive sequence
9
Operators on Motifs

10
Geometric Crossover

Search space have a metric
Mutation is a move in search space
Crossover yield children found on the shortest
path between the parents in search space
Successfully applied to other problems

11
Geometric Crossover for Motifs

View motifs as sequences
Basic assumption The edit distance is a good way
to move around in motif space
A crossover based on the edit distance, should
yield a good crossover for motif discovery
We (arbitrarily) choose unit costs for
insertions, deletions and substitutions

12
Sequence Alignment

Alignment put spaces (-) in both sequences such
as they become of the same length
Seq1 agcacac-a
Seq2 a-cacacta
Score 2
An Optimal alignment is an alignment with minimal
score
The score of the optimal alignment of two
sequences equals their edit distance
There often are multiple optimal alignments

13
Homologous Crossover

Child1 BANANAS Child2 ANANA
Mask 1101100 Seq1 BANANA- Seq2
-ANANAS SeqA BANANAS SeqB -ANANA-
14
Experiments

15
Dynamic behavior during evolution
16
Maximum Values
17
Max
18
Cytochrome

19
Ferredoxin

20
Population Means
21
Means
22
Classification Performance
Medians, of 10 experiments, for each family
23
Classification Performance - II

24
Concluding remarks

The geometric operator is promising - need work
It is more length preserving than substring swap
The geometric operator need a good guess on motif
length
Edit move might not be optimal for motif
discovery?
even though, it for some problems shows merit.
Our initial assumption imply an
insertion/deletion equally often as replacement
in sequence data
we are WAY off on that parameter

Geometric Crossovers for Supervised Motif Discovery PowerPoint PPT Presentation