Geometric Crossovers for Supervised Motif Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Geometric Crossovers for Supervised Motif Discovery

Description:

Try out the applicability of the geometric framework, on a ... Discriminative motif discovery has received increased attention lately. Classification problem ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 26
Provided by: rolvse
Category:

less

Transcript and Presenter's Notes

Title: Geometric Crossovers for Supervised Motif Discovery


1
Geometric Crossovers for Supervised Motif
Discovery
  • Rolv Seehuus
  • NTNU

2
Motivation and Scope
  • Try out the applicability of the geometric
    framework, on a supervised motif discovery
    problem
  • Compare its merits to a previously used operator.
  • In practice, we test on a very easy problem
  • that existing software can solve easily
  • Value as test case
  • Building block for more complex motif discovery
    problems, that current algorithms can not solve
    satisfactory

3
Motif Discovery
  • Has become a standard problem in bioinformatics
  • Given a set of sequences, figure out what is
    special with it
  • by eliciting motifs in the dataset
  • Differing by
  • Motif model
  • Learning algorithms
  • Scoring functions

4
The Standard Approach
  • Do analysis of the positive set of sequences
  • background distribution
  • information content
  • statistical significance
  • Report motif

5
Motif discovery as a classification problem
  • Always at least two datasets The positive, and
    the rest
  • Choose a negative dataset
  • Report motifs best suited to discriminate
  • No need to learn a background model
  • The statistical significance of the motif can be
    given
  • Discriminative motif discovery has received
    increased attention lately

6
Classification problem
  • Protein sequences, from the SwissProt database
  • Classified according to protein family (as
    specified in the Prosite database)
  • Selected six families, that previously have been
    shown to be hard to classify under similar
    circumstances.
  • Some of the families can be said to have an
    overrepresented motif as the ones we can train on

7
The Potential Negative Data Set
  • Huge, compared to the negative
  • Quite common in bioinformatics, and an
    interesting problem to cope with in its own right
  • In field
  • randomly generated sequences
  • one set of randomly selected sequences
  • random rearrangement of the positive sequences
    (data not shown)
  • The best practice was to select the samples
    randomly from the negative set each generation,
    so that their size matches the positive set.

8
Motif Model
  • Twenty amino acids
  • Wildcard

C...C.C..C DMEGACGGSCACSTCHVIVDP
Motif match, positive sequence
9
Operators on Motifs
  • Unit edit move as mutation
  • Mut(A) Insert, Delete or Replace a token
  • Substring Swapping Crossover (for comparison)
  • Two-point Geometric Crossover

10
Geometric Crossover
  • Search space have a metric
  • Mutation is a move in search space
  • Crossover yield children found on the shortest
    path between the parents in search space
  • Successfully applied to other problems

11
Geometric Crossover for Motifs
  • View motifs as sequences
  • Basic assumption The edit distance is a good way
    to move around in motif space
  • A crossover based on the edit distance, should
    yield a good crossover for motif discovery
  • We (arbitrarily) choose unit costs for
    insertions, deletions and substitutions

12
Sequence Alignment
  • Alignment put spaces (-) in both sequences such
    as they become of the same length
  • Seq1 agcacac-a
  • Seq2 a-cacacta
  • Score 2
  • An Optimal alignment is an alignment with minimal
    score
  • The score of the optimal alignment of two
    sequences equals their edit distance
  • There often are multiple optimal alignments

13
Homologous Crossover
  1. Pick an optimal alignment for two parent
    sequences
  2. Generate a crossover mask as long as the
    alignment
  3. Recombine as traditional crossover
  4. Remove dashes from offspring

Child1 BANANAS Child2 ANANA
Mask 1101100 Seq1 BANANA- Seq2
-ANANAS SeqA BANANAS SeqB -ANANA-
14
Experiments
  • Two crossovers with same parameters, and mutation
    only
  • Ten fold cross validation
  • Partitioned datasets in ten pieces
  • Trained on 9/10ths
  • Tested the best motif on the remaining test set
  • Trained on randomly selected subset of SwissProt
  • Tested on entire SwissProt
  • Fitness Scaled Pearson correlation of confusion
    matrix

15
Dynamic behavior during evolution
16
Maximum Values
17
Max
18
Cytochrome
  • Include the following fragment of a highly
    conserved motif CCH
  • Which geometric crossover find
  • While substring swapping finds CH
  • Conservation of length keeps us in the correct
    ballpark
  • CH representa local maximum for substring swap

19
Ferredoxin
  • Contains the following motif C..C..C...CPH
  • Which Substring Swap finds
  • While Geometric Crossover dont
  • Conservation of length keeps us from finding the
    correct motif

20
Population Means
21
Means
22
Classification Performance
Medians, of 10 experiments, for each family
23
Classification Performance - II
  • Similar for all operators
  • Maybe a slight advantage, for the geometric
    crossover if we have
  • A highly conserved motif exist
  • A ballpark guess on motif length
  • Surprisingly, mutation frequently outperforms the
    other operators

24
Concluding remarks
  • The geometric operator is promising - need work
  • It is more length preserving than substring swap
  • The geometric operator need a good guess on motif
    length
  • Edit move might not be optimal for motif
    discovery?
  • even though, it for some problems shows merit.
  • Our initial assumption imply an
    insertion/deletion equally often as replacement
    in sequence data
  • we are WAY off on that parameter

25
Future Work
  • Synthetic data with known parameters
  • Include character classes and within motif gaps
    in representation
  • Modules (composite motifs)
  • Expand to position weight matrixes
Write a Comment
User Comments (0)
About PowerShow.com