TFBS Identification with Genetic Algorithms using Suffix Tree - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

TFBS Identification with Genetic Algorithms using Suffix Tree

Description:

TFBS Identification with Genetic Algorithms using Suffix Tree. Cyrus Chan ... CRP motifs: ... CRP 18 sequences of 105 bp. Locate some of the motifs. More generation is ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 18
Provided by: cse12
Category:

less

Transcript and Presenter's Notes

Title: TFBS Identification with Genetic Algorithms using Suffix Tree


1
TFBS Identification with Genetic Algorithms using
Suffix Tree
  • Cyrus Chan

2
Biological Background
Transcription Factor
Expressed
Coexpressed Genes
TFBS motifs
3
Problem Description
Upstream Sequences
4
Related Methods
  • Deterministic Methods
  • (l, d)-motif discovery problem
  • Weakly conserved motifs in real problemsl?
  • Machine learning methods
  • Training data or strong prior knowledge
  • May bias the training data

5
Related Works
  • Genetic Algorithms / Evolutionary Computation
  • Consensus-led
  • Evaluate the fitness of aligned subsequences
  • Rely on additional techniques Alignments
    Clustering
  • Positions-led
  • Evaluate the fitness with a position matrix
  • Can not locate similar motifs fast

6
GAST
  • Genetic Algorithm using Suffix Tree (GAST)
  • Combines positions and consensus
  • Suffix Tree Content Addressable
  • Generalized Suffix Tree For multiple sequences
  • Locate motifs quickly
  • Positions diversity
  • Based on the similarities between different
    motifsl need not be specified

7
Representation
Individual
Consensus
Positions
8
Genetic Operators
  • Mutation
  • Type1 randomly mutate the consensus with the
    positions not completely preserved
  • Use suffix tree
  • Type2 single mutation of the positions within an
    individual
  • Multi-point?

9
Genetic Operators
  • Single-point Crossover
  • Performed on individuals in representation of
    positions

10
Fitness Function
  • Information Content
  • G. B. Fogels Fitness score is also tried

Where fb,i is the observed frequency of
nucleotide b on the column i and pb (0.25) is
the background frequency of the same nucleotide.
The summation is taken over the four possible
types of nucleotides (b? A, T, C, G). l is the
motif length.
11
Selection
  • K- Tournament
  • Elitism is tried but the result is not good
    premature convergence
  • K- Tournament adopted from G. B. Fogels work
  • Each individual competes with K random ones
  • Rank individuals by the wins
  • Re-rank by Fitness if there is a tie

12
Rearrangement
  • Take Advantage of Suffix Tree
  • Rank each motif within an individual according to
    its similarity to the consensus
  • Locate the occurrence of motifs in different
    sequences with Suffix Tree from highest to lowest
    similarity

13
Other Issues
  • Deal with TATA box and Poly-A
  • Penalize the fitness of an individual with T, A
    percentage higher than a user defined threshold
  • Phase problem
  • The solution is a shifted version of the optimal
    one
  • Can not be handled by genetic operators
  • Shift based on the gain of Information Content

14
Preliminary Experiments
  • The experiment of GACluster Paul et al GECCO06
  • CRP motifs
  • The method with 3 motifs found outperforms the
    binary GA, which finds no real motifs

True Motifs
15
Preliminary Experiments
  • GAST is able to identify 5 motifs in the
    sequences (Population 1000 Generation 500)
  • CRP 18 sequences of 105 bp
  • Locate some of the motifs
  • More generation is needed

16
Discussion
  • Mutation Rate /Crossover Rate
  • Convergence
  • Evaluate Fitness?
  • Structured motifs / Modules

Expressed
17
TFBS Identification with Genetic Algorithms using
Suffix Tree
  • Thank You!
  • Cyrus Chan
Write a Comment
User Comments (0)
About PowerShow.com