Title: Gene Expression Messy GAs
1Gene Expression Messy GAs
- Kargupta, et al
- Presented by
- Abhishek Singh
2Underpinnings
- Extension of mGA
- Black Box Optimization
- Relevance of SEARCH
- Importance of intra-cellular information flow for
SEARCH - No explicit modeling
- Precursor to Model Building GAs
3Overview
- SEARCH (some gory details)!
- Natural Evolution as a SEARCH
- Basic Gene Expression Messy GA
- Updates on the GEMGA
- Interspersed with Results and Achievements
- Summary and Conclusions
- Punctuated with Discussions and Debates
(hopefully)
4SEARCH overview
- Black Box Search
- Enumeration
- Induction
- Search Envisioned as Relation and Class
Hierarchizing - Framework to formalize sample complexity,
difficulty, etc - Relations, Class, and Samples
5SEARCH contd.
- Enumerative search exponential
- Alternative - stochastic decision making based on
sampling - Consequence Premise
- Assumes inductive relationships
- No relations enumeration!
- Relations classify search domain
- Classes contain optima
6SEARCH Decomposition
- Relation, R Set of ordered pairs
- Class, C Instantiation of a relations
- Sample, S Instance of a class
- Relations imposed implicitly or explicitly
- Representation
- Operators
- Heuristics
- Direct modeling
- Typical example for GAs follows
7Relations, Classes, and Samples
1100 1001 1110 1011 0100 0011 0000 0111
f f
1 0 1 0
Relation Class Sample space space
space
8SEARCH components
- Classification based on relations
- Sampling
- Evaluation, ordering, selection of better
classes - Evaluation, ordering selection of better
relations - Resolution
9A little bit of theory!
- ri ith Relation
- ?r Set of all Relations
- Ci Set of classes created by ri,, Ci Ni
- P Perturbation operator (dumb or smart)
- T, Tr Class/relation comparison statistic
- Mi Best from Ci (depends on decision error
probability) - Sr Best used from ?r
- Ci , ri Sampled ordered class/relation
10SEARCH challenges
- Based on T some classes are ordered and best Mi
are selected - Pruning of classes search (resolution)
- Relation must classify space such that optimal,
Ci is within T based Mi - Ci and Ci may not be same
- Ci must be in Misampled
- Based on Tr relations are ordered and selected
- Pruning of relations search based on past
11SEARCH challenges contd.
- Ordering dependent on sampling
- SEARCH fails if
- Defining relation is such that for chosen T, CI
? Mi - Stochastic error causes CI ? Misampled
12Quantifying the Challenges
- Relation Selection success
- Pr(CRS ri ) ? Pr( rk ?Tr ri)min?r- Sr
- Class Selection success
- Pr(CCS ri ) ? Pr( Ck ?T Ci)minNi Mi
- Overall success
- ? Pr(CRSri)Pr(CCSri) ?ri ? Sr
13Specializing the Equation
- Specialize to a specific class comparison
statistic and representation - Similar to earlier work on decision making
(Goldberg, et al, 1992) - Order statistics used
14Ordinal Class Selection
- Prob. of correct
- binary decision
- Pr(FT j,i ?a FT k,i)
- ? 1-2nH(a) (a d)a n
- d - zone of indifference
- F CDF of F
- a Quantile of CDF
- n no. of samples
- H Binary entropy func.
15Class Selection contd.
- For correct class selection
- Pr(CCSri) ? 1-2nH(a) (a d)a n Ni Mi
- d min F(FT ,i) - F(FT j,i) ?j
16Ordinal Relation Selection
- Analysis same as for class selection except Tr
used for comparison. - For one relation ri
- Pr(CRS ri)? 1-2nr H(ar) (ar dr)ar nr
?r-?g - Overall success probability bound
-
17Overall Success
- Combine the search for better classes and
relations - q Overall success probability
- d Bound for min d over all classes
compared to optima containing class - Nmax maxNi ?ri ?Sr
- Mmin minMi ?ri ?Sr
18Sample Complexity
- Total number of Function Evaluations
- For non-relational enumerative SC becomes the
size of search space - To bound SEARCH, Nmax and Sr need to be bound
by polynomials
19Order-k delineable Problems
- Order of a relation 0(ri) defined as log of
number of defined classes - If o(ri) o(l) then exponential Nmax
- For polynomial search
- Bound O(ri)?k
- Bound Sr O(poly(l))
- This defines a Generalized Order-k delineable
problem - Polynomial bound means simple relations can
capture solutions!
20Milestone 1
- Discussed the basic motivation behind GEMGA
- SEARCH challenges (déjà vu)
- Need for Relational search
- Enumeration bounds any BBS
- Bound on Relation set cardinality for polynomial
search - Next step SEARCH implementation as GEMGA
21Food for thought
- Linkage specific spatial operator-defined
relation/problem - Precursors to model builders
- Work relates to earlier and contemporary work on
problem difficulty and decision making - Break
22Recap
- Introduced SEARCH (in some detail)
- SEARCH challenges (intuitive and quantified)
- Order statistics for bounds on SEARCH complexity
- Without hierarchical search BBS is too expensive
- Restrict sampling and use Intelligent Guessing
23Nature Questions and answers
- Natural evolution evolved fitter (?) organisms
- 3x108 base pairs in humans implies HUGE search
space - Without a priori knowledge evolution is a BBS
- Not enough time for enumerative search
24Some Questions
- Problem of adequate time
- Shapiro Junkyard Tornedo!
- Holland Schema processing
- Goldberg Problem Decomposition
- Kauffman Gene Expression
- Problem of selection space
- Problem of recombination
- Recombination good if we know what to combine
- Natural recombination different from GAs
25Evolution as Information Flow
- Extracellular storage, exploration and
transmission within generations - Intracellular Gene expression
- Most GAs model extracellular flow
- What about intracellular mechanisms (introns,
diploidy, gene expression etc)
26Expression Mechanisms
- Transcription DNA mRNA
- Translation mRNA proteins
- Protein folding
- Protein Phenotype
- Lets see these in some detail
27Transcription
- Initiated and terminated by a specific sequence
of genes - RNA polymerase transcribes portion in between
(AGCT AGCU) - Regulatory proteins bind to DNA portions and
control transcription - Gene activator
- Gene repressor
28Translation
- mRNA is template for protein formation
- (61) Base triplets correspond to (20) amino acids
- 3 for promotion and termination
- Many to one mapping
- Regulated by control systems of repressors,
promoters, and operators - Specialization within cells
29Protein folding
- 3-D structure determines protein function
(phenotype) - This defines fitness space
- Phenotypic Genotypic correspondence
30Intracellular Information Flow
31The SEARCH perspective
- Sample space
- DNA population
- Class Space Amino acid sequences
- mRNA correspond to DNA schemas
- mRNA translate to amino acid seq.
- Define equivalence class
- Relation space Regulatory mech.
- Transcription process defines classes
- Transcription controlled by extra and intra DNA
components (feed back loop)
32SEARCH answers questions (?)
- Time
- Nature searched for relations too
- Selection
- Feed back loops apportion selection pressure
- Recombination
- Resolution of classes
- Representation (diploidy, introns etc)
33Doubts, concerns
- Concept of natural optimality
- Evolution as adaptation not SEARCH
- Temporal and Spatial niches
34Messy GAs
- Separate relation/class space from sample space
- Deterministically processed order k relations and
classes
35Messy GAs contd.
- Relation still not separate from class
- Sample space consisted of one template
- No implicit parallelism thus expensive
- fmGA tackled this issue (still problems of cross
competition between classes from different
relations)
36GEMGA overview
- Messy GA continued
- SC is for order-k problem, of length l, and
alphabet ? - Separates relation, class, and sample space
- Explicit relation learning
- Many changes and updates
37Representation
- Messy schemes maintained (locus, value)
- Gene has additional variables
- Weights, linkage lists, capacity
- Start with simple GEMGA with only weights
(initialized to 1) - No under/over-specification
38Population Sizing
- C is signal/noise coefficient
- At least one instance of optimal order-k class in
population of ?k - similar structure to previous equations
- mGA was O(( ?k) l )
- Order of relations processed
- 2l relations but only O(k) processed
-
39Basic Operators
- Transcription
- Selection
- Class Selection
- String Selection
- Recombination
- Each in detail
40Transcription
- Detects appropriate order-k relations
- Relations need to be compared
- GEMGA processes relations in distributed manner
- Every chromosome evaluates its genes for instance
of good class - Quality of good classes determine quality of
relation
41Transcription continued
- Flip each gene
- Note change in Fitness function
- Fitness Increases
- Gene not part of good class
- Make weight Zero
- Fitness Decreases
- Gene may be part of good class
- Make weight ?fitness
- repeat for C lt ?
42Selection
- Class Selection
- Grow better classes
- Gene with higher weight overwrites one with lower
weight on other string - String Selection
- Binary Tournament Selection
43Recombination
- Randomly pick two strings
- Consider all genes for swapping with some
probability - If weight of gene is greater than corresponding
gene then swap it - What does this do?
- Preserve tight linkage
44The Algorithm
- Primordial Phase
- l generations (all genes considered)
- Juxtapositional Phase
- Selection and recombination applied
- Every chromosome converges to optimal class when
- Substituting n
- Overall SC O( ?k (lk))
- Solution quality?
45Results
- Tested over uniform l bit trap functions of
length l - Order-l delineability needed
- Function evaluations grow linearly with l
- Population size is constant for constant l
- Scaling and Noise added
46Results contd.
47Milestone 2
- Natural BBS discussed
- SEARCH implemented as GEMGA to solve order-k
delineable problems - Polynomial time achieved
- Issues solved
- Relation (linkage) space searched
- Simplistic relation search mechanism
- Scope for improvements
- Similarity to hybrid GAs (local search)
48GEMGA revisions
- Need for more explicit relation learning
- Linkage set added to gene representation
- Transcription extended
- And class selection linked with linkage
49Linkage Set
- For each gene the set stores related genes in
chromosome - If genes 1,5,10,15 are related then linkage set
for gene 1 is 5,10,15 and so on - Linkage space over all genes defines relation
space
50Transcription II
- In addition to previous transcription operator
- Tries to identify the exact relations (construct
the linkage set) - Not very clear how!
- Transcription II applied for l2-l generations
51Transcription II contd.
- Pick two points (with weight gt 0) on chromosome
- Keep original fitness value
- Perturb both genes
- If change of fitness ! change due to
perturbation of single gene then genes are
related - Put them in the linkage set
- Change weights to 1
52Class Selection II
- Here cardinality of linkage set decides gene
growth - Same as previous except genes with high
cardinality overwrite lesser genes - Linkage sets of genes with greater cardinality
are copied with genes - Weight becomes a criterion for gene consideration
53Recombination
- Same as before except cardinality of linkage set
is used - Genes with larger linkage sets are chosen and
exchanged - This preserves linkage better
54Algorithm Complexity
- Transcription I
- l generations
- Transcription II
- l2-l generations in worst case
- Juxtapositional phase
- Same as before O(k)
- Population O(?k )
- Total SC O(?k (l2k))
- Worse than before, but better linkage
55Final GEMGA
- Linear Sample Complexity achieved
- Change in representation (again)
- Transcription II dropped but relation learning
maintained - Recombination and Expression combined
56Representation
- Chromosome genesi , linki ?i?l
- Gene has locus, value, capacity (for improvement)
- Link has
- Linkage set Set of related genes
- Weights Number of particular linkage in
population - Goodness (0,1) How good the linkage
is w.r.t. fitness contribution - Trials Number of time linkage has been
tried
57Transcription
- Same as before except that capacity is set to 1
if fitness after perturbation increases, else 0 - All genes with capacity 0 are put in the initial
linkage set - Continued for l generations
58Recombination Expression
- Two phases
- Pre-recombination Expression
- GEMGA Recombination
- Pre-recombination determines related gene
clusters - GEMGA recombination ensures growth of proper
classes and relations
59Pre-recombination
- Applied several time during first generation
- Pair of chromosomes selected
- Of those in initial linkage set (ILS) genes with
same values and capacities are extracted - If this set is present in ILS then weight of
linkage is increased by one - If not then this set is added to ILS
60Pre-recombination contd.
- Gives an lxl conditional probability matrix
- Mi,j prob(genes i and j together)
- Final Linkage Set constructed
- Max(Mi,j) ? j, calculated for i
- All genes with Mi,j within e of Max are included
in linkage set for I
61GEMGA recombination
- Element from linkage set of one chromosome chosen
based on weight and goodness for swapping - If goodness value of disrupted linkage set of
other chromosome are less than this one then SWAP
and adjust linkage set - Goodness is set by change in fitness due
recombination
62GEMGA recombination contd.
- Of the two original chromosome and two recombined
chromosomes two are selected based on goodness
and fitness - Apply iteratively over all pairs
- No fitness evaluations if fitness and disrupted
fitness is stored
63GEMGA analysis
- Transcription applied for l generations
- In Pre-recombination no fitness evaluations
- Population O(?k )
- SC O(?k ( l ))
- Linear growth of fitness evaluations and
relational learning
64Milestone 3
- GEMGA designed with strong linkage learning and
linear time - More complex relation building
- Tested on deceptive multi-modal functions to
validate conclusions - Could it be tested for tougher relations
65Musings
- LLGA solved linkages sequentially, GEMGA solves
parallely - Multi-Objective optimization
- Niche specific SEARCH?
- Linkage is a specific relations
- Test on problems with relational complexity
beyond deception - Similarity to natural gene expression?
- Memory Drawbacks
66Summary and Conclusion
- Need for relation based search
- GEMGA spans a class of methods that model
relationships within individuals - GEMGA shown to solve difficult problems
efficiently - Walsh analysis on GEMGA (Kargupta Park, 1999)
- Used on G.P. (Neill Ryan, 2000)