Title: SIMDDS
1(No Transcript)
2Major Application Finding Homologies
(C) Mark Gerstein, Yale University
bioinfo.mbb.yale.edu/mbb452a
3AutoSimS
- Local two-sequences alignment is the basis of
sequence analysis, and perhaps the most widely
used tool in computational molecular biology 1 - The parameters of most popular local sequence
alignment tools including BLAST and FASTA are set
by - Default set to for the average case, which
may not be appropriate for the sequences being
examined - Custom the manual settings may be difficult,
which usually require fine tuning through several
manual trials - AutoSimS (Automated Sequence Similarity Search)
contains three modules - A modified version of SIM/DDS (Similarity /
DNA-DNA sequence) 2, 3 for finding similar
regions - Adaptive simulated annealing (ASA) 4 for
optimizing parameters for SIM/DDS - An AI decision-making system (not implemented)
for guiding the adaptive simulated annealing
1
4(SIM/DDS)
Similarity / DNA-DNA Sequence
- Integrates features from Smith-Waterman, BLAST,
Fasta and Haste (Hash-Accelerated Search) 5 - Rated as one of fastest and least space
consuming (linear space complexity) tools for
universal sequence alignment 6 - Provides tradeoffs between sensitivity and speed
using over a dozen of parameters - Our modified SIM/DDS introduces more cutoffs
- Increases flexibility of control
- Sequence filtering
- Word masking
- Reduces the impact of short and exact matches
- Allows adjusting sensitivity for weak similarity
2
5(ASA)
Adaptive Simulated Annealing
- Uses global and statistical optimization
techniques that are able to handle complex,
non-linear search spaces - Several improvements over the original simulated
annealing technique - Computational complexity exponential
temperature schedule for annealing - Completeness decreases the chance to miss
optima - Generality more options to better fit problems
to be solved - Most attractive feature individual
considerations given to parameter range,
annealing-time-dependent sensitivities, and the
probability density distribution for each
parameter - Provides up to 100 options
- Facilitates incorporation into the AutoSimS model
3
6AutoSimS Model
User Preferences
AI Decision-Making Module (not implemented)
Sequence Data
Data Selection
Knowledge Base
Modified SIM / DDS
Parameters
Parameter Search
Set of possible parameters with exponential
probability
Parameter Evaluation
Exponential Annealing
Value of objective function
ASA
Preferred similarity regions
4
7Summary of Model
- ASA works as a wrapper program to select
parameters for SIM/DDS - With properly specified search spaces, objective
function and successor heuristics determined by
the AI decision-making system, ASA is used to
find the optimal parameter setting of modified
SIM/DDS program. This leads to finding better
similar regions - Even though the above mentioned information to
be given manually to ASA, we find it easier to do
so and let ASA tune the parameters for SIM/DDS
than to manually tune SIM/DDSs parameters - Adding the AI decision-making module will make
AutoSimS nearly autonomous by automatically
providing most of the information ASA needs
5
8Results
- AHSC (Average of High-Scoring Chain Scores) may
be used as an ASA objective function to find
parameters yielding highly similar regions - We find close-to-optimal parameter settings are
difficult to find manually, and that there are
many different parameter settings that yield
close-to-optimal search results - An automatic search for parameters may be
effective - Adaptive simulated annealing may be a preferred
search technique
Three runs of our modified SIM/DDS program using
parameters selected by adaptive simulated
annealing for a 100 and 200 letter pair of DNA
sequences yield similar results, but with
different parameter settings. ASA settings
Annealing schedule T 20 exp(-0.005t) if t lt
100 and 0 otherwise Acceptance function exp(
?E / T )
6
9Future Work
- Implement the AI decision-making system,
including the decision analysis and knowledge
base system - Experiment on a large number of different types
of molecular biological sequences to determine
the proper annealing temperature schedules and
successor heuristics and/or their parameters - Parallelize AutoSimS
- Incorporate core ideas of more efficient very
large-scale sequence comparison techniques, such
as LSH (Locality-Sensitive Hashing) 7 - Generate statistical estimates for the local
alignment score distributions 1, which will be
used in AutoSimSs decision-making system - Explore different ASA objective functions, which
may improve results
7
10Conclusion
- ASAs ability to fit complex functions, i.e.
nonlinear search spaces and multiple variables,
allows it to find a suitable set of parameters
for SIM/DDS - The incorporation of AI decision-making system
to our ASA-SIM/DDS program should enhance our
ability to achieve almost autonomous two-sequence
similarity analysis with high volume throughput
and acceptable performance - Our use of simulated annealing to find a
suitable set of parameter can be adapted to other
bioinformatics analysis programs, such as
alignment and clustering
8
11References
1 Altschul, S. F., Bundschuh, R., Olsen, R. and
Hwa, T., The Estimation of Statistical Parameters
for Local Alignment Score Distributions. Nucleic
Acids Research, Vol. 29, No. 2, 351361, 2001
2 Jiang, T., Xu, Y. and Zhang, M.Q., Current
Topics in Computational Molecular Biology. MIT
Press, 2002 3 Huang, X. and Miller, W., A
Time-Efficient, Linear-Space Local Similarity
Algorithm. Advances in Applied Mathematics 12,
337357, 1991 4 Ingber, L., Simulated
Annealing Practice versus Theory. Mathl. Comput.
Modelling, Vol.18, No.11, 2957, 1993 5
Borkowski, J.A., Smith, C.P. and Huang, X., PFPA
Flexible Integrated Filtering and Masking Tool,
Paracel Inc., Pasadena, CA 6 Tech Topics,
Michigan Technological University, Nov. 3, 1995,
Vol. XXVIII, No.9 7 Buhler, J., Efficient
Large-Scale Sequence Comparison by
Locality-Sensitive Hashing. Bioinformatics 17(5)
419428, 2001
9