Title: Optimization of SVM Parameters for Promoter Recognition in DNA Sequences
1Optimization of SVM Parameters for Promoter
Recognition in DNA Sequences
- Robertas Damaševicius
- Software Engineering Department,
- Kaunas University of Technology
- Studentu 50-415, Kaunas, Lithuania
- Email damarobe_at_soften.ktu.lt
2Data genetic (DNA) sequences
- Meaning represent genetic information stored in
DNA molecule in symbolic form - Syntax 4-letter alphabet A, C, G, T
- Complexity numerous layers of information
- protein-coding genes
- regulatory sequences
- mRNA sequences responsible for protein structure
- directions from DNA packaging and unwinding, etc.
- Motivation over 95 - junk DNA (biological
function is not fully understood) - Aim identify structural parts of DNA
- introns, exons, promoters, splice sites, etc.
3What are promoters?
- Promoter a regulatory region of DNA located
upstream of a gene, providing a control point for
gene transcription - Function by binding to promoter, specific
proteins (Transcription Factors) can either
promote or repress the transcription of a gene - Structure promoters contain binding sites or
boxes short DNA subsequences, which are
(usually) conserved
4Promoter recognition problem
- Multitude of promoter boxes (nucleotide
patterns) - TATA, Pribnow, Gilbert, DPE, E-box, Y-box,
- Boxes within a species are conserved, but there
are many exceptions to this rule - Exact pattern TACACC
- CAATGCAGGATACACCGATCGGTA
- Pattern with mismatches TACACC 1 mismatch
- CAATGCAGGATTCACCGATCGGTA
- Degenerate pattern TASDCC (SC,G, DA,G,T)
- CAATGCAGGATAGTCCGATCGGTA
5Support Vector Machine (SVM)
- are training data vectors,
are unknown data vectors - , is a target
space - is the kernel function.
5
6Quality of classification
- Training data
- size of dataset, generation of negative examples,
imbalanced datasets - Mapping of data into feature space
- Orthogonal, single nucleotide, nucleotide
grouping, ... - Selection of an optimal kernel function
- linear, polynomial, RBF, sigmoid
- Kernel function parameters
- SVM learning parameters
- Regularization parameter, Cost factor
- Selection of SVM parameter values an
optimization problem
6
7SVM optimization strategies
- Kernel optimization
- Putting additional parameters
- Designing new kernels
- Parameter optimization
- Learning parameters only
- Kernel parameters only
- Learning kernel parameters
- Optimization decisions
- Optimization method
- Objective function
8SVM (hyper)parameters
- Kernel parameters
- Learning parameters
9SVM parameter optimization methods
Method Advantages Disadvantages
Random search Simplicity. Depends on selection of random points and their distribution. Very slow as the size of the parameter space increases
Grid search Simplicity. A starting point is not required. Box-constraints for grid are necessary. No optimality criteria for the solution. Computationally expensive for a large number of parameters. Solution depends upon coarseness of the grid.
Nelder-Mead Few function evaluations. Good convergence and stability. Can fail if the initial simplex is too small. No proof of convergence.
10Dataset
- Drosophila sequence datasets
- Promoter dataset 1842 sequences, each 300 bp
length, from -250 bp to 50 bp with regards to
the gene transcription site location - Intron dataset 1799 sequences, each 300 bp
length - Coding sequence (CDS) dataset 2859 sequences,
each 300 bp length - Datasets for SVM classifier
- Training file 1260 examples (372 promoters, 361
introns, 527 CDS) - Test file 6500 examples (1842 promoters, 1799
introns, 2859 CDS) - Datasets are unbalanced
- 29.5 promoters vs. 70.5 non-promoters in the
training dataset - 28.3 promoters vs. 71.7 non-promoters in the
test dataset
11Classification requisites
- Feature mapping orthogonal
- Kernel function power series kernel
- Metrics
- Specificity (SPC)
- Sensitivity (TPR)
- SVM classifier SVMlight
- SVM parameter optimization method
- Modified Nelder-Mead (downhill simplex)
12Modification of Nelder-Mead
- Optimization time problem
- Call to SVM training and testing function is very
time-costly for large datasets - Requires many evaluations of objective function
- Modifications
- Function value caching
- Normalization after reflection step
13Classification results
Kernel No. of optimized parameters Type of optimized parameters Classification evaluation metric Classification evaluation metric
Kernel No. of optimized parameters Type of optimized parameters Specificity (SPC) Sensitivity (TPR)
Linear - none 84.83 58.25
Linear 3 learning 91.23 81.38
Polynomial - none 81.81 44.90
Polynomial 6 learning kernel 87.64 67.48
Power series (2) 3 kernel 94.85 89.69
Power series (3) 4 kernel 94.92 89.95
14ROC plot
100
15Conclusions
- SVM classifier alone can not achieve satisfactory
classification results for a complex unbalanced
dataset - SVM parameter optimization can improve
classification results significantly - Best results can be achieved when SVM parameter
optimization is combined with kernel function
modification - Power series kernel is particularly suitable for
optimization because of a larger number of kernel
parameters
16Ongoing work and future research
- Application of SVM parameter optimization for
splice site recognition problem presented in
CISIS2008 - Selection of rules for optimal DNA sequence
mapping to the feature space accepted to
WCSB2008 - Analysis of the relationships between data
characteristics and classifier behavior accepted
to IS2008 - Automatic derivation of formal grammars rules
accepted to KES2008 - Structural analysis of sequences using SVM with
grammar inference accepted to ITA2008
17Thank You.Any questions?