Optimization of SVM Parameters for Promoter Recognition in DNA Sequences

About This Presentation

Title:

Optimization of SVM Parameters for Promoter Recognition in DNA Sequences

Description:

Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Dama evi ius Software Engineering Department, Kaunas University of Technology – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 18

Provided by: proinKtuL

Category:

more less

Transcript and Presenter's Notes

Title: Optimization of SVM Parameters for Promoter Recognition in DNA Sequences

1
Optimization of SVM Parameters for Promoter
Recognition in DNA Sequences

Robertas Damaševicius
Software Engineering Department,
Kaunas University of Technology
Studentu 50-415, Kaunas, Lithuania
Email damarobe_at_soften.ktu.lt

2
Data genetic (DNA) sequences

Meaning represent genetic information stored in
DNA molecule in symbolic form
Syntax 4-letter alphabet A, C, G, T
Complexity numerous layers of information
protein-coding genes
regulatory sequences
mRNA sequences responsible for protein structure
directions from DNA packaging and unwinding, etc.
Motivation over 95 - junk DNA (biological
function is not fully understood)
Aim identify structural parts of DNA
introns, exons, promoters, splice sites, etc.

3
What are promoters?

Promoter a regulatory region of DNA located
upstream of a gene, providing a control point for
gene transcription
Function by binding to promoter, specific
proteins (Transcription Factors) can either
promote or repress the transcription of a gene
Structure promoters contain binding sites or
boxes short DNA subsequences, which are
(usually) conserved

4
Promoter recognition problem

Multitude of promoter boxes (nucleotide
patterns)
TATA, Pribnow, Gilbert, DPE, E-box, Y-box,
Boxes within a species are conserved, but there
are many exceptions to this rule
Exact pattern TACACC
CAATGCAGGATACACCGATCGGTA
Pattern with mismatches TACACC 1 mismatch
CAATGCAGGATTCACCGATCGGTA
Degenerate pattern TASDCC (SC,G, DA,G,T)
CAATGCAGGATAGTCCGATCGGTA

5
Support Vector Machine (SVM)

are training data vectors,
are unknown data vectors
, is a target
space
is the kernel function.

5
6
Quality of classification

Training data
size of dataset, generation of negative examples,
imbalanced datasets
Mapping of data into feature space
Orthogonal, single nucleotide, nucleotide
grouping, ...
Selection of an optimal kernel function
linear, polynomial, RBF, sigmoid
Kernel function parameters
SVM learning parameters
Regularization parameter, Cost factor
Selection of SVM parameter values an
optimization problem

6
7
SVM optimization strategies

Kernel optimization
Putting additional parameters
Designing new kernels
Parameter optimization
Learning parameters only
Kernel parameters only
Learning kernel parameters
Optimization decisions
Optimization method
Objective function

8
SVM (hyper)parameters

Kernel parameters
Learning parameters

9
SVM parameter optimization methods
Method Advantages Disadvantages
Random search Simplicity. Depends on selection of random points and their distribution. Very slow as the size of the parameter space increases
Grid search Simplicity. A starting point is not required. Box-constraints for grid are necessary. No optimality criteria for the solution. Computationally expensive for a large number of parameters. Solution depends upon coarseness of the grid.
Nelder-Mead Few function evaluations. Good convergence and stability. Can fail if the initial simplex is too small. No proof of convergence.
10
Dataset

Drosophila sequence datasets
Promoter dataset 1842 sequences, each 300 bp
length, from -250 bp to 50 bp with regards to
the gene transcription site location
Intron dataset 1799 sequences, each 300 bp
length
Coding sequence (CDS) dataset 2859 sequences,
each 300 bp length
Datasets for SVM classifier
Training file 1260 examples (372 promoters, 361
introns, 527 CDS)
Test file 6500 examples (1842 promoters, 1799
introns, 2859 CDS)
Datasets are unbalanced
29.5 promoters vs. 70.5 non-promoters in the
training dataset
28.3 promoters vs. 71.7 non-promoters in the
test dataset

11
Classification requisites

Feature mapping orthogonal
Kernel function power series kernel
Metrics
Specificity (SPC)
Sensitivity (TPR)
SVM classifier SVMlight
SVM parameter optimization method
Modified Nelder-Mead (downhill simplex)

12
Modification of Nelder-Mead

Optimization time problem
Call to SVM training and testing function is very
time-costly for large datasets
Requires many evaluations of objective function
Modifications
Function value caching
Normalization after reflection step

13
Classification results
Kernel No. of optimized parameters Type of optimized parameters Classification evaluation metric Classification evaluation metric
Kernel No. of optimized parameters Type of optimized parameters Specificity (SPC) Sensitivity (TPR)
Linear - none 84.83 58.25
Linear 3 learning 91.23 81.38
Polynomial - none 81.81 44.90
Polynomial 6 learning kernel 87.64 67.48
Power series (2) 3 kernel 94.85 89.69
Power series (3) 4 kernel 94.92 89.95
14
ROC plot
100
15
Conclusions

SVM classifier alone can not achieve satisfactory
classification results for a complex unbalanced
dataset
SVM parameter optimization can improve
classification results significantly
Best results can be achieved when SVM parameter
optimization is combined with kernel function
modification
Power series kernel is particularly suitable for
optimization because of a larger number of kernel
parameters

16
Ongoing work and future research

Application of SVM parameter optimization for
splice site recognition problem presented in
CISIS2008
Selection of rules for optimal DNA sequence
mapping to the feature space accepted to
WCSB2008
Analysis of the relationships between data
characteristics and classifier behavior accepted
to IS2008
Automatic derivation of formal grammars rules
accepted to KES2008
Structural analysis of sequences using SVM with
grammar inference accepted to ITA2008

17
Thank You.Any questions?

Write a Comment

User Comments (0)