Optimization of SVM Parameters for Promoter Recognition in DNA Sequences - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Optimization of SVM Parameters for Promoter Recognition in DNA Sequences

Description:

Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Dama evi ius Software Engineering Department, Kaunas University of Technology – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 18
Provided by: proinKtuL
Category:

less

Transcript and Presenter's Notes

Title: Optimization of SVM Parameters for Promoter Recognition in DNA Sequences


1
Optimization of SVM Parameters for Promoter
Recognition in DNA Sequences
  • Robertas Damaševicius
  • Software Engineering Department,
  • Kaunas University of Technology
  • Studentu 50-415, Kaunas, Lithuania
  • Email damarobe_at_soften.ktu.lt

2
Data genetic (DNA) sequences
  • Meaning represent genetic information stored in
    DNA molecule in symbolic form
  • Syntax 4-letter alphabet A, C, G, T
  • Complexity numerous layers of information
  • protein-coding genes
  • regulatory sequences
  • mRNA sequences responsible for protein structure
  • directions from DNA packaging and unwinding, etc.
  • Motivation over 95 - junk DNA (biological
    function is not fully understood)
  • Aim identify structural parts of DNA
  • introns, exons, promoters, splice sites, etc.

3
What are promoters?
  • Promoter a regulatory region of DNA located
    upstream of a gene, providing a control point for
    gene transcription
  • Function by binding to promoter, specific
    proteins (Transcription Factors) can either
    promote or repress the transcription of a gene
  • Structure promoters contain binding sites or
    boxes short DNA subsequences, which are
    (usually) conserved

4
Promoter recognition problem
  • Multitude of promoter boxes (nucleotide
    patterns)
  • TATA, Pribnow, Gilbert, DPE, E-box, Y-box,
  • Boxes within a species are conserved, but there
    are many exceptions to this rule
  • Exact pattern TACACC
  • CAATGCAGGATACACCGATCGGTA
  • Pattern with mismatches TACACC 1 mismatch
  • CAATGCAGGATTCACCGATCGGTA
  • Degenerate pattern TASDCC (SC,G, DA,G,T)
  • CAATGCAGGATAGTCCGATCGGTA

5
Support Vector Machine (SVM)
  • are training data vectors,
    are unknown data vectors
  • , is a target
    space
  • is the kernel function.

5
6
Quality of classification
  • Training data
  • size of dataset, generation of negative examples,
    imbalanced datasets
  • Mapping of data into feature space
  • Orthogonal, single nucleotide, nucleotide
    grouping, ...
  • Selection of an optimal kernel function
  • linear, polynomial, RBF, sigmoid
  • Kernel function parameters
  • SVM learning parameters
  • Regularization parameter, Cost factor
  • Selection of SVM parameter values an
    optimization problem

6
7
SVM optimization strategies
  • Kernel optimization
  • Putting additional parameters
  • Designing new kernels
  • Parameter optimization
  • Learning parameters only
  • Kernel parameters only
  • Learning kernel parameters
  • Optimization decisions
  • Optimization method
  • Objective function

8
SVM (hyper)parameters
  • Kernel parameters
  • Learning parameters

9
SVM parameter optimization methods
Method Advantages Disadvantages
Random search Simplicity. Depends on selection of random points and their distribution. Very slow as the size of the parameter space increases
Grid search Simplicity. A starting point is not required. Box-constraints for grid are necessary. No optimality criteria for the solution. Computationally expensive for a large number of parameters. Solution depends upon coarseness of the grid.
Nelder-Mead Few function evaluations. Good convergence and stability. Can fail if the initial simplex is too small. No proof of convergence.
10
Dataset
  • Drosophila sequence datasets
  • Promoter dataset 1842 sequences, each 300 bp
    length, from -250 bp to 50 bp with regards to
    the gene transcription site location
  • Intron dataset 1799 sequences, each 300 bp
    length
  • Coding sequence (CDS) dataset 2859 sequences,
    each 300 bp length
  • Datasets for SVM classifier
  • Training file 1260 examples (372 promoters, 361
    introns, 527 CDS)
  • Test file 6500 examples (1842 promoters, 1799
    introns, 2859 CDS)
  • Datasets are unbalanced
  • 29.5 promoters vs. 70.5 non-promoters in the
    training dataset
  • 28.3 promoters vs. 71.7 non-promoters in the
    test dataset

11
Classification requisites
  • Feature mapping orthogonal
  • Kernel function power series kernel
  • Metrics
  • Specificity (SPC)
  • Sensitivity (TPR)
  • SVM classifier SVMlight
  • SVM parameter optimization method
  • Modified Nelder-Mead (downhill simplex)

12
Modification of Nelder-Mead
  • Optimization time problem
  • Call to SVM training and testing function is very
    time-costly for large datasets
  • Requires many evaluations of objective function
  • Modifications
  • Function value caching
  • Normalization after reflection step

13
Classification results
Kernel No. of optimized parameters Type of optimized parameters Classification evaluation metric Classification evaluation metric
Kernel No. of optimized parameters Type of optimized parameters Specificity (SPC) Sensitivity (TPR)
Linear - none 84.83 58.25
Linear 3 learning 91.23 81.38
Polynomial - none 81.81 44.90
Polynomial 6 learning kernel 87.64 67.48
Power series (2) 3 kernel 94.85 89.69
Power series (3) 4 kernel 94.92 89.95
14
ROC plot
100
15
Conclusions
  • SVM classifier alone can not achieve satisfactory
    classification results for a complex unbalanced
    dataset
  • SVM parameter optimization can improve
    classification results significantly
  • Best results can be achieved when SVM parameter
    optimization is combined with kernel function
    modification
  • Power series kernel is particularly suitable for
    optimization because of a larger number of kernel
    parameters

16
Ongoing work and future research
  • Application of SVM parameter optimization for
    splice site recognition problem presented in
    CISIS2008
  • Selection of rules for optimal DNA sequence
    mapping to the feature space accepted to
    WCSB2008
  • Analysis of the relationships between data
    characteristics and classifier behavior accepted
    to IS2008
  • Automatic derivation of formal grammars rules
    accepted to KES2008
  • Structural analysis of sequences using SVM with
    grammar inference accepted to ITA2008

17
Thank You.Any questions?
Write a Comment
User Comments (0)
About PowerShow.com