Title: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design
1An Empirical Study of Choosing Efficient
Discriminative Seeds for Oligonucleotide Design
- Won-Hyong Chung and Seong-Bae Park
- Dept. of Computer Engineering
- Kyungpook National University, South Korea
2Motivation
- Issues for designing oligonucleotides
- To minimize the cross-hybridizations
- To minimize the computing time
- Seeding (or indexing) have been widely used for
concurring those issues by means of pre-screening
unreliable sequence regions before calculating
cross-hybridizations. - Although many types of seeding methods have been
proposed, measure of evaluating the seeds
regarding how adequate and efficient they are in
the oligonucleotide design is not yet proposed.
3Difference between alignment and oligonucleotide
design
- Alignment
- To find all possible alignments which have enough
scores. - Sensitivity is important, while specificity is
usually guaranteed by seeds own specificity. - Oligoncleotide design
- To find optimal oligonucleotides to differentiate
target sequences from the others. - Specificity should be considered as well as
sensitivity for checking cross-hybridization.
4Objectives
- We propose novel measures of evaluating the seeds
based on the discriminability and the efficiency. - We examine five seeding methods in
oligonucleotide design. - continuous, spaced, transition-constrained, BLAT,
and Vector seed - We provide a software package SeedChooser which
enables users to get the adequate seeds under
their own experimental conditions.
5What is Seed?
- Seeding process
- Filtering step short fixed-length common words
which are found at both query and target
sequences are selected. - Extension step the selected words are extended
to the size of oligonucleotide and be checked the
cross-hybridization. - Seed the filtering template of the fixed-length
words
6Seeding methods (1/2)
- Continuous seed a seed to find k-length exact
matches - BLAST employs 11-bp length seed 11111111111
- Spaced seed allowing dont care letter labeled
0 in the seed - 18-bp-length seed containing 11-bp matches
101101100111001011 is used at PatternHunter. - Transition-constrained seed adopting transition
(A lt-gt G, C lt-gt T) letter _at_ in the seed - YASS used such seed 1110_at_10010_at_1010111, it
consists of 18-bp length, 10-bp matches and 2
transitions.
7Seeding methods (2/2)
- Blat seed a continuous seed allowing one or two
mismatches at any positions of the seed. - Vector seed a generalized seed by combining the
idea of BLAT seed and spaced seed. - BLAT seed and Vector seed allow some mismatches
in any positions. - They greatly increase the sensitivity but spends
much more computing time than the previous seeds.
8The Issues of seeds for oligo design
- An ideal seed should filter all regions as fast
as possible that have no possibility of being
chosen as an oligo.
a seed should find as many oligos as possible
a seed should avoid to find non-oligo region
a seed should minimize the cost of indexing to
find oligos
Discriminability
Efficiency
Efficient Discriminability
9Discriminability
The discriminability is a balance between
precision and recall to minimize both false
positives and false negatives.
jump
alpha
10Efficiency
- The efficiency is the proportion of useful
regions filtered by a seed. - the duplication ratio of generated indices
- the average number of indices in each oligo
jump
beta, gamma
11Efficient discriminability
12Experiments
- Empirically chosen seeds were evaluated by three
measures, discriminability, efficiency, and
efficient discriminability, respectively. - We tested the seeds for designing the 50mer
oligos. - The parameters are set to 1 for evaluation.
- Simulated data set
- A set of random sequences which are generated by
OligoGenerator in SeedChooser. - Biological data set
- Ecologically important genes involved in the
nitrogen and carbon cycles. - nirS nitrite reductase gene set
- pmoA methane monooxygenase gene set
13Discriminability of the five seeding methods
14Efficiency of the five seeding methods
15Efficient Discriminability the five seeding
methods
16Evaluation results of pmoA data set
17Evaluation results of nirS data set
18SeedChooser Seed Evaluation and Recommendation
Tools
- SeedChooser To recommend best seeds by the
evaluation parameters. It uses genetic algorithm
to find best seeds. - SeedEvaluator To evaluate a set of the seeds by
the parameters. - OligoGenerator To generate a set of oligos for
the desired experimental conditions. - SeedChooser homepage
- http//ml.knu.ac.kr/whchung/seedchooser.html
19CONCLUSION
- The novel measure for evaluating the seeds in the
oligo design based on the discriminability and
the efficiency. - The spaced seed was generally preferred to the
other seeding methods. - Our study can be applied to the oligo design
programs in order to improve the performance by
suggesting the experiment-specific seeds. - We expect that our study will be helpful to the
other genomic tasks.
20Supplementary materials
21P0
T0
P1
T1
P2
T2
T3
- T1, T2, T3 the target sequences.
- P1 and P2 are the matched oligos for an oligo P0
- S1, S2 and S3 are the seed indices for S0 by a
seed.
back
22Relations of precision, recall and
discriminability
23Discriminability according to values of a
back
24Efficiency according to values of ß and ?
back
25Efficient Discriminability for 70mer Oligos