Title: CISD 794
1CISD 794 Knowledge Discovery in Databases
Prof. Junping Sun
- DENCLUE - Clustering DNA Sequences using FFT
(Fast Fourier Transform ) - Gilberto R dos Santos
2Clustering
- Cluster analysis is an unsupervised learning
method that constitutes a cornerstone of an
intelligent data analysis process. - Some DNA sequences may reach several millions of
base pairs. Any search method tried to have a
direct hit comparing two DNA sequences may have
an algorithm that grows Onn, being n the
number of bases in the DNA sequence. Several
methods are being combined to optimize sequence
search and matching. One initial approach is to
build clusters of DNA sequences.
3Functions in Time or Frequency
- A physical process can be described either in the
time domain, by the values of some quantity h as
a function of time t, e.g., h(t), or else in the
frequency domain, where the process is specified
by giving its amplitude H (generally a complex
number indicating phase also) as function of
frequency f, that is H(f), with - 8 lt f lt 8 - h(t) and H(f) are two different representations
of the same function. - H(f) h(t) e ( 2 p i f t ) dt
- h(t) H(f) e ( -2 p i f t ) dt
4FFT Fast Fourier Transform
- Fourier Equation
- Where An and Bn are
5DNA Sequence PDB 1b05
6DNA Sequence PDB 1b25
7DNA and FFT (DFT)
- DFT /FFT (Discrete Fourier Transformation) can
be used to define a wave that is the closest
representation of a DNA sequence function h(t).
DFT transformation technique can be used to
reduce the search time of range queries. This
method can be used in conjunction with any other
existing method, or can be used as a filtration
technique. The challenge of this research is to
define how DFT (FFT) can represent a DNA
sequence. The proximity search seeks the
sequences close enough to a given query sequence
either through direct alignment or using other
heuristics. The alignment of biological sequences
( pairwise or multiple alignment) is the
operation to place nucleotide or amino acid
residues in columns inferring the closest common
ancestral relationships. The best alignment
usually refers to the one demonstrating the most
likely evolutionary scenario.
8Wave Composition
9Data Structures Used
- Data matrix
- (two modes)
- Dissimilarity matrix
- (one mode)
10Euclidean Distance
- The Euclidean Distance for the matrix d(i,j) is
-
- Dx (i,j) sqrt ( (Xi1 Xj1 ) 2 (Xi2 Xj2
) 2 .. ( Xip Xjp ) 2 ) -
- Dy (i,j) sqrt ( (Yi1 Yj1 ) 2 (Yi2 Yj2
) 2 .. ( Yip Yjp ) 2 ) -
- Dz (i,j) sqrt ( (Zi1 Zj1 ) 2 (Zi2 Zj2
) 2 .. ( Zip Zjp ) 2 )
11Euclidean Distance
- cursor c1 is select from DNA_SEQUENCE_FREQ_XY
where frequency gt 0 for update order by 1 - cursor c2 is select from DNA_SEQUENCE_FREQ_YZ
where frequency gt 0 for update order by 1 - cursor c3 is select from DNA_SEQUENCE_FREQ_ZX
where frequency gt 0 for update order by 1 - d_v_dist_a number0
- d_v_dist_b number0
- v_dist_a number0
- v_dist_b number0
- r2 dna_variance_coef_xyrowtype
- rec3 dna_variance_coef_yzrowtype
- rec4 dna_variance_coef_zxrowtype
- begin
- for r1 in c1 loop
- select into r2
- from dna_variance_coef_xy where frequency
r1.frequency - begin
- select / INDEX ( DNA_SEQUENCE_FREQ_XY,
DNA_SEQUENCE_FREQ_XY_IDX1 ) / - sum( ( r1.ACOEF - ACOEF ) exp ( -1 (
power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef ))
), - sum( ( r1.BCOEF - BCOEF ) exp ( -1 (
power ( (r1.BCOEF - BCOEF) ,2) / r2.var_bcoef ))
) - into v_dist_a,v_dist_b
12Euclidean Distance - sine coefficient Function
X ? Y
13Euclidean Distance - cosine coefficient
Function (X) ? Y
14DENCLUE Clustering Based on Density
Distribution Functions
- The influence of each data point can be formally
modeled using a mathematical function (influence
function), that describes the impact of a data
point within its neighborhood. - The overall density of the data space can be
modeled analytically as the sum of the influence
functions of all data points - Clusters can then be determined mathematically by
identifying density atractors, where density
atractors are local maxima of the overall density
functions. - d(X,Y) ? should be reflexive and symetric ?
Euclidean distance Function. - Density Attractor/Density-Attracted Points
- -local maximum of the density function
- -density-attracted points are determined by a
gradient-based hill-climbing method - Center-Defined Cluster
- A center-defined cluster with density-attractor
x ( ) is the subset of the database which is
density-attracted by x.
15DNA Sequence 1b5s ALA Residue
16Denclue Technical Essence
- Uses grid cells but only keeps information about
grid cells that do actually contain data points
and manages these cells in a tree-based access
structure. - Influence function describes the impact of a
data point within its neighborhood. - Overall density of the data space can be
calculated as the sum of the influence function
of all data points. - Clusters can be determined mathematically by
identifying density attractors. - Density attractors are local maximal of the
overall density function.
17Gradient The steepness of a slope
18Density Attractor Function
- A point x Î is called a density-attractor
for a given influence function, iff x a local
maximum of the density-function - A point x Î
- is density-attracted to a density attractor x,
- iff k Î N d ( , x) lt x with
19Center Defined Clusters
- A center-defined cluster (wrt to s , x ) for a
density attractor x is a subset C Í D, with x Î
C being density-attracted by x - and (x) gt x. Points x Î D are
called outliers if they are density-attracted by
a local maximum - with
- ( ) lt x
20DENCLUE AlgorithmLocal Density Function
- Two cubes c1, c2 Î Cp are connected if
- d(mean(c1) mean(c2)) lt 4s
- near(x) x1 Î c1 d(mean(c1),x) lt ks
- (note k4)
21Density Attractors
22DENCLUE Algorithm (cont.)
- After determining the density-attractor x for a
point x and - the point x is classified and attached to the
cluster belonging to x.
23Gaussian Distance
- cursor c1 is select from DNA_SEQUENCE_FREQ_XY
where frequency gt 0 for update order by 1 - cursor c2 is select from DNA_SEQUENCE_FREQ_YZ
where frequency gt 0 for update order by 1 - cursor c3 is select from DNA_SEQUENCE_FREQ_ZX
where frequency gt 0 for update order by 1 - d_v_dist_a number0
- d_v_dist_b number0
- v_dist_a number0
- v_dist_b number0
- r2 dna_variance_coef_xyrowtype
- rec3 dna_variance_coef_yzrowtype
- rec4 dna_variance_coef_zxrowtype
- begin
- for r1 in c1 loop
- select into r2
- from dna_variance_coef_xy where frequency
r1.frequency - begin
- select / INDEX ( DNA_SEQUENCE_FREQ_XY,
DNA_SEQUENCE_FREQ_XY_IDX1 ) / - sum(exp ( -1 ( power ( (r1.ACOEF -
ACOEF) ,2) / r2.var_acoef )) ), - sum( exp ( -1 ( power ( (r1.BCOEF
- BCOEF) ,2) / r2.var_bcoef )) )
24Gaussian Distance - sine coefficient ? xy ? ALA
Residue
25Gaussian Distance - cosine coefficient ? xy ?
ALA Residue
26Gradient Distance
- cursor c1 is select from DNA_SEQUENCE_FREQ_XY
where frequency gt 0 for update order by 1 - cursor c2 is select from DNA_SEQUENCE_FREQ_YZ
where frequency gt 0 for update order by 1 - cursor c3 is select from DNA_SEQUENCE_FREQ_ZX
where frequency gt 0 for update order by 1 - d_v_dist_a number0
- d_v_dist_b number0
- v_dist_a number0
- v_dist_b number0
- r2 dna_variance_coef_xyrowtype
- rec3 dna_variance_coef_yzrowtype
- rec4 dna_variance_coef_zxrowtype
- begin
-
- for r1 in c1 loop
- select into r2
- from dna_variance_coef_xy where frequency
r1.frequency - begin
- select / INDEX ( DNA_SEQUENCE_FREQ_XY,
DNA_SEQUENCE_FREQ_XY_IDX1 ) / - sum( ( r1.ACOEF - ACOEF ) exp ( -1 (
power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef ))
), - sum( ( r1.BCOEF - BCOEF ) exp ( -1 (
power ( (r1.BCOEF - BCOEF) ,2)/ r2.var_bcoef )) )
27Gradient Gaussian Distance Cos ? xy ? ALA Residue
28References
- Flexible Pattern Matching in Strings Practical
On-line - Search Algorithms for Texts and
Biological Sequences - Gonzalo Navarro and Mathieu Raffinot
- Numerical Recipes in C
- William H. Press / Saul A. Teukolsky /
William T. Vetterling / - Brian P. Flannery - http//library.lanl
.gov/numerical - Introduction to Algorithms
- Thomas H. Cormen / Charles E.
Leiserson / Ronald L. Rivest - Introduction to Bioinformatics - Arthur M. Lesk
- Handbook of Algorithms and Data Structures,
- G.H. Gonnet R. Baeza-Yates.
- 6. Data Mining Opportunities and
Challenges - John Wang - Who is Fourier? A Mathematical Adventure Alan
Gleason - 8. Mathematical Handbook for Scientists and
Engineers - Granino A. Korn and Theresa M. Korn
- 9. An Efficient Approach to Clustering in Large
Multimedia Databases with Noise - Alexander Hinneburg, Daniel A. Keim
- Data Mining Concepts and Techniques
- Jiawei Han and Micheline Kamber