CISD 794 - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

CISD 794

Description:

ALA Residue. Denclue: Technical Essence ... Gaussian Distance - sine coefficient xy ALA Residue ... Cos xy ALA Residue. References. Flexible Pattern Matching in ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 29

Provided by: gilbertod

Learn more at: http://scis.nova.edu

Category:

Tags: cisd | ala

more less

Transcript and Presenter's Notes

Title: CISD 794

1
CISD 794 Knowledge Discovery in Databases
Prof. Junping Sun

DENCLUE - Clustering DNA Sequences using FFT
(Fast Fourier Transform )
Gilberto R dos Santos

2
Clustering

Cluster analysis is an unsupervised learning
method that constitutes a cornerstone of an
intelligent data analysis process.
Some DNA sequences may reach several millions of
base pairs. Any search method tried to have a
direct hit comparing two DNA sequences may have
an algorithm that grows Onn, being n the
number of bases in the DNA sequence. Several
methods are being combined to optimize sequence
search and matching. One initial approach is to
build clusters of DNA sequences.

3
Functions in Time or Frequency

A physical process can be described either in the
time domain, by the values of some quantity h as
a function of time t, e.g., h(t), or else in the
frequency domain, where the process is specified
by giving its amplitude H (generally a complex
number indicating phase also) as function of
frequency f, that is H(f), with - 8 lt f lt 8
h(t) and H(f) are two different representations
of the same function.
H(f) h(t) e ( 2 p i f t ) dt
h(t) H(f) e ( -2 p i f t ) dt

4
FFT Fast Fourier Transform

Fourier Equation
Where An and Bn are

5
DNA Sequence PDB 1b05
6
DNA Sequence PDB 1b25
7
DNA and FFT (DFT)

DFT /FFT (Discrete Fourier Transformation) can
be used to define a wave that is the closest
representation of a DNA sequence function h(t).
DFT transformation technique can be used to
reduce the search time of range queries. This
method can be used in conjunction with any other
existing method, or can be used as a filtration
technique. The challenge of this research is to
define how DFT (FFT) can represent a DNA
sequence. The proximity search seeks the
sequences close enough to a given query sequence
either through direct alignment or using other
heuristics. The alignment of biological sequences
( pairwise or multiple alignment) is the
operation to place nucleotide or amino acid
residues in columns inferring the closest common
ancestral relationships. The best alignment
usually refers to the one demonstrating the most
likely evolutionary scenario.

8
Wave Composition
9
Data Structures Used

Data matrix
(two modes)
Dissimilarity matrix
(one mode)

10
Euclidean Distance

The Euclidean Distance for the matrix d(i,j) is
Dx (i,j) sqrt ( (Xi1 Xj1 ) 2 (Xi2 Xj2
) 2 .. ( Xip Xjp ) 2 )
Dy (i,j) sqrt ( (Yi1 Yj1 ) 2 (Yi2 Yj2
) 2 .. ( Yip Yjp ) 2 )
Dz (i,j) sqrt ( (Zi1 Zj1 ) 2 (Zi2 Zj2
) 2 .. ( Zip Zjp ) 2 )

11
Euclidean Distance

cursor c1 is select from DNA_SEQUENCE_FREQ_XY
where frequency gt 0 for update order by 1
cursor c2 is select from DNA_SEQUENCE_FREQ_YZ
where frequency gt 0 for update order by 1
cursor c3 is select from DNA_SEQUENCE_FREQ_ZX
where frequency gt 0 for update order by 1
d_v_dist_a number0
d_v_dist_b number0
v_dist_a number0
v_dist_b number0
r2 dna_variance_coef_xyrowtype
rec3 dna_variance_coef_yzrowtype
rec4 dna_variance_coef_zxrowtype
begin
for r1 in c1 loop
select into r2
from dna_variance_coef_xy where frequency
r1.frequency
begin
select / INDEX ( DNA_SEQUENCE_FREQ_XY,
DNA_SEQUENCE_FREQ_XY_IDX1 ) /
sum( ( r1.ACOEF - ACOEF ) exp ( -1 (
power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef ))
),
sum( ( r1.BCOEF - BCOEF ) exp ( -1 (
power ( (r1.BCOEF - BCOEF) ,2) / r2.var_bcoef ))
)
into v_dist_a,v_dist_b

12
Euclidean Distance - sine coefficient Function
X ? Y

13
Euclidean Distance - cosine coefficient
Function (X) ? Y
14
DENCLUE Clustering Based on Density
Distribution Functions

The influence of each data point can be formally
modeled using a mathematical function (influence
function), that describes the impact of a data
point within its neighborhood.
The overall density of the data space can be
modeled analytically as the sum of the influence
functions of all data points
Clusters can then be determined mathematically by
identifying density atractors, where density
atractors are local maxima of the overall density
functions.
d(X,Y) ? should be reflexive and symetric ?
Euclidean distance Function.
Density Attractor/Density-Attracted Points
-local maximum of the density function
-density-attracted points are determined by a
gradient-based hill-climbing method
Center-Defined Cluster
A center-defined cluster with density-attractor
x ( ) is the subset of the database which is
density-attracted by x.

15
DNA Sequence 1b5s ALA Residue
16
Denclue Technical Essence

Uses grid cells but only keeps information about
grid cells that do actually contain data points
and manages these cells in a tree-based access
structure.
Influence function describes the impact of a
data point within its neighborhood.
Overall density of the data space can be
calculated as the sum of the influence function
of all data points.
Clusters can be determined mathematically by
identifying density attractors.
Density attractors are local maximal of the
overall density function.

17
Gradient The steepness of a slope

Example

18
Density Attractor Function

A point x Î is called a density-attractor
for a given influence function, iff x a local
maximum of the density-function
A point x Î
is density-attracted to a density attractor x,
iff k Î N d ( , x) lt x with

19
Center Defined Clusters

A center-defined cluster (wrt to s , x ) for a
density attractor x is a subset C Í D, with x Î
C being density-attracted by x
and (x) gt x. Points x Î D are
called outliers if they are density-attracted by
a local maximum
with
( ) lt x

20
DENCLUE AlgorithmLocal Density Function

Two cubes c1, c2 Î Cp are connected if
d(mean(c1) mean(c2)) lt 4s
near(x) x1 Î c1 d(mean(c1),x) lt ks
(note k4)

21
Density Attractors
22
DENCLUE Algorithm (cont.)

After determining the density-attractor x for a
point x and
the point x is classified and attached to the
cluster belonging to x.

23
Gaussian Distance

cursor c1 is select from DNA_SEQUENCE_FREQ_XY
where frequency gt 0 for update order by 1
cursor c2 is select from DNA_SEQUENCE_FREQ_YZ
where frequency gt 0 for update order by 1
cursor c3 is select from DNA_SEQUENCE_FREQ_ZX
where frequency gt 0 for update order by 1
d_v_dist_a number0
d_v_dist_b number0
v_dist_a number0
v_dist_b number0
r2 dna_variance_coef_xyrowtype
rec3 dna_variance_coef_yzrowtype
rec4 dna_variance_coef_zxrowtype
begin
for r1 in c1 loop
select into r2
from dna_variance_coef_xy where frequency
r1.frequency
begin
select / INDEX ( DNA_SEQUENCE_FREQ_XY,
DNA_SEQUENCE_FREQ_XY_IDX1 ) /
sum(exp ( -1 ( power ( (r1.ACOEF -
ACOEF) ,2) / r2.var_acoef )) ),
sum( exp ( -1 ( power ( (r1.BCOEF
- BCOEF) ,2) / r2.var_bcoef )) )

24
Gaussian Distance - sine coefficient ? xy ? ALA
Residue
25
Gaussian Distance - cosine coefficient ? xy ?
ALA Residue
26
Gradient Distance

cursor c1 is select from DNA_SEQUENCE_FREQ_XY
where frequency gt 0 for update order by 1
cursor c2 is select from DNA_SEQUENCE_FREQ_YZ
where frequency gt 0 for update order by 1
cursor c3 is select from DNA_SEQUENCE_FREQ_ZX
where frequency gt 0 for update order by 1
d_v_dist_a number0
d_v_dist_b number0
v_dist_a number0
v_dist_b number0
r2 dna_variance_coef_xyrowtype
rec3 dna_variance_coef_yzrowtype
rec4 dna_variance_coef_zxrowtype
begin
for r1 in c1 loop
select into r2
from dna_variance_coef_xy where frequency
r1.frequency
begin
select / INDEX ( DNA_SEQUENCE_FREQ_XY,
DNA_SEQUENCE_FREQ_XY_IDX1 ) /
sum( ( r1.ACOEF - ACOEF ) exp ( -1 (
power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef ))
),
sum( ( r1.BCOEF - BCOEF ) exp ( -1 (
power ( (r1.BCOEF - BCOEF) ,2)/ r2.var_bcoef )) )