CISD 794 - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CISD 794

Description:

ALA Residue. Denclue: Technical Essence ... Gaussian Distance - sine coefficient xy ALA Residue ... Cos xy ALA Residue. References. Flexible Pattern Matching in ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 29
Provided by: gilbertod
Learn more at: http://scis.nova.edu
Category:
Tags: cisd | ala

less

Transcript and Presenter's Notes

Title: CISD 794


1
CISD 794 Knowledge Discovery in Databases
Prof. Junping Sun
  • DENCLUE - Clustering DNA Sequences using FFT
    (Fast Fourier Transform )
  • Gilberto R dos Santos

2
Clustering
  • Cluster analysis is an unsupervised learning
    method that constitutes a cornerstone of an
    intelligent data analysis process.
  • Some DNA sequences may reach several millions of
    base pairs. Any search method tried to have a
    direct hit comparing two DNA sequences may have
    an algorithm that grows Onn, being n the
    number of bases in the DNA sequence. Several
    methods are being combined to optimize sequence
    search and matching. One initial approach is to
    build clusters of DNA sequences.

3
Functions in Time or Frequency
  • A physical process can be described either in the
    time domain, by the values of some quantity h as
    a function of time t, e.g., h(t), or else in the
    frequency domain, where the process is specified
    by giving its amplitude H (generally a complex
    number indicating phase also) as function of
    frequency f, that is H(f), with - 8 lt f lt 8
  • h(t) and H(f) are two different representations
    of the same function.
  • H(f) h(t) e ( 2 p i f t ) dt
  • h(t) H(f) e ( -2 p i f t ) dt

4
FFT Fast Fourier Transform
  • Fourier Equation
  • Where An and Bn are

5
DNA Sequence PDB 1b05
6
DNA Sequence PDB 1b25
7
DNA and FFT (DFT)
  • DFT /FFT (Discrete Fourier Transformation) can
    be used to define a wave that is the closest
    representation of a DNA sequence function h(t).
    DFT transformation technique can be used to
    reduce the search time of range queries. This
    method can be used in conjunction with any other
    existing method, or can be used as a filtration
    technique. The challenge of this research is to
    define how DFT (FFT) can represent a DNA
    sequence. The proximity search seeks the
    sequences close enough to a given query sequence
    either through direct alignment or using other
    heuristics. The alignment of biological sequences
    ( pairwise or multiple alignment) is the
    operation to place nucleotide or amino acid
    residues in columns inferring the closest common
    ancestral relationships. The best alignment
    usually refers to the one demonstrating the most
    likely evolutionary scenario.

8
Wave Composition
9
Data Structures Used
  • Data matrix
  • (two modes)
  • Dissimilarity matrix
  • (one mode)

10
Euclidean Distance
  • The Euclidean Distance for the matrix d(i,j) is
  •  
  • Dx (i,j) sqrt ( (Xi1 Xj1 ) 2 (Xi2 Xj2
    ) 2 .. ( Xip Xjp ) 2 )
  •  
  • Dy (i,j) sqrt ( (Yi1 Yj1 ) 2 (Yi2 Yj2
    ) 2 .. ( Yip Yjp ) 2 )
  •  
  • Dz (i,j) sqrt ( (Zi1 Zj1 ) 2 (Zi2 Zj2
    ) 2 .. ( Zip Zjp ) 2 )

11
Euclidean Distance
  • cursor c1 is select from DNA_SEQUENCE_FREQ_XY
    where frequency gt 0 for update order by 1
  • cursor c2 is select from DNA_SEQUENCE_FREQ_YZ
    where frequency gt 0 for update order by 1
  • cursor c3 is select from DNA_SEQUENCE_FREQ_ZX
    where frequency gt 0 for update order by 1
  • d_v_dist_a number0
  • d_v_dist_b number0
  • v_dist_a number0
  • v_dist_b number0
  • r2 dna_variance_coef_xyrowtype
  • rec3 dna_variance_coef_yzrowtype
  • rec4 dna_variance_coef_zxrowtype
  • begin
  • for r1 in c1 loop
  • select into r2
  • from dna_variance_coef_xy where frequency
    r1.frequency
  • begin
  • select / INDEX ( DNA_SEQUENCE_FREQ_XY,
    DNA_SEQUENCE_FREQ_XY_IDX1 ) /
  • sum( ( r1.ACOEF - ACOEF ) exp ( -1 (
    power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef ))
    ),
  • sum( ( r1.BCOEF - BCOEF ) exp ( -1 (
    power ( (r1.BCOEF - BCOEF) ,2) / r2.var_bcoef ))
    )
  • into v_dist_a,v_dist_b

12
Euclidean Distance - sine coefficient Function
X ? Y

13
Euclidean Distance - cosine coefficient
Function (X) ? Y
14
DENCLUE Clustering Based on Density
Distribution Functions
  • The influence of each data point can be formally
    modeled using a mathematical function (influence
    function), that describes the impact of a data
    point within its neighborhood.
  • The overall density of the data space can be
    modeled analytically as the sum of the influence
    functions of all data points
  • Clusters can then be determined mathematically by
    identifying density atractors, where density
    atractors are local maxima of the overall density
    functions.
  • d(X,Y) ? should be reflexive and symetric ?
    Euclidean distance Function.
  • Density Attractor/Density-Attracted Points
  • -local maximum of the density function
  • -density-attracted points are determined by a
    gradient-based hill-climbing method
  • Center-Defined Cluster
  • A center-defined cluster with density-attractor
    x ( ) is the subset of the database which is
    density-attracted by x.

15
DNA Sequence 1b5s ALA Residue
16
Denclue Technical Essence
  • Uses grid cells but only keeps information about
    grid cells that do actually contain data points
    and manages these cells in a tree-based access
    structure.
  • Influence function describes the impact of a
    data point within its neighborhood.
  • Overall density of the data space can be
    calculated as the sum of the influence function
    of all data points.
  • Clusters can be determined mathematically by
    identifying density attractors.
  • Density attractors are local maximal of the
    overall density function.

17
Gradient The steepness of a slope
  • Example

18
Density Attractor Function
  • A point x Î is called a density-attractor
    for a given influence function, iff x a local
    maximum of the density-function
  • A point x Î
  • is density-attracted to a density attractor x,
  • iff k Î N d ( , x) lt x with

19
Center Defined Clusters
  • A center-defined cluster (wrt to s , x ) for a
    density attractor x is a subset C Í D, with x Î
    C being density-attracted by x
  • and (x) gt x. Points x Î D are
    called outliers if they are density-attracted by
    a local maximum
  • with
  • ( ) lt x

20
DENCLUE AlgorithmLocal Density Function
  • Two cubes c1, c2 Î Cp are connected if
  • d(mean(c1) mean(c2)) lt 4s
  • near(x) x1 Î c1 d(mean(c1),x) lt ks
  • (note k4)

21
Density Attractors
22
DENCLUE Algorithm (cont.)
  • After determining the density-attractor x for a
    point x and
  • the point x is classified and attached to the
    cluster belonging to x.

23
Gaussian Distance
  • cursor c1 is select from DNA_SEQUENCE_FREQ_XY
    where frequency gt 0 for update order by 1
  • cursor c2 is select from DNA_SEQUENCE_FREQ_YZ
    where frequency gt 0 for update order by 1
  • cursor c3 is select from DNA_SEQUENCE_FREQ_ZX
    where frequency gt 0 for update order by 1
  • d_v_dist_a number0
  • d_v_dist_b number0
  • v_dist_a number0
  • v_dist_b number0
  • r2 dna_variance_coef_xyrowtype
  • rec3 dna_variance_coef_yzrowtype
  • rec4 dna_variance_coef_zxrowtype
  • begin
  • for r1 in c1 loop
  • select into r2
  • from dna_variance_coef_xy where frequency
    r1.frequency
  • begin
  • select / INDEX ( DNA_SEQUENCE_FREQ_XY,
    DNA_SEQUENCE_FREQ_XY_IDX1 ) /
  • sum(exp ( -1 ( power ( (r1.ACOEF -
    ACOEF) ,2) / r2.var_acoef )) ),
  • sum( exp ( -1 ( power ( (r1.BCOEF
    - BCOEF) ,2) / r2.var_bcoef )) )

24
Gaussian Distance - sine coefficient ? xy ? ALA
Residue
25
Gaussian Distance - cosine coefficient ? xy ?
ALA Residue
26
Gradient Distance
  • cursor c1 is select from DNA_SEQUENCE_FREQ_XY
    where frequency gt 0 for update order by 1
  • cursor c2 is select from DNA_SEQUENCE_FREQ_YZ
    where frequency gt 0 for update order by 1
  • cursor c3 is select from DNA_SEQUENCE_FREQ_ZX
    where frequency gt 0 for update order by 1
  • d_v_dist_a number0
  • d_v_dist_b number0
  • v_dist_a number0
  • v_dist_b number0
  • r2 dna_variance_coef_xyrowtype
  • rec3 dna_variance_coef_yzrowtype
  • rec4 dna_variance_coef_zxrowtype
  • begin
  • for r1 in c1 loop
  • select into r2
  • from dna_variance_coef_xy where frequency
    r1.frequency
  • begin
  • select / INDEX ( DNA_SEQUENCE_FREQ_XY,
    DNA_SEQUENCE_FREQ_XY_IDX1 ) /
  • sum( ( r1.ACOEF - ACOEF ) exp ( -1 (
    power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef ))
    ),
  • sum( ( r1.BCOEF - BCOEF ) exp ( -1 (
    power ( (r1.BCOEF - BCOEF) ,2)/ r2.var_bcoef )) )

27
Gradient Gaussian Distance Cos ? xy ? ALA Residue
28
References
  • Flexible Pattern Matching in Strings Practical
    On-line
  • Search Algorithms for Texts and
    Biological Sequences
  • Gonzalo Navarro and Mathieu Raffinot 
  • Numerical Recipes in C
  • William H. Press / Saul A. Teukolsky /
    William T. Vetterling /
  • Brian P. Flannery - http//library.lanl
    .gov/numerical
  • Introduction to Algorithms
  • Thomas H. Cormen / Charles E.
    Leiserson / Ronald L. Rivest
  • Introduction to Bioinformatics - Arthur M. Lesk
  • Handbook of Algorithms and Data Structures,
  • G.H. Gonnet R. Baeza-Yates.
  • 6.  Data Mining Opportunities and
    Challenges - John Wang
  • Who is Fourier? A Mathematical Adventure Alan
    Gleason
  • 8. Mathematical Handbook for Scientists and
    Engineers
  • Granino A. Korn and Theresa M. Korn
  • 9. An Efficient Approach to Clustering in Large
    Multimedia Databases with Noise
  • Alexander Hinneburg, Daniel A. Keim
  • Data Mining Concepts and Techniques
  • Jiawei Han and Micheline Kamber
Write a Comment
User Comments (0)
About PowerShow.com