cisd 794 – knowledge discovery in databases prof. junping sun

28
CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun DENCLUE - Clustering DNA Sequences using FFT (Fast Fourier Transform ) Gilberto R dos Santos

Upload: hollie

Post on 06-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun. DENCLUE - Clustering DNA Sequences using FFT (Fast Fourier Transform ) Gilberto R dos Santos. Clustering. Cluster analysis is an unsupervised learning method that constitutes a cornerstone of an intelligent data analysis process. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

DENCLUE - Clustering DNA Sequences using FFT (Fast Fourier Transform )

Gilberto R dos Santos

Page 2: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Clustering

• Cluster analysis is an unsupervised learning method that constitutes a cornerstone of an intelligent data analysis process.

• Some DNA sequences may reach several millions of base pairs. Any search method tried to have a direct hit comparing two DNA sequences may have an algorithm that grows O[n*n], being n the number of bases in the DNA sequence. Several methods are being combined to optimize sequence search and matching. One initial approach is to build clusters of DNA sequences.

Page 3: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Functions in Time or Frequency

• A physical process can be described either in the time domain, by the values of some quantity h as a function of time t, e.g., h(t), or else in the frequency domain, where the process is specified by giving its amplitude H (generally a complex number indicating phase also) as function of frequency f, that is H(f), with - ∞ < f < + ∞

• h(t) and H(f) are two different representations of the same function.

• H(f) = h(t) e ** ( 2* π * i * f * t ) dt• h(t) = H(f) e ** ( -2 * π * i * f * t ) dt

Page 4: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

FFT –Fast Fourier Transform

Fourier Equation

Where An and Bn are

Page 5: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

DNA –Sequence – PDB: 1b05

Page 6: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

DNA –Sequence – PDB: 1b25

Page 7: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

DNA and FFT (DFT)

• DFT /FFT (Discrete Fourier Transformation) can be used to define a wave that is the closest representation of a DNA sequence function h(t). DFT transformation technique can be used to reduce the search time of range queries. This method can be used in conjunction with any other existing method, or can be used as a filtration technique. The challenge of this research is to define how DFT (FFT) can represent a DNA sequence. The proximity search seeks the sequences close enough to a given query sequence either through direct alignment or using other heuristics. The alignment of biological sequences ( pairwise or multiple alignment) is the operation to place nucleotide or amino acid residues in columns inferring the closest common ancestral relationships. The best alignment usually refers to the one

demonstrating the most likely evolutionary scenario.

Page 8: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Wave Composition

Page 9: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Data Structures Used

• Data matrix– (two modes)

• Dissimilarity matrix– (one mode)

npx...nfx...n1x

...............ipx...ifx...i1x

...............1px...1fx...11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0

Page 10: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Euclidean Distance

• The Euclidean Distance for the matrix d(i,j) is•  

• Dx (i,j) = sqrt ( (Xi1 – Xj1 ) ** 2 + (Xi2 – Xj2 ) ** 2 + .. + ( Xip – Xjp )** 2 )

•  

• Dy (i,j) = sqrt ( (Yi1 – Yj1 ) ** 2 + (Yi2 – Yj2 ) ** 2 + .. + ( Yip – Yjp )** 2 )

•  

• Dz (i,j) = sqrt ( (Zi1 – Zj1 ) ** 2 + (Zi2 – Zj2 ) ** 2 + .. + ( Zip – Zjp )** 2 )

Page 11: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Euclidean Distancecursor c1 is select * from DNA_SEQUENCE_FREQ_XY where frequency > 0 for update order by 1;

cursor c2 is select * from DNA_SEQUENCE_FREQ_YZ where frequency > 0 for update order by 1;

cursor c3 is select * from DNA_SEQUENCE_FREQ_ZX where frequency > 0 for update order by 1;

d_v_dist_a number:=0;

d_v_dist_b number:=0;

v_dist_a number:=0;

v_dist_b number:=0;

r2 dna_variance_coef_xy%rowtype;

rec3 dna_variance_coef_yz%rowtype;

rec4 dna_variance_coef_zx%rowtype;

begin

for r1 in c1 loop

select * into r2

from dna_variance_coef_xy where frequency = r1.frequency ;

begin

select /*+ INDEX ( DNA_SEQUENCE_FREQ_XY, DNA_SEQUENCE_FREQ_XY_IDX1 ) */

sum( ( r1.ACOEF - ACOEF ) * exp ( -1 * ( power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef )) ),

sum( ( r1.BCOEF - BCOEF ) * exp ( -1 * ( power ( (r1.BCOEF - BCOEF) ,2) / r2.var_bcoef )) )

into v_dist_a,v_dist_b

from DNA_SEQUENCE_FREQ_XY where frequency = r1.frequency and pdb_id != r1.pdb_id;

update DNA_SEQUENCE_FREQ_XY set gradient_ACOEF = round(v_dist_a), gradient_BCOEF = round(v_dist_b)

where current of c1;

exception when others then

null;

end;

end loop;

commit;

Page 12: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Euclidean Distance - sine coefficient – Function [X] Y

Page 13: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Euclidean Distance - cosine coefficient – Function (X) Y

Page 14: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

DENCLUE – Clustering Based on Density Distribution Functions

The influence of each data point can be formally modeled using a mathematical function (influence function), that describes the impact of a data point within its neighborhood.

The overall density of the data space can be modeled analytically as the sum of the influence functions of all data points

Clusters can then be determined mathematically by identifying density atractors, where density atractors are local maxima of the overall density functions.

d(X,Y) should be reflexive and symetric Euclidean distance Function.

Density Attractor/Density-Attracted Points

-local maximum of the density function

-density-attracted points are determined by a gradient-based hill-climbing method

Center-Defined Cluster

A center-defined cluster with density-attractor x* ( ) is the subset of the database which is density-attracted by x*.

Page 15: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

DNA Sequence – 1b5s ALA Residue

Page 16: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Denclue: Technical Essence• Uses grid cells but only keeps information about grid cells that do

actually contain data points and manages these cells in a tree-based access structure.

• Influence function: describes the impact of a data point within its neighborhood.

• Overall density of the data space can be calculated as the sum of the influence function of all data points.

• Clusters can be determined mathematically by identifying density attractors.

• Density attractors are local maximal of the overall density function.

Page 17: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Gradient: The steepness of a slope

• Example

N

i

xxdD

Gaussian

i

exf1

2

),(2

2

)(

N

i

xxd

iiD

Gaussian

i

exxxxf1

2

),(2

2

)(),(

f x y eGaussian

d x y

( , )( , )

2

22

Page 18: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Density Attractor Function

A point x* is called a density-attractor for a given influence function, iff x* a local maximum of the density-function

A point x

is density-attracted to a density attractor x*,

iff k N : d ( , x*) <= with

Page 19: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Center Defined Clusters

A center-defined cluster (wrt to , ) for a density attractor x* is a subset C D, with x C being density-attracted by x*

and (x*) >= . Points x D are called outliers if they are density-attracted by a local maximum

with

( ) <

Page 20: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

DENCLUE AlgorithmLocal Density Function

Two cubes c1, c2 Cp are connected if

d(mean(c1); mean(c2)) <= 4

near(x) = { x1 c1 | d(mean(c1),x) < k

(note: k

Page 21: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Density Attractors

Page 22: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

DENCLUE Algorithm (cont.)

After determining the density-attractor x* for a point x and

the point x is classified and attached to the cluster belonging to x*.

Page 23: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

cursor c1 is select * from DNA_SEQUENCE_FREQ_XY where frequency > 0 for update order by 1;

cursor c2 is select * from DNA_SEQUENCE_FREQ_YZ where frequency > 0 for update order by 1;

cursor c3 is select * from DNA_SEQUENCE_FREQ_ZX where frequency > 0 for update order by 1;

d_v_dist_a number:=0;

d_v_dist_b number:=0;

v_dist_a number:=0;

v_dist_b number:=0;

r2 dna_variance_coef_xy%rowtype;

rec3 dna_variance_coef_yz%rowtype;

rec4 dna_variance_coef_zx%rowtype;

begin

for r1 in c1 loop

select * into r2

from dna_variance_coef_xy where frequency = r1.frequency ;

begin

select /*+ INDEX ( DNA_SEQUENCE_FREQ_XY, DNA_SEQUENCE_FREQ_XY_IDX1 ) */

sum(exp ( -1 * ( power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef )) ),

sum( exp ( -1 * ( power ( (r1.BCOEF - BCOEF) ,2) / r2.var_bcoef )) )

into v_dist_a,v_dist_b

from DNA_SEQUENCE_FREQ_XY where frequency = r1.frequency and pdb_id != r1.pdb_id;

update DNA_SEQUENCE_FREQ_XY set GAUSSIAN_ACOEF = round(v_dist_a), GAUSSIAN_BCOEF = round(v_dist_b)

where current of c1;

exception when others then

null;

end;

end loop;

commit;

end;

/

Gaussian Distance

Page 24: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Gaussian Distance - sine coefficient xy ALA Residue

Page 25: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Gaussian Distance - cosine coefficient xy ALA Residue

Page 26: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Gradient Distancecursor c1 is select * from DNA_SEQUENCE_FREQ_XY where frequency > 0 for update order by 1;cursor c2 is select * from DNA_SEQUENCE_FREQ_YZ where frequency > 0 for update order by 1;cursor c3 is select * from DNA_SEQUENCE_FREQ_ZX where frequency > 0 for update order by 1;d_v_dist_a number:=0;d_v_dist_b number:=0;v_dist_a number:=0;v_dist_b number:=0;r2 dna_variance_coef_xy%rowtype;rec3 dna_variance_coef_yz%rowtype;rec4 dna_variance_coef_zx%rowtype;begin for r1 in c1 loop select * into r2 from dna_variance_coef_xy where frequency = r1.frequency ; begin select /*+ INDEX ( DNA_SEQUENCE_FREQ_XY, DNA_SEQUENCE_FREQ_XY_IDX1 ) */ sum( ( r1.ACOEF - ACOEF ) * exp ( -1 * ( power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef )) ), sum( ( r1.BCOEF - BCOEF ) * exp ( -1 * ( power ( (r1.BCOEF - BCOEF) ,2)/ r2.var_bcoef )) ) into v_dist_a,v_dist_b from DNA_SEQUENCE_FREQ_XY where frequency = r1.frequency and pdb_id != r1.pdb_id; update DNA_SEQUENCE_FREQ_XY set gradient_ACOEF = round(v_dist_a), gradient_BCOEF =

round(v_dist_b) where current of c1; exception when others then null; end;end loop;commit;

Page 27: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

Gradient Gaussian Distance Cos xy ALA Residue

Page 28: CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

References

1. Flexible Pattern Matching in Strings: Practical On-line

Search Algorithms for Texts and Biological Sequences

Gonzalo Navarro and Mathieu Raffinot 

2. Numerical Recipes in C –

William H. Press / Saul A. Teukolsky / William T. Vetterling /

Brian P. Flannery - http://library.lanl.gov/numerical

3. Introduction to Algorithms –

Thomas H. Cormen / Charles E. Leiserson / Ronald L. Rivest

4. Introduction to Bioinformatics - Arthur M. Lesk

5. Handbook of Algorithms and Data Structures,

G.H. Gonnet & R. Baeza-Yates.

6.  Data Mining: Opportunities and Challenges - John Wang

7. Who is Fourier? A Mathematical Adventure –Alan Gleason

8. Mathematical Handbook for Scientists and Engineers –

Granino A. Korn and Theresa M. Korn

9. An Efficient Approach to Clustering in Large Multimedia Databases with Noise

Alexander Hinneburg, Daniel A. Keim

10. Data Mining: Concepts and Techniques

Jiawei Han and Micheline Kamber