digital signal processing techniques for protein- coding regions identification of rheumatic...

5
I nternational Jour nal of Comp uter Tr e nds and T e chnology- volu me 4I ss ue3- 2013 ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 436 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.  Dr.K.B.Ramesh 1 , Prabhu Shankar.K.S 2 , Dr.B.P.Mallikarjunaswamy 3 , Dr.E.T.Pu ttaiah 4  (1) Associate Professor, Dept. of Instrumentation Technology, R.V.College of Engineering, Bangalore, India (2) Biomedical Signal Processing and Instrumentation, I.T Dept., R.V.College of Engineering, Bangalore, India. (3) Professor, Department of Computer Science and Engineering, SSIT, Tumkur, Karnataka, India. (4) Professor, Dept. of Environmental science, Vice -chancellor, Gulbarga University, Karnataka, India. Abstract    Rheumatic arthritis is a chronic disease, which disables auto-immune system and ultimately affects a person’s ability to carry out everyday tasks . The prediction of genes involved in RA is an important application in bioinformatics. In order to analyze the proteins, locations and lengths of genomic sequence plays a prominent role in predicting the exons. DSP tools have been applied in this field based on the observation that coding regions have a prominent period-3 spectrum peak at frequency f=1/ 3 due to presence of codons (three nucleic acids), while non-coding regions lack such a prominent peak. This paper presents the different Digital Signal Processing techniques which approaches result in improved computational techniques for the solution of useful problems in genomic information science and technology.  Keywords   DNA, Rheumatic arthritis [RA], Gene prediction, Period-3 periodicity, Digital signal processing Techniques. I. INTRODUCTION Bioinformatics represents a new, growing area of science that uses computational approaches to answer biological questions. With the explosion of sequence and structural information available to researchers, the field of bioinformatics is playing an increasingly large role in the study of fundamental  biomedical problems. In all areas of biological and medical research, the role of the computer has been dramatically enhanced in the last five to ten year period. While the first wave of computational analysis did focus on sequence analysis, where many highly important unsolved problems still remain, the current and future needs will in particular concern sophisticated integration of extremely diverse sets of data. These novel types of data originate from a variety of experimental techniques of which many are capable of data  production at the levels of entire cells, organs, organisms, or even populations. The main driving force behind the changes has been the advent of new, efficient experimental techniques,  primarily DNA sequencing, that have led to an exponential growth of linear descriptions of protein, DNA and RNA molecules. The Bioinformatics Toolbox offers computational molecular biologists and other research scientists an open and extensible environment in which to explore ideas, prototype new algorithms, and build applications in drug research, genetic engineering, and other genomics and proteomics  projects. The toolbox provides access to genomic and  proteomic data formats, analysis techniques, and specialized visualizations for genomic and proteomic sequence and microarray analysis. Most functions are implemented in the open MATLAB language, enabling you to customize the algorithms or develop your own. Rheumatoid arthritis is a form of inflammatory arthritis which is a chronic disease that may affect many tissues and organs, but principally attacks flexible (synovial) joints. Fig.1 Joint affected by RA Rheumatoid arthritis can also produce di ffuse inflammation in the lungs, membrane around the heart the membranes of the lung (pleura), and white of the eye and also nodular lesions,  most common in subcutaneous. Rheumatoid arthritis may affect many different joints and cause damage to cartilage, tendons and ligaments   it can even wear away the ends of your bones. One common outcome is joint deformity and disability. Some people with RA develop rheumatoid nodules;

Upload: seventhsensegroup

Post on 02-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease

7/27/2019 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.

http://slidepdf.com/reader/full/digital-signal-processing-techniques-for-protein-coding-regions-identification 1/5

I nternational Journal of Computer Trends and Technology- volume4Issue3- 2013 

ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 436

Digital Signal Processing Techniques for Protein-

Coding Regions identification of Rheumatic Arthritis

(RA) disease. Dr.K.B.Ramesh1, Prabhu Shankar.K.S2, Dr.B.P.Mallikarjunaswamy3, Dr.E.T.Puttaiah4 

(1) Associate Professor, Dept. of Instrumentation Technology, R.V.College of Engineering, Bangalore, India

(2) Biomedical Signal Processing and Instrumentation, I.T Dept., R.V.College of Engineering, Bangalore, India.(3) Professor, Department of Computer Science and Engineering, SSIT, Tumkur, Karnataka, India.(4) Professor, Dept. of Environmental science, Vice-chancellor, Gulbarga University, Karnataka, India.

Abstract  —  Rheumatic arthritis is a chronic disease, which

disables auto-immune system and ultimately affects a

person’s ability to carry out everyday tasks. The

prediction of genes involved in RA is an important

application in bioinformatics. In order to analyze the

proteins, locations and lengths of genomic sequence playsa prominent role in predicting the exons. DSP tools have

been applied in this field based on the observation that

coding regions have a prominent period-3 spectrum peak 

at frequency f=1/ 3 due to presence of codons (three

nucleic acids), while non-coding regions lack such a

prominent peak. This paper presents the different Digital

Signal Processing techniques which approaches result in

improved computational techniques for the solution of 

useful problems in genomic information science and

technology. 

Keywords  — DNA, Rheumatic arthritis [RA], Gene prediction,Period-3 periodicity, Digital signal processing Techniques.

I.  INTRODUCTION

Bioinformatics represents a new, growing area of science thatuses computational approaches to answer biological questions.

With the explosion of sequence and structural informationavailable to researchers, the field of bioinformatics is playingan increasingly large role in the study of fundamental

 biomedical problems. In all areas of biological and medical

research, the role of the computer has been dramaticallyenhanced in the last five to ten year period. While the firstwave of computational analysis did focus on sequenceanalysis, where many highly important unsolved problemsstill remain, the current and future needs will in particular 

concern sophisticated integration of extremely diverse sets of data. These novel types of data originate from a variety of experimental techniques of which many are capable of data

 production at the levels of entire cells, organs, organisms, or even populations. The main driving force behind the changeshas been the advent of new, efficient experimental techniques,

 primarily DNA sequencing, that have led to an exponential

growth of linear descriptions of protein, DNA and RNAmolecules. The Bioinformatics Toolbox offers computationalmolecular biologists and other research scientists an open and

extensible environment in which to explore ideas, prototype

new algorithms, and build applications in drug research,genetic engineering, and other genomics and proteomics

 projects. The toolbox provides access to genomic and proteomic data formats, analysis techniques, and specializedvisualizations for genomic and proteomic sequence andmicroarray analysis. Most functions are implemented in theopen MATLAB language, enabling you to customizethe algorithms or develop your own. Rheumatoid arthritis is a

form of inflammatory arthritis which is a chronic disease thatmay affect many tissues and organs, but principally attacksflexible (synovial) joints.

Fig.1 Joint affected by RA

Rheumatoid arthritis can also produce diffuse inflammation inthe lungs, membrane around the heart the membranes of thelung (pleura), and white of the eye and also nodular lesions, most common in subcutaneous. Rheumatoid arthritis mayaffect many different joints and cause damage to cartilage,tendons and ligaments  –  it can even wear away the ends of 

your bones. One common outcome is joint deformity anddisability. Some people with RA develop rheumatoid nodules;

Page 2: Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease

7/27/2019 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.

http://slidepdf.com/reader/full/digital-signal-processing-techniques-for-protein-coding-regions-identification 2/5

I nternational Journal of Computer Trends and Technology- volume4Issue3- 2013 

ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 437

lumps of tissue that form under the skin, often over bony areas

exposed to pressure. These occur most often around theelbows but can be found elsewhere on the body, such as onthe fingers, over the spine or on the heels. Over time, the

inflammation that characterizes RA can also affect numerousorgans and internal systems.A DNA sequence is made from an alphabet of four elements,

namely A, C, G and T (respectively, adenine, cytosine,guanine, and thymine). The letters A, C, G and T representmolecules called nucleotides or bases. A large number of 

functions in living organisms are governed by proteins.Proteins are sequences made of amino acids. Since there are64 possible codons but only 20 amino acids, the mapping

from codons to amino acids is many-to-one. The introns donot participate in the protein synthesis. Gene identification is avery complex problem and the identification of period-3

regions is only a step towards gene and exon identificationcoding and non-protein coding regions. At each time thesliding window is shifted by one or more base positions, the power spectrum at frequency L/3 is computed. After 

extracting the period-3 property, sequences are classified intoexons (protein coding) and introns (non-protein coding)

regions using a threshold or learning process. Thresholding isone of the major challenges in this field since the selection of its optimal value could be different from one sequence to

other sequence. According to [6], the complementary strandsare statistically symmetric. Thus, the non-coding region for the 5‘-3‘ will have the same period-3 property as the coding

region. It means that the period-3 still works in prokaryotes,and it can be used to detect coding region. As shown in Figure2, a DNA sequence can be divided into genes and intragenic

spaces. The genes are responsible for protein synthesis. Agene can be divided into two sub regions called the exons and

introns. Only the exons are involved in protein coding. The bases in the exon region can be imagined to be divided intogroups of three adjacent bases. 

Fig. 2 Various regions in a DNA molecule

Each triplet is called a codon. Scanning the gene from left to

right, a codon sequence can be defined by concatenation of 

the codons in all the exons. Each codon instructs the cellmachinery to synthesize an amino acid. The codon sequencetherefore uniquely identifies an amino acid sequence which

defines a protein. It is known that exons (or coding regions)are rich in nucleotides C and G whereas introns (or noncodingregions) are rich in nucleotides A and T; and that protein

coding regions of nucleotide sequences exhibit a period-3

 property which is likely resulted from the three-base-length of codons used to generate amino acids. The process is divided into 4 steps as shown in fig. 3. The first

component in the process is to convert the DNA sequence intonumerical sequence, since DSP tools can only handle onlynumerical entities. Next step involves in the choosing of 

window, so that large sequence is divided into frames inwhich specific length of sequence is processed at a time. Thenthe window is slide over the whole sequence. It has been

shown that window size and length affect the predictionresults. In addition the researchers are interested in this fieldcan address weaknesses of their methods and work on

improving the efficiency of each component of the process.The third stage is important stage of the process. Here the period-3 component is extracted to discriminate protein.

Fig. 3 Functional Block Diagram

Organization of the paper: Each type of technique is reviewedin the sections II, III, IV, V, and VI. Conclusion is drawn inthe section VII. The references are listed in VII.

II.  METHODS AND RESULTS

P. P. Vaidyanathan and Byung-Jun Yoon[2] proposed adigital filtering technique for the prediction of genes. Here thesliding window is considered as the filtering technique which

has the impulse response of 

Fig. 4 Impulse responses of Band pass Filter .

Computational complexity and even period-3 behaviour from background information such as 1 /f noise more effectively can be reduced by efficient deign of a filter. Here the band pass

Page 3: Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease

7/27/2019 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.

http://slidepdf.com/reader/full/digital-signal-processing-techniques-for-protein-coding-regions-identification 3/5

I nternational Journal of Computer Trends and Technology- volume4Issue3- 2013 

ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 438

filter with a pass band of ω0 =2π/ 3 and minimum stop band

attenuation of about 13 dB as shown in above fig. 4A digital filter  H ( z ) with indicator sequence xG(n) as its input.With the indicator sequence xG(n) taken as input, let yG(n)

denote its output as indicated in below fig. 5

Fig. 5 Digital filter H(z) with impulse response H(ejw)

The narrow band filter  H ( z ) [1].have been regarded as an

antinotch IIR filter for gene prediction and the gene sequenceF56F11.4 in the C-elegans chromosome III was used for theanalysis. This gene has five exons as depicted in below fig. 6

Fig. 6 Output obtained from Digital filter 

III.  CORRELATING AND FILTERING APPROACH

Lun Huang, Mohammad Al Bataineh, G. E. Atkin[3], hasintroduced the novel gene detection method based on period-3

 property which uses correlation[13] and filtering techniques.Here complex poly-phase set {1, j, -1,-j} which is correlatedwith four sequences to predict the exon regions.. These four sequences are composed by using 3 out the 4 types of bases in

a period-3 pattern. Genome sequence is X(n) is assumed to beinput the four period-3 sequences are S1 (n) , S2 (n) , S3 (n),S4 (n) .

The outputs after the correlation are

Yi(n) = X(n) Si(-n)Where ‗ ‘ denotes convolution. 

Maximum ratio combination algorithm [8] has been used tocombine all four sequences respectively .period-3 filtering isused as sliding window. In order to predict the exon, length of 

the window L (L is considered as odd to make the windowsymmetric around the center) is used for the analysis.

The algorithm was used as follows.

1.L0 and L1 is initiated with L x /3, and p0 is set to 2 , where

L0 is initial window length, L1 is current window length,p0 isoriginal peak number.2.Window with length of L1 is used to detect the peaks of the

filter output sequence f (n).3. If the number of detected peaks is p1 <  po +1, L1= L−1 andgo to step 2; otherwise, set the minimum value of variance

Vmin = ∞.4. For this p1 peak distribution, the peak interval sequence{Ij }. j = 1, 2… p1 −1 is found out.

5. Then the variance of {Ij }is found with

V1 =

Where I = .

6. If V1 < Vmin. , Vmin = V1, p0 =p1, L0 = L1 and go back to step 4 ; otherwise L1 = L0.7. In the same ways step 2 and 3, minimize L1 for peak number p1 = p0 and L = 2. [1/3 Min. (L1)] + 1, where Min (.)

is the minimization function and [.] denotes the function that

round its argument up to the next integer.The result obtained from this approach was better than DFTand SONF approaches, in detecting more coding regionswhich was applied to both prokaryote and eukaryote genome

sequences and output obtained as shown below fig.7

Fig. 7 Exons predicted by novel gene prediction method.

IV.  HARMONIC SUPPRESSION FILTER.

There was a disadvantage with the method introduced by[2,3]

in which harmonics of the frequency 2 ᴨ /3 appears along

with the 2 ᴨ /3 frequency components. These harmonicfrequencies provide false measure of period-3 property byadding more peak strengths to exons and introns. So [4]

 proposed a harmonic suppression and maximum variancefilter which are needed to suppress the harmonics. Harmonic

suppression filter consists of dominant zeros at the multiples

of frequency 2 ᴨ /6except at 2 ᴨ /3 dominant pole is designed

at 2 ᴨ /3 in such a way that suppresses the samples of 

harmonic frequencies of 2 ᴨ /3 while passes samples of 

frequency 2 ᴨ /3. The peak at 16 KHz is evidence of the poleat angular frequency 2  π  /3 radians. A pole-zero plot of designed filter having poles of magnitude 0:898; 0:898; 0:998;0:898 at angular frequencies ω = 0, 2π  /6, 2π  /3, π  radians,

respectively, and zeros of magnitude 0:998; 0:998; 0:898;0:998 at angular frequencies ω = 0, 2 π  /6, 2 π  /3, π radians,

Page 4: Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease

7/27/2019 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.

http://slidepdf.com/reader/full/digital-signal-processing-techniques-for-protein-coding-regions-identification 4/5

I nternational Journal of Computer Trends and Technology- volume4Issue3- 2013 

ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 439

respectively. Since the suppression of the higher period

harmonics is not necessary because their contribution in awindow size of only 351 samples is negligible but it‘s

necessary to suppress the harmonics of 2 ᴨ /3.

Fig.8 shows that HS filter is able to detect the smaller lengthexons where antinotch failed; however, the problem of 

suppressing the spurious peaks in introns still remains becauseof failure to attenuate the complex conjugate harmonicfrequency components which can be suppressed by Maximumvariance filter.[9].

Fig. 8 Shows the output obtained by HS Filter.

Another approach[4] for the gene prediction which make useof the Minimum variance filter which is an adaptation of theMaximum Likelihood Method (MLM) developed by Caponfor the analysis of two dimensional power spectral densities.

Minimum Variance Spectrum Estimation technique gives usflexibility to minimize the power in the side lobe frequenciesthus maximizing the power in main lobe.

The minimum variance spectrum estimation techniqueinvolves the following steps:

1) Design a band pass filter g(n) with center frequency ω = 2π  /3 so that the filter rejects the maximum amount of out-of-

 band power while passing the component at frequency ω withno distortion.2) Filter the DNA sequence x(n) with the filter and estimate

the power in each output process y(n).Hence the impulse response of such a filter for a given inputsequence can be given as,

g = e/

Where, eH represents the Hermition (complex conjugate)matrix of exponential vector e, Rx is the p * p autocorrelationto eplitz matrix of the samples in the current window and g is

the impulse response of the Minimum Variance filter with

 band-pass frequency ω = 2π /3.First the band pass filter was designed that rejects the

maximum amount of band power at center frequency ω = 2 ᴨ /3 while passing the component at frequency ω with no

distortion. Then the power is estimated by filtering DNAsequence x (n) by filter in each output process y (n) as

designed in below fig. 10

Fig.9 shows the predicted exons using MV Filter.

The accuracy was achieved by using above approaches to

suppress the harmonic frequencies by means of HS filter andadaptive minimum Variance filter.

V.  Recursive wiener Khinchine Theorem.

[5] proposed a comparative approach based on recursive

Weiner khinchine theorem [10] for locating the protein codingregions was explored. It‘s an efficient algorithm which makes

use of the sliding window from autocorrelation function by

applying Weiner khinchine theorem to estimate the power 

spectral density [14]. By defining the size of the window as N,

every DFT of window is calculated from previous N-point

window by just one complex multiplication and two real

additions. In this paper both DFT and RWKT were implied on

a DNA sequence and it has been shown that RKWT provides

 better estimate of power spectral density when compared to

DFT but it is inefficient when time computation is considered

Fig. 10 shows the output of RWKT with good resolution

Page 5: Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease

7/27/2019 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.

http://slidepdf.com/reader/full/digital-signal-processing-techniques-for-protein-coding-regions-identification 5/5

I nternational Journal of Computer Trends and Technology- volume4Issue3- 2013 

ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 440

compared to DFT method.

VI.  FFT, FIR, IIR Methods.

[6] Compared the performance of three spectral methods FFT,

FIR and IIR (DF-Based) methods for the classification short

introns and exons. Computational complexity has been

taken as a factor for the comparison and it depends

on the FFT algorithm used. The K-Quaternary Codewith a nucleotide to numeric mapping as C = -1, G = - j, A = 1,T =  j is adopted. Three thresholds are adopted which includethe mid threshold Tm, the proportional threshold Tp, and the

cumulative distribution threshold Tc is studied. The firstspectral method is called the FFT-based spectral method

which uses the Fast Fourier transform (FFT) to compute the period-3 value of a numerical represented nucleotide sequencewith procedures described in [2]. The second (third) spectralmethod is called the FIR DF-based (IIR DF-based) spectral

method which uses a FIR (IIR)[11] band pass digital filter 

with its pass band centred at a normalized frequency of 2/3radians to compute the period- 3 value of a numerical

represented nucleotide sequence. The FFT requires a total of  Llog2 ( L) complex to complex multiplications and  Llog2 ( L)complex additions or 4 Llog2 ( L) real multiplications and2 Llog2 ( L) real additions. An even N th-order linear phase FIR digital filter requires a total of   L( N /2+1) complex to realmultiplications and  L( N+1) complex additions or 2 L( N /2+1)

real multiplications and 2 L( N+1) real additions. An  N th-order[12] IIR digital filter requires a total of  L(2 N +1) complexto real multiplications and 2 LN complex additions or 2 L(2 N +1)

real multiplications and 4 LN real additions.

VII. CONCLUSION.

Current problem in the field of genomic signal processing is todetect the genes in the DNA sequence of rheumatic arthritis.

In this paper we are trying to review some DSP techniques sofar and implementing it on a rheumatic arthritis for predictionof genes responsible for it and then comparing it with the

normal sequence, concluding that where exactly the mutationtakes place in the RA sequence. The obtained results from theabove techniques which are incorporated in the software

module be used for the characterization of the disease and theresults will support the Physicians and analysts to diagnoseand better understanding of disease development, treatment

and prevention of the disease. From the review the RWKT [5]and FFT[6] are the best methods for the prediction of geneswith the better resolution and fast computation respectively.

VIII.  REFERENCES.

[1]. P. P. Vaidyanathan and B.-J. Yoon, ―Digital filters for 

gene prediction applications,‖ in  Proc. Asilomar Conferenceon Signals, Systems, and Computers, pp. 306 – 310, Nov 2002.

[2]. Vaidyanathan, P.P., and Yoon, B.J. Digital filters for gene

 prediction applications. Proc. Asilomar Conf. SignalsSyst. Comput. 306 – 310. 2002.[3]. Lun Huang, Mohammad Al Bataineh, G. E. Atkin, Senior 

 Member, IEEE , Siyun Wang, Wei Zhang ― A Novel Gene Detection Method Based on Period-3  Property” 31st AnnualInternational Conference of the IEEE ,September 2-6, 2009.

[4]. Vikrant Tomar, Dipesh Gandhi, ―Digital SignalProcessing for Gene Prediction‖ IEEE,pp:435-439,2008[5]. M.Roy, S. Barman ―Spectral analysis of DNA sequence

using Recursive Weiner Khinchine theorem- A comparativeapproach‖ 978-1-4577,2011[6]. Benjamin Y. M. Kwan, Jennifer Y. Y. Kwan, Hon Keung

Kwan “Spectral Techniques for Classifying Short Exon and  Intron Sequences” IEEE ,pp : 219-226, 2012.[7]. M.K.Hota, V.K..Srinivasa ‖DSP technique for gene and

exon prediction taking complex indicator sequence ‖ pp. 354-359,Pune, India, 03 – 05 January, 2009.[8]. Mohammad Al Bataineh, Lun Huang, Ismaeel. Muhamed, Nick Menhart, and Guillermo Atkin, ―Gene Expression

Analysis using Communications, Coding and InformationTheory Based Models‖, BIOCOMP'09 -, July 13-16, 2009.

[9]. Hayes, M. H., ―Statistical digital signal processing and modeling‖, John Wiley & Sons, Inc, USA,1996.  [10]. Khalid M. Aamir and Mohammad A. Maud

“Recursive Weiner Khinchine theorem‖ World Academy of Science, Engineering and Technology 2 2007[11]. I. W. Selesnick, M. Lang, and C.S. Burrus. ―Constrained

least square design of FIR filters without specified transition bands,‖ IEEE Transactions on Signal Processing , vol. 44, no.8, pp. 1879-1892, August 1996.

[12]. B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L.Elnitski, P. Shah, Y. Zhang, D. Blankenberg, I. Albert,

J.Taylor, W. Miller, W. J. Kent, and A. Nekrutenko, ―Galaxy:A platform for interactive large scale genome analysis,‖Genome Research, vol. 15, issue 10, pp. 1451- 1455, 15October 2005.

[13]. H. Herzel, E. N. Trifonov, O. Weiss, and I. Groβe,―Interpreting correlations in biosequences,‖ Physica A, vol.249, pp. 449 – 459, 1998.[14].J.Tuqan and A.Rashdi ―A DSP approach for finding the

codons bias in DNA sequence‖ IEEE journal on signal processing vol. 2, no. 3,pp-343-356,jun 2008.