digital signal processing techniques for protein- coding regions identification of rheumatic...
TRANSCRIPT
7/27/2019 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.
http://slidepdf.com/reader/full/digital-signal-processing-techniques-for-protein-coding-regions-identification 1/5
I nternational Journal of Computer Trends and Technology- volume4Issue3- 2013
ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 436
Digital Signal Processing Techniques for Protein-
Coding Regions identification of Rheumatic Arthritis
(RA) disease. Dr.K.B.Ramesh1, Prabhu Shankar.K.S2, Dr.B.P.Mallikarjunaswamy3, Dr.E.T.Puttaiah4
(1) Associate Professor, Dept. of Instrumentation Technology, R.V.College of Engineering, Bangalore, India
(2) Biomedical Signal Processing and Instrumentation, I.T Dept., R.V.College of Engineering, Bangalore, India.(3) Professor, Department of Computer Science and Engineering, SSIT, Tumkur, Karnataka, India.(4) Professor, Dept. of Environmental science, Vice-chancellor, Gulbarga University, Karnataka, India.
Abstract — Rheumatic arthritis is a chronic disease, which
disables auto-immune system and ultimately affects a
person’s ability to carry out everyday tasks. The
prediction of genes involved in RA is an important
application in bioinformatics. In order to analyze the
proteins, locations and lengths of genomic sequence playsa prominent role in predicting the exons. DSP tools have
been applied in this field based on the observation that
coding regions have a prominent period-3 spectrum peak
at frequency f=1/ 3 due to presence of codons (three
nucleic acids), while non-coding regions lack such a
prominent peak. This paper presents the different Digital
Signal Processing techniques which approaches result in
improved computational techniques for the solution of
useful problems in genomic information science and
technology.
Keywords — DNA, Rheumatic arthritis [RA], Gene prediction,Period-3 periodicity, Digital signal processing Techniques.
I. INTRODUCTION
Bioinformatics represents a new, growing area of science thatuses computational approaches to answer biological questions.
With the explosion of sequence and structural informationavailable to researchers, the field of bioinformatics is playingan increasingly large role in the study of fundamental
biomedical problems. In all areas of biological and medical
research, the role of the computer has been dramaticallyenhanced in the last five to ten year period. While the firstwave of computational analysis did focus on sequenceanalysis, where many highly important unsolved problemsstill remain, the current and future needs will in particular
concern sophisticated integration of extremely diverse sets of data. These novel types of data originate from a variety of experimental techniques of which many are capable of data
production at the levels of entire cells, organs, organisms, or even populations. The main driving force behind the changeshas been the advent of new, efficient experimental techniques,
primarily DNA sequencing, that have led to an exponential
growth of linear descriptions of protein, DNA and RNAmolecules. The Bioinformatics Toolbox offers computationalmolecular biologists and other research scientists an open and
extensible environment in which to explore ideas, prototype
new algorithms, and build applications in drug research,genetic engineering, and other genomics and proteomics
projects. The toolbox provides access to genomic and proteomic data formats, analysis techniques, and specializedvisualizations for genomic and proteomic sequence andmicroarray analysis. Most functions are implemented in theopen MATLAB language, enabling you to customizethe algorithms or develop your own. Rheumatoid arthritis is a
form of inflammatory arthritis which is a chronic disease thatmay affect many tissues and organs, but principally attacksflexible (synovial) joints.
Fig.1 Joint affected by RA
Rheumatoid arthritis can also produce diffuse inflammation inthe lungs, membrane around the heart the membranes of thelung (pleura), and white of the eye and also nodular lesions, most common in subcutaneous. Rheumatoid arthritis mayaffect many different joints and cause damage to cartilage,tendons and ligaments – it can even wear away the ends of
your bones. One common outcome is joint deformity anddisability. Some people with RA develop rheumatoid nodules;
7/27/2019 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.
http://slidepdf.com/reader/full/digital-signal-processing-techniques-for-protein-coding-regions-identification 2/5
I nternational Journal of Computer Trends and Technology- volume4Issue3- 2013
ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 437
lumps of tissue that form under the skin, often over bony areas
exposed to pressure. These occur most often around theelbows but can be found elsewhere on the body, such as onthe fingers, over the spine or on the heels. Over time, the
inflammation that characterizes RA can also affect numerousorgans and internal systems.A DNA sequence is made from an alphabet of four elements,
namely A, C, G and T (respectively, adenine, cytosine,guanine, and thymine). The letters A, C, G and T representmolecules called nucleotides or bases. A large number of
functions in living organisms are governed by proteins.Proteins are sequences made of amino acids. Since there are64 possible codons but only 20 amino acids, the mapping
from codons to amino acids is many-to-one. The introns donot participate in the protein synthesis. Gene identification is avery complex problem and the identification of period-3
regions is only a step towards gene and exon identificationcoding and non-protein coding regions. At each time thesliding window is shifted by one or more base positions, the power spectrum at frequency L/3 is computed. After
extracting the period-3 property, sequences are classified intoexons (protein coding) and introns (non-protein coding)
regions using a threshold or learning process. Thresholding isone of the major challenges in this field since the selection of its optimal value could be different from one sequence to
other sequence. According to [6], the complementary strandsare statistically symmetric. Thus, the non-coding region for the 5‘-3‘ will have the same period-3 property as the coding
region. It means that the period-3 still works in prokaryotes,and it can be used to detect coding region. As shown in Figure2, a DNA sequence can be divided into genes and intragenic
spaces. The genes are responsible for protein synthesis. Agene can be divided into two sub regions called the exons and
introns. Only the exons are involved in protein coding. The bases in the exon region can be imagined to be divided intogroups of three adjacent bases.
Fig. 2 Various regions in a DNA molecule
Each triplet is called a codon. Scanning the gene from left to
right, a codon sequence can be defined by concatenation of
the codons in all the exons. Each codon instructs the cellmachinery to synthesize an amino acid. The codon sequencetherefore uniquely identifies an amino acid sequence which
defines a protein. It is known that exons (or coding regions)are rich in nucleotides C and G whereas introns (or noncodingregions) are rich in nucleotides A and T; and that protein
coding regions of nucleotide sequences exhibit a period-3
property which is likely resulted from the three-base-length of codons used to generate amino acids. The process is divided into 4 steps as shown in fig. 3. The first
component in the process is to convert the DNA sequence intonumerical sequence, since DSP tools can only handle onlynumerical entities. Next step involves in the choosing of
window, so that large sequence is divided into frames inwhich specific length of sequence is processed at a time. Thenthe window is slide over the whole sequence. It has been
shown that window size and length affect the predictionresults. In addition the researchers are interested in this fieldcan address weaknesses of their methods and work on
improving the efficiency of each component of the process.The third stage is important stage of the process. Here the period-3 component is extracted to discriminate protein.
Fig. 3 Functional Block Diagram
Organization of the paper: Each type of technique is reviewedin the sections II, III, IV, V, and VI. Conclusion is drawn inthe section VII. The references are listed in VII.
II. METHODS AND RESULTS
P. P. Vaidyanathan and Byung-Jun Yoon[2] proposed adigital filtering technique for the prediction of genes. Here thesliding window is considered as the filtering technique which
has the impulse response of
Fig. 4 Impulse responses of Band pass Filter .
Computational complexity and even period-3 behaviour from background information such as 1 /f noise more effectively can be reduced by efficient deign of a filter. Here the band pass
7/27/2019 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.
http://slidepdf.com/reader/full/digital-signal-processing-techniques-for-protein-coding-regions-identification 3/5
I nternational Journal of Computer Trends and Technology- volume4Issue3- 2013
ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 438
filter with a pass band of ω0 =2π/ 3 and minimum stop band
attenuation of about 13 dB as shown in above fig. 4A digital filter H ( z ) with indicator sequence xG(n) as its input.With the indicator sequence xG(n) taken as input, let yG(n)
denote its output as indicated in below fig. 5
Fig. 5 Digital filter H(z) with impulse response H(ejw)
The narrow band filter H ( z ) [1].have been regarded as an
antinotch IIR filter for gene prediction and the gene sequenceF56F11.4 in the C-elegans chromosome III was used for theanalysis. This gene has five exons as depicted in below fig. 6
Fig. 6 Output obtained from Digital filter
III. CORRELATING AND FILTERING APPROACH
Lun Huang, Mohammad Al Bataineh, G. E. Atkin[3], hasintroduced the novel gene detection method based on period-3
property which uses correlation[13] and filtering techniques.Here complex poly-phase set {1, j, -1,-j} which is correlatedwith four sequences to predict the exon regions.. These four sequences are composed by using 3 out the 4 types of bases in
a period-3 pattern. Genome sequence is X(n) is assumed to beinput the four period-3 sequences are S1 (n) , S2 (n) , S3 (n),S4 (n) .
The outputs after the correlation are
Yi(n) = X(n) Si(-n)Where ‗ ‘ denotes convolution.
Maximum ratio combination algorithm [8] has been used tocombine all four sequences respectively .period-3 filtering isused as sliding window. In order to predict the exon, length of
the window L (L is considered as odd to make the windowsymmetric around the center) is used for the analysis.
The algorithm was used as follows.
1.L0 and L1 is initiated with L x /3, and p0 is set to 2 , where
L0 is initial window length, L1 is current window length,p0 isoriginal peak number.2.Window with length of L1 is used to detect the peaks of the
filter output sequence f (n).3. If the number of detected peaks is p1 < po +1, L1= L−1 andgo to step 2; otherwise, set the minimum value of variance
Vmin = ∞.4. For this p1 peak distribution, the peak interval sequence{Ij }. j = 1, 2… p1 −1 is found out.
5. Then the variance of {Ij }is found with
V1 =
Where I = .
6. If V1 < Vmin. , Vmin = V1, p0 =p1, L0 = L1 and go back to step 4 ; otherwise L1 = L0.7. In the same ways step 2 and 3, minimize L1 for peak number p1 = p0 and L = 2. [1/3 Min. (L1)] + 1, where Min (.)
is the minimization function and [.] denotes the function that
round its argument up to the next integer.The result obtained from this approach was better than DFTand SONF approaches, in detecting more coding regionswhich was applied to both prokaryote and eukaryote genome
sequences and output obtained as shown below fig.7
Fig. 7 Exons predicted by novel gene prediction method.
IV. HARMONIC SUPPRESSION FILTER.
There was a disadvantage with the method introduced by[2,3]
in which harmonics of the frequency 2 ᴨ /3 appears along
with the 2 ᴨ /3 frequency components. These harmonicfrequencies provide false measure of period-3 property byadding more peak strengths to exons and introns. So [4]
proposed a harmonic suppression and maximum variancefilter which are needed to suppress the harmonics. Harmonic
suppression filter consists of dominant zeros at the multiples
of frequency 2 ᴨ /6except at 2 ᴨ /3 dominant pole is designed
at 2 ᴨ /3 in such a way that suppresses the samples of
harmonic frequencies of 2 ᴨ /3 while passes samples of
frequency 2 ᴨ /3. The peak at 16 KHz is evidence of the poleat angular frequency 2 π /3 radians. A pole-zero plot of designed filter having poles of magnitude 0:898; 0:898; 0:998;0:898 at angular frequencies ω = 0, 2π /6, 2π /3, π radians,
respectively, and zeros of magnitude 0:998; 0:998; 0:898;0:998 at angular frequencies ω = 0, 2 π /6, 2 π /3, π radians,
7/27/2019 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.
http://slidepdf.com/reader/full/digital-signal-processing-techniques-for-protein-coding-regions-identification 4/5
I nternational Journal of Computer Trends and Technology- volume4Issue3- 2013
ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 439
respectively. Since the suppression of the higher period
harmonics is not necessary because their contribution in awindow size of only 351 samples is negligible but it‘s
necessary to suppress the harmonics of 2 ᴨ /3.
Fig.8 shows that HS filter is able to detect the smaller lengthexons where antinotch failed; however, the problem of
suppressing the spurious peaks in introns still remains becauseof failure to attenuate the complex conjugate harmonicfrequency components which can be suppressed by Maximumvariance filter.[9].
Fig. 8 Shows the output obtained by HS Filter.
Another approach[4] for the gene prediction which make useof the Minimum variance filter which is an adaptation of theMaximum Likelihood Method (MLM) developed by Caponfor the analysis of two dimensional power spectral densities.
Minimum Variance Spectrum Estimation technique gives usflexibility to minimize the power in the side lobe frequenciesthus maximizing the power in main lobe.
The minimum variance spectrum estimation techniqueinvolves the following steps:
1) Design a band pass filter g(n) with center frequency ω = 2π /3 so that the filter rejects the maximum amount of out-of-
band power while passing the component at frequency ω withno distortion.2) Filter the DNA sequence x(n) with the filter and estimate
the power in each output process y(n).Hence the impulse response of such a filter for a given inputsequence can be given as,
g = e/
Where, eH represents the Hermition (complex conjugate)matrix of exponential vector e, Rx is the p * p autocorrelationto eplitz matrix of the samples in the current window and g is
the impulse response of the Minimum Variance filter with
band-pass frequency ω = 2π /3.First the band pass filter was designed that rejects the
maximum amount of band power at center frequency ω = 2 ᴨ /3 while passing the component at frequency ω with no
distortion. Then the power is estimated by filtering DNAsequence x (n) by filter in each output process y (n) as
designed in below fig. 10
Fig.9 shows the predicted exons using MV Filter.
The accuracy was achieved by using above approaches to
suppress the harmonic frequencies by means of HS filter andadaptive minimum Variance filter.
V. Recursive wiener Khinchine Theorem.
[5] proposed a comparative approach based on recursive
Weiner khinchine theorem [10] for locating the protein codingregions was explored. It‘s an efficient algorithm which makes
use of the sliding window from autocorrelation function by
applying Weiner khinchine theorem to estimate the power
spectral density [14]. By defining the size of the window as N,
every DFT of window is calculated from previous N-point
window by just one complex multiplication and two real
additions. In this paper both DFT and RWKT were implied on
a DNA sequence and it has been shown that RKWT provides
better estimate of power spectral density when compared to
DFT but it is inefficient when time computation is considered
Fig. 10 shows the output of RWKT with good resolution
7/27/2019 Digital Signal Processing Techniques for Protein- Coding Regions identification of Rheumatic Arthritis (RA) disease.
http://slidepdf.com/reader/full/digital-signal-processing-techniques-for-protein-coding-regions-identification 5/5
I nternational Journal of Computer Trends and Technology- volume4Issue3- 2013
ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 440
compared to DFT method.
VI. FFT, FIR, IIR Methods.
[6] Compared the performance of three spectral methods FFT,
FIR and IIR (DF-Based) methods for the classification short
introns and exons. Computational complexity has been
taken as a factor for the comparison and it depends
on the FFT algorithm used. The K-Quaternary Codewith a nucleotide to numeric mapping as C = -1, G = - j, A = 1,T = j is adopted. Three thresholds are adopted which includethe mid threshold Tm, the proportional threshold Tp, and the
cumulative distribution threshold Tc is studied. The firstspectral method is called the FFT-based spectral method
which uses the Fast Fourier transform (FFT) to compute the period-3 value of a numerical represented nucleotide sequencewith procedures described in [2]. The second (third) spectralmethod is called the FIR DF-based (IIR DF-based) spectral
method which uses a FIR (IIR)[11] band pass digital filter
with its pass band centred at a normalized frequency of 2/3radians to compute the period- 3 value of a numerical
represented nucleotide sequence. The FFT requires a total of Llog2 ( L) complex to complex multiplications and Llog2 ( L)complex additions or 4 Llog2 ( L) real multiplications and2 Llog2 ( L) real additions. An even N th-order linear phase FIR digital filter requires a total of L( N /2+1) complex to realmultiplications and L( N+1) complex additions or 2 L( N /2+1)
real multiplications and 2 L( N+1) real additions. An N th-order[12] IIR digital filter requires a total of L(2 N +1) complexto real multiplications and 2 LN complex additions or 2 L(2 N +1)
real multiplications and 4 LN real additions.
VII. CONCLUSION.
Current problem in the field of genomic signal processing is todetect the genes in the DNA sequence of rheumatic arthritis.
In this paper we are trying to review some DSP techniques sofar and implementing it on a rheumatic arthritis for predictionof genes responsible for it and then comparing it with the
normal sequence, concluding that where exactly the mutationtakes place in the RA sequence. The obtained results from theabove techniques which are incorporated in the software
module be used for the characterization of the disease and theresults will support the Physicians and analysts to diagnoseand better understanding of disease development, treatment
and prevention of the disease. From the review the RWKT [5]and FFT[6] are the best methods for the prediction of geneswith the better resolution and fast computation respectively.
VIII. REFERENCES.
[1]. P. P. Vaidyanathan and B.-J. Yoon, ―Digital filters for
gene prediction applications,‖ in Proc. Asilomar Conferenceon Signals, Systems, and Computers, pp. 306 – 310, Nov 2002.
[2]. Vaidyanathan, P.P., and Yoon, B.J. Digital filters for gene
prediction applications. Proc. Asilomar Conf. SignalsSyst. Comput. 306 – 310. 2002.[3]. Lun Huang, Mohammad Al Bataineh, G. E. Atkin, Senior
Member, IEEE , Siyun Wang, Wei Zhang ― A Novel Gene Detection Method Based on Period-3 Property” 31st AnnualInternational Conference of the IEEE ,September 2-6, 2009.
[4]. Vikrant Tomar, Dipesh Gandhi, ―Digital SignalProcessing for Gene Prediction‖ IEEE,pp:435-439,2008[5]. M.Roy, S. Barman ―Spectral analysis of DNA sequence
using Recursive Weiner Khinchine theorem- A comparativeapproach‖ 978-1-4577,2011[6]. Benjamin Y. M. Kwan, Jennifer Y. Y. Kwan, Hon Keung
Kwan “Spectral Techniques for Classifying Short Exon and Intron Sequences” IEEE ,pp : 219-226, 2012.[7]. M.K.Hota, V.K..Srinivasa ‖DSP technique for gene and
exon prediction taking complex indicator sequence ‖ pp. 354-359,Pune, India, 03 – 05 January, 2009.[8]. Mohammad Al Bataineh, Lun Huang, Ismaeel. Muhamed, Nick Menhart, and Guillermo Atkin, ―Gene Expression
Analysis using Communications, Coding and InformationTheory Based Models‖, BIOCOMP'09 -, July 13-16, 2009.
[9]. Hayes, M. H., ―Statistical digital signal processing and modeling‖, John Wiley & Sons, Inc, USA,1996. [10]. Khalid M. Aamir and Mohammad A. Maud
“Recursive Weiner Khinchine theorem‖ World Academy of Science, Engineering and Technology 2 2007[11]. I. W. Selesnick, M. Lang, and C.S. Burrus. ―Constrained
least square design of FIR filters without specified transition bands,‖ IEEE Transactions on Signal Processing , vol. 44, no.8, pp. 1879-1892, August 1996.
[12]. B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L.Elnitski, P. Shah, Y. Zhang, D. Blankenberg, I. Albert,
J.Taylor, W. Miller, W. J. Kent, and A. Nekrutenko, ―Galaxy:A platform for interactive large scale genome analysis,‖Genome Research, vol. 15, issue 10, pp. 1451- 1455, 15October 2005.
[13]. H. Herzel, E. N. Trifonov, O. Weiss, and I. Groβe,―Interpreting correlations in biosequences,‖ Physica A, vol.249, pp. 449 – 459, 1998.[14].J.Tuqan and A.Rashdi ―A DSP approach for finding the
codons bias in DNA sequence‖ IEEE journal on signal processing vol. 2, no. 3,pp-343-356,jun 2008.