fractal related methods for predicting protein structure ... · 318 fractal related methods for...

CHAPTER 16

FRACTAL RELATED METHODS FORPREDICTING PROTEIN STRUCTURECLASSES AND FUNCTIONS

ZU-GUO YU, VO ANH, JIAN-YI YANG, and SHAO-MING ZHU

16.1 INTRODUCTION

The molecular function of a protein can be inferred from the protein’s structureinformation [1]. It is well known that an amino acid sequence partially determinesthe eventual three-dimensional structure of a protein, and the similarity at theprotein sequence level implies similarity of function [2,3]. The prediction ofprotein structure and function from amino acid sequences is one of the mostimportant and challenge problems in molecular biology.

Protein secondary structure, which is a summary of the general conformationand hydrogen bonding pattern of the amino acid backbone [4,5], provides someknowledge to further simplify the complicated 3D structure prediction problem.Hence an intermediate but useful step is to predict the protein secondary struc-ture. Since the 1970s, many methods have been developed for predicting proteinsecondary structure (see the more recent references cited in Ref. [6]).

Four main classes of protein structures, based on the types and arrangementof their secondary structural elements, were recognized [7]: (1) the α helices,(2) the β strands, and (3) those with a mixture of α and β shapes denoted asα + β and α/β. This structural classification has been accepted and widely usedin protein structure and function prediction. As a consequence, it has becomean important problem and can help build protein database and predict proteinfunction. In fact, Hou et al. [1] (see also a short report published in Science[8]) constructed a map of the protein structure space using the pairwise structuralsimilarity of 1898 protein chains. They found that the space has a defining featureshowing these four classes clustered together as four elongated arms emerging

Algorithmic and Artificial Intelligence Methods for Protein Bioinformatics. First Edition.Edited by Yi Pan, Jianxin Wang, Min Li.© 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc.

317

318 FRACTAL RELATED METHODS FOR PREDICTING PROTEIN STRUCTURE CLASSES

from a common center. A review by Chou [9] described the development ofsome existing methods for prediction of protein structural classes. Proteins in thesame family usually have similar functions. Therefore family identification is animportant problem in the study of proteins.

Because similarity in the protein sequence normally implies similarity of func-tion and structure, the similarities of protein sequences can therefore be used todetect the biological function and interaction of proteins [10]. It is necessary toenrich the concept and context of similarity, because some proteins with lowsequence identity may have similar tertiary structure and function [11].

Fractal geometry provides a mathematical formalism for describing complexspatial and dynamical structures [12,13]. Fractal methods are known to be usefulfor detection of similarity. Traditional multifractal analysis is a useful way tocharacterize the spatial heterogeneity of both theoretical and experimental fractalpatterns [14]. More recently it has been applied successfully in many differentfields, including time series analysis and financial modeling [15]. Some applica-tions of fractal methods to DNA sequences are provided in References [16–18]and references cited therein.

Wavelet analysis, recurrence quantification analysis (RQA), and empiricalmode decomposition (EMD) are related to fractal methods in nonlinear sciences.Wavelet analysis is a useful tool in many applications such as noise reduction,image processing, information compression, synthesis of signals, and the studyof biological data. Wavelets are mathematical functions that decompose data intodifferent frequency components and then describe each component with a resolu-tion matched to its scale [19,20]. The recurrence plot (RP) is a purely graphicaltool originally proposed by Eckmann et al. [21] to detect patterns of recurrence inthe data. RQA is a relatively new nonlinear technique introduced by Zbilut andWebber [22,23] that can quantify the information supplied by RP. The traditionalEMD for data, which is a highly adaptive scheme serving as a complement toFourier and wavelet transforms, was originally proposed by Huang et al. [24]. InEMD, a complicated dataset is decomposed into a finite, often small, number ofcomponents called intrinsic mode functions (IMFs). Lin et al. [25] presented anew approach to EMD. This method has been used successfully in many applica-tions in analyzing a diverse range of datasets in biological and medical sciences,geophysics, astronomy, engineering, and other fields [26,27].

Fractal methods have been used to study proteins. Their applications includefractal analysis of the proton exchange kinetics [28], chaos game representation ofprotein structures [29], sequences based on the detailed HP model [30,31], fractaldimension of protein mass [32], fractal properties of protein chains [33], frac-tal dimensions of protein secondary structures [34], multifractal analysis of thesolvent accessibility of proteins [35], and the measure representation of proteinsequences [36]. The wavelet approach has been used for prediction of secondarystructures of proteins [37–45]. Chen et al. [41] predicted the secondary struc-ture of a protein by the continuous wavelet transform (CWT) and Chou–Fasmanmethod. Marsolo et al. [42] combined the pairwise distance with wavelet decom-position to generate a set of coefficients and capture some features of proteins.

METHODS 319

Qiu et al. [43] used the continuous wavelet transform to extract the position ofthe α helices and short peptides with special scales. Rezaei et al. [44] appliedwavelet analysis to membrane proteins. Pando et al. [45] used the discrete wavelettransform to detect protein secondary structures. Webber et al. [46] defined twoindependent variables to elucidate protein secondary structures based on the RQAof coordinates of α-carbon atoms. The variables can describe the percentage of α

carbons that are composed of an α helix and a β sheet, respectively. The abilityof RQA to deal with protein sequences was reviewed by Giuliani et al. [47]. TheIMFs obtained by the EMD method were used to discover similarities of proteinsequences, and the results showed that IMFs may reflect the functional identitiesof proteins [48,49].

More recently, the authors used the fractal methods and related wavelet anal-ysis, RQA and EMD methods to study the prediction of protein structure classesand functions [50–54]. In this chapter, we review the methods and results inthese studies.

16.2 METHODS

16.2.1 Measure Representation Based on the DetailedHP Model and Six-Letter Model

The detailed HP model was proposed by Yu et al. [30]. In this model, 20 differentkinds of amino acids are divided into four classes: nonpolar, negative polar,uncharged polar, and positive polar. The nonpolar class consists of the eightresidues ALA, ILE, LEU, MET, PHE, PRO, TRP, and VAL; the negative polarclass consists of the two residues ASP and GLU; the uncharged polar class ismade up of the seven residues ASN, CYS, GLN, GLY, SER, THR, and TYR;and the remaining three residues ARG, HIS, and LYS designate the positive polarclass.

For a given protein sequence s = s1 · · · sL with length L, where si is one ofthe 20 kinds of amino acids for i = 1, · · · , L, we define

ai =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

0, if si is nonpolar

1, if si is negative polar

2, if si is uncharged polar

3, if si is positive polar

(16.1)

This results in a sequence X (s) = a1 · · · aL, where ai is a letter of the alphabet{0, 1, 2, 3}. The mapping (16.1) is called the detailed HP model [30].

According to Chou and Fasman [55], the 20 different kinds of amino acids aredivided into six classes: strong β, former (Hβ); β former (hβ); weak β, former(Iβ); β indifferent (iβ); β breaker (bβ); and strong β breaker (Bβ). The Hβ classconsists of the three residues Met, Val, and IlE; the hβ class consists of the sevenresidues Cys, Tyr, Phe, Gln, Leu, Thr, and Trp; the Iβ class consists of the residue


Ala; the iβ class consists of the three residues Arg, Gly, and Asp; the bβ classis made up of the five residues Lys, Ser, His, Asn, and Pro; and the remainingresidue Glu constitutes the Bβ class.

For a given protein sequence s = s1 · · · sL with length L, where si is one ofthe 20 kinds of amino acids for i = 1 · · · L, we define

ai =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎩

0, if si is in the Bβ class

1, if si is in the bβ class

2, if si is in the iβ class

3, if si is in the Iβ class

4, if si is in the hβ class

5, if si is in the Hβ class

(16.2)

This results in a sequence X (s) = a1 · · · aL, where ai is a letter of the alphabet{0, 1, 2, 3, 4, 5}. The mapping (16.2) is called the six-letter model [51].

Here we call any string made up of K letters from the set {0, 1, 2, 3} (for thedetailed HP model) or {0, 1, 2, 3, 4, 5} (for the six-letter model) a K -string. Fora given K , there are in total 4K or 6K different K strings. In order to count thenumber of K strings in a sequence X (s) from a protein sequence s , we need 4K or6K counters. We divide the interval [0, 1] into 4K or 6K disjoint subintervals, anduse each subinterval to represent a counter. For r = r1 · · · rK , ri ∈ {0, 1, 2, 3}, i =1, · · ·, K , which is a substring with length K , we define

xleft(r) =K∑

i=1

ri

4i , xright(r) = 1

4K +K∑

i=1

ri

4i (16.3)

For r = r1 · · · rK , ri ∈ {0, 1, 2, 3, 4, 5}, i = 1, · · ·, K , which is a substring withlength K , we define

xleft(r) =K∑

i=1

ri

6i , xright(r) = 1

6K +K∑

i=1

ri

6i . (16.4)

We then use the subinterval [xleft(r), xright(r)) to represent substring r . Let NK (r)be the number of times that a substring r with length K appears in the sequenceX (s) (when we count these numbers, we open a reading frame with width K andslide the frame one amino acid each time). We define

FK (r) = NK (r)

L − K + 1(16.5)

to be the frequency of substring r . It follows that∑

{r}FK (r) = 1. We can nowdefine a measure μK on [0, 1) by dμK (x) = Y (x)dx , where

YK (x) = 4K FK (r), (16.6)

METHODS 321

when x ∈ [xleft(r), xright(r)) defined by (16.3). We define a measure νK on [0, 1)

by dνK (x) = Y ′K (x)dx , where

Y ′K (x) = 6K FK (r), (16.7)

when x ∈ [xleft(r), xright(r)) defined by (16.4). We see that∫ 1

0 dμK (x) = 1 and∫ 10 dνK (x) = 1. We call μK and νK the measure representation of the protein

sequence corresponding to the given K based on the detailed HP model and thesix-letter model, respectively.

16.2.2 Measures and Time Series Based on thePhysicochemical Features of Amino Acids

Measured in kcal/mol, the hydrophobic free energies of the 20 amino acids areA = 0.87, R = 0.85, N = 0.09, D = 0.66, C = 1.52, Q = 0.0, E = 0.67, G =0.0, H = 0.87, I = 3.15, L = 2.17, K = 1.65, M = 1.67, F = 2.87, P = 2.77,S = 0.07, T = 0.07, W = 3.77, Y = 2.76, and V = 1.87 [39].

The solvent accessibility (SA) values for solvent exposed area >30 A areS = 0.70, T = 0.71, A = 0.48, G = 0.51, P = 0.78, C = 0.32, D = 0.81, E =0.93, Q = 0.81, N = 0.82, L = 0.41, I = 0.39, V = 0.40, M = 0.44, F = 0.42,Y = 0.67, W = 0.49, K = 0.93, R = 0.84, and H = 0.66 [56].

The Schneider–Wrede scale (SWH) values of the 20 kinds of amino acidsare A = 1.6, R = −12.3, N = −4.8, D = −9.2, C = 2, Q = −1.1, E = −8.2,G = 1, H = −3, I = 3.1, L = 2.8, K = −8.8, M = 2.4, F = 3.7, P = −0.2,S = 0.6, T = 1.2, W = 1.9, Y = −0.7, and V = 2.6 [47]. Yang et al. [51] addeda constant 12.30 to these values to make all the 20 values nonnegative, yieldingthe revised Schneider–Wrede scale hydrophobicity (RSWH).

Rose et al. [57] proposed different measures of hydrophobicity of proteins.They gave four kinds of values for surface area and hydrophobicity of each aminoacid. We use A0, the stochastic standard state accessibility, that is, the solventaccessible surface area (SASA) of a residue in standard state. The SASA of the20 kinds of amino acids are A = 118.1, R = 256.0, N = 165.5, D = 158.7, C =146.1, Q = 193.2, E = 186.2, G = 88.1, H = 202.5, I = 181.0, L = 193.1, K =225.8, M = 203.4, F = 222.8, P = 146.8, S = 129.8, T = 152.5, W = 266.3,Y = 236.8, and V = 164.5.

Each amino acid can also be represented by the value of the volume ofsidechains of amino acids [58]. These values are A = 27.5, C = 44.6, D = 40,E = 62, F = 115.5, G = 0, H = 79, I = 93.5, K = 100, L = 93.5, M = 94.1,N = 58.7, P = 41.9, Q = 80.7, R = 105, S = 29.3 , T = 51.3, V = 71.5, W =145.5, and Y = 117.3.

We convert each amino acid according to hydrophobic free energy, SA,RSWH, SASA, and volume of sidechains along the protein sequence to calculatefive different numerical sequences, and view them as time series.

Let Tt , t = 1, 2, . . . , N , be the time series with length N . First, we defineFt = Tt/(

∑Nj=1 Tj ), (t = 1, 2, . . . , N ) as the frequency of Tt . It follows that


∑Nt=1 Ft = 1. Now we can define a measure νt on the interval [0, 1) by dνt (dx) =

Y1(x)dx , where

Y1(x) = N × Ft = Tt

1

N

∑Nj=1 Tj

, x ∈[

t − 1

N,

t

N

)(16.8)

We denote the interval [(t − 1)/N , t/N ) by It . It is seen that νt ([0, 1)) = 1 andνt (It ) = Ft . We call νt (x) the measure for the time series.

16.2.3 Z-Curve Representation of Proteins

The concept of Z -curve representation of a DNA sequence was first proposed byZhang and Zhang [59]. We propose a similar concept for proteins [50]. Once weget the sequence X (s) = a1 · · · aL for a protein, where ai is a letter of the alphabet{0, 1, 2, 3} as in Section 16.2.1, we can define the Z-curve representation of thisprotein as follows. This Z curve consists of a series of nodes Qi , i = 0, 1, . . . , L,whose coordinates are denoted by xi , yi and zi . These coordinates are defined as

⎧⎪⎪⎨⎪⎪⎩

xi = 2(num0i + num2

i ) − i

yi = 2(num0i + num1

i ) − i ,

zi = 2(num0i + num3

i ) − i

i = 0, 1, 2, . . . , L (16.9)

where num0i , num1

i , num2i , num3

i denote the number of occurrences of the sym-bols 0,1,2,3 in the prefix a1a2 · · · ai , respectively, and num0

0 = num10 = num2

0 =num3

0 = 0. The connection of nodes Q0, Q1, . . . , QL to one another by straightlines is defined as the Z-curve representation of this protein. We then define

⎧⎪⎨⎪⎩

�xi = xi − xi−1

�yi = yi − yi−1,

�zi = zi − zi−1

i = 1, 2, . . . , L (16.10)

where �xi , �yi , and �zi can only have values 1 and −1.

16.2.4 Chaos Game Representation of Proteins andRelated Time Series

Chaos game representation (CGR) of protein structures was first proposed byFiser et al. [29]. We denote this CGR by 20-CGR as 20 kinds of letters are usedto represent protein sequences. Later Basu et al. [60] and Yu et al. [31] proposedother kinds of CGRs for proteins, in which 12 and 4 kinds of letters were usedfor protein sequences, respectively. We denote them by 12-CGR and 4-CGR.

METHODS 323

16.2.4.1 Reverse Encoding for Amino Acids It is known that there areseveral kinds of coded methods for some amino acids. As a result, there shouldbe many possible nucleotide sequences for one given protein sequence. We haveused [52], the encoding method proposed by Deschavanne and Tuffery [61],which is listed in Table 1 of our study [52]. Deschavanne and Tuffery [61]explained that the rationale for the choice of this fixed code is to keep a balancein base composition so as to maximize the difference between the amino acidcodes.

After one protein sequence is transformed into nucleotide sequences, we canuse the CGR of nucleotide sequences [62] to analyze it; the CGR obtained isabbreviated AAD-CGR (amino acids to DNA CGR). The CGR for a nucleotidesequence is defined on the square [0, 1] × [0, 1], where the four vertices corre-spond to the four letters A, C, G, and T: the first point of the plot is placed halfway between the center of the square and the vertex corresponding to the firstletter of the nucleotide sequence; the i th point of the plot is then placed half waybetween the (i − 1)th point and the vertex corresponding to the i th letter. Theplot is then called the CGR of the nucleotide sequence, or the AAD-CGR of theprotein sequence.

We can decompose the AAD-CGR plot into two time series [52]. Any pointin the AAD-CGR plot is determined by two coordinates: x and y coordinates.Because the AAD-CGR plot can be uniquely reconstructed from these two timeseries, all the information stored in the AAD-CGR plot is contained in thetime series, and the information in the AAD-CGR plot comes from the primarysequence of proteins. Therefore, any analysis of the two time series is equiva-lent to an indirect analysis of the protein primary sequence. It is possible thatsuch analysis provides better results than direct analysis of the protein primarysequences.

16.2.5 Time Series Based on 6-Letter Model, 12-Letter Model, and20-Letter Model

According to the 6-letter model, the protein sequence can be represented by anumerical sequence {ai }L

i=1; here ai ∈ {0, 1, 2, 3, 4, 5} for each i = 1, 2, . . . , L.We now analogously define a 12-letter model and a 20-letter model. With

the idea of chaos game representation based on a 12-sided regular polygon [60],we define the 12-letter model as A = 0, G = 0, P = 1, S = 2, T = 2, H = 3,Q = 4, N = 5, D = 6, E = 6, R = 7, K = 7, I = 8, L = 8, V = 8, M = 8,W = 9, F = 10, Y = 10, and C = 11 according to the order of vertices on the12-sided regular polygon to represent these amino acids.

The 20-letter model can be similarly defined as A = 0, R = 1, N = 2, D = 3,C = 4, Q = 5, E = 6, G = 7, H = 8, I = 9, L = 10, K = 11, M = 12, F =13, P = 14, S = 15, T = 16, W = 17, Y = 18, and V = 19 according to thedictionary order of the 3-letter representation of each amino acid listed in Brown’streatise [63].


These three models can be used to convert a protein sequence into threedifferent numerical sequences (and also can be viewed as time series).

16.2.6 Iterated Function Systems Model

In order to simulate the measure representation of a protein sequence, we pro-posed use of the iterated function systems (IFS) model [30]. IFS is the acronymassigned by Barnsley and Demko [64] originally to a system of contractive mapsw = {w1, w2, . . . , wN }. Let E0 be a compact set in a compact metric space,Eσ1σ2···σn

= wσ1◦wσ2

◦ · · · ◦wσn(E0) and

En = ∪σ1, ... ,σn∈{1,2, ... ,N }

Eσ1σ2···σn

Then E = ∩∞n=1En is called the attractor of the IFS. The attractor is usually a

fractal set and the IFS is a relatively general model to generate many well-knownfractal sets such as the Cantor set and the Koch curve. Given a set of probabilitiesPi > 0,

∑Ni=1 Pi = 1, we pick an x0 ∈ E and define the iteration sequence

xn+1 = wσn(xn), n = 0, 1, 2, 3, . . .

where the indices σn are chosen randomly and independently from the set{1, 2, . . . , N } with probabilities P(σn = i ) = Pi . Then every orbit {xn} isdense in the attractor E [64]. For n sufficiently large, we can view the orbit{x0, x1, . . . , xn} as an approximation of E . This process is called chaos game.

Let χB the characteristic function for the Borel subset B ⊂ E , then, from theergodic theorem for IFS [64], the limit

μ(B) = limn→∞

[1

n + 1

n∑k=0

χB (xk )

]

exists. The measure μ is the invariant measure of the attractor of the IFS. In otherwords, μ(B) is the relative visitation frequency of B during the chaos game.A histogram approximation of the invariant measure may then be obtained bycounting the number of visits made to each pixel.

The coefficients in the contractive maps and the probabilities in the IFS modelare the parameters to be estimated for a real measure that we want to simulate.A moment method [30,65] can be used to perform this task.

From the measure representation of a protein sequence based on the detailedHP model, it is logical to choose N = 4 and

w1(x) = x

4, w2(x) = x

4+ 1

4,

w3(x) = x

4+ 1

2, w4(x) = x

4+ 3

4

METHODS 325

in the IFS model. For a given measure representation of a protein sequencebased on the detailed HP model, we obtain the estimated values of the probabil-ities P1, P2, P3, P4 by solving an optimization problem [65]. Using the estimatedvalues of the probabilities, we can use the chaos game to generate a histogramapproximation of the invariant measure of the IFS that can be compared with thereal measure representation of the protein sequence.

16.2.7 Detrended Fluctuation Analysis

The exponent in a detrended fluctuation analysis can be used to characterize thecorrelation of a time series [17,18]. We view �xi , �yi , and �zi , i = 1, 2, . . . , L,in the Z -curve representation of proteins as time series. We denote this time seriesby F (t), t = 1, . . . , L. First, the time series is integrated as T (k) = ∑k

t=1[F (t) −Fav], where Fav is the average over the whole time period. Next, the integratedtime series is divided into boxes of equal length n . In each box of length n , alinear regression is fitted to the data by least squares, representing the trend inthat box. We denote the T coordinate of the straight-line segments by Tn(k). Wethen detrend the integrated time series T (k) by subtracting the local trend Tn(k)

in each box. The root-mean-square fluctuation of this integrated and detrendedtime series is computed as

F(n) =√√√√ 1

N

N∑k=1

[T (k) − Tn(k)]2 (16.11)

Typically, F(n) increases with box size n . A linear relationship on a log–logplot indicates the presence of scaling F(n) ∝ nλ. Under such conditions, thefluctuations can be characterized by the scaling exponent λ, the slope of the linein the regression ln F(n) against ln n . For uncorrelated data, the integrated timeseries T (k) corresponds to a random walk, and therefore, λ = 0.5. A value of0.5 < λ < 1.0 indicates the presence of long memory so that, for example, a largevalue is likely to be followed by large values. In contrast, the range 0 < λ < 0.5indicates a different type of power-law correlation such that positive and negativevalues of the time series are more likely to alternate. We consider the exponentsλ for the �xi , �yi , and �zi , i = 1, 2, . . . , L, of the Z -curve representation ofprotein sequences as candidates constructing parameter spaces for proteins in thischapter. These exponents are denoted by λx , λy , and λz respectively.

16.2.8 Ordinary Multifractal Analysis

The most common algorithms of multifractal analysis are the so-called fixed-sizebox counting algorithms [16]. In the one-dimensional case, for a given measureμ with support E ⊂ R, we consider the partition sum

Zε(q) =∑

μ(B) �=0

[μ(B)]q (16.12)


with q ∈ R, where the sum runs over all different nonempty boxes B of a givenside ε in a grid covering of the support E , that is, B = [kε, (k + 1)ε). Theexponent τ(q) is defined by

τ(q) = limε→0

ln Zε(q)

ln ε(16.13)

and the generalized fractal dimensions of the measure are defined as

Dq = τ(q)

q − 1for q �= 1 (16.14)

Dq = limε→0

Z1, ε

ln εfor q = 1 (16.15)

where Z1, ε = ∑μ(B) �=0 μ(B) ln μ(B). The generalized fractal dimensions are

numerically estimated through a linear regression of (ln Zε(q))/(q − 1) againstln ε for q �= 1, and similarly through a linear regression of Z1, ε against ln ε forq = 1. The value D1 is called the information dimension and D2, the correlationdimension.

The concept of phase transition in multifractal spectra was introduced in stud-ies of logistic maps, Julia sets, and other simple systems. By following thethermodynamic formulation of multifractal measures, Canessa [66] derived anexpression for the analogous specific heat as

Cq ≡ −∂2τ(q)

∂q2≈ 2τ(q) − τ(q + 1) − τ(q − 1) (16.16)

He showed that the form of Cq resembles a classical phase transition at a criticalpoint for financial time series.

The singularities of a measure are characterized by the Lipschitz–Holder expo-nent α, which is related to τ(q) by

α(q) = d

dqτ(q) (16.17)

Substitution of Equation (16.13) into Equation (16.17) yields

α(q) = limε→0

∑μ(B) �=0[μ(B)]q ln μ(B)

Zε(q) ln ε. (16.18)

Again the exponent α(q) can be estimated through a linear regression of

∑μ(B) �=0

[μ(B)]q ln μ(B)

Zε(q)

against ln ε [35]. The multifractal spectrum f (α) versus α can be calculatedaccording to the relationship f (α) = qα(q) − τ(q).

METHODS 327

16.2.9 Analogous Multifractal Analysis

The analogous multifractal analysis (AMFA) is similar to multiaffinity analysis,and can be briefly sketched as follows [51]. We denote a time series as X (t), t =1, 2, . . . , N . First, the time series is integrated as

y ′q (k) =

k∑t=1

(X (t) − Xav), (q > 0) (16.19)

yq (k) =k∑

t=1

|X (t) − Xav|, (q �= 0) (16.20)

where Xav is the average over the whole time period. Then two quantities Mq (L)

and M ′q (L) are defined as

M ′q (L) = [〈|y ′(j ) − y ′(j + L)|q 〉j ]

1/q , (q > 0) (16.21)

Mq (L) = [〈|y(j ) − y(j + L)|q 〉j ]1/q , (q �= 0) (16.22)

where 〈〉j denotes the average over j , j = 1, 2, . . . , N − L; L typically variesfrom 1 to N1 for which the linear fit is good. From the ln L–ln Mq (L) andln L–ln M ′

q (L) planes, one can find the following relations:

M ′q (L) ∝ Lh ′(q) for q > 0 (16.23)

Mq (L) ∝ Lh(q) for q �= 0 (16.24)

Linear regressions of ln Mq (L) and ln M ′q (L) against ln L will result in the

exponents h(q) and h ′(q), respectively.

16.2.10 Wavelet Spectrum

As the wavelet transform of a function can be considered as an approximation ofthe function, wavelet algorithms process data at different scales or components.At each scale, many coefficients can be obtained and the wavelet spectrum iscalculated on the basis of these coefficients. Hence the wavelet spectrum providesuseful information for analyzing data. Given a function f (t), one defines itswavelet transform as [19]

Wf (a , b) = |a|−(1/2)

∞∫−∞

f (t)ψ

(t − b

a

)dt (16.25)


where b is the position and a is the scale. The scale a in wavelet transform meansthe ath resolution of the data. Taking a = j , j = 20, 21, 22, . . . , and b = k , k ∈ R,we get the wavelet spectrum as

Spectrum [j ] =∑

k

C 2j ,k , k = 20, 21, 22, . . .

where Cj ,k = Wf (j , k).For simplicity, the scale j can be selected as j = 1, 3

2 , 2, . . . , 20, which aremore adjacent and can be used to capture more details of the data. The waveletspectrum is calculated by summing the squares of the coefficients in each scalej . The local wavelet spectrum is defined through the modulus maxima of thecoefficients [67] as local spectrum [j ] = ∑

k C 2j ,k , where

Cj ,k ={

|Cj ,k |, if |Cj ,k | > |Cj ,k−1| and |Cj ,k | > |Cj ,k+1|0, otherwise.

The maximum of the wavelet spectrum and the maximum of the local waveletspectrum were applied in the prediction of structural classes and families ofproteins [53].

In our work, we chose the Daubechies wavelet, which is commonly usedas a signal processing tool. These wavelet functions are compactly supportedwavelets with extremal phase and highest number of vanishing moments fora given support width. They are orthogonal and biorthogonal functions. TheDaubechies wavelets can improve the frequency domain characteristics of otherwavelets [19].

16.2.11 Recurrence Quantification Analysis

The recurrence plot (RP) is a purely graphical tool originally proposed byEckmann et al. [21] to detect patterns of recurrence in the data. For atime series {x1, x2, . . . , xN } with length N , we can embed it into the spaceR

m with embedding dimension m and a time delay τ . We write yi→ =

(xi , xi+τ , xi+2τ , . . . , xi+(m−1)τ ), i = 1, 2, . . . , Nm , where Nm = N − (m − 1)τ .In this way we obtain Nm vectors (points) in the embedding space R

m . We gavesome numerical explanations for the selection of m and τ in our earlier paper[52].

From the Nm points, we can calculate the distance matrix (DM), which isa square Nm × Nm matrix. The elements of DM are the distances between allpossible combinations of i points and j points. They are computed according tothe norming function selected. Generally, the Euclidean norm is used [47]. DMcan be rescaled by dividing each element in the DM by a certain value as thisallows systems operating on different scales to be statistically compared. For sucha value, the maximum distance of the entire matrix DM is the most commonly

METHODS 329

used (and recommended) rescaling option, which redefines the DM over the unitinterval (0.0–100.0%).

Once the rescaled DM = (Di ,j )Nm×Nmis calculated, it can be transformed into

a recurrence matrix (RM) of distance elements within a threshold ε (namely,radius). RM = (Ri ,j (ε))Nm×Nm

and Ri ,j (ε) = H (ε − Di ,j ), i , j = 1, 2, . . . , Nmwhere H is the Heaviside function

H (x) ={

0, if x < 0

1, if x ≥ 0(16.26)

RP is simply a visualization of RM by plotting the points on the ij plane forthose elements in RM with values equal to 1. If Ri ,j (ε) = 1 , we say that j pointsrecur with reference to i points. For any ε, since Ri ,i (ε) ≡ 1, (i = 1, 2, . . . , Nm),the RP has always a black main diagonal line. Furthermore, the RP is symmetricwith respect to the main diagonal as Ri ,j (ε) = Rj ,i (ε), (i , j = 1, 2, . . . , Nm). ε isa crucial parameter of RP. If ε is chosen too small, there may be almost norecurrence points and we will not be able to learn about the recurrence structureof the underlying system. On the other hand, if ε is chosen too large, almostevery point is a neighbor of every other point, which leads to a large number ofartifacts [68]. Selection of ε was discussed numerically in [52].

Recurrence quantification analysis (RQA) is a relatively new nonlinear tech-nique proposed by Zbilut and Webber [22,23] that can quantify the informationsupplied by RP. Eight recurrence variables are usually used to quantify RP [68].It should be pointed out that the recurrence points in the following definitionsconsist only of those in the upper triangle in RP (excluding the main diagonalline). The first recurrence variable is %recurrence (%REC). %REC is a measureof the density of recurrence points in the RP. This variable can range from 0%(no recurrent points) to 100% (all points are recurrent). The second recurrencevariable is %determinism (%DET). %DET measures the proportion of recurrentpoints forming diagonal line structures. For this variable, we have to first decideat least how many adjacent recurrent points are needed to define a diagonal linesegment. Obviously, the minimum number required (and commonly used) is 2.The third recurrence variable is linemax (Lmax), which is simply the length of thelongest diagonal line segment in RP. This is a very important recurrence vari-able because it inversely scales with the largest positive Lyapunov exponent. Thefourth recurrence variable is entropy (ENT), which is the Shannon informationentropy of the distribution probability of the length of the diagonal lines. Thefifth recurrence variable is trend (TND), which quantifies the degree of systemstationarity. It is calculated as the slope of the least squares regression of %localrecurrence as a function of the displacement from the main diagonal. It shouldbe made clear that the so called %local recurrence is in fact the proportion ofrecurrent points on certain line parallel to the main diagonal over the lengthof this line. %recurrence is calculated on the whole upper triangle in RP while%local recurrence is computed on only certain lines in RP, so it is termed aslocal. Multiplying by 1000 increases the gain of the TND variable. The remaining


three variables are defined on the basis of the vertical line structure. The sixthrecurrence variable is %laminarity ( %LAM). %LAM is analogous to %DET butis calculated with recurrent points constituting vertical line structures. Similarly,we also select 2 as the minimum number of adjacent recurrent points to form avertical line segment. The seventh variable, trapping time (TT), is the averagelength of vertical line structures. The eighth recurrence variable is maximal lengthof the vertical lines in RP (Vmax), which is similar to Lmax.

16.2.12 Empirical Mode Decomposition and Similarity of Proteins

Empirical mode decomposition (EMD) was originally designed for non-linear andnonstationary data analysis by Huang et al. [24]. The traditional EMD decom-poses a time series into components called intrinsic mode functions (IMFs) todefine meaningful frequencies of a signal.

Lin et al. [25] proposed a new algorithm for EMD. Instead of using theenvelopes generated by spline, a lowpass filter is used to generate a movingaverage to replace the mean of the envelopes. The essence of the shifting algo-rithm remains. Let L be a lowpass filter operator, for which L(X )(t) representsa moving average of X . We now define T(X ) = X − L(X ). In this approach, thelowpass filter L is dependent on the data X . For a given X (t), we choose a low-pass filter L1 accordingly and set T1 = I − L1, where I is the identity operator.The first IMF in the new EMD is given by limn→∞Tn

1(X ), and subsequently thek th IMF Ik is obtained first by selecting a lowpass filter Lk according to the dataX − I1 − · · · − Ik−1 and iterations Ik = limn→∞Tn

k (X − I1 − · · · − Ik−1), whereTk = I − Lk . The process stops when Y = X − I1 − · · · − IK has at most onelocal maximum or local minimum. Lin et al. [25] suggested using the filterY = L(X ) given by Y (n) = ∑m

j=−m aj X (n + j ). We selected the mask

aj = m − |j | + 1

m + 1, j = −m , . . . , m

in our work [53].Let r(t) = X (t) − I1(t) − · · · − IK1

(t). The original signal can be expressedas X (t) = ∑K1

i=1 Ii (t) + r(t), where the number K1 can be chosen according to astandard deviation. In our work, the number of components in IMFs was set as4 due to the short length of some amino acid sequences [53].

The similarity value of two proteins at each component (IMF) is obtained asthe maximum absolute value of the correlation coefficient. In our work [53], anew cross-correlation coefficient C 12(j ) is defined by

C 12(j ) =∑N −1

n=0 S1(n)S2(n − j )[∑N1−1n=0 S 2

1 (n)∑N2−1

n=0 S 22 (n)

]1/2 , j = 0, ±1, ±2, . . . (16.27)

RESULTS AND CONCLUSIONS 331

where N is the length of the intersection of two signals with lag j , N1 is thelength of signal S1, and N2 is the length of signal S2. The maximum absolutevalue C of all the correlation coefficients of the components is considered as thesimilarity value for two proteins.

16.3 RESULTS AND CONCLUSIONS

In our earlier paper [50], we selected the amino acid sequences of 43 large pro-teins from the RCSB Protein Data Bank (http://www.rcsb.org/pdb/index.html). These 43 proteins belong to four structural classes according to theirsecondary structures:

1. We converted the amino acid sequences of these proteins into their measurerepresentations based on the detailed HP model with K = 5. We found thatthe IFS model corresponding to K = 5 is a good model for simulating themeasure representation of protein sequences, and the estimated value ofthe probability P1 from the IFS model contains information useful for thesecondary structural classification of proteins [30]. We performed an IFSsimulation for the proteins selected and adopted the estimated parameterP1 as one parameter to construct the parameter space for proteins.

2. We converted the amino acid sequences of these proteins to their Z curverepresentations and performed their detrended fluctuation analysis. Theexponents λx , λy , λz were estimated and used as candidate parameters toconstruct the parameter space.

3. We computed the generalized fractal dimensions Dq and the related spectraCq , multifractal spectra f (α) of hydrophobic free energy sequences andsolvent accessibility sequences of all 43 proteins.

4. For a structural classification of proteins, we considered the followingparameters: P1 from the IFS estimations of the measure representations;the exponents λx , λy , λz from the detrended fluctuation analysis of the Zcurve representations; the range of Dq (i.e. the value D−15 –D15 in ourframe); the maximum value of Cq (denoted MaxCq ); the value q0 of qthat corresponds to the maximum value of Cq ; the maximum value of α

(denoted αmax), the minimum value of α (denoted αmin) and �α (defined byαmax − αmin) from the multifracal analysis of the hydrophobic free-energysequences and solvent accessibility sequences of proteins as candidates forconstructing parameter spaces.

In a parameter space, one point represents a protein. We wanted to determinewhether the proteins can be separated from four structural classifications in theseparameter spaces. We found that we can propose a method which consists ofthree components to cluster proteins [50]. We used Fisher’s linear discriminantalgorithm to give a quantitative assessment of our clustering on the selected


proteins. The discriminant accuracies are satisfactory. In particular, they reach94.12% and 88.89% in separating β proteins from {α, α + β, α/β} proteins in a3D space.

We [51], considered a set of 49 large proteins that included the 43 proteinsstudied earlier [50]. Given an amino acid sequence of one protein, we first con-verted it into its measure representation μK based on the six-letter model withlength K = 5. Then we calculated Dq , τ(q), Cq , α, and f (α) for the measuresμK of the 49 selected proteins. We then converted the amino acid sequences ofproteins into their RSWH sequences according to the revised Schneider–Wredehydrophobicity scale. We used such sequences to construct the measures ν.The ordinary multifractal analysis was then performed on these measures. TheAMFA was also performed on the RSWH sequences. Then nine parameters fromthese analyses were selected as candidates for constructing parameter spaces.We proposed another three steps to cluster protein structures [51]. Fisher’s lin-ear discriminant algorithm was used to assess our clustering accuracy on the49 selected large proteins. The discriminant accuracies are satisfactory. In par-ticular, they reach 100.00% and 84.21% in separating the α proteins from the{β, α + β, α/β} proteins in a parameter space; 92.86% and 86.96%, in separatingthe β proteins from the {α + β, α/β} proteins in another parameter space; and91.67% and 83.33%, in separating the α/β proteins from the α + β proteins inthe last parameter space.

We [52], intended to predict protein structural classes (α, β, α + β, or α/β) forlow-similarity datasets. Two datasets were used widely: 1189 (containing 1092proteins) and 25PDB (containing 1673 proteins) with sequence similarity valuesof 40% and 25%, respectively. We proposed decomposing the chaos game rep-resentation of proteins into two kinds of time series. Then we applied recurrencequantification analysis to analyze these time series. For a given protein sequence,a total of 16 characteristic parameters can be calculated with RQA, which aretreated as feature representations of the protein. On the basis of such feature rep-resentation, the structural class for each protein was predicted with Fisher’s lineardiscriminant algorithm. The overall accuracies with step-by-step procedure are65.8% and 64.2% for 1189 and 25PDB datasets, respectively. With one-against-others procedure used widely, we compared our method with five other existingmethods. In particular, the overall accuracies of our method are 6.3% and 4.1%higher for the two datasets, respectively. Furthermore, only 16 parameters wereused in our method, which is less than that used by other methods.

Family identification is helpful in predicting protein functions. Since mostprotein sequences are relatively short, we first randomly linked the proteinsequences from the same family or superfamily together to form 120 longerprotein sequences [53], and each structural class contains 30 linked proteinsequences. Then we used, the 6-letter model, 12-letter model, 20-letter model,the revised Schneider–Wrede scale hydrophobicity, solvent accessibility, andstochastic standard state accessibility values to convert linked protein sequencesto numerical sequences. Then we calculated the generalized fractal dimensionsDq , the related spectra Cq , the multifractal spectra f (α), and the h(q) curves of

RESULTS AND CONCLUSIONS 333

the six kinds of numerical sequences of all 120 linked proteins. The curves ofDq , Cq , f (α), h(q) showed that the numerical sequences from linked proteinsare multifractal-like and sufficiently smooth. The Cq curves resemble thephase transition at a certain point, while the f (α) and h(q) curves indicate themultifractal scaling features of proteins. In wavelet analysis, the choice of awavelet function should be carefully considered. Different wavelet functionsrepresent a given function with different approximation components. We [53],chose the commonly used Daubechies wavelet db2 and computed the maximumof the wavelet spectrum and the maximum of the local wavelet spectrum for thesix kinds of numerical sequences of all 120 linked proteins. The parameters fromthe multifractal and wavelet analyses were used to construct parameter spaceswhere each linked protein is represented by a point. The four classes of proteinswere then distinguished in these parameter spaces. The discriminant accuraciesobtained through Fisher’s linear discriminant algorithm are satisfactory inseparating these classes. We found that the linked proteins from the same familyor superfamily tend to group together and can be separated from other linkedproteins. The methods are also helpful to identify the family of an unknownprotein.

Zhu et al. [54] applied component similarity analysis based on EMD andthe new cross-correlation coefficient formula (16.27) to protein pairs. They thenconsidered maximum absolute value C of all the correlation coefficients of thecomponents as the similarity value for two proteins. They also created the thresh-old of correlation [54]. Two signals are considered strongly correlated if thecorrelation coefficient exceeds ±0.7 and weakly correlated if the coefficient isbetween ±0.6 and ±0.7. The results showed that the functional relationships ofsome proteins may be revealed by component analysis of their IMFs. Comparedwith those traditional alignment methods, component analysis can be evaluatedand described easily. It illustrates that EMD and component analysis can com-plement traditional sequence similarity approaches that focus on the alignmentof amino acids.

From our analyses, we found that the measure representation based on thedetailed HP model and six-letter model, time series representation based onphysicochemical features of amino acids, Z -curve representation, the chaos gamerepresentation of proteins can provide much information for predicting structureclasses and functions of proteins. Fractal methods are useful to analyze proteinsequences. Our methods may play a complementary role in the existing methods.

ACKNOWLEDGMENT

This work was supported by the Natural Science Foundation of China (Grant11071282); the Chinese Program for Changjiang Scholars and InnovativeResearch Team in University (PCSIRT) (Grant No. IRT1179); the ResearchFoundation of Education Commission of Hunan Province, China (Grant11A122); the Lotus Scholars Program of Hunan Province of China; the Aid


program for Science and Technology Innovative Research Team in HigherEducational Institutions of Hunan Province of China; and the AustralianResearch Council (Grant DP0559807).

REFERENCES

1. Hou J, Jun S-R, Zhang C, Kim S-H, Global mapping of the protein structure spaceand application in structure-based inference of protein function, Natl. Acad. Sci. USA102:3651–3656 (2005).

2. Anfinsen C, Principles that govern the folding of protein chains, Science 181:223–230(1973).

3. Chothia C, One thousand families for the molecular biologists, Nature (Lond.)357:543–544 (1992).

4. Frishman D, Argos P, Knowledge-based protein secondary structure assignment, Pro-teins 23:566–579 (1995).

5. Crooks GE, Brenner SE, Protein secondary structure: Entropy, correlation and pre-diction, Bioinformatics 20:1603–1611 (2004).

6. Adamczak R, Porollo A, Meller J, Combining prediction of secondary structure andsolvent accessibility in proteins, Proteins 59:467–475 (2005).

7. Levitt M, Chothia C, Structural patterns in globular proteins, Nature 261:552–558(1976).

8. Service., A dearth of new folds, Science 307:1555–1555 (2005).

9. Chou KC, Progress in protein structural class prediction and its impact to bioinfor-matics and proteomics, Curr. Protein Peptide Sci. 6(5):423–436 (2005).

10. Trad CH, Fang Q, Cosic I, Protein sequence comparison based on the wavelet trans-form approach, Protein Eng. 15(3):193–203 (2002).

11. Lesk AM, Computational Molecular Biology: Sources and Methods for SequenceAnalysis, Oxford Univ. Press, 1988.

12. Mandelbrot BB, The Fractal Geometry of Nature, Academic Press, New York, 1983.

13. Feder J, Fractals, Plenum Press, New York, 1988.

14. Grassberger P, Procaccia I, Characterization of strange attractors, Rev. Lett.50:346–349 (1983).

15. Yu ZG, Anh V, Eastes R, Multifractal analysis of geomagnetic storm and solar flareindices and their class dependence, J. Geophys. Res. 114:A05214 (2009).

16. Yu ZG, Anh VV, Lau KS, Measure representation and multifractal analysis of com-plete genome, Phys. Rev. E 64:031903 (2001).

17. Yu ZG, Anh VV, Wang B, Correlation property of length sequences based on globalstructure of complete genome, Phys. Rev. E 63:011903 (2001).

18. Peng CK, Buldyrev S, Goldberg AL, Havlin S, Sciortino F, Simons M, Stanley HE,Long-range correlations in nucleotide sequences, Nature 356:168–170 (1992).

19. Chui CK, An Introduction to Wavelets, Academic Press Professional, San Diego,1992.

20. Daubechies I, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992.

REFERENCES 335

21. Eckmann JP, Kamphorst SO, Ruelle D, Recurrence plots of dynamical systems, Euro-phys. Lett. 4:973–977 (1987).

22. Zbilut JP, Webber CL Jr, Embeddings and delays as derived from quantification ofrecurrence plots, Phys. Lett. A 171:199–203 (1992).

23. Webber CL Jr, Zbilut JP, Dynamical assessment of physiological systems and statesusing recurrence plot strategies, J. Appl. Physiol. 76:965–973 (1994).

24. Huang N, Shen Z, Long SR, Wu ML, Shih HH, Zheng Q, Yen NC, Tung CC,Liu HH, The empirical mode decomposition and Hilbert spectrum for nonlinear andnonstationary time series analysis, Proc. Roy. Soc. Lond. A 454:903–995 (1998).

25. Lin L, Wang Y, Zhou H, Iterative filtering as an alternative for empirical modedecomposition, Adv. Adapt. Data Anal. 1(4):543–560 (2009).

26. Janosi IM, Muller R, Empirical mode decomposition and correlation properties oflong daily ozone records, Phys. Rev. E 71:056126 (2005).

27. ZG Yu, Anh V, Wang Y, Mao D, Wanliss J, Modeling and simulation of the horizontalcomponent of the geomagnetic field by fractional stochastic differential equationsin conjunction with empirical mode decomposition, J. Geophys. Res. 115:A10219(2010).

28. Dewey TG, Fractal analysis of proton-exchange kinetics in lysozyme, Proc. Natl.Acad. Sci. USA 91:12101–12104 (1994).

29. Fiser A, Tusnady GE, Simon I, Chaos game representation of protein structure, J.Mol. Graphies 12:302–304 (1994).

30. Yu ZG, Anh VV, Lau KS, Fractal analysis of large proteins based on the detailed HPmodel, Physica A 337:171–184 (2004).

31. Yu ZG, Anh VV, Lau KS, Chaos game representation, and multifractal and correlationanalysis of protein sequences from complete genome based on detailed HP model, J.Theor. Biol. 226:341–348 (2004).

32. Enright MB, Leitner DM, Mass fractal dimension and the compactness of proteins,Phys. Rev. E 71:011912 (2005).

33. Moret MA, Miranda JGV, Nogueira E, et al, Self-similarity and protein chains, Phys.Rev. E 71:012901 (2005).

34. Pavan YS, Mitra CK, Fractal studies on the protein secondary structure elements,Indian J. Biochem. Biophys. 42:141–144 (2005).

35. Balafas JS, Dewey TG, Multifractal analysis of solvent accessibilities in proteins,Phys. Rev. E 52:880–887 (1995).

36. Yu ZG, Anh VV, Lau KS, Multifractal and correlation analysis of protein sequencesfrom complete genome, Phys. Rev. E 68:021913 (2003).

37. Mandell AJ, Selz KA, Shlesinger MF, Mode maches and their locations in thehydrophobic free energy sequences of peptide ligands and their receptor eigenfunc-tions, Proc. Natl. Acad. Sci. USA 94:13576–13581 (1997).

38. Mandell AJ, Selz KA, Shlesinger MF, Wavelet transfermation of protein hydropho-bicity sequences suggests their memberships in structural families, Physica A244:254–262 (1997).

39. Selz KA, Mandell AJ, Shlesinger MF, Hydrophobic free energy eigenfunctionsof pore, channel, and transporter proteins contain β-burst patterns, Biophys. J.75:2332–2342 (1998).


40. Hirakawa H, Muta S, Kuhara S, The hydrophobic cores of proteins predicted bywavelet analysis, Bioinformatics 15:141–148 (1999).

41. Chen H, Gu F, Liu F, Predicting protein secondary structure using continuouswavelet transform and Chou–Fasman method, Proc. 27th Annual IEEE, Engineeringin Medicine and Biology Conf. 2005.

42. Marsolo K, Ramamohanarao K, Structure-based on querying of proteins usingwavelets, Proc.15th Int. ACM Conf. Information and Knowledge Management, 2006,pp. 24–33.

43. Qiu JD, Liang RP, Zou XY, Mo JY, Prediction of protein secondary structure basedon continuous wavelet transform, Talanta 61(3):285–293 (2003).

44. Rezaei M, Abdolmaleki P, Jahandideh S, Karami Z, Asadabadi EB, Sherafat MA,Abrishami-Moghaddam H, Fadaie M, Foroozan M, Prediction of membrane proteintypes by means of wavelet analysis and cascaded neural networks, J. Theor. Biol.254(4):817–820 (2008).

45. Pando J, Sands L, Shaheen SE, Detection of protein secondary structures via thediscrete wavelet transform, Phys. Rev. E 80:051909 (2009).

46. Webber CL Jr, Giuliani A, Zbilut JP, Colosimo A, Elucidating protein secondary struc-tures using alpha-carbon recurrence quantifications, Proteins: Struct., Funct., Genet.44:292–303 (2001).

47. Giuliani A, Benigni R, Zbilut JP, Webber CL Jr., Sirabella P, Colosimo A, Signalanalysis methods in elucidation of protein sequence-structure relationships, Chem.Rev. 102:1471–1491 (2002).

48. Shi F, Chen Q, Niu X, Functional similarity analyzing of protein sequences withempirical mode decomposition, Proc. 4th Int. Conf. Fuzzy Systems and KnowledgeDiscovery, 2007.

49. Shi F, Chen QJ, Li NN, Hilbert Huang transform for predicting proteins subcellularlocation, J. Biomed. Sci. Eng. 1:59–63 (2008).

50. Yu ZG, Anh VV, Lau KS, Zhou LQ, Fractal and multifractal analysis of hydrophobicfree energies and solvent accessibilities in proteins, Phys. Rev. E. 73:031920 (2006).

51. Yang JY, Yu ZG, Anh V, Clustering structure of large proteins using multifractalanalyses based on 6-letters model and hydrophobicity scale of amino acids, Chaos,Solitons, Fractals 40:607–620 (2009).

52. Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D, Prediction of protein structuralclasses by recurrence quantification analysis based on chaos game representation, J.Theor. Biol. 257:618–626 (2009).

53. Zhu SM, Yu ZG, Anh V, Protein structural classification and family identification bymultifractal analysis and wavelet spectrum, Chin. Phys. B 20:010505 (2011).

54. Zhu SM, Yu ZG, Anh V, Yang SY, Analysing the similarity of proteins based on anew approach to empirical mode decomposition, Proc. 4th Int. Conf. Bioinformaticsand Biomedical Engineering (ICBBE2010), vol. 1, (2010).

55. Chou PY, Fasman GD, Prediction of protein conformation, Biochemistry 13:222–245(1974).

56. Bordo D, Argos P, Suggestions for safe residue substitutions in site-directed mutage-nesis, J. Mol. Biol. 217:721–729 (1991).

REFERENCES 337

57. Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH, Hydrophobicity of aminoacid residues in globular proteins, Science 229:834–838 (1985).

58. Krigbaum WR, Komoriya A, Local interactions as a structure determinant for proteinmolecules: II, Biochim. Biophys. Acta 576:204–228 (1979).

59. Zhang R, Zhang CT, Z curves, an intuitive tool for visualizing and analyzing theDNA sequences, J. Biomol. Struct. Dyn. 11:767–782 (1994).

60. Basu S, Pan A, Dutta C, Das J, Chaos game representation of proteins, J. Mol.Graphics 15:279–289 (1997).

61. Deschavanne P, Tuffery P, Exploring an alignment free approach for protein classi-fication and structural class prediction, Biochimie 90:615–625 (2008).

62. Jeffrey HJ, Chaos game representation of gene structure, Nucleic Acids Res ,18:2163–2170 (1990).

63. Brown TA, Genetics, 3rd ed., Chapman & Hall, London, 1998.

64. Barnsley MF, Demko S, Iterated function systems and the global construction offractals, Proc. Roy. Soc. (Lond.) 399:243–275 (1985).

65. Vrscay ER, Iterated function systems: Theory, applications and the inverse problem, inBelair J, Dubuc S. (eds.), Fractal Geometry and Analysis, NATO ASI series, Kluwer,1991.

66. Canessa E, Multifractality in time series, J. Phys. A: Math. Genet. 33:3637–3651(2000).

67. Arneodo A, Bacry E, Muzy JF, The thermodynamics of fractals revisited withwavelets, Physica A 213:232–275 (1995).

68. Marwan N, Romano MC, Thiel M, Kurths J, Recurrence plots for the analysis ofcomplex systems, Phys. Reports 438:237–329 (2007).

fractal related methods for predicting protein structure ... · 318 fractal related methods for...

Documents