a 3d graphical representation of rna secondary structures based on chaos game representation

Available online at www.sciencedirect.com

www.elsevier.com/locate/cplett

Chemical Physics Letters 454 (2008) 355–361

A 3D graphical representation of RNA secondary structures basedon chaos game representation

Jie Feng *, Tian-ming Wang

Department of Applied Mathematics, Dalian University of Technology, Linggong Street, No. 2, Dalian 116024, PR China

Received 7 December 2007; in final form 16 January 2008Available online 21 January 2008

Abstract

In this Letter, based on chaos game representation (CGR), we propose a 3D graphical representation for RNA secondary structuresin terms of classifications of bases of nucleic acids. Some information on the base distribution and compositions of RNA secondarystructure can be intuitively reflected by the graphical representation. Furthermore, the numerical characterization of the graphicalrepresentation is applied to compute the similarities of RNA secondary structures. As application, we make quantitative comparisonsfor two sets of RNA secondary structures based on the graphical representation.� 2008 Elsevier B.V. All rights reserved.

1. Introduction

Ribonucleic acid (RNA) is an important molecule whichperforms a wide range of functions in the biological system.In particular, it is RNA (not DNA) that contains geneticinformation of virus such as HIV and therefore regulatesthe functions of such virus. RNA has recently becomethe center of much attention because of its catalytic prop-erties, leading to an increased interest in obtaining struc-tural information. The analysis of the similarity betweenRNA secondary structures provides valuable informationfor function prediction.

There are many algorithms for computing the similaritybetween RNA secondary structures [1–3]. So far, almost allsuch comparisons are based on alignments of RNA struc-tures: a distance function or a score function is used to rep-resent insertion, deletion and substitution of letters in thecompared structures. Using the distance function, onecan compute similarity between RNA structures. Anotherwork of comparing RNA secondary structures developedby Shapiro and Zhang [4] is based on the topological invar-

0009-2614/$ - see front matter � 2008 Elsevier B.V. All rights reserved.

doi:10.1016/j.cplett.2008.01.041

* Corresponding author.E-mail addresses: [email protected](J.Feng),[email protected]

(T.-m. Wang).

iants of tree structures. However, the method does notdirectly use base paired nucleotides and unpaired nucleo-tides, and is not suitable for the RNA secondary structureswith pseudoknots.

Recently, motivated by the work of several researcherson graphical representations of DNA sequences [5–9], Liaoand Wang [10] proposed to use graphs to represent RNAsecondary structures and then derive some numericalinvariants from graphs to compare RNA secondary struc-tures. Since then, different graphical measures have beenextensively studied for the similarity analysis betweenRNA secondary structures [11–17]. The advantage ofgraphical representations is that they allow visual inspec-tion of data, helping in recognizing major differencesamong RNA secondary structures. Furthermore, accord-ing to the representation, some numerical characterizationsare selected as invariants for comparisons of various RNAstructures.

In this Letter, motivated by Jeffrey’s ingenious work ofchaos game representation (CGR) of DNA sequences[18], we propose a 3D representation of RNA secondarystructures and outline an approach to make the analysisof the similarities between RNA secondary structures.Our approach considers not only sequence structures butalso chemical structures for RNA secondary structures.

mailto:[email protected]

mailto:[email protected]

356 J. Feng, T.-m. Wang / Chemical Physics Letters 454 (2008) 355–361

To test the validation of our method, we apply it to twosets of data. Secondary structures of set I listed in Fig. 1are the 30-terminus of nine different viruses reported byBol [19]. Set II contains two kinds of RNA secondarystructures, in which Synechocystis sp. PCC6803, Anabaena

sp. PCC7120, Agrobacterium tumefaciens and Rhodospiril-

A A A C A U C G G C A U A U A A U C G C G C G U A C G AUGC AUGC AUGC5' 3'

AlMV-3

C C U U C G U A C G U A U A U A U A A U U A A U U A C G AUGC AUGC5'

CiLRV-

A C U U G C G U A C G U A C G U A C G A A A G A A A C G A U C G C G U A C G C G AUGC AUGC AUGC5' 3'

CVV-3

G U A G A C G A U A U C G A U C G C G AUGC AUGC5'

APMV-

U A G A C G C G A A U U A C G U A U A C G C G C G C G C G AUGC AUGC AUGC

5' 3'

PDV-3

A C G U G C G U A C G U A C G U A C G U U A A U C A U U A C G C AUGC 5'

EMV-3

Fig. 1. Secondary structure at the 30-terminus of RNA 3 of alfalfa mosaic virvirus (TSV-3 [22]), citrus variegation virus (CVV-3 [21]), apple mosaic virus ((LRMV-3 [25]), elm mottle virus (EMV-3 [26]) and asparagus virus II (AVII

lum rubrum are from RNase P Database [28], and Halobac-

terium spl, Pyrodictium occultum and Planocera recticulata

are from 5S ribosomal RNA Database [29]. Among theseven secondary structures in set II, four of them fromRNase P Database are shown in Fig. 2. At the end of theLetter, we use the obtained similarity/dissimilarity distance

A A A C G C G U A C G AUGC

3'

3

A U A A U U A A U U A U A G C U U A U U A U A C G G C C G A U U A C G C G GUGC AUGC AUGC 5' 3'

T SV-3

U U G A C G C G C G AAGC

3'

3

C A U G C G U A C G U A C G G U A C G A U A A A U A C G A U C G U A U A C G C G GUUC AUGC UCGC5' 3'

LRMV-3

A A G

C G U A C G AUGC 3'

A C G U G C G U A C G U A C G U A C G A U A U A A A U C G A U C G U A U A C G C G AUGC AUGC AUGC5' 3'

AVII

us (AIMV-3 [20]), citrus leaf rugose virus (CiLRV-3 [21]), tobacco streakAPMV-3 [23]), prune dwarf ilarvirus (PDV-3 [24]), lilac ring mottle virus[27]). Numbering of nucleotides is from the 30 end of RNA 3.

Fig. 2. Secondary structure of Synechocystis sp. PCC6803, Anabaena sp. PCC7120, Agrobacterium tumefaciens and Rhodospirillum rubrum.

J. Feng, T.-m. Wang / Chemical Physics Letters 454 (2008) 355–361 357

matrix to the hierarchical clustering analysis and the resultshows the utility of our approach.

2. Outline of the graphical representation of RNA secondary

structures

The secondary structure of an RNA is a set of free basesand base pairs formed by bonds between A–U and G–C(base pair G–U is frequently allowed). Let A0, U0, G0, C0

denote A, U, G, C in the base pair A–U, G–C or G–U,respectively. Then we can obtain a special sequence repre-sentation of the secondary structure. We call it characteris-tic sequence of the secondary structure. For example, thecorresponding characteristic sequence of the substructureof CVV-3 (Fig. 3a) is GCC0U0C0C0GAAG0G0A0G0AU (from50 to 30).

Nucleic acids are linear macromolecules. Analysis andresearch of RNA should consider their chemical property.In RNA primary sequences, the four bases A, U, G, C canbe divided into two classes according to the strength of thehydrogen bond, i.e., weak H-bonds W = {A, U} and strongH-bonds S = {G,C}. The bases can be divided into anothertwo classes, amino group M = {A,C} and keto groupK = {G,U}. Besides, the division can be also made accord-ing to their chemical structures, i.e., purine R = {A, G} andpyrimidine Y = {C,U}. Based on the three classificationsof nucleic acids and CGR, we construct three curvesrestricted in a regular tetrahedron for a given RNA second-ary structure.

Let S be an arbitrary RNA secondary structure andS = s1s2 � � � sn is the corresponding characteristic sequence,where n is the length of S. According to CGR, all we have

Fig 2. (continued)

—10

1—1

01

—1

0

1

x

W−S curve

y

z

—10

1—1

01

—1

0

1

x

M−K curve

y

z

—10

1—1

01

—1

0

1

x

R−Y curve

y

z

AG A

C G C G

U A C G

GC AU5' 3'

Fig. 3. (a) Substructure of CVV-3. (b) The characteristic curves for the substructure of CVV-3 corresponding to three patterns, where (�1,�1,1)represents {A,U}, (1,1,1) represents {G,C}, (1,�1,�1) represents {A0,U0} and (�1,1,�1) represents {G0,C0}.


to do is to replace the two-dimensional representation ofDNA primary sequence based on assignment of four basesto the four corners of a square by an analogous three-dimen-sional representation based on assignment of four base clas-ses to the four corners of a tetrahedron, that is, the weak H-bonds {A, U} are assigned coordinate (�1,�1,1), the strongH-bonds {G,C} are assigned coordinate (1, 1,1), the pairedweak H-bonds bases {A0,U0} are assigned coordinate(1,�1,�1), and the paired strong H-bonds bases {G0,C0}are assigned coordinate (�1,1,�1). The point Pi(xi,yi,zi)corresponding to the ith element in RNA characteristicsequence is calculated by the following formula:

xi ¼ 12ðxi�1 þ xsiÞ;

yi ¼ 12ðyi�1 þ ysi

Þ;zi ¼ 1

2ðzi�1 þ zsiÞ:

8<: ð1Þ

i = 1,2, . . .,n, where si denotes the ith element in S,(x0,y0,z0) = (0, 0,0) and xsi , ysi

and zsi are calculated bythe following formula:

xsi ¼�1; if si 2 fA;U;C0;G0g;1; if si 2 fA0;U0;C;Gg:

�

ysi¼ �1; if si 2 fA;U;A0;U0g;

1; if si 2 fC;G;C0;G0g:

�

zsi ¼�1; if si 2 fA0;U0;C0;G0g;1; if si 2 fA;U;C;Gg:

�ð2Þ

In this way, S is converted into a series of pointsP1,P2, . . .,Pn. Especially, let the origin (0, 0,0) be the pointP0. When index i runs from 0 to n, we in turn connect thepoints P0,P1,P2, . . .,Pn and get a zigzag 3D curve within aregular tetrahedron. We call the curve W–S characteristic


curve for the corresponding RNA secondary structure onaccount of the usage of the weak–strong H-bonds classifi-cations. Obviously, the W–S curve is absolutely determinedby the RNA secondary structure. Once S is given, the W–Scurve is uniquely obtained by Eqs. (1) and (2). In the likemanner, if we assign the amino group {A, C}, keto group{G,U}, the paired amino group bases {A0,C0} and thepaired keto group bases {G0,U0} to the four corners ofthe regular tetrahedron, the M–K characteristic curve forthe corresponding RNA secondary structure is obtained;so is the R–Y curve. Hence, we get three different 3Dcurves corresponding to S.

In Fig. 3b, we draw the three characteristic curves of thesubstructure of CVV-3 shown in Fig. 3a. It is obvious toobserve that most position of the points in the W–S curvelocate near the vertex {G0,C0} (�1,1,�1), which is causedby the fact that the G0 and C0 content is the most in thecharacteristic sequence of the substructure of CVV-3. Onthe other hand, because the G and U content is the leastin the characteristic sequence of the substructure of CVV-3, the points near the vertex {G,U} (1, 1,1) are sparse inthe M–K curve. Since the characteristic curves containthe information of its corresponding RNA secondarystructure, comparison and analysis of multiple RNAs canbe performed by comparing and analyzing their 3D graph-ical representation.

In Fig. 4, the W–S graphical representations for the sec-ondary structures of sets I and II are displayed. Observing

—1 0 1—1

01

—101

AIMV−3

—1 0 1—1

0—1

—101

CILRV−3

—

—1 0 1—101

—101

APMV−3

—1 0 1—101

—101

LRMV−3

—

—1 0 1—101

—101

AVII

—1 0 1—101

—101

Hal−s

—

—1 0 1—101

—101

Synech.sp

—1 0 1—101

—101

Anabae.sp

—

Fig. 4. The W–S scatterplot representations for the secondary structures in se(1,�1,�1) represents {A0,U0} and (�1,1,�1) represents {G0,C0}.

this figure, we immediately note that the distribution ofpoints in some graphs are dense while in others sparse. Thisis caused by the fact that the size of RNA secondary struc-tures are different. Take Anabae. sp. and AIMV-3 for exam-ple, the points in the graphical representation of Anabae. sp.are more compact than that of AIMV-3. As a matter of fact,there are totally 457 bases of the former RNA secondarystructure and only 39 bases of the latter. Furthermore, wefind the graphs of LRMV-3, EMV-3, AVII and CVV-3are on the whole very similar to each other and so are thegraphical representations of Hal-s and Pla-r. Meanwhile,Synech. sp. has almost the same representation as Anabae.

sp., which is in agreement with the fact that they sharealmost the same secondary structure (Fig. 2).

Moreover, from the W–S scatter plot of Anabae. sp., thepoints near the edge {A,U}{G,C} of the regular tetrahe-dron are seen to approximately evenly distribute andaccording to Eqs. (1) and (2), we have that in Anabae.

sp., the numbers of bases A and U roughly equal to thatof bases G and C; for the same reason, the numbers ofpaired bases A0 and U0 and of paired bases G0 and C0 areroughly equal each other. As a result, we find the numbersof bases {A,U}, of bases {G,C}, of paired bases {A0,U0}and of paired bases {G0,C0} are 95, 67, 123 and 172, respec-tively. Similarly, in the W–S scatter plot of R. rubrum,many of points within the regular tetrahedron approachthe vertex {G0,C0} which indicates the numbers of pairedbases G0 and C0 exceed that of other bases in R. rubrum.

—1 0 1—1

01

101

TSV−3

—1 0 1—1

01

—101

CVV−3

—1 0 1—101

101

PDV−3

—1 0 1—101

—101

EMV−3

—1 0 1—101

101

Pyr−o

—1 0 1—101

—101

Pla−r

—1 0 1—101

101

A.tume

—1 0 1—101

—101

R.rubr

ts I and II, where (�1,�1,1) represents {A,U}, (1,1,1) represents {G,C},


In fact, there are 236 paired bases G0 and C0 out of 429bases of R. rubrum, accounting for 55.01% of total bases.This shows some information on the base distributionand compositions of RNA secondary structure can be intu-itively reflected by the graphical representation.

3. Numerical characterization of RNA secondary structure

In order to find some invariants sensitive to the form ofthe characteristic curve, we will transform the graphicalrepresentation of the characteristic curve into anothermathematical object-a matrix. Once we have a matrix rep-resenting the characteristic sequence of an RNA secondarystructure, some of matrix invariants are selected as descrip-tors of the sequence. One of the matrices is the L/L matrix[30], in which the elements li,j are defined as the quotient ofthe Euclidean distance between a pair of vertices (dots) ofthe characteristic curve and the sum of distances betweenthe same pair of vertices measured along the characteristiccurve. In other words

li;j ¼di;jPj�1

k¼i dk;kþ1

;

where di,j is the Euclidean distance between a pair of verti-ces. Its eigenvalue, and in particular its leading eigenvalue,can be used as RNA secondary structure descriptors. Sincethe characteristic curve does not represent the genuinemolecular geometry, we are not interested in the interpreta-tion of the leading eigenvalues of these matrices, but areinterested in them as numerical parameters that may facil-itate comparisons of RNA secondary structure.

By means of the leading eigenvalue of the L/L matrix,we will characterize the RNA secondary structures at the30-terminus of nine viruses in Fig. 1. In Table 1, we givethe leading eigenvalues of the L/L matrices associated withthree essentially different patterns of the characteristic

Table 1The leading eigenvalues of the L/L matrices associated with three essentially dthe 30-terminus belonging to nine viruses of Fig. 1

Patterns AIMV-3 CiLRV-3 TSV-3 CVV-3

W–S 9.5633 13.6892 10.8395 8.7309M–K 9.8639 9.7931 8.7752 9.1720R–Y 8.8139 16.1021 10.9945 15.8580

Table 2The similarity/dissimilarity matrix for the secondary structures at the 30-termbetween the end points of the 3-component vectors of the leading eigenvalues

Species AIMV-3 CiLRV-3 TSV-3 CVV-3

AIMV-3 0 8.3753 2.7512 7.1268CiLRV-3 0 5.9367 5.0030TSV-3 0 5.3158CVV-3 0APMV-3LRMV-3PDV-3EMV-3AVII

curves representing each of the RNA secondary structures.Observing this table, we can find that the largest leadingeigenvalue occurs for pattern R–Y except for AIMV-3and APMV-3.

4. Similarities/dissimilarities among the RNA secondarystructures

To illustrate the use of the quantitative characterizationof RNA secondary structures, we analyze the similarities/dissimilarities among the nine virus of set I listed inFig. 1. We construct a 3-component vector consisting ofthe leading eigenvalues of the L/L matrices associated withthree essentially different patterns of the characteristiccurves for the secondary structure. The analysis of similar-ity/dissimilarity among RNA secondary structures repre-sented by the 3-component vectors is based on theassumption that two RNA secondary structures are similarif the corresponding 3-component vectors point to a similardirection in the 3D space and have similar magnitudes. Thesimilarity between these two vectors can be measured bycalculating the Euclidean distance between the end pointsof two vectors. Clearly, the smaller is the Euclidean dis-tance between the end points of two vectors, the more sim-ilar are the RNA secondary structures.

In Table 2, we give the similarity/dissimilarity matrix forthe secondary structures at the 30-terminus belonging tonine viruses of Fig. 1 based on the Euclidean distancebetween the end points of the 3-component vectors of theleading eigenvalues of the L/L matrices. We believe that itis not accidental that the smallest entries in Table 2 are asso-ciated with the pairs (LRMV-3, AVII), (LRMV-3, EMV-3),(CVV-3, AVII), (CVV-3, EMV-3), (LRMV-3,CVV-3) and(AVII, EMV-3). On the other hand, the larger entries inthe similarity/dissimilarity matrix appear in the row belong-ing to APMV-3.

ifferent patterns of the characteristic curves for the secondary structures at

APMV-3 LRMV-3 PDV-3 EMV-3 AVII

9.4184 9.8692 9.5275 9.1729 9.784414.7721 8.6687 10.7976 8.1902 9.55019.4138 15.9447 11.0470 16.4562 16.1991

inus belonging to nine viruses of Fig. 1 based on the Euclidean distanceof the L/L matrices

APMV-3 LRMV-3 PDV-3 EMV-3 AVII

4.9468 7.2368 2.4208 7.8332 7.39529.3682 3.9852 6.6244 4.8054 3.91356.3625 5.0456 2.4113 5.7402 5.36678.5652 1.2476 5.1403 1.2317 1.17020 8.9503 4.2984 9.6425 8.5700

0 5.3513 0.9876 0.92120 6.0152 5.3072

0 1.51300

Fig. 5. The dendrogram of hierarchical clustering for the 16 secondarystructures by using XLMINER program. 1 – AIMV-3, 2 – CiLRV-3, 3 –TSV-3, 4 – CVV-3, 5 – APMV-3, 6 – LRMV-3, 7 – PDV-3, 8 – EMV-3, 9 –AVII, 10 – Halobacterium sp. (5S RNA), 11 – Pyrodictium occultum (5SRNA), 12 – Planocera recticulata (5S RNA), 13 – Synechocystis sp.PCC6803 (RNase P RNA), 14 – Anabaena sp. PCC7120 (RNase P RNA),15 – Agrobacterium tumefaciens (RNase P RNA), 16 – Rhodospirillum

rubrum (RNase P RNA).


Furthermore, the L/L matrices are employed to hierar-chical clustering analysis since the quality of a clusteringanalysis may indicate whether the matrix is good, that is,whether our method of abstracting information fromRNA secondary structures is efficient. The result of the hier-archical clustering analysis of RNA secondary structures insets I and II by using our graphical representations is shownin Fig. 5. Clearly, the nine viruses from set I are groupedinto one cluster and particularly LRMV-3, AVII, EMV-3and CVV-3 are grouped in a single branch, which is consis-tent with the results by the comparison of entries in Table 2;Synechocystis sp. PCC6803, Anabaena sp. PCC7120, Agro-

bacterium tumefaciens and Rhodospirillum rubrum (theybelong to RNase P) are grouped into another cluster; Halo-

bacterium spl, Pyrodictium occultum and Planocera recticu-

lata (they are from 5S Ribosomal RNA) are grouped intothe other cluster. The test indicates that our method canyield results reasonably. In general, the proposed graphicalrepresentation is feasible for comparing RNA secondarystructures and deducing their similarity relationship.

5. Conclusion

We have presented a 3D graphical representation of RNAsecondary structure and outlined an approach to the similar-ity analysis of RNA secondary structures. According tothree classifications of bases of nucleic acids and chaos gamerepresentation, each RNA secondary structure is mappedinto three curves restricted in a regular tetrahedron. The sim-ilarities/dissimilarities among 16 RNA secondary structuresare computed (some of the results obtained listed in Table 2)based on the 3-component vectors consisting of the leadingeigenvalues of the L/L matrices and then employed to thehierarchical clustering analysis. This test shows that our

method is viable for analyzing the similarity relationship ofRNA secondary structures.

It is worth noting that our approach allows visual inspec-tion of data, helping in recognizing major similarities amongdifferent RNA secondary structures. From the graphicalrepresentations, some information on the base distributionand compositions of RNA secondary structures can be intu-itively reflected. Moreover, our method does not requiremultiple alignments and can give result rapidly. The new3D graphical representation provides different approachesfor both computational scientists and molecular biologiststo analyze RNA secondary structures efficiently.

Acknowledgement

We thank all the anonymous referees for their valuablesuggestions and support. This work is supported by Na-tional Natural Science Foundation of China (10571019).

References

[1] V. Bafna, S. Muthukrisnan, R. Ravi, Comput. Sci. 937 (1995) 1.[2] F. Corpet, B. Michot, Comput. Appl. Biosci. 10 (4) (1995) 389.[3] B. Shapiro, Comput. Appl. Biosci. 4 (3) (1998) 387.[4] B. Shapiro, K. Zhang, Comput. Appl. Biosci. 6 (4) (1990) 309.[5] M. Randic, M. Vracko, N. Lers, D. Plavsic, Chem. Phys. Lett. 371

(2003) 202.[6] M. Randic, A.T. Balaban, J. Chem. Inf. Comput. Sci. 43 (2003) 532.[7] C. Raychaudhury, A. Nandy, J. Chem. Inf. Comput. Sci. 39 (1999)

243.[8] A. Nandy, P. Nandy, Curr. Sci. 68 (1995) 75.[9] P.M. Leong, S. Mogenthaler, Comput. Appl. Biosci. 12 (1995) 503.

[10] B. Liao, T. Wang, J. Biomol. Struct. Dyn. 21 (2004) 827.[11] W. Zhu, B. Liao, K.Q. Ding, J. Mole. Struct. Theochem. 757 (2005)

193.[12] J.W. Luo, B. Liao, R.F. Li, W. Zhu, J. Math. Chem. 39 (2006) 629.[13] Y.S. Zhang, MATCH Commun. Math. Comput. Chem. 57 (2007)

157.[14] L.W. Liu, T.M. Wang, J. Math. Chem. 42 (2007) 595.[15] J.W. Gao, X.P. Zhang, MATCH Commun. Math. Comput. Chem.

56 (2006) 249.[16] B. Liao, W. Zhu, R.F. Li, MATCH Commun. Math. Comput. Chem.

57 (2007) 687.[17] B. Liao, K.Q. Ding, T.M. Wang, J. Biomol. Struct. Dyn. 22 (2005)

455.[18] H.I. Jeffrey, Nucl. Acids Res. 18 (1990) 2163.[19] C.B.E.M. Reusken, J.F. Bol, Nucl. Acids Res. 14 (1996) 2660.[20] E.C. Koper-Zwarthoff, F.T. Brederode, P. Walstra, J.F. Bol, Nucl.

Acids Res. 7 (1979) 1887.[21] S.W. Scott, X. Ge, J. Gen. Virol. 76 (1995) 957.[22] B.J. Cornelissen, H. Janssen, D. Zuidema, J.F. Bol, Nucl. Acids Res.

12 (1984) 2427.[23] R.H. Alrefai, P.J. Shicl, L.L. Domier, C.J. DArcy, P.H. Berger, S.S.

Korban, J. Gen. Virol. 75 (1994) 2847.[24] S.W. Scott, X. Ge, J. Gen. Virol. 76 (1995) 1801.[25] E.J. Bachman, S.W. Scott, G. Xin, V.B. Vance, Virology 201 (1994)

127.[26] F. Houser-Scott, M.L. Baer, K.F. Liem, J.M. Cai, L. Gehrke, J.

Virol. 68 (1994) 2194.[27] EMBL/GenBank/DDBJ databases. Accession no. X86352.[28] J. Brown, Nucl. Acids Res. 26 (1998) 351.[29] M. Szymanski, Z. Miroslawa, V.A. Barciszewska, E.J. Barciszewski,

Nucl. Acids Res. 30 (2002) 176.[30] M. Randic, M. Vracko, N. Lers, D. Plavsic, Chem. Phys. Lett. 368

(2003) 1.

a 3d graphical representation of rna secondary structures based on chaos game representation

Documents