identifying affinity classes of inorganic materials binding

13
1 Identifying Affinity Classes of Inorganic Materials Binding Sequences via a Graph-based Model Nan Du, Marc R. Knecht, Mark T. Swihart, Zhenghua Tang, Tiffany R. Walsh and Aidong Zhang Abstract—Rapid advances in bionanotechnology have recently generated growing interest in identifying peptides that bind to inorganic materials and classifying them based on their inorganic material affinities. However, there are some distinct characteristics of inorganic materials binding sequence data that limit the performance of many widely-used classification methods when applied to this problem. In this paper, we propose a novel framework to predict the affinity classes of peptide sequences with respect to an associated inorganic material. We first generate a large set of simulated peptide sequences based on an amino acid transition matrix tailored for the specific inorganic material. Then the probability of test sequences belonging to a specific affinity class is calculated by minimizing an objective function. In addition, the objective function is minimized through iterative propagation of probability estimates among sequences and sequence clusters. Results of computational experiments on two real inorganic material binding sequence datasets show that the proposed framework is highly effective for identifying the affinity classes of inorganic material binding sequences. Moreover, the experiments on the SCOP (structural classification of proteins) dataset shows that the proposed framework is general and can be applied to traditional protein sequences. Index Terms—inorganic material, peptide sequences, classification 1 I NTRODUCTION Over the past decade, many studies have been published for analyzing the peptide sequences with affinity to bio- logical entities such as enzymes, cells, viruses, lipids and proteins. Recently, interest in identifying and classifying peptides that interact specifically with inorganic materi- als has grown. These inorganic materials binding peptide sequences have been identified from biocombinatorial peptide libraries using phage display [1], cell surface display [2], and yeast display [3]. In particular, numerous studies have been reported about the peptide sequences that bind to the inorganic materials, such as noble metals (gold, silver, platinum) [4], [5], [6], [7], [8], semiconductors (zinc sulfide, cadmi- um sulfide) [9], [10], [11], [12], and metal oxides (silica, titanium and magnetite) [13], [14], [15], [16], [17], [18], [19], which are of great interest for applications in tech- nology and medicine. Inorganic material binding peptide sequences, which are usually 7-14 amino acids long, are differentiated from other polypeptides by their specific molecular recognition properties for targeted inorganic material surfaces [20]. Effectively identifying the affinity Nan Du and Aidong Zhang are with the Computer Science and Engineer- ing Department, University at Buffalo (SUNY), Buffalo, NY 14260. E-mail:nandu,[email protected] Marc R. Knecht and Zhenghua Tang are with Department of Chemistry, University of Miami, 1301 Memorial Drive, Coral Gables, Florida 33146. E-mail:knecht,[email protected] Mark T. Swihart is with Department of Chemical and Biological Engi- neering, University at Buffalo (SUNY), Buffalo, NY 14260 E-mail:[email protected] Tiffany R. Walsh is with Institute for Frontier Materials, Deakin Univer- sity, Geelong, Vic. 3216, Australia E-mail:[email protected] classes, which shows the binding strength of a specific sequence with respect to the target inorganic material, is crucial for further designing novel peptides [21]. The binding affinity of a peptide to an inorganic surface is the result of a complex interplay between the binding strength of its individual residues and its conformation. The binding strength of a sequence for a specific material is usually measured with the adsorption free energy (∆G ads ), which is then used to classify the affinity class as weak, medium, or strong for each sequence. Despite extensive recent reports on combinatorially se- lected inorganic binding peptides and their bionanotech- nological utility as synthesizers and molecular linkers [22], [23], [20], there is still limited knowledge about the relationships between binding peptide sequences and their associated inorganic materials. Therefore, by using machine learning technology to suggest sequence affinity classes, we can predict new sequences having desired affinity for specific inorganic materials, without doing new large-scale screenings via phage display. Various approaches have been used or developed for recognizing both close and distant homologs of given protein sequences, which is one of the central themes in bioinformatics. Most of the work is based on estab- lished machine learning models such as Hidden Markov model (HMM) [24], [25], Neural Network (NN) [26], [27] and Support vector machine (SVM) [28]. However, the problem of inorganic material binding peptide sequence affinity classes identification has some distinct challenges that are rarely faced in protein sequence identification, which markedly limit the performance of the models mentioned above, despite their success in other types of protein sequences detection.

Upload: others

Post on 02-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identifying Affinity Classes of Inorganic Materials Binding

1

Identifying Affinity Classes of Inorganic MaterialsBinding Sequences via a Graph-based Model

Nan Du, Marc R. Knecht, Mark T. Swihart, Zhenghua Tang,Tiffany R. Walsh and Aidong Zhang

Abstract—Rapid advances in bionanotechnology have recently generated growing interest in identifying peptides that bind to inorganicmaterials and classifying them based on their inorganic material affinities. However, there are some distinct characteristics of inorganicmaterials binding sequence data that limit the performance of many widely-used classification methods when applied to this problem.In this paper, we propose a novel framework to predict the affinity classes of peptide sequences with respect to an associated inorganicmaterial. We first generate a large set of simulated peptide sequences based on an amino acid transition matrix tailored for the specificinorganic material. Then the probability of test sequences belonging to a specific affinity class is calculated by minimizing an objectivefunction. In addition, the objective function is minimized through iterative propagation of probability estimates among sequences andsequence clusters. Results of computational experiments on two real inorganic material binding sequence datasets show that theproposed framework is highly effective for identifying the affinity classes of inorganic material binding sequences. Moreover, theexperiments on the SCOP (structural classification of proteins) dataset shows that the proposed framework is general and can beapplied to traditional protein sequences.

Index Terms—inorganic material, peptide sequences, classification

F

1 INTRODUCTION

Over the past decade, many studies have been publishedfor analyzing the peptide sequences with affinity to bio-logical entities such as enzymes, cells, viruses, lipids andproteins. Recently, interest in identifying and classifyingpeptides that interact specifically with inorganic materi-als has grown. These inorganic materials binding peptidesequences have been identified from biocombinatorialpeptide libraries using phage display [1], cell surfacedisplay [2], and yeast display [3].

In particular, numerous studies have been reportedabout the peptide sequences that bind to the inorganicmaterials, such as noble metals (gold, silver, platinum)[4], [5], [6], [7], [8], semiconductors (zinc sulfide, cadmi-um sulfide) [9], [10], [11], [12], and metal oxides (silica,titanium and magnetite) [13], [14], [15], [16], [17], [18],[19], which are of great interest for applications in tech-nology and medicine. Inorganic material binding peptidesequences, which are usually 7-14 amino acids long, aredifferentiated from other polypeptides by their specificmolecular recognition properties for targeted inorganicmaterial surfaces [20]. Effectively identifying the affinity

• Nan Du and Aidong Zhang are with the Computer Science and Engineer-ing Department, University at Buffalo (SUNY), Buffalo, NY 14260.E-mail:nandu,[email protected]

• Marc R. Knecht and Zhenghua Tang are with Department of Chemistry,University of Miami, 1301 Memorial Drive, Coral Gables, Florida 33146.E-mail:knecht,[email protected]

• Mark T. Swihart is with Department of Chemical and Biological Engi-neering, University at Buffalo (SUNY), Buffalo, NY 14260E-mail:[email protected]

• Tiffany R. Walsh is with Institute for Frontier Materials, Deakin Univer-sity, Geelong, Vic. 3216, AustraliaE-mail:[email protected]

classes, which shows the binding strength of a specificsequence with respect to the target inorganic material,is crucial for further designing novel peptides [21]. Thebinding affinity of a peptide to an inorganic surface isthe result of a complex interplay between the bindingstrength of its individual residues and its conformation.The binding strength of a sequence for a specific materialis usually measured with the adsorption free energy(∆Gads), which is then used to classify the affinity classas weak, medium, or strong for each sequence.

Despite extensive recent reports on combinatorially se-lected inorganic binding peptides and their bionanotech-nological utility as synthesizers and molecular linkers[22], [23], [20], there is still limited knowledge about therelationships between binding peptide sequences andtheir associated inorganic materials. Therefore, by usingmachine learning technology to suggest sequence affinityclasses, we can predict new sequences having desiredaffinity for specific inorganic materials, without doingnew large-scale screenings via phage display.

Various approaches have been used or developed forrecognizing both close and distant homologs of givenprotein sequences, which is one of the central themesin bioinformatics. Most of the work is based on estab-lished machine learning models such as Hidden Markovmodel (HMM) [24], [25], Neural Network (NN) [26], [27]and Support vector machine (SVM) [28]. However, theproblem of inorganic material binding peptide sequenceaffinity classes identification has some distinct challengesthat are rarely faced in protein sequence identification,which markedly limit the performance of the modelsmentioned above, despite their success in other typesof protein sequences detection.

Page 2: Identifying Affinity Classes of Inorganic Materials Binding

2

Challenge I: The number of labeled samples is usu-ally insufficient. As an emerging topic, the peptide se-quences identified for binding solid inorganic materialshave been developed only in the last decade, and are notso well studied compared to protein sequences analysiswhich has much longer history. For example, unlikeprotein sequences analysis that has numerous large-scalepublic datasets such as GPCR [29] or SCOP [30], nocomplete result of large-scale screening experiments hasbeen made publicly available for the inorganic materialbinding sequences. Therefore, unlike protein sequenceresearch which has many public large databases andpublicly available experiment results, the data about in-organic binding peptide sequences are usually quite few.Most existing protein sequence classification approachesrequire a large set of labeled samples to train an accuratemodel. However, labeling the affinity classes for a largenumber of inorganic material binding sequences is verytime-consuming and expensive. Thus it is usually infea-sible. If only a limited number of labeled samples areavailable for the model training, the learned model maysuffer from the problems of over-fitting or under-fitting.As a machine learning method which has received muchattention in the past decade, Semi-Supervised Learning(SSL) [31] is good at handling the lack of sufficientlabeled training data problem. However, the utility ofthis method may be markedly limited due to the nextchallenge.

Challenge II: The peptide sequences belonging tothe same affinity class may be very dissimilar. Usually,the protein sequences which belong to the same familyfollow some apparent patterns, in other words, theyare similar to each other by some views. However, the“similarity” between inorganic material binding peptidesequences from the same affinity class may be not soapparent. In some cases, the intra-similarity which mea-sures the similarity of all sequences inside the same classis even less than the inter-similarity which measures thesimilarity among the sequences from different classes.This phenomenon also means some peptide sequencesbelonging to the same class may be dissimilar with eachother, at least by the current knowledge. This observa-tion reflects the fact that the inorganic material bindingsequences do not satisfy the smoothness assumptionat the class level which is generally assumed in bothsupervised learning and semi-supervised learning.

In light of these challenges for inorganic material bind-ing sequence affinity classes identification, we proposea novel framework which includes two parts. First, totackle the insufficient data challenge, we augment thetraining sequence set with simulated sequences whichare generated based on a new amino acid transitionmatrix. By using the simulated sequences, we incor-porate not only the prior phylogenetic knowledge butalso the specific sequence patterns responsible for thetarget inorganic material into the training data. Sec-ond, instead of searching the patterns globally from thepeptide sequences belonging to the same affinity class,

we separate the sequences into smaller clusters and tryto learn the patterns from them locally via a graph-based optimization model. Intuitively, since there are fewobvious patterns that could be found at the class level,we search for them at the smaller cluster level.

Based on the two strategies mentioned above, wepropose a novel model that combines the sequence simu-lation and cluster-based sequence affinity identification.The initial idea was published in [32]. This paper extendsthe original idea to formulate a solid method and pro-vide more supportive, comprehensive experiments. Themain process of the proposed method is shown in Fig.1, where we first use the labeled sequences as seeds tosimulate more sequences, and then all the labeled andsimulated sequences are used to train our graph-basedoptimization model which is effective at identifying thesequences’ affinity classes. We will discuss the proposedmethod in detail in the following sections.

Peptide

Sequences

Simulation

Graph-based

Optimization

Model

PPTNSM

HFQN

……

Strong Set

LWSTVA

SNLFT

……

Weak Set

PPTNSM

HFQN

……

Strong Set

……

Weak Set

……

Simulated Set

LWSTVA

SNMFT

PPANST

SNLFT

Fig. 1: Main process of the proposed method.

In this paper, we make the following contributions:

• We introduce the distinct challenges associated withidentifying affinity classes for inorganic materialbinding sequences.

• We propose a novel framework which can effec-tively predict the affinity classes of the inorganicmaterial binding sequences and provide an efficien-t iterative algorithm to find the optimal solutionof the proposed objective function. Moreover, ourframework is a general framework which is alsoeffective for identifying the classes of traditionalprotein sequences.

• The extensive computational experiments show thatthe proposed method outperforms many other base-line methods.

The rest of the paper is organized as follows. In Section2, we explain the relationship between our work and pre-vious related work. In Section 3, we describe the datasetsused in this paper and the setting of our problem. Thepeptide sequence simulation method and the graph-based optimization model are presented in Section 4 andSection 5, respectively. Extensive experimental results areshown in Section 6. The conclusion and future work ispresented in Section 7.

Page 3: Identifying Affinity Classes of Inorganic Materials Binding

3

2 RELATED WORK

As an emerging research topic, there is very little pub-lished work on identifying the affinity classes of inor-ganic material binding sequences that we can compareto. But as a similar topic, much research has been de-voted to the question of identifying the homologs of theprotein sequences. HMM is a widely-used probabilitymodeling method for protein homology detection [24],[25], [33] which first generates a probability for each spe-cific sequence family and then calculates the likelihoodof an unknown sequence fitting each family. Anothertype of direct modeling methods for protein homologydetection is based on Neural Network [26], [27], wherethe multilayer nature of neural network allows themto discover non-linear higher order correlations amongthe sequences. As a widely-used machine learning al-gorithm, SVM [28] has been also applied to proteinhomology detection problems. Mak et al. [34] proposed aSVM based model named PairProSVM to automaticallypredict the sub-cellular locations of proteins sequences.Karchin et al. [29] combined the HMM with the SVMto identify the protein homologies. Tian et al. proposeda weighted version of SVM to weaken the influence ofoutliers for improving protein sub-cellular localizationpredictions [35]. However, these methods are inappro-priate in our case for two reasons. First, they ask fora training set consisting of sufficient labeled examples.Second, they try to learn the pattern from each classwhich may not exist at this level.

Moreover, besides the differences with the traditionalclassification approaches, the proposed framework isalso different from the following work: 1) Oren et al.[21] has proposed a method to generate a new transitionmatrix and make the classification based on it. Thefirst difference between the work presented here andOren’s work is that they only consider the sequenceclassification problem via learning the patterns from theentire sequence set belonging to the same affinity class.Second, the newly generated transition matrix in [21]was only used to calculate the pairwise distance betweensequences. In our proposed method, the newly generatedmatrix is also used to generate the simulated sequences.2) Ge et al. [36] proposed a consensus maximizationmodel to solve the problem of finding informative genesfrom multiple studies. Although the proposed methodhas the same intuition as Ge’s work in which a clustershould correspond to a particular class z if the majorityof instances in this cluster belongs to class z, it aimedat making the reliable prediction by utilizing multipleexperimental results which is much different from ourwork. In our case, we only have the raw dataset (i.e. la-beled inorganic material sequences) rather than multipleexperimental results.

3 DATASETS AND PROBLEM DEFINITION

In this section, we describe the datasets used in thispaper and present the problem definition.

3.1 Datasets

We have used three datasets to demonstrate the pro-posed method’s performance. The first dataset is fromOren et al [21]. This dataset consists of a total of 25 quartz(rhombohedral silica, SiO2) binding peptide sequenceswhich were identified using phage-display techniques.All these peptide sequences are further classified intotwo classes based on their affinity strength: strong andweak binder classes which contain 10 and 15 sequences,respectively. To better demonstrate the problem andshow the proposed method in the rest of the paper, weabstract a sample set which includes two affinity classesfrom this dataset and show it in Table 1.

TABLE 1: Sample set of peptide sequences data

Strong Class Weak ClassName Sequence Name SequenceDS202 RLNPPSQMDPPF DS201 MEGQYKSNLLFTDS189 QTWPPPLWFSTS DS191 VAPRVQNLHFGA... ... ... ...

The second inorganic material binding peptide se-quence dataset is from our systematic study of peptidebinding on gold (Au) [37], combined with the previousdata from Wang et al. [38], to give a total of 32 peptidesequences. Sequences in our sequence set following thepattern XHXHXHX, where X is an arbitrary amino acidare from Wang et al. [38]. Since any peptide sequencesthat containing cysteine (i.e. amino acid C) can bindstrongly onto the gold surface, without loss of general-ity, any sequences contains cysteine are not considered.Using measured adsorption free energies (∆ G kJ/mol)for all the sequences, we drew the boundary betweenstrong and weak binding sequences, such that the weakclass has ∆ G > −25 kJ/mol, and the strong class has∆G ≤ −25 kJ/mol. Note that, Hnilova et al. [39] haveshown that sequence ’TLRRWRDRRILN’ (AUBP30) hasweak binding ability to gold. Although they did notreport the free energy for it, it is very likely to reside inthe weak set based on the qualitative binding analysis.All the sequences from the strong and weak classes arelisted in Table 2.

It is worth noticing that these datasets illustrate wellthe two challenges mentioned above. First, there areonly around ten sequences available for each affinityclass, which is very few in comparison to the data sizeused for classifier training in protein sequence analysiswhere hundreds or thousands of sequences are usuallyinvolved [33], [25]. Second, the unobvious pattern chal-lenge shown in these datasets is illustrated well in Fig.2 and Fig. 3. In this figure, based on the total similarityscores (TSS) defined in [21], we first calculate the totalsimilarity of sequences from the same class A via thefollowing equation:

TSSA =1

NA ∗ (NA− 1)

NA∑i=1

NA∑j=1

PSSij(1− δij), (1)

Page 4: Identifying Affinity Classes of Inorganic Materials Binding

4

TABLE 2: Summary of the gold binding peptide se-quences

Strong Sequence ∆G Weak Sequence ∆GWAGAKRLVLRRE -37.6 HHHHHHH -23.9

MHGKTQATSGTIQS -37.6 MHMHMHM -23.1LKAHLPPSRLPS -36.6 RHRHRHR -22.9WALRRSIRRQSY -36.4 YHYHYHY -22.8TGTSVLIATPYV -35.7 WHWHWHW -22.3

EQLGVRKELRGV -35.3 KHKHKHK -22.3RMRMKMK -35.0 AHAHAHA -21.6

PPPWLPYMPPWS -35.0 GHGHGHG -21.6AYSSGAPPMPPF -31.8 QHQHQHQ -20.5TGIFKSARAMRN -31.6 IHIHIHI -20.4

KHKHWHW -31.3 NHNHNHN -19.9TSNAVHPTLRHL -30.3 VHVHVHV -19.9

SHSHSHS -18.9THTHTHT -18.7EHEHEHE -18.6

DHDHDHD -18.2LHLHLHL -16.2FHFHFHF -15.4PHPHPHP -14.1

TLRRWRDRRILN -

where δ is the usual Kronecker delta function in whichδij = 1 when i = j and 0 otherwise, NA is the totalnumber of sequences in set A, and PSSij is the similaritybetween the ith sequence and jth sequence of set Acalculated via the Needleman-Wunsch algorithm [40]. Forthe sake of simplicity, we call it self-class similarity forshort. Moreover, the TSS of the sequences across theclasses A and B are calculated as:

TSSA−B =1

NA ∗NB

NA∑i=1

NB∑j=1

PSSij , (2)

where NB is the total number of sequences in set B.Correspondingly, the total similarity for sequences acrossthe classes is named across-class similarity for short.

To calculate PSSij , we need to provide a transitionmatrix on which the optimal scoring alignment wouldbe made. Without loss of generality, we have used boththe Pam 250 [41] (Fig. 2(a) and Fig. 3(a)) and Blosum 62[42] (Fig. 2(b) and Fig. 3(b)) as the transition matrices,respectively. Fig. 2 shows that the sequences belong-ing to the weak class have very low or no significantsimilarities. Their self-similarity is much lower than thecross-class similarity. Similarly, as shown in Fig. 3, thesimilarities of the sequences belonging to the strong goldbinding set are very close to the cross-class similarity.Due to this phenomenon, the traditional classificationapproaches cannot readily identify an effective pattern.

To demonstrate the proposed work is a general frame-work which is also effective on predicting the homologyfamilies of the traditional protein sequence, the thirddataset: Structural Classification of Proteins SCOP datasetfrom [43] is also used. In addition, we employ theapproach developed by Anoop Kumar and Lenore Cowen[25] to pick the SCOP families, where acquired proteinsare further grouped into seven families (i.e. A, B, C, D,E, F and G). The size and the length of longest/shortestof amino acids at each family in the dataset are shown

Strong

Weak

Strong

Weak

0

2

4

6

Sim

ilarity

Score

(a) Pam 250

Strong

Weak

Strong

Weak

0

5

10

Sim

ilarity

Score

(b) Blosum 62

Fig. 2: Total similarity scores of the self-class (the strongclass and weak class) and the cross-class for quartzbinding binders.

Strong

Weak

Strong

Weak

0

5

10

15

20

Sim

ilarity

Score

(a) Pam 250

Strong

Weak

Strong

Weak

0

5

10

15

20

Sim

ilarity

Score

(b) Blosum 62

Fig. 3: Total similarity scores of the self-class (the strongclass and weak class) and the cross-class for gold bindingbinders.

in Table 3, and the data we used are available at http://www.acsu.buffalo.edu/∼nandu/InorganicSeq/.

TABLE 3: Summary of the protein sequence data

Family Number of Seq Length of Length ofShortest Seq Longest Seq

Class A 23 160 177Class B 23 72 136Class C 16 224 260Class D 19 120 144Class E 18 324 429Class F 14 131 221Class G 20 45 83

3.2 Problem DefinitionWe consider our problem as identifying the affinityclasses for the test inorganic material binding pep-tide sequences based on the training sequences. Westart from a pool of l+u peptide sequences Ssource ={s1, ..., sl, ..., sl+u} where each peptide sequence is rep-resented as a series of ordered amino acids. To betterunderstand what a peptide sequence looks like, let ustake a peptide sequence from Table 1 as an example. D-S202: RLNPPSQMDPPF is a peptide sequence composedof twelve ordered amino acids where each letter denotesone of the 20 standard amino acids. We also assumein this sequence pool Ssource, the first l sequences arelabeled si ∈ Ssource(1 ≤ i ≤ l) based on its affinity to the

Page 5: Identifying Affinity Classes of Inorganic Materials Binding

5

target inorganic material (e.g. weak or strong), whichtogether is named L, and the rest of u sequences areunlabeled si ∈ Ssource(l + 1 ≤ i ≤ l + u) and together isnamed U, where L ∪ U = Ssource. Our goal is to predictthe labels of peptide sequences in U, using the trainingsequences in L.

4 PEPTIDE SEQUENCE SIMULATION

As we mentioned above, lack of labeled data is a generalproblem we usually face when working on inorganicmaterial binding sequences. One of the most success-ful methods to date for recognizing protein sequencesbased on evolutionary knowledge is using simulatedsequences. Nowadays, there are many studies [25], [44]which have shown that augmenting the training set withthe simulated sequences generated from an amino acidstransition matrix such as Blosum 62 and Pam 250 canincrease the homologs identification performance. Onecan reasonably expect that a set of peptides generatedby directed evolution to recognize a given solid materialwill have similar sequences [21].

Although these transition matrices are shown to beefficient and gain wide acceptance, we cannot directlyapply this technique to generate simulated sequences.These transition matrices are derived from the large-scalenatural protein sequence databases rather than the tar-get inorganic material binding sequences, which meansthese existing matrices could not represent the targetinorganic material well. Thus, we only use a traditionaltransition matrix as a seed and based on it we generatea new transition matrix which not only maintains theprior knowledge from proteins but also captures the sig-nificant knowledge inside the target inorganic material.

Here, aiming to provide a more comprehensive anddiverse view for our model, we use a two-step simulatedsequence generation approach to enlarge our trainingset. First, we generate a new transition matrix whichcan better measure the amino acids transition relationsfor the target inorganic material. Specifically, we usea traditional transition matrix (e.g. Blosum 62 or Pam250) as a seed matrix M , which is a 20 × 20 symmetricmatrix and each name on the column or row is a singleletter representing an amino acid. Then we greedilyand iteratively mutate each profile mij which is aninteger coefficient between two amino acids in the seedtransition matrix to maximize the difference betweenself-class similarity of class A (i.e. TSSA) and cross-class similarity between class A and B (i.e. TSSA−B) [21]which is designed to enlarge the gap among the affinityclasses.

Second, after the new transition matrix M∗ is con-structed, the simulated sequences are generated basedon the labeled sequence set L. When a sequence isselected as a seed, the simulated sequence is generatedby randomly selecting a position from it and replacingthe amino acid i in the corresponding position with a

new amino acid j with a probability defined in Eq. (3):

Pij =M∗

ij∑20j=1 M

∗ij

. (3)

Note that, all the probabilities are calculated afternormalization of the values in M∗ into a positive valuespace (e.g. 0−1). As an example, Fig. 4 shows the processof mutating an amino acid in the selected position basedon the mutation probability, and we keep replacingthe amino acids in the target sequence until a desiredmutation threshold t is reached [25].

RLNPPSQMDPPF

A R N D C Q E G H I L

4.6% 4.6% 3.5% 2.2% 4.6% 5.8% 3.5% 2.2% 3.5% 6.9% 8.0%

K M F P S T W Y V

4.6% 11.4% 5.8% 3.5% 4.6% 4.6% 4.6% 4.6%6.9%

Assume the 8-th amino acid M

(Methionine) is selected to be mutated

RLNPPSQWDPPFAssume W (Tryptophan) is selected to

replace M

Mutation

Probability

WD

Fig. 4: An example of mutating an amino acid in theselected position. When a specific position (e.g. 8-th)is selected from the target sequence, the correspondingamino acid M is mutated to be another amino acid Wbased on the mutation probability.

By this two-step method, we can incorporate notonly the prior phylogenetic knowledge but also thespecific amino acid pattern responsible for binding tothe target inorganic material into the data. Accordingly,based on this peptide sequence simulation method, foreach labeled source sequence si ∈ Ssource(1 ≤ i ≤ l),we generate m mutated sequences, which is represent-ed as a simulated peptide sequence set Ssimulated ={s∗1, ..., s∗l×m}. Finally, we define the sequence pool asS = Ssource ∪ Ssimulated which includes the sourcepeptide sequences and simulated sequences. We willshow that the simulated sequences effectively improvethe performance in the experiments.

5 GRAPH-BASED OPTIMIZATION MODEL

Aiming to handle the challenge that the obvious patternsare hard to find at the class level, we propose a graph-based optimization model to estimate the conditionalprobability of the test sequences belonging to each affin-ity class. Our method begins by mapping sequencesfrom the sequence pool into nodes of a sequence-to-sequence graph (Section 5.1) where the relationshipsamong sequences are better measured and many efficientclustering methods are available. Instead of searching forthe patterns at the class level, we partition the sequencesinto clusters where we believe the significant patternsexist and an objective function (Section 5.2) is proposedto learn the conditional probability of each sequencebelonging to a specific affinity class. Finally, we present

Page 6: Identifying Affinity Classes of Inorganic Materials Binding

6

an efficient iterative algorithm to obtain the optimalvalue of the objective function (Section 5.3).

5.1 Mapping Sequences into Nodes of a GraphWe map all the sequences into a graph where each nodedenotes a peptide sequence and each edge denotes thepairwise similarity between two sequences. This graphoffers a good understanding of the pairwise relationshipsamong peptide sequences and is easily partitioned intoclusters. The pairwise similarity among sequences iscalculated using Needleman-Wunsch [40] algorithm afterlocal alignment between each sequence pair using Smith-Waterman algorithm [45].

5.2 The Objective FunctionThe key idea of our approach is that, instead of searchingfor patterns at the class level, we narrow down theaffinity class prediction problem from the class level tothe cluster level. We believe that, if the patterns areobscure, shifting the focus from the class level to thecluster level, we can find clearer patterns.

Before proceeding further, we introduce the notationthat will be used in the following discussion: dij denotesthe ij-th entry in the matrix D, di. and d.j denote vectorsof i-th row and j-th column of matrix D, respectively.

Belongingness Matrix: We denote the belongingnessmatrix B as an N × V matrix where N is the numberof all sequences (including source sequences, simulatedsequences and test sequences) and V is the number ofclusters detected from the clustering result on N. Notethat, each entry in the belongingness matrix correspondsto the probability of a peptide sequence belonging to acluster. If peptide sequence si is assigned to a cluster cj ,then bij = 1 and 0 otherwise. To construct the belong-ingness matrix B, we have used spectral clustering [46]which has proven effective for solving the graph parti-tioning problems, to partition the sequence-to-sequencegraph that we have constructed in the previous sectioninto V clusters.

Sequence Probability Matrix: The conditional proba-bility of peptide sequence si belonging to class z (siz =P (y = z|si)) is estimated with an N ×D matrix S whereD is the number of affinity classes we want to classify.

Cluster Probability Matrix: The conditional probabil-ity of cluster cj belonging to class z (cjz = P (y = z|cj))is estimated as a V ×D matrix C, where cjz representsthe probability of a cluster cj belonging to a class z.

Sequence Labeled Matrix: In the labeled sequence setL the sequences have the initial class labels which arerepresented by an N ×D matrix F, where fiz = 1 if weknow sequence si belonging to class z in advance, and0 otherwise.

Cluster Labeled Matrix: We may also have priorinformation of a cluster belonging to a specific class.We use a V × D matrix Y to define initial labels forclusters where yjz = 1 denotes that we are confident thata cluster cj belongs to a specific class z, and 0 otherwise.

Specifically, we assign a cluster cj to a specific class z ifall the source sequences in it belong to the same class z.

Cluster Similarity Matrix: In addition, a V ×V matrixW denotes the similarity among the sequence clusters,where wij is the similarity between the sequence clusterci and cj . Specifically, the pairwise cluster similarity iscalculated using TSSA−B between the sets of sequencebinders A and B.

Now we formulate the affinity class identificationproblem as the following objective function:

minS,C

J(S,C) =

minS,C

(N∑i=1

V∑j=1

bij∥si. − cj.∥2 + αV∑i=1

V∑j=1

wij∥ci. − cj.∥2

+βN∑i=1

hi∥si. − fi.∥2 + γV∑

j=1

kj∥cj. − yj.∥2)

subject to the following conditions:D∑

z=1

siz = 1, siz ≥ 0

D∑z=1

cjz = 1, cjz ≥ 0,

(4)

where ∥.∥2 indicates the L2 norm. The first term inEq. (4),

∑Ni=1

∑Vj=1 bij∥si. − cj.∥2, ensures that a se-

quence should have similar probability vector as thecluster it belongs to, namely, cluster cj should corre-spond to class z if the majority of sequences in thiscluster belong to class z. Intuitively, the higher thedeviation, the larger penalty would get. The second termα∑V

i=1

∑Vj=1 wij∥ci. − cj.∥2 corresponds to the intuition

that the clusters which are close to each other shouldhave similar class, and α denotes the confidence over thissource of information. From the view of graph theory,this term is propagating the class information amongthe clusters. The third term β

∑Ni=1 hi∥si. − fi.∥2 applies

the constraint that the predictions should not deviatetoo much from the corresponding sequence ground-truthand β is the parameter that expresses the confidence ofour belief on the prior knowledge of sequences. Similar-ly, the last term γ

∑Vj=1 kj∥cj.− yj.∥2 is the loss function

penalizing the deviation between predictions and ourprior knowledge of clusters, and γ is the parameterthat expresses the confidence of our belief on the priorknowledge of clusters.

5.3 Iterative Update Algorithm

It is easy to prove that the objective function Eq. (4) isconvex which makes it possible to find a global optimalsolution. To obtain the optimal solution for matrices Sand C, we propose to solve Eq. (4) using the the blockcoordinate descent method [47]. At iteration t, fixing thevalue of si., we can take the partial derivative to ctj. in Eq.(4) and set it to 0, and then obtain the update FormulaEq. (5):

Page 7: Identifying Affinity Classes of Inorganic Materials Binding

7

ctj. =

∑ni=1 bij

st−1j. + γkj yj.∑n

i=1 bij + γkj + α(ij. −˜lj.)

. (5)

Accordingly, the update can be represented as a matrixform as Eq. (6), where Dv = diag{(

∑Ni=1 bij)} is the

normalization factor, Kv = diag{(∑D

z=1 yjz))} indicatesthe constraints for the clusters and diag denotes thediagonal elements of a matrix. Furthermore, L is thenormalized laplacian [48] defined as L = D

− 12

w WD− 1

2w ,

where Dw is the diagonal degree matrix of W.

Ct = (Dv + γKv + α(I − L))−1(ATSt−1 + γKvY ). (6)

The Hessian matrix with respect to C is a diagonalmatrix with entries

∑ni=1 bij + α > 0 and I − L. The

diagonal matrix is positive definite and it is easy toprove that I−L is also a semi-positive definite. Thus, thehessian matrix is a positive definite matrix, which meansderivative for C gives the unique minimum of Eq. (4).Similarly, we can obtain the update formula Eq. (7) withrespect to si. through fixing ctj..

sti. =

∑vj=1 bij c

tj. + βhifi.∑v

j=1 bij + βhi. (7)

Also, the matrix form of Eq. (7) is as following:

St = (Dn + βHn)−1(ACt + βHnF ), (8)

where Dn = diag{(∑V

j=1 bij)} is the normalization factorand Hn = diag{(

∑Dz=1 fiz)} indicates the constraints for

the sequences. The hessian matrix is also a diagonalmatrix with diagonal elements

∑Ni=1 bij > 0, which

means the derivative of S gives the unique minimumof Eq. (4). To sum up, the pseudo-code of iterativelysolving Eq. (4) by the block coordinate descend methodis shown as Algorithm 1, where ϵ is a convergencethreshold. Because the proposed method is based on agraph model, we name our approach Peptide SequencesIdentification Graph Model - PSIGM.

This iterative process shows a procedure of informa-tion propagation among the clusters. To better demon-strate it, Fig. 5 shows an example of the informationpropagation. In each iteration step, each cluster estimatesits class based on its members’ classes while retaining itsinitial class Y (as Fig. 5-(A)). After all the clusters receivethe label information (as Fig. 5-(B)), they propagate theirlabel information to their neighboring clusters basedon the smoothness assumption (as Fig. 5-(C)). Afterthe clusters have received the information from theirneighbors, they pass the information back to the nodesbelonging to it while the nodes retains their initial classes(as Fig. 5-(D)). This process continues until convergence.

Algorithm 1 The PSIGM AlgorithmInput: Belongingness matrix BN×V , sequence labeledmatrix FN×D, cluster labeled matrix YV×D, cluster sim-ilarity matrix WV×V , and parameter α, β, γ, ϵOutput: Estimated sequence probability matrix SN×D

1: Initialize S0 randomly;2: t← 1;3: begin4: repeat5: Update Ct using Eq. (6);6: Update St using Eq. (8);7: t← t+ 1;8: until

∥∥St − St−1∥∥ ≤ ϵ

9: Output St;10: end

5.4 Time ComplexityThe time complexity of the proposed algorithm is com-posed of two parts: updating the cluster probabilitymatrix C and updating the sequence probability matrixS. For updating the matrix C, the time complexity isO(V N2D + V 3 + V 2D) where N is the size of thepeptide sequence pool and V is the number of clustersand D is the number of affinity classes. Because in ourcase the sequence set is usually much larger than thenumber of clusters, thus the time complexity for the firststep is O(V N2D). For updating the matrix S, the timecomplexity is O(N2D+NVD), thus O(N2D). Therefore,the overall time complexity is O(V N2D). Suppose thenumber of iterations is k, the time complexity of wholealgorithm is O(kV N2D). In experiments, we observe thatk is usually between 8 and 20.

6 EXPERIMENTS

In the following, we first conduct the experiments onboth the quartz and gold binding sequence datasets toshow that PSIGM is effective for identifying the bindingaffinity classes of inorganic material binding sequences,and then the experiments on the SCOP protein sequencedataset to show that PSIGM is a general frameworkwhich also works effectively in other kinds of sequencesets. Because most of our baselines are designed as bina-ry classifiers, for the sake of simplicity, in the followingexperiments we only consider the case of weak andstrong binder identification of the datasets mentionedin Section 3.1, although our proposed method is notrestricted to binary classification. Throughout all theexperiments, we set α = 2, β = 10, and γ = 2 as defaultvalues of our algorithm. The rationale is that both α andγ depend on the clustering result which is influenced bysome uncertain factors such as the number of clustersand the initial centric of each cluster, thus assigning arelative low value to them is better; on the other hand,β shows our confidence in the labeled sequences whichcome from strict and reliable experiments, thus β shouldbe assigned a relative larger value.

Page 8: Identifying Affinity Classes of Inorganic Materials Binding

8

(A) (B)

(C)(D)

Strong Weak Test Simulated

1

2

3

45

6

1

2

3

45

6

1

2

3

4 5

61

2

3

45

6

Node class:

Cluster class: Strong Weak Unlabel

Fig. 5: An example of illustrating the label propagation at each iteration. (A) partition of all the nodes (eachsequence is represented as a node here) into multiple clusters; (B) conditional probability estimate of clusters (i.e.cluster probability matrix C) receiving the label information from the sequences (nodes) belonging to them; (C)each cluster propagates its class information to their neighboring clusters; and (D) after updating probability, eachcluster passes the label information back to its members conditional probability (i.e. sequence probability matrixS).

6.1 Experiments on Material Binding Sequences

To show our proposed framework is effective on pre-dicting the binding affinity class of inorganic materialbinding sequences, we perform the following three ex-periments to demonstrate that: 1) simulated sequencesand the information propagation among the clusterscan effectively alleviate the limitation due to challengeI; 2) searching the patterns from clusters rather thanclasses helps in handling challenge II; and 3) the newlygenerated transition matrix contributes to our proposedmethod’s performance. For each accuracy (i.e. the per-cent of testing set examples correctly classified by theclassifier when compared with the ground truth) shownin the following experiments, we performed experiments5 times using Leave-one-out validation and report themean value. We iteratively select one labeled sequenceas the test sequence, and use the rest of the labeledsequences to generate the simulated sequences and trainthe model. When the model is well trained, we canpredict the test sequences affinity class. The reason whywe use the Leave-one-out validation rather than crossvalidation in the inorganic material binding sequences isthat the number of labeled sequences is insufficient. Insuch a case, each one of them may represent significantunderlying pattern or characteristic. In addition, for eachexperiment, we fix the threshold ϵ for convergence to10−4.

The effect of simulated sequences and cluster in-

formation propagation. To test the effect of simulatedsequences and information propagation among the clus-ters, we ran our method with or without using simulatedsequences or information propagation among clusters,respectively. Furthermore, as we mentioned in Section5.2, we need to partition all the sequences into V clusters,thus we also want to show the relationship betweenthe proposed method’s performance and the number ofclusters. We vary the number of clusters as 2, 5, 10, 15,18 and 20.

We show that both strategies, sequence simulation andinformation propagation among the clusters, are crucialin improving the performance. The result is shown inFig. 6, where the x axis denotes the different numberof clusters and y axis denotes the accuracy. We can seethat simulated sequences contribute to the performanceimprovement. Also, in the absence of the informationpropagation among clusters, the performance degrades.Finally, we notice that the hill-like shapes appear as thenumber of clusters increasing for all the three cases. Mostlikely, when the number of clusters is too low, it is closeto a global view; when it is too high, the clusters wouldbe too trivial to learn from.

Performance comparison with baselines. In this part,we compare the proposed method with 5 other algo-rithms mentioned above including SVM, Neural Network,HMM, Learning with local and global consistency (LLGC)[48], which is a well-known graph-based Semi-Supervised

Page 9: Identifying Affinity Classes of Inorganic Materials Binding

9

2 5 10 15 18 20

0.65

0.7

0.75

0.8

0.85

Number of Clusters

Acc

urac

y

PSIGM

Without Information Propagation

Without Sequences Simulation

Fig. 6: Performance comparison with different strategies.

Learning algorithm. For fairness and comprehensiveness,we have also tried adding the simulated sequences usedfor our framework to these methods which are markedas SVM*, Neural Network* and HMM*. Note that, sinceLLGC was designed as a semi-supervised algorithmwhich needs unlabeled instances to aid propagating thelabeled information, thus we only consider the LLGCwith the simulated sequences which are used as unla-beled data. Thus, all the methods are separated into twoparts: using the simulated sequences and without usingthe simulated sequences. To measure the influence fromthe different mutation rate t which is used to generatesimulated sequences, we vary the mutation rate t at 5%,10%, 15% and 20%. The results of predicting the quartzbinding affinity classes comparing with baselines on thetwo inorganic binding sequences dataset are shown inTable 4 and Table 5, respectively.

Note that the proposed method significantly outper-forms the others in predicting the affinity classes ofthe test inorganic material binding sequences in mostcases. In addition, the performance of the proposedmethod is not so sensitive to the settings of the mutationrate over the range considered. It is worth noticingthat, instead of aiding the performance, the simulatedsequences in SVM*, HMM* and NN* make the perfor-mance worse than without them. The main reason forthis phenomenon is that: all these methods use a globalview on the training data which is only representedas two large classes: strong binding or weak bindingclasses. However, as we mentioned, the sequences insidethe same class may be very different, thus the moresimulated sequences are added, the more unobviouspatterns are likely to be. The proposed method treats thesequences locally as clusters, thus it can properly handlethis problem.

Parameter Sensitivity. There are three parameters inour objective function Eq. 4: α, β and γ. We conductedsensitivity experiments, shown in Fig. 7. In the exper-iments, when one parameter is varied, the other twoparameters are fixed at their default settings (i.e., α = 2,β = 10, and γ = 2). Note that, α represents theconfidence of our belief over information propagation a-

TABLE 4: Comparison with baselines under differentmutation rates for quartz-binding sequences

Mutation RateMethod 5% 10% 15% 20%PSIGM 0.82 0.82 0.84 0.83LLGC 0.60 0.63 0.60 0.62NN* 0.60 0.62 0.58 0.55

HMM* 0.66 0.65 0.69 0.67SVM* 0.62 0.63 0.58 0.59SVM 0.68NN 0.65

HMM 0.70

TABLE 5: Comparison with baselines under differentmutation rates of the gold binding sequences

Mutation RateMethod 5% 10% 15% 20%PSIGM 0.91 0.91 0.92 0.91LLGC 0.81 0.82 0.84 0.81NN* 0.86 0.82 0.84 0.81

HMM* 0.89 0.90 0.90 0.87SVM* 0.83 0.82 0.80 0.82SVM 0.84NN 0.82

HMM 0.90

mong clusters. The clusters of the sequences are obtainedfrom arbitrary clustering methods, which are not verystable. In other words, it may not be completely correct.Therefore, smaller α usually yields better performance.β shows our confidence on the prior knowledge ofthe sequence classes. These sequence classes, which areobtained from serious physical or chemical experiments,are deemed to be reliable and thus a large β is usuallybetter. γ denotes the confidence on the prior knowledgeof cluster classes. This information may not be totally re-liable, therefore lower value usually yields better results.The results in Fig. 7 confirm our observation.

0 1 2 5 8 10 15 200.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

Parameters

Acc

urac

y

αβγ

Fig. 7: Parameter sensitivity experiments.

Comparison with Varying Transition Matrices. Final-ly, we show the performance of the proposed methodwith different transition matrices. We want to demon-strate that the newly generated matrix improves theperformance of the proposed algorithm for the targetinorganic material. In this experiment, we also vary the

Page 10: Identifying Affinity Classes of Inorganic Materials Binding

10

mutation rate t at 5%, 10%, 15% and 20%. We havecompared the new transition matrix M1 which was gen-erated based on Blosum 62 and M2 which was generatedbased on Pam 250 with four other widely-used transitionmatrices including Blosum 62, Pam 250, Dayhoff [49]and Gonnet [50] in Table 6. The result shows that, thenewly generated transition matrices perform better thanthe others at each mutation rate.

TABLE 6: Performance with different transition matrices

Mutation RateTransition Matrix 5% 10% 15% 20%

M1 0.82 0.82 0.84 0.83M2 0.80 0.81 0.82 0.82

Blosum 62 0.72 0.74 0.75 0.74Pam 250 0.72 0.73 0.73 0.74Dayhoff 0.71 0.71 0.72 0.72Gonnet 0.79 0.78 0.79 0.80

6.2 Experiments on Protein Sequences

The proposed PSIGM is a general framework which isnot limited to identifying the affinity class of inorganicmaterial binding sequences. To prove that, we have usedthe SCOP protein data mentioned in Section 3.1. Insteadof predicting the sequences’ affinity classes, we considerthe problem in homology family prediction: for a spe-cific family, could the proposed framework identify thesequences belonging to it from the remaining families?Correspondingly, we construct seven identification tasksfrom this dataset, where the sequences from one particu-lar family are used as the positive set and the sequencesfrom the remaining six families are used as the negativeset. For example, when the sequences in family A areused as the positive set, the sequences from families B,C, D, E, F and G would be used as the negative set.Two experiments are performed to demonstrate that:1) our PSIGM is a general framework which can alsohandle the tradition protein sequence identification; and2) a moderate setting of mutation rate is conductive toimprove the performance.

Performance comparison with baselines. It is worthnoticing that, through handling the data in this way,it obtains the characteristics of inorganic binding se-quences to some extent. Note that each result shownin the follwing experiments (i.e. Table 7 and Fig. 9)is the average of 10 times performance through 5-foldcross validation. Since the protein sequence dataset hasrelative sufficient training samples and the sequencesthat belong to the same protein family are similar to eachother, we have used cross validation rather than Leave-one-out validation. Table 7 shows the result of predictingthe homology family comparing with baselines whichare mentioned in Section 6.1. As the table shows, theproposed method outperforms the other methods at eachprotein family’s prediction. Note that, the accuraciesof predicting the homology families are much higherthan the accuracies of predicting the affinity classes of

the inorganic material binding sequences. The reasonsbehind this can be well explained by Fig. 8, which showsthe self-class similarity of each prediction task. As weknow, the more cross-class similarity surpasses the self-similarity, the more difficult two classes are separated.

TABLE 7: Comparison with baselines of each class

Family PSIGM LLGC SVM HMM NNA 0.999 0.80 0.998 0.966 0.913B 0.965 0.81 0.959 0.944 0.947C 0.999 0.82 0.999 0.952 0.968D 0.999 0.82 0.999 0.969 0.999E 0.999 0.82 0.985 0.935 0.999F 0.978 0.82 0.946 0.952 0.857G 0.987 0.82 0.975 0.972 0.929

Mutation rate sensitivity. The performance of PSIGMis influenced by the setting of the mutation rate whichis used to generate the simulated sequences. To fullyevaluate how the mutation rate affects the performance,we increase it from 0.05 to 0.3 with a step of 0.05 andreport the accuracy of each family’s prediction task inFig. 9. It is clear that most families have an increasein accuracy as the mutation rate rises until reaching athreshold of 15%, and then the performance begins todecrease. This proves that the performance of PSIGM canbe improved by a moderate setting of mutation rate. Inaddition, we can infer that PSIGM is not only effective atidentifying the affinity classes of the inorganic materialbinding sequences, but also effective at predicting thehomology families of the traditional protein sequences.

0.05 0.1 0.15 0.2 0.25 0.30.93

0.94

0.95

0.96

0.97

0.98

0.99

1

Mutation Rate

AU

C

ABCDEFG

Fig. 9: Mutation rate sensitivity experiment.

7 CONCLUSION AND FUTURE WORK

Identifying the affinity classes of peptide sequencesbinding to a specific inorganic material is a new andchallenging research problem with broad applications.In this paper, we proposed a novel framework, PSIGM,to solve this problem. We begin with providing a two-step simulated peptide sequences generation method tomake the training set more comprehensive and diverse.Moreover, unlike traditional machine learning approach-es used for protein sequences identification that try

Page 11: Identifying Affinity Classes of Inorganic Materials Binding

11

A

Not A

A

Not A

0

100

200

300

Sim

ilarity

Score

(a) Class A

B

Not B

B

Not B

0

50

100

Sim

ilarity

Score

(b) Class B

C

Not C

C

Not C

0

100

200

300

400

Sim

ilarity

Score

(c) Class C

D

Not D

D

Not D

0

100

200

300

Sim

ilarity

Score

(d) Class D

E

Not E

E

Not E

0

200

400

600

Sim

ilarity

Score

(e) Class E

F

Not F

F

Not F

0

50

100

150

200

Sim

ilarity

Score

(f) Class F

G

Not G

G

Not G

0

50

100

150

Sim

ilarity

Score

(g) Class G

Fig. 8: Total similarity scores of the self-class and the cross-class for each prediction task based on Pam 250. (A)self-class and cross-class TSS of class A and non-A; (B) self-class and cross-class TSS of class B and non-B; (C)self-class and cross-class TSS of class C and non-C; (D) self-class and cross-class TSS of class D and non-D; (E)self-class and cross-class TSS of class E and non-E; and (F) self-class and cross-class TSS of class F and non-F; (G)self-class and cross-class TSS of class G and non-G.

to find the patterns from the class level, our frame-work partitions the sequences into smaller clusters andlearns the patterns from them through using a graph-based optimization model. Extensive experimental s-tudies demonstrate that the proposed framework caneffectively identify the affinity classes of the inorganicmaterial binding sequences.

In the future, to achieve better performance, we planto use a cyclic model to validate and retrain PSIGM: first,we will select some sequences that have the most/leastprobabilities binding to a target inorganic material asa candidate set by using PSIGM; second, we plan touse some efficient experimental methods to validate thecandidate sequence set such as QCM (Quartz CrystalMicrobalance); finally, the validated sequences will beused to retrain the PSIGM, and then new candidatesequences will be selected from the sequence databasebased on their affinity, so on so forth. We believe thatby this cyclic validation model, we can not only furthervalidate PSIGM’s effectiveness but also keep retrainingit to be better and better.

8 ACKNOWLEDGMENTS

This material is based upon work supported by theAir Force Office of Scientific Research (AFOSR), grantnumber FA9550-12-1-0226. We gratefully acknowledgethe Victorian Life Sciences Computation Facility (VLSCI)for allocation of computational resources. TRW thanksveski for an Innovation Fellowship.

REFERENCES

[1] G. P. Smith, “Filamentous fusion phage: novel expression vectorsthat display cloned antigens on the virion surface.,” Science,vol. 228, no. 4705, pp. 1315–1317, 1985.

[2] K. Y. Dane, C. Gottstein, and P. S. Daugherty, “Cell surfaceprofiling with peptide libraries yields ligand arrays that classifybreast tumor subtypes.,” Molecular Cancer Therapeutics, vol. 8,no. 5, pp. 1312–1318, 2009.

[3] E. T. Boder and K. D. Wittrup, “Yeast surface display for screeningcombinatorial polypeptide libraries,” Nature Biotechnology, vol. 15,no. 6, pp. 553–557, 1997.

[4] E. Kasotakis, E. Mossou, L. Adler-Abramovich, E. P. Mitchell, V. T.Forsyth, E. Gazit, and A. Mitraki, “Design of metal-binding sitesonto self-assembled peptide fibrils.,” Biopolymers, vol. 92, no. 3,pp. 164–172, 2009.

[5] J. Kimling, M. Maier, B. Okenve, V. Kotaidis, H. Ballot, andA. Plech, “Turkevich method for gold nanoparticle synthesisrevisited.,” The Journal of Physical Chemistry B, vol. 110, no. 32,pp. 15700–15707, 2006.

[6] Y. Huang, C.-Y. Chiang, S. K. Lee, Y. Gao, E. L. Hu, J. De Yoreo,and A. M. Belcher, “Programmable assembly of nanoarchitecturesusing genetically engineered viruses.,” Nano Letters, vol. 5, no. 7,pp. 1429–1434, 2005.

[7] K. T. Nam, D.-W. Kim, P. J. Yoo, C.-Y. Chiang, N. Meethong,P. T. Hammond, Y.-M. Chiang, and A. M. Belcher, “Virus-enabledsynthesis and assembly of nanowires for lithium ion batteryelectrodes.,” Science, vol. 312, no. 5775, pp. 885–888, 2006.

[8] R. R. Naik, S. J. Stringer, G. Agarwal, S. E. Jones, and M. O. Stone,“Biomimetic synthesis and patterning of silver nanoparticles.,”Nature Materials, vol. 1, no. 3, pp. 169–172, 2002.

[9] E. Estephan, C. Larroque, F. J. G. Cuisinier, Z. Blint, and C. Gerge-ly, “Tailoring gan semiconductor surfaces with biomolecules.,”The Journal of Physical Chemistry B, vol. 112, no. 29, pp. 8799–8805,2008.

[10] E. Estephan, M.-b. Saab, C. Larroque, M. Martin, F. Olsson,S. Lourdudoss, and C. Gergely, “Peptides for functionalizationof inp semiconductors.,” Journal of Colloid and Interface Science,vol. 337, no. 2, pp. 358–363, 2009.

[11] M. M. Tomczak, M. K. Gupta, L. F. Drummy, S. M. Rozenzhak, andR. R. Naik, “Morphological control and assembly of zinc oxide

Page 12: Identifying Affinity Classes of Inorganic Materials Binding

12

using a biotemplate.,” Acta Biomaterialia, vol. 5, no. 3, pp. 876–882, 2009.

[12] C. Vreuls, G. Zocchi, A. Genin, C. Archambeau, J. Martial, andC. V. De Weerdt, “Inorganic-binding peptides as tools for surfacequality control,” Journal of Inorganic Biochemistry, vol. 104, no. 10,pp. 1013–1021, 2010.

[13] R. R. Naik, L. L. Brott, S. J. Clarson, and M. O. Stone, “Silica-precipitating peptides isolated from a combinatorial phage dis-play peptide library.,” Journal of Nanoscience and Nanotechnology,vol. 2, no. 1, pp. 95–100, 2002.

[14] H. Chen, X. Su, K.-G. Neoh, and W.-S. Choe, “Probing the inter-action between peptides and metal oxides using point mutants ofa tio2-binding peptide.,” Langmuir, vol. 24, no. 13, pp. 6852–6857,2008.

[15] Y. Liu, J. Mao, B. Zhou, W. Wei, and S. Gong, “Peptide aptamersagainst titanium-based implants identified through phage dis-play.,” Journal of Materials Science: Materials in Medicine, vol. 21,no. 4, pp. 1103–1107, 2010.

[16] M. B. Dickerson, S. E. Jones, Y. Cai, G. Ahmad, R. R. Naik,N. Krger, and K. H. Sandhage, “Identification and design ofpeptides for the rapid, high-yield formation of nanoparticulatetio2 from aqueous solutions at room temperature,” Chemistry ofMaterials, vol. 20, no. 4, pp. 1578–1584, 2008.

[17] C. Tamerler, T. Kacar, D. Sahin, H. Fong, and M. Sarikaya,“Genetically engineered polypeptides for inorganics: A utility inbiological materials science and engineering,” Materials Scienceand Engineering C, vol. 27, no. 3, pp. 558–564, 2007.

[18] H. Chen, X. Su, K.-G. Neoh, and W.-S. Choe, “Qcm-d analysis ofbinding mechanism of phage particles displaying a constrainedheptapeptide with specific affinity to sio2 and tio2.,” AnalyticalChemistry, vol. 78, no. 14, pp. 4872–4879, 2006.

[19] E. Eteshola, L. J. Brillson, and S. C. Lee, “Selection and characteris-tics of peptides that bind thermally grown silicon dioxide films.,”Biomolecular Engineering, vol. 22, no. 5-6, pp. 201–204, 2005.

[20] S. Donatan, M. Sarikaya, C. Tamerler, and M. Urgen, “Effect ofsolid surface charge on the binding behaviour of a metal-bindingpeptide.,” Journal of the Royal Society Interface the Royal Society,no. April, pp. rsif.2012.0060–, 2012.

[21] E. E. Oren, C. Tamerler, D. Sahin, M. Hnilova, U. O. S. Sek-er, M. Sarikaya, and R. Samudrala, “A novel knowledge-basedapproach to design inorganic-binding peptides.,” Bioinformatics,vol. 23, no. 21, pp. 2816–2822, 2007.

[22] M. Hnilova, E. E. Oren, U. O. S. Seker, B. R. Wilson, S. Collino, J. S.Evans, C. Tamerler, and M. Sarikaya, “Effect of molecular confor-mations on the adsorption behavior of gold-binding peptides.,”Langmuir, vol. 24, no. 21, pp. 12440–12445, 2008.

[23] A. Vila Verde, P. J. Beltramo, and J. K. Maranas, “Adsorption ofhomopolypeptides on gold investigated using atomistic molecu-lar dynamics.,” Langmuir, vol. 27, no. 10, pp. 5918–5926, 2011.

[24] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, “Biologicalsequence analysis: probabilistic models of proteins and nucleicacids. cambridge univ,” 1998.

[25] A. Kumar and L. Cowen, “Augmented training of hidden markovmodels to recognize remote homologs via simulated evolution,”Bioinformatics, vol. 25, no. 13, pp. 1602–1608, 2009.

[26] C. H. Wu, S. Zhao, H. L. Chen, C. J. Lo, and J. McLarty, “Motifidentification neural design for rapid and sensitive protein familysearch.,” Computer applications in the biosciences CABIOS, vol. 12,no. 2, pp. 109–118, 1996.

[27] D. W. D. Wang, N. K. L. N. K. Lee, T. S. Dillon, and N. J.Hoogenraad, “Protein sequences classification using radial basisfunction (rbf) neural networks,” 2002.

[28] M. J. Grimble, “Adaptive systems for signal processing, commu-nications and control,” Control, vol. 3, 2001.

[29] R. Karchin, K. Karplus, and D. Haussler, “Classifying g-proteincoupled receptors with support vector machines.,” Bioinformatics,vol. 18, no. 1, pp. 147–159, 2002.

[30] M. Wistrand and E. L. L. Sonnhammer, “Improving profile hmmdiscrimination by adapting transition probabilities.,” Journal ofMolecular Biology, vol. 338, no. 4, pp. 847–854, 2004.

[31] X. Zhu, “Semi-supervised learning literature survey,” SciencesNewYork, vol. Tech. Rep., no. 1530, pp. 1–59, 2007.

[32] N. Du, M. R. Knecht, P. N. Prasad, M. T. Swihart, T. Walsh,and A. Zhang, “A framework for identifying affinity classes ofinorganic materials binding peptide sequence.,” ACM Conferenceon Bioinformatics, Computational Biology and Biomedical Informatics(ACM BCB), 2013.

[33] N. Terrapon, O. Gascuel, E. Marechal, and L. Brehelin, “Fittinghidden markov models of protein domains to a target species: ap-plication to plasmodium falciparum,” BMC Bioinformatics, vol. 13,no. 1, p. 67, 2012.

[34] M.-W. M. M.-W. Mak, J. G. J. Guo, and S.-Y. K. S.-Y. Kung, “Pair-prosvm: protein subcellular localization based on local pairwiseprofile alignment and svm.,” IEEEACM Transactions on Computa-tional Biology and Bioinformatics, vol. 5, no. 3, pp. 416–422, 2008.

[35] J. Tian, H. Gu, W. Liu, and C. Gao, “Robust prediction of pro-tein subcellular localization combining {PCA} and {WSVMs},”Computers in Biology and Medicine, vol. 41, no. 8, pp. 648 – 652,2011.

[36] L. Ge, N. Du, and A. Zhang, “Finding informative genes frommultiple microarray experiments: A graph-based consensus max-imization model,” in Proceedings of the 2011 IEEE InternationalConference on Bioinformatics and Biomedicine, BIBM ’11, pp. 506–511, 2011.

[37] Z. Tang, J. Palafox-Hernandez, W.-C. Law, Z. Hughes, M. T. Swi-hart, P. N. Prasad, M. R. Knecht, and T. R. Walsh, “Biomolecularrecognition principles for bionanocombinatorics: An integratedapproach to elucidate enthalpic and entropic factors,” ACS Nano,Article ASAP, DOI: 10.1021/nn404427y, vol. 7, pp. 9632–9646, 2013.

[38] Y. N. Tan, J. Y. Lee, and D. I. C. Wang, “Uncovering the designrules for peptide synthesis of metal nanoparticles.,” Journal of theAmerican Chemical Society, vol. 132, no. 16, pp. 5677–5686, 2010.

[39] M. Hnilova, C. R. So, E. E. Oren, B. R. Wilson, T. Kacar,C. Tamerler, and M. Sarikaya, “Peptide-directed co-assembly ofnanoprobes on multimaterial patterned solid surfaces,” Soft Mat-ter, vol. 8, pp. 4327–4334, 2012.

[40] S. B. Needleman and C. D. Wunsch, “A general method applicableto the search for similarities in the amino acid sequence of twoproteins.,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453,1970.

[41] W. R. Pearson, “Rapid and sensitive sequence comparison withfastp and fasta.,” Methods in Enzymology, vol. 183, no. 1988, pp. 63–98, 1990.

[42] S. Henikoff and J. G. Henikoff, “Amino acid substitution matri-ces from protein blocks.,” Proceedings of the National Academy ofSciences of the United States of America, vol. 89, no. 22, pp. 10915–10919, 1992.

[43] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, “Scop: Astructural classification of proteins database for the investigationof sequences and structures,” Journal of Molecular Biology, vol. 247,no. 4, pp. 536 – 540, 1995.

[44] Afiahayati and S. Hartati, “Multiple sequence alignment usinghidden markov model with augmented set based on blosum 80and its influence on phylogenetic accuracy,” 2010.

[45] T. F. Smith and M. S. Waterman, “Identification of commonmolecular subsequences.,” Journal of Molecular Biology, vol. 147,no. 1, pp. 195–197, 1981.

[46] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics andComputing, vol. 17, no. 4, pp. 395–416, 2007.

[47] D. P. Bertsekas, Nonlinear Programming, vol. 43. Athena Scientific,1995.

[48] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch, “Learningwith local and global consistency,” Advances in Neural InformationProcessing Systems 16, vol. 1, pp. 595–602, 2003.

[49] M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt, “A modelof evolutionary change in proteins,” Atlas of protein sequence andstructure, vol. 5, no. Suppl 3, pp. 345–352, 1978.

[50] G. H. Gonnet, M. A. Cohen, and S. A. Benner, “Exhaustive match-ing of the entire protein sequence database.,” Science, vol. 256,no. 5062, pp. 1443–1445, 1992.

Page 13: Identifying Affinity Classes of Inorganic Materials Binding

13

Nan Du Nan Du received his B.S. degree fromGuangdong University of Technology in 2006.After that, he received his M.S. degree fromSouthern China University of Technology in2009. Since 2009, he has been working towardthe Ph.D. degree in State University of New Yorkat Buffalo, NY, with supervision by Prof. AidongZhang. His research interests are in the area ofdata mining, machine learning and bioinformat-ics.

Marc R. Knecht Marc R. Knecht earned a B.S.degree in Chemistry from Duquesne Universityin 2001. In 2004, he received a Ph.D. in Bio-Inspired Chemistry from Vanderbilt Universityunder the direction of Professor David W. Wright,followed by postdoctoral research at the Univer-sity of Texas with Professor Richard M. Crooksfocused on characterizing the structure/functionrelationship of nanocatalysts. After completingpostdoctoral studies, he began his independentcareer as an assistant professor of Chemistry

at University of Kentucky. In the summer of 2011, Professor Knechtjoined the Department of Chemistry at the University of Miami as anassociate professor. During his independent career, Professor Knechthas established a research program focused on elucidating the effectsof the biotic/abiotic interface of bio-inspired nanomaterials. In this re-gard, his group has employed high-resolution characterization, activitystudies, and synthetic analyses of peptides to demonstrate that thebiological surface of bionanomaterials possesses significant control overthe functionality and could serve as modification sites to control theactivity. He has published 47 publications in this area.

Mark T. Swihart Mark T. Swihart is a Professorin the Department of Chemical and BiologicalEngineering at the University at Buffalo (SUNY).He earned a B.S. in Chemical Engineering fromRice University in 1992, and a Ph.D. in Chem-ical Engineering in 1997 from the University ofMinnesota. He then spent one year as a post-doctoral researcher in Mechanical Engineeringat the University of Minnesota before joining theUniversity at Buffalo as an assistant professor in1998. Since 2007, he has directed a university-

wide strategic initiative in Integrated Nanostructured Systems. Hisresearch interests include synthesis, processing, and applications ofnanoparticles and other nanomaterials, and he has co-authored morethan 120 journal papers in these areas. Dr. Swihart is a recipient ofthe Kenneth Whitby award from the American Association for AerosolResearch, the Schoellkopf medal from the Western New York sectionof the American Chemical Society, and the J.B. Wagner award from theElectrochemical Society.

Zhenghua Tang Zhenghua Tang currently is apostdoctoral research associate working in MarcR. Knecht group at University of Miami. He ob-tained his B. S. degree at college of Chemistryand Chemical Engineering, Lanzhou University,Lanzhou, Gansu, P. R. China in 2005. He at-tended graduate school there from Aug. 2005 toJun. 2007. During his graduate study, he went toInstitute of Chemistry, Chinese Academy of Sci-ence (ICCAS) as a visiting student for about oneyear (2006-2007). In August 2007, he moved

to US and obtained his Pd. D degree in chemistry from Departmentof Chemistry, Georgia State University in July, 2012. He started hiscurrent position since August, 2012. His research interest focuseson bio-inspired nanomaterials for targeted applications, including bio-nanocombinatorics, self-assembly, catalyst, multifunctional design andso on. He is the recipient of 2010 chairs award in Chemistry Departmentat GSU as well as 2011 Chinese Government Award for OutstandingSelf-Financed Students Abroad.

Tiffany R. Walsh Tiff Walsh graduated with aB.Sci(Hons) from the University of Melbourne.She earned her PhD degree in theoretical chem-istry from the University of Cambridge, U.K.,working in the group of Prof. David Wales in theDept. of Chemistry as a Cambridge Common-wealth Trust scholar. Walsh then joined the Dept.of Materials, University of Oxford, U.K. as a post-doctoral researcher in the Materials ModellingLaboratory (MML) with Prof. Adrian Sutton. Shewas then awarded a Glasstone fellowship, which

she held in the MML in Oxford. In 2002, she joined the faculty of theUniversity of Warwick, U.K., as a joint appointment in the Dept. of Chem-istry and the Centre for Scientific Computing. Her research interestsfocus on computational modelling the interface between biomoleculesand inorganic surfaces, using molecular dynamics simulations. She wasa lead investigator in the team that won 5.3 M ($US 8.2 M) of fundingfor a 5-year EPSRC Programme Grant in this area (started in Oct2010). In 2012, Walsh joined the Institute for Frontier Materials at DeakinUniversity in Australia, where she holds the position of Associate Prof.in Bio\Nanotechnology.

Aidong Zhang Dr. Aidong Zhang is UniversityDistinguished Professor and Chair in the De-partment of Computer Science and Engineeringat State University of New York at Buffalo. Herresearch interests include bioinformatics, datamining, multimedia and database systems, andcontent-based image retrieval. She is an authorof over 250 research publications in these areas.She has chaired or served on over 100 pro-gram committees of international conferencesand workshops, and currently serves several

journal editorial boards. She has published two books Protein InteractionNetworks: Computational Analysis (Cambridge University Press, 2009)and Advanced Analysis of Gene Expression Microarray Data (WorldScientific Publishing Co., Inc. 2006). Dr. Zhang is a recipient of theNational Science Foundation CAREER award and State University ofNew York (SUNY) Chancellor’s Research Recognition award. Dr. Zhangis an IEEE Fellow.