learning structure activity relationship (sar) of the

113
doi.org/10.26434/chemrxiv.14609670.v1 Learning Structure Activity Relationship (SAR) of the Wittig Reaction from Genetically-Encoded Substrates Kejia Yan, Vivian Triana, Sunil Vasu Kalmady, Kwami Aku-Dominguez, Sharyar Memon, Alex Brown, Russell Greiner, Ratmir Derda Submitted date: 18/05/2021 Posted date: 19/05/2021 Licence: CC BY-NC-ND 4.0 Citation information: Yan, Kejia; Triana, Vivian; Vasu Kalmady, Sunil; Aku-Dominguez, Kwami; Memon, Sharyar; Brown, Alex; et al. (2021): Learning Structure Activity Relationship (SAR) of the Wittig Reaction from Genetically-Encoded Substrates. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.14609670.v1 The Wittig reaction can be used for late stage functionalization of proteins and peptides to ligate glycans, pharmacophores, and many other functionalities. In this manuscript, we modified 160,000 N-terminal glyoxaldehyde peptides displayed on phage with the Wittig reaction by biotin labeled ylide under conditions that functionalize only 1% of the library population. Deep-sequencing of the biotinylated and input populations estimated the rate of conversion for each sequence. This “deep conversion” (DC) from deep sequencing correlates with rate constants measured by HPLC. Peptide sequences with fast and slow reactivity highlighted a critical role of primary backbone amides (N-H) in accelerating the rate of the aqueous Wittig reaction. Experimental measurement of reaction rates and density functional theory (DFT) computation of the transition state geometries corroborated this relationship. We also collected deep-sequencing data to build structure activity relationship (SAR) models that can predict DC value of the Wittig reaction. By using this data, we trained two classifier models based on Gradient Boosted trees. These classifiers achieved area under the ROC (Receiver Operating Characteristic) Curve (ROC AUC) of 81.2 ± 0.4 and 73.7 ± 0.8 (90–92% accuracy) in determining whether a sequence belonged to the top 5% or the bottom 5% in terms of its reactivity. We have deployed our learned models as a publicly available web app: http://44.226.164.95/ We anticipate that phage-displayed peptides and related mRNA or DNA-displayed substrates can be employed in a similar fashion to study the substrate scope and mechanisms of many other chemical reactions. File list (3) download file view on ChemRxiv Derda_MainText.pdf (0.93 MiB) download file view on ChemRxiv Electronic Supplementary Information.pdf (6.63 MiB) download file view on ChemRxiv Supplementary Data.rar (68.99 MiB)

Upload: others

Post on 13-Jan-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Structure Activity Relationship (SAR) of the

doi.org/10.26434/chemrxiv.14609670.v1

Learning Structure Activity Relationship (SAR) of the Wittig Reactionfrom Genetically-Encoded SubstratesKejia Yan, Vivian Triana, Sunil Vasu Kalmady, Kwami Aku-Dominguez, Sharyar Memon, Alex Brown, RussellGreiner, Ratmir Derda

Submitted date: 18/05/2021 • Posted date: 19/05/2021Licence: CC BY-NC-ND 4.0Citation information: Yan, Kejia; Triana, Vivian; Vasu Kalmady, Sunil; Aku-Dominguez, Kwami; Memon,Sharyar; Brown, Alex; et al. (2021): Learning Structure Activity Relationship (SAR) of the Wittig Reaction fromGenetically-Encoded Substrates. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.14609670.v1

The Wittig reaction can be used for late stage functionalization of proteins and peptides to ligate glycans,pharmacophores, and many other functionalities. In this manuscript, we modified 160,000 N-terminalglyoxaldehyde peptides displayed on phage with the Wittig reaction by biotin labeled ylide under conditionsthat functionalize only 1% of the library population. Deep-sequencing of the biotinylated and input populationsestimated the rate of conversion for each sequence. This “deep conversion” (DC) from deep sequencingcorrelates with rate constants measured by HPLC. Peptide sequences with fast and slow reactivity highlighteda critical role of primary backbone amides (N-H) in accelerating the rate of the aqueous Wittig reaction.Experimental measurement of reaction rates and density functional theory (DFT) computation of the transitionstate geometries corroborated this relationship. We also collected deep-sequencing data to build structureactivity relationship (SAR) models that can predict DC value of the Wittig reaction. By using this data, wetrained two classifier models based on Gradient Boosted trees. These classifiers achieved area under theROC (Receiver Operating Characteristic) Curve (ROC AUC) of 81.2 ± 0.4 and 73.7 ± 0.8 (90–92% accuracy)in determining whether a sequence belonged to the top 5% or the bottom 5% in terms of its reactivity. Wehave deployed our learned models as a publicly available web app: http://44.226.164.95/ We anticipate thatphage-displayed peptides and related mRNA or DNA-displayed substrates can be employed in a similarfashion to study the substrate scope and mechanisms of many other chemical reactions.

File list (3)

download fileview on ChemRxivDerda_MainText.pdf (0.93 MiB)

download fileview on ChemRxivElectronic Supplementary Information.pdf (6.63 MiB)

download fileview on ChemRxivSupplementary Data.rar (68.99 MiB)

Page 2: Learning Structure Activity Relationship (SAR) of the

Learning Structure Activity Relationship (SAR) of the Wittig Reaction from Genetically-Encoded

substrates

Authors: Kejia Yana, Vivian Trianaa, , Sunil Vasu Kalmadyb, Kwami Aku-Domingueza, Sharyar Memonc

Alex Browna, Russell Greinerb,d, Ratmir Derdaa*

a Department of Chemistry, University of Alberta, Edmonton, AB T6G 2G2, Canada.

b Department of Computing Science, University of Alberta, Alberta, AB T6G 2E8, Canada.

c Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 1H9,

Canada.

d Alberta Machine Intelligence institute, Alberta, AB T5J 3B1, Canada.

Corresponding author Email: [email protected]

Page 3: Learning Structure Activity Relationship (SAR) of the

2

Abstract

The Wittig reaction can be used for late stage functionalization of proteins and peptides to ligate glycans,

pharmacophores, and many other functionalities. In this manuscript, we modified 160,000 N-terminal

glyoxaldehyde peptides displayed on phage with the Wittig reaction by biotin labeled ylide under conditions

that functionalize only 1% of the library population. Deep-sequencing of the biotinylated and input

populations estimated the rate of conversion for each sequence. This “deep conversion” (DC) from deep

sequencing correlates with rate constants measured by HPLC. Peptide sequences with fast and slow

reactivity highlighted a critical role of primary backbone amides (N-H) in accelerating the rate of the aqueous

Wittig reaction. Experimental measurement of reaction rates and density functional theory (DFT)

computation of the transition state geometries corroborated this relationship. We also collected deep-

sequencing data to build structure activity relationship (SAR) models that can predict DC value of the Wittig

reaction. By using this data, we trained two classifier models based on Gradient Boosted trees. These

classifiers achieved area under the ROC (Receiver Operating Characteristic) Curve (ROC AUC) of 81.2 ±

0.4 and 73.7 ± 0.8 (90–92% accuracy) in determining whether a sequence belonged to the top 5% or the

bottom 5% in terms of its reactivity. We have deployed our learned models as a publicly available web app:

http://44.226.164.95/ We anticipate that phage-displayed peptides and related mRNA or DNA-displayed

substrates can be employed in a similar fashion to study the substrate scope and mechanisms of many other

chemical reactions.

Page 4: Learning Structure Activity Relationship (SAR) of the

3

Introduction

Profiling multiple substrates under the same reaction conditions is a cornerstone of mechanistic

organic chemistry. Optimization of chemical reactions, discovery of catalytic systems, and mechanistic

studies hinge on measurements of reaction rates of multiple substrates and conditions, all these situations

need a more efficient way to select the right substrates from massive compounds. The data collected in such

structure activity relationship (SAR) studies serve as essential input for developing mechanistic hypotheses

and decision making in the discovery of new reactions, however, measurement of the rates of a plurality of

substrates under different reaction conditions one-by-one is time consuming.1 Quantitative analyses starting

from pioneering work of Hammet and co-workers on linear free energy relationships (LFER)2 to modern

approaches that employ Multiple Linear Regression (MLR)3 and other Machine Learning (ML) methods4

permit converting observations from SAR studies to quantitative models that relate reactivity to observable

physical properties such as pKa or theoretically calculated parameters such as HOMO/LUMO energies.

These models, combined with “qualitative chemical intuition,” allow prediction of optimal conditions and

substrates for a particular reaction and provide critical insight in reaction mechanisms. The most valuable

input into these models consists of both “positive” and “negative” data (i.e., fast and slow reactions). The

same requirements exist in other machine learning fields: sets with positive and negative observations serve

as the most effective input for the training of models.5-7 Methods that allow collecting a large unbiased set

of reactivity data, facilitate building models that minimize the bias that could originate from human decision

making such as conscious selection of substrates with anticipated reactivity.

The high-throughput screening (HTS) technique was developed in the early 1990s for the discovery

of pharmaceutically valuable molecules.8, 9 It was later repurposed for high-throughput acquisition of SAR

information. Such HTS approaches have facilitated not only optimization of chemical reactions10 but also

discovery of catalytic systems,11-14 and unexpected chemical transformations.9 Screen that mix of multiple

Page 5: Learning Structure Activity Relationship (SAR) of the

4

substrates in one solution15 expand the traditional one-well-one-reaction format.9, 16 Multisubstrate

screening17, 18 can evaluate yields and enantiomeric excesses for multiple substrates converted by a catalyst,19

determine reaction kinetics, and corresponding rate constants of a large number of reactions in a few

experiments (340 measurements in 17 experiments), 20 and it can be used to discover new transformations.

Traditional considerations in multisubstrate approaches are throughput and non-linearity. The complexity of

the mixture should be amenable to the separation by the liquid or gas chromatography (LC or GC) and the

concentration of individual substrates should be sufficiently high to permit the analysis by mass-

spectrometry (MS). However, if substrates are present at high concentrations, nonlinear effects21 may

originate in such mixture due to the cross-reactivity between multiple substrates and products. Genetically-

encoded (GE) peptide libraries9, 16 and DNA-encoded small molecule libraries (DEL)22-26 make the DNA

message associate with each substrate which can be amplified from single copy number. This amplification

makes it possible to perform the screen at concentrations that minimize undesired cross-substrate reactions

(Fig S1) while conveniently increasing the number of substrates that can be interrogated to 106–1012 scale.

In this manuscript, we perform multisubstrate screening using >105 genetically-encoded substrates present

in one solution at 10 attomolar/substrate concentration and employ deep-sequencing to monitor the reactions.

Such a concentration of substrates minimizes the interaction between the substrates and can be used to collect

valuable information to understand the SAR of the Wittig reaction.

Phage display and DEL technology have been successfully applied to discover new reactions.22-24, 26-

29 A traditional application of DEL or GE-peptide library in reaction discovery aims to enrich a rare subset

of substrates that exhibit a faster reactivity than the average population. Such approach can be broadly

characterized as a “gain of function” screen. Knowledge emanating from the screens focused on “gain” in

chemical reactivity is insufficient for comprehensive structure-reactivity models. The problem can be easily

resolved if the genetically-encoded screen is modified to identify both “gain” and “loss” of function in

Page 6: Learning Structure Activity Relationship (SAR) of the

5

chemical reactivity. On the other hand, quantitative models built from continuous datasets including

examples that both increase and decrease the rate of the reaction can be made to predict chemical reactivity

on both ‘gain’ and ‘loss’ of function. Here we demonstrate that modification of GE-libraries under

kinetically-controlled conditions that convert ~1% of the library and deep-sequencing of the modified

population identified both the fast and slow-reacting substrates. We implement such a GE-screen to identify

the peptides that increase and decrease the rate of the Wittig reaction in water and then train the dataset use

machine learning to predict peptide reactivity.

The Wittig reaction is a versatile and biocompatible carbon-carbon bond forming process; the

product of the reaction can serve as a versatile electrophile or dienophile building block, or as a warhead for

reversible covalent trapping of biological nucleophiles. The Wittig reaction has been employed to synthesize

DNA-encoded libraries and modify phage displayed libraries of aldehyde-peptides.30, 31 The substrates for

the Wittig reaction—libraries of peptide aldehydes—have been generated by NaIO4 oxidation of readily

available phage-displayed libraries with N-terminal Ser.32-34 Aldehydes can also be introduced by unnatural

amino acid mutagenesis35 or by formyl glycine generating enzymes.36 The mechanism of the Wittig reaction

in water remains a topic of extensive research.37-39 It proceeds through a formal cycloaddition with an early

oxaphosphetane (OPA) transition state (TS). The stereo-electronic properties of the aldehydes influence the

geometry and energy around the OPA manifold and change the rate and stereo chemical outcome of the

Wittig reaction. Peptide aldehydes substrates contain a rich repertoire of functional groups that could

potentially stabilize or destabilize the OPA TS. The effect of hydroxyl groups with pKa of 7–15, ammonium

ions with pKa 7–12, carboxylates, aromatic rings with different “π-basicity” and different H-bond donors

and acceptors on the stability of the OPA in water is not obvious. The GE-screen described in this report

profiled the effect of a diverse combination of these groups in 105 peptide-aldehydes displayed on phage.

The predictions built on these observations illuminated the unique role of primary amides in stabilizing the

Page 7: Learning Structure Activity Relationship (SAR) of the

6

transition state via short-range non-covalent interactions.

Result and Discussion

Wittig reaction in linear phage library

Multisubstrate profiling of the Wittig reaction repurposed classical tagging strategies employed for

physical separation of the reacted subset of library members from the remaining unreacted population.32, 40

Specifically, successful Wittig reaction between phage-displayed peptide-aldehyde and biotin-tagged ylide

introduced biotin into a product. Tagged products can be affinity separated from the unreacted population

and analyzed by deep-sequencing. We employed periodate oxidation28, 32 of linear SXXXX libraires

displayed on phage41 to produce libraries of aldehyde substrates denoted as CHO-XXXX (Fig 1a, X denotes

any of the 20 amino acids). The resulting library of (20)4=160,000 substrates can be characterized

completely by Illumina sequencing (Fig 1–c).41 Change in the library composition that leads to either

enrichment or depletion of substrates in these libraries can be, therefore, reliably quantified.

Based on previously reported average rates of reaction on phage (k=0.2 M-1s-1),31 a 10 minute reaction

time and 0.4 mM Ylide Ester Biotin (YEB) should convert ~5% of the population to a biotinylated product

(Fig S1). Kinetic modeling of the distribution of the products in the multisubstrate reaction predicts that

interrupting the reaction at 5% conversion of the aldehyde population and counting the copy number of the

biotinylated products make it possible to identify substrates with rates that range from 0.002 M-1s-1 to 20 M-

1s-1 (Fig S1d). Therefore, interruption of the reaction at 1–5% conversion should make it possible to identify

substrates that react 100-time faster or 100-time slower than the average substrate. Adding these pre-

determined concentrations of YEB to 109 phage virions and quenching with acidic buffer at the specified

time intervals, ensured reproducible biotinylation of ~1% of the population of phage clones.

The pull-down process selects not only the “specific” biotinylated population but also the peptide

Page 8: Learning Structure Activity Relationship (SAR) of the

7

sequences that bind non-specifically to the components in the system such as protein streptavidin agarose

beads. A set of two controls assessed the magnitude of such non-specific binding to be 1–5% of the specific

biotin-streptavidin interaction. . Recovery of biotinylated peptides to streptavidin blocked biotin-binding site

(set Y, Figure 1e) and recovery of the unreacted aldehyde-peptide population by streptavidin (set Z, Figure

1e) was decreased by factors of 22 and 120 when compared to the recovery of the “specific” population (set

X, Fig 1e). We collected these “specific” and control populations in 3–5 independent experiments, subjected

them to PCR and Illumina sequencing. The combined data was used to identify the copy number of the

peptides that were biotinylated during the Wittig reactions (Fig 1d, S2, all the sequencing data are available

in ESI as VT_unfiltered_Feb.txt).

Figs 1e and S2 summarize the calculation of the absolute number of biotinylated particles from

sequencing data accounting for factors like sequencing depth and amount of phage particles in each specific

experiment. “Ni” (copy number of the peptide in the naïve library) was critical to account for sequences that

were present in a high copy number in naïve libraries but did not react fast in the Wittig reaction. Applying

normalization and sampling correction to peptide sequences, values denoted “Deep Conversion” (DC) for

over 50,000 peptide aldehyde sequences were generated. For example, biotin-tagged SPPXX sequences were

observed in high copy in captured population and they can be mistakenly interpreted as “most reactive”

substrates (Fig 1d). On the other hand, these sequences were also present in high copy number in the naïve

library (Fig 1c) and introduction of normalization predicted that SPPXX sequences in fact have the lowest

DC values in the Wittig reaction (Fig 1f). We found that the most straightforward approach for observing

the relation between sequence and conversion was a library-wide visualization tool referred to as “20×20

plot” using in our previous publications (Fig 1f).41

To confirm the observations predicted by high and low Deep Conversion (DC) indeed correlated

with experimentally determined reactivities of peptide aldehydes, we selected a series of peptides predicted

Page 9: Learning Structure Activity Relationship (SAR) of the

8

to be “fast” (DC ≥ 0.35), “medium” (0.15 ≤ DC ≤ 0.35) and “slow” (DC ≤ 0.15), synthesized them and

validated the reaction rate (Figs S3–S5). The results showed an agreement between the experimentally

determined conversion measured by HPLC and the DC (Fig 2a, Fig S6) in a range of two orders of magnitude

for both values. Specifically, we reported previously that the average rate for library of peptide aldehydes

displayed on phage was 0.23 ± 0.09 M-1s-1. 31 This rate constant was similar to the rate constant for “average”

synthetic peptide sequences that did not contain Trp or Pro residues in the first two positions (Fig 2a). In

contrast, synthetic aldehydes HCO-PPLA and HCO-PPPL exhibited rates of 0.017 ± 0.005 and 0.014 ± 0.03

M-1s-1 respectively. These sequences were up to 13–16-fold slower than the average population. This

observation was in line with the observed decrease in DC for SPPXX sequences highlighted by the 20x20

plot (Fig 1f). To test the effect of proline in a specific position, we systematically evaluated the reactivity of

HCO-PPPA, HCO-PPAA, HCO-PAAA, HCO-APAA and HCO-AAAP sequences (red in Fig 2a, Fig S7, all

the peptide traces can be found in Fig S28–S67). Replacing either the 1st or 2nd amino acid with Proline

lead only to a modest 2–3 fold decrease in rate whereas simultaneous replacement of the 1st and 2nd amino

acids with Prolines resulted in a 10-fold decrease. Subsequent introduction of Proline in the of 3rd or 4th

position had little additional effect on the rate of the Wittig reaction. In analysis of HCO-PXXX sequences,

we noted a known side reaction that consumed HCO-PXXX aldehydes via proline-assisted 6-exo-trig attack

of an amide nitrogen onto the aldehyde(Fig S8).42 In HPLC assay, the rate of the Wittig reaction for HCO-

PXXX was measured to be 0.12–0.17 M-1s-1 and the rates of cyclization were in a similar range. Combination

of both processes—see kinetic equation in Fig 2b—contributed to an apparent decrease in DC. We also

examined the factors that led to an increase in DC and focused on peptides with tryptophan in the first and

second position. Introduction of only one tryptophan in peptides HCO-QWLH, HCO-WIVR, HCO-HWFP,

HCO-LWYR and HCO-WLPR (Fig 2a) had no statistically significant increase from average reactivity,

whereas sequences HCO-WWPQ and HCO-WWGL with two tryptophan residues in both positions,

Page 10: Learning Structure Activity Relationship (SAR) of the

9

exhibited significant increases from the average rate and more than 50-fold increase from the lowest rate in

the population. These results suggested that the first and second N-terminal amino acids play critical

synergistic role in the rate of the Wittig reaction, possibly by stabilizing or destabilizing the OPA TS. These

observations also confirmed that DC values serve as a good predictive surrogate for fast and slow reaction.

Investigation of the transition states of Wittig reaction

Proline, unlike all other proteogenic amino acids forms no intramolecular hydrogen bonds between

the amide hydrogen and carbonyl groups nearby (Fig 2c, Fig S7). Another evidence for the role of backbone

N-H emanated from measurement of the rates of Wittig reaction with di-alanine and di-Sarcosine oxaloyl

aldehydes (Fig 3c). An HCO-Sarcosine-Sarcosine aldehyde devoid of backbone N-H exhibited a 5 fold

decrease in reactivity when compared to isomeric HCO-Ala-Ala aldehyde. To test further whether backbone

primary amides contribute to stabilization of the transition state, we performed DFT calculations of the

transition state geometry using the Gaussian 09 software43 and and B3LYP 6-31G(d) basis set. Due to the

conformational flexibility and complexity of oxyclic tetrapeptides and without access to a putative starting

structure for the TS, the TS could not be located for the full tetrapeptide. On the other hand, the TS

geometries of simpler model dipeptides such as HCO-Ala-Ala and HCO-Sarcosine-Sarcosine and a simple

ylide CH3OCOCH2PPH3 were successfully optimized using DFT; further details on the computational

methods are provided in ESI. Similarly, to previous calculations of the TS of the Wittig reaction for stabilized

ylides, we observed two transition states and TS1 exhibits a higher energy barrier than TS2 (Fig S9).37, 38

We focused on the geometry and relative energies of the TS1 to understand the preferences in the rate

limiting step of these reactions. In the optimized geometry of TS1 of HCO-Ala-Ala, we observed two

intramolecular hydrogen bonds between the backbone N-H and two carbonyls of the oxaloyl (Fig 3a). The

analysis in Fig 4 is focused on the E-isomer of TS1 because both DFT calculations and experiments favor

formation of E Wittig product (Fig S9). In the TS1 of HCO-Sarcosine-Sarcosine, N-methylated backbone

Page 11: Learning Structure Activity Relationship (SAR) of the

10

amides were unable to form such interactions and showed 10 kcal/mol higher than HCO-Ala-Ala TS1 (Fig

3b). As these two TS1 have the same atomic composition, the difference in energy interactions can be

neglected. Thus, the energy barrier of HCO-Ala-Ala TS1 can be attributed to the stabilization of HCO-Ala-

Ala TS1 by two hydrogen bonds. Experimental measurements of the rate constants of the reaction between

HCO-Ala-Ala, HCO-Sarcosine-Sarcosine and ylide CH3OCOCH2PPH3, corroborated the results from

computations (Fig 3c). We were also able to obtain TS1 and TS2 for HCO-Pro-Pro and HCO-Trp-Trp

substrates and observed a similar role of backbone amides in TS1 (Fig S10–S11).

To determine whether the change in the rate of the Wittig reaction was due to amino acid influences

on the OPA TSs or the inherent reactivity of substrates towards any nucleophilic attack, we measured the

rate of reaction of HCO-PPAA and HCO-WWRR sequences with 2,4-dinitrophenylhydrazine. To our

surprise, the reactivity for hydrazine ligation was reversed (Fig S12 and S13) and HCO-PPAA reacted at

least 2 times faster than HCO-WWRR both at pH 0 and pH 5. These results (Figs S14 and S15) demonstrate

that the increase in reactivity is not due to the inherent increase of the electrophilicity of the substrate but to

reaction-specific effects such as the geometry and the relative energy of the TS.

DFT calculations showed lower energies of TS1-E barriers when compared to TS1-Z configuration

for all calculated substrates (Fig S9). HPLC and NMR analysis confirmed E product was favoured over Z

for HCO-Ala-Ala (E/Z 4.5:1) and HCO-Sarcosine-Sarcosine (E/Z 3.6:1) (Fig S19–S20). We note that the

selection process was not designed to select sequences that react stereoselectively. Still, we noticed

interesting changes in E/Z selectivity in the Wittig reaction of “fast” (CHO-WWRR 1:1 E/Z), “medium”

(CHO-HWFP 1:9 E/Z) and “slow” (CHO-PPAA 1:9 E/Z) reacting sequences (Figs S16–S18). These results

provide additional evidence that the side chains of the amino acids and backbone amides might exert

different influence on the E-OPA vs Z-OPA transition states.

Application to the cyclic library

Page 12: Learning Structure Activity Relationship (SAR) of the

11

Applying the conditions (10 min. reaction, 0.4 mM [YEB]) that convert 1% of the population to

another library SXCXXXC, we found that the effect of amino-acid residues on conversion was significantly

attenuated when compared to the SX4 library (Fig S21). These results were not surprising because the

previous observations suggested that synergistic contributions from the amino acids both positions 1 and 2

are critical to change the reactivity. However, in the SXCXXXC library the synergy between position 1 and

2 is not possible because this library contains a constant cysteine at the second position. As in SX4, the

presence of proline in the first, but not the third position led to a decrease of DC due to proline-assisted 6-

exo-trig attack of amide nitrogen to aldehyde to afford intramolecular cyclization.42 In the SXCXXXC

library, tryptophan and tyrosine in the first (but not the third) position lead to a modest increase in conversion.

The synthesis and testing of several sequences confirmed that the placement of Tyr in the first position led

to a detectable increase in rate (Table S1 and Fig S22). Peptides with Tyr in the third position exhibited

average rates similar to the average rate of peptides from the SX4 library. Modification of SXCXXXC library

was not useful for identification of new sequences with low or high reactivity; still, a uniform reactivity of

such library is as advantage. Such library can be considered as an example of library in which majority of

the members exhibit the same reactivity towards one chemical modification.

Machine Learning using DC value to predict unobserved sequences.

Analysis by deep-sequencing made it possible to observe 60,432 of the 160,000 possible tetramer

amino acid sequences. Measurement of the rates of the remaining 99,568 sequences was not possible for

multiple reasons, such as low or unreliable count of these peptide sequences in the naïve population. We

sought to extrapolate the reactivity of the “missing” 99,568 peptides using a machine learning model trained

using the data from the “observed” 60,432 sequences. As a proof-of-principle, we trained a classification

model. The experimentally observed data was split into ‘HIGH’ or ‘LOW’ deep conversion subgroups where

HIGH and LOW are defined by the highest and lowest 5% of the DC values obtained by experimental

Page 13: Learning Structure Activity Relationship (SAR) of the

12

observation (Fig S23). As sequences with single Pro in the first position exhibit side-reaction, these

sequences were not used for training. Using XGBoost44 (Gradient Boosted Decision Trees) (Fig S24), we

trained two binary classifiers: one to identify sequences belonging to the HIGH subgroup and the other to

identify sequences belonging to the LOW subgroup (Fig S25, all the prediction probabilities are available in

Predictions-160-Probabilities.csv). The input features included the quantitative chemical properties of each

of the amino acids in the tetramer sequences, namely z-scale45 descriptors (3×4 AA position = 12 features),

VHSE46 (Vectors of Hydrophobic, Steric, and Electronic properties) descriptors (8×4 AA position = 32

features), as well as sequence patterns based on permutations of 20 amino acids among 4 positions within

the tetramer sequences (2,396 features) to yield a total of 2,440 features (more details about features can be

found in ESI section titled “Feature Engineering”). Evaluation of respective classifiers showed area under

the receiver operating characteristic curve (ROC AUC) scores of 81.2 ± 0.4 and 73.7 ± 0.8 for HIGH and

LOW, using a 5-fold stratified cross-validation. It also showed F1-scores of 33.7±0.9 and 19.0±0.9 for the

respective classifiers (Table S2). We deployed the learned models into a publicly available web app

(http://44.226.164.95/, Fig S26) which allows users to compute the probability of sequences belonging to

both the HIGH and LOW class. Using this model, we predicted the class labels of 99,568 amino acid

sequences not observed in the sequencing. Figure 4 compares two 20×20 plots, one shows the experimentally

measured labels for 60,432 sequences (Fig 4a) and the other showing predicted labels for peptides never

observed in deep sequencing (Fig 4b). The learned models (Fig 4b) suggest definitive patterns of reactivity

that were not clearly observed in the scarce experimental data (Fig 4a). Experimental validation of this

prediction would require synthesis of a large number of representative peptide sequences predicted by the

model to be in a HIGH or LOW class of reactivity. Such exploration is beyond the scope of the current

manuscript and further details will be described in our future work.

Page 14: Learning Structure Activity Relationship (SAR) of the

13

Conclusion

In conclusion, a genetically encoded library of >105 phage-displayed peptide aldehydes provided

a rich dataset guiding SAR studies for the Wittig reaction in water with a stabilized ylide (YEB).

The study highlighted the cooperative effect of the first and second N-terminal amino acids on the

rate of the reaction. Low reactivity of PP motif highlights the role of backbone amides in stabilization

of TS of the Wittig reaction. DFT computations corroborated the experimental observations and

suggested intramolecular hydrogen bonds with backbone amides as important stabilization factor the

transition state of Wittig reaction. The 50-fold dynamic range of reaction rate suggested a 3 kcal/mol

contribution to Wittig TS from the peptide and a large fraction of this contribution emanates from

the backbone amides.

Coupling of deep sequencing methodology to investigation of chemical reactivity and SAR studies with

only minor changes can be applied to investigate other M13 virion compatible reactions already known to

be compatible with the M13 virion (thiol SN2,47-50 thiol SNAr,51 oxime ligation,32, 52 copper catalyzed azide

alkyne cycloaddition,53 copper-free azide alkyne cycloaddition,54 Diels Alder cycloaddition31). These

investigations, in principle would follow the same experimental design: (i) couple the reactive group

to biotin; (ii) terminate the reaction at 1-5% conversion and (iii) sequence the reactive (biotinylated)

population. This approach can also Illuminate SAR of other reactions that have potential to be applied on

phage such as multicomponent reactions compatible with aqueous conditions55 or chemistry that has already

been explored on proteins, unprotected peptides and other display platforms34, 40, 56-58 As chemical

modification of DNA and RNA-displayed libraries is also possible, the SAR of reactions that modify these

libraries could be investigated in a similar fashion. In principle, peptide libraries combined with calibrated

mass spectrometry analysis,59 could permit analogous SAR analysis however the dynamic range of the mass-

spectrometry instrument should be sufficient to quantify peptides both enriched and depleted in the reaction.

Page 15: Learning Structure Activity Relationship (SAR) of the

14

The SAR data sets produced by deep-sequencing are sufficiently large for training of machine learning

models. In a proof-of-principle example, we trained a classification model and attained a respectable

accuracy as determined by cross-validation in a held-out dataset. We envision several important next steps

in the application of ML approaches to such datasets: (i) replacing peptide-centric descriptors with all-atomic

molecular descriptors will make it possible to extrapolate the reactivity of non-peptide structure from a

peptide datasets; (ii) training of the quantitative regression models in place of classification will make it

possible to prediction structure with higher or lower reactivity than any of the experimentally observed

structures. Most importantly, testing of ML models has to extend beyond simple evaluation of accuracy in a

cross-validation. Additional experimental validation will be required to test whether de novo predictions

made by the ML models are corroborated by the experimental measurements.

Contributions

V.T. performed synthesis of chemical and biochemical reagents, modification, selection, and analysis of the

phage display libraries. K.Y performed synthesis, modification of CHO-AA and CHO-Sarcosine-Sarcosine

peptides. S.M. and S.V.K. designed and implement machine learning algorithm to produce the machine

learned model. S.V.K developed the prediction web app. K.Y. and K.A-D., performed the DFT computation.

R.D., V.T., K.Y. and S.M. wrote the manuscript, edited the final manuscript and contributed intellectual and

strategic input. All authors approved the final manuscript.

Conflicts of interest

There are no conflicts to declare.

Acknowledgement

The authors acknowledge funding from NSERC (RGPIN-2016-402511 to R.D.), Amii (Alberta Machine

Intelligence Institute (to R.G. and S.V.K.) and NSERC Accelerator Supplement (to R.D.). Infrastructure

Page 16: Learning Structure Activity Relationship (SAR) of the

15

support was provided by CFI New Leader Opportunity (to R.D). We thank Dr. Ryan T. McKay at the

University of Alberta NMR spectrometry facility, Dr. Randy Whittal and Béla Reiz at the University of

Alberta mass spectrometry facility.

Page 17: Learning Structure Activity Relationship (SAR) of the

16

Fig 1. (a) Wittig reaction on phage libraries. (b) How to read a 20×20 plot.41 (c) 20×20 plots displaying

library-wide DC values from peptide substrates and the Wittig product. (d) After reacting 10 min, the

biotinylated Wittig products were captured by streptavidin beads and ready for sequencing. (e) Pull-down

titering data showing higher capture of phage expressing peptide (blue bars) vs wild type phage (white bars)

and higher capture in 1% biotinylated set (X) than in controls (Y and Z). (f) 20×20 plots for SX4 selections.

Page 18: Learning Structure Activity Relationship (SAR) of the

17

Fig 2. (a) Deep Conversion (DC), rate constants measured by HPLC and conversions at 600 seconds

(Conv600). Sequences marked “N/A” were not detected in the naïve library but were predicted by

bioinformatics analysis of the sequencing data. (b) Plot of DC vs. Conv600. Calculation of the conversion

from experimentally measured rate constants. (c) Reaction mechanism of the Wittig reaction in peptides.

Page 19: Learning Structure Activity Relationship (SAR) of the

18

Fig 3. (a) Geometries of trans Ala-Ala and Sarcosine-Sarcosine rate determined transition states (TS1) in

gas ((B3LYP 6-31G(d)). First and second amides formed intramolecular hydrogen bonds, distances are

labelled in red. Distances in OPA are labelled as black. (b) Energy gap of E Ala-Ala and Sarcosine-

Sarcosine transition states, TS1 are selected as rate-limit step. (c) Model peptides rate constants measured

by HPLC.

Page 20: Learning Structure Activity Relationship (SAR) of the

19

Fig 4. (a) 20×20 Plots showing original observed data with class labels for HIGH (top 5%), MEDIUM

(middle 90%) and LOW (Bottom 5%) as 1,0,-1 respectively. Sequences not observed experimentally are

showed in white. (b) 20×20 plot showing the predicted labels of labels of 99,568 non-observed sequences

with 1, 0, -1 corresponding to HIGH, MEDIUM, and LOW, respectively(Fig. S27). 60,432 sequences from

original observed data were omitted from this plot (showed as white pixels). Red points denote peptides with

high probability of ‘fast’ label, while in blue denotes peptides with high probability of ‘slow’ label. Data for

PXXX sequences but not PPXX were excluded from training and predictions on purpose due to side reaction

in these sequences.

Page 21: Learning Structure Activity Relationship (SAR) of the

20

References

1. M. T. Reetz, Angew. Chem., Int. Ed., 2002, 41, 1335-1338.

2. L. P. Hammett, Chem. Rev., 1935, 17, 125-136.

3. T. Wang, M.-B. Wu, J.-P. Lin and L.-R. Yang, Expert Opin. Drug Discovery, 2015, 10, 1283-1300.

4. C. B. Santiago, J.-Y. Guo and M. S. Sigman, Chem. Sci., 2018, 9, 2398-2412.

5. P. S. Kutchukian, J. F. Dropinski, K. D. Dykstra, B. Li, D. A. DiRocco, E. C. Streckfuss, L.-C. Campeau, T. Cernak, P.

Vachal, I. W. Davies, S. W. Krska and S. D. Dreher, Chem. Sci., 2016, 7, 2604-2613.

6. A. R. Katritzky, V. S. Lobanov and M. Karelson, Chem. Soc. Rev., 1995, 24, 279-287.

7. K. C. Harper, E. N. Bess and M. S. Sigman, Nat. Chem., 2012, 4, 366-374.

8. B. A. Bunin, M. J. Plunkett and J. A. Ellman, Proc. Natl. Acad. Sci. U. S. A., 1994, 91, 4708-4712.

9. K. D. Collins, T. Gensch and F. Glorius, Nat. Chem., 2014, 6, 859-871.

10. J. A. Selekman, J. Qiu, K. Tran, J. Stevens, V. Rosso, E. Simmons, Y. Xiao and J. Janey, Annu. Rev. Chem. Biomol.

Eng., 2017, 8, 525-547.

11. D. D. Devore and R. M. Jenkins, Comments Inorg. Chem., 2014, 34, 17-41.

12. G. Gasparini, M. Dal Molin and L. J. Prins, Eur. J. Org. Chem., 2010, 2010, 2429-2440.

13. M. T. Reetz, Angew. Chem., Int. Ed., 2001, 40, 284-310.

14. M. T. Reetz, Angew. Chem., Int. Ed., 2008, 47, 2556-2588.

15. D. W. Robbins and J. F. Hartwig, Science, 2011, 333, 1423-1427.

16. Y. Maeda, O. V. Makhlynets, H. Matsui and I. V. Korendovych, Annu. Rev. Biomed. Eng., 2016, 18, 311-328.

17. X. Gao and H. B. Kagan, Chirality, 1998, 10, 120-124.

18. C. Gennari, S. Ceccarelli, U. Piarulli, C. A. G. N. Montalbetti and R. F. W. Jackson, J. Org. Chem., 1998, 63, 5312-

5313.

19. H. Kim, G. Gerosa, J. Aronow, P. Kasaplar, J. Ouyang, J. B. Lingnau, P. Guerry, C. Farès and B. List, Nat. Commun.,

2019, 10, 770.

20. M. R. an der Heiden, H. Plenio, S. Immel, E. Burello, G. Rothenberg and H. C. J. Hoefsloot, Chem. - Eur. J., 2008, 14,

2857-2866.

21. T. Satyanarayana, S. Abraham and H. B. Kagan, Angew. Chem., Int. Ed., 2009, 48, 456-494.

22. M. W. Kanan, M. M. Rozenman, K. Sakurai, T. M. Snyder and D. R. Liu, Nature, 2004, 431, 545-549.

23. M. M. Rozenman, M. W. Kanan and D. R. Liu, J. Am. Chem. Soc., 2007, 129, 14933-14938.

24. Y. Chen, A. S. Kamlet, J. B. Steinman and D. R. Liu, Nat. Chem., 2011, 3, 146-153.

25. K. D. Hook, J. T. Chambers and R. Hili, Chem. Sci., 2017, 8, 7072-7076.

26. A. I. Chan, L. M. McGregor and D. R. Liu, Curr. Opin. Chem. Biol., 2015, 26, 55-61.

27. F. Tanaka, R. Fuller, L. Asawapornmongkol, A. Warsinke and S. Gobuty, Bioconjug. Chem., 2007, 18, 1318-1324.

28. G. M. Eldridge and G. A. Weiss, Bioconjug. Chem., 2011, 22, 2143-2153.

29. R. K. V. Lim, N. Li, C. P. Ramil and Q. Lin, ACS Chem. Biol., 2014, 9, 2139-2148.

30. B. N. Tse, T. M. Snyder, Y. Shen and D. R. Liu, J. Am. Chem. Soc., 2008, 130, 15611-15626.

31. V. Triana and R. Derda, Org. Biomol. Chem., 2017, 15, 7869-7877.

32. S. Ng, M. R. Jafari, W. L. Matochko and R. Derda, ACS Chem. Biol., 2012, 7, 1482-1487.

33. S. Ng, K. F. Tjhung, B. M. Paschal, C. J. Noren and R. Derda, in Peptide Libraries: Methods and Protocols, ed. R.

Derda, Springer New York, New York, NY, 2015, pp. 155-172.

34. S. Ng, M. R. Jafari and R. Derda, ACS Chem. Biol., 2012, 7, 123-138.

35. L. Wang, A. Brock, B. Herberich and P. G. Schultz, Science, 2001, 292, 498-500.

36. I. S. Carrico, B. L. Carlson and C. R. Bertozzi, Nat. Chem. Biol., 2007, 3, 321-322.

37. R. Robiette, J. Richardson, V. K. Aggarwal and J. N. Harvey, J. Am. Chem. Soc., 2005, 127, 13468-13469.

38. R. Robiette, J. Richardson, V. K. Aggarwal and J. N. Harvey, J. Am. Chem. Soc., 2006, 128, 2394-2409.

39. P. A. Byrne and D. G. Gilheany, Chem. Soc. Rev., 2013, 42, 6670-6696.

40. C. P. Ramil and Q. Lin, Chem. Commun., 2013, 49, 11007-11022.

41. B. He, K. F. Tjhung, N. J. Bennett, Y. Chou, A. Rau, J. Huang and R. Derda, Sci. Rep., 2018, 8, 1214.

42. K. Rose, J. Chen, M. Dragovic, W. Zeng, D. Jeannerat, P. Kamalaprija and U. Burger, Bioconjug. Chem., 1999, 10,

1038-1043.

43. M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, G. Scalmani, V. Barone, B.

Mennucci, G. A. Petersson, H. Nakatsuji, M. Caricato, X. Li, H. P. Hratchian, A. F. Izmaylov, J. Bloino, G. Zheng, J.

L. Sonnenberg, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao,

H. Nakai, T. Vreven, J. A. M. Jr., J. E. Peralta, F. Ogliaro, M. Bearpark, J. J. Heyd, E. Brothers, K. N. Kudin, V. N.

Page 22: Learning Structure Activity Relationship (SAR) of the

21

Staroverov, R. Kobayashi, J. Normand, K. Raghavachari, A. Rendell, J. C. Burant, S. S. Iyengar, J. Tomasi, M. Cossi,

N. Rega, J. M. Millam, M. Klene, J. E. Knox, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E.

Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, R. L. Martin, K. Morokuma, V. G.

Zakrzewski, G. A. Voth, P. Salvador, J. J. Dannenberg, S. Dapprich, A. D. Daniels, Ö. Farkas, J. B. Foresman, J. V.

Ortiz, J. Cioslowski and D. J. Fox, Gaussian 09 (Revision A.02) Gaussian, Inc., Wallingford CT, 2016.

44. T. Chen and C. Guestrin. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and

data mining, 2016, pp. 785-794.

45. C. Ge, E. Spånning, E. Glaser and Å. Wieslander, Mol. Plant, 2014, 7, 121-136.

46. J. Xie, Z. Xu, S. Zhou, X. Pan, S. Cai, L. Yang and H. Mei, PLOS ONE, 2013, 8, e74506.

47. S. Li and R. W. Roberts, Chem Biol, 2003, 10, 233-239.

48. M. R. Jafari, L. Deng, P. I. Kitov, S. Ng, W. L. Matochko, K. F. Tjhung, A. Zeberoff, A. Elias, J. S. Klassen and R.

Derda, ACS Chem. Biol., 2014, 9, 443-450.

49. S. Chen, J. Morales-Sanfrutos, A. Angelini, B. Cutting and C. Heinis, ChemBioChem, 2012, 13, 1032-1038.

50. S. Ng and R. Derda, Org. Biomol. Chem., 2016, 14, 5539-5545.

51. S. Kalhor-Monfared, M. R. Jafari, J. T. Patterson, P. I. Kitov, J. J. Dwyer, J. M. Nuss and R. Derda, Chem. Sci., 2016,

7, 3785-3790.

52. Z. M. Carrico, M. E. Farkas, Y. Zhou, S. C. Hsiao, J. D. Marks, H. Chokhawala, D. S. Clark and M. B. Francis, ACS

Nano, 2012, 6, 6675-6680.

53. F. Tian, M.-L. Tsao and P. G. Schultz, J. Am. Chem. Soc., 2004, 126, 15962-15963.

54. T. Urquhart, E. Daub and J. F. Honek, Bioconjug. Chem., 2016, 27, 2276-2280.

55. M. C. Pirrung and K. D. Sarma, J. Am. Chem. Soc., 2004, 126, 444-445.

56. R. J. Spears and M. A. Fascione, Org. Biomol. Chem., 2016, 14, 7622-7638.

57. R. A. Goodnow, C. E. Dumelin and A. D. Keefe, Nat. Rev. Drug Discovery, 2017, 16, 131-147.

58. J. R. Frost, J. M. Smith and R. Fasan, Curr. Opin. Struct. Biol., 2013, 23, 571-580.

59. C. Zhang, M. Welborn, T. Zhu, N. J. Yang, M. S. Santos, T. Van Voorhis and B. L. Pentelute, Nat. Chem., 2016, 8,

120-128.

Page 24: Learning Structure Activity Relationship (SAR) of the

1

Supplementary Information for:

Learning Structure Activity Relationships (GE-SAR) of the Wittig Reaction from Genetically-

Encoded substrates

Kejia Yana, Vivian Trianaa, Sunil Vasu Kalmadyb, Kwami Aku-Domingueza, Sharyar Memonc,Alex

Browna, Russ Greinerb,d, Ratmir Derdaa*

a Department of Chemistry, University of Alberta, Edmonton, AB T6G 2G2, Canada.

b Department of Computer Science, University of Alberta, Alberta, AB T6G 2E8, Canada.

c Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, T6G

1H9, Canada.

d Alberta Machine Intelligence institute, Alberta, AB T5J 3B1, Canada.

Corresponding author email: [email protected]

Page 25: Learning Structure Activity Relationship (SAR) of the

2

Table of Contents

Abbreviations ........................................................................................................................6

1. Synthetic methods ..........................................................................................................7

1.1 General chemistry information .....................................................................................7

1.2 Oxidation of N-terminal serine peptides .......................................................................7

1.3 Kinetics of selected peptides..........................................................................................8

1.4 Hydrazone ligation experiments ....................................................................................8

1.5 In situ E/Z selectivity determination ..............................................................................9

2. Biochemistry methods.......................................................................................................9

2.1 General biochemistry information ................................................................................9

2.2 Phage functionalization .................................................................................................9

2.2.1 SXC3C and SX4 1% biotinylation and purification .................................................9

2.2.2 SXC3C and SX4 0.1% biotinylation and purification ............................................10

2.2.3 SXC3C and SX4 oxidations and purification .........................................................10

2.3 Phage pull-down experiments .....................................................................................11

2.3.1 1% and 0.1% SX4 and SXC3C selections .............................................................11

2.4 Direct PCR of phage ...................................................................................................12

2.4.1 Direct PCR of NaOH elutions ..............................................................................12

2.4.2 Direct PCR of inputs .............................................................................................12

2.4.3 Gel electrophoresis characterization of PCR products ........................................12

3. Data analysis ....................................................................................................................13

3.1General data processing methods ................................................................................13

3.2 Processing of illumina data .........................................................................................13

4. DFT computation ............................................................................................................14

5. Machine Learning ...........................................................................................................14

Figure S3. Kinetic traces for 1% SX4 “Deep Conversion” ≥ 3.5. ........................................21

Figure S4. Kinetic traces for 1% SX4 and 0.15 ≤ “Deep Conversion” ≤ 3.5. ......................22

Figure S5. Kinetic traces for 1% SX4 and 0.15 ≥ “Deep Conversion” ................................23

Figure S6. Kinetics for SWWXX and SPXXX sequences with no Illumina counts. ..........24

Figure S7. Proline-Alanine scan ..........................................................................................25

Figure S8. Example of SPXXX LCMS traces for determination of rate. ............................26

Page 26: Learning Structure Activity Relationship (SAR) of the

3

Figure S9. Gibbs free energy barrier of model peptides ......................................................27

Figure S10. Geometries of trans (E) TS1 model peptides. ..................................................28

Figure S11. Geometry of cis (Z) TS1 model peptides .........................................................29

Figure S12. Kinetics of hydrazone ligation for HCO-WWRR at [H2SO4] = 65 mM. .........30

Figure S13. Kinetics of hydrazone ligation for HCO-PPAA at [H2SO4] = 65 mM. ............31

Figure S14. Kinetics of hydrazone ligation for HCO-WWRR at pH 5 ...............................32

Figure S15. Kinetics of hydrazone ligation for HCO-PPAA at pH 5 ..................................33

Figure S16. E/Z selectivity for HCO-WWRR .....................................................................34

Figure S17. E/Z selectivity for HCO-HWFP .......................................................................35

Figure S18. E/Z selectivity for HCO-PPAA ........................................................................36

Figure S21. 20×20 plot of SXCXXXC library ....................................................................39

Table S1. Comparison between DC and rate constants........................................................40

Figure S22. Kinetic traces for sequences of 1% SXC3C selection ......................................41

Figure S23. Distribution of log DC value (DC = “deep conversion”). ................................42

Histogram of the top 5% and bottom 5% of the sequences ..................................................42

Figure S24. A single decision tree for illustrative purposes. ...............................................43

Figure S25. Number of sequence patterns vs threshold number of sequences ....................44

Figure S26. Screenshot of the DC prediction app. ...............................................................45

Table S2. Performance metrics of the two classifiers for a 5-fold cross validation .............47

HPLC purity and LCMS traces of synthesized peptides.................................................48

Figure S28. Summary for SWWRR synthesis .....................................................................48

Figure S29. Summary for CHO-WWRR synthesis..............................................................49

Figure S30. Summary for SWWGP synthesis .....................................................................50

Figure S31. Summary for CHO-WWGP synthesis ..............................................................51

Figure S32. Summary for SQWLH synthesis ......................................................................52

Figure S33. Summary for CHO-QWLH synthesis ..............................................................53

Figure S34. Summary for SWWGL synthesis .....................................................................54

Figure S35. Summary for CHO-WWGL synthesis..............................................................55

Figure S36. Summary for SWWPQ synthesis .....................................................................56

Figure S37. Summary for CHO-WWPQ synthesis ..............................................................57

Page 27: Learning Structure Activity Relationship (SAR) of the

4

Figure S38. Summary for SWLPR synthesis .......................................................................58

Figure S39. Summary for CHO-WLPR synthesis ...............................................................59

Figure S40. Summary for SLWYR synthesis ......................................................................60

Figure S41. Summary for CHO-LWYR synthesis ...............................................................61

Figure S42. Summary for SIWVR synthesis .......................................................................62

Figure S43. Summary for CHO-IWVR synthesis ................................................................63

Figure S44. Summary for SHWFP synthesis .......................................................................64

Figure S45. Summary for CHO-HWFP synthesis ...............................................................65

Figure S46. Summary for SALRV synthesis .......................................................................66

Figure S47. Summary for CHO-ALRV synthesis................................................................67

Figure S48. Summary for SAPAA synthesis .......................................................................68

Figure S49. Summary for CHO-APAA synthesis................................................................69

Figure S50. Summary for SPPAA synthesis ........................................................................70

Figure S51. Summary for CHO-PPAA synthesis ................................................................71

Figure S52. Summary for SPPLA synthesis ........................................................................72

Figure S53. Summary for CHO-PPLA synthesis .................................................................73

Figure S54. Summary for SPPPA synthesis ........................................................................74

Figure S55. Summary for CHO-PPPA synthesis .................................................................75

Figure S56. Summary for SPRLP synthesis ........................................................................76

Figure S57. Summary for CHO-PRLP synthesis .................................................................77

Figure S58. Summary for SPQPL synthesis ........................................................................78

Figure S59. Summary for CHO-PQPL synthesis .................................................................79

Figure S60. Summary for SPPPL synthesis .........................................................................80

Figure S61. Summary for CHO-PPPL synthesis .................................................................81

Figure S62. Summary for SPPPP synthesis .........................................................................82

Figure S63. Summary for CHO-PPPP synthesis..................................................................83

Figure S64. Summary for SPYPA synthesis ........................................................................84

Figure S65. Summary for CHO-PYPA synthesis ................................................................85

Figure S66. Summary for SPAAA synthesis .......................................................................86

Figure S67. Summary for CHO-PAAA synthesis................................................................87

Page 28: Learning Structure Activity Relationship (SAR) of the

5

Supporting information references ...................................................................................88

Page 29: Learning Structure Activity Relationship (SAR) of the

6

Abbreviations

dNTP Doxyucleoside Triphosphate

DNA Deoxyribonucleic acid

EDTA Ethylenediaminetetraacetic acid

EtOH Ethanol

ESI Electrospray Ionization

HILIC Hydrophilic Interaction Chromatography

HRMS High-resolution Mass Spectrometry

MOPS 3-(N-morpholino)propanesulfonic acid

NMR Nuclear Magnetic Resonance

PBS Phosphate buffered saline

PCR Polymerase chain reaction

PEG Polyethylene glycol

RNAse Ribonuclease

UPLC-MS Ultra Performance Liquid Chromatography- Mass Spectrometry

Page 30: Learning Structure Activity Relationship (SAR) of the

7

1. Synthetic methods

1.1 General chemistry information

Acetate buffer 200 mM was prepared by mixing sodium acetate (16.4 g), NaCl (80 g) and adding 800

mL of milliQ water. The pH was adjusted to 5.0 with glacial acetic acid and the volume was completed

to 1L with milliQ water. MOPS buffer for kinetic measurements was prepared as a solution of 200 mM

MOPS and adjusted to pH 6.5 with NaOH (1M). All solutions used for phage work were prepared with

milliQ water. Peptides were synthesized using the Prelude X peptide synthesizer (Gyros Protein

Technologies) using 200 mg Rink Amide AM resin. Ylide Ester Biotin (YEB) was synthesized as already

reported.1 HRMS (ESI) spectra were recorded on Agilent 6220 oaTOF mass spectrometer using either

positive or negative ionization mode. Characterization of reaction crude was performed with UPLC-MS

using a C18 column (Phenomenex Kinetex 1.7 μm EVO C18, 2.1×50 mm) running with a gradient of

water/acetonitrile with 0.1% formic acid from 98/2 at 0 min to 40/60 at 5 min under a flow rate of 0.5

mL/min. HPLC kinetics were performed in an Agilent 1100 Series using a Thermo Scientific C18 column

(Hypersil GOLD, 3 μm, 50×2.1mm). 1H and 13C NMR spectra were acquired on a 600 MHz four channel

Agilent VNMRS spectrometer, equipped with a z gradient HCN probe and using VNMRJ 4.2A as the

acquisition software. Suppression of the H2O signal was performed using either presaturation or

excitation sculpting.2

1.2 Oxidation of N-terminal serine peptides

Unless specified differently, ~20 mg of peptide is diluted into 1 mL of 1× PBS and required amount of

acetonitrile (when needed). The peptide is incubated in ice for 5 minutes and solubility is checked, in

case of precipitation, more acetonitrile is added. 2.5 equivalents of NaIO4 are added as a saturated

solution in water. The mixture is incubated in ice for 20 minutes and injected into C18 or HILIC column

for quenching and purification. The acetonitrile is evaporated via speedvac and the oxidized peptide is

Page 31: Learning Structure Activity Relationship (SAR) of the

8

dried using lyophilization. When the peptide contains a methionine, a 2 mM solution of peptide (in 1×

PBS) is kept in ice while 1.5 equivalents of NaOH are dissolved in water. The NaOH is added and after

10 seconds, the reaction is quenched with 12 equivalents of glutathione. The mixture is incubated in ice

for ten minutes and then injected on C18 or HILIC column for purification. Acetonitrile is removed with

speedvac and the oxidized peptides are lyophilized.

1.3 Kinetics of selected peptides

Solutions of oxidized peptides (7.7 mM) were prepared in water. A stock solution of YEB (38.7 mM)

was prepared in water. An aliquot of peptide solution (10.4 μL, final concentration 0.8 mM) is dissolved

in 79.3 μL of MOPS buffer (200 mM, pH 6.5). YEB solution (10.3 μL, final concentration 4 mM) is

added. An aliquot of the reaction (10 μL) is taken out and quenched with HCl (1μL, 1 M). Quenching is

performed after 3 min, 10 min, 30 min, 60 min, 120 min and 180 min reaction time. An aliquot (3 μL) of

each quenched solution is injected into UHPLC-MS to obtain traces. Area percentages are extracted and

the pseudo first order rate constant and kinetic traces are obtained by using the MATLAB script attached

below. It is important to mention that several oxidized peptides required certain amount of co-solvent

(acetonitrile) to improve solubility. At all measurements, the final percentage of acetonitrile was ≤ 2%,

since rate and stereoselectivity of the Wittig reaction is highly dependent on the type of solvent.3 Also,

the volumes of aliquots were kept the same at all times to ensure all peptides were measured at the same

pH. Self-catalysis and cross-catalysis tests were performed by same protocol.

1.4 Hydrazone ligation experiments

0.1 mg of 2,4-dinitrophenylhydrazone were dissolved in 1.1mL of concentrated H2SO4. This solution

was slowly added to a mixture of 5.7 mL EtOH (95%)/1.6 mL H2O to generate a 61 mM solution of

hydrazine. 10.4 μL of HCO-peptide solution (7.7 mM) were mixed with 73.5 μL of H2O, 10 μL of EtOH

Page 32: Learning Structure Activity Relationship (SAR) of the

9

were added and finally, 6.5 μL of hydrazine solution were added. An aliquot of the reaction (2 μL) was

taken out and quenched by dilution with 198 μL of water. Quenching was performed at the times

specified in Fig S16 and Fig S17. An aliquot (25 μL) of each quenched solution was injected into

UHPLC-MS to obtain traces. Area percentages were extracted and the pseudo first order rate constant

and kinetic traces were obtained by using the MATLAB script.

1.5 In situ E/Z selectivity determination

Aliquots (usually 20-50 μL, 1.00 eq, final concentration 0.8 mM) of stock solutions of HCO-peptides

(usually 7.7 mM) were mixed with PB 200 mM at pH 6.5 in an NMR tube. 55 μL of D2O were added. A

solution of YEB (51.7 µL, 38.7 mM, 5.00 eq, final concentration 4.00 mM) was added. The final volume

was 550 μL (D2O 10%). Solvent suppression was performed and 1H NMR spectra were collected at the

specified time intervals.

2. Biochemistry methods

2.1 General biochemistry information

The SX4 library was bulk-amplified from a stock library synthesized by Katrina Tjhung4. The SXC3C

library was synthesized by Nicholas Bennet5 as well as the silently encoded SWYD neon green clones

for internal control.

2.2 Phage functionalization

2.2.1 SXC3C and SX4 1% biotinylation and purification

1012 pfu of library were taken to a final volume of 100 μL with MOPS 200 mM. To the solution, 1 μL of

freshly prepared NaIO4 (6 mM in water) was added and the mixture was incubated in ice for 8 minutes.

Then, 1 μL of methionine (50 mM in water) was added and the mixture was incubated for 20 minutes at

Page 33: Learning Structure Activity Relationship (SAR) of the

10

room temperature. 100 μL of YEB (0.8 mM in MOPS 200 mM) were added and the reaction was

incubated for 10 minutes at room temperature. After this, 800 μL of acetate buffer (200 mM) and 200 μL

of PEG NaCl were added and the mixture was kept at 4 °C for at least 12 hours. After this, the sample

was centrifuged at 21100 xg for 30 minutes and the supernatant was discarded. The sample was further

centrifuged for 5 minutes and remaining supernatant was removed. The phage pellet was re-suspended

in 200 μL of acetate buffer (10 mM) and incubated at room temperature for 15 minutes to allow phage

to completely dissolve. After this, remaining solid debris was removed by centrifugation at 21100 xg for

5 minutes. The supernatant was transferred to a clean epi-tube and kept at 4 °C for use in pull-down

screenings.

2.2.2 SXC3C and SX4 0.1% biotinylation and purification

1012 pfu of library were taken to a final volume of 100 μL with MOPS 200 mM. To the solution, 1 μL of

freshly prepared NaIO4 (6 mM in water) was added and the mixture was incubated in ice for 8 minutes.

Then, 1 μL of methionine (50 mM in water) was added and the mixture was incubated for 20 minutes.

100 μL of YEB (0.8 mM in MOPS 200 mM) were added and the reaction was incubated for 10 minutes

at room temperature. After this, PEG purification was performed as indicated in section 2.2.1 and the

phage was stored at 4 °C, ready to be used in screenings.

2.2.3 SXC3C and SX4 oxidations and purification

1012 pfu of library were taken to a final volume of 100 μL with MOPS 200 mM. To the solution, 1 μL of

freshly prepared NaIO4 (6 mM in water) was added and the mixture was incubated in ice for 8 minutes.

Then, 1 μL of methionine (50 mM in water) was added and the mixture was incubated for 20 minutes.

After this, PEG purification was performed as indicated in section 2.2.1 and the phage was stored at 4

°C, ready to be used in screenings.

Page 34: Learning Structure Activity Relationship (SAR) of the

11

2.3 Phage pull-down experiments

2.3.1 1% and 0.1% SX4 and SXC3C selections

Set X: ~109 pfu of biotinylated, purified library were mixed with ~105 pfu of internal control mixture

neon green phage (in the case of 1% selections) and the volume was taken to 1 mL with BSA 2% in

acetate buffer in a 1.5 mL microcentrifuge tube.

Set Y: ~109 pfu of biotinylated, purified library were mixed with ~105 pfu of internal control mixture (in

the case of 1% selections) and the volume was taken to 1 mL with BSA 2% + 0.9 mM biotin in acetate

buffer in a 1.5 mL microcentrifuge tube.

Set Z: ~109 pfu of oxidized, purified library were mixed with ~105 pfu of internal control mixture neon

green phage (in the case of 1% selections) and the volume was taken to 1 mL with BSA 2% in acetate

buffer in a 1.5 mL microcentrifuge tube.

All mixtures were prepared in triplicates. 100 μL of streptavidin magnetic beads per each sample (9 in

total) were washed three times with 1 mL of acetate buffer (10 mM) and suspended in 1 mL of BSA 2%

in acetate buffer.

All mixtures and bead suspensions were incubated for 30 minutes at room temperature in a Labquake

Tube Rotator. After this, 10 μL aliquots of each phage mixture (9 in total) were taken for tittering and

PCR of inputs. The bead suspensions (9 in total) were placed in a magnet and the BSA supernatants were

discarded. Each phage mixture (990 μL each one) was added to one corresponding tube of beads and

incubated at room temperature in a Labquake Tube Rotator for 1 hour. The unbound phage was washed

from the beads with Tween 0.1% in acetate buffer (10 washes of 1 mL each one) by using a Kingfisher

magnetic bead washer (ThermoFisher Scientific).

Page 35: Learning Structure Activity Relationship (SAR) of the

12

The washed beads were suspended in acetate buffer (10 mM), placed in a magnet and the supernatant

was discarded (this step was necessary to remove any remaining Tween that might interfere with PCR).

The rinsed beads were suspended in 20 μL of NaOH solution (10 mM) and incubated at room temperature

for one hour in a Labquake Tube Rotator. After that, suspensions were placed in a magnet and the

supernatant was transferred to a 0.6 mL microcentrifuge tube containing 20 μL of 1X Phusion® high

fidelity (HF) buffer to give a final volume of 40 μL of phage elution. 2 μL are taken for tittering of this

elution and the rest is stored 4 °C for use in direct PCR.

2.4 Direct PCR of phage

2.4.1 Direct PCR of NaOH elutions

Each elution consists of 40 μL of phage as specified in section 2.3.1. This elution was split in two aliquots

of 20 μL and each one was mixed with 1 μL of dNTPs (10 mM each), 1 μL of Phusion® High Fidelity

Polymerase, 2.5 μL of reverse primer (10 mM), 2.5 μL of forward primer (10 mM) and 23 μL of RNAase

free water. Samples were PCRed like this: first, the sample was held at 95 °C for 30 seconds and then 25

cycles of 95 °C (10 seconds), 60.5 °C (15 seconds) and 72 °C (30 seconds). Finally, the sample was

maintained at 72 °C for 5 minutes and held at 4 °C for storage. The two PCR products for each elution

were mixed before gel electrophoresis characterization.

2.4.2 Direct PCR of inputs

5 μL of the input aliquots (9 in total, from section 2.3.1) were mixed with 10 μL of Phusion® High

Fidelity buffer, 1 μL of dNTPs (10 mM each), 1 μL of Phusion® High Fidelity Polymerase, 2.5 μL of

reverse primer (10 mM), 2.5 μL of forward primer (10 mM) and 28 μL of RNAase free water. The

temperature cycles used to PCR these samples were the same as the ones described in section 2.4.1.

2.4.3 Gel electrophoresis characterization of PCR products

Page 36: Learning Structure Activity Relationship (SAR) of the

13

10 μL of the PCR products obtained in sections 2.4.1 and 2.4.2 were mixed with 2 μL of gel dye and

loaded into a 2% agarose gel in Tris Acetate EDTA buffer (TAE) containing SYBR-safe dye. 44 ng of

DNA low molecular weight ladder (New England Biolabs) were also loaded in one line for quantification

purposes. The gel was developed in TAE at 500 mV and 100 A for 40 minutes. Pictures were taken with

an ImageQuant RT ECL machine using 520 nm wavelength filter. After quantification, all DNA samples

(18 in total for each screening) were mixed together by adding 20 ng of each one. Purification with E-

Gel® SizeSelectTM 2% agarose gel (Invitrogen, #G6610-02) provided the sample that was sequenced

using the Illumina NextSeq platform.

3. Data analysis

3.1 General data processing methods

Data analysis was performed in MATLAB. Core scripts are available as part of the Supplementary

Data.rar.

3.2 Processing of illumina data

Illumina data was processed following already reported strategies.4, 6, 7 Briefly, raw FASTQ data was

processed using MATLAB Scripts developed by this lab. These scripts identified the barcodes and

constant flanking regions and extracted the reads of correct length for each type of library. Each sequence

and copy numbers at which they appear are arranged in a N×M matrix, where N is the total number of

unique sequences and M is the number of experiments and replicates. After this, the data arranged in the

matrix is used to find sequences significantly enriched in the positive sets. This is done with the

Differential Enrichment MATLAB script already reported by this lab7 and attached in Supplementary

Data.rar as DE_analysis.m. Basically, average normalized frequency of appearance was calculated for

each sequence in each set (positive, controls and inputs). Then, the average frequency in the positive set

Page 37: Learning Structure Activity Relationship (SAR) of the

14

was divided by the average frequency in control sets to obtain a final ratio. A p-value between replicates

of positive and control sets was calculated using a two-sided unequal variance t-test. Sequences that are

considered “hits” must have p-value < 0.05 and R ≥ 3. Further information on all the math is in previous

report.4 The 20×20 plots were obtained by using techniques already reported by this lab4, 5 and the script

(plot20x20_generation.m) is attached in Supplementary Data.rar. Logo plots were obtained by

transforming the txt data table after Differential Enrichment analysis back to FASTQ and uploading it

into the https://weblogo.berkeley.edu/logo.cgi website.

4. DFT computation

All the optimized structures of reactants, products and transition state were obtained using density

functional theory (DFT). The geometries were optimized in the gas phase using the B3LYP8-11 function

with 6-31G(d) basis set.12, 13 All the optimizations were performed using default convergence criteria.

The initial structure of TS1 and TS2 was built based on the structure from Robiette et al.3 To confirm the

TS conformation minima or first-order transition states, to determine Gibbs free energies (at 298 K),

vibrational frequencies were computed for all the optimized structures. The Gaussian 09 suite of

software14 was used to perform all the electronic structure computations. All the output files are attached

in Supplementary Data.rar.

5. Machine Learning

The DC values calculated in the previous section for each tetramer amino acid sequence were used as

input data and the MATLAB script being used to generate DC values is attached in Supplementary

Data.rar as MakeMLinput.m. Algorithm info is available at https://github.com/derdalab/GESAR. To

prepare the data for machine learning, all the amino acid sequences and their corresponding DC values

were loaded as a data frame by using Pandas library and python. A log2 of the DC values was added to

Page 38: Learning Structure Activity Relationship (SAR) of the

15

the data frame as a new column to be used to create the class labels for machine learning. Quantitative

chemical properties of each of the amino acids in the tetramer sequences were added as new columns,

specifically z-scale descriptors (3×4 AA position = 12 features)15 and VHSE (Vectors of Hydrophobic,

Steric, and Electronic properties) descriptors (8×4 AA position = 32 features)16 to the table.

* A data frame is a 2-dimensional data structure. In our case it has n rows and m columns where n is the

number of amino acid sequences observed and m are the corresponding columns of features and labels.

* Label refers to the target value or class we want to predict. Features refer to measurable properties

about our process of interest that are used as inputs to predict our label.

Apart from the chemical properties, we added sequence patterns based on permutations of 20 amino acids

among 4 positions within the tetramer sequences (2,396 features) to yield total of 2,440 features. We

used the following strategy to generate features based on sequence patterns. First, for the list of Amino-

Acids: [A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V], we compute the cartesian products with 1, 2 or 3

elements with repetitions as illustrated below:

○ 3 elements: AAA, AAR, ARN.. [8000 of these]

○ 2 elements: AA, AR, RN.. [400 of these]

○ 1 element: A,R.. [20 of these]

Then, we create sequence patterns as following illustration, where X is referred to wildcard amino-acids

○ 3 element: AAAX , AAXA, AXAA, XAAA [4 × 8000]

○ 2 element: AAXX, AXXA, XXAA, AXAX, AXXA, XAXA [6 × 400]

○ 1 element: XXXA, XXAX, XAXX, AXXX [4 × 20]

Then, we create features based on the lookup of these patterns. If a pattern is present in a tetramer

Page 39: Learning Structure Activity Relationship (SAR) of the

16

instance, we give a value of 1 and 0 otherwise. This procedure yielded 34,480 patterns, of which 2,396

were finally retained by setting a threshold of at least 20 tetramer instances that match a particular pattern.

To use classifiers, we split the data in to HIGH or LOW deep conversion subgroups where HIGH and

LOW are defined by the highest and lowest 5% of the Log DC values obtained by experimental

observation (Figure S5). This means that there were two independent splits of the data where the first

split was made up of HIGH (top 5%) and NOT-HIGH (95%), and the second was made up LOW (Bottom

5%) and NOT-LOW (95%). We used these class labels as our target labels for training our models. The

scripts used to create these labels and pre-process the data can be found in the github repo under

Final_evaluation_CLF_Models.ipynb file.

Training: Using a gradient boosted ensemble of decision trees (XGBoost17), we trained two binary

classifiers: one to identify sequences belonging to the HIGH subgroup and the other to identify sequences

belonging to the LOW subgroup . The two independent splits of the data we created earlier were used to

train these two classifiers, respectively. We then used 5-fold stratified cross-validation to evaluate the

performance of our models. Cross-validation refers to a resampling technique where the data set is split

into k groups (5 groups in our case). Where one of the five groups is used to evaluate the model while

the other four groups are used to train the model. This is done k times (5 times in our case) with a reshuffle

of the data each time and a new model trained every time. The average evaluation scores collected from

each reshuffle of the data are used to evaluate the performance of the model. Stratified k-fold includes

splitting the data so that class distribution is preserved across folds. We evaluated our models on several

metrics including ROC AUC Scores, Accuracy, F1-Scores, Precision and Recall.

Evaluation: The respective classifiers showed area under the receiver operating characteristic curve

Page 40: Learning Structure Activity Relationship (SAR) of the

17

(ROC AUC) scores of 81.2 ± 0.4 and 73.7 ± 0.8 for HIGH and LOW, as evaluated by 5-fold stratified

cross-validation on the available data. Further details about the performance of each model with

additional metrics are in Table S1. After evaluating our models, we discarded the old models and

retrained both models on the entire data set collected from experiments. These models were then saved

as .pickle files which can then be reloaded from disk when needed.

Prediction: To predict the classes of the sequences not observed in our experimental data set, we created

a new dataset that consisted of about 100,000 sequences and their corresponding features. We computed

these features by using the process specified above for feature engineering. Using our two classifier

models, we predicted two probabilities for each of these sequences: probability of sequence belonging to

LOW class, probability of sequencing belonging to HIGH class. Since these probabilities were generated

by two independent classifier models, these two probabilities are also independent and as such do not

add up to 1 for each sequence. We then set a threshold of 0.5 on the probabilities to decide if a sequence

belonged to the class or not. Each of our classifiers can generate a probability of a sample belonging to

the two binary classes each classifier is trained for i.e. the HIGH classifier model, it outputs a probability

value between 0 and 1 where higher values indicate that sequences is predicted to belong to the top 5%.

We can set the threshold at which a sequence is labelled as HIGH or NOT-HIGH (LOW or NOT-LOW

for the other classifier). Setting different threshold values can change the F1-score for each of the

classifiers as seen Figure S13 in Supplementary Information. We set the threshold value at 0.5 as it allows

us to have a high confidence in our predictions and maximizes the F1-score for our classifiers. The

threshold is a trade off between precision and recall. A higher threshold allows us to minimize false

positives while a lower threshold minimizes false negatives. We believe that 0.5 represents a balanced

approach between the two for our data set. Setting a lower threshold for each of the classifiers while

Page 41: Learning Structure Activity Relationship (SAR) of the

18

possible may also cause overlap in sequences that are labelled as positive and negative since each of

those predicted probabilities is independent. We then used the threshold to assign class labels to the

sequences. I.e if a sequence had a probability of greater than or equal 0.5 for HIGH and less than 0.5 for

low, we labelled it as HIGH. If it had a probability of greater than or equal to 0.5 and less than 0.5 for

HIGH, we labelled it as low. If a sequence had any other combination of probabilities, we labelled it as

the medium (middle 90%).

We used these labels to create a 20×20 plot that allows us to visualize the predictions of about 100,000

amino acid sequences.

The 20×20 plots were created by using scripts based on previous scripts reported by this lab.

(citation)These scripts were converted and modified for python implementation with class labels. The

modified script can be found under helper_functions.py in our Github repo.

Deployment: We deployed the saved models into a publicly available web app http://44.226.164.95/

which allows users to get the probability of sequences belonging to both the HIGH and LOW class as

well as their position within the 20×20 plot of the observed data.

Page 42: Learning Structure Activity Relationship (SAR) of the

19

Figure S1. (a) Scheme of Wittig reaction in a population of peptide aldehydes. (b) Population reaction

rate measured by biotin capture. (c) The experimentally measured population rate constant can be used

to extrapolate the conversion of the population at 10 minutes in the presence of different concentrations

of YEB. (d) Individual members of the library react at rates that are faster or slower than the average

population rate. We modeled both disappearance of the starting material and appearance of the

products in such kinetically heterogeneous population. Specifically, we used a realistic input: a 1010

total number of the phage in the library. To simplify the visualization, we assumed that every member

of the library is present in the same concentration. In such input, each phage clone is present at a copy

number 105 and conversion of all clones to the product can be easily followed. After 10 minutes of

reaction, it is possible to use the relative abundance of the products to distinguish the substrates with

rate constants ranging from 0.01x to 100x the average population rate of 0.2 M-1s-1. Detection of

substrates slower than 0.002 M-1s-1 is difficult because the copy numbers of the products approach

single digits and their relative abundance is <0.1 parts per million. Detection of such product by deep

sequencing requires significant depth (tens of millions of reads) and it would be hampered by

sequencing noise. Differentiation of substrates faster than 20 M-1s-1 is not possible because they all

reach complete conversion under these conditions. Fast reactivities can be differentiated by shifting the

detection time to the earlier time points; however, early detection would make it impossible to measure

the slow reactivity. It is simple to show that the dynamic range of the detection—0.01x to 100x the

average population rate—represents the upper theoretical limit at the 1010 pfu input and 106-107 read

depth. To improve it, one must increase the number of phage clones in the input beyond 1010 pfu and

increase the depth of sequencing beyond several million sequencing reads.

Page 43: Learning Structure Activity Relationship (SAR) of the

20

Figure S2. Deep sequencing analysis for 1% SX4 selection control.

A) Tittering. B) Illumina reads. C) Illumina counts for control mixture and Ci using equations 1-3. D)

20:20 plots for SX4 1% selection.

Page 44: Learning Structure Activity Relationship (SAR) of the

21

Figure S3. Kinetic traces for 1% SX4 “Deep Conversion” ≥ 3.5.

Page 45: Learning Structure Activity Relationship (SAR) of the

22

Figure S4. Kinetic traces for 1% SX4 and 0.15 ≤ “Deep Conversion” ≤ 3.5.

Page 46: Learning Structure Activity Relationship (SAR) of the

23

Figure S5. Kinetic traces for 1% SX4 and 0.15 ≥ “Deep Conversion” (not including SPXXX

sequences).

Page 47: Learning Structure Activity Relationship (SAR) of the

24

Figure S6. Kinetics for SWWXX and SPXXX sequences with no Illumina counts.

Page 48: Learning Structure Activity Relationship (SAR) of the

25

Figure S7. Proline-Alanine scan to determine position and quantity of hydrogen-bond donors for

stabilization of OPA transition state.

Page 49: Learning Structure Activity Relationship (SAR) of the

26

Figure S8. Example of SPXXX LCMS traces for determination of rate.

Page 50: Learning Structure Activity Relationship (SAR) of the

27

Figure S9. Gibbs free energy barrier of model peptides CHO-AlaAla, CHO-AlaSar, CHO-SarAla, CHO-

SarSar (Sar = Sarcosine) as determined using B3LYP/6-31(g) in gas phase. E configurations are shown

in red while Z configurations are shown in blue. Structures of TS1 are provided in Fig S14 and S15.

Cartesian coordinates are provided in output files as well.

Page 51: Learning Structure Activity Relationship (SAR) of the

28

Figure S10. Geometries of trans (E) TS1 model peptides as determined using B3LYP/6-31(g) in gas

phase. Important bond and distances are indicated, and Hydrogen bonds are labelled in red.

Page 52: Learning Structure Activity Relationship (SAR) of the

29

Figure S11. Geometry of cis (Z) TS1 model peptides as determined using B3LYP/6-31(g) in gas phase.

Important bond and distances are indicated, and Hydrogen bonds are labelled in red.

Page 53: Learning Structure Activity Relationship (SAR) of the

30

Figure S12. Kinetics of hydrazone ligation for HCO-WWRR at [H2SO4] = 65 mM.

Page 54: Learning Structure Activity Relationship (SAR) of the

31

Figure S13. Kinetics of hydrazone ligation for HCO-PPAA at [H2SO4] = 65 mM.

Page 55: Learning Structure Activity Relationship (SAR) of the

32

Figure S14. Kinetics of hydrazone ligation for HCO-WWRR at pH 5

Page 56: Learning Structure Activity Relationship (SAR) of the

33

Figure S15. Kinetics of hydrazone ligation for HCO-PPAA at pH 5

Page 57: Learning Structure Activity Relationship (SAR) of the

34

Figure S16. E/Z selectivity for HCO-WWRR

Page 58: Learning Structure Activity Relationship (SAR) of the

35

Figure S17. E/Z selectivity for HCO-HWFP

Page 59: Learning Structure Activity Relationship (SAR) of the

36

Figure S18. E/Z selectivity for HCO-PPAA

Page 60: Learning Structure Activity Relationship (SAR) of the

37

Figure S19. E/Z selectivity for HCO-AA

Page 61: Learning Structure Activity Relationship (SAR) of the

38

Figure S20. E/Z selectivity for HCO-SarSar

Page 62: Learning Structure Activity Relationship (SAR) of the

39

Figure S21. 20×20 plot of SXCXXXC library

Page 63: Learning Structure Activity Relationship (SAR) of the

40

Entry Sequence DC Rate

1 SYCKADC 0.28 0.44

2 SKCETFC 0.65 0.23

3 SQCYWRC 0.45 0.18

4 SQCYESC - 0.14

5 SFCQGKC 0.73 0.15

Table S1. Comparison between DC and rate constants in isolated peptides for SXC3C 1% selection

Page 64: Learning Structure Activity Relationship (SAR) of the

41

Figure S22. Kinetic traces for sequences of 1% SXC3C selection

Page 65: Learning Structure Activity Relationship (SAR) of the

42

Figure S23. Distribution of log DC value (DC = “deep conversion”).

Histogram of the top 5% and bottom 5% of the sequences based on the DC values. This threshold was

used to define labels for peptides for the classification task.

Page 66: Learning Structure Activity Relationship (SAR) of the

43

Figure S24. A single decision tree for illustrative purposes.

The XGBoost algorithm uses an ensemble of trees to generate a more accurate output for the

classification task. Decision trees can be used to deduce a simple rule based model that navigates

through a sequence pattern. Impurity at each node is shown below using the gini index. Wildcard base

is shown as ‘.’ e.g.: (W…) is WXXX where X can be any of the 20 amino acid bases. Value of 0 and 1

is assigned to the absence and presence of pattern respectively (Therefore ≤ 0.5 means absence of a

particular pattern).

Page 67: Learning Structure Activity Relationship (SAR) of the

44

Figure S25. Number of sequence patterns vs threshold number of sequences used as input features for

the models.

This threshold was based on the number of instances with each sequence pattern appearing among the

observed sequences. When the threshold for the minimum number of instances was set to 20, it reduced

the number of sequence patterns used as input features from 34,480 to 2,396.

Page 68: Learning Structure Activity Relationship (SAR) of the

45

Figure S26. Screenshot of the DC prediction app.

The app allows users to provide sequences and predicts the probability of a sequence belonging to the

HIGH and LOW DC groups. HIGH and LOW are learnt using two separate models. Hence, the scores

do not necessarily add up to 1. Users can use their judgement to compare the scores and decide which of

the two is more likely. However, these scores are not calibrated probabilities.

Page 69: Learning Structure Activity Relationship (SAR) of the

46

Figure S27. Probability distributions for predictions and corresponding F1-scores for different

probability thresholds. a) Probability distribution for predicted probabilities from the HIGH classifier

model for predicted (unobserved) data. b) Probability distribution for predicted probabilities from the

LOW classifier model for predicted (unobserved) data. c) Probability threshold vs F1-scores for HIGH

classifier. d) Probability threshold vs F1-scores for LOW classifier.

Page 70: Learning Structure Activity Relationship (SAR) of the

47

Metric High DC

Mean ± SD

Low DC

Mean ± SD

Accuracy 92.5 ± 0.2 90.1 ± 0.2

AUC-ROC 81.2 ± 0.4 73.7 ± 0.8

F1-Score* 33.7 ± 0.9 19.0 ± 0.9

Precision 30.4 ± 1.2 16.1 ± 0.9

Recall 37.9 ± 1.1 23.2 ± 0.9

Table S2. Performance metrics of the two classifiers for a 5-fold cross validation.

* F1-Score is the harmonic mean of precision and recall. The highest value for F1 is 1.0 which indicates

perfect precision and recall and the lowest is 0 which happens if either precision or recall is 0.

Page 71: Learning Structure Activity Relationship (SAR) of the

48

HPLC purity and LCMS traces of synthesized peptides

Figure S28. Summary for SWWRR synthesis

Page 72: Learning Structure Activity Relationship (SAR) of the

49

Figure S29. Summary for CHO-WWRR synthesis

Page 73: Learning Structure Activity Relationship (SAR) of the

50

Figure S30. Summary for SWWGP synthesis

Page 74: Learning Structure Activity Relationship (SAR) of the

51

Figure S31. Summary for CHO-WWGP synthesis

Page 75: Learning Structure Activity Relationship (SAR) of the

52

Figure S32. Summary for SQWLH synthesis

Page 76: Learning Structure Activity Relationship (SAR) of the

53

Figure S33. Summary for CHO-QWLH synthesis

Page 77: Learning Structure Activity Relationship (SAR) of the

54

Figure S34. Summary for SWWGL synthesis

Page 78: Learning Structure Activity Relationship (SAR) of the

55

Figure S35. Summary for CHO-WWGL synthesis

Page 79: Learning Structure Activity Relationship (SAR) of the

56

Figure S36. Summary for SWWPQ synthesis

Page 80: Learning Structure Activity Relationship (SAR) of the

57

Figure S37. Summary for CHO-WWPQ synthesis

Page 81: Learning Structure Activity Relationship (SAR) of the

58

Figure S38. Summary for SWLPR synthesis

Page 82: Learning Structure Activity Relationship (SAR) of the

59

Figure S39. Summary for CHO-WLPR synthesis

Page 83: Learning Structure Activity Relationship (SAR) of the

60

Figure S40. Summary for SLWYR synthesis

Page 84: Learning Structure Activity Relationship (SAR) of the

61

Figure S41. Summary for CHO-LWYR synthesis

Page 85: Learning Structure Activity Relationship (SAR) of the

62

Figure S42. Summary for SIWVR synthesis

Page 86: Learning Structure Activity Relationship (SAR) of the

63

Figure S43. Summary for CHO-IWVR synthesis

Page 87: Learning Structure Activity Relationship (SAR) of the

64

Figure S44. Summary for SHWFP synthesis

Page 88: Learning Structure Activity Relationship (SAR) of the

65

Figure S45. Summary for CHO-HWFP synthesis

Page 89: Learning Structure Activity Relationship (SAR) of the

66

Figure S46. Summary for SALRV synthesis

Page 90: Learning Structure Activity Relationship (SAR) of the

67

Figure S47. Summary for CHO-ALRV synthesis

Page 91: Learning Structure Activity Relationship (SAR) of the

68

Figure S48. Summary for SAPAA synthesis

Page 92: Learning Structure Activity Relationship (SAR) of the

69

Figure S49. Summary for CHO-APAA synthesis

Page 93: Learning Structure Activity Relationship (SAR) of the

70

Figure S50. Summary for SPPAA synthesis

Page 94: Learning Structure Activity Relationship (SAR) of the

71

Figure S51. Summary for CHO-PPAA synthesis

Page 95: Learning Structure Activity Relationship (SAR) of the

72

Figure S52. Summary for SPPLA synthesis

Page 96: Learning Structure Activity Relationship (SAR) of the

73

Figure S53. Summary for CHO-PPLA synthesis

Page 97: Learning Structure Activity Relationship (SAR) of the

74

Figure S54. Summary for SPPPA synthesis

Page 98: Learning Structure Activity Relationship (SAR) of the

75

Figure S55. Summary for CHO-PPPA synthesis

Page 99: Learning Structure Activity Relationship (SAR) of the

76

Figure S56. Summary for SPRLP synthesis

Page 100: Learning Structure Activity Relationship (SAR) of the

77

Figure S57. Summary for CHO-PRLP synthesis

Page 101: Learning Structure Activity Relationship (SAR) of the

78

Figure S58. Summary for SPQPL synthesis

Page 102: Learning Structure Activity Relationship (SAR) of the

79

Figure S59. Summary for CHO-PQPL synthesis

Page 103: Learning Structure Activity Relationship (SAR) of the

80

Figure S60. Summary for SPPPL synthesis

Page 104: Learning Structure Activity Relationship (SAR) of the

81

Figure S61. Summary for CHO-PPPL synthesis

Page 105: Learning Structure Activity Relationship (SAR) of the

82

Figure S62. Summary for SPPPP synthesis

Page 106: Learning Structure Activity Relationship (SAR) of the

83

Figure S63. Summary for CHO-PPPP synthesis

Page 107: Learning Structure Activity Relationship (SAR) of the

84

Figure S64. Summary for SPYPA synthesis

Page 108: Learning Structure Activity Relationship (SAR) of the

85

Figure S65. Summary for CHO-PYPA synthesis

Page 109: Learning Structure Activity Relationship (SAR) of the

86

Figure S66. Summary for SPAAA synthesis

Page 110: Learning Structure Activity Relationship (SAR) of the

87

Figure S67. Summary for CHO-PAAA synthesis

Page 111: Learning Structure Activity Relationship (SAR) of the

88

Supporting information references

References

1. V. Triana and R. Derda, Org. Biomol. Chem., 2017, 15, 7869-7877.

2. T. L. Hwang and A. J. Shaka, J. Magn. Reson., Series A, 1995, 112, 275-279.

3. R. Robiette, J. Richardson, V. K. Aggarwal and J. N. Harvey, J. Am. Chem. Soc., 2006, 128,

2394-2409.

4. K. F. Tjhung, P. I. Kitov, S. Ng, E. N. Kitova, L. Deng, J. S. Klassen and R. Derda, J. Am.

Chem. Soc., 2016, 138, 32-35.

5. B. He, K. F. Tjhung, N. J. Bennett, Y. Chou, A. Rau, J. Huang and R. Derda, Sci. Rep., 2018, 8,

1214.

6. W. L. Matochko, S. Cory Li, S. K. Y. Tang and R. Derda, Nucleic Acids Res., 2014, 42, 1784-

1798.

7. S. Ng, E. Lin, P. I. Kitov, K. F. Tjhung, O. O. Gerlits, L. Deng, B. Kasper, A. Sood, B. M.

Paschal, P. Zhang, C.-C. Ling, J. S. Klassen, C. J. Noren, L. K. Mahal, R. J. Woods, L. Coates

and R. Derda, J. Am. Chem. Soc., 2015, 137, 5248-5251.

8. C. Ge, E. Spånning, E. Glaser and Å. Wieslander, Mol. Plant, 2014, 7, 121-136.

9. J. Xie, Z. Xu, S. Zhou, X. Pan, S. Cai, L. Yang and H. Mei, PLOS ONE, 2013, 8, e74506.

10. T. Chen and C. Guestrin, Proceedings of the 22nd acm sigkdd international conference on

knowledge discovery and data mining, 2016, pp. 785-794.

11. A. D. Becke, J. Chem. Phys., 1993, 98, 1372-1377.

12. C. Lee, W. Yang and R. G. Parr, Phys. Rev. B, 1988, 37, 785-789.

13. S. H. Vosko, L. Wilk and M. Nusair, Can. J. Phys., 1980, 58, 1200-1211.

14. P. J. Stephens, F. J. Devlin, C. F. Chabalowski and M. J. Frisch, J. Phys. Chem., 1994, 98,

11623-11627.

15. W. J. Hehre, R. Ditchfield and J. A. Pople, J. Chem. Phys., 1972, 56, 2257-2261.

16. R. Ditchfield, W. J. Hehre and J. A. Pople, J. Chem. Phys., 1971, 54, 724-728.

17. M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, G.

Scalmani, V. Barone, B. Mennucci, G. A. Petersson, H. Nakatsuji, M. Caricato, X. Li, H. P.

Hratchian, A. F. Izmaylov, J. Bloino, G. Zheng, J. L. Sonnenberg, M. Hada, M. Ehara, K.

Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, T.

Vreven, J. A. M. Jr., J. E. Peralta, F. Ogliaro, M. Bearpark, J. J. Heyd, E. Brothers, K. N. Kudin,

V. N. Staroverov, R. Kobayashi, J. Normand, K. Raghavachari, A. Rendell, J. C. Burant, S. S.

Iyengar, J. Tomasi, M. Cossi, N. Rega, J. M. Millam, M. Klene, J. E. Knox, J. B. Cross, V.

Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R.

Cammi, C. Pomelli, J. W. Ochterski, R. L. Martin, K. Morokuma, V. G. Zakrzewski, G. A.

Voth, P. Salvador, J. J. Dannenberg, S. Dapprich, A. D. Daniels, Ö. Farkas, J. B. Foresman, J.

V. Ortiz, J. Cioslowski and D. J. Fox, Gaussian 09 (Revision A.02) Gaussian, Inc., Wallingford

CT, 2016.