arxiv:1803.10342v2 [q-bio.bm] 26 may 2018cesses. the technique is thus foundational to molecular...

Classification of crystallization outcomes using deep convolutional neural networks

Andrew E. BrunoCenter for Computational Research, University at Buffalo, Buffalo, New York, United States of America.

Patrick CharbonneauDepartment of Chemistry, Duke University, Durham, North Carolina, USA and

Department of Physics, Duke University, Durham, North Carolina, USA

Janet NewmanCollaborative Crystallisation Centre CSIRO, Parkville, Victoria, Australia.

Edward H. SnellHauptman-Woodward Medical Research Institute and SUNY Buffalo,

Department of Materials, Design, and Innovation,Buffalo, New York 14203, United States of America.

David R. So and Vincent VanhouckeGoogle Brain, Google Inc., Mountain View, California, United States of America.

Christopher J. WatkinsIM&T Scientific Computing, CSIRO, Clayton South, Victoria, Australia

Shawn WilliamsPlatform Technology and Sciences, GlaxoSmithKline Inc.,Collegeville, Pennsylvania, United States of America.

Julie WilsonDepartment of Mathematics, University of York, York, United Kingdom.

(Dated: May 29, 2018)

The Machine Recognition of Crystallization Outcomes (MARCO) initiative has assembled roughlyhalf a million annotated images of macromolecular crystallization experiments from various sourcesand setups. Here, state-of-the-art machine learning algorithms are trained and tested on differentparts of this data set. We find that more than 94% of the test images can be correctly labeled,irrespective of their experimental origin. Because crystal recognition is key to high-density screeningand the systematic analysis of crystallization experiments, this approach opens the door to bothindustrial and fundamental research applications.Author summary: Protein crystal growth experiments are routinely imaged, but the mass ofaccumulated data is difficult to manage and analyze. Using state-of-the-art machine learningalgorithms on a large and diverse set of reference images, we manage to recapitulate the labels ofa remarkably large fraction of the set. This automation should enable a number of industrial andfundamental applications.

I. INTRODUCTION

X-ray crystallography provides the atomic structureof molecules and molecular complexes. These structuresin turn provide insight into the molecular driving forcesfor small molecule binding, protein-protein interactions,supramolecular assembly and other biomolecular pro-cesses. The technique is thus foundational to molecularmodeling and design. Beyond the obvious importanceof structure information for understanding and alteringthe role of biomolecules, it also has important industrialapplications. The pharmaceutical industry, for instance,uses structures to guide chemistry as part of a “predictfirst” strategy [1], employing expert systems to reduce op-timization cycle times and more effectively bring medicine

to patients. Yet, despite decades of methodological ad-vances, crystallizing molecular targets of interest remainsthe bottleneck of the entire crystallography program instructural biology.

Even when crystallization is facile, it is microscopicallyrare; for macromolecules it is also uncommon [2–5]. Ex-perimental trials typically involve: (i) mixing a purifiedsample with chemical cocktails designed to promote molec-ular association, (ii) generating a supersaturated solutionof the desired molecule via evaporation or equilibration,and (iii) visually monitoring the outcomes, before (iv)optimizing those conditions and analyzing the resultantcrystal with an X-ray beam. One hopes for the formationof a crystal instead of non-specific (amorphous) precipi-tates or of nothing at all. In order to help run these trials,

arX

iv:1

803.

1034

2v2

[q-

bio.

BM

] 2

6 M

ay 2

018

2

commercial crystallization screens have been developed;each screen generally contains 96 formulations designedto promote crystal growth. Whether these screens areequally effective or not [5, 6] remains debated, but theiroverall yield is in any case paltry. Typically fewer than 5%of crystallization attempts produce useful results (with asuccess rate as low as 0.2% in some contexts [7]).

The practical solution to this hurdle has been to in-crease the convenience and number of crystallization tri-als. To offset the expense of reagents and scientist time,labs routinely employ industrial robotic liquid handlers,nanoliter-size drops, and record trial outcomes using auto-mated imaging systems [5, 8–11]. Hoping to compensatefor the rarity of crystallization, commercially availablesystems readily probe a large area of chemical space withminimal sample volume with a throughput of ∼ 1000individual experiments per hour.

While liquid handling is readily automated, crystalrecognition is not. Imaging systems may have made view-ing results more comfortable than bending over a micro-scope, but crystallographers still manually inspect imagesand/or drops, looking for crystals or, more commonly,conditions that are likely to produce good crystals whenoptimized. This human cost makes crystal recognition akey experimental bottleneck within the larger challengeof crystallizing biomolecules [7]. A typical experimentfor a given sample includes four 96-well screens at twotemperatures, i.e., 768 conditions (and can have up totwice that [12]). Assuming that it takes 2 seconds to man-ually scan a droplet (and noting that the scans have tobe repeated, as crystallization is time dependent), simplylooking at a single set of 96 trials over the lifetime ofan experiment can take the better part of an hour [13].For the sake of illustration, the U.S. Structural Sciencegroup at GlaxoSmithKline performs ∼ 1200 96-well ex-periments per year. If the targeted observation schedulewere rigorously followed, the group would spend a quarterof the year staring at drops, of which the vast majoritycontains no crystal. Recording outcomes and analyzingthe results of the 96 trials would further increase the timeburden. Current operations are already straining existingresources, and the approach simply does not scale forproposed higher-density screening [10].

Crystal growth is also sufficiently uncommon that thetolerance for false negatives is almost nil. Yet most crystal-lographers are misguided in thinking that they themselveswould never miss identifying a crystal given an image con-taining an crystal, or indeed miss a crystal in a dropletviewed directly under a microscope [14]. In fact, not onlydo crystallographers miss crystals due to lack of atten-tion through boredom, they often disagree on the classan image should be assigned to. An overall agreementrate of ∼ 70% was found when the classes assigned to1200 images by 16 crystallographers were compared [14].(When considering only crystalline outcomes, agreementrose to ∼ 93%.) Consistency in visual scoring was alsoconsidered by Snell et al. when compiling a ∼ 150, 000image dataset [15]. They found that viewers give different

scores to the same image on different occasions duringthe study, with the average agreement rate for scores ona control set at the beginning and middle of the studybeing 77%, rising to 84% for the agreement in scores be-tween the middle and end of the study. Crystallographersalso tend to be optimistically biased when scoring theirown experiments [16]. A better use of expert time andattention would be to focus on scientific inquiry.An algorithm that could analyze images of drops, dis-

tinguish crystals from trivial outcomes, and reduce theeffort spent cataloging failure, would present clear valueboth to the discipline and to industry. Ideally, such analgorithm would act like an experienced crystallographerin:

• recognizing macromolecular crystals appropriate fordiffraction experiments;

• recognizing outcomes that, while requiring optimiza-tion, would lead to crystals for diffraction experi-ments;

• recognizing non-macromolecular crystals;

• ignoring technical failures;

• identifying non-crystalline outcomes that requirefollow up;

• being agnostic as to the imaging platform used;

• being indefatigable and unbiased;

• occurring in a time frame that does not impede theprocess;

• learning from experience.

Such an algorithm would further reduce the variance in theassessments, irrespective of its accuracy. A high-variance,manual process is not conducive to automating the qualitycontrol of the system end-to-end, including the imagingequipment. Enhanced reproducibility enables traceabilityof the outcomes, and paves the way for putting in placemeasurable, continuous improvement processes across theentire imaging chain.Automated crystallization image classifications that

attempt to meet the above criteria have been previouslyattempted. The research laboratories that first automatedcrystallization inspection quickly realized that image anal-ysis would be a huge problem, and concomitantly devel-oped algorithms to interpret them [17–20]. None of theseprograms was ever widely adopted. This may have beendue in part to their dependence on a particular imag-ing system, and to the relatively limited use of imagingsystems at the time. Many of the early image analysisprograms further required very time consuming collationof features and significant preprocessing, e.g., drop seg-mentation to locate the experimental droplet within theimage. To the best of our knowledge, there was also nowidespread effort to make a widely available image anal-ysis package in the same way that that the diffraction

3

oriented programs have been organized, e.g., the CCP4package [21].

Can a better algorithm be constructed and trained? Inorder to help answer this question, the Machine Recog-nition of Crystallization Outcomes (MARCO) initiativewas set up [22]. MARCO assembled a set of roughlyhalf a million classified images of crystallization trialsthrough an international collaboration with five separateinstitutions. Here, we present a machine-learning basedapproach to categorize these images. Remarkably, thealgorithm we employ manages to obtain an accuracy ex-ceeding 94%, which is even above what was once thoughtpossible for human categorization. This suggests that adeployment of this technology in a variety of laboratorysettings is now conceivable. The rest of this paper is asfollows. Section II describes the dataset and the scoringscheme, Sec. III describes the machine-learning model andtraining procedure, Secs. IV and V describe and discussthe results, respectively, and Sec. VI briefly concludes.

II. MATERIAL AND METHODS

Image Data

The MARCO data set used in this study contains493,214 scored images from five institutions (See Ta-ble I [22]). The images were collected from imagersmade from two different manufacturers (Rigaku Automa-tion and Formulatrix), which have different optical sys-tems, as well as by the in-house imaging equipmentbuilt at the Hauptman-Woodward Medical Research Insti-tute (HWMRI) High-Throughput Crystallization Center(HTCC). Different versions of the setups were also used –some Rigaku images are collected with a true color camera,some are collected as greyscale images. The zoom extentvaries, with some imagers set up to collect a field-of-view(FOV) of only the experimental droplet, and some set forthe FOV to encompass a larger area of the experimentalsetup. The Rigaku and Formulatrix automation imagedvapor diffusion based experiments while the HTCC sys-tem imaged microbatch-under-oil experiments. A randomselection of 50,284 test images was held out for validation.Images in the test set were not represented in the trainingset. The precise data split is available from the MARCOwebsite [22].

Labeling

Images were scored by one or more crystallographers.As the dataset is composed of archival data, no commonscoring system was imposed, nor were exemplar imagesdistributed to the various contributors. Instead, existingscores were collapsed into four comprehensive and fairlyrobust categories: clear, precipitate, crystal, and other.This last category was originally used as a catchall for

images not obviously falling into the three major classes,and came to assume a functional significance as the classi-fication process was further investigated. Examination ofthe least classifiable five percent of images indeed revealedmany instances of process failure, such as dispensing er-rors or illumination problems. These uninterpretableimages were then labelled as “other” during the rescoring,which added an element of quality control to the overallprocess [25].

Relabeling

After a first baseline system was trained (see Sec. III),the 5% of the images that were most in disagreementwith the classifier (independently of whether the imagewas in the training or the test set), were relabeled by oneexpert, in order to obtain a systematic eye on the mostproblematic images.Because no rules were established and no exemplars

were circulated prior to the initial scoring, individualviewpoints varied on classifying certain outcomes. Forexample, the bottom 5% contained many instances ofphase separation, where the protein forms oil droplets oran oily film that coats the bottom of the crystallizationwell. Images were found to be inconsistently scored as“clear”, “precipitate”, or “other” depending on the amountand visibility of the oil film. This example highlights thedifficulty of scoring experimental outcomes beyond crystalidentification. A more serious source of ambiguity arisesfrom process failure. Many of the problematic images didnot capture experimental results at all. They were out offocus, dark, overexposed, dropless, etc. Whatever labelingconvention was initially followed, for the relabeling the“other” category was deemed to also diagnose problemswith the imaging process.

A total of 42.6% of annotations for the images thatwere revisited disagreed with the original label, suggestingsomewhat high (1 to 2%) label noise in this difficultfraction of the dataset. For a fraction of this data, multipleraters were asked to label the images independently andhad an inter-rater disagreement rate of approximately22%. The inherent difficulty of assigning a label to asmall fraction of the images is therefore consistent withthe results of Ref. [14]. Table II shows the final imagecounts after relabeling.

III. MACHINE LEARNING MODEL

The goal of the classifier here is to take an image asan input, and output the probability of it belonging toeach of four classes (crystals, precipitate, clear, other) (seeFig. 1). The classifier used is a deep Convolutional NeuralNetwork (CNN). CNNs, originally proposed in Ref. [26],and their modern ‘deep’ variants (see, e.g., Refs. [27, 28]for recent reviews), have proven to consistently providereliable results on a broad variety of visual recognition

4

TABLE I. Breakdown of data sources and imaging technology per institution contributing to MARCO.Institution Technical Setup # of Images

Bristol-Myers Squibb Formulatrix Rock Imager (FRI) 8719CSIRO Sitting drop, FRI, Rigaku Minstrel [23, 24] 15933

HWMRI Under oil, Home system [15] 79632GlaxoSmithKline Sitting drop, FRI 83126

Merck Sitting drop, FRI 305804

TABLE II. Data distribution. Final number ofimages in the dataset for each category after collapsingthe labels and relabeling.

Number of imagesLabel Training Validation

Crystals 56,672 6632Precipitate 212,541 23,892

Clear 148,861 16760Other 24,856 3,000

tasks, and are particularly amenable to addressing data-rich problems. They have been, for instance, state of theart on the very competitive ILSVRC image recognitionchallenge [29] since 2012.

This approach to visual perception has been making un-precedented inroads in areas such as medical imaging [30]and computational biology [31], and have also shown tobe human-competitive on a variety of specialized visualidentification [32, 33]. The chosen classifier is thus wellsuited for the current analysis.

Model Architecture

The model is a variation on the widely-used Inception-v3 architecture [36], which was state of the art on theILSVRC challenge around 2015. Several more recent al-ternatives were tried, including Inception-ResNet-v2 [37],and automatically generated variants of NASNet [38],but none yielded any significant improvements. An ex-tensive hyperparameter search was also conducted usingVizier [39], also without providing significant improvementover the baseline.

The Inception-v3 architecture is a complex deep CNNarchitecture described in detail in Ref. [36] as well as thereference implementation [40]. We only describe here themodifications made to tailor the model to the task athand.

Standard Inception-v3 operates on a 299x299 square im-age. Because the current problem involves very detailed,thin structures, it is plausible to assume that a largerinput image may yield better outcomes. We use instead599x599 images, and compress them down to 299x299using an additional convolutional layer at the very bottomof the network, before the layer labeled Conv2d 1a 3x3in the reference implementation. The additional convo-lutional layer has a depth (number of filters) of 16, a3× 3 receptive field (it operates on a 3× 3 square patchconvolved over the image) and a stride of 2 (it skips over

Fig 1. Conceptual Representation of aConvolutional Neural Network. A CNN is a stackof nonlinear filters (three filter levels are depicted here)that progressively reduce the spatial extent of the image,while increasing the number of filter outputs thatdescribe the image at every location. On top of thisstack sits a multinomial logistic regression classifier,which maps the representation to one probability valueper output class (Crystals vs. Precipitate vs. Clear vs.Others). The entire network is jointly optimized throughbackpropagation [34], in general by means of a variant ofstochastic gradient descent [35].

every other location in the image to reduce the dimen-sionality of the feature map). This modification improvedclassification absolute accuracy by approximately 0.3%.A few other convolutional layers were shrunk comparedto the standard Inception-v3 by capping their depth asdescribed in Table III, using the conventions from the

5

reference implementation.

TABLE III. Limits applied to layer depths toreduce the model complexity. In each named layerof the deep network – here named after the conventionsof the reference implementation – every convolutionalsubblock had its number of filters reduced to contain nomore than these many outputs.

Layer Max depthConv2d 4a 3x3 144

Mixed 6b 128Mixed 6c 144Mixed 6d 144Mixed 6e 96Mixed 7a 96Mixed 7b 192Mixed 7c 192

While these parameters are exhaustively reported hereto ensure reproducibility of the results, their fine tuningis not essential to maximizing the success rate, and wasmainly motivated by improving the speed of training. Inthe end, it was possible to train the model at larger batchsize (64 instead of 32) and still fit within the memory of aNVidia K80 GPU (see more details in the training sectionbelow). Given the large number of examples available, alldropout [41] regularizers were removed from the modeldefinition at no cost in performance.

Data Preprocessing and Augmentation

The source data is partitioned randomly into 415990training images and 47062 test images.The training data is generated dynamically by taking

random 599x599 patches of the input images, and sub-jecting them to a wide array of photometric distortions,identical to the reference implementation:

• randomized brightness (± 32 out of 255),

• randomized saturation (from 50% to 150%),

• randomized hue (± 0.2 out of 0.5),

• randomized contrast (from 50% to 150%).

In addition, images are randomly flipped left to rightwith 50% probability, and, in contrast to the usual practicefor natural scenes which don’t have a vertical symmetry,they are also flipped upside down with 50% probability.Because images in this dataset have full rotational invari-ance, one could also consider rotations beyond the mere90◦, 180◦, 270◦ that these flips provide, but we didn’tattempt it here, as we surmise the incremental benefitswould likely be minimal for the additional computationalcost. This form of aggressive data augmentation greatlyimproves the robustness of image classifiers, and partlyalleviates the need for large quantities of human labels.

For evaluation, no distortion is applied. The test imagesare center cropped and resized to 599x599.

Fig 2. Classifier Accuracy. Accuracy on the trainingand validation sets as a function of the number of stepsof training. Training halts when the performance on theevaluation set no longer increases (‘early stopping’). Asis typical for this type of stochastic training,performance increases rapidly at first as large trainingsteps are taken, and slows down as the learning rate isannealed and the model fine-tunes its weights.

Training

The model is implemented in TensorFlow [42], andtrained using an asynchronous distributed trainingsetup [43] across 50 NVidia K80 GPUs. The optimizer isRmsProp [44], with a batch size of 64, a learning rate of0.045, a momentum of 0.9, a decay of 0.9 and an epsilonof 0.1. The learning rate is decayed every two epochsby a factor of 0.94. Training completed after 1.7M steps(Fig. 2) in approximately 19 hours, having processed 100Mimages, which is the equivalent of 260 epochs. The modelthus sees every training sample 260 times on average, witha different crop and set of distortions applied each time.The model used at test time is a running average of thetraining model over a short window to help stabilize thepredictions.

IV. RESULTS

Classification

The original labeling gave rise to a model with 94.2%accuracy on the test set. Relabeling improved reportedclassification accuracy by approximately 0.3% absolute,with the caveat that the figures are not precisely compara-ble since some of the test labels changed in between. Therevised model thus achieves 94.5% accuracy on the testset for the four-way classification task. It overfits mod-estly to the training set, reaching just above 97% at theearly-stopping mark of 1.7M steps. Table IV summarizesthe confusions between classes. Although the classifierdoes not perform equally well on images from the variousdatasets, the standard deviation in performance from one

6

TABLE IV. Confusion Matrix. Fraction of the testdata that is assigned to each class based on the posteriorprobability assigned by the classifier. For instance, 0.8%of images labeled as Precipitate in the test set wereclassified as Crystals.

True PredictionsLabel Crystals Precipitate Clear Other

Crystals 91.0% 5.8% 1.7% 1.5%Precipitate 0.8% 96.1% 2.3% 0.7%

Clear 0.2% 1.8% 97.9% 0.2%Other 4.8% 19.7% 5.9% 69.6%

TABLE V. Standard Deviation of the predictionsacross data sources. Note in particular the largevariability in the consistency of the label ’Other’ acrossdatasets, which leads to comparatively poor selectivity ofthat less well-defined class.

True PredictionsLabel Crystals Precipitate Clear Other

Crystals 5% 4% 1% 1%Precipitate 2% 4% 1% 2%

Clear 1% 3% 5% 1%Other 7% 15% 6% 21%

set to another is fairly small, about 5% (see Table V),compared to the overall performance of the classifier.

The classifier outputs a posterior probability for eachclass. By varying the acceptance threshold for a proposedclassification, one can trade precision of the classifica-tion against recall. The receiver operating characteristic(ROC) curves can be seen in Fig. 3.

Validation

At CSIRO C3 a workflow [45] has been set up whichuses a variation of the analysis tool from DeepCrystal [46]to analyze newly collected crystallisation images and toassign either no score, ‘crystal score or ‘clear score. Atotal of 37,851 images were collected in Q1 2018 andassigned a human score by a C3 user were used as anindependent dataset to test the MARCO tool. Within thisdataset, 9746 images had been identified as containingcrystals. The current, DeepCrystal tool (which assignsonly crystal or clear scores) was found to have an overallaccuracy rate of 74%, while the MARCO tool has 90%.Although this retrospective analysis doesnt allow for adirect comparison of the ROC, the precision, recall andaccuracy of the two tools all favor the MARCO tool, asshown in table 6. The precision achieved by MARCOon this dataset is also very similar to that seen for theCSIRO images in the training data.

TABLE VI. Validation at C3 Precision, recall andaccuracy from an independent set of images collectedafter the MARCO tool was developed. The 38K imagesof sitting drop trials were collected between January 1and March 30, 2018 on two Formulatrix Rock Imager(FRI) instruments.

DL tool Precision Recall AccuracyDeepCrystal 0.4928 0.4520 0.7391

MARCO 0.7777 0.8663 0.9018

Pixel Attribution

We visually inspect to what parts of the image theclassifier learns to attend by aggregating noisy gradientsof the image with respect to its label on a per-pixel basis.The SmoothGrad [47] approach is used to visualize thefocus of the classifier. The images in Fig. 4 are constructedby overlaying a heat map of the classifier’s attention overa grayscale version of the input image.Note that saliency methods are imperfect and do not

in general weigh faithfully all the evidence present in animage according to their contributions to the decision, es-pecially when the evidence is highly correlated. Althoughthese visualizations paint a simplified and very partialpicture of the classifier’s decision mechanisms, they helpconfirm that it is likely not picking up and overfitting tocues that are irrelevant to the task.

Inference and Availability

The model is open-sourced and available online at [48].It can be run locally using TensorFlow or TensorFlow Lite,or as a Google Cloud Machine Learning [49] endpoint. Attime of writing, inference on a standard Cloud instancetakes approximately 260ms end-to-end per standalonequery. However, due to the very efficient parallelismproperties of convolutional networks, latency per imagecan be dramatically cut down for batch requests.

V. DISCUSSION

Previous attempts at automating the analysis of crys-tallisation images have employed various pattern recog-nition and machine learning techniques, including lineardiscriminant analysis [50, 51], decision trees and randomforests [52–54], and support vector machines [20, 55]. Neu-ral networks, including self-organizing maps, have alsobeen used classify these images [17, 56], with the mostrecent involving deep learning [57]. However, all previ-ous approaches have required a consistent set of imageswith the same field of view and resolution, in order toidentify the crystallization droplet in the well [23], andthereby restrict the analysis. Various statistical, geomet-ric or textural features were then extracted, either directlyfrom the image or from some transformation of the region

7

(a) (b)

Fig 3. Receiver Operating Characteristic Curves. (Q) Percentage of the correctly accepted detection ofcrystals as a function of the percentage of incorrect detections (AUC: 98.8). 98.7% of the crystal images can berecalled at the cost of less than 19% false positives. Alternatively, 94% of the crystals can be retrieved with less than1.6% false positives. (B) Percentage of the correctly accepted detection of precipitate as a function of the percentageof incorrect detections (AUC: 98.9). 99.6% of the precipitate images can be recalled at the cost of less than 25% falsepositives. Alternatively, 94% of the precipitates can be retrieved with less than 3.4% false positives.

(a) (b) (c)

Fig 4. Sample heatmaps for various types of images. (A) Crystal: the classifier focuses on some of theangular geometric features of individual crystals (arrows). (B) Precipitate: the classifier lands on the precipitate(arrows). (C) Clear: The classifier broadly samples the image, likely because this label is characterized by the absenceof structures rather than their presence. Note the slightly more pronounced focus on some darker areas (circle andarrows) that could be confused for crystals or precipitate. Because the ‘Others’ class is defined negatively by the theimage being not identifiable as belonging to the other three classes, heatmaps for images of that class are notparticularly informative.

of interest, to be used as variables in the classificationalgorithms.

The results from various studies can be difficult tocompare head-to-head because different groups presentconfusion matrices with the number of classes rangingfrom 2 to 11, only sometimes aggregating results forcrystals/crytalline materials. There is also a tradeoffbetween the number of false negatives and the numberof false positives. Yet most report classification ratesfor crystals around 80-85% even in more recent work [8,

54, 58], in which missed crystals are reported with muchlower rates. This advance comes at the expense of morefalse positives. For example, Pan et al. report just under3% false negatives, but almost 38% false positives [55].

As the trained algorithms are specific to a set of im-ages, they are also restricted to a particular type of crys-tallisation experiment. Prior to the curation of the cur-rent dataset, the largest set of images (by far) camefrom the Hauptman-Woodward Medical Research Insti-tute HTCC [15]. This dataset, which contains 147,456

8

images from 96 different proteins but is limited to ex-periments with the microbatch-under-oil technique, hasbeen used in a number of studies [57, 59]. Most notably,Yann et al. used a deep convolutional neural networkthat automatically extracted features, and reported acorrect classification rates as high as 97% for crystals and96% for non-crystals. Although impressive, these resultswere however obtained from a curated subset of 85,188clean images, i.e., images with class labels on which sev-eral human experts agreed [57]. In order to validate ourapproach, we retrained our model to perform the same 10-way classification on that subset of the data alone withoutany tuning of the model’s hyperparameters and achieved94.7% accuracy, compared to the reported 90.8%.

In this context, the current results are especially re-markable. A crystallographer can classify images of exper-iments independently of the systems used to create thoseimages. They can view an experiment with a microscopeor look at a computer image and reach similar conclu-sions. They can look at a vapor diffusion experiment or amicrobatch-under-oil setup and, again, asses either withconfidence. Here, we show that this can be accomplishedequally well, if not better, using deep CNNs. A benchtopresearcher can classify many images, especially if theyrelate to a project that has been years in the making. Forhigh-throughput approaches, however, that task becomeschallenging. The strength of computational approaches isthat each image is treated like the previous one, with nofatigue. Classification of 10,000 images is as consistentas classification of one. This advance opens the door forcomplete classification of all results in a high-throughputsetting and for data mining of repositories of past imagedata.

Another remarkable aspect of our results is that theyleverage a very generic computer vision architecture orig-inally designed for a different classification problem –categorization of natural images – with very distinct char-acteristics. For instance, one can presume that the globalgeometric relationships between object parts would playa greater role in identifying a car or a dog in an image,compared to the very local, texture-like features involvedin recognizing crystal-like structures. Yet no particu-lar specialization of the model was required to adapt itto the widely differing visual appearances of the sam-ples originating from different imagers. This convergenceof approaches toward a unified perception architectureacross a wide range of computer vision problems has beena common theme in recent years, further suggesting thatthe technology is now ready for wide adoption for anyhuman-mediated visual recognition task.

VI. CONCLUSION

In this work, we have collated biomolecular crystal-lization images for nearly half a million of experiments

across a large range of conditions, and trained a CNNon the labels of these images. Remarkably, the resultingmachine-learning scheme was able to recapitulate the la-bels of more than 94% of a test set. Such accuracy hasrarely been obtained, and has no equal for an uncurateddataset. The analysis also identified a small subset ofproblematic images, which upon reconsideration revealeda high level of label discrepancy. This variability inher-ent to using human labeling highlights one of the mainbenefits of automatic scoring. Such accuracy also makeconceivable high-density screening.

Enhancing the imaging capabilities by including UV orSONICC results, for instance, could certainly enrich themodel. But several research avenues could also be pursuedwithout additional laboratory equipment. In particular,it should be possible to leverage side information that iscurrently not being used.

• The four-way classification scheme used is a distilla-tion of 38 categories which are present in the sourcedata. While these categories are presumed to besomewhat inconsistent across datasets, they couldpotentially provide an additional supervision signal.

• Because one goal of this classifier is to be able togeneralize across datasets, it would be worthwhileto investigate the contribution of techniques thathave been designed to specifically reduce the effect ofdomain shift across data sources on the classificationoutcomes [60, 61].

• Each crystallization experiment records a series ofimages taken over times. Using the timecourseinformation could enhance the success rate of theclassifier [62].

Note in closing that the current study focused on crys-tallization as an outcome, which is but a small fractionof the protein solubility diagram. Patterns of precipi-tation, phase separation, and clear drops, also provideinformation as to whether and where crystallization mightoccur. The success in identifying crystals, precipitate andclear can be thus also be used to accurately chart thecrystallization regimes and to identify pathways for opti-mization [59, 63, 64]. The application of this approach tolarge libraries of historical data may therefore reveal pat-terns that guide future crystallization strategies, includingnovel chemical screens and mutagenesis programs.

ACKNOWLEDGMENTS

We acknowledge discussions at various stages of thisproject with I. Altan, S. Bowman, R. Dorich, D. Fusco,E. Gualtieri, R. Judge, A. Narayanaswamy, J. Noah-Vanhoucke, P. Orth, M. Pokross, X. Qiu, P. F. Riley,V. Shanmugasundaram, B. Sherborne and F. von Delft.PC acknowledges support from National Science Founda-tion Grant no. NSF DMR-1749374.

9

[1] S Harrison, B Lahue, Z Peng, A Donofrio, C Chang, andM Glick, “Extending predict first to the design make-test cycle in small-molecule drug discovery,” Future Med.Chem. 9, 533536 (2017).

[2] Alexander McPherson, Crystallization of BiologicalMacromolecules (CSHL Press, Cold Spring Harbor, 1999)p. 586.

[3] Naomi E. Chayen, “Turning protein crystallisation froman art into a science,” Curr. Opin. Struct. Biol. 14, 577–583 (2004).

[4] Diana Fusco and Patrick Charbonneau, “Soft matter per-spective on protein crystal assembly,” Colloids Surf. B:Biointerfaces 137, 22–31 (2016).

[5] J.T. Ng, C. Dekker, P. Reardon, and F. von Delft,“Lessons from ten years of crystallization experiments atthe SGC,” Acta Cryst. D 72, 224–235 (2016).

[6] V.J. Fazio, T.S. Peat, and J. Newman, “Lessons for thefuture,” Methods Mol. Biol. 1261, 141–156 (2015).

[7] Janet Newman, Evan E. Bolton, Jochen Muller-Dieckmann, Vincent J. Fazio, D. Travis Gallagher, DavidLovell, Joseph R. Luft, Thomas S. Peat, David Ratcliffe,Roger A. Sayle, Edward H. Snell, Kerry Taylor, PascalVallotton, Sameer Velanker, and Frank von Delft, “Onthe need for an international effort to capture, share anduse crystallization screening data,” Acta Cryst. F 68,253–258 (2012).

[8] Yulia Kotseruba, Christian A Cumbaa, and Igor Jurisica,“High-throughput protein crystallization on the world com-munity grid and the gpu,” J. Phys. Conf. Ser. 341, 012027(2012).

[9] Janet Newman, “One plate, two plates, a thousand plates.how crystallisation changes with large numbers of sam-ples,” Methods 55, 73 – 80 (2011).

[10] Shuheng Zhang, Charline J.J. Gerard, Aziza Ikni, GillesFerry, Laurent M. Vuillard, Jean A. Boutin, NathalieFerte, Romain Grossier, Nadine Candoni, and StphaneVeesler, “Microfluidic platform for optimization of crys-tallization conditions,” J. Cryst. Growth 472, 18 – 28(2017), industrial Crystallization and Precipitation inFrance (CRISTAL-8), May 2016, Rouen (France).

[11] Y. Thielmann, J. Koepke, and H. Michel, “The esfri in-struct core centre frankfurt: Automated high-throughputcrystallization suited for membrane proteins and more,”J. Struct. Funct. Genomics 13, 63–69 (2012).

[12] Edward H. Snell, Angela M. Lauricella, Stephen A. Potter,Joseph R. Luft, Stacey M. Gulde, Robert J. Collins, GeoffFranks, Michael G. Malkowski, Christian Cumbaa, IgorJurisica, and George T. DeTitta, “Establishing a trainingset through the visual analysis of crystallization trials.part II: Crystal examples,” Acta Cryst. D 64, 1131–1137(2008).

[13] This estimate is based on personnal communication withfive experienced crystallographers at GlaxoSmithKline: 2seconds/observation × 8 observations × 96 wells. Notethat current technology can automatically store and imageplates at about 3 min/plate.

[14] Julie Wilson, “Automated classification of images fromcrystallisation experiments,” in Advances in Data Mining.Applications in Medicine, Web Mining, Marketing, Imageand Signal Mining, edited by Petra Perner (Springer BerlinHeidelberg) pp. 459–473.

[15] Edward H. Snell, Joseph R. Luft, Stephen A. Potter,Angela M. Lauricella, Stacey M. Gulde, Michael G.Malkowski, Mary Koszelak-Rosenblum, Meriem I. Said,Jennifer L. Smith, Christina K. Veatch, Robert J. Collins,Geoff Franks, Max Thayer, Christian Cumbaa, Igor Ju-risica, and George T. DeTitta, “Establishing a trainingset through the visual analysis of crystallization trials.part i: 150 000 images,” Acta Cryst. D 64, 1123–1130(2008).

[16] David Hargreaves, Personal communication.[17] Glen Spraggon, Scott A. Lesley, Andreas Kreusch, and

John P. Priestle, “Computational analysis of crystalliza-tion trials,” Acta Cryst. D 58, 1915–1923 (2002).

[18] Christian Cumbaa and Igor Jurisica, “Automatic classifi-cation and pattern discovery in high-throughput proteincrystallization trials,” J. Struct. Funct. Genomics 6, 195–202 (2005).

[19] Kuniaki Kawabata, Kanako Saitoh, Mutsunori Takahashi,Hajime Asama, Taketoshi Mishima, Mitsuaki Sugahara,and Masashi Miyano, “Evaluation of protein crystalliza-tion state by sequential image classification,” Sensor Rev.28, 242–247 (2008).

[20] Samarasena Buchala and Julie C. Wilson, “Improved clas-sification of crystallization images using data fusion andmultiple classifiers,” Acta Cryst. D 64, 823–833 (2008).

[21] Martyn D. Winn, Charles C. Ballard, Kevin D. Cowtan,Eleanor J. Dodson, Paul Emsley, Phil R. Evans, Ronan M.Keegan, Eugene B. Krissinel, Andrew G. W. Leslie, Air-lie McCoy, Stuart J. McNicholas, Garib N. Murshudov,Navraj S. Pannu, Elizabeth A. Potterton, Harold R. Pow-ell, Randy J. Read, Alexei Vagin, and Keith S. Wilson,“Overview of the ccp4 suite and current developments,”Acta Cryst. D 67, 235–242 (2011).

[22] “MAchine Recognition of Crystallization Outcomes(MARCO),” (2017), https://marco.ccr.buffalo.edu[Online; accessed 17-March-2018].

[23] Pascal Vallotton, Changming Sun, David Lovell, Vin-cent J. Fazio, and Janet Newman, “Droplit, an improvedimage analysis method for droplet identification in high-throughput crystallization trials,” J. Appl. Crystallogr.43, 1548–1552 (2010).

[24] N. Rosa, M. Ristic, B. Marshall, and J. Newman, “Keep-ing crystallographers app-y,” Acta Cryst. F submitted.

[25] Katarina Mele, Rongxin Li, Vincent J. Fazio, and JanetNewman, “Quantifying the quality of the experimentsused to grow protein crystals: the iqc suite,” J. Appl.Cryst. 47, 1097–1106 (2014).

[26] Yann LeCun, Bernhard Boser, John S Denker, DonnieHenderson, Richard E Howard, Wayne Hubbard, andLawrence D Jackel, “Backpropagation applied to hand-written zip code recognition,” Neural computation 1, 541–551 (1989).

[27] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deeplearning,” Nature 521, 436 (2015).

[28] Waseem Rawat and Zenghui Wang, “Deep convolutionalneural networks for image classification: A comprehensivereview,” Neural Comput. 29, 2352–2449 (2017).

[29] A Berg, J Deng, and L Fei-Fei, “Large scale visualrecognition challenge (ILSVRC),” (2010) http://www.

image-net.org/challenges/LSVRC.

https://marco.ccr.buffalo.edu



http://www.image-net.org/challenges/LSVRC

http://www.image-net.org/challenges/LSVRC

10

[30] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Ar-naud Arindra Adiyoso Setio, Francesco Ciompi, MohsenGhafoorian, Jeroen AWM van der Laak, Bram van Gin-neken, and Clara I Sanchez, “A survey on deep learningin medical image analysis,” Med. Image Anal. 42, 60–88(2017).

[31] Christof Angermueller, Tanel Parnamaa, Leopold Parts,and Oliver Stegle, “Deep learning for computational biol-ogy,” Mol. Syst. Biol. 12, 878 (2016).

[32] Jonathan Krause, Varun Gulshan, Ehsan Rahimy, PeterKarth, Kasumi Widner, Greg S Corrado, Lily Peng, andDale R Webster, “Grader variability and the importance ofreference standards for evaluating machine learning mod-els for diabetic retinopathy,” arXiv:1710.01711 [cs.CV](preprint) (2017).

[33] Yun Liu, Krishna Gadepalli, Mohammad Norouzi,George E Dahl, Timo Kohlberger, Aleksey Boyko, Sub-hashini Venugopalan, Aleksei Timofeev, Philip Q Nelson,Greg S Corrado, et al., “Detecting cancer metastases ongigapixel pathology images,” arXiv:1703.02442 [cs.CV](preprint) (2017).

[34] David E Rumelhart, Geoffrey E Hinton, and Ronald JWilliams, “Learning representations by back-propagatingerrors,” Nature 323, 533 (1986).

[35] Leon Bottou, “Large-scale machine learning with stochas-tic gradient descent,” in Proceedings of COMPSTAT’2010(Springer, 2010) pp. 177–186.

[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, JonShlens, and Zbigniew Wojna, “Rethinking the incep-tion architecture for computer vision,” in Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition (2016) pp. 2818–2826.

[37] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke,and Alexander A Alemi, “Inception-v4, inception-resnetand the impact of residual connections on learning.”arXiv:1602.07261 [cs.CV] (preprint) (2017).

[38] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, andQuoc V Le, “Learning transferable architectures forscalable image recognition,” arXiv:1707.07012 [cs.CV](preprint) (2017).

[39] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra,Greg Kochanski, John Karro, and D Sculley, “Googlevizier: A service for black-box optimization,” in Proceed-ings of the 23rd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (ACM, 2017)pp. 1487–1495.

[40] Nathan Silberman and Sergio Guadarrama, “TensorFlow-Slim image classification model library,” (2017),https://github.com/tensorflow/models/tree/

master/research/slim [Online; accessed 02-February-2018].

[41] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov, “Dropout: A sim-ple way to prevent neural networks from overfitting,” J.Mach. Learn. Res. 15, 1929–1958 (2014).

[42] Martın Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving,Michael Isard, Yangqing Jia, Rafal Jozefowicz, LukaszKaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane,Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,Mike Schuster, Jonathon Shlens, Benoit Steiner, IlyaSutskever, Kunal Talwar, Paul Tucker, Vincent Van-

houcke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals,Pete Warden, Martin Wattenberg, Martin Wicke, YuanYu, and Xiaoqiang Zheng, “TensorFlow: Large-scalemachine learning on heterogeneous systems,” (2015).

[43] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen,Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker,Ke Yang, Quoc V Le, et al., “Large scale distributed deepnetworks,” in Advances in neural information processingsystems (2012) pp. 1223–1231.

[44] Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average ofits recent magnitude,” COURSERA: Neural networks formachine learning 4, 26–31 (2012).

[45] Chris Watkins, “C4, C3 Classifier Pipeline. v1. CSIRO.Software Collection.” (2018), https://doi.org/10.

4225/08/5a97375e6c0aa [Online; accessed 09-May-2018].[46] “DeepCrystal,” (2017), http://www.deepcrystal.com

[Online; accessed 09-May-2018].[47] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda

Viegas, and Martin Wattenberg, “Smoothgrad: Re-moving noise by adding noise,” arXiv:1706.03825 [cs.LG](preprint) (2017).

[48] Vincent Vanhoucke, “Marco repository in Tensor-Flow Models,” (2018), http://github.com/tensorflow/models/tree/master/research/marco [Online; accessed01-May-2018].

[49] “Google Cloud Machine Learning Engine,” (2018), [On-line; accessed 02-February-2018].

[50] Christian A. Cumbaa, Angela Lauricella, Nancy Fehrman,Christina Veatch, Robert Collins, Joe Luft, George De-Titta, and Igor Jurisica, “Automatic classification ofsub-microlitre protein-crystallization trials in 1536-wellplates,” Acta Cryst. D 59, 1619–1627 (2003).

[51] Kanako Saitoh, Kuniaki Kawabata, Hajime Asama, Take-toshi Mishima, Mitsuaki Sugahara, and Masashi Miyano,“Evaluation of protein crystallization states based on tex-ture information derived from greyscale images,” ActaCryst. D 61, 873–880 (2005).

[52] Marshall Bern, David Goldberg, Raymond C. Stevens,and Peter Kuhn, “Automatic classification of proteincrystallization images using a curve-tracking algorithm,”J. Appl. Cryst. 37, 279–287 (2004).

[53] R. Liu, Y. Freund, and G. Spraggon, “Image-based crystaldetection: a machine-learning approach.” Acta Cryst. D64, 1187–95 (2008).

[54] Christian A. Cumbaa and Igor Jurisica, “Protein crystal-lization analysis on the world community grid,” J. Struct.Funct. Genomics 11, 61–69 (2010).

[55] Shen Pan, Gidon Shavit, Marta Penas-Centeno, Dong-HuiXu, Linda Shapiro, Richard Ladner, Eve Riskin, WimHol, and Deirdre Meldrum, “Automated classificationof protein crystallization images using support vectormachines with scale-invariant texture and gabor features,”Acta Cryst. D 62, 271–279 (2006).

[56] M.J. Po and A.F. Laine, “Leveraging genetic algorithmand neural network in automated protein crystal recog-nition,” in Proceedings of the 30th Annual InternationalConference of the IEEE Engineering in Medicine and Biol-ogy Society, EMBS’08 - ”Personalized Healthcare throughTechnology” (2008) pp. 1926–1929.

[57] Margot Lisa-Jing Yann and Yichuan Tang, “Learningdeep convolutional neural networks for x-ray protein crys-tallization image analysis,” in Thirtieth AAAI Conferenceon Artificial Intelligence (2016).

https://github.com/tensorflow/models/tree/master/research/slim




https://www.tensorflow.org

https://www.tensorflow.org

https://doi.org/10.4225/08/5a97375e6c0aa




http://www.deepcrystal.com

http://www.deepcrystal.com

http://github.com/tensorflow/models/tree/master/research/marco




https://cloud.google.com/ml-engine

11

[58] Jeffrey Hung, John Collins, Mehari Weldetsion, OliverNewland, Eric Chiang, Steve Guerrero, and KazunoriOkada, “Protein crystallization image classification withelastic net,” in SPIE Medical Imaging, Vol. 9034 (SPIE)p. 14.

[59] Diana Fusco, Timothy J. Barnum, Andrew E. Bruno,Joseph R. Luft, Edward H. Snell, Sayan Mukherjee, andPatrick Charbonneau, “Statistical analysis of crystal-lization database links protein physico-chemical featureswith crystallization mechanisms,” PLoS ONE 9, e101123(2014).

[60] Yaroslav Ganin and Victor Lempitsky, “Unsuperviseddomain adaptation by backpropagation,” in InternationalConference on Machine Learning (2015) pp. 1180–1189.

[61] Konstantinos Bousmalis, George Trigeorgis, Nathan Sil-berman, Dilip Krishnan, and Dumitru Erhan, “Domain

separation networks,” in Advances in Neural InformationProcessing Systems (2016) pp. 343–351.

[62] Katarina Mele, B. M. Thamali Lekamge, Vincent J. Fazio,and Janet Newman, “Using time courses to enrich theinformation obtained from images of crystallization trials,”Cryst. Growth Des 14, 261269 (2014).

[63] Edward H. Snell, Ray M. Nagel, Ann Wojtaszcyk, HughO’Neill, Jennifer L. Wolfley, and Joseph R. Luft, “Theapplication and use of chemical space mapping to interpretcrystallization screening results,” Acta Cryst. D 64, 1240–1249 (2008).

[64] Irem Altan, Patrick Charbonneau, and Edward H. Snell,“Computational crystallization,” Arch. Biochem. Biophys.602, 1220 (2016).

arxiv:1803.10342v2 [q-bio.bm] 26 may 2018cesses. the technique is thus foundational to molecular...

Documents