d6 - media.romi-project.eu

Deliverable

n°

D6.3

Deliverable title: D6.3 - Results of trained models

Tasks

WP6 - Plant modelling

T6.1 Modelling.

T6.2 Extraction of plant parameters using neural networks.

T6.3 Models of plant status for crop monitoring.

Task Leader INRIA Planned Date 01.10.2020

Effective Date 01.10.2020

Written by Reviewed and approved by Authorized by

Name / Surname

Timothée Wintz, Aliénor Lahlou & Peter Hanappe

(SONY)

Christophe Godin (INRIA), Fabrice Besnard (CNRS), David Colliaux

(SONY)

Romain Azais (INRIA)

1. Executive Summary 2

1.1 Summary of deliverable content and initial objectives 2

1.2 Partners involved 2

1.3 Relation with other work packages and tasks 2

1.4 Dissemination / IPR policy 3

2. Demonstrator 3

3. Results of trained models 3

3.1 Introduction 3

3.1 Related Work 4

3.3 Material and Methods 6

3.3.1 Image acquisition and dataset of real images 7

3.3.2 Virtual plants and artificial dataset generation 9

3.3.3 Semantic segmentation of images 11

3.3.4 Training from simulated data 12

1

3.3.5 Volume carving 14

3.3.6 Level Set method 14

3.3.7 Back-projection 15

3.3.8 Fine-tuning 15

3.3.9 Evaluation metric 15

3.3.10 Software integration 16

3.4 Results and Discussions 16

3.4.1 Segmentation of virtual images 17

3.4.2 3D reconstruction of virtual plants 18

3.4.3 Segmentation of real plants images, CNN trained on virtual plants 19

3.4.4 3D reconstruction of real plants 20

3.4.5 Transfer to other species 22

3.5 Conclusion and perspectives 23

References 24

1. Executive Summary

1.1 Summary of deliverable content and initial objectives

This report relates to the work carried out in WP6. In particular, it discusses the results that we obtained

to extract segmented, 3D plant representations from 2D images of real plants using a technique that

combines machine learning with the virtual plant models developed in Task 6.1.

The text covers only a small part of the work performed in WP6. Please refer to the Progress Report to

have an overview of all the work-in-progress in WP6.

1.2 Partners involved

Leader: INRIA

Participants: CNRS, SONY

1.3 Relation with other work packages and tasks

Relation to WP5: In WP6, we develop the high-level software components that use the virtual plant

models and that rely on machine learning. Then, a close collaboration between WP6 and WP5 is

ROMI - D6.3 - Results of trained models 2

required to integrate these software components into the image processing pipelines and the data

storage infrastructure that are put in place in WP5.

Relation to WP7: The software components are tested on real plant data that are collected as part of

WP7.

1.4 Dissemination / IPR policy

The source code of the project is available under an Open Source license at

https://github.com/romi/romiscan.

Additional user documentation can be found at https://docs.romi-project.eu/Scanner/ and

https://github.com/romi/romiseg/tree/master

The work was part of a presentation to industry insiders at the International Forum for Agricultural

Robotics (FIRA 2019): https://www.youtube.com/watch?v=C2UJTtwS2QM

The work presented in this document will be completed to prepare a scientific publication.

2. Demonstrator

A short video will be provided to illustrate the work presented in this deliverable.

3. Results of trained models

3.1 Introduction

In the presented work, we seek to automatically obtain the geometrical structure of the inflorescence

stem of Arabidopsis thaliana in three dimensions (3D) using an RGB camera. This 3D-reconstruction is a

required step for the study of phyllotaxis, the geometric arrangement of organs around the stem, using

high-throughput methods [13]. This problem is also addressed in WP5 using geometrical techniques.

Here, we show how we can take advantage of the plant models developed in WP6 to speed up the

training of convolutional neural networks (CNN) to segment images of A. thaliana into its constituent

organs. These segmented images are then used to reconstruct the plants and its individual organs in 3D.

Advances in machine learning, and in particular the use of deep convolutional neural networks, offer

new opportunities for pixel-by-pixel semantic segmentation of images. However, the main bottleneck to

enable the power of CNNs for segmentation tasks is the need for annotated data sets to train the neural

nets. Data annotation is a tedious and time-consuming process. This is a constraint for small scale

phenotyping problems and also limits the adaption of machine learning techniques for economically less


https://github.com/romi/romiscan

https://docs.romi-project.eu/Scanner/

https://github.com/romi/romiseg/tree/master

https://www.youtube.com/watch?v=C2UJTtwS2QM

important crops. Two solutions to this problem are often suggested. In the first approach, generative

computer models are used to feed the machine learning algorithm with artificial data. In the second

approach, an existing CNN that is trained with images from one domain is fine-tuned with a small set of

images of a different target domain.

In this work, we look at both of these approaches. We propose to use virtual models of A. thaliana to

generate datasets of 2D images (see deliverable D6.1). These synthetic images are used to train a

convolutional neural network for the identification of the organs along the inflorescence stem. The CNN

is then applied to segment a set of real plant images, taken from different viewpoints. Finally, the

segmented images are used to reconstruct the three dimensional structure of the inflorescence. We also

validate whether the combined segmentation and 3D reconstruction reduces the errors as previously

reported by [41].

We also evaluate the efficiency of fine-tuning. We fine-tune the segmentation network, trained on

virtual A thaliana plants, with a small number of images of young tomato plants. We evaluate whether

the tuned CNN can successfully segment tomato plant images. This approach is useful to speed up the

adoption of machine learning techniques for a species for which no virtual plant model is available but

that has similar anatomical traits as a species that has been modeled.

Finally, we use the segmented images and a technique called volume carving that reconstructs the 3D

shape of the plant to obtain a fully segmented 3D representation of the plant’s organs.

To our knowledge, our work provides a unique attempt to gather state-of-the-art techniques of

machine-learning for plant phenotyping into a single pipeline, from raw images to a 3D reconstruction

segmented with high precision.

3.1 Related Work

The problem we address relates to the creation of 3D representation of individual plants in which all the

constitutive organs have a distinct label. The main fields of application are phenotyping for phenomics

and crop improvement [48, 30], in-field monitoring of highly valuable crops [16, 8], and precision

agriculture [2, 7].

A common approach to this problem is to first obtain a 3D representation such as a point cloud or a

mesh structure and then subsequently segment this 3D structure into organs. There are several

methods to obtain the initial 3D data, including the use of LiDAR [27, 9, 47, 46], time-of-flight cameras

[1, 4, 50, 11], or structured light [28, 26]. The 3D reconstruction can also be obtained from multi-view 2D

images. One such method, structure-from-motion, identifies the key-points from one image to another

to reconstruct the plant in 3D [44]. Another 2D-based method is volume carving that uses the projected 1

contours of the object to carve away voxels and obtain the visual hull of the object [21]. In the field of

phenotyping, volume carving has been discussed by several authors, including [36,41,54,55]. Once the

1 Alternative names for volume carving are space carving and voxel carving.


3D structure has been obtained, in general a 3D point cloud, the segmentation can be done using

geometric and algebraic methods [25] or, more recently, machine learning methods [52].

An alternative approach to the segmentation of the individual organs in the 3D representation is to first

segment 2D views of the plant and then project the segmented images into 3D using the camera

positions [41]. It is the approach that we have chosen to evaluate in this work.

Machine learning and computer vision neural-networks offer powerful toolbox for the 2D pixel-wise

image segmentation that is needed in this approach [19]. Several recent works have suggested their use

in phenotyping and precision agriculture, including [5, 51, 49, 41, 2]. A key obstacle in the training of

neural networks for this task is the need for annotated datasets. The datasets are mostly annotated

manually, which is time consuming (5 to 30 min./image). However, recently, several authors have

reported on the successful use of artificially generated images using virtual plant models to train the

CNNs [5, 51, 49, 2, 7].

Once the organs have been detected in each image, the set of 2D views have to be projected in 3D to

obtain the point cloud volumes of the organs. The visual hull based techniques described previously are

an efficient method for this task. In the case of multi-label images, an additional back-projection and

voting mechanism is introduced to decide the label of each point in the cloud [41].

The work presented here includes topics that are active fields of research in the phenotyping

community, including the use of synthetic images to train neural networks, the 3D reconstruction of the

plant’s shape from 2D images, or the segmentation of the 3D representation into the plant’s organs (see

Table 1). However, to our knowledge, the work presented here is unique in that it combines these

components into a single pipeline. We are thus in a position to evaluate the process from the generation

of synthetic images up until the 3D segmentation of the real plant images.

Our contributions can be summarized as follows:

● We show that convolutional neural networks trained only with synthetic images of plants can

successfully segment images of real plants.

● We successfully apply the method to reconstruct the inflorescence stem of A. thaliana in 3D and

detect all the individual organs. The inflorescence stem of A. thaliana is arguably a challenging

object of study due to its thin stem and fruits.

● We offer further proof that fine-tuning a CNN using a small dataset accelerates the use of

semantic segmentation on other plants.

Cicco, 2017 [5]

Ward, 2018 [51]

Ubbens, 2018 [49]

Shi, 2019 [41] Barth, 2019 [2]

ROMI

Object of study Sugar beets and weed

Rosetta of A. thaliana

Rosetta of A. thaliana

Tomato seedlings

Population of sweet peppers plants

Inflorescence stem of A. thaliana


Generation of synthetic images

Yes Yes Yes No Yes Yes

Training on synthetic images

Yes Yes, to augment real images

Yes, to augment real images

No Yes Yes

Training on (additional) real images

With and without

Yes Yes Yes Yes, a small set

No

3D reconstruction

No No No Yes No Yes

3D segmentation

No No No (leaf detection in 2D images)

No No Yes

Table 1: An comparison of the most relevant papers that use synthetic images to train neural networks for plant phenotyping and precision agriculture. For a more complete discussion of papers that are relevant to the presented work, please see above.

3.3 Material and Methods

In the following sections, we present the different steps of the pipeline. Figure 1 provides a schematic

overview on how the different components interact.


Figure 1: A presentation of the different steps of the final 3D segmentation pipeline along with its defining parameters.

The main pipeline is indicated by the arrows on the top. The arrows on the bottom show the information that is

provided by the Colmap application: the precise camera position and the camera matrix (see the discussion in 3.3.1).

3.3.1 Image acquisition and dataset of real images

Images of A. thaliana

We used a Sony RX0 RGB camera with a resolution of 1920 × 1060 pixels. The camera was fixed on a one

degree-of-freedom camera mount to control the panning. The images were retrieved through the

camera’s WiFi interface. The camera mount was attached on a CNC Cartesian arm, more specifically the

X-Carve [17].

To perform the volume carving and back-projection steps, discussed below, we must know the exact

position and orientation of the camera for each image, as well as the intrinsics of the camera model

(incorporated in the so-called camera matrix). It is challenging to perfectly calibrate a camera arm in real

world scenarios. We thus obtained the camera poses and matrix using a structure-from-motion

algorithm – in this work, we used the open source software Colmap [39, 38].

If I is the set of views taken of a given scene, we thus have the following projection models that include

both the intrinsic camera parameters and camera pose estimation: πi : ℝ3 → ℝ2, i∈ I, such that for any

world coordinates x ∈ ℝ3, πi(x) is the pixel coordinate of point x in view i.


Using the set-up above, we obtained 864 images of A. thaliana (12 different individuals) (see also

Deliverable D5.3 and WP7 - Data acquisition and real-world application testing). From this dataset, 6

images of 4 different individuals were manually annotated to evaluate the results of the experiments

(see 3.4.3). Some sample images from this dataset are shown in Figure 2.

Figure 2: Sample images from the A. thaliana dataset used in this study.

Images of a young tomato plant

An additional set of real images from a young tomato plant was obtained to test the fine-tuning method.

We took a video turning around a tomato plant and sampled 41 images equally distributed along the

video to get images from different viewpoints. Two images from this dataset were manually annotated

with our interface (see 3.3.8). We included only two classes: stem and leaf.


Figure 3: Two sample images from the dataset of the young tomato plant.

3.3.2 Virtual plants and artificial dataset generation


Figure 4: An example of simulated Arabidopsis thaliana plants at different growth stages.

We used a set of 3D mesh models of A. thaliana to generate the artificial dataset. The generated plant

models were exported as meshes from LPy [31] as described in Deliverable D6.1. We use the standard

OBJ format. The 5 different types of organs in the mesh (fruit, stem, pedicel, leaf and flower) each have

an associated “material” identifier that is used to apply colors or texture by the rendering engine. Each

organ class was attributed a distinct material, used below.

The open source software Blender was chosen to render the virtual plants because the Python bpy 2

module that comes with this application allows for easy scripting. Blender’s Eevee renderer was used

because of its fast and realistic looking rendering of scenes. To further increase the diversity of the

simulated dataset and prevent over-fitting on the chosen plant textures, a base color of the plant

material was randomly chosen from a color distribution derived from an image of a tree canopy.

For the background, we rendered the plants in a 360∘ High Dynamic Range Image (HDRI) of a real world

scene. Rendering a HDRI in blender reproduces the original scene lighting. By analogy with the image

acquisition set-up used for real plants (see 3.3.1 above and deliverable D5.3), we use the term “virtual

scanner” for the software module that generates a set of 2D images of the virtual plants in this artificial

background scene as if the camera were turning around the plant. In order to increase the complexity of

the images acquired with this virtual scanner, several sets of backgrounds were used, without limitations

on the lighting environment – night, daylight, sunset – or the type of scene – indoor or outdoor. In both

2 https://www.blender.org/


cases, a simulated flash light is triggered with a random level of intensity to simulate different artificial

lighting conditions during the acquisition of real life pictures.

The ground truth for the plant part segmentation is acquired by rendering each organ class separately

by making the materials of the other classes transparent (Figure 4).

Figure 5: Example of a virtual training image (top-left) and the masks associated with all the labels

(from left-to-right: background, stem, leaves, fruits, and pedicels).

A dataset of 2520 virtual plant images was generated using 18 views of 140 different plant models

(896x896 pixels), each with plant part segmentation. The distance and position of the camera around

the plant varied to offer more diversity.

3.3.3 Semantic segmentation of images


The objective of semantic image segmentation is to convert an input image into a “segmented” image

where each pixel is given a label according to the class it belongs to. In our case, the input is a plant

image, and, for each pixel in the image, the segmentation network labels the pixel as belonging to the

fruit, stem, pedicel, leaf, flower, or background classes. More generally, semantic image segmentation

converts the input image into a stack of probability maps of the same dimension as the image. If C is the

number of different classes (organs plus background), semantic segmentation produces C output images

with pixels having a value between 0 and 1. The pixel value in output image i is equal to the probability

that the corresponding pixel in the input image belongs to class i.

To produce such semantic probability maps, we used a segmentation convolutional neural network [12].

These segmentation networks are based on a contracting structure from an image to a low dimensional

feature space, called the latent space, which encodes the content of the image. It is connected to a

symmetric expanding structure that translates the information from the latent space to an image similar

to the original one, but with only the content of interest reconstructed. The contracting structure is

directly inspired from classification neural networks, and made of convolution layers, non-linearities,

and down-pooling layers. The convolution kernels allow to filter the information to emphasize the

content, and the down-pooling allows to reduce the dimension of the information. The expanding

structure reproduces the path in the other direction, with up-pooling layers and convolutions. The

spatial information is re-injected to reconstruct the image properly, by providing the down-pooling

coordinates from the contracting phase.

We decided to use a neural network architecture inspired by U-Net [33]. However, to leverage the

power of already existing labeled datasets, the contracting structure, or encoder, was replaced by a

classification network trained on ImageNet, and with six classes to segment. Thanks to the great

diversity in the ImageNet dataset, this classification structure has already learned to encode the

semantic representation of an image in the latent space. The classification network that we used is

ResNet [14] which is a deep neural network where inputs from previous layers are regularly re-injected

into deeper layers in order to maintain the geometry and avoid vanishing gradients [15].

3.3.4 Training from simulated data

The network was trained and tested with the 2520 images (140 individual x 18 images) from the virtual

scanner. To make the network robust to images and lighting conditions that are not in the dataset, we

artificially augment the dataset by adding uncorrelated Gaussian noise to each of the RGB color channels

(σ = 0.01, R’=R+eR, G’=G+eG, …) and random rotations (Figure 4). The images were normalized to

correspond to the mean and variance of the training set of ImageNet, on which was initially trained the

semantic structure (mean(R,G,B) = (0.485, 0.456, 0.406), standard-deviation(R,G,B) =

(0.229,0.224,0.225)). The dataset was split into 3 sets:

● a training set to train the network (70% of the dataset)

● a validation set on which the network is not trained and is used to evaluate and compare the

different networks (7% of the dataset)


● a test set used at the very end on the selected architecture (23% of the dataset)

The selection of the images was done on the individuals: 70% of the plants were selected for testing and

their images added to the training set. Using this procedure we introduce backgrounds in the testing

datasets that have not been encountered by the neural network during the training.

Figure 6: A sample of the training set (before normalization).

To train the network for the segmentation, a combination of metrics were used in the loss function. As it

is a multi-class problem, a sigmoid is applied to each output in order to contain the numerical range of

the predictions. First, cross-entropy was used as a per-pixel metric. It represents the uncertainty of the

prediction compared to the ground-truth. If the class of a pixel has a very low predicted probability, it

will highly penalise the loss. If the probability is close to one, the contribution to the loss is close to zero.

The notion of entropy represents this discrepancy between the ground truth distribution and the


predicted distribution. It is naturally translated at the mathematical level with the negative of the

logarithm.

(1)

Where n is the number of pixels in the image, C the number of possible classes, ygtk = 1 if pixel i is in

class k, 0 otherwise, and yi,predk = p(label(yi,gt) = k).

Second, the Dice coefficient was used as a another loss metric, which theoretically writes as:

(2)

This coefficient compares the number of right predictions to the number of samples, for each class. In

practice, the product of the ground truth by the predictions is summed, so that only the prediction of

the right class contributes to the sum. For example for point n, the class is k, and the prediction for class

k is pk, the loss for point n will be 1 - pk. The total loss is computed vector-wise, and lies between 0 and 1.

We used the mean of these two losses for the training: cross-entropy has more stable gradients while

Dice loss represents what should be minimised and is less sensitive to class imbalance than

cross-entropy.

3.3.5 Volume carving

Volume carving is a photogrammetric approach which uses pictures of an object from different points of

view to reconstruct an object in 3D as a binary function of space [21, 10, 36, 54, 55]. Space is divided

into a regular voxel grid with a voxel size chosen such that the size of the projection of each voxel is of

the same order of magnitude of a pixel in each of the pictures.

The input of the volume carving algorithm is a set of binary background images Bi produced by the

segmentation algorithm. The value of a pixel in the mask indicates whether the pixel belongs to an organ

class or to the background. Since volume carving is very sensitive to false positives – pixels labeled as

background when they are in fact projections from organ class voxels – the threshold for the grey value

of the pixel to distinguish between background or organs is chosen very close to 1.

To reduce the probability of carving away voxels that belong to the plant volume, the plant masks are

dilated by an additional pixel. This is justified by the fact that we only use the voxel center and not the

pixel space covered by the whole voxel. In the case where the error on the camera poses is non

negligible, the dilation of masks is increased even more to match this possible uncertainty.

3.3.6 Level Set method


The set of voxels resulting from the volume carving corresponds to the visual hull of the object, under

the given views. It is an approximation of the true volume occupied by the object. Because the voxels

are positioned on a fixed grid, the surface of the resulting volume suffers from spatial aliasing: it is not

continuous but “cubic”. A more continuous surface is obtained by estimating the real surface of the

scanned object from the carved volume and moving the voxel points to the closest position on this

surface. To compute this improved point cloud a Level Set method is used [40]. The Level Set method

requires a signed distance function that estimates the distance between a point and a volume. A positive

(negative) distance means that point lies inside (outside) the volume and a point on the surface has a

distance of zero. We used the fast marching algorithm provided by the SciPy software library [18].

3.3.7 Back-projection

Once the point cloud is obtained using the Level Set method, a label has to be attributed to each point in

the point cloud. Similar to the method proposed by [41], the coloring of the point cloud is performed

after its construction with the visual hull method.

A technique similar to the volume carving method is used, whereby each point is projected back onto

the segmented images. Let Mi,k denote the output of the segmentation algorithm for view i and class k.

In the following, classes include all non-background organ classes. A threshold is applied to the

probability scores, so that Mi,k(x) ∈ {0,1} for all x in Ω. Then, for each point, each class gets a vote from

each of its projection in the masks Mi,k:

Then the class with the maximal number of votes is attributed to the point:

class(x) = argmaxk(Mk(x))

If a point gets no vote from any of its projection, then it is discarded as a background point. This yields a

labeled point cloud, in which every point is attributed to an organ class.

3.3.8 Fine-tuning

Our network was trained on a virtual A. thaliana model but aims at reconstructing real plants. Although

the training performs well for real A. thaliana (see below), it transfers poorly to other species with large

anatomical differences. Therefore we conceived a simple interface that allows the user to manually label

a few images of interest, then run the training of the network on this small dataset. The interface uses

LabelMe [34] for manual annotations, and runs the training on these images for 20 epochs. This

generates a segmentation network specialized for the new species, starting from the network trained on

the virtual plant models of A. thaliana.


3.3.9 Evaluation metric

To evaluate the segmentation task, we use the classical metric of precision and recall. As discussed

above, the segmentation network outputs C images (one per class) with pixel values between 0 and 1.

To perform the evaluation discussed below, we convert these images separately into binary masks by

applying a threshold of 0.5 on the pixel values. We have noted that the chosen threshold does not

matter much because the output images from the neural network are highly contrasted. Every pixel x in

the segmented image is then attributed to one of four sets:

● x belongs to TP (true positive) if our predicted class is positive and the actual class is positive.

● x belongs to TN (true negative) if our predicted class is negative and the actual class is negative.

● x belongs to FP (false positive) if our predicted class is positive and the actual class is negative.

● x belongs to FN (false negative) if our predicted class is negative and the actual class is positive.

The precision and recall are then defined as follows:

Given a class C:

● The precision is the proportion of the pixels labeled with C that were correctly labeled. The

remaining proportion are pixels that were labeled C but that belong to another class.

● Considering now all the pixels that belong to a given class C, the recall measures the proportion

of those pixels that have been correctly labeled. The remaining proportion are pixels that belong

to class C but that were incorrectly labeled with another class.

To evaluate the 3D reconstruction, the same evaluation metric is applied to voxels. Each voxel is

attributed a single class from the back-projection algorithm presented above, and its class is compared

to the class of the closest point on the original mesh of the virtual plant. We similarly measure the

precision and recall.

3.3.10 Software integration

The code of the algorithm is written in Python and is integrated in the image analysis pipeline developed

in WP5. We used PyTorch as the machine learning framework. The package LabelMe was used to 3 4

annotate the images. Besides the SciPy and bpy libraries already mentioned, we use the standard

3 https://pytorch.org/ 4 http://labelme.csail.mit.edu/Release3.0/


computer vision libraries OpenCV and scikit-image to manipulate images. The pipeline itself uses the 5 6

Luigi task scheduler to execute the individual tasks shown in Figure 1. For more information on the 7

pipeline, please refer to the progress report on WP5.

3.4 Results and Discussions

We recall that the following datasets are used in this work:

● Dataset A: 2520 virtual plant images (18 views of 140 different models, see 3.3.2). 70% of the

images (the selection was done on the plants) were used for training, 23% for validation, and 7%

for the tests, as discussed in 3.4.1 and 3.4.2 below.

● Dataset B: 6 hand-annotated images of real A. thaliana (4 different individuals, see 3.3.1). These

images are used to test the segmentation on real plants quantitatively in 3.4.3.

● Dataset C: 858 images of real A. thaliana (12 different individuals, see 3.3.1). These images were

used to evaluate the segmentation and 3D segmentation qualitatively (section 3.4.4)

● Dataset D: 39 images of a young tomato plant taken from a hand-help mobile phone.

● Dataset E: 2 annotated images of the young tomato plant.

3.4.1 Segmentation of virtual images

The images from the dataset A are segmented using the trained neural network. Then, the results of 2D

segmentation are compared to the ground truth provided by images produced by the virtual scanner

(see 3.3.2). Figure 6 presents a sample from dataset A and its segmentation.

5 https://opencv.org/ 6 https://scikit-image.org/ 7 https://luigi.readthedocs.io/en/stable/index.html


Figure 7: Images segmented with our network. From left to right, then top to bottom: original image,

background, flower, fruit, leaf, pedicel. White indicates a high score (1) for the class, black indicates a

low score (0).

Pixel precision and recall for the five organ classes are shown in the figure 8.


Figure 8: Left: The pixel precision of the 2D segmentation on the dataset A, containing the computer-generated images of the virtual A. thaliana. Right: The pixel recall of the 2D segmentation.

This analysis shows a similar trend for all segmentation classes. For example, considering the fruit class,

~50% of all the pixels that have been labeled as “fruit” by the CNN are indeed fruit pixels in the original

image (true positives). Hence the remaining ~50% of fruit-labeled pixels are false-positive and

correspond to pixels with other labels in the original images. Similarly, 94.5% of all the pixels labeled as

fruit in the original image are correctly labeled by the neural network (true positives), so 5.5% of the

original flower pixels are miss-labeled by the CNN (false negatives).

In conclusion, the neural network has a high sensitivity to detect almost all pixels (>90%) of any plant

classes defined in the training phase (Fig. 5). However, its precision is lower, since too many pixels are

often assigned to a given class (false positives). We will come back to this point below.

3.4.2 3D reconstruction of virtual plants

The segmented images from the previous test are used to reconstruct the segmented 3D representation

of the plant, as discussed in 3.3.5-7. Similarly, we compute the voxel precision and recall (see 3.3.9) to

evaluate the results and plot it side-by-side with previous 2D evaluation for comparison (Fig. 9).


Figure 9: Evaluation of 2D and 3D segmentation side-by-side (blue: 2D pixel metric, orange: 3D, voxel metric). Left: precision . Right: recall.

The results show that the precision for each class is higher in 3D than in 2D. This is because the transfer

to 3D using the volume carving operation reduces the number of false positives. For this reason we favor

false positives against false negatives in the 2D predictions, as explained below. To understand this bias,

we recall that the 3D reconstruction carves away the voxels based on the prediction of pixels onto which

they project. On the one hand, false negative results in losing voxels that are part of the plant. On the

other hand, false-positive retains additional voxels that are likely to be removed in subsequent carving

operations using images that correctly labeled the pixels. Thus false-positives are thus preferable. We

therefore keep the threshold that is used to generate the binary masks low with the risk of introducing

more false positives.

3.4.3 Segmentation of real plants images, CNN trained on virtual plants

We evaluate the performance of the neural network trained on the virtual dataset to segment the 2D

images of dataset B, the set of manually annotated real images of A. thaliana. We present pixel precision

and recall results for each class, comparing results on virtual plants and real plants below (Figure 10).


Figure 10: Pixel precision and recall comparison between 2D segmentation on virtual and real images unseen by the neural network.

Note that the size of the dataset (6 pictures) is too small for a proper statistical analysis. Although the

network makes more errors than with the virtual plants (especially more leaf false positive and more

flower false negative), we get satisfactory segmentation results. The errors include the mix-up between

the stem, fruit, and pedicel classes. Also, the network has difficulties distinguishing fruits and leaves

when they grow along the stem.

Observing the errors made by the network has led to improvements in the 3D models of the plants: we

included features such as the rotation of the leaves, the bending of the stem, and the random coloration

of the organs. This allowed both to improve the plant model and the 2D-segmentation network. These

improvements are already included in the results above. Overall, the 3D models of the plant and the

images generated in the virtual environment allow us to obtain a satisfactory segmentation and 3D

reconstruction of real plants. Note that we recently developed new models of A. thaliana in WP6 (see

Deliverable 6.2) that are even more realistic than the first version of virtual models used here. This

should significantly improve our segmentation results in the near future.

3.4.4 3D reconstruction of real plants

We segmented and reconstructed A. thaliana real plants from dataset C. Without ground truths we can

only give a qualitative analysis of the results. Figure 11 shows the 3D reconstruction and segmentation

of the plants shown in Figure 2. The reconstructions and segmentation were qualitatively as good as the

3D reconstruction to the point where in some cases it is hard to tell whether the reconstructed plant is

from a real or a virtual plant. One default is that the reconstruction of the leaves at the basis on the

stem is not accurate. This is due to the fact that these leaves are hardly identifiable, mixed with the pot

or dried out. They are of little interest to the study of the phyllotaxy of the main fluorescence stem so


we didn’t focus on solving this issue. Another issue comes from occlusions, especially of the stem. The

reconstruction fails, for example, when there are several branches grouped together in the same pot, or

when there is a structure unknown to the segmentation network, for example, a tutor obstructing some

viewpoints. To solve the first issue one solution would be to increase the variety in the viewpoints, by

taking pictures from the top of the pot. To solve the second, we would need to teach the network to

interpret occlusions by foreign objects, by including such objects in the virtual scene for example.

Figure 11: 3D reconstruction of real plants. Red: stem, green: leaf, white: pedicel, purple: fruit:,

yellow: flower.


3.4.5 Transfer to other species

We used model fine-tuning to transfer the use of the neural network to species that are anatomically

different from A. thaliana. Then the network trained on virtual A. thaliana was trained on the two

tomato images of dataset E. The results of the 2D segmentation are shown in Figure 12. The results of

the 3D reconstruction pipeline are shown Figure 13. We can only evaluate the results qualitatively. We

notice that some of the top leaves are missing in some of the photos used in the reconstruction, which

impacts its quality. The results are encouraging enough to consider fine-tuning a potential solution to

train a neural network when a virtual plant model for a given species is not available. Note that using

our virtual tomato plant recently developed in WP6, we will be able to quantitatively assess transfer

learning error rates on virtual plant ground truth in the next phase of our work.

Figure 12: Predictions of the model fine-tuned on two images of tomato that are manually annotated.

(The masks of pedicel and stem are superimposed on this figure but they were predicted separately).


Figure 13: 3D reconstruction and segmentation of tomato plants with the pipeline.

3.5 Conclusion and perspectives

In this work, we have presented and assessed a fully automated method for segmentation of plants

from 2D pictures using convolutional networks for segmentation in 2D and backprojection of 3D points

for segmentation in 3D.

Our initial results suggest that using generative models of plants for training neural networks and

applying the trained algorithms on real specimens of plants is robust enough so that no additional

annotation of data is needed. This is of utmost importance to the field of plant biology, since annotation

of data is very time-consuming, and the variety in plant species makes the annotation of new species a

never-ending task. In WP6, throughout the ROMI project’s life, we also work at increasing the

photo-realism of our virtual plants on various aspects (geometric, physical, and material). These

improvements should simultaneously contribute to strengthen CNN training and significantly improve

automatic segmentation of organs.

We have also shown that annotation of a small set of real world plant data can be enough to transfer

the CNNs to plants with different anatomic properties.


Further work is needed to consolidate these initial findings. We will validate some of the implicit

hypotheses in the current set-up. Notably, we assume that the randomized 3D backgrounds in the

virtual plant images makes the segmentation by the neural network more robust. This intuition needs

additional validation before it can become a best practice. We also augmented the virtual plant model

and added the bending of the main stem and the rotation of the leaves without a precise evaluation of

its effects. We should also increase the number of manually annotated images of A. thaliana and tomato

plants for a proper evaluation of the 2D segmentation.

Another perspective of work is to investigate whether the obtained segmentation is precise enough to

extract precise quantitative data from the reconstructed plant. In particular, using the segmented 3D

data of A. thaliana, we want to extract the angles between the sequence of fruits along the main

inflorescence stem so that we can compare these results with manual measurements and the results of

the geometrical pipeline discussed in WP5. Our hope is that the knowledge of the individual organs

helps to extract the plant’s skeleton more accurately.

Finally, the work also opens up some additional research questions. The 2D masks generated from the

virtual models provide information on overlapping classes. The network is currently trained to predict

these overlapping classes: one pixel can be attributed to multiple classes. However, we have not fully

explored this possibility to improve the segmentation when organs are occluded in a subset of the

images.

References

[1] G. Aleny, B. Dellen, and C. Torras. 3D modelling of leaves from color and ToF data for robotized

plant measuring. In 2011 IEEE International Conference on Robotics and Automation, pages 3408–3414,

2011.

[2] R. Barth, J. IJsselmuiden, J. Hemming, and E.J. [Van Henten]. Synthetic bootstrapping of

convolutional neural networks for semantic plant part segmentation. Computers and Electronics in

Agriculture, 161:291 – 304, 2019.

[3] Frederic Boudon, Christophe Pradal, Thomas Cokelaer, Przemyslaw Prusinkiewicz, and Christophe

Godin. L-Py: An L-System Simulation Framework for Modeling Plant Architecture Development Based on

a Dynamic Language. Frontiers in Plant Science, 3, 2012.

[4] S. Chaivivatrakul, L. Tang, M. N. Dailey, and A. D. Nakarmi. Automatic morphological trait

characterization for corn plants via 3d holographic reconstruction. Computers and Electronics in

Agriculture, 109:109–123, 2014.

[5] Maurilio Di Cicco, Ciro Potena, Giorgio Grisetti, and Alberto Pretto. Automatic model based dataset

generation for fast and accurate crop and weeds detection. 2017 IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS), pages 5188–5195, 2017.


[6] Barabe Denis et al. Symmetry in plants, volume 4. World Scientific, 1998.

[7] Thomas Duboudin, Maxime Petit, and Liming Chen. Toward a Procedural Fruit Tree Rendering

Framework for Image Analysis. arXiv:1907.04759 [cs], July 2019. arXiv: 1907.04759.

[8] Fabio Fiorani and Ulrich Schurr. Future Scenarios for Plant Phenotyping. Annual Review of Plant

Biology, 64(1):267–291, Apr. 2013.

[9] Miguel Garrido, Dimitris Paraforos, David Reiser, Manuel Vzquez Arellano, Hans Griepentrog, and

Constantino Valero. 3d Maize Plant Reconstruction Based on Georeferenced Overlapping LiDAR Point

Clouds. Remote Sensing, 7(12):17077–17096, Dec. 2015.

[10] Franck Golbach, Gert Kootstra, Sanja Damjanovic, Gerwoud Otten, and Rick Van de Zedde.

Validation of plant part measurements using a 3d reconstruction method suitable for high-throughput

seedling phenotyping. Machine Vision and Applications, 27:663–680, 07 2016.

[11] Guan, H., Liu, M., Ma, X., and Yu, Song. Three-Dimensional Reconstruction of Soybean Canopies

Using Multisource Imaging for Phenotyping Analysis. Remote Sensing, 10:1206, 2018. .

[12] Yanming Guo, Yu Liu, Theodoros Georgiou, and Michael S. Lew. A review of semantic segmentation

using deep neural networks. International Journal of Multimedia Information Retrieval, 7(2):87–93, June

2018.

[13] Yann Guédon, Yassin Refahi, Fabrice Besnard, Etienne Farcot, Christophe Godin, and Teva Vernoux.

Pattern identification and characterization reveal permutations of organs as a key genetically controlled

property of post-meristematic phyllotaxis. Journal of Theoretical Biology, 338:94 – 110, 2013.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image

Recognition. arXiv:1512.03385 [cs], Dec. 2015. arXiv: 1512.03385.

[15] Sepp Hochreiter. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and

Problem Solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems,

06(02):107–116, Apr. 1998.

[16] Anna M. Hoffmann, Georg Noga, and Mauricio Hunsche. Fluorescence indices for monitoring the

ripening of tomatoes in pre- and postharvest phases. Scientia Horticulturae, 191:74–81, Aug. 2015.

[17] Inventables. X-Carve, 2019.

[18] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python,

2001–. [Online; accessed ].


[19] Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, and Siddhartha Chaudhuri. 3D Shape

Segmentation with Projective Convolutional Networks. In 2017 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 6630–6639, Honolulu, HI, July 2017. IEEE.

[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep

convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105,

2012.

[21] K.N. Kutulakos and S.M. Seitz. A theory of shape by space carving. In Proceedings of the Seventh

IEEE International Conference on Computer Vision, pages 307–314 vol.1, Kerkyra, Greece, 1999. IEEE.

[22] Lei Li, Qin Zhang, and Danfeng Huang. A review of imaging techniques for plant phenotyping.

14(11):20078–20111.

[23] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic

segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages

3431–3440, 2015.

[24] Massimo Minervini, Hanno Scharr, and Sotirios A Tsaftaris. Image analysis: the new bottleneck in

plant phenotyping [applications corner]. IEEE signal processing magazine, 32(4):126–131, 2015.

[25] Anh Nguyen and Bac Le. 3D point cloud segmentation: A survey. In 2013 6th IEEE Conference on

Robotics, Automation and Mechatronics (RAM), pages 225–230, 2013.

[26] Thuy Nguyen, David Slaughter, Nelson Max, Julin Maloof, and Neelima Sinha. Structured

Light-Based 3d Reconstruction System for Plants. Sensors, 15(8):18587–18612, July 2015.

[27] Kenji Omasa, Fumiki Hosoi, and Atsumi Konishi. 3D Lidar imaging for detecting and understanding

plant responses and canopy structure. Journal of Experimental Botany, 58(4):881–898, Mar. 2007.

[28] Stefan Paulus, Jan Behmann, Anne-Katrin Mahlein, Lutz Plümer, and Heiner Kuhlmann. Low-Cost

3D Systems: Suitable Tools for Plant Phenotyping. Sensors, 14(2):3001–3018, Feb. 2014.

[29] Laura Soledad Peirone, Gustavo Pereyra Irujo, Alejandro Bolton, Ignacio Erreguerena, and Luis AN

Aguirrezabal. Assessing the efficiency of phenotyping early traits in a greenhouse automated platform

for predicting drought tolerance of soybean in the field. Frontiers in plant science, 9:587, 2018.

[30] Fernando Perez-Sanz, Pedro J. Navarro, and Marcos Egea-Cortines. Plant phenomics: an overview

of image acquisition technologies and image data analysis algorithms. 6(11).

[31] Frédéric Boudon, Christophe Pradal, Thomas Cokelaer, Przemyslaw Prusinkiewicz, Christophe

Godin. L-Py: an L-system simulation framework for modeling plant architecture development based on a

dynamic language. Frontiers in Plant Science, Frontiers, 2012, 3 (76).


[32] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical

image segmentation. In International Conference on Medical image computing and computer-assisted

intervention, pages 234–241. Springer, 2015.

[33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for

Biomedical Image Segmentation. arXiv:1505.04597 [cs], May 2015. arXiv: 1505.04597.

[34] Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. LabelMe: A Database

and Web-Based Tool for Image Annotation. International Journal of Computer Vision, 77(1):157–173,

May 2008.

[35] Thiago Santos and Julio Ueda. Automatic 3D plant reconstruction from photographies,

segmentation and classification of leaves and internodes using clustering. page 3.

[36] Hanno Scharr, Christoph Briese, Patrick Embgenbroich, Andreas Fischbach, Fabio Fiorani, and Mark

Müller-Linow. Fast high resolution volume carving for 3d plant shoot reconstruction. Frontiers in plant

science, 8:1680, 2017.

[37] Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In 2016 IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, Las Vegas, NV, USA,

June 2016. IEEE.

[38] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In

Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[39] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view

selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.

[40] J.A. Sethian. Fast marching methods and level set methods for propagating interfaces, 1998.

Karman Institute Lecture Series, Computational Fluid Dynamics.

[41] Weinan Shi, Rick van de Zedde, Huanyu Jiang, and Gert Kootstra. Plant-part segmentation using

deep learning and multi-view vision. Biosystems Engineering, 187:81–95, 2019.

[43] Yeyin Shi, Ning Wang, Randal K. Taylor, William R. Raun, and James A. Hardin. Automatic corn plant

location and spacing measurement using laser line-scan technique. Precision Agriculture, 14(5):478–494,

Oct. 2013.

[44] Paloma Sodhi, Srinivasan Vijayarangan, and David Wettergreen. In-field segmentation and

identification of plant structures using 3d imaging. In 2017 IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS), pages 5180–5187, Vancouver, BC, Sept. 2017. IEEE.


[45] Siddharth Srivastava, Swati Bhugra, Brejesh Lall, and Santanu Chaudhury. Drought stress

classification using 3d plant models. In Proceedings of the IEEE International Conference on Computer

Vision, pages 2046–2054, 2017.

[46] Wei Su, Dehai Zhu, Jianxi Huang, and Guo Hao. Estimation of the vertical leaf area profile of corn (

zea mays ) plants using terrestrial laser scanning (tls). Computers and Electronics in Agriculture,

150:5–13, 07 2018.

[47] Thapa, 2018. A_Novel_LiDAR-Based_Instrument_for_High-Throughput.pdf.

[48] Sbastien Tisn, Yann Serrand, Liłn Bach, Elodie Gilbault, Rachid Ben Ameur, Herv Balasse, Roger

Voisin, David Bouchez, Mylne DurandTardif, Philippe Guerche, Gal Chareyron, Jrme Da Rugna, Christine

Camilleri, and Olivier Loudet. Phenoscope: an automated large-scale phenotyping platform offering high

spatial homogeneity. The Plant Journal, 74(3):534–544, 2013.

[49] Jordan Ubbens, Mikolaj Cieslak, Przemyslaw Prusinkiewicz, and Ian Stavness. The use of plant

models in deep learning: An application to leaf counting in rosette plants. Plant Methods, 14, 01 2018.

[50] Manuel Vázquez, David Reiser, Dimitrios S. Paraforos, Miguel Izard, and Hans W. Griepentrog. Leaf

area estimation of reconstructed maize plants using a time-of-flight camera based on different scan

directions. Robotics, 7, 10 2018.

[51] Daniel Ward, Peyman Moghadam, and Nicolas Hudson. Deep Leaf Segmentation Using Synthetic

Data. arXiv:1807.10931 [cs], July 2018. arXiv: 1807.10931.

[52] Illia Ziamtsov and Saket Navlakha. Machine learning approaches to improve three basic plant

phenotyping tasks using three-dimensional point clouds. Plant Physiology, 181(4):1425–1440, 2019.

[53] Rossi, R., Leolini, C., Costafreda-Aumedes, S., Leolini, L., Bindi, M., Zaldei, A., and Moriondo, M.

Performances Evaluation of a Low-Cost Platform for High-Resolution Plant Phenotyping. Sensors,

20(11):3150, 2020.

[54] Gaillard, M., Miao, C., Schnable, J. and Benes, B. Voxel Carving Based 3D Reconstruction of

Sorghum Identifies Genetic Determinants of Radiation Interception Efficiency. BioRXiv, April, 2020.

[55] Phattaralerphong, J. and Sinoquet, H. A method for 3D reconstruction of tree crown volume

from photographs: Assessment with 3D-digitized plants. Tree physiology. 25. 1229-42.


d6 - media.romi-project.eu

Documents