d6 - media.romi-project.eu
TRANSCRIPT
Deliverable
n°
D6.3
Deliverable title: D6.3 - Results of trained models
Tasks
WP6 - Plant modelling
T6.1 Modelling.
T6.2 Extraction of plant parameters using neural networks.
T6.3 Models of plant status for crop monitoring.
Task Leader INRIA Planned Date 01.10.2020
Effective Date 01.10.2020
Written by Reviewed and approved by Authorized by
Name / Surname
Timothée Wintz, Aliénor Lahlou & Peter Hanappe
(SONY)
Christophe Godin (INRIA), Fabrice Besnard (CNRS), David Colliaux
(SONY)
Romain Azais (INRIA)
1. Executive Summary 2
1.1 Summary of deliverable content and initial objectives 2
1.2 Partners involved 2
1.3 Relation with other work packages and tasks 2
1.4 Dissemination / IPR policy 3
2. Demonstrator 3
3. Results of trained models 3
3.1 Introduction 3
3.1 Related Work 4
3.3 Material and Methods 6
3.3.1 Image acquisition and dataset of real images 7
3.3.2 Virtual plants and artificial dataset generation 9
3.3.3 Semantic segmentation of images 11
3.3.4 Training from simulated data 12
1
3.3.5 Volume carving 14
3.3.6 Level Set method 14
3.3.7 Back-projection 15
3.3.8 Fine-tuning 15
3.3.9 Evaluation metric 15
3.3.10 Software integration 16
3.4 Results and Discussions 16
3.4.1 Segmentation of virtual images 17
3.4.2 3D reconstruction of virtual plants 18
3.4.3 Segmentation of real plants images, CNN trained on virtual plants 19
3.4.4 3D reconstruction of real plants 20
3.4.5 Transfer to other species 22
3.5 Conclusion and perspectives 23
References 24
1. Executive Summary
1.1 Summary of deliverable content and initial objectives
This report relates to the work carried out in WP6. In particular, it discusses the results that we obtained
to extract segmented, 3D plant representations from 2D images of real plants using a technique that
combines machine learning with the virtual plant models developed in Task 6.1.
The text covers only a small part of the work performed in WP6. Please refer to the Progress Report to
have an overview of all the work-in-progress in WP6.
1.2 Partners involved
Leader: INRIA
Participants: CNRS, SONY
1.3 Relation with other work packages and tasks
Relation to WP5: In WP6, we develop the high-level software components that use the virtual plant
models and that rely on machine learning. Then, a close collaboration between WP6 and WP5 is
ROMI - D6.3 - Results of trained models 2
required to integrate these software components into the image processing pipelines and the data
storage infrastructure that are put in place in WP5.
Relation to WP7: The software components are tested on real plant data that are collected as part of
WP7.
1.4 Dissemination / IPR policy
The source code of the project is available under an Open Source license at
https://github.com/romi/romiscan.
Additional user documentation can be found at https://docs.romi-project.eu/Scanner/ and
https://github.com/romi/romiseg/tree/master
The work was part of a presentation to industry insiders at the International Forum for Agricultural
Robotics (FIRA 2019): https://www.youtube.com/watch?v=C2UJTtwS2QM
The work presented in this document will be completed to prepare a scientific publication.
2. Demonstrator
A short video will be provided to illustrate the work presented in this deliverable.
3. Results of trained models
3.1 Introduction
In the presented work, we seek to automatically obtain the geometrical structure of the inflorescence
stem of Arabidopsis thaliana in three dimensions (3D) using an RGB camera. This 3D-reconstruction is a
required step for the study of phyllotaxis, the geometric arrangement of organs around the stem, using
high-throughput methods [13]. This problem is also addressed in WP5 using geometrical techniques.
Here, we show how we can take advantage of the plant models developed in WP6 to speed up the
training of convolutional neural networks (CNN) to segment images of A. thaliana into its constituent
organs. These segmented images are then used to reconstruct the plants and its individual organs in 3D.
Advances in machine learning, and in particular the use of deep convolutional neural networks, offer
new opportunities for pixel-by-pixel semantic segmentation of images. However, the main bottleneck to
enable the power of CNNs for segmentation tasks is the need for annotated data sets to train the neural
nets. Data annotation is a tedious and time-consuming process. This is a constraint for small scale
phenotyping problems and also limits the adaption of machine learning techniques for economically less
ROMI - D6.3 - Results of trained models 3
important crops. Two solutions to this problem are often suggested. In the first approach, generative
computer models are used to feed the machine learning algorithm with artificial data. In the second
approach, an existing CNN that is trained with images from one domain is fine-tuned with a small set of
images of a different target domain.
In this work, we look at both of these approaches. We propose to use virtual models of A. thaliana to
generate datasets of 2D images (see deliverable D6.1). These synthetic images are used to train a
convolutional neural network for the identification of the organs along the inflorescence stem. The CNN
is then applied to segment a set of real plant images, taken from different viewpoints. Finally, the
segmented images are used to reconstruct the three dimensional structure of the inflorescence. We also
validate whether the combined segmentation and 3D reconstruction reduces the errors as previously
reported by [41].
We also evaluate the efficiency of fine-tuning. We fine-tune the segmentation network, trained on
virtual A thaliana plants, with a small number of images of young tomato plants. We evaluate whether
the tuned CNN can successfully segment tomato plant images. This approach is useful to speed up the
adoption of machine learning techniques for a species for which no virtual plant model is available but
that has similar anatomical traits as a species that has been modeled.
Finally, we use the segmented images and a technique called volume carving that reconstructs the 3D
shape of the plant to obtain a fully segmented 3D representation of the plant’s organs.
To our knowledge, our work provides a unique attempt to gather state-of-the-art techniques of
machine-learning for plant phenotyping into a single pipeline, from raw images to a 3D reconstruction
segmented with high precision.
3.1 Related Work
The problem we address relates to the creation of 3D representation of individual plants in which all the
constitutive organs have a distinct label. The main fields of application are phenotyping for phenomics
and crop improvement [48, 30], in-field monitoring of highly valuable crops [16, 8], and precision
agriculture [2, 7].
A common approach to this problem is to first obtain a 3D representation such as a point cloud or a
mesh structure and then subsequently segment this 3D structure into organs. There are several
methods to obtain the initial 3D data, including the use of LiDAR [27, 9, 47, 46], time-of-flight cameras
[1, 4, 50, 11], or structured light [28, 26]. The 3D reconstruction can also be obtained from multi-view 2D
images. One such method, structure-from-motion, identifies the key-points from one image to another
to reconstruct the plant in 3D [44]. Another 2D-based method is volume carving that uses the projected 1
contours of the object to carve away voxels and obtain the visual hull of the object [21]. In the field of
phenotyping, volume carving has been discussed by several authors, including [36,41,54,55]. Once the
1 Alternative names for volume carving are space carving and voxel carving.
ROMI - D6.3 - Results of trained models 4
3D structure has been obtained, in general a 3D point cloud, the segmentation can be done using
geometric and algebraic methods [25] or, more recently, machine learning methods [52].
An alternative approach to the segmentation of the individual organs in the 3D representation is to first
segment 2D views of the plant and then project the segmented images into 3D using the camera
positions [41]. It is the approach that we have chosen to evaluate in this work.
Machine learning and computer vision neural-networks offer powerful toolbox for the 2D pixel-wise
image segmentation that is needed in this approach [19]. Several recent works have suggested their use
in phenotyping and precision agriculture, including [5, 51, 49, 41, 2]. A key obstacle in the training of
neural networks for this task is the need for annotated datasets. The datasets are mostly annotated
manually, which is time consuming (5 to 30 min./image). However, recently, several authors have
reported on the successful use of artificially generated images using virtual plant models to train the
CNNs [5, 51, 49, 2, 7].
Once the organs have been detected in each image, the set of 2D views have to be projected in 3D to
obtain the point cloud volumes of the organs. The visual hull based techniques described previously are
an efficient method for this task. In the case of multi-label images, an additional back-projection and
voting mechanism is introduced to decide the label of each point in the cloud [41].
The work presented here includes topics that are active fields of research in the phenotyping
community, including the use of synthetic images to train neural networks, the 3D reconstruction of the
plant’s shape from 2D images, or the segmentation of the 3D representation into the plant’s organs (see
Table 1). However, to our knowledge, the work presented here is unique in that it combines these
components into a single pipeline. We are thus in a position to evaluate the process from the generation
of synthetic images up until the 3D segmentation of the real plant images.
Our contributions can be summarized as follows:
● We show that convolutional neural networks trained only with synthetic images of plants can
successfully segment images of real plants.
● We successfully apply the method to reconstruct the inflorescence stem of A. thaliana in 3D and
detect all the individual organs. The inflorescence stem of A. thaliana is arguably a challenging
object of study due to its thin stem and fruits.
● We offer further proof that fine-tuning a CNN using a small dataset accelerates the use of
semantic segmentation on other plants.
Cicco, 2017 [5]
Ward, 2018 [51]
Ubbens, 2018 [49]
Shi, 2019 [41] Barth, 2019 [2]
ROMI
Object of study Sugar beets and weed
Rosetta of A. thaliana
Rosetta of A. thaliana
Tomato seedlings
Population of sweet peppers plants
Inflorescence stem of A. thaliana
ROMI - D6.3 - Results of trained models 5
Generation of synthetic images
Yes Yes Yes No Yes Yes
Training on synthetic images
Yes Yes, to augment real images
Yes, to augment real images
No Yes Yes
Training on (additional) real images
With and without
Yes Yes Yes Yes, a small set
No
3D reconstruction
No No No Yes No Yes
3D segmentation
No No No (leaf detection in 2D images)
No No Yes
Table 1: An comparison of the most relevant papers that use synthetic images to train neural networks for plant phenotyping and precision agriculture. For a more complete discussion of papers that are relevant to the presented work, please see above.
3.3 Material and Methods
In the following sections, we present the different steps of the pipeline. Figure 1 provides a schematic
overview on how the different components interact.
ROMI - D6.3 - Results of trained models 6
Figure 1: A presentation of the different steps of the final 3D segmentation pipeline along with its defining parameters.
The main pipeline is indicated by the arrows on the top. The arrows on the bottom show the information that is
provided by the Colmap application: the precise camera position and the camera matrix (see the discussion in 3.3.1).
3.3.1 Image acquisition and dataset of real images
Images of A. thaliana
We used a Sony RX0 RGB camera with a resolution of 1920 × 1060 pixels. The camera was fixed on a one
degree-of-freedom camera mount to control the panning. The images were retrieved through the
camera’s WiFi interface. The camera mount was attached on a CNC Cartesian arm, more specifically the
X-Carve [17].
To perform the volume carving and back-projection steps, discussed below, we must know the exact
position and orientation of the camera for each image, as well as the intrinsics of the camera model
(incorporated in the so-called camera matrix). It is challenging to perfectly calibrate a camera arm in real
world scenarios. We thus obtained the camera poses and matrix using a structure-from-motion
algorithm – in this work, we used the open source software Colmap [39, 38].
If I is the set of views taken of a given scene, we thus have the following projection models that include
both the intrinsic camera parameters and camera pose estimation: πi : ℝ3 → ℝ2, i∈ I, such that for any
world coordinates x ∈ ℝ3, πi(x) is the pixel coordinate of point x in view i.
ROMI - D6.3 - Results of trained models 7
Using the set-up above, we obtained 864 images of A. thaliana (12 different individuals) (see also
Deliverable D5.3 and WP7 - Data acquisition and real-world application testing). From this dataset, 6
images of 4 different individuals were manually annotated to evaluate the results of the experiments
(see 3.4.3). Some sample images from this dataset are shown in Figure 2.
Figure 2: Sample images from the A. thaliana dataset used in this study.
Images of a young tomato plant
An additional set of real images from a young tomato plant was obtained to test the fine-tuning method.
We took a video turning around a tomato plant and sampled 41 images equally distributed along the
video to get images from different viewpoints. Two images from this dataset were manually annotated
with our interface (see 3.3.8). We included only two classes: stem and leaf.
ROMI - D6.3 - Results of trained models 8
Figure 3: Two sample images from the dataset of the young tomato plant.
3.3.2 Virtual plants and artificial dataset generation
ROMI - D6.3 - Results of trained models 9
Figure 4: An example of simulated Arabidopsis thaliana plants at different growth stages.
We used a set of 3D mesh models of A. thaliana to generate the artificial dataset. The generated plant
models were exported as meshes from LPy [31] as described in Deliverable D6.1. We use the standard
OBJ format. The 5 different types of organs in the mesh (fruit, stem, pedicel, leaf and flower) each have
an associated “material” identifier that is used to apply colors or texture by the rendering engine. Each
organ class was attributed a distinct material, used below.
The open source software Blender was chosen to render the virtual plants because the Python bpy 2
module that comes with this application allows for easy scripting. Blender’s Eevee renderer was used
because of its fast and realistic looking rendering of scenes. To further increase the diversity of the
simulated dataset and prevent over-fitting on the chosen plant textures, a base color of the plant
material was randomly chosen from a color distribution derived from an image of a tree canopy.
For the background, we rendered the plants in a 360∘ High Dynamic Range Image (HDRI) of a real world
scene. Rendering a HDRI in blender reproduces the original scene lighting. By analogy with the image
acquisition set-up used for real plants (see 3.3.1 above and deliverable D5.3), we use the term “virtual
scanner” for the software module that generates a set of 2D images of the virtual plants in this artificial
background scene as if the camera were turning around the plant. In order to increase the complexity of
the images acquired with this virtual scanner, several sets of backgrounds were used, without limitations
on the lighting environment – night, daylight, sunset – or the type of scene – indoor or outdoor. In both
2 https://www.blender.org/
ROMI - D6.3 - Results of trained models 10
cases, a simulated flash light is triggered with a random level of intensity to simulate different artificial
lighting conditions during the acquisition of real life pictures.
The ground truth for the plant part segmentation is acquired by rendering each organ class separately
by making the materials of the other classes transparent (Figure 4).
Figure 5: Example of a virtual training image (top-left) and the masks associated with all the labels
(from left-to-right: background, stem, leaves, fruits, and pedicels).
A dataset of 2520 virtual plant images was generated using 18 views of 140 different plant models
(896x896 pixels), each with plant part segmentation. The distance and position of the camera around
the plant varied to offer more diversity.
3.3.3 Semantic segmentation of images
ROMI - D6.3 - Results of trained models 11
The objective of semantic image segmentation is to convert an input image into a “segmented” image
where each pixel is given a label according to the class it belongs to. In our case, the input is a plant
image, and, for each pixel in the image, the segmentation network labels the pixel as belonging to the
fruit, stem, pedicel, leaf, flower, or background classes. More generally, semantic image segmentation
converts the input image into a stack of probability maps of the same dimension as the image. If C is the
number of different classes (organs plus background), semantic segmentation produces C output images
with pixels having a value between 0 and 1. The pixel value in output image i is equal to the probability
that the corresponding pixel in the input image belongs to class i.
To produce such semantic probability maps, we used a segmentation convolutional neural network [12].
These segmentation networks are based on a contracting structure from an image to a low dimensional
feature space, called the latent space, which encodes the content of the image. It is connected to a
symmetric expanding structure that translates the information from the latent space to an image similar
to the original one, but with only the content of interest reconstructed. The contracting structure is
directly inspired from classification neural networks, and made of convolution layers, non-linearities,
and down-pooling layers. The convolution kernels allow to filter the information to emphasize the
content, and the down-pooling allows to reduce the dimension of the information. The expanding
structure reproduces the path in the other direction, with up-pooling layers and convolutions. The
spatial information is re-injected to reconstruct the image properly, by providing the down-pooling
coordinates from the contracting phase.
We decided to use a neural network architecture inspired by U-Net [33]. However, to leverage the
power of already existing labeled datasets, the contracting structure, or encoder, was replaced by a
classification network trained on ImageNet, and with six classes to segment. Thanks to the great
diversity in the ImageNet dataset, this classification structure has already learned to encode the
semantic representation of an image in the latent space. The classification network that we used is
ResNet [14] which is a deep neural network where inputs from previous layers are regularly re-injected
into deeper layers in order to maintain the geometry and avoid vanishing gradients [15].
3.3.4 Training from simulated data
The network was trained and tested with the 2520 images (140 individual x 18 images) from the virtual
scanner. To make the network robust to images and lighting conditions that are not in the dataset, we
artificially augment the dataset by adding uncorrelated Gaussian noise to each of the RGB color channels
(σ = 0.01, R’=R+eR, G’=G+eG, …) and random rotations (Figure 4). The images were normalized to
correspond to the mean and variance of the training set of ImageNet, on which was initially trained the
semantic structure (mean(R,G,B) = (0.485, 0.456, 0.406), standard-deviation(R,G,B) =
(0.229,0.224,0.225)). The dataset was split into 3 sets:
● a training set to train the network (70% of the dataset)
● a validation set on which the network is not trained and is used to evaluate and compare the
different networks (7% of the dataset)
ROMI - D6.3 - Results of trained models 12
● a test set used at the very end on the selected architecture (23% of the dataset)
The selection of the images was done on the individuals: 70% of the plants were selected for testing and
their images added to the training set. Using this procedure we introduce backgrounds in the testing
datasets that have not been encountered by the neural network during the training.
Figure 6: A sample of the training set (before normalization).
To train the network for the segmentation, a combination of metrics were used in the loss function. As it
is a multi-class problem, a sigmoid is applied to each output in order to contain the numerical range of
the predictions. First, cross-entropy was used as a per-pixel metric. It represents the uncertainty of the
prediction compared to the ground-truth. If the class of a pixel has a very low predicted probability, it
will highly penalise the loss. If the probability is close to one, the contribution to the loss is close to zero.
The notion of entropy represents this discrepancy between the ground truth distribution and the
ROMI - D6.3 - Results of trained models 13
predicted distribution. It is naturally translated at the mathematical level with the negative of the
logarithm.
(1)
Where n is the number of pixels in the image, C the number of possible classes, ygtk = 1 if pixel i is in
class k, 0 otherwise, and yi,predk = p(label(yi,gt) = k).
Second, the Dice coefficient was used as a another loss metric, which theoretically writes as:
(2)
This coefficient compares the number of right predictions to the number of samples, for each class. In
practice, the product of the ground truth by the predictions is summed, so that only the prediction of
the right class contributes to the sum. For example for point n, the class is k, and the prediction for class
k is pk, the loss for point n will be 1 - pk. The total loss is computed vector-wise, and lies between 0 and 1.
We used the mean of these two losses for the training: cross-entropy has more stable gradients while
Dice loss represents what should be minimised and is less sensitive to class imbalance than
cross-entropy.
3.3.5 Volume carving
Volume carving is a photogrammetric approach which uses pictures of an object from different points of
view to reconstruct an object in 3D as a binary function of space [21, 10, 36, 54, 55]. Space is divided
into a regular voxel grid with a voxel size chosen such that the size of the projection of each voxel is of
the same order of magnitude of a pixel in each of the pictures.
The input of the volume carving algorithm is a set of binary background images Bi produced by the
segmentation algorithm. The value of a pixel in the mask indicates whether the pixel belongs to an organ
class or to the background. Since volume carving is very sensitive to false positives – pixels labeled as
background when they are in fact projections from organ class voxels – the threshold for the grey value
of the pixel to distinguish between background or organs is chosen very close to 1.
To reduce the probability of carving away voxels that belong to the plant volume, the plant masks are
dilated by an additional pixel. This is justified by the fact that we only use the voxel center and not the
pixel space covered by the whole voxel. In the case where the error on the camera poses is non
negligible, the dilation of masks is increased even more to match this possible uncertainty.
3.3.6 Level Set method
ROMI - D6.3 - Results of trained models 14
The set of voxels resulting from the volume carving corresponds to the visual hull of the object, under
the given views. It is an approximation of the true volume occupied by the object. Because the voxels
are positioned on a fixed grid, the surface of the resulting volume suffers from spatial aliasing: it is not
continuous but “cubic”. A more continuous surface is obtained by estimating the real surface of the
scanned object from the carved volume and moving the voxel points to the closest position on this
surface. To compute this improved point cloud a Level Set method is used [40]. The Level Set method
requires a signed distance function that estimates the distance between a point and a volume. A positive
(negative) distance means that point lies inside (outside) the volume and a point on the surface has a
distance of zero. We used the fast marching algorithm provided by the SciPy software library [18].
3.3.7 Back-projection
Once the point cloud is obtained using the Level Set method, a label has to be attributed to each point in
the point cloud. Similar to the method proposed by [41], the coloring of the point cloud is performed
after its construction with the visual hull method.
A technique similar to the volume carving method is used, whereby each point is projected back onto
the segmented images. Let Mi,k denote the output of the segmentation algorithm for view i and class k.
In the following, classes include all non-background organ classes. A threshold is applied to the
probability scores, so that Mi,k(x) ∈ {0,1} for all x in Ω. Then, for each point, each class gets a vote from
each of its projection in the masks Mi,k:
Then the class with the maximal number of votes is attributed to the point:
class(x) = argmaxk(Mk(x))
If a point gets no vote from any of its projection, then it is discarded as a background point. This yields a
labeled point cloud, in which every point is attributed to an organ class.
3.3.8 Fine-tuning
Our network was trained on a virtual A. thaliana model but aims at reconstructing real plants. Although
the training performs well for real A. thaliana (see below), it transfers poorly to other species with large
anatomical differences. Therefore we conceived a simple interface that allows the user to manually label
a few images of interest, then run the training of the network on this small dataset. The interface uses
LabelMe [34] for manual annotations, and runs the training on these images for 20 epochs. This
generates a segmentation network specialized for the new species, starting from the network trained on
the virtual plant models of A. thaliana.
ROMI - D6.3 - Results of trained models 15
3.3.9 Evaluation metric
To evaluate the segmentation task, we use the classical metric of precision and recall. As discussed
above, the segmentation network outputs C images (one per class) with pixel values between 0 and 1.
To perform the evaluation discussed below, we convert these images separately into binary masks by
applying a threshold of 0.5 on the pixel values. We have noted that the chosen threshold does not
matter much because the output images from the neural network are highly contrasted. Every pixel x in
the segmented image is then attributed to one of four sets:
● x belongs to TP (true positive) if our predicted class is positive and the actual class is positive.
● x belongs to TN (true negative) if our predicted class is negative and the actual class is negative.
● x belongs to FP (false positive) if our predicted class is positive and the actual class is negative.
● x belongs to FN (false negative) if our predicted class is negative and the actual class is positive.
The precision and recall are then defined as follows:
Given a class C:
● The precision is the proportion of the pixels labeled with C that were correctly labeled. The
remaining proportion are pixels that were labeled C but that belong to another class.
● Considering now all the pixels that belong to a given class C, the recall measures the proportion
of those pixels that have been correctly labeled. The remaining proportion are pixels that belong
to class C but that were incorrectly labeled with another class.
To evaluate the 3D reconstruction, the same evaluation metric is applied to voxels. Each voxel is
attributed a single class from the back-projection algorithm presented above, and its class is compared
to the class of the closest point on the original mesh of the virtual plant. We similarly measure the
precision and recall.
3.3.10 Software integration
The code of the algorithm is written in Python and is integrated in the image analysis pipeline developed
in WP5. We used PyTorch as the machine learning framework. The package LabelMe was used to 3 4
annotate the images. Besides the SciPy and bpy libraries already mentioned, we use the standard
3 https://pytorch.org/ 4 http://labelme.csail.mit.edu/Release3.0/
ROMI - D6.3 - Results of trained models 16
computer vision libraries OpenCV and scikit-image to manipulate images. The pipeline itself uses the 5 6
Luigi task scheduler to execute the individual tasks shown in Figure 1. For more information on the 7
pipeline, please refer to the progress report on WP5.
3.4 Results and Discussions
We recall that the following datasets are used in this work:
● Dataset A: 2520 virtual plant images (18 views of 140 different models, see 3.3.2). 70% of the
images (the selection was done on the plants) were used for training, 23% for validation, and 7%
for the tests, as discussed in 3.4.1 and 3.4.2 below.
● Dataset B: 6 hand-annotated images of real A. thaliana (4 different individuals, see 3.3.1). These
images are used to test the segmentation on real plants quantitatively in 3.4.3.
● Dataset C: 858 images of real A. thaliana (12 different individuals, see 3.3.1). These images were
used to evaluate the segmentation and 3D segmentation qualitatively (section 3.4.4)
● Dataset D: 39 images of a young tomato plant taken from a hand-help mobile phone.
● Dataset E: 2 annotated images of the young tomato plant.
3.4.1 Segmentation of virtual images
The images from the dataset A are segmented using the trained neural network. Then, the results of 2D
segmentation are compared to the ground truth provided by images produced by the virtual scanner
(see 3.3.2). Figure 6 presents a sample from dataset A and its segmentation.
5 https://opencv.org/ 6 https://scikit-image.org/ 7 https://luigi.readthedocs.io/en/stable/index.html
ROMI - D6.3 - Results of trained models 17
Figure 7: Images segmented with our network. From left to right, then top to bottom: original image,
background, flower, fruit, leaf, pedicel. White indicates a high score (1) for the class, black indicates a
low score (0).
Pixel precision and recall for the five organ classes are shown in the figure 8.
ROMI - D6.3 - Results of trained models 18
Figure 8: Left: The pixel precision of the 2D segmentation on the dataset A, containing the computer-generated images of the virtual A. thaliana. Right: The pixel recall of the 2D segmentation.
This analysis shows a similar trend for all segmentation classes. For example, considering the fruit class,
~50% of all the pixels that have been labeled as “fruit” by the CNN are indeed fruit pixels in the original
image (true positives). Hence the remaining ~50% of fruit-labeled pixels are false-positive and
correspond to pixels with other labels in the original images. Similarly, 94.5% of all the pixels labeled as
fruit in the original image are correctly labeled by the neural network (true positives), so 5.5% of the
original flower pixels are miss-labeled by the CNN (false negatives).
In conclusion, the neural network has a high sensitivity to detect almost all pixels (>90%) of any plant
classes defined in the training phase (Fig. 5). However, its precision is lower, since too many pixels are
often assigned to a given class (false positives). We will come back to this point below.
3.4.2 3D reconstruction of virtual plants
The segmented images from the previous test are used to reconstruct the segmented 3D representation
of the plant, as discussed in 3.3.5-7. Similarly, we compute the voxel precision and recall (see 3.3.9) to
evaluate the results and plot it side-by-side with previous 2D evaluation for comparison (Fig. 9).
ROMI - D6.3 - Results of trained models 19
Figure 9: Evaluation of 2D and 3D segmentation side-by-side (blue: 2D pixel metric, orange: 3D, voxel metric). Left: precision . Right: recall.
The results show that the precision for each class is higher in 3D than in 2D. This is because the transfer
to 3D using the volume carving operation reduces the number of false positives. For this reason we favor
false positives against false negatives in the 2D predictions, as explained below. To understand this bias,
we recall that the 3D reconstruction carves away the voxels based on the prediction of pixels onto which
they project. On the one hand, false negative results in losing voxels that are part of the plant. On the
other hand, false-positive retains additional voxels that are likely to be removed in subsequent carving
operations using images that correctly labeled the pixels. Thus false-positives are thus preferable. We
therefore keep the threshold that is used to generate the binary masks low with the risk of introducing
more false positives.
3.4.3 Segmentation of real plants images, CNN trained on virtual plants
We evaluate the performance of the neural network trained on the virtual dataset to segment the 2D
images of dataset B, the set of manually annotated real images of A. thaliana. We present pixel precision
and recall results for each class, comparing results on virtual plants and real plants below (Figure 10).
ROMI - D6.3 - Results of trained models 20
Figure 10: Pixel precision and recall comparison between 2D segmentation on virtual and real images unseen by the neural network.
Note that the size of the dataset (6 pictures) is too small for a proper statistical analysis. Although the
network makes more errors than with the virtual plants (especially more leaf false positive and more
flower false negative), we get satisfactory segmentation results. The errors include the mix-up between
the stem, fruit, and pedicel classes. Also, the network has difficulties distinguishing fruits and leaves
when they grow along the stem.
Observing the errors made by the network has led to improvements in the 3D models of the plants: we
included features such as the rotation of the leaves, the bending of the stem, and the random coloration
of the organs. This allowed both to improve the plant model and the 2D-segmentation network. These
improvements are already included in the results above. Overall, the 3D models of the plant and the
images generated in the virtual environment allow us to obtain a satisfactory segmentation and 3D
reconstruction of real plants. Note that we recently developed new models of A. thaliana in WP6 (see
Deliverable 6.2) that are even more realistic than the first version of virtual models used here. This
should significantly improve our segmentation results in the near future.
3.4.4 3D reconstruction of real plants
We segmented and reconstructed A. thaliana real plants from dataset C. Without ground truths we can
only give a qualitative analysis of the results. Figure 11 shows the 3D reconstruction and segmentation
of the plants shown in Figure 2. The reconstructions and segmentation were qualitatively as good as the
3D reconstruction to the point where in some cases it is hard to tell whether the reconstructed plant is
from a real or a virtual plant. One default is that the reconstruction of the leaves at the basis on the
stem is not accurate. This is due to the fact that these leaves are hardly identifiable, mixed with the pot
or dried out. They are of little interest to the study of the phyllotaxy of the main fluorescence stem so
ROMI - D6.3 - Results of trained models 21
we didn’t focus on solving this issue. Another issue comes from occlusions, especially of the stem. The
reconstruction fails, for example, when there are several branches grouped together in the same pot, or
when there is a structure unknown to the segmentation network, for example, a tutor obstructing some
viewpoints. To solve the first issue one solution would be to increase the variety in the viewpoints, by
taking pictures from the top of the pot. To solve the second, we would need to teach the network to
interpret occlusions by foreign objects, by including such objects in the virtual scene for example.
Figure 11: 3D reconstruction of real plants. Red: stem, green: leaf, white: pedicel, purple: fruit:,
yellow: flower.
ROMI - D6.3 - Results of trained models 22
3.4.5 Transfer to other species
We used model fine-tuning to transfer the use of the neural network to species that are anatomically
different from A. thaliana. Then the network trained on virtual A. thaliana was trained on the two
tomato images of dataset E. The results of the 2D segmentation are shown in Figure 12. The results of
the 3D reconstruction pipeline are shown Figure 13. We can only evaluate the results qualitatively. We
notice that some of the top leaves are missing in some of the photos used in the reconstruction, which
impacts its quality. The results are encouraging enough to consider fine-tuning a potential solution to
train a neural network when a virtual plant model for a given species is not available. Note that using
our virtual tomato plant recently developed in WP6, we will be able to quantitatively assess transfer
learning error rates on virtual plant ground truth in the next phase of our work.
Figure 12: Predictions of the model fine-tuned on two images of tomato that are manually annotated.
(The masks of pedicel and stem are superimposed on this figure but they were predicted separately).
ROMI - D6.3 - Results of trained models 23
Figure 13: 3D reconstruction and segmentation of tomato plants with the pipeline.
3.5 Conclusion and perspectives
In this work, we have presented and assessed a fully automated method for segmentation of plants
from 2D pictures using convolutional networks for segmentation in 2D and backprojection of 3D points
for segmentation in 3D.
Our initial results suggest that using generative models of plants for training neural networks and
applying the trained algorithms on real specimens of plants is robust enough so that no additional
annotation of data is needed. This is of utmost importance to the field of plant biology, since annotation
of data is very time-consuming, and the variety in plant species makes the annotation of new species a
never-ending task. In WP6, throughout the ROMI project’s life, we also work at increasing the
photo-realism of our virtual plants on various aspects (geometric, physical, and material). These
improvements should simultaneously contribute to strengthen CNN training and significantly improve
automatic segmentation of organs.
We have also shown that annotation of a small set of real world plant data can be enough to transfer
the CNNs to plants with different anatomic properties.
ROMI - D6.3 - Results of trained models 24
Further work is needed to consolidate these initial findings. We will validate some of the implicit
hypotheses in the current set-up. Notably, we assume that the randomized 3D backgrounds in the
virtual plant images makes the segmentation by the neural network more robust. This intuition needs
additional validation before it can become a best practice. We also augmented the virtual plant model
and added the bending of the main stem and the rotation of the leaves without a precise evaluation of
its effects. We should also increase the number of manually annotated images of A. thaliana and tomato
plants for a proper evaluation of the 2D segmentation.
Another perspective of work is to investigate whether the obtained segmentation is precise enough to
extract precise quantitative data from the reconstructed plant. In particular, using the segmented 3D
data of A. thaliana, we want to extract the angles between the sequence of fruits along the main
inflorescence stem so that we can compare these results with manual measurements and the results of
the geometrical pipeline discussed in WP5. Our hope is that the knowledge of the individual organs
helps to extract the plant’s skeleton more accurately.
Finally, the work also opens up some additional research questions. The 2D masks generated from the
virtual models provide information on overlapping classes. The network is currently trained to predict
these overlapping classes: one pixel can be attributed to multiple classes. However, we have not fully
explored this possibility to improve the segmentation when organs are occluded in a subset of the
images.
References
[1] G. Aleny, B. Dellen, and C. Torras. 3D modelling of leaves from color and ToF data for robotized
plant measuring. In 2011 IEEE International Conference on Robotics and Automation, pages 3408–3414,
2011.
[2] R. Barth, J. IJsselmuiden, J. Hemming, and E.J. [Van Henten]. Synthetic bootstrapping of
convolutional neural networks for semantic plant part segmentation. Computers and Electronics in
Agriculture, 161:291 – 304, 2019.
[3] Frederic Boudon, Christophe Pradal, Thomas Cokelaer, Przemyslaw Prusinkiewicz, and Christophe
Godin. L-Py: An L-System Simulation Framework for Modeling Plant Architecture Development Based on
a Dynamic Language. Frontiers in Plant Science, 3, 2012.
[4] S. Chaivivatrakul, L. Tang, M. N. Dailey, and A. D. Nakarmi. Automatic morphological trait
characterization for corn plants via 3d holographic reconstruction. Computers and Electronics in
Agriculture, 109:109–123, 2014.
[5] Maurilio Di Cicco, Ciro Potena, Giorgio Grisetti, and Alberto Pretto. Automatic model based dataset
generation for fast and accurate crop and weeds detection. 2017 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), pages 5188–5195, 2017.
ROMI - D6.3 - Results of trained models 25
[6] Barabe Denis et al. Symmetry in plants, volume 4. World Scientific, 1998.
[7] Thomas Duboudin, Maxime Petit, and Liming Chen. Toward a Procedural Fruit Tree Rendering
Framework for Image Analysis. arXiv:1907.04759 [cs], July 2019. arXiv: 1907.04759.
[8] Fabio Fiorani and Ulrich Schurr. Future Scenarios for Plant Phenotyping. Annual Review of Plant
Biology, 64(1):267–291, Apr. 2013.
[9] Miguel Garrido, Dimitris Paraforos, David Reiser, Manuel Vzquez Arellano, Hans Griepentrog, and
Constantino Valero. 3d Maize Plant Reconstruction Based on Georeferenced Overlapping LiDAR Point
Clouds. Remote Sensing, 7(12):17077–17096, Dec. 2015.
[10] Franck Golbach, Gert Kootstra, Sanja Damjanovic, Gerwoud Otten, and Rick Van de Zedde.
Validation of plant part measurements using a 3d reconstruction method suitable for high-throughput
seedling phenotyping. Machine Vision and Applications, 27:663–680, 07 2016.
[11] Guan, H., Liu, M., Ma, X., and Yu, Song. Three-Dimensional Reconstruction of Soybean Canopies
Using Multisource Imaging for Phenotyping Analysis. Remote Sensing, 10:1206, 2018. .
[12] Yanming Guo, Yu Liu, Theodoros Georgiou, and Michael S. Lew. A review of semantic segmentation
using deep neural networks. International Journal of Multimedia Information Retrieval, 7(2):87–93, June
2018.
[13] Yann Guédon, Yassin Refahi, Fabrice Besnard, Etienne Farcot, Christophe Godin, and Teva Vernoux.
Pattern identification and characterization reveal permutations of organs as a key genetically controlled
property of post-meristematic phyllotaxis. Journal of Theoretical Biology, 338:94 – 110, 2013.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image
Recognition. arXiv:1512.03385 [cs], Dec. 2015. arXiv: 1512.03385.
[15] Sepp Hochreiter. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and
Problem Solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems,
06(02):107–116, Apr. 1998.
[16] Anna M. Hoffmann, Georg Noga, and Mauricio Hunsche. Fluorescence indices for monitoring the
ripening of tomatoes in pre- and postharvest phases. Scientia Horticulturae, 191:74–81, Aug. 2015.
[17] Inventables. X-Carve, 2019.
[18] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python,
2001–. [Online; accessed ].
ROMI - D6.3 - Results of trained models 26
[19] Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, and Siddhartha Chaudhuri. 3D Shape
Segmentation with Projective Convolutional Networks. In 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 6630–6639, Honolulu, HI, July 2017. IEEE.
[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105,
2012.
[21] K.N. Kutulakos and S.M. Seitz. A theory of shape by space carving. In Proceedings of the Seventh
IEEE International Conference on Computer Vision, pages 307–314 vol.1, Kerkyra, Greece, 1999. IEEE.
[22] Lei Li, Qin Zhang, and Danfeng Huang. A review of imaging techniques for plant phenotyping.
14(11):20078–20111.
[23] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic
segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
3431–3440, 2015.
[24] Massimo Minervini, Hanno Scharr, and Sotirios A Tsaftaris. Image analysis: the new bottleneck in
plant phenotyping [applications corner]. IEEE signal processing magazine, 32(4):126–131, 2015.
[25] Anh Nguyen and Bac Le. 3D point cloud segmentation: A survey. In 2013 6th IEEE Conference on
Robotics, Automation and Mechatronics (RAM), pages 225–230, 2013.
[26] Thuy Nguyen, David Slaughter, Nelson Max, Julin Maloof, and Neelima Sinha. Structured
Light-Based 3d Reconstruction System for Plants. Sensors, 15(8):18587–18612, July 2015.
[27] Kenji Omasa, Fumiki Hosoi, and Atsumi Konishi. 3D Lidar imaging for detecting and understanding
plant responses and canopy structure. Journal of Experimental Botany, 58(4):881–898, Mar. 2007.
[28] Stefan Paulus, Jan Behmann, Anne-Katrin Mahlein, Lutz Plümer, and Heiner Kuhlmann. Low-Cost
3D Systems: Suitable Tools for Plant Phenotyping. Sensors, 14(2):3001–3018, Feb. 2014.
[29] Laura Soledad Peirone, Gustavo Pereyra Irujo, Alejandro Bolton, Ignacio Erreguerena, and Luis AN
Aguirrezabal. Assessing the efficiency of phenotyping early traits in a greenhouse automated platform
for predicting drought tolerance of soybean in the field. Frontiers in plant science, 9:587, 2018.
[30] Fernando Perez-Sanz, Pedro J. Navarro, and Marcos Egea-Cortines. Plant phenomics: an overview
of image acquisition technologies and image data analysis algorithms. 6(11).
[31] Frédéric Boudon, Christophe Pradal, Thomas Cokelaer, Przemyslaw Prusinkiewicz, Christophe
Godin. L-Py: an L-system simulation framework for modeling plant architecture development based on a
dynamic language. Frontiers in Plant Science, Frontiers, 2012, 3 (76).
ROMI - D6.3 - Results of trained models 27
[32] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In International Conference on Medical image computing and computer-assisted
intervention, pages 234–241. Springer, 2015.
[33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for
Biomedical Image Segmentation. arXiv:1505.04597 [cs], May 2015. arXiv: 1505.04597.
[34] Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. LabelMe: A Database
and Web-Based Tool for Image Annotation. International Journal of Computer Vision, 77(1):157–173,
May 2008.
[35] Thiago Santos and Julio Ueda. Automatic 3D plant reconstruction from photographies,
segmentation and classification of leaves and internodes using clustering. page 3.
[36] Hanno Scharr, Christoph Briese, Patrick Embgenbroich, Andreas Fischbach, Fabio Fiorani, and Mark
Müller-Linow. Fast high resolution volume carving for 3d plant shoot reconstruction. Frontiers in plant
science, 8:1680, 2017.
[37] Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, Las Vegas, NV, USA,
June 2016. IEEE.
[38] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In
Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[39] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view
selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
[40] J.A. Sethian. Fast marching methods and level set methods for propagating interfaces, 1998.
Karman Institute Lecture Series, Computational Fluid Dynamics.
[41] Weinan Shi, Rick van de Zedde, Huanyu Jiang, and Gert Kootstra. Plant-part segmentation using
deep learning and multi-view vision. Biosystems Engineering, 187:81–95, 2019.
[43] Yeyin Shi, Ning Wang, Randal K. Taylor, William R. Raun, and James A. Hardin. Automatic corn plant
location and spacing measurement using laser line-scan technique. Precision Agriculture, 14(5):478–494,
Oct. 2013.
[44] Paloma Sodhi, Srinivasan Vijayarangan, and David Wettergreen. In-field segmentation and
identification of plant structures using 3d imaging. In 2017 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), pages 5180–5187, Vancouver, BC, Sept. 2017. IEEE.
ROMI - D6.3 - Results of trained models 28
[45] Siddharth Srivastava, Swati Bhugra, Brejesh Lall, and Santanu Chaudhury. Drought stress
classification using 3d plant models. In Proceedings of the IEEE International Conference on Computer
Vision, pages 2046–2054, 2017.
[46] Wei Su, Dehai Zhu, Jianxi Huang, and Guo Hao. Estimation of the vertical leaf area profile of corn (
zea mays ) plants using terrestrial laser scanning (tls). Computers and Electronics in Agriculture,
150:5–13, 07 2018.
[47] Thapa, 2018. A_Novel_LiDAR-Based_Instrument_for_High-Throughput.pdf.
[48] Sbastien Tisn, Yann Serrand, Liłn Bach, Elodie Gilbault, Rachid Ben Ameur, Herv Balasse, Roger
Voisin, David Bouchez, Mylne DurandTardif, Philippe Guerche, Gal Chareyron, Jrme Da Rugna, Christine
Camilleri, and Olivier Loudet. Phenoscope: an automated large-scale phenotyping platform offering high
spatial homogeneity. The Plant Journal, 74(3):534–544, 2013.
[49] Jordan Ubbens, Mikolaj Cieslak, Przemyslaw Prusinkiewicz, and Ian Stavness. The use of plant
models in deep learning: An application to leaf counting in rosette plants. Plant Methods, 14, 01 2018.
[50] Manuel Vázquez, David Reiser, Dimitrios S. Paraforos, Miguel Izard, and Hans W. Griepentrog. Leaf
area estimation of reconstructed maize plants using a time-of-flight camera based on different scan
directions. Robotics, 7, 10 2018.
[51] Daniel Ward, Peyman Moghadam, and Nicolas Hudson. Deep Leaf Segmentation Using Synthetic
Data. arXiv:1807.10931 [cs], July 2018. arXiv: 1807.10931.
[52] Illia Ziamtsov and Saket Navlakha. Machine learning approaches to improve three basic plant
phenotyping tasks using three-dimensional point clouds. Plant Physiology, 181(4):1425–1440, 2019.
[53] Rossi, R., Leolini, C., Costafreda-Aumedes, S., Leolini, L., Bindi, M., Zaldei, A., and Moriondo, M.
Performances Evaluation of a Low-Cost Platform for High-Resolution Plant Phenotyping. Sensors,
20(11):3150, 2020.
[54] Gaillard, M., Miao, C., Schnable, J. and Benes, B. Voxel Carving Based 3D Reconstruction of
Sorghum Identifies Genetic Determinants of Radiation Interception Efficiency. BioRXiv, April, 2020.
[55] Phattaralerphong, J. and Sinoquet, H. A method for 3D reconstruction of tree crown volume
from photographs: Assessment with 3D-digitized plants. Tree physiology. 25. 1229-42.
ROMI - D6.3 - Results of trained models 29