simultaneous recognition and pose estimation of ... · simultaneous recognition and pose estimation...

Simultaneous Recognition and Pose Estimation of

Instruments in Minimally Invasive Surgery

Thomas Kurmann1, Pablo Marquez Neila2, Xiaofei Du3, PascalFua2, Danail Stoyanov3, Sebastian Wolf4, and Raphael Sznitman1

1University of Bern, Switzerland2Ecole Polytechnique Federale de Lausanne, Switzerland

3University College London, United Kingdom4Bern University Hospital, Switzerland

Abstract

Detection of surgical instruments plays a key role in ensuring patientsafety in minimally invasive surgery. In this paper, we present a novelmethod for 2D vision-based recognition and pose estimation of surgicalinstruments that generalizes to different surgical applications. At its core,we propose a novel scene model in order to simultaneously recognize multi-ple instruments as well as their parts. We use a Convolutional Neural Net-work architecture to embody our model and show that the cross-entropyloss is well suited to optimize its parameters which can be trained in anend-to-end fashion. An additional advantage of our approach is that in-strument detection at test time is achieved while avoiding the need forscale-dependent sliding window evaluation. This allows our approach tobe relatively parameter free at test time and shows good performance forboth instrument detection and tracking. We show that our approach sur-passes state-of-the-art results on in-vivo retinal microsurgery image data,as well as ex-vivo laparoscopic sequences.

1 Introduction

Vision-based detection of surgical instruments in both minimally invasive surgeryand microsurgery has gained increasing popularity in the last decade. This islargely due to the potential it holds for more accurate guidance of surgicalrobots such as the da Vinci R©(Intuitive Surgical, USA) and Preceyes (Nether-lands), as well as for directing imaging technology such as endoscopes [13] orOCT imaging [2] at manipulated regions of the workspace.

In recent years, a large number of methods have been proposed to eithertrack instruments over time or detect them without any prior temporal infor-mation, in both 2D and 3D. In this work, we focus on 2D detection of surgicalinstruments as it is often required for tracking in both 2D [9] and 3D [4]. In thiscontext, [7, 12] proposed to build ensemble-based classifiers using hand-craftedfeatures to detect instruments parts (e.g. shaft, tips or center). Similarly, [3] de-

1

arX

iv:1

710.

0666

8v1

[cs

.CV

] 1

8 O

ct 2

017

tected multiple instruments in neurosurgery by repeatedly evaluating a boostedclassifier based on semantic segmentation.

Yet for most methods described above two important limitations arise. Thefirst is that instrument detection and pose estimation (i.e. instrument position,orientation and location of parts) have been tackled in two phases, leading tocomplicated pipelines that are sensitive to parameter tuning. The second isthat at evaluation time, detection of instruments has been achieved by repeatedwindow sliding at limited scales which is both inefficient and error prone (e.g.small or very large instruments are missed). Both points heavily reduce theusability of proposed methods.

In order to overcome these limitations, we propose a novel framework thatavoids these and can be applied to a variety of surgical settings. Assuming aknown maximum number of instruments and parts that could appear in the fieldof view, our approach, which relies on recent deep learning strategies [10], avoidsthe need for window sliding at test time and estimates multiple instruments andtheir pose simultaneously. This is achieved by designing a novel ConvolutionalNeural Network (CNN) architecture that explicitly models object parts and thedifferent objects that may be present. We show that when combined with across-entropy loss function, our model can be trained in an end-to-end fashion,thus bypassing the need for traditional two-stage detection and pose estimation.We validate our approach on both ex-vivo laparoscopy images and on in-vivoretinal microsurgery, where we show improved results over existing detectionand tracking methods.

2 Multi-instrument detector

In order to detect multiple instruments and their parts in a coherent and simplemanner, we propose a scene model which assumes that we know what wouldbe the maximum number of instruments in the field of view. We use a CNN toembody this model and use the cross-entropy to learn effective parameters usinga training set. Our CNN architecture takes as input an image and providesbinary outputs as to whether or not a given instrument is present as well as2D location estimates for its parts. A visualization of our proposed detectionframework can be seen in Fig. 1. Conveniently then, detecting instrumentsand estimating the joint positions on a test frame is simply achieved by a feedforward pass of the network. We now describe our scene model and our CNNin more detail.

2.1 Scene Model

Let I ∈ Rw×h be an image that may contain up to M instruments. In particular,we denote T = {T1, .., TM}, Tm ∈ {0, 1} to be the set of instruments that couldappear in the field of view such that Tm = 0 if the tool is not present and Tm = 1if it is. In addition, each instrument present in the image is defined as a set of Nparts, or joints, {Jn

m ∈ R2}Nn=0 consisting of 2D image locations. Furthermore,let GTn

m ∈ R2 be the ground truth 2D position for joint n of instrument Tmand tm ∈ {0, 1} be the ground truth variable indicating if the mth instrument isvisible in the image. Assuming that the instrument presence is unknown and isprobabilistic in nature, our goal is to train a network to estimate the following

2

Figure 1: Proposed multi-instrument detector network architecture. The net-work produces probabilistic outputs for both the presence of different instru-ments and position of their joints. The number of channels C is denoted on topof the box.

scene model

P (T1, . . . , TM , J11 , . . . , J

N1 , . . . , J

1M , . . . , J

NM ) =

m∏P (Tm)

m∏ n∏P (Jn

m|Tm) (1)

where P (Tm) are Bernoulli random variables and the likelihood models P (Jnm|Tm)

are parametric probability distributions. Note that Eq. (1) assumes indepen-dence between the different instruments as well as a conditional independencebetween the various joints for a given instrument. Even though both assump-tions are quite strong, they provide a convenient decomposition and a modelsimplification of what would otherwise be a complicated distribution. LettingP be the predicted distribution by our CNN and P be a probabilistic interpre-tation of the ground truth, then the cross-entropy loss function can be definedas

H(P , P ) = −∑s∈S

P (s) logP (s) (2)

where S is the probability space over all random variables (T1, . . . , TM , J11 , . . .,

JN1 , . . . , J

1M , . . . , J

NM ). Replacing P and P in Eq. (2) with the model Eq. (1)

and simplifying the term gives rise to

H(P , P ) =∑m

H(P (Tm), P (Tm)

)+∑

m

∑n

H(P (Jn

m|Tm = tm), P (Jnm|Tm = tm)

) (3)

To model the ground truth distribution P , we let P (Tm) = 0 if tm = 0 andP (Tm) = 1 if tm = 1, and specify the following likelihood models from theground truth annotations,

∀n∀m, P (Jnm = j|Tm = tm) =

{U(j; 0, wh), if tm = 0

G(j;GTnm, σ

2I), if tm = 1

3

where U is a Uniform distribution in the interval 0 to wh and G denotes a Gaus-sian distribution with mean GTn

m and covariance σ2I (i.e. assuming a symmetricand diagonal covariance matrix). We use this Gaussian distribution to accountfor the inaccuracies in the ground truth annotations such that P (Jn

m|Tm = tm)is a 2D probability map generated from the ground truth and which the networkwill try to estimate by optimizing Eq. (3). In this work, we fix σ2 = 10 for allexperiments. That is, our network will optimize both the binary cross-entropyloss of each of the instruments as well as the sum of the pixel-wise probabilitymap cross-entropy losses.

2.2 Multi-Instrument Detector Network

In order to provide a suitable network with the loss function of Eq. (3), wemodify and extend the U-Net [10] architecture originally used for semantic seg-mentation. Illustrated in Fig. 1, the architecture uses down and up samplingstages, where each stage has a convolutional, a ReLU activation and a samplinglayer. Here we use a total of 5 down and 5 up sampling stages and a singleconvolutional layer is used per stage to reduce the computational requirements.The number of features is doubled (down) or halved (up) per stage, starting with64 features in the first convolutional layer. All convolutional kernels have a sizeof 3x3, except for the last layer where a 1x1 kernel is used. Batch normalization[5] is applied before every activation layer. In order to provide output estimates∀(m,n), P (Tm), P (Jn

m|Tm), we extend this architecture to do two things:

1. We create classification layers stemming from the lowest layer of the net-work by expanding it with a fully connected classification stage. Theexpansion is connected to the lowest layer in the network such that thislayer learns to spatially encode the instruments. In particular this layerhas one output per instrument which is activated with a sigmoid activa-tion function to force a probabilistic output range. By doing so, we areeffectively making the network provide estimates P (Tm).

2. Our network produces M × N maps of size w × h which correspond toeach of the P (Jn

m|Tm = 1) likelihood distributions. Note that explicitlyoutputting P (Jn

m|Tm = 0) is unnecessary. Each output probability mapP (Jn

m|Tm = 1) is normalized using a softmax function such that the jointposition estimate of GTn

m is equal to the arg maxz P (Jnm = z|Tm = 1).

When combined with the loss function Eq. (3), this network will train to bothdetect multiple instruments as well as estimate their joint parts. We imple-mented this network using the open source TensorFlow library [1] in Python1.

3 Experiments

Retinal Microsurgery. We first evaluate our approach on the publicly avail-able in-vivo retinal microsurgery instrument dataset [11]. The set contains 3video sequences with 1171 images, each with a resolution of 640x480 pixels.Each image contains a single instrument with 4 annotated joints (start shaft,end shaft, left tip and right tip). As in [11], we trained our network on the first

1Code and models available at: https://github.com/otl-artorg/instrument-pose

4

https://github.com/otl-artorg/instrument-pose

15 20 25 30 35 40

Threshold (pixels)

60%

65%

70%

75%

80%

85%

90%

95%

100%Accura

cy

POSE[5]

DDVT[11]

RTOA[3]

Our Work

5 10 15 20 25 30 35 40

Threshold (pixels)

60%

65%

70%

75%

80%

85%

90%

95%

100%

Accura

cy

Start Shaft

End Shaft

Left Tip

Right Tip

Figure 2: Detection accuracy. (left) Percentage of correctly detected end of shaftjoints as a function of the accuracy threshold. (right) Percentage of correctlydetected joints.

50% of all three sequences and evaluated the rest. Optimization of the networkwas performed with the Adam optimizer [6] using a batch size of 2 and an initiallearning rate of 10−4. The network was trained for 10 epochs. Training andtesting was performed on a Nvidia GTX 1080 GPU running at an inference rateof approximately 9 FPS.

The network was trained on three joints (left tip, right tip and end shaft)while only the end shaft joint was evaluated. Similar to [11, 8, 9], we show theproportion of frames where the end shaft is correctly identified as a functionof detection sensitivity. We show the performance of our approach as well asstate-of-the-art detection and tracking methods in Fig. 2. Our method achievesan accuracy of 96.7% at a threshold radius of 15 pixels which outperforms thestate-of-the-art of 94.3%. The other two joints (left tip, right tip) achieve anaccuracy of 98.3% and 95.3%, showing that the method is capable of learning alljoint positions together with a high accuracy. The mean joint position errors are5.1, 4.6 and 5.5 pixels. As the dataset includes 4 annotated joints, we proposeto also evaluate the performance for all joints and report in Fig. 2(right) theaccuracy of the joints after the network was trained with all joints using thesame train-test data split. Overall, the performance is slightly lower than whentraining and evaluating with 3 joints because the 4th joint is the most difficultto detect due to blur and image noise. Fig. 3 depicts qualitative results of ourapproach and a video of all results can be found at https://www.youtube.com/watch?v=ZigYQbGHQus

Robotic Laparoscopy. We also evaluated our approach on the MICCAI2015 endoscopic vision challenge for laparoscopy instrument dataset tracking2.The dataset includes 4 training and 6 testing video sequences. In total 3 differenttools are visible in the sequences: left clasper, right clasper and left scissor whichis only visible in the test set. The challenge data only includes a single annotatedjoint (extracted from the operating da Vinci R©robot) which is inaccurate in alarge number of cases. For this reason, 5 joints (left tip, right tip, shaft point,end point, head point) per instrument in each image were manually labeledand then used instead 3. Images were resized to 640x512 pixels due to memoryconstraints when training the network. The training set consists of 940 images

2https://endovissub-instrument.grand-challenge.org/3https://github.com/surgical-vision/EndoVisPoseAnnotation

5

https://www.youtube.com/watch?v=ZigYQbGHQus

https://www.youtube.com/watch?v=ZigYQbGHQus

https://endovissub-instrument.grand-challenge.org/

https://github.com/surgical-vision/EndoVisPoseAnnotation

Figure 3: Visual results on retinal microsurgery image sequences 1-3 (top) andlaparoscopy sequences (bottom). The first two laparoscopy sequences containclaspers, whereas the right most contains a scissor and a clasper. The groundtruths are denoted with green points.

and the test set of 910 images. Presence of tools Tm is given if a single joint isannotated.

We define the instruments T1...4 as left clasper, right clasper, left scissor andright scissor. To evaluate our approach, we propose two experiments: (1) Usesthe same training and test data as in the original challenge, with an unknowntool in the test set. (2) We modified the training and test sets, such that theleft scissor is also available during training by moving sequence 6 of the testset to the training set. By flipping the images in this sequence left-to-right, weaugment our training data so to have the right scissor as well. Not only doesthis increase the complexity of the detection problem, but it also allows flippingdata augmentation to be used.

Experiment 1. Using the original dataset, we first verified that the networkcan detect specific tools. As the left scissor has not been trained on, we expectthis tool to be missed. The training set was augmented using left-right andup-down flips. On the test set, only two images were wrongly classified, with anaverage detection rate of 99.9% (right clasper 100%, left clasper 99.89%). Eval-uation of the joint prediction accuracy was performed as with the microsurgerydataset and the results are illustrated in Fig. 4(left). The accuracy is over 90%at 15 pixels sensitivity on all joints except for the two tips on the left clasper.The lower performance is explained by the left clasper only being visible in 40frames, and to that the method fails on 7 images where the tool tips of boththe left and right clasper are in the vicinity of each other or overlapping.

Experiment 2. Here the dataset was modified so that the right scissor is alsovisible in the training set by placing sequence 6 from the test set into the trainingset. The classification results of the instruments are: right clasper 100%, leftclasper 100%, right scissor 99.78% and left scissor 99.67%. Fig. 4 (right) showsthe results of all joint accuracies for this experiment. The accuracy of the leftclasper tool is slightly improved compared to the previous experiment due tothe increased augmented training size. However, the method still fails on the

6

5 10 15 20 25 30

Right Clasper

5 10 15 20 25 30

0%

20%

40%

60%

80%

100%

Right Scissor

5 10 15 20 25 30

Threshold (pixels)

Left Clasper

5 10 15 20 25 30

Threshold (pixels)

0%

20%

40%

60%

80%

100%

Left Scissor5 10 15 20 25 30

0%

20%

40%

60%

80%

100%

Accura

cy

Right Clasper

Left Tip

Right Tip

Shaft Point

End Point

Head Point

5 10 15 20 25 30

Threshold (pixels)

0%

20%

40%

60%

80%

100%

Accura

cy

Experiment 1 Experiment 2

Left Clasper

Figure 4: Accuracy threshold curves: Left Experiment 1 and right Experiment2

same images as in Experiment 1. The scissors show similar results for bothleft and right, which is to be expected due to them being from the same flippedimages. Further, for the scissor results it is visible that one joint performs poorerthan the rest. Upon visual inspection, we associate this performance drop tothe inconsistency in our annotations and the joint not being visible in certainimages. Given that our method assumes all joints are visible if a tool is present,detection failures occur when joints are occluded. Due to the increased inputimage size compared to the retinal microsurgery experiments, the inference rateis lower at around 6 FPS using the same hardware.

4 Conclusion

We presented a deep learning based surgical instrument detector. The networkcollectively estimates joint positions and instrument presence using a combinedloss function. Furthermore, the network obtains all predictions using a singlefeed-forward pass. We validated the method on two datasets, an in-vivo retinalmicrosurgery dataset and an ex-vivo laparoscopy set. Evaluations on the retinalmicrosurgery dataset showed state-of-the-art performance, outperforming eventhe current tracking methods. Our detector method is uninfluenced by previousestimations which is a key advantage over tracking solutions. The laparoscopydataset showed that the method is capable of classifying instrument presencewith a very high accuracy while jointly estimating the position of 20 joints.This points to our method being able to simultaneously count, estimate jointlocations and classify whether instruments are visible in a single feed-forwardpass.

7

References

[1] Martın Abadi and et al. TensorFlow: Large-Scale Machine Learning on Hetero-geneous Distributed Systems. 2015.

[2] Mohamed Alsheakhali, Abouzar Eslami, Hessam Roodaki, and Nassir Navab.CRF-Based Model for Instrument Detection and Pose Estimation in Retinal Mi-crosurgery. Computational and Mathematical Methods in Medicine, 2016.

[3] David Bouget, Rodrigo Benenson, Mohamed Omran, Laurent Riffaud, BerntSchiele, and Pierre Jannin. Detecting Surgical Tools by Modelling Local Ap-pearance and Global Shape. IEEE Transactions on Medical Imaging, 34(12),2015.

[4] Xiaofei Du, Maximilian Allan, Alessio Dore, Sebastien Ourselin, David Hawkes,John D. Kelly, and Danail Stoyanov. Combined 2d and 3d tracking of surgi-cal instruments for minimally invasive and robotic-assisted surgery. IJCARS,(6):1109–1119, 2016.

[5] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167, 2015.

[6] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.In Proceedings of the 3rd International Conference on Learning Representations(ICLR), 2014.

[7] A Reiter, P Allen, and T Zhao. Feature classification for tracking articulatedsurgical tools. In MICCAI 2012, volume 7511, pages 592–600, 2012.

[8] Nicola Rieke, David Joseph Tan, Mohamed Alsheakhali, Federico Tombari,Chiara Amat di San Filippo, Vasileios Belagiannis, Abouzar Eslami, and Nas-sir Navab. Surgical Tool Tracking and Pose Estimation in Retinal Microsurgery.In MICCAI 2015, pages 266–273. Springer, Cham, 2015.

[9] Nicola Rieke, David Joseph Tan, Federico Tombari, Josue Page Vizcaıno,Chiara Amat di San Filippo, Abouzar Eslami, and Nassir Navab. Real-TimeOnline Adaption for Robust Instrument Tracking and Pose Estimation. In MIC-CAI 2016, pages 422–430. Springer, Cham, 2016.

[10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: ConvolutionalNetworks for Biomedical Image Segmentation. MICCAI 2015, pages 234–241,2015.

[11] Raphael Sznitman, Karim Ali, Rogerio Richa, Russell H Taylor, Gregory D Hager,and Pascal Fual. Data-driven visual tracking in retinal microsurgery. MICCAI2012, (Pt 2):568–75, 2012.

[12] Raphael Sznitman, Carlos Becker, and Pascal Fua. Fast Part-Based Classificationfor Instrument Detection in Minimally Invasive Surgery. MICCAI 2014, 2014.

[13] Remi Wolf, Josselin Duchateau, Philippe Cinquin, and Sandrine Voros. 3D Track-ing of Laparoscopic Instruments Using Statistical and Geometric Modeling, pages203–210. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.

8

simultaneous recognition and pose estimation of ... · simultaneous recognition and pose estimation...

Documents