learning phantom dose distribution using regression artificial...

45
UPTEC F 19011 Examensarbete 30 hp Mars 2019 Learning Phantom Dose Distribution using Regression Artificial Neural Networks Mattias Åkesson

Upload: others

Post on 31-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

UPTEC F 19011

Examensarbete 30 hpMars 2019

Learning Phantom Dose Distribution using Regression Artificial Neural Networks

Mattias Åkesson

Page 2: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Learning Phantom Dose Distribution using RegressionArtificial Neural Networks

Mattias Åkesson

Before a radiation treatment on a cancer patient can get accomplished the treatment planning system (TPS) needs to undergo a quality assurance (QA). The QA consists of a pre-treatment (PT-QA) on a synthetic phantom body. During the PT-QA, data is collected from the phantom detectors, a set of monitors (transmission detectors) and the angular state of the machine. The outcome of this thesis project is to investigate if it is possible to predict the radiation dose distribution on the phantom body based on the data from the transmission detectors and the angular state of the machine. The motive for this is that an accurate prediction model could remove the PT-QA from most of the patient treatments. Prediction difficulties lie in reducing the contaminated noise from the transmission detectors and in correctly mapping the transmission data to the phantom. The task is solved by modeling an artificial neuron network (ANN), that uses a u-net architecture to reduce the noise and a novel model that maps the transmission values to the phantom based on the angular state. The results show a median relative dose deviation ~ 1%.

ISSN: 1401-5757, UPTEC F 19011Examinator: Tomas NybergÄmnesgranskare: Salman ToorHandledare: Prashant Singh

Page 3: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Acknowledgments

I would like to thank my supervisor Prashant Singh for all the good support andfeedback during the development of this project. I would also like to thank ErikBangtsson from ScandiDos for his expertise in radiation therapy and for the datahe provided to the project. Last but not least I want to thank Simon StromstedtHallberg who started this project with me for all the good discussions and forintroduced me to Tensorflow.

i

Page 4: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Popularvetenskaplig sammanfattning

Stralbehandling ar en popular metod att behandla cancerpatienter med idag.Metoden utfors med hjalp av en linjaraccelerator (linac) som koncentrerar stralningmot tumoromradet samtidigt som omkringliggande frisk vavnad utsatts for salite stralning som mojligt. Detta ar mojligt da linac:en kan forma och riktastralen fran olika vinkeltillstand. Innan en stralbehandling kan genomforas paen patient maste en s.k. stralbehandlingsplan raknas ut och kvalitetssakrasfor att forsakra att korrekt straldos levereras till patienten. Kvalitetssakringenbestar dels av en tidskravande for-kvalitetssakring pa en syntetisk cylinderkropp(fantom) med detektorer. En transmissionsdetektor ar monterad framfor myn-ningen av stralkallan for att overvaka stralbehandlingen. Examensarbetet un-dersoker om straldosdistributionen pa cylinderkroppens detektorer gar att raknaut genom att trana ett artificiellt neuralt natverk (ANN) att transformera sig-nalen fran transmissionsdetektorn till cylinderkroppen. Svarigheter med trans-formationen ar framst att reducera brus fran transmissionsdetektorn men ocksaatt hitta en korrekt relation mellan transmissionsdetektorn och samtliga de-tektorer pa cylinderkroppen som en funktion av vinkeltillstandet fran linac:en.Modellen som tagits fram anvander sig av befintlig ANN arkitektur for brusre-duktion kombinerat med nya losningar for att transformera signalen fran trans-missionsdetektorn till cylinderkroppen. Resultatet visar ett relativt medianfelrunt 1% for de flesta av stralbehandlingsplanen i testgruppen.

ii

Page 5: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Contents

1 Introduction 11.1 Radiation Treatment . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Treatment Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Treatment Plan QA . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Pre-Treatment QA . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 At Treatment QA . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Measurement Setup . . . . . . . . . . . . . . . . . . . . . . . . . 31.7 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 51.8 Further known properties . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 62.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . 62.2 Optimization and back propagation . . . . . . . . . . . . . . . . . 72.3 Mini-Batch Stochastic Gradient Descent . . . . . . . . . . . . . . 92.4 Dense Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5.1 2-Dimensional Convolutional Operation . . . . . . . . . . 102.5.2 SAME Padding . . . . . . . . . . . . . . . . . . . . . . . . 112.5.3 Max Pooling . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.4 Transposed Convolution . . . . . . . . . . . . . . . . . . . 12

2.6 U-net Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 122.7 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Model 143.1 Noise Reduction Filter . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Transformation Function . . . . . . . . . . . . . . . . . . . . . . . 173.3 Decay Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 The Complete Model . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . . 203.6 Design Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.6.1 U-net hyper-parameters . . . . . . . . . . . . . . . . . . . 213.6.2 Abs. dense layer hyper-parameters . . . . . . . . . . . . . 213.6.3 Batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.6.4 Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.6.5 Batch norm . . . . . . . . . . . . . . . . . . . . . . . . . . 213.6.6 Dropout filter . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Additional functions 224.1 Orthogonal Projection Function . . . . . . . . . . . . . . . . . . . 224.2 Absolute Dense Layer . . . . . . . . . . . . . . . . . . . . . . . . 22

iii

Page 6: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

5 Implementation 235.1 The Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Pre-process and Normalization of the Data . . . . . . . . . . . . 245.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Results 246.1 Training Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 Dose Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.3 Sub Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Discussion 357.1 Alternative Model . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.1.1 Transformation Matrix . . . . . . . . . . . . . . . . . . . . 357.1.2 Picture to Picture . . . . . . . . . . . . . . . . . . . . . . 367.1.3 Modify the Current Model . . . . . . . . . . . . . . . . . . 36

8 Conclusions 37

iv

Page 7: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Nomenclature

Acronyms

ANN Artificial Neural Network

AT-QA At-Treatment Quality Assurance

CPC Charged Particle Contamination

linac linear accelerator

MLC Multileaf Collimator

MU Monitor Units

PT-QA Pre-Treatment Quality Assurance

QA Quality Assurance

TPS Treatment Planning System

Data

θ1 gantry angle rotation

θ2 collimator angle rotation

P predicted Phantom data

P target Phantom data

T Transmission data

v

Page 8: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

1 Introduction

Radiation therapy is a popular approach towards treating cancer patients. Thetreatment involves radiating the affected part of the body using a radiationsource. The radiation process must be performed very carefully so that healthytissues are left untouched by the radiation. In order to achieve this, pre-treatment machine calibration is performed using a ‘phantom’ that takes placeof the patient, and measures the shape of the radiation received. The machineparameters are then tuned in order to achieve the perfect shape. This processis often time consuming, expensive and leads to the machine being unavailablefor treatment for different spells of time. This project explores machine learn-ing models as a replacement of the phantom-driven calibration process. Theproject is in collaboration with the company ScandiDos, which specializes inquality assurance and dosimetry for modern radiation therapy. The followingsection describes the problem in detail.

1.1 Radiation Treatment

The radiation treatment used for this project involves a linear accelerator (linac).It consists of a radiation source that is mounted on a gantry, that can rotatearound the patient. A collimator is also mounted on the gantry, which is usedto shape the radiation field in a way that is compliant with the anatomy ofthe patient being treated. The collimator can rotate around an axis that spansfrom the radiation source to the linac isocenter, see Figure 1a. This axis isalso known as the beam or field axis. These two rotational degrees of freedomare described by the gantry angle rotation and collimator angle rotation. Thecollimator consists of one or two pairs of blocks (or jaws or diaphragms) and amultileaf collimator (MLC). The MLC consists of a number of metal (in generaltungsten) leaves that can be moved in and out of the radiation field in orderto create an appropriate aperture. A typical MLC consists of 60-80 leaf pairsthat move independently of each other. The intensity of radiation is describedin terms of monitor units (MU). The higher the MU, the higher the dose beingdelivered. The correlation between MU and the delivered dose is linear.

(a) A schematic description of the linear ac-celearator.

(b) A treatment room view of a linear acce-larator.

Figure 1

1

Page 9: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

1.2 Treatment Plan

A treatment plan is a patient specific set of machine parameters that describeshow the radiation is delivered to the patient. A treatment plan consists of one orseveral fields, where each field consists of a number of control points. A controlpoint is essentially a moment in time where the gantry and collimator angles arespecified together with MLC and block positions, and a specific value in termsof MU. Typically, a treatment field consists of a 100 to 200 control points. Thetreatment plan is generated by a treatment planning system (TPS). The TPS isin general designed to create the plan such that an optimal energy (or dose) isdelivered to the tumour while the surrounding healthy tissue is spared. This isdone via inverse planning and it involves a number of models that describe theenergy fluence through the linac head, the allowed range of motion of the MLCand blocks, the radiation transport in the patient, etc. In order to verify thatthese models are correct and that the treatment plan can be delivered by thelinac as planned, it is necessary to perform treatment plan quality assurance(QA) in order to validate it.

1.3 Treatment Plan QA

For a patient specific treatment plan, the TPS is used to predict the dose thatis delivered to a phantom. A phantom is a detector system or a measurementdevice that consists of a body (typically a plastic cylinder) enclosing a set ofdetectors at specific locations. The phantom dose distribution is sampled in thedetector positions, which enables prediction of the discrete dose distributionthat can be measured. The phantom is placed on the treatment couch and thetreatment plan is delivered. During the treatment, the dose that is deliveredto the phantom detectors is measured. This measured dose is compared to thepredicted dose and if the two dose distributions are similar (enough), the plan isapproved for patient treatment. Note that the dose distribution that is predictedin, and delivered, to the phantom is not the same dose distribution as in thepatient. But as long as the predicted and delivered dose distributions in thephantom are equal, we can conclude that the models in the TPS are accurateenough and that the linac can deliver the dose in the way that is modelled.

1.4 Pre-Treatment QA

The procedure described above is often performed as pre-treatment (PT)QA.It is done once and only once before the start of the treatment. Note thatthe most common type of radiation treatment is based on treatment fractions.This means that the therapeutic dose is not delivered all at once. Instead thetreatment is split in a number of fractions, often around 30. In each fraction annth of the total dose is delivered, if there are n fractions. So a typical treatmentplan involves a PT fraction when the plan is delivered to the phantom, andapproved for treatment, followed by a number of treatment fractions with thepatient present.

2

Page 10: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

1.5 At Treatment QA

When QA is also performed during the treatment fractions, the scenario isknown as at-treatment (AT)QA or online QA. This requires that the dose de-livery be monitored by a device that can measure the dose while the patient istreated. ScandiDos have developed a detector called Delta4 Discover for thispurpose, which is a transmission detector that is mounted on the linac headbetween the collimator and the patient. The transmission detector measuresthe energy fluence that is impinging on the patient. The Discover is designedsuch that the measured signal is mapped to a dose distribution in the phantomgeometry. This is done with a plan specific calibration that is done during thepre-treatment QA fraction. This is known as PT-AT calibration and it makesit possible to detect anomalies in the dose delivery that may occur during thedifferent fractions.

1.6 Measurement Setup

The detector system considered in this study consists of one transmission de-tector (Delta4 Discover) and a cylindrical phantom (Delta4 Phantom+). Thesetwo units operate together and/or alone. The Discover unit is mounted onthe linac head and the Phantom+ is placed on the patient couch. When theDiscover and the Phantom+ operate together in Synthesis mode, the phantommeasurement acts as a calibration of the Discover signal. The Discover unitis equipped with 4040 detectors (semiconductor diodes) that are ordered in aregular rectangular grid. In the Phantom+ there are 1069 detectors which areordered in two crossing planes. In each plane the detectors are laid out in twogrid configurations. There is a fine grid (5mm distance) close to the origin anda coarse grid (10mm distance) outside a 6 x 6 x 6cm3 box around the origin.

3

Page 11: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Figure 2: Simultaneous measurement (Pre-treatment QA) with the Delta4 Dis-cover and the Delta4 Phantom+.

When the Discover and the Phantom are setup as in Figure 2 they collectdata synchronously. Data packages that consist of the transmission detectorsignal and the phantom signal (together with the gantry and collimator angleinformation) are sampled every 25ms. In this way it is possible to calibrate thesignal from the Discover so that it can be interpreted as a discrete phantomdose distribution. In the existing product used for experiments in this workDelta4 Synthesis, this calibration must be performed once for every treatmentplan that is going to be delivered to a patient.

The reason behind the calibration is that the transmission detector signal ishighly contaminated by charged particle radiation that originates in the linachead. The photons radiation that is used for the patient treatment is collimatedin the linac head by different kind of metal structures. When the photon ra-diation is obscured by this metal (and other kinds of materials as well), it isabsorbed. In the absorption process high energy electrons are produced. Theseelectrons constitute the charged particle contamination (CPC) of the radiationthat is measured by the transmission detector. These electrons however do notreach the patient/phantom surface and they do not contribute to the deliveredtherapeutic dose.

4

Page 12: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

1.7 Problem Formulation

The pre-treatment QA procedure is both time consuming and economical ex-pensive for the clinic, since they only generate revenue during the time thepatients are treated on the linac and not for the time the phantom occupiesthe machine. The goal of this project is to investigate whether it is possible topredict the phantom dosage on a treatment plan

Ptp =

n∑

i=0

Pi (1)

based on the set of control points from the transmission detectors {Ti|i =0, ..., n} and the states from gantry angle rotation {θi1|i = 0, ..., n} and thecollimator angle rotation {θi2|i = 0, ..., n}. If it is possible to predict the dosedistribution on the phantom accurately enough, the PT-QA procedure couldbe removed from the treatment plan. This task will be solved by modeling anArtificial Neural Network (ANN) model that inputs control points data: Ti, θi1,θi2 and outputs a predicted control points phantom vector

Pi = f(Ti, θi1, θi2). (2)

The predicted phantom dosage on a treatment plan is the sum of its controlpoints fractions

Ptp =

n∑

i=0

Pi. (3)

1.8 Further known properties

An extra obstacle to tackle while modeling the function Eq. 2 is the decay of thesignal while it is propagating from the gantry to the detectors on the phantom.ScandiDos have performed measurements to find out how the radiation decaysinside the plastic cylinder that encloses the phantom. The measurements areconducted in water tanks that have similar properties as the plastic. Theyfound out that different field sizes from the gantry result in slightly differentcharacteristic in the decay of the signal.

5

Page 13: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Figure 3: Graph showing the decay of a linac signal when transmitting in waterfor different field sizes. The window between the dotted lines shows the plasticdepth range for the signal before reaching the phantom detectors.

2 Theory

This project explores ANN models as replacement of the phanton-driven QAprocess. ANNs have delivered promising results for problems involving denois-ing images. An ANN has the ability to accurately model highly non-linearpatterns. In particular, the u-net architecture has delivered good results in pre-vious work regarding denoising images. The following text explains artificialneural networks in detail.

2.1 Artificial Neural Network

An Artificial Neural Network (ANN) can be seen as a directed graph, where theinput signals direct through a network of nodes to an output of one or morenodes. The nodes inside the ANN are referred to as neurons. The name origi-nates from the idea of trying to mimic the biochemical process of the neuronsin the human brain. A neuron takes a set of inputs, multiplies them by a setof trainable weights and sums the products with each other and an additionalweight referred to as the bias

f(x; w) =∑

i

xiwi + b = x ·w + b, (4)

where w = {wi, ..., wn, b}. The output of each neuron can also be an input nodeto another neuron. The numbers of neurons an input signal has to pass on itsway to the output is referred to the depth of the model. The higher depth an

6

Page 14: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

ANN has, the more complexity lies in the model i.e. higher ability of the modelto learn non-linear patterns [3]. If the neuron operation comprise of simplya linear transformation of input nodes to output nodes, a higher level systemof neurons could just be rewritten as a single level system. To accomplish amore complex model (a non-linear transformation) the linearity of the neuronhas to break. To achieve this an activation function is present to the outputof each neuron. The activation function can be any non-linear function, with awell defined derivative. Popular activation functions are the sigmoid function,f(x) = 1

1+e−x and the rectified linear unit (ReLU), f(x) = max(0, x). Theneuron operation for a general activation function can be written as

fn(x; w) = a (x ·w + b) . (5)

Figure 4 shows a schematic view of a neuron.

...

w1

w2

...

wn

b

∑fa

WeightsInputnodes

Activationfunction

Outputnode

Figure 4: A schematic view of a neuron in an ANN.

2.2 Optimization and back propagation

ANN models are a powerful learning tool and enable modeling of complexdatasets outputs arising from diverse fields. The datasets typically compriseof input-output relationships wherein certain variables form the input to themodel, while output is represented by certain target variables. For a classifica-tion task the output would be correct classification predictions, for a regressiontask the output would be predicted values similar to the target values. Forthe ANN to realize this an objective function needs to be defined. An objective

7

Page 15: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

function is an optimization (minimization or maximization) of an scalar functionreferred to as the loss or cost function. The cost function can take a differentform depending the task. For a regression task a common cost function is thesummed squared error defined as

C =1

m

m∑

j=0

(Pj − Pj

)2=

1

m

m∑

j=0

n∑

i=0

(P ji − P ji )2, (6)

where m is the number of records and n is the number of outputs of the model.To update the weights of the graph, the cost function is fed to an optimizer. Incontext of ANN, an optimizer typically refers to the gradient descent optimizer.It does three things:

1. It propagates a number of input records through the graph to the costfunction, stores all of the neuron outputs,

2. it computes the gradient of the cost with respect to w, defined as

∇C =

{∂C

∂w1, ...,

∂C

∂wn

}, (7)

3. it updates the weights with a learning rate α defined as

w := w − α∇C. (8)

The gradient of the cost function Eq. 7 is derived with a method called back-propagation. Because the neurons in the graph are just analytical functionswith well defined derivatives with respect to the weights, and the graph is justa network of neurons that are connected to a cost function that also is differen-tiable, it is possible to compute the gradients of the cost function with respectto all the weights. Before describing backpropagation it is convenient to definethe following notations:

• ii := input vector for neuron i

• ii,j := input j for neuron i

• oi := output for neuron i

• wi := weight vector for neuron i

• wi,j := weight j for neuron i

The cost function Eq. 6 can be written as

C =1

m

m∑

j=0

oi∈Pj

(P ji − oji )2, (9)

8

Page 16: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

where oi = P ji . The derivative of C with respect to any weight in the graphwk,l can be written as

∂C

∂wk,l= − 2

m

m∑

j=0

oi∈Pj

(P ji − oji )∂oji∂wk,l

. (10)

And∂oji∂wk,l

is just the partial derivative of the neuron function with respect to

wk,l derived as

∂oji∂wk,l

=∂fn(iji ; wi)

∂wk,l= a′ (ii ·wi + bi)

∂ii ·wi

∂wk,l

∂iji ·wi

∂wk,l=

ijk,l if wk,l ∈ wi∑ojii∈i

ji

∂ojii∂wk,l

if iji = f(wk,l)

0 else

,

(11)

where a′ is the derivative of the activation function.

2.3 Mini-Batch Stochastic Gradient Descent

Performing gradient descent on a complete training set in one iteration can inmany cases be computational impossible if the training set is too large. Thecomputational cost increases linearly with the size of the records it acts on. Asolution to this problem is to split the entire training set into equal sub sets andupdate the weights on every one. In machine learning those sub sets are referredto batches or mini-batches depending the size of the sub sets. The extremecase of having batch size one is called stochastic gradient decent (SGD). Givenenough iteration SGD will improve the model, but is very noisy. A compromiseis to select a small batch size (10-1000) which is called mini-batch gradientdecent.

2.4 Dense Layers

The simplest ANN model is the Dense Neural Network or Fully ConnectedNeural Network (FC). It consists of several layers of neurons, all fully connectedto all neurons from the previous layer. The layers between the input and outputare called hidden layers. The name comes from the fact that the output fromthe intermediate layers is hidden from the user.

9

Page 17: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Hiddendenselayer 1

Hiddendenselayer 2

Hiddendenselayer 3

Outputlayer

Inputlayer

Figure 5: A schematic view of a FC network with 3 input nodes, 3 layers ofhidden fully connected neurons of various sizes and a single valued output node.

2.5 Convolutional Layers

For input data that has spatial dimensions e.g. sound files (1-Dim), images(2-Dim), a convolutional neural Network (CNN) can be used. The concept isthat kernels (trainable filters) will perform convolutional operations on the inputdata and output a feature map. The feature map can be seen as a higher levelmeta image, where the values correspond to patterns from the input layer. Theidea of convolutional layers is to extract patterns from its input. A commonuse of the convolutional layer is to pass data through several levels of operationto extract more complex patterns. A big advantage over the classical denselayers is that convolutional layers share weights (kernel parameters) over thewhole input space where the connections in the FC network are unique. This ofcourse saves a lot of computational cost and memory and therefore also reducesoverfitting in the same range as a FC network. There are various dimensionsof convolutional operation but in this context (image data) it is referred as the2-dimensional convolutional operation. A kernel is a 3-dimensional filter withtrainable parameters. The size of the first two dimensions (x-,y-) of the kernelare hyper-parameters defining the spatial shape of the filter, the third dimensiontakes the shape of the number of channels of the incoming image e.g. a gray-scaled image has 1 channel, a RGB-color image has 3 channels or an outputfrom a previous convolutional layer can have any number of channels (definedby the number of kernels from that previous layer).

2.5.1 2-Dimensional Convolutional Operation

Figure 6 visualizes a 2-dimensional convolutional operation. Kernel K slidesacross the width and height of the input image I, computing the dot productbetween the entities on the kernel and the active (red shadowed) part of theimage resulting in an 2-dimensional feature map. A convolutional layer con-

10

Page 18: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

sisting of a set of k kernels will generate k 2-dimensional feature maps. Theseare concatenated into one 3-dimensional feature map where the third dimensionacts as channel depth.

0 1 1 1 0 0 0

0 0 1 1 1 0 0

0 0 0 1 1 1 0

0 0 0 1 1 0 0

0 0 1 1 0 0 0

0 1 1 0 0 0 0

1 1 0 0 0 0 0

I

∗1 0 1

0 1 0

1 0 1

K

=

1 4 3 4 1

1 2 4 3 3

1 2 3 4 1

1 3 3 1 1

3 3 1 1 0

I ∗K

1 0 1

0 1 0

1 0 1

×1 ×0 ×1

×0 ×1 ×0

×1 ×0 ×1

Figure 6: A convolutional operation between an 1 channel deep image I and a3x3x1 kernel K.

The shape of the feature map depends on the stride length of the convolu-tional operation and the size of the kernel if the padding is set to VALID definedas

feature map x-length VALID = ceil

(input data x-length− kernel x-length + 1

stride length x

),

feature map y-length VALID = ceil

(input data y-length− kernel y-length + 1

stride length y

).

(12)

Next section will describe padding.

2.5.2 SAME Padding

A key to restore the size of the output feature map after a convolutional op-eration (under assumption that the stride is 1 in both direction) is to pad theinput image with zeros at the borders. To add enough padding to restore thespatial dimension is called SAME padding. The shape of the feature map withSAME padding is calculated as

feature map x-length SAME padding = ceil

(input data x-length

stride length x

),

feature map y-length SAME padding = ceil

(input data y-length

stride length y

).

(13)

VALID padding is the same as no padding.

11

Page 19: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

2.5.3 Max Pooling

A usual component to the convolutional layer is to apply a pooling layer afterthe operation. A pooling layer downsamples the feature map, reducing its di-mensionality and allowing for assumptions to be made about features containedin the sub-regions binned. A pooling operation reduces the computational cost,avoids over-fitting and achieves a spatial variance of the input. The most com-mon pooling layer is max pooling, it outputs a reduced size of the feature mapwith the local max values, see Figure 7.

4 0 7 2

1 7 0 4

2 4 5 2

6 8 3 4

7 7

8 5

2x2 max pool

Figure 7: A 2x2 max pooling filter with stride [2,2].

2.5.4 Transposed Convolution

The transposed convolution operation is a backward version of the convolutionoperation. Here the input is a feature map and the kernels in the transposedconvolution layer get up-sampled to the output. The transposed convolution isprimarily designed to decode information from features to images. The typicaldesign is 2x2 kernel with stride 2 transforming the input to doubles the spatialdimensions and often bisects the feature space.

2 3 0

5 0 1

2 2 6

F

∗ 1 0

0 1

K

=

2 0 3 0 0 0

0 2 0 3 0 0

5 0 0 0 1 0

0 5 0 0 0 1

2 0 2 0 6 0

0 2 0 2 0 6

F ∗K

Figure 8: A transpose convolutional operation between an 1 channel deep featuremap F and a 2x2x1 kernel K.

2.6 U-net Architecture

The U-net model was developed for biomedical image segmentation at the Com-puter Science Department of the University of Freiburg, Germany [5]. The u-netis an encoder/decoder like model, it decodes the input to a spatial low level,feature high space, referred to the bottle neck representation of the record. The

12

Page 20: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

bottle neck can be seen as a feature state of the incoming spatial organized data(image data). From the bottle neck it decodes back to an spatial high dimension.The encoder is composed of several levels, each consists of 2 convolutional oper-ations, where the first doubles the feature space. Each level except the last endswith a max pooling layer that bisects the spatial dimension. The model buildsup higher orders of features after each level and enables more complex patternsto be learned. The decoding part is level symmetrical to the encoder. Eachdecode level begins with an transposed convolutional layer where it doubles thespatial dimension and bisects the feature dimension. The output is concate-nated with the feature map from the corresponding encoding level (the secondconvolutional layer). The concatenated layer passes a convolutional layer beforeentering the next decode level. See Figure 10 for a visualization of the u-netarchitecture used in this project. The name is self-explanatory when observingthe figure, the shape of the graph are formed as a ’U’. Note: the original paper[5] used VALID padding in the convolutional operations, making the spatialdimensions different at the concatenated operation. It was solved by croppingthe encoded feature map to fit the decoded feature map. But by using SAMEpadding in this work, the spatial dimension gets restored during each convo-lutional operation and no cropping is needed. Also the u-net designed in thisproject only use a single convolutional layer after the concatenated operation incontrast to the design in [5] where they use 2 convolutional operation.

2.7 Batch Normalization

Batch normalization [6] is an algorithm made primarily for speeding up training.During model training, the weight updates from previous layers affect the distri-bution of the layers next in line. This cause disturbance when fitting the weightsof the ”forward” layers to operate with the corresponding activation functions.Batch normalization solves this by first normalizing the outputs of a layer (be-fore the activation function) by subtracting the batch mean and dividing by thebatch standard deviation then re-scaling it by two trainable parameters γ andβ, see algorithm Eq. 1. The idea is that now γ and β controls the output distri-bution that go in to the activation function (γ is the standard deviation and βis the mean) independently of the distribution shifts from previous layers. Themini-batch mean and and mini-batch variance is just calculated under training,under inference the model uses moving mean and variance updated under train-ing according to Eq. 14. Here k is a hyper-parameter to define between 0-1.The value of k affects how much of the current mini batch statistics (µB, σB)will influence the update of the mean µ and variance σ used under inference.A typical choice is to set k = 0.99, that means 1% of the mean and variance isaffected by the current mini batch statistics.

13

Page 21: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Input : Values of x over a mini-batch: B = {x1...m}Parameters to be learned γ, β

Output: {yi = BNα,β(xi)}

µB ← 1m

∑mi=1 xi // mini-batch mean

σ2B ← 1

m

∑mi=1 (xi − µB)

2// mini-batch variance

xi ← xi−µB√σ2B+ε

// normalize

yi ← γxi + β ≡ BNγ,β(xi) // scale and shift

Algorithm 1: Batch Normalizing Transform, applied to activation x over amini-batch. (from the original batch-norm paper)

µ← kµ+ (1− k)µB

σ ← kγ + (1− k)σB(14)

3 Model

The model is an ANN i.e. a graph based function with trainable parameters.Two assumptions were made on the physical dependency between the incomingdata from the transmission image T and the angle state θ1, θ2 and the outcomephantom detectors P i.e. {T, θ1, θ2} → P:

• The value of a phantom detector Pi is dependent on only one point on thetransmission image and that point is dependent on the angular state i.e.Pi = f(Txi,yi) where Txi,yi is an interpolated point at position (xi, yi) onthe transmission image T and the point coordinate is determined by theangular state (xi, yi) = fi(θ1, θ2).

• The value of a phantom detector Pi can be predicted as the product of theangles specified point on the transmission image and an additional anglesonly dependent factor i.e. Pi = Txi,yi · fi(θ1, θ2).

The main function consists of three smaller task specific functions, the firstfunction is the noise reduction filter, a u-net convolution network that reducesthe contaminated signal from the transmission detectors discussed in Section 1.6.The second function is a transformation function that maps the values from thetransmission detectors to the phantom detectors i.e. a unique angle-dependentfunction for every detector in the phantom that interpolates a point from thenoise reduced transmission image to its corresponding phantom detector. Thethird function is the decay factor, also a unique angle-dependent function forevery detector in the phantom that is needed due to radiation absorption in theplastic of the body that reduces the intensity of the signal discussed in Section1.8.

14

Page 22: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

3.1 Noise Reduction Filter

The transmission detectors are placed in a two dimensional 40x101 evenly dis-tributed grid network, this can be seen as a one channel image, the transmissionimage. The purpose of the reduction filter is to reduce the noise from the trans-mission detectors i.e. reduce the noise from the transmission image. Previousworks have shown good results in reducing noise from images using a u-netarchitecture [1]. The idea with the u-net is that the encoding convolutionallayers will extract the noise pattern from the input image and subtract it in thedecoding part.

Figure 9: A transmission image sample.

The u-net architecture for this project is shown in Figure 10. Due to theunbalanced relation between the height and width resolution, the transmissionimage got padded at the left and right with 5 additional zero-valued lines each(40x101 − > 50x101). The first max pool layer has the kernel size 1x2 andthe stride (1,2) this so the following levels in the u-net hierarchy will have amore quadratic shape (50x101 − > 50x51). The following max pool layersin the model has the kernel size 2x2 and the stride (2,2). All of the max pool,convolution layers, transpose convolution layers have SAME padding, this avoidscuttings in the feature maps at the concatenation phase of the model. The noisereduction filter can be described as the function

fn.r.(Traw; wn.r.) = Tnoise reduced, (15)

where wn.r. are the trainable weights i.e. the kernels in the convolution layers.

15

Page 23: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

50x101

x1

50x101

x8

50x101

x8

50x51

x8

50x51

x16

50x51

x16

25x26

x16

25x26

x32

25x26

x32

13x13

x32

13x13

x64

13x13

x64

7x7

x64

7x7

x128

7x7

x128

4x4

x128

4x4

x256

4x4

x256

2x2

x256

2x2

x512

2x2

x512

50x101

x8

50x101

x16

50x101

x8

50x51

x16

50x51

x32

50x51

x16

25x26

x32

25x26

x64

25x26

x32

13x13

x64

13x13

x128

13x13

x64

7x7

x128

7x7

x256

7x7

x128

4x4

x256

4x4

x512

4x4

x256

50x101

x1

3x3 convolution layer, Relu, Batch-norm, same padding, stride: (1,1)2x2 max. pool, same padding, stride: (2,2)1x2 max. pool, same padding, stride: (1,2)2x2 transposed convolution layer, Relu, Batch-norm, same padding, stride: (2,2)1x2 transposed convolution layer, Relu, Batch-norm, same padding, stride: (1,2)concatenating parts

input:raw

transmission image

output:

noise reducedtransmission image

Figure 10: A schematic view of the u-net architecture used for this project.

16

Page 24: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

3.2 Transformation Function

Figure 11: Visualization of orthogonal radiation propagation from the transmis-sion to the phantom.

The transformation function is created with the assumption that the radiationbeams propagate almost orthogonal to the transmission plane when hitting thephantom detectors. Under this assumption it is valid to assume that each de-tector in the phantom is dependent on one specific point in the transmissionimage, that point is angle dependent e.g. phantom detector i will map its valuefrom a point (xi, yi) on the transmission image determined by a point func-

tion f(i)p (θ1, θ2) = (xi, yi). The point function for all phantom detectors can be

written asfp(θ1, θ2) = {(xi, yi) | i = 1, ..., 1069}. (16)

17

Page 25: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

The coordinates that outputs from the point function are then used to interpo-late the values from the transmission image to the phantom detectors as

ft.f.(T, θ1, θ2) = fBL(T, fp(θ1, θ2))

= v = {vi | i = 1, ..., 1069}, (17)

where fBL is a bi-linear interpolation function that takes the coordinates fromfp(θ1, θ2) and interpolates them on the transmission image T and output thevector v with the interpolated values related to each phantom detector. Thepoint function Eq. 16 can be rewritten as

fp(θ1, θ2; wt.f.) = frot (θ2, fproj.(θ1; wt.f.)) , (18)

where frot is the rotation function and fproj is the projection function. Theprojection function can be written as

fproj.(θ1; wt.f.) = fo.p.(θ1) + fcorr.(θ1; wt.f.)

= {(xproj,i, yproj,i) | i = 1, ..., 1069} (19)

It consists of two terms where the first term is the orthogonal projection functionfo.p.. It takes the gantry angle rotation θ1 as input and calculates the points inthe transmission image that would project a line orthogonal to the transmissionplane to each detector in the phantom (if the collimator rotation, θ2 = 0), seeFigure 11 for a visual interpretation. The second term fcorr. is the correctionterm, it takes the gantry angle rotation θ1 as input and outputs a tuple ofcoordinates. wt.f. is the set of trainable weights in the Dense Neural Networkdescribed in Figure 12. Note that only the correction term is trainable. Thecoordinates from Eq. 19 are then fed to the rotation function

frot (θ2,x,y) =

[cos(θ2) sin(θ2)− sin(θ2) cos(θ2)

] [xy

]

= {(xrot,i, yrot,i) | i = 1, ..., 1069},(20)

where each coordinate tuple are rotated with the collimator rotation angle θ2.The rotation function is simply an affine rotation transformation i.e. a matrixmultiplication between a matrix and some coordinate tuples. The transforma-tion function Eq. 17 can now with the input of Eq. 18 be written as

ft.f.(T, θ1, θ2; wt.f.) = v = {vi | i = 1, ..., 1069}. (21)

18

Page 26: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

θ1 ...

...

...

= (x1, y1)

= (xi, yi)

= (x1069, y1069)

Outputlayer

Abs.Denselayer

Inputlayer

Figure 12: the dense neural network used for the correction function. It takes asingle valued argument as input (θ1) and outputs 2138 values (1069 coordinatepoints).

3.3 Decay Function

The decay function has the same input and output shape as the transformationfunction and is designed similar to the correction term from the transformationfunction. It uses the same dense neural network, but the output shape is 1069instead of 2138. The decay function can be written as

fd.y.(θ1, θ2; wd.y.) = d = {di | i = 1, ..., 1069}. (22)

3.4 The Complete Model

The Noise reduction filter Eq. 15, the transformation function Eq. 21 and thedecay function Eq. 22 can now be put together as one complete function

fc.f.(T, θ1, θ2; w) = ft.f.(fn.r.(T; wn.r.), θ1, θ2; wt.f.)� fd.y.(θ1, θ2; wd.y.)

= P = {Pi | i = 1, ..., 1069}.(23)

T is the raw transmission image, P is the predicted phantom vector, w ={wn.r.,wt.f.,wd.y.} is the complete set of all the trainable weights and � is the

19

Page 27: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Hadamard product. The cost function

C = (P− P)2 = (P− fc.f.(T, θ1, θ2; w))2, (24)

is defined as the summed squared difference between the target phantom vectorP and the predicted phantom vector P. The cost function will be fed to anoptimizer and the model is then ready to be trained.

3.5 Training the Model

During training two data sets are available. A training set for which the modelwill be trained on and a validation set that after each epoch of training the modelwill be evaluated on. Each evaluation compares its result to the prevailing bestresult. If the new evaluation results is improved the model is saved as the bestresult. The model is saved after each epoch (regardless the evaluation result) ascurrent model. The idea with saving two versions of the model is that the bestresult model can be used to predict new data, while the current version can betrained in parallel, allowing to be fed with new training data. The reason notto only have the best result model to train on is that that will affect the modelto overfit the validation set after some time. There are many way of scoring theresult. The most trivial one would be the mean cost value or mean error overall records in the validation set. But the objective is not to produce the lowestcost or error record wise. The objective is to produce the smallest absoluterelative dose deviation e.g. the deviation between the summed prediction andthe summed target divided by the summed target over a whole treatment plan(tp) defined as

Dtp = abs

j∈tpPj −

j∈tpPj

� abs

j∈tp

1

Pj

= {Dtpi | i = 1, ..., 1069}.

(25)

The relative dose deviation Dtp consists of 1069 values (corresponding everyphantom sensor). The mean value of Dtp can give a misleading understanding ofthe magnitude of the deviation because some detectors report inflated means. Abetter way of representing the dose error is to first calculate the median relativedose deviation from every treatment plan in the validation set and then takethe mean over every treatment plan median Dtp defined as

result =1

len(V)

k∈V

median(Dk), (26)

where V is the set of treatment plans in the validation set and len(V) is thenumber of treatment plans in the validation set.

3.6 Design Settings

The number of hyper-parameters in this model is large, due to time limita-tions no smart hyper parameter optimization algorithm has been explored. But

20

Page 28: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

during the design process many different hyper-parameter versions have beencompared and this section will discuss how the current model has been chosen.

3.6.1 U-net hyper-parameters

The shape of the u-net is crucial for the performance of the model. A less denseversion, with a small numbers of channel (fewer kernels in the convolutionallayers) will be computational cheaper but may not reduce the noise properly.The best performance was found with an eight level deep u-net, that is a bottleneck layer with spatial dimension 2x2. And with the first convolutional layerconsists of 8 kernels, a bottle neck with a feature dimension of 512. Densermodels did not perform better.

3.6.2 Abs. dense layer hyper-parameters

The abs. dense layer is used in both the decay function and for the correctionterm in the transformation function. The shape of the abs. dense layer wasnot found to be critical for the performance of the model (computation andprecision wise) as long it was within the range 100-500. A minor improvementwas detected when the angle parameters was initialized evenly between 0− 2π .

3.6.3 Batch size

With the computational efficiency in mind only batch sizes of even bit weretested. 27 = 128 was found to perform best.

3.6.4 Optimizer

The optimizer used in the model was the Adam optimzer. Due to the individualadaptive learning rate of the Adam optimizer [2] no decrease on the learningrate was necessary. The default setting learning rate 0.001 was the best of thetested ones (0.01, 0,001, 0.0001).

3.6.5 Batch norm

The batch norm was used with the extended version batch renormalization [4]allowing the training and inference mode to use the same moving statistics (µand σ). The paper recommends the additional hyper-parameters rmax and dmaxto start at 1 and 0 then slowly increase to 3 and 5. The model training was foundto be faster by setting them free from the start (rmax = inf and dmax = inf).

3.6.6 Dropout filter

The model was first integrated with dropout filters [7] after each convolutionlayer in the noise reduction filter. The dropout didn’t improve the performanceat the validation set (or training set) so it was removed from the model.

21

Page 29: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

4 Additional functions

This section contains algorithms designed during the progress of this project.

4.1 Orthogonal Projection Function

The orthogonal projection function is used to find the coordinates on the un-rotated transmission plane from which the phantom detector would get injectedfrom, under the assumption that the radiation beam transmits orthogonal tothe transmission plane. For this the coordinates from each phantom coordinateare needed. This is the only part of the complete model where additional in-formation is used from the phantom coordinates. The orthogonal projectionfunction fo.p. used as the first term in the the projection function Eq. 19 isdefined as

fo.p.(θ1) = (xt,yt)

xt = xp cos(θ1)− zp sin(θ1)

yt = yp,

(27)

where (xp,yp, zp) is the coordinate tuples for the phantom detectors.

4.2 Absolute Dense Layer

This layer was designed to transform single valued inputs to multiple valuedoutputs, where each output gives a measurement for how close it is to somecenter value θc,i corresponding to output oi. One can see the layer architecturesimilar to the classical dense neuron layer but with abs. neurons instead. Theabs neuron acts as a hat function with 3 trainable parameters: θc, Vmax, k. Theabs. neuron is defined as

fabs(θ) = relu (Vmax − |θc − θ| · |k|)) (28)

and plotted in Figure 13.

22

Page 30: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

0 π2

π θc 3π2

0

0.5

Vmax

1

1.5

θ

f(θ)

Figure 13: Abs. neuron function with parameter: θc = 4, Vmax = 0.75 andk = 2.

5 Implementation

5.1 The Data Set

ScandiDos provided this project with 37 treatment plans with a total of 126878 control points. A control point contains 4040 transmission detector values,one gantry angle, one collimator angle and 1069 phantom detector values. Thesmallest treatment plan consists of 1564 control points and the largest planconsists of 7500 points. There are 5 different types of treatment plans in thisdata set:

• 2 Ani plans (Tumors in the rectum and/or the colon).

• 3 Esof plans (Esophagus, tumors in the the food tract).

• 4 HoN plans (Head and Neck. Tumors in the region above the lungs, butnot in the brain).

• 4 Lung plans (Lung tumors)

• 24 Prost plans (Prostate tumors. In the pelvis region for male patients).

The 37 treatments plans are divided into 3 sub sets: Training data, Test dataand Validation data. To minimize the dependency between the sub sets, theplans are divided evenly between the sub sets. The training data set got: 1 Aniplan, 2 Esof plan, 3 HoN plans, 2 Lung plans and 19 Prost plans. The validationdata set got: 1 Esof plan, 1 Lung plan and 1 Prost plan. The test data set got:1 Ani plan, 1 HoN plan, 1 Lung plan and 4 Prost plans.

23

Page 31: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

5.2 Pre-process and Normalization of the Data

A rational assumption is that no detector should have a negative value. Be-fore normalizing the input transmission values and target phantom values, thenegative values are set to zero. Two pre-processing goals were deemed criticalwhen choosing the normalizing algorithm of the data: avoid negative values andhave similar magnitude of the input transmission data and the target phantomdata preferably around magnitude 1. A simple approach could be just to scalethe transmission data with a factor so the mean value was of the magnitude1, and do the same with the phantom data. These constants could preferablybe the mean of all transmission data and the mean of all the phantom data.Here we chose to normalize the input transmission by multiplying them with1e-4 and multiplying the output transmission with 1e+4. The input angles weretransformed from degrees to radians.

5.3 Software

The project was modeled in the programming language Python using the frame-work Tensorflow. Tensorflow is an open-source software library for dataflowprogramming across a range of tasks.

6 Results

The results are divided into three sub sections: Training Progress, Dose Ac-curacy and Sub Functions. The training progress Figure 14, 15 and 16 showthe progress of the control points during training. During training progress thevalidation data set was used for quality measurements after each training epoch.Measurements done on the training data set was done while training over theepoch. So the training set result on epoch n is a mean result between epoch n−1and epoch n. One could argue that the indexing on the training set should stateepoch n − 1

2 instead of n but for the visualized plot over 51 epochs, it doesn’tdiffer much. The error measurement used for the control points is defined as

error =|P−P||P| . (29)

For the treatment plans accuracy different statistics of the relative dose devi-ation is used instead. The argument for using different measurements lies inavoiding zero division that occur in the control points data.

The dose accuracy shows the treatment plans accuracy on the test set forthe training version with highest accuracy results with respect to the validationset. Figure 17 shows histograms of the relative dose deviation defined as

relativedose deviation

= Dtp =

j∈tpPj −

j∈tpPj

� abs

j∈tp

1

Pj

= {Dtpi | i = 1, ..., 1069}

(30)

24

Page 32: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

for all the packages from the test set. The mean and median of the absoluterelative dose deviation Eq. 25 are indicated at the top of every histogram.Note the difference between absolute relative dose deviation and relative dosedeviation. The visualization of the relative dose deviation gives a clearer pictureof whether the prediction is over or under produced. A desired outcome wouldbe that the deviation is centered around 0%. Figure 18 compares the predictedand the target phantom detectors dose distribution on the two crossing phantomplanes x = 0 and z = 0. As additional information the error measurement usedfor the control point accuracy is derived at the top of each deviation plane. Thethird sub section Sub Functions visualize the behaviour of the sub functions ofthe main model. Figure 19 visualize the noise reduction function for two samplecontrol points from the test set. Figure 20 visualize the point function (from thetransformation function) with the initial orthogonal projection points and withthe added trainable correction term points for different angle states. Figure 21shows the decay function for different angle states.

6.1 Training Progress

Figure 14: Cost progress during training plot: training set and validation set.

25

Page 33: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Figure 15: Mean error progress during training plot: training set and validationset.

Figure 16: Mean median dose error progress during training plot: validationset.

26

Page 34: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

6.2 Dose Accuracy

Figure 17: (1/2)

27

Page 35: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Figure 17: (2/2) Histograms showing the relative dose deviation between pre-diction and target over all detectors in the phantom.

28

Page 36: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

(a) treatment plan: Ani1.1

(b) treatment plan: HoN1.1

(c) treatment plan: Lung1.1

Figure 18: (1/3)

29

Page 37: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

(d) treatment plan: Prost1.1

(e) treatment plan: Prost4.1

(f) treatment plan: Prost6.1

Figure 18: (2/3)

30

Page 38: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

(g) treatment plan: Prost9.2

Figure 18: (3/3) Scatter plot of the predicted, target and deviation dose distri-bution visualized over the two crossing planes x=0 and z=0, the error Eq. 29 oneach plane are expressed inside the brackets on top of the two deviation plots.

31

Page 39: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

6.3 Sub Functions

Figure 19: Noise reduction filter: input sample image (left) and output sampleimage (right).

32

Page 40: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Figure 20: Scatter plot of the point function from the phantom detector on thex = 0 plane hitting the transmission plane for different angles. Doted regioncorrespond to the area of transmission detectors.

33

Page 41: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

Figure 21: Decay function acting on the two crossing planes: x=0, z=0, forvarious angles.

34

Page 42: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

7 Discussion

The dose error in the test set shows satisfying results with an median deviation∼ 1% for almost each treatment plan. A substantial error contribution comesfrom the detectors at the border of the y-axis, can be seen clearly in Figure18: a,d,f. The cause of these errors could be that the transmission detectorsplacement is too narrow. Figure 20 of the point function could confirm thattheory, the structure of the points is broken outside the area of the transmissionimage. Another theory for the bad results on the border could be that thepoint function can’t ”walk” back in the active part of the image after it hasreached the zero-padding area. A point that has completely reached the zero-padding area (all four transmission detectors that the point is interpolated fromis part of the zero-padding) has no cost gradients for movement in any direction.The decay function visualized in Figure 21 shows some pattern breaks at thedetectors on the border of the y-axis and also some detectors next in line fromthe border. This confirm the previous statements. Elsewhere the decay functionacts according to expectations. The decay of the signal is proportional to thedistance the x-ray propagates inside the cylinder before reaching the detector.The noise reduction filter visualized in Figure 19 shows a smoother image afterit passes the filter. A strange property that was revealed in the figure is thevertical stripes that appear on the darker part of the plot. The reason for thephenomenon is unclear. A theory could be that it has something to do withthe uneven last transposed convolution stride the model has to take to fit thedimension of the the transmission detector distribution.

7.1 Alternative Model

This sub-section discusses alternative models that during the process of theproject got abandoned for various reason. For future work these models mayalso be worth getting inspired from.

7.1.1 Transformation Matrix

A tempting approach was to design a transformation matrix M that wouldtransform the transmission data T to the phantom plane P (Here T acts as avector instead of an image). Each element in M having its own angular functionMi,j = fi,j(θ1, θ2; wi,j) with trainable weights. This matrix would replace boththe transformation function and the decay function of the current model. Thenoise reduced image gets flattened out [50,101]− >[4040] and then undergoesa matrix multiplication with matrix M see Eq. 32. This approach has manypros over the current model. The biggest challenge here is to define the angularfunctions for each element correctly. The only model tested was a pre-rotated(θ2) abs. dense network similar to the ones used for the correction term in thetransformation function and the decay function. The design is shown in Figure12 where the output layer would be of size 4318760 ( = 4040x1080), one outputfor every element in the transformation matrix. Even if the training of the model

35

Page 43: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

converged to something better than guessing, it performed far worse than thecurrent model. The pros of having a transformation function compared to thecurrent model is:

• Simpler model

• More general (no need to assume orthogonality of the propagation. Anddon’t need to assume the first physical assumption that the current modelis relying on)

• No need of any spatial information from the phantom detectors (,or infor-mation at all from the phantom detectors).

The cons are additional to finding a good angular function, the computationalcost in training a model with a matrix of that size.

P = M(θ1, θ2; wt.m.)×T (31)

falt.1(T, θ1, θ2; w) = M(θ1, θ2; wt.m.)× fn.r.(T; wn.r.)

= P = {Pi | i = 1, ..., 1069}(32)

7.1.2 Picture to Picture

Another tempting approach was to build an ”encode/decode” approach sim-ilar to the u-net but the decoder will transform into corresponding points inthe phantom instead of a noise reduced version of the incoming image. Thisapproach would be even simpler than the transformation matrix approach dis-cussed above. This model does not need any of the physical assumptions thatthe current model is relying on. In contrast to the u-net the decoder in thismodel can not use concatenated bridges from the encoder. The reason is thatthe feature map from the encoder is locally connected to the spatial dimensionthat is to be transformed. The angular information is a crucial component forthis model to work. A reasonably assumption is to put it in the bottle neck ofthe model, so the angular information is treated as a feature. This approachfailed to succeed on a similar model to the u-net used in this project withoutthe concatenated contribution.

7.1.3 Modify the Current Model

A modification on the current model that has never been tested is instead ofthe element wise multiplication (the Hadamard product) between every pair oftransformed transmission image and decay function output, create a functionthat maps every output tuple independently. This modification would drop thesecond physical assumption from the current model.

36

Page 44: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

8 Conclusions

This work presents a proof of concept for predicting the dose in the phantomgeometry from the signal in the transmission detector with the angle state in-formation from the gantry and collimator using a trained ANN. 4 out of 7treatment plans had a median absolute relative dose deviation below 1% (worsttreatment plan had 2.37%). This fulfilled the project goals specified by Scandi-Dos ”The error in the ANN dose in the phantom must be less than 1% comparedto measurement or TPS dose”. The u-net architecture for denoising the trans-mission detector/image had shown good results in earlier works. What was newin this project was the additional task of transforming the information from themoving transmission plane on the gantry to the two crossing planes inside thephantom. The two physical assumptions that were made for the model provedto work sufficiently accurately regarding transforming the signal between theplanes. The data set provided in this work is just a fraction of all the dataScandiDos can deliver to the model. A first realization could be to have themodel running in the background while new PT-QA procedures are running.The new treatment plans data can then validate the model before mixing it intothe training data set for continued training of the model. When enough valida-tions are done with proper accuracy, the pre-treatment part could be removedfrom the Quality Assurance.

37

Page 45: Learning Phantom Dose Distribution using Regression Artificial …uu.diva-portal.org/smash/get/diva2:1301203/FULLTEXT01.pdf · 2019-04-01 · Learning Phantom Dose Distribution using

References

[1] Xun Jia Weiguo Lu Xuejun Gu Zohaib Iqbal Steve Jiang Dan Nguyen,Troy Long. A feasibility study for predicting optimal radiation therapydose distributions of prostate cancer patients from patient anatomy usingdeep learning. arXiv:1709.09233 [physics.med-ph], 2017.

[2] Jimmy Ba Diederik P. Kingma. Adam: A method for stochastic optimiza-tion. arXiv:1412.6980 [cs.LG], 2014.

[3] Kyunghyun Cho Yoshua Bengio Guido F. Montufar, Razvan Pascanu. Onthe number of linear regions of deep neural networks. Advances in NeuralInformation Processing Systems 27, 2014.

[4] Sergey Ioffe. Batch renormalization: Towards reducing minibatch depen-dence in batch-normalized models. arXiv:1702.03275 [cs.LG], 2017.

[5] Thomas Brox Olaf Ronneberger, Philipp Fischer. ”u-net: Convolutionalnetworks for biomedical image segmentation”. arXiv:1505.04597 [cs.CV],2015.

[6] Christian Szegedy Sergey Ioffe. Batch normalization: Accelerating deep net-work training by reducing internal covariate shift. arXiv:1502.03167 [cs.LG],2015.

[7] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, andRuslan Salakhutdinov. Dropout: A simple way to prevent neural networksfrom overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.

38