statistical modeling of flow cytometry...

66
Faculty of Sciences Statistical modeling of flow cytometry data Niels Verdoodt Master dissertation submitted to obtain the degree of Master of Statistical Data Analysis Promotor: Prof. Dr. Ir. Olivier Thas Tutor: Alain Visscher Department of Mathematical Modelling, Statistics and Bioinformatics Academic year 2016 - 2017

Upload: others

Post on 30-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Faculty of Sciences

Statistical modeling of flow cytometry data

Niels Verdoodt

Master dissertation submitted toobtain the degree of

Master of Statistical Data Analysis

Promotor: Prof. Dr. Ir. Olivier ThasTutor: Alain Visscher

Department of Mathematical Modelling,Statistics and Bioinformatics

Academic year 2016 - 2017

Page 2: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

The author and the promoter give permission to consult this master dissertation and to copy itor parts of it for personal use. Each other use falls under the restrictions of the copyright, inparticular concerning the obligation to mention explicitly the source when using results of thismaster dissertation.

Niels Verdoodt Prof. Dr. Ir. Olivier ThasJanuary 27, 2017 January 27, 2017

Page 3: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Foreword

When I first saw the presentation on the Master of Statistical Data Analysis in my last bacheloryear, the thought of subscribing me for it never really left my mind. While doing my thesis inbioscience engineering 2 years later, I was introduced to the world of fitting models using log-logistic models and hypothesis testing. Wanting to know and understand what formed the basisof this, I entered the Master of Statistical Data Analysis which was the beginning of an intensivestatistical journey. When looking for master’s thesis topic, it would ideally be a combination ofusing my newly acquired set of data analysis skills and a little twist of bio-engineering, prefer-ably in a biological context. This thesis topic is just about that, giving me a complex method tospend many hours with while still giving me the chance to explore what caused the bacteria inthe water samples to behave in such a way.

This thesis is an addition to the conceptual framework designed by professor Olivier Thas andembedded in a more mathematical context by Sem Peelman using a synthetic data set. I contin-ued by exploring the effect of the carrier density on the model along with the adjusted standarderrors, analyzed the real flow cytometry data set and added a smaller data set which could beused to validate the method.

First of all, I would like to thank my promotor Olivier Thas for providing the first data set,his patience to give as much examples as needed to clarify things and his never wavering goodmood. I would also like to thank my tutor Alain Visscher for helping me translate theory intopractice, his willingness to answer my questions and his ability to find typing errors in mymanuscripts. I would also like to thank Karen De Roy, Bart De Gusseme, Benjamin Buyss-chaert and Sam Van Nevel for providing me with an additional data set and for giving me somenew insights in the water distribution network of Flanders. Last but not least, I would like tothank my family and friends for their never-ending support throughout the past years.

Page 4: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis
Page 5: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Abstract

INTRODUCTION Flow cytometry is a high-throughput method which can rapidly give an ideaof the bacterial composition of a water sample. As such, flow cytometry analysis could helpexplore when the bacterial composition in repaired drinking water pipelines stabilizes. Sampleswere taken up- and downstream of the place of the repairs at several time points. Data sets usedwere measured in Ghent and Brakel.

METHOD Probability density functions will be constructed for each sample using exponentialfamilies consisting of a carrier density and a basis expansion. Poisson regression will be used toestimate these density functions. The method consists of 4 main steps. First, the samples werediscretized evenly. Second, the set of basis functions and the carrier density will be chosen befitting each individual sample. Third, the set of basis functions will be transformed using PCAto reduce the number of basis functions needed. Finally, a joint model of all samples will beconstructed using the transformed basis functions. This joint model will have parameter esti-mates which vary with the time and location of the samples.

RESULTS A set of 25 polynomial basis functions was chosen along a normal kernel smootherwith bandwidth λ = 0.5 as the carrier density. Only a minor effect of the carrier density couldbe noticed. Individual samples were fit using log-linear Poisson regression. The set of basisfunctions was transformed using 4 principal components (PC’s) for the Ghent data and 3 PC’sfor the Brakel data. A varying coefficients model was constructed by using the Poisson modeland letting it vary along the covariates. This allowed the fitting of all samples simultaneously.An extra parameter L was introduced which could be seen as a smoothness parameter. A stablemodel could be obtained for Brakel with L = 4 which showed stabilization of the bacterialcomposition after the first hour of rinsing the pipe. No stabilization could be detected for theGhent data. Instead a parsimonious model with L = 5 was chosen to build the density func-tions. Adjusted standard errors which take into account the choice of the carrier density werecalculated for both the individual as the joint models.

CONCLUSION When a good set of basis functions is chosen, the carrier density does littleto improve the model. The method can provide density functions for a exponential familyby taking into account extra information about the samples. A joint model which could helppredict the moment of reaching a stable bacterial community could be obtained with a smalldataset (Brakel). Improvements are still needed for larger and more noisy data sets (Ghent).

Page 6: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis
Page 7: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Table of Contents

1 Introduction 11.1 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Flow cytometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Data & Methodology 52.1 Design of the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 One-sample density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Multi-sample density estimation . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Functional PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Results 173.1 Data set construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Method overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Step 0: binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Step 1: one-sample fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Step 2: basis transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 Step 3: varying coefficient model . . . . . . . . . . . . . . . . . . . . . . . . . 243.7 Analysis of Brakel data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Discussion 374.1 Choice of basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Carrier density and standard errors . . . . . . . . . . . . . . . . . . . . . . . . 384.3 Model outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.4 Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Conclusion 43

References 45

A Remarks 49

B Extra figures 51

Page 8: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis
Page 9: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

1Introduction

1.1 Thesis overview

Having access to a steady supply of clean drinking water is something we would call evidentnowadays. When repairs of the water supply lines are necessary, significant precautions aremade in order to make sure these are carried out as fast and efficient as possible.When the re-pairs are done, the pipe needs to be flushed before getting reconnected to the water distributionnetwork in order to not contaminate the drinking water. Since flushing can be costly and timeconsuming (up to 60 hours), flow cytometry might aid in giving an overview of the drinkingwater quality by checking the bacterial community. Contrary to what most people would think,high-quality drinking water is far from sterile. This is of no concern as only a minor fractionof bacteria can negatively affect human health. The hypothesis is that after a period of time,the impacts of the flushing decrease and the bacterial composition will stabilize and be more orless the same up- and downstream of the repaired pipe. Estimating when this will happen couldhelp reduce the cost and time needed for the rinsing procedure (Van Nevel, 2013).

The data used in this master’s thesis originates from the repairs of a main drinking water pipes inGhent and Brakel. To place the methodology in a broader context, the goal of this thesis lies notso much in accurately estimating this tipping period when the bacterial community stabilizes.Rather it is to find a good fitting model for the collected data using a functional data analysisapproach.

Chapter 1 will continue with a short introduction of the basic principles of flow cytometryto give a more complete view of this technique. Chapter 2 gives the outlines of how the datasets were constructed followed by the statistical methodology followed in this dissertation. Inchapter 3 the results will be presented in order of the necessary steps to construct the modelexplained in chapter 2. Chapter 4 than discusses these results, links the method with existingtechniques and highlights possible improvements. Chapter 5 summarizes the most insightfulresults and concludes the dissertation.

Statistical analyses were carried out using R version 3.2.2 (R Core Team, 2013). Scripts to

1

Page 10: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

2 Chapter 1. Introduction

reproduce the given results were submitted together with this document. English notation ofunits was used (decimal separator is .) to comply with R. Since this thesis contributes to workdone by Thas (2014) and Peelman (2016), similar notation was used.

1.2 Flow cytometry

1.2.1 Overview

Flow cytometry is a technique used to measure physical and chemical properties of cells orother biological particles as they flow in a fluid stream through a beam of light. The propertiesmeasured include a particle’s relative size, internal complexity and relative fluorescence inten-sity. These characteristics are determined using an optical-to-electronic coupling system thatrecords how the cell or particle scatters incident laser light and emits fluorescence (Dako, 2006,Biomedical Instrumentation Center, 2000, BD Biosciences, 2000).

The advantage of using flow cytometers is that they can rapidly and quantitatively measuremultiple simultaneous parameters on individual cells. For some applications analysis time candrop below 20 minutes (De Roy et al., 2012) while high-performance cell sorters can reach ratesof 70,000 cells per second (Dako, 2006). A flow cytometer consists of three compounds: thefluidics, optics and electronics. These will be discussed in more detail in the following sections.

1.2.2 Fluidics

The purpose of the fluidics system is to transport particles in a fluid stream to the laser beam foranalysis. For optimal illumination, the stream transporting the particles should be positioned inthe center of the laser beam. Optimally, only one cell or particle should move through the laserbeam at a given moment. To accomplish this, the sample is injected into a stream of sheath fluidwithin the flow chamber. The design of the flow chamber is such that samples will be focusedin the center of the sheath fluid (BD Biosciences, 2000).

Increasing the sample pressure increases the flow rate of the sample particles. This will inreturn increase the width of the sample core. A higher flow rate is generally used for morequantitative measurements. Since more particles pass faster through the laser beam, the dataare less resolved but are acquired faster. A lower flow rate ensures particles pass more gradu-ally through the laser and is used when high quality measurements are needed, as for examplein DNA analysis (BD Biosciences, 2000).

1.2.3 Optics

The optical system consists of the excitation optics and the collection optics. The excitationoptics consist of the laser and lenses that are used to shape and focus the laser beam. The col-lections optics consist of a collection lens to collect light emitted from the particlelaser beaminteraction and a system of mirrors and filters to route specific wavelengths of the collected lightto the right detectors (BD Biosciences, 2000).

Page 11: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 1. Introduction 3

Figure 1.1: Scattering of light by a cell (BD Biosciences, 2000)

When a cell passes through the laser beam, it deflects incident light as shown in figure 1.1.Forward-scattered light (FSC) is proportional to the surface area or size of a cell. Side-scatteredlight (SSC) is proportional to the granularity or internal complexity of a cell (Biomedical In-strumentation Center, 2000). The cell membrane can also interact with different fluorescentdyes (called fluorochromes) to analyze many different bacterial features (Dako, 2006). Fluo-rescence occurs when a molecule absorbs light of one wavelength (absorption) and emits lightof a longer wavelength (emission). Since visible light (the colors visible to the human eye) haswavelengths from 400 nm to 700 nm, this difference in wavelength results in the color of theabsorbed light being different from the color of the emitted light. Figure 1.2 gives an exampleof this. By using different fluorochromes which emit at different wavelengths several featurescan be measured simultaneously (Dako, 2006).

Figure 1.2: Fluorescent absorption (left) and fluorescent emission (right) (Dako, 2006)

1.2.4 Electronics

The role of the electronics is to monitor and control the operation of the flow cytometer, fromdetection of a particle as it passes through the laser, to the physical deflection of that particle intoa collection flask for sorting. As a particle of interest passes through the laser, an electrical pulseis generated and presented to the signal processing electronics of the flow cytometer (figure 1.3).The instrument is triggered when this signal exceeds a predefined threshold level. This thresholdis primarily used to reject measurement noise from optical and electronic source (Dako, 2006).The height, width and area of the electrical pulse are determined by the particles size, speed andfluorescence intensity (Biomedical Instrumentation Center, 2000).

Page 12: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

4 Chapter 1. Introduction

Figure 1.3: Creation of a voltage pulse (BD Biosciences, 2000)

1.2.5 Microbial fingerprintEach particle will create a pulse of which the intensity can be measured by the electronics ofthe flow cytometer. All of the pulses of a sample combined could give an idea of the structureof the microbial community in the sample, which is called the microbial fingerprint (De Royet al., 2012). These can be visualized as univariate histograms or bivariate density plots as seenin figure 1.4. Both figures were obtained by making use of the Ghent data set (see section 2.1).

Figure 1.4: Histogram (FL1-A, left) and density plot (FL3-A vs FL1-A, right) of a flow cytometry sample.

Page 13: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

2Data & Methodology

2.1 Design of the experiment

Flow cytometry measurements were conducted after reparation works on main water pipes inboth Ghent and Brakel. To avoid contamination of drinking water, the pipeline was was firstdisinfected with chlorine as shown in figure 2.1. After chlorine is added, the pipe is filled withwater. Once full, the pipe was rinsed for 19 hours. For the pipeline at Ghent, 3 water samples(3 replicates) were collected each hour (except at 03h and 04h), upstream (reference water)and downstream (rinse water) of the location of the repairs and afterwards analyzed using flowcytometry. The samples collected upstream will be considered as clean while the downstreamsamples can be seen as potentially contaminated. After a while the bacterial community is as-sumed to stabilize at both sides of the pipe. Since the incoming water is potable, this indicatesgood drinking water quality from a bacteriological point of view (Van Nevel, 2013).

The pipeline at Brakel had a length of almost 3 km. Due to this, an absolute comparison ofall water samples was not possible (Van Nevel, 2013). The 4 remaining samples were measuredin quadruplicate at time points 24h, 2h, 4h and 5h. Because of its much smaller size, thesesamples will be used as a validation data set.

2.2 Density estimation

The cell intensities of each particle from a flow cytometry measurement could be seen as obser-vations randomly sampled from a unknown density f(y) as

yii.i.d.∼ f(y) for i = 1, 2, ..., n

with yi the intensity of the i-th particle of a particular sample. Each vector yi takes values in thesample space Y which is the unit interval [0,1]. Estimating f(y) usually happens in one of twoways: by fitting a parametric family (such as the normal) or by using nonparametric methodssuch as kernel density estimation. The method used here will be a combination of these methods

5

Page 14: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

6 Chapter 2. Data & Methodology

Figure 2.1: Rinsing procedure of the water pipe after reparation works. After adding chlorine (orangeplug) the pipe was filled with water. When the pipe is completely full, it is rinsed during 19hours. Samples of the reference and rinsing water were taken every hour during the rinsingprocess and analyzed by flow cytometry. Modified from Van Nevel (2013).

to form a hybrid estimator (Efron and Tibshirani, 1996). The exponential family of choice onY will be that introduced by Efron and Tibshirani (1996):

f(y;β) = f0(y)exp(α + β b(y)) (2.1)

with carrier or target density f0(y) and some function b(.). β is the parameter vector and α is thenormalizing parameter that makes f(y;β) integrate to 1 over Y . The function b(.) is generallyunknown but can be written as a series expansion

∞∑k=1

βkbk(y)

for k ≥ 1, constants βk ∈ R and basis functions bk (see 2.2.1). An approximation of f(y;β)can be given by truncating the series after K terms

fK(y;β) = f0(y)exp

(α(β) +

K∑k=1

βkbk(y)

)(2.2)

in which α(β) is a normalizing constant

α(β) = −ln

1∫0

f0(y)exp

(K∑k=1

βkbk(y)

)dy

Page 15: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 2. Data & Methodology 7

To simplify notation α(β) will be denoted as α.

Let’s illustrate this with an example. Say the density function we’re looking for is a normaldistribution with mean µ and variance σ2, then f() equals

f(x; µ, σ2) =1√

2πσ2exp

(−1

2

(x− µσ

)2)

which can be written as

f(x; µ, σ2) =1√

2πσ2exp

(− x2

2σ2+µx

σ2− µ2

2σ2

)or even

f(x; µ, σ2) =1√2π

exp(− 1

2σ2x2 +

µ

σ2x+

(− µ2

2σ2− ln(σ)

))in which a polynomial of the 2nd degree can be seen. If we choose the standard normal densityas the carrier density so that f0(x) = φ(x) = exp(−0.5x2)/

√2π and polynomial basis functions,

then equation 2.2 becomes

f2(y;β) =1√2π

exp(−1

2x2)· exp(α + β0 + β1x+ β2x

2)

in which it becomes easy to see which values the parameters have to take to obtain the wanteddensity 1.

2.2.1 Basis functionsIf a data distribution does not seem to follow any prior known function, the curve can be ap-proximated using mathematical building blocks called basis functions. As indicated by equa-tion 2.2, the number of basis functions (K) can be increased depending on the complexity of thecurve or the degree of approximation that is desired. Any set of linearly independent (orthog-onal) functions could be used to do this. From them, three chosen families of basis functions(polynomials, Fourier series and wavelets) were chosen which will be shortly introduced in thefollowing sections.

Polynomial base

In their simplest form polynomials can be represented as

bk(y) = yk−1 for k = 1, 2, ..., K K ∈ N

which are known as the polynomials with a monomial basis of degree K. Computationally,these polynomials become more ill-conditioned as the degree increases (Corless and Fillion,2014). As a result other polynomial sequences will be used which exhibit less this behaviour.Most polynomials are defined on the domain [-1,1]. Since our sample space is defined in ad-vance and to have more flexible basis functions only these polynomial sequences which can beeasily shifted to the [0,1] domain will be used (Abramowitz et al., 1965). These include theshifted Chebyshev and Legendre polynomials.

1In this case, an intercept (β0) was added to complete the equation.

Page 16: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

8 Chapter 2. Data & Methodology

(a) Legendre polynomials (b) Chebyshev polynomials

Figure 2.2: The first 5 shifted Legendre and Chebyshev polynomials. The color legend applies to bothgraphs.

The explicit expression for the shifted Legendre polynomials L of degree K (K ∈ N) is givenby

LK(y) = (−1)KK∑k=0

(K

k

)(K + k

k

)(−x)k

while the shifted Chebyshev polynomials C of degree K (K ∈ N) are given by the recurrencerelation

C0(y) = 1

C1(y) = 2y − 1

CK(y) = 2C1(y)CK−1(y)− CK−2(y)

The first 5 Legendre respectively Chebyshev polynomials can be seen in figures 2.2a and 2.2b,respectively.

Fourier basis

A Fourier basis is an expansion of a periodic function as it can be written as the sum of sinesand cosines (Abramowitz et al., 1965). The approximation of a function FK(y) up to degree K(K ∈ N) by a generalized Fourier series is given by

FK(y) =1

2a0 +

K∑k=1

ak cos(ky) +K∑k=1

bk sin(ky)

with

a0 =1

π

π∫−π

FK(y) dy ak =1

π

π∫−π

FK(y) cos(ky) dy bk =1

π

π∫−π

FK(y) sin(ky) dy

The first 5 Fourier polynomials are displayed in figure 2.3.

Page 17: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 2. Data & Methodology 9

Figure 2.3: The first 5 Fourier polynomials.

Wavelet basis

Wavelets (ψ) are constructed from scaling functions (φ) with properties:

∫φ(y) dy = 1 and

∫[φ(y) < 0] dy =

∫[φ(y) > 0] dy

The most basic scaling function is a step function defined as

φ(y) =

1 if 0 ≤ y < 1

2

−1 if 12≤ y < 1

0 otherwise

which is the basis of the Haar wavelet but many others exist as well. The wavelet polynomialconsists of different scaling functions

ψ(y) =K∑k=1

(−1)kck−1φ(2y − k)

where ck−1 is a finite set of filter coefficients depending on the scaling function. Instead ofadapting the scaling function, one can also adapt the wavelet as follows

ψjk(y) = 2j/2ψ(2jy − k)

where j is a dilation index (comparable to changing the amplitude of a periodic function), k is atranslation index (horizontal shift), 2j/2 a normalizing constant and ψjk a family of wavelets(Tangborn, 2010). Figure 2.4a shows the first 5 Haar Wavelets, figure 2.4b the first 5 4-coefficient Debauchy wavelets.

Page 18: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

10 Chapter 2. Data & Methodology

(a) Haar wavelets (b) 4-coefficient Debauchy wavelets

Figure 2.4: The first 5 Haar and 4-coefficient Debauchy wavelets.

2.3 One-sample density estimationThe βk parameter estimates of equation 2.2 proposed in section 2.2 can be found by solving themaximum likelihood equations

∂α

∂βk= − 1

n

n∑i=1

bk(yi) k = 1, 2, ..., K

with

α = −ln

1∫0

f0(y)exp

(K∑k=1

βkbk(y)

)dy

ignoring the fact that the chosen carrier f0(y) is itself data-dependent. This matches the coefficientswith the empirical averages of the chosen basis functions. Instead of doing this, the method ofa hybrid estimator by means of Poisson regression will be used (Efron and Tibshirani, 1996)originally introduced by Lindsey (1984a,b).

2.3.1 Poisson regressionFirst the sample space Y is partitioned in G disjoint cells Yj

Y =G⋃j=1

Yj

and the data y = (y1, y2, ..., yn) are reduced to the cell counts Nj

Nj = #{yi ∈ Yg} for j = 1, 2, .., G

The vector of counts N = (N1, N2, ..., NG) has sum N+ = n. The probability of observing yin the j th cell is given by

πkj(β) =

∫Yjfk(y;β) dy

Page 19: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 2. Data & Methodology 11

withG∑j=1

πk(β) = 1.

Then the counts (N1, ..., NG) can be described by a multinomial distribution on G categories,with n draws and probability vector πk(β) = (πk1(β), ..., πkG(β)),

N ∼ Mult(n;πk(β))

Then the maximum likelihood estimate (MLE) of β, based onN could be found by maximizingthe multinomial probability of N . Instead, the method of Lindsey (1984a,b) is to consider thecounts as independent Poisson observations

Nji.i.d.∼ Poisson(µj(γ,β)) with E[µj(γ,β)] = γπj(β) and γ > 0

Using standard Poisson properties the previous equations can be expressed as

N+ ∼ Poisson(γ) and N |N+ ∼ Mult(N+;πk(β))

from which it can be derived that the MLE of γ is

γ = N+ = n

so that

µj(β) = nπkj(β) (2.3)

After solving this, the Poisson regression equations can be written as 2

Nj ∼ Poisson(µj(β)) (2.4a)

µj(β) = n · µ0j · exp

(α +

K∑k=1

βkbk(y(j))

)(2.4b)

µ0j ∝ π0

j =

∫Yjf0(y) dy (2.4c)

and y(j) denotes a representative point (usually the midpoint) of Yj . Once the β -parameters areestimated they can be put in equation 2.2 to obtain a good estimate of the density function fk,

fK(y; β) ∝ f0(y)exp

(α +

K∑k=1

βkbk(y(j))

)(2.5)

Since the obtained density function will only be proportional to the real density function, thenormalizing constant α may be omitted.

2Since the observations of one sample comes from the same biological monster they may not be i.i.d. so thisassumption will be loosened. It is argumented that this will not affect the method too much (Thas and Clement,2014).

Page 20: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

12 Chapter 2. Data & Methodology

2.3.2 Standard errorsThe estimated standard errors for the regression coefficients β are not completely correct un-der the generalized linear model (GLM) framework because they do not take the data-basedchoice of the carrier density f0 into account. The carrier density is incorporated in the Poissonregression by µ0 as indicated by equations 2.3 and 2.4c,

µ0j = nπ0

j = n

∫Yjf0(y) dy

First, equation 2.4b could be written in matrix notation as (omitting α)

µ = nµ0eXβ

where X is the G×K data matrix consisting of the basis functions bk evaluated in y(j)

X =

b1(y(1)) b2(y(1)) · · · bK(y(1))

b1(y(2)) b2(y(2)) · · · bK(y(2))

...... . . . ...

b1(y(G)) b2(y(G)) · · · bK(y(G))

In the most general case, µ0 could be estimated by a function m so that

µ0 = m(N )

Let H be the G×G derivative matrix of ln(µ0) = (ln(µ01), ln(µ0

2),...,ln(µ0G)) with respect toN

and the the index u = 1, 2, ..., G, then

H =d lnµ0

dN=

(∂ ln(µ0

j)

∂Nu

)(2.6)

Let D be the G × G diagonal matrix with jth diagonal element µj = µ0jexj β, then the K × G

derivative matrix of β with respect toN is

d β

dN= [X ′DX]−1ZT (2.7a)

ZT = XT (I − DH) (2.7b)

where the superscript T indicates the transposed matrix.

According to Efron and Tibshirani (1996) (appendix A), it is better to not use m(s) directlybut to use a kernel smoother for estimating µ0 as in

µ0 = M ·N (2.8)

Efron and Tibshirani (1996) suggest to use a normal kernel smoother with juth element givenby

Mju(λ) =cjλφ

(y(j) − y(u)

λ

)for j, u = 1, 2, ..., G (2.9)

Page 21: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 2. Data & Methodology 13

with λ the window width of the smoother and constants cj chosen to make Mj+ = 1. Whenusing µ0 = M ·N , equation 2.7b becomes

ZT = XT (I −D(eXβ)M) (2.7c)

where D(eXβ) is the diagonal matrix with jth diagonal element exj β.

An approximate covariance matrix for β is then given by

Cov(β) = [XT DX]−1[ZT DZ][XT DX]−1 (2.10a)

Cov(β) = [XT DX]−1[ZT DZ][XT DX]−1 (2.10b)

where D is the diagonal matrix with jth diagonal element Nj and D is the diagonal matrix withjth element µj . Thus the standard errors (SE) can be obtained by

SE(β) =

√D(Cov(β)) (2.11)

For comparison, the ”naive” standard errors of a GLM ignore that µ0 is a function of the countsN . In that case, H = 0 in equation 2.6 so that ZT = XT in equation 2.7b which reduces thecovariance matrix of equation 2.10b to

Cov(β) = (XT DX)−1 (2.12)

Equation 2.10a would then give the Sandwich estimate of the standard error (equation 2.13)which give larger values since it is more robust against violations of homoscedasticity.

Cov(β) = [XT DX]−1[X ′DX][XT DX]−1 (2.13)

2.4 Multi-sample density estimationThe previous model could be extended by considering multiple samples and jointly estimatingtheir density functions (Efron and Tibshirani, 1996, Thas, 2014). Now the observations will bedenoted as

(yiq, xi) for i = 1, 2, ..., n and q = 1, 2, ...,mi.

where the index i refers to the water sample and the index q to a particle of that sample, xi is ap-dimensional covariate vector xi = (xi1, ..., xip) for each water sample. So in the multi-samplesituation we observe independent random samples from n different densities f1, f2, ..., fn on thesame sample space Y ,

yiq ∼ fi(y) for i = 1, 2, ..., n and q = 1, 2, ...,mi.

Using the extra information contained in xi can be done by using a technique similar to usingvarying coefficient models as described in Hastie and Tibshirani (1993).

2.4.1 Poisson regression with varying coefficientsDiscretization is done as described in 2.3.1, obtaining a count vector for each sample i,

Nij = #{yig ∈ Yg} for j = 1, 2, .., G and i = 1, 2, ..., n

Page 22: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

14 Chapter 2. Data & Methodology

Equation 2.2 can be expanded as

fK(y;βi) = f0(y)exp

(αi +

K∑k=1

βikbk(y)

)(2.14)

So each density function has its own parameter vector βi = (βi1, βi2, ..., βiK) and normalisingparameter αi. The Poisson regression equations then become

Nij ∼ Poisson(µj(βi)) (2.15a)

µj(βi) = ni · µ0j · exp

(αi +

K∑k=1

βikbk(y(j))

)(2.15b)

from which it becomes more clear which parts of the equation vary with the different samplesand which parts do not. This approach requires solving n regression problems (one for eachsample) to estimate a total of (K + 1)n parameters.

Since we now have additional information it may be possible to let the density function varysmoothly with the covariates xi. This is done by setting

βi = Θxi and αi = ωTxi

where Θ is a K × p parameter matrix

Θ =

θT1θT2...

θTK

=

θ11 θ12 · · · θ1p

θ21 θ22 · · · θ2p...

... . . . ...

θK1 θK2 · · · θKp

and ω a 1× p parameter vector (ω1, ω2, ..., ωp). The density model (equation 2.14) can then berewritten as

fK(y; Θxi) = f0(y)exp

(ωTxi +

K∑k=1

(θTkxi) bk(y)

)(2.16)

and the Poisson regression equations as

Nij ∼ Poisson(µj(Θxi)) (2.17a)

µj(Θxi) = ni · µ0j · exp

((ωTxi) +

K∑k=1

(θTkxi) bk(y(j))

)(2.17b)

This reparameterization makes the β parameters (and therefore the density function) vary smoothlywith the covariates. In addition, the number of parameters to be estimated reduces from (K + 1)nto (K + 1)p. As mentioned at the end of section 2.3.1, the ω-parameters could be omitted sincethe obtained density will only be proportional to the final density.

From a more practical point of view, in this case xi will consist of location and time mea-surements (see 2.1) of each sample i. Using a time-varying coefficient model, the degree of

Page 23: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 2. Data & Methodology 15

smoothness can be adjusted of how the coefficients vary between samples. This is done byexpanding xi with basis functions as a function of time (t) so that

βik = θTkxi =

p∑l=1

θklbl(ti) and αi = ωTxi =

p∑l=1

ωlbl(ti) (2.18)

where the basis functions bl could be other basis functions than those used for the densityestimation (bk of equations 2.16 and 2.17b).

2.4.2 Standard errorsSince the carrier density f0 is the same over all samples (equation 2.16) and the use of multiplesamples is combined by setting βi = Θxi, only 1 Poisson regression has to be fit. The adjustedstandard errors in this case are therefore the same in the one-sample and the multi-sample envi-ronment. This means that equations 2.10a and 2.10b of section 2.3.2 can still be used. For thederivation of adjusted standard errors to be able to differentiate between multiple samples, seeEfron and Tibshirani (1996).

The estimation of µ0 (equation 2.8) will be adapted to take into account data of all samplesby taking the average

µ0 = M · N with N =n∑i=1

N i (2.19)

in order to obtain a common carrier over all samples.

2.5 Functional PCAAs can be seen from all density function equations (2.2, 2.14, 2.16), the number of basis func-tions determines the number of parameters that need to be estimated. The Θ-matrix grows withan additional p parameters for each additional basis function (see section 2.4.1). This numbercan be very large even for estimating simple density functions so it is of interest to keep thenumber of basis functions to a minimum. In order to combine maximum precision with a min-imum number of parameters, the basis functions will be transformed. This done by using thesingular value decomposition (SVD) of a principal components analysis (PCA) and enables toconstruct a new set of basis functions from the first few principal components.

After solving the n Poisson regressions of equation 2.15 the β parameters can be collectedin a K × n matrix B as

B = [β1 β2 · · · βn]T

We assume that K ≤ n so that rank(B) = K. The singular value decomposition of B can thenbe written as

B =K∑k=1

δkukvTk

or in matrix notationB = UDV T

where

Page 24: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

16 Chapter 2. Data & Methodology

• U is a n×K matrix with columns u1, ...,uK also called the left singular vectors

• V is a K ×K matrix with columns v1, ...,vK also called the right singular vectors

• D is a K ×K diagonal matrix with elements δ1, ..., δK also called the singular values

In this way, a matrix decomposition of B is obtained in which the singular values are ordered sothat δ1 ≥ δ2 ≥ ... ≥ δK ≥ 0. They can be seen as weights for the total variability (information)in B. Since our goal is a dimension reduction which gives the best approximates of the originalmatrix, only the first few (r) δk’s are retained. B can then be truncated after r terms (r < K) as

Br = U rDrVTr

with

• Ur a n× r matrix

• Vr a K × r matrix

• Dr a r × r diagonal matrix with elements δ1, ..., δr.

Using SVD properties,

Br = ZrVTr

BrV r = Zr

where V r is a transformation matrix (also called the loadings) used to transform the originalmatrix Br into the lower dimensional version Zr (also called the scores). Zr is constructedin such a way that the first column contains the most information about the data, the secondcolumn contains the second most information and so on.

The original basis functions b used to estimate β can now be transformed by using the load-ings vl as

b′l(y) = v1lb1(y) + ...+ vKlbK(y) l = 1, ..., r.

The transformed functions are linear transformations of the original basis functions and arelinearly independent themselves. Thus, solving the Poisson regression equations 2.15 with theoriginal basis functions b1(y), ..., bK(y) or the transformed basis functions b′1(y), ..., b′l(y) shouldlead to roughly the same density estimates.

Page 25: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

3Results

3.1 Data set constructionThe original fcs-format of the flow cytometry data was converted for use with R. This originaldata will be in the long format where each row consists of the measured intensities of 1 particletogether with source (upstream/downstream), time and replicate number of the sample. Thenumber of intensities matches the number of detectors (channels) which are operational on theflow cytometer. The FL1 channel was selected for analysis.Data were log-transformed using the natural logarithm and rescaled to the interval [0,1]. Inten-sities lower than 2−17 were considered as artifacts caused by the procedure and not included inthe analysis (Clement and Thas, 2016). As mentioned in 2.1, the Ghent data consists of 3 repli-cates taken each hour from 12h until 07h the next day except for hours 03h and 04h. In this casethe number of samples n = 108. For the Brakel dataset, samples who could be matched weremeasured in quadruplicate at 0h (24), 02h, 04h and 05h leading to a total of n = 32 samples.The method will be explained in detail using the Ghent data set, while the Brakel data set willbe used to see how easily the method can be applied to other data sets.

3.2 Method overviewTo construct a time-varying model for the density of a single flow channel (FL1), the method-ology consists of 4 main steps.

Step 0 Divide the unit interval into equal-sized bins and convert the cell intensities of eachwater sample into bin counts. The aim is to find the number of bins which give a goodrepresentation of the sample density fi(y) with i = 1, 2, ..., n (the number of samples).

Step 1 Solve the Poisson regression equations for each sample. The aim is to find a commoncarrier density f0 and a class of best fitting basis functions b1, ..., bK1 over all samples.

Step 2 Transform the first set of basis functions to a smaller set of basis functions b′1, ..., b′K2

(K2 <K1) using SVD. The aim is to decrease the number of parameters without losing qualityof the fitted models.

17

Page 26: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

18 Chapter 3. Results

Step 3 Combine the one-sample Poisson models into one multi-sample Poisson model. Theaim is to find a general model over all samples and to make use of xi in such a way thatthe parameters vary smoothly with the covariates.

3.3 Step 0: binningThe number of bins can be chosen quite arbitrary. The aim is find a number G which is suffi-ciently large in order to have enough data points for smoothly estimating the underlying densityfunction. Since the total number of data points for the final model (step 3) will be n × G,G = 250 was assumed to be good compromise between smoothness and accuracy. For the firsthour (12:00, figure 3.1) it can be seen that the number of cell counts is always higher in thedownstream sample, which also has a more pronounced second peak.

The midpoint of the bin will be chosen as a representative point of Yj so that

y(j) =j − 0.5

Gfor j = 1, 2, ..., G

If the number of bins is not taken too small, there will be bins where the cell count equals 0.Since this causes errors in the fitting of the Poisson model (see Step 1, section 3.4), each bincount will be added with a constant = 1. Since the final density model will only be proportionalto the real density function (equation 2.5) and this number is very small relative to most of thecell counts (figure 3.1), it is assumed this does not harm the procedure.

Figure 3.2 shows a top view of all the samples. The color scale is adapted since a linear scalewould not capture the variability of the data very well. Up- and downstream samples seemvery similar at first sight. Over all samples, the second peak of the distribution (FL1 valuesbetween 0.4 and 0.5) moves slightly towards lower values after 20h. It is most pronounced inthe downstream samples from 12h to 14h. Very low cell intensities occur the most after 19h.

3.4 Step 1: one-sample fittingFitting of the Poisson model will be done by using the glm function for fitting generalized linearmodels in R. Since a log-linear Poisson model will be used, equations 2.15 can be rewritten as

Nij ∼ Poisson(µj(βi)) (3.1a)

ln(µj(βi)) =K∑k=0

βikbk(y(j)) + ln(µ0j

)+ ln(ni) (3.1b)

with i = 1, 2, ..., n and j = 1, 2, ..., G, omitting the normalizing constants αi for now. Both µ0j

and ni can be modelled as an offset.

Akaike’s information criteria (AIC) will be used as a measure of model quality.

AIC = −2ln(L) + 2p

with L the maximized value of the likelihood function of the model and p the number of freeparameters to be estimated. In this way, the most parsimonious model with the highest precisionhas the lowest (best) AIC score.

Page 27: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 3. Results 19

Figure 3.1: The first sample (12:00) of both upstream (left) and downstream (right) cell counts. Thenumber of bins (G) increases from top to bottom.

Figure 3.2: Top view of all up- and downstream samples. Lines represent estimated density functions forthat hour. Values between 02:00 and 05:00 were interpolated since no data was availablefor these time steps.

Page 28: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

20 Chapter 3. Results

3.4.1 Basis functionsThe binning procedure results in graphs (figure 3.1) which could give an idea how the finaldensity curve should look like. Since the shape of this density does not resembles any classicdistribution with support [0,1], finding a good basis for the model provided by the basis func-tions will be dealt with first. A standard uniform distribution can be chosen as f0 to make itredundant as by putting

f0(y) =

{1 if 0 ≤ y ≤ 1

0 otherwise

Two practical remarks can be mentioned (results not shown):

• Although not necessary from a theoretical point of view, an intercept will be added whenfitting the basis functions meaning that equation 3.1b will start from k = 0. Droppingthe intercept resulted in absolute AIC values which were minimum 10 times higher incomparison with models which were allowed to have an intercept. In addition, the glmfunction was not only able to converge with a polynomial or Fourier basis if no interceptwas added. Defining a more informative f0 before fitting the models did not change this.

• When fitting individual samples, adding ni to the model (equation 3.1b) did not con-tribute to a better model fit in most cases. When an intercept was added, it only changedthe estimated value of the intercept. It will still be added to be consistent with the finalmodel but only for the definitive set of basis functions.

Graphs of fitting the different polynomials for some degrees are given in appendix B by figuresB.1 to B.8. AIC values of each model fit are given in table 3.1.

Table 3.1: AIC values for the model fits of the first up stream sample with the different basis functions.Numbers between brackets represent the number of basis functions included. Lower AICvalues denote a better model fit.

Polynomial Fourier WaveletLegendre (5) 3408 Fourier (5) 3129 Haar (1) 5186

Legendre (10) 2017 Fourier (10) 1868 Haar (4) 4974Legendre (20) 1465 Fourier (20) 1419 Haar (7) 3088Chebyshev (5) 3408 Debauchy (1) 5846

Chebyshev (10) 2017 Debauchy (4) 27,165Chebyshev (20) 1465 Debauchy (7) 8219

For the polynomial basis, the shifted Legendre and Chebyshev polynomials give the samemodel fits. Parameter estimates vary between both (results not shown) but their fitted values arevery similar explaining the overlap in figures B.1 and B.2. AIC values of both polynomials arealso exactly the same (left column table 3.1). The model fits show that the second peak of thedistribution does not get fitted correctly for polynomials of a degree less than 20. Especially forfitting the second downstream peak, which is even more spiked, polynomials of a higher degreewill be necessary. The first peak of both distributions however, is already showing some signsof overfitting even with a polynomial of degree 20.

The model fit of the Fourier basis (figures B.3 and B.4) shows some odd behaviour by go-ing up at high intensities. It gives a good fit from 10 degrees onward and has the lowest AIC

Page 29: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 3. Results 21

Figure 3.3: Average AIC values of the model fitting of all samples with the Poisson regression model.Straight lines are parallel with the lowest AIC, which is at 24 basis functions for the polyno-mial basis and 40 basis functions for the Fourier basis. The average AIC over all sampleswith 1 Fourier basis function equals 13.5 · 103.

values for each degree compared to the other basis functions (middle column table 3.1). Thiscould partly be contributed to the fitting of both sines and cosines in each degree so that aFourier basis of degree 5 already estimates 10 parameters.

The Wavelet basis was modeled using the modwt function (maximal overlap discrete wavelettransform) of the waveslim package (Whitcher, 2015). Because of this, the maximum degreeis log2(G) = 7 for G = 250 which explains the choice of degrees for this basis. It has thehighest AIC values (right column table 3.1) of the considered basis functions and does not seema good match for this procedure. Especially the Debauchy wavelet shows some odd resultsbecause of the worse fits at higher degrees. The Haar wavelet gives a very blocked fit to thedata (figures B.5 and B.6) which lies in the nature of its steep mother wavelet (section 2.2.1).It estimates the first peak of the distribution quite well but fails at fitting the second peak foreach selected degree. The Debauchy wavelet (figures B.7 and B.8) fluctuates too much to givea good estimate of the density. The second peak of the distribution is estimated quite well inheight for the highest degree Debauchy wavelet but comes too early to give a good fit comparedto the other basis functions.

As a result, the wavelet basis will not be considered anymore. Figure 3.3 shows the averageAIC values over all samples when using a polynomial or Fourier basis of degrees 1 to 40. Fit-ting the models using Fourier basis functions leads to a smooth decrease of AIC values withincreasing degrees. However, the lowest AIC value is at 40 degrees indicating that still moredegrees can be added to give a better model fit although already 81 parameters will be estimatedat 40 degrees. This might indicate serious overfitting with a highly fluctuating model as a result.The polynomial basis has a more jagged descent when more degrees are added and its lowestAIC value at 24 degrees (25 parameters). The most parsimonious model thus seems to be ob-tained using polynomial basis functions.

Considering these results, the basis functions which seems the best in general are the polyno-

Page 30: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

22 Chapter 3. Results

mials. Since the goals was to find a large set of basis functions to start with and a degree of 24gives the lowest AIC, a degree of 25 was chosen. The remaining results were obtained usingLegendre polynomials but Chebyshev polynomials can be used as well.

3.4.2 Carrier densityA normal kernel smoother will be used (equation 2.9) in order to estimate the carrier density f0as explained in section 2.3.2. Figure 3.4 shows the kernel with different values for the windowwidth λ. A window width equal to 1 as used by Efron and Tibshirani (1996) does not seem tosuffice for this data set.

Figure 3.4: The carrier density f0 as estimated by a standard normal kernel smoother for different win-dow widths λ.f0 was estimated and plotted together with the data points of the first upstreamsample (12h).

Table 3.2 gives the first 10 parameters of the first upstream sample to show the general trendwhen estimating the carrier density. The Sandwich estimators which take into account the realdata points (SE) are slightly higher than the usual standard errors, for those taking into accountthe fitted values (SE) no difference can be observed with the naive glm SE’s. The parameterestimates do change depending on the choice of f0 which could mean the basis functions moldthemselves around the carrier density to give the best possible fit.

Since the optimization of the model was already primarily done using the basis functions, ker-nel smoother with window width λ = 0.5 was chosen (which has some slope) together withpolynomial basis functions of degree up to 25 to fit βi for each sample i.

3.5 Step 2: basis transformationEquation 3.1b of step 1 can also be denoted as

ln(µj(βi)) =

K1∑k=1

βikbk(y(j)) + ln(µ0j

)+ ln(ni) (3.2)

Page 31: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 3. Results 23

Table 3.2: The first 10 parameter estimates of the Poisson regression model of the first upstream sampletogether with their standard errors (SE). Adjusted standard errors are calculated for modelswith a defined carrier density f0. Naive: GLM SE, SE: Sandwich estimator using the truevalues, SE: Sandwich estimator using the fitted values, AIC: Akaike’s Information criteria.

Uniform f0 Kernel f0(λ = 1) Kernel f0(λ = 0.05)

estimate naive estimate naive SE SE estimate naive SE SE

β0 -7.30 0.063 -11.2 0.063 0.067 0.063 -9.68 0.063 0.067 0.063

β1 -3.35 0.15 -3.17 0.15 0.16 0.15 0.023 0.15 0.16 0.15

β2 0.56 0.20 0.56 0.20 0.22 0.20 0.27 0.20 0.22 0.20

β3 0.42 0.23 0.42 0.23 0.25 0.23 -0.085 0.23 0.25 0.23

β4 0.60 0.21 0.60 0.21 0.23 0.21 -0.16 0.21 0.23 0.21

β5 -0.49 0.19 -0.49 0.19 0.20 0.19 -0.25 0.19 0.20 0.19

β6 -1.26 0.17 -1.26 0.17 0.18 0.17 -0.20 0.17 0.18 0.17

β7 0.67 0.18 0.67 0.18 0.19 0.18 0.29 0.18 0.19 0.18

β8 0.59 0.18 0.59 0.18 0.19 0.18 -0.13 0.18 0.19 0.18

β9 -0.76 0.20 -0.76 0.20 0.21 0.20 -0.38 0.20 0.21 0.20

AIC 1312 1312 1313

with i = 1, 2, ..., n, j = 1, 2, ..., G and K1 = 25. The basis functions will now be transformedas outlined in section 2.5 by applying an SVD to the n ×K1 parameter matrix B to make thetransformed basis functions b′

b′l(y) = v1lb1(y) + v2lb2(y) + ...+ vK1lbK1(y)

for k = 0, 1, ..., K1, l = 1, 2, ..., K2 and K2 ≤ K1. The difference in numbering is becauseno numerical problems occurred when fitting the final model (section 3.6) without an intercepthence it will be dropped from this step onwards. The new Poisson regression equations thenbecome

Nij ∼ Poisson(µj(β′i)) (3.3a)

ln(µj(β′i)) =

K2∑k=1

β′ikb′k(y(j)) + ln

(µ0j

)+ ln(ni) (3.3b)

with i = 1, 2, ..., n and j = 1, 2, ..., G.

The PCA was obtained without centering since the original data needs to be preserved. Over95% of the variability of the data could already be explained by the first principal componentwhich could be an indication of a very variable mean (results not shown). Figure 3.5a shows theaverage AIC values over all samples when fitting the individual models. Since a sparse solutionis the aim, 4 basis functions should suffice. Figures B.9 and B.10 (see Appendix B) show thatthere is not much gain in taking more than 2 transformed basis functions into account for the

Page 32: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

24 Chapter 3. Results

(a) Average AIC values (b) Boxplots AIC values for b′ 1 to 10.

Figure 3.5: AIC values over all samples by fitting the Poisson regression model as a function of thenumber of transformed basis functions b′.

first upstream and downstream sample. This could be different for other samples. Figure 3.5bdisplays boxplots for the AIC values of all samples when increasing the number of b′ functionsfrom 1 to 10. At 5 transformed basis functions there’s no noticeable improvement anymore.Taking K2 = 4 seems a reasonable compromise and will be the number of transformed basisfunctions to fit the final model.

Side note The data was originally created to see how many hours it would take for the watersamples downstream (potentially dirty water) to be no longer distinguishable from the cleanwater samples upstream (section 2.1). Since a PCA was performed a biplot of samples andparameters could be constructed (figure 3.6). It is nice to see that also with a different approach,this tipping point (after 3 hours) can be noticed. The first 3 hours of downstream samples forma group without any upstream samples, indicating the largest difference between upstream anddownstream samples is situated in this time period.

3.6 Step 3: varying coefficient model

The Poisson regression model for the joint density model are adaptations from equations 2.17and given by

Nij ∼ Poisson(µj(Θxi)) (3.4a)

ln (µj(Θxi)) =

K2∑k=1

(θTkxi) b′k(y(j)) + ln

(µ0j

)+ ln(ni) (3.4b)

with i = 1, 2, ..., n and j = 1, 2, ..., G. The covariate matrix in its simplest form consists ofvectors

xi =

{(1, 0) for an upstream sample(1, 1) for a downstream sample

Page 33: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 3. Results 25

Figure 3.6: The biplot of the SVD on the βi parameters of every sample. Black labels indicate thesamples (scores), red labels the parameter estimates (loadings).

which includes a dummy coding of the origin of the sample (location). This can be expandedby putting in the moment of sampling ti (time) so xi can be constructed by using the vectors

xi =

{(1, ti, ..., t

L−1i , 0, 0, ..., 0) for an upstream sample at time ti

(1, ti, ..., tL−1i , 1, ti, ..., t

L−1i ) for a downstream sample at time ti

(3.5)

, with L ∈ N0. In this way L determines the smoothness in which the parameters vary with thecovariates. Θ then becomes a K2 × 2L parameter matrix as defined by

Θ =

θT1θT2...

θTK2

=

θ11 · · · θ1L θ1,L+1 · · · θ1,2L

θ21 · · · θ2L θ2,L+1 · · · θ2,2L... . . . ...

... . . . ...

θK21 · · · θK2L θK2,L+1 · · · θK2,2L

(3.6)

To not have to cope with extremely high values when fitting the model, the time steps were alsorescaled to [0,1] with 12h as the starting point. To be able to estimate the parameters of model3.4, xi will be treated as an expansion matrix. What is meant by this is that the matrix of the b′kbasis functions will be expanded by multiplying them with each element of xi, one at a time. Inpractice, the matrix notation of fitting an individual sample (equation 3.1) is given by (ignoringthe offsets)

N = b × βG× 1 G×K K × 1

Page 34: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

26 Chapter 3. Results

Figure 3.7: The AIC values for the joint density model for increasing values of L.

whereas for the joint model this becomes

N = b′ × ΘV

n ·G× 1 n ·G×K2 · p K2 · p× 1

with p = 2L and ΘV denoting Θ as a vector with values inserted by column. Equations 3.4 cannow be adjusted to the way the model is explicitly fitted which is given by

Nij ∼ Poisson(µj(xiΘT )) (3.7a)

ln(µj(xiΘ

T ))

=

p∑l=1

K2∑k=1

θk,l(xi,l · b′k(y(j))

)+ ln

(µ0j

)+ ln(ni) (3.7b)

with i = 1, 2, ..., n, j = 1, 2, ..., G and p = 2L. In order to compare this model with theone sample regressions (equations 3.3) and to be able to construct the density functions givenby equation 2.2, the sample dependent β’s need to be calculated. These vary now with thecovariates and are obtained by

βi(xi) = xi ·ΘT (3.8)

Figure 3.7 shows the AIC values for increasing values of L. Values higher than L = 10 werenot considered because this would supersede the effect of the dimension reduction in step 2.Choosing a value of L higher than 7 does not seem to bring significant improvement to themodel.

Figure 3.8 shows the parameter coefficients β′ for each of the basis functions b′. The black dotsrepresent the values of the individual samples. The colored lines were obtained by calculatingβ′k = θTkxi for different values of L. For the final model, it is visible that each replicate has thesame value for β since it has the same covariate values (same location and time). Taking L = 5seems like a good compromise between flexibility and smoothness of the curve whereas L = 8shows signs of overfitting, taking L = 3 does not seem flexible enough to ensure good qualitydensity estimations.

Page 35: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 3. Results 27

Figure 3.8: β′i for the joint and individual models (black dots) for different values of L. Legend appliesto all plots.

Page 36: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

28 Chapter 3. Results

To visualize the model fit, the first replicate of the 12h samples (figures B.11 and B.12) and 21hsamples (figures B.13 and B.14) were plotted. Taking L = 1 is not enough to fit the secondpeak of the distribution well, the model with L = 8 gives the best fitting values in the majorityof cases. For the shown 12h samples, L = 3 actually seems to give better results than takingL = 5 which overestimates the first peak of the density. This is reversed for the 21h sampleswhere the model with L = 3 underestimates this first peak. Summarizing, one can argue howhigh the desired value of L needs to be considering accuracy and sparseness of the model.

Table 3.3: Parameter estimates for the joint model withL = 1 together with their standard errors. Naive:GLM SE, SE: Sandwich estimator using the true values, SE: Sandwich estimator using thefitted values. Subscripts of the parameters are given according to equation 3.6.

Estimate Naive SE (·10−3) SE (·10−3) SE (·10−3)

θ11 -11.6 3.01 6.77 6.76

θ21 0.115 8.22 8.39 8.26

θ31 -0.043 11.6 14.3 14.2

θ41 -0.131 6.05 12.9 12.9

θ12 -0.068 4.31 4.40 4.33

θ22 -0.289 12.0 12.3 12.2

θ32 0.130 16.6 16.9 16.7

θ42 0.144 8.76 8.98 8.84

Table 3.4: β’s for the joint model with L = 1.

β1 β2 β3 β4

Upstream -11.6 0.115 -0.043 -0.131

Downstream -11.7 -0.174 0.087 0.013

Parameter estimates and their standard errors for the joint model with L = 1 can be found intable 3.3. These are given because they are calculated with the simplest way of representing xi(equation 3.5) which might give an indication about how the coefficients vary with location. Noclear trend except the large influence of the starting value could be observed. The adjusted stan-dard errors are now considerably larger for some parameter estimates. Calculating the adjustedstandard errors for the final model meant expanding the M matrix (equation 2.8) in a similarway as the other matrices (see section 3.6). This gives a matrix with 7.29 ·108 elements. Usingthe method on a more extended data set may cause R to abort the calculations due to memoryproblems.

By using equation 3.8, the β’s could be derived using the θ’s from table 3.3. As expected,the obtained values lie in the middle of the one-sample β’s from figure 3.8. Since the θk2’shave quite small values compared to θ11, this might give an indication that there’s not muchdifference between up- and downstream samples.

Page 37: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 3. Results 29

Figure 3.9: Top view of all up- and downstream samples estimates fitted by the final model with L = 5.Lines represent estimated density functions for that hour. Values between 02:00 and 05:00were interpolated since no data was available for these time steps.

Figure 3.9 shows a heatmap similar to figure 3.2 in section 3.3 was made using the fitted valuesof the joint model with L = 5. Some small differences can be noticed when comparing both(no small peaks in the fitted samples) but overall they look quite similar.

Figure 3.10: Density curves for all samples. Legend applies to both plots.

Now that all parameter estimates are known, the density curves could be constructed which are

Page 38: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

30 Chapter 3. Results

given by equation 2.2. They are shown in figure 3.10 for all samples. Darker colors indicatethat a sample was taken at a later time point. The carrier density f0 was chosen to be a normaldistribution with µ = 0 and σ2 = 0.5 to comply with the chosen normal kernel (see section3.4.2). Clearly visible is a shift in the second peak of the distribution towards lower intensitiesafter 19h. This effect is stronger in the downstream samples. The earliest and latest sampleshave the highest probabilities of having intensities around 0.4.

3.7 Analysis of Brakel data setThe Brakel data set will now be used to see to which extent the method can be applied to otherflow cytometry data sets. Since the basis functions are not sample dependent (see section 2.4),the best possible outcome is to have a reliable set of basis functions which could be used overmore than 1 data set. This means that only steps 2 and 3 have to be slightly modified. Step 2 bychoosing an appropriate number of principal components, step 3 by adapting the covariates xito the new data set.

3.7.1 Step 0: binningFigure 3.11 shows the first replicate (24:00) of the upstream and downstream samples for G =250 bins. The upstream sample looks quite similar to first sample in the Ghent data set (figure3.1) only the distribution now has 2 smaller peaks instead of one. The downstream samplehowever follows a smoother curve with much higher cell counts.

Figure 3.11: The first sample (24:00) of both upstream (left) and downstream (right) cell counts forG = 250 bins.

3.7.2 Step 1: one-sample fittingAll of the individual samples were now fit using the Poisson regression model for an increasingnumber of basis functions as in section 3.4.1. The lowest average AIC over all samples is at 23basis functions as shown in figure 3.12. Unfortunately, even at 40 basis functions the twin peaksof the first upstream sample (figure 3.11) could not be modeled correctly (results not shown).Given this, the set of 25 polynomial basis functions will be used as with the Ghent data set.

With respect to the choice of the carrier density, same results were obtained as for the Ghentdata set. A very low bandwidth λ has to be used to match the data points well (figure 3.13).The parameter estimates (table 3.5) change with the choice of f0, while there’s only a minimal

Page 39: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 3. Results 31

Figure 3.12: Average AIC values of the model fitting of all samples with a polynomial basis. The straightline is parallel with lowest AIC value at 23 basis functions.

effect on the standard errors. As before (section 3.4.2), a bandwidth of λ = 0.5 will be chosento add some effect of the carrier density.

3.7.3 Step 2: basis transformation

After choosing the carrier density and the set of basis functions, the PCA could be carried out.Again, more than 95% of the variability could be explained by the first principal component.Figure 3.14a shows that taking 2 PC’s might be enough. Closer inspection shows that there isstill a small drop in AIC value by also taking the third PC in account which is confirmed bylooking at the boxplots of the AIC values (figure 3.14b). Thus the number of transformed basis

Figure 3.13: The carrier density f0 as estimated by a standard normal kernel smoother for differentwindow widths λ.f0 was estimated and plotted together with the data points of the firstupstream sample (24h).

Page 40: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

32 Chapter 3. Results

Table 3.5: The first 10 parameter estimates of the Poisson regression model of the first upstream sampleof Brakel together with their standard errors (SE). Adjusted standard errors are calculatedfor models with a defined carrier density f0. naive: GLM SE, SE: Sandwich estimator usingthe true values, SE: Sandwich estimator using the fitted values, AIC: Akaike’s Informationcriteria.

Uniform f0 Kernel f0(λ = 1) Kernel f0(λ = 0.05)

estimate naive estimate naive SE SE estimate naive SE SE

β0 1.24 0.068 -13.3 0.068 0.068 0.068 -12.44 0.068 0.068 0.068

β1 -1.87 0.156 -1.72 0.156 0.156 0.156 0.034 0.156 0.156 0.156

β2 0.15 0.204 0.15 0.204 0.204 0.204 0.079 0.204 0.204 0.204

β3 -0.44 0.227 -0.44 0.227 0.228 0.227 -0.26 0.227 0.228 0.227

β4 1.33 0.206 1.33 0.206 0.207 0.206 0.007 0.206 0.207 0.206

β5 -0.71 0.192 -0.71 0.192 0.195 0.192 -0.34 0.192 0.195 0.192

β6 -1.20 0.182 -1.20 0.182 0.181 0.181 -0.003 0.182 0.184 0.182

β7 0.49 0.188 0.49 0.188 0.189 0.188 -0.10 0.188 0.189 0.188

β8 1.13 0.200 1.13 0.200 0.203 0.200 0.43 0.200 0.203 0.200

β9 -1.01 0.209 -1.01 0.209 0.212 0.209 -0.56 0.209 0.212 0.209

AIC 1096 1096 1097

functions will be set at 3.

Side note The biplot (figure 3.15) shows the same result as obtained by Van Nevel (2013),the bacterial community stabilizes after 1 hour of rinsing.

3.7.4 Step 3: varying coefficient model

Figure 3.16 shows the AIC values for the joint model for different values of L as outlined in sec-tion 3.6. Since the first hour of the Brakel samples is 0:00 (24:00), a small value will be addedto those sample times so that 0’s are used only for dummy coding up- and downstream samples1. Unlike the Ghent data set (figure 3.7), there seems no gain in taking L > 4. For higher valuesof L, not all models could estimate all the parameter coefficients due to singularities.

Figure 3.17 shows the β′ coefficients for the joint model (lines) along those of the individ-ual models (black dots). Due to the smaller number of time points, the line through the jointmodel β′s looks far more jagged than it did with the Ghent data set (figure 3.8). The joint modeldid not seem to capture the first β′ coefficient of time point 24h of the upstream samples as allthe values are far below those of the individual samples. Taking L = 2 does not seem to givea model flexible enough to vary smoothly with the covariates. However, when L = 4 the β′1coefficients stabilize from 2h onwards. This is key because it shows that the Poisson model can

1In the Ghent data set, 24h was not the first sample so the rescaling to [0,1] solved this issue.

Page 41: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 3. Results 33

(a) Average AIC values(b) Boxplots of AIC values for b′ 1 to 10. Outliers

of the first b′ are ±14 · 103.

Figure 3.14: AIC values over all samples of Brakel by fitting the Poisson regression model as a functionof the number of transformed basis functions b′.

Figure 3.15: The biplot of the SVD on the βi parameters of every sample of Brakel. Black labels indicatethe samples (scores), red labels the parameter estimates (loadings).

Page 42: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

34 Chapter 3. Results

Figure 3.16: The AIC values for the joint density model of Brakel for increasing values of L.

give an indication from which time point there are no major changes in bacterial communityanymore and when the rinsing of the pipeline can stop. L = 4 thus gives the best model andwill be used to construct the density curves.

In table 3.6 the parameter estimates and their standard errors are given for the joint modelwith L = 1. Larger values for θk2 are now noticeable compared to table 3.3. This means thatthere’s a more pronounced difference between up- and downstream samples in the Brakel thanin the Ghent data set. This can also be seen from the β’s shown in table 3.7. There’s a muchlarger difference in absolute value between up- and downstream samples than in table 3.4. Theadjusted standard errors are somewhat larger but do not differ greatly from the naive standarderrors.

The density curves of the Brakel data set are shown in figure 3.18. The first peak of bothdistributions is much steeper than those in the Ghent samples (figure 3.10). Like the Ghentsamples, there is a shift towards lower values for the second peak of the distribution but it ismuch more pronounced here. As could be seen from figure 3.11), the density curve of the firstdownstream sample has a different shape than those of the following time points which is alsocaptured well by the model.

Page 43: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 3. Results 35

Figure 3.17: β′i for the joint and individual models (black dots) of Brakel for different values of L. Leg-end applies to all plots.

Page 44: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

36 Chapter 3. Results

Table 3.6: Parameter estimates for the joint model of Brakel with L = 1 together with their standard er-rors. Naive: GLM SE, SE: Sandwich estimator using the true values, SE: Sandwich estimatorusing the fitted values. Subscripts of the parameters are given according to equation 3.6.

Estimate Naive SE (·10−3) SE (·10−3) SE (·10−3)

θ11 -14.3 10.6 11.0 10.6

θ21 -0.707 16.8 17.8 16.8

θ31 0.923 15.1 15.2 15.1

θ12 4.28 12.3 12.8 12.3

θ22 4.14 18.2 19.2 18.2

θ32 -0.395 16.3 16.3 16.3

Table 3.7: β’s for the joint model of Brakel with L = 1.

β1 β2 β3

Upstream -14.3 -0.707 0.923

Downstream -10.0 3.43 0.528

Figure 3.18: Density curves for all samples of Brakel. Legend applies to both plots.

Page 45: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

4Discussion

4.1 Choice of basis functionsAs can be seen from table 3.1, the shifted Chebyshev and Legendre polynomials give the sameAIC values when fitting the models. The reason for this is that the only difference betweenthem is their specific coefficients (see section 2.2.1) at each polynomial degree. As a result,their parameter estimates when fitting the Poisson regression model were different but not theirvalues. Hence it really does not matter which one to take to fit the regression model. However,choosing the monomial basis (every degree has coefficient = 1) did not change the fitted val-ues but restricted the maximum number of basis functions to 15. As indicated in Corless andFillion (2014), the numerical properties of the monomial basis are not well-suited for fittingmodels with a high degree polynomial basis.

Splines could also have been considered alongside polynomials. Splines are piecewise de-fined functions which could have a high degree of smoothness depending of their number ofknots, which are the places where the pieces (often polynomials themselves) connect. A simpleexample of a cubic spline with 1 knot is given by

S(t) = |t|3 as S(t) =

{t3 t ≥ 0

−t3 t < 0.

Since the density curve mostly consists of 1 steep peak (especially for Brakel) and 1 smallerpeak, more flexibility could be given in those areas. Intensities from 0.5 to 1 may be easilymodeled using a straight line. Especially for the high peaks the joint model was not alwaysable to estimate the cell counts well which can be seen when comparing figures B.13 and B.14.When a high precision is wanted splines might give the needed improvement to the model.

As could be seen from figures B.3 and B.4 the model fit with the Fourier basis has some un-wanted behaviour as the shape of the curve rises at high intensities. This effect could also benoticed (however much smaller) at the polynomial basis with 5 basis functions (figures B.1 andB.2). Since there are no higher cell counts in this area, it remains unclear why this happens but

37

Page 46: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

38 Chapter 4. Discussion

it diminishes at a higher number of basis functions. Since the average AIC values of the one-sample model with the Fourier basis were still slightly decreasing after adding up to 40 basisfunctions (figure 3.3) but the model with 20 basis functions already shows signs of overfitting(figures B.3 and B.4), the Fourier basis will probably not give a stable solution in this case.

There are many other types of wavelets like the Morlet wavelet or other R packages whichcould have been considered to fit the model using a wavelet basis. The wavelets which wereopted for here were not flexible enough (Haar wavelet, figures B.5 and B.6) or fluctuated toomuch (Debauchy wavelet, figures B.7 and B.8) in order to give a decent fit. Due to this andthe polynomial basis giving good results the wavelet basis was not considered in more detail.Since Wavelets could decompose information on time and location (frequency) simultaneously(Tangborn, 2010), an adaption of the method could be to directly fit the joint model withoutusing an intermediary dimension reduction step. In this way, the data would consist of a serieswith multiple peaks in which the wavelet properties could be used more extensively.

Instead of adding a small value to the bins as mentioned in section 3.3, one could think of usinga zero-inflated Poisson model. This however might need some expert knowledge to correctlyclassify the extent of the structural zeros to be added.

4.2 Carrier density and standard errorsThe presented method can be seen as an optimization problem in which choices need to be madealong the way. Here it was chosen to first model the basis functions and then look at the carrierdensity. If some notion about the density curve is available, reversing this order might be morebeneficial. At first glance it may also feel more ‘natural’ to first consider modeling the carrierdensity. However, as can be seen from the results a large and well-chosen set of basis functionsshould be able to compensate a lack of prior knowledge. All of the model fits using a standarduniform density as carrier (figures B.1 to B.8) were able to fit the general curve of the densityfunction quite well. With a good set of basis functions it does not always seem necessary tohave a well known density function.

In this dissertation, a normal kernel smoother was used as proposed by Efron and Tibshirani(1996). This works quite well in their example with a small data set. In this case, due to thehigh peaks in the distribution, it might not be the ideal solution. A very small bandwidth isneeded (figures 3.4 and 3.13) in order to let the smoother vary along the shape of the densitycurve. This is also noticeable in parameter estimates of the regression models as seen in tables3.2 and 3.5. Taking a smoother with bandwidth λ = 1 only has an effect of the values of β0and β1 while taking a bandwidth of λ = 0.05 changes all the listed β’s. With a smoother thisprecise, one could argue if a set of 25 basis functions is still necessary. In this case, it probablycauses some overfitting as indicated by the slight increase of the AIC value at this bandwidth.This was the main reason for choosing a less optimal bandwidth (λ = 0.5) which will be com-pensated by a maximally optimized set of basis functions.

Using a carrier density has as benefit that the standard errors of the parameter estimates canbe calculated correctly. These follow the reasoning of Sandwich estimators (equation 2.13) andcan be model-based (equation 2.10b) or not model-based (equation 2.10a). The reasoning isthat the adaptive choice of the carrier absorbs some of the variability of the β’s, resulting inlower standard errors. With a small sample size and a low number of basis functions this is the

Page 47: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 4. Discussion 39

case (Efron and Tibshirani, 1996). For a larger set of basis functions and samples, very littledifference is noticeable. It seems this correcting effect diminishes when the number of basisfunctions or samples increases. When looking at the standard errors of the joint model (tables3.3 and 3.6), for some of the parameters the standard errors do get noticeably larger. This couldbe explained by stating that by expanding the M matrix, each sample is now seen as a set ofdata points and not as an independent random sample. The independence assumption will thusnot be entirely correct in this case which is now correctly displayed in the standard errors (asopposite to the footnote at equations 2.4). However, for the joint model the values of θ are cal-culated and not of β directly. How these standard errors relate to each other might be inspirationfor future research.

4.3 Model outcome

4.3.1 One-sample density estimationLooking at the parameter estimates for the one-sample density estimation, the first 2 β’s seemto be estimated fairly well in both data sets given their low standard errors (tables 3.2 and 3.5).From β2 onwards the standard errors start to get larger relative to the value of the parameterestimates. The same trend can be seen at the adjusted standard errors. By the nature of thebasis expansion, the first few basis functions usually model the the general curve of the datawhile higher degrees are used to model the smaller trends. This might mean that by only addingan intercept and a linear term, this general trend can be captured quite well. This can also beseen at some extent in figures B.1 and B.2: 5 polynomial basis functions are enough to fit themajor course the data points follow while 20 or more basis functions are needed to model thesmall fluctuations of cell counts with intensities between 0.3 and 0.5. P-values of the parameterestimates were not mentioned because they are less relevant in this case. The goal is to find abest fitting density curve and not to find the best possible predictors in the data set.

The difference between the way the adjusted standard errors are calculated lies in whether thereal (equation 2.10a) or fitted values (equation 2.10b) were used to calculate them. There’s nopreference on which one to use although the latter resembles more to the usual (naive) wayof calculating SE’s. The former way always gives larger standard errors (mostly minor differ-ences) which is opposite to the results of Efron and Tibshirani (1996) who got lower SE’s bycalculating them using the real values. This could mean that the deviation from the real valuesis still quite large at some points. Of course, the assumption is that this effect can be found overall samples and not just the one considered here. Presumably some nuance is needed.

4.3.2 Multi-sample density estimationAs mentioned in section 1.1 and in the side note of section 3.5, the data was collected to seeif flow cytometry data could help predict when a bacterial community stabilizes after repairswere carried out on drinking water pipelines. Ideally, the joint model should also show whenthis happens since the model varies smoothly with the covariates which indicate time and loca-tion. The input of the system (incoming water) is considered to have a constant composition ofbacteria whereas outgoing samples (downstream water) will have a varying composition untilthe effect of the rinsing wears off and the bacterial community stabilizes again. This means thatvery little variation is expected of the upstream β’s in function of time while the downstreamβ’s will show some variation up to a particular time point where they also stabilize. This was

Page 48: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

40 Chapter 4. Discussion

also the conclusion of the work done by Peelman (2016) on a synthetic flow cytometry data set.

For the joint model of Brakel this time trend can be noticed for the model with L = 4 (figure3.17) where the downstream β’s stabilize after 24:00. The upstream samples seem to stabilizeafter 24:00 for the β1’s and after 04:00 for the β3’s. In the Ghent data set (figure 3.8) no suchtrends are visible, there always seem to be some time dependency in the parameter estimates.There are 3 reasons which can help explain this (De Gusseme, 2016):

• The most important reason for the time dependency is that the incoming water supplyitself is time dependent. Depending on soil and other geohydrological characteristics, thewater composition and thus the bacterial composition is strongly location dependent. Tobe able to guarantee a steady supply of water for a large water distribution network suchas Ghent, water can be provided from many nearby locations such as Melle, Wetteren,...Since the demand changes over time, the supply and also the mixing ratio of the watervaries over time which can even change every hour. But even for Brakel changes arevisible in the upstream β’s. The assumption of a constant bacterial community upstreamis thus too strong to hold for both Ghent and Brakel.

• The length of the pipe plays a role as well as it affects the residence time of the bacteria inone type of pipe. Since the distance between sampling points in Brakel was very long (±3 km), the bacteria might have had more time to adjust to their new environment whichresulted in a stable community at the downstream sampling point.

• The depth at which a pipeline is buried varies with the geography of the area. For Brakel,the first sample could be taken directly at the exit of the pipeline. This was not the casefor the Ghent pipeline. The sampling point there was at the top of a curve in the pipelinewhich means that first sample was not the first rinsing water as was the case in Brakel.If the first water could have been collected the first downstream sample of Ghent mighthave had a histogram more similar to that of Brakel (figure 3.11).

Comparing the β’s of Ghent (figure 3.8) with the heatmap (figure 3.9), it can be seen that higherparameter estimates result in higher first peaks. This is best noticeable around 19h -24h. Val-ues around 0 of β′3 and β′4 could indicate lower second peaks as is somewhat visible from thedownstream samples. This could indicate a very good result from the PCA step since for eachsample fit the same transformed basis functions are used. Except for the β′2 values, the indi-vidual β′s seem to follow a quite random pattern in which it is hard to fit a smooth line. Thismay be explained by the varying mixture ratio of the water as explained above. This couldbe the reason that even for high values of L there was still room to improve the model (figure3.7). Fitting the model with L = 1 ignores the time dependency so that the variation in locationcould be maximized 1. Tables 3.3 and 3.4 show that even then, the difference between up- anddownstream β’s stays rather small. The Ghent model thus may be improved by penalizing timemore in favor of maximizing the location differences. Even with a highly fluctuating income,the stabilizing trend should be noticeable. This might however lead to a model with very highvalue for L.

For the Brakel data set, the model could be no longer improved at L = 4 (figure 3.16) whichwas also the value at which the stabilizing trend could be observed best (figure 3.17). When

1This is the reason only 8 values are shown in table 3.4: by minimizing xi, every time point gets the sameparameter estimates.

Page 49: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Chapter 4. Discussion 41

looking at the up- and downstream β’s, the range of the values is clearly different. This couldindicate a larger effect of location which can be also be found in tables 3.6 and 3.7 and from theshape of the final density curves in figure 3.18. For Brakel, a maximally optimized model couldthus be found.

4.4 Bayesian approachA similarity between equation 2.2 and the general notation of Bayesian parameter estimationcan be noticed. In Bayesian statistics the (posterior) probability of a parameter θ given datapoints y is denoted by

Posterior︷ ︸︸ ︷p(θ|y) =

likelihood︷ ︸︸ ︷p(y|θ)

prior︷︸︸︷p(θ)∫

θ

p(y|θ)p(θ) dθ︸ ︷︷ ︸marginal likelihood

where the carrier density can be seen as the prior, the basis expansion as the likelihood and thenormalizing constant α(β) as the marginal likelihood. The probability of a new data point y isgiven by

p(y|y) =

∫θ

p(y|θ)p(θ|y) dθ

from which the density curve could be constructed. Model comparison can be done by usingthe Bayes factor which is given by

p(θ|y,M) =p(y|θ,M)p(θ|M)

p(y|M)

with M a certain model and

p(y|M) =

∫θ

p(y|θ,M)p(θ|M) dθ

so that the Bayes factor can be constructed for different models M1 and M2 as

p(M1|y)

p(M2|y)=p(y|M1)

p(y|M2)︸ ︷︷ ︸Bayes factor

· p(M1)

p(M2)

Using the Bayesian statistics for estimating the density functions has some advantages over do-ing this with classical statistics. For comparing models (for instance changing the carrier), theBayes factor focuses on the credibility of both models instead of gathering proof for rejectingone model. As a consequence it is not influenced as much by large sample sizes as p-values andwill go towards the most credible model instead of rejecting one. More in general, the standarderrors for the parameter estimates would not have to be calculated separately as the choice ofthe carrier density is embedded in the Bayesian framework by use of the prior.

When the method could be successfully adapted, Bayesian analysis tools like the Markov chainMonte Carlo methods could be used to find the (posterior) density functions. On the other hand,it could more challenging to find a proper way to implement the covariates xi into the modelwhile using this framework.

Page 50: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

42 Chapter 4. Discussion

Page 51: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

5Conclusion

Using flow cytometry, one can rapidly measure several cell properties simultaneously. It ismostly used in medical sciences but also in the assessment of water treatment processes. Whenthe repairs of a drinking water pipeline are finished, it is flushed for several hours before recon-necting to make sure no contaminated water stays in the pipe. Flow cytometry could be usedto analyze the bacterial community during the flushing. When it stabilizes, this could give anindication that the rinsing may be stopped. In this way, the time of the flushing procedure couldbe reduced thus saving many liters of potable water.

In order to construct probability density functions of water samples measured by flow cytome-try and potentially find this tipping period, 2 data sets (Ghent and Brakel) were analyzed. Themethod of choice was to use the specially designed exponential families introduced by Efronand Tibshirani (1996) and modified by Thas (2014). This uses a binning procedure to reducethe density estimation to a log-linear Poisson regression problem. The exponential part of theequation was then adapted to fit in covariates of the sampling points (xi) to take in extra infor-mation of time and location. In this way, all the data could be fitted simultaneously by using avarying coefficient model.

Since this was an optimization problem, a few choices were made along the way. The densityestimation equation consisted of a carrier density f0(y) and a basis expansion using a set ofbasis functions (

K∑k=1

βkbk(y)

).

The latter was optimized first by using a set of polynomials and no major improvement of themodel could be observed from considering the density function anymore. It was mainly addedbecause it allowed for the calculation of adjusted standard errors of the parameter estimates β.However, when a less informative carrier density is chosen very little difference between theadjusted and naive standard errors could be obtained when modeling individual samples.

The joint varying coefficient model of all samples was able to have parameters β(xi) whichvary smoothly with the location (upstream or downstream of the pipe) and time of sampling.

43

Page 52: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

44 Chapter 5. Conclusion

This smoothness could be denoted by L and was in itself a parameter which could be optimizedto find a stable model. For the Brakel data set, an optimal model could be obtained by settingL = 4. Using this degree of smoothness the parameter estimates stabilized after the first timepoint indicating the same result as obtained by Van Nevel (2013).

The same results could not be achieved with the Ghent data set. Since this was a drinking waterpipeline from a major city, the mixing ratio of the water changes over time (De Gusseme, 2016)which makes it very difficult for the model to notice any trends in the data . A suboptimal butparsimonious model with L = 5 could be obtained with this data set. An improvement for themodel could thus exist in making it less time and more location dependent. Adjusted standarderrors for the joint model were now found to be more correct than the GLM standard errorssince they incorporated the effect of considering multiple samples at once.

To conclude, the method certainly has potential but still needs some refinement to be able towork optimally for large data sets. Using the same method in a different approach (such as theBayesian framework) could also yield interesting results.

Page 53: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

References

Abramowitz, M., Stegun, I. a., and Miller, D. (1965). Handbook of Mathematical FunctionsWith Formulas, Graphs and Mathematical Tables. In Abramowitz, M. and Stegun, I. a., edi-tors, Applied Mathematics, volume 55, chapter 8, page 774. Dover Publications, Washington,DC.

BD Biosciences (2000). Introduction To Flow Cytometry: A Learning Guide.

Biomedical Instrumentation Center (2000). Introduction to flow cytometry.

Clement, L. and Thas, O. (2016). flowFDA: Functional Data Analysis for Flowcytometry Data.

Corless, R. and Fillion, N. (2014). A Graduate Introduction to Numerical Methods: From theViewpoint of Backward Error Analysis. Springer-Verlag New York.

Dako (2006). Flow Cytometry: Educational guide. Technical report, Carpinteria, CA, USA.

De Gusseme, B. (2016). Water distribution network Flanders.

De Roy, K., Clement, L., Thas, O., Wang, Y., and Boon, N. (2012). Flow cytometry for fastmicrobial community fingerprinting. Water Research, 46(3):907–919.

Efron, B. and Tibshirani, R. (1996). Using specially designed exponential families for densityestimation. The Annals of Statistics, 24(6):2431–2461.

Hastie, T. and Tibshirani, R. (1993). Varying-Coefficient Models. Journal of the Royal Statis-tical Society, 55(4):757–796.

Lindsey, J. (1984a). Comparison of Probability Distributions. Journal of the Royal StatisticalSociety, 36(1):38–47.

Lindsey, J. (1984b). Construction and Comparison of Statistical Models. Journal of the RoyalStatistical Society, 36(3):418–425.

Peelman, S. (2016). Statistical analysis of flow cytometry data. Master’s thesis, University ofGhent.

R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundationfor Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.

Tangborn, A. (2010). Wavelet Transforms in Time Series Analysis.

Thas, O. (2014). Modelling the Bivariate Density function in FC Space.

Thas, O. and Clement, L. (2014). Analysis of Flow Cytometric Data : a Functional DataAnalysis Approach.

45

Page 54: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

46 References

Van Nevel, S. (2013). Flow cytometric fingerprinting for the follow-up of the rinsing processof water pipes. In Growth and flow cytometric monitoring of bacteria in drinking water,chapter 5, pages 3–10. Ghent.

Whitcher, B. (2015). waveslim: Basic wavelet routines for one-, two- and three-dimensionalsignal processing. R package version 1.7.5.

Page 55: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Appendices

47

Page 56: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis
Page 57: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

ARemarks

Remark C of section 3 in Efron and Tibshirani (1996)Notation was modified to comply with this thesis.In case µ0 = M · N , the matrix Z in equation 2.7a is a function of β but not of µ0, sayZ = Z(β). If dN is a vector of size G orthogonal to the columns of Z(β), then

d β = [X ′DX]−1Z ′(β) dN = 0

This implies that the level surfaces of constant β value are (G−K)-dimensional flat subspacesin the G-dimensional N space. Flat level surfaces tend to make delta-method covariance esti-mates such as equations 2.10a and 2.10b more accurate.

49

Page 58: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

50 Appendix A. Remarks

Page 59: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

BExtra figures

All of the following figures are constructed using the Ghent data set.

51

Page 60: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

52 Appendix B. Extra figures

Figure B.1: Model fit of the first 12h upstream sample for a different number of basis functions of theshifted Legendre and Chebyshev polynomials. Only the Chebyshev polynomials are visiblesince both polynomials overlap each other. Number of basis functions included are displayedbetween brackets.

Figure B.2: Model fit of the first 12h downstream sample for a different number of basis functions of theshifted Legendre and Chebyshev polynomials. Only the Chebyshev polynomials are visiblesince both polynomials overlap each other.Number of basis functions included are displayedbetween brackets.

Page 61: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Appendix B. Extra figures 53

Figure B.3: Model fit of the first 12h upstream sample for a different number of basis functions of theFourier polynomials. Number of basis functions included are displayed between brackets.

Figure B.4: Model fit of the first 12h downstream sample for a different number of basis functions of theFourier polynomials. Number of basis functions included are displayed between brackets.

Page 62: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

54 Appendix B. Extra figures

Figure B.5: Model fit of the first 12h upstream sample for a different number of basis functions of theHaar wavelet polynomials.Number of basis functions included are displayed between brack-ets.

Figure B.6: Model fit of the first 12h downstream sample for a different number of basis functions ofthe Haar wavelet polynomials. Number of basis functions included are displayed betweenbrackets.

Page 63: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Appendix B. Extra figures 55

Figure B.7: Model fit of the first 12h upstream sample for a different number of basis functions of the 4-coefficient Debauchy wavelet polynomials. Number of basis functions included are displayedbetween brackets.

Figure B.8: Model fit of the first 12h downstream sample for a different number of basis functions ofthe 4-coefficient Debauchy wavelet polynomials. Number of basis functions included aredisplayed between brackets.

Page 64: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

56 Appendix B. Extra figures

Figure B.9: Model fit of the first 12h upstream sample taking into account different numbers (K2) oftransformed basis functions.

Figure B.10: Model fit of the first 12h downstream sample taking into account different numbers (K2) oftransformed basis functions.

Page 65: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

Appendix B. Extra figures 57

Figure B.11: Model fit of the first 12h upstream sample by the joint model for different values of L.

Figure B.12: Model fit of the first 12h downstream sample by the joint model for different values of L.

Page 66: Statistical modeling of flow cytometry datalib.ugent.be/.../881/RUG01-002349881_2017_0001_AC.pdf · When I first saw the presentation on the Master of Statistical Data Analysis

58 Appendix B. Extra figures

Figure B.13: Model fit of the first 21h upstream sample by the joint model for different values of L.

Figure B.14: Model fit of the first 21h downstream sample by the joint model for different values of L.