eindhoven university of technology bachelor optimally

Eindhoven University of Technology

BACHELOR

Optimally allocating green times at a traffic light intersection

Peeters, Anika

Award date:2021

Link to publication

DisclaimerThis document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Studenttheses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the documentas presented in the repository. The required complexity or quality of research of student theses may vary by program, and the requiredminimum study period may vary in duration.

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

https://research.tue.nl/en/studentTheses/b9c22fe1-7581-4616-9891-38593068d506

Department of Mathematics and Computer Science

Optimally allocating green times at atraffic light intersection

Bachelor Thesis

Academic year 2020-2021

Anika Peeters (1243429)Bachelor of Applied Mathematics

Supervisor:dr.ir. Marko A.A. Boon

Eindhoven, February 2021

Abstract

In this thesis, two methods for finding an optimal configuration of the traffic lights atan intersection with two conflicting flows will be investigated, discussed and compared.The goal of the configuration is to minimize either the mean queue length and/or meanwaiting time.

The first method uses queuing theory to find an allocation of the green time. Suchan allocation that minimizes the mean queue length and/or mean waiting will be referredto as the optimal split. This method will be applied to a M/M/1, M/D/1, D/D/1 andfixed cycle traffic light fluid queue model. Here, it was found that the fixed cycle trafficlight fluid queue model represents the intersection with two conflicting flows and trafficlights best. Thus, the optimal split found through this model has been compared to theoptimal split found using the second method, the Q-learning algorithm.

The Q-learning algorithm is a model-free reinforcement learning algorithm. The en-vironment used for the algorithm was a discrete-event simulation of the intersection withtraffic lights. The optimality results of both methods have been applied into the simu-lation and the mean queue length and mean waiting time results for both methods werecompared. From this, it could be concluded that the first method, the method using queu-ing theory, performs best. However, when rounding the results to the nearest integer, bothmethods have equal performance. For a more complex intersection involving more than 2directions, flows and/or traffic lights, the Q-learning algorithm is more promising. Con-sidering the flexibility and potential of the Q-learning algorithm compared to the methodusing queuing theory, it would be recommended to use the Q-learning algorithm.

ii

Contents

1 Introduction 2

1.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Optimal split for queuing and road traffic models 5

2.1 M/M/1 queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 The optimal split minimizing the waiting time . . . . . . . . . . . . 6

2.1.2 The optimal split minimizing the queue length . . . . . . . . . . . . 7

2.2 M/D/1 queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9



2.3 D/D/1 queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


2.4 Fixed cycle traffic light fluid queue . . . . . . . . . . . . . . . . . . . . . . 12


2.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Model analyses 16

3.1 Stability region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Mean queue length and waiting time . . . . . . . . . . . . . . . . . . . . . 19

3.3 Optimal split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 MDP and Q-learning theory and literature study 24

4.1 Finite Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.1 Bellman’s optimality principle . . . . . . . . . . . . . . . . . . . . . 26

4.2 MDP framework applied to an intersection . . . . . . . . . . . . . . . . . . 26

4.2.1 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Q-Learning 32

5.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 Q-learning result analysis and comparison . . . . . . . . . . . . . . . . . . 34

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Conclusion and recommendation 38

iii

References 40

A Theory 42A.1 List of symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42A.2 Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43A.3 Plots model analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

A.3.1 Mean queue length and waiting time . . . . . . . . . . . . . . . . . 45A.3.2 Optimal split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

B Simulation results 49B.1 Plots simulation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

C Q-learning results 51

1

Chapter 1

Introduction

Waiting in front of a red traffic light for a longer time than seemingly necessary is auniversal experience. More often than not it gives rise to irritation and restlessness.Apart from these minor psychological frustrations, a badly arranged traffic light can alsohave environmental and economical downsides and can cause congestion. Therefore, itis of interest to find a traffic light cycle configuration such that some, if not all, of thenegative effects are minimal. In this thesis, methods that can find a traffic light cycleconfiguration such that the amount of vehicles waiting and/or the amount of time spendwaiting is minimized will be investigated.

1.1 Problem description

The problem that will be tackled in this thesis is the optimization of the allocation of thegreen time at a traffic light intersection with two conflicting flows. The objective is tofind an optimal split that minimize the mean waiting time and/or the mean queue lengthof the system. The optimal split denotes the optimal allocation of the green times. Theresearch objectives of this thesis are as follows:

1. Determine the optimal split for several queuing and road traffic models.

2. Develop a method in which Q-learning is used to determine the optimal split.

3. Compare both methods using a discrete-event simulation.

The first objective mentions queuing and road traffic models. Inspired by the paper, Per-formance Comparison Between Queuing Theoretical Optimality and Q-learning Approachfor Intersection Traffic Signal Control (Chanloha, Usaha, Chinrungrueng, & Aswakul,2012), the optimal split will be derived for 4 different queuing and road traffic models.In the paper, the optimum found by Q-Learning is compared to the theoretical optimumfound from queuing theory for a M/M/1 and D/D/1 queue. In this thesis, the modelsthat will be explored are the M/M/1, M/D/1, D/D/1 and a fixed cycle traffic light fluidqueue model.

In order to achieve the research objectives, assumptions have been made to trans-late the real world problem into a mathematical problem. The traffic light intersectionenvironment in the mathematical model is as shown in Figure 1.1.

2

Figure 1.1: Traffic light intersection with two conflicting flows

The modeled environment as shown in this figure, in which we will strive to solve theproblem, consists of the following components and assumptions:

• Two separate directions p ∈ {1, 2}.

• Each direction has one traffic light. The traffic light is for directing the vehiclesstraight ahead.

• Vehicles arrive in direction p with arrival rate λp.

• Vehicles depart in direction p with departure rate µp.

• The amount of time a traffic light is green for direction p will be denoted as thegreen time. During this time, the vehicle are allowed to start their departure.

• The amount of time a traffic light is red or orange/yellow for direction p will bedenoted as the red time. During this time, the vehicle are not allowed to start theirdeparture. However, if a vehicle started its departure during the green time, it isstill allowed to continue its departure during the red time.

• The time unit in this thesis is seconds.

• The road behind the traffic light is of infinite length, meaning there is no constrainton the amount of cars that can queue in the lane.

• All vehicles fit within the width of the road and are of equal size.

Further assumptions depend on the queuing model and will be discussed in the section(s)destined for said model.

3

1.2 Approach

As mentioned before, the models considered for the first research objective are the M/M/1,M/D/1, D/D/1 and fixed cycle traffic light fluid queue models. Based on queuing theory,the optimal split for these models will be derived by minimizing either the mean queuelength or mean waiting time. This method that uses queuing theory will be referred to asthe theoretical optimization method. The found optima will be used in a discrete-eventsimulation of the intersection to compare the expected queue length and waiting time ofthe system with the theoretical findings from queuing theory.

The discrete-event simulation will also serve as the environment for the Q-learningalgorithm of the second research objective. Before implementing the Q-learning algorithm,theory regarding Markov decision processes will be discussed. This theory provides aframework for the Q-learning algorithm, for which the theory will also be discussed. Afterunderstanding the theory behind the Q-learning algorithm, a state and action space willbe determined so that the algorithm can be used to find the optimal allocation of thegreen time.

Finally, the theoretical optimization method and the method using Q-learning willbe compared using the discrete-event simulation. This comparison will be done by ap-plying the optimal allocation of the green times in the simulation for both methods andcomparing the mean queue length and mean waiting time results.

1.3 Thesis overview

In this thesis, the research objectives will be investigated and discussed in chronologicalorder. In Chapter 2, the optimal split for the various queuing and road traffic models willbe derived using queuing theory. This chapter is followed by an analysis and discussionof the models in Chapter 3. Next, a literature study and theory about Markov decisionprocesses and Q-learning will be provided in Chapter 4. In Chapter 5, the Q-learningalgorithm’s set-up and results will discussed and compared with the theoretical optimiza-tion method. Finally, a conclusion and recommendation will be given in Chapter 6. A listof symbols, full calculations, extra figures etc. can be found in the Appendix. In short,the research objects will be addressed in the following order:

• Research objective 1: Chapter 2, Chapter 3 and Chapter 6.

• Research objective 2: Chapter 4, Chapter 5 and Chapter 6.

• Research objective 3: Chapter 5 and Chapter 6.

4

Chapter 2

Optimal split for queuing and roadtraffic models

In this chapter, the optimal split will be derived for the M/M/1, M/D/1, D/D/1 (Adan& Resing, 2020) and fixed cycle traffic light fluid queue (Adan & Resing, 2003) modelsrespectively. Here, the objective is to find the allocation of the green time that minimizeseither the expected queue length or expected waiting time.

At the end of this chapter, a discrete-event simulation of an intersection with twoconflicting flows and traffic lights will be described. This simulation will be used tocompare with the theoretical findings regarding the expected queue length and waitingtime in Chapter 3. The simulation will also be used for the second and third researchobjective of this thesis as described in Section 1.1.

An overview of commonly used symbols and variables used in this chapter can befound in Appendix A.1.

2.1 M/M/1 queue

Let λp denote the traffic arrival rate in direction p ∈ {1, 2}, µ the intersection departurerate and wp the ratio of green time in a cycle that is allocated to direction p. In theM/M/1 queuing model, inter-arrival times are exponential with mean 1

λp, the service

times are exponential with mean 1wpµ

and there is a single server. For the queue length

not to explode, it is required that ρp = λpwpµ

< 1. Let L denote the total loss time value

per signal cycle being normalized by the cycle length. This can be seen as the fraction oftime in a cycle that all traffic lights are red. Hence,∑

∀ p

wp + L = 1.

Let the mean waiting time in direction p be denoted by Tp. An expression for the meanwaiting time can be obtained using the mean value approach (Adan & Resing, 2020). Fixp ∈ {1, 2}. Let the average number of vehicles in direction p seen by an arriving vehiclebe denoted by E[Lp]. Moreover, let the mean time spent in the system in direction p beE[Sp]. Each of these vehicles in direction p, including the one in service has a (residual)service time with mean 1

wpµ. A vehicle also has to wait for its own service time. This

results into the arrival relation

E[Sp] = E[Lp] ·1

wpµ+

1

wpµ.

5

Using E[Lp] = λpE[Sp] (Little’s Law), the following is obtained

E[Sp] = E[Lp] ·1

wpµ+

1

wpµ

=λpwpµ

· E[Sp] +1

wpµ.

Simplifying leads to(1− λp

wpµp

)E[Sp] =

1

wpµ=⇒ (1− ρp)E[Sp] =

1

wpµ=⇒ E[Sp] =

1/wpµ

1− ρp.

The mean waiting time in direction p, Tp, can then be expressed as

Tp = E[Sp]−1

wpµ=

1/wpµ

1− ρp− 1

wpµ=

1− (1− ρp)wpµ(1− ρp)

=ρp

wpµ− λp=

λpwpµ

wpµ− λp=

λpwpµ(wpµ− λp)

.

The mean waiting time of an arbitrary vehicle in the system is then given by

T =λ1

λ1 + λ2

T1 +λ2

λ1 + λ2

T2.

This expression will be used in the next section to find the optimal split for the M/M/1queuing model.

2.1.1 The optimal split minimizing the waiting time

The objective is to find the optimal split that minimizes the mean waiting time. Substi-tuting w2 = 1− L− w1 into T results into

T =λ1

λ1 + λ2

· λ1

w1µ(w1µ− λ1)+

λ2

λ1 + λ2

· λ2

w2µ(w2µ− λ2)

=1

λ1 + λ2

(λ2

1

w1µ(w1µ− λ1)+

λ22

(1− L− w1)µ((1− L− w1)µ− λ2)

).

Differentiating T with respect to w1 and equating to zero gives

∂

∂w1

T =1

λ1 + λ2

(λ2

1(λ1 − 2w1µ)

w21µ(λ1 − w1µ)2

− λ22(λ2 + 2µ(L− 1 + w1))

µ(L− 1 + w1)2(λ2 + µ(L− 1 + x))2

)= 0. (2.1)

After solving this equation for w1, the minimal and/or maximal T will be found. Thevalue of w1 that corresponds to a minimal T will be denoted as w∗1,MM1. In this case, theequation needs to be solved numerically in order to find w∗1,MM1. This will be done usingMathematica. Then,

w∗2,MM1 = 1− L− w∗1,MM1.

6

Evaluating the critical point

As said before, when solving Equation 2.1, the found critical points can correspond to aminimum, maximum or saddle point. To see if the found critical point (w∗1,MM1, w

∗2,MM1)

is a minimum, it needs to be confirmed that

Tw1w1Tw2w2 − T 2w1w2

> 0 and Tw1w1 , Tw2w2 > 0

both hold. Here, the subscript wp denotes differentiating with respect to wp.For the M/M/1 queue, this gives

Tw1 = λ1λ1+λ2

·− 2λ1w1

+λ21w21µ

(w1µ−λ1)2

Tw2 = λ2λ1+λ2

·− 2λ2w2

+λ22w22µ

(w2µ−λ2)2

=⇒

Tw1w1 = λ1

λ1+λ2· 2λ1(λ21−3λ1µw1+3µ2w2

1)

µw31(µw1−λ1)3

Tw2w2 = λ2λ1+λ2

· 2λ2(λ22−3λ2µw2+3µ2w22)

µw32(µw2−λ2)3

Tw1w2 = 0

.

Note that if the constraints (i) : λp, wp, µ > 0 and (ii) : λp < wpµ =⇒ wpµ − λp > 0hold, then

Tw1w1Tw2w2 − T 2w1w2

=1

(λ1 + λ2)2

(2λ2

1(λ21 − 3λ1µw1 + 3µ2w2

1)

µw31(µw1 − λ1)3

· 2λ22(λ2

2 − 3λ2µw2 + 3µ2w22)

µw32(µw2 − λ2)3

)− 02

=4λ2

1λ22(λ2

1 − 3λ1µw1 + 3µ2w21)(λ2

2 − 3λ2µw2 + 3µ2w22)

(λ1 + λ2)2µ2w31w

32(µw1 − λ1)3(µw2 − λ2)3

=4λ2

1λ22(λ2

1 + 3µw1(µw1 − λ1))(λ22 + 3µw2(µw2 − λ2))

(λ1 + λ2)µ2w31w

32(µw1 − λ1)3(µw2 − λ2)3

> 0,

Tw1w1 =2λ2

1(λ21 − 3λ1µw1 + 3µ2w2

1)

(λ1 + λ2)µw31(µw1 − λ1)3

=2λ2

1(λ21 + 3µw1(µw1 − λ1))

(λ1 + λ2)µw31(µw1 − λ1)3

> 0,

Tw2w2 =2λ2

2(λ22 − 3λ2µw2 + 3µ2w2

2)

(λ1 + λ2)µw32(µw2 − λ2)3

=2λ2

2(λ22 + 3µw2(µw2 − λ2))

(λ1 + λ2)µw32(µw2 − λ2)3

> 0.

So the critical point is indeed a minimum if the given constraints hold.

2.1.2 The optimal split minimizing the queue length

Note that in the paper (Chanloha et al., 2012), there was a mistake in the formula for themean waiting time Tp in direction p. The formula used in its place was the mean numberof vehicles in the system in direction p, given by

E [Lp] =ρp

1− ρp=

λpwpµ− λp

.

Let Q = E[L1] + E[L2]. Using w2 = 1− L− w1, then

Q =λ1

w1µ− λ1

+λ2

w2µ− λ2

=λ1

w1µ− λ1

+λ2

(1− L− w1)µ− λ2

.

7

The optimal w1 can be found by differentiating Q to w1 and equating it to zero. The fullcomputation can be found in Appendix A.2.

∂

∂w1

Q = − λ1µ

(w1µ− λ1)2+

λ2µ

((1− L− w1)µ− λ2)2.

Solving this equation gives

λ1µ

(w1µ− λ1)2=

λ2µ

((1− L− w1)µ− λ2)2

=⇒ ...

=⇒ w21

[(λ1 − λ2)µ2

]+ w1

[4λ1λ2µ− 2(1− L)λ1µ

2]

+[λ1λ

22 − 2λ1λ2(1− L)µ+ (1− L)2λ1µ

2 − λ21λ2

]= 0

D =[4λ1λ2µ− 2(1− L)λ1µ

2]2 − 4(λ1 − λ2)µ2

[λ1λ

22 − 2λ1λ2(1− L)µ+ (1− L)2λ1µ

2 − λ21λ2

]= ...

= λ1λ2µ2 [λ1 + λ2 − (1− L)µ]2 .

Using the abc-formula, the optimal split, w∗1,MM1, is obtained

w∗1,MM1 =− [4λ1λ2µ− 2(1− L)λ1µ

2]±√

4λ1λ2µ2[λ1 + λ2 − (1− L)µ]2

2(λ1 − λ2)µ2

= ...

=

[√λ1λ2−1

][λ1−λ2

√λ1λ2

]µ

+ (1− L)√

λ1λ2

[√λ1λ2− 1]

[√λ1λ2− 1] [

1 +√

λ1λ2

] =

√λ1λ2

(1− L) +λ1−λ2

√λ1λ2

µ

1 +√

λ1λ2

=Λ(1− L) + λ1−λ2Λ

µ

1 + Λwhere Λ =

√λ1

λ2

.

For direction 2, the optimal split w∗2,MM1 is given by

w∗2,MM1 = 1− L− w∗1,MM1 = 1− L−Λ(1− L) + λ1−λ2Λ

µ

1 + Λ

=(1− L)(1 + Λ)− Λ(1− L)− λ1−λ2Λ

µ

1 + Λ=

(1− L) + λ2Λ−λ1µ

1 + Λ.


Again, it needs to be checked if the found critical point (w∗1,MM1, w∗2,MM1) is a minimum.

For this it needs to be confirmed that

Qw1w1Qw2w2 −Q2w1w2

> 0 and Qw1w1 , Qw2w2 > 0

both hold for this point. For the M/M/1 queue, this gives

{Qw1 = − λ1µ

(w1µ−λ1)2

Qw2 = − λ2µ(w2µ−λ2)2

=⇒

Qw1w1 = 2λ1µ2

(w1µ−λ1)3

Qw2w2 = 2λ2µ2

(w2µ−λ2)3

Qw1w2 = 0

.

8

Note that if the constraints (i) : λp, wp, µ > 0 and (ii) : λp < wpµ =⇒ wpµ − λp > 0hold, then


=2λ1µ

2

(w1µ− λ1)3· 2λ2µ

2

(w2µ− λ2)3− 02 =

4λ1λ2µ4

(w1µ− λ1)3(w2µ− λ2)3> 0

Qw1w1 =2λ1µ

2

(w1µ− λ1)3> 0

Qw2w2 =2λ2µ

2

(w2µ− λ2)3> 0.

So the critical point is indeed a minimum if the given constraints hold.

2.2 M/D/1 queue

In a M/D/1 queuing model, the service time is deterministic. Suppose the traffic arrivalrate in direction p ∈ {1, 2} is given by λp. The inter-arrival times are exponential withmean 1

λp. Moreover, suppose the service time in direction p is given by E[Bp] = 1

wpµ,

where µ denotes the intersection departure rate and wp the ratio of green time allocatedto direction p in a cycle. For stability, it is required that ρp = λpE[Bp] < 1. (Adan &Resing, 2020)

For a M/G/1 queue, the mean waiting time is given by

Tp =ρpE[Rp]

1− ρp.

Here,

E[Rp] =E[B2

p ]

2E[Bp]

denotes the mean residual time in direction p. In the case of an M/D/1 queue, Var[Bp] = 0.Thus, the mean residual service time can be expressed as

E[Rp] =E[Bp]

2

2E[Bp]=

E[Bp]

2.

As a result, the following mean waiting time Tp in direction p is obtained

Tp =ρpE[Bp]

2− 2ρp=

λpwpµ· 1wpµ

2− 2 λpwpµ

=λp

2wpµ(wpµ− λp).

Again, the mean waiting time of an arbitrary vehicle in the system is given by

T =λ1

λ1 + λ2

T1 +λ2

λ1 + λ2

T2.

Substituting the mean waiting time Tp and using w2 = 1− L− w1, results into

T =λ1

λ1 + λ2

· λ1

2w1µ(w1µ− λ1)+

λ2

λ1 + λ2

· λ2

2w2µ(w2µ− λ2)

=1

2(λ1 + λ2)

(λ2

1

w1µ(w1µ− λ1)+

λ22

(1− L− w1)µ((1− L− w1)µ− λ2)

).

It can be seen that the mean waiting time T of the M/D/1 queue is half the mean waitingtime of the M/M/1 queue.

9


The objective is to find the optimal split that minimizes the mean waiting time T . Dif-ferentiating T with respect to w1 and equating to 0, gives

∂

∂w1

T =1

2(λ1 + λ2)

(λ2

1(λ1 − 2w1µ)

w21µ(λ1 − w1µ)2

− λ22(λ2 + 2µ(L− 1 + w1))

µ(L− 1 + w1)2(λ2 + µ(L− 1 + x))2

)= 0.

The critical point that results into the minimum, the optimal split, will be denoted byw∗1,MD1. Then, w∗2,MD1 is found using

w∗2,MD1 = 1− L− w∗1,MD1.

Note that this leads to the same equation that needs to be solved as the equation foundfor the M/M/1 queue. Therefore, this derivation from the M/M/1 queue can be re-usedto find the optimal split for the M/D/1 queue.


The objective is to find the optimal split that minimizes the mean queue length. UsingE[Sp] = E[Wp] + E[Bp] and Little’s Law E[Lp] = λpE[Sp] applied to the system, thefollowing is obtained

E[Lp] = λpE[Sp] = λp (E[Wp] + E[Bp])

= λp

(ρpE[Bp]

2− 2ρp+ E[Bp]

)=λpE[Bp](ρp + 2− 2ρp)

2− 2ρp

=λpE[Bp](2− ρp)

2− 2ρp=

2ρp − ρ2p

2− 2ρp

=

2λpwpµ− ( λp

wpµ)2

2− 2λpwpµ

=2λp −

λ2pwpµ

2wpµ− 2λp=

λp(2wpµ− λp)2wpµ(wpµ− λp)

.

Let Q = E[L1] + E[L2]. Using w2 = 1− L− w2, results into

Q =λ1(2w1µ− λ1)

2w1µ(w1µ− λ1)+

λ2(2w2µ− λ2)

2w2µ(w2µ− λ2)

=λ1(2w1µ− λ1)

2w1µ(w1µ− λ1)+

λ2(2(1− L− w1)µ− λ2)

2(1− L− w1)µ((1− L− w1)µ− λ2).

To find the optimal split w∗1.MD1, Q is differentiated to w1 and equated to 0,

∂

∂w1

Q = −λ1(λ21 − 2λ1w1µ+ 2µ2w2

1)

2w21µ(λ1 − w1µ)2

+λ2(λ2

2 − 2λ2µ(1− L− w1) + 2µ2(1− L− w1)2)

2µ(L− 1 + w1)2(λ2 − µ(1− L− w1))2= 0.

This equation has to be solved numerically with Mathematica. Here, it needs to be

checked if the critical point (w∗1,MD1, w2,MD1) is indeed a minimum.

10

2.3 D/D/1 queue

In the D/D/1 queue, both the arrival and departure rate are deterministic and there isa single server. Suppose the arrival rate in direction p is λp and the departure rate is µ,with wp the ratio of allocated green time in direction p. Then for stability, it is required

that ρp = λpwpµ

< 1. Due to this requirement, the mean waiting time for a D/D/1 queue is

equal to 0.


The objective is to find the optimal split that minimizes the mean queue length. Theexpected queue length in direction p is given by Qp = ρp = λp

wpµ. Let Q = Q1 +Q2. Then,

using w2 = 1− L− w1,

Q = ρ1 + ρ2 =λ1

w1µ+

λ2

w2µ=

λ1

w1µ+

λ2

(1− L− w1)µ.

To find the optimal split w∗1.DD1, the expression for Q will be differentiated with respectto w1 and equated to 0

∂

∂w1

Q = − λ1

w21µ

+λ2

(1− L− w1)2µ= 0

=⇒ w1 =λ1 − λ1L±

√λ1λ2 − 2λ1λ2L+ λ1λ2L2

λ1 − λ2

=

λ1λ2

(1− L)±√

λ1λ2

(1− 2L+ L2)

λ1λ2− 1

=

λ1λ2

(1− L)± (L− 1)√

λ1λ2

λ1λ2− 1

=

λ1λ2

(1− L)∓ (1− L)√

λ1λ2[√

λ1λ2− 1] [

1 +√

λ1λ2

] =(1− L)

[λ1λ2∓√

λ1λ2

][√

λ1λ2− 1] [

1 +√

λ1λ2

]=

√λ1λ2

(1− L)[√

λ1λ2− 1]

[√λ1λ2− 1] [

1 +√

λ1λ2

] =

√λ1λ2

(1− L)

1 +√

λ1λ2

=Λ(1− L)

1 + Λ.

Since w∗2,DD1 = 1− L− w∗1,DD1, the following optimal splits are obtained

w∗1,DD1 =Λ(1− L)

(1 + Λ),

w∗2,DD1 =1− L

(1 + Λ).


To see if the critical point (w∗1,DD1, w∗2,DD1) is indeed a minimum, it needs to be confirmed

thatQw1w1Qw2w2 −Q2

w1w2> 0 and Qw1w1 , Qw2w2 > 0

both hold for this point. For the D/D/1 queue, this gives

{Qw1 = − λ1

w21µ

Qw2 = − λ2w2

2µ

=⇒

Qw1w1 = 2λ1

w31µ

Qw2w2 = 2λ2w3

2µ

Qw1w2 = 0

.

11

Note that if the constraint (i) : λp, wp, µ > 0 holds, then


=2λ1

w31µ· 2λ2

w32µ− 02 =

4λ1λ2

w31w

32µ

2> 0,

Qw1w1 =2λ1

w31µ

> 0,

Qw2w2 =2λ2

w32µ

> 0.

So the critical point is indeed a minimum if the given constraint hold.

2.4 Fixed cycle traffic light fluid queue

As opposed to the G/G/1 queue types, the fixed cycle traffic light fluid queue model takesinto account the traffic lights. Here, the green and red times are considered for a fixedcycle length. The set-up for the fixed cycle traffic light fluid queue, which will frequentlybe shortened to fluid queue, is as follows:

Suppose the vehicles arrive with rate λp in direction p ∈ {1, 2} and depart withrate µ in during green time. Moreover, assume the inter-arrival and service times aredeterministic. Let rp denote the red time and gp the green time in direction p. Let xdenote the time it takes to clear the road. This is the minimum time it takes betweenone traffic light turning red and another traffic light turning green. During this time bothtraffic lights are red. The cycle time of the traffic light can then be expressed as

c = g1 + r1 = g2 + r2 = g1 + x+ g2 + x.

For the queue length not to explode, the stability condition λpc ≤ µgp needs to hold forboth directions. In other words, the amount of cars arriving in a cycle needs to be lessthan or equal to the maximum amount of cars that can depart during green time (Adan& Resing, 2003).

The number of cars in the queue (per direction) is a step-function. This step-functioncan be expressed as a continuous function: during red time this function has slope λp andduring green time it has slope λp − µ. The area beneath this continuous function is thengiven by

Area =1

2· rp · λprp +

1

2· λprpµ− λp

· λprp =λpr

2p

2(1− ρp).

Here, ρp = λpµp

. Then the mean number of vehicles Qp waiting in the queue in direction p

is given by

Qp =Area

c=

λpr2p

2c(1− ρp).

Let Tp denote the mean waiting time in direction p. Using Little’s Law, Qp = λpTp, yields

Tp =Qp

λp=

r2p

2c(1− ρp)=

(c− gp)2

2c(1− ρp).

The mean waiting time for an arbitrary vehicle in the system is then given by

T =λ1

λ1 + λ2

T1 +λ2

λ1 + λ2

T2.

12

Using g2 = c− 2x− g1, results into

T =λ1

λ1 + λ2

· (c− g1)2

2c(1− ρ1)+

λ2

λ1 + λ2

· (c− g2)2

2c(1− ρ2)

=1

λ1 + λ2

(λ1(c− g1)2

2c(1− ρ1)+λ2(c− (c− 2x− g1))2

2c(1− ρ2)

)=

1

λ1 + λ2

(λ1(c− g1)2

2c(1− ρ1)+λ2(2x+ g1)2

2c(1− ρ2)

).

This expression will be used in the next section to find the optimal split that minimizesthe mean waiting time.


The objective is to minimize the mean waiting time, so to minimize the expression T .Similarly as with the queuing models, T is differentiated with respect to g1 and equatedto 0.

∂

∂g1

T =1

λ1 + λ2

(−2λ1(c− g1)

2c(1− ρ1)+

2λ2(2x+ g1)

2c(1− ρ2)

)= 0.

This leads to the equation,

λ1(c− g1)

(1− ρ1)=λ2(2x+ g1)

(1− ρ2).

Solving this equation for g1 gives,

g1 = −λ2(g1 + 2x)(1− ρ1)

λ1(1− ρ2)+ c =⇒ ... =⇒ g1 =

cλ1(µ− λ2)− 2xλ2(µ− λ1)

(λ1 + λ2)µ− 2λ1λ2

.

The full computation can be found in Appendix A.2. The critical point that correspondsto a minimal mean waiting time will be denoted as g∗p. Since, g2 = c − 2x − g1, thefollowing optimal green times are obtained

g∗1 =cλ1(µ− λ2)− 2xλ2(µ− λ1)

(λ1 + λ2)µ− 2λ1λ2

, (2.2)

g∗2 = c− 2x− cλ1(µ− λ2)− 2xλ2(µ− λ1)

(λ1 + λ2)µ− 2λ1λ2

. (2.3)


To see if the critical point (g∗1, g∗2) is indeed a minimum, Tg1g1Tg2g2 − T 2

g1g2> 0 and

Tg1g1 , Tg2g2 > 0 both need to hold for the critical point. Here, the subscript gp denotesdifferentiating with respect to gp. Note that by definition c ≥ gp. If the constraints

13

(i) : c, λp, µ > 0 and (ii) : λpc ≤ µgp both hold, then{Tg1 = λ1

λ1+λ2· −(c−g1)c(1−ρ1)

Tg2 = λ2λ1+λ2

· −(c−g2)c(1−ρ2)

=⇒

Tg1g1 = λ1

λ1+λ2· 1c(1−ρ1)

= µλ1c(µ−λ1)(λ1+λ2)

,

Tg2g2 = λ2λ1+λ2

· 1c(1−ρ2)

= µλ2c(µ−λ2)(λ1+λ2)

,

Tg1g2 = 0.

=⇒ Tg1g1Tg2g2 − T 2g1g2

=µλ1

c(µ− λ1)(λ1 + λ2)· µλ2

c(µ− λ2)(λ1 + λ2)− 02

=µ2λ1λ2

(cµ− cλ1)(cµ− cλ2)(λ1 + λ2)2

(∗)> 0,

=⇒ Tg1g1 =µλ1

c(µ− λ1)(λ1 + λ2)

(∗)> 0, Tg2g2 =

µλ2

c(µ− λ2)(λ1 + λ2)

(∗)> 0,

(∗) : since cµ− cλp ≥ cµ− µgp = µ(c− gp) > 0.

Thus, the critical point is a minimum if the constraints hold.

2.5 Simulation

In this section, the set-up of the discrete-event simulation of the intersection with twoconflicting flows and traffic lights will be described. This simulation will be used inChapter 3, to compare with the theoretical results regarding the expected queue lengthand waiting time. Furthermore, it will be used in Chapter 5 as the environment forthe Q-learning algorithm and to compare the optimality results of the two optimizationmethods.

In this thesis, there will only be two directions. However, the simulation has beenconstructed in such a way that more directions can be added without needing to makemany alternations to the code.

The vehicles in the simulation will arrive according to a Poisson process with rate λ1

and λ2 in direction 1 and 2 respectively. Furthermore, it is assumed that they depart withdeterministic rate µp for p = 1, 2. Let Gp denote the green time in direction p and Rp theamount of time it takes to clear the road in direction p for p = 1, 2. This is similar to thex from the fixed cycle traffic light fluid queue from Section 2.4. A cycle thus consists ofc = G1 + R1 + G2 + R2. Moreover, the optimal split Formulas 2.2 and 2.3 of the fixedcycle fluid queue discussed in Section 2.4.1 can and will be used in the simulation. Notethat outside of the interval [0, c − R1 − R2], these formulas do not hold. This is due tothe fact that outside of this interval, Gp ≥ 0 is no longer satisfied for one of the twodirections. Thus, when using these formulas in the simulation, the following modificationwill be implemented:

• If G1, G2 ∈ [0, c−R1 −R2], then still

{G1 = cλ1(µ−λ2)−(R1+R2)λ2(µ−λ1)

µ(λ1+λ2)−2λ1λ2

G2 = c−R1 −R2 −G1

,

• If G1 < 0 according to the formula, then

{G1 = 0

G2 = c−R1 −R2

,

• If G2 < 0 according to the formula, then

{G1 = c−R1 −R2

G2 = 0.

The simulation has been modeled as a discrete-event simulation in Python. In Algorithm1, the pseudo-code for the four events that occur in the simulation can be found. These

14

four events are: an arrival, a departure, the start of a green period and the end of a greenperiod. Let the variable state denote which traffic light is green. Here, state = p meansthat the traffic light in direction p is green. When all traffic lights are red, state = −1.Let e denote the current event and e.flow the direction in which said event occurs. LetG and R be the arrays containing the green time and the time it takes to clear the roadrespectively. Here, G[p] denotes the amount of green time in direction p. Similarly, R[p]denotes the amount of time it takes to clear the road in direction p. During this timeall traffic lights are red. These are the same as the Gp and Rp as described before. Fur-thermore, ArrDist[p] denote a random number generated from the arrival distributionfor direction p and µ[p] denotes the departure rate in direction p. In this simulation, thegreen periods occur in the chronological order of the directions. Since there are only twodirections, the green periods will alternate. Unless one of the two directions has a greentime of 0. In that case, there’s only one direction in which green periods occur.

Algorithm 1: Pseudo-code discrete-event simulation

At time t:if Event = START GREEN then

state = e.flowif queue length in direction e.flow ≥ 1 then

schedule DEPARTURE eventendschedule END GREEN event in direction e.flow at time t+G[e.flow]

else if Event = END GREEN thenstate = −1schedule START GREEN event in direction (e.flow + 1)mod(p) at timet+R[e.flow]

else if Event = ARRIVAL thenadd the vehicle to the queue in direction e.flowregister the waiting timeschedule DEPARTURE event at time t+ µ[e.flow]schedule new ARRIVAL event in direction e.flow at time t+ArrDist[e.flow]

else if Event = DEPARTURE thenremove the vehicle from the queue in direction e.flowif queue length in direction e.flow ≥ 1 and state=e.flow then

register the waiting time of the vehicle eligible for departureschedule new DEPARTURE event of said vehicle in direction e.flow attime t+ ArrDist[e.flow]

end

15

Chapter 3

Model analyses

In this chapter, the various queuing models and the simulation discussed in Chapter 2will be analyzed. To this end, the stability region, mean queue length, mean waiting timeand optimal split found by the theoretical optimization method will be examined andcompared. For this, most parameters and variables will be fixed and equal for all queues.Namely,

• Departure/service rate µ = µp = 2 for p = 1, 2,

• The normalized total loss time per cycle L = 0.1,

• Cycle length c = 100,

• The amount of time given to clear the road per direction x = Rp = 5 for p = 1, 2.

Furthermore, λp will be varied within the interval (0, 2.5]. These values have loosely beenchosen based on the paper (Chanloha et al., 2012), in which similar values have beenimplemented.

For easier reference, the case where the optimal split has been found by minimizing themean queue length will now be denoted as the Q version. The case where the optimal splithas been found by minimizing the mean waiting time will be denoted as the T version.The meaning of the symbols and variables used in this chapter can again be found in theoverview in Appendix A.1.

3.1 Stability region

In Figures 3.1 and 3.2, the stability region can be found for each of the different queuingand road traffic models and cases. Note that in the plots, the white/black lines and/orregions occur when there is a 1

0evaluation in the equation for finding the optimal split.

As a result, there is no solution at the values of λ1 and λ2 where this happens.The region where both queues are stable is virtually equal for both cases of the M/M/1

and M/D/1 queue. For the T version this makes sense, as the equation that had to besolved to find the optimal split was equivalent for both these queues. Subsequently, theoptimal split, and thus the stability condition, is equivalent. This is illustrated in Figure3.1b and 3.1d. Comparing the Q version of the M/M/1 and M/D/1 queue, it can beseen in Figure 3.1a and 3.1c that the unstable regions for the M/D/1 queue are moreinconsistent. This is due to the fact that the equation to find the optimal split is morecomplicated than the equation for the M/M/1 queue. The equation contained extra termsand had to be solved numerically. While the equation for the M/M/1 queue only had two

16

critical points, the M/D/1 queue could reach up to six critical points for some values ofλ1 and λ2. Most of these critical points were imaginary. Thus, the real, positive criticalpoint that led towards the smallest queue length had to be selected. Nevertheless, theregion where both queues are stable is virtually the same for the M/M/1 and M/D/1 Qversion models.

For both the D/D/1 and the fixed cycle traffic light fluid models the stability regionis smaller. For the D/D/1 queue, the solutions w∗1 and w∗2 from Section 2.3 do not includeone of the terms from the solution of the M/M/1 (Q version) from Section 2.1. As a result,w∗p is smaller for certain values of λ1 and λ2 and thus the stability condition λp < w∗pµ isno longer satisfied for one and/or both directions.

The stability behavior of the fixed cycle traffic light fluid queue is also noticeablydifferent compared to the queuing models. Here, a bigger difference between the valuesof λ1 and λ2 can lead to instability. For example, when choosing λ1 = 0.5 and λ2 = 1,where λ1 is half of λ2, the following results are obtained

w∗1,DD1 = 0.372792 and g∗1 = 17.5,g∗1c

= 0.175.

When comparing w∗1 with g∗1/c, it can be seen that g∗1/c is more than twice as small thanw∗1. This resulted in the instability of queue 1, as illustrated in Figure 3.2a. Here, thestability condition λ1 ≤ g∗1µ/c is no longer satisfied. Whereas, for the D/D/1 queue, itsstability condition is still satisfied and there thus is stability.

Moreover, taking λ1 = 1 and λ2 = 0.000000001, results into

w∗1,DD1 ≈ 0.9, w∗2,DD1 ≈ 0 and g∗1 ≈ 100, g∗2 ≈ −10.

Since λ2 is virtually equal to zero, it would be expected that the amount of green timeallocated to direction 2 also equals 0. This is indeed true for the D/D/1 case. For thiscase, the stability condition is met. When checking the stability condition λpc ≤ µg∗p forthe fixed cycle fluid queue

λ1c ≈ 100 ≤ 200 ≈ µg∗1 3 and λ2c ≈ 0 ≤ −20 ≈ µg∗2 7.

It can be seen that only for direction 1 the stability condition is met. Furthermore, thevalues g∗1 = 100 and g∗2 = −10 are out of range, they do not lie in the interval [0, c− 2x]with c = 100 and x = 5.

As illustrated in Figure 3.2b, there is a similarity between the stability region of thefixed cycle traffic light fluid queue and the simulation. This is due to the fact that the greentime used in the simulation is from the optimal green time formula from the fixed cyclefluid queue from Section 2.4.1. However, due to the adjustment made in the simulationto include valid green times for all values of λp, the stability region of the simulation isslightly bigger.

In the simulation, the stability region has not been determined by checking if thestability condition is met for each of the values of λp. Instead it is determined with amore heuristic approach. In this approach, one checks the difference in the average totalwaiting time when evaluating the simulation for a time-horizon of T = 1 000 000 and atime-horizon of T = 100 000.

The stability region has been evaluated for values of λp with 0 < λp ≤ 2 using astep-size of 0.01. If for a certain value of λp, there was an increase of greater or equalthan factor% in the waiting time, that direction was labeled as unstable. Figure 3.2b

17

shows the stability region when using factor = 10. To reiterate, here a queue is labeledas unstable when there is a 10% difference in waiting time between the two (of differentlength) runs of the simulation.

(a) M/M/1 (Q version) (b) M/M/1 (T version)

(c) M/D/1 (Q version) (d) M/D/1 (T version)

(e) D/D/1

Figure 3.1: Stability regions for the various queues

18

(a) Fixed cycle traffic light fluid queue (b) Simulation

Figure 3.2: Stability regions

3.2 Mean queue length and waiting time

In this section, the most interesting results regarding the mean queue length and meanwaiting time will be discussed. The set-up of the plots are as follows: the value of λ1

has been fixed, while the value of λ2 is varied within the stability region. Note that dueto symmetry the same results are obtained when interchanging λ1 and λ2. The plots foreach of the queuing models separately can be found in Appendix A.3.1. Furthermore, allof the plots for the simulation can be found in Appendix B.1. For the simulation results,a time-horizon of T = 1 000 000 and step-size of 0.01 for λp have been used.

Figure 3.3 compares the mean waiting time for the M/M/1 and M/D/1 queue forλ1 = 0.5. Here, it can be seen that indeed the mean waiting time of the M/D/1 is halfthe mean waiting time of the M/M/1 queue. For both the Q and T version of the M/M/1and M/D/1 queues, the mean queue length, and thus also the mean waiting time due toLittle’s Law, increasingly grow as λp grows. At the turning point into an unstable region,the mean queue length and waiting time grow towards infinity.

The mean waiting time for the fixed cycle fluid queue (in the stable region) attains a(local) maximum, after which the waiting time decreases again, as can be seen in Figure3.4a. This local maximum occurs at the point where λ1 = λ2. Figure 3.4b illustrates thesimulation results when using the formula for Gp from the fixed cycle fluid queue. Thesimulation could be seen as the implementation of the fixed cycle fluid queue in “real-life”.However, as the simulation is still merely a model, this of course has to be taken with agrain of salt. Nevertheless, it does give more insight into the queue’s behavior. Comparingboth of the graphs in Figure 3.4, it can be seen that the behavior of the mean waitingtime in the stable region is alike for both the theoretical model and the simulation. Notethat it is expected that the graphs will not be identical. This is 1) due to the fact thatthe arrivals are stochastic in the simulation compared to the deterministic arrivals in thetheoretical fixed cycle fluid queue model and 2) the Gp used in the simulation is an alteredversion of the theoretical formula. The mean waiting time in the simulation for the entireinterval λ2 ∈ [0, 0.9] instead of just the stable region can be seen in Figure 3.5. The meanwaiting time grows towards infinity in the unstable region.

Instead of using the formula, the green time can also be fixed for all λp. A comparisonof the mean queue length when using a fixed and a variable Gp is visualized in Figure

19

3.6a. Here, it can be seen that when allocating the same amount of green time in bothdirections, namely Gp = 45 for p = 1, 2, the mean queue length is bigger or equal to thecase where the formula is used. However, an advantage of a fixed Gp is that it leads toa bigger stability region. For small values of λp, for example, the mean queue length isfinite. The mean queue length increases as λp increases. Additionally, the mean queuelength grows towards infinity for a bigger values of λp for the fixed Gp case compared tothe variable Gp case.

When analyzing the mean queue length for each direction separately, as in Figure3.6b, it can be seen that for λ1 = λ2, the mean queue length for each direction is equal.Moreover, for λ1 > λ2, the mean queue length in direction 1 is bigger than the meanqueue length in direction 2. The converse is true as well. This is as expected, since ahigher arrival rate corresponds to on average more vehicles arriving in that direction.

Figure 3.3: Mean waiting time for λ1 = 0.5 (T version)

(a) Fixed cycle traffic light fluid queue (theoreti-cal) (b) Simulation using Gp formula from fluid queue

Figure 3.4: Mean waiting time comparison in the stable region

20

Figure 3.5: Mean waiting time simulation

(a) Comparison between fixed and variable Gp (b) Using Gp from fluid queue formula

Figure 3.6: Simulation results mean queue length for λ1 = 0.3

21

3.3 Optimal split

In this section, the optimal split found in Chapter 2 by minimizing either the mean queuelength or the mean waiting time will briefly be analyzed. Plots of the optimal split foreach of the queuing models separately can be found in Appendix A.3.2. In these plots,three different values of λ1 have been evaluated for each of the queuing models. Onesimilar aspect for all the queuing models is that for fixed λ2, a higher value for λ1 leadsto a bigger difference between w∗1 and w∗2. Again, note that due to symmetry switchingλ1 and λ2 leads to the same results.

When comparing the optimal split of the M/M/1 and M/D/1 queue for λ1 = 0.5, it canbe seen in Figure 3.7 that these coincide for both versions. Particularly, for the T versionof these queues the optimal split is equivalent, as illustrated in Figure 3.7b. In this figure,the optimal split for the fixed cycle traffic light fluid queue has been normalized by thecycle length for easier comparison. Here, the difference between g∗1 and g∗2 (normalized)is bigger than the difference between w∗1 and w∗2 of the M/M/1 and M/D/1 queue.

Figures 3.7a and 3.7b show that the intersection point of the optimal split in direction1 and direction 2 meet at the point where λ1 = λ2 regardless of the queuing model. Inaddition, it holds for every queuing model that for λ1 > λ2 the amount of green timeallocated in direction 1 is bigger than the amount of allocated green time in direction 2.In other words, w∗1 > w∗2. This is as expected, since more vehicles arrive in direction 1than in direction 2 and thus more time will be needed for these vehicles to depart. Thecontrary where λ1 < λ2 is also true. Here, it holds that w∗1 < w∗2.

(a) Q version

(b) T version

Figure 3.7: Optimal split comparison for λ1 = 0.5

22

3.4 Discussion

Finding the optimal split by minimizing the theoretical mean waiting time or mean queuelength using queuing theory is a method that can be tricky to use in real life applications.In our model, the intersection only consists of two conflicting flows, for which findinga theoretical solution is still somewhat doable. Still, numerical methods already hadto be resorted to in order to do the T version of the minimizations. A more complexintersection consisting of more flows, lanes and directions that can be taken, would makeit very difficult, if not impossible, to theoretically find an optimum.

Furthermore, in order to implement the optimal split found through theory at a real-life traffic light intersection with two conflicting flows, the arrival and departure rateswould have to be known. However, as the arriving traffic is dynamic, such rates canconstantly change. These are aspects that continuously have to be taken into accountwhen employing this method.

The choice of parameters in this chapter might also not correspond well to the real-life situation. In particular, the amount of time given to clear the road that has beeninvestigated was x = Rp = 5. However, since the departure rate equaled 2, values of xand Rp that are more attuned towards this departure rate could have made the situationmore realistic. Another option could be to use x = Rp = 3.5, as this is the official amountof yellow/orange time at 50 km/h roads in the Netherlands (ANWB, 2017).

The queuing models of the G/G/1 type do not take the cycle length into consideration.The optimal split found for these models is the ratio of green time, it is normalized. Thisis because in these models, the two queues are operating separately. The connection ismade by equating w1 = 1− L− w2. There is no red or green time. In fact, these modelsare always in their “green period”. The number of vehicles able to depart per time unit,the departure rate µp, is adjusted by multiplying it with the normalized optimal splitw∗p. Since these queuing models do not use red and green periods and a cycle, one couldargue that these models do not correspond well with our intersection with two conflictingflows and two traffic lights. The set-up of these models also made it more difficult toimplement into a discrete-event simulation. Therefore, a similar set-up to that of thefixed cycle traffic light fluid queue model has been implemented in the simulation instead.

23

Chapter 4

MDP and Q-learning theory andliterature study

In this chapter, various mathematical concepts and tools needed for construction andimplementation of the Q-learning algorithm will be explored. The first concept that willbe delved into is the theory of Markov decision processes. This theory will then be appliedto a traffic light intersection using policy iteration. Furthermore, the notion of the Markovdecision processes will be used to describe the Q-learning algorithm for finding optimalpolicies. The Q-learning algorithm and theory will later be implemented in Chapter 5.

4.1 Finite Markov Decision Processes

A Markov decision process (MDP) is a discrete-time stochastic process that is used toprovide a mathematical framework for sequential decision making where the outcomesare partly uncertain and partly under the control of a decision maker (the agent). Ittakes both short-term outcomes of current decisions and opportunities for making futuredecisions into consideration. MDPs are useful for optimization problems that can besolved via dynamic programming and reinforcement learning (Bellman, 1957; Kallenberg,2016). One of these reinforcement learning algorithms is the Q-learning algorithm whichwill be implemented in this thesis. This section about MDPs will therefore provide thetheory that is necessary to understand and construct this algorithm.

The model

Markov decision processes consist of a stochastic (discrete-time) process Xn, n = 0, 1, 2..on a state space S. Given Xn = i, a decision is chosen from the action set A(i). Theprobabilistic law according to which the process subsequently evolves may depend on bothXn and the action An ∈ A(Xn) chosen by the decision maker. This transition mechanismhas the Markov property, since if the state and action at time n are known, the state attime n+ 1 is independent of the history hn−1 = (X0, A0, ..., Xn−1, An−1). So

P(Xn+1 = in+1|X0 = x0, A0 = a0, ..., Xn = in, An = an) = P(Xn+1 = in+1|Xn = in, An = an)

=: pin,in+1(an),

where i0, .., in+1 ∈ S and a0 ∈ A1(i0), ..., an ∈ An+1(in). Furthermore, there is a transitionmatrix, in which,

24

1. If the action a = f(i) is a given function f of the state, then {Xn}n is a Markovchain with transition matrix P (f) = {pi,j(f(i))}i,j∈S.

2. If a = fn(i) is a time-dependent function of the state, then {Xn}n is a non-stationaryMarkov chain with transition matrix P (fn) = {pi,j(fn(i))}i,j∈S at time n.

Suppose a reward ri,j(a) is earned, whenever the process Xn is in state i at time n,where action a is chosen and the process moves to state j. Then the expected reward ifaction a is taken while in state i is represented by

ri(a) =∑j∈S

pij(a)rij(a).

Moreover, when at state i at time T , we may wish to allow a terminal reward to be earned.This terminal reward is denoted by qi. The expected and terminal rewards are assumedto be uniformly bounded. (Spieksma, 2015; Nenez-Queija, 2009)

Decision rules and policies

A decision rule at time stage n, σnhn−1,in, is a probability distribution on the action space

A(in) associated with state in, given the history hn−1 and state in. Then, σnhn−1,in(a) is

the probability that action a ∈ A(in) is selected.The decision rule σn is a map that associates a probability distribution with each

history up to time stage n − 1 and state up to time stage n. A policy is a sequence ofdecision rules σ = (σ0, ..., σT−1). There are several types of policies (Spieksma, 2015):

• σ is a Markov policy: if for time stage n, the decision rule σn is independent ofhistory. To indicate the decision rule and probabilities, we can then write σnin andσnin(a) respectively.

• σ is a stationary policy: if all decision rules are equal σn = σ0 for n = 0, ..., T − 1.

• σ is a deterministic (Markov) policy: if for each n the probability distribution σninis degenerate. We then write σ = f = (f 0, f 1, ..., fT−1) where fnin ∈ A(in) is thedecision in state in at stage n is selected with probability 1.

Objective function

The total expected reward vector V σT corresponding to policy σ is, for i ∈ S, given by

V σT (i) =

T−1∑n=0

Eσ [rXn(An)|X0 = 1] + Eσ [qXT |X0 = i]

=T−1∑n=0

Eσi [rXn(An)] + Eσi [qXT ] .

Restricting to Markov policies, the total expected reward vector can be expressed moreexplicitly as

V σT (i) = r(σ0) + P (σ0)r(σ1) + P (σ0)P (σ1)r(σ2) + ...+ P (σ0) · · ·P (σT−2)r(σT−1)

+ P (σ0) · · ·P (σT−1)q

=T−2∑n=0

P (σ0) · · ·P (σn−1)r(σn) + P (σ0) · · ·P (σT−1)q.

25

If σ′ = (σ1, ..., σT−1) denotes the restricted policy for a T − 1 horizon problem from time1 up to time T with associated total expected reward V σ′

T1. Then,

V σT = r(σ0) + P (σ0)V σ′

T−1.

From this, it can be seen that for a fixed policy, V σT can recursively be computed backwards

in time. (Spieksma, 2015)

4.1.1 Bellman’s optimality principle

The policy σ∗ is optimal when it attains the maximum value for each initial state. Thus, itis an optimal policy if V σ∗

T = V ∗T . Here, the function V ∗T is the (T-horizon) value function,for which V ∗T = supσ V

σT .

Theorem: Let V0(i) = qi. For n = 1, 2, ... Furthermore, let Vn(i) recursively be givenby

Vn(i) = maxa∈A(i)

{ri(a) +∑j∈S

pij(a)Vn−1(j)}, i ∈ S.

Then Vn(i) = V ∗n (i), i ∈ S, n = 0, 1, ... and any policy fn = (fn, fn−1, ..., f 1) determinedby

fni = arg maxa∈A(i)

{ri(a) +∑j∈S

pij(a)Vn−1(j)}, i ∈ S

attains the optimal expected total reward over the periods T − n, ..., T and hence overperiods 1, ..., n. Particularly, fn is n−horizon optimal. (Spieksma, 2015; Nenez-Queija,2009)

Bellman’s optimality principle is one of the concepts from Markov decision processesthat is used in Q-learning.

4.2 MDP framework applied to an intersection

In this section, an example of how the configuration of the traffic lights at an intersectionwith two conflicting flows can be modelled as a Markov decision process. For this, a statespace, action space and transition probabilities have to be assigned and constructed. Thismodel will draw inspiration from the model proposed by (Haijema & Van Der Wal, 2008;Haijema, Hendrix, & van der Wal, 2017). The following is defined:

• Let qi be the number of cars waiting in the queue in direction i and q = (q1, q2).

• Let Xn = (X1,n, X2,n)T be the random variable of the number of cars in the queueat time-step n. Here, Xp,n is the random variable of the number of cars in directioni.

• Let G denote a green traffic light and R a red traffic light.

• Let L be the state space of the traffic lights of the intersection, with l = (l1, l2) ∈ Lwhere li ∈ {G,R} denotes the traffic light state at the beginning of a time-step fordirection i.

• Let S be the state space of the intersection, with states (l,q) ∈ S finite.

• Let A(s) ⊆ L be the action space of the intersection, with a = (a1, a2) where ai ∈{G,R} denotes the action chosen for direction i. Furthermore, let An = (A1,n, A2,n)denote the action taken at time-step n.

26

Transition probabilities

Using these defined notions, the state transition probabilities from state s = (l,q) to states′ = (a,q′) when taking action a can be derived. For this, suppose that the arrivals in thedifferent directions and time-steps are independent. The following transition probabilitiesare obtained:

ps,s′(a) = P(Xn+1 = q′ | Xn = q, An = a, ln = (l1,n, l2,n))

=2∏i=1

P(Xi,n+1 = q′i | Xi,n = qi, An = ai, li,n = li,n)

=2∏i=1

psi,s′i(ai).

Suppose that within a time-step exactly one vehicle (if there is any) will depart. Withinthis time-step, a vehicle can arrive in direction i with probability λi. These new arrivals inthe queues are independent Bernoulli processes. Thus, with probability 1− λi no vehiclearrives in direction i. The transition probabilities depends on the direction i ∈ {1, 2}and the chosen action in said direction ai ∈ {G,R}. In this example, the transitionprobabilities are given by

psi,s′i(ai = G) =

λi if q′i = qi,

1− λi if q′i = max{0, qi − 1},0 otherwise.

=

λi 0 . . . . . . 0

1− λi λi. . .

...

0. . . . . . . . .

......

. . . . . . λi 00 . . . 0 1− λi λi

,

psi,s′i(ai = R) =

λi if q′i = qi + 1,

1− λi if q′i = qi,

0 otherwise.

=

1− λi λi 0 . . . 0

0 1− λi. . . . . .

......

. . . . . . . . . 0...

. . . 1− λi λi0 . . . . . . 0 1− λi

.

Bellman equation

Now that the state space, action space and the transition probabilities have been defined,the cost function and Bellman’s optimality principle (see Section 4.1.1) for this modelcan be derived. The objective is to find an optimal policy that minimizes the expectedwaiting time. Nevertheless, instead of directly minimizing the expected waiting time, theexpected queue length at the start of a time slot will be minimized. As a result of Little’sLaw, the expected waiting time will be minimized as well. To this end, the (direct) costfunction is given by

c(q) =2∑i=1

qi = q1 + q2.

This is the contribution of the cost in one time-slot to the objective function.

27

Furthermore, let the value function v∗ be a solution of the Bellman equation. The Bellmanequation is given by

∀ s ∈ S : v∗(s) + g∗ = c(q) + mina∈A(s)

∑q′

ps,s′(a)v∗(s′).

Here, g∗, denotes the expected queue length at the start of a time-slot when an optimalpolicy π∗ is followed. Such an optimal policy π∗ satisfies:

π∗(s) = arg mina∈A(s)

∑q′

ps,s′(a)v∗(s′).

Note that s′ = (a,q′).

4.2.1 Policy iteration

To solve the Bellman equation, various algorithms can be used if the state and actionspace is finite. The goal of these algorithms is to find the optimal policy π∗. One suchalgorithm is the policy iteration algorithm. Policy iteration consists of two steps that arerepeated until the policy π does not change anymore, so if π′ = π. For this traffic lightintersection application, the two steps of the iteration are as follows:

1. Policy evaluation step:

∀ s ∈ S : vπ(s) + gπ = c(q) +∑q′

ps,s′(π(s))vπ(s′)

2. Policy improvement step:

π′(s) = arg mina∈A(s)

∑q’

ps,s′(a)vπ(s′)

The optimal policy π∗ has been found if π′ = π. However, if π′ 6= π, then π := π′ is setand the algorithm is repeated from step 1 until the optimal policy has been found.

4.3 Q-Learning

Besides algorithms that use the transition probabilities of the MDP framework, model-freealgorithms exist as well. One such algorithm being the Q-learning algorithm, which willbe implemented in this thesis. Q-learning is a model-free reinforcement learning algorithmthat can find an optimal policy for finite MDPs. It optimizes the expected value of thetotal reward over any and all successive steps, starting from the current state (Melo,2001). Since Q-learning is model-free, it is suitable for the traffic environment, whichcontains stochastic problems that change real-time (Joo, Ahmed, & Lim, 2020). The Qin Q-learning stands for quality. It calculates how useful a given state-action combinationis in gaining a certain future reward or cost (Shyalika, 2019). This information will bestored in the Q-value for state s and action a: Q(s, a). The process of Q-learning isvisualized in Figure 4.1. Based on the current state and reward the agent, the decisionmaker, takes action. This action then influences the environment and leads to the nextstate and reward.

28

Generally speaking, the Q-learning algorithm will learn from the consequences (thereward or cost) of the actions taken at each state by updating the Q-values. Mathemat-ically speaking, the core of the Q-learning algorithm, which updates the Q-values, is theBellman equation.

Figure 4.1: Schematic overview Q-learning

When the objective to maximize the reward, the following Q-learning update rule isimplemented:

Q(s, a)← Q(s, a) + α[r + γmax

a′Q(s′, a′)−Q(s, a)

],

in which,

• Q denotes the function of expected cumulative discounted future rewards,

• r denotes the received reward when moving from state s to state s′,

• α ∈ (0, 1] denotes the learning rate,

• γ ∈ (0, 1) denotes the discount factor.

The learning rate α can be seen as the step-size. It determines to what extent thenewly obtained information overrides past knowledge. With a learning rate of 0, thealgorithm only considers old experiences and information. Whereas, a learning rate of 1considers only the recently gained information. In other words, the learning rate is theweight of the newly gained experience in the algorithm (Chin, Kow, Khong, Tan, & Teo,2012). A learning rate of α = 1 is optimal in deterministic environments. In stochasticenvironments, for convergence of the algorithm the learning rate needs to decay towardszero under certain conditions (Sutton & Barto, 1998). A further elaboration on theseconditions will be given in Section 5.1.1.

The discount factor γ dictates the importance of future rewards. With a discountfactor of 0, the algorithm only considers current rewards. One could say the agent isshort-sighted. Whereas, for a discount factor close to 1, the algorithm focuses more onlong-term rewards (Russell & Norvig, 2010).

29

The optimal policy that maximizes the expected future rewards according to the Q-learning algorithm is given by

π(s) = arg maxaQ(s, a).

In the case where the objective is to minimize the cost, there are several updatingrules possible. The first possibility is to use the following updating rule:

Q(s, a)← Q(s, a) + α[c+ γmin

a′Q(s′, a′)−Q(s, a)

],

Here, c denotes the received reward when moving from state s to state s′ (Chanloha etal., 2012). Now, the best guess at an optimal policy is when

π(s) = arg minaQ(s, a).

Another possibility is to take r = −c and then use the updating rule that is used tomaximize the reward:

Q(s, a)← Q(s, a) + α[−c+ γmax

a′Q(s′, a′)−Q(s, a)

],

(Chu, Liao, Chang, & Lee, 2019).In this thesis, the first possibility will be used to minimize the cost, as this updating

rule is more intuitive.The pseudo-code of the Q-learning algorithm is described below in Algorithm 2. The

initialization of the Q-values can be done in various ways. Here, they have been initializedto zero, as that is the most common and a safe initialization.

In the algorithm, the action is selected in each iteration according to a pre-set actionselection strategy. A common action selection strategy is the ε-greedy policy (Chanda,2020). The idea of the ε-greedy policy is that the “best” action for the current state isselected with probability 1−ε and another action is randomly selected with a probability ofε. The “best” action is the action that satisfies the objective, which is to either maximizeor minimize the expected reward or cost. Thus, this will be the action that optimizesthe Q-value Q(s, a). The ε-greedy policy is generally used to avoid possible over-fitting.At the start of the algorithm, a higher value of ε is usually chosen. This is done sincenot much of the environment is known yet. Therefore, a higher value for ε can help withupdating more Q-values. Then, after exploring and acquiring more information, the ε willbecome lower (Shyalika, 2019; Kansal & Martin, 2019).

After selection the action a according to the pre-set action selection strategy in thecurrent state s, the resulting state s′ and reward r (or cost c depending on the problem)will be observed. These will then be used to update the Q-value Q(s, a) according to theupdating rule. State s′ will become the current state and the process will be repeateduntil the goal state is reached or the episode ends.

Once the algorithm has finished, the optimal policy can be derived from the Q-values.

30

Algorithm 2: Q-learning algorithm

Initialize α, γ, εInitialize Q(s, a) as zero for all states s and actions afor each episode do

Initialize state sfor each iteration do

if s is not the goal state thenSelect a ∈ A(s) according to a pre-set action selection strategyTake action a and observe s′ and rUpdate Q(s, a) according to corresponding updating-rules← s′

end

end

end

31

Chapter 5

Q-Learning

In this chapter, the Q-learning algorithm will be applied to the traffic light intersectionwith two conflicting flows to find an optimal allocation of the green times. Thus, thesecond research objective from this thesis will be met. The set-up of the algorithm andthe choice of parameters will be discussed in the first two sections. Next, the results ofthe Q-learning algorithm will be compared to the results of the fixed cycle traffic lightfluid queue using a discrete-event simulation. With this, the third research objective ofthis thesis will be achieved.

5.1 The algorithm

In order to apply the Q-learning algorithm, the objective, state space, action space and itscost need to be specified. The set-up of the algorithm in this thesis will be different fromthe set-up in the paper (Chanloha et al., 2012), from which the idea of implementing theQ-learning algorithm came from. Examples of such differences are that in the paper, thestate was the number of vehicles in each direction and that a “cell transmission model”was used to determine the state at each iteration. Such a model will not be used in thisthesis. Instead the environment will come from the discrete-event simulation as describedin Section 2.5.

The framework of the algorithm that will be described in this section was createdby drawing inspiration from various different sources and examples that also used Q-learning. Most of these sources also implemented Q-learning to find an optimal trafficlight configuration: (Chanloha et al., 2012; Chin et al., 2012; Chu et al., 2019). A sourcethat had nothing to do with traffic at all was utilized as well: (McCullock, 2012). However,the context and action/state space in which the Q-learning algorithm was implementedin these sources is different from that in this thesis. The set-up of the algorithm in thisthesis is as follows:

• The objective is to minimize the total mean waiting time.

• The state space is defined as S = {2.5·n | n ∈ {0, 1, ..., 35, 36}} = {0, 2.5, ..., 87.5, 90}.Here, the state s ∈ S denotes the amount of green time allocated towards the trafficlight in direction 1 of the intersection.

• The action space is defined as A = {0, 1, 2}, where a ∈ A denotes the action takenby the agent/decision maker with

– for a = 0, no changes is made to the allocation of the green time Gp,

32

– for a = 1, the following change is made to the allocation of the green time:{G1 ← G1 + 2.5

G2 ← G2 − 2.5,

– for a = 2, the following change is made to the allocation of the green time:{G1 ← G1 − 2.5

G2 ← G2 + 2.5.

• The cost C is defined as the increase in average total waiting time during oneiteration.

One iteration of the algorithm consists of 5 cycles. Again, c = 100 and µ = 2. Theenvironment for Q-learning will be provided by the simulation from Section 2.5 and byfollowing Algorithm 2. There will only be one episode with time-horizon T = 50 000 000and T

5c= 100 000 iterations.

The initial amount of green time allocated towards direction 1, the initial state thatwill be used, is s0 = 45. Thus, the discrete-event simulation starts with G1 = G2 = 45. Ateach iteration, the agent of the algorithm will decide a new allocation of the green time.After the Q-learning algorithm has finished, a path will be taken from the initial statetowards the optimal state by following the action corresponding to a minimal Q-valuesfor each encountered state. In other words, for each state

π(s) = arg minaQ(s, a)

will be selected and followed. In the overview below an example of such a path has beenmade bold for initial state s = 45. Here, the optimal amount of green time allocatedtowards direction 1 found by the algorithm is 37.5 seconds.

s a=0 a=1 a=237.5 −3.045 −2.509 −3.04040.0 −0.055 0.484 −3.05042.5 0.502 0.614 0.02945 1.335 −0.280 −0.768

It is also possible that the Q-values of two subsequent states contradict each other. Inthat case, the state with the minimal Q(s, a = 0) is selected. An example of such asituation can be seen in the overview below. Here, s = 37.5 and s = 35 contradict eachother. When looking at Q(s, a = 0) of both of these states, it can be concluded thatmin{Q(35, 0), Q(37.5, 0)} = min{−0.650,−0.910} = −0.910. This Q-value is marked redin the overview.

s a=0 a=1 a=235.0 −0.650 −3.025 −2.10237.5 −0.910 −2.509 −3.04540.0 −0.055 0.484 −3.05042.5 0.502 0.614 0.02945 1.335 −0.280 −0.768

33

5.1.1 Parameters

The difficulty of using a reinforcement learning algorithm such as the Q-learning algorithmis the balance between exploration and exploitation (Tijsma, Drugan, & Wiering, 2017).The choice of the learning rate α and the discount factor γ and which action selectionstrategy is implemented can influence the outcome of the algorithm. A poor choice canlead to being stuck in a local optimum, oscillating around the optimum, etc. Variousdifferent combinations of parameters and strategies have been examined to find a balance.In this section, the best combination found will be discussed.

In the algorithm, the actions will be selected according to the following action selectionstrategy. For the first 100 iterations, to stimulate exploration, the actions will be selectedrandomly. After that, the actions will be selected according to the ε-greedy policy asdescribed in Section 4.3 with ε = 0.15.

The choice of the learning rate α can be tricky. A too big α could lead to the algorithmcontinuously oscillating around the optimum. This is due the fact that the α can be seenas the step-size, so a big α corresponds to a big step-size. A big α can be useful at thebeginning of the algorithm to take a big leap towards the solution. A too small α couldmean that a lot of time and iterations are needed to find the optimum. A small α can beuseful at the end of the algorithm to get closer to the optimal solution. Therefore, insteadof fixing α it could be advantageous to implement a decay in the learning rate. However,as mentioned in Section 4.3, the learning rate has to decrease under certain conditionsto ensure convergence. These conditions are as follows. The learning rate α(i) has tosatisfy that 1)

∑α(i) = ∞ and that 2)

∑α2(t) converges. An example of such an α(i)

is α(i) = KK+i

. Here, K denotes the total number of iterations and i the current iteration(Even-Dar & Mansour, 2003; Mehta & Kelkar, n.d.).

For the discount factor γ, a value smaller than 1 is needed for convergence. In thisthesis, the focus will mainly be on the current and short-term cost. Therefore, a value ofγ = 0.1 will be implemented.

In summary,

• The action selection strategy will be as follows:– for iteration i ≤ 100 the action will be selected randomly,– for iteration i > 100, the action be be selected according to the ε-greedy policy

with ε = 0.15.

• The learning rate: α(i) = KK+i

with K = T5c

.

• The discount factor : γ = 0.1.

5.2 Q-learning result analysis and comparison

In Table 5.1, the optimal green time allocation results of the Q-learning algorithm and thefixed cycle traffic light fluid queue model can be found for various load types. The labelload type will be used to indicate the load of the system. These results, and all the comingresults, are for λ1 = 0.5 fixed. Thus, a higher value of λ2 corresponds to a higher load type.For easier comparison, the results from the table have also been illustrated in Figure 5.1.Here, it can be seen that the difference between the two methods becomes smaller as theload type becomes higher. The difference between the Q-learning algorithm’s optimumand the fluid formula’s optimum is at most 5 seconds for the load types that have beenevaluated.

34

When taking the theoretical optimal split of the fixed cycle traffic light fluid queue as abenchmark, the Q-learning algorithm seems to perform best for higher arrival rates. Thatis, the difference between the optimality found by Q-learning and the optimality foundthrough theory is less for higher arrival rates. Particularly, for arrival rates bigger than0.25µ. Looking at how the algorithm learns and works, this makes sense. The algorithmtakes a certain action and observes the cost of that action. For higher arrival rates, 5seconds of extra green time has a bigger influence on the cost, the average waiting time.Whereas, for lower arrival rates there comes a point in which an extra 5 seconds makes asmall difference in the average waiting time of the intersection. Therefore, the benefit ofadding or subtracting a certain amount of green time becomes small. This is reflected inthe solution of the algorithm for smaller arrival rates, in particular arrival rates that aresmaller than 0.25µ.

Load type λ2 gQ1 gQ2 from Q-learning g∗1 g∗2 from fluid formula

1 0.2 67.5 22.5 72.5 17.52 0.25 62.5 27.5 67 233 0.35 55 35 57.2222 32.77784 0.4 50 40 52.8571 37.14295 0.6 37.5 52.5 38.125 51.8756 0.75 30 60 29.2857 60.7143

Table 5.1: Optimal green time allocation for λ1 = 0.5

(a) Direction 1 (b) Direction 2

Figure 5.1: Optimal green time allocation for λ1 = 0.5 illustrated

Table 5.2 and Table 5.3 give the mean queue length and mean waiting time resultsrespectively for the various values of λ2. These results were obtained by implementingthe optimal split found by the specific methods into the discrete-event simulation for atime-horizon of T = 50 million. When rounding these results to the nearest integer, theyare the same regardless of the method. This again shows that regardless of a 5 seconddifference in the allocation of the green time, the mean waiting time for an arbitraryvehicle in the system is more or less the same.

35

mean queue length \ λ2 0.2 0.25 0.35 0.4 0.6 0.75

fluid (theory) 10.0833 12.1 15.6852 17.2857 22.6875 25.9286fluid (simulation) 10.4983 12.5472 16.2064 17.8524 23.4256 27.2824

q-learning (simulation) 10.6162 12.6542 16.2391 17.8894 23.4267 27.1499

Table 5.2: Mean queue length results for λ1 = 0.5

mean waiting time \ λ2 0.2 0.25 0.35 0.4 0.6 0.75

fluid (theory) 14.4048 16.1333 18.4532 19.2063 20.625 20.7429fluid (simulation) 14.5035 16.2337 18.5710 19.3331 20.7933 21.3269

q-learning (simulation) 14.6654 16.3695 18.6020 19.3796 20.7954 21.2203

Table 5.3: Mean waiting time results for λ1 = 0.5

Since the mean queue length and mean waiting time are more or less the same re-gardless of the optimization method, it would be interesting to look into the results perdirection. To this end, Figure 5.2 and Figure 5.3 illustrate the mean queue length andmean waiting time per direction. In these figures, it can be seen that although a 5 seconddifference in green time does not have a major influence on the overall mean queue lengthand mean waiting time results. It does have an influence when looking at each directionseparately.

Figure 5.2: Mean queue length comparison per direction

36

Figure 5.3: Mean mean waiting time comparison per direction

5.3 Discussion

As mentioned before, the difficulty of the Q-learning algorithm is the balance betweenexploration and exploitation. A poorly selected learning rate, discount factor and/oraction selection strategy can lead to the algorithm being stuck in a local optimum. Inthis thesis, the Q-learning algorithm has only been investigated for certain arrival anddeparture rates. It was tricky to find a good balance between exploration and exploitation.It has not been investigated whether or not this choice of parameters and action selectionstrategy will also work for completely different arrival and departure rates. Furthermore,it would be interesting to experiment with the discount factor as well. In this thesis,only current and short-term costs were taken into account. It has not been thoroughlyinvestigated whether or not it would be realistic to take long-term costs into account andwhat would happen if they were. These are all points for future research. If the algorithmseems to perform well for other arrival and departure rates and/or a different discountfactor as well, it could certainly be an optimization method to keep an eye one.

Although not optimizing to the millisecond, as the theoretical optimization methodsdo, the Q-learning algorithm can still be beneficial to determine the allocation of the greentimes. In fact, a huge advantage of the Q-learning algorithm compared to the theoreticaloptimization methods is that it is model-free. It can therefore be applied to and “learn”from a real traffic light intersection. Furthermore, it is possible to expanded and extendthe algorithm to include more flows and directions. This would also be an interestingpoint for future research.

The Q-learning algorithm has the possibility to redefine the state and action space.In this thesis, the states were the amount of green time allocated towards direction 1.Only 2.5 seconds could be added or subtracted to this green time. Consequently, it wasnot possible for the Q-learning algorithm to find the exact same green time results as thefixed cycle traffic light fluid queue. However, this state definition could be altered to forexample only a 1 seconds difference. Moreover, if more flows were to be added to theintersection, a state of the form s = (s1, s2, .., sn) could be used. Here, the sp denotes theamount of green time allocated in direction p. But it is also possible to have a completelydifferent state space. For instance, the state could be the number of vehicles in the queuein direction p as used in (Chanloha et al., 2012) or the state could be the “level” of vehiclesin the queue (i.e. no vehicles, a high amount of vehicles etc.) as used in (Chin et al.,2012). Q-learning has many possibilities.

37

Chapter 6

Conclusion and recommendation

In this chapter, a conclusion and/or recommendation based on the main research objectivewill be given. To reiterate, the research objectives of this thesis are as follows:

1. Determine the optimal split for several queuing and a road traffic models.

2. Develop a method in which Q-learning is used to determine the optimal split.

3. Compare both methods using a discrete-event simulation.

Note that the conclusions are drawn within the environment’s parameters, assumptionsand constraints. For a different environment, different conclusions might hold.

For the first research objective, the optimal split has been determined by minimizingthe mean queue length and/or mean waiting time as found by queuing theory for theseveral queuing and road traffic models. For a traffic light intersection with two conflictingflows for which the arrival and departure rates are known or can be derived, this theoreticaloptimization method would be recommend to use as opposed to Q-learning. Here, thefixed cycle traffic light fluid queue model fits the the environment of an intersection withtwo conflicting flows and traffic lights best. This is namely due to the fact that the queuingmodels of the G/G/1 type do not contain a cycle with red and green periods as alreadymentioned in the discussion in Section 3.4. Therefore, it is recommended to consider theoptimal split derived for the fixed cycle traffic light fluid queue model. An advantage ofthis model is that a closed expression of the optimal splits exist for the intersection withtwo flows.

When comparing this theoretically obtained optimal split with the split found throughQ-learning, it can be concluded that the theoretical method performs better. This con-clusion is drawn based on the mean queue length and mean waiting time results from thediscrete-event simulation. However, the difference between the results of the two methodsare only a fraction of a second or vehicle. When rounding to the nearest integer, bothmethods have an equal performance within the investigated model parameters and con-straints. Based on the discrete-event simulation and the rounding, it could be concludedthat both methods can be uses to optimally configure the traffic lights in an intersectionwith two conflicting flows.

Nevertheless, most intersections with traffic lights in real-life are more complex thanthe intersection with only two conflicting flows and directions. They usually consist ofmore traffic lights, lanes and directions. Therefore, for practicality, it would be recom-mended to in that case use the Q-learning algorithm to allocate the green time as optimallyas possible. As opposed to the methods in which the optima are found theoretically, theQ-learning algorithm is model-free. Thus, it is not limited to the traffic light intersection

38

with two conflicting flows that has been investigated in this thesis. The state and actionspaces of the algorithm can be extended and adjusted to fit these environments as well.Giving the algorithm great potential for broader use.

39

References

Adan, I., & Resing, J. (2003). Even geduld a.u.b. Retrieved from https://www.win.tue

.nl/bs/masterclass2003/index.html

Adan, I., & Resing, J. (2020). Queueing Theory.ANWB. (2017). Hoe lang duurt het gele stoplicht? Retrieved from https://www.anwb

.nl/experts/juridisch/29/hoe-lang-duurt-het-gele-stoplicht

Bellman, R. (1957). A Markovian Decision Process. Indiana Univ. Math. J., 6 (4),679–684.

Chanda, K. k. (2020). Q-Learning in Python. Retrieved from https://www

.geeksforgeeks.org/q-learning-in-python/

Chanloha, P., Usaha, W., Chinrungrueng, J., & Aswakul, C. (2012). Performance compar-ison between queueing theoretical optimality and q-learning approach for intersec-tion traffic signal control. Proceedings of International Conference on ComputationalIntelligence, Modelling and Simulation, 172–177. doi: 10.1109/CIMSim.2012.12

Chin, Y. K., Kow, W. Y., Khong, W. L., Tan, M. K., & Teo, K. T. K. (2012). Q-Learningtraffic signal optimization within multiple intersections traffic network. Proceedings- UKSim-AMSS 6th European Modelling Symposium, EMS 2012 , 343–348. doi:10.1109/EMS.2012.75

Chu, H. C., Liao, Y. X., Chang, L. H., & Lee, Y. H. (2019). Traffic light cycle configurationof single intersection based on modified Q-Learning. Applied Sciences (Switzerland),9 (21). doi: 10.3390/app9214558

Even-Dar, E., & Mansour, Y. (2003). Learning rates for Q-learning. Journal of Ma-chine Learning Research, 5 , 1–25. Retrieved from https://www.jmlr.org/papers/

volume5/evendar03a/evendar03a.pdf doi: 10.1007/3-540-44581-1{\ }39Haijema, R., Hendrix, E. M., & van der Wal, J. (2017). Dynamic control of traffic lights.

In International series in operations research and management science (Vol. 248,pp. 371–386). doi: 10.1007/978-3-319-47766-4{\ }13

Haijema, R., & Van Der Wal, J. (2008). AN MDP decomposition approach for traf-fic control at Isolated signalized intersections. Probability in the Engineering andInformational Sciences , 22 (4), 587–602. doi: 10.1017/S026996480800034X

Joo, H., Ahmed, S. H., & Lim, Y. (2020). Traffic signal control for smart cities usingreinforcement learning. Computer Communications , 154 (March), 324–330. Re-trieved from https://doi.org/10.1016/j.comcom.2020.03.005 doi: 10.1016/j.comcom.2020.03.005

Kallenberg, L. (2016). Markov decision processes lecture notes.Kansal, S., & Martin, B. (2019). Reinforcement Q-Learning from Scratch in Python

with OpenAI Gym. Retrieved from https://www.learndatasci.com/tutorials/

reinforcement-q-learning-scratch-python-openai-gym/

McCullock, J. (2012). Q-learning: Step-by-step tutorial. Retrieved from http://

mnemstudio.org/path-finding-q-learning-tutorial.htm

40

https://www.win.tue.nl/bs/masterclass2003/index.html

https://www.win.tue.nl/bs/masterclass2003/index.html

https://www.anwb.nl/experts/juridisch/29/hoe-lang-duurt-het-gele-stoplicht

https://www.anwb.nl/experts/juridisch/29/hoe-lang-duurt-het-gele-stoplicht

https://www.geeksforgeeks.org/q-learning-in-python/

https://www.geeksforgeeks.org/q-learning-in-python/

https://www.jmlr.org/papers/volume5/evendar03a/evendar03a.pdf

https://www.jmlr.org/papers/volume5/evendar03a/evendar03a.pdf

https://doi.org/10.1016/j.comcom.2020.03.005

https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

http://mnemstudio.org/path-finding-q-learning-tutorial.htm

http://mnemstudio.org/path-finding-q-learning-tutorial.htm

Mehta, V., & Kelkar, R. (n.d.). Reinforcement Learning. Carnegie Mellon School ofComputer Science. Retrieved from http://www.cs.cmu.edu/afs/andrew/course/

15/381-f08/www/lectures/HandoutModelFreeRL.pdf

Melo, F. S. (2001). Convergence of Q-learning: A simple proof. Institute Of Systems andRobotics, Tech. Rep, 1–4.

Nenez-Queija, R. (2009). Operations Research 2 S Introduction to Markov DecisionTheory.

Russell, S. J., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach (Thirded.). Prentice Hall.

Shyalika, C. (2019). A Beginners Guide to Q-Learning. Retrieved from https://

towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c

Spieksma, F. (2015). Markov decision processes lecture notes.Sutton, R., & Barto, A. (1998). Reinforcement learning: an introduction. MIT

Press. Retrieved from http://incompleteideas.net/sutton/book/ebook/the

-book.html

Tijsma, A. D., Drugan, M. M., & Wiering, M. A. (2017). Comparing explorationstrategies for Q-learning in random stochastic mazes. 2016 IEEE SymposiumSeries on Computational Intelligence, SSCI 2016 . Retrieved from https://

www.ai.rug.nl/~mwiering/GROUP/ARTICLES/Exploration QLearning.pdf doi:10.1109/SSCI.2016.7849366

41

http://www.cs.cmu.edu/afs/andrew/course/15/381-f08/www/lectures/HandoutModelFreeRL.pdf

http://www.cs.cmu.edu/afs/andrew/course/15/381-f08/www/lectures/HandoutModelFreeRL.pdf

https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c

https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c

http://incompleteideas.net/sutton/book/ebook/the-book.html

http://incompleteideas.net/sutton/book/ebook/the-book.html

https://www.ai.rug.nl/~mwiering/GROUP/ARTICLES/Exploration_QLearning.pdf

https://www.ai.rug.nl/~mwiering/GROUP/ARTICLES/Exploration_QLearning.pdf

Appendix A

Theory

This appendix chapter will provide extra figures, information and the full calculations forthe theory and theoretical optimization methods.

A.1 List of symbols

λp The arrival rate of the vehicles in direction pµp The departure rate of the vehicles in direction pµ The departure rate of the vehiclesGp The amount of green time allocated towards direction pRp The amount of red time allocated towards direction px The amount of time it takes to clear the road,

during this time all traffic lights are redL The fraction of time in a cycle in which both traffic lights are redc The cycle lengthwp The fraction of green time allocated towards direction p

(from G/G/1 queues)w∗p The optimal fraction of green time allocated towards direction p

(from G/G/1 queues)gp The amount of green time allocated towards direction p

(from the fixed cycle traffic light fluid queue)g∗p The optimal amount of green time allocated towards direction p

(from the fixed cycle traffic light fluid queue)gQp The optimal amount of green time allocated towards direction p

(from the Q-learning algorithm)Tp The mean waiting time in direction p

The mean waiting time of an arbitrary vehicle in the systemT OR

The time-horizon of the simulationQp The mean queue length in direction pQ The mean queue length

Q(s, a) The Q-value of the Q-learning algorithm for state s and action aα The learning rate in the Q-learning algorithmγ The discount factor in the Q-learning algorithm

42

A.2 Calculations

This section provides the full computations of some of the calculations from Chapter 2.

M/M/1 queue

The equation that had to be solved is

∂

∂w1

Q = − λ1µ

(w1µ− λ1)2+

λ2µ

((1− L− w1)µ− λ2)2.

Solving this equation gives

λ1µ

(w1µ− λ1)2=

λ2µ

((1− L− w1)µ− λ2)2

=⇒ λ1

w21µ

2 − 2λ1w1µ+ λ21

=λ2

(1− L− w1)2µ2 − 2λ2(1− L− w1)µ+ λ22

=⇒ λ1

w21µ

2 − 2λ1w1µ+ λ21

=λ2

(1− L)2µ2 − 2(1− L)w1µ2 + w21µ

2 − 2λ2(1− L)µ+ 2λ2w1µ+ λ22

=⇒ λ1

w21µ

2 − 2λ1w1µ+ λ21

=λ2

w21µ

2 + w1[2λ2µ− 2(1− L)µ2] + [λ22 − 2λ2(1− L)µ+ (1− L)2µ2]

=⇒ w21[(λ1 − λ2)µ2] + w1[2λ1λ2µ− 2(1− L)λ1µ

2 + 2λ1λ2µ]

+ [λ1λ22 − 2λ1λ2(1− L)µ+ (1− L)2λ1µ

2 − λ21λ2] = 0

=⇒ w21[(λ1 − λ2)µ2] + w1[4λ1λ2µ− 2(1− L)λ1µ

2]

+ [λ1λ22 − 2λ1λ2(1− L)µ+ (1− L)2λ1µ

2 − λ21λ2] = 0

D = [4λ1λ2µ− 2(1− L)λ1µ2]2 − 4(λ1 − λ2)µ2[λ1λ

22 − 2λ1λ2(1− L)µ+ (1− L)2λ1µ

2 − λ21λ2]

= 16λ21λ

22µ

2 − 16(1− L)λ21λ2µ

3 + 4(1− L)2λ21µ

4 − 4λ21λ

22µ

2 + 8λ21λ2(1− L)µ3 − 4(1− L)2λ2

1µ4

+ 4λ31λ2µ

2 + 4λ1λ32µ

2 − 8λ1λ22(1− L)µ3 + 4(1− L)2λ1λ2µ

4 − 4λ21λ

22µ

2

= 8λ21λ

22µ

2 − 8(1− L)λ21λ2µ

3 + 4λ31λ2µ

2 + 4λ1λ32µ

2 − 8λ1λ22(1− L)µ3 + 4(1− L)2λ1λ2µ

4

= λ1λ2µ2[8λ1λ2 − 8(1− L)λ1µ+ 4λ2

1 + 4λ22 − 8λ2(1− L)µ+ 4(1− L)2µ2]

= λ1λ2µ2[λ1 + λ2 − (1− L)µ]2.

Using the abc-formula, the optimal split, w∗1,MM1, is obtained

w∗1,MM1 =− [4λ1λ2µ− 2(1− L)λ1µ

2]±√

4λ1λ2µ2 [λ1 + λ2 − (1− L)µ]2

2(λ1 − λ2)µ2

=− [4λ1λ2 − 2(1− L)λ1µ]± 2 [λ1 + λ2 − (1− L)µ]

√λ1λ2

2(λ1 − λ2)µ

=

−2λ1λ2µ

+ (1− L)λ1 ±[

(λ1+λ2)√λ1λ2

µ− (1− L)

√λ1λ2

]λ1 − λ2

=

−2λ1λ2±(λ1+λ2)√λ1λ2

µ+ (1− L)

[λ1 ∓

√λ1λ2

]λ1 − λ2

=

−2λ1±(λ1+λ2)√λ1λ2

µ+ (1− L)

[λ1λ2∓√

λ1λ2

]λ1λ2− 1

43

=

−2λ1+(λ1+λ2)√λ1λ2

µ+ (1− L)

[λ1λ2−√

λ1λ2

][√

λ1λ2− 1] [

1 +√

λ1λ2

] =

[√λ1λ2−1

][λ1−λ2

√λ1λ2

]µ

+ (1− L)√

λ1λ2

[√λ1λ2− 1]

[√λ1λ2− 1] [

1 +√

λ1λ2

]=

√λ1λ2

(1− L) +λ1−λ2

√λ1λ2

µ

1 +√

λ1λ2

=Λ(1− L) + λ1−λ2Λ

µ

1 + Λwhere Λ =

√λ1

λ2

.

Fixed cycle traffic light fluid queue

The equation that has to be solved is

∂

∂g1

T =1

λ1 + λ2

(−2λ1(c− g1)

2c(1− ρ1)+

2λ2(2x+ g1)

2c(1− ρ2)

)= 0.

This leads to the equation,

λ1(c− g1)

(1− ρ1)=λ2(2x+ g1)

(1− ρ2).

Solving this equation for g1 gives,

=⇒ g1 = −λ2(g1 + 2x)(1− ρ1)

λ1(1− ρ2)+ c

=⇒ g1

[1 +

λ2(1− ρ1)

λ1(1− ρ2)

]= −2xλ2(1− ρ1)− cλ1(1− ρ2)

λ1(1− ρ2)

=⇒ g1

[λ1(1− ρ2) + λ2(1− ρ1)

λ1(1− ρ2)

]= −2xλ2(1− ρ1)− cλ1(1− ρ2)

λ1(1− ρ2)

=⇒ g1 = −2xλ2(1− ρ1)− cλ1(1− ρ2)

λ1(1− ρ2)· λ1(1− ρ2)

λ1(1− ρ2) + λ2(1− ρ1)

= −2xλ2(1− ρ1)− cλ1(1− ρ2)

λ2(1− ρ1) + λ1(1− ρ2)=cλ1(1− λ2

µ)− 2xλ2(1− λ1

µ)

λ2(1− λ1µ

) + λ1(1− λ2µ

)

=cλ1(µ− λ2)− 2xλ2(µ− λ1)

λ2(µ− λ1) + λ1(µ− λ2)=cλ1(µ− λ2)− 2xλ2(µ− λ1)

(λ1 + λ2)µ− 2λ1λ2

.

44

A.3 Plots model analysis

In this appendix, extra plots for the mean waiting time, mean queue length and optimalsplit for the various models can be found. This is supplementary to Chapter 3.

A.3.1 Mean queue length and waiting time

(a) Mean total queue length (b) Mean waiting time

Figure A.1: M/M/1 (Q version)


Figure A.2: M/M/1 (T version)


Figure A.3: M/D/1 (Q version)

45


Figure A.4: M/D/1 (T version)

Figure A.5: Mean total queue length D/D/1

(a) Mean queue length (b) Mean waiting time

Figure A.6: Fixed cycle traffic light fluid queue

46


Figure A.7: Simulation

47

A.3.2 Optimal split

(a) Q version (b) T version

Figure A.8: Optimal split M/M/1

(a) Q version (b) T version

Figure A.9: Optimal split M/D/1

(a) D/D/1 (b) Fixed cycle traffic light fluid

Figure A.10: Optimal split

48

Appendix B

Simulation results

B.1 Plots simulation analysis

This appendix provides extra plots regarding the discrete-event simulation results whenimplementing either fixed green times or the fixed cycle fluid formula.

(a) Using Gp from fluid queue formula (b) Using Gp = 45

Figure B.1: Simulation results mean queue length

(a) Using Gp from fluid queue formula (b) Using Gp = 45

Figure B.2: Simulation results mean waiting time

49


Figure B.3: Simulation results using Gp from fluid formula (stable region only)

(a) Mean queue length comparison for λ1 = 0.3 (b) Mean waiting time comparison for λ1 = 0.3

Figure B.4: Simulation results comparison

(a) Mean queue length for λ1 = 0.3 (b) Mean waiting time for λ1 = 0.3

Figure B.5: Simulation results per direction using Gp from fluid formula

50

Appendix C

Q-learning results

This appendix provides extra results that can be used to compare the Q-learning algo-rithm’s optima with that of the fixed cycle traffic light fluid model.

Load type λ2λ1λ2

λ2µ

λ1c

gQ1 µ

λ2c

gQ2 µ

λ1cg∗1µ

λ2cg∗2µ

1 0.2 2.5 0.1 0.370 0.444 0.345 0.5712 0.25 2 0.125 0.4 0.455 0.373 0.5433 0.35 1.43 0.175 0.455 0.5 0.437 0.5344 0.4 1.25 0.2 0.5 0.5 0.473 0.5385 0.6 0.83 0.3 0.667 0.571 0.656 0.5786 0.75 0.67 0.375 0.833 0.625 0.854 0.618

Table C.1: Various ratio comparison for λ1 = 0.5

51

eindhoven university of technology bachelor optimally

Documents