a comparison of algorithms used in traffic control …1214166/...test were conducted on a four-way...

IN DEGREE PROJECT TECHNOLOGY,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2018

A comparison of algorithms used in traffic control systems

ERIK BJÖRCK

FREDRIK OMSTEDT

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

A comparison of algorithmsused in traffic controlsystems

ERIK BJÖRCK & FREDRIK OMSTEDT

Computer ScienceDate: June 5, 2018Supervisor: Jeanette Hellgren KotaleskiExaminer: Örjan EkebergSwedish title: En jämförelse av algoritmer i trafiksystemSchool of Electrical Engineering and Computer Science

iii

Abstract

A challenge in today’s society is to handle a large amount of vehi-cles traversing an intersection. Traffic lights are often used to controlthe traffic flow in these intersections. However, there are inefficienciessince the algorithms used to control the traffic lights do not perfectlyadapt to the traffic situation. The purpose of this paper is to comparethree different types of algorithms used in traffic control systems tofind out how to minimize vehicle waiting times. A pretimed, a deter-ministic and a reinforcement learning algorithm were compared witheach other. Test were conducted on a four-way intersection with var-ious traffic demands using the program Simulation of Urban MObil-ity (SUMO). The results showed that the deterministic algorithm per-formed best for all demands tested. The reinforcement learning algo-rithm performed better than the pretimed for low demands, but worsefor varied and higher demands. The reasons behind these results arethe deterministic algorithm’s knowledge about vehicular movementand the negative effects the curse of dimensionality has on the train-ing of the reinforcement learning algorithm. However, more researchmust be conducted to ensure that the results obtained are trustworthyin similar and different traffic situations.

iv

Sammanfattning

En utmaning i dagens samhälle är att hantera en stor mängd fordonsom kör igenom en korsning. Trafikljus används ofta för att kontrolleratrafikflödena genom dessa korsningar. Det finns däremot ineffektivite-ter eftersom algoritmerna som används för att kontrollera trafikljuseninte är perfekt anpassade till trafiksituationen. Syftet med denna rap-port är att jämföra tre typer av algoritmer som används i trafiksystemför att undersöka hur väntetid för fordon kan minimeras. En tidsba-serad, en deterministisk och en förstärkande inlärning-algoritm jäm-fördes med varandra. Testerna utfördes på en fyrvägskorsning medolika trafikintensiteter med hjälp av programmet Simulation of UrbanMObility (SUMO). Resultaten visade att den deterministiska algorit-men presterade bäst för alla olika trafikintensiteter. Inlärningsalgorit-men presterade bättre än den tidsbaserade på låga intensiteter, mensämre på varierande och högre intensiteter. Anledningarna bakom re-sultaten är att den deterministiska algoritmen har kunskap om hurfordon rör sig samt att dimensionalitetsproblem påverkar träningenav inlärningsalgoritmen negativt. Det krävs däremot mer forskningför att säkerställa att resultaten är pålitliga i liknande och annorlundatrafiksituationer.

Contents

1 Introduction 11.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . 21.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 52.1 Traffic control systems . . . . . . . . . . . . . . . . . . . . 52.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . 6

2.2.1 The exploration and exploitation trade-off . . . . 72.2.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Curse of dimensionality . . . . . . . . . . . . . . . . . . . 92.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Method 123.1 Traffic control system model . . . . . . . . . . . . . . . . . 123.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Pretimed algorithm . . . . . . . . . . . . . . . . . . 143.2.2 Deterministic algorithm . . . . . . . . . . . . . . . 143.2.3 Reinforcement learning algorithm . . . . . . . . . 16

3.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.1 Uniform demand . . . . . . . . . . . . . . . . . . . 183.3.2 Varied demand . . . . . . . . . . . . . . . . . . . . 18

4 Results 204.1 Results for uniform demand . . . . . . . . . . . . . . . . . 20

4.1.1 Comparison . . . . . . . . . . . . . . . . . . . . . . 214.2 Results for varied demand . . . . . . . . . . . . . . . . . . 22

4.2.1 Comparison . . . . . . . . . . . . . . . . . . . . . . 22

v

vi CONTENTS

5 Discussion 245.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Conclusions 28

Bibliography 29

Chapter 1

Introduction

A problem in today’s society is the amount of traffic due to the exces-sive use of cars and similar vehicles for transportation. Intersectionscontribute to this problem as one or several traffic flows must waitfor another. Traffic control systems, in the form of traffic lights, areused to handle these conflicting traffic flows. However, there are inef-ficiencies with these systems. In the United States alone, people mustcollectively wait 296 million hours every year, averaging one hour perperson, due to bad timing with traffic control systems [5]. Moreover,these congestions are bad for the environment.

Several approaches can be taken to decrease the waiting time foreach vehicle, such as increasing road capacity or by creating round-abouts. However, these solutions are expensive and in some cases im-possible to implement due to various structural and economical rea-sons. It would therefore be preferable to optimize the algorithms usedby the control systems.

Several different solutions to the traffic scheduling problem havebeen presented, such as [15], [14] and [11]. However, it is hard to knowif these are optimal solutions since different algorithms yield betterresults for different traffic conditions [8]. This problem is in fact NP-hard which further explains the complexity of finding good algorithms[3].

The past few years the use of machine learning in various applica-tions has grown enormously, solving different complex problems in anefficient manner. By analyzing gathered data these algorithms learnto find solutions without extensive knowledge about the problem athand. This seems applicable to the traffic control problem, specifically

1

2 CHAPTER 1. INTRODUCTION

through a type of machine learning called Reinforcement Learning. Thisis an unsupervised learning method fit for problems where data aboutthe environment is not known beforehand.

1.1 Problem statement

How does a deterministic algorithm used in traffic control systemscompare to a reinforcement learning based algorithm in regards to op-timal traffic flow within an intersection? Are any of these viable com-pared to a trivial solution where the green time is constant for eachtraffic flow?

1.2 Scope

For the scope of this project only an isolated traffic intersection will beexamined. This is partly because of the increase in complexity whenmodeling a system of several intersections, but also since the focus lieson comparing algorithms meant to improve traffic light handling andnot those meant to improve the entire traffic flow.

In this thesis it is assumed that perfect sensors are used for datagathering from the vehicles within the intersection. The reason for thisis that the algorithms are to be compared, not the control system as awhole.

The different algorithms will be compared based on the waitingtime for each vehicle. The average squared waiting time (ASWT) inseconds will be used since this penalizes cases where some vehicleshave to wait a long time, compared to cases where most vehicles waitan intermediate amount of time.

1.3 Purpose

Seeing as efficient transportation is vital to a lot of people, finding outwhat kind of approaches to take to make sure traffic flow is stream-lined is very important. The purpose of this thesis is to investigatewhether machine learning or deterministic approaches work better forvarious cases of traffic. Moreover, this thesis aims to present the prob-lem of traffic congestion.

CHAPTER 1. INTRODUCTION 3

1.4 Disposition

The first section introduces the subject to the reader and presents theproblem to be examined, as well as the constraints on the project.

The second section gives detailed information about the differentconcepts regarding traffic control systems and their algorithms, as wellas a description of what reinforcement learning is and how it can beused for the problem at hand.

The third section describes and motivates the model used to ana-lyze the various algorithms. It also presents the algorithms tested andthe tests that were conducted to gather the results needed to answerthe problem statement.

The fourth section presents the results acquired from the tests forthe different algorithms. The following section then discusses and an-alyzes these results. The thesis is then concluded in regards to theproblem statement, based on the previous discussion and results.

Finally, the references used in this paper are listed.

1.5 Terminology

In this section, terms used in the report are briefly explained.

• ANN - Artificial Neural Network.

• API - Application Programming Interface.

• ASWT - Average Squared Waiting Time.

• CMAC - Cerebellar Model Articulation Controller.

• COD - Curse of Dimensionality.

• Green time - The time the vehicles on a road may traverse theintersection.

• GUI - Graphical User Interface.

• Intersection - A crossing of two or more roads.

• Lane - A section of a road that a vehicle can move in. A road canhave one or more lanes.

4 CHAPTER 1. INTRODUCTION

• Python - A programming language.

• Q-learning - A specific type of reinforcement learning algorithm.

• Q-values - Values used to determine the action in a Q-learningalgorithm.

• SUMO - Simulation of Urban MObility.

• TraCI - Traffic Control Interface.

Chapter 2

Background

In this section, necessary information regarding traffic control systemsas well as the algorithms examined is presented. Furthermore, workrelated to this paper is presented.

2.1 Traffic control systems

There are several different types of traffic control systems using traf-fic lights. Some control the traffic over several intersections, whereasothers only control it within one intersection. For the latter type, thereare three main systems used today for controlling the traffic lights: pre-timed systems, semi actuated systems and fully actuated systems.

Pretimed systems utilize fixed time intervals to control traffic. Thismeans that one phase is active for a fixed amount of time before thecontrol system switches to the next phase and keeps that active equallylong. After each phase has been active, the process is repeated [15].

Semi actuated systems use sensors at the smaller of two crossingroads at an intersection. The larger road has a green light until thesesensors indicate that vehicles from the smaller road wish to traversethe intersection. A phase allowing this is then activated until there areno more vehicles on the lesser road or when a maximum fixed time hasbeen reached. After this, the larger road is given a green light again.

Fully actuated systems are similar to semi actuated systems. How-ever, fully actuated systems use sensors for each road at an intersec-tion. The phases are therefore activated in a way that responds to thecurrent traffic situation [11].

Semi and fully actuated systems generally respond better to traffic

5

6 CHAPTER 2. BACKGROUND

flows than pretimed ones do. This is because pretimed systems do notanalyze the traffic at the intersection and therefore do not adapt to theenvironment like semi and fully actuated systems do [14] [11].

In this paper, pretimed systems and fully actuated systems will beanalyzed.

2.2 Reinforcement learning

Reinforcement learning can in broad terms be explained as the conceptof learning a behaviour which maximizes a reward. The learner doesnot have prior knowledge of which actions to take to achieve this, andmust therefore explore which of these that are rewarding by tryingthem. These actions may not only yield immediate reward but also af-fect future situations and thus future rewards. These are the two mainfeatures of reinforcement learning, trial-and-error search and delayedreward.

More specifically, reinforcement learning can be defined as a learn-ing system of six different elements.

The environment is a set of possible states that can be achieved bytaking various actions. It describes the problem which the learningsystem is trying to solve and it is from the environment that informa-tion about the reward is obtained.

The agent is the action taker and interacts with the environment bytaking actions based on the current state and various reward values.

The policy describes how the agent takes actions based on states andreward values. It is the core of the learning agent as it alone determinesthe behaviour of the system.

The reward signal is a measurement of how well a reinforcementlearning system is doing. It is sent to the agent by the environment.The agent’s only objective is to maximize the reward.

Rewards only specify the immediate result of an action. Therefore,the value function is used to establish what is good in the long run. Thevalue of a state can be described as the amount of reward an agent canexpect to receive over time, starting from that state.

A model of the environment is used in some reinforcement learningsystems. It copies the environment to allow assumptions to be madeabout the environment and its behaviour. Models are used to planactions by exploring future situations before they actually happen.

CHAPTER 2. BACKGROUND 7

Methods for solving reinforcement learning problems that use mod-els are called model-based, whereas methods using trial-and-error arecalled model-free [10].

These components are used to solve various reinforcement learningproblems. An overview of how the components work together can beseen in Figure 2.1.

Figure 2.1: A model showing how an agent interacts with the environ-ment in a reinforcement learning system [1].

In some cases, a punishment is used instead of a reward. This valueis meant to show that an action taken was bad. In reinforcement learn-ing systems where punishments are used, the agent’s objective is tominimize the punishment [2].

2.2.1 The exploration and exploitation trade-off

The exploration and exploitation trade-off is a challenge regardinghow an agent maximizes the reward. The agent must choose reward-ing actions it has tried previously to get a high reward. However, todiscover these actions it must try actions it has not taken before. Theagent must therefore exploit what it has experienced to get rewarded,but also explore new actions to increase its knowledge on rewardingactions.

The problem with these approaches is that none of them can be cho-sen exclusively. The agent must both try new actions and iterativelyfavor those that seem to be the best. Actions must be taken severaltimes to become reliable in their estimates of expected reward, and itis therefore hard to determine the trade-off between exploration andexploitation [10].


One possible solution is the ε-greedy action selection policy. Thispolicy chooses the action with the highest estimated reward a major-ity of the time. However, with a small probability of 1 − ε a randomaction is sometimes chosen. This action is uniformly selected from thepossible actions for the current state. This policy can be modified togradually increase the ε value, such that random actions are taken lessfrequently [4].

2.2.2 Q-learning

Q-learning is a reinforcement learning algorithm. The algorithm esti-mates the final reward at each state instead of reaching the final state,tracing back and updating the value function. This makes it possibleto use Q-learning in an environment with no end state [4].

Q-learning approximates the optimal action-value function inde-pendently of whichever policy is followed. This simplifies analysis ofthe algorithm. Using a policy still has an effect since it determineswhich state-action pairs are visited. However, for the algorithm toconverge, the only requirement is that all pairs are continuously up-dated [10]. It can be proven that Q-learning, with sufficient training,converges [4].

The value function Q maps state-action pairs to rewards and canbe defined in several ways. One way is using a matrix, where therows represent each state and columns represent each action. Whenthe amount of states are few, this is a good representation. However,when there are many or even an infinite amount of states, the matrixcan get large or impossible to use. By discretizing the states into setsor intervals, the value function can be approximated using a matrix.

Another way of approximating the value function is using an artif-ical neural network (ANN). The input layer is composed of the actionsand states, and the output layer represents the reward [12].

The value function Q is updated according to Equation 2.1.

Q(St, At)← Q(St, At)+α·[RSt,At+γ ·maxaQ(St+1, a)−Q(St, AT )] (2.1)

In the equation, St refers to the current state, at time t. At refers to theaction taken at time t. RSt,At is the reward from taking action At whenin state St. St+1 is the state the agent enters by taking action At.

maxaQ(St+1, a) is the maximum reward attainable in the state St+1.This is estimated using the value function. This is possible because


values may have been stored for these state-action pairs in a previousiteration.

α is the learning rate, a value between 0 and 1. It determines howmuch the Q-values are updated in each iteration. A higher value indi-cates that learning can occur quickly.

γ is the discount factor, a value between 0 and 1. It determineshow much future rewards are worth compared to immediate ones. Alow γ value indicates that immediate rewards are more important thanfuture ones [4].

The Q-learning algorithm can be described in the following way:

1. Initialize the Q-values Q(S,A),∀S,A arbitrarily.

2. For each learning episode, execute steps 3-7.

3. Observe the current state, St.

4. Choose an action, At, for that state based on an action selectionpolicy, eg. ε-greedy.

5. Take the action and observe the reward RSt,At and the new stateSt+1.

6. Update the Q-value using Equation 2.1.

7. Set the state to the new state St+1, and repeat steps 3-7 until aterminal state is reached.

As can be seen from above, the algorithm populates its value functionand uses these values to determine the best action to take [4].

2.3 Curse of dimensionality

Imagine that a coin has been dropped somewhere on a line of length100m. Finding this coin will not be that difficult, one can simply startat the beginning of the line and walk along it until the coin is found.Imagine instead that the coin is dropped on a field with a size of 100 ·100m2. Finding the coin in this field is severely harder than in theprevious case. Now extend the problem further by imagining the coinbeing in a building with the same width and length as the field, butwith a height of 100m. Yet again, finding the coin becomes a muchharder problem.


This is an example of the curse of dimensionality (COD). Formally,it is defined as "the problem caused by the exponential increase in vol-ume associated with adding extra dimensions to Euclidean space" [6].

The COD arises in several machine learning problems. This is dueto the fact that problems in higher dimensions generally require muchmore data to be able to keep the same performance as in cases wherethe number of dimensions is lower [6].

2.4 Related work

Most papers presenting new algorithms for traffic control systems usu-ally compare these to pretimed algorithms. Such is the case for [2],the algorithm which the reinforcement learning algorithm presentedin this paper is based on. [2] presents a Q-learning algorithm which iscompared to a pretimed algorithm on constant, uniform and variabletraffic demands. It shows results on par with a pretimed algorithmfor the uniform and constant demands, and results in favor of the Q-learning algorithm for variable demands. This algorithm is presentedin more detail in subsection 3.2.3.

Another example of a comparison between a pretimed algorithmand a Q-learning algorithm is found in [12]. Like in [2], a Q-learningalgorithm is presented and compared to a pretimed algorithm. Thisis done on constant and variable traffic demands. This paper differsfrom [2] in that it explores the usage of both a matrix and an ANNto approximate the value function for the Q-learning algorithm. TheANN variant did not show any positive results, but the matrix variantperformed better than the pretimed algorithm in some cases during avariable demand.

[9] is quite similar to this paper in that it compares a Q-learningalgorithm and a deterministic algorithm with each other, as well asagainst two pretimed algorithms. The deterministic algorithm is alongest queue first algorithm, which resembles the algorithm used inthis paper. More information regarding this paper’s deterministic al-gorithm can be found in subsection 3.2.2. The Q-learning algorithm’svalue function is approximated using linear regression and with anANN, and these variants are tested separately.

[9] differs from this paper in that the simulations are done on anetwork of intersections. Moreover, [9] analyzes both emission rates


and delay.The results from [9] show that both the deterministic algorithm and

the Q-learning algorithm perform better than the pretimed algorithmstested. The ANN variant performed slightly better than the better ofthe two pretimed algorithms, whereas the linear regression variantperformed slightly better than the deterministic algorithm.

Chapter 3

Method

3.1 Traffic control system model

Simulation of Urban MObility (SUMO) is a road traffic simulation pro-gram used to model various road networks and traffic flows. By im-plementing a road network, adding vehicles to it and defining rulesfor how these vehicles traverse the network, a model can be createdwhich resembles a realistic environment. [7]

We chose to use SUMO for three reasons. Primarily, SUMO is awidely used tool for creating and analyzing different intersections us-ing traffic lights. The paper from which the deterministic algorithmwas based on utilized this program to run its simulations [13]. Sec-ondarily, SUMO was well made for modeling a realistic traffic setting,which made it easy to use. Finally, SUMO saved us time since we didnot need to implement our own simulation program.

The intersection created was a four-way intersection with two lanesin each direction. For each part of the roads entering the intersectionthe leftmost lane was used for traffic going left, whereas the rightmostlane was used for traffic going forward or to the right. A traffic lightwas used to make sure that the traffic flows on the crossing roads didnot enter the intersection at the same time. This traffic light had fourphases:

• WE-G - the west and east roads were green and the others werered.

• WE-Y - the west and east roads were yellow and the others werered.

12

CHAPTER 3. METHOD 13

• NS-G - the north and south roads were green and the others werered.

• NS-Y - the north and south roads were yellow and the otherswere red.

For a graphic view of the intersection, see Figure 3.1.

Figure 3.1: The intersection used to analyze the various algorithms,here shown using SUMO’s graphical user interface (GUI).

From this model, different values were utilized to be able to get theresults. ASWT was calculated by querying for vehicles’ total waitingtime as soon as they had left the intersection. These values were thenstored and ASWT was calculated as seen in Equation 3.1.

ASWT =

N∑i=1

T 2i

N(3.1)

14 CHAPTER 3. METHOD

In Equation 3.1, N is the amount of vehicles that have traversed theintersection and Ti is the waiting time for the ith car to traverse theintersection.

The algorithms tested also utilized values such as the amount ofvehicles on a given lane as well as vehicle positions.

These values were retrieved using SUMO’s Traffic Control Inter-face (TraCI). TraCI is used to start the SUMO simulation and controlits simulation rate. Furthermore, it contains an application program-ming interface (API) for gathering values or setting states within thesimulation [7]. This can be done using several different programminglanguages, but for this paper Python was used. More specifically,scripts were written that retrieved the values needed and that con-tained the various traffic control algorithms. All code used, includingthe SUMO model files, can be found at https://xaril.github.io/traffic-control-system-comparison/.

3.2 Algorithms

3.2.1 Pretimed algorithm

Three different pretimed algorithms were tested. They work like thepretimed algorithms mentioned in chapter 2 in that one phase is activefor a fixed time before the system switches to another phase.

The three algorithms only differ in green time. They have a greentime of 20 seconds, 30 seconds and 40 seconds, respectively. Thesegreen times where chosen since they both resemble reality and are sim-ilar to related work. The yellow time is 3 seconds and each algorithmfollows a WE-G -> WE-Y -> NS-G -> NS-Y phase cycle.

The reason for testing three pretimed algorithms instead of onestems from the problem of determining how well a pretimed algo-rithm performs just based on one fixed value. A good result couldhave occurred because that time happened to be the optimal for justthat test case or demand. By exploring several values, a more generalresult for this kind of algorithm is found.

3.2.2 Deterministic algorithm

The deterministic algorithm used in this paper is based on the algo-rithm presented in [13]. It utilizes queue length, vehicle speed, vehicle

https://xaril.github.io/traffic-control-system-comparison/

https://xaril.github.io/traffic-control-system-comparison/


acceleration and vehicle distance from the intersection to determinewhich road should be given a green light. More specifically, it speci-fies an order in which the different green phases should be executedand for how long. These phases are then executed and the process isrepeated. The order of phases is based on the queue length of eachroad entering the intersection, where a larger queue gets higher prior-ity than a smaller one. Furthermore, the algorithm utilizes the vehi-cle speed and distance to the intersection for each queue active in thephase to determine the queue’s traversal time. The largest traversaltime is then chosen as the duration of the phase, as long as it is notlarger than a fixed maximum green time, set to 60 seconds. The yellowtime is 3 seconds.

This algorithm was chosen because it relies on sensor values to de-termine the traffic situation instead of using assumptions. This followsthe scope of this paper, as seen in section 1.2, in which perfect knowl-edge of the vehicles’ information was assumed. Moreover, it uses onlythe sensor values to determine the order of phases, making it a com-pletely deterministic algorithm. This makes for a interesting compari-son, as a learning based algorithm utilizes randomness, meaning thatit is non-deterministic.

The version implemented in this paper differs somewhat from thealgorithm presented in [13]. For example, the algorithm in [13] usesa parameter called a "ready zone" to determine if a vehicle is closeenough to the intersection to be able to traverse it in the maximumgreen time. In this paper, the entire traffic model is regarded as a"ready zone". Furthermore, the intersection in [13] is more complexin regards to the amount of phases and lanes. This is reflected in thealgorithm in that there are several phase options to begin with after de-termining the largest queue. In our case, there is only one phase pos-sible after the largest queue has been calculated. Finally, the approxi-mation of the traversal time for each queue is calculated in a differentway. Our algorithm calculates this by determining the time needed totraverse the intersection for the last vehicle in each lanes’ queue. Thistime is composed of three terms:

1. The startup time needed for each vehicle in the lane’s queue.

2. The time needed for the last vehicle to move from its position tothe center of the intersection.

3. If the last vehicle is going left, the time needed for the cars going


forward in the opposite direction to traverse the crossing part ofthe intersection.

These times were calculated according to Equation 3.2, Equation 3.3and Equation 3.4, respectively. The calculation for the entire traversaltime is found in Equation 3.5.

Tstart = 0.5 ·Nlane (3.2)

In Equation 3.2, Nlane is the amount of vehicles on the lane for whichthe time is calculated.

Tmove = −Vlast_vehicleAlast_vehicle

+

√(Vlast_vehicleAlast_vehicle

)2 +2 ·Dlast_vehicle

Alast_vehicle(3.3)

In Equation 3.3, Vlast_vehicle is the velocity and Alast_vehicle is the accel-eration for the last vehicle in the lane’s queue. Dlast_vehicle is the vehi-cle’s distance to the center of the intersection. This equation is derivedfrom the quadratic formula used on the equation of motion in regardsto constant acceleration. Since time cannot be negative in this case, thepositive root is used.

Topposite = 0.1 ·Nopposite_lane (3.4)

In Equation 3.4, Nopposite_lane is the amount of vehicles from the oppo-site road that the last vehicle must wait for. This is 0 if the vehicle isnot going left.

T = Tstart + Tmove + Topposite (3.5)

Unlike our algorithm, the algorithm in [13] does not utilize accelera-tion, and does not take into consideration the time needed to wait forvehicles from an opposite direction. Other than that, the calculationsare similar.

3.2.3 Reinforcement learning algorithm

The reinforcement learning algorithm used in this paper is based onthe algorithm presented in [2]. The algorithm uses Q-learning, whichis described in section 2.2. For this algorithm, the states are the com-binations of the amount of vehicles in each queue, together with theelapsed phase time. There are two actions, switching to another phaseor continuing with the current phase. The reward, which in this case


is a penalty, is the total delay between two decision points in the algo-rithm. This is a penalty because the algorithm seeks to minimize thedelay. The algorithm also declares a minimum green time of 20 sec-onds, ten seconds at the start of a phase as well as ten seconds at theend of one.

This algorithm was chosen because both the actions and the statesare perceived as realistic components when handling traffic. More-over, the paper was clear and concise which made it easy for us toimplement a version of the algorithm presented in [2].

The main difference between our algorithm and the one describedabove is the way Q values are stored. [2] utilizes a Cerebellar ModelArticulation Controller (CMAC), a concept similar to ANNs. Our al-gorithm uses a lookup table which divides the many state values intointervals. This reduces the size of the lookup table, which in turn re-duces computation time and counteracts the effects of the COD. Us-ing a table instead of an ANN was chosen for several reasons. [12]shows better results for a Q-learning algorithm used in a traffic con-trol system with a lookup table than with an ANN. Moreover, an extralevel of complexity is added if an ANN must be trained to contain andcorrectly calculate the values. This can be time consuming, which isunideal when working with a traffic control system that makes deci-sions each second.

Both the algorithm in [2] and the one in this paper follow similarexploration policies. An ε-greedy approach where the ε value is grad-ually increased for each iteration in each simulation, until it reaches0.9, is used. The algorithms differ in that the one in this paper onlyincreases the value when training the agent, not when performing theanalyzing tests. During those, ε = 0.9 for each time step. This is be-cause it has already been trained, meaning that exploration should notbe performed as often. Moreover, the algorithm in [2] only allows theε value to reach 0.9 if the current state has been visited several times.This is not the case for the algorithm used in this paper.

The following parameters were used in the algorithm: γ = 0.95,α = 0.8. The queue lengths were divided into intervals of length 4,from 0 to 20 vehicles. The elapsed phase time was divided into inter-vals of length 10, from 0 to 60 seconds. The delay, used as the punish-ment, was the ASWT between two adjacent decision points.


3.3 Testing

Two sets of tests were performed, one where the traffic demand en-tering the intersection was uniformly distributed for each road duringthe entire simulation, and one where it varied. The vehicles generatedfor both these sets were one of four different types:

• Short car, with a length of 3 meters.

• Medium car, with a length of 4 meters.

• Long car, with a length of 5 meters.

• Truck, with a length of 10 meters.

Each vehicle type had the same probability of being generated at anygiven time, and each vehicle type had the same acceleration, decelera-tion and max speed, at 2m/s, 4.5m/s and 50km/h, respectively. More-over, each vehicle generated had the same probability of going to theleft, forward or to the right in the intersection.

3.3.1 Uniform demand

Four different demands were tested: 5% chance for a vehicle to enterper second per road, 10%, 15% and 20%. These demands were chosenbecause they represent situations where traffic flows freely, where con-gestions occur and cases in between. One hour and ten minutes weresimulated for each test case, but the first ten minutes were not ana-lyzed to allow vehicles to enter the system. Each of the five algorithmsran on every test case, and 50 test cases were generated randomly andrun for each demand.

Before running any tests the Q-learning algorithm had to be trained.It trained on 300 test cases for each demand for a total of 1200 cases.During these 1200 cases, the algorithm utilized a gradual increase ofthe ε value. However, in the analyzed test cases it was fixed to 0.9.

3.3.2 Varied demand

For the varied demand, two hours and ten minutes long simulationswere generated. For the first ten minutes, a fixed 5% demand was setto allow vehicles to enter the simulation, and these minutes were not


analyzed. After this, the demand followed a second degree polyno-mial curve from 5% to 20% and then back to 5% again, with 20% beingreached after one hour and ten minutes. 50 test cases were generatedrandomly and run for each of the five algorithms.

Like with the uniform demands, the Q-learning algorithm was trainedon the varied demand as well. 1200 test cases were generated whichthe algorithm trained on. The same settings used for the ε value in theuniform demand tests were used here as well.

Chapter 4

Results

The results found in this section are meant to answer the questionof how well a deterministic algorithm performs compared to a learn-ing algorithm in regards to optimal traffic flow within an intersection.They are also used to determine if either of these algorithms are viablecompared to a trivial pretimed solution.

These results were gathered by using the simulation program SUMOto simulate traffic with certain demands and over certain time inter-vals. More information about SUMO can be found in chapter 3.

4.1 Results for uniform demand

The results for the various uniform demands are presented in Fig-ure 4.1 and Table 4.1. In the figure and table "Trivial X" refers to thepretimed algorithm with a green time of X seconds. The values in thetable have a precision of two decimals.

Trivial 20 Trivial 30 Trivial 40 Deterministic Learning5% demand 121.00 232.00 382.55 45.99 163.67

10% demand 153.24 276.59 446.23 85.48 296.2215% demand 365.58 598.40 692.29 230.08 519.6020% demand 4741.19 5353.42 5733.31 4332.14 5850.29

Table 4.1: The ASWT for the four uniform demands tested.

Not much can be said about these results alone as they only presenthow each algorithm performed on its own. It is however clear from

20

CHAPTER 4. RESULTS 21

Figure 4.1: The ASWT for the four uniform demands tested.

these results that the ASWT grows with increased demand. Further-more, it does not grow linearly. This is especially obvious for the 20%demand, as the values for each algorithm are a lot larger than for theother demands.

4.1.1 Comparison

Since this paper seeks to determine how the algorithms tested compareto each other, the ASWT ratios between the algorithms have also beencalculated. These results can be found in Figure 4.2 and Table 4.2. Theresults use the mean ASWT of the three pretimed results for each de-mand to yield a single result for pretimed algorithms. This is becausethe three algorithms were used to get a result for a pretimed algorithmin general, which is what we want to compare. The data in Table 4.2has a precision of three decimals.

These results tell more about how well the algorithms performed incomparison to one another. As is shown in Figure 4.2 and Table 4.2, thelearning algorithm performed better than the pretimed one for lower

22 CHAPTER 4. RESULTS

Figure 4.2: The ASWT ratio for the four uniform demands tested.

demands, but performed on par with or worse than it for higher de-mands. The deterministic algorithm performed better than the pre-timed algorithm, especially during lower demands of traffic. Finally,it is shown that the deterministic algorithm performed better than thelearning algorithm for all demands tested.

4.2 Results for varied demand

The results for each algorithm run on the varied demand can be foundin Table 4.3. Each value in the table has a precision of two decimals.

Like with the uniform results, Table 4.3 does not present much in-formation on how the algorithms compare to each other.

4.2.1 Comparison

The ratio between the algorithms can be found in Table 4.4. Like withthe uniform comparison, the mean ASWT of the three pretimed al-

CHAPTER 4. RESULTS 23

Deterministic/Trivial Learning/Trivial Deterministic/Learning5% demand 0.188 0.668 0.281

10% demand 0.293 0.706 0.41415% demand 0.441 0.995 0.44320% demand 0.821 1.109 0.741

Table 4.2: The ASWT ratio for the four uniform demands tested.

Trivial 20 Trivial 30 Trivial 40 Deterministic Learning2103.65 2444.88 2828.33 1846.03 3100.22

Table 4.3: The ASWT for the varied demand tested.

gorithms was used. The values in Table 4.4 have a precision of threedecimals.

Deterministic/Trivial Learning/Trivial Deterministic/Learning0.751 1.261 0.595

Table 4.4: The ASWT ratio for the varied demand tested.

The results in Table 4.4 show more about the performance of the al-gorithms. It is shown that the deterministic algorithm performs betterthan both the pretimed algorithms and the learning algorithms. More-over, it is shown that the learning algorithm performs worse than thepretimed ones.

Chapter 5

Discussion

As can be seen in chapter 4, the deterministic algorithm outperformedboth the pretimed and learning algorithm in every scenario tested, inregards to ASWT. Moreover, it was shown that the learning algorithmdid not perform well in a situation with varied demand or high de-mands in general. In this chapter, these results will be analyzed todetermine the reason behind them.

The main possible reason as to why the deterministic algorithmperformed better compared to the others is that, unlike the other al-gorithms, the deterministic has knowledge about how traffic controlsystems work. More specifically, it knows about physics models forvehicle movement. This enables it to effectively approximate the timeneeded for vehicles to traverse the intersection, which is then used toset phase times. This makes it possible for the deterministic algorithmto better adapt to the different demands compared to the other two al-gorithms. The learning algorithm only relies on previous experience,which may not always yield perfect results in a similar environment.Moreover, the pretimed algorithm does not adapt to the traffic at all.

It would not be possible for the deterministic algorithm to utilizeits knowledge about traffic control systems without the many types ofinformation gathered from the sensors. This information is vital forthe algorithm to correctly approximate phase times. The deterministicalgorithm gathers more types of data about the system than any of theother algorithms. However, in a real world situation some of this datais hard to collect, such as the velocity of vehicles. While the determin-istic algorithm performs best when perfect sensors are assumed, it isperhaps not the most realistic algorithm.

24

CHAPTER 5. DISCUSSION 25

The results show that the pretimed algorithm performed worsethan both the deterministic and learning algorithm when used on thelower demands; 5% and 10%. This is probably due to the fact that thegreen time was too long. When the vehicles in a queue have traversedthe intersection the phase stays active until the green time is up. Thisresults in vehicles on the other roads not being able to traverse the in-tersection even though the active roads are empty. The unability toadapt to the traffic situation results in the ASWT growing larger foreach unused second. Better results could have been obtained for thepretimed algorithm if lower green times were used.

For the higher demands 15% and 20%, the deterministic algorithmwas not as superior to the pretimed algorithm as for the lower de-mands. Moreover, the learning algorithm performed worse than thepretimed. This may be because the increase of traffic minimizes theamount of unused seconds during a phase’s green time. For the testcases in this paper, when there is a green time there will almost alwaysbe a vehicle traversing the intersection since the amount of vehiclesis uniformly distributed over each road entering the intersection. Ifthere was a case that had a road that was more congested than others,but with the same total demand, the ASWT would be larger since un-used seconds of the green time would occur. This would not be thecase for the deterministic algorithm since it adapts to the traffic situ-ation. However, in a uniformly distributed traffic system with a highdemand, the intersection will almost always be full of vehicles. Thisresults in each phase getting the same green time. The deterministic al-gorithm therefore works like a pretimed one for higher demands. Thedifference and the reason why the deterministic performs better thanthe pretimed is that it finds the optimal green time value whereas thepretimed needs to be manually set to the optimal value. This humanerror is the cause of worse performance for the pretimed algorithm.

The results were varied for the learning algorithm. This was sur-prising, as the related work showed promising results for a Q-learningalgorithm in a traffic control system [2] [12] [9]. There is however onedifference between this paper and the related work; the demand. Mostof the other studies used a demand that was lower than the ones usedin our tests. As our results show, the algorithm performed better forlower demands and could therefore have performed even better whenused on the demands presented in the other papers. Furthermore,some of the reports tested their algorithms on demands that were not

26 CHAPTER 5. DISCUSSION

uniform. For instance, [12] had one road with twice as high demandas the other road. When testing varied demands, only one road’s de-mand was increased and decreased. As was mentioned in the previ-ous paragraph, non-uniform demand can result in higher ASWT forpretimed systems. Therefore it is expected that the learning algorithmcould perform better in these cases.

There are other factors than the demand that could have affectedthe results for the learning algorithm. One of these is the COD. As wasmentioned in subsection 3.2.3, the lookup table used to store Q-valueswas used to counteract the COD since it reduced the state dimensions.However, this might have been implemented in a way that either didnot counteract it efficiently, or caused the states to unrealistically rep-resent the traffic flow.

In the former case, this means that a lot more training would havebeen needed for the algorithm to correctly approximate Q-values. Theresults in [12] shows that the usability of a Q-matrix might have beentoo fit to their implementation, meaning that it was not a viable op-tion in our case. Using an ANN could have prevented this, as it moredynamically sets boundaries for the states.

The unrealism mentioned in the latter case is troublesome as thereis no clear solution that also counteracts the COD. For instance, in-creasing the number of intervals used for each state variable wouldmore realistically reflect the traffic situation, but would also increasethe dimensionality. This may be a problem for our implementation ofthe Q-learning algorithm as the intervals were chosen linearly. This isnot necessarily the best partitioning for the various traffic demands.

The effects of the COD are shown in the results. Since the Q-learningalgorithm relies on queue lengths for the states, it is vital that variouslengths have been trained on several times. For lower demands, onlya few of the state intervals are used since queue lengths remain small.Because of this, these Q-values are trained and updated often. Sincethese are the only values needed by the algorithm for the demand,this yields a good result. For higher demands, all state intervals areused which means that Q-values will not be trained and updated asoften. This is because there are more possibilities of queue lengths,meaning that a certain state interval occurs less frequently. This re-sults in the states being less reliable, which in turn may yield a worseresult. The results from the varied demand further showcases this.The varied demand more frequently generates instances of all possible

CHAPTER 5. DISCUSSION 27

queue lengths, compared to higher demands which seldom generatesshorter queues and more often causes congestions. This explains whythe learning algorithm performed worst on the varied demand.

The ε-greedy policy used in the learning algorithm may also haveaffected the results. As can be seen in the results, the pretimed algo-rithm with a green time of 20 seconds performed best for all pretimedalgorithms, especially for high demands. Since the learning algorithmhas a minimum green time of 20 seconds, one could assume that itwould perform close to the pretimed one. The results do not show thisand one of the factors causing this could be that the policy choosesactions randomly 10% of the time to ensure exploration. For these ac-tions, the result is worse.

Aside from what has previously been mentioned, there are sev-eral factors which may have caused erroneous results. For instance,all the various parameters used within the algorithms, such as max-imum green time, yellow time or the learning parameters were fixedvalues that weren’t tested on their own. These were taken from thevarious papers that presented the algorithms. By tweaking these val-ues, other results could have been obtained. Moreover, there is a riskof human error when implementing the various algorithms and trafficsimulation. Minor bugs or miscalculations could have caused erro-neous results. Finally, the amount of tests performed may have beeninsufficient to correctly present valuable results.

5.1 Future work

The tests in this paper were limited. To further compare the algo-rithms, tests on different types of intersections and with different de-mands should be run. For instance, three-way crossings and non-uniformly distributed traffic demand are realistic components of a traf-fic system. More rewarding results would be obtained if this wastested. Furthermore, the possible improvements mentioned in thischapter should be tested. This is taken as future research.

Chapter 6

Conclusions

From the discussion of the results, two conclusions can be drawn.Firstly, the deterministic algorithm performs better than the learningalgorithm, in regards to averaged squared waiting time. Furthermore,it is a viable option compared to a trivial pretimed algorithm. Thisis because the deterministic algorithm has knowledge about the traf-fic system. Secondly, the learning algorithm does not perform wellfor high traffic demands, due to the curse of dimensionality makingit hard to train efficiently. However, these conclusions rely on the as-sumption that perfect sensors are used.

28

Bibliography

[1] URL: https://i.stack.imgur.com/eoeSq.png (visited on04/17/2018).

[2] Baher Abdulhai, Rob Pringle, and Grigoris J Karakoulas. “Rein-forcement learning for true adaptive traffic signal control”. In:Journal of Transportation Engineering 129.3 (2003), pp. 278–285.

[3] Shiuan-Wen Chen, Chang-Biau Yang, and Yung-Hsing Peng. “Al-gorithms for the traffic light setting problem on the graph model”.In: Taiwanese Association for Artificial Intelligence (2007), pp. 1–8.

[4] T. Eden, A. Knittel, and R. van Uffelen. Reinforcement learning.URL: http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html (visited on 04/18/2018).

[5] John Halkias and Michael Schauer. “Red light, green light”. In:Public roads 68.3 (2004).

[6] Eamonn Keogh and Abdullah Mueen. “Curse of dimensional-ity”. In: Encyclopedia of Machine Learning and Data Mining. Springer,2017, pp. 314–315.

[7] Daniel Krajzewicz et al. “Recent Development and Applicationsof SUMO - Simulation of Urban MObility”. In: International Jour-nal On Advances in Systems and Measurements 5.3&4 (Dec. 2012),pp. 128–138.

[8] Fredrik Omstedt. “Fordonstrafik - Trafikljusförbättringar för attminska väntetider”. Unpublished Paper. Dec. 2015. URL: https://github.com/Xaril/iskriv-report/blob/master/fordonstrafik-trafikljusforbattringar-att.pdf.

[9] Matt Stevens and Christopher Yeh. Reinforcement Learning for Traf-fic Optimization. 2016. URL: http://cs229.stanford.edu/proj2016spr/report/047.pdf (visited on 04/18/2018).

29

https://i.stack.imgur.com/eoeSq.png

http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html

http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html

https://github.com/Xaril/iskriv-report/blob/master/fordonstrafik-trafikljusforbattringar-att.pdf



http://cs229.stanford.edu/proj2016spr/report/047.pdf

http://cs229.stanford.edu/proj2016spr/report/047.pdf

30 BIBLIOGRAPHY

[10] Richard S Sutton and Andrew G Barto. Reinforcement learning:An introduction. Vol. 1. 1. MIT press Cambridge, 1998.

[11] Francesco Viti and Henk J. van Zuylen. “A probabilistic modelfor traffic at actuated control signals”. In: Transportation ResearchPart C: Emerging Technologies 18.3 (2010). 11th IFAC Symposium:The Role of Control, pp. 299–310. ISSN: 0968-090X. DOI: https://doi.org/10.1016/j.trc.2009.05.003. URL: http://www.sciencedirect.com/science/article/pii/S0968090X09000618.

[12] Kaidi Yang, Isabelle Tan, and Monica Menendez. “A reinforce-ment learning based traffic signal control algorithm in a con-nected vehicle environment”. In: 17th Swiss Transport ResearchConference (STRC 2017). TRANSP-OR, Transport and MobilityLaboratory, EPF Lausanne. 2017.

[13] Maram Bani Younes and Azzedine Boukerche. “An intelligenttraffic light scheduling algorithm through VANETs”. In: LocalComputer Networks Workshops (LCN Workshops), 2014 IEEE 39thConference on. IEEE. 2014, pp. 637–642.

[14] G. Zhang and Y. Wang. “Optimizing Minimum and MaximumGreen Time Settings for Traffic Actuated Control at Isolated In-tersections”. In: IEEE Transactions on Intelligent Transportation Sys-tems 12.1 (Mar. 2011), pp. 164–173. ISSN: 1524-9050. DOI: 10.1109/TITS.2010.2070795.

[15] X. Zheng and L. Chu. “Optimal Parameter Settings for Adap-tive Traffic-Actuated Signal Control”. In: 2008 11th InternationalIEEE Conference on Intelligent Transportation Systems. Oct. 2008,pp. 105–110. DOI: 10.1109/ITSC.2008.4732676.

http://dx.doi.org/https://doi.org/10.1016/j.trc.2009.05.003

http://dx.doi.org/https://doi.org/10.1016/j.trc.2009.05.003

http://www.sciencedirect.com/science/article/pii/S0968090X09000618



http://dx.doi.org/10.1109/TITS.2010.2070795

http://dx.doi.org/10.1109/TITS.2010.2070795

http://dx.doi.org/10.1109/ITSC.2008.4732676

www.kth.se

a comparison of algorithms used in traffic control …1214166/...test were conducted on a four-way...

Documents