reinforcement learning based traffic optimization at an

Reinforcement learning based traffic

optimization at an intersection with GLOSA

Master Thesis

Submitted in Fulfillment of the

Requirements for the Academic Degree

M.Sc.

Dept. of Computer Science

Chair of Computer Engineering

Submitted by: Rajitha Jayasinghe

Student ID: 456470

Date: 07.01.2019

Supervising tutor: Prof. Dr. W. Hardt

Prof. Dr. Uranchimeg Tudevdagva

Dr. Leonhard Lücken, DLR - Berlin

1

Abstract

Traffic flow optimization at an intersection helps to maintain a smooth urban traffic

flow. It can reduce travel time and emission. Regularly, new algorithms are

introduced to control approaching vehicles and traffic light phases. Reinforcement

learning and traffic optimization is a novel combination that is used by the research

community. This thesis suggests a methodology to reduce travel time and emission

of vehicles for a specific intersection design. Here the author provides a clear

solution by considering the driving route of approaching the vehicle to the

intersection. By using reinforcement learning and route information, this research

suggests a vehicle ordering mechanism in order to improve the throughput of the

intersection. Before proposing the solution, the author gives a thorough research

review of previous studies. Various findings regarding various Reinforcement

learning algorithms and how it has used to traffic optimization are explained in

Literature review. Further, the author is using GLOSA as a baseline to evaluate the

new solution. Several types of GLOSA variations are discussed in this report. A new

approach, which can be seen as an extension of the existing GLOSA algorithms, is

described in the concept chapter. A deep Q network approach and a rule-based

policy are introduced as the solution. The proposed solution was implemented and

evaluated. The author was able to achieve promising results from a rule-based policy

approach. Further, the issues related to both approaches were discussed in detail

and solutions were given to further improve the proposed solutions.

Keywords: Traffic management, junction optimization, GLOSA, Reinforcement

learning

2

Content

Abstract ....................................................................................................................... 1

Content ........................................................................................................................ 2

Acknowledgement ....................................................................................................... 6

List of Figures .............................................................................................................. 7

List of Tables ............................................................................................................. 10

List of Abbreviations .................................................................................................. 11

1 Introduction ........................................................................................................ 12

1.1 Motivation ..................................................................................................... 12

1.2 Problem Statement....................................................................................... 13

1.3 Proposed Solution and Approach ................................................................. 14

1.4 Objectives : .................................................................................................. 15

1.5 Thesis structure ............................................................................................ 15

1.6 Summary ...................................................................................................... 16

2 Fundamentals of Reinforcement Learning ......................................................... 17

2.1 Introduction .................................................................................................. 17

2.2 What is Reinforcement Learning .................................................................. 17

2.3 Comparison with Supervised and Unsupervised learning ............................ 17

2.4 Exploration and exploitation ......................................................................... 18

2.5 Markov Decision Process ............................................................................. 18

2.5.1 Definitions .............................................................................................. 19

2.6 Components of Reinforcement learning ....................................................... 19

2.7 Reinforcement learning algorithms ............................................................... 21

2.7.1 Q learning .............................................................................................. 21

2.7.2 Deep Q Network .................................................................................... 22

2.7.3 Rule-based policies ............................................................................... 23

2.8 Summary ...................................................................................................... 23

3 State of the Art ................................................................................................... 24

3.1 Fundamental parameters of traffic flow ........................................................ 24

3

3.2 Traffic stream parameters ............................................................................ 24

3.3 Car2X communication .................................................................................. 26

3.3.1 Reasons for using car2x communication ............................................... 26

3.3.2 Technical specifications ......................................................................... 27

3.3.3 The technical architecture of Car2X ....................................................... 27

3.3.4 Software architecture ............................................................................. 28

3.3.5 Car2X message types ........................................................................... 28

3.3.6 Forwarding types ................................................................................... 29

3.3.7 Car2X applications. ................................................................................ 29

3.4 Traditional approaches to intersection management .................................... 30

3.5 Green Light Optimal Speed Advisory. .......................................................... 31

3.5.2 Test scenarios ....................................................................................... 35

3.5.3 Limitations of GLOSA ............................................................................ 35

3.5.4 AGLOSA and their variations ................................................................. 35

3.5.5 Results ................................................................................................... 36

3.6 Traditional intersection management plans .................................................. 36

3.7 Reinforcement learning based intersection optimization .............................. 37

3.7.1 Results ................................................................................................... 48

3.8 Non-RL approaches to optimize traffic flow .................................................. 48

3.9 Summary ...................................................................................................... 50

4 Concept .............................................................................................................. 51

4.1 Reinforcement learning and proposed solution ............................................ 51

4.2 Deep Q network-based approach ................................................................ 51

4.2.1 System architecture ............................................................................... 51

4.2.2 Building a Reinforcement learning model .............................................. 52

4.2.3 Deep Q network structure ...................................................................... 53

4.2.4 Single-agent and multi-agent approaches ............................................. 55

4.2.5 Reward definition ................................................................................... 56

4.3 Rule-based policy approach ......................................................................... 56

4.3.1 System architecture ............................................................................... 56

4

4.3.2 Observation space ................................................................................. 58

4.3.3 Action Space.......................................................................................... 58

4.3.4 Rules ..................................................................................................... 59

4.3.5 Rewards ................................................................................................ 65

4.3.6 Policy parameters .................................................................................. 66

4.4 Summary ...................................................................................................... 67

5 Implementation ................................................................................................... 68

5.1 Technical details .......................................................................................... 68

5.1.1 Deep Q network approach ..................................................................... 68

5.1.2 Rule-based policy approach .................................................................. 68

5.2 Implementation of Deep Q network approach .............................................. 69

5.2.1 Flow configuration .................................................................................. 69

5.2.2 Dynamic SUMO network configuration .................................................. 69

5.3 Implementation of Rule-based policy approach ........................................... 74

5.3.1 Data extraction....................................................................................... 74

5.3.2 Rules implementation ............................................................................ 74

5.3.3 Extended-GLOSA implementation ......................................................... 75

5.3.4 Optimizer implementation ...................................................................... 76

5.3.5 Reward calculation ................................................................................ 77

5.4 Summary ...................................................................................................... 77

6 Evaluation .......................................................................................................... 78

6.1 DQN based approach ................................................................................... 78

6.1.1 Tests ...................................................................................................... 78

6.1.2 Results and discussion of DQN approach ............................................. 78

6.1.3 Issues observed ..................................................................................... 80

6.2 Rule-based policy approach ......................................................................... 81

6.2.1 Tests ...................................................................................................... 81

6.2.2 Results and discussion .......................................................................... 81

6.2.3 Optimization results and discussion ....................................................... 83

6.2.4 Further discussion of Grid test ............................................................... 84

5

6.2.5 Solutions and improvements ................................................................. 85

6.3 Summary ...................................................................................................... 86

7 Conclusion ......................................................................................................... 87

7.1 Challenges ................................................................................................... 87

7.2 Future improvements ................................................................................... 87

7.3 Concluding remarks ..................................................................................... 88

Bibliography ............................................................................................................... 90

Appendix ................................................................................................................... 93

6

Acknowledgement

First of all, I would thank Prof. Dr. Wolfram Hardt for giving me the opportunity to

commence my thesis at the department of Automotive Software Engineering, TU

Chemnitz. Especially, I would like to thank my university supervisor Prof. Uranchimeg

Tudevdagva for giving me the assistance for the fullfilment of the thesis.

Next, I would like to thank Deutsches Zentrum für Luft- und Raumfahrt (DLR) – Berlin

for giving me the opportunity to proceed with my master's thesis. Especial thanks to

my supervisor Dr. Leonhard Lücken for the constant support, time spent on my

project and feedback during the research. Further, I would like to thank the SUMO

development team for creating such a wonderful simulation toolkit.

Further, I would like to thank my parents, my closest friends for the continuous

encouragement and support until the end of my Master's degree.

7

List of Figures

Figure 1 : Problem scenario ...................................................................................... 13

Figure 2 : Solution scenario ....................................................................................... 14

Figure 3 : Machine learning paradigms ..................................................................... 18

Figure 4 : MDP sample scenario ............................................................................... 19

Figure 5 : RL agent and environment interaction [7] .................................................. 20

Figure 6 : Q leaning in action..................................................................................... 21

Figure 7 : Q table and DQN [15] ................................................................................ 22

Figure 8 : CNN to calculate Q values [15] ................................................................. 23

Figure 9: Car2X overview ......................................................................................... 26

Figure 10 : Car2x architecture [21] ............................................................................ 27

Figure 11 : Car2x software architecture [19] ............................................................. 28

Figure 12 : Example Car2X message [21] ................................................................. 29

Figure 13 : Car2X application stack [23] .................................................................... 30

Figure 14: GLOSA vs regular driving [24] .................................................................. 32

Figure 15 : GLOSA algorithm [5] ............................................................................... 33

Figure 16 : Coasting and freewheeling [24] ............................................................... 34

Figure 17 : GLOSA + coasting and freewheeling [24] ............................................... 34

Figure 18 : Road network by [2] ................................................................................ 38

Figure 19 : Simulation runs VS average delay [26] ................................................... 39

Figure 20 : Intersection network ................................................................................ 39

Figure 21 : Running time, delay and RL policies [28] ................................................ 40

Figure 22 : Online and offline training [30] ................................................................. 41

Figure 23 : Double DQN + Duel DQN structure [2] .................................................... 42

Figure 24 : Convolutional Neural Network + prioritized replay [33] ............................ 43

Figure 25 : RL agents and intersection scenarios [34] .............................................. 44

Figure 26 : FLOW experiments [11] .......................................................................... 45

Figure 27 : FLOW architecture [11] ........................................................................... 46

Figure 28 : FLOW components [35] .......................................................................... 47

Figure 29 : Intersection and cells [44] ........................................................................ 50

Figure 30: Deep Q network approach overview ........................................................ 51

Figure 31 : Deep Q network approach components .................................................. 52

Figure 32 : DQN - multi-agent ................................................................................... 54

Figure 33 : DQN - single-agent .................................................................................. 55

Figure 34 : Rule-based policy method overview ........................................................ 56

Figure 35: Rule-based policy approach components ................................................ 57

8

Figure 36: Rule 1 expected result ............................................................................. 59

Figure 37 : Rule 1 algorithm ...................................................................................... 59

Figure 38 : Rule 2 expected results ........................................................................... 60


Figure 40 : Rule 3 expected results ........................................................................... 61


Figure 42 : Extended-GLOSA algorithm .................................................................... 64

Figure 43 : FLOW steps ............................................................................................ 69

Figure 44 : Specify nodes code snippet .................................................................... 70

Figure 45 : Specify edges code snippet .................................................................... 70

Figure 46 : Specify route code snippet ...................................................................... 70

Figure 47 : Speify edge starts code snippet .............................................................. 71

Figure 48 : Action space code snippet ...................................................................... 72

Figure 49 : Observation space code snippet ............................................................. 72

Figure 50 : Get state code snippet ............................................................................ 72

Figure 51 : Master configuration code snippet........................................................... 73

Figure 52 : DQN policy .............................................................................................. 73

Figure 53 : Data extraction code snippet ................................................................... 74

Figure 54 : Rule 1 and 2 code snippet ....................................................................... 75

Figure 55 : Traffic cycle ............................................................................................. 75

Figure 56 : Extended GLOSA code snippet .............................................................. 76

Figure 57 : Optimizer code snippet ............................................................................ 76

Figure 58 : Reward calculation code snippet ............................................................. 77

Figure 59 : FLOW results1 ........................................................................................ 79

Figure 60 : Flow results2 ........................................................................................... 80

Figure 61 : Travel time evaluation ............................................................................. 82

Figure 62 : Emission evaluation ................................................................................ 83

Figure 63 : Optimizer traveltime results ..................................................................... 84

Figure 64 : Optimizer emission results ...................................................................... 84

Figure 65 : Gap creation issue .................................................................................. 85

Figure 66 : Rules execution - simulation loop ............................................................ 93

Figure 67 :Rule 3 code snippet .................................................................................. 93

Figure 68 : GLOSA arrival time calculation all variations ........................................... 94

Figure 69 : Traffic light phase selector ...................................................................... 94

Figure 70 : Advised speed - GREEN phase .............................................................. 94

Figure 71: Advised speed next GREEN phase.......................................................... 95

Figure 72 : SUMO initialization .................................................................................. 95

Figure 73 : SUMO main configuration ....................................................................... 96

9

Figure 74 : Route file SUMO ..................................................................................... 96

Figure 75 Rules psudo code.................................................................................... 100

10

List of Tables

Table 1: Traffic stream parameters Fun [18] ............................................................. 26

Table 2: Ca2X application categories ........................................................................ 30

Table 3 : GLOSA experiments results ....................................................................... 36

Table 4 : RL past experiments results ....................................................................... 48

Table 5 : RL and proposed solutions ......................................................................... 51

Table 6 : Observations - DQN approach ................................................................... 53

Table 7 : Actions: DQN approach .............................................................................. 54

Table 8 : Components, a Rule-based policy approach .............................................. 57

Table 9 : Observations - rule-based policy approach ................................................ 58

Table 10 : Actions - rule-based policy approach........................................................ 59

Table 11 : Optimizing variables ................................................................................. 66

Table 12 : Software - DQN approach ........................................................................ 68

Table 13 : Software - rule-based policy approach ..................................................... 68

11

List of Abbreviations

RL Reinforcement Learning

MDP Marcov Decision Process

RELU Rectified Linear Unit

FC Fully Connected

CNN Convolutional Neural Network

DQN Deep Q Network

Car2X Car2X Communication

GLOSA Green Light Speed Advisory

RSU Road Side Unit

V2V Vehicle to vehicle communication

V2I Vehicle to Infrastructure communication

IEEE The Institute of Electrical and Electronic Engineers

DSRC Dedicated Short Range

SUMO Simulation of Urban Mobility

CDRL Cooperative Deep Reinforcement Learning

LIDAR Laser Detection and Ranging

RADAR Radio Detection and Ranging

AU Application Unit

CCU Communication Control Unit

HMI Human Machine Interface

TCP Transmission Control Protocol

OBD2 On-Board Diagnostics

CAN Controller Area Network

CAM Cooperative Awareness Message

DENM Decentralized Environmental Notification Message

TSB Topologically Scoped Broadcast

GSB Geographically Scoped Broadcast

SPAT Signal Phase and Timing Message

SAM Service Announce Message

TTL Time to Live

AGLOSA Adaptive GLOSA

API Application Programming Interface

TRPO Trust Region Policy Optimization

12

1 Introduction

Traffic control at an intersection plays a vital role to achieve smooth urban traffic flow.

In conventional way, traffic lights with fixed-length phases controlled an intersection

[1]. But this is now considered as an inefficient way of controlling traffic due to the

growth of road infrastructure and vehicles. One way of solving these issues is to use

pre-computed plans for different times of the day or different days of the week [1], [2].

This is more efficient than conventional traffic lights. More sophisticated forms of

adaption rely on real-time traffic measurements. Mainly optimization can be done in

two ways. Controlling approaching vehicles to an intersection or controlling traffic

light phases depends on real-time traffic information [1], [2],[3]. In some situations,

both approaches are combined to achieve more optimized behavior [1]. Algorithms

related to all the approaches are explained the more detailed manner in the next

chapters. However, these advanced approaches are still under development and

testing.

An optimization of traffic may aim at several objectives. For instance, safety may be

increased by preventing sudden decelerations [1]. Efficiency can be increased by

reducing the number of vehicle stops since freely flowing traffic has higher

throughput than an accelerating queue. Likewise, reduction of emissions can be the

objective of an optimized control. With emerging progress in vehicle automation and

connectivity [4], it becomes more and more conceivable that the algorithms can rely

on a more adherent behavior of the traffic participants and thus unfold their

optimization potential to a higher degree in the future

1.1 Motivation

Many algorithms are developed already based on different approaches. Traffic lights

are already controlled by optimization algorithms and pre-computed plans. But

controlling vehicles is only available in test fields and yet to come. Algorithms like

GLOSA [5],[6] are using real-time information in order to adjust the speed of an

approaching vehicle to an intersection. GLOSA uses vehicle speed, position, and

many other parameters in order to suggest advice speed. It also considers several

other parameters in order to decide the approaching speed. Also, AGLOSA which is

more sophisticated than GLOSA is able to control vehicle as well as traffic light

phases. All these approaches achieved more efficient outputs than conventional

traffic light systems. However, these advanced algorithms do not address the

direction of the vehicle (direction after it passes the junction) which prescribe which

13

lane it needs to use when approaching an intersection. Also, lane change advises

are another aspect which needs to investigate more.

This thesis also uses machine learning to optimize traffic flows. Reinforcement

learning is a very promising machine learning paradigm due to the interaction

mechanism with the environment and how reinforcement learning algorithm takes

decisions depends on the changes in the environment due to its own actions [7].

Usage of reinforcement learning to traffic optimization is a new trend and currently

very popular with the research community [8]. It also achieves more promising

results. Here the author is trying to optimize traffic flow at an intersection by using

reinforcement learning specifically for a scenario which further discusses in the

“problem statement”

The main research areas of this thesis are “traffic optimization” and “reinforcement

learning”.

1.2 Problem Statement

In this thesis, this will be studied for a particular scenario, where a traffic light

controlled intersection features mixed turning lanes and straight going vehicles are

occasionally blocked by left-turning vehicles that have to wait for traffic in the

opposite direction.

Figure 1 : Problem scenario

14

Above figure illustrates previously described problem statement well. In this example

circled vehicles are straight going vehicles and those were blocked by leading left

turner.

1.3 Proposed Solution and Approach

Assuming knowledge of the approaching vehicles’ destination as an additional

observable it will be studied how an ordering of the approaching vehicle can improve

the intersection throughput. For instance, the ordering may lead to a queue where

straight going vehicles are in front of the left turning ones (see the following figure)

As an additional controllable element, this approach requires the inclusion of lane

change advice to the approaching vehicles, which are necessary to modify the

vehicle’s order.

Figure 2 : Solution scenario

The extended GLOSA algorithm which provides the speed advisory is expected to

further decrease the time loss of vehicles at the intersection in comparison to the

basic GLOSA algorithm [6] as it adds more degrees of controllability and provides

more information to the traffic control. To attain an operational controller the thesis

takes a reinforcement learning (RL) [7], [9] approach to train vehicles in a simulated

environment (SUMO) [10]. To perform this training, 2 approaches were selected. The

15

first approach is the RL framework FLOW [11], which is based on the reinforcement

learning library Rllab. A rule-based policy with RL is the second approach.

The more successful solution will be compared to the classical GLOSA algorithm,

which serves as a baseline. Further, the author expects to discuss issues and

shortcomings, which may occur during the testing phase

1.4 Objectives :

The objectives of this thesis are

Traffic simulation with reinforcement learning model: - This thesis aims at

evaluating a rule-based policy and a Deep Q network approach. Both

solutions are based on reinforcement leaning. For implementing Deep Q

network approach, here a framework named “FLOW” is used. The research

aims to build a simulation for demonstrating the approach and examines the

issues.

Comparison of the proposed solution with existing GLOSA: The more

promising approach is compared with exitsing GLOSA algorithm.

Discussion of recorded issues and solutions: - This explains recorded issues

during testing and evaluation. Further, the report briefly discusses how to

resolve the recorded issues

1.5 Thesis structure

The thesis is organized into seven chapters

Introduction: - The chapter introduces the problem which the author is solving.

Also introduce the solution approach briefly along with motivation and thesis

objectives.

Fundamentals to reinforcement learning: - This chapter explains the

reinforcement learning paradigm briefly. Mainly it explains what is

reinforcement learning, what are the similarities between reinforcement

learning and other machine learning paradigms, the main components of a

reinforcement learning scenario and various types of reinforcement learning

algorithms

State of the art: This is especially outlined previous work which has done by

the research community. Mainly it explains traffic engineering basics, Car2X,

various traffic optimization algorithms, reinforcement learning usage to

16

optimize traffic flow, intersection management and the results of previous

work. Further, the author explains GLOSA in more detail with variations.

Concept: Author proposes the solution along with the technologies and

algorithms. Mainly the reinforcement learning model technical architecture and

specifications.

Implementation: This chapter is reserved for presenting the implementation

aspect of the previously introduced solution.

Evaluation:-It discusses the obtained results from the proposed

implementation. Also, the chapter compares the results with currently existing

algorithms.

Conclusion: - This chapter gives an overview of the report and the progress of

the goals, drawbacks. It also states the future work which can improve the

proposed solution

1.6 Summary

In this chapter, the author introduced the research problem and the solution

approach to the reader. Further, it discussed thesis motivation and objectives which

needs to address during the research. Next chapter focuses on fundamentals of

Reinforcement learning.

17

2 Fundamentals of Reinforcement

Learning

2.1 Introduction

In this chapter, the author briefly explains the theoretical aspect of Reinforcement

learning. First, the basics of RL, the main difference and why it differs from other

machine learning paradigms. Next, the author discusses various types of RL

algorithms currently exist. Further addresses the components of RL and how it

relates to Markov Decision Process.

2.2 What is Reinforcement Learning

Reinforcement learning is a specific type of machine learning where states are

mapped to actions to maximize the numerical reward signal [7],[9] . RL differs from

currently well-established supervised learning and unsupervised learning due to

several reasons which the author discusses in next topic. One main characteristic of

RL is that the learner is not notified in the training phase which action to take. Instead

of that learner must learn which the best action for a specific state is by trying them.

Choosing the best action is controlled by the reward. Trial and error approach and

reward after an action are the main two characteristics of RL [7].

2.3 Comparison with Supervised and Unsupervised learning

As the author stated earlier, reinforcement learning is different from supervised

learning and unsupervised learning. Supervised learning has pre-labeled datasets for

training a classifier [12]. Pre-labeling is done by an external supervisor. Set of inputs

or attributes are fed to the model and model predicts the output or its class depends

on its previous experience in a training session. The object of a supervised learning

based system is to generalize and learn the characteristics of trained data. Later it

will be able to classify unseen data. But there is no interaction with the environment

in Supervised learning. In an interactive problem, sometimes it is not possible to get

all the states which can train by a supervised learning model [12]. That is why there

should be a way for a model to interact with the environment and learn from its own

experience which is provided by the reinforcement learning paradigm.

Unsupervised learning is also different from reinforcement learning. The main

difference between reinforcement learning and unsupervised learning is that RL

18

always tries to maximize reward while unsupervised learning tries to find hidden

structures. Also, there is a map from inputs to outputs in reinforcement learning which

is not used in unsupervised learning.

There is another learning paradigm named “semi-supervised learning” which is a

combination of supervised and unsupervised methodologies. This also differs from

Reinforcement Learning due to the above reasons.

Figure 3 : Machine learning paradigms

2.4 Exploration and exploitation

One challenge that comes with RL is a trade-off between exploration and exploitation

[13]. The agent always goes through the same states several times in order to find

optimal actions. Assume that the agent already found a better action set in one path,

but there may be much better action sets in other paths too. If RL does not consider

exploration, it will not find the best action set. In another hand, if an agent explores

too much, which means it cannot stick to one or a limited amount of paths [13]. It

cannot exploit knowledge and acts as it does not learn anything. This is why it is

important to find a balance between exploration and exploitation.

2.5 Markov Decision Process

Markov Decision Process (MDP) is a mathematical representation which is used to

model decision making in a stochastic environment [7]. The goal is to find a policy

which maps the best actions on all states in a certain environment. Reinforcement

learning is a methodology which can solve MDP for a given scenario [14].

Machine leaning

Supervised

Tasks driver (classificcation,

regression)

Unsupervised

(Clustering)

Reinforcement

(Agents learn to react to an environment)

19

2.5.1 Definitions

States (S): Set of possible occurrences of a given scenario

Model T(s,a,s′)∴P(s′|s,a): Probability of a state(s) transform to another state(s′) due

to a specific action .This is called a transmission model.

Action A(s): Influence which causes a state transition

Reward R(s): Feedback for an action

Policy Π(s)→a: A map of optimal action for each and every state

Optimum policy Π∗(s)→a: A special policy which maximizes expected reward

The following figure is an example of MDP. It consists of several states (s0,s1,s2,s3)

and due to actions (A0,A1,A2) a states is transitioned to the next state. Also, a

reward value(R1, R2, R3) is marked for each state.

Figure 4 : MDP sample scenario

2.6 Components of Reinforcement learning

Most of the components in a reinforcement learning model are also similar to MDP.

But there are some small differences too.

Agent: An actor or an object which lives inside the environment. The actor receives

information from the environment and executes actions on the environment

Environment: A place where the actor lives in. Generate states and where actions

are executed by the agent

State: A specific situation which returns from the environment. Example: Values for

observing parameters for the 1-second duration.

20

Policy: Is the behavior of an agent at a given time. A policy is a methodology which

the agent uses to determine the next best action based on the state. In another word,

it is a map of states and actions which can execute when in those states. The policy

is the core component in a reinforcement learning scenario. Policy can be simply a

small function like a lookup table or a complicated structure like a deep neural

network.

Reward: Immediate feedback which is given by the environment for the executed

action in the last step. The Agent’s main goal is to maximize the total reward at a

long run. Also, the reward informs the agent whether the action is good or bad. Good

actions always have positive rewards and bad actions always have negative rewards.

Values of the rewards depend on the scenario. Depends on reward values, policy

should identify whether the action is good or bad and it will try to avoid executing bad

actions in the future.

Value: Value is similar to Reward. But value is the long-term feedback. It is the total

reward (accumulated) an agent can gain from that specific state to end. According to

the experts, values are used to select the best action for each step, not the reward.

But without a reward, there is no value.

The following diagram shows how agent and environment interact with each other.

An agent is living in an environment. The agent receives a state along with observing

factors from the environment at each step. Depends on the state and rewards

received in the last step, policy decides the best action for the current step. Agent

executes the best action and moves to the next state. At the same time agent

receives the reward too. This agent and environment interaction happen until the

training phase ends.

Figure 5 : RL agent and environment interaction [7]

21

2.7 Reinforcement learning algorithms

2.7.1 Q learning

Q-learning is a value based algorithm [14], [15]. The main goal is to create a Q-table

and depends on the Q-table best action will be determined. Following is an example

of a Q-table. Scenario contains 4 actions (move up, down, left and right). Each and

every cell contains the value which is the maximum expected reward if it takes these

actions at that state.

The author explains Q-learning with the following scenario [15]. The main goal of the

agent is to reach the bottom – right corner by avoiding obstacles. Please note that

obstacles are not shown in the following diagram. The agent has several possible

actions (move up, down, left, right). The following Q-table consists of 25 states. It is in

the middle of the Q learning process. At the end of the process, each state should

contain a Q value. Q value with the highest numerical figure is the best action for that

given state.

Figure 6 : Q leaning in action

One other important fact about Q-learning is that there is no policy, only the

implemented Q-table. All Q-values are calculated by the action-value function [15].

22

Eq 1 : Q value function [15]

Gamma (γ) is the discount factor. A discount factor of 0 would mean that algorithm

only considers immediate rewards. The higher discount factor (which is 1), the further

rewards will propagate through time. State and action are the inputs to the function

and output is the Q value which represents the expected future reward. The process

continues until a maximum number of iterations exceeded. Finally, the outcome is an

optimized Q-table

2.7.2 Deep Q Network

When a scenario consists of a massive amount of states, Q-learning is not a good

choice [2]. Creating Q-table and updating it, is not an efficient way at all. The best

solution is Deep Q networks. Deep Q network is a neural network which takes states

as inputs and emits Q-value for each action. Following diagram shows the difference

between Q-learning and Deep Q network.

Figure 7 : Q table and DQN [15]

DQN can consist of several fully connected hidden layers. Convolution neural

network(CNN) are also used as DQN in some experiments [3]. Further details

regarding CNN's will discuss in next chapter. Following is a structure of convolution

neural network for a gaming scenario. It consists of a convolution, Rectified Linear

23

Unit (RELU) a fully connected( FC) layers. Video frames are used as the input to the

network and final output is the Q values for possible actions for a given state

Figure 8 : CNN to calculate Q values [15]

2.7.3 Rule-based policies

Rule-based policies are considered a traditional approach to modeling reinforcement

learning policies. Here the rules are deciding the action that the actor needs to take

depends on the state. In the early stages of reinforcement learning, rule-based

policies were more popular than Q learning and other approaches as it is an easier

and direct approach to solve a problem. When comparing with DQN, rule-based

systems are simpler and have a short training period.

2.8 Summary

This chapter provided a basic understanding of RL. It is a special machine learning

paradigm which does not belong to supervised or unsupervised learning categories.

An agent-environment interaction, trial and error mechanism in order to find the best

action for a given state are the main characteristics of the RL system. This chapter

introduces basic components of RL. Main components are agents, states with

observations, actions, policy, and most important reward. Q value which derives from

reward states how important a specific action in a particular state. The author

introduces MDP and how RL is related to it. Reinforcement learning is a method to

solve MDP. This chapter briefly explains on Q-learning and DQN with its variations.

Further, the research community uses rule-based policies which are less complex

and straightforward than DQNs.

24

3 State of the Art

The chapter explains the work, experiments, and results of previous researches.

First, the author plans to introduce the fundamentals of traffic engineering. First, the

report introduces the basic parameters of traffic flow. Next focus is on Car2x

technology. Especially how car2x works and why it is important. Next author tries to

introduce GLOSA which is one of the important Car2x applications. Further, the

author discusses variations of GLOSA and results achieved by the research

community. Finally, chapter introduces important approaches the experts have taken

in order to optimize traffic flow and intersection throughput. As reinforcement learning

plays a vital role for this project the author point out major reinforcement learning

based algorithms and techniques past researches have used for optimization. Finally,

the author briefly discusses few other notable algorithms (non-RL)

3.1 Fundamental parameters of traffic flow

Traffic engineering is to understand the characteristics and behavior of traffic flows

[16]. This assists to build smooth, efficient and safe traffic models which can later

deploy in the real world. Traffic stream parameters help to understand the nature and

the variations of traffic flow.

3.2 Traffic stream parameters

Traffic stream represents a combination of driver and vehicle behavior. More

importantly, engineers need to examine how vehicle and driver interact with other

instances in a large flow [17]. Also, a flow is varied depends on the location,

geographical characteristics of the road and the time factors.

The stream parameters are important to model flows. These parameters are used to

forecast traffic flows. According to the research, there are several types of

parameters [16],[18].

1. Measurement of Quality: Eg. speed

2. Measurement of Quantity: Eg.Density and Flow of traffic

25

Also stream parameters can be either macroscopic or microscopic. Macroscopic

represents the behavior of the flow as a whole while microscopic represents the

behavior of each individual with the communication with other instances.

Fundamental stream parameters are as follows

Parameter name Mathematical

representation

Description

Speed Distance/Time Considered as a quality measurement

Spot speed - The instantaneous speed of a vehicle at

a specific location

Running Speed Length of the track/

time spent in motion

Average speed a vehicle maintained

over when vehicle reach from one point

to another only it is in a motion

Journey Speed Length of the

track/time spent

(including stopped

durations)

Average speed a vehicle took to reach

from one point to another, including the

stopping times.

Time mean

Speed

The speed of all

vehicles/

Time

The average speed of all vehicles which

pass a specific point for a given time

Flow Number of

vehicles/time interval

A number of vehicles which pass a given

point of a road, during a specific time

interval. Volume can be further divided

into various other measurements.

Example:

1. Average annual daily traffic

2. Average weekday traffic

Density Number of vehicles/

road length

Number of vehicles driving on a given

length of a road

Travel time - Time takes to complete a journey by a

specific vehicle

Time headway - Time difference between two closed

vehicles pass a given point

Distance

headway

- Distance from leading vehicle’s rear

bumper to following vehicles rear

bumper at a point of time.

26

Table 1: Traffic stream parameters Fun [18]

3.3 Car2X communication

Car2X is a wireless communication, data exchange between vehicles, roadside

units, and pedestrians [4]. Several types of car2x communications are available

currently

1. Car to Car communication

2. Car to Infrastructures communication

3. Car to pedestrian communication

Vehicles, RSUs (Eg. Traffic lights, Road signs), mobile phones which own by

pedestrians are main building blocks of a Car2x environment

Figure 9: Car2X overview [4]

3.3.1 Reasons for using car2x communication

The Car2x research is important due to the following reasons [4], [19].

1. Save the lives of people and avoid road accidents.

2. Traffic efficiency. To optimize the traffic flow in smart cities and improve the

fuel efficiency of drivers which will also reduce travel time.

27

3. Comfort. Low priority applications which focus on improving the driving

experience.

3.3.2 Technical specifications

The car2x research community is using Dedicated Short Range Communication

based IEEE 802,11p standard [20] which is specifically developed for vehicular

networks. The main advantage of using DSRC is, it has the ability to establish a

connection very quickly and the latency is low when transmitting [4]. Frequency 5.9

GHz and two main bandwidths are used in Europe and United States. In United

states 75MHz is used while 30MHz in Europe [21]. Maximum communication range is

1Km which is better than currently available LIDAR and RADAR. GPS is also using to

positioning the vehicle or RSU.

3.3.3 The technical architecture of Car2X

The following diagram depicts the technical architecture of current simTD project [22]

which is a premiere project currently run by the German government.

Figure 10 : Car2x architecture [21]

All vehicles and RSUs consist of an AU and CCU. CCU to handle the

communication with other vehicles RSUs like a traffic sign, color light. AU is to handle

the Car2X Application. HMI for driver interacts with the application. All vehicles and

RSUs are connected by a regular TCP/IP connection to central management

28

systems where people monitor the environment. All these management centers

manage the environment depends on the input it gets from the vehicles and RSUs.

3.3.4 Software architecture

Car2Car applications consist of vehicle components while Car2I applications consist

of both vehicle and infrastructure components. Vehicle component gathers data from

the OBD2 system and CAN. RU have embedded sensors depends on the

functionality.

Figure 11 : Car2x software architecture [19]

3.3.5 Car2X message types

Several message types are transmitted over car2x networks [21].

CAM represents the presence of a vehicle or roadside units. Each and every

vehicle maintains a neighborhood table. It records details about nearby

vehicles and RSUs which updates frequently.

29

DENM is an event triggered message. Each and every scenario triggers a

certain type of DENM. Depends on the situation, event DENM changes This

will be broadcasted via TSB or GSB.

SPAT is mainly for intersection management. Color lights broadcast the

remaining green light interval to the vehicles

SAM is another type which is still under discussion.

This is a generic Car2x message which consists of a header and a payload

Figure 12 : Example Car2X message [21]

3.3.6 Forwarding types

This is how a message broadcasted from origin to destination. Transmission needs

to consider the number of hops, TTL and other technical information [21].

Unicast is the first type. Key information is that the destination is predefined. It

knows exactly where it needs to go. Maybe it’s the direct neighborhood which

means 1 hop or maybe several hops. Figure 2.5 shows a scenario of Unicast

where the vehicle which marked in Orange is the predefined destination.

TSB is usually many hops. There is a TTL value. Whenever it passes

networks it is reduced TTL value by 1. In this way, it reaches the destination.

Usually, messages used this forwarding type. But in a high-density area like a

traffic jam or something, the interruption will be very high. The following figure

shows a sample scenario of TSB

A single hop is when transmitted to the direct neighborhood. Here the TTL is

one.

GSB. This is similar TSB. But here it considered the Geographical details too.

The geographical area is marked as a circle or Rectangle

3.3.7 Car2X applications.

30

Car2X applications can divide into safety and non-safety related applications. Safety

applications are mainly focused on protecting human lives while rest aiming at

improving traffic efficiency.

Category Examples

Safety [21], [19] Emergency Vehicle Warning, Motorcycle approach

notification, Intersection Collision warning, Weather

related warnings, Blind spot detection etc..

Improving traffic efficiency

[21], [4]

Green Light Optimal Speed Advisory (GLOSA),

Enhanced Route Guidance.

Table 2: Ca2X application categories

Figure 13 : Car2X application stack [23]

3.4 Traditional approaches to intersection management

Before several decades ago, traffic engineers used several simple approaches to

improve traffic efficiency by controlling color light phases. The easiest way is to

introduce precomputed plans for traffic lights depends on the time of the day and day

of the week. (Eg: traffic plan for busy hours and another to off-peak hours).

Mainly there are two ways to address traffic optimization problem at an intersection

[1]

Adaptive Traffic lights: Change the traffic light phases in order to improve

traffic flow

31

Guidance for vehicles: Real-time guidance to drivers to adjust vehicle

parameters before reaching an intersection.

3.5 Green Light Optimal Speed Advisory.

Greenlight optimal speed advisory (GLOSA) is a car2x application which aims to

improve the fuel efficiency of a driver and reduce journey time. GLOSA is used to

optimize the approaching speed of vehicles to a nearby color light intersection. This

will prevent driving too fast when the color light phase is red and drive too slow when

the phase is Green [5]. Communication type is primarily considered as Car to

Infrastructure communication. Also, GLOSA can reduce traffic congestion.

Algorithm execution starts when the vehicle enters the communication range. Here

the communication in the sense the communication between vehicle an RSUs (color

lights). Color light sends CAM declaring the availability to the vehicle. The vehicle

receives the message and on-board GLOSA application checks the type of the

message, whether it is a traffic light, the position of the traffic light. Also, the onboard

application will check if the traffic light situated in the current route. Otherwise, it will

disregard the signal.

3.5.1.1 Important steps

The following steps represent how GLOSA make decisions, calling sequence, data

exchange and required conditions to work[1],[5].

1. GLOSA application calculated the time vehicle needs to reach tragic light

considering the current speed of the vehicle, the distance between the traffic

light and the vehicle and the acceleration.

2. Next important thing is checking the phase of the traffic light.

2.1. If the light is GREEN the algorithm sets expected speed to maximum

speed.

2.2. If the light is RED, algorithm slows down the vehicle if the vehicle is

traveling too fast. Algorithm calculated the speed the vehicle needs to

have in order to reach the color light in the next green phase.

2.3. If the light is YELLOW, depending on the distance of the vehicle

expected speed is finalized.

3. Finally, the driver receives a speed range (minimum expected speed,

Maximum Expected speed)

4. The algorithm runs once for each step

32

The following figure shows how GLOSA advises a driver and slow down a vehicle

before it reaches the intersection.

Figure 14: GLOSA vs regular driving [24]

3.5.1.2 Useful mathematical equations

The following equations were used to find the arrival time (t) of a vehicle from current

distance to the intersection and advised GLOSA speed (U) [5].

First, calculate the arrival time (t) to the intersection from current posistion when

distance (d), velocity (u) and acceleration (a). Distance is the difference between the

current vehicle position and intersection.

When a != 0 and (t) is the subject from the above equation

(

) √(

)

When a=0

33

Next calculates the Advised speed (U), when the distance from current vehicle

position to color light (d) , time needs to reach color light (t) and the current speed of

the vehicle (u)

(

)

3.5.1.3 A Pseudo Code

The following pseudocode further illustrates how to implement the algorithm in the

simplest form.

Figure 15 : GLOSA algorithm [5]

According to [24] , GLOSA productivity increases if the system can advise the driver

to freewheel and coast when the vehicle cannot pass during the current green phase

Both freewheeling and coasting apply only for a vehicle with a manual transmission.

Freewheeling: When a vehicle slows down with a neural gear. This is more efficient

and braking and the distance the vehicle can travel depends on initial speed.

Coasting: When a vehicle slows down vehicle with any gear. It uses the vehicle’s

kinetic energy.

The following diagram shows how coasting and freewheeling can be used with

GLOSA. According to experts, Freewheeling is more energy efficient than Coasting

[24].

34

Figure 16 : Coasting and freewheeling [24]

Before suggesting a speed for slowing down, the algorithm needs to check whether it

is possible for freewheeling. If not, then it needs to check the possibility of coasting

before braking. Here it checks the possibility of freewheeling before coasting, as

freewheeling is more efficient than coasting. Following is an activity diagram of the

suggested approach.

Figure 17 : GLOSA + coasting and freewheeling [24]

35

Also, some experiments are carried out in order to check how multi-hop

communication is useful with GLOSA. Here researches are suggesting to use

forwarding mechanism like TSB instead of SHB [6]. It is important to improve

information distance and communication range between RSUs and vehicles. This

usually disregards in simulation-based testing. Applying multi-hop with parked cars is

considered as the best way to increase the information distance.

3.5.2 Test scenarios

GLOSA is tested with various traffic simulators [5],[1]. Further several authors tested

GLOSA in real test fields [6]. Also, GLOSA has tested with various conditions.

Among them, length of GLOSA activated track, single or multi intersection

approaches, the percentage of vehicles equipped with GLOSA systems, traffic

density, various forwarding types were few test variables.

3.5.3 Limitations of GLOSA

A number of vehicles which equips GLOSA [24]: The percentage of vehicles which

use GLOSA is very important when it uses in the real world. Vehicles which do not

equip GLOSA, drive regular, inefficient way. This can cause traffic congestion as

usual. Even though rest has GLOSA, because of the congestion, GLOSA does not

work well. GLOSA does not detect traffic congestion details as it is only focusing on

specific vehicle and the traffic light. Installation of GLOSA system for all approaching

vehicles is the best way of getting maximum results [5], [24].

High traffic density areas: GLOSA does not consider traffic congestions, information

about leading vehicles, road information when it suggests a speed. Due to that, it is

not a good solution for high traffic density situations [24].

3.5.4 AGLOSA and their variations

Adaptive GLOSA (AGLOSA) is a collection of adaptive traffic lights and GLOSA [1].

GLOSA only changes the vehicle speed. It does not affect the traffic light

phases in any way

Adaptive traffic lights only change the traffic light phases depends on the

inputs it gets from nearby vehicles.

36

AGLOSA changes both vehicle approaching speed and the traffic light phase which

is more efficient. Mainly AGLOSA has several steps [1].

1. Each and every vehicle sends positions, speed to the RSU (traffic light)

2. RSU calculated the most optimum plan.

3. RSU send the best switching time/phase to each vehicle

4. Vehicle onboard system calculates the approaching speed according to the

switching phase

3.5.5 Results

The following table summarizes important observations of past work related to

GLOSA

Past work Results

[5] According to the results [5], 300 m is the best distance to

activate GLOSA for approaching vehicles. Further, the

researcher has done an experiment that consists of both

GLOSA equipped and non-equipped vehicles. When

increasing the GLOSA equipped vehicles, the author was able

to get more improved results.

[6] Single hop broadcast suffers from signal attenuation due to

various reasons. [6] suggests multi-hop mechanism to extend

information distance.

[1] [1] compared AGLOSA with currently existing fixed length

lights, adaptive lights, and GLOSA. Simulation has run up to

2000 vehicles per hour and was able to achieve successful

results.

Table 3 : GLOSA experiments results

3.6 Traditional intersection management plans

Fixed time length phases are used for traditional traffic lights. Here all phase lengths

are pre-calculated and do not consider the environmental or external factors when

changing the phases [1]. Installing precomputed plans based on the time of the day

and the day of the week is a much better approach than fixed length phase lights

which are mostly used in a real environment [6]. The Engineers focus mainly the rush

37

hours of the day and an average number of vehicles which passes the intersection

during rush hours and many other environmental factors. This is a more effective way

of controlling the traffic flow than fixed length phased traffic lights.

Due to the improvements of Car to car and car to infrastructure communication, the

engineers were able to introduce more advanced algorithms and methodologies to

control vehicle movements and traffic light phases [24].

3.7 Reinforcement learning based intersection optimization

Reinforcement learning is one of the premier machine learning paradigms currently

used by the research community in order to solve complex traffic optimization

problems. Traffic flow optimization, intersection throughput enhancement,

autonomous driving related testing are few example usages. Mainly experiments are

done in simulation environments and overall results show that reinforcement learning

is a good selection for solving the above problems.

Here the author has listed several notable reinforcement learning based researches

mainly deals with solving various traffic situations.

Mainly there are a few ways of solving traffic-related problems by using reinforcement

learning.

1. Consider traffic light as an agent

2. Consider vehicle as an agent

[2] has proposed Deep Q network and Q matrix approach to experiment on a traffic

light system where red and green light phases change dynamically. Its duration

depends on the number of approaching vehicles to the traffic light. [2] used the

following road network.

38

Figure 18 : Road network by [2]

As usual, vehicles are approaching the intersection from North-South direction and

West-East direction vise versa. The author used SUMO [10] as the simulation

interface and Py-Brain [25] to train the Reinforcement learning algorithm.

Both Q–network and Q-Matrix was tested with constant demand and varying

demand. Also used the negative sum of squared delays of all vehicles. Delay was

taken by the difference of actual arrival time and virtual arrival time. According to the

result gained, Q-network did not produce promising results when the vehicle queue is

high. Another important fact is that Q-network was trained by a various number of

hidden layers. But on the other hand Q- matrix produces more productive results

than the Q-network even when the length is high.

[26] introduced a Q learning based traffic light optimization algorithm to improve the

efficiency of traffic lights. The main difference is the researcher used a road network

with 3 nearby intersections. But this instance equipped with left turning lanes similar

to the real world. For building the perfect algorithm [26] has used both peak and off-

peak situations. Author has used VISSIM [27] for simulation purposes and external

APIs to train Q learning algorithm. Also, this is a multi-agent approach. Following

observations were used as inputs to Q-learning

1. Queue lengths of main and side lanes

2. Current signal phase

3. Duration of the green phase

4. Leader information

Extending the existing phase or changing the phase (Green light for the opposite)

directions were the actions proposed by Q-learning. Total Q values of all

intersections were taken as the global reward for the experiment while Q value of

each intersection were considered as a local reward. Following chart shows the

39

results achieved by [26]. Mainly it shows how the delay reduces with various

demands.

Figure 19 : Simulation runs VS average delay [26]

.

[28] also proposed a reinforcement learning based approach for a multi-intersection

scenario where all intersections are located at close proximity.

Figure 20 : Intersection network

Traffic lights were considered as agents and due to multiple intersections proposed

scenario consist of multiple agents. Vehicle position, speed, Traffic signal phases of

40

current intersection and other intersection were used as state variables. The

direction which needs to turn on the green light was taken as the action spaces. As

an example, South to West is one action space. Similarly, there were four actions

including left turning vehicle phases. [28] used a Convolution Network to create the

reinforcement learning algorithm. Here the researcher specifically named it as

Cooperative Deep Reinforcement Learning (CDRL). As usual, the network consists

of convolution layers, Batch normalizations, RELU layers, and finally fully connected

layers. Further CDRL is equipped with Experience Replay feature.

Author has used a well-known transportation planning function which is first

introduced by U.S Bureau of Public Roads (BPR [29]). According to the results

received, CDRL has lower delay values when running time and episodes are high.

The following figure shows how CDRL can compare with other algorithms like Deep

Q network and Q learning.

Figure 21 : Running time, delay and RL policies [28]

[30] is another Reinforcement learning approach to reduce the intersection delay by

considering traffic light as an agent. But this approach has few differences, compared

to previous approaches.

Mainly the author has simulated the implemented algorithm by using real-time data.

SUMO has used as the simulation framework. The model introduced by [30] has 2

phases.

41

The offline phase is a data gathering phase with a fixed traffic light schedule and it

allows vehicles to pass the intersection as usual. Online phase is the reinforcement

learning phase which extracts details from the simulation and sent it to a

convolutional network. Finally, Q values were presented. Like above instance here

the past author has used experience replay mechanism. The following figure shows

more details about the researcher’s approach in detail.

Figure 22 : Online and offline training [30]

Main actions were similar to [28]. State observations were taken for each and every

lane. Observation space includes with queue length, updated waiting times and other

regular vehicle data. But [28] has used several different reward types. Among them

accumulated waiting time for all approaching vehicles, total vehicles that pass the

intersection during the simulation step and total travel time for passing vehicles were

special

[2] is another different approach where the author has used a 3DQN network. 3DQN

is a combination of Double DQN [31] and Duel DQN [32]. Double DQN approach is

heavily used to handle the problem of Q-value overestimation. The solution to this

problem is the usage of two networks in order to decouple action selection and Q

value prediction.

Use a single DQN to predict the best action for the next state

A new type of network (target network) to estimate the target Q-value by

taking the above-predicted action.

42

According to the research community, Double DQN is easier to train the regular DQN

network and have more stable learning.

Duel DQN which is the other special technique [2] has used. It again decouples the

Q value and the action.

One network for calculating the state value for the current state

Calculate the advantage for all existing actions. Here the network checks what

the best action for the state is by comparing all available actions.

According to the researcher, by combining both techniques can enhance the

performance of the overall network. Because the author used both 2 techniques,

overall network consist of 3 networks which are the main difference when comparing

all other approaches.

Also here author used a prioritized experienced replay technique which is used for

samples that occur rarely and also more important than other. This technique

enhanced the probability of reoccurrence for samples which have a high temporal

error. The following figure shows the overall architecture of the network

Figure 23 : Double DQN + Duel DQN structure [2]

Author has used vehicle speed, the position has state variables and actions spaces

were four phases of traffic lights. Accumulative waiting time between cycles was

taken as the reward for the system. According to the final results, 3DQN achieved the

highest cumulative reward and lease waiting time when compared with the well-

established DQN approach.

[33] has proposed another approach with prioritized replay technique. The researcher

used a convolution neural network with a target network. Speed and positions were

fed into the convolution network as state variables. Convolution network also consists

43

of two parts. The following figure shows the network structure [33]used. This is a

specific structure when comparing with other authors.

Figure 24 : Convolutional Neural Network + prioritized replay [33]

Main action space was the green light direction for South to North or East to the

west. Author has tested the algorithm with real traffic data and the results show that

the algorithm is effective than fixed phase control and the longest queue first

algorithm.

As seen above, there were many occurrences that use traffic light as the agent. But

only a few authors used the approaching vehicle as the agent.

[34] has tested how an automated vehicle handles a un-signaled intersection by

using reinforcement learning. According to [34] traditional rule-based systems are not

good at handling complex systems similar to this and DQN is a good choice for

solving it.

The following figure shows a few scenarios that used to train a DQN. Red vehicle is a

reinforcement learning agent and other vehicles were controlled by SUMO.

44

Figure 25 : RL agents and intersection scenarios [34]

Here the researcher tested the reinforcement learning algorithm with 3

representations.

Time-to-Go: Action space only consists of Wait and Go when the vehicle next

to the intersection.

Sequential: Action space consists of accelerating, decelerate and keep a

constant speed.

Creep-and-Go: This is a combination of the above 2 representations. Action

space consists of a wait, move, forward slowly and go

State variables were heading angle, the presence of a vehicle (0 or 1) and velocity.

Rewards were calculated by checking vehicle status. Reward increments happen

based on the collision status of vehicles. According to the results, Time-to-go

representation outperforms other 2 representations and much better than traditional

Time to Collision (TTC) algorithm.

[8]’s DeepTraffic is a DQN, a Javascript based simulation that lives on a browser.

The main idea of the project was to train DQN the most efficient way of handling a

vehicle in a dense traffic area and avoid collisions. Action space consists of changing

lanes to sides, keep same settings, acceleration, and deceleration. This is an open

competition for students who can change the hyper-parameters of the algorithm

(learning rates, number of hidden layers, neurons of DQN, rewards, exploration, and

exploitation and many other) and train and see how well the algorithm reacted

according to the changed parameters. This way [8] tried to figure out the best set of

hyper-parameters for solving the scenario. These are few facts the researcher has

identified so far

Lager and deeper the DQN, performance is better for this specific scenario

Input size too large is not good for the performance

Too large input size is also good if the time constraint is not a big issue for the

project. Because training time is very high when the network is deeper, larger

45

and input size is too high. But this is always not the case as sometimes

diminishing returns with larger and deeper networks.

[11], [35] introduced a computational framework named FLOW which is helping to

simulate autonomous vehicle behaviors using reinforcement learning algorithms.

FLOW uses SUMO [10] as the simulation framework and a reinforcement library

named Rllab [36]. The research community uses FLOW for developing complex

controllers for reinforcement learning agents with human-driven vehicles for various

scenarios. Mainly [11] has developed controllers for ring roads (single lane, double

lane), figure eight road network, intersection scenarios. Here the development team

trained autonomous vehicles using reinforcement learning algorithms to drive without

colliding each other. Following images shows few example scenarios and all vehicles

were reinforcement learning agents which live in SUMO environment.

Figure 26 : FLOW experiments [11]

The following diagram shows a process diagram of FLOW. As mentioned earler,

SUMO is the simulation environment and Traci to extend the sumo functionality and

modify sumo states dynamically in each simulation step. Flow used Rllab to

implement reinforcement learning policies. Also, Rllab was used to evaluate and

46

optimization purposes. OpenAI-Gym [37] was used as the base framework when

creating Rllab which is another reinforcement learning library.

Figure 27 : FLOW architecture [11]

FLOW is the connector between SUMO and Rllab. After the user adds information

that needs for creating SUMO network and information for training the model, FLOW

dynamically creates the SUMO road network in runtime. It collects data from

simulation in each and every step and passes it to RLlab. Reinforcement learning

policies from Rllab uses this information to find the best action for each and every

state and pass it to FLOW. SUMO gets required action which needs to execute as a

Traci command.

Until it exceeds no of steps simulation continues. At the end of the simulation, It

calculates gradient and updates policies. Then FLOW resets the environment and

starts the simulation again until it reaches the user-defined number of iterations. At

47

the end of the training phase FLOW gives a fully trained reinforcement learning

policies for a specific scenario. However, there is no standard process is defined to

find the exact hyper-parameter values initially and it needs to find out experimentally.

As usual computation power plays a key role during the training phase as it takes a

considerable amount of time to train the module. One disadvantage of FLOW is that

the current released version does not use multi-agent architecture. If the user has

many agents in one simulation sill the user needs to pass all inputs to one single

model. Due to that, models are often larger and deeper and it needs more time to

train it.

Flow has a dynamic way of creating SUMO based road networks. It was completely

written in Python. It provides several classes which user needs to override.

Generator: User enters information for generating nodes, edges, and routes

Scenario: Specifying network configuration using shapes which the user

mentioned in Generator class for an experiment. Based on the specification,

FLOW creates the Sumo configuration and net files.

Environment: Class responsible to retrieve observation space for each and

every step and execute the action from listed action space. It also calculates

the rewards. Use can write a custom function or use already defined

specifications by FLOW.

The following figure provides more details regarding the main components of FLOW.

Figure 28 : FLOW components [35]

48

For all the above scenarios observation space consist of velocity, the current lane,

and current position. Action space consists of lane change or a velocity change

(acceleration, deceleration or keep constant velocity)

Also, the user needs to add other details which are important for the training process

(Reinforcement learning algorithm, policies, Evaluation methodology, Iterations,

Simulation time)

3.7.1 Results

Past work Results

[2] Neural network approach has mixed results. The network was

unable to fit with action-value function. Further when

increasing hidden layers. Results got worse.

Q learning approach results were much robust and certainly

better than the Neural network.

[26] Q-learning, multi-agent approach decreases average delay

per vehicle in both saturated and oversaturated flows

[28] CDRL achieved the highest reward and lowest average

cumulative than Deep reinforcement learning and Q-learning

[33] Average staying time reduced significantly after 800 episodes

Up to 40 % reduction when compared with the longest first

queue and fixed control algorithms.

[11] FLOW provides an environment to experiment with RL

algorithms for a wide range of scenarios already. But

currently, it is only supporting a single agent mechanism. But

in future FLOW developers will introduce a complete

framework which expands the number of scenarios with much

more functionalities. Further future releases support multi-

agent mechanism.

Table 4 : RL past experiments results

3.8 Non-RL approaches to optimize traffic flow

This section briefly explains few other methodologies and algorithms the research

community used to optimize traffic flows.

49

[38] ran queue length optimization in order to find the best cycle values (phase length

for Red and green) in order to achieve the lowest queue lengths at an intersection.

The solution was designed for the real environment and the optimized cycle length

was 140 seconds according to the results.

[39] a Genetic algorithm based approach to allocating green time depends on the

traffic demand. A chromosome consists of 4 genes. Genes represent an effective

green time ratio. The author follows regular genetic algorithm steps. Genes go

through fitness function, selection, crossover, and mutation until it reached the

stopping criteria. The rank selection was used for selection phase and Blending for

crossover. [40] proposed a similar methodology to optimize traffic light phases by

using adaptive Genetic algorithm. The main goal was to reduce queuing time and

improve intersection capacity.

[41] uses a fuzzy algorithm to optimize phase sequence. Vehicle queue length and

arriving rate were inputs to the algorithm. It has 4 output classes (very long, long,

medium, and short). It controls the length of the current green light phase depends

on the predicted class. [42] has introduced another fuzzy approach to control phase

duration similar to [41]. Inputs for the fuzzy controller were a number of vehicles of

arrival direction (the direction which is green) and the queue length of waiting for

direction.

[43] proposed a hybrid fuzzy-genetic controller which is architecturally a Multi-agent

system. It has an Intersection management agent and Driver agents.

[44] suggested a reservation-based scheduling approach for multilane intersections.

This can consider as a unique system when compared with other approaches.

The area which belongs to the intersection is divided into small portions/cell

according to [44]’s design. Approaching vehicles need to inform the availability to the

intersection management system. Intersection management system needs to

calculate arrival time and departing time to each vehicle for each cell. Each cell has

a queue with vehicle identities sorted by arrival times of vehicles. If the cells are not

available, the system checks when it is available for a specific vehicle. Depends on

the availability, the system suggested arrival time for a vehicle to a specific cell. The

following figure shows how the author has divided an intersection into cells.

50

Figure 29 : Intersection and cells [44]

Please note that the above approaches are very few notable experiments. Traffic

flow optimization and intersection management is a wide research scope and there

were various experiments have done with a wide range of optimization algorithms.

3.9 Summary

This chapter starts by introducing traffic stream parameters. It defines important

parameters like traffic density, travel time, traffic flow which further use in this report.

Next is a brief explanation of the Car2x technology. The report explains how Car2x

works and overall architecture of hardware and software setup of it. Further, it

explains more technical details like message types and forwarding types. Next author

explains about GLOSA which is one of the car2x applications. The report describes

how GLOSA works mainly along with variations of GLOSA. It explains the

mathematical and technical detail of GLOSA. Next is one of the most important

topics, past work regarding Reinforcement learning is discussed. Various research

has used mainly different types of DQNs to optimize traffic flows and intersections.

Among those, CNN, Duel networks, 3DQN are popular. Q learning has also used too.

Further, the way various authors have to address the problem was different too.

Mainly the scenarios, road network, assumptions, simulation or live data were

different aspects. Rule-based policies for traffic optimizations was not used widely

among the research community. Finally, the author discusses few other non-RL

approaches authors have taken for traffic optimization. Among that Genetic

algorithm, various model-based optimizations, fuzzy logic were widely used.

51

4 Concept

This chapter serves for introducing the solution to the problem statement. Here the

author is experimenting on two main approaches. Both approaches use

Reinforcement learning to solve the problem. The chapter describes the structure of

Deep Q network and rule-based policy. Further, it explains the system architecture

and components of both solutions and processes/activities in order to achieve the

final results.

4.1 Reinforcement learning and proposed solution

Before moving to a specific approach, the following table compares how

Reinforcement learning concepts match with the proposed problem. Please note that

this is a general view of the problem and no specific details are given. More details

are given in the following sections

RL concepts Proposed problem

Environment A traffic simulation

Agent The vehicle in traffic simulation

States Single simulation step of a simulation

Policy Deep Q network or rule-based policy

Actions Vary depends on the approach. Eg: lane change, the speed

change

Observations Depends on the approach. Eg: Speed, position..etc

Table 5 : RL and proposed solutions

4.2 Deep Q network-based approach

4.2.1 System architecture

Following diagram shows an abstract view of the proposed prototype.

Figure 30: Deep Q network approach overview

Next diagram provide more clear view of the connector component. It is responsible

to connect simulator and the DQN. Mainly it extract real time data from the simulation

52

and feed it to the DQN for training. Further it recieves best suited action for each

state from DQN and feed it back to the simulator. Connector eqippped with other

important components to calculate reward and optimize the weights of the DQN.

Figure 31 : Deep Q network approach components

4.2.2 Building a Reinforcement learning model

Several Major steps need to follow in order to create a RL policy.

1. Creation of agents: the First challenge is to define the agent of an

environment. Further how many agents live and interaction between agents.

2. Selection of observations: Defining a state of a scenario. Select what kind of

information extract from the environment

3. Selection of actions: Defining possible actions agent can take considering

constraints

4. Designing policy: Author experimenting on two algorithms. First discusses a

DQN approach and later a rule-based policy.

5. Defining rewards: Creation of reward metrics/variables to provide feedback for

actions. Sometimes a reward needs to be normalized.

6. Run until receive optimized results: This phase takes a considerable amount

of time and resources. Further author has to try various mechanisms in order

to achieve better and faster results.

7. Rerun with various hyper-parameters: Several parameters can change in

order to see the behavior of the algorithm.

53

4.2.3 Deep Q network structure

Before designing the network, the author should decide the inputs (observations) and

outputs (actions) for the network.

4.2.3.1 Observations selection

Observation/inputs Range Description

Speed 0 to 14 ms-1 The current speed of a specific vehicle for

the current simulation step

Lane Left lane 1

Right lane 0

The current lane where a specific vehicle is

positioned

Distance 0 to 1000m Distance from the current position to the

traffic light

Number of leading

vehicles

0 or depends

on the flow

Calculate leading vehicles until the

intersection count for each vehicle for both

lanes.

Number of following

vehicles

0 or depends

on the flow

Calculated following vehicle count for every

vehicle

Remaining green light

time

0 to less

than 90

Full traffic

light cycle is

90 seconds

If the current phase is GREEN, calculate the

remaining green light time until it changes

Next green light 0 or 59

seconds

(exclude

green light

phase) if the

vehicle pass

intersection

in next cycle.

If the vehicle misses the current green

phase, calculate the time for the next green

phase after RED and YELLOW phases

Table 6 : Observations - DQN approach

54

4.2.3.2 Actions section

Action/outputs Range Description

Speed change 0 to 14 ms-1 Indicate acceleration or a

deceleration

Lane changes No lane change 0,left to

right 1, right to left -1

Instruct vehicle to change the

lane (from left to right, right to left

lane or no lane change at all)

Table 7 : Actions: DQN approach

4.2.3.3 Hidden Neurons and layers

The author tested the proposed solution with several DQN structures to train agents.

Especially above observations and actions are tested with various hidden layer

structures.

Example: 32*32: 2 hidden layers and have 32 neurons in each layer

64*64: 2 hidden layers and have 64 neurons in each layer.

The following diagram shows an abstract structure of proposed DQN. Inputs and

outputs are already defined in the previous section.

Figure 32 : DQN - multi-agent

Further network structure varies due to the multi-agent and single agent approach

which the author discusses in the next section. Above structure is only for the multi-

agent approach. Next diagram shows how it can modify for single agent approach.

55

Figure 33 : DQN - single-agent

Neurons from one layer are connected with all neurons from the next layer, which is

similar to a usual feed-forward neural network.

4.2.4 Single-agent and multi-agent approaches

Single-agent approach: Data obtains from all vehicles (agents) is sent to one Deep Q

network. One network decides the action for each and every vehicle. Due to that

training phase only contains one neural network. But the disadvantage is when the

simulation changes the number of vehicles, DQN structure changes and needs to

train an entirely new network. Also, when the number of vehicles increases,

observations and actions are increased. Training a network with lots of inputs needs

more time and computation power. Here the “single agent approach” defines training

all vehicles and whole scenario as one instance.

Multi-agent vehicle approach: There is also only one Deep Q network architecture to

train all vehicles. But several instances are available with the same architecture for

each and every vehicle. Each vehicle sends its observations to this Deep Q network

and receives the action from it. All deep Q networks should run parallel. The “Multi-

agent approach” defines training each and every vehicle as training a universal

vehicle controller identical to all instances/agents. This approach also needs

considerable computation power, but expectedly less time than the single-agent

approach.

56

4.2.5 Reward definition

The author considers the accumulated reward which is the total reward of each and

every vehicle for the whole episode. The episode is a whole scenario which includes

all the states from start to end.

Accumulated travel time:- travel time of all the vehicles from start to end

Accumulated emission: Emission of all vehicles for the whole scenario,

After completing a simulation, the program checks the accumulated travel time and

emission with the previous simulation. Mainly it checks any improvement (reduction

of travel time and emission) occur. If so, the reward increases proportionally to the

improvement. If not, reward decreases.

4.3 Rule-based policy approach

4.3.1 System architecture

The following diagram shows the main components and interaction between those.

Figure 34 : Rule-based policy method overview

Component Functionality

Simulator The simulator is used to extract information for training

the policies and executing the actions in every simulation

step

Connector Data extraction from simulation and passing the selected

57

action for each step

Rule-based policy The control logic of an agent. The component consists of

a set of rules which decides the appropriate action

depending on the state of the observation space.

Optimizer Responsible for optimizing rule-based policies. Several

parameters of the policy are exposed to the optimizer.

More details about optimized variables will discuss later

Table 8 : Components, a Rule-based policy approach

Optimizer runs several parallel simulations for optimizing parameters. The optimizer

continues until achieves the best suited values for optimizing variables or reach a

maximum number of iterations. The final output is a well-trained rule-based policy

with optimized values. Unlike the DQN approach, here the author only considers a

multi-agent reinforcement learning approach. That means all simulated vehicles are

considered as agents which are controlled by rule-based policies. The following

figure shows the main components of the suggested system

Figure 35: Rule-based policy approach components

58

4.3.2 Observation space

Following observations are extracted from simulation in each step. Please note that

the table only shows observation space for a single vehicle. As the author follows a

multi-agent approach each and every vehicle has its own observation space.

Observation space is similar to the above DQN based approach. But it consists of a

few additional variables too.

Observation

(single vehicle)

Range Description

Speed 0 to Max speed set up by

simulation framework

The speed of the current

simulation step

Acceleration Lowest and highest set up

by simulation framework

Acceleration of the current

simulation step

Lane Left lane 1, right lane 0 The current lane where the

vehicle positioned

Distance 0 t0 1000m Distance from the current

position to the traffic light

Leading vehicles

IDs

0 or many vehicles

depends on the flow

Leasing vehicle IDs for current

vehicle

Following vehicles

IDs

0 or many vehicles

depends on the flow

Following vehicle IDs for current

vehicle

Remaining green

light time

0 to less than 90

Full traffic light cycle is 90

seconds

If the current phase is GREEN,

calculate the remaining green

light time until it changes

Next green light 0 or 59 seconds (exclude

green light phase) if the

vehicle pass intersection in

next cycle.

If the vehicle misses the current

green phase, calculate the time

for the next green phase after

RED and YELLOW phases

Table 9 : Observations - rule-based policy approach

4.3.3 Action Space

Action Range Description

Speed change 0 to 14 ms-1 Indicate acceleration or deceleration. Speed

59

change is decided by extended GLOSA

function.

Possible values: 0 (full stop) to Max speed

Lane changes 0 or 1 Two Boolean values for represent left and

right lane.

Table 10 : Actions - rule-based policy approach

4.3.4 Rules

Following set of rules define previously declared actions. Please note that each

vehicle should execute these rules in each simulation step to find the best possible

action.

4.3.4.1 Rule 1

If a left turn vehicle is traveling in the right lane, it should change to the left lane

Figure 36: Rule 1 expected result

Figure 37 : Rule 1 algorithm

60

4.3.4.2 Rule 2

If a straight going vehicle is following a left turn vehicle in the left lane, it should

change to the right lane.

Figure 38 : Rule 2 expected results


61

4.3.4.3 Rule 3

This rule is only valid for straight going vehicles which are traveling on the right lane.

If there are no leading left turners in the left lane and if there are free SLOTS

available in the left lane, the vehicle can do a lane change from right to left. The

number of SLOTS in the left lane is a parameter, which is defined according to the

scenario. The number of vehicles which can pass during a Green light phase and

extended green light phase for left turners is considered when defining slots.

SLOTS= maximum desired number of leading straight going vehicles in front of the

first left turner which is traveling on the left lane.

Figure 40 : Rule 3 expected results

Several flags are set after changing the lanes to indicate that the ego vehicle already

executed particular rule and thereafter the program keeps the vehicle in the changed

lane.

62


63

4.3.4.4 Extended GLOSA

This is an extension for traditional GLOSA algorithm which decides the approaching

speed of the vehicle. The extension is to establish the connection between the

Traditional GLOSA and Rules. Several new modifications were added mainly to

control the left turners and to figure out when the vehicle reaches the intersection.

The Following diagram shows the activity flow of the proposed extended GLOSA

function.

Following equations were introduced by above extended-GLOSA algorithm

If dMax > distance to junction condition is true, the following equation is valid. In other

word, when vehicle tries to reach the Max speed and the distance it needs to reach

the MAX_SPEED is larger than the current distance to the junction, following

equation (EQ1) should execute.

(

) √(

∗

∗ ) ∗

Eq 2 : Arrival time calculation equation 1

If the above condition is false, the next equation(EQ2) is valid.


Following equation(EQ3) should execute when the acceleration is 0 and speed is not

0


All 3 above equations are to find the arrival time to the junction from current vehicle

position in different circumstances.

64

Figure 42 : Extended-GLOSA algorithm

65

Final equation(EQ4) is to find the suitable speed needs for a vehicle in order to fulfill

GLOSA constraints.

( ∗

)

Eq 5 : Advised sped calculation

GLOSA algorithm is executed for all RL agents and every simulation step.

4.3.5 Rewards

Following equations are used to find the Reward value for training rule-based

policies. Here the author is explaining each and every equation step-by-step. The

final reward is constructed from accumulated total travel time and accumulated

emission. Several parallel runs are running for optimizing values which are chosen by

the optimizer.

The author explains how to measure accumulated total travel time.

∑ ∑

Eq 6 : travel time calculation

Equation 5 explains the travel time for all vehicles, for overall simulation in each

parallel run. Duration represents how long the particular vehicle runs in a simulation.

Depart delay indicates the delay causes when a vehicle enters into the simulation in

the beginning due to no space in the lane.

∑

Eq 7 : Accumulated travel time calculation

Equation 6 derives from equation 5. Here route length means the total distance a

vehicle travels from start to finish. Further, this is used to find the accumulated travel

time figure per km.

Next is the accumulated emission which is the other factor relevant for constructing

reward.

66

∑ ∑

Eq 8 : Acumulated emission calculation (per g)

Equation 7 calculated accumulated emission (CO2) for all vehicles, for the whole

simulation in each and every parallel run.

∑

Eq 9 : Accumulated emission calculation (per km)

Equation 8 is used to find the accumulated emission figure per km. At last, the author

introduces reward function. Here author is using previously calculated accumulated

travel time(per km) and accumulated emission (per km). Also, two normalization

coefficients are used separately for travel time and emission.

Travel time normalization = 1/120 which is used to normalize the accumulated travel

time figure.Emission normalization figure is 1/150 which again is used to normalized

accumulated emission figure. Values for the constants were found experimentally.

Alpha is used to control the final influence of travel time and emission to the Reward.

In another word Alpha is a weighting factor. Alpha = 1 means only travel time matters

for final reward.

∗ ∗

∗

Eq 10 : Reward calculation

4.3.6 Policy parameters

Optimizing variable Description

Margin Extra added time for traffic light cycle which is used to

control the vehicles (for all vehicles). Extended

GLOSA uses Margin as a parameter when calculating

arrival time to the intersection.

Extra delay for the first left

turner

This duration is added for the first left turner’s targeted

arrival time in the GLOSA algorithm in order to slow it

down. This allows straight going vehicles to overtake

left turner easily

Table 11 : Optimizing variables

67

The suitable optimization algorithm is selected depending on the nature of the

problem. Here the problem is a stochastic discontinuous function. The author uses

the Broyden-Fletcher-Goldfarb-Shanno algorithm (BFGS) for the optimization [45] .

Optimizer runs until it achieves the best suited margin and left turner delay by gaining

highest reward value. If it does not find the best suited values optimizer run a specific

number of turners and find the best suited value out of all turns. Further, it is

impossible that is to minimize both objective quantities simultaneously. Due to that

reason, the author has used Alpha has a weighting factor to combine both

accumulated figures and create one single figure The author needs to provide

starting and bound values for margin and delay.

Bounds values : Margin from 0.1 to 5 and delay from 0 to 40

Starting values : Margin = 2 and delay = 31

4.4 Summary

The author introduced DQN and Rule-based policy approach. Mainly it discusses

Major steps creating RL algorithms. Further, the section discussed design,

components and major processes/functions of both solutions. Next, the report

defined the architecture of the solution, observation, and action space, and Reward

functions. For designing DQN two architectures were used. Mainly the author divided

it into single agent and multi-agent based designs. Seven neurons were placed and

two in output layer in multi-agent based DQN. However single-agent architecture was

more complicated than the multi-agent architecture due to many parameters. For the

rule-based approach, the author introduced the rules and the expected results.

Mainly the rule based policy consists of three rules to decide the best suited lane for

each vehicle during all simulation steps. The Extended GLOSA was used to calculate

the speed for each vehicle. Extended GLOSA has several modifications in order to

support rules. At last, the chapter discussed the optimization process which uses to

improve rules and extended-GLOSA.

68

5 Implementation

This chapter focuses on introducing the implementation aspects of proposed

solutions. First, it introduces the software stack the author uses for implementing the

solution. Next, discuss the steps need to follow for both approaches. FLOW has

several implementation steps in order to create a RL prototype. Rule-based policies

also consist of several components. Code snippets are provided when explaining the

implementation aspects.

5.1 Technical details

5.1.1 Deep Q network approach

Software Description

FLOW [11] FLOW is using as a connector for linking Reinforcement library and

SUMO microscopic traffic simulator. FLOW is using several other

3rd party libraries.

Theano [46] , OpenAI Gym [37], SUMO Traci

SUMO [10] Microscopic traffic simulator which provides an environment for RL

agents

Rllab [36] Reinforcement learning library which uses to train Deep Q network.

Python All above frameworks are written in python.

Table 12 : Software - DQN approach

5.1.2 Rule-based policy approach

Software Description

Python All components are written in Python

SUMO Similar to the above approach SUMO is the simulation environment

Sumo-Traci

[47]

Traci is able to change the state of the simulation in runtime. By

using Traci user can pass commands to the simulation and retrieve

data from it.

SciPy [48] Scipy provides python based optimization with various optimization

algorithms.

Table 13 : Software - rule-based policy approach

69

5.2 Implementation of Deep Q network approach

5.2.1 Flow configuration

Following diagram shows the steps the user needs to follow in order to create FLOW

based SUMO simulation

5.2.2 Dynamic SUMO network configuration

Figure 43 : FLOW steps

Please note that here author only explains how to create a FLOW simulation. More

information regarding FLOW’s architecture, functionality was explained in the

Literature Review chapter.

5.2.2.1 FLOW Generator creation

The user needs to create a custom Generator class which extends from base

Generator and overrides several inherited methods. Usually in SUMO context user

needs to provide node, edges and route information in order to generate the network

configuration file. Similar to that, here the FLOW creators have given several

methods in order to provide node, edges and route information to create the net file.

Specify_nodes() which states the location of the nodes relative to one specific node.

Here the author only considers an intersection scenario. Due to that center node

which creates the intersection is considered as the main node(junction node) and all

the other nodes are relatively positioned to the main node.

Following code snippet shows the node array. Proposed intersection scenario

consists of 9 nodes and coordinates for “center” node is 0,0.

70

Figure 44 : Specify nodes code snippet

Specift_edges(): give edge details. Mainly type of the edge, length, which nodes

contribute to creating the edge... etc. Following code shows the edge array which

consists of 9 edges.

Figure 45 : Specify edges code snippet

Specify_routes(): which states the route names and edges contribute to creating the

route.

Figure 46 : Specify route code snippet

Above code shows the routing array. As an example route “left” consists of 3 edges (

“left”,”altleft1” and “right”) which creates the road from left to right through the

intersection.

71

5.2.2.2 FLOW Scenario creation

Similar to Generator creation, the user needs to create custom Scenario class which

override several methods of base Scenario class

Specify_edge_starts(): Here the user needs to provide the starting coordinates of an

edge relative to one specific edge. Following code snippet shows a part of Edge

array. Here the “bottom“ edge is taken as the main edge and other edges like “top“

have given the starting position relative to the “bottom“ edge.

Figure 47 : Speify edge starts code snippet

Specifiy_intersection_edge_starts() and specify_internal_edge_starts() are another 2

functions similar to specify_edge_starts(). Only difference is that internal_edge

fuction targeting internal edges of an intersection which allows vehicle to pass

intersection for various directions.

5.2.2.3 FLOW environment creation

The user needs to create a custom environment class, extends from Base

Environment. As usual, there are several inherited methods which need to override.

Action_space(): Following functions declare the possible actions the agent can

execute at a given time. The agent can only change lane and speed of the vehicles.

Also, the author has set upper and lower bounds to acceleration and deceleration. As

FLOW only supports single agent scenarios currently, actions for all vehicles live in

the simulation should provide by a single DQN.

As an example, if a simulation has 30 RL agents, DQN should provide 60

(2*30)actions.

72

Figure 48 : Action space code snippet

Observation_space() : Here the author provides observations/inputs to the DQN

Following code snippet shows all inputs to DQN. For 30 RL agents, DQN should

support (30*8) inputs/observations. The user needs to provide lower and upper

bounds for all inputs. More information regarding inputs was already mentioned in the

Design chapter

Figure 49 : Observation space code snippet

Get_state() is to retrieve observations of the following state after execution of actions.

Here the user can call getters which were already defined by FLOW

Figure 50 : Get state code snippet

There are several other functions like apply_rl_actions() which executes previously

defined actions and compute_reward() which calculate the reward as specify in the

Design chapter.

73

5.2.2.4 Master configuration creation

The main configuration connects previously defined generator, scenario and

experiment classes. The further user needs to enter additional information like the

length of vertical and horizontal lanes, number of agents which needs to create,

velocity bounds for vehicles and many other technical information

Figure 51 : Master configuration code snippet

Following code snippet shows how the author has used predefined Gaussian MLP

Policy which consists a DQN wit 2 hidden layers (64*64). Further FLOW is using

TRPO algorithm to adjust DQN weights.

Figure 52 : DQN policy

74

5.3 Implementation of Rule-based policy approach

5.3.1 Data extraction

The first step is data extraction using SUMO-Traci. It allows extracting data in runtime

for each and every simulation step. Traci provides a wide range of functions as seen

in following code snippet. Here the author collects information related to vehicle and

neighboring vehicles for a specific vehicle. The author is using a python dictionary to

store information for all agents and several lists (eg: leaders_rightlane_straight_list)

to save information regarding neighboring vehicles. Further, there are several flags

which indicate whether the rule is already set for a specific vehicle.

Figure 53 : Data extraction code snippet

5.3.2 Rules implementation

Rules are already introduced in the last chapter. Following code snippet shows rule 1

and 2. The author uses Traci commands to control agents in runtime. Mainly

changeLane() which changes vehicles from one lane to another. Here rule 1 change

left turners from the right lane to left (lane 0 to 1 according to SUMO environment)

and rule2 changes straight going vehicle, if it is following a left turner from left to right

lane (0 to 1).

75

Figure 54 : Rule 1 and 2 code snippet

Code snippet Relevant to rule3 can be found in Appendix

5.3.3 Extended-GLOSA implementation

All equations and conditions which needed for creating extended-GLOSA were

already introduced in Concept chapter. Full algorithm is added to the appendix.

But here the author is discussing above the implementation aspect of it.

The proposed Sumo traffic light has 8 phases for one single cycle. Following code

snippet shows available phases. Phase 1 which is a Green phase for all vehicles

traveling from left to right and vice versa. Phase 3 shows the extended Green phase

for left turners which are not available for straight going vehicles. Following code

snippet tries to find the next green light phase for a given current phase.

Figure 55 : Traffic cycle

Next code snippet shows how the author handles to slow down vehicles during Red

phases. The program mainly has 2 ways to handle a slowdown of a vehicle. First, it

checks whether the vehicle can arrive in the next green phase or the following one

76

using previously calculated arrival time. Further, it considers the extended green

phase to left turners. The program has an extreme slowdown technique for first left

turner which is controlled by “slow_down_flag”.

For slowing down a vehicle, the author uses Sumo slowdown() which is able to slow

down a specific vehicle to a given speed and a time duration.

Figure 56 : Extended GLOSA code snippet

5.3.4 Optimizer implementation

Following implementation shows how the author has given starting values, bounds

for Arrival margin and delay for first left turner as mentioned in Concept. The author

uses Scipy framework and minimize() to call the inbuilt optimization process. Here it

is using BFGS-B optimization algorithm. But during the optimization process author

has used various optimization algorithms to find the best fit. After executing this, it

runs until optimizer finds the lowest accumulated travel time and emission figure.

Figure 57 : Optimizer code snippet

77

5.3.5 Reward calculation

This is part of the reward function which is called by above minimize() function. After

every simulation, Sumo generates 2 separate files which contain information of

emission figures and travel time for all vehicles. Also, the program runs several

simulations parallel and each simulation creates above files separately. Programs

read all files and calculate accumulated travel time and emission. Then considering

normalization factors as mentioned in following code final reward is calculated.

Figure 58 : Reward calculation code snippet

5.4 Summary

This chapter explained how the author implemented both solutions. FLOW was used

to implement the DQN. According to the guidelines provided by FLOW development

team [11], the author implemented those required classes (Generator, Scenario, and

Experiment). Further, the reward was introduced and it was based on the

accumulated travel time and emission for each vehicle. Important code snippets were

provided when necessary. Next, rule based approach was also described. The

author has introduced necessary inputs to rules. Input vector was similar to the

earlier approach. But it has several other inputs like data related to leading vehicles.

Further chapter explained the implementation aspects of rules and how SUMO-Traci

was used with important functions. Finally, the optimizer was introduced and it was

built by using Scipy.

78

6 Evaluation

This chapter explains the results and main observations the author received after the

execution of previously implemented prototypes. Further issues which occurred

during the testing phase also discussed. Next, solutions to the previously discovered

problems are also explained briefly.

6.1 DQN based approach

6.1.1 Tests

As mentioned in the above chapters, the author is using FLOW to create DQN.

During the training phase of DQN, the author has completed following subtasks.

Preliminary tests: Initially ran the simulation for a very short time. Here the simulation

steps and a number of iterations were low. Idea was to see how FLOW performs

under test circumstances. Started with 10 minutes runs (execution time) and later

extended to 1 hour. Duration of a single simulation is 5 minutes.

Next step was to increase simulation time (up to15 minutes) and a number of

iterations (up to 1000). Total execution time is approximately 20 hours for these set

of experiments.

Change the number of hidden layers and neurons in the hidden layer: Experiments

were started with 32 neurons in the hidden layer (single hidden layer) and then tried

two other variations.

32*32: 2 hidden layers and 32 neurons in each

64*64: As similar to previous structure 2 hidden layers. Instead of 32 here, the author

is using 64 neurons in each layer.

Ran parallel runs: Experimented up to 4 parallel runs for 20 hours.

6.1.2 Results and discussion of DQN approach

Preliminary tests: This is to check whether FLOW is generating the intersection

scenario as expected. There were no errors with FLOW’s capabilities.

79

Single-agent architecture: Author was able to see how agents are reacting for RL.

Starting positions of all vehicles during the first simulation step were varied as FLOW

is using a custom algorithm introduced by the author to generate starting positions of

vehicles on lanes. That means every simulation starts with different starting positions.

One advantage of this approach is this can create a new scenario, instead of

vehicles starting at same positions in every simulation.

In the first few iterations, movement of vehicles was extremely slow. Lane changes

were not much. As mentioned in Literature review RL takes more time to learn best

actions for each state. Experience gains from the first set of iterations are not good

enough to find the best action for a given state. Due to its slowness, vehicles did not

reach the intersection. Still, all the vehicles were structured as the initial partition. The

following figure illustrates the observed output.

Figure 59 : FLOW results1

After 5 iterations, was able to see vehicles are creating groups (partitions) which

mean it broke from the original partition as leading vehicles try to reach the

intersection with a higher speed than initial times. The author was able to see higher

speed from leading vehicles of the partition and one or two vehicles passes the

intersection. Further, there were many overtakes in each and every step.

80

Figure 60 : Flow results2

To run 50 iterations, it took around 20 hours and still saws similar output. Mainly from

the results, the author was able to see that FLOW is trying various actions to find the

best actions for all vehicles. But after 5th iteration, the changes between iterations

were not much.

When adding more neurons to hidden layers (eg 64*64) training was much slower.

Above results are only valid for single-agent approach.

6.1.3 Issues observed

Extreme slowness of single-agent approach: As mentioned above, after initial

iterations, results between iterations were very similar. There were not many

variations even after it ran for 20 hours. The author was able to see lane changes

and speed changes. But the improvement is very slow. In another words the issue

here is training time. Training needs more time than expected. Due to the timeline of

the project, resources and the preliminary inputs author received from rule-based

policy approach, the author experimented on rule-based approach than DQN

FLOW based multi-agent approach: Due to the extremely slow response from the

above tests, the author took the next steps to extend FLOW for a multi-agent

approach. But due to the design of FLOW, it was not possible to convert FLOW to a

multi-agent system with minor improvements.

81

6.2 Rule-based policy approach

6.2.1 Tests

SUMO route files were used to generate various traffic flows. By changing attributes

like “period”, “number of vehicles per hour” and “probability” various traffic scenarios

were generated [49]. The author was able to change a number of vehicles for each

direction/route. All these scenarios were tested with the following variations.

GLOSA only scenario: Here the main goal is to see how traditional GLOSA

algorithm performs for a given scenario.

Extended-GLOSA with Rule-based system: Instead of traditional –GLOSA

here author used Extended- GLOSA algorithms with newly developed rules.

Due to that system equipped with speed and lane advisory system here.

Estimation: This is a test step in order to verify whether the proposed

approach is functioning as expected. Here vehicles enter with the already

arranged order. According to the proposed approach straight going vehicles

need to arrive earlier than left turners. Due to that, order was already created

when the vehicle enters. Further, no speed and lane change advisory was

used here.

Random SUMO runs: Here no speed or lane advisory is provided. Further, no

order was provided like the above estimation step. All the vehicles were

controlled by built-in models (Kraus Model [49]) in SUMO.

Another important factor is that SUMO has the ability to create random behavior

when loading vehicles. This means it does not run the same simulation again.

Further tests have been done for above variants with various simulation times

between 10 and 60 minutes simulation time per run.

By calculating the accumulated travel time and emission the author evaluates the

final results for the above variants

6.2.2 Results and discussion

First the author explains how accumulated travel time changes with the above

variants.

82

Figure 61 : Travel time evaluation

Category 1: When no of approaching straight going vehicles in left lane = 11 and left

turn vehicles = 5 for a single Green phase.

Category 2: no of approaching straight going vehicles in left lane = 11 and left turn

vehicles = 4 for a single Green phase.





Further, this figure is constructed after conducting above SUMO experiments.

According to the traffic light cycle used for the experiments, a maximum number of

vehicles which can pass in green and extended green phase (just for left turners) is

15. For first 3 categories, number of straight going vehicles was fixed as 11 as only

11 can pass the intersection during whole green phase. But number of left turners

has changed to check how rules react to it. Up to 5 left turners can pass the

intersection during the extended green phase. Last category is just to see the delay

when number of vehicles exceeds 15. In another words this is definitely create a

traffic jam and the author wanted to see how rules react for this.

Next, explains how accumulated emission changes with the same 4 categories and

variants. All above conditions are valid for this experiment too.

75

80

85

90

95

100

105

Category 1 Category 2 Category 3 Category 4

Accumulated travel time (%)

Category

Random

Estimation

GLOSA

GLOSA+rules

83

Figure 62 : Emission evaluation

According to the above results, this section can conclude as following

There is up to 10% improvement of newly introduce Extended GLOSA+rules

than traditional GLOSA comparing accumulated travel time.

According to the estimation this can be further reduced.

When compare accumulated travel time with random flows, new prototype has

an improvement of 15%

Up to 12% improvement of accumulated emission when compared new

prototype and GLOSA

When adding more straight going vehicles, both accumulated travel time and

emission percentage reduces significantly.

6.2.3 Optimization results and discussion

Optimization process was carried out as mentioned in Concept chapter. Even though

optimizer ran for several hours with different optimization algorithms, it was unable to

find the most optimum values for arrival margin and left turner delay. Due to that the

author ran a parameter grid scan.

The idea was to examine a range of values for the algorithm parameters and check

monitoring values- accumulated travel time and emission figures. Further several

parallel simulations (15 runs) were run with various simulation times.

0

20

40

60

80

100

120

Category 1 Category 2 Category 3 Category 4

Accuumulated emission (%)

Category

Random

Estimation

GLOSA

GLOSA+rules

84

Arrival margin : 0.1 to 5

Left turner delay = 0 to 40

After running several grid tests with various traffic flows, the following results were

achieved.

Figure 63 : Optimizer traveltime results

Figure 64 : Optimizer emission results

As seen in the above diagrams, there were no clear patterns found from Grid test

too. Another main observation was the optimization process was slower than

expected. But the author discusses the reasons for this in the next section.

6.2.4 Further discussion of Grid test

Here the author is explaining further regarding reasons for the above results.

85

One major issue was created by newly designed rules. This does not mean rules are

not working properly. As shown in the above results rules are working as expected.

In fact, rule-based policy approach was much better than existing GLOSA. When

closely examining the random scenarios generated by SUMO, there were situations

where no gaps were found in order to do the lane changes. As an example, when

rule 3 applied for a certain straight going vehicle in the right lane, but no immediate

space was created in the left lane to do the lane change until it reached the

intersection (marked vehicle on the following figure), proposed order will not be

created. When running parallel runs with randomness, there were so many various

situations where the vehicle was unable to do the lane change as expected. This was

the major reason for unclear results during Grid test and optimization.

Figure 65 : Gap creation issue

The slowness of simulation execution: the author observed an unexpected slowness

when running rule-based algorithm and optimization. All rules and Extended-GLOSA

was running for each and every simulation step for all monitored vehicles. This was

the main reason for the slowness.

6.2.5 Solutions and improvements

86

Need to introduce a novel gap creation strategy to create space/gap when there are

no possible spaces in the opposite lane. For this cooperative decision making is

needed for neighboring vehicles.

To reduce the slowness of the simulation: further experiments need to carry out in

order to find the optimum rule and GLOSA execution frequency.

Even after introducing a proposed rule set, there were situations where few left

turners were unable to cross the intersection. Due to that reason left turners to have

to stop next to the intersection when the light is red. This happens when numbers of

left turners are higher than expected by rules. Special rules need to introduce to

address this situation.

6.3 Summary

This chapter explained the results obtained during testing. First, the author explained

the results obtained from FLOW based solution. First, preliminary tests were run in

order to find whether it created the designed SUMO road network. Later, several

tests were run to test the DQN with few variations. The custom algorithm was created

to generate vehicles for each lane. The author was able to create the proposed

scenario as expected. But FLOW only supported Single-agent mechanism yet. But

the development team is currently developing a multi-agent support toolkit too. Due

to the timeline of this project, the author was unable to use the multi-agent version of

FLOW. The model was trained up to 20 hours with several parallel simulations. It

started to respond well during the first few hours of running and showed promising

results. FLOW was able to order vehicles and adjust speed. But after preliminary

runs, the author did not see any further progress even though it ran for several hours

(nearly a day). The main issue of this approach was discussed in detail in this

chapter.

According to the currently available results, the Rule-based policy was more

successful than FLOW. Implemented rules worked as expected. The proposed

solution was compared with traditional GLOSA and performed better. But the

optimizer started to output unexpected results. A parameter grid scan was carried out

in order to further investigate the results. The results were examined thoroughly and

issues which caused the unexpected outcomes were identified. Finally, suitable

solutions were suggested in order to retreive better results in the future.

87

7 Conclusion

This chapter summarizes the work that has been done during the research. The

challenges that the author has faced and future work to improve the proposed

solution are outlined.

7.1 Challenges

The author faced several challenges during the research.

FLOW initial simulation setup: FLOW framework is a newly introduced toolkit

to the research community. In fact, the development is still ongoing. Further,

FLOW is only supported for simple scenarios at the moment. The author has

to spend a considerable amount of time to create the proposed road network

and traffic flow.

Finalization of GLOSA: The author has found various approached to create

GLOSA. This research used GLOSA algorithm suggested by [5]. But during

the implementation phase, it did not work as expected and the author had to

take extra steps to solve issues. Further, extensive tests have been done to

check whether GLOSA is functioning properly.

FLOW multi-agent approach setup: As mentioned during the evaluation

chapter, FLOW only supports single-agent architecture. Recently, FLOW

development team started to implement a multi-agent supported toolkit too.

However, the author was unable to use the new FLOW solution due to the

research timeline. Due to that reason, the author investigated the possibility of

converting existing FLOW to a multi-agent architecture.

Initial development of rules: After unexpected issues with FLOW, the research

focused on developing a rules-based RL policy. Due to the nature of the

problem, the initial designing phase of rules was complicated. Mainly the

problem was how to arrange the approaching vehicles and to see what the

best arrangement was.

Investigating optimization issues: This issue was already discussed in the

evaluation chapter. Optimization produced unexpected results even after

running it several times. After running a comprehensive parameter scan, the

author was able to identify the issues for the unexpected results.

7.2 Future improvements

88

DQN approach with multi-agent architecture: FLOW based DQN solution was

very slow due to the single-agent architecture. This should run on a GPU

environment and needs to provide necessary hardware. One disadvantage of

DQN is the longer training times. Further, DQN needs more training time when

it tried to solve a complex problem due to its reward, trial and error

mechanism.

Extension of rules: The author has seen several issues during the optimization

phase. Especially, the gap creation between vehicles when it changes the

lane is necessary. This version does not handle this issue and rules only

execute a lane change when there is an available space.

Speed up simulations; The simulation was slow and this needs to speed up by

reducing the calling frequency of rules and GLOSA.

7.3 Concluding remarks

The research introduces a reinforcement learning based approach in order to

optimize the traffic flow at an intersection. It focuses on reducing travel time and

emission of vehicles. This approach is very successful when the scenario consists of

more straight going vehicles than left turners. The author proposes a vehicle

reordering mechanism that establishes a specific sequence of vehicles in the traffic

flow before it reaches the intersection.

A separate chapter describes the basics of “Reinforcement learning”. It explains a

theoretical aspect of reinforcement learning. Mainly how reinforcement learning

works, components of RL scenario and existing algorithms.

Before introducing the new approach, a literature review was carried out. The

literature review section pointed out the most important findings. First, it introduced

the basic of traffic engineering. Before stating about GLOSA variants, the author

mentioned about Car2X system. The structure of Ca2x and how it works in the real

environment was discussed. GLOSA is the only car2x application discussed here as

it is considered as the baseline for the project. Next, the literature review focuses on

various RL algorithms that have been used to optimize the traffic flow and the

intersection. Simple DQN, CNN, More complex DQNs are predominant in the existing

research body.

The concept chapter introduced the DQN based solution and rule-based policy. DQN

structure, observation space, action space, reward functions were described in detail

89

for both solutions. DQN network had 7 neurons in the input layer and 2 in the output

layer.

The hidden layer had 32*32 (2 hidden layers, each layer has 32 neurons) and 64*64

architecture. Next solution was the rule-based policy. It consists of several rules

which control an approaching vehicle in each simulation step. Rules were designed

to slow down left turners and allow more straight going vehicles to overtake when

possible. All possible vehicles pass the intersection during the next green phase.

More improved GLOSA was introduced as the speed advisory and it can support

newly introduced rules which provide lane changes. GLOSA allows vehicles to reach

to the intersection exactly on time (when the green phase starts). Due to that reason,

the vehicle will not come to a full stop next to the intersection. An optimizer was

responsible for optimizing rules further.

The implementation chapter explains the technical aspects of the solution. All the 3rd

party frameworks were introduced here. The author used SUMO [10] , SUMO-Traci

[47], FLOW [11] to implement the first solution. Rules were fully implemented using

Traci[47]. Also, it consists of important code snippets for creating SUMO road

network, rules and extended-GLOSA.

The final step was to evaluate the solutions and point out the issues. FLOW based

solution did not provide the expected results. After running simulation up to 20

hours, the progress was slow. But still, the author was able to see a small

development. It was able to do lane changes and speed changes. Due to the single-

approach setup, the simulation was much slower than thought. But the FLOW- multi-

agent setup was not available during the development phase of this research.

Next approach, rules were evaluated with GLOSA. Rules were 10 to 12% percent

more efficient than existing GLOSA. But the optimizer showed unexpected results

again. The author troubleshoots the issue and described extensively in the

Evaluation chapter. Further, the promising solutions were explained. As mentioned in

the Introduction chapter the initial proposal was to reorder approaching left turners

and straight going vehicles in order to improve the intersection throughput. It is

successful as the new rule based policy uses rules to dynamically change the speed

and the lane of each vehicle in order to achieve a better sequence. Further, the

output was more efficient than the currently existing GLOSA.

90

Bibliography

[1] Jakob Erdmann, “Combining Adaptive Junction Control with Simultaneous Green-Light-Optimal-Speed-Advisory,” presented at the 2013 IEEE 5th International Symposium on Wireless Vehicular Communications (WiVeC), Dresden,Germany, 2013.

[2] Yang, Kaidi, Isabelle Tan, and Monica Menendez., “A reinforcement learning based traffic signal control algorithm in a connected vehicle environment,” presented at the 17th Swiss Transport Research Conference (STRC 2017)., Lausanne, 2017.

[3] B. Otkrist and Naik, Nikhil and Raskar, Ramesh Bowen, “Designing neural network architectures using reinforcement learning,” ArXiv Prepr., 2016.

[4] Harding, Y. Gregory, and J. Wang, “Vehicle-to-vehicle communications: Readiness of V2V technology for application,” NHTSA, Technical HS 812 014, 2014.

[5] K. K. Mehrdad Dianati, David Riecky Ralf Kernchen, “Performance study of a Green Light Optimized Speed Advisory (GLOSA) Application Using an Integrated Cooperative ITS Simulation Platform,” presented at the 2011 7th International Wireless Communications and Mobile Computing Conference, Turkey, 2011, p. 6.

[6] R. S. ; A. T. Reinhard German ; David Eckhoff, “Multi-hop for GLOSA Systems: Evaluation and Results From a Field Experiment,” presented at the 2017 IEEE Vehicular Networking Conference (VNC), Torino,Italy, 2017.

[7] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, Second. MIT Press, 2017.

[8] DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning, “DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning,” ArXiv Prepr. ArXiv, Jan. 2018.

[9] S. Ricchard S, “Introduction,” in The challenge of reinforcement learning, Boston: Springer, 1992.

[10] DLR, “SUMO - Simulation of Urban Mobility,” Company web, 2018. . [11] C. W. Kanaad Parvate, Eugene Vinitskyz, Alexandre M Bayen Aboudy Kreidieh,

“Flow: Architecture and Benchmarking for Reinforcement Learning in Traffic Control,” 2017.

[12] M. Riedmiller, M. V. .. Kavukcuoglu, K., “Playing atari with deep reinforcement learning,” ArXiv Prepr., vol. 1312.5602, 2013.

[13] M. Coggan, “Exploration and exploitation in reinforcement learning,” McGill Univ., vol. Research supervised by Prof. Doina Precup, CRA-W DMP Project at McGill University, 2004.

[14] L. Riedmiller, Martin Martin, “An algorithm for distributed reinforcement learning in cooperative multi-agent systems,” in In Proceedings of the Seventeenth International Conference on Machine Learning}, Citeseer, 2000.

[15] Thomas Simonini, “Diving deeper into Reinforcement Learning with Q-Learning,” Apr-2018. .

[16] Mathew, Tom V., and KV Krishna Rao, Fundamental parameters of traffic flow. NPTE, 2016.

[17] C. F. D. Daganzo, Carlos, “Fundamentals of transportation and traffic operations,” in Fundamentals of transportation and traffic operations, vol. 30, Oxford: Pergamon, 1997.

91

[18] B. D. M. Sven Maerivoet, “Traffic flow theory,” in Physics and Society, 2005, p. 33.

[19] D. Stephens and J. Schroeder, “Vehicle-to-infrastructure (V2I) safety applications performance,” US Dept.of Transportation, Technical FHWA-JPO-16-253, 2013.

[20] IEEE Standards Association, “IEEE 802.11p.” [21] H. Stübing, “Car-to-X Communication: System Architecture and Applications,” in

Multilayered Security and Privacy Protection in Car-to-X Networks, Wiesbaden: Springer, 2013, pp. 9–19.

[22] German Association of the Automotive Industry., “SimTD,” Projrct SimTD, 2018. . [23] roberto Baldessari and W. Zhang, “CAR-2-X Communication SDK – A Software

Toolkit for Rapid Application Development and Experimentations,” presented at the nternational Conference on Communication, Dreseden, 2009.

[24] D. E. Bastian Halmos ; Reinhard German, “Potentials and Limitations of Green Light Optimal Speed Advisory Systems,” presented at the 2013 IEEE Vehicular Networking Conference, Boston,USA, 2013.

[25] PyBrain, “Pybrain features,” 2018. . [26] Marsetič, Rok, Darja Šemrov, and Marijan Žura, “ROAD ARTERY TRAFFIC

LIGHT OPTIMIZATION WITH USE OF REINFORCEMENT LEARNING,” presented at the Intelligent Transportation Systems (ITS), 2014.

[27] PTV grou[, “PTV Vissim,” 2010. . [28] W. W. Mengqi LIU, Jiachuan DENG, “Cooperative Deep Reinforcement Learning

for Trafic Signal,” presented at the International Workshop on Urban Computing, canada, 2017.

[29] US Department of Transportation, “Federal Highway aministration,” 2018. . [30] H. W. Zhenhui Li, “IntelliLight: A Reinforcement Learning Approach for Intelligent

Traffic Light Control,” presented at the 8: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, 2013, p. 10.

[31] A. and S. Van Hasselt, Hado and Guez David, “Deep Reinforcement Learning with Double Q-Learning,” Cornell Univ., vol. 2, p. 5, 2016.

[32] M. and V. H. Wang, Ziyu and Schaul, Tom and Hessel Hado and Lanctot, “Dueling network architectures for deep reinforcement learning,” arXiv, 2015.

[33] J. G. Minoru Ito, Norio Shiratori Yulong Shen, Jia Liu, “Adaptive Traffic Signal Control: Deep Reinforcement Learning Algorithm with Experience Replay and Target Network,” 2017, p. 10.

[34] D. I. Kaushik Subramanian, Kikuo Fujimura Reza Rahimi, Akansel Cosgun, “Navigating Occluded Intersections with Autonomous Vehicles using Deep Reinforcement Learning,” presented at the IEEE International Conference on Robotics and Automation, 2017.

[35] Ankur Mehta ; Eugene Vinitsky, “Framework for control and deep reinforcement learning in traffic,” presented at the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 2017.

[36] J. S. Yan Duan, Xi Chen, Rein Houthooft Pieter Abbeel, “Benchmarking Deep Reinforcement Learning for Continuous Contro,” presented at the Proceedings of the 33rd International Conference on Machine Learning, 2016.

[37] L. and S. Brockman, G., Cheung, V., Pettersson, L Jonas and Schulman, “OpenAI Gym,” arXiv, 2016.

[38] E. J. and A. S. Mutiara Maulida,Herman Y. Sutarto, “Queue Length Optimization of Vehicles at Road Intersection Using Parabolic Interpolation Method,” presented

92

at the International Conference on Automation, Cognitive Science, Optics, Micro Electro-Mechanical System, and Information Technology, Indonesia, 2015.

[39] R. K. yin Min Keng, Helen Chuo Kenneth Tze, “Genetic algorithm based signal optimizer for oversaturated urban signalized intetsection,” presented at the IEEE Inteernational Conference on Consumeer Electronics, Malaysia, 2016.

[40] Guo, X., Song, Y, “Research of traffic assignment algorithm based on adaptive genetic algorithm,” presented at the Computing, Control and Industrial Engineering (CCIE), 2011 IEEE 2nd International Conferenc, 2011.

[41] Leng, Junqiang, and Yuqin Feng., “Research on the Fuzzy Control and Simulation for Intersection Based on the Phase Sequence Optimization,” in Measuring Technology and Mechatronics Automation, 2009, 2009.

[42] P. B. Vipul Vilas Sawake1, “Review of Traffic Signal Timing Optimization based on Fuzzy Logic Controller,” presented at the International Conference on Innovation in Information Embedded and Communication System, 2017.

[43] M. and H. Abdelhameed, Magdy M and Abdelaziz S. and Shehata, Omar M., “A hybrid fuzzy-genetic controller for a multi-agent intersection control system,” presented at the Engineering and Technology (ICET), 2014 International Conference, 2014.

[44] A. and C. Choi, Myungwhan and Rubenecia Hyo Hyun, “Reservation-based cooperative traffic management at an intersection of multi-lane roads,” presented at the Information Networking (ICOIN), 2018 International Conference, 2018.

[45] Scipy, “Scipy Optimizatin algorithms,” BFGS, Nov-2018. . [46] worldwide, Theano. 2018. [47] DLR, Sumo-Traci. Berlin,Germany: DLR, 2018. [48] worldwide, Scipy. Opensource, 2018. [49] DLR, “Sumo User Documentation,” Nov-2018. .

93

Appendix

Figure 66 : Rules execution - simulation loop

Figure 67 :Rule 3 code snippet

94

Figure 68 : GLOSA arrival time calculation all variations

Figure 69 : Traffic light phase selector

Figure 70 : Advised speed - GREEN phase

95

Figure 71: Advised speed next GREEN phase

Figure 72 : SUMO initialization

96

Figure 73 : SUMO main configuration

Figure 74 : Route file SUMO

97

Variable initialization:

GREEN_PHASE_DURATION

LEFT_TURN_SEPERATE_PHASE

CYCLE_LENGTH

MAX_SPEED

ARRIVAL_MARGIN

MAX_NUMBER_VEHICLES_LEFT_LANE_POOL

LEFT_LANE_CHANGE_CONTROL_DISTANCE

Simulated_DATA // List for store data for simulation vehicles for all steps

For step 0 to end

Data extraction from list Simulated_DATA:

Data extracted :

For each and every vehicle

Speed

Acceleration

Distance to the intersection

Current lane

Leading vehicles in left lane

Leading vehicles in right lane

Leading vehicles in left lane (straight going)

Leading vehicles in left lane (left turning)

Leading vehicles in right lane (straight going)

Flag 1 : Flag for rule 1

Flag 2: Flag for rule2

Flag 3: Flag for rule 3

Slow_down_flag //for first left turner

Save it in a list

Get the Current phase and calculate remaining GREEN phase and NEXT GREEN phase

Get the current traffic light phase and give a number to the phase

Ex. if phase == "rrrrGGGgrrrrGGGg":

phaseNo = 1

//This until phase 8. Phase 1 is GREEN for all vehicles

// Phase 3 GREEN only for left turners

// Others are RED phases

Calculated remaining time of the current phase

NEXT calculate remaining GREEN light time and NEXT green light

Eg: phaseNo== 1

Remaining_green= remaining

NextGreenPhase= remaining + 59 (depends on Sumo traffic light )

98

Extended-GLOSA

Set speed mode for vehicle 0b11111

If acceleration != 0

//tmax represents time vehicle needs to reach MAX speed

Calculate tmax -> tmax = (MAX_SPEED - speed) / acceleration

//dmax represents time vehicle needs to reach MAX speed

dmax = tmax * ((MAX_SPEED + speed) / 2)

if dmax > distance to traffic light

//calculate arrival time

arrival_time = -(speed / acceleration) + math.sqrt(((speed * speed) /

(acceleration * acceleration)) + ((2 * distance) / acceleration))

else

arrival_time = tmax + ((distance - dmax) / MAX_SPEED)

else

//executes for very small speeds like 0.1 ,0.0012 which is next to a traffic light.

//Mainly to avoid unexpected stopping

//consider last 100m for executing this command (current 1km scenario)

if (speed >= 0 and speed < 1)

if distance <= position of the intersection / 10:

arrival_time = 1

else if speed != 0:

arrival_time = distance / speed

if phase == 1 or 3

//GREEN or 6seconds additional left turners

if (remaining > arrival_time or vehicle is a left turner

and remaining + LEFT_TURN_SEPERATE_GREEN_TIME > arrival_time

traci setspeed -1

//advised speed MAX speed

Else

Get the time for next green phase+added Margin

//calculate advised speed

advisedSpeed = max(0, ((2 * distance) / (nextGreenPhase)) - speed)

traci slowdown with advised speed

else

calculate next NEXTGREENPHASE starting each and every phase

Eg: if the phase==1

nextGreenPhase = 59 + remaining + ARRIVAL_MARGIN

99

if the vehicle is the a left turner, consider the additional 6 seconds phase also

if arrival_time> NEXTGREENPHASE

//arrive in green phase after next green

If vehicle == first left turner

myGreenPhase = nextGreenPhase + CYCLE_LENGTH +

GREEN_PHASE_DURATION

Else

myGreenPhase = nextGreenPhase + CYCLE_LENGTH

else

//vehicle arrive in next green phase

If vehicle == first left turner

myGreenPhas= nextGreenPhase+GREEN_PHASE_DURATION

else

myGreenPhase = nextGreenPhase

advisedSpeed = max(0, ((2 * distance) / (myGreenPhase)) - speed)

Issue slowdown command with advisedSpeed

Rules execution

For each and every vehicle

Rule1

//change left all left turners to left lane

If the route (whether it is going to turn left) and current lane is right

Lane change to left lane

Change flag to true

Rule2

// check left lane. check whether straight going is following a left turner. if then change it to

right lane

If the route is straight going and the currently in left lane:

Check how many left turners are leading in left lane

If Leading vehicles in left lane (left turning) >0

Change to right lane

Set flag2 = true

Rule3

//grab the leading straight going vehicles in right lane to the first left turner in left lane, and

change it to left lane

Left_lane_pool // number of vehicles which can lead on left lane for first left turner

// First find the first left turner

Get vehicle list currently travelling in left lane.

For from end to start vehicle list

If the route is a left turning route

Firsrt_left_turner_found

100

Break

Else

Left_lane_pool++

If Firsrt_left_turner_found== currently investigating vehicle

If Left_lane_pool <= MAX_NUMBER_VEHICLES_LEFT_LANE_POOL

// slow down mechanism for first left tuner

Slow_down_flag ==true

Get list Leading vehicles in right lane

While from end to start

If left_lane_pool<= MAX_NUMBER_VEHICLES_LEFT_LANE_POOL

Change leading straight going vehicle to left lane

left_lane_pool

else

break

Store updated in list Simulated_DATA

For end

Figure 75 Rules psudo code

reinforcement learning based traffic optimization at an

Documents