departure management with a reinforcement learning approach: respecting cfmu slots

Departure MANagement with a ReinforcementLearning ApproachRespecting CFMU Slots

Ivomar Brito Soares, Yann-Michael De Hauwere, KrisJanuarius, Tim Brys, Thierry Salvant, Ann Nowe

[email protected] - [email protected]

ITSC 2015September 2015

Introduction Reinforcement Learning Departure MANanagement DMAN RL Model Experiments Conclusions

Summary

Introduction

Reinforcement Learning

Departure MANanagement

DMAN RL Model

Experiments

Conclusions

Brito Soares et al. Departure MANagement with a Reinforcement Learning Approach 2


Large Scale Multi-Agent Systems (MAS)

Smart Energy Grids Intelligent Traffic Systems

Warehouse PlanningAir Traffic System



Large Scale MAS

Characteristics

I Resources and control are inherently distributed.

I Different actors with mutual and conflicting interests.

I Highly stochastic.

I Full system dynamics is unknown (e.g. not all constraints areknown).



Large Scale MAS

Characteristics

I Resources and control are inherently distributed.

I Different actors with mutual and conflicting interests.

I Highly stochastic.

I Full system dynamics is unknown (e.g. not all constraints areknown).

Full mathematical description or a global centralized solution aredifficult to be calculated.



Summary

Introduction



DMAN RL Model

Experiments

Conclusions



Why use RL?

I Artificial Intelligence (AI)I Machine Learning (ML)

I Reinforcement Learning (RL)


I Agent based modelling effort is reduced.I Exceeds human controller performance.I Adaptive and can change its decisions dynamically.



Single-Agent RL

Single-Agent Reinforcement Learning (RL) Model[Kelbling et al. (1996)]

I Approach to solve a Markov Decision Process (MDP).I Agent must learn by itself (trial and error).I Maximize a long term numerical reward signal.



Multi-Agent RL

Multi-Agent Reinforcement Learning (MARL)[Nowe (2011)]

I Markov Game (MG).

I Multiple agents / multiple sequential decisions.

I More complex than Single-Agent RL.



Summary

Introduction



DMAN RL Model

Experiments

Conclusions



Air Traffic System

I Air Traffic System (ATS)I Airport Ground Operations (AGO)

I Departure MANagement (DMAN)

Organize movement of departing aircraft from gate to take-offclearance.



Air Traffic System

I Air Traffic System (ATS)I Airport Ground Operations (AGO)

I Departure MANagement (DMAN)

Organize movement of departing aircraft from gate to take-offclearance.

DMAN Tasks

I Respect assigned Target Take-Off Time Windows(TTOTW).

I Increase of runway throughput and airport capacity.

I Reduce fuel consumtion, noise and CO2 emissions.



Target Take-Off Time Window (TTOTW)

Respect TTOW allows for a better usage of the ATSinfra-structure (CFMU Slots ≡Take-off Time Slots).

Flight Plan CallsignTarget Take-Off Time Window (TTOTW)

TTOTWmin TTOT TTOTWmax

AAL0005D 07:19:12 07:20:12 07:21:12

DAL0067D 07:22:12 07:23:12 07:24:12

JBU0065D 07:25:12 07:26:12 07:27:12

AAL0007D 07:28:12 07:29:12 07:30:12

DAL0009D 07:31:12 07:32:12 07:33:12Some TTOT windows generated for learning scenario 6




Respect TTOW allows for a better usage of the ATSinfra-structure (CFMU Slots ≡Take-off Time Slots).

Flight Plan CallsignTarget Take-Off Time Window (TTOTW)

TTOTWmin TTOT TTOTWmax

AAL0005D 07:19:12 07:20:12 07:21:12

DAL0067D 07:22:12 07:23:12 07:24:12

JBU0065D 07:25:12 07:26:12 07:27:12

AAL0007D 07:28:12 07:29:12 07:30:12

DAL0009D 07:31:12 07:32:12 07:33:12Some TTOT windows generated for learning scenario 6

In Charles de Gaulle Airport (LFPG) in Paris, France, roughly80% of the flights succeed in taking off inside their slots[Gotteland, 2003] .



Human Controllers Approaches to Respecting TTOTW

1. Gate Controllers: Clear the aircraft off-block at TTOT -estimation of time duration between off-block and take-off.

I Average: Average of all departure aircraft of KJFK.I Exact: When it taxis alone.




1. Gate Controllers: Clear the aircraft off-block at TTOT -estimation of time duration between off-block and take-off.

I Average: Average of all departure aircraft of KJFK.I Exact: When it taxis alone.

2. Runway Controllers: Make it wait before lining up, if itestimates that it will miss its window by taking-off too early.



Fast Time Simulation (FTS)

I Modeling: EnRoute flight phase, Terminal ManeuveringArea (TMA), aircraft and airport handlers, vehicles groundmovements.

I Simulation clock runs faster than a regular clock (fast-time).

I AirTOp, SIMMOD, TAAM, Arc Port ALTO etc.



John F. Kennedy International Airport (KJFK)

GridWorldEnvironment

New York City, USA (NY-Metro)



Summary

Introduction



DMAN RL Model

Experiments

Conclusions



Environment

I Fast Time Simulation: AirTOp.

I Departure Flights: Off-Block → Pushing-Back → Taxiing →Runway acceleration → Take-Off → Standard InstrumentDeparture (SID) → EnRoute.

I Arrival Flights: EnRoute → Standard Terminal ArrivalRoute (STAR) → Land → Runway decceleration → Taxiing→ In-Block.

I Safety requirements: wake-vortex separations, runway usage.



Markov Decision Process (MDP)

I Agent: A FP controller agent.




I Agent: A FP controller agent.I States:

1. Parked (initial state): one per departure aircraft with a TTOTwindow.

I Departure GateI Entry Time at departure gate.

2. Taxiing (intermediate states) (finite):I Entry NodeI Exit NodeI Entry Time at entry node.

3. Taken-Off (goal/absorbing states):I Taken-Off Inside WindowI Taken-Off Outside Window




I Actions:

1. Delay Off-Block.2. Delay During Taxiing.3. Take-Off




I Actions:

1. Delay Off-Block.2. Delay During Taxiing.3. Take-Off

I Reward Function:I Inside Window: Positive reward penalizing delay on ground.I Outside Window: 0.



RLDMAN: Single-Agent Single-State to Multi-Agent Multi-State

1. Single-Agent Single-State (N-Armed Bandit):

I Parked State / Delay Off-Block actions.

I Delay is absorbed at the gate →reduced fuelconsumption.







2. Multi-Agent Single-State: Multiple Single-State agentslearning in a shared environment.








3. Single-Agent Multi-State: Not always possible to absorb allthe delay at the gate:

I Arriving flight is requesting it.I Avoid the traffic in the vicinity of the gate.










4. Multi-Agent Multi-State: Multiple Multi-State agentslearning in a shared environment.



Single-Agent Multi-State Example

I TTOT Window (TTOW): [08:16:00,08:18:00], a-b = b-c =c-d = 5min.

I Reward Function Parameters: rmax=100, rTTOTW ,out=0,f taxiing=0.5. Q-learning= α = 0.2, γ = 0.8.

Terminal

a

b c

d 27

09

Parked State Taxiing States Taken-Off States

08:00:00 08:00:30

08:00:00

08:01:00

08:05:30

08:05:00

08:06:00

08:06:30

08:07:00

08:10:30

08:10:00

08:11:00

08:11:30

08:12:00

08:12:30

08:13:00

Taken-OffOutsideWindow

Taken-OffInside

Window

gd (a)

ne (a), nx (b)

ne (b), nx (c)

ne (c), nx (d)

1. Aircraft parks at the gate at 08:00:00.






Terminal

a

b c

d 27

09


08:00:00 08:00:30

08:00:00

08:01:00

08:05:30

08:05:00

08:06:00

08:06:30

08:07:00

08:10:30

08:10:00

08:11:00

08:11:30

08:12:00

08:12:30

08:13:00


Taken-OffInside

Window

gd (a)

ne (a), nx (b)

ne (b), nx (c)

ne (c), nx (d)

1. Aircraft parks at the gate at 08:00:00.2. Off-blocks (AOBT) at 08:00:30. R=0,

Q=0.






Terminal

a

b c

d 27

09


08:00:00 08:00:30

08:00:00

08:01:00

08:05:30

08:05:00

08:06:00

08:06:30

08:07:00

08:10:30

08:10:00

08:11:00

08:11:30

08:12:00

08:12:30

08:13:00


Taken-OffInside

Window

gd (a)

ne (a), nx (b)

ne (b), nx (c)

ne (c), nx (d)


Q=0.3. No stop at node b. R=0, Q=0.






Terminal

a

b c

d 27

09


08:00:00 08:00:30

08:00:00

08:01:00

08:05:30

08:05:00

08:06:00

08:06:30

08:07:00

08:10:30

08:10:00

08:11:00

08:11:30

08:12:00

08:12:30

08:13:00


Taken-OffInside

Window

gd (a)

ne (a), nx (b)

ne (b), nx (c)

ne (c), nx (d)


Q=0.3. No stop at node b. R=0, Q=0.4. Aircraft stops for 30sec at node c. R=0,

Q=0.






Terminal

a

b c

d 27

09


08:00:00 08:00:30

08:00:00

08:01:00

08:05:30

08:05:00

08:06:00

08:06:30

08:07:00

08:10:30

08:10:00

08:11:00

08:11:30

08:12:00

08:12:30

08:13:00


Taken-OffInside

Window

gd (a)

ne (a), nx (b)

ne (b), nx (c)

ne (c), nx (d)


Q=0.3. No stop at node b. R=0, Q=0.4. Aircraft stops for 30sec at node c. R=0,

Q=0.5. Aircraft takes-off (ATOT) at 08:16:00.

rTTOTW ,ini =100-0.5 * 60 = 70. Q =

0.2*70=14.



Summary

Introduction



DMAN RL Model

Experiments

Conclusions



Learning Problem Size

Large Multi-Agent System (MAS)

I Number of departure flights (agents): 698

I Number of arrival flights: 711

I Two days of operations of KJFK

Independent Learning Scenarios

I Total: 42

Scenario 6-38 5 0 39 1,3,4 2,41 40

# of Dep 20 13 8 6 3 1 0

# of AT MA MA MA MA MA SA -Number of departure flights per learning scenario index

(Dep: Departure Flights, AT: Agents Type, MA: Multi-Agent, SA: Single-Agent).



Learning Scenario 6: Single-State Multi-State Comparison

I Deterministic environment

I 20 independent learning agents

Average # of TTOTW in Average fuel consumption departureflights (Kg)






Average reward (all agents) Average reward (per agent)






Average gate delay (all agents) Average taxiing delay (all agents)



All Learning Scenarios: Percentage of Windows Respected

Percentage of Windows Respected

CaseEnvironment

Deterministic StochasticMachine Learning 99 96

Gate ControllersAverage 85 71Exact 97 44

Gate + RunwayControllers

Average 87 70Exact 96 44

Percentage of windows respected for all scenarios



All Learning Scenarios: Fuel Consumption

Fuel Consumption (kg)

CaseEnvironment

Deterministic StochasticMachine Learning 34,806 37,989

Gate ControllersAverage 35,057 38,106Exact 34,839 37,865

Gate + RunwayControllers

Average 35,412 36,613Exact 34,847 37,872

Fuel consumption for departure aircraft for all scenarios



Summary

Introduction



DMAN RL Model

Experiments

Conclusions



Conclusions

I Reinforcement Learning (RL) has showen to have goodpotential for modeling and finding solutions for respectingassigned take-off windows for departure aircraft.

I Realistic real world applications of RL.

I Single-State

I Advantages: Reduced fuel consumption and a reducedlearning problem since there are no visited taxiing states.

I Disadvantages: Increased gate delay and not being able tofind a solution for all cases, e.g., when it needs to avoid traffictaxiing in the vicinity of the gate.

I Multi-State

I Advantages: Reduced gate delay. Finds solutions to avoiddisturbance traffic on its path to the runway.

I Disadvantages: Increased fuel consumption. Bigger learningproblem.



Questions?

Ivomar Brito Soares

[email protected] - [email protected]

AI Lab (VUB): https://ai.vub.ac.be

Airtopsoft SA: http://www.airtopsoft.

com

Innoviris: http://www.innoviris.be

Youtube Channel:RLDMAN: http://www.youtube.com/

channel/UC8uJBsMej5A1as8trbVxbbQ.Brito Soares et al. Departure MANagement with a Reinforcement Learning Approach 30

https://ai.vub.ac.be

http://www.airtopsoft.com

http://www.airtopsoft.com

http://www.innoviris.be

http://www.youtube.com/channel/UC8uJBsMej5A1as8trbVxbbQ

http://www.youtube.com/channel/UC8uJBsMej5A1as8trbVxbbQ


References

1. A. G. Barto, Reinforcement Learning: An Introduction.MIT press, 1998.

2. K. Tumer and A. Agogino, Improving Air TrafficManagement with a Learning Multiagent System, IEEEIntelligent Systems, vol. 24, no. 1, pp. 1821, 2009.

3. R. S. Michalski, J. G. Carbonell, and T. M. Mitchell,Machine learning: An Artificial Intelligence Approach.Springer Science Business Media, 2013.

4. A. Nowe, P. Vrancx, and Y.-M. De Hauwere, Game Theoryand Multi-agent Reinforcement Learning. Springer, 2012,ch. Reinforcement Learning: State of the Art, pp. 441470.

5. Y.-M. De Hauwere, Sparse Interactions in Multi-AgentReinforcement Learning, 2011.

6. R. De Neufville and A. Odoni, Airport Systems: Planning,Design and Management, 2013.



References

7. S. Stroiney, B. Levy, and C. Knickerbocker, DepartureManagement: Savings in Taxi Time, Fuel Burn, andEmissions, in Integrated Communications Navigation andSurveillance Conference (ICNS). IEEE, 2010, pp. J217.

8. J.-B. Gotteland, N. Durand, and J.-M. Alliot, HandlingCFMS Slots in Busy Airports, in 5th USA/Europe AirTraffic Management Research and Development Seminar,2003.

9. Airtopsoft, Airtop Fast Time Simulatorhttp://www.airtopsoft.com/, 2005, [Online; accessed23-February-2015].

10. C. J. Watkins and P. Dayan, Q-Learning, Machine learning,vol. 8, no. 3-4, pp. 279292, 1992.


http://www.airtopsoft.com/


Abstract

This paper considers how existing Reinforcement Learning (RL) techniques can be used to model and learnsolutions for large scale Multi-Agent Systems (MAS). The large scale MAS of interest is the context of themovement of departure flights in big airports, commonly known as the Departure MANagement (DMAN)problem. A particular DMAN subproblem is how to respect Central Flow Management Unit (CFMU) take-off timewindows, which are time windows planned by flow management authorities to be respected for the take-off time ofdeparture flights. A RL model to handle this problem is proposed including the Markov Decision Process (MDP)definition, the behavior of the learning agents and how the problem can be modeled using RL ranging from thesimplest to the full RL problem. Several experiments are also shown that illustrate the performance of the machinelearning algorithm, with a comparison on how these problems are commonly handled by airport controllersnowadays. The environment in which the agents learn is provided by the Fast Time Simulator (FTS) AirTOp andthe airport case study is the John F. Kennedy International Airport (KJFK) in New York City, USA, one of thebusiest airports in the world.



Reinforcement Learning (RL) Overview

I Markov Decision Process (MDP) : (S ,A,Tsa, γ,R).

I Learn Policy (π): Jπ ≡ E [∑∞

t=0 γtR(s(t), π(s(t)))]

I Q values: Q(s, a) = R(s, a) + γ∑

s′ T (s, a, s ′) maxa′ Q(s ′, a′)

I Q-Learning update rule [Watkins, 1992]:Q(s, a)← Q(s, a) + α[R(s, a) + γmaxa′ Q(s ′, a′)− Q(s, a)]

I ε-greedy action selection mechanism:

ε (episode) = ε0 ∗ τ episode





I S = {s1, ..., sN} is a finite set of states.I A = {a1, ..., ak} are the actions available to the agent.I Tsa: Each combination of starting state si , action choice

al ∈ A and next state sj has an associated transitionprobability T (si , al , sj).

I R = Immediate reward R(si , al).I γ ∈ [0, 1) is the discount factor (e.g., 0.9).


t=0 γtR(s(t), π(s(t)))]


s′ T (s, a, s ′) maxa′ Q(s ′, a′)









t=0 γtR(s(t), π(s(t)))]


s′ T (s, a, s ′) maxa′ Q(s ′, a′)









t=0 γtR(s(t), π(s(t)))]

I E : Expected value.


s′ T (s, a, s ′) maxa′ Q(s ′, a′)









t=0 γtR(s(t), π(s(t)))]


s′ T (s, a, s ′) maxa′ Q(s ′, a′)









t=0 γtR(s(t), π(s(t)))]


s′ T (s, a, s ′) maxa′ Q(s ′, a′)

I s ′: Next state.I a′: Action taken on next state.









t=0 γtR(s(t), π(s(t)))]


s′ T (s, a, s ′) maxa′ Q(s ′, a′)









t=0 γtR(s(t), π(s(t)))]


s′ T (s, a, s ′) maxa′ Q(s ′, a′)


I α: Learning rate.








t=0 γtR(s(t), π(s(t)))]


s′ T (s, a, s ′) maxa′ Q(s ′, a′)









t=0 γtR(s(t), π(s(t)))]


s′ T (s, a, s ′) maxa′ Q(s ′, a′)




I ε0 to be the initial value (e.g., 1.0).I τ ∈ (0, 1) the decay (e.g.,0.995, 1348 episodes).



Algorithm 1 RL Update With Q-Learning and ε-Greedy withDecay

1: Initialize Q(s, a) (e.g., 0)2: for Each episode do3: Agent returns to initial state.4: Decay ε: ε← ε0 ∗ τ episode5: while Final/absorbing state not reached do6: Generate random number n ∈ [0, 1):7: if n ≤ ε then8: Explore: Choose a at random9: else

10: Exploit: Choose among a with the highest Q

11: Execute action a12: Observe reward r , next state s

′

13: Update Q of a: Q(s, a) ← Q(s, a) + αt [R(s, a) +γmaxa′ Q(s ′, a′)− Q(s, a)] [Watkins, 1992]Brito Soares et al. Departure MANagement with a Reinforcement Learning Approach 35



Respect TTOW allows for a better usage of the ATSinfra-structure.

I Departure aircraft i : acdepi , acdep,TTOTi .

I TTOT Window (TTOTW):

I Width: TTOTW wi > 0.

I Range: [TTOTWmini , TTOTmax

i ].I Constraints: TTOTWmin

i < TTOTi < TTOTWmaxi and

TTOTWmaxi = TTOTWmin

i + TTOTW wi .

I TTOTWi respected: Actual Take-Off Time (ATOT):ATOTi ∈ [TTOTWmin

i ,TTOTWmaxi ].




Respect TTOW allows for a better usage of the ATSinfra-structure.

I Departure aircraft i : acdepi , acdep,TTOTi .

I TTOT Window (TTOTW):

I Width: TTOTW wi > 0.

I Range: [TTOTWmini , TTOTmax

i ].I Constraints: TTOTWmin

i < TTOTi < TTOTWmaxi and

TTOTWmaxi = TTOTWmin

i + TTOTW wi .

I TTOTWi respected: Actual Take-Off Time (ATOT):ATOTi ∈ [TTOTWmin

i ,TTOTWmaxi ].

In Charles de Gaulle Airport (LFPG) in Paris, France, roughly80% of the flights succeed in taking off inside their slots[Gotteland, 2003] .




1. Gate Controllers: Clear the aircraft off-block at TTOT -T oti .

I Average: T ot,ai =

Average push back duration (00:02:35)+ Average taxi time duration (total taxi length / 14kt)+ Average runway line up duration (00:00:28)+ Runway acceleration duration.

I Exact: T ot,ei =

Actual Take-Off Time (ATOT)- Actual Off-Block Time (AOBT).




1. Gate Controllers: Clear the aircraft off-block at TTOT -T oti .

I Average: T ot,ai =

Average push back duration (00:02:35)+ Average taxi time duration (total taxi length / 14kt)+ Average runway line up duration (00:00:28)+ Runway acceleration duration.

I Exact: T ot,ei =

Actual Take-Off Time (ATOT)- Actual Off-Block Time (AOBT).

2. Runway Controllers: Make it wait before lining up, if itestimates that it will miss its window by taking-off too early.




1. Single-Agent Single-State (N-Armed Bandit):I Only one learning agent and the only state

considered is the Parked State with the DelayOff-Block actions.

I If delay is absorbed at the gate and the aircraftengines are turned off, thus reducing fuelconsumption.










4. Multi-Agent Multi-State: Multiple Multi-State agentslearning in a shared environment.




I Agent: A FP controller agent ai .




I Agent: A FP controller agent ai .I States (S):

1. Parked (initial state) (spi ):I Departure Gate (gd)I Entry Time (ep)

2. Taxiing (intermediate states) (S ti,j = sti,1, ..., s

ti,N):

I Entry Node (ne)I Exit Node (nx)I Entry Time (et)

3. Taken-Off (goal/absorbing states):I Taken-Off Inside Window (sTTOTW ,in

i )I Taken-Off Outside Window (sTTOTW ,out

i )




I Actions (A):

1. Delay Off-Block (Ao = ao1 , ..., aoL)

2. Delay During Taxiing (At = at1, ..., atM)

3. Take-Off (ae)




I Actions (A):

1. Delay Off-Block (Ao = ao1 , ..., aoL)

2. Delay During Taxiing (At = at1, ..., atM)

3. Take-Off (ae)

I Reward Function (R):I Inside Window:

rTTOTW ,ini = rmax − ptaxiingi = rmax − f taxiing ∗ d taxiing

i

I Outside Window: rTTOTW ,outi = 0.



RL Set Up

I Independent learners.

I Q-learning: α = 0.2, γ = 0.8.

I Action selection mechanism is ε-greedy with parameter decay:ε0 = 1.0, τ = 0.995.

I Competitive setting: rmax = 10, 000, f taxiing = −1.0.

I Environment: deterministic, stochastic (non-deterministic).



RL Set Up

I Delay off-block actions: Ao with a range of [−10min, 10min]centered around TTOT − T ot,e and a step of 10sec for everyagent (121 actions per spi ).

I Delay during taxiing actions: At were defined with a range of[0, 1min] and a step of 10sec (7 actions per S t

i ) close to eachapron exit.

I Learning trial ends when ε = 0.001 for all agents (1378episodes).




Percentage of KJFK departure flights that take off inside window


departure management with a reinforcement learning approach: respecting cfmu slots

Engineering