ece-517: reinforcement learning in artificial intelligence...

15
1 ECE - 517: Reinforcement Learning in Artificial Intelligence Lecture 4: Discrete - Time Markov Chains Dr. Itamar Arel College of Engineering Department of Electrical Engineering & Computer Science The University of Tennessee Fall 2015 September 1, 2015

Upload: others

Post on 28-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

1

ECE-517: Reinforcement Learning in Artificial Intelligence

Lecture 4: Discrete-Time Markov Chains

Dr. Itamar Arel

College of EngineeringDepartment of Electrical Engineering & Computer Science

The University of TennesseeFall 2015

September 1, 2015

Page 2: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 2

Simple DTMCs

“States” can be labeled (0,)1,2,3,…

At every time slot a “jump” decision is made randomly based on the current state

10

p

q

1-q1-p

10

2

a

d fc b

e

(Sometimes the arrow pointing back to the same state is omitted)

Page 3: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 3

1-D Random Walk

Time is slotted

The walker flips a coin every time slot to decide which way to go

X(t)

p1-p

Page 4: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 4

Single Server Queue

Consider a queue at a supermarket

In every time slot: A customer arrives with probability p The HoL customer leaves with probability q

We’d like to learn about the behavior of such a system

Bernoulli(p)

Geom(q)

Page 5: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 5

Birth-Death Chain

Our queue can be modeled by a Birth-Death Chain (a.k.a. Geom/Geom/1 queue)

Want to know: Queue size distribution

Average waiting time, etc.

Markov Property The “Future” is independent of the “Past” given the “Present”

In other words: Memoryless

We’ve mentioned memoryless distributions: Exponential and Geometric

Useful for modeling and analyzing real systems

0 1 2 3

Page 6: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 6

Discrete Time Random Process (DTRP)

Random Process: An indexed family of random variables

Let Xn be a DTRP consisting of a sequence of independent, identically distributed (i.i.d.) random

variables with common cdf FX(x). This sequence is called the i.i.d. random process. Example: Sequence of Bernoulli trials (flip of coin)

In networking: traffic may obey a Bernoulli i.i.d. arrival Pattern

In reality, some degree of dependency/correlation exists between consecutive elements in a DTRP Example: Correlated packet arrivals (video/audio stream)

Page 7: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 7

Discrete Time Markov Chains

A sequence of random variables {Xn} is called a Markov Chain if it has the Markov property:

States are usually labeled {(0,)1,2,…}

State space can be finite or infinite

Transition Probability Matrix

Probability of transitioning from state i to state j

We will assume the MC is homogeneous/stationary: independent of time

Transition probability matrix: P = {pij}

Two state MC:

Page 8: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 8

Stationary Distribution

Define then pk+1 = pk P

(p is a row vector)

Stationary distribution: if the limit exists

If p exists, we can solve it by

These are called balance equations

Transitions in and out of state i are balanced

Page 9: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 9

General Comment & Conditions for p to Exist (I)

If we partition all the states into two sets, then transitions between the two sets must be “balanced”. This can be easily derived from the balance equations

Definitions:

State j is reachable by state i if

State i and j commute if they are reachable by each other

The Markov chain is irreducible if all states commute

Page 10: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 10

Conditions for p to Exist (I) (cont’d)

Condition: The Markov chain is irreducible Counter-examples:

Aperiodic Markov chain … Counter-example:

21 43 32p=1

1

10

2

1

0 01 1

0

Page 11: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 11

Conditions for p to Exist (II)

For the Markov chain to be recurrent…

All states i must be recurrent, i.e.

Otherwise transient

With regards to a recurrent MC …

State i is recurrent if E(Ti)<1, where Ti is time

between visits to state i

Otherwise the state is considered null-recurrent

Page 12: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 12

Solving for p: Example for two-state Markov Chain

10

p

q

1-q1-p

p

Page 13: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 13

Birth-Death Chain

Arrival w.p. p ; departure w.p. q

Let u = p(1-q), d = q(1-p), r = u/d

Balance equations:

u u u u

d d d d

0 1 2 31-u

1-u-d 1-u-d 1-u-d

Page 14: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 14

Birth-Death Chain (cont’d)

Continuing this pattern, we observe that:

p(i-1)u = p(i)d

Equivalently, we can draw a bi-section between state i and state i-1

Therefore, we have

where r = u/d.

What we are interested in is the stationary distribution of the states, so …

Page 15: ECE-517: Reinforcement Learning in Artificial Intelligence ...web.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture4.pdf · ECE-517: Reinforcement Learning in Artificial Intelligence

ECE 517 – Reinforcement Learning in AI 15

Birth-Death Chain (cont’d)