CS 234

Assignment 1

For submission instructions please refer to website

1 Optimal Policy for Simple MDP [20 pts]

Consider the simple n-state MDP shown in Figure 1. Starting from state s1, the agent can move

to the right (a0) or left (a1) from any state si

. Actions are deterministic and always succeed (e.g.

going left from state s2 goes to state s1, and going left from state s1 transitions to itself). Rewards

are given upon taking an action from the state. Taking any action from the goal state G earns a

reward of r = +1 and the agent stays in state G. Otherwise, each move has zero reward (r = 0).

Assume a discount factor γ < 1.

�” �’ �& �$%” �

�),� = 0

�”,� = 0

� = 1

�),� = 0 �),� = 0

�”,� = 0

�”,� = 0

Figure 1: n-state MDP

(a) The optimal action from any state si

is taking a0 (right) until the agent reaches the goal state

G. Find the optimal value function for all states si and the goal state G. [5 pts]

(b) Does the optimal policy depend on the value of the discount factor γ? Explain your answer.

[5 pts]

(c) Consider adding a constant c to all rewards (i.e. taking any action from states si has reward c

and any action from the goal state G has reward 1 + c). Find the new optimal value function

for all states si and the goal state G. Does adding a constant reward c change the optimal

policy? Explain your answer. [5 pts]

(d) After adding a constant c to all rewards now consider scaling all the rewards by a constant a

(i.e. rnew = a(c + rold)). Find the new optimal value function for all states si and the goal

state G. Does that change the optimal policy? Explain your answer, If yes, give an example

of a and c that changes the optimal policy. [5 pts]

1

2 Running Time of Value Iteration [20 pts]

In this problem we construct an example to bound the number of steps it will take to find the optimal

policy using value iteration. Consider the infinite MDP with discount factor γ < 1 illustrated in

Figure 2. It consists of 3 states, and rewards are given upon taking an action from the state. From

state s0, action a1 has zero immediate reward and causes a deterministic transition to state s1 where

there is reward +1 for every time step afterwards (regardless of action). From state s0, action a2

causes a deterministic transition to state s2 with immediate reward of γ

2/(1 − γ) but state s2 has

zero reward for every time step afterwards (regardless of action).

� = +1

� = 0

�’, � = 0

�), � =

�)

1 − �

�-

�’

�)

Figure 2: infinite 3-state MDP

(a) What is the total discounted return (P∞

t=0 γ

t

rt) of taking action a1 from state s0 at time step

t = 0? [5 pts]

(b) What is the total discounted return (P∞

t=0 γ

t

rt) of taking action a2 from state s0 at time step

t = 0? What is the optimal action? [5 pts]

(c) Assume we initialize value of each state to zero, (i.e. at iteration n = 0, ∀s : Vn=0(s) = 0).

Show that value iteration continues to choose the sub-optimal action until iteration n

∗ where,

n

∗ ≥

log(1 − γ)

log γ

≥

1

2

log( 1

1 − γ

)

1

1 − γ

Thus, value iteration has a running time that grows faster than 1/(1 − γ). (You just need to

show the first inequality) [10 pts]

3 Approximating the Optimal Value Function [35 pts]

Consider a finite MDP M = hS, A, T, R, γi, where S is the state space, A action space, T transition

probabilities, R reward function and γ the discount factor. Define Q∗

to be the optimal state-action

value Q∗

(s, a) = Qπ∗ (s, a) where π

∗

is the optimal policy. Assume we have an estimate Q˜ of Q∗

,

and Q˜ is bounded by l∞ norm as follows:

||Q˜ − Q

∗

||∞ ≤ ε

2

Where ||x||∞ = maxs,a|x(s, a)|.

Assume that we are following the greedy policy with respect to Q˜, π(s) = argmaxa∈AQ˜(s, a). We

want to show that the following holds:

Vπ(s) ≥ V

∗

(s) −

2ε

1 − γ

Where Vπ(s) is the value function of the greedy policy π and V

∗

(s) = maxa∈AQ∗

(s, a) is the optimal

value function. This shows that if we compute an approximately optimal state-action value function

and then extract the greedy policy for that approximate state-action value function, the resulting

policy still does well in the real MDP.

(a) Let π

∗ be the optimal policy, V

∗

the optimal value function and as defined above π(s) =

argmaxa∈AQ˜(s, a). Show the following bound holds for all states s ∈ S. [10 pts]

V

∗

(s) − Q

∗

(s, π(s)) ≤ 2ε

(b) Using the results of part 1, prove that Vπ(s) ≥ V

∗

(s) −

2ε

1−γ

. [10 pts]

Now we show that this bound is tight. Consider the 2-state MDP illustrated in figure 3. State s1

has two actions, “stay” self transition with reward 0 and “go” that goes to state s2 with reward 2ε.

State s2 transitions to itself with reward 2ε for every time step afterwards.

� = 2�

“��”, � = 2�

�+ �,

“����”, � = 0

Figure 3: 2-state MDP

(c) Compute the optimal value fucntion V

∗

(s) for each state and the optimal state-action value

function Q∗

(s, a) for state s1 and each action. [5 pts]

(d) Show that there exists an approximate state-action value function Q˜ with ε error (measured

with l∞ norm), such that Vπ(s1) − V

∗

(s1) = −

2ε

1−γ

, where π(s) = argmaxa∈AQ˜(s, a). (You

may need to define a consistent tie break rule) [10 pts]

4 Frozen Lake MDP [25 pts]

Now you will implement value iteration and policy iteration for the Frozen Lake environment from

OpenAI Gym. We have provided custom versions of this environment in the starter code.

(a) (coding) Read through vi_and_pi.py and implement policy_evaluation, policy_improvement

and policy_iteration. The stopping tolerance (defined as maxs |Vold(s) − Vnew(s)|) is tol =

10−3

. Use γ = 0.9. Return the optimal value function and the optimal policy. [10pts]

3

(b) (coding) Implement value_iteration in vi_and_pi.py. The stopping tolerance is tol =

10−3

. Use γ = 0.9. Return the optimal value function and the optimal policy. [10 pts]

(c) (written) Run both methods on the Deterministic-4×4-FrozenLake-v0 and

Stochastic-4×4-FrozenLake-v0 environments. In the second environment, the dynamics of the

world are stochastic. How does stochasticity affect the number of iterations required, and the

resulting policy? [5 pts]

4

## Reviews

There are no reviews yet.