CS 234

Assignment 1

1 Effect of Modeling Errors in Policy Evaluation [20 pts]

Consider deploying a Reinforcement Learning (RL) agent on an episodic MDP M with a horizon

of H timesteps. The true MDP M is never revealed to the agent except for the state and action

space (S and A, respectively) and the time horizon H. In other words, the agent knows neither the

expected reward r(s, a) nor the transition dynamics P(s

0

| s, a) for a generic state-action pair (s, a).

As the agent explores the environment in different episodes it builds an empirical model of the

world (often via maximum likelihood) which we denote by Mˆ . After collecting sufficient data we

would expect the empirical MDP to be similar to the true MDP.

Since our MDP is episodic, the value of a state depends on the time to termination of the episode.

Thus, for a given stochastic policy π, we require a different value function for each timestep, which

we denote V1, . . . , VH in M and Vˆ

1, . . . , VˆH in Mˆ . In particular, for i = 1, . . . , H, let

Vi(s) = E

“X

H

t=i

r(st

, at)

si = s

#

and Vˆ

i(s) = E

“X

H

t=i

rˆ(st

, at)

si = s

#

,

where the expectation is taken under following policy π. Suppose we use the empirical MDP Mˆ

instead of M to evaluate policy π. Assuming VˆH+1 = VH+1 = ~0 show that for all i = 1, . . . , H:

Vˆ

i(s) − Vi(s) = X

H

t=i

E

”

rˆ(st

, at) − r(st

, at) +X

s

0

(Pˆ(s

0

| st

, at) − P(s

0

| st

, at))Vˆ

t+1(s

0

)

si = s

#

.

In the above equality the expectation is defined with respect to the states encountered in true MDP

M upon starting from si and following stochastic policy π.

This result relates the value of policy π on Mˆ and M using the expected trajectories on M which

we can compute easily. If it holds that rˆ and Pˆ are close to r and P then this result can be used to

conclude that the empirical value function Vˆ is also close to the true one V .

If you use an existing result to derive the result, you need to cite the source.

1

2 Value Iteration [35 pts]

(a) After each iteration of value iteration you can extract the current best policy (given the current

value function, by taking an argmax instead of max). If the best policy at step k is the same

as the best policy at step k+1 when doing value iteration, can the policy ever change again

(with further iterations of value iteration)? Prove it cannot or disprove with a counterexample.

(b) Consider an MDP with finite state space and finite action set. Empirically we often halt value

iteration after a fixed number of steps for computational reasons, yielding an approximately

optimal value function V˜ . We then extract a greedy policy πV˜ from V˜ by taking the argmax

of one more backup. Let’s assume we know that the maximum difference across any state

is (for extra understanding, think about how the difference in |BV −V | allows us to ensure this)

|V

∗

(s) − V˜ (s)| ≤ ε, for all s.

Then policy πV˜ has values close to the optimal ones for each state. In particular, define the

loss function LV˜ as follows:

LV˜ (s) = V

∗

(s) − VπV˜

(s), for all s,

where V

∗

is the optimal value function and VπV˜

is the value function under policy πV˜ . Prove

LV˜ (s) ≤

2γε

1 − γ

for all states s. Here γ < 1 is the discount factor.

If you use outside sources you must write up your own solution and cite the used source.

3 Grid Policies [20 pts]

Consider the following grid environment. Starting from any unshaded square, you can move up,

down, left, or right. Actions are deterministic and always succeed (e.g. going left from state 1 goes

to state 0) unless they will cause the agent to run into a wall. The thicker edges indicate walls, and

attempting to move in the direction of a wall results in staying in the same square. Taking any

action from the green target square (no. 5) earns a reward of +5 and ends the episode. Taking any

action from the red square of death (no. 11) earns a reward of -5 and ends the episode. Otherwise,

each move is associated with some reward r ∈ {−1, 0, +1}. Assume the discount factor γ = 1 unless

otherwise specified.

2

20 21 22 23 24

15 16 17 18 19

10 11 12 13 14

5 6 7 8 9

0 1 2 3 4

(a) Define the reward r for all states (except state 5 and state 11 whose rewards are specified

above) that would cause the optimal policy to return the shortest path to the green target

square (no. 5).

(b) Using r from part (a), find the optimal value function for each square.

(c) Does setting γ = 0.8 change the optimal policy? Why or why not?

(d) All transitions are even better now: each transition now has an extra reward of 1 in addition

to the reward you defined in (a). Assume γ = 0.8 as in part (c). How would the value function

change? How would the policy change? Explain why.

4 Frozen Lake MDP [25 pts]

Now you will implement value iteration and policy iteration for the Frozen Lake environment from

OpenAI Gym (https://gym.openai.com/envs/FrozenLake-v0). We have provided custom versions of

this environment in the starter code.

(a) (coding) Implement policy iteration in vi_and_pi.py. The stopping tolerance (defined as

maxs |Vold(s) − Vnew(s)|) is tol = 10−3

. Use γ = 0.9. Return the optimal value function and

the optimal policy. [10pts]

(b) (coding) Implement value iteration in vi_and_pi.py. The stopping tolerance is tol = 10−3

.

Use γ = 0.9. Return the optimal value function and the optimal policy. [10 pts]

(c) (written) Run both methods on the Deterministic-4×4-FrozenLake-v0 and

Stochastic-4×4-FrozenLake-v0 environments. In the second environment, the dynamics of the

world are stochastic. How does stochasticity affect the number of iterations required, and the

resulting policy? [5 pts]

3

## Reviews

There are no reviews yet.