HOME ASSIGNMENT 2,CMPE 260

###

### TOPIC: Sequential decision making with discrete state and action spaces. ###

###

### LEARNING OBJECTIVES: to get hands-on experience with the theoretical concepts, ###

### discussed in the class, such as MPD graph, Value/Policy Iteration, Q-learning, ###

### SARSA, DQN (a bonus question). And, to get hands-on experience with efficient ###

### implementation with broadcasting of multidimensional tensors. ### ###

###

###

### This assignment is a part of the preparation for the midterm on the same topics. ###

### You can work in teams of 2-3 students. Your can discuss your solutions with other ###

### teams, but sharing your code or parts of it with other teams is plagiarism. ###

###

### DUE: April 7, 11:59PM. ###

——–>x

Down_> y

Down will be x

Right -> y

numparray[x,y] =

### UTILITY FUNCTIONS / PSEUDO-CODE ###

# 5 possible actions: Up, Right, Left, Down, Stay

(Y, x) 0, -1 0, -1 1, 0

PSEUDOCODE: Actions = Dict( 1=>(-1, 0), 2=>(0, +1), 3=>(0, -1), 4=>(+1, 0), 5=>(0, 0))

# E.g., Right = (keep constant the first coordinate of state, increase the second coordinate of state by +1)

### example for an obstacle (fence with length 3 blocks)

PSEUDOCODE: Obstacle = [(4, xObst) for xObst in 1:3]

PSEUDOCODE: function validState(s) = # returns true if (state, s, is within the maze boundaries) AND (s is NOT in the obstacles)

# dS – size of the maze with dimensions: dS x dS //

# dA – number of actions

# Goal – goal state.

PSEUDOCODE: function BuildMaze(dS, dA, Goal)

# dynamics tensor with dimensions: |dS| x |dS| x |dA| x |dS| x |dS| x 1, where the

# dimensions are S₁, S₂, A, S₁′, S₂′. e.g., S₂ is the current second coordinate of the state

# and S₁′ is the first coordinate of the state at the next time step.

Ps′_sa = zeros(dS, dS, dA, dS, dS)

# the reward tensor with the same dimension as the dynamics

# reward is -1 on every state, and 0 at the Goal state.

Rs′sa = -ones(dS, dS, dA, dS, dS)

# iterate over the valid states

for s in filter(validState, (x->x.I).(CartesianIndices((dS, dS))))

if s ∈ Goal

Ps′_sa[s…, :, s…] .= 1.0 # all the actions get prob 1 at the goal

Rs′sa[s…, :, s…] .= 0.0 # all the actions get reward 0

continue

end

for a in Actions # the same action set at each state

# if “next state is valid” move to it, otherwise stay at place

s′ = validState(s .+ a[2]) ? s .+ a[2] : s

Ps′_sa[s…, a[1], s′…] = 1.0

end

end

“sanity test:” forall a, s : sum_s′ Ps′_sa = 1

return Ps′_sa, Rs′sa

end

### TASKS ###

### Task 1 ###

Build your maze with dimensions 10×10 and 3 fences, and the goal state (exit)

in one of the corners of the maze. Visualize the maze layout on 2D plot.

The tasks 2 – 4 are for the model-based setting, where both p(s′ | s, a) and R(s′, s, a) are known.

### Task 2 ###

Implement the Policy Evaluation (PE) algorithm for a deterministic policy, π.

The dimensions of the policy π[a | s] are |dS| x |dS| x |dA| x 1 x 1 x 1.

There is a single possible action at every state.

E.g., π[UP | some_state] = [1, 0, 0, 0, 0], see ‘Actions’ above.

E.g., π[Stay | another_state] = [0, 0, 0, 0, 1], see ‘Actions’ above.

Evaluate a random deterministic policy, π. Plot Value of a random policy on 2D plot.

Instructions:

The policy is stationary, which means π[a′ | s′] is permute_dimensions(π[a | s], dim1->dim4, dim2->dim5, dim3->dim6)

Use broadcasting ‘.*’, e.g.,

p(s′ a′ | s, a) = π[a′|s′] .* p(s′ | s, a)

sum_s′p(s′ | s, π[a|s]) .* V[s′], where V[s′] is the value of the next state with dimensions 1 x 1 x 1 x dS x dS x 1.

The value of the current state has dimensions dS x dS x 1 x 1 x 1 x 1.

“V of the next state” is permute_dimensions(“V of the current state”, dims1->dims4, dims2->dims5)

### Task 3 ###

Repeat Task 2 with manually setting the optimal actions in the radius of 2 states from the goal state.

Explain your observations.

### Task 4 ###

Implement the Policy Improvement (PI) Algorithm, and find the optimal policy π*.

Visualize the optimal value function, V_i, on a 2D plot at 3 different iterations, i, of PI.

Explain your observations.

The next tasks are for the model-free setting, where neither p(s′ | s, a) nor R(s′, s, a) are known.

### Task 5 ###

Write a function, s′, r = step(s, a), that receives the current state, s, and the current action, a, and

returns the next state, s′, and reward, r.

Generate 10 trajectories from 10 different initial states, using a random policy.

Generate 10 trajectories from 10 different initial states, using the optimal policy from Task 4 above.

Explain your observations.

### Task 5 ###

Implement Q-learning algorithm for the tabular case, where Q function is given by a table.

Plot an accumulated reward as a function of the iteration number of Q-learning algorithm

for 5 runs of Q-learning from scratch. Plot an average curve of the 4 runs of Q-learning,

and the variance (use fill_between). Explain your observations

### Task 6 ###

Repeat Task 5 with the SARSA Algorithm.

Explain your observations.

### Task 7 (bonus 20 points) ###

Repeat Tasks 5 and 6 with the DQN algorithm, where Q is represented by a neural network with a few

feedforward layers. Compare the results with and without Experience Replay, and with and without a

separate network for the target. Explain your observations.

What to submit:

1. a runnable code, which enables to reproduce your results

2. a PDF file with all the plots, explanations, and the code.

Please submit two separate files, rather than a single ZIP, in Canvas.

## Reviews

There are no reviews yet.