CS 747: Programming

Assignment 2

In this assignment, you will implement algorithms

for finding an optimal policy for a given MDP. The

first part of the assignment is to apply Linear

Programming, based on the formulation presented

in class. The second part of the assignment is to

implement Howard’s Policy Iteration. The third part

of the assignment requires you to apply these

solvers to a variant of the Gambler’s Problem

(described in the textbook by Sutton and Barto

(2018, see Example 4.3)): in fact your job is

essentially to formulate this task as an MDP.

Data

This directory provides a few samples of input and

output that you can use to test your code. The

directory contains four MDPs encoded as text files,

with each file in the following format.

Number of states

Number of actions

Reward function

Transition function

Discount factor

Type

In these files, and also in the MDPs on which your

algorithms will be tested, the number of states S

will be an integer greater than 0 and less than 150.

Assume that the states are numbered 0, 1, 2, …, (S

– 1). Similarly, actions will be numbered 0, 1, 2, …,

(A – 1), where A is less than 100. The reward

function will be provided over S×A lines, each line

containing S entries. Each entry corresponds to R(s,

a, s’), wherein state s, action a, and state s’ are being

iterated in sequence from 0 to (S – 1), 0 to (A – 1),

and 0 to (S – 1), respectively. A similar scheme is

CS 747: Programming Assignment 2 https://www.cse.iitb.ac.in/~shivaram/teaching/c…

1 of 8 01/09/18, 18:37

adopted for the transition function T. Each reward

lies between -1 and 1 (both included). The discount

factor is a real number in [0, 1]. However, the

discount factor will only be set to 1 if the

underlying task is episodic. The last field in the file,

denoted Type, will either be “continuing” or

“episodic”. If episodic, it is our convention that the

very last state (numbered S – 1) will be a terminal

state. The MDP will be such that for every starting

state and policy, trajectories will eventually reach

the terminal state.

Below is a snippet of python code that is used to

generate MDP files.

print S

print A

for s in range(0, S):

for a in range(0, A):

for sPrime in range(0, S):

print str(R[s][a][sPrime]) + “\t”,

print “\n”,

for s in range(0, S):

for a in range(0, A):

for sPrime in range(0, S):

print str(T[s][a][sPrime]) + “\t”,

print “\n”,

print gamma

print type

Solution

Given an MDP, your program must compute the

optimal value function V* and an optimal policy π*

by applying the algorithm that is specified through

the command line. Create a shell script called

planner.sh to invoke your program. The arguments

to planner.sh will be

–mdp followed by a full path to the input

MDP file, and

–algorithm followed by one of lp and hpi.

Make no assumptions about the location of the

CS 747: Programming Assignment 2 https://www.cse.iitb.ac.in/~shivaram/teaching/c…

2 of 8 01/09/18, 18:37

MDP file relative to the current working directory;

read it in from the full path that will be provided.

The algorithms specified above correspond to

Linear Programming and Howard’s Policy Iteration,

respectively. Here are a few examples of how your

planner might be invoked (it will always be invoked

from its own directory).

./planner.sh –mdp /home/user/temp/data

/mdp-7.txt –algorithm lp

./planner.sh –mdp /home/user/mdpfiles

/mdp-5.txt –algorithm hpi

You are free to implement the planner in any

programming language of your choice. You are not

expected to code up a solver for LP; rather, you can

use available solvers as blackboxes (more below).

Your effort will be in providing the LP solver the

appropriate input based on the MDP, and

interpreting its output appropriately. You are

expected to write your own code for Howard’s

Policy Iteration; you may not use any libraries that

might be available for the purpose.

Output Format

The output of your planner must be in the following

format, and written to standard output.

V*(0) π*(0)

V*(1) π*(1)

.

.

.

V*(S – 1) π*(S – 1)

In the data directory provided, you will find four

output files corresponding to the MDP files, which

have solutions in the format above.

Since your output will be checked automatically,

make sure you have nothing printed to stdout other

than the S lines as above in sequence. If the testing

code is unable to parse your output, you will not

receive any marks.

CS 747: Programming Assignment 2 https://www.cse.iitb.ac.in/~shivaram/teaching/c…

3 of 8 01/09/18, 18:37

Note:

Your output has to be written to

the standard output, not to any

file.

1.

For values, print at least 6 places

after the decimal point. Print

more if you’d like, but 6

(xxx.123456) will suffice.

2.

If your code produces output that

resembles the solution files: that

is, S lines of the form

value + “\t” + action + “\n”

or even

value + ” ” + action + “\n”

you should be okay. Make sure

you don’t print anything else.

3.

If there are multiple optimal

policies, feel free to print any

one of them.

4.

Gambler’s Problem

For a full description of the Gambler’s Problem, see

Example 4.3 in the textbook by Sutton and Barto

(2018). As in the Exercise 4.9, your objective is to

prepare plots of the optimal value function for

different values of ph.

Rather than write separate code for the Gambler’s

Problem, you will merely encode the problem as an

MDP, on which your Linear Programming and

Policy Iteration algorithms have already been

designed to work. You must create a script called

encodeGambler.sh, which will take ph as input, and

write out an MDP (in the same format as described

above) to standard output. We will invoke your

program as follows.

./encodeGambler.sh ph > mdpFileName

CS 747: Programming Assignment 2 https://www.cse.iitb.ac.in/~shivaram/teaching/c…

4 of 8 01/09/18, 18:37

Here ph will be a number between 0 and 1 (both

excluded), and mdpFileName will be the full path to a

file into which your MDP gets written. Thereafter

we will run your policy iteration code on the same

MDP, as follows, and manually inspect the values

printed out.

./planner.sh –algorithm hpi

The challenge of this exercise is for you to be able

to pose the Gambler’s Problem using the MDP

format we have adopted. Your formulation must be

mathematically correct: that is, optimal values in

the MDP must indeed be the Gambler’s maximum

expected profits from corresponding states. You do

have some leeway in terms of deciding what your

states, actions, rewards, and transition probabilities

must be. If you run into technical difficulties on

account of the discount factor or the type of the

task, use your understanding of the task to make

suitable modeling choices. Make sure you

document all these choices in your report.

Apart from fully describing the rationale behind

your encoding in your report, include graphs for ph

values of 0.2, 0.4, 0.6, and 0.8. Interpret the results:

do the graphs agree with your intuition?

Submission

You will submit three items: (1) working code for

your planner, which implements the two different

algorithms, (2) working code for encoding the

Gambler’s Problem as an MDP, and (3) a report

describing your MDP encoding, results, and

observations on the Gambler’s Problem.

Create a directory titled [rollnumber]. Place all

your source and executable files in this directory.

The directory must contain scripts titled planner.sh

and encodeGambler.sh, which must take in the

command line arguments specified above, and

produce the output also as specified. Also place

CS 747: Programming Assignment 2 https://www.cse.iitb.ac.in/~shivaram/teaching/c…

5 of 8 01/09/18, 18:37

report.pdf in the same directory.

Before you submit, make sure you can successfully

run planner.sh and encodeGambler.sh on the

departmental (sl2) machines. Provide references to

any libraries and code snippets you have utilised

(mention them in report.pdf). It is okay to use

libraries for data structures and for operations such

as sorting. You may also use libraries for solving

systems of linear equations. However, the logic

used for policy improvement, and for translating the

given MDP into a linear program, and for encoding

the Gambler’s Problem, must entirely be code that

you have written.

Compress and submit your directory as

[rollnumber].tar.gz. The directory must contain

all the sources, executables, and experimental data,

and importantly, your planner.sh and

encodeGambler.sh scripts and report.pdf file.

Make sure you upload your submission to Moodle

by the submission deadline.

Evaluation

Your planner will be tested on a large number of

MDPs. Your task is to ensure that it prints out the

correct solution (V* and π*) in each case, using

each of the algorithms you have been asked to

implement. 3 marks each are allotted for the

correctness of your Linear Programming and

Howard’s Policy Iteration algorithms. 2 marks are

allotted for the correctness of your encoding of

Gambler’s Problem, and another 2 marks for your

results and observations.

The TAs and instructor may look at your source

code to corroborate the results obtained by your

program, and may also call you to a face-to-face

session to explain your code.

Deadline and Rules

CS 747: Programming Assignment 2 https://www.cse.iitb.ac.in/~shivaram/teaching/c…

6 of 8 01/09/18, 18:37

Your submission is due by 11.55 p.m., Sunday,

September 9. You are advised to finish working on

your submission well in advance, keeping enough

time to test it on the sl2 machines and upload to

Moodle. Your submission will not be evaluated

(and will be given a score of zero) if it is not

received by the deadline.

Before submission, make sure that your code runs

for a variety of experimental conditions. Test your

code on the sl2 machines even while you are

developing it: do not postpone this step to the last

minute. If your code requires any special libraries

to run, it is your responsibility to get those libraries

working on the sl2 machines (go through the CSE

bug tracking system to make a request to the system

administrators). Make sure that you upload the

intended version of your code to Moodle (after

uploading, download your submission and test it on

the sl2 machines to make sure it is the correct

version). You will not be allowed to alter your code

in any way after the submission deadline. In short:

your grade will be completely determined by your

submission on Moodle at the time of the deadline.

Play safe by having it uploaded and tested at least a

few hours in advance.

You must work alone on this assignment. Do not

share any code (whether yours or code you have

found on the Internet) with your classmates. Do not

discuss the design of your solution with anybody

else. Do not see anybody else’s code or report.

References for Linear Programming

Although you are free to use any library of your

choice for LP, we recommend that you use the

Python library PuLP (https://pythonhosted.org

/PuLP/) or the lp_solve program

(http://lpsolve.sourceforge.net/5.5/). Both of these

are already installed on the sl2 machines.

CS 747: Programming Assignment 2 https://www.cse.iitb.ac.in/~shivaram/teaching/c…

7 of 8 01/09/18, 18:37

PuLP is convenient to use directly from Python

code: here is a short tutorial and here is a reference.

lp_solve can be used both through an API and

through the command line. Here is a reference and

here is an introductory example.

CS 747: Programming Assignment 2 https://www.cse.iitb.ac.in/~shivaram/teaching/c…

8 of 8 01/09/18, 18:37

## Reviews

There are no reviews yet.