Assignment 4 deep Q network (DQN)

Rate this product

Reinforcement Learning Assignment 4

1
1 Introduction
The goal of this assignment is to do experiment with deep Q network (DQN),
which combines the advantage of Q-learning and the neural network. In classical
Q-learning methods, the action value function Q is intractable with the increase
of state space and action space. DQN introduces the success of deep learning
and has achieved a super-human level of play in atari games. Your goal is to
implement the DQN algorithm and its improved algorithm and play with them
in some classical RL control scenarios.
2 Deep Q-learning
ction approximator is used instead, such as a neural
al network function approximator with weights h as a
an be trained by adjusting the parameters hi at iteration
ed error in the Bellman equation, where the optimal

0
,a0 ð Þ are substituted with approximate target values
ysing parameters h{
i from some previous iteration.
Thiss functions Li(hi) that changes at each iteration i,
Lið Es ð Þ 0 ½ yDs,a {Q sð Þ ,a; hi 2
~ s,s0 ð Þ y{Q sð Þ ,a; hi 2 zEs,a,r Vs ½ 0 ½ y :
Note t the network weights; this is in contrast with the
targeting, which are fixed before learning begins. At
each sd the parameters from the previous iteration hi
2
fixed s function Li(hi), resulting in a sequence of welldefinehe final term is the variance of the targets, which
does ns hi that we are currently optimizing, and may
therefing the loss function with respect to the weights
we arrt:
+hi
L
a0 Q s
0
,a0
; h{
i
{Qð Þ s,a; hi
+hiQðÞ s,a; hi
:
Raher t expectations in the above gradient, it is often
computatmize the loss function by stochastic gradient
descent.gorithm19 can be recovered in this framework
by updaty time step, replacing the expectations using
single si{1.
Note tha-free: it solves the reinforcement learning task
directlylator, without explicitly estimating the reward
and tran Þ Ds,a. It is also off-policy: it learns about the greedy
policy a whefollowing a behaviour distribution that ensures
adequatepace. In practice, the behaviour distribution is
often se that follows the greedy policy with probability
1 2 e anwith probability e.
Trainingworks. The full algorithm for training deep
Q-networhm 1. The agent selects and executes actions
accordinsed on Q. Because using histories of arbitrary
length acan be difficult, our Q-function instead works
on a fixhistories produced by the function w described
above. Tndard online Q-learning in two ways to make it
suitabletworks without diverging.
First, ws experience replay23 in which we store the
agent’s et5 (st, at,rt,st 1 1), in a data setDt5 {e1,…,et},
pooled o the end of an episode occurs when a terminal statmory. During the inner loop of the algorithm,
we applyminibatch updates, to samples of experience,
(s, a,r,rom the pool of stored samples. This approach
has seveonline Q-learning. First, each step of experience
is potenpdates, which allowsfor greater data efficiency.
Second, utive samples is inefficient, owing to the strong
correlatandomizing the samples breaks these correlations aniance of the updates. Third, when learning onpolicy tmine the next data sample that the parameters
are traimaximizing action is to move left then the training sampmples from the left-hand side; if the maximizing acti then the training distribution will also switch.
Itis easck loops may arise and the parameters could get
stuckin endiverge catastrophically20. By using experience
replay the behaviour distribution is averaged over many of its previous states,
smoothing out learning and avoiding oscillations or divergence in the parameters.
Note that when learning by experience replay, it is necessary to learn off-policy
(because our current parameters are different to those used to generate the sample), which motivates the choice of Q-learning.
In practice, our algorithm only stores the last N experience tuples in the replay
memory, and samples uniformly at random from Dwhen performing updates. This
approach is in some respects limited because the memory buffer does not differentiate important transitions and always overwrites with recent transitions owing
to the finite memory size N. Similarly, the uniform sampling gives equal importance to all transitions in the replay memory. A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to
prioritized sweeping30.
The second modification to online Q-learning aimed at further improving the
stability of our method with neural networks is to use a separate network for generating the targets yj in the Q-learning update. More precisely, every C updates we
clone the network Q to obtain a target network Q^ and use Q^ for generating the
Q-learning targets yj for the followingC updates to Q. This modification makes the
algorithm more stable compared to standard online Q-learning, where an update
that increasesQ(st,at) often also increasesQ(st 1 1,a)for all a and hence also increases
the target yj, possibly leading to oscillations or divergence of the policy. Generating
the targets using an older set of parameters adds a delay between the time an update
to Q is made and the time the update affects the targets yj
, making divergence or
oscillations much more unlikely.
We also found it helpful to clip the error term from the update rzc maxa0 Q
s
0
,a0
; h{
i
{Qð Þ s,a; hi to be between 21 and 1. Because the absolute value loss
function jxj has a derivative of 21 for all negative values of x and a derivative of 1
for all positive values of x, clipping the squared error to be between 21 and 1 corresponds to using an absolute value loss function for errors outside of the (21,1)
interval. This form of error clipping further improved the stability of the algorithm.
Algorithm 1: deep Q-learning with experience replay.
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights h
Initialize target action-value function Q^ with weights h2 5 h
For episode 5 1, M do
Initialize sequence s1~f g x1 and preprocessed sequence w1~wð Þ s1
For t 5 1,T do
With probability e select a random action at
otherwise select at~argmaxaQ wð Þ st ð Þ ,a; h
Execute action at in emulator and observe reward rt and image xt 1 1
Set stz1~st,at,xtz1 and preprocess wtz1~wð Þ stz1
Store transition wt,at,rt,wtz1
in D
Sample random minibatch of transitions wj,aj,rj,wjz1

from D
Set yj~ rj if episode terminates at step jz1
rjzc maxa0 Q^ wjz1,a0
; h{
otherwise (
Perform a gradient descent step on yj{Q wj,aj; h

2
with respect to the
network parameters h
Every C steps reset Q^~Q
End For
End For
31. Jarrett, K., Kavukcuoglu, K., Ranzato, M. A. & LeCun, Y.What is the bestmulti-stage
architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–2153
(2009).
32. Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann
machines. Proc. Int. Conf. Mach. Learn. 807–814 (2010).
33. Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially
observable stochastic domains. Artificial Intelligence 101, 99–134 (1994).
LETTER RESEAlan Publishers Limited. All rights reserved
Figure 1: Deep Q-learning with experience replay
You can refer to the original paper for the details of the DQN. “Human-level
control through deep reinforcement learning.” Nature 518.7540 (2015): 529.
3 Experiment Description
• Programming language: python3
• You should compare the performance of DQN and one kind of improved
DQN and test them in a classical RL control environment–MountainCar.
OPENAI gym provides this environment, which is implemented with python
(https://gym.openai.com/envs/MountainCar-v0/). What’s more, gym also
provides other more complex environment like atari games and mujoco.
2
Since the state is abstracted into car’s position, convolutional layer is not
necessary in our experiment. You can get started with OPENAI gym refer
to this link (https://gym.openai.com/docs/). Note that it is suggested to
implement your neural network on the Tensorflow or Pytorch.
4 Report and Submission
• Your report and source code should be compressed and named after “studentID+name+assignment4”.
• The files should be submitted on Canvas before Apr. 30, 2021.
3

Assignment 4 deep Q network (DQN)

Share this:

Related

Related products

Assignment 3 Q-learning Sarsa

Reinforcement Learning Assignment 2

Dynamic Programming(DP)