In this article, I am going to explain the terminology used in the paper “Human-level control through deep reinforcement learning,” which describes a deep Q-network used to play Atari games.
In reinforcement learning, an agent is an algorithm which observes the environment (gets an input which represents the state of that environment). The environment may be fully observable, in such a case, the agent has all information in every step, or it may be partially observable if some information is not available (even if the lack of availability is not permanent).
The goal of the agent is to maximize the reward it gets by performing actions that change the state of the environment.
The agent is supposed to learn the action-value function, which approximates the optimal action-value function. An action-value function is a function that given the state of the environment and the available actions, returns a sequence of actions that maximizes the total rewards.
The function does not need always to pick the action that maximizes the reward in the current step. Because of the discount factor in the Bellman equation, the action-value function may choose a worse action in the current step if it allows the agent to get a significantly higher reward in the future.
Optimal action-value function
The optimal action-value function is the perfect function which for any given state always returns the actions that give the agent the highest possible reward.
Deep Q-network uses deep learning to approximate the optimal action-value function. It is possible because neural networks are universal function approximators (given a large volume of training examples and sufficient time, a neural network can learn to approximate any mathematical function).
Greedy behavior policy
The actions that maximize the total reward are called behavior policy. The greedy behavior policy is a policy that always picks the action that maximizes the total reward. Such behavior seems to make sense, but it is susceptible to overfitting and getting stuck in a local maximum.
\(\varepsilon\)-greedy behavior policy
Duo to that problem, a modification of the greedy behavior policy is used. An epsilon greedy behavior policy picks the optimal action with probability \(1 - \varepsilon\), so in some cases, it may pick a behavior that is not optimal. When it does not pick the optimal action, it chooses the action randomly. This implementation allows the actor to explore the set of available actions and learn a better behavior policy.
We call such implementation off-policy learning because some of the training data is from data “off” the target policy (the data that is produced randomly).
The deep Q-network described in the paper “Human-level control through deep reinforcement learning” starts learning with epsilon set to 1, so at the beginning, all training examples are random. Later, it gradually changes the epsilon value to 0.1.
Such a configuration of epsilon hyperparameter allows the network to explore a wast space of options at the beginning of the training. Later, the random actions are used to modify behavior policy to improve them even more and avoid ending up in a local maximum of the action-value function.
Training the neural network
In reinforcement learning, there is no predefined training dataset, so generating it randomly at the beginning of training is the only way to get data for the training of a neural network.
In general, we can distinguish two phases of learning: exploration and exploitation. In the beginning, the algorithm explores the possibilities by making random choices. It allows the neural network to be trained on random data. Obviously, such a neural network does not make choices which result in a high reward, so the random exploration is continued on top of the behaviors already learned by the neural network.
Because the epsilon hyperparameter diminishes gradually until it reaches 0.1, the network starts using the learned action-value function to generate the behavior policy. When it happens, the network may start exploiting the action-value function by repeatedly choosing high-reward actions. Even at that stage, some random behavior is allowed to avoid overfitting.
To train the neural network, the authors of the paper mentioned above used two interesting techniques: termed experience replay and iterative update.
Termed experience replay
Termed experience replay is a method of storing the current state of the environment, the action taken by the actor, the received reward, and the subsequent state of the environment which is a result of the action picked by the actor. The algorithm stores those data in a collection called reply memory.
When the neural network is trained, the mini-batches of training examples are drawn randomly from the reply memory. The reply memory contains examples from multiple previous iterations (the authors of the paper stored the last million examples), so in every iteration, the mini-batch contains not only recently generated data, but also examples from the past. This method helps break the correlation between training examples and smooths the learning by averaging over diverse sample set.
When training a neural network to approximate the action-value function, we generate the training dataset labels using the previous version of that neural network (the one trained during the previous iteration).
Typically, the network used to generate labels is replaced by a new one in every iteration, but when iterative update technique is used the label generating network is updated every C iterations (the authors used 10000 as the value of C).
It creates a time delay between the network that generated labels and the network that is currently trained, so the learning averages over multiple iterations again. As a result, the technique helps to avoid oscillation and divergence (it is less likely to continue using a poor behavior policy if it gets randomly learned by the network).