---
title: "Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models"
description: "There is a whole spectrum of exploration strategies between random and greedy policies."
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/using-boltzmann-distribution-as-exploration-policy-in-tensorflow-agent
---

In this article, I am going to show you how to use Boltzmann policy in TensorFlow-Agent, how to configure the policy, and what is the expected result of various configuration options.

## Use Boltzmann policy with DQN Agent

While using the deep Q-network agent as our reinforcement learning model, we can easily configure Boltzmann policy by specifying the boltzmann_temperature parameter in the DQNAgent constructor.

```python
from tf_agents.agents.dqn import dqn_agent

#tf_env is the environment implementation, q_network is the neural network used as the model

agent = dqn_agent.DqnAgent(
    tf_env.time_step_spec(),
    tf_env.action_spec(),
    q_network=q_net,
    boltzmann_temperature = 0.8, #<-- this parameter configures Boltzmann policy
    optimizer=tf.train.AdamOptimizer(0.001))
```

It is important to remember that **we cannot use both epsilon_greedy and boltzmann_temperature parameters at the same time** because those are two different exploration methods and cannot be used at the same time.

In the DQNAgent code, there is the following if statement:

```python
# DQNAgent implementation in Tensorflow-Agents
# https://github.com/tensorflow/agents/blob/a155216ded2ad151359c6f719149aacc9503b5f5/tf_agents/agents/dqn/dqn_agent.py#L285
if boltzmann_temperature is not None:
      collect_policy = boltzmann_policy.BoltzmannPolicy(
          policy, temperature=self._boltzmann_temperature)
else:
    collect_policy = epsilon_greedy_policy.EpsilonGreedyPolicy(
        policy, epsilon=self._epsilon_greedy)
```

We see that the boltzmann_temperature is used to create the proper exploration policy object (called collect_policy in Tensorflow-Agent code).

## How does it work

While exploring, the agent creates an action distribution. This distribution **describes how optimal an action is according to the data gathered by the agent**. If you want, you can say that the action distribution describes the agent's belief about the optimal action.

In the Boltzmann policy implementation, the **original action distribution gets divided by the temperature parameter**. Because of that, Boltzmann policy turns the agent's exploration behavior into a **spectrum between picking the action randomly (random policy) and always picking the most optimal action (greedy policy)**.

```python
# BoltzmannPolicy implementation in Tensorflow-Agents
# https://github.com/tensorflow/agents/blob/a155216ded2ad151359c6f719149aacc9503b5f5/tf_agents/policies/boltzmann_policy.py#L67
def _apply_temperature(self, dist):
    """Change the action distribution to incorporate the temperature."""
    logits = dist.logits / self._get_temperature_value()
    return dist.copy(logits=logits)
```

If we specify a very small temperature value, the differences between original action probabilities become more substantial, so the action with the highest probability is even more likely to be selected.
**If the temperature parameter is very close to zero, it turns the Boltzmann policy into a greedy policy** because the most probable action gets selected all the time.

On the other hand, a **huge value of the temperature parameter** dominates the original action distribution. As a result, there are almost no differences between probabilities, and we end up with a **random policy**.