---
title: "How to use a behavior policy with Tensorflow Agents"
description: "Random and scripted behavior policies"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/how-to-use-a-behavior-policy-with-tensorflow-agents
---

In this blog post, I am going to show an example of configuring a basic behavior policy which will be used to interact with a Tensorflow Agents environment.

As examples, I will use two policies: the random policy and the scripted policy. Note that using a neural network to determine the behavior policy is not in the scope of this article.

I have already defined the environment while writing the previous article. Please copy/paste the imports, the environment class, and the code that initializes it from the previous text. Then, you should also copy the following imports:

```python
from tf_agents.policies import random_py_policy
from tf_agents.policies import scripted_py_policy
```

The environment represents the bare basics of TicTacToe. I limited the game to only one player, and the only goal of the player is to fill the board without trying to pick the same spot more than once.

## Random policy

In the previous text, I used a random policy to choose actions, but I was doing it manually. This time I will implement picking a random action as a Tensorflow policy.

```python
action_spec = array_spec.BoundedArraySpec((), np.int32, minimum=0, maximum=8)
random_policy = random_py_policy.RandomPyPolicy(time_step_spec=None, action_spec=action_spec)
```

Now, I can use the policy to interact with the environment. The only difference between the previous implementation and the following code is that I replaced the tf.random_uniform function with a call to the action function of the random policy.

```python
time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10000

for _ in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  tf_env.reset()
  while not tf_env.current_time_step().is_last():
    action = random_policy.action(tf_env.current_time_step())
    next_time_step = tf_env.step(action)
    episode_steps += 1
    episode_reward += next_time_step.reward.numpy()
  rewards.append(episode_reward)
  steps.append(episode_steps)

num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
max_reward = np.max(rewards)
max_length = np.max(steps)

print('num_episodes:', num_episodes, 'num_steps:', num_steps)
print('avg_length', avg_length, 'avg_reward:', avg_reward)
print('max_length', max_length, 'max_reward:', max_reward)
```

Result:

```
num_episodes: 10000 num_steps: 44350
avg_length 4.435 avg_reward: -0.6551
max_length 9 max_reward: 1.8000001
```

## Scripted policy

Because I have oversimplified the rules of the game, a simple scripted policy can get the maximal reward every time. It must pick the spots sequentially, one after another. This scripted policy can represent such behavior.

```python
action_spec = array_spec.BoundedArraySpec((), np.int32, minimum=0, maximum=8)
action_script = [(1, np.int32(0)),
                 (1, np.int32(1)),
                 (1, np.int32(2)),
                 (1, np.int32(3)),
                 (1, np.int32(4)),
                 (1, np.int32(5)),
                 (1, np.int32(6)),
                 (1, np.int32(7)),
                 (1, np.int32(8))]

scripted_policy = scripted_py_policy.ScriptedPyPolicy(
    time_step_spec=None, action_spec=action_spec, action_script=action_script)
```

Note two things. The script is represented as an array of actions. Every action consists of a tuple. Its first element is the number of times the action should be repeated. I must perform every action exactly once, so I put 1 in every one of the tuples.

The second element is the array of actions that has the same shape as the action specification in the first line. I specified the shape as an empty tuple, which is a special value that denotes a scalar (a single numerical value). Therefore to create the scalar, I call a numpy function that returns a value of a given type (in this case, int32).

### Using a scripted policy

The instance of a scripted policy class does not store any information about the current step. Because of that, I have to modify the code I use to test the policy.

```python
time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10000

for _ in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  tf_env.reset()
  policy_state = scripted_policy.get_initial_state() #1
  while not tf_env.current_time_step().is_last():
    action = scripted_policy.action(tf_env.current_time_step(), policy_state) #2
    next_time_step = tf_env.step(action.action) #3
    policy_state = action.state #4
    episode_steps += 1
    episode_reward += next_time_step.reward.numpy()
  rewards.append(episode_reward)
  steps.append(episode_steps)

num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
max_reward = np.max(rewards)
max_length = np.max(steps)

print('num_episodes:', num_episodes, 'num_steps:', num_steps)
print('avg_length', avg_length, 'avg_reward:', avg_reward)
print('max_length', max_length, 'max_reward:', max_reward)
```

First, I call the get_initial_state function and store the policy state in a variable (#1). In the inner loop, I pass both the current step and the policy state to the policy instance (#2). Its action function returns not only the next action but an object which contains the action (#3) and the new policy state (#4).