What reward-based algorithms teach us about …. US!

blockimperiumgames
2 min readJun 6, 2021

Experiment

In this experiment, I established a maze environment in Unity and used UnityML Agents to navigate the maze. To reduce the learning time, Unity NavMeshes were used to allow the agent to move readily from waypoint to waypoint. The goal was not to teach the agent to navigate, but to see how well it would adapt to a dynamic environment.

The agent was trained for 50,000 epochs with the same set of rewards. After 50k epochs, the rewards were vastly altered such that some previous paths which lead to rewards led to penalties and some paths were randomly given larger rewards.

Learning

I have been running some fairly long-term reinforcement learning experiments to determine if a network that has learned a mechanism to maximize rewards will be able to ‘retrain’ itself in a similar environment where the value of the rewards changes and found effectively what I expected.

When modeling an environment where the value of intrinsic rewards is dynamic - if there is a long training period (in terms of epochs) before the rewards change the agents tend to follow their existing reward-seeking behavior in the altered environment. Effectively they appear to learn one way of doing things that work, but if the environment changes — they don’t readily adapt to the new environment; continuing their previous reward-seeking behavior which is less optimal. They will only deviate if forced randomly (and this is the key) to choose a new path. If that randomization is relatively low, they will not find out that a new way of doing things is yielding better rewards.

As I sat back and looked at the results of this 30-day experiment I realized that this is similar to ‘short-term thinking’ that humans experience. If we learn to do something that yields a reward, if we are presented with a similar challenge we DON’T randomly choose to do things the new way. We continue with our existing reward behavior which is less than optimal.

Hypothesis

The hypothesis is that I need to add something to the learning process that looks at the age of a decision and gradually increases the randomization based on the age of the learned behavior (increasing the exploration rate). This should result in the agent constantly learning (exploring) instead of following its existing reward-seeking behavior (exploitation).

--

--