python - 为什么当我将 cartpole 环境换成我自己的(更简单的)环境时,TensorFlow Agents 的内置 DQN 教程无法学习?
问题描述
我正在尝试训练一个几乎完全按照 TensorFlow Agents 的DQN 教程建模的 DQN 代理。而不是cartpole,我希望它学习一个简单的游戏,其中电池可以买卖电力,因为价格每12个时间步长在1到2之间变化(12个1,12个2,12个1,......)。电池可容纳 10 个单位的电量。最优策略是在价格为 1 时买入,在价格为 2 时卖出。我所做的只是添加这个单元格来导入我写的环境:
#import environment
from storage_environment import StorageEnvironment
# define price signal and max charges
price_signal = ([1] * 6 + [2] * 12 + [1] * 6) * 365
price_signal = [p*1 for p in price_signal]
max_charge = 10
#load environment
train_py_env = StorageEnvironment(price_signal, max_charge)
eval_py_env = StorageEnvironment(price_signal, max_charge)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)
这是环境:
from tf_agents.environments import py_environment
from tf_agents.specs import array_spec
import numpy as np
from tf_agents.trajectories import time_step as ts
from matplotlib import pyplot as plt
import random
# The class for our environment
# The base class is the standard PyEnvironment
# In practice, we'll use a wrapper to convert this to a TensorFlow environment
class StorageEnvironment(py_environment.PyEnvironment):
# price_signal: a list of prices, one for each timestep. The length of the episodes will be determined
# by the length of this signal
#
# max_charge: the maximum charge of the battery
def __init__(self, price_signal, max_charge):
# Add the price signal and max charge as attributes
self._price_signal = price_signal
self._max_charge = max_charge
# Keep track of the timestep
self._timestep = 0
# The charge begins at 0
self._charge = 0
# The balance and value begin at 0
self._balance = 0
self._value = 0
# Actions are integers between 0 and 2
self._action_spec = array_spec.BoundedArraySpec(
shape = (),
dtype = np.int,
minimum = 0,
maximum = 2,
name = 'action'
)
# Observations are floating-point vectors of length 2
# The first element is the current price signal (min: 0, max: inf)
# The second element is the current battery charge (min: 0 max: max_charge)
self._observation_spec = array_spec.BoundedArraySpec(
shape = (2,),
dtype = np.float,
minimum = [0, 0],
maximum = [np.inf, self._max_charge],
name = 'observation'
)
# Required implementation for inheiritance
def action_spec(self):
return self._action_spec
# Required implementation for inheiritance
def observation_spec(self):
return self._observation_spec
# Reset environment - required for inheiritance
def _reset(self):
# Set timestep to 0
self._timestep = 0
# Set price to first element of price signal
self._current_price = self._price_signal[self._timestep]
# Set charge to 0
self._charge = 0
# Set balance and value to 0
self._balance = 0
self._value = 0
# Restart environment
return ts.restart(
observation = np.array([self._current_price, self._charge], dtype = np.float)
)
# Take a step with an action (integer from 0 to 2)
def _step(self, action):
# If the last step was the final time step, ignore action and reset environment
if self._current_time_step.is_last():
return self.reset()
# 1 -> idle
# No reward and charge doesn't change
if action == 1:
pass
# 0 -> discharge
elif action == 0:
if self._charge > 0:
self._charge -= 1
self._balance += self._current_price
# 2 -> charge
elif action == 2:
if self._charge < self._max_charge:
self._charge += 1
self._balance -= self._current_price
else:
raise ValueError('action should be 0, 1, or 2')
# Calculate reward
# In practice, reward is equal to the change in the value of the energy currently stored by the battery
self._new_value = self._balance + self._current_price*self._charge
self._reward = self._new_value - self._value
self._value = self._new_value
# Alternatively:
# self._reward = self._charge * (self._current_price-self._old_price)
# If we've reached the end of the price signal, terminate the episode
if self._timestep == len(self._price_signal)-1:
return ts.termination(
observation = np.array([self._current_price, self._charge], dtype = np.float),
reward = self._reward
)
# If we've not reached the end of the price signal, transition to the next time step
else:
self._timestep += 1
self._current_price = self._price_signal[self._timestep]
return ts.transition(
observation = np.array([self._current_price, self._charge], dtype = np.float),
reward = self._reward
)
在 Colab 中运行 Cartpole 教程,该算法只需几百次迭代即可找到最佳策略。我还提取了 Q 值;该图显示了最后 24 个训练时间步长:
对于我的问题,即使经过 20,000 次迭代,Q 值也很少有意义(我希望“充电”和“放电”曲线像镜像方波一样交替出现):
我尝试过改变网络的大小,使用不同的学习率、epsilon 值、优化器等。似乎没有任何效果。即使不更改参数,每次运行看起来都不同。
我的主要问题是:为什么算法足够强大来解决cartpole,但无法在这个更简单的环境中学习?我错过了一些基本的东西吗?
解决方案
推荐阅读
- spring - 无法使用 UndertowServletWebServerFactory 来限制 MAX_ENTITY_SIZE
- java - 如何合并 2 个对象列表但是当它有 2 个对象重复时,它将两者结合并更改数量?
- python - 如何编写 python 脚本来读取我的 Outlook 的所有主题标题......它是否在日期范围内?
- php - 动态 orWhere 子句
- .net - 显示网站图像测试与生产
- vue.js - 我可以在不使用 ul 或 ol 的情况下使用 Vue v-for
- ruby - 如何访问与条件匹配的数组中元素的索引
- react-native - 如何在 DrawerNavigator 中确定活动路线以着色活动菜单项并关闭抽屉
- django - Django过滤器与manytomanyfield
- python - ValueError:具有多个元素的数组的真值不明确。使用 a.any() 或 a.all() 进行元素比较