首页 > 解决方案 > 为什么当我将 cartpole 环境换成我自己的(更简单的)环境时,TensorFlow Agents 的内置 DQN 教程无法学习?

问题描述

我正在尝试训练一个几乎完全按照 TensorFlow Agents 的DQN 教程建模的 DQN 代理。而不是cartpole,我希望它学习一个简单的游戏,其中电池可以买卖电力,因为价格每12个时间步长在1到2之间变化(12个1,12个2,12个1,......)。电池可容纳 10 个单位的电量。最优策略是在价格为 1 时买入,在价格为 2 时卖出。我所做的只是添加这个单元格来导入我写的环境:

#import environment
from storage_environment import StorageEnvironment

# define price signal and max charges
price_signal = ([1] * 6 + [2] * 12 + [1] * 6) * 365
price_signal = [p*1 for p in price_signal]
max_charge = 10

#load environment
train_py_env = StorageEnvironment(price_signal, max_charge)
eval_py_env = StorageEnvironment(price_signal, max_charge)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

这是环境:

from tf_agents.environments import py_environment
from tf_agents.specs import array_spec
import numpy as np
from tf_agents.trajectories import time_step as ts
from matplotlib import pyplot as plt
import random

# The class for our environment
# The base class is the standard PyEnvironment
# In practice, we'll use a wrapper to convert this to a TensorFlow environment
class StorageEnvironment(py_environment.PyEnvironment):
    
    # price_signal: a list of prices, one for each timestep. The length of the episodes will be determined
    # by the length of this signal
    #
    # max_charge: the maximum charge of the battery
    def __init__(self, price_signal, max_charge):
        
        # Add the price signal and max charge as attributes
        self._price_signal = price_signal
        self._max_charge = max_charge
        
        # Keep track of the timestep
        self._timestep = 0
        
        # The charge begins at 0
        self._charge = 0
        
        # The balance and value begin at 0
        self._balance = 0
        self._value = 0
        
        # Actions are integers between 0 and 2
        self._action_spec = array_spec.BoundedArraySpec(
            shape = (),
            dtype = np.int,
            minimum = 0,
            maximum = 2,
            name = 'action'
        )
        # Observations are floating-point vectors of length 2
        # The first element is the current price signal (min: 0, max: inf)
        # The second element is the current battery charge (min: 0 max: max_charge)
        self._observation_spec = array_spec.BoundedArraySpec(
            shape = (2,),
            dtype = np.float,
            minimum = [0, 0],
            maximum = [np.inf, self._max_charge],
            name = 'observation'
        )
    
    # Required implementation for inheiritance
    def action_spec(self):
        return self._action_spec
    
    # Required implementation for inheiritance
    def observation_spec(self):
        return self._observation_spec
    
    # Reset environment - required for inheiritance
    def _reset(self):
        # Set timestep to 0
        self._timestep = 0
        
        # Set price to first element of price signal
        self._current_price = self._price_signal[self._timestep]
        
        # Set charge to 0
        self._charge = 0
        
        # Set balance and value to 0
        self._balance = 0
        self._value = 0
        
        # Restart environment
        return ts.restart(
            observation = np.array([self._current_price, self._charge], dtype = np.float)
        )
    
    # Take a step with an action (integer from 0 to 2)
    def _step(self, action):
        
        # If the last step was the final time step, ignore action and reset environment
        if self._current_time_step.is_last():
            return self.reset()
        
        # 1 -> idle
        # No reward and charge doesn't change
        if action == 1:
            pass
            
        # 0 -> discharge
        elif action == 0:
            if self._charge > 0:
                self._charge -= 1
                self._balance += self._current_price
                
        # 2 -> charge
        elif action == 2:
            if self._charge < self._max_charge:
                self._charge += 1
                self._balance -= self._current_price

        else:
            raise ValueError('action should be 0, 1, or 2')
        
        # Calculate reward
        # In practice, reward is equal to the change in the value of the energy currently stored by the battery
        self._new_value = self._balance + self._current_price*self._charge
        self._reward = self._new_value - self._value
        self._value = self._new_value
        
        # Alternatively:
        # self._reward = self._charge * (self._current_price-self._old_price)
            
        # If we've reached the end of the price signal, terminate the episode
        if self._timestep == len(self._price_signal)-1:
            return ts.termination(
                observation = np.array([self._current_price, self._charge], dtype = np.float),
                reward = self._reward
            )
        
        # If we've not reached the end of the price signal, transition to the next time step
        else:
            self._timestep += 1
            self._current_price = self._price_signal[self._timestep]
            
            return ts.transition(
                observation = np.array([self._current_price, self._charge], dtype = np.float),
                reward = self._reward
            )

在 Colab 中运行 Cartpole 教程,该算法只需几百次迭代即可找到最佳策略。我还提取了 Q 值;该图显示了最后 24 个训练时间步长: 在此处输入图像描述

对于我的问题,即使经过 20,000 次迭代,Q 值也很少有意义(我希望“充电”和“放电”曲线像镜像方波一样交替出现): 在此处输入图像描述

我尝试过改变网络的大小,使用不同的学习率、epsilon 值、优化器等。似乎没有任何效果。即使不更改参数,每次运行看起来都不同。

我的主要问题是:为什么算法足够强大来解决cartpole,但无法在这个更简单的环境中学习?我错过了一些基本的东西吗?

标签: pythonmachine-learningreinforcement-learningdqntensorflow-agents

解决方案


推荐阅读