neural-network - OpenAI 健身房的月球着陆器模型未收敛
问题描述
我正在尝试使用 keras 的深度强化学习来训练代理学习如何玩Lunar Lander OpenAI 健身房环境。问题是我的模型没有收敛。这是我的代码:
import numpy as np
import gym
from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers
def get_random_action(epsilon):
return np.random.rand(1) < epsilon
def get_reward_prediction(q, a):
qs_a = np.concatenate((q, table[a]), axis=0)
x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
x[0] = qs_a
guess = model.predict(x[0].reshape(1, x.shape[1]))
r = guess[0][0]
return r
results = []
epsilon = 0.05
alpha = 0.003
gamma = 0.3
environment_parameters = 8
num_of_possible_actions = 4
obs = 15
mem_max = 100000
epochs = 3
total_episodes = 15000
possible_actions = np.arange(0, num_of_possible_actions)
table = np.zeros((num_of_possible_actions, num_of_possible_actions))
table[np.arange(num_of_possible_actions), possible_actions] = 1
env = gym.make('LunarLander-v2')
env.reset()
i_x = np.random.random((5, environment_parameters + num_of_possible_actions))
i_y = np.random.random((5, 1))
model = Sequential()
model.add(Dense(512, activation='relu', input_dim=i_x.shape[1]))
model.add(Dense(i_y.shape[1]))
opt = optimizers.adam(lr=alpha)
model.compile(loss='mse', optimizer=opt, metrics=['accuracy'])
total_steps = 0
i_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
i_y = np.zeros(shape=(1, 1))
mem_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
mem_y = np.zeros(shape=(1, 1))
max_steps = 40000
for episode in range(total_episodes):
g_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
g_y = np.zeros(shape=(1, 1))
q_t = env.reset()
episode_reward = 0
for step_number in range(max_steps):
if episode < obs:
a = env.action_space.sample()
else:
if get_random_action(epsilon, total_episodes, episode):
a = env.action_space.sample()
else:
actions = np.zeros(shape=num_of_possible_actions)
for i in range(4):
actions[i] = get_reward_prediction(q_t, i)
a = np.argmax(actions)
# env.render()
qa = np.concatenate((q_t, table[a]), axis=0)
s, r, episode_complete, data = env.step(a)
episode_reward += r
if step_number is 0:
g_x[0] = qa
g_y[0] = np.array([r])
mem_x[0] = qa
mem_y[0] = np.array([r])
g_x = np.vstack((g_x, qa))
g_y = np.vstack((g_y, np.array([r])))
if episode_complete:
for i in range(0, g_y.shape[0]):
if i is 0:
g_y[(g_y.shape[0] - 1) - i][0] = g_y[(g_y.shape[0] - 1) - i][0]
else:
g_y[(g_y.shape[0] - 1) - i][0] = g_y[(g_y.shape[0] - 1) - i][0] + gamma * g_y[(g_y.shape[0] - 1) - i + 1][0]
if mem_x.shape[0] is 1:
mem_x = g_x
mem_y = g_y
else:
mem_x = np.concatenate((mem_x, g_x), axis=0)
mem_y = np.concatenate((mem_y, g_y), axis=0)
if np.alen(mem_x) >= mem_max:
for l in range(np.alen(g_x)):
mem_x = np.delete(mem_x, 0, axis=0)
mem_y = np.delete(mem_y, 0, axis=0)
q_t = s
if episode_complete and episode >= obs:
if episode%10 == 0:
model.fit(mem_x, mem_y, batch_size=32, epochs=epochs, verbose=0)
if episode_complete:
results.append(episode_reward)
break
我正在运行数万集,但我的模型仍然不会收敛。它将开始减少约 5000 集以上的平均策略变化,同时增加平均奖励,但随后它会偏离深度,之后每集的平均奖励实际上会下降。我试过弄乱超参数,但我没有得到任何结果。我试图在DeepMind DQN 论文之后对我的代码进行建模。
解决方案
您可能希望将get_random_action
函数更改为每集衰减 epsilon。毕竟,假设您的代理可以学习最佳策略,在某些时候您根本不想采取随机行动,对吧?这是一个稍微不同的版本get_random_action
,可以为你做到这一点:
def get_random_action(epsilon, total_episodes, episode):
explore_prob = epsilon - (epsilon * (episode / total_episodes))
return np.random.rand(1) < explore_prob
在你的函数的这个修改版本中,epsilon 会随着每一集而略微减少。这可能有助于您的模型收敛。
有几种衰减参数的方法。有关更多信息,请查看此 Wikipedia 文章。
推荐阅读
- configuration - OpenLiberty 无法注入环境变量
- php - 在 PHP 中更新密码后在登录页面中显示消息
- google-bigquery - BigQuery 按数组条目过滤行
- firebase - 如何从客户端查询 Firestore?
- xaml - Xamarin 形成 UWP - Listview 在加载 viewcells 时显示黑色背景
- python - pipenv - 尽管已安装,但没有名为 tensorflow 的模块
- ruby-on-rails - Rails:Vanilla Rails 6.0 错误命令“webpack”未找到
- node.js - 使用 ajax 向 nodejs 发送 post 请求
- excel - 用于数据提取的 VBA Bloomberg (BDH) 公式
- c# - 如何在方法内部获取传递参数的nameof()?