首页 > 解决方案 > 在张量流中运行会话时出现 InvalidArgumentError

问题描述

我使用这个a3c博客的帖子来学习代理。它使用神经网络来优化性能。但是当到达以下代码时,它会出错。事实上,它说存在不兼容的输入数据形状和占位符。但我尝试了很多不同的整形并考虑过重新整形。但是在运行 sess.run() 部分时仍然出现错误。我应该怎么做才能修复它?:

InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder_21' with dtype float and shape [?,2]
     [[node Placeholder_21 (defined at <ipython-input-462-3c1b764fbd4e>:3) ]]

在批量打印输入数据时,我看到:

print("State shape:", batch_states.shape)
print("Batch states:",batch_states)
print("Batch actions length:",len(batch_actions))
print("Batch Actions:", batch_actions)
print("Batch Rewards:", batch_rewards)
print("Batch Done:", batch_done)
print("Num actions:", n_actions)
State shape: (10, 2)
Batch states: [[1501.87201108 1501.87201108]
 [1462.65450863 1462.65450863]
 [1480.95616876 1480.95616876]
 [1492.24380743 1492.24380743]
 [1481.92809598 1481.92809598]
 [1480.19257102 1480.19257102]
 [1503.54571786 1503.54571786]
 [1489.38563414 1489.38563414]
 [1541.16797527 1541.16797527]
 [1516.04036259 1516.04036259]]
Batch actions length: 10
Batch Actions: [[1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0.]]
Batch Rewards: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Batch Done: [False False False False False False False False False False]
Num actions: 5

这是我从中得到错误的代码部分:

states_ph = tf.placeholder('float32', [None,] + list(obs_shape))    
next_states_ph = tf.placeholder('float32', [None,] + list(obs_shape))
actions_ph = tf.placeholder('int32', (None,n_actions))
rewards_ph = tf.placeholder('float32', (None,))
is_done_ph = tf.placeholder('float32', (None,))
# logits[n_envs, n_actions] and state_values[n_envs, n_actions]
logits, state_values = agent.symbolic_step(states_ph)
next_logits, next_state_values = agent.symbolic_step(next_states_ph)

# There is no next state if the episode is done!
next_state_values = next_state_values * (1 - is_done_ph)

# probabilities and log-probabilities for all actions
probs = tf.nn.softmax(logits, axis=-1)            # [n_envs, n_actions]
logprobs = tf.nn.log_softmax(logits, axis=-1)     # [n_envs, n_actions]

# log-probabilities only for agent's chosen actions
logp_actions = tf.reduce_sum(logprobs * tf.one_hot(actions_ph, n_actions), axis=-1) # [n_envs,]
# Compute advantage using rewards_ph, state_values and next_state_values.
gamma = 0.99
advantage = rewards_ph + gamma * (next_state_values - state_values)

assert advantage.shape.ndims == 1, "please compute advantage for each sample, vector of shape [n_envs,]"

# Compute policy entropy given logits_seq. Mind the "-" sign!
entropy =  - tf.reduce_sum(probs * logprobs, 1) 


assert entropy.shape.ndims == 1, "please compute pointwise entropy vector of shape [n_envs,] "

# Compute target state values using temporal difference formula. Use rewards_ph and next_step_values
target_state_values = rewards_ph + gamma*next_state_values

actor_loss = -tf.reduce_mean(logp_actions * tf.stop_gradient(advantage), axis=0) - 0.001 * tf.reduce_mean(entropy, axis=0)
critic_loss = tf.reduce_mean((state_values - tf.stop_gradient(target_state_values))**2, axis=0)

train_step = tf.train.AdamOptimizer(1e-4).minimize(actor_loss + critic_loss)
sess.run(tf.global_variables_initializer())
l_act, l_crit, adv, ent = sess.run([actor_loss, critic_loss, advantage, entropy], feed_dict = {
        states_ph: batch_states,
        actions_ph: batch_actions,
        next_states_ph: batch_states,
        rewards_ph: batch_rewards,
        is_done_ph: batch_done,
    })

标签: pythontensorflow

解决方案


推荐阅读