首页 > 解决方案 > DQN - 无法解决 Cartpole-v1 - 我做错了什么?

问题描述

我一直在尝试通过连续 100 步获得 475 的平均奖励来解决 CartPole-V1。

这就是我需要运行的算法: 在此处输入图像描述

我尝试了许多具有固定 Q 值的 DQN 架构。我究竟做错了什么?

这些是我的超参数:

    TOTAL_EPISODES = 5000
    T = 500
    LR = 0.01
    GAMMA = 0.95
    MIN_EPSILON = 0.01
    EPSILON_DECAY_RATE = 0.9995
    epsilon = 1.0  # moving epsilon

    state_size = env.observation_space.shape[0]
    action_size = env.action_space.n
    batch_size = 64
    C = 8
    reward_discount = 10
    deque_size = 2000
    experience_replay = deque(maxlen=deque_size)

我尝试在 [0.01, 0.02, 0.001] 中使用 LR,降低 epsilon 衰减率,batch_size = 32, C = 4, ...

实现与图片中的相同,我将放置非平凡的部分:

def train_on_batch(batch_size, memory, gamma, model, ddqn_target_model, losses):
    minibatch = random.sample(memory, batch_size)
    states = np.zeros((batch_size, 4))
    targets = np.zeros((batch_size, 2))

    for index, (state, action, reward, next_state, done) in enumerate(minibatch):
        states[index] = state.reshape(1, 4)
        model_target = model.predict(state.reshape(1, 4))
        target_pred = ddqn_target_model.predict(next_state.reshape(1, 4))
        if done:
            target = reward
        else:
            target = reward + gamma * (np.amax(target_pred))

        model_target[0][action] = target
        targets[index] = model_target[0]
    history = model.fit(states, targets, batch_size=batch_size, epochs=1, verbose=0)
    losses.append(history.history['loss'][0])


def build_model(state_size, action_size, learning_rate, layers_num=3):
    model = Sequential()
    if layers_num == 3:
        model.add(Dense(24, input_dim=state_size, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(24, activation='relu'))
    else:
        model.add(Dense(18, input_dim=state_size, activation='relu'))
        model.add(Dense(18, activation='relu'))
        model.add(Dense(18, activation='relu'))
        model.add(Dense(18, activation='relu'))
        model.add(Dense(18, activation='relu'))
    model.add(Dense(action_size, activation='linear'))
    model.compile(loss='mse',
                  optimizer=Adam(lr=learning_rate))

    return model

def sample_action(model, state, epsilon):
    if random.uniform(0, 1) < epsilon:
        action = env.action_space.sample()
    else:
        action_pred = model.predict(state)
        action = np.argmax(action_pred[0])

    return action

起初我做ddqn_target_model.set_weights(model.get_weights()),在剧集迭代中我做if episode % C == 0: ddqn_target_model.set_weights(model.get_weights())

我错过了什么?

谢谢

标签: pythonkerasdeep-learningreinforcement-learningopenai-gym

解决方案


推荐阅读