首页 > 解决方案 > PPO可以应用于REINFORCE算法而不是A2C算法吗?

问题描述

除了为 A2C 做 PPO,我们可以为 REINFORCE 算法实现近端策略优化吗?喜欢下面的代码?

def update(self):
    s, a, r,  done_mask, prob_a = self.make_batch()

    gamma = 0.99
    k_epoch = 3
    for i in range(k_epoch):
        discounted_rewards  = []
        Gt = 0
        for t in reversed(range(len(r))):
            Gt = r[t] + done_mask[t]*gamma*Gt
            discounted_rewards.append(Gt)
        discounted_rewards.reverse()
        discounted_rewards = T.tensor(discounted_rewards,dtype = T.float)

        pi = self.model.forward(s)
        pi_a = pi.squeeze(1).gather(1,a)

        ratio = T.exp( T.log(pi_a) - T.log(prob_a) )
        surr1 = ratio * discounted_rewards 
        eps_clip = 0.1
        surr2 = T.clamp(ratio, 1-eps_clip, 1+eps_clip)*discounted_rewards
        loss = -T.min(surr1, surr2)


        self.optimizer.zero_grad()
        loss.mean().backward(retain_graph=True)
        self.optimizer.step()

标签: tensorflowdeep-learningpytorchreinforcement-learning

解决方案


推荐阅读