强化学习导论课后习题参考

Reinforcement Learning: An Introduction (second edition) - Chapter 7,8

Chapter 7

7.1
In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the n-step error used in (7.2) can also be written as a sum TD errors (again if the value estimates don’t change) generalizing the earlier result.

由(6.5)有\(\delta_t \overset{.}{=} R_{t+1}+\gamma V(S_{t+1})-V(S_t)\)，直接写(7.2)的n-step error

\[\begin{array}{l} G_{t:t+n} -V(S_t) = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n V(S_{t+n}) - V(S_t) \\ \\ \qquad \qquad \qquad \ \ = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n V(S_{t+n}) - V(S_t) + \gamma V(S_{t+1}) - \gamma V(S_{t+1}) \\ \qquad \qquad \qquad \qquad + \gamma^2 V(S_{t+2}) - \gamma^2 V(S_{t+2})+ \cdots + \gamma^{n-1} V(S_{t+n-1}) - \gamma^{n-1} V(S_{t+n-1}) \\ \qquad \qquad \qquad \qquad + \gamma^{n} V(S_{t+n}) - \gamma^{n} V(S_{t+n}) \\ \\ \qquad \qquad \qquad \ \ = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) + \gamma R_{t+2} + \gamma^2 V(S_{t+2}) - \gamma V(S_{t+1}) + \cdots \\ \qquad \qquad \qquad \qquad + \gamma^{n-1}R_{t+n} + \gamma^n V(S_{t+n}) - \gamma^{n-1} V(S_{t+n-1}) \\ \\ \qquad \qquad \qquad \ \ = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) + \gamma (R_{t+2} + \gamma V(S_{t+2}) - V(S_{t+1})) + \cdots \\ \qquad \qquad \qquad \qquad + \gamma^{n-1}(R_{t+n} + \gamma V(S_{t+n}) - V(S_{t+n-1})) \\ \\ \qquad \qquad \qquad \ \ = \delta_t +\gamma \delta_{t+1} + \cdots + \gamma^{n-1}\delta_{t+n-1} \\ \\ \qquad \qquad \qquad \ \ = \sum_k^{n-1}\gamma^k \delta_{t+k} \end{array} \]

7.2
(programming) With an n-step method, the value estimates do change from step to step, so an algorithm that used the sum of TD errors (see previous exercise) in place of the error in (7.2) would actually be a slightly different algorithm. Would it be a better algorithm or a worse one? Devise and program a small experiment to answer this question empirically.

7.3
Why do you think a larger random walk task (19 states instead of 5) was used in the examples of this chapter? Would a smaller walk have shifted the advantage to a different value of \(n\)? How about the change in left-side outcome from 0 to −1 made in the larger walk? Do you think that made any difference in the best value of \(n\)?

用一个状态更多的例子主要是想说明n-step的好处，可以画出来Figure 7.2效果差异那么明显的图。如果是之前5个states的例子，\(n>2\)就能影响整条序列了，画出来的图差别就很小了，没法突出n-step的好处。关于左边reward变成-1，由于这个random walk的设置是undiscounted的，之前为0的时候，在左边结束的话更新的幅度比较小，组要把值往0上更新，右边往1上更新。变成-1之后更新的幅度变大了，主要把值往-1更新，右边还是往1更新。这么看起来的话更新的幅度变大，Monte Carlo方法的方差变大，震荡会厉害一点，而TD方法的方差更小。所以n-step方法得到最好效果的n应该更偏向取小的n值。

7.4
Prove that the n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, as

\[G_{t:t+n} = Q_{t-1}(S_t,A_t)+\sum_{k=t}^{\min(t+n,T)-1} \gamma^{k-t}[R_{k+1}+\gamma Q_k(S_{k+1},A_{k+1})-Q_{k-1}(S_{k},A_{k})] \]

参考Excercise 7.1，其中\(\delta_t = R_{t+1}+\gamma Q_k(S_{k+1},A_{k+1})-Q_{k-1}(S_k,A_k)\)，有

\[\begin{array}{l} G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n},A_{t+n}) \\ \\ \qquad \quad = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n},A_{t+n}) + Q_{t-1}(S_t,A_t) - Q_{t-1}(S_t,A_t) \\ \qquad \qquad + \gamma Q_{t}(S_{t+1},A_{t+1}) - \gamma Q_{t}(S_{t+1},A_{t+1}) + \gamma^2 Q_{t+1}(S_{t+2},A_{t+2}) - \gamma^2 Q_{t+1}(S_{t+2},A_{t+2}) + \cdots \\ \qquad \qquad + \gamma^{n-1} Q_{t+n-2}(S_{t+n-1},A_{t+n-1}) - \gamma^{n-1} Q_{t+n-2}(S_{t+n-1},A_{t+n-1}) + \gamma^{n} Q_{t+n-1}(S_{t+n},A_{t+n}) - \gamma^{n} Q_{t+n-1}(S_{t+n},A_{t+n}) \\ \\ \qquad \quad = R_{t+1} + \gamma Q_{t}(S_{t+1},A_{t+1}) - Q_{t-1}(S_t,A_t) + \gamma R_{t+2} + \gamma^2 Q_{t+1}(S_{t+2},A_{t+2})- \gamma Q_{t}(S_{t+1},A_{t+1}) + \cdots \\ \qquad \qquad + \gamma^{n-1}R_{t+n}+ \gamma^n Q_{t+n-1}(S_{t+n},A_{t+n}) - \gamma^{n-1} Q_{t+n-2}(S_{t+n-1},A_{t+n-1}) + Q_{t-1}(S_t,A_t) \\ \\ \qquad \quad = \delta_{t} + \gamma \delta_{t+1} + \cdots + \gamma^{n-1} \delta_{t+n-1} + Q_{t-1}(S_t,A_t) \\ \\ \qquad \quad = Q_{t-1}(S_t,A_t) + \sum_{k=t}^{\min(t+n,T)-1}\gamma^{k-t}[R_{k+1}+\gamma Q_k(S_{k+1},A_{k+1})-Q_{k-1}(S_{k},A_{k})] \end{array} \]

7.5
Write the pseudocode for the off-policy state-value prediction algorithm described above.

这个好像没啥好写的，就是把(7.2)式\(V_{t+n}(S_t) \overset{.}{=} V_{t+n-1}(S_t)+\alpha[G_{t:t+n}-V_{t+n-1}(S_t)]\)换成(7.13)式\(V_{t+n}(S_t) \overset{.}{=} V_{t+n-1}(S_t)+\alpha[\rho_t(R_{t+1}+\gamma G_{t+1:t+n}) + (1-\rho_t)V_{t+n-1}(S_t) - V_{t+n-1}(S_t)]\)。

7.6
Prove that the control variate in the above equations does not change the expected value of the return.

就是让证明(7.14)式\(G_{t:h} \overset{.}{=} R_{t+1} + \gamma (\rho_{t+1}G_{t+1:h}+\bar V_{h-1}(S_{t+1})-\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1}))\)期望不变，只要证明后面部分\(\bar V_{h-1}(S_{t+1})-\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1})\)的期望为0即可。

\[\begin{array}{l} E[\bar V_{h-1}(S_{t+1})-\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1})] \\ \\ = \bar V_{h-1}(S_{t+1}) - E_b[\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1})] \\ \\ = \sum_a\pi(a|S_{t+1})Q_{h-1}(S_{t+1},a) - \sum_a b(a|S_{t+1})\frac{\pi(a|S_{t+1})}{b(a|S_{t+1})}Q_{h-1}(S_{t+1},a) \qquad (\bar V_t(s) \overset{.}{=} \sum_a\pi(a|s)Q_t(s,a) \cdots (7.8)) \\ \\ = 0 \end{array} \]

7.7
Write the pseudocode for the off-policy action-value prediction algorithm described immediately above. Pay particular attention to the termination conditions for the recursion upon hitting the horizon or the end of episode.

主要写更新部分

\[\begin{array}{l} \tau \leftarrow t-n+1 \\ \text{If} \ \tau \geq 0: \\ \quad G \leftarrow \sum_{i=\tau +2}^{\min(\tau+n,T)}\gamma^{i-\tau-1}R_i \\ \quad \text{If} \ \tau +n<T, \text{then}: G \leftarrow R_{t+1}+\gamma \rho_{t+1}(G-Q(S_{\tau+n},A_{\tau+n}))+\gamma \pi\sum_a(a|S_{t+n})Q(a,S_{\tau+n}) \\ \quad Q(S_\tau,A_\tau) \leftarrow Q(S_\tau,A_\tau) + \alpha [G-Q(S_\tau,A\tau)] \end{array} \]

7.8
Show that the general (off-policy) version of the n-step return (7.13) can still be written exactly and compactly as the sum of state-based TD errors (6.5) if the approximate state value function does not change.

由Excercise 7.1 有

\[\begin{array}{l} G_{t:t+n} -V(S_t) = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n V(S_{t+n}) - V(S_t) \\ \\ \qquad \qquad \qquad \ \ = \sum_{k=0}^{n-1}\gamma^k \delta_{t+k} \end{array} \]

则(7.13)式子有

\[\begin{array}{l} G_{t:h} = \rho_t (R_{t+1}+\gamma G_{t+1:h})+(1-\rho_t)V_{h-1}(S_t) \\ \\ \qquad = \rho_t G_{t:h}+(1-\rho_t)V_{h-1}(S_t) \\ \\ \qquad = \rho_t(\delta_{t}+\gamma \delta_{t+1}+\cdots +\gamma^{h-t-1}\delta_{h-1}+V_{h-1}(S_t))+(1-\rho_t)V_{h-1}(S_t) \\ \\ \qquad = \rho_t(\sum_{k=0}^{h-t-1} \gamma^{k}\delta_{t+k}+V_{h-1}(S_t))+(1-\rho_t)V_{h-1}(S_t) \\ \\ \qquad = \rho_t\sum_{k=0}^{h-t-1} \gamma^{k}\delta_{t+k}+V_{h-1}(S_t) \end{array} \]

7.9
Repeat the above exercise for the action version of the off-policy n-step return (7.14) and the Expected Sarsa TD error (the quantity in brackets in Equation 6.9).

由(6.9)式有Expected Sarsa TD error为\(R_{t+1}+\gamma \sum_a\pi(a|S_{t+1})Q(S_{t+1},a)-Q(S_t,A_t)\)，其中\(\sum_a\pi(a|S_{t+1})Q(S_{t+1},a)=\bar V(S_{t+1})\)，则有

\[\begin{array}{l} G_{t:h}-Q(S_t,A_t)= R_{t+1}+\gamma \rho_{t+1}(G_{t+1:h}-Q(S_{t+1},A_{t+1})+\gamma \bar V_{h-1}(S_{t+1})-Q(S_t,A_t) \\ \\ \qquad \qquad \qquad \quad \ = R_{t+1} + \gamma \bar V(S_{t+1})-Q(S_t,A_t)+\gamma \rho_{t+1}(G_{t+1:h}-Q(S_{t+1},A_{t+1})) \\ \\ \qquad \qquad \qquad \quad \ = \delta_t + \gamma \rho_{t+1}(G_{t+1:h}-Q(S_{t+1},A_{t+1})) \\ \\ \qquad \qquad \qquad \quad \ = \delta_t + \gamma \rho_{t+1}(\delta_{t+1} + \gamma \rho_{t+2}(G_{t+2:h}-Q(S_{t+2},A_{t+2}))) \\ \\ \qquad \qquad \qquad \quad \ \cdots \\ \\ \qquad \qquad \qquad \quad \ = \delta_t + \gamma \rho_{t+1}\delta_{t+1} + \gamma^2 \rho_{t+1:t+2}\delta_{t+2} + \cdots \\ \\ \qquad \qquad \qquad \quad \ = \sum_{k=0}^{h-t}\gamma^k\rho_{t+1:t+k}\delta_{t+k} \end{array} \]

7.10
(programming) Devise a small off-policy prediction problem and use it to show that the off-policy learning algorithm using (7.13) and (7.2) is more data efficient than the simpler algorithm using (7.1) and (7.9).

7.11
Show that if the approximate action values are unchanging, then the tree-backup return (7.16) can be written as a sum of expectation-based TD errors:

\[G_{t:t+n} = Q(S_t,A_t) + \sum_{k=t}^{\min(t+n-1,T-1)}\delta_t \prod_{i=t+1}^k\gamma\pi(A_i|S_i), \]

where \(\delta_t \overset{.}{=} R_{t+1}+\gamma \bar V_t(S_{t+1})-Q(S_t,A_t)\) and \(\bar V_t\) is given by (7.8).

直接证\(G_{t:t+n} - Q(S_t,A_t) = \sum_{k=t}^{\min(t+n-1,T-1)}\delta_t \prod_{i=t+1}^k\gamma\pi(A_i|S_i)\)

\[\begin{array}{l} G_{t:t+n} - Q(S_t,A_t) = R_{t+1} + \gamma \sum_{a \not = A_{t+1}}\pi(a|S_{t+1})Q(S_{t+1},a)+\gamma \pi(A_{t+1}|S_{t+1})G_{t+1:t+n} - Q(S_t,A_t) \\ \\ \qquad \qquad \qquad \qquad = R_{t+1} + \gamma\sum_a\pi(a|S_{t+1})Q(S_{t+1},a) - \gamma \pi(A_{t+1}|S_{t+1})Q(S_{t+1},A_{t+1}) +\gamma \pi(A_{t+1}|S_{t+1})G_{t+1:t+n} - Q(S_t,A_t) \\ \\ \qquad \qquad \qquad \qquad = R_{t+1} + \gamma\sum_a\pi(a|S_{t+1})Q(S_{t+1},a) - Q(S_t,A_t)+\gamma \pi(A_{t+1}|S_{t+1})(G_{t+1:t+n} - Q(S_{t+1},A_{t+1})) \\ \\ \qquad \qquad \qquad \qquad = R_{t+1} + \gamma \bar V_t(S_{t+1}) - Q(S_t,A_t)+\gamma \pi(A_{t+1}|S_{t+1})(G_{t+1:t+n} - Q(S_{t+1},A_{t+1})) \\ \\ \qquad \qquad \qquad \qquad = \delta_t + \gamma \pi(A_{t+1}|S_{t+1})(G_{t+1:t+n} - Q(S_{t+1},A_{t+1})) \\ \\ \qquad \qquad \qquad \qquad = \delta_t + \gamma \pi(A_{t+1}|S_{t+1})(\delta_{t+1} + \gamma \pi(A_{t+2}|S_{t+2})( G_{t+2:t+n} - Q(S_{t+2},A_{t+2})) \\ \\ \qquad \qquad \qquad \qquad \cdots \\ \\ \qquad \qquad \qquad \qquad = \sum_{k=t}^{\min(t+n-1,T-1)}\delta_k\prod_{i=t+1}^k\gamma\pi(A_i|S_i) \end{array} \]

Chapter 8

8.1
The nonplanning method looks particularly poor in Figure 8.3 because it is a one-step method; a method using multi-step bootstrapping would do better. Do you think one of the multi-step bootstrapping methods from Chapter 7 could do as well as the Dyna method? Explain why or why not.

又是一个开放性问题。单就maze这个环境来说，Dyna应该会比multi-step好点，就算multi-step是n，Dyna也是n，这算相同。但是Dyna是在状态空间上random状态，multi-step的n是在一条轨迹上，这样看起来Dyna更容易达到全局最优。另外maze是个确定性状态转移的环境，model拟合的误差也相对小，所以Dyna应该还是会更好一点。

8.2
Why did the Dyna agent with exploration bonus, Dyna-Q+, perform better in the first phase as well as in the second phase of the blocking and shortcut experiments?

Dyna-Q+探索性更强，更容易找到全局最优，Dyna相对更容易陷入局部最优。

8.3
Careful inspection of Figure 8.5 reveals that the difference between Dyna-Q+ and Dyna-Q narrowed slightly over the first part of the experiment. What is the reason for this?

这个问题看了半天，原来是想说Figure 8.5里面3000time steps之前Dyna-Q+和Dyna-Q的累积回报的差距先变大，中间又变小了，这是为啥？（这个图的差距变化也太不明显了，看了半天才明白是这意思。）主要原因是Dyna-Q慢慢找到全局最优了，所以累积回报变大了。而Dyna-Q+探索性更强，即使找到全局最优也会继续探索，所以累积回报值的理论收敛值会小一些，上升速度就没有Dyna-Q那么快了。

8.4
(programming) The exploration bonus described above actually changes the estimated values of states and actions. Is this necessary? Suppose the bonus \(\kappa \sqrt{\tau}\) was used not in updates, but solely in action selection. That is, suppose the action selected was always that for which \(Q(S_t,a)+\kappa\sqrt{\tau(S_t,a)}\) was maximal. Carry out a gridworld experiment that tests and illustrates the strengths and weaknesses of this alternate approach.

8.5
How might the tabular Dyna-Q algorithm shown on page 164 be modified to handle stochastic environments? How might this modification perform poorly on changing environments such as considered in this section? How could the algorithm be modified to handle stochastic environments and changing environments?

如果是随机环境，之前的\(R,S^\prime \leftarrow Model(S,A)\)就不适用了，只能写成转移到每个状态和回报的概率分布，更新的话通过这个新的Model算转回报的期望。
如果环境再是changing environments，效果应该会更差，因为学到的Model会更加不准，要想得到准确的Model所需的Time steps会更长。
解决这个问题的方法可以和Dyna-Q+类似，也给策略增加一些探索，尽量多的更新Model从而学到更准确的Q估计。

8.6
The analysis above assumed that all of the \(b\) possible next states were equally likely to occur. Suppose instead that the distribution was highly skewed, that some of the \(b\) states were much more likely to occur than most. Would this strengthen or weaken the case for sample updates over expected updates? Support your answer.

如果b个分支不是等概率转移，那么采样的方式更容易采到概率大的分支，那么得到的估计值应该更准，因为概率小的分支对Q值的贡献很小，不采样也不会有太大误差，这样其实一定程度上相当于分支减少了，自然更新也更快了。

8.7
Some of the graphs in Figure 8.8 seem to be scalloped in their early portions, particularly the upper graph for \(b = 1\) and the uniform distribution. Why do you think this is? What aspects of the data shown support your hypothesis?

这个题我理解的意思是说为啥\(b = 1\)均匀采样更新的方式是扇形(scalloped)上升，也就是为啥上升那么慢。这里还是因为更新的状态的关系，因为是均匀采样，很多状态更新是无用的，只有访问过的有reward变化的状态才会造成有效的更新，所以上升就慢了。到后面，几乎所有状态都被访问过，均匀采样的效果就上去了，甚至会超过on-policy的方法，因为on-policy更新的状态之前被频繁更新，这些状态的估计值没有变化了，再更新他们也就变成无效更新了。

8.8
(programming) Replicate the experiment whose results are shown in the lower part of Figure 8.8, then try the same experiment but with \(b = 3\). Discuss the meaning of your results.

强化学习导论课后习题参考 - Chapter 7,8

Reinforcement Learning: An Introduction (second edition) - Chapter 7,8

Contents

Chapter 7

Chapter 8

推荐阅读

强化学习导论 课后习题参考 - Chapter 7,8

Reinforcement Learning: An Introduction (second edition) - Chapter 7,8

Contents

Chapter 7

Chapter 8

推荐阅读

强化学习导论课后习题参考 - Chapter 7,8