Exciting stuffs in this chapter includes DQN, policy gradients/REINFORCE, TRPO/PPO, etc. Current RLHF are mostly applications of model-free methods. However, I’m slightly overwhelmed by this chapter and I’ll leave question marks to places where I’m not sure.

First we must formulate the deep reinforcement learning problem.

Gradient Update View of TD-learning(12.1)

Recall this update rule of TD-learning:

\[V^\pi(x) \leftarrow V^\pi(x) + \alpha_t(r+\gamma V^{\pi} (x') - V^\pi(x))\]

We wish to re-derive it as gradient update of a parameter(for each state). The purpose of providing this view is to (later) expand this gradient update view, and update a NN estimator (of the parameter/V-function) instead (and provide a tractable solution to the problem where we have too many or even continuous states).

\[V^\pi (x; \mathbf{\theta}) = \theta(x) = \theta_x\] \[\mathbf{\theta} = [\theta_1...\theta_n]\]

We consider a simple $\frac{1}{2}$-MSE loss and apply the bootstrapping strategy as we did in TD-learning. (Otherwise, it is not possible to update the estimate without knowing the groundtruth, which we don’t)

\[\begin{aligned} & l_\theta(x,r) = \frac{1}{2} (v^\pi(x) - \theta(x)) \\ & = \frac{1}{2} (r + \gamma E_{x\vert x', \pi(x)}[v^\pi(x')] - \theta(x))\\ \\ & \nabla_{\theta(x)} l_\theta(x,r) = \theta(x) - (r + \gamma E_{x\vert x', \pi(x)}[v^\pi(x')]) \end{aligned}\]

After bootstrapping, and a single-sample Monte Carlo Estimate(to avoid transition):

\[\begin{aligned} & l = \frac{1}{2} (v^\pi(x) - \theta(x)) \\ & = \frac{1}{2} (r + \gamma \theta^{old}(x') - \theta(x))\\\\ & \delta_{TD} = \theta(x) - (r + \gamma \theta^{old})\\ \\ & V^\pi(x) = \theta(x) \leftarrow \theta(x) - \alpha_t \delta_{TD} \end{aligned}\]

Note that in previous chapter, TD-learning converges with infinite iterations. This is requiring infinite(or a lot of) samples as each iteration we acquire/use a new sample(unlike in ML, infinite epoches simply means updating over a dataset for infinite times). This inefficiency is introduced by bootstrapping and single sample Monte Carlo.(remark 12.1) This is the cost of fitting a function to something we don’t know.

Deep RL

Note that the previous view extends to other variants of TD-learning as well(like Q-learning). Now consider using a linear estimator or even an NN estimator to approximate a Q-function across states(instead of using a single parameter for each state):

\[Q(x, a; \theta) = \theta^T \phi(x,a)\]

For Q-learning, we can define the loss and gradient update(bellman error):

\[\begin{aligned} & l = \frac{1}{2} (r + \gamma \max_{a'} Q^*(x',a';\theta^{old}) - Q^*(x,a;\theta))\\ \\ & \delta_{B} = r+\gamma \max_{a'} Q^*(x',a';\theta^{old}) - Q^*(x,a;\theta)\\ \\ & \theta \leftarrow \theta - \alpha_t \nabla_\theta l(\theta; a,x,r, x') \\ & = \theta - \alpha_t \delta_{B} \nabla_\theta Q^*(x,a;\theta) \end{aligned}\]

This method with one-hot embeddings of states should be pretty much the same as regular Q-learning. One is welcomed to try.

However, this vannilla stochastic semi-gradient descent(as opposed to sgd, since we use bootstrapping and single-sample MC) converges very slow. We wish to (introduce some tricks and) alleviate this problem.

Heuristics for Value Function Approximation

Motivation: stabilize the moving optimization target introduced by boostrapping.

DQN: experience replay

Instead of updating one NN for every observed transition, DQN proposed to maintain two networks: on-line network and target network(with infrequent update for $\theta^{old}$).

This allows something like a mini-batch update. You would maintain a “replay-buffer”($D$) for new observations. And you perform a mini-batch update (and the optimization target is the same across several updates).

\[l = \frac{1}{2} \Sigma_{(x,a,x',r)\in D} (r + \gamma \max_{a'} Q^*(x',a';\theta^{old}) - Q^*(x,a;\theta))\]
  • DQN do generates its own data in the pseudo-code. (But it is off-policy? lol). See the reddit discussion on why DQN is off-policy.
  • The online network is updated for each “mini-batch” using this loss.
  • The target network is updated less frequently by copying the weight of online network. (In the pseudo-code: every C steps)
  • The new observations is generated by the Q value estimation of the online netwrok.
  • PAI didn’t mention this: the dataset is initialized to some amount before any updates, and during the inner loop(one iteration is one step in gameplay), you store every new transitions to the dataset. During the mini-batch update, you actually samples random mini-batch from $D$ because “randomizing the samples breaks these (strong) correlations (between consecutive samples) and therefore reduces the variance of the updates”. (Mnih et al. 2015)

The DQN paper(Mnih et al. 2015) is on Nature and accessible through purchase or institution. If you do not have access, the pseudo-code is available in page 32 of this slide and it should contains the information of interest.

Double DQN(DDQN): Overcoming Maximization Bias

Note that in our previous loss, the maximization operator is applied on the Q value estimated by the target network. This is creates bias because the chosen action might not be the action chosen by the online network (see figure 12.1, which is also from the DDQN paper).

\[l = \frac{1}{2} \Sigma_{(x,a,x',r)\in D} (r + \gamma \textcolor{red}{\max_{a'} Q^*(x',a';\theta^{old})} - Q^*(x,a;\theta))\]

If you ask chatGPT/Gemini, it’ll tell you that this bias exists because the max operator is not linear and thus: $E f(X) \neq f EX$

Also, note that this bias occurs in loss calculation so it is irrelevnt to the playing stage.

DDQN improved this by specifying the chosen action as the action maximizing the Q-value of the online network (instead of using the max operator). But you still use the Q-value estimation of the target network for parameter update.

Additionally, PAI provides another on-policy interpretation/implementation of the DQN family: you update the online network for each transition, and update the target network after observing a certain amount of transitions.

Policy Approximation/Policy Gradient Methods

Motivation: Maximizing over all actions during every update is intractable for large or continuous space, we wish to learn a parameterized policy approximation $\pi_\phi(a \vert x)$ instead. The parameter should maximize the policy value, defined by discounted payoff($G_0$), approximated by bounded discounted payoff($G_{0:T}$).

The Optimization Goal

\[\phi = argmax J(\pi_\phi); J(\pi) = E_\pi [G_0]=E_\pi [\Sigma \gamma^t R_t] \approx E_\pi[G_{0:T}]\]

Calculating the exact expectation is a bit tricky(and costly): the policy $\pi$ can be stochastic, the unmodeled transition and reward is definitely stochastic. However, we can use MCMC (basically play the game multiple time with the policy, producing multiple episodes) to approximate it:

\[J_T(\phi) \approx \frac{1}{m} \Sigma^m_i g^{i}_{0:T}\] \[g^i_{0:T} = \Sigma^{T-1}_{t=0} \gamma^t r^i_t\]

Policy Gradient

Then we need to update our policy by some portion of the gradient.

We first define the possibility of a trajectory($\tau$), and from it, the unbiased gradient estimate.

\[\Pi_\phi(\tau) = p(x_0) \prod_t \pi_\phi(a_t \vert x_t) p(x_{t+1}\vert x_t, a_t)\] \[\nabla_\phi J(\phi) \approx \nabla_\phi J_T(\phi) = \nabla_\phi E_{\tau \sim \Pi_\phi} [G_{0:T}]\]

We cannot move the gradient operator inside because the expectation is also dependent on $\phi$.

We can bypass this headache with regularity assumption by using a Score Gradient Estimator

Note that:

\[\nabla_\phi \Pi_\phi (\tau) = \nabla_\phi e^{\log\Pi_\phi (\tau)} =\Pi_\phi(\tau) \nabla_\phi \log \Pi_\phi (\tau)\]

The regularity assumption states that you can swap the integral and differential operators if (1) the function inside of the integral is differentiable with respect to the parameter (2) the original function and its partial derivative must be integrable.

Thus,

\[\begin{aligned} & \nabla_\phi E_{\tau \sim \Pi_\phi} [G_{0:T}] = \nabla_\phi \int \Pi_\phi(\tau) G_0 d\tau\\ & = \int \nabla_\phi \Pi_\phi(\tau) G_0 d\tau \\ & = \int G_0 \Pi_\phi(\tau) \nabla_\phi \log \Pi_\phi (\tau) d \tau \\ & = E_{\tau \sim \Pi_\phi} [G_0 \Pi_\phi(\tau) \nabla_\phi \log \Pi_\phi (\tau)]\\ \end{aligned}\]

Now we have moved the derivative inside and we just have to worry about computing $\nabla_\phi \log \Pi_\phi (\tau)$, you can check that(since it’s the only parameterized term):

\[\nabla_\phi \log \Pi_\phi (\tau) = \Sigma_t \nabla_\phi \log \pi_\phi (a_t\vert x_t)\]

Since it’s intractable to compute the exact expectation, we apply Monte Carlo sampling:

\[\nabla_\phi J_T(\phi) \approx \frac{1}{m} \Sigma_i g^{(i)}_{0:T} \Sigma_t \nabla_\phi \log \pi_\phi(a^{(i)}_t\vert x^{(i)}_t)\]

Note that in MC sampling we always simply take nean rather than weighted average over the sequence likelihood. Because in practice we don’t know the $p(x_0)$ or transition probability and thus we can not compute sequence likelihood. We treat every observed sequence equally instead.

Controling Variance: baseline and downstream return

$G_0$ is likely to be a positive numaber(sum of discounted payoff), and our MC sampling can be unstable(variance). We want to control the variance by controling the magnitude of the multiplier by introducing a “normalizing” term(baseline).

\[E_{\tau \sim \Pi_\phi}[G_0 \nabla_\phi \log \Pi_\phi (\tau)] = E_{\tau \sim \Pi_\phi}[(G_0-b) \nabla_\phi \log \Pi_\phi (\tau)]\]

We can do this and the score gradient remains to be an unbiased estimate because the introduced term is expected to be 0 after expanding.

\[\begin{aligned} & = E_{\tau \sim \Pi_\phi}[G_0 \nabla_\phi \log \Pi_\phi (\tau)] - E_{\tau \sim \Pi_\phi}[b \nabla_\phi \log \Pi_\phi (\tau)];\\ & E_{\tau \sim \Pi_\phi}[b \nabla_\phi \log \Pi_\phi (\tau)] = b \int \nabla_\phi \Pi_\phi(\tau) d\tau \\ & = b \nabla_\phi \int \Pi_\phi(\tau) d\tau \\ & = b \nabla_\phi 1 = 0 \end{aligned}\]

The baseline $b$ can be a constant. However, it can also be the discounted playoff for previous states. (Example 12.7)

\[G_0 -b(\tau_{0:t-1}) = \gamma^t G_{t:T}\]

REINFORCE

This gives us the policy update of the REINFORCE algorithm. (downstream return) Consult Algorithm 12.8 for pseudo-code. At every time step of an obtained episode, we perform this update:

\[\phi \leftarrow \phi + \eta \gamma^t g_{t:T} \nabla_\phi \log \pi_\phi(a_t \vert x_t)\]

See book for details on further reducing the variance: you can additionally subtract mean from each term of the downstream return sequence.

On-Policy Actor-Critic

Motivation: Our previous alleviation of introducing a baseline reduce the magnitude of the variance, instead of reducing the uncertainty. The idea of actor-critic is to incorporate the techniques from value approximation so that it can guide the policy update, reducing the variance. (Reducing the uncertainty of full trajectory MC sampling)

Also, introducing value approximation enables to method to update for each timestep without completing a rollout.(compare algorithm 12.8 with algorithm 12.11) Since the value approximation is updated for each time-step and the optimal policy is well-defined for value approximation.

Compared to pure value approximation method, by delegating the action selection to actor, the critic don’t have to perform the intractable optimization over the infinite action space.

Advantage Function

\[a^\pi(x,a) = q^\pi(x,a) - v^\pi (x) = q^\pi(x,a) - E_{a' \sim \pi(x)}[q^\pi(x,a')]\]

Optimizing advantage function is the same as optimizing q function. However, we have numerical advantage by optimizing on advantage function.

Policy Gradient Theorem

We take the previously seen downstream return policy gradient and condition on $x^t$ and $a^t$ instead. We also use a different estimator to reduce variance.

\(\begin{aligned} & \nabla_\phi J(\phi) = \Sigma^\infty_{t=0} E_{\tau \sim \Pi} \gamma^t G_t \nabla _\phi \log \pi_\phi (a_t \vert x_t)\\ & = \Sigma^\infty_{t=0} E_{\tau_{t:T} \sim \Pi} \gamma^t G_t \nabla _\phi \log \pi_\phi (a_t \vert x_t) \\ & = \Sigma^\infty_{t=0} E_{x_t, a_t} \gamma^t E_{\pi_\phi} [G_t \vert x_t, a_t] \nabla _\phi \log \pi_\phi (a_t \vert x_t)\\ & = \Sigma^\infty_{t=0} E_{x_t, a_t} \gamma^t \textcolor{red}{q^\pi_\phi(x_t, a_t)} \nabla _\phi \log \pi_\phi (a_t \vert x_t) \\ \end{aligned}\)

This can be rephrased with introduction of discounted state occupancy measure $\rho^\infty_\phi$, measuring how often we visit a state given a policy:

\[\begin{aligned} & = \Sigma \int P_{X_t}(x) E_{a_t \sim \pi_\phi(. \vert x)} \gamma^t q^\pi_\phi(x_t, a_t) \nabla _\phi \log \pi_\phi (a_t \vert x_t) dx \\ & = \frac{1}{1-\lambda} \int \rho^\infty_\phi (x) E_{a_t \sim \pi_\phi(. \vert x)} \gamma^t q^\pi_\phi(x_t, a_t) \nabla _\phi \log \pi_\phi (a_t \vert x_t) dx\\ \\ & \rho^\infty_\phi = (1-\lambda)\Sigma^\infty_{t-0} \gamma^t P_{X_t}(x) \end{aligned}\]

Actor-Critic

Both actor and critic should be parameterized by NN.

  • Actor is parameterized policy. $pi_\phi$

  • Critic is a parameterized value function approximation. $q^{\pi_\phi}(x,a) \approx Q^{\pi_\phi}(x,a; \theta)$

Recall the previous SARSA update from note 4(with some rewrite):

\[Q^\pi(x,a)\leftarrow Q^\pi (x,a) + \alpha_t \textcolor{red}{(r+\gamma Q^\pi (x',a) - Q^\pi (x,a))}\]

The red part is the Temporal Difference Error, $\delta$. If we use the same update here with chain rule, we can derive the update rule for the critic of the first actor-critic introduced by PAI:

\(\delta = r+\gamma Q^\pi (x',a; \theta) - Q^\pi (x,a; \theta)\) \(\theta \leftarrow \theta + \eta \delta \nabla_\theta Q(x,a;\theta)\)

The update rule for actor simply substitute the downward discouted playoff with the critic’s q-learning estimate:

\[\phi \leftarrow \phi + \eta \gamma^t Q(x,a;\theta) \nabla_\phi \log \pi_\phi(a\vert x)\]

Consult Algorithm 12.11 for full detail, note that the algorithm is on-policy and update is performed for each obtained transition.

This is, however, problematic.

The Q-value estimate is not the discounted playoff we observed, rather it is biased.

Also, note that SARSA was on-policy(and the q-function it learns is dependent on policy) and the policy(actor) is updated for each iteration. This means the critic can never produce an accurate estimate.

There is, thus, no garuantee of improvement for the actor.

Advantage Actor Critic

Advantage Actor Critic(A2C): The critics estimates the advantage rather than the q-function. It is easier to model a relative value than an absolute value. Since you only need to get the sign right, and a positive advantage will induce a positive gradient to encourage the policy, vice versa.

\[\phi \leftarrow \phi + \eta \gamma^t A(x,a;\theta) \nabla_\phi \log \pi_\phi(a\vert x)\]

Generalized Advantage Estimation(GAE): The motivation of GAE is to perform variance-bias tradeoff between value approximation(low variance, biased) and policy approximation(high variance, unbiased). GAE adopts stochastic policy and $\epsilon$-greedy to encourage exploration. But this doesn’t solve the sample inefficiency of the on-policy setting.

There are different ways to approximate/model the advantage function(1 step TD and GAE) and PAI doesn’t seems to discuss much about that. (?)

Trusted-Region Policy Optimization(TRPO): TRPO aims to improve sample efficiency by allowing some reuse of past data via performing stable and larger policy updates. (“kinda” analogous to DQN)

TRPO constraint the update for each iteration (k) using KL-divergence between stochastic policies, so that the policy update lands in a “trusted region”.

\[\phi_{k+1}\leftarrow argmax_\phi J(\phi); E_{x \sim \rho^\infty_{\phi_k}}KL[\pi_{\phi_k}(.\vert x)\vert \vert \pi_{\phi_{k+1}}(.\vert x)] < \delta\] \[J(\phi) = E_{x \sim \rho^\infty_{\phi_k}; a \sim \pi_{\phi_k}} [w_k(\phi; x,a) A^{\pi_{\phi_k}}(x,a)]\] \[w_k(\phi;x,a) = \frac{\pi_{\phi}(a\vert x)}{\pi_{\phi_k}(a\vert x)}\]

The expectation is a rephrased policy objective maximizing the weighted advantage taking into account of how often the states are visited(discounted state occupancy measure).

The weight(importance sampling) is applied so that the policy evaluation focus on actions that was more chosen in new policy than previous policy.

The KL divergence constrain the new policy so that the new policy behaves roughly similiar to the previous policy on frequently visited states. By taking small steps, we allow the critics to learn and stay effective. (Increasing sample efficiency) Also, with this constraint, the importance weight won’t be too large(say, infinity) or small.

PPO replace the constraint optimization with unconstraint optimization with regularization. This introduces a trade-off in place of a hard clipping constraint.

\[\phi_{k+1}\leftarrow argmax_\phi J(\phi) - \lambda E_{x \sim \rho^\infty_{\phi_k}}KL[\pi_{\phi_k}(.\vert x)\vert \vert \pi_{\phi_{k+1}}(.\vert x)]\]

Remark 12.12 shows that the KL divergence can be estimated by Monte Carlo, adding a baseline: (?)

\[KL = E_{a\sim \pi_{\phi_k}} [w_k(\phi; x,a) - 1 - \log w_k(\phi; x,a)]\]

GRPO uses Monte Carlo estimate of the advantage function instead of the critic estimates. GRPO is more computationally efficient than PPO, retaining the sample efficiency. GRPO embraces the high variance of rollout. However, it applies the group-wise normalization with mean and standard deviation.

\[\begin{aligned} & \hat J(\phi) = E_{[\tau^{(i)}]^m_i \sim \Pi_{\phi_k}} [\frac{1}{m} \Sigma^m_i \Sigma_t w_k(\phi;x,a) \hat{A}^{\pi_{\phi_k}}_{t,i}]\\ & \hat{A}^{\pi_{\phi_k}}_{t,i} = \frac{g^{(i)}_{t:T} - mean(\{\tau^i\}) }{std(\{\tau^i\})}\\ \end{aligned}\]

You can even see GRPO as a variant of the vanilla REINFORCE!

It sort of make sense for GRPO to be proposed in an Generative LM paper, since if you see every word/token as an action, there is no immediate reward for every action, but a preference score(from human?) for the entire sequence. (?)

It is not explicitly shown here and on the book. But GRPO still adopt the idea of restraint upgrade despite the absence of a learned critic from PPO. On second thought this actually might make sense because the regularization can reduce the instability of the policy parameter caused by high variance.(?) Note that GRPO uses reverse KL and PPO uses forward KL. (Why?)

Off-Policy Actor-Critic

With previously introduces mitigations, on-policy actor-critic still can suffer from sample efficiency. Off-policy Actor-critic is another family of methods. Naturally, off-policy implies reusing past data.

So we naturally want to decouple the actor and critic training. The actor will be used to generate “playing” data to be stored in the replay buffer, and the critic will be trained by sampling experience from the replay buffer, like we did in DQN. The actor will be trained to maximize the estimated Q-value for any states(instead of updated by policy gradient from a single transition).

Recall that we could not evaluate the DQN loss with large action space because of the max operator. Similarily, the actor network lift the need of this.

As usual, the regression target for the critic is a bootstrapped estimate.

Exploration Distribution: $\mu(x) > 0$ is the exploration distribution. It denotes the sampling process from the replay buffer. Transitions of all possible states are uniformly sampled(full support). This ensures no states are missed and thus unbiased gradient estimate.

\[\begin{aligned} & \hat J_\mu (\phi; \theta) = E_{x\sim \mu} [Q^*(x, \pi_\phi(x); \theta)]\\ \\ & \nabla_\phi \hat J_\mu (\phi; \theta) = E_{x\sim \mu} [\nabla_\phi Q^*(x, \pi_\phi(x); \theta)]\\ & = E_{x\sim \mu} [\textbf{J}_\phi \pi_\phi(x) \nabla_a Q^*(x, a; \theta)\vert _{a=\pi_\phi(x)}] \end{aligned}\]

$\textbf{J}\phi \pi\phi(x)$ is Jacobian matrix, and $\nabla_a Q^*(x, a; \theta)\vert {a=\pi\phi(x)}$ is a gradient vector same length as the action vector. Applying the matrix multiplication will acquire the sum “blame” on each parameter over each action dimension (chain rule).

For the same reason, DDPG only iterates between playing/collecting data and updating both actor and critic from replay buffer.

Deep Deterministic Policy Gradient(DDPG), Twin Delayed DDPG(TD3)

PAI first introduce us to the setting where policy is deterministic.

The old parameter is saved and updated with EMA to support bootstrapping.

In this setting, noise is added to action chosen by actor to encourage exploration. Also, note that the actor is updated in every inner loop with its own chosen action. (Algorithm 12.13)

Regression target of a transition for Critic \(y = r+\gamma Q^*(x',\pi(x';\phi^{old});\theta^{old})\)

Critic Update \(\theta \leftarrow \theta - \eta \nabla_\theta \frac{1}{B} \Sigma_{(x,a,r,x',y)\in B} (y-Q^*(x,a;\theta))^2\) \(\theta_{old} = (1-\rho)\theta_{old} + \rho \theta\)

Actor Update \(\phi \leftarrow \phi + \eta \nabla_\phi \frac{1}{B} \Sigma_{(x,a,r,x',y)\in B} Q^*(x,a;\theta)\) \(\phi_{old} = (1-\rho)\phi_{old} + \rho \phi\)

Similarily to DDQN, TD3 improves DDPG by introducing an old critic network and delaying actor updates.

Stochastic Policies and Reparameterization Trick

Critic

The critic gradient involves expectation over stochastic policy and, because the action space is intractable, probably sampling.

\[\nabla_\theta E_{a'\sim \pi(x',\phi)} [\frac{1}{2}(r+\gamma Q^*(x',\pi(x';\phi^{old});\theta^{old}) -Q^*(x,a;\theta))^2 ]\]

Actor

Actor gradient is slightly more tricky since we cannot move the gradient operator inside the expectation(again).

\[\nabla_\phi \hat J(\phi; \theta) = E_{x\sim \mu} \nabla_\phi E_{a\sim \pi(x;\phi)}[Q^*(x,a;\theta)]\]

This time, we use reparameterization trick.

(Suppose $\pi$ is a Gaussian distribution thus reparameterizable)

\[\begin{aligned} & \nabla_\phi E_{a\sim \pi(x;\phi)}[Q^*(x,a;\theta)] \\ & = E_{\epsilon \sim Z} [ \nabla_\phi Q^*(x, a_\epsilon;\theta)] ; \\ & a_\epsilon = \mu(x;\phi)+\Sigma(x;\phi)^{\frac{1}{2}}\epsilon, \epsilon \sim Z. \end{aligned}\]

Consult book for more general notation since Gaussian is not the only reparameterizable distribution.

Maximum Entropy RL(MERL)/Entropy Regularization

Motivation: Randomized Stochastic policy can collapse to deterministic. MaxEnt regularize the policy(by maximizing entropy, increasing uncertainty, as the name suggest) and prevents that.

\[\begin{aligned} & J_{\lambda}(\phi) = J(\phi) + \lambda H[\Pi_{\phi}]\\ & = \Sigma_t E_{(x_t,a_t)\sim \Pi_\phi} [r(x_t,a_t) + \lambda H[\pi_\phi(.\vert x_t)]] \end{aligned}\]

Entropy for some given distributions(if continuous action space) can be write in closed form. For univariate gaussian:

\[H = \frac{1}{2} \log(2\pi \sigma^2) + \frac{1}{2} = \frac{1}{2} \log(2\pi e \sigma^2)\]

Control as Inference/Soft Actor Critic(SAC) (12.6.1)

More here

I would delay the RLHF part(12.7) of this chapter to “extension”.