A note on PPO

Neil — Tue, 30 Jun 2026 00:00:00 GMT

A lot of RL algorithms of late are focusing on KL divergence. Since it’s a core interview question, let’s work through the RLHF + PPO formulation and its relation to DPO. The first part of which is getting an analytic representation of the optimal policy. Everything I have learned about RL I have learned from Lambert (2026) or GPT and below is no exception.

Ok, so we take for granted that the PPO objective is a good idea, which not everyone does. What is it anyways?

Let be our reward function that takes in a context and a completion , i.e., the LLM output based on . Thus, is the reward or “goodness” or the completion for . Somehow, we never see this reward in real life.

We can let be the (discrete) domain of .

is reference policy that we wish to stay close in KL distance to. Why do we do this? I’m not sure really, especially since PPO has different roots than from RL on LLMs. “Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.” Huh.

Then, the PPO objective is to solve the following optimization problem, where is the true distribution of contexts and is a positive real hyperparameter:

Here, is KL divergence which is defined as

I had to refer to the Wikipedia page while looking up the definition of KL, and the following seems to be an application of Donsker-Varadhan. But that is somewhat difficult to read and intuitively get.

Let’s pretend that and are countable for simplicity (which it is for LLMs) and is bounded (which I believe is necessary). We can reformulate the above objective as follows:

Now, we will proceed by realizing that the KL minimizer of a distribution is itself — forward or reverse (what we’re doing here is reverse) KL, it’s all the same. So instead of doing KKT or something, we can just make a distribution. The above optimization formula can now be formulated

Well let’s focus on the inner optimization, since we are now choosing a different probability distribution for each value of . We will also add a constant (which is fine since it’s a constant):

where

Our reason for doing so is because we can now define the probability distribution:

and substitute it into our inner maximization to get that

Naturally picking maximizes this, and we get that is the solution to PPO. No one has been super interested in this solution previously because we don’t have access to possibly? Perhaps it was useful to inverse RL people.

References

Lambert, Nathan. 2026. Reinforcement Learning from Human Feedback. Online. https://rlhfbook.com.

Neil Xu

A note on PPO

References