<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Neil Xu</title>
<link>https://blog.neilzxu.me/</link>
<atom:link href="https://blog.neilzxu.me/index.xml" rel="self" type="application/rss+xml"/>
<description>Notes on statistics, machine learning, and research.</description>
<generator>quarto-1.9.38</generator>
<lastBuildDate>Tue, 30 Jun 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>A note on PPO</title>
  <dc:creator>Neil </dc:creator>
  <link>https://blog.neilzxu.me/posts/welcome/</link>
  <description><![CDATA[ 





<p>A lot of <em>RL</em> algorithms of late are focusing on KL divergence. Since it’s a core interview question, let’s work through the RLHF + PPO formulation and its relation to DPO. The first part of which is getting an analytic representation of the optimal policy. Everything I have learned about <em>RL</em> I have learned from <span class="citation" data-cites="rlhf2026lambert">Lambert (2026)</span> or GPT and below is no exception.</p>
<p>Ok, so we take for granted that the PPO objective is a good idea, which <a href="https://arxiv.org/abs/2407.13399">not everyone does</a>. What is it anyways?</p>
<p>Let <img src="https://latex.codecogs.com/png.latex?r"> be our reward function that takes in a context <img src="https://latex.codecogs.com/png.latex?x"> and a completion <img src="https://latex.codecogs.com/png.latex?y">, i.e., the LLM output based on <img src="https://latex.codecogs.com/png.latex?x">. Thus, <img src="https://latex.codecogs.com/png.latex?r(x,%20y)"> is the reward or “goodness” or the completion <img src="https://latex.codecogs.com/png.latex?y"> for <img src="https://latex.codecogs.com/png.latex?x">. Somehow, we never see this reward in real life.</p>
<p>We can let <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BY%7D"> be the (discrete) domain of <img src="https://latex.codecogs.com/png.latex?y">.</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bref%7D%7D"> is reference policy that we wish to stay close in KL distance to. Why do we do this? I’m not sure really, especially since PPO has different roots than from RL on LLMs. “Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.” Huh.</p>
<p>Then, the PPO objective is to solve the following optimization problem, where <img src="https://latex.codecogs.com/png.latex?%5Ctexttt%7BP%7D"> is the true distribution of contexts and <img src="https://latex.codecogs.com/png.latex?%5Cbeta"> is a positive real hyperparameter:</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Cmax_%7B%5Ctext%7Bpolicy%7D%20%5Cpi%7D%20%5Cmathbb%7BE%7D_%7BY%20%5Csim%20%5Cpi(%5Ccdot%20%5Cmid%20X),%20X%20%5Csim%20%5Ctexttt%7BP%7D%7D%5Br(X,%20Y)%5D%20-%20%5Cbeta%20%5Cmathbb%7BE%7D_%7BX%20%5Csim%20%5Ctexttt%7BP%7D%7D%5BD_%7B%5Ctext%7BKL%7D%7D(%5Cpi(%20%5Ccdot%20%5Cmid%20X)%20%7C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(%20%5Ccdot%20%5Cmid%20X)%5D%20"></p>
<p>Here, <img src="https://latex.codecogs.com/png.latex?D_%7B%5Ctext%7BKL%7D%7D"> is KL divergence which is defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?D_%7B%5Ctext%7BKL%7D%7D(P%20%7C%7C%20Q)%20=%20%5Cmathbb%7BE%7D_%7BX%20%5Csim%20P%7D%5Cleft%5B%5Clog%20%5Cfrac%7BP(X)%7D%7BQ(X)%7D%5Cright%5D."></p>
<p>I had to refer to the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Wikipedia page</a> while looking up the definition of KL, and the following seems to be an application of Donsker-Varadhan. But that is somewhat difficult to read and intuitively get.</p>
<p>Let’s pretend that <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BX%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BY%7D"> are countable for simplicity (which it is for LLMs) and <img src="https://latex.codecogs.com/png.latex?r"> is bounded (which I believe is necessary). We can reformulate the above objective as follows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Cmax_%7B%5Ctext%7Bpolicy%7D%5C%20%5Cpi%7D%20%5Cmathbb%7BE%7D_%7BY%20%5Csim%20%5Cpi(%5Ccdot%20%5Cmid%20X),%20X%20%5Csim%20%5Ctexttt%7BP%7D%7D%5Br(X,%20Y)%20-%20%5Cbeta%20%5Clog%20%5Cpi(Y%20%5Cmid%20X)%20+%20%5Cbeta%20%5Clog%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(Y%20%5Cmid%20X)%5D."></p>
<p>Now, we will proceed by realizing that the KL minimizer of a distribution <img src="https://latex.codecogs.com/png.latex?%5Ctexttt%7BP%7D"> is <img src="https://latex.codecogs.com/png.latex?%5Ctexttt%7BP%7D"> itself — forward or reverse (what we’re doing here is reverse) KL, it’s all the same. So instead of doing KKT or something, we can just make a distribution. The above optimization formula can now be formulated</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Cmax_%7B%5Ctext%7Bpolicy%7D%5C%20%5Cpi%7D%20%5Cbeta%20%5Csum_%7Bx%20%5Cin%20%5Cmathcal%7BX%7D%7D%20%5Csum_%7By%20%5Cin%20%5Cmathcal%7BY%7D%7D%20((r(x,%20y)%20/%20%5Cbeta%20+%20%5Clog%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%20%5Cmid%20x)%20-%20%5Clog%20%5Cpi(y%20%5Cmid%20x))%20%5Cpi(y%20%5Cmid%20x)%20%5Ccdot%20%5Ctexttt%7BP%7D(x)."></p>
<p>Well let’s focus on the inner optimization, since we are now choosing a different probability distribution for each value of <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathcal%7BX%7D">. We will also add a constant <img src="https://latex.codecogs.com/png.latex?z(x)"> (which is fine since it’s a constant):</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Cmax_%7B%5Ctext%7Bdistribution%7D%5C%20%5Cpi(%5Ccdot%20%5Cmid%20x)%7D%20%5Cbeta%20%20%5Csum_%7By%20%5Cin%20%5Cmathcal%7BY%7D%7D%20((r(x,%20y)%20/%20%5Cbeta%20+%20%5Clog%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%20%5Cmid%20x)%20+%20%5Clog%20z(x)%20-%20%5Clog%20%5Cpi(y%20%5Cmid%20x))%20%5Cpi(y%20%5Cmid%20x)"></p>
<p>where</p>
<p><img src="https://latex.codecogs.com/png.latex?z(x)%20=%20%5Csum_%7By%20%5Cin%20%5Cmathcal%7BY%7D%7D%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%20%5Cmid%20x)%5Cexp(r(x,%20y)%20/%20%5Cbeta)."></p>
<p>Our reason for doing so is because we can now define the probability distribution:</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Cpi%5E*(y%20%5Cmid%20x)%20=%20%5Cfrac%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%20%5Cmid%20x)%5Cexp(r(x,%20y)%20/%20%5Cbeta)%7D%7Bz(x)%7D"></p>
<p>and substitute it into our inner maximization to get that</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Cmax_%7B%5Ctext%7Bdistribution%7D%5C%20%5Cpi(%5Ccdot%20%5Cmid%20x)%7D%20%5Cbeta%20D_%7B%5Ctext%7BKL%7D%7D(%5Cpi(%5Ccdot%20%5Cmid%20x)%20%7C%7C%20%5Cpi%5E*(%5Ccdot%20%5Cmid%20x)).%20"></p>
<p>Naturally picking <img src="https://latex.codecogs.com/png.latex?%5Cpi%20=%20%5Cpi%5E*"> maximizes this, and we get that <img src="https://latex.codecogs.com/png.latex?%5Cpi%5E*"> is the solution to PPO. No one has been super interested in this solution previously because we don’t have access to <img src="https://latex.codecogs.com/png.latex?r(x,%20y)"> possibly? Perhaps it was useful to inverse RL people.</p>




<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-rlhf2026lambert" class="csl-entry">
Lambert, Nathan. 2026. <em>Reinforcement Learning from Human Feedback</em>. Online. <a href="https://rlhfbook.com">https://rlhfbook.com</a>.
</div>
</div></section></div> ]]></description>
  <category>LLM</category>
  <category>RL</category>
  <guid>https://blog.neilzxu.me/posts/welcome/</guid>
  <pubDate>Tue, 30 Jun 2026 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>
