# Policy Gradients ## 1 Motivation: Limitations of Deep Q-Learning In Deep Q-Learning (DQN), a neural network approximates the action-value function $ Q(s,a) $, and the greedy policy is recovered by taking the action with the largest predicted value: $$ \pi(s) = \arg\max_a Q(s,a) $$ ![DQN algorithm with experience replay and target network. Source: Mnih et al. (2015) [4] via Lil'Log [8].](images/weng_dqn_algorithm.png) This works well for **discrete** action spaces, but it creates two important limitations: 1. **Continuous actions** — in a continuous action space, computing $ \arg\max_a Q(s,a) $ may require solving a difficult continuous optimization problem at every state, rather than simply taking the max over finitely many actions. 2. **Stochastic policies** — the greedy DQN policy is deterministic. Exploration can be added externally, such as with $ \epsilon $-greedy, but DQN does not directly parameterize a stochastic policy $ \pi_\theta(a \mid s) $. This motivates a key question: **What if we learn the policy directly?** --- ## 2 Policy Gradient Architecture Instead of learning Q-values and deriving a policy afterward, a **policy network** directly parameterizes the policy: $$ \pi_\theta(a \mid s) = \mathbb{P}(A=a \mid S=s;\theta) $$ For **discrete** action spaces, the policy network outputs a categorical distribution over actions. For **continuous** action spaces, the policy often outputs parameters of a continuous distribution, such as the mean and standard deviation of a Gaussian: $$ a \sim \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2) $$ The action is then sampled from this distribution. ### 2.1 Continuous Policies and Reparameterization For continuous stochastic policies, one common idea is to represent randomness separately from the policy parameters. For example, a Gaussian sample can be written as $$ z = \mu + \sigma \odot \epsilon, \qquad \epsilon \sim \mathcal{N}(0,I) $$ so that $$ z \sim \mathcal{N}(\mu,\sigma^2 I) $$ This is called the **reparameterization trick**. It isolates the randomness in $ \epsilon $, which does not depend on the learned parameters, allowing gradients to flow through $ \mu $ and $ \sigma $. ![Reparameterization trick: in the original form (left) gradients cannot flow through the stochastic node z; in the reparameterized form (right) z = g(φ, x, ε) with ε ~ p(ε), enabling backpropagation. Source: Kingma & Welling (2014) [9] via Lil'Log [10].](images/weng_reparam_trick.png) However, vanilla policy gradient methods such as REINFORCE do **not** require backpropagating through the sampled action. Instead, they use the **log-derivative trick**, which differentiates the log-probability of the sampled action: $$ \nabla_\theta \log \pi_\theta(a \mid s) $$ This is why policy gradients can be used even when the action sampling process, reward function, or environment dynamics are not directly differentiable. --- ## 3 Goal of Reinforcement Learning The goal is to find policy parameters $ \theta^* $ that maximize the expected discounted return: $$ J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=0}^{T} \gamma^t r_t \right] $$ where: - $ \tau = (s_0,a_0,r_0,\ldots,s_T,a_T,r_T) $ is a trajectory sampled by following $ \pi_\theta $. - $ p_\theta(\tau) $ is the trajectory distribution induced by the policy and environment. - $ \gamma \in [0,1] $ is the discount factor. The optimal parameters are: $$ \theta^* = \arg\max_\theta J(\theta), \qquad \pi^* = \pi_{\theta^*} $$ --- ## 4 Optimization via Gradient Ascent In supervised learning, we usually **minimize** a loss using gradient descent. In policy optimization, we **maximize** the objective $ J(\theta) $ using gradient ascent: $$ \theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\theta_k) $$ To compute the gradient, we need to differentiate through an expectation over trajectories: $$ \nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=0}^{T} \gamma^t r_t \right] $$ This is challenging because: 1. the expectation is over all possible trajectories; 2. the trajectory distribution $ p_\theta(\tau) $ depends on $ \theta $; 3. the environment transition dynamics may be unknown or non-differentiable; 4. the reward function may not be differentiable with respect to $ \theta $. The policy gradient theorem gives a practical way to estimate this gradient from sampled trajectories. --- ## 5 Policy Gradient Theorem The **Policy Gradient Theorem** gives a tractable expression for $ \nabla_\theta J(\theta) $. ### 5.1 Express $ J(\theta) $ Using the Value Function We can write the objective as the expected value of the starting state: $$ J(\theta) = \mathbb{E}_{s_0 \sim p(s_0)} \left[ V^{\pi_\theta}(s_0) \right] = \sum_{s_0} p(s_0) V^{\pi_\theta}(s_0) $$ where $$ V^{\pi_\theta}(s) = \mathbb{E}_{\tau \sim p_\theta(\tau \mid s_0=s)} \left[ \sum_{t=0}^{T} \gamma^t r_t \right] $$ is the expected return starting from state $ s $ and following policy $ \pi_\theta $. ### 5.2 Recall: Relationship Between $ V $ and $ Q $ The value function averages over the action-values under the policy: $$ V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s,a) $$ For continuous action spaces, the sum becomes an integral: $$ V^\pi(s) = \int \pi(a \mid s) Q^\pi(s,a)\, da $$ ### 5.3 Log-Derivative Trick The central trick is the identity: $$ \nabla_\theta \pi_\theta(a \mid s) = \pi_\theta(a \mid s) \nabla_\theta \log \pi_\theta(a \mid s) $$ This is called the **log-derivative trick** or **score-function trick**. Intuitively, instead of differentiating through the reward or the environment, we differentiate the log-probability of the action that the policy took. A simplified derivation begins with: $$ J(\theta) = \sum_{s_0} p(s_0) V^{\pi_\theta}(s_0) $$ Differentiate with respect to $ \theta $: $$ \nabla_\theta J(\theta) = \sum_{s_0} p(s_0) \nabla_\theta V^{\pi_\theta}(s_0) $$ Expand $ V^{\pi_\theta}(s_0) $ using $ Q^{\pi_\theta} $: $$ \nabla_\theta J(\theta) = \sum_{s_0} p(s_0) \nabla_\theta \sum_a \pi_\theta(a \mid s_0) Q^{\pi_\theta}(s_0,a) $$ Using the log-derivative trick gives the basic form: $$ \nabla_\theta J(\theta) = \sum_{s_0} p(s_0) \sum_a \pi_\theta(a \mid s_0) \nabla_\theta \log \pi_\theta(a \mid s_0) Q^{\pi_\theta}(s_0,a) $$ ![Policy-Gradient Theorem Part 2: full derivation via the log-derivative trick, step by step. Source: CS 4782 Lecture Slides [6].](images/slide-22.png) The full theorem extends this idea from the initial state to all states visited under the policy. ### 5.4 Final Form Let $ d^{\pi_\theta}(s) $ denote the state visitation distribution induced by following policy $ \pi_\theta $. In discounted settings, this is often a discounted state visitation distribution. The policy gradient theorem can be written as: $$ \boxed{ \nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\, a \sim \pi_\theta(\cdot \mid s)} \left[ \nabla_\theta \log \pi_\theta(a \mid s) Q^{\pi_\theta}(s,a) \right] } $$ ![The Policy Gradient Theorem: for any differentiable policy, the gradient of the objective equals the expectation of the log-policy gradient weighted by the cumulative return. Source: HuggingFace Deep RL Course [7].](images/hf_policy_gradient_theorem.png) **Key insight:** We never need to differentiate through the reward function or environment dynamics. We only differentiate the log-probability of the action under the policy. --- ## 6 REINFORCE Algorithm REINFORCE is the classic Monte Carlo policy gradient algorithm [1]. The key idea is to replace the unknown action-value function $ Q^{\pi_\theta}(s_t,a_t) $ with the empirical return from that timestep onward. Define the return from timestep $ t $ as: $$ G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k $$ This is a Monte Carlo estimate of the return obtained after taking action $ a_t $ in state $ s_t $ and then following the policy. Thus, REINFORCE uses the gradient estimate: $$ \hat{g}_t = \nabla_\theta \log \pi_\theta(a_t \mid s_t) G_t $$ ![REINFORCE training loop: collect an episode with π, compute the return, then increase the probability of sampled actions in proportion to their returns. Source: HuggingFace Deep RL Course [7].](images/hf_pg_bigpicture.jpg) ### Algorithm ```text 1. Initialize policy parameters θ randomly 2. for each episode do 3. Sample trajectory τ = (s₀,a₀,r₀,...,s_T,a_T,r_T) by following πθ 4. for each timestep t do 5. Compute return G_t = Σ_{k=t}^{T} γ^{k-t} r_k 6. Compute gradient estimate ĝ_t = ∇θ log πθ(a_t | s_t) · G_t 7. Update θ ← θ + α Σ_t ĝ_t 8. end for ``` Some implementations average the gradient over timesteps: $$ \theta \leftarrow \theta + \alpha \frac{1}{T} \sum_{t=0}^{T} \hat{g}_t $$ Averaging is an implementation choice that changes the scale of the update and can help stabilize training across episodes of different lengths. ### Advantages and Limitations | Property | Explanation | |---|---| | **Unbiased** | $ G_t $ is an unbiased Monte Carlo estimate of $ Q^{\pi_\theta}(s_t,a_t) $ under the current policy. | | **General** | Works for both discrete and continuous action spaces. | | **High variance** | A single sampled trajectory may produce a very noisy estimate of the expected return. | | **Slow convergence** | High variance can make learning unstable or sample-inefficient. | --- ## 7 On-Policy vs. Off-Policy | | On-Policy | Off-Policy | |---|---|---| | **Definition** | Learn from data collected by the current or very recent policy being improved. | Learn from data collected by a different behavior policy. | | **Data reuse** | Old data cannot be reused freely because the policy has changed. | Past samples can often be reused, such as through experience replay. | | **Examples** | REINFORCE, A2C/A3C, PPO | Q-Learning, DQN | | **Stability** | Often simpler and more stable conceptually. | Can be unstable and may require corrections such as importance sampling. | | **Sample efficiency** | Usually less sample-efficient. | Usually more sample-efficient. | **Q-Learning and DQN are off-policy** because they can learn about the greedy target policy while collecting data from another behavior policy, such as an $ \epsilon $-greedy policy. **REINFORCE is on-policy** because it estimates gradients using trajectories sampled from the current policy. --- ## 8 Issues with Policy Gradient Methods ### 8.1 Exploration Policy gradient methods can converge prematurely to suboptimal behavior if the policy becomes too deterministic too early. Common exploration strategies include: 1. **Stochastic policies** — sampling from $ \pi_\theta(a \mid s) $ naturally introduces exploration. 2. **Entropy regularization** — add an entropy bonus to the objective to discourage premature determinism: $$ J_{\text{entropy}}(\theta) = J(\theta) + \beta \mathbb{E}_{s} \left[ \mathcal{H}(\pi_\theta(\cdot \mid s)) \right] $$ 3. **Action noise** — especially in continuous control, deliberately injecting noise into actions can help exploration. ### 8.2 High Variance From the same starting state, different sampled trajectories can produce very different returns. This makes the policy gradient estimate noisy. ![REINFORCE variance: from the same Pong starting state, different trajectories yield returns of +100, −1000, +10, and −40. This variance makes learning unstable. Source: HuggingFace Deep RL Course [7].](images/hf_variance.jpg) The gradient estimate $$ \nabla_\theta \log \pi_\theta(a_t \mid s_t)G_t $$ depends heavily on the sampled return $ G_t $. If $ G_t $ varies a lot across trajectories, the updates will also vary a lot. ### 8.3 Reward Hacking Agents can maximize the reward in unintended ways. This is called **reward hacking**. For example, if a reward function gives points for moving forward, an agent might learn to move forward unsafely or exploit a simulator bug rather than solving the intended task. Designing reward functions that are robust to exploitation is a central challenge in reinforcement learning. --- ## 9 Variance Reduction with a Baseline Instead of asking: > Was this return good? we ask: > Was this return better than expected from this state? To do this, subtract a **baseline** $ b(s_t) $ from the return: $$ \hat{g}_t = \nabla_\theta \log \pi_\theta(a_t \mid s_t) \left( G_t - b(s_t) \right) $$ The baseline reduces variance but does not introduce bias, as long as it does not depend on the action $ a_t $. Why? $$ \mathbb{E}_{a \sim \pi_\theta(\cdot \mid s)} \left[ \nabla_\theta \log \pi_\theta(a \mid s)b(s) \right] = b(s) \sum_a \pi_\theta(a \mid s) \nabla_\theta \log \pi_\theta(a \mid s) $$ Using the log-derivative trick: $$ = b(s) \sum_a \nabla_\theta \pi_\theta(a \mid s) $$ Since probabilities sum to one, $$ = b(s) \nabla_\theta \sum_a \pi_\theta(a \mid s) = b(s) \nabla_\theta 1 = 0 $$ Therefore, subtracting an action-independent baseline does not change the expected gradient. The natural baseline is the value function: $$ b(s_t) = V^{\pi_\theta}(s_t) $$ This gives the **advantage function**: $$ A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s) $$ ![The Advantage function A(s,a) = Q(s,a) − V(s): Q(s,a) is the value of taking action a, while V(s) is the average value of the state. The advantage measures how much better action a is relative to the average. Source: HuggingFace Deep RL Course [7].](images/hf_advantage1.jpg) The advantage tells us whether action $ a $ was better or worse than the average action from state $ s $. Using advantage, the policy gradient becomes: $$ \nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a \mid s) A^{\pi}(s,a) \right] $$ --- ## 10 Actor-Critic Algorithms Actor-Critic methods combine policy-based and value-based ideas [2]. ![Actor-Critic intuition: the Actor takes actions and the Critic evaluates them, providing a feedback signal that improves the policy without waiting for the full episode return. Source: HuggingFace Deep RL Course [7].](images/hf_ac.jpg) ### Components - **Actor** $ \pi_\theta(a \mid s) $: the policy network that selects actions. - **Critic** $ V_\phi(s) $, $ Q_\phi(s,a) $, or $ A_\phi(s,a) $: a value network that evaluates the actor's behavior. The actor is updated using a policy gradient, while the critic is trained to estimate values. Instead of using the full Monte Carlo return $ G_t $, actor-critic methods often use an advantage estimate: $$ \hat{g}_t = \nabla_\theta \log \pi_\theta(a_t \mid s_t) \hat{A}_t $$ A common one-step advantage estimate is the TD error: $$ \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) $$ Then the actor update uses: $$ \hat{g}_t = \nabla_\theta \log \pi_\theta(a_t \mid s_t) \delta_t $$ The critic is trained to make its value predictions more accurate, for example by minimizing a squared TD error: $$ \mathcal{L}_{critic} = \left( r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \right)^2 $$ ### Why Actor-Critic Helps REINFORCE waits until the end of an episode to compute Monte Carlo returns. This is unbiased but high variance. Actor-critic methods use a learned critic to estimate value, which usually reduces variance and allows more frequent online updates. The tradeoff is that the critic may be inaccurate, which can introduce bias. ### A2C vs. A3C ![A3C (asynchronous) vs A2C (synchronous): parallel agents collect diverse experience and send updates to shared global network parameters, stabilizing training and improving sample efficiency. Source: Mnih et al. (2016) [5] via Lil'Log [11].](images/weng_a3c_vs_a2c.png) | Method | Meaning | Key Idea | |---|---|---| | **A3C** | Asynchronous Advantage Actor-Critic | Multiple workers interact with environments in parallel and asynchronously update shared global parameters. | | **A2C** | Advantage Actor-Critic | A synchronous version where workers collect experience in parallel and updates are coordinated together. | Both methods use the advantage idea to improve the actor update. --- ## 11 Summary | Method | Estimate Used | Variance | Bias | Main Tradeoff | |---|---|---|---|---| | **REINFORCE** | Monte Carlo return $ G_t $ | High | None | Simple but noisy | | **REINFORCE + Baseline** | $ G_t - b(s_t) $ | Lower | None, if baseline is action-independent | More stable updates | | **Actor-Critic** | Learned value, Q-value, or advantage estimate | Usually lower | Possible critic bias | More sample-efficient, but depends on critic quality | Key takeaways: - **Policy gradient methods** directly optimize a parameterized policy $ \pi_\theta(a \mid s) $. - The objective is to maximize expected discounted return: $$ J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=0}^{T} \gamma^t r_t \right] $$ - The **Policy Gradient Theorem** gives: $$ \nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a \mid s) Q^{\pi_\theta}(s,a) \right] $$ - The key computational advantage is that we only need to differentiate the log-policy, not the reward function or environment dynamics. - **REINFORCE** uses Monte Carlo returns. It is unbiased but high variance. - **Baselines** reduce variance without bias as long as the baseline does not depend on the action. - The most common baseline is the value function, giving the advantage: $$ A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s) $$ - **Actor-Critic** methods use a learned critic to estimate values or advantages, reducing variance and enabling online updates. - The tradeoff is that actor-critic methods can introduce bias if the critic is inaccurate. --- ## References 1. Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. *Machine Learning*, 8, 229–256. 2. Konda, V.R., & Tsitsiklis, J.N. (2000). Actor-Critic Algorithms. *NeurIPS*, 13. 3. Sutton, R.S., & Barto, A.G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. [PDF](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) 4. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518, 529–533. 5. Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. *ICML*. [arXiv:1602.01783](https://arxiv.org/abs/1602.01783) 6. CS 4782 Deep Learning, Cornell — Week 9 Lecture Slides (Policy Gradients). Thanks to Varsha Kishore, Justin Lovelace, Luke Kulm, Jinzhou Li. 7. HuggingFace Deep RL Course. [https://huggingface.co/learn/deep-rl-course/](https://huggingface.co/learn/deep-rl-course/) 8. Weng, L. (2018). A (Long) Peek into Reinforcement Learning. *Lil'Log*. [link](https://lilianweng.github.io/posts/2018-02-19-rl-overview/) 9. Kingma, D.P., & Welling, M. (2014). Auto-Encoding Variational Bayes. *ICLR*. [arXiv:1312.6114](https://arxiv.org/abs/1312.6114) 10. Weng, L. (2018). From Autoencoder to Beta-VAE. *Lil'Log*. [link](https://lilianweng.github.io/posts/2018-08-12-vae/) 11. Weng, L. (2018). Policy Gradient Algorithms. *Lil'Log*. [link](https://lilianweng.github.io/posts/2018-04-08-policy-gradient/) 12. Schulman, J., et al. (2016). High-dimensional continuous control using generalized advantage estimation. *ICLR*. [arXiv:1506.02438](https://arxiv.org/abs/1506.02438)