# Policy Gradients
## 1 Motivation: Limitations of Deep Q-Learning
In Deep Q-Learning (DQN), a neural network approximates the action-value function $ Q(s,a) $, and the greedy policy is recovered by taking the action with the largest predicted value:
$$
\pi(s) = \arg\max_a Q(s,a)
$$
![DQN algorithm with experience replay and target network. Source: Mnih et al. (2015) [4] via Lil'Log [8].](images/weng_dqn_algorithm.png)
This works well for **discrete** action spaces, but it creates two important limitations:
1. **Continuous actions** — in a continuous action space, computing $ \arg\max_a Q(s,a) $ may require solving a difficult continuous optimization problem at every state, rather than simply taking the max over finitely many actions.
2. **Stochastic policies** — the greedy DQN policy is deterministic. Exploration can be added externally, such as with $ \epsilon $-greedy, but DQN does not directly parameterize a stochastic policy $ \pi_\theta(a \mid s) $.
This motivates a key question:
**What if we learn the policy directly?**
---
## 2 Policy Gradient Architecture
Instead of learning Q-values and deriving a policy afterward, a **policy network** directly parameterizes the policy:
$$
\pi_\theta(a \mid s) = \mathbb{P}(A=a \mid S=s;\theta)
$$
For **discrete** action spaces, the policy network outputs a categorical distribution over actions.
For **continuous** action spaces, the policy often outputs parameters of a continuous distribution, such as the mean and standard deviation of a Gaussian:
$$
a \sim \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2)
$$
The action is then sampled from this distribution.
### 2.1 Continuous Policies and Reparameterization
For continuous stochastic policies, one common idea is to represent randomness separately from the policy parameters. For example, a Gaussian sample can be written as
$$
z = \mu + \sigma \odot \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,I)
$$
so that
$$
z \sim \mathcal{N}(\mu,\sigma^2 I)
$$
This is called the **reparameterization trick**. It isolates the randomness in $ \epsilon $, which does not depend on the learned parameters, allowing gradients to flow through $ \mu $ and $ \sigma $.
![Reparameterization trick: in the original form (left) gradients cannot flow through the stochastic node z; in the reparameterized form (right) z = g(φ, x, ε) with ε ~ p(ε), enabling backpropagation. Source: Kingma & Welling (2014) [9] via Lil'Log [10].](images/weng_reparam_trick.png)
However, vanilla policy gradient methods such as REINFORCE do **not** require backpropagating through the sampled action. Instead, they use the **log-derivative trick**, which differentiates the log-probability of the sampled action:
$$
\nabla_\theta \log \pi_\theta(a \mid s)
$$
This is why policy gradients can be used even when the action sampling process, reward function, or environment dynamics are not directly differentiable.
---
## 3 Goal of Reinforcement Learning
The goal is to find policy parameters $ \theta^* $ that maximize the expected discounted return:
$$
J(\theta)
=
\mathbb{E}_{\tau \sim p_\theta(\tau)}
\left[
\sum_{t=0}^{T} \gamma^t r_t
\right]
$$
where:
- $ \tau = (s_0,a_0,r_0,\ldots,s_T,a_T,r_T) $ is a trajectory sampled by following $ \pi_\theta $.
- $ p_\theta(\tau) $ is the trajectory distribution induced by the policy and environment.
- $ \gamma \in [0,1] $ is the discount factor.
The optimal parameters are:
$$
\theta^* = \arg\max_\theta J(\theta),
\qquad
\pi^* = \pi_{\theta^*}
$$
---
## 4 Optimization via Gradient Ascent
In supervised learning, we usually **minimize** a loss using gradient descent.
In policy optimization, we **maximize** the objective $ J(\theta) $ using gradient ascent:
$$
\theta_{k+1}
=
\theta_k
+
\alpha \nabla_\theta J(\theta_k)
$$
To compute the gradient, we need to differentiate through an expectation over trajectories:
$$
\nabla_\theta J(\theta)
=
\nabla_\theta
\mathbb{E}_{\tau \sim p_\theta(\tau)}
\left[
\sum_{t=0}^{T} \gamma^t r_t
\right]
$$
This is challenging because:
1. the expectation is over all possible trajectories;
2. the trajectory distribution $ p_\theta(\tau) $ depends on $ \theta $;
3. the environment transition dynamics may be unknown or non-differentiable;
4. the reward function may not be differentiable with respect to $ \theta $.
The policy gradient theorem gives a practical way to estimate this gradient from sampled trajectories.
---
## 5 Policy Gradient Theorem
The **Policy Gradient Theorem** gives a tractable expression for $ \nabla_\theta J(\theta) $.
### 5.1 Express $ J(\theta) $ Using the Value Function
We can write the objective as the expected value of the starting state:
$$
J(\theta)
=
\mathbb{E}_{s_0 \sim p(s_0)}
\left[
V^{\pi_\theta}(s_0)
\right]
=
\sum_{s_0} p(s_0) V^{\pi_\theta}(s_0)
$$
where
$$
V^{\pi_\theta}(s)
=
\mathbb{E}_{\tau \sim p_\theta(\tau \mid s_0=s)}
\left[
\sum_{t=0}^{T} \gamma^t r_t
\right]
$$
is the expected return starting from state $ s $ and following policy $ \pi_\theta $.
### 5.2 Recall: Relationship Between $ V $ and $ Q $
The value function averages over the action-values under the policy:
$$
V^\pi(s)
=
\sum_a \pi(a \mid s) Q^\pi(s,a)
$$
For continuous action spaces, the sum becomes an integral:
$$
V^\pi(s)
=
\int \pi(a \mid s) Q^\pi(s,a)\, da
$$
### 5.3 Log-Derivative Trick
The central trick is the identity:
$$
\nabla_\theta \pi_\theta(a \mid s)
=
\pi_\theta(a \mid s)
\nabla_\theta \log \pi_\theta(a \mid s)
$$
This is called the **log-derivative trick** or **score-function trick**.
Intuitively, instead of differentiating through the reward or the environment, we differentiate the log-probability of the action that the policy took.
A simplified derivation begins with:
$$
J(\theta)
=
\sum_{s_0} p(s_0) V^{\pi_\theta}(s_0)
$$
Differentiate with respect to $ \theta $:
$$
\nabla_\theta J(\theta)
=
\sum_{s_0} p(s_0) \nabla_\theta V^{\pi_\theta}(s_0)
$$
Expand $ V^{\pi_\theta}(s_0) $ using $ Q^{\pi_\theta} $:
$$
\nabla_\theta J(\theta)
=
\sum_{s_0} p(s_0)
\nabla_\theta
\sum_a
\pi_\theta(a \mid s_0)
Q^{\pi_\theta}(s_0,a)
$$
Using the log-derivative trick gives the basic form:
$$
\nabla_\theta J(\theta)
=
\sum_{s_0} p(s_0)
\sum_a
\pi_\theta(a \mid s_0)
\nabla_\theta \log \pi_\theta(a \mid s_0)
Q^{\pi_\theta}(s_0,a)
$$
![Policy-Gradient Theorem Part 2: full derivation via the log-derivative trick, step by step. Source: CS 4782 Lecture Slides [6].](images/slide-22.png)
The full theorem extends this idea from the initial state to all states visited under the policy.
### 5.4 Final Form
Let $ d^{\pi_\theta}(s) $ denote the state visitation distribution induced by following policy $ \pi_\theta $. In discounted settings, this is often a discounted state visitation distribution.
The policy gradient theorem can be written as:
$$
\boxed{
\nabla_\theta J(\theta)
=
\mathbb{E}_{s \sim d^{\pi_\theta},\, a \sim \pi_\theta(\cdot \mid s)}
\left[
\nabla_\theta \log \pi_\theta(a \mid s)
Q^{\pi_\theta}(s,a)
\right]
}
$$
![The Policy Gradient Theorem: for any differentiable policy, the gradient of the objective equals the expectation of the log-policy gradient weighted by the cumulative return. Source: HuggingFace Deep RL Course [7].](images/hf_policy_gradient_theorem.png)
**Key insight:** We never need to differentiate through the reward function or environment dynamics. We only differentiate the log-probability of the action under the policy.
---
## 6 REINFORCE Algorithm
REINFORCE is the classic Monte Carlo policy gradient algorithm [1].
The key idea is to replace the unknown action-value function $ Q^{\pi_\theta}(s_t,a_t) $ with the empirical return from that timestep onward.
Define the return from timestep $ t $ as:
$$
G_t
=
\sum_{k=t}^{T} \gamma^{k-t} r_k
$$
This is a Monte Carlo estimate of the return obtained after taking action $ a_t $ in state $ s_t $ and then following the policy.
Thus, REINFORCE uses the gradient estimate:
$$
\hat{g}_t
=
\nabla_\theta \log \pi_\theta(a_t \mid s_t) G_t
$$
![REINFORCE training loop: collect an episode with π, compute the return, then increase the probability of sampled actions in proportion to their returns. Source: HuggingFace Deep RL Course [7].](images/hf_pg_bigpicture.jpg)
### Algorithm
```text
1. Initialize policy parameters θ randomly
2. for each episode do
3. Sample trajectory τ = (s₀,a₀,r₀,...,s_T,a_T,r_T) by following πθ
4. for each timestep t do
5. Compute return G_t = Σ_{k=t}^{T} γ^{k-t} r_k
6. Compute gradient estimate ĝ_t = ∇θ log πθ(a_t | s_t) · G_t
7. Update θ ← θ + α Σ_t ĝ_t
8. end for
```
Some implementations average the gradient over timesteps:
$$
\theta
\leftarrow
\theta
+
\alpha \frac{1}{T}
\sum_{t=0}^{T}
\hat{g}_t
$$
Averaging is an implementation choice that changes the scale of the update and can help stabilize training across episodes of different lengths.
### Advantages and Limitations
| Property | Explanation |
|---|---|
| **Unbiased** | $ G_t $ is an unbiased Monte Carlo estimate of $ Q^{\pi_\theta}(s_t,a_t) $ under the current policy. |
| **General** | Works for both discrete and continuous action spaces. |
| **High variance** | A single sampled trajectory may produce a very noisy estimate of the expected return. |
| **Slow convergence** | High variance can make learning unstable or sample-inefficient. |
---
## 7 On-Policy vs. Off-Policy
| | On-Policy | Off-Policy |
|---|---|---|
| **Definition** | Learn from data collected by the current or very recent policy being improved. | Learn from data collected by a different behavior policy. |
| **Data reuse** | Old data cannot be reused freely because the policy has changed. | Past samples can often be reused, such as through experience replay. |
| **Examples** | REINFORCE, A2C/A3C, PPO | Q-Learning, DQN |
| **Stability** | Often simpler and more stable conceptually. | Can be unstable and may require corrections such as importance sampling. |
| **Sample efficiency** | Usually less sample-efficient. | Usually more sample-efficient. |
**Q-Learning and DQN are off-policy** because they can learn about the greedy target policy while collecting data from another behavior policy, such as an $ \epsilon $-greedy policy.
**REINFORCE is on-policy** because it estimates gradients using trajectories sampled from the current policy.
---
## 8 Issues with Policy Gradient Methods
### 8.1 Exploration
Policy gradient methods can converge prematurely to suboptimal behavior if the policy becomes too deterministic too early.
Common exploration strategies include:
1. **Stochastic policies** — sampling from $ \pi_\theta(a \mid s) $ naturally introduces exploration.
2. **Entropy regularization** — add an entropy bonus to the objective to discourage premature determinism:
$$
J_{\text{entropy}}(\theta)
=
J(\theta)
+
\beta
\mathbb{E}_{s}
\left[
\mathcal{H}(\pi_\theta(\cdot \mid s))
\right]
$$
3. **Action noise** — especially in continuous control, deliberately injecting noise into actions can help exploration.
### 8.2 High Variance
From the same starting state, different sampled trajectories can produce very different returns. This makes the policy gradient estimate noisy.
![REINFORCE variance: from the same Pong starting state, different trajectories yield returns of +100, −1000, +10, and −40. This variance makes learning unstable. Source: HuggingFace Deep RL Course [7].](images/hf_variance.jpg)
The gradient estimate
$$
\nabla_\theta \log \pi_\theta(a_t \mid s_t)G_t
$$
depends heavily on the sampled return $ G_t $. If $ G_t $ varies a lot across trajectories, the updates will also vary a lot.
### 8.3 Reward Hacking
Agents can maximize the reward in unintended ways. This is called **reward hacking**.
For example, if a reward function gives points for moving forward, an agent might learn to move forward unsafely or exploit a simulator bug rather than solving the intended task.
Designing reward functions that are robust to exploitation is a central challenge in reinforcement learning.
---
## 9 Variance Reduction with a Baseline
Instead of asking:
> Was this return good?
we ask:
> Was this return better than expected from this state?
To do this, subtract a **baseline** $ b(s_t) $ from the return:
$$
\hat{g}_t
=
\nabla_\theta \log \pi_\theta(a_t \mid s_t)
\left(
G_t - b(s_t)
\right)
$$
The baseline reduces variance but does not introduce bias, as long as it does not depend on the action $ a_t $.
Why?
$$
\mathbb{E}_{a \sim \pi_\theta(\cdot \mid s)}
\left[
\nabla_\theta \log \pi_\theta(a \mid s)b(s)
\right]
=
b(s)
\sum_a
\pi_\theta(a \mid s)
\nabla_\theta \log \pi_\theta(a \mid s)
$$
Using the log-derivative trick:
$$
=
b(s)
\sum_a
\nabla_\theta \pi_\theta(a \mid s)
$$
Since probabilities sum to one,
$$
=
b(s)
\nabla_\theta
\sum_a
\pi_\theta(a \mid s)
=
b(s)
\nabla_\theta 1
=
0
$$
Therefore, subtracting an action-independent baseline does not change the expected gradient.
The natural baseline is the value function:
$$
b(s_t) = V^{\pi_\theta}(s_t)
$$
This gives the **advantage function**:
$$
A^{\pi}(s,a)
=
Q^{\pi}(s,a)
-
V^{\pi}(s)
$$
![The Advantage function A(s,a) = Q(s,a) − V(s): Q(s,a) is the value of taking action a, while V(s) is the average value of the state. The advantage measures how much better action a is relative to the average. Source: HuggingFace Deep RL Course [7].](images/hf_advantage1.jpg)
The advantage tells us whether action $ a $ was better or worse than the average action from state $ s $.
Using advantage, the policy gradient becomes:
$$
\nabla_\theta J(\theta)
=
\mathbb{E}
\left[
\nabla_\theta \log \pi_\theta(a \mid s)
A^{\pi}(s,a)
\right]
$$
---
## 10 Actor-Critic Algorithms
Actor-Critic methods combine policy-based and value-based ideas [2].
![Actor-Critic intuition: the Actor takes actions and the Critic evaluates them, providing a feedback signal that improves the policy without waiting for the full episode return. Source: HuggingFace Deep RL Course [7].](images/hf_ac.jpg)
### Components
- **Actor** $ \pi_\theta(a \mid s) $: the policy network that selects actions.
- **Critic** $ V_\phi(s) $, $ Q_\phi(s,a) $, or $ A_\phi(s,a) $: a value network that evaluates the actor's behavior.
The actor is updated using a policy gradient, while the critic is trained to estimate values.
Instead of using the full Monte Carlo return $ G_t $, actor-critic methods often use an advantage estimate:
$$
\hat{g}_t
=
\nabla_\theta \log \pi_\theta(a_t \mid s_t)
\hat{A}_t
$$
A common one-step advantage estimate is the TD error:
$$
\delta_t
=
r_t
+
\gamma V_\phi(s_{t+1})
-
V_\phi(s_t)
$$
Then the actor update uses:
$$
\hat{g}_t
=
\nabla_\theta \log \pi_\theta(a_t \mid s_t)
\delta_t
$$
The critic is trained to make its value predictions more accurate, for example by minimizing a squared TD error:
$$
\mathcal{L}_{critic}
=
\left(
r_t
+
\gamma V_\phi(s_{t+1})
-
V_\phi(s_t)
\right)^2
$$
### Why Actor-Critic Helps
REINFORCE waits until the end of an episode to compute Monte Carlo returns. This is unbiased but high variance.
Actor-critic methods use a learned critic to estimate value, which usually reduces variance and allows more frequent online updates. The tradeoff is that the critic may be inaccurate, which can introduce bias.
### A2C vs. A3C
![A3C (asynchronous) vs A2C (synchronous): parallel agents collect diverse experience and send updates to shared global network parameters, stabilizing training and improving sample efficiency. Source: Mnih et al. (2016) [5] via Lil'Log [11].](images/weng_a3c_vs_a2c.png)
| Method | Meaning | Key Idea |
|---|---|---|
| **A3C** | Asynchronous Advantage Actor-Critic | Multiple workers interact with environments in parallel and asynchronously update shared global parameters. |
| **A2C** | Advantage Actor-Critic | A synchronous version where workers collect experience in parallel and updates are coordinated together. |
Both methods use the advantage idea to improve the actor update.
---
## 11 Summary
| Method | Estimate Used | Variance | Bias | Main Tradeoff |
|---|---|---|---|---|
| **REINFORCE** | Monte Carlo return $ G_t $ | High | None | Simple but noisy |
| **REINFORCE + Baseline** | $ G_t - b(s_t) $ | Lower | None, if baseline is action-independent | More stable updates |
| **Actor-Critic** | Learned value, Q-value, or advantage estimate | Usually lower | Possible critic bias | More sample-efficient, but depends on critic quality |
Key takeaways:
- **Policy gradient methods** directly optimize a parameterized policy $ \pi_\theta(a \mid s) $.
- The objective is to maximize expected discounted return:
$$
J(\theta)
=
\mathbb{E}_{\tau \sim p_\theta(\tau)}
\left[
\sum_{t=0}^{T} \gamma^t r_t
\right]
$$
- The **Policy Gradient Theorem** gives:
$$
\nabla_\theta J(\theta)
=
\mathbb{E}_{s \sim d^{\pi_\theta}, a \sim \pi_\theta}
\left[
\nabla_\theta \log \pi_\theta(a \mid s)
Q^{\pi_\theta}(s,a)
\right]
$$
- The key computational advantage is that we only need to differentiate the log-policy, not the reward function or environment dynamics.
- **REINFORCE** uses Monte Carlo returns. It is unbiased but high variance.
- **Baselines** reduce variance without bias as long as the baseline does not depend on the action.
- The most common baseline is the value function, giving the advantage:
$$
A^\pi(s,a)
=
Q^\pi(s,a)
-
V^\pi(s)
$$
- **Actor-Critic** methods use a learned critic to estimate values or advantages, reducing variance and enabling online updates.
- The tradeoff is that actor-critic methods can introduce bias if the critic is inaccurate.
---
## References
1. Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. *Machine Learning*, 8, 229–256.
2. Konda, V.R., & Tsitsiklis, J.N. (2000). Actor-Critic Algorithms. *NeurIPS*, 13.
3. Sutton, R.S., & Barto, A.G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. [PDF](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)
4. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518, 529–533.
5. Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. *ICML*. [arXiv:1602.01783](https://arxiv.org/abs/1602.01783)
6. CS 4782 Deep Learning, Cornell — Week 9 Lecture Slides (Policy Gradients). Thanks to Varsha Kishore, Justin Lovelace, Luke Kulm, Jinzhou Li.
7. HuggingFace Deep RL Course. [https://huggingface.co/learn/deep-rl-course/](https://huggingface.co/learn/deep-rl-course/)
8. Weng, L. (2018). A (Long) Peek into Reinforcement Learning. *Lil'Log*. [link](https://lilianweng.github.io/posts/2018-02-19-rl-overview/)
9. Kingma, D.P., & Welling, M. (2014). Auto-Encoding Variational Bayes. *ICLR*. [arXiv:1312.6114](https://arxiv.org/abs/1312.6114)
10. Weng, L. (2018). From Autoencoder to Beta-VAE. *Lil'Log*. [link](https://lilianweng.github.io/posts/2018-08-12-vae/)
11. Weng, L. (2018). Policy Gradient Algorithms. *Lil'Log*. [link](https://lilianweng.github.io/posts/2018-04-08-policy-gradient/)
12. Schulman, J., et al. (2016). High-dimensional continuous control using generalized advantage estimation. *ICLR*. [arXiv:1506.02438](https://arxiv.org/abs/1506.02438)