Policy Gradient

Posted Apr 27, 2023 Updated Jun 17, 2024

1 min read

What is Policy Gradient?

Policy Gradient is used to update the actor model using the Q-values optimized by the critic model. The goal of the policy $\pi_{\phi}$ gradient is to optimize the expected return by optimizing $\phi$. How does the policy gradient update weights? The function is:

$ \phi_{t+1}\ =\ \phi_{t}\ +\ \alpha \nabla_{\phi}J(\pi_{\phi})|_{\phi_{t}} $
$ \nabla_{\phi}J(\phi)\ =\ \nabla_{\phi}\sum_{s\in S}d^{\pi}(s)V^{\pi}(s)\ =\ \nabla_{\phi}\sum_{s\in S}d^{\pi}(s)\sum_{a \in A} \pi_{\phi}(a|s)Q^{\pi}(s,a) $

The return of the policy gradient affects the reward in the Bellman Equation.
The return of the policy graddient is:

$ R_{t}\ =\ \sum_{i=t}^{T} \gamma^{i-t}r(s_{i}, a_{i}) $

The structure of the Policy Gradient is::

Deep Reinforcement Learning, DRL_Theory

deep reinforcement learning python

This post is licensed under CC BY 4.0 by the author.

Policy Gradient

What is Policy Gradient?

Further Reading

Living Penalty

Q Learning

Bellman Equation