Home Policy Gradient
Post
Cancel

Policy Gradient

What is Policy Gradient?

Policy Gradient is used to update the actor model using the Q-values optimized by the critic model. The goal of the policy $\pi_{\phi}$ gradient is to optimize the expected return by optimizing $\phi$. How does the policy gradient update weights? The function is:

$ \phi_{t+1}\ =\ \phi_{t}\ +\ \alpha \nabla_{\phi}J(\pi_{\phi})|_{\phi_{t}} $
$ \nabla_{\phi}J(\phi)\ =\ \nabla_{\phi}\sum_{s\in S}d^{\pi}(s)V^{\pi}(s)\ =\ \nabla_{\phi}\sum_{s\in S}d^{\pi}(s)\sum_{a \in A} \pi_{\phi}(a|s)Q^{\pi}(s,a) $



The return of the policy gradient affects the reward in the Bellman Equation.
The return of the policy graddient is:

$ R_{t}\ =\ \sum_{i=t}^{T} \gamma^{i-t}r(s_{i}, a_{i}) $




The structure of the Policy Gradient is::



This post is licensed under CC BY 4.0 by the author.