What is Policy Gradient?
Policy Gradient is used to update the actor model using the Q-values optimized by the critic model. The goal of the policy $\pi_{\phi}$ gradient is to optimize the expected return by optimizing $\phi$. How does the policy gradient update weights? The function is:
$ \nabla_{\phi}J(\phi)\ =\ \nabla_{\phi}\sum_{s\in S}d^{\pi}(s)V^{\pi}(s)\ =\ \nabla_{\phi}\sum_{s\in S}d^{\pi}(s)\sum_{a \in A} \pi_{\phi}(a|s)Q^{\pi}(s,a) $
The return of the policy gradient affects the reward in the Bellman Equation.
The return of the policy graddient is:
The structure of the Policy Gradient is::