4.5 Natural Gradients
When using the symmetric KL divergence to measure the distance between two distributions, the corresponding Riemannian metric is the Fisher Information Matrix (FIM), defined as the Hessian matrix of the KL divergence around θ, i.e. the matrix of second order derivatives w.r.t the elements of θ.
The Fisher information matrix is defined as the Hessian of the KL divergence around θ, i.e. how the manifold locally changes around θ: \(F(\theta)=\nabla^2D_{JS}(p(x;\theta)\mid\mid p(x;\theta + \Delta\theta))\mid_{\Delta\theta=0}\) Why is it useful? The Fisher Information matrix allows to locally approximate (for small Δθ) the KL divergence between the two close distributions (using a second-order Taylor series expansion): \(F(\theta)=\mathbb{E}_{x\sim p(x,\theta)}[\nabla log p(x;\theta)(\nabla log p(x;\theta))^T]\)
\(D_{JS}(p(x;\theta) \mid\mid p(x;\theta+\Delta\theta))\approx\Delta\theta^TF(\theta)\Delta\theta\) The KL divergence is then locally quadratic, which means that the update rules obtained when minimizing the KL divergence with gradient descent will be linear.
Reference
https://julien-vitay.net/deeprl/NaturalGradient.html#sec:natural-policy-gradient-and-natural-actor-critic-nac
댓글남기기