The objective of perceptual inference is to find the distribution of useful mental states based on sensory observations of the world. A wet feeling on the face when going outside for instance should probably give a hight probability to the mental state rain, but also some probability to the mental state active lawn sprinkler.
The loss function used for perceptual inference is variational free energy (VFE). It can be expressed in many ways, each with a different but complementary interpretation.
Below are a few different ways to express VFE, \(\mathcal F[q; o]\) 1:
$$\mathcal F[q; o] =$$
$$\mathbb E_{q(s)}[\log q(s) – \log p(s, o)] = \ \ \ \ (1a) $$
$$\mathbb E_{q(s)}[\log q(s) – \log p(s \mid o) – \log p(o)] = \ \ \ \ (2a)$$
$$\mathbb E_{q(s)}[\log q(s) – \log p(o \mid s) – \log p(s)] \ \ \ \ (3a)$$
Two of these formulations can be expressed as a KL-divergence with a “residual term”:
$$\mathcal F[q; o] =$$
$$D_{KL}\left[q(s) \mid \mid p(s \mid o) \right] – \log p(o) = \ \ \ \ (2b)$$
$$D_{KL}\left[q(s) \mid \mid p(s)\right] – \mathbb{E}_{q(s)}[\log p(o \mid s)] \ \ \ \ (3b)$$
And a bonus variant:
$$-\mathbb H[q(s)]-\mathbb{E}_{q(s)}[\log p(o, s)] \ \ \ \ (4)$$
The entity \(\mathbb H\) found in the last expression is called entropy.
$$ \mathbb H[q(s)]= \sum_s q(s) \log \frac{1}{q(s)} = \mathbb E_{q(s)} \log \frac{1}{q(s)} = – \mathbb E_{q(s)} \log q(s)$$
In the derivations we have also used
$$p(o, s) = p(o \mid s)p(s) = p(s \mid o)p(o)$$
and
$$\sum_s \log p(o) q(s) = \log p(o) \sum_s q(s) = \log p(o)$$
Expression \((2b)\) can be interpreted as the sum of divergence and surprise.
Divergence measures the accuracy of the approximate posterior, i.e., how close it is to the true posterior distribution of states, \(p(s \mid o)\).
Expression \((3b)\) can be interpreted as the sum of complexity and accuracy.
Complexity measures how far the approximate posterior is from the prior beliefs, the expected mental states \(p(s)\).
Accuracy is high when the mental states implied by our observations are similar to the mental states predicted by \(q(s)\).
The complexity – accuracy formulation is particularly interesting as it would give a mathematical explanation of the empirically firmly supported psychological phenomenon cognitive dissonance [5]. There is a cognitive cost and discomfort associated with changing one’s beliefs, especially if these beliefs are considered as parts of one’s identity. Try for instance to talk somebody out of their religion or political sympathies by pointing at observations (or lack thereof).
Expression \((4)\) is usually interpreted as the difference between energy and entropy.
If the energy is low, i.e., if the inferred mental state distribution is probable under the generative model, we can afford a high precision (peaked) \(q(s)\) meaning a low entropy. If the energy is high meaning that the generative model doesn’t give a high probability to the inferred mental state distribution, then this has to be compensated with a high entropy (flat) \(q(s)\) to keep free energy low. High entropy means that we are uncertain and therefore must spread the probability mass on many mental states.
All the formulations are communicating vessels in the sense that if you for instance minimize the divergence in \((2b)\), which is the only term that can be minimized in perceptual inference, then we will get the optimal tradeoff between complexity and accuracy and between energy and entropy.
- The square brackets indicate that \(\mathcal F\) is a functional, a function of a function, of \(q\). ↩︎