The objective of perceptual inference is to find a good approximation of the distribution of the states of the world based on sensory observations. A wet feeling on the face when going outside for instance should probably give a hight probability to the state *rain*, but also some probability to the state *active lawn sprinkler*.

The loss function used for perceptual inference is variational free energy (VFE). It can be expressed in many ways, each with a different but complementary interpretation.

Below are a few different ways to express VFE, \(\mathcal F[q; o]\) ^{1}:

$$\mathcal F[q; o] =$$

$$\sum_s \log q(s) q(s) – \sum_s \log p(s, o) q(s) = \ \ \ \ (1a)$$

$$\sum_s \log q(s) q(s) – \sum_s \log p(s \mid o) q(s) – \sum_s \log p(o) q(s) = \ \ \ \ (2a)$$

$$\sum_s \log q(s) q(s) – \sum_s \log p(o \mid s) q(s) – \sum_s \log p(s) q(s) = \ \ \ \ (3a)$$

Each of these formulations can be expressed as a KL-divergence with a “residual term”:

$$\mathcal F[q; o] =$$

$$D_{KL}\left[q(s) \mid \mid p(s, o) \right] = \ \ \ \ (1b)$$

$$D_{KL}\left[q(s) \mid \mid p(s \mid o) \right] – \log p(o) = \ \ \ \ (2b)$$

$$D_{KL}\left[q(s) \mid \mid p(s)\right] – \mathbb{E}_{q(s)} \left[\log p(o \mid s)\right] = \ \ \ \ (3b)$$

And a bonus variant:

$$-\mathbb H[q(s)]-\mathbb{E}_{q(s)}\left[\log p(o, s) \right] \ \ \ \ (1c)$$

The entity \(\mathbb H\) found in the last expression is called *entropy*.

$$ \mathbb H[q(s)]= \sum_s q(s) \log \frac{1}{q(s)} = \mathbb E_{q(s)} \log \frac{1}{q(s)} = – \mathbb E_{q(s)} \log q(s)$$

In the derivations we have also used

$$p(o, s) = p(o \mid s)p(s) = p(s \mid o)p(o)$$

and

$$\sum_s \log p(o) q(s) = \log p(o) \sum_s q(s) = \log p(o)$$

Expression \(2b\) can be interpreted as the sum of *divergence* and *surprise*. Divergence measures the accuracy of the approximate posterior, i.e., how close it is to the true posterior distribution of states, \(p(s \mid o)\).

Surprise measures how far from an expected observation the current observation is. For a fish it is surprising to be on land. The more probable the observation based on the generative model, the smaller the surprise. The probability of the observation is given by

$$p(o) = p(o \mid s)p(s)$$

This means that surprise is measured against the *expected states*, \(p(s)\). If the current observation is likely for state \(s\) and that state is an expected state, then \(p(o)\) will also be high and the observation will be unsurprising. (We will later make the distinction between *expected states* and *preferred states*. As everybody intuitively knows, they are not always the same.)

Expression \(2c\) can be interpreted as the sum of *complexity* and *accuracy*. Complexity measures how far the approximate posterior is from the prior beliefs, the expected states \(p(s)\).

Accuracy is high when the states implied by our observations are similar to the states predicted by \(q(s)\).

The complexity – accuracy formulation is particularly interesting as it would give a mathematical explanation of the empirically firmly supported psychological phenomenon *cognitive dissonance* [5]. There is a cognitive cost and discomfort associated with changing one’s beliefs, especially if these beliefs are considered as parts of one’s identity. Try for instance to talk somebody out of their religion or political sympathies by pointing at observations (or lack thereof).

Expression \(1c\) is usually interpreted as the difference between *energy* and *entropy*. If the energy is low, i.e., if the inferred state distribution is probable under the generative model, we can afford a high precision (peaked) \(q(s)\) meaning a low entropy. If the energy is high meaning that the generative model doesn’t give a high probability to the inferred state distribution, then this has to be compensated with a high entropy (flat) \(q(s)\) to keep free energy low. High entropy means that we are uncertain and therefore must spread the probability mass on many states.

All the formulations are communicating vessels in the sense that if you for instance minimize the divergence in \(2b\), which is the only term that can be minimized in perceptual inference, then we will get the optimal tradeoff between complexity and accuracy and between energy and entropy. Also, it can be shown that \(\mathcal F(q, o) \geq 0\).

- The square brackets indicate that \(\mathcal F\) is a functional, a function of a function, of \(q\). ↩︎