A deeper look at perceptual inference

In this post we will take a deeper look at how, according to the active inference framework (AIF), an agent assigns probabilities to mental states about the world based on observations. We call this process perceptual inference.

The simple example in the previous post showed how the human brain could according to AIF turn an observation into a probability over mental states representing what it observes using its generative model and Bayes’ theorem. Since there were only two interesting things in the garden scenario, a frog and a lizard, it was possible to calculate the posteriori probability distribution for the mental state as a closed-form expression (a “frog” with 73% probability). In real life there are too many possible mental states for the brain to be able to calculate the probability distribution of mental states analytically.

This post describes how, according to active inference framework (AIF), an agent gets around the messy intractability of reality.

We don’t really know what’s out there

AIF posits that the generative model translates observations to probabilities for mental states, high-level representations of real-world states. The utility of mental states is to facilitate cognitive tasks such as storage of knowledge, decision making, communication, and planning, tasks that would be very hard on the abstraction level of observations only 1.

A maybe trivial, but sometimes forgotten, fact is that a model is not the reality; the mental state is not the real-world state. An agent can only attempt to create, use, and store useful representations of the world in the form of mental states. These representations only capture certain aspects of the real-world states. Ontologically the model and the real world are two very different animals.

In case of frogs and lizards there is a straightforward correspondence between the mental state and the animal out there in the wild.

Some, perhaps all, mental states are also associated with an attribute that has not so far been observed in the real world, namely subjective experience. Examples of subjective experiences are green and pain [5]. We see not only a frog but a green frog. We not only see that our tooth is cracked and feel the crack with our tongue, we also feel the pain.

Mental states associated with subjective experiences are arguably useful as they are easy to remember, communicate, and act on. In some cases, like in case of severe pain, they are imperative to act on thereby giving a very strong motivation for the agent to stay alive (and, under the influence of other subjective experiences, procreate). I will return to this topic in a future post.

Some notation and ontology

We will in this post, like in the previous post, assume that all distributions are categorical, i.e., their supports are vectors of discrete events:

$$p(o) = \text{Cat}(o, \pmb{\omega)}$$

$$p(s) = \text{Cat}(s, \pmb{\sigma)}$$

Again, the above distributions mean for instance that \(p(o_i) = P(O = o_i) = \omega_i\), i.e., \(\pmb{\omega}\) is a vector of probabilities.

As stated before, the optimal way, according to AIF, for an agent to arrive at the probabilities for its mental states is by using Bayes’ theorem:

$$p(s \mid o) = \frac{p(o, s)}{p(o)} = \frac{p(o \mid s)p(s)}{p(o)} = \frac{p(o \mid s)p(s)}{\sum_s p(o \mid s)p(s)}$$

\(p(o, s)\) is the brains generative model. It models how observations and mental states are related in general. It packs all information needed for perceptual inference.

\(p(o, s) = p(o \mid s)p(s)\) where:

  • \(p(s)\) is the agent’s expectations, “priors”, about the probabilities of each possible mental state in the vector of all potential mental states \(\pmb{S}\) prior to an observation.
  • \(p(o \mid s)\), the likelihood, is the probability distribution of observations given a certain mental state \(s\).

The last piece of Bayes’ theorem is the evidence \(p(o)\). It is simply the overall probability of a certain observation \(o\). It would ideally need to be calculated using the law of total probability by summing up the contributions to \(p(o)\) from all possible mental states in \(\pmb{S}\):

$$p(o) = \sum_{s} p(o \mid s)p(s)$$

Since there is a large number of mental states, this sum would be very long making the evidence intractable.

Some intuition

The prior probabilities of mental states \(p(s)\) is according to Bayes’ theorem modulated by the ratio \(\frac{p(o \mid s)}{p(o)}\) to get the posterior probabilities of mental states. Let’s look at two examples to gain some intuition about this ratio.

Let’s say that we are looking at animals in a zoo and want to identify the species by doing an observation. We know that there are exactly ten animals, all of different species, in this small zoo and that the priori probability for seeing a certain species is the same for all ten animals (here identified with numbers from one to ten). \(p(s)\) is therefore a uniform distribution that looks like this:

In the first case we assume that our observation is whether the sun is shining or not. We know that where we are, the sun shines \(50\%\) of the time meaning that we get the following conditional probability in the nominator:

As stated above as an assumption, the evidense (denominator) is \(p( \texttt{sunny}) = 0.5\). Therefore the ratio \(\frac{p(\texttt{sunny} \mid s)}{p(\texttt{sunny})} = 1.0\) for all beliefs meaning that the posterior is equal to the prior; we have not learned anything from the observation which we of course new in forehand since knowing the weather doesn’t tell us very much about the animals in the zoo.

In the second case we are smarter and try to find a feature to observe that actually says something about the animal we are looking at. Let’s say that one of the animals is a dromedary and we observe a hump. Only the dromedary sports a hump so we are certain to observe hump for the \(\texttt{dromedary}\) mental state and certain not to observe a hump for any other mental state. The nominator or the ratio now looks like this:

#4 is the \(\texttt{dromedary}\) mental state.

The total probability of observing a hump is now \(0.1\) since only one of the ten animals has a hump. The ratio \(\frac{p(\texttt{hump} \mid s)}{p(\texttt{hump})} = 10\) for the \(\texttt{dromedary}\) mental state (\(s\)). This means that if we observe a hump, the probability of a \(\texttt{dromedary}\) mental state should increase by a factor of \(10\). With the uniform prior probability distribution, the posterior probability for a \(\texttt{dromedary}\) now becomes \(10 \times 0.1 = 1.0\). We have gained enough information to now be certain that our mental state is \(\texttt{dromedary}\).

How to calculate the intractable

There are ways to find approximations \(q(s)\) to the optimal posterior distributions \(p(s \mid o)\) without having to calculate \(p(o)\):

$$q(s) \approx p(s \mid o) = \frac{p(o \mid s)p(s)}{p(o)}$$

Although \(q(s)\) depends on \(o\) indirectly, this dependency is usually omitted in AIF notation. \(q(s)\) can be seen as just a probability distribution that is an approximation of another probability distribution that in turn depends on \(o\).

One method to find the approximate probability distribution \(q(s)\) is variational inference which can be formulated as an optimization problem. There is some empirical evidence suggesting that the brain actually does something similar [6].

\(q(s)\) is usually assumed to belong to some tractable distribution like a multidimensional Gaussian. Variational inference finds the optimal parameters of the distribution, in this case the variance (vector) (or the covariance matrix) and the mean (vector). In the following the set of parameters is denoted \(\theta\).

There are several optimization methods that can be used for finding the \(\theta\) that minimizes the dissimilarity between \(q(s \mid \theta)\) and \(p(s \mid o)\). A popular one is gradient descent that is also used in machine learning. We assume that the inference of \(q(s \mid \theta)\) is fast enough for \(o\) and the generative model to remain constant during the inference.

Optimization minimizes a loss function \(\mathcal L(o, \theta)\). The Kullback-Leibler divergence 2 \(D_{KL}\) measures the dissimilarity between two probability distributions and is thus a good loss function candidate for active inference:

$$\mathcal L(o, \theta) = D_{KL}\left[q(s \mid \theta) \mid \mid p(s \mid o) \right] := \sum_{s} \log\left(\frac{q(s \mid \theta) }{p(s \mid o)}\right) q(s \mid \theta)$$.

\(\mathcal L(o, \theta)\) equals zero when both distributions are identical (all logaritms equal zero). It can be shown that \(\mathcal L(o, \theta) \gt 0\) for all other combinations of distributions.

The loss function unpacked

Let’s try to unpack \(\mathcal L(\theta, o)\) into something that is useful for gradient descent:

$$\mathcal L(o, \theta) = D_{KL}\left[q(s \mid \theta) \mid \mid p(s \mid o) \right] = \sum_{s} \log\left(\frac{q(s \mid \theta) }{p(s \mid o)}\right) q(s \mid \theta)=$$

$$\sum_s \log\left(\frac{q(s \mid \theta)p(o) }{p(o \mid s)p(s)}\right) q(s \mid \theta)=$$

$$\sum_s q(s \mid \theta) \log q(s \mid \theta) – \sum_s q(s \mid \theta) \log p(o \mid s) -$$

$$\sum_s q(s \mid \theta) \log p(s) + \sum_s q(s \mid \theta) \log p(o) =$$

$$\sum_s q(s \mid \theta) \log q(s \mid \theta) – \sum_s q(s \mid \theta) \log p(o \mid s) -$$

$$\sum_s q(s \mid \theta) \log p(s) + \log p(o)=$$

$$\mathcal{F}(o, \theta) + \log p(o)$$

With:

$$\mathcal{F}(o, \theta) = \sum_s q(s \mid \theta) \log q(s \mid \theta) -$$

$$\sum_s q(s \mid \theta) \log p(o \mid s) – \sum_s q(s \mid \theta) \log p(s)$$

\(\mathcal{F}(o, \theta)\) is called variational free energy or just free energy 3. Since \(\log p(o) \) doesn’t depend on \(\theta\), \(D_{KL}\left[q(s \mid \theta) \mid \mid p(s \mid o) \right]\) is minimized when \(\mathcal{F}(o, \theta)\) is minimized meaning that we can replace our earlier loss function with free energy.

\(\mathcal{F}(o, \theta)\) can be made differentiable with respect to \(\theta\) which means that it can be minimized using gradient descent. An estimate of the loss function in each iteration can be found using Monte Carlo integration which means that we only take a few samples from \(q(s \mid \theta)\), not the full distribution, do the multiplications and the summation. We then calculate the gradient of this sum with respect to \(\theta\) and adjust \(\theta\) with a small fraction (the learning rate) of the negative gradient [1]. Note that \(p(s)\) and \(p(o \mid s)\) are assumed to be quantities available to the agent for use in the calculation of the loss function.

The quantity \(– \mathcal{F}(o, \theta)\) is in Bayesian variational methods denoted evidence lower bound, ELBO, since \(D_{KL}\left[q(s \mid \theta) \mid \mid p(s \mid o)\right] \geq 0\) and therefore \(\log p(o) \geq – \mathcal{F}(o, \theta)\).

Note that the quantity \(– \log p(o) \) remains unchanged during perceptual inference. The observation is what it is as long as the agent doesn’t change something in its environment that would cause the observation to change as a consequence.

\(-\mathcal{F}(o, \pmb \theta)\) is the lower bound for \(log p(o)\). This means that as \(\mathcal{F}(o, \pmb \theta)\) is minimized (goes toward zero), \(p(o)\) is “pushed” toward unity.

Surprise

The quantity \(– \log p(o)\) can be seen as the “residual” of the variational inference process. It is the part that can not be optimized away, regardless of how accurate a variational distribution \(q(s \mid \theta)\) we manage to come up with. \(p(o)\) is the probability of the observation. If the observation has a high probability, it was “expected” by the model. Low probability observations are unexpected. \(– \log p(o) \) is therefore also called surprise. High probability \(p(o)\) means low surprise and vice versa.

From above we have \(\log p(o) \geq – \mathcal{F}(o, \theta) \Rightarrow -\log p(o) \leq \mathcal{F}(o, \theta)\) meaning that the free energy is an upper bound on surprise; surprise is always lower than or equal to free energy.

According to AIF, a biological agent strives to minimize surprise at all times as surprise means that the agent is outside its comfort zone, its expected or preferred mental states. To minimize surprise, the agent takes action to minimize free energy, thereby minimizing surprise. Minimizing surprise will be the topic of coming posts.

Why \(p(o \mid s)p(s)\) and not \(p(s \mid o)\)?

At a first glance it may seem backward that the brain is “probing” its way through the world, asserting hypothetical mental states and performing continuous error correction to keep its mental states in sync with the observations (and therefore hopefully with the relevant real-world state). Why isn’t the brain programmed to infer the probability of a mental state directly from the observation, \(p(s \mid o)\)? Why go through all the bother with variational inference?

It turns out that nature almost always chooses the most economical, and “good enough” way to do things. Without considering the priors, \(p(s)\), there would be a huge number of possible mental states corresponding to an observation (“Is it a bird, is it a plane, …”), most of them entirely unlikely in a certain context. It would therefore be wasteful to calculate the probabilities for all those states, most of which have almost zero probability.

Instead nature has inverted the calculation. It uses its priors (adjusted to the situation) to limit the search space for the most probable mental states, starting with the most likely prior. It then generates the observation that the candidate mental state would produce and compares it with the actual observation. If they match, then the candidate mental state is promoted to the actual mental state. If not, then the search continues.

An illustration would be you walking on a street one block from your home in dense fog. You see a shape coming toward you but its all fuzzy and dark. Your brain starts doing the variational inference: the most likely state would be that of a person, a fellow pedestrian. It might be a neighbor, your spouse walking the dog, your teenager “going out”, or a few other alternatives from your prior list. A gorilla would be way down the list. Superman would have almost zero prior probability and would probably not be considered at all. Together with a million other phenomena that wouldn’t be considered.

A consequence of this process is that it would require a quite clear observation for you to be convinced that it is actually a gorilla. As long as the observation is fuzzy, you rely more on your prior. This is nature’s good enough and fast enough solution to perceptual inference.

Another consequence is that if your teenager had left the fridge door open several nights in a row and it was again open, you would be excused for suspecting your teenager again, even if it turned out to be your spouse this time (or, alas, yourself). The mechanism causing such misunderstandings is not a bug, it’s a feature.

In summary:

Priors make perception possible by constraining which states are even considered; forward models allow those constrained hypotheses to be tested locally, whereas a direct inverse \(p(s \mid o)\) would require global consideration of an unbounded state space.

Links

[1] Khan Academy. Gradient descent.
[2] Volodymyr Kuleshov, Stefano Ermon. Variational inference. Class notes from Stanford course CS288.
[3] Thomas Parr, Giovanni Pezzulo, Karl J. Friston. Active Inference.
[4] Hohwy, J., Friston, K. J., & Stephan, K. E. (2013). The free-energy principle: A unified brain theory? Trends in Cognitive Sciences, 17(10), 417-425.
[5] Anil Seth. Being You.
[6] Andre M. Bastos, W. Martin Usrey, Rick A. Adams, George R. Mangun, Pascal Fries, Karl J. Friston,
Canonical Microcircuits for Predictive Coding, Neuron, Volume 76, Issue 4, 2012, Pages 695-711.

  1. The distinction between observations and mental states may not be so clear cut but a matter of degree. A pure observation, like the sight of a linear structure, is void of semantics whereas the mental state representating a whole object such as a face, has a lot of semantic content. There are several layers of representation between the pure observation and the final mental state with increasing degree of action-relevant semantics. ↩︎
  2. Technically \(D_{KL}\) is a functional which is a function of one or more other functions. \(D_{KL}\) is a function of \(q(s, \pmb \theta)\). The square brackets around the argument are meant to indicate a functional. Intuitively one can think of a functional as a function of a large (up to infinite) number of parameters, namely all the values of the functions that are its arguments. ↩︎
  3. The term is borrowed from thermodynamics where similar equations arise. Knowing about thermodynamics is not important for understanding AIF though. \(\mathcal{F}(o, \theta)\) is sometimes written as a functional like this: \(\mathcal{F}[q; o]\). This is a more general expression as it doesn’t make any assumptions about the probability distribution \(q\). ↩︎

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *