In this post we will take a deeper look at how, according to the active inference theory, an organism interprets what it observes without yet taking any action to change the world. This is called *perceptual inference.*

The simple example in the previous post showed how the human brain could according to AIF turn an observation into a prediction about what it observes (state) using its generative model and Bayes’ theorem. Since there were only two interesting states in the garden scenario, a frog and a lizard, it was possible to calculate the *posteriori probability distribution* for the prediction as a closed-form expression. In the real world there are too many possible states for the brain to be able to calculate it’s predictions analytically.

This post describes how, according to active inference framework (AIF), an organism may get around the messy intractability of reality.

## We don’t really know what’s out there

AIF posits that the generative model translates observations to *(hidden) states*, high-level representations of the world. The utility of explicit states is to facilitate cognitive tasks such as storage of knowledge, decision making, communication, and planning, tasks that would be very hard on the abstraction level of observations.

A maybe trivial, but sometimes forgotten, fact is that a model is not the reality. The organism can only attempt to create, use, and store useful *representations* of the world. These representations, like all models, only capture certain aspects of the real world states. And ontologically the model and the real world are two very different animals.

My way to understand this is to remember that the variables (states and observations) in the equations describing AIF are stored in the brain (or some other organ capable of performing computations), like variables in a computer memory. The observations are generated within the organism based on real-world states. The state variables (in the brain) map more or less accurately onto real-world states.

In case of frogs and lizards (two not so different animals) there is probably a fairly simple correspondence between the human brain’s states and the animals out there in the wild. Quantum fields and black holes may on the other hand not correspond to very useful states in everybody’s generative model. Yet other things such as mathematical theorems may *only* exist in our brains and have no real-world existense (I am a Platonist so I *do* believe in the existence of abstract objects but the topic is contentious.)

Some, or perhaps all, internal states in the brain are also associated with a quality that has not so far been observed in the real world, namely subjective experience or quale. Examples of such internal states are “red” or “pain” [5]. States represented as qualia are arguably useful as they are easy to remember, communicate, act on. In some cases, like in case of severe pain, they are imperative to act on thereby giving a very strong motivation for the organism to stay alive (and, under the influence of other qualia, procreate).

## Some notation and ontology

We will in this post, like in the previous post, assume that all distributions are categorical, i.e., their supports are vectors of discrete events:

$$p(o) = \text{Cat}(o, \pmb{\omega)}$$

$$p(s) = \text{Cat}(s, \pmb{\sigma)}$$

As stated before, the optimal way for the organism (again, according to AIF) to arrive at its predictions about the observed states is using Bayes’ theorem:

$$p(s \mid o) = \frac{p(o, s)}{p(o)} = \frac{p(o \mid s)p(s)}{p(o)} = \frac{p(o \mid s)p(s)}{\sum_s p(o \mid s)p(s)}$$

- \(p(o, s)\) is the brains generative model. It models how observations and states are related in general. \(p(o, s) = p(o \mid s)p(s)\) where:
- \(p(s)\) is the organisms prior assumption about the probabilities of each possible state in the vector of all potential states \(\pmb{S}\).
- \(p(o \mid s)\), the likelihood, is the probability distribution of observations given a certain predicted state \(s\). The likelihood models the organism’s sensory apparatus.

The last piece of Bayes’ theorem is the *evidence* \(p(o)\). It is simply the overall probability of a certain observation \(o\). It would ideally need to be calculated using the law of total probability by summing up the contributions to \(p(o)\) from all possible states in \(\pmb{S}\):

$$p(o) = \sum_{s} p(o \mid s)p(s)$$

Since there is a large number of states in the real world, this sum would be very long making the evidence intractable. Since the evidence is intractable, we can not rely on Bayes’ theorem alone to infer \(p(s \mid o)\) except in toy examples.

The evidence normalizes the nominator of Bayes’ theorem, \(p(o \mid s)p(s)\), to a proper probability distribution so that sums or integrates up to unity. (The posterior has the same *shape* as \(p(o \mid s)p(s)\) but is scaled with the inverse of the evidence.)

## How to calculate the intractable

There are ways to find *approximations* \(q(s)\) to the optimal posterior distributions \(p(s \mid o)\) without having to calculate \(p(o)\):

$$q(s) \approx p(s \mid o) = \frac{p(o \mid s)p(s)}{p(o)}$$

Although \(q(s)\) depends on \(o\) indirectly, this dependency is usually omitted in AIF notation. \(q(s)\) can be seen as just a probability distribution that is an *approximation* of another probability distribution that in turn depends on \(o\).

One method to find the approximate probability distribution \(q(s)\) is *variational inference*. There is some empirical evidence suggesting that the brain actually does something similar [6].

\(q(s)\) is usually assumed to belong to some tractable distribution like a multidimensional Gaussian. Variational inference finds the optimal parameters of the distribution, in this case the variance (vector) (or the covariance matrix) and the mean (vector). In the following the set of parameters is denoted \(\pmb \theta\). Variational inference optimizes a loss function thereby finding the optimal parameters of \(q(s)\).

## Optimization

There are several optimization methods that can be used for finding the \(\pmb \theta\) that minimizes the dissimilarity between \(q(s \mid \pmb \theta)\) and \(p(s \mid o)\). We assume that the inference of \(q(s \mid \pmb \theta)\) is fast enough for \(o\) and the generative model to remain constant during the inference.

Optimization minimizes a *loss function* \(\mathcal L(o, \pmb \theta)\). The Kullback-Leibler divergence ^{1} \(D_{KL}\) measures the dissimilarity between two probability distributions and is thus a good loss function candidate for active inference:

$$\mathcal L(o, \pmb \theta) = D_{KL}\left[q(s \mid \pmb \theta) \mid \mid p(s \mid o) \right] := \sum_{s} \log\left(\frac{q(s \mid \pmb \theta) }{p(s \mid o)}\right) q(s \mid \pmb \theta)$$.

## The loss function unpacked

Let’s try to unpack \(\mathcal L(\pmb \theta, o)\) into something that is useful in gradient descent:

$$\mathcal L(o, \pmb \theta) = D_{KL}\left[q(s \mid \pmb \theta) \mid \mid p(s \mid o) \right] = \sum_{s} \log\left(\frac{q(s \mid \pmb \theta) }{p(s \mid o)}\right) q(s \mid \pmb \theta)=$$

$$\sum_s \log\left(\frac{q(s \mid \pmb \theta)p(o) }{p(o \mid s)p(s)}\right) q(s \mid \pmb \theta)=$$

$$\sum_s q(s \mid \pmb \theta) \log q(s \mid \pmb \theta) – \sum_s q(s \mid \pmb \theta) \log p(o \mid s) -$$

$$\sum_s q(s \mid \pmb \theta) \log p(s) + \sum_s q(s \mid \pmb \theta) \log p(o) =$$

$$\sum_s q(s \mid \pmb \theta) \log q(s \mid \pmb \theta) – \sum_s q(s \mid \pmb \theta) \log p(o \mid s) -$$

$$\sum_s q(s \mid \pmb \theta) \log p(s) + \log p(o)=$$

$$\mathcal{F}(o, \pmb \theta) + \log p(o)$$

With:

$$\mathcal{F}(o, \pmb \theta) = \sum_s q(s \mid \pmb \theta) \log q(s \mid \pmb \theta) -$$

$$\sum_s q(s \mid \pmb \theta) \log p(o \mid s) – \sum_s q(s \mid \pmb \theta) \log p(s)$$

\(\mathcal{F}(o, \pmb \theta)\) is called *variational free energy* or just *free energy* ^{2}. Since \(\log p(o) \) doesn’t depend on \(\pmb \theta\), \(D_{KL}\left[q(s \mid \pmb \theta) \mid \mid p(s \mid o) \right]\) is minimized when \(\mathcal{F}(o, \pmb \theta)\) is minimized meaning that we can use the free energy as our loss function.

\(\mathcal{F}(o, \pmb \theta)\) can be made differentiable with respect to \(\pmb \theta\) which means that it can be minimized using gradient descent. An estimate of the loss function in each iteration can be found using Monte Carlo integration which means that we only take a few samples from \(q(s \mid \pmb \theta)\), not the full distribution, do the multiplications and the summation. We then calculate the gradient of this sum with respect to \(\pmb \theta\) and adjust \(\pmb \theta\) with a small fraction (the learning rate) of the negative gradient [1]. Note that \(p(s)\) and \(p(o \mid s)\) are assumed to be quantities available to the organism for use in the calculation of the loss function.

The quantity \(– \mathcal{F}(o, \pmb \theta)\) is in Bayesian variational methods denoted *evidence lower bound*, ELBO, since \(D_{KL}\left[q(s \mid \pmb \theta) \mid \mid p(s \mid o)\right] \geq 0\) and therefore \(\log p(o) \geq – \mathcal{F}(o, \pmb \theta)\).

Note that the quantity \(– \log p(o) \) remains unchanged during perceptual inference. The observation is what it is as long as the actor doesn’t change something in its environment that would cause the observation to change as a consequence.

## Surprise

The quantity \(– log p(o)\) can be seen as the “residual” of the variational inference process. It is the part that can not be optimized away, regardless of how accurate a variational distribution \(q(s \mid \pmb \theta)\) we manage to come up with. \(p(o)\) is the probability of the observation. If the observation has a high probability, it was “expected” by the model. Low probability observations are unexpected. \(– log p(o) \) is therefore also called *surprise*. High probability \(p(o)\) means low surprise and vice versa.

From above we have \(\log p(o) \geq – \mathcal{F}(o, \pmb \theta) \Rightarrow -\log p(o) \leq \mathcal{F}(o, \pmb \theta)\) meaning that the free energy is an upper bound on surprise; surprise is always lower than or equal to free energy.

According to AIF, the organism strives to minimize surprise at all times as surprise means that the organism is outside its comfort zone, its expected states. Minimizing surprise will be the topic of coming posts.

## Links

[1] Khan Academy. Gradient descent.

[2] Volodymyr Kuleshov, Stefano Ermon. Variational inference. Class notes from Stanford course CS288.

[3] Thomas Parr, Giovanni Pezzulo, Karl J. Friston. Active Inference.

[4] Hohwy, J., Friston, K. J., & Stephan, K. E. (2013). The free-energy principle: A unified brain theory? *Trends in Cognitive Sciences*, 17(10), 417-425.

[5] Anil Seth. Being You.

[6] Andre M. Bastos, W. Martin Usrey, Rick A. Adams, George R. Mangun, Pascal Fries, Karl J. Friston,

Canonical Microcircuits for Predictive Coding, Neuron, Volume 76, Issue 4, 2012, Pages 695-711.

- Technically \(D_{KL}\) is a
*functional*which is a function of one or more other functions. \(D_{KL}\) is a function of \(q(s, \pmb \theta)\). The square brackets around the argument are meant to indicate a functional. Intuitively one can think of a functional as a function of a large (up to infinite) number of parameters, namely all the values of the functions that are its arguments. ↩︎ - The term is borrowed from thermodynamics where similar equations arise. Knowing about thermodynamics is not important for understanding AIF though. ↩︎