A good way to learn something is to try to explain it to somebody else. This is the essence of the Feynman method of learning [5]. I believe Feynman suggested the topic should be explained to a six-year old. I think that most of the topics that interest me would be challenging to explain to a six-year old though. Unless the six-year old has some university education.

I have in the same vein found that turning thoughts into text, and then reading the text to see if it makes sense, is a useful tool for structuring my thoughts and ideas [6]. I will therefore from time to time try to put new tentative insights into writings in this blog. If what I wrote doesn’t make sense, then I probably haven’t understood what I’m writing about. Or as the Swedish writer Esaias Tegnér put it:

What clearly you cannot say, you do not know;

Esaias Tegnér. Epilogue at an academic graduation in Lund, 1820. Unknown translator.

with thought the word is born on lips of man;

what’s dimly said is dimly thought.

Learning is an iterative process so I will most certainly have to revise earlier posts as I learn more. Hopefully they will eventually coalesce into a consistent and comprehensive narrative.

## Active inference framework

I have set out to try to understand the *active inference framework* (AIF) that is claimed to provide a unified model for human and animal perception, learning, decision making, and action [2].

The theory of active inference was first proposed by Karl Friston based on ideas that go back all the way to the 1900:th century.

The theory has connections to several other topics that I’m interested in. There seem to be some interesting and potentially important connections between AIF and machine learning. It can possibly contribute to the understanding of the mental health conditions that affect many people today, especially the young. Active inference is perhaps also, as Anil Seth claims in his book Being You [1], a milestone on the road to understand the greatest of mysteries, consciousness.

I will in the first series of post try to deconstruct the model mathematically and not focus much on the wider claim of its biological plausibility and some even wider claims about the generality of its applicability in different scenarios.

The central claim by Karl Friston [2] is that AIF explains both perception and action planning by biological organisms so I will introduce the framwork in those terms before moving on to more mathematical aspects.

## Key ideas

The key claims about AIF and its applicability to biological organisms, as I understand them today, are summarized below.

- An organism, like a human being, has in its brain (or similar computing organ) a
*generative model*^{1}that describes the probabilistic relationship between the observations the organism makes and the organism’s mental representations of the corresponding real-world states, its*mental states*(referred to simply as*states*below). The generative model is simply the joint probability distribution \(p(s, o)\) where \(o\) denotes an observation and \(s\) a mental state. - The generative model can be expressed as a product of two probability distributions,
*priors*and*likelihoods*like this \(p(s, o) = p(o \mid s)p(s)\). Priors, \(p(s)\), represent the organism’s beliefs (“prejudices”) about the probabilities of mental states prior to an observation. Likelihoods, \(p(o \mid s)\), represent the organism’s beliefs about the probabilities of certain observations given a particular mental state. Note that the organism doesn’t have access to the real-world states, only observations and its internal mental states that in some useful sense should represent the real-world states. - The organism uses an approximative implementation of
*Bayes’ theorem*and*variational inference*, a numerical method, to match states and observations using the generative model. It notably does it “backwards” starting with a hypothesis (“guess”) about the state indicated by an observation. It then iteratively generates new guesses of the state until it matches the observation according the the generative model. - The organism maintains probability distributions of
*preferred states*or*preferred observations*^{2}(that depend on earlier observations and states) and selects actions to increase the probability of such observations or states. A preferred state for a fish is for instance to be in water of the right temperature and salinity. Humans have many levels of preferred states from the right body temperature all the way to a meaningful life. - The organism updates its generative model over time based on experience.

Active inference presents a shift from the traditional view that an organism passively receives inputs through the senses, processes these inputs, and then decides on an action. Instead, in the active inference framework, the organism continuously generates predictions about what it expects to observe now and in the future (given some actions). When there’s a mismatch between these predictions and actual observations, a “prediction error” arises. The organism aims to minimize this error, either by updating its beliefs or by taking further actions to make the world more in line with its predictions.

Whether to observe or act depends on the context. For instance, during food foraging, there is a strong emphasis on action to reach adaptive states. In contrast, while watching a movie, the emphasis might lean more towards more or less passive observation, having hopefully reached the adaptive state of enjoyment.

In all scenarios, the organism evaluates possible actions or “policies” based on their anticipated outcome states or observations and the attractivess of those states and observations respectively.

## A simple example

Let’s look at how a human actor, according to AIF, could optimally infer useful information about the *state* of the real world based on its *observation* in an extremely simple scenario. The real-world state is represented in the generative model by a mental state that is inferred from the observation.

We assume that there are two kinds of animals in a garden, frogs and lizards. Frogs are prone to jumping quite often. The lizards also jump but more seldom. Let’s also assume that the actor has forgotten their eyeglasses in the house so that they can’t really tell a frog from a lizard just by looking at it but they can discern whether it jumps or crawls (yes, this is a contrived scenario getting more contrived but please bear with me).

The actor has seen frogs and lizards before so they have a fairly accurate one-to-one mapping from the real-world amphibians and reptiles to corresponding mental states.

The actor *believes* that there are more lizards than frogs in the garden. This belief, when quantified, is the *prior*. The actor also *believes* based on some experience that frogs jump much more often than lizards do. This belief, also when quantified, is called *likelihood*. The prior and the likelihood constitute the actor’s *generative model* of the world.

The actor now observes a random animal for two minutes. They don’t see what animal it is because of their poor eyesight but they can see that the animal jumps. What is their best guess about the species of the animal that jumps?

## Nothing is certain

We cannot be certain about anything in this world (except perhaps death and taxes). In the context of active inference there is uncertainty about the accuracy of the observation (aleatoric uncertainty) and about the accuracy of the brain’s generative model (epistemic uncertainty). There will therefore never be a clear-cut answer to the question about what animal we see or any other prediction.

When I suggested to Bard that when I see a cat on the street I’m pretty sure it’s a cat and not a gorilla it replied “it is possible that the cat we see on the street is actually a very large cat, or that it is a gorilla that has been dressed up as a cat”. So take it from Bard, nothing is certain. A cat may actually be a gorilla in disguise and vice versa.

When we are uncertain, we need to describe the world in terms of *probabilities* and *probability distributions*. We will say things like “the animal is a frog with 73% probability” (and therefore, in a garden with only two species, a lizard with 27 % probability). Bayes’ theorem and AIF are all about probabilities, not certainties.

## Notation

When I don’t understand something mathematical or technical, it is frequently due to confusing or unfamiliar notation or unclear ontology. (The other times it is down to cognitive limitations I guess.) Before I continue, I will therefore introduce some notation that I hope I can stick to in this and future posts. I try to use as commonly accepted notation as possible but unfortunately there is no standard notation in mathematics.

Active inference is about *observations*, *states*, and *probability distributions* (probability mass functions and probability density functions). Observations and states can be *discrete* or *continuous*. Observations that can be enumerated or named, like \(\text{jumps}\), are discrete. Observations that can be measured, like the body temperature, are continuous.

Discrete observations and states are *events* in the general vocabulary of probability theory. In this post we are only considering discrete observations and states.

*Potential* observations are represented by a vector of events:

$$\pmb{\mathcal{O}} = [\mathcal{O}_1, \mathcal{O}_2, \ldots, \mathcal{O}_n] = [\text{jumps}, \text{crawls}, \ldots]$$

\(\pmb{\mathcal{O}}\) is a vector of *potential* observations. It doesn’t by itself say anything about what observations have been made or are likely to be made. We therefore also need to a assign a probability \(P\) to each observations, in this case \(P(\text{jumps})\) and \(P(\text{crawls})\) respectively. We collect these probabilities into a vector

$$\pmb{\omega} = [\omega_1, \omega_2, \ldots, \omega_n] = [P(\text{jumps}), P(\text{crawls}), \ldots]$$

The vector of observations \(\pmb{\mathcal{O}}\) and the vector of probabilities \(\pmb{\omega}\) are assumed to be matched element-wise so that \(P(\mathcal{O}_i) = \omega_i\)

For formal mathematical treatment, all events need to be mapped to *random variables*, real numbers representing the different events. This is a technicality but will make things more mathematically consistent.

In our case \(\text{jumps}\) could for instance be mapped to \(0\) and \(\text{crawls}\) to \(1\) (the mapping is arbitrary for discrete events without an order). In general we do the mapping \(\pmb{\mathcal{X}} \mapsto \pmb{X}\), where \(\pmb{\mathcal{X}}\) is an event and \(\pmb{X} \in \mathbb{R}\).

The observation events are analogously mapped to observation random variables: \(\pmb{\mathcal{O}} \mapsto \pmb{O}\). When referring to a certain value of a random variable such as an observation, we use lower case letters, e.g., \(P(O = o)\).

Continuous observations and states can be represented by random variables directly; there is no need for the event concept.

A *probability mass function* [3], \(p(x)\) is a function that returns the probability of a random variable value \(x\) such that \(p(x) = P(X = x)\). It is often possible to define a probability mass function analytically which makes using random variables in some ways easier than bare-bones events for which probabilities need to be explicitly defined for each event. Also, continuous random variables, like the body temperature, don’t have any meaningful representation in the event space.

Note that a probability is denoted with a capital \(P\) while a probability mass function (probability distribution) is denoted with a lower case \(p\).

Since our observations are discrete, the probability mass function for the observations that we are interested in would be:

$$p(o) = \text{Cat}(o, \pmb{\omega)}$$

This is called a categorical probability mass function and is a basically a “lookup table” of probabilities such that \(p(o_i) = P(O = o_i) = \omega_i\). The interesting aspect of a categorical probability mass function is the vector of probabilities. In practical situations it is often useful to reason about the probabilities directly and to “forget” that we pull the probabilities out from a probability mass function. Read on to see what I mean.

States are also events:

$$\pmb{\mathcal{S}} = [\mathcal{S}_1, \mathcal{S}_2, \ldots, \mathcal{S}_n] = [\text{frog}, \text{lizard}, \ldots]$$

The probabilities of each state is represented by the vector:

$$\pmb{\sigma} = [\sigma_1, \sigma_2, \ldots, \sigma_n] = [P(\text{frog}), P(\text{lizard}), \ldots]$$.

The state events are mapped to state random variables: \(\pmb{\mathcal{S}} \mapsto \pmb{S}\)

The probability distribution of the state random variable is:

$$p(s) = \text{Cat}(s, \pmb{\sigma)}$$

## Back to frogs and lizards

Assume that the prior probabilities held by the actor for finding frogs and lizards in the garden are:

$$\pmb{\sigma} = [0.25, 0.75]$$

This means that under the actor’s brain’s current generative model of the world they believe that there are three times more lizards than frogs in the garden.

Assume that the likelihood for each of the species in the garden jumping within two minutes is represented by the vector:

$$[P(\text{jumps} \mid \text{frog}), P(\text{jumps} \mid \text{lizard})] = [0.8, 0.1]$$

\(P(\text{jumps} \mid \text{frog})\) should be read as “the probability for observing jumping *given* that the animal is a frog”.

This means that frogs are eight times more likely to jump within a two minute period than lizards. In other words: if there would be as many frogs as lizards in the garden, then, when the actor sees an animal jumping in the garden, they would believe it would be a frog eight times out of nine; for every eight jumping frogs there would be one jumping lizard.

With three times as many lizards as frogs, the number of lizards jumping would go up a factor three meaning that when the actor sees an animal jumping in the garden, they would believe it is a frog only eight times out of eleven; for every eight jumping frogs there would be three jumping lizards.

The actor saw the animal jump. The probabilities for the animal being a frog and a lizard are therefore given by the vector:

$$[P(\text{frog} \mid \text{jumps}), P(\text{lizard} \mid \text{jumps})] = [\frac{8}{11}, \frac{3}{11}] \approx [0.73, 0.27]$$

meaning that the actor’s best prediction given my current model and the observation is that the observed animal is a frog with \(73\%\) probability and a lizard with \(27\%\) probability.

## With some more math

Below follows an alternative account of the same example, this time with a little more mathematics, introducing Bayes’ theorem.

As stated above, \(p(s)\) and \(p(o \mid s)\), the prior and the likelihood together define the brain’s current model of the (very limited) world.

AIF claims that the actor uses Bayes’ theorem to predict the probability of the jumping animal being a frog and a lizard respectively:

$$p(s \mid o)= \frac{p(o \mid s)p(s)}{p(o)}$$

Let’s put some numbers in the nominator (\(\odot\) indicates element-wise multiplication):

$$[P(\text{jumps} \mid \text{frog})P(\text{frog}), P(\text{jumps} \mid \text{lizard})P(\text{lizard})] = [0.8, 0.1] \odot [0.25, 0.75] = [0.2, 0.075]$$

The denominator indicates how probable it is to see an animal, any animal, jumping in the garden. It can in this simple case, assuming that there are only two types of animals in the garden and that their jumping propensities are known, be calculated using the *law of total probability*:

$$P(\text{jumps}) = P(\text{jumps} \mid \text{frog})P(\text{frog}) + P(\text{jumps} \mid \text{lizard})P(\text{lizard}) = 0.8*0.25 + 0.1 * 0.75 = 0.275$$

The actor’s brain thus predicts the probabilities for frogs and lizard respectively as follows:

$$[P(\text{frog} \mid \text{jumps}), P(\text{lizard} \mid \text{jumps})]= [0.2, 0.075] / 0.275 \approx [0.73, 0.27]$$

This is the same result as above. Note how the fact that there are three times as many lizards than frogs as expressed in the prior increases the probability for the jumping animal being a lizard even if lizards don’t readily jump. Bayes’ theorem sometimes gives surprising but always correct answers.

We will take a deeper look into the mathematic needed in real-world situations, outside the simple garden, in the next posts.

## Links

[1] Anil Seth. Being You.

[2] Thomas Parr, Giovanni Pezzulo, Karl J. Friston. Active Inference.

[3] MIT Open Courseware. Introduction to Probability and Statistics.

[4] Ryan Smith, Karl J. Friston, Christopher J. Whyte. A step-by-step tutorial on active inference and its application to empirical data. Journal of Mathematical Psychology. Volume 107. 2022.

[5] Farnham Street Blog. The Feynman Technique: Master the Art of Learning.

[6] Youki Terada. Why Students Should Write in All Subjects. Edutopia.

- A generative model approximates the full joint probability distribution between input and output, in this case between observations and states, \(p(s, o)\). This in contrast to a
*discriminative model*that is “one way”. The likelihood \(p(o \mid s)\) is a discriminative model. ↩︎ - There is some ambiquity in AIF about whether an actors preferences are encoded as states or observations. Both alternatives are implied in different parts of the literature. Observations are a probabilistic function of the states as given by the generative model. ↩︎