Active Inference is a normative framework to characterize Bayes-optimal behavior and cognition in living organisms. Its normative character is evinced in the idea that all facets of behavior and cognition in living organisms follow a unique imperative: minimizing the surprise of their sensory observations. Surprise has to be interpreted in a technical sense: it measures how much an agent’s current sensory observations differ from its preferred sensory observations—that is, those that preserve its integrity (e.g., for a fish, being in the water).
[1, p.6]
Minimizing surprise
When an agent has, using perceptual inference, determined what state it is in they want to decide what to do next. The process for finding the best set of actions going forward is here called policy inference or synonymously action inference and is the topic of this post. A policy is defined as a sequence of actions.
The active inference framework (AIF) posits, as stated in the opening quote, that rational agent choose their actions so as to minimize a quantity called surprise.
Surprise was introduced in an earlier post. A high surprise in perceptual inference means that the observation has a low probability according to the agent’s generative model. Perceptual inference in itself does not assign any valence to the probability of the observation. The probability measurement is a “byproduct” of the perceptual inference.
The observation relates to the prior probability distribution of states \(p(s)\) through:
$$p(o) = \sum_s p(o \mid s)p(s)$$
In a baseline formulation of policy inference for the control of low-level biological functions, the prior over states functions as a homeostatic setpoint distribution. States with high prior probabilities are expected states in which the agent usually prefers to stay. The prior probability distribution specifies the region of state space within which the agent remains viable. If the current observation is likely for state \(s\) and that state has a high probability in \(p(s)\), then \(p(o)\) will also be high and the observation will be unsurprising.
The baseline model can explain the behavior of an agent with limited behavioral choices in a predictable environment . A bacterium has a high prior probability for a nutrient-rich environment and has well-tuned sensors to measure the nutrient concentration. Whales have a high prior probability for being in water and therefore remain mostly unsurprised while in water. They would become almost infinitely surprised if summoned into existence high in the atmosphere.
… against all probability a sperm whale had suddenly been called into existence several miles above the surface of an alien planet. And since this is not a naturally tenable position for a whale, this poor innocent creature had very little time to come to terms with its identity as a whale before it then had to come to terms with not being a whale any more.
Douglas Adams. The Hitchhiker’s Guide to the Galaxy.
An agent experiencing high surprise will attempt to choose and take some action to lower the surprise, thereby getting closer to the allostatic setpoints. For this it needs to, as a baseline approximation, find a policy \(\pi\) that on average satisfies:
$$\pi \propto – \nabla_{\pi} D_{KL}[p(o \mid \pi) \mid \mid p(o)] \ \ \ \ (0)$$
A bacterium senses the nutrient concentration around it and moves, on average, toward a higher concentration of nutrients, i.e., toward lower surprise observations.
If the observation (vector) contains both integral and derivative terms, then the above expression describes a classic PID controller. For a mountainbike rider who uses visual an proprioceptive observations for guiding their ride, an integral signal would be to find themselves off the trail. A derivative signal would be the expansion of the visual field when approaching a hill. Both types of observations are useful for policy inference on the trail.
Intelligent animals, most notably humans, can do more than follow the nutrient gradient (although that kind of behavior is not unheard of among humans either). They can imagine and execute a large repertoire of possible actions for getting to a large repertoire of possible setpoints.
“Organisms follow a unique imperative: minimizing the surprise of their sensory observations” cannot alone guide the behavior of a complex agent because of several reasons:
- Life in a matrix or under the influence of psychotropic substances would minimize surprise without supporting the integrity of the agent. An agent under influence has a generative model that doesn’t model the generative process, i.e., mental states will not correspond to real-world states in any meaningful sense (in a “truthful way”, see this post). The agent will therefore not make adaptive decisions. The “acid trip” scene from the movie Easy Rider is a perfect illustration.
- Looking into the future we have to consider that expected mental states and preferred mental states are not always the same. We may expect a project to fail while still preferring it to succeed and planning our actions so as to rescue it if possible. Using the prior probabilities as setpoints for an agent’s behavior may be fine for bacteria without ambitions but would only be useful to model humans in a static society who believe that the future must be like the past.
- Observations are sometimes so uncertain that there is no “surprise gradient” to follow to reduce surprise. Finding your bike in the basement of an apartment building is hard if it is pitch black. To find your bike where you expected to find it (the low-surprise observation), you first need to switch on the lights, a kind of a “helper action” not specific to bike search. Only when there is light can you trust your observations to find your bike among a dozen other bikes. This indicates that the loss function must also include an “epistemic” component; the observations need to be informative for the minimization of surprise to be a useful objective.
- At least (some) human agents have morals. Deontic moral rules such as “you shall not lie” will limit the repertoire of policies that the agent can choose from. A term assigning a high loss to “forbidden” policies needs to be added to the total loss function for it to explain many types of human behavior.
Policy selection is also limited by physical capability of the agent, available time, available energy, and many other aspects that may have to be weighted in for a realistic model.
Into the lake
One evening I decided to go ice skating on the local lake. The ice was thick and smooth, perfect for long distance skating, a popular activity in my native Sweden. The moon was shining and I picked up a good speed with a bit of tailwind.
I knew I needed to avoid the mouth of the river flowing into the lake because the ice is thin where the water flows. It turned out that the thin ice extended much further from the river mouth into the lake than I thought and suddenly I found myself in a hole in the ice, swimming in sorbet. It took a second or two for my brain to accept the mental state corresponding to the real-world state of swimming in freezing water. My first reaction was: this is not happening. My generative model assigned a very low prior probability to this state! Then my brain started to minimize free energy (while spending some chemical energy to keep up my body temperature).
There were several distinct observations pointing at me actually being in water: I was obviously wet, I started to feel the cold, I felt pain in my sartorius muscle which had hit the edge of the ice on the other side of the hole at high speed, and I was definitely swimming. No further information seeking was necessary to establish high-contrast observations. Whirr, click! After a second or two of free energy minimization I had a posterior mental state distribution peaking sharply for the mental state “in a hole in the ice”. The remaining surprise was, after the variational inference was done and the mental state was established, still high because of the low prior probability of the state “being in a hole in the ice”.
My desired mental state was of course not to be swimming with my skates on in a freezing lake but to be back on the ice and then in a warm sauna. I started to evaluate possible policies…
Concepts
When going from inferring what is to inferring actions that will lead to what ought to be, we need to expand the generative model and add yet another model, the preference model.
- Instead of instantaneous observations and states, we will consider sequences of future observations, \(o_{1:T}\) and associated future mental states \(s_{1:T}\). For convenience and brevity we will skip the subscripts in most equations and use the notations \(o\), and \(s\) to also represent sequences.
- We add the concept policy, \(\pi\), which is a sequence of actions, \([\pi_0, \pi_1, \ldots \pi_{T-1}] = \pi_{0:T-1}\), to the generative model so that it becomes \(p(s, o, \pi)\). Each policy is assumed to lead to a sequence of observations \(o_{1:T}\) and associated mental states \(s_{1:T}\) as given by the generative model. Depending on the complexity of the agent and perhaps its sensing range [7], the policy may be anything from moving with the nutrient gradient to a plan for a professional career.
- An agent’s preferences over observations, mental states, and policies are expressed as the preference distribution \(\tilde p(s, o, \pi)\). In primitive biological agents the preference distribution represents the objectives that promote survival and procreation, the “primordial imperatives” and can be equal to the prior distribution. We humans are, sometimes to our detriment, more complex; we for instance hold many preferences that are not all that adaptive. The preference model is usually simplified to \(\tilde p(s) \tilde p(o) \tilde p(\pi)\).
Policy inference overview
Policy inference according to AIF entails two phases.
First, the agent simulates the likely futures using their “built-in” generative model \(p(s, o, \pi)\) to infer a distribution of future states for each candidate policy, \(q(s \mid \pi)\). It will find answers to questions like “which city will I end up in if I turn left in this intersection?”. We call this phase simulation.
Second, once the agent has simulated a number of policies and inferred the distribution of states each policy lwill ead to, they estimate the loss incurred by each policy according to a loss function \(\mathcal G\). Based on the loss of each policy, a probability is assigned to the policy and a probability distribution over policies, \(q(\pi)\), is inferred. A rational agent will then choose and execute one of the high-probability policies from that distribution.
In practice the agent will often execute the first one or a few steps of the policy and then do a new policy inference to adjust the policy. Planning also has several levels and time horizons. Homeostasis is a subconscious level of planning aimed at keeping the body’s vital parameters within optimal ranges. Going down the stairs to the kitchen to brew coffee requires some conscious short-term planning (and may lead to a catastrophic surprise if it turns out that you are out of coffee). A career plan is a very long term plan, often requiring adjustment along its execution.
Simulation using Bayesian dead reckoning
The simulation uses the agent’s generative model \(p(s_{1:T}, o_{1:T}, \pi_{1:T})\) to predict the distribution of future states that each policy leads to. Usually the state transitions are assumed to be a controlled Markov process meaning that all information required to infer the state \(s_{t +1}\) is contained in the current state \(s_t\) and the policy \(\pi\)1. With that restriction, the generative model becomes:
$$p(s_{1:T}, o_{1:T} \mid \pi) = p(s_1) \prod_{t=2}^T p(s_t \mid s_ {t-1}, \pi_t) \prod_{t=1}^T p(o_t \mid s_t) \ \ \ \ (1)$$
The sequence of states is thus assumed to unfold according to \(p(s_{t+1} \mid s_ t, \pi_t)\) starting from a known prior state \(p(s_1)\). The prior distribution is the only information we can reasonably expect to have about the future. The observations can be derived using the likehood \(p(o \mid s)\) but they are a consequence of the progression of states in the simulation, not a cause, so they don’t add any information to the progression of state distributions2.
Marginalizing \(p(s_{1:T}, o_{1:T} \mid \pi)\) over observations we get:
$$p(s_{1:T} \mid \pi) = p(s_1) \prod_{t = 2} p(s_t \mid s_ {t-1}, \pi_t) \ \ \ \ (2)$$
The distribution of states resulting from the unfolding of the expression above is usually referred to as \(q(s_{1:T} \mid \pi)\) but since it is fully based on the generative model it is identical to \(p(s_{1:T} \mid \pi)\) so
$$q(s_{1:T} \mid \pi) = p(s_{1:T} \mid \pi)$$
The loss function
The loss function for policy inference presented in the literature is called expected free energy (EFE). It is tempting to try to derive it from, or find some direct relationship with, variational free energy (VFE). Also the term free energy suggests a deep association between the two quantities. The association turns out to be largely terminological rather than mathematical, as we will show below.
The standard derivation of EFE goes someting like this:
A viable agent’s generative model is assumed to model the world (the generative process that generates observations) well enough to support utilitarian decision making. If the model leaves a large surprise, $- \log p(o)$, when variational free energy has been minimized, then the observation is said to be unexpected relative to the agent’s current generative model. An unexpected observation should trigger a policy (set of actions) to bring the observation back to the expected territory. A sensory reading of our body temperature of 37.9 degrees Celsius would for instance be unexpected and a signal to the hypothalamus to trigger a policy involving sweating and widening of blood vessels. This is what equation \((0)\) above describes and is a reasonable model for the continuous homeostatic regulation.
For allostasis, which involves a prediction and mitigation of future unexpected observations, a more complex algorithm is needed. To mitigate unexpected observations in the future by choosing a preventive policy \(\pi\), a baseline algorithm could be to estimate the expected variational free energy (EVFE) that each policy would lead to and then pick the one that minimizes the expected free energy thereby minimizing surprise. This means that instead of minimizing the variational free energy:
$$\mathcal F[q; o] =\mathbb E_{q(s)}[\log q(s) – \log p(s, o)]$$
we would want to minimize its expected value over all possible observations encountered when executing policy \(\pi\):
$$\mathcal G[q; \pi] = \mathbb E_{q(o \mid \pi)}[F[q; o, \pi]] = \mathbb E_{q(s, o \mid \pi)}[\log q(s \mid \pi) – \log p(s, o \mid \pi)]$$
As explained above, \(q(s \mid \pi)\) is derived from equation \((2)\). With the above assumptions, all terms of \(\mathcal G[q; \pi]\) including \(q(s \mid \pi)\) are determined by the generative model and the assumed inference scheme for a certain policy meaning that the EVFE would be possible to calculate in closed form for each policy. (We can therefore drop the functional notation and denote the expected free energy just \(\mathcal G(\pi)\)). Having calculated \(\mathcal G(\pi)\) for all policies \(\pi\), then the policy associated with the lowest \(\mathcal G(\pi)\) can be chosen.
The above algorithm works for an agent in a stable environment where nothing novel ever happens, like for a bacterium in a nutrient solution. The only action it needs to decide on is in which direction to move to maximize nutrient concentration; to follow the negative of the surprise gradient.
For human agents what is expected is not always what is preferred. The ought is not the same as the is as David Hume famously stated.
This is in the AIF literature often explained by defining an additional distribution of preferred observations, \(\tilde p(o)\), and replacing the outcome prior inside the scoring term with it like this [2]:
$$\mathcal G(\pi) = \mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log p(o, s \mid \pi)] \approx$$
$$\mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log p(s \mid o, \pi) – \log \tilde p(o)]$$
The last expression is called expected free energy (EFE) The term expected free energy is historically motivated but technically misleading: after the introduction of a preference distribution, the objective is no longer the expectation of the variational free energy of the agent’s generative model. It is better understood as a free-energy-inspired decision functional.
The \(\approx\) sign is therefore misleading: the step is not an approximation in the statistical sense but a substitution of a different distribution. The implications are:
- \(\mathcal G(\pi)\) is no longer the expected free energy of the generative model \(p(s,o)\); it becomes a scoring function that evaluates model-predicted outcomes using an external preference distribution.
- Nothing guarantees that \(\tilde p(o)\) resembles \(p(o)\). An agent may fully expect an outcome while strongly preferring the opposite (e.g., expecting a project to fail while wishing it to succeed).
- The epistemic term that appears in the final EFE expression is inherited from the EVFE decomposition, but its retention after the substitution is a design decision, not a mathematical necessity.
EFE is a Bayesian formulation of the well-known exploitation – exploration tradeoff principle from machine learning and decision making theory. The “derivation” of EFE from EVFE does not add additional fundamental insights but it has value as it offers a unified probabilistic notation for perception and policy inference.
The “root” loss function for policy inference used in the derivations in the rest of this post is taken to be a general form of the function \(\mathcal G\) above3:
$$\mathcal J[q] = – \mathbb E_{q(o)} [D_{KL}[q(s, \pi \mid o) \mid \mid q(s, \pi)] + \log \tilde p(o)] \ \ \ \ (3)$$
The first term encourages the search for informative observations by maximizing the difference (divergence) between \(q(s, \pi \mid o)\) and \(q(s, \pi)\). We want the observations to add information to the prior distribution. The second term is surprise in relation to the preference distribution. It favors observations that are preferred under the preference distribution.
Dissecting the loss function
For any probability distribution \(\tilde p(o, s, \pi)\), it is true that:
$$\tilde p(o) = \sum_{s, \pi} \tilde p(s, o, \pi) \Rightarrow$$
$$- \log \tilde p(o) = – \log \sum_{s, \pi} \tilde p(s, o, \pi) = – \log \mathbb E_{q(s, \pi \mid o)} \left[\frac{\tilde p(s, o, \pi)}{q(s, \pi \mid o)}\right] \ \ \ \ (4)$$
The rationale for using exactly \(q(s, \pi \mid o)\), instead of, say \(q(s, \pi)\) will become apparent further down.
The equality requires that \(q(s, \pi \mid o) \gt 0\) wherever \(p(s, o, \pi) \gt 0\).
Jensen’s inequality gives:
$$- \log \tilde p(o) = – \log \mathbb E_{q(s, \pi \mid o)} \left[\frac{\tilde p(s, o, \pi)}{q(s, \pi \mid o)}\right] \leq – \mathbb E_{q(s, \pi \mid o)} \left[\log \frac{\tilde p(s, o, \pi)}{q(s, \pi \mid o)}\right] \Rightarrow$$
$$\mathbb E_{q(o)} [- \log \tilde p(o)] \leq \mathbb E_{q(o)} \left[- \mathbb E_{q(s, \pi \mid o)} \left[\log \frac{\tilde p(s, o, \pi)}{q(s, \pi \mid o)}\right]\right] = – \mathbb E_{q(s, o, \pi)} \left[\log \frac{\tilde p(s, o, \pi)}{q(s, \pi \mid o)}\right] =$$
$$\mathbb{E}_{q(s, o, \pi)} [\log q(s, \pi \mid o) – \log \tilde p(s, o, \pi)]\ \ \ \ (5)$$
The second term of \(\mathcal J[q]\) decomposes like this:
$$- \mathbb E_{q(o)} [D_{KL}[q(s, \pi \mid o) \mid \mid q(s, \pi)] =$$
$$- \mathbb E_{q(o)} [\mathbb E_{q(s, \pi \mid o)} [\log q(s , \pi \mid o) – \log q(s, \pi)] =$$
$$E_{q(s, o, \pi)} [\log q(s, \pi) – \log q(s, \pi \mid o)] \ \ \ \ (6)$$
Putting \((5)\) and \((6)\) together gives us an upper bound of \(\mathcal J[q]\):
$$\mathcal G[q] = \mathbb{E}_{q(s, o, \pi)} [\log q(s, \pi \mid o) – \log \tilde p(s, o, \pi)] + E_{q(s, o, \pi)} [\log q(s, \pi) – \log q(s, \pi \mid o)] =$$
$$\mathbb{E}_{q(s, o, \pi)} [\log q(s, \pi) – \log \tilde p(s, o, \pi)] ] =$$
$$\mathbb{E}_{q(\pi)}[\mathbb{E}_{q(s, o \mid \pi)}[\log q(s \mid \pi) + \log q(\pi) – \log \tilde p(s, o \mid \pi) – \log \tilde p(\pi)] =$$
$$\mathbb{E}_{q(\pi)}[ \log q(\pi) – \log \tilde p(\pi) + \mathbb{E}_{q(s, o \mid \pi)}[\log q(s \mid \pi) – \log \tilde p(s, o \mid \pi)] ]=$$
$$\mathbb{E}_{q(\pi)} [\log q(\pi) – \log \tilde p(\pi) + \mathcal G[q; \pi]] \ \ \ \ (7)$$
where
$$\mathcal G[q; \pi] = \mathbb{E}_{q(s, o \mid \pi)}[\log q(s \mid \pi) – \log \tilde p(s, o \mid \pi)] \ \ \ \ (8)$$
\(\tilde p(\pi)\) encodes our deontic policy preferences such as “avoid lying”.
Finding \(q(\pi)\)
The notation \(\mathcal G[q; \pi]\) indicates that \(\mathcal G\) is a functional of the distribution \(q(s, o \mid \pi)\). As shown above, this distribution is given by the “simulation” and therefore a given when evaluating \(\mathcal G\). \(\mathcal G\) becomes a function of \(\pi\) and is no longer a functional. We therefore use the notation \(\mathcal G(\pi)\) in the derivations below.
In perceptual inference we use variational free energy to find an approximation of the probability distribution of mental states \(q(s)\) for a given observation (see above). In policy inference we instead are seeking the probability distribution of policies \(q(\pi)\) that minimizes expected free energy (expected free energy in the future):
$$ q(\pi) = \underset {q(\pi)}{\arg \min}\left[\mathcal G[q]\right]$$
We here assume that we have a finite set of policies \(\{\pi_i\}\), each with a probability \(q(\pi_i) = q_i\). This implies that we want to minimize \(\mathcal G[q]\) with respect to \(q(\pi)\) subject to \(q(\pi) \geq 0\) and \(\sum_{\pi} q(\pi) = 1\) (the probabilities must sum up to one).
Our policy preferences \(\tilde p(\pi)\) can likewise be expressed as a unit vector with components \(\tilde p_i\).
We use Lagrangian optimization to find \(q(\pi)\). We define the Lagrangian:
$$\mathcal L [q] = \sum_{\pi} q(\pi) (\log q(\pi) – \log \tilde p(\pi) + \mathcal G(\pi)) + \lambda \left(\sum_{\pi} q(\pi) – 1 \right) =$$
$$\sum_{i} q_i (\log q_i – \log \tilde p_i + \mathcal G(\pi_i)) + \lambda \left(\sum_{i} q_i – 1 \right)$$
At the minimum the gradient of the Lagrangian is zero meaning that all partial derivatives with respect to \(q_i\) are zero:
$$\frac {\partial {\mathcal L(q_i)}}{\partial q_i} = \log q_i + 1 – \log {\tilde p_i} + \mathcal G(\pi_i) + \lambda = 0 \Rightarrow$$
$$\log q_i = \log \tilde {p_i} – \mathcal G(\pi_i) – (1 + \lambda) \Rightarrow$$
$$q_i = \tilde {p_i} e^{-\mathcal G(\pi_i)} e^{- (1 + \lambda)}$$
We have that \(\sum_{i} q_i = 1\) which gives us the last unknown, \(\lambda\). We satisfy this condition by choosing a \(\lambda\) that satisfies:
$$e^{1 + \lambda} = \sum_i \tilde{p_i} e^{-\mathcal G(\pi_i)} \Rightarrow$$
$$q_i = \text{softmax}(\log \tilde {p_i} – \mathcal G(\pi_i)) \ \ \ \ (9)$$
To ensure that we have a minimum we can evaluate the second derivatives:
$$\frac {\partial^2 {\mathcal L(q_i)}}{\partial {q_i}^2} = \frac{1}{ q_i} \gt 0$$
Note that \(\mathcal G(\pi_i)\) does not vary with the probability of \(\pi\) so it is a constant with respect to \(q_i\).
Having found \(q(\pi)\), the agent typically selects and executes the policy with the highest probability.
Alternative expressions for \(\mathcal G(\pi)\)
\(\mathcal G(\pi)\) is, together with our policy preferences \(\tilde p(\pi)\), used to “score” the policies as shown above; the smaller the \(\mathcal G(\pi)\), the more attractive the policy. Below we show a few standard formulations of \(\mathcal G(\pi)\). They all start from \((8)\):
$$\mathcal G(\pi) = \mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log \tilde p(o, s \mid \pi)]$$
The derivations rely on the following identities:
- \(\tilde p(o \mid \pi) = \tilde p(o)\) since the agents preferences (in terms of observations) are not assumed to depend on the chosen policy.
- \(q(o \mid s, \pi) = p(o \mid s)\). The likelihood describes the agents sensory apparatus and is therefore not conditional on \(\pi\).
- \(\tilde p(s, o \mid \pi) = p(s \mid o, \pi) \tilde p(o) = p(o \mid s) \tilde p(s)\). These are modeling choices. We don’t want or expect preferences in the generative model.
- \(q(s, o \mid \pi) = q(o \mid s, \pi) q(s \mid \pi) = p(o \mid s) q(s \mid \pi) = q(s \mid o, \pi) q(o \mid \pi)\).
- \(q(\cdot) = p(\cdot)\) in general. \(q\) is any distribution derived from the forward-looking simulation based on the generative model \(p(s, o, \pi)\).
Information gain – pragmatic value
The first formulation is isomorph with the loss function \((3)\) but conditioned on policy:
$$\mathcal G(\pi) = \mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log \tilde p(o, s \mid \pi)]=$$
$$\mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log p(s \mid o, \pi) – \log \tilde p(o)]=$$
$$\mathbb E_{q(o \mid \pi)}[\mathbb E_{q(s \mid o, \pi)}[\log q(s \mid \pi) – \log p(s \mid o, \pi) – \log \tilde p(o)]]=$$
$$- \mathbb E_{q(o \mid \pi)}[D_{KL}[q(s \mid o, \pi) \mid \mid q(s \mid \pi)] + \log \tilde p(o)]$$
Risk over states – ambiquity
$$\mathcal G(\pi) = \mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log \tilde p(o, s \mid \pi)]=$$
$$\mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log p(o \mid s) – \log \tilde p(s)]=$$
$$\mathbb E_{p(o \mid s)}[\mathbb E_{q(s \mid \pi)}[\log q(s \mid \pi) – \log p(o \mid s) – \log \tilde p(s)]]=$$
$$D_{KL}[q(s \mid \pi) \mid \mid \tilde p(s)] + \mathbb E_{q(s \mid \pi)}[\mathbb H[p(o \mid s)]]$$
Risk over observations – ambiquity
$$\mathcal G(\pi) = \mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log \tilde p(o, s \mid \pi)]=$$
$$\mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log p(s \mid o, \pi) – \log \tilde p(o)]=$$
$$\mathbb E_{q(o, s \mid \pi)}[\log q(s \mid o, \pi) + \log q(o \mid \pi) – \log p(o \mid s) – \log p(s \mid o, \pi) – \log \tilde p(o)]=$$
$$\mathbb E_{q(o \mid \pi)}[\log q(o \mid \pi) – \log \tilde p(o)] + \mathbb E_{q(o \mid \pi)}[\mathbb E_{q(s \mid o, \pi)}[\log q(s \mid o, \pi) – \log p(s \mid o, \pi)]] – \mathbb E_{q(s \mid \pi)}[\mathbb E_{p(o \mid s)}[\log p(o \mid s)]] =$$
$$D_{KL}[q(o \mid \pi) \mid \mid \tilde p(o)] + \mathbb E_{q(o \mid \pi)}[D_{KL}[q(s \mid o, \pi) \mid \mid p(s \mid o, \pi)]] + \mathbb E_{q(s \mid \pi)} \mathbb H[p(o \mid s)] =$$
$$D_{KL}[q(o \mid \pi) \mid \mid \tilde p(o)] + \mathbb E_{q(s \mid \pi)} \mathbb H[p(o \mid s)]$$
We assume \(q(s \mid o, \pi) = p(s \mid o, \pi)\) which implies that the middle divergence is zero.
The last step
During planning, the future state distributions are inferred using equation \((2)\):
$$p(s_{1:T} \mid \pi) = p(s_1) \prod_{t = 2} p(s_t \mid s_ {t-1}, \pi_t)$$
This means that
$$q(s_{1:T} | \pi) = \prod_{t=1}^{T} q(s_t | \pi) = p(s_1) \prod_2^T p(s_{t} \mid s_ {t-1}, \pi_t)$$
$$\mathcal G(\pi) = \mathbb{E}_{q(s, o \mid \pi)} \left[\log q(s \mid \pi) – \log \tilde p(o, s \mid \pi)\right] =$$
$$\mathbb{E}_{q(s, o \mid \pi)} \left[\sum_{t=1}^T \left(\log q(s_t \mid \pi) – \log \tilde p(o_t, s_t \mid \pi)\right) \right] =$$
$$\sum_{t=1}^T \mathbb{E}_{q(s, o \mid \pi)} \left[\log q(s_t \mid \pi) – \log \tilde p(o_t, s_t \mid \pi) \right] = $$
$$\sum_{t=1}^T \mathcal G(\pi, t)$$
Where
$$\mathcal G(\pi, t) = \mathbb{E}_{q(s_t, o_t \mid \pi)} \left[\log q(s_t \mid o_t, \pi) – \log \tilde p(o_t, s_t)\right]$$
We can thus find \(\mathcal G(\pi)\) by calculating \(\mathcal G(\pi, t)\) for each time step in the planning horizon and taking the sum of all values.
The decomposition of the EFE into a sum hints at the possibility to give the states in the sequence different weights. We may for instance put a very high value on the end state and therefore discount the intermediate states. One example would be to undergo a surgery. The end state is higly desirable but the intermediate state not so much so they need to be supressed in the calculations to give the policy involving the surgery a high probability. An other possibility would be to discount states in the future more than immediate states because future states are uncertain and can be influenced by adjusting the policy during execution.
Conclusions
Behavior is not governed by the minimization of surprise or by the minimization of variational free energy itself, but by a decision functional that is inspired by free-energy principles and motivated by empirical considerations of goal-directedness and information seeking.
Out of the lake
After a few seconds in the zero Celsius water, my brain started to dry-run (no pun intentended) a few policies to determine which one would get me to my desired mental states with a high probability. The first desired mental state in that sequence, and pretty much the limit of my planning horizon at that time, was of course to get out of the water, back onto the ice. The policy I chose (among not too many alternatives) involved using my ice prods to pull myself out of the water and call my wife to come and pick me up.
Always bring ice prods!
Links
[1] Thomas Parr, Giovanni Pezzulo, Karl J. Friston. Active Inference. MIT Press Direct.
[2] Ryan Smith, Karl J. Friston, Christopher J. Whyte. A step-by-step tutorial on active inference and its application to empirical data. Journal of Mathematical Psychology. Volume 107. 2022.
[3] Beren Millidge, Alexander Tschantz, Christopher L. Buckley. Whence the Expected Free Energy. Neural Computation 33, 447–482 (2021).
[4] Stephen Francis Mann, Ross Pain, Michael D. Kirchhoff. Free energy: a user’s guide. Biology & Philosophy (2022) 37: 33.
[5] Carol Tavris on mistakes, justification, and cognitive dissonance. Sean Carroll’s Mindscape: science, society, philosophy, culture, arts, and ideas. Podcast (Spotify link).
[6] The principle of least action. Feynman Lectures.
[7] Sean Carrols Mindscape podcast. Episode 39: Malcolm MacIver on Sensing, Consciousness, and Imagination.
[8] Caucheteux, C., Gramfort, A. & King, JR. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nat Hum Behav 7, 430–441 (2023). https://doi.org/10.1038/s41562-022-01516-2
[9] Beren Millidge, Anil Seth, Christopher L Buckley. Predictive Coding: A Theoretical and Experimental Review.
[10] Théophile Champion, Howard Bowman, Dimitrije Marković, Marek Grześ. Reframing the Expected Free Energy: Four Formulations and a Unification. arXiv:2402.14460
[11] Joseph Marino. Predictive Coding, Variational Autoencoders, and Biological Connections Free.
Neural Computation (2022) 34 (1): 1–44.
[12] Tscshantz A, Millidge B, Seth AK, Buckley CL. Hybrid predictive coding: Inferring, fast and slow. PLoS Comput Biol. 2023 Aug 2.
Footnotes
- Sometimes the term partially observable Markov decision process (POMDP) is used when describing the distribution for the simulation of future states. Partial observability pertains to the observation model and perceptual inference, not to the transition process itself so I believe it is confusing to introduce the term when describing the prediction model. ↩︎
- We could consider running the simulation with observations derived from states and do a full variational inference of \(q\). This would take into account the quality of the observations along the different trajectories of states and thus promote information seeking. But since we include an epistemic term in the subsequent evaluation of each policy, we would weigh in information quality twice if we also included it in the prediction of states. ↩︎
- Other loss functions have been suggested in the literature, e.g., the “free energy of the expected future” [3]: \(\mathcal J[q] = D_{KL}[q(s, o, \pi) \mid \mid \tilde p(s, o)]\) ↩︎