Learning in the active inference framework

Learning by observing

We introduced variational free energy (VFE) in this post as:

$$\mathcal{F}(o) = \mathbb E_{q(s)}[\log q(s) – \log p(s, o)] $$

The agent infers the variational distribution \(q(s)\), the best estimate of the posterior, intractable distribution \(p(s \mid o)\), by minimizing \(\mathcal{F}(o)\).

The generative model \(p(s, o)\) was in this basic version of the VFE assumed to be fixed. In reality it evolves as the actor learns to better predict its environment. Assuming that the generative model can be defined by a set of parameters \(\phi\), the generative model becomes \(p(s, o, \phi)\).

The new recognition model, the variational posterior distribution, becomes \(q(s, \phi)\). As before, the recognition model is found by minimizing the augmented expression for VFE:

$$\mathcal{F}(o, \phi) = \mathbb E_{q(s, \phi)}[\log q(s, \phi) – \log p(s, o, \phi)]$$

The inference of the state \(s\) is a real-time process whereas the evolution of \(\phi\) takes place over seconds up to days, based on average surprise over many observations and state inferences. Adapting the generative model to each data point would make it unstable, in effect overfitting to the data.

Because of the timescale separation between fast perception and slow learning, we can justify a mean-field approximation, forcing the recognition model to treat \(\phi\) and \(s\) as independent meaning that \(q(s, \phi) \approx q(s)q(\phi)\). We get:

$$\mathcal{F}(o, \phi) = \mathbb E_{q(s)q(\phi)}[\log q(s)q(\phi) – \log p(s, o, \phi)]$$

Learning by doing

In this post we introduced policies \(\pi\), sequences of actions, that the agent selects and executes to minimize surprise over time. The generative model was augmented to \(p(s, o, \pi)\). The expected free energy (EFE) function \(\mathcal G\), to infer the optimal policy became:

$$\mathcal G(\pi) = \mathbb{E}_{q(s, o \mid \pi)}[\log q(s \mid \pi) – \log \tilde p(s, o \mid \pi)]$$

To augment EFE so that it also recognizes learning opportunities, we must also include the parameter \(\phi\) in the generative model \(p(s, o, \pi)\) so that it becomes \(p(s, o, \pi, \phi)\). The EFE now becomes:

$$\mathcal G(\pi) = \mathbb{E}_{q(s, o, \phi \mid \pi)}[\log q(s, \phi \mid \pi) – \log \tilde p(s, o, \phi\mid \pi)]=$$

$$\mathbb E_{q(o \mid \pi)}[\mathbb E_{q(\phi | o, \pi)}[\mathbb E_{q(s \mid o, \phi, \pi)}[\log q(\phi \mid \pi) – \log q(\phi \mid o, \pi) + \log q(s \mid \phi, \pi) – \log p(s \mid o, \phi, \pi) – \log \tilde p(o)]]]$$

We assume the agent performs optimal Bayesian inference such that the variational posterior \(q(s \mid o, \pi, \phi)\) converges to the true posterior \(p(s \mid o, \pi, \phi)\) under the generative model. This allows us to treat the two as equivalent for the purpose of policy evaluation.

$$\mathcal G(\pi) = – \mathbb E_{q(o \mid \pi)}[D_{KL}[q(\phi \mid o, \pi) \parallel q(\phi \mid \pi)] + \mathbb E_{q(\phi \mid o, \pi)}[D_{KL}[q(s \mid o, \phi, \pi) \parallel q(s \mid \phi, \pi)]] + \log \tilde p(o)]$$

With the assumption that \(s\) and \(\phi\) are independent, as motivated above, and recognizing that \(\phi\) is independent of the chosen policy \(\pi\) we get1:

$$\mathcal G(\pi) = – \mathbb E_{q(o \mid \pi)}[D_{KL}[q(\phi \mid o, \pi) \parallel q(\phi)] + D_{KL}[q(s \mid o, \pi) \parallel q(s \mid \pi)] + \log \tilde p(o)]$$

This is the information gain – pragmatic value split of EFE from this post but with the added term \(D_{KL}[q(\phi \mid o, \pi) \mid \mid q(\phi)]\) which motivates the search for information that changes \(\phi\), i.e., updates the generative model.

An interesting observation is that if \(q(\phi)\) is very narrow meaning that the agent is very certain about its generative model, then the posterior belief \(q(\phi \mid o)\) will be \(\approx q(\phi)\) since no evidence \(p(o \mid \phi)\) can change a certain prior. This means, in laymans terms, that the agent is dogmatic and that will not be motivated to select actions that entail learning.

Summary: the three pillars of the agent’s objective function

Because the entire right side of the equation is negated, minimizing \(\mathcal G(\pi)\) compels the agent to maximize the positive values of the terms inside the bracket:

The epistemic drive for model parameters (learning): \(D_{KL}[q(\phi \mid o, \pi) \parallel q(\phi)]\). This term mandates curiosity. It compels the system to seek out experiences that challenges its current ontological assumptions. It is the mathematical antidote to dogmatism, forcing the agent to refine its generative model \(\phi\) to better match an evolving reality.

The epistemic drive for state (perception): \(D_{KL}[q(s \mid o, \pi) \parallel q(s \mid \pi)]\). This term drives situational awareness. It compels the system to resolve immediate ambiguity by actively gathering data to figure out exactly what is happening in the present moment.

The pragmatic drive (survival): \(\log \tilde p(o)\). This term acts as the biological boundary condition. Because \(\log \tilde p(o)\) evaluates to a negative number (probabilities are less than 1), maximizing it means driving the outcome probability as close to 1.0 as possible. This ensures the pursuit of knowledge never supersedes the physical integrity of the agent.

Footnotes

  1. Given that:
    $$q(s, \phi) = q(s) q(\phi)$$
    we get:
    $$q(s, \phi \mid o, \pi) = q(s \mid o, \pi) q(\phi \mid o, \pi) \Rightarrow$$
    $$q(s \mid \phi, o, \pi)q(\phi \mid o, \pi) = q(s \mid o, \pi) q(\phi \mid o, \pi) \Rightarrow$$
    $$q(s \mid \phi, o, \pi) = q(s \mid o, \pi)$$
    ↩︎

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *