Changing the world

Minimizing surprise

Active Inference is a normative framework to characterize Bayes-optimal behavior and cognition in living organisms. Its normative character is evinced in the idea that all facets of behavior and cognition in living organisms follow a unique imperative: minimizing the surprise of their sensory observations. Surprise has to be interpreted in a technical sense: it measures how much an agent’s current sensory observations differ from its preferred sensory observations—that is, those that preserve its integrity (e.g., for a fish, being in the water).

[1, p.6]

In an earlier post about perceptual inference we introduced the generative model \(p(o, s)\) used to infer mental state distributions from observations. We also hinted that the model is such that the more probable an observation is according to the model, the more attractive the observation is for the organism. An attractive observation is called an expected observation if it is an actual observation and a wanted observation1 if it hasn’t yet happened but the organism wants it to happen.

The ultimate objective of the organism is to minimize surprise over time. Surprise measures the discrepancy between the organisms expected observation and its actual observation. Surprise means that the organism is outside of its “comfort zone” of expected observations as defined by the generative model. Whether the expected state or wanted state are also an adaptive states depends on the quality of the generative model (in psychological terms the mental health of the organism).

The organism wants to move through life with as few surprises as possible, basically taking the “path of least resistance” where resistance is defined as surprise. It is a useful as baseline approximation to think of an organism’s life as an attempt to minimize the integral of surprise over its lifetime. This is similar to the way nature minimizes the integral of the difference between kinetic energy and potential energy over time obeying the principle of least action [6].

Simple organisms that live in predictable environments usually only need to, or can, take one of a few reflex actions to get back to a wanted state. The only option a beached fish has is to try to flap and hope the flapping will take it back to the water. Bacteria follow the nutrient gradient. Simple organisms may not have the cognitive capacity to hold explicit, long-term wants, only expectations for the immediate future 2.

Intelligent animals, most notably (at least some) humans, can do more than flap or follow the nutrient gradient. They can imagine a large repertoire of possible actions for getting to a wanted state or wanted observations.

Surprise is quantified as the negative logarithm of the probability of an observation according to the generative model, \(-\log p(o)\). As explained in an earlier post, surprise cannot in the general case be calculated analytically. Instead the organism uses a quantity called variational free energy (VFE) as a proxy (upper limit) for surprise in perceptual inference and expected free energy (EFE) as a proxy for (future) surprise in action inference 3.

Variational free energy was introduced in an earlier post. We will dig deeper into both variational free energy and expected free energy in this post.

Free energy is thus according to AIF the loss function that the organism seeks to minimize at all times to stay within its homeostatic and allostatic states.

Now and in the future

When the organism decides what to do next, it runs simulations of future states and observations to evaluate different actions and to determine which actions would most likely lead to a series of wanted observations or wanted states over a planning horizon. Depending on the complexity of the organism and perhaps its sensing range [7], the simulations may be anything from determining which way the nutrient gradient points to planning a professional career.

In a stable world the wanted observations should be same as the expected observations. The future body temperature should be in line with the historical body temperature, give or take. The fish both expects and wants to be in water rather than on land. If this is the case, then one and the same generative model can guide both perceptual inference and the “simulation” in action inference. This means that the priors \(p(s)\) can in stable situations represent both expected states and wanted states (from which expected and wanted observations can be inferred). If this is not the case, then the organism must hold two different generative models in its “head”: one for analyzing the current state of affairs and one for planning for the future. Intuitively this seems reasonable at least for humans as we often want to be in a state that is different from the state we expect to be in (sometimes to the detriment of our mental health).

The jury seems to still be out regarding whether the brain defines its wants in terms of observations or states. Both variants can be found in the literature without much motivation for one or the other. The generative model maps between states and observation and expressions for EFE can be derived for both wanted states and wanted observations 4. When I below write “wanted states” it should be read “wanted states or wanted observations”.

Planning for the future

I predict myself therefore I am.

Anil Seth. Being You.

A sequence of future actions is in AIF called a policy, \(\pi\)5. \(\pi = [a_0, a_1, \ldots a_{T-1}] = a_{0:T-1}\). The purpose of each action is to transition the organism to a new state.

Each policy will lead to a unique sequence of observations \(o_{1:T}\) and a sequence of associated mental states \(s_{1:T}\). We will in the following for convenience and brevity skip the subscripts in most equations and use the notations \(a\), \(o\), and \(s\) to also represent sequences.

The purpose of action inference is to, when action needs to be taken, infer a probability distribution of policies, \(\hat q(\pi)\), that minimizes free energy in the future. Since we are working with probabilistic inputs, we also end up with a probabilistic result; different policies will, with different probabilities, lead to the wanted state. The actual policy to employ can be sampled from the distribution by for instance choosing the most probable policy (or if one is adventurous, a slightly less probable but more “interesting” policy).

Since all the states and observations are in the future when evaluating policies, the policy distribution has to be inferred based on a “simulation” during which each possible policy is evaluated probabilistically with respect to whether it will minimize future surprise or not (lead us to a wanted state, through not too many unwanted states).

The policy distribution inference can be formulated as an optimization (in this case minimization) problem with expected free energy as the loss function.

Generative model including action

In preparation for understanding action inference, we start with making two enhancements to the generative model for perceptual inference introduced in this post.

First, we don’t just infer the state distribution at the current moment in time (\(t = 0\)) but estimate a the distribution of the whole sequence of states leading up to \(t = 0\), \(s_{-T+1:0}\), based on the corresponding sequence of observations \(o_{-T+1:0}\).

We will also include the actions that cause the organism to go from one state to the next, \(a_{-T:-1}\). The sequence of actions is a policy, \(\pi\). We still drop the subscripts for brevity below.

The above additions mean that \(p(o, s)\) becomes \(p(o, s, \pi)\).

For any probability distribution \(p(o, s, \pi)\), it is true that:

$$p(o) = \sum_{s, \pi} p(o, s, \pi)$$

This is the marginal probability distribution of \(o\). Expressed in terms of surprise this becomes:

$$- \log p(o) = – \log \sum_{s, \pi} p(o, s, \pi) = – \log \mathbb E_{q(s, \pi)} \left[\frac{p(o, s, \pi)}{q(s, \pi)}\right] \ \ \ \ (1)$$

In the last expression we have multiplied the expression in both the nominator and the denominator with the probability distribution \(q(s, \pi)\) which here represents the variational posterior of perceptual inference. (The expression would of course be true for any well-behaved probability distribution.)

Jensen’s inequality gives:

$$- \log p(o) = – \log \mathbb E_{q(s, \pi)} \left[\frac{p(o, s, \pi)}{q(s, \pi)}\right] \leq – \mathbb E_{q(s, \pi)} \left[\log \frac{p(o, s, \pi)}{q(s, \pi)}\right] = \mathcal F[q; o]$$

The right hand side of the inequality can be written:

$$\mathcal F[q; o] = \mathbb{E}_{q(s, \pi)} [\log q(s, \pi) – \log p(o, s, \pi)] \ \ \ \ (2)$$

\(\mathcal F\) is the variational free energy at time \(0\), introduced in an earlier post and further explained here, but now based not only on the current observation but on a sequence of observations and a sequence of actions.

Formally \(\mathcal F\) is a functional (function of a function) of the variational distribution \(q\), parametrized by \(o\). The variational distribution is the “variable” that is modified to find the minimum value of \(\mathcal F\) in perceptual inference.

For a certain policy \(\pi\) (2) yields:

$$\mathcal F[q; o, \pi] = \mathbb{E}_{q(s \mid \pi)} [\log q(s \mid \pi) – \log p(o, s \mid \pi)] = \ \ \ \ (3)$$

$$\mathbb{E}_{q(s \mid \pi)} [\log q(s \mid \pi) – \log p(o \mid s) – \log p(s \mid \pi))]=$$

$$\sum_s q(s \mid \pi) \log q(s \mid \pi) – \sum_s q(s \mid \pi) \log p(o \mid s) – \sum_s q(s \mid \pi) \log p(s \mid \pi) =$$

$$D_{KL}[q(s \mid \pi) \mid \mid p(s \mid \pi)] – \mathbb E_{q(s \mid \pi)}[p(o \mid s)]$$

Above we have taken into account that the likelihood \(p(o \mid s)\) does not depend on \(\pi\). It describes the sensory system and is therefore relatively stable.

Doing it the other way around:

$$\mathcal F[q; o, \pi] = \mathbb{E}_{q(s \mid \pi)} [\log q(s \mid \pi) – \log p(o, s \mid \pi)] =$$

$$\mathbb{E}_{q(s \mid \pi)} [\log q(s \mid \pi) – \log p(s \mid o, \pi) – p(o \mid \pi))] =$$

$$\sum_s q(s \mid \pi) \log q(s \mid \pi) – \sum_s q(s \mid \pi) \log p(s \mid o, \pi) – \sum_s q(s \mid \pi) \log p(o \mid \pi) =$$

$$D_{KL}[q(s \mid \pi) \mid \mid p(s \mid o, \pi)] – \log p(o \mid \pi)$$

Since a Kullback-Leibler divergence is always greater than zero, this shows in another way that free energy is always larger than the surprise (here the surprise is conditioned on \(\pi\); different policies yield different surprises).

In this enhanced model the prior distribution of states in the current moment depends on the policy that was executed leading up to the current moment. If the last thing we did was to jump into the lake we would expect a wet state. Had we chickened out we would expect to stay dry. A prior informed by the last action (and potentially all actions before that) would in turn modulate the approximate posterior; it would be much more likely that we were actually (in the posterior sense) wet if we had taken that jump. The last action of the policy would precede the observation so it could inform the posterior even before any observation was made. Taking into account a whole sequence of observations and actions gives the organism more information on which to base the estimate of the posterior than the current observation alone.

Since we assume that the generative model is available for the organism for calculating free energy it follows that \(p(o \mid s)\) and \(p(s \mid \pi)\) are also available for computation 6.

Inferring \(\hat q(\pi)\)

As mentioned above, the purpose of action inference is to, when action needs to be taken, infer an optimal probability distribution of policies, \(\hat q(\pi)\), that minimizes free energy in the future. The actual policy can then be sampled from the distribution by for instance choosing the most probable policy.

Let’s pause in the present for a little bit more before we head into the future. The distribution \(p(o, s, \pi)\) is the generative model that is valid up until the present moment. It’s a historic record of what has happened in the past; it is our experience base. It tells us what actions typically led to what states and what observations these states led to. The marginal distribution \(p(s) = \sum_{o, \pi} p(o, s, \pi)\) yields the probabilities of the expected states.

When planning future actions we replace \(p(o, s, \pi)\) with \(\tilde p(o, s, \pi)\). \(\tilde p(s) = \sum_{o, \pi} \tilde p(o, s, \pi)\) yields the probabilities of the preferred states. Here \(o\) and \(\pi\) are future expected observations and actions respectively. When the organism prefers something that it doesn’t expect, then \(\tilde p(s) \neq p(s)\).

Action inference can be seen as tweaking the \(\tilde p(o, s, \pi)\) until it produces the preferred marginal distribution \(\tilde p(s)\) and then looking up the actions that correlate with the high probability states of that marginal distribution.

When simulating the future we have to replace actual observations with a probability distribution of observations. This means that equation \((2)\) is transformed into an expectation over \(s, \pi\) and \(o\); \(\mathcal F[q; o]\), the “retrospective” free energy is replaced by \(\mathcal G[q]\), the “prospective” free energy 7.

$$\mathcal G[q] = \mathbb{E}_{q(s, o, \pi)} [\log q(s, \pi) – \log \tilde p(o, s, \pi)] = \ \ \ \ (4)$$

$$\mathbb{E}_{q(\pi)} \left[\log q(\pi) + \mathbb{E}_{q(s, o \mid \pi)} \left[ \log q(s \mid \pi) – \log \tilde p(o, s \mid \pi) \right]\right]=$$

$$\mathbb{E}_{q(\pi)} \left[\log q(\pi) + \mathcal G[q; \pi]\right] \ \ \ \ (5)$$

where

$$\mathcal G[q; \pi] = \mathbb{E}_{q(s, o \mid \pi)}\left[\log q(s \mid \pi) – \log \tilde p(o, s \mid \pi) \right]$$

Starting from \((5)\) we get:

$$\mathcal G[q] = \mathbb{E}_{q(\pi)} \left[\log q(\pi) + \mathcal G[q; \pi]\right] =$$

$$\sum_{\pi} q(\pi)(\log q(\pi) – \log e^{- \mathcal G[q; \pi]}) =$$

$$\sum_{\pi} q(\pi)\left(\log q(\pi) – \log \left(\frac{e^{- \mathcal G[q; \pi]} \sum_{\pi} e^{- \mathcal G[q; \pi]}}{\sum_{\pi} e^{- \mathcal G[q; \pi]}}\right)\right) =$$

$$\sum_{\pi} q(\pi)\left(\log q(\pi) – \log (\frac{e^{- \mathcal G[q; \pi]}}{\sum_{\pi} e^{- \mathcal G[q; \pi]}} \right) – \sum_{\pi} q(\pi) \log \sum_{\pi} e^{- \mathcal G[q; \pi]}=$$

$$\sum_{\pi} q(\pi)\left(\log q(\pi) – \log (\frac{e^{- \mathcal G[q; \pi]}}{\sum_{\pi} e^{- \mathcal G[q; \pi]}} \right) – \log \sum_{\pi} e^{- \mathcal G[q; \pi]}=$$

$$\sum_{\pi} q(\pi)\left(\log q(\pi) – \log \sigma (- \mathcal G[q; \pi])\right) – C$$

Remember that the objective was to find the \(\hat q(\pi)\) that minimizes \(\mathcal G[q; \pi]\). \(\sigma\) is the softmax function, often used at the output artificial neural network classifiers. It normalizes any array of values so that they sum up to one, effectively turning the values into a probability distribution.

\(C\) is a quantity that doesn’t depend on \(q(\pi)\), only the total set of candidate policies which remain unchanged during action inference. Since both \(q(\pi)\) and \(\sigma (- \mathcal G[q; \pi])\) are proper probability distributions we can express \(\mathcal G[q]\) as a KL-divergence:

$$\mathcal G[q] = D_{KL}\left[q(\pi) \mid \mid \sigma (- \mathcal G[q; \pi])\right] – C$$

This expression is trivially minimized if we set \(\hat q(\pi) =\sigma (- \mathcal G[q; \pi])\), making the two distributions of the divergence equal.

Shapes of \(\mathcal G[q; \pi]\)

To find \(\hat q(\pi)\) we thus have to find the expected free energy \(\mathcal G[q; \pi]\) for each candidate policy.

Wants as mental states

This section outlines a derivation of \(\mathcal G[q; \pi]\) in the case when the organism’s preferences are expressed in terms of preferred states.

$$\mathcal G[q; \pi] = \mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log \tilde p(o, s \mid \pi)] =$$

$$\mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log p(o \mid s) – \log \tilde p(s \mid \pi)] \ \ \ \ (6)$$

\(\log p(o \mid s)\) characterizes the sensory system and does therefore not depend on \(\pi\). It is also assumed to be the same in the future as in the past. These two assertions imply that \(\tilde p(o, s, \pi) = p(o \mid s) \tilde p(s \mid \pi)\)

\(\tilde p(s \mid \pi)\) represents the preferred (target) distribution of states; we minimize \(\mathcal G[q; \pi]\) under the assumption that this distribution is true. The preferred distribution is the same regardless of policy so we can set \(\tilde p(s \mid \pi) = \tilde p(s)\). This means that \(\tilde p(o, s, \pi) = p(o \mid s) \tilde p(s) = \tilde p(o, s)\) [3, p.453].

Also, \(q(o, s \mid \pi) = p(o \mid s)q(s \mid \pi)\) where \(q(s \mid \pi)\) is the variational posterior inferred during the “simulation”.

Equation \((6)\) therefore becomes:

$$\mathcal G[q; \pi] =\mathbb E_{p(o \mid s)q(s \mid \pi)}[\log q(s \mid \pi) – \log p(o \mid s) – \log \tilde p(s)] =$$

$$\sum_{o, s} p(o \mid s)q(s \mid \pi)[\log q(s \mid \pi) – \log p(o \mid s) – \log \tilde p(s)] =$$

$$\sum_s \left(\sum_o p(o \mid s)q(s \mid \pi)[\log q(s \mid \pi) – \log \tilde p(s)] – \sum_o q(s \mid \pi)p(o \mid s)\log p(o \mid s)\right) =$$

$$\sum_s \left(q(s \mid \pi)[\log q(s \mid \pi) – \log \tilde p(s)]\sum_o p(o \mid s) – q(s \mid \pi)\sum_o p(o \mid s)\log p(o \mid s)\right)$$

$$\sum_o p(o \mid s) = 1 \Rightarrow$$

$$\mathcal G[q; \pi] = \sum_s \left(q(s \mid \pi)[\log q(s \mid \pi) – \log \tilde p(s)] – q(s \mid \pi)\sum_o p(o \mid s)\log p(o \mid s)\right) =$$

$$D_{KL}[q(s \mid \pi) \mid \mid \tilde p(s)] + \mathbb E_{q(s \mid \pi)}[\mathbb H[p(o \mid s)]]$$

Wants as observations

This section outlines a derivation of \(\mathcal G[q; \pi]\) in the case when the organism’s preferences are expressed in terms of preferred observations.

$$\mathcal G[q; \pi] = \mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log \tilde p(o, s \mid \pi)] =$$

$$\mathbb E_{q(o, s \mid \pi)}[\log q(s \mid \pi) – \log p(s \mid o, \pi) – \log \tilde p(o)] =$$

$$\sum_{o, s} q(o, s \mid \pi)[\log q(s \mid \pi) – \log p(s \mid o, \pi) – \log \tilde p(o)]$$

Bayes’ theorem gives:

$$p(s \mid o, \pi) = \frac{p(o \mid s, \pi)p(s \mid \pi)}{p(o \mid \pi)}$$

Also, as stated above, the likelihood \(p(o \mid s)\) doesn’t depend on \(\pi\). We get:

$$p(s \mid o, \pi) = \frac{p(o \mid s)p(s \mid \pi)}{p(o \mid \pi)}$$

$$\mathcal G[q; \pi] = \sum_{o, s} q(o, s \mid \pi)[\log q(o \mid \pi) – \log p(o \mid s) – \log \tilde p(o)] =$$

$$\sum_{o, s} q(o, s \mid \pi)[\log q(o \mid \pi) – \log \tilde p(o)] – \sum_{o, s} q(o, s \mid \pi)[\log p(o \mid s)]$$

$$ q(o, s \mid \pi) = q(o \mid \pi)p(s \mid o, \pi) \Rightarrow$$

$$\mathcal G[q; \pi] = \sum_o \left(\sum_s p(s \mid o, \pi)q(o \mid \pi) [\log q(o \mid \pi) – \log \tilde p(o)]\right) -$$

$$\sum_s \left(\sum_o p(o \mid s)q(s \mid \pi) \log p(o \mid s)\right) =$$

$$\sum_o \left(q(o \mid \pi) [\log q(o \mid \pi) – \log \tilde p(o)] \sum_s p(s \mid o, \pi)\right) – \sum_s q(s \mid \pi) \left(\sum_o p(o \mid s) \log p(o \mid s)\right)$$

$$ \sum_s p(s \mid o, \pi) = 1 \Rightarrow$$

$$\mathcal G[q; \pi] = D_{KL}[q(o \mid \pi) \mid \mid \tilde p(o)] + \mathbb E_{q(s \mid \pi)} \mathbb H[p(o \mid s)] $$

Note that the two expressions for \(\mathcal G[q; \pi]\) don’t necessarily yield exactly the same answer as they start from different premises. It can in fact be proven that [1, p.251]:

$$D_{KL}[q(s \mid \pi) \mid \mid \tilde p(s)] + \mathbb E_{q(s \mid \pi)}[\mathbb H[p(o \mid s)]] \geq$$

$$D_{KL}[q(o \mid \pi) \mid \mid \tilde p(o)] + \mathbb E_{q(s \mid \pi)} \mathbb H[p(o \mid s)]$$

Whence \(\mathcal G[q; \pi]\)

We are not done yet though. We still need to calculate each \(\mathcal G[q; \pi]\). Remember that \(o = o_{1:T}\) and \(s = s_{1:T}\) are sequences of observations and corresponding states. To simplify the inference of \(\mathcal G[q; \pi]\) we will make a couple of assumptions [3, p.453]:

$$q(s_{1:T} | \pi) \approx \prod_{t=1}^{T} q(s_t | \pi)$$

And

$$p(o_{1:T}, s_{1:T} | \pi) \approx \prod_{t=1}^{T} p(o_t, s_t | \pi)$$

These simplifications rely on a mean field approximation which assumes that all dependencies between the observations and states in a sequence are captured by the parameter \(\pi\) so that the distribution can be expressed as a product of independent distributions.

$$\mathcal G[q; \pi] = \mathbb{E}_{q(s, o \mid \pi)} \left[\log q(s \mid \pi) – \log \tilde p(o, s \mid \pi)\right] =$$

$$\mathbb{E}_{q(s, o \mid \pi)} \left[\sum_{t=1}^T \left(\log q(s_t \mid \pi) – \log \tilde p(o_t, s_t \mid \pi)\right) \right] =$$

$$\sum_{t=1}^T \mathbb{E}_{q(s, o \mid \pi)} \left[\log q(s_t \mid \pi) – \log \tilde p(o_t, s_t \mid \pi) \right] = $$

$$\sum_{t=1}^T \mathcal G[q; \pi, t]$$

Where

$$\mathcal G[q; \pi, t] = \mathbb{E}_{q(s_t, o_t \mid \pi)} \left[\log q(s_t \mid \pi) – \log \tilde p(o_t, s_t \mid \pi)\right]$$

We can thus find \(\mathcal G[q; \pi]\) by calculating \(\mathcal G[q; \pi, t]\) for each time step in the planning horizon and taking the sum of all values.

Links

[1] Thomas Parr, Giovanni Pezzulo, Karl J. Friston. Active Inference. MIT Press Direct.
[2] Ryan Smith, Karl J. Friston, Christopher J. Whyte. A step-by-step tutorial on active inference and its application to empirical data. Journal of Mathematical Psychology. Volume 107. 2022.
[3] Beren Millidge, Alexander Tschantz, Christopher L. Buckley. Whence the Expected Free Energy. Neural Computation 33, 447–482 (2021).
[4] Stephen Francis Mann, Ross Pain, Michael D. Kirchhoff. Free energy: a user’s guide. Biology & Philosophy (2022) 37: 33.
[5] Carol Tavris on mistakes, justification, and cognitive dissonance. Sean Carroll’s Mindscape: science, society, philosophy, culture, arts, and ideas. Podcast (Spotify link).
[6] The principle of least action. Feynman Lectures.
[7] Sean Carrols Mindscape podcast. Episode 39: Malcolm MacIver on Sensing, Consciousness, and Imagination.

  1. Often in the literature called preferred states or preferred observations. I prefer the stronger expressions wanted states and wanted observations. ↩︎
  2. One hypothesis is that imagination, and thus the ability to hold preferences, arose when fish climbed up on land and could see much further than under water. This made it possible, and necessary, to plan further ahead by imagining and evaluating different possible courses of actions [7]. ↩︎
  3. The term active inference is used to describe both the whole framework and the action-oriented part of the framework. I find this confusing. Active inference seems to refer to the fact that the policy (sequence of actions) can be inferred through an inference algorithm resembling the one used to infer the posterior in perceptual inference. To avoid overloading the term active inference I will call the action-oriented part of active inference action inference for now. ↩︎
  4. Intuitively (and speculatively) I believe I think in terms of states when I plan actions consciously, like when I (a long time ago) planned my education. I wanted to end up in a state of competence and knowledge which is a very abstract state that is not easy to characterize with an observation. When I jump my horse, I on the other hand see my self on the other side of the fence after the jump (as opposed to on the ground in front of the fence). A physical and to a degree unconscious want like that may be more observation-based. There is most likely of hierarchy of representations from e.g., raw retinal input to abstract states so maybe there is a continuum between what we call observations and what we call states. ↩︎
  5. It is possible to build a continuous model of active inference. I start with introducing the discrete variant as it is somewhat more intuitive. ↩︎
  6. It is tempting to use the term “known quantity” but that might lead one to believe that the generative model is cognitively known to the organisms which it probably not the case. It suffices that it can use the generative model for its unconcious inference of the approximate posterior. ↩︎
  7. We can derive \(\mathcal G[q]\) from \(\mathcal F[q; o]\) by taking the expectation of \(\mathcal F[q; o]\) over \(q(o \mid s, \pi) = p(o \mid s)\). (We cannot take the expectation over \(q(o \mid pi)\) as this would ignore the fact that the observations and the states are correlated.) \(\mathcal G[q] = \mathbb{E}_{p(o \mid s)q(s, \pi)} [\log q(s, \pi) – \log p(o, s, \pi)]\). ↩︎

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *