Reinforcement Learning as Probabilistic Modelling: A Variational Inference Formulation (Part I) WIP

Haitham Bou Ammar
Feb 14, 2019
15 min read

Updated: Feb 22, 2019

Reinforcement Learning is concerned with an agent attempting to acquire optimal behaviour in unknown environments that typically exhibit stochasticity. Though minimally supervised, reinforcement learning algorithms have shown numerous success ranging from solving ATARI games using Deep Q-Networks, to the triumphant victory against the world champions in the game of GO, and recently in StartCraft.

Maybe the most intuitive formulation of reinforcement learning is that stemming from Andrew and Rich’s book that introduces a reinforcement learning agent as that acting in a Markov Decision Process with the aim of learning an optimal action-selection rule or policy. Though interesting, with time I came to appreciate another view on reinforcement learning that has roots in probabilistic modelling and variational inference. I found that such a definition unifies different views on reinforcement learning under one common framework and gives a more formal treatment of the field altogether. I will detail such a formulation and explain some special cases arising, e.g., Max-Entropy Reinforcement Learning, and others. To fully understand such a formulation, the reader needs to have access to background on probabilistic modelling, latent variable models, and the evidence lower bound (ELBO). As such, I split this post into two parts. In this part, we will discuss basics paving the way to the reinforcement learning formulation in Part II.

Probabilistic Modelling & Latent Variable Models:

Imagine you are given a data set of some observed variables. You are interested in understanding the distribution governing these observations. To do so, you go ahead and assume the existence of a hidden or latent variable which is not directly observable but is assumed to affect your observations. These variables are typically introduced in a statistical model, which we call a latent variable model, with different goals, e.g., representing unobservable factors, accounting for measurement errors, among others. In machine learning, latent variable models have numerous applications in a variety of fields, e.g., clustering, dimensionality reduction, and most notably time-series analysis. Such applications are important in the real-world. For example, understanding how to model time series data can be adapted to applications in robotics and in the stock market.

To get a better grasp of the latent models, let us walk through an example of modelling time series data. Say you gathered a data set describing the joint movements of a robotic arm for some time horizon. Given this collection, our goal is to understand the hidden or latent process (i.e., the dynamics) by which this data was generated. If this data has been (marginally) identically independently distributed, our problem would have been much easier. Unfortunately, the joints at some time in the future can (and actually do) depend on the joint values at previous time-steps. As such, we need to handle our probabilistic assumptions of the data with more care. Here, there are different types of models that we can use, but we will be interested in the so-called First-Order Markov Model as detailed later. Intuitively, this assumption refers to a conditional independence relationship, where given the current observation, the future is independent of the past.

I can go ahead and start a philosophical discussion motivating this. But this is not a conference paper, so let's move on! It's just a mathematically convenient assumption to make our problem more tractable.

The figure (probabilistic graphical model) to the right illustrates our modelling assumptions so-far. The arrows in the figure represent a probabilistic relation. For example, the variable at the 3rd time-step only depends on the one at the 2nd time-step and not on both the 2nd and 1st. In other words, the density of the variable at time-step 3 is only conditioned on that at the 2nd time step.

So far, we have not introduced any latent or hidden variables and made our probabilistic assumptions directly on our observed variables. Among other restrictions, such type of models can not cope-well with high-dimensional data problems, e.g., when the observations are a sequence of images. Not only would this allow our model to scale, but also can present interpretable summaries of the observed data.

As such, we will introduce a hidden (typically lower-dimensional) variable that dictates the transitions between time-steps. This leads us to the graphical model depicted below. As you can see we now have a latent layer that is temporally dependent, which in turn causes the observations.

Example of a Graphical model for time series problems:

In other words, we managed to introduce a hidden stochastic process that is generating the observations we see. If we are able to find such a process, then we would have discovered a fundamental property of the system (e.g., our robot) which can be used to make predictions and forecasts into the future. Given the above graphical model, we now need to formally state of our assumptions that will allow us to learn about such a latent process. For instance, we need to understand what dictates transitions in our latent layer, or how would an observation be generated from the latent layer. Of course, we would not know the exact distributions up-front. Rather, we transform the problem into a machine learning one, where we parameterise a broad class of distributions, and fit the parameters according to observed data.

Stating Our Assumptions:

Closest to the heart of a probabilistic modeller is a clear statement of the assumptions that are used to define the problem. Personally, I don't blame them as I think a clear problem definition is key to any machine learning algorithm. Engineering comes later! Math comes first! I can't stress this enough, Engineering comes later! Math comes first!

Hopefully, you are still reading after such a provocative statement. As you know, it is equally possible to code something you did not define yet.

Anyhow, let's now commence to make some assumptions about our latent and observation processes. What comes next is a special set of assumptions that I chose for the sake of presentation. You can make this problem as complex as you'd like depending on your data, e.g., discrete versus continuous, type of latent functions, etc.

Starting with the latent variables, we know that the information at the current time-step only depends on the previous. We also know that our hidden variable is continuous (can be discrete as well, but for now let's stick to continuous). As such, we parameterise a function that relates the continuous variable at the current time step to that at the previous. So far, this function is deterministic. We are going to induce a distribution in the latent space using a very primitive and simple structure. Namely, we assume that our observation of the latent variable is the value of the function in addition to a Gaussian noise perturbation. Formally, we say:

Example of our latent function:

The above equation simply states the temporal relationship in the latent space. In more formal terms our function is in fact a map at it related from the dimension of latent variable back to that dimension. As for the distribution over epsilon, we simply choose it to be a multi-variate Gaussian. If you are unfamiliar with this concept, just think about it as a generalisation of a univariate Gaussian distribution to more dimensions. Rather than having a mean, and covariance values, we now have a mean vector and a covariance matrix so as to consider higher dimensions. Here, you can find a nice introduction to these.

Example of two-dimensional Gaussian Distributions:

In the above figure, I illustrated how a Gaussian distribution in two-dimensions behaves. There are two components that affect the behaviour of these plots. The first is the mean vector, while the second is the covariance matrix. The mean vector simply specifies where the distribution is centred, while the covariance matrix defines the "stretch" in each dimension. To illustrate this point, please notice that the covariance matrix is a 2 x 2 for a two-dimensional problem. The entries on the top-left and bottom-right reflect the variance in each dimension, while the other represents the covariance across these dimensions (i.e., how would one dimension vary with respect to the other). As such, if you set the variances in each of the dimensions to be high and the covariances to zero, you would expect a flat distribution as the one shown on the right-side of the figure above.

Given our assumption above about the temporal structure of the latent variables, we can easily derive the conditional distribution of the variable at the current time step conditioned on the previous one. Typically, people are presented with these equations that are taken for granted but deriving such a distribution is really simple. There's nothing fancy about it. Let's do so. Here are the steps to reason about this:

We know that the conditional distribution is Gaussian as we are just transforming a Gaussian noise
We know a Gaussian distribution in high-dimensions has a mean vector and a covariance matrix
We know that the mean is just the expectation which is known to be linear (i.e., we can add expectations)
We know the definition of a covariance matrix
We know how to do simple arithmetic.

Let us commence with computing the mean of the distribution. These derivations are shown in the figures below:

Derivations for mean and covariance matrices:

Here, we first realised that the mean is the expectation. We then used the definition of our temporal model from before. With this, we used the linear property of the expectation and realised that the function is deterministic (so the mean is the value itself) and epsilon has a mean of zero. Of course, this is true as we take the expectation with respect to our noise epsilon. Please note that the latent variable itself is stochastic, but our expectation above is with respect to the additive noise epsilon. So now, we know that the mean is just the function value at the previous time step. Now, we commence deriving the covariance matrix of our distribution. This can be done in a similar fashion:

Again, we started from the definition of the covariance matrix and used the mean we just derived in the previous step. Given the mean, the expression in the middle turns out to be just epsilon. As such, in the last equation, we realise that we are looking for nothing other than the covariance of epsilon itself. Hence, our conditional distribution is nothing but a Gaussian centred around the function value and has the same covariance as epsilon. Note, we need to fit both the free parameters of the model as well as the covariance matrix. Interestingly, the covariance matrix has to be positive semi-definite. Therefore, when optimising for it, we need to use some tricks, e.g., either a semi-definite program or triangular factorisation. I will detail these later.

After defining our state-dynamics or temporal equation in the latent space, we are now left with one more ingredient to seal the deal. That is we should define the distribution by which an observation is generated given the latent representation. From the probabilistic graphical model above, we realise that the observation at the current time step depends on the latent variable at that same time step. This can be derived as shown in the image below.

Example derivation of the emission model:

Here, we simply parameterised a function to relate the latent variables to the observations. This function is itself deterministic. As such, we introduced a Gaussian noise to induce a distribution. Following similar steps to the previous derivation, we arrived at the conditional distribution of the observed variable given the latent, which is nothing but another multi-variate Gaussian with the deterministic function being the mean and having the same covariance matrix as the noise. In addition to the previous free parameters, this definition introduces a set of 2 new free variables to fit, being the parameters of the unknown function and the noise. Also note that if we were to pick the latent and observation models linear, and the noise covariances to be time independent we would have arrived at the standard Kalman filtering method.

Deriving the Optimisation Problem:

Having stated all of our assumptions, now we are ready to commence optimisation to fit the free parameters of the model. Typically, in machine learning one starts directly from the optimisation problem with little to no regard where this came from. For instance, a lot of people know the least-squares error optimisation objective and argue intuitive sense. Good! But where did this objective come from?

It turns out that this objective is nothing but maximising a likelihood estimate under Gaussian noise! We are not going to derive it here, but the example is meant to illustrate that optimisation objectives can be fundamentally derived from probabilistic modelling assumptions. We are going to illustrate this idea in our running example of learning models for time-series problems.

How to derive our optimisation problem? We only have access to observations (e.g., the values of the joints recorded), and we would like to train all of our free variables. OK! So, let's try to find a model that maximises the joint distribution -- also referred to as the marginal -- of our observations. To understand this, you'd need to know the following two concepts:

Concept I - Marginalisation of probability densities: I am sure you have seen this before maybe in a different context. Let us assume we have two discrete random variables X and Y that are jointly dependent. If we would like to find the distribution of X to be equal to some value x from the joint, we simply marginalise-out Y. In other words, if you would sum the joint distribution over all possible Y values you end up with the distribution over X. We can do the same for continuous random variables. Rather than summing up, we just need to integrate. This is what we see done in the equation below:

Explanation: On the right-hand side of the equation, we simply have the joint distribution of our observations. This is the quantity we would like to maximise. Specifically, we would like to fit our free parameters, such that we maximise the density of our observations. So far so good! But where do the assumptions we developed so-far come in? Well, our assumption stated that the observations are affected by hidden variables that adhere to some temporal structure. Therefore, we can think of the density of our observations as that resulting from marginalising-out the latents. This is what the integrals on the right-hand side of the equation are meant to achieve. We are simply integrating-out from the joint density (over observed and latent) all latent variables.

Concept II - Factorising probability distributions: To commence with our derivations, we next need to handle the joint distribution on the right-hand side of the above equation that takes into account the modelling assumptions we introduced before. This is also not hard to do. All you need to know is how to factor joint densities in terms of conditionals, which is also know as the chain rule of probability. This simply says the following:

Explanation: So, if we have a set of random variables, we can factor their joint distribution by conditioning the last random variable on the previous and then multiplying by the joint of the remaining variables. Please note we can apply this further by simply conditioning the n-1 variable on the rest, and then multiplying by the joint of the remainder.

Great! Let us apply this idea to our problem. If we do so, we get:

Now, we can keep applying the same idea to the last term on the right hand-side of the above equation recursively till we reach the variables at the first time step. Below are the main steps to follow:

With this you should be able to convince yourself that our joint distribution over both the latent and observed variables factorises conveniently to the form shown in the equation below:

Interestingly, the form we arrived at allows us to use the distributions we derived above (e.g., that of the emission model and the temporal distributions of the latent space). Now, it should be clearer why we decided to choose multi-variate Gaussian perturbations during the assumptions phase. It's not philosophical really! (Now, let's hear it from the "Central Limiters")

It's just a mathematical convenience. As you can see, the above equation requires us to multiply a variety of probability densities. Multiplying Gaussians is easy, and as such chose our assumptions. That is not to say that you can't pick other forms. In fact, there is numerous research on handling intractabilities of such models, which I leave to future posts.

Inspecting the above equation carefully, we come to realise we know everything about it, except the distribution over the latent variable at the first time step. There are multiple ways to handle this. For now, let us assume that it is a prior on the latent variable at the first time step that we specify.

The Evidence Lower Bound (ELBO):

So far we derived the main equations needed for our model. We discussed the probabilistic assumptions, derived some conditional distributions, and detailed (on a high-level) the quantity we would like to maximise to fit our free parameters. Before positing the complete optimisation problem, one more problems needs to be carefully revisited.

Remember the equation with all the integrals? Yea, that's a scary one! See, we have not yet discussed how to compute or handle any of these integrals. In fact, this is the hardest part of our problem that can easily become intractable.

A brilliant colleague of mine, who is an expert in these problems, once said, and I paraphrase: "You know, I spent my career learning numerous techniques that allow me to integrate extremely hard problems. I never thought an integration course which I took in my first college year could have such a profound effect on me." Sipping my drink, I was oh sure integration, want another drink?

I never really understood this statement until encountering variational inference. Apart from other numerous applications, here it is again. Integration also pops-up in machine learning. Hmm! My colleague could be on to something!

From my point of view, the beauty of variational inference was to transform hard integrals (as the one we have above) into optimisation problems, which can handled more easily. Well this statement is controversial as the resulting optimisation problems become non-convex, which themselves are hard to solve. Nonetheless, we can still find an acceptable solution using algorithms like Adam, SGD, and others.

Rather than computing our integrals precisely, we are going to resort to an approximation. In other words, know that the quantity we want to approximate is a probability density which we can not compute exactly. Therefore, we are going to assume a family of parameterised distributions (in the latent space) and attempt to fit its parameters so as to acquire the "best" approximation possible. We call the distribution the variational distribution and its parameters as the variational parameters.

With this assumption, we now can re-write our original problem into a more tractable one. We are going to do this in three-steps:

Step 1 - Introduce variational distribution to the marginalisation equation: In this step, we simply take our marginalisation equation and multiply and divide by the variational distribution. This is shown in the figure below:

Step 2 - Introduce a logarithm to the optimisation problem: As you recall, our goal is to maximise the marginal density of our observations. Instead of maximising a function itself, one can maximise the logarithm of that function. Here, you might have two questions:

1) Why is this statement true? and 2) Why do we need to introduce any such function to our problem.

2) Why do we need to introduce such a function: Let me start answering the second question first. The logarithm has some interesting properties that will allow us to have a simpler problem in the end. For example, a log of a product becomes a summation of logs. This property, for instance, can be beneficial for us as we have lots of these products in our problem (remember factoring our distribution in Concept II).

1) Why is this statement true? As for why introducing a log keeps the solution of the problem unchanged, there are two explanations. The "kind-of" intuitive one goes as follows. Imagine you would like to maximise a function in one dimension. This involves its derivative being set to zero, somewhere in your computation. Well, when would the derivative of the logarithm of the function be zero? Check it out. Now, to the more formal explanation. I found this source gives a nice explanation of this fact, that you can check it out on the same site. It goes as follows:

As we can see from the above figure, this works not only for a logarithm but for any increasing function as well. Cool, so now we know why introducing a log doesn't hurt us and in fact helps us.

Introducing the logarithm, our problem now can be written as:

Step 3 - Jensen's Inequality: If we are to benefit from the logarithm, we better attempt to take it inside the integration sign -- then we can, for example, transform all products to summations -- What a treat! How are we going to do that though? Here is where Jensen's inequality comes to the rescue. So what does Jensen say:

Namely, if we have a random variable with some expectation and we apply a convex function to that expectation, we can take the function inside the expectation by introducing a less than or equal sign. This can be similarly done for a concave function (our case) but with a greater than or equal sign.

Cool! How does it apply to our problem then? Well, the integration signs are nothing by the expectation, which we are transforming using a concave function. Hence, we can write:

Simply put, we realised that all the integrals correspond to an expectation (under the variational distribution), of the ratio of the joint observation and marginals to the variational distribution. This should be easy to see from the definition of the expectation for continuous random variables.

The Optimisation Problem:

We are now at the final step, where we put everything together and look at our optimisation problem. Hooray!

In doing so, we come to realise that we need to make some assumptions on the structure of our variational distribution. There are a multitude of ways we can choose such a distribution. The better the structure, the more informative the distribution, the better our approximation. I am going to stick to a "simple one", where my variational distribution factors analogously to the latent variable distribution. By no means this is the only way to do. Interested readers are referred here for an in-depth discussion of this topic. With this, our derivations can be written as:

Finally, we got our optimisation objective that we would like to maximise in order to fit the free parameters. Now, we can substitute all of our modelling assumptions (e.g., the Gaussian distributions) and proceed to optimise. Please note that these parameters include all parameters from our modelling assumptions (e.g., parameters for functions f and g, the covariance matrices, initial state distribution -- we call these the model parameters, as well as the variational distribution). To learn these two sets of parameters (i.e., model and variational), we can proceed using, for example alternating optimisation, where we fix the first, update the second set, and then repeat. Other authors proceed by applying a gradient step in both parameters jointly (e.g., execute Adam on both sets). Also, there is an interesting direction of work that performs natural gradients, which can be very efficient for Gaussian distributions and the natural parameterisation.

As a final caution, please notice that the number of parameters of our variational distribution can become large, as each time-step requires a different set of parameters. This can be handled in different ways. An interesting approach that I read, used a recurrent neural network to parameterise these. You might want to have a look:

https://arxiv.org/pdf/1801.10395.pdf
https://arxiv.org/pdf/1801.10395.pdf

Cool! Now, you should be ready for applying such ideas in reinforcement learning for a more general problem definition. Stay posted for part II!

This blog post would have never been possible without the help of others. Shout to:

Rasul Tutunov (Proofreading and checking the maths)

Simon Tatham (Proofreading, commenting, and proposing improvements)

Lectures by Modelling Gurus, have a look:

David M. Blei's lecture

Maneesh Sahani's lecture

If you found this post useful in your research please cite as:

Bou Ammar, H., Tutunov, R., Reinforcement Learning as Probabilistic Modelling: A Variational Inference Formulation (Part I), Blog 2019.

#machinelearning #reinforcementlearning #proabilisticmodelling #Neurips #ICML

[i don't own any of the pictures (including cover) in this post]

Haitham Bou Ammar

Ph.D. in Artificial Intelligence

"key to true artificial intelligence is lifelong learning"

Reinforcement Learning as Probabilistic Modelling: A Variational Inference Formulation (Part I) WIP

Recent Posts

Comments