Advances in deep learning have pushed generative learning into new and complex domains such as molecule design, music, voice, image and program generation. These advances have been made using models with continuous latent variables in spite of the computational efficiencies and greater interpretability offered by discrete latent variables. Despite the advantages of discrete latent variables, continuous latent variable models have proven to be much easier to train. Unfortunately, problems such as clustering, semi-supervised learning, and variational memory addressing all require discrete variables. Thus, efficient training of machine learning models with discrete variables remains an important challenge in advanced machine learning.

The main reason behind the success of continuous latent variable models is the reparameterization trick which provides low-variance gradient estimates during training of these models. The basic idea of reparameterization is to separate sampling from a parametric distribution into two steps: samples are first drawn from a simple non-parametric base distribution and then a differentiable mapping (which depends on the parameters) is applied to theses samples. The mapping is designed such that the base samples are transformed to samples from the desired distribution. This may sound complicated, but in practice, is straightforward. For example, let's say we would like to sample from a Normal distribution parameterized with the mean \(\mu\) and standard deviation \(\sigma\). The base distribution in this case is a standard Normal distribution with zero mean and unit variance. Then, samples from this distribution represented by \(\epsilon \sim \mathcal{N}(0, 1)\) can be mapped to our desired distribution using the function \(f(\epsilon,\mu,\sigma)=\mu + \sigma \epsilon\). Since, \(f(\epsilon,\mu,\sigma)\) is differentiable with respect to \(\mu\) and \(\sigma\), we can use \(f\) in the core of deep generative models that train Normal distributions.

The difficulty in training binary latent variable models arises from the fact that there is no differentiable function for sampling from binary variables. For example, in order to sample from a Bernoulli distribution with the mean \(\mu\), we typically sample \(\epsilon\) from the uniform distribution \(\mathcal{U}(0, 1)\) defined over the range \( [0,1] \). If \(\epsilon \geq 1-\mu\), we select 1 otherwise we select 0. The mapping function \(f(\epsilon,\mu)\) for Bernoulli random variables can be represented using the step function shown below for the fixed \(\epsilon=0.5\). As it can be seen the gradient of the mapping function \(f(\epsilon,\mu)\) with respect to \(\mu\) is either 0 for \(\mu \neq 0.5\) or \(\infty\) when \(\mu = 0.5\). In other words, \(f(\epsilon,\mu)\) is not differentiable with respect to \(\mu\).

At Quadrant, we have developed several methods for training deep generative models with binary latent variables. These frameworks are known as "Discrete Variational Autoencoders" or DVAEs for short. The common theme in all these frameworks is the idea of relaxing binary variables to continuous variables such that the mapping function is no longer non-differentiable. Here, we will review the relaxation proposed in the latest framework called DVAE# but you can find links to the technical papers describing these models in the references at the end of this post.

Let \(z = 0 \) or \(z=1\) represent a Bernoulli random variable with mean parameter \(\mu\), i.e. \(P(z=1)=\mu\). We define a probabilistic smoothing transformation of \(z\) using the conditional distribution \(r(\zeta|z)\) defined for \(\zeta \) in the continuous range \([0,1]\). This transformation defines a distribution for \(\zeta\) such that if \(z\) is 0, \(\zeta\) is likely to have a value near 0 and if \(z\) is 1, \(\zeta\) is likely to be near 1. One approach for defining the smoothing transformations is to use a power function distributions as follows:

\(r(\zeta|z=0) = \frac{1}{\beta} \zeta ^{\frac{1}{\beta} - 1} \) and \(r(\zeta|z=1) = \frac{1}{\beta} (1-\zeta) ^{\frac{1}{\beta} - 1} \).

\(\beta \) is a scalar parameter that controls the sharpness of transformation. This smoothing transformations is visualized below:

Now, we can define the purely continuous mixture distribution \(q(\zeta)= (1- \mu) r(\zeta|z=0) + \mu r(\zeta|z=1) \) which mixes \(r(\zeta|z=0)\) and \(r(\zeta|z=1)\) with weights \((1- \mu)\) and \(\mu\) respectively. The mixture is visualized below for \(\mu=0.75\):

As can be seen, the mixture distribution \(q(\zeta)\) gives high probability for the range of \(\zeta\) that is close to 1 because \(\mu=0.75\). It is easy to show that if \(\beta\) approaches \(\infty\), then \(q(\zeta)\) approaches the probability mass function defined on \(z\). When \(\beta\) is finite, \(q(\zeta)\) acts as a continuous relaxation of Bernoulli distribution defined on \(z\).

Since \(q(\zeta)\) is a probability density function defined on a continuous random variable, we can define a differentiable mapping function that converts samples from a uniform distribution to samples from \(q(\zeta)\). This mapping is the inverse of the cumulative density function (CDF) for \(q(\zeta)\). Below, we visualize the mapping as function of \(\mu\) for the fixed \(\epsilon=0.5\) and several values of \(\beta\). As you can see, the mapping function is now differentiable with respect to \(\mu\) and closely approximates the Bernoulli distribution as \(\beta\) increases.However, this is not the whole story of DVAEs. DVAEs are designed to train variational autoencoders with powerful distributions on binary latent variables called Boltzmann machines which allow for rich correlations between different binary variables. We have released an implementation of DVAE# that makes it easy to develop models with binary latent variables. If you are interested in smoothing transformations or you would like to train discrete variational autoencoders with Boltzmann prior distributions, we encourage you to check our code available here. The repository contains not only an implementation of DVAE# but also its predecessors DVAE++ and DVAE. To train DVAEs with Boltzmann priors, efficient sampling libraries are also required. At Quadrant, we have developed QuPA, a Tensorflow sampling library that uses population annealing to efficiently sample from Boltzmann machines. You can get this library here.

You can find the technical details of different DVAE variants in the following papers:

[1] Discrete Variational Autoencoders, by Jason Tyler Rolfe, ICLR 2017, paper.

[2] DVAE++: Discrete Variational Autoencoders with Overlapping Transformations by Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, Evgeny Andriyash, ICML 2018, paper

[3] DVAE#: Discrete Variational Autoencoders with Relaxed Boltzmann Priors by Arash Vahdat*, Evgeny Andriyash*, William G. Macready, NIPS 2018, paper.