Bijectors of Tensorflow Probability - A Guide to Understand the Motivation and Mathematical Intuition Behind Them

Posted November 7, 2021 by Gowri Shankar ‐ 11 min read

A bijector is a function of a tensor and its utility is to transform one distribution to another distribution. Bijectors bring determinism to the randomness of a distribution where the distribution by itself is a source of stochasticity. For example, If you want a log density of distribution, we can start with a Gaussian distribution and do log transform using bijector functions. Why do we need such transformations, the real world is full of randomness and probabilistic machine learning establishes a formalism for reasoning under uncertainty. i.e A prediction that outputs a single variable is not sufficient but has to quantify the uncertainty to bring in model confidence. Then to sample complex random variables that get closer to the randomness of nature, we seek the help of bijective functions.

In this post, we shall learn the mathematical intuition and the motivation behind the Bijector functions of Tensorflow in detail, this post is one of many posts that comes under the category Probabilistic Deep Learning focusing on uncertainty and density estimation. Previous posts can be referred to here

Objective

The objective of this post is to understand the bijective functions and their properties to understand the motivation and mathematical intuition behind the Bijectors of Tensorflow Probability(TFP) package.

Introduction

Time travel to your 10th grade - bijections, bijective functions or invertible functions is not something new to us. We have studied them all in our set theory classes, oh yeah they are just maps. Each element of the first set is paired with exactly one set in the second set. The elements of the first set are called the domain and the elements of the second set are called co-domain. Few rules that define what exactly a function is,

  • All elements of set 1 must have a pair in the second set
  • If set 1 and set 2 elements are related by a single pair then the relationship is called as one-one function
  • If more than 1 element in set 1 have the same pair in set 2, then the relationship is called a many-one function
  • If every element in set 2 has a corresponding domain and a few elements of set 1 paired to the same element in set 2, then it is called onto function - surjective, non-injective
  • If there is an element in the co-domain that has no corresponding domain, then it is called into function or injective, non-surjective function
  • If there is at least a single element in set 1 that has no co-domain, then there exists no relationship between set 1 and set 2 - hence that is not a function at all.

Surjective and Injective

That set of rules leaves us with one combination that defines bijection precisely - i.e. A function that is injective, surjective, and one-one is a bijective function, It also means both the sets have the same number of elements. Set theory is cooler than what one can imagine if you closely observe - the whole tech space is built on top of the foundational principles of set theory. A Venn diagram is the classic example of how a relational database works(UNIONS and JOINS) - It is silly to quote yet I am quoting, the term relational in RDBMS is nothing but the above set of rules that we have mentioned to define a function.

Bijection

Distributions are Sources of Stochasticity

There is too much randomness in nature(pun intended) and we are trying to model these natural phenomenons. It is quite difficult to craft the complex distributions that are resultant of randomness observed from nature, hence we start from something simple, say a Gaussian Distribution. In a Gaussian Distribution, the parameters of interest like mean and standard deviations are constrained but our estimation requires an unconstrained space where a direct sampling is difficult. Traditionally a Markov Chain Monte Carlo method is used for obtaining such sequences of random samples that converge to a target probability distribution, these are time-reversal and volume-preserving in nature. Bijector functions of Tensorflow Probability helps us to create such complex transformation of distributions through a process called Normalizing Flows. Please note, these transformations are deterministic - that is the beauty. Refer to one of our previous articles, a practical guide to normalizing flows to understand implementation details of Bijectors,



Bijectors encapsulsate the change of variables for a probability density.

Roughly speaking, when you (smoothly and invertibly) map one space to another, you also 
induce a map from probability densities on the initial space to densities on the target 
space. In general such transformations warp lengths/areas/volumes (measures) in the initial 
space to different lengths/areas/volumes in the target space. Since densities incorporate 
volume information, we need to keep track of these warpings and account for them in the 
computation of the probability density in the latter space.

By implementing forward and inverse transformations, as well as log Jacobian determinants, 
Bijectors give us all the info we need to transform random samples as well as probability 
densities.

- Chris Sutter, 2019

image.png

The Need for Bijective Functions

In this section, we shall review some of the critical characteristics of Bijective functions - This could be a perfect review of the concepts we often forget when we realize how beguiling math is.

Isomorphism

There are quite a lot of complex definitions for this concept but the one that made me really understand is from graph theory Two graphs G1 and G2 are isomorphic if there exists a matching between their vertices so that two vertices are connected by an edge in G1 if and only if corresponding vertices are connected by an edge in G2. The following diagram makes it much more simple - both these graphs are isomorphic.

Isomorphism

The above geometric intuition can be interpreted in many other forms especially via sets but for absolute clarity, we shall study the mathematical intuition through our favorite notations.

Let us have two sets $A$ and $B$ with their corresponding elements ${a_1, a_2, a_3, \cdots, a_n}$ and ${b_1, b_2, b_3, \cdots, b_m}$ respectively, then if there exist a binary mapping $f$ such that

$$\large f(a_i \oplus b_j) = f(a_i) \otimes f(b_j) \tag{1. Isomorphic Mapping}$$ $$\large f^{-1}(b_r \otimes b_s) = f^{-1}(b_r) \oplus f^{-1}(b_s) \tag{2. Inverse Isomorphic Mapping}$$ $$then, \text{they are isomorpmic}$$

From Group theory, isomorphism preserves the structural integrity of a set - It often maps a complicated group to a simpler one.

Homeomorphism - A Donut and A Coffee Mug are Same

How can a donut becomes a coffee mug - by continuous deformation, refer to the animation below. The key point to take from this metaphor is the volume preservation that we have briefly mentioned at the beginning of this post.

Mug to Donut



The intuitive criterion of stretching, bending, cutting and gluing back together 
takes a certain amount of practice to apply correctly—it may not be obvious from
the description above that deforming a line segment to a point is impermissible, 
for instance. It is thus important to realize that it is the formal definition given 
above that counts. In this case, for example, the line segment possesses infinitely 
many points, and therefore cannot be put into a bijection with a set containing only
a finite number of points, including a single point.

- Wikipedia

The formal definition of homeomorphic function has one-one mapping between sets such that both the function and its inverse are continuous and that in topology exists for geometric figures which can be transformed one into the other by elastic deformation.

Diffeomorphism

If you notice the definition of isomorphism and homeomorphism, the relationship between the sets/groups is established stronger with the addition of more and more properties. This closeness eventually lead to minimal KL-Divergence a common measure of the difference between two distributions or a loss function for stochasticity measurements. A diffeomorphism is a stronger form of equivalence than a homeomorphism. The formal definition of diffeomorphic function is it is a map between manifolds that is differentiable and has a differentiable inverse


A manifold is a topological space that is locally Euclidean (i.e., around every point, there 
is a neighborhood that is topologically the same as the open unit ball in R^n). To illustrate
this idea, consider the ancient belief that the Earth was flat as contrasted with the modern
evidence that it is round. The discrepancy arises essentially from the fact that on the small
scales that we see, the Earth does indeed look flat. In general, any object that is nearly 
"flat" on small scales is a manifold, and so manifolds constitute a generalization of objects 
we could live on in which we would encounter the round/flat Earth problem, as first codified 
by Poincaré.

More concisely, any object that can be "charted" is a manifold.

One of the goals of topology is to find ways of distinguishing manifolds. For instance, 
a circle is topologically the same as any closed loop, no matter how different these 
two manifolds may appear. Similarly, the surface of a coffee mug with a handle is topologically 
the same as the surface of the donut, and this type of surface is called a (one-handled) torus.

- Wolfram Mathworld

One simple way to understand diffeomorphism is - Let us say we believe ourselves, geometrically speaking a point from nowhere means our beliefs are personal. Then we find another person who has the same beliefs, geometrically a line connecting two points in a 2D space. Similarly, the dimension increases and the belief manifests across through the dimensions. If the beliefs are differentiable and invertible through the dimensions then we call it diffeomorphic bijective belief(function) - pun intended.

Wikipedia says, Given two manifolds ${\displaystyle M}$ and ${\displaystyle N}$, a differentiable map ${\displaystyle f\colon M\rightarrow N}$ is called a diffeomorphism if it is a bijection and its inverse ${\displaystyle f^{-1}\colon N\rightarrow M}$ is differentiable as well. If these functions are ${\displaystyle r}$ times continuously differentiable, ${\displaystyle f}$ is called a ${\displaystyle C^{r}}$-diffeomorphism.

Permutation Group

This is more simple to grasp, the group consisting of all possible permutations of ${\displaystyle N}$ objects is known as the permutation group. of the order ${\displaystyle N}!$

Projective Map

It is all about perspective, imagine a negative film is projected from a distance placed on a plane - that is pixel-pixel, line-line, object-object mapping. Otherwise called isomorphism of vector spaces or tensors. A picture worth a million words also I doubt whether I am qualified enough to scribe this in worlds.

Homography


Use of cross-ratios in projective geometry to measure real-world dimensions of 
features depicted in a perspective projection. A, B, C, D and V are points on the 
image, their separation given in pixels; A', B', C' and D' are in the real world, 
their separation in metres.

In (1), the width of the side street, W is computed from the known widths of the adjacent shops.
In (2), the width of only one shop is needed because a vanishing point, V is visible.

- Wikipedia

Bijectors of Tensorflow Probability

We studied distributions are sources of stochasticity that collects properties of probability that include ${\mu - mean, \sigma - standard deviation}$. Bijectors make the deterministic transformations for the distributions as per the properties and conditions we have studied in the previous sections. They are the abstraction that enables efficient, composable manipulations of probability distributions by ensuring the following,

  1. Simple but robust and consistent interface for the programmers to manipulate the distributions
  2. Provide rich high dimensional densities that are invertible and volume-preserving mappings
  3. Provide efficient and fast volume adjustments through log determinant of the Jacobian matrix concerning dimensionality
  4. Decouple stochasticity from determinism through simplistic design and flexibility
  5. Encapsulate the complexities through imparting modularity and exploiting algebraic relationships among random variables

For example, Bijector APIs provides an interface for diffeomorphism for a random variable $\displaystyle X$ and its diffeomorphism $\displaystyle F$ as follows,

$$Y = F(X) \tag{3. Density Function}$$ $$\Large p_{Y(y)} = p_X \left(F^{-1}(y)\right)|DF^{-1}(y)| \tag{4. Diffeomorphism in Bijectors}$$

Where,

  • $DF^{-1}$ is the inverse of the Jacobian of $\displaystyle F$

Core of Bijectors

The key objective of a Bijector is to transform a Tensor of certain random distribution to a target distribution and ensure it is invertible. From Joshua V. Dillon et al’s paper title Tensorflow Distributions,

  • forward implements $\displaystyle x \mapsto F(x)$, and is used by TransfromedDistribution.sampleto convert one random outcome into another. It also establishes the name of the bijector.
  • inverse undeos the transformation $\displaystyle y \mapsto F^{-1}(y)$ and is used by TransfromedDistribution.log_prob
  • inverse_log_det_jacobian computes $log|DF^{-1}(y)|$ and is used by TransfromedDistribution.log_prob to adjust for how the volume changes by the transformation. . In certain settings, it is more numerically stable (or generally preferable) to implement the forward_log_det_jacobian.Because forward and reverse log ◦ det ◦ Jacobians differ only in sign, either (or both) may be implemented.

Conclusion

Probabilistic Machine/Deep Learning gives the measure of uncertainty for our predictions, making it a more human-like approach towards achieving convergence. Although this post is predominantly focused on the properties and mathematical intuition behind bijective functions, the intent is to understand the motivation behind the Bijectors of TFP. I hope I covered all the nuances of bijective functions in detail with relevant examples for the readers' pleasure. This post is inspired by Joshua V. Dillon et al’s paper on TFP architecture, I thank Joshua and his team for motivating me to write this short one. I also thank my readers and say goodbye until one more adventure with math and its utility.

References