### KL-Divergence, Relative Entropy in Deep Learning

Posted April 10, 2021 by Gowri Shankar ‐ **5 min read**

This is the fourth post on Bayesian approach to ML models. Earlier we discussed uncertainty, entropy - measure of uncertainty, maximum likelihood estimation etc. In this post we are exploring KL-Divergence to calculate relative entropy between two distributions.

We shall see, how KL-Divergence works for Cats and Dogs classification problem.

Image Credit: www.mentalfloss.com

Previous Posts in this Series

- Bayesian and Frequentist Approach to Machine Learning Models
- Understaning Uncertainty, Deterministic to Probabilistic Neural Networks
- Shannon’s Entropy, Measure of Uncertainty When Elections are Around

## Objective

Our objective is to get answers to the following questions

- What is KL Divergence?
- Why
`log`

likelihood? - What is Likelihood ratio?
- Is expected values - Weighted average of instances of Random values?
- Is expected value of log likelihood ratio is KLD?
- What can KL-Divergence be used for?
- How are KL-Divergence and Cross Entropy related?
- Is KL-Divergence a distance measure?
- How are KL-Divergence and Log-Likelihood related?
- Is KLD asymmetric?
- What are forward and reverse KL?

## Introduction

KL-Divergence is a measure of how two distributions differ from each others. Some of very well known probability density distribution plots

Image Credit: ∫ntegrabℓε ∂iﬀerentiαℓs

Let us say we are building a deep neural network that classifies dogs and cats, for a dog picture - The probability of classifying a dog as dog by a perfect neural network is 1. i.e

$$\large p_{\theta}(dog) = 1, p_{\theta}(cat) = 0$$ where $\theta$ represents the reality.

**Network Performance**

However, our neural networks is governed by the estimated weights $\phi$. i.e for a decent neural network, outcome could be
$$\large q_{\phi}(dog) = 0.8, q_{\phi}(cat) = 0.2$$
Meanwhile, if our network behaves adversarial to the expectations… we seek deeper explanations. i.e.
$$\large q_{\phi}(dog) = 0.2, q_{\phi}(cat) = 0.8$$

Hence, we seek a clarity to declutter this uncertainity by measuring the probabilities relatively.

## KL-Divergence

In the above representations, we introduced two probabilities $p_{\theta}$ representing the distribution of the reality and $q_{\phi}$ representing the distribution of the estimator(DNN).

From a dataset $X$ with observations ${x_1, x_2, x_3, \cdots, x_n}$, the probability for a single observation($i$) can be written as below as a representation of two distributions, $$p_{\theta}(x_i), q_{\phi}(x_i)$$

A typical measure of difference is subtracting $p_{\theta}$ and $q_{\phi}$ but we are dealing with very small numbers which might result in rounding to zero. To avoid this, we compute `log`

differences or `loglikelihood`

.
i.e instead of
$$p_{\theta}(x_i) - q_{\phi}(x_i)$$
We take logarithmic difference
$$log p_{\theta}(x_i) - log q_{\phi}(x_i)$$
$$i.e.$$
$$\large log \frac { p_{\theta}(x_i)}{q_{\phi}(x_i)} \tag{1. Log Likelihood Ratio}$$

We saw the behavior for a single observer $i$ but our interest is to understand the likelihood of the whole sample set. Which is what is the average likelihood or the tendency of the random variables

### KLD from Expected Value for Discrete Random Variables

Since every observation has a role in determining the probability, we have to give give importance to the ones with maximum likelihood. Hence a weighted average to all the observations are computed

$$\mathbb{E_{p\theta}}[h(X)] = \sum_{i=1}^\infty h(x_i)p_{\theta}(x_i) \tag{2. Expected Values}$$

Our goal is to find the average difference between the distributions $p_{\theta}$ and $q_{\phi}$, here expected values comes handy to calculate the average log likelihood ratio.

from eq.1 and eq.2

$$\sum_{i=1}^{\infty}p_{\theta}(x_i)log \frac {p_{\theta}(x_i)}{q_{\phi}(x_i)} \tag{3. Expected Log Likelihood Ratio}$$ $$i.e.$$ $$\large \mathbb{E_p}\left[log \frac {p_{\theta}(x)}{q_{\phi}(x)}\right ] = \sum_{i=1}^{\infty}p_{\theta}(x_i)\left[log \frac {p_{\theta}(x_i)}{q_{\phi}(x_i)}\right ] \tag{4. KL-Divergence}$$

### Continuous Random Varaibles

For continuous random variables $$D_{KL}(p_{\theta} || q_{\phi}) = \int_{\mathbb{R}} p_{\theta}(x)\left[log \frac {p_{\theta}(x)}{q_{\phi}(x)}\right ]dx \tag{5. Forward KLD}$$ $$and$$ $$D_{KL}(q_{\phi} || p_{\theta}) = \int_{\mathbb{R}} q_{\phi}(x)\left[log \frac {q_{\phi}(x)}{p_{\theta}(x)}\right ]dx \tag{6. Reverse KLD}$$

They are not symmetric, i.e. $$D_{KL}(p_{\theta} || q_{\phi}) \neq D_{KL}(q_{\phi} || p_{\theta})$$ Hence it is not distance measure but divergence.

Generally $p$ is the reference distribution and $q$ is approximation. Forward KLs are the cross-entropy losses widely used in machine learning.

### KL-Divergence as Loss Function

In this section let us explore how KL-Divergence is used as a loss function, from eqn.4

$$\large \sum_{i=1}^{\infty}p_{\theta}(x)log p_{\theta}(x) - \sum_{i=1}^{\infty}p_{\theta}(x)log q_{\phi}(x)$$

- The negative part is the cross-entropy $H(p_{\theta}, p_{\phi})$
- The positive part is the negative of entropy $H(p_{\theta}, p_{\theta})$, constant and derived from the dataset

Hence the loss function is $$\large - \sum_{i=1}^{\infty}p_{\theta}(x)log q_{\phi}(x)$$

## Cats and Dogs

We created 2 situations out of our neural network,

- With a decent outcome and $$q_{\phi}(dog) = 0.8, q_{\phi}(cat) = 0.2$$
- With adversarial outcome $$q_{\phi}(dog) = 0.2, q_{\phi}(cat) = 0.8$$

Let us calculate KL-Divergence between reality and estimation

Image Credit: www.searchenginejournal.com

### Case 1: $q_{\phi}(dog) = 0.8, q_{\phi}(cat) = 0.2$

$$1.log_2 \left[\frac {1.0}{0.8}\right] + 0.log_2 \left[\frac {1}{0.2} \right]$$ $$i.e.$$ $$0.3219$$

Observation: KLD value is closer to zero

### Case 2: $q_{\phi}(dog) = 0.2, q_{\phi}(cat) = 0.8$

$$1.log_2 \left[\frac {1.0}{0.2}\right] + 0.log_2 \left[\frac {1}{0.8} \right]$$ $$i.e.$$ $$2.3219$$

Observation: KLD value is far from zero

### Case 3: A perfect model

$$1.log_2 \left[\frac {1.0}{1.0}\right] + 0.log_2 \left[\frac {1}{0.0} \right]$$ $$i.e.$$ $$0$$

Observation: KLD value is zero since it is a perfect model

$$\sum_{i=1}^{\infty}p_{\theta}(x_i)log \frac {p_{\theta}(x_i)}{q_{\phi}(x_i)} \geq 0$$

When two disributions are similar then the KL-Divergence value is close to zero and it increases as the intensity of differences of distributions increases

```
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors
import matplotlib.pyplot as plt
```

## Computing KL Divergence

In this section we shall explore how to use KL-Divergence API from Tensorflow Probabilities module

### Univariate Gausian Distributions

Let us compute KL divergence for two univariate normal distributions using Tensorflow Probabilities functions.

- Create $p$ and $q$ distributions, using $\mu = { 1, 2 }$ and $\sigma={ 1, 0.5 }$
- Compute the divergence using tensorflow distributions

```
𝜇_q = 1
𝜎_q = 1
𝜇_p = 2
𝜎_p = 0.5
distribution_q = tfd.Normal(loc=𝜇_q, scale=𝜎_q)
distribution_p = tfd.Normal(loc=𝜇_p, scale=𝜎_p)
```

```
n_samples = 1000
samples_q = distribution_q.sample(n_samples)
samples_p = distribution_p.sample(n_samples)
plt.figure(figsize=(20, 5))
plt.subplot(1, 2, 1)
plt.hist(samples_q, bins=50, density=True)
plt.subplot(1, 2, 2)
plt.hist(samples_p, bins=50, density=True)
plt.show()
```

Observe the distribution mean and the spread in the above histograms. Further let us calculate the KL divergence

```
tfd.kl_divergence(distribution_q, distribution_p)
```

```
<tf.Tensor: shape=(), dtype=float32, numpy=2.8068528>
```

### Asymmetry

Note when we swap the distribution KLD value varies, hence it is not a measure of distance but divergence

```
tfd.kl_divergence(distribution_p, distribution_q)
```

```
<tf.Tensor: shape=(), dtype=float32, numpy=0.6371457>
```

### Fair Coin vs Unfair Coin

```
distribution_q = tfd.Categorical(probs=[0.5, 0.5])
distribution_p = tfd.Categorical(probs=[0.99, 0.01])
samples_q = distribution_q.sample(n_samples)
samples_p = distribution_p.sample(n_samples)
plt.figure(figsize=(20, 5))
plt.subplot(1, 2, 1)
plt.hist(samples_q, bins=3, density=True)
plt.subplot(1, 2, 2)
plt.hist(samples_p, bins=3, density=True)
plt.show()
```

```
tfd.kl_divergence(distribution_q, distribution_p)
```

```
<tf.Tensor: shape=(), dtype=float32, numpy=1.6144631>
```

#### 2 Fair Coins

```
tfd.kl_divergence(distribution_q, distribution_q)
```

```
<tf.Tensor: shape=(), dtype=float32, numpy=0.0>
```

## Conclusion

In this post, we dived deep into KL-Divergence formula and how it is derived. Also learnt the significance of KL divergence in cross-entropy calculation. Further, we calculated KLD for various probability distributions understood the efficacy of the algorithm.