3rd Wave in India, Covid Debacle Continues - Let us Use Covid Data to Learn Piecewise LR and Exponential Curve Fitting

Posted January 7, 2022 by Gowri Shankar ‐ 8 min read

Deep neural networks models are dominant in their job compared to any other algorithms like support vectors machines or statistical models that are celebrated once. When it comes to big data, without a doubt deep learning models are the defacto choice for convergence. I often wonder what must be making them so efficient, something should be quite obvious and provable. Activation functions, we know activation functions bring in non-linearity to the network layers through the neurons and they do the magic in vogue. ReLU, Sigmoid, and their sister-in-law gang are the piecewise linear functions that create non-linearity to the outcomes. i.e. the activation functions help the neural networks to slice and dice the input space into finer grains and form locally sensitive hash tables. A piecewise linear function in the network can be visualized as a polyhedron(or a cell) with sharp edges is the fundamental building block for achieving convergence in DNNs.

The above explanation for the success of neural networks is most convincing and thought-provoking. Certain beliefs are known, accepted but often unconvincingly because of the gap in understanding the fundamental ideas and concepts. When we attempt to ask more and more questions, we see the light - that is the process of learning. This post has 2 parts, we shall study Piecewise Linear Regression models in the first part and how to fit an exponential curve in the second one. This post falls under the section Fundamentals of AI/ML and Convergence under Math for AI/ML, previous posts under these topics can be referred to here,

Omicron

Image Credit: Omicron in Kerala: 45 of 50 fresh cases arrived from low-risk countries

I thank Tim Scarfe and fearsome(with his knowledge) Dr. Keith Duggar(MIT PhD) for brilliantly explaining the concept of Activation Functions within the context of Piecewise Linear Functions in their 2022 premiere of machine learning street talk. Without the inspiration from them, this article wouldn’t have been possible.

061: Interpolation, Extrapolation, and Linearisation (Prof. Yann LeCun, Dr. Randall Balestriero)
trivia 1: I just finished watching the first 50 minutes of the MLST latest edition, It was like watching a thriller movie - deeply indebted to Tim, Yannic, and Keith for extraordinary rendition of their craft.
trivia 2: Why we all address Dr.Keith Duggar as Dr.Keith Duggar but his compadre Dr.Tim Scarfe and Dr.Yannic Kilcher as Tim and Yannic is quite beyond the scope of this post

DISCLAIMER: This post is purely for academic purposes. I am not suggesting anything beyond learning goals, the analysis is conducted using the latest Covid-19 data.

Objective

The objective of this post is to understand the Piecewise Linear Regression without much mathematics but through coding. We also study how to fit an exponential curve in the second section. With that quest, we cover the following,

An explorative analysis of the Covid-19 dataset
Univariate spline smoothing using SciKit interpolate
PWLF: Piecewise Linear Fitting library for Piecewise Linear Regression
Math behind logarithmic fitting

Introduction

We are in the nascent stages of understanding how deep neural network models are bringing accurate predictions. Most of the models are not explainable and lack sufficient trust for the end-user to let the machines make the decisions. Hence, we seek models to provide recommendations with uncertainty measurements to increase their reliability. For example, the right way to find the area under a curve is to look at the curve with a magnifying lens and dissect them using a knife to make smaller rectangles and then sum the area of smaller rectangles. i.e. Divide and Conquer. The same principle is apt for fitting curves. To demonstrate this, let us load the global Covid-19 dataset and then analyze a portion of it.

Covid data used in this analysis is taken from Our World In Date: Coronavirus (COVID-19) Cases on $7^{th}$ Jan 2022, 12 PM IST

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#%matplotlib inline

covid_df = pd.read_csv("./owid-covid-data.csv", sep=",", parse_dates=["date"])
print(f"No of features: {covid_df.shape[1]}, No of observations: {covid_df.shape[0]}")
print(f"Start Date: {min(covid_df.date)}, End Date: {max(covid_df.date)}")

No of features: 67, No of observations: 152691
Start Date: 2020-01-01 00:00:00, End Date: 2022-01-06 00:00:00

covid_ind_df = covid_df[covid_df.location == "India"]
covid_ind_df.iloc[:, 0:10].tail()

	iso_code	continent	location	date	total_cases	new_cases	new_cases_smoothed	total_deaths	new_deaths	new_deaths_smoothed
64766	IND	Asia	India	2022-01-02	34922882.0	33750.0	18507.000	481893.0	123.0	270.857
64767	IND	Asia	India	2022-01-03	34960261.0	37379.0	22938.571	482017.0	124.0	246.714
64768	IND	Asia	India	2022-01-04	35018358.0	58097.0	29924.571	482551.0	534.0	279.857
64769	IND	Asia	India	2022-01-05	35109286.0	90928.0	41035.143	482876.0	325.0	288.000
64770	IND	Asia	India	2022-01-06	35109286.0	NaN	NaN	482876.0	NaN	NaN

Covid Diagnosis and Growth Curves

Our goal is to fit the new covid cases(Diagnosis) and the total number of covid cases(Growth) curves under a constrained environment to understand Piecewise Linear Regression. We also fit a Logarithmic curve on the growth curve focusing on the period where the first wave ends and up to the peak of the second wave. Waves often result in exponential growth, hence we choose a logarithmic fitting scheme.

def plot_daily_cases(data, scatter=False, predictions_daily=[], predictions_total=[], titleA="", titleB="", sup_title=""):
    fig, axes = plt.subplots(1, 2, figsize=(25, 6))
    if(scatter):
        axes[0].scatter(data.date.values, data.new_cases.values, label='Observed')
        axes[1].scatter(data.date.values, data.total_cases.values, label='Observed')
    else:
        axes[0].plot(data.date.values, data.new_cases.values, label='Observed')
        axes[1].plot(data.date.values, data.total_cases.values, label='Observed')
    if(len(predictions_daily) == len(data)):
        axes[0].plot(data.date.values, predictions_daily, color="red", linestyle='dashed',
     linewidth=5, label='Predicted')

    if(len(predictions_total) == len(data)):
        axes[1].plot(data.date.values, predictions_total, color="red", linestyle='dashed',
     linewidth=5, label='Predicted')
    axes[0].legend()
    axes[1].legend()
    axes[0].set_title(titleA)
    axes[1].set_title(titleB)

    fig.suptitle(sup_title, fontsize=20)

    plt.show()

from scipy.interpolate import UnivariateSpline
def smooth_spline(data, sm_factor=0.5):
    spl = UnivariateSpline(np.arange(len(data)), data, s=len(data))
    spl.set_smoothing_factor(sm_factor)
    return spl(data)

plot_daily_cases(
    covid_ind_df,
    titleA="Daily - No. of Individuals Diagnosed +ve",
    titleB="Growth - Total No of Infected Individuals",
    sup_title="Covid Cases(India)"
)

png

Data Subset

Few observations from the data

We see there are 2 peaks and a shooting count at the tail end - i.e 2 completed waves and the 3rd is about to start
The first wave subsided around September of 2020 - If I remember correctly, August 2020 was the month Amazon started delivering non-essential goods
The second wave started by March 2021

We shall extract the covid new cases and total diagnosed data between October 2020 and March 2021 for understanding Piecewise Linear Regression.

I used univariate spline fit to smooth the data to simplify things

covid_SEP20_MAR21_df = covid_ind_df[(covid_ind_df.date >= "2020-10-01") & (covid_ind_df.date <= "2021-03-30")]
plot_daily_cases(
    covid_SEP20_MAR21_df,
    scatter=True,
    titleA="Daily - No. of Individuals Diagnosed +ve",
    titleB="Growth - Total No of Infected Individuals",
    sup_title="Covid Cases(India) $1^{st}$ Oct, 2020 - $30^{th}$ Mar, 2021"
)

smoothened_data = smooth_spline(covid_SEP20_MAR21_df.new_cases.values, sm_factor=0.1)
covid_SEP20_MAR21_df.loc[:, "new_cases"] = smoothened_data
plot_daily_cases(
    covid_SEP20_MAR21_df,
    scatter=True,
    titleA="Daily - No. of Individuals Diagnosed +ve(Smooth)",
    titleB="Growth - Total No of Infected Individuals",
    sup_title="Covid Cases(India) $1^{st}$ Oct, 2020 - $30^{th}$ Mar, 2021"
)

png

/Users/shankar/dev/tools/anaconda3/envs/nxt/lib/python3.9/site-packages/pandas/core/indexing.py:1773: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)

png

Observe in Pieces

From the graphs we can see, there are 6 different regions where trend shifts during the period between the end of the first wave and the beginning of the second wave. Splits

# !python -m pip install pwlf

Piecewise Linear Regression

We need Piecewise Linear Regression models when the relationship between the dependent variable$(y)$ and the independent variable$(x)$ is continuous and consist of 2 or more linear segments. Then such function is estimated using nonlinear least squares by specifying the number of segments. Let us say the model fit consist of $k$ linear segment(in our example it is 6) then, $$\Large y = \beta_0 + \beta_1 x + \sum_{j=1}^{k-1} \beta_{j+1}(x - \Delta_j)I(x - \Delta_j) \tag{1. Piecewise Linear Regression Model}$$ Where,

${\beta_0}$ is the $y$-intercept
${\beta_j}$ is the slope of the segment, ${j=1,2,3, \cdots, k}$
$\Delta_j$ location of the slope changes between segment $j$ and segment $j+1$
$I(x-\Delta_j) = 1 \ if \ x \geq \Delta_j$ and $0$ otherwise

import pwlf
pwlf_model = pwlf.PiecewiseLinFit(np.arange(len(covid_SEP20_MAR21_df)), covid_SEP20_MAR21_df.new_cases.values)
z = pwlf_model.fit(5)
slopes = pwlf_model.calc_slopes()

# Predictions
predicted_new_cases = pwlf_model.predict(np.arange(len(covid_SEP20_MAR21_df)))

/Users/shankar/dev/tools/anaconda3/envs/nxt/lib/python3.9/site-packages/pwlf/pwlf.py:1086: RuntimeWarning: invalid value encountered in true_divide
  self.slopes = np.divide(

plot_daily_cases(
    covid_SEP20_MAR21_df,
    scatter=True,
    predictions_daily=predicted_new_cases,
    titleA="Daily - No. of Individuals Diagnosed +ve(Smooth)",
    titleB="Growth - Total No of Infected Individuals",
    sup_title="Covid Cases(India) $1^{st}$ Oct, 2020 - $30^{th}$ Mar, 2021 Actual vs Predictions"
)

png

Exponential and Logarithmic Curve Fitting

We know a simple linear regression model is represented as $y = Ax + B$ and we have studied the Piecewise Linear Regression model in the previous section. An exponential curve fitting model is represented as follows,

$$\Large y = Ae^{Bx} \tag{2. An Exponential Model}$$ $$take \ log \ on \ both \ sides$$ $$log(y) = log(Ae^{Bx}) = log(A) + log(e^{Bx}))$$ $$log(y) = log(A) + Bx$$

Let us do the following,

Subset the data for a period where there is an exponential growth
Fit the model using numpy.polyfit function
Apply piecewise linear regression for the log(y)

Yes, It is a hack but the intent of this post is to understand the idea behind the Piecewise approach and nothing beyond

covid_2nd_wave_df = covid_ind_df[(covid_ind_df.date >= "2021-01-31") & (covid_ind_df.date <= "2021-05-10")]
plot_daily_cases(
    covid_2nd_wave_df,
    scatter=True,
    titleA="Daily - No. of Individuals Diagnosed +ve",
    titleB="Growth - Total No of Infected Individuals",
    sup_title="Covid Cases(India) $31^{st}$ Jan, 2021 - $10^{th}$ May, 2021"

)

png

NUM_PIECES = 10
def logarithmic_fit(data):
    prediction_coeffs = np.polyfit(
        np.arange(len(data)),
        np.log(data.total_cases),
        1,
        w= np.sqrt(data.total_cases)
    )
    B, A = prediction_coeffs

    preds = [np.exp(A) * np.exp(B * x) for x in np.arange(len(data))]

    return preds

def piecewise_logarithmic_fit(data, num_pieces=NUM_PIECES):

    piece_length = len(data) // num_pieces

    predictions = []
    for i in np.arange(num_pieces):
        offset = i * piece_length
        piece_of_data = data[offset:offset + piece_length ]
        preds = logarithmic_fit(piece_of_data)
        predictions.append(preds)

    predictions = np.concatenate(predictions)

    return predictions

preds = logarithmic_fit(covid_2nd_wave_df)
plot_daily_cases(
    covid_2nd_wave_df,
    scatter=True,
    predictions_total=preds,
    titleA="Daily - No. of Individuals Diagnosed +ve",
    titleB="Growth - Total No of Infected Individuals",
    sup_title="Covid Cases(India) $31^{st}$ Jan, 2021 - $10^{th}$ May, 2021 Actual vs Predictions"
)

pwlf_model = pwlf.PiecewiseLinFit(np.arange(len(covid_2nd_wave_df)), covid_2nd_wave_df.new_cases.values)
z = pwlf_model.fit(5)
slopes = pwlf_model.calc_slopes()

# Predictions
predicted_new_cases = pwlf_model.predict(np.arange(len(covid_2nd_wave_df)))

preds = piecewise_logarithmic_fit(covid_2nd_wave_df)
plot_daily_cases(
    covid_2nd_wave_df,
    scatter=True,
    predictions_daily=predicted_new_cases,
    predictions_total=preds,
    titleA="Daily - No. of Individuals Diagnosed +ve",
    titleB="Growth - Total No of Infected Individuals",
    sup_title="Covid Cases(India) $31^{st}$ Jan, 2021 - $10^{th}$ May, 2021 Actual vs Piecewise Predictions"
)

png

/Users/shankar/dev/tools/anaconda3/envs/nxt/lib/python3.9/site-packages/pwlf/pwlf.py:1086: RuntimeWarning: invalid value encountered in true_divide
  self.slopes = np.divide(

png

Epilogue

In this post, we took one of the most fundamental concepts of machine learning and analyzed using the covid data. Our goal was to learn how the piecewise approach works to get an intuition of polyhedron(cell) in the complex and large-scale DNNs. I took many shortcuts, assumptions, and presumptions to make this post concise and precise at the expense of accuracy/correctness - I request the readers to pardon me for this approach. In my defense, complete learning and understanding a complex concept is a continuous process because we all are in different stages of comprehension. Hoping this post is a useful one that declutters a few critical ideas.

3rd-wave-in-india-covid-debacle-continues-let-us-use-covid-data-to-learn-piecewise-lr-and-exponential-curve-fitting

References

Piecewise Linear Regression from StatGraphics Centurion, 2020
How to do exponential and logarithmic curve fitting in Python? I found only polynomial fitting by Kennytm, 2010
Fit piecewise linear functions to data! from PWLF Library