Metrics That We Measure - Measuring Efficacy of Your Machine Learning Models
Posted October 30, 2021 by Gowri Shankar ‐ 7 min read
Have we identified the perfect metrics to measure the efficacy of our machine learning models? A perfect metric - does that even exist. A recent feed from LinkedIn on measuring metrics caught my attention, It is a bit opinionated claim from the author with substantial shreds of evidence and arguments. His post drew attention from many and made it a valuable repository of information and views from diverse people. This post summarizes diverse responses from the participants of the post.
Nikhil Aggarwal from Google made the post that drew my attention and it can be accessed here
The F1 score is one of the most widely used but misunderstood metrics.
In real-life scenarios, I hardly find a use case where the cost of a
false positive is equal to a false negative. We can’t give equal weightage
to recall and precision. Fbeta is more appropriate where the value of beta
changes based on business needs and problem statements.
- Nikhil Aggarwal, Google
I thank Nikhil and the LinkedIn community who participated in the discussion for inspiring me to write this post.
Objective
The key objective of this post is to catalog critical comments made by the participants of Nikhil Aggarwal’s post on $F_1$ vs $F_{beta}(F_2)$ and the cost of False Positives
vs False Negatives
.
Introduction
Is accuracy the right metric to evaluate the efficacy of our machine learning models? If not then why do the most qualified and accomplished professionals start the conversation by asking How accurate is your model?
Truth, the term accurate
has nothing to do with the accuracy
metric - I believe they mean to ask, Have you identified the right metrics to evaluate and what is the efficacy of your model?
Disclaimer: They may or may not know the right metrics and their significance to the context of the problem. Meanwhile, the question How accurate is your model?
is destined to remain in its glory for years to come - It is that question so significant that holds the truth of all evaluation metrics that we have ever invented and to invent in the future.
Academia or Industry
Context is the King and Medium is the Message - In data science, there is no golden rule or clearly defined goal to achieve. Hence the context plays a crucial role in glorifying or cremating an idea. Dr. Andrew Ng in his recent email to his followers detailed the differences between academia and industry for the data science professionals who are transitioning. Following are the measures that one need to keep in mind based on the nature of his/her organization - i.e Academia or Industry
- Speed vs Accuracy
- Return of Investment vs Novelty
- Experiences vs Junior Teams
- Interdisciplinary Work vs Disciplinary Specialization
- Top-down vs Bottom-up Management
If one looks closely at the above list, they all are either quantitative or qualitative metrics for the success of a professional in the context of the organization he/she represents. i.e There is no hard and fast rule when it comes to data science because there is more than one right answer for most of the problems.
Metrics Mislead
Machine learning is not a deterministic process, the machine learning outcomes are stochastic because of the lack of information about all the confounders. Hence the metrics that work today may or may not be valid tomorrow - for a simple reason called data drifts. After all, we rely upon and measure the density of a predictor to predict which is dynamic as time moves.
the statistical component of your exercise ends when you output a probability
for each class of your new sample. Mapping these predicted probabilities
(p^,1−p^) to a 0-1 classification, by choosing a threshold beyond which you
classify a new observation as 1 vs. 0 is not part of the statistics any more.
It is part of the decision component
- Venkat Raman, Aryma Lab
Assumptions made because of a context from the business that influences the outcome are biases. The context often does not come from the data but some sort of an inductive bias to the system. This bias results in being judgemental about an idea and ends up in sub-optimal results. Hence an evaluation metrics is a mere indicator of where the data lean towards rather than the autocratic authority who is destined to make one’s decisions.
Classic Case of Credit Risk
Is it possible for a machine learning model to predict the credit risk of a potential borrower accurately? There are two cases to thoughts through
- Loan approval for a potential defaulter (False Negative)
- Loan rejection for a potential customer (False Positive)
We generally use custom metrics built on top of recall, because
if we weed out all risky customers we're losing money in terms
of interest and late.
- Kriti Doneria, Kaggle Master
From a business point of view cost of the uncertainty of whether a customer is a defaulter or not is less than the interest and late payments from risky customers, claims Manjunatha. It is quite evident from this section of the post - A machine learning algorithm and its evaluation metrics are in the nascent stages where they can’t make decisions independently.
The Metrics
In this section, we shall revise the common metrics that we often use to evaluate our machine learning models
Accuracy
To scribe in words, “How good is the classification model?” $$\frac{True \ Positives + True \ Negatives}{Total \ No. \ of \ Samples}$$
Confusion Matrix
A confusion matrix is used to describe the performance of a binary classification model. There are four basic terms to be pondered
- True Positives: These are the samples predicted correctly
- True Negatives: Predicted as FALSE but they are FALSE
- False Positives or Type 1 Error: Predicted as TRUE but they are FALSE
- False Negatives or Type 2 Error: Predicted as FALSE but they are TRUE
Actual Positive | Actual Negative | |
---|---|---|
Predicted Positive | True Positives | False Positives |
Predicted Negative | False Negatives | True Negatives |
Precision
To scribe in words, “When it predicts TRUE, how often the model is correct”
$$\frac{True \ Positives}{True \ Positives + False \ Positives}$$
Recall(or Sensitivity/TPR)
To scribe in words, TPR is “If it is TRUE, how often the model predicts TRUE”
$$\frac{True \ Positives}{True \ Positives + False \ Negatives}$$
$F_1$ Score
$F_1$ Score is the weighted average of the true postive rate(recall) and precision
$$2 * \frac{Precision \times Recall}{Precision + Recall}$$
$F_{\beta}$ Score
$F_{\beta}$ is a generalized measure of $F_1$ score with additional weights by valuing either precision or recall more than the other. $$F_{\beta} = (1 + \beta^2) . \frac{precision.recall}{(\beta^2 . precision) + recall}$$
True Positive Rate (TPR or Recall or Sensitivity)
To scribe in words, TPR is “If it is TRUE, how often the model predicts TRUE”
$$\frac{True \ Positives}{True \ Positives + False \ Negatives}$$
False Positive Rate (FPR)
To scribe in words, FPR is “When it is FALSE, how often the model predicts TRUE”
$$\frac{False \ Positives}{False \ Positives + True \ Negatives}$$
ROC Curve
ROC stands for Receiver Operator Characteristics, ROC curves are used to present results of binary decision problems.
- ROC curves show the number of correctly classified positive samples varies with the number of incorrectly classified negative examples. ie
$$False \ Positive \ Rate(FPR) \ vs \ True \ Positive \ Rate(TPR)$$
- ROC Curves can present an overly optimistic view of an algorithm’s performance
- In ROC space, the goal is to be in the upper left-hand corner
Other Metrics
The above-discussed metrics are the most common evaluation metrics used for classification problems, there is an extensive list of metrics specific to the problem of interest. Following are a few to name,
- Concordant - Discordant Ratio, Pair Ranking
- Gini Coefficient - Degree of Variation
- Kolmogorov Smirnov Chart - Distance between Distributions
- Gain and Lift Charts - Classification
- Word Error Rate (WER) - Speech
- Perplexity - NLP
- BLEU Score - NLP
- Intersection over Union (IoU) - Computer Vision
- Inception Score - GANs
- Frechet Inception Distance - GANs
- Peak Signal to Noise Ratio (PSNR) - Image Quality & Reconstruction
- Structural Similarity Index(SSIM) - Image Quality and Reconstruction
- Mean Reciprocal Rank(MRR) - Ranking
- Discounted Cumulative Gain(DCG) - Ranking
- Normalized Discounted Cumulative Gain(NDCG) - Ranking
Conclusion
Nikhil’s post triggered many to give their views, brought many people of diverse domains together to give their opinions and ideas. For a simple classification problem, the most meaningful and reliable measure could be AUC-ROC. I noticed almost all research papers in the healthcare domain conclude with one or another flavor of the ROC curve. Beyond that leaning towards any particular metric is not advised.
This is why no ONE metric makes sense for all use cases. The outcomes and predictions need to be evaluated based on their costs/benefits.
- Phil Fry
Research from an academic setup and Industry are equally important. Though the context and objectives are different in academia and industry, our end goal is to build superior human-like, human-inspired, reliable, and responsible expert systems for the greater good.
Few Critical Reads
- Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules by Frank Harrell, 2020
- Of quantiles and expectiles: consistent scoring functions, Choquet representations and forecast rankings by Ehm et al, 2016
- Biostatistics for Biomedical Research by Harrell and Slaughter of Vanderbilt Univ, 2021
- Machine Learning Meets Economics by Nicholas Krutchen, 2016
metrics-that-we-measure-measuring-efficacy-of-your-machine-learning-models