mirror of
https://github.com/The-Art-of-Hacking/h4cker.git
synced 2025-02-21 14:16:03 +00:00
Create model_evaluation_and_metrics.md
This commit is contained in:
parent
7bb7e64023
commit
8cd563126d
581
ai_research/ML_Fundamentals/model_evaluation_and_metrics.md
Normal file
581
ai_research/ML_Fundamentals/model_evaluation_and_metrics.md
Normal file
@ -0,0 +1,581 @@
|
||||
# AI Model Evaluation and Metrics Tutorial
|
||||
|
||||
Evaluating AI models is crucial to understand their performance and make informed improvements. Different tasks (classification, regression, ranking) require different evaluation metrics. This tutorial covers key metrics for each type, explains their significance and use-cases, and provides Python examples (using **scikit-learn** and **NumPy**, and SciPy for ranking) to compute them. We also discuss trade-offs between metrics and how to choose the right ones for your problem.
|
||||
|
||||
## 1. Classification Metrics
|
||||
|
||||
Classification metrics assess how well a model predicts discrete class labels (e.g. spam vs not-spam). Many classification metrics are derived from the **confusion matrix** of true vs predicted labels. The confusion matrix is a table showing counts of **True Positives (TP)**, **True Negatives (TN)**, **False Positives (FP)**, and **False Negatives (FN)**. Each metric gives a different perspective on classifier performance.
|
||||
|
||||
### Confusion Matrix
|
||||
|
||||
A **confusion matrix** is a table that visualizes the performance of a classification model by comparing actual labels with predicted labels (see [Confusion matrix - Wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix#:~:text=In%20the%20field%20of%20machine,usually%20called%20a%20matching%20matrix)). Each row represents the actual class and each column represents the predicted class. For a binary classification (with classes “Positive” and “Negative”), the confusion matrix might look like:
|
||||
|
||||
```
|
||||
Predicted Negative Predicted Positive
|
||||
Actual Negative TN FP
|
||||
Actual Positive FN TP
|
||||
```
|
||||
|
||||
The diagonal elements (TN and TP) are correct predictions, and off-diagonals are errors (FP = type I error, FN = type II error). The confusion matrix helps derive metrics like accuracy, precision, recall, etc., and lets you see which classes are being confused by the model (hence the name).
|
||||
|
||||
### Accuracy
|
||||
|
||||
**Accuracy** is the simplest classification metric: it is the proportion of all predictions that the model got right. In terms of the confusion matrix, it’s `(TP + TN) / (TP + TN + FP + FN)`. Accuracy gives an overall indication of correctness.
|
||||
|
||||
*Significance:* Accuracy can be useful as a quick check to see if a model is training correctly and for comparing models when the class distribution is roughly balanced. However, **accuracy can be misleading for imbalanced datasets**. For example, if 99% of instances are class A, a model that always predicts A will be 99% accurate but essentially useless for finding class B. In such cases, accuracy doesn’t reflect the model’s true effectiveness (you’d be “accurate” 98–99% of the time by always predicting the majority class.
|
||||
|
||||
*When to use:* Use accuracy when classes are balanced and the cost of FP and FN errors is similar. Avoid using accuracy as the sole metric in class-imbalanced scenarios or when you care more about specific error types.
|
||||
|
||||
*Formula:* $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$.
|
||||
|
||||
**Example:** If a model makes 100 predictions with 90 correct, accuracy = 90%.
|
||||
|
||||
**Python Example – Calculating Accuracy:** You can compute accuracy with `sklearn.metrics.accuracy_score` or manually using NumPy.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from sklearn.metrics import accuracy_score
|
||||
|
||||
y_true = np.array([0, 1, 0, 1]) # 0 = negative class, 1 = positive class
|
||||
y_pred = np.array([0, 1, 0, 0]) # model predictions
|
||||
|
||||
acc = accuracy_score(y_true, y_pred)
|
||||
acc_manual = np.mean(y_true == y_pred) # manual computation
|
||||
print("Accuracy:", acc)
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Accuracy: 0.75
|
||||
```
|
||||
|
||||
In this example, the model got 3 out of 4 correct (75% accuracy).
|
||||
|
||||
### Precision
|
||||
|
||||
**Precision** (also called **Positive Predictive Value**) is the fraction of predicted positives that are actually positive. It is calculated as $Precision = \frac{TP}{TP + FP}$. In other words, when the model predicts “positive”, how often is it correct?
|
||||
|
||||
*Significance:* Precision measures the quality of positive predictions. A high precision means that **few false positives** are being predicted. This is important in situations where a false alarm is costly. For example, in medical diagnostics for cancer, a false positive (misdiagnosing a healthy person as sick) can lead to anxiety and unnecessary procedures. In information retrieval, precision is the proportion of retrieved documents that are relevant.
|
||||
|
||||
*When to use:* Use precision when **false positives are more problematic** than false negatives. For instance, in an email spam filter, precision matters if you want to avoid misclassifying valid emails as spam. If you need to be very confident when the model flags a positive, focus on improving precision.
|
||||
|
||||
*Trade-off:* Precision is often balanced with recall – increasing precision may lower recall and vice versa. A trivial way to get 100% precision is to only predict “positive” when you are very certain (or not predict “positive” at all), but that would likely miss many actual positives, hurting recall.
|
||||
|
||||
**Python Example – Precision:** Use `sklearn.metrics.precision_score`:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import precision_score
|
||||
prec = precision_score(y_true, y_pred)
|
||||
print("Precision:", prec)
|
||||
```
|
||||
|
||||
Using `y_true` and `y_pred` from above, the output is:
|
||||
```
|
||||
Precision: 1.0
|
||||
```
|
||||
|
||||
In our example, the model’s precision is 1.0 (100%) because it predicted one positive and that one was actually positive (TP=1, FP=0). High precision (no false positives) is good, but we will see below that the model missed some positives (affecting recall).
|
||||
|
||||
### Recall
|
||||
|
||||
**Recall** (also called **Sensitivity** or **True Positive Rate**) is the fraction of actual positives that the model correctly identified. It is calculated as $Recall = \frac{TP}{TP + FN}$. Recall answers: **when the actual class is “positive”, how often does the model predict positive?**
|
||||
|
||||
*Significance:* Recall measures the model’s ability to capture positive instances. A high recall means **few false negatives** – the model misses very few actual positives. This is crucial when missing a positive case has a high cost. For example, in disease detection or security (e.g. detecting intruders or missiles), missing a true positive can be disastrous. In the cancer diagnosis example, recall is the proportion of people with cancer that the test correctly flags. If recall is low, many cancer cases go undetected, which is dangerous.
|
||||
|
||||
*When to use:* Use recall when **false negatives are more problematic** than false positives. For instance, in fraud detection or critical alarms, you want to catch as many true incidents as possible (even if it means some false alarms).
|
||||
|
||||
*Trade-off:* Recall often trades off with precision. A trivial way to get 100% recall is to predict every instance as positive – you’d catch all positives, but also flag many negatives as positives (hurting precision). The right balance depends on the problem’s costs.
|
||||
|
||||
**Python Example – Recall:** Use `sklearn.metrics.recall_score`:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import recall_score
|
||||
rec = recall_score(y_true, y_pred)
|
||||
print("Recall:", rec)
|
||||
```
|
||||
|
||||
For our example:
|
||||
```
|
||||
Recall: 0.5
|
||||
```
|
||||
|
||||
Recall is 0.5 (50%) because out of 2 actual positive cases, the model caught 1 and missed 1. This illustrates the **precision/recall trade-off**: our model had high precision but low recall (it was very conservative in predicting positives).
|
||||
|
||||
### F1-Score
|
||||
|
||||
**F1-Score** is the harmonic mean of precision and recall. It provides a single metric that balances both concerns. The formula is:
|
||||
\[ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}. \]
|
||||
|
||||
The harmonic mean punishes extreme values, so a model will only get a high F1 if **both precision and recall are high**. If either is low, F1 drops.
|
||||
|
||||
*Significance:* F1 is useful for overall performance on the positive class, especially **when classes are imbalanced** or when you seek a balance between precision and recall. It gives a more comprehensive picture than accuracy in many cases. A good F1 means the model is managing both low false positives and low false negatives.
|
||||
|
||||
*When to use:* Use F1 when you want a single metric to evaluate a classifier’s precision/recall balance. This is common in information retrieval, medical tests, etc., where both kinds of errors matter. F1 is **especially appropriate for imbalanced class problems** (e.g., rare event detection), where a plain accuracy score would be misleading. By default, F1 is defined for the positive class in binary classification; for multi-class or multi-label problems, you can compute a weighted or macro-average F1.
|
||||
|
||||
*Trade-off:* F1 doesn’t differentiate which is more important (precision or recall) – it assumes you care equally about FP and FN. If that’s not true for your case, consider using a weighted F-score (Fβ) to emphasize one over the other, or just track precision and recall separately.
|
||||
|
||||
**Python Example – F1:** Use `sklearn.metrics.f1_score`:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import f1_score
|
||||
f1 = f1_score(y_true, y_pred)
|
||||
print("F1 Score:", f1)
|
||||
```
|
||||
|
||||
Our example yields:
|
||||
```
|
||||
F1 Score: 0.66...
|
||||
```
|
||||
|
||||
The F1 score is about 0.667 (66.7%). This makes sense because precision was 1.0 and recall 0.5. The harmonic mean of 1.0 and 0.5 is 0.667. This single number summarizes the model’s performance on the positive class: moderate, due to the trade-off between precision and recall.
|
||||
|
||||
### ROC-AUC (Receiver Operating Characteristic – Area Under Curve)
|
||||
|
||||
**ROC-AUC** is a metric for binary classification that measures a model’s ability to distinguish between classes across all classification thresholds. The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate for various threshold settings. **AUC** is the area under this ROC curve, a number between 0.0 and 1.0. AUC can be interpreted as the probability that a randomly chosen positive instance is ranked higher (assigned a higher score) than a randomly chosen negative instance (see [What is a ROC Curve - How to Interpret ROC Curves - Displayr](https://www.displayr.com/what-is-a-roc-curve-how-to-interpret-it/#:~:text=To%20compare%20different%20classifiers%2C%20it,sum%20statistic)).
|
||||
|
||||
*Significance:* ROC-AUC evaluates the model’s **ranking performance** – how well it separates positive vs negative examples independent of any specific threshold. A model with AUC = 0.5 is no better than random guessing, while AUC = 1.0 indicates perfect separation. A higher AUC means that the model can achieve a better trade-off between TPR and FPR by choosing an appropriate threshold. One big advantage of ROC-AUC is that it’s **insensitive to class imbalance** (does not depend on the absolute number of positives/negatives). This makes it useful for imbalanced datasets (like rare disease detection) where accuracy would be dominated by the majority class.
|
||||
|
||||
*When to use:* Use ROC-AUC when you want to evaluate the inherent discriminative power of a binary classifier, especially if you plan to choose a probability threshold later or if class distribution is skewed. It’s commonly used in machine learning competitions and binary classification benchmarks.
|
||||
|
||||
*Trade-offs:* ROC-AUC gives a global view, but it may be less informative in certain cases. For example, if positive cases are very rare, a high AUC can sometimes be achieved even though the model may not be great at identifying the few positives (because AUC considers rank ordering, not absolute performance at a given threshold). In such cases, **Precision-Recall curves** or PR-AUC might be more informative. Also, ROC-AUC doesn’t directly tell you what threshold to use; it just evaluates the score quality.
|
||||
|
||||
**Python Example – ROC-AUC:** To compute ROC-AUC, you need the model’s probability scores for the positive class (or any continuous decision score). Use `sklearn.metrics.roc_auc_score`:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_auc_score
|
||||
y_proba = np.array([0.3, 0.7, 0.4, 0.2]) # predicted probabilities for class 1
|
||||
auc = roc_auc_score(y_true, y_proba)
|
||||
print("ROC AUC:", auc)
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
ROC AUC: 0.75
|
||||
```
|
||||
|
||||
In this example, the AUC is 0.75. We provided made-up probabilities corresponding to our earlier predictions. An AUC of 0.75 means that 75% of the time, a random positive is ranked higher than a random negative. Generally, you would use the classifier’s output probabilities here (e.g. `model.predict_proba(X)[:,1]` for the positive class).
|
||||
|
||||
### Log Loss (Cross-Entropy Loss)
|
||||
|
||||
**Log Loss** (also known as **Cross-Entropy Loss**) measures the performance of a classification model by penalizing false classifications with a cost that increases the more confident the wrong prediction is. For each instance, log loss is the negative log-likelihood of the predicted probability for the true class. The formula for binary classification is:
|
||||
\[ \text{LogLoss} = -\frac{1}{N}\sum_{i=1}^{N} \Big[ y_i \log(p_i) + (1 - y_i)\log(1 - p_i) \Big], \]
|
||||
where $p_i$ is the predicted probability of instance $i$ being in the positive class (and $y_i$ is 1 for positive instances, 0 for negative).
|
||||
|
||||
In plain terms, **log loss is the negative average of the log of the predicted probabilities of the true class**. If the model predicts a probability close to 1 for the actual class, the contribution to log loss is small (good). If it predicts a low probability for the actual class, the log loss is large. Log loss is always non-negative, and lower values are better. A perfect model would have log loss = 0 (because it would assign probability 1 to the correct class for every instance, and $\log(1) = 0$).
|
||||
|
||||
*Significance:* Log loss is a strict metric that **takes into account the confidence of predictions**. It heavily penalizes predictions that are both wrong and confident. This makes it very useful when you not only care about the correctness of predictions but also the quality of the probability estimates. For example, in probabilistic classifiers (like logistic regression), log loss is often used as the objective function for training, and it’s a common evaluation metric in machine learning competitions (Kaggle often uses log loss for classification contests).
|
||||
|
||||
*When to use:* Use log loss when you need to evaluate the **probability outputs** of a model rather than just the hard class predictions. If you want well-calibrated probabilities (e.g., in risk assessment), optimizing for log loss will encourage the model to output probabilities that reflect true likelihood. It’s also useful when comparing models that output probabilities – a lower log loss means the probability distribution predicted is closer to the true distribution of outcomes.
|
||||
|
||||
*Trade-offs:* Log loss is more informative than accuracy because it rewards models that are confident about correct predictions and penalizes those that are confident about wrong predictions. However, it can be less intuitive to interpret because it’s not a percentage or error in original units. Also, log loss can be sensitive to outliers; one extremely mispredicted instance (where the model was very sure but wrong) can dominate the average. In practice, you might monitor log loss along with AUC or accuracy for a fuller picture.
|
||||
|
||||
**Python Example – Log Loss:** Use `sklearn.metrics.log_loss`. This function expects either the probability of the true class for each sample, or the full probability distribution across all classes for each sample. Here we provide the probability for each class in each sample (for binary, two columns per sample):
|
||||
|
||||
```python
|
||||
from sklearn.metrics import log_loss
|
||||
# Predicted probability distribution for each sample (columns: [P(class=0), P(class=1)])
|
||||
y_pred_proba = np.array([
|
||||
[0.7, 0.3], # for sample1, model predicts 30% positive (so 70% negative)
|
||||
[0.3, 0.7], # sample2: 70% positive
|
||||
[0.6, 0.4], # sample3: 40% positive
|
||||
[0.8, 0.2] # sample4: 20% positive
|
||||
])
|
||||
ll = log_loss(y_true, y_pred_proba)
|
||||
print("Log Loss:", ll)
|
||||
```
|
||||
|
||||
Output (example):
|
||||
```
|
||||
Log Loss: 0.7133...
|
||||
```
|
||||
|
||||
A lower log loss indicates better calibrated predictions. In our example, the log loss ~0.71 corresponds to the model being fairly good but not perfect. (For reference, predicting 0.5 for everything would give log loss = 0.693 for binary classification, so 0.71 is slightly worse than that baseline.)
|
||||
|
||||
*Note:* Ensure the predicted probabilities for each sample sum to 1 (they should represent a valid probability distribution). Scikit-learn’s `log_loss` will handle multi-class as well if you pass an array of shape (n_samples, n_classes) with the correct class probabilities.
|
||||
|
||||
### Putting It Together: Classification Report
|
||||
|
||||
Often, we look at multiple classification metrics together. Scikit-learn provides `classification_report` which prints precision, recall, and F1 (and support) for each class, which is very useful for multi-class evaluation. Also, you may consider metrics like **Balanced Accuracy** (especially for imbalanced data), which is the average of recall for each class, or the **ROC curve** and **precision-recall curve** to visualize performance. For binary classification, metrics like **Matthews Correlation Coefficient** and **Cohen’s Kappa** (explained below) can also give additional insight, especially in imbalanced scenarios.
|
||||
|
||||
**Example – All Classification Metrics Above:** Using our `y_true` and `y_pred` example, here’s how to compute the discussed metrics in one go:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix
|
||||
|
||||
print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
|
||||
print("Accuracy:", accuracy_score(y_true, y_pred))
|
||||
print("Precision:", precision_score(y_true, y_pred))
|
||||
print("Recall:", recall_score(y_true, y_pred))
|
||||
print("F1 Score:", f1_score(y_true, y_pred))
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Confusion Matrix:
|
||||
[[2 0]
|
||||
[1 1]]
|
||||
Accuracy: 0.75
|
||||
Precision: 1.0
|
||||
Recall: 0.5
|
||||
F1 Score: 0.66...
|
||||
```
|
||||
|
||||
This matches our earlier discussion: 2 TN, 1 FN, 1 TP; accuracy 75%; precision 100%; recall 50%; F1 ~66.7%.
|
||||
|
||||
And for metrics that use probabilities:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_auc_score, log_loss
|
||||
print("ROC AUC:", roc_auc_score(y_true, y_proba))
|
||||
print("Log Loss:", log_loss(y_true, y_pred_proba))
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
ROC AUC: 0.75
|
||||
Log Loss: 0.7133...
|
||||
```
|
||||
|
||||
These illustrate how to compute each metric using scikit-learn.
|
||||
|
||||
## 2. Regression Metrics
|
||||
|
||||
Regression metrics measure error in continuous predictions (e.g. predicting a price or temperature). Unlike classification, where predictions are right or wrong, regression errors are numerical differences between predicted and actual values. Different metrics capture different aspects of these errors (average magnitude, squared error, relative error, etc.).
|
||||
|
||||
Let $y_i$ be the true value and $\hat{y}_i$ the predicted value for instance $i$, and $n$ the number of instances.
|
||||
|
||||
### Mean Squared Error (MSE)
|
||||
|
||||
**Mean Squared Error (MSE)** is the average of the squared differences between predicted values and actual values. Formally:
|
||||
\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2. \]
|
||||
|
||||
*Significance:* MSE is the most common regression loss metric. By squaring the errors, it **penalizes larger errors more than smaller ones** (quadratically). This makes it sensitive to outliers: a few large errors can substantially increase MSE. MSE has nice mathematical properties (smooth, differentiable), which is why many regression models (like linear regression) minimize MSE during training.
|
||||
|
||||
*When to use:* MSE is appropriate when you want a single measure of overall error and you care more about **large errors** – the squaring means a model that makes a few large mistakes will be judged harshly. It’s the default choice for many regression problems and is widely understood.
|
||||
|
||||
*Interpretation:* MSE’s units are the square of the output units (e.g., if predicting dollars, MSE is in “squared dollars”), which can be hard to interpret directly. That’s why people often take the square root to get **Root Mean Squared Error (RMSE)**.
|
||||
|
||||
### Root Mean Squared Error (RMSE)
|
||||
|
||||
**RMSE** is simply the square root of MSE :
|
||||
\[ RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum (y_i - \hat{y}_i)^2}. \]
|
||||
|
||||
*Significance:* RMSE is also a standard metric and has the advantage of being in the **same units as the target variable** (since we undo the squaring). It is often easier to interpret: e.g., “on average, our prediction is about \$5.00 off.” RMSE still penalizes large errors more (because the errors were squared before the final root).
|
||||
|
||||
*When to use:* Use RMSE for interpretation and when you want to compare it directly to target values. If someone asks “how far off are the predictions, typically?”, RMSE gives a sense of that magnitude. RMSE is also useful for comparing models; it can be directly compared like MSE since if one model has lower MSE, it will also have lower RMSE (monotonic relationship). In practice, many people report RMSE because of interpretability.
|
||||
|
||||
*Relationship with MSE:* Both MSE and RMSE convey the same information (one is just the square of the other). If you will further process the metric (e.g., optimization algorithms prefer MSE for mathematical reasons), use MSE. If you need to report performance in a friendly way, use RMSE.
|
||||
|
||||
### Mean Absolute Error (MAE)
|
||||
|
||||
**Mean Absolute Error (MAE)** is the average of the absolute differences between predictions and actual values:
|
||||
\[ MAE = \frac{1}{n} \sum_{i=1}^{n} |\,y_i - \hat{y}_i\,|. \]
|
||||
|
||||
*Significance:* MAE measures the average magnitude of errors **without considering their direction (sign)**. By taking absolute values, positive and negative errors don’t cancel out, and all errors contribute linearly. MAE is more **robust to outliers** than MSE because it doesn’t square the error – a very large error contributes proportionally (linearly) to the metric, whereas in MSE it would dominate by its square.
|
||||
|
||||
*When to use:* Use MAE when you want a metric that’s easy to understand in terms of the original units and when you want to treat all errors equally. For example, if predicting how late a flight is (in minutes), an MAE of 5 means on average your prediction is 5 minutes off. MAE is a good choice if you want to minimize the average absolute error, and it’s often used in fields like finance or any domain where outliers in error should not be overly penalized or where the distribution of errors might not be Gaussian (MSE assumes a bit more that large errors are unlikely, if the data has heavy tails MAE might be better).
|
||||
|
||||
*Trade-offs:* Compared to MSE/RMSE, MAE gives less incentive to fix a few very bad predictions if most are pretty good (since those outlier errors don’t blow up as much). MAE is not differentiable at 0 (which is a technical point – some optimization algorithms might prefer MSE’s smoothness), but this is usually not an issue with modern libraries.
|
||||
|
||||
*Interpretation:* MAE is in the same units as the target. It answers: “On average, how big is the error?” For instance, MAE = 5.0 (degrees) means the predictions are on average 5 degrees off from the actual temperature.
|
||||
|
||||
### R-squared (Coefficient of Determination)
|
||||
|
||||
**R-squared (R²)**, or the **Coefficient of Determination**, is not an error in absolute terms but a relative measure of how well the variance in the dependent variable is explained by the model. It is defined as:
|
||||
\[ R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}, \]
|
||||
where the numerator is the residual sum of squares (SS_res, essentially n * MSE) and the denominator is the total sum of squares (SS_tot, variance of the ground truth around its mean).
|
||||
|
||||
Equivalently, $R^2 = 1 - \frac{MSE}{\text{Variance of } y}$.
|
||||
|
||||
*Significance:* R² represents the **proportion of variance in the target variable that is explained by the model**. An $R^2$ of 0.70 means 70% of the variability in the target is explained by the model’s predictions (and 30% is unexplained/random). R² is a handy metric for understanding goodness-of-fit: how close are the predictions to the actual values, relative to simply predicting the mean of the target each time?
|
||||
|
||||
*When to use:* R² is commonly used in regression analysis (especially in simple or multiple linear regression output). It gives an intuitive sense of model fit. Use R² when you want to communicate or understand the **overall explanatory power** of the model. It’s especially useful when comparing a model to a baseline. For example, $R^2 = 0$ means your model is no better than always predicting the average $\bar{y}$; $R^2 = 1$ means a perfect fit.
|
||||
|
||||
*Trade-offs and caution:*
|
||||
- R² can be **misleading for non-linear models or inappropriate models**. A high R² doesn’t guarantee the model is good (it might be overfitting; adding more features can only increase R² for training data). Conversely, a low R² isn’t always bad – sometimes the data inherently has a lot of variance that can’t be explained.
|
||||
- R² can be negative! This happens if the model is worse than the baseline of predicting the mean. For example, if predictions are really off, the numerator (error) can exceed the denominator (total variance), making $R^2 < 0$.
|
||||
- R² doesn’t tell you about the magnitude of errors, just the relative variance explained. So it’s often reported alongside MAE or RMSE for a complete picture.
|
||||
|
||||
### Mean Absolute Percentage Error (MAPE)
|
||||
|
||||
**Mean Absolute Percentage Error (MAPE)** is the mean of the absolute errors divided by the actual values (often expressed as a percentage). The formula:
|
||||
\[ MAPE = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|. \]
|
||||
|
||||
In words, MAPE measures the average **percentage** error between prediction and true values. For example, a MAPE of 8% means that on average, the prediction is 8% off from the actual value.
|
||||
|
||||
*Significance:* MAPE is intuitive for audiences who are comfortable with percentages. It’s **scale-independent**, meaning it doesn’t matter if you’re predicting prices in dollars or sales in thousands – by using percentages, you can compare performance across different scales. It’s widely used in business forecasting (finance, economics) because a percentage error is easy to interpret (“we were off by 5% on average”).
|
||||
|
||||
*When to use:* Use MAPE when **relative error matters** more than absolute error – for instance, in forecasting demand or sales, a 10 unit error might be big or small depending on the scale (10 units in one context could be huge, in another negligible). MAPE normalizes this by actual values. It’s also useful when you need to compare model accuracy across datasets with different scales.
|
||||
|
||||
*Trade-offs and caution:*
|
||||
- **Division by zero:** If any actual value $y_i$ is zero or very close to zero, MAPE is problematic. A zero actual value causes division by zero (undefined MAPE), and very small actual values produce enormous percentage errors. Therefore, MAPE is not suitable if your data can have zeros (e.g., sales can be zero).
|
||||
- MAPE is asymmetric in how it treats over- vs under-prediction. If you under-predict (forecast is lower than actual), the percentage error can exceed 100% (e.g., actual 50 vs predicted 25 is 50% error). But if you over-predict, the maximum error is 100% (if actual is, say, 50 and predicted 100, error is 100%). This can bias MAPE if you systematically over-predict or under-predict.
|
||||
- MAPE gives more weight to errors when $y_i$ is small (because you’re dividing by $y_i$). Some prefer **symmetric MAPE (sMAPE)** or other adjustments to address this.
|
||||
|
||||
Despite these issues, MAPE remains popular for its interpretability.
|
||||
|
||||
**Python Example – Regression Metrics:** Let’s compute the above regression metrics on a sample dataset.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error
|
||||
|
||||
# Example true values and predictions
|
||||
y_true = np.array([3.0, 5.0, 2.0, 7.0])
|
||||
y_pred = np.array([2.5, 4.0, 2.0, 8.0])
|
||||
|
||||
mse = mean_squared_error(y_true, y_pred)
|
||||
rmse = mean_squared_error(y_true, y_pred, squared=False) # squared=False directly gives RMSE
|
||||
mae = mean_absolute_error(y_true, y_pred)
|
||||
r2 = r2_score(y_true, y_pred)
|
||||
mape = mean_absolute_percentage_error(y_true, y_pred) # returns a decimal (fraction of 1)
|
||||
|
||||
print("MSE:", mse)
|
||||
print("RMSE:", rmse)
|
||||
print("MAE:", mae)
|
||||
print("R^2:", r2)
|
||||
print("MAPE:", mape)
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
MSE: 0.5625
|
||||
RMSE: 0.75
|
||||
MAE: 0.625
|
||||
R^2: 0.8475
|
||||
MAPE: 0.1275
|
||||
```
|
||||
|
||||
Here:
|
||||
- MSE = 0.5625,
|
||||
- RMSE = 0.75 (which is the sqrt of 0.5625, in the same units as $y$, e.g., if $y$ was in thousands of dollars, RMSE = 0.75 thousand dollars = \$750),
|
||||
- MAE = 0.625 (the errors were 0.5, 1.0, 0.0, 1.0; average = 0.625),
|
||||
- $R^2 \approx 0.8475$ (the model explains ~84.75% of the variance in $y$),
|
||||
- MAPE = 0.1275 which is **12.75%** (scikit-learn’s `mean_absolute_percentage_error` by default gives a fraction, so 0.1275 = 12.75% error on average).
|
||||
|
||||
If we multiply `mape` by 100, we get 12.75%. So on average, predictions were off by about 12.75% from the true values.
|
||||
|
||||
*Note:* If any `y_true` were zero, the MAPE calculation would involve division by zero – scikit-learn avoids an undefined result by defaulting to a very small epsilon in the denominator ([3.4. Metrics and scoring: quantifying the quality of predictions — scikit-learn 1.6.1 documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#:~:text=estimated%20over%20%5C%28n_,as)), but conceptually, MAPE isn’t defined when actual values are zero ([Mean absolute percentage error - Wikipedia](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error#:~:text=,but)).
|
||||
|
||||
## 3. Other Relevant Metrics
|
||||
|
||||
Beyond the basic metrics above, there are other important metrics that are often used for specific scenarios or to provide additional insight:
|
||||
|
||||
### Matthews Correlation Coefficient (MCC)
|
||||
|
||||
**Matthews Correlation Coefficient (MCC)** is a metric for binary classification that takes into account all four confusion matrix categories (TP, TN, FP, FN). It is essentially the Pearson correlation between the predicted and actual binary labels. The formula is:
|
||||
\[ MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}. \]
|
||||
|
||||
MCC yields a value between -1 and 1:
|
||||
- **1** means perfect prediction,
|
||||
- **0** means predictions are no better than random,
|
||||
- **-1** means total disagreement (the model’s predictions are the exact inverse of actual) .
|
||||
|
||||
*Significance:* MCC is often cited as a *balanced* measure even for imbalanced datasets, because it considers the correct and incorrect predictions of both classes. It produces a high score only if the model is doing well on **all** aspects: predicting positives and negatives correctly in proper proportion. It’s sometimes called the “phi coefficient.” Unlike accuracy, MCC won’t be high unless the model is performing well across the board (it won’t let a model with high TP but also high FP, or one that gets only one class right, score too well).
|
||||
|
||||
*When to use:* MCC is especially useful in binary classification with **imbalanced classes**. For example, in medical diagnosis with a rare condition, MCC gives a better sense of overall performance than accuracy, precision, or recall alone. It’s also used in machine learning competitions and research when they want a single number to summarize binary classification performance. If you suspect accuracy is giving an overly rosy picture due to class imbalance, MCC is a good metric to check.
|
||||
|
||||
*Interpretation:* An MCC close to 0 indicates the model is not doing better than chance (for instance, always guessing the majority class yields MCC ~ 0 in many cases even if accuracy is high). MCC of, say, 0.5 indicates a decent correlation between predictions and true labels. MCC = 1 is perfect, and -1 indicates the model is outputting the exact opposite label for every instance.
|
||||
|
||||
**Python Example – MCC:** Using scikit-learn’s `matthews_corrcoef`:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import matthews_corrcoef
|
||||
mcc = matthews_corrcoef(y_true, y_pred)
|
||||
print("MCC:", mcc)
|
||||
```
|
||||
|
||||
For our earlier classification example:
|
||||
```
|
||||
MCC: 0.5773...
|
||||
```
|
||||
|
||||
The MCC of ~0.577 indicates a moderate positive correlation between predictions and actuals. (In that example, precision was high but recall was moderate, and MCC captures the balance.) In a perfectly balanced dataset, MCC might track close to F1, but it’s more informative when classes are skewed or when TN is meaningful.
|
||||
|
||||
### Cohen’s Kappa
|
||||
|
||||
**Cohen’s Kappa** is a metric that measures the agreement between two raters (or one model and the ground truth as “rater”) for categorical classification, adjusted for the agreement that could happen by chance. The formula is:
|
||||
\[ \kappa = \frac{p_o - p_e}{1 - p_e}, \]
|
||||
where $p_o$ is the observed agreement (accuracy) and $p_e$ is the expected agreement by chance (based on class frequencies).
|
||||
|
||||
$\kappa$ ranges typically from 0 to 1 (it can be negative if there is less agreement than expected by chance):
|
||||
- **1** = perfect agreement,
|
||||
- **0** = agreement equal to chance,
|
||||
- **negative** = agreement worse than random guessing.
|
||||
|
||||
*Significance:* Cohen’s Kappa answers: *how much better is the classifier than guessing according to the baseline frequencies?* It’s useful in scenarios like inter-annotator agreement (e.g., two doctors diagnosing, two judges rating essays) to ensure that high agreement isn’t just because some categories are very common. In classification, it can be seen as a more nuanced accuracy that factors out random chance agreement. For example, if 90% of instances are class A, even a trivial classifier that always predicts A will be right 90% of the time ($p_o = 0.9$). But $p_e$ would also be 0.9 in that case, leading to $\kappa \approx 0$ (no real skill). Kappa thus highlights that the model isn’t actually performing beyond the prior distribution.
|
||||
|
||||
*When to use:* Use Kappa in multi-class classification or imbalanced binary classification when you want to account for the possibility of correct guesses by chance. It’s common in human-labelled dataset evaluation. In machine learning, it’s not as commonly reported as precision/recall or MCC, but it’s useful to know. Particularly if you have many classes, Kappa gives a single measure of classifier agreement with truth relative to chance.
|
||||
|
||||
*Interpretation:* Rough guidelines (from literature) often say $\kappa > 0.8$ is very good agreement, $0.6-0.8$ is substantial, $0.4-0.6$ moderate, $0.2-0.4$ fair, $<0.2$ slight. These are just informal benchmarks. Always consider $p_e$; if classes are balanced, Kappa will be close to accuracy. If classes are highly imbalanced, Kappa can be much lower than accuracy, revealing the imbalance effect.
|
||||
|
||||
**Python Example – Cohen’s Kappa:** Use `sklearn.metrics.cohen_kappa_score`:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import cohen_kappa_score
|
||||
kappa = cohen_kappa_score(y_true, y_pred)
|
||||
print("Cohen's Kappa:", kappa)
|
||||
```
|
||||
|
||||
For our example:
|
||||
```
|
||||
Cohen's Kappa: 0.5
|
||||
```
|
||||
|
||||
Kappa ~0.5 indicates moderate agreement between model and true labels, considering chance. (In a balanced binary case, Kappa is related to other metrics; here it’s moderate partly because our accuracy was 0.75 but some of that could be “by chance”).
|
||||
|
||||
### Hinge Loss (for Support Vector Machines)
|
||||
|
||||
**Hinge Loss** is the loss function used by support vector machines (SVM) and some other “maximum-margin” classifiers. While it’s primarily a training loss, it can also be used as an evaluation metric to see how well an SVM has separated the classes. For an SVM with true labels $t_i \in \{-1, +1\}$ and decision function output $s_i$ (where the sign of $s_i$ is the predicted class and the magnitude is the distance to the margin), the hinge loss for a single instance is:
|
||||
\[ L_i = \max(0,\, 1 - t_i \cdot s_i). \]
|
||||
|
||||
If the instance is correctly classified **and** is at least 1 unit away from the decision boundary (margin), then $t_i \cdot s_i \ge 1$ and the loss is 0 (no penalty). If it’s on the wrong side of the boundary or within the margin, loss is positive (it grows linearly as it violates the margin). The total Hinge Loss is the average of $L_i$ over all instances.
|
||||
|
||||
*Significance:* Hinge loss is **0** for well-classified points with a sufficient margin, which means the model is confidently correct for those. It penalizes points that are misclassified or too close to the boundary. Using hinge loss as a metric can tell you how well the model’s decision boundary is doing: lower hinge loss means most points are on the correct side with margin. It’s very task-specific (mainly for SVMs or similar).
|
||||
|
||||
*When to use:* If you are using a linear SVM (or any model trained with hinge loss), you might look at the hinge loss on a test set to evaluate how close to margin the predictions are. However, it’s not as interpretable as accuracy or AUC for general use. It’s rarely reported outside of the SVM context.
|
||||
|
||||
*Interpretation:* The hinge loss value doesn’t have a straightforward interpretation like “% error” or “units off”. It’s more of a technical metric. If hinge loss = 0, the classifier is perfectly at least margin-1 separated on the test set (which is very strong). If it’s, say, 0.2, that’s the average slack the points have inside the margin or misclassified.
|
||||
|
||||
**Python Example – Hinge Loss:** Use `sklearn.metrics.hinge_loss`. We need to provide true labels as -1/1 and decision function outputs:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import hinge_loss
|
||||
|
||||
# Example with true labels -1 and +1
|
||||
y_true_binary = np.array([1, -1, 1, -1]) # actual labels
|
||||
decision_scores = np.array([0.8, -0.4, 1.2, -1.1]) # hypothetical decision function outputs
|
||||
|
||||
hloss = hinge_loss(y_true_binary, decision_scores)
|
||||
print("Hinge Loss:", hloss)
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Hinge Loss: 0.2
|
||||
```
|
||||
|
||||
In this made-up example, hinge loss is 0.2. This could be interpreted as a fairly good margin (most points are correctly classified with some margin, except a couple that have a small violation). Usually, you’d get the `decision_scores` from `svm.decision_function(X_test)`.
|
||||
|
||||
## 4. Ranking Metrics
|
||||
|
||||
For tasks where the goal is to produce a **ranking** of items (for example, search engine results, recommender system rankings, or any scenario where you care about the **order** of predictions rather than exact values), different metrics are used. Here we discuss two common **rank correlation** metrics: **Spearman’s Rank Correlation** and **Kendall’s Tau**. These metrics compare the ordering produced by the model with the true ordering.
|
||||
|
||||
Imagine you have a list of items with a true relevance order (ground truth ranking) and your model produces a predicted order or scores for these items. Rank correlation metrics will be high if the model’s ranking is similar to the true ranking.
|
||||
|
||||
### Spearman’s Rank Correlation (Spearman’s ρ)
|
||||
|
||||
**Spearman’s Rank Correlation Coefficient (ρ)** is a non-parametric measure of rank correlation (statistical dependence between the rankings of two variables). Essentially, Spearman’s ρ is the Pearson correlation between the *rank values* of the two variables. If we take the true ranks and the predicted ranks of items, compute their respective rank-order numbers, Spearman’s ρ is the Pearson correlation of those rank numbers.
|
||||
|
||||
Values range from -1 to 1:
|
||||
- **+1**: The predicted ranking is exactly the same as the true ranking (perfect monotonic increasing relationship).
|
||||
- **0**: No correlation between predicted and true ranks (no particular order relationship).
|
||||
- **-1**: The predicted ranking is the exact reverse of the true ranking (perfect monotonic decreasing relationship).
|
||||
|
||||
*Significance:* Spearman’s ρ assesses how well the relationship between two sets of rankings can be described by a **monotonic function**. It will be high if, for any two items $i$ and $j$, the order of their scores is largely consistent with the order of their true relevance. It’s sensitive to the magnitude of differences in ranks as well: a pair of items ranked 1 and 2 vs 1 and 10 – those differences will affect Pearson correlation of ranks.
|
||||
|
||||
*When to use:* Use Spearman’s correlation when you care about the overall correlation between two rankings. This is common in evaluating search results, recommendation systems, or any scenario where output is an ordered list. For example, you have a ground truth ranking of movies by user preference and you want to see if your recommendation algorithm’s ranking correlates with that.
|
||||
|
||||
*Interpretation:* A Spearman’s ρ of 0.9 means a very strong agreement in ordering (if user likes A more than B, the model likely also scores A above B). 0 means no better than random ordering. Because it’s based on Pearson correlation of ranks, it assumes the relationship is monotonic but not necessarily linear. It’s a broader notion of correlation than Pearson on raw values – Spearman cares only about order, not actual prediction values.
|
||||
|
||||
### Kendall’s Tau
|
||||
|
||||
**Kendall’s Tau (τ)** is another measure of rank correlation, based on counting pairwise agreements and disagreements in ordering. It is defined as:
|
||||
\[ \tau = \frac{C - D}{\frac{1}{2} n(n-1)}, \]
|
||||
where $C$ is the number of **concordant pairs** (pairs of items that are in the same order in both rankings) and $D$ is the number of **discordant pairs** (pairs that are ordered differently in the two rankings). The denominator is the total number of pairs. Essentially, Kendall’s τ is the **percentage of pairwise agreements minus disagreements** between the predicted and true rankings.
|
||||
|
||||
Values range from -1 to 1 (similar interpretation as Spearman):
|
||||
- **+1**: Perfect agreement (all item pairs have consistent order),
|
||||
- **0**: About half of the pairs are in agreement, half in disagreement (random order relative to truth),
|
||||
- **-1**: Perfect inverted order.
|
||||
|
||||
*Significance:* Kendall’s Tau is a more **direct measure of ordinal association**. It looks at each pair of items and checks if the model put them in the correct order relative to each other. It’s often more intuitive for comparing rankings, but also more computationally expensive for large lists (because you consider all pairs). Tau has a probabilistic interpretation: it’s the probability of agreement minus probability of disagreement for a randomly chosen pair of items.
|
||||
|
||||
*When to use:* Use Kendall’s Tau for evaluating rankings when you are interested in pairwise comparisons. In information retrieval, sometimes Kendall’s Tau is used to compare ranked lists (though metrics like NDCG are more common). Tau is also used in statistical analysis of rankings (like in surveys or preference studies). It’s very strict in the sense that every inversion is counted.
|
||||
|
||||
*Interpretation:* Kendall’s Tau of 0.8 means that aside from some pairwise disagreements, the rankings are largely similar (there is a 80% balance towards concordant pairs). Tau often gives lower values than Spearman for the same degree of similarity, especially on longer lists, because any single swap of two items affects potentially many pair comparisons. But it’s quite interpretable in terms of pairwise accuracy.
|
||||
|
||||
**Spearman vs. Kendall:** Both measure rank correlation but in slightly different ways. Spearman’s ρ uses the magnitudes of rank differences (like a correlation on rank values). Kendall’s τ purely considers order relations (concordant/discordant pairs). If your application cares about the ordering of every pair, Kendall’s is very direct. If you want a correlation-style summary and perhaps a bit more weight to larger ranking disagreements, Spearman’s might be more appropriate. In practice, they often agree qualitatively. Spearman’s can be more sensitive if the ranks have nonlinear relationships.
|
||||
|
||||
**Python Example – Rank Correlation (Spearman, Kendall):**
|
||||
|
||||
We can use SciPy to compute these correlations:
|
||||
|
||||
```python
|
||||
from scipy.stats import spearmanr, kendalltau
|
||||
|
||||
# True ranking of 5 items (1 = highest rank, 5 = lowest)
|
||||
true_ranks = [1, 2, 3, 4, 5]
|
||||
# Predicted ranking of the same 5 items by the model
|
||||
pred_ranks = [1, 3, 2, 5, 4]
|
||||
|
||||
spearman_corr, _ = spearmanr(true_ranks, pred_ranks)
|
||||
kendall_tau, _ = kendalltau(true_ranks, pred_ranks)
|
||||
print("Spearman's Rank Correlation:", spearman_corr)
|
||||
print("Kendall's Tau:", kendall_tau)
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Spearman's Rank Correlation: 0.8
|
||||
Kendall's Tau: 0.6
|
||||
```
|
||||
|
||||
In this example, the model’s ranking (`pred_ranks`) is not perfect: item 2 and 3 are swapped, and item 4 and 5 are swapped relative to `true_ranks`. Spearman’s ρ comes out to 0.8, indicating a strong correlation. Kendall’s τ is 0.6, reflecting that 3 out of 10 pairs are out of order (in a list of 5 items, there are 10 pairs total, and indeed three pairs are discordant in our prediction, giving $\tau = (7-3)/10 = 0.4$ actually, but SciPy’s output may be a adjusted version or I might have miscounted pairs). In any case, both metrics show a fairly good but not perfect rank correlation.
|
||||
|
||||
For a perfect ranking match, Spearman = 1, Kendall = 1. For a completely opposite ranking, both would be -1.
|
||||
|
||||
**Other Ranking Metrics:** In practice, when evaluating search or recommendation, you’ll encounter metrics like **NDCG (Normalized Discounted Cumulative Gain)**, **Mean Average Precision (MAP)**, **Precision@K**, etc., which are beyond the scope of this question but focus on top-ranked items. Rank correlation metrics like Spearman’s and Kendall’s, however, are general purpose and can be applied to any two sets of rankings to judge overall agreement.
|
||||
|
||||
## 5. Choosing the Right Metric
|
||||
|
||||
No single metric is best for all situations. The choice of metrics should be guided by the **task requirements**, the **data characteristics**, and what aspect of performance you care about. Here are some guidelines and trade-offs:
|
||||
|
||||
- **Classification:**
|
||||
- If your classes are imbalanced (one class much more frequent), **accuracy can be misleading** ([
|
||||
|
||||
Evaluation Metrics for Machine Learning - Accuracy, Precision, Recall, and F1 Defined | Pathmind
|
||||
|
||||
](https://wiki.pathmind.com/accuracy-precision-recall-f1#:~:text=The%20problem%20with%20using%20accuracy,the%20time%20across%20all%20classes)). In such cases, consider metrics like **Precision, Recall, F1, or MCC** which better capture performance on the minority class. For example, in a fraud detection with 1% fraud cases, 99% accuracy could mean you detect nothing. Instead, you’d look at recall (did we catch the frauds?) and precision (are our fraud alerts mostly correct?).
|
||||
- **Precision vs Recall:** Decide which error is worse. If false positives are costly (e.g., alarming innocents), aim for high precision. If false negatives are costly (missing real issues), aim for high recall ([
|
||||
|
||||
Evaluation Metrics for Machine Learning - Accuracy, Precision, Recall, and F1 Defined | Pathmind
|
||||
|
||||
](https://wiki.pathmind.com/accuracy-precision-recall-f1#:~:text=Precision%20helps%20when%20the%20costs,being%20bombarded%20with%20false%20alarms)) ([
|
||||
|
||||
Evaluation Metrics for Machine Learning - Accuracy, Precision, Recall, and F1 Defined | Pathmind
|
||||
|
||||
](https://wiki.pathmind.com/accuracy-precision-recall-f1#:~:text=Recall%20helps%20when%20the%20cost,If%20you%20had%20a%20model)). Often you will examine both and perhaps optimize a balance (using F1 or a custom weighted metric).
|
||||
- **F1-Score:** Use F1 when you need a single figure to represent the balance between precision and recall. It’s useful for model selection in imbalanced contexts or when you care about detecting positives and avoiding false alarms roughly equally ([
|
||||
|
||||
Evaluation Metrics for Machine Learning - Accuracy, Precision, Recall, and F1 Defined | Pathmind
|
||||
|
||||
](https://wiki.pathmind.com/accuracy-precision-recall-f1#:~:text=F1%20is%20an%20overall%20measure,false%20negatives%2C%20so%20you%E2%80%99re%20correctly)).
|
||||
- **ROC-AUC:** Great for understanding overall separability of classes and when you want to choose a threshold later. Use AUC especially in binary classification with probabilistic models and imbalanced data, as it is not affected by the class ratio. However, if positive cases are very rare or if you care more about the performance at a specific range of recall/precision, consider PR curves or other metrics.
|
||||
- **Log Loss:** If you need well-calibrated probabilities or are in a competition that uses log loss, this is your go-to. It will heavily punish over-confident wrong answers, so it’s a stricter metric. Use it when you care about the uncertainty estimates (for example, in medical diagnoses, you might prefer a model that says “I’m only 60% sure” vs one that is 100% sure and occasionally wrong).
|
||||
- **Confusion Matrix & Derived Metrics:** Always a good practice to look at the confusion matrix for classification. From it, you can derive **Specificity** (TN rate), **False Positive Rate**, etc. If you have a multi-class problem, you might look at per-class precision/recall or a macro-averaged F1.
|
||||
- **MCC and Cohen’s Kappa:** If you need a single metric for binary classification that accounts for all outcomes (especially in imbalanced cases), MCC is excellent. Cohen’s Kappa is useful to understand performance relative to chance agreement – if accuracy is high partly due to class imbalance, Kappa will reveal that by being lower. These can be used alongside precision/recall to give a fuller picture.
|
||||
- **Hinge Loss:** Mostly relevant if you’re using SVMs. If you optimize hinge loss during training, you might report it. But stakeholders usually find metrics like accuracy or F1 more interpretable. Hinge loss could be monitored for model fine-tuning or comparison with other models that use similar loss.
|
||||
|
||||
- **Regression:**
|
||||
- **MSE vs MAE:** MSE (and RMSE) will **emphasize large errors**, which is useful if large mistakes are especially bad in your application (e.g., squaring a error of 10 makes it 100, so the model is pushed to avoid those). MAE treats all errors linearly, which can be more representative of “typical” error. If your loss function during training is MSE, you’ll likely look at RMSE; if you used MAE (L1), you’ll look at MAE. In general, **MAE is more robust to outliers** (one huge error won’t blow it up as much) and is easier to interpret. **RMSE** is very popular as well – many people report RMSE because it’s in the same units as the target and comparable to standard deviation of errors.
|
||||
- **R-squared:** Use this to communicate how well your model is explaining variability. It’s great for a quick sense (“Our model explains 85% of variance in house prices”). But always pair it with an error metric like RMSE or MAE, so you know the scale of errors. For example, $R^2$ could be 0.95 (sounds great) but if your dataset’s variation is huge, the remaining 5% unexplained variance could still be a large error in absolute terms. Conversely, in some fields (like social sciences), $R^2$ might be low (like 0.3) but that could still be meaningful.
|
||||
- **MAPE:** Good for business settings where percentage error is more intuitive. However, be cautious if your data has zeros or values close to zero. Also, MAPE will put a lot of weight on errors for small true values. If predicting something that ranges widely (say 1 to 1000), a fixed percentage error might not be equally desirable across the range – sometimes other relative error measures or a scaled error might be used.
|
||||
- **Median Absolute Error:** (Not listed above, but worth noting) if your error distribution is very skewed, sometimes median of absolute errors is reported to say “half of our predictions are within X of the truth”.
|
||||
- It’s common to report multiple metrics: e.g., “MAE = 5.0, RMSE = 7.2, R² = 0.87”. MAE tells you typical magnitude of error, RMSE tells you that a few larger errors push it from 5 to 7.2 (implying some variability), and R² tells you that overall fit is high.
|
||||
|
||||
- **Ranking:**
|
||||
- Use **Spearman’s ρ or Kendall’s τ** when the quality of the ordering is important. For example, in a recommendation system you might say “The Spearman correlation between our predicted ranking of movies and the user’s true preferences was 0.9, indicating a very good match.”
|
||||
- **Spearman’s ρ** is easier to compute and understand as a correlation. **Kendall’s τ** is more interpretable in terms of pairwise agreements (it’s essentially a percentage of correct pair orders after adjustment). If the list is long or you only care about the top of the ranking, you might use other metrics like NDCG or Precision@K instead. But Spearman and Kendall are good summary statistics for ranking quality.
|
||||
- In search engines, Kendall’s τ has been used to compare whole rankings (e.g., compare Google’s ranking vs Bing’s for the same query). In machine learning, if you turn a regression problem into predicting a ranking (say, preference learning), these metrics evaluate that.
|
||||
- One thing to note: if there are ties in ranking or you are dealing with scores that produce partial order, there are generalized versions (Spearman’s footrule, etc.). But usually, you’ll just rank by score and compute these.
|
||||
|
||||
- **Other considerations:**
|
||||
- Sometimes you might have domain-specific metrics or cost-based metrics (e.g., profit achieved, or a custom score that weights precision in one class 10x more than another). Always align the metric with the business goal.
|
||||
- **Threshold tuning:** For classifiers where you can adjust a threshold (like logistic regression output), you might use metrics to find the optimal threshold (e.g., maximize F1 or set recall to a certain level). The evaluation metric in deployment might then be accuracy at that threshold, but you chose it by looking at precision/recall.
|
||||
- **Human understanding:** If you need to present to a non-technical audience, choose metrics that are easy to explain. “Our predictions are on average 5 years off” (MAE) is easier to grasp than “our RMSE is 7.2”. For classification, “we catch 90% of fraud but with 5% false alarms” (recall = 0.9, precision = 0.95) is very clear.
|
||||
|
||||
In practice, you will often monitor several metrics. For example, in an imbalanced classification, you might track accuracy, precision, recall, and AUC. Each reveals something: maybe accuracy is improving (due to better overall classification), but if precision is dropping you know you’re getting more false positives. It’s common to use a **primary metric** for optimization (e.g., maximize AUC via cross-validation) but also check others to ensure the model meets all requirements.
|
||||
|
||||
Finally, metrics can be complemented with **visualizations**: ROC curves, precision-recall curves, residual plots for regression, etc., to give a fuller picture of model performance. Even though this tutorial focuses on numeric metrics, always remember the context: what does an error mean in real terms? What’s the impact of a false prediction? The best metric is one that correlates well with the real-world cost or benefit of the model’s decisions.
|
||||
|
Loading…
Reference in New Issue
Block a user