Why use a confusion matrix instead of just accuracy?

Accuracy can be misleading with imbalanced data; the matrix shows exactly which kinds of mistakes the model is making.

Can a confusion matrix be used for more than two classes?

Yes, it can be extended to any number of classes by showing a row and column for each class label.

What is Confusion Matrix?

A confusion matrix is a table that shows how well a classification model performs by comparing its predictions to the actual labels.

It organizes results into four categories: true positives (correctly predicted positive cases), true negatives (correctly predicted negative cases), false positives (incorrectly predicted positives), and false negatives (incorrectly predicted negatives).

These counts let you calculate key metrics such as accuracy, precision, recall, and F1 score, revealing not just overall correctness but the specific types of errors the model makes.

The matrix works for both binary and multi-class problems and is especially helpful when classes are imbalanced.

Example

A spam filter might produce a matrix showing 900 true negatives (real emails correctly kept), 80 true positives (spam correctly caught), 20 false positives (real emails wrongly sent to spam), and 10 false negatives (spam that reached the inbox).

Why it matters

It gives a clear picture of model errors beyond simple accuracy, which is essential for high-stakes uses like medical diagnosis or fraud detection where different mistakes carry different costs.

Frequently asked questions

They stand for true positive, true negative, false positive, and false negative—the four possible outcomes shown in the matrix.

Related terms

Accuracy

Accuracy measures the proportion of correct predictions made by a machine learning model out of all predictions. It is calculated as the number of correct predictions divided by the total number of predictions.

Precision

Precision is an evaluation metric for classification models that measures the proportion of true positive predictions among all positive predictions made.

Recall

Recall is an evaluation metric that measures the proportion of actual positive cases a model correctly identifies. It shows how well the model finds all relevant instances in the data.

F1 Score

The F1 Score is a single metric that balances precision and recall to evaluate how well a classification model performs, especially when classes are uneven.

ROC Curve

A ROC Curve is a graph that illustrates the performance of a binary classification model by plotting the true positive rate against the false positive rate at various decision thresholds.

Benchmark

A benchmark is a standardized dataset and task used to measure and compare how well different AI models perform.