How is F1 different from accuracy?

Accuracy can be misleading with imbalanced data, while F1 focuses on the positive class by combining precision and recall.

Can F1 be used for multi-class problems?

Yes, by calculating F1 for each class and then averaging (macro, micro, or weighted) to get an overall score.

What is F1 Score?

The F1 Score is a single metric that balances precision and recall to evaluate how well a classification model performs, especially when classes are uneven.

It is the harmonic mean of precision (how many predicted positives are actually correct) and recall (how many actual positives were found), giving equal weight to both.

The formula is F1 = 2 * (precision * recall) / (precision + recall). It ranges from 0 to 1, with 1 being perfect performance.

Unlike accuracy, F1 is robust to class imbalance because it penalizes models that favor one class too heavily.

Example

In a medical test for a rare disease, a model might achieve high accuracy by always saying 'no disease,' but its F1 score would be low because it misses the few actual cases.

Why it matters

F1 Score is widely used today to fairly assess models on imbalanced real-world data such as fraud detection, medical diagnosis, and content moderation.

Frequently asked questions

It depends on the task, but scores above 0.7 are often considered decent and above 0.9 excellent for many classification problems.

Related terms

Precision

Precision is an evaluation metric for classification models that measures the proportion of true positive predictions among all positive predictions made.

Recall

Recall is an evaluation metric that measures the proportion of actual positive cases a model correctly identifies. It shows how well the model finds all relevant instances in the data.

Accuracy

Accuracy measures the proportion of correct predictions made by a machine learning model out of all predictions. It is calculated as the number of correct predictions divided by the total number of predictions.

Confusion Matrix

A confusion matrix is a table that shows how well a classification model performs by comparing its predictions to the actual labels.

Benchmark

A benchmark is a standardized dataset and task used to measure and compare how well different AI models perform.

BLEU Score

BLEU Score is an automatic metric that evaluates machine-generated text quality, mainly for machine translation, by measuring overlap with human-written reference translations.