Accuracy measures the proportion of correct predictions made by a machine learning model out of all predictions. It is calculated as the number of correct predictions divided by the total number of predictions.
Accuracy is a simple evaluation metric for classification tasks. It counts true positives and true negatives, then divides by the total instances to give a percentage of correct outcomes.
It assumes equal importance for all errors and works best on balanced datasets. On imbalanced data it can be misleading because a model can achieve high accuracy by always predicting the majority class.
Accuracy is often reported alongside other metrics such as precision, recall, and F1 score to give a fuller picture of model performance.
A model that classifies 95 out of 100 images correctly as cat or dog has 95% accuracy. If the dataset contains 90 dogs and only 10 cats, the same score could hide poor performance on cats.
Accuracy remains the most widely reported baseline metric for comparing models and communicating results to non-experts. It is easy to understand yet must be interpreted carefully in real-world applications with class imbalance.
It depends on the problem; 90%+ is often strong for balanced data, but random guessing on a balanced binary task already gives 50%.
Precision is an evaluation metric for classification models that measures the proportion of true positive predictions among all positive predictions made.
Recall is an evaluation metric that measures the proportion of actual positive cases a model correctly identifies. It shows how well the model finds all relevant instances in the data.
The F1 Score is a single metric that balances precision and recall to evaluate how well a classification model performs, especially when classes are uneven.
A confusion matrix is a table that shows how well a classification model performs by comparing its predictions to the actual labels.
A benchmark is a standardized dataset and task used to measure and compare how well different AI models perform.
BLEU Score is an automatic metric that evaluates machine-generated text quality, mainly for machine translation, by measuring overlap with human-written reference translations.