What is K-Nearest Neighbors?
Also known as: KNN
K-Nearest Neighbors (KNN) is a simple supervised machine learning algorithm used for classification and regression that predicts the label or value of a new data point based on the majority vote or average of its K closest training examples.
KNN works by storing the entire training dataset and, for each new query point, calculating distances to all known points using a chosen metric such as Euclidean distance. It then selects the K nearest points and aggregates their labels (for classification) or values (for regression) to produce the prediction.
The algorithm is instance-based and lazy, meaning it performs no explicit training phase and defers computation until prediction time. Key choices include the value of K, the distance metric, and optional weighting of neighbors by distance.
Because it makes few assumptions about data distribution, KNN is non-parametric and can model complex decision boundaries, though it can suffer from the curse of dimensionality and requires careful feature scaling.
Example
Imagine classifying a new fruit as an apple or orange: measure its weight and diameter, find the three closest fruits in a labeled dataset, and assign the label that appears most often among those three neighbors.
Why it matters
KNN remains a foundational baseline in modern AI because of its interpretability and ease of implementation, and it underpins many recommendation systems, anomaly detection pipelines, and prototype-based methods used in production today.
Frequently asked questions
K is a user-chosen hyperparameter that specifies how many nearest neighbors are considered when making a prediction.
Related terms
Supervised learning is a machine learning method where a model is trained on data that already has correct answers attached, so it can learn to predict those answers for new data.
Classification is a supervised machine learning task that assigns input data to one of several predefined categories or classes based on patterns learned from labeled training examples.
Regression is a supervised machine learning method that predicts continuous numerical values from input features.
Active learning is a machine learning technique where the model itself selects the most informative unlabeled data points to be labeled by a human, rather than labeling data randomly or all at once.
Adam (Adaptive Moment Estimation) is a popular optimization algorithm used to train machine learning models by iteratively updating parameters based on gradients.
Anomaly detection is a machine learning technique that identifies rare or unusual data points that differ significantly from the majority of the data, often called outliers.