What is K-Means?
K-Means is an unsupervised machine learning algorithm that partitions data into a user-specified number (K) of clusters by grouping similar points together.
The algorithm starts by randomly placing K centroids in the feature space. Each data point is then assigned to the nearest centroid based on distance, typically Euclidean.
Centroids are updated to the mean position of all points assigned to them, and the assignment step repeats until the centroids stabilize or a maximum iteration limit is reached.
It minimizes within-cluster variance and is efficient for large datasets, though results can vary with initial centroid placement and it assumes roughly spherical clusters.
Example
A retailer might use K-Means with K=4 on customer purchase data to automatically group shoppers into clusters such as 'budget buyers', 'frequent high-spenders', 'seasonal purchasers', and 'one-time visitors'.
Why it matters
K-Means remains a foundational tool for exploratory data analysis, customer segmentation, image compression, and anomaly detection in modern AI pipelines due to its simplicity and speed.
Frequently asked questions
Common methods include the elbow plot of within-cluster sum of squares or the silhouette score that measures cluster cohesion and separation.
Related terms
Clustering is an unsupervised machine learning technique that automatically groups similar data points together into clusters based on their features, without using any labeled examples.
Unsupervised learning is a machine learning method that trains models on unlabeled data to find hidden patterns, structures, or relationships without any guidance on correct outputs.
Active learning is a machine learning technique where the model itself selects the most informative unlabeled data points to be labeled by a human, rather than labeling data randomly or all at once.
Adam (Adaptive Moment Estimation) is a popular optimization algorithm used to train machine learning models by iteratively updating parameters based on gradients.
Anomaly detection is a machine learning technique that identifies rare or unusual data points that differ significantly from the majority of the data, often called outliers.
The bias-variance tradeoff describes the balance between two sources of error in a machine learning model: bias (error from overly simple assumptions) and variance (error from sensitivity to small fluctuations in the training data).