How many principal components should I keep?

A common approach is to keep enough components to explain 95% or more of the total variance, or to inspect a scree plot for an 'elbow' point.

Does PCA work with non-linear relationships?

Standard PCA is linear; for non-linear patterns, variants such as Kernel PCA are often used instead.

What is Principal Component Analysis?

Also known as: PCA

Principal Component Analysis (PCA) is a technique for reducing the number of dimensions in a dataset while keeping as much of the original information as possible. It does this by finding new axes, called principal components, that capture the largest amounts of variation in the data.

PCA works by calculating the directions in which the data spreads out the most. These directions become the principal components, ordered from the one that explains the most variance to the one that explains the least.

The method relies on linear algebra: it centers the data, computes the covariance matrix, and finds its eigenvectors and eigenvalues to determine the new axes. Data points are then projected onto these axes to create a lower-dimensional version.

Because it is unsupervised, PCA does not use labels; it only looks at the structure of the input features themselves.

Example

Imagine a spreadsheet with 50 columns describing each customer (age, income, purchase history, etc.). PCA can combine these into just two or three new columns that still show the main differences between customers, making it easy to plot and explore clusters.

Why it matters

PCA is a standard preprocessing step that speeds up training, reduces noise, and helps visualize high-dimensional data in modern machine-learning pipelines. It remains widely used in computer vision, genomics, and recommendation systems.

Frequently asked questions

PCA is unsupervised because it does not require labeled outcomes; it only examines relationships among the input features.

Related terms

Dimensionality Reduction

Dimensionality reduction is a machine learning technique that decreases the number of features (dimensions) in a dataset while preserving as much relevant information as possible.

Active Learning

Active learning is a machine learning technique where the model itself selects the most informative unlabeled data points to be labeled by a human, rather than labeling data randomly or all at once.

Adam Optimizer

Adam (Adaptive Moment Estimation) is a popular optimization algorithm used to train machine learning models by iteratively updating parameters based on gradients.

Anomaly Detection

Anomaly detection is a machine learning technique that identifies rare or unusual data points that differ significantly from the majority of the data, often called outliers.

Bias-Variance Tradeoff

The bias-variance tradeoff describes the balance between two sources of error in a machine learning model: bias (error from overly simple assumptions) and variance (error from sensitivity to small fluctuations in the training data).

Classification

Classification is a supervised machine learning task that assigns input data to one of several predefined categories or classes based on patterns learned from labeled training examples.