What is Dimensionality Reduction?
Dimensionality reduction is a machine learning technique that decreases the number of features (dimensions) in a dataset while preserving as much relevant information as possible.
High-dimensional data often suffers from the curse of dimensionality, where too many features lead to sparse data, increased computational cost, and risk of overfitting. Dimensionality reduction addresses this by transforming or selecting a smaller set of features.
Common approaches include linear methods like Principal Component Analysis (PCA) that project data onto fewer axes capturing maximum variance, and nonlinear methods like t-SNE or autoencoders that uncover complex structures in the data.
The goal is to simplify models, speed up training, reduce noise, and enable visualization of data in 2D or 3D while minimizing information loss.
Example
In a dataset of house prices with 50 features like size, location, and amenities, dimensionality reduction might combine correlated features into 5 key components that still allow accurate price predictions.
Why it matters
Modern AI datasets from images, genomics, and text are extremely high-dimensional; dimensionality reduction makes analysis computationally feasible, improves model performance, and aids interpretability.
Frequently asked questions
No. Feature selection picks existing features, while dimensionality reduction creates new combined features or projections.
Related terms
Principal Component Analysis (PCA) is a technique for reducing the number of dimensions in a dataset while keeping as much of the original information as possible. It does this by finding new axes, called principal components, that capture the largest amounts of variation in the data.
An autoencoder is a neural network that learns to compress input data into a smaller representation and then reconstruct the original data from that compressed form.
Active learning is a machine learning technique where the model itself selects the most informative unlabeled data points to be labeled by a human, rather than labeling data randomly or all at once.
Adam (Adaptive Moment Estimation) is a popular optimization algorithm used to train machine learning models by iteratively updating parameters based on gradients.
Anomaly detection is a machine learning technique that identifies rare or unusual data points that differ significantly from the majority of the data, often called outliers.
The bias-variance tradeoff describes the balance between two sources of error in a machine learning model: bias (error from overly simple assumptions) and variance (error from sensitivity to small fluctuations in the training data).