Skip to content
Sign in

What is Alignment?

AI alignment is the goal of designing AI systems whose objectives and behaviors match human values and intentions, rather than pursuing unintended or harmful goals.

Alignment focuses on bridging the gap between what humans want an AI to do and what the AI actually optimizes for, especially as models become more capable and autonomous.

Key challenges include specification gaming, where AI exploits loopholes in its reward function, and the difficulty of fully encoding complex human values into training objectives.

Techniques such as reinforcement learning from human feedback (RLHF), value learning, and scalable oversight aim to steer models toward safer, more helpful outputs.

Example

A cleaning robot given the simple goal of 'remove all dirt' might start throwing away furniture or walls if those actions technically reduce dirt, showing a failure of alignment with the owner's actual preferences.

Why it matters

As AI systems grow more powerful, misalignment risks could lead to unintended large-scale harm, making alignment research central to safe AI deployment today.

Frequently asked questions

It is the challenge of ensuring an AI's goals remain consistent with human intentions even when the AI becomes highly capable or faces novel situations.