Skip to content
Sign in

What is Interpretability?

Interpretability is the property of an AI model that allows humans to understand why it made a particular decision or prediction.

In machine learning, many powerful models like deep neural networks act as black boxes, producing outputs without revealing their internal reasoning. Interpretability techniques aim to open this box by highlighting which inputs most influenced the result.

Common approaches include feature importance scores, surrogate models that approximate the original, and visualization methods that show decision pathways. These help users trace cause-and-effect relationships between data and model behavior.

From an ethics perspective, interpretability supports accountability by making it possible to detect bias, errors, or unfair patterns that would otherwise remain hidden.

Example

A bank uses an AI system to approve loans. With interpretability tools, an applicant can see that their application was rejected mainly because of a high debt-to-income ratio and a recent late payment, rather than receiving only a yes/no answer.

Why it matters

As AI systems are used in high-stakes areas like healthcare, hiring, and criminal justice, interpretability builds trust, enables regulatory compliance, and helps prevent harmful or discriminatory outcomes.

Frequently asked questions

They are closely related and often used interchangeably, but interpretability usually refers to models that are inherently understandable, while explainability focuses on techniques that explain any model after training.