Skip to content

What is Retrieval-Augmented Generation?

Also known as: RAG

Retrieval-Augmented Generation (RAG) is a technique that improves large language models by retrieving relevant external information before generating a response.

RAG combines two steps: first, a retriever searches a knowledge base (often using vector embeddings and similarity search) for documents relevant to the user's query. Then, the generator (the LLM) uses both the original query and the retrieved content to produce an answer.

This approach keeps the model's knowledge current without retraining and reduces hallucinations by grounding outputs in real retrieved data. It typically involves an embedding model, a vector database, and the LLM working together in a pipeline.

RAG can be tuned by adjusting retrieval quality, chunk size, or how retrieved passages are inserted into the prompt.

Example

A customer-support chatbot uses RAG to answer questions about a company's latest product manual by first retrieving the relevant sections from an internal document store and then generating a clear reply based on those sections.

Why it matters

RAG lets LLMs deliver accurate, up-to-date answers without expensive retraining and is now a standard pattern for building reliable enterprise and domain-specific AI applications.

Frequently asked questions

Fine-tuning changes the model's weights with new data, while RAG leaves the model unchanged and supplies fresh information at inference time via retrieval.