What is Retrieval-Augmented Generation?
Also known as: RAG
Retrieval-Augmented Generation (RAG) is a technique that improves large language models by retrieving relevant external information before generating a response.
RAG combines two steps: first, a retriever searches a knowledge base (often using vector embeddings and similarity search) for documents relevant to the user's query. Then, the generator (the LLM) uses both the original query and the retrieved content to produce an answer.
This approach keeps the model's knowledge current without retraining and reduces hallucinations by grounding outputs in real retrieved data. It typically involves an embedding model, a vector database, and the LLM working together in a pipeline.
RAG can be tuned by adjusting retrieval quality, chunk size, or how retrieved passages are inserted into the prompt.
Example
A customer-support chatbot uses RAG to answer questions about a company's latest product manual by first retrieving the relevant sections from an internal document store and then generating a clear reply based on those sections.
Why it matters
RAG lets LLMs deliver accurate, up-to-date answers without expensive retraining and is now a standard pattern for building reliable enterprise and domain-specific AI applications.
Frequently asked questions
Fine-tuning changes the model's weights with new data, while RAG leaves the model unchanged and supplies fresh information at inference time via retrieval.
Related terms
A vector database is a specialized database designed to store and query high-dimensional vector embeddings, enabling fast similarity searches instead of traditional exact-match queries.
Prompt engineering is the practice of designing and refining text inputs (prompts) to guide AI models like large language models toward producing accurate, relevant, or creative outputs.
In LLMs, hallucination is when the model generates fluent, confident text that is factually incorrect, fabricated, or not supported by its training data.
Temperature is a parameter in large language models that controls the randomness of generated text. Lower values produce more focused and deterministic outputs, while higher values increase creativity and variability.
Top-p sampling (nucleus sampling) is a text-generation technique that dynamically selects the smallest set of most likely next tokens whose combined probability exceeds a threshold p (e.g. 0.9), then samples from that set.