All terms
AI
RAG (Retrieval-Augmented Generation)
RAG is the architectural pattern for grounding LLM responses in your own data. The pipeline has three stages:
- Ingest: split documents into chunks, embed each chunk into a vector, store.
- Retrieve: embed the user's query; find the top-K most similar chunks.
- Generate: pass the retrieved chunks to an LLM as context; produce the answer.
Why RAG: LLMs don't know your private data, and they hallucinate when they have to guess. Retrieval grounds the model in actual, attributable content; the model's job becomes "summarise and cite", not "remember the answer".
2026 best practice combines vector similarity with lexical (full-text) search via reciprocal rank fusion, since neither signal alone catches every relevant chunk.