RAG

General-purpose language models can be fine-tuned to achieve several common tasks, such as sentiment analysis and named entity recognition. These tasks generally don't require additional background knowledge.

For more complex, knowledge-intensive tasks, it's possible to build a language-model-based system that accesses external knowledge sources to complete them. This enables more factual consistency, improves the reliability of the generated responses, and helps to mitigate the problem of "hallucination".

Why LLMs Need RAG?

LLMs are essentially highly advanced text-prediction engines trained on massive, but static, datasets. This creates a few fundamental problems that RAG directly solves:

The Knowledge Cutoff: Training an LLM from scratch takes months and millions of dollars. By the time a model is released, its knowledge of current events is already outdated. RAG allows the model to access the internet or live databases to pull in news, stock prices, or updated data without needing a full retraining.
Reducing Hallucinations: When a standard LLM doesn't know an answer, it tends to confidently guess or make things up (hallucinate). RAG forces the model to ground its response in actual retrieved facts, significantly increasing accuracy.
Providing Verifiable Sources: Because a RAG system actively pulls specific documents or web pages to answer a prompt, it can "show its work" by providing direct links and citations so users can verify the information.

How RAG Works

Think of a standard LLM as a student taking a closed-book exam, relying only on their memory. A RAG system turns it into an open-book exam.

Retrieval: When you ask a question (e.g., "What was the score of last night's cricket match?"), The system doesn't immediately send it to the LLM. First, it uses a search engine or a vector database to find the most relevant and recent articles or data snippets.
Augmentation: The system takes your original question and appends the freshly retrieved information to it in the background.
Generation: The LLM reads this augmented package (your prompt + the newly found facts) and synthesises a natural, coherent, and highly accurate answer.

Meta AI researchers introduced a method called Retrieval Augmented Generation (RAG) to address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model. RAG can be fine-tuned and its internal knowledge updated efficiently without retraining the entire model.

RAG takes an input and retrieves a set of relevant/supporting documents given a source (e.g., Wikipedia). The documents are concatenated as context with the original input prompt and fed to the text generator, which produces the final output. This makes RAG adaptive for situations where facts could evolve over time. This is very useful as LLMs's parametric knowledge is static. RAG enables language models to bypass retraining, providing access to the latest information to generate reliable outputs via retrieval-based generation.

Meta proposed a general-purpose fine-tuning recipe for RAG. A pre-trained seq2seq model is used as the parametric memory, and a dense vector index of Wikipedia is used as non-parametric memory (accessed using a neural pre-trained retriever). Below is an overview of how the approach works:

RAG performs strongly across several benchmarks, including Natural Questions, WebQuestions, and CuratedTrec. RAG generates responses that are more factual, specific, and diverse when tested on MS-MARCO and Jeopardy questions. RAG also improves results on FEVER fact verification.

This shows the potential of RAG as a viable option for enhancing the outputs of language models in knowledge-intensive tasks.

These retriever-based approaches have become more popular and are combined with popular LLMs like ChatGPT, Claude, Gemini, and many modern models to improve capabilities and factual consistency.