Table of Contents
As AI systems reshape the way we interact with information, Retrieval-Augmented Generation (RAG) has emerged as a powerful architecture to combine internal knowledge with generative language models. But with great capabilities comes a critical challenge: how do we know if a RAG system works?
This blog post outlines a structured, real-world approach to validating a RAG system, based on our experience building and maintaining COGNOS, a RAG-powered product at Apiumhub.
What Is a RAG System?
A RAG system combines an information retrieval engine (e.g., a search or vector database) with a generative large language model (LLM), allowing it to answer questions based on both proprietary data and external general knowledge.
Why Is RAG Validation Difficult?
Unlike traditional rule-based systems, RAG systems are non-deterministic. Answers depend on:
- The model you use
- The documents retrieved
- The phrasing of the input question
In our case with COGNOS, the challenge is even greater: users upload their files or connect private knowledge sources. We cannot access the actual content used at inference time. Despite this, we must ensure a minimum level of reliability.
Can Validation Be Automated?
Yes, tools like MLflow, RAGAS, UpTrain, Opik, or custom frameworks using LangChain can help automate the process. However, before automating, you must understand what to validate and why.
This post focuses on the manual foundation you should establish before automation.
Validation Workflow for a RAG System
Let’s split this process into two major steps. The first one, preparation, is the basis for all the validation. You need to prepare sample data, but also the questions you want to ask and the answers you expect.
1. Data Preparation
Start by gathering realistic document samples—policies, invoices, requirements, CVs, etc. Avoid synthetic examples as much as possible.
One characteristic that makes the privacy of COGNOS key is that it guarantees that no data will be shared with third parties during the validation process.
Example: A company policy document regulating paid leave, remote work, or employee reviews—real documents with technical, dense language.
2. Question Design
For each document or document set:
- Start with simple questions, then increase complexity.
- Define preconditions (e.g., “Requires company_policy.pdf”).
- Specify the expected context from which the model should draw.
Tips:
- Paraphrase questions to test semantic understanding.
- Include multilingual prompts if the system supports multiple languages.
- Add negative test cases (e.g., questions that should not be answerable).
- Reflect real user concerns and define questions by persona if the system serves multiple roles.
Example:
- Question: How many vacation days am I entitled to?
- Document required: “company_policy.pdf”
- Expected context: “Each employee is entitled to take up to 23 days of vacation”.
3. Test Execution
Once your dataset and question set are ready, run your queries. While this step can be automated, the key is obtaining complete input-output logs for validation.
RAG Evaluation Criteria
Context Relevance
Before judging the output, confirm that the retrieved context is correct. Was the system able to locate the relevant portion of the document?
Tools like COGNOS provide a “source view” so you can inspect what was retrieved before the model generated a response.
Validate that the retrieved content matches the expected section. This helps isolate failures in retrieval from generation errors.
Response Accuracy
Once context is verified, evaluate the quality of the generated answer using the following categories:
- Accurate / Inaccurate: Is the answer factually correct?
- Complete / Incomplete: Does it include all critical information?
Generated / Failed: Did the model generate a coherent answer, or did it default to a fallback or fail silently?
Example:
- Document: “All employees will receive Microsoft 365 accounts except warehouse staff.”
- Question: “Do all employees get Microsoft 365 accounts?”
- Answer: “Yes, they all do.” → Incorrect due to omission.
Best Practices
- Iterate frequently: RAG systems evolve—so should your validation tests.
- Learn from real user queries: they’ll guide you toward real-world validation cases.
- Monitor performance in production to detect regressions and edge cases.
Conclusion
Validating a RAG system is not trivial. It requires a deep understanding of the entire pipeline—from retrieval and context handling to final response generation. But with a structured validation process, product managers can gain confidence in the reliability of their system’s behavior, even in complex, dynamic environments.