How does CRAG (Corrective RAG) determine when to trigger an external web search?

Retrieval confidence thresholds

RAG Interview Preparation Guide

Introduction

Retrieval-Augmented Generation (RAG) has become the industry standard pattern for grounding Large Language Models (LLMs) in external, dynamic, and private data sources. By combining the reasoning capabilities of LLMs with real-time information retrieval, RAG effectively mitigates hallucinations, bypasses context window limitations, and eliminates the prohibitive costs of continuous model fine-tuning. In modern AI engineering, companies rely on RAG to power enterprise search, customer support agents, and automated knowledge discovery systems. Consequently, RAG system design and optimization have become some of the most frequently tested topics in technical interviews for AI Engineers, Machine Learning Engineers, and AI Architects. Interviewers use these questions to evaluate a candidate's practical understanding of data ingestion pipelines, vector databases, search relevance, latency optimization, and production-grade evaluation frameworks. This guide covers the complete RAG ecosystem—document ingestion, chunking, embedding models, vector indexing, query rewriting, retrieval scoring, reranking, and generation—with architecture diagrams, 50 graded interview questions, and production considerations for latency, cost, and faithfulness evaluation.

Why It Matters

RAG bridges the gap between static model training and dynamic real-world data. From a business perspective, RAG enables organizations to deploy AI systems that securely reference proprietary internal documents without exposing sensitive data to public model training runs. It provides immediate auditability, as every generated answer can be traced back to its specific source document chunk. From an engineering perspective, RAG decouples the knowledge base from the model weights, allowing developers to update, delete, or add information instantly by simply modifying the vector database. As enterprise adoption of generative AI scales, the focus has shifted from simple proof-of-concept RAG pipelines to highly optimized, cost-effective, and low-latency production architectures that can handle millions of documents and queries daily.

RAG's production impact extends beyond hallucination reduction. By externalizing knowledge into a vector database, RAG enables instant knowledge updates, fine-grained document-level access control, and clear citation of every generated claim—a compliance requirement in regulated industries. In 2026, the industry has moved beyond naive RAG to advanced patterns including query decomposition, hypothetical document embedding, self-reflective RAG, and corrective RAG. Candidates who understand these patterns signal the systems thinking that defines senior AI engineering. Candidates who understand advanced patterns and can reason about retrieval precision, latency, and cost tradeoffs are demonstrating the systems thinking that defines senior AI engineering expertise.

Core Concepts

Architecture Overview

A production RAG architecture separates data preparation (ingestion pipeline) from real-time query processing (retrieval and generation pipeline). The ingestion pipeline runs asynchronously, parsing and indexing documents. The online pipeline processes user queries in real-time to generate grounded responses.

Data Flow

Raw documents are parsed, chunked, embedded, and stored in the Vector Database.
A user submits a query.
The query is converted into an embedding.
The Vector Database performs a similarity search to find candidate chunks.
The Reranker filters and re-orders the candidates.
The top chunks and the query are formatted into a prompt.
The Generator LLM synthesizes the final response.

[Raw Docs] -> [Parser] -> [Chunker] -> [Embedder] -> [Vector DB]
                                                    ↑
[User Query] -> [Query Encoder] --------------------+ (Similarity Search)
                     ↓
[LLM Response] <- [Generator LLM] <- [Prompt Builder] <- [Reranker] <- [Top Chunks]

Key Components

Tools & Frameworks

Design Patterns

Parent-Child Chunking Workflow Pattern

Documents are split into small child chunks for precise vector retrieval, but when a match is found, the larger parent chunk is passed to the LLM.

Trade-offs: Improves retrieval precision and provides rich context to the LLM, but increases database complexity and storage requirements.

Hybrid Search with RRF Architecture Pattern

Combines traditional keyword search (BM25) with semantic vector search, merging the results using Reciprocal Rank Fusion (RRF).

Trade-offs: Ensures robust retrieval for both exact keyword matches and conceptual queries, but increases query latency and system complexity.

Query Rewriting Workflow Pattern

Uses a lightweight LLM to rephrase, expand, or decompose a user's query into multiple search-friendly queries before retrieval.

Trade-offs: Drastically improves retrieval rates for complex or poorly phrased queries, but adds extra LLM latency and API costs to the retrieval step.

Self-RAG / Corrective RAG Reliability Pattern

The system evaluates the retrieved context for relevance and self-corrects, fetching web search results if the local database context is insufficient.

Trade-offs: Minimizes hallucinations and handles out-of-knowledge queries, but introduces significant latency and token overhead.

Common Mistakes

Production Considerations

Reliability	To ensure high reliability, implement fallback mechanisms such as falling back to keyword search if the vector database is unavailable. Use circuit breakers and rate limiters on LLM APIs. Implement self-evaluation steps where a lightweight model checks if the generated answer is supported by the retrieved context before serving it to the user.
Scalability	Scale the ingestion pipeline horizontally using distributed message queues (e.g., Kafka) to handle bulk document processing. For the retrieval path, scale the vector database using sharding (partitioning data by tenant or document type) and read replicas to handle high query volumes without bottlenecking.
Performance	Minimize end-to-end latency by implementing semantic caching to serve frequent queries instantly. Use lightweight embedding models and parallelize the retrieval of vector and keyword search results. Optimize vector database index parameters (e.g., HNSW construction parameters) to balance search speed and accuracy.
Cost	Manage token and API costs by using smaller, open-source embedding models and applying vector quantization to reduce memory footprints. Implement prompt compression techniques to strip redundant words from retrieved chunks, and use semantic caching to avoid invoking expensive LLMs for identical queries.
Security	Enforce strict Document-Level Access Control (DLAC) by attaching user permission metadata to every chunk and applying pre-filtering during queries. Ensure all data is encrypted at rest and in transit, and sanitize user inputs to prevent prompt injection attacks designed to leak system instructions or unauthorized documents.
Monitoring	Track key performance indicators including retrieval latency, LLM generation latency, Time to First Token (TTFT), token usage, and cache hit rates. Monitor retrieval quality metrics such as Hit Rate and Mean Reciprocal Rank (MRR), and log LLM alignment metrics like faithfulness and answer relevance using observability tools.

Key Trade-offs

•Latency vs. Accuracy: Adding a reranker improves answer quality but increases the end-to-end query latency.

•Storage vs. Precision: Smaller chunk sizes with high overlap improve retrieval granularity but increase vector database storage costs.

•Cost vs. Performance: Using proprietary state-of-the-art LLMs yields superior reasoning but incurs significantly higher API costs compared to self-hosted open-source models.

Scaling Strategies

•Implement multi-region read replicas for the vector database to minimize retrieval latency for global users.

•Decouple document ingestion from the query path using an asynchronous, event-driven architecture with Apache Kafka.

•Utilize vector database sharding based on customer tenant ID to ensure strict data isolation and horizontal scalability.

Optimisation Tips

•Apply scalar or product quantization to compress vector embeddings, reducing database memory usage by up to 4x with minimal accuracy loss.

•Implement a semantic cache layer using Redis to instantly resolve semantically equivalent queries without calling the embedding model or LLM.

•Use parent-child chunking to keep vector search highly precise while providing the LLM with rich, continuous context.

FAQ

Is RAG important for AI engineering interviews?

Yes, RAG is one of the most frequently tested system design topics in 2026. Companies building generative AI applications expect candidates to know how to design, optimize, and evaluate RAG pipelines.

How often does RAG appear in interviews?

Almost always for roles involving LLMs, enterprise search, or knowledge systems. You should expect both conceptual questions and a full system design scenario centered around RAG.

What tools should I learn first?

Focus on LangChain or LlamaIndex for orchestration, and Pinecone, Qdrant, or PGVector for vector databases. Understanding the underlying concepts is more important than memorizing specific APIs.

What is the difference between RAG and fine-tuning?

RAG provides external knowledge dynamically at query time without changing the model weights. Fine-tuning adapts the model's style, tone, or task-specific behavior by updating its weights, but is costly and hard to update dynamically.

How do I choose the right chunk size?

It is a tradeoff: smaller chunks (e.g., 100-200 tokens) are more precise but lack context; larger chunks (e.g., 500-1000 tokens) have more context but introduce noise and cost more tokens. Use parent-child chunking to get the best of both.

What is the 'Lost in the Middle' phenomenon?

LLMs tend to pay more attention to the beginning and end of long prompts, often ignoring or omitting information located in the middle of a large context window. Reranking helps solve this by putting the most relevant chunks first.

How do I evaluate a RAG system?

Use automated evaluation frameworks like Ragas or TruLens. Focus on three core metrics: Faithfulness (is the answer supported by context?), Answer Relevance (does the answer address the query?), and Context Recall (did we retrieve the right chunks?).

What is Hybrid Search and why use it?

Hybrid Search combines keyword search (BM25) and vector search (dense embeddings). It is crucial because keyword search excels at exact matches (like product IDs or names), while vector search excels at conceptual and semantic queries.

How do I handle document security in RAG?

Implement Document-Level Access Control (DLAC). Attach metadata tags containing user permissions to every chunk in the vector database, and apply a pre-filter during the query step to ensure users only retrieve chunks they are authorized to see.

How do I demonstrate RAG knowledge in an interview?

Discuss real-world production trade-offs. Don't just describe a basic pipeline; talk about chunking strategies, metadata filtering, hybrid search, reranking, latency optimization, and how you would quantitatively evaluate the system.