In This Article
Why RAG Is the Architecture Pattern of 2026
Large language models know a lot — but they don't know your company's policies, product catalog, internal procedures, or yesterday's board decision. They hallucinate confidently when they don't have information. They can't cite their sources. And their knowledge has a training cutoff that makes them increasingly stale. Retrieval-Augmented Generation solves all three problems: the model retrieves relevant documents from your knowledge base before generating a response — grounding the output in factual, current, company-specific information with traceable sources.
RAG has become the default architecture for enterprise AI applications because it provides domain specificity without fine-tuning (cheaper, faster, updatable), factual grounding with source citation (auditable, trustworthy), current information (update the knowledge base, not the model), and data security (your documents stay in your infrastructure — only retrieved excerpts go to the LLM). Over 70% of enterprise generative AI applications in production use RAG as the primary architecture pattern.
RAG Architecture: The 5-Stage Pipeline
| Stage | What Happens | Key Decision | Impact of Getting It Wrong |
|---|---|---|---|
| 1. Ingestion | Documents parsed, cleaned, chunked | Chunk size, overlap, metadata extraction | Too large: irrelevant context dilutes retrieval. Too small: loses coherence |
| 2. Embedding | Text chunks converted to vectors | Embedding model selection | Poor embeddings = poor retrieval = poor answers |
| 3. Indexing | Vectors stored in searchable index | Vector database, index configuration | Slow search = slow response. Wrong config = missed relevant docs |
| 4. Retrieval | Query matched to relevant chunks | Search strategy (vector, keyword, hybrid), top-k | Poor retrieval = hallucination (model generates without facts) |
| 5. Generation | LLM produces response from retrieved context | Prompt design, context window, model selection | Poor prompt = model ignores context. Too much context = noise |
Stage 1: Chunking — The Most Underestimated Decision
Chunking determines the granularity of retrieval — the size of text blocks that the system can retrieve and present to the LLM. This single decision affects retrieval accuracy, generation quality, and cost more than any other architectural choice.
Chunking Strategies
Fixed-size chunking (300-500 tokens): Split documents into equal-sized blocks with overlap (50-100 token overlap between consecutive chunks). Simple, predictable, and sufficient for homogeneous documents (articles, reports). Limitation: splits can break mid-paragraph or mid-concept, creating chunks that lack coherent meaning.
Semantic chunking: Split at natural boundaries — paragraphs, sections, topic shifts. Preserves meaning within each chunk. Better retrieval accuracy because each chunk represents a complete thought. More complex to implement — requires boundary detection (by heading structure, paragraph breaks, or embedding similarity between sentences).
Document-structure-aware chunking: Uses document structure (headings, sections, tables, lists) to create chunks that respect the document's organization. A policy document gets chunked by section — each policy clause is a separate chunk with its section heading as metadata. Tables are chunked separately from prose. This produces the highest-quality chunks but requires document parsing that understands structure.
Parent-child chunking: Small chunks for retrieval (high precision), larger parent chunks for context (coherence). The system retrieves using small chunks (2-3 sentences) but passes the surrounding parent chunk (full section) to the LLM. This combines retrieval precision with generation context.
Chunk size is not a parameter you set once — it's a parameter you experiment with. Test 3-4 chunk sizes (200, 400, 600, 1000 tokens) on your specific documents and queries. Measure retrieval accuracy (% of queries where the relevant chunk is in the top-5 results). The optimal size varies by document type, query type, and embedding model — there's no universal correct answer.
Stage 2: Embedding Models — Matching Meaning to Mathematics
Embedding models convert text into numerical vectors that capture semantic meaning. Two sentences with similar meaning produce vectors that are close together in the embedding space. The embedding model's quality directly determines retrieval quality — a poor embedding model that places "refund policy" far from "return procedure" will miss relevant documents.
| Embedding Model | Dimensions | Best For | Cost |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | Best overall quality, multi-language | $0.13/million tokens |
| OpenAI text-embedding-3-small | 1536 | Good quality at lower cost, most use cases | $0.02/million tokens |
| Cohere embed-v3 | 1024 | Multi-language, search-optimized | $0.10/million tokens |
| BGE-large-en (open-source) | 1024 | No data leaves premises, no API cost | Compute only |
| E5-large-v2 (open-source) | 1024 | Strong retrieval performance, self-hosted | Compute only |
Stage 4: Retrieval Strategy — Vector, Keyword, or Hybrid
Pure vector search matches by semantic meaning — "annual revenue" retrieves documents about "yearly sales." Pure keyword search matches by exact terms — "AML compliance" retrieves documents containing those exact words. Hybrid search combines both — and consistently outperforms either alone.
Hybrid search (recommended): Run both vector similarity and keyword (BM25) search. Merge results using Reciprocal Rank Fusion (RRF) — a document ranked high by both methods gets the highest merged score. Azure AI Search implements hybrid search natively with a single query. This handles both: queries where meaning matters ("how do I handle a customer complaint?") and queries where exact terminology matters ("what's the SOX 404 compliance procedure?").
Re-ranking: The Accuracy Multiplier
Initial retrieval returns 20-50 candidate chunks. A re-ranking model (cross-encoder) evaluates each candidate's relevance to the specific query and reorders them. The top 3-5 re-ranked results are passed to the LLM. Re-ranking typically improves retrieval accuracy by 10-15% because cross-encoders jointly evaluate the query-document pair (more accurate than independent embedding similarity). Azure AI Search Semantic Ranker and Cohere Rerank provide managed re-ranking.
Stage 5: Generation — Prompt Architecture for Grounded Responses
The prompt structure determines whether the LLM actually uses the retrieved context or ignores it in favor of its training knowledge. The prompt must: instruct the model to answer based only on the provided context, include the retrieved chunks in a clearly delineated section, specify the response format (citations, structure, length), and handle the "I don't know" case (when the context doesn't contain the answer, the model should say so instead of hallucinating).
Prompt Template Pattern
System instruction: "Answer the user's question based only on the provided context. If the context doesn't contain enough information to answer, say 'I don't have information about that in my knowledge base.' Cite the source document for each claim." → Retrieved context (clearly labeled: "Context: [document chunks with source metadata]") → User query. This pattern grounds the model in retrieved facts, provides attribution, and prevents hallucination by explicitly instructing the model to acknowledge knowledge gaps.
Advanced RAG Patterns
Query transformation: Rewrite the user's query before retrieval to improve search results. "What's our return policy for electronics purchased more than 30 days ago?" → decompose into "return policy" + "electronics" + "30-day window" and search for each. HyDE (Hypothetical Document Embedding) generates a hypothetical answer and uses it as the search query — often retrieving more relevant documents than the original question.
Multi-step retrieval: First retrieval identifies relevant topic areas. Second retrieval dives deeper into the specific topic. Useful for complex queries that span multiple documents or concepts — the first step narrows the search space, the second step finds the specific answer.
Agentic RAG: An AI agent decides dynamically: does this query need retrieval? Which knowledge bases to search? Is the retrieved context sufficient, or should the agent search again with a different query? Agentic RAG adapts the retrieval strategy per query rather than applying the same pipeline to every question.
Scaling RAG: From Thousands to Millions of Documents
RAG systems that work with 10,000 documents can fail at 1 million. Scaling challenges: vector search latency increases with index size (mitigation: approximate nearest neighbor algorithms — HNSW, IVF — trade small accuracy loss for 100x speed), embedding ingestion throughput (mitigation: batch embedding with parallel workers), context window management (at scale, top-k retrieval returns more diverse results that may exceed the LLM's context window — mitigation: re-ranking to select only the most relevant 3-5 chunks), and storage cost (1 million document chunks × 1536-dimensional embeddings × 4 bytes/float = ~6GB of vector storage — manageable but requires planning for growth).
Evaluation: Measuring RAG Quality
RAG quality has two measurable dimensions: retrieval quality (did the system find the right documents?) and generation quality (did the model produce a correct answer from the retrieved documents?). Retrieval metrics: recall@k (what percentage of relevant documents are in the top-k results?), MRR (Mean Reciprocal Rank — how high is the first relevant result ranked?). Generation metrics: groundedness (is the answer supported by retrieved context?), relevance (does the answer address the question?), completeness (does the answer cover all aspects of the question?). Build an evaluation dataset of 100-200 question-answer pairs with known correct answers and source documents. Run evaluation after every pipeline change — chunking, embedding model, retrieval strategy, or prompt adjustment. The evaluation score determines whether the change improved or degraded quality.
Multi-Index RAG: Different Knowledge Bases for Different Query Types
Enterprise RAG often requires multiple vector indexes — one for product documentation, one for policy documents, one for technical specifications, one for customer interaction history. Query routing determines which index to search based on the query type: product questions search the product index, policy questions search the policy index, technical questions search the specifications index. This routing can be rule-based (keyword detection), classifier-based (fine-tuned intent classifier), or LLM-based (the LLM determines which knowledge base is relevant). Multi-index architecture improves retrieval accuracy by 15-25% versus a single combined index because each index is optimized for its specific document type and query pattern.
The Xylity Approach
We implement RAG with the 5-stage pipeline — document-aware chunking, tested embedding models, hybrid search with re-ranking, grounded prompt architecture, and continuous retrieval quality evaluation. Our RAG architects and LLM engineers build the pipeline alongside your team, tuning each stage for your specific documents, queries, and accuracy requirements.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
RAG That Gets the Answer Right
Five stages — chunking, embedding, indexing, retrieval, generation. RAG architecture tuned for your documents, your queries, and your accuracy standard.
Start Your RAG Architecture Engagement →