Why RAG Is the Architecture Pattern of 2026

Large language models know a lot — but they don't know your company's policies, product catalog, internal procedures, or yesterday's board decision. They hallucinate confidently when they don't have information. They can't cite their sources. And their knowledge has a training cutoff that makes them increasingly stale. Retrieval-Augmented Generation solves all three problems: the model retrieves relevant documents from your knowledge base before generating a response — grounding the output in factual, current, company-specific information with traceable sources.

RAG has become the default architecture for enterprise AI applications because it provides domain specificity without fine-tuning (cheaper, faster, updatable), factual grounding with source citation (auditable, trustworthy), current information (update the knowledge base, not the model), and data security (your documents stay in your infrastructure — only retrieved excerpts go to the LLM). Over 70% of enterprise generative AI applications in production use RAG as the primary architecture pattern.

Fine-tuning teaches the model how to respond. RAG teaches the model what to respond about. For enterprise applications that need domain-specific factual answers, RAG is almost always the right starting point. — Xylity AI Engineering Practice

RAG Architecture: The 5-Stage Pipeline

StageWhat HappensKey DecisionImpact of Getting It Wrong
1. IngestionDocuments parsed, cleaned, chunkedChunk size, overlap, metadata extractionToo large: irrelevant context dilutes retrieval. Too small: loses coherence
2. EmbeddingText chunks converted to vectorsEmbedding model selectionPoor embeddings = poor retrieval = poor answers
3. IndexingVectors stored in searchable indexVector database, index configurationSlow search = slow response. Wrong config = missed relevant docs
4. RetrievalQuery matched to relevant chunksSearch strategy (vector, keyword, hybrid), top-kPoor retrieval = hallucination (model generates without facts)
5. GenerationLLM produces response from retrieved contextPrompt design, context window, model selectionPoor prompt = model ignores context. Too much context = noise

Stage 1: Chunking — The Most Underestimated Decision

Chunking determines the granularity of retrieval — the size of text blocks that the system can retrieve and present to the LLM. This single decision affects retrieval accuracy, generation quality, and cost more than any other architectural choice.

Chunking Strategies

Fixed-size chunking (300-500 tokens): Split documents into equal-sized blocks with overlap (50-100 token overlap between consecutive chunks). Simple, predictable, and sufficient for homogeneous documents (articles, reports). Limitation: splits can break mid-paragraph or mid-concept, creating chunks that lack coherent meaning.

Semantic chunking: Split at natural boundaries — paragraphs, sections, topic shifts. Preserves meaning within each chunk. Better retrieval accuracy because each chunk represents a complete thought. More complex to implement — requires boundary detection (by heading structure, paragraph breaks, or embedding similarity between sentences).

Document-structure-aware chunking: Uses document structure (headings, sections, tables, lists) to create chunks that respect the document's organization. A policy document gets chunked by section — each policy clause is a separate chunk with its section heading as metadata. Tables are chunked separately from prose. This produces the highest-quality chunks but requires document parsing that understands structure.

Parent-child chunking: Small chunks for retrieval (high precision), larger parent chunks for context (coherence). The system retrieves using small chunks (2-3 sentences) but passes the surrounding parent chunk (full section) to the LLM. This combines retrieval precision with generation context.

The Chunking Experiment

Chunk size is not a parameter you set once — it's a parameter you experiment with. Test 3-4 chunk sizes (200, 400, 600, 1000 tokens) on your specific documents and queries. Measure retrieval accuracy (% of queries where the relevant chunk is in the top-5 results). The optimal size varies by document type, query type, and embedding model — there's no universal correct answer.

Stage 2: Embedding Models — Matching Meaning to Mathematics

Embedding models convert text into numerical vectors that capture semantic meaning. Two sentences with similar meaning produce vectors that are close together in the embedding space. The embedding model's quality directly determines retrieval quality — a poor embedding model that places "refund policy" far from "return procedure" will miss relevant documents.

Embedding ModelDimensionsBest ForCost
OpenAI text-embedding-3-large3072Best overall quality, multi-language$0.13/million tokens
OpenAI text-embedding-3-small1536Good quality at lower cost, most use cases$0.02/million tokens
Cohere embed-v31024Multi-language, search-optimized$0.10/million tokens
BGE-large-en (open-source)1024No data leaves premises, no API costCompute only
E5-large-v2 (open-source)1024Strong retrieval performance, self-hostedCompute only

Stage 4: Retrieval Strategy — Vector, Keyword, or Hybrid

Pure vector search matches by semantic meaning — "annual revenue" retrieves documents about "yearly sales." Pure keyword search matches by exact terms — "AML compliance" retrieves documents containing those exact words. Hybrid search combines both — and consistently outperforms either alone.

Hybrid search (recommended): Run both vector similarity and keyword (BM25) search. Merge results using Reciprocal Rank Fusion (RRF) — a document ranked high by both methods gets the highest merged score. Azure AI Search implements hybrid search natively with a single query. This handles both: queries where meaning matters ("how do I handle a customer complaint?") and queries where exact terminology matters ("what's the SOX 404 compliance procedure?").

Re-ranking: The Accuracy Multiplier

Initial retrieval returns 20-50 candidate chunks. A re-ranking model (cross-encoder) evaluates each candidate's relevance to the specific query and reorders them. The top 3-5 re-ranked results are passed to the LLM. Re-ranking typically improves retrieval accuracy by 10-15% because cross-encoders jointly evaluate the query-document pair (more accurate than independent embedding similarity). Azure AI Search Semantic Ranker and Cohere Rerank provide managed re-ranking.

Stage 5: Generation — Prompt Architecture for Grounded Responses

The prompt structure determines whether the LLM actually uses the retrieved context or ignores it in favor of its training knowledge. The prompt must: instruct the model to answer based only on the provided context, include the retrieved chunks in a clearly delineated section, specify the response format (citations, structure, length), and handle the "I don't know" case (when the context doesn't contain the answer, the model should say so instead of hallucinating).

Prompt Template Pattern

System instruction: "Answer the user's question based only on the provided context. If the context doesn't contain enough information to answer, say 'I don't have information about that in my knowledge base.' Cite the source document for each claim." → Retrieved context (clearly labeled: "Context: [document chunks with source metadata]") → User query. This pattern grounds the model in retrieved facts, provides attribution, and prevents hallucination by explicitly instructing the model to acknowledge knowledge gaps.

Advanced RAG Patterns

Query transformation: Rewrite the user's query before retrieval to improve search results. "What's our return policy for electronics purchased more than 30 days ago?" → decompose into "return policy" + "electronics" + "30-day window" and search for each. HyDE (Hypothetical Document Embedding) generates a hypothetical answer and uses it as the search query — often retrieving more relevant documents than the original question.

Multi-step retrieval: First retrieval identifies relevant topic areas. Second retrieval dives deeper into the specific topic. Useful for complex queries that span multiple documents or concepts — the first step narrows the search space, the second step finds the specific answer.

Agentic RAG: An AI agent decides dynamically: does this query need retrieval? Which knowledge bases to search? Is the retrieved context sufficient, or should the agent search again with a different query? Agentic RAG adapts the retrieval strategy per query rather than applying the same pipeline to every question.

Scaling RAG: From Thousands to Millions of Documents

RAG systems that work with 10,000 documents can fail at 1 million. Scaling challenges: vector search latency increases with index size (mitigation: approximate nearest neighbor algorithms — HNSW, IVF — trade small accuracy loss for 100x speed), embedding ingestion throughput (mitigation: batch embedding with parallel workers), context window management (at scale, top-k retrieval returns more diverse results that may exceed the LLM's context window — mitigation: re-ranking to select only the most relevant 3-5 chunks), and storage cost (1 million document chunks × 1536-dimensional embeddings × 4 bytes/float = ~6GB of vector storage — manageable but requires planning for growth).

Evaluation: Measuring RAG Quality

RAG quality has two measurable dimensions: retrieval quality (did the system find the right documents?) and generation quality (did the model produce a correct answer from the retrieved documents?). Retrieval metrics: recall@k (what percentage of relevant documents are in the top-k results?), MRR (Mean Reciprocal Rank — how high is the first relevant result ranked?). Generation metrics: groundedness (is the answer supported by retrieved context?), relevance (does the answer address the question?), completeness (does the answer cover all aspects of the question?). Build an evaluation dataset of 100-200 question-answer pairs with known correct answers and source documents. Run evaluation after every pipeline change — chunking, embedding model, retrieval strategy, or prompt adjustment. The evaluation score determines whether the change improved or degraded quality.

Multi-Index RAG: Different Knowledge Bases for Different Query Types

Enterprise RAG often requires multiple vector indexes — one for product documentation, one for policy documents, one for technical specifications, one for customer interaction history. Query routing determines which index to search based on the query type: product questions search the product index, policy questions search the policy index, technical questions search the specifications index. This routing can be rule-based (keyword detection), classifier-based (fine-tuned intent classifier), or LLM-based (the LLM determines which knowledge base is relevant). Multi-index architecture improves retrieval accuracy by 15-25% versus a single combined index because each index is optimized for its specific document type and query pattern.

The Xylity Approach

We implement RAG with the 5-stage pipeline — document-aware chunking, tested embedding models, hybrid search with re-ranking, grounded prompt architecture, and continuous retrieval quality evaluation. Our RAG architects and LLM engineers build the pipeline alongside your team, tuning each stage for your specific documents, queries, and accuracy requirements.

Continue building your understanding with these related resources from our consulting practice.

RAG That Gets the Answer Right

Five stages — chunking, embedding, indexing, retrieval, generation. RAG architecture tuned for your documents, your queries, and your accuracy standard.

Start Your RAG Architecture Engagement →