Retrieval Augmented Generation

What is retrieval augmented generation?

Retrieval augmented generation (RAG) is an architecture that grounds a large language model's answers in your actual documents. Instead of relying on what the model memorized during training, the system retrieves relevant passages from your knowledge base at query time and feeds them to the LLM as context — so answers are based on your data, with citations users and auditors can verify.

RAG is the foundation for trustworthy enterprise AI. It's how you build a "ChatGPT for our 50,000-page policy library" or "an assistant that drafts answers from our actual contract precedents" without paying to retrain a model and without the model fabricating sources. Done well, RAG measurably reduces hallucinations and produces answers traceable to specific documents. Done badly, it produces confident-sounding garbage faster than any other AI architecture.

Key terms used on this page:

Retrieval: Finding the most relevant passages from a knowledge base in response to a query.
Embedding: A dense numerical vector representing the meaning of a piece of text, used for similarity comparison.
Vector database: A specialized data store optimized for nearest-neighbor search over millions or billions of embeddings.
Chunk: A passage of a document — typically 200 to 1,000 tokens — that gets embedded and indexed independently.
Hybrid search: Combining keyword (BM25) scoring with vector similarity to get the best of both worlds.
Reranker: A more expensive model that re-scores the top retrieved candidates for relevance, dramatically improving precision.
Citation / source attribution: Returning the document IDs and passages that grounded each part of the answer, so claims are verifiable.

How does the retrieval pipeline actually work?

A production RAG retrieval pipeline has more moving parts than the typical diagram suggests. Here's the architecture we ship:

1. Ingestion. Source connectors pull from SharePoint, Confluence, Google Drive, S3, Notion, databases, ticketing systems. We extract text (Unstructured, Tika, AWS Textract for scanned PDFs), preserve metadata (author, date, ACL, source URL), and emit cleaned documents.

2. Chunking. Documents are split into passages — recursive character splitting for prose, structure-aware splitting for documents with headings, atomic chunks for tables and code. Overlap of 10–20% reduces lost context at boundaries.

3. Embedding. Each chunk is embedded with OpenAI text-embedding-3-large, Voyage AI voyage-3, Cohere Embed v3, or an open-source model (BGE-large, E5) when self-hosting matters.

4. Indexing. Vectors land in Pinecone, Weaviate, Qdrant, Chroma, Milvus, or pgvector. Alongside, we maintain a BM25 keyword index (Elasticsearch, OpenSearch, or Postgres full-text) for hybrid search.

5. Query. The user question is embedded; vector search returns the top 50–100 nearest neighbors. In parallel, BM25 retrieves the top 50 by keyword. Results are merged with reciprocal rank fusion or weighted scoring.

6. Reranking. A cross-encoder (Cohere Rerank, BGE Reranker, Voyage rerank-2) re-scores the top 50–100 candidates for true relevance and returns the top 5–10. This step matters more than people expect — it routinely doubles precision at the cost of a few hundred milliseconds.

7. Generation. The reranked passages plus the user question go to the LLM (Claude, GPT-4, or an open-source model) with a prompt that requires citations and instructs the model to refuse if context is insufficient.

8. Evaluation and logging. Every query, retrieval, and answer is logged so we can replay, measure, and improve.

Skipping any of steps 5–7 is the most common reason RAG systems underperform. Pure vector search without keyword fallback misses exact phrases and IDs. No reranker means the LLM gets noisy context and answers are worse than they need to be.

Which vector database should you use?

The market has consolidated around a handful of options that each make different tradeoffs.

Vector DB	Best for	Strengths	Weaknesses
Pinecone	Teams that want managed and don't want to think about infrastructure	Fastest time-to-production, serverless tier, predictable performance	Vendor lock-in, no self-host option, costs grow with scale
Weaviate	Hybrid search out of the box, self-hostable	Native BM25 + vector hybrid, modular embeddings, GraphQL API	Operational complexity at scale
Qdrant	Performance-sensitive workloads, full control	Fast Rust-based engine, payload filtering, self-host or managed	Smaller ecosystem than Pinecone
Chroma	Prototypes and small-to-medium corpora	Dead-simple Python API, embedded mode, low ceremony	Not built for very large or high-QPS production
Milvus	Massive corpora, multi-tenant SaaS, on-prem at scale	Battle-tested at billions-of-vectors scale, strong open-source community	Operational footprint is heavy
pgvector	Already on Postgres, corpus under a few million chunks	One less system to operate, transactional consistency with your app data, free	Slower at very large scale, fewer ANN options

We default to pgvector for clients already on Postgres with manageable corpora; Pinecone or Weaviate when scale and ops simplicity matter; Qdrant when performance is the dominant constraint.

How does hybrid search and reranking improve RAG quality?

Pure vector similarity has a known weakness: it retrieves passages that are topically similar to the query but not necessarily answer-bearing. It also struggles with exact identifiers — model numbers, ticker symbols, statute references, error codes — because embeddings smooth over the literal characters.

Hybrid search combines vector similarity with BM25 keyword scoring. We use reciprocal rank fusion (RRF) or weighted score combination to merge the two ranked lists. On most enterprise corpora, hybrid retrieval beats either method alone by 10–25% on standard relevance benchmarks.

Reranking is the single highest-leverage upgrade for most RAG systems we audit. After retrieving 50–100 candidates with hybrid search, a cross-encoder model (Cohere Rerank 3, BGE Reranker v2, Voyage rerank-2) re-scores them for relevance to the query. Cross-encoders are too slow to apply to the full corpus but fast enough on 100 candidates, and their precision is dramatically better than embedding similarity. We've seen reranking take answer accuracy from "barely usable" to "production-ready" without changing anything else.

Query rewriting is the third lever. Often the user's literal question is a poor retrieval query. A small LLM call rewrites it — expanding acronyms, splitting multi-part questions, generating hypothetical document text (HyDE) — before retrieval runs. This is cheap and frequently underused.

How does generation, prompting, and citation work in production?

Retrieval gives the model the right context. The generation step decides whether the model uses it well. The patterns that matter:

Structured prompts that require citations. Every claim must reference a source ID from the retrieved set. We enforce this with format instructions and validate post-hoc — answers without grounded citations are flagged or rejected.
Refusal when context is insufficient. The prompt explicitly instructs the model to say "I don't have enough information" rather than guessing. This is the single biggest hallucination reducer.
Context ordering. LLMs attend more to the start and end of context windows than the middle. We rerank and place the highest-relevance passages at the boundaries.
Chain-of-thought, used carefully. For complex multi-document reasoning, asking the model to think step-by-step inside `` tags improves accuracy at the cost of latency and tokens. Worth it for legal, financial, or compliance tasks; overkill for simple Q&A.
Output schemas. When the answer needs structured fields (a contract summary with party, term, obligations), we use OpenAI structured outputs, Anthropic tool use, or Instructor / Pydantic to enforce the shape. This eliminates parsing failures.
Streaming with citation rendering. For chat UIs, we stream tokens and resolve citations to clickable links inline so users can verify as they read.

Should you build, buy, or partner for RAG?

The RAG market has matured in the last 18 months. There's now a real spectrum of options.

Option	Best for	Strengths	Weaknesses	Typical cost
Off-the-shelf RAG SaaS (Glean, Guru, Notion AI Q&A)	Generic enterprise search across SaaS apps	Fast deployment, no engineering, polished UX	One-size-fits-all retrieval, limited customization, ongoing per-seat cost, ACL fidelity varies	$20–$50 per user per month
Cloud-vendor RAG (Azure AI Search + OpenAI, AWS Bedrock Knowledge Bases, Vertex AI Search)	Teams already deep in one cloud, standard document corpora	Native integration with cloud identity and storage, low ops	Limited control over chunking, reranking, prompt strategy; harder to tune for hard domains	$5K–$30K/month at scale
Orchestration frameworks (LangChain, LlamaIndex, Haystack)	Custom RAG built by an internal team	Full control, vendor-neutral, large ecosystem	Real engineering effort, abstraction tax, you own the maintenance	Engineer time + infra
Self-hosted on pgvector or Qdrant	Existing Postgres / dev teams, data residency requirements, cost-sensitive at scale	One stack, full control, low marginal cost per query	Requires evaluation discipline and ML engineering	Infra only after build
Custom RAG build with us	Differentiated knowledge, regulated data, high accuracy bar	Tuned chunking + hybrid search + reranking, ACL enforcement, evaluation harness, owned IP	8–14 weeks to first production release	$60K–$220K upfront, then mostly infra

What we see work most often: start with a custom build on managed components (Pinecone or Weaviate, OpenAI or Voyage embeddings, Cohere Rerank, Claude or GPT-4 generation) and a real evaluation harness. Migrate to self-hosted (pgvector + open-source embeddings + open-source reranker) only when cost or data-residency constraints justify the operational load. Skip "RAG-in-a-box" SaaS unless your problem is genuinely generic.

How do you evaluate a RAG system?

"It looks pretty good" is not an evaluation. Every RAG project we ship has a measurable evaluation harness from week two:

A gold question set of 100–500 representative queries with reference answers and the document IDs that should be retrieved.
Retrieval metrics: recall@k (did we retrieve the right passages?), MRR (how high were they ranked?), and nDCG.
Answer metrics: faithfulness (does every claim trace to a retrieved passage?), answer correctness (against the reference), and citation accuracy. We use Ragas, TruLens, or a custom LLM-as-judge harness depending on the project.
Regression on every change. New embedding model, new chunking strategy, new prompt — all must be measured against the gold set before shipping.
Production telemetry. Thumbs up/down, query logs, retrieval latency, generation latency, cost per query.

This is the discipline that separates RAG systems that get better over time from RAG systems that quietly degrade.

What does a RAG engagement look like with us?

Typical engagements run 8 to 14 weeks:

1. Corpus and use-case scoping (1–2 weeks). We inventory the data sources, ACL requirements, query patterns, and success criteria. We build the gold evaluation set with you — this is the foundation everything else stands on.

2. Baseline build (2–3 weeks). A minimum-viable RAG with managed components: Pinecone or pgvector, OpenAI or Voyage embeddings, Claude or GPT-4 generation, basic prompt with citations. We measure against the gold set and report.

3. Tuning (3–5 weeks). Chunking strategies, hybrid search, reranking, query rewriting, prompt engineering, ACL enforcement. Each change is measured. The accuracy curve at this stage is what matters.

4. Production hardening (2–3 weeks). Ingestion pipelines with incremental sync, monitoring, cost controls, rate limiting, audit logging, fallback behavior, eval-on-deploy in CI.

5. Hand-off (1 week). Documentation, runbooks, retraining and reindexing playbooks, on-call coverage during stabilization.

Outcomes: a measurable accuracy and faithfulness number on your data, an ingestion pipeline that keeps the index fresh, an evaluation harness that runs on every change, and code your team owns.

What does RAG cost?

Realistic ranges:

Lightweight RAG (one corpus, one or two query types, managed components): USD 40,000–90,000 to build, USD 1,000–5,000/month to run.
Production RAG with hybrid search, reranking, ACL, and evaluation harness: USD 80,000–180,000 to build, USD 3,000–15,000/month to run depending on volume.
Multi-source enterprise RAG (5+ data sources, role-based access, multilingual, multiple downstream use cases): USD 150,000–300,000 to build, USD 8,000–30,000/month at scale.

Operating cost is dominated by LLM generation (when traffic is high), embedding regeneration (when the corpus changes a lot), and vector DB hosting. We model these on real query volume before quoting.

For pricing detail, see our Pricing page.

Frequently asked questions about RAG

Why not just dump everything into a long-context model and skip RAG?

Long-context models (Claude with 200K, Gemini with 1M+) are great for single large documents. RAG still wins when your corpus is bigger than the context, when you need source citations, when latency matters, when cost matters, or when you want to update knowledge without retraining. Most enterprise knowledge bases are 100x to 10,000x larger than even the longest context window.

Which vector database should we use?

Pinecone for fastest time-to-production with managed scaling, Weaviate for hybrid search out of the box and self-hostability, Qdrant for performance at scale with full control, Chroma for prototypes and small-to-medium corpora, Milvus for very large self-hosted deployments. If you're already on Postgres and your corpus is under a few million chunks, pgvector is often the right answer — one less system to operate.

How do we stop the model from hallucinating?

Three layers. First, retrieval that actually finds the right passages (hybrid search plus reranking, not just vector similarity). Second, prompts that instruct the model to refuse to answer when retrieved context is insufficient and to cite sources for every claim. Third, evaluation: a regression set of questions where you score faithfulness — does every claim trace back to a retrieved passage? — on every change. Hallucinations don't disappear, but with these in place they drop from "unusable" to "rare and detectable."

How do you handle access controls — different users seeing different documents?

Permissions enforced at retrieval time, not generation time. Every chunk is tagged with its source document's ACL metadata; the vector query filters by the user's groups before retrieval runs. We never rely on the LLM to "remember" not to show a document — that's how data leaks happen.

What's the right chunking strategy?

It depends on the documents. Recursive character splitting with 500–1,000 token chunks and 10–20% overlap is the safe default. For structured documents (contracts with clause headings, technical manuals with sections), structure-aware chunking that respects headings retrieves better. For tables and code, treat them as atomic units. We benchmark 3–4 chunking strategies on a real evaluation set rather than guessing.

Do we need to fine-tune the embedding model on our data?

Usually no. OpenAI text-embedding-3-large, Voyage AI voyage-3, and Cohere Embed v3 are strong enough out of the box for most domains. Fine-tuning embeddings is worth it for very specialized vocabulary (clinical, legal, hard sciences) and only after a strong baseline is in place. The bigger wins are usually in chunking, hybrid search, and reranking.

How do we keep the knowledge base in sync with source systems?

Incremental ingestion. Source systems (SharePoint, Confluence, Drive, S3, databases) emit change events or are polled on a schedule; new and updated documents are re-chunked, re-embedded, and upserted; deleted documents are removed by ID. We log every ingestion run and surface lag metrics so users know how fresh the index is.