Building a RAG pipeline from scratch: the decisions that actually matter
In brief
Building RAG is easy. Building RAG that doesn't silently degrade over time is hard. Here's the production-ready version — including the retrieval failures most tutorials don't mention.
Contents
Retrieval-Augmented Generation is the pattern for giving a language model access to your data without fine-tuning. The concept is straightforward: retrieve relevant chunks from your knowledge base, stuff them into the context window, let Claude answer. The implementation has enough decisions to make it non-trivial.
This covers the full pipeline: ingestion, chunking, embedding, retrieval, generation, and evaluation. Focus is on the decisions that affect quality, not on getting something running in ten minutes.
The full pipeline
Document → Chunk → Embed → Store in vector DB
↓
Query → Embed → Vector search → Top-K chunks → Claude → Answer
Each step has levers. Here is what matters at each one.
Step 1: Ingestion and chunking
Chunking is the most underrated decision in RAG. Bad chunking means bad retrieval, which means bad answers, regardless of how good your model is.
Fixed-size chunking (split every N tokens): simple, predictable, fast. Works poorly when your documents have structure — it splits mid-sentence, mid-table, mid-code-block. Good as a baseline, bad in production.
Recursive character splitting: split on paragraph breaks first, then sentences, then words, then characters. Tries to preserve natural text boundaries. This is what most libraries default to and it is reasonable for prose-heavy documents.
Semantic chunking: use an embedding model to identify where topic shifts occur, and chunk at those boundaries. More expensive, more accurate, worth it for documents where paragraph breaks do not cleanly align with topic changes.
Document-aware chunking: respect the document's actual structure — split markdown on headers, split code files on function boundaries, split PDFs on page breaks with header detection. This is the right approach for structured documents, and it is not that complicated:
import re
def chunk_markdown(text: str, max_tokens: int = 500) -> list[str]:
# Split on h2 headers first
sections = re.split(r'(?=^## )', text, flags=re.MULTILINE)
chunks = []
for section in sections:
if len(section.split()) > max_tokens:
# Further split long sections on paragraphs
paragraphs = section.split('\n\n')
current = []
current_len = 0
for p in paragraphs:
p_len = len(p.split())
if current_len + p_len > max_tokens and current:
chunks.append('\n\n'.join(current))
current = [p]
current_len = p_len
else:
current.append(p)
current_len += p_len
if current:
chunks.append('\n\n'.join(current))
else:
chunks.append(section)
return [c.strip() for c in chunks if c.strip()]
Chunk overlap: add 10-20% overlap between chunks so that context at the boundary of one chunk is not lost. Easy to implement, meaningfully reduces retrieval gaps.
Chunk size: 200-500 tokens is the common range. Smaller chunks are more precise but lose context; larger chunks include more context but dilute relevance. Test both ends on your specific documents and queries.
Step 2: Embedding
The embedding model turns text into a vector that captures meaning. Similar meaning → similar vectors → found by vector search.
For most use cases, text-embedding-3-small (OpenAI) or embed-english-v3.0 (Cohere) are solid choices. They are fast, cheap, and good. You do not need the largest model unless you have measured a quality gap with the smaller one.
Critical: embed your chunks and your queries with the same model. Mixing models produces nonsense distances.
from openai import OpenAI
embed_client = OpenAI()
def embed(text: str) -> list[float]:
response = embed_client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
Store the embedding alongside the chunk text and any metadata (source document, page number, section header) that will help at generation time.
Step 3: Vector storage and retrieval
For development, numpy with cosine similarity is fine. For production, use a real vector database — pgvector (Postgres extension), Pinecone, Weaviate, Qdrant, or Chroma. They handle indexing, approximate nearest-neighbor search, and filtering at scale.
import numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def retrieve(query: str, chunks: list[dict], top_k: int = 5) -> list[dict]:
query_embedding = embed(query)
scored = [
{**chunk, "score": cosine_similarity(query_embedding, chunk["embedding"])}
for chunk in chunks
]
return sorted(scored, key=lambda x: x["score"], reverse=True)[:top_k]
Hybrid search: vector search finds semantically similar content but misses exact keyword matches. BM25 (keyword search) finds exact matches but misses paraphrases. Combining both — hybrid search — consistently outperforms either alone. Most production RAG systems use hybrid retrieval with a reranker.
Reranking: take your top-20 retrieved chunks, run them through a cross-encoder reranker (Cohere Rerank, or a local model), and keep the top 5. Cross-encoders are more accurate than embedding similarity because they process query and document together rather than independently. This step meaningfully improves precision.
Step 4: Generation
Construct the prompt carefully. The retrieval context should be clearly separated from the question, and Claude should be instructed on how to handle cases where the retrieved context does not contain the answer.
import anthropic
client = anthropic.Anthropic()
def answer(query: str, retrieved_chunks: list[dict]) -> str:
context = "\n\n---\n\n".join([
f"Source: {c['source']}\n{c['text']}"
for c in retrieved_chunks
])
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="""You are a helpful assistant. Answer questions using only the provided context.
If the context does not contain enough information to answer, say so directly.
Do not make up information. Cite the source when relevant.""",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}]
)
return message.content[0].text
Context window management: if your retrieved chunks are large, you can overflow the context. Track token counts. anthropic.count_tokens() or tiktoken for estimation. Drop lower-scoring chunks before you hit the limit.
Step 5: Evaluation
This is where most RAG implementations fall apart. You cannot improve what you do not measure.
The minimum viable eval set: 20-50 question-answer pairs grounded in your documents. For each question, you know the correct answer and which source it should come from.
Measure:
- Retrieval recall: does the correct source appear in the top-K results?
- Answer correctness: is the generated answer correct? (Manual for now, LLM-as-judge at scale)
- Faithfulness: does the answer only use information from the retrieved context? (Important for hallucination detection)
def evaluate_retrieval(eval_set: list[dict], chunks: list[dict]) -> dict:
hits = 0
for item in eval_set:
retrieved = retrieve(item["question"], chunks, top_k=5)
retrieved_sources = {r["source"] for r in retrieved}
if item["expected_source"] in retrieved_sources:
hits += 1
return {
"recall@5": hits / len(eval_set),
"total": len(eval_set)
}
Run this eval every time you change chunking strategy, embedding model, or retrieval parameters. It gives you a number. Numbers let you make decisions.
The things that actually move quality
In rough order of impact:
- Chunking strategy — document-aware chunking beats fixed-size meaningfully
- Reranking — cross-encoder after vector retrieval is a consistent win
- Hybrid search — catches the exact-match cases embedding misses
- Evaluation — the only way to know if anything is working
Embedding model choice, vector database choice, and chunk size matter less than people think. Get the eval working first, then optimize.
Try this before you build: Sketch your retrieval strategy on paper. What are you retrieving? From where? What does a bad retrieval result look like, and how would you know? This 15-minute exercise will surface the hard problems before you've written a line of code.
Further reading
- Introducing Contextual Retrieval — Anthropic's approach to improving RAG with contextual embeddings
- Citations documentation — how to get Claude to cite its sources from retrieved documents