Retrieval & KnowledgeHow It Works

Building a RAG pipeline from scratch: the decisions that actually matter

In brief

Building RAG is easy. Building RAG that doesn't silently degrade over time is hard. Here's the production-ready version — including the retrieval failures most tutorials don't mention.

9 min read·RAG

Contents

♡Sign in to save

Retrieval-Augmented Generation is the pattern for giving a language model access to your data without fine-tuning. The concept is straightforward: retrieve relevant chunks from your knowledge base, stuff them into the context window, let Claude answer. The implementation has enough decisions to make it non-trivial.

This covers the full pipeline: ingestion, chunking, embedding, retrieval, generation, and evaluation. Focus is on the decisions that affect quality, not on getting something running in ten minutes.

The full pipeline

Document → Chunk → Embed → Store in vector DB
                              ↓
Query → Embed → Vector search → Top-K chunks → Claude → Answer

Each step has levers. Here is what matters at each one.

Step 1: Ingestion and chunking

Chunking is the most underrated decision in RAG. Bad chunking means bad retrieval, which means bad answers, regardless of how good your model is.

Fixed-size chunking (split every N tokens): simple, predictable, fast. Works poorly when your documents have structure — it splits mid-sentence, mid-table, mid-code-block. Good as a baseline, bad in production.

Recursive character splitting: split on paragraph breaks first, then sentences, then words, then characters. Tries to preserve natural text boundaries. This is what most libraries default to and it is reasonable for prose-heavy documents.

Semantic chunking: use an embedding model to identify where topic shifts occur, and chunk at those boundaries. More expensive, more accurate, worth it for documents where paragraph breaks do not cleanly align with topic changes.

Document-aware chunking: respect the document's actual structure — split markdown on headers, split code files on function boundaries, split PDFs on page breaks with header detection. This is the right approach for structured documents, and it is not that complicated:

import re

def chunk_markdown(text: str, max_tokens: int = 500) -> list[str]:
    # Split on h2 headers first
    sections = re.split(r'(?=^## )', text, flags=re.MULTILINE)
    chunks = []
    for section in sections:
        if len(section.split()) > max_tokens:
            # Further split long sections on paragraphs
            paragraphs = section.split('\n\n')
            current = []
            current_len = 0
            for p in paragraphs:
                p_len = len(p.split())
                if current_len + p_len > max_tokens and current:
                    chunks.append('\n\n'.join(current))
                    current = [p]
                    current_len = p_len
                else:
                    current.append(p)
                    current_len += p_len
            if current:
                chunks.append('\n\n'.join(current))
        else:
            chunks.append(section)
    return [c.strip() for c in chunks if c.strip()]

Chunk overlap: add 10-20% overlap between chunks so that context at the boundary of one chunk is not lost. Easy to implement, meaningfully reduces retrieval gaps.

Chunk size: 200-500 tokens is the common range. Smaller chunks are more precise but lose context; larger chunks include more context but dilute relevance. Test both ends on your specific documents and queries.

Step 2: Embedding

The embedding model turns text into a vector that captures meaning. Similar meaning → similar vectors → found by vector search.

For most use cases, text-embedding-3-small (OpenAI) or embed-english-v3.0 (Cohere) are solid choices. They are fast, cheap, and good. You do not need the largest model unless you have measured a quality gap with the smaller one.

Critical: embed your chunks and your queries with the same model. Mixing models produces nonsense distances.

from openai import OpenAI

embed_client = OpenAI()

def embed(text: str) -> list[float]:
    response = embed_client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

Store the embedding alongside the chunk text and any metadata (source document, page number, section header) that will help at generation time.

Step 3: Vector storage and retrieval

For development, numpy with cosine similarity is fine. For production, use a real vector database — pgvector (Postgres extension), Pinecone, Weaviate, Qdrant, or Chroma. They handle indexing, approximate nearest-neighbor search, and filtering at scale.

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def retrieve(query: str, chunks: list[dict], top_k: int = 5) -> list[dict]:
    query_embedding = embed(query)
    scored = [
        {**chunk, "score": cosine_similarity(query_embedding, chunk["embedding"])}
        for chunk in chunks
    ]
    return sorted(scored, key=lambda x: x["score"], reverse=True)[:top_k]

Hybrid search: vector search finds semantically similar content but misses exact keyword matches. BM25 (keyword search) finds exact matches but misses paraphrases. Combining both — hybrid search — consistently outperforms either alone. Most production RAG systems use hybrid retrieval with a reranker.

Reranking: take your top-20 retrieved chunks, run them through a cross-encoder reranker (Cohere Rerank, or a local model), and keep the top 5. Cross-encoders are more accurate than embedding similarity because they process query and document together rather than independently. This step meaningfully improves precision.

Step 4: Generation

Construct the prompt carefully. The retrieval context should be clearly separated from the question, and Claude should be instructed on how to handle cases where the retrieved context does not contain the answer.

import anthropic

client = anthropic.Anthropic()

def answer(query: str, retrieved_chunks: list[dict]) -> str:
    context = "\n\n---\n\n".join([
        f"Source: {c['source']}\n{c['text']}"
        for c in retrieved_chunks
    ])

    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="""You are a helpful assistant. Answer questions using only the provided context.
If the context does not contain enough information to answer, say so directly.
Do not make up information. Cite the source when relevant.""",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }]
    )

    return message.content[0].text

Context window management: if your retrieved chunks are large, you can overflow the context. Track token counts. anthropic.count_tokens() or tiktoken for estimation. Drop lower-scoring chunks before you hit the limit.

Step 5: Evaluation

This is where most RAG implementations fall apart. You cannot improve what you do not measure.

The minimum viable eval set: 20-50 question-answer pairs grounded in your documents. For each question, you know the correct answer and which source it should come from.

Measure:

Retrieval recall: does the correct source appear in the top-K results?
Answer correctness: is the generated answer correct? (Manual for now, LLM-as-judge at scale)
Faithfulness: does the answer only use information from the retrieved context? (Important for hallucination detection)

def evaluate_retrieval(eval_set: list[dict], chunks: list[dict]) -> dict:
    hits = 0
    for item in eval_set:
        retrieved = retrieve(item["question"], chunks, top_k=5)
        retrieved_sources = {r["source"] for r in retrieved}
        if item["expected_source"] in retrieved_sources:
            hits += 1
    return {
        "recall@5": hits / len(eval_set),
        "total": len(eval_set)
    }

Run this eval every time you change chunking strategy, embedding model, or retrieval parameters. It gives you a number. Numbers let you make decisions.

The things that actually move quality

In rough order of impact:

Chunking strategy — document-aware chunking beats fixed-size meaningfully
Reranking — cross-encoder after vector retrieval is a consistent win
Hybrid search — catches the exact-match cases embedding misses
Evaluation — the only way to know if anything is working

Embedding model choice, vector database choice, and chunk size matter less than people think. Get the eval working first, then optimize.

Try this before you build: Sketch your retrieval strategy on paper. What are you retrieving? From where? What does a bad retrieval result look like, and how would you know? This 15-minute exercise will surface the hard problems before you've written a line of code.