AI Codex
Foundation Models & LLMsCore Definition

The engine under everything

A large language model is what Claude is at its core — and understanding how it works changes how you think about everything else in AI.

Before you can understand RAG, agents, prompt engineering, or anything else in AI, you need a working mental model of what a large language model actually is.

Not the technical details — the intuition.

The core idea

A large language model is a system trained to predict what text should come next, given the text that came before. That's it.

Trained on trillions of words — books, articles, code, conversations, documentation, the web — the model learned the patterns of language at a scale that's hard to comprehend. Not just grammar and syntax, but reasoning patterns, factual associations, argument structures, writing styles, domain knowledge.

The prediction task sounds humble. "Given these words, what comes next?" But at scale, with enough data and enough parameters, something remarkable emerges: a system that can answer questions, write code, analyze documents, translate languages, explain concepts, and reason through problems — all as a side effect of learning to predict text well.

What "large" actually means

The "large" in large language model refers to the number of parameters — the numerical weights that get adjusted during training to capture patterns in the data.

Early language models had millions of parameters. GPT-2 had 1.5 billion. Claude, GPT-4, and their contemporaries are estimated to have hundreds of billions to over a trillion.

More parameters means more capacity to store patterns, make distinctions, and handle complex reasoning. But "large" is also relative — the trend is toward better performance at smaller sizes through improved training techniques. Today's smaller models often outperform yesterday's larger ones.

How it generates a response

When you send Claude a message, here's roughly what happens:

  1. Your text gets converted into tokens — chunks of characters, roughly ¾ of a word each
  2. The model processes all the tokens in your context window simultaneously
  3. For each position, it calculates a probability distribution over what token should come next
  4. It samples from that distribution to produce the next token
  5. That token gets appended, and the process repeats until the response is complete

This happens incredibly fast — generating thousands of tokens per second. But the underlying mechanism is the same for every word: predict what comes next, then predict what comes after that.

Why this mental model matters

Understanding that Claude is fundamentally a next-token predictor changes how you interact with it.

It explains why prompting works. The tokens you provide are the context the model uses to predict what should come next. More relevant, well-structured context leads to better predictions. That's why clear prompts get better results.

It explains hallucination. When the model doesn't know something, it doesn't produce an error — it produces a plausible-looking continuation of the pattern. It's still doing next-token prediction, just with weaker signal.

It explains why format matters. If you ask for a bulleted list, Claude has seen millions of examples of bulleted lists in its training data and will generate text that continues that pattern. The format you specify shapes the prediction.

It explains why Claude isn't a search engine. Claude doesn't retrieve information — it generates text that's consistent with patterns in its training. For specific, current, or proprietary facts, you need to give it the information, not ask it to recall it.

What Claude is on top of this

Claude is a large language model plus several layers of additional training.

After pre-training on the broad text corpus, Claude was further trained using RLHF (reinforcement learning from human feedback) and constitutional AI — techniques that shaped its behavior to be helpful, honest, and careful about harm.

This post-training is what makes Claude feel different from a raw language model. It's why Claude follows instructions, maintains a consistent character, pushes back on harmful requests, and expresses uncertainty rather than confidently guessing. The underlying prediction engine is what makes it capable. The post-training is what makes it useful.


Further reading